The object recognition problem is that of finding instances of object classes in an image or video sequences: faces, giraffes, the digit 5, chairs etc. This has to be accomplished while allowing for intra-class variation, as well as changes in illumination and viewpoint. Belongie, Malik and Puzicha (2001) introduced a relational descriptor for shapes represented as point sets, the "shape context". This enables one to compute similarity measures between shapes which, together with similarity measures for texture and color, can be used to drive object recognition. I will show steps to a complete theory of object recognition based on shape contexts. I will show results on a variety of 2D and 3D recognition problems.
The action recognition problem is that of finding instances of actions in video sequences: run, jump, kick etc. This has to be accomplished while allowing for variation in the person performing the action, clothing, illumination and viewpoint. We have developed two approaches to recognition of actions. In low resolution data, ("far field") the approach is based on collecting low resolution optical flow measurements over a spatiotemporal volume for each moving figure, constructing a robust descriptor from this volume, and then matching these to stored sequences. We show generalization over person, clothing and illumination while pose variations are dealt in a multiple-view framework. In high resolution data ("near field") the approach is based on extracting stick figures in each frame, and relying on joint level human body tracking to provide a complete intermediate representation which is robust to lighting, clothing as well as pose.
This talk is based on joint work; please visit:
http://http.cs.berkeley.edu/projects/vision/vision_group.html for pointers to publications.