2107 Kelley Engineering Center
School of EECS
Oregon State University
Corvallis, OR 97331

Tel: (541) 737-7268
Fax: (541) 737-1300
sinisa at eecs oregonstate edu

Object and Activity Recognition Grounded on Mid-Level Image Representations

Microsoft Research
Sept 20, 2011

Existing approaches to object and activity recognition typically ground their models of objects and human activities onto a pre-selected set of low-level image features. In the face of uncertainty, however, we believe that recognition requires: (1) A more synergistic interaction between high-level inference algorithms and low-level feature extractors, and (2) Mid-level image representations for bridging the semantic gap. In this talk, I will present our new mid-level features, called bags of right detections (BORDs), aimed at enabling (1)-(2). BORDs are placed on a deformable grid across an image or video. The inference algorithm adaptively warps the grid of BORDs, such that they can better summarize visual cues relevant for recognition. I will also present two successful applications of BORDs to: i) View-invariant object detection and 3D pose estimation, and ii) Group-activity detection and actor localization.
Learning Spatiotemporal Graphs of Human Activities

ICCV '11, Oral presentation
Nov 8, 2011

Given a set of spatiotemporal graphs, we learn their model graph, and pdf's associated with nodes and edges of the model. The model graph adaptively learns from data relevant video segments and their spatiotemporal relations. We present a novel weighted-least-squares formulation of learning a structural archetype of graphs. The model is used for video parsing.
Shape of Human Activities

4th Int. Workshop on Shape Perception in Human and Computer Vision
May 5, 2011

We formalize activity shape as a spatiotemporal graph, and demonstrate its advantages for video interpretation. Access to video tubes can be provided by a number of existing approaches to multiscale spatiotemporal video segmentation. The resulting tubes can be conveniently organized in a spatiotemporal graph, where nodes correspond to the tubes, and edges capture their hierarchical, temporal, and spatial relationships. Given a set of training spatiotemporal graphs, we learn their archetype, i.e., model graph, and pdf's associated with model nodes and edges. The graph model adaptively learns from video data relevant video tubes, and their space-time relations for activity recognition. This advances much prior work that typically hand-picks the activity primitives, their total number, and temporal relations (e.g., allow only followed-by relations), and then only estimates their relative significance for activity recognition. Since activity shapes are closer to the symbol level, our work demonstrates that they could facilitate activity recognition, with significant reduction in the number of training examples.
From Contours to 3D Object Detection and Pose Estimation

ICCV '11, Oral presentation
Nov 8, 2011

We address view-invariant object detection and pose estimation using contours as basic object features. A top-down feedback from inference warps the image, so the bottom-up extraction of contours could better collectively summarize relevant visual information and match our 3D object model, under arbitrary non-rigid shape deformations and affine projection.
Object Discovery and Recognition in an Ensemble of Image Segmentations

UIUC, ECE Colloquim
Oct 21, 2010

This talk will present our recent work on region-based object discovery and recognition. Image regions provide direct access to important visual cues for object recognition, including shape smoothness, and compositionality. Their use as image features, however, is hindered by a number of factors, including: shortcomings of a particular low-level segmentation used to extract regions, and poor repeatability across different images of the same scene. We address these hindrances in the following common computational steps of object discovery and recognition: (1) feature extraction, (2) feature clustering, and (3) learning and inference of object models. We formulate feature extraction as the maximum weight independent set (MWIS) problem, where the goal is to select the most distinctive regions that cover the entire image from an ensemble of distinct image segmentations. For object discovery, we cluster regions extracted from a set of images by multicoloring a graph which captures collaborating and conflicting deformations of region boundaries across the image set. Finally, for learning and inference, we combine popular object representations -- namely, random forest (RF) and conditional random field (CRF) -- into a new computational framework, called random forest random field (RF)^2. The talk will also present a theoretical analysis of our new algorithms for solving: the MWIS problem, graph multicoloring, and learning and inference of (RF)^2.
Multiobject Tracking as Maximum Weight Independent Set

CVPR '11, Oral presentation
June 22, 2011

In this talk, I will present our approach to simultaneous tracking of multiple targets in a video. We first apply object detectors to every video frame. Pairs of detection responses from every two consecutive frames are then used to build a graph of tracklets. The graph helps transitively link the best matching tracklets that do not violate hard and soft contextual constraints between the resulting tracks. We prove that this data association problem can be formulated as finding the maximum-weight independent set (MWIS) of the graph. We present a new, polynomial-time MWIS algorithm, and prove that it converges to an optimum. Similarity and contextual constraints between object detections, used for data association, are learned online from object appearance and motion properties. Long-term occlusions are addressed by iteratively repeating MWIS to hierarchically merge smaller tracks into longer ones. Our results demonstrate advantages of simultaneously accounting for soft and hard contextual constraints in multitarget tracking.
Matching Hierarchies of Deformable Shapes

GbR 2009, Oral presentation
From Hieararchy of Regions to Image Understanding

1st Sino-USA Summer School in Vision, Learning, and Pattern Recognition, Beijing, China, July 2009; HP Labs, Palo Alto, CA, May 2009
Scale-invariant region-based hierarchical image matching

ICPR 2008, Oral presentation
Object recognition by discriminative methods

1st Sino-USA Summer School in Vision, Learning, and Pattern Recognition, Beijing, China, July 2009
What Do those Images Have in Common

Google Tech Talks, April 2008; Ricoh Innovations, Inc., April 2008; Oregon State University, EECS Dept. Colloquium, January 2008; UCLA, Dept. of Statistics Speakers Series, January 2008; Carnegie Mellon University, VASC Seminar Series, October 2007; UIUC, PAML Seminar Series, September 2007
Extracting Texels in 2.1D natural Textures

ICCV 2007, Oral presentation
Extracting Subimages of an Unknown Category from a Set of Images

CVPR 2006, Oral presentation
3D Texture Classification Using the Belief Net of a Segmentation Tree

ICPR 2006, Oral presentation