Caltech
Center for Neuromorphic Systems Engineering

Home
Research
News
People

[back]

Attentional Selection for Learning and Recognition of Objects in Cluttered Scenes
Ueli Rutishauser, Dirk Walther, Christof Koch, and Pietro Perona

The problem of serial processing of highly complex visual stimuli containing multiple objects is not only faced by humans and other primates, but also by machine vision systems. Advanced object recognition algorithms are capable of achieving very good recognition performance with objects learned from a single image (one-shot learning). These algorithms perform well as long as they are trained on images in which a major part of the image is occupied by the object to be learned and recognized. As soon as major parts of an image are occupied by clutter it becomes impossible to learn from such images without manual pre-labeling. These approaches are thus not suitable in an unsupervised environment, as they would mainly learn background clutter instead of the actual objects.

Biological systems solve this dilemma by selective visual attention, which is, at least in part, based on saliency. This suggests that bottom-up attention plays a major role in our capability to distinguish background from foreground. It furthermore suggests that attention has a major influence on how humans recognize objects. Rapid orientation to the most salient objects in an image leads to fast detection and recognition of the most important objects in a scene. We are developing a computational model that combines visual selective attention with object recognition in a manner inspired by the interplay of the ventral (“what”) pathway and the dorsal (“where”) pathway in the primate visual cortex.

The main idea of our approach is the identification of objects or object parts by their most salient features. To find the most salient feature for a given attended location, we retract in the hierarchy of maps used for computing the saliency map to find the feature map that contributes most to the saliency of that location (fig. 1c). In this map, we segment the attended object (fig. 1d) and use a smoothed version of this mask (fig. 1e) to modulate the contrast in the input image such that there is full contrast at the object location, no contrast outside the object, and a graded contrast transition at the edge of the object (fig. 1f).

Learning from our data sets results in a classifier, which can recognize 30 objects. Performance of this classifier is evaluated by computing the true positive rate (TP) and false positive rate (FP) for every single object classifier. For every object the 50 positive samples are used for computing the true positive rate and the remaining images of all other objects are used as negative samples for obtaining the false positive rate.

We evaluate performance (true positive rate) for every data set with three different methods: learning and recognition without attention; only learning with attention; both learning and recognition with attention. The results for one of our test sets are shown in fig. 2.

As can be seen in fig. 2, attention has only little effect when the object occupies a large area of the image (> 5%). For smaller objects (larger images), however, the influence of attention becomes more accentuated. In the more difficult cases (low object resolution, small relative object size) attention more than triples the true positive rate while keeping the false positives rate at very low levels (FP < 0.5%).

The bigger the actual image becomes the bigger is the computational advantage of attention. This is due to the fact that attention massively reduces the number of keypoints that need to be extracted and compared between an image and the model. Our current implementation, which is in no way optimized for speed, is capable of processing about 1.5 frames per second on a 2.0 GHz Pentium 4 mobile, both for learning and recognition. This includes attentional selection, shape estimation and recognition or learning.

This work is currently under review for publication at the conference on Neural Information Processing Systems (NIPS 2003).



Figure 1. Illustration of the processing steps for obtaining an estimation for the object shape at the attended location and for using it for object recognition: (a) original image with the attended location marked by the yellow circle; (b) the saliency map; (c) the feature map with the strongest contribution at the attended location (here the orientation map with the center at pyramid level 3 and the surround at 7); (d) the segmented feature map; (e) the smoothed object mask; (f) the contrast-modulated image; (g) extracted features (keypoints) for the recognition algorithm, marked in green; (h) the second most salient image region with its keypoints. Up to five patches are used for learning and recognition. (i) for comparison, the whole image with the keypoints used for building a model.

Figure 2. True positive (TP) rate for our test set of images. The relative object size is varied by keeping the absolute object size constant and adding more background. It can be observed that attention significantly improves the TP rate when the relative object size is below 5%.


top