The problem
of serial processing of highly complex visual stimuli containing multiple
objects is not only faced by humans and other primates, but also by
machine vision systems. Advanced object recognition algorithms are capable
of achieving very good recognition performance with objects learned
from a single image (one-shot learning). These algorithms perform well
as long as they are trained on images in which a major part of the image
is occupied by the object to be learned and recognized. As soon as major
parts of an image are occupied by clutter it becomes impossible to learn
from such images without manual pre-labeling. These approaches are thus
not suitable in an unsupervised environment, as they would mainly learn
background clutter instead of the actual objects.
Biological systems solve this dilemma by selective visual attention,
which is, at least in part, based on saliency. This suggests that bottom-up
attention plays a major role in our capability to distinguish background
from foreground. It furthermore suggests that attention has a major
influence on how humans recognize objects. Rapid orientation to the
most salient objects in an image leads to fast detection and recognition
of the most important objects in a scene. We are developing a computational
model that combines visual selective attention with object recognition
in a manner inspired by the interplay of the ventral (“what”)
pathway and the dorsal (“where”) pathway in the primate
visual cortex.
The main idea of our approach is the identification of objects or object
parts by their most salient features. To find the most salient feature
for a given attended location, we retract in the hierarchy of maps used
for computing the saliency map to find the feature map that contributes
most to the saliency of that location (fig. 1c). In this map, we segment
the attended object (fig. 1d) and use a smoothed version of this mask
(fig. 1e) to modulate the contrast in the input image such that there
is full contrast at the object location, no contrast outside the object,
and a graded contrast transition at the edge of the object (fig. 1f).
Learning from our data sets results in a classifier, which can recognize
30 objects. Performance of this classifier is evaluated by computing
the true positive rate (TP) and false positive rate (FP) for every single
object classifier. For every object the 50 positive samples are used
for computing the true positive rate and the remaining images of all
other objects are used as negative samples for obtaining the false positive
rate.
We evaluate performance (true positive rate) for every data set with
three different methods: learning and recognition without attention;
only learning with attention; both learning and recognition with attention.
The results for one of our test sets are shown in fig. 2.
As can be seen in fig. 2, attention has only little effect when the
object occupies a large area of the image (> 5%). For smaller objects
(larger images), however, the influence of attention becomes more accentuated.
In the more difficult cases (low object resolution, small relative object
size) attention more than triples the true positive rate while keeping
the false positives rate at very low levels (FP < 0.5%).
The bigger the actual image becomes the bigger is the computational
advantage of attention. This is due to the fact that attention massively
reduces the number of keypoints that need to be extracted and compared
between an image and the model. Our current implementation, which is
in no way optimized for speed, is capable of processing about 1.5 frames
per second on a 2.0 GHz Pentium 4 mobile, both for learning and recognition.
This includes attentional selection, shape estimation and recognition
or learning.
This work is currently under review for publication at the conference
on Neural Information Processing Systems (NIPS 2003).

Figure 1. Illustration of the processing steps for obtaining
an estimation for the object shape at the attended location and for
using it for object recognition: (a) original image with the attended
location marked by the yellow circle; (b) the saliency map; (c) the
feature map with the strongest contribution at the attended location
(here the orientation map with the center at pyramid level 3 and the
surround at 7); (d) the segmented feature map; (e) the smoothed object
mask; (f) the contrast-modulated image; (g) extracted features (keypoints)
for the recognition algorithm, marked in green; (h) the second most
salient image region with its keypoints. Up to five patches are used
for learning and recognition. (i) for comparison, the whole image with
the keypoints used for building a model.