Caltech
Center for Neuromorphic Systems Engineering

Home
Research
News
People

[back]

Attentional Selection for Object Recognition in Visual Cortex
Dirk Walther, Laurent Itti, Maximilian Riesenhuber, Tomaso Poggio, Christof Koch

Most models of object recognition assume the isolated occurrence of objects in the field of view. However, in our everyday experience we are usually confronted with scenes that are cluttered with a variety of objects – some relevant for our actions, some not. Our brain’s response to this overwhelming flood of visual information is serializing the processing of the objects by mechanisms of visual attention. Attentional selection of objects is often modeled using all-or-nothing switching of neuronal connection pathways from the attended region of the retinal input to the recognition units. However, there is little physiological evidence for such all-or-none modulation in early areas. We have developed a combined model for spatial attention and object recognition in which the recognition system monitors the entire visual field, but attentional modulation by as little as 20% at a high level is sufficient to recognize multiple objects.

A first important step is the approximate extraction of the extent of the attended object. We have extended our model of bottom-up saliency-based attention to this end. Once we have determined the most salient location in the input image, we ask back why this location is salient, tracing back to the conspicuity map and finally the feature map contributing most to the saliency of the attended location. Since the feature map is much sparser than the saliency map, we can segment the shape of the attended object in this feature map, thereby obtaining a mask that is used for object-based inhibition of return as well as for modulating the activity of cell populations in the object recognition system.

We are modulating the activity of neuron populations at the S2 level of processing in our hierarchical model for object recognition. The rational for choosing S2 is twofold – biologically and computationally motivated. The S2 layer corresponds in its function approximately to area V4 in the primate visual cortex. There have been a number of reports from electrophysiology [1-5] and psychophysics [6, 7] that show attentional modulation of V4 activity. Hence, the S2 level is a natural choice for modulating recognition. From a computational point of view, it is efficient to apply the modulation at a level as high up in the hierarchy as possible that still has some spatial resolution, i.e. S2. This way, the computation that is required to obtain the activations of the S2 units from the input image needs to be done only once for each image. When the system attends to the next location in the image, only the computation upwards from S2 needs to be repeated.

Activation outside the FOA is entirely suppressed. As little as 20% attentional modulation is sufficient to boost the recognition performance significantly.

Using the model described above, we were subsequently able to process multiple paperclip stimuli in images. It is remarkable that as little as 20% modulation of the activity of neuron populations at the S2 level of HMAX was sufficient to successfully recognize both paperclip stimuli in almost all of the stimuli containing two paperclips.

References
1. Reynolds, J.H., T. Pasternak, and R. Desimone, Attention increases sensitivity of V4 neurons. Neuron, 2000. 26(3): p. 703-714.
2. Treue, S., Neural correlates of attention in primate visual cortex. Trends in Neurosciences, 2001. 24(5): p. 295-300.
3. Connor, C.E., D.C. Preddie, J.L. Gallant, and D.C. Van Essen, Spatial attention effects in macaque area V4. Journal of Neuroscience, 1997. 17(9): p. 3201-3214.
4. Motter, B.C., Neural Correlates of Attentive Selection for Color or Luminance in Extrastriate Area V4. Journal of Neuroscience, 1994. 14(4): p. 2178-2189.
5. Luck, S.J., L. Chelazzi, S.A. Hillyard, and R. Desimone, Neural mechanisms of spatial selective attention in areas V1, V2, and V4 of macaque visual cortex. Journal of Neurophysiology, 1997. 77(1): p. 24-42.
6. Intriligator, J. and P. Cavanagh, The spatial resolution of visual attention. Cognitive Psychology, 2001. 43(3): p. 171-216.
7. Braun, J., Visual-Search among Items of Different Salience - Removal of Visual-Attention Mimics a Lesion in Extrastriate Area V4. Journal of Neuroscience, 1994. 14(2): p. 554-567.
8. Itti, L., C. Koch, and E. Niebur, A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998. 20(11): p. 1254-1259.
9. Itti, L. and C. Koch, Computational modelling of visual attention. Nature Reviews Neuroscience, 2001. 2(3): p. 194-203.
10. Riesenhuber, M. and T. Poggio, Hierarchical models of object recognition in cortex. Nature Neuroscience, 1999. 2(11): p. 1019-1025.

Figure 5. Our model combines a saliency-based attention system [8, 9] with our hierarchical recognition system HMAX [10]. For the attention system, the retinal image is filtered for colors (red-green and blue-yellow), intensities, and four orientations at four different scales, and six center-surround differences are computed for each feature. The resulting 7_6 = 42 feature maps are combined into three conspicuity maps (for color, intensity and orientations), from which one saliency map is computed. All locations within the saliency map compete in a winner-take-all (WTA) network of integrate and fire neurons, and the winning location is attended to. Subsequently, the saliency map is inhibited at the winning location (inhibition-of-return), allowing the competition to go on, so that other locations can be attended to. The hierarchical recognition system starts out from units tuned to bar-like stimuli with small receptive fields, similar to V1 simple cells. In a succession of layers, information is combined alternatingly by spatial pooling (using a maximum pooling function) and by feature combination (using a weighted sum operation). View-tuned units at the top of the hierarchy respond to a specific view of an object while showing tolerance to changes in scale and position. The activity of units at level S2 is modulated by the attention system.

Figure 6. For each stimulus, the recognition system can identify two, one or none of the two paperclip objects present. We show for how many stimuli each of the three cases occurs as a function of the attentional modulation. Zero modulation strength implies no attentional modulation at all. At 100% modulation, the S2 activation outside the FOA is entirely suppressed. As little as 20% attentional modulation is sufficient to boost the recognition performance significantly.


top