Caltech
Center for Neuromorphic Systems Engineering

Home
Research
News
People

[back]

Computational Modeling of Visual Attention Systems
Robert J. Peters, Asha Iyer, Christof Koch
Nathan Mundhenk, Laurent Itti

We have continued to extend our biological model of bottom-up visual attention with several recently characterized retinal and cortical interactions that are known to govern human performance in certain visual tasks. We are testing the behavioral importance of these interactions by comparing our model's predictions against human eye movement data recorded with our infrared eyetracker. In the last year we have worked with three new model components: (1) short-range orientation interactions (for clutter reduction), (2) long-range orientation interactions (for contour facilitation), and (3) retinal filtering (for fovea vs. periphery effects).

Eyetracking psychophysics
We are using a high-speed (120Hz) infrared eyetracking system from ISCAN, Inc. This sytem provides us with spatial precision of 0.15-0.5° of visual angle when measuring observers' eye movements. We have compiled three image databases for use in human psychophysics experiments and in model testing (see Figure 1): (1) computer-generated fractals, (2) outdoor scenes, and (3) 10m-resolution overhead imagery.

Our psychophysical data are drawn from experiments in which we collect eye movement patterns from subjects while they “free-view” the images for three seconds at a time. While it is well-known that high-level task demands (such as “look for faces” or “look for agricultural fields”) can have a large influence on the patterns of eye movements, our aim is to make the task relatively free from top-down biases, to allow direct comparisons with our bottom-up attention model. Limiting the duration of image presentation to only a few seconds is likely to emphasize bottom-up effects, since top-down effects are likely to be weakest when the image is first presented.

Computational modeling
The basis for this work is a saliency model of bottom-up attention (Itti et al, 1998). In this model an input image is first passed through several sets of linear filters based on V1 cells in order to extract color, intensity, and orientation features at multiple spatial scales. A center-surround operation leads to a set of feature maps, which are combined across scales to form one so-called conspicuity map for each feature type. A non-linear normalization is applied to each map to amplify peaks of activity relative to noise in the background. Finally, the maps are combined across feature types to produce a single feature-independent saliency map.

Recently, Parkhurst et al [1] tested this type of saliency model with several classes of natural and artificial images. They found that the model could accurately predict where human observers were likely to look. That is, locations that were fixated by human observers were likely to have higher than average saliency as predicted by the model. They reported that this effect was stronger for computer-generated fractal images than for natural indoor and outdoor scenes, presumably because our prior experience with natural scenes leads to stronger top-down control of where to look in such images. We replicated these results in our experiments: we assessed how much higher were the salience values at the locations actually visited in human observers' scanpaths, relative to the salience values visited in random scanpaths, and found that the real scanpaths were predicted with a -score of 10-12 standard deviations above chance levels, for all three image categories. In concrete terms, a -score of 10 means that only one in 1023 random scanpaths will match the saliency map better than the human scanpath.

Short-range orientation interactions
We asked whether a more realistic set of cross-channel interactions could enhance the model's predictive ability. Specifically, we included a set of cross-orientation and cross-scale inhibitory interactions among spatially overlapping orientation-tuned units, based on a model in which overlapping units form an inhibitory pool that tends to suppress each unit's feed-forward response [2]. These interactions depend on four new parameters: two exponents applied to the excitatory and inhibitory inputs, and two scale parameters controlling the extent of cross-scale and cross-orientation interactions. By substituting the non-linear output from these interactions for the basic saliency model's linear output, we formed an “enhanced saliency model”. Qualitatively, these orientation interactions help to reduce clutter at spatial locations with multiple orientations (Figure 2).

When we tested the enhanced model by again measuring its ability to predict human eye movements, we found a significant improvement over the performance of the base model. The
-scores were now 12-14 standard deviations above chance, or an improvement of about 2 standard deviations relative to the base model. This improvement was highly significant
( ).

Long-range orientation interactions
Most recently we have been developing a neurobiologically-plausible model of long-range orientation interactions--leading to contour detection--to be included in the saliency model. In essence, this model involves excitatory and inhibitory interactions among units tuned to different orientations and different spatial locations (Figure 3). Units that can form a reasonable common contour will tend to co-excite each other, while other units will tend to inhibit each other.

This model performs well in identifying implicit contours embedded in arrays of randomly oriented Gabor elements. Human observers find such contours to be highly salient, while the standard saliency model is completely blind to such contours. Qualitatively, this model also performs well in identifying perceptually salient contours in each of the three classes of images we have used in human psychophysics (Figure 4). We are currently analyzing to what degree this perceptual salience is reflected in the eye movements made by human observers when viewing such images. As part of this analysis, we have begun parsing our observers' scanpaths into fixation intervals, so that we can compare their eye movements with the model one fixation at a time.

Retinal filtering
An often forgotten fact of our visual life is that our visual acuity is best only in the central several degrees of our visual field; our ability to perform tasks with peripheral vision decays dramatically with increasing eccentricity. This effect is modulated by spatial frequency, in that peripheral vision's deficit relative to foveal vision is much worse for high spatial frequencies than for low spatial frequencies. In particular, previous psychophysical studies have quantified the eccentricity and spatial frequency dependence of two tasks: (1) contrast detection (how much contrast does a grating patch need to be distinguishable from a uniform gray background), and (2) orientation discrimination (how much contrast is needed so that a horizontal grating can be distinguished from a vertical grating). It has also been shown that people tend to make saccades to locations near their current fixation point. Thus, on average, saccades tend to neglect more peripheral locations; put another way, a peripheral location requires much higher salience (such as a flashing light) to elicit a saccade than does a central location. We asked whether we could quantitatively predict this effect in our eyetracking data, by modulating the saliency model's internal maps according to data from the literature on contrast detection and orientation discrimination. Again, we found that this modification provides a significant improvement in the model's ability to predict human eye movements.

Outlook
We are currently pursuing several aims within this work. First, we are looking for computationally efficient approximations to the neurobiologically-detailed algorithms we have developed throughout the model. For example, now that we have shown the accuracy of our detailed retinal filter algorithm, is it possible to use an approximation in which only the output saliency map is modulated, while maintaining the predictive success of the algorithm? Second, we are looking for new ways to validate the model. In one case, we are planning psychophysics experiments using stimuli (such as Gabor ``snakes'') that highlight the effect of including a dedicated contour facilitation module in a model of attention. With such stimuli, human observers are expected to show clear-cut behavioral differences between images containing the embedded contours and control images containing only noise; this in turn provides a distinct signature that an accurate model must mimic. Our model validation efforts will also extend to approaches using scanpaths parsed into fixation intervals, and approaches in which inter-observer variability is used to test whether our model's behavior appears to be drawn from the same statistical distribution as human behavior. Third and last, we would ultimately like to explore the interactions among the various effects we have studied. How does retinal filtering affect contour facilitation? What are the timescales at which these effects appear behaviorally, and neurophysiologically? What is the interplay between these timescales and natural stimuli that involve motion, in contrast to our current data which are based on static images? The answers to these questions will bring us closer not only toward understanding the human visual system, but also toward building machines that assist, interact, collaborate, and synergize with real human visual systems.

Bibliography
D. Parkhurst, K. Law, and E. Niebur. Modeling the role of salience in the allocation of overt visual attention. Vision Research, 42(1):107-123, 2002.

D.K. Lee, L. Itti, C. Koch, and J. Braun. Attention activates winner-take-all competition among visual filters. Nature Neuroscience, 2(4):375-381, 1999.

Figure 1. Three classes of images were used in eyetracking experiments: computer-generated fractals (top row), outdoor scenes (middle row), and overhead satellite imagery (bottom row). Each row shows a sample image (left column) and the corresponding saliency map predicted by the model (right column). More active regions in the saliency map are shown as brighter and more elevated. For each image type, one subject's scanpath from three seconds of free viewing is superimposed on both the original image and on the saliency map. Qualitatively there is a strong correspondence between the models and subjects'eye movements.

 

Figure 2. Two different saliency models: one with weak local orientation interactions (left column), one with strong interactions (right column). Given the same inputs (top panels), the two models produce similar outputs (lower panels), but the model with strong interactions produces more focused activation (lower right panel, yellow arrow) relative to the model with weak activations (lower left panel), due to inhibitory suppression of feed-forward activity.

 

Figure 3. Schematic diagram of a model of contour detection. An input image is processed with multiple orientation filters. These filter outputs then interact via a connection kernel that selectively enhances units that plausibly form part of a contour (essentially, units that are collinear or nearly so). The process iterates with global inhibitory feedback and recurrent excitation until a stable point is reached.

 

Figure 4. (left) An overhead image from the DOI-10m dataset. (right) The output of our contour facilitation algorithm on the overhead image.


top