| [back]
Computational
Modeling of Visual Attention Systems
Robert J. Peters, Asha Iyer, Christof Koch
Nathan Mundhenk, Laurent Itti
We have
continued to extend our biological model of bottom-up visual attention
with several recently characterized retinal and cortical interactions
that are known to govern human performance in certain visual tasks.
We are testing the behavioral importance of these interactions by comparing
our model's predictions against human eye movement data recorded with
our infrared eyetracker. In the last year we have worked with three
new model components: (1) short-range orientation interactions (for
clutter reduction), (2) long-range orientation interactions (for contour
facilitation), and (3) retinal filtering (for fovea vs. periphery effects).
Eyetracking psychophysics
We are using a high-speed (120Hz) infrared eyetracking system from ISCAN,
Inc. This sytem provides us with spatial precision of 0.15-0.5°
of visual angle when measuring observers' eye movements. We have compiled
three image databases for use in human psychophysics experiments and
in model testing (see Figure 1): (1) computer-generated fractals, (2)
outdoor scenes, and (3) 10m-resolution overhead imagery.
Our psychophysical data are drawn from experiments in which we collect
eye movement patterns from subjects while they “free-view”
the images for three seconds at a time. While it is well-known that
high-level task demands (such as “look for faces” or “look
for agricultural fields”) can have a large influence on the patterns
of eye movements, our aim is to make the task relatively free from top-down
biases, to allow direct comparisons with our bottom-up attention model.
Limiting the duration of image presentation to only a few seconds is
likely to emphasize bottom-up effects, since top-down effects are likely
to be weakest when the image is first presented.
Computational modeling
The basis for this work is a saliency model of bottom-up attention (Itti
et al, 1998). In this model an input image is first passed through several
sets of linear filters based on V1 cells in order to extract color,
intensity, and orientation features at multiple spatial scales. A center-surround
operation leads to a set of feature maps, which are combined across
scales to form one so-called conspicuity map for each feature type.
A non-linear normalization is applied to each map to amplify peaks of
activity relative to noise in the background. Finally, the maps are
combined across feature types to produce a single feature-independent
saliency map.
Recently, Parkhurst et al [1] tested this type of saliency model with
several classes of natural and artificial images. They found that the
model could accurately predict where human observers were likely to
look. That is, locations that were fixated by human observers were likely
to have higher than average saliency as predicted by the model. They
reported that this effect was stronger for computer-generated fractal
images than for natural indoor and outdoor scenes, presumably because
our prior experience with natural scenes leads to stronger top-down
control of where to look in such images. We replicated these results
in our experiments: we assessed how much higher were the salience values
at the locations actually visited in human observers' scanpaths, relative
to the salience values visited in random scanpaths, and found that the
real scanpaths were predicted with a -score
of 10-12 standard deviations above chance levels, for all three image
categories. In concrete terms, a -score
of 10 means that only one in 1023 random scanpaths will match
the saliency map better than the human scanpath.
Short-range orientation interactions
We asked whether a more realistic set of cross-channel interactions
could enhance the model's predictive ability. Specifically, we included
a set of cross-orientation and cross-scale inhibitory interactions among
spatially overlapping orientation-tuned units, based on a model in which
overlapping units form an inhibitory pool that tends to suppress each
unit's feed-forward response [2]. These interactions depend on four
new parameters: two exponents applied to the excitatory and inhibitory
inputs, and two scale parameters controlling the extent of cross-scale
and cross-orientation interactions. By substituting the non-linear output
from these interactions for the basic saliency model's linear output,
we formed an “enhanced saliency model”. Qualitatively, these
orientation interactions help to reduce clutter at spatial locations
with multiple orientations (Figure 2).
When we tested the enhanced model by again measuring its ability to
predict human eye movements, we found a significant improvement over
the performance of the base model. The
-scores were now 12-14
standard deviations above chance, or an improvement of about 2 standard
deviations relative to the base model. This improvement was highly significant
( ).
Long-range orientation interactions
Most recently we have been developing a neurobiologically-plausible
model of long-range orientation interactions--leading to contour detection--to
be included in the saliency model. In essence, this model involves excitatory
and inhibitory interactions among units tuned to different orientations
and different spatial locations (Figure 3). Units that can form a reasonable
common contour will tend to co-excite each other, while other units
will tend to inhibit each other.
This model performs well in identifying implicit contours embedded in
arrays of randomly oriented Gabor elements. Human observers find such
contours to be highly salient, while the standard saliency model is
completely blind to such contours. Qualitatively, this model also performs
well in identifying perceptually salient contours in each of the three
classes of images we have used in human psychophysics (Figure 4). We
are currently analyzing to what degree this perceptual salience is reflected
in the eye movements made by human observers when viewing such images.
As part of this analysis, we have begun parsing our observers' scanpaths
into fixation intervals, so that we can compare their eye movements
with the model one fixation at a time.
Retinal filtering
An often forgotten fact of our visual life is that our visual acuity
is best only in the central several degrees of our visual field; our
ability to perform tasks with peripheral vision decays dramatically
with increasing eccentricity. This effect is modulated by spatial frequency,
in that peripheral vision's deficit relative to foveal vision is much
worse for high spatial frequencies than for low spatial frequencies.
In particular, previous psychophysical studies have quantified the eccentricity
and spatial frequency dependence of two tasks: (1) contrast detection
(how much contrast does a grating patch need to be distinguishable from
a uniform gray background), and (2) orientation discrimination (how
much contrast is needed so that a horizontal grating can be distinguished
from a vertical grating). It has also been shown that people tend to
make saccades to locations near their current fixation point. Thus,
on average, saccades tend to neglect more peripheral locations; put
another way, a peripheral location requires much higher salience (such
as a flashing light) to elicit a saccade than does a central location.
We asked whether we could quantitatively predict this effect in our
eyetracking data, by modulating the saliency model's internal maps according
to data from the literature on contrast detection and orientation discrimination.
Again, we found that this modification provides a significant improvement
in the model's ability to predict human eye movements.
Outlook
We are currently pursuing several aims within this work. First, we are
looking for computationally efficient approximations to the neurobiologically-detailed
algorithms we have developed throughout the model. For example, now
that we have shown the accuracy of our detailed retinal filter algorithm,
is it possible to use an approximation in which only the output saliency
map is modulated, while maintaining the predictive success of the algorithm?
Second, we are looking for new ways to validate the model. In one case,
we are planning psychophysics experiments using stimuli (such as Gabor
``snakes'') that highlight the effect of including a dedicated contour
facilitation module in a model of attention. With such stimuli, human
observers are expected to show clear-cut behavioral differences between
images containing the embedded contours and control images containing
only noise; this in turn provides a distinct signature that an accurate
model must mimic. Our model validation efforts will also extend to approaches
using scanpaths parsed into fixation intervals, and approaches in which
inter-observer variability is used to test whether our model's behavior
appears to be drawn from the same statistical distribution as human
behavior. Third and last, we would ultimately like to explore the interactions
among the various effects we have studied. How does retinal filtering
affect contour facilitation? What are the timescales at which these
effects appear behaviorally, and neurophysiologically? What is the interplay
between these timescales and natural stimuli that involve motion, in
contrast to our current data which are based on static images? The answers
to these questions will bring us closer not only toward understanding
the human visual system, but also toward building machines that assist,
interact, collaborate, and synergize with real human visual systems.
Bibliography
D. Parkhurst, K. Law, and E. Niebur. Modeling the role of salience in
the allocation of overt visual attention. Vision Research, 42(1):107-123,
2002.
D.K. Lee, L. Itti, C. Koch, and J. Braun. Attention activates winner-take-all
competition among visual filters. Nature Neuroscience, 2(4):375-381,
1999.

Figure
1. Three
classes of images were used in eyetracking experiments: computer-generated
fractals (top row), outdoor scenes (middle row), and overhead satellite
imagery (bottom row). Each row shows a sample image (left column) and
the corresponding saliency map predicted by the model (right column).
More active regions in the saliency map are shown as brighter and more
elevated. For each image type, one subject's scanpath from three seconds
of free viewing is superimposed on both the original image and on the
saliency map. Qualitatively there is a strong correspondence between
the models and subjects'eye movements.
|