| [back]
Automated
Event Detection in Underwater Video
Dirk Walther, Duane Edgington, Karen A. Salamy, Michael Risi, R. E. Sherlock,
and Christof Koch
Remotely
operated underwater vehicles (ROVs) become increasingly important as
a tool for obtaining quantitative data on the distribution and abundance
of oceanic animals. Using video cameras, it is possible to make quantitative
video transects (QVT) through the water, providing high-resolution data
at the scale of the individual animals and their natural aggregation
patterns. The current manual method of analyzing QVT video by trained
scientists is very labor intensive and poses a serious limitation to
the amount of data that can be obtained from ROV dives.
To overcome the bottleneck in analyzing ROV dive videos we are developing
an automated system for detecting animals (events) visible in the videos.
This task is difficult due to the low contrast of many translucent animals
and due to debris (“marine snow”) cluttering the scene.
We are processing the videos with an attentional selection algorithm
[1] that has been shown to work robustly for target detection in a variety
of natural scenes [2]. The candidate locations (tokens) identified by
the attentional selection module are combined across video frames using
Kalman filters. If tokens can be tracked successfully over several frames,
they are stored as potentially “interesting” events. Based
on low-level properties, “interesting” events are identified
and marked in the video frames.
Especially in deep water video (below 100 meters), visible animals are
often sparse in space and time. By detecting whether or not there is
an “interesting” candidate object for an animal present
in a particular sequence of underwater video, we have developed a notion
of “boring” video frames – video frames that do not
contain any “interesting” events. By omitting “boring”
frames and marking candidate objects, we aim to enhance the productivity
of human video annota-tors and/or cue a subsequent object classification
module.
The video is recorded onboard the ROVs using broadcast quality video
cameras. On the launch vessel, the video is stored on digital tapes
for later onshore processing. The video is captured onto hard discs
and processed using a computer cluster with 8 Rack Saver rs1100 dual
Xeon 2.4 GHz servers, configured as a 16 CPU Gigabit Ethernet Beowulf
cluster, running our special purpose processing software written in
C++.
For the video analysis, the video frames first undergo a number of pre-processing
steps such as smoothing of horizontal scan lines, and removal of constant
background features. After pre-processing, each frame is scanned for
salient lo-cations using the model for saliency-based attention in humans
by Itti & Koch [1]. For this model, each frame is de-composed into
seven channels (intensity contrast, red/green and blue/yellow double
color opponencies, and the four ca-nonical, spatial orientations) at
different spatial scales. We compute center-surround contrasts across
scales – similar to those found in the human retina and primary
visual cortex, yielding six “feature maps” for each of the
seven features. In a series of non-linear normalization and summation
steps, the feature maps are combined into one “saliency map”,
in which only a sparse number of salient locations remain active (Fig.
5c). In a winner-take-all neural network these locations compete for
saliency, yielding one winner. This winning location is then inhibited
(inhibition of return), and the competition continues, thus producing
a sequence of the most salient lo-cations in the frame. An example for
the resulting scan path is shown in Fig. 5d.
From a binary version of the attended objects we derive a number of
low-level properties that we use for classifying the objects into “interesting”
and “boring” objects. Having identified “interesting”
objects in single frames, we track objects (“visual events”)
across frames using two independent linear Kalman filters for the x
and the y coordinates. If an object cannot be assigned to an already
existing pair of Kalman filters, a new tracker is initialized for this
object. We prune “noisy” objects by discarding objects tracked
for less than five frames.
We are learning from the human annotators with their long experience
and training which of the visual events they con-sider interesting and
which not. We ask annota-tors to view video clips in which the algorithm
has marked all potentially “interesting” events. Using the
annotators’ feed-back we can learn by example a notion of how
“interesting” an event is, using the low-level object properties.
Once we have established the visual events and made the decision which
ones are “interesting”, we can mark them in the video by
drawing the boundary box for the objects into the frames (Fig. 5f).
Occasionally, video has several seconds or even minutes in which nothing
salient appears. We now also have the option of omitting those long
stretches of “bor-ing” video from being passed on to the
annotators.
A crucial part of our analysis is the discrimination bet-ween “interesting”
and “boring” visual events in the under-water video. To
evaluate our approach to this problem we captured ten video clips of
ten seconds (300 frames) length each from video material recorded during
a midwater dive by ROV Ventana on April 30, 1999 in the Monterey Bay1.
We pre-processed the videos according to the procedure described above,
marked all potentially “interesting” events in the video
frames (see Fig. 5f), and stored the medium-level pro-perties for the
objects (see Fig. 5e). A video annotator was then asked to decide which
of the events they would find “interesting”, and which ones
not.
We examined the object properties of the events that the annotator marked
as “interesting” and found that the co-variance of the object
pixels with respect to the centroid uxy is a good indicator for the
how “interesting” an event is. In Fig. 6, we show the ROC
(Receiver Operating Characteristics) for this discrimination. With a
threshold of |uxy|= 0.4 we could obtain a true positive rate of 93.3%
with only 9.6% false positives. The unexpectedly good discrimination
power with only one out of twelve properties that we extract from the
objects makes us confident that we can improve the discrimi-nation performance
more by learning a measure for how “in-teresting” an event
is from all available properties.
We are encouraged by these first results, and we are currently developing
a more complex measure for the distinction between “interesting”
and “boring” events in the video.
1. Itti,
L., C. Koch, and E. Niebur, A model of saliency-based visual attention
for rapid scene analysis. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 1998. 20(11): p. 1254-1259.
2. Itti, L. and C. Koch. Target Detection using Saliency-Based Attention.
in Proc. RTO/SCI-12 Workshop on Search and Target Acquisition (NATO
Unclassified). 1999. Utrecht, The Netherlands.

Figure
3. Processing
steps for the automated video analysis. (a) video frame captured by
the PC with video editing card; (b) the result of the pre-processing.
In this image, scan lines were smoothed out, and the average of the
preceding ten frames was subtracted. The image contrast is enhanced
for illustration purposes; (c) saliency map computed from image (b);
(d) scan path of the saliency model overlaid on top of the image; (e)
object masks that were extracted at the salient location with the boundary
box (in gray) and the major (red) and minor (light blue) axes marked;
(f) video frames with the objects marked that are part of “interesting”
events.

Figure
4.
Receiver Operating Characteristics (ROC) for the discrimination between
“interesting” and “boring” events based on the
covariance uxy. The blue line marks the best performance
at a threshold of |uxy| = 0.4.
top
|