Caltech
Center for Neuromorphic Systems Engineering

Home
Research
News
People

[back]

Automated Event Detection in Underwater Video
Dirk Walther, Duane Edgington, Karen A. Salamy, Michael Risi, R. E. Sherlock, and Christof Koch

Remotely operated underwater vehicles (ROVs) become increasingly important as a tool for obtaining quantitative data on the distribution and abundance of oceanic animals. Using video cameras, it is possible to make quantitative video transects (QVT) through the water, providing high-resolution data at the scale of the individual animals and their natural aggregation patterns. The current manual method of analyzing QVT video by trained scientists is very labor intensive and poses a serious limitation to the amount of data that can be obtained from ROV dives.

To overcome the bottleneck in analyzing ROV dive videos we are developing an automated system for detecting animals (events) visible in the videos. This task is difficult due to the low contrast of many translucent animals and due to debris (“marine snow”) cluttering the scene. We are processing the videos with an attentional selection algorithm [1] that has been shown to work robustly for target detection in a variety of natural scenes [2]. The candidate locations (tokens) identified by the attentional selection module are combined across video frames using Kalman filters. If tokens can be tracked successfully over several frames, they are stored as potentially “interesting” events. Based on low-level properties, “interesting” events are identified and marked in the video frames.

Especially in deep water video (below 100 meters), visible animals are often sparse in space and time. By detecting whether or not there is an “interesting” candidate object for an animal present in a particular sequence of underwater video, we have developed a notion of “boring” video frames – video frames that do not contain any “interesting” events. By omitting “boring” frames and marking candidate objects, we aim to enhance the productivity of human video annota-tors and/or cue a subsequent object classification module.

The video is recorded onboard the ROVs using broadcast quality video cameras. On the launch vessel, the video is stored on digital tapes for later onshore processing. The video is captured onto hard discs and processed using a computer cluster with 8 Rack Saver rs1100 dual Xeon 2.4 GHz servers, configured as a 16 CPU Gigabit Ethernet Beowulf cluster, running our special purpose processing software written in C++.

For the video analysis, the video frames first undergo a number of pre-processing steps such as smoothing of horizontal scan lines, and removal of constant background features. After pre-processing, each frame is scanned for salient lo-cations using the model for saliency-based attention in humans by Itti & Koch [1]. For this model, each frame is de-composed into seven channels (intensity contrast, red/green and blue/yellow double color opponencies, and the four ca-nonical, spatial orientations) at different spatial scales. We compute center-surround contrasts across scales – similar to those found in the human retina and primary visual cortex, yielding six “feature maps” for each of the seven features. In a series of non-linear normalization and summation steps, the feature maps are combined into one “saliency map”, in which only a sparse number of salient locations remain active (Fig. 5c). In a winner-take-all neural network these locations compete for saliency, yielding one winner. This winning location is then inhibited (inhibition of return), and the competition continues, thus producing a sequence of the most salient lo-cations in the frame. An example for the resulting scan path is shown in Fig. 5d.

From a binary version of the attended objects we derive a number of low-level properties that we use for classifying the objects into “interesting” and “boring” objects. Having identified “interesting” objects in single frames, we track objects (“visual events”) across frames using two independent linear Kalman filters for the x and the y coordinates. If an object cannot be assigned to an already existing pair of Kalman filters, a new tracker is initialized for this object. We prune “noisy” objects by discarding objects tracked for less than five frames.

We are learning from the human annotators with their long experience and training which of the visual events they con-sider interesting and which not. We ask annota-tors to view video clips in which the algorithm has marked all potentially “interesting” events. Using the annotators’ feed-back we can learn by example a notion of how “interesting” an event is, using the low-level object properties.

Once we have established the visual events and made the decision which ones are “interesting”, we can mark them in the video by drawing the boundary box for the objects into the frames (Fig. 5f). Occasionally, video has several seconds or even minutes in which nothing salient appears. We now also have the option of omitting those long stretches of “bor-ing” video from being passed on to the annotators.

A crucial part of our analysis is the discrimination bet-ween “interesting” and “boring” visual events in the under-water video. To evaluate our approach to this problem we captured ten video clips of ten seconds (300 frames) length each from video material recorded during a midwater dive by ROV Ventana on April 30, 1999 in the Monterey Bay1. We pre-processed the videos according to the procedure described above, marked all potentially “interesting” events in the video frames (see Fig. 5f), and stored the medium-level pro-perties for the objects (see Fig. 5e). A video annotator was then asked to decide which of the events they would find “interesting”, and which ones not.

We examined the object properties of the events that the annotator marked as “interesting” and found that the co-variance of the object pixels with respect to the centroid uxy is a good indicator for the how “interesting” an event is. In Fig. 6, we show the ROC (Receiver Operating Characteristics) for this discrimination. With a threshold of |uxy|= 0.4 we could obtain a true positive rate of 93.3% with only 9.6% false positives. The unexpectedly good discrimination power with only one out of twelve properties that we extract from the objects makes us confident that we can improve the discrimi-nation performance more by learning a measure for how “in-teresting” an event is from all available properties.

We are encouraged by these first results, and we are currently developing a more complex measure for the distinction between “interesting” and “boring” events in the video.

1. Itti, L., C. Koch, and E. Niebur, A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998. 20(11): p. 1254-1259.

2. Itti, L. and C. Koch. Target Detection using Saliency-Based Attention. in Proc. RTO/SCI-12 Workshop on Search and Target Acquisition (NATO Unclassified). 1999. Utrecht, The Netherlands.

Figure 3. Processing steps for the automated video analysis. (a) video frame captured by the PC with video editing card; (b) the result of the pre-processing. In this image, scan lines were smoothed out, and the average of the preceding ten frames was subtracted. The image contrast is enhanced for illustration purposes; (c) saliency map computed from image (b); (d) scan path of the saliency model overlaid on top of the image; (e) object masks that were extracted at the salient location with the boundary box (in gray) and the major (red) and minor (light blue) axes marked; (f) video frames with the objects marked that are part of “interesting” events.

Figure 4. Receiver Operating Characteristics (ROC) for the discrimination between “interesting” and “boring” events based on the covariance uxy. The blue line marks the best performance at a threshold of |uxy| = 0.4.


top