Caltech
Center for Neuromorphic Systems Engineering

Home
Research
News
People

[back]

Object Recognition by Probabilistic Hypothesis Construction
Pierre Moreels, Michael Maire, Pietro Perona


Abstract. We present a probabilistic framework for recognizing objects in images of cluttered scenes. Hundreds of objects may be considered and searched in parallel. Each object is learned from a single training image and is modeled by the visual appearance of a set of features, as well as their position with respect to a common reference frame. The recognition process computes both the identity and position of objects in the scene by computing the best interpretation (or hypothesis) of the scene in the light of a database of known objects. A hypothesis pairs features in an input image either with features in the database or marks them as clutters. Each hypothesis may be scored in a principled way using a generative model of the image which is defined using the learned objects as well as a model for clutter. While the space of all possible hypotheses is enormously large, one may find the best hypothesis efficiently -- we explore a couple of heuristics to do so. In our initial experiments our algorithm compares favorably with state-of-the-art recognition systems.

Motivation. In the computer vision literature there is broad agreement that objects and object categories should be represented as collections of parts (or features) which appear in a given mutual position or shape (e.g. side-by-side eyes, a nose below them etc). There is, however, disagreement as to the best tradeoff in this design space. On one hand, one may wish to represent the appearance and position of parts in a careful probabilistic framework, which allows one to generate principled learning and detection algorithms. One example of this approach is the “constellation model” [1] which has been successfully applied to unsupervised learning and recognition of object categories amongst clutter. This approach is penalized by a large number of parameters that are needed to represent appearance and shape and by algorithmic complexity. As a result there is a practical limit to the size of the models that one can use, typically limiting the number of object parts below 10. On the other hand, one finds in the literature models containing hundreds of parts. In this case the authors dramatically simplify the way appearance and position are modeled as well as the algorithms used to learn and match models to images. A representative of this approach is David Lowe's algorithm [2] which can recognize simultaneously and quickly multiple individual objects (as opposed to categories).

We are interested in exploring whether probabilistically rigorous modeling may be extended to yield practical data structures and algorithms for models that contain hundreds of features. To this end, we modify the constellation model [1] to incorporate a number of attractive features presented by Lowe [2]: using a KD-tree for efficiently associating features by appearance as well as computing feature positions with respect to a common reference frame rather than with respect to each other. Additionally, we pool representational parameters amongst features. As a result, it is possible to learn models quickly based on a single example; additionally, the system gains the robustness associated with using a large number of features while also offering an expressive probabilistic model for verifying object presence. One additional contribution is exploring efficient algorithms for associating models with images that are based on this probabilistic model and the A* search technique [3].

Figure 1. Sketch of hypothesis build:

a) Initialization: The closest match in the database is identified for each scene feature.

b) Search for a new pairing: starting from the latest assignment, we look for an unassigned feature in the model image. This feature is mapped to its best appearance match in the scene, under the condition that this new pairing is coherent with the pose predicted by the hypothesis. The pose is then updated based on the new match.

Main Results. We compare the performance of our probabilistic search method to that of Lowe's algorithm on a training set consisting of 100 images of toys and common kitchen items, with a single image per object. The test set contained images of single objects and complicated scenes that include several objects. We used a resolution of 480x320 for training images and test images of single objects, and 800x533 for complex scenes.

Our algorithm achieved a performance similar to Lowe's system, with a detection rate of 85\%. Since our method verifies the coherence of each match by scoring the resulting partial hypothesis, our false alarm rate was lower than that of Lowe's method.

In order to measure the accuracy of the pose transformations estimated by each method, the training and test images were manually marked with ground truth information. An ellipse was fitted around objects present in a scene, and a canonical orientation was chosen for each of them. We measured the accuracy of the transformation with the distance, in pixels, between the predicted positions of the ellipses in a scene, and the ground truth previously recorded. The error was averaged across points on the ellipse and across test images. We obtained a mean error of 45 pixels for our method, and 56 pixels for Lowe's algorithm.

Our approach requires examination and valuation of a number of partial and complete hypotheses that are much higher than with Lowe's method. As a result, the probabilistic algorithm is the slower of the two methods. Our unoptimized code for Lowe's method takes in average 2 seconds on a Pentium 4 running at 2.4GHz to identify objects in an 800x533 image, while our probabilistic algorithm requires on average 10 seconds for the same image.

Figure 2. Example of result for a textured object included in a complex scene.

a) Initial object
b) Result of Lowe's algorithm. Since the stuffed bear is a textured object, detection of similar features can occur in many locations, leading to incorrect pairings. As a result, the frame transformation, estimated only from the features positions, is inaccurate.
c) Result of the probabilistic search.

Conclusion. We introduced a probabilistic framework that explains scenes measurements with the presence of objects from a database, and with background detections. Our approach benefits from a high number of matches, which leads to a high confidence on the presence of objects in the scene. The valuation of each complete or partial hypothesis allows us to verify the coherence of each tentative new pairing, thus leading to a lower false alarm rate than that obtained by Lowe.

In the current implementation, the parameters of the location terms in the foreground density are set to arbitrary values. Future work will include actual learning of those parameters from several observations of the same features across multiple images. Independent parameters should ultimately be used for each feature in the database.

Another direction for future research is the relative importance of pose and appearance in the incremental hypothesis build process. When a hypothesis is initialized and consists of a single match, the uncertainty in the predicted frame position is very high, and decreases as more pairings are added. A next version of the hypothesis search method will put a higher weight on appearance-based pairings at the beginning of the search, and a higher weight on location verification when numerous correspondences have been chosen.

References
[1]. D.G. Lowe, “Object recognition from local scale-invariant features”, Proc. Int. Conf. Comp. Vision, Corfu, Greece, 1999
[2]. M. Weber, M. Welling and P. Perona, “Unsupervised learning of models for recognition”, Proc. 6th Europ. Conf. Comp. Vis., 2000.
[3]. J. Pearl, “Heuristics”, Addison-Wesley, 1984.


top