Caltech
Center for Neuromorphic Systems Engineering

Home
Research
News
People

[back]

Decomposition of Human Motion into Dynamics Based Primitives with Application to Drawing Tasks
Domatilla Del Vecchio, Richard Murray, Pietro Perona

Abstract. Using tools from dynamical systems and systems identification we develop a framework for the study of primitives for human motion, which we refer to as movemes. The objective is understanding human motion by decomposing it into a sequence of elementary building blocks that belong to a known alphabet of dynamical systems. We develop a segmentation and classification algorithm in order to reduce a complex activity into the sequence of movemes that have generated it. We test our ideas on data sampled from five human subjects who were drawing figures using a computer mouse. Our experiments show that we are able to distinguish between movemes and recognize them even when they take place in activities containing an unspecified number of movemes.

Introduction. Building systems that can detect and recognize human actions and activities is an important goal of modern engineering. Applications range from human-machine interfaces to security to entertainment. With the development of information technology we can expect that computer systems will be increasingly embedded in our environment, so that human-machine interaction will need interfaces that are easier to use and more natural. As humans use their visual system and auditory system to communicate, several works (see for example [10, 20] and the earlier work on building human-machine interfaces using vision [7, 14, 23, 24, 21]) ask the question of whether it is possible to develop computerized equipment able to communicate with humans in similar way. As described extensively in [4] there is also an immediate need for automated surveillance systems in commercial, law enforcement, and military applications.

A fundamental problem in detecting and recognizing human action is one of representation. Our point of view is that human activity should be decomposed into building blocks which belong to an “alphabet” of elementary actions; for example the activity “answering the phone” could be decomposed into the sequence “step-step-step-reach-lift”, where “step”, “reach” and “lift” may not be further decomposed. We refer to these primitives of motion as movemes. Our aim is then to build an alphabet of movemes, which one can compose to represent and describe human motion similar to the way phonemes are used in speech. The word “moveme” intended as primitive of motion was invented by [3]. They studied periodic or stereotypical motions such as walking or running where the motion is always the same and therefore their movemes, like the phonemes, were repeatable segments of trajectory. Goncalves et al. [6] studied motions that were parametrized by an initial condition and a target, such as “reach” that requires the specification of a target location. They proposed that movemes ought to be parametrized by goal and style parameters. Their moveme models are phenomenological and non-causal. In this paper we attempt to define movemes in terms of causal dynamical systems. This approach opens the possibility of dealing with problems like prediction, and leads to more compact models parameterized by a small number of parameters. Moreover the
dynamical systems framework allows us to use a set of mathematical tools for determining analytically the performance of the algorithms proposed.

The idea of dynamical primitives of motion has also appeared in neurobiology studies. Bizzi and Mussa-Ivaldi [2] pose the question whether the motor behavior of vertebrates is based on simple units (motor primitives) that can be combined flexibly to accomplish a variety of motor tasks, and experiments have provided evidence for a modular organization of the spinal cord in frogs and rats. Mussa-Ivaldi et al. [15] ran experiments which showed that the fields induced by the focal activation of the spinal cord follow a principle of vectorial summation, so that a variety of motor control polices can be obtained from a simple linear combination of few control modules. Experimental results in [9] and [5] support the idea that kinematic and dynamic internal models are utilized in movement planning and control. The “internal model” hypothesis proposes that the brain acquires an inverse dynamic model of the object to be controlled through motor learning after which motor control can be executed mostly in a feed-forward manner. Thus, the role of dynamics in the description of human motion seems to be an important one.

What is the alphabet of movemes? Which are the dynamical models that we should use to represent them? Can a continuous trajectory of a human body be decomposed automatically into its component movemes? To answer these questions we take a relatively abstract point of view so to find a representation framework that may apply to situations where dynamical evolution and switching between different dynamical modes come into play. We introduce a formal definition of a moveme and set up the classification and segmentation problem that can be appropriately formalized in a dynamical systems framework. Standard system identification tools and stability arguments can then be applied to derive analytical error analysis for the proposed algorithm so as to obtain performance estimates in the presence of noise and modeling uncertainties. Finally we present some experimental results on human drawing data. Even though the particular example considered can be solved other ways, it is meant to show how the developed techniques can be used in a practical and simple application characterized by modeling uncertainty, noise, and subject variability.

The problem of segmenting data streams originating from different unknown or partially known processes which alternate in time is a general problem of interest to various areas, see for example [8, 11, 22]. We propose a solution to the problem in our particular scenario in which each one of the segments has been generated from the perturbed version of a linear dynamical system belonging to a finite known set of possible linear models. By using system identification techniques [12, 18] and pattern recognition techniques [1, 19] we develop an off-line joint segmentation and classification algorithm and provide analytical error analysis. The dynamical systems representation for describing human motion is not a novel idea; some sample citations include [17, 13, 16]. Our contribution lies mainly in the development of a joint classification-segmentation algorithm, based on a priori given classes of motion (the moveme alphabet), and characterized by a detailed error analysis.

The experimental results show that the performance of the proposed algorithm is about 90% on our data set when training and testing are performed on data coming from distinct subjects. This gives evidence of the fact that the movemes considered are user-invariant on our data set. Subject-invariance is not a property that we can prove formally and requires an experimental verification. The results we obtain on 2D motions are encouraging in this respect.

The formalism that we introduced is directly applicable to the higher-dimensional case of full-body motion. If one compares it with previous work (e.g. the linear/quadratic input-output maps of [6]) one notices that our causal dynamical systems approach requires far fewer parameters for describing a moveme; hence it promises to require fewer training examples and allow for better generalization. Challenges into extending our results to three dimensional (3D) motion, which the current paper does not address, include the scalability of the approach, how to segment involuntary actions, and how to link moveme chains into meaningful activities. Additional work is also required to address issues like dependency on the number of training examples, and user-dependence of the movemes in a more complex and three dimensional experimental setting.

Furthermore, it is interesting to generalize the current segmentation and classification algorithm to the on-line case. In the on-line setting it would be useful to think to a possible solution to the prediction problem, which is one of predicting the next action (or actions) on the basis of what has already happened. Moreover exploring different classes of dynamical systems may help modeling human motion with greater accuracy. Also issues regarding to what extent models are user independent and to what extent we need to train on different individuals should be addressed.

At a higher level of abstraction the idea of finding a “language” in which to specify what is possible and what is not seems to be promising. For example we know that in the sequence “step-step-reach-lift” for answering the phone, it is not possible to lift the phone before having reached it. These kinds of conditions could determine a model which gives a structure to the way in which movemes can be composed. A clear advantage of having such a model is that it could give feedback to the segmentation and classification algorithm so to increase its robustness.

References
[1] C.M. Bishop. Neural Networks for Pattern Recognition. Clarendon, Oxford, 1995.

[2] E. Bizzi and F.A. Mussa-Ivaldi. Toward a neurobiology of coordinate transformations. New Cog. Neuroscience, MIT Press, Cambridge, MA:489-500, 1999.

[3] C. Bregler and J. Malik. Learning and recognizing human dynamics in video sequences. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 568-674, Puerto Rico, 1997.

[4] R.T. Collins, A. J. Lipton, and T. Kanade. Introduction to the special section on video surveillance. IEEE Trans. on Pattern Analysis and Machine Intelligence, 22:745-746, August 2000.

[5] J.R. Flanagan and A.M. Wing. The role of internal models in motion planning and control: evidence from grip force adjustments during movements of hand-held loads. The Journal of Neuroscience, 17:1519-1528, 1997.

[6] L. Goncalves, E. Di Bernardo, and P. Perona. Reach out and touch space (motion learning). In Proc. of the Third International Conference on Automatic Face and Gesture Recognition, pages 234-239, Nara, Japan, April 14-16 1998.

[7] L. Goncalves, E. Di Bernardo, E. Ursella, and P. Perona. Monocular tracking for human arm in 3d. In Proc. of the 7th Int. Conference on Computer Vision, ICCV, pages 764-770, 1995.

[8] F. Gustafsson. Adaptive Filtering and Change Detection. John Wiley & Sons, 2000.

[9] M. Kawato. Internal models for motor control and trajectory planning. Current Opinion in Neurobiology, 9:718-727, 1999.

[10] I. Laptev and T. Lindeberg. Tracking of multi-state hand models using particle filtering and a hierarchy of multi-scale image features. In IEEE Workshop on Scale-Space and Morphology, pages 63-74, Vancouver, Canada, July 2001.

[11] M. Lavielle. Optimal segmentation of random processes. IEEE Trans. on Signal Processing, 46:1365-1373, May 1998.

[12] L. Ljung. System Identification. Prentice Hall, New Jersey, 1999.

[13] C. Lu, H. Liu, and N.J. Ferrier. Multidimensional motion segmentation and identification. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pages 629-636, Hilton Head Island, South Carolina, 2000.

[14] M.E. Munich and P. Perona. Visual input for pen-based computers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24:313-328, March 2002.

[15] F.A. Mussa-Ivaldi, S.F. Giszter, and E. Bizzi. Linear combinations of primitives in vertebrate motor control. Proc. of the National Academy of Science, 91:7534-7538, 1994.

[16] D. Ormoneit, T. Hastie, and M.J. Black. Functional analysis of human motion data. In Proc. 5th World Congress of the Bernoulli Society for Probability and Mathematical Statistics and 63rd Annual Meeting of the Institute of Mathematical Statistics, Guanajuato, Mexico, 2000.

[17] V. Pavlovic and James M. Rehg. Impact of dynamic model learning on classi¯cation of human motion. In IEEE Conf. Computer Vision and Pattern Recognition, Hilton Head Island, 2000.

[18] T. Söderström and P. Stoica. System Identification. Prentice Hall. Hemel Hempstead, 1989.

[19] V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, 1995.

[20] S. Waldherr, S. Thurn, R. Romero, and D. Margaritis. Template-based recognition of pose and motion gestures on a mobile robot. In Proc. of the AAAI 15th National Conference on Artificial Intelligence, pages 977-982, 1998.

[21] P. Wellner. The digital desk calculator: Tactile manipulator on a desk top display. In Proc. of the ACM Symposium on User Interface and Technology, pages 27-33, Hilton Head, November 1991.

[22] A.S. Willsky and H.L. Jones. A generalized likelihood ratio approach to the detection and estimation of jumps in linear systems. IEEE Trans. on Automatic Control, 21:108-112, February 1976.

[23] A. Wilson and A. Bobick. Learning visual behavior for gestures analysis. In Proc. of IEEE Symposium on Computer Vision, pages 229-234, Coral Gables, FL, November 1995.

[24] Y. Yacoob and L. Davis. Recognizing human facial expressions from long image sequences using optical flow. IEEE Trans. on Pattern Analysis and Machine Intelligence 18(6), pages 636-642, 1996.


top