Motion Based Recognition

Roman Goldenberg

Motion Based Recognition

Roman Goldenberg

2002, PhD Thesis - Technion - Israel Institute of Technology, Faculty of Computer Science

visibility

…

description

110 pages

link

1 file

ABSTRACT In this work we explore how the motion based visual information can be used to solve a number of well known computer vision problems such as segmentation, tracking, object recognition and classification and event detection. We consider three special cases for which the methods used are quite different: the rigid, non-rigid and articulated objects. For rigid objects we address the problem taken from the traffic domain and show how the relative velocity of nearby vehicles can be estimated from a video sequence taken by a camera installed on a moving car. For non-rigid objects we present a novel geometric variational framework for image segmentation using the active contour approach. The method is successfully used for moving non-rigid targets segmentation and tracking in color movies, as well as for a number of other applications such as cortical layer segmentation in 3-D MRI brain images, segmentation of defects in VLSI circuits on electronic microscope images, analysis of bullet traces in metals, and others. Relying on the high accuracy results of segmentation and tracking obtained by the fast geodesic contour approach, we present a framework for moving object classification based on the eigen-decomposition of the normalized binary silhouette sequence. We demonstrate the ability of the system to distinguish between various object classes by their static appearance and dynamic behavior. Finally we show how the observed articulated object motion can be used as a cue for the segmentation and detection of independently moving parts. The method is based on the analysis of normal flow vectors computed in color space and also relies on a number of geometric heuristics for edge segments grouping.

Motion Based Recognition Research Thesis Submitted in Partial Fulﬁllment of the Requirements for the Degree of Doctor of Philosophy Roman Goldenberg Submitted to the Senate of the Technion – Israel Institute of Technology Nisan 5762 Haifa April 2002 This research was carried out in the Department of Computer Science under the supervision of Prof. Ehud Rivlin and Prof. Ron Kimmel. While realizing that I’m stepping on a shaky ground when trying to write down a list of people I would like to thank, I’ll do it anyway. Even when I know that the list will never be a full one, that most of the people mentioned here will never read this, and that it is not the best way to express my gratitude, I’ll do it primarily for myself. Maybe some day it will help me to refresh my memories of the good old days. I would like to thank my advisors Prof. Ehud Rivlin and Prof. Ron Kimmel for their devoted guidance through all these years. I would like to thank my co-authors Dr. Michael Rudzsky from the Technion, Dr. Zoran Duric from the George Mason University and Prof. Azriel Rosenfeld from the Center for Automation Research. The work presented here is a result of our common eﬀort. I wish to thank Prof. Alfred Bruckshtein and Prof. Irad Yavneh from the Technion for intriguing conversations and bright ideas they shared with us. I am very grateful to Alexander Berengolts, Ilya Blayvas, Constantin Brif, Alexander Brook, Lev Finkelshtein, Kirill Gokhberg, Eyal and Noam Gordon, Lev Gorelik, Alexander Gutkin, Yana Katz for their help, advice and simply for being there when I needed it. The generous ﬁnancial help of the Technion, VATAT, and the Applied Materials company is gratefully acknowledged. Contents Abstract 1 List of Symbols and Abbreviations 3 1 Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Psychological aspects . . . . . . . . . . . . . . . . . . . . . . 1.1.2 Motion related problems . . . . . . . . . . . . . . . . . . . . 1.1.3 Motion modeling . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Main Results and Contributions . . . . . . . . . . . . . . . . . . . . 1.3 Background and Related work . . . . . . . . . . . . . . . . . . . . . 1.3.1 Motion Segmentation and Tracking . . . . . . . . . . . . . . 1.3.2 Classiﬁcation and Recognition . . . . . . . . . . . . . . . . . 1.3.2.1 Detecting and Recognizing Motions . . . . . . . . . Recognizing Activities . . . . . . . . . . . . . . . . . Detecting Periodic Activities . . . . . . . . . . . Recognizing Periodic Activities . . . . . . . . . . Parametrized Modeling for Activity Recognition Motion Events . . . . . . . . . . . . . . . . . . . . . . 1.3.2.2 Motion Based Object Classiﬁcation . . . . . . . . . Temporal Texture Analysis . . . . . . . . . . . . . . . Dynamics Based Object Classiﬁcation . . . . . . . . . 1.3.3 From Motion towards Natural Language Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 5 6 6 7 8 8 10 11 11 11 12 13 14 16 16 17 18 2 Rigid Motion 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Interpreting Relative Vehicle Motion in Traﬃc Scenes . . . . . . 2.2 Motion Models of Highway Vehicles . . . . . . . . . . . . . . . . . . . . 2.2.1 Real Vehicle Motion . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Departures of Vehicle Motion from Smoothness . . . . . . . . . 2.2.3 The Sizes of the Smooth and Non-smooth Velocity Components 2.2.4 Camera Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.5 Independently Moving Vehicles . . . . . . . . . . . . . . . . . . 2.3 Image Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 The Imaging Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 21 21 22 24 25 26 28 29 29 29 i . . . . . . . . . . . . . . . . . . 2.4 2.5 2.6 2.3.2 The Image Motion Field and the Optical Flow Field . 2.3.3 Estimation of Rotation . . . . . . . . . . . . . . . . . 2.3.4 Estimating the FOE . . . . . . . . . . . . . . . . . . 2.3.5 The Image Motion Field of an Observed Vehicle . . . 2.3.6 Estimating Vehicle Motion from Normal Flow . . . . Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Road and Vehicle Detection . . . . . . . . . . . . . . 2.4.2 Stabilization . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Relative Motions of Vehicles . . . . . . . . . . . . . . Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Understanding Object Motion . . . . . . . . . . . . . 2.5.2 Independent Motion Detection . . . . . . . . . . . . . 2.5.3 Vehicle Detection and Tracking . . . . . . . . . . . . 2.5.4 Road Detection and Shape Analysis . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Non-Rigid Motion - Tracking 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 3.2 Snakes and Active Contours . . . . . . . . . . . . . . 3.3 From snakes to geodesic active contours . . . . . . . . 3.3.1 Level set method . . . . . . . . . . . . . . . . 3.3.2 The AOS scheme . . . . . . . . . . . . . . . . 3.3.3 Re-initialization by the fast marching method 3.4 Edge indicator functions for color and movies . . . . 3.4.1 Edges in Color . . . . . . . . . . . . . . . . . 3.4.2 Tracking objects in movies . . . . . . . . . . . 3.5 Implementation details . . . . . . . . . . . . . . . . . 3.6 Experimental Results . . . . . . . . . . . . . . . . . . 3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . 4 Non-Rigid Objects - Motion Based Classiﬁcation 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 4.1.1 General Framework . . . . . . . . . . . . . . 4.2 Futurism and Periodic Motion Analysis . . . . . . . 4.3 Related Work . . . . . . . . . . . . . . . . . . . . . 4.4 Our Approach . . . . . . . . . . . . . . . . . . . . . 4.4.1 Segmentation and Tracking . . . . . . . . . 4.4.2 Periodicity Analysis . . . . . . . . . . . . . . 4.4.3 Frame Sequence Alignment . . . . . . . . . 4.4.3.1 Spatial Alignment: . . . . . . . . 4.4.3.2 Temporal Alignment: . . . . . . . . 4.4.4 Parameterization . . . . . . . . . . . . . . . 4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . ii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 31 33 34 34 34 35 35 35 37 37 38 40 41 42 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 47 47 48 50 51 52 53 53 53 54 55 61 . . . . . . . . . . . . 63 63 63 65 65 67 68 69 70 70 71 72 77 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Articulated Motion 5.1 Introduction . . . . . . . . . . . . . . . . 5.2 Background subtraction . . . . . . . . . 5.3 Edge detection . . . . . . . . . . . . . . 5.4 Normal ﬂow . . . . . . . . . . . . . . . . 5.5 Grouping edge pixels into edge segments 5.6 Grouping edge segments . . . . . . . . . 5.7 Multiple frames approach . . . . . . . . 5.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 81 81 82 82 84 84 88 90 6 Discussion 91 Bibliography 94 iii List of Figures 1.1 1.2 1.3 1.4 1.5 1.6 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 Extraction of motion information from a sequence of frames (from [20]). . . . Total motion magnitude feature vector for a sample of walk (top) and a sample of run (bottom). One cycle of activity is divided into six time divisions shown horizontally. Each frame shows the spatial distribution of motion in 4x4 spatial grid size of each square is proportional to the amount of motion in the neighborhood (from [92]). . . . . . . . . . . . . . . . . . . . . . . . . . The parameterized modeling and recognition (from [127]). . . . . . . . . . . Image sequence of ”walking”, ﬁve part tracking including self-occlusion and three sets of ﬁve signals (out of 40) recovered during the activity (torso, thigh, calf, foot and arm) (from [66]). . . . . . . . . . . . . . . . . . . . . . . . . . (a) A graphic depiction of the pose function for the hopping of a kangaroo ; (b) A graphic depiction of the pose function for the hopping of a rabbit (from [107]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Single frame of a real-word traﬃc scene recorded at an intersection. Vehicle trajectories obtained by automatic detection and model-based tracking are shown together with the vehicle models used (from [70]). . . . . . . . . . . . The Darboux frame moves along the path Γ which lies on the surface Σ. . . A possible motion of the base of a vehicle as the vehicle hits a bump. . . . . The plane perspective projection image of P is F = f (X/Z, Y /Z, 1); the weak perspective projection image of P is obtained through the plane perspective projection of the intermediate point P1 = (X, Y, Zc ) and is given by G = f (X/Zc , Y /Zc , 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (a-d) A selected frame from each of four sequences. Top: the input images. Middle: results of road detection. Bottom: results of vehicle detection. . . . (a-b) Two images taken 1/15th of a second apart; (c-d) their normal ﬂows. One can see the eﬀects of bumps. In the ﬁrst frame, the ﬂow vectors point downward; in the second, they point upward. . . . . . . . . . . . . . . . . . . Stabilization results for one frame from each of three sequences: (a) Input frame. (b) Normal ﬂow. (c) Rotational normal ﬂow. (d) Translational normal ﬂow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Identiﬁcation of distant image points in frames from two sequences. . . . . . Frames 0, 30, and 60 of a sequence showing two vehicles accelerating. The normal ﬂow results are shown below the corresponding image frames. . . . . iv 7 13 14 15 18 20 23 25 30 36 37 38 39 40 2.9 2.10 2.11 2.12 2.13 2.14 2.15 3.1 3.2 3.3 3.4 3.5 3.6 3.7 4.1 4.2 4.3 4.4 4.5 Motion analysis results for the acceleration sequence. U , V , W are the scaled (by an unknown distance Zc−1 ) components of the relative translational velocity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frames 5, 15, 25, and 35 of the van sequence. The normal ﬂow results are shown below the corresponding image frames. . . . . . . . . . . . . . . . . . Motion analysis results for the van sequence. U , V , W are the scaled (by an unknown distance Zc−1 ) components of the relative translational velocity . . . Frames 1, 14 and 26 of the Italian sequence. The normal ﬂow results are shown below the corresponding image frames. . . . . . . . . . . . . . . . . . Motion analysis results for the Italian sequence. U , V , W are the scaled (by an unknown distance Zc−1 ) components of the relative translational velocity . Frames 1, 25 and 48 of the Haifa sequence. The normal ﬂow results are shown below the corresponding image frames. . . . . . . . . . . . . . . . . . . . . . Motion analysis results for the Haifa sequence. U , V , W are the scaled (by an unknown distance Zc−1 ) components of the relative translational velocity . Two arrays of linked lists used for narrow band representation. . . . . . . . . Curvature ﬂow by the proposed scheme. A non-convex curve vanishes in ﬁnite time at a circular point by Grayson’s Theorem. The curve evolution is presented for two diﬀerent time steps. Left: τ = 20; Right: τ = 50. . . . . . Curvature ﬂow CPU time for the explicit scheme and the implicit AOS scheme. First, the whole domain is updated, next, the narrow band is used to increase the eﬃciency, and ﬁnally the AOS speeds the whole process. For the explicit scheme the maximal time step that still maintains stability is choosen. For the AOS scheme, CPU times for several time steps are presented. . . . . . . Multiple objects segmentation in a static color image. . . . . . . . . . . . . . Gray matter segmentation in a MRI brain image. The white contour converges to the outer cortical surface and the black contour converges to the inner cortical surface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tracking a cat in a color movie by the proposed scheme. Top: Segmentation of the cat in a single frame. Bottom: Tracking the walking cat in the 50 frames sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tracking two people in a color movie. Top: curve evolution in a single frame. Bottom: tracking two walking people in a 60 frame movie. . . . . . . . . . . ‘Dynamism of a Dog on a Leash’ 1912, by Giacomo Balla. Albright-Knox Art Gallery, Buﬀalo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (a) running horse video sequence, (b) ﬁrst 10 eigen-shapes, (c,d) ﬁrst and second eigen-shapes enlarged, (e) ‘The Red Horseman’, 1914, by Carlo Carra, Civico Museo d’Arte Contemporanea, Milan. . . . . . . . . . . . . . . . . . . Non-rigid moving object segmentation and tracking. . . . . . . . . . . . . . . Global motion characteristics measured for walking man sequence. . . . . . . Global motion characteristics measured for walking cat sequence. . . . . . . v 41 42 43 44 44 45 45 55 56 57 58 59 60 60 66 67 69 70 71 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 5.1 5.2 5.3 5.4 5.5 5.6 Deformations of a walking man contour during one motion period (step). Two steps synchronized in phase are shown. One can see the similarity between contours in corresponding phases. . . . . . . . . . . . . . . . . . . . . . . . Inter-frame correlation between object silhouettes. Four graphs show the correlation measured for four initial frames. . . . . . . . . . . . . . . . . . . . . Scale alignment. A minimal square bounding box around the center of the segmented object silhouette (a) is cropped and re-scaled to form a 50 × 50 binary image (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Temporal alignment. Top: original 11 frames of one period subsequence. Bottom: re-sampled 10 frames sequence. . . . . . . . . . . . . . . . . . . . . Temporal shift alignment: (a) - average starting frame of all the training set sequences, (b) - temporally shifted single-cycle test sequence, (c) - the correlation between the reference starting frame and the test sequence frames The principal basis for the ‘dogs and cats’ training set formed of 20 ﬁrst principal component vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . Distances to the ‘people’ and ’dogs and cats’ feature spaces from more than 1000 various images of people, dogs and cats. . . . . . . . . . . . . . . . . . . Image sequence parameterization. Top: 11 normalized target images of the original sequence. Bottom: the same images after the parameterization using the principal basis and back-projecting to the image basis. The numbers are the norms of the diﬀerences between the original and the back-projected images. Feature points extracted from the sequences with walking, running and galloping dogs and walking cats and projected to the 3-D space for visualization Feature points extracted from the sequences showing people walking and running parallel to the image plane and at 45 degrees angle to the camera. Feature points are projected to the 3-D space for visualization . . . . . . . . . . . . . Learning curves for (a) dogs and cats data set and (b) people data set. Onenearest-neighbor classiﬁcation error rate as a function of the number of eigen vectors in the projection basis. . . . . . . . . . . . . . . . . . . . . . . . . . . Example of background subtraction (a) Original color image from sequence, (b) Foreground conﬁdence values, (c) Thresholded foreground image. . . . . Example of edge detection for color images (a) Original color image from sequence, (b) Edge indicator function values, (c) Edge indicator function after applying a threshold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Two consecutive frames from an image sequence (a, b) and the computed normal ﬂow (c). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of grouping edge pixels into edge segments. Diﬀerent edge segments are in diﬀerent colors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Edge grouping results for six frames from an image sequence. Every group of edge segments is enclosed by a best ﬁtting rectangular bounding box . . . . Multiple frames based segmentation results for six frames from an image sequence. Every group of edge segments is enclosed by a best ﬁtting rectangular bounding box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi 72 73 74 74 75 76 76 77 78 78 79 82 83 83 85 88 89 Abstract In this work we explore how the motion based visual information can be used to solve a number of well known computer vision problems such as segmentation, tracking, object recognition and classiﬁcation and event detection. We consider three special cases for which the methods used are quite diﬀerent: the rigid, non-rigid and articulated objects. For rigid objects we address the problem taken from the traﬃc domain and show how the relative velocity of nearby vehicles can be estimated from a video sequence taken by a camera installed on a moving car. For non-rigid objects we present a novel geometric variational framework for image segmentation using the active contour approach. The method is successfully used for moving non-rigid targets segmentation and tracking in color movies, as well as for a number of other applications such as cortical layer segmentation in 3-D MRI brain images, segmentation of defects in VLSI circuits on electronic microscope images, analysis of bullet traces in metals, and others. Relying on the high accuracy results of segmentation and tracking obtained by the fast geodesic contour approach, we present a framework for moving object classiﬁcation based on the eigen-decomposition of the normalized binary silhouette sequence. We demonstrate the ability of the system to distinguish between various object classes by their static appearance and dynamic behavior. Finally we show how the observed articulated object motion can be used as a cue for the segmentation and detection of independently moving parts. The method is based on the analysis of normal ﬂow vectors computed in color space and also relies on a number of geometric heuristics for edge segments grouping. 1 2 List of Symbols and Abbreviations Γ Σ ⃗t ⃗v ⃗s κ τ κg κn τg ω ⃗ T⃗ = (U, V, W ) R (X, Y, Z) (Xc , Yc , Zc ) f C(p) = {x(p), y(p)} g(C) ⃗ N ⃗ κN ds = |Cp |dp ϕ(x, y) : IR2 → IR MLD SFM AOS PCA EKF TPS FOE CSF MRI CT MHI SVD DTFS A smooth trajectory space curve. A smooth surface the Γ is lying on. The tangent vector to the curve Γ. The second tangent vector to the surface Σ (orthogonal to ⃗t). The normal vector to the surface Σ. The curvature of the curve Γ. The torsion of the curve Γ. The geodesic curvature. The normal curvature. The geodesic twist. The rotational velocity vector. The translational velocity vector. The rotation matrix. A scene point. A reference point for weak perspective projection. A focal length. A parameterized 2-D curve. An edge indicator function. The normal to the curve. the curvature vector. The Euclidean arclength. An implicit representation of C (the level set function). Moving Light Display. Structure from Motion. Additive Operator Splitting. Principal Component Analysis. Extended Kalman Fgilter. Trajectory Primal Sketch. Focus of Expansion. Cerebral Spinal Fluid. Magnetic Resonance Imaging. Computer Tomography. Motion History Image. Singular Value Decomposition. Distance to Feature Space. 3 4 Chapter 1 Introduction 1.1 Motivation Motion based recognition deals with the recognition of an object and its behavior based on motion in a sequence of images. Generally, it is an approach that favors the direct use of temporal data to solve a number of well known computer vision problems like image segmentation, tracking, object classiﬁcation and recognition and so on. While static scene analysis received more attention, partially historically and partially due to technical problems arising from the processing of large amounts of data, it turns out that some computer vision problems may be solved more easily by analyzing not a single image, but a sequence of images. 1.1.1 Psychological aspects The importance of motion information can be proved by looking at biological systems. Motion perception in animals and humans has been studied in psychology for a long time. The experiments with Johansson’s moving light displays (MLD) [64], [65] show that the movement of several light spots attached to a moving human body without any additional visual hints like contours, color, texture or background is suﬃcient for recognizing diﬀerent types of activities (walking, running, dancing) as well as the gender of the subject. Moreover, using the MLDs one can recognize a person by his/her gait. Michotte [80] showed that humans are able to describe an underlying physical models of objects interaction (pushing, launching, triggering, entraining, etc.) even if the observed scene consists only of simple 2-D geometrical objects on a white background. Since the nature of the objects is unknown, the observer can not use his experience and must base his judgments mainly on the temporal and spatial parameters of the scene. This can be regarded as an evidence that causality can be percepted “directly”. Heider and Simmel in 1944 showed that, in addition to causality, people may also perceive intentionality in the action of inanimate objects. Moreover, in addition to instantaneous intentions, people may even attribute traits to the observed actors. People can compose a whole “story” based on a ﬁlm whose only “stars” are two triangles, a circle and a square. 5 1.1.2 Motion related problems Motion Based Recognition has been a computer vision research topic for more than 30 years. It would be very diﬃcult to describe here the whole spector of motion related problems the computer vision community dealt with. Even if we put aside the Structure from Motion (SFM) class of problems, which is a reconstruction of a 3D object structure based on multiple views of the moving object, we are still left with a great variety of problems and approaches that use motion information this way or another. From those one can extract two major groups. 1. The primary goal of the ﬁrst one is the motion itself, i.e. detection and recognition of diﬀerent types of motion (like activities, gestures, maneuvers, behavioral patterns, etc.). At a higher level this may include parsing a stream of primitive dynamic events into more complex concepts like scenarios, revealing a physical explanation of the observed motion, building a natural language description or even reconstructing a set of rules governing the behavior of the agents in a scene (an underlying algorithm). 2. Another group of problems are conventional object recognition and classiﬁcation based primarily on motion analysis. These may vary from person recognition based on its gait to aircraft type classiﬁcation by its dynamic characteristics. Yet a diﬀerent group of problems that usually is to be solved prior to the recognition and classiﬁcation stage are segmentation and tracking. A lack of a good, generic and reliable solution for this separate class of lower level problems is an obstacle for the groups of researchers working in the higher level stage. This can probably explain the small number of works done in high level analysis in a complex, real world, outdoor environment where the segmentation and tracking are even more problematic. Since we are analyzing moving objects, the segmentation in our context means detection of independently moving bodies, while tracking means pursuing of an object image on the basis of its current dynamical behavior. It should be also mentioned that the nature of the problems above and, thus, the methods applied can be signiﬁcantly inﬂuenced by choosing diﬀerent scene types (e.g. indoor/outdoor, rigid/non-rigid objects), imaging schemes (e.g. static/moving camera) and context (e.g. a football game with a known background, number of object, rules, etc. - the so called closed world assumption). 1.1.3 Motion modeling Usually for the purpose of classiﬁcation and recognition some kind of motion model is deﬁned. Certain assumptions are made about the nature of objects motion in order to build a model. In some cases those assumptions reﬂect the experiment conditions, whereas in others they are simpliﬁcations of a more complex observed motion. Among the motion model assumptions used by researchers one can ﬁnd the planar (2D) motion assumption, the constant velocity assumption, the periodical motion assumption, the smooth trajectories assumption, etc. In some cases known physical(dynamical) models of object motion are used. 6 Figure 1.1: Extraction of motion information from a sequence of frames (from [20]). Sometimes not only the motion, but the moving objects themselves are modeled. For a non-rigid bodies (e.g. humans) a model of internal object structure (both geometrical and mechanical) can be extensively used in motion analysis. A lot of geometrical models were used to approximate a moving object: polygons, blobs, centroids, stick ﬁgures, wire frames, etc. To ﬁt a predeﬁned motion model to an image sequence one need to extract some motion based features. Motion information can be obtained by various means. In Figure 1.1 one can see the classiﬁcation of methods for motion data extraction as suggested in [20]. 1.2 Main Results and Contributions The main object of this research is to ﬁnd eﬃcient methods for extracting and using motion information. We concentrated on three major problems. • In the ﬁrst one we deal with the analysis of the rigid bodies motion in a problem taken form a traﬃc domain. – We show how a moving vehicle which is carrying a camera can estimate the relative motions of nearby vehicles. – A model for the motion of the observing vehicle is presented. – A novel algorithm for video stabilization is presented that allows correcting the image sequence so that transient motions resulting from bumps, etc. are removed. – A model for the motions of nearby vehicles is presented and it is shown how to detect their motions relative to the observing vehicle. 7 • The second part of this thesis deals with a general non-rigid motion. Here we approached both the low-level problem of non-rigid target segmentation and tracking and the high-level problem of motion based non-rigid objects classiﬁcation. – An active contour based algorithm for object segmentation in images was developed. – The fast version of the geodesic active contour model was implemented using a novel numerical scheme based on the Weickert-Romeney-Viergever AOS scheme, Adalsteinsson-Sethian level set narrow band and Sethian’s fast marching method. – The algorithm was adopted for segmentation and tracking of non-rigid objects in color movies. – An algorithm for periodicity analysis of the extracted deformable contours of the moving objects is developed. – A technique for image sequence normalization using the PCA approach is proposed. – A framework for moving object classiﬁcation based on the eigen-decomposition of the normalized binary silhouette sequence is presented, capable to distinguish between various object classes by their static appearance and dynamic behavior. • And, ﬁnally, we approached the problem of articulated body segmentation based on its motion. – The method is based on the analysis of the normal ﬂow vectors computed on color image sequence. – We suggest a simple model describing articulated parts motion and develop an algorithm for edge segments grouping based on their apparent motion. – The technique is further developed to support more robust multiple frame analysis. 1.3 Background and Related work In this section we give a brief survey of the background literature and related works relevant to the topics developed in this thesis. We discuss the following topics: motion based segmentation and tracking, recognition of motion activities and events, motion based object classiﬁcation, and generation of natural language description of a dynamic scene. 1.3.1 Motion Segmentation and Tracking Image motion is a very strong cue to image segmentation and it can be used to identify image areas corresponding to independently moving objects and to distinguish these from the background. Once a moving object is detected the motion information can be used to keep the track of its image. The two problems are strongly coupled. The features found 8 for segmentation may be used in tracking and the knowledge of object position and velocity would certainly help in segmentation. The problem may be further complicated by introducing non-rigid objects, egomotion, non-smooth and non-translational motion, occlusion situations, etc. Irani et al. [62] detect multiple moving objects in an image pair one by one. First, a dominant motion is computed assuming that the motion of the objects can be approximated by 2D parametric transformations in the image plane. Then the two images in the pair are registered using the detected motion and the region corresponding to the motion is detected by looking for the stationary region. The region is then excluded and the process is repeated to ﬁnd other objects and their motions. It has been suggested to use the beneﬁts of tracking moving objects to improve the segmentation. This is done by using temporal integration of images registered with respect to the tracked motion, without assuming temporal motion constancy. The temporally integrated image Av(t) serves as a dynamic internal representation image of the tracked object and is constructed as follows: { Av(0) = I(0) Av(t + 1) = ωI(t + 1) + (1 − ω)register(Av(t), I(t + 1)) where ω = 0.3 and register(P, Q) denotes the registration of images P and Q by wrapping P towards Q according to the motion of the tracked object computed between them. The temporally integrated image is then used for motion analysis and segmentation instead of the ﬁrst image in each pair. The tracked objects remains sharp while other objects blur out, which improves the accuracy if the segmentation and the motion computation. Clarke and Zisserman [25] use epipolar geometry for detection and tracking of independently moving rigid objects by moving (translating) observer. The basic idea is that given a correspondence between relatively large number of feature points in two frames the epipole and the translation vector of the camera motion relative to the background can be recovered. Of course, this is true only if most of the features belong to the background and not to the moving objects. Once the camera motion is known, the moving objects can be found by attempting to ﬁt an epipole to the feature matches not consistent with the background epipole. The process may be repeated until too few point matches remain, or no object is found. As for the camera motion, the independent translation is recovered and used for moving object tracking. Corners are used as the feature points and the image plane extent of the moving object is deﬁned by the smallest rectangle enclosing all the features that have been updated more than a ﬁxed number of times. Rosales and Sclaroﬀ [96] address a more speciﬁc problem of tracking humans. Since a static camera is used, it is possible to acquire statistical measurements of the empty scene over a number of video frames. Color covariance estimates are obtained for each image pixel. Independently moving objects are then detected by thresholding a distance measure between the background model and the incoming image. This gives a binary map of interesting regions and the connected component analysis is used to break it into diﬀerent moving objects. A special tracker object is assigned to every objects being tracked. It contains information about objects location, a binary support map, blob characteristics, bounding box, and other information that can will be used for tracking using the Extended Kalman Filter (EKF). 9 To reduce the complexity of the tracking problem only two feature points are selected: two opposite corners in the object bounding box. Assuming that the objects are undergoing a locally linear translation motion in 3D and utilizing a 3D central projection camera model, the relative 3D position and velocity of the features are extracted. Parametrized feature points positions and velocities compose the EKF state vector. The occlusion detection routine predicts future location of objects, based on current estimates of 3D velocity and position. If a collision of the objects is detected, then the occlusion is probable. After the collision is conﬁrmed by both the estimated bounding box parameters and the merging binary maps of the objects being tracked, the system uses EKF prediction to update the object position and velocity. The end of occlusion can be found using the EKF estimates of position and velocity along with the computed area of the blobs. 1.3.2 Classiﬁcation and Recognition A process of motion based recognition can be coarsely divided into two steps. The ﬁrst step is to ﬁnd an appropriate representation for the objects or motions to be recognized - a model. Such a model may be a low level representation, like trajectory of some point on an moving object as described by Gould and Shah [56] or, on the other hand, can be organized into a high level representation, like motion verbs in Nagel et al. works [50], [70] or a function of an object in Rivlin et al. approach [36]. Generally, there are two methods to extract motion information form a sequence of images in order to organize it into a model. • motion correspondence, where a set of feature points (tokens) is detected in each image frame and then, a correspondence between those points among frames is established. There are two intrinsic problems associated with this method: the lack of a good universal feature detection operator - either too few or too many tokens are detected and, the combinatorial nature of the correspondence problem - only one trajectory is to be selected from a large set of possible trajectories. • optical ﬂow, which is an alternative method for motion analysis that computes the displacement of each pixel between two consecutive frames. The problem here is that only the so-called normal ﬂow can be computed due to the aperture problem, which is also error prone because of boundary over-smoothing, multiple moving objects etc. [2] Methods combining both approaches were proposed [105], which utilize the local motion direction found by optical ﬂow to facilitate the search for corresponding feature point in the consecutive frames. The second step is to match some unknown input with the model. Usually, the same method for information extraction as in the previous step is used to obtain the motion information from the sample. Matching methods, depending on the type of the representation used, vary from standard clustering techniques to neural networks and probabilistic approaches. 10 1.3.2.1 Detecting and Recognizing Motions According to [89] diﬀerent motions can be classiﬁed according to the spatial and temporal uniformity they exhibit. Three diﬀerent types of motions patterns are deﬁned: • temporal textures to be the regular motion patterns of indeterminate spatial and temporal extent, • activities to be motion patterns which are temporally periodic but are limited in spatial extent, • motion events to be isolated simple motions that do not exhibit any temporal or spatial repetition. Examples of temporal textures include wind blown trees or grass, examples of activities include walking, running etc., while examples of motion events are isolated instances of opening a door, starting a car, throwing a ball, a change in direction, a stop, an acceleration etc. Temporal textures can be eﬀectively treated with statistical techniques analogous to those used in gray-level texture discrimination. [92]. On the other hand, activities and motion events are more discretely structured, and techniques similar to those used in static object recognition would be useful in their classiﬁcation. Since the temporal texture is usually used as a tool for the segmentation and/or recognition of objects possessing a certain motion pattern and not for the recognition of diﬀerent motions for a spatially deﬁned object, we shall talk about it in Motion Based Object Classiﬁcation section 1.3.2.2. In what follows we describe in more details the recognition of activities and survey the motion events. Recognizing Activities Detecting Periodic Activities Periodicity is common in the natural world. It is also a salient feature in human perception. Information regarding the nature of a periodic phenomenon, such as its location, strength, and frequency, is important for understanding of the environment. Techniques for periodicity detection and characterization can assist in many applications requiring object and activity recognition and representation. Periodicity often involves both space and time, as in cyclic motion. Polana and Nelson in [90] are dealing with the detection of repetitive motion. Their approach suggests a low-level feature based activity recognition mechanism. An image sequence is considered as a spatio-temporal solid with two spatial dimensions and one time dimension. Repeated activity will give rise to periodic or semi-periodic gray level signals along smooth curves in the image solid - reference curves. If these curves could be identiﬁed and samples extracted along them over several cycles , then frequency domain techniques could be used in order to judge the degree of periodicity and thus detect periodic activities. In case of stationary activities or locomotory activities phase curves are straight lines at some angle that depends on the translational velocity. 11 Signals are extracted along these curves and decomposed using the Fourier transform. The normalized diﬀerence between the energy in the highest amplitude frequencies and its multipliers and the energy in the remaining frequencies (halfway between) gives the periodicity measure. That is, ∑ pf = ( i Fiw − ∑ ∑ Fiw+w/2 )/( i i Fiw + ∑ Fiw+w ) (1.1) i where F is the energy spectrum of the signal f and w is the frequency corresponding to the highest amplitude in the energy spectrum. The measure is normalized with respect to the total energy at the frequencies of interest so that it is one for a completely periodic signal and zero for a ﬂat spectrum. In general, if signal consists of frequencies other than one single fundamental and its multiples, its periodicity measure will be low. Non-maximum suppression technique is used then to vote for the frequencies on which most of the analyzed signals (collected from diﬀerent reference curves) agree. The maximal value of this combined signal is taken as the fundamental frequency and the periodicity measure for an entire sequence is deﬁned as a sum of periodicity measures obtained from all the contributing signals. Experiments conducted by the authors [90] show that such a periodic activities as walking, exercising or swinging can be distinguished from a nonperiodic activities by simple thresholding of the periodicty measure. In [91], optical ﬂow based algorithms are used to transform an image sequence so that the object in consideration is stabilized at the center of the image frame. Then ﬂow magnitudes in tessellated frame areas of periodic motion are used as feature vectors for motion classiﬁcation. The problem is that ﬂow based methods are very sensitive to noise. Thus, Liu and Picard [75] propose a robust method based on a new criterion for spatio-temporal periodicity which is a ratio between the harmonic energy and the total energy of a 1-D signal along the temporal dimension. In their method multiple fundamentals can be extracted along a temporal line; the values of fundamental frequencies are used in processing to facilitate in distinguishing periodicity of various activities; regions of periodicity are actually segmented and frame alignment is used for tracking after the moving object. Recognizing Periodic Activities In [92] the idea above is further developed to include the recognition of diﬀerent types of periodic activities. One cycle (which is deﬁned by the fundamental frequency) of the spatio-temporal solid representing the activity is divided into XxY xT cells by partitioning the two spatial dimensions into X, Y and the temporal dimension into T divisions respectively. A local motion statistics is then selected and computed in each cell of the spatio-temporal grid. The feature vector in this case is composed of XY T elements each of which is the value of the statistic in the particular cell. (see Figure 1.2). The normalized spatio-temporal solid, while corrected for temporal scale (frequency) is not corrected for temporal translation (phase). So the pattern is tried at each possible phase and the best match is then selected. Several statistics were tried: the positive and negative curl and divergence estimates, the non-uniformity of ﬂow direction,the directional diﬀerence statistics in four directions 12 Figure 1.2: Total motion magnitude feature vector for a sample of walk (top) and a sample of run (bottom). One cycle of activity is divided into six time divisions shown horizontally. Each frame shows the spatial distribution of motion in 4x4 spatial grid size of each square is proportional to the amount of motion in the neighborhood (from [92]). etc. The one that gave the best results is a sum of ﬂow magnitude divided by its standard deviation. The feature values are arranged into a vector and a nearest centroid classiﬁer is used for classiﬁcation. Three sequences were used to compute the cluster centers, i.e. to train the classiﬁer, while using a fourth one as the unknown. None of the single features is suﬃcient, alone, for correct classiﬁcation. However, using curl and divergence estimates along with the directional diﬀerence statistics in four directions provide enough information to correctly classify all motions used in the experiments. A principle component analysis of the features was also performed, which also conﬁrmed the relative importance of those two features. Parametrized Modeling for Activity Recognition In [127] a framework for modeling and recognition of temporal activities is proposed. A set of exemplar activities represented by vector of measurements taken over some period of time is modeled by parameterization in a form of principal components (PCA). An observed activity, that possibly undergo some transformation due to temporal shift, speed variations or diﬀerent distances between the camera and the object, is then projected on the principal activity basis giving some coeﬃcient vector. A robust regression is used to recover the coeﬃcients along with the parameters of the transformation. As the coeﬃcient vector is obtained, the normalized distance to exemplar activities coeﬃcients is used to recognize the observed activity (see Figure 1.3). The paper describes three experiments where the method was successfully applied: • human walking direction recognition (parallel, diagonal, perpendicular away, perpendicular forward) • recognition of four types of activities (walking, marching, line-walking and kicking while walking) • visual based speech recognition. 13 Figure 1.3: The parameterized modeling and recognition (from [127]). The ﬁrst two applications used an approach described in [66] for tracking human motion using parameterized optical ﬂow, assuming that initial segmentation of the body into parts is given. Eight motion parameters were tracked for ﬁve body parts (arm, torso, thigh, calf and foot). (see Figure 1.4). The last application used optical ﬂow vectors directly as the measurements to build principal basis. Motion Events Motion events are deﬁned as signiﬁcant changes or discontinuities in motion. They are usually detected by the presence of discontinuities, which can be found, for instance, by computing the scale-space of a speed curve. Gould, and Shah’s [56] [55] trajectory Primal Sketch (TPS) is a representation for the signiﬁcant changes in motion. Changes are identiﬁed at various scales by computing the scale-space of the velocity curves vx and vy extracted from a trajectory. This results in a set of TPS contours, each contour corresponding to a change in motion. The strength of the contour , i.e. the number of zero-crossings belonging to that contour, the shape or straightness of the contour, i.e. the sum of the distance between each successively linked zero-crossing from the contour divided by strength-1, and the frame number at which the contour originated, reveals a lot information about the event. This representation has been shown to distinguish basic motions like translation, rotation, projective and cycloid. The results show that the ﬁrst derivative discontinuities in the rotation and cycloid trajectories have a sine/cosine relationship with respect to each other, so that they can be separate from the projectile and translation type trajectories. Engel and Rubin [42] deﬁne ﬁve types of motion events: smooth starts, smooth stops, pauses, impulse starts and impulse stops. They are considered as motion events that partition 14 Figure 1.4: Image sequence of ”walking”, ﬁve part tracking including self-occlusion and three sets of ﬁve signals (out of 40) recovered during the activity (torso, thigh, calf, foot and arm) (from [66]). a global motion into parts that have some underlying psychological support. The method of detecting those boundaries use polar velocity representation (s, ψ), and the features used for the detection of the perceptual boundaries are ﬁrst and second derivatives of the speed s′ and s′′ , and the second derivative of angle ψ ′′ . Force impulses imply a discontinuity, i.e. zero-crossing in consecutive pairs of s′′ and ψ ′′ estimates. If a zero crossing exists in either or both, and its slope exceeds some predeﬁned threshold, then force impulses are conﬁrmed. Starts and stops are found when speed is low, but increasing or decreasing suﬃciently, again determined using thresholds. The simultaneous presence of a start or stop and a force impulse indicates an impulse start or stop, while a pause occurs when a start (smooth or impulse) follows a stop. the authors studied cycloidal motion at two diﬀerent speeds, fast and slow. At each cusp, a boundary was detected; the slow motion detected a ψ ′′ zero crossing, along with a pause, while the fast motion asserted force impulses for both s′′ and ψ ′′ , but no start, stop or pause. Both boundaries were found to agree with human perception. Motion events have been shown for simpler motions as for projectile and cycloid motions, as well as for complex motions like human movements. Motion events are thus of wide applicability, and can be used alone or in conjunction with other types of features. Siskind et al in a number of works [109], [111], [110] build a framework for visual event classiﬁcation based not on the “object recognition ﬁrst” principle, but on a direct retrieval of motion event information from visual data stream. In [110] a maximum likelihood scheme is utilized to recognize simple spatial-motion events, such as described by the verbs pick up, put down, push, pull, drop and throw. Event recognition is divided into two independent subtasks: lower-level - responsible for segmentation and tracking and upper-level - responsible for event recognition. The lower-level task performs 2D object tracking, taking image sequences as input and producing as output a stream of 2D position, orientation, shape and size values for each 15 participation object in the observed event. The tracker operates on frame-by-frame basis tracking colored objects and moving objects independently. Colored object segmentation is performed by clustering strongly colored pixels by hue values. Moving pixels are detected as pixels with large interframe gray value diﬀerence. Region growing is then applied to build a set of contiguous object regions. Each region is then abstracted as an ellipse and further analysis is based on ellipse parameters only. Intra-frame correspondence analysis is then employed to solve the problem of ellipse drop-outs and spurious ellipses yielding a set of ellipses that track all of the participant objects in the event through the entire movie. The upper-level task takes as input the 2D pose stream produced by lower-level processing, without any image information, and classiﬁes such a pose stream as an instance of a given event type. A maximum-likelihood approach is used to performed the upper-level task. In this approach a supervised learning strategy is used to train a model for each event class from a set of example events from that class. The feature vectors contain both absolute and relative ellipse positions, orientations, velocities and accelerations. Hidden Markov Models(HMM) are adopted by the authors as the generative model within the maximumlikelihood framework. Then a novel event observations are classiﬁed as the class whose model best ﬁts the new observation. The system developed is reported to perform well in a relatively uncluttered desk-top environment. 1.3.2.2 Motion Based Object Classiﬁcation Temporal Texture Analysis It is shown in [89] how a number of statistical spatial and temporal features derived from normal optical ﬂow (a component of motion ﬁeld parallel to the spatial gradient direction) can be used to to classify regional activities characterized by complex, non-rigid motion. The features suggested in the paper are: 1. Nonuniformity of normal ﬂow direction, which is an absolute deviation of normal ﬂow direction histogram from a uniform distribution. To reduce the dependence of the normal ﬂow direction on underlying intensity texture, the ﬂow direction are normalized by a histogram of gradient directions. 2. Mean ﬂow magnitude divided by its standard deviation, which is a measure of “peakiness” in the velocity distribution. 3. Directional diﬀerence statistics in four directions. The statistics is based on the number of pixels pairs at a given oﬀset which diﬀer in their value by a given amount, and gives the ratio between the the number of pixels diﬀering by at most one to the number of pixel diﬀering by more than one in four directions. The oﬀset is proportional the average ﬂow magnitude to yield the scaling invariance. The feature represents a measure of the spatial homogeneity of the ﬁeld. 4. Positive and negative curl and divergence estimates. Mean values over a region of interest are used. 16 The ﬁrst three features are invariant under translation, rotation and temporal and spatial scaling, while the last one is invariant with respect to translation and rotation only. Experiments show that using these features and a nearest centroid classiﬁer it is possible to distinguish between such naturally occurring motions like: • ﬂuttering crepe paper bands • cloth waving in the wind • motion of tree in the wind • ﬂow of water in a river • turbulent motion of water • uniformly expanding image produced by forward observer motion • uniformly rotating image produced by observer roll Principal component analysis of the features indicates that the second order features (the last two) are most useful in classiﬁcation. Dynamics Based Object Classiﬁcation Shavit and Jepson [108] propose a qualitative representation of moving body dynamics (posef unction) for motion-based categorization. The approach is based on a phase portrait of a system, that is a collection of phase curves in 2n-dimensional space, where n is the number of independent variables. If motion measurements (of position and velocity) are given for each of the n dimensions of the system (an n dimensional ﬂow ﬁeld), construction of the phase portrait is performed by interpolation of measurements. The authors suggest to look at the pose function as composed of two independent parts: global component and local component. The global component of the pose function is a viewer centered model of the external dynamics (translation of the blob en-mass), whereas the local component of the function is an expression of the internal dynamics (deformation of the blob). In [107] the authors continue to analyze the usage of dynamic systems analysis for motion understanding. For idealized autonomous conservative systems the external dynamics is captured by the Hamiltonian function - the total energy of the system is constant and equal to the sum of potential and kinematic energies. In this case both potential and kinematic energies are not time dependent and the phase curve corresponding to the global component is symmetric. If the system is not autonomous due to the action of non-conservative forces, then the global component phase curve built will not be symmetric, which indicates that the idealized model is not appropriate for the system. In such a case an additional time dependent component is introduced in the Hamilton equation to compensate for the total energy changes. In order to build the local component pose function the authors consider a local ﬁxed reference frame erected at the instantaneous position of blob’s centroid. Internal body dynamics is expressed by the cluster geometry (relative position of the cluster points with respect to the centroid) and the relative deformation velocity. 17 Principal axes method and the velocity gradient tensor are used to compute the rate of strain and the angular velocity of the principal axes rotation. It is claimed that the obtained phase curves are suﬃcient for motion based classiﬁcation. As the method feasibility proof the authors present an experiment where the pose function is computed for a hopping rabbit and kangaroo taken from photostats of real animals in locomotion. (see Figure 1.5). (a) (b) Figure 1.5: (a) A graphic depiction of the pose function for the hopping of a kangaroo ; (b) A graphic depiction of the pose function for the hopping of a rabbit (from [107]). The external appearance of the graphs is suﬃcient for distinguishing between the two animals. 1.3.3 From Motion towards Natural Language Description Nagel et al in [70] describe a system for automatic characterization of vehicle trajectories extracted from am image sequence by natural language motion verbs. Their approach can be decomposed into the following component processes: 18 1. Features extraction: extraction of suitable features from each frame of an image sequence. Centroids of the blobs generated by the monotonicity operator are used. The monotonicity operator ﬁnds connected regions in the image, which could be classiﬁed as local minima or maxima of the gray-value function. 2. Features matching: interframe matching of the resulting feature sets, thereby deriving displacement vector estimates at selected image plane locations. A multiresolution approach is used to detect objects moving with diﬀerent apparent velocities. Computed displacements of blobs at a coarse level are used to calculate a so-called displacement compensation, which is then used to advance the displacement of the search area in the next ﬁner level. 3. Objects detection: clustering of these displacement vectors, facilitating the isolation of displacement vector clusters which can be hypothesized to represent candidates for the image of a moving object in the scene. The displacement of blobs are tracked along a certain number of frames and mapped to a vector chain. The footpoint of the ﬁrst vector and the total displacement represented by the vector chain deﬁne a so-called image distance vector. The image distance vector forms a cluster seed provided it satisﬁes the following consistency constrains: • stability: the involved blob can be extracted in a certain number of frames • minimum displacement: the total displacement of the image distance vector has to exceed a certain minimal displacement threshold. • coherence: the variance coeﬃcient of the displacements of the vector in a vector chain is used to suppress statistical ﬂuctuations of stationary features and those vector chains owing to wrong matches. Then the image distance vectors are clustered based on the following geometrical relations: is neighbour(⃗u, ⃗v ), is parallel(⃗u, ⃗v ) and is samelength(⃗u, ⃗v ). The vectors of a cluster form an object candidate and are marked for tracking in the subsequent frames. 4. Tracking: tracking of such object candidates 5. Backprojection: backprojection of the resulting candidate representation and position onto the street plane provided by interactively obtained transformation from the image plane to a scene map 6. Motion verbs: association of trajectory segments in the street plane with an internal representation of motion verbs describing the motion of cars on roads. A set of 119 German verbs is divided into four categories representing the verbs describing the action of the vehicle itself, verbs referring to the road, verbs making additional reference to other vehicles and verbs referring to other locations. A set of 13 attributes that are extracted form an image sequence and describe the scene in terms of trajectory properties, vehicle position relative to other objects or to the road, vehicle orientation, velocity, acceleration, etc. was deﬁned. For each verb there exists a set of three predicates acting on the attributes above: 19 Figure 1.6: Single frame of a real-word traﬃc scene recorded at an intersection. Vehicle trajectories obtained by automatic detection and model-based tracking are shown together with the vehicle models used (from [70]). • a precondition - true if the attributes are within some predeﬁned range at the beginning of an interval of validity for a verb • a monotonicity condition - true if the attribute values remain acceptable for a verb • a postcondition - true if the attribute values are within some range indicating the end of an interval of validity for a verb, where the interval of validity is the time period for which a particular interpretation is true. The attributes are computed at each time instance and a ﬁnite automaton based on predicate values is used to determine intervals of validity. The approach has been applied to various traﬃc scenes and successfully described such a maneuvers like passing a barrier, turns, parking, etc. by natural language verbs. Over the last several years considerable success has been reported in traﬃc scenes analysis by a large group of researchers in the Institut for Algorithmen und Kognitive Systeme [58] [59] [60] [61] [69] [71]. To receive an accurate input to the higher level processing a model based approach has been widely used for moving object detection and tracking and for the scene modeling. (see Figure 1.6). 20 Chapter 2 Rigid Motion 2.1 Introduction For rigid objects it is easier to obtain 3D translational and sometimes even rotational velocity components. The classiﬁcation then can be performed directly on the measured velocities and their derivatives or by ﬁtting them to a constrained motion model that depends on physical parameters of a moving object. In the latter case the classiﬁcation is performed based on extracted model parameters. If a set of models is given for various object classes, classiﬁcation can be done by choosing the model that ﬁts best the object’s motion characteristics. In this chapter we show how the constraints induced by a domain help us to extract motion parameters. We ﬁnd relative velocity of moving vehicles observed from a moving platform. It is assumed that the observed vehicle is a rigid body which is moving smoothly (due to inertia) on a plane, and its rotation is not signiﬁcant in a small number of frames, so that only the translational velocity component can be measured accurately enough. 2.1.1 Interpreting Relative Vehicle Motion in Traﬃc Scenes Autonomous operation of a vehicle on a road calls for understanding of various events involving the motions of the vehicles in its vicinity. In normal traﬃc ﬂow, most of the vehicles on a road move in the same direction without major changes in their distances and relative speeds. When a nearby vehicle deviates from this norm (e.g. when it passes or changes lanes), or when it is on a collision course, some action may need to be taken. In this paper we show how a vehicle carrying a camera can estimate the relative motions of nearby vehicles. Understanding the relative motions of vehicles requires modeling both the motion of the observing vehicle and the motions of the other vehicles. We represent the motions of vehicles using a Darboux motion model that corresponds to the motion of an object moving along a smooth curve that lies on a smooth surface. We show that deviations from Darboux motion correspond primarily to small, rapid rotations around the axes of the vehicle. These rotations arise from the vehicle’s suspension elements in response to unevenness of the road. We estimate bounds on both the smooth rotations due to Darboux motion (from highway design principles) and the non-smooth rotations due to the suspension. We show that both 21 types of rotational motion, as well as the non-smooth translational component of the motion (bounce), are small relative to the smooth (Darboux) translational motion of the vehicle. This analysis is used to model the motions of both the observer and observed vehicles. We use the analysis to show that only the rotational velocity components of the observer vehicle are important. On the other hand, the rotational velocity components of an observed vehicle are negligible compared to its translational velocity. As a consequence we need to estimate the rotational velocity components only for the observing vehicle. This is the case even when an observed vehicle is changing its direction of motion relative to the observing vehicle (turning or changing lanes); the turn shows up as a gradual change in the direction of the relative translational velocity. An important consequence of the Darboux motion model is that for a ﬁxed forwardlooking camera mounted on the observer vehicle the direction of translation (and therefore the position of the focus of expansion (FOE)) remains the same in the images obtained by the camera. We use this fact to estimate the observing vehicle’s rotational velocity components; this is done by ﬁnding the rotational ﬂow which, when subtracted from the observed ﬂow, leaves a radial ﬂow pattern (radiating from the FOE) of minimal magnitude. We describe the motion ﬁeld using full perspective projection, estimate its rotational components, and derotate the ﬁeld. The ﬂow ﬁelds of nearby vehicles are then, under the Darboux motion model, pure translational ﬁelds. We analyze the motions of the other vehicles under weak perspective projection, and derive their motion parameters. We present results for several road image sequences obtained from cameras carried by moving vehicles. The results demonstrate the eﬀectiveness of our approach. In the next section we present motion models for road vehicles, discuss ideal and real vehicle motion, and analyze the relative sizes of the smooth and non-smooth velocity components. In Section 2.3 we discuss the image motion and describe a way to estimate the necessary derotation and describe methods of estimating nearby vehicle motions from the normal ﬂow ﬁeld. Section 2.4 presents experimental results for several sequences taken at diﬀerent locations. In Section 2.5 we review some prior work on various topics that are related to this paper, involving road detection, vehicle detection, and motion analysis, and compare them with the approach presented in this paper. 2.2 Motion Models of Highway Vehicles The ideal motion of a ground vehicle does not have six degrees of freedom. If the motion is (approximately) smooth it can be described as motion along a smooth trajectory Γ lying on a smooth surface Σ. Moreover, we shall assume that the axes of the vehicle (the fore/aft, crosswise, and up/down axes) are respectively parallel to the axes of the Darboux frame deﬁned by Γ and Σ. These axes are deﬁned by the tangent ⃗t to Γ (and Σ), the second tangent ⃗v to Σ (orthogonal to ⃗t), and the normal ⃗s to Σ (see Figure 2.1). Our assumption about the axes is reasonable for the ordinary motions of standard types of ground vehicles; in particular, we are assuming that the ﬁrst two vehicle axes are parallel to the surface and that the vehicle’s motion is parallel to its fore/aft axis (the vehicle is not skidding). Consider a point O moving along a (space) curve Γ. There is a natural coordinate 22 s’ t’ O’ s v’ n t ψ b O Γ v Σ Figure 2.1: The Darboux frame moves along the path Γ which lies on the surface Σ. system Otnb associated with Γ, deﬁned by the tangent ⃗t, normal ⃗n, and binormal ⃗b of Γ. The triple (⃗t,⃗n, ⃗b) is called the moving trihedron or Frenet-Serret coordinate frame. We have the Frenet-Serret formulas [72] ′ ⃗t′ = κ⃗n, ⃗n′ = −κ⃗t + τ⃗b, ⃗b = −τ⃗n (2.1) where κ is the curvature and τ the torsion of Γ. When the curve Γ lies on a smooth surface Σ, it is more appropriate to use the Darboux frame (⃗t, ⃗v, ⃗s) [72]. We take the ﬁrst unit vector of the frame to be the tangent ⃗t of Γ and the surface normal ⃗s to be the third frame vector; ﬁnally we obtain the second frame vector as ⃗v = ⃗s × ⃗t (see Figure 2.1). Note that ⃗t and ⃗v lie in the tangent plane of Σ. Since the vector ⃗t belongs to both the Otnb and Otvs frames, they diﬀer only by a rotation around ⃗t, say through an angle ψ ≡ ψ(s). We thus have ( ⃗v ⃗s ) ( = cos ψ sin ψ − sin ψ cos ψ )( ⃗n ⃗b ) . (2.2) The derivatives of ⃗t,⃗v,⃗s with respect to arc length along Γ can be found from (2.1) and (2.2): ⃗t′ = κg⃗v − κn⃗s, ⃗v′ = −κg⃗t + τg⃗s, ⃗s′ = κn⃗t − τg⃗v (2.3) where dψ ; ds κg is called the geodesic curvature, κn is called the normal curvature, and τg is called the (geodesic) twist. It is well known that the instantaneous motion of a moving frame is determined by its rotational velocity ω ⃗ and the translational velocity T⃗ of the reference point of the frame. κg ≡ κ cos ψ, κn ≡ κ sin ψ, 23 τg ≡ τ + The translational velocity T⃗ of O is just ⃗t and the rotational velocity of the Otvs frame is given by the vector ω ⃗ d = τg⃗t + κn⃗v + κg⃗s. Hence the derivative of any vector in the Otvs frame is given by the vector product of ω ⃗ d and ⃗ that vector. It can be seen that the rate of rotation around t is just τg , the rate of rotation around ⃗v is just κn , and the rate of rotation around ⃗s is just κg . If, instead of using the arc length s as a parameter, the time t is used, the rotational velocity ω ⃗ d and translational velocity T⃗ are scaled by the speed v(t) = ds/dt of O along Γ. 2.2.1 Real Vehicle Motion We will use two coordinate frames to describe vehicle motion. The “real” vehicle frame Cξηζ (which moves non-smoothly, in general) is deﬁned by its origin C, which is the center of mass of the vehicle, and its axes: Cξ (fore/aft), Cη (crosswise), and Cζ (up/down); and the ideal vehicle frame Otvs (the Darboux frame), which corresponds to the smooth motion of the vehicle. The motion of the vehicle can be decomposed into the motion of the Otvs frame and the motion of the Cξηζ frame relative to the Otvs frame. As we have just seen, the rotational velocity of the Otvs (Darboux) frame is v⃗ω d = v(τg⃗t + κn⃗v + κg⃗s) and its translational velocity is v⃗t. We denote the rotational velocity of the Cξηζ (vehicle) frame by ω ⃗ v and its translational velocity by T⃗v . The position of the Cξηζ frame relative to the Otvs frame is given by the displacement vector d⃗v/d between C and O, and the relative orientation of the frames is given by an orthogonal rotational matrix (matrix of direction cosines) which we denote by Rv/d . The translational velocity of the vehicle (the velocity of C) is the sum of three terms: (i) the ˙ translational velocity of the Darboux frame v⃗t, (ii) the translational velocity T⃗v/d ≡ d⃗v/d , and (iii) the displacement v⃗ω d × d⃗v/d due to rotation of C in the Otvs frame. The transla˙ tional velocity of the vehicle expressed in the Otvs frame is thus v⃗ω d × d⃗v/d + v⃗t + d⃗v/d ; its translational velocity in the Cξηζ frame is ˙ T T⃗v = Rv/d (v⃗ω d × d⃗v/d + v⃗t + d⃗v/d ). (2.4) Similarly, the rotational velocity of Cξηζ is the sum of two terms: (i) the rotational velocity T vRv/d ω ⃗ d of the Otvs frame, and (ii) the rotational velocity ω ⃗ v/d , which corresponds to the T skew matrix Ωv/d = Rv/d Ṙv/d . The rotational velocity of the Cξηζ frame expressed in the Otvs frame is thus v⃗ω d + Rv/d ω ⃗ v/d ; the corresponding expression in the Cξηζ frame is T ω ⃗d + ω ⃗ v/d . ω ⃗ v = vRv/d (2.5) Rotations around the fore/aft, sideways, and up/down axes of a vehicle are called roll, pitch, and yaw, respectively. In terms of our choice of the real vehicle coordinate system, these are rotations around the ξ, η, and ζ axes. 24 W C4 C1 x H C y L O z C3 C2 Figure 2.2: A possible motion of the base of a vehicle as the vehicle hits a bump. 2.2.2 Departures of Vehicle Motion from Smoothness The motion of a ground vehicle depends on many factors: the type of intended motion; the speed of the vehicle; the skill of the driver; the size, height and weight of the vehicle; the type and size of the tires (or tractor treads), and the nature of the suspension mechanism, if any; and the nature of the surface on which the vehicle is being driven. These factors tend to remain constant; they undergo abrupt changes only occasionally, e.g. if a tire blows out, or the vehicle suddenly brakes or swerves to avoid an obstacle, or the type of surface changes. Such events may produce impulsive changes in the vehicle’s motion, but the eﬀects of these changes will rapidly be damped out. In addition to these occasional events, “steady-state” non-smoothness of a ground vehicle’s motion may result from roughness of the surface. A ground vehicle drives over roads that have varying degrees of roughness [4]. The roughness consists primarily of small irregularities in the road surface. In discussing the eﬀects of the roughness of the road on the motion of a ground vehicle we will assume, for simplicity, an ordinary, well balanced four-wheeled vehicle moving on a planar surface that is smooth except for occasional small bumps (protrusions). The bumps are assumed to be “small” relative to the size of the wheels, so that the eﬀect of a wheel passing over a bump is impulsive. (We could also allow the surface to have small depressions, but a large wheel cannot deeply penetrate a small depression, so the depressions have much smaller eﬀects than the bumps.) As the vehicle moves over road surface, each wheel hits bumps repeatedly. We assume that the vehicle has a suspension mechanism which integrates and damps the impulsive eﬀects of the bumps. Each suspension element is modeled by a spring with damping; its characteristic function is a sine function multiplied by an exponential damping function (see [84, 126]).We assume that the suspension elements associated with the four wheels are independent of each other and are parallel to the vertical axis of the vehicle. The vehicle may hit new bumps while the eﬀects of the previous bumps are still being felt. Each hit forces the suspension and adds to the accumulated energy in the spring; thus it can happen that the suspension is constantly oscillating, which has the eﬀect of moving 25 the corners of the vehicle up and down. The period of oscillation is typically on the order of 0.5 sec (see [84, 126]). In general, it takes several periods to damp out the spring; for example, the damping ratio provided by shock absorbers of passenger cars is in the range 0.2 − 0.4. The maximum velocity of the oscillation is typically on the order of 0.1 m/sec. Consider the coordinate system Cxyz with origin at the center of mass C of the vehicle (see Figure 2.2). Let vi be the velocity corner Ci of the vehicle, and let the length and width of the vehicle be L and W . From the vi ’s we can compute the angular velocity matrix     0 −ωz ωy 0 0 (v3 − v4 )/L    0 −ωx  0 0 (v Ω̃ ≡  ωz =   2 − v3 )/W  . −ωy ωx 0 (v4 − v3 )/L (v3 − v2 )/W 0 (2.6) Note that any of the vi s can be positive or negative. Multiplication by Ω̃ can be replaced by the vector product with the angular velocity vector ω ⃗ = ⃗ı(v3 − v2 )/W + ⃗ȷ(v3 − v4 )/L where the rate of rotation around the x axis (the roll velocity) is ωx = (v3 − v2 )/W and the rate of rotation around the y axis (the pitch velocity) is ωy = (v3 − v4 )/L. As noted above, we typically have |vi | < 0.1 m/sec. If we assume that W > 1 m and L > 2 m we have √ |ωx | < 2 0.2 rad/sec and |ωy | < 0.1 rad/sec. The yaw velocity component is |ωz | = O(vi / W 2 + L2 ) which is |ωz | = O(0.01) rad/sec. (For a complete derivation see [39].) The translational velocity vector of the center C of the vehicle is obtained by using the velocities v1 − v3 , v2 − v3 , 0, and v4 − v3 for the corners and adding v3 to the velocity in the direction of the z-axis. We thus have     tx −(v4 − v3 )H/L    ⃗ T =  ty  =  (v2 − v3 )H/W  . tz (v2 + v4 )/2 (2.7) If we assume that H < 0.5 m (< W/2) we have |tx | < 0.05 m/sec, |ty | < 0.1 m/sec, and |tz | < 0.1 m/sec. We can draw several conclusions from this discussion: (i) The eﬀects of small bumps are of short duration, i.e. they can be considered to be impulsive. The suspension elements integrate and damp these eﬀects, resulting in a set of out-of-phase oscillatory motions. (ii) The yaw component of rotation due to the eﬀect of a bump is very small compared to the roll and pitch components. (iii) The translational eﬀects of a bump are proportional to the velocities (or displacements) of the suspension elements and the dimensions of the vehicle and are quite small. 2.2.3 The Sizes of the Smooth and Non-smooth Velocity Components We now compare the sizes of the velocity components which are due to the ideal motion of the vehicle — i.e., the velocity components of the Darboux frame (Section 2.2) — to the sizes of the velocity components which are due to departures of the vehicle frame from the Darboux frame (Section 2.2.2). The translational velocity of the Darboux frame is just v⃗t; thus the magnitude of the translational velocity is just v. If v = 10 m/sec (= 36 km/hr≈ 22 mi/hr) this velocity is 26 much larger than the velocities which are due to departures of the vehicle from the Darboux frame, which, as we have just seen, are on the order of 0.1m/sec or less. The rotational velocity of the Darboux frame is v⃗ωd = v(τg⃗t + κn⃗v + κg⃗s); thus the √ magnitude of the rotational velocity is v τg2 + κ2n + κ2g . In this section we will estimate bounds on τg , κn , and κg . Our analysis is based on the analyses in [4, 41, 126] and on the highway design recommendations published by the American Association of State Highway Oﬃcials [83]. Good highway design allows a driver to make turns at constant angular velocities, and to follow spiral arcs in transitioning in and out of turns, in order to reduce undesirable acceleration eﬀects on the vehicle. A well-designed highway turn has also a transverse slope, with the outside higher than the inside, to counterbalance the centrifugal force on the turning vehicles. Thus the ideal (smooth) motion of a vehicle has piecewise constant translational and rotational velocity components, with smooth transitions between them. Note that the translational components are constant in the vehicle coordinate frame even when the vehicle is turning, unless it slows down to make the turn. To illustrate the typical sizes of these components, consider a ground vehicle moving with velocity v along a plane curve Γ on the surface Σ. If Σ is a plane and Γ is a circular arc with radius of curvature ρg = |κg |−1 (i.e., the vehicle is turning with a constant steering angle), the angular velocity of the vehicle is v⃗ωd = vκg⃗s and there is a centripetal acceleration ⃗ac = v 2 κg⃗v at the vehicle’s center [63]. As a result there is a centrifugal force on the vehicle proportional to ∥⃗ac ∥ and the mass of the vehicle. If skidding is to be avoided the limit on ∥⃗ac ∥ (see [4]) is given by ∥⃗ac ∥ = v 2 κg ≤ g(tan α + µa ) (2.8) where g is the gravitational acceleration, α is the transverse slope, and µa is the coeﬃcient of adhesion between the wheels and the surface. [Typical values of µa range from 0.8 − 0.9 for dry asphalt and concrete to 0.1 for ice (see [126], page 26).] From (2.8) we have either a lower bound on ρg for a given v or an upper bound on v for a given ρg . For example, if v = 30 m/sec (≈ 108 km/hr), α = 0.05 rad, and µa = 0.2 from v 2 /ρg < 0.25g we have ρg > 367 m. This yields an upper bound on the yaw angular velocity of < v|κg | = v/ρg ≈ 0.08 rad/sec, which is somewhat larger than the yaw angular velocity arising from the departures from Darboux motion. Other dynamic constraints on a vehicle such as the limits on torques and forces can be used to obtain constraints on τg and κn . (These and other considerations such as safety and comfort were used in [83] to make recommendations for highway design; for a summary of these recommendations see [39].) For both vertical curves (crossing a hill) and turning curves the (recommended lower bound on the) radius of curvature ρmin grows with the square of the design velocity vd . However, the resulting (design) yaw and pitch angular velocities are limited by vd /ρmin . Thus for smaller velocities v the vehicle can negotiate tighter vertical and turning curves and thus have even larger values of the yaw and pitch angular velocities. Typical values of the roll and pitch angular velocities are given in [39]. For realistic vehicle speeds we can conclude the following about the impulsive and smooth translational and rotational velocity components of the vehicle [39]. The impulsive eﬀects on the translational velocity are approximately two orders of magnitude smaller than the smooth velocity components themselves. Impulsive eﬀects on the yaw angular velocity are 27 somewhat smaller than the smooth yaw component arising from worst-case turns of the road; for moderate turns the impulsive eﬀects are comparable in size to the smooth yaw velocity. Impulsive eﬀects on the roll angular velocity are approximately an order of magnitude larger than the smooth roll component arising from worst-case twists (and turns) of the road; for gentler twists the smooth roll velocity is even smaller. Similarly, impulsive eﬀects on the pitch angular velocity are approximately an order of magnitude larger than the smooth pitch velocity arising from worst-case changes of vertical slope (i.e., vertical curves) of the road; for gentler vertical curves the smooth pitch angular is even smaller. (The impulsive eﬀects are not signiﬁcantly aﬀected by turns, twists, or vertical slope.) We can thus conclude that impulsive eﬀects on the roll and pitch angular velocities are signiﬁcant and larger than the corresponding smooth velocities, and that impulsive eﬀects on the yaw angular velocity are on the order of the smooth yaw velocity. 2.2.4 Camera Motion Assume that a camera is mounted on the vehicle; let d⃗c be the position vector of the mass center of the vehicle relative to the nodal point of the camera The orientation of the vehicle coordinate system Cξηζ relative to the camera is given by an orthogonal rotational matrix (a matrix of the direction cosines) which we denote by Rc . The columns of Rc are the unit vectors of the vehicle coordinate system expressed in the camera coordinate system. We will assume that the position and orientation of the vehicle relative to the camera coordinate system do not change as the vehicle moves. Thus we will assume that Rc and d⃗c are constant and known. Given the position ⃗pe of a scene point E in the vehicle coordinate system Cξηζ, its position ⃗re in the camera coordinate system is given by ⃗re = Rc⃗p + d⃗c e Since Rc and d⃗c are constant we have ⃗r˙e = Rc⃗p˙e . The velocity of E is given by ⃗r˙e = −⃗ω × ⃗re − T⃗ . (2.9) In this expression, the rotational velocity is ω ⃗ = Rc ω ⃗ v (see (2.5)), and the translational ⃗ ⃗ ⃗ velocity is T = Rc Tv + ω ⃗ × dc (see (2.4)), where ω ⃗ v and T⃗v are the rotational and translational velocities of the vehicle coordinate system. We saw in Section 2.2.3 that both ∥v⃗ω d ∥ and ∥⃗ω v/d ∥ are O(0.1)rad/sec. The factors T Rc and Rv/d do not aﬀect the magnitude of either ω ⃗ or T⃗ . Thus the two components of rotational velocity have comparable magnitudes. As regards the translational components, note that for normal speeds of the vehicle (v > 10m/sec≈ 22mi/hr), typical suspension elements, and the camera mounted on the vehicle close to the center of mass we have (see Section 2.2.3) ∥d⃗v/d ∥ = O(0.025)m/sec, ˙ ∥d⃗v/d ∥ = O(0.1)m/sec, and ∥d⃗c ∥ = O(1)m/sec. The magnitudes of the translational velocity components are thus ∥v⃗ω d × d⃗v/d ∥ ≤ v∥⃗ω d ∥ ∥d⃗v/d ∥ = O(0.0025)m/sec; ∥v⃗t∥ = v = O(10)m/sec; and ∥⃗ω × d⃗c ∥ ≤ ∥⃗ω ∥ ∥d⃗c ∥ = O(0.1)m/sec. Therefore, the dominant term in the expression for T⃗ is v⃗t since it is two orders of magnitude larger then any of the other three terms of T⃗ . 28 2.2.5 Independently Moving Vehicles We are interested in other vehicles that are moving nearby. We assume the other vehicles are all moving in the same direction. To facilitate the derivation of the motion equations of a rigid body B we use two rectangular coordinate frames, one (Oxyz) ﬁxed in space, the other (Cx1 y1 z1 ) ﬁxed in the body and moving with it. The position of the moving frame at any instant is given by the position d⃗1 of the origin C1 , and by the nine direction cosines of the axes of the moving frame with respect to the ﬁxed frame. For a given position p⃗ of P in Cx1 y1 z1 we have the position ⃗rp of P in Oxyz ⃗rp = R⃗p + d⃗1 (2.10) where R is the matrix of the direction cosines (the frames are taken as right-handed so that det R = 1). The velocity ⃗r˙p of P in Oxyz is given by ˙ ⃗r˙p = ω ⃗ 1 × (⃗rp − d⃗1 ) + d⃗1 . (2.11) ˙ where d⃗1 is the translational velocity vector and ω ⃗ 1 = (ωx ωy ωz )T is the rotational velocity vector. From Section 2.2.3 we have for a typical vehicle ∥⃗ω 1 ∥ = O(0.1)rad/sec and ∥⃗rp − d⃗1 ∥ = O(1)m. We thus have ∥⃗ω 1 × (⃗rp − d⃗1 )∥ ≤ ∥⃗ω 1 ∥ ∥⃗rp − d⃗1 ∥ = O(0.1)O(1) = O(0.1)m/sec. For the translational velocity we have ∥d⃗1 ∥ = v; for normal speeds of the vehicle v>10m/sec ≈ 22m/sec so that ∥d⃗1 ∥ = O(10)m/sec. We can conclude that for any point P on the vehicle the translational velocity is two orders of magnitude larger than the rotational velocity. If we make the ﬁxed frame Oxyz correspond to the camera frame at time t we have from (2.9) and (2.11) that the velocity of a point P on the vehicle expressed in the camera frame is given by ˙ ⃗r˙p = −⃗ω × ⃗rp + ω ⃗ 1 × (⃗rp − d⃗1 ) − T⃗ + d⃗1 . (2.12) ˙ In (2.12) the vector −T⃗ + d⃗1 corresponds to the relative translational velocity between the camera and the independently moving vehicle. Regarding the ﬁrst and the second terms on the r.h.s. of (2.12) we can see that for comparable rotational velocities ω ⃗ and ω ⃗ 1 the ﬁrst term will dominate the second term since usually ∥⃗rp − d⃗1 ∥ ≪ ∥⃗rp ∥. We will use this observation later. 2.3 2.3.1 Image Motion The Imaging Models Let (X, Y, Z) denote the Cartesian coordinates of a scene point with respect to the ﬁxed camera frame (see Figure 2.3), and let (x, y) denote the corresponding coordinates in the image plane. The equation of the image plane is Z = f , where f is the focal length of the camera. The perspective projection onto this plane is given by x= fY fX , y= . Z Z 29 (2.13) For weak perspective projection we need a reference point (Xc , Yc , Zc ). A scene point (X, Y, Z) is ﬁrst projected onto the point (X, Y, Zc ); then, through plane perspective projection, the point (X, Y, Zc ) is projected onto the image point (x, y). The projection equations are then given by X Y x = f, y = f. (2.14) Zc Zc Y Π y F E re O X P x N Z Figure 2.3: The plane perspective projection image of P is F = f (X/Z, Y /Z, 1); the weak perspective projection image of P is obtained through the plane perspective projection of the intermediate point P1 = (X, Y, Zc ) and is given by G = f (X/Zc , Y /Zc , 1). 2.3.2 The Image Motion Field and the Optical Flow Field The instantaneous velocity of the image point (x, y) under perspective projection is obtained by taking the derivatives of (2.13) and using (2.9): ( ) ẊZ − X Ż −U f + xW xy x2 ẋ = = + ω − ω + f + ωz y, x y Z2 Z f f (2.15) Ẏ Z − Y Ż −V f + yW y2 xy ẏ = = + ω + f − ωy − ωz x. x 2 Z Z f f (2.16) ( ) The instantaneous velocity of the image point (x, y) under weak perspective projection can be obtained by taking derivatives of (2.14) with respect to time and using (2.9): −U f + xW Z ẊZc − X Żc = − f ωy + ωz y, 2 Zc Zc Zc Ẏ Zc − Y Żc −V f + yW Z ẏ = f = + f ωx − ωz x. 2 Zc Zc Zc ẋ = f 30 (2.17) (2.18) Let ⃗ı and ⃗ȷ be the unit vectors in the x and y directions, respectively; ⃗r˙ = ẋ⃗ı + ẏ⃗ȷ is the projected motion ﬁeld at the point ⃗r = x⃗ı + y⃗ȷ. If we choose a unit direction vector ⃗nr at the image point ⃗r and call it the normal direction, then the normal motion field at ⃗r is ⃗r˙n = (⃗r˙ · ⃗nr )⃗nr . ⃗nr can be chosen in various ways; the usual choice (as we shall now see) is the direction of the image intensity gradient. Let I(x, y, t) be the image intensity function. The time derivative of I can be written as dI ∂I dx ∂I dy ∂I = + + = (Ix⃗ı + Iy⃗ȷ) · (ẋ⃗ı + ẏ⃗ȷ) + It = ∇I · ⃗r˙ + It dt ∂x dt ∂y dt ∂t where ∇I is the image gradient and the subscripts denote partial derivatives. If we assume dI/dt = 0, i.e. that the image intensity does not vary with time, then we have ∇I · ⃗u + It = 0. The vector ﬁeld ⃗u in this expression is called the optical flow. If we choose the normal direction ⃗nr to be the image gradient direction, i.e. ⃗nr ≡ ∇I/∥∇I∥, we then have −It ∇I ⃗un = (⃗u · ⃗nr )⃗nr = (2.19) ∥∇I∥2 where ⃗un is called the normal flow. It was shown in [121] that the magnitude of the diﬀerence between ⃗un and the normal motion ﬁeld ⃗rn is inversely proportional to the magnitude of the image gradient. Hence ⃗r˙n ≈ ⃗un when ∥∇I∥ is large. Equation (2.19) thus provides an approximate relationship between the 3-D motion and the image derivatives. We will use this approximation later in this paper. 2.3.3 Estimation of Rotation We now describe our algorithm for estimating rotation. In this section we give only a brief description of the algorithm. A full description and a proof of correctness will be given in a forthcoming paper. We shall use the following notation: Let I be the image intensity at ⃗r, and let ⃗nr = nx⃗ı + ny⃗ȷ = ∇I/∥∇I∥ be the direction of the image intensity gradient at ⃗r. The normal motion field at ⃗r is the projection of the image motion ﬁeld onto the gradient direction ⃗nr and is given by ⃗r˙n = (⃗r˙ · ⃗nr )⃗nr . From (2.15-2.16) we have 1 ⃗r˙n · ⃗nr = nx ẋ + ny ẏ = [nx (−U f + xW ) + ny (−V f + yW )] Z( [ )] [ ( ) ] xy y2 x2 xy + nx + ny + f ωx − n x + f + ny ωy + (nx y − ny x)ω(2.20) z f f f f The ﬁrst term on the r.h.s. of (2.20) is the translational normal motion ⃗r˙ t · ⃗nr and the remaining three terms are the rotational normal motion ⃗r˙ ω ·⃗nr . From now on we will assume that the camera is forward-looking, i.e. that the focus of expansion (FOE) is in the image. The normal flow at ⃗r is deﬁned as −It /∥∇I∥. From [121] we know that the magnitude of the diﬀerence between the normal ﬂow ﬁeld and the normal motion ﬁeld is inversely proportional to the gradient magnitude; we can thus write It + O(∥∇I∥−1 ) = ⃗un · ⃗nr + O(∥∇I∥−1 ). ⃗r˙ n · ⃗nr = ⃗r˙ t · ⃗nr + ⃗r˙ ω · ⃗nr = − ∥∇I∥ 31 (2.21) If the camera motion is a pure translation the image motion ﬁeld is a radial pattern; the magnitude of each image motion vector is proportional to the distance of the image point from the focus of expansion (FOE) and inversely proportional to the depth of the corresponding scene point. If the position ⃗r0 = ⃗ıx0 +⃗ȷy0 of the FOE is known the translational motion ﬁeld can be obtained from the translational normal motion ﬁeld by multiplying each of ⃗r˙ t · ⃗nr by a vector whose direction is (⃗r − ⃗r0 ) and whose magnitude is inversely proportional to the angle between the normal ﬂow and the normal motion vector. The translational motion ﬁeld is then given by ⃗r − ⃗r0 . (2.22) ⃗r˙ t = (⃗r˙ t · ⃗nr ) (⃗r − ⃗r0 ) · ⃗nr Note that if we knew ω ⃗ we could compute the rotational motion ﬁeld ⃗r˙ ω and the rotational ˙ normal motion ⃗rω · ⃗nr and use (2.21) to obtain ⃗r˙ t · ⃗nr ≈ ⃗u · ⃗nr − ⃗r˙ ω · ⃗nr . (2.23) If we combine (2.22) and (2.23) we have ( ⃗r˙ t ≈ ⃗u · ⃗nr − ⃗r˙ ω · ⃗nr ) (⃗r − ⃗r0 ) . (⃗r − ⃗r0 ) · ⃗nr (2.24) If the FOE is known or can be estimated, we can use (2.24) to estimate the rotational ∑ velocity vector ω ⃗ by minimizing ⃗r ∥⃗r˙ t ∥2 . Indeed, the image motion ﬁeld in the neighborhood of the FOE is composed of the translational image motion and the rotational image motion. The roll component of the rotational image motion is orthogonal to the translational image motion, so that it increases the magnitude of the image motion ﬁeld and the normal motion ﬁeld. The yaw and the pitch components of the rotational image motion are approximately constant in the neighborhood of the FOE and just shift the position of the singular (zero) point of the ﬂow ﬁeld [40]. Furthermore, the rotational normal motion accounts for most of the image motion ﬁeld at the distant image points [39]. Therefore, if we subtract the rotational image motion, the sum of the magnitudes of the resulting (translational) ﬂow ﬁeld will be minimal. Using (2.20) and (2.24) we then have ∑ ∥(⃗r − ⃗r0 )∥2 ω ⃗ = arg min ∥⃗u · ⃗nr − ⃗r˙ ω · ⃗nr ∥2 . (2.25) ∥(⃗r − ⃗r0 ) · ⃗nr ∥2 ω ⃗ ⃗r In matrix form this problem corresponds to minimizing ∥A⃗ω − ⃗b∥ (see [114]) where the rows ⃗ai of A are given by [ ( ) ( xy y2 ∥(⃗r − ⃗r0 )∥ nx + ny +f ; ⃗ai = ∥(⃗r − ⃗r0 ) · ⃗nr ∥ f f −nx and the elements bi of ⃗b are given by bi = ⃗u · ⃗nr ∥(⃗r − ⃗r0 )∥ . ∥(⃗r − ⃗r0 ) · ⃗nr ∥ The solution is given by ω ⃗ = A+⃗b where A+ is the generalized inverse of A (see [114]). 32 ) x2 xy + f − ny ; f f ] nx y − ny x (2.26) 2.3.4 Estimating the FOE In the case of a forward looking camera and an unknown FOE it is possible to simultaneously estimate the FOE and ω ⃗ . Based on our analysis in Section 2.2 we observe that, in the case of a camera rigidly connected to the vehicle and a scene without independently moving objects, the direction of the translational velocity of the vehicle remains approximately constant throughout the image sequence. The rotational velocity ω ⃗ , however, depends on both the geometry of the road and the activity of the suspension elements; hence, ω ⃗ changes as the vehicle moves. It is possible to choose a part of the road without horizontal and/or vertical curves and/or changes of lateral slope so that the vehicle rotations correspond to activities of the suspension elements. In these cases the rotational velocity components change in an almost periodic manner. If the scenes are chosen so that signiﬁcant parts of images correspond to distant scene points the following algorithm can be used to reliably estimate the FOE. Given N successive image frames taken by a forward looking camera mounted on a moving vehicle we form the following function φ= ∥(⃗ri − ⃗r0 )∥2 ∥⃗ui · ⃗nr − ⃗r˙ ωk · ⃗nr ∥2 ∥(⃗ri − ⃗r0 ) · ⃗nr ∥2 k=1 i=1 Nk N ∑ ∑ (2.27) where the inside sum is over all normal ﬂow vectors in the kth frame, and the outside sum is over all frames; the position of the FOE does not change throughout the sequence (N frames), but ω ⃗ changes at every frame (hence the index k). We estimate ω ⃗ k , k = 1, . . . , N , and ⃗r0 by minimizing φ. A straightforward method for minimizing φ corresponds to a nonlinear least squares problem. It can be observed from (2.27) that φ is linear in the ω ⃗ k s and nonlinear in ⃗r0 . In the numerical analysis literature such problems are called “separable nonlinear least squares problems” [97]. Several algorithms for solving such problems have been proposed; a unifying feature of these algorithms is that they have better performance than standard nonlinear least squares algorithms [97] since they are based on solving problems of smaller dimensionality that the unseparated problem. In this paper we use a simple version of the separable algorithm. We choose an initial ⃗r0 and solve for the ω ⃗ k vectors. This problem is equivalent to solving N linear least squares ⃗ ⃗ problems as described in Section 2.3.3. We then have ω ⃗ k = A+ k bk with the Ak s and the bk s appropriately deﬁned. Once the ω ⃗ k s are estimated we have a nonlinear least squares problem in the two components of ⃗r0 : we need to minimize φ for the ﬁxed ω ⃗ k s. From (2.27) we then have Nk N ∑ ∑ ∥(⃗ri − ⃗r0 )∥2 2 ⃗r0 = arg min αk,i (2.28) ∥(⃗ri − ⃗r0 ) · ⃗nr ∥2 ⃗r0 k=1 i=1 where αk,i = ∥⃗ui · ⃗nr − ⃗r˙ ωk · ⃗nr ∥. We use the Gauss-Newton algorithm to solve this problem. The partial derivatives ∂φ/∂x0 and ∂φ/∂y0 that are required by the Gauss-Newton algorithm are readily obtained from (2.27-2.28). After solving for ⃗r0 we substitute ⃗r0 in (2.27) and solve for the ω ⃗ k s; we then use these values to solve for ⃗r0 again. The process is repeated until ⃗r0 converges. A simple test for convergence is ∥⃗r0 (s) − ⃗r0 (s + 1)∥ ≤ 1, where s is the iteration number. 33 2.3.5 The Image Motion Field of an Observed Vehicle From (2.17–2.18) we obtain the (approximate) equations of projected motion for points on a vehicle under weak perspective: U f − xW , Zc V f − yW . ẏ = Zc ẋ = (2.29) (2.30) Equations (2.29-2.30) relate the image (projected) motion ﬁeld to the scaled translational velocity Zc−1 T⃗ = Zc−1 (U V W )T . Given the point ⃗r = x⃗ı + y⃗ȷ and the normal direction nx⃗ı + ny⃗ȷ, from (2.29-2.30) the normal motion ﬁeld ⃗r˙n · ⃗n = nx ẋ + ny ẏ is given by ⃗r˙n · ⃗n = nx f U Zc−1 + ny f V Zc−1 − (nx x + ny y)W Zc−1 Let        (2.31)  a1 nx f c1 U Zc−1         ny f a =  a2  ≡   , c =  c2  ≡  V Zc−1  . a3 −nx x − ny y c3 W Zc−1 (2.32) Using (2.32) we can write (2.31) as ⃗r˙n ·⃗n = aT c. The column vector a is formed of observable quantities only, while each element of the column vector c contains quantities which are not directly observable from the images. To estimate c we need estimates of ⃗r˙n · ⃗n at three or more image points. 2.3.6 Estimating Vehicle Motion from Normal Flow As in Section 2.3.5 we use linear least squares to estimate parameter vector c from the normal ﬂow. In the case of a moving vehicle the parameters of interest are the vehicle’s trajectory and its rate of approach. The rate of approach ν= W Zc (measured in sec−1 ) is equivalent to the inverse of the time to collision and corresponds to the rate with which an object is approaching the camera (or receding from it). The rate ν = 0.1/sec means that every second the object travels 0.1 of the distance between the observer and its current position. A negative rate of approach means that the object is going away from the camera. 2.4 Experiments In Sections 2.4.1 and 2.4.2 we give examples illustrating road detection, stabilization, and vehicle detection. In Section 2.4.3 we present results for several sequences showing vehicles in motion. 34 2.4.1 Road and Vehicle Detection We detect the road region by ﬁnding road edges and lane markers. A Canny edge detector is applied and Hough-like voting is used to detect dominant straight lines in the image. Each line is parameterized by its normal angle α and its displacement d from the center of the image. For all possible values of α and d the image is scanned along the corresponding line. The number of edge points that are found within a strip along the line, and whose gradient direction is orthogonal to the line direction, is taken as the vote for the corresponding point in the α - d plane. Among those lines, only the ones with a certain weight and orientation are chosen to be road line candidates. If several lines with close α and d values are candidates, only the best representative of these lines is chosen. The other lines are eliminated by applying local maximum suppression in the α - d plane. Note that if the lines represent the two edges of a lane marker, the gradient directions will have opposite signs for the two edges. Due to perspectivity, the road boundaries and lane marker lines should converge to a single point. Candidate lines that do not converge to a single point are not identiﬁed as a road or lane boundaries. The identiﬁed lines are the maximal subsets of candidate lines that all intersect at or close to one point (all the intersection points of every pair of the lines are located in a small region). Figure 2.4 presents some road detection and vehicle detection results for four diﬀerent sequences (collected in three diﬀerent countries). 2.4.2 Stabilization As was shown in [39] the impulsive eﬀects introduce signiﬁcant changes in the roll and pitch angular velocities (see Figure 2.5). Figure 2.6 shows three examples of image sequence stabilization by compensating the rotational eﬀects of road bumps. The estimated rotational normal ﬂow component (column c) is subtracted from the total normal ﬂow (column b), which yields the translational normal ﬂow component (column d) One can see that the translational normal ﬂow components at distant points are close to zero. The vector of rotation is estimated by using the method based on FOE calculation, as described in Section 2.3.3, or alternatively, by estimating the amount of rotation from the apparent shifts of distant points. Two examples of distant point identiﬁcation, using horizon points, are shown in Figure 2.7. 2.4.3 Relative Motions of Vehicles After derotating and detecting moving vehicles, we can analyze their motions using the algorithm for motion estimation under weak perspective. In the ﬁrst experiment we used an image sequence taken in Haifa, Israel, from a vehicle following two other accelerating vehicles. The sequence consisted of 90 frames (slightly less than three seconds). Figure 2.8 shows frames 0, 30 and 60, and the corresponding normal ﬂow on the vehicles. Figure 2.9 shows estimated values of U Zc−1 , V Zc−1 , and W Zc−1 for the central (closest) vehicle. These values correspond to the translations of the vehicles relative 35 (a) (b) (c) (d) Figure 2.4: (a-d) A selected frame from each of four sequences. Top: the input images. Middle: results of road detection. Bottom: results of vehicle detection. to the vehicle carrying the camera (i.e., in the observer coordinate system). Because of our choice of coordinate system the rate of approach ν is equal to the negative of W Zc−1 . The graphs show that the motion components have a simple behavior; before they reach their extremal values they can be approximated by straight lines, indicating constant relative accelerations. In the second experiment we used an image sequence of a van, taken in France, from another vehicle following the van [12, 38]. The sequence consisted of 56 frames (slightly less than two seconds). Figure 2.10 shows frames 5, 15, 25, and 35 as well as the corresponding normal ﬂow. Figure 2.11 shows estimated values of U Zc−1 , V Zc−1 , and W Zc−1 . The graph shows that there is an impending collision (rate of approach greater than 1 sec−1 ). Around the 20th frame the rate of approach becomes zero (as do all the velocity components) and after that it becomes negative because the van starts pulling away from the vehicle carrying the camera. A similar image sequence was used in [12] in studies of vehicle convoy behavior. The third sequence (taken from the IEN Galileo Ferrari Vision Image Library) consisted 36 (a) (b) (c) (d) Figure 2.5: (a-b) Two images taken 1/15th of a second apart; (c-d) their normal ﬂows. One can see the eﬀects of bumps. In the ﬁrst frame, the ﬂow vectors point downward; in the second, they point upward. of 26 frames. Figure 2.12 shows frames 1, 14 and 26, as well as the corresponding normal ﬂow. Figure 2.13 shows estimated values of U Zc−1 , V Zc−1 , and W Zc−1 . The graph shows that the W component of the translational velocity is dominant over the U and V components, which is correct for a vehicle that overtakes the observer vehicle and does not change lanes; the two vehicles are moving on parallel courses. Figure 2.14 shows frames 1, 26 and 47 of another 48-frame sequence, taken in Haifa, Israel; as well as the corresponding normal ﬂow. Figure 2.15 shows U Zc−1 , V Zc−1 , and W Zc−1 graphs for the left (overtaking) and central vehicles. One can see that the graphs diﬀer mainly in the values of the W component, since the relative speed of approach for the left vehicle is greater than that for the central one. The U and V components are relatively small; all three vehicles are moving in the same direction. 2.5 2.5.1 Related Work Understanding Object Motion Some work has been done on understanding object motion, but this work has almost always assumed a stationary viewpoint. Understanding object motion is based on extracting the object’s motion parameters from an image sequence. Broida and Chellappa [11] proposed a framework for motion estimation of a vehicle using Kalman ﬁltering. Weng et al. [125] assumed an object that possesses an axis of symmetry, and a constant angular momentum model which constrained the motion over a local frame subsequence to be a superposition of precession and translation. The trajectory of the center of rotation can be approximated by a vector polynomial. Changing the parameters of the model with time allows adaptation to long-term changes in the motion characteristics. In [38] Duric et al. tried to understand the motions of objects such as tools and vehicles, based on the fact that the natural axes of the object tend to remain aligned with the local trihedron deﬁned by the object’s trajectory. Based on this observation they used the FrenetSerret motion model, and showed that knowing how the Frenet-Serret frame is changing relative to the observer gives essential information for understanding the object’s motion. 37 (a) (b) (c) (d) Figure 2.6: Stabilization results for one frame from each of three sequences: (a) Input frame. (b) Normal ﬂow. (c) Rotational normal ﬂow. (d) Translational normal ﬂow Our present work is a continuation of this work in a more realistic and complicated scenario, in which the camera is also moving. 2.5.2 Independent Motion Detection Much work has been done on detection of independently moving objects by a moving observer. However, the work has been related to detection, classiﬁcation, and tracking of the motion, and has not paid much attention to motion estimation. Clarke and Zisserman in [25] addressed the problem of independently moving rigid object detection, assuming that all motions (including the camera motion) are pure translations. The idea is to track a set of feature points using correlation and movement assumptions derived from the previous frames, and, based on the feature point correspondences, determine the epipole (FOE) for the background points as the intersection of the features’ image plane trajectories. The assumption is that a majority of the feature points are background points. Moving objects are found by ﬁtting an epipole to those feature matches that are not consistent with the background epipole. The image plane extent of the moving object is deﬁned by the smallest rectangle enclosing these features. The instability of the camera introduces strong rotational 38 (a) (b) Figure 2.7: Identiﬁcation of distant image points in frames from two sequences. components into the relative motion; these are not dealt with in [25]. Torr and Murray in [116] used statistical methods to detect a non-rigid motion. A ﬁvedimensional space of image pixels and spatio-temporal intensity gradients is ﬁt to an aﬃne transformation model in a least squares sense, while identifying the outliers to the ﬁt using statistical diagnostics. The outliers are spatially clustered to form sets of pixels representing independently moving objects. The assumption again is that the majority of the pixels come from background points and their movement is well approximated by a linear or aﬃne vector ﬁeld, i.e. the distances to the background points are large compared to the variations in these distances. Nelson in [82] suggested two qualitative methods for motion detection. The ﬁrst uses knowledge about observer motion to detect independently moving objects by looking for points whose projected velocity behaviors are not consistent with the constraints imposed by the observer’s motion. The second method is used to detect so-called “animate motion”, which can be found by looking for violations of the motion ﬁeld smoothness over time. This is valid in cases where the observer motion changes slowly, while the apparent motion of the moving object changes rapidly. Irani et al. in [62] used temporal integration to construct a dynamic internal representation image of the tracked object. It is assumed that the motion can be approximated by 2D parametric transformations in the image plane. Given a pair of images, a dominant motion is ﬁrst computed and the corresponding object is excluded from the region of analysis. This process is then repeated and other objects are found. To track the objects in a long image sequence, integrated images registered with respect to the tracked motion are used. Sharma and Aloimonos in [106] demonstrated a method for detecting independent motions of compact objects by an active observer whose motion can be constrained. Fejes and Davis in [44] used constraints on low-dimensional projected components of ﬂow ﬁelds to detect independent motion. They implemented a recursive ﬁlter to extract directional motion parameters. Independent motion detection was done by a combination of repeated-medianbased line-ﬁtting and one-dimensional search. The method was demonstrated on scenes with large amounts of clutter. 39 (a) (b) (c) Figure 2.8: Frames 0, 30, and 60 of a sequence showing two vehicles accelerating. The normal ﬂow results are shown below the corresponding image frames. 2.5.3 Vehicle Detection and Tracking In previous work on vehicle detection and tracking by a moving observer, the detection made use of model-based object recognition in single frames. Gil et al. [51] combined multiple motion estimators for vehicle tracking. Vehicle detection was performed using two features: the bounding rectangle of the moving vehicle, where the convex hull of the vehicle is computed for every frame and then translated according to the predicted motion parameters, and an updated 2-D pattern (gray-level mask) based on optimization of the correlation between the pattern and the image using the motion parameters. These results were obtained using a stationary camera mounted above a highway under diﬀerent road and illumination conditions. Betke et al. [6] developed a real-time system for detection and tracking of multiple vehicles in a frame sequence taken on a highway from a moving vehicle. The system distinguishes between distant and passing vehicle detection. In case of a passing vehicle the recognition is performed by detecting large brightness diﬀerences over small numbers of frames. 2-D car models are used to create a gray-scale template of the detected vehicle for future tracking. Distant vehicles are detected by analyzing prominent horizontal and vertical edges. A square 40 −3 14 x 10 U V W 12 10 8 6 4 2 0 −2 0 10 20 30 40 50 60 70 80 90 Figure 2.9: Motion analysis results for the acceleration sequence. U , V , W are the scaled (by an unknown distance Zc−1 ) components of the relative translational velocity . region bounded by such edges, which is strongly enough correlated with a vehicle template, is recognized as a vehicle. For each newly recognized vehicle a separate tracking process is allocated by a real-time operating system, which tracks the vehicle until it disappears and makes sure that no other process tracks the same vehicle. When one vehicle occludes another, one of the tracking processes terminates and the other tracks the occlusion region as a single moving object. 2.5.4 Road Detection and Shape Analysis Our work also involved detection of the road markings; in doing this we made no signiﬁcant use of motion models. There has been considerable work on road boundary detection in images obtained by a vehicle driving on the road. A typical detection procedure involves two steps. First the road is detected in a single frame (or in a small number of frames), and then the road boundaries are tracked in a long sequence of frames. Schneiderman and Nashman in [100] left the ﬁrst step to a human operator and gave a solution for the second step alone. Their algorithm is based on road markings. In each frame, edge points are extracted and grouped into clusters representing individual lane markers. The lane marker models built in the ﬁrst step are updated using the information obtained from successive frames. The markers are modeled by second-order polynomials in the image plane. Tracking is exploited in the sense that edge points are associated with markers based on a road model formed from analysis of the previous frames. Spatial proximity and gradient direction are used as a clues for clustering. The marker models are updated to satisfy a least squares criterion of optimality. Broggi in [9]-[10] presented a parallel real-time algorithm for road boundary detection. 41 (a) (b) (c) (d) Figure 2.10: Frames 5, 15, 25, and 35 of the van sequence. The normal ﬂow results are shown below the corresponding image frames. First the image undergoes an inverse perspective transformation and then road markings are detected using simple morphological ﬁlters that work in parallel on diﬀerent image areas. The inverse perspective procedure is based on a priori knowledge of the imaging process (camera position, orientation, optics, etc.). Richter and Wetzel in [94] used region segmentation for road surface recognition. A road model is adjusted to the segmentation results collected from several frames. The adjusted model is then used for model-based segmentation and tracking. A special lookup mechanism is used to overcome the problem of obstacles and shadows that may cause the segmentation algorithm to fail to detect the road regions. A sophisticated road model that takes into account both horizontal and vertical curvature was suggested by Dickmanns in [34]. A diﬀerential-geometric road representation and a spatio-temporal driving model are exploited to ﬁnd the parameters of the road and the vehicle. 2.6 Conclusions Understanding the motions of vehicles from images taken by a moving observer requires a mathematical formulation of the relationships between the observer’s motion and the image motion ﬁeld, as well as a model for the other vehicles’ trajectories and their contributions to the image motion ﬁeld. In this paper a constant relationship between each vehicle’s frame and the observer’s frame is assumed. The use of the Darboux frame provides a vocabulary appropriate for describing long motion sequences. We have derived equations for understanding the relative motions of vehicles in traﬃc 42 1.5 1 W 0.5 U 0 V −0.5 −1 −1.5 0 10 20 30 40 50 60 Figure 2.11: Motion analysis results for the van sequence. U , V , W are the scaled (by an unknown distance Zc−1 ) components of the relative translational velocity . scenes from a sequence of images taken by a moving vehicle. We use the Darboux motion model for both the observing vehicle and the nearby moving vehicles. Using a full perspective imaging model we stabilize the observer sequence so that our model for the observed vehicles’ motions can be applied. Using the weak perspective approximation we analyze the nearby vehicles’ motions and apply this analysis to long image sequences. Expanding our analysis to various classes of traﬃc events [70], and to articulated vehicles, are directions for future research. 43 (a) (b) (c) Figure 2.12: Frames 1, 14 and 26 of the Italian sequence. The normal ﬂow results are shown below the corresponding image frames. 0.05 U V W 0.04 0.03 0.02 0.01 0 −0.01 −0.02 −0.03 0 5 10 15 20 25 Figure 2.13: Motion analysis results for the Italian sequence. U , V , W are the scaled (by an unknown distance Zc−1 ) components of the relative translational velocity . 44 (a) (b) (c) Figure 2.14: Frames 1, 25 and 48 of the Haifa sequence. The normal ﬂow results are shown below the corresponding image frames. U V W U V W 0.01 0.01 0.005 0.005 0 0 −0.005 −0.005 −0.01 0 5 10 15 20 25 30 35 40 −0.01 45 0 5 10 15 20 25 30 35 40 45 Figure 2.15: Motion analysis results for the Haifa sequence. U , V , W are the scaled (by an unknown distance Zc−1 ) components of the relative translational velocity . 45 46 Chapter 3 Non-Rigid Motion - Tracking 3.1 Introduction It is clear that one of the major cues for motion based non-rigid object classiﬁcation can be obtained by analyzing moving body deformations. Hence, the accuracy of the segmentation and tracking algorithm in ﬁnding the target outline is crucial for the quality of the ﬁnal result. This rules out a number of available or easy-to-build tracking systems that provide only a center of mass or a bounding box around the target and calls for more precise and usually more sophisticated solutions. In this work we decided to use the geodesic active contour approach [17], where the segmentation and tracking problem is expressed as a geometric energy minimization. 3.2 Snakes and Active Contours An important problem in image analysis is object segmentation. It involves the isolation of a single object from the rest of the image that may include other objects and a background. Here, we focus on boundary detection of one or several objects by a dynamic model known as the ‘geodesic active contour’ introduced in [16, 17, 18, 19], see also [68, 104]. Geodesic active contours were introduced as a geometric alternative for ‘snakes’ [115, 67]. Snakes are deformable models that are based on minimizing an energy along a curve. The curve, or snake, deforms its shape so as to minimize an ‘internal’ and ‘external’ energies along its boundary. The internal part causes the boundary curve to become smooth, while the external part leads the curve towards the edges of the object in the image. In [14, 78], a geometric alternative for the snake model was introduced, in which an evolving curve was formulated by the Osher-Sethian level set method [85]. The method works on a ﬁxed grid, usually the image pixels grid, and automatically handles changes in the topology of the evolving contour. The geodesic active contour model was born latter. It is both a geometric model as well as energy functional minimization. In [16, 17], it was shown that the geodesic active contour model is related to the classical snake model. Actually, a simpliﬁed snake model yields the same result as that of a geodesic active contour model, up to an arbitrary constant that 47 depends on the initial parameterization. Unknown constants are an undesirable property in most automated models. Although the geodesic active contour model has many advantages over the snake, its main drawback is its non-linearity that results in ineﬃcient implementations. For example, explicit Euler schemes for the geodesic active contour limit the numerical step for stability. In order to overcome these speed limitations, a multi-resolution approach was used in [123], and additional heuristic steps were applied in [86], like computationally preferring areas of high energy. In this paper we introduce a new method that maintains the numerical consistency and makes the geodesic active contour model computationally eﬃcient. The eﬃciency is achieved by cancelling the limitation on the time step in the numerical scheme, by limiting the computations to a narrow band around the the active contour, and by applying an eﬃcient re-initialization technique. 3.3 From snakes to geodesic active contours Snakes were introduced in [67, 115] as an active contour model for boundary segmentation. The model is derived by a variational principle from a non-geometric measure. The model starts from an energy functional that includes ‘internal’ and ‘external’ terms that are integrated along a curve. Let the curve C(p) = {x(p), y(p)}, where p ∈ [0, 1] is an arbitrary parameterization. The snake model is deﬁned by the energy functional ∫ 1 S[C] = 0 ( ) |Cp |2 + α|Cpp |2 + 2βg(C) dxdy, where Cp ≡ {∂ p x(p), ∂ p y(p)}, and α and β are positive constants. The last term represents an external energy, where g() is a positive edge indicator function that depends on the image I(x, y), it gets small values along the edges and higher values elsewhere. For example g(x, y) = 1/(|∇I|2 + 1). Taking the variational derivative with respect to the curve, δS[C]/δC, we obtain the Euler Lagrange equations −Cpp + αCpppp + β∇g = 0. One may start with a curve that is close to a signiﬁcant local minimum of S[C], and use the Euler Lagrange equations as a gradient descent process that leads the curve to its proper position. Formally, we add a time variable t, and write the gradient descent process as ∂ t C = −δS[C]/δC, or explicitly dC = Cpp − αCpppp − β∇g. dt The snake model is a linear model, and thus an eﬃcient and powerful tool for object segmentation and edge integration, especially when there is a rough approximation of the boundary location. There is however an undesirable property that characterizes this model. It depends on the parameterization. The model is not geometric. 48 Motivated by the theory of curve evolution, Caselles et al. [14] and Malladi et al. [78] introduced a geometric ﬂow that includes internal and external geometric measures. Given an initial curve C0 , the geometric ﬂow is given by the planar curve evolution equation Ct = ⃗ , where, N ⃗ is the normal to the curve, κN ⃗ is the curvature vector, v is an g(C)(κ − v)N arbitrary constant, and g(), as before, is an edge indication scalar function. This is a geometric ﬂow, that is, free of the parameterization. Yet, as long as g does not vanish along the boundary, the curve continues its propagation and may skip its desired location. One remedy, proposed in [78], is a control procedure that monitors the propagation and sets g to zero as the curve gets closer to the edge. The geodesic active contour model was introduced in [16, 17, 18, 19], see also [68, 104], as a geometric alternative for the snakes. The model is derived from a geometric functional, where the arbitrary parameter p is replaced with a Euclidean arclength ds = |Cp |dp. The functional reads ∫ 1 (α + g̃(C)) |Cp |dp. S[C] = 0 It may be shown to be equivalent to the arclength parameterized functional ∫ L(C) S[C] = g̃(C)ds + αL(C), 0 where L(C) is the total Euclidean length of the curve. One may equivalently deﬁne g(x, y) = g̃(x, y) + α, in which case ∫ L(C) S[C] = g(C)ds, 0 i.e. minimization of the modulated arclength g(C)ds. The Euler Lagrange equations as a gradient descent process are ) dC ( ⃗⟩ N ⃗. = g(C)κ − ⟨∇g, N dt Again, internal and external forces are coupled together, yet this time in a way that leads towards a meaningful minimum, which is the minimum of the functional. a One may add an additional force that comes from an area minimization term, and motivated by the balloon force [26]. This way, the contour may be directed to propagate inwards by minimization of the interior. The functional with the additional area term modulated by an edge indicator function reads ∫ L(C) ∫ S[C] = g(C)ds + α gda, 0 Ω where da is an area element and Ω is the interior of region enclosed by the contour C. The Euler Lagrange as steepest descent, following the development in [130, 129], is ) dC ( ⃗ ⟩ − αg(C) N ⃗. = g(C)κ − ⟨∇g, N dt a An early version of a geometric-variational model, in which S[C] = curves was proposed in [46]. 49 ∫ ∫ g(C)ds/ ds, that deals with open The connection between classical snakes, and the geodesic active contour model was established in [17] via Maupertuis’ Principle of least action [35]. By Fermat’s Principle, the ﬁnal geodesic active contours are geodesics in an isotropic non-homogeneous medium. Recent applications of the geodesic active contours include 3D shape from multiple views, also known as shape from stereo [43], segmentation in 3D movies [76], tracking in 2D movies [86], and reﬁnement of eﬃcient segmentation in 3D medical images [77]. The curve propagation equation is just part of the whole model. Subsequently, the geometric evolution is implemented by the Osher-Sethian level set method [85]. 3.3.1 Level set method The Osher-Sethian [85] level set method considers evolving fronts in an implicit form. It is a numerical method that works on a ﬁxed coordinate system and takes care of topological changes of the evolving interface. Consider the general geometric planar curve evolution dC ⃗, =VN dt where V is any intrinsic quantity, i.e., V does not depend on a speciﬁc choice of parameterization. Now, let ϕ(x, y) : IR2 → IR be an implicit representation of C, such that C = {(x, y) : ϕ(x, y) = 0}. One example is a distance function from C deﬁned over the coordinate plane, with negative sign in the interior and positive in the exterior of the closed curve. The evolution for ϕ such that its zero set tracks the evolving contour is given by dϕ = V |∇ϕ|. dt This relation is easily proven by applying the chain rule, and using the fact that the normal of any level set, ϕ = constant, is given by the gradient of ϕ, dϕ ⃗⟩=V = ⟨∇ϕ, Ct ⟩ = ⟨∇ϕ, V N dt ⟨ ∇ϕ ∇ϕ, |∇ϕ| ⟩ = V |∇ϕ|. This formulation enable us to implement curve evolution on the x, y ﬁxed coordinate system. It automatically handles topological changes of the evolving curve. The zero level set may split from a single simple connected curve, into two separate curves. Speciﬁcally, the corresponding geodesic active contour model written in its level set formulation is given by ( ) ∇ϕ dϕ = div g(x, y) |∇ϕ|. dt |∇ϕ| Including a weighted area minimization term that yields a constant velocity, modulated by the edge indication function, we have ( ( ∇ϕ dϕ = αg(x, y) + div g(x, y) dt |∇ϕ| 50 )) |∇ϕ|. We have yet to determine a numerical scheme and an appropriate edge indication function g. An explicit Euler scheme with forward time derivative, introduces a numerical limitation on the time step needed for stability. Moreover, the whole domain needs to be updated each step, which is a time consuming operation for a sequential computer. The narrow band approach overcomes the last diﬃculty by limiting the computations to a narrow strip around the zero set. First suggested by Chopp [24], in the context of the level set method, and later developed in [1], the narrow band idea limits the computation to a tight strip of few grid points around the zero set. The rest of the domain serves only as a sign holder. As the curve evolves, the narrow band changes its shape and serves as a dynamic numerical support around the location of the zero level set. 3.3.2 The AOS scheme Additive operator splitting (AOS) schemes were introduced by Weickert et al. [124] as an unconditionally stable numerical scheme for non-linear diﬀusion in image processing. Let us brieﬂy review its main ingredients and adapt it to our model. The original AOS model deals with the Perona-Malik [88], non-linear image evolution equation of the form ∂ t u = div (g(|∇u|)∇u) with given initial condition as the image u(0) = u0 . Let us re-write explicitly the right hand side of the evolution equation div (g(|∇u|)∇u) = m ∑ ∂ xl (g(|∇u|)∂ xl u) , l=1 where l is an index running over the m dimensions of the problem, e.g., for a 2D image m = 2,x1 = x, and x2 = y. As a ﬁrst step towards discretization consider the operator Al (uk ) = ∂ xl (g(|∇uk |)∂ xl ), where the superscript k indicates the iteration number, e.g., u0 = u0 . We can write the explicit scheme [ ] m ∑ k+1 k u = I +τ Al (u ) uk , l=1 where, τ is the numerical time step. It requires an upper limit for τ if one desires to establish convergence to a stable steady state. Next, the semi-implicit scheme [ k+1 u = I −τ m ∑ ]−1 k Al (u ) uk , l=1 is unconditionally stable, yet inverting the large bandwidth matrix is a computationally expensive operation. Finally, the consistent, ﬁrst order, semi-implicit, additive operator splitting scheme uk+1 = m [ ]−1 1 ∑ I − mτ Al (uk ) uk , m l=1 51 may be applied to eﬃciently solve the non-linear diﬀusion. The AOS semi-implicit scheme in 2D is then given by a linear tridiagonal system of equations 2 1∑ uk+1 = [I − 2τ Al (uk )]−1 uk , (3.1) 2 l=1 where Al (uk ) is a matrix corresponding to derivatives along the l-th coordinate axis. It can be eﬃciently solved for uk+1 by Thomas algorithm, see [124]. In our case, the geodesic active contour model is given by ( ) ∇ϕ ∂ t ϕ = div g(|∇u0 |) |∇ϕ|, |∇ϕ| where u0 is the image, and ϕ is the implicit representation of the curve. Since our interest is only at the zero level set of ϕ, we can reset ϕ to be a distance function every numerical iteration. One nice property of distance maps is it unit gradient magnitude almost everywhere. Thereby, the short term evolution for the geodesic active contour given by a distance map, with |∇ϕ| = 1, is ∂ t ϕ = div (g(|∇u0 |)∇ϕ) . Note, that now Al (ϕk ) = Al (u0 ), which means that the matrices [I − 2τ Al (u0 )]−1 can be computed once for the whole image. Yet, we need to keep the ϕ function as a distance map. This is done through re-initialization by Sethian’s fast marching method every iteration. It is now simple to introduce a weighted area ‘balloon’ like force to the scheme. The resulting AOS scheme with the ‘balloon’ then reads ϕk+1 = 2 1∑ [I − 2τ Al (u0 )]−1 (ϕk + τ αg(u0 )), 2 l=1 (3.2) where α is the weighted area/balloon coeﬃcient. In order to reduce the computational cost we use a multi-scale approach [73]. We construct a Gaussian pyramid of the original image. The algorithm is ﬁrst applied at the lower resolution. Next, the zero set is embedded at a higher resolution and the ϕ distance function is computed. Moreover, the computations are performed only within a limited narrow band around the zero set. The narrow band automatically modiﬁes its shape as we re-initiate the distance map. 3.3.3 Re-initialization by the fast marching method In order to maintain sub-grid accuracy, we detect the zero level set curve with sub-pixel accuracy. We apply a linear interpolation in the four pixel cells in which ϕ changes its sign. The grid points with the exact distance to the zero level set are then used to initialize the fast marching method. Sethian’s fast marching method [103, 102], is a computationally optimal numerical method for distance computation on rectangular grids. The method keeps a front of updated points sorted in a heap structure, and constructs a numerical solution iteratively, by ﬁxing the 52 smallest element at the top of the heap and expanding the solution to its neighboring grid points. This method enjoys a computational complexity bound of O(N log N ), where N is the number of grid points in the narrow band. See also [22, 118], where consistent O(N log N ) schemes are used to compute distance maps on rectangular grids. 3.4 Edge indicator functions for color and movies What is a proper edge indicator for color images? Several generalizations for the gradient magnitude of gray level images were proposed, see e.g. [33, 98, 113]. In [86] Paragios and Deriche, introduced a probability based edge indicator function for movies. In this paper we have chosen the geometric philosophy to extract an edge indicator. We consider a measure suggested by the Beltrami framework in [113], to construct an edge indicator function. 3.4.1 Edges in Color According to the Beltrami framework, a color image is considered as a two dimensional surface in the ﬁve dimensional spatial-spectral space. The metric tensor is used to measure distances on the image manifold. The magnitude of this tensor is an area element of the color image surface, which can be considered as a generalization of the gradient magnitude. Formally, the metric tensor of the two-dimensional image given by the two-dimensional surface {x, y, R(x, y), G(x, y), B(x, y)} in the {x, y, R, G, B} space, is given by ( (gij ) = 1 + Rx2 + G2x + Bx2 Rx Ry + Gx Gy + Bx By Rx Ry + Gx Gy + Bx By 1 + Ry2 + G2y + By2 ) , where Rx ≡ ∂ x R. Our edge indicator is the largest eigenvalue of the structure tensor metric. It is the eigenvalue in the direction of maximal change in dR2 + dG2 + dB 2 , and it reads v u( )2 u 1∑ 1 ∑∑ 1∑ u i 2 |∇u | + t |∇ui |2 − |∇ui × ∇uj |2 λ=1+ 2 2 i 2 i i (3.3) j where u1 = R, u2 = G, u3 = B. Then, the edge indicator function g is given by a decreasing function of λ, e.g., g = (1 + λ2 )−1 . 3.4.2 Tracking objects in movies Let us explore two possibilities to track objects in movies. The ﬁrst, considers the whole movie volume as a Riemannian space, as done in [19]. In this case the active contour becomes an active surface. The AOS scheme in the spatial-temporal 3D hybrid space is ϕk+1 = 1∑ [I − 3τ Al (u0 )]−1 ϕk , 3 l where Al (u0 ) is a matrix corresponding to derivatives along the l-th coordinate axis, where now l ∈ [x, y, T ]. 53 The edge indicator function is again derived from the Beltrami framework, where for color movies we pull-back the metric   1 + Rx2 + G2x + Bx2 Rx Ry + Gx Gy + Bx By Rx RT + Gx GT + Bx BT  · 1 + Ry2 + G2y + By2 Ry RT + Gy GT + By BT  gij =   2 2 2 · · 1 + RT + GT + BT (3.4) which is the metric for√a 3D volume in the 6D {x, y, T , R, G, B} spatial-temporal-spectral space. Now we have det(gij )dxdydT as a volume element of the image and, again, the largest eigenvalue of the structure tensor metric, λ, can be used as an edge indicator. Intuitively, the larger λ gets, the smaller spatial-temporal steps one should apply in order to cover the same volume. A diﬀerent approach uses the contour location in frame n as an initial condition for the 2D solution in frame n + 1, see e.g. [15, 86]. The above edge indicator is still valid in this case. Note, that the aspect ratios between the time, the image space, and the intensity, should be determined according to the application. The ﬁrst approach was found to yield accurate results in oﬀ line tracking analysis. While the second approach gives up some accuracy, that is achieved by temporal smoothing in the ﬁrst approach, for eﬃciency in real time tracking. 3.5 Implementation details There are some implementation considerations one should be aware of. For example, the summation over the two dimensions in the Equations (3.1,3.2) should be done in such a way that the matrices A1 (u0 ) and A2 (u0 ), corresponding to the derivatives along the x and y axes respectively, will be tridiagonal. This is achieved by spanning the matrix ϕk once by rows and once by columns. Working on the whole domain is fairly straightforward. The matrix ϕk is represented as vectors ϕki corresponding to a single row/column. The vectors [I − 2τ Ail (u0 )]−1 ϕki , where Ail of size |ϕki | × |ϕki | is a matrix corresponding to the derivatives of a single row/column, are computed using the Thomas algorithm and summed according to their coordinates to yield the matrix ϕk+1 . Using the narrow band approach is a bit more tricky. The question is what is the most eﬃcient representation for the narrow band that would allow to ﬁnd for every row/column the segments that belong to the narrow band? One may suggest to use the run length encoding, but the standard static run length implementation is not suﬃcient. This is due to the fact that the narrow band is rebuilt every iteration using the fast marching algorithm, which generates narrow band pixels in arbitrary order. Creating the run length encoding from such a stream of pixels can be done either oﬀ-line, by ﬁrst constructing and then scanning a map of the whole image (which is clearly ineﬃcient), or on-line using a dynamic data structure. In our implementation we use two arrays of linked lists, one for the rows and one for the columns, where each linked list corresponds to one row/column and contains segments of adjacent pixels belonging to the narrow band (see Figure 3.1). Each segment is deﬁned 54 Array of Linked Lists - Rows Array of Linked Lists - Columns Narrow Band Figure 3.1: Two arrays of linked lists used for narrow band representation. by its ﬁrst and last pixels coordinates. Since a narrow band is generated as the fast marching algorithm grows outwards, adding a new pixel usually means just changing one of the boundaries of an already existing segment. For reasonably simple contours the number of times a new segment is created or an existing one is merged with another is relatively low. Therefore, the number of segments per row/column is always small and the complexity of adding a new pixel to the narrow band is practically O(1). The calculations are performed separately for every horizontal and vertical segment ϕki and the vectors [I − 2τ Ail (u0 )]−1 ϕki , where Ail of size |ϕki | × |ϕki | is a matrix corresponding to the derivatives of a single segment, are summed according to their coordinates to yield the new level set function ϕk+1 for the narrow band pixels only. Another issue requiring a special attention is the value of the time step. If we choose a relatively large time step, the active contour may skip over the object boundary. The time step should thus correspond to the numerical support of the edges (edge width). This, in turn, dictates the width of the narrow band that should be wide enough, so that the contour would not escape it in one iteration. One way to overcome the time step limitation is to use a coarse to ﬁne scales of boundary smoothing, with an appropriate time step for each scale. Finally, since the method is based on the AOS, which is a ﬁrst-order approximation scheme, the numerical error grows linearly with the time step. 3.6 Experimental Results As a simple example, the proposed method can be used as a consistent, unconditionally stable, and computationally eﬃcient, numerical approximation for the curvature ﬂow. The 55 step 1 step 1 step 3 step 3 step 30 step 10 step 80 step 30 step 150 step 58 Figure 3.2: Curvature ﬂow by the proposed scheme. A non-convex curve vanishes in ﬁnite time at a circular point by Grayson’s Theorem. The curve evolution is presented for two diﬀerent time steps. Left: τ = 20; Right: τ = 50. 56 Curvature Flow − CPU time 1250 explicit (whole image) 1200 1150 250 CPU time (sec) 200 150 explicit (narrow band) 100 AOS (narrow band) 50 0 0 1 2 3 4 5 6 7 8 9 10 11 time step Figure 3.3: Curvature ﬂow CPU time for the explicit scheme and the implicit AOS scheme. First, the whole domain is updated, next, the narrow band is used to increase the eﬃciency, and ﬁnally the AOS speeds the whole process. For the explicit scheme the maximal time step that still maintains stability is choosen. For the AOS scheme, CPU times for several time steps are presented. curvature ﬂow, also known as curve shortening ﬂow or geometric heat equation, is a well studied equation in the theory of curve evolution. It is proven to bring every simple closed curve into a circular point in ﬁnite time [48, 57]. Figure 3.2 shows an application of the proposed method for a curve evolving by its curvature and vanishes at a point. One can see how the number of iterations needed for the curve to converge to a point decreases as the time step is increased. We tested several implementations for the curvature ﬂow. Figure 3.3 shows the CPU time it takes the explicit and implicit schemes to evolve a contour into a circular point. For the explicit scheme we tested both the narrow band and the naive approach in which every grid point is updated every iteration. The tests were performed on an Ultra SPARC 360M Hz machine for a 256 × 256 resolution image. It should be noted that when the narrow band approach is used, the band width should be increased as the τ grows to ensure that the curve does not escape the band in one iteration. Figure 3.4 shows multiple objects segmentation for a static color image. Here we used a balloon force to propagate the contour through the narrow passages between objects. Figure 3.5 presents an example of medical image application, where the gray matter segmentation is performed on a single slice of a human head MRI image. The task is to detect a narrow layer of the brain bounded by two interfaces - the outer cortical surface (Cerebral Spinal Fluid(CSF)/gray matter interface) and the inner cortical surface (gray matter/white 57 Figure 3.4: Multiple objects segmentation in a static color image. 58 matter interface). The segmentation is performed by two active contours initialized inside the white matter regions. The negative balloon force coeﬃcient is used to expand the contour towards the region boundary and the edge indicator functions are chosen to respond only to the edge points corresponding to the characteristic intensity proﬁles of the CSF/gray matter and the gray matter/white matter interfaces respectively. This speciﬁc medical problem introduces new challenging diﬃculties and possible solutions will be reported elsewhere [53]. Figure 3.5: Gray matter segmentation in a MRI brain image. The white contour converges to the outer cortical surface and the black contour converges to the inner cortical surface. Figures 3.6 and 3.7 show segmentation results for color movies with diﬃcult spatial textures. The tracking is performed at two resolutions. At the lower resolution we search for temporal edges and at the higher resolution we search for strong spatial edges. The contour found in the coarse grid is used as the initial contour at the ﬁne grid. It is possible to compute the inverse matrices of the AOS once for the whole image, or to invert small sub-matrices as new points enter or exit the narrow band. There is obviously a trade-oﬀ between the two approaches. For initialization, we have chosen the ﬁrst approach, since the initial curve starts at the frame of the image and has to travel over most of the image until it captures the moving objects. While for tracking of moving objects in a movie, we use the local approach, since now the curve has only to adjust itself to local changes. 59 step 2 step 40 step 60 frame 2 frame 22 frame 50 Figure 3.6: Tracking a cat in a color movie by the proposed scheme. Top: Segmentation of the cat in a single frame. Bottom: Tracking the walking cat in the 50 frames sequence. step 2 step 30 step 48 step 70 frame 2 frame 21 frame 38 frame 59 Figure 3.7: Tracking two people in a color movie. Top: curve evolution in a single frame. Bottom: tracking two walking people in a 60 frame movie. 60 3.7 Conclusions It was shown that an integration of advanced numerical techniques yield a computationally eﬃcient algorithm that solves a geometric segmentation model. The numerical algorithm is consistent with the underlying continuous model. It combines the narrow band level set method, with additive operator splitting, and the fast marching. The proposed ‘fast geodesic active contour’ scheme was applied successfully for image segmentation and tracking in movie sequences and color images. 61 62 Chapter 4 Non-Rigid Objects - Motion Based Classiﬁcation 4.1 Introduction Our goal is to utilize the motion information directly as a tool for object classiﬁcation. A lot of examples taken from nature and from psychophisical experiments [80] show that moving objects are easier to recognize than stationary ones and in some cases motion information alone, even in an extremely low resolution like in MLDs [64], [65], is suﬃcient for recognition and classiﬁcation. Moreover, we argue that in many instances motion analysis is the best or the only way to deal with object recognition when the classiﬁcation categories are not shape based or in cases when shape information is hardly obtainable. 4.1.1 General Framework The most general approach in object classiﬁcation is to represent each category by a single prototype, that is an average member of the category carrying the most prominent common properties of the class, and then to compare an object in question to the prototype of each category. In this context two questions are to be answered in order to build a classiﬁer. First, how should we represent category prototypes? Second, what methods are to be used to compare an object with the prototypes? Since the problem of motion based classiﬁcation has not been extensively explored yet, let us take a look at the adjacent ﬁeld of shape based recognition and classiﬁcation and see what are the main problems there and how they are solved. Among the intrinsic obstacles making shape based recognition a diﬃcult problem one can name the variability of apparent object view due to several factors [120]: viewing position, photometric eﬀects, object setting and changing shape. Let us see how these factors aﬀect motion based classiﬁcation. The instantaneous motion of objects can be generally decomposed into three basic motions: • motion of the center of mass of the moving object (translation) 63 • changes in orientation (rotation) and • object deformation (this also includes changes in relative position of object parts) It is clear enough that the photometric eﬀects do not inﬂuence motion characteristics of a moving object once they are extracted from an image sequence. Shape changes in case of motion based recognition is not an obstacle, but a means of classiﬁcation. Unless occlusion is happening, object setting is not really a problem, since the analysis is performed not on the whole image but only on a part of it found by the segmentation and tracking processes which contains only the object of interest. On the other hand, viewing position can signiﬁcantly alter the characteristics of apparent motion and in some cases (e.g. translation along the camera optical axis) even makes it hardly observable. There are several ways to cope with this problem. The ﬁrst is to use 3D motion reconstruction, i.e. to restore 3D velocity vectors from the 2D projection of object motion onto an image plane. This is certainly an option for rigid bodies and usually requires both motion model and image acquisition model to be deﬁned. We used this approach in [37], where two diﬀerent imaging models have been deﬁned: one for image sequence stabilization and another for the analysis of nearby objects motion. Another way is to keep motion prototypes for a number of views and do the classiﬁcation regarding each view as a separate class. For example Yacoob and Black [127] recognize diﬀerent human walking directions (parallel, diagonal, perpendicular away, perpendicular forward) as special classes of human motion. Another possible approach is to ﬁnd a transformation that brings the observed motion to some canonical form stored as a prototype or, alternatively, to restore the observed motion as a combination of several motion examplars stored as a class representatives. This is, in a sense, analogous to the alignment approach in shape based recognition, where one wish to compensate for the transformation separating the viewed object and the stored model and then to compare them. Alignment is one of the three large groups of methods used in shape based recognition. The other two are: invariant properties methods and parts decomposition. Invariant properties methods assume that certain properties remain invariant under the transformations that an object is allowed to make. In this work we are looking for those motion characteristics that are invariant to varying viewing position and motion rate. Motion rate is an additional factor adding to the variability of observed motion. This new factor stems from the temporal nature of the input data and does not exist in shape based recognition. An observable motion rate can change both due to viewing distance variations and due to the simple fact that the same object can move with diﬀerent speed. A lot of natural, self-propelled objects exhibit periodical motion. Periodicity is a salient feature in human perception. In nature, many behaviors of animals and humans consist of specialized periodic motion patterns. For example, certain male birds bob their head when courting a female, bees dance in a ﬁgure-8 pattern to signal that food is near the hive, elephants toss-up their head when threatened, and humans vigorously shake their hand in the air to get the attention of another [112], [122]. If so, one motion period (e.g. two steps of a walking man or one rabbit hop) can be used as a natural unit of motion and extracted motion characteristics can by normalized by 64 a period duration. The period of motion is often linked to important properties that may otherwise be diﬃcult to determine, e.g. heart-rate to activity level or period of locomotion to velocity. This gives rise to another set of problems: detection and characterization of periodic activities. In this chapter we present a classiﬁcation approach based on the analysis of the changing appearance of the moving bodies. The framework is based on the segmentation and tracking by the geodesic active contour approach yielding high accuracy results allowing for the periodicity analysis to be performed on the global characteristics of the extracted deformable object contours. The normalized image sequences are parameterized using the PCA approach and classiﬁed by the k-nearest-neighbor algorithm. The experiments demonstrate the ability of the proposed scheme to distinguish between various object classes by their static appearance and dynamic behavior. 4.2 Futurism and Periodic Motion Analysis Futurism is a movement in art, music, and literature that began in Italy at about 1909 and marked especially by an eﬀort to give formal expression to the dynamic energy and movement of mechanical processes. A typical example is the ‘Dynamism of a Dog on a Leash’ by Giacomo Balla, who lived during the years 1871-1958 in Italy, see Figure 4.1 [5]. In this painting one could see how the artist captures in one still image the periodic walking motion of a dog on a leash. Following Futurism, we show how periodic motions can be represented by a small number of eigen-shapes that capture the whole dynamic mechanism of periodic motions. Singular value decomposition of a silhouette of an object serves as a basis for behavior classiﬁcation by principle component analysis. Figure 4.2 present a running horse video sequence and its eigen-shape decomposition. One can see the similarity between the ﬁrst eigen-shapes - Figure 4.2(c,d), and another futurism style painting ”The Red Horseman” by Carlo Carra [5] - Figure 4.2(e). The boundary contour of the moving non-rigid object is computed eﬃciently and accurately by the fast geodesic active contours [53]. After normalization, the implicit representation of a sequence of silhouette contours given by their corresponding binary images, is used for generating eigen-shapes for the given motion. Singular value decomposition produces the eigen-shapes that are used to analyze the sequence. We show examples of object as well as behavior classiﬁcation based on the eigen-decomposition of the sequence. 4.3 Related Work Motion based recognition received a lot of attention in the last several years. This is due to the general recognition of the fact that the direct use of temporal data may signiﬁcantly improve our ability to solve a number of basic computer vision problems such as image segmentation, tracking, object classiﬁcation, etc., as well as the availability of a low cost computer systems powerful enough to process large amounts of data. In general, when analyzing a moving object, one can use two main sources of information to rely upon: changes of the moving object position (and orientation) in space, and object 65 Figure 4.1: ‘Dynamism of a Dog on a Leash’ 1912, by Giacomo Balla. Albright-Knox Art Gallery, Buﬀalo. deformations. Object position is an easy-to-get characteristic, applicable both for rigid and non-rigid bodies that is provided by most of the target detection and tracking systems, usually as a center of the target bounding box. A number of techniques [56], [55], [42], [110] were proposed for the detection of motion events and for the recognition of various types of motions based on the analysis of the moving object trajectory and its derivatives. Detecting object orientation is a more challenging problem which is usually solved by ﬁtting a model that may vary from a simple ellipsoid [110] to a complex 3D vehicle model [69] or a speciﬁc aircraft-class model adapted for noisy radar images as in [29]. While object orientation characteristic is more applicable for rigid objects, it is object deformation that contains the most essential information about the nature of the non-rigid body motion. This is especially true for natural non-rigid objects in locomotion that exhibit substantial changes in their apparent view, as in this case the motion itself is caused by these deformations, e.g. walking, running, hoping, crawling, ﬂying, etc. There exists a large number of papers dealing with the classiﬁcation of moving non-rigid objects and their motions, based on their appearance. Lipton et al. describe a method for moving target classiﬁcation based on their static appearance [74] and using the skeletonization [47]. Polana and Nelson [93] used local motion statistics computed for image grid cells to classify various types of activities. An original approach using the temporal templates and motion history images (MHI) for action representation and classiﬁcation was suggested by Davis and Bobick in [7]. Cutler and Davis [32] describe a system for real-time moving object classiﬁcation based on periodicity analysis. It would be impossible to describe here the whole spectrum of papers published in this ﬁeld and we refer the reader to the following surveys [20], [49] and [81]. The most related to our approach is a work by Yacoob and Black [128], where diﬀerent types of human activities were recognized using a parameterized representation of measurements collected during one motion period. The measurements were eight motion parameters tracked for ﬁve body parts (arm, torso, thigh, calf and foot). 66 (a) (b) (c) (d) (e) Figure 4.2: (a) running horse video sequence, (b) ﬁrst 10 eigen-shapes, (c,d) ﬁrst and second eigen-shapes enlarged, (e) ‘The Red Horseman’, 1914, by Carlo Carra, Civico Museo d’Arte Contemporanea, Milan. In this paper we concentrate on the analysis of the deformations of moving non-rigid bodies in an attempt to extract characteristics that allow us to distinguish between diﬀerent types of motions and diﬀerent classes of objects. 4.4 Our Approach Our basic assumption is that for any given class of moving objects, like humans, dogs, cats, and birds, the apparent object view in every phase of its motion can be encoded as a combination of several basic body views or conﬁgurations. Assuming that a living creature exhibits a pseudo-periodic motion, one motion period can be used as a comparable information unit. Then, by extracting the basic views from a large training set and projecting onto them the observed sequence of object views collected from one motion period, we obtain a parameterized representation of object’s motion that can be used for classiﬁcation. Unlike [128] we do not assume an initial segmentation of the body into parts and do not explicitly measure the motion parameters. Instead, we work with the changing apparent view of deformable objects and use the parameterization induced by their form variability. In what follows we describe the main steps of the process that include, • Segmentation and tracking of the moving object that yield an accurate external object boundary in every frame. • Periodicity analysis, in which we estimate the frequency of the pseudo-periodic motion and split the video sequence into single-period intervals. 67 • Frame sequence alignment that brings the single-period sequences above to a standardized form by compensating for temporal shift, speed variations, diﬀerent object sizes and imaging conditions. • Parameterization by building an eigen-shape basis from a training set of possible object views and projecting the apparent view of a moving body onto this basis. 4.4.1 Segmentation and Tracking As our approach is based on the analysis of deformations of the moving body, the accuracy of the segmentation and tracking algorithm in ﬁnding the target outline is crucial for the quality of the ﬁnal result. This rules out a number of available or easy-to-build tracking systems that provide only a center of mass or a bounding box around the target and calls for more precise and usually more sophisticated solutions. Therefore we decided to use the geodesic active contour approach [17] and speciﬁcally the ‘fast geodesic active contour’ method described in [53], where the segmentation problem is expressed as a geometric energy minimization. We search for a curve C that minimizes the functional ∫ L(C) S[C] = g(C)ds, 0 where ds is the Euclidean arclength, L(C) is the total Euclidean length of the curve, and g is a positive edge indicator function in a 3D hybrid spacial-temporal space that depends on the pair of consecutive frames I t−1 (x, y) and I t (x, y). It gets small values along the spacial-temporal edges, i.e. moving object boundaries, and higher values elsewhere. In addition to the scheme described in [53], we also use the background information whenever a static background assumption is valid and a background image B(x, y) is available. In the active contours framework this can be achieved either by modifying the g function to reﬂect the edges in the diﬀerence image D(x, y) = |B(x, y) − I t (x, y)|, or by introducing additional area integration terms to the functional S(C): ∫ ∫ L(C) S[C] = 0 g(C)ds + λ1 inside(C) |D(x, y) − c1 |2 da + λ2 ∫ outside(C) |D(x, y) − c2 |2 da, where λ1 and λ2 are ﬁxed parameters and c1, c2 are given by: c1 = averageinside(C) [D(x, y)] c2 = averageoutside(C) [D(x, y)] The latter approach is inspired by the ‘active contours without edges’ model proposed by Chan and Vese [21] and forces the curve C to close on a region whose interior and exterior have approximately uniform values in D(x, y). A diﬀerent approach to utilize the region information by coupling between the motion estimation and the tracking problem was suggested by Paragios and Deriche in [87]. Figure 4.3 shows some results of moving object segmentation and tracking using the proposed method. 68 Figure 4.3: Non-rigid moving object segmentation and tracking. Contours can be represented in various ways. Here, in order to have a uniﬁed coordinate system and be able to apply a simple algebraic tool, we use the implicit representation of a simple closed curve as its binary image. That is, the contour is given by an image for which the exterior of the contour is black while the interior of the contour is white. 4.4.2 Periodicity Analysis Here we assume that the majority of non-rigid moving objects are self-propelled alive creatures whose motion is almost periodic. Thus, one motion period, like a step of a walking man or a rabbit hop, can be used as a natural unit of motion and extracted motion characteristics can by normalized by the period size. The problem of detection and characterization of periodic activities was addressed by several research groups and the prevailing technique for periodicity detection and measurements is the analysis of the changing 1-D intensity signals along spatio-temporal curves associated with a moving object or the curvature analysis of feature point trajectories [90], [75], [101], [117]. Here we address the problem using global characteristics of motion such as moving object contour deformations and the trajectory of the center of mass. By running frequency analysis on such 1-D contour metrics as the contour area, velocity of the center of mass, principal axes orientation, etc. we can detect the basic period of the motion. Figures 4.4 and 4.5 present global motion characteristics derived from segmented moving objects in two sequences. One can clearly observe the common dominant frequency in all three graphs. The period can also be estimated in a straightforward manner by looking for the frame where the external object contour best matches the object contour in the current frame. Figure 4.6 shows the deformations of a walking man contour during one motion period 69 Walking man − global motion parameters 0.2 Principal axis inclination angle rad 0.1 0 −0.1 −0.2 0 20 40 60 80 100 120 140 80 100 120 140 100 120 140 sq. pixels 1400 Contour area 1200 1000 800 600 0 20 40 0 20 40 60 pixels/frame 4 2 0 −2 Center of mass − vertical velocity −4 60 80 frame number Figure 4.4: Global motion characteristics measured for walking man sequence. (step). Samples from two diﬀerent steps are presented and each vertical pair of frames is phase synchronized. One can clearly see the similarity between the corresponding contours. An automated contour matching can be performed in a number of ways, e.g. by comparing contour signatures or by looking at the correlation between the object silhouettes in diﬀerent frames. Figure 4.7 shows four graphs of inter-frame silhouette correlation values measured for four diﬀerent starting frames taken within one motion period. It is clearly visible that all four graphs nearly coincide and the local maxima peaks are approximately evenly spaced. The period, therefore, can be estimated as the average distance between the neighboring peaks. 4.4.3 Frame Sequence Alignment One of the most desirable features of any classiﬁcation system is the invariance to a set of possible input transformations. As the input in our case is not a static image, but a sequence of images, the system should be robust to both spacial and temporal variations. 4.4.3.1 Spatial Alignment: Scale invariance is achieved by cropping a square bounding box around the center of mass of the tracked target silhouette and re-scaling it to a predeﬁned size (see Figure 4.8). One way to have orientation invariance is to keep a collection of motion samples for a wide 70 Walking cat − global motion parameters 0.2 Principal axis inclination angle rad 0.15 0.1 0.05 0 0 20 40 0 20 40 60 80 100 120 80 100 120 80 100 120 sq. pixels 8500 8000 7500 Contour area 7000 60 pixels/frame 2 1 0 −1 Center of mass − vertical velocity −2 0 20 40 60 frame number Figure 4.5: Global motion characteristics measured for walking cat sequence. range of possible motion directions and then look for the best match. This approach was used by Yacoob and Black in [128] to distinguish between diﬀerent walking directions. Although here we experiment only with motions nearly parallel to the image plane, the system proved to be robust to small variations in orientation. Since we do not want to keep models for both left-to-right and right-to-left motion directions, the right-to-left moving sequences are converted to left-to-right by horizontal mirror ﬂip. 4.4.3.2 Temporal Alignment: A good estimate of the motion period allows us to compensate for motion speed variations by re-sampling each period subsequence to a predeﬁned duration. This can be done by interpolation between the binary silhouette images themselves or between their parameterized representation as explained below. Figure 4.9 presents an original and re-sampled one-period subsequence after scaling from 11 to 10 frames. Temporal shift is another issue that has to be addressed in order to align the phase of the observed one-cycle sample and the models stored in the training base. In [128] it was done by solving a minimization problem of ﬁnding the optimal parameters of temporal scaling and time shift transformations so that the observed sequence is best matched to the training samples. Polana and Nelson [93] handled this problem by matching the test one-period subsequence to reference template at all possible temporal translations. Assuming that in the training set all the sequences are accurately aligned, we ﬁnd the 71 (t) (t+3) (t+6) (t+9) (t+12) (τ ) (τ +3) (τ +6) (τ +9) (τ +12) Figure 4.6: Deformations of a walking man contour during one motion period (step). Two steps synchronized in phase are shown. One can see the similarity between contours in corresponding phases. temporal shift of a test sequence by looking for the starting frame that best matches the generalized (averaged) starting frame of the training samples, as they all look alike. Figure 4.10 shows (a) - the reference starting frame taken as an average over the temporally aligned training set, (b) - a re-sampled single-period test sequence and, (c) the correlation between the reference starting frame and the test sequence frames. The maximal correlation is achieved at the seventh frame, therefore the test sequence is aligned by cyclically shifting it 7 frames to the left. 4.4.4 Parameterization In order to reduce the dimensionality of the problem we ﬁrst project the object image in every frame onto a low dimensional base that represents all possible appearances of objects that belong to a certain class, like humans, four-leg animals, etc. Let n be number of frames in the training base of a certain class of objects and M be a training samples matrix, where each column corresponds to a spatially aligned image of a moving object written as a binary vector. In our experiments we use 50 × 50 normalized images, therefore, M is a 2500 × n matrix. The correlation matrix M M T is decomposed using Singular Value Decomposition as M M T = U ΣV T , where U is an orthogonal matrix of principal directions and the Σ is a diagonal matrix of singular values. In practice, the decomposition is performed on M T M , which is computationally more eﬃcient [119]. The principal basis {Ui , i = 1..k} for the training set is then taken as k columns of U corresponding to the largest singular values in Σ. Figure 4.11 presents a principal basis for the training set formed of 800 sample images collected from more than 60 sequences showing dogs and cats in motion. The basis is built by taking the k = 20 ﬁrst principal component vectors. We assume that by building such representative bases for every class of objects and then ﬁnding the basis that best represents a given object image in a minimal distance to the 72 1 0.95 0.9 inter−frame correlation 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0 10 20 30 frame offset 40 50 60 Figure 4.7: Inter-frame correlation between object silhouettes. Four graphs show the correlation measured for four initial frames. feature space (DTFS) sense, we can distinguish between various object classes. Figure 4.12 shows the distances from more than 1000 various images of people, dogs and cats to the feature space of people and to that of dogs and cats. In all cases, images of people were closer to the people feature space than to the animals’ feature space and vise a versa. This allows us to distinguish between these two classes. A similar approach was used in [45] for the detection of pedestrians in traﬃc scenes. If the object class is known (e.g. we know that the object is a dog), we can parameterize the moving object silhouette image I in every frame by projecting it onto the class basis. Let B be the basis matrix formed from the basis vectors {Ui , i = 1..k}. Then, the parameterized representation of the object image I is given by the vector p of length k as p = B T vI , where vI is the image I written as a vector. The idea of using a parameterized representation in motion-based recognition context is certainly not a new one. To name a few examples we mention again the work of Yacoob and Black [128]. Cootes et al. [27] used similar technique for describing feature point locations by a reduced parameter set. Baumberg and Hogg [3] used PCA to describe a set of admissible B-spline models for deformable object tracking. Chomat and Crowley [23] used PCA-based spatio-temporal ﬁlter for human motion recognition. Figure 4.13 shows several normalized moving object images from the original sequence and their reconstruction from a parameterized representation by back-projection to the image space. The numbers below are the norms of diﬀerences between the original and the back- 73 (a) (b) Figure 4.8: Scale alignment. A minimal square bounding box around the center of the segmented object silhouette (a) is cropped and re-scaled to form a 50 × 50 binary image (b). Figure 4.9: Temporal alignment. Top: original 11 frames of one period subsequence. Bottom: re-sampled 10 frames sequence. projected images. These norms can be used as the DTFS estimation. Now, we can use these parameterized representations to distinguish between diﬀerent types of motion. The reference base for the activity recognition consists of temporally aligned one-period subsequences, whereas the moving object silhouette in every frame of these subsequences is represented by its projection to the principal basis. More formally, let {If : f = 1..T } be a one-period, temporally aligned set of normalized object images, and pf , f = 1..T a projection of the image If onto the principal basis B of size k. Then, the vector P of length kT formed by concatenation of all the vectors pf , f = 1..T , represent a one-period subsequence. By choosing a basis of size k = 20 and the normalized duration of one-period subsequence to be T = 10 frames, every single-period subsequence is represented by a feature point in a 200-dimensional feature space. In the following experiment we processed a number of sequences of dogs and cats in various types of locomotion. From these sequences we extracted 33 samples of walking dogs, 9 samples of running dogs, 9 samples with galloping dogs and 14 samples of walking cats. Let S200×m be the matrix of projected single-period subsequences, where m is the number of samples and the SVD of the correlation matrix is given by SS T = U S ΣS V S . In Figure 4.14 we depict the resulting feature points projected for visualization to the 3-D space using the three ﬁrst principal directions {UiS : i = 1..3}, taken as the column vectors of U S corresponding to the three largest eigen values in ΣS . One can easily observe four separable 74 (a) (b) 420 Correlation with the average starting frame 410 400 390 380 370 360 350 340 330 1 2 3 4 5 6 Frame number 7 8 9 10 (c) Figure 4.10: Temporal shift alignment: (a) - average starting frame of all the training set sequences, (b) - temporally shifted single-cycle test sequence, (c) - the correlation between the reference starting frame and the test sequence frames clusters corresponding to the four groups. Another experiment was done over the ‘people’ class of images. Figure 4.15 presents feature points corresponding to several sequences showing people walking and running parallel to the image plane and running at oblique angle to the camera. Again, all three groups lie in separable clusters. The classiﬁcation can be performed, for example, using the k-nearest-neighbor algorithm. We conducted the ‘leave one out’ test for the dogs set above, classifying every sample by taking them out from the training set one at a time, and the three-nearest-neighbors strategy resulted in 100% success rate. Another option is to further reduce the dimensionality of the feature space by projecting the 200-dimensional feature vectors in S to some principal basis. This can be done using a basis of any size, exactly as we did it in 3-D for the visualization. Figure 4.16 presents learning curves for both animals and people data sets. The curves show how the classiﬁcation error rate achieved by one-nearest-neighbor classiﬁer changes as a function of principal basis size. The curves are obtained by averaging over a hundred of iterations when the data set is randomly split into the training and the testing sets. The principal basis is built on the 75 Figure 4.11: The principal basis for the ‘dogs and cats’ training set formed of 20 ﬁrst principal component vectors. 30 dogs and cats people Distance to the "dogs & cats" feature space 25 20 15 10 5 0 0 5 10 15 20 Distance to the "people" feature space 25 30 Figure 4.12: Distances to the ‘people’ and ’dogs and cats’ feature spaces from more than 1000 various images of people, dogs and cats. 76 5.23 4.66 4.01 3.96 3.42 4.45 4.64 6.10 5.99 6.94 5.89 Figure 4.13: Image sequence parameterization. Top: 11 normalized target images of the original sequence. Bottom: the same images after the parameterization using the principal basis and back-projecting to the image basis. The numbers are the norms of the diﬀerences between the original and the back-projected images. training set and the testing is performed using the ‘leave one out’ strategy on the testing set. 4.5 Conclusions We presented a new framework for motion-based classiﬁcation of moving non-rigid objects. The technique is based on the analysis of changing appearance of moving objects and is heavily relying on high accuracy results of segmentation and tracking by using the fast geodesic contour approach. The periodicity analysis is then performed based on the global properties of the extracted moving object contours, followed by video sequence spatial and temporal normalization. Normalized one-period subsequences are parameterized by projection onto a principal basis extracted from a training set of images for a given class of objects. A number of experiments show the ability of the system to analyze motions of humans and animals, to distinguish between these two classes based on object appearance, and to classify various type of activities within a class, such as walking, running, galloping. The ‘dogs and cats’ experiment demonstrate the ability of the system to discriminate between these two very similar by appearance classes by analyzing their locomotion. 77 dog walking dog running dog galloping cat walking Figure 4.14: Feature points extracted from the sequences with walking, running and galloping dogs and walking cats and projected to the 3-D space for visualization walking running running−45 Figure 4.15: Feature points extracted from the sequences showing people walking and running parallel to the image plane and at 45 degrees angle to the camera. Feature points are projected to the 3-D space for visualization 78 0.4 0.35 0.35 0.3 0.3 0.25 0.25 Mean error Mean error 0.4 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0 0 5 10 15 20 Number of eigenvectors 25 30 0 35 (a) 0 2 4 6 8 Number of eigenvectors 10 12 14 (b) Figure 4.16: Learning curves for (a) dogs and cats data set and (b) people data set. Onenearest-neighbor classiﬁcation error rate as a function of the number of eigen vectors in the projection basis. 79 80 Chapter 5 Articulated Motion 5.1 Introduction Articulated object segmentation is a well known computer vision problem. The problem usually arises whenever a simple object detection and segmentation is not suﬃcient and some additional information is required about the nature of the object of interest and its interactions with other objects. The most prominent example is, of course, recovering the position of human body parts, which is often a ﬁrst step for numerous application including posture and gait analysis, computer-human interface, automatic surveillance, etc. There have been many proposed methods for tracking and analysis of human body motion. Most of them assume some kind of human body model (either parametric or not) and are trying to ﬁt the observed visual cues to the model. There are some applications where this approach fails. For example when doing a moving target classiﬁcation there is no single model for the objects of interest as they may belong to diﬀerent classes like humans, animals, vehicles. Yet, one may wish to detect the independently moving parts of the object as they may serve as cues for the classiﬁcation. This calls for a non model based method for articulated body segmentation based on the observed appearance of the object and not on the prior knowledge about its nature. One of the very important applications of the articluated object segmentation is the shape based recognition using the parts decomposition, whose basic idea is to locate the parts of an object, to classify them into the diﬀerent types of generic components and then to describe the object in terms of constituent parts. Parts decomposition in its original form is not applicable in motion based classiﬁcation, but can be used as a tool that allows to decompose an object into primitive moving parts (limbs/parts of a mechanism) and then to do the classiﬁcation based on either relative or absolute motion of the parts (e.g. [127]) . 5.2 Background subtraction Like in most tracking applications we used a reference background image to detect and segment out of the moving object. In this work we used the adaptive background subtraction technique suggested in [79], which is based on fusing the color and gradient information. Fig- 81 ure 5.1 shows an example of applying the background subtraction to a color image sequence. The largest foreground connected component, Figure 5.1(c), is then used as a detected moving object mask. (a) (b) (c) Figure 5.1: Example of background subtraction (a) Original color image from sequence, (b) Foreground conﬁdence values, (c) Thresholded foreground image. 5.3 Edge detection We use the following scheme for estimating the edge strength in color images. First the gradients (Rx , Ry ), (Gx , Gy ) and (Bx , By ) are computed by Sobel operator separately for the three channels R, G and B respectively. Then the edge indicator function is given by the largest eigen value of the matrix ( Rx2 + G2x + Bx2 Rx Ry + Gx Gy + Bx By Rx Ry + Gx Gy + Bx By Ry2 + G2y + By2 ) and the normal direction to the edge is given by the corresponding eigen vector ⃗n . The edge sign is taken as the sign of the largest projection of the channels gradients, ∇R, ∇G and ∇B, onto the normal edge direction vector ⃗n. Figure 5.2 shows an example of edge detection in a color image (b), and the edge pixels obtained by applying a threshold to the edge indicator function (c). Note that the edge detection is performed only in the masked region belonging to the moving object. 5.4 Normal ﬂow In our work we have chosen to use the normal ﬂow for the motion analysis. The the normal ﬂow is the projection of the image motion ﬁeld onto the normal edge direction, i.e. if ⃗nr is the direction of the normal to the edge at image point ⃗r and the optical ﬂow at ⃗r is ⃗u, then the normal ﬂow at ⃗r is given by ⃗un = ⃗u · ⃗nr . In this work we used the algorithm for normal ﬂow computation that works directly on color image sequences by Duric et al. Figure 5.3 presents an example of the computed normal ﬂow for a pair of consecutive frames from a color sequence. 82 (a) (b) (c) Figure 5.2: Example of edge detection for color images (a) Original color image from sequence, (b) Edge indicator function values, (c) Edge indicator function after applying a threshold. (a) (b) (c) Figure 5.3: Two consecutive frames from an image sequence (a, b) and the computed normal ﬂow (c). 83 5.5 Grouping edge pixels into edge segments The next step is grouping the edge pixels (edgels) into meaningful segments that hopefully belong to a single independently moving object/body part. Our assumption is that a connected set of edgels forming a nearly straight line segment and exhibiting the same motion are likely to belong to the same independently moving part. The grouping algorithm is then 1. Initialize a new edge segment with an unused edge pixel p, and mark the p as used. 2. Find a set Np of unused edgels neighboring with p. 3. If Np is empty, close the current segment and return to step 1. 4. Otherwise, ﬁnd the best ﬁtting edgel pnew ∈ Np , in a sense that pnew is the edgel that most closely coincides with the segment direction and has the closest edge orientation and normal ﬂow to those of edgel p. 5. Add the pnew to the segment, update the segment direction, set p = pnew and return to step 2. As the algorithm advances by adding a new pixel to one end of the segment, one should repeat steps 2-4 twice from the same starting point in order to track the edge segment in both directions. The time complexity of the algorithm is linear with the number of edge pixels. The resulting set of edge segments is then ﬁltered by concatenating collinear segments with close end points and by discarding very short edge segments. Figure 5.4 shows the result of edge grouping, where edge segments are presented in diﬀerent colors. 5.6 Grouping edge segments As we are not interested in extracting the actual 3D motion parameters of the independently moving object parts and only need the motion information in order to segment the moving object, it may be acceptable to approximate the observable motion using the planar aﬃne model. The model error is relatively small provided the distance from the camera to the object is much larger than the object displacements in the direction perpendicular to the image plane. Under this planar motion assumption the velocity of an object point ⃗r = (x, y) in the image plane (the optical ﬂow) is given by ⃗r˙ ≡ ⃗u = T⃗ − ω ⃗ × ⃗r, where ω ⃗ = (0, 0, ωz ) is the ⃗ rotational velocity vector and T = (U, V ) is the translational velocity. Then the normal ﬂow at the point ⃗r is ∥⃗un ∥ = ⃗u · ⃗nr = (T⃗ − ω ⃗ × ⃗r) · ⃗nr , where ⃗nr = (nx , ny ) is the normal direction at the point ⃗r. For any two edge segments S1 and S2 that belong to the same independently moving articulated object part P , there exists a unique set of motion parameters m ⃗ P = (U P , V P , ωzP ) such that for every edge point ⃗ri ∈ (S1 ∪ S2 ) the following constraint should hold u⃗i · ⃗nri = ((U P , V P ) − ω⃗P × r⃗i ) · ⃗nri . 84 (5.1) Figure 5.4: Example of grouping edge pixels into edge segments. Diﬀerent edge segments are in diﬀerent colors. 85 86 Let us denote   nxi   nyi ai =   nyi xi − nxi yi Then in matrix form the set of constraints (5.1) for the points of S1 ∪ S2 is given by ⃗b = Am ⃗P (5.2) where aTi are the rows of matrix A and the elements bi of vector ⃗b are given by bi = u⃗i · ⃗nri . The matrix A and the vector ⃗b are formed from the observable quantities, while the motion parameters vector m ⃗ P can be estimated by minimizing the ∥Am ⃗ P − ⃗b∥. In general, the two edge segments being compared are not of equal size, which means that there are more equations in (5.2) that are coming from one edge segment than from the other. Let us assume w.l.o.g. that |S1 | > |S2 |. Then in order to give both segments equal weight in deﬁning the motion parameters one should compensate for the diﬀerences in size by multiplying the second segment equations by |S1 |/|S2 |. If the two edge segments S1 and S2 indeed belong the the same moving object part and, therefore, exhibit the same motion, then the residual resS1 ,S2 = ∥Am ⃗ P − ⃗b∥ should be close to zero, whereas for the cases when S1 and S2 are moving diﬀerently the residual should be larger. This gives us a grouping cue allowing to discard pairs of edge segments that do not ﬁt to a common motion model. It turned out that judging by this only cue is not suﬃcient. E.g. when two object parts are moving similarly in a certain phase of motion, the edge segments belonging to these parts are likely to be grouped together if the decision is based on the observable motion alone. Therefore one needs to employ additional tests to make the grouping process more reliable. The other cues we used for segmentation are based on geometrical properties of the extracted edge segments. Usually the independently moving parts (e.g limbs of humans and animals, robot manipulator segments) can be approximated by a cylinder and as such have an almost parallel boundaries in their image projections. Motivated by this assumption we are looking for the pairs of almost parallel edge segments by comparing their slope angle. For a pair of almost parallel segments the normal distance is calculated. If the segments are almost collinear (very small normal distance) they may present a two parts of the same straight boundary segment and they are grouped together is their edge points are close and their slope angles are almost identical. Otherwise, if the normal distance is relatively large, the two segments may probably belong to the diﬀerent sides of the same limb. Then, in order to be grouped together, the following requirements are to be met: segments projections on each other overlap and the segments are not separated by the background. The latter property is checked by drawing a line between the centers of mass of the two segments and looking whether this line contains any background pixels. Figure 5.5 presents an example of edge segments grouping for six frames taken from a single image sequence. For each group of edge segments a rectangular bounding box aligned with the principal axes direction is shown. One can see that the torso part that is visible in frames (a-c) is not detected in the frames (d-f) as its bounding edge segments are lost due 87 to the motion of the hands. On the other hand, the arms that were not visible in the frames (a-d) are detected in frames (e,f) as they emerge from the torso region. The forward leg is split into two regions in frame (a), but becomes a single region in frame (b) since at this step phase it is fully stretched and moves as one piece. The other leg, being in the anti-phase, is presented by a single region in frames (a-c) and then split into two limbs in frames (d-f). (a) (b) (c) (d) (e) (f) Figure 5.5: Edge grouping results for six frames from an image sequence. Every group of edge segments is enclosed by a best ﬁtting rectangular bounding box 5.7 Multiple frames approach One of the drawbacks of the scheme described so far is the lack of persistency in segmentation, i.e. an object part detected in one frame may disappear in the next one and then appear again several frames later. One way to cope with this problem is to remember the results of the segmentation made for one frame and then use them when doing the segmentation in the next frames. To do so it is necessary to keep track of persistent image elements, such that they could be detected from frame to frame. In our implementation this is done by tracking the edge segments. An edge segment found in a certain frame enters a list of tracked segments and then its new position in the next frame is detected using the normal ﬂow. A search region is deﬁned around the new segment position predicted by the normal ﬂow vectors and the corresponding edge segment in the next frame is built from the edgels found within the region. An edge segment that is 88 being tracked for several frames is regarded as stable and enters the database of the active edge segments. The database is a table of size Ns × Ns , where Ns is the number of active segments. An entry (i, j) of the table accumulates the information about the mutual relationship between the edge segments Si and Sj . The history data that is stored for each pair of edge segments includes the mean and the variance of the slope diﬀerence, normal and tangential distances and the motion model residual computed from the normal ﬂow. All these parameters are updated with every new frame. The decision whether two segments should be grouped together is made in the similar manner to that described in the previous section, while taking into account the historical information. E.g. if the slope angle between two segments changes signiﬁcantly within the period of observation (i.e. large slope diﬀerence variance) the segments cannot be grouped together. The same applies for the situation when in some frame a pair of segments was separated by the background. Once it was decided that the two segments do not belong to the same moving object part, the decision is stored in the corresponding table entry and the segments will not be grouped together in the subsequent frames even if their relative motion and geometry will look appropriate. (a) (b) (c) (d) (e) (f) Figure 5.6: Multiple frames based segmentation results for six frames from an image sequence. Every group of edge segments is enclosed by a best ﬁtting rectangular bounding box Figure 5.6 shows an example of moving body segmentation performed using the multiple frames approach. One can see that the segmentation now is more persistent than for the frame by frame analysis. For example, the left hand region, once extracted as a separate one in frame (b), remains separate through the whole sequence. Each leg is grouped as one 89 body part and remains such unlike in ﬁgure 5.5 where they were split into two regions in some frames. 5.8 Conclusions In this chapter we presented a simple, yet powerful method for motion based segmentation. The method can be used for the segmentation of articulated objects based on the analysis of the observed motion of their parts. We used a relatively simple underlying motion model - aﬃne, which proved to be powerful enough to discriminate between the edges that move in the same way and those that move independently. The algorithm is based on the analysis of normal ﬂow vectors computed in color space and also relies on a number of geometric heuristics for edge segments grouping. A multi-frame approach is also considered, which yields more robust results. 90 Chapter 6 Discussion The goal of this work was to explore the possibility of using the motion based information, as well as various ways it can be used, in order to solve a number of well known computer vision problems, such as segmentation, tracking, objects recognition and classiﬁcation and event detection. It turned out that the methods one can apply for the analysis of moving objects diﬀer signiﬁcantly between the three basic classes of objects: rigid, non-rigid and articulated. In Chapter 2 we address the problem of extracting the motion parameters of a moving rigid body for a relatively complicated, yet constrained case of moving vehicles. As image sequences are obtained from a forward looking camera installed on a moving vehicle, the observed motion is a result of superposition of two independent motions: an apparent translational motion of the car (and the camera attached) relative to the environment - both static (road, buildings) and dynamic (vehicles) and an apparent rotational motion of the environment relative to the camera caused by various disturbances due to road bumps or other similar reasons. In order to extract the relative motion parameters of the nearby vehicles we needed to stabilize the image sequence, i.e. to get rid of the rotational component of the observable motion pattern, yielding a sequence of images that would have been observed if the vehicle motion had been smooth. The stabilization algorithm we developed is based on the analysis of the normal ﬂow vectors using a perspective projection imaging model, as the optical ﬂow vectors participating in the computation are collected from the points located at diﬀerent scene depths. Road segmentation is performed by ﬁnding road edges and lane separating marks using Hough-like voting and choosing dominant straight lines converging to a single point. Independently moving objects are then detected by clustering vertical and horizontal linear contour sections representing car boundaries and/or car shadow. The 3-D car motion vector is computed from translational component of optical ﬂow vectors assuming translational planar motion and weak perspective camera model. Our belief is that by analyzing the relative velocity graphs and checking such parameters like typical accelerations, speed of maneuvers and other higher level behavioral patterns, one can distinguish between vehicle classes like private cars, trucks, motorcycles and so on. The work can be extended to analyze a 3-D motion, e.g. for the motion based classiﬁcation of an aircraft in ﬂight. Actually this approach has already been used in aircraft type 91 detection from radar imaging [31, 29, 30, 28] and it would be very interesting to ﬁnd out whether this type of analysis can be performed on an optical data. While there exists a relatively large number of works dealing with extraction of motion events in traﬃc scenes [13, 58, 59, 60, 69, 71], all of them do it with a stationary camera and sometimes use an a priory known and modelled observed scene. Here we address the problem of motion event recognition from a driver’s viewpoint and are trying to understand various events involving the motions of nearby vehicles usually moving in the same direction without major changes in their distances and relative speeds. The set of interesting events and parameters in this context such as passing or changing lanes, exiting or merging a stream, overtaking, turning, accelerating or breaking, rate of approach, dangerously close or approaching car (collision), etc., can be detected based on the analysis of relative translational velocity of a vehicle and by looking at its position relative to diﬀerent road parts - lanes, margins, separating line and so on. The analysis of the non-rigid objects motion on the other hand imposes new challenges. While the exact estimation of such motion parameters as translation and rotation is more problematic (if possible), because of the deformations the target undergo, the same deformations contain a very essential information about the nature of the object. Therefore we choose to use an active contour based approach for the segmentation and tracking of nonrigid objects in order to provide a high-accuracy results allowing us to capture the object deformations dynamic. In Chapter 3 we present a fast and eﬃcient implementation of numerical scheme for geodesic active contours that is based on a combination of several advanced numerical techniques. The proposed ‘fast geodesic active contour’ scheme was applied successfully for image segmentation and tracking in video sequences and color images [53]. The method can be used to solve a number of segmentation problems that commonly arise in image analysis and was successfully applied in a number of projects ranging from medical applications such as cortical layer segmentation in 3-D MRI brain images [52], to the segmentation of defects in VLSI circuits on electronic microscope images, and to the analysis of bullet traces in metals. In Chapter 4 we present a framework for motion-based classiﬁcation of moving non-rigid objects [54]. The technique is based on the analysis of changing appearance of moving objects and is heavily relying on high accuracy results of segmentation and tracking by using the fast geodesic contour approach described earlier. By assuming that most of non-rigid moving objects are self-propelled alive creatures whose motion is almost periodic, we perform the periodicity analysis based on the global properties of the extracted moving object contours. Knowing the motion period allows us to extract several important quantities that may be used as classiﬁcation invariants. In addition, by normalizing in time other measured motion parameters we make them immune against motion rate variations and allow to compare them to the stored categories representative templates. Normalized one-period subsequences are parameterized by projection onto a principal basis extracted from a training set of images for a given class of objects. A number of experiments show the ability of the system to analyze motions of humans and animals, to distinguish between these two classes based on object appearance, and to classify various type of activities within a class, such as walking, running, galloping. 92 The ‘dogs and cats’ experiment demonstrate the ability of the system to discriminate between these two very similar by appearance classes by analyzing their locomotion. This is an interesting example, because it was claimed [120] that these two classes of objects are indistinguishable in terms of their visual appearance. This raises an interesting and philosophical question of distinguishability between various classes and the deﬁnition of the set of classes or categories we would like to relate an object to. This is, of course, an abstract question, unless we limit ourselves to some natural purposive environment, where the classiﬁcation process is a part of a more complex system allowing an observer to perceive the environment and then act accordingly. There exists a hypothesis [8], [120] that the most natural object classes are those induced by nature itself and not imposed by an observer. In other words, the set of natural classes is independent of our perception and the visual attributes available are suﬃcient for the classiﬁcation. The principle of ‘Natural Modes’ [8] states: ‘Environmental pressures force objects to have non-arbitrary conﬁgurations and properties that deﬁne object categories in the space of properties important to interaction between objects and the environment’. This is accompanied with another principle of ‘Accessibility’ claiming that those properties are ‘reﬂected in observable properties of the object’. By adopting such basic natural classes like ‘plants’, ‘animals’, ‘humans’, ‘human made objects’, etc., and using motion information as input data for object classiﬁcation we show that object motion is one of those basic visual attributes allowing us to distinguish between natural categories. As we have already mentioned, the methods that can be applied for rigid and non-rigid bodies classiﬁcation are quite diﬀerent. Therefore, ﬁrst, we would like to distinguish between rigid and non-rigid objects. In most cases natural non-rigid objects in locomotion exhibit substantial changes in the apparent view of the external contour since the motion itself is caused by these deformations. Hence, we can judge about rigidity of moving objects by the amount of deformation that their external contours undergo. Although this aspect of the problem was not discussed in detail in this thesis, we successfully used this technique in the real-time system we have built to distinguish between such rigid/non-rigid categories as humans, animals and vehicles [95]. After the initial rigid/non-rigid distinction is made we can proceed with ﬁner classiﬁcation within these two large categories. Finally, in Chapter 5 we address the problem of articulated objects segmentation and tracking. The obvious motivation comes form the problem of human body segmentation, which is required in numerous application where the posture of a human body is to be recovered. Our method is based on the analysis of normal ﬂow vectors computed from color image sequence and grouping of edge segments based on the similarity in motion they exhibit. The similarity is deﬁned in terms of compliance with the underlying motion model, which in our case was taken to be a simple aﬃne motion and proved to be ‘expressive’ enough for our qualitative purposes. We also suggested a number of geometric heuristics to be used as grouping cues and developed a scheme of multi-frame analysis that is more robust than a simple frame-to-frame decision making. The algorithm is meant to be a part of a more complex event recognition system that works in an indoor environment, where humans conﬁguration(posture) changes and their 93 interaction with static objects of primary interest. In this context an interesting question arises whether there exists a generic set of primitive recognizable motion events to be extracted from an image sequence that could describe a large, application independent class of verbs or other higher level abstractions. An interesting development of our work would be to use a low level primitives called “acts”, introduced in the Conceptual Dependency theory by Schank [99], which is a theory of the representation of the meaning of sentences. The theory deﬁnes a set of primitive actions called “acts” in such a way that any verb in a natural language can be represented as a unique combination of those primitive actions. Then one would need to develop a set of techniques to extract these primitives and to parse them into higher level events. 94 Bibliography [1] D. Adalsteinsson and J. A. Sethian. A fast level set method for propagating interfaces. J. of Comp. Phys., 118:269–277, 1995. [2] J. L. Barron, D. J.Fleet, and S. S. Beauchemin. Performance of optical ﬂow techniques. International Journal of Computer Vision, 12(1):43–77, 1994. [3] A. Baumberg and D. Hogg. An eﬃcient method for contour tracking using active shape models. In In Proc. IEEE Workshop on Motion of Non-Rigid and Articulated Objects, pages 194–199, Austin, 1994. [4] M. G. Bekker. Introduction to Terrain-Vehicle Systems. University of Michigan Press, Ann Arbor, 1969. [5] J. R. Beniger. The Arts and New Media site. [6] M. Betke, E. Haritaoglu, and L. S. Davis. Real-time multiple vehicle detection and tracking from a moving vehicle. Machine Vision and Applications, 12(2):69–83, 2000. [7] A. Bobick and J. Davis. The representation and recognition of action using temporal templates. IEEE Trans. on PAMI, 23(3):257–267, 2001. [8] A. F. Bobick. Natural Object Categorization. Ph.D. thesis, tr-1001, MIT Artiﬁcial Intelligence Laboratory, 1987. [9] A. Broggi. An image reorganization procedure for automotive road following systems. In Proc. International Conference on Image Processing, volume 3, pages 532–535, 1995. [10] A. Broggi. Parallel and local feature-extraction: A real-time approach to road boundary detection. IEEE Transactions on Image Processing, 4(2):217–223, February 1995. [11] T. J. Broida and R. Chellappa. Estimation of object motion parameters from noisy images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(1):90–99, January 1986. [12] P. Burlina and R. Chellappa. Time-to-x: Analysis of motion through temporal parameters. In Proc. Conference on Computer Vision and Pattern Recognition, pages 461–468, 1994. [13] H. Buxton and S. Gong. Advanced visual surveillance using bayesian networks. In Proc. IEEE Special Workshop on Context-Based Vision, pages 111–123, MIT Massachusetts, April 1995. [14] V. Caselles, F. Catte, T. Coll, and F. Dibos. A geometric model for active contours. Numerische Mathematik, 66:1–31, 1993. 95 [15] V. Caselles and B. Coll. Snakes in movement. SIAM J. Numer. Analy., 33(6):2445–2456, 1996. [16] V. Caselles, R. Kimmel, and G. Sapiro. Geodesic active contours. In Proceedings ICCV’95, pages 694–699, Boston, Massachusetts, June 1995. [17] V. Caselles, R. Kimmel, and G. Sapiro. Geodesic active contours. IJCV, 22(1):61–79, 1997. [18] V. Caselles, R. Kimmel, G. Sapiro, and C. Sbert. Minimal surfaces: A geometric three dimensional segmentation approach. Numerische Mathematik, 77(4):423–425, 1997. [19] V. Caselles, R. Kimmel, G. Sapiro, and C. Sbert. Minimal surfaces based object segmentation. EEE Trans. on PAMI, 19:394–398, 1997. [20] C. Cedras and M. Shah. Motion-based recognition: A survey. IVC, 13(2):129–155, March 1995. [21] T.F. Chan and L. A. Vese. Active contours without edges. IEEE trans. on Image Processing, 10(2):266–277, February 2001. [22] C. S. Chiang, C. M. Hoﬀmann, and R. E. Lync. How to compute oﬀsets without selfintersection. In Proc. of SPIE, volume 1620, page 76, 1992. [23] O. Chomat and J. Crowley. Recognizing motion using local appearance, 1998. [24] D. L. Chopp. Computing minimal surfaces via level set curvature ﬂow. J. of Computational Physics, 106(1):77–91, May 1993. [25] J. C. Clarke and A. Zisserman. Detection and tracking of independent motion. Image and Vision Computing, 14:565–572, 1996. [26] L. D. Cohen. On active contour models and balloons. 53(2):211–218, 1991. CVGIP: Image Understanding, [27] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham. Active shape models: Their training and application. CVIU, 61(1):38–59, January 1995. [28] N. J. Cutaia, M. I. Miller, J. A. O’Sullivan, and D. L. Snyder. Multi-target narrow-band direction ﬁnding and tracking based on motion dynamics. In Allerton Conference on Communication, Control, and Computing, Urbana-Champaign, IL, October 1992. [29] N. J. Cutaia and J. A. O’Sullivan. Automatic target recognition using kinematic priors. In Proceedings of the 33rd Conference on Decision and Control, pages 3303–3307, Lake Buena Vista, FL, December 1994. [30] N. J. Cutaia and J. A. O’Sullivan. Identiﬁcation of maneuvering aircraft using class dependent kinematic models. Technical Report ESSRL-95-13, Electronic Signals and Systems Research Laboratory Monograph, St. Louis, MO, May 1995. [31] N. J. Cutaia and J. A. O’Sullivan. Performance bounds for recognition of jump-linear systems. In Proc. of the 1995 American Controls Conference, pages 109–113, Seattle, WA, June 1995. [32] R. Cutler and L. Davis. Robust real-time periodic motion detection, analysis, and applications. PAMI, 22(8):781–796, August 2000. 96 [33] S. Di Zenzo. A note on the gradient of a multi image. Computer Vision, Graphics, and Image Processing, 33:116–125, 1986. [34] E. D. Dickmanns and B. D. Mysliwetz. Recursive 3-d road and relative ego-state recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(2):199–213, February 1992. [35] B. A. Dubrovin, A. T. Fomenko, and S. P. Novikov. Modern Geometry - Methods and Applications I. Springer-Verlag, New York, 1984. [36] Z. Duric, J. Fayman, and E. Rivlin. Function from motion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18:579–591, 1996. [37] Z. Duric, R. Goldenberg, E. Rivlin, and A. Rosenfeld. Estimating relative vehicle motion in traﬃc scenes. Pattern Recognition, 35(6):1339–53, June 2002. [38] Z. Duric, E. Rivlin, and A. Rosenfeld. Understanding object motion. IVC, 16(11):785–797, August 1998. [39] Z. Duric and A. Rosenfeld. Image sequence stabilization in real time. Real-Time Imaging, 2:271–284, 1996. [40] Z. Duric, A. Rosenfeld, and L. S. Davis. Egomotion analysis based on the frenet-serret motion model. IJCV, 15:105–122, 1995. [41] J. R. Ellis. Vehicle Handling Dynamics. Mechanical Engineering Publications Limited, London, 1994. [42] S. A. Engel and J. M. Rubin. Detecting visual motion boundaries. In Proc. Workshop on Motion: Representation and Analysis, pages 107–111, Charleston, S.C., May 1986. [43] O. Faugeras and R. Keriven. Variational principles, surface evolution PFE’s, level set methods, and the stereo problem. IEEE Trans. on Image Processing, 7(3):336–344, 1998. [44] S. Fejes and L. S. Davis. Detection of independent motion using directional motion estimation. Computer Vision and Image Understanding, 74(2):101–120, May 1999. [45] U. Franke, D. Gavrila, S. Gorzig, F. Lindner, F. Paetzold, and C. Wohler. Autonomous driving goes downtown. IEEE Intelligent System, 13(6):40–48, 1998. [46] P. Fua and Y. G. Leclerc. Model driven edge detection. Machine Vision and Applications, 3:45–56, 1990. [47] H. Fujiyoshi and A. Lipton. Real-time human motion analysis by image skeletonization. In Proc. of the Workshop on Application of Computer Vision, October 1998. [48] M. Gage and R. S. Hamilton. The heat equation shrinking convex plane curves. J. Diﬀ. Geom., 23, 1986. [49] D. M. Gavrila. The visual analysis of human movement: A survey. CVIU, 73(1):82–98, January 1999. 97 [50] R. Gerber and H.-H. Nagel. Knowledge representation for the generation of quantiﬁed natural language description of vehicle traﬃc in image sequence. In Proc. IEEE International Conference on Image Processing (ICIP’96), volume 2, pages 805–808, Lausanne/CH, September 1996. [51] S. Gil, R. Milanese, and T. Pun. Combining multiple motion estimates for vehicle tracking. In Proc. European Conference on Computer Vision, pages II:307–320, 1996. [52] R. Goldenberg, R. Kimmel, E. Rivlin, and M. Rudzsky. Cortex segmentation - a fast variational geometric approach. In Proc. of IEEE workshop on Variational and Level Set Methods in Computer Vision., pages 127–133, Vancouver, Canada, July 2001. [53] R. Goldenberg, R. Kimmel, E. Rivlin, and M. Rudzsky. Fast geodesic active contours. IEEE Trans. on Image Processing, 10(10):1467–75, October 2001. [54] R. Goldenberg, R. Kimmel, E. Rivlin, and M. Rudzsky. ‘Dynamism of a dog on a leash’ or behavior classiﬁcation by eigen-decomposition of periodic motions. In Accepted to ECCV’02, Copenhagen, May 2002. [55] K. Gould, K. Rangarajan, and M. Shah. Detection and representation of events in motion trajectories. In Advances in Image Processing and Analysis, chapter 14. SPIE Optical Engineering Press, June 1992. Gonzalez and Mahdavieh (Eds.). [56] K. Gould and M. Shah. The trajectory primal sketch: a multi-scale scheme for representing motion characteristics. In Proc. Conf. on Computer Vision and Pattern Recognition, San Diego, CA, pages 79–85, 1989. [57] M. A. Grayson. The heat equation shrinks embedded plane curves to round points. J. Diﬀ. Geom., 26, 1987. [58] M. Haag, T. Frank, H. Kollnig, and H.-H. Nagel. Inﬂuence of an explicitly modeled 3d scene on the tracking of partially occluded vehicles. Computer Vision and Image Understanding, 65(2):206–225, 1997. [59] M. Haag and H.-H. Nagel. Tracking of complex driving manoeuvers in traﬃc image sequences. Image and Vision Computing, 16(8):517–527, 1998. [60] F. Heimes and H.-H. Nagel. Real-time tracking of intersections in image sequences of a moving camera. Engineering Applications of Artiﬁcial Intelligence, 11:215–227, 1998. [61] F. Heimes, H.-H. Nagel, and T. Frank. Model-based tracking of complex innercity road intersections. Mathematical and Computer Modelling, 27(9-11):189–203, 1998. [62] M. Irani, B. Rousso, and S. Peleg. Detecting and tracking multiple moving objects using temporal integration. In Proc 2nd European Conf on Computer Vision, pages 282–287, Santa Margharita, Ligure, Italy, 1992. [63] Jr J. H. Williams. Fundamentals of Applied Dynamics. Wiley, New York, 1996. [64] G. Johansson. Visual perception of biological motion and a model for its analysis. Perception and Psychophysics, 14(2):201–21, 1973. [65] G. Johansson. Visual motion perception. Scientiﬁc American, 232(6):75–80, 85–88, 1975. 98 [66] S. X. Ju, M. Black, and Y. Yacoob. Cardboard people: A parameterized model of articulated image motion. In Proc. Int. Conf. on Face and Gestures, pages 561–567, Vermont, 1996. [67] M. Kass, A. Witkin, and D. Terzopoulos. Snakes: Active contour models. International Journal of Computer Vision, 1:321–331, 1988. [68] S. Kichenassamy, A. Kumar, P. Olver, A. Tannenbaum, and A. Yezzi. Gradient ﬂows and geometric active contour models. In Proceedings ICCV’95, Boston, Massachusetts, June 1995. [69] D. Koller, K. Daniilidis, and H.-H. Nagel. Model-based object tracking in monocular image sequences of road traﬃc scenes. International Journal of Computer Vision, 10:257–281, 1993. [70] D. Koller, H. Heinze, and H.-H. Nagel. Algorithmic characterization of vehicle trajectories from image sequences by motion verbs. In CVPR91, pages 90–95, Lahaina, Maui, Hawaii, June 1991. [71] H. Kollnig and H.-H. Nagel. Ermittlung von begriﬄichen beschreibungen von geschehen in straenverkehrsszenen mit hilfe unscharfer mengen. Informatik Forschung und Entwicklung, 8:186–196, 1993. [72] E. Kreyszing. Diﬀerential Geometry. Dover Publications, Inc., New York, 1991. [73] Bertrand Leroy, Isabelle L. Herlin, and Laurent Cohen. Multi-resolution algorithms for active contour models. In Proc. 12th Int. Conf. on Analysis and Optimization of Systems (ICAOS’96), Paris, France, June 1996. [74] A. Lipton, H. Fujiyoshi, and R. Patil. Moving target classiﬁcation and tracking from real-time video. In In Proc. IEEE Image Understanding Workshop, pages 129–136, 1998. [75] F. Liu and R. W. Picard. Finding periodicity in space and time. In Proc. of the 6th Int. Conf. on Computer Vision, pages 376–383, Bombay, India, 1998. [76] R. Malladi, R. Kimmel, D. Adalsteinsson, V. Caselles, G. Sapiro, and J. A. Sethian. A geometric approach to segmentation and analysis of 3d medical images. In Proceedings of IEEE/SIAM workshop on Biomedical Image Analysis, San-Francisco, California, June 1996. [77] R. Malladi and J. A. Sethian. An O(N log N) algorithm for shape modeling. Proceedings of National Academy of Sciences, USA, 93:9389–9392, 1996. [78] R. Malladi, J. A. Sethian, and B. C. Vemuri. Shape modeling with front propagation: A level set approach. IEEE Trans. on PAMI, 17:158–175, 1995. [79] S. J. McKenna, S. Jabri, Z. Duric, A. Rosenfeld, and H. Wechsler. Tracking groups of people. CVIU, 80(1):42–56, October 2000. [80] A. E. Michotte. La Perception de la Causalite. Studia Psychologica, Publications Universitaires de Louvain, 1954. [81] D. M. Moeslund and E. Granum. A survey of computer vision-based human motion capture. CVIU, 81(3):231–268, March 2001. [82] R. C. Nelson. Qualitative detection of motion by a moving observer. International Journal of Computer Vision, 7(1):33–46, November 1991. 99 [83] American Association of State Highway Oﬃcials. A Policy on Geometric Design of Rural Highways (9th edition). AASHO, 1977. [84] R. Oldenburger. Mathematical Engineering Analysis. Macmillan, New York, 1950. [85] S. J. Osher and J. A. Sethian. Fronts propagating with curvature dependent speed: Algorithms based on Hamilton-Jacobi formulations. J. of Comp. Phys., 79:12–49, 1988. [86] N. Paragios and R. Deriche. A PDE-based level set approach for detection and tracking of moving objects. In Proc. of the 6th ICCV, Bombay, India, 1998. [87] N. Paragios and R. Deriche. Geodesic active regions for motion estimation and tracking. In Proc. of the 7th Int. Conf. on Computer Vision, pages 688–694, Kerkyra, Greece, 1999. [88] P. Perona and J. Malik. Scale-space and edge detection using anisotropic diﬀusion. IEEEPAMI, 12:629–639, 1990. [89] R. Polana and R. C. Nelson. Temporal texture recognition. In Proc. of CVPR, pages 129–134, 1992. [90] R. Polana and R. C. Nelson. Detecting activities. Journal of Visual Communication and Image Representation, 5:172–180, 1994. [91] R. Polana and R. C. Nelson. Low level recognition of human motion. In Proc. of IEEE Workshop on Motion of Non-Rigid and Artiulated Objects, pages 77–82, Austin, TX, November 1994. [92] R. Polana and R. C. Nelson. Recognizing activities. In ICPR94, pages A:815–818, 1994. [93] R. Polana and R. C. Nelson. Detection and recognition of periodic, nonrigid motion. IJCV, 23(3):261–282, June 1997. [94] S. Richter and D.. Wetzel. A robust and stable road model. In Proc. International Conference on Pattern Recognition, pages A:777–780, 1994. [95] E. Rivlin, M. Rudzsky, R. Goldenberg, U. Bogomolov, and S. Lapchev. A real-time system for classiﬁcation of moving objects. In Accepted to ICPR’02, Québec City, Canada, August 2002. [96] R. Rosales and S. Sclaroﬀ. Improved tracking of multiple humans with trajectory prediction and occlusion modelling. In Proc. of IEEE Conf. on CVPR, Workshop on the Interpretation of Visual Motion, Santa Barbara, CA, 1998. [97] A. Ruhe and A. Wedin. Algorithms for separable nonlinear least squares problems. SIAM Review, 22:318–337, 1980. [98] G. Sapiro and D. L. Ringach. Anisotropic diﬀusion of multivalued images with applications to color ﬁltering. IEEE Trans. Image Proc., 5:1582–1586, 1996. [99] R. C. Schank. Conceptul dependency: A theory of natural language understanding. Cognitive Psychology, 3(4):552–631, 1972. [100] H. Schneiderman and M. Nashman. A discriminating feature tracker for vision-based autonomous driving. IEEE Transactions on Robotics and Automation, 10:769–775, 1994. 100 [101] S. M. Seitz and C. R. Dyer. View invariant analysis of cyclic motion. Int. Journal of Computer Vision, 25(3):231–251, December 1997. [102] J. A. Sethian. Level Set Methods: Evolving Interfaces in Geometry, Fluid Mechanics, Computer Vision and Materials Sciences. Cambridge Univ. Press, 1996. [103] J. A. Sethian. A marching level set method for monotonically advancing fronts. Proc. Nat. Acad. Sci., 93(4), 1996. [104] J. Shah. A common framework for curve evolution, segmentation and anisotropic diﬀusion. In Proceedings IEEE CVPR’96, pages 136–142, 1996. [105] M. Shah, K. Rangarajan, and P.-S. Tsai. Motion trajectories. IEEE Transactions on System, Man and Cybernatics, 23(4):1138–1150, July/August 1993. [106] R. Sharma and Y. Aloimonos. Early detection of independent motion from active control of normal image ﬂow patterns. IEEE Transactions on Systems, Man, and Cybernetics B, 26(1):42–52, February 1996. [107] E. Shavit and A. Jepson. Motion understanding from qualitative visual dynamics. In IEEE workshop on Qualitative Vision, New-York, June 1993. [108] E. Shavit and A. Jepson. The pose function: An itermediate level representation for motion understanding. In Proc. of the 10th Israeli Symposium on AI and Vision, Tel-Aviv, December 1993. [109] J. M. Siskind. Naive Physics, Event Perception, Lexical Semantics and Language Aquisition. Ph.D. thesis, Elec. Eng’g. and Comp. Sci., MIT, January 1992. [110] J. M. Siskind and Q. Morris. A maximum-likelihood approach to visual event classiﬁcation. In Proceedings of the Fourth European Conference on Computer Vision, pages 347–360, Cambridge, UK, April 1996. [111] J.M. Siskind. Grounding language in perception. AI Review, 8(5-6):371–391, 1995. [112] R. H. Smythe. Vision in Animal World. St. Martin’s Press, NY, 1975. [113] N. Sochen, R. Kimmel, and R. Malladi. A general framework for low level vision. IEEE Trans. on Image Processing, 7(3):310–318, 1998. [114] G. W. Stewart. Introduction to Matrix Computations. Academic Press, New York, 1973. [115] D. Terzopoulos, A. Witkin, and M. Kass. Constraints on deformable models: Recovering 3d shape and nonrigid motions. Artiﬁcial Intelligence, 36:91–123, 1988. [116] P. H. S. Torr and D. W. Murray. Statistical detection of independent movement from a moving camera. Image and Vision Computing, 11:180–187, 1993. [117] P. S. Tsai, M. Shah, K. Keiter, and T. Kasparis. Cyclic motion detection for motion based recognition. Pattern Recognition, 27(12):1591–1603, December 1994. [118] J. N. Tsitsiklis. Eﬃcient algorithms for globally optimal trajectories. IEEE Trans. on Automatic Control, 40(9):1528–1538, 1995. 101 [119] M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuro Science, 3(1):71–86, 1991. [120] S. Ullman. High-level Vision. Object Recognition and Visual Cognition. MIT Press, Cambridge, Massachusets, 1996. [121] A. Verri and T. Poggio. Against quantitative optical ﬂow. In Proc. International Conference on Computer Vision, pages 171–180, 1987. [122] K. von Frisch. Bees: Their Vision, Chemical Senses, and Language. Cornell University Press, Ithaca, New York, 1971. [123] J. Weickert. Fast segmentation methods based on partial diﬀerential equations and watershed transformation. In Mustererkennung, pages 93–100. Springer, Berlin, 1998. [124] J. Weickert, B. M. ter Haar Romeny, and M. A. Viergever. Eﬃcient and reliable scheme for nonlinear diﬀusion ﬁltering. IEEE Trans. on Image Processing, 7(3):398–410, 1998. [125] J. Weng, T. Huang, and N. Ahuja. 3-d motion estimation, understanding, and prediction from noisy image sequences. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9(3):370–389, May 1987. [126] J. Y. Wong. Theory of Ground Vehicles (2nd edition). Wiley, New York, 1993. [127] Y. Yacoob and M. J. Black. Parameterized modeling and recognition of activities. In Proc. Int. Conf. on Computer Vision, Bombay, India, 1998. [128] Y. Yacoob and M. J. Black. Parameterized modeling and recognition of activities. CVIU, 73(2):232–247, February 1999. [129] S. C. Zhu and A. Yuille. Region competition: Unifying snakes, region growing, and bayes/MDL for multiband image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18:884–900, 1996. [130] S. C. Zhu, A. Yuille, and T. S. Lee. Region competition: Unifying snakes, region growing, and bayes/mdl for multi-band image segmentation. In ICCV95, pages 416–423, 1995. 102 View publication stats

Log In

Motion Based Recognition

Sign up for access to the world's latest research.

Related papers

Related papers