Motion Based Recognition
Research Thesis
Submitted in Partial Fulfillment of the Requirements
for the Degree of Doctor of Philosophy
Roman Goldenberg
Submitted to the Senate of the Technion – Israel Institute of Technology
Nisan 5762
Haifa
April 2002
This research was carried out in the Department of Computer Science under
the supervision of Prof. Ehud Rivlin and Prof. Ron Kimmel.
While realizing that I’m stepping on a shaky ground when trying to write down a list of
people I would like to thank, I’ll do it anyway. Even when I know that the list will never
be a full one, that most of the people mentioned here will never read this, and that it is not
the best way to express my gratitude, I’ll do it primarily for myself. Maybe some day it will
help me to refresh my memories of the good old days.
I would like to thank my advisors Prof. Ehud Rivlin and Prof. Ron Kimmel for their
devoted guidance through all these years.
I would like to thank my co-authors Dr. Michael Rudzsky from the Technion, Dr. Zoran
Duric from the George Mason University and Prof. Azriel Rosenfeld from the Center for
Automation Research. The work presented here is a result of our common effort.
I wish to thank Prof. Alfred Bruckshtein and Prof. Irad Yavneh from the Technion for
intriguing conversations and bright ideas they shared with us.
I am very grateful to Alexander Berengolts, Ilya Blayvas, Constantin Brif, Alexander
Brook, Lev Finkelshtein, Kirill Gokhberg, Eyal and Noam Gordon, Lev Gorelik, Alexander
Gutkin, Yana Katz for their help, advice and simply for being there when I needed it.
The generous financial help of the Technion, VATAT, and the Applied Materials company is gratefully acknowledged.
Contents
Abstract
1
List of Symbols and Abbreviations
3
1 Introduction
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.1 Psychological aspects . . . . . . . . . . . . . . . . . . . . . .
1.1.2 Motion related problems . . . . . . . . . . . . . . . . . . . .
1.1.3 Motion modeling . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Main Results and Contributions . . . . . . . . . . . . . . . . . . . .
1.3 Background and Related work . . . . . . . . . . . . . . . . . . . . .
1.3.1 Motion Segmentation and Tracking . . . . . . . . . . . . . .
1.3.2 Classification and Recognition . . . . . . . . . . . . . . . . .
1.3.2.1 Detecting and Recognizing Motions . . . . . . . . .
Recognizing Activities . . . . . . . . . . . . . . . . .
Detecting Periodic Activities . . . . . . . . . . .
Recognizing Periodic Activities . . . . . . . . . .
Parametrized Modeling for Activity Recognition
Motion Events . . . . . . . . . . . . . . . . . . . . . .
1.3.2.2 Motion Based Object Classification . . . . . . . . .
Temporal Texture Analysis . . . . . . . . . . . . . . .
Dynamics Based Object Classification . . . . . . . . .
1.3.3 From Motion towards Natural Language Description . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
5
5
6
6
7
8
8
10
11
11
11
12
13
14
16
16
17
18
2 Rigid Motion
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 Interpreting Relative Vehicle Motion in Traffic Scenes . . . . . .
2.2 Motion Models of Highway Vehicles . . . . . . . . . . . . . . . . . . . .
2.2.1 Real Vehicle Motion . . . . . . . . . . . . . . . . . . . . . . . .
2.2.2 Departures of Vehicle Motion from Smoothness . . . . . . . . .
2.2.3 The Sizes of the Smooth and Non-smooth Velocity Components
2.2.4 Camera Motion . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.5 Independently Moving Vehicles . . . . . . . . . . . . . . . . . .
2.3 Image Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 The Imaging Models . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
21
21
22
24
25
26
28
29
29
29
i
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2.4
2.5
2.6
2.3.2 The Image Motion Field and the Optical Flow Field .
2.3.3 Estimation of Rotation . . . . . . . . . . . . . . . . .
2.3.4 Estimating the FOE . . . . . . . . . . . . . . . . . .
2.3.5 The Image Motion Field of an Observed Vehicle . . .
2.3.6 Estimating Vehicle Motion from Normal Flow . . . .
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Road and Vehicle Detection . . . . . . . . . . . . . .
2.4.2 Stabilization . . . . . . . . . . . . . . . . . . . . . . .
2.4.3 Relative Motions of Vehicles . . . . . . . . . . . . . .
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.1 Understanding Object Motion . . . . . . . . . . . . .
2.5.2 Independent Motion Detection . . . . . . . . . . . . .
2.5.3 Vehicle Detection and Tracking . . . . . . . . . . . .
2.5.4 Road Detection and Shape Analysis . . . . . . . . . .
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 Non-Rigid Motion - Tracking
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . .
3.2 Snakes and Active Contours . . . . . . . . . . . . . .
3.3 From snakes to geodesic active contours . . . . . . . .
3.3.1 Level set method . . . . . . . . . . . . . . . .
3.3.2 The AOS scheme . . . . . . . . . . . . . . . .
3.3.3 Re-initialization by the fast marching method
3.4 Edge indicator functions for color and movies . . . .
3.4.1 Edges in Color . . . . . . . . . . . . . . . . .
3.4.2 Tracking objects in movies . . . . . . . . . . .
3.5 Implementation details . . . . . . . . . . . . . . . . .
3.6 Experimental Results . . . . . . . . . . . . . . . . . .
3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . .
4 Non-Rigid Objects - Motion Based Classification
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . .
4.1.1 General Framework . . . . . . . . . . . . . .
4.2 Futurism and Periodic Motion Analysis . . . . . . .
4.3 Related Work . . . . . . . . . . . . . . . . . . . . .
4.4 Our Approach . . . . . . . . . . . . . . . . . . . . .
4.4.1 Segmentation and Tracking . . . . . . . . .
4.4.2 Periodicity Analysis . . . . . . . . . . . . . .
4.4.3 Frame Sequence Alignment . . . . . . . . .
4.4.3.1 Spatial Alignment: . . . . . . . .
4.4.3.2 Temporal Alignment: . . . . . . . .
4.4.4 Parameterization . . . . . . . . . . . . . . .
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . .
ii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
30
31
33
34
34
34
35
35
35
37
37
38
40
41
42
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
47
47
47
48
50
51
52
53
53
53
54
55
61
.
.
.
.
.
.
.
.
.
.
.
.
63
63
63
65
65
67
68
69
70
70
71
72
77
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 Articulated Motion
5.1 Introduction . . . . . . . . . . . . . . . .
5.2 Background subtraction . . . . . . . . .
5.3 Edge detection . . . . . . . . . . . . . .
5.4 Normal flow . . . . . . . . . . . . . . . .
5.5 Grouping edge pixels into edge segments
5.6 Grouping edge segments . . . . . . . . .
5.7 Multiple frames approach . . . . . . . .
5.8 Conclusions . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
81
81
81
82
82
84
84
88
90
6 Discussion
91
Bibliography
94
iii
List of Figures
1.1
1.2
1.3
1.4
1.5
1.6
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
Extraction of motion information from a sequence of frames (from [20]). . . .
Total motion magnitude feature vector for a sample of walk (top) and a sample of run (bottom). One cycle of activity is divided into six time divisions
shown horizontally. Each frame shows the spatial distribution of motion in
4x4 spatial grid size of each square is proportional to the amount of motion
in the neighborhood (from [92]). . . . . . . . . . . . . . . . . . . . . . . . . .
The parameterized modeling and recognition (from [127]). . . . . . . . . . .
Image sequence of ”walking”, five part tracking including self-occlusion and
three sets of five signals (out of 40) recovered during the activity (torso, thigh,
calf, foot and arm) (from [66]). . . . . . . . . . . . . . . . . . . . . . . . . .
(a) A graphic depiction of the pose function for the hopping of a kangaroo ;
(b) A graphic depiction of the pose function for the hopping of a rabbit (from
[107]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Single frame of a real-word traffic scene recorded at an intersection. Vehicle
trajectories obtained by automatic detection and model-based tracking are
shown together with the vehicle models used (from [70]). . . . . . . . . . . .
The Darboux frame moves along the path Γ which lies on the surface Σ. . .
A possible motion of the base of a vehicle as the vehicle hits a bump. . . . .
The plane perspective projection image of P is F = f (X/Z, Y /Z, 1); the weak
perspective projection image of P is obtained through the plane perspective
projection of the intermediate point P1 = (X, Y, Zc ) and is given by G =
f (X/Zc , Y /Zc , 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
(a-d) A selected frame from each of four sequences. Top: the input images.
Middle: results of road detection. Bottom: results of vehicle detection. . . .
(a-b) Two images taken 1/15th of a second apart; (c-d) their normal flows.
One can see the effects of bumps. In the first frame, the flow vectors point
downward; in the second, they point upward. . . . . . . . . . . . . . . . . . .
Stabilization results for one frame from each of three sequences: (a) Input
frame. (b) Normal flow. (c) Rotational normal flow. (d) Translational normal
flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Identification of distant image points in frames from two sequences. . . . . .
Frames 0, 30, and 60 of a sequence showing two vehicles accelerating. The
normal flow results are shown below the corresponding image frames. . . . .
iv
7
13
14
15
18
20
23
25
30
36
37
38
39
40
2.9
2.10
2.11
2.12
2.13
2.14
2.15
3.1
3.2
3.3
3.4
3.5
3.6
3.7
4.1
4.2
4.3
4.4
4.5
Motion analysis results for the acceleration sequence. U , V , W are the scaled
(by an unknown distance Zc−1 ) components of the relative translational velocity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Frames 5, 15, 25, and 35 of the van sequence. The normal flow results are
shown below the corresponding image frames. . . . . . . . . . . . . . . . . .
Motion analysis results for the van sequence. U , V , W are the scaled (by an
unknown distance Zc−1 ) components of the relative translational velocity . . .
Frames 1, 14 and 26 of the Italian sequence. The normal flow results are
shown below the corresponding image frames. . . . . . . . . . . . . . . . . .
Motion analysis results for the Italian sequence. U , V , W are the scaled (by
an unknown distance Zc−1 ) components of the relative translational velocity .
Frames 1, 25 and 48 of the Haifa sequence. The normal flow results are shown
below the corresponding image frames. . . . . . . . . . . . . . . . . . . . . .
Motion analysis results for the Haifa sequence. U , V , W are the scaled (by
an unknown distance Zc−1 ) components of the relative translational velocity .
Two arrays of linked lists used for narrow band representation. . . . . . . . .
Curvature flow by the proposed scheme. A non-convex curve vanishes in
finite time at a circular point by Grayson’s Theorem. The curve evolution is
presented for two different time steps. Left: τ = 20; Right: τ = 50. . . . . .
Curvature flow CPU time for the explicit scheme and the implicit AOS scheme.
First, the whole domain is updated, next, the narrow band is used to increase
the efficiency, and finally the AOS speeds the whole process. For the explicit
scheme the maximal time step that still maintains stability is choosen. For
the AOS scheme, CPU times for several time steps are presented. . . . . . .
Multiple objects segmentation in a static color image. . . . . . . . . . . . . .
Gray matter segmentation in a MRI brain image. The white contour converges
to the outer cortical surface and the black contour converges to the inner
cortical surface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Tracking a cat in a color movie by the proposed scheme. Top: Segmentation
of the cat in a single frame. Bottom: Tracking the walking cat in the 50
frames sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Tracking two people in a color movie. Top: curve evolution in a single frame.
Bottom: tracking two walking people in a 60 frame movie. . . . . . . . . . .
‘Dynamism of a Dog on a Leash’ 1912, by Giacomo Balla. Albright-Knox Art
Gallery, Buffalo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
(a) running horse video sequence, (b) first 10 eigen-shapes, (c,d) first and
second eigen-shapes enlarged, (e) ‘The Red Horseman’, 1914, by Carlo Carra,
Civico Museo d’Arte Contemporanea, Milan. . . . . . . . . . . . . . . . . . .
Non-rigid moving object segmentation and tracking. . . . . . . . . . . . . . .
Global motion characteristics measured for walking man sequence. . . . . . .
Global motion characteristics measured for walking cat sequence. . . . . . .
v
41
42
43
44
44
45
45
55
56
57
58
59
60
60
66
67
69
70
71
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13
4.14
4.15
4.16
5.1
5.2
5.3
5.4
5.5
5.6
Deformations of a walking man contour during one motion period (step). Two
steps synchronized in phase are shown. One can see the similarity between
contours in corresponding phases. . . . . . . . . . . . . . . . . . . . . . . .
Inter-frame correlation between object silhouettes. Four graphs show the correlation measured for four initial frames. . . . . . . . . . . . . . . . . . . . .
Scale alignment. A minimal square bounding box around the center of the
segmented object silhouette (a) is cropped and re-scaled to form a 50 × 50
binary image (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Temporal alignment. Top: original 11 frames of one period subsequence.
Bottom: re-sampled 10 frames sequence. . . . . . . . . . . . . . . . . . . . .
Temporal shift alignment: (a) - average starting frame of all the training
set sequences, (b) - temporally shifted single-cycle test sequence, (c) - the
correlation between the reference starting frame and the test sequence frames
The principal basis for the ‘dogs and cats’ training set formed of 20 first
principal component vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Distances to the ‘people’ and ’dogs and cats’ feature spaces from more than
1000 various images of people, dogs and cats. . . . . . . . . . . . . . . . . . .
Image sequence parameterization. Top: 11 normalized target images of the
original sequence. Bottom: the same images after the parameterization using
the principal basis and back-projecting to the image basis. The numbers are
the norms of the differences between the original and the back-projected images.
Feature points extracted from the sequences with walking, running and galloping dogs and walking cats and projected to the 3-D space for visualization
Feature points extracted from the sequences showing people walking and running parallel to the image plane and at 45 degrees angle to the camera. Feature
points are projected to the 3-D space for visualization . . . . . . . . . . . . .
Learning curves for (a) dogs and cats data set and (b) people data set. Onenearest-neighbor classification error rate as a function of the number of eigen
vectors in the projection basis. . . . . . . . . . . . . . . . . . . . . . . . . . .
Example of background subtraction (a) Original color image from sequence,
(b) Foreground confidence values, (c) Thresholded foreground image. . . . .
Example of edge detection for color images (a) Original color image from
sequence, (b) Edge indicator function values, (c) Edge indicator function after
applying a threshold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Two consecutive frames from an image sequence (a, b) and the computed
normal flow (c). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example of grouping edge pixels into edge segments. Different edge segments
are in different colors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Edge grouping results for six frames from an image sequence. Every group of
edge segments is enclosed by a best fitting rectangular bounding box . . . .
Multiple frames based segmentation results for six frames from an image sequence. Every group of edge segments is enclosed by a best fitting rectangular
bounding box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vi
72
73
74
74
75
76
76
77
78
78
79
82
83
83
85
88
89
Abstract
In this work we explore how the motion based visual information can be used to solve a
number of well known computer vision problems such as segmentation, tracking, object
recognition and classification and event detection.
We consider three special cases for which the methods used are quite different: the rigid,
non-rigid and articulated objects.
For rigid objects we address the problem taken from the traffic domain and show how
the relative velocity of nearby vehicles can be estimated from a video sequence taken by a
camera installed on a moving car.
For non-rigid objects we present a novel geometric variational framework for image segmentation using the active contour approach. The method is successfully used for moving
non-rigid targets segmentation and tracking in color movies, as well as for a number of other
applications such as cortical layer segmentation in 3-D MRI brain images, segmentation of
defects in VLSI circuits on electronic microscope images, analysis of bullet traces in metals,
and others.
Relying on the high accuracy results of segmentation and tracking obtained by the fast
geodesic contour approach, we present a framework for moving object classification based on
the eigen-decomposition of the normalized binary silhouette sequence. We demonstrate the
ability of the system to distinguish between various object classes by their static appearance
and dynamic behavior.
Finally we show how the observed articulated object motion can be used as a cue for
the segmentation and detection of independently moving parts. The method is based on
the analysis of normal flow vectors computed in color space and also relies on a number of
geometric heuristics for edge segments grouping.
1
2
List of Symbols and Abbreviations
Γ
Σ
⃗t
⃗v
⃗s
κ
τ
κg
κn
τg
ω
⃗
T⃗ = (U, V, W )
R
(X, Y, Z)
(Xc , Yc , Zc )
f
C(p) = {x(p), y(p)}
g(C)
⃗
N
⃗
κN
ds = |Cp |dp
ϕ(x, y) : IR2 → IR
MLD
SFM
AOS
PCA
EKF
TPS
FOE
CSF
MRI
CT
MHI
SVD
DTFS
A smooth trajectory space curve.
A smooth surface the Γ is lying on.
The tangent vector to the curve Γ.
The second tangent vector to the surface Σ (orthogonal to ⃗t).
The normal vector to the surface Σ.
The curvature of the curve Γ.
The torsion of the curve Γ.
The geodesic curvature.
The normal curvature.
The geodesic twist.
The rotational velocity vector.
The translational velocity vector.
The rotation matrix.
A scene point.
A reference point for weak perspective projection.
A focal length.
A parameterized 2-D curve.
An edge indicator function.
The normal to the curve.
the curvature vector.
The Euclidean arclength.
An implicit representation of C (the level set function).
Moving Light Display.
Structure from Motion.
Additive Operator Splitting.
Principal Component Analysis.
Extended Kalman Fgilter.
Trajectory Primal Sketch.
Focus of Expansion.
Cerebral Spinal Fluid.
Magnetic Resonance Imaging.
Computer Tomography.
Motion History Image.
Singular Value Decomposition.
Distance to Feature Space.
3
4
Chapter 1
Introduction
1.1
Motivation
Motion based recognition deals with the recognition of an object and its behavior based
on motion in a sequence of images. Generally, it is an approach that favors the direct use
of temporal data to solve a number of well known computer vision problems like image
segmentation, tracking, object classification and recognition and so on.
While static scene analysis received more attention, partially historically and partially
due to technical problems arising from the processing of large amounts of data, it turns out
that some computer vision problems may be solved more easily by analyzing not a single
image, but a sequence of images.
1.1.1
Psychological aspects
The importance of motion information can be proved by looking at biological systems. Motion perception in animals and humans has been studied in psychology for a long time. The
experiments with Johansson’s moving light displays (MLD) [64], [65] show that the movement of several light spots attached to a moving human body without any additional visual
hints like contours, color, texture or background is sufficient for recognizing different types of
activities (walking, running, dancing) as well as the gender of the subject. Moreover, using
the MLDs one can recognize a person by his/her gait.
Michotte [80] showed that humans are able to describe an underlying physical models
of objects interaction (pushing, launching, triggering, entraining, etc.) even if the observed
scene consists only of simple 2-D geometrical objects on a white background. Since the
nature of the objects is unknown, the observer can not use his experience and must base
his judgments mainly on the temporal and spatial parameters of the scene. This can be
regarded as an evidence that causality can be percepted “directly”.
Heider and Simmel in 1944 showed that, in addition to causality, people may also perceive
intentionality in the action of inanimate objects. Moreover, in addition to instantaneous
intentions, people may even attribute traits to the observed actors. People can compose a
whole “story” based on a film whose only “stars” are two triangles, a circle and a square.
5
1.1.2
Motion related problems
Motion Based Recognition has been a computer vision research topic for more than 30 years.
It would be very difficult to describe here the whole spector of motion related problems the
computer vision community dealt with. Even if we put aside the Structure from Motion
(SFM) class of problems, which is a reconstruction of a 3D object structure based on multiple
views of the moving object, we are still left with a great variety of problems and approaches
that use motion information this way or another. From those one can extract two major
groups.
1. The primary goal of the first one is the motion itself, i.e. detection and recognition
of different types of motion (like activities, gestures, maneuvers, behavioral patterns,
etc.). At a higher level this may include parsing a stream of primitive dynamic events
into more complex concepts like scenarios, revealing a physical explanation of the
observed motion, building a natural language description or even reconstructing a set
of rules governing the behavior of the agents in a scene (an underlying algorithm).
2. Another group of problems are conventional object recognition and classification based
primarily on motion analysis. These may vary from person recognition based on its
gait to aircraft type classification by its dynamic characteristics.
Yet a different group of problems that usually is to be solved prior to the recognition
and classification stage are segmentation and tracking. A lack of a good, generic and reliable solution for this separate class of lower level problems is an obstacle for the groups of
researchers working in the higher level stage. This can probably explain the small number
of works done in high level analysis in a complex, real world, outdoor environment where
the segmentation and tracking are even more problematic. Since we are analyzing moving
objects, the segmentation in our context means detection of independently moving bodies,
while tracking means pursuing of an object image on the basis of its current dynamical
behavior.
It should be also mentioned that the nature of the problems above and, thus, the methods
applied can be significantly influenced by choosing different scene types (e.g. indoor/outdoor,
rigid/non-rigid objects), imaging schemes (e.g. static/moving camera) and context (e.g. a
football game with a known background, number of object, rules, etc. - the so called closed
world assumption).
1.1.3
Motion modeling
Usually for the purpose of classification and recognition some kind of motion model is defined.
Certain assumptions are made about the nature of objects motion in order to build a model.
In some cases those assumptions reflect the experiment conditions, whereas in others they are
simplifications of a more complex observed motion. Among the motion model assumptions
used by researchers one can find the planar (2D) motion assumption, the constant velocity
assumption, the periodical motion assumption, the smooth trajectories assumption, etc. In
some cases known physical(dynamical) models of object motion are used.
6
Figure 1.1: Extraction of motion information from a sequence of frames (from [20]).
Sometimes not only the motion, but the moving objects themselves are modeled. For a
non-rigid bodies (e.g. humans) a model of internal object structure (both geometrical and
mechanical) can be extensively used in motion analysis. A lot of geometrical models were
used to approximate a moving object: polygons, blobs, centroids, stick figures, wire frames,
etc.
To fit a predefined motion model to an image sequence one need to extract some motion
based features. Motion information can be obtained by various means. In Figure 1.1 one
can see the classification of methods for motion data extraction as suggested in [20].
1.2
Main Results and Contributions
The main object of this research is to find efficient methods for extracting and using motion
information. We concentrated on three major problems.
• In the first one we deal with the analysis of the rigid bodies motion in a problem taken
form a traffic domain.
– We show how a moving vehicle which is carrying a camera can estimate the relative
motions of nearby vehicles.
– A model for the motion of the observing vehicle is presented.
– A novel algorithm for video stabilization is presented that allows correcting the
image sequence so that transient motions resulting from bumps, etc. are removed.
– A model for the motions of nearby vehicles is presented and it is shown how to
detect their motions relative to the observing vehicle.
7
• The second part of this thesis deals with a general non-rigid motion. Here we approached both the low-level problem of non-rigid target segmentation and tracking
and the high-level problem of motion based non-rigid objects classification.
– An active contour based algorithm for object segmentation in images was developed.
– The fast version of the geodesic active contour model was implemented using a
novel numerical scheme based on the Weickert-Romeney-Viergever AOS scheme,
Adalsteinsson-Sethian level set narrow band and Sethian’s fast marching method.
– The algorithm was adopted for segmentation and tracking of non-rigid objects in
color movies.
– An algorithm for periodicity analysis of the extracted deformable contours of the
moving objects is developed.
– A technique for image sequence normalization using the PCA approach is proposed.
– A framework for moving object classification based on the eigen-decomposition
of the normalized binary silhouette sequence is presented, capable to distinguish
between various object classes by their static appearance and dynamic behavior.
• And, finally, we approached the problem of articulated body segmentation based on
its motion.
– The method is based on the analysis of the normal flow vectors computed on color
image sequence.
– We suggest a simple model describing articulated parts motion and develop an
algorithm for edge segments grouping based on their apparent motion.
– The technique is further developed to support more robust multiple frame analysis.
1.3
Background and Related work
In this section we give a brief survey of the background literature and related works relevant to the topics developed in this thesis. We discuss the following topics: motion based
segmentation and tracking, recognition of motion activities and events, motion based object
classification, and generation of natural language description of a dynamic scene.
1.3.1
Motion Segmentation and Tracking
Image motion is a very strong cue to image segmentation and it can be used to identify
image areas corresponding to independently moving objects and to distinguish these from
the background. Once a moving object is detected the motion information can be used to
keep the track of its image. The two problems are strongly coupled. The features found
8
for segmentation may be used in tracking and the knowledge of object position and velocity
would certainly help in segmentation.
The problem may be further complicated by introducing non-rigid objects, egomotion,
non-smooth and non-translational motion, occlusion situations, etc.
Irani et al. [62] detect multiple moving objects in an image pair one by one. First, a
dominant motion is computed assuming that the motion of the objects can be approximated
by 2D parametric transformations in the image plane. Then the two images in the pair are
registered using the detected motion and the region corresponding to the motion is detected
by looking for the stationary region. The region is then excluded and the process is repeated
to find other objects and their motions.
It has been suggested to use the benefits of tracking moving objects to improve the segmentation. This is done by using temporal integration of images registered with respect
to the tracked motion, without assuming temporal motion constancy. The temporally integrated image Av(t) serves as a dynamic internal representation image of the tracked object
and is constructed as follows:
{
Av(0) = I(0)
Av(t + 1) = ωI(t + 1) + (1 − ω)register(Av(t), I(t + 1))
where ω = 0.3 and register(P, Q) denotes the registration of images P and Q by wrapping
P towards Q according to the motion of the tracked object computed between them. The
temporally integrated image is then used for motion analysis and segmentation instead of
the first image in each pair. The tracked objects remains sharp while other objects blur out,
which improves the accuracy if the segmentation and the motion computation.
Clarke and Zisserman [25] use epipolar geometry for detection and tracking of independently moving rigid objects by moving (translating) observer. The basic idea is that given a
correspondence between relatively large number of feature points in two frames the epipole
and the translation vector of the camera motion relative to the background can be recovered. Of course, this is true only if most of the features belong to the background and not
to the moving objects. Once the camera motion is known, the moving objects can be found
by attempting to fit an epipole to the feature matches not consistent with the background
epipole. The process may be repeated until too few point matches remain, or no object
is found. As for the camera motion, the independent translation is recovered and used for
moving object tracking. Corners are used as the feature points and the image plane extent
of the moving object is defined by the smallest rectangle enclosing all the features that have
been updated more than a fixed number of times.
Rosales and Sclaroff [96] address a more specific problem of tracking humans. Since a
static camera is used, it is possible to acquire statistical measurements of the empty scene
over a number of video frames. Color covariance estimates are obtained for each image pixel.
Independently moving objects are then detected by thresholding a distance measure between
the background model and the incoming image. This gives a binary map of interesting regions
and the connected component analysis is used to break it into different moving objects.
A special tracker object is assigned to every objects being tracked. It contains information
about objects location, a binary support map, blob characteristics, bounding box, and other
information that can will be used for tracking using the Extended Kalman Filter (EKF).
9
To reduce the complexity of the tracking problem only two feature points are selected: two
opposite corners in the object bounding box. Assuming that the objects are undergoing a
locally linear translation motion in 3D and utilizing a 3D central projection camera model,
the relative 3D position and velocity of the features are extracted. Parametrized feature
points positions and velocities compose the EKF state vector.
The occlusion detection routine predicts future location of objects, based on current
estimates of 3D velocity and position. If a collision of the objects is detected, then the
occlusion is probable. After the collision is confirmed by both the estimated bounding box
parameters and the merging binary maps of the objects being tracked, the system uses EKF
prediction to update the object position and velocity. The end of occlusion can be found
using the EKF estimates of position and velocity along with the computed area of the blobs.
1.3.2
Classification and Recognition
A process of motion based recognition can be coarsely divided into two steps. The first step
is to find an appropriate representation for the objects or motions to be recognized - a model.
Such a model may be a low level representation, like trajectory of some point on an moving
object as described by Gould and Shah [56] or, on the other hand, can be organized into a
high level representation, like motion verbs in Nagel et al. works [50], [70] or a function of
an object in Rivlin et al. approach [36].
Generally, there are two methods to extract motion information form a sequence of images
in order to organize it into a model.
• motion correspondence, where a set of feature points (tokens) is detected in each image
frame and then, a correspondence between those points among frames is established.
There are two intrinsic problems associated with this method: the lack of a good
universal feature detection operator - either too few or too many tokens are detected and, the combinatorial nature of the correspondence problem - only one trajectory is
to be selected from a large set of possible trajectories.
• optical flow, which is an alternative method for motion analysis that computes the
displacement of each pixel between two consecutive frames. The problem here is that
only the so-called normal flow can be computed due to the aperture problem, which is
also error prone because of boundary over-smoothing, multiple moving objects etc. [2]
Methods combining both approaches were proposed [105], which utilize the local motion
direction found by optical flow to facilitate the search for corresponding feature point in the
consecutive frames.
The second step is to match some unknown input with the model. Usually, the same
method for information extraction as in the previous step is used to obtain the motion
information from the sample. Matching methods, depending on the type of the representation used, vary from standard clustering techniques to neural networks and probabilistic
approaches.
10
1.3.2.1
Detecting and Recognizing Motions
According to [89] different motions can be classified according to the spatial and temporal
uniformity they exhibit. Three different types of motions patterns are defined:
• temporal textures to be the regular motion patterns of indeterminate spatial and temporal extent,
• activities to be motion patterns which are temporally periodic but are limited in spatial
extent,
• motion events to be isolated simple motions that do not exhibit any temporal or spatial
repetition.
Examples of temporal textures include wind blown trees or grass, examples of activities
include walking, running etc., while examples of motion events are isolated instances of
opening a door, starting a car, throwing a ball, a change in direction, a stop, an acceleration
etc.
Temporal textures can be effectively treated with statistical techniques analogous to those
used in gray-level texture discrimination. [92]. On the other hand, activities and motion
events are more discretely structured, and techniques similar to those used in static object
recognition would be useful in their classification.
Since the temporal texture is usually used as a tool for the segmentation and/or recognition of objects possessing a certain motion pattern and not for the recognition of different
motions for a spatially defined object, we shall talk about it in Motion Based Object Classification section 1.3.2.2. In what follows we describe in more details the recognition of activities
and survey the motion events.
Recognizing Activities
Detecting Periodic Activities Periodicity is common in the natural world. It is
also a salient feature in human perception. Information regarding the nature of a periodic
phenomenon, such as its location, strength, and frequency, is important for understanding
of the environment. Techniques for periodicity detection and characterization can assist in
many applications requiring object and activity recognition and representation. Periodicity
often involves both space and time, as in cyclic motion.
Polana and Nelson in [90] are dealing with the detection of repetitive motion. Their
approach suggests a low-level feature based activity recognition mechanism.
An image sequence is considered as a spatio-temporal solid with two spatial dimensions
and one time dimension. Repeated activity will give rise to periodic or semi-periodic gray
level signals along smooth curves in the image solid - reference curves. If these curves could
be identified and samples extracted along them over several cycles , then frequency domain
techniques could be used in order to judge the degree of periodicity and thus detect periodic
activities. In case of stationary activities or locomotory activities phase curves are straight
lines at some angle that depends on the translational velocity.
11
Signals are extracted along these curves and decomposed using the Fourier transform.
The normalized difference between the energy in the highest amplitude frequencies and
its multipliers and the energy in the remaining frequencies (halfway between) gives the
periodicity measure. That is,
∑
pf = (
i
Fiw −
∑
∑
Fiw+w/2 )/(
i
i
Fiw +
∑
Fiw+w )
(1.1)
i
where F is the energy spectrum of the signal f and w is the frequency corresponding to the
highest amplitude in the energy spectrum. The measure is normalized with respect to the
total energy at the frequencies of interest so that it is one for a completely periodic signal
and zero for a flat spectrum. In general, if signal consists of frequencies other than one
single fundamental and its multiples, its periodicity measure will be low. Non-maximum
suppression technique is used then to vote for the frequencies on which most of the analyzed
signals (collected from different reference curves) agree.
The maximal value of this combined signal is taken as the fundamental frequency and
the periodicity measure for an entire sequence is defined as a sum of periodicity measures
obtained from all the contributing signals.
Experiments conducted by the authors [90] show that such a periodic activities as walking, exercising or swinging can be distinguished from a nonperiodic activities by simple
thresholding of the periodicty measure.
In [91], optical flow based algorithms are used to transform an image sequence so that the
object in consideration is stabilized at the center of the image frame. Then flow magnitudes in
tessellated frame areas of periodic motion are used as feature vectors for motion classification.
The problem is that flow based methods are very sensitive to noise. Thus, Liu and Picard [75]
propose a robust method based on a new criterion for spatio-temporal periodicity which is a
ratio between the harmonic energy and the total energy of a 1-D signal along the temporal
dimension. In their method multiple fundamentals can be extracted along a temporal line;
the values of fundamental frequencies are used in processing to facilitate in distinguishing
periodicity of various activities; regions of periodicity are actually segmented and frame
alignment is used for tracking after the moving object.
Recognizing Periodic Activities In [92] the idea above is further developed to include the recognition of different types of periodic activities.
One cycle (which is defined by the fundamental frequency) of the spatio-temporal solid
representing the activity is divided into XxY xT cells by partitioning the two spatial dimensions into X, Y and the temporal dimension into T divisions respectively.
A local motion statistics is then selected and computed in each cell of the spatio-temporal
grid. The feature vector in this case is composed of XY T elements each of which is the value
of the statistic in the particular cell. (see Figure 1.2).
The normalized spatio-temporal solid, while corrected for temporal scale (frequency) is
not corrected for temporal translation (phase). So the pattern is tried at each possible phase
and the best match is then selected.
Several statistics were tried: the positive and negative curl and divergence estimates,
the non-uniformity of flow direction,the directional difference statistics in four directions
12
Figure 1.2: Total motion magnitude feature vector for a sample of walk (top) and a sample
of run (bottom). One cycle of activity is divided into six time divisions shown horizontally.
Each frame shows the spatial distribution of motion in 4x4 spatial grid size of each square
is proportional to the amount of motion in the neighborhood (from [92]).
etc. The one that gave the best results is a sum of flow magnitude divided by its standard
deviation. The feature values are arranged into a vector and a nearest centroid classifier is
used for classification.
Three sequences were used to compute the cluster centers, i.e. to train the classifier,
while using a fourth one as the unknown. None of the single features is sufficient, alone, for
correct classification. However, using curl and divergence estimates along with the directional
difference statistics in four directions provide enough information to correctly classify all
motions used in the experiments. A principle component analysis of the features was also
performed, which also confirmed the relative importance of those two features.
Parametrized Modeling for Activity Recognition In [127] a framework for modeling and recognition of temporal activities is proposed. A set of exemplar activities represented by vector of measurements taken over some period of time is modeled by parameterization in a form of principal components (PCA). An observed activity, that possibly
undergo some transformation due to temporal shift, speed variations or different distances
between the camera and the object, is then projected on the principal activity basis giving
some coefficient vector. A robust regression is used to recover the coefficients along with
the parameters of the transformation. As the coefficient vector is obtained, the normalized
distance to exemplar activities coefficients is used to recognize the observed activity (see
Figure 1.3).
The paper describes three experiments where the method was successfully applied:
• human walking direction recognition (parallel, diagonal, perpendicular away, perpendicular forward)
• recognition of four types of activities (walking, marching, line-walking and kicking
while walking)
• visual based speech recognition.
13
Figure 1.3: The parameterized modeling and recognition (from [127]).
The first two applications used an approach described in [66] for tracking human motion
using parameterized optical flow, assuming that initial segmentation of the body into parts
is given. Eight motion parameters were tracked for five body parts (arm, torso, thigh, calf
and foot). (see Figure 1.4).
The last application used optical flow vectors directly as the measurements to build
principal basis.
Motion Events Motion events are defined as significant changes or discontinuities in
motion. They are usually detected by the presence of discontinuities, which can be found,
for instance, by computing the scale-space of a speed curve.
Gould, and Shah’s [56] [55] trajectory Primal Sketch (TPS) is a representation for the
significant changes in motion. Changes are identified at various scales by computing the
scale-space of the velocity curves vx and vy extracted from a trajectory. This results in
a set of TPS contours, each contour corresponding to a change in motion. The strength
of the contour , i.e. the number of zero-crossings belonging to that contour, the shape or
straightness of the contour, i.e. the sum of the distance between each successively linked
zero-crossing from the contour divided by strength-1, and the frame number at which the
contour originated, reveals a lot information about the event. This representation has been
shown to distinguish basic motions like translation, rotation, projective and cycloid. The
results show that the first derivative discontinuities in the rotation and cycloid trajectories
have a sine/cosine relationship with respect to each other, so that they can be separate from
the projectile and translation type trajectories.
Engel and Rubin [42] define five types of motion events: smooth starts, smooth stops,
pauses, impulse starts and impulse stops. They are considered as motion events that partition
14
Figure 1.4: Image sequence of ”walking”, five part tracking including self-occlusion and three
sets of five signals (out of 40) recovered during the activity (torso, thigh, calf, foot and arm)
(from [66]).
a global motion into parts that have some underlying psychological support. The method
of detecting those boundaries use polar velocity representation (s, ψ), and the features used
for the detection of the perceptual boundaries are first and second derivatives of the speed
s′ and s′′ , and the second derivative of angle ψ ′′ . Force impulses imply a discontinuity, i.e.
zero-crossing in consecutive pairs of s′′ and ψ ′′ estimates. If a zero crossing exists in either
or both, and its slope exceeds some predefined threshold, then force impulses are confirmed.
Starts and stops are found when speed is low, but increasing or decreasing sufficiently, again
determined using thresholds. The simultaneous presence of a start or stop and a force impulse
indicates an impulse start or stop, while a pause occurs when a start (smooth or impulse)
follows a stop. the authors studied cycloidal motion at two different speeds, fast and slow.
At each cusp, a boundary was detected; the slow motion detected a ψ ′′ zero crossing, along
with a pause, while the fast motion asserted force impulses for both s′′ and ψ ′′ , but no start,
stop or pause. Both boundaries were found to agree with human perception.
Motion events have been shown for simpler motions as for projectile and cycloid motions,
as well as for complex motions like human movements. Motion events are thus of wide
applicability, and can be used alone or in conjunction with other types of features.
Siskind et al in a number of works [109], [111], [110] build a framework for visual event
classification based not on the “object recognition first” principle, but on a direct retrieval
of motion event information from visual data stream. In [110] a maximum likelihood scheme
is utilized to recognize simple spatial-motion events, such as described by the verbs pick up,
put down, push, pull, drop and throw.
Event recognition is divided into two independent subtasks: lower-level - responsible for
segmentation and tracking and upper-level - responsible for event recognition.
The lower-level task performs 2D object tracking, taking image sequences as input and
producing as output a stream of 2D position, orientation, shape and size values for each
15
participation object in the observed event. The tracker operates on frame-by-frame basis
tracking colored objects and moving objects independently. Colored object segmentation is
performed by clustering strongly colored pixels by hue values. Moving pixels are detected as
pixels with large interframe gray value difference. Region growing is then applied to build
a set of contiguous object regions. Each region is then abstracted as an ellipse and further
analysis is based on ellipse parameters only. Intra-frame correspondence analysis is then
employed to solve the problem of ellipse drop-outs and spurious ellipses yielding a set of
ellipses that track all of the participant objects in the event through the entire movie.
The upper-level task takes as input the 2D pose stream produced by lower-level processing, without any image information, and classifies such a pose stream as an instance of
a given event type. A maximum-likelihood approach is used to performed the upper-level
task. In this approach a supervised learning strategy is used to train a model for each event
class from a set of example events from that class. The feature vectors contain both absolute and relative ellipse positions, orientations, velocities and accelerations. Hidden Markov
Models(HMM) are adopted by the authors as the generative model within the maximumlikelihood framework. Then a novel event observations are classified as the class whose model
best fits the new observation.
The system developed is reported to perform well in a relatively uncluttered desk-top
environment.
1.3.2.2
Motion Based Object Classification
Temporal Texture Analysis It is shown in [89] how a number of statistical spatial and
temporal features derived from normal optical flow (a component of motion field parallel to
the spatial gradient direction) can be used to to classify regional activities characterized by
complex, non-rigid motion.
The features suggested in the paper are:
1. Nonuniformity of normal flow direction, which is an absolute deviation of normal flow
direction histogram from a uniform distribution. To reduce the dependence of the
normal flow direction on underlying intensity texture, the flow direction are normalized
by a histogram of gradient directions.
2. Mean flow magnitude divided by its standard deviation, which is a measure of “peakiness” in the velocity distribution.
3. Directional difference statistics in four directions. The statistics is based on the number of pixels pairs at a given offset which differ in their value by a given amount, and
gives the ratio between the the number of pixels differing by at most one to the number of pixel differing by more than one in four directions. The offset is proportional
the average flow magnitude to yield the scaling invariance. The feature represents a
measure of the spatial homogeneity of the field.
4. Positive and negative curl and divergence estimates. Mean values over a region of
interest are used.
16
The first three features are invariant under translation, rotation and temporal and spatial
scaling, while the last one is invariant with respect to translation and rotation only.
Experiments show that using these features and a nearest centroid classifier it is possible
to distinguish between such naturally occurring motions like:
• fluttering crepe paper bands
• cloth waving in the wind
• motion of tree in the wind
• flow of water in a river
• turbulent motion of water
• uniformly expanding image produced by forward observer motion
• uniformly rotating image produced by observer roll
Principal component analysis of the features indicates that the second order features (the
last two) are most useful in classification.
Dynamics Based Object Classification Shavit and Jepson [108] propose a qualitative
representation of moving body dynamics (posef unction) for motion-based categorization.
The approach is based on a phase portrait of a system, that is a collection of phase curves
in 2n-dimensional space, where n is the number of independent variables.
If motion measurements (of position and velocity) are given for each of the n dimensions
of the system (an n dimensional flow field), construction of the phase portrait is performed
by interpolation of measurements.
The authors suggest to look at the pose function as composed of two independent parts:
global component and local component.
The global component of the pose function is a viewer centered model of the external
dynamics (translation of the blob en-mass), whereas the local component of the function is
an expression of the internal dynamics (deformation of the blob).
In [107] the authors continue to analyze the usage of dynamic systems analysis for motion
understanding. For idealized autonomous conservative systems the external dynamics is
captured by the Hamiltonian function - the total energy of the system is constant and equal
to the sum of potential and kinematic energies. In this case both potential and kinematic
energies are not time dependent and the phase curve corresponding to the global component
is symmetric.
If the system is not autonomous due to the action of non-conservative forces, then the
global component phase curve built will not be symmetric, which indicates that the idealized
model is not appropriate for the system. In such a case an additional time dependent component is introduced in the Hamilton equation to compensate for the total energy changes.
In order to build the local component pose function the authors consider a local fixed
reference frame erected at the instantaneous position of blob’s centroid. Internal body dynamics is expressed by the cluster geometry (relative position of the cluster points with
respect to the centroid) and the relative deformation velocity.
17
Principal axes method and the velocity gradient tensor are used to compute the rate of
strain and the angular velocity of the principal axes rotation.
It is claimed that the obtained phase curves are sufficient for motion based classification.
As the method feasibility proof the authors present an experiment where the pose function
is computed for a hopping rabbit and kangaroo taken from photostats of real animals in
locomotion. (see Figure 1.5).
(a)
(b)
Figure 1.5: (a) A graphic depiction of the pose function for the hopping of a kangaroo ; (b)
A graphic depiction of the pose function for the hopping of a rabbit (from [107]).
The external appearance of the graphs is sufficient for distinguishing between the two
animals.
1.3.3
From Motion towards Natural Language Description
Nagel et al in [70] describe a system for automatic characterization of vehicle trajectories
extracted from am image sequence by natural language motion verbs.
Their approach can be decomposed into the following component processes:
18
1. Features extraction: extraction of suitable features from each frame of an image sequence. Centroids of the blobs generated by the monotonicity operator are used. The
monotonicity operator finds connected regions in the image, which could be classified
as local minima or maxima of the gray-value function.
2. Features matching: interframe matching of the resulting feature sets, thereby deriving
displacement vector estimates at selected image plane locations. A multiresolution
approach is used to detect objects moving with different apparent velocities. Computed
displacements of blobs at a coarse level are used to calculate a so-called displacement
compensation, which is then used to advance the displacement of the search area in
the next finer level.
3. Objects detection: clustering of these displacement vectors, facilitating the isolation of
displacement vector clusters which can be hypothesized to represent candidates for the
image of a moving object in the scene. The displacement of blobs are tracked along
a certain number of frames and mapped to a vector chain. The footpoint of the first
vector and the total displacement represented by the vector chain define a so-called
image distance vector. The image distance vector forms a cluster seed provided it
satisfies the following consistency constrains:
• stability: the involved blob can be extracted in a certain number of frames
• minimum displacement: the total displacement of the image distance vector has
to exceed a certain minimal displacement threshold.
• coherence: the variance coefficient of the displacements of the vector in a vector
chain is used to suppress statistical fluctuations of stationary features and those
vector chains owing to wrong matches.
Then the image distance vectors are clustered based on the following geometrical relations: is neighbour(⃗u, ⃗v ), is parallel(⃗u, ⃗v ) and is samelength(⃗u, ⃗v ). The vectors of a
cluster form an object candidate and are marked for tracking in the subsequent frames.
4. Tracking: tracking of such object candidates
5. Backprojection: backprojection of the resulting candidate representation and position
onto the street plane provided by interactively obtained transformation from the image
plane to a scene map
6. Motion verbs: association of trajectory segments in the street plane with an internal
representation of motion verbs describing the motion of cars on roads. A set of 119
German verbs is divided into four categories representing the verbs describing the
action of the vehicle itself, verbs referring to the road, verbs making additional reference
to other vehicles and verbs referring to other locations. A set of 13 attributes that
are extracted form an image sequence and describe the scene in terms of trajectory
properties, vehicle position relative to other objects or to the road, vehicle orientation,
velocity, acceleration, etc. was defined. For each verb there exists a set of three
predicates acting on the attributes above:
19
Figure 1.6: Single frame of a real-word traffic scene recorded at an intersection. Vehicle
trajectories obtained by automatic detection and model-based tracking are shown together
with the vehicle models used (from [70]).
• a precondition - true if the attributes are within some predefined range at the
beginning of an interval of validity for a verb
• a monotonicity condition - true if the attribute values remain acceptable for a
verb
• a postcondition - true if the attribute values are within some range indicating the
end of an interval of validity for a verb,
where the interval of validity is the time period for which a particular interpretation is
true. The attributes are computed at each time instance and a finite automaton based
on predicate values is used to determine intervals of validity.
The approach has been applied to various traffic scenes and successfully described such
a maneuvers like passing a barrier, turns, parking, etc. by natural language verbs.
Over the last several years considerable success has been reported in traffic scenes analysis
by a large group of researchers in the Institut for Algorithmen und Kognitive Systeme [58]
[59] [60] [61] [69] [71]. To receive an accurate input to the higher level processing a model
based approach has been widely used for moving object detection and tracking and for the
scene modeling. (see Figure 1.6).
20
Chapter 2
Rigid Motion
2.1
Introduction
For rigid objects it is easier to obtain 3D translational and sometimes even rotational velocity
components. The classification then can be performed directly on the measured velocities and
their derivatives or by fitting them to a constrained motion model that depends on physical
parameters of a moving object. In the latter case the classification is performed based on
extracted model parameters. If a set of models is given for various object classes, classification
can be done by choosing the model that fits best the object’s motion characteristics.
In this chapter we show how the constraints induced by a domain help us to extract
motion parameters. We find relative velocity of moving vehicles observed from a moving
platform. It is assumed that the observed vehicle is a rigid body which is moving smoothly
(due to inertia) on a plane, and its rotation is not significant in a small number of frames,
so that only the translational velocity component can be measured accurately enough.
2.1.1
Interpreting Relative Vehicle Motion in Traffic Scenes
Autonomous operation of a vehicle on a road calls for understanding of various events involving the motions of the vehicles in its vicinity. In normal traffic flow, most of the vehicles
on a road move in the same direction without major changes in their distances and relative
speeds. When a nearby vehicle deviates from this norm (e.g. when it passes or changes
lanes), or when it is on a collision course, some action may need to be taken. In this paper we show how a vehicle carrying a camera can estimate the relative motions of nearby
vehicles.
Understanding the relative motions of vehicles requires modeling both the motion of
the observing vehicle and the motions of the other vehicles. We represent the motions of
vehicles using a Darboux motion model that corresponds to the motion of an object moving
along a smooth curve that lies on a smooth surface. We show that deviations from Darboux
motion correspond primarily to small, rapid rotations around the axes of the vehicle. These
rotations arise from the vehicle’s suspension elements in response to unevenness of the road.
We estimate bounds on both the smooth rotations due to Darboux motion (from highway
design principles) and the non-smooth rotations due to the suspension. We show that both
21
types of rotational motion, as well as the non-smooth translational component of the motion
(bounce), are small relative to the smooth (Darboux) translational motion of the vehicle.
This analysis is used to model the motions of both the observer and observed vehicles.
We use the analysis to show that only the rotational velocity components of the observer
vehicle are important. On the other hand, the rotational velocity components of an observed
vehicle are negligible compared to its translational velocity. As a consequence we need to
estimate the rotational velocity components only for the observing vehicle. This is the case
even when an observed vehicle is changing its direction of motion relative to the observing
vehicle (turning or changing lanes); the turn shows up as a gradual change in the direction
of the relative translational velocity.
An important consequence of the Darboux motion model is that for a fixed forwardlooking camera mounted on the observer vehicle the direction of translation (and therefore
the position of the focus of expansion (FOE)) remains the same in the images obtained by the
camera. We use this fact to estimate the observing vehicle’s rotational velocity components;
this is done by finding the rotational flow which, when subtracted from the observed flow,
leaves a radial flow pattern (radiating from the FOE) of minimal magnitude.
We describe the motion field using full perspective projection, estimate its rotational
components, and derotate the field. The flow fields of nearby vehicles are then, under the
Darboux motion model, pure translational fields. We analyze the motions of the other
vehicles under weak perspective projection, and derive their motion parameters. We present
results for several road image sequences obtained from cameras carried by moving vehicles.
The results demonstrate the effectiveness of our approach.
In the next section we present motion models for road vehicles, discuss ideal and real
vehicle motion, and analyze the relative sizes of the smooth and non-smooth velocity components. In Section 2.3 we discuss the image motion and describe a way to estimate the
necessary derotation and describe methods of estimating nearby vehicle motions from the
normal flow field. Section 2.4 presents experimental results for several sequences taken at
different locations. In Section 2.5 we review some prior work on various topics that are
related to this paper, involving road detection, vehicle detection, and motion analysis, and
compare them with the approach presented in this paper.
2.2
Motion Models of Highway Vehicles
The ideal motion of a ground vehicle does not have six degrees of freedom. If the motion is
(approximately) smooth it can be described as motion along a smooth trajectory Γ lying on
a smooth surface Σ. Moreover, we shall assume that the axes of the vehicle (the fore/aft,
crosswise, and up/down axes) are respectively parallel to the axes of the Darboux frame
defined by Γ and Σ. These axes are defined by the tangent ⃗t to Γ (and Σ), the second
tangent ⃗v to Σ (orthogonal to ⃗t), and the normal ⃗s to Σ (see Figure 2.1). Our assumption
about the axes is reasonable for the ordinary motions of standard types of ground vehicles;
in particular, we are assuming that the first two vehicle axes are parallel to the surface and
that the vehicle’s motion is parallel to its fore/aft axis (the vehicle is not skidding).
Consider a point O moving along a (space) curve Γ. There is a natural coordinate
22
s’
t’
O’
s
v’
n
t
ψ
b
O
Γ
v
Σ
Figure 2.1: The Darboux frame moves along the path Γ which lies on the surface Σ.
system Otnb associated with Γ, defined by the tangent ⃗t, normal ⃗n, and binormal ⃗b of Γ.
The triple (⃗t,⃗n, ⃗b) is called the moving trihedron or Frenet-Serret coordinate frame. We have
the Frenet-Serret formulas [72]
′
⃗t′ = κ⃗n, ⃗n′ = −κ⃗t + τ⃗b, ⃗b = −τ⃗n
(2.1)
where κ is the curvature and τ the torsion of Γ.
When the curve Γ lies on a smooth surface Σ, it is more appropriate to use the Darboux
frame (⃗t, ⃗v, ⃗s) [72]. We take the first unit vector of the frame to be the tangent ⃗t of Γ and
the surface normal ⃗s to be the third frame vector; finally we obtain the second frame vector
as ⃗v = ⃗s × ⃗t (see Figure 2.1). Note that ⃗t and ⃗v lie in the tangent plane of Σ. Since the
vector ⃗t belongs to both the Otnb and Otvs frames, they differ only by a rotation around ⃗t,
say through an angle ψ ≡ ψ(s). We thus have
(
⃗v
⃗s
)
(
=
cos ψ sin ψ
− sin ψ cos ψ
)(
⃗n
⃗b
)
.
(2.2)
The derivatives of ⃗t,⃗v,⃗s with respect to arc length along Γ can be found from (2.1)
and (2.2):
⃗t′ = κg⃗v − κn⃗s, ⃗v′ = −κg⃗t + τg⃗s, ⃗s′ = κn⃗t − τg⃗v
(2.3)
where
dψ
;
ds
κg is called the geodesic curvature, κn is called the normal curvature, and τg is called the
(geodesic) twist.
It is well known that the instantaneous motion of a moving frame is determined by its
rotational velocity ω
⃗ and the translational velocity T⃗ of the reference point of the frame.
κg ≡ κ cos ψ,
κn ≡ κ sin ψ,
23
τg ≡ τ +
The translational velocity T⃗ of O is just ⃗t and the rotational velocity of the Otvs frame is
given by the vector
ω
⃗ d = τg⃗t + κn⃗v + κg⃗s.
Hence the derivative of any vector in the Otvs frame is given by the vector product of ω
⃗ d and
⃗
that vector. It can be seen that the rate of rotation around t is just τg , the rate of rotation
around ⃗v is just κn , and the rate of rotation around ⃗s is just κg .
If, instead of using the arc length s as a parameter, the time t is used, the rotational
velocity ω
⃗ d and translational velocity T⃗ are scaled by the speed v(t) = ds/dt of O along Γ.
2.2.1
Real Vehicle Motion
We will use two coordinate frames to describe vehicle motion. The “real” vehicle frame
Cξηζ (which moves non-smoothly, in general) is defined by its origin C, which is the center
of mass of the vehicle, and its axes: Cξ (fore/aft), Cη (crosswise), and Cζ (up/down); and
the ideal vehicle frame Otvs (the Darboux frame), which corresponds to the smooth motion
of the vehicle.
The motion of the vehicle can be decomposed into the motion of the Otvs frame and the
motion of the Cξηζ frame relative to the Otvs frame. As we have just seen, the rotational
velocity of the Otvs (Darboux) frame is v⃗ω d = v(τg⃗t + κn⃗v + κg⃗s) and its translational
velocity is v⃗t. We denote the rotational velocity of the Cξηζ (vehicle) frame by ω
⃗ v and its
translational velocity by T⃗v .
The position of the Cξηζ frame relative to the Otvs frame is given by the displacement
vector d⃗v/d between C and O, and the relative orientation of the frames is given by an
orthogonal rotational matrix (matrix of direction cosines) which we denote by Rv/d . The
translational velocity of the vehicle (the velocity of C) is the sum of three terms: (i) the
˙
translational velocity of the Darboux frame v⃗t, (ii) the translational velocity T⃗v/d ≡ d⃗v/d ,
and (iii) the displacement v⃗ω d × d⃗v/d due to rotation of C in the Otvs frame. The transla˙
tional velocity of the vehicle expressed in the Otvs frame is thus v⃗ω d × d⃗v/d + v⃗t + d⃗v/d ; its
translational velocity in the Cξηζ frame is
˙
T
T⃗v = Rv/d
(v⃗ω d × d⃗v/d + v⃗t + d⃗v/d ).
(2.4)
Similarly, the rotational velocity of Cξηζ is the sum of two terms: (i) the rotational velocity
T
vRv/d
ω
⃗ d of the Otvs frame, and (ii) the rotational velocity ω
⃗ v/d , which corresponds to the
T
skew matrix Ωv/d = Rv/d Ṙv/d . The rotational velocity of the Cξηζ frame expressed in the
Otvs frame is thus v⃗ω d + Rv/d ω
⃗ v/d ; the corresponding expression in the Cξηζ frame is
T
ω
⃗d + ω
⃗ v/d .
ω
⃗ v = vRv/d
(2.5)
Rotations around the fore/aft, sideways, and up/down axes of a vehicle are called roll,
pitch, and yaw, respectively. In terms of our choice of the real vehicle coordinate system,
these are rotations around the ξ, η, and ζ axes.
24
W
C4
C1
x
H
C
y
L
O
z
C3
C2
Figure 2.2: A possible motion of the base of a vehicle as the vehicle hits a bump.
2.2.2
Departures of Vehicle Motion from Smoothness
The motion of a ground vehicle depends on many factors: the type of intended motion;
the speed of the vehicle; the skill of the driver; the size, height and weight of the vehicle; the
type and size of the tires (or tractor treads), and the nature of the suspension mechanism, if
any; and the nature of the surface on which the vehicle is being driven. These factors tend to
remain constant; they undergo abrupt changes only occasionally, e.g. if a tire blows out, or
the vehicle suddenly brakes or swerves to avoid an obstacle, or the type of surface changes.
Such events may produce impulsive changes in the vehicle’s motion, but the effects of these
changes will rapidly be damped out. In addition to these occasional events, “steady-state”
non-smoothness of a ground vehicle’s motion may result from roughness of the surface.
A ground vehicle drives over roads that have varying degrees of roughness [4]. The
roughness consists primarily of small irregularities in the road surface. In discussing the
effects of the roughness of the road on the motion of a ground vehicle we will assume, for
simplicity, an ordinary, well balanced four-wheeled vehicle moving on a planar surface that
is smooth except for occasional small bumps (protrusions). The bumps are assumed to be
“small” relative to the size of the wheels, so that the effect of a wheel passing over a bump
is impulsive. (We could also allow the surface to have small depressions, but a large wheel
cannot deeply penetrate a small depression, so the depressions have much smaller effects
than the bumps.)
As the vehicle moves over road surface, each wheel hits bumps repeatedly. We assume
that the vehicle has a suspension mechanism which integrates and damps the impulsive
effects of the bumps. Each suspension element is modeled by a spring with damping; its
characteristic function is a sine function multiplied by an exponential damping function
(see [84, 126]).We assume that the suspension elements associated with the four wheels are
independent of each other and are parallel to the vertical axis of the vehicle.
The vehicle may hit new bumps while the effects of the previous bumps are still being
felt. Each hit forces the suspension and adds to the accumulated energy in the spring; thus
it can happen that the suspension is constantly oscillating, which has the effect of moving
25
the corners of the vehicle up and down. The period of oscillation is typically on the order
of 0.5 sec (see [84, 126]). In general, it takes several periods to damp out the spring; for
example, the damping ratio provided by shock absorbers of passenger cars is in the range
0.2 − 0.4. The maximum velocity of the oscillation is typically on the order of 0.1 m/sec.
Consider the coordinate system Cxyz with origin at the center of mass C of the vehicle
(see Figure 2.2). Let vi be the velocity corner Ci of the vehicle, and let the length and width
of the vehicle be L and W . From the vi ’s we can compute the angular velocity matrix
0 −ωz
ωy
0
0
(v3 − v4 )/L
0 −ωx
0
0
(v
Ω̃ ≡ ωz
=
2 − v3 )/W .
−ωy
ωx
0
(v4 − v3 )/L (v3 − v2 )/W
0
(2.6)
Note that any of the vi s can be positive or negative. Multiplication by Ω̃ can be replaced by
the vector product with the angular velocity vector ω
⃗ = ⃗ı(v3 − v2 )/W + ⃗ȷ(v3 − v4 )/L where
the rate of rotation around the x axis (the roll velocity) is ωx = (v3 − v2 )/W and the rate
of rotation around the y axis (the pitch velocity) is ωy = (v3 − v4 )/L. As noted above, we
typically have |vi | < 0.1 m/sec. If we assume that W > 1 m and L > 2 m we have
√ |ωx | <
2
0.2 rad/sec and |ωy | < 0.1 rad/sec. The yaw velocity component is |ωz | = O(vi / W 2 + L2 )
which is |ωz | = O(0.01) rad/sec. (For a complete derivation see [39].)
The translational velocity vector of the center C of the vehicle is obtained by using the
velocities v1 − v3 , v2 − v3 , 0, and v4 − v3 for the corners and adding v3 to the velocity in the
direction of the z-axis. We thus have
tx
−(v4 − v3 )H/L
⃗
T = ty = (v2 − v3 )H/W
.
tz
(v2 + v4 )/2
(2.7)
If we assume that H < 0.5 m (< W/2) we have |tx | < 0.05 m/sec, |ty | < 0.1 m/sec, and
|tz | < 0.1 m/sec.
We can draw several conclusions from this discussion: (i) The effects of small bumps are
of short duration, i.e. they can be considered to be impulsive. The suspension elements
integrate and damp these effects, resulting in a set of out-of-phase oscillatory motions. (ii)
The yaw component of rotation due to the effect of a bump is very small compared to the
roll and pitch components. (iii) The translational effects of a bump are proportional to the
velocities (or displacements) of the suspension elements and the dimensions of the vehicle
and are quite small.
2.2.3
The Sizes of the Smooth and Non-smooth Velocity Components
We now compare the sizes of the velocity components which are due to the ideal motion
of the vehicle — i.e., the velocity components of the Darboux frame (Section 2.2) — to the
sizes of the velocity components which are due to departures of the vehicle frame from the
Darboux frame (Section 2.2.2).
The translational velocity of the Darboux frame is just v⃗t; thus the magnitude of the
translational velocity is just v. If v = 10 m/sec (= 36 km/hr≈ 22 mi/hr) this velocity is
26
much larger than the velocities which are due to departures of the vehicle from the Darboux
frame, which, as we have just seen, are on the order of 0.1m/sec or less.
The rotational velocity of the Darboux
frame is v⃗ωd = v(τg⃗t + κn⃗v + κg⃗s); thus the
√
magnitude of the rotational velocity is v τg2 + κ2n + κ2g . In this section we will estimate
bounds on τg , κn , and κg . Our analysis is based on the analyses in [4, 41, 126] and on the
highway design recommendations published by the American Association of State Highway
Officials [83].
Good highway design allows a driver to make turns at constant angular velocities, and
to follow spiral arcs in transitioning in and out of turns, in order to reduce undesirable
acceleration effects on the vehicle. A well-designed highway turn has also a transverse slope,
with the outside higher than the inside, to counterbalance the centrifugal force on the turning
vehicles. Thus the ideal (smooth) motion of a vehicle has piecewise constant translational
and rotational velocity components, with smooth transitions between them. Note that the
translational components are constant in the vehicle coordinate frame even when the vehicle
is turning, unless it slows down to make the turn.
To illustrate the typical sizes of these components, consider a ground vehicle moving
with velocity v along a plane curve Γ on the surface Σ. If Σ is a plane and Γ is a circular
arc with radius of curvature ρg = |κg |−1 (i.e., the vehicle is turning with a constant steering
angle), the angular velocity of the vehicle is v⃗ωd = vκg⃗s and there is a centripetal acceleration
⃗ac = v 2 κg⃗v at the vehicle’s center [63]. As a result there is a centrifugal force on the vehicle
proportional to ∥⃗ac ∥ and the mass of the vehicle. If skidding is to be avoided the limit on
∥⃗ac ∥ (see [4]) is given by
∥⃗ac ∥ = v 2 κg ≤ g(tan α + µa )
(2.8)
where g is the gravitational acceleration, α is the transverse slope, and µa is the coefficient of
adhesion between the wheels and the surface. [Typical values of µa range from 0.8 − 0.9 for
dry asphalt and concrete to 0.1 for ice (see [126], page 26).] From (2.8) we have either a lower
bound on ρg for a given v or an upper bound on v for a given ρg . For example, if v = 30 m/sec
(≈ 108 km/hr), α = 0.05 rad, and µa = 0.2 from v 2 /ρg < 0.25g we have ρg > 367 m. This
yields an upper bound on the yaw angular velocity of < v|κg | = v/ρg ≈ 0.08 rad/sec, which
is somewhat larger than the yaw angular velocity arising from the departures from Darboux
motion.
Other dynamic constraints on a vehicle such as the limits on torques and forces can be
used to obtain constraints on τg and κn . (These and other considerations such as safety and
comfort were used in [83] to make recommendations for highway design; for a summary of
these recommendations see [39].) For both vertical curves (crossing a hill) and turning curves
the (recommended lower bound on the) radius of curvature ρmin grows with the square of
the design velocity vd . However, the resulting (design) yaw and pitch angular velocities are
limited by vd /ρmin . Thus for smaller velocities v the vehicle can negotiate tighter vertical
and turning curves and thus have even larger values of the yaw and pitch angular velocities.
Typical values of the roll and pitch angular velocities are given in [39].
For realistic vehicle speeds we can conclude the following about the impulsive and smooth
translational and rotational velocity components of the vehicle [39]. The impulsive effects
on the translational velocity are approximately two orders of magnitude smaller than the
smooth velocity components themselves. Impulsive effects on the yaw angular velocity are
27
somewhat smaller than the smooth yaw component arising from worst-case turns of the road;
for moderate turns the impulsive effects are comparable in size to the smooth yaw velocity.
Impulsive effects on the roll angular velocity are approximately an order of magnitude larger
than the smooth roll component arising from worst-case twists (and turns) of the road; for
gentler twists the smooth roll velocity is even smaller. Similarly, impulsive effects on the
pitch angular velocity are approximately an order of magnitude larger than the smooth pitch
velocity arising from worst-case changes of vertical slope (i.e., vertical curves) of the road;
for gentler vertical curves the smooth pitch angular is even smaller. (The impulsive effects
are not significantly affected by turns, twists, or vertical slope.) We can thus conclude that
impulsive effects on the roll and pitch angular velocities are significant and larger than the
corresponding smooth velocities, and that impulsive effects on the yaw angular velocity are
on the order of the smooth yaw velocity.
2.2.4
Camera Motion
Assume that a camera is mounted on the vehicle; let d⃗c be the position vector of the mass
center of the vehicle relative to the nodal point of the camera The orientation of the vehicle
coordinate system Cξηζ relative to the camera is given by an orthogonal rotational matrix
(a matrix of the direction cosines) which we denote by Rc . The columns of Rc are the unit
vectors of the vehicle coordinate system expressed in the camera coordinate system. We will
assume that the position and orientation of the vehicle relative to the camera coordinate
system do not change as the vehicle moves. Thus we will assume that Rc and d⃗c are constant
and known.
Given the position ⃗pe of a scene point E in the vehicle coordinate system Cξηζ, its
position ⃗re in the camera coordinate system is given by
⃗re = Rc⃗p + d⃗c
e
Since Rc and d⃗c are constant we have ⃗r˙e = Rc⃗p˙e . The velocity of E is given by
⃗r˙e = −⃗ω × ⃗re − T⃗ .
(2.9)
In this expression, the rotational velocity is ω
⃗ = Rc ω
⃗ v (see (2.5)), and the translational
⃗
⃗
⃗
velocity is T = Rc Tv + ω
⃗ × dc (see (2.4)), where ω
⃗ v and T⃗v are the rotational and translational
velocities of the vehicle coordinate system.
We saw in Section 2.2.3 that both ∥v⃗ω d ∥ and ∥⃗ω v/d ∥ are O(0.1)rad/sec. The factors
T
Rc and Rv/d
do not affect the magnitude of either ω
⃗ or T⃗ . Thus the two components of
rotational velocity have comparable magnitudes.
As regards the translational components, note that for normal speeds of the vehicle
(v > 10m/sec≈ 22mi/hr), typical suspension elements, and the camera mounted on the
vehicle close to the center of mass we have (see Section 2.2.3) ∥d⃗v/d ∥ = O(0.025)m/sec,
˙
∥d⃗v/d ∥ = O(0.1)m/sec, and ∥d⃗c ∥ = O(1)m/sec. The magnitudes of the translational velocity components are thus ∥v⃗ω d × d⃗v/d ∥ ≤ v∥⃗ω d ∥ ∥d⃗v/d ∥ = O(0.0025)m/sec; ∥v⃗t∥ = v =
O(10)m/sec; and ∥⃗ω × d⃗c ∥ ≤ ∥⃗ω ∥ ∥d⃗c ∥ = O(0.1)m/sec. Therefore, the dominant term in the
expression for T⃗ is v⃗t since it is two orders of magnitude larger then any of the other three
terms of T⃗ .
28
2.2.5
Independently Moving Vehicles
We are interested in other vehicles that are moving nearby. We assume the other vehicles
are all moving in the same direction. To facilitate the derivation of the motion equations
of a rigid body B we use two rectangular coordinate frames, one (Oxyz) fixed in space, the
other (Cx1 y1 z1 ) fixed in the body and moving with it. The position of the moving frame at
any instant is given by the position d⃗1 of the origin C1 , and by the nine direction cosines of
the axes of the moving frame with respect to the fixed frame. For a given position p⃗ of P in
Cx1 y1 z1 we have the position ⃗rp of P in Oxyz
⃗rp = R⃗p + d⃗1
(2.10)
where R is the matrix of the direction cosines (the frames are taken as right-handed so that
det R = 1). The velocity ⃗r˙p of P in Oxyz is given by
˙
⃗r˙p = ω
⃗ 1 × (⃗rp − d⃗1 ) + d⃗1 .
(2.11)
˙
where d⃗1 is the translational velocity vector and ω
⃗ 1 = (ωx ωy ωz )T is the rotational velocity
vector.
From Section 2.2.3 we have for a typical vehicle ∥⃗ω 1 ∥ = O(0.1)rad/sec and ∥⃗rp − d⃗1 ∥ =
O(1)m. We thus have ∥⃗ω 1 × (⃗rp − d⃗1 )∥ ≤ ∥⃗ω 1 ∥ ∥⃗rp − d⃗1 ∥ = O(0.1)O(1) = O(0.1)m/sec. For
the translational velocity we have ∥d⃗1 ∥ = v; for normal speeds of the vehicle v>10m/sec ≈
22m/sec so that ∥d⃗1 ∥ = O(10)m/sec. We can conclude that for any point P on the vehicle
the translational velocity is two orders of magnitude larger than the rotational velocity.
If we make the fixed frame Oxyz correspond to the camera frame at time t we have
from (2.9) and (2.11) that the velocity of a point P on the vehicle expressed in the camera
frame is given by
˙
⃗r˙p = −⃗ω × ⃗rp + ω
⃗ 1 × (⃗rp − d⃗1 ) − T⃗ + d⃗1 .
(2.12)
˙
In (2.12) the vector −T⃗ + d⃗1 corresponds to the relative translational velocity between the
camera and the independently moving vehicle. Regarding the first and the second terms
on the r.h.s. of (2.12) we can see that for comparable rotational velocities ω
⃗ and ω
⃗ 1 the
first term will dominate the second term since usually ∥⃗rp − d⃗1 ∥ ≪ ∥⃗rp ∥. We will use this
observation later.
2.3
2.3.1
Image Motion
The Imaging Models
Let (X, Y, Z) denote the Cartesian coordinates of a scene point with respect to the fixed
camera frame (see Figure 2.3), and let (x, y) denote the corresponding coordinates in the
image plane. The equation of the image plane is Z = f , where f is the focal length of the
camera. The perspective projection onto this plane is given by
x=
fY
fX
, y=
.
Z
Z
29
(2.13)
For weak perspective projection we need a reference point (Xc , Yc , Zc ). A scene point
(X, Y, Z) is first projected onto the point (X, Y, Zc ); then, through plane perspective projection, the point (X, Y, Zc ) is projected onto the image point (x, y). The projection equations
are then given by
X
Y
x = f, y = f.
(2.14)
Zc
Zc
Y
Π
y
F
E
re
O
X
P
x
N
Z
Figure 2.3: The plane perspective projection image of P is F = f (X/Z, Y /Z, 1); the weak
perspective projection image of P is obtained through the plane perspective projection of
the intermediate point P1 = (X, Y, Zc ) and is given by G = f (X/Zc , Y /Zc , 1).
2.3.2
The Image Motion Field and the Optical Flow Field
The instantaneous velocity of the image point (x, y) under perspective projection is obtained
by taking the derivatives of (2.13) and using (2.9):
(
)
ẊZ − X Ż
−U f + xW
xy
x2
ẋ =
=
+
ω
−
ω
+ f + ωz y,
x
y
Z2
Z
f
f
(2.15)
Ẏ Z − Y Ż
−V f + yW
y2
xy
ẏ =
=
+
ω
+ f − ωy
− ωz x.
x
2
Z
Z
f
f
(2.16)
(
)
The instantaneous velocity of the image point (x, y) under weak perspective projection
can be obtained by taking derivatives of (2.14) with respect to time and using (2.9):
−U f + xW
Z
ẊZc − X Żc
=
− f ωy
+ ωz y,
2
Zc
Zc
Zc
Ẏ Zc − Y Żc
−V f + yW
Z
ẏ = f
=
+ f ωx − ωz x.
2
Zc
Zc
Zc
ẋ = f
30
(2.17)
(2.18)
Let ⃗ı and ⃗ȷ be the unit vectors in the x and y directions, respectively; ⃗r˙ = ẋ⃗ı + ẏ⃗ȷ is
the projected motion field at the point ⃗r = x⃗ı + y⃗ȷ. If we choose a unit direction vector ⃗nr
at the image point ⃗r and call it the normal direction, then the normal motion field at ⃗r is
⃗r˙n = (⃗r˙ · ⃗nr )⃗nr . ⃗nr can be chosen in various ways; the usual choice (as we shall now see) is
the direction of the image intensity gradient.
Let I(x, y, t) be the image intensity function. The time derivative of I can be written as
dI
∂I dx ∂I dy ∂I
=
+
+
= (Ix⃗ı + Iy⃗ȷ) · (ẋ⃗ı + ẏ⃗ȷ) + It = ∇I · ⃗r˙ + It
dt
∂x dt
∂y dt
∂t
where ∇I is the image gradient and the subscripts denote partial derivatives.
If we assume dI/dt = 0, i.e. that the image intensity does not vary with time, then we
have ∇I · ⃗u + It = 0. The vector field ⃗u in this expression is called the optical flow. If we
choose the normal direction ⃗nr to be the image gradient direction, i.e. ⃗nr ≡ ∇I/∥∇I∥, we
then have
−It ∇I
⃗un = (⃗u · ⃗nr )⃗nr =
(2.19)
∥∇I∥2
where ⃗un is called the normal flow.
It was shown in [121] that the magnitude of the difference between ⃗un and the normal
motion field ⃗rn is inversely proportional to the magnitude of the image gradient. Hence
⃗r˙n ≈ ⃗un when ∥∇I∥ is large. Equation (2.19) thus provides an approximate relationship
between the 3-D motion and the image derivatives. We will use this approximation later in
this paper.
2.3.3
Estimation of Rotation
We now describe our algorithm for estimating rotation. In this section we give only a brief
description of the algorithm. A full description and a proof of correctness will be given in
a forthcoming paper. We shall use the following notation: Let I be the image intensity at
⃗r, and let ⃗nr = nx⃗ı + ny⃗ȷ = ∇I/∥∇I∥ be the direction of the image intensity gradient at ⃗r.
The normal motion field at ⃗r is the projection of the image motion field onto the gradient
direction ⃗nr and is given by ⃗r˙n = (⃗r˙ · ⃗nr )⃗nr . From (2.15-2.16) we have
1
⃗r˙n · ⃗nr = nx ẋ + ny ẏ = [nx (−U f + xW ) + ny (−V f + yW )]
Z(
[
)]
[ (
)
]
xy
y2
x2
xy
+ nx
+ ny
+ f ωx − n x
+ f + ny
ωy + (nx y − ny x)ω(2.20)
z
f
f
f
f
The first term on the r.h.s. of (2.20) is the translational normal motion ⃗r˙ t · ⃗nr and the
remaining three terms are the rotational normal motion ⃗r˙ ω ·⃗nr . From now on we will assume
that the camera is forward-looking, i.e. that the focus of expansion (FOE) is in the image.
The normal flow at ⃗r is defined as −It /∥∇I∥. From [121] we know that the magnitude
of the difference between the normal flow field and the normal motion field is inversely
proportional to the gradient magnitude; we can thus write
It
+ O(∥∇I∥−1 ) = ⃗un · ⃗nr + O(∥∇I∥−1 ).
⃗r˙ n · ⃗nr = ⃗r˙ t · ⃗nr + ⃗r˙ ω · ⃗nr = −
∥∇I∥
31
(2.21)
If the camera motion is a pure translation the image motion field is a radial pattern; the
magnitude of each image motion vector is proportional to the distance of the image point from
the focus of expansion (FOE) and inversely proportional to the depth of the corresponding
scene point. If the position ⃗r0 = ⃗ıx0 +⃗ȷy0 of the FOE is known the translational motion field
can be obtained from the translational normal motion field by multiplying each of ⃗r˙ t · ⃗nr by
a vector whose direction is (⃗r − ⃗r0 ) and whose magnitude is inversely proportional to the
angle between the normal flow and the normal motion vector. The translational motion field
is then given by
⃗r − ⃗r0
.
(2.22)
⃗r˙ t = (⃗r˙ t · ⃗nr )
(⃗r − ⃗r0 ) · ⃗nr
Note that if we knew ω
⃗ we could compute the rotational motion field ⃗r˙ ω and the rotational
˙
normal motion ⃗rω · ⃗nr and use (2.21) to obtain
⃗r˙ t · ⃗nr ≈ ⃗u · ⃗nr − ⃗r˙ ω · ⃗nr .
(2.23)
If we combine (2.22) and (2.23) we have
(
⃗r˙ t ≈ ⃗u · ⃗nr − ⃗r˙ ω · ⃗nr
)
(⃗r − ⃗r0 )
.
(⃗r − ⃗r0 ) · ⃗nr
(2.24)
If the FOE is known or can be estimated, we can use (2.24) to estimate the rotational
∑
velocity vector ω
⃗ by minimizing ⃗r ∥⃗r˙ t ∥2 . Indeed, the image motion field in the neighborhood
of the FOE is composed of the translational image motion and the rotational image motion.
The roll component of the rotational image motion is orthogonal to the translational image
motion, so that it increases the magnitude of the image motion field and the normal motion
field. The yaw and the pitch components of the rotational image motion are approximately
constant in the neighborhood of the FOE and just shift the position of the singular (zero)
point of the flow field [40]. Furthermore, the rotational normal motion accounts for most
of the image motion field at the distant image points [39]. Therefore, if we subtract the
rotational image motion, the sum of the magnitudes of the resulting (translational) flow
field will be minimal. Using (2.20) and (2.24) we then have
∑
∥(⃗r − ⃗r0 )∥2
ω
⃗ = arg min
∥⃗u · ⃗nr − ⃗r˙ ω · ⃗nr ∥2
.
(2.25)
∥(⃗r − ⃗r0 ) · ⃗nr ∥2
ω
⃗ ⃗r
In matrix form this problem corresponds to minimizing ∥A⃗ω − ⃗b∥ (see [114]) where the rows
⃗ai of A are given by
[
(
)
(
xy
y2
∥(⃗r − ⃗r0 )∥
nx
+ ny
+f ;
⃗ai =
∥(⃗r − ⃗r0 ) · ⃗nr ∥
f
f
−nx
and the elements bi of ⃗b are given by
bi = ⃗u · ⃗nr
∥(⃗r − ⃗r0 )∥
.
∥(⃗r − ⃗r0 ) · ⃗nr ∥
The solution is given by
ω
⃗ = A+⃗b
where A+ is the generalized inverse of A (see [114]).
32
)
x2
xy
+ f − ny ;
f
f
]
nx y − ny x
(2.26)
2.3.4
Estimating the FOE
In the case of a forward looking camera and an unknown FOE it is possible to simultaneously
estimate the FOE and ω
⃗ . Based on our analysis in Section 2.2 we observe that, in the case of a
camera rigidly connected to the vehicle and a scene without independently moving objects,
the direction of the translational velocity of the vehicle remains approximately constant
throughout the image sequence. The rotational velocity ω
⃗ , however, depends on both the
geometry of the road and the activity of the suspension elements; hence, ω
⃗ changes as the
vehicle moves. It is possible to choose a part of the road without horizontal and/or vertical
curves and/or changes of lateral slope so that the vehicle rotations correspond to activities
of the suspension elements. In these cases the rotational velocity components change in
an almost periodic manner. If the scenes are chosen so that significant parts of images
correspond to distant scene points the following algorithm can be used to reliably estimate
the FOE.
Given N successive image frames taken by a forward looking camera mounted on a moving
vehicle we form the following function
φ=
∥(⃗ri − ⃗r0 )∥2
∥⃗ui · ⃗nr − ⃗r˙ ωk · ⃗nr ∥2
∥(⃗ri − ⃗r0 ) · ⃗nr ∥2
k=1 i=1
Nk
N ∑
∑
(2.27)
where the inside sum is over all normal flow vectors in the kth frame, and the outside sum
is over all frames; the position of the FOE does not change throughout the sequence (N
frames), but ω
⃗ changes at every frame (hence the index k). We estimate ω
⃗ k , k = 1, . . . , N ,
and ⃗r0 by minimizing φ.
A straightforward method for minimizing φ corresponds to a nonlinear least squares
problem. It can be observed from (2.27) that φ is linear in the ω
⃗ k s and nonlinear in ⃗r0 . In
the numerical analysis literature such problems are called “separable nonlinear least squares
problems” [97]. Several algorithms for solving such problems have been proposed; a unifying
feature of these algorithms is that they have better performance than standard nonlinear least
squares algorithms [97] since they are based on solving problems of smaller dimensionality
that the unseparated problem.
In this paper we use a simple version of the separable algorithm. We choose an initial
⃗r0 and solve for the ω
⃗ k vectors. This problem is equivalent to solving N linear least squares
⃗
⃗
problems as described in Section 2.3.3. We then have ω
⃗ k = A+
k bk with the Ak s and the bk s
appropriately defined. Once the ω
⃗ k s are estimated we have a nonlinear least squares problem
in the two components of ⃗r0 : we need to minimize φ for the fixed ω
⃗ k s. From (2.27) we then
have
Nk
N ∑
∑
∥(⃗ri − ⃗r0 )∥2
2
⃗r0 = arg min
αk,i
(2.28)
∥(⃗ri − ⃗r0 ) · ⃗nr ∥2
⃗r0 k=1 i=1
where αk,i = ∥⃗ui · ⃗nr − ⃗r˙ ωk · ⃗nr ∥. We use the Gauss-Newton algorithm to solve this problem.
The partial derivatives ∂φ/∂x0 and ∂φ/∂y0 that are required by the Gauss-Newton algorithm
are readily obtained from (2.27-2.28).
After solving for ⃗r0 we substitute ⃗r0 in (2.27) and solve for the ω
⃗ k s; we then use these
values to solve for ⃗r0 again. The process is repeated until ⃗r0 converges. A simple test for
convergence is ∥⃗r0 (s) − ⃗r0 (s + 1)∥ ≤ 1, where s is the iteration number.
33
2.3.5
The Image Motion Field of an Observed Vehicle
From (2.17–2.18) we obtain the (approximate) equations of projected motion for points on
a vehicle under weak perspective:
U f − xW
,
Zc
V f − yW
.
ẏ =
Zc
ẋ =
(2.29)
(2.30)
Equations (2.29-2.30) relate the image (projected) motion field to the scaled translational
velocity Zc−1 T⃗ = Zc−1 (U V W )T .
Given the point ⃗r = x⃗ı + y⃗ȷ and the normal direction nx⃗ı + ny⃗ȷ, from (2.29-2.30) the
normal motion field ⃗r˙n · ⃗n = nx ẋ + ny ẏ is given by
⃗r˙n · ⃗n = nx f U Zc−1 + ny f V Zc−1 − (nx x + ny y)W Zc−1
Let
(2.31)
a1
nx f
c1
U Zc−1
ny f
a = a2 ≡
, c = c2 ≡ V Zc−1 .
a3
−nx x − ny y
c3
W Zc−1
(2.32)
Using (2.32) we can write (2.31) as ⃗r˙n ·⃗n = aT c. The column vector a is formed of observable
quantities only, while each element of the column vector c contains quantities which are not
directly observable from the images. To estimate c we need estimates of ⃗r˙n · ⃗n at three or
more image points.
2.3.6
Estimating Vehicle Motion from Normal Flow
As in Section 2.3.5 we use linear least squares to estimate parameter vector c from the normal
flow.
In the case of a moving vehicle the parameters of interest are the vehicle’s trajectory and
its rate of approach. The rate of approach
ν=
W
Zc
(measured in sec−1 ) is equivalent to the inverse of the time to collision and corresponds to
the rate with which an object is approaching the camera (or receding from it). The rate
ν = 0.1/sec means that every second the object travels 0.1 of the distance between the
observer and its current position. A negative rate of approach means that the object is
going away from the camera.
2.4
Experiments
In Sections 2.4.1 and 2.4.2 we give examples illustrating road detection, stabilization, and
vehicle detection. In Section 2.4.3 we present results for several sequences showing vehicles
in motion.
34
2.4.1
Road and Vehicle Detection
We detect the road region by finding road edges and lane markers. A Canny edge detector is
applied and Hough-like voting is used to detect dominant straight lines in the image. Each
line is parameterized by its normal angle α and its displacement d from the center of the
image. For all possible values of α and d the image is scanned along the corresponding line.
The number of edge points that are found within a strip along the line, and whose gradient
direction is orthogonal to the line direction, is taken as the vote for the corresponding point
in the α - d plane. Among those lines, only the ones with a certain weight and orientation
are chosen to be road line candidates.
If several lines with close α and d values are candidates, only the best representative of
these lines is chosen. The other lines are eliminated by applying local maximum suppression
in the α - d plane. Note that if the lines represent the two edges of a lane marker, the
gradient directions will have opposite signs for the two edges.
Due to perspectivity, the road boundaries and lane marker lines should converge to a
single point. Candidate lines that do not converge to a single point are not identified as a
road or lane boundaries. The identified lines are the maximal subsets of candidate lines that
all intersect at or close to one point (all the intersection points of every pair of the lines are
located in a small region).
Figure 2.4 presents some road detection and vehicle detection results for four different
sequences (collected in three different countries).
2.4.2
Stabilization
As was shown in [39] the impulsive effects introduce significant changes in the roll and
pitch angular velocities (see Figure 2.5). Figure 2.6 shows three examples of image sequence
stabilization by compensating the rotational effects of road bumps. The estimated rotational
normal flow component (column c) is subtracted from the total normal flow (column b), which
yields the translational normal flow component (column d) One can see that the translational
normal flow components at distant points are close to zero.
The vector of rotation is estimated by using the method based on FOE calculation, as
described in Section 2.3.3, or alternatively, by estimating the amount of rotation from the
apparent shifts of distant points. Two examples of distant point identification, using horizon
points, are shown in Figure 2.7.
2.4.3
Relative Motions of Vehicles
After derotating and detecting moving vehicles, we can analyze their motions using the
algorithm for motion estimation under weak perspective.
In the first experiment we used an image sequence taken in Haifa, Israel, from a vehicle
following two other accelerating vehicles. The sequence consisted of 90 frames (slightly less
than three seconds). Figure 2.8 shows frames 0, 30 and 60, and the corresponding normal
flow on the vehicles. Figure 2.9 shows estimated values of U Zc−1 , V Zc−1 , and W Zc−1 for the
central (closest) vehicle. These values correspond to the translations of the vehicles relative
35
(a)
(b)
(c)
(d)
Figure 2.4: (a-d) A selected frame from each of four sequences. Top: the input images.
Middle: results of road detection. Bottom: results of vehicle detection.
to the vehicle carrying the camera (i.e., in the observer coordinate system). Because of our
choice of coordinate system the rate of approach ν is equal to the negative of W Zc−1 .
The graphs show that the motion components have a simple behavior; before they reach
their extremal values they can be approximated by straight lines, indicating constant relative
accelerations.
In the second experiment we used an image sequence of a van, taken in France, from
another vehicle following the van [12, 38]. The sequence consisted of 56 frames (slightly less
than two seconds). Figure 2.10 shows frames 5, 15, 25, and 35 as well as the corresponding
normal flow. Figure 2.11 shows estimated values of U Zc−1 , V Zc−1 , and W Zc−1 . The graph
shows that there is an impending collision (rate of approach greater than 1 sec−1 ). Around
the 20th frame the rate of approach becomes zero (as do all the velocity components) and
after that it becomes negative because the van starts pulling away from the vehicle carrying
the camera. A similar image sequence was used in [12] in studies of vehicle convoy behavior.
The third sequence (taken from the IEN Galileo Ferrari Vision Image Library) consisted
36
(a)
(b)
(c)
(d)
Figure 2.5: (a-b) Two images taken 1/15th of a second apart; (c-d) their normal flows. One
can see the effects of bumps. In the first frame, the flow vectors point downward; in the
second, they point upward.
of 26 frames. Figure 2.12 shows frames 1, 14 and 26, as well as the corresponding normal
flow. Figure 2.13 shows estimated values of U Zc−1 , V Zc−1 , and W Zc−1 . The graph shows that
the W component of the translational velocity is dominant over the U and V components,
which is correct for a vehicle that overtakes the observer vehicle and does not change lanes;
the two vehicles are moving on parallel courses.
Figure 2.14 shows frames 1, 26 and 47 of another 48-frame sequence, taken in Haifa,
Israel; as well as the corresponding normal flow. Figure 2.15 shows U Zc−1 , V Zc−1 , and W Zc−1
graphs for the left (overtaking) and central vehicles. One can see that the graphs differ
mainly in the values of the W component, since the relative speed of approach for the left
vehicle is greater than that for the central one. The U and V components are relatively
small; all three vehicles are moving in the same direction.
2.5
2.5.1
Related Work
Understanding Object Motion
Some work has been done on understanding object motion, but this work has almost always
assumed a stationary viewpoint. Understanding object motion is based on extracting the
object’s motion parameters from an image sequence. Broida and Chellappa [11] proposed
a framework for motion estimation of a vehicle using Kalman filtering. Weng et al. [125]
assumed an object that possesses an axis of symmetry, and a constant angular momentum
model which constrained the motion over a local frame subsequence to be a superposition of
precession and translation. The trajectory of the center of rotation can be approximated by
a vector polynomial. Changing the parameters of the model with time allows adaptation to
long-term changes in the motion characteristics.
In [38] Duric et al. tried to understand the motions of objects such as tools and vehicles,
based on the fact that the natural axes of the object tend to remain aligned with the local
trihedron defined by the object’s trajectory. Based on this observation they used the FrenetSerret motion model, and showed that knowing how the Frenet-Serret frame is changing
relative to the observer gives essential information for understanding the object’s motion.
37
(a)
(b)
(c)
(d)
Figure 2.6: Stabilization results for one frame from each of three sequences: (a) Input frame.
(b) Normal flow. (c) Rotational normal flow. (d) Translational normal flow
Our present work is a continuation of this work in a more realistic and complicated scenario,
in which the camera is also moving.
2.5.2
Independent Motion Detection
Much work has been done on detection of independently moving objects by a moving observer. However, the work has been related to detection, classification, and tracking of the
motion, and has not paid much attention to motion estimation. Clarke and Zisserman in
[25] addressed the problem of independently moving rigid object detection, assuming that
all motions (including the camera motion) are pure translations. The idea is to track a set
of feature points using correlation and movement assumptions derived from the previous
frames, and, based on the feature point correspondences, determine the epipole (FOE) for
the background points as the intersection of the features’ image plane trajectories. The
assumption is that a majority of the feature points are background points. Moving objects
are found by fitting an epipole to those feature matches that are not consistent with the
background epipole. The image plane extent of the moving object is defined by the smallest
rectangle enclosing these features. The instability of the camera introduces strong rotational
38
(a)
(b)
Figure 2.7: Identification of distant image points in frames from two sequences.
components into the relative motion; these are not dealt with in [25].
Torr and Murray in [116] used statistical methods to detect a non-rigid motion. A fivedimensional space of image pixels and spatio-temporal intensity gradients is fit to an affine
transformation model in a least squares sense, while identifying the outliers to the fit using
statistical diagnostics. The outliers are spatially clustered to form sets of pixels representing
independently moving objects. The assumption again is that the majority of the pixels come
from background points and their movement is well approximated by a linear or affine vector
field, i.e. the distances to the background points are large compared to the variations in these
distances.
Nelson in [82] suggested two qualitative methods for motion detection. The first uses
knowledge about observer motion to detect independently moving objects by looking for
points whose projected velocity behaviors are not consistent with the constraints imposed
by the observer’s motion. The second method is used to detect so-called “animate motion”,
which can be found by looking for violations of the motion field smoothness over time. This
is valid in cases where the observer motion changes slowly, while the apparent motion of the
moving object changes rapidly.
Irani et al. in [62] used temporal integration to construct a dynamic internal representation image of the tracked object. It is assumed that the motion can be approximated by 2D
parametric transformations in the image plane. Given a pair of images, a dominant motion
is first computed and the corresponding object is excluded from the region of analysis. This
process is then repeated and other objects are found. To track the objects in a long image
sequence, integrated images registered with respect to the tracked motion are used.
Sharma and Aloimonos in [106] demonstrated a method for detecting independent motions of compact objects by an active observer whose motion can be constrained. Fejes and
Davis in [44] used constraints on low-dimensional projected components of flow fields to detect independent motion. They implemented a recursive filter to extract directional motion
parameters. Independent motion detection was done by a combination of repeated-medianbased line-fitting and one-dimensional search. The method was demonstrated on scenes with
large amounts of clutter.
39
(a)
(b)
(c)
Figure 2.8: Frames 0, 30, and 60 of a sequence showing two vehicles accelerating. The normal
flow results are shown below the corresponding image frames.
2.5.3
Vehicle Detection and Tracking
In previous work on vehicle detection and tracking by a moving observer, the detection
made use of model-based object recognition in single frames. Gil et al. [51] combined
multiple motion estimators for vehicle tracking. Vehicle detection was performed using
two features: the bounding rectangle of the moving vehicle, where the convex hull of the
vehicle is computed for every frame and then translated according to the predicted motion
parameters, and an updated 2-D pattern (gray-level mask) based on optimization of the
correlation between the pattern and the image using the motion parameters. These results
were obtained using a stationary camera mounted above a highway under different road and
illumination conditions.
Betke et al. [6] developed a real-time system for detection and tracking of multiple vehicles in a frame sequence taken on a highway from a moving vehicle. The system distinguishes
between distant and passing vehicle detection. In case of a passing vehicle the recognition
is performed by detecting large brightness differences over small numbers of frames. 2-D car
models are used to create a gray-scale template of the detected vehicle for future tracking.
Distant vehicles are detected by analyzing prominent horizontal and vertical edges. A square
40
−3
14
x 10
U
V
W
12
10
8
6
4
2
0
−2
0
10
20
30
40
50
60
70
80
90
Figure 2.9: Motion analysis results for the acceleration sequence. U , V , W are the scaled
(by an unknown distance Zc−1 ) components of the relative translational velocity .
region bounded by such edges, which is strongly enough correlated with a vehicle template,
is recognized as a vehicle. For each newly recognized vehicle a separate tracking process
is allocated by a real-time operating system, which tracks the vehicle until it disappears
and makes sure that no other process tracks the same vehicle. When one vehicle occludes
another, one of the tracking processes terminates and the other tracks the occlusion region
as a single moving object.
2.5.4
Road Detection and Shape Analysis
Our work also involved detection of the road markings; in doing this we made no significant
use of motion models. There has been considerable work on road boundary detection in
images obtained by a vehicle driving on the road. A typical detection procedure involves
two steps. First the road is detected in a single frame (or in a small number of frames), and
then the road boundaries are tracked in a long sequence of frames.
Schneiderman and Nashman in [100] left the first step to a human operator and gave a
solution for the second step alone. Their algorithm is based on road markings. In each frame,
edge points are extracted and grouped into clusters representing individual lane markers. The
lane marker models built in the first step are updated using the information obtained from
successive frames. The markers are modeled by second-order polynomials in the image plane.
Tracking is exploited in the sense that edge points are associated with markers based on a
road model formed from analysis of the previous frames. Spatial proximity and gradient
direction are used as a clues for clustering. The marker models are updated to satisfy a least
squares criterion of optimality.
Broggi in [9]-[10] presented a parallel real-time algorithm for road boundary detection.
41
(a)
(b)
(c)
(d)
Figure 2.10: Frames 5, 15, 25, and 35 of the van sequence. The normal flow results are
shown below the corresponding image frames.
First the image undergoes an inverse perspective transformation and then road markings are
detected using simple morphological filters that work in parallel on different image areas.
The inverse perspective procedure is based on a priori knowledge of the imaging process
(camera position, orientation, optics, etc.).
Richter and Wetzel in [94] used region segmentation for road surface recognition. A road
model is adjusted to the segmentation results collected from several frames. The adjusted
model is then used for model-based segmentation and tracking. A special lookup mechanism
is used to overcome the problem of obstacles and shadows that may cause the segmentation
algorithm to fail to detect the road regions.
A sophisticated road model that takes into account both horizontal and vertical curvature
was suggested by Dickmanns in [34]. A differential-geometric road representation and a
spatio-temporal driving model are exploited to find the parameters of the road and the
vehicle.
2.6
Conclusions
Understanding the motions of vehicles from images taken by a moving observer requires a
mathematical formulation of the relationships between the observer’s motion and the image
motion field, as well as a model for the other vehicles’ trajectories and their contributions to
the image motion field. In this paper a constant relationship between each vehicle’s frame
and the observer’s frame is assumed. The use of the Darboux frame provides a vocabulary
appropriate for describing long motion sequences.
We have derived equations for understanding the relative motions of vehicles in traffic
42
1.5
1
W
0.5
U
0
V
−0.5
−1
−1.5
0
10
20
30
40
50
60
Figure 2.11: Motion analysis results for the van sequence. U , V , W are the scaled (by an
unknown distance Zc−1 ) components of the relative translational velocity .
scenes from a sequence of images taken by a moving vehicle. We use the Darboux motion
model for both the observing vehicle and the nearby moving vehicles. Using a full perspective
imaging model we stabilize the observer sequence so that our model for the observed vehicles’
motions can be applied. Using the weak perspective approximation we analyze the nearby
vehicles’ motions and apply this analysis to long image sequences. Expanding our analysis
to various classes of traffic events [70], and to articulated vehicles, are directions for future
research.
43
(a)
(b)
(c)
Figure 2.12: Frames 1, 14 and 26 of the Italian sequence. The normal flow results are shown
below the corresponding image frames.
0.05
U
V
W
0.04
0.03
0.02
0.01
0
−0.01
−0.02
−0.03
0
5
10
15
20
25
Figure 2.13: Motion analysis results for the Italian sequence. U , V , W are the scaled (by an
unknown distance Zc−1 ) components of the relative translational velocity .
44
(a)
(b)
(c)
Figure 2.14: Frames 1, 25 and 48 of the Haifa sequence. The normal flow results are shown
below the corresponding image frames.
U
V
W
U
V
W
0.01
0.01
0.005
0.005
0
0
−0.005
−0.005
−0.01
0
5
10
15
20
25
30
35
40
−0.01
45
0
5
10
15
20
25
30
35
40
45
Figure 2.15: Motion analysis results for the Haifa sequence. U , V , W are the scaled (by an
unknown distance Zc−1 ) components of the relative translational velocity .
45
46
Chapter 3
Non-Rigid Motion - Tracking
3.1
Introduction
It is clear that one of the major cues for motion based non-rigid object classification can be
obtained by analyzing moving body deformations. Hence, the accuracy of the segmentation
and tracking algorithm in finding the target outline is crucial for the quality of the final
result. This rules out a number of available or easy-to-build tracking systems that provide
only a center of mass or a bounding box around the target and calls for more precise and
usually more sophisticated solutions.
In this work we decided to use the geodesic active contour approach [17], where the
segmentation and tracking problem is expressed as a geometric energy minimization.
3.2
Snakes and Active Contours
An important problem in image analysis is object segmentation. It involves the isolation of
a single object from the rest of the image that may include other objects and a background.
Here, we focus on boundary detection of one or several objects by a dynamic model known
as the ‘geodesic active contour’ introduced in [16, 17, 18, 19], see also [68, 104].
Geodesic active contours were introduced as a geometric alternative for ‘snakes’ [115, 67].
Snakes are deformable models that are based on minimizing an energy along a curve. The
curve, or snake, deforms its shape so as to minimize an ‘internal’ and ‘external’ energies
along its boundary. The internal part causes the boundary curve to become smooth, while
the external part leads the curve towards the edges of the object in the image.
In [14, 78], a geometric alternative for the snake model was introduced, in which an
evolving curve was formulated by the Osher-Sethian level set method [85]. The method
works on a fixed grid, usually the image pixels grid, and automatically handles changes in
the topology of the evolving contour.
The geodesic active contour model was born latter. It is both a geometric model as well
as energy functional minimization. In [16, 17], it was shown that the geodesic active contour
model is related to the classical snake model. Actually, a simplified snake model yields the
same result as that of a geodesic active contour model, up to an arbitrary constant that
47
depends on the initial parameterization. Unknown constants are an undesirable property in
most automated models.
Although the geodesic active contour model has many advantages over the snake, its
main drawback is its non-linearity that results in inefficient implementations. For example,
explicit Euler schemes for the geodesic active contour limit the numerical step for stability.
In order to overcome these speed limitations, a multi-resolution approach was used in [123],
and additional heuristic steps were applied in [86], like computationally preferring areas of
high energy.
In this paper we introduce a new method that maintains the numerical consistency and
makes the geodesic active contour model computationally efficient. The efficiency is achieved
by cancelling the limitation on the time step in the numerical scheme, by limiting the computations to a narrow band around the the active contour, and by applying an efficient
re-initialization technique.
3.3
From snakes to geodesic active contours
Snakes were introduced in [67, 115] as an active contour model for boundary segmentation. The model is derived by a variational principle from a non-geometric measure. The
model starts from an energy functional that includes ‘internal’ and ‘external’ terms that are
integrated along a curve.
Let the curve C(p) = {x(p), y(p)}, where p ∈ [0, 1] is an arbitrary parameterization. The
snake model is defined by the energy functional
∫
1
S[C] =
0
(
)
|Cp |2 + α|Cpp |2 + 2βg(C) dxdy,
where Cp ≡ {∂ p x(p), ∂ p y(p)}, and α and β are positive constants.
The last term represents an external energy, where g() is a positive edge indicator function
that depends on the image I(x, y), it gets small values along the edges and higher values
elsewhere. For example g(x, y) = 1/(|∇I|2 + 1). Taking the variational derivative with
respect to the curve, δS[C]/δC, we obtain the Euler Lagrange equations
−Cpp + αCpppp + β∇g = 0.
One may start with a curve that is close to a significant local minimum of S[C], and use the
Euler Lagrange equations as a gradient descent process that leads the curve to its proper
position. Formally, we add a time variable t, and write the gradient descent process as
∂ t C = −δS[C]/δC, or explicitly
dC
= Cpp − αCpppp − β∇g.
dt
The snake model is a linear model, and thus an efficient and powerful tool for object segmentation and edge integration, especially when there is a rough approximation of the boundary
location. There is however an undesirable property that characterizes this model. It depends
on the parameterization. The model is not geometric.
48
Motivated by the theory of curve evolution, Caselles et al. [14] and Malladi et al. [78]
introduced a geometric flow that includes internal and external geometric measures. Given
an initial curve C0 , the geometric flow is given by the planar curve evolution equation Ct =
⃗ , where, N
⃗ is the normal to the curve, κN
⃗ is the curvature vector, v is an
g(C)(κ − v)N
arbitrary constant, and g(), as before, is an edge indication scalar function. This is a
geometric flow, that is, free of the parameterization. Yet, as long as g does not vanish
along the boundary, the curve continues its propagation and may skip its desired location.
One remedy, proposed in [78], is a control procedure that monitors the propagation and sets
g to zero as the curve gets closer to the edge.
The geodesic active contour model was introduced in [16, 17, 18, 19], see also [68, 104],
as a geometric alternative for the snakes. The model is derived from a geometric functional,
where the arbitrary parameter p is replaced with a Euclidean arclength ds = |Cp |dp. The
functional reads
∫ 1
(α + g̃(C)) |Cp |dp.
S[C] =
0
It may be shown to be equivalent to the arclength parameterized functional
∫
L(C)
S[C] =
g̃(C)ds + αL(C),
0
where L(C) is the total Euclidean length of the curve. One may equivalently define g(x, y) =
g̃(x, y) + α, in which case
∫
L(C)
S[C] =
g(C)ds,
0
i.e. minimization of the modulated arclength g(C)ds. The Euler Lagrange equations as a
gradient descent process are
)
dC (
⃗⟩ N
⃗.
= g(C)κ − ⟨∇g, N
dt
Again, internal and external forces are coupled together, yet this time in a way that leads
towards a meaningful minimum, which is the minimum of the functional. a One may add an
additional force that comes from an area minimization term, and motivated by the balloon
force [26]. This way, the contour may be directed to propagate inwards by minimization of
the interior. The functional with the additional area term modulated by an edge indicator
function reads
∫ L(C)
∫
S[C] =
g(C)ds + α gda,
0
Ω
where da is an area element and Ω is the interior of region enclosed by the contour C. The
Euler Lagrange as steepest descent, following the development in [130, 129], is
)
dC (
⃗ ⟩ − αg(C) N
⃗.
= g(C)κ − ⟨∇g, N
dt
a
An early version of a geometric-variational model, in which S[C] =
curves was proposed in [46].
49
∫
∫
g(C)ds/ ds, that deals with open
The connection between classical snakes, and the geodesic active contour model was
established in [17] via Maupertuis’ Principle of least action [35]. By Fermat’s Principle, the
final geodesic active contours are geodesics in an isotropic non-homogeneous medium.
Recent applications of the geodesic active contours include 3D shape from multiple views,
also known as shape from stereo [43], segmentation in 3D movies [76], tracking in 2D movies
[86], and refinement of efficient segmentation in 3D medical images [77]. The curve propagation equation is just part of the whole model. Subsequently, the geometric evolution is
implemented by the Osher-Sethian level set method [85].
3.3.1
Level set method
The Osher-Sethian [85] level set method considers evolving fronts in an implicit form. It is
a numerical method that works on a fixed coordinate system and takes care of topological
changes of the evolving interface.
Consider the general geometric planar curve evolution
dC
⃗,
=VN
dt
where V is any intrinsic quantity, i.e., V does not depend on a specific choice of parameterization. Now, let ϕ(x, y) : IR2 → IR be an implicit representation of C, such that
C = {(x, y) : ϕ(x, y) = 0}. One example is a distance function from C defined over the
coordinate plane, with negative sign in the interior and positive in the exterior of the closed
curve.
The evolution for ϕ such that its zero set tracks the evolving contour is given by
dϕ
= V |∇ϕ|.
dt
This relation is easily proven by applying the chain rule, and using the fact that the normal
of any level set, ϕ = constant, is given by the gradient of ϕ,
dϕ
⃗⟩=V
= ⟨∇ϕ, Ct ⟩ = ⟨∇ϕ, V N
dt
⟨
∇ϕ
∇ϕ,
|∇ϕ|
⟩
= V |∇ϕ|.
This formulation enable us to implement curve evolution on the x, y fixed coordinate
system. It automatically handles topological changes of the evolving curve. The zero level
set may split from a single simple connected curve, into two separate curves.
Specifically, the corresponding geodesic active contour model written in its level set formulation is given by
(
)
∇ϕ
dϕ
= div g(x, y)
|∇ϕ|.
dt
|∇ϕ|
Including a weighted area minimization term that yields a constant velocity, modulated by
the edge indication function, we have
(
(
∇ϕ
dϕ
= αg(x, y) + div g(x, y)
dt
|∇ϕ|
50
))
|∇ϕ|.
We have yet to determine a numerical scheme and an appropriate edge indication function
g. An explicit Euler scheme with forward time derivative, introduces a numerical limitation
on the time step needed for stability. Moreover, the whole domain needs to be updated
each step, which is a time consuming operation for a sequential computer. The narrow band
approach overcomes the last difficulty by limiting the computations to a narrow strip around
the zero set. First suggested by Chopp [24], in the context of the level set method, and later
developed in [1], the narrow band idea limits the computation to a tight strip of few grid
points around the zero set. The rest of the domain serves only as a sign holder. As the
curve evolves, the narrow band changes its shape and serves as a dynamic numerical support
around the location of the zero level set.
3.3.2
The AOS scheme
Additive operator splitting (AOS) schemes were introduced by Weickert et al. [124] as an
unconditionally stable numerical scheme for non-linear diffusion in image processing. Let us
briefly review its main ingredients and adapt it to our model.
The original AOS model deals with the Perona-Malik [88], non-linear image evolution
equation of the form ∂ t u = div (g(|∇u|)∇u) with given initial condition as the image u(0) =
u0 . Let us re-write explicitly the right hand side of the evolution equation
div (g(|∇u|)∇u) =
m
∑
∂ xl (g(|∇u|)∂ xl u) ,
l=1
where l is an index running over the m dimensions of the problem, e.g., for a 2D image
m = 2,x1 = x, and x2 = y.
As a first step towards discretization consider the operator
Al (uk ) = ∂ xl (g(|∇uk |)∂ xl ),
where the superscript k indicates the iteration number, e.g., u0 = u0 . We can write the
explicit scheme
[
]
m
∑
k+1
k
u
= I +τ
Al (u ) uk ,
l=1
where, τ is the numerical time step. It requires an upper limit for τ if one desires to establish
convergence to a stable steady state. Next, the semi-implicit scheme
[
k+1
u
= I −τ
m
∑
]−1
k
Al (u )
uk ,
l=1
is unconditionally stable, yet inverting the large bandwidth matrix is a computationally
expensive operation.
Finally, the consistent, first order, semi-implicit, additive operator splitting scheme
uk+1 =
m [
]−1
1 ∑
I − mτ Al (uk )
uk ,
m l=1
51
may be applied to efficiently solve the non-linear diffusion.
The AOS semi-implicit scheme in 2D is then given by a linear tridiagonal system of
equations
2
1∑
uk+1 =
[I − 2τ Al (uk )]−1 uk ,
(3.1)
2 l=1
where Al (uk ) is a matrix corresponding to derivatives along the l-th coordinate axis. It can
be efficiently solved for uk+1 by Thomas algorithm, see [124].
In our case, the geodesic active contour model is given by
(
)
∇ϕ
∂ t ϕ = div g(|∇u0 |)
|∇ϕ|,
|∇ϕ|
where u0 is the image, and ϕ is the implicit representation of the curve. Since our interest is
only at the zero level set of ϕ, we can reset ϕ to be a distance function every numerical iteration. One nice property of distance maps is it unit gradient magnitude almost everywhere.
Thereby, the short term evolution for the geodesic active contour given by a distance map,
with |∇ϕ| = 1, is
∂ t ϕ = div (g(|∇u0 |)∇ϕ) .
Note, that now Al (ϕk ) = Al (u0 ), which means that the matrices [I − 2τ Al (u0 )]−1 can be
computed once for the whole image. Yet, we need to keep the ϕ function as a distance map.
This is done through re-initialization by Sethian’s fast marching method every iteration.
It is now simple to introduce a weighted area ‘balloon’ like force to the scheme. The
resulting AOS scheme with the ‘balloon’ then reads
ϕk+1 =
2
1∑
[I − 2τ Al (u0 )]−1 (ϕk + τ αg(u0 )),
2 l=1
(3.2)
where α is the weighted area/balloon coefficient.
In order to reduce the computational cost we use a multi-scale approach [73]. We construct a Gaussian pyramid of the original image. The algorithm is first applied at the lower
resolution. Next, the zero set is embedded at a higher resolution and the ϕ distance function
is computed. Moreover, the computations are performed only within a limited narrow band
around the zero set. The narrow band automatically modifies its shape as we re-initiate the
distance map.
3.3.3
Re-initialization by the fast marching method
In order to maintain sub-grid accuracy, we detect the zero level set curve with sub-pixel
accuracy. We apply a linear interpolation in the four pixel cells in which ϕ changes its sign.
The grid points with the exact distance to the zero level set are then used to initialize the
fast marching method.
Sethian’s fast marching method [103, 102], is a computationally optimal numerical method
for distance computation on rectangular grids. The method keeps a front of updated points
sorted in a heap structure, and constructs a numerical solution iteratively, by fixing the
52
smallest element at the top of the heap and expanding the solution to its neighboring grid
points. This method enjoys a computational complexity bound of O(N log N ), where N is
the number of grid points in the narrow band. See also [22, 118], where consistent O(N log N )
schemes are used to compute distance maps on rectangular grids.
3.4
Edge indicator functions for color and movies
What is a proper edge indicator for color images? Several generalizations for the gradient
magnitude of gray level images were proposed, see e.g. [33, 98, 113]. In [86] Paragios and
Deriche, introduced a probability based edge indicator function for movies. In this paper we
have chosen the geometric philosophy to extract an edge indicator. We consider a measure
suggested by the Beltrami framework in [113], to construct an edge indicator function.
3.4.1
Edges in Color
According to the Beltrami framework, a color image is considered as a two dimensional
surface in the five dimensional spatial-spectral space. The metric tensor is used to measure
distances on the image manifold. The magnitude of this tensor is an area element of the
color image surface, which can be considered as a generalization of the gradient magnitude.
Formally, the metric tensor of the two-dimensional image given by the two-dimensional
surface {x, y, R(x, y), G(x, y), B(x, y)} in the {x, y, R, G, B} space, is given by
(
(gij ) =
1 + Rx2 + G2x + Bx2
Rx Ry + Gx Gy + Bx By
Rx Ry + Gx Gy + Bx By
1 + Ry2 + G2y + By2
)
,
where Rx ≡ ∂ x R. Our edge indicator is the largest eigenvalue of the structure tensor metric.
It is the eigenvalue in the direction of maximal change in dR2 + dG2 + dB 2 , and it reads
v
u(
)2
u 1∑
1 ∑∑
1∑
u
i 2
|∇u | + t
|∇ui |2 −
|∇ui × ∇uj |2
λ=1+
2
2
i
2
i
i
(3.3)
j
where u1 = R, u2 = G, u3 = B. Then, the edge indicator function g is given by a decreasing
function of λ, e.g., g = (1 + λ2 )−1 .
3.4.2
Tracking objects in movies
Let us explore two possibilities to track objects in movies. The first, considers the whole
movie volume as a Riemannian space, as done in [19]. In this case the active contour becomes
an active surface. The AOS scheme in the spatial-temporal 3D hybrid space is
ϕk+1 =
1∑
[I − 3τ Al (u0 )]−1 ϕk ,
3 l
where Al (u0 ) is a matrix corresponding to derivatives along the l-th coordinate axis, where
now l ∈ [x, y, T ].
53
The edge indicator function is again derived from the Beltrami framework, where for
color movies we pull-back the metric
1 + Rx2 + G2x + Bx2 Rx Ry + Gx Gy + Bx By Rx RT + Gx GT + Bx BT
·
1 + Ry2 + G2y + By2
Ry RT + Gy GT + By BT
gij =
2
2
2
·
·
1 + RT + GT + BT
(3.4)
which is the metric for√a 3D volume in the 6D {x, y, T , R, G, B} spatial-temporal-spectral
space. Now we have det(gij )dxdydT as a volume element of the image and, again, the
largest eigenvalue of the structure tensor metric, λ, can be used as an edge indicator. Intuitively, the larger λ gets, the smaller spatial-temporal steps one should apply in order to
cover the same volume.
A different approach uses the contour location in frame n as an initial condition for the
2D solution in frame n + 1, see e.g. [15, 86]. The above edge indicator is still valid in this
case. Note, that the aspect ratios between the time, the image space, and the intensity,
should be determined according to the application.
The first approach was found to yield accurate results in off line tracking analysis. While
the second approach gives up some accuracy, that is achieved by temporal smoothing in the
first approach, for efficiency in real time tracking.
3.5
Implementation details
There are some implementation considerations one should be aware of. For example, the
summation over the two dimensions in the Equations (3.1,3.2) should be done in such a way
that the matrices A1 (u0 ) and A2 (u0 ), corresponding to the derivatives along the x and y
axes respectively, will be tridiagonal. This is achieved by spanning the matrix ϕk once by
rows and once by columns.
Working on the whole domain is fairly straightforward. The matrix ϕk is represented as
vectors ϕki corresponding to a single row/column. The vectors [I − 2τ Ail (u0 )]−1 ϕki , where Ail
of size |ϕki | × |ϕki | is a matrix corresponding to the derivatives of a single row/column, are
computed using the Thomas algorithm and summed according to their coordinates to yield
the matrix ϕk+1 .
Using the narrow band approach is a bit more tricky. The question is what is the most
efficient representation for the narrow band that would allow to find for every row/column
the segments that belong to the narrow band? One may suggest to use the run length
encoding, but the standard static run length implementation is not sufficient. This is due to
the fact that the narrow band is rebuilt every iteration using the fast marching algorithm,
which generates narrow band pixels in arbitrary order. Creating the run length encoding
from such a stream of pixels can be done either off-line, by first constructing and then
scanning a map of the whole image (which is clearly inefficient), or on-line using a dynamic
data structure.
In our implementation we use two arrays of linked lists, one for the rows and one for
the columns, where each linked list corresponds to one row/column and contains segments
of adjacent pixels belonging to the narrow band (see Figure 3.1). Each segment is defined
54
Array of Linked Lists - Rows
Array of Linked Lists - Columns
Narrow
Band
Figure 3.1: Two arrays of linked lists used for narrow band representation.
by its first and last pixels coordinates. Since a narrow band is generated as the fast marching algorithm grows outwards, adding a new pixel usually means just changing one of the
boundaries of an already existing segment. For reasonably simple contours the number of
times a new segment is created or an existing one is merged with another is relatively low.
Therefore, the number of segments per row/column is always small and the complexity of
adding a new pixel to the narrow band is practically O(1).
The calculations are performed separately for every horizontal and vertical segment ϕki
and the vectors [I − 2τ Ail (u0 )]−1 ϕki , where Ail of size |ϕki | × |ϕki | is a matrix corresponding to
the derivatives of a single segment, are summed according to their coordinates to yield the
new level set function ϕk+1 for the narrow band pixels only.
Another issue requiring a special attention is the value of the time step. If we choose a
relatively large time step, the active contour may skip over the object boundary. The time
step should thus correspond to the numerical support of the edges (edge width). This, in
turn, dictates the width of the narrow band that should be wide enough, so that the contour
would not escape it in one iteration. One way to overcome the time step limitation is to use
a coarse to fine scales of boundary smoothing, with an appropriate time step for each scale.
Finally, since the method is based on the AOS, which is a first-order approximation scheme,
the numerical error grows linearly with the time step.
3.6
Experimental Results
As a simple example, the proposed method can be used as a consistent, unconditionally
stable, and computationally efficient, numerical approximation for the curvature flow. The
55
step 1
step 1
step 3
step 3
step 30
step 10
step 80
step 30
step 150
step 58
Figure 3.2: Curvature flow by the proposed scheme. A non-convex curve vanishes in finite
time at a circular point by Grayson’s Theorem. The curve evolution is presented for two
different time steps. Left: τ = 20; Right: τ = 50.
56
Curvature Flow − CPU time
1250
explicit (whole image)
1200
1150
250
CPU time (sec)
200
150
explicit (narrow band)
100
AOS (narrow band)
50
0
0
1
2
3
4
5
6
7
8
9
10
11
time step
Figure 3.3: Curvature flow CPU time for the explicit scheme and the implicit AOS scheme.
First, the whole domain is updated, next, the narrow band is used to increase the efficiency,
and finally the AOS speeds the whole process. For the explicit scheme the maximal time
step that still maintains stability is choosen. For the AOS scheme, CPU times for several
time steps are presented.
curvature flow, also known as curve shortening flow or geometric heat equation, is a well
studied equation in the theory of curve evolution. It is proven to bring every simple closed
curve into a circular point in finite time [48, 57]. Figure 3.2 shows an application of the
proposed method for a curve evolving by its curvature and vanishes at a point. One can see
how the number of iterations needed for the curve to converge to a point decreases as the
time step is increased.
We tested several implementations for the curvature flow. Figure 3.3 shows the CPU time
it takes the explicit and implicit schemes to evolve a contour into a circular point. For the
explicit scheme we tested both the narrow band and the naive approach in which every grid
point is updated every iteration. The tests were performed on an Ultra SPARC 360M Hz
machine for a 256 × 256 resolution image.
It should be noted that when the narrow band approach is used, the band width should be
increased as the τ grows to ensure that the curve does not escape the band in one iteration.
Figure 3.4 shows multiple objects segmentation for a static color image. Here we used a
balloon force to propagate the contour through the narrow passages between objects.
Figure 3.5 presents an example of medical image application, where the gray matter
segmentation is performed on a single slice of a human head MRI image. The task is to detect
a narrow layer of the brain bounded by two interfaces - the outer cortical surface (Cerebral
Spinal Fluid(CSF)/gray matter interface) and the inner cortical surface (gray matter/white
57
Figure 3.4: Multiple objects segmentation in a static color image.
58
matter interface). The segmentation is performed by two active contours initialized inside
the white matter regions. The negative balloon force coefficient is used to expand the contour
towards the region boundary and the edge indicator functions are chosen to respond only to
the edge points corresponding to the characteristic intensity profiles of the CSF/gray matter
and the gray matter/white matter interfaces respectively. This specific medical problem
introduces new challenging difficulties and possible solutions will be reported elsewhere [53].
Figure 3.5: Gray matter segmentation in a MRI brain image. The white contour converges
to the outer cortical surface and the black contour converges to the inner cortical surface.
Figures 3.6 and 3.7 show segmentation results for color movies with difficult spatial
textures. The tracking is performed at two resolutions. At the lower resolution we search for
temporal edges and at the higher resolution we search for strong spatial edges. The contour
found in the coarse grid is used as the initial contour at the fine grid.
It is possible to compute the inverse matrices of the AOS once for the whole image, or to
invert small sub-matrices as new points enter or exit the narrow band. There is obviously a
trade-off between the two approaches. For initialization, we have chosen the first approach,
since the initial curve starts at the frame of the image and has to travel over most of the
image until it captures the moving objects. While for tracking of moving objects in a movie,
we use the local approach, since now the curve has only to adjust itself to local changes.
59
step 2
step 40
step 60
frame 2
frame 22
frame 50
Figure 3.6: Tracking a cat in a color movie by the proposed scheme. Top: Segmentation of
the cat in a single frame. Bottom: Tracking the walking cat in the 50 frames sequence.
step 2
step 30
step 48
step 70
frame 2
frame 21
frame 38
frame 59
Figure 3.7: Tracking two people in a color movie. Top: curve evolution in a single frame.
Bottom: tracking two walking people in a 60 frame movie.
60
3.7
Conclusions
It was shown that an integration of advanced numerical techniques yield a computationally
efficient algorithm that solves a geometric segmentation model. The numerical algorithm
is consistent with the underlying continuous model. It combines the narrow band level set
method, with additive operator splitting, and the fast marching. The proposed ‘fast geodesic
active contour’ scheme was applied successfully for image segmentation and tracking in movie
sequences and color images.
61
62
Chapter 4
Non-Rigid Objects - Motion Based
Classification
4.1
Introduction
Our goal is to utilize the motion information directly as a tool for object classification. A lot
of examples taken from nature and from psychophisical experiments [80] show that moving
objects are easier to recognize than stationary ones and in some cases motion information
alone, even in an extremely low resolution like in MLDs [64], [65], is sufficient for recognition
and classification.
Moreover, we argue that in many instances motion analysis is the best or the only way
to deal with object recognition when the classification categories are not shape based or in
cases when shape information is hardly obtainable.
4.1.1
General Framework
The most general approach in object classification is to represent each category by a single
prototype, that is an average member of the category carrying the most prominent common
properties of the class, and then to compare an object in question to the prototype of each
category.
In this context two questions are to be answered in order to build a classifier. First, how
should we represent category prototypes? Second, what methods are to be used to compare
an object with the prototypes?
Since the problem of motion based classification has not been extensively explored yet,
let us take a look at the adjacent field of shape based recognition and classification and see
what are the main problems there and how they are solved.
Among the intrinsic obstacles making shape based recognition a difficult problem one can
name the variability of apparent object view due to several factors [120]: viewing position,
photometric effects, object setting and changing shape.
Let us see how these factors affect motion based classification. The instantaneous motion
of objects can be generally decomposed into three basic motions:
• motion of the center of mass of the moving object (translation)
63
• changes in orientation (rotation) and
• object deformation (this also includes changes in relative position of object parts)
It is clear enough that the photometric effects do not influence motion characteristics of
a moving object once they are extracted from an image sequence. Shape changes in case of
motion based recognition is not an obstacle, but a means of classification. Unless occlusion
is happening, object setting is not really a problem, since the analysis is performed not on
the whole image but only on a part of it found by the segmentation and tracking processes
which contains only the object of interest.
On the other hand, viewing position can significantly alter the characteristics of apparent
motion and in some cases (e.g. translation along the camera optical axis) even makes it hardly
observable. There are several ways to cope with this problem. The first is to use 3D motion
reconstruction, i.e. to restore 3D velocity vectors from the 2D projection of object motion
onto an image plane. This is certainly an option for rigid bodies and usually requires both
motion model and image acquisition model to be defined. We used this approach in [37],
where two different imaging models have been defined: one for image sequence stabilization
and another for the analysis of nearby objects motion.
Another way is to keep motion prototypes for a number of views and do the classification
regarding each view as a separate class. For example Yacoob and Black [127] recognize
different human walking directions (parallel, diagonal, perpendicular away, perpendicular
forward) as special classes of human motion.
Another possible approach is to find a transformation that brings the observed motion to
some canonical form stored as a prototype or, alternatively, to restore the observed motion
as a combination of several motion examplars stored as a class representatives. This is, in a
sense, analogous to the alignment approach in shape based recognition, where one wish to
compensate for the transformation separating the viewed object and the stored model and
then to compare them.
Alignment is one of the three large groups of methods used in shape based recognition.
The other two are: invariant properties methods and parts decomposition.
Invariant properties methods assume that certain properties remain invariant under the
transformations that an object is allowed to make. In this work we are looking for those
motion characteristics that are invariant to varying viewing position and motion rate.
Motion rate is an additional factor adding to the variability of observed motion. This new
factor stems from the temporal nature of the input data and does not exist in shape based
recognition. An observable motion rate can change both due to viewing distance variations
and due to the simple fact that the same object can move with different speed.
A lot of natural, self-propelled objects exhibit periodical motion. Periodicity is a salient
feature in human perception. In nature, many behaviors of animals and humans consist of
specialized periodic motion patterns. For example, certain male birds bob their head when
courting a female, bees dance in a figure-8 pattern to signal that food is near the hive,
elephants toss-up their head when threatened, and humans vigorously shake their hand in
the air to get the attention of another [112], [122].
If so, one motion period (e.g. two steps of a walking man or one rabbit hop) can be
used as a natural unit of motion and extracted motion characteristics can by normalized by
64
a period duration. The period of motion is often linked to important properties that may
otherwise be difficult to determine, e.g. heart-rate to activity level or period of locomotion
to velocity. This gives rise to another set of problems: detection and characterization of
periodic activities.
In this chapter we present a classification approach based on the analysis of the changing appearance of the moving bodies. The framework is based on the segmentation and
tracking by the geodesic active contour approach yielding high accuracy results allowing
for the periodicity analysis to be performed on the global characteristics of the extracted
deformable object contours. The normalized image sequences are parameterized using the
PCA approach and classified by the k-nearest-neighbor algorithm. The experiments demonstrate the ability of the proposed scheme to distinguish between various object classes by
their static appearance and dynamic behavior.
4.2
Futurism and Periodic Motion Analysis
Futurism is a movement in art, music, and literature that began in Italy at about 1909
and marked especially by an effort to give formal expression to the dynamic energy and
movement of mechanical processes. A typical example is the ‘Dynamism of a Dog on a
Leash’ by Giacomo Balla, who lived during the years 1871-1958 in Italy, see Figure 4.1 [5].
In this painting one could see how the artist captures in one still image the periodic walking
motion of a dog on a leash. Following Futurism, we show how periodic motions can be
represented by a small number of eigen-shapes that capture the whole dynamic mechanism
of periodic motions. Singular value decomposition of a silhouette of an object serves as
a basis for behavior classification by principle component analysis. Figure 4.2 present a
running horse video sequence and its eigen-shape decomposition. One can see the similarity
between the first eigen-shapes - Figure 4.2(c,d), and another futurism style painting ”The
Red Horseman” by Carlo Carra [5] - Figure 4.2(e). The boundary contour of the moving
non-rigid object is computed efficiently and accurately by the fast geodesic active contours
[53]. After normalization, the implicit representation of a sequence of silhouette contours
given by their corresponding binary images, is used for generating eigen-shapes for the given
motion. Singular value decomposition produces the eigen-shapes that are used to analyze
the sequence. We show examples of object as well as behavior classification based on the
eigen-decomposition of the sequence.
4.3
Related Work
Motion based recognition received a lot of attention in the last several years. This is due
to the general recognition of the fact that the direct use of temporal data may significantly
improve our ability to solve a number of basic computer vision problems such as image
segmentation, tracking, object classification, etc., as well as the availability of a low cost
computer systems powerful enough to process large amounts of data.
In general, when analyzing a moving object, one can use two main sources of information
to rely upon: changes of the moving object position (and orientation) in space, and object
65
Figure 4.1: ‘Dynamism of a Dog on a Leash’ 1912, by Giacomo Balla. Albright-Knox Art
Gallery, Buffalo.
deformations.
Object position is an easy-to-get characteristic, applicable both for rigid and non-rigid
bodies that is provided by most of the target detection and tracking systems, usually as
a center of the target bounding box. A number of techniques [56], [55], [42], [110] were
proposed for the detection of motion events and for the recognition of various types of
motions based on the analysis of the moving object trajectory and its derivatives. Detecting
object orientation is a more challenging problem which is usually solved by fitting a model
that may vary from a simple ellipsoid [110] to a complex 3D vehicle model [69] or a specific
aircraft-class model adapted for noisy radar images as in [29].
While object orientation characteristic is more applicable for rigid objects, it is object
deformation that contains the most essential information about the nature of the non-rigid
body motion. This is especially true for natural non-rigid objects in locomotion that exhibit
substantial changes in their apparent view, as in this case the motion itself is caused by these
deformations, e.g. walking, running, hoping, crawling, flying, etc.
There exists a large number of papers dealing with the classification of moving non-rigid
objects and their motions, based on their appearance. Lipton et al. describe a method for
moving target classification based on their static appearance [74] and using the skeletonization [47]. Polana and Nelson [93] used local motion statistics computed for image grid cells
to classify various types of activities. An original approach using the temporal templates
and motion history images (MHI) for action representation and classification was suggested
by Davis and Bobick in [7]. Cutler and Davis [32] describe a system for real-time moving
object classification based on periodicity analysis. It would be impossible to describe here
the whole spectrum of papers published in this field and we refer the reader to the following
surveys [20], [49] and [81].
The most related to our approach is a work by Yacoob and Black [128], where different
types of human activities were recognized using a parameterized representation of measurements collected during one motion period. The measurements were eight motion parameters
tracked for five body parts (arm, torso, thigh, calf and foot).
66
(a)
(b)
(c)
(d)
(e)
Figure 4.2: (a) running horse video sequence, (b) first 10 eigen-shapes, (c,d) first and second
eigen-shapes enlarged, (e) ‘The Red Horseman’, 1914, by Carlo Carra, Civico Museo d’Arte
Contemporanea, Milan.
In this paper we concentrate on the analysis of the deformations of moving non-rigid
bodies in an attempt to extract characteristics that allow us to distinguish between different
types of motions and different classes of objects.
4.4
Our Approach
Our basic assumption is that for any given class of moving objects, like humans, dogs,
cats, and birds, the apparent object view in every phase of its motion can be encoded
as a combination of several basic body views or configurations. Assuming that a living
creature exhibits a pseudo-periodic motion, one motion period can be used as a comparable
information unit. Then, by extracting the basic views from a large training set and projecting
onto them the observed sequence of object views collected from one motion period, we obtain
a parameterized representation of object’s motion that can be used for classification.
Unlike [128] we do not assume an initial segmentation of the body into parts and do not
explicitly measure the motion parameters. Instead, we work with the changing apparent
view of deformable objects and use the parameterization induced by their form variability.
In what follows we describe the main steps of the process that include,
• Segmentation and tracking of the moving object that yield an accurate external object
boundary in every frame.
• Periodicity analysis, in which we estimate the frequency of the pseudo-periodic motion
and split the video sequence into single-period intervals.
67
• Frame sequence alignment that brings the single-period sequences above to a standardized form by compensating for temporal shift, speed variations, different object sizes
and imaging conditions.
• Parameterization by building an eigen-shape basis from a training set of possible object
views and projecting the apparent view of a moving body onto this basis.
4.4.1
Segmentation and Tracking
As our approach is based on the analysis of deformations of the moving body, the accuracy
of the segmentation and tracking algorithm in finding the target outline is crucial for the
quality of the final result. This rules out a number of available or easy-to-build tracking
systems that provide only a center of mass or a bounding box around the target and calls
for more precise and usually more sophisticated solutions.
Therefore we decided to use the geodesic active contour approach [17] and specifically
the ‘fast geodesic active contour’ method described in [53], where the segmentation problem
is expressed as a geometric energy minimization. We search for a curve C that minimizes
the functional
∫ L(C)
S[C] =
g(C)ds,
0
where ds is the Euclidean arclength, L(C) is the total Euclidean length of the curve, and
g is a positive edge indicator function in a 3D hybrid spacial-temporal space that depends
on the pair of consecutive frames I t−1 (x, y) and I t (x, y). It gets small values along the
spacial-temporal edges, i.e. moving object boundaries, and higher values elsewhere.
In addition to the scheme described in [53], we also use the background information whenever a static background assumption is valid and a background image B(x, y) is available.
In the active contours framework this can be achieved either by modifying the g function
to reflect the edges in the difference image D(x, y) = |B(x, y) − I t (x, y)|, or by introducing
additional area integration terms to the functional S(C):
∫
∫
L(C)
S[C] =
0
g(C)ds + λ1
inside(C)
|D(x, y) − c1 |2 da + λ2
∫
outside(C)
|D(x, y) − c2 |2 da,
where λ1 and λ2 are fixed parameters and c1, c2 are given by:
c1 = averageinside(C) [D(x, y)]
c2 = averageoutside(C) [D(x, y)]
The latter approach is inspired by the ‘active contours without edges’ model proposed by
Chan and Vese [21] and forces the curve C to close on a region whose interior and exterior
have approximately uniform values in D(x, y). A different approach to utilize the region
information by coupling between the motion estimation and the tracking problem was suggested by Paragios and Deriche in [87].
Figure 4.3 shows some results of moving object segmentation and tracking using the
proposed method.
68
Figure 4.3: Non-rigid moving object segmentation and tracking.
Contours can be represented in various ways. Here, in order to have a unified coordinate
system and be able to apply a simple algebraic tool, we use the implicit representation of a
simple closed curve as its binary image. That is, the contour is given by an image for which
the exterior of the contour is black while the interior of the contour is white.
4.4.2
Periodicity Analysis
Here we assume that the majority of non-rigid moving objects are self-propelled alive creatures whose motion is almost periodic. Thus, one motion period, like a step of a walking man
or a rabbit hop, can be used as a natural unit of motion and extracted motion characteristics
can by normalized by the period size.
The problem of detection and characterization of periodic activities was addressed by several research groups and the prevailing technique for periodicity detection and measurements
is the analysis of the changing 1-D intensity signals along spatio-temporal curves associated
with a moving object or the curvature analysis of feature point trajectories [90], [75], [101],
[117]. Here we address the problem using global characteristics of motion such as moving
object contour deformations and the trajectory of the center of mass.
By running frequency analysis on such 1-D contour metrics as the contour area, velocity
of the center of mass, principal axes orientation, etc. we can detect the basic period of the
motion. Figures 4.4 and 4.5 present global motion characteristics derived from segmented
moving objects in two sequences. One can clearly observe the common dominant frequency
in all three graphs.
The period can also be estimated in a straightforward manner by looking for the frame
where the external object contour best matches the object contour in the current frame.
Figure 4.6 shows the deformations of a walking man contour during one motion period
69
Walking man − global motion parameters
0.2
Principal axis inclination angle
rad
0.1
0
−0.1
−0.2
0
20
40
60
80
100
120
140
80
100
120
140
100
120
140
sq. pixels
1400
Contour area
1200
1000
800
600
0
20
40
0
20
40
60
pixels/frame
4
2
0
−2
Center of mass − vertical velocity
−4
60
80
frame number
Figure 4.4: Global motion characteristics measured for walking man sequence.
(step). Samples from two different steps are presented and each vertical pair of frames is
phase synchronized. One can clearly see the similarity between the corresponding contours.
An automated contour matching can be performed in a number of ways, e.g. by comparing
contour signatures or by looking at the correlation between the object silhouettes in different
frames. Figure 4.7 shows four graphs of inter-frame silhouette correlation values measured
for four different starting frames taken within one motion period. It is clearly visible that all
four graphs nearly coincide and the local maxima peaks are approximately evenly spaced.
The period, therefore, can be estimated as the average distance between the neighboring
peaks.
4.4.3
Frame Sequence Alignment
One of the most desirable features of any classification system is the invariance to a set of
possible input transformations. As the input in our case is not a static image, but a sequence
of images, the system should be robust to both spacial and temporal variations.
4.4.3.1
Spatial Alignment:
Scale invariance is achieved by cropping a square bounding box around the center of mass
of the tracked target silhouette and re-scaling it to a predefined size (see Figure 4.8).
One way to have orientation invariance is to keep a collection of motion samples for a wide
70
Walking cat − global motion parameters
0.2
Principal axis inclination angle
rad
0.15
0.1
0.05
0
0
20
40
0
20
40
60
80
100
120
80
100
120
80
100
120
sq. pixels
8500
8000
7500
Contour area
7000
60
pixels/frame
2
1
0
−1
Center of mass − vertical velocity
−2
0
20
40
60
frame number
Figure 4.5: Global motion characteristics measured for walking cat sequence.
range of possible motion directions and then look for the best match. This approach was used
by Yacoob and Black in [128] to distinguish between different walking directions. Although
here we experiment only with motions nearly parallel to the image plane, the system proved
to be robust to small variations in orientation. Since we do not want to keep models for
both left-to-right and right-to-left motion directions, the right-to-left moving sequences are
converted to left-to-right by horizontal mirror flip.
4.4.3.2
Temporal Alignment:
A good estimate of the motion period allows us to compensate for motion speed variations
by re-sampling each period subsequence to a predefined duration. This can be done by interpolation between the binary silhouette images themselves or between their parameterized
representation as explained below. Figure 4.9 presents an original and re-sampled one-period
subsequence after scaling from 11 to 10 frames.
Temporal shift is another issue that has to be addressed in order to align the phase of the
observed one-cycle sample and the models stored in the training base. In [128] it was done
by solving a minimization problem of finding the optimal parameters of temporal scaling
and time shift transformations so that the observed sequence is best matched to the training
samples. Polana and Nelson [93] handled this problem by matching the test one-period
subsequence to reference template at all possible temporal translations.
Assuming that in the training set all the sequences are accurately aligned, we find the
71
(t)
(t+3)
(t+6)
(t+9)
(t+12)
(τ )
(τ +3)
(τ +6)
(τ +9)
(τ +12)
Figure 4.6: Deformations of a walking man contour during one motion period (step). Two
steps synchronized in phase are shown. One can see the similarity between contours in
corresponding phases.
temporal shift of a test sequence by looking for the starting frame that best matches the
generalized (averaged) starting frame of the training samples, as they all look alike. Figure 4.10 shows (a) - the reference starting frame taken as an average over the temporally
aligned training set, (b) - a re-sampled single-period test sequence and, (c) the correlation
between the reference starting frame and the test sequence frames. The maximal correlation
is achieved at the seventh frame, therefore the test sequence is aligned by cyclically shifting
it 7 frames to the left.
4.4.4
Parameterization
In order to reduce the dimensionality of the problem we first project the object image in
every frame onto a low dimensional base that represents all possible appearances of objects
that belong to a certain class, like humans, four-leg animals, etc.
Let n be number of frames in the training base of a certain class of objects and M be
a training samples matrix, where each column corresponds to a spatially aligned image of
a moving object written as a binary vector. In our experiments we use 50 × 50 normalized
images, therefore, M is a 2500 × n matrix. The correlation matrix M M T is decomposed
using Singular Value Decomposition as M M T = U ΣV T , where U is an orthogonal matrix
of principal directions and the Σ is a diagonal matrix of singular values. In practice, the
decomposition is performed on M T M , which is computationally more efficient [119]. The
principal basis {Ui , i = 1..k} for the training set is then taken as k columns of U corresponding
to the largest singular values in Σ. Figure 4.11 presents a principal basis for the training set
formed of 800 sample images collected from more than 60 sequences showing dogs and cats
in motion. The basis is built by taking the k = 20 first principal component vectors.
We assume that by building such representative bases for every class of objects and then
finding the basis that best represents a given object image in a minimal distance to the
72
1
0.95
0.9
inter−frame correlation
0.85
0.8
0.75
0.7
0.65
0.6
0.55
0
10
20
30
frame offset
40
50
60
Figure 4.7: Inter-frame correlation between object silhouettes. Four graphs show the correlation measured for four initial frames.
feature space (DTFS) sense, we can distinguish between various object classes. Figure 4.12
shows the distances from more than 1000 various images of people, dogs and cats to the
feature space of people and to that of dogs and cats. In all cases, images of people were
closer to the people feature space than to the animals’ feature space and vise a versa. This
allows us to distinguish between these two classes. A similar approach was used in [45] for
the detection of pedestrians in traffic scenes.
If the object class is known (e.g. we know that the object is a dog), we can parameterize
the moving object silhouette image I in every frame by projecting it onto the class basis. Let
B be the basis matrix formed from the basis vectors {Ui , i = 1..k}. Then, the parameterized
representation of the object image I is given by the vector p of length k as p = B T vI , where
vI is the image I written as a vector.
The idea of using a parameterized representation in motion-based recognition context is
certainly not a new one. To name a few examples we mention again the work of Yacoob and
Black [128]. Cootes et al. [27] used similar technique for describing feature point locations by
a reduced parameter set. Baumberg and Hogg [3] used PCA to describe a set of admissible
B-spline models for deformable object tracking. Chomat and Crowley [23] used PCA-based
spatio-temporal filter for human motion recognition.
Figure 4.13 shows several normalized moving object images from the original sequence
and their reconstruction from a parameterized representation by back-projection to the image
space. The numbers below are the norms of differences between the original and the back-
73
(a)
(b)
Figure 4.8: Scale alignment. A minimal square bounding box around the center of the
segmented object silhouette (a) is cropped and re-scaled to form a 50 × 50 binary image (b).
Figure 4.9: Temporal alignment. Top: original 11 frames of one period subsequence. Bottom:
re-sampled 10 frames sequence.
projected images. These norms can be used as the DTFS estimation.
Now, we can use these parameterized representations to distinguish between different
types of motion. The reference base for the activity recognition consists of temporally
aligned one-period subsequences, whereas the moving object silhouette in every frame of
these subsequences is represented by its projection to the principal basis. More formally, let
{If : f = 1..T } be a one-period, temporally aligned set of normalized object images, and
pf , f = 1..T a projection of the image If onto the principal basis B of size k. Then, the
vector P of length kT formed by concatenation of all the vectors pf , f = 1..T , represent a
one-period subsequence. By choosing a basis of size k = 20 and the normalized duration of
one-period subsequence to be T = 10 frames, every single-period subsequence is represented
by a feature point in a 200-dimensional feature space.
In the following experiment we processed a number of sequences of dogs and cats in
various types of locomotion. From these sequences we extracted 33 samples of walking dogs,
9 samples of running dogs, 9 samples with galloping dogs and 14 samples of walking cats.
Let S200×m be the matrix of projected single-period subsequences, where m is the number
of samples and the SVD of the correlation matrix is given by SS T = U S ΣS V S . In Figure
4.14 we depict the resulting feature points projected for visualization to the 3-D space using
the three first principal directions {UiS : i = 1..3}, taken as the column vectors of U S
corresponding to the three largest eigen values in ΣS . One can easily observe four separable
74
(a)
(b)
420
Correlation with the average starting frame
410
400
390
380
370
360
350
340
330
1
2
3
4
5
6
Frame number
7
8
9
10
(c)
Figure 4.10: Temporal shift alignment: (a) - average starting frame of all the training set
sequences, (b) - temporally shifted single-cycle test sequence, (c) - the correlation between
the reference starting frame and the test sequence frames
clusters corresponding to the four groups.
Another experiment was done over the ‘people’ class of images. Figure 4.15 presents feature points corresponding to several sequences showing people walking and running parallel
to the image plane and running at oblique angle to the camera. Again, all three groups lie
in separable clusters.
The classification can be performed, for example, using the k-nearest-neighbor algorithm.
We conducted the ‘leave one out’ test for the dogs set above, classifying every sample by
taking them out from the training set one at a time, and the three-nearest-neighbors strategy
resulted in 100% success rate.
Another option is to further reduce the dimensionality of the feature space by projecting
the 200-dimensional feature vectors in S to some principal basis. This can be done using
a basis of any size, exactly as we did it in 3-D for the visualization. Figure 4.16 presents
learning curves for both animals and people data sets. The curves show how the classification
error rate achieved by one-nearest-neighbor classifier changes as a function of principal basis
size. The curves are obtained by averaging over a hundred of iterations when the data set
is randomly split into the training and the testing sets. The principal basis is built on the
75
Figure 4.11: The principal basis for the ‘dogs and cats’ training set formed of 20 first principal
component vectors.
30
dogs and cats
people
Distance to the "dogs & cats" feature space
25
20
15
10
5
0
0
5
10
15
20
Distance to the "people" feature space
25
30
Figure 4.12: Distances to the ‘people’ and ’dogs and cats’ feature spaces from more than
1000 various images of people, dogs and cats.
76
5.23
4.66
4.01
3.96
3.42
4.45
4.64
6.10
5.99
6.94
5.89
Figure 4.13: Image sequence parameterization. Top: 11 normalized target images of the
original sequence. Bottom: the same images after the parameterization using the principal
basis and back-projecting to the image basis. The numbers are the norms of the differences
between the original and the back-projected images.
training set and the testing is performed using the ‘leave one out’ strategy on the testing
set.
4.5
Conclusions
We presented a new framework for motion-based classification of moving non-rigid objects.
The technique is based on the analysis of changing appearance of moving objects and is heavily relying on high accuracy results of segmentation and tracking by using the fast geodesic
contour approach. The periodicity analysis is then performed based on the global properties
of the extracted moving object contours, followed by video sequence spatial and temporal
normalization. Normalized one-period subsequences are parameterized by projection onto a
principal basis extracted from a training set of images for a given class of objects. A number
of experiments show the ability of the system to analyze motions of humans and animals,
to distinguish between these two classes based on object appearance, and to classify various
type of activities within a class, such as walking, running, galloping. The ‘dogs and cats’
experiment demonstrate the ability of the system to discriminate between these two very
similar by appearance classes by analyzing their locomotion.
77
dog walking
dog running
dog galloping
cat walking
Figure 4.14: Feature points extracted from the sequences with walking, running and galloping
dogs and walking cats and projected to the 3-D space for visualization
walking
running
running−45
Figure 4.15: Feature points extracted from the sequences showing people walking and running parallel to the image plane and at 45 degrees angle to the camera. Feature points are
projected to the 3-D space for visualization
78
0.4
0.35
0.35
0.3
0.3
0.25
0.25
Mean error
Mean error
0.4
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
0
5
10
15
20
Number of eigenvectors
25
30
0
35
(a)
0
2
4
6
8
Number of eigenvectors
10
12
14
(b)
Figure 4.16: Learning curves for (a) dogs and cats data set and (b) people data set. Onenearest-neighbor classification error rate as a function of the number of eigen vectors in the
projection basis.
79
80
Chapter 5
Articulated Motion
5.1
Introduction
Articulated object segmentation is a well known computer vision problem. The problem
usually arises whenever a simple object detection and segmentation is not sufficient and
some additional information is required about the nature of the object of interest and its
interactions with other objects. The most prominent example is, of course, recovering the
position of human body parts, which is often a first step for numerous application including
posture and gait analysis, computer-human interface, automatic surveillance, etc. There
have been many proposed methods for tracking and analysis of human body motion. Most
of them assume some kind of human body model (either parametric or not) and are trying to
fit the observed visual cues to the model. There are some applications where this approach
fails. For example when doing a moving target classification there is no single model for
the objects of interest as they may belong to different classes like humans, animals, vehicles.
Yet, one may wish to detect the independently moving parts of the object as they may serve
as cues for the classification. This calls for a non model based method for articulated body
segmentation based on the observed appearance of the object and not on the prior knowledge
about its nature.
One of the very important applications of the articluated object segmentation is the
shape based recognition using the parts decomposition, whose basic idea is to locate the
parts of an object, to classify them into the different types of generic components and then
to describe the object in terms of constituent parts. Parts decomposition in its original form
is not applicable in motion based classification, but can be used as a tool that allows to
decompose an object into primitive moving parts (limbs/parts of a mechanism) and then to
do the classification based on either relative or absolute motion of the parts (e.g. [127]) .
5.2
Background subtraction
Like in most tracking applications we used a reference background image to detect and segment out of the moving object. In this work we used the adaptive background subtraction
technique suggested in [79], which is based on fusing the color and gradient information. Fig-
81
ure 5.1 shows an example of applying the background subtraction to a color image sequence.
The largest foreground connected component, Figure 5.1(c), is then used as a detected moving object mask.
(a)
(b)
(c)
Figure 5.1: Example of background subtraction (a) Original color image from sequence, (b)
Foreground confidence values, (c) Thresholded foreground image.
5.3
Edge detection
We use the following scheme for estimating the edge strength in color images. First the
gradients (Rx , Ry ), (Gx , Gy ) and (Bx , By ) are computed by Sobel operator separately for the
three channels R, G and B respectively. Then the edge indicator function is given by the
largest eigen value of the matrix
(
Rx2 + G2x + Bx2
Rx Ry + Gx Gy + Bx By
Rx Ry + Gx Gy + Bx By
Ry2 + G2y + By2
)
and the normal direction to the edge is given by the corresponding eigen vector ⃗n . The
edge sign is taken as the sign of the largest projection of the channels gradients, ∇R, ∇G
and ∇B, onto the normal edge direction vector ⃗n.
Figure 5.2 shows an example of edge detection in a color image (b), and the edge pixels
obtained by applying a threshold to the edge indicator function (c). Note that the edge
detection is performed only in the masked region belonging to the moving object.
5.4
Normal flow
In our work we have chosen to use the normal flow for the motion analysis. The the normal
flow is the projection of the image motion field onto the normal edge direction, i.e. if ⃗nr is
the direction of the normal to the edge at image point ⃗r and the optical flow at ⃗r is ⃗u, then
the normal flow at ⃗r is given by ⃗un = ⃗u · ⃗nr .
In this work we used the algorithm for normal flow computation that works directly on
color image sequences by Duric et al.
Figure 5.3 presents an example of the computed normal flow for a pair of consecutive
frames from a color sequence.
82
(a)
(b)
(c)
Figure 5.2: Example of edge detection for color images (a) Original color image from sequence, (b) Edge indicator function values, (c) Edge indicator function after applying a
threshold.
(a)
(b)
(c)
Figure 5.3: Two consecutive frames from an image sequence (a, b) and the computed normal
flow (c).
83
5.5
Grouping edge pixels into edge segments
The next step is grouping the edge pixels (edgels) into meaningful segments that hopefully
belong to a single independently moving object/body part. Our assumption is that a connected set of edgels forming a nearly straight line segment and exhibiting the same motion
are likely to belong to the same independently moving part. The grouping algorithm is then
1. Initialize a new edge segment with an unused edge pixel p, and mark the p as used.
2. Find a set Np of unused edgels neighboring with p.
3. If Np is empty, close the current segment and return to step 1.
4. Otherwise, find the best fitting edgel pnew ∈ Np , in a sense that pnew is the edgel that
most closely coincides with the segment direction and has the closest edge orientation
and normal flow to those of edgel p.
5. Add the pnew to the segment, update the segment direction, set p = pnew and return
to step 2.
As the algorithm advances by adding a new pixel to one end of the segment, one should
repeat steps 2-4 twice from the same starting point in order to track the edge segment in
both directions. The time complexity of the algorithm is linear with the number of edge
pixels. The resulting set of edge segments is then filtered by concatenating collinear segments
with close end points and by discarding very short edge segments. Figure 5.4 shows the result
of edge grouping, where edge segments are presented in different colors.
5.6
Grouping edge segments
As we are not interested in extracting the actual 3D motion parameters of the independently
moving object parts and only need the motion information in order to segment the moving
object, it may be acceptable to approximate the observable motion using the planar affine
model. The model error is relatively small provided the distance from the camera to the
object is much larger than the object displacements in the direction perpendicular to the
image plane.
Under this planar motion assumption the velocity of an object point ⃗r = (x, y) in the
image plane (the optical flow) is given by ⃗r˙ ≡ ⃗u = T⃗ − ω
⃗ × ⃗r, where ω
⃗ = (0, 0, ωz ) is the
⃗
rotational velocity vector and T = (U, V ) is the translational velocity. Then the normal flow
at the point ⃗r is ∥⃗un ∥ = ⃗u · ⃗nr = (T⃗ − ω
⃗ × ⃗r) · ⃗nr , where ⃗nr = (nx , ny ) is the normal direction
at the point ⃗r.
For any two edge segments S1 and S2 that belong to the same independently moving
articulated object part P , there exists a unique set of motion parameters m
⃗ P = (U P , V P , ωzP )
such that for every edge point ⃗ri ∈ (S1 ∪ S2 ) the following constraint should hold
u⃗i · ⃗nri = ((U P , V P ) − ω⃗P × r⃗i ) · ⃗nri .
84
(5.1)
Figure 5.4: Example of grouping edge pixels into edge segments. Different edge segments
are in different colors.
85
86
Let us denote
nxi
nyi
ai =
nyi xi − nxi yi
Then in matrix form the set of constraints (5.1) for the points of S1 ∪ S2 is given by
⃗b = Am
⃗P
(5.2)
where aTi are the rows of matrix A and the elements bi of vector ⃗b are given by bi = u⃗i · ⃗nri .
The matrix A and the vector ⃗b are formed from the observable quantities, while the motion
parameters vector m
⃗ P can be estimated by minimizing the ∥Am
⃗ P − ⃗b∥.
In general, the two edge segments being compared are not of equal size, which means
that there are more equations in (5.2) that are coming from one edge segment than from the
other. Let us assume w.l.o.g. that |S1 | > |S2 |. Then in order to give both segments equal
weight in defining the motion parameters one should compensate for the differences in size
by multiplying the second segment equations by |S1 |/|S2 |.
If the two edge segments S1 and S2 indeed belong the the same moving object part and,
therefore, exhibit the same motion, then the residual resS1 ,S2 = ∥Am
⃗ P − ⃗b∥ should be close
to zero, whereas for the cases when S1 and S2 are moving differently the residual should be
larger. This gives us a grouping cue allowing to discard pairs of edge segments that do not
fit to a common motion model.
It turned out that judging by this only cue is not sufficient. E.g. when two object parts
are moving similarly in a certain phase of motion, the edge segments belonging to these parts
are likely to be grouped together if the decision is based on the observable motion alone.
Therefore one needs to employ additional tests to make the grouping process more reliable.
The other cues we used for segmentation are based on geometrical properties of the
extracted edge segments. Usually the independently moving parts (e.g limbs of humans and
animals, robot manipulator segments) can be approximated by a cylinder and as such have
an almost parallel boundaries in their image projections. Motivated by this assumption we
are looking for the pairs of almost parallel edge segments by comparing their slope angle.
For a pair of almost parallel segments the normal distance is calculated. If the segments
are almost collinear (very small normal distance) they may present a two parts of the same
straight boundary segment and they are grouped together is their edge points are close and
their slope angles are almost identical.
Otherwise, if the normal distance is relatively large, the two segments may probably
belong to the different sides of the same limb. Then, in order to be grouped together, the
following requirements are to be met: segments projections on each other overlap and the
segments are not separated by the background. The latter property is checked by drawing a
line between the centers of mass of the two segments and looking whether this line contains
any background pixels.
Figure 5.5 presents an example of edge segments grouping for six frames taken from a
single image sequence. For each group of edge segments a rectangular bounding box aligned
with the principal axes direction is shown. One can see that the torso part that is visible in
frames (a-c) is not detected in the frames (d-f) as its bounding edge segments are lost due
87
to the motion of the hands. On the other hand, the arms that were not visible in the frames
(a-d) are detected in frames (e,f) as they emerge from the torso region. The forward leg is
split into two regions in frame (a), but becomes a single region in frame (b) since at this step
phase it is fully stretched and moves as one piece. The other leg, being in the anti-phase, is
presented by a single region in frames (a-c) and then split into two limbs in frames (d-f).
(a)
(b)
(c)
(d)
(e)
(f)
Figure 5.5: Edge grouping results for six frames from an image sequence. Every group of
edge segments is enclosed by a best fitting rectangular bounding box
5.7
Multiple frames approach
One of the drawbacks of the scheme described so far is the lack of persistency in segmentation,
i.e. an object part detected in one frame may disappear in the next one and then appear
again several frames later.
One way to cope with this problem is to remember the results of the segmentation made
for one frame and then use them when doing the segmentation in the next frames. To do so
it is necessary to keep track of persistent image elements, such that they could be detected
from frame to frame.
In our implementation this is done by tracking the edge segments. An edge segment
found in a certain frame enters a list of tracked segments and then its new position in the
next frame is detected using the normal flow. A search region is defined around the new
segment position predicted by the normal flow vectors and the corresponding edge segment
in the next frame is built from the edgels found within the region. An edge segment that is
88
being tracked for several frames is regarded as stable and enters the database of the active
edge segments.
The database is a table of size Ns × Ns , where Ns is the number of active segments. An
entry (i, j) of the table accumulates the information about the mutual relationship between
the edge segments Si and Sj . The history data that is stored for each pair of edge segments
includes the mean and the variance of the slope difference, normal and tangential distances
and the motion model residual computed from the normal flow. All these parameters are
updated with every new frame. The decision whether two segments should be grouped
together is made in the similar manner to that described in the previous section, while
taking into account the historical information. E.g. if the slope angle between two segments
changes significantly within the period of observation (i.e. large slope difference variance)
the segments cannot be grouped together. The same applies for the situation when in some
frame a pair of segments was separated by the background.
Once it was decided that the two segments do not belong to the same moving object
part, the decision is stored in the corresponding table entry and the segments will not be
grouped together in the subsequent frames even if their relative motion and geometry will
look appropriate.
(a)
(b)
(c)
(d)
(e)
(f)
Figure 5.6: Multiple frames based segmentation results for six frames from an image sequence. Every group of edge segments is enclosed by a best fitting rectangular bounding
box
Figure 5.6 shows an example of moving body segmentation performed using the multiple
frames approach. One can see that the segmentation now is more persistent than for the
frame by frame analysis. For example, the left hand region, once extracted as a separate
one in frame (b), remains separate through the whole sequence. Each leg is grouped as one
89
body part and remains such unlike in figure 5.5 where they were split into two regions in
some frames.
5.8
Conclusions
In this chapter we presented a simple, yet powerful method for motion based segmentation.
The method can be used for the segmentation of articulated objects based on the analysis
of the observed motion of their parts. We used a relatively simple underlying motion model
- affine, which proved to be powerful enough to discriminate between the edges that move
in the same way and those that move independently. The algorithm is based on the analysis
of normal flow vectors computed in color space and also relies on a number of geometric
heuristics for edge segments grouping. A multi-frame approach is also considered, which
yields more robust results.
90
Chapter 6
Discussion
The goal of this work was to explore the possibility of using the motion based information,
as well as various ways it can be used, in order to solve a number of well known computer
vision problems, such as segmentation, tracking, objects recognition and classification and
event detection.
It turned out that the methods one can apply for the analysis of moving objects differ
significantly between the three basic classes of objects: rigid, non-rigid and articulated.
In Chapter 2 we address the problem of extracting the motion parameters of a moving
rigid body for a relatively complicated, yet constrained case of moving vehicles.
As image sequences are obtained from a forward looking camera installed on a moving
vehicle, the observed motion is a result of superposition of two independent motions: an apparent translational motion of the car (and the camera attached) relative to the environment
- both static (road, buildings) and dynamic (vehicles) and an apparent rotational motion of
the environment relative to the camera caused by various disturbances due to road bumps
or other similar reasons. In order to extract the relative motion parameters of the nearby
vehicles we needed to stabilize the image sequence, i.e. to get rid of the rotational component of the observable motion pattern, yielding a sequence of images that would have been
observed if the vehicle motion had been smooth. The stabilization algorithm we developed
is based on the analysis of the normal flow vectors using a perspective projection imaging
model, as the optical flow vectors participating in the computation are collected from the
points located at different scene depths.
Road segmentation is performed by finding road edges and lane separating marks using
Hough-like voting and choosing dominant straight lines converging to a single point. Independently moving objects are then detected by clustering vertical and horizontal linear
contour sections representing car boundaries and/or car shadow.
The 3-D car motion vector is computed from translational component of optical flow
vectors assuming translational planar motion and weak perspective camera model.
Our belief is that by analyzing the relative velocity graphs and checking such parameters
like typical accelerations, speed of maneuvers and other higher level behavioral patterns, one
can distinguish between vehicle classes like private cars, trucks, motorcycles and so on.
The work can be extended to analyze a 3-D motion, e.g. for the motion based classification of an aircraft in flight. Actually this approach has already been used in aircraft type
91
detection from radar imaging [31, 29, 30, 28] and it would be very interesting to find out
whether this type of analysis can be performed on an optical data.
While there exists a relatively large number of works dealing with extraction of motion
events in traffic scenes [13, 58, 59, 60, 69, 71], all of them do it with a stationary camera
and sometimes use an a priory known and modelled observed scene. Here we address the
problem of motion event recognition from a driver’s viewpoint and are trying to understand
various events involving the motions of nearby vehicles usually moving in the same direction
without major changes in their distances and relative speeds.
The set of interesting events and parameters in this context such as passing or changing
lanes, exiting or merging a stream, overtaking, turning, accelerating or breaking, rate of
approach, dangerously close or approaching car (collision), etc., can be detected based on
the analysis of relative translational velocity of a vehicle and by looking at its position relative
to different road parts - lanes, margins, separating line and so on.
The analysis of the non-rigid objects motion on the other hand imposes new challenges.
While the exact estimation of such motion parameters as translation and rotation is more
problematic (if possible), because of the deformations the target undergo, the same deformations contain a very essential information about the nature of the object. Therefore we
choose to use an active contour based approach for the segmentation and tracking of nonrigid objects in order to provide a high-accuracy results allowing us to capture the object
deformations dynamic.
In Chapter 3 we present a fast and efficient implementation of numerical scheme for
geodesic active contours that is based on a combination of several advanced numerical techniques. The proposed ‘fast geodesic active contour’ scheme was applied successfully for image
segmentation and tracking in video sequences and color images [53]. The method can be
used to solve a number of segmentation problems that commonly arise in image analysis and
was successfully applied in a number of projects ranging from medical applications such as
cortical layer segmentation in 3-D MRI brain images [52], to the segmentation of defects in
VLSI circuits on electronic microscope images, and to the analysis of bullet traces in metals.
In Chapter 4 we present a framework for motion-based classification of moving non-rigid
objects [54]. The technique is based on the analysis of changing appearance of moving objects
and is heavily relying on high accuracy results of segmentation and tracking by using the
fast geodesic contour approach described earlier.
By assuming that most of non-rigid moving objects are self-propelled alive creatures
whose motion is almost periodic, we perform the periodicity analysis based on the global
properties of the extracted moving object contours.
Knowing the motion period allows us to extract several important quantities that may be
used as classification invariants. In addition, by normalizing in time other measured motion
parameters we make them immune against motion rate variations and allow to compare
them to the stored categories representative templates.
Normalized one-period subsequences are parameterized by projection onto a principal
basis extracted from a training set of images for a given class of objects. A number of
experiments show the ability of the system to analyze motions of humans and animals, to
distinguish between these two classes based on object appearance, and to classify various
type of activities within a class, such as walking, running, galloping.
92
The ‘dogs and cats’ experiment demonstrate the ability of the system to discriminate
between these two very similar by appearance classes by analyzing their locomotion. This
is an interesting example, because it was claimed [120] that these two classes of objects
are indistinguishable in terms of their visual appearance. This raises an interesting and
philosophical question of distinguishability between various classes and the definition of the
set of classes or categories we would like to relate an object to. This is, of course, an
abstract question, unless we limit ourselves to some natural purposive environment, where
the classification process is a part of a more complex system allowing an observer to perceive
the environment and then act accordingly.
There exists a hypothesis [8], [120] that the most natural object classes are those induced
by nature itself and not imposed by an observer. In other words, the set of natural classes
is independent of our perception and the visual attributes available are sufficient for the
classification.
The principle of ‘Natural Modes’ [8] states: ‘Environmental pressures force objects to
have non-arbitrary configurations and properties that define object categories in the space
of properties important to interaction between objects and the environment’. This is accompanied with another principle of ‘Accessibility’ claiming that those properties are ‘reflected
in observable properties of the object’. By adopting such basic natural classes like ‘plants’,
‘animals’, ‘humans’, ‘human made objects’, etc., and using motion information as input data
for object classification we show that object motion is one of those basic visual attributes
allowing us to distinguish between natural categories.
As we have already mentioned, the methods that can be applied for rigid and non-rigid
bodies classification are quite different. Therefore, first, we would like to distinguish between
rigid and non-rigid objects. In most cases natural non-rigid objects in locomotion exhibit
substantial changes in the apparent view of the external contour since the motion itself is
caused by these deformations. Hence, we can judge about rigidity of moving objects by the
amount of deformation that their external contours undergo. Although this aspect of the
problem was not discussed in detail in this thesis, we successfully used this technique in
the real-time system we have built to distinguish between such rigid/non-rigid categories as
humans, animals and vehicles [95]. After the initial rigid/non-rigid distinction is made we
can proceed with finer classification within these two large categories.
Finally, in Chapter 5 we address the problem of articulated objects segmentation and
tracking. The obvious motivation comes form the problem of human body segmentation,
which is required in numerous application where the posture of a human body is to be
recovered.
Our method is based on the analysis of normal flow vectors computed from color image
sequence and grouping of edge segments based on the similarity in motion they exhibit. The
similarity is defined in terms of compliance with the underlying motion model, which in
our case was taken to be a simple affine motion and proved to be ‘expressive’ enough for
our qualitative purposes. We also suggested a number of geometric heuristics to be used as
grouping cues and developed a scheme of multi-frame analysis that is more robust than a
simple frame-to-frame decision making.
The algorithm is meant to be a part of a more complex event recognition system that
works in an indoor environment, where humans configuration(posture) changes and their
93
interaction with static objects of primary interest.
In this context an interesting question arises whether there exists a generic set of primitive
recognizable motion events to be extracted from an image sequence that could describe a
large, application independent class of verbs or other higher level abstractions. An interesting
development of our work would be to use a low level primitives called “acts”, introduced in
the Conceptual Dependency theory by Schank [99], which is a theory of the representation
of the meaning of sentences. The theory defines a set of primitive actions called “acts” in
such a way that any verb in a natural language can be represented as a unique combination
of those primitive actions. Then one would need to develop a set of techniques to extract
these primitives and to parse them into higher level events.
94
Bibliography
[1] D. Adalsteinsson and J. A. Sethian. A fast level set method for propagating interfaces. J. of
Comp. Phys., 118:269–277, 1995.
[2] J. L. Barron, D. J.Fleet, and S. S. Beauchemin. Performance of optical flow techniques.
International Journal of Computer Vision, 12(1):43–77, 1994.
[3] A. Baumberg and D. Hogg. An efficient method for contour tracking using active shape
models. In In Proc. IEEE Workshop on Motion of Non-Rigid and Articulated Objects, pages
194–199, Austin, 1994.
[4] M. G. Bekker. Introduction to Terrain-Vehicle Systems. University of Michigan Press, Ann
Arbor, 1969.
[5] J. R. Beniger. The Arts and New Media site.
[6] M. Betke, E. Haritaoglu, and L. S. Davis. Real-time multiple vehicle detection and tracking
from a moving vehicle. Machine Vision and Applications, 12(2):69–83, 2000.
[7] A. Bobick and J. Davis. The representation and recognition of action using temporal templates. IEEE Trans. on PAMI, 23(3):257–267, 2001.
[8] A. F. Bobick. Natural Object Categorization. Ph.D. thesis, tr-1001, MIT Artificial Intelligence
Laboratory, 1987.
[9] A. Broggi. An image reorganization procedure for automotive road following systems. In
Proc. International Conference on Image Processing, volume 3, pages 532–535, 1995.
[10] A. Broggi. Parallel and local feature-extraction: A real-time approach to road boundary
detection. IEEE Transactions on Image Processing, 4(2):217–223, February 1995.
[11] T. J. Broida and R. Chellappa. Estimation of object motion parameters from noisy images.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(1):90–99, January 1986.
[12] P. Burlina and R. Chellappa. Time-to-x: Analysis of motion through temporal parameters.
In Proc. Conference on Computer Vision and Pattern Recognition, pages 461–468, 1994.
[13] H. Buxton and S. Gong. Advanced visual surveillance using bayesian networks. In Proc.
IEEE Special Workshop on Context-Based Vision, pages 111–123, MIT Massachusetts, April
1995.
[14] V. Caselles, F. Catte, T. Coll, and F. Dibos. A geometric model for active contours. Numerische Mathematik, 66:1–31, 1993.
95
[15] V. Caselles and B. Coll. Snakes in movement. SIAM J. Numer. Analy., 33(6):2445–2456,
1996.
[16] V. Caselles, R. Kimmel, and G. Sapiro. Geodesic active contours. In Proceedings ICCV’95,
pages 694–699, Boston, Massachusetts, June 1995.
[17] V. Caselles, R. Kimmel, and G. Sapiro. Geodesic active contours. IJCV, 22(1):61–79, 1997.
[18] V. Caselles, R. Kimmel, G. Sapiro, and C. Sbert. Minimal surfaces: A geometric three
dimensional segmentation approach. Numerische Mathematik, 77(4):423–425, 1997.
[19] V. Caselles, R. Kimmel, G. Sapiro, and C. Sbert. Minimal surfaces based object segmentation.
EEE Trans. on PAMI, 19:394–398, 1997.
[20] C. Cedras and M. Shah. Motion-based recognition: A survey. IVC, 13(2):129–155, March
1995.
[21] T.F. Chan and L. A. Vese. Active contours without edges. IEEE trans. on Image Processing,
10(2):266–277, February 2001.
[22] C. S. Chiang, C. M. Hoffmann, and R. E. Lync. How to compute offsets without selfintersection. In Proc. of SPIE, volume 1620, page 76, 1992.
[23] O. Chomat and J. Crowley. Recognizing motion using local appearance, 1998.
[24] D. L. Chopp. Computing minimal surfaces via level set curvature flow. J. of Computational
Physics, 106(1):77–91, May 1993.
[25] J. C. Clarke and A. Zisserman. Detection and tracking of independent motion. Image and
Vision Computing, 14:565–572, 1996.
[26] L. D. Cohen. On active contour models and balloons.
53(2):211–218, 1991.
CVGIP: Image Understanding,
[27] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham. Active shape models: Their
training and application. CVIU, 61(1):38–59, January 1995.
[28] N. J. Cutaia, M. I. Miller, J. A. O’Sullivan, and D. L. Snyder. Multi-target narrow-band
direction finding and tracking based on motion dynamics. In Allerton Conference on Communication, Control, and Computing, Urbana-Champaign, IL, October 1992.
[29] N. J. Cutaia and J. A. O’Sullivan. Automatic target recognition using kinematic priors. In
Proceedings of the 33rd Conference on Decision and Control, pages 3303–3307, Lake Buena
Vista, FL, December 1994.
[30] N. J. Cutaia and J. A. O’Sullivan. Identification of maneuvering aircraft using class dependent
kinematic models. Technical Report ESSRL-95-13, Electronic Signals and Systems Research
Laboratory Monograph, St. Louis, MO, May 1995.
[31] N. J. Cutaia and J. A. O’Sullivan. Performance bounds for recognition of jump-linear systems.
In Proc. of the 1995 American Controls Conference, pages 109–113, Seattle, WA, June 1995.
[32] R. Cutler and L. Davis. Robust real-time periodic motion detection, analysis, and applications. PAMI, 22(8):781–796, August 2000.
96
[33] S. Di Zenzo. A note on the gradient of a multi image. Computer Vision, Graphics, and Image
Processing, 33:116–125, 1986.
[34] E. D. Dickmanns and B. D. Mysliwetz. Recursive 3-d road and relative ego-state recognition.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(2):199–213, February
1992.
[35] B. A. Dubrovin, A. T. Fomenko, and S. P. Novikov. Modern Geometry - Methods and
Applications I. Springer-Verlag, New York, 1984.
[36] Z. Duric, J. Fayman, and E. Rivlin. Function from motion. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 18:579–591, 1996.
[37] Z. Duric, R. Goldenberg, E. Rivlin, and A. Rosenfeld. Estimating relative vehicle motion in
traffic scenes. Pattern Recognition, 35(6):1339–53, June 2002.
[38] Z. Duric, E. Rivlin, and A. Rosenfeld. Understanding object motion. IVC, 16(11):785–797,
August 1998.
[39] Z. Duric and A. Rosenfeld. Image sequence stabilization in real time. Real-Time Imaging,
2:271–284, 1996.
[40] Z. Duric, A. Rosenfeld, and L. S. Davis. Egomotion analysis based on the frenet-serret motion
model. IJCV, 15:105–122, 1995.
[41] J. R. Ellis. Vehicle Handling Dynamics. Mechanical Engineering Publications Limited, London, 1994.
[42] S. A. Engel and J. M. Rubin. Detecting visual motion boundaries. In Proc. Workshop on
Motion: Representation and Analysis, pages 107–111, Charleston, S.C., May 1986.
[43] O. Faugeras and R. Keriven. Variational principles, surface evolution PFE’s, level set methods, and the stereo problem. IEEE Trans. on Image Processing, 7(3):336–344, 1998.
[44] S. Fejes and L. S. Davis. Detection of independent motion using directional motion estimation.
Computer Vision and Image Understanding, 74(2):101–120, May 1999.
[45] U. Franke, D. Gavrila, S. Gorzig, F. Lindner, F. Paetzold, and C. Wohler. Autonomous
driving goes downtown. IEEE Intelligent System, 13(6):40–48, 1998.
[46] P. Fua and Y. G. Leclerc. Model driven edge detection. Machine Vision and Applications,
3:45–56, 1990.
[47] H. Fujiyoshi and A. Lipton. Real-time human motion analysis by image skeletonization. In
Proc. of the Workshop on Application of Computer Vision, October 1998.
[48] M. Gage and R. S. Hamilton. The heat equation shrinking convex plane curves. J. Diff.
Geom., 23, 1986.
[49] D. M. Gavrila. The visual analysis of human movement: A survey. CVIU, 73(1):82–98,
January 1999.
97
[50] R. Gerber and H.-H. Nagel. Knowledge representation for the generation of quantified natural
language description of vehicle traffic in image sequence. In Proc. IEEE International Conference on Image Processing (ICIP’96), volume 2, pages 805–808, Lausanne/CH, September
1996.
[51] S. Gil, R. Milanese, and T. Pun. Combining multiple motion estimates for vehicle tracking.
In Proc. European Conference on Computer Vision, pages II:307–320, 1996.
[52] R. Goldenberg, R. Kimmel, E. Rivlin, and M. Rudzsky. Cortex segmentation - a fast variational geometric approach. In Proc. of IEEE workshop on Variational and Level Set Methods
in Computer Vision., pages 127–133, Vancouver, Canada, July 2001.
[53] R. Goldenberg, R. Kimmel, E. Rivlin, and M. Rudzsky. Fast geodesic active contours. IEEE
Trans. on Image Processing, 10(10):1467–75, October 2001.
[54] R. Goldenberg, R. Kimmel, E. Rivlin, and M. Rudzsky. ‘Dynamism of a dog on a leash’ or
behavior classification by eigen-decomposition of periodic motions. In Accepted to ECCV’02,
Copenhagen, May 2002.
[55] K. Gould, K. Rangarajan, and M. Shah. Detection and representation of events in motion trajectories. In Advances in Image Processing and Analysis, chapter 14. SPIE Optical
Engineering Press, June 1992. Gonzalez and Mahdavieh (Eds.).
[56] K. Gould and M. Shah. The trajectory primal sketch: a multi-scale scheme for representing
motion characteristics. In Proc. Conf. on Computer Vision and Pattern Recognition, San
Diego, CA, pages 79–85, 1989.
[57] M. A. Grayson. The heat equation shrinks embedded plane curves to round points. J. Diff.
Geom., 26, 1987.
[58] M. Haag, T. Frank, H. Kollnig, and H.-H. Nagel. Influence of an explicitly modeled 3d scene
on the tracking of partially occluded vehicles. Computer Vision and Image Understanding,
65(2):206–225, 1997.
[59] M. Haag and H.-H. Nagel. Tracking of complex driving manoeuvers in traffic image sequences.
Image and Vision Computing, 16(8):517–527, 1998.
[60] F. Heimes and H.-H. Nagel. Real-time tracking of intersections in image sequences of a
moving camera. Engineering Applications of Artificial Intelligence, 11:215–227, 1998.
[61] F. Heimes, H.-H. Nagel, and T. Frank. Model-based tracking of complex innercity road
intersections. Mathematical and Computer Modelling, 27(9-11):189–203, 1998.
[62] M. Irani, B. Rousso, and S. Peleg. Detecting and tracking multiple moving objects using
temporal integration. In Proc 2nd European Conf on Computer Vision, pages 282–287, Santa
Margharita, Ligure, Italy, 1992.
[63] Jr J. H. Williams. Fundamentals of Applied Dynamics. Wiley, New York, 1996.
[64] G. Johansson. Visual perception of biological motion and a model for its analysis. Perception
and Psychophysics, 14(2):201–21, 1973.
[65] G. Johansson. Visual motion perception. Scientific American, 232(6):75–80, 85–88, 1975.
98
[66] S. X. Ju, M. Black, and Y. Yacoob. Cardboard people: A parameterized model of articulated
image motion. In Proc. Int. Conf. on Face and Gestures, pages 561–567, Vermont, 1996.
[67] M. Kass, A. Witkin, and D. Terzopoulos. Snakes: Active contour models. International
Journal of Computer Vision, 1:321–331, 1988.
[68] S. Kichenassamy, A. Kumar, P. Olver, A. Tannenbaum, and A. Yezzi. Gradient flows and
geometric active contour models. In Proceedings ICCV’95, Boston, Massachusetts, June 1995.
[69] D. Koller, K. Daniilidis, and H.-H. Nagel. Model-based object tracking in monocular image
sequences of road traffic scenes. International Journal of Computer Vision, 10:257–281, 1993.
[70] D. Koller, H. Heinze, and H.-H. Nagel. Algorithmic characterization of vehicle trajectories
from image sequences by motion verbs. In CVPR91, pages 90–95, Lahaina, Maui, Hawaii,
June 1991.
[71] H. Kollnig and H.-H. Nagel. Ermittlung von begrifflichen beschreibungen von geschehen in
straenverkehrsszenen mit hilfe unscharfer mengen. Informatik Forschung und Entwicklung,
8:186–196, 1993.
[72] E. Kreyszing. Differential Geometry. Dover Publications, Inc., New York, 1991.
[73] Bertrand Leroy, Isabelle L. Herlin, and Laurent Cohen. Multi-resolution algorithms for active contour models. In Proc. 12th Int. Conf. on Analysis and Optimization of Systems
(ICAOS’96), Paris, France, June 1996.
[74] A. Lipton, H. Fujiyoshi, and R. Patil. Moving target classification and tracking from real-time
video. In In Proc. IEEE Image Understanding Workshop, pages 129–136, 1998.
[75] F. Liu and R. W. Picard. Finding periodicity in space and time. In Proc. of the 6th Int.
Conf. on Computer Vision, pages 376–383, Bombay, India, 1998.
[76] R. Malladi, R. Kimmel, D. Adalsteinsson, V. Caselles, G. Sapiro, and J. A. Sethian. A
geometric approach to segmentation and analysis of 3d medical images. In Proceedings of
IEEE/SIAM workshop on Biomedical Image Analysis, San-Francisco, California, June 1996.
[77] R. Malladi and J. A. Sethian. An O(N log N) algorithm for shape modeling. Proceedings of
National Academy of Sciences, USA, 93:9389–9392, 1996.
[78] R. Malladi, J. A. Sethian, and B. C. Vemuri. Shape modeling with front propagation: A level
set approach. IEEE Trans. on PAMI, 17:158–175, 1995.
[79] S. J. McKenna, S. Jabri, Z. Duric, A. Rosenfeld, and H. Wechsler. Tracking groups of people.
CVIU, 80(1):42–56, October 2000.
[80] A. E. Michotte. La Perception de la Causalite. Studia Psychologica, Publications Universitaires de Louvain, 1954.
[81] D. M. Moeslund and E. Granum. A survey of computer vision-based human motion capture.
CVIU, 81(3):231–268, March 2001.
[82] R. C. Nelson. Qualitative detection of motion by a moving observer. International Journal
of Computer Vision, 7(1):33–46, November 1991.
99
[83] American Association of State Highway Officials. A Policy on Geometric Design of Rural
Highways (9th edition). AASHO, 1977.
[84] R. Oldenburger. Mathematical Engineering Analysis. Macmillan, New York, 1950.
[85] S. J. Osher and J. A. Sethian. Fronts propagating with curvature dependent speed: Algorithms based on Hamilton-Jacobi formulations. J. of Comp. Phys., 79:12–49, 1988.
[86] N. Paragios and R. Deriche. A PDE-based level set approach for detection and tracking of
moving objects. In Proc. of the 6th ICCV, Bombay, India, 1998.
[87] N. Paragios and R. Deriche. Geodesic active regions for motion estimation and tracking. In
Proc. of the 7th Int. Conf. on Computer Vision, pages 688–694, Kerkyra, Greece, 1999.
[88] P. Perona and J. Malik. Scale-space and edge detection using anisotropic diffusion. IEEEPAMI, 12:629–639, 1990.
[89] R. Polana and R. C. Nelson. Temporal texture recognition. In Proc. of CVPR, pages 129–134,
1992.
[90] R. Polana and R. C. Nelson. Detecting activities. Journal of Visual Communication and
Image Representation, 5:172–180, 1994.
[91] R. Polana and R. C. Nelson. Low level recognition of human motion. In Proc. of IEEE Workshop on Motion of Non-Rigid and Artiulated Objects, pages 77–82, Austin, TX, November
1994.
[92] R. Polana and R. C. Nelson. Recognizing activities. In ICPR94, pages A:815–818, 1994.
[93] R. Polana and R. C. Nelson. Detection and recognition of periodic, nonrigid motion. IJCV,
23(3):261–282, June 1997.
[94] S. Richter and D.. Wetzel. A robust and stable road model. In Proc. International Conference
on Pattern Recognition, pages A:777–780, 1994.
[95] E. Rivlin, M. Rudzsky, R. Goldenberg, U. Bogomolov, and S. Lapchev. A real-time system
for classification of moving objects. In Accepted to ICPR’02, Québec City, Canada, August
2002.
[96] R. Rosales and S. Sclaroff. Improved tracking of multiple humans with trajectory prediction
and occlusion modelling. In Proc. of IEEE Conf. on CVPR, Workshop on the Interpretation
of Visual Motion, Santa Barbara, CA, 1998.
[97] A. Ruhe and A. Wedin. Algorithms for separable nonlinear least squares problems. SIAM
Review, 22:318–337, 1980.
[98] G. Sapiro and D. L. Ringach. Anisotropic diffusion of multivalued images with applications
to color filtering. IEEE Trans. Image Proc., 5:1582–1586, 1996.
[99] R. C. Schank. Conceptul dependency: A theory of natural language understanding. Cognitive
Psychology, 3(4):552–631, 1972.
[100] H. Schneiderman and M. Nashman. A discriminating feature tracker for vision-based autonomous driving. IEEE Transactions on Robotics and Automation, 10:769–775, 1994.
100
[101] S. M. Seitz and C. R. Dyer. View invariant analysis of cyclic motion. Int. Journal of Computer
Vision, 25(3):231–251, December 1997.
[102] J. A. Sethian. Level Set Methods: Evolving Interfaces in Geometry, Fluid Mechanics, Computer Vision and Materials Sciences. Cambridge Univ. Press, 1996.
[103] J. A. Sethian. A marching level set method for monotonically advancing fronts. Proc. Nat.
Acad. Sci., 93(4), 1996.
[104] J. Shah. A common framework for curve evolution, segmentation and anisotropic diffusion.
In Proceedings IEEE CVPR’96, pages 136–142, 1996.
[105] M. Shah, K. Rangarajan, and P.-S. Tsai. Motion trajectories. IEEE Transactions on System,
Man and Cybernatics, 23(4):1138–1150, July/August 1993.
[106] R. Sharma and Y. Aloimonos. Early detection of independent motion from active control
of normal image flow patterns. IEEE Transactions on Systems, Man, and Cybernetics B,
26(1):42–52, February 1996.
[107] E. Shavit and A. Jepson. Motion understanding from qualitative visual dynamics. In IEEE
workshop on Qualitative Vision, New-York, June 1993.
[108] E. Shavit and A. Jepson. The pose function: An itermediate level representation for motion
understanding. In Proc. of the 10th Israeli Symposium on AI and Vision, Tel-Aviv, December
1993.
[109] J. M. Siskind. Naive Physics, Event Perception, Lexical Semantics and Language Aquisition.
Ph.D. thesis, Elec. Eng’g. and Comp. Sci., MIT, January 1992.
[110] J. M. Siskind and Q. Morris. A maximum-likelihood approach to visual event classification. In Proceedings of the Fourth European Conference on Computer Vision, pages 347–360,
Cambridge, UK, April 1996.
[111] J.M. Siskind. Grounding language in perception. AI Review, 8(5-6):371–391, 1995.
[112] R. H. Smythe. Vision in Animal World. St. Martin’s Press, NY, 1975.
[113] N. Sochen, R. Kimmel, and R. Malladi. A general framework for low level vision. IEEE
Trans. on Image Processing, 7(3):310–318, 1998.
[114] G. W. Stewart. Introduction to Matrix Computations. Academic Press, New York, 1973.
[115] D. Terzopoulos, A. Witkin, and M. Kass. Constraints on deformable models: Recovering 3d
shape and nonrigid motions. Artificial Intelligence, 36:91–123, 1988.
[116] P. H. S. Torr and D. W. Murray. Statistical detection of independent movement from a
moving camera. Image and Vision Computing, 11:180–187, 1993.
[117] P. S. Tsai, M. Shah, K. Keiter, and T. Kasparis. Cyclic motion detection for motion based
recognition. Pattern Recognition, 27(12):1591–1603, December 1994.
[118] J. N. Tsitsiklis. Efficient algorithms for globally optimal trajectories. IEEE Trans. on Automatic Control, 40(9):1528–1538, 1995.
101
[119] M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuro Science,
3(1):71–86, 1991.
[120] S. Ullman. High-level Vision. Object Recognition and Visual Cognition. MIT Press, Cambridge, Massachusets, 1996.
[121] A. Verri and T. Poggio. Against quantitative optical flow. In Proc. International Conference
on Computer Vision, pages 171–180, 1987.
[122] K. von Frisch. Bees: Their Vision, Chemical Senses, and Language. Cornell University Press,
Ithaca, New York, 1971.
[123] J. Weickert. Fast segmentation methods based on partial differential equations and watershed
transformation. In Mustererkennung, pages 93–100. Springer, Berlin, 1998.
[124] J. Weickert, B. M. ter Haar Romeny, and M. A. Viergever. Efficient and reliable scheme for
nonlinear diffusion filtering. IEEE Trans. on Image Processing, 7(3):398–410, 1998.
[125] J. Weng, T. Huang, and N. Ahuja. 3-d motion estimation, understanding, and prediction from
noisy image sequences. IEEE Transactions on Pattern Analysis and Machine Intelligence,
9(3):370–389, May 1987.
[126] J. Y. Wong. Theory of Ground Vehicles (2nd edition). Wiley, New York, 1993.
[127] Y. Yacoob and M. J. Black. Parameterized modeling and recognition of activities. In Proc.
Int. Conf. on Computer Vision, Bombay, India, 1998.
[128] Y. Yacoob and M. J. Black. Parameterized modeling and recognition of activities. CVIU,
73(2):232–247, February 1999.
[129] S. C. Zhu and A. Yuille. Region competition: Unifying snakes, region growing, and
bayes/MDL for multiband image segmentation. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 18:884–900, 1996.
[130] S. C. Zhu, A. Yuille, and T. S. Lee. Region competition: Unifying snakes, region growing,
and bayes/mdl for multi-band image segmentation. In ICCV95, pages 416–423, 1995.
102
View publication stats