Computer Vision and Image Understanding 113 (2009) 580–587
Contents lists available at ScienceDirect
Computer Vision and Image Understanding
journal homepage: www.elsevier.com/locate/cviu
Note
Target tracking with incomplete detection
Yunqian Ma a,*, Qian Yu b, Isaac Cohen a
a
b
Honeywell Labs, 1985 Douglas Drive North, Golden Valley, MN 55422, USA
Sarnoff Corporation, 201 Washington Rd, Princeton, NJ 08536, USA
a r t i c l e
i n f o
Article history:
Received 29 August 2006
Accepted 21 January 2009
Available online 30 January 2009
Keywords:
Multiple target tracking
Split and merge of detected regions
Maximum a posteriori
a b s t r a c t
In this paper, we address the multiple target tracking problem as a maximum a posteriori problem.
We adopt a graph representation of all observations over time. To make full use of the visual observations from the image sequence, we introduce both motion and appearance likelihood. The multiple
target tracking problem is formulated as finding multiple optimal paths in the graph. Due to the noisy
foreground segmentation, an object may be represented by several foreground regions and similarly
one foreground region may correspond to multiple objects. To deal with this problem, we propose
merge, split and mean shift operations to generate new hypotheses to the measurement graph. The
proposed approach uses a sliding window framework, that aggregates information across a fixed
number of frames. Experimental results on both indoor and outdoor data sets are reported. Furthermore, we provide a comparison between the proposed approach with the existing methods that do
not merge/split detected blobs.
Ó 2009 Elsevier Inc. All rights reserved.
1. Introduction
Multiple target tracking is a key component in visual surveillance. Tracking provides a spatio-temporal description of detected
moving regions in the scene, this low level information is critical
for recognition of human actions in video surveillance. In the considered visual tracking problem, the observations used are the detected moving blobs. Incomplete observations due to occlusions,
stop and go motion or noisy foreground detections constitute the
main limitation of blob-based tracking methods. We propose a
tracking method that allows to split, merge detected moving regions, as well as re-acquiring moving targets after a stop-and-go
motion or occlusion.
Several problems need to be addressed by a tracking algorithm:
A single moving object (e.g. one person) can be detected as multiple moving blobs. In this case the tracking algorithm needs to
‘Merge’ the detected blobs. Similarly, one detected blob can be
composed of multiple moving objects, in this case the tracking
algorithm needs to ‘Split’ and segment the detected blob into corresponding moving objects. The split and merge of detected blobs
has to be robust to partial or total occlusions, as well as being capable of differentiating detected moving regions of nearby objects.
Stop-and-go motion, or non-detection due to similarity of the object to the background may require the tracker to re-acquire the
target.
* Corresponding author.
E-mail addresses: yunqian.ma@honeywell.com (Y. Ma), qyu@sarnoff.com (Q. Yu),
isaac.cohen@honeywell.com (I. Cohen).
1077-3142/$ - see front matter Ó 2009 Elsevier Inc. All rights reserved.
doi:10.1016/j.cviu.2009.01.002
Moveover, the detected blobs could be due to erroneous motion
detection. Here the tracking algorithm needs to filter these observations in presence of static or dynamic occlusions of the moving
objects in the scene. Finally the number of moving objects in the
scene vary as new moving objects enter or leave the field of view
of the camera.
A large number of tracking algorithms have been developed in
the past decades. The interested reader can refer to [24] for a recent comprehensive survey of the field. Several data association
tracking algorithms have been proposed ranging from a simple
nearest neighbor association to the complex multiple hypothesis
tracker [9,10]. The Probabilistic Data Association (PDA) method
[14], which is considered a good compromise between performance and complexity, uses a weighted average of all the measurements within the tracks’ validation gate [15] to estimate the
target state. The PDA method deals with multiple targets as independent objects in term of observations, and therefore less suitable
for addressing the situations where multiple observations correspond to a single target and vice versa. JPDAF [7,8] is an extension
of the PDA, where the measurement of target association probabilities is evaluated jointly across the targets. The Multiple Hypothesis Tracker (MHT) tracking algorithm was first developed by Reid
[9] and propagated multiple hypotheses in time. The ranking [3]
of the hypotheses requires evaluating over all existing hypotheses
and thus pruning and merging [4] were used to reduce the set of
hypotheses to a manageable size. A class of maximum likelihood
methods seek single or multiple best paths in an observation graph
[2,5], however the methods assume no missing detection and
known number of objects.
Y. Ma et al. / Computer Vision and Image Understanding 113 (2009) 580–587
Most of existing data association algorithms cannot address the
merge or split of the observations for an accurate estimation of the
target state. The multiple hypothesis trackers are the most widely
used, however these methods assume a one-to-one mapping between observations and targets. An attempt to extend these frameworks to merge and split behaviors was proposed in [16], which
introduced the concept of virtual measurement to represent the
splitting and merging of detected regions. However, the association
was inferred using a brute force method. Senior [17] performs a
multi-object segmentation using a probabilistic pixel classification
algorithm which uses the appearance model to calculate the likelihood of a pixel to belong to a particular object. An iterative approach then finds the front-most model first and deletes it from
the foreground object and then fits the second object. Wu et al.
[18] defines an occlusion relation parameter for addressing the
blob splitting problem.
In this paper, we formulate the multiple target tracking problem as a maximum a posteriori (MAP) problem. We expand the
set of observations provided by a motion blob detector, with
hypotheses added by merge, split and mean shift operations, which
are designed to deal with noisy foreground segmentation due to
occlusion, foreground fragment and missing detection. All these
added hypotheses will be validated during the MAP estimation.
The remainder of this paper is organized as follows: In Section
2, we present the MAP formulation for multiple target tracking
problem. We use motion probabilities and appearance probabilities as the joint probabilities. In Section 3, we present the proposed
tracking method using merge, split and mean shift operations. In
Section 4, we present experimental results obtained on real indoor
and outdoor video sequences. Also we compare the proposed
tracking method with the existing art in the field. Finally, in Section
5 we give conclusion and outline future work.
2. Multiple target tracking formulation
In a multiple target tracking problem, the objective is to track
multiple target trajectories over time given a set of noisy measurements provided by a motion detection algorithm, such as [19]. The
observations considered are blobs that cannot be regarded as complete observations, and furthermore, the targets position and
velocities are automatically initialized and do not require operator
interaction.
The detector usually provides image blobs which contain both
the estimated location, size and the appearance information as
well. Within any arbitrary time span ½0; T, there are K unknown
number of targets in the monitored scene. Let yt ¼ fyit : i ¼
1; . . . ; mt g denote the observations at time t; Y ¼ [t2f1;...;Tg yt is the
set of all the observations within the duration ½0; T. The multiple
target tracking can be formulated as finding the set of K best paths
fs1 ; s2 ; ; sK g in the temporal and spatial space, where K, represents the number of moving objects or targets in the scene and this
number is unknown. We denote a track by the set of its observations: sk ¼ fsk ðtÞ : t 2 ½1; Tg where sk ðtÞ 2 yt represents the observation of track sk at time t.
weight or cost associated to an edge will be computed using to motion and appearance models described in the following paragraphs.
To reduce the amount of edges defined in the graph we consider
only edges for which the weight or cost (similarity of motion and
appearance) between two nodes is more than a pre-determined
threshold. This is similar to the gating in [15]. An example of such
a graph is shown in Fig. 1. At each time instant, there are mt observations. The one which does not belong to any track represents a
false alarm. The shaded node represents a missing observation, inferred by the tracking.
We formulate the multiple target tracking problem as a MAP
problem. Finding K best paths s1;;K through the graph G is to find
s1;;K ¼ arg maxðPðs1;;K jYÞÞ
ð1Þ
The posterior of the K best paths can be represented as the
observation likelihood of the K paths and the prior of the K paths
as follows.
Pðs1;K jYÞ / PðYjs1;K ÞPðs1;K Þ
ð2Þ
The K paths multiple target tracking can be extended to a MAP estimate as follows:
s1;K ¼ arg maxðPðYjs1;K ÞPðs1;K ÞÞ
ð3Þ
First, we describe the likelihood of the K paths. To make full use
of available visual cues, we consider both motion and appearance
likelihood measures. By assuming that each target is moving independently, the joint likelihood of the K paths over time ½1; T can be
represented as:
PðYjs1;K Þ ¼
K
Y
Pmotion ðsk ð1Þ; ; sk ðTÞÞPcolor ðsk ð1Þ; ; sk ðTÞÞ
ð4Þ
k¼1
The joint likelihood function defined in Eq. (4) essentially represents the smoothness of tracks in both appearance and motion. If
the number of objects is known, we can simply apply K shortest
disjoint paths algorithm [21] to solve this problem. When K ¼ 1,
the tracking algorithm is simplified as a Viterbi algorithm [5].
However, the number of the target is variant and usually unknown,
thus we need a prior distribution to regulate the likelihood and
avoid overfitting the observations.
2.1. Graph representation and MAP estimate
We utilize a graph representation G ¼ hV; Ei of all measurements within time ½0; T. The graph is a directed graph that consists
of a set of nodes V ¼ fykt : t ¼ 1; ; T; k ¼ 1; ; Kg. Each node corresponds to a detected moving region, however, we also consider
one special node y0t to represent the null measurement at time t
which corresponds to missed detections. A directed edge
ðyit1 ; yjt2 Þ 2 E; t 1 < t2 is defined between two nodes based on proximity and similarity of the corresponding detected blobs. The
581
Fig. 1. Graph representation of measurements.
582
Y. Ma et al. / Computer Vision and Image Understanding 113 (2009) 580–587
Second, the prior model of Pðsk : k ¼ 1; ; KÞ is represented as
follows:
Pðs1;...;K Þ ¼
jsk j Y
K
T
Y
pd
kfft expðCðtÞÞ
1 pd
t¼1
k¼1
ð5Þ
where jsk j is the number of measurements associated to the track
k. pd denotes the detection rate which can be determined from
prior knowledge of the detection procedure. We assume the distribution of the number of false alarms ft is a Poisson distribution
and kf is the parameter of the distribution. CðtÞ represents the
overlapping between two different tracks at each time, which
can be written as,
CðtÞ ¼
si ðtÞ\sj ðtÞ
si ðtÞ\sj ðtÞ–; si ðtÞ[sj ðtÞ
P
jsi ðtÞ \ sj ðtÞ–;j
ð6Þ
where sk ðtÞ is the observation associated to the track k at time t.
Directly optimizing Eq. (3) is computationally impractical for
realtime application. We seek a suboptimal solution as
s1;...;K ¼ arg maxðargmaxðPðYjs1;...;k ÞÞPðs1;...;k ÞÞ
kl 6k6ku
ð7Þ
s1;...;k
where kl and ku are the lower and upper bounds of the number of
targets. The optimization problem is reduced into two steps, first
find the optimum tracks given the number of tracks, then compute
the prior weight of the candidate tracks and find the best weighted
solution. Note that, the joint likelihood probability is defined by the
product of the appearance and motion likelihood probabilities,
while the weight of the edges in a track is actually the sum of the
minus log likelihood. Thus maximizing the joint likelihood is indeed
to find the K shortest disjoint paths. Given the number of tracks, the
optimal tracks are computed by using [21] (see Appendix A for details). The overview framework of this approach is shown in Algorithm 1.
The details of the definition of motion and appearance likelihood are shown in Sections 2.2 and 2.3. The way of building and
augmenting the observation graph is shown in Section 3.
Algorithm 1. Overview of the multiple target tracking algorithm
Input: Y; kl ; ku
Output: K; s1;;K
1. Build observation graph
2. Augment the observation graph with hypothesis
3. Compute the optimal tracks
for k ¼ kl to ku do
find s1;...;k ¼ arg maxðPðYjs1;...;k ÞÞ
compute PðYjs1;...;k ÞPðs1;...;k Þ
end for
4. Find the tracks with the best combined weight as solution
2.2. Motion model
Before discussing the motion and appearance likelihood, we
first introduce the ground plane assumption, which the motion
model is defined upon.
2.2.1. Ground plane assumption
We assume the targets are moving on a ground plane, as shown
in Fig. 2a. The detected blobs are shown in Fig. 2b. The ground
Fig. 2. Illustration of the ground plane model.
plane assumption provides us a rough estimation of the average
targets’ height, as in Fig. 2c.
The homography between the image plane and the ground
plane can be represented as follows.
½lx ; ly ; 1T ¼ H½g x ; g y ; 1T
ð8Þ
where ðlx ; ly ; 1Þ and ðg x ; g y ; 1Þ are the locations on the image plane
and ground plane respectively. The homography matrix H can be
estimated by a least square estimation using four correspondence
points between image plane and ground plane. We assume that
the contact point of a target with the ground plane corresponds
to the bottom point of detected image blob. Given the homography,
we project the contact point to the ground plane and this will correspond to the location of target on the ground plane.
The relationship between the average height of targets h and
the position in image plane can be modeled as a linear combination
of lx ; ly ; 1: h ¼ a½lx ; ly ; 1T , where a is a 1 3 vector. To estimate the
average height, we select several foreground images which only
contain blobs with correct height and use a Least Square method
to compute the average height.
If a ground plane is available, we can use constant velocity motion model in the 2D image plane and 3D ground plane, otherwise
only in 2D image plane. We denote xkt the state vector of the target
k at time t to be ½lx ; ly ; w; h; _lx ; _ly ; lgx ; lgy ; _lgx ; _lgy (location, width, height
and velocity in 2D image, and location, velocity on the ground
plane). We consider a linear kinematic model:
xktþ1 ¼ Ak xkt þ wk
where Ak is the transition matrix as follows.
ð9Þ
583
Y. Ma et al. / Computer Vision and Image Understanding 113 (2009) 580–587
1
B0
B
B
B0
B
B0
B
B
B0
k
A ¼B
B0
B
B
B0
B
B0
B
B
@0
0
0
0 0 0 1 0 0 0 0
1 0 0 0 1 0 0 0
0
0 0
0
0
0
0
0 1
0 1
0 0
0
0
0
0
0
0 1 0
0
0
0
0
0
0
0 1 0
0
0
0
0
0
0 0
0
0
0
0 0
0
1 0
0
0
0
0 0
0
0 1
0
0
0
0 0
0
0
1 0 1
1
0
0C
C
C
0C
C
0C
C
C
0C
C
0C
C
C
0C
C
1C
C
C
0A
ð10Þ
0 1
^k ðt1 ÞÞ ¼
Pcolor ðsk ðt 2 Þjs
We assume wk to be a normal probability distributions,
k
w Nð0; Q k Þ.
The observation ykt ¼ ½ux ; uy ; w; h; ugx ; v gy contains the measurement of a target position and size in 2D image plane and position
on 3D ground plane. Since observations often contain false alarms,
the observation model is represented as:
8 k
k
k
>
< H xt þ v
ykt ¼ dt
>
:
if it belongs to a target
ð11Þ
false alarm
1 0
0 0
0 0 0
B0 1 0 0 0
B
B
B0 0 1 0 0
k
H ¼B
B0 0 0 1 0
B
B
@0 0 0 0 0
0
0
0 0
0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 1 0 0
1
0C
C
C
0C
C
0C
C
C
0A
ð12Þ
0 0 0 1 0 0
The measurement is modeled as a linear model of current state
if it belongs to a target otherwise it is modeled as a false alarm dt ,
which is assumed to be a uniform distribution.
^ t ðsk Þ the a
^k ðt i Þ denote the a posteriori state estimate and P
Let s
posteriori estimate of the error covariance matrix of sk at time t.
Along a track sk the motion likelihood of one edge ðsk ðt1 Þ;
sk ðt2 ÞÞ 2 E, t1 < t2 can be represented as Pmotion ðsk ðt2 Þjs^k ðt1 ÞÞ. Given
the transition and observation model in Kalman filter, the motion
likelihood then can be written as:
^k ðt 1 ÞÞ ¼
Pmotion ðsk ðt 2 Þjs
1
^ t ðsk ÞÞ
ð2pÞ3=2 detðP
2
exp
!
^ 1 ðsk Þe
eT P
t2
2
ð13Þ
^ t ðsk Þcan be computed recursively
^k ðt 1 Þ andP
where e ¼ ykt HAt2 t1 s
2
^ t 1 ðsk ÞAT þ Q ÞHT þ R.
^ t ðsk Þ ¼ HðAP
by a Kalman filter as P
2
2
1 X
P i ðcÞ
ðPi ðcÞ P j ðcÞÞ log
2 c¼r;g;b
Pj ðcÞ
ð14Þ
Using motion model and appearance model, we associate a
weight to each edge defined between two nodes of the graph. This
weight combines the appearance and motion likelihood models
presented in the previous paragraphs. In Eqs. (11) and (14), we assume the state of target at time t is determined by the previous
state at time t 1 and the observation at time t is a function of
the state at time t alone, i.e. Markov condition. Thus the joint likelihood of K paths in Eq. (4) can be factorized as follows:
PðYjs1;K Þ ¼
where ykt represents the measurement which may arise either from
a false alarm or from the target, and dt is the false alarm rate at time
t. We assume v k to be normal probability distributions,
v k Nð0; Rk Þ. According to the definition of the measurement and
state vector, the measurement matrix Hk can be represented as
follows.
0
In this paper, we use a simple color appearance model to illustrate our proposed algorithm. Of course, other appearance model
such as [11] can be used within this framework as well.
In the simple color appearance model, the appearance of each
detected region is modeled using a non-parametric color
histogram.
All RGB bins are concatenated to form a one dimension histogram. The appearance likelihood between two image blobs
ðsk ðt 1 Þ; sk ðt 2 ÞÞ 2 E; t 1 < t 2 in track k, is measured using the a symmetric Kullback-Leibler (KL) divergence defined as follows.
K
Y
T
Y
^k ðt 2 ÞÞP
Pmotion ðsk ðt1 Þjs
s
s
color ð k ðt 1 Þj^ k ðt 2 ÞÞ
k¼1 ðt 1 ;t2 Þ2sk
ð15Þ
where ðsk ðt 1 Þ; sk ðt 2 ÞÞ represents the edge in the track k.
3. Merge, split and re-acquisition
Graph representation [2,22] is widely used to organize the relationship between multiple observations in multiple frames. In such
a graph representation, nodes represent the detected moving regions and edges encode the motion and appearance likelihood of
the two observations belonging to one track in two frames. With
the graph representation, the trajectories of multiple targets can
be represented as paths in such a graph and multiple target tracking problem is to recover such paths.
Most multiple target tracking algorithms [2,6,10] assume that
no two paths pass through the same observation. This assumption
is reasonable when considering complete observations. However,
this assumption is usually violated in the context of visual tracking
problem, where the targets cannot be regarded as points and the
inputs to the tracking algorithm are usually image blobs. Before
finding the shortest disjoint paths, we aim to recover the case of
merge, split or missing observations by adding new hypotheses
in the observation graph. In order to form reasonable hypotheses,
we generate candidate tracks by connecting observation nodes
with large enough motion and appearance likelihood. The threshold of forming candidate tracks is quite strict. And the estimated
positions of these candidate tracks will be used to generate new
hypotheses. In the following paragraphs we present an extension
of the framework to handle split and merge behaviors in estimating the best paths.
3.1. Merge and split hypothesis
2.3. Appearance model
In this section, we present the appearance model likelihood. An
observation’s appearance model can be represented by a set of distinctive features such as color, shape, or texture. The good appearance model, such as [11] can be invariant to 2D rigid
transformation and scale change over wide range of transformation within a large resolution.
The proposed merge and split behaviors correspond to a recursive association of new observation, given estimated trajectories.
Using the estimated candidate tracks, we can evaluate how the
mtþ1 observations fyitþ1 : i ¼ 1; . . . ; mtþ1 g at time t þ 1 fit the estimated tracks which end at time t. The spatial overlap between estimated state at instant time t and new observation will considered
as a primary cue and we consider the following two cases:
584
Y. Ma et al. / Computer Vision and Image Understanding 113 (2009) 580–587
tk ðt þ 1Þ has a sufficient spatial overlap with
If the prediction of s
more than one observation at time t þ 1. This will trigger a
merge operation which merges the observations at time t þ 1
into one new observation. This new observation carrying the
merge hypothesis will be added into the graph. The merge operation is illustrated in Fig. 3a.
If the predicted positions and shapes of more than one track spatially overlap with one observation ytþ1 at time t þ 1. The set of
candidate tracks is j; jjj > 1. This will trigger a split operation,
which splits the node ytþ1 into several observations. These
observations, which encode the split hypothesis, will be added
to the observations set at time t þ 1. The split operation proceeds as follows: for each track stk in j whose prediction has a
sufficient overlap with ytþ1 :
- Change the predicted size and location at time t þ 1 to find the
tk ðt þ 1Þ; ytþ1 Þ;
best appearance score sk ¼ P color ðs
- Create a new observation node for the track with the largest sk
and add it to the graph;
- Reduce the confidence of the area occupied by the newly added
node and recompute the score sk for each track left in j;
Iterate this process until all candidate tracks in j that overlapped with the observation ytþ1 are tested. The split operation is
illustrated in Fig. 3b.
3.2. Mean shift hypothesis
In this section, we present the mean shift hypothesis. For example, the regions are not detected in the case of stop-and-go motion.
In this case, we propose to incorporate additional information from
the images for improving appearance-based tracking. Since at each
time t, we have already maintained the appearance histogram of
each candidate track, we introduce the mean shift operation to
keep track of this appearance distribution when the motion blob
does not provide good enough input.
Mean shift method [13], which can be regarded as a mode-seeking process, was successfully applied to the tag-to-track problem.
Usually the central module of mean shift tracker is based on the
mean shift iterations to find the most probable target position in
the current frame according to the previous target appearance
histogram.
In our multiple target tracking problem, if a candidate track is
not associated with an observation at time t, due to low motion
and appearance likelihood (caused by a fragmented detection,
non detection or large mismatch in size, etc.), we instantiate a
mean shift algorithm to propose the most probable target position
given the appearance histogram of the track.
A color histogram of the candidate track is used in mean shift.
Note that the color histogram used by mean shift is established
using past observations along the path (within a sliding window),
instead of using only the latest one. Using the predicted position
from mean shift, we add a new observation to the graph. Past
observation is just the observations (image blobs) in the tracks before time t.
The final decision will be made by considering all the observations in the graph. To prevent the mean shift tracking from
tracking a target after it leaves the field of view the mean shift
hypothesis. We considered only for the trajectories where the ratio of real node to the total number of observations along the
track is larger than a threshold. Due to the new hypotheses
being added into the graph, one track can contain real nodes
(obtained from motion segmentation) and hypotheses nodes.
So the ratio of real node measures how many real observation
nodes are in one track. The threshold on this value prevents
mean shift hypothesis being added too much.
4. Experimental results
In the experiment we used a sliding window of size W to
implement the proposed tracking algorithm for real time. The
graph contains the observations between time t and t þ W.
When new observations are added into graph, observations older
than t will be removed from the graph. New candidate tracks
will also come from false alarm nodes (nodes are not assigned
to any track). In the following experiments, we use W ¼ 45 for
the window size.
4.1. Data set
We conducted experiments on both a public data set and our
Honeywell data set. In the experimental tests conducted, the input
for the tracking algorithm were the foreground regions and the original image sequence.
Fig. 3. Merge and split hypothesis added to the graph.
4.1.1. Public data set and results
We evaluate our algorithm with on videos selected from a public data set, CLEAR [23]. This data set is captured by a stationary
camera in an urban traffic environment. The targets of interest contain vehicles and pedestrians. In the considered data set, a large
number of partial or complete occlusions between targets (pedestrians and vehicles) were observed.
The left two sub-figures in Figs. 4 and 5 show the sample frames
in the CLEAR data set with the tracked results using the proposed
tracking method. The right two sub-figures in Fig. 4 show the detected moving blobs. For example, the two cars in the right top figure in Fig. 4 are in one moving blob. The proposed tracking method
can split this one blob into two cars.
585
Y. Ma et al. / Computer Vision and Image Understanding 113 (2009) 580–587
tively. Let N f be the total number of ground truth frames in the sequence. Say that for a sequence we have N G ground truth-ed
objects and N T detected objects. Within each frame, the intersection over union of two MBRs from ground truth and output is used
to measure the normalized spatial accuracy of reported regions.
The one to one matching between these ground truth objects and
tracked objects is performed by specifically computing the measure over all the ground truth and detected object combinations
and to maximize the overall score for a sequence. Given an optimal
mapping, the number of matched ground truth and tracked objects
is N mapped , and therefore N mapped 6 N G and N mapped 6 N T .
The ATA metric [20] is defined as follows:
PNmapped
i¼1
ATA ¼
Fig. 4. Tracking in the CLEAR data set.
4.1.2. Honeywell data set and results
We also tested our tracking algorithm on various real surveillance spots collected by our lab. The data considered were collected inside the lab room, around the parking lots and other
facilities at Honeywell. The difficulty in these data set is clutter
background and frequent occlusions. Partial occlusions and noisy
foreground segmentation cause the cases of split and merged foreground regions very often. Fig. 6 shows the data sets with tracking
results overlaid and the detected foreground. Due to the noisy foreground segmentation, the input foreground for one target could
have multiple fragmented regions, as shown in Fig. 6a. In the case
where two or more moving objects are very close to each other, we
may have a single moving blob for all the moving objects, shown in
Fig. 6b. Missed detections were frequent in the considered data set:
objects merge into background due to stop-and-go motion is
shown in Fig. 6c. Moreover, due to the noisy background, frequent
false positive are present in the foreground detection. The tracking
results in present of false alarms is shown in Fig. 6d.
For the data set where a ground plane is available, using the
homography between the ground plane and the image plane, the
targets can be tracked on the 3D ground plane, as shown in Fig. 7.
4.2. Performance metrics for quantitative comparison
In this section, we compare the proposed tracking algorithm
with the existing JPDA [12] without the added merge, split and
mean shift hypothesis.
Before we present the comparison results, we first describe the
performance metrics for comparison. For each detected target, we
assign a unique track ID and we output a minimum bounding rectangle (MBR) for every frame. In order to provide a quantitative
evaluation of the proposed algorithm, we apply the quantitative
metrics ATA (Average Tracking Accuracy) proposed in [20] to evaluate and compare the performance. Let Gi denotes the ith ground
truth object and Gi ðtÞ denotes the ith ground truth object in tth
frame. T i denotes the tracked object for Gi . N G and N T denotes
the number of ground truth objects and tracked objects respec-
PNf
t¼1
ðtÞ
ðtÞ
ðtÞ
\T i j=jGi [T i j
jGðtÞ
i
N ðG
i [T i –;Þ
ðNG þ N T Þ=2
ð16Þ
where N ðGi [T i –;Þ indicates the total number of frames in which either
a ground truth object or a detected object or both are present. Thus
Nf 6 NðGi [T i –;Þ always holds. This quantitative metric penalizes for
both False Negatives (undetected ground truth area) and False Positives (detected boxes that do not overlay any ground truth area)
[20]. Note that, due to the mapping is defined on the trajectories
of ground truth and tracked objects, it fully considers the temporal
consistency of tracking accuracy. For example, suppose we have
two ground truth trajectories with the same start and end frame,
the tracked objects swap ID right in the middle of the start and
end frame. Although except the swapping ID, all detected MBRs of
the two trajectories are exactly the same as the MBRs in ground
¼
truth, the ATA score according to Eq. (16) will be ð12 þ 12Þ= ð2þ2Þ
2
1=2. This metric is also well suited for penalizing fragmented trajectories: Let suppose that we have only one ground truth trajectory,
and the obtained tracking result the track is broken into two tracks
of equal size. Since only one track will be mapped to the ground
¼ 1=3.
truth track, the final ATA score will be 12 = ð1þ2Þ
2
The ATA score (Average Tracking Accuracy) of comparison in
two data sets is shown in the table as follows.
From this comparison table, it is clear that the proposed tracking method has higher ATA score than the existing JPDA method.
4.3. Environment and computing complexity
Without any code optimization, the time performance of our
online tracking algorithm with 45-frame sliding window is close
to real time (15–20 fps for 3–5 targets) on a PC platform with a
P4 2.8 Hz CPU and 1 GB RAM.
Basically, the computing cost of our algorithm is linear to the
total number of targets. The quality of motion detection affects
the performance of the tracking algorithm. If the foreground region
is very noisy, many hypotheses may be added to the graph, which
degrades the time performance.
5. Conclusion and future work
In this paper, we present a method for multiple targets tracking
with incomplete observation in video surveillance. If we partition
the application scenarios into easy, medium and difficult cases,
most of the existing tracking algorithms can handle the easy cases
relatively well. However for the medium and difficult cases, multi-
Sequence
Frames
proposed method
with split/merge
only
with meanshift
only
no augumented
hypothesis
method 2
C LE A R
1125( 2 s equenc es )
0.293
0. 252
0. 126
0. 109
0. 097
Honey well
1322 ( 6 s eqenc es )
0.327
0. 269
0. 233
0. 212
0. 237
Fig. 5. Comparison with other methods method1: the proposed method, method2: JPDAF in [12].
586
Y. Ma et al. / Computer Vision and Image Understanding 113 (2009) 580–587
Fig. 6. Tracking results of the Honeywell dataset.
ple targets could be merged into one blob especially during the
partial occlusion and one target could be split into several blobs
due to noisy foreground detection. Also missed detections are frequent in presence of stop-and-go motion, or when we are unable to
distinguish foreground from background regions without adjusting
the detection parameters to each sequence considered. In this paper, we use a graph representation of the observations and formulate the multiple target tracking problem as a maximum a
posterior estimate. We expand the set of hypotheses by considering merge, split and mean shift operations. The proposed tracking
Fig. 7. Tracking targets using ground plane information. Left, estimated trajectories are plotted in the 2D image. Right, the positions of the moving people in the scene are
plotted on the ground plane.
Y. Ma et al. / Computer Vision and Image Understanding 113 (2009) 580–587
method can deal with noisy foreground segmentation due to occlusion, foreground fragments and missing detections. Experimental
results show good performance on the test data sets. Also, we have
a quantitative comparison between the proposed tracking method
with the existing tracking method, and show a better ATA score for
the proposed tracking method.
For the future work, first we will introduce other appearance
models such as in [11]. Moreover we will incorporate the model
information into our tracking framework.
Appendix A. K shortest disjoint path algorithm
We aim to find K shortest paths [1](starting from a virtual node
s and ending at a virtual node t ) in the observation graph G. The K
shortest paths are required to be node-disjoint. Two or more paths
that share no common nodes (or edge) are called node-disjoint (or
edge-disjoint). We first introduce the solution to the K shortest
edge-disjoint paths problem, which can be used to solve the K
shortest node-disjoint paths problem. The detail can be found in
[21].
A.1. K shortest edge-disjoint path
By augmenting the edges in original G with unity capacity, the
problem can be transformed into another form: to find a valued
K flow from node s to t and the total cost on this flow is minimum.
Note that, we also need to know each disjoint path in this flow. The
algorithm is as follows.
Algorithm 2. K shortest edge-disjoint path problem
Input: initial graph G
Output: fpi ; i ¼ 1; . . . ; Kg
Augment the graph G to T with unity capacity for all edges
Find the maximum flow F in T
if F < K, i.e.T does not contains a valued K flow then
return the graph G contains no K disjoint paths
end if
for i ¼ F to F K þ 1 do
Compute valued i flow and create T 0 by removing zero-flow
edges in T
Find the shortest path pi in T 0
Remove the edges in pi from T
end for
Since the maximum flow problem can be solved in polynomial
time by Ford-Fulkerson algorithm OðjEj maxflowÞ, the total complexity of the algorithm is OððjEj maxflow þ jVj2 Þ KÞ.
A.2. K shortest node-disjoint path
This problem can easily transformed to the edge-disjoint path
problem. For the node v which is not allowed to be shared by
587
two or more paths, split the node v into two nodes, v 0 and v 00 .
The edge ðv 0 ; v 00 Þ has a unity capacity and a zero cost. All the incoming edges ðx; v Þ are changed to ðx; v 0 Þ and all the outgoing edges
ðv ; xÞ are changed to ðv 00 ; xÞ. Then we can apply the K shortest
edge-disjoint path algorithm on the new graph to get the K shortest node-disjoint path.
References
[1] J.K. Wolf, A.M. Viterbi, G. Dixon, Finding the best set of K paths through a trellis
with application to multitarget tracking, Correspondence IEEE Trans. on AES 25
(2) (1989) 287–296.
[2] B. LaScala, G.W. Pulford, Viterbi data association tracking for over-the-horizon
radar, Proc. Int. Radar Sym. 3 (1998) 155–164.
[3] K.G. Murty, An algorithm for ranking all the assignments in order of increasing
cost, Operations Res. 16 (1968) 682–687.
[4] K. Buckley, A. Vaddiraju, R. Perry, A new pruning/merging algorithm for MHT
multitarget tracking, Radar—2000 (2000).
[5] T. Quach, M. Farooq, Maximum likelihood track formation with the
Viterbi Algorithm Proc. 33-rd IEEE Conf. on Decision and Control, 1994,
pp. 271–276.
[6] D. Castanon, Efficient algorithms for finding the K best paths through a trellis,
IEEE Trans. Aerospace Elect. Sys. 26 (2) (1990) 405–410.
[7] T.E. Fortman, Y. Bar-Shalom, M. Scheffe, Sonar tracking of multiple
targets using oint probabilistic data association, IEEE J. Oceanic Eng. OE8 (3) (1983) 173–184.
[8] Y. Bar-Shalom, T.E.
Fortmann, Tracking and Data Association
Mathematics in Science and Engineering, Academic Press, San Diego,
CA, 1988. Series 179.
[9] D.B. Reid, An algorithm for tracking multiple targets, IEEE Trans. Automatic
Control 24 (6) (1979) 843–854.
[10] I.J. Cox, S.L. Hingorani, An efficient implementation of Reid’s multiple
hypothesis tracking algorithm and its evaluation for the purpose of visual
tracking, IEEE Trans. Pattern Anal. Machine Intell. 18 (2) (1996) 138–150.
[11] J. Kang, I. Cohen, G.G. Medioni, Object reacquisition using invariant appearance
model, in: International Conference on Pattern Recognition, Cambridge, United
Kingdom, 2004, pp. 759–762.
[12] J. Kang, I. Cohen, G. Medioni, Continuous tracking within and across camera
streams, in: IEEE Conference on Computer Vision and Pattern Recognition
Madison, Wisconsin. June 2003.
[13] Yizong Cheng, Mean shift mode seeking and clustering, IEEE Trans. Pattern
Anal. Machine Intell. 17 (8) (1995) 790–799.
[14] S. Blackman, Multiple Target Tracking with Radar Applications, Artech House,
Berlin, 1986.
[15] Y. Bar-Shalon, E. Tse, Tracking in a cluttered environment with probablistic
data association, Automatica (1975) 451–460.
[16] A. Genovesio, J. Olivo-Marin, Split and merge data association filter for dense
multi-target tracking, in: ICPR, 2004, pp. IV: 677–680.
[17] A. Senior, Tracking people with probabilistic appearance models, PETS02
(2002) 48–55.
[18] Ying Wu, Ting Yu, Gang Hua, Tracking appearances with occlusions, in: Proc.
IEEE Conf. on Computer Vision and Pattern Recognition, Madison, WI, vol. I,
2003, pp.789–795.
[19] C. Stauffer, W.E.L. Grimson, Adaptive background mixture models for real-time
tracking, in: IEEE International Conference on Computer Vision and Pattern
Recognition 1999, vol. 2, pp. 246–252.
[20] R. Kasturi, D. Goldgof, P. Soundararajan, V. Manohar, M. Boonstra, V. Korzhova,
Performance Evaluation Protocol for Text, Face, Hands, Person and Vehicle
Detection & Tracking in Video Analysis and Content Extraction (VACEII).
Technical Report, University of South Florida, 2005.
[21] J.W. Suurballe, Disjoint paths in a network, Networks 4 (1974) 125–145.
[22] I. Cohen, G. Medioni. Detecting and tracking objects in video surveillance, in:
Proc. of the IEEE Computer Vision and Pattern Recognition 99, Fort Collins,
June 1999.
[23] http://www.clear-evaluation.org/.
[24] Alper Yilmaz, Omar Javed, Mubarak Shah, Object tracking: a survey, ACM
Comput. Surv. 38 (4) (2006).