Multiple Object Tracking: A Literature Review
Multiple Object Tracking: A Literature Review
Multiple Object Tracking: A Literature Review
Abstract Multiple Object Tracking (MOT) is an im- into different groups and each group is discussed in de-
portant computer vision task which has gained increas- tails for the principles, advances and drawbacks. 3) We
ing attention due to its academic and commercial po- examine experiments of existing publications and give
tential. Although different kinds of approaches have tables which list results on the popular data sets to pro-
been proposed to tackle this problem, there still ex- vide convenient comparison. We also provide some in-
ists many issues unsolved. For example, factors such teresting discoveries by analyzing these tables. 4) We of-
as abrupt appearance changes and severe object occlu- fer some potential directions and respective discussions
sions pose great challenges for MOT. In order to help about MOT, which are still open issues and need more
the readers understand this topic and start working on research efforts. This would be helpful for researchers
it, we contribute a systematic and comprehensive re- to identify further interesting problems.
view. In the review, we inspect the recent advances in
Keywords Multiple Object Tracking · Observa-
various aspects about this topic and propose some in-
tion Model · Dynamic Model · Object Detection ·
teresting directions for future research.
Association · Tracklet · Survey
To our best knowledge, there has not been any re-
view about this topic in the community. We endeavor
to provide a thorough review on the development of 1 Introduction
this problem in the last decades. The main contribu-
tions of this review are fourfold: 1) Key aspects in a Multiple Object Tracking (MOT), or Multiple Target
multiple object tracking system, including how to for- Tracking (MTT), plays an important role in computer
mulate MOT generally, how to categorize MOT algo- vision. The task of MOT is largely partitioned to locat-
rithms, what needs to be considered when developing ing multiple objects, maintaining their identities and
a MOT system and how to evaluate a MOT system, yielding their individual trajectories given an input video.
are discussed from the perspective of understanding a Objects to track can be, for example, pedestrians on the
topic. We believe in that this could not only provide re- street (Yang et al. 2011; Pellegrini et al. 2009), vehicles
searchers, especially new comers to the topic of MOT, a (Koller et al. 1994; Betke et al. 2000), sport players
general understanding of the state of the arts, but also in the court (Lu et al. 2013; Xing et al. 2011; Nillius
help them to comprehend the essential components of et al. 2006), or a flock of animals (birds (Luo et al.
a MOT system and the inter-component connection. 2) 2014), bats (Betke et al. 2007), ants (Khan et al. 2004),
Instead of enumerating individual works, we discuss ex- fishes (Spampinato et al. 2008, 2012; Fontaine et al.
isting work according to the various aspects involved in 2007), cells (Meijering et al. 2009; Li et al. 2008), etc.).
a MOT system. In each aspect, methods are divided The multiple “objects” could also be different parts of
1
a single object. In this review, we mainly focus on the
Imperial College London, London, United Kingdom
2
National Laboratory of Pattern Recognition, Institute of
research about pedestrian. There are three underlying
Automation, Chinese Academy of Sciences, Beijing, China reasons for this. Firstly, compared to other conventional
3
Institute of Intelligent System and Decision, Wenzhou Uni- objects in computer vision, pedestrians are typical non-
versity, Zhejiang, China rigid objects, which is an ideal example to study the
2 Wenhan Luo1 , Junliang Xing2 , Xiaoqin Zhang3 , Xiaowei Zhao1 , Tae-Kyun Kim1
MOT problem. Secondly, video of pedestrians raises a problem. Thus we here provide a review to discuss the
huge number of practical applications which further re- various aspects of multiple object tracking.
sults in great commercial potential. Thirdly, according
to incomplete statistics collected by ourselves, at least
70% of current MOT research efforts are devoted to 1.1 Differences from Other Related Reviews
pedestrian (cf. Table 9 to Table 26).
To the best of our knowledge, there has not been any
As a mid-level task in computer vision, multiple ob-
comprehensive literature review on the topic of multiple
ject tracking grounds high-level tasks such as action
object tracking. However, there have been some other
recognition, behavior analysis, etc. It has numerous ap-
reviews related to multiple object tracking, which are
plications. Some of them are presented in the following.
listed in Table 1. We group these surveys into three
– Visual Surveillance. The massive amount of videos, sets and highlight the differences from ours as follows.
especially surveillance videos, requires automatic anal- The first set (Zhan et al. 2008; Hu et al. 2004; Kim
ysis to detect abnormal behaviors, which is based et al. 2010; Candamo et al. 2010) also involves crowd,
on analyses of objects’ actions, trajectories, etc. To i.e., multiple objects but the focus is different from
obtain such information, we need to locate targets ours. We intend to review the primary aspects in devel-
and track them, which is exactly the objective of oping a multi-object tracking system. In comparison,
multiple object tracking. tracking is only discussed as an individual part (Zhan
– Human Computer Interface (HCI). Visual informa- et al. 2008; Hu et al. 2004; Kim et al. 2010; Candamo
tion, such as expression, gesture, can be employed et al. 2010; Wang 2013). More specifically, Zhan et al.
to achieve advanced HCI. Extraction of visual infor- (2008) focuses on crowd modeling, thus object track-
mation requires visual tracking as the basis. When ing is only a step to obtain feature for crowd mod-
multiple objects appear in the scene, we need to con- eling. Hu et al. (2004) and Kim et al. (2010) discuss
sider interactions among them. In this case, MOT papers about building a surveillance system for high-
plays a crucial rule to make the HCI more natural level vision tasks, such as behavior understanding, so
and intelligent. tracking is an intermediate step. Candamo et al. (2010)
– Virtual Augment Reality (VAR). MOT also has an review publications about behavior recognition in a spe-
application for this problem. For instance, MOT can cial scenario, i.e., transit scenes. In that review, object
supply users with better experience in video confer- tracking is discussed as a core technology as well as
ences. motion detection and object classification. Multiple ob-
– Medical Image Processing. Some tasks of medical ject tracking is also discussed as one module for video
image processing require laborious manual labeling, surveillance under multiple cameras (Wang 2013). The
for instance, labeling multiple cells in images. In this second set (Forsyth et al. 2006; Cannons 1991; Yilmaz
case, MOT can help to save a large amount of la- et al. 2006; Li et al. 2013) is dedicated to general vi-
beling cost. sual tracking techniques (Forsyth et al. 2006; Cannons
The various applications above have sparkled enor- 1991; Yilmaz et al. 2006) or some special issues such as
mous interest in this topic. However, compared with appearance models in visual tracking (Li et al. 2013).
Single Object Tracking (SOT) which primarily focuses Their scope is wider than our review while our review
on designing sophisticated appearance models or mo- is more comprehensive and detailed in multiple object
tion models to deal with challenging factors such as tracking. The third set (Wu et al. 2013a; Leal-Taixé
scale changes, out-of-plane rotation and illumination et al. 2015) is about benchmark work about general
variations, multiple object tracking additionally requires visual tracking (Wu et al. 2013a) and specific multiple
maintaining the identities among multiple objects. Be- object tracking (Leal-Taixé et al. 2015). Their attention
sides the common challenges in both SOT and MOT, is paid on experimental study rather than literature re-
the further key issues making MOT challenging include view.
(but not limit to): 1) frequent occlusions, 2) initial-
ization and termination of tracks, 3) small size of ob-
jects (Betke et al. 2007), 4) similar appearance among 1.2 Contributions
objects, and 5) interaction among multiple objects. In
order to deal with the MOT problem, a wide range The main contributions of this review are fourfold:
of solutions have been proposed in recent years. These – Key aspects in a multiple object tracking system,
solutions focus on different aspects of a MOT system, including how to formulate MOT generally, how to
making it difficult for researchers especially new comers categorize MOT algorithms, what needs to be con-
to MOT to gain a comprehensive understanding of this sidered when developing a MOT system and how
Multiple Object Tracking: A Literature Review 3
to evaluate a MOT system, are discussed from the about MOT, all the detailed issues requiring consid-
perspective of understanding a topic. We believe in eration in building a system, and how to evaluate a
that this could not only provide researchers, espe- MOT system. Accordingly, the organization of this re-
cially new comers to the topic of MOT, a general view is shown in Figure 1. Section 2 introduces some
understanding of the state of the arts, but also help preliminary knowledge of MOT, including some impor-
them to comprehend the essential components of a tant terminologies in this problem (Section 2.1) and
MOT system and the inter-component connection. the denotations used in this article (Section 2.2). Sec-
– Instead of enumerating individual works, we discuss tion 3 describes the MOT problem, including formula
existing work according to the various aspects in- (Section 3.1) and categorization (Section 3.2). Section
volved in a MOT system. In each aspect, methods 4 contributes to the most important issues involved
are divided into different groups and each group is in multi-object tracking, specifically, appearance model
discussed in details for the principles, advances and (Section 4.1), motion model (Section 4.2), interaction
drawbacks. model (Section 4.3), exclusion model (Section 4.4), oc-
– We examine experiments of existing publications clusion handling model (Section 4.5) and estimation
and give tables which list results on the popular data methods including both probabilistic inference (Section
sets to provide convenient comparison. We also pro- 4.6) and deterministic optimization (Section 4.7), re-
vide some interesting discoveries by analyzing these spectively. Furthermore, issues concerning evaluations
tables. of a MOT system, including metrics (Section 5.1), pub-
– We offer some potential directions and respective lic data sets (Section 5.2), public codes (Section 5.3)
discussions about MOT, which are still open issues and benchmark results (Section 5.4) are discussed in
and need more research efforts. This would be help- Section 5. To this end, conclusions are drawn and some
ful for researchers to identify further interesting prob- interesting directions are provided in Section 6.
lems.
It is worthy noting that, this manuscript is dedi- 2 Preliminaries
cated to literately reviewing recent advances in multiple
object tracking. As mentioned above, we also present Before beginning the review, we provide some prelim-
experimental results in publicly available data sets ex- inary knowledge for going deeper into the MOT prob-
cerpted from existing publications to provide a view of lem. We first introduce some terminologies that are fre-
how the state-of-the-art MOT methods work. For more quently used in the MOT problem. To make the review
comprehensive survey of experimental study in multiple easy to follow, we then list the denotations that are
object tracking, readers may refer to the work by Leal- globally applied in the manuscript.
Taixé et al. (2015).
2.1 Terminologies
1.3 Organization of This Review
The terminologies listed in the following play an impor-
The goal of this review is to provide an overview of the tant role in the MOT research.
major aspects which readers need to consider when de- Object. In computer vision, an object is considered
veloping a system for multiple object tracking. These as a continuous closed area in an image which is distinct
aspects include what is the current state of research from its surroundings. The interest of multiple object
4 Wenhan Luo1 , Junliang Xing2 , Xiaoqin Zhang3 , Xiaowei Zhao1 , Tae-Kyun Kim1
Appearance
Metric
Formulation Motion
Data Set
Categorization Interaction
Public
Exclusion Algorithm
Benchmark
Occlusion
Probabilistic Result
Inference
Deterministic
Optimization
Sec. 3.1 Sec. 3.2 Sec. 4.1 Sec. 4.2 Sec. 4.3 Sec. 4.4 Sec. 4.5 Sec. 4.6 Sec. 4.7 Sec. 5.1 Sec. 5.2 Sec. 5.3 Sec. 5.4
tracking often comes from one same type of objects, by progressively linking detection responses into longer
so an object should be identified additionally with an and longer tracklets and eventually forming trajecto-
identity. ries. Figure 2 shows these three concepts.
Detection. Detection is a computer vision task which Data association. Data association is a typical
localizes objects in images. It is usually conducted by solution to multiple object tracking if we cast it as a
training an object detector from a training dataset. In paradigm of matching detection responses across frames
most situations, detection does not involve temporal based on object detection. The technique of data asso-
information. ciation figures out inter-frame correspondences between
Tracking. Tracking is to localize an identical object detection hypothesis.
in continuous frames. Thus tracking always involves in
a video or image sequence with temporal information.
In MOT, tracking means simultaneously localizing mul-
2.2 Denotations
tiple objects and maintaining their identities.
Detection response. Detection responses are also
Throughout this manuscript, we represent scalar and
known as detection observations or detection hypothe-
vector by lowercase letter (e.g. x) and bold lower case
ses, which are the outputs of an object detector trained
letter (e.g. x) individually. We use bold capital letter
for a specific kind of objects, e.g., human, vehicle, face
(e.g. X) to denote matrix or a set of vectors. Capital let-
or animal. They are configurations of objects including
ters (e.g. X) are adopted to denote specific functions or
positions, sizes, etc, in an image sequence.
variables. Table 2 lists symbols utilized throughout this
Trajectory. Trajectory is the output of a MOT sys-
review. Except the symbols in the table, there might be
tem. One trajectory corresponds to one target, thus a
some symbols for a specific reference. As these symbols
trajectory is unique. At the same time, one trajectory
are not commonly employed, we did not put them in
is composed of multiple object responses of an identi-
the table. For these symbols, we will introduce their
cal target in an image sequence, each representing the
meanings in the context.
location, size and some other information in one frame.
Tracklet. Tracklet is an intermediate level of out-
put between detection responses and trajectories. It is
composed of several detection responses which are be- 3 MOT Problem
lieved to be from an identical target. As a fact, a de-
tection response can be viewed as a tracklet composed The objective of MOT is to produce trajectories of ob-
of only one detection response. Tracklet is usually ob- jects as they move around the image plane. To elaborate
tained by linking confident detection responses, thus this problem, in this section we firstly endeavor to give
it is shorter than trajectory regarding the time span. a general mathematical formulation of MOT, then we
In some approaches, the final trajectories are obtained discuss its categorization from different aspects.
Multiple Object Tracking: A Literature Review 5
t t+1 t+2 t+3 t+4 t+5 t t+1 t+2 t+3 t+4 t+5 t t+1 t+2 t+3 t+4 t+5
Fig. 2: Detection responses (left), tracklets (center), and trajectories (right) are shown in continuous 6 frames.
Different colors encode different targets. Best viewed in color
3.1 Problem Formulation mal a posterior) estimation from the conditional distri-
bution of the sequential states of all the objects given
Multiple object tracking can generally be formulated all the observations:
as a multi-variable estimation problem. Given an im- b 1:t = arg max P (S1:t |O1:t ) .
S (1)
age sequence {I1 , I2 , ..., It , ...} as input, we employ sit S1:t
to denote the state of the i-th object in the t-th frame.
We use St = (s1t , s2t , ..., sM The estimation can be performed using probabilistic
t ) to denote states of all the
t
Mt objects in the t-th frame, si1:t = {si1 , si2 , ..., sit } to inference algorithms based on a two-step iterative pro-
denote the sequential states of the i -th object from the cedure (Liu et al. 2012; Breitenstein et al. 2009; Yang
first frame to the t-th frame, and S1:t = {S1 , S2 , ..., St } et al. 2009b; Mitzel and Leibe 2011; Rodriguez et al.
to denote all the sequential states of all the objects from 2011; Kratz and Nishino 2010; Reid 1979):
R
the first frame to the t-th frame. Note that the object Predict: P (St |O1:t−1) = P (St |St−1 )P (St−1 |O1:t−1 )dSt−1
number may vary from frame to frame. Update: P (St |O1:t ) ∝ P (Ot |St )P (St |O1:t−1 )
To estimate the states of objects, we need to col- In the formula above, P (St |St−1 ) and P (Ot |St ) are
lect some observations from the image sequence. Cor- the Dynamic Model and the Observation Model, respec-
respondingly, we utilize oit to denote the collected ob- tively. These two models play a very important role in
servations for the i-th object in the t-th frame, Ot = a tracking algorithm. Since the distributions of these
(o1t , o2t , ..., oM
t ) to denote the collected observations
t
for two models are usually unknown, sampling methods like
all the Mt objects in the t-th frame, oi1:t = oi1 , oi2 , ..., oit Particle Filter (Jin and Mokhtarian 2007; Yang et al.
to denote the sequential observations collected from the 2005; Hess and Fern 2009; Han et al. 2007; Hu et al.
first frame to the t-th frame, and O1:t = {O1 , O2 , ..., Ot } 2012; Liu et al. 2012; Breitenstein et al. 2009; Yang
to denote all the collected sequential observations of all et al. 2009b), MCMC (Khan et al. 2004, 2005, 2006),
the objects from the first frame to the t-th frame. RJMCMC (Choi et al. 2013), etc are employed to per-
The objective of multiple object tracking is to find form the estimation.
the “optimal” sequential states of all the objects, which The estimation problem can also be coped with de-
can be generally modeled by performing MAP (maxi- terministic optimization approaches, e.g., directly max-
6 Wenhan Luo1 , Junliang Xing2 , Xiaoqin Zhang3 , Xiaowei Zhao1 , Tae-Kyun Kim1
Object
Manual
detection
initialization
MOT MOT
Sequence Trajectory Trajectory
tracking Sequence tracking
DBT DFT
imizing the likelihood function P (O1:t |S1:t ) as a dele- as object is represented by region shapes (bounding box
gate of P (S1:t |O1:t ): or ellipse (Kuo and Nevatia 2011)) in most works, it
is not necessary to discuss by object representations.
b 1:t = arg max P (S1:t |O1:t ) = arg max P (O1:t |S1:t ) ,
S Alternatively, we provide discussion according to the
S1:t S1:t
following aspects.
(2)
Noting that, when the number of objects is one, DFT cal as online tracking. To be clearer, we compare them
degrades as the classical visual tracking problem. in Table 4.
DBT is more popular for the fact that new objects
are discovered and disappearing objects are terminated 3.2.3 Mathematical Methodology
automatically. DFT requires manual initialization of
each object to be tracked, thus it cannot deal with the MOT could be classified into probabilistic tracking and
case that objects appear. However, it is model-free, i.e., deterministic tracking according to the adopted mathe-
free of pre-trained object detectors. So it can deal with matical methodology. There are two differences between
sequences of any type of objects. However, the setting of them. First, the approaches to estimating states of ob-
fixed number of objects limits its applications in practi- jects are different. In probabilistic tracking, the esti-
cal systems. Table 3 lists the major differences between mation is based on probabilistic inference, while in de-
DBT and DFT. terministic tracking the estimation is based on deter-
ministic optimization. Second, the outputs are differ-
3.2.2 Processing Mode ent. Output of probabilistic tracking may be different
in different running trials while constant in determinis-
According to the way of processing data, MOT could tic tracking.
be categorized into online tracking and offline tracking.
The difference is whether the future frame observations 3.2.4 Discussion
are utilized when handling the current frame. Online
tracking utilizes observations up to the current time The insights behind “online vs offline” and “DBT vs
instant to conduct the estimation, while offline tracking DFT” are related. The difference between DBT and
employs observations both in the past and in the future. DFT is whether a detection model is adopted (DBT) or
Online tracking. In online tracking, the image se- not (DFT). The key to differentiate online and offline
quence is handled in a step-wise way, thus online track- tracking is the way they process observations. Read-
ing is also named as sequential tracking. As shown in ers may question whether DFT is identical to online
Figure 4 (left), we present a toy example that there tracking because DFT always processes observations se-
are four objects (different circles) in a video sequence quentially. That is true because DFT is free of (type-
with IDs a, b, c and d. The arrow attached to each ob- specific) object detection. It cannot attain future ob-
ject indicates its movement direction. The green arrows servations, thus it can only follow the sequential way.
represent observations in the past. The results are rep- Another vagueness may rise between DBT and offline
resented by the object’s location and its ID. Based on tracking, as in DBT tracklets or detection responses
the up-to-time observations, trajectories are outputted are usually associated in a batch way. Note that there
on the fly. are also sequential DBT which conducts association be-
Offline tracking. Offline tracking (Song et al. 2010; tween previously obtained trajectories and new detec-
Qin and Shelton 2012; Yang and Nevatia 2012a,b; Bren- tion responses (Luo et al. 2014; Luo and Kim 2013;
del et al. 2011; Yang et al. 2011; Kuo et al. 2010; Hen- Xing et al. 2009).
riques et al. 2011; Sugimura et al. 2009; Choi and Savarese At the same time, “online vs offline” and “proba-
2010) utilizes a batch way to process the data therefore bilistic vs deterministic” are also related. In practice,
it is also called batch tracking. Figure 4 (right) illus- online tracking usually adopts probabilistic inference
trates how the batch tracking processes observations. for estimation while there indeed exists deterministic
Observations from all the frames are required to be optimization based online tracking, such as online track-
obtained in advance and are investigated together to ing by linking up-to-time trajectories and detections in
estimate the final output. Note that, due to computa- the next frame based on Hungarian algorithm. On the
tion ability, sometimes it is not possible to handle all other hand, offline tracking always employ determinis-
the frames at one time. Alternatively, one solution is to tic optimization in the derivation of states of objects.
divide the whole video into a set of segments or clips,
handle these clips respectively, and infuse the results 3.2.5 MOT Special Cases
hierarchically.
In general, online tracking is appropriate in the case Some publications cannot be simply grouped into tra-
that video stream is obtained sequentially. Offline track- ditional multi-target tracking as they exhibit different
ing typically deals with the data globally when all the attributes. We represent three special cases as follows.
frames are obtained. Theoretically offline tracking could MOT in sport scenarios. Tracking of multiple
obtain global optimal solution while it is not as practi- targets in the sport court has wide applications such
8 Wenhan Luo1 , Junliang Xing2 , Xiaoqin Zhang3 , Xiaowei Zhao1 , Tae-Kyun Kim1
Table 3: Comparison between DBT and DFT. Part of this table is from the work by Yang and Nevatia (2012c)
1 2 3 N 1 2 3 N
a b
c d
a b
c d
a b
c d
… a b
c d
a b
c d
a b
c d
a b
c d
… a b
c d
…
ax1 , a y1 ax2 , a y2 ax3 , a y3 axn , a yn
ax1 , a y1 ax2 , a y2 ax3 , a y3 axn , a yn
bx1 , by1
c ,c
x1 y1
dx , d y
1 1
bx2 , by2
c ,c
x2 y2
dx , d y
2 2
bx3 , by3
c ,c
x3 y3
dx , d y
3 3
bxn , byn
c ,c
xn yn
dx , d y
n n
bx1 , by1 bx2 , by2 bx3 , by3
c , c c , c c , c
x1 y1 x2 y2 x3 y3
d x , d y d x , d y d x , d y
1 1 2 2 3 3
…
bxn , byn
c ,c
xn yn
dx , d y
n n
Online Offline
as technical statistics in a match, automatic analysis of logistic regression classifier which maps image patches
strategies. There are several differences between tradi- to team labels using RGB color histograms. This step
tional MOT and this special problem. The first one is improves the precision while retaining the same level of
that the sport court is special. There are frequent shot recall rate.
switches and actions to zoom in and zoom out, which
MOT in aerial scenes. Multiple object tracking
make it challenging. However, this scene also exhibits
in aerial scenes is different from the normal MOT as it
useful features. For example, regarding some kinds of
exhibits these features: 1) the size of targets is consider-
sport such as basketball and football, there is a clear
ably small, thus the commonly used appearance infor-
boundary between the court and the background. This
mation is not reliable, 2) the frame rate is quite low, 3)
kind of boundary could dismiss most of the confusion
object density is high, e.g., the number of objects could
from the background once it is recognized. For instance,
be up to hundreds of. Therefore, the solution is also dif-
Xing et al. (2011) propose to segment the playfield from
ferent from that to the normal multiple object tracking
the non-playfield as a pre-processing step for MOT. The
problem. Reilly et al. (2010) propose several techniques
second difference is that targets in this problem proba-
to tackle the difficulties mentioned above. To detect
bly dress uniform. This can lead to two effects. On one
objects, motion detection is applied after background
hand, as targets dress very similarly to each other, the
modeling. Obtaining objects in each frame, Hungar-
appearance of targets is not as diverse as traditional
ian algorithm is adapted to solve the assignment of de-
MOT. Thus it may be more difficult to differentiate
tections to existing trajectories. When computing the
objects. On the other hand, as claimed before, targets
affinity, the spatial proximity, velocity orientation and
of the same team wear clothes of the same color or pat-
context are considered. Note that context is relatively
tern, while ones from different teams do not follow this.
important in this special case. Shi et al. (2013) deal
It could be helpful. For example, Lu et al. (2013) train a
with the multi-frame multi-target association problem
Multiple Object Tracking: A Literature Review 9
in the aerial scene as a rank-1 tensor approximation gorithm rather than by detection. Brostow and Cipolla
problem. High-order tensor is constructed based on all (2006) deal with generic MOT without any supervision.
the candidate trajectories. Then the multi-dimensional Assuming that a pair of points that appears to move to-
assignment problem for MOT is transformed into the gether is likely to be part of the same individual, feature
rank-1 approximation. In the end the approximation is points are tracked and motion clustering is conducted
solved by a l1 tensor power iteration. to discover and maintain identical entities.
Generic MOT. As we have discussed, the majority
of MOT work focuses on some special kinds of objects,
such as vehicles and pedestrians. The reason is that 4 MOT Components
DBT solutions without manual initialization are more
As shown in Figure 5, MOT involves two primary com-
practical in real-life applications and object detection
ponents. One is observation model and the other one
has achieved great progress, especially for pedestrians.
is dynamic model. Observation model measures simi-
However, these methods, although free of manual ini-
larity between object states and observations. To be
tialization, rely on specific kinds of object detectors to
more specific, an observation model includes modeling
obtain detection observations. Typically, these object
of appearance, motion, interaction, exclusion and oc-
detectors have already been trained in advance. There
clusion. Dynamic model investigates states transition
arises two problems that, 1) these pre-trained detectors
across frames. It can be classified into probabilistic in-
can only be applied to the image sequences of the same
ference and deterministic optimization. All these com-
type of objects, 2) as these detectors are pre-trained,
ponents will be present in this section.
they are data-specific. Thus the performance is not op-
timal when they are applied to other image sequences.
On the other hand, DFT approaches, although require 4.1 Appearance Model
manual initialization, they can be applied to type-free
sequences. Appearance is an important cue for affinity computa-
Observing this, recently some researchers have pro- tion in MOT. However, it is worthy noting that, differ-
ceeded a step to investigate generalization of the MOT ent from single object tracking approach which primar-
problem to any kind of objects with minimum man- ily focuses on constructing a sophisticated appearance
ual labeling and free of offline trained detectors. Zhao model to discriminate object from background, multiple
et al. (2012) propose to track multiple similar objects object tracking does not mainly focus on appearance
by requiring one instance in the first frame to be la- model, i.e., appearance cue is important but not the
beled. They firstly track this object, collect training only cue to depend on. This is partly because that the
samples, and train an object detector for this kind of multiple objects in MOT can hardly be discriminated
objects in the first a few frames. Then they start from by relying on only appearance information.
the first frame again, detect the top M (specified by Technically, appearance model includes two compo-
the user) similar objects and track them in the subse- nents, i.e. visual representation and statistical measur-
quent frames. Compared with DFT, Zhao et al. (2012) ing. Visual representation is closely related to features,
saves much labeling labor. However, as the number of but it is more than features. It is how to precisely de-
objects to track is still fixed, it is not comparable with scribe the visual characteristics of object based on fea-
DBT. To make the generalization of MOT more prac- tures, and in general it can be grouped into two sets,
tical, a generic MOT problem is proposed by Luo and visual representation based on single cue and that based
Kim (2013) that multiple similar objects are tracked on multiple cues. Statistical measuring is the computa-
with labeling of one example in the first frame. Object tion of similarity or dissimilarity between different ob-
detection is progressively refined as the accumulation servations when visual representation is ready. Eq. 4
of training samples. A specialized tracker is initialized gives an illustration of appearance modeling, where oi
once an object is discovered and multiple trackers are and oj are visual representation of different observa-
learned together based on multiple task learning (Ev- tions based on single cue or multiple cues, and F (•, •) is
geniou and Pontil 2004; Caruana 1997). Dealing with a function to measure the similarity Sij between oi and
the same problem, Luo et al. (2014) propose a bi-label oj . In the following, we firstly discuss the features/cues
propagation framework. The detection and tracking of employed in MOT, and then describe appearance mod-
multiple objects is cast as class and object label propa- els based on single cue and multiple cues respectively.
gation in the spatial-temporal video tube. Multiple sim-
ilar objects are tracked by Dicle et al. (2013), however,
the detection responses are given as input to the al- Si,j = F (oi , oj ) (4)
10 Wenhan Luo1 , Junliang Xing2 , Xiaoqin Zhang3 , Xiaowei Zhao1 , Tae-Kyun Kim1
MOT
Components
Observation Dynamic
model model
4.1.1 Feature et al. 2009) or directly use it for data association (Iza-
dinia et al. 2012). Besides this, optical flow is also em-
Feature is indispensable for MOT. As shown in Fig- ployed to complement HOG for observation model (An-
ure 6, different kinds of features have been employed driyenko and Schindler 2011). Additionally, optical flow
in MOT. We categorize features into the following sub is popular in extremely crowded scenarios for discov-
sets. ering crowd motion pattens (Ali and Shah 2008; Ro-
Point features. Point features are successful in driguez et al. 2011).
single object tracking (Shi and Tomasi 1994). For MOT, Gradient/pixel-comparison features. There are
point features can also be helpful. For instance, KLT some features based on gradient or pixel comparison.
tracker is employed to track feature points and gener- Mitzel et al. (2010) utilize a variation of the level-set
ate a set of trajectories or short tracklets (Sugimura formula, which integrates three terms penalizing the
et al. 2009; Zhao et al. 2012). Choi and Savarese (2010) deviation between foreground and background, a em-
utilize the KLT features (Tomasi and Kanade 1991) bedding function from a signed distance function and
as additional features to estimate the camera’s mo- the length of the contour to track objects in continuous
tion, greatly improving tracking performance. Similarly, frames. Besides the success in human detection, HOG
KLT tracking is utilized by Benfold and Reid (2011) (Dalal and Triggs 2005) plays a vital role in the multi-
to estimate motion. Local feature points (Lowe 2004) ple pedestrian tracking problem as well. For instance,
are adopted along with the bag-of-word model by Yang HOG is employed (Izadinia et al. 2012; Kuo et al. 2010;
et al. (2009b) to capture the texture characteristics of Breitenstein et al. 2009; Choi and Savarese 2012; Yu
a region. Point features are also employed by Brostow et al. 2008) to detect objects and/or compute similar-
and Cipolla (2006) for motion clustering. ity between pedestrian detections for data association.
Color/intensity features. This is the most pop- Region covariance matrix features. Region co-
ularly utilized feature for MOT. Usually the color or variance matrix (Porikli et al. 2006; Tuzel et al. 2006)
intensity features along with a measurement are em- features are robust to issues such as illumination changes,
ployed to calculate the affinity between two counter- scale variations, etc. Therefore, it is also employed for
parts (detection hypotheses, tracklets or short trajec- the MOT problem. The region covariance matrix based
tories). The simple raw pixel template is employed by dissimilarity is used to compare appearance for data as-
Yamaguchi et al. (2011) to compute the appearance sociation (Henriques et al. 2011). Covariance matrices
affinity. Color histogram is used by Sugimura et al. along with other features constitute the feature pool
(2009), Song et al. (2010), Mitzel et al. (2010), Izadinia for appearance learning by Kuo et al. (2010). Hu et al.
et al. (2012), Okuma et al. (2004) and Mitzel and Leibe (2012) utilize the covariance matrix to represent object
(2011). for both single and multiple object tracking.
Optical flow. The optical flow feature can be em- Depth. Depth information is employed for various
ployed to conduct short-term visual tracking. Thus many computer vision tasks. With regard to MOT, Mitzel
solutions to MOT utilize optical flow to link detec- et al. (2010) utilize depth information to correct bound-
tion responses from continuous frames into short track- ing box of detection response and re-initialize the bound-
lets for further data association processing (Rodriguez ing box for level-set tracking in their work. Depth in-
Multiple Object Tracking: A Literature Review 11
(a) (b)
Fig. 6: Some exemplar features. (a) Image showing optical flow (Ali and Shah 2008), (b) image showing covariance
matrix, (c) image showing point feature (Brostow and Cipolla 2006), (d) image showing gradient based features
(Dalal and Triggs 2005), (e) image showing depth (Mitzel et al. 2010) and (f) image showing color feature (Okuma
et al. 2004). Best viewed in color
formation is integrated into the framework (Ess et al. to estimate how probable an object would occur in a
2009, 2007) to augment detection hypotheses with a specific grid under the multi-camera settings. It creates
depth flag 0 or 1, which further refines the detection detection hypotheses on top of background modeling as
responses. Similarly, Ess et al. (2008) employ depth in- the input for MOT.
formation to obtain more accurate object detections in Generally speaking, most of the features are effi-
a mobile vision system and then use the detection re- cient. At the same time, they also have shortcomings.
sult for multiple object tracking. The stereo depth is For instance, color histogram has well studied similarity
taken into account by Giebel et al. (2004) to estimate measures, but it ignores the spatial layout of the object
weight of a particle in the proposed Bayesian framework region. Point features are efficient, but sensitive to is-
for multiple 3D object tracking. Gavrila and Munder sues like occlusion and out-of-plane rotation. Gradient
(2007) integrate depth to generate detections and con- based features like HOG can describe the shape of ob-
sequently verify them for multiple object tracking from ject and robust to issues such as illumination changes,
a moving car. but it cannot handle occlusion and deformation well.
Region covariance matrix features are more robust as
Others. Some other features, which are not so pop- they take more information in account, but this benefit
ular, are utilized to conduct multiple object tracking as is obtained at the cost of more computation. Depth fea-
well. For instance, gait features in the frequency do- tures make the computation of affinity more accurate,
main, which are unique for every person, are employed but they require multiple views of the same scenery
by Sugimura et al. (2009) to maximize the discrimina- and/or additional algorithm (Felzenszwalb and Hutten-
tion between the tracked individuals. Given a trajec- locher 2006) to obtain depth.
tory, a line fitting via linear regression is conducted to
extract the periodic component of the trajectory. Then
the Fast Fourier Transform (FFT) is applied to the 4.1.2 Single cue based appearance model
residual periodic signal to obtain the amplitude spectra
and phase of the trajectory, which are utilized to com- In most cases, appearance model in MOT is kind of
pute the dissimilarity between a trajectory and other simplicity and efficiency, thus a single cue is a popular
trajectories. The Probabilistic Occupancy Map (POM) option. In the following we present appearance models
(Fleuret et al. 2008; Berclaz et al. 2011) is employed employing a single cue in five aspects.
12 Wenhan Luo1 , Junliang Xing2 , Xiaoqin Zhang3 , Xiaowei Zhao1 , Tae-Kyun Kim1
Raw pixel template representation. Pixel is the lets (Qin and Shelton 2012). Xing et al. (2009) rep-
most granular element of images and videos, thus pixel resent target (pedestrian body) as a constitution of
is the foundation of computer vision problems. Beside three overlapped part areas, i.e. the Full Body (FB),
its importance, it is also popular for its simplicity. The the Head-Shoulder (HS) and the Head-Torso (HT). To
raw pixel template representation is the raw pixel in- link tracklets for data association, they consider the
tensity or color of a region. It can encode the spatial color histogram of each part in the detection response.
information since the comparison is element wise when The affinity regarding appearance to link two tracklets
matching two templates. Yamaguchi et al. (2011) em- Ti and Tj is calculated as in Eq. 5,
ploy the Normalized Cross Correlation (NCC) to eval-
uate the predicted position of object. This appearance
model is very simple, but helpful. The appearance affin- S (Ti , Tj ) = exp (−D (ci , cj )) , (5)
ity is calculated as the NCC between the target tem-
plate and a candidate bounding box (Ali and Shah where D (ci , cj ) = mean D cki , ckj |k = F B, HT, HS ,
2008). Note that the target template is progressively c means color histogram and D (•, •) is the Bhattacharya
updated at each time instant. Wu et al. (2012) build a distance measure. The link affinity of two detection re-
network-flow approach to handle multiple target track- sponses regarding appearance is calculated based on
ing. When they compute the transitional cost on the RGB histograms (Zhang et al. 2008). Given two de-
arcs of the network as flows, the normalized cross corre- tection observations oi and oj , the corresponding color
lation between the upper one-fourth bounding boxes of histograms ci and cj are extracted. Then the Bhat-
the corresponding two detection observations is used. A tacharyya distance Dij between ci and cj is obtained,
simple patch-based tracker is implemented to calculate and the probability of linking oi and oj is
the data likelihood with regard to appearance (Pelle-
grini et al. 2009). The similarity is computed between
N Dij ; Ds , σs2
the initial bounding box and a candidate bounding box P (oi |oj ) = , (6)
as the Pdata (p) ∝ N (Dij ; Ds , σs2 ) + N (Dij ; Dd , σd2 )
squared exponential
2
0
exp − N CC p, p − 1 , where p and p0 repre-
where N •; Ds , σs2 and N •; Dd , σd2 are two Gaus-
sent the candidate bounding box and the initial bound- sian distributions of appearance dissimilarity between
ing box. Despite of efficiency, this kind of representation the same object and different objects respectively. The
easily suffers from the change of illumination, occlusion parameters of these two distributions are learned from
or some other issues. the training data. Noting that, besides its advances, the
Color histogram representation. Color histogram color histogram representation has the drawback of los-
is the most popular representation for appearance mod- ing spatial information.
eling in MOT approaches as a result of its effective- Covariance matrix representation. Covariance
ness to capture the statistical information of target re- matrix is robust to illumination change, rotation, etc.
gion. For example, Kratz and Nishino (2010) employ The covariance matrix descriptor is employed to rep-
the color histogram model (Pérez et al. 2002) to cal- resent the appearance of an object by Henriques et al.
culate the likelihood in terms of appearance, and they (2011). The likelihood concerning appearance to link
use an exponential function to transform the histogram two detection responses is modeled as
distance into probability. Similarly, to capture the dis-
similarity, Sugimura et al. (2009) use the Bhattacharyya
distance between hue-saturation color histograms when Plink (Ti , Tj ) ∝ N Dij ; D, Σ , (7)
constructing a graph. A Mean Shift tracker employing
color histogram is utilized to sequentially seize the ob- where N is the Gaussian distribution, Dij is the ap-
ject (Choi and Savarese 2010). Appearance model is de- pearance dissimilarity between Ti and Tj , D and Σ
fined as the RGB color histogram of the trajectory by are parameters estimated from training data. A block-
Leibe et al. (2008). It is initialized as the first detection division appearance model which divides the object re-
response’s color histogram and evolves as a weighted gion into blocks is proposed by Hu et al. (2012). Within
mean of all the detection responses which belong to each block, the covariance matrix is extracted as the re-
this trajectory. The likelihood considering appearance gion descriptor to characterize the block. At the same
is proportional to the Bhattacharyya coefficient of two time, likelihood of each block is computed with regard
histograms. Affinity regarding appearance is dealt with to the corresponding block of the target, and likelihood
by calculating the Bhattacharyya distance between the of the whole region is the product of the likelihood of
average HSV color histograms of the concerned track- all blocks.
Multiple Object Tracking: A Literature Review 13
Pixel comparison representation. Nothing could a ranking term. The ranking term ranks correct track-
be simpler than giving a binary result of comparison be- let associations higher than their counterparts and the
tween two pixels, and this is the advance of this type of classification term dismisses wrong associations.
representation over other kinds of representation. Zhao Concatenating. A SVM model classifier is trained
et al. (2012) adopt Random Ferns in tracking (Kalal to distinguish a specific target from targets in its tem-
et al. 2012). It encodes the results of comparisons be- poral window. Color, HOG and optical flow are con-
tween pairs of pixels and vote the comparison based on catenated and further processed with PCA projection
training data. The probability of a patch being positive for dimension reduction to describe the detection re-
is calculated based on how many positive samples and sponse (Brendel et al. 2011). The similarity S between
negative samples have been recorded by that leaf. two detection responses is
Bag of words representation. Fast dense SIFT-
like features (Lowe 2004) are computed by Yang et al.
(2009b) and encoded based on the bag-of-word model. T
S = exp − (f − f 0 ) M (f − f 0 ) , (8)
To incorporate spatial information, the spatial pyra-
mid matching (SPM) method (Lazebnik et al. 2006) is
adapted. This is used as an observation model for ap- where f and f 0 are features corresponding to the two de-
pearance modeling. tection responses, M is a distance metric matrix learned
online.
Summation. Mitzel et al. (2010) simultaneously
4.1.3 Multi-cue based appearance model segment and track multiple objects. As pedestrians and
backgrounds often contain the same color, color infor-
Different kinds of cues could compensate each other,
mation alone yields rather unreliable segmentation for
which would make appearance model robust. However,
pedestrians. To address it, they integrate color informa-
there arises an issue that how to fuse the information
tion with depth information. To be specific, they firstly
from multiple cues. Regarding this, we present multi-
compute an expected depth of foreground in the follow-
cue based appearance models according to five kinds of
ing frames according to the current depth and a maxi-
fusion strategies, Boosting, Concatenating, Summation,
mum velocity. Then each depth of foreground in the fol-
Product and Cascading (also see Table 5).
lowing frame could be assigned a probability based on a
Boosting. The strategy of Boosting usually selects
Gaussian distribution centered at the expected depth.
a portion of features from a feature pool sequentially via
The probability based on color representation is the ac-
a Boosting based algorithm (e.g. Adaboost by Kuo et al.
cordance of color histogram calculated in the Lab space
(2010) and RealBoost by Yang and Nevatia (2012c)).
with the learned appearance model (Bibby and Reid
Features are selected according to their discrimination
2008). Then these two probabilities (computed from
power. A discriminative appearance model is proposed
color and depth cues) are weighted by a parameter α
by Kuo et al. (2010) to assign high similarity to track-
as in Eq. 9. The similar weighting strategy is adopted
lets which are of the same object, but low affinity to
by Liu et al. (2012) to balance two cues of raw pixel
tracklets of different objects. Specifically, color histogram
intensity and silhouette.
in RGB space, HOG and covariance matrix descriptor
are employed as features. They choose 15 regions so
that they have 45 cues in total in the feature pool. Col-
lecting positive and negative training pairs according to Pi = (1 − α) Pi,color + αPi,depth , i ∈ {f g, bg} . (9)
the so-called spatial-temporal constraints, they employ
Adaboost to choose the most representative features to Besides similarity, distance could also be added together
discriminate pairs of tracklets belonging to the same to conduct multiple object tracking. Distance in terms
object from those belonging to different objects. Sim- of different appearance cues, including RGB color his-
ilarly, Yang and Nevatia (2012c) adopt features (Kuo togram, correlogram and LBP is computed as a sum-
et al. 2010) and employ the standard RealBoost algo- mation for matching (Takala and Pietikainen 2007).
rithm to learn the feature weights from training sample Product. Yang et al. (2009b) integrate multiple
set, which is composed of correctly linked pairs (as pos- cues including color, shapes and bags of local features
itive samples) and incorrectly matched pairs (as neg- (Lowe 2004; Lazebnik et al. 2006) to calculate the like-
ative samples). A HybridBoost algorithm is proposed lihood of linking a detection response with an existing
by Li et al. (2009) to automatically select features with trajectory. Assuming these three cues are f 1 , f 2 and
maximum discrimination. This algorithm employs a hy- f 3 , the likelihood linking a detection response with a
brid loss function composed of a classification term and trajectory (state s) is modeled as P f 1 , f 2 , f 3 |s . These
14 Wenhan Luo1 , Junliang Xing2 , Xiaoqin Zhang3 , Xiaowei Zhao1 , Tae-Kyun Kim1
cues are assumed to be independent from each other, 4.2 Motion Model
thus the likelihood is
Object motion model describes how an object moves.
It is important for multiple object tracking since it can
predict the potential position of objects in the future
3
frames, reducing search space. In general, objects are
Y
P f 1 , f 2 , f 3 |s = P f i |s .
(10) assumed to move smoothly in the image scene (cf. the
i=1 abrupt motion is a special case). Popular motion models
employed in multiple object tracking are divided into
the following two classes.
A similar formula is adopted by Song et al. (2010).
The likelihood considering color histogram is multiplied 4.2.1 Constant velocity motion models/linear motion
with the likelihood regarding foreground response as models
the final likelihood in the observation model. In this
work, these two cues are assumed to be independent. As the name indicates, objects following these mod-
Likelihoods in terms of three cues, shape, texture and els are assumed to move with constant velocity. This
depth, are multiplied to compute the weight of a par- is the most popular model (Shafique et al. 2008; Yu
ticle in a Bayesian framework (Giebel et al. 2004). Di- et al. 2007). The velocity of object in the next time
viding the scene under multiple cameras into multiple is the same as the current velocity (added by process
grids, appearance model is constructed based on color noise independently drawn from some types of distribu-
model and ground plane occupancy estimation (Berclaz tions). For example, Breitenstein et al. (2009) employ a
et al. 2006). Similarity concerning these two cues are constant velocity motion model to propagate particles
multiplied in the MAP formula. like this:
be specific, a term which considers differences between is from the displacement between the observed position
the velocities of one object in different time instants is and the estimated position of head of T2 , defined as
formulated in Eq. 12 as, ∆p2 = ptail + vtail ∆t − phead . The probability is
N −2 X
M
vi − vt+1
2 , Pm (T1 , T2 ) = N (∆p1 ; 0, Σp ) N (∆p2 ; 0, Σp ) . (14)
X
t
Cdyn = i (12)
t=1 i=1
This motion model is essentially the same as the one
where vit is the velocity of target i at time t. It is com- by Xing et al. (2009). It is quite popular (Kuo et al.
puted as the displacement between object positions in 2010; Kuo and Nevatia 2011; Yang et al. 2011; Qin and
two continuous frames. The first summation takes all Shelton 2012; Nillius et al. 2006). However, it only con-
the N frames into account and the second summa- siders the pair of tracklets itself. A motion model be-
tion counts all the M trajectories/objects. Intuitively, tween two pairs of tracklets are also taken into consid-
this term penalizes the difference between velocities and eration (Yang and Nevatia 2012b). Figure 7(b) shows
forces trajectories to be smooth. A constant velocity the pairwise motion model. Considering two pairs of
model simultaneously considering the forward velocity tracklets, (T1 , T3 ) and (T2 , T4 ), they suppose T1 and
and the backward velocity is proposed by Xing et al. T2 are tail-close tracklets. Firstly the earlier tail time
(2009) to compute the affinity linking two tracklets in when T1 and T2 end is defined as tx = min{te1 , te2 },
terms of motion. Given two tracklets Ti and Tj , let where tei means the ending time of tracklet Ti . Sim-
us assume there is a temporal gap between the tail ilarly, the later time when T3 and T4 start is repre-
of Ti and the head of Tj . The forward-direction mo- sented as ty = max{ts3 , ts4 }. Obviously ty > tx . Then
tion is described by a Gaussian distribution centered they compute the relative distance between the esti-
in phead
j , the position of the head response in Tj , with mated positions of T1 and T2 at frame ty as ∆p1 =
variance ΣjB . It estimates the probability of the posi- te te
(p11 + v1tail (ty − te1 )) − (p22 + v2tail (ty − te2 )), where v1tail
tion of tail response in Ti plus forward displacement and v2tail are the tail velocity of T1 and T2 . On the
as ptail
i + viF ∆t. The backward-direction motion is also other hand, the real relative distance between T1 and
represented as a Gaussian distribution, with the dif- t t
T2 at frame ty is ∆p2 = p3y − p4y . Similar to the mo-
ference that motion is calculated backwardly from the tion model in the unary term, they employ a zero-mean
position of head response in Tj to the position of tail Gaussian function as N (∆p1 − ∆p2 ; 0, Σp ) to estimate
response in Ti . The model is given in Eq. 13 as the linkable probability. The insight behind this pair-
wise motion model is that if the difference between ∆p1
Pm (Ti , Tj ) = N ptail + viF ∆t; phead , ΣjB ∗
i j and ∆p2 is small and T1 is associated with T3 (i.e., the
(13) label of the node corresponding to (T1 , T3 ) is 1), then
N phead + viB ∆t; ptail F
j i , Σi .
the probability to associate T2 and T4 is high.
Different from previous graph based MOT approaches Besides considering position and velocity, Kuo and
which treat each node as an individual observation (e.g, Nevatia (2011) also take the accelerate rate into consid-
one detection response), Yang and Nevatia (2012b) em- eration. The probability concerning motion of a state
ploy the Conditional Random Field (CRF) model, treat- {ŝk } (k is the frame index) given observation tracklet
ing the node as a pair of tracklets. The label of each {ok } is modeled as,
node indicates whether the two tracklets correspond-
ing to this node can be associated or not. This is ad- Y
dressed by the unary term in the CRF model, consider- P ({ŝk } | {ok }) = N (xk − x̂k ; 0, Σp )
ing both the appearance and motion information. The k
(15)
Y Y
probability in terms of motion is calculated based on N (vk ; 0, Σv ) N (ak ; 0, Σa ) ,
the displacement between the estimated positions via k k
a linear motion model and the observed positions. Fig-
k+1 −x̂k vk −vk−1
ure 7(a) illustrates this model clear. Given two tracklets where vk = x̂tk+1 −tk is the velocity, ak = 0.5(tk+1 −tk−1 )
T1 and T2 , assuming that T1 is the front one along is the acceleration, and N is a zero-mean Gaussian dis-
the time axis compared with T2 , there is a time gap tribution.
∆t between tail of T1 and head of T2 . The probabil-
ity of linking T1 and T2 depends on two terms. One 4.2.2 Non-linear motion model
is from the displacement between the observed position
and the estimated position of tail of T1 , which is de- Linear motion model is commonly used to explain ob-
fined as ∆p1 = phead − vhead ∆t − ptail . The other one ject’s movement. However, there are some cases which
16 Wenhan Luo1 , Junliang Xing2 , Xiaoqin Zhang3 , Xiaowei Zhao1 , Tae-Kyun Kim1
(a) (b)
Fig. 7: The unary motion model (a) and pairwise motion model (b) (Yang and Nevatia 2012b)
the linear motion model cannot deal with. To this end, There are some representative work of these models,
non-linear motion models are proposed to produce more which are illustrated as follows.
accurate motion affinity between tracklets. For instance,
Yang and Nevatia (2012a) employ a non-linear motion
4.3.1 Social force models
model to handle the situation that targets may move
freely. Given two tracklets T1 and T2 which belong
Social force models are also known as group models.
to the same target in Figure 8(a), the linear motion
In these models, each object is considered to be depen-
model (Yang and Nevatia 2012b) would produce low
dent from other objects and environmental factors. This
probability to link them, which is not consistent with
type of information could alleviate performance deterio-
the truth. Alternatively, employing the nonlinear mo-
ration in crowded scenes. In social force models, targets
tion model, which is composed of a set of pattern track-
are considered as agencies which determine their speed,
lets, the gap between tail of tracklet T1 and head of
velocity, destination based on observations of other ob-
tracklet T2 could be reasonably explained by a tracklet
jects and the around environment. More specifically, in
T0 ∈ S. As shown in Figure 8(b), the tracklet T0 is
social force models, targets behavior is modeled based
a support tracklet to explain T1 and T2 because there
on two aspects, individual force and group force.
exist elements {(pi , si , vi )} in T0 which are matched
Individual force. For each individual in the sce-
with the tail of T1 and the head of T2 , where p, s and
nario of multiple objects, two types of force are con-
v are position, size and velocity, respectively. Then the
sidered: 1) fidelity, which means one should not change
real path to bridge T1 and T2 is estimated based on
his desired destination, 2) constancy, which means one
T0 , and the similar way as the linear motion model is
should not suddenly change his velocity, including speed
employed to calculate the affinity between T1 and T2 ,
and direction.
but based on the non-linear motion positions.
Group force. For a whole group, three types of
force are considered: 1) attraction, which means individ-
uals moving together as a group should stay close, 2) re-
4.3 Interaction Model pulsion, which means that individuals moving together
as a group should keep some distance away from oth-
Interaction model, also known as mutual motion model, ers to make all members comfortable and 3) coherence,
captures the influence of an object to other objects. which means individuals moving together as a group
This is a distinct issue of multiple object tracking com- should be with similar velocity.
pared with single object tracking. In the crowd scenery, The majority of existing publications with social
obviously an object would consider some “force” from force model follows these two types of force. For in-
others. For instance, when a pedestrian is walking on stance, assuming that each pedestrian in a social group
the street, he would consider his speed, direction and always adjusts its trajectory at an early stage in or-
destination, in order to avoid collision with others. An- der to avoid possible collisions, Pellegrini et al. (2009)
other example is that when a crowd of people walk maximize the minimum distance which makes targets
across a street, each of them follows others and guides comfortable. In their model, each subject is represented
others at the same time, i.e., they form a motion pat- as si = {pti , vit }, where pti and vit are position and ve-
tern and every one follows this pattern. In fact, these locity at time t. Regarding subject i, the minimum dis-
are examples of two typical interaction models known tance between it and another subject j to make them
as the social force models (Helbing and Molnar 1995) comfortable is Dij (vi ). The corresponding energy term
and the crowd motion pattern models (Hu et al. 2008). is Cij . To model this property among multiple objects
Multiple Object Tracking: A Literature Review 17
(a) (b)
Fig. 8: An image comparing the linear motion model (a) with the non-linear motion model (b) (Yang and Nevatia
2012a). Best viewed in color
considering subject i, they balance each pair-wise en- from the desired speed, 3) a direction term as
e i −pi
ergy term by a weight. Thus, the final energy between Cdirection (v; si ) = − kep v
pi −pi k kvk to penalize the case
subject i and other objects considering interaction is that object does not follow the desired direction, 4) an
X attraction term
Ciinter (vi ) =
wij Cij (vi ) , (16) v
Cattraction (v; si , Si ) = j∈{S−i } kvvii k · kvjj k
P ∆pij v
k∆pij k · kvk
j6=i
assuming that people tend to stay close when they move
where wij is the weight assigned to subject j consider- together, 5) a group term Cgroup (v; si , Si ) = kv − v̄Si k
ing subject i. Furthermore, they assume subject i walks penalizing the variance of velocity in a group (v̄Si is the
to a destination p e i with a desired speed ui . There- group average velocity) and 6) a collision term (Pelle-
ssc
fore they have two more energy terms as Ci (vi ) = grini et al. 2009) (see Eq. 16 for details).
2
(ui − kvi k) penalizing sudden speed change and Social grouping behavior is considered to improve
pi −pi )·vi
Cidf d (vi ) = − ke(e
pi −pi kkvi k penalizing the drift from des- data association for MOT (Qin and Shelton 2012). To
tination. Then the complete energy objective is be specific, they assume people form K groups, where
K can be learned optimally, and every tracklet assigned
Ci (vi ) = Ciinter (vi ) + λ1 Cissc (vi ) + λ2 Cidf d (vi ) , (17)
to the same group should be consistent with the group
where λ1 and λ2 are parameters to balance these terms. mean trajectory. Thus they have an extra cost term
By minimizing the above energy function for subject which takes the distance between a concerned tracklet
i, the search space of its destination could be largely and its assigned group trajectories into consideration.
reduced, and the data association procedure is further Two factors, repulsion and group motion, are con-
simplified. sidered by Choi and Savarese (2010). The repulsion
Yamaguchi et al. (2011) also assume objects are factor tries to separate objects if they are too close
agencies of a social force model. The destination of an to each other. Given two targets i and j at time t,
object is determined by considering the so-called per- the potential concerning repulsion is Srepulsion (i, j) =
sonal, social and environmental factors, which are for- exp(−1/(cr Dij )), where Dij is the distance between the
mulated as terms in a cost function. In their model, each two targets in the 3D space and cr is a controlling pa-
object state is represented as si = {pi , vi , ui , p
e i , Si } rameter. It is obvious that if two objects are too close,
where pi is position, vi is velocity, ui is the desired then the potential between them is very small. Group
speed, pe i is the desired destination and Si is the group motion factor assumes the relative distance between
of objects including object i. The behavior model to two objects in continuous two frames should keep un-
navigate objects has 6 terms, which are 1) a damp- changed. This also means that velocities of the pair of
ing term penalizing the sudden change of velocity as objects should be similar to each other. Thus the group
Cdamping (v; si ) = kv − vi k, 2) a speed term giving a motion is modeled as a potential term as Sgroup (i, j) =
2
cost as Cspeed (v; si ) = (ui − kvi k) if the speed varies exp(−cg • kvi − vj k)/(1 + exp(sg (Dij − Dg ))), where v
18 Wenhan Luo1 , Junliang Xing2 , Xiaoqin Zhang3 , Xiaowei Zhao1 , Tae-Kyun Kim1
is velocity and cg is a parameter to control this factor. of the scene. BFF takes the barriers in the scene into
The first factor is a soft step function to model how we consideration by recognizing the physical and virtual
consider two objects to be in the same group. Dg is a barriers in the scene. DFF captures the motion of a
threshold distance and sg is a parameter to balance the crowd around the object being tracked to determine
slope. From the group motion factor, if two objects are the future positions of objects in the crowd.
close enough and with similar velocity to be considered Zhao et al. (2012) deal with the multiple object
in the same group, then the potential is high. tracking problem in the structured crowd scene by ob-
A social force model which has four components is serving that crowd exhibits clear motion patterns in
proposed by Scovanner and Tappen (2009) to learn dy- this case, and these patterns could benefit the tracking
namics of pedestrians in the real world. These four com- problem. Motion patterns are discovered by the ND ten-
ponents contribute four energy terms as 1) CLM which sor voting (Mordohai and Medioni 2010) of the track-
constraints the movement of a target to avoid jump let points which are obtained by KLT tracker. These
in the space grid, 2) CCV which maintains target’s con- motion patterns are represented as a set of 4-D points
stant velocity, 3) CDest which guides a target to reach a {pi = (xi , yi , uxi , vyi ), i = 1, ..., n}, where x, y are spa-
destination, 4) CAV which takes other targets into con- tial positions, and u, v are speed along the two dimen-
sideration, producing repulsion to avoid possible colli- sions of the image. Then these motion patterns are em-
sions. These four energy terms are weighted to form an ployed to 1) estimate the probability of a candidate for
energy objective which are then minimized to predict detection and 2) predict velocity of an object in a spe-
the movement of a target, generating tracks. cial position for tracking.
The data association problem and the group rela- Observing that a group of pedestrians exhibits col-
tionship mining problem are jointly estimated (Pelle- lective spatio-temporal structure, movement of an ob-
grini et al. 2010). They model the trajectory assignment ject within any local space-time location of a video
problem based on the motion and appearance informa- are learned by training a set of Hidden Markov Mod-
tion, and mine the group relationship by assuming tar- els (HMM) (Kratz and Nishino 2010, 2012). The entire
gets belonging to the same group keep the distance con- video is viewed as 3D volume, and it is divided into local
stant and have the same direction. A three-order CRF spatio-temporal cubes. The motion pattern of a specific
model is proposed and an energy function with regard- spatio-temporal cube is represented as a 3D Gaussian
ing to these two problems is constructed. By inferring distribution considering the 3D gradients of all the pix-
the most probable estimation, these two problems are els in the cube. This motion pattern is assumed to vary
solved. through time and exhibit the Markov property. Thus
the future motion pattern could be predicted based on
4.3.2 Crowd motion pattern models the previous states (motion patterns), and this predict
motion pattern could be employed to constraint track-
Inspired by the crowd simulation literature (Zhan et al. ing of the object in this spatio-temporal location.
2008), motion patterns are introduced to alleviate the The motion pattern models described above make
difficulty of tracking an individual object in crowd. In an assumption that the objects move coherently in a
general, this type of models is usually applied in the common direction. This may hold in the case of the so-
over-crowded scenario, i.e., the density of targets is con- called structured crowd scenarios, but does not com-
siderably high. In the highly-crowded scenery, objects ply with the unstructured crowd which exhibits var-
are usually quite small, and cues such as appearance ious modalities of motion. To address it, Correlated
and individual motion are ambiguous. In this case, mo- Topic Model (CTM) is adopted by Rodriguez et al.
tion from the crowd is a comparably reliable cue for the (2009) to learn various motion behaviors in the scene. A
problem. tracker which can predict a rough displacement based
There have been some work in this direction. For on scene codebook from all the moving pixel in the
example, an assumption is made that the behavior of scene, along with the learned high-level behavior, are
an individual is determined by the scene layout and weighted to track objects in the unstructured scenes.
the surrounding objects (Ali and Shah 2008). In or- Similar to image retrieval, motion pattern could also be
der to model the influence from others and the scene retrieved (Rodriguez et al. 2011). Motion patterns are
structure, three kinds of force from the floor fields are firstly learned in an unsupervised and offline manner
proposed. These fields are Static Floor Fields (SFF), from a database composed of a large number of videos.
Boundary Floor Field (BFF) and Dynamic Floor Field Then given a test video, a set of space-time patches are
(DFF). SEF considers the scene structure, including the matched to explain the test video. After that, motion
favorite path a crowd takes and the sinking point (exit) priors in the retrieved video patches are transferred to
Multiple Object Tracking: A Literature Review 19
the test video as a prior to assist object tracking along (Luo et al. 2015). To model the detection-level exclu-
with a Kalman filter based tracker. sion, the so-called cannot links are introduced to im-
itate that if two tracklets have overlap in their time
span, then they cannot be assigned into one cluster,
4.4 Exclusion Model i.e., trajectory.
Exclusion is a constraint when seeking a solution to the 4.4.2 Trajectory-level exclusion modeling
MOT problem due to physical collisions. Given multiple
A term in Eq. 19 to model exclusion is accounted (An-
detection responses and multiple trajectory hypothe-
driyenko and Schindler 2011) in an energy function as
ses, generally there are two constraints to be consid-
ered. The first one is the so-called detection-level ex-
N X
clusion (Milan et al. 2013), i.e., two different detection X 1
Cexc ∝
. (19)
responses in the same frame cannot be assigned to an
p − pt
2
t
t=1 i6=j i j
identical trajectory hypothesis. The second one is the
so-called trajectory-level exclusion, i.e., two trajectories The first summation accounts for all the N frames in
cannot occupy an identical detection response. Model- the sequence, and the second summation accounts for
ing of them is presented as follows. all the detection responses in the same frame. pti is the
ground plane coordinate of object i at time t. If two
detection responses are too close, it will lead to consid-
4.4.1 Detection-level exclusion modeling erably large cost of the energy function.
To model the trajectory level exclusion, Milan et al.
The detection-level exclusion is explicitly modeled by (2013) penalize the case that two close trajectories Ti
defining a cost term to penalize the case if two simul- and Tj have different labels. The penalty is propor-
taneous detection response oti and otj at time t are as- tional to the spatial-temporal overlap between Ti and
signed the same label of trajectory with a cost if they Tj . The closer the two trajectories, the higher penalty
are distant (Milan et al. 2013) . it is.
KC and De Vleeschouwer (2013) employ label prop- Similarly, mutual exclusion is modeled as an addi-
agation for multiple object tracking. To model exclu- tional cost term to penalize the case that two trajecto-
sion, a special exclusion graph is constructed to cap- ries are very close to each other. The cost is reversely
ture the constraint that detection responses with the proportional to the minimum distance between the tra-
same time stamp (occurring at the same time) should jectories in their temporal overlap (Andriyenko et al.
have different labels. Given all the detection responses, 2012). By doing so, one of the trajectory would be aban-
they define a graph where nodes represent detection doned to avoid the collision.
responses. These nodes are not fully connected. Alter- Exclusion is modeled as an extra constraint in the
natively, each node (one detection) is connected only to objective function of network flow (Butt and Collins
nodes (other detections) happening at the same time as 2013). Let the detection observations at frame k as
the node concerned. The connecting weight of each edge Ok = ok1 , ..., okMk . Given detection responses in two
between two nodes are uniform as wij = 1/n, where consecutive frames as Ok and Ok+1 , one detection from
n is the number of detection responses happening at Ok and the other detection from Ok+1 can form a
the same time as these two nodes. After constructing match. Based on all matches between these two frames,
this graph, the Laplacian matrix of this exclusion graph a graph is constructed as G = (V, E), where each node
could be computed as L and label error regarding ex- in G is a pair of detections and each edge belonging
clusion is maximized as, to E represents flow in the graph, where flow 1 means
linkable and 0 means not. Conflict edges are represented
as Econf lict . Recalling the constraint that one detection
arg max T r (YLY) , (18) should only be occupied by no more than one trajec-
Y∈S
tory, the flow through edge in Econf lict is constrained
to be at most 1.
where Y = y1 , ..., y|V | is the label assignment of all
the |V | nodes in the graph, S is the set of all row-
stochastic matrices of size |V | × |V | and T r (•) is the 4.5 Occlusion Handling
trace norm of a matrix.
A topic model based on Dirichlet Process Mixture Occlusion is a fatal issue in multiple object tracking. It
Model (DPMM) is employed for multi-object tracking could lead to ID switch or fragmentation of trajectories.
20 Wenhan Luo1 , Junliang Xing2 , Xiaoqin Zhang3 , Xiaowei Zhao1 , Tae-Kyun Kim1
In order to handle occlusion, various kinds of strategies from frame 47 to frame 134. During this period, the
have been proposed. We here summarize the popular whole-body human detector would be confused. How-
ones in the following aspects. ever, thanks to the part detector, the visible parts are
Part-to-whole. This strategy is the most popular detected. Based on these parts, trajectories of visible
one for occlusion handling. It is built on the assumption parts are estimated. Furthermore, along with the tra-
that, part of the object is still visible when occlusion jectory of the whole body, the complete trajectory is
happens. This assumption holds in most cases because recovered. A similar part based model for occlusion han-
even the complete occlusion still begins with partial oc- dling is also proposed by Shu et al. (2012).
clusion. Based on this assumption, this strategy observe In general, tracking based on appearance informa-
ans utilize the visible part to infer state of the whole tion may fail when occlusion happens. Analogous to
object. part based model which can still observe some parts in
Hu et al. (2012) propose a block-division model to case of occlusion, feature point clustering based track-
deal with occlusion along with the task of recovering ing, which assumes feature points with similar motion
occlusion relationship among objects. In this model, ob- should belong to the same object, is also applicable to
ject is divided into multiple un-overlapped blocks and address occlusion. As long as some parts of an object are
for each block an appearance model based on subspace visible, the clustering of feature point trajectories will
learning is constructed. Likelihood is computed accord- work. There are some examples (Sugimura et al. 2009;
ing to reconstruction error in the subspace correspond- Brostow and Cipolla 2006; Fragkiadaki et al. 2012).
ing to each block. This model benefits the tracking Hypothesize-and-test. This strategy sidesteps chal-
problem under occlusion in two aspects. Firstly, spatial lenges from occlusion by hypothesizing proposals and
information is considered as likelihood of an observation testing the proposals according to observations at hand.
is the product of likelihood of all its blocks. Secondly, An Explicit Occlusion Model (EOM) is proposed
an occlusion map could be obtained according to recon- by Zhang et al. (2008) and integrated into the cost-
struction errors of all blocks, and this occlusion map can flow framework to better handle long-term occlusion
be utilized to reason the occlusion relationship among in data association for MOT. In classical data associ-
objects. The relationship could be further utilized to ation, two tracklets are assumed to be linkable only
selectively update appearance model. when the temporal gap between them is small. This
Part based appearance model is learned to discrimi- would somehow result in fragments in the final associa-
nate an object from other objects around and the back- tion estimation. Increasing the temporal gap threshold
ground (Yang and Nevatia 2012c). To explicitly deal to make more tracklets linkable could ease the situation
with occlusion, object is represented as 15 parts. Given to some extent, but would possibly yield more errant as-
a tracklet Tk , its appearance
model consists of a set of sociations. To deal with this, occlusion hypotheses are
1 2 n
features
1 2 k nk f , f , ..., f k and the corresponding weights generated based on the occlusion constraints. If the dis-
w , w , ..., w from its head observation and its tail tance and scale difference of two observations are small
observation. The link probability considering appear- enough, then they are occludable. Assuming oi is oc-
ance between two tracklets is calculated as the simi- cluded by oj , a corresponding occlusion hypothesis is
larity between the tail of oneP tracklet and the head of eji = (pj , si , fi , tj ), where pj and tj are the position and
o
i i i
the other tracklet, which is i wi F fj , fk , where fj time stamp of oj , and si and fi are the size and appear-
i
and fk are from the ith part of the two tracklets and ance feature of oi . Along with the original observations
F (•, •) is a similarity evaluation function. The meaning (tracklets), all the observations are given as input to the
of part model is that once a part is found occluded, all cost-flow framework and MAP is conducted to obtain
the features from that part are assumed to be invalid. the optimal solution.
Learning of the appearance model is conducted via a The model adopted by Tang et al. (2013) and Tang
boosting algorithm. et al. (2014) is also a hypothesize-and-test fashion to
Part based model is also applied by Izadinia et al. handle occlusion. Different from the traditional detec-
(2012) as a multi-person multi-part tracker. Beginning tor which treats occlusion as distraction, occlusion is
from a state-of-the-art part-based human detector (Felzen- employed to help detection by observing that occlu-
szwalb et al. 2010), they track the whole human body sion yields typical appearance patterns. Specifically, a
and individual body parts, and the final trajectory es- double-person detector is built to be aware of different
timation is obtained by jointly considering association levels of occlusion between two people. They train the
between the whole human body and the individual hu- double-person detector based on instances generated
man body parts. Figure 9 shows how the part based by synthetically combining two objects with different
model handles occlusion. The pedestrian is occluded levels of occlusion, thus the resulting detector can be
Multiple Object Tracking: A Literature Review 21
Fig. 9: An image from illustrating how the part based model deals with occlusion (Izadinia et al. 2012)
Fig. 10: Training samples for the double-person detector (Tang et al. 2014). From left to right, the level of occlusion
increases
occlusion aware. Figure 10 shows exemplar training in- nations are generated to correspond to the observations.
stances for the double-person detector. Along with the This could also be treated as “buffer-and-recover” strat-
traditional single person detector, this multi-person de- egy.
tector could be employed as the basis of multiple object Others. The strategies described above do not cover
tracking. all the tactics in the community. On one hand, in prac-
Buffer-and-recover. This strategy buffers obser- tice there exists a method which addresses occlusion
vations when occlusion happens and remember states of based on overlap between detection bounding boxes. It
objects before occlusion. When occlusion ends, object is simple but works in some cases. On the other hand,
states are recovered based on the buffered observations the three types of strategies are not strictly parallel.
and the stored states before occlusion. Sometimes they are combined and used simultaneously.
Mitzel et al. (2010) combine a level-set tracker based
on image segmentation and a high-level tracker based
on detection for MOT. In their approach, the high-level 4.6 Probabilistic Inference
tracker is employed to initialize new tracks from detec-
tion response and the level-set tracker is used to tackle Approaches based on probabilistic inference framework
the frame-to-frame data association. When occlusion typically represent states of objects (like size, position
occurs, the level-set tracker would fail. To tackle this, and velocity) with probabilistic distribution. The goal
the high-level tracker keeps a trajectory alive for up of algorithm is to estimate the probabilistic distribu-
to 15 frames when occlusion happens, and extrapolates tion of target status by a variety of probability reason-
the position to grow the dormant trajectory through oc- ing methods based on existing observations. Through
clusion. In case the object reappears, the track is fired the probabilistic distribution, object states can be es-
again and the identity is maintained. The similar idea timated. As this kind of approaches requires only the
is also used by Mitzel and Leibe (2011). existing observations, they are especially appropriate
Ryoo and Aggarwal (2008) propose an “observe- for the task of online tracking. As only the existing ob-
and-explain” strategy to handle the inter-object occlu- servations are employed for estimation, it is naturally
sion and scene-object occlusion. Their strategy could to impose the assumption of Markov property in the
save computation cost as an observation mode is ac- objects state sequence. This assumption includes two
tivated when the state of tracking is not clear due to aspects and let us illustrate them by recalling the for-
occlusion. When they get enough observations, expla- mula in Section 3.1.
22 Wenhan Luo1 , Junliang Xing2 , Xiaoqin Zhang3 , Xiaowei Zhao1 , Tae-Kyun Kim1
First, the current object state only depends on the Yang et al. 2005; Hess and Fern 2009; Han et al. 2007;
previous states. Further, it only depends on the very Hu et al. 2012; Liu et al. 2012; Breitenstein et al. 2009;
last state if the first-order Markov property is imposed. Yang et al. 2009b; Khan et al. 2004). Typically, the
Formally, P (St |S1:t−1 ) = P (St |St−1 ). strategy of Maximum A Posteriori (MAP) (Liu et al.
Second, observation of the object is only related to 2012; Breitenstein et al. 2009; Yang et al. 2009b; Mitzel
its state corresponding to the observation. Formally, and Leibe 2011; Rodriguez et al. 2011; Kratz and Nishino
Qt
P (O1:t |S1:t ) = i=1 P (Ot |St ). 2010; Reid 1979) is adopted to derive a state with the
Based on the two aspects, the estimation in Eq. 1 maximum probability.
could be conducted by two steps of predict and update,
which are:
4.7 Deterministic Optimization
R
Predict: P (St |O1:t−1) = P (St |St−1)P (St−1 |O1:t−1)dSt−1
Update: P (St |O1:t ) ∝ P (Ot |St )P (St |O1:t−1 ) Approaches based on deterministic optimization frame-
work at first obtain observations from each of the frame
In the formula above, P (St |St−1 ) and P (Ot |St ) are in the image sequence. Defining similarity among obser-
the Dynamic Model and the Observation Model indi- vations, the tracking problem is cast as a special opti-
vidually. The dynamic model corresponds to tracking mization problem. Then by seeking the optimal solution
strategy and the observation model provides observa- of the optimization problem, final estimation of track-
tion measurement concerning object states. The pre- ing results is obtained. Approaches within this frame-
dict step is to estimate the current state based on all work are suitable for the task of offline tracking because
the previous observations. More specifically, the poste- observations from all the frames or at least a time win-
rior probability distribution of the current state is es- dow are required to be ready in advance. Given ob-
timated by integrating in the space of the last object servations (usually detection hypotheses) from all the
state via the dynamic model. The update step is to up- frames, these types of methods endeavor to globally
date the posterior probability distribution of states of associate observations belonging to an identical object
objects based on the observation model. into a trajectory. The key issue is how to seek the op-
According to the formula, states of objects can be timum association. Usually, the global optimum of all
estimated by iteratively conducting the predict and up- trajectories is formulated as a particular type of opti-
date steps. However, in practice, the object state dis- mization problem. Some popular and well-studied ap-
tribution cannot be represented by simple distribution proaches are detailed in the following.
such as multivariate normal distribution, thus there is Bipartite Graph Matching. By modeling the MOT
no analytical solution to the integral procedure. Addi- problem as Bipartite Graph Matching, two disjoint sets
tionally, for multiple objects, the dimension of the set of graph nodes could be existing trajectories and new
of states is very large, which makes the integral and detections in sequential tracking or two sets of tracklets
derivation of approximate solution more difficult. in batch tracking. Weights among nodes are modeled
Various kinds of probabilistic inference models haves as affinities among trajectories and detections. Then
been applied to multi-object tracking (Kratz and Nishino greedy bipartite assignment algorithm (Shu et al. 2012;
2010; Fortmann et al. 1983; Giebel et al. 2004), such as Breitenstein et al. 2009; Wu and Nevatia 2007) or Hun-
Kalman filter (Rodriguez et al. 2011; Reid 1979), Ex- garian algorithm (Qin and Shelton 2012; Reilly et al.
tended Kalman filter (Mitzel and Leibe 2011) and Par- 2010; Perera et al. 2006; Xing et al. 2009; Huang et al.
ticle filter (Jin and Mokhtarian 2007; Yang et al. 2005; 2008) are employed to derive the matching between
Hess and Fern 2009; Han et al. 2007; Hu et al. 2012; Liu nodes in the two sets.
et al. 2012; Breitenstein et al. 2009; Yang et al. 2009b). Dynamic Programming. Extend Dynamic Pro-
Kalman filter. In the case of linear system and gramming (Wolf et al. 1989), Linear Programming (Jiang
Gaussian-distribution object states, Kalman filter is proved et al. 2007; Berclaz et al. 2009; Andriyenko and Schindler
to be the optimal estimator. It has been applied (Ro- 2010), Quadratic Boolean Programming (Leibe et al.
driguez et al. 2011; Reid 1979). 2007), K-shortest path (Berclaz et al. 2011; Choi and
Extended Kalman filter. For the nonlinear case, Savarese 2012) and set cover (Wu et al. 2011) are adopted
extended Kalman filter is an solution. It approximate to solve the association problem among detections or
the nonlinear system by Taylor Expansion (Mitzel and tracklets.
Leibe 2011). Min-cost Max-flow Network Flow. Network flow
Particle filter. Monte Carlo sampling based mod- is known as the transportation network. It is a directed
els becomes popular in tracking, especially after the graph where each edge has capacity. In the MOT prob-
introduce of Particle filter (Jin and Mokhtarian 2007; lem, nodes in the graph for network flow are usually
Multiple Object Tracking: A Literature Review 23
low-level observations, which could be detection responses following, we list metrics, publicly available data sets,
or tracklets. Usually the flow is modeled as an indica- codes and benchmark results.
tor to link two nodes (flow is 1) or not (flow is 0). To
meet the flow balance requirement, a source node cor-
responding to the start of a trajectory and a sink node 5.1 Metrics
corresponding to the end of a trajectory are added to
the original graph (see Figure 11). One trajectory corre- Evaluation metrics of MOT approaches are crucial as
sponds to one flow path in the graph. The flow transited they provide standard for fair quantitative comparison.
from the source node to the sink node equals to the A brief review on different MOT evaluation metrics is
number of trajectories or the objects in the video, and presented in this section. As many approaches to MOT
the cost to transit the flow from the source node to the employ the tracking-by-detection strategy, they often
sink node is the neg-likelihood of all the association hy- measure detection performance as well as tracking per-
potheses. Some examples (Zhang et al. 2008; Choi and formance. Metrics for object detection are therefore em-
Savarese 2012; Wu et al. 2012; Butt and Collins 2013; ployed in MOT approaches. Based on this, MOT met-
Pirsiavash et al. 2011) adopt this for MOT problem. rics can be largely categorized into two sets evaluating
detection and tracking respectively, as listed in Table 6.
Conditional Random Field. The Conditional Ran-
dom Field model is adopted to handle the multiple ob-
5.1.1 Metrics for detection
ject tracking problem (Yang and Nevatia 2012b; Yang
et al. 2011; Milan et al. 2013). Defining a graph G =
We further group metrics for detection into two subsets.
(V, E) where V is the set of vertexes and E is the set of
One set measures accuracy, and the other one measures
edges between vertexes, low level tracklets are given as
precision.
input to the graph. Each node in the graph is defined
Accuracy. The commonly used Recall and Preci-
as a pair of tracklets, and a label is predicted to indi-
sion metrics as well as the average False Alarms per
cate whether this pair of tracklets can be linked (label
Frame (FAF) rate are employed as MOT metrics (Yang
is 1) or not (label is 0). These labels compose the label
et al. 2011). Choi and Savarese (2010) use the False
map which corresponds to the optimal association of
Positive Per Image (FPPI) to evaluate detection per-
the tracklets for the MOT problem.
formance in MOT. A comprehensive metric called Mul-
MWIS. The maximum-weight independent set (MW
tiple Object Detection Accuracy (MODA), which con-
IS) is the heaviest subset of non-adjacent nodes of an
siders the relative number of false positives and miss
attributed graph. Concerning the MOT problem, nodes
detections is utilized by Kasturi et al. (2009).
in the attribute graph represent pair of tracklets in con-
Precision. The Multiple Object Detection Preci-
tinuous frames, weights of nodes represent the affinity
sion (MODP) metric measures the quality of alignment
of the pair of tracklets, and the edge is connect if two
between true detections and the ground truth (Kasturi
tracklets share the same detection. Given this graph,
et al. 2009).
the data association problem is modeled as the MWIS
problem (Shafique et al. 2008; Brendel et al. 2011).
5.1.2 Metrics for tracking
In practice, deterministic optimization usually out-
performs approaches based on probabilistic inference, Metrics for tracking are classified into four subsets by
especially in the case of occlusion among objects. With different attributes as the following.
the help of global information, occlusion is addressed Accuracy. This kind of metrics measures how accu-
better than its counterparts. However, global informa- rately an algorithm could track target objects. The met-
tion simultaneously results in more consumption of time ric of ID switches (IDs) (Yamaguchi et al. 2011) counts
and space for the optimization. Additionally, the re- how many times a MOT algorithm switches to wrong
quirement of access to all frames in advance limits their objects. Multiple Object Tracking Accuracy (MOTA)
applications in online tracking scenarios. metric (Keni and Rainer 2008) combines the false pos-
itive rate, false negative rate and mismatch rate for
MOT.
5 MOT Evaluation Precision. Two metrics, Multiple Object Tracking
Precision (MOTP) (Keni and Rainer 2008) and Track-
Given a MOT algorithm developed, metrics, data sets ing Distance Error (TDE) (Kratz and Nishino 2010)
are required to evaluate its performance. At the same belong to this subset. They describe how precisely the
time, one also need to compare with appropriate pub- objects are tracked from the view of overlap and dis-
lic algorithms to verify a developed algorithm. In the tance.
24 Wenhan Luo1 , Junliang Xing2 , Xiaoqin Zhang3 , Xiaowei Zhao1 , Tae-Kyun Kim1
Fig. 11: An example of the cost-flow network with 3 timesteps and 9 observations (Zhang et al. 2008)
Completeness. Metrics for completeness indicate 9 to Table 26. Please note that this kind of direct com-
how completely the ground truth trajectories are tracked. parison among different approaches on the same data
Metrics of Mostly Tracked (MT), Partly Tracked (PT), set may not be fair. We list some issues which may re-
Mostly Lost (ML) and Fragmentation (FM) (Li et al. sult in unfairness in case of direct comparison:
2009) could be grouped to this set. – Different methodologies. For example, some publi-
Robustness. To assess the ability of a MOT al- cations belong to offline methods while others be-
gorithm to recover from occlusion, metrics of Recover long to online ones. Due to the difference described
from Short-term occlusion (RS) and Recover from Long- in Section 3.2.2, it is unfair to directly compare
term occlusion (RL) are considered (Song et al. 2010). them.
– Different detection hypotheses. Different approaches
adopt various detectors to obtain detection hypothe-
5.2 Data Sets ses. One approach based on different detection hy-
potheses would output different results, let alone dif-
To compare with various state-of-the-art MOT meth-
ferent approaches.
ods, publicly available datasets are employed to eval-
– Some approaches utilize observations from multi-
uate the proposed methods in individual publications.
ple views while some approaches adopt information
We here summarize the popular data sets in the litera-
from a single view. This makes the comparison be-
ture, to give a clear view in Table 7.
tween them difficult.
– Prior information, such as scene structure and the
number of pedestrians, are employed by some ap-
5.3 Public Algorithms
proaches. Direct comparison between these approaches
and others is not so convincing.
We examine the literature and list algorithms with which
the associated source codes are publicly available to In order to make direct and fair comparison, one
make further comparisons convenient in Table 8. needs to fix all the other components while vary the
concerned component. For instance, adopting different
data association models while keeping all other parts
5.4 Benchmark Results the same could directly compare performance of dif-
ferent data association methods. This is what another
We list public results on the data sets mentioned above survey (Leal-Taixé et al. 2015) specifically focusing on
to get a direct comparison among different approaches evaluation of multiple object tracking provides. For in-
and provide convenience for future comparison in Table tensive experimental comparison among different MOT
Multiple Object Tracking: A Literature Review 25
Table 6: An overview of evaluation metrics for MOT. The up arrow (resp. down arrow) indicates that the perfor-
mance is better if the quantity is greater (resp. smaller)
solutions, one may refer to (Leal-Taixé et al. 2015) for TRECVID2008 (Table 19), ETH Central (Table 20),
interest. In spite of the issues mentioned above, it is still Hockey (Table 21), Parking Lot (Table 22), TUD Cross-
worthy to list all the public results on the same data ing (Table 23), TUD Stadtmitte (Table 24), ETHMS
set due to the following reasons. (Table 25) and TUD Campus (Table 26). For each data
set, we report the results in terms of the MOTA, MOTP,
– By listing the public results, it at least provides in- IDS, Precision, Recall, MT, PT, ML, FM and F1 met-
tuitive comparison among different methods on the rics. At the same time, we organize the results in two
same data set and convenience for future compari- groups, which correspond to offline and online meth-
son if one evaluates his developed MOT algorithm. ods respectively. Please note that, 1) in each table, the
– Although comparison among individual methods may top rows and bottom rows (separated by a line) are
not be fair, intuitive comparison between different results of offline and online methods correspondingly,
types of methods such as that between offline meth- 2) in some tables there is no separation of rows, which
ods and online methods could tell us how these types means all the methods in this table belongs to offline
of methods work in public data sets. methods, 3) cells in tables may be null, which means
– Additionally, we could observe how the research of that we did not find the corresponding value neither
MOT progresses along years by comparing perfor- from the original publication nor from other publica-
mance of methods in different years. tions which cite it, 4) in some cases, there could be dif-
We list public results on the popular data sets, specif- ferent results for an unique publication (for example, re-
ically, PETS2009S2L1 (Table 9), PETS2009S2L2 (Ta- sult from the original publication versus result from an-
ble 10), PETS2009S2L3 (Table 11), PETS2009S1L1- other publication which compares with it). This might
2 (Table 12), PETS2009S1L1-1 (Table 13), PETS2009 because different configurations are adopt (e.g. differ-
S3 MF1 (Table 14), CAVIAR (Table 15), Airport (Ta- ent detection hypotheses). In this case, we quote either
ble 16), Town Center (Table 17), i-LIDS (Table 18), the most popularly cited or the latest results here.
26 Wenhan Luo1 , Junliang Xing2 , Xiaoqin Zhang3 , Xiaowei Zhao1 , Tae-Kyun Kim1
Table 7: An overview of publicly available data sets. The tick means ground truth is available while the cross
means not available
Multi- Ground
Data set Web link
view truth
PETS 2009 X X www.cvg.rdg.ac.uk/PETS2009/a.html
PETS 2006 X 5 www.cvg.rdg.ac.uk/PETS2006/data.html
PETS 2007 X X www.pets2007.net/
http://groups.inf.ed.ac.uk/vision/
CAVIAR X X
CAVIAR/CAVIARDATA1/
Trecvid 2008 X 5 www-nlpir.nist.gov/projects/tv2008/
TUD 5 X www.d2.mpi-inf.mpg.de/datasets
www.vision.caltech.edu/Image_Datasets/
Caltech Pedestrian 5 X
CaltechPedestrians/
UBC Hockey 5 5 www.cs.ubc.ca/~okumak/research.html
Lids AVSS 2007 5 X www.eecs.qmul.ac.uk/~andrea/avss2007_d.html
ETH pedestrian X X www.vision.ee.ethz.ch/~aess/dataset/
ETHZ Central 5 X www.vision.ee.ethz.ch/datasets/
www.robots.ox.ac.uk/ActiveVision/Research/Projects/
Town Centre 5 X
2009bbenfold_headpose/project.html#datasets
https://graphics.cs.ucy.ac.cy/
Zara 5 5
research/downloads/crowd-data
http://www.svcl.ucsd.edu/projects/
UCSD 5 5
anomaly/dataset.htm
UCF Crowds 5 5 www.crcv.ucf.edu/data/crowd.php
We conduct analysis of benchmark results on the be draw from Figure 14 and Figure 15. It coincides with
Town Center and the CAVIAR data set to investigate the fact that offline methods employ globally temporal
the comparison between offline methods and online meth- information for the estimation.
ods. The reason we choose these two data sets is that
they are popularly utilized for evaluation. We average
values of each metric across each type of methods, and
report the mean value and the standard deviation in
Additionally, we conduct analysis of the year-wise
Table 27 and Table 28. To make the comparison more
results on the PETS2009S2L1 data set. To be specific,
intuitive, we additionally show the comparative results
we calculate metric values of methods in years ranging
in the form of bar. For each table, IDS and FM are
from 2009 to 2015, and report the mean and standard
shown separately as their value range is different from
deviation value in Table 29. These results are also rep-
other metrics.
resented in Figure 16 and Figure 17. In general, the per-
Some interesting points could be achieved based on formance improves across years. We suspect that con-
the analysis. As shown in Figure 12 and Figure 13, of- tributors such as better models and progress in object
fline methods generally outperform online ones regard- detection could be employed to explain the achieved
ing most of the metrics. The same conclusion could also progress.
Multiple Object Tracking: A Literature Review 27
6 Conclusion and Future Directions fline trained object detector. There arises a prob-
lem that the detection result for a specific video is
This paper has presented a comprehensive review of not optimal since the object detector is not trained
Multiple Object Tracking (MOT). The review has il- for the given video. This would decrease the perfor-
lustrated the current state, the key issues, and evalua- mance of multiple object tracking. The customiza-
tion of this topic by classification, discussion and case tion of the object detector is necessary to improve
study. Although great progress in MOT has been re- MOT performance. One solution is proposed by Shu
cently achieved, there are still issues remaining to be et al. (2013), which adapts a generic pedestrian de-
tackled. tector to a specific video by progressively refining
the generic pedestrian detector. This is an impor-
– MOT with video adaptation. As aforementioned, the tant direction to improve the pre-procedure for MOT.
majority of current MOT methods requires an of-
28 Wenhan Luo1 , Junliang Xing2 , Xiaoqin Zhang3 , Xiaowei Zhao1 , Tae-Kyun Kim1
Fig. 12: Comparison between offline methods and on- Fig. 13: Comparison between offline methods and online
line methods on the Town Center data set in terms methods on the Town Center data set in terms of IDS
of MOTA, MOTP, Precision, Recall, MT, ML and F1 and FM metrics
metrics
Fig. 14: Comparison between offline methods and online Fig. 15: Comparison between offline methods and online
methods on the CAVIAR data set in terms of MOTA, methods on the CAVIAR data set in terms of IDS and
MOTP, Precision, Recall, MT, ML and F1 metrics FM metrics
Ben Shitrit H, Berclaz J, Fleuret F, Fua P (2011) Tracking Pattern Anal Mach Intel 33(9):1820–1833
multiple people under global appearance constraints. In: Brendel W, Amer M, Todorovic S (2011) Multiobject tracking
Proc. IEEE Int. Conf. Comput. Vis., pp 137–144 as maximum weight independent set. In: Proc. IEEE Int.
Benfold B, Reid I (2011) Stable multi-target tracking in real- Conf. Comput. Vis. Pattern Recognit., pp 1273–1280
time surveillance video. In: Proc. IEEE Int. Conf. Com- Brostow G, Cipolla R (2006) Unsupervised bayesian detec-
put. Vis. Pattern Recognit., pp 3457–3464 tion of independent motion in crowds. In: Proc. IEEE
Berclaz J, Fleuret F, Fua P (2006) Robust people tracking Int. Conf. Comput. Vis. Pattern Recognit., pp 594–601
with global trajectory optimization. In: Proc. IEEE Int. Butt A, Collins R (2013) Multi-target tracking by lagrangian
Conf. Comput. Vis. Pattern Recognit., pp 744–750 relaxation to min-cost network flow. In: Proc. IEEE Int.
Berclaz J, Fleuret F, Fua P (2009) Multiple object tracking Conf. Comput. Vis. Pattern Recognit., pp 1846–1853
using flow linear programming. In: Proc. IEEE Int. Work- Candamo J, Shreve M, Goldgof DB, Sapper DB, Kasturi
shop Perform. Eval. Track. Surveillance, pp 1–8 R (2010) Understanding transit scenes: A survey on hu-
Berclaz J, Fleuret F, Turetken E, Fua P (2011) Multiple ob- man behavior-recognition algorithms. IEEE Trans Intell
ject tracking using k-shortest paths optimization. IEEE Transp Syst 11(1):206–224
Trans Pattern Anal Mach Intel 33(9):1806–1819 Cannons K (1991) A review of visual tracking. Tech. Rep.
Betke M, Haritaoglu E, Davis LS (2000) Real-time multi- CSE-2008-07, Dept. Comput. Sci. Eng., York Univ.
ple vehicle detection and tracking from a moving vehicle. Caruana R (1997) Multitask learning. Mach Learn 28(1):41–
Mach Vis Appl 12(2):69–83 75
Betke M, Hirsh DE, Bagchi A, Hristov NI, Makris NC, Kunz Chang TH, Gong S, Ong EJ (2000) Tracking multiple people
TH (2007) Tracking large variable numbers of objects in under occlusion using multiple cameras. In: Proc. Brit.
clutter. In: Proc. IEEE Int. Conf. Comput. Vis. Pattern Mach. Vis. Conf., pp 1–10
Recognit., pp 1–8 Chen X, Qin Z, An L, Bhanu B (2014) An online learned
Bibby C, Reid I (2008) Robust real-time visual tracking using elementary grouping model for multi-target tracking. In:
pixel-wise posteriors. In: Proc. Eur. Conf. Comput. Vis., Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit.,
pp 831–844 pp 1242–1249
Bose B, Wang X, Grimson E (2007) Multi-class object track- Choi W, Savarese S (2010) Multiple target tracking in world
ing algorithm that handles fragmentation and grouping. coordinate with single, minimally calibrated camera. In:
In: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recog- Proc. Eur. Conf. Comput. Vis., pp 553–567
nit., pp 1–8 Choi W, Savarese S (2012) A unified framework for multi-
Breitenstein MD, Reichlin F, Leibe B, Koller-Meier E, target tracking and collective activity recognition. In:
Van Gool L (2009) Robust tracking-by-detection using Proc. Eur. Conf. Comput. Vis., pp 215–230
a detector confidence particle filter. In: Proc. IEEE Int. Choi W, Pantofaru C, Savarese S (2013) A general framework
Conf. Comput. Vis., pp 1515–1522 for tracking multiple people from a moving camera. IEEE
Breitenstein MD, Reichlin F, Leibe B, Koller-Meier E, Trans Pattern Anal Mach Intel 35(7):1577–1591
Van Gool L (2011) Online multiperson tracking-by- Conte D, Foggia P, Percannella G, Vento M (2010) Perfor-
detection from a single, uncalibrated camera. IEEE Trans mance evaluation of a people tracking system on pets2009
32 Wenhan Luo1 , Junliang Xing2 , Xiaoqin Zhang3 , Xiaowei Zhao1 , Tae-Kyun Kim1
database. In: Proc. IEEE Int. Conf. Advanced Video Fontaine E, Barr AH, Burdick JW (2007) Model-based track-
Signal-Based Surveillance, pp 119–126 ing of multiple worms and fish. In: Proc. IEEE Int. Conf.
Dalal N, Triggs B (2005) Histograms of oriented gradients for Comput. Vis. Workshops, pp 1–13
human detection. In: Proc. IEEE Int. Conf. Comput. Vis. Forsyth DA, Arikan O, Ikemoto L, O’Brien J, Ramanan D,
Pattern Recognit., pp 886–893 et al. (2006) Computational studies of human motion:
Dehghan A, Tian Y, Torr PH, Shah M (2015) Target identity- Part 1, tracking and motion synthesis. Found Trends
aware network flow for online multiple target tracking. In: Comput Graph Vis 1(2-3):77–254
Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., Fortmann TE, Bar-Shalom Y, Scheffe M (1983) Sonar track-
pp 1146–1154 ing of multiple targets using joint probabilistic data asso-
Dicle C, Sznaier M, Camps O (2013) The way they move: ciation. IEEE J Ocean Eng 8(3):173–184
Tracking multiple targets with similar appearance. In: Fragkiadaki K, Zhang W, Zhang G, Shi J (2012) Two-
Proc. IEEE Int. Conf. Comput. Vis., pp 2304–2311 granularity tracking: Mediating trajectory and detection
Ess A, Leibe B, Van Gool L (2007) Depth and appearance for graphs for tracking under occlusions. In: Proc. Eur. Conf.
mobile scene analysis. In: Proc. IEEE Int. Conf. Comput. Comput. Vis., pp 552–565
Vis., pp 1–8 Gammeter S, Ess A, Jäggli T, Schindler K, Leibe B, Van Gool
Ess A, Leibe B, Schindler K, Van Gool L (2008) A mobile L (2008) Articulated multi-body tracking under egomo-
vision system for robust multi-person tracking. In: Proc. tion. In: Proc. Eur. Conf. Comput. Vis., pp 816–830
IEEE Int. Conf. Comput. Vis. Pattern Recognit., pp 1–8 Gavrila DM, Munder S (2007) Multi-cue pedestrian detection
Ess A, Leibe B, Schindler K, Van Gool L (2009) Robust mul- and tracking from a moving vehicle. Int J Comput Vis
tiperson tracking from a mobile platform. IEEE Trans 73(1):41–59
Pattern Anal Mach Intel 31(10):1831–1846 Giebel J, Gavrila DM, Schnörr C (2004) A bayesian frame-
Evgeniou T, Pontil M (2004) Regularized multi-task learning. work for multi-cue 3d object tracking. In: Proc. Eur. Conf.
In: Proc. ACM Int. Conf. Knowl. Discov. Data Min, pp Comput. Vis., pp 241–252
109–117 Han B, Joo SW, Davis LS (2007) Probabilistic fusion tracking
Felzenszwalb P, Girshick R, McAllester D, Ramanan D (2010) using mixture kernel-based bayesian filtering. In: Proc.
Object detection with discriminatively trained part-based IEEE Int. Conf. Comput. Vis., pp 1–8
models. IEEE Trans Pattern Anal Mach Intel 32(9):1627– Helbing D, Molnar P (1995) Social force model for pedestrian
1645 dynamics. Phys Rev E 51(5):4282–4286
Felzenszwalb PF, Huttenlocher DP (2006) Efficient belief Henriques JF, Caseiro R, Batista J (2011) Globally optimal
propagation for early vis. Int J Comput Vis 70(1):41–54 solution to multi-object tracking with merged measure-
Fleuret F, Berclaz J, Lengagne R, Fua P (2008) Multicam- ments. In: Proc. IEEE Int. Conf. Comput. Vis., pp 2470–
era people tracking with a probabilistic occupancy map. 2477
IEEE Trans Pattern Anal Mach Intel 30(2):267–282 Hess R, Fern A (2009) Discriminatively trained particle filters
for complex multi-object tracking. In: Proc. IEEE Int.
Multiple Object Tracking: A Literature Review 33
Table 27: Benchmark result comparison between offline and online methods on the Town Center data set
Method MOTA ↑ MOTP ↑ Precision ↑ Recall ↑ MT ↑ ML ↓ F1 ↑ IDS ↓ FM ↓
Offline 0.700 ± 0.052 0.727 ± 0.032 0.797 ± 0.102 0.723 ± 0.086 0.848 ± 0.014 0.055 ± 0.006 0.757 ± 0.089 45.000 ± 28.717 32.333 ± 9.292
Online 0.681 ± 0.046 0.695 ± 0.012 0.710 ± 0.002 0.641 ± 0.001 0.605 ± 0.059 0.076 ± 0.004 0.673 ± 0.001 262.333 ± 139.848 387.000 ± 93.338
Table 28: Benchmark result comparison between offline and online methods on the CAVIAR data set
Method MOTA ↑ MOTP ↑ Precision ↑ Recall ↑ MT ↑ ML ↓ F1 ↑ IDS ↓ FM ↓
Offline 0.872 ± 0.000 0.763 ± 0.000 0.963 ± 0.014 0.869 ± 0.048 0.860 ± 0.036 0.023 ± 0.020 0.926 ± 0.008 8.222 ± 4.024 20.222 ± 14.864
Online 0.729 ± 0.192 0.872 ± 0.000 0.904 ± 0.121 0.802 ± 0.064 0.780 ± 0.095 0.043 ± 0.027 0.859 ± 0.107 14.200 ± 3.271 31.800 ± 17.908
Conf. Comput. Vis. Pattern Recognit., pp 240–247 ance features. In: Proc. IEEE Int. Conf. Comput. Vis.,
Hofmann M, Haag M, Rigoll G (2013a) Unified hierarchi- pp 2000–2007
cal multi-object tracking using global data association. Keni B, Rainer S (2008) Evaluating multiple object tracking
In: Proc. IEEE Int. Conf. Advanced Video Signal-Based performance: the clear mot metrics. EURASIP J Image
Surveillance, pp 22–28 Video Process
Hofmann M, Wolf D, Rigoll G (2013b) Hypergraphs for joint Khan Z, Balch T, Dellaert F (2004) An mcmc-based particle
multi-view reconstruction and multi-object tracking. In: filter for tracking multiple interacting targets. In: Proc.
Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., Eur. Conf. Comput. Vis., pp 279–290
pp 3650–3657 Khan Z, Balch T, Dellaert F (2005) Mcmc-based particle fil-
Hu M, Ali S, Shah M (2008) Detecting global motion patterns tering for tracking a variable number of interacting tar-
in complex videos. In: Proc. IEEE Int. Conf. Pattern. gets. IEEE Trans Pattern Anal Mach Intel 27(11):1805–
Recognit., pp 1–5 1819
Hu W, Tan T, Wang L, Maybank S (2004) A survey on visual Khan Z, Balch T, Dellaert F (2006) Mcmc data association
surveillance of object motion and behaviors. IEEE Trans and sparse factorization updating for real time multitar-
Syst Man Cybern Part C-Appl Rev 34(3):334–352 get tracking with merged and multiple measurements.
Hu W, Li X, Luo W, Zhang X, Maybank S, Zhang Z IEEE Trans Pattern Anal Mach Intel 28(12):1960–1972
(2012) Single and multiple object tracking using log- Kim IS, Choi HS, Yi KM, Choi JY, Kong SG (2010) Intel-
euclidean riemannian subspace and block-division ap- ligent visual surveillance-a survey. Int J Control Autom
pearance model. IEEE Trans Pattern Anal Mach Intel Syst 8(5):926–939
34(12):2420–2440 Koller D, Weber J, Malik J (1994) Robust multiple car track-
Huang C, Wu B, Nevatia R (2008) Robust object tracking by ing with occlusion reasoning. In: Proc. Eur. Conf. Com-
hierarchical association of detection responses. In: Proc. put. Vis., pp 189–196
Eur. Conf. Comput. Vis., pp 788–801 Kratz L, Nishino K (2010) Tracking with local spatio-
Ishiguro K, Yamada T, Ueda N (2008) Simultaneous cluster- temporal motion patterns in extremely crowded scenes.
ing and tracking unknown number of objects. In: Proc. In: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recog-
IEEE Int. Conf. Comput. Vis. Pattern Recognit., pp 1–8 nit., pp 693–700
Izadinia H, Saleemi I, Li W, Shah M (2012) (mp)2t: Mul- Kratz L, Nishino K (2012) Tracking pedestrians using local
tiple people multiple parts tracker. In: Proc. Eur. Conf. spatio-temporal motion patterns in extremely crowded
Comput. Vis., pp 100–114 scenes. IEEE Trans Pattern Anal Mach Intel 34(5):987–
Jiang H, Fels S, Little JJ (2007) A linear programming ap- 1002
proach for multiple object tracking. In: Proc. IEEE Int. Kuo CH, Nevatia R (2011) How does person identity recogni-
Conf. Comput. Vis. Pattern Recognit., pp 1–8 tion help multi-person tracking? In: Proc. IEEE Int. Conf.
Jin Y, Mokhtarian F (2007) Variational particle filter for Comput. Vis. Pattern Recognit., pp 1217–1224
multi-object tracking. In: Proc. IEEE Int. Conf. Comput. Kuo CH, Huang C, Nevatia R (2010) Multi-target tracking
Vis., pp 1–8 by on-line learned discriminative appearance models. In:
Kalal Z, Mikolajczyk K, Matas J (2012) Tracking- Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit.,
learning-detection. IEEE Trans Pattern Anal Mach Intel pp 685–692
34(7):1409–1422 Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of fea-
Kasturi R, Goldgof D, Soundararajan P, Manohar V, Garo- tures: Spatial pyramid matching for recognizing natural
folo J, Bowers R, Boonstra M, Korzhova V, Zhang J scene categories. In: Proc. IEEE Int. Conf. Comput. Vis.
(2009) Framework for performance evaluation of face, Pattern Recognit., pp 2169–2178
text, and vehicle detection and tracking in video: Data, Leal-Taixé L, Pons-Moll G, Rosenhahn B (2011) Everybody
metrics, and protocol. IEEE Trans Pattern Anal Mach needs somebody: Modeling social and grouping behav-
Intel 31(2):319–336 ior on a linear programming multiple people tracker. In:
KC AK, De Vleeschouwer C (2013) Discriminative label prop- Proc. IEEE Int. Conf. Comput. Vis. Workshops, pp 120–
agation for multi-object tracking with sporadic appear- 127
34 Wenhan Luo1 , Junliang Xing2 , Xiaoqin Zhang3 , Xiaowei Zhao1 , Tae-Kyun Kim1
Leal-Taixé L, Pons-Moll G, Rosenhahn B (2012) Branch- Nillius P, Sullivan J, Carlsson S (2006) Multi-target tracking-
and-price global optimization for multi-view multi-target linking identities using bayesian network inference. In:
tracking. In: Proc. IEEE Int. Conf. Comput. Vis. Pattern Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit.,
Recognit., IEEE, pp 1987–1994 pp 2187–2194
Leal-Taixé L, Milan A, Reid I, Roth S, Schindler K (2015) Okuma K, Taleghani A, De Freitas N, Little JJ, Lowe DG
Motchallenge 2015: Towards a benchmark for multi-target (2004) A boosted particle filter: Multitarget detection and
tracking. arXiv preprint arXiv:150401942 tracking. In: Proc. Eur. Conf. Comput. Vis., pp 28–39
Leibe B, Schindler K, Van Gool L (2007) Coupled detection Park Y, Lepetit V, Woo W (2008) Multiple 3d object tracking
and trajectory estimation for multi-object tracking. In: for augmented reality. In: Proc. IEEE/ACM Int. Symp.
Proc. IEEE Int. Conf. Comput. Vis., pp 1–8 Mix. Augment. Real., pp 117–120
Leibe B, Schindler K, Cornelis N, Van Gool L (2008) Coupled Pellegrini S, Ess A, Schindler K, Van Gool L (2009) You’ll
object detection and tracking from static cameras and never walk alone: Modeling social behavior for multi-
moving vehicles. IEEE Trans Pattern Anal Mach Intel target tracking. In: Proc. IEEE Int. Conf. Comput. Vis.,
30(10):1683–1698 pp 261–268
Li K, Miller ED, Chen M, Kanade T, Weiss LE, Campbell PG Pellegrini S, Ess A, Van Gool L (2010) Improving data as-
(2008) Cell population tracking and lineage construction sociation by joint modeling of pedestrian trajectories and
with spatiotemporal context. Med Image Anal 12(5):546– groupings. In: Proc. Eur. Conf. Comput. Vis., pp 452–465
566 Perera AA, Srinivas C, Hoogs A, Brooksby G, Hu W (2006)
Li X, Hu W, Shen C, Zhang Z, Dick A, Hengel AVD (2013) Multi-object tracking through simultaneous long occlu-
A survey of appearance models in visual object tracking. sions and split-merge conditions. In: Proc. IEEE Int.
ACM Trans Intell Syst Technol 4(4):58 Conf. Comput. Vis. Pattern Recognit., pp 666–673
Li Y, Huang C, Nevatia R (2009) Learning to associate: Hy- Pérez P, Hue C, Vermaak J, Gangnet M (2002) Color-based
bridboosted multi-target tracker for crowded scene. In: probabilistic tracking. In: Proc. Eur. Conf. Comput. Vis.,
Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., pp 661–675
pp 2953–2960 Pirsiavash H, Ramanan D, Fowlkes CC (2011) Globally-
Liu Y, Li H, Chen YQ (2012) Automatic tracking of a large optimal greedy algorithms for tracking a variable num-
number of moving targets in 3d. In: Proc. Eur. Conf. ber of objects. In: Proc. IEEE Int. Conf. Comput. Vis.
Comput. Vis., pp 730–742 Pattern Recognit., pp 1201–1208
Lowe DG (2004) Distinctive image features from scale- Poiesi F, Mazzon R, Cavallaro A (2013) Multi-target tracking
invariant keypoints. Int J Comput Vis 60(2):91–110 on confidence maps: An application to people tracking.
Lu WL, Ting JA, Little JJ, Murphy KP (2013) Learning to Comput Vis Image Underst 117(10):1257–1272
track and identify players from broadcast sports videos. Porikli F, Tuzel O, Meer P (2006) Covariance tracking using
IEEE Trans Pattern Anal Mach Intel 35(7):1704–1716 model update based on lie algebra. In: Proc. IEEE Int.
Luo W, Kim TK (2013) Generic object crowd tracking by Conf. Comput. Vis. Pattern Recognit., pp 728–735
multi-task learning. In: Proc. Brit. Mach. Vis. Conf., pp Possegger H, Mauthner T, Roth PM, Bischof H (2014) Occlu-
73.1–73.13 sion geodesics for online multi-object tracking. In: Proc.
Luo W, Kim TK, Stenger B, Zhao X, Cipolla R (2014) Bi- IEEE Int. Conf. Comput. Vis. Pattern Recognit., pp
label propagation for generic multiple object tracking. In: 1306–1313
Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., Qin Z, Shelton CR (2012) Improving multi-target tracking
pp 1290–1297 via social grouping. In: Proc. IEEE Int. Conf. Comput.
Luo W, Stenger B, Zhao X, Kim TK (2015) Automatic topic Vis. Pattern Recognit., pp 1972–1978
discovery for multi-object tracking. In: Proc. AAAI Conf. Reid DB (1979) An algorithm for tracking multiple targets.
Artif. Intell., pp 3820–3826 IEEE Trans Autom Control 24(6):843–854
McLaughlin N, Martinez Del Rincon J, Miller P (2013) On- Reilly V, Idrees H, Shah M (2010) Detection and tracking
line multiperson tracking with occlusion reasoning and of large number of targets in wide area surveillance. In:
unsupervised track motion model. In: Proc. IEEE Int. Proc. Eur. Conf. Comput. Vis., pp 186–199
Workshop Perform. Eva. Track. Surveillance, pp 37–42 Rodriguez M, Ali S, Kanade T (2009) Tracking in unstruc-
Meijering E, Dzyubachyk O, Smal I, van Cappellen WA tured crowded scenes. In: Proc. IEEE Int. Conf. Comput.
(2009) Tracking in cell and developmental biology. Semin Vis., pp 1389–1396
Cell Dev Biol 20(8):894–902 Rodriguez M, Sivic J, Laptev I, Audibert JY (2011) Data-
Milan A, Schindler K, Roth S (2013) Detection- and driven crowd analysis in videos. In: Proc. IEEE Int. Conf.
trajectory-level exclusion in multiple object tracking. In: Comput. Vis., pp 1235–1242
Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., Ryoo MS, Aggarwal JK (2008) Observe-and-explain: A new
pp 3682–3689 approach for multiple hypotheses tracking of humans and
Milan A, Roth S, Schindler K (2014) Continuous energy min- objects. In: Proc. IEEE Int. Conf. Comput. Vis. Pattern
imization for multitarget tracking. IEEE Trans Pattern Recognit., pp 1–8
Anal Mach Intel 36(1):58–72 Scovanner P, Tappen MF (2009) Learning pedestrian dynam-
Mitzel D, Leibe B (2011) Real-time multi-person tracking ics from the real world. In: Proc. IEEE Int. Conf. Comput.
with detector assisted structure propagation. In: Proc. Vis., pp 381–388
IEEE Int. Conf. Comput. Vis. Workshops, pp 974–981 Segal AV, Reid I (2013) Latent data association: Bayesian
Mitzel D, Horbert E, Ess A, Leibe B (2010) Multi-person model selection for multi-target tracking. In: Proc. IEEE
tracking with sparse detection and continuous segmenta- Int. Conf. Comput. Vis., pp 2904–2911
tion. In: Proc. Eur. Conf. Comput. Vis., pp 397–410 Shafique K, Lee MW, Haering N (2008) A rank constrained
Mordohai P, Medioni G (2010) Dimensionality estimation, continuous formulation of multi-frame multi-target track-
manifold learning and function approximation using ten- ing problem. In: Proc. IEEE Int. Conf. Comput. Vis. Pat-
sor voting. J Mach Learn Res 11:411–450 tern Recognit., pp 1–8
Multiple Object Tracking: A Literature Review 35
Shi J, Tomasi C (1994) Good features to track. In: Proc. IEEE tracking. IEEE Trans Aerosp Electron Syst 25(2):287–296
Int. Conf. Comput. Vis. Pattern Recognit., pp 593–600 Wu B, Nevatia R (2006) Tracking of multiple, partially oc-
Shi X, Ling H, Xing J, Hu W (2013) Multi-target tracking by cluded humans based on static body part detection. In:
rank-1 tensor approximation. In: Proc. IEEE Int. Conf. Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit.,
Comput. Vis. Pattern Recognit., pp 2387–2394 pp 951–958
Shi X, Ling H, Hu W, Yuan C, Xing J (2014) Multi-target Wu B, Nevatia R (2007) Detection and tracking of multiple,
tracking with motion context in tensor power iteration. partially occluded humans by bayesian combination of
In: Proc. IEEE Int. Conf. Comput. Vis. Pattern Recog- edgelet based part detectors. Int J Comput Vis 75(2):247–
nit., pp 3518–3525 266
Shu G, Dehghan A, Oreifej O, Hand E, Shah M (2012) Wu Y, Lim J, Yang MH (2013a) Online object tracking: A
Part-based multiple-person tracking with partial occlu- benchmark. In: Proc. IEEE Int. Conf. Comput. Vis. Pat-
sion handling. In: Proc. IEEE Int. Conf. Comput. Vis. tern Recognit., pp 2411–2418
Pattern Recognit., pp 1815–1821 Wu Z, Kunz TH, Betke M (2011) Efficient track linking meth-
Shu G, Dehghan A, Shah M (2013) Improving an object de- ods for track graphs using network-flow and set-cover
tector and extracting regions using superpixels. In: Proc. techniques. In: Proc. IEEE Int. Conf. Comput. Vis. Pat-
IEEE Int. Conf. Comput. Vis. Pattern Recognit., pp tern Recognit., pp 1185–1192
3721–3727 Wu Z, Thangali A, Sclaroff S, Betke M (2012) Coupling detec-
Song B, Jeng TY, Staudt E, Roy-Chowdhury AK (2010) A tion and data association for multiple object tracking. In:
stochastic graph evolution framework for robust multi- Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit.,
target tracking. In: Proc. Eur. Conf. Comput. Vis., pp pp 1948–1955
605–619 Wu Z, Zhang J, Betke M (2013b) Online motion agreement
Spampinato C, Chen-Burger YH, Nadarajan G, Fisher RB tracking. In: Proc. Brit. Mach. Vis. Conf., pp 63.1–63.10
(2008) Detecting, tracking and counting fish in low quality Xing J, Ai H, Lao S (2009) Multi-object tracking through
unconstrained underwater videos. Proc Int Conf Comput occlusions by local tracklets filtering and global tracklets
Vis Theory Appl 2008:514–519 association with detection responses. In: Proc. IEEE Int.
Spampinato C, Palazzo S, Giordano D, Kavasidis I, Lin FP, Conf. Comput. Vis. Pattern Recognit., pp 1200–1207
Lin YT (2012) Covariance based fish tracking in real-life Xing J, Ai H, Liu L, Lao S (2011) Multiple player tracking
underwater environment. In: Proc. Int. Conf. Comput. in sports video: a dual-mode two-way bayesian inference
Vis. Theory Appl., pp 409–414 approach with progressive observation modeling. IEEE
Stalder S, Grabner H, Van Gool L (2010) Cascaded con- Tran Image Process 20(6):1652–1667
fidence filtering for improved tracking-by-detection. In: Yamaguchi K, Berg AC, Ortiz LE, Berg TL (2011) Who are
Proc. Eur. Conf. Comput. Vis., pp 369–382 you with and where are you going? In: Proc. IEEE Int.
Sugimura D, Kitani KM, Okabe T, Sato Y, Sugimoto A Conf. Comput. Vis. Pattern Recognit., pp 1345–1352
(2009) Using individuality to track individuals: clustering Yan X, Wu X, Kakadiaris IA, Shah SK (2012) To track or to
individual trajectories in crowds using local appearance detect? an ensemble framework for optimal selection. In:
and frequency trait. In: Proc. IEEE Int. Conf. Comput. Proc. Eur. Conf. Comput. Vis., pp 594–607
Vis., pp 1467–1474 Yan X, Kakadiaris IA, Shah SK (2014) What do i see? mod-
Sun Z, Bebis G, Miller R (2006) On-road vehicle detection: A eling human visual perception for multi-person tracking.
review. IEEE Trans Pattern Anal Mach Intel 28(5):694– In: Proc. Eur. Conf. Comput. Vis., pp 314–329
711 Yang B, Nevatia R (2012a) Multi-target tracking by online
Takala V, Pietikainen M (2007) Multi-object tracking using learning of non-linear motion patterns and robust appear-
color, texture and motion. In: Proc. IEEE Int. Conf. Com- ance models. In: Proc. IEEE Int. Conf. Comput. Vis. Pat-
put. Vis. Pattern Recognit. tern Recognit., pp 1918–1925
Tang S, Andriluka M, Milan A, Schindler K, Roth S, Schiele B Yang B, Nevatia R (2012b) An online learned crf model for
(2013) Learning people detectors for tracking in crowded multi-target tracking. In: Proc. IEEE Int. Conf. Comput.
scenes. In: Proc. IEEE Int. Conf. Comput. Vis., pp 1049– Vis. Pattern Recognit., pp 2034–2041
1056 Yang B, Nevatia R (2012c) Online learned discriminative
Tang S, Andriluka M, Schiele B (2014) Detection and tracking part-based appearance models for multi-human tracking.
of occluded people. Int J Comput Vis 110(1):58–69 In: Proc. Eur. Conf. Comput. Vis., pp 484–498
Tang S, Andres B, Andriluka M, Schiele B (2015) Subgraph Yang B, Huang C, Nevatia R (2011) Learning affinities
decomposition for multi-target tracking. In: Proc. IEEE and dependencies for multi-target tracking using a CRF
Int. Conf. Comput. Vis. Pattern Recognit., pp 5033–5041 model. In: Proc. IEEE Int. Conf. Comput. Vis. Pattern
Tomasi C, Kanade T (1991) Detection and tracking of point Recognit., pp 1233–1240
features. Tech. Rep. CMU-CS-91-132, School of Com- Yang C, Duraiswami R, Davis L (2005) Fast multiple object
puter Science, Carnegie Mellon Univ. tracking via a hierarchical particle filter. In: Proc. IEEE
Tuzel O, Porikli F, Meer P (2006) Region covariance: A fast Int. Conf. Comput. Vis., pp 212–219
descriptor for detection and classification. In: Proc. Eur. Yang J, Vela PA, Shi Z, Teizer J (2009a) Probabilistic multi-
Conf. Comput. Vis., pp 589–600 ple people tracking through complex situations. In: Proc.
Wang X (2013) Intelligent multi-camera video surveillance: A IEEE Int. Workshop Perform. Eva. Track. Surveillance,
review. Pattern Recognit Lett 34(1):3–19 pp xxx–xxx
Wen L, Li W, Yan J, Lei Z, Yi D, Li SZ (2014) Multiple target Yang M, Yu T, Wu Y (2007) Game-theoretic multiple target
tracking based on undirected hierarchical relation hyper- tracking. In: Proc. IEEE Int. Conf. Comput. Vis., pp 1–8
graph. In: Proc. IEEE Int. Conf. Comput. Vis. Pattern Yang M, Lv F, Xu W, Gong Y (2009b) Detection driven adap-
Recognit., pp 1282–1289 tive multi-cue integration for multiple human tracking. In:
Wolf JK, Viterbi AM, Dixon GS (1989) Finding the best set Proc. IEEE Int. Conf. Comput. Vis., pp 1554–1561
of k paths through a trellis with application to multitarget
36 Wenhan Luo1 , Junliang Xing2 , Xiaoqin Zhang3 , Xiaowei Zhao1 , Tae-Kyun Kim1