Multiple Object Tracking in Recent Times a Literat

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Multiple Object Tracking in Recent Times: A

Literature Review
Mk Bashar* Samia Islam* Kashifa Kawaakib Hussain
Computer Science and Engineering Computer Science and Engineering Computer Science and Engineering
Islamic University of Technology Islamic University of Technology Islamic University of Technology
Gazipur, Bangladesh Gazipur, Bangladesh Gazipur, Bangladesh
bashar38@iut-dhaka.edu samiaislam@iut-dhaka.edu kashifa@iut-dhaka.edu

Md. Bakhtiar Hasan A.B.M. Ashikur Rahman Md. Hasanul Kabir


Computer Science and Engineering Computer Science and Engineering Computer Science and Engineering
arXiv:2209.04796v1 [cs.CV] 11 Sep 2022

Islamic University of Technology Islamic University of Technology Islamic University of Technology


Gazipur, Bangladesh Gazipur, Bangladesh Gazipur, Bangladesh
bakhtiarhasan@iut-dhaka.edu ashikiut@iut-dhaka.edu hasanul@iut-dhaka.edu

Abstract—Multiple object tracking gained a lot of interest MOT, we can track all objects of a single class or all objects
from researchers in recent years, and it has become one of of said classes. If we only track a single object, it is called
the trending problems in computer vision, especially with the Single Object Tracking or SOT. MOT is more challenging
recent advancement of autonomous driving. MOT is one of the
critical vision tasks for different issues like occlusion in crowded than SOT. Thus researchers proposed numerous deep learning-
scenes, similar appearance, small object detection difficulty, ID based architectures for solving MOT-related problems.
switching, etc. To tackle these challenges, as researchers tried to To make the last three years of research organized, we would
utilize the attention mechanism of transformer, interrelation of like to do a literature review on MOT. This work includes these
tracklets with graph convolutional neural network, appearance papers. There are also some review papers on MOT in previous
similarity of objects in different frames with the siamese network,
they also tried simple IOU matching based CNN network, motion years [1], [2], [3], [4]. But all of them have limitations. Some
prediction with LSTM. To take these scattered techniques under of them only include deep learning-based approaches, only
an umbrella, we have studied more than a hundred papers focus on data association, only analyze the problem statement,
published over the last three years and have tried to extract do not categorize the paper well, and applications in real life
the techniques that are more focused on by researchers in recent are also missing.
times to solve the problems of MOT. We have enlisted numerous
applications, possibilities, and how MOT can be related to real
We have tried to overcome all of these issues in this work.
life. Our review has tried to show the different perspectives We have tried to go through almost every paper from 2020
of techniques that researchers used overtimes and give some to 2022 on MOT. After filtering out them, we have reviewed
future direction for the potential researchers. Moreover, we have more than a hundred papers in this work. While going through
included popular benchmark datasets and metrics in this review. the papers, the first thing that caught our attention is that there
Index Terms—MOT, Multiple Object Tracking, Object Track-
are many challenges in MOT. Then we made an attempt to find
ing, Occlusion, Computer Vision out different approaches to face those challenges. To establish
the approaches, the papers have used various MOT datasets,
I. I NTRODUCTION and to evaluate their work, they have taken help from various
Throughout the last decade, real-life problems have been MOT metrics. So we have included a quick review of the
solved by deep learning-based algorithms. We have seen in datasets. Additionally, we have included a summary of new
recent years that deep learning has been extensively used in metrics along with previously existing ones. We have also tried
computer vision. Object Tracking is one of the very important to list down some of the MOT applications among the vast
tasks in computer vision. It comes just right after object use cases of Multiple Object Tracking. Going through these
detection. To accomplish the task of object tracking, firstly papers, some scope of work has drawn our attention, which
object needs to be localized in a frame. Then each object is was mentioned later on.
assigned an individual unique id. Then each same object of Therefore, to sum up, we have organized our work in the
consecutive frames will make trajectories. Here, an object can following manner:
be anything like pedestrians, vehicles, a player in a sport, birds 1) Figuring out MOT’s main challenges
in the sky etc. If we want to track more than one object in a 2) Listing down frequently used various MOT approaches
frame, then it is called Multiple Object Tracking or MOT. In 3) Writing a summary of the MOT benchmark datasets
4) Writing a summary of MOT metrics
* These authors contributed equally to this work 5) Exploring various applications
6) Some suggestions regarding future works B. Challenges for Lightweight Architecture
Though recent solution for most of the problems depends
II. MOT M AIN C HALLENGES on heavy-weight architectures, they are very resource hungry.
Thus in MOT, heavy-weight architecture is very counterin-
Multiple Object Tracking has some challenges to tackle.
tuitive to achieving real-time tracking. Therefore researchers
Although occlusion is the main challenge in MOT, there are
always cherish lightweight architecture. For lightweight ar-
several other challenges that a tracker has to deal with in terms
chitecture in MOT, there are some additional challenges to
of an MOT problem.
consider [12]. Bin et al. mentioned three challenges for
lightweight architecture such as,
A. Occlusion • Object tracking architecture requires both pre-trained
Occlusion occurs when something we want to see is entirely weights for good initialization and fine-tuned tracking
or partially hidden or occluded by another object in the same data. Because NAS algorithms need direction from the
frame. Most MOT approaches are implemented based only on target task and, at the same time, solid initialization.
cameras without sensor data. That’s why it is a bit challenging • NAS algorithms need to focus on both backbone network
for a tracker to track the location of an object when they and feature extraction, so that the final architecture can
obscure each other. Furthermore, occlusion becomes more suit perfectly for the target tracking task.
severe in a crowded scene to model the interaction between • Final architecture needs to compile compact and low-
people [5]. Over time the use of bounding boxes to locate latency building blocks.
an object is very popular in the MOT community. But in
C. Some Common Challenges
crowded scenes,[6] occlusions are very difficult to handle
since ground-truth bounding boxes often overlap each other. MOT architecture often suffers from inaccurate object de-
This problem can be solved partially by jointly addressing tection. If objects are not detected correctly, then the whole
the object tracking and segmentation tasks [7]. In literature, effort of tracking an object will go in vain. Sometimes
we can see appearance information and graph information the speed of object detection becomes one major factor for
are used to find global attributes to solve the occlusion [8], MOT architectures. For background distortion, object detec-
[9], [10], [11]. However, frequent occlusion has a significant tion sometimes becomes quite difficult. Lighting also plays
impact on lower accuracy in MOT problems. Thus researchers a vital role in object detection and recognition. Thus all of
try to attack this problem without bells and whistles. In Figure these factors become more important in object tracking. Due
1a, occlusion is illustrated. In Figure 1b, the red-dressed to the motion of the camera or object, motion blurring makes
woman is almost covered by the lamp post. This is an example MOT more challenging. Many times MOT architecture finds
of occlusion. it hard to decide an object as true incomer or not. One of
the challenges is the proper association between the detection
and tracklet. Incorrect and imprecise object detection is also
a consequence of low accuracy in many cases. There are also
some challenges, such as similar appearance confuses models
frequently, initialization and termination of tracks is a bit
crucial task in MOT, interaction among multiple objects, ID
(a) Switching (same object identified as different in consecutive
frames through the object did not go out of frame). Due
to non-rigid deformations and inter-class similarity in shape
and other appearance properties, people and vehicles create
some additional challenges in many cases [13]. For example,
vehicles have different shapes and colors than people’s clothes.
Last but not least, smaller-sized objects make a variety of
visual elements in scale. Liting et al. try to solve the prob-
lem with higher resolution images with higher computational
complexity. They also used hierarchical feature map with
traditional multiscale prediction techniques [14].
III. MOT A PPROACHES
The task of multiple object tracking is done normally in two
(b) steps: object detection and target association. Some focus on
Fig. 1. (a) Illustration of the occlusion of two objects (green and blue). In
object detection; some focus on data association. There is a
frame 1, two objects are separate from each other. In frame 2, they are partially diversity of approaches for these two steps. The approaches
occluded. In frame 3, they are totally occluded. can not be differentiable whether it is a detection phase or an
association phase. Sometimes we can see the overlapping of
Guided Transformer Encoder (GTE) [19] only go through the
significant pixels of each frame in a global context.
Recent MOT Approaches In [20], Dynamic Object Query (DOQ) has been applied to
make the detection more flexible. Additionally, query-based
tracking has also been applied for semantic segmentation.
Yihong et al. propose a multi-scaled pixel-by-pixel dense
Detection and
Transformer Graph Model
Association
Attention Module query system [5], which generates dense heatmaps for targets
that can give more accurate results.
Tracklet
The paper [21] and [22] focus more on the computation
Motion Model Siamese Network
Association cost for running the architecture in real-time. In [21], the
transformer layer has been built upon an exemplar attention
module which reduces the dimension of input by giving global
Fig. 2. Recent MOT approaches categorization
information. Thus the layer can work in real-time. In [22],
Daitao et al. have improved the computation time by using
a lightweight attention layer applied by transformer model
the approaches. Most of the papers use various combinations which is inserted in a pyramid network.
of MOT components. So we can not say that the approaches In [23] Zhou et al. introduce the concept of global tracking.
are independent of each other. Yet we have tried to figure out Instead of processing frame by frame, they take a window of
the frequently used approaches so that they can help in making 32 frames and track within them. It utilizes the transformer’s
decisions of which one to follow. cross-attention mechanism more efficiently. According to the
authors, if an object is lost within a window and reborn again,
A. Transformer then their tracker can successfully track it, which makes their
Recently there have been many works regarding Computer tracker lower ID switching which is one of the main challenges
Vision [15] implemented by transformer, so as in MOT. It of MOT.
is a deep learning model which has two parts like the other In [24] Zeng et al. extend DETR [25], which is an object
models: Encoder and Decoder [16]. The encoder captures self- detection transformer. They introduce Query Interaction Mod-
attention, whereas the decoder captures cross-attention. This ule to filter out the output of the decoder of DETR before
attention mechanism helps to memorize context for long term. adding a detection to the tracklet.
Based on query key fashion, transformer is used to predict the In [26] Zhu et al. use uses encoder of the transformer to
output. Though it has been used solely as a language model in generate a feature map and then they use three tracing heads
the past, in recent years, vision researchers have focused on it to predict bounding box classification, regression, and embed-
to take advantage of contextual memoization. In most cases, ding. In most of the cases, we show others use convolutional
in MOT researchers try to predict the location of the next layer to extract features, some use popular CNN architec-
frame of an object based on previous information, where we ture to extract features from a frame, but it adds an extra
think transformer is the best player to handle. As transformer load to the main architecture. This ViTT utilized relatively
is specialized to handle sequential information, frame by fame lightweight transformers encoders compared to others. More-
processing can be done perfectly by transformer. A whole over, the tracking heads are simple feed-forward networks.
summary of transformer based approach in MOT is presented Consequently, they produce a lightweight architecture.
in Table I.
Peize et al. have built TransTrack [17], using the transformer B. Graph Model
architecture where they produce two sets of bounding boxes Graph Convolutional Network (GCN) is a special kind of
from two types of queries, i.e., object query and track query, convolutional network where the neural network is applied in
and by simple IoU matching, they decide the final set of a graph fashion [27] instead of a linear form. Also, a recent
the boxes, which is the tracking box for every object. It is trend has been seen in using Graph models in solving MOT
totally as same as tracking by detection paradigm. Moreover, problems where a set of detected objects from consecutive
it leverages the prediction for tracking from previous detection frames are considered as a node, and the link between two
knowledge utilizing transformer query key mechanism. Tim et nodes is considered as an edge. Normally data association is
al. have done a similar thing by introducing TrackFormer [7], done in this domain by applying the Hungarian algorithm [28].
excluding some implementation details. An overview of solving MOT problems using graph models
In [18], the patches of images were at first detected, then is given in Table II.
they took help from probabilistic concepts to get expected Guillem et al. detect and track objects globally by using a
tracks and cropped the frames according to the bounding boxes message passing network (MPN) combined with a graph to
to get the patches. Then using those patches, the tracks of extract deep features throughout the graph [29]. Later on, [30]
current frames are predicted. and [31] have done something similar. Gaoang et al. have taken
Later on, En et al. have proposed a method of combining the same approach, [32] but they removed the appearance in-
the attention model with a Transformer encoder to make formation because according to them, appearance features can
Fig. 3. Utilizing the encoder-decoder architecture of the transformer, TrackFormer [7] converts multi-object tracking as a set prediction problem performing
joint detection and tracking-by-attention.

TABLE I
S UMMARY OF T RANSFORMER RELATED PAPERS

Detection / Appearance Feature


Reference Year Data Association Dataset MOTA (%)
Extraction
[17] 2020 Decoder of DETR Decoder of Transformer MOT17, MOT20 74.5,64.5
[7] 2021 CNN Decoder of Transformer MOT17 62.5
[18] 2020 CNN Transformer MOT16, MOT17 73.3, 73.6
[19] 2022 Faster R-CNN Hungarian Algorithm MOT16, MOT17, MOT20 75.8. 74.7, 70.5
[20] 2022 CNN + Encoder of Transformer Decoder + Feed Forward Network MOT15, MOT16, MOT17 40.3, 65.7, 65.0
[5] 2021 DETR Deformable Dual Decoder MOT17, MOT20 71.9, 62.3
[21] 2021 Exemplar Attention based encoder Exemplar Attention based encoder TrackingNet 70.55 (Precision)
[22] 2022 Transformer Pyramid Network Multihead and pooling attention UAV123 85.83 (Precision)
[23] 2022 CenterNet Tracking transformer TAO, MOT17 45.8 (HOTA), 75.3
Decoder and Query Interaction MOT17, DanceTrack, 57.2 (HOTA), 54.2
[24] 2021 DETR Module + Temporal aggregation BDD100k (HOTA), 32.0
network (nMOTA)
[26] 2021 Encoder Bounding Box Regression Network MOT16 65.7

cause more occlusion. Also, they have followed an advanced on multi-camera MOT problem [36]. They have established
embedding strategy to design the tacklets. a dynamic graph to accumulate the new feature information
But Jiahe et al. use two graphs: Appearance Graph Network instead of a static graph like the other papers.
and Motion Graph Network, to identify the similarity of
appearance and motion from among the frames respectively C. Detection and Target Association
[33]. In such kind of approach, detection is done by any deep
Peng et al. have also used two graph modules to solve the learning model. But the main challenge is to associate target,
MOT problem, but one of the modules is for generating the i.e. to keep track of the trajectory of the object of interest [37].
proposal, and the other one is for scoring the proposal [34]. In Different papers follow different approaches in this regard.
the case of proposal generation, they considered small tracklets Margret et al. have picked both the bottom-up approach
or detected objects as nodes, and each node is connected with and the top-down approach [38]. In bottom-up approach,
all the others. But for the next module, they have trained a point trajectories are determined. But in top-down approach,
GCN to rank the proposals according to their scores as can be bounding boxes are determined. Then by combining these two,
seen in Figure 4. a full track of objects can be found.
In [35], Jiawei et al. have solved two problems: association In [39], to solve the association problem, Hasith et al. have
problem and assignment problem. To solve the association simply detected the objects and used the famous Hungarian
problem, they focused more on matching features within the Algorithm to associate the information. In the same year 2019,
same frame across the graph rather than finding relationships Paul et al. proposed Track-RCNN [40] which is an extension
between two frames. But for the assignment problem, they of R-CNN and obviously a revolutionary task in the field of
have integrated a quadratic programming layer to learn more MOT. Track-RCNN is a 3-D convolutional network that can
robust features. do both detection and tracking along with segmentation.
So far, the papers have worked with the single-camera MOT But in the year 2020, Yifu et al. have done object detec-
problem. But in the next year, in 2022, Kha et al. have worked tion and re-identification in two separate branches [54]. The
Fig. 4. (a) Frames with detected objects. (b) Graph constructed with the detected objects or tracklets as each node and proposal generation. (c) Ranking the
proposals with GCN. (d) Trajectory Inference. (e) Final output [34]

TABLE II
S UMMARY OF G RAPH M ODEL RELATED PAPERS

Reference Year Detection Association Dataset MOTA (%)


[29] 2020 ResNet50 Message Passing MOT15, MOT16, MOT17 51.5, 58.6, 58.8
[33] 2020 ResNet-34 Hungarian algorithm MOT16, MOT17 47.7, 50.2
MOT15, MOT16,
[31] 2021 SeResNet-50 Human-Interaction Model 80.4, 50.0, 86.7
DukeMTMCT
Box and Tracklet Motion
[32] 2021 CenterNat, CompACT MOT17, KITTI, UA-Detrac 56.0, 87.6, 22.5
Embedding
[33] 2020 ResNet-34 Hungarian algorithm MOT16, MOT17 47.7, 50.2
[34] 2021 ResNet50-IBN Proposal Generation and Scoring MOT17, MOT20 59.0, 56.3
[35] 2021 CenterNet Graph Matching MOT16, MOT17 65.0, 66.2
[30] 2022 CenterPoint, MEGVII Message Passing nuScenes 55.4

branches are similar in architecture and they both used center RCNN extended with residual networks and have combined it
to extract features to detect and re-identify respectively. They with similarity learning and ultimately have proposed Quasi
claim that they have focused equally on the two tasks, that’s Dense Tracking model (QDTrack) [45].
why they have named their approach FairMOT. In the same year, Yaoye et al. have introduced D2LA
In the year 2021, we find two papers to improve data network [41] which is based on FairMOT as introduced in
association using LSTM. Bisheng et al. propose Detection [54] to keep a balance between the trade-off of accuracy and
Refinement for Tracking (DRT), which has done the detection complexity. To avoid occlusion, they have taken measures
task by semi-supervised learning which produces heatmap to namely strip attention module. On the other hand, Norman
localize the objects more correctly [42]. The architecture has et al. estimate the geometry of each detected object and make
two branches where the secondary branch, it can recover oc- a mapping of each object to its corresponding pose so that
cluded objects. Also, the paper has solved the data association they can identify the object after occlusion [56].
problem by LSTM [55]. Chanho et al. also used bilinear LSTM Ramana et al. have proposed their own dataset with
in this regard [43]. their own architecture namely HeadHunter for detection and
Besides in [44], Qiang et al. have done data association HeadHunter-T for tracking [46]. There are two stages in
by proposing CorrTracker, which is a correlational network HeadHunter. In the first stage, they have used FPN and Resnet-
that is able to propagate information across the associations. 50 to extract features. In the second stage, they have used
They have done the part of object detection by self-supervised Faster-RCNN and RPN to generate object proposals.
learning. But Jiangmiao et al. have detected objects by Faster- Jialian et al. have proposed two modules [47]: cost volume
TABLE III
S UMMARY OF D ETECTION AND TARGET A SSOCIATION RELATED PAPERS

Reference Year Detection Association Dataset MOTA (%)


[38] 2018 Faster R-CNN Correlation Co-Clustering MOT15, MOT16, MOT17 35.6, 47.1, 51.2
[39] 2019 DPM, F-RCNN, SDP, RRC Hungarian Algorithm MOT17, KITTI 46.9, 85.04
KITTI MOTS, 65.1, KITTI MOTS
[40] 2019 Mask R-CNN Distance Measurement
MOTSChallenge (MOTSA)
MOT15, MOT16, MOT17,
[41] 2021 CenterNet Hungarian Algorithm 60.6, 74.9, 73.7, 61.8
MOT20
[42] 2021 ResNet50 LSTM-based Motion Model MOT16, MOT17 76.3, 76.4
[43] 2021 CenterNet Bilinear LSTM MOT16, MOT17 48.3, 51.5
Correlation Learning MOT15, MOT16, MOT17,
[44] 2021 CenterNet 62.3, 76.6, 76.5, 65.2
MOT20
Quasi-dense Similarity Matching MOT16, MOT17,
[45] 2021 Faster R-CNN 69.8, 68.7, 64.3, 51.18
BDD100K, Waymo
[46] 2021 HeadHunter HeadHunter-T CroHD 63.6
MOT16, MOT17, nuScenes, 70.1, 69.1, 5.9
[47] 2021 CenterNet CVA (Cost Volume based Association MOTS (AMOTA), 65.5
(MOTSA)
MOT17, MOT20,
[48] 2022 Mask-RCNN Hungarian Algorithm 43.21, 57.70, 92.12
NTU-MOTD
TAMU2015V, UGA2015V, 79.0%, 65.5%,
[49] 2022 YOLOv4 Hungarian Algorithm
UGA2018V 73.4%
MOT15, MOT16, MOT17,
[50] 2022 DLA-34 Hungarian Algorithm 55.8, 73.8, 74.0, 60.2
MOT20
DPM and YOLOv5 with
[51] 2022 Global and Partial Feature Matching MOT16 46.5
detection modifier (DM)
Kalman Filtering, Bicubic
[52] 2022 YOLO X with later NMS MOT17, MOT20 78.3, 75.7
Interpolation and ReID Model
[53] 2022 T-ReDet module ReID-NMS Model MOT16, MOT17, MOT20 63.9, 62.5, 57.4

based association (CVA) module and motion-guided feature boxes being removed [53]. Jian et al. also have used NMS to
warper (MFW) module to extract object localization offset reduce redundant bounding boxes from the detector. They have
information and to transmit the information from frame to re-detected trajectory location by comparing features and re-
frame respectively. They have named the integration of the identified bounding boxes with the help of IoU. The ultimate
whole process as TraDeS (TRAck to DEtect and Segment). outcome is a joint re-detection and re-identification tracker
Changzhi et al. have made ParallelMOT [57] which have two (JDI).
different branches for detection and re-identification similar to
[54]. D. Attention Module
In 2022, we can see diversity in the problem statements of To re-identify the occluded objects, attention is needed.
MOT. [48] is an exceptional paper where Cheng-Jen et al. have Attention means we only consider the objects of interest by
introduced indoor multiple object tracking. They have pro- nullifying the background so that their features are remem-
posed depth-enhanced tracker (DET) to improve the tracking- bered for long, even after occlusion. The summary of using
by-detection strategy along with an indoor MOT dataset. We attention module in MOT field is given in Table IV.
can again see a different kind of problem statement in [49], In [41], Yaoye et al. have incorporated a strip attention
which is to track crop seedlings. In this paper, Chenjiao et module to re-identify the pedestrians occluded with the back-
al. have used YOLOv4 as an object detector and tracked the ground. This module is actually a pooling layer that includes
bounding boxes got from the detector by optical flow. max and mean pooling which extracts features from the pedes-
Oluwafunmilola et al. have done object tracking along with trians more fruitfully so that when they are blocked, the model
object forecasting [50]. They have detected bounding boxes does not forget them and can re-identify further. Song et al.
using FairMOT [54] and then have stacked a forecasting have wanted to use information from object localization in data
network and have made Joint Learning Architecture (JLE). association and also the information from data association in
Zhihong et al. have extracted new features of each frame to get object localization. To link up between the two, they have used
the information globally and have accumulated partial features two attention modules, one for target and one for distraction
for occlusion handling [51]. They have merged these two kinds [59]. Then they finally applied a memory aggregation to make
of features to detect the pedestrian accurately. strong attention.
No paper has taken any measure to preserve the significant Tianyi et al. have proposed spatial-attention mechanism [60]
bounding boxes so that they are not eliminated in the data by implementing Spatial Transformation Network (STN) in
association stage except [52]. After detecting, Hong et al. an appearance model to force the model to only focus on
applied Non-Maskable Suppression (NMS) in the tracking the foreground. On the other hand, Lei et al. have at first
phase to reduce the probability of the important bounding proposed Prototypical Cross-Attention Module (PCAM) to
Fig. 5. The structure of Attention based head of cross-attention [58]

TABLE IV
S UMMARY OF ATTENTION RELATED PAPERS

Reference Year Attention Mechanism Dataset MOTA (%)


[41] 2021 Strip Pooling MOT15, MOT16, MOT17, MOT20 60.6, 74.9, 73.7, 61.8
Temporal Aware Target Attention and
[59] 2021 MOT16, MOT17, MOT20 59.1, 59.7, 56.6
Distractor Attention
[60] 2021 Spatial Transformation Network (STN) MOT16, MOT17 50.5, 50.0
BDD100K (Validation), 27.4 (MOTSA), 66.4
[61] 2021 Spatio-Temporal Cross-Attention
KITTI-MOTS (Validation) (mMOTSA)
Custom Dataset: Sparse Scene,
[62] 2021 Self-Attention in Detection 70.9, 56.4
Dense Scene
PETS09, EPFL, CAMPUS, MCT,
[36] 2021 Graph Structural and Temporal Self-Attention 93.5, 66.3, 96.7, 95.7, 90.9
CityFlow (Validation)
[58] 2022 Self- and Cross-Attention as Tracking Head MOT17, MOT20 75.6, 70.4

extract relevant features from past frames. Then they have used object, Bisheng et al. used motion model based on LSTM
Prototypical Cross-Attention Network (PCAN) to transmit the [42]. Wenyuan et al. incorporated motion model with Deep
contrasting feature of foreground and background throughout Affinity Network (DAN) [64] to optimize data association by
the frames [61]. eliminating the locations where it is not possible for an object
Huiyuan et al. have proposed self-attention mechanism to to situate [65].
detect vehicles [62]. The paper [36] also has a self-attention Qian et al. also have calculated motion by measuring
module applied in the dynamic graph to combine internal and distance from consecutive satellite frames with Accumula-
external information of cameras. tive Multi-Frame Differencing(AMFD) and low-rank matrix
JiaXu et al. have used both cross and self-attention in a completion (LRMC) [66] and have formed a motion model
lightweight fashion [58]. In Figure 5, we can see the cross- baseline (MMB) to detect and to reduce the amount of false
attention head of that architecture. The self-attention module alarms. Hang et al. have used motion features to identify
is used to extract robust features decreasing background oc- foreground objects in the field of vehicle driving [67]. They
clusion. Then the data is passed to the cross-attention module have detected relevant objects by comparing motion features
for instance association. with GLV model. Gaoang et al. have proposed a local-global
E. Motion Model motion (LGM) tracker that finds out the consistencies of the
Motion is an inevitable property of objects. So this feature motion and thus associates the tracklets [32]. Apart from
can be used in the area of multi-object tracking, be it for these, Ramana et al. have used motion model to predict
detection or association. Motion of an object can be calculated the motion of the object rather than data association which
by the difference in position of the object between two frames. hae three modules: Integrated Motion Localization (IML),
And based on this measure, different decisions can be taken as Dynamic Reconnection Context (DRC), 3D Integral Image
we have seen going through the papers. An overview is given (3DII) [46].
in Table V. In the year 2022, Shoudong et al. have used motion model
Hasith et al. and Oluwafunmilola et al. have used motion for both motion prediction and association by proposing
to compute dissimilarity cost in [39] and [63] respectively. Motion-Aware Tracker (MAT) [68]. Zhibo et al. have proposed
Motion is calculated by the difference between actual location compensation tracker (CT), which can obtain the lost objects
and predicted location. To predict the location of an occluded having a motion compensation module [69]. But Xiaotong et
TABLE V
S UMMARY OF M OTION M ODEL RELATED PAPERS

Reference Year Motion Mechanism Dataset MOTA (%)


Dissimilarity Distance between Detected and Predicted
[39] 2019 MOT17, KITTI 46.9, 85.04
Object
Dissimilarity Distance between Detected and Predicted MOT15, MOT16, MOT17,
[63] 2021 55.8, 73.8, 74.0, 60.2
object MOT20
[42] 2021 LSTM-based Model on Consecutive Frames MOT16, MOT17 76.3, 76.4
[65] 2021 Kalman Filtering MOT17 44.3
Accumulative Multi-Frame Differencing and Low-Rank
[66] 2021 VISO 73.6
Matrix Completion
Distance of Motion Feature and Mean Vector of Gaussian 100 (Anomaly Detection
[67] 2021 NJDOT
Local Velocity Model Accuracy)
[32] 2021 Box and Tracklet Motion Embedding MOT17, KITTI, UA-Detrac 56.0, 87.6, 22.5
Particle Filtering and Enhanced Correlation Coefficient
[46] 2021 CroHD 63.6
Maximization
Combination of Camera Motion and Pedestrian Motion
[68] 2022 MOT16, MOT17 70.5, 69.5
(IML), Dynamic Motion-based Reconnection (DRC)
[69] 2022 Motion Compensation with Basic Tracker MOT16, MOT17, MOT20 69.8, 68.8. 66.0
[18] 2022 Kalman Filtering MOT16, MOT17 73.3, 73.6

al. have used motion model to predict the bounding boxes of sociation module [71]. Thus overall stability of a Siamese
objects [18] so as done in [67] but to make image patches as network has been improved.
discussed in III-A. In contrast to transformer models III-A, JiaXu et al. have
proposed a lightweight attention-based tracking head under the
F. Siamese Network structure of a Siamese network that enhances the localization
of foreground objects within a box [58]. On the other hand,
Similarity information between two frames helps a lot in
Philippe et al. have incorporated their efficient transformer
object tracking. Thus the Siamese network tries to learn the
layer into a Siamese Tracking network. They have replaced
similarities and differentiate the inputs. This network has two
the convolutional layer with the exemplar transformer layer
parallel sub networks sharing the same weight and parameter
[21].
space. Finally, the parameters between the twin networks
are tied up and then trained on a certain loss function to G. Tracklet Association
measure the semantic similarity between them. The summary A group of consecutive frames of objects of interest is called
of applying Siamese network in MOT task is given in Table a tracklet. In detecting and tracking objects, tracklets are first
VI. identified using different algorithms. Then they are associated
Daitao et al. have proposed a pyramid network that em- together to establish a trajectory. Tracklet association is ob-
beds a lightweight transformer attention layer. Their proposed viously a challenging task in MOT problems. Some papers
Siamese Transformer Pyramid Network has augmented the tar- specifically focus on this issue. Different papers have taken
get features with lateral cross attention between pyramid fea- different approaches. Such an overview is as presented in Table
tures. Thus it has produced robust target-specific appearance VII.
representation [22]. Bing et al. have tried to uplift the region Jinlong et al. have proposed Tracklet-Plane Matching
based multi object tracking network by incorporating motion (TPM), [72] where at first short tracklets are created from
modeling [70]. They have embedded the Siamese network the detected objects, and they are aligned in a tracklet-plane
tracking framework into Faster-RCNN to achieve fast tracking where each tracklet is assigned with a hyperplane according
by lightweight tracking and shared network parameters. to their start and end time. Thus large trajectories are formed.
Cong et al. have proposed a Cleaving Network using This process also can handle non-neighboring and overlapping
Siamese Bi-directional GRU (SiaBiGRU) in post-processing tracklets. To mitigate the performance, they have also proposed
the trajectories to eliminate corrupted tracklets. Then they have two schemes.
established Re-connection Network to link up those tracklets Duy et al. have at first made tracklet by a 3D geometric
and make a trajectory [31]. algorithm [73]. They have formed trajectories from multiple
In a typical MOT network, there are prediction and detection cameras and due to this, they have optimized the association
modules. The prediction module tries to predict the appearance globally by formulating spatial and temporal information.
of an object in the next frame, and the detection module In [31], Cong et al. have proposed Position Projection
detects the objects. The result of these two modules is used in Network (PPN) to transfer the trajectories from local to global
matching the features and updating the trajectory of objects. context. Daniel et al. re-identifies occluded objects by assign-
Xinwen et al. have proposed Siamese RPN (Region Proposal ing the new-coming object to the previously found occluded
Network) structure as the predictor. They have also proposed object depending on motion. Then they have implemented
an adaptive threshold determination method for the data as- already found tracks further for regression, thus have taken
Fig. 6. (a) A typical Siamese Network that has symmetric pyramid architecture, (b) A typical Discriminative network, (c) Siamese Transfer Pyramid Network
that is proposed in [22]

TABLE VI
S UMMARY OF S IAMESE N ETWORK R ELATED PAPERS

Reference Year Method Dataset MOTA (%)


CNN for Apprearance extraction, LSTM and
[22] 2020 Duke-MTMCT, MOT16 73.5, 55.0
RNN for Motion modelling
[70] 2021 Implicit and Explicit motion modelling MOT17, TAO-person, HiEve 65.9, 44.3 (TAP@0.5), 53.2
[71] 2021 Siamese Network with Region Proposal Network MOT16, MOT17, MOT20 65.8, 67.2, 62.3
[21] 2021 Single instance level attention TrackingNet 70.55 (Precision)
Dynamic search region refine and attention
[58] 2022 MOT17, MOT20 67.2, 70.4
based tracking
[22] 2022 Transformer based appearance similarity UAV123 85.83 (Precision)

the tracking-by-regression approach. Furthermore, they have IV. MOT B ENCHMARKS


extended their work by extracting temporal direction to make
the performance better [74]. A typical MOT dataset contains video sequences. In those
sequences, every object is identified by a unique id until it
goes out of the frame. Once a new object comes into the
In [75], we can see a different strategy from the formers. frame, it gets a new unique id. MOT has a good number
En et al have considered each trajectory as a center vector of benchmarks. Among them, MOT challenge benchmarks
and made a trajectory-center memory bank (TMB) which is have several versions. Since 2015, in almost every year, they
updated dynamically and calculates cost. The whole process is publish a new benchmark with more variations. There are also
named multi-view trajectory contrastive learning (MTCL). Ad- some popular benchmarks such as PETS, KITTI, STEPS, and
ditionally, they have created learnable view sampling (LVS), DanceTrack.
which notices each detection as key point which helps to view
As of now, the MOT challenge has 17 datasets for object
the trajectory in a global context. They have also proposed
tracking, which include MOT15 [81], MOT16 [82], MOT20,
similarity-guided feature fusion (SGFF) approach to avoid
[6] and others. The MOT15 benchmark contains Venice,
vague features.
KITTI, ADL-Rundle, ETH-Pescross, ETH-Sunnyday, PETs,
TUD-Crossing datasets. This benchmark is filmed in an un-
Et al have developed tracklet booster (TBooster) [76] to constrained environment with both static and moving cameras.
alleviate the errors which occur during association. TBooster MOT16 and MOT17 are basically more updated benchmarks
has two components: Splitter and Connector. In the first from MOT15 with high accuracy of ground truth and strictly
module, the tracklets are split where the ID switching occurs. followed protocols. MOT20 is a pedestrian detection chal-
Thus the problem of assigning the same ID to multiple objects lenge. This benchmark has 8 challenging video sequences (4
can be resolved. In the second module, the tracklets of the train, 4 test) in unconstrained environments [6]. In addition
same object are linked. By doing this, assigning the same ID to object tracking, MOTS dataset has segmentation tasks too
to multiple tracklets can be avoided. Tracklet embedding can [40]. In general, the tracking dataset has a bounding box with
be done by Connector. a unique identifier for objects in a frame. But in MOTS, every
TABLE VII
S UMMARY OF T RACKLET A SSOCIATION RELATED PAPERS

Reference Year Method Dataset MOTA (%)


[72] 2020 Tracklet-plane matching process to resolve confusing short tracklets MOT16, MOT17 50.9, 52.4
CenterTrack [78] and DG-Net [79] as tracking graph and GAEC+KLj WILDTRACK, PETS-09,
[77] 2021 97.1, 74.2, 77.5
[80] heuristic solver for lifted multicut solver Campus
[31] 2020 CNN for Apprearance extraction, LSTM and RNN for Motion modelling Duke-MTMCT, MOT16 73.5, 55.0
[74] 2021 Regression based two stage tracking MOT16, MOT17, MOT20 66.8, 65.1, 61.2
Tracklet splitter splits potential false ids and connector connects pure
[76] 2021 MOT17, MOT20 61.5, 54.6
tracks to trajectory
Learnable view sampling for similarity-guided feature fusion and MOT15, MOT16, MOT17, 62.1, 74.3, 73.5,
[75] 2022
Trajectory-center memory bank for re-identification MOT20 63.2

object has a segmentation mask also. TAO [83] dataset has a 40


huge size due to tracking each and every object in a frame.
There is a dataset called Head Tracking 21. The task for this
benchmark is to track the head of every pedestrian. STEP 30

dataset has segmented and tracked every pixel. There are some
other datasets; those are included in table VIII. Frequency of

Paper Count
20
the datasets used in the papers those we review is shown in
Chart 7. From the chart, we can see that MOT17 dataset is
used more frequently than other datasets. 10

TABLE VIII
S TATISTICS OF PUBLICLY AVAILABLE DATASETS 0
MOT15 MOT16 MOT17 MOT20 KITTI Others

No. of Size Published


Dataset Reference
Frames (Bytes) year
DanceTrack∗ 105000 16.5G 2022 [84] Fig. 7. The number of papers for each dataset
TAO VOS - 2.4G 2021 [85]
Head Tracking 21 11464 4.1G 2021 [46]
STEP-ICCV21 2075 380M 2021 [86] Accuracy), as M OT A alone can not account for localization
MOTSynth-
MOT CVPR22
1381119 - 2021 [87] errors. Localization is one of the outputs of an MOT task. It
MOTSynth- lets us know where the object is in a frame. Alone it can not
1378244 - 2021 [87]
MOTS CVPR22 provide a thorough idea of the tacker’s performance in object
MOT20 13410 5.0G 2020 [6]
3D-ZeF20 14398 14.0G 2020 [88] tracking.
i
P
TAO 4447038 347G 2020 [83] i,t dt
CTMC-v1 152498 768M 2020 [89] M OT P = P
OWTB 4447038 350G 2020 [83] t ct
MOTS 5906 783.5M 2019 [40]
MOT16 11235 1.9G 2016 [82]
dit : The distance between the actual object and its respective
MOT17 33705 5.5G 2016 [82] hypothesis at time t, within a single frame for each object oi
PETS 2016 - - 2016 [90] from the set a tracker assigns a hypothesis hi .
MOT15 11283 1.3G 2015 [81]
KITTI Tracking - 15G 2012 [91]
ct : Number of matches between object and hypothesis made
TUD Multiview at time t.
179 387M 2010 [92]
Pedestrains
PETS 2009 - 4.9G 2009 [93] B. MOTA
TUD Campus,
Crossing
272 100M 2008 [94] Multiple Object Tracking Accuracy. This metric measures
∗ This dataset has scenes indoors only. how well the tracker detects objects and predicts trajectories
without taking precision into account. The metric takes into
account three types of error [95].
V. MOT M ETRICS
P
(mt + f pt + mmet )
M OT A = 1 − t P
A. MOTP t
gt
Multiple Object Tracking Precision. It is a score given based mt : The number of misses at time t
on how precise the tracker is in finding the position of the f pt : The number of false positives
object [95] regardless of the tracker’s ability to recognize mmet : The number of identity switches
object configuration and maintain consistent trajectories. As gt : The number of objects present at time t
M OT P can only provide localization accuracy, it is often M OT A has several drawbacks. M OT A overemphasizes
used in conjunction with M OT A (Multiple Object Tracking the effect of accurate detection. It focuses on matches between
predictions to ground truths at the detection level and does |T P T r|n
Ren =
not consider association. When we consider M OT A without |gtT raj|
identity-switching, the metric is more heavily affected by poor n: The total number of predicted trajectories. Predicted tra-
precision than it is by re-call. The aforementioned limitations jectories are arranged according to their confidence score in
could lead researchers to tune their trackers towards better descending order.
precision and accuracy at detection level whilst ignoring other P rn : Calculates the precision of the tracker.
important aspects of tracking. M OT A can only take into T P T r: True Positive Trajectories. Any predicted trajectory
account the short-term associations. It can only evaluate how that has found a match.
well an algorithm can perform first-order association and not |T P T r|n : Number of true positive trajectories among n pre-
how well it associates throughout the whole trajectory. But, dicted trajectories.
it doesn’t take into account association precision/ID transfer Ren : Measures Re-call
at all. In fact, if a tracker is able to correct any association |gtT raj|: Ground Truth Object Trajectory using the equation
mistake, it punishes it instead of rewarding it. While the for precision and recall further calculation is done to obtain
highest score in M OT A is 1 the is no fixed minimum value the final T rack − mAP score.
for the score, which can lead to a negative M OT A score.
InterpP rn = max(P rm )
m≥n
C. IDF1
We first interpolate the precision values and obtain
The Identification Metric. It tries to map predicted trajecto- InterpP r for each value of n. Then we plot a graph of
ries with actual trajectories, in contrast to metrics like M OT A InterpP r against Ren for each value of n. We now have
which perform bijective mapping at the detection level. It was the precision-recall curve. The integral from this curve will
designed for measuring ‘identification’ which, unlike detection give us the T rack − mAP score. There are some demerits
and association, has to do with what trajectories are there [96]. to track mAP as well. It is difficult to visualize the tracking
|IDT P | result for T rack − mAP . It has several outputs for a single
ID − Recall = trajectory. The effect of the trajectories with low confidence
|IDT P | + |IDF N |
scores on the final score is obscured. There is a way to ‘hack’
|IDT P | the metric. Researchers can get a higher score by creating
ID − P recision =
|IDT P | + |IDF P | several predictions that have a low confidence score. This
would increase the chances of getting a decent match and
|IDT P | thus increases the score. However, it is not an indicator of
IDF 1 =
|IDT P | + 0.5|IDF P | + 0.5|IDF N | good tracking. T rack −mAP can not indicate if trackers have
IDT P : Identity True Positive. The predicted object trajec- better detection and association.
tory and ground truth object trajectory match. E. HOTA
IDF N : Identify False Negative. Any ground truth detection
that went undetected and has an unmatched trajectory. Higher Order Tracking Accuracy. The source paper [96]
IDF P : Identity False Positive. Any predicted detection that describes HOT A as, “HOTA measures how well the trajec-
is false. tories of matching detections align, and averages this overall
Due to M OT A’s heavy reliance on detection accuracy, matching detection, while also penalizing detections that don’t
some prefer IDF 1 as this metric puts more focus on asso- match.” HOT A is supposed to be a single score that can cover
ciation. However, IDF 1 has some flaws as well. In IDF 1, all the elements of tracking evaluation. It is also supposed
the best unique bijective mapping does not lead to the best to be decomposed into sub-metrics. HOT A compensates for
alignment between predicted and actual trajectories. The end the shortcomings of the other commonly used metrics. While
result would leave room for better matches. IDF 1 score can metrics like M OT A ignore association and heavily depend on
decrease even if there are correct detections. The score could detection(M OT A) or vice versa (IDF 1), novel concepts such
also decrease if there are a lot of un-matched trajectories. This as T P As, F P As and F N As are developed so that association
incentives researchers to increase the total number of unique can be measured just like how T P s, F N s, and F P s are used
and not focus on making decent detections and associations. to measure detection.
s P
D. Track-mAP c∈{T P } A(c)
HOT Aα =
This metric matches the ground truth trajectory and pre- |T P | + |F N | + |F P |
dicted trajectory. Such a match is made between trajectories
|T P A(c)|
when the trajectory similarity score, Str , between the pair is A(c) =
greater than or equal to the threshold αtr . Also, the predicted |T P A(c)| + |F N A(c)| + |F P A(c)|
trajectory must have the highest confidence score [96]. A(c) : Measures how similar predicted trajectory and ground-
|T P T r|n truth trajectory are.
P rn = T P : True Positive. A ground truth detection and predicted
n
detection are matched together given that S ≥ α. S is the detections are not aligned. This is similar to but unlike M OT P
localization similarity and α is the threshold. as it includes several localization thresholds. Commonly used
F N : False Negative. A ground truth detection that was missed metrics like M OT A and IDF 1 do not take localization
F P : False Positive. A predicted detection with no respective into account despite the importance of object localization in
ground truth detection. tracking.
T P A : True Positive Association. The set of True Positives
that have the same ground truth IDs and the same prediction G. AssA: Association Accuracy Score
ID as a given T P c.
According to MOT Benchmark: ”The average of the as-
T P A(c) = {k}, sociation jaccard index over all matching detections and then
k ∈ {T P |prID(k) = prID(c) ∧ gtID(c) = gtID(c)} averaged over localization threshold” [96]. Association is part
of the result of an MOT task that lets us know if objects in
F N A : The set of ground truth detections with the same different frames belong to the same or different objects. The
ground truth ID as a given T P c. However, these detections objects have the same ID and are part of the same trajectories.
were assigned a prediction ID different from c or none at all. Association Accuracy gives us the average alignment between
match trajectories. It focuses on association errors. These are
F N A(c) = {k},
caused when a single object in ground truth is given two
{T P | prID(k) 6= prID(c) ∧ gtID(k) = gtID(c)} different predicted detections, or a single predicted detection
k∈
∪ {F N | gtID(k) = gtID(c)} is given two different ground truth objects.
F P A : The set of predicted detections with the same pre- 1 X
diction ID as a given T P c. However, these detections were AssAα = A(c)
|T P |
assigned a ground-truth ID different from c or none at all. c∈{T P }

F P A(c) = {k}, H. DetA: Detection Accuracy


{T P | prID(k) = prID(c) ∧ gtID(k) 6= gtID(c)} According to MOT Benchmark: “Detection Jaccard Index
k∈
∪ {F P | prID(k) = prID(c)} averaged over localization threshold” [96]. Detection is an-
other output of an MOT task. It is simply what objects are
HOT aα means that this is HOT A calculated for a par-
within the frame. The detection accuracy is the portion of
ticular value of α. Further calculation needs to be done to
correct detections. Detection errors exist when ground truth
get the final HOTA score. We find the value of HOT A for
detections are missed or when there are false detections.
different values of α, ranging from 0 to 1 and then calculate
their average. |T P |
Z 1 DetAα =
1 X |T P | + |F N | + |F P |
HOT A = HOT Aα dα ≈ HOT Aα
0 19  
0.05, 0.1,  I. DetRe: Detection Recall
α∈

...0.9, 0.95
 The equation is given for one localization threshold. We
need to average over all localization thresholds [96].
We are able to break down HOT A into several sub-metrics.
This is useful to us because we can take different elements of |T P |
DetReα =
the tracking evaluation and use them for comparison. We can |T P | + |F N |
get a better idea of the kind of errors our tracker is making.
There are five types of errors commonly found in tracking, Detection recall errors are false negatives. They happen when
false negatives, false positives, fragmentations, mergers and the tracker misses an object that exists in the ground truth.
deviations. These can be measured through detection recall, Detection accuracy can be broken down into Detection recall
detection precision, association recall, association precision, and Detection precision.
and localization, respectively.
J. DetPr: Detection Precision
F. LocA
The equation is given for one localization threshold. We
Localization Accuracy[96]. need to average over all localization thresholds [96].
Z 1
1 X
|T P |
LocA = S(c)dα DetP rα =
0 |T P α| |T P | + |F P |
c∈{T P α}

S(c): The spatial similarity score between the predicted detec- As mentioned previously, detection precision is part of detec-
tion and ground truth detection. This sub-metric deals with the tion accuracy. Detection precision errors are false positives.
error type deviation or localization errors. Localization errors They happen when the tracker makes predictions that does
are caused when the predicted detections and ground truth not exist in the ground truth.
K. AssRe: Association Recall VI. A PPLICATIONS
We need to calculate the equation below and then average There is a myriad of applications for MOT. Much work has
over all matching detections. Finally, average the result over gone into tracking various objects, including pedestrians, an-
the localization threshold [96]. imals, fish, vehicles, sports players, etc. Actually, the domain
1 X |T P A(c)| of multiple object tracking can not be confined to only a few
AssReα = fields. But to get an idea from an application point of view,
|T P | |T P A(c)| + |F N A(c)|
c∈{T P }
we will cover the papers depending on specific applications.
Association Recall errors happen when the tracker assigns
different predicted trajectories to the same ground-truth tra- A. Autonomous Driving
jectory. Association Accuracy can be broken down into Asso- Autonomous driving can be said to be the most common
ciation Recall and Association Precision. task in Multiple Object Tracking. In recent years, this is a
L. AssPr: Association Precision very hot topic in artificial intelligence.
We need to calculate the equation below and then average Gao et al. have proposed a dual-attention network for
over all matching detections. Finally, average the result over autonomous driving where they have integrated two attention
the localization threshold [96]. modules [97]. Fu et al. have at first detected vehicles by self-
attention mechanism and then used multi-dimensional infor-
1 X |T P A(c)| mation for association. They have also handled occlusion by
AssReα =
|T P | |T P A(c)| + |F P A(c)| re-tracking the missed vehicles [62]. Pang et al. have combined
c∈{T P }
vehicle detection with Multiple Measurement Models filter
Association precision makes up part of association accuracy.
(RFS-M3) which is based on random finite set-based (RFS)
Association errors occur when two different ground truth
introducing 3-D MOT [98]. Luo et al. have also applied 3-D
trajectories are given the same prediction Identity.
MOT by proposing SimTrack which detects and associates the
M. MOTSA: Multi Object Tracking and Segmentation Accu- vehicle from point clouds captured by LiDAR.
racy Mackenzie et al. have done two studies: one for self-driving
This is a variation of the M OT A metric, so that the trackers cars and the other for sports [99]. They have looked into the
performance of segmentation tasks can also be evaluated. overall performance of Multiple Object Avoidance (MOA), a
tool for measuring attention for action in autonomous driving.
|F N | + |F P | + |IDS|
M OT SA = 1 − Zou et al. have proposed a lightweight framework for the
|M | full-stack perception of traffic scenes in the 2-D domain
|T P | − |F P | − |IDS| captured by roadside cameras [100]. Cho et al. have identified
= and tracked the vehicles from traffic surveillance cameras by
|M |
YOLOv4 and DeepSORT after projecting the images from
Here M is a set of N non-empty ground truth masks. Each
local to global coordinate systems [101].
mask is assigned a ground truth track Id. T P is a set of true
positives. A true positive occurs when a hypothesized mask B. Pedestrian Tracking
is mapped to a ground truth mask. F P is false negatives, the
set of hypothesized maps without any ground truth maps and Pedestrian Tracking is one of the most frequent tasks of
F N , false negatives are the ground truth maps without any multiple object tracking systems. As streetcam videos are easy
corresponding hypothesized maps. The IDS, ID switches are to be captured, much work has been done regarding human
ground truth masks belonging to the same track but have been or pedestrian tracking. Consequently, pedestrian tracking is
assigned different ID’s. considered to be an individual field of research.
The downsides of M OT SA include, giving more impor- Zhang et al. have proposed DROP (Deep Re-identification
tance to detection over association and being affected greatly Occlusion Processing) which can re-identify the occluded
by the choice of matching threshold. pedestrians with the help of appearance features of the pedes-
trians [102]. Sundararaman et al. have proposed HeadHunter
N. AMOTA: Average Multiple Object Tracking Precision to detect pedestrians’ heads followed by a re-identification
This is calculated by averaging the M OT A value over all module for tracking [46]. On the other hand, Stadler et al. have
recall values. proposed an occlusion handling strategy rather than a feature-
1 X F Pr + F Nr + IDSr based approach followed by a regression-based method [74].
AM OT A = 1+
L numg t Chen et al. have introduced a framework applied by Faster
1 2
r∈{ L , L ...1}
R-CNN, KCF trackers and Hungarian algorithm to detect
The value numg t is the number of ground truth objects in all vehicle-mounted far-infrared (FIR) pedestrians [103]. Ma et
the frames. For a specific recall value r the number of false al. have made a multiple-stages framework for trajectory
positive, number of false negative and the number of identity processing and Siamese Bi-directional GRU (SiaBiGRU) for
switches are denoted as F Pr , F Nr and IDSr . The number post-processing them [31]. They have also used a Position
of recall values is denoted using L. Projection Network for cross-camera trajectory matching.
Later on, in [104], Wang et al. have tracked pedestrians overcomes the challenges of camera panning, zooming of
simply by using YOLOv5 for detection and DeepSORT for hockey broadcast video.
tracking. Patel et al. have proposed a number of algorithms
regarding different aspects [105]. At first, they have created E. Wild Life Tracking
an algorithm to localize objects, then they proposed a track- One of the potential use cases of MOT is wildlife tracking.
ing algorithm to identify any suspicious pedestrians from It helps wildlife researchers to avoid costly sensors which are
the crowd. There are a couple of algorithms for measuring not so reliable in some cases.
physical distances as well. In [114], Marcos et al. have developed a uav based single
animal tracking system. They have used YOLOv3 with par-
C. Vehicle Surveillance ticle filter for object tracking. Furthermore, in [115] Zhang
Vehicle Surveillance is also a very important task along et al. have addressed the challenge of animal motion and
with autonomous driving. To monitor the activities of vehicles, behavior analysis for wildlife tracking. Consequently, they
MOT can be applied. have proposed AnimalTrack which is a largescale benchmark
Shi et al. have introduced a motion based tracking method dataset for multi-animal tracking. They have also provided
along with a Gaussian local velocity (GLV) modeling method some baseline.
to identify the normal movement of vehicles and also a In [116], Guo et al. have proposed a method to utilize MOT
discrimination function to detect anomalous driving [67]. to detect negative behavior of animals. As analysis of the
Quang et al. have focused more on Vietnamese vehicles’ speed behavior of animals is very important for breeding, they have
detection. They have at first detected traffic by YOLOv4 and shown that using two very popular trackers FairMOT [54] and
estimated speed by back-projecting it into 3-D coordinate JDE [65], they tracked groups of pigs and laying hens. Which
system with Haversine method [106]. further have helped them to analyze the improvement of health
Wang et al. have used graph convolutional neural network and welfare. However, one of the most interesting job is done
to associate the bounding boxes of vehicles into tracklets and by Ju et al. In [117], they have argued that monitoring the
proposed an embedding strategy, reconstruct-to-embed with turkey health during reproduction is very important. Thus they
global motion consistency to convert the tracklets into tracks have proposed a method to identify the behavior of turkeys
[32]. Zhang et al. have proposed a convolutional network utilizing MOT. They have introduced a turkey tracker and head
based on YOLOv5 to solve the low recognition rate accuracy tracker to identify turkey behavior.
problem in tracking vehicles [107]. At last, Diego et al. have MOT is also playing a vital role in tracking underwater
published a review paper regarding the traffic environment entities like fish. In [118], Li et al. have proposed CMFTNet,
itself discussing various works of multiple object tracking which is implemented by applying Joint Detection and Em-
under traffic domain [108]. bedding for extracting and associating features. Deformable
convolution is applied furthermore to sharpen the features
D. Sports Player Tracking in complex context and finally, with the help of weight
In the age of artificial intelligence, rigorous analysis of counterpoised loss the fish can be tracked accurately. Also,
players in any sport is one of the most important tactics. Thus Filip et al. have analyzed some multiple object tracking works
MOT is used in many ways for sports player tracking. on tracking fish accomplished in the past [119].
In [109], Kalafatić et al. have tried to solve the occlusion
problem of football players tracking by typical tracking by de- F. Others
tection approach. They have also mentioned some challenges We can see the real-life application of MOT in other fields,
like similar appearance, varying size of projection of players, as well as MOT, is not limited to some particular tasks.
changing illumination, which MOT researchers should keep In the field of visual surveillance, Ahmed et al. have
in mind to solve besides tracking. However, Naik et al. have presented a collaborative robotic framework that is based on
addressed identity switching in real-world sports videos [110]. SSD and YOLO for detection and a combination of a number
They have proposed a novel approach DeepPlayer-Track to of tracking algorithms [120]. Urbann et al. have proposed
track players and referees while retaining the tracking identity. a siamese network-based approach for online tracking under
They have used YOLOv4 and SORT to some extent. surveillance scenarios [121]. Nagrath et al. have analyzed
In [111], Zheng et al. have argued that MOT can replace various approaches and datasets of multiple object tracking
the use of hardware chips for target tracking. For long term for surveillance [122].
real time multicamera multi target tracking of soccer player, Robotics is a very trendy topic in today’s world. In [123],
they utilize KCF algorithm which has shown good robustness Wilson et al. have introduced audio-visual object tracking
in terms of accuracy. In [112], Cioppa et al. have proposed a (AVOT). Peireira et al. have implemented mobile robots and
novel dataset of soccer videos. In which they have annotated tracked them by typical SORT and Deep-SORT algorithms
multiple players, referees, and ball. They have also given some integrated with their proposed cost matrices [124].
baseline on that dataset. In [113], Vats et al. have introduced We can also see the implementation of MOT in agriculture.
ice hockey video analysis. Their system can track players, To track tomato cultivation, Ge et al. have used a combina-
identify teams, and identify individual players. Their work tion of YOLO-based shufflenetv2 as a baseline, CBAM for
attention mechanism, BiFPN as multi-scale fusion structure, to deploy a model in IoT embedded devices. Also to
and DeepSORT for tracking [125]. Tan et al. have also used track in real-time, lightweight architecture plays a very
YOLOv4 as detector of cotton seedlings and an optical flow- important role. So without decreasing accuracy, if we
based tracking method to track the seedlings [49]. can achieve more fps then, it can be implemented in
MOT can be also utilized in various real-life applications real-life applications, where lightweight architecture is
like security monitoring, monitoring social distancing, radar very necessary.
tracking, activity recognition, smart elderly care, criminal 6) To apply in real-life scenarios, online multiple object
tracking, person re-identification, behavior analysis, and so tracking is the only possible solution. Thus inference
on. time plays a very crucial role. We observe the trend
of acquiring more accuracy from researchers in recent
VII. F UTURE D IRECTIONS
times. But if we can achieve an inference time of over
As MOT is a trending research topic for many years, thirty frames per second, then we can use MOT as
numerous efforts have been made on it already. But still, there real-time tracking. As real-time tracking is the key to
is a lot of scope in this field. Here we would like to point out surveillance thus it is one of the major future directions
some of the potential directions of MOT. for MOT researchers.
1) Multiple object tracking under multiple cameras is a bit 7) A trend of applying quantum computing in computer
challenging. The main challenge would be how to fuse vision can be seen in recent times. Quantum computing
the scenes. But if scenes from non-overlapping cameras can be used in MOT as well. Zaech et al. have pub-
are fused together and projected in a virtual world, then lished the first paper of MOT using Adiabatic Quantum
MOT can be utilized to track a target object in a long Computing (AQC) with the help of Ising model [129].
area continuously. A similar kind of effort can be seen They expect that AQC can speed up the N-P hard
in [31]. A relatively new dataset Multi-camera multiple assignment problem during association in future. As
people tracking is also available [126]. Xindi et al. have quantum computing has a very high potential in the near
proposed a real-time online tracking system for multi- future, this can be a very promising domain to research
target multi-camera tracking [127]. on.
2) Class-based tracking system can be integrated with
multiple object tracking. An MOT algorithm tries to VIII. C ONCLUSION
track almost all moving objects in a frame. This will In this paper, we have tried to compact a summary and
be better applied in real-life scenarios if class-based review of recent trends in computer vision in MOT. We have
tracking can be possible. For example, bird tracking tried to analyze the limitations and significant challenges.
MOT system can be very useful in airports, because to At the same time, we have found that besides some major
prevent the clash of birds with airplanes on the runway challenges like occlusion handling, id switching, there are also
some manual preventive mechanism is currently applied. some minor challenges that may sit in the driving position
It can be totally automated using a class-based MOT in terms of better precision. We have added them too. Brief
system. Class-based tracking helps in surveillance in theories related to each approach are included in this study. We
many ways. Because it helps to track a certain type of have tried to focus on each approach equally. We have added
object efficiently. some popular benchmark datasets along with their insights.
3) MOT is widely applied in 2D scenes. Though it is a We have included some possibilities for future direction based
bit challenging task, analyzing 3D videos utilizing MOT on recent MOT trends. Our observation of this study is that
will be a good research topic. 3D tracking can provide recently researchers have focused more on transformer-based
more accurate tracking and occlusion handling. As in architecture. This is because of the contextual information
3D scene depth information is kept, thus it helps to memorization of transformer. As transformer is resource hun-
overcome one of the main challenges on MOT which gry to get better accuracy with a lightweight architecture,
is occlusion. focusing on a specific module is necessary per our study.
4) So far in most of the papers transformer is used as a Finally, we hope this study will serve as complementary to
black box. But transformer can be used more specifically a researcher in the field to start the journey in the field of
in solving different MOT tasks. Some approaches are Multiple Object Tracking.
totally based on detection and further regression is
applied to predict the bounding box of the next frame R EFERENCES
[128]. In that case, DETR [25] can be used to detect as [1] S. K. Pal, A. Pramanik, J. Maiti, and P. Mitra, “Deep learning in multi-
it has very high efficiency in detecting objects. object detection and tracking: state of the art,” Applied Intelligence,
vol. 51, no. 9, pp. 6400–6429, 2021.
5) In any application lightweight architecture is very im- [2] W. Luo, J. Xing, A. Milan, X. Zhang, W. Liu, and T.-K. Kim, “Multiple
portant for real-life applications. Cause lightweight ar- object tracking: A literature review,” Artificial Intelligence, vol. 293,
chitecture is resource efficient and in real-life scenarios, p. 103448, 2021.
[3] Y. Park, L. M. Dang, S. Lee, D. Han, and H. Moon, “Multiple object
we have constrained on resources mostly. In MOT tracking in deep learning approaches: A survey,” Electronics, vol. 10,
lightweight architecture is also very crucial if we want no. 19, p. 2406, 2021.
[4] L. Rakai, H. Song, S. Sun, W. Zhang, and Y. Yang, “Data association [27] T. N. Kipf and M. Welling, “Semi-supervised classification with graph
in multiple object tracking: A survey of recent techniques,” Expert convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
Systems with Applications, p. 116300, 2021. [28] H. W. Kuhn, “The hungarian method for the assignment problem,”
[5] Y. Xu, Y. Ban, G. Delorme, C. Gan, D. Rus, and X. Alameda- Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955.
Pineda, “Transcenter: Transformers with dense queries for multiple- [29] G. Brasó and L. Leal-Taixé, “Learning a neural solver for multiple
object tracking,” arXiv preprint arXiv:2103.15145, 2021. object tracking,” in Proceedings of the IEEE/CVF Conference on
[6] P. Dendorfer, H. Rezatofighi, A. Milan, J. Shi, D. Cremers, I. Reid, Computer Vision and Pattern Recognition, pp. 6247–6257, 2020.
S. Roth, K. Schindler, and L. Leal-Taixé, “Mot20: A bench- [30] J.-N. Zaech, A. Liniger, D. Dai, M. Danelljan, and L. Van Gool,
mark for multi object tracking in crowded scenes,” arXiv preprint “Learnable online graph representations for 3d multi-object tracking,”
arXiv:2003.09003, 2020. IEEE Robotics and Automation Letters, 2022.
[7] T. Meinhardt, A. Kirillov, L. Leal-Taixe, and C. Feichtenhofer, [31] C. Ma, F. Yang, Y. Li, H. Jia, X. Xie, and W. Gao, “Deep trajectory
“Trackformer: Multi-object tracking with transformers,” arXiv preprint post-processing and position projection for single & multiple camera
arXiv:2101.02702, 2021. multiple object tracking,” International Journal of Computer Vision,
[8] W. Huo, J. Ou, and T. Li, “Multi-target tracking algorithm based on vol. 129, no. 12, pp. 3255–3278, 2021.
deep learning,” in Journal of Physics: Conference Series, vol. 1948, [32] G. Wang, R. Gu, Z. Liu, W. Hu, M. Song, and J.-N. Hwang, “Track
p. 012011, IOP Publishing, 2021. without appearance: Learn box and tracklet embedding with local and
[9] A. Milan, S. H. Rezatofighi, A. Dick, I. Reid, and K. Schindler, “Online global motion patterns for vehicle tracking,” in Proceedings of the
multi-target tracking using recurrent neural networks,” in Thirty-First IEEE/CVF International Conference on Computer Vision, pp. 9876–
AAAI conference on artificial intelligence, 2017. 9886, 2021.
[10] Y. Tian, A. Dehghan, and M. Shah, “On detection, data association and [33] J. Li, X. Gao, and T. Jiang, “Graph networks for multiple object
segmentation for multi-target tracking,” IEEE transactions on pattern tracking,” in Proceedings of the IEEE/CVF Winter Conference on
analysis and machine intelligence, vol. 41, no. 9, pp. 2146–2160, 2018. Applications of Computer Vision, pp. 719–728, 2020.
[11] M. Ullah and F. Alaya Cheikh, “A directed sparse graphical model [34] P. Dai, R. Weng, W. Choi, C. Zhang, Z. He, and W. Ding, “Learning a
for multi-target tracking,” in Proceedings of the IEEE Conference on proposal classifier for multiple object tracking,” in Proceedings of the
Computer Vision and Pattern Recognition Workshops, pp. 1816–1823, IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2018. pp. 2443–2452, 2021.
[12] B. Yan, H. Peng, K. Wu, D. Wang, J. Fu, and H. Lu, “Lighttrack: [35] J. He, Z. Huang, N. Wang, and Z. Zhang, “Learnable graph matching:
Finding lightweight neural networks for object tracking via one-shot Incorporating graph partitioning with deep feature learning for multiple
architecture search,” in Proceedings of the IEEE/CVF Conference on object tracking,” in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pp. 15180–15189, 2021. Computer Vision and Pattern Recognition, pp. 5299–5309, 2021.
[13] C.-Y. Chong, “An overview of machine learning methods for multiple [36] K. G. Quach, P. Nguyen, H. Le, T.-D. Truong, C. N. Duong, M.-T. Tran,
target tracking,” in 2021 IEEE 24th International Conference on and K. Luu, “Dyglip: A dynamic graph model with link prediction for
Information Fusion (FUSION), pp. 1–9, IEEE, 2021. accurate multi-camera multiple object tracking,” in Proceedings of the
[14] L. Lin, H. Fan, Y. Xu, and H. Ling, “Swintrack: A simple and strong IEEE/CVF Conference on Computer Vision and Pattern Recognition,
baseline for transformer tracking,” arXiv preprint arXiv:2112.00995, pp. 13784–13793, 2021.
2021. [37] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online
[15] J. Bi, Z. Zhu, and Q. Meng, “Transformer in computer vision,” in and realtime tracking,” in 2016 IEEE international conference on image
2021 IEEE International Conference on Computer Science, Electronic processing (ICIP), pp. 3464–3468, IEEE, 2016.
Information Engineering and Intelligent Control Technology (CEI), [38] M. Keuper, S. Tang, B. Andres, T. Brox, and B. Schiele, “Motion
pp. 178–188, IEEE, 2021. segmentation & multiple object tracking by correlation co-clustering,”
[16] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. IEEE transactions on pattern analysis and machine intelligence,
Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” vol. 42, no. 1, pp. 140–153, 2018.
Advances in neural information processing systems, vol. 30, 2017. [39] H. Karunasekera, H. Wang, and H. Zhang, “Multiple object tracking
[17] P. Sun, J. Cao, Y. Jiang, R. Zhang, E. Xie, Z. Yuan, C. Wang, and with attention to appearance, structure, motion and size,” IEEE Access,
P. Luo, “Transtrack: Multiple object tracking with transformer,” arXiv vol. 7, pp. 104423–104434, 2019.
preprint arXiv:2012.15460, 2020. [40] P. Voigtlaender, M. Krause, A. Osep, J. Luiten, B. B. G. Sekar,
[18] X. Chen, S. M. Iranmanesh, and K.-C. Lien, “Patchtrack: Multiple A. Geiger, and B. Leibe, “Mots: Multi-object tracking and segmenta-
object tracking using frame patches,” arXiv preprint arXiv:2201.00080, tion,” in Proceedings of the IEEE/CVF Conference on Computer Vision
2022. and Pattern Recognition, pp. 7942–7951, 2019.
[19] E. Yu, Z. Li, S. Han, and H. Wang, “Relationtrack: Relation-aware [41] Y. Song, P. Zhang, W. Huang, Y. Zha, T. You, and Y. Zhang, “Multiple
multiple object tracking with decoupled representation,” IEEE Trans- object tracking based on multi-task learning with strip attention,” IET
actions on Multimedia, 2022. Image Processing, vol. 15, no. 14, pp. 3661–3673, 2021.
[20] Y. Liu, T. Bai, Y. Tian, Y. Wang, J. Wang, X. Wang, and F.-Y. Wang, [42] B. Wang, C. Fruhwirth-Reisinger, H. Possegger, H. Bischof, G. Cao,
“Segdq: Segmentation assisted multi-object tracking with dynamic and E. M. Learning, “Drt: Detection refinement for multiple object
query-based transformers,” Neurocomputing, 2022. tracking,” in 32nd British Machine Vision Conference: BMVC 2021,
[21] P. Blatter, M. Kanakis, M. Danelljan, and L. Van Gool, “Effi- The British Machine Vision Association, 2021.
cient visual tracking with exemplar transformers,” arXiv preprint [43] C. Kim, L. Fuxin, M. Alotaibi, and J. M. Rehg, “Discriminative
arXiv:2112.09686, 2021. appearance modeling with multi-track pooling for real-time multi-
[22] D. Xing, N. Evangeliou, A. Tsoukalas, and A. Tzes, “Siamese trans- object tracking,” in Proceedings of the IEEE/CVF Conference on
former pyramid networks for real-time uav tracking,” in Proceedings of Computer Vision and Pattern Recognition, pp. 9553–9562, 2021.
the IEEE/CVF Winter Conference on Applications of Computer Vision, [44] Q. Wang, Y. Zheng, P. Pan, and Y. Xu, “Multiple object tracking with
pp. 2139–2148, 2022. correlation learning,” in Proceedings of the IEEE/CVF Conference on
[23] X. Zhou, T. Yin, V. Koltun, and P. Krähenbühl, “Global tracking trans- Computer Vision and Pattern Recognition, pp. 3876–3886, 2021.
formers,” in Proceedings of the IEEE/CVF Conference on Computer [45] J. Pang, L. Qiu, X. Li, H. Chen, Q. Li, T. Darrell, and F. Yu, “Quasi-
Vision and Pattern Recognition, pp. 8771–8780, 2022. dense similarity learning for multiple object tracking,” in Proceedings
[24] F. Zeng, B. Dong, T. Wang, X. Zhang, and Y. Wei, “Motr: End- of the IEEE/CVF conference on computer vision and pattern recogni-
to-end multiple-object tracking with transformer,” arXiv preprint tion, pp. 164–173, 2021.
arXiv:2105.03247, 2021. [46] R. Sundararaman, C. De Almeida Braga, E. Marchand, and J. Pettre,
[25] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and “Tracking pedestrian heads in dense crowd,” in Proceedings of the
S. Zagoruyko, “End-to-end object detection with transformers,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition,
European conference on computer vision, pp. 213–229, Springer, 2020. pp. 3865–3875, 2021.
[26] X. Zhu, Y. Jia, S. Jian, L. Gu, and Z. Pu, “Vitt: vision transformer [47] J. Wu, J. Cao, L. Song, Y. Wang, M. Yang, and J. Yuan, “Track to
tracker,” Sensors, vol. 21, no. 16, p. 5608, 2021. detect and segment: An online multi-object tracker,” in Proceedings of
the IEEE/CVF conference on computer vision and pattern recognition, [69] Z. Zou, J. Huang, and P. Luo, “Compensation tracker: Reprocessing
pp. 12352–12361, 2021. lost object for multi-object tracking,” in Proceedings of the IEEE/CVF
[48] C.-J. Liu and T.-N. Lin, “Det: Depth-enhanced tracker to mitigate Winter Conference on Applications of Computer Vision, pp. 307–317,
severe occlusion and homogeneous appearance problems for indoor 2022.
multiple-object tracking,” IEEE Access, 2022. [70] B. Shuai, A. Berneshawi, X. Li, D. Modolo, and J. Tighe, “Siammot:
[49] C. Tan, C. Li, D. He, and H. Song, “Towards real-time tracking and Siamese multi-object tracking,” in Proceedings of the IEEE/CVF con-
counting of seedlings with a one-stage detector and optical flow,” ference on computer vision and pattern recognition, pp. 12372–12382,
Computers and Electronics in Agriculture, vol. 193, p. 106683, 2022. 2021.
[50] O. Kesa, O. Styles, and V. Sanchez, “Multiple object tracking and [71] X. Gao, Z. Shen, and Y. Yang, “Multi-object tracking with siamese-rpn
forecasting: Jointly predicting current and future object locations,” in and adaptive matching strategy,” Signal, Image and Video Processing,
Proceedings of the IEEE/CVF Winter Conference on Applications of pp. 1–9, 2022.
Computer Vision, pp. 560–569, 2022. [72] J. Peng, T. Wang, W. Lin, J. Wang, J. See, S. Wen, and E. Ding,
[51] Z. Sun, J. Chen, M. Mukherjee, C. Liang, W. Ruan, and Z. Pan, “Online “Tpm: Multiple object tracking with tracklet-plane matching,” Pattern
multiple object tracking based on fusing global and partial features,” Recognition, vol. 107, p. 107480, 2020.
Neurocomputing, vol. 470, pp. 190–203, 2022. [73] D. M. Nguyen, R. Henschel, B. Rosenhahn, D. Sonntag, and P. Swo-
[52] H. Liang, T. Wu, Q. Zhang, and H. Zhou, “Non-maximum suppression boda, “Lmgp: Lifted multicut meets geometry projections for multi-
performs later in multi-object tracking,” Applied Sciences, vol. 12, camera multi-object tracking,” arXiv preprint arXiv:2111.11892, 2021.
no. 7, p. 3334, 2022. [74] D. Stadler and J. Beyerer, “Improving multiple pedestrian tracking
[53] J. He, X. Zhong, J. Yuan, M. Tan, S. Zhao, and L. Zhong, “Joint by track management and occlusion handling,” in Proceedings of the
re-detection and re-identification for multi-object tracking,” in Inter- IEEE/CVF Conference on Computer Vision and Pattern Recognition,
national Conference on Multimedia Modeling, pp. 364–376, Springer, pp. 10958–10967, 2021.
2022. [75] E. Yu, Z. Li, and S. Han, “Towards discriminative representation: Multi-
[54] Y. Zhang, C. Wang, X. Wang, W. Zeng, and W. Liu, “Fairmot: On the view trajectory contrastive learning for online multi-object tracking,”
fairness of detection and re-identification in multiple object tracking,” arXiv preprint arXiv:2203.14208, 2022.
International Journal of Computer Vision, vol. 129, no. 11, pp. 3069– [76] G. Wang, Y. Wang, R. Gu, W. Hu, and J.-N. Hwang, “Split and
3087, 2021. connect: A universal tracklet booster for multi-object tracking,” IEEE
[55] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Transactions on Multimedia, 2022.
computation, vol. 9, no. 8, pp. 1735–1780, 1997. [77] D. M. Nguyen, R. Henschel, B. Rosenhahn, D. Sonntag, and P. Swo-
[56] N. Muller, Y.-S. Wong, N. J. Mitra, A. Dai, and M. Nießner, “Seeing boda, “Lmgp: Lifted multicut meets geometry projections for multi-
behind objects for 3d multi-object tracking in rgb-d sequences,” in camera multi-object tracking,” in Proceedings of the IEEE/CVF Con-
Proceedings of the IEEE/CVF Conference on Computer Vision and ference on Computer Vision and Pattern Recognition, pp. 8866–8875,
Pattern Recognition, pp. 6071–6080, 2021. 2022.
[57] C. Lv, C. Shu, Y. Lv, and C. Song, “Parallelmot: Pay more attention [78] X. Zhou, V. Koltun, and P. Krähenbühl, “Tracking objects as points,”
in tracking,” in 2021 IEEE International Conference on Computer in European Conference on Computer Vision, pp. 474–490, Springer,
Science, Artificial Intelligence and Electronic Engineering (CSAIEE), 2020.
pp. 252–256, IEEE, 2021. [79] Z. Zheng, X. Yang, Z. Yu, L. Zheng, Y. Yang, and J. Kautz, “Joint
[58] J. Wan, H. Zhang, J. Zhang, Y. Ding, Y. Yang, Y. Li, and X. Li, discriminative and generative learning for person re-identification,”
“Dsrrtracker: Dynamic search region refinement for attention-based in proceedings of the IEEE/CVF conference on computer vision and
siamese multi-object tracking,” arXiv preprint arXiv:2203.10729, 2022. pattern recognition, pp. 2138–2147, 2019.
[59] S. Guo, J. Wang, X. Wang, and D. Tao, “Online multiple object tracking [80] M. Keuper, E. Levinkov, N. Bonneel, G. Lavoué, T. Brox, and
with cross-task synergy,” in Proceedings of the IEEE/CVF Conference B. Andres, “Efficient decomposition of image and mesh graphs by
on Computer Vision and Pattern Recognition, pp. 8136–8145, 2021. lifted multicuts,” in Proceedings of the IEEE International Conference
[60] T. Liang, L. Lan, X. Zhang, and Z. Luo, “A generic mot boosting on Computer Vision, pp. 1751–1759, 2015.
framework by combining cues from sot, tracklet and re-identification,” [81] L. Leal-Taixé, A. Milan, I. Reid, S. Roth, and K. Schindler, “Motchal-
Knowledge and Information Systems, vol. 63, no. 8, pp. 2109–2127, lenge 2015: Towards a benchmark for multi-target tracking,” arXiv
2021. preprint arXiv:1504.01942, 2015.
[61] L. Ke, X. Li, M. Danelljan, Y.-W. Tai, C.-K. Tang, and F. Yu, [82] A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler,
“Prototypical cross-attention networks for multiple object tracking and “Mot16: A benchmark for multi-object tracking,” arXiv preprint
segmentation,” Advances in Neural Information Processing Systems, arXiv:1603.00831, 2016.
vol. 34, 2021. [83] A. Dave, T. Khurana, P. Tokmakov, C. Schmid, and D. Ramanan,
[62] H. Fu, J. Guan, F. Jing, C. Wang, and H. Ma, “A real-time multi- “Tao: A large-scale benchmark for tracking any object,” in European
vehicle tracking framework in intelligent vehicular networks,” China conference on computer vision, pp. 436–454, Springer, 2020.
Communications, vol. 18, no. 6, pp. 89–99, 2021. [84] P. Sun, J. Cao, Y. Jiang, Z. Yuan, S. Bai, K. Kitani, and P. Luo,
[63] O. Kesa, O. Styles, and V. Sanchez, “Joint learning architecture for “Dancetrack: Multi-object tracking in uniform appearance and diverse
multiple object tracking and trajectory forecasting,” arXiv preprint motion,” arXiv preprint arXiv:2111.14690, 2021.
arXiv:2108.10543, 2021. [85] P. Voigtlaender, L. Luo, C. Yuan, Y. Jiang, and B. Leibe, “Reducing the
[64] S. Sun, N. Akhtar, H. Song, A. Mian, and M. Shah, “Deep affinity annotation effort for video object segmentation datasets,” in Proceed-
network for multiple object tracking,” IEEE transactions on pattern ings of the IEEE/CVF Winter Conference on Applications of Computer
analysis and machine intelligence, vol. 43, no. 1, pp. 104–119, 2019. Vision, pp. 3060–3069, 2021.
[65] W. Qin, H. Du, X. Zhang, Z. Ma, X. Ren, and T. Luo, “Joint prediction [86] M. Weber, J. Xie, M. Collins, Y. Zhu, P. Voigtlaender, H. Adam,
and association for deep feature multiple object tracking,” in Journal B. Green, A. Geiger, B. Leibe, D. Cremers, et al., “Step: Segmenting
of Physics: Conference Series, vol. 2026, p. 012021, IOP Publishing, and tracking every pixel,” arXiv preprint arXiv:2102.11859, 2021.
2021. [87] M. Fabbri, G. Brasó, G. Maugeri, O. Cetintas, R. Gasparini, A. Ošep,
[66] Q. Yin, Q. Hu, H. Liu, F. Zhang, Y. Wang, Z. Lin, W. An, and Y. Guo, S. Calderara, L. Leal-Taixé, and R. Cucchiara, “Motsynth: How can
“Detecting and tracking small and dense moving objects in satellite synthetic data help pedestrian detection and tracking?,” in Proceed-
videos: A benchmark,” IEEE Transactions on Geoscience and Remote ings of the IEEE/CVF International Conference on Computer Vision,
Sensing, 2021. pp. 10849–10859, 2021.
[67] H. Shi, H. Ghahremannezhad, and C. Liu, “Anomalous driving detec- [88] M. Pedersen, J. B. Haurum, S. H. Bengtson, and T. B. Moeslund, “3d-
tion for traffic surveillance video analysis,” in 2021 IEEE International zef: A 3d zebrafish tracking benchmark dataset,” in Proceedings of the
Conference on Imaging Systems and Techniques (IST), pp. 1–6, IEEE, IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2021. pp. 2426–2436, 2020.
[68] S. Han, P. Huang, H. Wang, E. Yu, D. Liu, and X. Pan, “Mat: Motion- [89] S. Anjum and D. Gurari, “Ctmc: Cell tracking with mitosis detection
aware multi-object tracking,” Neurocomputing, 2022. dataset challenge,” in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition Workshops, pp. 982–983, [110] B. T. Naik, M. F. Hashmi, Z. W. Geem, and N. D. Bokde, “Deepplayer-
2020. track: Player and referee tracking with jersey color recognition in
[90] L. Patino, T. Cane, A. Vallee, and J. Ferryman, “Pets 2016: Dataset soccer,” IEEE Access, vol. 10, pp. 32494–32509, 2022.
and challenge,” in Proceedings of the IEEE Conference on Computer [111] B. Zheng, “Soccer player video target tracking based on deep learning,”
Vision and Pattern Recognition Workshops, pp. 1–8, 2016. Mobile Information Systems, vol. 2022, 2022.
[91] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous [112] A. Cioppa, S. Giancola, A. Deliege, L. Kang, X. Zhou, Z. Cheng,
driving? the kitti vision benchmark suite,” in 2012 IEEE conference B. Ghanem, and M. Van Droogenbroeck, “Soccernet-tracking: Multiple
on computer vision and pattern recognition, pp. 3354–3361, IEEE, object tracking dataset and benchmark in soccer videos,” in Proceed-
2012. ings of the IEEE/CVF Conference on Computer Vision and Pattern
[92] M. Andriluka, S. Roth, and B. Schiele, “Monocular 3d pose estimation Recognition, pp. 3491–3502, 2022.
and tracking by detection,” in 2010 IEEE Computer Society Conference [113] K. Vats, P. Walters, M. Fani, D. A. Clausi, and J. Zelek, “Player tracking
on Computer Vision and Pattern Recognition, pp. 623–630, IEEE, and identification in ice hockey,” arXiv preprint arXiv:2110.03090,
2010. 2021.
[93] J. Ferryman and A. Shahrokni, “Pets2009: Dataset and challenge,” in [114] J. T. Marcos and S. W. Utete, “Animal tracking within a formation of
2009 Twelfth IEEE international workshop on performance evaluation drones,” in 2021 IEEE 24th International Conference on Information
of tracking and surveillance, pp. 1–6, IEEE, 2009. Fusion (FUSION), pp. 1–8, IEEE, 2021.
[94] M. Andriluka, S. Roth, and B. Schiele, “People-tracking-by-detection [115] L. Zhang, J. Gao, Z. Xiao, and H. Fan, “Animaltrack: A large-scale
and people-detection-by-tracking,” in 2008 IEEE Conference on com- benchmark for multi-animal tracking in the wild,” arXiv preprint
puter vision and pattern recognition, pp. 1–8, IEEE, 2008. arXiv:2205.00158, 2022.
[95] K. Bernardin and R. Stiefelhagen, “Evaluating multiple object tracking [116] Q. Guo, Y. Sun, L. Min, A. van Putten, E. F. Knol, B. Visser,
performance: the clear mot metrics,” EURASIP Journal on Image and T. Rodenburg, L. Bolhuis, P. Bijma, et al., “Video-based detection
Video Processing, vol. 2008, pp. 1–10, 2008. and tracking with improved re-identification association for pigs and
[96] J. Luiten, A. Osep, P. Dendorfer, P. Torr, A. Geiger, L. Leal-Taixé, laying hens in farms,” in Proceedings of the 17th International Joint
and B. Leibe, “Hota: A higher order metric for evaluating multi-object Conference on Computer Vision, Imaging and Computer Graphics
tracking,” International journal of computer vision, vol. 129, no. 2, Theory and Applications, SciTePress, 2022.
pp. 548–578, 2021. [117] S. Ju, M. A. Erasmus, F. Zhu, and A. R. Reibman, “Turkey behavior
[97] M. Gao, L. Jin, Y. Jiang, and J. Bie, “Multiple object tracking using identification using video analytics and object tracking,” in 2021 IEEE
a dual-attention network for autonomous driving,” IET Intelligent International Conference on Image Processing (ICIP), pp. 1219–1223,
Transport Systems, vol. 14, no. 8, pp. 842–848, 2020. IEEE, 2021.
[98] S. Pang, D. Morris, and H. Radha, “3d multi-object tracking using [118] W. Li, F. Li, and Z. Li, “Cmftnet: Multiple fish tracking based on
random finite set-based multiple measurement models filtering (rfs-m counterpoised jointnet,” Computers and Electronics in Agriculture,
3) for autonomous vehicles,” in 2021 IEEE International Conference vol. 198, p. 107018, 2022.
on Robotics and Automation (ICRA), pp. 13701–13707, IEEE, 2021. [119] F. Děchtěrenko, D. Jakubková, J. Lukavskỳ, and C. J. Howard, “Track-
[99] A. K. Mackenzie, M. L. Vernon, P. R. Cox, D. Crundall, R. C. Daly, ing multiple fish,” PeerJ, vol. 10, p. e13031, 2022.
D. Guest, A. Muhl-Richardson, and C. J. Howard, “The multiple object [120] I. Ahmed, S. Din, G. Jeon, F. Piccialli, and G. Fortino, “Towards
avoidance (moa) task measures attention for action: Evidence from collaborative robotics in top view surveillance: A framework for
driving and sport,” Behavior research methods, vol. 54, no. 3, pp. 1508– multiple object tracking by detection using deep learning,” IEEE/CAA
1529, 2022. Journal of Automatica Sinica, vol. 8, no. 7, pp. 1253–1270, 2020.
[100] Z. Zou, R. Zhang, S. Shen, G. Pandey, P. Chakravarty, A. Parchami, and [121] O. Urbann, O. Bredtmann, M. Otten, J.-P. Richter, T. Bauer, and
H. X. Liu, “Real-time full-stack traffic scene perception for autonomous D. Zibriczky, “Online and real-time tracking in a surveillance scenario,”
driving with roadside cameras,” arXiv preprint arXiv:2206.09770, arXiv preprint arXiv:2106.01153, 2021.
2022. [122] P. Nagrath, N. Thakur, R. Jain, D. Saini, N. Sharma, and J. Hemanth,
[101] K. Cho and D. Cho, “Autonomous driving assistance with dynamic “Understanding new age of intelligent video surveillance and deeper
objects using traffic surveillance cameras,” Applied Sciences, vol. 12, analysis on deep learning techniques for object tracking,” in IoT for
no. 12, p. 6247, 2022. Sustainable Smart Cities and Society, pp. 31–63, Springer, 2022.
[102] X. Zhang, X. Wang, and C. Gu, “Online multi-object tracking with [123] J. Wilson and M. C. Lin, “Avot: Audio-visual object tracking of
pedestrian re-identification and occlusion processing,” The Visual Com- multiple objects for robotics,” in 2020 IEEE International Conference
puter, vol. 37, no. 5, pp. 1089–1099, 2021. on Robotics and Automation (ICRA), pp. 10045–10051, IEEE, 2020.
[103] H. Chen, W. Cai, F. Wu, and Q. Liu, “Vehicle-mounted far-infrared [124] R. Pereira, G. Carvalho, L. Garrote, and U. J. Nunes, “Sort and deep-
pedestrian detection using multi-object tracking,” Infrared Physics & sort based multi-object tracking for mobile robotics: Evaluation with
Technology, vol. 115, p. 103697, 2021. new data association metrics,” Applied Sciences, vol. 12, no. 3, p. 1319,
[104] Y. Wang and H. Yang, “Multi-target pedestrian tracking based on 2022.
yolov5 and deepsort,” in 2022 IEEE Asia-Pacific Conference on Image [125] Y. Ge, S. Lin, Y. Zhang, Z. Li, H. Cheng, J. Dong, S. Shao, J. Zhang,
Processing, Electronics and Computers (IPEC), pp. 508–514, IEEE, X. Qi, and Z. Wu, “Tracking and counting of tomato at different growth
2022. period using an improving yolo-deepsort network for inspection robot,”
[105] A. S. Patel, R. Vyas, O. Vyas, M. Ojha, and V. Tiwari, “Motion- Machines, vol. 10, no. 6, p. 489, 2022.
compensated online object tracking for activity detection and crowd [126] X. Han, Q. You, C. Wang, Z. Zhang, P. Chu, H. Hu, J. Wang, and
behavior analysis,” The Visual Computer, pp. 1–21, 2022. Z. Liu, “Mmptrack: Large-scale densely annotated multi-camera mul-
[106] P. H. Quang, P. P. Thanh, T. N. Van Anh, S. V. Phi, B. Le Nhat, tiple people tracking benchmark,” arXiv preprint arXiv:2111.15157,
and H. N. Trong, “Vietnamese vehicles speed detection with video- 2021.
based and deep learning for real-time traffic flow analysis system,” [127] X. Zhang and E. Izquierdo, “Real-time multi-target multi-camera
in 2021 15th International Conference on Advanced Computing and tracking with spatial-temporal information,” in 2019 IEEE Visual
Applications (ACOMP), pp. 62–69, IEEE, 2021. Communications and Image Processing (VCIP), pp. 1–4, IEEE, 2019.
[107] K. Zhang, C. Wang, X. Yu, A. Zheng, M. Gao, Z. Pan, G. Chen, and [128] P. Bergmann, T. Meinhardt, and L. Leal-Taixe, “Tracking without
Z. Shen, “Research on mine vehicle tracking and detection technology bells and whistles,” in Proceedings of the IEEE/CVF International
based on yolov5,” Systems Science & Control Engineering, vol. 10, Conference on Computer Vision, pp. 941–951, 2019.
no. 1, pp. 347–366, 2022. [129] J.-N. Zaech, A. Liniger, M. Danelljan, D. Dai, and L. Van Gool,
[108] D. M. Jiménez-Bravo, Á. L. Murciego, A. S. Mendes, H. S. San Blás, “Adiabatic quantum computing for multi object tracking,” in Proceed-
and J. Bajo, “Multi-object tracking in traffic environments: A system- ings of the IEEE/CVF Conference on Computer Vision and Pattern
atic literature review,” Neurocomputing, 2022. Recognition, pp. 8811–8822, 2022.
[109] Z. Kalafatić, T. Hrkać, and K. Brkić, “Multiple object tracking for
football game analysis,” in 2022 45th Jubilee International Convention
on Information, Communication and Electronic Technology (MIPRO),
pp. 936–941, IEEE, 2022.

You might also like