Academia.eduAcademia.edu

Target tracking with incomplete detection

2009

Computer Vision and Image Understanding 113 (2009) 580–587 Contents lists available at ScienceDirect Computer Vision and Image Understanding journal homepage: www.elsevier.com/locate/cviu Note Target tracking with incomplete detection Yunqian Ma a,*, Qian Yu b, Isaac Cohen a a b Honeywell Labs, 1985 Douglas Drive North, Golden Valley, MN 55422, USA Sarnoff Corporation, 201 Washington Rd, Princeton, NJ 08536, USA a r t i c l e i n f o Article history: Received 29 August 2006 Accepted 21 January 2009 Available online 30 January 2009 Keywords: Multiple target tracking Split and merge of detected regions Maximum a posteriori a b s t r a c t In this paper, we address the multiple target tracking problem as a maximum a posteriori problem. We adopt a graph representation of all observations over time. To make full use of the visual observations from the image sequence, we introduce both motion and appearance likelihood. The multiple target tracking problem is formulated as finding multiple optimal paths in the graph. Due to the noisy foreground segmentation, an object may be represented by several foreground regions and similarly one foreground region may correspond to multiple objects. To deal with this problem, we propose merge, split and mean shift operations to generate new hypotheses to the measurement graph. The proposed approach uses a sliding window framework, that aggregates information across a fixed number of frames. Experimental results on both indoor and outdoor data sets are reported. Furthermore, we provide a comparison between the proposed approach with the existing methods that do not merge/split detected blobs. Ó 2009 Elsevier Inc. All rights reserved. 1. Introduction Multiple target tracking is a key component in visual surveillance. Tracking provides a spatio-temporal description of detected moving regions in the scene, this low level information is critical for recognition of human actions in video surveillance. In the considered visual tracking problem, the observations used are the detected moving blobs. Incomplete observations due to occlusions, stop and go motion or noisy foreground detections constitute the main limitation of blob-based tracking methods. We propose a tracking method that allows to split, merge detected moving regions, as well as re-acquiring moving targets after a stop-and-go motion or occlusion. Several problems need to be addressed by a tracking algorithm: A single moving object (e.g. one person) can be detected as multiple moving blobs. In this case the tracking algorithm needs to ‘Merge’ the detected blobs. Similarly, one detected blob can be composed of multiple moving objects, in this case the tracking algorithm needs to ‘Split’ and segment the detected blob into corresponding moving objects. The split and merge of detected blobs has to be robust to partial or total occlusions, as well as being capable of differentiating detected moving regions of nearby objects. Stop-and-go motion, or non-detection due to similarity of the object to the background may require the tracker to re-acquire the target. * Corresponding author. E-mail addresses: yunqian.ma@honeywell.com (Y. Ma), qyu@sarnoff.com (Q. Yu), isaac.cohen@honeywell.com (I. Cohen). 1077-3142/$ - see front matter Ó 2009 Elsevier Inc. All rights reserved. doi:10.1016/j.cviu.2009.01.002 Moveover, the detected blobs could be due to erroneous motion detection. Here the tracking algorithm needs to filter these observations in presence of static or dynamic occlusions of the moving objects in the scene. Finally the number of moving objects in the scene vary as new moving objects enter or leave the field of view of the camera. A large number of tracking algorithms have been developed in the past decades. The interested reader can refer to [24] for a recent comprehensive survey of the field. Several data association tracking algorithms have been proposed ranging from a simple nearest neighbor association to the complex multiple hypothesis tracker [9,10]. The Probabilistic Data Association (PDA) method [14], which is considered a good compromise between performance and complexity, uses a weighted average of all the measurements within the tracks’ validation gate [15] to estimate the target state. The PDA method deals with multiple targets as independent objects in term of observations, and therefore less suitable for addressing the situations where multiple observations correspond to a single target and vice versa. JPDAF [7,8] is an extension of the PDA, where the measurement of target association probabilities is evaluated jointly across the targets. The Multiple Hypothesis Tracker (MHT) tracking algorithm was first developed by Reid [9] and propagated multiple hypotheses in time. The ranking [3] of the hypotheses requires evaluating over all existing hypotheses and thus pruning and merging [4] were used to reduce the set of hypotheses to a manageable size. A class of maximum likelihood methods seek single or multiple best paths in an observation graph [2,5], however the methods assume no missing detection and known number of objects. Y. Ma et al. / Computer Vision and Image Understanding 113 (2009) 580–587 Most of existing data association algorithms cannot address the merge or split of the observations for an accurate estimation of the target state. The multiple hypothesis trackers are the most widely used, however these methods assume a one-to-one mapping between observations and targets. An attempt to extend these frameworks to merge and split behaviors was proposed in [16], which introduced the concept of virtual measurement to represent the splitting and merging of detected regions. However, the association was inferred using a brute force method. Senior [17] performs a multi-object segmentation using a probabilistic pixel classification algorithm which uses the appearance model to calculate the likelihood of a pixel to belong to a particular object. An iterative approach then finds the front-most model first and deletes it from the foreground object and then fits the second object. Wu et al. [18] defines an occlusion relation parameter for addressing the blob splitting problem. In this paper, we formulate the multiple target tracking problem as a maximum a posteriori (MAP) problem. We expand the set of observations provided by a motion blob detector, with hypotheses added by merge, split and mean shift operations, which are designed to deal with noisy foreground segmentation due to occlusion, foreground fragment and missing detection. All these added hypotheses will be validated during the MAP estimation. The remainder of this paper is organized as follows: In Section 2, we present the MAP formulation for multiple target tracking problem. We use motion probabilities and appearance probabilities as the joint probabilities. In Section 3, we present the proposed tracking method using merge, split and mean shift operations. In Section 4, we present experimental results obtained on real indoor and outdoor video sequences. Also we compare the proposed tracking method with the existing art in the field. Finally, in Section 5 we give conclusion and outline future work. 2. Multiple target tracking formulation In a multiple target tracking problem, the objective is to track multiple target trajectories over time given a set of noisy measurements provided by a motion detection algorithm, such as [19]. The observations considered are blobs that cannot be regarded as complete observations, and furthermore, the targets position and velocities are automatically initialized and do not require operator interaction. The detector usually provides image blobs which contain both the estimated location, size and the appearance information as well. Within any arbitrary time span ½0; T, there are K unknown number of targets in the monitored scene. Let yt ¼ fyit : i ¼ 1; . . . ; mt g denote the observations at time t; Y ¼ [t2f1;...;Tg yt is the set of all the observations within the duration ½0; T. The multiple target tracking can be formulated as finding the set of K best paths fs1 ; s2 ;    ; sK g in the temporal and spatial space, where K, represents the number of moving objects or targets in the scene and this number is unknown. We denote a track by the set of its observations: sk ¼ fsk ðtÞ : t 2 ½1; Tg where sk ðtÞ 2 yt represents the observation of track sk at time t. weight or cost associated to an edge will be computed using to motion and appearance models described in the following paragraphs. To reduce the amount of edges defined in the graph we consider only edges for which the weight or cost (similarity of motion and appearance) between two nodes is more than a pre-determined threshold. This is similar to the gating in [15]. An example of such a graph is shown in Fig. 1. At each time instant, there are mt observations. The one which does not belong to any track represents a false alarm. The shaded node represents a missing observation, inferred by the tracking. We formulate the multiple target tracking problem as a MAP problem. Finding K best paths s1;;K through the graph G is to find s1;;K ¼ arg maxðPðs1;;K jYÞÞ ð1Þ The posterior of the K best paths can be represented as the observation likelihood of the K paths and the prior of the K paths as follows. Pðs1;K jYÞ / PðYjs1;K ÞPðs1;K Þ ð2Þ The K paths multiple target tracking can be extended to a MAP estimate as follows: s1;K ¼ arg maxðPðYjs1;K ÞPðs1;K ÞÞ ð3Þ First, we describe the likelihood of the K paths. To make full use of available visual cues, we consider both motion and appearance likelihood measures. By assuming that each target is moving independently, the joint likelihood of the K paths over time ½1; T can be represented as: PðYjs1;K Þ ¼ K Y Pmotion ðsk ð1Þ;    ; sk ðTÞÞPcolor ðsk ð1Þ;    ; sk ðTÞÞ ð4Þ k¼1 The joint likelihood function defined in Eq. (4) essentially represents the smoothness of tracks in both appearance and motion. If the number of objects is known, we can simply apply K shortest disjoint paths algorithm [21] to solve this problem. When K ¼ 1, the tracking algorithm is simplified as a Viterbi algorithm [5]. However, the number of the target is variant and usually unknown, thus we need a prior distribution to regulate the likelihood and avoid overfitting the observations. 2.1. Graph representation and MAP estimate We utilize a graph representation G ¼ hV; Ei of all measurements within time ½0; T. The graph is a directed graph that consists of a set of nodes V ¼ fykt : t ¼ 1;    ; T; k ¼ 1;    ; Kg. Each node corresponds to a detected moving region, however, we also consider one special node y0t to represent the null measurement at time t which corresponds to missed detections. A directed edge ðyit1 ; yjt2 Þ 2 E; t 1 < t2 is defined between two nodes based on proximity and similarity of the corresponding detected blobs. The 581 Fig. 1. Graph representation of measurements. 582 Y. Ma et al. / Computer Vision and Image Understanding 113 (2009) 580–587 Second, the prior model of Pðsk : k ¼ 1;    ; KÞ is represented as follows: Pðs1;...;K Þ ¼ jsk j Y K  T Y pd kfft expðCðtÞÞ 1  pd t¼1 k¼1 ð5Þ where jsk j is the number of measurements associated to the track k. pd denotes the detection rate which can be determined from prior knowledge of the detection procedure. We assume the distribution of the number of false alarms ft is a Poisson distribution and kf is the parameter of the distribution. CðtÞ represents the overlapping between two different tracks at each time, which can be written as, CðtÞ ¼ si ðtÞ\sj ðtÞ si ðtÞ\sj ðtÞ–; si ðtÞ[sj ðtÞ P jsi ðtÞ \ sj ðtÞ–;j ð6Þ where sk ðtÞ is the observation associated to the track k at time t. Directly optimizing Eq. (3) is computationally impractical for realtime application. We seek a suboptimal solution as s1;...;K ¼ arg maxðargmaxðPðYjs1;...;k ÞÞPðs1;...;k ÞÞ kl 6k6ku ð7Þ s1;...;k where kl and ku are the lower and upper bounds of the number of targets. The optimization problem is reduced into two steps, first find the optimum tracks given the number of tracks, then compute the prior weight of the candidate tracks and find the best weighted solution. Note that, the joint likelihood probability is defined by the product of the appearance and motion likelihood probabilities, while the weight of the edges in a track is actually the sum of the minus log likelihood. Thus maximizing the joint likelihood is indeed to find the K shortest disjoint paths. Given the number of tracks, the optimal tracks are computed by using [21] (see Appendix A for details). The overview framework of this approach is shown in Algorithm 1. The details of the definition of motion and appearance likelihood are shown in Sections 2.2 and 2.3. The way of building and augmenting the observation graph is shown in Section 3. Algorithm 1. Overview of the multiple target tracking algorithm Input: Y; kl ; ku Output: K; s1;;K 1. Build observation graph 2. Augment the observation graph with hypothesis 3. Compute the optimal tracks for k ¼ kl to ku do find s1;...;k ¼ arg maxðPðYjs1;...;k ÞÞ compute PðYjs1;...;k ÞPðs1;...;k Þ end for 4. Find the tracks with the best combined weight as solution 2.2. Motion model Before discussing the motion and appearance likelihood, we first introduce the ground plane assumption, which the motion model is defined upon. 2.2.1. Ground plane assumption We assume the targets are moving on a ground plane, as shown in Fig. 2a. The detected blobs are shown in Fig. 2b. The ground Fig. 2. Illustration of the ground plane model. plane assumption provides us a rough estimation of the average targets’ height, as in Fig. 2c. The homography between the image plane and the ground plane can be represented as follows. ½lx ; ly ; 1T ¼ H½g x ; g y ; 1T ð8Þ where ðlx ; ly ; 1Þ and ðg x ; g y ; 1Þ are the locations on the image plane and ground plane respectively. The homography matrix H can be estimated by a least square estimation using four correspondence points between image plane and ground plane. We assume that the contact point of a target with the ground plane corresponds to the bottom point of detected image blob. Given the homography, we project the contact point to the ground plane and this will correspond to the location of target on the ground plane. The relationship between the average height of targets h and the position in image plane can be modeled as a linear combination of lx ; ly ; 1: h ¼ a½lx ; ly ; 1T , where a is a 1  3 vector. To estimate the average height, we select several foreground images which only contain blobs with correct height and use a Least Square method to compute the average height. If a ground plane is available, we can use constant velocity motion model in the 2D image plane and 3D ground plane, otherwise only in 2D image plane. We denote xkt the state vector of the target k at time t to be ½lx ; ly ; w; h; _lx ; _ly ; lgx ; lgy ; _lgx ; _lgy  (location, width, height and velocity in 2D image, and location, velocity on the ground plane). We consider a linear kinematic model: xktþ1 ¼ Ak xkt þ wk where Ak is the transition matrix as follows. ð9Þ 583 Y. Ma et al. / Computer Vision and Image Understanding 113 (2009) 580–587 1 B0 B B B0 B B0 B B B0 k A ¼B B0 B B B0 B B0 B B @0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 1 0 0C C C 0C C 0C C C 0C C 0C C C 0C C 1C C C 0A ð10Þ 0 1 ^k ðt1 ÞÞ ¼ Pcolor ðsk ðt 2 Þjs We assume wk to be a normal probability distributions, k w  Nð0; Q k Þ. The observation ykt ¼ ½ux ; uy ; w; h; ugx ; v gy  contains the measurement of a target position and size in 2D image plane and position on 3D ground plane. Since observations often contain false alarms, the observation model is represented as: 8 k k k > < H xt þ v ykt ¼ dt > : if it belongs to a target ð11Þ false alarm 1 0 0 0 0 0 0 B0 1 0 0 0 B B B0 0 1 0 0 k H ¼B B0 0 0 1 0 B B @0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0C C C 0C C 0C C C 0A ð12Þ 0 0 0 1 0 0 The measurement is modeled as a linear model of current state if it belongs to a target otherwise it is modeled as a false alarm dt , which is assumed to be a uniform distribution. ^ t ðsk Þ the a ^k ðt i Þ denote the a posteriori state estimate and P Let s posteriori estimate of the error covariance matrix of sk at time t. Along a track sk the motion likelihood of one edge ðsk ðt1 Þ; sk ðt2 ÞÞ 2 E, t1 < t2 can be represented as Pmotion ðsk ðt2 Þjs^k ðt1 ÞÞ. Given the transition and observation model in Kalman filter, the motion likelihood then can be written as: ^k ðt 1 ÞÞ ¼ Pmotion ðsk ðt 2 Þjs 1 ^ t ðsk ÞÞ ð2pÞ3=2 detðP 2 exp ! ^ 1 ðsk Þe eT P t2 2 ð13Þ ^ t ðsk Þcan be computed recursively ^k ðt 1 Þ andP where e ¼ ykt  HAt2 t1 s 2 ^ t 1 ðsk ÞAT þ Q ÞHT þ R. ^ t ðsk Þ ¼ HðAP by a Kalman filter as P 2 2   1 X P i ðcÞ ðPi ðcÞ  P j ðcÞÞ log 2 c¼r;g;b Pj ðcÞ ð14Þ Using motion model and appearance model, we associate a weight to each edge defined between two nodes of the graph. This weight combines the appearance and motion likelihood models presented in the previous paragraphs. In Eqs. (11) and (14), we assume the state of target at time t is determined by the previous state at time t  1 and the observation at time t is a function of the state at time t alone, i.e. Markov condition. Thus the joint likelihood of K paths in Eq. (4) can be factorized as follows: PðYjs1;K Þ ¼ where ykt represents the measurement which may arise either from a false alarm or from the target, and dt is the false alarm rate at time t. We assume v k to be normal probability distributions, v k  Nð0; Rk Þ. According to the definition of the measurement and state vector, the measurement matrix Hk can be represented as follows. 0 In this paper, we use a simple color appearance model to illustrate our proposed algorithm. Of course, other appearance model such as [11] can be used within this framework as well. In the simple color appearance model, the appearance of each detected region is modeled using a non-parametric color histogram. All RGB bins are concatenated to form a one dimension histogram. The appearance likelihood between two image blobs ðsk ðt 1 Þ; sk ðt 2 ÞÞ 2 E; t 1 < t 2 in track k, is measured using the a symmetric Kullback-Leibler (KL) divergence defined as follows. K Y T Y ^k ðt 2 ÞÞP Pmotion ðsk ðt1 Þjs s s color ð k ðt 1 Þj^ k ðt 2 ÞÞ k¼1 ðt 1 ;t2 Þ2sk ð15Þ where ðsk ðt 1 Þ; sk ðt 2 ÞÞ represents the edge in the track k. 3. Merge, split and re-acquisition Graph representation [2,22] is widely used to organize the relationship between multiple observations in multiple frames. In such a graph representation, nodes represent the detected moving regions and edges encode the motion and appearance likelihood of the two observations belonging to one track in two frames. With the graph representation, the trajectories of multiple targets can be represented as paths in such a graph and multiple target tracking problem is to recover such paths. Most multiple target tracking algorithms [2,6,10] assume that no two paths pass through the same observation. This assumption is reasonable when considering complete observations. However, this assumption is usually violated in the context of visual tracking problem, where the targets cannot be regarded as points and the inputs to the tracking algorithm are usually image blobs. Before finding the shortest disjoint paths, we aim to recover the case of merge, split or missing observations by adding new hypotheses in the observation graph. In order to form reasonable hypotheses, we generate candidate tracks by connecting observation nodes with large enough motion and appearance likelihood. The threshold of forming candidate tracks is quite strict. And the estimated positions of these candidate tracks will be used to generate new hypotheses. In the following paragraphs we present an extension of the framework to handle split and merge behaviors in estimating the best paths. 3.1. Merge and split hypothesis 2.3. Appearance model In this section, we present the appearance model likelihood. An observation’s appearance model can be represented by a set of distinctive features such as color, shape, or texture. The good appearance model, such as [11] can be invariant to 2D rigid transformation and scale change over wide range of transformation within a large resolution. The proposed merge and split behaviors correspond to a recursive association of new observation, given estimated trajectories. Using the estimated candidate tracks, we can evaluate how the mtþ1 observations fyitþ1 : i ¼ 1; . . . ; mtþ1 g at time t þ 1 fit the estimated tracks which end at time t. The spatial overlap between estimated state at instant time t and new observation will considered as a primary cue and we consider the following two cases: 584 Y. Ma et al. / Computer Vision and Image Understanding 113 (2009) 580–587 tk ðt þ 1Þ has a sufficient spatial overlap with  If the prediction of s more than one observation at time t þ 1. This will trigger a merge operation which merges the observations at time t þ 1 into one new observation. This new observation carrying the merge hypothesis will be added into the graph. The merge operation is illustrated in Fig. 3a.  If the predicted positions and shapes of more than one track spatially overlap with one observation ytþ1 at time t þ 1. The set of candidate tracks is j; jjj > 1. This will trigger a split operation, which splits the node ytþ1 into several observations. These observations, which encode the split hypothesis, will be added to the observations set at time t þ 1. The split operation proceeds as follows: for each track stk in j whose prediction has a sufficient overlap with ytþ1 : - Change the predicted size and location at time t þ 1 to find the tk ðt þ 1Þ; ytþ1 Þ; best appearance score sk ¼ P color ðs - Create a new observation node for the track with the largest sk and add it to the graph; - Reduce the confidence of the area occupied by the newly added node and recompute the score sk for each track left in j; Iterate this process until all candidate tracks in j that overlapped with the observation ytþ1 are tested. The split operation is illustrated in Fig. 3b. 3.2. Mean shift hypothesis In this section, we present the mean shift hypothesis. For example, the regions are not detected in the case of stop-and-go motion. In this case, we propose to incorporate additional information from the images for improving appearance-based tracking. Since at each time t, we have already maintained the appearance histogram of each candidate track, we introduce the mean shift operation to keep track of this appearance distribution when the motion blob does not provide good enough input. Mean shift method [13], which can be regarded as a mode-seeking process, was successfully applied to the tag-to-track problem. Usually the central module of mean shift tracker is based on the mean shift iterations to find the most probable target position in the current frame according to the previous target appearance histogram. In our multiple target tracking problem, if a candidate track is not associated with an observation at time t, due to low motion and appearance likelihood (caused by a fragmented detection, non detection or large mismatch in size, etc.), we instantiate a mean shift algorithm to propose the most probable target position given the appearance histogram of the track. A color histogram of the candidate track is used in mean shift. Note that the color histogram used by mean shift is established using past observations along the path (within a sliding window), instead of using only the latest one. Using the predicted position from mean shift, we add a new observation to the graph. Past observation is just the observations (image blobs) in the tracks before time t. The final decision will be made by considering all the observations in the graph. To prevent the mean shift tracking from tracking a target after it leaves the field of view the mean shift hypothesis. We considered only for the trajectories where the ratio of real node to the total number of observations along the track is larger than a threshold. Due to the new hypotheses being added into the graph, one track can contain real nodes (obtained from motion segmentation) and hypotheses nodes. So the ratio of real node measures how many real observation nodes are in one track. The threshold on this value prevents mean shift hypothesis being added too much. 4. Experimental results In the experiment we used a sliding window of size W to implement the proposed tracking algorithm for real time. The graph contains the observations between time t and t þ W. When new observations are added into graph, observations older than t will be removed from the graph. New candidate tracks will also come from false alarm nodes (nodes are not assigned to any track). In the following experiments, we use W ¼ 45 for the window size. 4.1. Data set We conducted experiments on both a public data set and our Honeywell data set. In the experimental tests conducted, the input for the tracking algorithm were the foreground regions and the original image sequence. Fig. 3. Merge and split hypothesis added to the graph. 4.1.1. Public data set and results We evaluate our algorithm with on videos selected from a public data set, CLEAR [23]. This data set is captured by a stationary camera in an urban traffic environment. The targets of interest contain vehicles and pedestrians. In the considered data set, a large number of partial or complete occlusions between targets (pedestrians and vehicles) were observed. The left two sub-figures in Figs. 4 and 5 show the sample frames in the CLEAR data set with the tracked results using the proposed tracking method. The right two sub-figures in Fig. 4 show the detected moving blobs. For example, the two cars in the right top figure in Fig. 4 are in one moving blob. The proposed tracking method can split this one blob into two cars. 585 Y. Ma et al. / Computer Vision and Image Understanding 113 (2009) 580–587 tively. Let N f be the total number of ground truth frames in the sequence. Say that for a sequence we have N G ground truth-ed objects and N T detected objects. Within each frame, the intersection over union of two MBRs from ground truth and output is used to measure the normalized spatial accuracy of reported regions. The one to one matching between these ground truth objects and tracked objects is performed by specifically computing the measure over all the ground truth and detected object combinations and to maximize the overall score for a sequence. Given an optimal mapping, the number of matched ground truth and tracked objects is N mapped , and therefore N mapped 6 N G and N mapped 6 N T . The ATA metric [20] is defined as follows: PNmapped i¼1 ATA ¼ Fig. 4. Tracking in the CLEAR data set. 4.1.2. Honeywell data set and results We also tested our tracking algorithm on various real surveillance spots collected by our lab. The data considered were collected inside the lab room, around the parking lots and other facilities at Honeywell. The difficulty in these data set is clutter background and frequent occlusions. Partial occlusions and noisy foreground segmentation cause the cases of split and merged foreground regions very often. Fig. 6 shows the data sets with tracking results overlaid and the detected foreground. Due to the noisy foreground segmentation, the input foreground for one target could have multiple fragmented regions, as shown in Fig. 6a. In the case where two or more moving objects are very close to each other, we may have a single moving blob for all the moving objects, shown in Fig. 6b. Missed detections were frequent in the considered data set: objects merge into background due to stop-and-go motion is shown in Fig. 6c. Moreover, due to the noisy background, frequent false positive are present in the foreground detection. The tracking results in present of false alarms is shown in Fig. 6d. For the data set where a ground plane is available, using the homography between the ground plane and the image plane, the targets can be tracked on the 3D ground plane, as shown in Fig. 7. 4.2. Performance metrics for quantitative comparison In this section, we compare the proposed tracking algorithm with the existing JPDA [12] without the added merge, split and mean shift hypothesis. Before we present the comparison results, we first describe the performance metrics for comparison. For each detected target, we assign a unique track ID and we output a minimum bounding rectangle (MBR) for every frame. In order to provide a quantitative evaluation of the proposed algorithm, we apply the quantitative metrics ATA (Average Tracking Accuracy) proposed in [20] to evaluate and compare the performance. Let Gi denotes the ith ground truth object and Gi ðtÞ denotes the ith ground truth object in tth frame. T i denotes the tracked object for Gi . N G and N T denotes the number of ground truth objects and tracked objects respec- PNf t¼1 ðtÞ ðtÞ ðtÞ \T i j=jGi [T i j jGðtÞ i N ðG i [T i –;Þ ðNG þ N T Þ=2  ð16Þ where N ðGi [T i –;Þ indicates the total number of frames in which either a ground truth object or a detected object or both are present. Thus Nf 6 NðGi [T i –;Þ always holds. This quantitative metric penalizes for both False Negatives (undetected ground truth area) and False Positives (detected boxes that do not overlay any ground truth area) [20]. Note that, due to the mapping is defined on the trajectories of ground truth and tracked objects, it fully considers the temporal consistency of tracking accuracy. For example, suppose we have two ground truth trajectories with the same start and end frame, the tracked objects swap ID right in the middle of the start and end frame. Although except the swapping ID, all detected MBRs of the two trajectories are exactly the same as the MBRs in ground ¼ truth, the ATA score according to Eq. (16) will be ð12 þ 12Þ= ð2þ2Þ 2 1=2. This metric is also well suited for penalizing fragmented trajectories: Let suppose that we have only one ground truth trajectory, and the obtained tracking result the track is broken into two tracks of equal size. Since only one track will be mapped to the ground ¼ 1=3. truth track, the final ATA score will be 12 = ð1þ2Þ 2 The ATA score (Average Tracking Accuracy) of comparison in two data sets is shown in the table as follows. From this comparison table, it is clear that the proposed tracking method has higher ATA score than the existing JPDA method. 4.3. Environment and computing complexity Without any code optimization, the time performance of our online tracking algorithm with 45-frame sliding window is close to real time (15–20 fps for 3–5 targets) on a PC platform with a P4 2.8 Hz CPU and 1 GB RAM. Basically, the computing cost of our algorithm is linear to the total number of targets. The quality of motion detection affects the performance of the tracking algorithm. If the foreground region is very noisy, many hypotheses may be added to the graph, which degrades the time performance. 5. Conclusion and future work In this paper, we present a method for multiple targets tracking with incomplete observation in video surveillance. If we partition the application scenarios into easy, medium and difficult cases, most of the existing tracking algorithms can handle the easy cases relatively well. However for the medium and difficult cases, multi- Sequence Frames proposed method with split/merge only with meanshift only no augumented hypothesis method 2 C LE A R 1125( 2 s equenc es ) 0.293 0. 252 0. 126 0. 109 0. 097 Honey well 1322 ( 6 s eqenc es ) 0.327 0. 269 0. 233 0. 212 0. 237 Fig. 5. Comparison with other methods method1: the proposed method, method2: JPDAF in [12]. 586 Y. Ma et al. / Computer Vision and Image Understanding 113 (2009) 580–587 Fig. 6. Tracking results of the Honeywell dataset. ple targets could be merged into one blob especially during the partial occlusion and one target could be split into several blobs due to noisy foreground detection. Also missed detections are frequent in presence of stop-and-go motion, or when we are unable to distinguish foreground from background regions without adjusting the detection parameters to each sequence considered. In this paper, we use a graph representation of the observations and formulate the multiple target tracking problem as a maximum a posterior estimate. We expand the set of hypotheses by considering merge, split and mean shift operations. The proposed tracking Fig. 7. Tracking targets using ground plane information. Left, estimated trajectories are plotted in the 2D image. Right, the positions of the moving people in the scene are plotted on the ground plane. Y. Ma et al. / Computer Vision and Image Understanding 113 (2009) 580–587 method can deal with noisy foreground segmentation due to occlusion, foreground fragments and missing detections. Experimental results show good performance on the test data sets. Also, we have a quantitative comparison between the proposed tracking method with the existing tracking method, and show a better ATA score for the proposed tracking method. For the future work, first we will introduce other appearance models such as in [11]. Moreover we will incorporate the model information into our tracking framework. Appendix A. K shortest disjoint path algorithm We aim to find K shortest paths [1](starting from a virtual node s and ending at a virtual node t ) in the observation graph G. The K shortest paths are required to be node-disjoint. Two or more paths that share no common nodes (or edge) are called node-disjoint (or edge-disjoint). We first introduce the solution to the K shortest edge-disjoint paths problem, which can be used to solve the K shortest node-disjoint paths problem. The detail can be found in [21]. A.1. K shortest edge-disjoint path By augmenting the edges in original G with unity capacity, the problem can be transformed into another form: to find a valued K flow from node s to t and the total cost on this flow is minimum. Note that, we also need to know each disjoint path in this flow. The algorithm is as follows. Algorithm 2. K shortest edge-disjoint path problem Input: initial graph G Output: fpi ; i ¼ 1; . . . ; Kg Augment the graph G to T with unity capacity for all edges Find the maximum flow F in T if F < K, i.e.T does not contains a valued K flow then return the graph G contains no K disjoint paths end if for i ¼ F to F  K þ 1 do Compute valued i flow and create T 0 by removing zero-flow edges in T Find the shortest path pi in T 0 Remove the edges in pi from T end for Since the maximum flow problem can be solved in polynomial time by Ford-Fulkerson algorithm OðjEj  maxflowÞ, the total complexity of the algorithm is OððjEj  maxflow þ jVj2 Þ  KÞ. A.2. K shortest node-disjoint path This problem can easily transformed to the edge-disjoint path problem. For the node v which is not allowed to be shared by 587 two or more paths, split the node v into two nodes, v 0 and v 00 . The edge ðv 0 ; v 00 Þ has a unity capacity and a zero cost. All the incoming edges ðx; v Þ are changed to ðx; v 0 Þ and all the outgoing edges ðv ; xÞ are changed to ðv 00 ; xÞ. Then we can apply the K shortest edge-disjoint path algorithm on the new graph to get the K shortest node-disjoint path. References [1] J.K. Wolf, A.M. Viterbi, G. Dixon, Finding the best set of K paths through a trellis with application to multitarget tracking, Correspondence IEEE Trans. on AES 25 (2) (1989) 287–296. [2] B. LaScala, G.W. Pulford, Viterbi data association tracking for over-the-horizon radar, Proc. Int. Radar Sym. 3 (1998) 155–164. [3] K.G. Murty, An algorithm for ranking all the assignments in order of increasing cost, Operations Res. 16 (1968) 682–687. [4] K. Buckley, A. Vaddiraju, R. Perry, A new pruning/merging algorithm for MHT multitarget tracking, Radar—2000 (2000). [5] T. Quach, M. Farooq, Maximum likelihood track formation with the Viterbi Algorithm Proc. 33-rd IEEE Conf. on Decision and Control, 1994, pp. 271–276. [6] D. Castanon, Efficient algorithms for finding the K best paths through a trellis, IEEE Trans. Aerospace Elect. Sys. 26 (2) (1990) 405–410. [7] T.E. Fortman, Y. Bar-Shalom, M. Scheffe, Sonar tracking of multiple targets using oint probabilistic data association, IEEE J. Oceanic Eng. OE8 (3) (1983) 173–184. [8] Y. Bar-Shalom, T.E. Fortmann, Tracking and Data Association Mathematics in Science and Engineering, Academic Press, San Diego, CA, 1988. Series 179. [9] D.B. Reid, An algorithm for tracking multiple targets, IEEE Trans. Automatic Control 24 (6) (1979) 843–854. [10] I.J. Cox, S.L. Hingorani, An efficient implementation of Reid’s multiple hypothesis tracking algorithm and its evaluation for the purpose of visual tracking, IEEE Trans. Pattern Anal. Machine Intell. 18 (2) (1996) 138–150. [11] J. Kang, I. Cohen, G.G. Medioni, Object reacquisition using invariant appearance model, in: International Conference on Pattern Recognition, Cambridge, United Kingdom, 2004, pp. 759–762. [12] J. Kang, I. Cohen, G. Medioni, Continuous tracking within and across camera streams, in: IEEE Conference on Computer Vision and Pattern Recognition Madison, Wisconsin. June 2003. [13] Yizong Cheng, Mean shift mode seeking and clustering, IEEE Trans. Pattern Anal. Machine Intell. 17 (8) (1995) 790–799. [14] S. Blackman, Multiple Target Tracking with Radar Applications, Artech House, Berlin, 1986. [15] Y. Bar-Shalon, E. Tse, Tracking in a cluttered environment with probablistic data association, Automatica (1975) 451–460. [16] A. Genovesio, J. Olivo-Marin, Split and merge data association filter for dense multi-target tracking, in: ICPR, 2004, pp. IV: 677–680. [17] A. Senior, Tracking people with probabilistic appearance models, PETS02 (2002) 48–55. [18] Ying Wu, Ting Yu, Gang Hua, Tracking appearances with occlusions, in: Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Madison, WI, vol. I, 2003, pp.789–795. [19] C. Stauffer, W.E.L. Grimson, Adaptive background mixture models for real-time tracking, in: IEEE International Conference on Computer Vision and Pattern Recognition 1999, vol. 2, pp. 246–252. [20] R. Kasturi, D. Goldgof, P. Soundararajan, V. Manohar, M. Boonstra, V. Korzhova, Performance Evaluation Protocol for Text, Face, Hands, Person and Vehicle Detection & Tracking in Video Analysis and Content Extraction (VACEII). Technical Report, University of South Florida, 2005. [21] J.W. Suurballe, Disjoint paths in a network, Networks 4 (1974) 125–145. [22] I. Cohen, G. Medioni. Detecting and tracking objects in video surveillance, in: Proc. of the IEEE Computer Vision and Pattern Recognition 99, Fort Collins, June 1999. [23] http://www.clear-evaluation.org/. [24] Alper Yilmaz, Omar Javed, Mubarak Shah, Object tracking: a survey, ACM Comput. Surv. 38 (4) (2006).