Multiple Target Detection and Tracking in A Multiple Camera Network
Multiple Target Detection and Tracking in A Multiple Camera Network
TRITA-EE 2015:85
Multiple Target Detection and Tracking in
a Multiple Camera Network
1
2
Contents
1 Introduction 5
1.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Ethical implications . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Structure of the report . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Background 7
2.1 Introduction to Distributed Smart Cameras . . . . . . . . . . . . 7
2.2 Multiple Smart Cameras . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Architecture of the network . . . . . . . . . . . . . . . . . . . . . 9
2.4 Mathematical Algorithms for Fusion . . . . . . . . . . . . . . . . 12
2.4.1 Fusion for target tracking . . . . . . . . . . . . . . . . . . 12
2.4.2 Scalable fusion algorithms . . . . . . . . . . . . . . . . . . 13
2.5 Detection and tracking algorithms . . . . . . . . . . . . . . . . . 14
2.5.1 Detection approaches . . . . . . . . . . . . . . . . . . . . 14
2.5.2 Tracking approaches . . . . . . . . . . . . . . . . . . . . . 15
4 Problem formulation 28
4.1 Simplification of the samples . . . . . . . . . . . . . . . . . . . . 30
4.2 Data exchange in camera network . . . . . . . . . . . . . . . . . . 30
3
5 Implementation 33
5.1 The Background Subtraction . . . . . . . . . . . . . . . . . . . . 33
5.2 Detecting Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 Tracking implementation . . . . . . . . . . . . . . . . . . . . . . . 37
5.4 Changes made to the original approach . . . . . . . . . . . . . . 38
6 Results 41
6.1 Simulation samples . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.2 Background dependency . . . . . . . . . . . . . . . . . . . . . . . 42
6.3 Cameras influence . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.4 Evaluation of the detection algorithm . . . . . . . . . . . . . . . 43
6.5 Synthetic Rectangles influence . . . . . . . . . . . . . . . . . . . . 44
6.6 Tracks Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.7 Data Exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.8 Computational complexity . . . . . . . . . . . . . . . . . . . . . . 50
4
1 Introduction
Multiple camera systems are employed in scenarios where the surveillance of
a particular area is the main task to achieve. Some of the functionalities that
are required for a determined client could be facial recognition, detection of
camouflaged targets static or dynamic, tracing special targets, detection of un-
usual behaviours, identification, classification of targets and objects that belong
to the same type, etc.
The scene presents some challenges like a crowded room, similar colors of
the clothes of the people, proximity of the individuals, as they walk near to
each other, and distances problems between the focal length of a camera and
the zone under study. These constraints affect to the background algorithm
appearing shadows, wrong subtraction of targets and fused blobs, which leads
to incorrect detections and identifications of the people. Hence, these issues
need to be studied and some measures have to be taken to develop a strong
algorithm correctly.
Regarding the method of acquiring and combining the data, different topolo-
gies can be carried out to perform the communication between the nodes. De-
pending of the chosen one, a central or various processing nodes should be
employed to decrease the workload of the camera nodes and perform the oper-
ations that suppose a computational cost expensive [4]. Thus, the processing
task will be distributed over several nodes, which will decrease the delay, and
will reduce the bandwidth requirements.
1.1 Methodology
The information associated to the thesis have been approached from a general
basis with the details of how the system of distributed smart cameras works,
the architecture and the algorithms related with the communication between
the nodes, to study progressively more specific complex concepts related with
the topic of the project to understand the later implementation. Specifically, the
literature includes a deep study of the algorithms to detect and track (Proba-
bilistic Occupancy Map and K-Shortest Path). After acquiring all the necessary
knowledge, the code provided by the researchers of the paper [12] in OpenCV
has been tested and understood to be able to translate the main concepts to
MATLAB code. Part of the work, which was not in their free implementation,
had to be developed like the use of the homographies to perform triangulations
5
calculations and the influence of the virtual rectangles in the detection param-
eters. Once these issues were solved, the implementation of the equations of
POM was carried out. However, after a lot of research and great efforts trying
to do this, we reached to the conclusion that the model applied in C++ was
not present a good yield in MATLAB and, therefore, the development of the
equations was faced from other point of view. As the goal of the thesis is to
implement a network that is capable to detect a number of people and observe
the communication between the cameras during the process, the results are
good enough and obtained quickly, despite the fact the detection and tracking
processes present more inaccurate results in comparison with POM and KSP.
6
2 Background
2.1 Introduction to Distributed Smart Cameras
Smart Camera is a vision system that capture images, being considered as
smart sensor or node, due to the capability of computing and processing specific
data in order to produce the best results. The fields of applications are various,
such as computer vision, image sensors, embedded computing, sensor networks,
and the improvements on each field give as result a new concept of understand-
ing the visual communication to solve complex task in real time. The advantages
in comparison with previous technologies, like some of the prototypes developed
in 1980s, which integrate sensing, and 1990s, which count with the first fabri-
cated CMOS implementations, are: reduced size, power consumption, higher
reliability, lower workload, less bandwidth used to transmit the data, etc. [1]
Some of the applications developed to improve the yield in the last decade
are: video surveillance, to detect unusual behaviors in the scene, to identify
criminals or suspect individuals; adaptive motion sensor applied to vehicles to
make more comfortable the task of driving and prevent possible traffic accidents;
medical applications such as complex surgeries that require a lot of time and
accuracy, aid at elder or handicapped people, tests and research to face some
of the most common diseases; robots that simulate the behavior of animals or
humans; video games and other entertainments based on gesture and motion
recognition; etc.
7
2.2 Multiple Smart Cameras
The denomination of distributed cameras is given at the system with various
cameras spread over the space which may have overlapping field of views and
cover a determined area. Thus, the images obtained from different angles are
analyzed jointly at the same time. The use of this design is, mainly, to handle the
occlusions that may appear in one determined view due to static or dynamic
objects. Besides, the system with multiple cameras allows to obtain better
quality since the lenses can focus in a determined point that requires less area,
giving a sharper capture of the image.
The main task, where the distributed smart cameras will focus, are detection,
tracking and recognition of object or humans in a given zone.
8
The second issue is modeling the track to follow targets using the data
acquired for the object detector as input. The most common algorithms to
develop the estimation of the posterior object location are based on Kalman
filters and the particle filters, explained in detail later. [6], [7], [8]
The last aspect is the identification and recognition of the targets. There are
two main approaches, local feature-based and global feature-based, depending
if we are interested in the detail of one person, for example, or in a multiple
detection framework. [2]
In the main node, there is a single track database and the amount of data
received from the others sensors are aligned and associated with the tracks
belonging to this database to be able to update them and produce their state
estimates.
This architecture model produces the optimal results, because there is not
loss of information and, in the case where we have enough bandwidth to process
the data, is ideal to put in practice. However, to satisfy the requirements, the
power consumed and the energy needed by the main base station might be high.
The development is quite simple too. It has been tested in many researches
that is a good implementation for wired solutions like in DCS networks (Dy-
namic Service Controller) where the resources are not the main problem to
consider. Issues appear where there are not enough resources for processing like
in the WSN (Wireless Sensor Networks). There may also be failures produced
in the fusion center due to system crash or because of an undesired unknown
guess that compromise the security of the digital environment [3]. To avoid such
obstacles other structures have been developed.
9
that they receive. The high level nodes can achieve considerable savings in
communication, if data arrived have a rate slower than the sensor observation
rate.
One of the advantage is that in each case, the communication does not have
to be synchronized and the local nodes can do many calculations before report
to a more important node. We can divide this type of architecture depending
of they have feedback or not.
When there is not feedback, from the high level, the common information in
the received estimate and the current estimate, is the last estimate communi-
cated from the low level. When the state is dynamic, the probabilities have to
be extrapolated to a common time. As long as the independence conditions on
the observation fields are satisfied, a lot of sensor data can be processed before
the results are sent to the high level to fuse them.
In the case with feedback, fusion takes places at both levels. Therefore, the
low and high level functions are almost identical. It is equivalent to periodic
broadcast, when the feedback reaches each node and it is necessary that the
fusion node removes its own previously sent information, that was stored, before
combining the local estimates. Hierarchical architecture is employed for many
applications, i.e., a fusion node for radar data, another node for infrared sensor
data and a node that combines the tracks generated. [4]
Distributed networks are the other solution to mitigate the lack of yield of
traditional networks in certain context. They can be defined as the opposite
concept of centralized models. In this kind of configuration, there is not a
relationship of hierarchy. Thus, all the fusion nodes are in the same level in
processing concerns. Each node can communicate each other depending on the
layout used and may be done in an asynchronous way. The sensors depend of
each other to achieve the best result that will be distributed to the clients.
In a fully distributed network, the sensing devices may have their own pro-
cessor to fuse the local data and then, cooperate with others to complement
their own inputs. The common information nodes depend on the specific com-
munication path, but now, the fusion equations are not simple in comparison
with hierarchical cases. Thus, the proper choose of an algorithm that works
well is more difficult, because a long pedigree of data is needed, and memory
requirements to store them are higher. Hence, more computation to identify
the common ancestors in order to apply the fusion equations are employed.
The main premise of the centralized techniques is not longer available for the
new ones, since they usually assume that the errors in the data to be fused are
independent. Now, only the sensor data can be considered conditionally inde-
pendent given the state to be estimate. In opposition, the “rumor propagation”
or “chicken little phenomenon” have to be had in account, since the information
arriving by multiple paths have to be fused and redundant information are not
10
Figure 2: Different architectures of networks. Extracted from [3].
There are some issues that have to be addresses like how the nodes have to
relate in order to share the assignments for fusion. This means which will be
the task of each sensor and what responsibilities they will have in a scenario
given. It is also a challenge how to combine the results from two fusion nodes for
making a decision on a hypothesis such as detecting the presence of targets and
how to fuse targets tracking. Another requirements on these techniques have to
be fulfilled since distributed configurations work with more complex algorithms.
11
2.4 Mathematical Algorithms for Fusion
2.4.1 Fusion for target tracking
If the architecture designed for the network is one of the main task of the
communication, there are other aspects in the same level of importance, like the
implementation of an algorithm that works according to our requirements. It is
significant to fuse the data in the nodes, to develop a high performance of the
results and then select what to do in the next step. It is essential to estimate
and predict some of the posterior data or tracks to be able to reconstruct all the
information in case of missing messages. To improve it, we have to deal with
the spatial registration and temporal prediction of the target on the inputs of
the sensors, i.e., the data that we are getting as input.
Data alignment is the first step in the fusion process and to do that is neces-
sary to convert the data to a common coordinate system to fuse the target data
obtained by the sensor. To compute the data, predetermined targets kinematic
are linked to the track and a temporal prediction is carried out using, for ex-
ample, Kalman filter to estimate the location or state of the target. After that,
the measurements are divided into sets thanks to association methods. Some of
the more distinguish approaches are presented below.
The nearest neighbor approach assumes that the measurement closest to our
node represent the target. The advantages are obvious, since computational
resources are not needed to implement this method. However, it may not work
well in some instances where the quality of the sequence is poor or following the
track may result confusing. [3]
The last step to be done is updating the estimates of the states of each track
using one filter, as the particle filter for example, to combine the associated
measurements with the predicted states.
12
2.4.2 Scalable fusion algorithms
The basics of distributed information fusion have been studied deeply, but
practical applications of these theoretical results to non-deterministic informa-
tion are yet uncertain. The main point is to identify and remove the redundant
information received from the different sensor that have to be fused to achieve
efficient use of the resources. To build the optimal fusion formula, carrying a
long pedigree information is required, which is not practical in environments
with limited communication bandwidth. For this reason, we use sub-optimal
algorithms that under certain assumptions perform a solution in a determined
environment. Some of the approaches are described in next paragraphs.
p1 (x)p2 (x)/p̄(x)
p(x) = R (1)
[p1 (x)p2 (x)/p̄(x)]dx
where p1 (x) and p2 (x) are the two probability density functions to be fused
and p̄(x) is the density function received from the same channel at the previous
communication time and is the common “prior information” to be removed in
the fusion formula. [5]
Hence, it just needs to keep the target of the previous transmission or re-
ception from the channel and remove it when combining the current estimates.
Therefore, there is not need of saving long chains of data to carry out this im-
plementation. However, if the time between the transmission and reception is
long the approach is useless and the dependent information might be lost.
Naı̈ve fusion assumes that the dependency between density functions is neg-
ligible. This standard represents the simplest type. Nevertheless, it could be
unreliable in practice [5]. The formula is written as:
p1 (x)p2 (x)
p(x) = R (2)
p1 (x)p2 (x)dx
One of the pattern that works with high accuracy when the dependency
between two distributions is unknown is the Chernoff fusion. The equation is
based on the following,
1−ω
pω
1 (x)p2 (x)
p(x) = R 1−ω (3)
pω
1 (x)p 2 (x)dx
where ω ∈ [0 1] is an appropriate parameter which minimizes a chosen cri-
teria. [5]
The resulting fused density function that minimizes the joint data is the
halfway between two original densities in terms of the Kullback-Leibler distance.
13
Due to that, the resources to satisfy the different cases might be computation-
ally extensive. One particular case is when we choose a certain parameter of ω
to minimize the determinant fused covariance, which means, in Gaussian case,
to minimize the Shannon information. Another case is when this R pmentioned pa-
rameter is set to be 0.5. In this case the denominator becomes p1 (x)p2 (x)dx
which is the Bhattacharyya bound. [5]
14
The most recent in this area is the Hidden Markov model, which represents
the intensity variations of a pixel in an image sequence as discrete states. [8]
Both two filters presented above yield very good results when the objects are
not close to each other. For tracking multiple objects in the video sequences, the
most likely measurement for a particular moving object needs to be associate
with the state of the object which is called the correspondence problem. To solve
it, one algorithm that performs well in practice is the nearest neighbor approach
which is very simple to implement and does not spend many computational
resources.
The last filter, the Mean shift filter, is introduced to optimize the smooth
similarity function and to obtain the direction of the movement belonging to a
target. The tracking is based on a non-parametric feature space analysis. The
goal is to maximize the appearance similarity doing iterations by comparing the
15
histograms of the target and the hypothesized object model. The histogram is
defined by the Bhattacharya distance and it has to converge for carrying out
the right operation. [6]
16
3 State of the art
In this section we present the state of art and the algorithms that step by
step perform a robust approach to be able to first detect, then track and finally
identify people or even objects in adverse conditions and under some previous
assumptions that we will take to preserve the right yield and to get the best
results.
On the first one, we can highlight on one hand, blob-based methods that are
binary blobs extracted from a single camera. They combine shape analysis with
tracking to locate people. One example of this is The Bayesian Multiple-Blob
Tracker system (BraMBLe) that generates a blob-likelihood based on a known
background model and appearance model. It can be extended to multiple cam-
eras, but they will work independently. The only advantage is that the target
can be observed from other angles in case of occlusion or wrong detection, just to
check if the arrived information is correct. Basically, a background/foreground
segmentation is performed on calibrate images. On the other hand, it is the
Color-Based Methods that take in account the several colors in the scene to be
able to differentiate targets in the background in real time. In order to carry
out that, the images are segmented pixel by pixel and classified, i.e., Gaussian
mixtures are continuously updated.
On the second approach, more cameras compose the system because it be-
comes necessary when the assignment to accurately detect and track several
targets in 3-D locations in a complex environment is the main requirement.
The approaches to do that are mainly three, two of them have been presented
as Blob-based method and Color-based method, but now the fusion of data ar-
riving from other angles are one of the issue to have in mind, since the images
can be performed jointly. The only different is the Occupancy Map Methods
17
which differs with respect to the others is that, here, the technique uses a dis-
cretized map where the objects detected are, later, back-projected. In some of
the works a mixture of Gaussian is fitted to the resulting score map, to esti-
mate the likely location of individuals, and then, combined with Kalman filter
to model the motion. The method to detect in the state of art developed, shares
common characteristics with the last explained, but it changes on how the usual
colors and motion models are combined with the generative model to estimate
the probabilities of occupancy.
18
3.1.5 Requirements
To ensure the proper run of the tasks, some assumptions in the spatio-
temporal dimension have to be set. These are:
POM is performed in individual time frames, which means that does not
have consistency along time or does not follow a continuity flow (temporal in-
dependent frame) [12]. Therefore, POM does not maintain identities of targets
across the frames which makes necessary the use of an additional algorithm to
track people. The optimization scheme introduces 4 seconds of delay between
the first image acquisition and the last results, which are compatible with many
applications in the real world and yield high robustness.
1 Y −Ψ(B C |AC )
= e (8)
Z c
19
being BC the image produced by the background subtraction, AC the syn-
thetic image obtained by putting rectangles at locations where Xk = 1, Ψ the
pseudo-distance between the image produced by the background subtraction
and the image created synthetically, and Z a determined given value.
To calculate with precision, among all the sequence captured, the location of
a target in the space, POM algorithm is performed to approximate the marginal
conditional probabilities of occupancy of the ground plane to the real location
of the body. To estimate these quantities, a uniform value is used to compute
the average virtual image. Later, a recursively method is needed to improve,
iteration by iteration, the estimation of the location on the ground plane. To
do the task easier, the synthetic image is assigned to each discretization of
the ground plane (to each grid) consisting of a rectangle, 3-D projection of a
cylinder, as people are presented in this shape (for this reason we need two
homographies, the ground plane and the head plane). Thus, the work is to
set, approximate and contain the silhouette of humans in this synthetic image
developed by the algorithm.
After doing all these operations, the final results are shown in the cameras
(a sum of the synthetic images and the frames extracted after applying the
background subtraction) and then, each one can operate independently, due to
the posterior support of the object tracker, which will be in charge of identifying
the targets among the frames and thus, recognizing every different individual
from a single point of view.
20
3.1.9 Virtual Rectangle Influence
Depending on the parameters used to produce the particular area of the
rectangles, the influence with respect the task of detection might be improved
or deteriorated. It has been tested that for model sizes between 1.70 meters
(height of average person) and 2.20 meters, there are not significant changes.
With less height, the model efficiency tends to decrease since, if it is shorter
than the person, the algorithm will be more sensitive in the production and
calculation of binary blobs. Other test carried out was to check if the algorithm
can detect children. It could be observed that small kids with a height of
80 centimeters were not detected, since the blob produced by the background
subtraction corresponded with approximate less than a quarter obtained on the
average adult.
c c ξ − qk c c
|Ak,ξ | = |A | + |(1 − A ) ⊗ Ak | (10)
1 − qk
c c ξ − qk c c
|Bc ⊗ Ak,ξ | = |Bc ⊗ A | + |Bc ⊗ (1 − A ) ⊗ Ak | (11)
1 − qk
c c
c 1 |Bc | − 2|Bc ⊗ Ak,ξ | + |Ak,ξ |
Ψ(Bc , Ak,ξ ) = c (12)
σ Ak,ξ
1
qk ← P c c (13)
1 + exp(λk + c Ψ(Bc , Ak,1 ) − Ψ(Bc , Ak,0 ))
c
where A is the computed average of the synthetic image with the current
estimates of qk s, qk is the marginal probability, Ack is the image composed of
ones inside a rectangle standing for the silhouette of an individual at location
c
k seen from camera c, and zeros elsewhere, Ak,ξ is the average synthetic im-
age, ξ is the value forced to be zero or one, Bc is the binary image generated
c
by the background subtraction for the camera c, Ψ(Bc , Ak,ξ ) is a normalized
pseudo-distance to account for the asymmetry between the background and the
21
synthetic images, σ is a fixed value set to 0.01 and λk is log( 1−ǫ
ǫk ) where ǫk is
k
Nevertheless, the real problem appears when many people are in the se-
quence, making difficult the detection since the tolerance of the algorithm is
infringed. Some of the researchers proposed another method based on people
detectors that can respond in a crowd environment.
The known particle filter can solve some of the limitations as well as the
Markov chain Monte Carlo model, which recovers trajectories over frames com-
puting the baths, or the Probability Hypothesis Density filter, which works well
in noisy context [13]. However, many parameters have to be configured before
doing the calculations, and the temporal window to carry out the approaches,
22
has to be small, which supposes a loss of generality and a problem in real ap-
plications.
The production of the final results in POM in each camera are considered
as input by KSP protocol. The only value needed by the tracker will be the
neighborhood size, which suppose a saving of resources in the processor. The
inconsistency of the detector algorithm along the time make impossible the work
of identifying a target over the frames. KSP gives the solution of linking detec-
tions across frames and provides continuity on time (the probabilistic occupancy
map works as a discretization of the time, in a discontinuous form without pro-
ducing flows). Thus, if one object is not detected in a frame but it is in previous
and following ones, a correct trajectory will be made, so, the possible mistakes
due to background segmentation or occlusion are solved in a single camera.
3.2.4 Advantages
In comparison with previous techniques, the K-shortest paths suppose a very
fast optimization in the run time due to it is simpler formally and algorithmically
than others. Hence, the resources dedicated to the task are used efficiently to
calculate the outcomes.
23
3.2.5 Assumptions
The utilization of Linear Programming to perform KSP only requires the cost
of a priori determined number of objects detected and a restriction with the most
likely locations where an object may be found. However, as an improvement,
this method does not need an appearance model.
The only assumptions that have to be in account are that, objects do not
appear or vanish anywhere in the sequence (except for a determined locations
defined as entrance vsource and exit vsink ), they do not move too fast across the
frames and they cannot share a particular place with another (impossibility of
sharing a grid in the ground).
X
t
∀k, t fk,j ≤1 (16)
j∈N (k)
The total of humans detected may vary over time. It is necessary to introduce
two additional virtual nodes vsource and vsink to represent the arrivals and
departures of objects in the graph system (they may be considered as doors or
borders of the camera field of view and they do not represent any real location
in the physical map). Thus, all flows leaving vsource will end in vsink after a
random duration on time (17).
24
Figure 3: A complete flow system consisting of three positions and three time frames.
Here, it is assumed a possible entrance and exit point. Flows to and from the virtual
positions are shown as dashed lines. Taken from [13].
X X
fvtsource ,j = t
fk,v sink
(17)
j∈N (vsource ) k:vsink ∈N (k)
| {z } | {z }
Leaving vsource Arriving at vsink
The labor of the K-Shortest Paths algorithm is to find the k paths {p1 , ..., pk }
between the entrance and departure virtual nodes (vsource and vsink ) such as
the total cost are the minimum. The focus will be in the particular instance
where the graph is direct and paths are node-disjoint (a location cannot be
share by two objects and flows cannot be created or suppressed anywhere) and
node-simple (paths visits every node at most once in the directed acyclic graph,
fig. 3). Thanks to the acyclic nature of the graph, the average complexity is
almost linear with the number of nodes, which supposes an improvement in the
speed of the solvers.
To find the k optimal paths, the searching of a single shortest path in the
graph is developed and a calculation of its total cost is performed. Then, an
iteratively computation method (disjoint path algorithm) is done to find the
rest of the trajectories. The optimal number k is obtained when the cost of the
iteration of k + 1 is higher than the previous one.
25
between the inputs and the outputs. The batches are overlapped such that the
last frame treated is combined with the current one.
To reduce the number of calculations, only the closer neighbor nodes are
taking in account for each node in an independent time step. To implement it,
a fixed parameter as threshold is given, as we can check in (18). If the value to
arrive to some point does not exceed this number, the location is considered as
not reachable and the flows is therefore removed.
An indoor basketball court with ten players moving fast and many occlu-
sions.
A scenario with ten people, some important light changes and precision
issues in camera calibration with only seven frames per second to acquire the
sequence.
A ball sequence taking with only one single point of view is tested too, which
makes difficult the task of detection since the color and the appearance are quite
similar, besides the speed of the balls are in general quite faster than people.
The final results prove that exists an improvement using KSP with respect
previous method, specially in comparison with the Dynamic Programming (DP)
and the results of only detect target with POM algorithm in all the scenarios.
The pedestrian tracking results show that the precision of POM is, in general,
quite good and the improvement carries out by the trackers do not suppose a
big change in the quality, since people do not move fast. Even so, KSP efficiency
is always higher, furthermore in those cases where the illumination issues make
difficult the detection.
26
A test with one single camera to obtain the inputs has been done to observe
the superiority of the K-Shortest Path in detriment of Dynamic Programming,
that often leaves people outside the ground discretized area instead of trying to
fix the noisy detections.
For the balls final results was useless the color model used by DP, since all
of them share similar color, which supposes a disadvantage in front of KSP. [13]
To check the speed in run time of the model, measurements were taken,
showing the KSP is much faster than using basic Linear Programming, a fac-
tor of 100-1000, and about 10 times in comparison with other sophisticated
approaches as DP.
27
4 Problem formulation
The goal of the thesis is to be able to track a unknown number of individ-
uals Nind that appear in a video sequence, recorded from different angles of a
unknown scenario taken at head level. The issues to deal are the overlapped
fields of view that are produced by the cameras, the occluded targets present in
each scene on a single camera, the crowded environments, the proximity of one
person to other and problems related with the luminosity, color, etc.
28
Table 1: Notations (Deterministic Quantities)
29
4.1 Simplification of the samples
To save time and make the computation faster, the image frames taken on
each camera c, are reduced. That means that if the whole sequence is com-
posed of 3000 frames, for example, we will not take all of them, but a frame
sample each 25 of them. This will lead as result 120 frames, which supposes
a huge improvement of the speed saving a lot of time. It has been tested that
a step between image frames larger, obtained worst results, as the individuals
that appear presented big displacements in the scene. Therefore, the veloc-
ity of the moving ‘objects’ vs has to be taken into account, to acquire solid
background/foreground data.
Certain locations may be hidden for a particular view, which means that
someone can appear or disappear in a determined moment of a sequence for a
single view. Nevertheless, when the fused data is applied, this issue is solved
satisfactorily and the mark of a target is seen in the top view. Hence, the images
Bgt = (Bgt1 , ..., Bgtc ) acquired at time t for 1 ≤ t ≤ T are stored and the task
will be to find the values of the probabilities of a detection Xct for each location
k of the discretized ground (1, ..., G) shown in equation (19).
Thus, each view works independently in each node n and after the scanning
of all the G locations for each camera c, the information are sent to a central
processing node Nc after translation coordinates calculations. To do that, all
the information acquired at the same time Xct is added and the unnecessary
data that exist due to incorrect detections are filtered as much as possible Rt .
Hence, the redundant information are the most important values, since we work
on overlapped planes. The process is show in figure 5c 5d 5e.
Once, all the results have been computed in Nc , the triangulation is done
again, and the original time frame f images are added with final results Fct and
seen on each view. Figure 5f illustrates the scheme.
There are some constrains that should be explained. One of them is that a
certain location k cannot be occupied for more than one individual, so, there is
a problematic with the proximity of two or more detected targets being able to
lead to incorrect results due to connecting the trajectories of different people.
As opposition to the algorithm explained by the authors of [13], there will not be
‘virtual’ locations that show the entrance and departures of the bodies, because
the restriction of a view on a particular camera c is compensated by another.
30
a Ground from top view to camera b Probabilities estimation on each
views. camera.
31
There will be other complications, as the evidence of a person is stronger and
detected better if the blobs of the target appear in every view, since redundant
information will be taken as valuable data.
The last one is that the probabilities are computed independently at each
time step, so there is not link between a probability for a certain location k at
t − 1 and the corresponding for t.
32
5 Implementation
Regarding the functionality of the system explained above, the approach is
composed of the followings parts.
First, the video sequences as input are required from the different cameras,
4 nodes n in our case, that usually share overlapped field of view. In order to
obtain these sequences, we have used the information provided on the web page
by the authors of [12] (http://cvlab.epfl.ch/data/pom).
The next step is to obtain the frames of each sequence, giving as results
a series of images I c for each camera c with a width of w and a height of h
that will depend on the resolution of the plot function of MATLAB. To be able
to detect the Nind individuals, a background/foreground algorithm has to be
applied and re-update on each time frame f and image processing operations
are then computed to perform robust results.
Once we have all the prior information and the frames after the background
subtraction, all the data have to be fused in a central node Nc to develop a
‘virtual’ top view T P (a hypothetical view produced after the application of
homographies upon the camera views that simulate a fixed camera monitoring
the floor from a high position) where the individuals occupy a position of a
discretized ground k. The resolution of this particular view is provided on the
same web page.
After the calculations carried out in Nc , the information are sent back to
the n nodes and synthetic rectangles Srk are drawn in the locations k where
the bodies are supposed to be standing. This method allows to handle possible
occlusions that are often present in crowded environments.
The final algorithm that we have developed presents some differences from
that the authors describe. In first instance, the authors define a common ground
plane, where there are different k locations calibrated. Therefore, despite the
fact that each target appears in a different w and h in each camera view c,
the occupation in the ground is accurate detected before fusing the data, so
the problems with the triangulation and conversion from one view to other are
avoided. They create a ‘virtual’ ground where the central node Nc is supposed
to be emplaced. There, all the prior data can be observe from the T P , the
tracks of the targets are performed and the information are sent back all to the
n nodes.
33
Figure 6: Original frame of an image and the corresponding background subtraction
(difference frame algorithm).
eigenbackgrounds and M oG presented run time too high (around 2 and 4 second
more than others) for the task of extracting foreground in various frames. The
rest perform well in all the cases, hence, the low complexity model was chosen
to improve the speed of the program, since the results were too similar.
To understand the equations, the first frame has been set as background
Bgrd. Therefore, the sequence has to start recording the plane under study
without targets on it. If the video start with someone in the scene, it will carry
the mistakes to incorrect detections, but it can be changed quickly in the code.
To create the final binary image, the pixels have to be evaluated and the
values thresholded.
To finalize, some image processing is applied like a median filter to get more
accurate blobs, removal of some smalls parts regarding a threshold value and
removal of some holes to obtain sharper pictures. A example of the result is
shown in the figure 6.
34
after applying background subtraction at the same time, to synchronize all the
data that we dispose. To do that, the number of the first and last frame that we
want to evaluate has to be annotated, besides the step between each Bg c image.
Every binary picture has to be set into a general resolution, (a predetermined w
and h, in concrete, a resolution of 360x288 pixels) to carry out the operations.
In the first step, we set all the data that we are going to need, defining
important parameters as the probabilities of the ground and the discretized
locations k in the ground seen in the T P .
After processing all the data necessary to have all the initial values, the
upgrade of the equations from (9) to (13) has to be developed sequentially for
every camera c, location k and number of iteration convergence it, until a fixed
stable point is achieved, i.e. the values of qk reach a stable solution. Sadly,
after a lot of investigation, various method of coding tested and talks with some
expert of the matter, the conclusion has been that the equations cannot be
developed in MATLAB since they were designed for OpenCV. The run time
to implement the algorithm is computationally too expensive, and hence it is
useless for researching purposes. The importance of the iteration character of
the problem over the scanning of the pixels, was the main reason to choose the
C++ language over MATLAB used by us, since the last one is more efficient in
other tasks different from multiple targets in multiple views.
Therefore, some changes have been applied to the original algorithm in order
35
Figure 7: Synthetic images of each camera c for the different locations k standing for
the supposed presence of an individual.
A thresholded tested value is set and if this value is surpassed by the summa-
c
tion of the pixels in |Ψ(Bgkc , Ak,1 )| a blue rectangle is drawn and the probability
of the ground for each view qkc is computed as a one (eq. 22). The figure 7 and
figure 8 illustrates the results.
X c
(|Ψ(Bgkc , Ak,1 )|) > thrs (22)
For each frame step f and once the probabilities qkc have been set to partic-
ular values to keep the process, the homographies are applied again for qkc to
obtain the combined probabilities in T P view and for the Bg c images.
36
Figure 8: Detected targets of the background subtraction drawing blue rectangles on
each camera c.
The fused Bg c images give as result a map of intensity of grays, where the
feet represent the darker colors on the ground, and therefore initializing the
probabilities on the ground qk to a particular value as it may be checked in
figure 9.
The fused qkc give us more accuracy targets in T P , where the detected indi-
viduals in this view has been defined as squares to cover more space, a example
is shown in figure 9.
Finally the addition of both yields the exact location of the individual seen
from the top.
The ideas of the new approach have been taken reading some of the works
previously done as those that appear in [16], [18] and [19]. The main concept of
representing targets depending on the location of the feet in the ground plane
can be read as well in the projects [15], [17], [21] where the data are fused to
obtain different weights that represent the likelihood of a detection.
Therefore, we proceed to explain what we have done to solve the last step
37
Figure 9: Homographies applied to the images after background subtraction and to
the probabilities of each view and computed as squares
The only inputs needed are the various frames images stored in the computer,
which will be mixed with the tracking blobs that appear in the feet location of
each person once the triangulation has been applied over the top view. The
final results can be seen in figure 10.
Regarding the data exchange between the nodes n and the central node Nc ,
the implementation of the investigators will be simpler in comparison with ours,
38
Figure 10: Final results of the image sequences with the corresponding white tracking
blobs
as it is illustrate in figure 11, since the probabilities on the top view and on the
camera views, are exactly the same. Thus, there is not necessity of combining
the data in the main node due to the information is defined as calibrated. The
only data that have to be sent will be those from Nc to the nodes to be able to
solve problems like occlusions, missed targets and tracking process.
The main reason of the improvements of this code in comparison with ours,
is that MATLAB performs well for rapidly trying out some simple ideas or
plotting, linear equations or sequential access to pixels. But, when iterations
operations over non-linear regions with complex structures come, there are other
more adequate. C++, for instance, is flexible and allows arbitrary algorithm
development with structures more logic and refined, which give us the possi-
bility of implement almost anything that you can image. The use of OpenCV,
therefore, will present the use of libraries to code faster the algorithm.
39
a Ground from top view to camera b Probabilities estimation for the cali-
views. brated locations.
Figure 11: Graphics of the information exchange in the algorithm of the authors.
40
6 Results
The results obtained in the implementation are shown in this section where
the main algorithm to analyze is the detection of people in the scene. First,
the sequences of video used for the testbed are described as well as the first
method implemented, the background subtraction. Then, the results of the
detection algorithm are presented with a evaluation of the conditions in which
the approach works reasonably well and in which those that it does not. Finally,
the accuracy of the tracking will be checked after the process of exchanging
information between the nodes n and the main node Nc and how this affect to
the whole computation.
The hardware of the system is composed by the 4 different camera nodes and
the central node Nc defined as a ‘virtual’ node where the main computations
are processed to fused the outputs of each view and thus, a simplification of the
calculations is developed.
In the other hand, the sequence with a maximum of 6 people entering the
room has a duration of 1 minute and 58 seconds involving a number of 2955
frames. The total frames taking into account are, therefore, 118. Reducing the
amount of data will lead to a faster computation and the system shows similar
final results, as the individuals do not move too fast in the room.
The main problem to solve in these cases results in a more complex recon-
struction problem than usually happens in real-life situations, since people tend
41
to occlude each other and the proximity of the cameras with the area of in-
terest is closer than in others environments doing more challenging the task of
detection [12].
The region of the space under study has a size of approximately 30 m2 and
the discretized area into G = 17 x 17 = 289 locations, corresponding to a regular
grid of 32,3 cm resolution. The resolution taken for our results is less precise than
the chosen by the researchers in [12], 20 cm, since the algorithm implemented
has some differences and the computation time with more locations points was
slower to carry out.
42
Figure 12: Wrong background subtraction and detection of the frame 826 for camera
2.
For the particular sequences tested, it shows that for a small crowded room is
43
more difficult to obtain good results than in the Probabilistic Occupancy Map.
When more people start to enter in the scene, more challenging is the procedure
to track the individuals. However, in a outdoor environment where the cameras
are located further (but not too much distance where people are hardly seen)
and a wider area is recorded, the task would be easier.
44
the rectangle influence for this tracking algorithm developed once the detection
is done.
Figure 13: Good final results for not crowded environment with an inaccuracy of less
than 20 cm - 30 cm.
45
a Tracks with four targets.
Figure 14: Good final results for crowded environment with an inaccuracy of less
than 20 cm - 30 cm.
46
a Inaccurate track for one of the target.
b Missed detections for some views and inaccuracy of more than 30 cm.
47
a Wrong detections due to a crowded environment.
Figure 16: Wrong final results in a dense environment with fused blobs, false and
missed detections.
48
6.7 Data Exchange
One of the main important point to observe in the thesis is the exchange
of information between the camera nodes c and the central node Nc and how
affects to the whole process. To perform the translation of coordinates from the
particular views to the top view, and vice versa (i.e. a triangulation operation)
the imagehomog function has been used (freely provided by Mike Brookes with
some modifications). The inputs needed are a given image on gray scale or
RGB, and a matrix with the values of the coordinates of the points to take into
account. For our purposes, we need the points of the ground plane, whose values
are available to use freely in the database and are presented in the equations
(23), (24), (25) and (26). The calibration, thus, consists in one homography
per camera view which projects the ground plane. The head plane has not
been considered. The equation (27) shows how the translation computation is
performed.
0.176138 0.647589 −63.412272
−0.180912 0.622446 −0.125533 Ground plane f or camera 0 (23)
−0.000002 0.001756 0.102316
0.177291 0.004724 31.224545
0.169895 0.661935 −79.781865 Ground plane f or camera 1 (24)
−0.000028 0.001888 0.054634
−0.104843 0.099275 50.734500
0.107082 0.102216 7.822562 Ground plane f or camera 2 (25)
−0.000054 0.001922 −0.068053
−0.142865 0.553150 −17.395045
−0.125726 0.039770 75.937144 Ground plane f or camera 3 (26)
−0.000011 0.001780 0.015675
49
Figure 17: Ground plane for top view, in the left, and camera views.
To get the final results, the translation of the coordinates from the cameras
has to be done n times for Bg c images and qkc probabilities. Once the coordinates
are expressed in top view values, the results can be shown from this perspective.
However, to display the corrected qkc probabilities from the different angles,
another triangulation projection has to be carried out.
Once the original images I c and the background images Bg c have been stored
in a folder of the computer, the detection algorithm is tested.
In the early stages, when there are not people in the scene, the algorithm is
faster, in concrete, the difference between inputs and outputs is approximately
1.5 seconds. This speed is related with the processing time, which is saved,
50
due to nobody is standing on the area of interest, and there are less heavy
calculations to perform.
Later, when targets start to move around, the program is slower, taking a
time between 3 and 4 seconds as can be checked in the figure 18 and 19. Here, the
time consumed is linked with the calculations carried out by the homographies
operations to translate the coordinates to the top view. The iterations of each
location in the ground and the drawing of the synthetic rectangles when there is
a detection have to be taken into account too, but this is a lighter computational
activity. The complexity is almost independent of the number of individuals,
since the only variation is the addition of one rectangle more in the image, being
the time difference of the order of milliseconds or microseconds. The speed is
even faster than the POM algorithm that takes about 4 seconds to achieve the
outcomes.
The tracking needs all the detecting probabilities for each location and time
frame f and has to wait to process the whole sequence to start the calculations.
This time will suppose a total of 8 minutes for the 4 people and 6 minutes and 5
seconds for 6 people in the laboratory sequence. The calculations to produce the
outputs in the tracking computation will take a time between 2 and 3 seconds
for each time frame f .
Figure 18: CDF of the time to process all the calculations for each frame in the
sequence with 0 to 6 people entering in the room.
51
Figure 19: CDF of the time to process all the operations for each frame in the sequence
with 0 to 4 people entering in the room.
52
7 Conclusion and Future Works
7.1 Conclusion
This thesis has been focused on the problem of detecting and tracking various
targets in an environment with a set of multiple cameras that records the same
area from different angles, distance and height position, as well as the commu-
nication between each node to perform the calculations required to achieve the
best results. The objective has been to develop a strong algorithm which assigns
a probability on the ground to a person standing on a certain location, and to
be able to follow them along the image sequences.
First of all, each sequence recorded from a point of view is taken and divided
in frames to perform all the operations together to minimize the processing time.
An algorithm is needed to visualize the moving targets and to differentiate from
the static parts. Once cameras can detect the presence of a individual, the
results are sent and combined in a central node that has been used to carry
out the larger operations with high computational cost. The fused and filtered
results are sent back to every node and the final results are shown at the same
time for the views.
The results show a decrease of the yield with respect the algorithm performed
by the authors in C++ due to MATLAB presents a character more graphic
where recursive equations are a hard task to implement. It has been concluded
that the method used has a big dependency with the background algorithm and
thus, a improvement on this area would increased significantly the detection and
tracking rates. However, the time to develop the algorithm could be involved
and the workload exchange between the nodes heavier to process.
The first one, shows better results in terms of precision and accuracy. The
suspect positions of the ground, in comparison with POM, are significantly
reduced and the detection process increases the speed, at the expense of higher
run time due to complicated dependency structure between locations. Besides,
the switches between identities still exist.
The second, carried out for the same researchers of the POM and KSP,
proposes a model to compute the probabilities on the ground of presence of
multiple objects from a single depth map without requiring large amount of
data that modern RGB-D sensors provide. Therefore, less bandwidth will be
required and the operations will compute faster to get the results.
53
There are also other fields where the detecting and tracking process with
multiple camera could evolve.
The vast majority of systems are designed to work in real time, and the
motion of the cameras could changed from a static location to a framework
with dynamic smart nodes in a near future.
54
References
[1] B. Rinner, and W. Wolf: “An Introduction to Distributed Smart Cameras”
Proceedings of the IEEE, vol. 96, no. 10, pp. 1565-1575, Oct. 2008
[2] A. C. Sankaranarayanan, A. Veeraraghavan, and R. Chellappa: “Object De-
tection, Tracking and Recognition for Multiple Smart Cameras” Proceedings
of the IEEE, vol. 96, no. 10, pp. 1606-1624, Oct. 2008
[3] M. E. Liggins II, C. Y. Chong, I. Kadar, M.G. Alford, V. Vannicola, and S.
Thomopoulos: “Distributed Fusion Architectures and Algorithms for Target
Tracking” Proceedings of the IEEE, vol. 85, no. 1, pp. 95-107, Jan. 1997
[4] C. Y. Chong: “Distributed Architectures for Data Fusion” International
Conference on Multisource-Multisensor Information Fusion, pp. 1-8, Feb.
1998
[5] K. C. Chang, C. Y. Chong, I. Kadar, and S. Mori: “On Scalable Distributed
Sensor Fusion” The 11th International Conference on Information Fusion,
pp. 1-8, Jul. 2008
[6] Mohammad Shafiee, Zohreh Azimifar, and Alexander Wong: “A Deep-
structured Conditional Random Field Model for Object Silhouette Tracking”
CoRR, vol. abs/1501.00752, Jan. 2015
[7] Y. S. Chia, W. Y. Kow, W. L. Khong, A. Kiring, and K. T. K. Teoi: “Kernel-
Based Object Tracking via Particle Filter and Mean Shift Algorithm” The
11th International Conference on Hybrid Intelligent Systems (HIS), pp. 522-
527, Dec. 2011
[8] B. Deori, and D. M. Thounaojam: “A Survey On Moving Object Tracking
In Video” International Journal on Information Theory, vol. 3, no. 3, Jul.
2014
[9] Emil Eriksson, György Dán and Viktoria Fodor: “Predictive Distributed
Visual Analysis for Video in Wireless Sensor Networks,” IEEE Trans. on
Mobile Computing, to appear
[10] Emil Eriksson, György Dán, Viktoria Fodor: “Algorithms for Distributed
Feature Extraction in Multi-camera Visual Sensor Networks,” in Proc. of
IFIP/TC6 Networking, May 2015
[11] Emil Eriksson, György Dán and Viktoria Fodor: “Real-time Distributed
Visual Feature Extraction from Video in Sensor Networks,” in Proc. of
IEEE International Conference on Distributed Computing in Sensor Sys-
tems (DCOSS), May 2014.
[12] F. Fleuret, J. Berclaz, R. Lengagne, and P. Fua: “Multicamera People
Tracking with a Probabilistic Occupancy Map” IEEE Transactions on Pat-
tern Analysis and Machine Intelligence, vol. 30, no. 2, p. 267-282, Feb. 2008
55
[13] F. Fleuret, J. Berclaz, R. Lengagne, and P. Fua: “Multiple Object Track-
ing Using K-Shortest Paths Optimization” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 33, no. 9, p. 1806-1819, Feb. 2011
[14] Seth Benton: “Background subtraction, part 2: MATLAB to C using
MCS”, EE Times newsletter, Aug. 2008
56