0% found this document useful (0 votes)
77 views59 pages

Multiple Target Detection and Tracking in A Multiple Camera Network

This document summarizes a master's thesis project on implementing an algorithm for multiple target detection and tracking across a multiple camera network. The project aims to detect and track an unknown number of individuals entering an area using synchronized video sequences from cameras with overlapping fields of view. It discusses the background on distributed smart camera networks, detection and tracking algorithms, and a state of the art probabilistic occupancy map approach and k-shortest path algorithm. It then formulates the problem, describes implementation of detection and tracking, and presents results evaluating the algorithm's performance.

Uploaded by

RazvanHarag
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views59 pages

Multiple Target Detection and Tracking in A Multiple Camera Network

This document summarizes a master's thesis project on implementing an algorithm for multiple target detection and tracking across a multiple camera network. The project aims to detect and track an unknown number of individuals entering an area using synchronized video sequences from cameras with overlapping fields of view. It discusses the background on distributed smart camera networks, detection and tracking algorithms, and a state of the art probabilistic occupancy map approach and k-shortest path algorithm. It then formulates the problem, describes implementation of detection and tracking, and presents results evaluating the algorithm's performance.

Uploaded by

RazvanHarag
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Multiple Target Detection and Tracking in a

Multiple Camera Network

JUAN MANUEL PABLO RODRÍGUEZ

Master’s Degree Project


Stockholm, Sweden September 2015

TRITA-EE 2015:85
Multiple Target Detection and Tracking in
a Multiple Camera Network

JUAN MANUEL PABLO RODRÍGUEZ

Stockholm September 2015

Department of Communication Networks


School of Electrical Engineering
Kungliga Tekniska Högskolan
Abstract
Given synchronized video sequences from a number of cameras with
overlapping fields of view, the detection and tracking of a priori unknown
number of individuals entering a determined area is considered, showing
that a generative model can accurately follow the individuals and handle
effectively such problems as occlusions in each view independently. The
aim of this thesis is to implement the exchange of information between
the cameras where the detection and tracking processes take place. The
inputs are obtained from synchronized videos and the frames are taken
individually to treat them as independent images. The proposed algo-
rithm was implemented in MATLAB and results obtained on a personal
computer are presented. The results show that the algorithm achieves
good tracking accuracy, has relatively low computational complexity, and
at the same time it allows to observe the communication requirements
between the cameras and the processing node.

1
2
Contents
1 Introduction 5
1.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Ethical implications . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Structure of the report . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Background 7
2.1 Introduction to Distributed Smart Cameras . . . . . . . . . . . . 7
2.2 Multiple Smart Cameras . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Architecture of the network . . . . . . . . . . . . . . . . . . . . . 9
2.4 Mathematical Algorithms for Fusion . . . . . . . . . . . . . . . . 12
2.4.1 Fusion for target tracking . . . . . . . . . . . . . . . . . . 12
2.4.2 Scalable fusion algorithms . . . . . . . . . . . . . . . . . . 13
2.5 Detection and tracking algorithms . . . . . . . . . . . . . . . . . 14
2.5.1 Detection approaches . . . . . . . . . . . . . . . . . . . . 14
2.5.2 Tracking approaches . . . . . . . . . . . . . . . . . . . . . 15

3 State of the art 17


3.1 Object detector: Probabilistic Occupancy Map . . . . . . . . . . 17
3.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.2 Related approaches . . . . . . . . . . . . . . . . . . . . . . 17
3.1.3 General idea . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.4 Advantages of the algorithm . . . . . . . . . . . . . . . . 18
3.1.5 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.6 Generative Image Model . . . . . . . . . . . . . . . . . . . 19
3.1.7 Camera Calibration . . . . . . . . . . . . . . . . . . . . . 20
3.1.8 Central processor . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.9 Virtual Rectangle Influence . . . . . . . . . . . . . . . . . 21
3.1.10 Equations of the probabilistic ground plane . . . . . . . . 21
3.1.11 Problems addressed . . . . . . . . . . . . . . . . . . . . . 22
3.2 K-Shortest Path Algorithm to track targets . . . . . . . . . . . . 22
3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.2 Previous and related approaches . . . . . . . . . . . . . . 22
3.2.3 Main concept . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.4 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.5 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.6 Formalization and equations . . . . . . . . . . . . . . . . . 24
3.2.7 Complexity reduction . . . . . . . . . . . . . . . . . . . . 25
3.3 Previous Testbed and results . . . . . . . . . . . . . . . . . . . . 26

4 Problem formulation 28
4.1 Simplification of the samples . . . . . . . . . . . . . . . . . . . . 30
4.2 Data exchange in camera network . . . . . . . . . . . . . . . . . . 30

3
5 Implementation 33
5.1 The Background Subtraction . . . . . . . . . . . . . . . . . . . . 33
5.2 Detecting Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 Tracking implementation . . . . . . . . . . . . . . . . . . . . . . . 37
5.4 Changes made to the original approach . . . . . . . . . . . . . . 38

6 Results 41
6.1 Simulation samples . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.2 Background dependency . . . . . . . . . . . . . . . . . . . . . . . 42
6.3 Cameras influence . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.4 Evaluation of the detection algorithm . . . . . . . . . . . . . . . 43
6.5 Synthetic Rectangles influence . . . . . . . . . . . . . . . . . . . . 44
6.6 Tracks Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.7 Data Exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.8 Computational complexity . . . . . . . . . . . . . . . . . . . . . . 50

7 Conclusion and Future Works 53


7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4
1 Introduction
Multiple camera systems are employed in scenarios where the surveillance of
a particular area is the main task to achieve. Some of the functionalities that
are required for a determined client could be facial recognition, detection of
camouflaged targets static or dynamic, tracing special targets, detection of un-
usual behaviours, identification, classification of targets and objects that belong
to the same type, etc.

The environment is composed by a set of cameras that have to be calibrated


and the information that reach each one have to be processed first, individually
and then, fused. Therefore, the cameras used, configure a whole system of smart
cameras where they will play the role of a sensor node [1].

The scene presents some challenges like a crowded room, similar colors of
the clothes of the people, proximity of the individuals, as they walk near to
each other, and distances problems between the focal length of a camera and
the zone under study. These constraints affect to the background algorithm
appearing shadows, wrong subtraction of targets and fused blobs, which leads
to incorrect detections and identifications of the people. Hence, these issues
need to be studied and some measures have to be taken to develop a strong
algorithm correctly.

Regarding the method of acquiring and combining the data, different topolo-
gies can be carried out to perform the communication between the nodes. De-
pending of the chosen one, a central or various processing nodes should be
employed to decrease the workload of the camera nodes and perform the oper-
ations that suppose a computational cost expensive [4]. Thus, the processing
task will be distributed over several nodes, which will decrease the delay, and
will reduce the bandwidth requirements.

The goal of this thesis is to implement an algorithm that detect multiple


people in a unknown scenario and follows the targets over the video sequences.

1.1 Methodology
The information associated to the thesis have been approached from a general
basis with the details of how the system of distributed smart cameras works,
the architecture and the algorithms related with the communication between
the nodes, to study progressively more specific complex concepts related with
the topic of the project to understand the later implementation. Specifically, the
literature includes a deep study of the algorithms to detect and track (Proba-
bilistic Occupancy Map and K-Shortest Path). After acquiring all the necessary
knowledge, the code provided by the researchers of the paper [12] in OpenCV
has been tested and understood to be able to translate the main concepts to
MATLAB code. Part of the work, which was not in their free implementation,
had to be developed like the use of the homographies to perform triangulations

5
calculations and the influence of the virtual rectangles in the detection param-
eters. Once these issues were solved, the implementation of the equations of
POM was carried out. However, after a lot of research and great efforts trying
to do this, we reached to the conclusion that the model applied in C++ was
not present a good yield in MATLAB and, therefore, the development of the
equations was faced from other point of view. As the goal of the thesis is to
implement a network that is capable to detect a number of people and observe
the communication between the cameras during the process, the results are
good enough and obtained quickly, despite the fact the detection and tracking
processes present more inaccurate results in comparison with POM and KSP.

1.2 Ethical implications

It is allowed to use the algorithm for researching purposes or as basis of some


future works in this field. Some examples could be the identification of the
players in a sport game, rescue operations or surveillance of a area. However,
we are aware that the model could be used for military activities, for instance.
There may be ethical problems with the privacy and security of some people,
depending on the purpose in which the potential developers focus. If these works
present some of the issues described before, this will remain in the awareness
of the authors that carry out this practice, while we are free of guilt, since we
develop a system that does not harm anybody. We will appreciate if the names
of the researchers of this thesis are included in the references of future projects
when it is used.

1.3 Structure of the report


The thesis report has been organized in seven sections. Section II describes
the concepts of the distributed smart cameras and their characteristics, the
architectures that can be built to communicate a set of cameras between each
other, as well as the algorithms needed to carry out this communication. A brief
summary of some of the detection and tracking approaches finish this section
to introduce the section III, where the Probabilistic Occupancy Map (POM)
and the K-shortest Path (KSP) algorithms are laid out. In section IV, the
problem is formulated to be able to figure out how to get the main goal. Some
of the characteristics are shown together with the scheme of the communication
between the nodes. Section V discusses the implementation problem and what
has been done to achieve the aims and solve the issues. Section VI presents the
final results and how the different parameters that configure the system affect
them. To finish, the section VII ends with a conclusion and possible future
works to improve the works in this area.

6
2 Background
2.1 Introduction to Distributed Smart Cameras
Smart Camera is a vision system that capture images, being considered as
smart sensor or node, due to the capability of computing and processing specific
data in order to produce the best results. The fields of applications are various,
such as computer vision, image sensors, embedded computing, sensor networks,
and the improvements on each field give as result a new concept of understand-
ing the visual communication to solve complex task in real time. The advantages
in comparison with previous technologies, like some of the prototypes developed
in 1980s, which integrate sensing, and 1990s, which count with the first fabri-
cated CMOS implementations, are: reduced size, power consumption, higher
reliability, lower workload, less bandwidth used to transmit the data, etc. [1]

Within the framework of these new communication technologies, investigat-


ing in other fields, like the selection of the architecture to link the entire network
and the corresponding algorithms to compute and estimate the data that flow
through this environment, has become important. Therefore, the variety of
images and processing algorithms will be huge and there will be implementa-
tions for different task such as motion detection, segmentation, tracking, object
recognition, etc. [1]

Some of the applications developed to improve the yield in the last decade
are: video surveillance, to detect unusual behaviors in the scene, to identify
criminals or suspect individuals; adaptive motion sensor applied to vehicles to
make more comfortable the task of driving and prevent possible traffic accidents;
medical applications such as complex surgeries that require a lot of time and
accuracy, aid at elder or handicapped people, tests and research to face some
of the most common diseases; robots that simulate the behavior of animals or
humans; video games and other entertainments based on gesture and motion
recognition; etc.

Figure 1: Architecture of a smart camera. Extracted from [1].

7
2.2 Multiple Smart Cameras
The denomination of distributed cameras is given at the system with various
cameras spread over the space which may have overlapping field of views and
cover a determined area. Thus, the images obtained from different angles are
analyzed jointly at the same time. The use of this design is, mainly, to handle the
occlusions that may appear in one determined view due to static or dynamic
objects. Besides, the system with multiple cameras allows to obtain better
quality since the lenses can focus in a determined point that requires less area,
giving a sharper capture of the image.

To build a model of distributed smart cameras, one approach might be de-


signing a algorithm of extraction of images for a single camera and then extrap-
olate to a system with severals, bearing in mind that algorithms to compute the
joint and separated data have to be implemented to communicate each node.
Thus, it is essential to determine which sensor acquires the information and
which node processes, computes and fuses the data and sends them back to
those which are interested in having the final results. Such a problem was con-
sidered in the context of distributed feature extraction for object recognition in
[9], [10] and [11].

It is important to know that the 3-D spatio-temporal environment is pro-


jected into a planar screen in each camera, so a proper calibration with the
corresponding points of reference is needed to communicate efficiently. Once
the correspondences between object locations across views are finished, the next
step is to estimate the presence of the object according with a predefined 3-D
coordinates which is known as triangulation. The inverse operation can be done
with planar scenes and homographies, where, for example, it might be took the
ground plane (from a virtual or real top view) to represent the object into a
2-D plane to be able to process the information in every camera. After making
the assumptions and doing the proper calculations, the planar image has to be
transformed again to the world plane, which may be a tedious problem, but
it can be done with the aid of virtual parallels and perpendicular lines on the
planes. [2]

The main task, where the distributed smart cameras will focus, are detection,
tracking and recognition of object or humans in a given zone.

The first aspect is detection or the ability to localize objects or individuals


in the space under study. In general, it is a complex procedure since objects
belonging to same class may have different shape, besides other aspect as illu-
mination, color, pose, angle or camera parameters. Depending on the behaviors
or the results that we expect, various algorithms may be defined to fit in the re-
quirements. Some of them are background modeling for moving objects, which
is a simple labor since we will take in account only the motions presented in
sequences and background subtraction, to determine which objects are always
fixed in the scene and extract the pixel desired to calculate the results. [2]

8
The second issue is modeling the track to follow targets using the data
acquired for the object detector as input. The most common algorithms to
develop the estimation of the posterior object location are based on Kalman
filters and the particle filters, explained in detail later. [6], [7], [8]

The last aspect is the identification and recognition of the targets. There are
two main approaches, local feature-based and global feature-based, depending
if we are interested in the detail of one person, for example, or in a multiple
detection framework. [2]

2.3 Architecture of the network


The necessity of laying out and choosing a proper architecture for the network
is one of the basics that influences how the communication between the nodes is
done and, hence, the relationships between them. In previous years, some of the
models have been designed and studied to improve the quality and complexity
of the links between transmitters and receivers. The explanation of the main
architectures are shown below in detail. We will consider those models which
have memory, since the case of study will be in an environment where smart
cameras process the data and it has to be continuous through the time.

Centralized network is the traditional architecture to fuse the information


where data from multiple locations in the space are sent to the same central
point to be analyzed. Then, the results are distributed to the final user.

In the main node, there is a single track database and the amount of data
received from the others sensors are aligned and associated with the tracks
belonging to this database to be able to update them and produce their state
estimates.

This architecture model produces the optimal results, because there is not
loss of information and, in the case where we have enough bandwidth to process
the data, is ideal to put in practice. However, to satisfy the requirements, the
power consumed and the energy needed by the main base station might be high.

The development is quite simple too. It has been tested in many researches
that is a good implementation for wired solutions like in DCS networks (Dy-
namic Service Controller) where the resources are not the main problem to
consider. Issues appear where there are not enough resources for processing like
in the WSN (Wireless Sensor Networks). There may also be failures produced
in the fusion center due to system crash or because of an undesired unknown
guess that compromise the security of the digital environment [3]. To avoid such
obstacles other structures have been developed.

Hierarchical networks are one of the solutions provided to compensate the


lack of efficient of centralized in some scenarios. Here, the fusion nodes are
arranged in a hierarchy where usually, the low level nodes process the sensor data
and the high level nodes do the operations to combine the flow of information

9
that they receive. The high level nodes can achieve considerable savings in
communication, if data arrived have a rate slower than the sensor observation
rate.

One of the advantage is that in each case, the communication does not have
to be synchronized and the local nodes can do many calculations before report
to a more important node. We can divide this type of architecture depending
of they have feedback or not.

When there is not feedback, from the high level, the common information in
the received estimate and the current estimate, is the last estimate communi-
cated from the low level. When the state is dynamic, the probabilities have to
be extrapolated to a common time. As long as the independence conditions on
the observation fields are satisfied, a lot of sensor data can be processed before
the results are sent to the high level to fuse them.

In the case with feedback, fusion takes places at both levels. Therefore, the
low and high level functions are almost identical. It is equivalent to periodic
broadcast, when the feedback reaches each node and it is necessary that the
fusion node removes its own previously sent information, that was stored, before
combining the local estimates. Hierarchical architecture is employed for many
applications, i.e., a fusion node for radar data, another node for infrared sensor
data and a node that combines the tracks generated. [4]

Distributed networks are the other solution to mitigate the lack of yield of
traditional networks in certain context. They can be defined as the opposite
concept of centralized models. In this kind of configuration, there is not a
relationship of hierarchy. Thus, all the fusion nodes are in the same level in
processing concerns. Each node can communicate each other depending on the
layout used and may be done in an asynchronous way. The sensors depend of
each other to achieve the best result that will be distributed to the clients.

In a fully distributed network, the sensing devices may have their own pro-
cessor to fuse the local data and then, cooperate with others to complement
their own inputs. The common information nodes depend on the specific com-
munication path, but now, the fusion equations are not simple in comparison
with hierarchical cases. Thus, the proper choose of an algorithm that works
well is more difficult, because a long pedigree of data is needed, and memory
requirements to store them are higher. Hence, more computation to identify
the common ancestors in order to apply the fusion equations are employed.

The main premise of the centralized techniques is not longer available for the
new ones, since they usually assume that the errors in the data to be fused are
independent. Now, only the sensor data can be considered conditionally inde-
pendent given the state to be estimate. In opposition, the “rumor propagation”
or “chicken little phenomenon” have to be had in account, since the information
arriving by multiple paths have to be fused and redundant information are not

10
Figure 2: Different architectures of networks. Extracted from [3].

desired. Besides, the information communicated between nodes have to contain


all the data needed to reconstruct the optimal estimate and if these cannot be
sent, the resulting fused estimate will be suboptimal. [3]

The main advantage of the distributed and hierarchical solutions in com-


parison with centralized networks is lighter processing load at each fusion node,
since the workload is shared by a few nodes and communicates tracks less often.
Therefore, there will not be only a big database to save the information due to
every node has its own local database. Then, the speed of the transmissions
will be higher and the protection against failures will improve the survivability.
[3]

There are some issues that have to be addresses like how the nodes have to
relate in order to share the assignments for fusion. This means which will be
the task of each sensor and what responsibilities they will have in a scenario
given. It is also a challenge how to combine the results from two fusion nodes for
making a decision on a hypothesis such as detecting the presence of targets and
how to fuse targets tracking. Another requirements on these techniques have to
be fulfilled since distributed configurations work with more complex algorithms.

11
2.4 Mathematical Algorithms for Fusion
2.4.1 Fusion for target tracking
If the architecture designed for the network is one of the main task of the
communication, there are other aspects in the same level of importance, like the
implementation of an algorithm that works according to our requirements. It is
significant to fuse the data in the nodes, to develop a high performance of the
results and then select what to do in the next step. It is essential to estimate
and predict some of the posterior data or tracks to be able to reconstruct all the
information in case of missing messages. To improve it, we have to deal with
the spatial registration and temporal prediction of the target on the inputs of
the sensors, i.e., the data that we are getting as input.

The distributed data association problem is discussed in terms of track-


to-track association likelihoods. We present later, the distributed versions of
two centralized algorithms for tracking approaches, the joint probabilistic data
association (JPDA) and multiple hypothesis tracking (MHT). [3]

Data alignment is the first step in the fusion process and to do that is neces-
sary to convert the data to a common coordinate system to fuse the target data
obtained by the sensor. To compute the data, predetermined targets kinematic
are linked to the track and a temporal prediction is carried out using, for ex-
ample, Kalman filter to estimate the location or state of the target. After that,
the measurements are divided into sets thanks to association methods. Some of
the more distinguish approaches are presented below.

The nearest neighbor approach assumes that the measurement closest to our
node represent the target. The advantages are obvious, since computational
resources are not needed to implement this method. However, it may not work
well in some instances where the quality of the sequence is poor or following the
track may result confusing. [3]

The JPDA algorithm assigns a weight to each detection to represent the


likelihood of being associate to the target. One of the main advantage of doing
this, is that allows the assignment of a single sensor report to a multiple tracks,
but the excessive clutter may be a real problem and put the tracker away from
the original position. [3]

The MHT model assumes possible association of actual detections to pre-


vious tracks and the likelihood are computed by comparing the extrapolated
track state estimates with the detections. The only problem is that more
computational resources are spent if we want to get better results in complex
situations.[3]

The last step to be done is updating the estimates of the states of each track
using one filter, as the particle filter for example, to combine the associated
measurements with the predicted states.

12
2.4.2 Scalable fusion algorithms
The basics of distributed information fusion have been studied deeply, but
practical applications of these theoretical results to non-deterministic informa-
tion are yet uncertain. The main point is to identify and remove the redundant
information received from the different sensor that have to be fused to achieve
efficient use of the resources. To build the optimal fusion formula, carrying a
long pedigree information is required, which is not practical in environments
with limited communication bandwidth. For this reason, we use sub-optimal
algorithms that under certain assumptions perform a solution in a determined
environment. Some of the approaches are described in next paragraphs.

The Channel filter is defined by a pair of transmitting and receiving agents.


The receiving actor is in charge of removing the common information in a single
channel. The equation that defines how it works is given below,

p1 (x)p2 (x)/p̄(x)
p(x) = R (1)
[p1 (x)p2 (x)/p̄(x)]dx
where p1 (x) and p2 (x) are the two probability density functions to be fused
and p̄(x) is the density function received from the same channel at the previous
communication time and is the common “prior information” to be removed in
the fusion formula. [5]

Hence, it just needs to keep the target of the previous transmission or re-
ception from the channel and remove it when combining the current estimates.
Therefore, there is not need of saving long chains of data to carry out this im-
plementation. However, if the time between the transmission and reception is
long the approach is useless and the dependent information might be lost.

Naı̈ve fusion assumes that the dependency between density functions is neg-
ligible. This standard represents the simplest type. Nevertheless, it could be
unreliable in practice [5]. The formula is written as:

p1 (x)p2 (x)
p(x) = R (2)
p1 (x)p2 (x)dx

One of the pattern that works with high accuracy when the dependency
between two distributions is unknown is the Chernoff fusion. The equation is
based on the following,
1−ω

1 (x)p2 (x)
p(x) = R 1−ω (3)

1 (x)p 2 (x)dx
where ω ∈ [0 1] is an appropriate parameter which minimizes a chosen cri-
teria. [5]

The resulting fused density function that minimizes the joint data is the
halfway between two original densities in terms of the Kullback-Leibler distance.

13
Due to that, the resources to satisfy the different cases might be computation-
ally extensive. One particular case is when we choose a certain parameter of ω
to minimize the determinant fused covariance, which means, in Gaussian case,
to minimize the Shannon information. Another case is when this R pmentioned pa-
rameter is set to be 0.5. In this case the denominator becomes p1 (x)p2 (x)dx
which is the Bhattacharyya bound. [5]

2.5 Detection and tracking algorithms


2.5.1 Detection approaches
Detecting protocols to identify a concrete object or humans in a scene have
became a demanding research area in the domain of the images processing and
surveillance applications. Most of the methods are based in segmentation using
background subtraction. In video object tracking, a selection of the right fea-
tures is determinant to be able to distinguish the objects in the space, having
in account their own uniqueness. Some of the characteristics are the color fea-
ture, gradient features for human detection, edge features that are less sensitive
to illumination changes, texture features, optical flow referred as the bright-
ness constraint of each pixel in consecutive frames, spatio-temporal features for
recognition of actions in a three dimensional model and the multiple features
fusion and the biological features to describe the main behaviors of movements
on humans. The most known types of algorithms are the followings:
Segmentation based: It is used to segment the image frame into segments,
to find out the objects in which we are interested. Criteria for a good partition
plays an important role. There are three main different methods to apply.
The first one is the graph cut, where the input image is considered as a graph
and addressed as a partitioning problem. [8]
The second, Mean-shift clustering, is used to find the clusters of image pixels
in the frame. [8]
Finally, the active contours can be defined as closed contour evolving the
boundary of the object. [8]
Background modeling based: The attention has been put in this part, since
it will be the pillar for the future work. There are various methods to apply.
The Gaussian mixture model is a technique for modeling dynamic back-
ground, as it can represent complex distribution of each pixel, but it has the
disadvantage of suffering slow convergence in the early stages of the working.
[6]
Eigen-space decomposition of the background is another approach, which
project the current image to the eigen-space and calculate the difference between
the reconstructed and actual images in order to detect the foreground objects,
providing much robustness against light changes. [8]

14
The most recent in this area is the Hidden Markov model, which represents
the intensity variations of a pixel in an image sequence as discrete states. [8]

2.5.2 Tracking approaches


The main task of a defined object tracker is to find out the motion of tra-
jectories of an object in a video sequence and how they progress along time,
identifying them in each frame. The pattern followed will be to have the object
detected by one of the previous approaches that are explained above and later
follow the trajectory between the instances of the object across frames. The
utilization of filters are the key to process the information. Some of them are
explained:

Kalman filter is a single object state estimation procedure. It is used as an


estimator to predict with Gaussian assumptions and correct the states in the
past, present and future of an object on predefined linear systems dynamics. The
two typical assignment of the filter are the prediction, i.e. updating the time,
and the correction, i.e. updating the previous measures. To address the issue
of non-linear system dynamics, some variants have emerged like the Extended
Kalman filter, which shows faster convergence in terms of iterations. However,
it does not resolve yet the issues associate with non-Gaussian behaviors. To
face this setback, investigators of the field have been paid attention in particle
filters. [6] [7]

Particle filter is a hypothesis tracker based on the recursive Bayesian filter


with Monte Carlo sampling. The conditional state density in a given time is
represented by a set of samples with a determined weight which indicates the im-
portance of the sample. The steps which the filter follows to estimate the state
of dynamics from observations are: selection of random samples, prediction by
generating new samples from selected samples and correction by computing the
weights correspondence. Despite of the fact that performs with high percentage
of success in real cases, it may be computationally expensive when the appear-
ance of the object change drastically and a large number of particles appears.
[7]

Both two filters presented above yield very good results when the objects are
not close to each other. For tracking multiple objects in the video sequences, the
most likely measurement for a particular moving object needs to be associate
with the state of the object which is called the correspondence problem. To solve
it, one algorithm that performs well in practice is the nearest neighbor approach
which is very simple to implement and does not spend many computational
resources.

The last filter, the Mean shift filter, is introduced to optimize the smooth
similarity function and to obtain the direction of the movement belonging to a
target. The tracking is based on a non-parametric feature space analysis. The
goal is to maximize the appearance similarity doing iterations by comparing the

15
histograms of the target and the hypothesized object model. The histogram is
defined by the Bhattacharya distance and it has to converge for carrying out
the right operation. [6]

16
3 State of the art
In this section we present the state of art and the algorithms that step by
step perform a robust approach to be able to first detect, then track and finally
identify people or even objects in adverse conditions and under some previous
assumptions that we will take to preserve the right yield and to get the best
results.

3.1 Object detector: Probabilistic Occupancy Map


3.1.1 Introduction
The algorithm to detect targets in the state of art receives the name of
Probabilistic Occupancy Map (POM) [12] and as the name indicates, it sets
the probabilities of an object or person in a certain plane of a particular space
under study. Later, we explain the steps to follow to accomplish the detection
required as output. To check the reliability of this algorithm some scenarios
have been tested where there were some difficulties with poses, illumination
changes, occlusions, etc., from various point of views.

3.1.2 Related approaches


Some of the previous and related works are introduced to have a general idea
of the model. The studies are divided in monocular approaches and multi-view
approaches.

On the first one, we can highlight on one hand, blob-based methods that are
binary blobs extracted from a single camera. They combine shape analysis with
tracking to locate people. One example of this is The Bayesian Multiple-Blob
Tracker system (BraMBLe) that generates a blob-likelihood based on a known
background model and appearance model. It can be extended to multiple cam-
eras, but they will work independently. The only advantage is that the target
can be observed from other angles in case of occlusion or wrong detection, just to
check if the arrived information is correct. Basically, a background/foreground
segmentation is performed on calibrate images. On the other hand, it is the
Color-Based Methods that take in account the several colors in the scene to be
able to differentiate targets in the background in real time. In order to carry
out that, the images are segmented pixel by pixel and classified, i.e., Gaussian
mixtures are continuously updated.

On the second approach, more cameras compose the system because it be-
comes necessary when the assignment to accurately detect and track several
targets in 3-D locations in a complex environment is the main requirement.
The approaches to do that are mainly three, two of them have been presented
as Blob-based method and Color-based method, but now the fusion of data ar-
riving from other angles are one of the issue to have in mind, since the images
can be performed jointly. The only different is the Occupancy Map Methods

17
which differs with respect to the others is that, here, the technique uses a dis-
cretized map where the objects detected are, later, back-projected. In some of
the works a mixture of Gaussian is fitted to the resulting score map, to esti-
mate the likely location of individuals, and then, combined with Kalman filter
to model the motion. The method to detect in the state of art developed, shares
common characteristics with the last explained, but it changes on how the usual
colors and motion models are combined with the generative model to estimate
the probabilities of occupancy.

3.1.3 General idea


A brief summarize to see the global operations of the algorithm will be pre-
sented, so the future possible readers can easily highlight the main ideas of the
object detector. The first algorithm implemented is a background/foreground
subtraction algorithm performed to detect people and carried out individually
in each camera. The probabilities of estimation are written as P(Xkt = 1|Bt )
where Xkt are the vectors of a boolean random variable (X1t , ...., XG
t ) standing
for the occupancy of location k on the ground plane and Bt are the binary
images generated by the background subtraction. Later, the data are sent to
a central node to process the common information of all the views acquired at
the same time. After having the results, they are approximated to the marginal
conditional probabilities of occupancy to the ground plane to achieve the highest
possible accuracy. The approximation is obtained by minimizing the Kullback-
Leibler divergence between a product law and the real posterior position, i.e.,
doing iterations and calculations in order to get the synthetic image produced
by the algorithm (the images obtained by putting rectangles of human sizes at
occupied locations) closer in distance to the real location of the object in the
ground plane (around 100 according with some studies developed to produce
the best results) [12]. The results are sent back to each camera and they work
independently after that.

3.1.4 Advantages of the algorithm


The accuracy of POM provides solutions within a few tens of centimeters
and in contrast with most of other states of the art that update estimates from
frame by frame recursively and may fail due to some difficulties over a deter-
mined numbers of consecutive frames, POM works with basic color methods in
each frame individually and it can get over these situations because of the com-
putation of the global optimum scores, taking the information over a predefined
number of frames. More advantages have been developed, since traditional ap-
proaches tend to be pessimistic with respect the joint information reached from
different nodes in the central station and hence, results produced are wrong
due to some of them are discarded, if a camera cannot detect correctly. In
opposition, POM works well with sufficient evidence from only one angle given.

18
3.1.5 Requirements
To ensure the proper run of the tasks, some assumptions in the spatio-
temporal dimension have to be set. These are:

Individuals do not take in account the presence of others in their proximity


when they are moving, and people detected cannot share the same space in the
ground plane. This can be formalized as:
Y
P (X 1 , ..., X G ) = P (X k ) (4)
k

The relations of dependence between each view depend of the identification


of individuals in the room, so, each image taken by the different cameras at
the same time is considered as a vector X = (X1 , ..., XG ) plus some additional
noise. Thus, as soon as the presence of all individuals is known, the views can
be treated as independent as we can observe in the equation (5)
Y
P (B 1 , ..., B C |X) = P (B C |X) (5)
c

POM is performed in individual time frames, which means that does not
have consistency along time or does not follow a continuity flow (temporal in-
dependent frame) [12]. Therefore, POM does not maintain identities of targets
across the frames which makes necessary the use of an additional algorithm to
track people. The optimization scheme introduces 4 seconds of delay between
the first image acquisition and the last results, which are compatible with many
applications in the real world and yield high robustness.

3.1.6 Generative Image Model


The first thing that has to be done is the background subtraction to isolate
the target of interests. To do that, the calculations are done in a recursive way
to improve iteration by iteration the image obtained (reduce the noise in the
frames and differentiate with accuracy targets from background). Typically a
color is assigned for the background (white for example) and another for the
detections (green or black). The color model will depend of how the cameras are
calibrated to obtain the images. To give consistency to the model the followings
equations are presented:
Y
P (B|X) = P (B C |X), (6)
c
Y
= P (B C |AC ) (7)
c

1 Y −Ψ(B C |AC )
= e (8)
Z c

19
being BC the image produced by the background subtraction, AC the syn-
thetic image obtained by putting rectangles at locations where Xk = 1, Ψ the
pseudo-distance between the image produced by the background subtraction
and the image created synthetically, and Z a determined given value.

3.1.7 Camera Calibration


POM algorithm needs a calibration consisting of two homographies per cam-
era, which project the ground plane in a virtual top view and the head plane
(plane parallel to the ground located 1.75 meters higher). The calibration can
be done using the Tsai method. Once the ground plane is developed, it is
divided in discrete virtual small grids (square or rectangular shape) to assign
targets detected to a position in the space under study. Since each camera is
calibrated to work together, some points of reference are needed to establish a
2-D system of coordinates to recognize the same object from the different views
and reduce the redundant information. Therefore, each grid is perfectly located
on the ground and known by the system.

3.1.8 Central processor


The central unit of processing the information can be a computer or a camera
designed to execute this labor. Once every image in the same time step from each
point have been performed after doing the background subtraction individually,
the joint data are computed in the ground plane locations.

To calculate with precision, among all the sequence captured, the location of
a target in the space, POM algorithm is performed to approximate the marginal
conditional probabilities of occupancy of the ground plane to the real location
of the body. To estimate these quantities, a uniform value is used to compute
the average virtual image. Later, a recursively method is needed to improve,
iteration by iteration, the estimation of the location on the ground plane. To
do the task easier, the synthetic image is assigned to each discretization of
the ground plane (to each grid) consisting of a rectangle, 3-D projection of a
cylinder, as people are presented in this shape (for this reason we need two
homographies, the ground plane and the head plane). Thus, the work is to
set, approximate and contain the silhouette of humans in this synthetic image
developed by the algorithm.

After doing all these operations, the final results are shown in the cameras
(a sum of the synthetic images and the frames extracted after applying the
background subtraction) and then, each one can operate independently, due to
the posterior support of the object tracker, which will be in charge of identifying
the targets among the frames and thus, recognizing every different individual
from a single point of view.

20
3.1.9 Virtual Rectangle Influence
Depending on the parameters used to produce the particular area of the
rectangles, the influence with respect the task of detection might be improved
or deteriorated. It has been tested that for model sizes between 1.70 meters
(height of average person) and 2.20 meters, there are not significant changes.
With less height, the model efficiency tends to decrease since, if it is shorter
than the person, the algorithm will be more sensitive in the production and
calculation of binary blobs. Other test carried out was to check if the algorithm
can detect children. It could be observed that small kids with a height of
80 centimeters were not detected, since the blob produced by the background
subtraction corresponded with approximate less than a quarter obtained on the
average adult.

3.1.10 Equations of the probabilistic ground plane


To perform the system, a iteration problem has to be computed to obtain the
probabilities of each location k. To estimate these qk s, a uniform small value
is given to them to compute the average synthetic images and after that, a re-
estimation of every qk is done with an iteration process until a stable solution
is reached. The main issue is the computation of the distance between the
background and the synthetic images, which has to be done for each location,
camera and every iteration that is usually of the order of 100. The system of
equations needed to obtain the probabilities of the ground is provided in [12]
and given below:
c
A = 1 − ⊗k (1 − qk Ack ) (9)

c c ξ − qk c c
|Ak,ξ | = |A | + |(1 − A ) ⊗ Ak | (10)
1 − qk
c c ξ − qk c c
|Bc ⊗ Ak,ξ | = |Bc ⊗ A | + |Bc ⊗ (1 − A ) ⊗ Ak | (11)
1 − qk
c c
c 1 |Bc | − 2|Bc ⊗ Ak,ξ | + |Ak,ξ |
Ψ(Bc , Ak,ξ ) = c (12)
σ Ak,ξ
1
qk ← P c c (13)
1 + exp(λk + c Ψ(Bc , Ak,1 ) − Ψ(Bc , Ak,0 ))
c
where A is the computed average of the synthetic image with the current
estimates of qk s, qk is the marginal probability, Ack is the image composed of
ones inside a rectangle standing for the silhouette of an individual at location
c
k seen from camera c, and zeros elsewhere, Ak,ξ is the average synthetic im-
age, ξ is the value forced to be zero or one, Bc is the binary image generated
c
by the background subtraction for the camera c, Ψ(Bc , Ak,ξ ) is a normalized
pseudo-distance to account for the asymmetry between the background and the

21
synthetic images, σ is a fixed value set to 0.01 and λk is log( 1−ǫ
ǫk ) where ǫk is
k

the prior probability of presence at location i.

3.1.11 Problems addressed


The algorithm has a series of problems that have to be exposed. These are
related with the quality of the background subtraction, the presence of many
people covered by a single camera and the proximity between them. The first
issue rarely occur if a sophisticated approach is used for modeling the extraction
of the blobs. The main weakness here, is that many blobs may be produced due
to reflections or shadows in the scene and it may interfere in the true location
calculated by the detector as they appear as moving silhouettes. [12]

Nevertheless, the real problem appears when many people are in the se-
quence, making difficult the detection since the tolerance of the algorithm is
infringed. Some of the researchers proposed another method based on people
detectors that can respond in a crowd environment.

3.2 K-Shortest Path Algorithm to track targets


3.2.1 Introduction
The K-Shortest Path Algorithm (KSP) is the model that continues the per-
formance of the system, after using the Probabilistic Occupancy Map. This is
the name that receives the algorithm in charge of tracking the targets, which
have been previously detected by POM, and as it may be deduced, the task
is to find the k paths that estimate the motion of the target and optimize the
trajectory computing the minimal cost used by the system. We explain below
some of the previous and related researches to introduce KSP model, the math-
ematical aspects and some of the tested scenarios to observe parameters such
as reliability or efficiency in real cases and compare them with other previous
method already used in this matter.

3.2.2 Previous and related approaches


A considerable amount of studies has been focused on the recursive update
of tracks with the most sophisticated object detectors to obtain the image in-
puts where the trackers can start their labor. The most generic algorithms are
Kalman filter and Mean-shift algorithm that address multi-target tracking effi-
ciently when the number of objects is small in real time applications, but they
do not work well in crowd environments.

The known particle filter can solve some of the limitations as well as the
Markov chain Monte Carlo model, which recovers trajectories over frames com-
puting the baths, or the Probability Hypothesis Density filter, which works well
in noisy context [13]. However, many parameters have to be configured before
doing the calculations, and the temporal window to carry out the approaches,

22
has to be small, which supposes a loss of generality and a problem in real ap-
plications.

To provide better outcomes, some researchers have been working on hybrid


approaches in order to reduce some complexity in the system, dividing the
work in a hierarchical way and increasing the available time of observation.
Nevertheless, despite the fact that they work with high yield in some cases,
they do not guarantee the reach of a global optimum because of formulation
issues.

Before of developing the current KSP, Dynamic Programming seemed to


produce good results linking multiple detections over time of several trajectories
simultaneously. Problems merged when it is applied in the real work, since the
computational complexity might be really high in some cases. The reduction of
some of the cost brings other limitations, as the tendency of mixing trajectories
along the time or wrong tracking task when there are not enough information
to analyze them.

Other approach followed is Linear Programming, in which the KSP model


has been based. The purpose is to find the global optimum and solve the data
association problem with multiple people tracking. To enforce this, a network
graph is built where the nodes are connected to future and past observation
under the running of computational optimization. Therefore, the final result is
searched using min-cost flow algorithms.

3.2.3 Main concept


Having in mind all the previous works related with the implementation of
the K-Shortest Path algorithm, we will expose a general idea to understand the
approach.

The production of the final results in POM in each camera are considered
as input by KSP protocol. The only value needed by the tracker will be the
neighborhood size, which suppose a saving of resources in the processor. The
inconsistency of the detector algorithm along the time make impossible the work
of identifying a target over the frames. KSP gives the solution of linking detec-
tions across frames and provides continuity on time (the probabilistic occupancy
map works as a discretization of the time, in a discontinuous form without pro-
ducing flows). Thus, if one object is not detected in a frame but it is in previous
and following ones, a correct trajectory will be made, so, the possible mistakes
due to background segmentation or occlusion are solved in a single camera.

3.2.4 Advantages
In comparison with previous techniques, the K-shortest paths suppose a very
fast optimization in the run time due to it is simpler formally and algorithmically
than others. Hence, the resources dedicated to the task are used efficiently to
calculate the outcomes.

23
3.2.5 Assumptions
The utilization of Linear Programming to perform KSP only requires the cost
of a priori determined number of objects detected and a restriction with the most
likely locations where an object may be found. However, as an improvement,
this method does not need an appearance model.

The only assumptions that have to be in account are that, objects do not
appear or vanish anywhere in the sequence (except for a determined locations
defined as entrance vsource and exit vsink ), they do not move too fast across the
frames and they cannot share a particular place with another (impossibility of
sharing a grid in the ground).

3.2.6 Formalization and equations


The physical area is discretized into G different locations and the time interval
into T instants. For any location k, let N (k) ⊂ {1, ..., G} denote the neighbor-
hood of k, which is, the locations an object located at k at time t can reach at
time t + 1. Hence, the neighborhood are defined as the possible emplacement
that a target might occupy in the next time step. Now, some conditions are
enforced to build the approach. These are:

The sum of flows arriving at any location j is equal to a discrete variable


that labels each vertex mti , which also is the same than the sum of outgoing
flows from location j at time t. [13]
X X
t−1
∀t, j fi,j = mtj = t
fj,k (14)
i:j∈N (i) k∈N (i)
| {z } | {z }
Arriving at j at t Leaving f rom j at t

A location cannot be occupied by two or more targets at a time, as explained


before, and the flows are defined as non-negative (15). So, it can be set an upper
bound of 1 to the sum of all outgoing flows from a given location and impose
(16).
t
∀k, j, t, fk,j ≥0 (15)

X
t
∀k, t fk,j ≤1 (16)
j∈N (k)

The total of humans detected may vary over time. It is necessary to introduce
two additional virtual nodes vsource and vsink to represent the arrivals and
departures of objects in the graph system (they may be considered as doors or
borders of the camera field of view and they do not represent any real location
in the physical map). Thus, all flows leaving vsource will end in vsink after a
random duration on time (17).

24
Figure 3: A complete flow system consisting of three positions and three time frames.
Here, it is assumed a possible entrance and exit point. Flows to and from the virtual
positions are shown as dashed lines. Taken from [13].

X X
fvtsource ,j = t
fk,v sink
(17)
j∈N (vsource ) k:vsink ∈N (k)
| {z } | {z }
Leaving vsource Arriving at vsink

Therefore, the formulation under some of the made assumptions to ensure


the compatibility with real cases, converges toward an integer solution in small
areas for relatively shorts periods of time.

The labor of the K-Shortest Paths algorithm is to find the k paths {p1 , ..., pk }
between the entrance and departure virtual nodes (vsource and vsink ) such as
the total cost are the minimum. The focus will be in the particular instance
where the graph is direct and paths are node-disjoint (a location cannot be
share by two objects and flows cannot be created or suppressed anywhere) and
node-simple (paths visits every node at most once in the directed acyclic graph,
fig. 3). Thanks to the acyclic nature of the graph, the average complexity is
almost linear with the number of nodes, which supposes an improvement in the
speed of the solvers.

To find the k optimal paths, the searching of a single shortest path in the
graph is developed and a calculation of its total cost is performed. Then, an
iteratively computation method (disjoint path algorithm) is done to find the
rest of the trajectories. The optimal number k is obtained when the cost of the
iteration of k + 1 is higher than the previous one.

3.2.7 Complexity reduction


To be able to set up a real time application such as broadcasting, the sequence
is split into batches of 100 frames, which are translated into a 4 second delay

25
between the inputs and the outputs. The batches are overlapped such that the
last frame treated is combined with the current one.

To reduce the number of calculations, only the closer neighbor nodes are
taking in account for each node in an independent time step. To implement it,
a fixed parameter as threshold is given, as we can check in (18). If the value to
arrive to some point does not exceed this number, the location is considered as
not reachable and the flows is therefore removed.

max ρuj (18)


||j−k||<τ1
t−τ2 <u<t+τ2

3.3 Previous Testbed and results


Some of the results are presented in the following paragraphs to compare
the reliability and efficiency of the KSP with other previous state of arts. The
data set consists of a series of multi-camera video sequences of pedestrian set at
shoulder level on various different environments. Some of them are described:

A crowded outdoor terrace consisting of nine people appearing and walking


in from of the cameras to test the ability of the algorithm to cope with a crowded
environment.

An indoor basketball court with ten players moving fast and many occlu-
sions.

A dark underground passageway which presents some challenges due to poor


lighting conditions and large area covered by the system (only one or two cam-
eras), which makes people very small on the scene, being really difficult the
detection and motion estimation.

A scenario with ten people, some important light changes and precision
issues in camera calibration with only seven frames per second to acquire the
sequence.

A ball sequence taking with only one single point of view is tested too, which
makes difficult the task of detection since the color and the appearance are quite
similar, besides the speed of the balls are in general quite faster than people.

The final results prove that exists an improvement using KSP with respect
previous method, specially in comparison with the Dynamic Programming (DP)
and the results of only detect target with POM algorithm in all the scenarios.

The pedestrian tracking results show that the precision of POM is, in general,
quite good and the improvement carries out by the trackers do not suppose a
big change in the quality, since people do not move fast. Even so, KSP efficiency
is always higher, furthermore in those cases where the illumination issues make
difficult the detection.

26
A test with one single camera to obtain the inputs has been done to observe
the superiority of the K-Shortest Path in detriment of Dynamic Programming,
that often leaves people outside the ground discretized area instead of trying to
fix the noisy detections.

For the balls final results was useless the color model used by DP, since all
of them share similar color, which supposes a disadvantage in front of KSP. [13]

To check the speed in run time of the model, measurements were taken,
showing the KSP is much faster than using basic Linear Programming, a fac-
tor of 100-1000, and about 10 times in comparison with other sophisticated
approaches as DP.

27
4 Problem formulation
The goal of the thesis is to be able to track a unknown number of individ-
uals Nind that appear in a video sequence, recorded from different angles of a
unknown scenario taken at head level. The issues to deal are the overlapped
fields of view that are produced by the cameras, the occluded targets present in
each scene on a single camera, the crowded environments, the proximity of one
person to other and problems related with the luminosity, color, etc.

The problem is formulated as processing the probabilities on a ‘virtual’


ground plane V GP based on a scanning of the images acquired Bg c . The
images will be taken at each time step, known as temporal frame f , of a set of
calibrated videos V Rc that observe the same plane from different angles. The
probabilities will be set after several operations of image processing. A scheme
can be checked in figure 4. In the following sections will be addressed the whole
formulation system deeply. Table 1 and 2 show the notation with the main
variables.

Figure 4: Scheme with the main steps of the algorithm implemented.

28
Table 1: Notations (Deterministic Quantities)

Nind Maximum number of individuals allowed in a determined video sequence


c Different number of cameras for the particular scenario
f Frame taken to be processed in the temporal frame
vs Velocity of each target that cannot be surpassed
G Total number of discrete locations of the ground plane
t Time step taken for the sequence (We omit this to present a clear notation)
T Duration of the whole sequence in discretized time
k Each different location in the discretized virtual ground plane
Nc Central node where all the data are carried to be fused and processed in order to get
the final results
w∗h Resolution of an image where the w represents the width and h the height
n Different amount of nodes where the operations are performed for each camera
qk Probabilities of the ground plane of the central node where all probabilities are fused
and taken in common
it Number of iterations needed to converge to a fixed point and obtain the probabilities of
the ground
thrs Threshold number set to a certain value around 1200-1700 approximately
qkc Probabilities of the ground plane computed in each camera for a certain node

Table 2: Notations (Random Quantities)

V Rc Calibrated video sequence for each camera


Ic Images acquired from a video, i.e., images taken at each time frame
V GP Virtual discretized ground plane where the probabilities will be set to a certain value
Bg c Images acquired from a sequence after applying background subtraction
TP Top view image created to develop all the calculation processed by the central node
Xc Image with the probabilities of detections computed in the ground
R Fused filtered information stored in the central node
Fc Image with the final results sent to each view
Srk Synthetic rectangle that may appear in each location k
Ack Synthetic image where the ones represent the standing silhouette of an individual for a
location k seen by a camera c and zeros elsewhere

29
4.1 Simplification of the samples
To save time and make the computation faster, the image frames taken on
each camera c, are reduced. That means that if the whole sequence is com-
posed of 3000 frames, for example, we will not take all of them, but a frame
sample each 25 of them. This will lead as result 120 frames, which supposes
a huge improvement of the speed saving a lot of time. It has been tested that
a step between image frames larger, obtained worst results, as the individuals
that appear presented big displacements in the scene. Therefore, the veloc-
ity of the moving ‘objects’ vs has to be taken into account, to acquire solid
background/foreground data.

4.2 Data exchange in camera network


The ground plane is divided into a number of locations G on the top view
T P , i.e. discretized, and triangulated to each camera view (figure 5a).

Certain locations may be hidden for a particular view, which means that
someone can appear or disappear in a determined moment of a sequence for a
single view. Nevertheless, when the fused data is applied, this issue is solved
satisfactorily and the mark of a target is seen in the top view. Hence, the images
Bgt = (Bgt1 , ..., Bgtc ) acquired at time t for 1 ≤ t ≤ T are stored and the task
will be to find the values of the probabilities of a detection Xct for each location
k of the discretized ground (1, ..., G) shown in equation (19).

P(X1 , ..., XT |Bg1 , ..., BgT ) (19)

Thus, each view works independently in each node n and after the scanning
of all the G locations for each camera c, the information are sent to a central
processing node Nc after translation coordinates calculations. To do that, all
the information acquired at the same time Xct is added and the unnecessary
data that exist due to incorrect detections are filtered as much as possible Rt .
Hence, the redundant information are the most important values, since we work
on overlapped planes. The process is show in figure 5c 5d 5e.

Once, all the results have been computed in Nc , the triangulation is done
again, and the original time frame f images are added with final results Fct and
seen on each view. Figure 5f illustrates the scheme.

There are some constrains that should be explained. One of them is that a
certain location k cannot be occupied for more than one individual, so, there is
a problematic with the proximity of two or more detected targets being able to
lead to incorrect results due to connecting the trajectories of different people.
As opposition to the algorithm explained by the authors of [13], there will not be
‘virtual’ locations that show the entrance and departures of the bodies, because
the restriction of a view on a particular camera c is compensated by another.

30
a Ground from top view to camera b Probabilities estimation on each
views. camera.

c Top view of the background. d Probabilities on the top view.

e Filtering calculations. f Final results on each camera.

Figure 5: Scheme of the exchange of information operations.

31
There will be other complications, as the evidence of a person is stronger and
detected better if the blobs of the target appear in every view, since redundant
information will be taken as valuable data.

The last one is that the probabilities are computed independently at each
time step, so there is not link between a probability for a certain location k at
t − 1 and the corresponding for t.

32
5 Implementation
Regarding the functionality of the system explained above, the approach is
composed of the followings parts.

First, the video sequences as input are required from the different cameras,
4 nodes n in our case, that usually share overlapped field of view. In order to
obtain these sequences, we have used the information provided on the web page
by the authors of [12] (http://cvlab.epfl.ch/data/pom).

The next step is to obtain the frames of each sequence, giving as results
a series of images I c for each camera c with a width of w and a height of h
that will depend on the resolution of the plot function of MATLAB. To be able
to detect the Nind individuals, a background/foreground algorithm has to be
applied and re-update on each time frame f and image processing operations
are then computed to perform robust results.

Once we have all the prior information and the frames after the background
subtraction, all the data have to be fused in a central node Nc to develop a
‘virtual’ top view T P (a hypothetical view produced after the application of
homographies upon the camera views that simulate a fixed camera monitoring
the floor from a high position) where the individuals occupy a position of a
discretized ground k. The resolution of this particular view is provided on the
same web page.

After the calculations carried out in Nc , the information are sent back to
the n nodes and synthetic rectangles Srk are drawn in the locations k where
the bodies are supposed to be standing. This method allows to handle possible
occlusions that are often present in crowded environments.

The final algorithm that we have developed presents some differences from
that the authors describe. In first instance, the authors define a common ground
plane, where there are different k locations calibrated. Therefore, despite the
fact that each target appears in a different w and h in each camera view c,
the occupation in the ground is accurate detected before fusing the data, so
the problems with the triangulation and conversion from one view to other are
avoided. They create a ‘virtual’ ground where the central node Nc is supposed
to be emplaced. There, all the prior data can be observe from the T P , the
tracks of the targets are performed and the information are sent back all to the
n nodes.

5.1 The Background Subtraction


There are different approaches to configure a strong background algorithm.
To fit into the requirements, some of them were evaluated like the eigenback-
grounds, frame difference (low-complexity), approximate median (medium-complexity)
and Mixture of Gaussians, M oG, (high-complexity) [14]. After some test, the

33
Figure 6: Original frame of an image and the corresponding background subtraction
(difference frame algorithm).

eigenbackgrounds and M oG presented run time too high (around 2 and 4 second
more than others) for the task of extracting foreground in various frames. The
rest perform well in all the cases, hence, the low complexity model was chosen
to improve the speed of the program, since the results were too similar.

To understand the equations, the first frame has been set as background
Bgrd. Therefore, the sequence has to start recording the plane under study
without targets on it. If the video start with someone in the scene, it will carry
the mistakes to incorrect detections, but it can be changed quickly in the code.

The background Bgrd and the actual frame to compute f rm have to be


changed to gray images to be able to apply the main equation, the difference
image, which after applying absolute values between the gray frame and the
gray background, gives the desired results in the equation (20).

dif f = |grayf rm − grayBgrd| (20)

To create the final binary image, the pixels have to be evaluated and the
values thresholded.

To finalize, some image processing is applied like a median filter to get more
accurate blobs, removal of some smalls parts regarding a threshold value and
removal of some holes to obtain sharper pictures. A example of the result is
shown in the figure 6.

5.2 Detecting Process


To run the detector algorithm, the frames of each individual camera have
to be stored in folders in the computer. We will take the different frames f rm

34
after applying background subtraction at the same time, to synchronize all the
data that we dispose. To do that, the number of the first and last frame that we
want to evaluate has to be annotated, besides the step between each Bg c image.
Every binary picture has to be set into a general resolution, (a predetermined w
and h, in concrete, a resolution of 360x288 pixels) to carry out the operations.

In the first step, we set all the data that we are going to need, defining
important parameters as the probabilities of the ground and the discretized
locations k in the ground seen in the T P .

Secondly, a function to translate the ground points from camera view to


top view and vice versa, i.e. a triangulation operation, is needed, to calibrate
all the information obtained previously. Besides, some points have to be taken
manually if we do not count with the corresponding equations. Fortunately,
on the web page that belongs to the authors of [12], there are some matrix
that represent the values of the points taken at the ground plane to compute
the translation properly. Therefore, the function that applies the homographies
transformation to an image with bilinear interpolation can be done satisfactorily
after adjusting some parameters. The only disadvantage is that the resolution
of the image changes each time the function imagehomog is implemented, and
hence, a reconstruction of the samples should be done after this to obtain the
calibrated results.

The ‘virtual’ discretized ground plane V GP created, has to be inserted in


the camera views c and a superposition with the Bg c images is performed to
observe in which location k stand the people in the scene. Once the previous
work is done, a scanning of each pixel for the resolution given in the four images
is realized to create the synthetic images Ack where the ones inside a rectangle
represent the standing silhouette of an individual at location k seen from camera
c, and zeros are the empty space. To improve the speed of process, only the Bg c
pixels that are overlapped with the pixels of the ground for a certain location k
are considered.

After processing all the data necessary to have all the initial values, the
upgrade of the equations from (9) to (13) has to be developed sequentially for
every camera c, location k and number of iteration convergence it, until a fixed
stable point is achieved, i.e. the values of qk reach a stable solution. Sadly,
after a lot of investigation, various method of coding tested and talks with some
expert of the matter, the conclusion has been that the equations cannot be
developed in MATLAB since they were designed for OpenCV. The run time
to implement the algorithm is computationally too expensive, and hence it is
useless for researching purposes. The importance of the iteration character of
the problem over the scanning of the pixels, was the main reason to choose the
C++ language over MATLAB used by us, since the last one is more efficient in
other tasks different from multiple targets in multiple views.

Therefore, some changes have been applied to the original algorithm in order

35
Figure 7: Synthetic images of each camera c for the different locations k standing for
the supposed presence of an individual.

to get good results. After the creation of Ack , an equation is implemented


to reach a method as similar as possible to the one described in the thesis.
This equation is (21) and only involves the distance between the region of the
synthetic image Ack where a rectangle is supposed to be located (ξ equal to 1)
and the corresponding Bg c image for the camera c and location k (the value of
the foreground image of the individuals is set to zero and the empty space to
one).
c c
|Ψ(Bgkc , Ak,1 )| = |Bgkc − Ak,1 | (21)

A thresholded tested value is set and if this value is surpassed by the summa-
c
tion of the pixels in |Ψ(Bgkc , Ak,1 )| a blue rectangle is drawn and the probability
of the ground for each view qkc is computed as a one (eq. 22). The figure 7 and
figure 8 illustrates the results.
X c
(|Ψ(Bgkc , Ak,1 )|) > thrs (22)

For each frame step f and once the probabilities qkc have been set to partic-
ular values to keep the process, the homographies are applied again for qkc to
obtain the combined probabilities in T P view and for the Bg c images.

36
Figure 8: Detected targets of the background subtraction drawing blue rectangles on
each camera c.

The fused Bg c images give as result a map of intensity of grays, where the
feet represent the darker colors on the ground, and therefore initializing the
probabilities on the ground qk to a particular value as it may be checked in
figure 9.

The fused qkc give us more accuracy targets in T P , where the detected indi-
viduals in this view has been defined as squares to cover more space, a example
is shown in figure 9.

Finally the addition of both yields the exact location of the individual seen
from the top.

The final qk is the output of the whole detection process.

The ideas of the new approach have been taken reading some of the works
previously done as those that appear in [16], [18] and [19]. The main concept of
representing targets depending on the location of the feet in the ground plane
can be read as well in the projects [15], [17], [21] where the data are fused to
obtain different weights that represent the likelihood of a detection.

5.3 Tracking implementation


As the previous algorithm differs from the Probabilistic Occupancy Map
method explained by the authors of [12], the K-Shortest Path Algorithm cannot
be applied either for the whole implemented system. As the accuracy of the
targets seem pretty close to the actual positions of the bodies and as KSP is only
a improvement over the continuous sequence, it is not a major inconvenience not
being able to adapt it, since this one, as the POM, was designed with OpenCV.

Therefore, we proceed to explain what we have done to solve the last step

37
Figure 9: Homographies applied to the images after background subtraction and to
the probabilities of each view and computed as squares

of the system, having the probabilities qk as input.

The only inputs needed are the various frames images stored in the computer,
which will be mixed with the tracking blobs that appear in the feet location of
each person once the triangulation has been applied over the top view. The
final results can be seen in figure 10.

5.4 Changes made to the original approach


As it was explained in some paragraph above, some changes have been done
with respect the algorithm implemented by the researchers of [12] [13]. The main
problem was to re-update the probabilities seen in the ground with MATLAB,
since the time to process the four different image sequences was excessively
high. Hence, from this point, the whole algorithm works in another way, but
with as much characteristics in common as possible. However, the K-Shortest
Path Algorithm cannot be designed either, since the Probabilistic Occupancy
Map shown, has different features, so, there are different inputs to compute the
tracking algorithm. Besides the system was designed to run with OpenCV tool.
Therefore, some changes should be applied too.

Regarding the data exchange between the nodes n and the central node Nc ,
the implementation of the investigators will be simpler in comparison with ours,

38
Figure 10: Final results of the image sequences with the corresponding white tracking
blobs

as it is illustrate in figure 11, since the probabilities on the top view and on the
camera views, are exactly the same. Thus, there is not necessity of combining
the data in the main node due to the information is defined as calibrated. The
only data that have to be sent will be those from Nc to the nodes to be able to
solve problems like occlusions, missed targets and tracking process.

The main reason of the improvements of this code in comparison with ours,
is that MATLAB performs well for rapidly trying out some simple ideas or
plotting, linear equations or sequential access to pixels. But, when iterations
operations over non-linear regions with complex structures come, there are other
more adequate. C++, for instance, is flexible and allows arbitrary algorithm
development with structures more logic and refined, which give us the possi-
bility of implement almost anything that you can image. The use of OpenCV,
therefore, will present the use of libraries to code faster the algorithm.

To summarize, MATLAB biggest advantage is that the developed of simple


equations can be easily described and run. However, in return, this tool presents
a limited way of linear thinking, which may lead to undesired results being
MATLAB the root of the constraint over the language that solve the proposed
problems.

39
a Ground from top view to camera b Probabilities estimation for the cali-
views. brated locations.

c Calculations of the tracks. d Final results on all the views

Figure 11: Graphics of the information exchange in the algorithm of the authors.

40
6 Results
The results obtained in the implementation are shown in this section where
the main algorithm to analyze is the detection of people in the scene. First,
the sequences of video used for the testbed are described as well as the first
method implemented, the background subtraction. Then, the results of the
detection algorithm are presented with a evaluation of the conditions in which
the approach works reasonably well and in which those that it does not. Finally,
the accuracy of the tracking will be checked after the process of exchanging
information between the nodes n and the main node Nc and how this affect to
the whole computation.

6.1 Simulation samples


In order to test the code implemented, a multi-camera surveillance video
trace is used to evaluate the approach.

The hardware of the system is composed by the 4 different camera nodes and
the central node Nc defined as a ‘virtual’ node where the main computations
are processed to fused the outputs of each view and thus, a simplification of the
calculations is developed.

The inputs can be downloaded from the web-page: http://cvlab.epfl.


ch/data/pom.

In concrete, we will use the Laboratory sequences where depending of the


videos chosen, a flow of people between 0 and 4 or 0 and 6 will appear in the
indoor scene. The samples were shot inside a laboratory by the 4 cameras placed
in the corners of a 50 m2 room at shoulder level. More precisely, the camera 0
and 1 will be located at a height of 2.3 meters and the camera 2 and 3 at a level
of 1.8 meters over the ground plane. The frame rate for both test excerpt will
be 25 fps, the videos are encoded using MPEG-4 codec and the resolution is set
to 360x288, being the first parameter the weight and the second one the height.

The sequence with a maximum of 4 people sequentially entering the room


has a duration of 2 minutes and 36 seconds involving a number of 3915 frames.
Over this input, only a excerpt of 156 frames are evaluated, where the sequence
is divided and the frames will be taken sequentially after 25 deleted samples.

In the other hand, the sequence with a maximum of 6 people entering the
room has a duration of 1 minute and 58 seconds involving a number of 2955
frames. The total frames taking into account are, therefore, 118. Reducing the
amount of data will lead to a faster computation and the system shows similar
final results, as the individuals do not move too fast in the room.

The main problem to solve in these cases results in a more complex recon-
struction problem than usually happens in real-life situations, since people tend

41
to occlude each other and the proximity of the cameras with the area of in-
terest is closer than in others environments doing more challenging the task of
detection [12].

The region of the space under study has a size of approximately 30 m2 and
the discretized area into G = 17 x 17 = 289 locations, corresponding to a regular
grid of 32,3 cm resolution. The resolution taken for our results is less precise than
the chosen by the researchers in [12], 20 cm, since the algorithm implemented
has some differences and the computation time with more locations points was
slower to carry out.

6.2 Background dependency


For the generation of the black blobs that represent the silhouettes of people
in the room, the low-complexity frame difference has been applied, as it has
explained in section 5.1. The relation between the blobs subtraction and the
Probabilistic Occupancy Map algorithm is too strong, so it is with the code pro-
posed too. A wrong or low quality background extraction will lead to incorrect
detection. Therefore, using a good background approach is a point of having
into account to improve the system. The main problem is that, despite the fact
there are a lot of models to implement a blob extraction method, there are a
lot of scenarios with too many variables, like the distance of the cameras to the
recorded scene, illumination changes, similar colors of the clothes, the proxim-
ity between targets and the occlusion in the views. Hence, a proper method
must be taken for a particular environment, doing hard the task of developing
a system that can be used for general purposes.

As it can be observed, in general, the background algorithm carried out,


performs well in most of the cases, except some blobs that eventually may
appear. One of the worst results can be seen in the frame 826 for the camera 2
(figure 12) for the laboratory sequence with 4 people entering the room, where
this cause a lot of false detections. Fortunately, after some computations, they
are filtered and almost totally corrected.

6.3 Cameras influence


Some of the variables to consider are the number of cameras used and the
emplacement of these, the angle of each one and the height of the location,
which gives better results depending of the scanned zone. Regarding that the
algorithm has been designed to emphasize more in the points where there are
redundant information on the central node Nc , if cameras present overlapped
field of views the results will improve. It is significant too, that the feet of the
targets are recorded, hence, the inclination of the cameras need to have a good
perspective of the ground plane. The number of cameras needed to achieve
the outcomes should be at minimum 3, and they have to be calibrated to be
able to fuse the data properly. This number can be increased, presenting better

42
Figure 12: Wrong background subtraction and detection of the frame 826 for camera
2.

outputs when grows, following the results a logarithm function. However, a


single uncalibrated camera, or one that records an area which is too difficult to
film from this determined view, may get wrong results. Thus, the optimum point
will be in the use between 3 or 5 calibrated from different angles to evaluate the
area without occlusions.

6.4 Evaluation of the detection algorithm


Observing the pictures plotted on MATLAB for the outputs of the detection
algorithm for the laboratory sequence some conclusions may be extracted.

First of all, the algorithm operates on a frame-by-frame basis, so there is not


time consistency and therefore it does not maintain the identities of the targets
across the frames.

As it was explained in the section 6.2, the detection algorithm is correlated


with the accuracy of the background subtraction. The distance between the
focal length and the area of interest, the height of the targets, the proximity
between each people, etc., play an important role when the detection is calcu-
lated.

For the particular sequences tested, it shows that for a small crowded room is

43
more difficult to obtain good results than in the Probabilistic Occupancy Map.
When more people start to enter in the scene, more challenging is the procedure
to track the individuals. However, in a outdoor environment where the cameras
are located further (but not too much distance where people are hardly seen)
and a wider area is recorded, the task would be easier.

6.5 Synthetic Rectangles influence


It is shown that the algorithm remains robust with silhouettes as small as
70 pixels in height, which correspond to a distance of 15 meters when using
a standard focal. The size of the rectangles have been set to a width of 60
pixels, which correspond approximately with a real distance between 0.5 and 0.9
meters depending of the depth (the perspective must have taken into account),
and a variable height, which depends of the location of the blobs of the feet
where are overlapped with the blobs of the discretized ground for each view.
Unfortunately, the variable height may cause too large rectangle when a wrong
background subtraction happens. Nevertheless, after the final result, we observe
that this imprecision is successfully corrected. Besides the cameras, as it was
explained, are located to a maximum of 2.3 meters, so the wrong outcomes are
insignificant.

6.6 Tracks Evaluation


The white tracks blobs that represent the triangulated feet of the individuals,
have, in general, a good accuracy in the early stages, as it is in the K-shortest
Path. However, after some operations, the background implementation starts
to decrease when more people appear in the scene, giving some fused blobs for
two different targets, yielding worse results than the algorithm in [13].

In most of the cases the occlusion are solved successfully. A calibration of


the cameras to perform the tracks has been done, being the camera number 2
a really hard task to correct, and hence some imprecise results appear when
someone enters the room for the first time.

As it may be seen in the figures below for a environment with 6 people


entering in the room, the method is weaker than the implemented by the re-
searchers of [12] because, as we have explained, their model was designed to be
developed with OpenCV with recursive equations. As our task was to designed
with MATLAB, we have applied the concepts of this tool and we have scanned
the pixels several times to get the best results for the scenes tested, since those
researches that we have found implemented with this language, worked worst
for multiple camera with multiple targets. To improve the method that presents
some similarities with [15] and [21], the white tracks blobs may be considered
as weights probabilities that predict the location of a individual in the ground
plane. A particular color or a number could be assigned for the different persons
in order to keep the identification across the frames. We do not consider useful

44
the rectangle influence for this tracking algorithm developed once the detection
is done.

a Tracks with one target.

b Tracks with three targets handling occlusions.

Figure 13: Good final results for not crowded environment with an inaccuracy of less
than 20 cm - 30 cm.

45
a Tracks with four targets.

b Tracks with six targets.

Figure 14: Good final results for crowded environment with an inaccuracy of less
than 20 cm - 30 cm.

46
a Inaccurate track for one of the target.

b Missed detections for some views and inaccuracy of more than 30 cm.

Figure 15: Inaccurate final results with missed detections.

47
a Wrong detections due to a crowded environment.

b Wrong detections due to bad background subtraction.

Figure 16: Wrong final results in a dense environment with fused blobs, false and
missed detections.

48
6.7 Data Exchange
One of the main important point to observe in the thesis is the exchange
of information between the camera nodes c and the central node Nc and how
affects to the whole process. To perform the translation of coordinates from the
particular views to the top view, and vice versa (i.e. a triangulation operation)
the imagehomog function has been used (freely provided by Mike Brookes with
some modifications). The inputs needed are a given image on gray scale or
RGB, and a matrix with the values of the coordinates of the points to take into
account. For our purposes, we need the points of the ground plane, whose values
are available to use freely in the database and are presented in the equations
(23), (24), (25) and (26). The calibration, thus, consists in one homography
per camera view which projects the ground plane. The head plane has not
been considered. The equation (27) shows how the translation computation is
performed.

 
0.176138 0.647589 −63.412272
−0.180912 0.622446 −0.125533  Ground plane f or camera 0 (23)
−0.000002 0.001756 0.102316

 
0.177291 0.004724 31.224545
 0.169895 0.661935 −79.781865 Ground plane f or camera 1 (24)
−0.000028 0.001888 0.054634

 
−0.104843 0.099275 50.734500
 0.107082 0.102216 7.822562  Ground plane f or camera 2 (25)
−0.000054 0.001922 −0.068053

 
−0.142865 0.553150 −17.395045
−0.125726 0.039770 75.937144  Ground plane f or camera 3 (26)
−0.000011 0.001780 0.015675

H ∗ Ximage = Xtopview (27)

To create the ground plane, we have to perform a virtual image with a


dimensions of 358x360, width ∗ height, in the top view and the triangulation
has to be done for each camera c. As this process is the opposite described
in the eq. (27), we use the inverse of the values given from eq. (23) to eq.
(26), the inverse of the matrix, to fit the ground into each view, that must be
calibrated to be added with the background images. The triangulation affects
to the ground plane increasing the size of the image, which will be calibrated
later. Therefore, the decreasing of quality is not affected for the exchange of
data. The ground is illustrate in figure 17.

49
Figure 17: Ground plane for top view, in the left, and camera views.

To get the final results, the translation of the coordinates from the cameras
has to be done n times for Bg c images and qkc probabilities. Once the coordinates
are expressed in top view values, the results can be shown from this perspective.
However, to display the corrected qkc probabilities from the different angles,
another triangulation projection has to be carried out.

After applying the function imagehomog to first projected the coordinates


for the T P and then triangulate the probabilities, the samples are too uncali-
brated, the image resolution differs from the original taken, and a hard task has
to be done to obtain the correct outputs. This explain why some of the results
in figure 15 and 16 are worse in comparison with the K-Shortest Path algorithm
developed in C++.

The approach carried out in [13] only needs a communication between Nc


and n, for the creation of the ground plane, and each location k for all the views
is the same, but seen from a different perspective. For that reason, the data
do not suffer as much changes as in the algorithm implemented. Hence, their
results are more consistent and robust.

6.8 Computational complexity


The computational cost has been evaluated in order to obtain the duration
between the inputs and outputs for the detecting and tracking algorithm.

Once the original images I c and the background images Bg c have been stored
in a folder of the computer, the detection algorithm is tested.

In the early stages, when there are not people in the scene, the algorithm is
faster, in concrete, the difference between inputs and outputs is approximately
1.5 seconds. This speed is related with the processing time, which is saved,

50
due to nobody is standing on the area of interest, and there are less heavy
calculations to perform.

Later, when targets start to move around, the program is slower, taking a
time between 3 and 4 seconds as can be checked in the figure 18 and 19. Here, the
time consumed is linked with the calculations carried out by the homographies
operations to translate the coordinates to the top view. The iterations of each
location in the ground and the drawing of the synthetic rectangles when there is
a detection have to be taken into account too, but this is a lighter computational
activity. The complexity is almost independent of the number of individuals,
since the only variation is the addition of one rectangle more in the image, being
the time difference of the order of milliseconds or microseconds. The speed is
even faster than the POM algorithm that takes about 4 seconds to achieve the
outcomes.

The tracking needs all the detecting probabilities for each location and time
frame f and has to wait to process the whole sequence to start the calculations.
This time will suppose a total of 8 minutes for the 4 people and 6 minutes and 5
seconds for 6 people in the laboratory sequence. The calculations to produce the
outputs in the tracking computation will take a time between 2 and 3 seconds
for each time frame f .

Figure 18: CDF of the time to process all the calculations for each frame in the
sequence with 0 to 6 people entering in the room.

51
Figure 19: CDF of the time to process all the operations for each frame in the sequence
with 0 to 4 people entering in the room.

52
7 Conclusion and Future Works
7.1 Conclusion
This thesis has been focused on the problem of detecting and tracking various
targets in an environment with a set of multiple cameras that records the same
area from different angles, distance and height position, as well as the commu-
nication between each node to perform the calculations required to achieve the
best results. The objective has been to develop a strong algorithm which assigns
a probability on the ground to a person standing on a certain location, and to
be able to follow them along the image sequences.

First of all, each sequence recorded from a point of view is taken and divided
in frames to perform all the operations together to minimize the processing time.
An algorithm is needed to visualize the moving targets and to differentiate from
the static parts. Once cameras can detect the presence of a individual, the
results are sent and combined in a central node that has been used to carry
out the larger operations with high computational cost. The fused and filtered
results are sent back to every node and the final results are shown at the same
time for the views.

The results show a decrease of the yield with respect the algorithm performed
by the authors in C++ due to MATLAB presents a character more graphic
where recursive equations are a hard task to implement. It has been concluded
that the method used has a big dependency with the background algorithm and
thus, a improvement on this area would increased significantly the detection and
tracking rates. However, the time to develop the algorithm could be involved
and the workload exchange between the nodes heavier to process.

7.2 Future work


There are already some works that develop systems to improve the results, do
the calculations faster and simpler or perform in scenarios where the challenge
conditions make the process difficult to implement. Some of these new researches
can be observed in [20] or [16].

The first one, shows better results in terms of precision and accuracy. The
suspect positions of the ground, in comparison with POM, are significantly
reduced and the detection process increases the speed, at the expense of higher
run time due to complicated dependency structure between locations. Besides,
the switches between identities still exist.

The second, carried out for the same researchers of the POM and KSP,
proposes a model to compute the probabilities on the ground of presence of
multiple objects from a single depth map without requiring large amount of
data that modern RGB-D sensors provide. Therefore, less bandwidth will be
required and the operations will compute faster to get the results.

53
There are also other fields where the detecting and tracking process with
multiple camera could evolve.

The vast majority of systems are designed to work in real time, and the
motion of the cameras could changed from a static location to a framework
with dynamic smart nodes in a near future.

The identification of different people to track a single individual due to facial


recognition or moving behaviour, could be a good task to perform. The main
problem is that, by the moment, a database with pictures of all the individuals
that enter in the scene is needed.

More interesting could be the classification of objects in a particular area


depending of the shape that these present, and that also exhibit some similar-
ities. These allow to follow such different targets as cars, people, animals, or
moving objects with distinct patterns in the same video sequence.

54
References
[1] B. Rinner, and W. Wolf: “An Introduction to Distributed Smart Cameras”
Proceedings of the IEEE, vol. 96, no. 10, pp. 1565-1575, Oct. 2008
[2] A. C. Sankaranarayanan, A. Veeraraghavan, and R. Chellappa: “Object De-
tection, Tracking and Recognition for Multiple Smart Cameras” Proceedings
of the IEEE, vol. 96, no. 10, pp. 1606-1624, Oct. 2008
[3] M. E. Liggins II, C. Y. Chong, I. Kadar, M.G. Alford, V. Vannicola, and S.
Thomopoulos: “Distributed Fusion Architectures and Algorithms for Target
Tracking” Proceedings of the IEEE, vol. 85, no. 1, pp. 95-107, Jan. 1997
[4] C. Y. Chong: “Distributed Architectures for Data Fusion” International
Conference on Multisource-Multisensor Information Fusion, pp. 1-8, Feb.
1998
[5] K. C. Chang, C. Y. Chong, I. Kadar, and S. Mori: “On Scalable Distributed
Sensor Fusion” The 11th International Conference on Information Fusion,
pp. 1-8, Jul. 2008
[6] Mohammad Shafiee, Zohreh Azimifar, and Alexander Wong: “A Deep-
structured Conditional Random Field Model for Object Silhouette Tracking”
CoRR, vol. abs/1501.00752, Jan. 2015
[7] Y. S. Chia, W. Y. Kow, W. L. Khong, A. Kiring, and K. T. K. Teoi: “Kernel-
Based Object Tracking via Particle Filter and Mean Shift Algorithm” The
11th International Conference on Hybrid Intelligent Systems (HIS), pp. 522-
527, Dec. 2011
[8] B. Deori, and D. M. Thounaojam: “A Survey On Moving Object Tracking
In Video” International Journal on Information Theory, vol. 3, no. 3, Jul.
2014
[9] Emil Eriksson, György Dán and Viktoria Fodor: “Predictive Distributed
Visual Analysis for Video in Wireless Sensor Networks,” IEEE Trans. on
Mobile Computing, to appear
[10] Emil Eriksson, György Dán, Viktoria Fodor: “Algorithms for Distributed
Feature Extraction in Multi-camera Visual Sensor Networks,” in Proc. of
IFIP/TC6 Networking, May 2015
[11] Emil Eriksson, György Dán and Viktoria Fodor: “Real-time Distributed
Visual Feature Extraction from Video in Sensor Networks,” in Proc. of
IEEE International Conference on Distributed Computing in Sensor Sys-
tems (DCOSS), May 2014.
[12] F. Fleuret, J. Berclaz, R. Lengagne, and P. Fua: “Multicamera People
Tracking with a Probabilistic Occupancy Map” IEEE Transactions on Pat-
tern Analysis and Machine Intelligence, vol. 30, no. 2, p. 267-282, Feb. 2008

55
[13] F. Fleuret, J. Berclaz, R. Lengagne, and P. Fua: “Multiple Object Track-
ing Using K-Shortest Paths Optimization” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 33, no. 9, p. 1806-1819, Feb. 2011
[14] Seth Benton: “Background subtraction, part 2: MATLAB to C using
MCS”, EE Times newsletter, Aug. 2008

[15] Thiago T. Santos, Carlos H. Morimoto: “Multicamera People Detection


and Tracking using Integration” Pattern Recognition Letter, vol. 32, no. 1,
p. 47-55, Jan. 2001
[16] T. Bagautdinov, F. Fleuret, J. Berclaz, R. Lengagne, and P. Fua: “Prob-
ability Occupancy Maps for Occluded Depth Images” IEEE International
conference om Computer Vision and Pattern Recognition (CVPR), Jun.
2015
[17] Akos Utasi, C. Benedek: “A multi-view annotation tool for people detec-
tion evaluation” Proceedings of the 1st International Workshop on Visual
Interfaces for Ground Truth Collection in Computer Vision Applications,
no. 3, May 2012
[18] P. Carr, Y. Sheikh, I. Matthews: “Monocular Object Detection Using 3D
Geometric Primitives” European Conference on Computer Vision (ECCV)
2012, Oct. 2012

[19] Saad M. Khan, Mubarak Shah: “A Multiview Approach to Tracking Peo-


ple in Crowded Scenes using a Planar Homography Constraint” European
Conference on Computer Vision (ECCV) 2006, vol. 3954, p.133-146, May
2006
[20] Wan Jiuqing, Li Achuan: “Multiple People Tracking Using Camera Net-
works with Overlapping Views” International Journal of Distributed Sensor
Networks Volume 2015, art. ID 591067, Jan. 2015
[21] Ran Eshel, Yael Moses: “Homography Based Multiple Camera Detection
and Tracking of People in a Dense Crowd” M.Sc. dissertation for research
project Under the supervision of Dr. Yael Moses, Sep. 2008

56

You might also like