0% found this document useful (0 votes)
4 views6 pages

SheikhShah CVPR 2005

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 6

Bayesian Object Detection in Dynamic Scenes

Yaser Sheikh Mubarak Shah

Computer Vision Laboratory


School of Computer Science
University of Central Florida
Orlando, FL 32826

Abstract as fountains, swaying trees or ocean ripples. Furthermore,


Detecting moving objects using stationary cameras is an im- the assumption that the sensor remains stationary is often
portant precursor to many activity recognition, object recog- nominally violated by common phenomena such as wind
nition and tracking algorithms. In this paper, three inno- or ground vibrations and to a larger degree by (stationary)
vations are presented over existing approaches. Firstly, the hand-held cameras. If natural scenes are to be modeled it is
model of the intensities of image pixels as independently dis- essential that object detection algorithms operate reliably in
tributed random variables is challenged and it is asserted such circumstances.
that useful correlation exists in the intensities of spatially In the context of this work, background modeling meth-
proximal pixels. This correlation is exploited to sustain ods can be classified into two categories: (1) Methods that
high levels of detection accuracy in the presence of nomi- employ local (pixel-wise) models of intensity and (2) Meth-
nal camera motion and dynamic textures. By using a non- ods that have regional models of intensity. Most background
parametric density estimation method over a joint domain- modelling approaches tend to fall into the first category of
range representation of image pixels, multi-modal spatial pixel-wise models. In their work, Wren et al [21] mod-
uncertainties and complex dependencies between the do- eled the color of each pixel, I(x, y), with a single 3 di-
main (location) and range (color) are directly modeled. Sec- mensional Gaussian, I(x, y) ∼ N (µ(x, y), Σ(x, y)). The
ondly, temporal persistence is proposed as a detection cri- mean µ(x, y) and the covariance Σ(x, y), were learned from
teria. Unlike previous approaches to object detection which color observations in consecutive frames. Once the pixel-
detect objects by building adaptive models of the only back- wise background model was derived, the likelihood of each
ground, the foreground is also modeled to augment the de- incident pixel color could be computed and labeled. Simi-
tection of objects (without explicit tracking) since objects lar approaches that used Kalman Filtering for updating were
detected in a preceding frame contain substantial evidence proposed in [8] and [9] and a robust detection algorithm was
for detection in a current frame. Third, the background and also proposed in [7]. However, the single Gaussian pdf is
foreground models are used competitively in a MAP-MRF ill-suited to most outdoor situations, since repetitive object
decision framework, stressing spatial context as a condition motion, shadows or reflectance often caused multiple pixel
of pixel-wise labeling and the posterior function is maxi- colors to belong to the background at each pixel. To ad-
mized efficiently using graph cuts. Experimental validation dress some of these issues, Friedman and Russell, and inde-
of the proposed method is presented on a diverse set of dy- pendently Stauffer and Grimson, [2, 18] proposed modeling
namic scenes. each pixel intensity as a mixture of Gaussians, instead, to ac-
count for the multi-modality of the ‘underlying’ likelihood
function of the background color. While the use of Gaussian
1 Introduction mixture models was tested extensively, it did not explicitly
Automated surveillance systems typically use stationary model the spatial dependencies of neighboring pixel colors
sensors to monitor an environment of interest. The assump- that may be caused by a variety of real dynamic motion.
tion that the sensor remains stationary between the incidence Since most of these phenomenon are ‘periodic’, the pres-
of each video frame allows the use of statistical background ence of multiple models describing each pixel mitigates this
modelling techniques for the detection of moving objects. effect somewhat by allowing a mode for each periodically
Since ‘interesting’ objects in a scene are usually defined observed pixel intensity, however performance notably dete-
to be moving ones, such object detection provides a reli- riorates since dynamic textures usually do not repeat exactly.
able foundation for other surveillance tasks like tracking Another limitation of this approach is the need to specify
and often is also an important prerequisite for action or ob- the number of Gaussians (models), for the E-M algorithm or
ject recognition. However, the assumption of a stationary the K-means approximation. Some methods that address the
sensor does not necessarily imply a stationary background. uncertainty of spatial location using local models have also
Examples of ‘nonstationary’ background motion abound in been proposed. In [1], El Gammal et al proposed nonpara-
the real world, including periodic motions, such as a ceiling metric estimation methods for per-pixel background model-
fans, pendulums or escalators, and dynamic textures, such ing. Kernel density estimation (KDE) was used to establish

Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)
1063-6919/05 $20.00 © 2005 IEEE
membership, and since KDE is a data-driven process, multi- work that competitively uses the foreground and background
ple modes in the intensity of the background were also han- models for object detection, while enforcing spatial context
dled. They addressed the issue of nominally moving cam- in the process. The rest of the paper is organized as fol-
eras with a local search for the best match for each incident lows. A description of the proposed approach is presented
pixel in neighboring models. Ren et al too explicitly ad- in Section 2. Within this section, a discussion on modelling
dressed the issue of background subtraction in a dynamic spatial uncertainty and on utilizing the foreground model for
scene by introducing the concept of a spatial distribution object detection and a description of the overall MAP-MRF
of Gaussians (SDG), [16]. ‘Nonstationary’ backgrounds framework is included. Experimental results are discussed
have most recently been addressed by Pless et al [15] and in Section 3, followed by conclusions in Section 4.
Mittal et al [12]. Pless et al proposed several pixel-wise
models based on the distributions of the image intensities
and spatio-temporal derivatives. Mittal et al proposed an 2 Object Detection
adaptive kernel density estimation scheme with a pixel-wise
joint-model of color (for a normalized color space), and the In this section we describe the global representation of the
optical flow at each pixel. Other notable pixel-wise detec- background, the use of temporal persistence to formulate
tion schemes include [19], where topology free HMMs are object detection as a competitive binary classification prob-
described and several state splitting criteria are compared in lem, and the overall MAP-MRF decision framework. For
context of background modeling, and [17], where a three- an image of size M × N , let S discretely and regularly
state HMM is used to model the background. index the image lattice, S = {(i, j)|1 ≤ i ≤ N, 1 ≤
The second category of methods use region models of the j ≤ M }. In context of object detection in a stationary cam-
background. In [20], Toyama et al proposed a three tiered era, the objective is to assign a binary label from the set
algorithm that used region based (spatial) scene informa- L = {background, foreground} to each of the sites in S.
tion in addition to per-pixel background model: region and
frame level information served to verify pixel-level infer-
ences. Another global method proposed by Oliver et al [13] 2.1 Joint Domain-Range Background Model
used eigenspace decomposition to detect objects.The back- If the primary source of spatial uncertainty of a pixel is im-
ground was modeled by the eigenvectors corresponding to age misalignment, a Gaussian density would be an adequate
the η largest eigenvalues, that encompass possible illumina- model since the corresponding point in the subsequent frame
tions in the field of view (FOV). The foreground objects are is equally likely to lie in any direction. However, in the pres-
detected by projecting the current image in the eigenspace ence of dynamic textures, cyclic motion, and nonstationary
and finding the difference between the reconstructed and backgrounds in general, the ‘correct’ model of spatial un-
actual images. The most recent region-based approaches certainty would often have an arbitrary shape and may be
are by Monnet et al [11], Zhong et al [22]. Monnet et al bi-modal or multi-modal because by definition, motion fol-
and Zhong et al simultaneously proposed models of im- lows a certain repetitive pattern. Such arbitrarily structured
age regions as an autoregressive moving average (ARMA) spaces can be best analyzed using nonparametric methods
process, which is used to incrementally learn (using PCA) since these methods make no underlying assumptions on the
and then predict motion patterns in the scene. shape of the density. Non-parametric estimation methods
The proposed work has three novel contributions. Firstly, operate on the principle that dense regions in a given fea-
the method proposed here provides a principled means of ture space, populated by feature points from a class, corre-
modeling the spatial dependencies of observed intensities. spond to the modes of the ‘true’ pdf. In this work, analysis
The model of image pixels as independent random variables, is performed on a feature space where the p pixels are rep-
an assumption almost ubiquitous in background subtraction resented by xi ∈ R5 , i = 1, 2, . . . p. The feature vector,
methods, is challenged and it is further asserted that there x, is a joint domain-range representation, where the space
exists useful structure in the spatial proximity of pixels. This of the image lattice is the domain, (x, y) and some color
structure is exploited to sustain high levels of detection ac- space, for instance (r, g, b), is the range. Using this repre-
curacy in the presence of nominal camera motion and dy- sentation allows a global model of the entire background,
namic textures. By using nonparametric density estimation fR,G,B,X,Y (r, g, b, x, y), rather than a collection of pixel-
methods over a joint domain-range representation, the back- wise models. These pixel-wise models ignore the depen-
ground itself is modeled as a single distribution and multi- dencies between proximal pixels and it is asserted here that
modal spatial uncertainties are directly handled. Secondly, these dependencies are important. The joint representation
unlike all previous approaches, the foreground is explicitly provides a direct means to model and exploit this depen-
modeled to augment the detection of objects without using dency.
tracking information. The criterion of temporal persistence In order to build a background model, consider the sit-
is proposed for simultaneous use with the conventional cri- uation at time t, before which all pixels, represented in 5-
terion of background difference, without explicitly track- space, form the set ψb = {y1 , y2 . . . yn } of the background.
ing objects. Thirdly, instead of directly applying a thresh- Given this sample set, at the observation of the frame at time
old to membership probabilities, which implicitly assumes t, the probability of each pixel-vector belonging to the back-
independence of labels, we propose a MAP-MRF frame- ground can be computed using the kernel density estimator

Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)
1063-6919/05 $20.00 © 2005 IEEE
3500 4500

([14]). The kernel density estimator is a member of the non- 3000


4000

parametric class of estimators and under appropriate con- 2500


3500

ditions the estimate it produces is a valid probability itself. 3000

Number of Pixels

Number of Pixels
2000

Thus, to find the probability that a candidate point, x, be-


2500

2000
1500

longs to the background, ψb , an estimate can be computed, 1000


1500

1000


n   500
500

P (x|ψb ) = n−1 ϕH x − y i , (1) 0


10 15 20 25
Likelihood
30 35 40
0
−10 0 10 20 30 40
Likelihood Ratio
50 60 70 80 90

i=1 (a) (b)


where H is a symmetric positive definite d × d bandwidth
matrix, and Figure 2: Improvement in discrimination using temporal per-
sistence. (a) Histogrammed log-likelihood values for back-
ground membership. (b) Histogrammed log-likelihood ratio val-
ϕH (x) = |H|−1/2 ϕ(H−1/2 x), (2)
ues. Clearly the variance between clusters is decidedly enhanced.
where ϕ is a d-variate kernel function usually satisfying
 ϕ(x)dx = 1, ϕ(x) = ϕ(−x), xϕ(x)dx = 0,
xxT ϕ(x)dx = Id and is also usually compactly sup-
ported. The d-variate Gaussian density is a common choice 
m  
as the kernel ϕ, P (x|ψf ) = αγ + (1 − α)m−1 ϕH x − z i , (4)
i=1
 1 
(N )
ϕH (x) = |H|−1/2 (2π)−d/2 exp − xT H−1 x . (3) where α  1 is a small positive constant that represents the
2 uniform likelihood and γ is the uniform distribution equal
1
Within the joint domain-range representation, the kernel
to R×G×B×M ×N (R, G, B are the support of color values,
typically 256, and M, N are the spatial support of the im-
density estimator explicitly models spatial dependencies,
without running into the difficulties of parametric mod- age). If an object is detected in the preceding frame, the
likelihood of observing the colors of that object in the same
elling. Furthermore, since it is known that the rgb axes are
proximity increases according to the second term in Equa-
correlated, it is worth noting that the kernel density estima-
tion 4. Therefore, as objects of interest are detected all pix-
tion also accounts for this correlation. Lastly, in order to en-
els that are classified as ‘interesting’ are used to update the
sure that the algorithm remains adaptive to slower changes
foreground model ψf . In this way, simultaneous models
(such as illumination change or relocation) a sliding win-
are maintained of both the background and the foreground,
dow of length ρb frames is maintained. This parameter cor-
which are then used competitively to estimate interesting re-
responds to the learning rate of the system.
gions. Finally, to allow objects to become part of the back-
ground (e.g. a car having been parked or new construction
2.2 Modeling the Foreground in an environment), all pixels are used to update ψb . Figure
1 shows plots of some marginals of the foreground model.
The intensity difference of interesting objects from the back-
At this point, whether a pixel vector x is ‘interesting’
ground has been, by far, the most widely used criterion for
or not can be competitively estimated using a simple like-
object detection. In this paper, temporal persistence is pro- (x|ψb )
posed as a property of real foreground objects, i.e. interest- lihood ratio classifier, [4]), − ln PP (x|ψ f)
> κ, where κ is
ing objects tend to have smooth motion and tend to maintain a threshold which balances the trade-off between sensitiv-
consistent colors from frame to frame. The joint representa- ity to change and robustness to noise. The utility in using
tion used here allows competitive classification between the the foreground model for detection can be clearly seen in
foreground and background. To that end, models for both Figure 2. Evidently, the higher the likelihood of belonging
the background and the foreground are maintained. An ap- to the foreground, the lower the likelihood ratio. However,
pealing aspect of this representation is that the foreground as is described next, instead of using only likelihoods, prior
model can be constructed in a similar fashion to the back- information of neighborhood spatial context is enforced in
ground model: a joint domain-range non-parametric density a MAP-MRF framework. This removes the need to specify
ψf = {z1 , z2 . . . zm }. Just as there was a learning rate pa- the arbitrary parameter κ.
rameter ρb for the background model, a parameter ρf for the
number of foreground samples is defined. 2.3 MAP-MRF Estimation
However, unlike the background, at any time instant the
likelihood of observing a foreground pixel at any location The inherent spatial coherency of objects in the real world is
(i, j) of any color is uniform. Then, once a foreground re- often applied in a post processing step, in the form of mor-
gion is been detected at time t, there is an increased like- phological operators like erosion and dilation, or by neglect-
lihood of observing a foreground region at time t + 1 in ing connected components containing only a few pixels,
the same proximity with a similar color distribution. Thus, [18]. Furthermore, directly applying a threshold to member-
foreground likelihood is expressed as a mixture of a uniform ship probabilities implies conditional independence of la-
function and the kernel density function, bels, i.e. P ( i | j ) = P ( i ), where i = j. We assert that
Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)
1063-6919/05 $20.00 © 2005 IEEE
0

50
1.00 1.00 1.00

probability

probability
probability
0.50 0.50 0.50
100
x axis

probability
150
1.00
0.50

200

250

50 100 150 200 250 300 350


y axis

(a) (b) (c) (d) (e)

Figure 1: Foreground Modelling. Using kernel density estimates on a model built from recent frames, the foreground can be detected
in subsequent frames using the property of temporal persistence, (a) Current Frame (b) the X, Y -marginal, fX,Y (x, y) High membership
probabilities are seen in regions where foreground in the current frame matches the recently detected foreground. The non-parametric nature
of the model allows the arbitrary shape of the foreground to be captured accurately (c) the B, G-marginal, fB,G (b, g) (d) the B, R-marginal,
fB,R (b, r) (e) the G, R-marginal, fG,R (g, r).

such conditional independence rarely exists between prox- Objects Det. Mis-Det. Det. Rate Mis-Det. Rate
Seq. 1 84 84 0 100.00% 0.00%
imal sites. Instead of applying ad-hoc heuristics, Markov Seq. 2 115 114 1 99.13% 0.87%
Random Fields provide a mathematical foundation to make Seq. 3 161 161 0 100.00% 0.00%
a global inference using local information. The MRF prior Seq. 4 94 94 0 100.00% 0.00%
is precisely the constraint of spatial context we wish to im- Seq. 5 170 169 2 99.41% 1.18%
pose on L. The set of neighbors, N , is defined as the set of
Table 1: Object level detection rates. Object sensitivity and speci-
sites within a radius r ∈ R from site i = (i, j),
ficity for five sequences (each one hour long).
Ni = {s ∈ S| distance(i, s) ≤ r, i = s}
where L are the 2N M possible configurations of L. An ex-
where distance(a, b) denotes the Euclidean distance be- haustive search of the solution space is not feasible due to its
tween the pixel locations a and b. The 4-neighborhood or size, but since L belongs to the F 2 class of energy functions
8-neighborhood cliques are two commonly used neighbor- (as defined in [10]), efficient algorithms exist for the maxi-
hoods. The pixel-vectors x̂ = {x1 , x2 , ...xp } are condi- mization of L using graph cuts, [5, 10]. To optimize the en-
tionally independent given L, with conditional density func- ergy function (Equation 7), we construct a graph G = V, E
tions f (xi | i ). Thus, since each xi is dependant on L only with a 4-neighborhood system N . In the graph, there are
through i , the likelihood function may be written as, two distinct terminals s and t, the sink and the source, and
n nodes corresponding to each image pixel location, thus

p 
p
l(x̂|L) = f (xi | i ) = f (xi |ψf )i f (xi |ψb )1−i (5) V = {v1 , v2 , · · · , vn , s, t}. The graph construction is as de-
i=1 i=1 scribed in [5], with a directed edge (s, i) from s to node i
with a weight τ (the log-likelihood ratio), if τ > 0, other-
Spatial context is enforced in the decision through a pairwise wise a directed edge (i, t) is added between node i and the
interaction MRF prior, used
 pfor its discontinuity
 preserving sink t with a weight τ . For the second term in Equation 7,
p
properties, p(L) ∝ exp λ + (1 − i )(1 − undirected edges of weight λ are added if there correspond-
 i=1 j=1 i j
j ) , where λ is a constant, and i 
= j are neighbors. By ing pixels are neighbors as defined by N . The minimum cut
Bayes Law, can then computed through several approaches, the Ford-
Fulkerson algorithm [3], the faster version in [5] or through
p(x̂|L)p(L) the generic version of [10]. The configuration found corre-
p(L|x̂) = (6)
p(x̂) sponds to an optimal estimate of L.
where p(x̂|L) is as defined in Equation 5, p(L) is as de-
fined and p(x̂) = p(x̂|ψf ) + p(x̂|ψb ). The log-posterior,
ln p(L|x̂), is then equivalent to (ignoring constant terms),
3 Results and Discussion
The algorithm was tested in the presence of nominal cam-

p
f (xi |ψf )
L(L|x̂) = ln era motion, dynamic textures, and cyclic motion. On a 3.06
i+
i=1
f (xi |ψb ) GHz Intel Pentium 4 processor with 1 GB RAM, an opti-

p p   mized implementation can process up to 11 fps for a frame
λ i j + (1 − i )(1 − j) . (7) size of 240 by 360. Comparative results for the mixture of
i=1 j=1 Gaussians method have also been shown. The first sequence
that was tested involved a camera mounted on a tall tripod.
The MAP estimate is the binary image that maximizes The wind caused the tripod to sway back and forth causing
nominal motion in the scene. In Figure 4 the first row is
arg max L(L|x̂) (8) the current image. The second row shows the detected fore-
L∈L

Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)
1063-6919/05 $20.00 © 2005 IEEE
Frame 1106 Frame 1106 Frame 1106 1
1
0 0 0
0.9
0.9
50 50 50 0.8
0.8

0.7
100 100 100 0.7

0.6
0.6

Sensitivity

Specificity
150 150 150

0.5
0.5
200 200 200
0.4
0.4

250 250 250 0.3


0.3
50 100 150 200 250 300 350 50 100 150 200 250 300 350 50 100 150 200 250 300 350

Frame 0040 Frame 0040 Frame 0040 0.2 0.2

0 0 0
0.1 0.1
Proposed Method Proposed Method
Mixture of Gaussians Mixture of Gaussians
50 50 50 0 0
0 50 100 150 200 250 300 0 50 100 150 200 250 300
Number of Frames Number of Frames

100 100 100

Figure 5: Pixel-level detection sensitivity and specificity. Average


150 150 150

200 200 200

True Negatives - Proposed Method 99.65 %, Average True Nega-


250 250 250

50 100 150 200 250 300 350 50 100 150 200 250 300 350 50 100 150 200 250 300 350
tives - Mixture of Gaussians 94.22 %, Average True Positives -
Proposed Method 90.66 %, Average True Positives - Mixture of
Figure 3: Detection in dynamic scenes. The first column has the Gaussians 75.42 %
original images, the second column shows the results obtained by
the Mixture of Gaussians method, [18] and the third column are the
results obtained by the proposed method. Morphological operators The background is represented by a single distribution and
were not used in the results. a kernel density estimator is to find membership probabili-
ties. Another novel proposition in this work is temporal per-
sistence as a criterion for detection without feedback from
ground proposed in [18], and it is evident that the motion
higher-level modules. By making coherent models of both
causes substantial degradation in performance, despite a 5-
the background and the foreground, changes the paradigm
component mixture model and a high learning rate of 0.05.
of object detection from identifying outliers with respect to
The third row shows the foreground detected using the pro-
a background model to explicitly classifying between the
posed approach. It is stressed that no morphological opera-
foreground and background models. The likelihoods obtain
tors like erosion / dilation or median filters were used in the
in this way are utilized in a MAP-MRF framework that al-
presentation of these results. Figures 3 shows results on a
lows an optimal global inference of the solution based on
variety of scenes with dynamic textures, including fountains
local information. The resulting algorithm performed suit-
(a), shimmering water (b) and waving trees (c) and (d).
ably in several challenging settings.
We performed quantitative analysis at both the pixel-level
Since analysis is being performed in R5 , it is important
and object-level. For the first experiment, we manually seg- to consider how the so-called curse of dimensionality affects
mented a 300-frame sequence containing nominal motion performance. Typically higher dimensional feature spaces
(as seen in Figure 4). In the sequence, two objects (a person mean large sparsely populated volumes, but at high frame
and then a car) move across the field of view causing the rates, the overriding advantage in the context of background
two bumps in the number of pixels. The per-frame detec- modeling and object detection is the generous availability of
tion rates are shown in Figure 5 in terms of specificity and data. Here, the magnitude of the sample size is seen as an
sensitivity, where effective means of reducing the variance of the density esti-
mate, otherwise expected [4] (pg. 323). Future directions in-
# of true positives detected clude using a fully parameterized bandwidth matrix for use
specificity = in adaptive Kernel Density Estimation. Another promising
total # of true positives
area of future work is to fit this work in with nonparamet-
# of true negatives detected ric approaches to tracking, like mean-shift tracking. Since
sensitivity = . both background and foreground models are continuously
total # of true negatives
maintained, the detection information can be used to weight
likelihoods apriori.
Clearly, the detection accuracy both in terms of sensitiv-
ity and specificity is consistently higher than the mixture of
Gaussians approach. Next, to evaluate detection at the ob-
ject level (detecting whether an object is present or not), we
Acknowledgements
evaluated five sequences, each one hour long. Sensitivity The authors would like to thank Omar Javed for his comments and
and specificity were measured in an identical fashion to the critiques. This material is based upon work funded in part by the
pixel-level experiment, with an object as each contiguous U. S. Government. Any opinions, findings and conclusions or rec-
region of pixels. Results are shown in Table 1. ommendations expressed in this material are those of the authors
and do not necessarily reflect the views of the U.S. Government.

4 Conclusion
References
There are a number of fundamental innovations in this work.
From an intuitive point of view, using the joint representa- [1] A. Elgammal , D. Harwood and L. Davis, “Background and
tion of image pixels allows local spatial structure of a se- Foreground Modeling Using Nonparametric Kernel Density
quence to be represented explicitly in the modeling process. Estimation for Visual Surveillance,” IEEE Proceedings, 2002.
Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)
1063-6919/05 $20.00 © 2005 IEEE
Frame 0271 Frame 0373 Frame 0410 Frame 0465

0 0 0 0

50 50 50 50

100 100 100 100

150 150 150 150

200 200 200 200

250 250 250 250

50 100 150 200 250 300 350 50 100 150 200 250 300 350 50 100 150 200 250 300 350 50 100 150 200 250 300 350
Frame 271 Frame 373 Frame 410 Frame 465

0 0 0 0

50 50 50 50

100 100 100 100

150 150 150 150

200 200 200 200

250 250 250 250

50 100 150 200 250 300 350 50 100 150 200 250 300 350 50 100 150 200 250 300 350 50 100 150 200 250 300 350

Frame 0271 Frame 0373 Frame 0410 Frame 0465

0 0 0 0

50 50 50 50

100 100 100 100

150 150 150 150

200 200 200 200

250 250 250 250

50 100 150 200 250 300 350 50 100 150 200 250 300 350 50 100 150 200 250 300 350 50 100 150 200 250 300 350

(a) (b) (c) (d)

Figure 4: Background Subtraction in a nominally moving camera (motion is an average of 12 pixels). The top row are the original images,
the second row are the results obtained by using the Mixture of Gaussians method, [18] and the third row results obtained by the proposed
method. Morphological operators were not used in the results.

[2] N. Friedman and S. Russell, “Image Segmentation in Video Se- [12] A. Mittal and N. Paragios, “Motion-Based Background Sub-
quences: A Probabilistic Approach,” Proceedings of the Thir- traction using Adaptive Kernel Density Estimation,” CVPR,
teenth Conference on Uncertainity in Artificial Intelligence, 2004.
1997. [13] N. Oliver, B. Rosario, and A. Pentland, “A Bayesian
[3] L. Ford, D. Fulkerson, “Flows in Networks”, Princeton Uni- Computer Vision System for Modeling Human Interactions,”
versity Press, 1962. TPAMI, 2000.
[4] K. Fukunaga, “Introduction to Statistical Pattern Recogni- [14] E. Parzen, “On Estimation of a Probability Density and
tion”, Academic Press, 1990. Mode,” Annals of Mathematical Statistics, 1962.
[5] D. Greig, B. Porteous, A. Seheult, “Exact Maximum A Pos- [15] R. Pless, J. Larson, S. Siebers and B. Westover, “Evaluation
teriori Estimation for Binary Images,” Journal of the Royal of Local models of Dynamic Backgrounds,” CVPR, 2003.
Statistical Society, Series B, Vol. 51, No. 2, 1989. [16] Y. Ren, C-S. Chua and Y-K. Ho, “Motion Detection with
[6] S. Geman and D. Geman, “Stochastic Relaxation, Gibbs Dis- Nonstationary Background,” MVA, Springer-Verlag, 2003.
tributions and the Bayesian Restoration of Images,” TPAMI, [17] J. Rittscher, J. Kato, S. Joga, and A Blake. “A probabilistic
1984. background model for tracking,” ECCV, 2000.
[7] I. Haritaoglu, D. Harwood and L. Davis, “W4: Real-time of [18] C. Stauffer and W. Grimson, “Learning Patterns of Activity
people and their activities”, TPAMI, 2000. using Real-time Tracking,” TPAMI, 2000.
[8] K.-P. Karmann, A. Brandt, and R. Gerl, “Using adaptive track- [19] B. Stenger, V. Ramesh, N. Paragios, F. Coetzee and J. Buh-
ing to classify and monitor activities in a site,”, Time Varying mann. “Topology Free Hidden Markov Models: Application to
Image Processing and Moving Object Recognition, Elsevier Background Modeling,” ECCV, 2000.
Science Publishers, 1990. [20] K. Toyama, J. Krumm, B. Brumitt, and B. Meyers, “Wall-
[9] D. Koller, J. Weber, T. Huang, J. Malik, G. Ogasawara, B. flower: Principles and Practice of Background Maintenance,”
Rao, and S. Russell, “Towards robust automatic traffic scene ICCV, 1999.
analysis in real-time,” ICPR, 1994. [21] C. Wren, A. Azarbayejani, T. Darrel, and A. Pentland,
[10] V. Kolmogorov and R. Zabihm “What Energy Functions can “Pfinder: Real time Tracking of the Human Body,” TPAMI,
be Minimized via Graph Cuts?,” TPAMI, 2004. 1997.
[11] A. Monnet, A. Mittal, Nikos Paragios, and V. Ramesh, [22] J. Zhong and S. Sclaroff, “Segmenting Foreground Objects
“Background Modeling and Subtraction of Dynamic Scenes,” from a Dynamic Textured Background via a Robust Kalman
ICCV, 2003. Filter,” ICCV, 2003.
Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)
1063-6919/05 $20.00 © 2005 IEEE

You might also like