Image-Based Markerless 3d Human Motion Capture Using Multiple Cues

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Image-Based Markerless 3d Human Motion

Capture using Multiple Cues

Pedram Azad1 , Ales Ude2 , Tamim Asfour1 , Gordon Cheng3 , and Ruediger
Dillmann1
1
Institute for Computer Science and Engineering, University of Karlsruhe,
Germany
azad|asfour|dillmann@ira.uka.de
2
Jozef Stefan Institute, Ljubljana, Slowenia
ales.ude@ijs.si
3
Computational Neuroscience Laboratories, ATR, Kyoto, Japan
gordon@atr.jp

1 Introduction

The idea of markerless human motion capture is to capture human motion


without any additional arrangements required, by operating on image se-
quences only. Implementing such a system on a humanoid robot and thus
giving the robot the ability to perceive human motion would be valuable for
various reasons. Captured trajectories, which are calculated in joint angle
space, can serve as a solid base for learning human-like movements. Commer-
cial human motion capture systems such as the VAICON system, which are
popular both in the film industry and in the biological research field, require
reflective markers and time consuming manual post-processing of captured se-
quences. Therefore, such systems can only provide data for highly supervised
offline learning algorithms. In contrast, a real-time human motion capture
system using the image data acquired by the robot’s head would make one
big step toward autonomous online learning of movements. Another applica-
tion for the data computed by such a system is the recognition of actions
and activities, serving as a perceptive component for human-robot interac-
tion. However, providing data for learning of movements – often referred to
as learning-by-imitation – is the more challenging goal, since transforming
captured movements in configuration space into the robot’s kinematics and
reproducing them on the robot sets the higher demands to smoothness and
accuracy.
For application on an active head of a humanoid robot, a number of re-
strictions has to be coped with. In addition to the limitation to two cameras
positioned at approximately eye distance, one has to take into account that
an active head can potentially move. Furthermore, computations have to be
2 Pedram Azad et. al

performed in real-time, preferably at a frame rate of 30 Hz or higher, in order


to achieve optimal results.
The general problem definition is to find the correct configuration of the
underlying articulated 3d human model for each input image respectively
image tuple. The main problem is that search space increases exponentionally
with the number of Degrees Of Freedom (DOF). A realistic model of the
human body has at least 25 DOF, or 17 DOF if only modeling the upper
body, leading in both cases to a very high-dimensional search space.
There are several approaches to solve the general problem of markerless
human motion capture, differing in the sensors incorporated and the intended
application. When using multiple cameras, i.e. three or more cameras located
around the area of interest, two different systems have shown very good re-
sults. The one class of approaches is based on the calculation of 3d voxel data,
as done by [5, 13]. The other approach is based on particle filtering and be-
came popular by the work of Deutscher et. al [6]. Recently, we have started to
adapt and extend this system for real-time application on a humanoid robot
head [3], presenting the newest results in the following. Other approaches de-
pend on incorporation of an additional 3d sensor and the Iterative Closest
Point (ICP) algorithm, such as the Swiss Ranger, as presented by [10]. How-
ever, for this system, the goal is not to acquire smooth trajectories but to
classify the activities of a human into categories, such as walking, waving,
bowing, etc. Other approaches concentrate on deriving as much information
as possible from monocular image sequences [15], and reducing the size of
the search space by applying restrictions to the range of possible movements,
e.g. by incorporating a task-specific dynamic model [14]. Our experience is
that it is not possible to build a general 3d human motion capture system,
since in many cases a single camera is not sufficient to determine accurate 3d
information, based on the principle depth through scaling. A further strategy
to reduce search space is search space decomposition i.e. performing a hierar-
chical search, as done by [8]. However, by doing this, the power of the system
is limited, since in many cases the global view is needed to determine the
correct configuration, e.g. for rotations around the body axis, the information
provided by the positions of the arms is very helpful.
We use the Bayesian framework Particle Filtering to compute the prob-
ability distribution of the current configuration, as described in detail in [3].
Particle filtering, also known as the Condensation Algorithm in the context
of visual tracking, as introduced in [9], has proven to be an applicable and
robust technique for contour tracking in general [4] [11] [12], and for human
motion capture in particular, as shown in [6] [15].
In particle filters, a larger search space requires a greater number of parti-
cles. One strategy to cope with this problem is to reduce the dimensionality of
configuration space by restricting the range of the subject’s potential move-
ments, as already mentioned, or to approach a linear relationship between the
dimension of configuration space and the size of the search space by perform-
ing a hierarchical search. A general but yet effective way to reduce the number
Image-Based Markerless 3d Human Motion Capture using Multiple Cues 3

of particles is based on the idea of Simulated Annealing, presented in [6, 7].


However, the final system, which uses three cameras at fixed positions in the
corners of a room, requires on average 15 seconds to process one frame on a
1 GHz PIII CPU [7].
Theoretically, an edge based cue would be already sufficient to track the
movements of a human – if using an adequate number of particles. To span
the search space with a sufficient resolution when using an edge based cue
only, millions of particles would be necessary for a successful tracker. There-
fore, the common approach using particle filters for human motion capture is
to combine edge and region information within the likelihood function, which
evaluates a given configuration matching the current observation. Although
this is a powerful approach, the computational effort is relatively high. Espe-
cially the evaluation of the region based cue is computationally expensive.
Our strategy is to combine as many cues derivable from the input images
as possible to reduce search space implicitly by achieving a higher convergence
of the probability distribution. We present a running system on our humanoid
robot ARMAR using the benefits of a stereo setup and combining edge, region
and skin color information. The initial configuration is found automatically – a
necessity for any perceptive component of a vision system. The system is able
to capture real 3d motion with a high smoothness and accuracy for a purely
vision based algorithm, without using markers or manual post-processing. The
processing rate of our algorithm is 15 Hz on a 3 GHz CPU.

2 Using Particle Filters for Human Motion Capture


Particle filtering has become popular for various visual tracking applications
– often also referred to as the Condensation Algorithm. The benefits of a
particle filter compared to a Kalman filter are the ability to track non-linear
movements and the property to store multiple hypotheses simultaneously. The
price one has to pay for these advantages is the higher computational effort.
The probability distribution representing the likelihood of the configurations
in configuration space matching the observations is modeled by a finite set
of N particles S = {(s1 , π1 ), ..., (sN , πN )}, where si denotes one configuration
and πi the likelihood associated with it. The core of a particle filter is the
likelihood function p(z|s) computing the probabilities πi , where s denotes a
given configuration and z the current observations i.e. the current image pair.
This likelihood function must be evaluated for each particle for each frame
i.e. N · f times per second. As an example this means for N = 1000 particles
and f = 30 Hz N · f = 30000 evaluations per second. A detailed description
of using particle filters for human motion capture can be found in [3].

2.1 Edge Cue


Given the projected edges of a configuration s of the human model and the
current input image z, the likelihood function p(z|s) for the edge cue calculates
4 Pedram Azad et. al

the likelihood that the configuration leading to the set of projected edges is
the proper configuration i.e. matching the gradient image the most. The basic

Fig. 1. Illustration of the search of edges

technique is to traverse the projected edges and search at fixed distances ∆


for high-contrast features perpendicular to the projected edge within a fixed
search distance δ (in each direction) i.e. finding edge pixels in the camera
image, as illustrated in figure 1 [9]. For this purpose, usually the camera
image is preprocessed to generate an edge image using a gradient based edge
detector. The likelihood is calculated on the base of the Sum of Squared
Differences (SSD). For convenience of notation, it is assumed that all edges
are contained in one contiguous spline with M = L/∆ discretizations, where
L denotes the sum of the length of all projected edges in the current image.
The distance at which an edge feature has been found for the mth point is
denoted as dm and µ denotes a constant maximum error which is applied
in case no edge feature could be found. The likelihood function can then be
formulated as: ( )
M
1 X
p(z|s) ∝ exp − 2 f (dm , µ) (1)
2σ M m=1

where f (ν, µ) = min(ν 2 , µ2 ). Another approach is to spread the gradients


in the gradient image with a Gaussian Filter or any other suitable operator
and to sum the gradient values along a projected edge, as done in [6], rather
than performing a search perpendicular to each pixel of the projected edge.
By doing this, the computational effort is reduced significantly, even when
picking the highest possible discretization of ∆ = 1 pixel. Furthermore, one
does not have to make the non-trivial decision which gradient pixel to take
for each pixel of the projected edge. Assuming that the spread gradient map
has been remapped between 0 and 1, the modified likelihood function can be
formulated as:
 
Mg
 1 X 
pg (z|s) ∝ exp − 2 (1 − gm )2 (2)
 2σg Mg 
m=1

where gm denotes the remapped gradient value for the mth point.
Image-Based Markerless 3d Human Motion Capture using Multiple Cues 5

2.2 Region Cue

The second cue commonly used is region-based, for which a foreground seg-
mentation technique has to be applied. The segmentation algorithm to be
picked is independent from the likelihood function itself. The most common
approach is background subtraction. However, this segmentation method as-
sumes a static camera setup and is therefore not suitable for application on
a potentially moving robot head. Another option is to segment motion by
using difference images or optical flow. Both methods also assume a static
camera setup. It has to be mentioned that there are extensions of the ba-
sic optical flow algorithm that allow to distinguish real motion in the scene
and ego-motion [16]. However, the problem with all motion based methods –
which does not include background subtraction – is that the quality of the
segmentation result is not sufficient for a region-based cue. Only those parts of
the image that contain edges or any other kind of texture can be segmented,
and the silhouette of segmented moving objects often contains parts of the
background, resulting in a relatively blurred segmentation result.
Having segmented the foreground in the input image, where foreground
pixels are set to 1 and background pixels are set to 0, the likelihood function
commonly used can be formulated as [6]:
( Mr
)
1 X
2
pr (z|s) ∝ exp − 2 (1 − rm ) (3)
2σr Mr m=1

where rm denotes the segmentation value of the mth pixel from the set of
pixels of all projected body part regions. Although this function can be opti-
mized further, using the fact that rm ∈ {0, 1}, its computation is still rather
inefficient. The bottleneck is the computation of the set of all M projected
pixels together with reading the corresponding values from the segmentation
map.

2.3 Fusion of Multiple Cues

The both introduced cues are fused by simply multiplying the two likelihood
functions resulting in:
( PMg PMr !)
2 2
1 m=1 (1 − gm ) (1 − r m )
pg,r (z|s) ∝ exp − + m=1 2 (4)
2 σg2 Mg σ r Mr

Any other cue can be fused within the particle filter with the same rule.
One way of combining the information provided by multiple cameras is to
incorporate the likelihoods for each image in the exact same manner [6]. In our
system, we additionally use 3d information which can be computed explicitly
by knowing the stereo calibration. This separate cue is then combined with
the other likelihoods with the same method, as will be described in Section 3.
6 Pedram Azad et. al

3 Multiple Cues in the proposed System


In this section, we want to introduce the cues our system is based on. Instead
of the commonly used region-based likelihood function pr , as introduced in
Equation (3), we incorporate the result of foreground segmentation in a more
efficient way, as will be introduced in Section 3.1. In Section 3.2 we will present
the results of studies regarding the effectivity of the introduced cues, leading
to a new likelihood function. As already mentioned, we use the benefits of a
stereo system in an additional explicit way, as will be introduced in 3.3. The
final combined likelihood function is presented in Section 3.4.

3.1 Edge Filtering using Foreground Segmentation

When looking deeper into the region-based likelihood function pr , one can
state two separate abilities:
• Leading to a faster convergence of the particle filter
• Compensating the failure of the edge-based likelihood function in cluttered
backgrounds
The first property is discussed in detail in Section 3.2, and an efficient alterna-
tive is presented. The second property can be implemented explicitly by using
the result of foreground segmentation directly to generate a filtered edge map,
containing only foreground edge pixels. In general, there are two possibilities:
• Filtering the gradient image by masking out background pixels with the
segmentation result
• Calculating gradients on the segmentation result
While the first alternative preserves more details in the image, the second
alternative computes a sharper silhouette. Furthermore, in the second case
gradient computation can be optimized for binarized input images, which
is why we currently use this approach. As explained in Section 2.2, the only
commonly used foreground segmentation technique is background subtraction,
which we cannot use, since the robot head can potentially move. It has to be
mentioned that taking into account that the robot head can move is not a
burden, but there are several benefits of using an active head, which will be
discussed in Section 7. As an alternative to using background subtraction, we
are using a solid colored shirt, which allows us to perform tests practically
anywhere in our lab. Since foreground segmentation is performed in almost
any markerless human motion capture system, we do not restrict ourselves
compared to other approaches, but only trade in the restriction of wearing
a colored shirt for the need of having a completely static setup. We want to
point out that the proposed generation of a filtered edge map does not depend
on the segmentation technique.
Image-Based Markerless 3d Human Motion Capture using Multiple Cues 7

3.2 Cue Studies and Distance Likelihood Function

In order to understand which are the benefits and drawbacks of each likeli-
hood function and thus getting a feeling of what a likelihood function can do
and what not, it is helpful to take a look at the corresponding probability
distributions in a simple one-dimensional example. The experiment we use in
simulation is tracking a square of fixed size in 2d, which can be simplified
furthermore to tracking the intersection of a square with a straight line along
the straight line i.e. in one dimension. The model of the square to be tracked
is defined by the midpoint (x, y) and the edge length k, where y and k are
constant and x is the one dimensional configuration to be predicted. In the
following, we want to compare three different likelihood functions separately:
the gradient-based cue pg , the region-based cue pr , and a third cue pd , which
is based on the euclidian distance:
 
1
pd (z|s) ∝ exp − 2 |f (s) − c|2 (5)
2σd

where c is an arbitrary dimensional vector which has been calculated previ-


ously on the base of the observations z, and f : Rdim(s) → Rdim(c) is a transfor-
mation mapping a configuration s to the vector space of c. In our example, c
denotes the midpoint of the square in the observation z, dim(s) = dim(c) = 1,
and f (s) = s. For efficiency considerations, we have used the squared euclid-
ian distance, practically resulting in the SSD. Evidently, in this simple case,
there is no need to use a particle filter for tracking, if the configuration to be
predicted c can be determined directly. However, in this example, we want
to show the characteristic properties of the likelihood function pd , in order to
describe the performance in the final likelihood function of the human motion
capture system, presented in the sections 3.3 and√ 3.4. For the experiments, we
used N = 15 particles and picked σg = σr = 5 and σd = 0.1. In the update
step of the particle filter we applied gaussian noise only, with an amplification
factor of ω = 3. The task was to find a static square with k = 70, based on
the pixel data at the intersection of the square with the x-axis. As one can see
in Figure 2, the gradient-based likelihood function pg produces the narrowest
distribution. The probability distributions produced by pr and pd are rela-
tively similar; their narrowness can be adjusted by varying σr respectively σd .
The effect of each distribution can be seen in Figure 3. While with starting
points in a close neighborhood of the goal the gradient cue leads to the fastest
convergence, the region cue and the distance cue converge faster the farther
the starting point is away from the goal. In the figures 4-6, the initial dis-
tance from the goal ∆x0 was varied. As expected, pg leads to the fastest and
smoothest convergence for ∆x0 = 5. ∆x0 = 15 is already close to the border
of the convergence radius for pg ; the particle filter first tends to the wrong
direction and then finally converges to the goal position. With ∆x0 = 80, it
is by far impossible for pg to find the global maximum, it converges to the
(wrong) local maximum, matching the right edge of the model with the left
8 Pedram Azad et. al

Gradient Cue
Region Cue
1 Distance Cue

0.8

Probability

0.6

0.4

0.2

0
130 140 150 160 170 180 190
Object Position X

Fig. 2. Comparison of Probablity Distributions

100 Gradient Cue


Region Cue
Distance Cue

80
Iterations Needed

60

40

20

0
0 10 20 30 40 50 60 70 80
Initial X Distance

Fig. 3. Comparison of iteration numbers: an iteration number of 100 indicates that


the goal was not found

edge of the square in the image. For ∆x0 = 5 and ∆x0 = 15, pr and pd behave
quite similar. However, for ∆x0 = 80, pd converges significantly faster, since
it has the global view at any time. In contrast, pr has to approach the goal
slowly to reach the area, in which it can converge fast. As a conclusion, one
can state that whenever possible to determine a discrete point directly, it is
the best choice to use the likelihood function pd rather than pr . While it is not
possible to do a successful tracking without the edge cue – especially when
scaling has to be taken into account – it is also not possible to rely on the
edge cue only. The higher the dimensionality of search space is, the more dras-
tic the lack of a sufficient number of particles becomes. Thus, in the case of
Image-Based Markerless 3d Human Motion Capture using Multiple Cues 9

5
Gradient Cue
Region Cue
Distance Cue

Prediction Error
2

0
0 5 10 15 20
Iteration

Fig. 4. Comparison of convergence for ∆x0 = 5

25
Gradient Cue
Region Cue
Distance Cue

20

15
Prediction Error

10

0
0 5 10 15 20 25
Iteration

Fig. 5. Comparison of convergence for ∆x0 = 15

human motion capture with dimensions of 17 and greater, the configurations


will never perfectly match the image observations. Note, that the simulated
experiment examined a static case. In the dynamic case, the robustness of the
tracker is always related to the frame rate at which images are captured and
processed, and to the speed of the subject’s movements. In the next section,
we show how the likelihood function pd is incorporated in our system in 3d,
leading to a significant implicit reduction of the search space.

3.3 Using Stereo Information

There are various ways to use stereo information in a vision system. One
possibility is to calculate depth maps, however, the quality of depth maps is
in general not sufficient and only rather rough information can be derived from
them. Another option in a particle filter framework is to project the model
10 Pedram Azad et. al

Gradient Cue
Region Cue
100 Distance Cue

80

Prediction Error
60

40

20

0
0 10 20 30 40 50 60 70 80 90
Iteration

Fig. 6. Comparison of convergence for ∆x0 = 80

into both the left and the right image and evaluate the likelihood function for
both images and multiply the the resulting likelihoods, as already mentioned
in Section 2.3. This approach can be described as implicit stereo. A third
alternative is to determine correspondences for specific features in the image
pair and calculate the 3d position for each match explicitly by triangulation.
In the proposed system, we use both implicit stereo and stereo triangu-
lation. As features we use the hands and the head, which are segmented by
color and matched in a preprocessing step. Thus, the hands and the head
can be understood as three natural markers. The image processing line for
determining the positions of the hands and the head in the input image is
described in Section 4.
In principal, there are two alternatives to use the likelihood function pd
together with skin color blobs: apply pd in 2d for each image separately and
let the 3d position be calculated implicitly by the particle filter, or apply pd in
3d to the triangulated 3d positions of the matched skin color blobs. We have
experienced that the first approach does not lead to a robust acquisition of 3d
information. This circumstance is not surprising, since in a high dimensional
space the mismatch between the number of particles used and the size of the
search space is more drastic. This leads, together with the fact the in Figure
4 the prediction result of the likelihood function pd is noisy within an area of
1-2 pixels in a very simple experiment, to a considerable error of the implicit
stereo calculation in the real scenario. The accuracy of stereo triangulation
decreases with the distance from the camera in a squared relationship. In
order to observe the complete upper body of a human, the subject has to
be located at a distance of at least 2-3 meters from the camera head. Thus,
a potential error of two or more pixels in each camera image can lead to a
significant error of the triangulation result. For this reason, in the proposed
system, we apply pd in 3d to the triangulation result of matched skin color
blobs. By doing this, the particle filter is forced to always move the peak of
Image-Based Markerless 3d Human Motion Capture using Multiple Cues 11

the probability distribution toward configurations in which the positions of


the hands and the head from the model are very close to the real 3d positions,
which have been determined on the base of the image observations.

3.4 Final Likelihood Function

In the final likelihood function, we use two different components: the edge
cue based on the likelihood function pg , and the distance cue based on the
likelihood function pd , as explained in the sections 3.2 and 3.3. We have expe-
rienced that when leaving out the square in Equation (2), i.e. calculating the
Sum of Absolute Differences (SAD) instead of the Sum Of Squared Differences
(SSD), the quality of the results remains the same for our application. In this
special case one can optimize pg further, resulting in:
 
 1 Mg
1 X 
p0g (z|s) ∝ exp − 2 (1 − gm ) (6)
 2σg Mg m=1 

For a system capable of real-time application, we have decided to replace the


region-based cue based on pr completely by the distance cue based on pd . As
our experimental results show, and as expected by the studies from Section
3.2, by doing this, a relatively small set of particles is sufficient for a successful
system. The distance cue drags the peak of the distribution into a subspace in
which the hands and the head are located at the true positions. Thus, search
space is reduced implicitly, practically leaving the choice in this subspace to
the cooperating gradient cue, based on the likelihood function p0g . In order to
formulate the distance cue, first the function di (s, c) is defined as:

|fi (s) − c|2 : c 6= 0



di (s, c) := (7)
0 : otherwise

where n := dim(s) is the number of DOF of the human modal, dim(c) = 3,


i ∈ {1, 2, 3} to indicate the function for the left hand, right hand or the head.
The transformation fi : Rn → R3 transforms the n-dimensional configuration
of the human model into the 3d position of the left hand, right hand or head
respectively, using the forward kinematics of the human model. Furthermore:

1 : c 6= 0
g(c) := (8)
0 : otherwise

The likelihood function for the distance cue is then formulated as:
 
1 d1 (s, c1 ) + d2 (s, c2 ) + d3 (s, c3 )
p0d (z|s) ∝ exp − 2 (9)
2σd g(c1 ) + g(c2 ) + g(c3 )

where the vector ci are computed on the base of the image observations z using
skin color segmentation and stereo triangulation, as explained in Section 3.3. If
12 Pedram Azad et. al

the position of a hand or the head can not be determined because of occlusions
or any other disturbance, the corresponding vector ci is set to the zero vector.
Note that this does not falsify the resulting probability distribution in any way.
Since all likelihoods of a generation k are independent from the likelihoods
calculated for any previous generation, the distribution for each generation
is also independent. Thus, it does not make any difference that in the last
image pair one ci was present, and in the next image pair it is not. The final
likelihood function is the product of p0g and p0d :
  
 1 3 Mg
1 X d i (s, c i ) 1 1 X 
p(z|s) ∝ exp −  2 + 2 (1 − gm ) (10)
 2 σd g(ci ) σg Mg 
i=1 m=1

4 Image Processing Line


The image processing line is a pipeline, transforming the input image pair into
a skin color map and a gradient map, which are then used by the likelihood
function presented in Section 3.4. In Figure 7, the image processing line for
one image is shown; in the system the pipeline is applied twice: once for the
left and once for the right input image. After the input images are smoothed
with a 3 × 3 Gaussian kernel, the HSI image is computed. The HSI image is
then filtered twice, once for skin color segmentation and once for foreground
segmentation by segmenting the shirt color. A simple 1 × 2 gradient operator
is applied to the segmented foreground image, which is sufficient and the most
efficient for a binarized image. Finally, a gradient pixel map is generated by
applying a 3 × 3 or 5 × 5 Gaussian kernel, as done in [6]. Currently, the hands

Smooth HSI

Shirt Segmentation Skin Color


Segmentation

Gradients

Gradient Pixel Map Eroded Skin Color Map

Fig. 7. Visualization of the image processing line

and the head are segmented using a fixed interval color model in HSI color
Image-Based Markerless 3d Human Motion Capture using Multiple Cues 13

space. The resulting color blobs are matched, taking into account their size,
the ratio between the height and width of the bounding box, and the epipolar
geometry. By doing this, false regions in the background can be discarded
easily. Finally, the centroids of matched regions are triangulated using the
parameters of the calibrated stereo setup. As will be discussed in Section 7,
we are currently working on implementing a more sophisticated hand-/head-
tracking system, which allows to deal with occlusions of skin colored regions.

5 Integrating Vision Toolkit


The complete system has been implemented using the Integrating Vision
Toolkit (IVT) extensively [2]. With the IVT, the complete image processing
line presented in Section 4 could be implemented in less than 50 lines of code.
The IVT provides a clean interface to capture devices of any kind, providing
a convenient application for stereo camera calibration based on the OpenCV.
For implementing Graphical User Interfaces, QT is integrated optionally, as
well as the OpenCV library for image processing routines which are not yet
available. The library is implemented in an easy-to-use software architecture,
hiding all dependencies behind clean interfaces. The IVT fully supports the
operating systems Linux, Mac OS and Windows. The project is available on
Sourceforge; the link is included in the References.

6 Experimental Results
The experiments being presented in this section were performed on the hu-
manoid robot ARMAR. In the robot head, two Dragonfly cameras are po-
sitioned at a distance of approximately eleven centimeters. As input for the
image processing line, we used a resolution of 320 × 240, captured at a frame
rate of 25 Hz. The particle filter was run with a set of N = 1000 particles.
The computation times for one image pair, processed on a 3 GHz CPU, are
listed in Table 1. As one can see, the processing rate of the system is 15
Hz, which is not yet real-time for an image sequence captured at 25 Hz,
but very close. Of course, if moving more slowly, a processing rate of 15
Hz is sufficient. The relationship between the speed of the movements to
be tracked and the frame rate at which the images are captured (and for
real-time application processed) is briefly discussed in Section 7. In Figure
8, six screenshots are shown which show how the system automatically ini-
tializes itself. No configuration is told the system; it autonomously finds the
only possible configuration matching the observations. Figure 9 shows four
screenshots of the same video sequence, showing the performance of the hu-
man motion capture system tracking a punch with the left hand. The cor-
responding video and videos of other sequences can be downloaded from
http://i61www.ira.uka.de/users/azad/videos.
14 Pedram Azad et. al

Time [ms]
Image Processing Line 14
1000 Forward Kinematics and Projection 23
1000 Evaluations of Likelihood Function 29
Total 66
Table 1. Processing times with N = 1000 particles on a 3 GHz CPU

Fig. 8. Screenshots showing automatic initialization

Fig. 9. Screenshots showing tracking performance

7 Discussion
We have presented an image-based markerless human motion capture system
for real-time application. The system is capable of computing very smooth and
accurate trajectories in configuration space for such a system. We presented
our strategy of multi-cue fusion within the particle filter, and showed the
results of studies examining the properties of the cues commonly used and
a further distance cue. We showed that by using this distance cue combined
with stereo vision, which has not yet been used in markerless human motion
capture, we could reduce the size of the search space implicitly. This reduction
of search space allows us to capture human motion with a particle filter using
Image-Based Markerless 3d Human Motion Capture using Multiple Cues 15

as few as 1000 particles with a processing rate of 15 Hz on a 3 GHz CPU. We


plan to investigate and implement several improvements of the system:
• Currently, the subsystem for detection of the hands and the head in the
images is not powerful enough to deal with occlusions of skin-colored re-
gions in the image. To overcome this problem, we are currently working
on implementing a more sophisticated hand and head tracking system, as
presented by Argyros et. al [1]. By doing this, we expect the system to be
able to robustly track long and complicated sequences, since it will not be
required to try to avoid occlusions between the hands and the hand.
• For any kind of tracking, the effective size of the search space increases
exponentially with the potential speed of the movements respectively de-
creases exponentially with the frame rate at which images are captured.
For this reason, the human motion capture systems with the most convinc-
ing results use a framerate of 60 Hz or higher, as done by [6]. Commercial
marker-based tracking systems use a frame rate of 100 Hz up to 400 Hz
and higher, to acquire smooth trajectories. For this reason, we want to
perform further tests with the new Dragonfly2 camera, which is capable
of providing the same image data as the Dragonfly camera, but at a frame
rate of 60 Hz instead of 30 Hz.
• In the theory of particle filters, there exist several methods to decrease
the effectively needed number of particles by modification of the standard
filtering algorithm. For this purpose, we want to investigate the work on
Partitioned Sampling [12] and Annealed Particle Filtering [6].
• We plan to extend the human model by incorporating the legs and feet
into the human model. Especially for this purpose, we want to use the
benefits of an active head, since with a static head it is hardly possible to
have the complete human in the field of vision of the robot at one time
step.
To our best knowledge, the proposed system is the first purely image-based
markerless human motion capture system designed for a robot head which
can track human movements with such accuracy and smoothness, and being
suitable for real-time application at the same time. The system does not as-
sume a static camera in any way; future work will also concentrate on running
experiments using this benefit of being able to capture human motion while
tracking the subject actively.

References
1. A. A. Argyros and M. I.A. Lourakis. Real-time tracking of multiple skin-colored
objects with a possibly moving camera. In European Conference on Computer
Vision (ECCV), volume 3, pages 368–379, Prague, Czech Republic, 2004.
2. P. Azad. Integrating Vision Toolkit. http://ivt.sourceforge.net.
16 Pedram Azad et. al

3. P. Azad, A. Ude, R. Dillmann, and G. Cheng. A full body human motion capture
system using particle filtering and on-the-fly edge detection. In International
Conference on Humanoid Robots (Humanoids), Santa Monica, USA, 2004.
4. A. Blake and M. Isard. Active Contours. Springer, 1998.
5. F. Caillette and T. Howard. Real-time markerless human body tracking with
multi-view 3-d voxel reconstruction. In British Machine Vision Conference,
volume 2, pages 597–606, Kingston, UK, 2004.
6. J. Deutscher, A. Blake, and I. Reid. Articulated body motion capture by an-
nealed particle filtering. In Computer Vision and Pattern Recognition (CVPR),
pages 2126–2133, Hilton Head, USA, 2000.
7. J. Deutscher, A. Davison, and I. Reid. Automatic partitioning of high dimen-
sional search spaces associated with articulated body motion capture. In Com-
puter Vision and Pattern Recognition (CVPR), pages 669–676, Kauai, USA,
2001.
8. D. Gavrila and L. Davis. 3-d model-based tracking of humans in action: a multi-
view approach. In International Conference on Computer Vision and Pattern
Recognition (CVPR), pages pp. 73–80, San Francisco, USA, 1996.
9. M. Isard and A. Blake. Condensation - conditional density propagation for
visual tracking. International Journal of Computer Vision, 29(1):5–28, 1998.
10. S. Knoop, S. Vacek, and R. Dillmann. Modeling joint constraints for an articu-
lated 3d human body model with artificial correspondences in icp. In Interna-
tional Conference on Humanoid Robots (Humanoids), Tsukuba, Japan, 2005.
11. J. MacCormick. Probabilistic models and stochastic algorithms for visual track-
ing. PhD thesis, University of Oxford, UK, 2000.
12. J. MacCormick and M. Isard. Partitioned sampling, articulated objects, and
interface-quality hand tracking. In European Conference Computer Vision
(ECCV), pages 3–19, Dublin, Ireland, 2000.
13. I. Mikic, M. Trivedi, E. Hunter, and P. Cosman. Human body model acquisi-
tion and tracking using voxel data. International Journal of Computer Vision,
53(3):199–223, 2003.
14. K. Rohr. Human movement analysis based on explicit motion models. Motion-
Based Recognition, pages 171–198, 1997.
15. H. Sidenbladh. Probabilistic Tracking and Reconstruction of 3D Human Mo-
tion in Monocular Video Sequences. PhD thesis, Royal Institute of Technology,
Stockholm, Sweden, 2001.
16. K. Wong and M. Spetsakis. Motion segmentation and tracking. In International
Conference on Vision Interface, pages 80–87, Calgary, Canada, 2002.

You might also like