Image-Based Markerless 3d Human Motion Capture Using Multiple Cues
Image-Based Markerless 3d Human Motion Capture Using Multiple Cues
Image-Based Markerless 3d Human Motion Capture Using Multiple Cues
Pedram Azad1 , Ales Ude2 , Tamim Asfour1 , Gordon Cheng3 , and Ruediger
Dillmann1
1
Institute for Computer Science and Engineering, University of Karlsruhe,
Germany
azad|asfour|dillmann@ira.uka.de
2
Jozef Stefan Institute, Ljubljana, Slowenia
ales.ude@ijs.si
3
Computational Neuroscience Laboratories, ATR, Kyoto, Japan
gordon@atr.jp
1 Introduction
the likelihood that the configuration leading to the set of projected edges is
the proper configuration i.e. matching the gradient image the most. The basic
where gm denotes the remapped gradient value for the mth point.
Image-Based Markerless 3d Human Motion Capture using Multiple Cues 5
The second cue commonly used is region-based, for which a foreground seg-
mentation technique has to be applied. The segmentation algorithm to be
picked is independent from the likelihood function itself. The most common
approach is background subtraction. However, this segmentation method as-
sumes a static camera setup and is therefore not suitable for application on
a potentially moving robot head. Another option is to segment motion by
using difference images or optical flow. Both methods also assume a static
camera setup. It has to be mentioned that there are extensions of the ba-
sic optical flow algorithm that allow to distinguish real motion in the scene
and ego-motion [16]. However, the problem with all motion based methods –
which does not include background subtraction – is that the quality of the
segmentation result is not sufficient for a region-based cue. Only those parts of
the image that contain edges or any other kind of texture can be segmented,
and the silhouette of segmented moving objects often contains parts of the
background, resulting in a relatively blurred segmentation result.
Having segmented the foreground in the input image, where foreground
pixels are set to 1 and background pixels are set to 0, the likelihood function
commonly used can be formulated as [6]:
( Mr
)
1 X
2
pr (z|s) ∝ exp − 2 (1 − rm ) (3)
2σr Mr m=1
where rm denotes the segmentation value of the mth pixel from the set of
pixels of all projected body part regions. Although this function can be opti-
mized further, using the fact that rm ∈ {0, 1}, its computation is still rather
inefficient. The bottleneck is the computation of the set of all M projected
pixels together with reading the corresponding values from the segmentation
map.
The both introduced cues are fused by simply multiplying the two likelihood
functions resulting in:
( PMg PMr !)
2 2
1 m=1 (1 − gm ) (1 − r m )
pg,r (z|s) ∝ exp − + m=1 2 (4)
2 σg2 Mg σ r Mr
Any other cue can be fused within the particle filter with the same rule.
One way of combining the information provided by multiple cameras is to
incorporate the likelihoods for each image in the exact same manner [6]. In our
system, we additionally use 3d information which can be computed explicitly
by knowing the stereo calibration. This separate cue is then combined with
the other likelihoods with the same method, as will be described in Section 3.
6 Pedram Azad et. al
When looking deeper into the region-based likelihood function pr , one can
state two separate abilities:
• Leading to a faster convergence of the particle filter
• Compensating the failure of the edge-based likelihood function in cluttered
backgrounds
The first property is discussed in detail in Section 3.2, and an efficient alterna-
tive is presented. The second property can be implemented explicitly by using
the result of foreground segmentation directly to generate a filtered edge map,
containing only foreground edge pixels. In general, there are two possibilities:
• Filtering the gradient image by masking out background pixels with the
segmentation result
• Calculating gradients on the segmentation result
While the first alternative preserves more details in the image, the second
alternative computes a sharper silhouette. Furthermore, in the second case
gradient computation can be optimized for binarized input images, which
is why we currently use this approach. As explained in Section 2.2, the only
commonly used foreground segmentation technique is background subtraction,
which we cannot use, since the robot head can potentially move. It has to be
mentioned that taking into account that the robot head can move is not a
burden, but there are several benefits of using an active head, which will be
discussed in Section 7. As an alternative to using background subtraction, we
are using a solid colored shirt, which allows us to perform tests practically
anywhere in our lab. Since foreground segmentation is performed in almost
any markerless human motion capture system, we do not restrict ourselves
compared to other approaches, but only trade in the restriction of wearing
a colored shirt for the need of having a completely static setup. We want to
point out that the proposed generation of a filtered edge map does not depend
on the segmentation technique.
Image-Based Markerless 3d Human Motion Capture using Multiple Cues 7
In order to understand which are the benefits and drawbacks of each likeli-
hood function and thus getting a feeling of what a likelihood function can do
and what not, it is helpful to take a look at the corresponding probability
distributions in a simple one-dimensional example. The experiment we use in
simulation is tracking a square of fixed size in 2d, which can be simplified
furthermore to tracking the intersection of a square with a straight line along
the straight line i.e. in one dimension. The model of the square to be tracked
is defined by the midpoint (x, y) and the edge length k, where y and k are
constant and x is the one dimensional configuration to be predicted. In the
following, we want to compare three different likelihood functions separately:
the gradient-based cue pg , the region-based cue pr , and a third cue pd , which
is based on the euclidian distance:
1
pd (z|s) ∝ exp − 2 |f (s) − c|2 (5)
2σd
Gradient Cue
Region Cue
1 Distance Cue
0.8
Probability
0.6
0.4
0.2
0
130 140 150 160 170 180 190
Object Position X
80
Iterations Needed
60
40
20
0
0 10 20 30 40 50 60 70 80
Initial X Distance
edge of the square in the image. For ∆x0 = 5 and ∆x0 = 15, pr and pd behave
quite similar. However, for ∆x0 = 80, pd converges significantly faster, since
it has the global view at any time. In contrast, pr has to approach the goal
slowly to reach the area, in which it can converge fast. As a conclusion, one
can state that whenever possible to determine a discrete point directly, it is
the best choice to use the likelihood function pd rather than pr . While it is not
possible to do a successful tracking without the edge cue – especially when
scaling has to be taken into account – it is also not possible to rely on the
edge cue only. The higher the dimensionality of search space is, the more dras-
tic the lack of a sufficient number of particles becomes. Thus, in the case of
Image-Based Markerless 3d Human Motion Capture using Multiple Cues 9
5
Gradient Cue
Region Cue
Distance Cue
Prediction Error
2
0
0 5 10 15 20
Iteration
25
Gradient Cue
Region Cue
Distance Cue
20
15
Prediction Error
10
0
0 5 10 15 20 25
Iteration
There are various ways to use stereo information in a vision system. One
possibility is to calculate depth maps, however, the quality of depth maps is
in general not sufficient and only rather rough information can be derived from
them. Another option in a particle filter framework is to project the model
10 Pedram Azad et. al
Gradient Cue
Region Cue
100 Distance Cue
80
Prediction Error
60
40
20
0
0 10 20 30 40 50 60 70 80 90
Iteration
into both the left and the right image and evaluate the likelihood function for
both images and multiply the the resulting likelihoods, as already mentioned
in Section 2.3. This approach can be described as implicit stereo. A third
alternative is to determine correspondences for specific features in the image
pair and calculate the 3d position for each match explicitly by triangulation.
In the proposed system, we use both implicit stereo and stereo triangu-
lation. As features we use the hands and the head, which are segmented by
color and matched in a preprocessing step. Thus, the hands and the head
can be understood as three natural markers. The image processing line for
determining the positions of the hands and the head in the input image is
described in Section 4.
In principal, there are two alternatives to use the likelihood function pd
together with skin color blobs: apply pd in 2d for each image separately and
let the 3d position be calculated implicitly by the particle filter, or apply pd in
3d to the triangulated 3d positions of the matched skin color blobs. We have
experienced that the first approach does not lead to a robust acquisition of 3d
information. This circumstance is not surprising, since in a high dimensional
space the mismatch between the number of particles used and the size of the
search space is more drastic. This leads, together with the fact the in Figure
4 the prediction result of the likelihood function pd is noisy within an area of
1-2 pixels in a very simple experiment, to a considerable error of the implicit
stereo calculation in the real scenario. The accuracy of stereo triangulation
decreases with the distance from the camera in a squared relationship. In
order to observe the complete upper body of a human, the subject has to
be located at a distance of at least 2-3 meters from the camera head. Thus,
a potential error of two or more pixels in each camera image can lead to a
significant error of the triangulation result. For this reason, in the proposed
system, we apply pd in 3d to the triangulation result of matched skin color
blobs. By doing this, the particle filter is forced to always move the peak of
Image-Based Markerless 3d Human Motion Capture using Multiple Cues 11
In the final likelihood function, we use two different components: the edge
cue based on the likelihood function pg , and the distance cue based on the
likelihood function pd , as explained in the sections 3.2 and 3.3. We have expe-
rienced that when leaving out the square in Equation (2), i.e. calculating the
Sum of Absolute Differences (SAD) instead of the Sum Of Squared Differences
(SSD), the quality of the results remains the same for our application. In this
special case one can optimize pg further, resulting in:
1 Mg
1 X
p0g (z|s) ∝ exp − 2 (1 − gm ) (6)
2σg Mg m=1
The likelihood function for the distance cue is then formulated as:
1 d1 (s, c1 ) + d2 (s, c2 ) + d3 (s, c3 )
p0d (z|s) ∝ exp − 2 (9)
2σd g(c1 ) + g(c2 ) + g(c3 )
where the vector ci are computed on the base of the image observations z using
skin color segmentation and stereo triangulation, as explained in Section 3.3. If
12 Pedram Azad et. al
the position of a hand or the head can not be determined because of occlusions
or any other disturbance, the corresponding vector ci is set to the zero vector.
Note that this does not falsify the resulting probability distribution in any way.
Since all likelihoods of a generation k are independent from the likelihoods
calculated for any previous generation, the distribution for each generation
is also independent. Thus, it does not make any difference that in the last
image pair one ci was present, and in the next image pair it is not. The final
likelihood function is the product of p0g and p0d :
1 3 Mg
1 X d i (s, c i ) 1 1 X
p(z|s) ∝ exp − 2 + 2 (1 − gm ) (10)
2 σd g(ci ) σg Mg
i=1 m=1
Smooth HSI
Gradients
and the head are segmented using a fixed interval color model in HSI color
Image-Based Markerless 3d Human Motion Capture using Multiple Cues 13
space. The resulting color blobs are matched, taking into account their size,
the ratio between the height and width of the bounding box, and the epipolar
geometry. By doing this, false regions in the background can be discarded
easily. Finally, the centroids of matched regions are triangulated using the
parameters of the calibrated stereo setup. As will be discussed in Section 7,
we are currently working on implementing a more sophisticated hand-/head-
tracking system, which allows to deal with occlusions of skin colored regions.
6 Experimental Results
The experiments being presented in this section were performed on the hu-
manoid robot ARMAR. In the robot head, two Dragonfly cameras are po-
sitioned at a distance of approximately eleven centimeters. As input for the
image processing line, we used a resolution of 320 × 240, captured at a frame
rate of 25 Hz. The particle filter was run with a set of N = 1000 particles.
The computation times for one image pair, processed on a 3 GHz CPU, are
listed in Table 1. As one can see, the processing rate of the system is 15
Hz, which is not yet real-time for an image sequence captured at 25 Hz,
but very close. Of course, if moving more slowly, a processing rate of 15
Hz is sufficient. The relationship between the speed of the movements to
be tracked and the frame rate at which the images are captured (and for
real-time application processed) is briefly discussed in Section 7. In Figure
8, six screenshots are shown which show how the system automatically ini-
tializes itself. No configuration is told the system; it autonomously finds the
only possible configuration matching the observations. Figure 9 shows four
screenshots of the same video sequence, showing the performance of the hu-
man motion capture system tracking a punch with the left hand. The cor-
responding video and videos of other sequences can be downloaded from
http://i61www.ira.uka.de/users/azad/videos.
14 Pedram Azad et. al
Time [ms]
Image Processing Line 14
1000 Forward Kinematics and Projection 23
1000 Evaluations of Likelihood Function 29
Total 66
Table 1. Processing times with N = 1000 particles on a 3 GHz CPU
7 Discussion
We have presented an image-based markerless human motion capture system
for real-time application. The system is capable of computing very smooth and
accurate trajectories in configuration space for such a system. We presented
our strategy of multi-cue fusion within the particle filter, and showed the
results of studies examining the properties of the cues commonly used and
a further distance cue. We showed that by using this distance cue combined
with stereo vision, which has not yet been used in markerless human motion
capture, we could reduce the size of the search space implicitly. This reduction
of search space allows us to capture human motion with a particle filter using
Image-Based Markerless 3d Human Motion Capture using Multiple Cues 15
References
1. A. A. Argyros and M. I.A. Lourakis. Real-time tracking of multiple skin-colored
objects with a possibly moving camera. In European Conference on Computer
Vision (ECCV), volume 3, pages 368–379, Prague, Czech Republic, 2004.
2. P. Azad. Integrating Vision Toolkit. http://ivt.sourceforge.net.
16 Pedram Azad et. al
3. P. Azad, A. Ude, R. Dillmann, and G. Cheng. A full body human motion capture
system using particle filtering and on-the-fly edge detection. In International
Conference on Humanoid Robots (Humanoids), Santa Monica, USA, 2004.
4. A. Blake and M. Isard. Active Contours. Springer, 1998.
5. F. Caillette and T. Howard. Real-time markerless human body tracking with
multi-view 3-d voxel reconstruction. In British Machine Vision Conference,
volume 2, pages 597–606, Kingston, UK, 2004.
6. J. Deutscher, A. Blake, and I. Reid. Articulated body motion capture by an-
nealed particle filtering. In Computer Vision and Pattern Recognition (CVPR),
pages 2126–2133, Hilton Head, USA, 2000.
7. J. Deutscher, A. Davison, and I. Reid. Automatic partitioning of high dimen-
sional search spaces associated with articulated body motion capture. In Com-
puter Vision and Pattern Recognition (CVPR), pages 669–676, Kauai, USA,
2001.
8. D. Gavrila and L. Davis. 3-d model-based tracking of humans in action: a multi-
view approach. In International Conference on Computer Vision and Pattern
Recognition (CVPR), pages pp. 73–80, San Francisco, USA, 1996.
9. M. Isard and A. Blake. Condensation - conditional density propagation for
visual tracking. International Journal of Computer Vision, 29(1):5–28, 1998.
10. S. Knoop, S. Vacek, and R. Dillmann. Modeling joint constraints for an articu-
lated 3d human body model with artificial correspondences in icp. In Interna-
tional Conference on Humanoid Robots (Humanoids), Tsukuba, Japan, 2005.
11. J. MacCormick. Probabilistic models and stochastic algorithms for visual track-
ing. PhD thesis, University of Oxford, UK, 2000.
12. J. MacCormick and M. Isard. Partitioned sampling, articulated objects, and
interface-quality hand tracking. In European Conference Computer Vision
(ECCV), pages 3–19, Dublin, Ireland, 2000.
13. I. Mikic, M. Trivedi, E. Hunter, and P. Cosman. Human body model acquisi-
tion and tracking using voxel data. International Journal of Computer Vision,
53(3):199–223, 2003.
14. K. Rohr. Human movement analysis based on explicit motion models. Motion-
Based Recognition, pages 171–198, 1997.
15. H. Sidenbladh. Probabilistic Tracking and Reconstruction of 3D Human Mo-
tion in Monocular Video Sequences. PhD thesis, Royal Institute of Technology,
Stockholm, Sweden, 2001.
16. K. Wong and M. Spetsakis. Motion segmentation and tracking. In International
Conference on Vision Interface, pages 80–87, Calgary, Canada, 2002.