Robust Single-View Geometry and Motion Reconstruction
Hao Li∗
ETH Zurich
input scans
reconstruction
Bart Adams†
KU Leuven
input scans
Leonidas J. Guibas‡
Stanford University
reconstruction
Mark Pauly§
ETH Zurich
input scans
reconstruction
Figure 1: Reconstruction of complex deforming objects from high-resolution depth scans. Our method accurately captures the global
topology and shape motion, as well as dynamic, small-scale details, such as wrinkles and folds.
Abstract
We present a framework and algorithms for robust geometry and
motion reconstruction of complex deforming shapes. Our method
makes use of a smooth template that provides a crude approximation of the scanned object and serves as a geometric and topological
prior for reconstruction. Large-scale motion of the acquired object
is recovered using a novel space-time adaptive, non-rigid registration method. Fine-scale details such as wrinkles and folds are synthesized with an efficient linear mesh deformation algorithm. Subsequent spatial and temporal filtering of detail coefficients allows
transfer of persistent geometric detail to regions not observed by
the scanner. We show how this two-scale process allows faithful recovery of small-scale shape and motion features leading to a highquality reconstruction. We illustrate the robustness and generality
of our algorithm on a variety of examples composed of different
materials and exhibiting a large range of dynamic deformations.
Keywords: animation reconstruction, non-rigid registration, partial scans, 3D scanning, geometry synthesis, template tracking
1
Introduction
Accurate digitization of complex real-world objects is one of the
central problems in visual computing. Commercial solutions for
rigid objects are widely available and can be considered a mature
technology. However, many of the assumptions of rigid scanning
methods are no longer valid in a dynamic setting where the acquired shape is in motion and deforms. High temporal and spatial
resolution is essential to faithfully recover the small-scale geometric detail that is often created as a result of the dynamic motion of
the scanned model. Recent advances in 3D scanning technology
facilitate the acquisition of dynamic objects, but pose substantial
challenges for reconstruction algorithms. We consider the problem
∗ Applied
Geometry Group, E-mail:hao@inf.ethz.ch
Graphics Group, E-mail:bart.adams@cs.kuleuven.be
‡ Geometric Computing Group, E-mail:guibas@cs.stanford.edu
§ Applied Geometry Group, E-mail:pauly@inf.ethz.ch
† Computer
of marker-less, high-resolution geometry and motion reconstruction
from single-view scans of a deforming shape. The main advantage of single-view 3D scanners is the simplicity of the acquisition
setup, requiring no calibration or synchronization of multiple sensing units. However, single-view reconstruction of dynamic shapes
is particularly challenging, since every scan covers a small section
of the object’s surface. Large and complex shape deformations constantly create or destroy geometric detail, such as wrinkles or folds
in cloth, that needs to be distinguished from acquisition noise.
We address these challenges by introducing a novel template-based
dynamic registration algorithm that offers significant improvements
in terms of accuracy and robustness over previous methods. A
key feature of our approach is the separation of large-scale motion from small-scale shape dynamics. We introduce a time- and
space-adaptive deformation model that robustly captures the largescale deformation of the object with minimal assumptions about
the dynamics of the motion and without requiring an underlying physical model or kinematic skeleton. Our method dynamically adds degrees of freedom to the deformation model where
needed, effectively extracting a generalized skeleton for the acquired shape. Small-scale dynamics are handled by a novel detailsynthesis method that computes a displacement field to adjust the
deformed template to match the high-resolution input scans. The
combination of these tools allows the efficient processing of extended scan sequences and yields a complete high-resolution geometry representation of the scanned object with full correspondences
over all time instances.
We make a clear distinction between static and dynamic detail.
Static detail includes all small-scale geometric features that are persistent in the shape and are not affected by the motion of the object.
In the example shown in Figure 2, the mouth, eyes, and nose of the
hand-puppet are static detail, since the entire face region is rigid.
Dynamic detail consists of features that are transient. Deformation
of the object can cause dynamic detail to appear and disappear, such
as the folds in the body of the puppet. Our non-rigid registration
method makes use of a template model to reconstruct the overall
motion of the shape and provide a geometric prior for shape completion and topology control. In contrast to recent methods in performance capture [de Aguiar et al. 2008; Vlasic et al. 2008], we deliberately remove fine-scale detail from the template to avoid confusing static detail with dynamic detail. High-resolution templates
from rigid scans typically have all detail “baked in”, even transient
features that are then erroneously transferred to all reconstructed
surfaces (see also Figure 10). Our detail synthesis method automatically extracts detail from the high-resolution 3D input scans,
propagates detail into occluded regions, and separates salient features from high-frequency noise.
permanent detail
Park and Hodgins [2006; 2008] developed a system that uses a very
dense and large set of markers to capture and synthesize dynamic
motions such as muscle bulging and flesh jiggling. While high resolution motions can be captured accurately, marker-based motion
capture systems typically have a time-consuming calibration process and high hardware cost, and require actors to wear unnatural
skin-tight clothing with optical beacons.
transient detail
noise
15
...
47
...
50
...
84 frame
Figure 2: Deforming shapes typically contain both permanent detail, such as the face region of the puppet, and transient detail, such
as the dynamic folds in the cloth. Transient detail still persists over
a number of adjacent frames and can thus be distinguished from
temporally incoherent noise.
The methods we propose are general in that they
are not specifically designed for a certain acquisition setup or particular motion models. Our tool requires no user interaction beyond
aligning the template with the first scan and specifying a few global
parameters. The main technical contributions of this paper are
Contributions.
• an efficient non-rigid registration method based on a nonlinear deformation model that automatically adapts to the motion of the scanned object,
• a detail synthesis method that employs a spatio-temporal analysis of detail vectors to propagate detail into occluded regions
and remove high-frequency acquisition noise,
• the integration of these methods into a complete 3D geometry
and motion reconstruction framework.
The reconstructed surface meshes come with temporally consistent
correspondences, which enables further applications such as mesh
editing, texturing, or signal processing to be applied to the animation sequence. We demonstrate the versatility of our approach
by showing high-resolution reconstructions of highly deformable
shapes such as cloth, as well as the more coherent motion of articulated shapes. In addition, our purely data-driven algorithm is
able to accurately reproduce subtle secondary motions such as hand
tremor, or the behavior of complex materials such as the crumpling
of a paper bag.
2
Related Work
Non-rigid registration methods were initially developed to align
3D scans of rigid objects that are distorted due to device nonlinearities and calibration inaccuracies [Ikemoto et al. 2003; Brown
and Rusinkiewicz 2004; Brown and Rusinkiewicz 2007]. These
methods achieve highly accurate alignments for subtle warps, but
are not suitable for large-scale deformations such as a bending arm.
More general deformation models
have been proposed to capture dynamic shapes [Allen et al. 2003;
Sumner et al. 2007; Botsch and Sorkine 2008]. Various methods
make use of a template model to simplify correspondence estimation and provide a prior for geometry and topology reconstruction,
often relying on a small set of manually specified correspondences
[Blanz and Vetter 1999; Allen et al. 2003; Pauly et al. 2005; Amberg et al. 2007]. Several unsupervised methods were proposed that
require no manual intervention [Anguelov et al. 2004; Bronstein
et al. 2006], but typically lead to higher computational complexity
that makes these methods less suitable for long scan sequences.
Template-Based Methods.
Marker-less methods are widely used in the acquisition and modeling of facial animations. In [Zhang et al. 2004], the deformation of an accurate face template is driven by time-coherent optical
flow features and geometric closest point constraints. Since many
features in a human face are persistent, their system can robustly
handle long sequences of facial animations. More recently, several
papers avoid the use of markers to reproduce complex animations
of human performances and cloth deformations from multi-view
video [Bradley et al. 2008; de Aguiar et al. 2008; Vlasic et al. 2008].
The latter two methods initialize the recording process with a high
resolution full-body laser scan of the subject in a static pose. A
low-resolution template model is created to robustly recover complex motions by combining various tracking and silhouette fitting
techniques. Details of the high resolution models are then transferred back to the animated template. While large-scale deformations such as flowing garments are nicely captured, fine-scale geometric details such as folds that are not persistent in the surface
are captured in the high-resolution model, remaining permanently
throughout the reconstructed animation and possibly yielding unnatural deformations. An extension of this approach has been presented in [Ahmed et al. 2008] that follows a similar rationale to our
method. A low-resolution template is tracked and subsequently enriched with local detail extracted from the acquired data. However,
the specifics of this system differ substantially from our solution.
The input stems from a multi-view acquisition system using eight
video cameras, the template tracking is based on a shape-skeleton
and silhouette matching, and the detail synthesis is performed based
on surface normals reconstructed using shape from shading.
Since creating an accurate
and sufficiently detailed template of a deforming object can be difficult, various methods have been proposed that do not rely on
a complete model. The algorithm presented by Mitra and colleagues [2007] aggregates all scans into a 4D space-time surface
and estimates inter-frame motion from kinematic properties of the
deforming surface. Süssmuth and coworkers [2008] introduced a
space-time approach that first computes an implicit 4D surface representation. A template is extracted from the initial frame and
warped to the subsequent frames by maximizing local rigidity.
These methods require adjacent frames to be sufficiently dense in
space and time and are mainly designed for articulated motions.
Sharf and colleagues [2008] introduced a volumetric space-time reconstruction technique that represents shape motion as an incompressible flow of material through time. This strong regularization
makes the method particularly suitable for very noisy input data.
Wand and coworkers [2007] introduced a statistical framework that
performs pairwise alignment and merging over all adjacent scans
within a global non-linear optimization process, leading to high
computational cost. While significant performance improvements
were achieved in a follow-up work using a volumetric meshless deformation model [Wand et al. 2009], this approach is still substantially slower than our method. In addition, the lack of a template
can lead to topological ambiguities and misalignments for unseen
parts (see also Figure 14).
Registration Without A Template.
Several researchers have proposed pairwise non-rigid registration
algorithms that are specifically designed for large deformations. Li
and coworkers [2008] developed a registration framework that simultaneously solves for point correspondences, surface deformation, and region of overlap within a single global optimization.
Our registration method uses similar components, but avoids the
coupled non-linear optimization of correspondences and deforma-
Dynamic
Acquisition
Partial Scans
Smooth Template
Detail Coefficients
Detail
Estimation
Warped Template
Detail
Aggregation
Reconstructed Shape
Static
Acquisition
Non-Rigid
Registration
Figure 3: Processing pipeline. A smooth template mesh is registered to each of the input scans using a non-linear, adaptive deformation
model. Small-scale detail coefficients are estimated and integrated into the template. The final reconstruction is obtained through detail
aggregation and filtering to propagate detail into occluded regions and separate salient features from noise.
tion to obtain an efficient alignment method for extended scan sequences. Chang and Zwicker [2008] solve a discrete labeling problem to detect the set of optimal correspondences and apply graph
cuts to optimize for a consistent deformation from source to target.
They extend their scheme in [Chang and Zwicker 2009] using a reduced space deformation model represented by a volumetric grid
that encloses the underlying scan. Although significant motion and
occlusions can be handled, their deformation field representation
breaks down for topologically difficult scenarios such as shapes
with nearby or touching surfaces. Huang and colleagues [2008]
suggested a registration technique that finds an alignment by diffusing consistent closest point correspondences over the target shape
while preserving isometries as much as possible. Their implementation has proved to be efficient for large isometric deformations,
yet the correspondence search is sensitive to topological changes
and holes that commonly occur in partial acquisition systems.
4
Template Registration
The registration stage captures the large-scale motion of the subject by fitting a coarse template shape to every frame of the scan
sequence. Scans do not have to be a subset of the geometry described by the template, as in most previous methods (e.g. [Allen
et al. 2003]). Our method robustly handles part-in-part registration,
as opposed to the simpler part-in-whole matching (see e.g. Figure 12). We assume minimal prior knowledge about the acquired
motion and thus employ a general deformation model to capture a
sufficiently large range of shape deformations. We extend the embedded deformation framework proposed in [Sumner et al. 2007] to
automatically adapt to the motion of the captured data. This allows
recovering unknown complex material behavior and improves the
robustness and efficiency of the registration.
4.1 Surface Deformation Model
3
Overview
Our dynamic acquisition system shown in Figure 4 provides dense
depth maps with a spatial resolution of 0.5 mm at 25 frames per second (see [Weise et al. 2007] for a description of a similar acquisition
setup). This allows us to capture fine-scale geometric detail of deforming objects at high temporal resolution. However, input scans
are typically highly incomplete and contain considerable amounts
of measurement noise. We found that a template model is essential
as a geometric and topological prior for the robust reconstruction of
shapes that undergo complex deformations, in particular for singleview acquisition, where large parts of the object are occluded.
Figure 3 gives an overview of our processing pipeline. Static acquisition is used to reconstruct the initial template. We remove all
high-frequency detail from the template using low-pass filtering to
avoid transferring potentially transient features to future scans. This
significantly simplifies template construction since we do not require high geometric precision. To initialize the computations, we
manually specify a rigid alignment of the template to the first frame
of the scan sequence and apply one step of the pairwise non-rigid
registration method described in Section 4.
We propose a two-scale approach to reconstruct a complete and
consistent surface for each frame. Template registration uses a nonlinear reduced deformable model to recover the large-scale motion
and align the template to each of the input scans (Section 4). The
template-to-scan registration makes use of detail coefficients estimated in the previous frame to enable feature locking and improve
the alignment accuracy. The final reconstruction is then obtained
using a separate detail synthesis pass that runs once forward and
once backward in time to aggregate and propagate detail into occluded regions (Section 5).
Embedded deformation computes a warping field using a deformation graph to discretize the underlying space. Each node xi of the
graph induces a deformation within a local influence region of radius ri . We represent such a local deformation as an affine transformation specified by a 3 × 3 matrix Ai and a 3 × 1 translation
vector bi . Graph nodes are connected by an edge whenever two
nodes influence the same vertex of the mesh. A vertex vj of the
embedded shape is mapped to the position
X
vj′ =
w(vj , xi , ri ) [Ai (vj − xi ) + xi + bi ] ,
(1)
xi
where w(vj , xi , ri ) are the normalized weights w(vj , xi , ri ) =
max(0, (1 − d2 (vj , xi )/ri2 )3 ) with d(vj , xi ) the distance between vj and xi . We exploit the topological prior of the template and replace Euclidean distances in the original formulation by
Figure 4: Our real-time structured light scanner based on active
stereo delivers high resolution input scans from a single view.
geodesic distances measured on the template mesh. This improvement avoids distortion artifacts that often occur when geodesically
distant parts of the object come into close contact (Figure 8). We
use a variant of the fast marching method to efficiently compute
approximate geodesic distances [Kimmel and Sethian 1998].
During non-rigid registration we solve for the unknown transformations (Ai , bi ). A feature preserving deformation field is obtained
by maximizing local rigidity using the energy
X“ T 2
(a1 a2 ) + (aT1 a3 )2 + (aT2 a3 )2 +
Erigid =
xi
”
(1 − aT1 a1 )2 + (1 − aT2 a2 )2 + (1 − aT3 a3 )2
before refinement
after refinement
(2)
that measures the deviation of the column vectors a1 , a2 , a3 of Ai
from orthogonality and unit length. An additional regularization
term ensures smoothness of the deformation. We extend the original formulation of [Sumner et al. 2007] using the geodesic distance
weights to handle non-uniformly sampled graph nodes:
XX
w(xi , xj , ri + rj ) kAi (xj − xi )+
Esmooth =
(3)
xi xj
xi + bi − (xj + bj )k22 .
Minimizing these combined energies with the fitting term defined
below yields affine transformations for each node, which in turn
define a smooth deformation field for the template mesh. We solve
this non-linear problem using a standard Gauss-Newton algorithm
as described in [Sumner et al. 2007; Li et al. 2008].
initial
initial
final
time
final
initial
final
Figure 5: The deformation graph is dynamically refined during
non-rigid registration to adapt to the deformation of the scanned
object. Color-coded images indicate the regularization energy that
determines where new nodes are added to the graph. The bottom
row shows the initial and final deformation graphs for the hand and
the sumo reconstruction.
4.2 Robust Pairwise Registration
We iteratively compute closest point correspondences in the spirit
of non-rigid ICP methods, followed by a pruning and deformation
step. To avoid local minima in the non-linear optimization, we use
the simple yet effective technique proposed by [Li et al. 2008] that
progressively relaxes the regularization energies of the deformation
model. Similar strategies were also applied in [Allen et al. 2003;
Amberg et al. 2007]. In this way, the template can be accurately
aligned to scans that undergo considerable deformations without
the use of sparse, high-dimensional features.
Since our input data is sufficiently coherent in time, we repeatedly
use closest point correspondences between the template and each
input scan to determine the optimal deformation. In order to obtain
an accurate fit, we augment the smooth template with detail information extracted from the previous frame. Template vertices vij of
frame j are displaced in the direction of the corresponding surface
normal nji yielding ṽij = vij + dij−1 nji , where dj−1
is the detail
i
coefficient of frame j − 1 (see Section 5). The correspondence energy combines the point-to-point and the point-to-plane metric to
avoid incorrect correspondences in large featureless regions:
‚2
‚
X
‚
‚
Efit =
αpoint ‚ṽij − cji ‚ + αplane |nTcj (ṽij − cji )|2 , (4)
j
j
2
i
(vi ,ci )∈C
where cji denotes the closest point on the input scan from ṽij with
corresponding surface normal ncj . We use αpoint = 0.1 and
i
αplane = 1 in all our experiments. Correspondences are discarded if
they are too far apart, have incompatible normal orientations, lie on
the boundary of the partial input scans, or stem from back-facing or
self-occluded vertices of the template.
For each template-to-scan alignment, we
initialize the registration with high stiffness weights αsmooth = 10
and αrigid = 100. We then alternate in each iteration between correspondence computation and template deformation by minimizing
Iterative Optimization.
Etot = Efit + αsmooth Esmooth + αrigid Erigid . If the relative total energy
did not change significantly between iterations j and j + 1 (i.e.,
j+1
j
j
< σ), we additionally relax the regularization
|Etot
− Etot
|/Etot
weights to αsmooth ← 12 αsmooth and αrigid ← 12 αrigid . This relaxation
strategy effectively improves the robustness by avoiding suboptimal
local minima and allows handling pairs of scans that undergo significant deformations. In all our experiments we use σ = 0.005.
The iterative optimization is repeated until αrigid < 0.1 or until a
maximum number of iterations Nmax = 100 is reached.
Note that detail information of the previous frame is only used to
improve the accuracy of the registration by enabling geometric feature locking. The resulting continuous space deformation is applied
to the template vertices without added detail. As discussed in Section 5 the final detail coefficients are obtained through a separate
detail synthesis pass.
4.3 Dynamic Graph Refinement
We replace the static, uniform sampling of the deformation graph
in [Sumner et al. 2007] and [Li et al. 2008] with a spatially and temporally adaptive node distribution. While the idea of adaptive mesh
deformation has been explored in previous work, for instance in
the context of multi-resolution shape modeling from images [Zhang
and Seitz 2000], we propose to adapt the degrees of freedom of the
deformation model instead of the geometry itself in order to improve registration robustness and efficiency.
A hierarchical graph representation is pre-computed from a dense
uniform sampling of graph nodes by successively merging nodes
in a bottom-up fashion. The initial uniform node sampling corresponds to the highest resolution level l = Lmax of the deformation
graph that we restrict to roughly one tenth of the number of mesh
vertices. We thus avoid over-fitting in regions of small-scale deformations, which are instead captured by our detail synthesis method
(Section 5). We uniformly sub-sample the nodes of each level by
repeatedly increasing their average sampling distance rl−1 = 4 rl
until l reaches Lmin . Each of the remaining nodes xli from level
l ∈ Lmin . . . Lmax form a cluster Cil which contains every node from
the level below xl+1
that is not closer to any other cluster from l.
i
The resulting cluster hierarchy is then used for adaptive refinement.
We choose Lmin = Lmax /2 for all our experiments.
Registration starts with a coarse uniform
graph at level Lmin and dynamically adapts the graph resolution
by inserting nodes in regions with high regularization residual
(Esmooth ), which indicates a strong discrepancy of neighboring node
transformations (see Figure 5). In all our examples we set the
threshold for refinement to 10% of the highest regularization value.
One step of refinement substitutes every xli that exhibits high regularization with all nodes contained in Cil . To avoid unnecessary
refinements for every new upcoming target frame, adaptive refinement is only performed if the global regularization term is still
above a certain threshold, i.e. Esmooth > 0.01, for the maximum
number of iteration Nmax = 100 of pairwise registration.
Refinement Criterion.
The dynamic refinement effectively learns an adaptive deformation
model that is consistent with the motion of the scanned object. Additional nodes will be inserted automatically in regions of high deformation, while large rigid parts can be accurately deformed by
a single graph node. In addition to being less susceptible to local
minima, this leads to significant performance improvements (up to
a factor of four in our examples) as compared to a uniform sampling
with a high level of node redundancy. As illustrated in Figure 5, our
adaptive model is suitable for a wide variety of dynamic objects,
from articulated shapes to complex cloth folding.
4.4 Multi-Frame Stabilization
The warped template T j−1 obtained after alignment to scan j − 1
is the zero-energy state when aligning to scan j for each frame of
the entire template warping process. For surface regions that are
visible in the scan, dynamic details, such as cracks and fissures in
paper-like materials can be accurately captured, since the method
prevents the template from deforming back to its initial undeformed
state. However, unobserved template parts are inherently prone to
accumulation of misalignments, especially for lengthier scan sequences as illustrated in Figure 6. In contrast to our formulation,
classical template fitting methods [Zhang et al. 2004; de Aguiar
et al. 2008; Vlasic et al. 2008] warp the same initial template to
each recorded frame and thus, use a deformation model that behaves globally elastic in time. For complex articulated subjects,
such as human bodies, missing data in occluded regions would pull
the template back to its original shape, which can be very different
to the one of the current frame. Therefore, multi-view acquisition
systems are usually used in combination with sparse and robust feature tracking [de Aguiar et al. 2008] and sometimes enhanced with
manual intervention [Vlasic et al. 2008] to ensure reliable tracking.
In our dense acquisition setting, the surface coverage of the template by the input scans is spatially and temporally coherent over
time. Thus, for non-occluded regions, the template shape from a
closer time instance represents in general a more likely shape prior
than the initial template Tinit . On the other hand, we make the assumption that no better knowledge exists than Tinit for template regions that are never observed or not seen for an extended period.
To address this issue we introduce a time-dependent combination
of plastic and elastic deformation to accurately track exposed surface regions and reduce the accumulation of errors in less recently
observed parts of the scanned object. After the pairwise registration of T j−1 to scan j as presented in Section 4.2, we obtain
the plastically deformed template T j . A weight cji for visibility confidence can then be defined for each vertex vij ∈ T j as
cji = max{0, (P + jilast − j)/P } with jilast the last frame where
vi has been observed, and P a constant (we chose P = 30 in all
34
0
coverage
without
stabilization
with
stabilization
ground truth
Figure 6: A hybrid plastic and elastic deformation model is used to
stabilize the registration for multiple input frames as repeated pairwise alignment is susceptible to error accumulation. The accumulation of misalignments is shown on frame 30 of the sumo sequence.
our examples) that defines a temporal confidence range of visibility.
All template vertices with cji = 1 are visible in the current frame,
while cji = 0 represent those that are no longer considered confident. For the same frame, an elastically deformed template T̃ j with
vertices ṽij is created by warping Tinit to the current frame j using
the linearized thin-plate energy as described in [Botsch and Sorkine
2008]. Hard positional constraints are defined for all vertices with
confidence cji = 1. The resulting template T̄ j with vertices v̄ij
is obtained by linearly blending T j and T̃ j with the confidence
weights for visibility yielding the vertices v̄ij = cji vij + (1 − cji )ṽij .
5
Detail Synthesis
Non-rigid registration aligns the template sequentially with all input scans. The resulting deformation fields induced by the graph
capture the large-scale deformation but might miss small deformations that give rise to dynamic detail such as wrinkles and folds.
To recover fine-scale detail at the spatial resolution of the scanner,
we perform a separate detail synthesis stage that is composed of
two steps: First, a per-vertex optimization from local correspondences is applied to estimate detail coefficients for each vertex of
the template. These preliminary detail coefficients are the ones used
for template alignment as detailed in Section 4. After the template
has been registered to the entire scan sequence, we perform an additional pass that exploits the temporal coherence of the scan sequence to improve the reconstruction quality by propagating detail
into occluded regions.
Since the deformed template is already well-aligned with the input scan, we employ an efficient linear mesh deformation algorithm similar to [Zhang et al. 2004] to
estimate detail coefficients. For each vertex vi in the template
mesh, we trace an undirected ray in normal direction ni and find
the closest intersection point on the input scan. In case an intersection point ci is found, a point-to-point correspondence constraint
is created, if both points have the same normal orientation and are
sufficiently close. Since the template has no high-frequency detail,
its normal vector field is smooth, leading to spatially coherent correspondences. We compute the detail coefficients di by minimizing
the energy resulting from the extracted correspondences subject to
a regularization constraint
Linear Mesh Deformation.
2 · 10−3
0
input
aggregated detail
single-frame detail
aligned template
Figure 7: Detail synthesis. Reconstructing detail from the current
frame leads to lack of detail in occluded regions. Aggregating detail
over temporally adjacent frames propagates detail into hole regions
and reduces noise. The color-coded images show the magnitude of
the detail coefficients relative to the bounding box diagonal.
Edetail =
X
kvi + di ni − ci k22 + β
i∈V
X
|di − dj |2 ,
(5)
(i,j)∈E
where V and E are index sets of mesh vertices and edges, respectively. The parameter β balances detail synthesis with smoothness
and is set to β = 0.5 in all our experiments. The resulting system
of equations is linear and sparse and can thus be solved efficiently.
The linear mesh deformation method described
above estimates detail coefficients independently for each frame in
those regions of the object that are observed by a particular scan.
To transfer detail to occluded regions we perform a separate processing pass that aggregates detail coefficients using a so-called exponentially weighted moving average. We use the formulation of
Roberts [1959] and define this moving average as
Aggregation.
j
j−1
di = (1 − γ)di
+ γdji
(6)
with γ set to 0.5 in all our examples. The influence of past detail
coefficients decays quickly in this formulation, which is important,
since transient or dynamic detail such as wrinkles and folds might
not persist during deformation. Note that details in the template
only disappear when they vanish in the input scans of succeeding
frames. For instance, the details of a rigid object will persist and
not fade toward zero coefficients since only observed coefficients
are combined during detail synthesis. When processing scan j, we
j−1
first update the vertices vij ← vij + di nji and perform the linear
mesh deformation described in the previous section. This yields the
new detail coefficients dji that are then used to update the moving
j
average di , which will in turn be employed to process the subsequent scans. The entire detail aggregation process is performed
by running sequentially once forward and once backward through
the scans while performing the linear mesh deformation and updating the moving averages. Going back and forth allows us to backpropagate persistent details seen at future instances to earlier scans
(see Figure 7). As a final step, we apply a band-limiting bilateral
filter [Aurich and Weule 1995] that operates in the time domain and
detail range to further reduce temporal noise.
6
surfaces described in [Guennebaud and Gross 2007]. Given the
roughly aligned template mesh, our system runs completely automatically without any user intervention. Only few parameters (such
as the weighting coefficients of the different energy terms) have to
be chosen manually. For all examples, we use the same initial parameter settings. During optimization we automatically adapt the
parameters using the approach detailed in Section 4.
Figure 9 shows the warped template and final reconstruction of the
puppet. This example is particularly difficult due to the close proximity of multiple surface sheets when closing the puppet’s hands.
The reconstruction of a hand in Figure 8 demonstrates that our
detail synthesis method is capable of capturing the intricate folds
and wrinkles of human skin, even though the scans contain a large
amount of measurement noise. Figure 12 illustrates how detail is
propagated correctly into occluded regions, which leads to a plausible high-resolution reconstruction even for parts of the model that
have not been observed in a particular scan. Figure 13 shows the
reconstruction of a crumpling paper bag. Despite substantial holes
# Scans
Min # Points per Scan
Max # Points per Scan
Input Data Size (Mb)
# Template Vertices
Begin # Graph Nodes
End # Graph Nodes
Output Data Size (Mb)
Registration Time
Detail Synthesis Time
Total Time
Puppet
100
23k
37k
430
48k
20
100
530
39
26
65
Head
200
53k
68k
1,690
64k
152
458
2,030
247
92
339
Hand
35
19k
25k
120
46k
77
1238
180
15
8
23
Paper Bag
85
82k
123k
145
64k
37
86
960
65
36
101
Sumo
34
85k
86k
430
107k
52
110
540
26
23
49
Table 1: Statistics for the results shown in this paper. All computations were performed on a 3.0 GHz Dual Quad-Core Intel Xeon
machine with 8 GB RAM. Timings are measured in minutes and
include I/O operations.
input scans
reconstruction
Results
We show a variety of acquired geometry and motion sequences processed with our system that exhibit substantially different dynamic
behavior. Accurate reconstruction of these objects is challenging
due to the high noise level in the scans, missing data caused by
occlusions or specularity, unknown correspondences, and the large
and complex motion and deformations of the acquired objects. The
statistics for the results are shown in Table 1. All templates were
constructed by performing an online rigid registration technique
similar to [Rusinkiewicz et al. 2002] on our acquired data, followed
by a surface reconstruction technique based on algebraic point set
input scan
warped template
reconstruction
Figure 8: The zooms illustrate how high-frequency detail such as
the skin folds is faithfully recovered and transferred to occluded
regions. Even though the scan is connected at the fingertips, shape
topology is correctly recovered (red circle).
input
warped template
input (side)
reconstruction (back)
reconstruction
textured reconstruction
Figure 9: The global motion of the puppet’s shape as well as fine-scale static and dynamic detail are captured accurately using the template
registration and detail synthesis algorithm. The intricate folds of the cloth are handled robustly in the registration.
7
Evaluation
Figure 10 illustrates the difference between tracking a highresolution template versus our two-scale approach that separates
global shape motion and dynamic detail reconstruction. For comparison we use the first frame of our two-scale reconstruction as the
high-resolution template, which is then aligned with the input scan
sequence using the registration method of Section 4. As can be seen
in the zoom, dynamic detail created by the motion, in particular in
the cloth, is not captured accurately. In contrast, our detail synthesis approach avoids the artifacts created by “baked-in” geometric
detail and leads to a high-quality reconstruction of both static and
dynamic detail. While a fairly large range of template smoothness
can be tolerated, an overly coarse template can deteriorate the reconstruction as shown in Figure 11.
The necessity of using a template for robust reconstruction of complex deforming shape is illustrated in Figure 14. The method
of [Wand et al. 2009] that avoids the use of a template cannot track
the motion of the fingers accurately. In particular, the correspondence estimation fails when previously unseen parts of the shape,
such as the back of the fingers, come into view. Figure 15 shows
a comparison of our method to the dynamic registration approach
of [Süssmuth et al. 2008] using the same template in both reconstructions.
We evaluate the robustness of the template tracking and detail synthesis method using the ground truth comparison shown in Figure 16. The scanning process has been simulated by creating a
set of artificial depth maps from a fixed viewpoint. The groundtruth animation of the 3D model was obtained from dense motion
capture data provided by [Park and Hodgins 2006]. In order to
test the stability of the template tracking, we sampled the entire sequence at successively lower temporal resolution. The non-rigid
warped high-res template
0
· 10−4
reconstruction vs. high-res template
reconstruction with detail synthesis
Figure 10: Warping a high-resolution template without detail synthesis leads to inferior results as compared to our two-scale reconstruction approach (cf. Figure 9). The color coding shows the distance between both results relative to the bounding box diagonal.
reconstruction (front)
smooth templates
frame 1
frame 50
reconstruction (side)
caused by oversaturation in the reflections, the dynamics of the material as well as sharp geometric creases are faithfully captured.
Figure 11: Evaluation of the reconstruction (frame 1 and 50) for
three different initial templates. The upper row shows the original
template. The coarser template in the second row is produced by
surface reconstruction from points that are uniformly subsampled
at half of the density of the original template. The last row illustrates the reconstruction using an even coarser template. This is
obtained from only 25% of the initial point density.
input scans
200
0
warped template
coverage
reconstrution
Figure 12: Our method faithfully recovers both the large-scale motion of the turning head, as well as the dynamic features created by the
expression, such as wrinkles on the forehead or around the mouth. Intricate geometric details such as the ears are accurately captured, even
though they are only observed in few frames. The color-coded images show the number of frames a certain region has been observed.
registration robustly aligns the template with the scans for a temporally sub-sampled sequence consisting of only 34 frames. The large
inter-frame motion, especially of the arms and legs, is tracked correctly, even though our correspondence computations do not make
use of feature points, markers, or user assistance. Template tracking breaks down at 17 frames, where the fast motion of the arms
cannot be recovered anymore (see Figure 17 (a)). Detail synthesis
for the 34-frame sequence reliably recovers most of the fine-scale
geometry correctly. Artifacts appear in the fingers and toes due to
the coarse approximation of the template. In addition, drawbacks
of the single-view acquisition become apparent in regions that are
not observed by the scanner, such as the back of the sumo. Quantitatively, we measured the maximum of the average distance over
all frames as 0.0012, the maximum of the maximum distance over
all frames as 0.0283 as a fraction of the bounding box diagonal.
for an extended period of time, registration can fail if these regions have undergone deformations while not being observed by
the scanner. In such a case, our system would require user interaction to re-initialize the registration. This is an inherent limitation of
single-view systems where more than half of the object surface is
occluded at any time instance. However, even some multi-view systems (e.g. [Vlasic et al. 2008]) permit user assistance to adjust incorrect optimizations. Similar manual assistance might be required
for longer sequences, where the scanner infrequently produces inferior data in certain frames. These frames need to be removed
n/a
We make few assumptions on the geometry and motion of the scanned objects. The correspondence estimation based
on closest points, however, requires a sufficiently high acquisition
frame-rate as otherwise, misalignments can occur, as shown in Figure 17 (a). Similarly, for parts of the shape that are out of view
Limitations.
frame 22
frame 34
input scans
frame 22
frame 34
[Wand et al. 2009]
frame 22
frame 34
our approach
Figure 14: Reconstruction without a template is particularly challenging for single-view acquisition. The results in the center have
been produced by the authors of [Wand et al. 2008].
our approach
input scans
input scans
warped template
reconstruction
texture
Figure 13: Sharp creases and intricate folds created by the complex, non-smooth deformation of a crumpling paper bag are captured accurately.
[Süssmuth et al. 2008]
Figure 15: Comparison of two template based reconstruction
methods. The results in the bottom right have been produced by
the authors of [Süssmuth et al. 2008].
error
0.032
max
34
0.024
0.016
0.008
average
0
input
coverage
reconstrution
ground truth
0
frame
34
Figure 16: Ground truth comparison for a synthetic full-body example with fast motion. The top row shows every frame of the input sequence.
The color-coded image indicates the number of frames in which a certain part of the shape is covered by the scans. The graph shows the
maximum and average error distance between the ground truth and the reconstruction for each frame.
manually and the registration re-started with user assistance. While
none of our sequences required such manual intervention, the acquisition of longer sequences was inhibited by this limitation of our
scanning system.
Global aspects, such as the loop closure problem well-known in
rigid scanning [Pulli 1999] are currently not considered in our system. To address these limitations, more sophisticated feature tracking would be required in order to establish reliable correspondences
across larger spatial and temporal distances. We currently do not
prevent global self-intersections of the reconstructed meshes. However, as shown in Figure 17 (b), our method robustly recovers,
mainly due to the use of geodesic distances on the template mesh
and the correspondence pruning strategy based on normal consistency and visibility. Avoiding self-intersections entirely would require an additional self-collision handling step in the shape deformation optimization algorithm, which would add a significant overhead to the overall reconstruction pipeline. Our method does not
discover topological errors in the template, as shown in Figure 17
(c). In the template reconstruction the pinky has been erroneously
connected to the paper bag, which leads to artifacts in the final
frames of the sequences, where the finger is lifted off the bag.
8
Conclusion
We have presented a robust algorithm for geometry and motion reconstruction of dynamic shapes. One of the main benefits of our
method is simplicity. Our scanning system requires no specialized hardware or complex calibration or synchronization, and can
be readily deployed in different acquisition scenarios. We do not
require silhouette or feature extraction, manual correction of correspondences, or the explicit construction of a shape skeleton. Our
system demonstrates that even for single-view acquisition, highquality results can be obtained for a variety of scanned objects, with
a realistic reconstruction of shape dynamics and fine-scale features.
Key to the success of our algorithm is the robust template tracking
based on an adaptive deformation model. Our novel detail synthesis method exploits the accurate registration to aggregate and propagate geometric detail into occluded regions. As future work we
plan to resolve aforementioned limitations and incorporate global
self-collision handling. Moreover, we want to evaluate the algorithm in a multi-view setting where larger parts of the object are
seen at the same or alternating time instances. As our current acquisition system only allows us to scan within a working volume
of 40 × 30 × 60 cm3 , we wish to extend our scanning setup to
allow acquisition of larger objects such as full human body performances. The tests on synthetic data indicate that our reconstruction
algorithm should perform well for such cases. Finally, the proposed
template
registration
initial alignment
input scans
(a)
failed alignment
(b)
self-intersection
(c)
reconstruction
Figure 17: Limitations: (a) registration can fail if the framerate is too low relative to the motion of the scanned object; (b)
self-intersections are not prevented during template alignment; (c)
wrong template topology leads to artifacts when the finger is lifted
off the paper bag.
registration algorithm can be used to acquire and learn material behavior (such as the crumpling of paper or folding of skin). Such
information can be used to improve the realism of physically-based
simulation algorithms.
Acknowledgements. The authors would like to thank Thibaut
Weise for providing the real-time 3D scanner, Carsten Stoll for his
performance capture data, Sang Il Park and Jessica Hodgins for
the animated sumo. Special thanks go to Johannes Schmid for
helping with the video editing, Michael Wand, Martin Bokeloh,
and Jochen Süssmuth for performing the comparisons, Qi-Xing
Huang and Maks Ovsjanikov for the feedbacks and discussions.
This work is supported by SNF grant 200021-112122, NSF grants
ITR 0205671, FRG 0354543, FODAVA 808515, as well as NIH
grant GM-072970 and the Fund for Scientific Research, Flanders
(F.W.O.-Vlaanderen).
References
A HMED , N., T HEOBALT, C., D OBREV, P., S EIDEL , H.-P., AND
T HRUN , S. 2008. Robust fusion of dynamic shape and normal
capture for high-quality reconstruction of time-varying geometry. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2008), 1–8.
A LLEN , B., C URLESS , B., AND P OPOVI Ć , Z. 2003. The space of
human body shapes: reconstruction and parameterization from
range scans. ACM Transactions on Graphics 22, 3, 587–594.
A MBERG , B., ROMDHANI , S., AND V ETTER , T. 2007. Optimal
step nonrigid icp algorithms for surface registration. In Proceedings of IEEE CVPR.
A NGUELOV , D., S RINIVASAN , P., PANG , H.-C., KOLLER , D.,
T HRUN , S., AND DAVIS , J. 2004. The correlated correspondence algorithm for unsupervised registration of nonrigid surfaces. In Advances in Neural Inf. Proc. Systems 17.
AURICH , V., AND W EULE , J. 1995. Non-linear gaussian filters performing edge preserving diffusion. In Mustererkennung
1995, 17. DAGM-Symposium, Springer-Verlag, 538–545.
B LANZ , V., AND V ETTER , T. 1999. A morphable model for the
synthesis of 3D faces. In Proceedings of ACM SIGGRAPH 99,
ACM Press / ACM SIGGRAPH, 187–194.
B OTSCH , M., AND S ORKINE , O. 2008. On linear variational surface deformation methods. IEEE Transactions on Visualization
and Computer Graphics 14, 1, 213–230.
K IMMEL , R., AND S ETHIAN , J. A. 1998. Computing geodesic
paths on manifolds. In Proc. Natl. Acad. Sci. USA, 8431–8435.
L I , H., S UMNER , R. W., AND PAULY, M. 2008. Global correspondence optimization for non-rigid registration of depth scans.
Computer Graphics Forum (Proc. SGP) 27, 5, 1421–1430.
M ITRA , N. J., F LORY, S., OVSJANIKOV, M., G ELFAND , N.,
G UIBAS , L., AND P OTTMANN , H. 2007. Dynamic geometry
registration. In Symposium on Geometry Processing, 173–182.
PARK , S. I., AND H ODGINS , J. K. 2006. Capturing and animating skin deformation in human motion. ACM Transactions on
Graphics 25, 3, 881–889.
PARK , S. I., AND H ODGINS , J. K. 2008. Data-driven modeling of
skin and muscle deformation. ACM Transactions on Graphics
27, 3, 96:1–96:6.
PAULY, M., M ITRA , N. J., G IESEN , J., G ROSS , M., AND
G UIBAS , L. J. 2005. Example-based 3d scan completion. In
Symposium on Geometry Processing.
P ULLI , K. 1999. Multiview registration for large data sets. In
Second Int. Conf. on 3D Dig. Image and Modeling, 160–168.
ROBERTS , S. 1959. Control chart tests based on geometric moving
averages. Technometrics1, 239–250.
RUSINKIEWICZ , S., H ALL -H OLT, O., AND L EVOY, M. 2002.
Real-time 3D model acquisition. ACM Transactions on Graphics
21, 3, 438–446.
B RADLEY, D., P OPA , T., S HEFFER , A., H EIDRICH , W., AND
B OUBEKEUR , T. 2008. Markerless garment capture. ACM
Transactions on Graphics 27, 3, 99:1–99:9.
S HARF, A., A LCANTARA , D. A., L EWINER , T., G REIF, C.,
S HEFFER , A., A MENTA , N., AND C OHEN -O R , D. 2008.
Space-time surface reconstruction using incompressible flow.
ACM Transactions on Graphics 27, 5, 110:1–110:10.
B RONSTEIN , A. M., B RONSTEIN , M. M., AND K IMMEL , R.
2006. Generalized multidimensional scaling: a framework for
isometry-invariant partial surface matching. Proc. National
Academy of Sciences (PNAS) 103.
S UMNER , R. W., S CHMID , J., AND PAULY, M. 2007. Embedded deformation for shape manipulation. ACM Transactions on
Graphics 26, 3, 80:1–80:7.
B ROWN , B., AND RUSINKIEWICZ , S. 2004. Non-rigid rangescan alignment using thin-plate splines. In Symp. on 3D Data
Processing, Visualization, and Transmission.
S ÜSSMUTH , J., W INTER , M., AND G REINER , G. 2008. Reconstructing animated meshes from time-varying point clouds.
Computer Graphics Forum (Proceedings of SGP 2008) 27, 5,
1469–1476.
B ROWN , B. J., AND RUSINKIEWICZ , S. 2007. Global non-rigid
alignment of 3-d scans. ACM Transactions on Graphics 26, 3,
21:1–21:10.
V LASIC , D., BARAN , I., M ATUSIK , W., AND P OPOVI Ć , J. 2008.
Articulated mesh animation from multi-view silhouettes. ACM
Transactions on Graphics 27, 3, 97:1–97:9.
C HANG , W., AND Z WICKER , M. 2008. Automatic registration for
articulated shapes. Computer Graphics Forum (Proc. SGP) 27,
5, 1459–1468.
WAND , M., J ENKE , P., H UANG , Q., B OKELOH , M., G UIBAS , L.,
AND S CHILLING , A. 2007. Reconstruction of deforming geometry from time-varying point clouds. In Symposium on Geometry
processing, 49–58.
C HANG , W., AND Z WICKER , M. 2009. Range scan registration
using reduced deformable models. Computer Graphics Forum
(Proceedings of Eurographics 2009), to appear.
DE AGUIAR , E., S TOLL , C., T HEOBALT, C., A HMED , N.,
DEL , H.-P., AND T HRUN , S. 2008. Performance capture
S EI from
sparse multi-view video. ACM Transactions on Graphics 27, 3,
98:1–98:10.
G UENNEBAUD , G., AND G ROSS , M. 2007. Algebraic point set
surfaces. In ACM Transactions on Graphics, ACM, New York,
NY, USA, vol. 26, 23:1–23:10.
H UANG , Q., A DAMS , B., W ICKE , M., , AND G UIBAS , L. J. 2008.
Non-rigid registration under isometric deformations. Computer
Graphics Forum (Proc. of SGP) 27, 5, 1459–1468.
I KEMOTO , L., G ELFAND , N., AND L EVOY, M. 2003. A hierarchical method for aligning warped meshes. In Proceedings of 4th
Int. Conference on 3D Digital Imaging and Modeling, 434–441.
WAND , M., A DAMS , B., OVSJANIKOV, M., B ERNER , A.,
B OKELOH , M., J ENKE , P., G UIBAS , L., S EIDEL , H.-P., AND
S CHILLING , A. 2009. Efficient reconstruction of non-rigid
shape and motion from real-time 3d scanner data. ACM Transactions on Graphics. (to appear).
W EISE , T., L EIBE , B., AND G OOL , L. V. 2007. Fast 3d scanning
with automatic motion compensation. In IEEE Conference on
Computer Vision and Pattern Recognition, 1–8.
Z HANG , L., AND S EITZ , S. M. 2000. Image-based multiresolution
shape recovery by surface deformation. SPIE, S. F. El-Hakim
and A. Gruen, Eds., vol. 4309, 51–61.
Z HANG , L., S NAVELY, N., C URLESS , B., AND S EITZ , S. M.
2004. Spacetime faces: high resolution capture for modeling
and animation. ACM Transactions on Graphics 23, 3, 548–558.