3D Scanning Deformable Objects With A Single RGBD Sensor

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

3D Scanning Deformable Objects with a Single RGBD Sensor

Mingsong Dou∗1 , Jonathan Taylor2 , Henry Fuchs1 , Andrew Fitzgibbon2 and Shahram Izadi2
1
Department of Computer Science, UNC-Chapel Hill
2
Microsoft Research

Abstract

We present a 3D scanning system for deformable ob-


jects that uses only a single Kinect sensor. Our work al-
lows considerable amount of nonrigid deformations during
scanning, and achieves high quality results without heavily
constraining user or camera motion. We do not rely on any
prior shape knowledge, enabling general object scanning
with freeform deformations. To deal with the drift problem
when nonrigidly aligning the input sequence, we automat- Figure 1. A mother holds an energetic baby while rotating in front
ically detect loop closures, distribute the alignment error of a Kinect camera. Our system registers scans with large defor-
over the loop, and finally use a bundle adjustment algorithm mations into a unified surface model.
to optimize for the latent 3D shape and nonrigid deforma-
tion parameters simultaneously. We demonstrate high qual- a template prior or for acquiring initial partial scans.
ity scanning results in some challenging sequences, com- Our work uses a Kinect sensor which gives a partial
paring with state of art nonrigid techniques, as well as noisy scan of an object at each frame. Our goal is to com-
ground truth data. bine all these scans into a complete high quality model
(Fig. 1). Even for rigid alignment, the problem of drift
occurs when aligning a sequence of partial scans consec-
1. Introduction utively, where the alignment error accumulates quickly and
the scan does not close seamlessly. Drift is more serious
With the availability of commodity depth cameras, 3D in the case of nonrigid alignment (Fig. 2(c)). KinectFusion
scanning has become mainstream, with applications for 3D alleviates some level of drift by aligning the current frame
printing, CAD, measurement and gaming. However, many with the fused model instead of the previous frame [14].
existing 3D scanning systems (e.g. [14, 9]) use rigid align-
Many follow-up systems based on KinectFusion have
ment algorithms and thus require the object or scene being
specifically looked at scanning humans (e.g., for 3D print-
scanned to remain static. In many scenarios such as recon-
ing or generating avatars) where the user rotates in front
structing humans, particularly children, and animals, or in-
of the Kinect while maintaining a roughly rigid pose,
hand scanning of soft and deformable objects, for instance
e.g., [24, 11, 18, 27, 7]. This highlights the fundamental
toys, nonrigid movement is inevitable.
issue when scanning living things – they ultimately move.
Recent work on nonrigid scanning are constrained by
one or more of the following: 1) they rely on specific user To make this problem more tractable, some systems
motion (e.g. [24, 11, 18]); 2) require multiple cameras make strong assumptions about the nonrigid object being
(e.g. [5]); 3) capture a static pre-scan as template prior (e.g. a human, using either parametric models [24, 7] or limiting
[29]); or align partial static scans nonrigidly (e.g. [11]). the user to certain poses such as a ‘T’ shape [3]. We wish to
avoid such scene assumptions. Li et al. [11] adapt a more
To address these issues we present a new 3D scanning
general nonrigid registration framework which can support
system for arbitrary scenes, based on a single sensor, which
a wider range of poses, clothing or even multiple users. This
allows for large deformations during acquisition. Further,
system demonstrates compelling results but relies on a very
our system avoids the need for any static capture, either as
specific type of user interaction: the user moves in roughly
∗ Most work is conducted at Microsoft Research 45 degree increments, in front of Kinect, and at each step

1
remains static, whilst the motorized sensor scans up and on human shape models [20, 24, 28]. The shape prior
down. Each of these partial static scans are then nonrigidly means they cannot scan general shapes, including even hu-
registered and a global model reconstructed. Here the user mans holding objects, or in unusual clothing. More general
is assumed to explicitly perform a loop closure at the end of approaches either work on diverse (non-rigged) templates
sequence. For certain energetic subjects, such as children or [8, 4, 12, 10], or use template-less spatio-temporal repre-
animals, who do not follow instructions well, such a usage sentations [13, 22, 17]. Instead our system discovers the
scenario may be constraining. latent surface model without the need for an initial rigid
Zeng et al. [27] show that when using nonrigid align- scan or statically captured template model. It also attempts
ment to an embedded deformation (ED) graph model [15] to mitigate the drift inherent in non-template-based models.
for quasi-rigid motion, drift is greatly alleviated, and loop-
closure can be made implicit. However, for nonrigid mo- 2. Triangular Mesh Surface Model
tion, our experience (Fig. 6) shows that drift is still a seri-
Throughout this paper, we use a triangular mesh as our
ous problem even when scanning mildly deforming objects
fundamental surface representation. We parameterize a tri-
such as a turning head.
angle mesh by the set of 3D vertex locations V = {vM m=1 }
In this paper, we detect loop closures explicitly to han-
and the set of triangle indices T ⊂ {(i, j, k) : 1 ≤ i, j, k ≤
dle severe drift without restricting user motions. However,
M }. We will also occasionally query the triangulation
dealing with such loop closures, is only one piece of the
through the function N (m) which returns the indices of the
puzzle, as this only evenly distributes error over the loop
vertices neighboring vertex m, or through the use of a vari-
instead of minimizing the alignment residual. Thus, our
able τ ∈ T representing a single triangle face.
pipeline also performs a dense nonrigid bundle adjustment
We will often need to label a mesh using a subscript
to simultaneously optimize the final shape and nonrigid pa-
(e.g., Vi ) in which case we will label the vertices with a
rameters at each frame. We use loop closure to provide
corresponding superscript (e.g., vim ). Indeed, a point on the
the initialization for the bundle adjustment step. Our ex-
surface itself is parameterized using a surface coordinate
periments show that bundle adjustment gives improved data
u = (τ, u, v) where τ ∈ T is a triangle index and (u, v)
alignment and thus a high quality final model.
is a barycentric coordinate in the unit triangle. The posi-
We will summarize previous work in the next section and
tion of this coordinate can then be evaluated using a linear
describe our surface and deformation model in Sec. 2&3.
combination of the vertices in τ as
From Sec. 4 through Sec. 6, we explain the preprocess-
ing procedures for bundle adjustment, including partial scan S(u; V) = uvτ1 + vvτ2 + (1 − u − v)vτ3 (1)
extraction, coarse scan alignment, and loop closure detec-
tion. Then we illustrate our bundle adjustment algorithm in and its surface normal computed as (with bbxcc := x/kxk)
Sec. 7. Finally, we show results in Sec. 8.
S ⊥ (u; V) = bb(vτ2 − vτ1 ) × (vτ3 − vτ1 )cc (2)
1.1. Related Work
Dou et al. [5] designed a system to scan dynamic objects
3. Embedded Deformation Model
with eight-Kinect sensors, where drift is not a concern given In general, we will want to allow our meshes to deform,
that a relatively complete model is captured at each frame. for example to allow our surface reconstruction to explain
Tong et al. [18] illustrated a full body scanning system with the data in a depth sequence. Our desire to keep our al-
three Kinects. Their system uses a turntable to turn people gorithm agnostic to object class led us to choose the em-
around, but cannot handle large deformations. Other high- bedded deformation (ED) model of [15] to parameterize
end multi-camera setups include [4, 20, 6, 21]. In our work the non-rigid deformations of a mesh V. In this model,
we wish to move away from complex rigs, and support more a set of K “ED nodes” are uniformly sampled through-
lightweight and commodity consumer setups, using only a out the mesh at a set of fixed locations {gk }Kk=1 ⊆ R .
3

single off-the-shelf depth sensor. Each vertex m is “skinned” to the deformation nodes by
More lightweight capture setups have been demon- a set of fixed weights {wmk }Kk=1 ⊆ [0, 1], where wmk =
strated, but either still require complex lighting, more (max(0, 1 − d(vm , gk )/dmax ))2 /wsum with d(vm , gk ) the
than one camera, or cannot generate high quality results geodesic distance between the two, dmax the distance of
[8, 12, 10, 23, 19, 26, 25]. vm to its c+1-th nearest ED node, and w1sum the normal-
More severe deformations can be handled with template- ization weight. Note vm is only influenced by its c near-
based systems. For example, Zollhofer et al. [29] first ac- est nodes (c = 4 in our experiments) since other nodes
quire a template of the scene under near-rigid motion us- have weights 0. The weighted deformation of the vertices
ing Kinect fusion, and then adapt that template to non- surrounding gk is parameterized by a local affine transfor-
rigid sequences. Even more specialized are systems based mation Ak ∈ R3×3 and translation tk ∈ R3 . In addition,
(a) input color and depth (b) partial scans (c) coarse-aligned scans (d) LC-aligned scans (e) LC-fused surface (f) BA-optimized surface (g) deformed surface

Figure 2. Scanning pipeline. The input sequence has around 400 frames which are fused into 40 partial scans (Sec. 4). Partial scans
are consecutively placed in the reference pose to achieve the coarse alignment (Sec. 5). Next, loop closures are detected and alignment is
refined (Sec. 6); all the LC-aligned scans are fused volumetrically to get the LC-fused surface which serves as the initial for the following
bundle adjustment stage (Sec. 7). As a by-product of the system, the reconstructed model can be deformed back to each frame.

we follow [27, 11] in augmenting the deformation using a Esmooth (G) in later equations, where α = 10 in our experi-
global rotation R ∈ SO(3) and translation T ∈ R3 . The ments. In addition, rigidity is encouraged by penalizing the
precise location of vertex vm deformed using the parameter deformations at ED nodes,
set G = {R, T } ∪ {Ak , tk }K
k=1 is X X
ρ ktk k2 , (7)

Erigid (G) = ρ (kAk − IkF ) +
K
X k k
ED(vm ; G) = R wmk [Ak (vm − gk ) + gk + tk ] + T
k=1 where ρ(·) is a robustness kernel function. We minimize
(3) this energy using standard nonlinear least squares optimiza-
and its associated surface normal is: tion [15, 5, 10].
$$ K %%
4. Extracting Partial Scans
X
⊥ −T
ED (nm ; G) = R wmk Ak nm . (4)
k=1
The first phase of our algorithm begins by preprocessing
In addition, we allow the former functional to be applied an RGBD sequence into a set of high quality, but only par-
to an entire mesh at a time to produce a deformed mesh tial, scans {Vi }Ni=1 of the object of interest. Each of these
ED(V; G) := {ED(vm ; G)}M m=1 .
segments is reconstructed from a small contiguous set of F
In general, we will want to find parameters that ei- frames using the method of [5] to fuse the depth data into
ther exactly or approximately satisfy some constraints (e.g. a triangular mesh. These short segments can be reliably
ED(vm ; G) ≈ pk ∈ R3 ), and thus encode these con- reconstructed using standard methods, in contrast to larger
straints softly in an energy function E(G) (e.g. E(G) = sequences where camera and reconstruction drift generally
kpk − ED(vm ; G)k2 ). In order to prevent this model from leaves gross errors at loop closure boundaries. In addition,
using its tremendous amount of flexibility to deform in un- these segments compress the information contained in the
reasonable ways, we follow the standard practice of regu- full sequence, drastically reducing the computational com-
larizing the deformation by augmenting E(G) with plexity of fitting our surface model to the entire sequence as
described in following sections.
K K
X X To reconstruct the partial scan for segment i, we begin
Erot (G) = kATk Ak − IkF + (det(Ak ) − 1)2 , (5)
by iteratively fusing data from each frame f ∈ {1, ..., F }
k=1 k=1
into the reference frame which is set as the first frame.
that encourages local affine transformations to be rigid (re- This is trivially accomplished for frame 1, so for frame
flection is eliminated by enforcing a positive determinant), f ∈ {2, ..., F } we extract from the current volumetric rep-
and resentation of the reference frame, the reference mesh Vi1
K X and align it to frame f using an ED deformation with pa-
rameters Gfi . Note that the parameters Gif −1 can be used to
X
Esmooth (G) = kAj (gk −gj )+gj +tj −(gk +tk )k2 ,
k=1 j∼k initialize this optimization. We then observe the deformed
(6) mesh ED(Vi1 ; Gfi ), and find a set of nearby points on Vif
that encourages neighboring affine transformations to be to establish a set of correspondences between Vif and Vi1 .
similar. For clarity, we use Ereg (G) = αErot (G) + These correspondences can then be used to estimate a pa-
rameter set Ĝfi that aligns Vif back to Vi1 in the reference refined. To this end, we consider matching the aligned scan
frame [15] and that can be used to volumetrically fuse the ED(Vi ; Gi ) against the aligned scans {ED(Vj ; Gj )}i−K
j=1 ,
data from frame f into the reference frame (where Vi1 lives). where K ≥ 1 restricts to frames with enough movement.
After completing this operation for all frames, a single sur- To measure the overlap of a mesh Vj and a mesh Vi , we
face Vi is extracted from the volumetric representation us- define the overlap ratio
ing marching cubes [5]. M
After this initial fusing, we have obtained a set of par- 1 Xi
d(Vi , Vj ) = I[min kvim − vjm0 k < δ] (8)
tially reconstructed segments {Vi }N i=1 , each of which is a Mi m=1 m0
partial scan of the object of interest at a different time and
in a different pose. Examples of partial scans are shown in as the proportion of vertices in Vi that have a neighboring
Figure 2(b). Ultimately, we want all segments {Vi }N vertex in Vj within δ (we use δ =4cm). We thus calculate
i=1 to
be explained by a single complete mesh V (we call it the dij = d(ED(Vi ; Gi ), ED(Vj ; Gj )) and consider as possi-
latent mesh) and a set of ED graphs {Gi }N ble candidates, the set of scan indices Ji = {j : dij ≥
i=1 that deforms
{Vi }Ni=1 to V. But it is not immediately clear where to get r1 , |i − j| > K, dij > dij−1 , dij > dij+1 }, the indices whose
such a mesh, and how to get a good initial estimate of the aligned scan is at least K indices away with a ‘peak’ over-
deformation parameters required to achieve this. Instead, lap ratio of at least r1 . For any scan index j ∈ Ji , we
we proceed by deforming all segments into the reference then consider doing a more expensive, but more accurate,
pose, fusing the results together into a complete mesh, and direct alignment of Vj to Vi using a set of ED parameters
using the deformations to provide a good initial guess for Gj→i [5]. If d(Vi , ED(Vj , Gji )) ≥ r2 we then find a set of
the parameters that minimize an appropriate energy. correspondences Cij ⊆ {1, ..., Mi } × {1, ..., Mj } for which
for any (m, m0 ) ∈ Cij , we have thatkvim −ED(vjm0 , Gj→i )k
5. Coarse Scan Alignment is less than 1 cm. We set Cij = ∅ for any other pairs of
frames that did not pass this test. In our experiment we let
In this section, we describe how we find deformation pa- r1 = 30%, r2 = 50%.
rameters Gi for each segment Vi so that a set of roughly With these loop closing correspondences extracted, we
aligned meshes {ED(Vi ; Gi )}N i=1 can be obtained in the use Li et al.’s algorithm [11] to re-estimate ED graph pa-
reference pose (i.e. pose of V1 ). We first align each segment rameters G = {Gi }N i=1 , by minimizing the energy;
Vi to its immediate neighbor Vi+1 yielding a parameter set P P
Gi→i+1 by using the technique in [5]. This is effortless min λcorr Ecorr (G) + λreg Ereg (Gi ) + λrigid Erigid (Gi ),
G i i
as adjacent scans have similar poses and the Gi→i+1 can (9)
be initialized using the parameters already estimated by [5] where
when aligning the first frame to the last frame of segment i.
N
To obtain an alignment of segment Vi+1 back to the ref- X X
Ecorr (G) = kED(vim ; Gi ) − ED(vjm0 ; Gj )k2 .
erence frame, it is helpful to assume that we have already
i=1 (m,m0 )∈Cij
obtained such an alignment for segment Vi , which is trivial j=1
i j6=i
for i = 1. Then for each vertex vm of mesh Vi , we find the (10)
i+1
nearest surface point vµ(m) on Vi+1 (closer than 1cm) to its After the set of deformation parameters G is estimated, we
deformed position ED(vim , Gi→i+1 ). Similarly, the align- deform the scans accordingly and fuse them volumetrically
ment parameter set Gi tells us that vim should be located at to obtain a rough latent surface V. Fig. 2(c,d)&3(b) show
i
ṽm = ED(vm ; Gi ) in the reference frame. This process examples of scan alignment before and after loop closure.
i+1 i
establishes a set of correspondences {hvµ(m) , ṽm i}M
m=1
which provide constraints that can be used to estimate Gi+1 7. Dense Nonrigid Bundle Adjustment
using the standard ED alignment algorithm [15].
At this point, the above procedure has succeeded in giv-
6. Error Redistribution ing us a rough surface representation of our object of inter-
est, but the process has washed out the fine details that can
Naturally, the error in the propagation step accumulates, be seen in the partial scans (see Fig. 2 and Fig. 8). This is
making the deformation parameter sets more and more un- largely a result of the commitment to a set of noisy corre-
reliable as i increases. On the other hand, we assume that spondences used for error distribution. Eq. 9 does not aim
our sequence includes a loop closure and thus there should to refine these correspondences, and thus misalignments are
be some later segments that could match reasonably well inevitable. As shown in Fig. 2 where large deformation ex-
to earlier segments. We would thus like to identify such ists, the misalignment is still visible where a loop closure
pairs and establish rough constraints between them, in the has occurred, and the fused model looks flat and misses
form of correspondences, so that the deformations can be many details.
(a) input color & depth (b) Before/After LC (c) LC-fused surface (d) BA-optimized surface (e) KinectFusion
Figure 3. Scanning a person with slight deformation. Before loop closure (LC), scans are poorly aligned. After LC, the surface is
topologically correct but noisy. Bundle Adjustment (BA) removes spurious noise without further smoothing details such as the shirt collar.

To improve both the data alignment and recover the fine 7.2. Surface Regularization Terms
details we employ a bundle adjustment (BA) type technique
In addition, we regularize the latent mesh using the
to refine V as to explain all the data summarized in the par-
Laplacian regularizer
tial scans {Vi }N
i=1 . We parameterize the deformation that
each partial scan Vi has to undergo to be explained by the M
X 1 X
reference V using a set of ED deformation parameters Gi . Elap (V) = kvm − vm0 k2 , (13)
|N (m)|
We then cast an energy E(V) over the latent mesh V as a m=1 m0 ∈N (m)
combination of the following terms.
where N (m) is the set of indices of vertices that neighbor
7.1. Deformation Terms vm . This term attracts a vertex to the centroid of its neigh-
bors, penalizing unevenness of the surface, but has the po-
For each data point vim in segment Vi , we expect that
tential to shrink the surface by dragging the set of boundary
some ED graph Gi deforms it towards the latent mesh V and
vertices inwards. We thus also add a energy term encourag-
ED(vim ; Gi ) gets explained by V. We thus add an energy
ing isometry as
term designed to encourage the distance of ED(vim ; Gi ) to
the latent surface to be close, and for the normal to match.
X X
Eiso (V) = |kvm0 − vm k2 − L2mm0 |2 (14)
This term is m∈B m0 ∈N (i)
N Mi
X X where B ⊆ {1, ..., M } is the set of indices of such boundary
Edata (V) = min min λdata Epoint (vim ; Gi , u, V)+
Gi u vertices, and Lmm0 is the length kvm0 − vm k in the initial
i=1 m=1
mesh.
λnormal Enormal (nim ; Gi , u, V)
+ λreg Ereg (Gi ) + λrigid Erigid (Gi ) 7.3. Solving
Combining all of the above energy terms, we obtain the
where
full energy
Epoint (v; G, u, V) = kED(v; G) − S(u; V)k2 (11)
E(V) = Edata (V) + λlap Elap (V) + λiso Eiso (V) (15)
and
that we seek to minimize. To deal with the inner minimiza-
Enormal (n; G, u, V) = kED(n; G) − S ⊥ (u; V)k2 . (12) tions, we follow the lead of [16, 29] of defining a set of
latent variables, passing them through the sums, and rewrit-
S(u; V) and S ⊥ (u; V) are corresponding point and normal ing the energy in terms of a lifted energy defined over these
of ED(v; G)1 in the latent surface V, which we have ex- additional latent variables. In our case, we have the ED
plained in Section 2. deformation parameter sets G = {Gi }N i=1 and the surface
As we continue to use the ED deformation model, the M1 m MN
coordinates U = {um }
1 m=1 ∪ ... ∪ {u }
N m=1 , which allows
terms Ereg (G) and Erigid (G) continue to provide regulariza- us to obtain a lifted energy E 0 (V, G, U) such that
tion for ED graphs.
1 Note
E(V) = min E 0 (V, G, U) ≤ E 0 (V, G 0 , U 0 ) (16)
that we do not set ED graph Gi on the latent mesh V to G,U
PMi i
deform V towards partial scan Vi and minimize i=1 minu kvm −
2
S(u; ED(V; G))k , because this gives many unnecessary ED nodes as for any G 0 and U 0 . We can thus minimize our desired energy
V is complete and Vi is partial. by minimizing this lifted energy and to this end, we notice
Low Resolution BA High Resolution BA

Low Resolution Full Resolution


Figure 5. Bundle adjustment iterations. Top row: evolving surface
model. Bottom: per-vertex residuals. Note the increase in detail
Before BA After BA Before BA After BA of the right forearm and hand.

Figure 4. Top: Partial scan alignment residuals during bundle ad-


when scanning a close object and 3mm cubed for objects at
justment. Bottom: two examples of aligned scan before and after
BA. The cross sections of scans are given in the middle.
a further distance. This results in partial scans with around
100,000 vertices. When conducting nonrigid alignment for
that all terms are in a sum of squares form. We thus use the both partial scan extraction and alignment, ED nodes are
Levenberg–Marquardt algorithm implemented in Ceres [1] sampled as to remain roughly at 5cm (measured in geodesic
to minimize E 0 (V, G, U). We initialize the latent mesh V us- distance) to their neighbors. This endows each ED graph
ing the coarse mesh recovered in the previous section, and with roughly 150 to 200 nodes depending on the dimension
G using the corresponding ED parameter sets and U by con- of the object of interest.
ducting a single closest point computation. After detecting the loop closure constraints and perform-
Note that even though surface normal S ⊥ (u; ·) is con- ing error redistribution, the aligned partial scans are volu-
stant with respect to the barycentric coordinate u (an entire metrically fused to get an initial latent mesh for the final
triangle on the latent surface shares the same normal vec- bundle adjustment stage. We perform bilaterial filtering on
tor), it does give constraints to the latent mesh and the ED the volume data to ameliorate any misalignment. We also
graphs, which makes latent surface smooth and improves perform a simple remeshing to eliminate thin triangles on
the alignment. the initial latent mesh extracted with marching cubes, which
Note that some special care has to be taken to allow the makes the bundle adjustment numerically stable.
Levenberg-Marquardt algorithm to interact with a surface The bundle adjustment is the most expensive stage, given
coordinate variable u ∈ U [16, 2]. Such a variable has the huge amount of parameters to be optimized in Eq. 16:
the atypical parameterization u = (τ, u, v) where τ is dis- the roughly 5,000 graph nodes, 300,000 vertices of the la-
crete (a triangle ID), and (u, v) are real valued coordinates tent mesh and three million surface coordinates. A limi-
in the unit triangle. As typically the coordinate (u, v) will tation of this procedure is that the number of vertices on
lie strictly within the unit triangle, τ remains constant lo- the latent mesh and its triangulation remain fixed through-
cally and only the Jacobians with respect to (u, v) which out the bundle adjustment stage. Thus, if the initial mesh
are well defined are provided to the optimizer. When an up- does not have the correct shape topology or has missing
date (u, v) ← (u, v) + δ(du, dv) is requested that would parts due to poor initial alignment, it is difficult for the bun-
exit the unit triangle, the coordinate should first move the dle adjustment to recover the correct shape. To handle the
distance δ̂ to the edge of the triangle. The adjacent trian- above issues, we take a coarse-to-fine approach by running
gle τ 0 is then looked up, a new direction (du0 , dv 0 ) and step the bundle adjustment twice with different levels of detail.
size δ 0 = δ − δ̂ computed, and finally the procedure is recur- In the first run, a low resolution latent mesh is used with
sively called after updating τ ← τ 0 , (du, dv) ← (du0 , dv 0 ) an average distance between neighboring vertices of 1cm.
and δ ← δ 0 . Eventually the step size δ will be sufficiently The first run quickly converges and improves partial scan
small that an update does not need to leave a triangle. alignments G significantly, from which a better initial latent
mesh can be built. In the second run, we use the full reso-
8. Experiments lution mesh where the average distance between neighbor-
ing vertices is about 2mm. Initializing the parameters from
In the following experiments, we evaluate our method on the previous bundle adjustment, the vertices on the latent
a variety of RGBD sequences of various objects of interest. mesh do not need to move much along the tangent direction,
Each sequence is between 200 and 400 frames, and we fuse so we constrain the vertex to only move as a displacement
these volumetrically into 20 to 40 partial scans by fusing along the direction normal to the initial latent mesh, which
the data from each F = 10 frame subsegment. We set the reduces the number of parameters on the latent surface by
size of the voxels in the fusion procedure to 2mm cubed nearly two thirds. That is, only one single displacement
Figure 6. KinectFusion with nonrigid alignment. The accumulated
surfaces after fusing 10, 30, 50, 70, 90 frames are shown. Note the
nose gets blurred at the end.
(a) (b) (c) (d)
parameter per vertex instead of three is required to parame-
Figure 9. Comparison with 3D Self-Portraits. Scanning results of
terize full 3D position.
(a) shapifyme; (b) 3D self-portraits implemented by us; (c) BA-
Fig. 5 illustrates the intermediate latent surfaces together optimized 3D self-portraits; (d) our system.
with the alignment residual at each BA iteration; the align-
ment error is computed for each vertex on the latent surface without requiring such assumptions.
as its average distance to the deformed partial scans. Fig. 4 In contrast though, KinectFusion fails in dynamic cases.
plots the average alignment residuals during BA (including Fig. 3 shows the reconstruction results of KinectFusion on
both accepted and rejected BA iterations) on various data a sequence with slight head movement. Replacing ICP
sequences. The alignment error typically goes down from in KinectFusion with the nonrigid alignment algorithm [5]
3mm to less than 1mm. Examples of aligned scans before does not result in a reasonable reconstruction either. As
and after BA are also given in Fig. 4, where the cross sec- shown in Figure 6, when non-rigidly fusing more than 30
tions of scans are shown to demonstrate the alignment qual- frames, the drifting artifacts result in a blurred nose.
ity and the bundle adjustment’s ability to recover the true
structure of the object. 8.2. Comparison with 3D Self-portraits
3D self-portraits [11] is among the first systems with the
8.1. Comparison with KinectFusion capability to scan a dynamic object with a single consumer
Our system is designed for dynamically moving objects, sensor. We want to stress that our system handles continu-
but it still works in more restricted cases such as rigid scenes ously deforming objects while 3D self-portraits first recon-
(i.e. scanning static objects). Fig. 7 shows the comparison structs eight static scans and then non-rigidly fuses them
in reconstruction qual- later. The above difference prevents us from comparing the
ity between our method two system quantitatively, but we show side-by-side of the
and KinectFusion on reconstructed models of the same person from the two sys-
a static mannequin. tems in Fig. 9. The software Shapifyme which implements
To compare the two 3D self-portraits appears to heavily smooth the reconstruc-
systems quantitatively, tions, and our implementation of 3D self-portrait gives more
we first generate a 3D detailed reconstructions. We then ran bundle adjustment al-
model of the man- gorithm of Sec. 7 on the eight scans, and we find that it im-
nequin which serves proves the reconstruction further, showing another advan-
as the ground truth tage of our approach. Compared with 3D self-portraits, our
and then synthesize system allows continuous movement and recovers more fa-
ground truth KinectFusion Ours a sequence of depth cial details.
maps and color images
Figure 7. Rigid Scanning. 8.3. Synthetic sequence
by moving a virtual
camera around the 3D model. We run our algorithm and We tested our system on the Saskia dataset [21] which
KinectFusion on the synthetic data. As shown in Fig. 7, contains dramatic deformations. The original sequence has
both systems give appealing reconstructions which are a roughly complete model at each frame, and thus we syn-
authentic to ground truth. KinectFusion has an average thesize one depth map and color image from each frame
reconstruction error of 0.94 mm v.s. 1.21 mm in our system. with a virtual camera rotating around the subject. Our re-
Our system has lower residuals on the side that is observed construction system results in a shape in a reference pose
by the reference frame (1st row in Fig. 7, error map uses (i.e. the latent mesh V) as shown on the left of Fig. 10. To
the same scale as Fig. 5) while it has higher residual on the measure alignment error, we then deform V to each frame
other side (2nd row in Fig. 7) due to flexibility introduced and compute the distance from the frame data. To achieve
by the nonrigid alignment. Naturally, we don’t expect to this a backward ED graph G̃i from V to each partial scan Vi
outperform a method that exploits the rigidity of this scene, is first computed using correspondences. The deformations
but we are satisfied that our system can get similar results from partial scan Vi to the frames in segment i have already
Figure 8. Top: reconstruction after loop closure. Bottom: final reconstruction after bundle adjustment.

a poor reconstruction (e.g. the arm is unrealistically thin).


During bundle adjustment, the arm gradually expands as
optimization iterations are performed until it is a realistic
size (see Fig. 5).
We tested our system on several situations including full
body scans and upper body scans. We also tried to scan
objects other than human beings. Fig. 8 shows some scan
examples. In all the scans that we performed, the Kinect
sensor is mounted on a tripod, and we let people turn around
freely in front or, in the case of an object, be rotated by the
“director” of the scene.
Figure 10. Alignment error in Saskia dataset. The first shape in
each triple is the deformed reconstructed surface, the second is the
9. Conclusions
ground truth, and the third shows the alignment error (same scale We have presented a system which merges a sequence of
as Fig. 5). Per-frame alignment error is drawn at the bottom. images from a single range sensor into a unified 3D model,
without requiring an initial template. In contrast to previous
been computed as explained in Section 4, so we first deform systems, a wider range of deformations can be handled, in-
V to each partial scan’s pose and then to each frame’s pose. cluding wriggling children. Some limitations remain, how-
The alignment error is then measured between the deformed ever. First, although complex scene topologies can be han-
reconstruction and the synthesized depth map. We draw the dled, the topology is restricted to be constant throughout the
alignment error at each frame at the bottom of Fig. 10. sequence, and if the coarse-scale reconstruction does not
The Saskia sequence poses a particular challenge as the correctly choose the topology, it cannot currently change at
topology changes when the dress touches the legs. This in- the fine scale.
troduces some artifacts on the legs in the reconstructed la- The computational cost is also high. We run our experi-
tent mesh V and also gives some problems in the deformed ments on a desktop PC with 8-core 3.0G Hz Intel Xeon CPU
latent mesh in each frame’s pose. and 64G memory. For a sequence with 400 frames, the par-
8.4. Scanned examples tial scan preprocessing stage takes around 30 seconds per
frame, the initial alignment and loop closure detection takes
Fig. 3 shows a sequence with small deformations. The
about 1 hour, and the final bundle adjustment up to 5 hours.
loop closure technique described in Sec. 6 reconstructs a
However, these results are using only lightly optimized im-
reasonable model, but some artifacts exist due to misalign-
plementations, and if we were to assume the user intends to
ment. Our bundle adjustment technique in Sec. 7, however,
3D print a “shelfie”, the 3D printing process will itself take
improves the reconstruction. Another example with consid-
a considerable time. Even if the goal is to upload the model
erable deformations is shown in Fig. 2, where the loop clo-
for use in a game, an overnight process remains valuable.
sure gives a problematic alignment of the partial scans and
References hand modeling from monocular depth sequences. In Com-
puter Vision and Pattern Recognition (CVPR). IEEE, 2014.
[1] S. Agarwal, K. Mierle, and Others. Ceres solver. http: 5, 6
//ceres-solver.org. 6
[17] A. Tevs, A. Berner, M. Wand, I. Ihrke, M. Bokeloh, J. Kerber,
[2] T. J. Cashman and A. W. Fitzgibbon. What shape are dol- and H.-P. Seidel. Animation cartography—intrinsic recon-
phins? building 3d morphable models from 2d images. Pat- struction of shape and motion. ACM TOG, 31(2):12, 2012.
tern Analysis and Machine Intelligence, IEEE Transactions 2
on, 35(1):232–244, 2013. 6
[18] J. Tong, J. Zhou, L. Liu, Z. Pan, and H. Yan. Scanning 3D
[3] Y. Cui, W. Chang, T. Nöll, and D. Stricker. Kinectavatar:
full human bodies using Kinects. TVCG, 18(4):643–650,
fully automatic body capture using a single kinect. In
2012. 1, 2
Computer Vision-ACCV 2012 Workshops, pages 133–147.
[19] L. Valgaerts, C. Wu, A. Bruhn, H.-P. Seidel, and C. Theobalt.
Springer, 2013. 1
Lightweight binocular facial performance capture under un-
[4] E. de Aguiar, C. Stoll, C. Theobalt, N. Ahmed, H.-P. Seidel,
controlled lighting. ACM TOG (Proc. SIGGRAPH Asia),
and S. Thrun. Performance capture from sparse multi-view
31(6):187, November 2012. 2
video. ACM TOG (Proc. SIGGRAPH), 27:1–10, 2008. 2
[20] D. Vlasic, I. Baran, W. Matusik, and J. Popović. Artic-
[5] M. Dou, H. Fuchs, and J.-M. Frahm. Scanning and tracking
ulated mesh animation from multi-view silhouettes. ACM
dynamic objects with commodity depth cameras. In Proc.
TOG (Proc. SIGGRAPH), 2008. 2
ISMAR, pages 99–106. IEEE, 2013. 1, 2, 3, 4, 7
[21] D. Vlasic, P. Peers, I. Baran, P. Debevec, J. Popović,
[6] J. Gall, C. Stoll, E. De Aguiar, C. Theobalt, B. Rosenhahn,
S. Rusinkiewicz, and W. Matusik. Dynamic shape capture
and H.-P. Seidel. Motion capture using joint skeleton track-
using multi-view photometric stereo. ACM Transactions on
ing and surface estimation. In IEEE Conf. CVPR, 2009. 2
Graphics (TOG), 28(5):174, 2009. 2, 7
[7] T. Helten, A. Baak, G. Bharaj, M. Muller, H.-P. Seidel, and
C. Theobalt. Personalization and evaluation of a real-time [22] M. Wand, B. Adams, M. Ovsjanikov, A. Berner, M. Bokeloh,
depth-based full body tracker. In Proc. 3DV, pages 279–286, P. Jenke, L. Guibas, H.-P. Seidel, and A. Schilling. Efficient
2013. 1 reconstruction of nonrigid shape and motion from real-time
3D scanner data. ACM TOG, 28:15, 2009. 2
[8] C. Hernández, G. Vogiatzis, G. J. Brostow, B. Stenger, and
R. Cipolla. Non-rigid photometric stereo with colored lights. [23] T. Weise, S. Bouaziz, H. Li, and M. Pauly. Realtime
In Proc. ICCV, pages 1–8. IEEE, 2007. 2 performance-based facial animation. ACM TOG, 30(4):77,
[9] S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, 2011. 2
P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison, [24] A. Weiss, D. A. Hirshberg, and M. J. Black. Home 3d body
et al. Kinectfusion: real-time 3d reconstruction and inter- scans from noisy image and range data. In Int. Conf. on Com-
action using a moving depth camera. In Proceedings of the puter Vision (ICCV), 2011. 1, 2
24th annual ACM symposium on User interface software and [25] C. Wu, C. Stoll, L. Valgaerts, and C. Theobalt. On-set perfor-
technology, pages 559–568. ACM, 2011. 1 mance capture of multiple actors with a stereo camera. ACM
[10] H. Li, B. Adams, L. J. Guibas, and M. Pauly. Robust single- TOG, 32(6):161, 2013. 2
view geometry and motion reconstruction. ACM Trans- [26] G. Ye, Y. Liu, N. Hasler, X. Ji, Q. Dai, and C. Theobalt.
actions on Graphics (Proceedings SIGGRAPH Asia 2009), Performance capture of interacting characters with handheld
28(5), December 2009. 2, 3 kinects. In Proc. ECCV, pages 828–841. Springer, 2012. 2
[11] H. Li, E. Vouga, A. Gudym, L. Luo, J. T. Barron, and G. Gu- [27] M. Zeng, J. Zheng, X. Cheng, and X. Liu. Templateless
sev. 3d self-portraits. ACM Trans. Graph., 32(6):187, 2013. quasi-rigid shape modeling with implicit loop-closure. In
1, 3, 4, 7 Computer Vision and Pattern Recognition (CVPR), pages
[12] M. Liao, Q. Zhang, H. Wang, R. Yang, and M. Gong. Model- 145–152. IEEE, 2013. 1, 2, 3
ing deformable objects from a single depth camera. In Proc. [28] Q. Zhang, B. Fu, M. Ye, and R. Yang. Quality dynamic
ICCV, pages 167–174. IEEE, 2009. 2 human body modeling using a single low-cost depth camera.
[13] N. J. Mitra, S. Flöry, M. Ovsjanikov, N. Gelfand, L. J. In Computer Vision and Pattern Recognition (CVPR), pages
Guibas, and H. Pottmann. Dynamic geometry registration. 676–683. IEEE, 2014. 2
In Proc. SGP, pages 173–182, 2007. 2 [29] M. Zollhöfer, M. Nießner, S. Izadi, C. Rehmann, C. Zach,
[14] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, M. Fisher, C. Wu, A. Fitzgibbon, C. Loop, C. Theobalt, and
D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and M. Stamminger. Real-time non-rigid reconstruction using an
A. Fitzgibbon. Kinectfusion: Real-time dense surface map- rgb-d camera. ACM Transactions on Graphics (TOG), 33(4),
ping and tracking. In Mixed and augmented reality (ISMAR), 2014. 1, 2, 5
2011 10th IEEE international symposium on, pages 127–
136. IEEE, 2011. 1
[15] R. W. Sumner, J. Schmid, and M. Pauly. Embedded defor-
mation for shape manipulation. In SIGGRAPH, 2007. 2, 3,
4
[16] J. Taylor, R. Stebbing, V. Ramakrishna, C. Keskin, J. Shot-
ton, S. Izadi, A. Hertzmann, and A. Fitzgibbon. User-specific

You might also like