3D Scanning Deformable Objects With A Single RGBD Sensor
3D Scanning Deformable Objects With A Single RGBD Sensor
3D Scanning Deformable Objects With A Single RGBD Sensor
Mingsong Dou∗1 , Jonathan Taylor2 , Henry Fuchs1 , Andrew Fitzgibbon2 and Shahram Izadi2
1
Department of Computer Science, UNC-Chapel Hill
2
Microsoft Research
Abstract
1
remains static, whilst the motorized sensor scans up and on human shape models [20, 24, 28]. The shape prior
down. Each of these partial static scans are then nonrigidly means they cannot scan general shapes, including even hu-
registered and a global model reconstructed. Here the user mans holding objects, or in unusual clothing. More general
is assumed to explicitly perform a loop closure at the end of approaches either work on diverse (non-rigged) templates
sequence. For certain energetic subjects, such as children or [8, 4, 12, 10], or use template-less spatio-temporal repre-
animals, who do not follow instructions well, such a usage sentations [13, 22, 17]. Instead our system discovers the
scenario may be constraining. latent surface model without the need for an initial rigid
Zeng et al. [27] show that when using nonrigid align- scan or statically captured template model. It also attempts
ment to an embedded deformation (ED) graph model [15] to mitigate the drift inherent in non-template-based models.
for quasi-rigid motion, drift is greatly alleviated, and loop-
closure can be made implicit. However, for nonrigid mo- 2. Triangular Mesh Surface Model
tion, our experience (Fig. 6) shows that drift is still a seri-
Throughout this paper, we use a triangular mesh as our
ous problem even when scanning mildly deforming objects
fundamental surface representation. We parameterize a tri-
such as a turning head.
angle mesh by the set of 3D vertex locations V = {vM m=1 }
In this paper, we detect loop closures explicitly to han-
and the set of triangle indices T ⊂ {(i, j, k) : 1 ≤ i, j, k ≤
dle severe drift without restricting user motions. However,
M }. We will also occasionally query the triangulation
dealing with such loop closures, is only one piece of the
through the function N (m) which returns the indices of the
puzzle, as this only evenly distributes error over the loop
vertices neighboring vertex m, or through the use of a vari-
instead of minimizing the alignment residual. Thus, our
able τ ∈ T representing a single triangle face.
pipeline also performs a dense nonrigid bundle adjustment
We will often need to label a mesh using a subscript
to simultaneously optimize the final shape and nonrigid pa-
(e.g., Vi ) in which case we will label the vertices with a
rameters at each frame. We use loop closure to provide
corresponding superscript (e.g., vim ). Indeed, a point on the
the initialization for the bundle adjustment step. Our ex-
surface itself is parameterized using a surface coordinate
periments show that bundle adjustment gives improved data
u = (τ, u, v) where τ ∈ T is a triangle index and (u, v)
alignment and thus a high quality final model.
is a barycentric coordinate in the unit triangle. The posi-
We will summarize previous work in the next section and
tion of this coordinate can then be evaluated using a linear
describe our surface and deformation model in Sec. 2&3.
combination of the vertices in τ as
From Sec. 4 through Sec. 6, we explain the preprocess-
ing procedures for bundle adjustment, including partial scan S(u; V) = uvτ1 + vvτ2 + (1 − u − v)vτ3 (1)
extraction, coarse scan alignment, and loop closure detec-
tion. Then we illustrate our bundle adjustment algorithm in and its surface normal computed as (with bbxcc := x/kxk)
Sec. 7. Finally, we show results in Sec. 8.
S ⊥ (u; V) = bb(vτ2 − vτ1 ) × (vτ3 − vτ1 )cc (2)
1.1. Related Work
Dou et al. [5] designed a system to scan dynamic objects
3. Embedded Deformation Model
with eight-Kinect sensors, where drift is not a concern given In general, we will want to allow our meshes to deform,
that a relatively complete model is captured at each frame. for example to allow our surface reconstruction to explain
Tong et al. [18] illustrated a full body scanning system with the data in a depth sequence. Our desire to keep our al-
three Kinects. Their system uses a turntable to turn people gorithm agnostic to object class led us to choose the em-
around, but cannot handle large deformations. Other high- bedded deformation (ED) model of [15] to parameterize
end multi-camera setups include [4, 20, 6, 21]. In our work the non-rigid deformations of a mesh V. In this model,
we wish to move away from complex rigs, and support more a set of K “ED nodes” are uniformly sampled through-
lightweight and commodity consumer setups, using only a out the mesh at a set of fixed locations {gk }Kk=1 ⊆ R .
3
single off-the-shelf depth sensor. Each vertex m is “skinned” to the deformation nodes by
More lightweight capture setups have been demon- a set of fixed weights {wmk }Kk=1 ⊆ [0, 1], where wmk =
strated, but either still require complex lighting, more (max(0, 1 − d(vm , gk )/dmax ))2 /wsum with d(vm , gk ) the
than one camera, or cannot generate high quality results geodesic distance between the two, dmax the distance of
[8, 12, 10, 23, 19, 26, 25]. vm to its c+1-th nearest ED node, and w1sum the normal-
More severe deformations can be handled with template- ization weight. Note vm is only influenced by its c near-
based systems. For example, Zollhofer et al. [29] first ac- est nodes (c = 4 in our experiments) since other nodes
quire a template of the scene under near-rigid motion us- have weights 0. The weighted deformation of the vertices
ing Kinect fusion, and then adapt that template to non- surrounding gk is parameterized by a local affine transfor-
rigid sequences. Even more specialized are systems based mation Ak ∈ R3×3 and translation tk ∈ R3 . In addition,
(a) input color and depth (b) partial scans (c) coarse-aligned scans (d) LC-aligned scans (e) LC-fused surface (f) BA-optimized surface (g) deformed surface
Figure 2. Scanning pipeline. The input sequence has around 400 frames which are fused into 40 partial scans (Sec. 4). Partial scans
are consecutively placed in the reference pose to achieve the coarse alignment (Sec. 5). Next, loop closures are detected and alignment is
refined (Sec. 6); all the LC-aligned scans are fused volumetrically to get the LC-fused surface which serves as the initial for the following
bundle adjustment stage (Sec. 7). As a by-product of the system, the reconstructed model can be deformed back to each frame.
we follow [27, 11] in augmenting the deformation using a Esmooth (G) in later equations, where α = 10 in our experi-
global rotation R ∈ SO(3) and translation T ∈ R3 . The ments. In addition, rigidity is encouraged by penalizing the
precise location of vertex vm deformed using the parameter deformations at ED nodes,
set G = {R, T } ∪ {Ak , tk }K
k=1 is X X
ρ ktk k2 , (7)
Erigid (G) = ρ (kAk − IkF ) +
K
X k k
ED(vm ; G) = R wmk [Ak (vm − gk ) + gk + tk ] + T
k=1 where ρ(·) is a robustness kernel function. We minimize
(3) this energy using standard nonlinear least squares optimiza-
and its associated surface normal is: tion [15, 5, 10].
$$ K %%
4. Extracting Partial Scans
X
⊥ −T
ED (nm ; G) = R wmk Ak nm . (4)
k=1
The first phase of our algorithm begins by preprocessing
In addition, we allow the former functional to be applied an RGBD sequence into a set of high quality, but only par-
to an entire mesh at a time to produce a deformed mesh tial, scans {Vi }Ni=1 of the object of interest. Each of these
ED(V; G) := {ED(vm ; G)}M m=1 .
segments is reconstructed from a small contiguous set of F
In general, we will want to find parameters that ei- frames using the method of [5] to fuse the depth data into
ther exactly or approximately satisfy some constraints (e.g. a triangular mesh. These short segments can be reliably
ED(vm ; G) ≈ pk ∈ R3 ), and thus encode these con- reconstructed using standard methods, in contrast to larger
straints softly in an energy function E(G) (e.g. E(G) = sequences where camera and reconstruction drift generally
kpk − ED(vm ; G)k2 ). In order to prevent this model from leaves gross errors at loop closure boundaries. In addition,
using its tremendous amount of flexibility to deform in un- these segments compress the information contained in the
reasonable ways, we follow the standard practice of regu- full sequence, drastically reducing the computational com-
larizing the deformation by augmenting E(G) with plexity of fitting our surface model to the entire sequence as
described in following sections.
K K
X X To reconstruct the partial scan for segment i, we begin
Erot (G) = kATk Ak − IkF + (det(Ak ) − 1)2 , (5)
by iteratively fusing data from each frame f ∈ {1, ..., F }
k=1 k=1
into the reference frame which is set as the first frame.
that encourages local affine transformations to be rigid (re- This is trivially accomplished for frame 1, so for frame
flection is eliminated by enforcing a positive determinant), f ∈ {2, ..., F } we extract from the current volumetric rep-
and resentation of the reference frame, the reference mesh Vi1
K X and align it to frame f using an ED deformation with pa-
rameters Gfi . Note that the parameters Gif −1 can be used to
X
Esmooth (G) = kAj (gk −gj )+gj +tj −(gk +tk )k2 ,
k=1 j∼k initialize this optimization. We then observe the deformed
(6) mesh ED(Vi1 ; Gfi ), and find a set of nearby points on Vif
that encourages neighboring affine transformations to be to establish a set of correspondences between Vif and Vi1 .
similar. For clarity, we use Ereg (G) = αErot (G) + These correspondences can then be used to estimate a pa-
rameter set Ĝfi that aligns Vif back to Vi1 in the reference refined. To this end, we consider matching the aligned scan
frame [15] and that can be used to volumetrically fuse the ED(Vi ; Gi ) against the aligned scans {ED(Vj ; Gj )}i−K
j=1 ,
data from frame f into the reference frame (where Vi1 lives). where K ≥ 1 restricts to frames with enough movement.
After completing this operation for all frames, a single sur- To measure the overlap of a mesh Vj and a mesh Vi , we
face Vi is extracted from the volumetric representation us- define the overlap ratio
ing marching cubes [5]. M
After this initial fusing, we have obtained a set of par- 1 Xi
d(Vi , Vj ) = I[min kvim − vjm0 k < δ] (8)
tially reconstructed segments {Vi }N i=1 , each of which is a Mi m=1 m0
partial scan of the object of interest at a different time and
in a different pose. Examples of partial scans are shown in as the proportion of vertices in Vi that have a neighboring
Figure 2(b). Ultimately, we want all segments {Vi }N vertex in Vj within δ (we use δ =4cm). We thus calculate
i=1 to
be explained by a single complete mesh V (we call it the dij = d(ED(Vi ; Gi ), ED(Vj ; Gj )) and consider as possi-
latent mesh) and a set of ED graphs {Gi }N ble candidates, the set of scan indices Ji = {j : dij ≥
i=1 that deforms
{Vi }Ni=1 to V. But it is not immediately clear where to get r1 , |i − j| > K, dij > dij−1 , dij > dij+1 }, the indices whose
such a mesh, and how to get a good initial estimate of the aligned scan is at least K indices away with a ‘peak’ over-
deformation parameters required to achieve this. Instead, lap ratio of at least r1 . For any scan index j ∈ Ji , we
we proceed by deforming all segments into the reference then consider doing a more expensive, but more accurate,
pose, fusing the results together into a complete mesh, and direct alignment of Vj to Vi using a set of ED parameters
using the deformations to provide a good initial guess for Gj→i [5]. If d(Vi , ED(Vj , Gji )) ≥ r2 we then find a set of
the parameters that minimize an appropriate energy. correspondences Cij ⊆ {1, ..., Mi } × {1, ..., Mj } for which
for any (m, m0 ) ∈ Cij , we have thatkvim −ED(vjm0 , Gj→i )k
5. Coarse Scan Alignment is less than 1 cm. We set Cij = ∅ for any other pairs of
frames that did not pass this test. In our experiment we let
In this section, we describe how we find deformation pa- r1 = 30%, r2 = 50%.
rameters Gi for each segment Vi so that a set of roughly With these loop closing correspondences extracted, we
aligned meshes {ED(Vi ; Gi )}N i=1 can be obtained in the use Li et al.’s algorithm [11] to re-estimate ED graph pa-
reference pose (i.e. pose of V1 ). We first align each segment rameters G = {Gi }N i=1 , by minimizing the energy;
Vi to its immediate neighbor Vi+1 yielding a parameter set P P
Gi→i+1 by using the technique in [5]. This is effortless min λcorr Ecorr (G) + λreg Ereg (Gi ) + λrigid Erigid (Gi ),
G i i
as adjacent scans have similar poses and the Gi→i+1 can (9)
be initialized using the parameters already estimated by [5] where
when aligning the first frame to the last frame of segment i.
N
To obtain an alignment of segment Vi+1 back to the ref- X X
Ecorr (G) = kED(vim ; Gi ) − ED(vjm0 ; Gj )k2 .
erence frame, it is helpful to assume that we have already
i=1 (m,m0 )∈Cij
obtained such an alignment for segment Vi , which is trivial j=1
i j6=i
for i = 1. Then for each vertex vm of mesh Vi , we find the (10)
i+1
nearest surface point vµ(m) on Vi+1 (closer than 1cm) to its After the set of deformation parameters G is estimated, we
deformed position ED(vim , Gi→i+1 ). Similarly, the align- deform the scans accordingly and fuse them volumetrically
ment parameter set Gi tells us that vim should be located at to obtain a rough latent surface V. Fig. 2(c,d)&3(b) show
i
ṽm = ED(vm ; Gi ) in the reference frame. This process examples of scan alignment before and after loop closure.
i+1 i
establishes a set of correspondences {hvµ(m) , ṽm i}M
m=1
which provide constraints that can be used to estimate Gi+1 7. Dense Nonrigid Bundle Adjustment
using the standard ED alignment algorithm [15].
At this point, the above procedure has succeeded in giv-
6. Error Redistribution ing us a rough surface representation of our object of inter-
est, but the process has washed out the fine details that can
Naturally, the error in the propagation step accumulates, be seen in the partial scans (see Fig. 2 and Fig. 8). This is
making the deformation parameter sets more and more un- largely a result of the commitment to a set of noisy corre-
reliable as i increases. On the other hand, we assume that spondences used for error distribution. Eq. 9 does not aim
our sequence includes a loop closure and thus there should to refine these correspondences, and thus misalignments are
be some later segments that could match reasonably well inevitable. As shown in Fig. 2 where large deformation ex-
to earlier segments. We would thus like to identify such ists, the misalignment is still visible where a loop closure
pairs and establish rough constraints between them, in the has occurred, and the fused model looks flat and misses
form of correspondences, so that the deformations can be many details.
(a) input color & depth (b) Before/After LC (c) LC-fused surface (d) BA-optimized surface (e) KinectFusion
Figure 3. Scanning a person with slight deformation. Before loop closure (LC), scans are poorly aligned. After LC, the surface is
topologically correct but noisy. Bundle Adjustment (BA) removes spurious noise without further smoothing details such as the shirt collar.
To improve both the data alignment and recover the fine 7.2. Surface Regularization Terms
details we employ a bundle adjustment (BA) type technique
In addition, we regularize the latent mesh using the
to refine V as to explain all the data summarized in the par-
Laplacian regularizer
tial scans {Vi }N
i=1 . We parameterize the deformation that
each partial scan Vi has to undergo to be explained by the M
X 1 X
reference V using a set of ED deformation parameters Gi . Elap (V) = kvm − vm0 k2 , (13)
|N (m)|
We then cast an energy E(V) over the latent mesh V as a m=1 m0 ∈N (m)
combination of the following terms.
where N (m) is the set of indices of vertices that neighbor
7.1. Deformation Terms vm . This term attracts a vertex to the centroid of its neigh-
bors, penalizing unevenness of the surface, but has the po-
For each data point vim in segment Vi , we expect that
tential to shrink the surface by dragging the set of boundary
some ED graph Gi deforms it towards the latent mesh V and
vertices inwards. We thus also add a energy term encourag-
ED(vim ; Gi ) gets explained by V. We thus add an energy
ing isometry as
term designed to encourage the distance of ED(vim ; Gi ) to
the latent surface to be close, and for the normal to match.
X X
Eiso (V) = |kvm0 − vm k2 − L2mm0 |2 (14)
This term is m∈B m0 ∈N (i)
N Mi
X X where B ⊆ {1, ..., M } is the set of indices of such boundary
Edata (V) = min min λdata Epoint (vim ; Gi , u, V)+
Gi u vertices, and Lmm0 is the length kvm0 − vm k in the initial
i=1 m=1
mesh.
λnormal Enormal (nim ; Gi , u, V)
+ λreg Ereg (Gi ) + λrigid Erigid (Gi ) 7.3. Solving
Combining all of the above energy terms, we obtain the
where
full energy
Epoint (v; G, u, V) = kED(v; G) − S(u; V)k2 (11)
E(V) = Edata (V) + λlap Elap (V) + λiso Eiso (V) (15)
and
that we seek to minimize. To deal with the inner minimiza-
Enormal (n; G, u, V) = kED(n; G) − S ⊥ (u; V)k2 . (12) tions, we follow the lead of [16, 29] of defining a set of
latent variables, passing them through the sums, and rewrit-
S(u; V) and S ⊥ (u; V) are corresponding point and normal ing the energy in terms of a lifted energy defined over these
of ED(v; G)1 in the latent surface V, which we have ex- additional latent variables. In our case, we have the ED
plained in Section 2. deformation parameter sets G = {Gi }N i=1 and the surface
As we continue to use the ED deformation model, the M1 m MN
coordinates U = {um }
1 m=1 ∪ ... ∪ {u }
N m=1 , which allows
terms Ereg (G) and Erigid (G) continue to provide regulariza- us to obtain a lifted energy E 0 (V, G, U) such that
tion for ED graphs.
1 Note
E(V) = min E 0 (V, G, U) ≤ E 0 (V, G 0 , U 0 ) (16)
that we do not set ED graph Gi on the latent mesh V to G,U
PMi i
deform V towards partial scan Vi and minimize i=1 minu kvm −
2
S(u; ED(V; G))k , because this gives many unnecessary ED nodes as for any G 0 and U 0 . We can thus minimize our desired energy
V is complete and Vi is partial. by minimizing this lifted energy and to this end, we notice
Low Resolution BA High Resolution BA