Academia.eduAcademia.edu

Robust Single-View Geometry and Motion Reconstruction

input scans reconstruction input scans reconstruction input scans reconstruction Figure 1: Reconstruction of complex deforming objects from high-resolution depth scans. Our method accurately captures the global topology and shape motion, as well as dynamic, small-scale details, such as wrinkles and folds. Abstract We present a framework and algorithms for robust geometry and motion reconstruction of complex deforming shapes. Our method makes use of a smooth template that provides a crude approximation of the scanned object and serves as a geometric and topological prior for reconstruction. Large-scale motion of the acquired object is recovered using a novel space-time adaptive, non-rigid registration method. Fine-scale details such as wrinkles and folds are synthesized with an efficient linear mesh deformation algorithm. Subsequent spatial and temporal filtering of detail coefficients allows transfer of persistent geometric detail to regions not observed by the scanner. We show how this two-scale process allows faithful recovery of small-scale shape and motion features leading to a high-quality reconstruction. We illustrate the robustness and generality of our algorithm on a variety of examples composed of different materials and exhibiting a large range of dynamic deformations.

Robust Single-View Geometry and Motion Reconstruction Hao Li∗ ETH Zurich input scans reconstruction Bart Adams† KU Leuven input scans Leonidas J. Guibas‡ Stanford University reconstruction Mark Pauly§ ETH Zurich input scans reconstruction Figure 1: Reconstruction of complex deforming objects from high-resolution depth scans. Our method accurately captures the global topology and shape motion, as well as dynamic, small-scale details, such as wrinkles and folds. Abstract We present a framework and algorithms for robust geometry and motion reconstruction of complex deforming shapes. Our method makes use of a smooth template that provides a crude approximation of the scanned object and serves as a geometric and topological prior for reconstruction. Large-scale motion of the acquired object is recovered using a novel space-time adaptive, non-rigid registration method. Fine-scale details such as wrinkles and folds are synthesized with an efficient linear mesh deformation algorithm. Subsequent spatial and temporal filtering of detail coefficients allows transfer of persistent geometric detail to regions not observed by the scanner. We show how this two-scale process allows faithful recovery of small-scale shape and motion features leading to a highquality reconstruction. We illustrate the robustness and generality of our algorithm on a variety of examples composed of different materials and exhibiting a large range of dynamic deformations. Keywords: animation reconstruction, non-rigid registration, partial scans, 3D scanning, geometry synthesis, template tracking 1 Introduction Accurate digitization of complex real-world objects is one of the central problems in visual computing. Commercial solutions for rigid objects are widely available and can be considered a mature technology. However, many of the assumptions of rigid scanning methods are no longer valid in a dynamic setting where the acquired shape is in motion and deforms. High temporal and spatial resolution is essential to faithfully recover the small-scale geometric detail that is often created as a result of the dynamic motion of the scanned model. Recent advances in 3D scanning technology facilitate the acquisition of dynamic objects, but pose substantial challenges for reconstruction algorithms. We consider the problem ∗ Applied Geometry Group, E-mail:hao@inf.ethz.ch Graphics Group, E-mail:bart.adams@cs.kuleuven.be ‡ Geometric Computing Group, E-mail:guibas@cs.stanford.edu § Applied Geometry Group, E-mail:pauly@inf.ethz.ch † Computer of marker-less, high-resolution geometry and motion reconstruction from single-view scans of a deforming shape. The main advantage of single-view 3D scanners is the simplicity of the acquisition setup, requiring no calibration or synchronization of multiple sensing units. However, single-view reconstruction of dynamic shapes is particularly challenging, since every scan covers a small section of the object’s surface. Large and complex shape deformations constantly create or destroy geometric detail, such as wrinkles or folds in cloth, that needs to be distinguished from acquisition noise. We address these challenges by introducing a novel template-based dynamic registration algorithm that offers significant improvements in terms of accuracy and robustness over previous methods. A key feature of our approach is the separation of large-scale motion from small-scale shape dynamics. We introduce a time- and space-adaptive deformation model that robustly captures the largescale deformation of the object with minimal assumptions about the dynamics of the motion and without requiring an underlying physical model or kinematic skeleton. Our method dynamically adds degrees of freedom to the deformation model where needed, effectively extracting a generalized skeleton for the acquired shape. Small-scale dynamics are handled by a novel detailsynthesis method that computes a displacement field to adjust the deformed template to match the high-resolution input scans. The combination of these tools allows the efficient processing of extended scan sequences and yields a complete high-resolution geometry representation of the scanned object with full correspondences over all time instances. We make a clear distinction between static and dynamic detail. Static detail includes all small-scale geometric features that are persistent in the shape and are not affected by the motion of the object. In the example shown in Figure 2, the mouth, eyes, and nose of the hand-puppet are static detail, since the entire face region is rigid. Dynamic detail consists of features that are transient. Deformation of the object can cause dynamic detail to appear and disappear, such as the folds in the body of the puppet. Our non-rigid registration method makes use of a template model to reconstruct the overall motion of the shape and provide a geometric prior for shape completion and topology control. In contrast to recent methods in performance capture [de Aguiar et al. 2008; Vlasic et al. 2008], we deliberately remove fine-scale detail from the template to avoid confusing static detail with dynamic detail. High-resolution templates from rigid scans typically have all detail “baked in”, even transient features that are then erroneously transferred to all reconstructed surfaces (see also Figure 10). Our detail synthesis method automatically extracts detail from the high-resolution 3D input scans, propagates detail into occluded regions, and separates salient features from high-frequency noise. permanent detail Park and Hodgins [2006; 2008] developed a system that uses a very dense and large set of markers to capture and synthesize dynamic motions such as muscle bulging and flesh jiggling. While high resolution motions can be captured accurately, marker-based motion capture systems typically have a time-consuming calibration process and high hardware cost, and require actors to wear unnatural skin-tight clothing with optical beacons. transient detail noise 15 ... 47 ... 50 ... 84 frame Figure 2: Deforming shapes typically contain both permanent detail, such as the face region of the puppet, and transient detail, such as the dynamic folds in the cloth. Transient detail still persists over a number of adjacent frames and can thus be distinguished from temporally incoherent noise. The methods we propose are general in that they are not specifically designed for a certain acquisition setup or particular motion models. Our tool requires no user interaction beyond aligning the template with the first scan and specifying a few global parameters. The main technical contributions of this paper are Contributions. • an efficient non-rigid registration method based on a nonlinear deformation model that automatically adapts to the motion of the scanned object, • a detail synthesis method that employs a spatio-temporal analysis of detail vectors to propagate detail into occluded regions and remove high-frequency acquisition noise, • the integration of these methods into a complete 3D geometry and motion reconstruction framework. The reconstructed surface meshes come with temporally consistent correspondences, which enables further applications such as mesh editing, texturing, or signal processing to be applied to the animation sequence. We demonstrate the versatility of our approach by showing high-resolution reconstructions of highly deformable shapes such as cloth, as well as the more coherent motion of articulated shapes. In addition, our purely data-driven algorithm is able to accurately reproduce subtle secondary motions such as hand tremor, or the behavior of complex materials such as the crumpling of a paper bag. 2 Related Work Non-rigid registration methods were initially developed to align 3D scans of rigid objects that are distorted due to device nonlinearities and calibration inaccuracies [Ikemoto et al. 2003; Brown and Rusinkiewicz 2004; Brown and Rusinkiewicz 2007]. These methods achieve highly accurate alignments for subtle warps, but are not suitable for large-scale deformations such as a bending arm. More general deformation models have been proposed to capture dynamic shapes [Allen et al. 2003; Sumner et al. 2007; Botsch and Sorkine 2008]. Various methods make use of a template model to simplify correspondence estimation and provide a prior for geometry and topology reconstruction, often relying on a small set of manually specified correspondences [Blanz and Vetter 1999; Allen et al. 2003; Pauly et al. 2005; Amberg et al. 2007]. Several unsupervised methods were proposed that require no manual intervention [Anguelov et al. 2004; Bronstein et al. 2006], but typically lead to higher computational complexity that makes these methods less suitable for long scan sequences. Template-Based Methods. Marker-less methods are widely used in the acquisition and modeling of facial animations. In [Zhang et al. 2004], the deformation of an accurate face template is driven by time-coherent optical flow features and geometric closest point constraints. Since many features in a human face are persistent, their system can robustly handle long sequences of facial animations. More recently, several papers avoid the use of markers to reproduce complex animations of human performances and cloth deformations from multi-view video [Bradley et al. 2008; de Aguiar et al. 2008; Vlasic et al. 2008]. The latter two methods initialize the recording process with a high resolution full-body laser scan of the subject in a static pose. A low-resolution template model is created to robustly recover complex motions by combining various tracking and silhouette fitting techniques. Details of the high resolution models are then transferred back to the animated template. While large-scale deformations such as flowing garments are nicely captured, fine-scale geometric details such as folds that are not persistent in the surface are captured in the high-resolution model, remaining permanently throughout the reconstructed animation and possibly yielding unnatural deformations. An extension of this approach has been presented in [Ahmed et al. 2008] that follows a similar rationale to our method. A low-resolution template is tracked and subsequently enriched with local detail extracted from the acquired data. However, the specifics of this system differ substantially from our solution. The input stems from a multi-view acquisition system using eight video cameras, the template tracking is based on a shape-skeleton and silhouette matching, and the detail synthesis is performed based on surface normals reconstructed using shape from shading. Since creating an accurate and sufficiently detailed template of a deforming object can be difficult, various methods have been proposed that do not rely on a complete model. The algorithm presented by Mitra and colleagues [2007] aggregates all scans into a 4D space-time surface and estimates inter-frame motion from kinematic properties of the deforming surface. Süssmuth and coworkers [2008] introduced a space-time approach that first computes an implicit 4D surface representation. A template is extracted from the initial frame and warped to the subsequent frames by maximizing local rigidity. These methods require adjacent frames to be sufficiently dense in space and time and are mainly designed for articulated motions. Sharf and colleagues [2008] introduced a volumetric space-time reconstruction technique that represents shape motion as an incompressible flow of material through time. This strong regularization makes the method particularly suitable for very noisy input data. Wand and coworkers [2007] introduced a statistical framework that performs pairwise alignment and merging over all adjacent scans within a global non-linear optimization process, leading to high computational cost. While significant performance improvements were achieved in a follow-up work using a volumetric meshless deformation model [Wand et al. 2009], this approach is still substantially slower than our method. In addition, the lack of a template can lead to topological ambiguities and misalignments for unseen parts (see also Figure 14). Registration Without A Template. Several researchers have proposed pairwise non-rigid registration algorithms that are specifically designed for large deformations. Li and coworkers [2008] developed a registration framework that simultaneously solves for point correspondences, surface deformation, and region of overlap within a single global optimization. Our registration method uses similar components, but avoids the coupled non-linear optimization of correspondences and deforma- Dynamic Acquisition Partial Scans Smooth Template Detail Coefficients Detail Estimation Warped Template Detail Aggregation Reconstructed Shape Static Acquisition Non-Rigid Registration Figure 3: Processing pipeline. A smooth template mesh is registered to each of the input scans using a non-linear, adaptive deformation model. Small-scale detail coefficients are estimated and integrated into the template. The final reconstruction is obtained through detail aggregation and filtering to propagate detail into occluded regions and separate salient features from noise. tion to obtain an efficient alignment method for extended scan sequences. Chang and Zwicker [2008] solve a discrete labeling problem to detect the set of optimal correspondences and apply graph cuts to optimize for a consistent deformation from source to target. They extend their scheme in [Chang and Zwicker 2009] using a reduced space deformation model represented by a volumetric grid that encloses the underlying scan. Although significant motion and occlusions can be handled, their deformation field representation breaks down for topologically difficult scenarios such as shapes with nearby or touching surfaces. Huang and colleagues [2008] suggested a registration technique that finds an alignment by diffusing consistent closest point correspondences over the target shape while preserving isometries as much as possible. Their implementation has proved to be efficient for large isometric deformations, yet the correspondence search is sensitive to topological changes and holes that commonly occur in partial acquisition systems. 4 Template Registration The registration stage captures the large-scale motion of the subject by fitting a coarse template shape to every frame of the scan sequence. Scans do not have to be a subset of the geometry described by the template, as in most previous methods (e.g. [Allen et al. 2003]). Our method robustly handles part-in-part registration, as opposed to the simpler part-in-whole matching (see e.g. Figure 12). We assume minimal prior knowledge about the acquired motion and thus employ a general deformation model to capture a sufficiently large range of shape deformations. We extend the embedded deformation framework proposed in [Sumner et al. 2007] to automatically adapt to the motion of the captured data. This allows recovering unknown complex material behavior and improves the robustness and efficiency of the registration. 4.1 Surface Deformation Model 3 Overview Our dynamic acquisition system shown in Figure 4 provides dense depth maps with a spatial resolution of 0.5 mm at 25 frames per second (see [Weise et al. 2007] for a description of a similar acquisition setup). This allows us to capture fine-scale geometric detail of deforming objects at high temporal resolution. However, input scans are typically highly incomplete and contain considerable amounts of measurement noise. We found that a template model is essential as a geometric and topological prior for the robust reconstruction of shapes that undergo complex deformations, in particular for singleview acquisition, where large parts of the object are occluded. Figure 3 gives an overview of our processing pipeline. Static acquisition is used to reconstruct the initial template. We remove all high-frequency detail from the template using low-pass filtering to avoid transferring potentially transient features to future scans. This significantly simplifies template construction since we do not require high geometric precision. To initialize the computations, we manually specify a rigid alignment of the template to the first frame of the scan sequence and apply one step of the pairwise non-rigid registration method described in Section 4. We propose a two-scale approach to reconstruct a complete and consistent surface for each frame. Template registration uses a nonlinear reduced deformable model to recover the large-scale motion and align the template to each of the input scans (Section 4). The template-to-scan registration makes use of detail coefficients estimated in the previous frame to enable feature locking and improve the alignment accuracy. The final reconstruction is then obtained using a separate detail synthesis pass that runs once forward and once backward in time to aggregate and propagate detail into occluded regions (Section 5). Embedded deformation computes a warping field using a deformation graph to discretize the underlying space. Each node xi of the graph induces a deformation within a local influence region of radius ri . We represent such a local deformation as an affine transformation specified by a 3 × 3 matrix Ai and a 3 × 1 translation vector bi . Graph nodes are connected by an edge whenever two nodes influence the same vertex of the mesh. A vertex vj of the embedded shape is mapped to the position X vj′ = w(vj , xi , ri ) [Ai (vj − xi ) + xi + bi ] , (1) xi where w(vj , xi , ri ) are the normalized weights w(vj , xi , ri ) = max(0, (1 − d2 (vj , xi )/ri2 )3 ) with d(vj , xi ) the distance between vj and xi . We exploit the topological prior of the template and replace Euclidean distances in the original formulation by Figure 4: Our real-time structured light scanner based on active stereo delivers high resolution input scans from a single view. geodesic distances measured on the template mesh. This improvement avoids distortion artifacts that often occur when geodesically distant parts of the object come into close contact (Figure 8). We use a variant of the fast marching method to efficiently compute approximate geodesic distances [Kimmel and Sethian 1998]. During non-rigid registration we solve for the unknown transformations (Ai , bi ). A feature preserving deformation field is obtained by maximizing local rigidity using the energy X“ T 2 (a1 a2 ) + (aT1 a3 )2 + (aT2 a3 )2 + Erigid = xi ” (1 − aT1 a1 )2 + (1 − aT2 a2 )2 + (1 − aT3 a3 )2 before refinement after refinement (2) that measures the deviation of the column vectors a1 , a2 , a3 of Ai from orthogonality and unit length. An additional regularization term ensures smoothness of the deformation. We extend the original formulation of [Sumner et al. 2007] using the geodesic distance weights to handle non-uniformly sampled graph nodes: XX w(xi , xj , ri + rj ) kAi (xj − xi )+ Esmooth = (3) xi xj xi + bi − (xj + bj )k22 . Minimizing these combined energies with the fitting term defined below yields affine transformations for each node, which in turn define a smooth deformation field for the template mesh. We solve this non-linear problem using a standard Gauss-Newton algorithm as described in [Sumner et al. 2007; Li et al. 2008]. initial initial final time final initial final Figure 5: The deformation graph is dynamically refined during non-rigid registration to adapt to the deformation of the scanned object. Color-coded images indicate the regularization energy that determines where new nodes are added to the graph. The bottom row shows the initial and final deformation graphs for the hand and the sumo reconstruction. 4.2 Robust Pairwise Registration We iteratively compute closest point correspondences in the spirit of non-rigid ICP methods, followed by a pruning and deformation step. To avoid local minima in the non-linear optimization, we use the simple yet effective technique proposed by [Li et al. 2008] that progressively relaxes the regularization energies of the deformation model. Similar strategies were also applied in [Allen et al. 2003; Amberg et al. 2007]. In this way, the template can be accurately aligned to scans that undergo considerable deformations without the use of sparse, high-dimensional features. Since our input data is sufficiently coherent in time, we repeatedly use closest point correspondences between the template and each input scan to determine the optimal deformation. In order to obtain an accurate fit, we augment the smooth template with detail information extracted from the previous frame. Template vertices vij of frame j are displaced in the direction of the corresponding surface normal nji yielding ṽij = vij + dij−1 nji , where dj−1 is the detail i coefficient of frame j − 1 (see Section 5). The correspondence energy combines the point-to-point and the point-to-plane metric to avoid incorrect correspondences in large featureless regions: ‚2 ‚ X ‚ ‚ Efit = αpoint ‚ṽij − cji ‚ + αplane |nTcj (ṽij − cji )|2 , (4) j j 2 i (vi ,ci )∈C where cji denotes the closest point on the input scan from ṽij with corresponding surface normal ncj . We use αpoint = 0.1 and i αplane = 1 in all our experiments. Correspondences are discarded if they are too far apart, have incompatible normal orientations, lie on the boundary of the partial input scans, or stem from back-facing or self-occluded vertices of the template. For each template-to-scan alignment, we initialize the registration with high stiffness weights αsmooth = 10 and αrigid = 100. We then alternate in each iteration between correspondence computation and template deformation by minimizing Iterative Optimization. Etot = Efit + αsmooth Esmooth + αrigid Erigid . If the relative total energy did not change significantly between iterations j and j + 1 (i.e., j+1 j j < σ), we additionally relax the regularization |Etot − Etot |/Etot weights to αsmooth ← 12 αsmooth and αrigid ← 12 αrigid . This relaxation strategy effectively improves the robustness by avoiding suboptimal local minima and allows handling pairs of scans that undergo significant deformations. In all our experiments we use σ = 0.005. The iterative optimization is repeated until αrigid < 0.1 or until a maximum number of iterations Nmax = 100 is reached. Note that detail information of the previous frame is only used to improve the accuracy of the registration by enabling geometric feature locking. The resulting continuous space deformation is applied to the template vertices without added detail. As discussed in Section 5 the final detail coefficients are obtained through a separate detail synthesis pass. 4.3 Dynamic Graph Refinement We replace the static, uniform sampling of the deformation graph in [Sumner et al. 2007] and [Li et al. 2008] with a spatially and temporally adaptive node distribution. While the idea of adaptive mesh deformation has been explored in previous work, for instance in the context of multi-resolution shape modeling from images [Zhang and Seitz 2000], we propose to adapt the degrees of freedom of the deformation model instead of the geometry itself in order to improve registration robustness and efficiency. A hierarchical graph representation is pre-computed from a dense uniform sampling of graph nodes by successively merging nodes in a bottom-up fashion. The initial uniform node sampling corresponds to the highest resolution level l = Lmax of the deformation graph that we restrict to roughly one tenth of the number of mesh vertices. We thus avoid over-fitting in regions of small-scale deformations, which are instead captured by our detail synthesis method (Section 5). We uniformly sub-sample the nodes of each level by repeatedly increasing their average sampling distance rl−1 = 4 rl until l reaches Lmin . Each of the remaining nodes xli from level l ∈ Lmin . . . Lmax form a cluster Cil which contains every node from the level below xl+1 that is not closer to any other cluster from l. i The resulting cluster hierarchy is then used for adaptive refinement. We choose Lmin = Lmax /2 for all our experiments. Registration starts with a coarse uniform graph at level Lmin and dynamically adapts the graph resolution by inserting nodes in regions with high regularization residual (Esmooth ), which indicates a strong discrepancy of neighboring node transformations (see Figure 5). In all our examples we set the threshold for refinement to 10% of the highest regularization value. One step of refinement substitutes every xli that exhibits high regularization with all nodes contained in Cil . To avoid unnecessary refinements for every new upcoming target frame, adaptive refinement is only performed if the global regularization term is still above a certain threshold, i.e. Esmooth > 0.01, for the maximum number of iteration Nmax = 100 of pairwise registration. Refinement Criterion. The dynamic refinement effectively learns an adaptive deformation model that is consistent with the motion of the scanned object. Additional nodes will be inserted automatically in regions of high deformation, while large rigid parts can be accurately deformed by a single graph node. In addition to being less susceptible to local minima, this leads to significant performance improvements (up to a factor of four in our examples) as compared to a uniform sampling with a high level of node redundancy. As illustrated in Figure 5, our adaptive model is suitable for a wide variety of dynamic objects, from articulated shapes to complex cloth folding. 4.4 Multi-Frame Stabilization The warped template T j−1 obtained after alignment to scan j − 1 is the zero-energy state when aligning to scan j for each frame of the entire template warping process. For surface regions that are visible in the scan, dynamic details, such as cracks and fissures in paper-like materials can be accurately captured, since the method prevents the template from deforming back to its initial undeformed state. However, unobserved template parts are inherently prone to accumulation of misalignments, especially for lengthier scan sequences as illustrated in Figure 6. In contrast to our formulation, classical template fitting methods [Zhang et al. 2004; de Aguiar et al. 2008; Vlasic et al. 2008] warp the same initial template to each recorded frame and thus, use a deformation model that behaves globally elastic in time. For complex articulated subjects, such as human bodies, missing data in occluded regions would pull the template back to its original shape, which can be very different to the one of the current frame. Therefore, multi-view acquisition systems are usually used in combination with sparse and robust feature tracking [de Aguiar et al. 2008] and sometimes enhanced with manual intervention [Vlasic et al. 2008] to ensure reliable tracking. In our dense acquisition setting, the surface coverage of the template by the input scans is spatially and temporally coherent over time. Thus, for non-occluded regions, the template shape from a closer time instance represents in general a more likely shape prior than the initial template Tinit . On the other hand, we make the assumption that no better knowledge exists than Tinit for template regions that are never observed or not seen for an extended period. To address this issue we introduce a time-dependent combination of plastic and elastic deformation to accurately track exposed surface regions and reduce the accumulation of errors in less recently observed parts of the scanned object. After the pairwise registration of T j−1 to scan j as presented in Section 4.2, we obtain the plastically deformed template T j . A weight cji for visibility confidence can then be defined for each vertex vij ∈ T j as cji = max{0, (P + jilast − j)/P } with jilast the last frame where vi has been observed, and P a constant (we chose P = 30 in all 34 0 coverage without stabilization with stabilization ground truth Figure 6: A hybrid plastic and elastic deformation model is used to stabilize the registration for multiple input frames as repeated pairwise alignment is susceptible to error accumulation. The accumulation of misalignments is shown on frame 30 of the sumo sequence. our examples) that defines a temporal confidence range of visibility. All template vertices with cji = 1 are visible in the current frame, while cji = 0 represent those that are no longer considered confident. For the same frame, an elastically deformed template T̃ j with vertices ṽij is created by warping Tinit to the current frame j using the linearized thin-plate energy as described in [Botsch and Sorkine 2008]. Hard positional constraints are defined for all vertices with confidence cji = 1. The resulting template T̄ j with vertices v̄ij is obtained by linearly blending T j and T̃ j with the confidence weights for visibility yielding the vertices v̄ij = cji vij + (1 − cji )ṽij . 5 Detail Synthesis Non-rigid registration aligns the template sequentially with all input scans. The resulting deformation fields induced by the graph capture the large-scale deformation but might miss small deformations that give rise to dynamic detail such as wrinkles and folds. To recover fine-scale detail at the spatial resolution of the scanner, we perform a separate detail synthesis stage that is composed of two steps: First, a per-vertex optimization from local correspondences is applied to estimate detail coefficients for each vertex of the template. These preliminary detail coefficients are the ones used for template alignment as detailed in Section 4. After the template has been registered to the entire scan sequence, we perform an additional pass that exploits the temporal coherence of the scan sequence to improve the reconstruction quality by propagating detail into occluded regions. Since the deformed template is already well-aligned with the input scan, we employ an efficient linear mesh deformation algorithm similar to [Zhang et al. 2004] to estimate detail coefficients. For each vertex vi in the template mesh, we trace an undirected ray in normal direction ni and find the closest intersection point on the input scan. In case an intersection point ci is found, a point-to-point correspondence constraint is created, if both points have the same normal orientation and are sufficiently close. Since the template has no high-frequency detail, its normal vector field is smooth, leading to spatially coherent correspondences. We compute the detail coefficients di by minimizing the energy resulting from the extracted correspondences subject to a regularization constraint Linear Mesh Deformation. 2 · 10−3 0 input aggregated detail single-frame detail aligned template Figure 7: Detail synthesis. Reconstructing detail from the current frame leads to lack of detail in occluded regions. Aggregating detail over temporally adjacent frames propagates detail into hole regions and reduces noise. The color-coded images show the magnitude of the detail coefficients relative to the bounding box diagonal. Edetail = X kvi + di ni − ci k22 + β i∈V X |di − dj |2 , (5) (i,j)∈E where V and E are index sets of mesh vertices and edges, respectively. The parameter β balances detail synthesis with smoothness and is set to β = 0.5 in all our experiments. The resulting system of equations is linear and sparse and can thus be solved efficiently. The linear mesh deformation method described above estimates detail coefficients independently for each frame in those regions of the object that are observed by a particular scan. To transfer detail to occluded regions we perform a separate processing pass that aggregates detail coefficients using a so-called exponentially weighted moving average. We use the formulation of Roberts [1959] and define this moving average as Aggregation. j j−1 di = (1 − γ)di + γdji (6) with γ set to 0.5 in all our examples. The influence of past detail coefficients decays quickly in this formulation, which is important, since transient or dynamic detail such as wrinkles and folds might not persist during deformation. Note that details in the template only disappear when they vanish in the input scans of succeeding frames. For instance, the details of a rigid object will persist and not fade toward zero coefficients since only observed coefficients are combined during detail synthesis. When processing scan j, we j−1 first update the vertices vij ← vij + di nji and perform the linear mesh deformation described in the previous section. This yields the new detail coefficients dji that are then used to update the moving j average di , which will in turn be employed to process the subsequent scans. The entire detail aggregation process is performed by running sequentially once forward and once backward through the scans while performing the linear mesh deformation and updating the moving averages. Going back and forth allows us to backpropagate persistent details seen at future instances to earlier scans (see Figure 7). As a final step, we apply a band-limiting bilateral filter [Aurich and Weule 1995] that operates in the time domain and detail range to further reduce temporal noise. 6 surfaces described in [Guennebaud and Gross 2007]. Given the roughly aligned template mesh, our system runs completely automatically without any user intervention. Only few parameters (such as the weighting coefficients of the different energy terms) have to be chosen manually. For all examples, we use the same initial parameter settings. During optimization we automatically adapt the parameters using the approach detailed in Section 4. Figure 9 shows the warped template and final reconstruction of the puppet. This example is particularly difficult due to the close proximity of multiple surface sheets when closing the puppet’s hands. The reconstruction of a hand in Figure 8 demonstrates that our detail synthesis method is capable of capturing the intricate folds and wrinkles of human skin, even though the scans contain a large amount of measurement noise. Figure 12 illustrates how detail is propagated correctly into occluded regions, which leads to a plausible high-resolution reconstruction even for parts of the model that have not been observed in a particular scan. Figure 13 shows the reconstruction of a crumpling paper bag. Despite substantial holes # Scans Min # Points per Scan Max # Points per Scan Input Data Size (Mb) # Template Vertices Begin # Graph Nodes End # Graph Nodes Output Data Size (Mb) Registration Time Detail Synthesis Time Total Time Puppet 100 23k 37k 430 48k 20 100 530 39 26 65 Head 200 53k 68k 1,690 64k 152 458 2,030 247 92 339 Hand 35 19k 25k 120 46k 77 1238 180 15 8 23 Paper Bag 85 82k 123k 145 64k 37 86 960 65 36 101 Sumo 34 85k 86k 430 107k 52 110 540 26 23 49 Table 1: Statistics for the results shown in this paper. All computations were performed on a 3.0 GHz Dual Quad-Core Intel Xeon machine with 8 GB RAM. Timings are measured in minutes and include I/O operations. input scans reconstruction Results We show a variety of acquired geometry and motion sequences processed with our system that exhibit substantially different dynamic behavior. Accurate reconstruction of these objects is challenging due to the high noise level in the scans, missing data caused by occlusions or specularity, unknown correspondences, and the large and complex motion and deformations of the acquired objects. The statistics for the results are shown in Table 1. All templates were constructed by performing an online rigid registration technique similar to [Rusinkiewicz et al. 2002] on our acquired data, followed by a surface reconstruction technique based on algebraic point set input scan warped template reconstruction Figure 8: The zooms illustrate how high-frequency detail such as the skin folds is faithfully recovered and transferred to occluded regions. Even though the scan is connected at the fingertips, shape topology is correctly recovered (red circle). input warped template input (side) reconstruction (back) reconstruction textured reconstruction Figure 9: The global motion of the puppet’s shape as well as fine-scale static and dynamic detail are captured accurately using the template registration and detail synthesis algorithm. The intricate folds of the cloth are handled robustly in the registration. 7 Evaluation Figure 10 illustrates the difference between tracking a highresolution template versus our two-scale approach that separates global shape motion and dynamic detail reconstruction. For comparison we use the first frame of our two-scale reconstruction as the high-resolution template, which is then aligned with the input scan sequence using the registration method of Section 4. As can be seen in the zoom, dynamic detail created by the motion, in particular in the cloth, is not captured accurately. In contrast, our detail synthesis approach avoids the artifacts created by “baked-in” geometric detail and leads to a high-quality reconstruction of both static and dynamic detail. While a fairly large range of template smoothness can be tolerated, an overly coarse template can deteriorate the reconstruction as shown in Figure 11. The necessity of using a template for robust reconstruction of complex deforming shape is illustrated in Figure 14. The method of [Wand et al. 2009] that avoids the use of a template cannot track the motion of the fingers accurately. In particular, the correspondence estimation fails when previously unseen parts of the shape, such as the back of the fingers, come into view. Figure 15 shows a comparison of our method to the dynamic registration approach of [Süssmuth et al. 2008] using the same template in both reconstructions. We evaluate the robustness of the template tracking and detail synthesis method using the ground truth comparison shown in Figure 16. The scanning process has been simulated by creating a set of artificial depth maps from a fixed viewpoint. The groundtruth animation of the 3D model was obtained from dense motion capture data provided by [Park and Hodgins 2006]. In order to test the stability of the template tracking, we sampled the entire sequence at successively lower temporal resolution. The non-rigid warped high-res template 0 · 10−4 reconstruction vs. high-res template reconstruction with detail synthesis Figure 10: Warping a high-resolution template without detail synthesis leads to inferior results as compared to our two-scale reconstruction approach (cf. Figure 9). The color coding shows the distance between both results relative to the bounding box diagonal. reconstruction (front) smooth templates frame 1 frame 50 reconstruction (side) caused by oversaturation in the reflections, the dynamics of the material as well as sharp geometric creases are faithfully captured. Figure 11: Evaluation of the reconstruction (frame 1 and 50) for three different initial templates. The upper row shows the original template. The coarser template in the second row is produced by surface reconstruction from points that are uniformly subsampled at half of the density of the original template. The last row illustrates the reconstruction using an even coarser template. This is obtained from only 25% of the initial point density. input scans 200 0 warped template coverage reconstrution Figure 12: Our method faithfully recovers both the large-scale motion of the turning head, as well as the dynamic features created by the expression, such as wrinkles on the forehead or around the mouth. Intricate geometric details such as the ears are accurately captured, even though they are only observed in few frames. The color-coded images show the number of frames a certain region has been observed. registration robustly aligns the template with the scans for a temporally sub-sampled sequence consisting of only 34 frames. The large inter-frame motion, especially of the arms and legs, is tracked correctly, even though our correspondence computations do not make use of feature points, markers, or user assistance. Template tracking breaks down at 17 frames, where the fast motion of the arms cannot be recovered anymore (see Figure 17 (a)). Detail synthesis for the 34-frame sequence reliably recovers most of the fine-scale geometry correctly. Artifacts appear in the fingers and toes due to the coarse approximation of the template. In addition, drawbacks of the single-view acquisition become apparent in regions that are not observed by the scanner, such as the back of the sumo. Quantitatively, we measured the maximum of the average distance over all frames as 0.0012, the maximum of the maximum distance over all frames as 0.0283 as a fraction of the bounding box diagonal. for an extended period of time, registration can fail if these regions have undergone deformations while not being observed by the scanner. In such a case, our system would require user interaction to re-initialize the registration. This is an inherent limitation of single-view systems where more than half of the object surface is occluded at any time instance. However, even some multi-view systems (e.g. [Vlasic et al. 2008]) permit user assistance to adjust incorrect optimizations. Similar manual assistance might be required for longer sequences, where the scanner infrequently produces inferior data in certain frames. These frames need to be removed n/a We make few assumptions on the geometry and motion of the scanned objects. The correspondence estimation based on closest points, however, requires a sufficiently high acquisition frame-rate as otherwise, misalignments can occur, as shown in Figure 17 (a). Similarly, for parts of the shape that are out of view Limitations. frame 22 frame 34 input scans frame 22 frame 34 [Wand et al. 2009] frame 22 frame 34 our approach Figure 14: Reconstruction without a template is particularly challenging for single-view acquisition. The results in the center have been produced by the authors of [Wand et al. 2008]. our approach input scans input scans warped template reconstruction texture Figure 13: Sharp creases and intricate folds created by the complex, non-smooth deformation of a crumpling paper bag are captured accurately. [Süssmuth et al. 2008] Figure 15: Comparison of two template based reconstruction methods. The results in the bottom right have been produced by the authors of [Süssmuth et al. 2008]. error 0.032 max 34 0.024 0.016 0.008 average 0 input coverage reconstrution ground truth 0 frame 34 Figure 16: Ground truth comparison for a synthetic full-body example with fast motion. The top row shows every frame of the input sequence. The color-coded image indicates the number of frames in which a certain part of the shape is covered by the scans. The graph shows the maximum and average error distance between the ground truth and the reconstruction for each frame. manually and the registration re-started with user assistance. While none of our sequences required such manual intervention, the acquisition of longer sequences was inhibited by this limitation of our scanning system. Global aspects, such as the loop closure problem well-known in rigid scanning [Pulli 1999] are currently not considered in our system. To address these limitations, more sophisticated feature tracking would be required in order to establish reliable correspondences across larger spatial and temporal distances. We currently do not prevent global self-intersections of the reconstructed meshes. However, as shown in Figure 17 (b), our method robustly recovers, mainly due to the use of geodesic distances on the template mesh and the correspondence pruning strategy based on normal consistency and visibility. Avoiding self-intersections entirely would require an additional self-collision handling step in the shape deformation optimization algorithm, which would add a significant overhead to the overall reconstruction pipeline. Our method does not discover topological errors in the template, as shown in Figure 17 (c). In the template reconstruction the pinky has been erroneously connected to the paper bag, which leads to artifacts in the final frames of the sequences, where the finger is lifted off the bag. 8 Conclusion We have presented a robust algorithm for geometry and motion reconstruction of dynamic shapes. One of the main benefits of our method is simplicity. Our scanning system requires no specialized hardware or complex calibration or synchronization, and can be readily deployed in different acquisition scenarios. We do not require silhouette or feature extraction, manual correction of correspondences, or the explicit construction of a shape skeleton. Our system demonstrates that even for single-view acquisition, highquality results can be obtained for a variety of scanned objects, with a realistic reconstruction of shape dynamics and fine-scale features. Key to the success of our algorithm is the robust template tracking based on an adaptive deformation model. Our novel detail synthesis method exploits the accurate registration to aggregate and propagate geometric detail into occluded regions. As future work we plan to resolve aforementioned limitations and incorporate global self-collision handling. Moreover, we want to evaluate the algorithm in a multi-view setting where larger parts of the object are seen at the same or alternating time instances. As our current acquisition system only allows us to scan within a working volume of 40 × 30 × 60 cm3 , we wish to extend our scanning setup to allow acquisition of larger objects such as full human body performances. The tests on synthetic data indicate that our reconstruction algorithm should perform well for such cases. Finally, the proposed template registration initial alignment input scans (a) failed alignment (b) self-intersection (c) reconstruction Figure 17: Limitations: (a) registration can fail if the framerate is too low relative to the motion of the scanned object; (b) self-intersections are not prevented during template alignment; (c) wrong template topology leads to artifacts when the finger is lifted off the paper bag. registration algorithm can be used to acquire and learn material behavior (such as the crumpling of paper or folding of skin). Such information can be used to improve the realism of physically-based simulation algorithms. Acknowledgements. The authors would like to thank Thibaut Weise for providing the real-time 3D scanner, Carsten Stoll for his performance capture data, Sang Il Park and Jessica Hodgins for the animated sumo. Special thanks go to Johannes Schmid for helping with the video editing, Michael Wand, Martin Bokeloh, and Jochen Süssmuth for performing the comparisons, Qi-Xing Huang and Maks Ovsjanikov for the feedbacks and discussions. This work is supported by SNF grant 200021-112122, NSF grants ITR 0205671, FRG 0354543, FODAVA 808515, as well as NIH grant GM-072970 and the Fund for Scientific Research, Flanders (F.W.O.-Vlaanderen). References A HMED , N., T HEOBALT, C., D OBREV, P., S EIDEL , H.-P., AND T HRUN , S. 2008. Robust fusion of dynamic shape and normal capture for high-quality reconstruction of time-varying geometry. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2008), 1–8. A LLEN , B., C URLESS , B., AND P OPOVI Ć , Z. 2003. The space of human body shapes: reconstruction and parameterization from range scans. ACM Transactions on Graphics 22, 3, 587–594. A MBERG , B., ROMDHANI , S., AND V ETTER , T. 2007. Optimal step nonrigid icp algorithms for surface registration. In Proceedings of IEEE CVPR. A NGUELOV , D., S RINIVASAN , P., PANG , H.-C., KOLLER , D., T HRUN , S., AND DAVIS , J. 2004. The correlated correspondence algorithm for unsupervised registration of nonrigid surfaces. In Advances in Neural Inf. Proc. Systems 17. AURICH , V., AND W EULE , J. 1995. Non-linear gaussian filters performing edge preserving diffusion. In Mustererkennung 1995, 17. DAGM-Symposium, Springer-Verlag, 538–545. B LANZ , V., AND V ETTER , T. 1999. A morphable model for the synthesis of 3D faces. In Proceedings of ACM SIGGRAPH 99, ACM Press / ACM SIGGRAPH, 187–194. B OTSCH , M., AND S ORKINE , O. 2008. On linear variational surface deformation methods. IEEE Transactions on Visualization and Computer Graphics 14, 1, 213–230. K IMMEL , R., AND S ETHIAN , J. A. 1998. Computing geodesic paths on manifolds. In Proc. Natl. Acad. Sci. USA, 8431–8435. L I , H., S UMNER , R. W., AND PAULY, M. 2008. Global correspondence optimization for non-rigid registration of depth scans. Computer Graphics Forum (Proc. SGP) 27, 5, 1421–1430. M ITRA , N. J., F LORY, S., OVSJANIKOV, M., G ELFAND , N., G UIBAS , L., AND P OTTMANN , H. 2007. Dynamic geometry registration. In Symposium on Geometry Processing, 173–182. PARK , S. I., AND H ODGINS , J. K. 2006. Capturing and animating skin deformation in human motion. ACM Transactions on Graphics 25, 3, 881–889. PARK , S. I., AND H ODGINS , J. K. 2008. Data-driven modeling of skin and muscle deformation. ACM Transactions on Graphics 27, 3, 96:1–96:6. PAULY, M., M ITRA , N. J., G IESEN , J., G ROSS , M., AND G UIBAS , L. J. 2005. Example-based 3d scan completion. In Symposium on Geometry Processing. P ULLI , K. 1999. Multiview registration for large data sets. In Second Int. Conf. on 3D Dig. Image and Modeling, 160–168. ROBERTS , S. 1959. Control chart tests based on geometric moving averages. Technometrics1, 239–250. RUSINKIEWICZ , S., H ALL -H OLT, O., AND L EVOY, M. 2002. Real-time 3D model acquisition. ACM Transactions on Graphics 21, 3, 438–446. B RADLEY, D., P OPA , T., S HEFFER , A., H EIDRICH , W., AND B OUBEKEUR , T. 2008. Markerless garment capture. ACM Transactions on Graphics 27, 3, 99:1–99:9. S HARF, A., A LCANTARA , D. A., L EWINER , T., G REIF, C., S HEFFER , A., A MENTA , N., AND C OHEN -O R , D. 2008. Space-time surface reconstruction using incompressible flow. ACM Transactions on Graphics 27, 5, 110:1–110:10. B RONSTEIN , A. M., B RONSTEIN , M. M., AND K IMMEL , R. 2006. Generalized multidimensional scaling: a framework for isometry-invariant partial surface matching. Proc. National Academy of Sciences (PNAS) 103. S UMNER , R. W., S CHMID , J., AND PAULY, M. 2007. Embedded deformation for shape manipulation. ACM Transactions on Graphics 26, 3, 80:1–80:7. B ROWN , B., AND RUSINKIEWICZ , S. 2004. Non-rigid rangescan alignment using thin-plate splines. In Symp. on 3D Data Processing, Visualization, and Transmission. S ÜSSMUTH , J., W INTER , M., AND G REINER , G. 2008. Reconstructing animated meshes from time-varying point clouds. Computer Graphics Forum (Proceedings of SGP 2008) 27, 5, 1469–1476. B ROWN , B. J., AND RUSINKIEWICZ , S. 2007. Global non-rigid alignment of 3-d scans. ACM Transactions on Graphics 26, 3, 21:1–21:10. V LASIC , D., BARAN , I., M ATUSIK , W., AND P OPOVI Ć , J. 2008. Articulated mesh animation from multi-view silhouettes. ACM Transactions on Graphics 27, 3, 97:1–97:9. C HANG , W., AND Z WICKER , M. 2008. Automatic registration for articulated shapes. Computer Graphics Forum (Proc. SGP) 27, 5, 1459–1468. WAND , M., J ENKE , P., H UANG , Q., B OKELOH , M., G UIBAS , L., AND S CHILLING , A. 2007. Reconstruction of deforming geometry from time-varying point clouds. In Symposium on Geometry processing, 49–58. C HANG , W., AND Z WICKER , M. 2009. Range scan registration using reduced deformable models. Computer Graphics Forum (Proceedings of Eurographics 2009), to appear. DE AGUIAR , E., S TOLL , C., T HEOBALT, C., A HMED , N., DEL , H.-P., AND T HRUN , S. 2008. Performance capture S EI from sparse multi-view video. ACM Transactions on Graphics 27, 3, 98:1–98:10. G UENNEBAUD , G., AND G ROSS , M. 2007. Algebraic point set surfaces. In ACM Transactions on Graphics, ACM, New York, NY, USA, vol. 26, 23:1–23:10. H UANG , Q., A DAMS , B., W ICKE , M., , AND G UIBAS , L. J. 2008. Non-rigid registration under isometric deformations. Computer Graphics Forum (Proc. of SGP) 27, 5, 1459–1468. I KEMOTO , L., G ELFAND , N., AND L EVOY, M. 2003. A hierarchical method for aligning warped meshes. In Proceedings of 4th Int. Conference on 3D Digital Imaging and Modeling, 434–441. WAND , M., A DAMS , B., OVSJANIKOV, M., B ERNER , A., B OKELOH , M., J ENKE , P., G UIBAS , L., S EIDEL , H.-P., AND S CHILLING , A. 2009. Efficient reconstruction of non-rigid shape and motion from real-time 3d scanner data. ACM Transactions on Graphics. (to appear). W EISE , T., L EIBE , B., AND G OOL , L. V. 2007. Fast 3d scanning with automatic motion compensation. In IEEE Conference on Computer Vision and Pattern Recognition, 1–8. Z HANG , L., AND S EITZ , S. M. 2000. Image-based multiresolution shape recovery by surface deformation. SPIE, S. F. El-Hakim and A. Gruen, Eds., vol. 4309, 51–61. Z HANG , L., S NAVELY, N., C URLESS , B., AND S EITZ , S. M. 2004. Spacetime faces: high resolution capture for modeling and animation. ACM Transactions on Graphics 23, 3, 548–558.