Scancomplete: Large-Scale Scene Completion and Semantic Segmentation For 3D Scans
Scancomplete: Large-Scale Scene Completion and Semantic Segmentation For 3D Scans
Scancomplete: Large-Scale Scene Completion and Semantic Segmentation For 3D Scans
Angela Dai1,3,5 Daniel Ritchie2 Martin Bokeloh3 Scott Reed4 Jürgen Sturm3 Matthias Nießner5
1
Stanford University 2 Brown University 3 Google 4 DeepMind 5 Technical University of Munich
arXiv:1712.10215v2 [cs.CV] 28 Mar 2018
3D scans of indoor environments suffer from sensor occlusions, leaving 3D reconstructions with highly incomplete 3D
geometry (left). We propose a novel data-driven approach based on fully-convolutional neural networks that transforms
incomplete signed distance functions (SDFs) into complete meshes at unprecedented spatial extents (middle). In addition
to scene completion, our approach infers semantic class labels even for previously missing geometry (right). Our approach
outperforms existing approaches both in terms of completion and semantic labeling accuracy by a significant margin.
Abstract 1. Introduction
With the wide availability of commodity RGB-D sen-
We introduce ScanComplete, a novel data-driven ap- sors such as Microsoft Kinect, Intel RealSense, and Google
proach for taking an incomplete 3D scan of a scene as input Tango, 3D reconstruction of indoor spaces has gained mo-
and predicting a complete 3D model along with per-voxel mentum [22, 11, 24, 42, 6]. 3D reconstructions can help cre-
semantic labels. The key contribution of our method is its ate content for graphics applications, and virtual and aug-
ability to handle large scenes with varying spatial extent, mented reality applications rely on obtaining high-quality
managing the cubic growth in data size as scene size in- 3D models from the surrounding environments. Although
creases. To this end, we devise a fully-convolutional gen- significant progress has been made in tracking accuracy and
erative 3D CNN model whose filter kernels are invariant to efficient data structures for scanning large spaces, the result-
the overall scene size. The model can be trained on scene ing reconstructed 3D model quality remains unsatisfactory.
subvolumes but deployed on arbitrarily large scenes at test One fundamental limitation in quality is that, in general,
time. In addition, we propose a coarse-to-fine inference one can only obtain partial and incomplete reconstructions
strategy in order to produce high-resolution output while of a given scene, as scans suffer from occlusions and the
also leveraging large input context sizes. In an extensive physical limitations of range sensors. In practice, even with
series of experiments, we carefully evaluate different model careful scanning by human experts, it is virtually impos-
design choices, considering both deterministic and proba- sible to scan a room without holes in the reconstruction.
bilistic models for completion and semantic inference. Our Holes are both aesthetically unpleasing and can lead to se-
results show that we outperform other methods not only in vere problems in downstream processing, such as 3D print-
the size of the environments handled and processing effi- ing or scene editing, as it is unclear whether certain areas of
ciency, but also with regard to completion quality and se- the scan represent free space or occupied space. Traditional
mantic segmentation performance by a significant margin. approaches, such as Laplacian hole filling [36, 21, 44] or
Poisson Surface reconstruction [13, 14] can fill small holes.
However, completing high-level scene geometry, such as
1
missing walls or chair legs, is much more challenging. methods typically focus on filling small holes by fitting lo-
One promising direction towards solving this problem cal surface primitives such planes or quadrics, or by using a
is to use machine learning for completion. Very recently, continuous energy minimization [36, 21, 44]. Many surface
deep learning approaches for 3D completion and other gen- reconstruction methods that take point cloud inputs can be
erative tasks involving a single object or depth frame have seen as such an approach, as they aim to fit a surface and
shown promising results [29, 39, 10, 9, 7]. However, gen- treat the observations as data points in the optimization pro-
erative modeling and structured output prediction in 3D cess; e.g., Poisson Surface Reconstruction [13, 14].
remains challenging. When represented with volumetric Other shape completion methods have been developed,
grids, data size grows cubically as the size of the space in- including approaches that leverage symmetries in meshes
creases, which severely limits resolution. Indoor scenes are or point clouds [40, 19, 26, 34, 37] or part-based structural
particularly challenging, as they are not only large but can priors derived from a database [38]. One can also ‘com-
also be irregularly shaped with varying spatial extents. plete’ shapes by replacing scanned geometry with aligned
In this paper, we propose a novel approach, ScanCom- CAD models retrieved from a database [20, 32, 15, 17, 33].
plete, that operates on large 3D environments without re- Such approaches assume exact database matches for objects
strictions on spatial extent. We leverage fully-convolutional in the 3D scans, though this assumption can be relaxed by
neural networks that can be trained on smaller subvolumes allowing modification of the retrieved models, e.g., by non-
but applied to arbitrarily-sized scene environments at test rigid registration such that they better fit the scan [25, 31].
time. This ability allows efficient processing of 3D scans of To generalize to entirely new shapes, data-driven struc-
very large indoor scenes: we show examples with bounds of tured prediction methods show promising results. One of
up to 1480×1230×64 voxels (≈ 70×60×3m). We specif- the first such methods is Voxlets [8], which uses a random
ically focus on the tasks of scene completion and semantic decision forest to predict unknown voxel neighborhoods.
inference: for a given partial input scan, we infer missing
geometry and predict semantic labels on a per-voxel basis. Deep Learning in 3D With the recent popularity of deep
To obtain high-quality output, the model must use a suffi- learning methods, several approaches for shape generation
ciently high resolution to predict fine-scale detail. However, and completion have been proposed. 3D ShapeNets [3]
it must also consider a sufficiently large context to recognize learns a 3D convolutional deep belief network from a shape
large structures and maintain global consistency. To recon- database. This network can generate and complete shapes,
cile these competing concerns, we propose a coarse-to-fine and also repair broken meshes [23].
strategy in which the model predicts a multi-resolution hi- Several other works have followed, using 3D convolu-
erarchy of outputs. The first hierarchy level predicts scene tional neural networks (CNNs) for object classification [18,
geometry and semantics at low resolution but large spatial 27] or completion [7, 9]. To more efficiently represent
context. Following levels use a smaller spatial context but and process 3D volumes, hierarchical 3D CNNs have been
higher resolution, and take the output of the previous hier- proposed [30, 41]. The same hierarchical strategy can be
archy level as input in order to leverage global context. also used for generative approaches which output higher-
In our evaluations, we show scene completion and se- resolution 3D models [29, 39, 10, 9]. One can also increase
mantic labeling at unprecedented spatial extents. In addi- the spatial extent of a 3D CNN with dilated convolutions
tion, we demonstrate that it is possible to train our model on [43]. This approach has recently been used for predicting
synthetic data and transfer it to completion of real RGB-D missing voxels and semantic inference [35]. However, these
scans taken from commodity scanning devices. Our results methods operate on a fixed-sized volume whose extent is
outperform existing completion methods and obtain signif- determined at training time. Hence, they focus on process-
icantly higher accuracy for semantic voxel labeling. ing either a single object or a single depth frame. In our
In summary, our contributions are work, we address this limitation with our new approach,
• 3D fully-convolutional completion networks for pro- which is invariant to differing spatial extent between train
cessing 3D scenes with arbitrary spatial extents. and test, thus allowing processing of large scenes at test
• A coarse-to-fine completion strategy which captures time while maintaining a high voxel resolution.
both local detail and global structure.
• Scene completion and semantic labeling, both of out- 3. Method Overview
performing existing methods by significant margins.
Our ScanComplete method takes as input a partial 3D
2. Related Work scan, represented by a truncated signed distance field
(TSDF) stored in a volumetric grid. The TSDF is gener-
3D Shape and Scene Completion Completing 3D shapes ated from depth frames following the volumetric fusion ap-
has a long history in geometry processing and is often ap- proach of Curless and Levoy [4], which has been widely
plied as a post-process to raw, captured 3D data. Traditional adopted by modern RGB-D scanning methods [22, 11, 24,
Figure 1. Overview of our method: we propose a hierarchical coarse-to-fine approach, where each level takes a partial 3D scan as input,
and predicts a completed scan as well as per-voxel semantic labels at the respective level’s voxel resolution using our autoregressive 3D
CNN architecture (see Fig. 3). The next hierarchy level takes as input the output of the previous levels (both completion and semantics),
and is then able to refine the results. This process allows leveraging a large spatial context while operating on a high local voxel resolution.
In the final result, we see both global completion, as well as local surface detail and high-resolution semantic labels.
12, 6]. We feed this partial TSDF into our new volumetric for the groups that precede it. Thus, we use eight separate
neural network, which outputs a truncated, unsigned dis- networks, one for each voxel group; see Fig. 2.
tance field (TDF). At train time, we provide the network We also explore multiple options for the training loss
with a target TDF, which is generated from a complete function which penalizes differences between the network
ground-truth mesh. The network is trained to output a TDF output and the ground truth target TDF. As one option, we
which is as similar as possible to this target complete TDF. use a deterministic `1 -distance, which forces the network
Our network uses a fully-convolutional architecture with to focus on a single mode. This setup is ideal when partial
three-dimensional filter banks. Its key property is its invari- scans contain enough context to allow for a single explana-
ance to input spatial extent, which is particularly critical for tion of the missing geometry. As another option, we use a
completing large 3D scenes whose sizes can vary signif- probabilistic model formulated as a classification problem,
icantly. That is, we can train the network using random i.e., TDF values are discretized into bins and their probabil-
spatial crops sampled from training scenes, and then test on ities are weighted based on the magnitude of the TDF value.
different spatial extents at test time. This setup may be better suited for very sparse inputs, as the
The memory requirements of a volumetric grid grow cu- predictions can be multi-modal.
bically with spatial extent, which limits manageable resolu- In addition to predicting complete geometry, the model
tions. Small voxel sizes capture local detail but lack spatial jointly predicts semantic labels on a per-voxel basis. The se-
context; large voxel sizes provide large spatial context but mantic label prediction also leverages the fully-convolution
lack local detail. To get the best of both worlds while main- autoregressive architecture as well as the coarse-to-fine pre-
taining high resolution, we use a coarse-to-fine hierarchical diction strategy to obtain an accurate semantic segmentation
strategy. Our network first predicts the output at a low res- of the scene. In our results, we demonstrate how completion
olution in order to leverage more global information from greatly helps semantic inference.
the input. Subsequent hierarchy levels operate at a higher
resolution and smaller context size. They condition on the 4. Data Generation
previous level’s output in addition to the current-level in-
complete TSDF. We use three hierarchy levels, with a large To train our ScanComplete CNN architecture, we pre-
context of several meters (∼ 6m3 ) at the coarsest level, up pare training pairs of partial TSDF scans and their complete
to a fine-scale voxel resolution of ∼ 5cm3 ; see Fig. 1. TDF counterparts. We generate training examples from
Our network uses an autoregressive architecture based SUNCG [35], using 5359 train scenes and 155 test scenes
on that of Reed et al. [28]. We divide the volumetric space from the train-test split from prior work [35]. As our net-
of a given hierarchy level into a set of eight voxel groups, work requires only depth input, we virtually scan depth data
such that voxels from the same group do not neighbor each by generating scanning trajectories mimicking real-world
other; see Fig. 2. The network predicts all voxels in group scanning paths. To do this, we extract trajectory statistics
one, followed by all voxels in group two, and so on. The from the ScanNet dataset [5] and compute the mean and
prediction for each group is conditioned on the predictions variance of camera heights above the ground as well as the
size), as well as a semantic label of the closest object to
the voxel center. As with TSDFs, TDF values are stored in
voxel-distance metrics, and we repeat this ground truth data
generation for each of the three hierarchy levels.
For training, we uniformly sample subvolumes at 3m in-
tervals out of each of the train scenes. We keep all sub-
volumes containing any non-structural object voxels (e.g.,
tables, chairs), and randomly discard subvolumes that con-
tain only structural voxels (i.e., wall/ceiling/floor) with 90%
Figure 2. Our model divides volumetric space into eight inter- probability. This results in a total of 225, 414 training sub-
leaved voxel groups, such that voxels from the same group do not
volumes. We use voxel grid resolutions of [32 × 16 × 32],
neighbor each other. It then predicts the contents of these voxel
groups autoregressively, predicting voxel group i conditioned on
[32 × 32 × 32], and [32 × 64 × 32] for each level, resulting
the predictions for groups 1 . . . i − 1. This approach is based on in spatial extents of [6m × 3m × 6m], [3m3 ], [1.5m × 3m ×
prior work in autoregressive image modeling [28]. 1.5m], respectively. For testing, we test on entire scenes.
Both the input partial TSDF and complete target TDF are
camera angle between the look and world-up vectors. For stored as uniform grids spanning the full extent of the scene,
each room in a SUNCG scene, we then sample from this which varies across the test set. Our fully-convolutional ar-
distribution to select a camera height and angle. chitecture allows training and testing on different sizes and
Within each 1.5m3 region in a room, we select one cam- supports varying training spatial extents.
era to add to the training scanning trajectory. We choose the Note that the sign of the input TSDF encodes known and
camera c whose resulting depth image D(c) is most similar unknown space according to camera visibility, i.e., voxels
to depth images from ScanNet. To quantify this similarity, with a negative value lie behind an observed surface and
we first compute the histogram of depth of values H(D(c)) are thus unknown. In contrast, we use an unsigned distance
for all cameras in ScanNet, and then compute the average field (TDF) for the ground truth target volume, since all vox-
histogram, H̄. We then compute the Earth Mover’s Dis- els are known in the ground truth. One could argue that the
tance between histograms for all cameras in ScanNet and target distance field should use a sign to represent space in-
H̄, i.e., EMD(H(D(c)), H̄) for all cameras c in ScanNet. side objects. However, this is infeasible in practice, since
2
We take the mean µEMD and variance σEMD of these dis- the synthetic 3D models from which the ground truth dis-
tance values. This gives us a Gaussian distribution over tance fields are generated are rarely watertight. The use of
distances to the average depth histogram that we expect to implicit functions (TSDF and TDF) rather than a discrete
see in real scanning trajectories. For each candidate cam- occupancy grid allows for better gradients in the training
era c, we compute its probability under this distribution, process; this is demonstrated by a variety of experiments on
i.e., N (EMD(H(D(c)), H̄), µEMD , σEMD ). We take a lin- different types of grid representations in prior work [7].
ear combination of this term with the percentage of pixels
in D(c) which cover scene objects (i.e., not floor, ceiling,
5. ScanComplete Network Architecture
or wall), reflecting the assumption that people tend to fo-
cus scans on interesting objects rather than pointing a depth Our ScanComplete network architecture for a single hi-
sensor directly at the ground or a wall. The highest-scoring erarchy level is shown in Fig. 3. It is a fully-convolutional
camera c∗ under this combined objective is added to the architecture operating directly in 3D, which makes it invari-
training scanning trajectory. This way, we encourage a real- ant to different training and testing input data sizes.
istic scanning trajectory, which we use for rendering virtual At each hierarchy level, the network takes the input par-
views from the SUNCG scenes. tial scan as input (encoded as an TSDF in a volumetric grid)
For rendered views, we store per-pixel depth in meters. as well as the previous low-resolution TDF prediction (if
We then volumetrically fuse [4] the data into a dense regular not the base level) and any previous voxel group TDF pre-
grid, where each voxel stores a truncated signed distance dictions. Each of the input volumes is processed with a se-
value. We set the truncation to 3× the voxel size, and we ries of 3D convolutions with 1×1×1 convolution shortcuts.
store TSDF values in voxel-distance metrics. We repeat this They are then all concatenated feature-wise and further pro-
process independently for three hierarchy levels, with voxel cessed with 3D convolutions with shortcuts. At the end, the
sizes of 4.7cm3 , 9.4cm3 , and 18.8cm3 . network splits into two paths, one outputting the geomet-
We generate target TDFs for training using complete ric completion, and the other outputting semantic segmen-
meshes from SUNCG. To do this, we employ the level set tation, which are measured with an `1 loss and voxel-wise
generation toolkit by Batty [1]. For each voxel, we store softmax cross entropy, respectively. An overview of the ar-
a truncated distance value (no sign; truncation of 3× voxel chitectures between hierarchy levels is shown in Fig. 1.
Figure 3. Our ScanComplete network architecture for a single hierarchy level. We take as input a TSDF partial scan, and autoregressively
predict both the completed geometry and semantic segmentation. Our network trains for all eight voxel groups in parallel, as we use ground
truth for previous voxel groups at train time. In addition to input from the current hierarchy level, the network takes the predictions (TDF
and semantics) from the previous level (i.e., next coarser resolution as input), if available; cf. Fig. 1.
5.1. Training networks within each hierarchy level are trained in parallel,
with a total training time for the full hierarchy of ∼ 3 days.
To train our networks, we use the training data generated
from the SUNCG dataset as described in Sec. 4.
At train time, we feed ground truth volumes as the previ-
6. Results and Evaluation
ous voxel group inputs to the network. For the previous
hierarchy level input, however, we feed in volumes pre- Completion Evaluation on SUNCG We first evaluate
dicted by the previous hierarchy level network. Initially, different architecture variants for geometric scene comple-
we trained on ground-truth volumes here, but found that tion in Tab. 1. We test on 155 SUNCG test scenes, varying
this tended to produce highly over-smoothed final output the following architectural design choices:
volumes. We hypothesize that the network learned to rely
heavily on sharp details in the ground truth volumes that are • Hierarchy Levels: our three-level hierarchy (3) vs. a
sometimes not present in the predicted volumes, as the net- single 4.7cm-only level (1). For the three-level hier-
work predictions cannot perfectly recover such details and archy, we compare training on ground truth volumes
tend to introduce some smoothing. By using previous hier- (gt train) vs. predicted volumes (pred. train) from the
archy level predicted volumes as input instead, the network previous hierarchy level.
must learn to use the current-level partial input scan to re- • Probabilistic/Deterministic: a probabilistic model
solve details, relying on the previous level input only for (prob.) that outputs per-voxel a discrete distribution
more global, lower-frequency information (such as how to over some number of quantized distance value bins
fill in large holes in walls and floors). The one downside to (#quant) vs. a deterministic model that outputs a single
this approach is that the networks for each hierarchy level distance value per voxel (det.).
can no longer be trained in parallel. They must be trained • Autoregressive: our autoregressive model that pre-
sequentially, as the networks for each hierarchy level de- dicts eight interleaved voxel groups in sequence (au-
pend on output predictions from the trained networks at the toreg.) vs. a non-autoregressive variant that predicts
previous level. Ideally, we would train all hierarchy levels all voxels independently (non-autoreg.).
in a single, end-to-end procedure. However, current GPU • Input Size: the width and depth of the input context at
memory limitations make this intractable. train time, using either 16 or 32 voxels
Since we train our model on synthetic data, we introduce
height jittering for training samples to counter overfitting, We measure completion quality using `1 distances with re-
jittering every training sample in height by a (uniform) ran- spect to the entire target volume (entire), predicted surface
dom jitter in the range [0, 0.1875]m. Since our training data (pred. surf.), target surface (target surf.), and unknown
is skewed towards walls and floors, we apply re-weighting space (unk. space). Using only a single hierarchy level, an
in the semantic loss, using a 1:10 ratio for structural classes autoregressive model improves upon a non-autoregressive
(e.g. wall/floor/ceiling) versus all other object classes. model, and reducing the number of quantization bins from
For our final model, we train all networks on a NVIDIA 256 to 32 improves completion (further reduction reduces
GTX 1080, using the Adam optimizer [16] with learning the discrete distribution’s ability to approximate a contin-
rate 0.001 (decayed to 0.0001) We train one network for uous distance field). Note that the increase in pred. surf.
each of the eight voxel groups at each of the three hierarchy error from the hierarchy is tied to the ability to predict more
levels, for a total of 24 trained networks. Note that the eight unknown surface, as seen by the decrease in unk. space
Hierarchy Probabilistic/ Autoregressive Input `1 -Err `1 -Err `1 -Err `1 -Err
Levels Deterministic Size (entire) (pred. surf.) (target surf.) (unk. space)
1 prob. (#quant=256) non-autoreg. 32 0.248 0.311 0.969 0.324
1 prob. (#quant=256) autoreg. 16 0.226 0.243 0.921 0.290
1 prob. (#quant=256) autoreg. 32 0.218 0.269 0.860 0.283
1 prob. (#quant=32) autoreg. 32 0.208 0.252 0.839 0.271
1 prob. (#quant=16) autoreg. 32 0.212 0.325 0.818 0.272
1 prob. (#quant=8) autoreg. 32 0.226 0.408 0.832 0.284
1 det. non-autoreg. 32 0.248 0.532 0.717 0.330
1 det. autoreg. 16 0.217 0.349 0.808 0.282
1 det. autoreg. 32 0.204 0.284 0.780 0.266
3 (gt train) prob. (#quant=32) autoreg. 32 0.336 0.840 0.902 0.359
3 (pred. train) prob. (#quant=32) autoreg. 32 0.202 0.405 0.673 0.251
3 (gt train) det. autoreg. 32 0.303 0.730 0.791 0.318
3 (pred. train) det. autoreg. 32 0.182 0.419 0.534 0.225
Table 1. Quantitative scene completion results for different variants of our completion-only model evaluated on synthetic SUNCG ground
truth data. We measure the `1 error against the ground truth distance field (in voxel space, up to truncation distance of 3 voxels). Using an
autoregressive model with a three-level hierarchy and large input context size gives the best performance.
error. Moreover, for our scene completion task, a determin- eas even when input scans are highly partial, and producing
istic model performs better than a probabilistic one, as intu- more complete results as well as more accurate local de-
itively we aim to capture a single output mode—the physi- tail. Note that our method is O(1) at test time in terms of
cal reality behind the captured 3D scan. An autoregressive, forward passes; we run more efficiently than previous meth-
deterministic, full hierarchy with the largest spatial context ods which operate on fixed-size subvolumes and must itera-
provides the highest accuracy. tively make predictions on subvolumes of a scene, typically
We also compare our method to alternative scene com- O(wd) for a w × h × d scene.
pletion methods in Tab. 2. As a baseline, we compare to Completion Results on ScanNet (real data) We also
Poisson Surface Reconstruction [13, 14]. We also compare show qualitative completion results on real-world scans in
to 3D-EPN, which was designed for completing single ob- Fig. 6. We run our model on scans from the publicly-
jects, as opposed to scenes [7]. Additionally, we compare to available RGB-D ScanNet dataset [5], which has data cap-
SSCNet, which completes the subvolume of a scene viewed tured with an Occiptal Structure Sensor, similar to a Mi-
by a single depth frame [35]. For this last comparison, in crosoft Kinect or Intel PrimeSense sensor. Again, we use
order to complete the entire scene, we fuse the predictions the best performing network according to Tab. 1. We see
from all cameras of a test scene into one volume, then eval- that our model, trained only on synthetic data, learns to gen-
uate `1 errors over this entire volume. Our method achieves eralize and transfer to real data.
lower reconstruction error than all the other methods. Note
that while jointly predicting semantics along with comple- Semantic Inference on SUNCG In Tab. 3, we evaluate
tion does not improve on completion, Tab. 3 shows that it and compare our semantic segmentation on the SUNCG
significantly improves semantic segmentation performance. dataset. All methods were trained on the train set of scenes
We show a qualitative comparison of our completion used by SSCNet [35] and evaluated on the test set. We use
against state-of-the-art methods in Fig. 4. For these re- the SUNCG 11-label set. Our semantic inference benefits
sults, we use the best performing architecture according to significantly from the joint completion and semantic task,
Tab. 1. We can run our method on arbitrarily large scenes significantly outperforming current state of the art.
as test input, thus predicting missing geometry in large ar- Fig. 5 shows qualitative semantic segmentation results
Figure 4. Completion results on synthetic SUNCG scenes; left to right: input, Poisson Surface Reconstruction [14], 3D-EPN [7], SSCNet
[35], Ours, ground truth.
bed ceil. chair floor furn. obj. sofa table tv wall wind. avg
(vis) ScanNet [5] 44.8 90.1 32.5 75.2 41.3 25.4 51.3 42.4 9.1 60.5 4.5 43.4
(vis) SSCNet [35] 67.4 95.8 41.6 90.2 42.5 40.7 50.8 58.4 20.2 59.3 49.7 56.1
(vis) Ours [sem-only, no hier] 63.6 92.9 41.2 58.0 27.2 19.6 55.5 49.0 9.0 58.3 5.1 43.6
(is) Ours [sem-only] 82.9 96.1 48.2 67.5 64.5 40.8 80.6 61.7 14.8 69.1 13.7 58.2
(vis) Ours [no hier] 70.3 97.6 58.9 63.0 46.6 34.1 74.5 66.5 40.9 86.5 43.1 62.0
(vis) Ours 80.1 97.8 63.4 94.3 59.8 51.2 77.6 65.4 32.4 84.1 48.3 68.6
(int) SSCNet [35] 65.6 81.2 48.2 76.4 49.5 49.8 61.1 57.4 14.4 74.0 36.6 55.8
(int) Ours [no hier] 68.6 96.9 55.4 71.6 43.5 36.3 75.4 68.2 33.0 88.4 33.1 60.9
(int) Ours 82.3 97.1 60.0 93.2 58.0 51.6 80.6 66.1 26.8 86.9 37.3 67.3
Table 3. Semantic labeling accuracy on SUNCG scenes. We measure per-voxel class accuracies for both the voxels originally visible in
the input partial scan (vis) as well as the voxels in the intersection of our predictions, SSCNet, and ground truth (int). Note that we show
significant improvement over a semantic-only model that does not perform completion (sem-only) as well as the current state-of-the-art.
on SUNCG scenes. Our ability to process the entire scene resolutions, thus allowing for variably-sized test scenes with
at test time, in contrast to previous methods which operate unbounded spatial extents. In addition, we use a coarse-
on fixed subvolumes, along with the autoregressive, joint to-fine prediction strategy combined with a volumetric au-
completion task, produces more globally consistent and ac- toregressive network that leverages large spatial contexts
curate voxel labels. while simultaneously predicting local detail. As a result,
For semantic inference on real scans, we refer to the ap- we achieve both unprecedented scene completion results as
pendix. well as volumetric semantic segmentation with significantly
higher accuracy than previous state of the art.
7. Conclusion and Future Work
Our work is only a starting point for obtaining high-
In this paper, we have presented ScanComplete, a novel quality 3D scans from partial inputs, which is a typical
data-driven approach that takes an input partial 3D scan and problem for RGB-D reconstructions. One important aspect
predicts both completed geometry and semantic voxel la- for future work is to further improve output resolution. Cur-
bels for the entire scene at once. The key idea is to use rently, our final output resolution of ∼ 5cm3 voxels is still
a fully-convolutional network that decouples train and test not enough—ideally, we would use even higher resolutions
Figure 5. Semantic voxel labeling results on SUNCG; from left to right: input, SSCNet [35], ScanNet [5], Ours, and ground truth.
Figure 6. Completion results on real-world scans from ScanNet [5]. Despite being trained only on synthetic data, our model is also able to
complete many missing regions of real-world data.
in order to resolve fine-scale objects, e.g., cups. In addition, task, and we are convinced that we will see many improve-
we believe that end-to-end training across all hierarchy lev- ments along these directions.
els would further improve performance with the right joint
optimization strategy. Nonetheless, we believe that we have
set an important baseline for completing entire scenes. We
hope that the community further engages in this exciting
Acknowledgments depth images on mobile devices. IEEE transactions on visu-
alization and computer graphics, 21(11):1241–1250, 2015.
This work was supported by a Google Research Grant, 2
a Stanford Graduate Fellowship, and a TUM-IAS Rudolf [13] M. Kazhdan, M. Bolitho, and H. Hoppe. Poisson surface
Mößbauer Fellowship. We would also like to thank Shuran reconstruction. In Proceedings of the fourth Eurographics
Song for helping with the SSCNet comparison. symposium on Geometry processing, volume 7, 2006. 1, 2, 6
[14] M. Kazhdan and H. Hoppe. Screened poisson surface recon-
References struction. ACM Transactions on Graphics (TOG), 32(3):29,
2013. 1, 2, 6, 7
[1] C. Batty. SDFGen. https://github.com/christopherbatty/SDFGen.
[15] Y. M. Kim, N. J. Mitra, D.-M. Yan, and L. Guibas. Acquiring
4
3d indoor environments with variability and repetition. ACM
[2] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner,
Transactions on Graphics (TOG), 31(6):138, 2012. 2
M. Savva, S. Song, A. Zeng, and Y. Zhang. Matterport3D:
[16] D. Kingma and J. Ba. Adam: A method for stochastic opti-
Learning from RGB-D data in indoor environments. Inter-
mization. arXiv preprint arXiv:1412.6980, 2014. 5
national Conference on 3D Vision (3DV), 2017. 11, 12
[3] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, [17] Y. Li, A. Dai, L. Guibas, and M. Nießner. Database-assisted
Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, object retrieval for real-time 3d reconstruction. In Computer
J. Xiao, L. Yi, and F. Yu. ShapeNet: An Information-Rich Graphics Forum, volume 34, pages 435–446. Wiley Online
3D Model Repository. Technical Report arXiv:1512.03012 Library, 2015. 2
[cs.GR], Stanford University — Princeton University — [18] D. Maturana and S. Scherer. Voxnet: A 3d convolutional
Toyota Technological Institute at Chicago, 2015. 2 neural network for real-time object recognition. In Intelligent
[4] B. Curless and M. Levoy. A volumetric method for building Robots and Systems (IROS), 2015 IEEE/RSJ International
complex models from range images. In Proceedings of the Conference on, pages 922–928. IEEE, 2015. 2
23rd annual conference on Computer graphics and interac- [19] N. J. Mitra, L. J. Guibas, and M. Pauly. Partial and approxi-
tive techniques, pages 303–312. ACM, 1996. 2, 4 mate symmetry detection for 3d geometry. In ACM Transac-
[5] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, tions on Graphics (TOG), volume 25, pages 560–568. ACM,
and M. Nießner. Scannet: Richly-annotated 3d reconstruc- 2006. 2
tions of indoor scenes. In Proc. Computer Vision and Pattern [20] L. Nan, K. Xie, and A. Sharf. A search-classify approach for
Recognition (CVPR), IEEE, 2017. 3, 6, 7, 8, 11, 12, 13, 14 cluttered indoor scene understanding. ACM Transactions on
[6] A. Dai, M. Nießner, M. Zollhöfer, S. Izadi, and C. Theobalt. Graphics (TOG), 31(6):137, 2012. 2
Bundlefusion: Real-time globally consistent 3d reconstruc- [21] A. Nealen, T. Igarashi, O. Sorkine, and M. Alexa. Laplacian
tion using on-the-fly surface reintegration. ACM Transac- mesh optimization. In Proceedings of the 4th international
tions on Graphics (TOG), 36(3):24, 2017. 1, 2 conference on Computer graphics and interactive techniques
[7] A. Dai, C. R. Qi, and M. Nießner. Shape completion us- in Australasia and Southeast Asia, pages 381–389. ACM,
ing 3d-encoder-predictor cnns and shape synthesis. In Proc. 2006. 1, 2
Computer Vision and Pattern Recognition (CVPR), IEEE, [22] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux,
2017. 2, 4, 6, 7, 11, 12, 13 D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and
[8] M. Firman, O. Mac Aodha, S. Julier, and G. J. Bros- A. Fitzgibbon. Kinectfusion: Real-time dense surface map-
tow. Structured prediction of unobserved voxels from a sin- ping and tracking. In Mixed and augmented reality (ISMAR),
gle depth image. In Proceedings of the IEEE Conference 2011 10th IEEE international symposium on, pages 127–
on Computer Vision and Pattern Recognition, pages 5431– 136. IEEE, 2011. 1, 2
5440, 2016. 2 [23] D. T. Nguyen, B.-S. Hua, M.-K. Tran, Q.-H. Pham, and S.-
[9] X. Han, Z. Li, H. Huang, E. Kalogerakis, and Y. Yu. High K. Yeung. A field model for repairing 3d shapes. In The
Resolution Shape Completion Using Deep Neural Networks IEEE Conference on Computer Vision and Pattern Recogni-
for Global Structure and Local Geometry Inference. In IEEE tion (CVPR), volume 5, 2016. 2
International Conference on Computer Vision (ICCV), 2017. [24] M. Nießner, M. Zollhöfer, S. Izadi, and M. Stamminger.
2 Real-time 3d reconstruction at scale using voxel hashing.
[10] C. Häne, S. Tulsiani, and J. Malik. Hierarchical surface ACM Transactions on Graphics (TOG), 2013. 1, 2
prediction for 3d object reconstruction. arXiv preprint [25] M. Pauly, N. J. Mitra, J. Giesen, M. H. Gross, and L. J.
arXiv:1704.00710, 2017. 2 Guibas. Example-based 3d scan completion. In Sym-
[11] S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, posium on Geometry Processing, number EPFL-CONF-
P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison, 149337, pages 23–32, 2005. 2
et al. Kinectfusion: real-time 3d reconstruction and inter- [26] M. Pauly, N. J. Mitra, J. Wallner, H. Pottmann, and L. J.
action using a moving depth camera. In Proceedings of the Guibas. Discovering structural regularity in 3d geometry. In
24th annual ACM symposium on User interface software and ACM transactions on graphics (TOG), volume 27, page 43.
technology, pages 559–568. ACM, 2011. 1, 2 ACM, 2008. 2
[12] O. Kähler, V. A. Prisacariu, C. Y. Ren, X. Sun, P. Torr, and [27] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. Guibas.
D. Murray. Very high frame rate volumetric integration of Volumetric and multi-view cnns for object classification on
3d data. In Proc. Computer Vision and Pattern Recognition [42] T. Whelan, S. Leutenegger, R. F. Salas-Moreno, B. Glocker,
(CVPR), IEEE, 2016. 2 and A. J. Davison. Elasticfusion: Dense slam without a pose
[28] S. E. Reed, A. van den Oord, N. Kalchbrenner, S. Gómez, graph. Proc. Robotics: Science and Systems, Rome, Italy,
Z. Wang, D. Belov, and N. de Freitas. Parallel multi- 2015. 1
scale autoregressive density estimation. In Proceedings of [43] F. Yu and V. Koltun. Multi-scale context aggregation by di-
The 34th International Conference on Machine Learning lated convolutions. arXiv preprint arXiv:1511.07122, 2015.
(ICML), 2017. 3, 4 2
[29] G. Riegler, A. O. Ulusoy, H. Bischof, and A. Geiger. Oct- [44] W. Zhao, S. Gao, and H. Lin. A robust hole-filling algorithm
netfusion: Learning depth fusion from data. arXiv preprint for triangular mesh. The Visual Computer, 23(12):987–997,
arXiv:1704.01047, 2017. 2 2007. 1, 2
[30] G. Riegler, A. O. Ulusoy, and A. Geiger. Octnet: Learn-
ing deep 3d representations at high resolutions. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2017. 2
[31] J. Rock, T. Gupta, J. Thorsen, J. Gwak, D. Shin, and
D. Hoiem. Completing 3d object shape from one depth im-
age. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 2484–2493, 2015. 2
[32] T. Shao, W. Xu, K. Zhou, J. Wang, D. Li, and B. Guo. An
interactive approach to semantic modeling of indoor scenes
with an rgbd camera. ACM Transactions on Graphics (TOG),
31(6):136, 2012. 2
[33] Y. Shi, P. Long, K. Xu, H. Huang, and Y. Xiong. Data-driven
contextual modeling for 3d scene understanding. Computers
& Graphics, 55:55–67, 2016. 2
[34] I. Sipiran, R. Gregor, and T. Schreck. Approximate symme-
try detection in partial 3d meshes. In Computer Graphics
Forum, volume 33, pages 131–140. Wiley Online Library,
2014. 2
[35] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and
T. Funkhouser. Semantic scene completion from a single
depth image. Proceedings of 30th IEEE Conference on Com-
puter Vision and Pattern Recognition, 2017. 2, 3, 6, 7, 8, 12,
14
[36] O. Sorkine and D. Cohen-Or. Least-squares meshes. In
Shape Modeling Applications, 2004. Proceedings, pages
191–199. IEEE, 2004. 1, 2
[37] P. Speciale, M. R. Oswald, A. Cohen, and M. Pollefeys. A
symmetry prior for convex variational 3d reconstruction. In
European Conference on Computer Vision, pages 313–328.
Springer, 2016. 2
[38] M. Sung, V. G. Kim, R. Angst, and L. Guibas. Data-driven
structural priors for shape completion. ACM Transactions on
Graphics (TOG), 34(6):175, 2015. 2
[39] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Oc-
tree generating networks: Efficient convolutional archi-
tectures for high-resolution 3d outputs. arXiv preprint
arXiv:1703.09438, 2017. 2
[40] S. Thrun and B. Wegbreit. Shape from symmetry. In
Tenth IEEE International Conference on Computer Vision
(ICCV’05) Volume 1, volume 2, pages 1824–1831. IEEE,
2005. 2
[41] P.-S. Wang, Y. Liu, Y.-X. Guo, C.-Y. Sun, and X. Tong.
O-cnn: Octree-based convolutional neural networks for 3d
shape analysis. ACM Transactions on Graphics (TOG),
36(4):72, 2017. 2
In this appendix, we provide additional details for our block boundaries. Even though the quantitative error met-
ScanComplete submission. First, we show a qualitative rics are not too bad for the baseline approach, the visual
evaluation on real-world RGB-D data; see Sec. A. Second, inspection reveals that the boundary artifacts introduced at
we evaluate our semantics predictions on real-world bench- these seams are problematic.
marks; see Sec. B. Further, we provide details on the com-
parisons to Dai et al. [7] in Sec. C and visualize the subvol-
ume blocks used for the training of our spatially-invariant
network in Sec. D. In Sec. E, we compare the timings of
our network against previous approaches showing that we
not only outperform them in terms of accuracy and qualita-
tive results, but also have a significant run-time advantage
due to our architecture design. Finally, we show additional
results on synthetic data for completion and semantics in
Sec. F.
Figure 9. Additional results on ScanNet for our completion and semantic voxel labeling predictions.
Figure 10. Additional results on Google Tango scans for our completion and semantic voxel labeling predictions.
bed ceil. chair floor furn. obj. sofa table tv wall wind. avg
ScanNet [5] 11.7 88.7 13.2 81.3 11.8 13.4 25.2 18.7 4.2 53.5 0.5 29.3
SSCNet [35] 33.1 42.4 21.4 42.0 24.7 8.6 39.3 25.2 13.3 47.7 24.1 29.3
Ours 50.4 95.5 35.3 89.4 45.2 31.3 57.4 38.2 16.7 72.2 33.3 51.4
Table 6. Semantic labeling on SUNCG scenes, measured as IOU per class over the visible surface of the partial test scans.
Figure 11. Additional results on SUNCG for our completion and semantic voxel labeling predictions.