Uw Cse 11 02 02 PDF
Uw Cse 11 02 02 PDF
Uw Cse 11 02 02 PDF
Hao Du1 Peter Henry1 Xiaofeng Ren2 Dieter Fox1,2 Dan B Goldman3 Steven M. Seitz1
{duhao,peter,fox,seitz}@cs.washinton.edu xiaofeng.ren@intel.com dgoldman@adobe.com
1 2 3
University of Washington Intel Labs Seattle Adobe Systems
Abstract Strategy,
Positioning & Feedback, Suggestions & Visualization
Path Planning
The arrival of cheap consumer depth cameras, led by
Microsoft’s Kinect system, presents a huge opportunity for Color
3D modeling of personal spaces. While 3D indoor map-
ping techniques are becoming increasingly robust, they are Real-‐time 3D
still too brittle to enable non-technical users to build consis- Registration &
Depth
Modeling
tent and complete maps of indoor environments. This is due
to technical challenges such as limited lighting, occlusion,
and lack of texture, and to the fact that novice users lack a
deep understanding of the underlying algorithms and their
User input and control of work flow
limitations. In this research, we use a prototype affordable
RGB-D camera, which provides both color and depth, to Figure 1. Interactive 3D mapping: The depth and color frames col-
build a real-time interactive system that assists and guides lected by the user are aligned and globally registered in real time.
The system alerts the user if the current data cannot be aligned,
a user through the modeling process. Color and depth are
and provides guidance on where more data needs to be collected.
jointly utilized to achieve robust 3D alignment. The system
The user can track the model quality and “rewind” data or intro-
offers online feedback and guidance, tolerates user errors duce additional constraints to improve the global consistency of
and alignment failures, and enables novice users to capture the model.
complete and dense 3D models. We evaluate our system and
algorithms with extensive experiments.
Furukawa et al [7] used the Manhattan World assumption to
automatically find such planes. In both cases, photos have
1. Introduction to be carefully taken and registered, and geometric details
are sacrificed for the sake of large surface textures and ap-
Building 3D models of indoor environments has great
pealing visualization.
potentials and interesting usages. For example, having ac-
cess to an accurate, photorealistic model of one’s home can Our objective is to enable a non-technical user to build
enable many scenarios such as virtual remodeling or online dense and complete models for his/her personal environ-
furniture shopping. Such a model can also provide rich con- ments. One technology that makes it feasible is the wide
text information for smart home applications. availability of consumer depth cameras, such as those de-
Indoor 3D modeling is also a hard problem for many ployed in the Microsoft Kinect system [19]. These cameras
reasons such as limited lighting, occlusion, limited field of directly provide dense color and depth information. How-
view, and lack of texture. There has been a lot of work ever, their field of view is limited (about 60◦ ) and the data is
and progress on 3D modeling and reconstruction of envi- rather noisy and low resolution (640×480). Henry et al [11]
ronments. State-of-the-art research systems can build 3D showed that such cameras are suitable for dense 3D model-
models at a city scale [22, 18, 1]. On the other hand, build- ing, but much was left to be desired, such as robustness for
ing a complete model of a room, say a small room with use by non-experts, or complete coverage of the environ-
textureless walls, remains a challenge. ment including featureless or low-light areas.
Many recent works addressed the robustness and com- The key idea behind our work is to take advantage of
pleteness issues in indoor modeling and searched for a so- online user interaction and guidance in order to solve many
lution to solve or bypass them. Sinha et al [21] built an of the issues in 3D environment modeling. We design and
interactive system to enable a user to mark planar surfaces. implement an interactive 3D modeling system so that the
user holds a depth camera to freely scan an environment There have been many successful efforts to build real-
and enjoys real-time feedback. Our approach has several time systems for 3D structure recovery. Davison et.al. built
advantages: real-time SLAM (simultaneous localization and mapping)
Robust: We compute 3D alignments of depth frames on- systems using monocular cameras [5]. The Parallel Track-
the-fly, so that the system can detect failures (many reasons ing and Modeling system (PTAM) [13] is a closely related
such as fast motions or featureless areas) and prompt the system applying SLAM techniques. Another example of
user to “rewind” and resume scanning. The success of 3D real-time sparse 3D modeling can be found in [16]. One
registration of consecutive frames is thus “guaranteed”. recent development is the dense 3D modeling work of [17]
Complete: A 3D environment model is constructed on- which uses PTAM and flow techniques to compute dense
the-fly. The user can check the model in 3D at any time for depths. Many real-time systems are limited in the scale they
coverage and quality. The system also automatically pro- can handle.
vides suggestions where the map may yet be incomplete. Due to the difficulties of indoor modeling, such as light-
Dense: Largely due to the nature of the depth sensor, the ing and lack of texture, interactive approaches have been
model constructed by our system is dense without assum- proposed to utilize human input. [6] was an early exam-
ing planar surfaces or a “box” model of a room. A dense ple showing very impressive facade models and visualiza-
model reveals details of the environment and can have many tions with manual labeling. [23] used interactions to extract
uses such as recognizing architectural elements, robot mo- planes from a single image. [21] is a recent example com-
tion planning, telepresence or visualization. bining user input with vanishing line analysis and multi-
In addition to developing an interactive mapping sys- view stereo to recover polygonal structures. Our work is
tem, we introduce a variant of RANSAC for frame-to-frame different and novel, as we enable online user interaction,
matching that combines the strengths of color and depth utilizing user input on-the-fly for both capturing data and
cues provided by our camera. In contrast to the standard in- extracting geometric primitives.
lier count to rank matches, our approach learns a classifier Recently, there have been many efforts to push the limits
that takes additional features such as visibility consistency of 3D modeling to a large scale. One example is the city-
into account. The learned classifier results in more robust scale, or “Rome”-scale, sparse 3D reconstruction [1]. An-
frame-to-frame alignments and provides an improved crite- other example is the real-time urban street reconstruction
rion for detection alignment failures, which is important for work of Pollefeys et al [18]. In comparison, indoor model-
our real time user feedback. ing has not taken off beyond a few small-scale results.
This paper is organized as follows. After discussing re- This may soon change with the arrival of mass-produced
lated work, Section 3 gives an overview of our mapping depth cameras. We believe there are great opportunities to
system. The frame alignment approach is introduced in Sec- make use of these cameras for 3D modeling. The work of
tion 4, followed by a description of the interactive mapping Henry et al [11] is most relevant to this work. They showed
technique. Section 6 provides experimental results. We how to use both color and depth for sequential alignment of
conclude in Section 7. depth frames and carried out experimental studies of vari-
ous alignment algorithms and their combinations. Our work
2. Related Works aims at making such a depth-camera-based modeling sys-
tem online, incorporating various aspects of user interac-
Modeling and reconstructing the world in 3D is a prob- tion to make 3D modeling robust, easy to use, and capable
lem of central importance in computer vision. Various of producing dense, complete models of personal spaces.
techniques have been developed for the alignment of mul-
tiple views, such as pairwise matching of sparse [12] or 3. System Overview
dense point clouds [2, 3], two-view and multi-view geome-
tries [10] and joint optimization of camera poses and 3D Figure 2 gives an overview of our interactive mapping
features through bundle adjustment [24]. system. The system is based on the well established struc-
3D vision techniques, combined with local feature ex- ture of online mapping approaches, where each data frame
traction [15], have led to exciting results in 3D modeling. is matched against the most recent frame to provide vi-
PhotoTourism [22] is an example where sparse 3D mod- sual odometry information, and against a subset of previous
els are constructed from web photos. There has been a lot frames to detect “loop closures” [14, 11, 4]. While visual
of work on multi-view stereo techniques [20]. The patch- odometry results in local consistency, loop closures provide
based framework [9], which has been most successful on constraints used to globally optimize all camera poses.
object modeling, has also been applied to environment mod- The globally aligned map is visualized in real time, as
eling. The work of Furukawa et al [8] built on these works shown in Figure 3. The user can assess the quality of frame
to obtain dense indoor models using the Manhattan world alignment via a bar shown in the visualizer. In order to avoid
assumption. capturing data that can not be aligned consecutively, the sys-
User Check and Visit
Incomplete Spots
Suggest
Places to Visit
User Control 4. Color and Depth RANSAC
Viewpoint
misalignment percentage
0.35
0.8
It shows the reprojection error versus the number of inliers 0.3 ransac
ransac
ransac+visibility
found between two consecutive frames. As can be seen, 0.25
ransac+visibility
0.6
0.2
using depth information within RANSAC (3D RANSAC) 0.15
0.4
1.5
0.5
0
345678 101214 17 20 25 30 40 50
number of ransac inliers
Figure 6. Alignment error of 7 point 2D RANSAC and 3 point 3D
RANSAC versus number of inliers.
[11] P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox. Rgb- P. Merrell, C. Salmi, S. Sinha, B. Talton, L. Wang, Q. Yang,
d mapping: Using depth cameras for dense 3d modeling of H. Stewenius, R. Yang, G. Welch, and H. Towles. Detailed
indoor environments. In International Symposium on Exper- real-time urban 3D reconstruction from video. Int’l. J. Comp.
imental Robotics, 2010. 1, 2 Vision, 72(2):143–67, 2008. 1, 2
[12] B. K. P. Horn. Closed-form solution of absolute orientation [19] PrimeSense. http://www.primesense.com/. 1
using unit quaternions. J. Opt. Soc. Am. A, 4(4):629–642, [20] S. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski.
1987. 2 A comparison and evaluation of multi-view stereo recon-
[13] G. Klein and D. W. Murray. Parallel tracking and mapping struction algorithms. In CVPR, volume 1, pages 519–528,
for small ar workspaces. In Int’l. Symp. on Mixed and Aug- 2006. 2
mented Reality, 2007. 2, 5 [21] S. Sinha, D. Steedly, R. Szeliski, M. Agrawala, and M. Polle-
[14] K. Konolige, J. Bowman, J. D. Chen, P. Mihelich, M. Calon- feys. Interactive 3D architectural modeling from unordered
der, V. Lepetit, and P. Fua. View-based maps. International photo collections. ACM Transactions on Graphics (TOG),
Journal of Robotics Research (IJRR), 29(10), 2010. 2 27(5):1–10, 2008. 1, 2
[15] D. Lowe. Discriminative image features from scale-invariant [22] N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: ex-
keypoints. Int’l. J. Comp. Vision, 60(2), 2004. 2 ploring photo collections in 3d. In SIGGRAPH, 2006. 1, 2,
[16] E. Mouragnon, M. Lhuillier, M. Dhome, F. Dekeyser, and 5
P. Sayd. Real time localization and 3d reconstruction. In [23] P. Sturm and S. Maybank. A method for interactive 3d re-
CVPR, volume 1, pages 363–370, 2006. 2 construction of piecewise planar objects from single images.
In BMVC, pages 265–274, 1999. 2
[17] R. A. Newcombe and A. J. Davison. Live dense reconstruc-
[24] B. Triggs, P. McLauchlan, R. Hartley, and A. Fitzgibbon.
tion with a single moving camera. In CVPR, 2010. 2
Bundle adjustmenta modern synthesis. Vision algorithms:
[18] M. Pollefeys, D. Nister, J.-M. Frahm, A. Akbarzadeh,
theory and practice, pages 153–177, 2000. 2
P. Mordohai, B. Clipp, C. Engels, D. Gallup, S.-J. Kim,