0% found this document useful (0 votes)
20 views8 pages

Uw Cse 11 02 02 PDF

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 8

Interactive Dense 3D Modeling of Indoor Environments

Hao Du1 Peter Henry1 Xiaofeng Ren2 Dieter Fox1,2 Dan B Goldman3 Steven M. Seitz1
{duhao,peter,fox,seitz}@cs.washinton.edu xiaofeng.ren@intel.com dgoldman@adobe.com
1 2 3
University of Washington Intel Labs Seattle Adobe Systems

Abstract Strategy,  
Positioning  &   Feedback,  Suggestions  &  Visualization
Path  Planning
The arrival of cheap consumer depth cameras, led by
Microsoft’s Kinect system, presents a huge opportunity for Color
3D modeling of personal spaces. While 3D indoor map-
ping techniques are becoming increasingly robust, they are Real-­‐time  3D  
still too brittle to enable non-technical users to build consis- Registration  &  
Depth
Modeling
tent and complete maps of indoor environments. This is due
to technical challenges such as limited lighting, occlusion,
and lack of texture, and to the fact that novice users lack a
deep understanding of the underlying algorithms and their
User  input  and  control  of  work  flow
limitations. In this research, we use a prototype affordable
RGB-D camera, which provides both color and depth, to Figure 1. Interactive 3D mapping: The depth and color frames col-
build a real-time interactive system that assists and guides lected by the user are aligned and globally registered in real time.
The system alerts the user if the current data cannot be aligned,
a user through the modeling process. Color and depth are
and provides guidance on where more data needs to be collected.
jointly utilized to achieve robust 3D alignment. The system
The user can track the model quality and “rewind” data or intro-
offers online feedback and guidance, tolerates user errors duce additional constraints to improve the global consistency of
and alignment failures, and enables novice users to capture the model.
complete and dense 3D models. We evaluate our system and
algorithms with extensive experiments.
Furukawa et al [7] used the Manhattan World assumption to
automatically find such planes. In both cases, photos have
1. Introduction to be carefully taken and registered, and geometric details
are sacrificed for the sake of large surface textures and ap-
Building 3D models of indoor environments has great
pealing visualization.
potentials and interesting usages. For example, having ac-
cess to an accurate, photorealistic model of one’s home can Our objective is to enable a non-technical user to build
enable many scenarios such as virtual remodeling or online dense and complete models for his/her personal environ-
furniture shopping. Such a model can also provide rich con- ments. One technology that makes it feasible is the wide
text information for smart home applications. availability of consumer depth cameras, such as those de-
Indoor 3D modeling is also a hard problem for many ployed in the Microsoft Kinect system [19]. These cameras
reasons such as limited lighting, occlusion, limited field of directly provide dense color and depth information. How-
view, and lack of texture. There has been a lot of work ever, their field of view is limited (about 60◦ ) and the data is
and progress on 3D modeling and reconstruction of envi- rather noisy and low resolution (640×480). Henry et al [11]
ronments. State-of-the-art research systems can build 3D showed that such cameras are suitable for dense 3D model-
models at a city scale [22, 18, 1]. On the other hand, build- ing, but much was left to be desired, such as robustness for
ing a complete model of a room, say a small room with use by non-experts, or complete coverage of the environ-
textureless walls, remains a challenge. ment including featureless or low-light areas.
Many recent works addressed the robustness and com- The key idea behind our work is to take advantage of
pleteness issues in indoor modeling and searched for a so- online user interaction and guidance in order to solve many
lution to solve or bypass them. Sinha et al [21] built an of the issues in 3D environment modeling. We design and
interactive system to enable a user to mark planar surfaces. implement an interactive 3D modeling system so that the
user holds a depth camera to freely scan an environment There have been many successful efforts to build real-
and enjoys real-time feedback. Our approach has several time systems for 3D structure recovery. Davison et.al. built
advantages: real-time SLAM (simultaneous localization and mapping)
Robust: We compute 3D alignments of depth frames on- systems using monocular cameras [5]. The Parallel Track-
the-fly, so that the system can detect failures (many reasons ing and Modeling system (PTAM) [13] is a closely related
such as fast motions or featureless areas) and prompt the system applying SLAM techniques. Another example of
user to “rewind” and resume scanning. The success of 3D real-time sparse 3D modeling can be found in [16]. One
registration of consecutive frames is thus “guaranteed”. recent development is the dense 3D modeling work of [17]
Complete: A 3D environment model is constructed on- which uses PTAM and flow techniques to compute dense
the-fly. The user can check the model in 3D at any time for depths. Many real-time systems are limited in the scale they
coverage and quality. The system also automatically pro- can handle.
vides suggestions where the map may yet be incomplete. Due to the difficulties of indoor modeling, such as light-
Dense: Largely due to the nature of the depth sensor, the ing and lack of texture, interactive approaches have been
model constructed by our system is dense without assum- proposed to utilize human input. [6] was an early exam-
ing planar surfaces or a “box” model of a room. A dense ple showing very impressive facade models and visualiza-
model reveals details of the environment and can have many tions with manual labeling. [23] used interactions to extract
uses such as recognizing architectural elements, robot mo- planes from a single image. [21] is a recent example com-
tion planning, telepresence or visualization. bining user input with vanishing line analysis and multi-
In addition to developing an interactive mapping sys- view stereo to recover polygonal structures. Our work is
tem, we introduce a variant of RANSAC for frame-to-frame different and novel, as we enable online user interaction,
matching that combines the strengths of color and depth utilizing user input on-the-fly for both capturing data and
cues provided by our camera. In contrast to the standard in- extracting geometric primitives.
lier count to rank matches, our approach learns a classifier Recently, there have been many efforts to push the limits
that takes additional features such as visibility consistency of 3D modeling to a large scale. One example is the city-
into account. The learned classifier results in more robust scale, or “Rome”-scale, sparse 3D reconstruction [1]. An-
frame-to-frame alignments and provides an improved crite- other example is the real-time urban street reconstruction
rion for detection alignment failures, which is important for work of Pollefeys et al [18]. In comparison, indoor model-
our real time user feedback. ing has not taken off beyond a few small-scale results.
This paper is organized as follows. After discussing re- This may soon change with the arrival of mass-produced
lated work, Section 3 gives an overview of our mapping depth cameras. We believe there are great opportunities to
system. The frame alignment approach is introduced in Sec- make use of these cameras for 3D modeling. The work of
tion 4, followed by a description of the interactive mapping Henry et al [11] is most relevant to this work. They showed
technique. Section 6 provides experimental results. We how to use both color and depth for sequential alignment of
conclude in Section 7. depth frames and carried out experimental studies of vari-
ous alignment algorithms and their combinations. Our work
2. Related Works aims at making such a depth-camera-based modeling sys-
tem online, incorporating various aspects of user interac-
Modeling and reconstructing the world in 3D is a prob- tion to make 3D modeling robust, easy to use, and capable
lem of central importance in computer vision. Various of producing dense, complete models of personal spaces.
techniques have been developed for the alignment of mul-
tiple views, such as pairwise matching of sparse [12] or 3. System Overview
dense point clouds [2, 3], two-view and multi-view geome-
tries [10] and joint optimization of camera poses and 3D Figure 2 gives an overview of our interactive mapping
features through bundle adjustment [24]. system. The system is based on the well established struc-
3D vision techniques, combined with local feature ex- ture of online mapping approaches, where each data frame
traction [15], have led to exciting results in 3D modeling. is matched against the most recent frame to provide vi-
PhotoTourism [22] is an example where sparse 3D mod- sual odometry information, and against a subset of previous
els are constructed from web photos. There has been a lot frames to detect “loop closures” [14, 11, 4]. While visual
of work on multi-view stereo techniques [20]. The patch- odometry results in local consistency, loop closures provide
based framework [9], which has been most successful on constraints used to globally optimize all camera poses.
object modeling, has also been applied to environment mod- The globally aligned map is visualized in real time, as
eling. The work of Furukawa et al [8] built on these works shown in Figure 3. The user can assess the quality of frame
to obtain dense indoor models using the Manhattan world alignment via a bar shown in the visualizer. In order to avoid
assumption. capturing data that can not be aligned consecutively, the sys-
User  Check  and  Visit  
Incomplete  Spots
Suggest  
Places  to  Visit
User  Control   4. Color and Depth RANSAC
Viewpoint

User  Re-­‐adjust,   Completeness


In this section we describe our real-time matching algo-
͞ZĞǁŝŶĚ͕͟ZĞƐƵŵĞ Visualization
Assessment rithm for visual odometry. The added depth information as-
RGB-­‐D   Local  Alignment Failure   Update   sociated with image pixels enables us to use 3-Point match-
Ransac+Visibility Global  Alignment
Frames Detection Model ing, yielding more robust estimation of the relative camera
Loop  Closure pose between frame pairs. The proposed Visibility Criteria
Detection User  Verify  and  
Add  Loop  Closures
evaluates the quality of a relative camera pose transform.
Using a combination of Visibility Criteria and RANSAC
Figure 2. Detailed system overview: Frame alignment, loop clo- significantly improves matching accuracy.
sure detection, and global alignment are performed in real time.
Green boxes represent user interactions. The user is alerted if 4.1. RANSAC and 3-Point Matching Algorithm
alignment fails, notified of suggested place visits, and can verify
and improve model quality via manual loop closure insertion. Initial feature matches are established with feature de-
scriptors, such as SIFT or Calondar, applied to interest
points in 2D images. These matched image feature points
are associated with their 3D locations. Instead of tradi-
tional methods that use the 7-Point or 8-Point algorithm to
estimate fundamental matrices (or essential matrices when
camera intrinsics are known) [10]), we directly estimate the
full camera pose transform using 3-Point algorithm using
3D locations. The full camera pose transform, as compared
with the essential matrix, provides the added information
of translational scale. As there are outliers in the initial
feature matches, RANSAC is applied to determine the fea-
ture match inliers, and thereby the underlying camera pose
Figure 3. Real time visualization of the mapping process: The transform.
left panel provides a viewer of the globally aligned 3D map. The
Consider N pairs of initial feature matches between
health bar in the center right panel indicates the current quality of
Frame F1 and F2 , represented by 3D coordinates (X1i , X2i ),
frame alignment. In a failure case, as is shown, the user is guided
to re-locate the camera with respect to a specific frame contained (i = 1, 2, ..., N ) in their respective reference systems. The
in the map. The upper right panel shows the target frame, and problem is to find a relative transform (R, T ) (rotation and
lower right panel indicates the current camera view. translation) that best complies with the initial matches while
being robust to outliers. A typical RANSAC approach sam-
ples the solution space to get a candidate (R, T ), estimating
its fitness by counting the number of inliers, f0 ,
tem alerts the user whenever local alignment fails. In this
N
case, the user has to re-locate the camera with respect to the X
f0 (F1 , F2 , R, T ) = L(X1i , X2i , R, T ), (1)
global model. To do so, the live camera data is matched
i
against a frame that is globally aligned within the map. Per
default, this frame is the most recently matched frame, but where,
it can also be any map frame chosen by the user. Once the (
camera is re-localized, the mapping process proceeds as be- 1, e = kRX1i + T − X2i k < 
fore. All data collected between failure detection and re- L(X1i , X2i , R, T ) =
0, otherwise
localization is discarded, and the constraint graph used for (2)
global alignment is updated appropriately. and  is the threshold beneath which a feature match
To enable building complete maps, the system contin- (X1i , X2i ) is determined to be inlier with respect to the par-
uously checks the current model for completeness. This ticular (R, T ). RANSAC chooses the transform consistent
analysis provides visual feedback about incomplete areas with the largest number of inlier matches.
and guides the user to locations that provide the necessary
4.2. Visibility Criteria
view points. In addition to automatic loop closure detec-
tion, which cannot be expected to work in every case, the Owing to the depth maps captured by the RGB-D cam-
user can check the model for inconsistencies and add loop era, the quality of a camera pose transform can be indicated
closure constraints between pairs of frames chosen from the by laying out the point clouds in 3D space and performing
map. a visibility check, termed Visibility Criteria. The Visibility
Criteria can help obtain more accurate relative camera poses linear function of fi , i.e.
(Sec. 4.3). It also provides cues for the suggestion of loop
m
closure candidates (Sec. 5.3). X
g(F1 , F2 , R, T ) = αi , (3)
Consider the 2D example shown in Figure 4 (left), the i=0
scene is a horizontal line shown in black, and is captured by
a pair of cameras. The circles and stars are the depth maps and estimation the weights αi through linear regression.
sampled at the camera pixels. When (R, T ) is the genuine We demonstrate the effectiveness of incorporating the
relative transformation, there is no visibility conflict. When added visibility criteria in Section 6.1.
(R∗ , T ∗ ) is a wrong relative transformation, shown in Fig-
ure 4 (right), overlaying the point clouds from both cameras, 5. User Interaction
it is possible to see visibility conflicts – when a camera cap-
tures a point in 3D, the space along its viewing line should Our system incorporates user interaction in three ways:
be completely empty; if there exists points from the other failure detection and rewind/resume in matching, complete-
camera in between, there is a conflict. ness guidance, and user-assisted loop closure.

5.1. Rewind and Resume


In the case of interior scene reconstruction, each cap-
tured frame usually covers a small area of the entire scene.
Thus, the connections (relative camera poses) between
neighboring frames are critical for a successful reconstruc-
tion. Our system, through online feedback to the user, guar-
antees that only frames that can be successfully aligned to
at least one of the previous frames are added to the map.
If a newly captured frame cannot be aligned (for example,
Figure 4. Visibility conflicts. because the user moved the camera too quickly or moved to
an area with insufficient features) the system stops record-
ing frames until it receives a frame that can be successfully
Practically, we count the visibility conflicts by project- aligned.
ing the point cloud C1 from frame F1 onto the image plane
Using what we call Rewind and Resume, users can cap-
of F2 , and check its depth in F2 ’s camera coordinate sys-
ture new frames by selecting any existing frame as the tar-
tem. If any such depth appears smaller than the depth of
get frame to align against. The related Undo operation is
the F2 ’s pixel at the corresponding location (ignoring the
straightforward but extremely useful. If the system accepts
errors by setting a tolerance error equal to the depth accu-
frames that the user does not want (e.g., they may be blurry
racy), it is counted as a visibility conflict. The same ap-
or not very well aligned), the user can manually ‘undo’ to
proach is reversely applied by projecting C2 onto the im-
the nearest desired frame, and continue from there, recap-
age plane of F1 . Several criteria can be derived from the
turing the scene from a different viewing angle. This also
visibility check. We utilize the following ones: number of
gives the user the ability to remove frames that inadvertently
visibility conflicts (f1 ); average squared distance of points
contain moving objects, such as people walking in front of
with visibility conflicts (f2 ); number of visibility inliers (f3 )
the camera.
by counting those pixels where no visibility conflicts both
ways of projections. 5.2. Completeness
A good candidate transform R, T ideally gives f1 = 0
and f2 = 0. f3 indicates the size of the overlapped area Capturing a complete 3D model of the scene is desired,
measured by pixels between a pair of frames. because large missing areas in an incomplete 3D model sig-
nificantly lower the visual quality. A missing area exists in
4.3. Decision Function for RANSAC the scene either because the area has never been captured or
the frames that did contain the area did not get depth values,
We now not only have the number of RANSAC inliers, for reasons such as range or relative angle.
f0 , but multiple features fi , (i = 1, 2, 3, ..., m) for the We consider the completeness in a user-defined manner.
RANSAC algorithm to pick up the final transform. Given Using a passive capturing system, it can be very difficult
frame pair F1 , F2 and candidate transform R, T , a decision for the user to be aware of which parts of the scene have
function is needed to figure out how likely the candidate been captured. With an online system, the user can view
R, T would be a good solution. The general form of the de- the current reconstruction in real time, view the up-to-date
cision function is, g(f1 , f2 , ..., fm ). We define g to be the 3D model, and directly see which areas are missing.
points with 2D locations only (that is, ignoring depth data).
We also demonstrate the benefits of additionally incorporat-
ing visibility criteria information via our learned RANSAC
classifier. The advantage of the interactive system for vi-
sual odometry is evaluated through an experiment request-
ing five persons to model a meeting room.
In addition to these experiments, we also tried to run
bundler [22] and PTAM [13] on parts of our data. PTAM
is designed for small environments only and did not scale
Figure 5. Completeness Guide. Our system displays the classifica-
tion of voxels from a user specified 2D slice in the 3D point cloud.
to the large scale maps we build with our system. Bundler
Green: the voxel is guaranteed to be empty; Red: it is “occupied”; was not able to generate consistent maps of any of our test
Blue: unknown area. environments. To ensure that these failures were not only
due to the low quality of the images collected by our depth
camera, we additionally collected data in one of our test en-
In order to further assist users in finding uncaptured ar- vironments with a high resolution video camera. Bundler
eas, our system is able to estimate completeness. Consider failed even on this high quality data set, generating incon-
a bounding box that contains the currently reconstructed sistent matches for the majority of data.
point cloud. The inside of the bounding box can be repre-
sented by 3D grid voxels. Each grid voxel is classified into 6.1. RANSAC for Visual Odometry
one of the three categories: (1) there is at least a scene point
in that voxel; (2) there must be no scene point in the voxel; We compare the performance of 3-Point RANSAC for
(3) none of the above. All voxels are initialized in Category full transform estimation on the captured RGB-D data and
(3). A voxel is determined as Category (1) when there exists 7-Point RANSAC for essential matrix estimation on the
a scene point. A voxel is determined as Category (2) when RGB portion of the same data. The image resolution pro-
it is not (1) and the voxel has been seen through by any of vided by our camera is 640×480. The effective depth-range
the existing camera viewing line. we take is from 0.5m to 5m. The depth-accuracy is calcu-
Figure 5 shows the classification of voxels from a user lated from the camera specifications ( 7.5cm baseline, 570
specified 2D slice in the 3D point cloud. Green: the voxel pixel focal length and 640 × 480 resolution), approximately
is guaranteed to be empty; Red: it is “occupied”; Blue: un- 0.03cm at 0.3m range and 7cm at 5m range. For the ini-
known area. The user’s goal is then to paint all areas in tial feature match, we use Calondar features. We generate
either green or red by exploring the 3D space. feature pairs via bidirectional matching, where (X1i , X2i ) is
considered a matched feature pair if and only if X1i is the
5.3. Interactive Loop Closure best match in Frame 1 for Feature X2i and vice versa. To
enable good performance of the 7-Point RANSAC based
When frames are aligned with reliable relative poses (as on the features’ 2D pixel locations, we include the best fea-
is the case with our system), a small number of loop clo- ture match if and only if the second best match has a sig-
sure constraints can achieve global consistency. Our system nificantly higher distance in feature descriptor space (1.25
can obtain loop closure constraints through both RANSAC times further away than the best distance).
matching and using ICP. While RANSAC matching can be
done automatically, this is computationally expensive and
only works if the inconsistent areas of the map have match- 2D versus 3D RANSAC
ing views. Also, performing RANSAC matching against all
To generate ground truth data in a realistic scenario, we col-
previously recorded frames is computationally prohibitive.
lected a data sequence in a lab and used our system to gen-
The Visibility Criteria are used to suggest frames that are
erate the globally optimized map shown in the bottom right
likely inconsistent. The user can select any pair of frames
panel of Fig. 11. The consistency of that map indicates that
to perform a RANSAC or ICP based alignment. The user
the camera poses for the individual frames are sufficiently
then inspects the resulting map and decides to accept or re-
accurate to serve as ground truth for our experiment.
ject the change.
We randomly sampled frame pairs from this data set and
6. Experiment Results determined their transformation using 7 point RANSAC
on 2D pixel features and 3 point RANSAC on 3D feature
In this section, we present our evaluation and experiment points. Out of all pairs, 7 point RANSAC found a transfor-
results using our system. We show that 3-Point RANSAC mation for 15,149 pairs (≥ 7 inliers), with an average re-
based on feature points with 3D locations performs sig- projection error of 0.69m. 3D RANSAC determined trans-
nificantly better than 7-Point RANSAC based on feature formations for 15,693 pairs (≥ 3 inliers) with an error of
0.4 1
0.31m. Figure 6 provides a more detailed view of this result.

median translation error (m)

misalignment percentage
0.35
0.8
It shows the reprojection error versus the number of inliers 0.3 ransac
ransac
ransac+visibility
found between two consecutive frames. As can be seen, 0.25
ransac+visibility
0.6

0.2
using depth information within RANSAC (3D RANSAC) 0.15
0.4

significantly reduces reprojection errors, generating good 0.1


0.2
0.05
results even for small numbers of inliers. 0 0
3 4 5 6 7 8 10 12 14 17 20 25 30 40 50 3 4 5 6 7 8 10 12 14 17 20 25 30 40 50
number of ransac inliers number of ransac inliers

3 Figure 7. Accuracy of 3D RANSAC with and without visibility


Average reprojection error (m)

features. Left: alignment error vs. number of inliers. Right: per-


2.5 3D Ransac
centage of misaligned frames (misalignment threshold 0.1).
2D Ransac
2

1.5

0.5

0
345678 101214 17 20 25 30 40 50
number of ransac inliers
Figure 6. Alignment error of 7 point 2D RANSAC and 3 point 3D
RANSAC versus number of inliers.

RANSAC with Visibility Features


Figure 8. Visibility RANSAC: Top row, image pair. Bottom row,
We also compare the performance of 3-Point RANSAC the aligned point cloud. Red: visibility conflicts. Bottom(left), us-
with and without incorporating the visibility criteria fea- ing transform obtained from regular RANSAC. (right) using trans-
tures for the RANSAC objective. To collect the data needed form obtained from visibility RANSAC. Original RANSAC #in-
to train the linear regression model we placed the depth lier=26, avgdis=1.64. Visibility RANSAC #inlier=23, avgdis =
camera at 12 different locations, measured their ground 0.48
truth distances, and collected 100 depth frames at each loca-
tion. We then randomly picked pairs of camera frames, ran
the RANSAC algorithm, and recorded all estimated trans- 6.2. Interactive Mapping
forms along with the number of inliers and the visibility fea- To evaluate the capability of our interactive system to
tures associated with these transforms (we only used pairs generate improved visual odometry data, we performed a
of frames that had at least one transform within 1m of the small study in which five persons were tasked to collect data
ground truth distance for the entire RANSAC run). We for a map of a small meeting room. A 3D map generated
randomly split the data into a training and evaluation set. with our system is shown in Fig. 9. For each of the five
The training set was used to estimate the linear regression people, we determined if (s)he was able to collect data that
model, which was then used to re-rank the RANSAC trans- can be consecutively aligned for visual odometry. Three
forms in the evaluation set. of the people were “expert users” who had substantial ex-
Figure 7 shows that “RANSAC + Visibility” produces perience in using the depth camera for mapping purposes.
more accurate camera pose translations, indicating more ac- Two persons were “novice users” who had not previously
curate camera pose transforms. An example result is given collected mapping data. The different mapping runs con-
in Figure 8. The top row shows a pair of frames to be tained between 357 and 781 frames, with roughly 3 frames
matched; bottom is the projection of the point cloud from processed per second.
the top left frame onto the camera pose of the top right When using the interactive system, every person was
frame using the estimated camera transform without (left) able to collect a data set that covered the entire room and
and with (right) using the visibility criteria. The pixels with for which all consecutive frames could be aligned. Two of
pink color indicate visibility conflicts. As can be seen, our the expert users were able to do so without any intervention
RANSAC version chooses a transformation with less inliers by our system. The other three users required on average 16
but an overall improved alignment. This experiment shows interventions, that is, the failure detection module in Fig. 2
that the visibility criteria helps RANSAC find a solution that triggered 16 camera re-localizations, which typically took
is closer to the groundtruth camera pose transform. only 1-2 seconds to achieve. Without the interactive sys-
7. Discussions
We presented an interactive system for building dense
3D reconstructions of indoor environments using cheap
depth cameras. Such cameras are extremely promising for
enabling people to build maps of their personal spaces.
Meanwhile, the solution is far from trivial partly due to high
noise and limited field of view.
To generate robust visual odometry estimates, we intro-
duce a RANSAC variant that takes full advantage of both
Figure 9. Top down view of 3D map of meeting room used to depth and color information. The RANSAC criterion is
evaluate the benefits of interactive mapping.
based on a linear regression function that incorporates ad-
ditional visibility constraints extracted from the depth data.
We demonstrate that our RANSAC approach significantly
tem, none of these three users was able to successfully cap- outperforms existing counterparts.
ture a sequence without visual odometry failures. The mean To ensure that the collected data can be aligned into a
time between tracking failures was 20.5 frames. This exper- globally consistent map, our system continuously checks
iment shows that it is difficult to collect good mapping data the quality of frame alignment and alerts the user in case
without interaction, and that our interactive system makes it of alignment errors. Such errors can occur, for instance, if
easier by overcoming frame-to-frame failures. the user swipes the camera very quickly or gets too close to
a featureless wall. Our experiments indicate that alignment
6.3. Comparison to PMVS errors are extremely difficult to overcome otherwise, even
for expert users. Furthermore, we demonstrate that our sys-
To demonstrate the advantage of cheap depth cameras tem is able to generate consistent 3D maps of large scale
over standard cameras for dense 3D modeling, we collected indoor environments. In future work, we intend to extract
a depth camera sequence and a collection of high-res cam- structural information about walls, furniture, and other ob-
era images of a wall and a textured white board standing in jects from these 3D maps.
front of it. Fig. 10 shows a zoom into the white board part
of the reconstruction achieved by our system (left) and by References
PMVS [8] using the high-res images. As can be seen, while
[1] S. Agarwal, N. Snavely, I. Simon, S. Seitz, and R. Szeliski.
PMVS is not able to generate a dense 3D reconstruction, our
Building Rome in a day. In CVPR, pages 72–79, 2010. 1, 2
system generates a full reconstruction of the whiteboard.
[2] P. J. Besl and N. D. McKay. A method for registration of 3-d
shapes. IEEE Trans. PAMI, 14(2), 1992. 2
[3] Y. Chen and G. Medioni. Object modeling by registration of
multiple range images. Image Vision Comput., 10(3):145–
155, 1992. 2
[4] L. Clemente, A. Davison, I. Reid, J. Neira, and J. Tardós.
Mapping large loops with a single hand-held camera. 2007.
Figure 10. 3D reconstruction achieved by our system (left) and by 2
PMVS using high-res camera images (right). The blue parts in the [5] A. Davison, I. Reid, N. Molton, and O. Stasse. MonoSLAM:
right image indicate areas without any depth reconstruction. Real-time single camera SLAM. IEEE Trans. PAMI, pages
1052–1067, 2007. 2
[6] P. Debevec, C. J. Taylor, and J. Malik. Modeling and render-
ing architecture from photographs: A hybrid geometryand
6.4. Large Scale Mapping image-based approach. In SIGGRAPH, 1996. 2
[7] Y. Furukawa, B. Curless, S. Seitz, and R. Szeliski.
Fig. 11 shows examples of maps we built with our in- Manhattan-world stereo. 2009. 1
teractive system. These results were achieved by collect- [8] Y. Furukawa, B. Curless, S. M. Seitz, and R. Szeliski. Re-
ing good visual odometry data using the failure detection constructing building interiors from images. In ICCV, 2009.
and re-localization process along with the interactive sys- 2, 7
tem for adding pairwise loop closure constraints. For in- [9] Y. Furukawa and J. Ponce. Accurate, dense, and robust multi-
stance, the globally consistent map shown at the top was view stereopsis. IEEE Trans. PAMI, 2009. 2
generated with 25 loop closure constraints originated from [10] R. Hartley and A. Zisserman. Multiple view geometry in
the interactive loop closure suggestion strategy. computer vision. Cambridge University Press, 2003. 2, 3
Figure 11. (top) 3D map of an office environment built with our system. (bottom) Closeup view and additional test environment.

[11] P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox. Rgb- P. Merrell, C. Salmi, S. Sinha, B. Talton, L. Wang, Q. Yang,
d mapping: Using depth cameras for dense 3d modeling of H. Stewenius, R. Yang, G. Welch, and H. Towles. Detailed
indoor environments. In International Symposium on Exper- real-time urban 3D reconstruction from video. Int’l. J. Comp.
imental Robotics, 2010. 1, 2 Vision, 72(2):143–67, 2008. 1, 2
[12] B. K. P. Horn. Closed-form solution of absolute orientation [19] PrimeSense. http://www.primesense.com/. 1
using unit quaternions. J. Opt. Soc. Am. A, 4(4):629–642, [20] S. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski.
1987. 2 A comparison and evaluation of multi-view stereo recon-
[13] G. Klein and D. W. Murray. Parallel tracking and mapping struction algorithms. In CVPR, volume 1, pages 519–528,
for small ar workspaces. In Int’l. Symp. on Mixed and Aug- 2006. 2
mented Reality, 2007. 2, 5 [21] S. Sinha, D. Steedly, R. Szeliski, M. Agrawala, and M. Polle-
[14] K. Konolige, J. Bowman, J. D. Chen, P. Mihelich, M. Calon- feys. Interactive 3D architectural modeling from unordered
der, V. Lepetit, and P. Fua. View-based maps. International photo collections. ACM Transactions on Graphics (TOG),
Journal of Robotics Research (IJRR), 29(10), 2010. 2 27(5):1–10, 2008. 1, 2
[15] D. Lowe. Discriminative image features from scale-invariant [22] N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: ex-
keypoints. Int’l. J. Comp. Vision, 60(2), 2004. 2 ploring photo collections in 3d. In SIGGRAPH, 2006. 1, 2,
[16] E. Mouragnon, M. Lhuillier, M. Dhome, F. Dekeyser, and 5
P. Sayd. Real time localization and 3d reconstruction. In [23] P. Sturm and S. Maybank. A method for interactive 3d re-
CVPR, volume 1, pages 363–370, 2006. 2 construction of piecewise planar objects from single images.
In BMVC, pages 265–274, 1999. 2
[17] R. A. Newcombe and A. J. Davison. Live dense reconstruc-
[24] B. Triggs, P. McLauchlan, R. Hartley, and A. Fitzgibbon.
tion with a single moving camera. In CVPR, 2010. 2
Bundle adjustmenta modern synthesis. Vision algorithms:
[18] M. Pollefeys, D. Nister, J.-M. Frahm, A. Akbarzadeh,
theory and practice, pages 153–177, 2000. 2
P. Mordohai, B. Clipp, C. Engels, D. Gallup, S.-J. Kim,

You might also like