Efficient Large-Scale Multi-View Stereo For Ultra High-Resolution Image Sets
Efficient Large-Scale Multi-View Stereo For Ultra High-Resolution Image Sets
Efficient Large-Scale Multi-View Stereo For Ultra High-Resolution Image Sets
DOI 10.1007/s00138-011-0346-8
ORIGINAL PAPER
Abstract We present a new approach for large-scale state-of-the-art approaches do not exploit the very high reso-
multi-view stereo matching, which is designed to operate lution—20 Megapixel and more—that modern cameras can
on ultra high-resolution image sets and efficiently compute readily acquire. Instead, they rely on moderate sized images,
dense 3D point clouds. We show that, using a robust descrip- 1–4 Megapixel in recent papers [6,10,17,26,28], and assume
tor for matching purposes and high-resolution images, we a small capture space that makes it possible to use visual hull
can skip the computationally expensive steps that other algo- constraints and volumetric optimization algorithms. Because
rithms require. As a result, our method has low memory these are computationally expensive and have large mem-
requirements and low computational complexity while pro- ory requirements, this, in turn, limits the size of the images
ducing 3D point clouds containing virtually no outliers. This that can be used. The few approaches that have been shown
makes it exceedingly suitable for large-scale reconstruction. to scale up to larger images, 2–6 Megapixel, rely on local
The core of our algorithm is the dense matching of image plane sweeping strategies [6–8,27], which tend to be very
pairs using DAISY descriptors, implemented so as to elim- slow when applied to larger images. Admittedly, some algo-
inate redundancies and optimize memory access. We use a rithms are capable of operating on higher resolution images
variety of challenging data sets to validate and compare our as a whole [8] or by dividing them into overlapping tiles of
results against other algorithms. smaller sub-images and later combining the results [9] but
none have used whole images as large as we consider here,
Keywords Multi-view stereo · 3D reconstruction · mainly due to their inability to scale linearly in computation
DAISY · High-resolution images time for image resolution or for their high memory require-
ments.
In this paper, we show that higher resolution images,
1 Introduction with their richer texture, can both provide more reliable
matches than lower resolution ones and be handled fast
Multi-view stereo reconstruction of complex scenes has enough for practical large scale use. To this end, we intro-
made significant progress in recent years, as evidenced by the duce an approach, which has a very low memory and com-
quality of the models now being produced. However, most putational cost requirement, for
E. Tola (B) – producing quickly point clouds such as the one depicted
Aurvis Ltd, Ankara, Turkey in Fig. 1, which contain far fewer outliers than those
e-mail: engin.tola@gmail.com
URL: http://www.aurvis.com
recovered by techniques that rely on plane sweeping
[6–8,27];
C. Strecha · P. Fua – exploiting images as large as 40 Megapixel, such as the
Computer Vision Laboratory, EPFL, Lausanne, Switzerland 31 images depicted in Fig. 2, while reducing the process-
e-mail: christoph.strecha@epfl.ch
ing time to less than 15 min instead of the many hours that
P. Fua most state-of-the-art algorithms [6–8,27] would have
e-mail: pascal.fua@epfl.ch required.
123
E. Tola et al.
Fig. 1 Statue reconstruction. The data set contains 127 18-Megapixel data set. Second row The depth maps computed for the images shown
images of a statue. The reconstruction time was 29.5 min and the final in the first row. Last two rows The renderings of the cloud in its raw and
cloud contains 15.3 million points. First row Some images from the colored form from different viewpoints (color figure online)
Our approach operates directly at the highest resolution, a graph cut framework, we show here that processing high-
which is key for achieving both accuracy and robustness. This resolution images reduces the need for smoothness priors to
is in complete contrast to typical hierarchical approaches the point where such a computationally demanding optimi-
that are popular in the field and the effectiveness of our zation scheme becomes unnecessary. Instead, we introduce a
approach therefore constitutes an important and novel result. much faster approach for checking match consistency across
What makes this possible is the DAISY descriptor [25] that multiple frames and rejecting erroneous matches, and the
has been shown to be very powerful for dense wide base- resulting algorithm produces very dense and accurate 3D
line matching. However, while in our earlier work [25], we clouds such as those depicted in Figs. 1, 2, and 3 in very
focused on using it to compute a data term for use within acceptable computation times, as can be seen in Table 2.
123
Efficient large-scale MVS
Fig. 2 EPFL campus reconstruction. The data set contains 31 Third row The final point cloud from different view points in its raw
40-Megapixel images of the EPFL campus shot from a helicopter. The form. Last row The colorized renderings of the same view points shown
reconstruction time was 14.2 min and the final cloud contains 11.3 mil- in third row. Notice that the few erroneous points present in the depth
lion points. First row Some images from the data set. Second row The maps are filtered out and final cloud contains virtually no outliers (color
depth maps computed for the images shown in the first row. As can be figure online)
seen, depth maps, while being mostly correct, contain some outliers.
123
E. Tola et al.
Fig. 3 Lausanne Cathedral-Ground reconstruction. The 3D point cloud is rendered from several viewpoints. It was constructed in 419 min from
1,302 21-Megapixel images and contains 148.2 million points
While being very effective, these methods do not scale up to proposed in [8] with the addition of an energy minimization-
modeling extended scenes because the memory and compu- based method to compute and refine a mesh that represents
tational requirements would become prohibitive. Thus, large- the 3D points. Both produce very impressive results and
scale methods use different representations, as discussed outperform most other algorithms on many different data
below. sets [18].
A number of techniques replace the unified representation In spirit, our approach is very close to that of [6], except
of volumetric methods by a set of depth maps [3,7,14,19,27]. for the fact that using the DAISY descriptor [25] lets us han-
They scale up to very large scenes, sometimes at the cost of dle much higher resolution images by removing the need for
losing some of the accuracy and completeness of volumet- computationally intensive optimizations and produce higher
ric methods [16,22]. This class of MVS methods include quality point clouds in a fraction of the time. We use tem-
global approaches [20,23,25] that impose smoothness priors porary depth maps while computing an oriented point cloud
in the form of Markov Random Fields and local ones [1,19] but do not impose any smoothness. The inherent smoothness
that rely on gradient-based optimizations to enforce local and robustness of the DAISY descriptor makes it unneces-
smoothness. The accuracy of the global methods is limited sary and lets us skip many of the expensive computations
by the fact that they can only handle relatively small resolu- required by other algorithms. In principle, our results could
tion images. This is due to the very large amounts of memory be further refined using a method such as [8], but their quality
required to store many potential depth states for many pix- and density is such that this is barely necessary.
els. Memory issues also limit GPU-based implementations
since it is hard to effectively deal with very large images 3 Approach
that do not fit into the relatively small memories of GPUs
[3,14]. By contrast, local approaches [1,19] can deal with Ours is a two-step approach. We start by computing dense
high-resolution images, in principle at least. However, not point clouds from image pairs and then check for consistency
only are they relatively slow, but they also require initial using additional ones. This is an effective way to ensure with
depth estimates and do not converge well if these are not high probability that only correct points are retained, espe-
sufficiently close to the desired solution. cially if the pairwise clouds are of good quality. In the remain-
A different approach for handling large-scale scenes and der of this section, we first sketch these two steps and then
images of arbitrary sizes is to directly reconstruct oriented provide more specific implementation details.
point clouds [6,8]. This is achieved by making it unneces-
sary to incorporate smoothness priors into the computation. 3.1 Pairwise point clouds
Instead, the algorithms rely on visibility constraints to filter
the points and reject erroneous ones. For example, in [6], a Given an image pair, ϒi = (Is , It ), whose baseline might be
very dense and accurate point cloud is computed by locally relatively large, we use the DAISY descriptor we introduced
estimating planar-oriented patches. A similar approach is in earlier work [25] to measure similarity along epipolar lines
123
Efficient large-scale MVS
Gradient
Histogram Aggregation
Separation
d
do1
() + GΣ GΣ
1
d
do2 () + GΣ GΣ
1
d
don () + GΣ GΣ
1
Go GΣo 1 GΣo 2
and compute a dense depth map. DAISY is composed of We treat the depth as valid if it is above a threshold, which
concatenated gradient histograms that are computed by first we take to be 0.8 for all results presented here. The choice
estimating gradients into separate orientation layers and for this threshold will be justified in Sect. 5.1.
aggregating their magnitudes within each layer into orien- This approach is depicted in Fig. 5. As will be shown in
tation histograms, as shown in Fig. 4. This can be done by the results section, we tested the performance of this very
simply thresholding and convolving the oriented gradients simple decision rule under changing image resolution, cam-
with Gaussian filters of various sizes. It produces the same era baseline, and descriptor parameters and found it to be
kind of invariance as SIFT [13] or SURF [2] histogram build- very reliable.
ing but can be computed much more efficiently for every This produces a dense point cloud, which we denote as
single image point and in every direction. DAISY is there- i
X . These pairwise clouds may, of course, still contain a few
fore ideally suited for dense matching. spurious points.
In [25], we introduced an EM framework based on max-
flow min-cut optimization to estimate both depths and occlu-
sions, which made the complete approach computationally 3.2 Enforcing consistency
very intensive. By contrast, here we discard this optimization
scheme. Instead, we directly use the DAISY matching score To eliminate these spurious points, we only retain those for
to compute the probability of a pixel x having a depth d in which there is a consistent evidence in multiple pairs. Given a
one image as 3D point, X , computed from one specific pair, ϒi = (Is , It ),
⎛ ⎞ consistency is measured by reprojecting the point in other
j
1 Dxi − Dx (d) 2 images and computing
P(d) = exp ⎝− ⎠, (1)
Z σ
|d(X, i) − dmapi, j (X )|
i, j (X ) = , (3)
j dmapi, j (X )
where Dxi and Dx (d)
are the descriptors at x in one image
and at the corresponding point x (d) in the other, assuming
where d(X, i) is the depth of the 3D point X with respect to
that the depth is indeed d. The sharpness of the distribution is
camera i and dmapi, j (X ) is the depth value computed at the
controlled by σ and Z is a normalizing constant that ensures
projection of X in image i using the image pair (i, j). A 3D
that the probabilities sum to one.
point is retained if this consistency measure is small enough
To decide whether or not to assign a depth to a pixel, we
for at least C depth maps. Formally, we write
look for the first two probability maxima along the uniformly
sampled epipolar line and consider the ratio of their values: ⎧ ⎡ ⎤ ⎫
⎨ ⎬
{X j }, iff ⎣ V (X j ∈ X )⎦ > C
i
Pbest (d) X= , (4)
RX = . (2) ⎩ ⎭
Psecond best (d) i∈Q j
123
E. Tola et al.
Fig. 6 Lausanne Cathedral-Aerial Reconstruction. The 3D point cloud dataset. Courtesy of J.M. Zellweger. Last two rows The render-
was constructed in 22 min from 61 24-Megapixel aerial images ings of the colorized points from several viewpoints (color figure
and contains 12.7 million points. First row Some images from the online)
i
where X is the final point cloud, X is the 3D point f ∗ sin(θ )
q(X ) = , (5)
cloud for the sub-maximization problem for ϒi introduced ||X − C||
above, V (.) is a boolean function that returns 1 if its argu-
where f is the focal length, sin(θ ) the baseline measure with
ment is true and 0 otherwise and Q j represents a set of
θ being the angle between the two camera rays, and ||X −C||
image pairs. This definition inherently subsumes the com-
is the distance to the camera center. When merging the pair-
mon left-to-right and right-to-left tests often used by cor-
wise clouds, we cluster 3D points and only retain from each
relation-based approaches [4]. This stems from the fact
cluster the one with the highest precision. Not only does this
that the depth map obtained by reversing the roles of the
ensure that our final cloud is formed of high precision points,
cameras can be among the ones chosen for verification
it also provides expected precision measures that could be
(Figs. 6, 7, 8).
used to determine where additional images are needed so as
Since a point can be instantiated from several different
to guide further processing.
image pairs, for optimum results, it is important to only retain
the one whose precision can be expected to be highest and
to ignore the others. This is done by considering three geo-
metric factors: (i) baseline of the stereo pair, (ii) focal length 4 Implementation details
of the camera, and (iii) distance of the point to the camera
center. Factor (i) affects the expected precision of the point The two main computational steps introduced in Sects. 3.1
since larger baselines tend to yield more precise depth esti- and 3.2 are conceptually simple but require careful imple-
mates. Factors (ii) and (iii) control the information content mentation to yield accurate results and to run fast on many
as a closer camera or a zoomed-in one typically yield more big images, such as those presented in this paper. Here, we
textured views. Thus, we take the precision estimate to be provide some of the critical details.
123
Efficient large-scale MVS
Fig. 7 Cathedral portal reconstruction. The 3D point cloud was con- Computed 3D raw points are rendered from different view points and
structed in 9.2 min from 13 21-Megapixel images and contains 20.3 mil- bottom row shows the colorized rendering of these view points (color
lion points. Top row Some example images of the data set. Middle row figure online)
4.1 Viewpoint selection arranged in a cross-hair pattern at the mean value of the
expected depth range. The ratio of the radii of the pro-
View selection is very important since we need to pair images jected spheres is used to measure the scale difference
that our algorithm is suited for to achieve a good precision. between images. As before, if the calibration points are
To this end, we exploit the following metrics to decide if two available, we place a sphere at all the visible calibration
images are suitable candidates: points, project them onto the cameras, and compute the
scale estimate by averaging the ratio of these projections.
– The images should be close enough to one another to be
easily matched but should also be sufficiently separated Cameras with a baseline larger than 10◦ and smaller than
for reliable depth estimation. In general, we quantify this 30◦ and scale difference between 0.8 and 1.2 are assumed to
by measuring the angle between the camera principal be matchable. In practice, one single depth map is computed
rays. However, if the 3D points computed as a by-product per image by pairing it up with an image that has the closest
in the camera calibration stage are available, the angles scale in the matchable set.
between these calibration points and the camera centers
are averaged for all the visible points to compute this 4.2 Depth map estimation
metric.
– Since the DAISY descriptor is not scale invariant, images The matching probabilities, given in Eq. 1 for a pixel,
should be close to each other in scale. Rough scale esti- are computed at regular intervals along the epipolar lines.
mates are computed by projecting five virtual spheres If we were to sample the depth space uniformly instead
123
E. Tola et al.
Fig. 8 Pillar reconstruction. The dataset contains 214 18-Megapixel images, second row their depth maps, and last two rows the renderings
images of a building pillar. The reconstruction time was 48.9 min and of the cloud from different viewpoints in its raw and colored form,
the final cloud contains 63.1 million points. First row Some example respectively (color figure online)
of its projection onto the image plane, as is done in most increasing robustness. This effectively simulates what global
plane sweeping-based methods, the probability distribution optimizers, such as belief propagation and graphcuts, do by
would be compressed or expanded non-uniformly depend- enforcing smoothness, but at a much lower computational
ing on camera positions and pixel locations. This sampling cost. This process may fail at depth discontinuities but the
has several advantages: (i) it removes the dependency of the resulting erroneous depths will be eliminated by the con-
probability distribution from the geometry of the specific sistency checks and, therefore, will not degrade the overall
matching problem, (ii) it produces a better modeling of the performance.
distribution, (iii) it enables a finer control over the resolu- Given the probability distribution, we perform non-
tion of the search space, and (iv) it allows us to generically maxima suppression and require the best and second best
define rules on the distribution without taking care of special probabilities to be separated by at least the size of the descrip-
conditions. In addition, a preset number of depth states is not tor radius. This is necessary since the descriptor characteris-
desirable as it might be too small for some pixels, resulting in tics change smoothly and the probabilities of nearby pixels
the correct match being missed. Alternatively, it might be too are correlated.
large, resulting in unnecessary computation. A closed form Finally, to compute the matching score, we modified the
solution for uniform sampling of the epipolar line is given in original DAISY source code in a number of ways. Since
“Appendix”. a scale estimate is associated to each image at the view
In practice, we do not initially compute full probability point selection stage presented in Sect. 4.1, we compute the
distributions for each pixel. Instead, we compute them at descriptors for each one at the correct scale. In addition, each
sparse locations so that we can constrain the search for neigh- descriptor is computed orthogonal to the epipolar line pass-
boring ones. This has both computational and algorithmic ing through it to account for the perspective deformation of
advantages: a smaller depth-range results in fewer opera- the texture and instead of performing partial histogram nor-
tions and reduces the number of potential matches, thereby malization, as suggested in the original paper, we normalize
123
Efficient large-scale MVS
descriptors as a whole. The descriptors are computed over 4.4 Memory requirements
relatively small regions since input images have a narrow
baseline setup and thus full normalization produces more The most memory-intensive steps of the algorithm are
stable descriptors compared to partial normalization over depth-map computation and consistency estimation.
these small aggregation regions. Finally, as the images to be For depth map estimation, only two images need to be
matched have relatively close viewpoints, we use a very short loaded memory. Individual descriptors must then be com-
version of the descriptor, as will be discussed in the results puted for each one of their pixels. Since each descriptor
section. This stage is very efficient as descriptors are precom- requires 36 floating point numbers, this means that for two
puted at their respective orientation and scale and matching X = 6 Megapixel images, the algorithm requires approx-
is performed at integer resolution. imately X × 36 × 4 × 2 = 1,728 MB of memory. The
output is a floating point depth value per pixel which takes
X × 4 = 24 MB of memory.
4.3 Consistency computation To enforce consistency, a 3D point from one depth-map
is instantiated if it is present in at lease C others. However,
Once the depth maps have been computed, they are turned as explained above, instead of checking consistency against
into a point cloud by verifying the accuracy of each 3D all depth-maps, we only use the ten closest ones. This means
point on more than one depth map. This verification would that only 11 depth-maps need to be loaded at any one time,
be costly if we checked each point against all depth maps. which represents 11 × X × 4 = 264 MB of memory for our
Instead, for each point, the verification is performed on 6-Megapixel images. In practice, if enough RAM is avail-
the closest ten images whose corresponding camera looks able, we maintain several more depth maps in memory. This
towards the point. reduces latency times since a depth map can be used by sev-
The depth projection error, given in Eq. 3, measures the eral others to validate their 3D points.
error made between two depth maps computed from differ-
ent view points using different image pairs and if this error
for a pixel is within a given threshold for at least C = 3 depth 5 Results
maps, the point is included to the cloud. The threshold for
this error is set according to the discretization error of the In this section, we present results of our multi-view stereo
depth estimation procedure described above. Since the depth algorithm. This section is divided into three subsections
maps are computed at integer resolution and since they are where
computed using different camera configurations, there will
always be a difference in the estimated location of the same
– we validate our choice of parameters and compare the
point in two depth maps. This discretization error, however,
discriminative power of DAISY against standard nor-
can be computed using calibration parameters but it will cost
malized cross correlation (NCC) in the framework of
valuable computation time. Therefore, instead of doing it for
our approach,
each pixel, we settle for a single value for all the pixels of
– we compare our results against other approaches on pub-
an image by taking the maximum discretization error value
licly available benchmark data sets where we also present
computed at only image corners as the depth projection error
quantitative accuracy and computation time values,
threshold.
– we show results on ultra high resolution and very large-
Since depth maps can be of different qualities depending
scale image sets for which the method here is primarily
on the baseline of the cameras used to compute them, within a
designed for.
given 3D volume, we retain only the 3D point whose expected
precision, as computed by Eq. 5, is greatest. To this end, a
sparse octree structure is employed. Points are placed in an 5.1 Parameter choices and image resolution
octree and only the point with the biggest precision value
within an octree cell is retained. This way, points of lower In Fig. 10, the Fountain-P11 sequence, given in Fig. 9, is
quality are discarded when there is a point source with more used to demonstrate the effect of various DAISY parame-
precision but they are kept within the reconstruction if no ters on the accuracy and density of the computed depth maps
other reliable point source exists. and to compare it against basic NCC. We used the six pairs
After this filtering, the normals for the points are computed of narrow-to-wide baseline images of 3,072 × 2,048 reso-
by a plane fitting procedure over a kd-tree structure from the lution and the curves are plotted for the averaged results.
closest 32 points. The colors of the points are assigned to In these experiments, we used a standard baseline version of
them using the images in which the points are initialized NCC instead of more sophisticated recent implementations
from. that involve either warping the correlation window [6] or
123
E. Tola et al.
Fig. 9 Fountain-P11. The 3D point cloud was reconstructed in 2.2 min from 11 6-Megapixel images and contains 2.5 million points. First two
rows The raw 3D points and their colorized versions. Last row The mesh computed from these points using the Poisson reconstruction algorithm
(color figure online)
introducing a multi-resolution approach [14]. This is because Graphs show that resolution has a direct impact on the
we are not trying to prove that DAISY performs much better accuracy and motivates the use of higher resolution images
than these methods. Instead, we want to show that these time to achieve a low error rate. In addition, using higher reso-
consuming optimizations are not necessary when one uses lution images results in improved accuracy for both DAISY
higher resolution images. and NCC but DAISY consistently and significantly outper-
As discussed in Sect. 3, the probabilities of each depth forms NCC at all resolutions. This largely stems from the
state along a uniformly sampled epipolar line is computed fact that we compute depths at pixel resolution. DAISY is
and a depth value is assigned to a pixel if the ratio, Rx , of more resilient to pixel sampling errors than NCC and hence
the probability of the depth is larger than the probability of is not affected adversely by this fact.
the next best depth state by a threshold. The detection rate Better NCC results could of course be obtained by taking
is measured by changing this threshold and error rates are surface orientation into account to warp the sampling grid
expressed as the ratio of the error of the assigned depth to the appropriately [6]. Alternatively, as is done in [14], NCC
actual depth value defined as: scores could be improved by averaging different measure-
ments for a set of surface normals to account for possi-
|dassigned − dground truth |
e= . (6) ble errors caused by incorrect normal estimates. However,
dground truth
the results depicted in Fig. 10 are nevertheless significant
Figure 10a depicts the influence of image resolution on because they show that the DAISY descriptor is robust
depth accuracy for a fixed set of DAISY parameters, R = enough to make these time consuming optimizations unnec-
8, Q = 2, T = 4, and H = 4. For a detailed description of essary, which is what makes our approach both effective and
these parameters, we refer the interested reader to [25]. For fast. Moreover, in Sect. 5.2, we will show that our approach is
comparison purposes, we also plot results using NCC with as accurate as that of [6], which relies on plane sweeping and,
patch size P = 17 which is roughly equivalent to the area in effect, warps the surface patches and therefore optimizes
used by DAISY. the NCC parameters.
123
Efficient large-scale MVS
Correct %
Correct %
70 NCC P=3
70 70
60
60 60
50
50 DAISY 3072x2048 DAISY R=16 50
DAISY 1536x1024 40 DAISY R= 8
40 DAISY 768x512 30 DAISY R= 4 40
NCC 3072x2048 NCC P=17
30 NCC 1536x1024 20 NCC P= 7 30
NCC 768x512 NCC P= 3
20 10 20
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 70 75 80 85 90 95 100
Error % Error % Detection %
Fig. 10 Parameter choices and image resolution. We use the 3,072 × error rates. b Varying the DAISY radius R and NCC patch size P
2,048 images of the Fountain-P11 dataset. a Effect of the image resolu- computed at 100% detection rate on the original images. R = 8 and
tion on depth map quality estimated by reducing the size of the original P = 17 perform best at low error rates. c Detection rate versus correct
images. Note that higher resolution images yield more pixels with low pixel percentage for varying descriptor radius
(a) (b)
900 7
800 6
600
500 4
400 3
300
2
200
100 1
0 0
1 2 3 4 5 1 2 3 4 5
C C
Fig. 11 Consistency threshold selection. We use the the Fountain-P11 truth is smaller than 0.1% of the actual depth. a Number of outliers with
dataset to see the effect of confidence threshold C on the quality of the respect to varying C threshold values. b The inlier number for the same
point clouds. A point is assumed an inlier if its distance to the ground thresholds
123
E. Tola et al.
Table 1 Computation times and scene details for benchmark datasets 5.2.2 Speed
Data set Image Res. Comp. time, Comp. time, Point no.
no. 1 core 8 core Running [6] using 17×17 patch size takes 843 min for
Fountain and 508 min for Herzjesu. These numbers become
HerzJesu-P8 8 6 8.9 1.5 3.2
338 and 204 min for 7 × 7 patches. When using our method
Entry-P10 10 6 9.3 1.6 1.4
with descriptor size R = 8 they drop to 12.2 and 8.9 min,
Fountain-P11 11 6 12.2 2.2 2.5 respectively. For consistency, we used a single core in all
Castle-P30 30 6 20.8 4.2 1.8 cases. Our approach is about 40–50 times faster than [6] for
HerzJesu-P25 25 6 23.1 4.3 4.9 parameters chosen so as to produce clouds as dense as ours.
Resolutions are given in megapixels, point numbers are in millions, and The speed difference would have been much more dramatic
computation times are in minutes for large-scale sequences like the ones presented in the next
histogram of deviations of recovered depth from the ground section such as the Lausanne Cathedral-Ground sequence
truth. These deviations are expressed as multiples of the shown in Fig. 3 or the Lausanne City sequence shown in
ground truth variance and the percentage of recovered depths Fig. 17.
that fall below this error threshold. On these moderate sized There are two main reasons for the speed difference. First,
images, we perform roughly equivalently to [6], albeit much the use of NCC as a similarity measure in [6] forces an opti-
faster as will be discussed below. We are a little less accu- mization over the surface normal for all the points besides
rate than the best approach [8], which was to be expected an optimization over depth. This is necessary for NCC since
for two reasons. First, we only triangulate for benchmark- failure to do so will result in poor performance as it is known
ing purposes and use a standard triangulation algorithm that to be very sensitive to perspective deformations. This addi-
does not enforce global consistency constraints as is done in tional optimization over the surface normal is avoided in our
[8]. Second, Ref. [8] includes a variational mesh refinement algorithm mainly because DAISY is quite robust to both per-
scheme, whereas we discretize the depth states as discussed spective and sampling errors and thus a single optimization
in Sect. 3. This does not influence the algorithm adversely on over depth is sufficient. Second, memory access pattern of our
the high-resolution image sets, like the ones that will be pre- depth estimation framework is very efficient. In [6], authors
sented in the next section, for which our method is designed propose a diffusion-like approach where starting from some
for, but handicaps it on lower resolution ones. So, one can seed points, new points are included into the cloud by iter-
deduce that to achieve the accuracy of [8] run for 6 Megapixel atively searching, expanding and filtering new points near
images, our method would need higher resolution images already included points. In addition, more than two images
to reduce the discretization error inherit in our approach. are used for measuring the photo-consistency score of a point
However, since our method runs very efficiently for high- and this may mean to load and release an image many times if
resolution images, this handicap for lower resolution images the image set is large enough not to fit to memory. In our case,
does not constitute a major problem as long as higher-res- however, depth estimation is done only on a pair of images
olution images can be acquired. In case no high-resolution and the consistency check is done in the cloud computation
images exist, Ref. [8] seems to be a better solution than ours. stage. This separation of jobs is what enables us to access
In a sense, the benchmarking site compares two things memory very efficiently.
that are not comparable: The raw output of the matching We are unfortunately unable to give timing estimates for
algorithm in our case, against a refined version in the other the other methods discussed above because the correspond-
cases. For a more direct comparison, we therefore used the ing publications do not mention them and no source code is
publicly available code of one of the best performing point publicly available. However, there is little reason to believe
cloud-based algorithms [6] and ran it using either 7 × 7 or that any of those implemented on CPUs would be any faster.
17 × 17 patches, on the Fountain and Herzjesu sequences, Furthermore, since the DAISY descriptor relies on convolu-
which consist of 3,072 × 2,048 images. In Fig. 16, we plot tions, it could easily be implemented on a GPU. This should
the number of correctly found points versus the allowed error yield further increase in speeds such as those reported in [14]
threshold where a point is assumed correct if its distance to without a need for a computer cluster as is done in that
the ground truth is smaller than the error threshold. To com- work.
pute point error, both the reconstructed 3D points and the In Fig. 18, we present the computation times for depth
ground truth are projected onto the images. The error is then map estimation using different image resolutions and differ-
taken as the smallest ratio of the depth error to the actual ent number of cores. All the experiments are done on a 12 GB
depth value as given in Eq. 6 across all the input images. Intel Xeon 2.5 GHz Quad Core machine. When all 8 cores are
Using the 17 × 17 patches result in denser clouds than using used, depth maps are computed for 40, 20, 10, and 5 Mega-
the 7 × 7 ones but they remain far sparser than ours for any pixel images in 35, 20, 11, and 5 s with an average 2,100
given accuracy level. depth tests per pixel for the highest resolution. From these
123
Efficient large-scale MVS
Fig. 12 HerzJesu-P25. The 3D point cloud was constructed in 4.3 min from 25 6 Megapixel images and contains 4.9 million points. Top row
images The raw 3D points and bottom row ones show their colorized renderings (color figure online)
Fig. 13 Entry-P10 The 3D point cloud was constructed in 1.6 min from 10 6 Megapixel images and contains 1.4 million points. The left images
The raw points and the right images The colorized versions of these points (color figure online)
Fig. 14 Castle-P30. The 3D point cloud was constructed in 4 min from 30 6 Megapixel relatively low textured images and contains 1.8 million
points. We show the raw 3D points as seen from several viewpoints
experiments, we see that doubling the core number, roughly operation and this is mostly due to disk read/write operations
halves the computation time. Naturally, computation time when storing intermediate results.
can be further reduced using larger number of cores.
Note that running our approach on all eight cores for the 5.3 Handling large data sets
whole pipeline is on average 6 times faster than running on
a single one as seen in Table 1, even when using only simple We now present the results on large scale sets of very high
OpenMP preprocessor instructions for for-loop paralleliza- resolution images. We used the algorithm outlined in [21] to
tions, thus indicating that our method is inherently parallel register these images. For each set, we render the final point
and could easily be made to run even faster on a cluster if cloud, which is computed by combining individual depth
additional speed was required. Albeit minor, there is a speed maps as explained in Sect. 3.2, from different view points.
loss compared to depth map computation stage for parallel The points are shaded according to their estimated normals
123
E. Tola et al.
80 80
Cumulative
Cumulative
Cumulative
60 60 60 60
40 40 40 40
Hiep Hiep Hiep
20 Fur 20 Fur 20 Fur 20 Hiep
Ours Ours Ours Jan
Tyl St Tyl Ours
0 0 0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
Sigma Sigma Sigma Sigma
Fig. 15 Comparing against state-of-the-art. We compare our recon- St [19] and Jan [11]. Our own method is labeled as “Ours”. On these
structions to those of other methods reported in [22] on a Fountain-P11, moderate-resolution images, our accuracy is comparable to that of [6]
b HerzJesu-P25, c Entry-P10 and d Castle-P30. We plot the cumu- and a little worse than that of [8], which can be attributed to the fact
lative error histograms and the legend is Hiep [8], Fur [6], Tyl [27], that, unlike [8], we do not refine the point clouds
Fig. 16 Comparison against [6]. For Fountain-P11 on the left and produces many more points and therefore far denser clouds. Further-
HerzJesu-P8 on the right, we plot the total number of reconstructed more, it is also faster since, when using a single core, it only requires
pixels whose depth is within a given distance of the ground-truth value 12.2 min for Fountain and 8.9 min for Herzjesu, whereas [6] using
as function of that distance. Note that for any such distance, our method 17 × 17 patches, requires 843 and 508 min, respectively
as explained in Sect. 3. We also show colorized point clouds, Table 2 Computation times and scene details for large scale datasets
where the color of each point is assigned from the image Data set Image no. Resolution CPU time Point no.
where the point is initialized from. For some data sets, we
also show the depth maps computed from selected image Portal 13 21 9.2 20.3
pairs in inverse depth representation. The red color denotes EPFL 31 40 14.2 11.35
the pixels for which the algorithm decides that there is no Lausanne
good match. Cathedral-Aerial 61 24 22.1 12.7
Each set was processed on a 12 GB Intel Quad Core Xeon Statue 127 18 29.5 15.3
2.5 GHz machine and computation times and input charac- Pillar 214 18 48.9 63.1
teristics are listed in Table 2. For more extensive, detailed Lausanne
and animated results, please visit our website [24]. Cathedral-Ground 1,302 21 419 148.2
Figure 2 depicts the EPFL Campus dataset. It consists of Lausanne City 4,484 6–21 1,632 272
40 Megapixel images taken from a helicopter, which rep- Resolutions are given in megapixels, point numbers are in millions, and
resents the highest resolution we tested our algorithm on. computation times are in minutes using 8 cores
The campus is fully reconstructed including trees, grass
walkways, parked cars, and train tracks. The only excep- precision measure of Eq. 5, only the highest-precision 3D
tions are some building façades that were not seen from any points are retained.
viewpoint. Figures 3 and 6 show the reconstructions of the Lausanne
Next, in Figs. 1 and 8, we tested our algorithm on two Cathedral computed using ground-level and aerial images,
datasets, Statue and Pillar, that contain images of very dif- respectively. The ground-level reconstruction is the more
ferent scales. The viewpoint selection algorithm successfully accurate but some of the roofs are missing because they
paired up images that are similar in scale and thanks to the were not visible in any of the images. By contrast, the aerial
123
Efficient large-scale MVS
Fig. 17 City scale reconstruction. The 3D point cloud was constructed cloud, this figure depicts a decimated version of it. The top left figure
from 4,484 6 to 21 Megapixel ground-level images in 27 h and contains is a top view that shows the extent of the reconstruction and the others
272 million points. For lack of a machine able to display the whole are close-up views seen from different perspectives
reconstruction is more complete, even though fewer images small overlap between them and only a few images see the
were used, because almost all the input images contain the same location. This means that, unlike in the case of com-
full façade of the cathedral as can be seen in the first row munity photo collection datasets [3,7], we cannot rely on
of Fig. 6. It is less accurate because the images were taken the same place being imaged many times over. Instead, our
from further away. In addition, in Fig. 7, we reconstructed MVS has to operate effectively even when the scene is only
the main portal of the cathedral using a smaller subset of sparsely covered, which it does. It took 1,632 min to com-
the ground-level data set. In this case, to increase the density pute a cloud containing 272 million points. This may seem
of the computed cloud, we computed two depth maps per long but represents only 27 h or a little over 1 day on a single
image of the data set by pairing it up with the closest two PC, as opposed to a cluster and without using GPU process-
cameras. ing. In other words, this remains manageable on an ordinary
Finally, in Fig. 17, we present results on the the Lausanne computer even though the dataset involved is quite large.
City sequence [21], which is the largest data set we tested our
algorithm on. It contains 3,504 6-Megapixel images and 980
21-Megapixel images of the downtown area of Lausanne, 6 Conclusion
seen at different scales. There is much clutter, such as people
and cars, and some images were taken at different times of In this paper, we presented a novel multi-view stereo
day. In addition, since the cameras are distributed more or algorithm that can handle large-scale very high resolution
less uniformly across the whole area, instead of being clus- images at relatively very low computational costs. In con-
tered at a few landmark locations, there is sometimes only a trast to many state-of-the-art methods that use moderate sized
123
E. Tola et al.
225
40 Megapixels
intensity-based measures where a descriptor would be more
200 20 Megapixels robust to errors in translation or sampling.
10 Megapixels Additional speed improvements can be achieved by con-
175 5 Megapixels
straining depth ranges per pixel instead of per image as is
Time (in seconds)
150
done in this work. This could be accomplished easily using
125 an initial rough mesh computed from the calibration points.
100
Also, preventing unnecessary computation for low contrast
regions such as sky or saturated areas will definitely help.
75
Another natural extension of this work would be the com-
50 putation of a triangulated mesh from the reconstructed point
25 cloud that takes visibility into account, both for compact
representation and to eliminate the few remaining outli-
0
0 1 2 3 4 5 6 7 8 9 ers. Finally, since we believe that the output of our algo-
Number of Cores rithm could serve as input to point cloud refinement tech-
niques such as [8], we make our software available from our
Fig. 18 Depth map computation times. Computation times for depth website [24].
maps of different resolution images with respect to different core num-
bers. The experiments are performed on a 12 GB Intel Xeon 2.5 GHz
Quad Core machine. Using 8 cores, we can compute depth maps for
40, 20, 10 and 5 Megapixel images in 35, 20, 11 and 5 s, respectively. Appendix: Uniform step sampling
In the experiments, on average 2,100 depth states are tested per pixel
for the highest resolution. Computation times decrease almost linearly
with the number of the cores, i.e., increasing the core number twice its Given two camera calibrations, we present a closed form
size, reduces the computation time by half method to compute a sequence of 3D points such that they
all project to a single location on one camera and that their
projection forms a uniformly sampled line on the other
images, we advocate the use of larger resolution ones. We camera.
showed that the rich texture of such images removes the need Let a camera be parametrized by its intrinsic parameters
for expensive optimization algorithms and makes it possible K, and extrinsic parameters of rotation matrix R and camera
to obtain very dense high-quality 3D point clouds using a center C. The projection of a 3D point X is defined as
computationally very simple scheme. This in part is due to
the DAISY descriptor being more reliable than normalized λx = KR(X − C) (7)
cross correlation, making the kind of plane sweeping strate-
gies used in many competing approaches unnecessary. with x = [x y 1]T its image coordinates and λ its depth. Then,
We validated our approach on various large-scale high- the back projected ray emanating from point x = [x y 1]T is
resolution data sets acquired under different conditions such parametrized in terms of the depth variable λ as:
as from a helicopter, from an airplane, or from the ground X(λ) = λR T K−1 x + C. (8)
with different view point densities and scale changes, also
compared our results to that of state-of-the-art methods on Given two cameras P0 = (K0 , R0 , C0 ) and P1 =
benchmark data sets and showed that we produce reconstruc- (K1 , R1 , C1 ), we would like to sample the back projected
tions of similar accuracy but much faster. line X(λ) so that the projected samples on camera P1 are
In short, using a robust metric for matching purposes, uniformly separated (see Fig. 19). The two points separated
one can bypass the complex processing other algorithms by depth dλ on X(λ) is equal to:
have to use and reduce computational load immensely. This
makes for a very efficient algorithm to process ultra high- X(λ) = λR0T K0−1 x + C0
(9)
resolution images in minutes on a standard desktop machine X(λ + dλ) = (λ + dλ)R0T K0−1 x + C0 .
as evidenced by the experiments we have presented.
The accuracy of the proposed approach could be further Then, the projection of these points on image P1 are
improved by performing sub-pixel sampling of the epipo- ⎡ ⎤
u
lar line for depth estimation but this would slow down the ω v ⎦ = λa + b,
⎣ (10)
algorithm by allocating CPU time unnecessarily across the 1
whole disparity range. A better approach would be a multi- ⎡ ⎤
resolution sampling of the range by moving in smaller steps u + r du
for plausible depth values and in larger steps otherwise. Using (ω + dω) ⎣ v + r dv ⎦ = (λ + dλ)a + b, (11)
a descriptor for this type of algorithm has advantages over 1
123
Efficient large-scale MVS
1: a ⇐ K1 R1 R0T K0−1 x
2: b ⇐ K1 R1 (C0 − C1 )
3: u min ⇐ λλmin a0 +b0
min a2 +b2
, vmin ⇐ λmin a1 +b1
λmin a2 +b2
λmax a0 +b0
4: u max ⇐ vmax ⇐ λλmax
λmax a2 +b2 ,
max a1 +b1
a2 +b2
5: dl ⇐ (u max − u min )2 + (vmax − vmin )2
6: du ⇐ (u max − u min )/dl, dv ⇐ (vmax − vmin )/dl
7: ωmin ⇐ λmin a2 + b2
8: ω ⇐ ωmin
Fig. 19 Uniform step sampling framework 9: λ ⇐ λmin
10: (u, v) =⇐ (u min , vmin )
11: while λ < λmax do
12: if |du| not equal to 0 then
with a = K1 R1 R0T K0−1 x and b = K1 R1 (C0 − C1 ). In the ωr du
equations above, (u, v) and ω are the coordinates and depth 13: dλ ←
a0 − a2 (u + r du)
of X(λ) on image P1 , respectively, w + dω the depth of the 14: else
ωr dv
point X(λ + dλ), (du, dv) the slope of the epipolar line and 15: dλ ←
a1 − a2 (v + r dv)
r is the sampling resolution. 16: end if
The slope of the epipolar line can be found by projecting 17: λ ← λ + dλ
any two points X(λmin ) and X(λmax ) onto P1 . Let the coordi- 18: (u, v) ← (u + r du, v + r dv)
19: X ← λR T K−1 x + C
nates of the projections to be (u min , vmin ) and (u max , vmax ). 20: ω ← λa2 + b2
The slope is then computed as: 21: end while
dl = (u max − u min )2 + (vmax − vmin )2
du = (u max − u min )/dl (12)
dv = (vmax − vmin )/dl. References
1. Alvarez, L., Deriche, R., Weickert, J., Sanchez, J.: Dense disparity
Rearranging Eqs. 10 and 11, one can see the relationship map estimation respecting image discontinuities: a PDE and scale-
between the changes in the depths for two views as: space based approach. J. Mach. Learning Res. 13(1/2), 3–21 (2002)
2. Bay, H., Tuytelaars, T., Van Gool, L.: SURF: speeded up robust
⎡ ⎤ ⎡ ⎤ features. In: European Conference on Computer Vision (2006)
r du u + r du
3. Frahm, J.-M., Georgel, P., Gallup, D., Johnson, T., Raguram, R.,
adλ = ω ⎣ r dv ⎦ + dω ⎣ v + r dv ⎦ . (13) Jen, Y.-H., Wu, C., Dunn, E., Clipp, B., Lazebnik, S., Pollefeys,
0 1 M.: Building Rome on a Cloudless Day. In: European Conference
on Computer Vision (2010)
4. Fua, P.: A parallel stereo algorithm that produces dense depth
If we denote the ith element of vector a with ai and expand maps and preserves image features. Mach. Vis. Appl. Winter 6(1),
the third row of the Eq. 13 as a2 dλ = dω, the first two rows 35–49 (1993)
of Eq. 13 becomes 5. Fua, P.: From multiple stereo views to multiple 3D surfaces.
Comput. Vis. Image Underst. 24(1), 19–35 (1997)
6. Furukawa, Y., Ponce, J.: Accurate, dense, and robust multi-view
ωr du ωr dv
dλ = = . (14) stereopsis. IEEE Trans. Pattern Anal. Mach. Intel. 99 (preprint,
a0 − a2 (u + r du) a1 − a2 (v + r dv) 2009)
7. Goesele, M., Snavely, N., Curless, B., Hoppe, H., Seitz, S.M.:
Equation 14 is the update in the depth with respect to cam- Multi-view stereo for community photo collections, pp. 1–8. In:
International Conference on Computer Vision (2007)
era P0 that one must make in order to move r pixels in the 8. Hiep, V.H., Keriven, R., Labatut, P., Pons, J.-P.: Towards high-
direction of (du, dv) on camera P1 where (u, v) and ω is the resolution large-scale multi-view stereo, pp. 1430–1437. In: Con-
current projection and current depth with respect to camera ference on Computer Vision and Pattern Recognition (2009)
P1 (see Fig. 19). By iterating this process, the sequence of 3D 9. Hirschmüller, H.: Stereo processing by semi-global match-
ing and mutual information. IEEE Trans. Pattern Anal. Mach.
points with a uniform sampling projection can be computed.
Intell. 30, 328–341 (2008)
Algorithm 1 presents the pseudo-code for this closed form 10. Hornung, A., Hornung, E., Kobbelt L.: Hierarchical volumetric
solution. multi-view stereo reconstruction of manifold surfaces based on
123
E. Tola et al.
dual graph embedding, pp. 503–510. In: Conference on Computer 22. Strecha, C., von Hansen W., Van Gool, L., Fua, P., Thoennessen, U.:
Vision and Pattern Recognition (2006) On Benchmarking camera calibration and multi-view stereo for
11. Jancosek, M., Shekhovtsov, A., Pajdla, T.: Scalable multi-view high resolution imagery. In: Conference on Computer Vision and
stereo. In: 3DIM (2009) Pattern Recognition (2008)
12. Kazhdan, M., Bolitho, M., Hoppe H.: Poisson surface recon- 23. Sun, J., Li, Y., Kang, S.B., Shum, H.-Y.: Symmetric stereo
struction, pp. 61–70. In: SGP ’06: Proceedings of the fourth matching for occlusion handling, pp. 399–406. In: Conference on
Eurographics Symposium on Geometry processing (2006) Computer Vision and Pattern Recognition (2005)
13. Lowe, D.G.: Distinctive image features from scale-invariant key- 24. Tola, E.: Efficient large scale multiview stereo for ultra high reso-
points. Int. J. Comput. Vis. 20(2), 91–110 (2004) lution images (2011). http://www.engintola.com/research/emvs
14. Pollefeys, M., Nistéer, D., Frahm, J.-M., Akbarzadeh, A., 25. Tola, E., Lepetit, V., Fua, P.: Daisy: an efficient dense descriptor
Mordohai, P., Clipp, B., Engels, C., Gallup, D., Kim, S.-J., Merrell, applied to wide baseline stereo. IEEE Trans. Pattern Anal. Mach.
P., Salmi, C., Sinha, S., Talton, B., Wang, L., Yang, Q., Stewée- Intell. 32(5), 815–830 (2010)
nius, H., Yang, R., Welch, G., Towles, H.: Detailed real-time urban 26. Tran, S., Davis, L.: 3D surface reconstruction using graph cuts
3D reconstruction from video. Int. J. Comput. Vis. 78(2–3), 143– with surface constraints, pp. 219–231. In: European Conference
167 (2008). doi:10.007/s11263-007-0086-4 on Computer Vision (2006)
15. Salman, N., Yvinec, M.: Surface reconstruction from multi-view 27. Tylecek, R., Sara, R.: Depth map fusion with camera position
stereo. In: Asian Conference on Computer Vision (2009) refinement. In: CVWW (2009)
16. Seitz, S.M., Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: 28. Vogiatzis, G., Hernandez Esteban, C., Torr, P.H.S., Cipolla,
A comparison and evaluation of multi-view stereo reconstruction R.: Multiview stereo via volumetric graph-cuts and occlusion
algorithms, pp. 519–528. In: Conference on Computer Vision and robust photo-consistency. IEEE Trans. Pattern Anal. Mach.
Pattern Recognition (2006) Intell. 29, 2241–2246 (2007)
17. Starck, J., Miller, G., Hilton, A.: Volumetric stereo with silhou-
ette and feature constraints. In: British Machine Vision Conference
(September 2006) Author Biography
18. Strecha, C.: Multi-view evaluation (2008). http://cvlab.epfl.ch/data
19. Strecha, C., Fransens, R., Van Gool L.: Wide-baseline stereo from Engin Tola received both his Bachelor and Master of Science degrees
multiple views: a probabilistic account. In: Conference on Com- in Electrical and Electronics Engineering from the Middle East Tech-
puter Vision and Pattern Recognition (2004) nical University (METU), Turkey, in 2003 and 2005, respectively.
20. Strecha, C., Fransens, R., Van Gool, L.: Combined depth and out- Afterwards he joined the computer vision laboratory of EPFL (Swiss
lier estimation in multi-view stereo. In: Conference on Computer Frederal Institute of Technology) and gained his PhD degree in Com-
Vision and Pattern Recognition (2006) puter Science in 2010. In 2011, he co-founded Aurvis to continue
21. Strecha, C., Pylvanainen, T., Fua, P.: Dynamic and scalable large his research in developing efficient 3D reconstruction techniques and
scale image reconstruction. In: Conference on Computer Vision Argutek for mobile augmented reality applications.
and Pattern Recognition (June 2010)
123