2108.08291v1

Pixel-Perfect Structure-from-Motion with Featuremetric Refinement
Philipp Lindenberger1 * Paul-Edouard Sarlin2 * Viktor Larsson2 Marc Pollefeys2,3

Departments of 1 Mathematics and 2 Computer Science, ETH Zurich 3 Microsoft
Abstract refined
arXiv:2108.08291v1 [cs.CV] 18 Aug 2021
Structure
Finding local features that are repeatable across multiple from
views is a cornerstone of sparse 3D reconstruction. The clas-
sical image matching paradigm detects keypoints per-image Motion
once and for all, which can yield poorly-localized features
and propagate large errors to the final geometry. In this pa-
per, we refine two key steps of structure-from-motion by a
direct alignment of low-level image information from multi-
ple views: we first adjust the initial keypoint locations prior
to any geometric estimation, and subsequently refine points
and camera poses as a post-processing. This refinement is
robust to large detection noise and appearance changes, as
it optimizes a featuremetric error based on dense features
featuremetric refinement
predicted by a neural network. This significantly improves Figure 1: From sparse to dense. We improve the accu-
the accuracy of camera poses and scene geometry for a wide racy of sparse Structure-from-Motion by refining 2D key-
range of keypoint detectors, challenging viewing conditions, points, camera poses, and 3D points using the direct align-
and off-the-shelf deep features. Our system easily scales ment of deep features. This featuremetric optimization lever-
to large image collections, enabling pixel-perfect crowd- ages dense image information but can scale to scenes with
sourced localization at scale. Our code is publicly avail- thousands of images. Such refinement results in subpixel-
able at github.com/cvg/pixel-perfect-sfm as accurate reconstructions, even in challenging conditions.
an add-on to the popular SfM software COLMAP.
cally selects such points for each image independently and

1. Introduction relies on these initial detections for the remainder of the
reconstruction process. However, detecting keypoints from
Mapping the world is an important requirement for spatial a single view is inherently inaccurate due to appearance
intelligence applications in augmented reality or robotics. changes and discrete image sampling [31]. The advent of
Tasks like visual localization or path planning can bene- convolutional neural network (CNNs) for detection has mag-
fit from accurate sparse or dense 3D reconstructions of nified this issue, as they generally do not retain local image
the environment. These can be built from images using information and instead favor global context.
Structure-from-Motion (SfM), which associates observa-
Multi-view geometric optimization with bundle adjust-
tions across views to estimate camera parameters and 3D
ment [4, 42, 82] is commonly used to refine cameras and
scene geometry. Sparse reconstruction based on matching
points using reprojection errors. Dusmanu et al. [24] pro-
local image features [10, 21, 23, 34, 51, 57, 59, 65] is the
posed to refine keypoint locations prior to SfM via an analo-
most common due to its scalability and its robustness to
gous geometric cost constrained with local optical flow. This
appearance changes introduced by varying devices, view-
can improve SfM, but has limited accuracy and scalability.
points, and temporal conditions found in crowdsourced sce-
narios [2, 29, 35, 41, 47, 50, 58]. In this work, we argue that local image information is
SfM assumes that sparse interest points [10, 21, 23, 34, 51, valuable throughout the SfM process to improve its accu-
59, 62, 84, 92] can be reliably detected across views. It typi- racy. We adjust both keypoints and bundles, before and after
reconstruction, by direct image alignment [18, 26, 52] in a
* indicates equal contributions learned feature space. Exploiting this locally-dense informa-
1
Sparse feature matching SfM result Differently, dense matching [13, 49, 61, 74, 77, 81, 83]
considers all pixels in each image, resulting in denser and
raw more accurate correspondences. It has been successful for
refined
constrained settings like optical flow [40, 76] or stereo depth
direct optimization estimation [90], but is not suitable for large-scale SfM due
to its high computational cost due to many redundant corre-
CNN
spondences. Several recent works [46, 60, 78, 96] improve
Featuremetric the matching efficiency by first matching coarsely and subse-
keypoint Featuremetric
Dense features adjustment bundle adjustment quently refining correspondences using a local search. This
is however limited to image pairs and thus cannot create
Figure 2: Refinement pipeline. Our refinement works on point tracks required by SfM.
top of any SfM pipeline that is based on local features. We Our work combines the best of both paradigms by leverag-
perform a two-stage adjustment of keypoints and bundles. ing dense local information to refine sparse observations. It
The approach first refines the 2D keypoints only from tenta- is inherently amenable to SfM as it can optimize all locations
tive matches by optimizing a direct cost over dense feature over multiple views in a track simultaneously.
maps. The second stage operates after SfM and refines 3D
Subpixel estimation is a well-studied problem in correspon-
points and poses with a similar featuremetric cost.
dence search. Common approaches either upsample the input
images or fit polynomials or Gaussian distributions to local
tion is significantly more accurate than geometric optimiza- image neighborhoods [28,36,39,51,69]. With the widespread
tion, while deep, high-dimensional features extracted by a interest in CNNs for local features, solutions tailored to 2D
CNN ensure wider convergence in challenging conditions. heatmaps have been recently developed, such as learning fine
This formulation elegantly combines globally-discriminative local sub-heatmaps [38] or estimating subpixel corrections
sparse matching with locally-accurate dense details. It is with regression [14, 80] or the soft-argmax [55, 93]. Cleaner
applicable to both incremental [70, 75] and global [9, 12, 54] heatmaps can also arise from aggregating predictions over
SfM irrespective of the types of sparse or dense features. multiple virtual views using data augmentation [21].
We validate our approach in experiments evaluating the Detections or local affine frames can be combined across
accuracy of both 3D structure and camera poses in vari- multiple views with known poses in a least-squares geo-
ous conditions. We demonstrate drastic improvements for metric optimization [25, 82]. Dusmanu et al. [24] instead
multiple hand-crafted and learned local features using off- refine keypoints solely based on tentative matches, without
the-shelf CNNs. The resulting system produces accurate assuming known geometry. This geometric formulation ex-
reconstructions and scales well to large scenes with thou- hibits remarkable robustness, but is based on a local optical
sands of images. In the context of visual localization, it can, flow whose estimation for each correspondence is expen-
in addition to providing a more accurate map, also refine sive and approximate. We unify both keypoint and bundle
poses of single query images with minimal overhead. optimizations into a joint framework that optimizes a fea-
For the benefit of the research community, we will release turemetric cost, resulting in more accurate geometries and a
our code as an extension to COLMAP [70, 71] and to the more efficient keypoint refinement.
popular localization toolbox hloc [63, 64]. We believe that
our featuremetric refinement can significantly improve the Direct alignment optimizes differences in pixel intensities
accuracy of existing datasets [67] and push the community by implicitly defining correspondences through the motion
towards sub-pixel accurate localization at large scale. and geometry. It therefore does not suffer from geometric
noise and is naturally subpixel accurate via image interpola-
2. Related work tion. Direct photometric optimization has been successfully
applied to optical flow [8,52], visual odometry [18,26,27,44],
Image matching is at the core of SfM and visual SLAM, SLAM [5, 72], multi-view stereo (MVS) [19, 22, 91], and
which typically rely on sparse local features for their effi- pose refinement [73]. It generally fails for moderate displace-
ciency and robustness. The process i) detects a small num- ments or appearances changes, and is thus not suitable for
ber of interest points, ii) computes their visual descriptors, large-baseline SfM. One notable work by Woodford & Ros-
iii) matches them with a nearest neighbor search, and iv) ver- ten [88] refines dense SfM+MVS models with a robust image
ifies the matches with two-view epipolar estimation and normalization. It focuses on dense mapping with accurate
RANSAC. The correspondences then serve for relative or initial poses and moderate appearance changes. Georgel et
absolute pose estimation and 3D triangulation. As keypoints al. [30] instead estimate more accurate relative poses by el-
are sparse, small inaccuracies in their locations can result in egantly combining photometric and geometric costs. They
large errors for the estimated geometric quantities. show that dense information can improve sparse estimation
2
but their approach ignores appearance changes. Differently, quires many observations to reduce the geometric noise.
our work improves the entire SfM pipeline starting with Operating on an existing reconstruction, it cannot recover
tentative matches and addresses larger, challenging changes. observations arising from noisy keypoints that are matched
To improve on the weaknesses of photometric optimiza- correctly but discarded by the geometric verification.
tion, numerous recent works align multi-dimensional image Track refinement: To improve the accuracy of the key-
representations. Examples of this featuremetric optimization points prior to any geometric 3D estimation, Dusmanu et
include frame tracking with handcrafted [6, 56] or learned al. [24] optimize their locations over tentative tracks formed
descriptors [17,53,86,87,89], optical flow [7,11], MVS [94], by raw, unverified matches. They exploit the inherent struc-
and dense SfM in small scenes [79]. Closer to our work, ture of the matching graph to discard incorrect matches with-
PixLoc [66] learns deep features with a large basin of con- out relying on geometric constraints. Given two-view dense
vergence for wide-baseline pose refinement. It improves the flow fields {Tv→u } between the neighborhoods of matching
accuracy of sparse matching but is designed for single im- keypoints u and v, this keypoint adjustment optimizes, for
ages and disregards the scalability to multiple images or large each tentative track j, the multi-view cost
scenes. Here we extend this paradigm to other steps of SfM X
j
and propose an efficient algorithm that scales to thousands EKA = kpv + Tv→u [pv ] − pu kγ , (2)
of images. We show that learning task-specific wide-context (u,v)∈M(j)
features is not necessary and demonstrate highly accurate
refinements with off-the-shelf features. where M(i) denotes the set of matches that forms the track
In conclusion, our work is the first to apply robust feature- and [·] is a lookup with subpixel interpolation. A deep neu-
metric optimization to a large-scale sparse reconstruction ral network is trained to regress the flow of a single point
problem and show significant benefits for visual localization. from two input patches and the flow field is interpolated
from a sparse grid. This dramatically improves the keypoint
3. Background accuracy, but some errors remain as the regression and inter-
polation are only approximate.
Given N images {Ii } observing a scene, we are inter- Both bundle and keypoint adjustments are based on ge-
ested in accurately estimating its 3D structure, represented as ometric observations, namely keypoint locations and flow,
sparse points {Pj ∈ R3 }, intrinsic parameters {Ci } of the but do not account for their respective uncertainties. They
cameras, and the poses {(Ri , ti ) ∈ SE(3)} of the images, thus require a large number of observations to average out
represented as rotation matrices and translation vectors. the geometric noise and their accuracy is in practice limited.
A typical SfM pipeline performs geometric estimation
from correspondences between sparse 2D keypoints {pu } 4. Approach
observing the same 3D point from different views, collec-
tively called a track. Association between observations is Summarizing dense image information into sparse points
based on matching local image descriptors {du ∈ RD }, is necessary to perform global data association and optimiza-
but the estimated geometry relies solely on the location of tion at scale. However, refining geometry is an inherently
the keypoints, whose accuracy is thus critical. Keypoints local operation, which, we show, can efficiently benefit from
are detected from local image information for each image locally-dense pixels. Given constraints provided by coarse
individually, without considering multiple views simultane- but global correspondences or initial 3D geometry, the dense
ously. Subsequent steps of the pipeline discover additional information only needs to be locally accurate and invariant
information about the scene, such as its its geometry or its but not globally discriminative. While SfM typically discards
multi-view appearance. Two approaches leverage this infor- image information as early as possible, we instead exploit
mation to reduce the detection noise and refine the keypoints. it in several steps of the process thanks to direct alignment.
Leveraging the power of deep features, this translates into
Global refinement: Bundle adjustment [82] is the gold stan- featuremetric keypoint and bundle adjustments that elegantly
dard for refining structure and poses given initial estimates. integrate into any SfM pipeline by replacing their geometric
It minimizes the total geometric error counterparts. Figure 2 shows an overview.
X X We first introduce the featuremetric optimization in Sec-
EBA = kΠ (Ri Pj + ti , Ci ) − pu kγ , (1) tion 4.1. We then describe our formulations of keypoint
j (i,u)∈T (j)
adjustment, in Section 4.2, and bundle adjustment, in Sec-
tion 4.3, and analyze their efficiency.
where T (j) is the set the images and keypoints in track j,
Π(·) projects to the image plane, and k·kγ is a robust 4.1. Featuremetric optimization
norm [33]. This formulation implicitly refines the keypoints
while ensuring their geometric consistency. It however ig- Direct alignment: We consider the error between image
nores the uncertainty of the initial detections and thus re- intensities at two sparse observations: r = Ii [pu ] − Ij [pv ].
3
Local image derivatives implicitly define a flow from one This allows the optimization to split tracks connected by
point to the other through a gradient descent update: weak correspondences, providing robustness to mismatches.
∂Ij The confidence is not based on the dense features since these
Tv→u [pv ] ∝ − [p ]> r . (3) are not expected to disambiguate correspondences at the
∂p v
global image level.
This flow can be efficiently computed at any location in a
neighborhood around v, without approximate interpolation Efficiency : This direct formulation simply compares pre-
nor descriptor matching. It naturally emerges from the direct computed features on sparse points and is thus much more
optimization of the photometric error, which can be mini- scalable than patch flow regression (Eq. 2), which performs
mized with second-order methods in the same way as the a dense local correlation for each correspondence. All tracks
aforementioned geometric costs. Unlike the flow regressed are optimized independently, which is very fast in practice
from a black-box neural network [24], this flow can be made despite the sheer number of tentative matches.
consistent across multiple view by jointly optimizing the Drift: Because of the lack of geometric constraints, the
cost over all pairs of observations. points are free to move anywhere on the underlying 3D sur-
Learned representation: SfM can handle image collec- face of the scene. The featuremetric cost biases the updates
tions with unconstrained viewing conditions exhibiting large towards areas with low spatial feature gradients and with
changes in terms of illumination, resolution, or camera mod- better-defined features. This can result in a large drift if not
els. The image representation used should be robust to such accounted for. Keypoints should however remain repeatable
changes and ensure an accurate refinement in any condition. w.r.t. unrefined detections to ensure the matchability of new
We thus turn to features computed by deep CNNs, which images, such as for visual localization. It is thus critical to
can exhibit high invariance by capturing a large context, yet limit the drift, while allowing the refinement of noisier key-
retain fine local details. For each image Ii , we compute a D- points. For each track, we freeze the location of the keypoint
dimensional, L2-normalized feature map Fi ∈ RW ×H×D ū with highest connectivity, as in [24], and constrain the
at identical resolution. We use the same representations for location pu of each keypoint w.r.t. to its initial detection p0u ,
keypoint and bundle adjustments, requiring a single forward such that pu − p0u ≤ K.
pass per image. Our experiments show that multiple off-the- Once all tracks are refined, the geometric estimation pro-
shelf dense local descriptors can result in highly accurate ceeds, typically using two-view epipolar geometric verifica-
refinements. However, our formulation can also be applied tion followed by incremental or global SfM.
to robust intensity representations, such as the normalized 4.3. Bundle adjustment
cross-correlation (NCC) over local image patches [88].
The estimated structure and motion can then be refined
4.2. Keypoint adjustment with a similar featuremetric cost. Here keypoints are implic-
Once local features are detected, described, and matched, itly defined by the projections of the 3D points into the 2D
we refine the keypoint locations before geometrically verify- image planes, and only poses and 3D points are optimized.
ing the tentative matches. Objective: We minimize for each track j the error between
Track separation: Connected components in the match- its observations and a reference appearance f j :
ing graph define tentative tracks – sets of keypoints that X X
are likely to observe the same 3D point, but whose obser- EFBA = Fi [Π (Ri Pj + ti , Ci )] − f j γ .
j (i,u)∈T (j)
vations have not yet been geometrically verified. Because
(5)
a 3D point has a single projection on a given image plane,
The reference is selected at the beginning of the optimization
valid tracks cannot contain multiple keypoints detected in
and kept fixed from then on. This reduces the drift of the
the same image. We can leverage this property to efficiently
points significantly, as also noted in [5], but is more flexible
prune out most incorrect matches using the track separation
than the common ray-based parametrization [26, 44, 88].
algorithm introduced in [24]. This speeds up the subsequent
The reference is defined as the observation closest to the
optimization and reduces the noise in the estimation.
robust mean µ over all initial observations f ju of the track:
Objective: We then adjust the locations of 2D keypoints
belonging to the same track j by optimizing its featuremetric f j = argmin µj − f (6)
consistency along tentative matches with the cost f ∈{f ju }
X
with µj = argmin kf − µkγ . (7)
j
X
EFKA = wuv Fi(u) [pu ] − Fi(v) [pv ] γ , (4) µ∈RD
(u,v)∈M(j) f ∈{f ju }
where wuv is the confidence of the correspondence (u, v), This ensures robustness to outlier observations and accounts
such as the similarity of its local feature descriptors d>
u dv . for the unknown topology of the feature space.
4
ETH3D indoor ETH3D outdoor SuperPoint - raw
SfM features
Accuracy (%) Completeness (%) Accuracy (%) Completeness (%)
ë Refinement
1cm 2cm 5cm 1cm 2cm 5cm 1cm 2cm 5cm 1cm 2cm 5cm
SIFT [51] 75.62 85.04 92.45 0.21 0.87 3.61 57.64 71.92 85.23 0.06 0.34 2.45
ë Patch Flow 80.99 89.06 95.06 0.24 0.97 3.88 64.79 78.90 90.04 0.08 0.41 2.76
ë ours 82.82 89.77 94.77 0.25 0.96 3.75 68.43 80.73 91.28 0.08 0.42 2.75
SuperPoint [21] 75.76 85.61 93.38 0.59 2.21 8.89 50.45 65.07 80.26 0.10 0.55 3.92
ë Patch Flow 85.77 91.57 95.85 0.72 2.51 9.59 64.94 77.65 88.86 0.15 0.77 4.93 SuperPoint - refined
ë ours 89.33 93.58 96.58 0.74 2.53 9.51 71.27 82.58 92.08 0.16 0.83 5.06
D2-Net [23] 47.18 64.94 83.37 0.47 1.87 7.07 20.87 34.55 56.53 0.03 0.19 1.78
ë Patch Flow 79.10 86.64 93.26 1.45 4.53 12.95 57.34 70.71 84.12 0.21 1.06 6.02
ë ours 82.49 88.83 94.35 1.36 4.13 11.80 65.71 77.95 89.22 0.21 1.01 5.63
R2D2 [59] 66.30 79.21 90.00 0.53 2.06 8.62 49.32 66.10 83.10 0.11 0.55 3.63
ë Patch Flow 77.94 85.82 92.48 0.66 2.32 9.07 64.14 78.10 90.18 0.16 0.71 4.09
ë ours 80.67 87.61 93.42 0.67 2.31 8.95 67.77 80.85 91.91 0.16 0.73 4.09 correct/incorrect @ 1cm
Table 1: 3D sparse triangulation. Our refinement yields significantly more accurate and complete point clouds than the
common geometric SfM pipeline. It is more effective than the existing Patch Flow [24], especially at 1cm or with SIFT.
Efficiency: Compared to the keypoint adjustment (Eq. 4), is computed with iteratively reweighted least squares [37].
using a reference feature reduces the number of residuals Simultaneously storing all high-dimensional feature
from O(N 2 ) to O(N ). On the other hand, all tracks need to patches incurs high memory requirements during BA. We
be updated simultaneously because of the interdependency dramatically increase its efficiency by exhaustively precom-
caused by the camera poses. To accelerate the convergence, puting patches of feature distances and directly
interpolate
we form a reduced camera system based on the Schur com- an approximate cost Ēij = Fi − f j γ pij . To improve
plement and use embedded point iterations [42]. The refine- the convergence, we store and optimize its spatial derivatives
ment generally converges within a few camera updates. ∂ Ēij/∂p . This reduces the residual size from D to 3 with
ij
no loss of accuracy. See Appendix C for more details.

4.4. Implementation
Run time and memory: S2DNet can extract 3-5 dense fea-
Dense extractor: Our refinement can work with any off- ture maps per second and both featuremetric adjustments run
the-shelf CNN that produces feature maps that are locally in less than 5 minutes for 100 images. As these features are
discriminative. These should be of the same resolution as 128-dimensional, the memory consumption can be a bottle-
the input (stride 1) to enable subpixel accuracy. The radius neck. We believe that much fewer dimensions are actually
of convergence, or context, of such features depends on the required for refinement, and retraining a compact feature
amount of noise in the keypoints. Most detectors like SIFT extractor would improve the efficiency of the optimization.
have at most a few pixels of error, while others like D2-Net
exhibit a much larger detection noise. In our experiments, we 5. Experiments
use S2DNet [31] for dense feature extraction, as it computes
fine features very efficiently in only 4 convolutions, but also We evaluate our featuremetric refinement on various SfM
produce, if required, deeper features with a larger context. tasks with several handcrafted and learned local features
These can then be combined into a multi-level optimization and show substantial improvements for all of them. We first
scheme [26, 66, 86] that sequentially refines based on coarse evaluate its accuracy on the tasks of triangulation and camera
to fine features. The convergence can thus be adjusted de- pose estimation in Sections 5.1 and 5.2, respectively. We
pending on the detector and on the image resolution. We then assess in Section 5.3 the impact of the refinement on
show in Section 5.4 that other dense features work well too. two-view and multi-view pose estimation for end-to-end
reconstruction in challenging conditions. Lastly, Section 5.4
Optimization: The optimization problems of both key- analyzes the validity and scalability of our design decisions
point and bundle adjustments are solved with the Levenberg- through an ablation study.
Marquardt [45] algorithm implemented using Ceres [3].
Feature maps are stored as collections of 16×16 patches 5.1. 3D triangulation
centered around the initial keypoint detections. We thus
constrain points to move at most K= 8 pixels. The feature We first evaluate the accuracy of the refined 3D structure
lookup is implemented as bicubic interpolation. We use the given known camera poses and intrinsics.
Cauchy loss γ with a scale of 0.25. The robust mean in Eq. 7 Evaluation: We use the ETH3D benchmark [73], which
5
Recall [%]
100 SfM features AUC (%) SfM features Task 1: Stereo Task 2: Multiview
(# keypoints) AUC@K◦ AUC@5◦ @N
ë Refinement 1mm 1cm 10cm ë Refinement
80 • SIFT 16.92 56.08 81.65
5◦ 10◦ 5 10 25
ë Patch Flow 14.62 52.69 81.69 SuperPoint+SuperGlue (2k) 58.78 71.01 63.02 77.36 86.76
60 ë ours 25.38 60.22 84.07 ë ours 65.89 76.51 68.87 82.09 89.73
• SuperPoint 15.38 51.20 82.33 SIFT (2k) 38.09 48.05 25.12 50.82 77.28
ë Patch Flow 28.46 63.99 86.79 ë ours 40.59 50.87 28.01 53.59 79.49
40 ë ours 40.00 71.97 86.86
D2-Net (4k) 16.83 22.40 16.52 33.07 49.35
• D2-Net 1.54 12.16 56.10 ë ours 25.89 33.32 21.33 40.69 57.93
20 ë Patch Flow 16.92 54.70 75.16
ë ours 17.69 55.03 76.26 Table 3: End-to-end SfM. The proposed refinement im-
0 100 101 102
• R2D2 11.53 52.88 82.69 proves the accuracy of poses estimated by epipolar geometry
mm ë Patch Flow
ë ours
25.38 61.42 84.14
27.69 63.86 86.13
(stereo) or a complete SfM pipeline (multiview) with crowd-
raw refined sourced imagery. Improvements are substantial for both stan-
dard (SIFT) and recent (SuperGlue) matching configurations,
Table 2: Camera pose estimation. We plot the cumulative especially when few images N observe the scene.
translation error and report its AUC. Our refinement im-
proves the accuracy of the query camera poses for all local
features, even when for SIFT, whose detections are already 5.2. Camera pose estimation
well-localized. It is generally more accurate than Patch Flow. We now evaluate the impact of our refinement on the task
of camera pose estimation from a single image.
is composed of 13 indoor and outdoor scenes and provides Evaluation: We again follow the setup of [24] based on the
images with millimeter-accurate camera poses and highly- ETH3D benchmark. For each scene, 10 images are randomly
accurate ground truth dense reconstructions obtained with selected as queries. For each of them, the remaining images,
a laser scanner. We follow the protocol introduced in [24], excluding the 2 most covisible ones, are used to triangu-
in which a sparse 3D model is triangulated for each scene late a sparse 3D partial model. Each query is then matched
using COLMAP [70] with fixed camera poses and intrinsics. against its corresponding partial model and the resulting
Following the original benchmark setup, we report the ac- 2D-3D matches serve to estimate its absolute pose using
curacy and completeness of the reconstruction, in %, as the LO-RANSAC+PnP [15] followed by geometric refinement.
ratio of triangulated and ground-truth dense points that are We compare the 130 estimated query poses to their ground
within a given distance of each other. truth and report the area under the cumulative translation
error curve (AUC) up to 1mm, 1cm, and 10cm.
Baselines: We evaluate our featuremetric refinement with
the hand-crafted local features SIFT [51] and the learned Baselines: Patch Flow performs multi-view optimization
ones SuperPoint [21], D2-Net [23], and R2D2 [59], using the over each partial model independently as well as over the
associated publicly available code repositories. We compare matches between each query and its partial model. Simi-
our approach to the geometric optimization of [24], referred larly, we first refine each partial model as in Section 5.1. We
here as Patch Flow. We re-compute the numbers provided in then adjust the query keypoints using its tentative matches,
the original paper using the code provided by the authors. estimate an initial pose, and refine it with featuremetric BA.
Results: Table 1 shows that our approach results in sig- Results: The AUC and its cumulative plot are shown in
nificantly more accurate and complete 3D reconstructions Table 2. Our refinement substantially improves the local-
compared to the traditional geometric SfM. It is more ac- ization accuracy for all local features, including SIFT, for
curate than Patch Flow, especially at the strict threshold of which Patch Flow does not show any benefit. At all error
1cm, and exhibits similar completeness. The improvements thresholds, featuremetric optimization is consistently more
are consistent across all local features, both indoors and accurate than its geometric counterparts. The accuracy of
outdoors. The gap with Patch Flow is especially large for SuperPoint is raised far higher than other detectors, despite
SIFT, which already detects well-localized keypoints. This the high sparsity of the 3D models that it produces. This
confirms that our featuremetric optimization better captures shows how more accurate keypoint detections can result in
low-level image information and yields a finer alignment. much more accurate visual localization.
Patch Flow is more complete for larger thresholds as it partly
5.3. End-to-end Structure-from-Motion
solves a different problem by increasing the keypoint re-
peatability with its large receptive field, while we focus on While the previous experiments precisely quantify the ac-
their localization. curacy of the refinement, they do not contain any variations
6
of appearance or camera models. We thus turn to crowd- SuperPoint Acc. (%) Compl. (%) track AUC
sourced imagery and evaluate the benefits of our featuremet- ë Refinement 1cm 2cm 1cm 2cm length 1cm
ric optimization in an end-to-end reconstruction pipeline.
unrefined 18.42 32.23 0.06 0.49 4.17 51.20
KA vs. BA
Evaluation: We use the data, protocol, and code of the ë Patch Flow [24] 37.00 55.18 0.15 0.93 5.24 63.53
ë F-KA 36.85 54.48 0.15 0.90 5.02 69.84
2020 Image Matching Challenge [1, 43]. It is based on large
ë F-BA 43.65 62.44 0.18 1.06 4.17 67.61
collections of crowd-sourced images depicting popular land- ë F-KA+BA (full) 46.46 65.41 0.19 1.14 5.02 71.97
marks around the world. Pseudo ground truth poses are ob-
w/ F-BA drift 47.93 66.52 0.20 1.17 5.02 64.51
bonus
tained with SfM [70] and used for two tasks. The stereo Patch Flow + F-BA 46.30 65.22 0.19 1.13 5.24 -
task evaluates relative poses estimated from image pairs by higher resolution 47.67 65.39 0.21 1.21 5.12 -
decomposing their epipolar geometry. This is a critical step
dense feats
photometric BA [88] 28.43 45.87 0.11 0.72 4.17 -
of global SfM as it initializes its global optimization. The VGG-16 ImageNet 36.86 54.99 0.15 0.90 4.61 -
multiview task runs incremental SfM for small subsets of DSIFT [49] 38.78 56.46 0.16 0.96 4.73 -
PixLoc [66] 29.49 46.60 0.12 0.74 4.48 -
images, making the SfM problem much harder, and eval-
uates the final relative poses within each subset. For each
Table 4: Ablation study on ETH3D. i) Featuremetric key-
task, we report the AUC of the pose error at the threshold
point and bundle adjustments (KA and BA) both largely
of 5◦ , where the pose error is the maximum of the angular
improve the triangulation and localization accuracy. Patch
errors in rotation and translation. As the evaluation server
Flow produces a longer track length because of its larger
accepts at most correspondences, we cannot evaluate our
receptive field but is less accurate. ii) Letting the BA drift by
method using the test data. We instead test on a subset of the
updating reference features or increasing the image resolu-
publicly available validation scenes, and tune the RANSAC
tion both improve the triangulation, at the expense of poorer
and matching parameters on the remaining scenes. More
localization and increased run time, respectively. iii) Dif-
details on this setup are provided in the Appendix.
ferent image representations are better than the unrefined
Baselines: We evaluate our refinement in combination with detections but S2DNet (our default) works best.
SIFT [51], D2-Net [23], and SuperPoint+SuperGlue [21, 65].
100k 53560s
We limit the number of detected keypoints to 2k for computa-
Ours
tional reasons, but increase this number to 4k for D2-Net as Patch Flow
it otherwise performs poorly. In the stereo task, we adjust the 10k 5458s
run time [s]
keypoints using the entire exhaustive tentative match graph

(4950 pairs per scene). We use LO-DEGENSAC [15, 16] 1018s
1000
for match verification, the ratio test for SIFT, and the mu- 215s
162s
tual check for SIFT and D2-Net. In the multiview task, we
100
adjust keypoints for each subset independently, considering
only the matches between images in the subset, and run our 14s
bundle adjustment after SfM. 10
10 images 100 images 1000 images
Results: Table 3 summarizes the results. For stereo, our 3.9K points 30K points 216K points
featuremetric keypoint adjustment significantly improves Relative run times for 1000 images:
the accuracy of the two-view epipolar geometries across
all local features and despite the challenging conditions. In Features F-KA F-BA
multiview setting, it also improves the accuracy of the SfM

0% 20% 40% 60% 80% 100%
poses, especially for small sets of images. Featuremetric
optimization is particularly effective in this situation, as Figure 3: Run-times. We show the duration, in logarithmic
geometric optimization cannot fully suppress the detection scale, of the refinement for varying numbers of images. Our
noise due to the small number of observations. We visualize refinement is more than ten times faster than Patch Flow [24],
tracks of a 5-image reconstruction in Figure 4 and highlight whose run-time is dominated by the computation of the
the accuracy of the refined SfM model. pairwise flow, which scales quadratically. Thanks to our
precomputed cost patches, the featuremetric BA is fast. The
5.4. Additional insights KA amounts for the majority of the refinement time.
Ablation study: Table 4 shows the performance of sev-
eral variants of our featuremetric optimization on ETH3D NCC-normalized intensity patches with fronto-parallel warp-
in terms of triangulation (scene Facade only) and localiza- ing. Our final configuration, based on on the dense features
tion (all scenes). We compare both types of adjustments, of S2DNet [31], performs best across all metrics. We will
minor tweaks, and different image representations, including now show that it is also fairly efficient.
7
Figure 4: Refined SfM tracks. We show patches centered around reprojections of 3x 3D points observed in 4 images of the St.
Peter’s Square scene. Deep features and their correlation maps with a reference are robust to scale or illumination changes, yet
preserve local details required for fine alignment. Points refined with our approach (in green) are consistent across multiple
views while those of a standard SfM pipeline (in red) are misaligned because the initial keypoint detections (in blue) are noisy.
Scalability: We run SfM on subsets of images of the Aachen spondences. Through extensive experiments we show that
Day-Night dataset [67, 68, 95]. Figure 3 shows the run times this results in more accurate camera poses and structure; in
of the refinement for subsets of 10, 100 and 1000 images. challenging conditions and for different local features.
The featuremetric refinement is an order of magnitude faster While we optimize against dense feature maps, we keep
than Patch-Flow [24]. Precomputing distance maps reduces the sparse scene representation of SfM. This ensures not
the peak memory requirement of the bundle adjustment from only that the approach is scalable but also that the result-
80 GB to less than 10GB for 1000 images. As storing feature ing 3D model is compatible with downstream applications,
maps only requires 50 GB of disk space, this refinement e.g. mapping for visual localization. Since our refinement
can easily run on a desktop PC. We thus refined the entire works well even with few observations, as it does not need to
Aachen Day-Night v1.1 model, composed of 7k images, average out the keypoint detection noise, it has the potential
in less than 2 hours. Scene partitioning [70] could further to achieve more accurate results using fewer images.
reduce the peak memory. See Appendix D for more details.
We thus believe that our approach can have a large impact
in the localization community as it can improve the accuracy
6. Conclusion of the ground truth poses of standard benchmark datasets, of
which many are currently saturated. Since this refinement is
In this paper we argue that the recipe for accurate large- less sensitive to under-sampling, it enables benchmarking
scale Structure-from-Motion is to perform an initial coarse for crowd-sourced scenarios beyond densely-photographed
estimation using sparse local features, which are by neces- tourism landmarks.
sity globally-discriminative, followed by a refinement using
Acknowledgements: The authors thank Mihai Dusmanu, Rémi Pautrat,
locally-accurate dense features. Since the dense feature only Marcel Geppert, and the anonymous reviewers for their thoughtful com-
need to be locally-discriminative, they can afford to capture ments. Paul-Edouard Sarlin was supported by gift funding from Huawei,
much lower-level texture, leading to more accurate corre- and Viktor Larsson by an ETH Zurich Postdoctoral Fellowship.
8
Appendix 3D triangulation error
raw
A. Additional results on ETH3D refined
A.1. Triangulation 1m
We refine the triangulation of SuperPoint [21] keypoints

for the ETH3D Courtyard scene and show in Figure 5 the
distribution of triangulation errors for points observed by 10cm
different numbers of images (track length). Our featuremetric
refinement provides the largest improvement for points with
low track length, for which the estimates of the traditional
1cm
geometric BA are dominated by the noise of the keypoint
detection. For larger track lengths, the refined point cloud
has an accuracy close to the Faro Focus X 330 laser scanner Lidar
from which the ground truth is computed. 1mm
We show in Figure 10 the raw and refined point clouds 2 3 4 5 6 7 8+
for SuperPoint and D2-Net. The benefits of our refinement track length
are easily visible in 3D. Planar walls exhibit fewer noisy
keypoints and the refined point clouds are more complete. Figure 5: Triangulation errors vs. track length. The ini-
tial, unrefined output, based on geometric BA, exhibits high
A.2. Camera pose estimation errors for 3D points that are observed by few images (low
track length). Our refinement significantly reduces these er-
We analyze in Table 5 how the different kinds of adjust- rors and brings the accuracy of the sparse point cloud close
ments impact the accuracy of camera localization. The full to the ground truth acquired by Lidar (2mm accuracy).
method presented in the main paper first refines the 3D SfM
model with featuremetric keypoint and bundle adjustments. KA BA qKA qBA AUC (%)
SuperPoint
It then refines each keypoint in the query image using its ë Refinement 1mm 1cm 10cm
tentative 2D-3D correspondences by minimizing the feature-
metric error between its observation in the query and the unrefined 15.38 51.20 82.33
ë refined X 16.15 53.34 82.49
most similar observation of the respective 3D points. Re- ë refined X X 16.92 54.71 84.08
fining the query keypoints before RANSAC increases the ë refined X X X 38.46 70.44 85.28
number of inlier matches and stabilizes the pose estimation ë refined (full) X X X X 40.00 71.97 86.86
in challenging scenarios where few 3D points are matches. ë Patch Flow X X 28.46 63.04 86.65
Once an initial pose is estimated with PnP+RANSAC,
we refine it via a small featuremetric bundle adjustment Table 5: Ablation study for pose estimation. The accuracy
over the inlier correspondences. This optimizes each query of the camera pose is improved by refining the map (KA
keypoint against the closest descriptor within the matched and BA) and by refining the query keypoints before (qKA)
track. As opposed to refining each query keypoint against and after (qBA) pose estimation. The largest improvement is
all observations in the matched track, this has the benefit of brought by qKA. It increases the number of inlier matches
scaling linearly in the number of query keypoints and yields and the likelihood of finding the correct pose with RANSAC.
a similar accuracy.
B. Impact of various parameters various patch sizes. Smaller 10×10 patches achieve suffi-
cient accuracys and require significantly less memory.
B.1. Patch size
B.2. Image resolution
Figure 6 shows how much our refinement displaces the
detected keypoints during the triangulation of SuperPoint on The image resolution at which the dense features are ex-
Courtyard using dense features extracted from 1600x1066- tracted has a large impact on the accuracy of the refinement.
pixel images. When using full feature maps without any In Figure 8 we quantify in the impact on both triangula-
constraints in keypoint adjustment, most points are moved tion accuracy and run time for the ETH3D Courtyard scene
by more than 1 pixel, but most often by less than 8 pixels. (38 images). The accuracy drops significantly when the res-
This confirms that storing the feature maps as 16×16 patches olution is smaller than 1600×1066px, which amounts to
is sufficient and rather conservative. 25% of the full image resolution. Doubling the resolution
We show in Figure 7 the accuracy of the triangulation for to 3200×2132px yields noticeable improvements, albeit sig-
9
Distribution of 2D point movement 90 40
100
80 80 30
cumulative distr. [%]
Accuracy [%]
Time [s]
60
70 20
40
20 60 Accuracy @ 2cm 10
Accuracy @ 1cm
0 Feature extraction time
2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 50 0
1 10 400 800 1200 1600 2000 2400 2800 3200
point movement [px] Image Size (Max Edge Length)
Figure 6: Distribution of point movements. We show the Figure 8: Impact of the image resolution. Increasing the
cumulative distribution of the distance traveled by the 2D image resolution increases the accuracy, but at the cost of
keypoints during the featuremetric refinement of SuperPoint longer feature extraction time and higher VRAM require-
with KA and BA. 60% of the points move by fewer than 2 ments. For all experiments on ETH3D, we used a maximum
pixels and 99% remain within 8 pixels of the initial detec- edge length of 1600px, which is very close to saturating the
tions. accuracy while providing low run times.
90 20 80 8
Accuracy [%] 75 6
Time [s]
80 15 70 4
Memory [GB]
Accuracy [%]
65 Accuracy @ 1cm 2
F-BA run time
70 10 60 0
16 32 64 128
Feature dimensionality
60 Accuracy @ 2cm 5
Figure 9: Impact of the feature dimensionality. Dense fea-
Accuracy @ 1cm
RAM tures computed by S2DNet can be naively reduced to acceler-
50 0 ate the featuremetric bundle adjustment by 2 while incurring
4 8 12 16 20 24 28 32
Patch Size only a minor drop of triangulation accuracy.
Figure 7: Impact of the patch size. Smaller patches for

each observation significantly reduce memory requirements Retaining pairwise constraints however allows the opti-
but can impair the accuracy of the refinement. Patches of mization to separate tracks that were incorrectly merged by
size 10×10 offer a good trade-off with high accuracy and the track separation algorithm. This is not necessary in the
moderate memory consumption. bundle adjustment, as tracks are already filtered by the robust
geometric estimation and can thus be assumed to be correct,
but is common for unverified track. We evaluate the impact
nificantly increases the extraction time and the consumption of the reference selection in the keypoint adjustment and re-
of GPU VRAM. As a reference, extracting only fine-level port the results in Table 6. For both SuperPoint and D2-Net,
S2DNet features (4 convolutions) from 3200×2132px im- using the feature center results in lower completeness and
ages requires around 10GB of GPU VRAM. accuracy than the topological center. It also results in a lower
track length, which confirms that the topological reference
B.3. Reference selection for keypoint adjustment allows to retain incorrectly-merged tracks. Since the feature
Selecting some observations as references is necessary center still performs relatively well, it could be considered
to avoid the drift. In a given track, the keypoint adjustment in case of tighter computational constraints.
selects the point that is the most connected (topological cen- Furthermore, Table 6 highlights the importance of the
ter), while the bundle adjustment selects the point closest featuremetric keypoint adjustment. The benefits are larger
to the robust mean in feature space (feature center). Could for D2-Net, which detects very noisy keypoints. As a con-
we use the feature center for selecting the reference of the sequence, many correct albeit noisy matches are rejected by
keypoint adjustment? By minimizing the feature distance to the geometric verification. Our keypoint adjustment not only
this unique reference, we could reduce the number of resid- allows more points to be triangulated, thus increasing the
uals from quadratic (pairwise constraints) to linear (unary completeness of the model, but also increases the accuracy
constraints) and thus accelerate the optimization. of the triangulated points.
10
Triangulation Acc. (%) Compl. (%)track C. Cost map approximation
ë Refinement 1cm 2cm 1cm 2cm 5cm length
We mention in Section 4.4 that the memory efficiency of
unrefined 18.03 31.97 0.07 0.49 5.03 4.17 the bundle adjustment can be improved by precomputing the
SuperPoint
ë Patch Flow [24] 37.00 55.18 0.15 0.93 7.44 5.24

featuremetric cost. We provide here more details.
ë F-BA 43.65 62.44 0.18 1.06 7.70 4.17
ë +F-KA (feat-ref) 45.05 64.84 0.18 1.12 7.76 4.88 Description: Given D-dimension features, the featuremet-
ë +F-KA (topol-ref) 46.46 65.41 0.19 1.14 8.19 5.02 ric bundle adjustment (Eq. 5) involves residuals and Jaco-
unrefined 7.68 13.98 0.02 0.17 2.19 3.29 bian matrices of dimension D. Unlike the keypoint adjust-
D2-Net
ë Patch Flow [24] 34.64 52.36 0.16 1.00 8.10 4.99 ment, which can optimize tracks independently, all bundle
ë F-BA 39.30 58.59 0.15 0.94 6.99 3.29
ë +F-KA (feat-ref) 43.35 62.54 0.19 1.18 8.36 4.49 parameters are updated simultaneously and the memory re-
ë +F-KA (topol-ref) 44.21 64.22 0.20 1.20 8.72 4.63 quirements are thus prohibitive. Given the 2D reprojection
pij = Π (Ri Pj + ti , Ci ), this formulation loads in memory
Table 6: Additional ablation study on ETH3D Facade. the dense features Fi , interpolates them at pij , and compute
the residuals rij = Fi pij − f j for the cost Eij = krij kγ .

i) Featuremetric keypoint adjustment significantly improves
the completeness, especially for noisy keypoints as in D2- To reduce the memory footprint, we can exhaustively
Net. ii) Keypoint adjustment against the topological center precompute patches of feature distances and treat
them as
in each tentative track (topol-ref) improves the point cloud one-dimensional residuals r̄ij = Fi − f j pij . The cost
in accuracy and completeness over KA towards the robust then becomes Ēij = γ(r̄ij ). Such distances only need to be
feature center (feat-ref) because it allows to merge tracks. computed once since the reference f j is kept fixed through-
out the optimization. This precomputed cost reduces the
peak memory by a factor D, with often D=128. It is simi-
B.4. Number of feature levels lar to the Neural Reprojection Error recently introduced by
Using multiple feature levels enlarges the basin of con- Germain et al. [32] for camera localization.
vergence but increases the computational requirements. The Analysis: Swapping the distance computation and the sparse
radius of convergence that is required depends on the noise of interpolation introduces an approximation error. We first
the keypoint detector and on the resolution of the image from write the bilinear or bicubic interpolation as a sum over
which keypoints are detected. When performing detection features Fk on the discrete grid:
and refinement at identical image resolutions, the optimal X X
displacement is at most a few pixels for most keypoint de- F [p] = wk Fk with wk = 1 . (8)
tectors. In this case, the fine level of S2DNet feature maps k k
is sufficient. We empirically measured that its radius of con- We assume that the features are L2-normalized kFk k = 1,
vergence is approximately 3 pixels, although the multiview such that kF [p]k ≈ 1. For a squared loss function, the
constraints enable to refine over much larger distances. approximation error can then be written as:
We thus use a single feature level for all experiments 2 2
involving SIFT, SuperPoint, and R2D2. D2-Net require a dif- kF − f k [p] − kF[p] − f k
ferent treatment, as its detection noise is significantly larger. 2 1 XX 2
≈ 1 − kF[p]k = wk wl kFk − Fl k . (9)
This is partly due to the aggressive downsampling of its CNN 2
k l
backbone and to the low resolution of its output heatmap.
As a consequence, we employ both fine and medium feature This error is zero at points on the discrete grid and increases
levels for D2-Net. Both keypoint and bundle adjustments run with the roughness of the feature space. This approximation
the optimization successively at the coarser and finer levels. thus displaces the local minimum of the cost by at most 1
pixel but most often by much less.
B.5. Dimensionality of the features Improvement: This approximation however degrades the
Throughout this paper, we used 128-dimensional dense correctness of the approximate Hessian matrix that the
features extracted by S2DNet [31]. Relying on compact fea- Levenberg-Marquardt algorithm [45] relies on for fast con-
tures would easily reduce the memory footprint and the run vergence. We found that also optimizing the squared spatial
time of the refinement. To demonstrate these benefits, we derivatives of this cost significantly improves the conver-
show in Figure 9 the relationship between the dimension, the gence. This simply amounts to augmenting the scalar resid-
run time of the BA, and the triangulation accuracy when re- ual map with dense derivative maps:
 
taining only the first k channels of the S2DNet features. Fea- Fi − f j
tures with fewer dimensions yield a faster refinement. The  ∂ kFi −f j k 
accuracy drops moderately but we expect a smaller reduction r̃ij =  ∂x
 pij . (10)
∂ kFi −f j k
 
with features explicitly trained for smaller dimensions. ∂y
11
SuperPoint Acc. (%) Compl. (%) Time Memory sparse local and dense features are extracted at full image
ë Refinement 1cm 2cm 1cm 2cm (s) (GB) resolution, which is generally not larger than 1024px.
unrefined 64.27 76.47 0.37 1.44 - - D.3. Ablation study - Section 5.4
ë ours (exact) 81.31 88.50 0.47 1.74 42.22 7.3
ë ours (cost maps) 80.27 87.81 0.47 1.72 29.86 0.15 The triangulation metrics are reported for the ETH3D
scene Facade, which is the largest with 76 images. We use
Table 7: Triangulation with cost map approximations. SuperPoint local features as they perform best in all earlier
Using precomputed cost maps increase the efficiency of experiments and we store dense feature maps in every exper-
the bundle adjustment with a marginal loss of accuracy. iment. The localization AUC is measured over all 13 scenes
in ETH3D with 10 holdout images per scene. We now detail
the different baselines.
This improvement results in three-dimensional residuals,
Localization is achieved in “F-KA” by first refining the
which is still smaller than D when D=128. Using the spatial
keypoints, triangulating the map and finally performing
derivatives, we can also compute an exact, more accurate
query keypoint adjustment as described in section A.2. For
bicubic spline interpolation of the cost landscape.
localization with “F-BA”, we refined the triangulated model
Evaluation: We now show experimentally that this approx- using featuremetric bundle adjustment and then refined the
imation often does not, or only minimally, impairs the ac- pose from PnP+RANSAC using qBA.
curacy of the refinement. Table 7 reports the results of the In the entry “w/ F-BA drift”, we use the robust refer-
triangulation of SuperPoint features on the ETH3D dataset. ence (Eq. 7) to select the observation in each track which
The approximation reduces the accuracy by less than 1% is most similar to the robust reference as the source frame.
and does not alter the completeness. It however significantly The optimizer then minimizes the error between each other
reduces the memory consumption of the bundle adjustment, observation and the current, moving reference of the source
allowing it to scale to thousands of images. Note that all frame. Since only the index of the source frame is fixed
experiments in Sections 5.1, 5.2, and 5.3 do not use the during the optimization, this method does not account for
cost map approximation as the corresponding scenes are drift, which appears to yield higher accuracy but suffers from
relatively small. repeatability problems during localization.
The baseline “PatchFlow + F-BA” uses the keypoint re-
D. Experimental details finement from Dusmanu et al. [24] as initialization, and runs
our featuremetric bundle adjustment on top of it. We used the
D.1. ETH3D - Sections 5.1 and 5.2
exact same parameters for PatchFlow as presented in [24].
For the experiments on ETH3D, we use the evaluation The entry “higher resolution” corresponds to input images
code provided by Dusmanu et al. [24]. We use the origi- at double the resolution than all the other experiments, i.e.
nal implementations of SuperPoint [21], D2-Net [23], and 3200 pixels in the longest dimension.
R2D2 [59], and extract root-normalized SIFT [51] features For the “photometric” baseline, we use RGB images
using COLMAP [70]. For both sparse and dense feature (while Woodford et al. [88] use grayscale images), we warp
extraction, the images are resized so that their longest di- patches of 4×4 pixels at the featuremap resolution (1600 pix-
mension is equal to 1600 pixels. The tentative matches are els in the longest dimension) with fronto-parallel assumption,
filtered according to the recipe described in [24]. and apply normalized cross correlation (NCC). Identically to
our featuremetric BA and to LSPBA [88], the source frame
D.2. Structure-from-Motion - Section 5.3 is selected as the observation closest to the robust mean.
We tune the hyperparameters on the training scenes Tem- We report results for dense features extracted from a
ple Nara Japan, Trevi Fountain, and Brandenburg Gate. The VGG-16 CNN, trained on ImageNet [20], at the layer
results in the main paper are computed on the test scenes conv1 2 (64 channels) and for the fine feature map pre-
Sacre Coeur, Saint Peter’s Square, and Reichstag, using the dicted by PixLoc [66] (32 channels). The model of PixLoc,
data and code provided by the challenge organizers. trained on MegaDepth [48], was kindly provided by its au-
For SIFT [51], we use the mutual check, a ratio test with thors. In DSIFT [49] (128 channels), we apply a bin size of
threshold 0.85 for the multi-view and 0.9 for the stereo tasks, 4 and a step size of 1 and refer to the VLFeat implementa-
and DEGENSAC with an inlier threshold of 0.5px. For D2- tion [85] for more details.
Net [23], we use the mutual check and inlier thresholds of
D.4. Scalability
2px and 0.5px for raw and refined keypoints, respectively.
For SuperPoint+SuperGlue [21,65], we do not use additional All experiments were conducted on 8 CPU cores (Intel
match filtering and we select an inlier thresholds of 1.1px Xeon E5-2630v4) and one NVIDIA RTX 1080 Ti. The sub-
and 0.5px for raw and refined keypoints, respectively. All sets from the Aachen Day-Night v1.1 model [67, 68, 95]
12
were selected as the images with the largest visibility over-
lap, in descending order. To accelerate the feature matching,
each image was matched only to its top 20 most covisible
reference images in the original Aachen SfM model. We
use SuperPoint [21] features and match image pairs with the
mutual check and distance thresholding at 0.7. During BA,
we apply the sparse Schur solver from Ceres for each linear
system in LM, while we use sparse Cholesky in KA, similar
to [24]. Featuremetric bundle adjustment is stopped after 30
iterations while KA runs for at most 100 iterations and stops
when parameters change by less than 10−4 .
To refine the full Aachen Day-Night model, we use Su-
perPoint features matched with SuperGlue [65] from the
Hierarchical Localization toolbox [63, 64]. We refine the
keypoints with KA, then triangulate the points with fixed
poses from the reference model. Finally, we run a full bundle
adjustment of the model with the proposed approximation
by cost maps.
13
raw / refined point clouds
SuperPoint
Accuracy - unrefined
Accuracy - refined
raw / refined point clouds

D2-Net
Accuracy - unrefined
Accuracy - refined
Figure 10: Refinement on ETH3D Courtyard. In the top parts, we show for both SuperPoint (top) and D2-Net (bottom)
top-down views of the sparse point clouds triangulated with raw (in red) and refined (in green) keypoints. The refined point
clouds better fit the geometry of the scene, especially on planar walls. In the lower parts, we also show images in which points
are colored as accurate (in green) or inaccurate (in red) at 1cm for raw (left) and refined (right) point clouds.
14
References [19] Amaël Delaunoy and Marc Pollefeys. Photometric bundle
adjustment for dense multi-view 3d modeling. In CVPR, 2014.
[1] CVPR 2020 Image Matching Challenge. 2
https://www.cs.ubc.ca/research/ [20] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li
image-matching-challenge/. Accessed March 1, Fei-Fei. ImageNet: A large-scale hierarchical image database.
2021. 7 In CVPR, 2009. 12
[2] Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Si- [21] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi-
mon, Brian Curless, Steven M Seitz, and Richard Szeliski. novich. SuperPoint: Self-supervised interest point detection
Building Rome in a day. Communications of the ACM, and description. In CVPR Workshop on Deep Learning for
54(10):105–112, 2011. 1 Visual SLAM, 2018. 1, 2, 5, 6, 7, 9, 12, 13
[3] Sameer Agarwal, Keir Mierle, and Others. Ceres solver. [22] Frédéric Devernay and Olivier D Faugeras. Computing dif-
http://ceres-solver.org. 5 ferential properties of 3-D shapes from stereoscopic images
[4] Sameer Agarwal, Noah Snavely, Steven M Seitz, and Richard without 3-D models. 1994. 2
Szeliski. Bundle adjustment in the large. In ECCV, 2010. 1 [23] Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Polle-
[5] Hatem Alismail, Brett Browning, and Simon Lucey. Photo- feys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2-Net:
metric bundle adjustment for vision-based SLAM. In ACCV, A trainable CNN for joint detection and description of local
2016. 2, 4 features. In CVPR, 2019. 1, 5, 6, 7, 12
[6] Hatem Alismail, Brett Browning, and Simon Lucey. Robust [24] Mihai Dusmanu, Johannes L. Schönberger, and Marc Polle-
tracking in low light and sudden illumination changes. In feys. Multi-View Optimization of Local Feature Geometry.
3DV, 2016. 3 In ECCV, 2020. 1, 2, 3, 4, 5, 6, 7, 8, 11, 12, 13
[7] Epameinondas Antonakos, Joan Alabort-i Medina, Georgios [25] Ivan Eichhardt and Daniel Barath. Optimal multi-view cor-
Tzimiropoulos, and Stefanos P Zafeiriou. Feature-based lucas– rection of local affine frames. In BMVC, 2019. 2
kanade and active appearance models. IEEE Transactions on [26] Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct
Image Processing, 2015. 3 sparse odometry. TPAMI, 2017. 1, 2, 4, 5
[8] Simon Baker, Ralph Gross, and Iain Matthews. Lucas-kanade [27] Jakob Engel, Thomas Schöps, and Daniel Cremers. LSD-
20 years on: A unifying framework. IJCV, 56, 2003. 2 SLAM: Large-scale direct monocular SLAM. In ECCV, 2014.
[9] Daniel Barath, Dmytro Mishkin, Ivan Eichhardt, Ilia 2
Shipachev, and Jiri Matas. Efficient initial pose-graph genera- [28] Wolfgang Förstner and Eberhard Gülch. A fast operator for
tion for global sfm. In CVPR, 2021. 2 detection and precise location of distinct points, corners and
[10] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. SURF: centres of circular features. In Proc. ISPRS intercommission
Speeded up robust features. In ECCV, 2006. 1 conference on fast processing of photogrammetric data, 1987.
2
[11] Che-Han Chang, Chun-Nan Chou, and Edward Y Chang.
[29] Jan-Michael Frahm, Pierre Fite-Georgel, David Gallup, Tim
CLKN: Cascaded Lucas-Lanade networks for image align-
Johnson, Rahul Raguram, Changchang Wu, Yi-Hung Jen,
ment. In CVPR, 2017. 3
Enrique Dunn, Brian Clipp, Svetlana Lazebnik, et al. Building
[12] Avishek Chatterjee and Venu Madhav Govindu. Efficient and Rome on a cloudless day. In ECCV, 2010. 1
robust large-scale rotation averaging. In CVPR, 2013. 2
[30] P. Georgel, Selim Benhimane, and Nassir Navab. A unified
[13] Christopher B Choy, JunYoung Gwak, Silvio Savarese, and approach combining photometric and geometric information
Manmohan Chandraker. Universal correspondence network. for pose estimation. In BMVC, 2008. 2
In NIPS, 2016. 2 [31] Hugo Germain, Guillaume Bourmaud, and Vincent Lepetit.
[14] Peter Hviid Christiansen, Mikkel Fly Kragh, Yury Brodskiy, S2DNet: Learning accurate correspondences for sparse-to-
and Henrik Karstoft. UnsuperPoint: End-to-end unsupervised dense feature matching. In ECCV, 2020. 1, 5, 7, 11
interest point detector and descriptor. arXiv:1907.04011, [32] Hugo Germain, Vincent Lepetit, and Guillaume Bourmaud.
2019. 2 Neural Reprojection Error: Merging feature learning and cam-
[15] Ondřej Chum, Jiřı́ Matas, and Josef Kittler. Locally optimized era pose estimation. In CVPR, 2021. 11
RANSAC. In Joint Pattern Recognition Symposium, pages [33] Frank R Hampel, Elvezio M Ronchetti, Peter J Rousseeuw,
236–243. Springer, 2003. 6, 7 and Werner A Stahel. Robust statistics: the approach based
[16] Ondrej Chum, Tomas Werner, and Jiri Matas. Two-view on influence functions. Wiley, 1986. 3
geometry estimation unaffected by a dominant plane. In [34] Christopher G Harris, Mike Stephens, et al. A combined
CVPR, 2005. 7 corner and edge detector. In Alvey vision conference, 1988. 1
[17] Ronald Clark, Michael Bloesch, Jan Czarnowski, Stefan [35] Jared Heinly, Johannes L Schonberger, Enrique Dunn, and
Leutenegger, and Andrew J. Davison. LS-Net: Learning to Jan-Michael Frahm. Reconstructing the World* in Six Days
solve nonlinear least squares for monocular stereo. In ECCV, *(as Captured by the Yahoo 100 Million Image Dataset). In
2018. 3 CVPR, 2015. 1
[18] Jan Czarnowski, Stefan Leutenegger, and Andrew J. Davi- [36] Heiko Hirschmüller, Peter R Innocent, and Jon Garibaldi.
son. Semantic texture for robust dense tracking. In ICCV Real-time correlation-based stereo vision with reduced border
Workshops, 2017. 1, 2 errors. IJCV, 2002. 2
15
[37] Paul W Holland and Roy E Welsch. Robust regression us- [56] Seonwook Park, Thomas Schöps, and Marc Pollefeys. Illumi-
ing iteratively reweighted least-squares. Communications in nation Change Robustness in Direct Visual SLAM. In ICRA,
Statistics-theory and Methods, 6(9):813–827, 1977. 5 2017. 3
[38] Danying Hu, Daniel DeTone, and Tomasz Malisiewicz. Deep [57] Rémi Pautrat, Viktor Larsson, Martin R Oswald, and Marc
ChArUco: Dark ChArUco Marker Pose Estimation. In CVPR, Pollefeys. Online invariance selection for local feature de-
2019. 2 scriptors. In ECCV, 2020. 1
[39] Andres Huertas and Gerard Medioni. Detection of inten- [58] Filip Radenovic, Johannes L Schonberger, Dinghuang Ji, Jan-
sity changes with subpixel accuracy using laplacian-gaussian Michael Frahm, Ondrej Chum, and Jiri Matas. From dusk till
masks. TPAMI, 1986. 2 dawn: Modeling in the dark. In CVPR, 2016. 1
[40] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, [59] Jerome Revaud, Philippe Weinzaepfel, César De Souza, Noe
Alexey Dosovitskiy, and Thomas Brox. FlowNet 2.0: Evolu- Pion, Gabriela Csurka, Yohann Cabon, and Martin Humen-
tion of optical flow estimation with deep networks. In CVPR, berger. R2D2: Repeatable and reliable detector and descriptor.
2017. 2 In NeurIPS, 2019. 1, 5, 6, 12
[41] Arnold Irschara, Christopher Zach, Jan-Michael Frahm, and [60] Ignacio Rocco, Relja Arandjelović, and Josef Sivic. Efficient
Horst Bischof. From structure-from-motion point clouds to neighbourhood consensus networks via submanifold sparse
fast location recognition. In CVPR, 2009. 1 convolutions. In ECCV, 2020. 2
[42] Yekeun Jeong, David Nister, Drew Steedly, Richard Szeliski, [61] Ignacio Rocco, Mircea Cimpoi, Relja Arandjelović, Akihiko
and In-So Kweon. Pushing the envelope of modern methods Torii, Tomas Pajdla, and Josef Sivic. Neighbourhood consen-
for bundle adjustment. TPAMI, 2011. 1, 5 sus networks. In NeurIPS, 2018. 2
[43] Yuhe Jin, Dmytro Mishkin, Anastasiia Mishchuk, Jiri Matas,
[62] Edward Rosten and Tom Drummond. Machine learning for
Pascal Fua, Kwang Moo Yi, and Eduard Trulls. Image Match-
high-speed corner detection. In ECCV, 2006. 1
ing across Wide Baselines: From Paper to Practice. IJCV,
[63] Paul-Edouard Sarlin. Visual localization made
2020. 7
easy with hloc. https://github.com/cvg/
[44] Christian Kerl, Jürgen Sturm, and Daniel Cremers. Dense
Hierarchical-Localization/. 2, 13
visual slam for RGB-D cameras. In IROS, 2013. 2, 4
[64] Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and
[45] Kenneth Levenberg. A method for the solution of certain
Marcin Dymczyk. From coarse to fine: Robust hierarchical
non-linear problems in least squares. Quarterly of applied
localization at large scale. In CVPR, 2019. 2, 13
mathematics, 2(2):164–168, 1944. 5, 11
[46] Xinghui Li, Kai Han, Shuda Li, and Victor Prisacariu. Dual- [65] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz,
resolution correspondence networks. In NeurIPS, 2020. 2 and Andrew Rabinovich. SuperGlue: Learning feature match-
ing with graph neural networks. In CVPR, 2020. 1, 7, 12,
[47] Yunpeng Li, Noah Snavely, Dan Huttenlocher, and Pascal
13
Fua. Worldwide pose estimation using 3D point clouds. In
ECCV, 2012. 1 [66] Paul-Edouard Sarlin, Ajaykumar Unagar, Måns Larsson,
[48] Zhengqi Li and Noah Snavely. MegaDepth: Learning single- Hugo Germain, Carl Toft, Viktor Larsson, Marc Pollefeys,
view depth prediction from internet photos. In CVPR, 2018. Vincent Lepetit, Lars Hammarstrand, Fredrik Kahl, and
12 Torsten Sattler. Back to the Feature: Learning robust camera
localization from pixels to pose. In CVPR, 2021. 3, 5, 7, 12
[49] Ce Liu, Jenny Yuen, and Antonio Torralba. SIFT Flow: Dense
correspondence across scenes and its applications. TPAMI, [67] Torsten Sattler, Will Maddern, Carl Toft, Akihiko Torii, Lars
2010. 2, 7, 12 Hammarstrand, Erik Stenborg, Daniel Safari, Masatoshi Oku-
[50] Liu Liu, Hongdong Li, and Yuchao Dai. Efficient global 2D- tomi, Marc Pollefeys, Josef Sivic, Fredrik Kahl, and Tomas
3D matching for camera localization in a large-scale 3D map. Pajdla. Benchmarking 6DOF outdoor visual localization in
In ICCV, 2017. 1 changing conditions. In CVPR, 2018. 2, 8, 12
[51] David G Lowe. Distinctive image features from scale- [68] Torsten Sattler, Tobias Weyand, Bastian Leibe, and Leif
invariant keypoints. IJCV, 60(2):91–110, 2004. 1, 2, 5, 6, 7, Kobbelt. Image retrieval for image-based localization re-
12 visited. In BMVC, 2012. 8, 12
[52] Bruce D. Lucas and Takeo Kanade. An iterative image reg- [69] Daniel Scharstein and Richard Szeliski. A taxonomy and eval-
istration technique with an application to stereo vision. In uation of dense two-frame stereo correspondence algorithms.
IJCAI, 1981. 1, 2 IJCV, 2002. 2
[53] Zhaoyang Lv, Frank Dellaert, James M. Rehg, and Andreas [70] Johannes Lutz Schönberger and Jan-Michael Frahm.
Geiger. Taking a deeper look at the inverse compositional Structure-from-motion revisited. In CVPR, 2016. 2, 6, 7,
algorithm. In CVPR, 2019. 3 8, 12
[54] Daniel Martinec and Tomas Pajdla. Robust rotation and trans- [71] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys,
lation estimation in multiview reconstruction. In CVPR, 2007. and Jan-Michael Frahm. Pixelwise view selection for unstruc-
2 tured multi-view stereo. In ECCV, 2016. 2
[55] Yuki Ono, Eduard Trulls, Pascal Fua, and Kwang Moo Yi. [72] Thomas Schops, Torsten Sattler, and Marc Pollefeys. BAD
LF-Net: Learning local features from images. In NeurIPS, SLAM: Bundle Adjusted Direct RGB-D SLAM. In CVPR,
2018. 2 June 2019. 2
16
[73] Thomas Schops, Johannes L Schonberger, Silvano Galliani, [91] Yao Yao, Zixin Luo, Shiwei Li, Tianwei Shen, Tian Fang, and
Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- Long Quan. Recurrent MVSNet for high-resolution multi-
dreas Geiger. A multi-view stereo benchmark with high- view stereo depth inference. In CVPR, 2019. 2
resolution images and multi-camera videos. In CVPR, 2017. [92] Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal
2, 5 Fua. LIFT: Learned invariant feature transform. In ECCV,
[74] Xi Shen, François Darmon, Alexei A Efros, and Mathieu 2016. 1
Aubry. RANSAC-Flow: generic two-stage image alignment. [93] Baosheng Yu and Dacheng Tao. Heatmap regression via
In ECCV, 2020. 2 randomized rounding. arXiv:2009.00225, 2020. 2
[75] Noah Snavely, Steven M Seitz, and Richard Szeliski. Mod- [94] Zehao Yu and Shenghua Gao. Fast-MVSNet: Sparse-to-dense
eling the world from internet photo collections. IJCV, 2008. multi-view stereo with learned propagation and gauss-newton
2 refinement. In CVPR, 2020. 3
[76] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. [95] Zichao Zhang, Torsten Sattler, and Davide Scaramuzza. Ref-
PWC-Net: CNNs for optical flow using pyramid, warping, erence Pose Generation for Long-term Visual Localization
and cost volume. In CVPR, 2018. 2 via Learned Features and View Synthesis. IJCV, 2020. 8, 12
[77] Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and [96] Qunjie Zhou, Torsten Sattler, and Laura Leal-Taixe.
Xiaowei Zhou. LoFTR: Detector-free local feature matching Patch2Pix: Epipolar-guided pixel-level correspondences. In
with Transformers. CVPR, 2021. 2 CVPR, 2021. 2
[78] Hajime Taira, Masatoshi Okutomi, Torsten Sattler, Mircea
Cimpoi, Marc Pollefeys, Josef Sivic, Tomas Pajdla, and Ak-
ihiko Torii. InLoc: Indoor Visual Localization with Dense
Matching and View Synthesis. TPAMI, 2019. 2
[79] Chengzhou Tang and Ping Tan. BA-Net: Dense bundle ad-
justment network. In ICLR, 2019. 3
[80] Jiexiong Tang, Hanme Kim, Vitor Guizilini, Sudeep Pil-
lai, and Rares Ambrus. Neural outlier rejection for self-
supervised keypoint learning. In ICLR, 2020. 2
[81] Engin Tola, Vincent Lepetit, and Pascal Fua. DAISY: An effi-
cient dense descriptor applied to wide-baseline stereo. TPAMI,
2009. 2
[82] Bill Triggs, Philip F McLauchlan, Richard I Hartley, and
Andrew W Fitzgibbon. Bundle adjustment — a modern syn-
thesis. In International workshop on vision algorithms, 1999.
1, 2, 3
[83] Prune Truong, Martin Danelljan, and Radu Timofte. GLU-
Net: Global-local universal network for dense flow and corre-
spondences. In CVPR, 2020. 2
[84] Michał J Tyszkiewicz, Pascal Fua, and Eduard Trulls. Disk:
Learning local features with policy gradient. In NeurIPS,
2020. 1
[85] Andrea Vedaldi and Brian Fulkerson. VLFeat: An open and
portable library of computer vision algorithms. In ACM inter-
national conference on Multimedia, 2010. 12
[86] Lukas Von Stumberg, Patrick Wenzel, Qadeer Khan, and
Daniel Cremers. GN-Net: The Gauss-Newton loss for multi-
weather relocalization. RA-L, 5(2):890–897, 2020. 3, 5
[87] Lukas Von Stumberg, Patrick Wenzel, Nan Yang, and Daniel
Cremers. LM-Reloc: Levenberg-Marquardt based direct vi-
sual relocalization. In 3DV, 2020. 3
[88] Oliver J Woodford and Edward Rosten. Large scale photo-
metric bundle adjustment. In BMVC, 2020. 2, 4, 7, 12
[89] Binbin Xu, Andrew J. Davison, and Stefan Leutenegger. Deep
probabilistic feature-metric tracking. RA-L, 6(1):223–230,
2021. 3
[90] Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan.
MVSNet: Depth inference for unstructured multi-view stereo.
ECCV, 2018. 2
17

2108.08291v1

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

2108.08291v1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2108.08291v1

Uploaded by

Copyright:

Available Formats

Pixel-Perfect Structure-from-Motion with Featuremetric Refinement

Philipp Lindenberger1 * Paul-Edouard Sarlin2 * Viktor Larsson2 Marc Pollefeys2,3

cally selects such points for each image independently and

no loss of accuracy. See Appendix C for more details.

keypoints using the entire exhaustive tentative match graph

multiview setting, it also improves the accuracy of the SfM

We refine the triangulation of SuperPoint [21] keypoints

Figure 7: Impact of the patch size. Smaller patches for

ë Patch Flow [24] 37.00 55.18 0.15 0.93 7.44 5.24

raw / refined point clouds

You might also like