Neurar: Neural Uncertainty For Autonomous 3D Reconstruction With Implicit Neural Representations
Neurar: Neural Uncertainty For Autonomous 3D Reconstruction With Implicit Neural Representations
Neurar: Neural Uncertainty For Autonomous 3D Reconstruction With Implicit Neural Representations
Abstract—Implicit neural representations have shown com- and at each viewpoint, the information of a scene is captured
pelling results in offline 3D reconstruction and also recently by an RGB image (with an optional depth image).
demonstrated the potential for online SLAM systems. How- Implicit neural representations for 3D objects have shown
ever, applying them to autonomous 3D reconstruction, where
arXiv:2207.10985v2 [cs.CV] 8 Feb 2023
a robot is required to explore a scene and plan a view path their potential to be precise in geometry encoding, efficient
for the reconstruction, has not been studied. In this paper, in memory consumption (adaptive to scene size and com-
we explore for the first time the possibility of using implicit plexity), predictive in filling unseen regions and flexible in
neural representations for autonomous 3D scene reconstruction the amount of training data. The reconstruction with offline
by addressing two key challenges: 1) seeking a criterion to images [1] or online images with cameras held by human [2]
measure the quality of the candidate viewpoints for the view
planning based on the new representations, and 2) learning the has achieved compelling results recently with implicit neural
criterion from data that can generalize to different scenes instead representations. However, leveraging these advancements to
of a hand-crafting one. To solve the challenges, firstly, a proxy achieve high-quality autonomous 3D reconstruction has not
of Peak Signal-to-Noise Ratio (PSNR) is proposed to quantify a been studied.
viewpoint quality; secondly, the proxy is optimized jointly with Previous 3D representations for autonomous 3D reconstruc-
the parameters of an implicit neural network for the scene.
With the proposed view quality criterion from neural networks
tion include point cloud, volume, and surface. To plan the view
(termed as Neural Uncertainty), we can then apply implicit without global information of a scene, previous work resorts to
representations to autonomous 3D reconstruction. Our method a greedy strategy: given the current position of a robot and the
demonstrates significant improvements on various metrics for reconstruction status, they quantify the quality of the candidate
the rendered image quality and the geometry quality of the viewpoints via information gain to plan the next best view
reconstructed 3D models when compared with variants using
TSDF or reconstruction without view planning. Project webpage
(NBV). In these works, the information gain relies on hand-
https://kingteeloki-ran.github.io/NeurAR/ crafted criteria, each designed ad-hoc for a particular combina-
tion of a 3D representation and a reconstruction algorithm. For
Index Terms—Computer Vision for Automation; Motion and
Path Planning; Planning under Uncertainty example, Mendez et al. [3] define view cost by triangulated
uncertainty given by the algorithm inferring depth from stereo
RGB images, Isler et al. [4] quantify the information gain
I. I NTRODUCTION for a view using entropy in voxels seen in this viewpoint, Wu
UTONOMOUS 3D reconstruction has a wide range of
A applications, e.g. augmented/virtual reality, autonomous
driving, filming, gaming, medicine, architecture. The problem
et al. [5] identify the quality of the view by a Poisson field
from point clouds and Song et al. [6] leverage mesh holes and
boundaries to guide the view planning.
requires a robot to make decisions about moving towards To use implicit neural representations for autonomous 3D
which viewpoint in each step to get the best reconstruction reconstruction, a key capability is to quantify the quality of
quality of an unknown scene with the lowest cost, i.e. view the candidate viewpoints. For implicit neural representations,
planning. In this work, we assume a robot can localize itself how to define the viewpoint quality? Is it possible for the
neural network to learn a measurement of the quality from
Manuscript received: August 7, 2022; Revised: November 23, 2022; Ac-
cepted: December 21, 2022. This paper was recommended for publication by data instead of defining it heuristically? In this paper, we make
Editor Cesar Cadena upon evaluation of the Associate Editor and Reviewers’ efforts to answer both questions.
comments. This work was supported in part by NSFC under Grants 62233013, The quality of reconstructed 3D models can be measured by
62088101, 62103372, and the Fundamental Research Funds for the Central
Universities, China. (Corresponding Author: Qi Ye)
the quality of images rendered from different viewpoints and
1 Yunlong Ran and Jing Zeng are with the College of Control Science the measurement is adopted in many offline 3D reconstruction
and Engineering, Zhejiang University, Hangzhou, China. {yunlong_ran, works. PSNR is one popular measurement and it is defined
zengjing}@zju.edu.cn
2 Shibo He, Jiming Chen, and Qi Ye are with the College of Control Science according to the difference between the images rendered from
and Engineering, the State Key Laboratory of Industrial Control Technol- the reconstructed model and the ground truth model.
ogy, Zhejiang University, and the Key Laboratory of Collaborative Sensing The ground truth images, however, are unavailable for
and Autonomous Unmanned Systems of Zhejiang Province. {s18he, cjm unvisited viewpoints to calculate PSNR during autonomous
qi.ye}@zju.edu.cn
3 Lincheng Li and Yingfeng Chen are with Fuxi AI reconstruction. Is it possible to learn a proxy for PSNR?
Lab, NetEase Inc., Hangzhou, China. {lilincheng, In [7], [8], the authors point out that if the target variable
chenyingfeng1}@corp.netease.com to regress is under a Gaussian noise model and the target
4 Gimhee Lee is with the Department of Computer Science, National
University of Singapore, Singapore. dcslgh@nus.edu.sg distribution conditioned on the input is optimized by maximum
Digital Object Identifier (DOI): see top of this page. likelihood, the optimum value of the noise variance is given
View Planner
RGB Images Depth Images
2 3D Reconstruction Poses
IEEE ROBOTICS AND Renderer
AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JANUARY 2023
5 U:0.0367 15 U:0.0100
15
6 U:0.0677 16 U:0.0139
6 16
Fig. 1. View paths planned by NeurAR and the uncertainty maps used to guide the planner. Given current pose 5 and 15, paths are planned toward viewpoints
having higher uncertainties, i.e. pose 6 and 16. Uncertainties of the viewpoints are shown in red text. Notice that darker regions in uncertainty maps relate to
worse quality regions of rendered images.
by the residual variance of the target values and the regressed completeness and smoothness of the constructed Poisson iso-
ones. Inspired by this, we assume the color to regress for a surface and the confidence map is used to guide the calculation
spatial point in a scene as a random variable modeled by a of NBV. Schmid et al. [9] propose information gain to evaluate
Gaussian distribution. The Gaussian distribution models the the quality of observed surfaces and unknown voxels near the
uncertainty of the reconstruction and the variance quantifies observed surfaces, then plan a path by RRT.
the uncertainty. When the regression network converges, the
variance of the distribution is given by the squared error of In contrast, some methods explore function learning to
the predicted color and the ground truth color; the integral of solve the NBV problem [11]–[13]. Supervised learning-based
the uncertainty of points in the frustum of a viewpoint can be methods [13] learn information gain of a viewpoint given a
taken as a proxy of PSNR to measure the quality of candidate partial occupancy map and [12] learns an informed distribu-
viewpoints. tion of high-utility viewpoints based on a partial occupancy
With the key questions solved, we are able to build an map. For the reinforcement learning-based methods [11], no
autonomous 3D reconstruction system (NeurAR) using an hand-crafted heuristics are required and an agent explores the
implicit neural network. In summary, the contributions of the viewpoints with high overall coverage. These learning-based
paper are: methods require a large-scale dataset for training and may be
hard to generalize to different scenes.
• We propose the first autonomous 3D reconstruction sys-
tem using an implicit neural representation. Different from the hand-crafted heuristics, the learned NBV
• We propose a novel view quality criterion that learns policy, and the existing learning-based information gain, our
online from continuously added input images per target proposed neural uncertainty is learned per target scene during
scene instead of hand-engineering or learning from a large the reconstruction, requiring no manual definition, no large
corpus of 3D scenes. training set, and being able to work in any new scene.
• Our proposed method significantly improves on various
metrics from alternatives using voxel-based representa- Online Dense Reconstruction For online 3D reconstruc-
tions or using man-designed paths for the reconstruction. tion from RGB images, most methods use Multi-View-Stereo
(MVS) [14] to reconstruct dense models by first getting a
II. R ELATED W ORK sparse set of initial matches, iteratively expanding matches
to nearby locations and performing surface reconstruction.
View Planning Most view planning methods focus on the With the release of commodity RGB-D sensors, the fusion of
NBV problem which uses feedback from current partial recon- point clouds reprojected from depth images gains popularity.
struction to determine NBV. According to the representations KinectFusion [15] achieves real-time 3D reconstruction with
for the 3D models, these methods can be divided into voxel- a moving depth camera by integrating points from depth
based method [3], [4] and surface-based method [5], [9]. images with Truncated Signed Distance Functions (TSDFs).
The voxel-based methods are most commonly used due OctoMap [16] builds a probabilistic occupancy volume based
to their simplicity in representing space. Vasquez-Gomez et on octree. Recently, implicit representations have shown com-
al. [10] analyze a set of boundary voxels and determine pelling results in 3D reconstruction, either for the radiance
NBV for dense 3D modeling of small-scale objects. Stefan field approximation from RGB images [17] or the shape ap-
Isler et al. [4] provide several metrics to quantify the volume proximation from point clouds [18]. The novel representations
information contained in the voxels. Mendez et al. [3] define are also studied for online dense reconstruction. iMAP [2]
the information gain by the triangulation uncertainty of stereo adopts MLPs as the scene representation and reconstructs the
images. Despite the simplicity, these methods suffer from scene from RGBD images. In addition to using MLPs to
memory consumption with growing scene complexity and represent a scene implicitly, NeRFusion [19] further combines
higher spatial resolutions. a feature volume to fuse information from different views as
A complete volumetric map does not necessarily guarantee a latent scene representation. Similarly, we use the implicit
a perfect 3D surface. Therefore, researchers propose to analyze neural function to represent 3D models but we focus on how
the shape and quality of the reconstructed surface for NBV [5], to leverage this representation for autonomous view planning.
[9]. Wu et al. [5] estimate a confidence map representing the
RAN et al.: NEURAL UNCERTAINTY FOR AUTONOMOUS 3D RECONSTRUCTION WITH IMPLICIT NEURAL REPRESENTATIONS 3
Uncertainties Poses
acceleration in Section III-C. The view planner module (Sec-
View Planner
tion III-D) samples viewpoints from empty space, measures
3D Reconstruction
the contributions of these sampled viewpoints by composing
Renderer
L_{mean} = \lVert \raysetclrval - \mean _I \rVert _2^2 \leq L_{I}=\frac 1 {\numrayset } \sum _{r=1}^{\numrayset } \lVert \rayclrval - \mean _r \rVert _2^2. \eqlabel {mseloss} (6)
For Lmean of (6), the constraint for each pixel is too weak
and the network is not able to converge a meaningful result
(PSNR about 10). As we have supervision for the color of each
(c) PCC -0.96 (d) PCC -0.92
pixel and Lmean is smaller than or equal to LI , we choose to
minimize LI instead. The loss function above is differentiable Fig. 3. Loss curves and linear relationship between log σI2 and PSNR. PCC
w.r.t θ and µ, σ, ρ are the outputs of MLPs (Fθ ); we can use value of -1 signifies strong negative correlation. (a) Training loss curves for
online training and offline training. (b) The loss curves for the uncertainty part
gradient descent to determine the network parameters θ and
log σI and the ratio part L I
in (5). (c) Linearity when the scene is optimized
µ, σ, ρ. To make σ positive, we let the network estimate σ 2 σI2
using offline images. (d) Linearity when the scene is optimized using online
and the output of MLPs s is activated by es to get σ 2 . images.
Consider the minimization with respect to θ. Given σI , we
can see the minimization of the maximum likelihood under a the uncertainty part log σI and the ratio part L I
σI2
in (5) for
conditional Gaussian distribution for each point is equivalent online are shown in Fig. 3(b). Notice that the loss curve for
to minimizing a mean-of-squares error function given by LI in the ratio part stays almost constant during training, which also
(6). Apply Fθ on a point on a ray and µ, ρ, σ can be obtained. verifies the effectiveness using uncertainty as a proxy of PSNR.
3) Neural Uncertainty and PSNR: For the variance σI2 , or For the verification and the usage of the linear relationship
Neural Uncertainty, the optimum value can be achieved by for autonomous reconstruction, we adopt NeRF [1] as our
setting the derivative of Lcolor with respect to σI to zero, implicit representation for a scene. From the formulation and
giving the derivation of the relation between neural uncertainty and
\sigma _I^2 = L_{I}. \eqlabel {relation} (7) PSNR, our neural uncertainty is agnostic to the underlying
function Fθ , which can be MLPs like NeRF or networks based
The equation above indicates that the optimal solution for σI2 is on trainable feature vectors with MLP Decoder [19]
the squared errors between the predicted image and the ground
truth image. C. Online Training and Acceleration
M AX 2
On the other hand, PSNR is defined as 10 log10 M SEI , Though online reconstruction with Neural Uncertainty su-
where M AXI is the maximum possible pixel value of the pervised by images can achieve similar accuracy with NeRF,
image. When a pixel is represented using 8 bits, it is 255 and the convergence is too slow for view planning. We accelerate
MSE is the mean squared error between two images, the same training by introducing particle filter, depth supervision and a
as LI . Then we establish a linear relationship between the keyframe strategy.
logarithm of Neural Uncertainty and PSNR, i.e. Particle filter keeps particles (rays) active in high loss region,
PSNR = A\log {\sigma }_I^2, (8) which helps the network optimize details faster. At each step,
when a new image is added to an image pool for the training,
where A is a constant coefficient. a set of particles are randomly sampled from the image. After
To verify the linear relationship, we scatter data pairs of an iteration of training, a quarter of particles are resampled
(P SN R, log σI2 ) for images in the testing set evaluated at according to the weight of particles, which is defined according
different iterations when optimizing Fθ for a cabin scene (the to the loss of a ray, and the other particles are uniformly
scene is shown in Section IV). Two different training strategies sampled from the image.
are conducted: online training with images captured along a Particle filter is applied after coarse learning of the whole
planned trajectory added sequentially and offline batch training scene is done. At the early stage of training, the model knows
using all the images precaptured from the trajectory. As can be little information about the scene and tends to have a higher
seen Fig. 3(c-d), a strong correlation exists between P SN R loss for rays shooting at objects than that for empty space. This
and log σ 2 , whose Pearson Correlation Coefficient (PCC) is results in particles always staying at the surfaces of objects
-0.96 for online and -0.92 for offline. The two variables and therefore the network learns the whole space slower.
are almost perfectly negatively linearly related. During the Considering this, at the beginning iterations of the training,
training, σI and LI are jointly optimized. The loss curves of we use random sampling.
RAN et al.: NEURAL UNCERTAINTY FOR AUTONOMOUS 3D RECONSTRUCTION WITH IMPLICIT NEURAL REPRESENTATIONS 5
Depth supervision can greatly speed up training. Depth im- other scenes. RGBD images are rendered by Unity Engine and
ages for NeRF are rendered similar to color images in [1]. We we assume their corresponding camera poses are known. To
define depth loss and our final loss as simulate depth noise, all rendered depth images are added with
noise scaling approximately quadratically with depth z [20].
L_{depth} = \frac 1 {\numrayset }\sum _{r=1}^{\numrayset } \lVert \hat {z}_r- z_r \rVert _2^2 \text { , }\text { } L = L_{color} + \lambda _d L_{depth}, \eqlabel {finalloss} (9) The depth noise model is ϵ = N (µ(z), σ(z)) where µ(z) =
0.0001125z 2 +0.0048875 , σ(z) = 0.002925z 2 +0.003325 and
the constant parameters are acquired by fitting the model to
where ẑr and zr represent the rendered depth from recon- the noise reported for Intel Realsense L515 2 . The depth noise
structed 3D model Fθ and the ground truth depth for a pixel. model for Alexander of large size, µ(z) = 0.00001235z 2 +
As depth captured from real sensors typically has noise, we 0.00004651 , σ(z) = 0.00001228z 2 +0.00001571, the constant
find using depth supervision may make the model not able parameters acquired by fitting the model to the noise reported
to converge well due to the conflict between noisy depth and for Lidar VLP163 .
the depth inferred from multiview RGB images. A balance Implementation details The networks and the hyperparam-
needs to be made between the two cases. We strengthen depth eters of all the experiments for different scenes below are
supervision at the early stage of training to accelerate training set to the same value. Most hyper parameters use the default
and decrease it after getting a coarse 3D structure. The weight setting in NeRF, including parameters of Adam optimizer (with
of depth loss is decreased from 1 to 1/10 after Nd iterations hyperparameters β1 = 0.9, β2 = 0.999, ϵ = 10−7 ), 64/128
to emphasize structure from multiview images. sampling points on a ray for coarse/fine sampling, a batch
Keyframe pool We follow iMAP [2] to maintain a keyframe size of 1024 etc.
pool containing 4 images for continual training. The pool is
initialized with the first four views and during training, the 𝜸 𝑿
60
Neural Uncertainty
branch
ρ
image with minimum image loss in the pool is replaced with +
a new image or an image from a seen images set. 128 Uncertainty
𝜸 𝒙
256 256 256 256 256 256 256 256 256 256
60
128 RGB
Ground Truth Online RRT (our) Online FT Offline FT Online RS TSDF FT TSDF RS TSDF RRT
cabin
drums
alexander
Fig. 6. Comparison of the reconstruction models with different methods. Refer to the supplementary video for higher resolution, more comparison and more
viewpoints.
tank
TABLE I
E VALUATIONS ON THE RECONSTRUCTED 3D MODELS USING DIFFERENT METHODS
For the implicit scene representation, we adopt TSDF as the sampled NBVs. 7) Online even cover.: NeurAR with views
baseline as it is oen of the most used representation for SLAM evenly distributed in a dome to cover the whole space. 8)
and autonomous reconstruction [4], [9], [21]. For view path Ours offline: NeurAR trained offline, i.e. trained from scratch
planning based on the proposed Neural Uncertainty, we con- with images precaptured from all the planned views from our
struct two variants: one using a pre-defined circular trajectory proposed method.
with which existing work [21], [22] usually compares and the Metrics The quality of the reconstructed models can be
other one randomly sampling a viewpoint instead of using the measured in two aspects: the quality of the rendered im-
view cost to choose NBV at each step. ages (measure both the geometry and the texture) and the
The variants are 1) TSDF FT: RGBD images and corre- quality of the geometry of the constructed surface. For the
sponding poses are collected from a Fixed circular Trajectory former, we evenly distribute about 200 testing views 80m for
and are fed into the system sequentially. The voxel resolution Alexander, 3m and 3.4m from the center for other scenes,
of TSDF is 1cm. 2) TSDF RS: replace the fixed trajectory in render images for the reconstructed models and evaluate
TSDF FT with Randomly Sampled NBVs. 3)TSDF RRT: PSNR, SSIM and LPIPS for these images. For the latter, we
As for autonomous scene reconstruction most work is not adopt metrics from iMAP [2]: Accuracy (cm), Completion
open source, we re-implement the view cost defined in [9] (cm), CompletionRatio (the percentage of points in the
which adopts TSDF representation. The online trajectory is reconstructed mesh with Completion under a threshold, 30
planned with RRT and the view cost is defined according to cm for Alexander and 1cm for others). For the geometry
the quality of a reconstructed voxel, measured by the number metrics, about 300k points are sampled from the surfaces.
of rays traversing through it. Images and corresponding poses
for the fusion are collected from the planned views and are A. Neural Uncertainty
fed into the system sequentially. 4) Offline FT: The variant Fig. 3 has quantitatively verified the correlation between
is NeurAR with views from the fixed trajectory and trained Neural Uncertainty and the image quality. Here, we further
offline. 5) Online FT: The variant is NeurAR with views from demonstrate the correlation with examples. In this experi-
the fixed trajectory. 6) Online RS: NeurAR with randomly ment, the reconstruction network is optimized by the loss
RAN et al.: NEURAL UNCERTAINTY FOR AUTONOMOUS 3D RECONSTRUCTION WITH IMPLICIT NEURAL REPRESENTATIONS 7
defined in (5) with 73 images collected from cameras placed for many scenes of simple shape and texture is an optimum
roughly at one hemisphere. Fig. 5 shows the images rendered solution. Metrics for Online even cover. and Ours in Table I
from different viewpoints for the reconstructed 3D models demonstrate that our planned views can achieve similar or even
and their corresponding uncertainty map during training. The better reconstruction results and our method has no prior or
uncertainty map is rendered from an uncertainty field where only a minor knowledge about the scene.
1
the value of each point in the field is logσ . Viewpoints of Ours offline in most scenes gives better results than NeurAR
Row 1 are in the hemisphere seen in the training and the trained online as using all views enforces multiview constraints
viewpoint of the last row is in the other hemisphere. At the at the start of the training.
beginning, for all viewpoints, the object region and the empty In addition to the lower mean PSNR shown in the Table I,
space both have high uncertainty. For the seen viewpoints, our method achieves much smaller variance regarding the
with training continuing, the uncertainty of the whole space PSNR of the rendered images from different viewpoints. For
decreases and the quality of the rendered images improve. example, the PSNR variance of the rendered images for
When the network converges, the uncertainty only remains reconstructed cabins using Online FT, Online RS and our
high in the local areas having complicated geometries. For method is 32.55dB, 15.44dB to 4.78dB, indicating our method
the unseen viewpoint, though the uncertainty for the empty more even. Further, for the average path length in the smaller
space decreases dramatically with training, the uncertainty is scenes, our NeurAR traverses 43.27m while Online RS about
still very high on the whole surface of the object even when 70.24m; in the larger scene, our NeurAR traverses 907.60m
the network converges. while Online RS about 1329.75m.
Fig. 1 shows the Neural Uncertainty can successfully guide For models using the reconstruction with TSDF (TSDF
the planner to planning view paths for cameras to look at FT, TSDF RS, TSDF RRT), we render images via volume
regions that are not well reconstructed. For example, given rendering. From the images in the last three columns in Fig. 6,
current pose 15 and Fθ , a path is planned toward viewpoint the surfaces exhibit many holes; in addition, the surfaces of the
16 having a higher uncertainty map. The rendered image from reconstructed models are rugged due to the noise in the depth
viewpoint 16 from Fθ exhibits poor quality (zoom in the image images. Though NeurAR uses depth too, it depends on the
and the uncertainty map for details under the roof). depth images to accelerate convergence only at the early stage
of training and decreases its effect after coarse structures have
been learned. The finer geometry is acquired by multiview
B. View planning with Neural Uncertainty
image supervision. NeurAR fills holes in TSDF RRT and
To show the efficacy of our proposed method, we compare provide finer details. It outperforms TSDF RRT in the image
metrics in Table I and the rendered images of reconstructed 3D quality, geometry quality and path length (43.27m vs 57.39m).
models using different methods in Fig. 6. Table I demonstrates Though post-processing can be applied to extra finer meshes
except for the even coverage, our method outperforms all for the reconstructions using TSDF and get smoother images
variants on all metrics significantly. Our proposed NeurAR than our rendering images from TSDF directly, denoising and
achieves better reconstruction results with shorter view paths filling the holes are non-trivial tasks, particularly in scenes
compared with other variants and existing work. having complex structures.
Compared with methods using the implicit representation
without path planning (Offline FT, Online FT, Online RS),
our method demonstrates significant improvements in the C. Ablation study
image quality and geometry accuracy. Methods without the Training Iterations between Steps The number of iterations
planned views 1) are even unable to converge to a reasonable for NeRF optimization iter allowed for planning a view step
result in the large scene (NeRF will fail due to overfitting affects final results. We choose different iterations to run our
without carefully planned input viewpoint coverage), 2) have NeurAR system in the scene cabin and compare PSNR of the
many holes in the objects as these regions are unseen in the rendered images of the reconstructed models. PSNR for the
images and 3) inferior image details on the surface of objects models optimized using 300, 700, 1400 iterations is 26.16,
as these regions have a little overlap of different views, making 28.58, 26.91. This is because too few iterations may lead to
inferring the 3D geometry hard (check red box in Fig. 6 uncertainty not being optimized well while too many steps
for visual comparison). In addition, these variants tend to may lead to overfitting to images added.
have ghost effects in the empty space. This is largely because Depth noise We construct variants of NeurAR using different
the viewpoints are designed to make the camera look at the noise magnitude from no noise, the noise magnitude of L515,
objects and the empty space is not considered. Notice that to the double (2 × µ(z), 2 × σ(z) ) and the triple (3 × µ(z),
placing viewpoints covering the whole space without the aid 3×σ(z) ) noise magnitude of L515. Table II shows the metrics
of visualization tools as feedback is non-trivial for humans. measuring the cabin models reconstructed by these variants:
Assuming we have the 3D shape of a scene with the target NeurAR can be accelerated with very noisy depth at a cost of
object in the center (which is our goal), to cover the whole only a minor drop in the reconstruction quality. The robustness
space, we distribute viewpoints evenly at a dome centered at verifies the effect of dampening the depth supervision during
the scene center with a radius of 80 meters for Alexander and the training in Section III-C .
3 meters for others. All viewpoints look at the center. An even Number of views To further study the influence, we set the
coverage path can provide a good reference as even coverage maximum allowed views from 18 to 58 for the cabin scene and
8 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JANUARY 2023