M ESH MVS: M ULTI -V IEW S TEREO G UIDED M ESH R E CONSTRUCTION
Rakesh Shrestha1 , Zhiwen Fan2 , Siyu Zhu2 , Zuozhuo Dai2 , Qingkun Su2 , Ping Tan1
Simon Fraser University1 , Alibaba A.I Labs2
{rakeshs,pingtan}@sfu.ca, {waynefan.fzw,siting.zsy,zuozhuo.dzz,qingkun.sqk}alibaba-inc.com
arXiv:2010.08682v1 [cs.CV] 17 Oct 2020
A BSTRACT
Deep learning based 3D shape generation methods generally utilize latent features
extracted from color images to encode the objects’ semantics and guide the shape
generation process. These color image semantics only implicitly encode 3D
information, potentially limiting the accuracy of the generated shapes. In this paper
we propose a multi-view mesh generation method which incorporates geometry
information in the color images explicitly by using the features from intermediate
2.5D depth representations of the input images and regularizing the 3D shapes
against these depth images. Our system first predicts a coarse 3D volume from the
color images by probabilistically merging voxel occupancy grids from individual
views. Depth images corresponding to the multi-view color images are predicted
which along with the rendered depth images of the coarse shape are used as a
contrastive input whose features guide the refinement of the coarse shape through a
series of graph convolution networks. Attention-based multi-view feature pooling
is proposed to fuse the contrastive depth features from different viewpoints which
are fed to the graph convolution networks.
We validate the proposed multi-view mesh generation method on ShapeNet, where
we obtain a significant improvement with 34% decrease in chamfer distance to
ground truth and 14% increase in the F1-score compared with the state-of-the-art
multi-view shape generation method.
1
I NTRODUCTION
3D shape generation is a long-standing research problem in computer vision and computer graphics
with applications in autonomous driving, augmented reality, etc. Conventional approaches mainly
leverage multi-view geometry based on stereo correspondences between images but are restricted
by the coverage provided by the input views. With the availability of large-scale 3D shape datasets
and the success of deep learning in several computer vision tasks, 3D representations such as voxel
grid Choy et al. (2016); Tulsiani et al. (2017); Yan et al. (2016) and point cloud Yang et al. (2018);
Fan et al. (2017) have been explored for single-view 3D reconstruction. Among them, triangle mesh
representation has received the most attention as it has various desirable properties for a wide range
of applications and is capable of modeling detailed geometry without high memory requirement.
Single-view 3D reconstruction methods Wang et al. (2018); Huang et al. (2015); Kar et al. (2015);
Su et al. (2014) generate the 3D shape from merely a single color image but suffer from occlusion
and limited visibility which leads to low quality reconstructions in the unseen areas. Multi-view
methods Wen et al. (2019); Choy et al. (2016); Kar et al. (2017); Gwak et al. (2017) extend the
input to images from different viewpoints which provides more visual information and improves the
accuracy of the generated shapes. Recent work in multi-view mesh reconstruction Wen et al. (2019)
introduces a multi-view deformation network using perceptual feature from each color image for
refining the meshes generated by Pixel2Mesh Wang et al. (2018). Although promising results were
obtained, this method relies on perceptual features from color images which do not explicitly encode
the objects’ geometry and could restrict the accuracy of the 3D models.
In this work, we present a novel multi-view mesh generation method where we start by predicting
coarse volumetric occupancy grid representations for the color images of each input viewpoint
independently using a shared fully convolutional network which are merged into a single voxel grid in
1
Figure 1: Architecture of the proposed method. The voxel grid prediction module predicts coarse
voxel grid representation which is further refined by a series of GCNs. The GCNs use contrastive
depth features from rendered depths of the current shape and the predicted depths from MVSNet.
Multi-view features are pooled using a multi-head attention mechanism.
a probabilistic fashion followed by cubify Gkioxari et al. (2019) operation to convert it to a triangle
mesh. We then use Graph Convolutional Network (GCN) Scarselli et al. (2008); Wang et al. (2018)
to fine-tune the cubified voxel grid in a coarse-to-fine manner. The GCN refines the coarse mesh by
using the feature vector of each graph node (mesh vertices) obtained by projecting the vertices on
the 2D contrastive depth features. The contrastive depth features are extracted from the rendered
depth maps of the current mesh and predicted depth maps from a multi-view stereo network. We also
propose an attention-based method to fuse feature from multiple views that can learn the importance
of different views for each of the mesh vertices. Constrains between the intermediate refined mesh
from GCN with predicted depth maps of different viewpoints further improve the final mesh quality.
By employing multi-view voxel grid generation and refining it using geometry information from both
the current mesh (through the rendered depth maps) and predicted depth maps, we are able to generate
high-quality meshes. We validate our method on the ShapeNet Chang et al. (2015) benchmark and
our method achieves the best performance among all previous multi-view and single-view mesh
generation methods.
2
2.1
R ELATED W ORK
T RADITIONAL S HAPE G ENERATION M ETHODS
3D model generation has traditionally been tackled using multi-view geometry principles. Among
them, structure-from-motion (SfM) Schonberger & Frahm (2016) and simultaneous localization
and mapping (SLAM) Cadena et al. (2016) are popular techniques that perform 3D reconstruction
and camera pose estimation at the same time. Closer to our problem setup, multi-view stereo
methods infer 3D geometry from images with known camera parameters. Volumetric methods Kar
et al. (2017); Kutulakos & Seitz (2000); Seitz & Dyer (1999) predict voxel grid representation of
objects by estimating the relationship between each voxel and object surfaces. Point cloud based
methods Furukawa & Ponce (2009); Lhuillier & Quan (2005) start with a sparse point cloud and
gradually increase the density of points to obtain a final dense point cloud of the object. Durou et al.
(2008); Zhang et al. (1999); Favaro & Soatto (2005) reason about shading, texture and defocus to
reason about visible parts of the object and infer its 3D geometry. While the results of these works
are impressive in terms of quality and completeness of reconstruction, they still struggle with poorly
textured and non-reflective surfaces and require carefully selected input views.
2
2.2
D EEP S HAPE G ENERATION M ETHODS
Deep learning based approaches can learn to infer 3D structure from training data and can be robust
against poorly textured and reflective surfaces as well as limited and arbitrarily selected input views.
These methods can be categorized into single view and multi-view methods. Huang et al. (2015);
Su et al. (2014) use shape component retrieval and deformation from a large dataset for single-view
3D shape generation. Kurenkov et al. (2018) extend this idea by introducing free-form deformation
networks on retrieved object templates from a database. Some work learn shape deformation from
ground truth foreground masks of 2D images Kar et al. (2015); Yan et al. (2016); Tulsiani et al.
(2017). Recurrent Neural Networks (RNN) based methods Choy et al. (2016); Kar et al. (2017);
Gwak et al. (2017) are another popular solution to solve this problem. Gwak et al. (2017); Lin et al.
(2019) introduce image silhouettes along with adversarial multi-view constraints and optimize object
mesh models using multi-view photometric constraints. Predicting mesh directly from color images
was proposed in Wang et al. (2018); Wickramasinghe et al. (2019); Pan et al. (2019); Wen et al.
(2019); Gkioxari et al. (2019); Tang et al. (2019). DR-KFS Jin et al. (2019) introduces a differentiable
visual similarity metric while SeqXY2SeqZ Han et al. (2020) represents 3D shapes using a set of 2D
voxel tubes for shape reconstruction. Front2Back Yao et al. (2020) generates 3D shapes by fusing
predicted depth and normal images and DV-Net Jia et al. (2020) predicts dense object point clouds
using dual-view RGB images with a gated control network to fuse point clouds from the two views.
2.3
D EPTH E STIMATION
Compared to 3D shape generation, depth prediction is an easier problem formulation since it simplifies
the task to per-view depth map estimation. Deep learning based multi-view stereo depth estimation
was first introduced in Hartmann et al. (2017) where a learned cost metric is used to estimate patch
similarities. DeepMVS Huang et al. (2018) warps multi-view images to 3D space and then applies
deep networks for regularization and aggregation to estimate depth images. Learned 3D cost volume
based depth prediction was proposed in MVSNet Yao et al. (2018) where a 3 dimensional cost volume
is built using homographically warped 2D features from multi-view images and 3D CNNs are used
for cost regularization and depth regression. This idea was further extended by Chen et al. (2019);
Luo et al. (2019); Gu et al. (2019); Yao et al. (2019).
3
M ETHODOLOGY
Figure 1 shows the architecture of the proposed system which takes as input multi-view color images
of an object with known poses and outputs a triangle mesh representing the surface of the object.
3.1
M ULTI - VIEW VOXEL G RID P REDICTION
Single-view Voxel Grid Prediction The single-view voxel branch consists of a ResNet feature
extractor and a fully convolutional voxel grid prediction network. It generates the coarse initial shape
of an object from one viewpoint as voxel occupancy grid using a color image. Here, we set the
resolution of the generated voxel occupancy grid as 32 × 32 × 32. The voxel prediction networks for
all viewpoints share the same weights.
Probabilistic Occupancy Grid Merging Voxel occupancy grid predicted from a single viewpoint
suffers from occlusion and limited visibility. In order to fuse voxel grids from different viewpoints,
we propose a probabilistic occupancy grid merging method which merges the voxel grids from each
input viewpoint probabilistically to obtain the final voxel grid output. This allows occluded regions
in one view to be estimated from other views where those regions are visible as well as increase the
confidence of prediction in overlapping regions. Occupancy probability of each voxel is represented
by p(x) which is converted to log-odds (logit):
l(x) = log
p(x)
1 − p(x)
(1)
Bayesian update on the probabilities reduce to simple summation of log likelihoods Konolige (1997).
Hence, the multi-view log-odds of a voxel is given by:
3
l(x) = l1 (x) + l2 (x) + ... + ln (x)
(2)
where li is the voxel’s log-odds in view i and n is the number of input views. The final voxel
probability x is obtained by applying the inverse function of Equation (1) which is a sigmoid function.
3.2
M ESH R EFINEMENT
The cubified mesh from the voxel branch only provides a coarse reconstruction of the object’s
surface. We apply graph convolutional networks which represent each mesh vertex as one graph node
and deforms them to more accurate positions.
GCN-based Mesh Deformation The features pooled from multi-view images along with 3D
coordinates of the vertices in world frame are used as features of the graph nodes. Series of Graphbased Convolutional Network (GCN) blocks are applied to deform a mesh at the current stage to the
next stage, starting with the cubified voxel grids. A graph convolution deforms mesh vertices by
P
′
propagating features from neighboring vertices by applying fi = ReLU (W0 fi + j∈N (i) W1 fj )
where N (i) is the set of neighboring vertices of the i-th vertex in the mesh, f{} represents the
feature vector of a vertex, and W0 and W1 are learnable parameters of the model. Each GCN
block utilizes several graph convolutions to transform the vertex features along with a final vertex
refinement operation where the features along with vertex coordinates are further transformed as
′
vi = vi + tanh(Wvert [fi ; vi ]) where the matrix Wvert is another learnable parameter to obtain the
deformed mesh.
Contrastive Depth Feature Extraction Yao et al. (2020) demonstrate that using intermediate,
image-centric 2.5D representations instead of directly generating 3D shapes in global frame from
raw 2D images can improve 3D reconstruction quality. We therefore propose to formulate the
features for graph nodes using 2.5D depth maps as input additional inputs alongside the RGB features.
Specifically, we render the meshes at different GCN stages to depth image at all the input views
using Kato et al. (2018) and use them along with predicted depths for depth feature extraction. We call
this form of depth input contrastive depth as it contrasts the rendered depths of the current
mesh against the predicted depths and allows the network to reason about the deformation better than
when using predicted depth or color images alone. Given the 2D features, corresponding feature
vectors of individual vertices can be found by projecting the 3D vertex coordinates to the feature
planes using known camera parameters. We use VGG-16 Simonyan & Zisserman (2014) as our
contrastive depth feature extraction network.
Multi-View Depth Estimation We extend MVSNet Yao et al. (2018) and predict the depth maps of
all views since the original implementation predicts depth of only one reference view. This is achieved
by transforming the feature volumes to each view’s coordinate frame using homography warping and
applying identical cost volume regularization and depth regression on each view. Detailed network
architecture diagram of this module is provided in the appendix.
Attention-based Multi-View Feature Pooling In order to fuse multi-view contrastive depth features, we formulate an attention module by adapting multi-head attention mechanism originally
designed for sequence to sequence machine translation using transformer (encoder-decoder) architecture Vaswani et al. (2017). In a transformer architecture the encoder hidden state is mapped to
lower dimension key-value pairs (K, V) while the decoder hidden state is mapped to a query vector
Q using independent fully connected layers. The encoder hidden state in our case is the multi-view
features while the decoder hidden state is the mean of the multi-view features. The attention weights
are computed using scaled-dot product:
QKT
Attention(Q, K, V) = sof tmax( √ )V
N
(3)
where N is the number of input views.
Multiple attention heads are used which are concatenated and transformed to obtain the final output
headi = Attention(QWiQ , KWiK , VWiV )
(4)
0
(5)
M ultiHead(Q, K, V) = [head1 ; ...; headh ]W
where multiple W are parameters to be learned, h is the number of attention heads and i ∈ [1, h].
4
Figure 2: Attention weights visualization. From left to right: input images from 3 viewpoints,
corresponding ground truth point clouds color-coded by their view order and the predicted mesh
vertices color-coded by the attention weights of the views. Only the view with maximum attention
weight is visualized for each predicted points for clarity.
We choose multi-head attention as our feature pooling method since it allows the model to attend
information from different representation subspaces of the features by training multiple attentions
in parallel. This method is also invariant to the order and number of input views. We visualize the
learned attention weights (average of each attention heads) in Figure 2 where we can observe that the
attention weights roughly takes into account the visibility/occlusion information from each view.
3.3
L OSS FUNCTIONS
Mesh losses The losses which are derived from Wang et al. (2018) to constrain the mesh predicted
by each
P ground truth (Q) include Chamfer distance Lchamfer (P, Q) =
PGCN block (P) to resemble the
|P|−1 (p,q)∈ΛP,Q ||p − q||2 + |Q|−1 (q,p)∈ΛQ,P ||q − p||2 and surface normal loss Lnormal (P, Q) =
P
P
−|P|−1 (p,q)∈ΛP,Q |up · uq | − |Q|−1 (q,p)∈ΛQ,P |uq · up | with additional regularization in the form
P
1
′ 2
of edge length loss Ledge (V, E) = |E|
(v,v ′ )∈E ||v − v || for visually appealing results.
Depth loss Our depth prediction network is supervised using adaptive reversed Huber loss
(also known as BerHu criterion) Lambert-Lacroix & Zwald (2016). Ldepth = |x|, if |x| ≤
2
+c2
c, otherwise x 2c
Contrastive depth loss BerHu loss is also applied between the rendered depth images at different
2
+c2
GCN stages and the predicted depth images. Lcontrastive = |x|, if |x| ≤ c, otherwise x 2c
Voxel loss Binary cross-entropy loss between the predicted voxel occupancy probabilities and
the ground truth occupancies is used as voxel loss to supervise the voxel predictions Lvoxel =
− p(x)log p(x) + 1 − p(x) log 1 − p(x)
Final loss We use the weighted sum of the individual losses discussed above as the final loss to train
our model in an end-to-end fashion. L = λchamfer Lchamfer + λnormal Lnormal + λedge Ledge + λdepth Ldepth +
λcontrastive Lcontrastive + λvoxel Lvoxel , where L is the final loss term.
4
4.1
E XPERIMENTS
E XPERIMENTAL S ETUP
Comparisons We evaluate the proposed method against various multi-view shape generation
methods. The state-of-the-art method is Pixel2Mesh++ Wen et al. (2019) (referred as P2M++). Wen
et al. (2019) also provide a baseline by directly extending Pixel2Mesh Wang et al. (2018) to operate
on multi-view images (referred as MVP2M) using their statistical feature pooling method to aggregate
features from multiple color images. Results from additional multi-view shape generation baselines
3D-R2N2 Choy et al. (2016) and LSM Kar et al. (2017) are also reported.
5
Figure 3: Qualitative evaluation on ShapeNet dataset. From top to bottom: one of the input images,
ground truth mesh, multi-view extended Pixel2Mesh, Pixel2Mesh++, and ours. Our predictions are
closer to the actual shape, especially for the objects with more complex topologies.
Dataset We evaluate our method against the state-of-the-art methods on the dataset from Choy et al.
(2016) which is a subset of ShapeNet Chang et al. (2015) and has been widely used by recent 3D shape
generation methods. It contains 50K 3D CAD models from 13 categories. Each model is rendered
with a transparent background from 24 randomly chosen camera viewpoints to obtain color images.
The corresponding camera intrinsics and extrinsics are provided in the dataset. Since the dataset does
not contain depth images, we render them using a custom depth renderer at the same viewpoints as
the color images and with the same camera intrinsics. We follow the training/testing/validation split
of Gkioxari et al. (2019).
Implementation For the depth prediction module, we follow the original MVSNet Yao et al. (2018)
implementation. The output depth dimensions reduces by a factor of 4 to 56×56 from the 224×224
input image. The number of depth hypotheses is chosen as 48 which offers a balance between
accuracy and running/training time efficiency. These depth hypotheses represent values from 0.1 m
to 1.3 m at an interval of 25 mm. These values were chosen based on the range of depths present in
the dataset.
The hierarchical features obtained from "Contrastive Depth Features Extractor" are of total 4800
dimensions for each view. The aggregated multi-view features are compressed to 480 dimensional
after applying attentive feature pooling. 5 attention heads are used for merging multi-view features.
The loss function weights are set as λchamfer = 1, λnormal = 1.6 × 10−4 , λdepth = 0.1, λcontrastive =
0.001 and λvoxel = 1. Two settings of λedge were used, λedge = 0 (referred as Best) which gives better
quantitative results and λedge = 0.2 (referred as Pretty) which gives better qualitative results.
Training and Runtime The network is optimized using Adam optimizer with a learning rate of
10−4 . The training is done on 5 Nvidia RTX-2080 GPUs with effective batch size 5. The depth
prediction network (MVSNet) is trained independently for 30 epochs. Then the whole system is
trained for another 40 epochs with the weights of the MVSNet frozen. Our system is implemented in
PyTorch deep learning framework and it takes around 60 hours for training.
Evaluation Metric Following Wang et al. (2018); Wen et al. (2019), we use F1-score as our evaluation metric. The F1-score is the harmonic mean of precision and recall where the precision/recall
are calculated by finding the percentage of points in the predicted/ground truth that can find a nearest
neighbor from the other within a threshold τ . Two values of τ are used: 10−4 and 2 × 10−4 m2 .
4.2
C OMPARISON WITH PREVIOUS M ULTI - VIEW S HAPE G ENERATION M ETHODS
We quantitatively compare our method against previous works for multi-view shape generation
in Table 1 and show the effectiveness of our methods in improving the shape quality. Our method
6
outperforms the state-of-the-art method Pixel2Mesh++ Wen et al. (2019) with a decrease in chamfer
distance to ground truth by 34% and 15% increase in F1-score at threshold τ . Note that in Table 1 the
same model is trained for all the categories but accuracy on individual categories as well as average
over the categories are evaluated. We provide the chamfer distances in the appendix.
Category
Couch
Cabinet
Bench
Chair
Monitor
Firearm
Speaker
Lamp
Cellphone
Plane
Table
Car
Watercraft
Mean
F-score (τ ) ↑
3D-R2N2
LSM
MVP2M
P2M++
45.47
54.08
44.56
37.62
36.33
55.72
41.48
32.25
58.09
47.81
48.78
59.86
40.72
46.37
43.02
50.80
49.33
48.55
43.65
56.14
45.21
45.58
60.11
55.60
48.61
51.91
47.96
49.73
53.17
56.85
60.37
54.19
53.41
79.67
48.90
50.82
66.07
75.16
65.95
67.27
61.85
61.05
57.56
65.72
66.24
62.05
60.00
80.74
54.88
62.56
74.36
76.79
71.89
68.45
62.99
66.48
Ours
(pretty)
71.63
75.91
81.11
77.63
74.14
92.92
66.02
72.47
85.57
89.23
82.37
77.01
75.52
78.58
Ours
(best)
73.63
76.39
83.76
78.69
76.64
94.32
67.83
75.93
86.45
92.13
83.68
80.43
80.48
80.80
F-score (2τ ) ↑
3D-R2N2
LSM
MVP2M
P2M++
59.97
64.42
62.47
54.26
48.65
76.79
52.29
49.38
69.66
70.49
62.67
78.31
63.59
62.53
55.49
60.72
65.92
64.95
56.33
73.89
56.65
64.76
71.39
76.39
62.22
68.20
66.95
64.91
73.24
76.58
75.69
72.36
70.63
89.08
68.29
65.72
82.31
86.38
79.96
84.64
77.49
77.10
75.33
81.57
79.67
77.68
75.42
89.29
71.46
74.00
86.16
86.62
84.19
85.19
77.32
80.30
Ours
(pretty)
85.28
87.61
90.56
88.24
86.04
96.81
79.76
82.00
93.40
94.65
90.24
88.99
86.77
88.49
Ours
(best)
88.24
88.84
92.57
90.02
88.89
97.67
82.34
85.33
94.28
96.57
91.97
92.33
90.35
90.72
Table 1: Qualitative comparison against state-of-the-art multi-view shape generation methods. We
report F-score on each semantic category along with the mean over all categories using two thresholds
τ and 2τ for nearest neighbor match where τ =10−4 m2 .
We also provide visual results for qualitative assessment of the generated shapes by our Pretty model
in Figure 3 which shows that it is able to more accurately predict topologically diverse shapes.
4.3
A BLATION STUDIES
Contrastive Depth Feature Extraction We evaluate several methods for contrastive feature extraction (Sub-section 3.2). These methods are 1) Input Concatenation: using the concatenated
rendered and predicted depth maps as input to the VGG feature extractor, 2) Input Difference: using
the difference of the two depth maps as input to VGG, 3) Feature Concatenation: concatenating
features from rendered and predicted depths extracted by shared VGG, 4) Feature Difference: using
difference of the features from the two depth maps extracted by shared VGG, and 5) None: using the
VGG features from the predicted depths only. The quantitative results are summarized in Table 2 and
shows that Input Concatenation method produces better results than other formulations.
F1-τ
80.80
80.41
80.45
80.30
79.40
(1) Input Concatenation
(2) Input Difference
(3) Feature Concatenation
(4) Feature Difference
(5) None
F1-2τ
90.72
90.54
90.54
90.40
89.95
Table 2: Comparisons of different contrastive depth formulations. In 1st and 2nd rows, concatenation and difference of the rendered and predicted depths are fed to VGG feature extractor while in
3rd and 4th rows, concatenation and difference of the VGG features from the depths is used for mesh
refinement. 5 (None) uses VGG features from predicted depth only.
Attention Module In the 5th row and 6th row of Table 3, we present the performance of the
proposed attention method against statistical feature pooling Wen et al. (2019) and a simpler attention
mechanism Hu et al. (2020); Yang et al. (2020) where the pooled features are simply the weighted sum
of the multi-view features. We find that the three method perform similarly on our final architecture
but multi-head attention method performs better on more light-weight architectures.
Contrastive Depth Losses We also evaluate the effect of using additional regularization from
contrastive depth losses: rendered depth vs. predicted depth and rendered depth vs. ground truth
depth in the 3rd, 4th and 5th rows of Table 3 which show that introducing the additional loss terms to
constrain the refined meshes improves the accuracy of the generated shapes.
Ground truth depth as input In row 7 we use ground truth instead of predicted depths which
gives the upper bound on our mesh prediction accuracy in relation to the depth prediction accuracy.
Sphere initialization Row 8 uses a sphere as the coarse shape instead of cubified voxel grid.
7
Naive multi-view Mesh R-CNN In row 9 of Table 3 we extend Mesh R-CNN Gkioxari et al.
(2019) to multi-view using statistical feature pooling method proposed in Wen et al. (2019) for mesh
refinement while in row 10 we further extend their single-view voxel grid prediction method to our
probabilistic multi-view voxel grid prediction.
(1) Baseline framework
(2) Baseline + rendered vs predicted depth loss (final model)
(3) Baseline + rendered vs GT depth loss
(4) Baseline + rendered vs predicted depth loss + rendered vs GT depth loss
(5) Baseline with stats pooling
(6) Baseline with simple attention
(7) Baseline with GT depth
(8) Sphere initialization
(9) Naive multi-view Mesh R-CNN (single-view voxel prediction)
(10) Naive multi-view Mesh R-CNN (multi-view voxel prediction)
F1-τ
79.82
80.80
80.35
80.45
79.63
80.03
84.58
73.78
72.74
76.97
F1-2τ
90.18
90.72
90.55
90.56
90.10
90.21
92.86
85.49
84.99
88.24
Table 3: Comparison of shape generation accuracy with different settings of additional contrastive depth losses, multi-view feature pooling. The Baseline framework uses multi-head attention
mechanism without any contrastive depth losses.
Number of View We test the performance of our framework with respect to the number of views.
Table 4 shows that the accuracy of our method increases as we increase the number of input views
for training. These experiments also validate that the attention-based feature pooling can efficiently
encode features from different views to take advantage of larger number of views.
Table 5 shows the results when using different number of views during testing on our model trained
with 3 views which indicates that increasing the number of views during testing does not improve the
accuracy while decreasing the number of views can cause a significant drop in accuracy.
Metric
F1-τ
F1-2τ
2
73.60
85.80
3
80.80
90.72
4
82.61
91.78
5
83.76
92.73
6
84.25
93.14
Metric
F1-τ
F1-2τ
2
72.46
84.49
3
80.80
90.72
4
80.98
91.03
5
80.94
91.16
6
80.85
91.20
Table 4: Accuracy w.r.t the number of views Table 5: Accuracy w.r.t the number of views
during training. The evaluation was performed during testing. The same model trained with 3
on the same number of views as training.
views was used in all of the cases.
5
C ONCLUSION
We propose a neural network based solution to predict 3D triangle mesh models of objects from
images taken from multiple views. First, we propose a multi-view voxel grid prediction module which
probabilistically merges voxel grids predicted from individual input views. We then cubify the merged
voxel grid to triangle mesh and apply graph convolutional networks for further refining the mesh.
The features for the mesh vertices are extracted from contrastive depth input consisting of rendered
depths at each refinement stage along with the predicted depths. The proposed mesh reconstruction
method outperforms existing methods with a large margin and is capable of reconstructing objects
with more complex topologies.
R EFERENCES
Cesar Cadena, Luca Carlone, Henry Carrillo, Yasir Latif, Davide Scaramuzza, José Neira, Ian Reid,
and John J Leonard. Past, present, and future of simultaneous localization and mapping: Toward
the robust-perception age. IEEE Transactions on robotics, 32(6):1309–1332, 2016.
Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li,
Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d
model repository. arXiv preprint arXiv:1512.03012, 2015.
Rui Chen, Songfang Han, Jing Xu, and Hao Su. Point-based multi-view stereo network. In
Proceedings of the IEEE International Conference on Computer Vision, pp. 1538–1547, 2019.
8
Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3d-r2n2: A
unified approach for single and multi-view 3d object reconstruction. In European conference on
computer vision, pp. 628–644. Springer, 2016.
Jean-Denis Durou, Maurizio Falcone, and Manuela Sagona. Numerical methods for shape-fromshading: A new survey with benchmarks. Computer Vision and Image Understanding, 109(1):
22–43, 2008.
Haoqiang Fan, Hao Su, and Leonidas J Guibas. A point set generation network for 3d object
reconstruction from a single image. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pp. 605–613, 2017.
Paolo Favaro and Stefano Soatto. A geometric approach to shape from defocus. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 27(3):406–417, 2005.
Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multiview stereopsis. IEEE
transactions on pattern analysis and machine intelligence, 32(8):1362–1376, 2009.
Georgia Gkioxari, Jitendra Malik, and Justin Johnson. Mesh r-cnn. In Proceedings of the IEEE
International Conference on Computer Vision, pp. 9785–9795, 2019.
Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong Tan, and Ping Tan. Cascade cost volume
for high-resolution multi-view stereo and stereo matching. arXiv preprint arXiv:1912.06378, 2019.
JunYoung Gwak, Christopher B Choy, Manmohan Chandraker, Animesh Garg, and Silvio Savarese.
Weakly supervised 3d reconstruction with adversarial constraint. In 2017 International Conference
on 3D Vision (3DV), pp. 263–272. IEEE, 2017.
Zhizhong Han, Guanhui Qiao, Yu-Shen Liu, and Matthias Zwicker. Seqxy2seqz: Structure learning
for 3d shapes by sequentially predicting 1d occupancy segments from 2d coordinates. arXiv
preprint arXiv:2003.05559, 2020.
Wilfried Hartmann, Silvano Galliani, Michal Havlena, Luc Van Gool, and Konrad Schindler. Learned
multi-patch similarity. In Proceedings of the IEEE International Conference on Computer Vision,
pp. 1586–1594, 2017.
Qingyong Hu, Bo Yang, Linhai Xie, Stefano Rosa, Yulan Guo, Zhihua Wang, Niki Trigoni, and
Andrew Markham. Randla-net: Efficient semantic segmentation of large-scale point clouds.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020.
Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. Deepmvs:
Learning multi-view stereopsis. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 2821–2830, 2018.
Qixing Huang, Hai Wang, and Vladlen Koltun. Single-view reconstruction via joint analysis of image
and shape collections. ACM Transactions on Graphics (TOG), 34(4):1–10, 2015.
Xin Jia, Shourui Yang, Yuxin Peng, Junchao Zhang, and Shengyong Chen. Dv-net: Dual-view
network for 3d reconstruction by fusing multiple sets of gated control point clouds. Pattern
Recognition Letters, 131:376–382, 2020.
Jiongchao Jin, Akshay Gadi Patil, Zhang Xiong, and Hao Zhang. Dr-kfs: A differentiable visual
similarity metric for 3d shape reconstruction, 2019.
Abhishek Kar, Shubham Tulsiani, Joao Carreira, and Jitendra Malik. Category-specific object
reconstruction from a single image. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pp. 1966–1974, 2015.
Abhishek Kar, Christian Häne, and Jitendra Malik. Learning a multi-view stereo machine. In
Advances in neural information processing systems, pp. 365–376, 2017.
Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neural 3d mesh renderer. In The IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
9
Kurt Konolige. Improved occupancy grids for map building. Autonomous Robots, 4(4):351–367,
1997.
Andrey Kurenkov, Jingwei Ji, Animesh Garg, Viraj Mehta, JunYoung Gwak, Christopher Choy, and
Silvio Savarese. Deformnet: Free-form deformation network for 3d shape reconstruction from a
single image. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp.
858–866. IEEE, 2018.
Kiriakos N Kutulakos and Steven M Seitz. A theory of shape by space carving. International journal
of computer vision, 38(3):199–218, 2000.
Sophie Lambert-Lacroix and Laurent Zwald. The adaptive berhu penalty in robust regression. Journal
of Nonparametric Statistics, 28(3):487–514, 2016.
Maxime Lhuillier and Long Quan. A quasi-dense approach to surface reconstruction from uncalibrated
images. IEEE transactions on pattern analysis and machine intelligence, 27(3):418–433, 2005.
Chen-Hsuan Lin, Oliver Wang, Bryan C Russell, Eli Shechtman, Vladimir G Kim, Matthew Fisher,
and Simon Lucey. Photometric mesh optimization for video-aligned 3d object reconstruction. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 969–978,
2019.
Keyang Luo, Tao Guan, Lili Ju, Haipeng Huang, and Yawei Luo. P-mvsnet: Learning patch-wise
matching confidence aggregation for multi-view stereo. In Proceedings of the IEEE International
Conference on Computer Vision, pp. 10452–10461, 2019.
Junyi Pan, Xiaoguang Han, Weikai Chen, Jiapeng Tang, and Kui Jia. Deep mesh reconstruction from
single rgb images via topology modification networks. In Proceedings of the IEEE International
Conference on Computer Vision, pp. 9964–9973, 2019.
Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The
graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2008.
Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4104–4113, 2016.
Steven M Seitz and Charles R Dyer. Photorealistic scene reconstruction by voxel coloring. International Journal of Computer Vision, 35(2):151–173, 1999.
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556, 2014.
Hao Su, Qixing Huang, Niloy J Mitra, Yangyan Li, and Leonidas Guibas. Estimating image depth
using shape collections. ACM Transactions on Graphics (TOG), 33(4):1–11, 2014.
Jiapeng Tang, Xiaoguang Han, Junyi Pan, Kui Jia, and Xin Tong. A skeleton-bridged deep learning
approach for generating meshes of complex topologies from single rgb images. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4541–4550, 2019.
Shubham Tulsiani, Tinghui Zhou, Alexei A Efros, and Jitendra Malik. Multi-view supervision
for single-view reconstruction via differentiable ray consistency. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 2626–2634, 2017.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information
processing systems, pp. 5998–6008, 2017.
Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and Yu-Gang Jiang. Pixel2mesh:
Generating 3d mesh models from single rgb images. In Proceedings of the European Conference
on Computer Vision (ECCV), pp. 52–67, 2018.
Chao Wen, Yinda Zhang, Zhuwen Li, and Yanwei Fu. Pixel2mesh++: Multi-view 3d mesh generation
via deformation. In Proceedings of the IEEE International Conference on Computer Vision, pp.
1042–1051, 2019.
10
Udaranga Wickramasinghe, Edoardo Remelli, Graham Knott, and Pascal Fua. Voxel2mesh: 3d mesh
model generation from volumetric data, 2019.
Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. Perspective transformer nets:
Learning single-view 3d object reconstruction without 3d supervision. In Advances in neural
information processing systems, pp. 1696–1704, 2016.
Bo Yang, Sen Wang, Andrew Markham, and Niki Trigoni. Robust attentional aggregation of deep
feature sets for multi-view 3d reconstruction. International Journal of Computer Vision, 128(1):
53–73, 2020.
Yaoqing Yang, Chen Feng, Yiru Shen, and Dong Tian. Foldingnet: Point cloud auto-encoder via
deep grid deformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 206–215, 2018.
Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for unstructured
multi-view stereo. In Proceedings of the European Conference on Computer Vision (ECCV), pp.
767–783, 2018.
Yao Yao, Zixin Luo, Shiwei Li, Tianwei Shen, Tian Fang, and Long Quan. Recurrent mvsnet for
high-resolution multi-view stereo depth inference. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 5525–5534, 2019.
Yuan Yao, Nico Schertler, Enrique Rosales, Helge Rhodin, Leonid Sigal, and Alla Sheffer. Front2back:
Single view 3d shape reconstruction via front to back prediction. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp. 531–540, 2020.
Ruo Zhang, Ping-Sing Tsai, James Edwin Cryer, and Mubarak Shah. Shape-from-shading: a survey.
IEEE transactions on pattern analysis and machine intelligence, 21(8):690–706, 1999.
A
A PPENDIX
N ETWORK ARCHITECTURE
MVSN ET ARCHITECTURE
Figure 4: Depth prediction network (MVSNet) architecture
Our depth prediction module is based on MVSNet Yao et al. (2018) which constructs a regularized
3D cost volumes to estimate the depth map of the reference view. Here, we extent MVSNet to predict
11
the depth maps of all views instead of only the reference view. This is achieved by transforming
the feature volumes to each view’s coordinate frame using homography warping and applying
identical cost volume regularization and depth regression on each view. This allows the reuse of
pre-regularization feature volumes for efficient multi-view depth prediction invariant to the order of
input images. Figure 4 shows the architecture of the our depth estimation module.
P ROBABILISTIC O CCUPANCY G RID M ERGING
We use single-view voxel prediction network from Gkioxari et al. (2019) to predict predicts voxel
grids for each of the input images in their respective local coordinate frames. The occupancy
grids are transformed to global frame (which is set to the coordinate frame of the first image) by
finding the equivalent global grid values in the local grids after applying bilinear interpolation on the
closest matches. The voxel grids in global coordinates are then probabilistically merged according
to Sub-section 3.1 of the main submission.
E XPERIMENTS
We quantitatively compare our method against previous works for multi-view shape generation
in Table 6 and show effectiveness of our proposed shape generation methods in improving shape
quality. Our method outperforms the state-of-the-art method Pixel2Mesh++ Wen et al. (2019) with
decrease in chamfer distance to ground truth by 34%, which shows the effectiveness of our proposed
method. Note that in Table 6 same model is trained for all the categories but accuracy on individual
categories as well as average over all the categories are evaluated.
Category
Couch
Cabinet
Bench
Chair
Monitor
Firearm
Speaker
Lamp
Cellphone
Plane
Table
Car
Watercraft
Mean
3D-R2N2
0.806
0.613
1.362
1.534
1.465
0.432
1.443
6.780
1.161
0.854
1.243
0.358
0.869
1.455
Chamfer Distance (CD) ↓
LSM MVP2M P2M++
0.730
0.534
0.439
0.634
0.488
0.337
0.572
0.591
0.549
0.495
0.583
0.461
0.592
0.658
0.566
0.385
0.305
0.305
0.767
0.745
0.635
1.768
0.980
1.135
0.362
0.445
0.325
0.496
0.403
0.422
0.994
0.511
0.388
0.326
0.321
0.249
0.509
0.463
0.508
0.664
0.541
0.486
Ours
0.220
0.230
0.159
0.201
0.217
0.123
0.402
0.755
0.138
0.084
0.181
0.165
0.175
0.211
Table 6: Qualitative comparison against state-of-the-art multi-view shape generation methods.
Following Wen et al. (2019), we report Chamfer Distance in m2 × 1000 from ground truth for
different methods. Note that same model is trained for all the categories but accuracy on individual
categories as well as average over all the categories are evaluated.
A BLATION STUDIES
Coarse Shape Generation We conduct comparisons on voxel grid predicted from our proposed
probabilistically merged voxel grids against single view method Gkioxari et al. (2019). As is shown
in Table 7, the accuracy of the initial shape generated from probabilistically merged voxel grid is
higher than that from individual views.
Accuracy at different GCN stages We analyze the accuracy of meshes at different GCN stages
in Table 8 to validate that our method produces the meshes in a coarse-to-fine manner.
Resolution of Depth Prediction We conduct experiments using different numbers of depth hypotheses in our depth prediction network (Sub-section A), producing depth values at different
12
Metric
F1-τ
F1-2τ
Single-view
25.19
36.75
Multi-view
31.27
44.46
Metric
F1-τ
F1-2τ
Table 7: Accuracy of predicted voxel grids from
single-view prediction compared against the proposed probabilistically merged multi-view voxel
grids. The voxel branch was trained separately
without the mesh refinement and evaluation was
performed on the cubified voxel grids. We use
three views for probabilistic grid merging.
Cubified
31.48
44.40
Stage-1
76.78
88.32
Stage-2
79.88
90.19
Stage-3
80.80
90.72
Table 8: Accuracy of the refined meshes at different GCN stages. 1, 2 and 3 indicate the performance at the corresponding graph convolution
blocks while Cubified is for the cubified voxel grids
used as input for the first GCN block. All the
stages, including the voxel prediction, were trained
jointly and hence the accuracy of voxel predictions
varies from that in Table 7.
resolutions. A higher number of depth hypothesis means finer resolution of the predicted depths.
The quantitative results with different hypothesis numbers are summarized in Table 9. We set depth
hypothesis as 48 for our final architecture which is equivalent to the resolution of 25 mm.
Metric
F1-τ
F1-2τ
24
80.29
90.43
48
80.80
90.72
72
80.69
90.74
96
80.34
90.47
Table 9: Accuracy w.r.t the number of depth hypothesis. A higher number of depth hypothesis
increases the resolution of predicted depth values at the expense of higher memory requirement.
The range of depths for all the models are same and based on the minimum/maximum depth in the
ShapeNet Chang et al. (2015) dataset.
Generalization Capability We conduct experiments to evaluate the generalization capability of
our system across the semantic categories. We train our model with only 12 out of the 13 categories
and test on the category that was left out. Table 10 shows that the accuracy generally does not
decrease significantly when compared with the model that was trained on all 13 categories.
Category
Couch
Cabinet
Bench
Chair
Monitor
Firearm
Speaker
Lamp
Cellphone
Plane
Table
Car
Watercraft
F-score (τ ) ↑
Excluding Including
63.29
73.63
68.26
76.39
76.08
83.76
60.60
78.69
67.26
76.64
78.59
94.32
62.39
67.83
63.50
75.93
67.24
86.45
57.48
92.13
76.41
83.68
59.08
80.43
64.97
80.48
F-score (2τ ) ↑
Excluding Including
80.79
88.24
83.10
88.84
87.42
92.57
75.93
90.02
81.57
88.89
86.28
97.67
77.77
82.34
74.66
85.33
80.54
94.28
67.27
96.57
86.86
91.97
75.58
92.33
78.95
90.35
Table 10: Accuracy when a category is excluded during training and evaluation is performed on
the category to verify how well training on other categories generalizes to the excluded category.
13