1906 02739v2
1906 02739v2
1906 02739v2
1
3D pose [31, 44, 66] of known shapes. Other approaches
predict novel 3D shapes as sets of 3D points [8, 34],
patches [15, 70], or geometric primitives [9, 64, 67]; others
use deep networks to model signed distance functions [42].
These methods can flexibly represent complex shapes, but
rely on post-processing to extract watertight mesh outputs.
Some methods predict regular voxel grids [5, 71, 72];
while intuitive, scaling to high-resolution outputs requires
complex octree [50, 62] or nested shape architectures [49].
Figure 2. Example predictions from Mesh R-CNN on Pix3D. Us-
ing initial voxel predictions allows our outputs to vary in topology;
Others directly output triangle meshes, but are con-
converting these predictions to meshes and refining them allows us strained to deform from fixed [56, 57, 69] or retrieved mesh
to capture fine structures like tabletops and chair legs. templates [51], limiting the topologies they can represent.
Our approach uses a hybrid of voxel prediction and mesh
fixed mesh templates, limiting them to fixed mesh topolo- deformation, enabling high-resolution output shapes that
gies. As shown in Figure 1, we overcome this limitation can flexibly represent arbitrary topologies.
by utilizing multiple 3D shape representations: we first pre- Some methods reconstruct 3D shapes without 3D anno-
dict coarse voxelized object representations, which are con- tations [23, 25, 48, 68, 75]. This is an important direction,
verted to meshes and refined to give highly accurate mesh but at present we consider only the fully supervised case
predictions. As shown in Figure 2, this hybrid approach al- due to the success of strong supervision for 2D perception.
lows Mesh R-CNN to output meshes of arbitrary topology Multi-View Shape Prediction There is a broad line of work
while also capturing fine object structures. on multi-view reconstruction of objects and scenes, from
We benchmark our approach on two datasets. First, classical binocular stereo [17, 53] to using shape priors [1,
we evaluate our mesh prediction branch on ShapeNet [4], 2, 6, 21] and modern learning techniques [24, 26, 54]. In
where our hybrid approach of voxel prediction and mesh this work, we focus on single-image shape reconstruction.
refinement outperforms prior work by a large margin. Sec- 3D Inputs Our method inputs 2D images and predicts se-
ond, we deploy our full Mesh R-CNN system on the recent mantic labels and 3D shapes. Due to the increasing avail-
Pix3D dataset [60] which aligns 395 models of IKEA fur- abilty of depth sensors, there has been growing interest in
niture to real-world images featuring diverse scenes, clut- methods predicting semantic labels from 3D inputs such as
ter, and occlusion. To date Pix3D has primarily been RGB-D images [16, 58] and pointclouds [14, 32, 45, 59,
used to evalute shape predictions for models trained on 63]. We anticipate that incorporating 3D inputs into our
ShapeNet, using perfectly cropped, unoccluded image seg- method could improve the fidelity of our shape predictions.
ments [41, 60, 73], or synthetic rendered images of Pix3D Datasets Advances in 2D perception have been driven by
models [76]. In contrast, using Mesh R-CNN we are the large-scale annotated datasets such as ImageNet [52] and
first to train a system on Pix3D which can jointly detect ob- COCO [37]. Datasets for 3D shape prediction have lagged
jects of all categories and estimate their full 3D shape. their 2D counterparts due to the difficulty of collecting 3D
annotations. ShapeNet [4] is a large-scale dataset of CAD
2. Related Work models which are rendered to give synthetic images. The
IKEA dataset [33] aligns CAD models of IKEA objects to
Our system inputs a single RGB image and outputs a set
real-world images; Pix3D [60] extends this idea to a larger
of detected object instances, with a triangle mesh for each
set of images and models. Pascal3D [74] aligns CAD mod-
object. Our work is most directly related to recent advances
els to real-world images, but it is unsuitable for shape recon-
in 2D object recognition and 3D shape prediction. We also
struction since its train and test sets share the same small set
draw more broadly from work on other 3D perception tasks.
of models. KITTI [11] annotates outdoor street scenes with
2D Object Recognition Methods for 2D object recogni- 3D bounding boxes, but does not provide shape annotations.
tion vary both in the type of information predicted per ob-
ject, and in the overall system architecture. Object de-
tectors output per-object bounding boxes and category la- 3. Method
bels [12, 13, 36, 38, 46, 47]; Mask R-CNN [18] additionally Our goal is to design a system that inputs a single image,
outputs instance segmentation masks. Our method extends detects all objects, and outputs a category label, bounding
this line of work to output a full 3D mesh per object. box, segmentation mask and 3D triangle mesh for each de-
Single-View Shape Prediction Recent approaches use a tected object. Our system must be able to handle cluttered
variety of shape representations for single-image 3D recon- real-world images, and must be trainable end-to-end. Our
struction. Some methods predict the orientation [10, 20] or output meshes should not be constrained to a fixed topol-
Figure 3. System overview of Mesh R-CNN. We augment Mask R-CNN with 3D shape inference. The voxel branch predicts a coarse
shape for each detected object which is further deformed with a sequence of refinement stages in the mesh refinement branch.
The cubified mesh from the voxel branch only provides a and the (absolute) normal distance is given by
coarse 3D shape, and it cannot accurately model fine struc-
X X
Lnorm (P, Q) = −|P |−1 |up · uq | − |Q|−1 |uq · up |. (2)
tures like chair legs. The mesh refinement branch processes (p,q)∈ΛP,Q (q,p)∈ΛQ,P
this initial cubified mesh, refining its vertex positions with The chamfer and normal distances penalize mismatched
a sequence of refinement stages. Similar to [69], each re- positions and normals between two pointclouds, but min-
finement stage consists of three operations: vertex align- imizing these distances alone results in degenerate meshes
ment, which extracts image features for vertices; graph con- (see Figure 5). High-quality mesh predictions require addi-
volution, which propagates information along mesh edges; tional shape regularizers: To this end we use an edge loss
and vertex refinement, which updates vertex positions. Each 1
Ledge (V, E) = |E|
P 0 2
(v,v )∈E kv − v k where E ⊆ V × V
0
layer of the network maintains a 3D position vi and a fea-
ture vector fi for each mesh vertex. 1 Vertex alignment is called perceptual feature pooling in [69]
are the edges of the predicted mesh. Alternatively, a Lapla- Chamfer (↓) F1τ (↑) F12τ (↑)
cian loss [7] also imposes smoothness constraints. N3MR [25] 2.629 33.80 47.72
3D-R2N2 [5] 1.445 39.01 54.62
The mesh loss of the i-th stage is a weighted sum of
PSG [8] 0.593 48.58 69.78
Lcham (P i , P gt ), Lnorm (P i , P gt ) and Ledge (V i , E i ). The Pixel2Mesh [69]† 0.591 59.72 74.19
mesh refinement branch is trained to minimize the mean of MVD [56] - 66.39 -
these losses across all refinement stages. GEOMetrics [57] - 67.37 -
Pixel2Mesh [69]‡ 0.463 67.89 79.88
Ours (Best) 0.306 74.84 85.75
4. Experiments
Ours (Pretty) 0.391 69.83 81.76
We benchmark our mesh predictor on ShapeNet [4], Table 1. Single-image shape reconstruction results on ShapeNet,
where we compare with state-of-the-art approaches. We using the evaluation protocol from [69]. For [69], † are results
then evaluate our full Mesh R-CNN for the task of 3D shape reported in their paper and ‡ is the model released by the authors.
prediction in the wild on the challenging Pix3D dataset [60]. 128. The mesh refinement branch has three stages, each
with six graph convolution layers (of dimension 128) or-
4.1. ShapeNet ganized into three residual blocks. We train for 25 epochs
ShapeNet [4] provides a collection of 3D shapes, repre- using Adam [27] with learning rate 10−4 and 32 images per
sented as textured CAD models organized into semantic cat- batch on 8 Tesla V100 GPUs. We set the cubify thresh-
egories following WordNet [43], and has been widely used old to 0.2 and weight the losses with λvoxel = 1, λcham = 1,
as a benchmark for 3D shape prediction. We use the subset λnorm = 0, and λedge = 0.2.
of ShapeNetCore.v1 and rendered images from [5]. Each Baselines We compare with previously published methods
mesh is rendered from up to 24 random viewpoints, giving for single-image shape prediction. N3MR [25] is a weakly
RGB images of size 137 × 137. We use the train / test splits supervised approach that fits a mesh via a differentiable ren-
provided by [69], which allocate 35,011 models (840,189 derer without 3D supervision. 3D-R2N2 [5] and MVD [56]
images) to train and 8,757 models (210,051 images) to test; output voxel predictions. PSG [8] predicts point-clouds.
models used in train and test are disjoint. We reserve 5% of Appendix D additionally compares with OccNet [42].
the training models as a validation set. Pixel2Mesh [69] predicts meshes by deforming and sub-
The task on this dataset is to input a single RGB image dividing an initial ellipsoid. GEOMetrics [57] extends [69]
of a rendered ShapeNet model on a blank background, and with adaptive face subdivision. Both are trained to mini-
output a 3D mesh for the object in the camera coordinate mize Chamfer distances; however [69] computes it using
system. During training the system is supervised with pairs predicted mesh vertices, while [57] uses points sampled uni-
of images and meshes. formly from predicted meshes. We adopt the latter as it bet-
Evaluation We adopt evaluation metrics used in recent ter matches test-time evaluation. Unlike ours, these meth-
work [56, 57, 69]. We sample 10k points uniformly at ran- ods can only predict connected meshes of genus zero.
dom from the surface of predicted and ground-truth meshes, The training recipe and backbone architecture vary
and use them to compute Chamfer distance (Equation 1), among prior work. Therefore for a fair comparison with our
Normal consistency, (one minus Equation 2), and F1τ at method we also compare against several ablated versions of
various distance thresholds τ , which is the harmonic mean our model (see Appendix C for exact details):
of the precision at τ (fraction of predicted points within τ of • Voxel-Only: A version of our method that terminates
a ground-truth point) and the recall at τ (fraction of ground- with the cubified meshes from the voxel branch.
truth points within τ of a predicted point). Lower is better
• Pixel2Mesh+ : We reimplement Pixel2Mesh [69]; we
for Chamfer distance; higher is better for all other metrics.
outperform their original model due to a deeper back-
With the exception of normal consistency, these metrics
bone, better training recipe, and minimizing Chamfer on
depend on the absolute scale of the meshes. In Table 1 we
sampled rather than vertex positions.
follow [69] and rescale by a factor of 0.57; for all other
results we follow [8] and rescale so the longest edge of the • Sphere-Init: Similar to Pixel2Mesh+ , but initializes from
ground-truth mesh’s bounding box has length 10. a high-resolution sphere mesh, performing three stages
of vertex refinement without subdivision.
Implementation Details Our backbone feature extractor is
ResNet-50 pretrained on ImageNet. Since images depict a • Ours (light): Uses a smaller nonresidual mesh refinement
single object, the voxel branch receives the entire conv5 3 branch with three graph convolution layers per stage. We
feature map, bilinearly resized to 24 × 24, and predicts a will adopt this lightweight design on Pix3D.
48 × 48 × 48 voxel grid. The VertAlign operator con- Voxel-Only is essentially a version of our method that
catenates features from conv2 3, conv3 4, conv4 6, omits the mesh refinement branch, while Pixel2Mesh+ and
and conv5 3 before projecting to a vector of dimension Sphere-Init omit the voxel prediction branch.
Full Test Set Holes Test Set
Chamfer(↓) Normal F10.1 F10.3 F10.5 |V | |F | Chamfer(↓) Normal F10.1 F10.3 F10.5 |V | |F |
Pixel2Mesh [69]‡ 0.205 0.736 33.7 80.9 91.7 2466±0 4928±0 0.272 0.689 31.5 75.9 87.9 2466±0 4928±0
Voxel-Only 0.916 0.595 7.7 33.1 54.9 1987±936 3975±1876 0.760 0.592 8.2 35.7 59.5 2433±925 4877±1856
Sphere-Init 0.132 0.711 38.3 86.5 95.1 2562±0 5120±0 0.138 0.705 40.0 85.4 94.3 2562±0 5120±0
Pixel2Mesh+ 0.132 0.707 38.3 86.6 95.1 2562±0 5120±0 0.137 0.696 39.3 85.5 94.4 2562±0 5120±0
Best
Ours (light) 0.133 0.725 39.2 86.8 95.1 1894±925 3791±1855 0.130 0.723 41.6 86.7 94.8 2273±899 4560±1805
Ours 0.133 0.729 38.8 86.6 95.1 1899±928 3800±1861 0.130 0.725 41.7 86.7 94.9 2291±903 4595±1814
Sphere-Init 0.175 0.718 34.5 82.2 92.9 2562±0 5120±0 0.186 0.684 34.4 80.2 91.7 2562±0 5120±0
Pixel2Mesh+
Pretty
0.175 0.727 34.9 82.3 92.9 2562±0 5120±0 0.196 0.685 34.4 79.9 91.4 2562±0 5120±0
Ours (light) 0.176 0.699 34.8 82.4 93.1 1891±924 3785±1853 0.178 0.688 36.3 82.0 92.4 2281±895 4576±1798
Ours 0.171 0.713 35.1 82.6 93.2 1896±928 3795±1861 0.171 0.700 37.1 82.4 92.7 2292±902 4598±1812
Table 2. We report results both on the full ShapeNet test set (left), as well as a subset of the test set consisting of meshes with visible
holes (right). We compare our full model with several ablated version: Voxel-Only omits the mesh refinement head, while Sphere-Init and
Pixel2Mesh+ omit the voxel head. We show results both for Best models which optimize for metrics, as well as Pretty models that strike a
balance between shape metrics and mesh quality (see Figure 5); these two categories of models should not be compared. We also report the
number of vertices |V | and faces |F | in predicted meshes (mean±std ). ‡ refers to the released model by the authors. (per-instance average)
Image
Input Image
(best) (pretty)
Pixel2Mesh+
Ours
Figure 5. Training without the edge length regularizer Ledge Figure 6. Pixel2Mesh+ predicts meshes by deforming an initial
results in degenerate predicted meshes that have many overlap- sphere, so it cannot properly model objects with holes. In contrast
ping faces. Adding Ledge eliminates this degeneracy but results in our method can model objects with arbitrary topologies.
worse agreement with the ground-truth as measured by standard
metrics such as Chamfer distance. using a 0.57 mesh scaling factor and threshold value τ =
10−4 on squared Euclidean distances. For Pixel2Mesh, we
Best vs Pretty As previously noted in [69] (Section 4.1),
provide the performance reported in their paper [69] as well
standard metrics for shape reconstruction are not well-
as the performance of their open-source pretrained model.
correlated with mesh quality. Figure 5 shows that mod-
Table 1 shows that we outperform prior work by a wide
els trained without shape regularizers give meshes that are
margin, validating the design of our mesh predictor.
preferred by metrics despite being highly degenerate, with
irregularly-sized faces and many self-intersections. These Ablation Study Fairly comparing with prior work is chal-
degenerate meshes would be difficult to texture, and may lenging due to differences in backbone networks, losses,
not be useful for downstream applications. and shape regularizers. For a controlled evaluation, we ab-
Due to the strong effect of shape regularizers on both late variants using the same backbone and training recipe,
mesh quality and quantitative metrics, we suggest only shown in Table 2. ShapeNet is dominated by simple ob-
quantitatively comparing methods trained with the same jects of genus zero. Therefore we evaluate both on the en-
shape regularizers. We thus train two versions of all our tire test set and on a subset consisting of objects with one or
ShapeNet models: a Best version with λedge = 0 to serve as more holes (Holes Test Set) 2 . In this evaluation we remove
an upper bound on quantitative performance, and a Pretty the ad-hoc scaling factor of 0.57, and we rescale meshes so
version that strikes a balance between quantitative perfor- the longest edge of the ground-truth mesh’s bounding box
mance and mesh quality by setting λedge = 0.2. has length 10, following [8]. We compare the open-source
Comparison with Prior Work Table 1 compares our Pretty 2 We annotated 3075 test set models and flagged whether they contained
and Best models with prior work on shape prediction from holes. This resulted in 17% (or 534) of the models being flagged. See
a single image. We use the evaluation protocol from [69], Appendix G for more details and examples.
Pix3D S1 APbox APmask APmesh chair sofa table bed desk bkcs wrdrb tool misc |V | |F |
Voxel-Only 94.4 88.4 5.3 0.0 3.5 2.6 0.5 0.7 34.3 5.7 0.0 0.0 2354±706 4717±1423
Pixel2Mesh+ 93.5 88.4 39.9 30.9 59.1 40.2 40.5 30.2 50.8 62.4 18.2 26.7 2562±0 5120±0
Sphere-Init 94.1 87.5 40.5 40.9 75.2 44.2 50.3 28.4 48.6 42.5 26.9 7.0 2562±0 5120±0
Mesh R-CNN (ours) 94.0 88.4 51.1 48.2 71.7 60.9 53.7 42.9 70.2 63.4 21.6 27.8 2367±698 4743±1406
# test instances 2530 2530 2530 1165 415 419 213 154 79 54 11 20
Pix3D S2
Voxel-Only 71.5 63.4 4.9 0.0 0.1 2.5 2.4 0.8 32.2 0.0 6.0 0.0 2346±630 4702±1269
Pixel2Mesh+ 71.1 63.4 21.1 26.7 58.5 10.9 38.5 7.8 34.1 3.4 10.0 0.0 2562±0 5120±0
Sphere-Init 72.6 64.5 24.6 32.9 75.3 15.8 40.1 10.1 45.0 1.5 0.8 0.0 2562±0 5120±0
Mesh R-CNN (ours) 72.2 63.9 28.8 42.7 70.8 27.2 40.9 18.2 51.1 2.9 5.2 0.0 2358±633 4726±1274
# test instances 2356 2356 2356 777 504 392 218 205 84 134 22 20
Table 3. Performance on Pix3D S1 & S2 . We report mean APbox , APmask and APmesh , as well as per category APmesh . All AP performances
are in %. The Voxel-Only baseline outputs the cubified voxel predictions. The Sphere-Init and Pixel2Mesh+ baselines deform an initial
sphere and thus are limited to making predictions homeomorphic to spheres. Our Mesh R-CNN is flexible and can capture arbitrary
topologies. We outperform the baselines consistently while predicting meshes with fewer number of vertices and faces.
CNN init # refine steps APbox APmask APmesh Our first split, S1 , randomly allocates 7539 images for
COCO 3 94.0 88.4 51.1 training and 2530 for testing. Despite the small num-
IN 3 93.1 87.0 48.4
ber of unique object models compared to ShapeNet, S1 is
COCO 2 94.6 88.3 49.3
COCO 1 94.2 88.9 48.6
challenging since the same model can appear with varying
appearance (e.g. color, texture), in different orientations,
Table 4. Ablations of Mesh R-CNN on Pix3D. under different lighting conditions, in different contexts,
and with varying occlusion. This is a stark contrast with
Pixel2Mesh model against our ablations in this evaluation
ShapeNet, where objects appear against blank backgrounds.
setting. Pixel2Mesh+ (our reimplementation of [69]) sig-
Our second split, S2 , is even more challenging: we en-
nificantly outperforms the original due to an improved train-
sure that the 3D models appearing in the train and test sets
ing recipe and deeper backbone.
are disjoint. Success on this split requires generalization
We draw several conclusions from Table 2: (a) On the
not only to the variations present in S1 , but also to novel 3D
Full Test Set, our full model and Pixel2Mesh+ perform on
shapes of known categories: for example a model may see
par. However, on the Holes Test Set, our model dominates
kitchen chairs during training but must recognize armchairs
as it is able to predict topologically diverse shapes while
during testing. This split is possible due to Pix3D’s unique
Pixel2Mesh+ is restricted to make predictions homeomor-
annotation structure, and poses interesting challenges for
phic to spheres, and cannot model holes or disconnected
both 2D recognition and 3D shape prediction.
components (see Figure 6). This discrepancy is quantita-
tively more salient on Pix3D (Section 4.2) as it contains Evaluation We adopt metrics inspired by those used for 2D
more complex shapes. (b) Sphere-Init and Pixel2Mesh+ recognition: APbox , APmask and APmesh . The first two are
perform similarly overall (both Best and Pretty), suggest- standard metrics used for evaluating COCO object detection
ing that mesh subdivision may be unnecessary for strong and instance segmentation at intersection-over-union (IoU)
quantitative performance. (c) The deeper residual mesh 0.5. APmesh evalutes 3D shape prediction: it is the mean
refinement architecture (inspired by [69]) performs on-par area under the per-category precision-recall curves for F10.3
with the lighter non-residual architecture, motivating our at 0.53 . Pix3D is not exhaustively annotated, so for evalua-
use of the latter on Pix3D. (d) Voxel-Only performs poorly tion we only consider predictions with box IoU > 0.3 with
compared to methods that predict meshes, demonstrating a ground-truth region. This avoids penalizing the model for
that mesh predictions better capture fine object structure. correct predictions corresponding to unannotated objects.
(e) Each Best model outperforms its corresponding Pretty We compare predicted and ground-truth meshes in the
model; this is expected since Best is an upper bound on camera coordinate system. Our model assumes known cam-
quantitative performance. era intrinsics for VertAlign. In addition to predicting the
box of each object on the image plane, Mesh R-CNN pre-
4.2. Pix3D dicts the depth extent by appending a 2-layer MLP head,
similar to the box regressor head. As a result, Mesh R-
We now turn to Pix3D [60], which consists of 10069
CNN predicts a 3D bounding box for each object. See Ap-
real-world images and 395 unique 3D models. Here the task
pendix E for more details.
is to jointly detect and predict 3D shapes for known object
categories. Pix3D does not provide standard train/test splits, 3 A mesh prediction is considered a true-positive if its predicted label is
so we prepare two splits of our own. correct, it is not a duplicate detection, and its F10.3 > 0.5
Figure 7. Examples of Mesh R-CNN predictions on Pix3D. Mesh R-CNN detects multiple objects per image, reconstructs fine details such
as chair legs, and predicts varying and complex mesh topologies for objects with holes such as bookcases and tables.
Index Inputs Operation Output shape Chamfer(↓) Normal F10.1 F10.3 F10.5 |V | |F |
(1) Input Backbone features h × w × 256 OccNet [42] 0.264 0.789 33.4 80.5 91.3 2499±60 4995±120
(2) Input Input vertex features |V | × 128 Ours (light) 0.135 0.725 38.9 86.7 95.0 1978±951 3958±1906
Pretty Best
(3) Input Input vertex positions |V | × 3 Ours 0.139 0.728 38.3 86.3 94.9 1985±960 3971±1924
(4) (1), (3) VertAlign |V | × 256
Ours (light) 0.185 0.696 34.3 82.0 92.8 1976±956 3954±1916
(5) (2), (3), (4) Concatenate |V | × 387
Ours 0.180 0.709 34.6 82.2 93.0 1982±961 3967±1926
(6) (5) GraphConv(387 → 128) |V | × 128
(7) (3), (6) Concatenate |V | × 131
(8) (7) GraphConv(131 → 128) |V | × 128 Table 11. Comparison between our method and Occpancy Net-
(9) (3), (8) Concatenate |V | × 131 works (OccNet) [42] on ShapeNet. We use the same evaluation
(10) (9) GraphConv(131 → 128) |V | × 128 metrics and setup as Table 2.
(11) (3), (10) Concatenate |V | × 131
(12) (11) Linear(131 → 3) |V | × 3
(13) (12) Tanh |V | × 3
(14) (3), (13) Addition |V | × 3 subdivision operations are performed before the mesh re-
finement branch (Sphere-Init) or whether mesh refinement
Table 10. Architecture for a single mesh refinement stage on
Pix3D.
is interleaved with mesh subdivision (Pixel2Mesh+ ). On
ShapeNet, the Pixel2Mesh+ and Sphere-Init baselines are
trained with a batch size of 96; on Pix3D they use the same
training recipe as our full model.
with 162 vertices, 320 faces, and 480 edges which results
from applying two face subdivision operations to a regu-
D. Comparison with Occupancy Networks
lar icosahedron and projecting all resulting vertices onto a
sphere. For the Pixel2Mesh+ baseline, the mesh refinement Occupancy Networks [42] (OccNet) also predict 2D
stages are the same as our full model, except that we apply meshes with neural networks. Rather than outputing a mesh
a face subdivision operation prior to VertAlign in refine- directly from the neural network as in our approach, they
ment stages 2 and 3. train a neural network to compute a signed distance between
Like Pixel2Mesh+ , the Sphere-Init baseline omits the a query point in 3D space and the object boundary. At test-
voxel branch and uses an identical initial sphere mesh for all time a 3D mesh can be extracted from a set of query points.
images. However unlike Pixel2Mesh+ the initial mesh is a Like our approach, OccNets can also predict meshes with
level-4 icosphere with 2562 vertices, 5120 faces, and 7680 varying topolgy per input instance.
edges which results from applying four face subdivivison Table 11 compares our approach with OccNet on the
operations to a regular icosahedron. Due to this large initial ShapeNet test set. We obtained test-set predictions for Oc-
mesh, the mesh refinement stages are identical to our full cNet from the authors. Our method and OccNet are trained
model, and do not use mesh subdivision. on slightly different splits of the ShapeNet dataset, so we
Pixel2Mesh+ and Sphere-Init both predict meshes with compare our methods on the intersection of our respective
the same number of vertices and faces, and with identical test splits. From Table 11 we see that OccNets achieve
topologies; the only difference between them is whether all higher normal consistency than our approach; however both
the Best and Pretty versions of our model outperform Occ-
Nets on all other metrics.
¯ = dz · f
dz (4)
zc h
Note that the depth extent of an object is related to the
size of the object (here approximated by the object’s bound-
ing box height h), its location zc in the Z-axis (far away
objects need to be bigger in order to explain the image) and
the focal length f . At inference time the depth extent dz of
the object is recovered from the predicted dz ¯ and the pre- Pixel2Mesh+ Mesh R-CNN
dicted height of the object bounding box h, given the focal
length f and the center of the object zc in the Z-axis. Note Figure 9. Qualitative comparisons between Pixel2Mesh+ and
that we assume the center of the object zc since Pix3D an- Mesh R-CNN on Pix3D. Each row shows the same example for
notations are not metric and due to the inherent scale-depth Pixel2Mesh+ (first three columns) and Mesh R-CNN (last three
ambiguity. columns), respectively. For each method, we show the input im-
age along with the predicted 2D mask (chair, bookcase, table, bed)
F. Pix3D: Visualizations and Comparisons and box (in green) superimposed. We show the 3D mesh rendered
on the input image and an additional view of the 3D mesh.
Figure 9 shows qualitative comparisons between
Pixel2Mesh+ and Mesh R-CNN. Pixel2Mesh+ is limited
to making predictions homeomorphic to spheres and thus priors. In CVPR, 2013. 2
cannot capture varying topologies, e.g. holes. In addi- [2] Volker Blanz and Thomas Vetter. A morphable model for the
tion, Pixel2Mesh+ has a hard time capturing high curva- synthesis of 3d faces. In SIGGRAPH, 1999. 2
tures, such as sharp table tops and legs. This is due to the [3] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Re-
large deformations required when starting from a sphere, altime multi-person 2D pose estimation using part affinity
fields. In CVPR, 2017. 1
which are not encouraged by the shape regularizers. On the
other hand, Mesh R-CNN initializes its shapes with cubified [4] Angel X. Chang, Thomas A. Funkhouser, Leonidas J.
Guibas, Pat Hanrahan, Qi-Xing Huang, Zimo Li, Silvio
voxel predictions resulting in better initial shape represen-
Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong
tations which require less drastic deformations. Xiao, Li Yi, and Fisher Yu. Shapenet: An information-rich
3d model repository. In CoRR 1512.03012, 2015. 1, 2, 5
G. ShapeNet Holes test set [5] Christopher B. Choy, Danfei Xu, JunYoung Gwak, Kevin
We construct the ShapeNet Holes Test set by selecting Chen, and Silvio Savarese. 3D-R2N2: A unified approach
models from the ShapeNet test set that have visible holes for single and multi-view 3d object reconstruction. In ECCV,
from any viewpoint. Figure 10 shows several input images 2016. 1, 2, 5
for randomly selected models from this subset. This test set [6] Amaury Dame, Victor A. Prisacariu, Carl Y. Ren, and Ian
Reid. Dense reconstruction using 3d object shape priors. In
is very challenging – many objects have small holes result-
CVPR, 2013. 2
ing from thin structures; and some objects have holes which
[7] Mathieu Desbrun, Mark Meyer, Peter Schröder, and Alan H.
are not visible from all viewpoints.
Barr. Implicit fairing of irregular meshes using diffusion and
curvature flow. In SIGGRAPH, 1999. 5
References
[8] Haoqiang Fan, Hao Su, and Leonidas J. Guibas. A point
[1] Sid Yingze Bao, Manmohan Chandraker, Yuanqing Lin, and set generation network for 3d object reconstruction from a
Silvio Savarese. Dense object reconstruction with semantic single image. In CVPR, 2017. 1, 2, 5, 7
Figure 10. Example input images for randomly selected models from the the Holes Test Set on ShapeNet. For each model we show three
different input images showing the model from different viewpoints. This set is extremely challenging – some models may have very small
holes (such as the holes in the back of the chair in the left model of the first row, or the holes on the underside of the table on the right
model of row 2), and some models may have holes which are not visible in all input images (such as the green chair in the middle of the
fourth row, or the gray desk on the right of the ninth row).
[9] Sanja Fidler, Sven Dickinson, and Raquel Urtasun. 3d ob- ICCV, 2013. 2
ject detection and viewpoint estimation with a deformable [11] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel
3d cuboid model. In NeurIPS, 2012. 2 Urtasun. Vision meets robotics: The kitti dataset. In IJRR,
[10] David F. Fouhey, Abhinav Gupta, and Martial Hebert. Data- 2013. 2
driven 3D primitives for single image understanding. In [12] Ross Girshick. Fast R-CNN. In ICCV, 2015. 2
[13] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra [34] Chen-Hsuan Lin, Chen Kong, and Simon Lucey. Learning
Malik. Rich feature hierarchies for accurate object detection efficient point cloud generation for dense 3d object recon-
and semantic segmentation. In CVPR, 2014. 1, 2 struction. In AAAI, 2018. 2
[14] Benjamin Graham, Martin Engelcke, and Laurens van der [35] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
Maaten. 3d semantic segmentation with submanifold sparse Bharath Hariharan, and Serge Belongie. Feature pyramid
convolutional networks. In CVPR, 2018. 2 networks for object detection. In CVPR, 2017. 3, 8
[15] Thibault Groueix, Matthew Fisher, Vladimir G Kim, [36] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
Bryan C Russell, and Mathieu Aubry. A papier-mâché ap- Piotr Dollár. Focal loss for dense object detection. In ICCV,
proach to learning 3d surface generation. In CVPR, 2018. 2017. 2
2 [37] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
[16] Saurabh Gupta, Ross Girshick, Pablo Arbeláez, and Jitendra Pietro Perona, Deva Ramanan, Piotr Dollr, and C. Lawrence
Malik. Learning rich features from rgb-d images for object Zitnick. Microsoft COCO: Common objects in context.
detection and segmentatio. In ECCV, 2014. 2 ECCV, 2014. 1, 2
[17] Richard Hartley and Andrew Zisserman. Multiple view ge- [38] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian
ometry in computer vision. Cambridge university press, Szegedy, and Scott Reed. SSD: Single shot multibox de-
2003. 2 tector. In ECCV, 2016. 2
[18] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- [39] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully
shick. Mask R-CNN. In ICCV, 2017. 1, 2, 3, 10 convolutional networks for semantic segmentation. In
[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. CVPR, 2015. 3
Deep residual learning for image recognition. In CVPR, [40] William E. Lorensen and Harvey E. Cline. Marching cubes:
2016. 1 A high resolution 3d surface construction algorithm. In SIG-
[20] Derek Hoiem, Alexei A. Efros, and Martial Hebert. Geomet- GRAPH. ACM, 1987. 4
ric context from a single image. In ICCV, 2005. 2
[41] Priyanka Mandikal, Navaneet Murthy, Mayank Agarwal, and
[21] Christian Hne, Nikolay Savinov, and Marc Pollefeys. Class
R. Venkatesh Babu. 3d-lmnet: Latent embedding matching
specific 3d object shape priors using surface normals. In
for accurate and diverse 3d point cloud reconstruction from
CVPR, 2014. 2
a single image. In BMVC, 2018. 2
[22] Max Jaderberg, Karen Simonyan, and Andrew Zisserman.
[42] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se-
Spatial transformer networks. In NeurIPS, 2015. 4
bastian Nowozin, and Andreas Geiger. Occupancy networks:
[23] Angjoo Kanazawa, Shubham Tulsiani, Alexei A. Efros, and
Learning 3d reconstruction in function space. In CVPR,
Jitendra Malik. Learning category-specific mesh reconstruc-
2019. 2, 5, 11
tion from image collections. In ECCV, 2018. 1, 2
[43] George A. Miller. Wordnet: A lexical database for english.
[24] Abhishek Kar, Christian Häne, and Jitendra Malik. Learning
In Commun. ACM, 1995. 5
a multi-view stereo machine. In NeurIPS, 2017. 2
[44] Georgios Pavlakos, Xiaowei Zhou, Aaron Chan, Konstanti-
[25] Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neu-
nos G. Derpanis, and Kostas Daniilidis. 6-dof object pose
ral 3D mesh renderer. In CVPR, 2018. 2, 5
from semantic keypoints. In ICRA, 2017. 2
[26] Alex Kendall, Hayk Martirosyan, Saumitro Dasgupta, Peter
Henry, Ryan Kennedy, Abraham Bachrach, and Adam Bry. [45] Charles R. Qi, Li Yi, Hao Su, and Leonidas J. Guibas. Point-
End-to-end learning of geometry and context for deep stereo net++: Deep hierarchical feature learning on point sets in a
regression. In ICCV, 2017. 2 metric space. In NeurIPS, 2017. 2
[27] Diederik P. Kingma and Jimmy Ba. Adam: A method for [46] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali
stochastic optimization. In ICLR, 2015. 5 Farhadi. You only look once: Unified, real-time object de-
[28] Diederik P. Kingma and Max Welling. Auto-encoding vari- tection. In CVPR, 2016. 2
ational bayes. In ICLR, 2014. 9 [47] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
[29] Thomas N Kipf and Max Welling. Semi-supervised classi- Faster R-CNN: Towards real-time object detection with re-
fication with graph convolutional networks. In ICLR, 2017. gion proposal networks. In NeurIPS, 2015. 1, 2, 3
4 [48] Danilo Jimenez Rezende, S.M. Ali Eslami, Shakir Mo-
[30] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Im- hamed, Peter Battaglia, Max Jaderberg, and Nicolas Heess.
ageNet classification with deep convolutional neural net- Unsupervised learning of 3d structure from images. In
works. In NeurIPS, 2012. 1 NeurIPS, 2016. 2
[31] Abhijit Kundu, Yin Li, and James M. Rehg. 3d-rcnn: [49] Stephan R. Richter and Stefan Roth. Matryoshka networks:
Instance-level 3d object reconstruction via render-and- Predicting 3d geometry via nested shape layers. In CVPR,
compare. In CVPR, 2018. 2 2018. 2
[32] Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, [50] Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger.
and Baoquan Chen. Pointcnn: Convolution on x-transformed Octnet: Learning deep 3d representations at high resolutions.
points. In NeurIPS, 2018. 2 In CVPR, 2017. 2
[33] Joseph J. Lim, Hamed Pirsiavash, and Antonio Torralba. [51] Jason Rock, Tanmay Gupta, Justin Thorsen, JunYoung
Parsing IKEA Objects: Fine Pose Estimation. In ICCV, Gwak, Daeyun Shin, and Derek Hoiem. Completing 3d ob-
2013. 2 ject shape from one depth image. In CVPR, 2015. 2
[52] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- [69] Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei
jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Liu, and Yu-Gang Jiang. Pixel2Mesh: Generating 3D mesh
Aditya Khosla, Michael Bernstein, Alexander C. Berg, and models from single RGB images. In ECCV, 2018. 1, 2, 4, 5,
Li Fei-Fei. ImageNet Large Scale Visual Recognition Chal- 6, 7, 9, 10, 11
lenge. IJCV, 2015. 1, 2 [70] Peng-Shuai Wang, Chun-Yu Sun, Yang Liu, and Xin Tong.
[53] Daniel Scharstein and Richard Szeliski. A taxonomy and Adaptive O-CNN: a patch-based deep representation of 3d
evaluation of dense two-frame stereo correspondence algo- shapes. In SIGGRAPH Asia, 2018. 2
rithms. IJCV, 2002. 2 [71] Jiajun Wu, Yifan Wang, Tianfan Xue, Xingyuan Sun, Bill
[54] Tanner Schmidt, Richard Newcombe, and Dieter Fox. Self- Freeman, and Josh Tenenbaum. Marrnet: 3d shape recon-
supervised visual descriptor learning for dense correspon- struction via 2.5 d sketches. In NeurIPS, 2017. 2
dence. In IEEE Robotics and Automation Letters, 2017. 2 [72] Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and
[55] Karen Simonyan and Andrew Zisserman. Very deep convo- Josh Tenenbaum. Learning a probabilistic latent space of
lutional networks for large-scale image recognition. In ICLR, object shapes via 3d generative-adversarial modeling. In
2015. 1 NeurIPS, 2016. 2
[56] Edward Smith, Scott Fujimoto, and David Meger. Multi- [73] Jiajun Wu, Chengkai Zhang, Xiuming Zhang, Zhoutong
view silhouette and depth decomposition for high resolution Zhang, William T. Freeman, and Joshua B. Tenenbaum.
3d object representation. In NeurIPS, 2018. 2, 5 Learning 3D Shape Priors for Shape Completion and Recon-
[57] Edward J. Smith, Scott Fujimoto, Adriana Romero, and struction. In ECCV, 2018. 2
David Meger. GEOMetrics: Exploiting geometric structure [74] Yu Xiang, Roozbeh Mottaghi, and Silvio Savarese. Beyond
for graph-encoded objects. In ICML, 2019. 1, 2, 4, 5, 9 pascal: A benchmark for 3d object detection in the wild. In
WACV, 2014. 2
[58] Shuran Song and Jianxiong Xiao. Deep Sliding Shapes for
amodal 3D object detection in RGB-D images. In CVPR, [75] Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and
2016. 2 Honglak Lee. Perspective transformer nets: Learning single-
view 3d object reconstruction without 3d supervision. In
[59] Hang Su, Varun Jampani, Deqing Sun, Subhransu Maji,
NeurIPS, 2016. 2
Evangelos Kalogerakis, Ming-Hsuan Yang, and Jan Kautz.
[76] Xiuming Zhang, Zhoutong Zhang, Chengkai Zhang,
SPLATNet: Sparse lattice networks for point cloud process-
Joshua B. Tenenbaum, William T. Freeman, and Jiajun Wu.
ing. In CVPR, 2018. 2
Learning to Reconstruct Shapes from Unseen Classes. In
[60] Xingyuan Sun, Jiajun Wu, Xiuming Zhang, Zhoutong
NeurIPS, 2018. 2
Zhang, Chengkai Zhang, Tianfan Xue, Joshua B. Tenen-
baum, and William T. Freeman. Pix3d: Dataset and methods
for single-image 3d shape modeling. In CVPR, 2018. 2, 5, 7
[61] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,
Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
Vanhoucke, and Andrew Rabinovich. Going deeper with
convolutions. In CVPR, 2015. 1
[62] Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox.
Octree generating networks: Efficient convolutional archi-
tectures for high-resolution 3d outputs. In ICCV, 2017. 2
[63] Maxim Tatarchenko, Jaesik Park, Vladlen Koltun, and Qian-
Yi Zhou. Tangent convolutions for dense prediction in 3d. In
CVPR, 2018. 2
[64] Yonglong Tian, Andrew Luo, Xingyuan Sun, Kevin Ellis,
William T. Freeman, Joshua B. Tenenbaum, and Jiajun Wu.
Learning to infer and execute 3d shape programs. In ICLR,
2019. 2
[65] Alexander Toshev and Christian Szegedy. Deeppose: Human
pose estimation via deep neural networks. In CVPR, 2014. 1
[66] Shubham Tulsiani and Jitendra Malik. Viewpoints and key-
points. In CVPR, 2015. 2
[67] Shubham Tulsiani, Hao Su, Leonidas J. Guibas, Alexei A.
Efros, and Jitendra Malik. Learning shape abstractions by
assembling volumetric primitives. In CVPR, 2017. 2
[68] Shubham Tulsiani, Tinghui Zhou, Alexei A. Efros, and Ji-
tendra Malik. Multi-view supervision for single-view recon-
struction via differentiable ray consistency. In CVPR, 2017.
2