1906 02739v2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Mesh R-CNN

Georgia Gkioxari Jitendra Malik Justin Johnson


Facebook AI Research (FAIR)

Abstract Input Image 2D Recognition


arXiv:1906.02739v2 [cs.CV] 25 Jan 2020

Rapid advances in 2D perception have led to systems


that accurately detect objects in real-world images. How-
ever, these systems make predictions in 2D, ignoring the 3D
structure of the world. Concurrently, advances in 3D shape
prediction have mostly focused on synthetic benchmarks
and isolated objects. We unify advances in these two areas.
We propose a system that detects objects in real-world im-
ages and produces a triangle mesh giving the full 3D shape
of each detected object. Our system, called Mesh R-CNN,
augments Mask R-CNN with a mesh prediction branch that
outputs meshes with varying topological structure by first
predicting coarse voxel representations which are converted 3D Meshes 3D Voxels
to meshes and refined with a graph convolution network op-
Figure 1. Mesh R-CNN takes an input image, predicts object
erating over the mesh’s vertices and edges. We validate
instances in that image and infers their 3D shape. To capture di-
our mesh prediction branch on ShapeNet, where we outper- versity in geometries and topologies, it first predicts coarse voxels
form prior work on single-image shape prediction. We then which are refined for accurate mesh predictions.
deploy our full Mesh R-CNN system on Pix3D, where we
jointly detect objects and predict their 3D shapes. Project
We believe that the time is ripe for these hitherto dis-
page: https://gkioxari.github.io/meshrcnn/.
tinct research directions to be combined. We should strive
to build systems that (like current methods for 2D percep-
1. Introduction tion) can operate on unconstrained real-world images with
The last few years have seen rapid advances in 2D ob- many objects, occlusion, and diverse lighting conditions but
ject recognition. We can now build systems that accurately that (like current methods for 3D shape prediction) do not
recognize objects [19, 30, 55, 61], localize them with 2D ignore the rich 3D structure of the world.
bounding boxes [13, 47] or masks [18], and predict 2D key- In this paper we take an initial step toward this goal. We
point positions [3, 18, 65] in cluttered, real-world images. draw on state-of-the-art methods for 2D perception and 3D
Despite their impressive performance, these systems ignore shape prediction to build a system which inputs a real-world
one critical fact: that the world and the objects within it are RGB image, detects the objects in the image, and outputs a
3D and extend beyond the XY image plane. category label, segmentation mask, and a 3D triangle mesh
At the same time, there have been significant advances in giving the full 3D shape of each detected object.
3D shape understanding with deep networks. A menagerie Our method, called Mesh R-CNN, builds on the state-of-
of network architectures have been developed for differ- the-art Mask R-CNN [18] system for 2D recognition, aug-
ent 3D shape representations, such as voxels [5], point- menting it with a mesh prediction branch that outputs high-
clouds [8], and meshes [69]; each representation carries resolution triangle meshes.
its own benefits and drawbacks. However, this diverse and Our predicted meshes must be able to capture the 3D
creative set of techniques has primarily been developed on structure of diverse, real-world objects. Predicted meshes
synthetic benchmarks such as ShapeNet [4] consisting of should therefore dynamically vary their complexity, topol-
rendered objects in isolation, which are dramatically less ogy, and geometry in response to varying visual stimuli.
complex than natural-image benchmarks used for 2D object However, prior work on mesh prediction with deep net-
recognition like ImageNet [52] and COCO [37]. works [23, 57, 69] has been constrained to deform from

1
3D pose [31, 44, 66] of known shapes. Other approaches
predict novel 3D shapes as sets of 3D points [8, 34],
patches [15, 70], or geometric primitives [9, 64, 67]; others
use deep networks to model signed distance functions [42].
These methods can flexibly represent complex shapes, but
rely on post-processing to extract watertight mesh outputs.
Some methods predict regular voxel grids [5, 71, 72];
while intuitive, scaling to high-resolution outputs requires
complex octree [50, 62] or nested shape architectures [49].
Figure 2. Example predictions from Mesh R-CNN on Pix3D. Us-
ing initial voxel predictions allows our outputs to vary in topology;
Others directly output triangle meshes, but are con-
converting these predictions to meshes and refining them allows us strained to deform from fixed [56, 57, 69] or retrieved mesh
to capture fine structures like tabletops and chair legs. templates [51], limiting the topologies they can represent.
Our approach uses a hybrid of voxel prediction and mesh
fixed mesh templates, limiting them to fixed mesh topolo- deformation, enabling high-resolution output shapes that
gies. As shown in Figure 1, we overcome this limitation can flexibly represent arbitrary topologies.
by utilizing multiple 3D shape representations: we first pre- Some methods reconstruct 3D shapes without 3D anno-
dict coarse voxelized object representations, which are con- tations [23, 25, 48, 68, 75]. This is an important direction,
verted to meshes and refined to give highly accurate mesh but at present we consider only the fully supervised case
predictions. As shown in Figure 2, this hybrid approach al- due to the success of strong supervision for 2D perception.
lows Mesh R-CNN to output meshes of arbitrary topology Multi-View Shape Prediction There is a broad line of work
while also capturing fine object structures. on multi-view reconstruction of objects and scenes, from
We benchmark our approach on two datasets. First, classical binocular stereo [17, 53] to using shape priors [1,
we evaluate our mesh prediction branch on ShapeNet [4], 2, 6, 21] and modern learning techniques [24, 26, 54]. In
where our hybrid approach of voxel prediction and mesh this work, we focus on single-image shape reconstruction.
refinement outperforms prior work by a large margin. Sec- 3D Inputs Our method inputs 2D images and predicts se-
ond, we deploy our full Mesh R-CNN system on the recent mantic labels and 3D shapes. Due to the increasing avail-
Pix3D dataset [60] which aligns 395 models of IKEA fur- abilty of depth sensors, there has been growing interest in
niture to real-world images featuring diverse scenes, clut- methods predicting semantic labels from 3D inputs such as
ter, and occlusion. To date Pix3D has primarily been RGB-D images [16, 58] and pointclouds [14, 32, 45, 59,
used to evalute shape predictions for models trained on 63]. We anticipate that incorporating 3D inputs into our
ShapeNet, using perfectly cropped, unoccluded image seg- method could improve the fidelity of our shape predictions.
ments [41, 60, 73], or synthetic rendered images of Pix3D Datasets Advances in 2D perception have been driven by
models [76]. In contrast, using Mesh R-CNN we are the large-scale annotated datasets such as ImageNet [52] and
first to train a system on Pix3D which can jointly detect ob- COCO [37]. Datasets for 3D shape prediction have lagged
jects of all categories and estimate their full 3D shape. their 2D counterparts due to the difficulty of collecting 3D
annotations. ShapeNet [4] is a large-scale dataset of CAD
2. Related Work models which are rendered to give synthetic images. The
IKEA dataset [33] aligns CAD models of IKEA objects to
Our system inputs a single RGB image and outputs a set
real-world images; Pix3D [60] extends this idea to a larger
of detected object instances, with a triangle mesh for each
set of images and models. Pascal3D [74] aligns CAD mod-
object. Our work is most directly related to recent advances
els to real-world images, but it is unsuitable for shape recon-
in 2D object recognition and 3D shape prediction. We also
struction since its train and test sets share the same small set
draw more broadly from work on other 3D perception tasks.
of models. KITTI [11] annotates outdoor street scenes with
2D Object Recognition Methods for 2D object recogni- 3D bounding boxes, but does not provide shape annotations.
tion vary both in the type of information predicted per ob-
ject, and in the overall system architecture. Object de-
tectors output per-object bounding boxes and category la- 3. Method
bels [12, 13, 36, 38, 46, 47]; Mask R-CNN [18] additionally Our goal is to design a system that inputs a single image,
outputs instance segmentation masks. Our method extends detects all objects, and outputs a category label, bounding
this line of work to output a full 3D mesh per object. box, segmentation mask and 3D triangle mesh for each de-
Single-View Shape Prediction Recent approaches use a tected object. Our system must be able to handle cluttered
variety of shape representations for single-image 3D recon- real-world images, and must be trainable end-to-end. Our
struction. Some methods predict the orientation [10, 20] or output meshes should not be constrained to a fixed topol-
Figure 3. System overview of Mesh R-CNN. We augment Mask R-CNN with 3D shape inference. The voxel branch predicts a coarse
shape for each detected object which is further deformed with a sequence of refinement stages in the mesh refinement branch.

ogy in order to accommodate a wide variety of complex 3.1. Mesh Predictor


real-world objects. We accomplish these goals by marrying
At the core of our system is a mesh predictor which re-
state-of-the-art 2D perception with 3D shape prediction.
ceives convolutional features aligned to an object’s bound-
Specifically, we build on Mask R-CNN [18], a state-of-
ing box and predicts a triangle mesh giving the object’s full
the-art 2D perception system. Mask R-CNN is an end-to-
3D shape. Like Mask R-CNN, we maintain correspondence
end region-based object detector. It inputs a single RGB
between the input image and features used at all stages of
image and outputs a bounding box, category label, and seg-
processing via region- and vertex-specific alignment opera-
mentation mask for each detected object. The image is
tors (RoIAlign and VertAlign). Our goal is to capture
first passed through a backbone network (e.g. ResNet-50-
instance-specific 3D shapes of all objects in an image. Thus,
FPN [35]); next a region proposal network (RPN) [47]
each predicted mesh must have instance-specific topology
gives object proposals which are processed with object clas-
(genus, number of vertices, faces, connected components)
sification and mask prediction branches.
and geometry (vertex positions).
Part of Mask R-CNN’s success is due to RoIAlign
We predict varying mesh topologies by deploying a se-
which extracts region features from image features while
quence of shape inference operations. First, the voxel
maintaining alignment between the input image and fea-
branch makes bottom-up voxelized predictions of each ob-
tures used in the final prediction branches. We aim to main-
ject’s shape, similar to Mask R-CNN’s mask branch. These
tain similar feature alignment when predicting 3D shapes.
predictions are converted into meshes and adjusted by the
We infer 3D shapes with a novel mesh predictor, com-
mesh refinement head, giving our final predicted meshes.
prising a voxel branch and a mesh refinement branch. The
The output of the mesh predictor is a triangle mesh T =
voxel branch first estimates a coarse 3D voxelization of an
(V, F ) for each object. V = {vi ∈ R3 } is the set of vertex
object, which is converted to an initial triangle mesh. The
positions and F ⊆ V ×V ×V is a set of triangular faces.
mesh refinement branch then adjusts the vertex positions of
this initial mesh using a sequence of graph convolution lay-
ers operating over the edges of the mesh. 3.1.1 Voxel Branch
The voxel branch and mesh refinement branch are ho-
mologous to the existing box and mask branches of Mask R- The voxel branch predicts a grid of voxel occupancy proba-
CNN. All take as input image-aligned features correspond- bilities giving the course 3D shape of each detected object.
ing to RPN proposals. The voxel and mesh losses, described It can be seen as a 3D analogue of Mask R-CNN’s mask pre-
in detail below, are added to the box and mask losses and the diction branch: rather than predicting a M × M grid giving
whole system is trained end-to-end. The output is a set of the object’s shape in the image plane, we instead predict a
boxes along with their predicted object scores, masks and G × G × G grid giving the object’s full 3D shape.
3D shapes. We call our system Mesh R-CNN, which is il- Like Mask R-CNN, we maintain correspondence be-
lustrated in Figure 3. tween input features and predicted voxels by applying a
We now describe our mesh predictor, consisting of the small fully-convolutional network [39] to the input feature
voxel branch and mesh refinement branch, along with its map resulting from RoIAlign. This network produces a
associated losses in detail. feature map with G channels giving a column of voxel oc-
cupancy scores for each position in the input.
World Space Prediction Space Vertex Alignment yields an image-aligned feature vector
Image plane
K for each mesh vertex1 . We use the camera’s intrinsic matrix
Y X
to project each vertex onto the image plane. Given a feature
Z map, we compute a bilinearly interpolated image feature at
Znear each projected vertex position [22].
Zfar K-1 In the first stage of the mesh refinement branch,
VertAlign outputs an initial feature vector for each ver-
Figure 4. Predicting voxel occupancies aligned to the image plane
tex. In subsequent stages, the VertAlign output is con-
requires an irregularly-shaped voxel grid. We achieve this effect
by making voxel predictions in a space that is transformed by
catenated with the vertex feature from the previous stage.
the camera’s (known) intrinsic matrix K. Applying K −1 trans- Graph Convolution [29] propagates information along
forms our predictions back to world space. This results in frustum- mesh edges. Given input vertex features {f Pi } it computes
shaped voxels in world space. updated features fi0 = ReLU W0 fi + j∈N (i) W1 fj
where N (i) gives the i-th vertex’s neighbors in the mesh,
Maintaining pixelwise correspondence between the im- and W0 and W1 are learned weight matrices. Each stage of
age and our predictions is complex in 3D since objects be- the mesh refinement branch uses several graph convolution
come smaller as they recede from the camera. As shown in layers to aggregate information over local mesh regions.
Figure 4, we account for this by using the camera’s (known) Vertex Refinement computes updated vertex positions
intrinsic matrix to predict frustum-shaped voxels. vi0 = vi + tanh(Wvert [fi ; vi ]) where Wvert is a learned
Cubify: Voxel to Mesh The voxel branch produces a 3D weight matrix. This updates the mesh geometry, keeping its
grid of occupancy probabilities giving the coarse shape of topology fixed. Each stage of the mesh refinement branch
an object. In order to predict more fine-grained 3D shapes, terminates with vertex refinement, producing an intermedi-
we wish to convert these voxel predictions into a triangle ate mesh output which is further refined by the next stage.
mesh which can be passed to the mesh refinement branch. Mesh Losses Defining losses that operate natively on trian-
We bridge this gap with an operation called cubify. It gle meshes is challenging, so we instead use loss functions
inputs voxel occupancy probabilities and a threshold for bi- defined over a finite set of points. We represent a mesh with
narizing voxel occupancy. Each occupied voxel is replaced a pointcloud by densely sampling its surface. Consequently,
with a cuboid triangle mesh with 8 vertices, 18 edges, and a pointcloud loss approximates a loss over shapes.
12 faces. Shared vertices and edges between adjacent occu- Similar to [57], we use a differentiable mesh sampling
pied voxels are merged, and shared interior faces are elim- operation to sample points (and their normal vectors) uni-
inated. This results in a watertight mesh whose topology formly from the surface of a mesh. To this end, we im-
depends on the voxel predictions. plement an efficient batched sampler; see Appendix B for
Cubify must be efficient and batched. This is not details. We use this operation to sample a pointcloud P gt
trivial and we provide technical implementation details of from the ground-truth mesh, and a pointcloud P i from each
how we achieve this in Appendix A. Alternatively march- intermediate mesh prediction from our model.
ing cubes [40] could extract an isosurface from the voxel Given two pointclouds P , Q with normal vectors, let
grid, but is significantly more complex. ΛP,Q = {(p, arg minq kp − qk) : p ∈ P } be the set of
Voxel Loss The voxel branch is trained to minimize the pairs (p, q) such that q is the nearest neighbor of p in Q, and
binary cross-entropy between predicted voxel occupancy let up be the unit normal to point p. The chamfer distance
probabilities and true voxel occupancies. between pointclouds P and Q is given by
X X
Lcham (P, Q) = |P |−1 kp − qk2 + |Q|−1 kq − pk2 (1)
3.1.2 Mesh Refinement Branch (p,q)∈ΛP,Q (q,p)∈ΛQ,P

The cubified mesh from the voxel branch only provides a and the (absolute) normal distance is given by
coarse 3D shape, and it cannot accurately model fine struc-
X X
Lnorm (P, Q) = −|P |−1 |up · uq | − |Q|−1 |uq · up |. (2)
tures like chair legs. The mesh refinement branch processes (p,q)∈ΛP,Q (q,p)∈ΛQ,P

this initial cubified mesh, refining its vertex positions with The chamfer and normal distances penalize mismatched
a sequence of refinement stages. Similar to [69], each re- positions and normals between two pointclouds, but min-
finement stage consists of three operations: vertex align- imizing these distances alone results in degenerate meshes
ment, which extracts image features for vertices; graph con- (see Figure 5). High-quality mesh predictions require addi-
volution, which propagates information along mesh edges; tional shape regularizers: To this end we use an edge loss
and vertex refinement, which updates vertex positions. Each 1
Ledge (V, E) = |E|
P 0 2
(v,v )∈E kv − v k where E ⊆ V × V
0
layer of the network maintains a 3D position vi and a fea-
ture vector fi for each mesh vertex. 1 Vertex alignment is called perceptual feature pooling in [69]
are the edges of the predicted mesh. Alternatively, a Lapla- Chamfer (↓) F1τ (↑) F12τ (↑)
cian loss [7] also imposes smoothness constraints. N3MR [25] 2.629 33.80 47.72
3D-R2N2 [5] 1.445 39.01 54.62
The mesh loss of the i-th stage is a weighted sum of
PSG [8] 0.593 48.58 69.78
Lcham (P i , P gt ), Lnorm (P i , P gt ) and Ledge (V i , E i ). The Pixel2Mesh [69]† 0.591 59.72 74.19
mesh refinement branch is trained to minimize the mean of MVD [56] - 66.39 -
these losses across all refinement stages. GEOMetrics [57] - 67.37 -
Pixel2Mesh [69]‡ 0.463 67.89 79.88
Ours (Best) 0.306 74.84 85.75
4. Experiments
Ours (Pretty) 0.391 69.83 81.76
We benchmark our mesh predictor on ShapeNet [4], Table 1. Single-image shape reconstruction results on ShapeNet,
where we compare with state-of-the-art approaches. We using the evaluation protocol from [69]. For [69], † are results
then evaluate our full Mesh R-CNN for the task of 3D shape reported in their paper and ‡ is the model released by the authors.
prediction in the wild on the challenging Pix3D dataset [60]. 128. The mesh refinement branch has three stages, each
with six graph convolution layers (of dimension 128) or-
4.1. ShapeNet ganized into three residual blocks. We train for 25 epochs
ShapeNet [4] provides a collection of 3D shapes, repre- using Adam [27] with learning rate 10−4 and 32 images per
sented as textured CAD models organized into semantic cat- batch on 8 Tesla V100 GPUs. We set the cubify thresh-
egories following WordNet [43], and has been widely used old to 0.2 and weight the losses with λvoxel = 1, λcham = 1,
as a benchmark for 3D shape prediction. We use the subset λnorm = 0, and λedge = 0.2.
of ShapeNetCore.v1 and rendered images from [5]. Each Baselines We compare with previously published methods
mesh is rendered from up to 24 random viewpoints, giving for single-image shape prediction. N3MR [25] is a weakly
RGB images of size 137 × 137. We use the train / test splits supervised approach that fits a mesh via a differentiable ren-
provided by [69], which allocate 35,011 models (840,189 derer without 3D supervision. 3D-R2N2 [5] and MVD [56]
images) to train and 8,757 models (210,051 images) to test; output voxel predictions. PSG [8] predicts point-clouds.
models used in train and test are disjoint. We reserve 5% of Appendix D additionally compares with OccNet [42].
the training models as a validation set. Pixel2Mesh [69] predicts meshes by deforming and sub-
The task on this dataset is to input a single RGB image dividing an initial ellipsoid. GEOMetrics [57] extends [69]
of a rendered ShapeNet model on a blank background, and with adaptive face subdivision. Both are trained to mini-
output a 3D mesh for the object in the camera coordinate mize Chamfer distances; however [69] computes it using
system. During training the system is supervised with pairs predicted mesh vertices, while [57] uses points sampled uni-
of images and meshes. formly from predicted meshes. We adopt the latter as it bet-
Evaluation We adopt evaluation metrics used in recent ter matches test-time evaluation. Unlike ours, these meth-
work [56, 57, 69]. We sample 10k points uniformly at ran- ods can only predict connected meshes of genus zero.
dom from the surface of predicted and ground-truth meshes, The training recipe and backbone architecture vary
and use them to compute Chamfer distance (Equation 1), among prior work. Therefore for a fair comparison with our
Normal consistency, (one minus Equation 2), and F1τ at method we also compare against several ablated versions of
various distance thresholds τ , which is the harmonic mean our model (see Appendix C for exact details):
of the precision at τ (fraction of predicted points within τ of • Voxel-Only: A version of our method that terminates
a ground-truth point) and the recall at τ (fraction of ground- with the cubified meshes from the voxel branch.
truth points within τ of a predicted point). Lower is better
• Pixel2Mesh+ : We reimplement Pixel2Mesh [69]; we
for Chamfer distance; higher is better for all other metrics.
outperform their original model due to a deeper back-
With the exception of normal consistency, these metrics
bone, better training recipe, and minimizing Chamfer on
depend on the absolute scale of the meshes. In Table 1 we
sampled rather than vertex positions.
follow [69] and rescale by a factor of 0.57; for all other
results we follow [8] and rescale so the longest edge of the • Sphere-Init: Similar to Pixel2Mesh+ , but initializes from
ground-truth mesh’s bounding box has length 10. a high-resolution sphere mesh, performing three stages
of vertex refinement without subdivision.
Implementation Details Our backbone feature extractor is
ResNet-50 pretrained on ImageNet. Since images depict a • Ours (light): Uses a smaller nonresidual mesh refinement
single object, the voxel branch receives the entire conv5 3 branch with three graph convolution layers per stage. We
feature map, bilinearly resized to 24 × 24, and predicts a will adopt this lightweight design on Pix3D.
48 × 48 × 48 voxel grid. The VertAlign operator con- Voxel-Only is essentially a version of our method that
catenates features from conv2 3, conv3 4, conv4 6, omits the mesh refinement branch, while Pixel2Mesh+ and
and conv5 3 before projecting to a vector of dimension Sphere-Init omit the voxel prediction branch.
Full Test Set Holes Test Set
Chamfer(↓) Normal F10.1 F10.3 F10.5 |V | |F | Chamfer(↓) Normal F10.1 F10.3 F10.5 |V | |F |
Pixel2Mesh [69]‡ 0.205 0.736 33.7 80.9 91.7 2466±0 4928±0 0.272 0.689 31.5 75.9 87.9 2466±0 4928±0
Voxel-Only 0.916 0.595 7.7 33.1 54.9 1987±936 3975±1876 0.760 0.592 8.2 35.7 59.5 2433±925 4877±1856
Sphere-Init 0.132 0.711 38.3 86.5 95.1 2562±0 5120±0 0.138 0.705 40.0 85.4 94.3 2562±0 5120±0
Pixel2Mesh+ 0.132 0.707 38.3 86.6 95.1 2562±0 5120±0 0.137 0.696 39.3 85.5 94.4 2562±0 5120±0
Best

Ours (light) 0.133 0.725 39.2 86.8 95.1 1894±925 3791±1855 0.130 0.723 41.6 86.7 94.8 2273±899 4560±1805
Ours 0.133 0.729 38.8 86.6 95.1 1899±928 3800±1861 0.130 0.725 41.7 86.7 94.9 2291±903 4595±1814
Sphere-Init 0.175 0.718 34.5 82.2 92.9 2562±0 5120±0 0.186 0.684 34.4 80.2 91.7 2562±0 5120±0
Pixel2Mesh+
Pretty

0.175 0.727 34.9 82.3 92.9 2562±0 5120±0 0.196 0.685 34.4 79.9 91.4 2562±0 5120±0
Ours (light) 0.176 0.699 34.8 82.4 93.1 1891±924 3785±1853 0.178 0.688 36.3 82.0 92.4 2281±895 4576±1798
Ours 0.171 0.713 35.1 82.6 93.2 1896±928 3795±1861 0.171 0.700 37.1 82.4 92.7 2292±902 4598±1812
Table 2. We report results both on the full ShapeNet test set (left), as well as a subset of the test set consisting of meshes with visible
holes (right). We compare our full model with several ablated version: Voxel-Only omits the mesh refinement head, while Sphere-Init and
Pixel2Mesh+ omit the voxel head. We show results both for Best models which optimize for metrics, as well as Pretty models that strike a
balance between shape metrics and mesh quality (see Figure 5); these two categories of models should not be compared. We also report the
number of vertices |V | and faces |F | in predicted meshes (mean±std ). ‡ refers to the released model by the authors. (per-instance average)

Without Ledge With Ledge

Image
Input Image
(best) (pretty)

Pixel2Mesh+
Ours

Figure 5. Training without the edge length regularizer Ledge Figure 6. Pixel2Mesh+ predicts meshes by deforming an initial
results in degenerate predicted meshes that have many overlap- sphere, so it cannot properly model objects with holes. In contrast
ping faces. Adding Ledge eliminates this degeneracy but results in our method can model objects with arbitrary topologies.
worse agreement with the ground-truth as measured by standard
metrics such as Chamfer distance. using a 0.57 mesh scaling factor and threshold value τ =
10−4 on squared Euclidean distances. For Pixel2Mesh, we
Best vs Pretty As previously noted in [69] (Section 4.1),
provide the performance reported in their paper [69] as well
standard metrics for shape reconstruction are not well-
as the performance of their open-source pretrained model.
correlated with mesh quality. Figure 5 shows that mod-
Table 1 shows that we outperform prior work by a wide
els trained without shape regularizers give meshes that are
margin, validating the design of our mesh predictor.
preferred by metrics despite being highly degenerate, with
irregularly-sized faces and many self-intersections. These Ablation Study Fairly comparing with prior work is chal-
degenerate meshes would be difficult to texture, and may lenging due to differences in backbone networks, losses,
not be useful for downstream applications. and shape regularizers. For a controlled evaluation, we ab-
Due to the strong effect of shape regularizers on both late variants using the same backbone and training recipe,
mesh quality and quantitative metrics, we suggest only shown in Table 2. ShapeNet is dominated by simple ob-
quantitatively comparing methods trained with the same jects of genus zero. Therefore we evaluate both on the en-
shape regularizers. We thus train two versions of all our tire test set and on a subset consisting of objects with one or
ShapeNet models: a Best version with λedge = 0 to serve as more holes (Holes Test Set) 2 . In this evaluation we remove
an upper bound on quantitative performance, and a Pretty the ad-hoc scaling factor of 0.57, and we rescale meshes so
version that strikes a balance between quantitative perfor- the longest edge of the ground-truth mesh’s bounding box
mance and mesh quality by setting λedge = 0.2. has length 10, following [8]. We compare the open-source
Comparison with Prior Work Table 1 compares our Pretty 2 We annotated 3075 test set models and flagged whether they contained
and Best models with prior work on shape prediction from holes. This resulted in 17% (or 534) of the models being flagged. See
a single image. We use the evaluation protocol from [69], Appendix G for more details and examples.
Pix3D S1 APbox APmask APmesh chair sofa table bed desk bkcs wrdrb tool misc |V | |F |
Voxel-Only 94.4 88.4 5.3 0.0 3.5 2.6 0.5 0.7 34.3 5.7 0.0 0.0 2354±706 4717±1423
Pixel2Mesh+ 93.5 88.4 39.9 30.9 59.1 40.2 40.5 30.2 50.8 62.4 18.2 26.7 2562±0 5120±0
Sphere-Init 94.1 87.5 40.5 40.9 75.2 44.2 50.3 28.4 48.6 42.5 26.9 7.0 2562±0 5120±0
Mesh R-CNN (ours) 94.0 88.4 51.1 48.2 71.7 60.9 53.7 42.9 70.2 63.4 21.6 27.8 2367±698 4743±1406
# test instances 2530 2530 2530 1165 415 419 213 154 79 54 11 20
Pix3D S2
Voxel-Only 71.5 63.4 4.9 0.0 0.1 2.5 2.4 0.8 32.2 0.0 6.0 0.0 2346±630 4702±1269
Pixel2Mesh+ 71.1 63.4 21.1 26.7 58.5 10.9 38.5 7.8 34.1 3.4 10.0 0.0 2562±0 5120±0
Sphere-Init 72.6 64.5 24.6 32.9 75.3 15.8 40.1 10.1 45.0 1.5 0.8 0.0 2562±0 5120±0
Mesh R-CNN (ours) 72.2 63.9 28.8 42.7 70.8 27.2 40.9 18.2 51.1 2.9 5.2 0.0 2358±633 4726±1274
# test instances 2356 2356 2356 777 504 392 218 205 84 134 22 20

Table 3. Performance on Pix3D S1 & S2 . We report mean APbox , APmask and APmesh , as well as per category APmesh . All AP performances
are in %. The Voxel-Only baseline outputs the cubified voxel predictions. The Sphere-Init and Pixel2Mesh+ baselines deform an initial
sphere and thus are limited to making predictions homeomorphic to spheres. Our Mesh R-CNN is flexible and can capture arbitrary
topologies. We outperform the baselines consistently while predicting meshes with fewer number of vertices and faces.
CNN init # refine steps APbox APmask APmesh Our first split, S1 , randomly allocates 7539 images for
COCO 3 94.0 88.4 51.1 training and 2530 for testing. Despite the small num-
IN 3 93.1 87.0 48.4
ber of unique object models compared to ShapeNet, S1 is
COCO 2 94.6 88.3 49.3
COCO 1 94.2 88.9 48.6
challenging since the same model can appear with varying
appearance (e.g. color, texture), in different orientations,
Table 4. Ablations of Mesh R-CNN on Pix3D. under different lighting conditions, in different contexts,
and with varying occlusion. This is a stark contrast with
Pixel2Mesh model against our ablations in this evaluation
ShapeNet, where objects appear against blank backgrounds.
setting. Pixel2Mesh+ (our reimplementation of [69]) sig-
Our second split, S2 , is even more challenging: we en-
nificantly outperforms the original due to an improved train-
sure that the 3D models appearing in the train and test sets
ing recipe and deeper backbone.
are disjoint. Success on this split requires generalization
We draw several conclusions from Table 2: (a) On the
not only to the variations present in S1 , but also to novel 3D
Full Test Set, our full model and Pixel2Mesh+ perform on
shapes of known categories: for example a model may see
par. However, on the Holes Test Set, our model dominates
kitchen chairs during training but must recognize armchairs
as it is able to predict topologically diverse shapes while
during testing. This split is possible due to Pix3D’s unique
Pixel2Mesh+ is restricted to make predictions homeomor-
annotation structure, and poses interesting challenges for
phic to spheres, and cannot model holes or disconnected
both 2D recognition and 3D shape prediction.
components (see Figure 6). This discrepancy is quantita-
tively more salient on Pix3D (Section 4.2) as it contains Evaluation We adopt metrics inspired by those used for 2D
more complex shapes. (b) Sphere-Init and Pixel2Mesh+ recognition: APbox , APmask and APmesh . The first two are
perform similarly overall (both Best and Pretty), suggest- standard metrics used for evaluating COCO object detection
ing that mesh subdivision may be unnecessary for strong and instance segmentation at intersection-over-union (IoU)
quantitative performance. (c) The deeper residual mesh 0.5. APmesh evalutes 3D shape prediction: it is the mean
refinement architecture (inspired by [69]) performs on-par area under the per-category precision-recall curves for F10.3
with the lighter non-residual architecture, motivating our at 0.53 . Pix3D is not exhaustively annotated, so for evalua-
use of the latter on Pix3D. (d) Voxel-Only performs poorly tion we only consider predictions with box IoU > 0.3 with
compared to methods that predict meshes, demonstrating a ground-truth region. This avoids penalizing the model for
that mesh predictions better capture fine object structure. correct predictions corresponding to unannotated objects.
(e) Each Best model outperforms its corresponding Pretty We compare predicted and ground-truth meshes in the
model; this is expected since Best is an upper bound on camera coordinate system. Our model assumes known cam-
quantitative performance. era intrinsics for VertAlign. In addition to predicting the
box of each object on the image plane, Mesh R-CNN pre-
4.2. Pix3D dicts the depth extent by appending a 2-layer MLP head,
similar to the box regressor head. As a result, Mesh R-
We now turn to Pix3D [60], which consists of 10069
CNN predicts a 3D bounding box for each object. See Ap-
real-world images and 395 unique 3D models. Here the task
pendix E for more details.
is to jointly detect and predict 3D shapes for known object
categories. Pix3D does not provide standard train/test splits, 3 A mesh prediction is considered a true-positive if its predicted label is

so we prepare two splits of our own. correct, it is not a duplicate detection, and its F10.3 > 0.5
Figure 7. Examples of Mesh R-CNN predictions on Pix3D. Mesh R-CNN detects multiple objects per image, reconstructs fine details such
as chair legs, and predicts varying and complex mesh topologies for objects with holes such as bookcases and tables.

Implementation details We use ResNet-50-FPN [35] as


the backbone CNN; the box and mask branches are identi-
cal to Mask R-CNN. The voxel branch resembles the mask
branch, but the pooling resolution is decreased to 12 (vs. 14
for masks) due to memory constraints giving 24 × 24 × 24
voxel predictions. We adopt the lightweight design for the
mesh refinement branch from Section 4.1. We train for 12
epochs with a batch size of 64 per image on 8 Tesla V100
GPUs (two images per GPU). We use SGD with momen-
tum, linearly increasing the learning rate from 0.002 to 0.02
over the first 1K iterations, then decaying by a factor of
10 at 8K and 10K iterations. We initialize from a model
pretrained for instance segmentation on COCO. We set the
cubify threshold to 0.2 and the loss weights to λvoxel = 3, Figure 8. More examples of Mesh R-CNN predictions on Pix3D.
λcham = 1, λnorm = 0.1 and λedge = 1 and use weight decay
10−4 ; detection loss weights are identical to Mask R-CNN.
Comparison to Baselines As discussed in Section 1, we Table 4 compares pretraining on COCO vs ImageNet,
are the first to tackle joint detection and shape inference in and compares different architectures for the mesh predictor.
the wild on Pix3D. To validate our approach we compare COCO vs. ImageNet initialization improves 2D recognition
with ablated versions of Mesh R-CNN, replacing our full (APmask 88.4 vs. 87.0) and 3D shape prediction (APmesh 51.1
mesh predictor with Voxel-Only, Pixel2Mesh+ , and Sphere- vs. 48.4). Shape prediction is degraded when using only one
Init branches (see Section 4.1). All baselines otherwise use mesh refinement stage (APmesh 51.1 vs. 48.6).
the same architecture and training recipe. Figures 2, 7 and 8 show example predictions from Mesh
Table 3 (top) shows the performance on S1 . We observe R-CNN. Our method can detect multiple objects per im-
that: (a) Mesh R-CNN outperforms all baselines, improving age, reconstruct fine details such as chair legs, and predict
over the next-best by 10.6% APmesh overall and across most varying and complex mesh topologies for objects with holes
categories; Tool and Misc4 have very few test-set instances such as bookcases and desks.
(11 and 20 respectively), so their AP is noisy. (b) Mesh
R-CNN shows large gains vs. Sphere-Init for objects with
complex shapes such as bookcase (+21.6%), table (+16.7%) Discussion
and chair (+7.3%). (c) Voxel-Only performs very poorly – We propose Mesh R-CNN, a novel system for joint 2D
this is expected due to its coarse predictions. perception and 3D shape inference. We validate our ap-
Table 3 (bottom) shows the performance on the more proach on ShapeNet and show its merits on Pix3D. Mesh
challenging S2 split. Here we observe: (a) The overall per- R-CNN is a first attempt at 3D shape prediction in the wild.
formance on 2D recognition (APbox , APmask ) drops signif- Despite the lack of large supervised data, e.g. compared to
icantly compared to S1 , signifying the difficulty of recog- COCO, Mesh R-CNN shows promising results. Mesh R-
nizing novel shapes in the wild. (b) Mesh R-CNN outper- CNN is an object centric approach. Future work includes
forms all baselines for shape prediction for all categories reasoning about the 3D layout, i.e. the relative pose of ob-
except sofa, wardrobe and tool. (c) Absolute performance jects in the 3D scene.
on wardrobe, tool and misc is small for all methods due to
Acknowledgements We would like to thank Kaiming He,
significant shape disparity between models in train and test
Piotr Dollár, Leonidas Guibas, Manolis Savva and Shubham
and lack of training data.
Tulsiani for valuable discussions. We would also like to
4 Misc consists of objects such as fire hydrant, picture frame, vase, etc. thank Lars Mescheder and Thibault Groueix for their help.
Appendix B. Mesh Sampling
A. Implementation of Cubify As described in the main paper, the mesh refinement
head is trained to minimize chamfer and normal losses that
Algorithm 1 outlines the cubify operation. Cubify are defined on sets of points sampled from the predicted and
takes as input voxel occupancy probabilities V of shape ground-truth meshes.
N × D × H × W as predicted from the voxel branch and Computing these losses requires some method of con-
a threshold value τ . Each occupied voxel is replaced with a verting meshes into sets of sampled points. Pixel2Mesh
cuboid triangle mesh (unit cube) with 8 vertices, 18 edges, [69] is trained using similar losses. In their case ground-
and 12 faces. Shared vertices and edges between adjacent truth meshes are represented with points sampled uniformly
occupied voxels are merged, and shared interior faces are at random from the surface of the mesh, but this sampling is
eliminated. This results in a watertight mesh T = (V, F ) performed offline before the start of training; they represent
for each example in the batch whose topology depends on predicted meshes using their vertex positions. Computing
the voxel predictions. these losses using vertex positions of predicted meshes is
Algorithm 1 is an inefficient implementation of cubify very efficient since it avoids the need to sample meshes on-
as it involves nested for loops which in practice increase line during training; however it can lead to degenerate pre-
the time complexity, especially for large batches and dictions since the loss would not encourage the interior of
large voxel sizes. In particular, this implementation takes predicted faces to align with the ground-truth mesh.
> 300ms for N = 32 voxels of size 32 × 32 × 32 on a To avoid these potential degeneracies, we follow [57]
Tesla V100 GPU. We replace the nested for loops with 3D and compute the chamfer and normal losses by randomly
convolutions and vectorize our computations, resulting in a sampling points from both the predicted and ground-truth
time complexity of ≈ 30ms for the same voxel inputs. meshes. This means that we need to sample the predicted
meshes online during training, so the sampling must be ef-
ficient and we must be able to backpropagate through the
Data: V : [0, 1]N ×D×H×W , τ ∈ [0, 1] sampling procedure to propagate gradients backward from
unit cube = (Vcube , Fcube ) the sampled points to the predicted vertex positions.
for (n, z, y, x) ∈ range(N, D, H, W) do Given a mesh with vertices V ⊂ R3 and faces F ⊆
if V [n, z, y, x] > τ then V × V × V , we can sample a point uniformly from the
add unit cube at (n, z, y, x) surface of the mesh as follows. We first define a probabil-
if V [n, z − 1, y, x] > τ then ity distribution over faces where each face’s probability is
remove back faces proportional to its area:
end
if V [n, z + 1, y, x] > τ then area(f )
P (f ) = P 0
(3)
remove front faces f 0 ∈F area(f )
end
if V [n, z, y − 1, x] > τ then We then sample a face f = (v1 , v2 , v3 ) from this distribu-
remove top faces P p uniformly from the √
tion. Next we sample a point inte-
end rior of f by setting
√ p = w v
i√ i i where w 1 = 1 − ξ1 ,
if V [n, z, y + 1, x] > τ then w2 = (1 − ξ2 ) ξ 1 , w3 = ξ2 ξ 1 , and ξ1 , ξ2 ∼ U (0, 1) are
remove bottom faces sampled from a uniform distribution.
end This formulation allows propagating gradients from p
if V [n, z, y, x − 1] > τ then backward to the face vertices vi and can be seen as an in-
remove left faces stance of the reparameterization trick [28].
end
if V [n, z, y, x + 1] > τ then C. Mesh R-CNN Architecture
remove right faces
end At a high level we use the same overall architecture for
end predicting meshes on ShapeNet and Pix3D, but we slightly
specialize to each dataset due to memory constraints from
end
the backbone and task-specific heads. On Pix3D Mask R-
merge shared verts
−1 CNN adds time and memory complexity in order to perform
return A list of meshes {Ti = (Vi , Fi )}N
i=0
object detection and instance segmentation.
Algorithm 1: Cubify
ShapeNet. The overall architecture of our ShapeNet model
is shown in Table 6; the architecture of the voxel branch is
shown in Table 8.
Index Inputs Operation Output shape Index Inputs Operation Output shape
(1) Input conv2 3 features 35 × 35 × 256 (1) Input conv2 3 features 35 × 35 × 256
(2) Input conv3 4 features 18 × 18 × 512 (2) Input conv3 4 features 18 × 18 × 512
(3) Input conv4 6 features 9 × 9 × 1024 (3) Input conv4 6 features 9 × 9 × 1024
(4) Input conv5 3 features 5 × 5 × 2048 (4) Input conv5 3 features 5 × 5 × 2048
(5) Input Input vertex features |V | × 128 (5) Input Input vertex features |V | × 128
(6) Input Input vertex positions |V | × 3 (6) Input Input vertex positions |V | × 3
(7) (1), (6) VertAlign |V | × 256 (7) (1), (6) VertAlign |V | × 256
(8) (2), (6) VertAlign |V | × 512 (8) (2), (6) VertAlign |V | × 512
(9) (3), (6) VertAlign |V | × 1024 (9) (3), (6) VertAlign |V | × 1024
(10) (4), (6) VertAlign |V | × 2048 (10) (4), (6) VertAlign |V | × 2048
(11) (7),(8),(9),(10) Concatenate |V | × 3840 (11) (7),(8),(9),(10) Concatenate |V | × 3840
(12) (11) Linear(3840 → 128) |V | × 128 (12) (11) Linear(3840 → 128) |V | × 128
(13) (5), (6), (12) Concatenate |V | × 259 (13) (5), (6), (12) Concatenate |V | × 259
(14) (13) ResGraphConv(259 → 128) |V | × 128 (14) (13) GraphConv(259 → 128) |V | × 128
(15) (14) 2× ResGraphConv(128 → 128) |V | × 128 (15) (6), (14) Concatenate |V | × 131
(16) (15) GraphConv(128 → 3) |V | × 3 (16) (15) GraphConv(131 → 128) |V | × 128
(17) (16) Tanh |V | × 3 (17) (6), (16) Concatenate |V | × 131
(18) (6), (17) Addition |V | × 3 (18) (17) GraphConv(131 → 128) |V | × 128
(19) (18) Linear(128 → 3) |V | × 3
Table 5. Architecture for a single residual mesh refinement stage (20) (19) Tanh |V | × 3
(21) (6), (20) Addition |V | × 3
on ShapeNet. For ShapeNet we follow [69] and use residual
blocks of graph convolutions: ResGraphConv(D1 → D2 ) con- Table 7. Architecture for the nonresidual mesh refinement stage
sists of two graph convolution layers (each preceeded by ReLU) use in the lightweight version of our ShapeNet models. Each
and an additive skip connection, with a linear projection if the in- GraphConv operation is followed by ReLU. The output of the
put and output dimensions are different. The output of the refine- stage are the vertex features (18) and updated vertex positions (21).
ment stage are the vertex features (15) and the updated vertex po- The first refinement stage does not take input vertex features (5),
sitions (18). The first refinement stage does not take input vertex so for this stage (13) only concatenates (6) and (12).
features (5), so for this stage (13) only concatenates (6) and (12).
Index Inputs Operation Output shape
Index Inputs Operation Output shape (1) Input Image features V /2 × V /2 × D
(1) Input Image 137 × 137 × 3 (2) (1) Conv(D → 256, 3 × 3), ReLU V /2 × V /2 × 256
(2) (1) ResNet-50 conv2 3 35 × 35 × 256 (3) (2) Conv(256 → 256, 3 × 3), ReLU V /2 × V /2 × 256
(3) (2) ResNet-50 conv3 4 18 × 18 × 512 (4) (3) TConv(256 → 256, 2 × 2, 2), ReLU V × V × 256
(4) (3) ResNet-50 conv4 6 9 × 9 × 1024 (5) (4) Conv(256 → V, 1 × 1) V ×V ×V
(5) (4) ResNet-50 conv5 3 5 × 5 × 2048
(6) (5) Bilinear interpolation 24 × 24 × 2048 Table 8. Architecture of our voxel prediction branch. For
(7) (6) Voxel Branch 48 × 48 × 48 ShapeNet we use V = 48 and for Pix3D we use V = 24. TConv
(8) (7) cubify |V | × 3, |F | × 3
(9) (2), (3), (4), (5), (8) Refinement Stage 1 |V | × 3, |F | × 3
is a transpose convolution with stride 2.
(10) (2), (3), (4), (5), (9) Refinement Stage 2 |V | × 3, |F | × 3
(11) (2), (3), (4), (5), (10) Refinement Stage 3 |V | × 3, |F | × 3
ers per stage. We therefore use the nonresidual design for
Table 6. Overall architecture for our ShapeNet model. Since we
do not predict bounding boxes or masks, we feed the conv5 3
our Pix3D models.
features from the whole image into the voxel branch. The archi- Pix3D. The overall architecture of our full Mesh R-CNN
tecture for the refinement stage is shown in Table 5, and the archi- system on Pix3D is shown in Table 9. The backbone,
tecture for the voxel branch is shown in Table 8. RPN, box branch, and mask branch are identical to Mask R-
CNN [18]. The voxel branch is the same as in the ShapeNet
We consider two different architectures for the mesh re- models (see Table 8), except that we predict voxels at a
finement network on ShapeNet. Our full model as well lower resolution (48×48×48 for ShapeNet vs. 24×24×24
as our Pixel2Mesh+ and Sphere-Init baselines use mesh for Pix3D) due to memory constraints. Table 10 shows
refinement stages with three residual blocks of two graph the exact architecture of the mesh refinement stages for our
convolutions each, similar to [69]; the architecture of these Pix3D models.
stages is shown in Table 5. We also consider a shallower Baselines. The Voxel-Only baseline is identical to the full
lightweight design which uses only three graph convolution model, except that it omits all mesh refinement branches
layers per stage, omitting residual connections and instead and terminates with the mesh resulting from cubify. On
concatenating the input vertex positions before each graph ShapeNet, the Voxel-Only baseline is trained with a batch
convolution layer. The architecture of this lightweight de- size of 64 (vs. a batch size of 32 for our full model); on
sign is shown in Table 7. Pix3D it uses the same training recipe as our full model.
As shown in Table 2, we found that these two archi- The Pixel2Mesh+ is our reimplementation of [69]. This
tectures perform similarly on ShapeNet even though the baseline omits the voxel branch; instead all images use an
lightweight design uses half as many graph convolution lay- identical inital mesh. The initial mesh is a level-2 icosphere
Index Inputs Operation Output shape
(1) Input Input Image H ×W ×3
(2) (1) Backbone: ResNet-50-FPN h × w × 256
(3) (2) RPN h×w×A×4
(4) (2),(3) RoIAlign 14 × 14 × 256
(5) (4) Box branch: 2× downsample, Flatten, Linear(7 ∗ 7 ∗ 256 → 1024), Linear(1024 → 5C) C ×5
(6) (4) Mask branch: 4× Conv(256 → 256, 3 × 3), TConv(256 → 256, 2 × 2, 2), Conv(256 → C, 1 × 1) 28 × 28 × C
(7) (2), (3) RoIAlign 12 × 12 × 256
(8) (7) Voxel Branch 24 × 24 × 24
(9) (8) cubify |V | × 3, |F | × 3
(10) (7), (9) Refinement Stage 1 |V | × 3, |F | × 3
(11) (7), (10) Refinement Stage 2 |V | × 3, |F | × 3
(12) (7), (11) Refinement Stage 3 |V | × 3, |F | × 3
Table 9. Overall architecture of Mesh R-CNN on Pix3D. The backbone, RPN, box, and mask branches are identical to Mask R-CNN.
The RPN produces a bounding box prediction for each of the A anchors at each spatial location in the input feature map; a subset of
these candidate boxes are processed by the other branches, but here we show only the shapes resulting from processing a single box for
the subsequent task-specific heads. Here C is the number of categories (10 = 9 + background for Pix3D); the box branch produces
per-category bounding boxes and classification scores, while the mask branch produces per-category segmentation masks. TConv is a
transpose convolution with stride 2. We use a ReLU nonlinearity between all Linear, Conv, and TConv operations. The architecture fo the
voxel branch is shown in Table 8, and the architecture of the refinement stages is shown in Table 10.

Index Inputs Operation Output shape Chamfer(↓) Normal F10.1 F10.3 F10.5 |V | |F |
(1) Input Backbone features h × w × 256 OccNet [42] 0.264 0.789 33.4 80.5 91.3 2499±60 4995±120
(2) Input Input vertex features |V | × 128 Ours (light) 0.135 0.725 38.9 86.7 95.0 1978±951 3958±1906
Pretty Best

(3) Input Input vertex positions |V | × 3 Ours 0.139 0.728 38.3 86.3 94.9 1985±960 3971±1924
(4) (1), (3) VertAlign |V | × 256
Ours (light) 0.185 0.696 34.3 82.0 92.8 1976±956 3954±1916
(5) (2), (3), (4) Concatenate |V | × 387
Ours 0.180 0.709 34.6 82.2 93.0 1982±961 3967±1926
(6) (5) GraphConv(387 → 128) |V | × 128
(7) (3), (6) Concatenate |V | × 131
(8) (7) GraphConv(131 → 128) |V | × 128 Table 11. Comparison between our method and Occpancy Net-
(9) (3), (8) Concatenate |V | × 131 works (OccNet) [42] on ShapeNet. We use the same evaluation
(10) (9) GraphConv(131 → 128) |V | × 128 metrics and setup as Table 2.
(11) (3), (10) Concatenate |V | × 131
(12) (11) Linear(131 → 3) |V | × 3
(13) (12) Tanh |V | × 3
(14) (3), (13) Addition |V | × 3 subdivision operations are performed before the mesh re-
finement branch (Sphere-Init) or whether mesh refinement
Table 10. Architecture for a single mesh refinement stage on
Pix3D.
is interleaved with mesh subdivision (Pixel2Mesh+ ). On
ShapeNet, the Pixel2Mesh+ and Sphere-Init baselines are
trained with a batch size of 96; on Pix3D they use the same
training recipe as our full model.
with 162 vertices, 320 faces, and 480 edges which results
from applying two face subdivision operations to a regu-
D. Comparison with Occupancy Networks
lar icosahedron and projecting all resulting vertices onto a
sphere. For the Pixel2Mesh+ baseline, the mesh refinement Occupancy Networks [42] (OccNet) also predict 2D
stages are the same as our full model, except that we apply meshes with neural networks. Rather than outputing a mesh
a face subdivision operation prior to VertAlign in refine- directly from the neural network as in our approach, they
ment stages 2 and 3. train a neural network to compute a signed distance between
Like Pixel2Mesh+ , the Sphere-Init baseline omits the a query point in 3D space and the object boundary. At test-
voxel branch and uses an identical initial sphere mesh for all time a 3D mesh can be extracted from a set of query points.
images. However unlike Pixel2Mesh+ the initial mesh is a Like our approach, OccNets can also predict meshes with
level-4 icosphere with 2562 vertices, 5120 faces, and 7680 varying topolgy per input instance.
edges which results from applying four face subdivivison Table 11 compares our approach with OccNet on the
operations to a regular icosahedron. Due to this large initial ShapeNet test set. We obtained test-set predictions for Oc-
mesh, the mesh refinement stages are identical to our full cNet from the authors. Our method and OccNet are trained
model, and do not use mesh subdivision. on slightly different splits of the ShapeNet dataset, so we
Pixel2Mesh+ and Sphere-Init both predict meshes with compare our methods on the intersection of our respective
the same number of vertices and faces, and with identical test splits. From Table 11 we see that OccNets achieve
topologies; the only difference between them is whether all higher normal consistency than our approach; however both
the Best and Pretty versions of our model outperform Occ-
Nets on all other metrics.

E. Depth Extent Prediction


Predicting an object’s depth extent from a single image is
an ill-posed problem. In an earlier version of our work, we
assumed the range of an object in the Z-axis was given at
train & test time. Since then, we have attempted to predict
the depth extent by training a 2-layer MLP head of similar
architecture to the bounding box regressor head. Formally,
this head is trained to predict the scale-normalized depth
extent (in log space) of the object, as follows

¯ = dz · f
dz (4)
zc h
Note that the depth extent of an object is related to the
size of the object (here approximated by the object’s bound-
ing box height h), its location zc in the Z-axis (far away
objects need to be bigger in order to explain the image) and
the focal length f . At inference time the depth extent dz of
the object is recovered from the predicted dz ¯ and the pre- Pixel2Mesh+ Mesh R-CNN
dicted height of the object bounding box h, given the focal
length f and the center of the object zc in the Z-axis. Note Figure 9. Qualitative comparisons between Pixel2Mesh+ and
that we assume the center of the object zc since Pix3D an- Mesh R-CNN on Pix3D. Each row shows the same example for
notations are not metric and due to the inherent scale-depth Pixel2Mesh+ (first three columns) and Mesh R-CNN (last three
ambiguity. columns), respectively. For each method, we show the input im-
age along with the predicted 2D mask (chair, bookcase, table, bed)
F. Pix3D: Visualizations and Comparisons and box (in green) superimposed. We show the 3D mesh rendered
on the input image and an additional view of the 3D mesh.
Figure 9 shows qualitative comparisons between
Pixel2Mesh+ and Mesh R-CNN. Pixel2Mesh+ is limited
to making predictions homeomorphic to spheres and thus priors. In CVPR, 2013. 2
cannot capture varying topologies, e.g. holes. In addi- [2] Volker Blanz and Thomas Vetter. A morphable model for the
tion, Pixel2Mesh+ has a hard time capturing high curva- synthesis of 3d faces. In SIGGRAPH, 1999. 2
tures, such as sharp table tops and legs. This is due to the [3] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Re-
large deformations required when starting from a sphere, altime multi-person 2D pose estimation using part affinity
fields. In CVPR, 2017. 1
which are not encouraged by the shape regularizers. On the
other hand, Mesh R-CNN initializes its shapes with cubified [4] Angel X. Chang, Thomas A. Funkhouser, Leonidas J.
Guibas, Pat Hanrahan, Qi-Xing Huang, Zimo Li, Silvio
voxel predictions resulting in better initial shape represen-
Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong
tations which require less drastic deformations. Xiao, Li Yi, and Fisher Yu. Shapenet: An information-rich
3d model repository. In CoRR 1512.03012, 2015. 1, 2, 5
G. ShapeNet Holes test set [5] Christopher B. Choy, Danfei Xu, JunYoung Gwak, Kevin
We construct the ShapeNet Holes Test set by selecting Chen, and Silvio Savarese. 3D-R2N2: A unified approach
models from the ShapeNet test set that have visible holes for single and multi-view 3d object reconstruction. In ECCV,
from any viewpoint. Figure 10 shows several input images 2016. 1, 2, 5
for randomly selected models from this subset. This test set [6] Amaury Dame, Victor A. Prisacariu, Carl Y. Ren, and Ian
Reid. Dense reconstruction using 3d object shape priors. In
is very challenging – many objects have small holes result-
CVPR, 2013. 2
ing from thin structures; and some objects have holes which
[7] Mathieu Desbrun, Mark Meyer, Peter Schröder, and Alan H.
are not visible from all viewpoints.
Barr. Implicit fairing of irregular meshes using diffusion and
curvature flow. In SIGGRAPH, 1999. 5
References
[8] Haoqiang Fan, Hao Su, and Leonidas J. Guibas. A point
[1] Sid Yingze Bao, Manmohan Chandraker, Yuanqing Lin, and set generation network for 3d object reconstruction from a
Silvio Savarese. Dense object reconstruction with semantic single image. In CVPR, 2017. 1, 2, 5, 7
Figure 10. Example input images for randomly selected models from the the Holes Test Set on ShapeNet. For each model we show three
different input images showing the model from different viewpoints. This set is extremely challenging – some models may have very small
holes (such as the holes in the back of the chair in the left model of the first row, or the holes on the underside of the table on the right
model of row 2), and some models may have holes which are not visible in all input images (such as the green chair in the middle of the
fourth row, or the gray desk on the right of the ninth row).

[9] Sanja Fidler, Sven Dickinson, and Raquel Urtasun. 3d ob- ICCV, 2013. 2
ject detection and viewpoint estimation with a deformable [11] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel
3d cuboid model. In NeurIPS, 2012. 2 Urtasun. Vision meets robotics: The kitti dataset. In IJRR,
[10] David F. Fouhey, Abhinav Gupta, and Martial Hebert. Data- 2013. 2
driven 3D primitives for single image understanding. In [12] Ross Girshick. Fast R-CNN. In ICCV, 2015. 2
[13] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra [34] Chen-Hsuan Lin, Chen Kong, and Simon Lucey. Learning
Malik. Rich feature hierarchies for accurate object detection efficient point cloud generation for dense 3d object recon-
and semantic segmentation. In CVPR, 2014. 1, 2 struction. In AAAI, 2018. 2
[14] Benjamin Graham, Martin Engelcke, and Laurens van der [35] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
Maaten. 3d semantic segmentation with submanifold sparse Bharath Hariharan, and Serge Belongie. Feature pyramid
convolutional networks. In CVPR, 2018. 2 networks for object detection. In CVPR, 2017. 3, 8
[15] Thibault Groueix, Matthew Fisher, Vladimir G Kim, [36] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
Bryan C Russell, and Mathieu Aubry. A papier-mâché ap- Piotr Dollár. Focal loss for dense object detection. In ICCV,
proach to learning 3d surface generation. In CVPR, 2018. 2017. 2
2 [37] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
[16] Saurabh Gupta, Ross Girshick, Pablo Arbeláez, and Jitendra Pietro Perona, Deva Ramanan, Piotr Dollr, and C. Lawrence
Malik. Learning rich features from rgb-d images for object Zitnick. Microsoft COCO: Common objects in context.
detection and segmentatio. In ECCV, 2014. 2 ECCV, 2014. 1, 2
[17] Richard Hartley and Andrew Zisserman. Multiple view ge- [38] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian
ometry in computer vision. Cambridge university press, Szegedy, and Scott Reed. SSD: Single shot multibox de-
2003. 2 tector. In ECCV, 2016. 2
[18] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- [39] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully
shick. Mask R-CNN. In ICCV, 2017. 1, 2, 3, 10 convolutional networks for semantic segmentation. In
[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. CVPR, 2015. 3
Deep residual learning for image recognition. In CVPR, [40] William E. Lorensen and Harvey E. Cline. Marching cubes:
2016. 1 A high resolution 3d surface construction algorithm. In SIG-
[20] Derek Hoiem, Alexei A. Efros, and Martial Hebert. Geomet- GRAPH. ACM, 1987. 4
ric context from a single image. In ICCV, 2005. 2
[41] Priyanka Mandikal, Navaneet Murthy, Mayank Agarwal, and
[21] Christian Hne, Nikolay Savinov, and Marc Pollefeys. Class
R. Venkatesh Babu. 3d-lmnet: Latent embedding matching
specific 3d object shape priors using surface normals. In
for accurate and diverse 3d point cloud reconstruction from
CVPR, 2014. 2
a single image. In BMVC, 2018. 2
[22] Max Jaderberg, Karen Simonyan, and Andrew Zisserman.
[42] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se-
Spatial transformer networks. In NeurIPS, 2015. 4
bastian Nowozin, and Andreas Geiger. Occupancy networks:
[23] Angjoo Kanazawa, Shubham Tulsiani, Alexei A. Efros, and
Learning 3d reconstruction in function space. In CVPR,
Jitendra Malik. Learning category-specific mesh reconstruc-
2019. 2, 5, 11
tion from image collections. In ECCV, 2018. 1, 2
[43] George A. Miller. Wordnet: A lexical database for english.
[24] Abhishek Kar, Christian Häne, and Jitendra Malik. Learning
In Commun. ACM, 1995. 5
a multi-view stereo machine. In NeurIPS, 2017. 2
[44] Georgios Pavlakos, Xiaowei Zhou, Aaron Chan, Konstanti-
[25] Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neu-
nos G. Derpanis, and Kostas Daniilidis. 6-dof object pose
ral 3D mesh renderer. In CVPR, 2018. 2, 5
from semantic keypoints. In ICRA, 2017. 2
[26] Alex Kendall, Hayk Martirosyan, Saumitro Dasgupta, Peter
Henry, Ryan Kennedy, Abraham Bachrach, and Adam Bry. [45] Charles R. Qi, Li Yi, Hao Su, and Leonidas J. Guibas. Point-
End-to-end learning of geometry and context for deep stereo net++: Deep hierarchical feature learning on point sets in a
regression. In ICCV, 2017. 2 metric space. In NeurIPS, 2017. 2
[27] Diederik P. Kingma and Jimmy Ba. Adam: A method for [46] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali
stochastic optimization. In ICLR, 2015. 5 Farhadi. You only look once: Unified, real-time object de-
[28] Diederik P. Kingma and Max Welling. Auto-encoding vari- tection. In CVPR, 2016. 2
ational bayes. In ICLR, 2014. 9 [47] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
[29] Thomas N Kipf and Max Welling. Semi-supervised classi- Faster R-CNN: Towards real-time object detection with re-
fication with graph convolutional networks. In ICLR, 2017. gion proposal networks. In NeurIPS, 2015. 1, 2, 3
4 [48] Danilo Jimenez Rezende, S.M. Ali Eslami, Shakir Mo-
[30] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Im- hamed, Peter Battaglia, Max Jaderberg, and Nicolas Heess.
ageNet classification with deep convolutional neural net- Unsupervised learning of 3d structure from images. In
works. In NeurIPS, 2012. 1 NeurIPS, 2016. 2
[31] Abhijit Kundu, Yin Li, and James M. Rehg. 3d-rcnn: [49] Stephan R. Richter and Stefan Roth. Matryoshka networks:
Instance-level 3d object reconstruction via render-and- Predicting 3d geometry via nested shape layers. In CVPR,
compare. In CVPR, 2018. 2 2018. 2
[32] Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, [50] Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger.
and Baoquan Chen. Pointcnn: Convolution on x-transformed Octnet: Learning deep 3d representations at high resolutions.
points. In NeurIPS, 2018. 2 In CVPR, 2017. 2
[33] Joseph J. Lim, Hamed Pirsiavash, and Antonio Torralba. [51] Jason Rock, Tanmay Gupta, Justin Thorsen, JunYoung
Parsing IKEA Objects: Fine Pose Estimation. In ICCV, Gwak, Daeyun Shin, and Derek Hoiem. Completing 3d ob-
2013. 2 ject shape from one depth image. In CVPR, 2015. 2
[52] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- [69] Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei
jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Liu, and Yu-Gang Jiang. Pixel2Mesh: Generating 3D mesh
Aditya Khosla, Michael Bernstein, Alexander C. Berg, and models from single RGB images. In ECCV, 2018. 1, 2, 4, 5,
Li Fei-Fei. ImageNet Large Scale Visual Recognition Chal- 6, 7, 9, 10, 11
lenge. IJCV, 2015. 1, 2 [70] Peng-Shuai Wang, Chun-Yu Sun, Yang Liu, and Xin Tong.
[53] Daniel Scharstein and Richard Szeliski. A taxonomy and Adaptive O-CNN: a patch-based deep representation of 3d
evaluation of dense two-frame stereo correspondence algo- shapes. In SIGGRAPH Asia, 2018. 2
rithms. IJCV, 2002. 2 [71] Jiajun Wu, Yifan Wang, Tianfan Xue, Xingyuan Sun, Bill
[54] Tanner Schmidt, Richard Newcombe, and Dieter Fox. Self- Freeman, and Josh Tenenbaum. Marrnet: 3d shape recon-
supervised visual descriptor learning for dense correspon- struction via 2.5 d sketches. In NeurIPS, 2017. 2
dence. In IEEE Robotics and Automation Letters, 2017. 2 [72] Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and
[55] Karen Simonyan and Andrew Zisserman. Very deep convo- Josh Tenenbaum. Learning a probabilistic latent space of
lutional networks for large-scale image recognition. In ICLR, object shapes via 3d generative-adversarial modeling. In
2015. 1 NeurIPS, 2016. 2
[56] Edward Smith, Scott Fujimoto, and David Meger. Multi- [73] Jiajun Wu, Chengkai Zhang, Xiuming Zhang, Zhoutong
view silhouette and depth decomposition for high resolution Zhang, William T. Freeman, and Joshua B. Tenenbaum.
3d object representation. In NeurIPS, 2018. 2, 5 Learning 3D Shape Priors for Shape Completion and Recon-
[57] Edward J. Smith, Scott Fujimoto, Adriana Romero, and struction. In ECCV, 2018. 2
David Meger. GEOMetrics: Exploiting geometric structure [74] Yu Xiang, Roozbeh Mottaghi, and Silvio Savarese. Beyond
for graph-encoded objects. In ICML, 2019. 1, 2, 4, 5, 9 pascal: A benchmark for 3d object detection in the wild. In
WACV, 2014. 2
[58] Shuran Song and Jianxiong Xiao. Deep Sliding Shapes for
amodal 3D object detection in RGB-D images. In CVPR, [75] Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and
2016. 2 Honglak Lee. Perspective transformer nets: Learning single-
view 3d object reconstruction without 3d supervision. In
[59] Hang Su, Varun Jampani, Deqing Sun, Subhransu Maji,
NeurIPS, 2016. 2
Evangelos Kalogerakis, Ming-Hsuan Yang, and Jan Kautz.
[76] Xiuming Zhang, Zhoutong Zhang, Chengkai Zhang,
SPLATNet: Sparse lattice networks for point cloud process-
Joshua B. Tenenbaum, William T. Freeman, and Jiajun Wu.
ing. In CVPR, 2018. 2
Learning to Reconstruct Shapes from Unseen Classes. In
[60] Xingyuan Sun, Jiajun Wu, Xiuming Zhang, Zhoutong
NeurIPS, 2018. 2
Zhang, Chengkai Zhang, Tianfan Xue, Joshua B. Tenen-
baum, and William T. Freeman. Pix3d: Dataset and methods
for single-image 3d shape modeling. In CVPR, 2018. 2, 5, 7
[61] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,
Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
Vanhoucke, and Andrew Rabinovich. Going deeper with
convolutions. In CVPR, 2015. 1
[62] Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox.
Octree generating networks: Efficient convolutional archi-
tectures for high-resolution 3d outputs. In ICCV, 2017. 2
[63] Maxim Tatarchenko, Jaesik Park, Vladlen Koltun, and Qian-
Yi Zhou. Tangent convolutions for dense prediction in 3d. In
CVPR, 2018. 2
[64] Yonglong Tian, Andrew Luo, Xingyuan Sun, Kevin Ellis,
William T. Freeman, Joshua B. Tenenbaum, and Jiajun Wu.
Learning to infer and execute 3d shape programs. In ICLR,
2019. 2
[65] Alexander Toshev and Christian Szegedy. Deeppose: Human
pose estimation via deep neural networks. In CVPR, 2014. 1
[66] Shubham Tulsiani and Jitendra Malik. Viewpoints and key-
points. In CVPR, 2015. 2
[67] Shubham Tulsiani, Hao Su, Leonidas J. Guibas, Alexei A.
Efros, and Jitendra Malik. Learning shape abstractions by
assembling volumetric primitives. In CVPR, 2017. 2
[68] Shubham Tulsiani, Tinghui Zhou, Alexei A. Efros, and Ji-
tendra Malik. Multi-view supervision for single-view recon-
struction via differentiable ray consistency. In CVPR, 2017.
2

You might also like