Implicit-PDF: Non-Parametric Representation of Probability Distributions On The Rotation Manifold
Implicit-PDF: Non-Parametric Representation of Probability Distributions On The Rotation Manifold
Implicit-PDF: Non-Parametric Representation of Probability Distributions On The Rotation Manifold
Kieran Murphy * 1 Carlos Esteves * 1 Varun Jampani 1 Srikumar Ramalingam 1 Ameesh Makadia 1
Figure 1. We introduce a method to predict arbitrary distributions over the rotation manifold. This is particularly useful for pose estimation
of symmetric and nearly symmetric objects, since output distributions can include both uncertainty on the estimation and the symmetries
of the object. a-top: The cube has 24 symmetries, which are represented by 24 points on SO(3), and all modes are correctly inferred by
our model. a-bottom: The cylinder has a continuous symmetry around one axis, which traces a cycle on SO(3). It also has a discrete
2-fold symmetry (a “flip”), so the distribution is represented as two cycles. The true pose distribution for the vase depicted on the left
would trace a single cycle on SO(3) since it does not have a flip symmetry. b: This cylinder has a mark that uniquely identifies its pose,
when visible (top). When the mark is not visible (bottom), our model correctly distributes the probability over poses where the mark is
invisible. This example is analogous to a coffee cup when the handle is not visible. The resulting intricate distribution cannot be easily
approximated with usual unimodal or mixture distributions on SO(3), but is easily handled by our IPDF model. Visualization: Points
with non-negligible probability are displayed as dots on the sphere according to their first canonical axis, colored according to the rotation
about that axis. The ground truth (used for evaluation only, not training) is shown as a solid outline. Refer to Section 3.5 for more details.
monly used parametric distributions on SO(3) that require way to qualitatively assess predicted distributions. Through
complicated approximations for computing the normalizing evaluation of predicted distributions and poses, we obtain
term and further are not flexible enough to fit complex dis- a broad assessment of our method: IPDF is the only tech-
tributions accurately (Gilitschenski et al., 2019; Deng et al., nique that can consistently accurately recover the complex
2020; Mohlin et al., 2020). Our primary contributions are pose uncertainty distributions arising from a high degree
of symmetry or self-occlusion, while being supervised by
• Implicit-PDF, a novel approach for modeling non- only a single pose per example. Further, while IPDF has the
parametric distributions on the rotation manifold. Our expressive power to model non-trivial distributions, it does
implicit representation can be applied to realistic chal- not sacrifice in ability to predict poses in non-ambiguous
lenging pose estimation problems where uncertainty situations and reaches state of the art performance with the
can arise from approximate or exact symmetries, self- usual metrics on many categories of Pascal3D+ (Xiang et al.,
occlusion, and noise. We propose different sampling 2014) and ModelNet10-SO(3) (Liao et al., 2019).
strategies which allow us to both efficiently reconstruct
full distributions on SO(3) as well as generate multiple 2. Related work
pose candidates with continuous precision.
Symmetries are plentiful in our natural and human-made
• SYMSOL, a new dataset with inherent ambiguities
worlds, and so it is not surprising there is a history in com-
for analyzing pose estimation with uncertainty. The
puter vision of exploiting strong priors or assumptions on
dataset contains shapes with high order of symmetry, as
shape or texture symmetry to recover 3D structure from a
well as nearly-symmetric shapes, that challenge prob-
single image (Poggio & Vetter, 1992; Hong et al., 2004;
abilistic approaches to accurately learn complex pose
Rothwell et al., 1993). However, among the more recent ma-
distributions. When possible, objects are paired with
chine learning approaches for pose estimation, symmetries
their ground truth “symmetry maps”, which allows
are treated as nuisances and strategies have been developed
quantitative evaluation of predicted distributions.
to utilize symmetry annotations at training. With known
symmetries at training, a canonical normalization of rota-
Our IPDF method is extensively evaluated on the new SYM-
tion space unambiguously resolves each set of equivalent
SOL dataset as well as traditional pose estimation bench-
rotations to a single one, allowing training to proceed as in
marks. To aid our analysis, we develop a novel visualization
single-valued regression (Pitteri et al., 2019). In Corona et al.
method for distributions on SO(3) that provides an intuitive
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold
(2018), manually annotated symmetries on 3D shapes are calization. Inference with random dropout applied to the
required to jointly learn image embedding and classification trained model is used to generate Monte Carlo pose samples,
of the object’s symmetry order. Learning representations and thus this approach does not offer a way to estimate the
that cover a few specific symmetry classes is considered density at arbitrary poses (sampling large numbers of poses
in Saxena et al. (2009). would also be impractical).
In contrast to these works, Sundermeyer et al. (2019) make An alternative framework for representing arbitrary com-
pose or symmetry supervision unnecessary by using a plex distributions is Normalizing Flows (Rezende & Mo-
denoising autoencoder to isolate pose information. Nei- hamed, 2015). In principle, the reparameterization trick for
ther Sundermeyer et al. (2019) nor Corona et al. (2018) Lie groups introduced in Falorsi et al. (2019) allows for
directly predict pose, and thus require comparing against constructing flows to the Lie algebra of SO(3). Rezende
many rendered images of the same exact object for pose in- et al. (2020) develop normalizing flows for compact and
ference. In a similar vein, Okorn et al. (2020) use a learned connected differentiable manifolds, however it is still un-
comparison against a dictionary of images to construct a clear how to effectively construct flows on non-Euclidean
histogram over poses. Deng et al. (2019) propose a particle manifolds, and so far there has been little evidence of a suc-
filter framework for 6D object pose tracking, where each cessful application to realistic problems at the complexity
particle represents a discrete distribution over SO(3) with of learning arbitrary distributions on SO(3).
191K bins. Similar to the previously mentioned works, this
The technical design choices of our implicit pose model are
discrete rotation likelihood is estimated by codebook match-
inspired by the very successful implicit shape (Mescheder
ing and an autoencoder is trained to generate the codes.
et al., 2019) and scene (Mildenhall et al., 2020) representa-
As noted earlier, symmetries are not the only source of tions, which can represent detailed geometry with a multi-
pose uncertainty. Aiming to utilize more flexible representa- layer perceptron that takes low-dimensional position and/or
tions, a recent direction of work has looked to directional directions as inputs.
statistics (Mardia & Jupp, 2000) to consider parameteric
We introduce the details of our approach next.
probability distributions. Regression to the parameters of a
von Mises distribution over (Euler) angles (Prokudin et al.,
2018), as well as regression to the Bingham (Peretroukhin 3. Methods
et al., 2020; Deng et al., 2020; Gilitschenski et al., 2019)
The method centers upon a multilayer perceptron (MLP)
and Matrix Fisher distributions (Mohlin et al., 2020) over
which implicitly represents probability distributions over
SO(3) have been proposed. Since it is preferable to train
SO(3). The input to the MLP is a pair comprising a rotation
these probabilistic models with a likelihood loss, the dis-
and a visual representation of an image obtained using a stan-
tribution’s normalizing term must be computed, which is
dard feature extractor such as a residual network; the output
itself a challenge (it is a hypergeometric function of a ma-
is an unnormalized log probability. Roughly speaking, we
trix argument for Bingham and Matrix Fisher distributions).
construct the distribution for a given image by populating
Gilitschenski et al. (2019) and Deng et al. (2020) approxi-
the space of rotations with such queries, and then normaliz-
mate this function and gradient via interpolation in a lookup
ing the probabilities. This procedure is highly parallelizable
table, Mohlin et al. (2020) use a hand-crafted approxima-
and efficient (see Supp. for time ablations). In the following
tion scheme to compute the gradient, and Peretroukhin et al.
we provide details for the key ingredients of our method.
(2020) simply forgo the likelihood loss. In the simplest set-
ting these models are unimodal, and thus ill equipped to deal
with non-trivial distributions. To this end, Prokudin et al. 3.1. Formalism
(2018), Gilitschenski et al. (2019), and Deng et al. (2020) Our goal is, given an input x ∈ X (for example, an
propose using multimodal mixture distributions. One chal- image), to obtain a conditional probability distribution
lenge to training the mixtures is avoiding mode collapse, for p(·|x) : SO(3) 7→ R+ , that represents the pose of x. We
which a winner-take-all strategy can be used (Deng et al., achieve this by training a neural network to estimate the un-
2020). An alternative to the mixture models is to directly normalized joint log probability function f : X × SO(3) 7→
predict multiple pose hypotheses (Manhardt et al., 2019), R. Let α be the normalization term such that p(x, R) =
but this does not share any of the benefits of a probabilistic α exp(f (x, R)), where p is the joint distribution. The com-
representation. putation of α is infeasible, requiring integration over X .
Bayesian deep learning provides a general framework to From the product rule, p(R|x) = p(x, R)/p(x). We esti-
reason about model uncertainty, and in Kendall & Cipolla mate p(x) by marginalizing over SO(3), and since SO(3)
(2016) test time dropout (Gal & Ghahramani, 2016) was is low-dimensional, we approximate the integral with a dis-
used to approximate Bayesian inference for camera relo-
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold
crete sum as follows, (MLP) to implicitly represent the pose distribution. Differ-
Z ently from most implicit models, we train a single model
p(x) = p(x, R) dR to represent the pose of any instance of multiple categories,
R∈SO(3) so an input descriptor (e.g. pre-trained CNN features for
image inputs) is also fed to the MLP, which we produce
Z
=α exp(f (x, R)) dR with a pre-trained ResNet (He et al., 2015). Most implicit
R∈SO(3)
N
representation methods for shapes and scenes take a posi-
≈α
X
exp(f (x, Ri ))V, (1) tion in Euclidean space and/or a viewing direction as inputs.
i
In our case, we take an arbitrary 3D rotation, so we must
revisit the longstanding question of how to represent rota-
where the {Ri } are centers of an equivolumetric partitioning tions (Levinson et al., 2020). We found it best to use a 3 × 3
2
of SO(3) with N partitions of volume V = π /N . (see rotation matrix to avoid discontinuities present in other rep-
Section 3.4 for details). Now α cancels out in the expression resentations (Saxena et al., 2009). Following Mildenhall
for p(R|x), giving et al. (2020), we found positionally encoding each element
of the input to be beneficial. See the supplement for ablative
1 exp(f (x, R)) studies on these design choices.
p(R|x) ≈ PN , (2)
V i exp(f (x, Ri ))
3.3. Loss
where all the RHS terms are obtained from querying the
neural network. We train our model by minimizing the predicted negative
log-likelihood of the (single) ground truth pose. This re-
During training, the model receives pairs of inputs x and cor- quires normalizing the output distribution, which we ap-
responding ground truth R, and the objective is to maximize proximate by evaluating Eq. (2) using the method described
p(R|x). See Section 3.3 for details. in Section 3.4 to obtain an equivolumetric grid over SO(3),
in which case the normalization is straightforward. During
Inference – single pose. To make a single pose prediction, training, we rotate the grid such that R0 coincides with the
we solve ground truth. Then, we evaluate p(R0 |x) as in Eq. (2), and
the loss is simply
Rx∗ = arg max f (x, R), (3)
R∈SO(3) L(x, R0 ) = − log(p(R0 |x)) (4)
with gradient ascent, since f is differentiable. The initial We noticed that the method is robust enough to be trained
guess comes from evaluating a grid {Ri }. Since the domain without an equivolumetric grid; evaluating Eq. (2) for ran-
of this optimization problem is SO(3), we project the values domly sampled Ri ∈ SO(3), provided that one of them
back into the manifold after each gradient ascent step. coincides with the ground truth, works similarly well. The
equivolumetric partition is still required during inference
for accurate representation of the probabilities.
Inference – full distribution. Alternatively, we may want
to predict a full probability distribution. In this case p(Ri |x)
is evaluated over the SO(3) equivolumetric partition {Ri }. 3.4. Sampling the rotation manifold
This representation allows us to reason about uncertainty Training and producing an estimate of the most likely pose
and observe complex patterns of symmetries and near- does not require precise normalization of the probabilities
symmetries. predicted by the network. However, when the distribution
Our method can estimate intricate distributions on the man- is the object of interest (e.g. an accurate distribution will
ifold without direct supervision of such distributions. By be used in a downstream task), we can normalize by evalu-
learning to maximize the likelihood of a single ground truth ating on a grid of points with equal volume in SO(3) and
pose per object over a dataset, with no prior knowledge of approximating the distribution as a histogram.
each object’s symmetries, appropriate patterns expressing We employ a method of generating equivolumetric grids de-
symmetries and uncertainty naturally emerge in our model’s veloped by Yershova et al. (2010), which uses as its starting
outputs, as shown in Fig. 1. point the HEALPix method of generating equal area grids
on the 2-sphere (Gorski et al., 2005). A useful property of
3.2. Network this sampling is that it is generated hierarchically, permitting
multi-resolution sampling if desired.
Inspired by recent breakthroughs in implicit shape and scene
representations (Mescheder et al., 2019; Park et al., 2019; The Hopf fibration is leveraged to cover SO(3) by threading
Sitzmann et al., 2019), we adopt a multilayer perceptron a great circle through each point on the surface of a 2-sphere.
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold
Figure 3. IPDF predicted distributions for SYMSOL. (a) The cone has one great circle of equivalent orientations under symmetry. (b)
The 60 modes of icosahedral symmetry would be exceedingly difficult for a mixture density network based approach, but IPDF can get
quite close (we omit the ground truths from the left and middle visualizations for clarity). (c) The marked tetrahedron (“tetX”) has one red
face. When it is visible, the 12-fold tetrahedral symmetry reduces to only three equivalent rotations. With less information about the
location of the red face, more orientations are possible: 6 when two white faces are visible (middle) and 9 when only one white face is
visible (right). (d) The orientation of the marked sphere (“sphereX”) is unambiguous when both markings are visible (left). When they
are not (middle), all orientations with the markings on the hidden side of the sphere are possible. When only a portion of the markings are
visible (right; inset is a magnification showing several pixels of the X are visible), the IPDF distribution captures the partial information.
surfaces of revolution (cone, cylinder), with 100,000 ren- Now suppose we mark this sphere with a small arrow. If
derings of each shape from poses sampled uniformly at the arrow is visible, the pose distribution collapses to an
random from SO(3). Each image is paired with its ground impulse. If the arrow is not visible, the distribution is no
truth symmetries (the set of rotations of the source object longer uniform, since about half of the space of possible
that would not change the image), which are easily derived rotations can now be eliminated. This distribution cannot
for these shapes. As would be the case in most practical be easily approximated by mixtures of unimodals.
situations, where symmetries are not known and/or only ap-
SYMSOL II objects include a sphere marked with a small
proximate, we use such annotations only for evaluation and
letter “X” capped with a dot to break flip symmetry when
not for training. Access to the full set of equivalent rotations
visible (sphX), a tetrahedron with one red and three white
opens new avenues of evaluating model performance rarely
faces (tetX), and a cylinder marked with a small filled off-
possible with pose estimation datasets.
centered circle (cylO). We render 100,000 images for each.
While the textureless solids generate a challenging variety of
The two SYMSOL datasets test expressiveness, but the
distributions, they can still be approximated with mixtures
solids are relatively simple and the dataset does not require
of simple unimodal distributions such as the Bingham (Deng
generalization to unseen objects. ModelNet10-SO(3) was
et al., 2020; Gilitschenski et al., 2019). We go one step fur-
introduced by Liao et al. (2019) to study pose estimation
ther and break the symmetry of objects by texturing with
on rendered images of CAD models from ModelNet10 (Wu
small markers (SYMSOL II). When the marker is visible,
et al., 2015). As in SYMSOL, the rotations of the objects
the pose distribution is no longer ambiguous and collapses
cover all of SO(3) and therefore present a difficulty for
given the extra information. When the marker is not vis-
methods that rely on particular rotation formats such as
ible, only a subspace of the symmetric rotations for the
Euler angles (Liao et al., 2019; Prokudin et al., 2018).
textureless shape are possible.
The Pascal3D+ dataset (Xiang et al., 2014) is a popular
For example, consider a textureless sphere. Its pose distribu-
benchmark for pose estimation on real images, consisting
tion is uniform – rotations will not change the input image.
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold
Table 1. Distribution estimation on SYMSOL I and II. We report the average log likelihood on both parts of the SYMSOL dataset, as
a measure for how well the multiple equivalent ground truth orientations are represented by the output distribution. For reference, a
minimally informative uniform distribution over SO(3) has an average log likelihood of -2.29. IPDF’s expressivity allows it to more
accurately represent the complicated pose distributions across all of the shapes. A separate model was trained for each shape for all
baselines and for all of SYMSOL II, but only a single IPDF model was trained on all five shapes of SYMSOL I.
SYMSOL I (log likelihood ↑) SYMSOL II (log likelihood ↑)
avg. cone cyl. tet. cube ico. avg. sphX cylO tetX
Deng et al. (2020) −1.48 0.16 −0.95 0.27 −4.44 −2.45 2.57 1.12 2.99 3.61
Gilitschenski et al. (2019) −0.43 3.84 0.88 −2.29 −2.29 −2.29 3.70 3.32 4.88 2.90
Prokudin et al. (2018) −1.87 −3.34 −1.28 −1.86 −0.50 −2.39 0.48 −4.19 4.16 1.48
IPDF (Ours) 4.10 4.45 4.26 5.70 4.81 1.28 7.57 7.30 6.91 8.49
Finally, we evaluate on T-LESS (Hodaň et al., 2017), con- Liao et al. (2019) 0.496 0.658 28.7
sisting of texture-less industrial parts with various discrete Deng et al. (2020) 0.562 0.694 32.6
and continuous approximate symmetries. As in Gilitschen- Prokudin et al. (2018) 0.456 0.528 49.3
ski et al. (2019), we use the Kinect RGB single-object im- Mohlin et al. (2020) 0.693 0.757 17.1
ages, tight-cropped and color-normalized. Although the IPDF (ours) 0.719 0.735 21.5
objects are nearly symmetric, their symmetry-breaking fea-
IPDF (ours), top-2 0.868 0.888 4.9
tures are visible in most instances. Nonetheless, it serves
IPDF (ours), top-4 0.904 0.926 4.8
as a useful benchmark to compare distribution metrics with
Gilitschenski et al. (2019).
We find that IPDF proves competitive across the board.
4.3. SYMSOL I: symmetric solids
4.2. Baselines We report the average log likelihood in Table 1, and the
gap between IPDF and the baselines is stark. The average
We compare to several recent works which parameterize
log likelihood indicates how successful the prediction is at
distributions on SO(3) for the purpose of pose estimation.
distributing probability mass around all of the ground truths.
Gilitschenski et al. (2019) and Deng et al. (2020) output
The expressivity afforded by our method allows it to capture
the parameters for mixtures of Bingham distributions and
both the continuous and discrete symmetries present in the
interpolate from a large lookup table to compute the normal-
dataset. As the order of the symmetry increases from 12 for
ization constant. Mohlin et al. (2020) output the parameters
the tetrahedron, to 24 for the cube, and finally 60 for the
for a unimodal matrix Fisher distribution and similarly em-
icosahedron, the baselines struggle and tend to perform at
ploy an approximation scheme to compute the normalization
same level as a minimally informative (uniform) distribution
constant. Prokudin et al. (2018) decompose SO(3) into the
over SO(3). The difference between IPDF and the baselines
product of three independent distributions over Euler angles,
in Table 1 is further cemented by the fact that a single IPDF
with the capability for multimodality through an ‘infinite
model was trained on all five shapes while the baselines
mixture’ approach. Finally we compare to the spherical re-
were allowed a separate model per shape. Interestingly,
gression work of Liao et al. (2019), which directly regresses
while the winner-take-all strategy of Deng et al. (2020) en-
to Euler angles, to highlight the comparative advantages of
abled training with more Bingham modes than Gilitschenski
distribution-based methods. We quote reported values and
et al. (2019), it seems to have hindered the ability to faith-
run publicly released code when values are unavailable. See
fully represent the continuous symmetries of the cone and
Supplemental Material for additional details.
cylinder, as suggested by the relative performance of these
methods.
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold
Table 3. Results on a standard pose estimation benchmark, Pascal3D+. As is common, we show accuracy at 30◦ (top) and median error in
degrees (bottom), for each category and also averaged over categories. Our IPDF is at or near state-of-the-art on many categories. ‡ The
results for Liao et al. (2019) and Mohlin
√ et al. (2020) differ from their published numbers. For Liao et al. (2019), published errors are
known to be incorrectly scaled by a 2 factor, and Mohlin et al. (2020) evaluates on a non-standard test set. See Supplemental for details.
avg. aero bike boat bottle bus car chair table mbike sofa train tv
‡Liao et al. (2019) 0.819 0.82 0.77 0.55 0.93 0.95 0.94 0.85 0.61 0.80 0.95 0.83 0.82
‡Mohlin et al. (2020) 0.825 0.90 0.85 0.57 0.94 0.95 0.96 0.78 0.62 0.87 0.85 0.77 0.84
Prokudin et al. (2018) 0.838 0.89 0.83 0.46 0.96 0.93 0.90 0.80 0.76 0.90 0.90 0.82 0.91
Acc@30°↑
Tulsiani & Malik (2015) 0.808 0.81 0.77 0.59 0.93 0.98 0.89 0.80 0.62 0.88 0.82 0.80 0.80
Mahendran et al. (2018) 0.859 0.87 0.81 0.64 0.96 0.97 0.95 0.92 0.67 0.85 0.97 0.82 0.88
IPDF (Ours) 0.837 0.81 0.85 0.56 0.93 0.95 0.94 0.87 0.78 0.85 0.88 0.78 0.86
‡Liao et al. (2019) 13.0 13.0 16.4 29.1 10.3 4.8 6.8 11.6 12.0 17.1 12.3 8.6 14.3
‡Mohlin et al. (2020) 11.5 10.1 15.6 24.3 7.8 3.3 5.3 13.5 12.5 12.9 13.8 7.4 11.7
Median Prokudin et al. (2018) 12.2 9.7 15.5 45.6 5.4 2.9 4.5 13.1 12.6 11.8 9.1 4.3 12.0
error (◦ ) ↓ Tulsiani & Malik (2015) 13.6 13.8 17.7 21.3 12.9 5.8 9.1 14.8 15.2 14.7 13.7 8.7 15.4
Mahendran et al. (2018) 10.1 8.5 14.8 20.5 7.0 3.1 5.1 9.3 11.3 14.2 10.2 5.6 11.7
IPDF (Ours) 10.3 10.8 12.9 23.4 8.8 3.4 5.3 10.0 7.3 13.6 9.5 6.4 12.3
ruled out. For example, with the one red face visible in the
left subplot of Figure 3c, there is nothing to distinguish the
three remaining faces, and the implicit distribution reflects
this state with three modes.
Figure 3d show the IPDF prediction for various views of
the marked sphere. When the marking is not visible at all
(middle subplot), the half of SO(3) where the marking faces
the camera can be ruled out; IPDF assigns zero probability
Figure 4. Bathtubs may have exact or approximate 2-fold symme- to half of the space. When only a portion of the marking is
tries around one or more axes. We show our predicted probabilities visible (right subplot), IPDF yields a nontrivial distribution
as solid disks, the ground truth as circles, and the predictions of with an intermediate level of ambiguity, capturing the partial
Liao et al. (2019) as crosses. Our model assigns high probabilities information contained in the image.
to all symmetries, while the regression method ends up far from
every symmetry mode (note the difference in position and color 4.5. ModelNet10-SO(3)
between circles and crosses).
Unimodal methods perform poorly on categories with ro-
tational symmetries such as bathtub, desk and table (see
the supplementary material for complete per-category re-
4.4. SYMSOL II: nearly-symmetric solids
sults). When trained with a single ground truth pose se-
When trained on the solids with distinguishing features lected randomly from among multiple distinct rotations,
which are visible only from a subset of orientations, IPDF is these methods tend to split the difference and predict a rota-
far ahead of the baselines (Table 1). The prediction serves tion equidistant from all equivalent possibilities. The most
as a sort of ‘belief state’, with the flexibility of being uncon- extreme example of this behavior is the bathtub category,
strained by a particular parameterization of the distribution. which contains instances with approximate or exact two-
The marked cylinder in the right half of Figure 1 displays fold symmetry around one or more axes (see Fig. 4). With
this nicely. When the red marking is visible, the pose is well two modes of symmetry separated by 180◦ , the outputs tend
defined from the image and the network outputs a sharp peak to be 90◦ away from each mode. We observe this behavior
at the correct, unambiguous location. When the cylinder in Liao et al. (2019); Mohlin et al. (2020).
marking is not visible, there is irreducible ambiguity con-
Since our model can easily represent any kind of symme-
veyed in the output with half of the full cylindrical symmetry
try, it does not suffer from this problem, as illustrated in
shown in the left side of the figure.
Fig. 4. The predicted distribution captures the symmetry of
The pose distribution of the marked tetrahedron in Figure 3c the object but returns only one of the possibilities during
takes a discrete form. Depending on which faces are visible, inference. This is penalized by metrics that rely on a single
a subset of the full 12-fold tetrahedral symmetry can be ground truth, since picking the mode that is not annotated
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold
4.7. T-LESS
◦
results in an 180 error, while picking the midpoint between The results of Table 4, and specifically the success of the re-
two modes (which is far from both) results in a 90◦ error. gression method of Liao et al. (2019), show that approximate
Since some bathtub instances have two-fold symmetries or exact symmetries are not an issue in the particular split of
over more than one axis (like the top-right of Fig. 4), our the T-LESS dataset used in Gilitschenski et al. (2019). All
median error ends up closer to 180◦ when the symmetry an- methods are able to achieve median angular errors of less
notation is incomplete, which in turn significantly increases than 4◦ . Among the methods which predict a probability
the average over all categories. We observe the same for distribution over pose, IPDF maximizes the average log like-
other multi-modal methods (Prokudin et al., 2018; Deng lihood and minimizes the spread, when correctly factoring
et al., 2020). in the uncertainty into the metric evaluation.
Our performance increases dramatically in the top-k evalu-
ation even for k = 2 (see Table S4). The ability to output 5. Conclusion
pose candidates is an advantage of our model, and is not
possible for direct regression (Liao et al., 2019) or unimodal In this work we have demonstrated the capacity of an
methods (Mohlin et al., 2020). While models based on mix- implicit function to represent highly expressive, non-
tures of unimodal distributions could, in theory, produce parametric distributions on the rotation manifold. It per-
pose candidates, their current implementations (Gilitschen- forms as well as or better than state of the art parameterized
ski et al., 2019; Deng et al., 2020) suffer from mode collapse distribution methods, on standard pose estimation bench-
and are constrained to a fixed number of modes. marks where the ground truth is a single pose. On the new
and difficult SYMSOL dataset, the implicit method is far
superior while being simple to implement as it does not re-
4.6. Pascal3D+
quire any onerous calculations of a normalization constant.
In contrast to the full coverage of SO(3) and the pres- Particularly, we show in SYMSOL II that our method can
ence of symmetries and ambiguities in the SYMSOL and represent distributions that cannot be approximated well
ModelNet10-SO(3) datasets, Pascal3D+ serves as a check by current mixture-based models. See the Supplementary
that pose estimation performance in the unambiguous case Material for additional visualizations, ablation studies and
is not sacrificed. In fact, as the results of Table 3 show, timing evaluations, extended discussion about metrics, and
IPDF performs as well as or better than the baselines which implementation details.
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold
Data. In Proceedings of the IEEE/CVF International Prokudin, S., Gehler, P., and Nowozin, S. Deep Directional
Conference on Computer Vision (ICCV), October 2019. Statistics: Pose Estimation with Uncertainty Quantifi-
cation. In Proceedings of the European Conference on
Mardia, K. V. and Jupp, P. E. Directional Statistics. John Computer Vision (ECCV), pp. 534–551, 2018.
Wiley and Sons, LTD, London, 2000.
Rezende, D. and Mohamed, S. Variational Inference with
McAllister, R., Gal, Y., Kendall, A., Van Der Wilk, M., Shah, Normalizing Flows. In International Conference on Ma-
A., Cipolla, R., and Weller, A. Concrete problems for chine Learning, pp. 1530–1538. PMLR, 2015.
autonomous vehicle safety: Advantages of bayesian deep Rezende, D. J., Papamakarios, G., Racaniere, S., Albergo,
learning. International Joint Conferences on Artificial M., Kanwar, G., Shanahan, P., and Cranmer, K. Normal-
Intelligence, Inc., 2017. izing Flows on Tori and Spheres. In International Con-
ference on Machine Learning, pp. 8083–8092. PMLR,
Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S.,
2020.
and Geiger, A. Occupancy Networks: Learning 3D Re-
construction in Function Space. In IEEE Conference on Rothwell, C., Forsyth, D. A., Zisserman, A., and Mundy,
Computer Vision and Pattern Recognition, CVPR, 2019. J. L. Extracting Projective Structure from Single Per-
spective Views of 3D Point Sets. In 1993 (4th) Inter-
Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., national Conference on Computer Vision, pp. 573–582.
Ramamoorthi, R., and Ng, R. NeRF: Representing Scenes IEEE, 1993.
as Neural Radiance Fields for View Synthesis. In ECCV,
2020. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,
Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,
Mohlin, D., Bianchi, G., and Sullivan, J. Probabilistic M., et al. ImageNet Large Scale Visual Recognition
Orientation Estimation with Matrix Fisher Distributions. Challenge. International Journal of Computer Vision,
In Advances in Neural Information Processing Systems 115(3):211–252, 2015.
33, 2020.
Saxena, A., Driemeyer, J., and Ng, A. Y. Learning 3-D
Object Orientation from Images. In IEEE International
Okorn, B., Xu, M., Hebert, M., and Held, D. Learning
Conference on Robotics and Automation (ICRA), 2009.
Orientation Distributions for Object Pose Estimation. In
IEEE International Conference on Robotics and Automa- Sitzmann, V., Zollhöfer, M., and Wetzstein, G. Scene repre-
tion (ICRA), 2020. sentation networks: Continuous 3d-structure-aware neu-
ral scene representations. In Advances in Neural Infor-
Park, J. J., Florence, P., Straub, J., Newcombe, R. A., and mation Processing Systems 32: Annual Conference on
Lovegrove, S. Deepsdf: Learning continuous signed Neural Information Processing Systems 2019, NeurIPS
distance functions for shape representation. In IEEE 2019, December 8-14, 2019, Vancouver, BC, Canada, pp.
Conference on Computer Vision and Pattern Recognition, 1119–1130, 2019.
CVPR 2019, Long Beach, CA, USA, June 16-20, 2019,
pp. 165–174, 2019. doi: 10.1109/CVPR.2019.00025. Su, H., Qi, C. R., Li, Y., and Guibas, L. J. Render for
cnn: Viewpoint estimation in images using cnns trained
Peretroukhin, V., Giamou, M., Rosen, D. M., Greene, W. N., with rendered 3d model views. In Proceedings of the
Roy, N., and Kelly, J. A Smooth Representation of SO(3) IEEE international conference on computer vision, pp.
for Deep Rotation Learning with Uncertainty. In Proceed- 2686–2694, 2015.
ings of Robotics: Science and Systems (RSS), Jul. 12–16
2020. Sundermeyer, M., Marton, Z., Durner, M., Brucker, M., and
Triebel, R. Implicit 3D Orientation Learning for 6D Ob-
Pitteri, G., Ramamonjisoa, M., Ilic, S., and Lepetit, V. On ject Detection from RGB Images. CoRR, abs/1902.01275,
Object Symmetries and 6D Pose Estimation from Images. 2019.
CoRR, abs/1908.07640, 2019. URL http://arxiv. Suwajanakorn, S., Snavely, N., Tompson, J. J., and Norouzi,
org/abs/1908.07640. M. Discovery of Latent 3D Keypoints via End-to-end
Geometric Reasoning. In Advances in Neural Information
Poggio, T. and Vetter, T. Recognition and Structure from Processing Systems (NIPS), pp. 2063–2074, 2018.
one 2D Model View: Observations on Prototypes, Ob-
ject Classes and Symmetries. Technical report, MAS- Tulsiani, S. and Malik, J. Viewpoints and keypoints. In
SACHUSETTS INST OF TECH CAMBRIDGE ARTI- Proceedings of the IEEE Conference on Computer Vision
FICIAL INTELLIGENCE LAB, 1992. and Pattern Recognition (CVPR), June 2015.
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold
Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X.,
and Xiao, J. 3D ShapeNets: A Deep Representation for
Volumetric Shapes. In Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, pp.
1912–1920, 2015.
Xiang, Y., Mottaghi, R., and Savarese, S. Beyond PASCAL:
A benchmark for 3D object detection in the wild. In 2014
IEEE Winter Conference on Applications of Computer
Vision (WACV), pp. 75–82, March 2014.
Xiang, Y., Kim, W., Chen, W., Ji, J., Choy, C., Su, H.,
Mottaghi, R., Guibas, L., and Savarese, S. ObjectNet3D:
A Large Scale Database for 3D Object Recognition. In
European Conference Computer Vision (ECCV), 2016.
Yershova, A., Jain, S., Lavalle, S. M., and Mitchell, J. C.
Generating Uniform Incremental Grids on SO (3) Using
the Hopf Fibration. The International journal of robotics
research, 29(7):801–812, 2010.
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold
Figure S1. Sample IPDF outputs on Pascal3D+ objects. We visualize predictions by the IPDF model, trained on all twelve object
categories, which yielded the results in Table 3 of the main text. The ground truth rotations are displayed as the colored open circles.
In Figure S1 we show sample predictions from IPDF trained on the objects in Pascal3D+. The network outputs much more
information about the pose of the object in the image than can be expressed in a single estimate. Even in the examples where
the distribution is unimodal, and the pose is relatively unambiguous, IPDF provides rich information about the uncertainty
around the most likely pose. The expressivity of IPDF allows it to express category-level symmetries, which appear as
multiple modes in the distributions above. The most stand-out example in Figure S1 is of the bicycle in the second row:
the pose estimate of IPDF is incredibly uncertain, yet still there is information in the exclusion of certain regions of SO(3)
which have been ‘ruled out’. The expressivity of IPDF allows an unprecedented level of information to be contained in the
predicted pose distributions.
cover the full joint space. Normalizing the distributions is similarly straightforward, by querying over a product of Cartesian
and HealPix-derived grids. Predicted distributions on modified images of SYMSOL are shown in Figure S2. For two
renderings of a cone from identical orientation but different translations, only the predicted distribution over translation
differs between the two images.
Figure S2. Extension to 6DOF rotation+translation estimation. We train IPDF on a modified SYMSOL I dataset, where the objects are
also translated in space. Shown above are two images of a cone with the same orientation but shifted in space. We query the network over
the full joint space of translations and rotations, and visualize the marginal distributions. Each point in rotation space has a corresponding
point in translation space, and we color them the same to indicate as such. While uninformative in the above plots, this scheme of coloring
allows nontrivial joint distributions to be expressed.
Table S1. Spread estimation on SYMSOL. This metric evaluates how closely the probability mass is centered on any of the equivalent
ground truths. For this reason, we can only evaluate it on SYMSOL I, where all ground truths are known at test time. Values are in
degrees.
cone cyl. tet. cube ico.
Deng et al. 10.1 15.2 16.7 40.7 29.5
Ours 1.4 1.4 4.6 4.0 8.4
We evaluate the spread metric on the SYMSOL I dataset, where the full set of ground truths is known at test time, for IPDF
and the method of Deng et al. (2020). The results are shown in Table S1.
The metric values, in degrees, show how well the implicit method is able to home in on the ground truths. For the cone and
cylinder, the spread of probability mass away from the continuous rotational symmetry has a typical scale of just over one
degree.
The predicted distributions in Figure S3 for a tetrahedron and cone visually ground the values of Table S1. Many of the
individual unimodal Bingham components can be identified for the output distributions of Deng et al. (2020), highlighting
the difficulty covering the great circle of ground truth rotations for the cone with only a limited number of unimodal
distributions (bottom). The spread around the ground truths for both shapes is significantly larger and more diffuse than for
IPDF, shown on the right.
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold
Figure S3. Comparison of predicted distributions: tetrahedron and cone. We show predicted pose distributions for a tetrahedron
(top) and cone (bottom). Displayed on the left is the method of Deng et al. (2020), which outputs parameters for a mixture of Bingham
distributions. The right side shows IPDF. The predicted distributions from the implicit method are much more densely concentrated
around the ground truth, providing a visual grounding for the significant difference in the spread values of Table S1.
Table S2. Inference time evaluation. For our method, we measure the time needed to generate the normalized distribution over SO(3)
given a single 224 × 224 image. The number of samples correspond to the HEALPix-SO(3) grids of levels 3, 4, and 5, respectively. The
coarser grid has an average distance of approximately 5◦ between nearest neighbors. The processing time growth is slower than linear.
Method Number of samples frames/s ↓ Acc@15°↑ Acc@30°↑ Med. (◦ ) ↓
Liao et al. - 18.2 0.522 0.652 38.2
Ours 37 k 18.3 0.717 0.735 25.1
Ours 295 k 9.1 0.723 0.738 17.6
Ours 2359 k 2.4 0.723 0.738 18.7
S5. Ablations
In Figure S4, we show the average log likelihood on the five shapes of SYMSOL I through ablations to various aspects of
the method. The top row shows the dependence on the size of the dataset. Performance levels off after 50,000 images per
shape, but is greatly diminished for only 10,000 examples. Note almost all of the values for 10,000 images are less than
the log likelihood of a uniform distribution over SO(3), − log A = −2 log π = −2.29, the ‘safest’ distribution to output if
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold
Figure S4. Ablative studies. We report the average log likelihood for the shapes of SYMSOL I with various aspects of the method ablated.
Error bars are the standard deviation over five networks trained with different random seeds. In the top row, we show the dependence on
the size of the dataset, with performance leveling off after 50,000 images per shape. The subsequent row varies the positional encoding,
with 0 positional encoding terms corresponding to no positional encoding at all: the flattened rotation matrix is the query rotation. The
third row examines the role of the rotation format when querying the MLP (before positional encoding is applied). The final row shows
that, during training, inexact normalization arising from the queries being randomly sampled over SO(3) leads to roughly equivalent
performance as the proper normalization from using the equivolumetric grid as the query points. Note that evaluation makes use of an
equivolumetric grid in both cases, to calculate the log likelihood.
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold
training is unsuccessful. This indicates overfitting: with only one rotation for each training example, a minimal number of
examples is needed to connect all the ground truths with each view of a shape. The network becomes confident about the
rotations it has seen paired with a particular view, and assigns small probability to the unseen ground truths, resulting in
large negative log likelihood values.
The second row varies the positional encoding applied to the rotations when querying the MLP. 0 positional encoding terms
corresponds to no positional encoding at all: the flattened rotation matrix is used as the query rotation. The positional
encoding benefits the three shapes with discrete symmetries and is neutral or even slightly negative for the cone and cylinder.
Intended to facilitate the representation of high frequency features (Mildenhall et al., 2020), positional encoding helps
capture the twelve modes of tetrahedral symmetry with two terms, whereas four are necessary for peak performance on the
cube and icosahedron. For all shapes, including more positional encoding terms eventually degrades the performance.
In the third row, we compare different formats for the query rotation, pre-positional encoding. For all shapes, representing
rotations as matrices is optimal, with axis-angle and quaternion formats comparable to each other and a fair amount worse.
Representing rotations via Euler angles averages out near the log likelihood of a uniform distribution (−2.29), though with a
large spread which indicates most but not all runs fail to train.
Finally, the fourth row examines the effect of normalization in the likelihood loss during training. Randomly sampling
queries from SO(3) offers simplicity and freedom over the exact number of queries, but results in inexact normalization of
the probability distribution. During training, this leads to roughly equivalent performance as when an equivolumetric grid of
queries is used, which can be exactly normalized.
Figure S5. The efficacy of gradient ascent on Pascal3D+. We report the average performance across classes on Pascal3D+, for the same
IPDF model, using different means to extract a single-valued pose estimate. The error bars are the standard deviation among random
sampling attempts, and the curves are slightly offset horizontally for clarity.
In Figure S5 we show the efficacy of performing gradient ascent to extract the most likely pose from IPDF, given an image.
The first way to find the rotation with maximal probability is by sampling from SO(3) and taking the argmax over the
unnormalized outputs of IPDF. Predictably, finer resolution of the samples yields more accurate predictions, indicated by
shrinking median angular error (left) and growing accuracy at 30◦ (right) averaged over the categories of Pascal3D+. The
second way to produce an estimate leverages the fact that IPDF is fully differentiable. We use the best guess from a sampling
of queries as a starting value for gradient ascent on the output of IPDF. The space of valid rotations is embedded in a much
larger query space, so we project the updated query back to SO(3) after every step of gradient ascent, and run it for 100
steps. The estimates returned by gradient ascent yield optimal performance for anything more than 10,000 queries, whereas
argmax requires more than 500,000 queries for similar results. The difference between the argmax and gradient ascent is
primarily in the median angular error (left): improvements of an estimate on the order of a degree would benefit this statistic
more than the accuracy at 30◦ .
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold
Figure S6. Distributions modelling a scenario with multiple ground truths. P1 and P2 are mixtures of two normal distributions, with
the components centered on the ground truths at x = ±1. P3 is a normal distribution centered on only one of the two ground truths. P4 is
a uniform distribution over the interval [−2, 2].
The results for the spread and average log likelihood, defined in the main text, are shown in Table S3. There are several
takeaways from this simplified example. The spread, being the average over the ground truths of the minimum error, captures
how well any of the ground truths are represented. By this metric, P1 and P3 are equivalent. When the full set of ground
truths is not known at evaluation, the spread ceases to be meaningful.
The average log likelihood measures how well all ground truths are represented and is invariant to whether the full set of
GTs is provided with each test example, or only a subset. The latter is the predominant scenario for pose estimation datasets,
where annotations are not provided for near or exact symmetries. This means only one ground truth is provided for each test
example, out of possibly several equivalent values. In Table S3, the average log likelihood ranks the distributions in the
order one would expect, with the ‘ignorant’ uniform distribution (P4 ) performing slightly worse than P1 and P2 , and with
P3 severely penalized for failing to cover both of the ground truths.
Precision and spread metrics are misleading because they penalize correct predictions that don’t have a corresponding
annotation. Our solution is to drop the precision metric and split the distribution into different modes to compute the spreads,
by finding connected components in probability distribution predicted.
The recall metrics are problematic when viewed independently of precision, since they can be easily optimized for by
returning a large number of candidate poses covering the whole space. Our solution here is to limit the number of output
pose candidates to k, yielding metrics that we denote the top-k accuracy@15°, top-k accuracy@30°, and top-k error. For
example, the metrics reported by Liao et al. (2019); Mohlin et al. (2020) on ModelNet10-SO(3) are equivalent to our top-1.
One issue with the top-k evaluation is that we cannot disentangle if errors are due to the dataset (lack of symmetry
annotations), or due to the model. Since there is no way around it without expensive annotation, we find it useful to report
the top-k for different k, including k = 1, where no model errors are forgiven.
Now, for each entry in the dataset, RGT is the single annotated ground truth, the top-k pose predictions are {R̂i }1≤i≤k ,
and we have k normalized probability distributions corresponding to each of the top-k modes, {p̂i }1≤i≤k . The following
equations describe the metrics,
n o
top-k accuracy@α = min d(RGT , R̂j ) < α , (5)
1≤j≤k
Typically, accuracy and spread are averaged over the whole dataset, while the median error over all entries is reported.
Efficient implementation The input to the MLP is a concatenation of the image descriptor produced by a CNN and a
query pose. During both training and inference, we evaluate densities for a large number of poses per image. A naive
implementation would replicate and tile image descriptors {di }0≤i<NB and pose queries {qj }0≤j<NQ , where NB is the
mini-batch size and NQ is the number of pose queries, and evaluate the first fully connected operation with weights W
(before applying bias and nonlinearity) in a batched fashion, as follows,
" #
d1 d1 d1 · · · d2 d2 d2 · · ·
W . (8)
q1 q2 q3 · · · q1 q2 q3 · · ·
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold
avg. bathtub bed chair desk dresser tv n. stand sofa table toilet
Deng et al. (2020) 0.562 0.140 0.788 0.800 0.345 0.563 0.708 0.279 0.733 0.440 0.832
Prokudin et al. (2018) 0.456 0.114 0.822 0.662 0.023 0.406 0.704 0.187 0.590 0.108 0.946
Acc@15°
Mohlin et al. (2020) 0.693 0.322 0.882 0.881 0.536 0.682 0.790 0.516 0.919 0.446 0.957
IPDF (ours) 0.719 0.392 0.877 0.874 0.615 0.687 0.799 0.567 0.914 0.523 0.945
IPDF (ours), top-2 0.868 0.735 0.946 0.900 0.803 0.810 0.883 0.756 0.959 0.932 0.960
IPDF (ours), top-4 0.904 0.806 0.966 0.905 0.862 0.870 0.899 0.842 0.966 0.956 0.963
Deng et al. (2020) 0.694 0.325 0.880 0.908 0.556 0.649 0.807 0.466 0.902 0.485 0.958
Prokudin et al. (2018) 0.528 0.175 0.847 0.777 0.061 0.500 0.788 0.306 0.673 0.183 0.972
Acc@30°
Mohlin et al. (2020) 0.757 0.403 0.908 0.935 0.674 0.739 0.863 0.614 0.944 0.511 0.981
IPDF (ours) 0.735 0.410 0.883 0.917 0.629 0.688 0.832 0.570 0.921 0.531 0.967
IPDF (ours), top-2 0.888 0.770 0.953 0.946 0.825 0.812 0.918 0.762 0.968 0.945 0.982
IPDF (ours), top-4 0.926 0.846 0.973 0.953 0.889 0.874 0.939 0.851 0.975 0.972 0.988
Deng et al. (2020) 32.6 147.8 9.2 8.3 25.0 11.9 9.8 36.9 10.0 58.6 8.5
Median Prokudin et al. (2018) 49.3 122.8 3.6 9.6 117.2 29.9 6.7 73.0 10.4 115.5 4.1
Error (◦ ) Mohlin et al. (2020) 17.1 89.1 4.4 5.2 13.0 6.3 5.8 13.5 4.0 25.8 4.0
IPDF (ours) 21.5 161.0 4.4 5.5 7.1 5.5 5.7 7.5 4.1 9.0 4.8
IPDF (ours), top-2 4.9 6.8 4.1 5.5 5.3 4.9 5.3 5.1 3.9 3.7 4.8
IPDF (ours), top-4 4.8 6.0 4.1 5.4 5.1 4.7 5.2 4.8 3.9 3.7 4.8
When computed this way, this single step is the computational bottleneck. An alternative, much more efficient method is to
observe that
" # " # " #
di di 0
W =W +W = Wd di + Wq qj , (9)
qj 0 qj
where W = [Wd Wq ]. In this manner, Wd can be applied batchwise to image descriptors, yielding a NO × NB output,
and Wq can be applied to all query poses independently, yielding a NO × NQ output, where NO is the number of output
channels (number of rows in W ). An NO × NQ × NB tensor equivalent to Eq. (8) is then obtained via a broadcasting sum,
drastically reducing the number of operations.
SYMSOL For the SYMSOL experiments, three positional encoding terms were used for the query, and four fully
connected layers of 256 units with ReLU activation for the MLP. One network was trained for all five shapes of SYMSOL I
with a batch size of 128 images for 100,000 steps (28 epochs). A different network was trained for each of the three textured
shapes of SYMSOL II; these trained with a batch size of 64 images for 50,000 steps (36 epochs). The loss calculation
requires evaluating a coverage of points on SO(3) along with the ground truth in order to find the approximate normalization
rescaling of the likelihoods. We found that this coverage did not need to be particularly dense, and used 4096 points for
training.
T-LESS For T-LESS, only one positional encoding term was used, and the MLP consisted of a single layer of 256 units
with ReLU activation. The images were color-normalized and tight-cropped as in Gilitschenski et al. (2019). Training was
with a batch size of 64 images for 50,000 steps (119 epochs).
ModelNet10-SO(3) For ModelNet10-SO(3) (Liao et al., 2019), we use four fully connected layers of 256 units with
ReLU activation as in SYMSOL. We train a single model for the whole dataset, for 100,000 steps with batch size of 64.
Following Liao et al. (2019) and Mohlin et al. (2020), we concatenate a one-hot encoding of the class label to the image
descriptor before feeding it to the MLP.
Pascal3D+ We used a learning rate of 10−5 for 150,000 steps, with the same schedule as in the other experiments (linear
ramp for the first 1000 steps, then cosine decay). The vision model was an ImageNet pre-trained ResNet101, and the MLP
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold
Table S5. IPDF architecture. m is the number of positional encoding frequencies and n is the number of fully connected layers in the
MLP. The factor of 2 comes from using both sines and cosines in the positional encoding. The vision description is the result of applying
global average pooling to the output of an ImageNet pre-trained ResNet to obtain a 2048-dimensional vector. We use an ImageNet
pre-trained Resnet50 for SYMSOL, T-LESS, and ModelNet10-SO(3), and Resnet101 for Pascal3D+.
consisted of two fully connected layers of 256 units with ReLU activation (trained on all classes at once, without class label
information). We supplemented the Pascal3D+ training images with synthetic images from Render for CNN (Su et al.,
2015), such that every mini-batch of 64 images consisted of 25% real images and 75% synthetic.
[Gilitschenski et al. (2019)] We trained the multi-modal Bingham distribution model from Gilitschenski et al. (2019)
using their PyTorch code.2 For this baseline we again trained a single model per shape for ModelNet10-SO(3) and SYMSOL.
We followed the published approach and trained the model in two stages – first stage with fixed dispersion and second
stage updates all distribution parameters. For a batch size of 32, a single training step for a 4-component distribution takes
almost 2 seconds on a NVIDIA TESLA P100 GPU. The time is dominated by the lookup table interpolation to calculate the
distribution’s normalizing term (and gradient), and is linear in the number of mixture components (training with 12 mixture
components took over 7 seconds per step). This limited our ability to tune hyperparameters effectively or train with a large
number of mixture components.
[Prokudin et al. (2018)] We trained the infinite mixture model from Prokudin et al. (2018) using their Tensorflow code.3
The only modification was during evaluation: the log likelihood required our method of normalization via equivolumetric
grid because representing a distribution over SO(3) as the product of three individually normalized von Mises distributions
lacks the necessary Jacobian. We left the improperly normalized log likelihood in their loss, as it was originally formulated.
A different model was trained per shape category of SYMSOL and ModelNet10-SO(3).
Note that our implicit pose distribution is trained as a single model for the whole of SYMSOL I and ModelNet10-SO(3)
datasets, so the comparisons against Deng et al. (2020), Gilitschenski et al. (2019), and Prokudin et al. (2018) favor the
baselines. Our method outperforms them nevertheless.
1
https://github.com/Multimodal3DVision/torch_bingham.
2
https://github.com/igilitschenski/deep_bingham.
3
https://github.com/sergeyprokudin/deep_direct_stat.
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold
S8.2. A note on Pascal3D+ evaluations with respect to Liao et al. and Mohlin et al.
In the Pascal3D+ table in the main paper, and mentioned in that caption, we report numbers for Liao et al. (2019) and
Mohlin et al. (2020) which differ from the numbers reported in their papers (these are the rows marked with ‡).
4
Liao et al. (2019) An error in the evaluation code,
√ reported on github , incorrectly measured the angular error – reported
numbers were incorrectly lower by a factor of 2. The authors corrected the evaluation code for ModelNet10-SO(3) and
posted updated
√ numbers, which we show in our paper. However, their evaluation code used for Pascal3D+ still contains the
incorrect 2 factor: comparing
√ the corrected ModelNet10-SO(3) geodesic distance function5 and the Pascal3D+ geodesic
6
distance function the 2 difference is clear. We sanity checked this by running their Pascal3D+ code with the incorrect
metric and were able to closely match the numbers in the paper. In the main paper, we report performance obtained using
the corrected evaluation code.
Mohlin et al. (2020) We found that the code released by (Mohlin et al., 2020) uses different dataset splits for training and
testing on Pascal3D+ than many of the other baselines we compared against. Annotated images in the Pascal3D+ dataset are
selected from one of four source image sets: ImageNet train, ImageNet val, PASCALVOC train, and PASCALVOC val.
Methods like Mahendran et al. and Liao et al. place all the ImageNet images (ImageNet train, ImageNet val) in the training
partition (i.e. used for training and/or validation): “We use the ImageNet-trainval and Pascal-train images as our training
data and the Pascal-val images as our testing data.” Mahendran et al. (2018), Sec 4. However, in the code released
by Mohlin et al. (2020), we observe the test set is sourced from the ImageNet data7 . We reran the Mohlin et al. code as-is
and were able to match their published numbers. After logging both evaluation loops, we confirmed the test data differs
between Mohlin et al. and Liao et al.. The numbers we report in the main paper for Mohlin et al. are after modifying the data
pipeline to match Liao et al., which is also what we follow for our IPDF experiments. We ran Mohlin et al. with and without
augmentation and warping in the data pipeline and chose the best results (which was with warping and augmentation).
4
https://github.com/leoshine/Spherical_Regression/issues/8
5
https://github.com/leoshine/Spherical_Regression/blob/a941c732927237a2c7065695335ed949e0163922/
S3.3D_Rotation/lib/eval/GTbox/eval_quat_multilevel.py#L45
6
https://github.com/leoshine/Spherical_Regression/blob/a941c732927237a2c7065695335ed949e0163922/
S1.Viewpoint/lib/eval/eval_aet_multilevel.py#L135
7
https://github.com/Davmo049/Public_prob_orientation_estimation_with_matrix_fisher_
distributions/blob/4baba6d06ca36db4d4cf8c905c5c3b70ab5fb54a/Pascal3D/Pascal3D.py#L558-L583