0% found this document useful (0 votes)
1 views22 pages

Implicit-PDF: Non-Parametric Representation of Probability Distributions On The Rotation Manifold

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 22

Implicit-PDF: Non-Parametric Representation of Probability Distributions on

the Rotation Manifold

Kieran Murphy * 1 Carlos Esteves * 1 Varun Jampani 1 Srikumar Ramalingam 1 Ameesh Makadia 1

Abstract 2018). A quintessential example is the task of 3D pose esti-


mation – pose estimation is both a vital ingredient in many
Single image pose estimation is a fundamental
real-world robotics and computer vision applications where
arXiv:2106.05965v2 [cs.CV] 1 Jul 2022

problem in many vision and robotics tasks, and


propagating uncertainty can facilitate complex downstream
existing deep learning approaches suffer by not
reasoning (McAllister et al., 2017), as well as an inherently
completely modeling and handling: i) uncertainty
ambiguous problem due to the abundant approximate and
about the predictions, and ii) symmetric objects
exact symmetries in our 3D world.
with multiple (sometimes infinite) correct poses.
To this end, we introduce a method to estimate Many everyday objects possess symmetries such as the box
arbitrary, non-parametric distributions on SO(3). or vase depicted in Fig. 1 (a). It is tempting to formulate
Our key idea is to represent the distributions im- a model of uncertainty that precisely mirrors the pose am-
plicitly, with a neural network that estimates the biguities of such shapes; however it becomes immediately
probability given the input image and a candidate evident that such an approach is not scalable, as it is un-
pose. Grid sampling or gradient ascent can be realistic to enumerate or characterize all sources of pose
used to find the most likely pose, but it is also uncertainty. Even in a simple scenario such as a coffee
possible to evaluate the probability at any pose, mug with self-occlusion, the pose uncertainty manifests as
enabling reasoning about symmetries and uncer- a complex distribution over 3D orientations, as in Fig. 1 (b).
tainty. This is the most general way of represent-
This paper addresses two long-standing and open challenges
ing distributions on manifolds, and to showcase
in pose estimation (a) what is the most general representa-
the rich expressive power, we introduce a dataset
tion for expressing arbitrary pose distributions, including
of challenging symmetric and nearly-symmetric
the challenging ones arising from symmetrical and near-
objects. We require no supervision on pose uncer-
symmetrical objects, in a neural network and (b) how do
tainty – the model trains only with a single pose
we effectively train the model in typical scenarios where
per example. Nonetheless, our implicit model is
the supervision is a single 3D pose per observation (as in
highly expressive to handle complex distributions
Pascal3D+ (Xiang et al., 2014), ObjectNet3D (Xiang et al.,
over 3D poses, while still obtaining accurate pose
2016), ModelNet10-SO(3) (Liao et al., 2019)), i.e. without
estimation on standard non-ambiguous environ-
supervision on the distribution, or priors on the symmetries.
ments, achieving state-of-the-art performance on
Pascal3D+ and ModelNet10-SO(3) benchmarks. To this end, we propose an implicit representation for non-
Code, data, and visualizations may be found at parametric probability distributions over the rotation man-
implicit-pdf.github.io. ifold SO(3) (we refer to our model as implicit-PDF, or
IPDF for short). Such an implicit representation can be pa-
rameterized with a neural network and successfully trained
1. Introduction with straightforward sampling strategies – uniform or even
random querying of the implicit function is sufficient to
There is a growing realization in deep learning that be- reconstruct the unnormalized distribution and approximate
stowing a network with the ability to express uncertainty the normalizing term. For inference, in addition to recon-
is universally beneficial and of crucial importance to sys- structing the full probability distribution we can combine
tems where safety and interpretability are primary con- the sampling strategy with gradient ascent to make pose
cerns (Leibig et al., 2017; Han et al., 2007; Ching et al., predictions at arbitrary (continuous) resolution. The use
*
Equal contribution 1 Google Research, New York, NY, USA. of a non-parametric distribution, while being simple, of-
Correspondence to: <implicitpdf@gmail.com>. fers maximal expressivity for arbitrary densities and poses
arising from symmetrical and near symmetrical 3D objects.
Proceedings of the 38 th International Conference on Machine The simplicity of our approach is in stark contrast to com-
Learning, PMLR 139, 2021. Copyright 2021 by the author(s).
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold

Figure 1. We introduce a method to predict arbitrary distributions over the rotation manifold. This is particularly useful for pose estimation
of symmetric and nearly symmetric objects, since output distributions can include both uncertainty on the estimation and the symmetries
of the object. a-top: The cube has 24 symmetries, which are represented by 24 points on SO(3), and all modes are correctly inferred by
our model. a-bottom: The cylinder has a continuous symmetry around one axis, which traces a cycle on SO(3). It also has a discrete
2-fold symmetry (a “flip”), so the distribution is represented as two cycles. The true pose distribution for the vase depicted on the left
would trace a single cycle on SO(3) since it does not have a flip symmetry. b: This cylinder has a mark that uniquely identifies its pose,
when visible (top). When the mark is not visible (bottom), our model correctly distributes the probability over poses where the mark is
invisible. This example is analogous to a coffee cup when the handle is not visible. The resulting intricate distribution cannot be easily
approximated with usual unimodal or mixture distributions on SO(3), but is easily handled by our IPDF model. Visualization: Points
with non-negligible probability are displayed as dots on the sphere according to their first canonical axis, colored according to the rotation
about that axis. The ground truth (used for evaluation only, not training) is shown as a solid outline. Refer to Section 3.5 for more details.

monly used parametric distributions on SO(3) that require way to qualitatively assess predicted distributions. Through
complicated approximations for computing the normalizing evaluation of predicted distributions and poses, we obtain
term and further are not flexible enough to fit complex dis- a broad assessment of our method: IPDF is the only tech-
tributions accurately (Gilitschenski et al., 2019; Deng et al., nique that can consistently accurately recover the complex
2020; Mohlin et al., 2020). Our primary contributions are pose uncertainty distributions arising from a high degree
of symmetry or self-occlusion, while being supervised by
• Implicit-PDF, a novel approach for modeling non- only a single pose per example. Further, while IPDF has the
parametric distributions on the rotation manifold. Our expressive power to model non-trivial distributions, it does
implicit representation can be applied to realistic chal- not sacrifice in ability to predict poses in non-ambiguous
lenging pose estimation problems where uncertainty situations and reaches state of the art performance with the
can arise from approximate or exact symmetries, self- usual metrics on many categories of Pascal3D+ (Xiang et al.,
occlusion, and noise. We propose different sampling 2014) and ModelNet10-SO(3) (Liao et al., 2019).
strategies which allow us to both efficiently reconstruct
full distributions on SO(3) as well as generate multiple 2. Related work
pose candidates with continuous precision.
Symmetries are plentiful in our natural and human-made
• SYMSOL, a new dataset with inherent ambiguities
worlds, and so it is not surprising there is a history in com-
for analyzing pose estimation with uncertainty. The
puter vision of exploiting strong priors or assumptions on
dataset contains shapes with high order of symmetry, as
shape or texture symmetry to recover 3D structure from a
well as nearly-symmetric shapes, that challenge prob-
single image (Poggio & Vetter, 1992; Hong et al., 2004;
abilistic approaches to accurately learn complex pose
Rothwell et al., 1993). However, among the more recent ma-
distributions. When possible, objects are paired with
chine learning approaches for pose estimation, symmetries
their ground truth “symmetry maps”, which allows
are treated as nuisances and strategies have been developed
quantitative evaluation of predicted distributions.
to utilize symmetry annotations at training. With known
symmetries at training, a canonical normalization of rota-
Our IPDF method is extensively evaluated on the new SYM-
tion space unambiguously resolves each set of equivalent
SOL dataset as well as traditional pose estimation bench-
rotations to a single one, allowing training to proceed as in
marks. To aid our analysis, we develop a novel visualization
single-valued regression (Pitteri et al., 2019). In Corona et al.
method for distributions on SO(3) that provides an intuitive
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold

(2018), manually annotated symmetries on 3D shapes are calization. Inference with random dropout applied to the
required to jointly learn image embedding and classification trained model is used to generate Monte Carlo pose samples,
of the object’s symmetry order. Learning representations and thus this approach does not offer a way to estimate the
that cover a few specific symmetry classes is considered density at arbitrary poses (sampling large numbers of poses
in Saxena et al. (2009). would also be impractical).
In contrast to these works, Sundermeyer et al. (2019) make An alternative framework for representing arbitrary com-
pose or symmetry supervision unnecessary by using a plex distributions is Normalizing Flows (Rezende & Mo-
denoising autoencoder to isolate pose information. Nei- hamed, 2015). In principle, the reparameterization trick for
ther Sundermeyer et al. (2019) nor Corona et al. (2018) Lie groups introduced in Falorsi et al. (2019) allows for
directly predict pose, and thus require comparing against constructing flows to the Lie algebra of SO(3). Rezende
many rendered images of the same exact object for pose in- et al. (2020) develop normalizing flows for compact and
ference. In a similar vein, Okorn et al. (2020) use a learned connected differentiable manifolds, however it is still un-
comparison against a dictionary of images to construct a clear how to effectively construct flows on non-Euclidean
histogram over poses. Deng et al. (2019) propose a particle manifolds, and so far there has been little evidence of a suc-
filter framework for 6D object pose tracking, where each cessful application to realistic problems at the complexity
particle represents a discrete distribution over SO(3) with of learning arbitrary distributions on SO(3).
191K bins. Similar to the previously mentioned works, this
The technical design choices of our implicit pose model are
discrete rotation likelihood is estimated by codebook match-
inspired by the very successful implicit shape (Mescheder
ing and an autoencoder is trained to generate the codes.
et al., 2019) and scene (Mildenhall et al., 2020) representa-
As noted earlier, symmetries are not the only source of tions, which can represent detailed geometry with a multi-
pose uncertainty. Aiming to utilize more flexible representa- layer perceptron that takes low-dimensional position and/or
tions, a recent direction of work has looked to directional directions as inputs.
statistics (Mardia & Jupp, 2000) to consider parameteric
We introduce the details of our approach next.
probability distributions. Regression to the parameters of a
von Mises distribution over (Euler) angles (Prokudin et al.,
2018), as well as regression to the Bingham (Peretroukhin 3. Methods
et al., 2020; Deng et al., 2020; Gilitschenski et al., 2019)
The method centers upon a multilayer perceptron (MLP)
and Matrix Fisher distributions (Mohlin et al., 2020) over
which implicitly represents probability distributions over
SO(3) have been proposed. Since it is preferable to train
SO(3). The input to the MLP is a pair comprising a rotation
these probabilistic models with a likelihood loss, the dis-
and a visual representation of an image obtained using a stan-
tribution’s normalizing term must be computed, which is
dard feature extractor such as a residual network; the output
itself a challenge (it is a hypergeometric function of a ma-
is an unnormalized log probability. Roughly speaking, we
trix argument for Bingham and Matrix Fisher distributions).
construct the distribution for a given image by populating
Gilitschenski et al. (2019) and Deng et al. (2020) approxi-
the space of rotations with such queries, and then normaliz-
mate this function and gradient via interpolation in a lookup
ing the probabilities. This procedure is highly parallelizable
table, Mohlin et al. (2020) use a hand-crafted approxima-
and efficient (see Supp. for time ablations). In the following
tion scheme to compute the gradient, and Peretroukhin et al.
we provide details for the key ingredients of our method.
(2020) simply forgo the likelihood loss. In the simplest set-
ting these models are unimodal, and thus ill equipped to deal
with non-trivial distributions. To this end, Prokudin et al. 3.1. Formalism
(2018), Gilitschenski et al. (2019), and Deng et al. (2020) Our goal is, given an input x ∈ X (for example, an
propose using multimodal mixture distributions. One chal- image), to obtain a conditional probability distribution
lenge to training the mixtures is avoiding mode collapse, for p(·|x) : SO(3) 7→ R+ , that represents the pose of x. We
which a winner-take-all strategy can be used (Deng et al., achieve this by training a neural network to estimate the un-
2020). An alternative to the mixture models is to directly normalized joint log probability function f : X × SO(3) 7→
predict multiple pose hypotheses (Manhardt et al., 2019), R. Let α be the normalization term such that p(x, R) =
but this does not share any of the benefits of a probabilistic α exp(f (x, R)), where p is the joint distribution. The com-
representation. putation of α is infeasible, requiring integration over X .
Bayesian deep learning provides a general framework to From the product rule, p(R|x) = p(x, R)/p(x). We esti-
reason about model uncertainty, and in Kendall & Cipolla mate p(x) by marginalizing over SO(3), and since SO(3)
(2016) test time dropout (Gal & Ghahramani, 2016) was is low-dimensional, we approximate the integral with a dis-
used to approximate Bayesian inference for camera relo-
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold

crete sum as follows, (MLP) to implicitly represent the pose distribution. Differ-
Z ently from most implicit models, we train a single model
p(x) = p(x, R) dR to represent the pose of any instance of multiple categories,
R∈SO(3) so an input descriptor (e.g. pre-trained CNN features for
image inputs) is also fed to the MLP, which we produce
Z
=α exp(f (x, R)) dR with a pre-trained ResNet (He et al., 2015). Most implicit
R∈SO(3)
N
representation methods for shapes and scenes take a posi-
≈α
X
exp(f (x, Ri ))V, (1) tion in Euclidean space and/or a viewing direction as inputs.
i
In our case, we take an arbitrary 3D rotation, so we must
revisit the longstanding question of how to represent rota-
where the {Ri } are centers of an equivolumetric partitioning tions (Levinson et al., 2020). We found it best to use a 3 × 3
2
of SO(3) with N partitions of volume V = π /N . (see rotation matrix to avoid discontinuities present in other rep-
Section 3.4 for details). Now α cancels out in the expression resentations (Saxena et al., 2009). Following Mildenhall
for p(R|x), giving et al. (2020), we found positionally encoding each element
of the input to be beneficial. See the supplement for ablative
1 exp(f (x, R)) studies on these design choices.
p(R|x) ≈ PN , (2)
V i exp(f (x, Ri ))
3.3. Loss
where all the RHS terms are obtained from querying the
neural network. We train our model by minimizing the predicted negative
log-likelihood of the (single) ground truth pose. This re-
During training, the model receives pairs of inputs x and cor- quires normalizing the output distribution, which we ap-
responding ground truth R, and the objective is to maximize proximate by evaluating Eq. (2) using the method described
p(R|x). See Section 3.3 for details. in Section 3.4 to obtain an equivolumetric grid over SO(3),
in which case the normalization is straightforward. During
Inference – single pose. To make a single pose prediction, training, we rotate the grid such that R0 coincides with the
we solve ground truth. Then, we evaluate p(R0 |x) as in Eq. (2), and
the loss is simply
Rx∗ = arg max f (x, R), (3)
R∈SO(3) L(x, R0 ) = − log(p(R0 |x)) (4)

with gradient ascent, since f is differentiable. The initial We noticed that the method is robust enough to be trained
guess comes from evaluating a grid {Ri }. Since the domain without an equivolumetric grid; evaluating Eq. (2) for ran-
of this optimization problem is SO(3), we project the values domly sampled Ri ∈ SO(3), provided that one of them
back into the manifold after each gradient ascent step. coincides with the ground truth, works similarly well. The
equivolumetric partition is still required during inference
for accurate representation of the probabilities.
Inference – full distribution. Alternatively, we may want
to predict a full probability distribution. In this case p(Ri |x)
is evaluated over the SO(3) equivolumetric partition {Ri }. 3.4. Sampling the rotation manifold
This representation allows us to reason about uncertainty Training and producing an estimate of the most likely pose
and observe complex patterns of symmetries and near- does not require precise normalization of the probabilities
symmetries. predicted by the network. However, when the distribution
Our method can estimate intricate distributions on the man- is the object of interest (e.g. an accurate distribution will
ifold without direct supervision of such distributions. By be used in a downstream task), we can normalize by evalu-
learning to maximize the likelihood of a single ground truth ating on a grid of points with equal volume in SO(3) and
pose per object over a dataset, with no prior knowledge of approximating the distribution as a histogram.
each object’s symmetries, appropriate patterns expressing We employ a method of generating equivolumetric grids de-
symmetries and uncertainty naturally emerge in our model’s veloped by Yershova et al. (2010), which uses as its starting
outputs, as shown in Fig. 1. point the HEALPix method of generating equal area grids
on the 2-sphere (Gorski et al., 2005). A useful property of
3.2. Network this sampling is that it is generated hierarchically, permitting
multi-resolution sampling if desired.
Inspired by recent breakthroughs in implicit shape and scene
representations (Mescheder et al., 2019; Park et al., 2019; The Hopf fibration is leveraged to cover SO(3) by threading
Sitzmann et al., 2019), we adopt a multilayer perceptron a great circle through each point on the surface of a 2-sphere.
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold

bution which incorporates symmetry, ambiguity, and hu-


man error involved in the process of annotation. The
task of evaluation is a comparison between two distri-
butions given samples from one, for which likelihood is
standard (Goodfellow et al., 2014; Clauset et al., 2009;
Figure 2. Equivolumetric grid on SO(3). In order to normalize Okorn et al., 2020; Gilitschenski et al., 2019). We re-
the output distribution, we sample unnormalized densities on an port the log likelihood averaged over test set annotations,
equivolumetric grid following Yershova et al. (2010). This iterative Ex∼p(x),R∼pGT (R|x) [log p(R|x)]. Importantly, the average
method starts with HEALPix (Gorski et al., 2005) which generates log likelihood is invariant to whether one ground truth anno-
equal-area grids hierarchically on the sphere. Left: a grid with 576 tation is available or a set of all equivalent annotations.
samples, right: 4608 samples.
Prediction as a distribution: Spread When a complete
set of equivalent ground truth values is known (e.g. a
The grids are generated recursively from a starting seed value for each equivalent rotation under symmetry), the
of 72 points, and grow by a factor of eight each iteration. expected angular deviation to any of the ground truth val-
Figure 2 shows grids after one and two subdivisions. For ues is ER∼p(R|x) [minR0 ∈{RGT } d(R, R0 )] and d : SO(3) ×
evaluation, we use the grid after 5 subdivisions, with a little SO(3) 7→ R+ is the geodesic distance between rotations.
more than two million points. This measure has been referred to as the Mean Absolute An-
gular Deviation (MAAD) (Prokudin et al., 2018; Gilitschen-
3.5. Visualization ski et al., 2019), and encapsulates both the deviation from
the ground truths and the uncertainty around them.
We introduce a novel method to display distributions over
SO(3). A common approach to visualizing such distribu- Prediction as a finite set: precision The most common
tions is via multiple marginal distributions, e.g. over each evaluation scenario in pose estimation tasks is a one-to-one
of the three canonical axes (Lee et al., 2008; Mohlin et al., comparison between a single-valued prediction and a ground
2020). This is in general incomplete as it is not able to fully truth annotation. However, in general, both the prediction
specify the joint distribution. and ground truth may be multi-valued, though often only
one of the ground truths is available for evaluation. To com-
In order to show the full joint distribution, we display the pensate, sometimes symmetries are implicitly imposed on
entire space of rotations with the help of the Hopf fibration. the entire dataset by reporting flip-invariant metrics (Suwa-
With this method, we project a great circle of points on janakorn et al., 2018; Esteves et al., 2019). These metrics
SO(3) to each point on the 2-sphere, and then use the color evaluate precision, where a prediction need only be close to
wheel to indicate the location on the great circle. More one of the ground truths to score well. Usually, the median
intuitively, we may view each point on the 2-sphere as the angular error and accuracy at some angular threshold θ are
direction of a canonical z-axis, and the color indicates the reported in this setting.
tilt angle about that axis. To represent probability density,
we vary the size of the points on the plot. Finally, we display Prediction as a finite set: recall We can also evaluate the
the surface of the 2-sphere using the Mollweide projection. coverage of multiple ground truths given multiple predic-
tions, indicating recall. We employ a simple method of
As the method projects to a lower dimensional space, there clustering by connected components to extract multiple pre-
are limitations arising from occlusions, but also a freedom dictions from an output distribution, and rank by probability
in the projection axis which allows finding more or less mass, to return top-k recall metrics; median error and ac-
informative views. The visualization benefits from relatively curacy at θ are evaluated in this setting. When k = 1 and
sparse distributions where much of the space has negligible the ground truth is unique, these coincide with the precision
probability. We did not find this to be limiting in practice: metrics. See the supplement for extended discussion.
even the 60 modes of a distribution expressing icosahedral
symmetry are readily resolved (Fig. 3b).
4. Experiments
3.6. Evaluation metrics 4.1. Datasets
The appropriateness of different metrics depends on the To highlight the strengths of our method, we put it to the
nature of predictions (a probability distribution or a set of test on a range of challenging pose estimation datasets.
values) and on the state of knowledge of the ground truth.
First, we introduce a new dataset (SYMSOL I) of images
Prediction as a distribution: Log likelihood In the most rendered around simple symmetric solids. It includes im-
general perspective, ground truth annotations accompany- ages of platonic solids (tetrahedron, cube, icosahedron) and
ing an image are observations from an unknown distri-
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold

Figure 3. IPDF predicted distributions for SYMSOL. (a) The cone has one great circle of equivalent orientations under symmetry. (b)
The 60 modes of icosahedral symmetry would be exceedingly difficult for a mixture density network based approach, but IPDF can get
quite close (we omit the ground truths from the left and middle visualizations for clarity). (c) The marked tetrahedron (“tetX”) has one red
face. When it is visible, the 12-fold tetrahedral symmetry reduces to only three equivalent rotations. With less information about the
location of the red face, more orientations are possible: 6 when two white faces are visible (middle) and 9 when only one white face is
visible (right). (d) The orientation of the marked sphere (“sphereX”) is unambiguous when both markings are visible (left). When they
are not (middle), all orientations with the markings on the hidden side of the sphere are possible. When only a portion of the markings are
visible (right; inset is a magnification showing several pixels of the X are visible), the IPDF distribution captures the partial information.

surfaces of revolution (cone, cylinder), with 100,000 ren- Now suppose we mark this sphere with a small arrow. If
derings of each shape from poses sampled uniformly at the arrow is visible, the pose distribution collapses to an
random from SO(3). Each image is paired with its ground impulse. If the arrow is not visible, the distribution is no
truth symmetries (the set of rotations of the source object longer uniform, since about half of the space of possible
that would not change the image), which are easily derived rotations can now be eliminated. This distribution cannot
for these shapes. As would be the case in most practical be easily approximated by mixtures of unimodals.
situations, where symmetries are not known and/or only ap-
SYMSOL II objects include a sphere marked with a small
proximate, we use such annotations only for evaluation and
letter “X” capped with a dot to break flip symmetry when
not for training. Access to the full set of equivalent rotations
visible (sphX), a tetrahedron with one red and three white
opens new avenues of evaluating model performance rarely
faces (tetX), and a cylinder marked with a small filled off-
possible with pose estimation datasets.
centered circle (cylO). We render 100,000 images for each.
While the textureless solids generate a challenging variety of
The two SYMSOL datasets test expressiveness, but the
distributions, they can still be approximated with mixtures
solids are relatively simple and the dataset does not require
of simple unimodal distributions such as the Bingham (Deng
generalization to unseen objects. ModelNet10-SO(3) was
et al., 2020; Gilitschenski et al., 2019). We go one step fur-
introduced by Liao et al. (2019) to study pose estimation
ther and break the symmetry of objects by texturing with
on rendered images of CAD models from ModelNet10 (Wu
small markers (SYMSOL II). When the marker is visible,
et al., 2015). As in SYMSOL, the rotations of the objects
the pose distribution is no longer ambiguous and collapses
cover all of SO(3) and therefore present a difficulty for
given the extra information. When the marker is not vis-
methods that rely on particular rotation formats such as
ible, only a subspace of the symmetric rotations for the
Euler angles (Liao et al., 2019; Prokudin et al., 2018).
textureless shape are possible.
The Pascal3D+ dataset (Xiang et al., 2014) is a popular
For example, consider a textureless sphere. Its pose distribu-
benchmark for pose estimation on real images, consisting
tion is uniform – rotations will not change the input image.
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold

Table 1. Distribution estimation on SYMSOL I and II. We report the average log likelihood on both parts of the SYMSOL dataset, as
a measure for how well the multiple equivalent ground truth orientations are represented by the output distribution. For reference, a
minimally informative uniform distribution over SO(3) has an average log likelihood of -2.29. IPDF’s expressivity allows it to more
accurately represent the complicated pose distributions across all of the shapes. A separate model was trained for each shape for all
baselines and for all of SYMSOL II, but only a single IPDF model was trained on all five shapes of SYMSOL I.
SYMSOL I (log likelihood ↑) SYMSOL II (log likelihood ↑)
avg. cone cyl. tet. cube ico. avg. sphX cylO tetX
Deng et al. (2020) −1.48 0.16 −0.95 0.27 −4.44 −2.45 2.57 1.12 2.99 3.61
Gilitschenski et al. (2019) −0.43 3.84 0.88 −2.29 −2.29 −2.29 3.70 3.32 4.88 2.90
Prokudin et al. (2018) −1.87 −3.34 −1.28 −1.86 −0.50 −2.39 0.48 −4.19 4.16 1.48
IPDF (Ours) 4.10 4.45 4.26 5.70 4.81 1.28 7.57 7.30 6.91 8.49

of twelve categories of objects. Though some of the cate-


Table 2. ModelNet10-SO(3) accuracy and median angle error. Met-
gories contain instances with symmetries (e.g. bottle and
rics are averaged over categories. Our model can output pose
table), the ground truth annotations have generally been candidates, so we also evaluate top-k metrics, which are more
disambiguated and restricted to subsets of SO(3). This al- robust to the lack of symmetry annotations in this dataset. See
lows methods which regress to a single pose to perform Supplementary Material for the complete table with per-category
competitively (Liao et al., 2019). Nevertheless, the dataset metrics.
is a challenging test on real images. Acc@15°↑ Acc@30°↑ Med. (◦ ) ↓

Finally, we evaluate on T-LESS (Hodaň et al., 2017), con- Liao et al. (2019) 0.496 0.658 28.7
sisting of texture-less industrial parts with various discrete Deng et al. (2020) 0.562 0.694 32.6
and continuous approximate symmetries. As in Gilitschen- Prokudin et al. (2018) 0.456 0.528 49.3
ski et al. (2019), we use the Kinect RGB single-object im- Mohlin et al. (2020) 0.693 0.757 17.1
ages, tight-cropped and color-normalized. Although the IPDF (ours) 0.719 0.735 21.5
objects are nearly symmetric, their symmetry-breaking fea-
IPDF (ours), top-2 0.868 0.888 4.9
tures are visible in most instances. Nonetheless, it serves
IPDF (ours), top-4 0.904 0.926 4.8
as a useful benchmark to compare distribution metrics with
Gilitschenski et al. (2019).
We find that IPDF proves competitive across the board.
4.3. SYMSOL I: symmetric solids
4.2. Baselines We report the average log likelihood in Table 1, and the
gap between IPDF and the baselines is stark. The average
We compare to several recent works which parameterize
log likelihood indicates how successful the prediction is at
distributions on SO(3) for the purpose of pose estimation.
distributing probability mass around all of the ground truths.
Gilitschenski et al. (2019) and Deng et al. (2020) output
The expressivity afforded by our method allows it to capture
the parameters for mixtures of Bingham distributions and
both the continuous and discrete symmetries present in the
interpolate from a large lookup table to compute the normal-
dataset. As the order of the symmetry increases from 12 for
ization constant. Mohlin et al. (2020) output the parameters
the tetrahedron, to 24 for the cube, and finally 60 for the
for a unimodal matrix Fisher distribution and similarly em-
icosahedron, the baselines struggle and tend to perform at
ploy an approximation scheme to compute the normalization
same level as a minimally informative (uniform) distribution
constant. Prokudin et al. (2018) decompose SO(3) into the
over SO(3). The difference between IPDF and the baselines
product of three independent distributions over Euler angles,
in Table 1 is further cemented by the fact that a single IPDF
with the capability for multimodality through an ‘infinite
model was trained on all five shapes while the baselines
mixture’ approach. Finally we compare to the spherical re-
were allowed a separate model per shape. Interestingly,
gression work of Liao et al. (2019), which directly regresses
while the winner-take-all strategy of Deng et al. (2020) en-
to Euler angles, to highlight the comparative advantages of
abled training with more Bingham modes than Gilitschenski
distribution-based methods. We quote reported values and
et al. (2019), it seems to have hindered the ability to faith-
run publicly released code when values are unavailable. See
fully represent the continuous symmetries of the cone and
Supplemental Material for additional details.
cylinder, as suggested by the relative performance of these
methods.
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold

Table 3. Results on a standard pose estimation benchmark, Pascal3D+. As is common, we show accuracy at 30◦ (top) and median error in
degrees (bottom), for each category and also averaged over categories. Our IPDF is at or near state-of-the-art on many categories. ‡ The
results for Liao et al. (2019) and Mohlin
√ et al. (2020) differ from their published numbers. For Liao et al. (2019), published errors are
known to be incorrectly scaled by a 2 factor, and Mohlin et al. (2020) evaluates on a non-standard test set. See Supplemental for details.
avg. aero bike boat bottle bus car chair table mbike sofa train tv
‡Liao et al. (2019) 0.819 0.82 0.77 0.55 0.93 0.95 0.94 0.85 0.61 0.80 0.95 0.83 0.82
‡Mohlin et al. (2020) 0.825 0.90 0.85 0.57 0.94 0.95 0.96 0.78 0.62 0.87 0.85 0.77 0.84
Prokudin et al. (2018) 0.838 0.89 0.83 0.46 0.96 0.93 0.90 0.80 0.76 0.90 0.90 0.82 0.91
Acc@30°↑
Tulsiani & Malik (2015) 0.808 0.81 0.77 0.59 0.93 0.98 0.89 0.80 0.62 0.88 0.82 0.80 0.80
Mahendran et al. (2018) 0.859 0.87 0.81 0.64 0.96 0.97 0.95 0.92 0.67 0.85 0.97 0.82 0.88
IPDF (Ours) 0.837 0.81 0.85 0.56 0.93 0.95 0.94 0.87 0.78 0.85 0.88 0.78 0.86
‡Liao et al. (2019) 13.0 13.0 16.4 29.1 10.3 4.8 6.8 11.6 12.0 17.1 12.3 8.6 14.3
‡Mohlin et al. (2020) 11.5 10.1 15.6 24.3 7.8 3.3 5.3 13.5 12.5 12.9 13.8 7.4 11.7
Median Prokudin et al. (2018) 12.2 9.7 15.5 45.6 5.4 2.9 4.5 13.1 12.6 11.8 9.1 4.3 12.0
error (◦ ) ↓ Tulsiani & Malik (2015) 13.6 13.8 17.7 21.3 12.9 5.8 9.1 14.8 15.2 14.7 13.7 8.7 15.4
Mahendran et al. (2018) 10.1 8.5 14.8 20.5 7.0 3.1 5.1 9.3 11.3 14.2 10.2 5.6 11.7
IPDF (Ours) 10.3 10.8 12.9 23.4 8.8 3.4 5.3 10.0 7.3 13.6 9.5 6.4 12.3

ruled out. For example, with the one red face visible in the
left subplot of Figure 3c, there is nothing to distinguish the
three remaining faces, and the implicit distribution reflects
this state with three modes.
Figure 3d show the IPDF prediction for various views of
the marked sphere. When the marking is not visible at all
(middle subplot), the half of SO(3) where the marking faces
the camera can be ruled out; IPDF assigns zero probability
Figure 4. Bathtubs may have exact or approximate 2-fold symme- to half of the space. When only a portion of the marking is
tries around one or more axes. We show our predicted probabilities visible (right subplot), IPDF yields a nontrivial distribution
as solid disks, the ground truth as circles, and the predictions of with an intermediate level of ambiguity, capturing the partial
Liao et al. (2019) as crosses. Our model assigns high probabilities information contained in the image.
to all symmetries, while the regression method ends up far from
every symmetry mode (note the difference in position and color 4.5. ModelNet10-SO(3)
between circles and crosses).
Unimodal methods perform poorly on categories with ro-
tational symmetries such as bathtub, desk and table (see
the supplementary material for complete per-category re-
4.4. SYMSOL II: nearly-symmetric solids
sults). When trained with a single ground truth pose se-
When trained on the solids with distinguishing features lected randomly from among multiple distinct rotations,
which are visible only from a subset of orientations, IPDF is these methods tend to split the difference and predict a rota-
far ahead of the baselines (Table 1). The prediction serves tion equidistant from all equivalent possibilities. The most
as a sort of ‘belief state’, with the flexibility of being uncon- extreme example of this behavior is the bathtub category,
strained by a particular parameterization of the distribution. which contains instances with approximate or exact two-
The marked cylinder in the right half of Figure 1 displays fold symmetry around one or more axes (see Fig. 4). With
this nicely. When the red marking is visible, the pose is well two modes of symmetry separated by 180◦ , the outputs tend
defined from the image and the network outputs a sharp peak to be 90◦ away from each mode. We observe this behavior
at the correct, unambiguous location. When the cylinder in Liao et al. (2019); Mohlin et al. (2020).
marking is not visible, there is irreducible ambiguity con-
Since our model can easily represent any kind of symme-
veyed in the output with half of the full cylindrical symmetry
try, it does not suffer from this problem, as illustrated in
shown in the left side of the figure.
Fig. 4. The predicted distribution captures the symmetry of
The pose distribution of the marked tetrahedron in Figure 3c the object but returns only one of the possibilities during
takes a discrete form. Depending on which faces are visible, inference. This is penalized by metrics that rely on a single
a subset of the full 12-fold tetrahedral symmetry can be ground truth, since picking the mode that is not annotated
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold

constitute a variety of methods to tackle the pose estimation


problem. The feat is remarkable given that our method was
designed for maximal expressiveness and not for the single-
prediction, single-ground truth scenario. IPDF performance
in terms of median angular error, while good, overlooks the
wealth of information contained in the full predicted distri-
bution. Sample pose predictions are shown in Figure 5 and
in the Supplemental; the distributions express uncertainty
and category-level pose ambiguities.

Table 4. Pose estimation on T-LESS. LL is the log-likelihood,


spread is the mean angular error, and Med. is the median angular
error for single-valued predictions. Gilitschenski et al. (2019) un-
derestimate its evaluation of spread, disregarding the dispersion.
LL ↑ Spread (◦ ) ↓ Med. (◦ ) ↓
Deng et al. (2020) 5.3 23.1 3.1
Gilitschenski et al. (2019) 6.9 3.4 2.7
Prokudin et al. (2018) 8.8 34.3 1.2
Figure 5. IPDF predicted distributions on Pascal3D+. We dis-
play a sampling of IPDF pose predictions to highlight the rich- Liao et al. (2019) - - 2.6
ness of information contained in the full distribution output, as IPDF (Ours) 9.8 4.1 1.3
compared to a single pose estimate. Uncertainty regions and
multi-modal predictions are freely expressed, owing to the non-
parametric nature of IPDF.

4.7. T-LESS

results in an 180 error, while picking the midpoint between The results of Table 4, and specifically the success of the re-
two modes (which is far from both) results in a 90◦ error. gression method of Liao et al. (2019), show that approximate
Since some bathtub instances have two-fold symmetries or exact symmetries are not an issue in the particular split of
over more than one axis (like the top-right of Fig. 4), our the T-LESS dataset used in Gilitschenski et al. (2019). All
median error ends up closer to 180◦ when the symmetry an- methods are able to achieve median angular errors of less
notation is incomplete, which in turn significantly increases than 4◦ . Among the methods which predict a probability
the average over all categories. We observe the same for distribution over pose, IPDF maximizes the average log like-
other multi-modal methods (Prokudin et al., 2018; Deng lihood and minimizes the spread, when correctly factoring
et al., 2020). in the uncertainty into the metric evaluation.
Our performance increases dramatically in the top-k evalu-
ation even for k = 2 (see Table S4). The ability to output 5. Conclusion
pose candidates is an advantage of our model, and is not
possible for direct regression (Liao et al., 2019) or unimodal In this work we have demonstrated the capacity of an
methods (Mohlin et al., 2020). While models based on mix- implicit function to represent highly expressive, non-
tures of unimodal distributions could, in theory, produce parametric distributions on the rotation manifold. It per-
pose candidates, their current implementations (Gilitschen- forms as well as or better than state of the art parameterized
ski et al., 2019; Deng et al., 2020) suffer from mode collapse distribution methods, on standard pose estimation bench-
and are constrained to a fixed number of modes. marks where the ground truth is a single pose. On the new
and difficult SYMSOL dataset, the implicit method is far
superior while being simple to implement as it does not re-
4.6. Pascal3D+
quire any onerous calculations of a normalization constant.
In contrast to the full coverage of SO(3) and the pres- Particularly, we show in SYMSOL II that our method can
ence of symmetries and ambiguities in the SYMSOL and represent distributions that cannot be approximated well
ModelNet10-SO(3) datasets, Pascal3D+ serves as a check by current mixture-based models. See the Supplementary
that pose estimation performance in the unambiguous case Material for additional visualizations, ablation studies and
is not sacrificed. In fact, as the results of Table 3 show, timing evaluations, extended discussion about metrics, and
IPDF performs as well as or better than the baselines which implementation details.
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold

References The Astrophysical Journal, 622(2):759–771, Apr 2005.


ISSN 1538-4357. doi: 10.1086/427976. URL http:
Ching, T., Himmelstein, D. S., Beaulieu-Jones, B. K.,
//dx.doi.org/10.1086/427976.
Kalinin, A. A., Do, B. T., Way, G. P., Ferrero, E., Agapow,
P.-M., Zietz, M., Hoffman, M. M., et al. Opportu- Han, D., Kwong, T., and Li, S. Uncertainties in real-time
nities and obstacles for deep learning in biology and flood forecasting with neural networks. Hydrological Pro-
medicine. Journal of The Royal Society Interface, 15 cesses: An International Journal, 21(2):223–228, 2007.
(141):20170387, 2018.
He, K., Zhang, X., Ren, S., and Sun, J. Deep Resid-
Clauset, A., Shalizi, C. R., and Newman, M. E. Power- ual Learning for Image Recognition. arXiv preprint
law distributions in empirical data. SIAM review, 51(4): arXiv:1512.03385, 2015.
661–703, 2009.
Hodaň, T., Haluza, P., Obdržálek, Š., Matas, J., Lourakis,
Corona, E., Kundu, K., and Fidler, S. Pose Estimation for M., and Zabulis, X. T-LESS: An RGB-D Dataset for 6D
Objects with Rotational Symmetry. In 2018 IEEE/RSJ In- Pose Estimation of Texture-less Objects. IEEE Winter
ternational Conference on Intelligent Robots and Systems Conference on Applications of Computer Vision (WACV),
(IROS), pp. 7215–7222. IEEE, 2018. 2017.
Deng, H., Bui, M., Navab, N., Guibas, L., Ilic, S., and Hong, W., Yang, A. Y., Huang, K., and Ma, Y. On Symme-
Birdal, T. Deep Bingham Networks: Dealing with Uncer- try and Multiple-View Geometry: Structure, Pose, and
tainty and Ambiguity in Pose Estimation. arXiv preprint Calibration from a Single Image. International Journal
arXiv:2012.11002, 2020. of Computer Vision, 60(3):241–265, 2004.
Deng, X., Mousavian, A., Xiang, Y., Xia, F., Bretl, T., and Kendall, A. and Cipolla, R. Modelling uncertainty in deep
Fox, D. PoseRBPF: A Rao-Blackwellized Particle Fil- learning for camera relocalization. In 2016 IEEE inter-
ter for 6D Object Pose Estimation. In Proceedings of national conference on Robotics and Automation (ICRA),
Robotics: Science and Systems, FreiburgimBreisgau, Ger- pp. 4762–4769. IEEE, 2016.
many, June 2019. doi: 10.15607/RSS.2019.XV.049.
Lee, T., Leok, M., and McClamroch, N. H. Global symplec-
Esteves, C., Sud, A., Luo, Z., Daniilidis, K., and Makadia, tic uncertainty propagation on SO(3). In Proceedings of
A. Cross-Domain 3D Equivariant Image Embeddings. In the 47th IEEE Conference on Decision and Control, CDC
International Conference on Machine Learning (ICML), 2008, December 9-11, 2008, Cancún, Mexico, pp. 61–66,
2019. 2008. doi: 10.1109/CDC.2008.4739058. URL https:
//doi.org/10.1109/CDC.2008.4739058.
Falorsi, L., de Haan, P., Davidson, T. R., and Forré, P. Repa-
rameterizing Distributions on Lie Groups. In The 22nd Leibig, C., Allken, V., Ayhan, M. S., Berens, P., and Wahl,
International Conference on Artificial Intelligence and S. Leveraging uncertainty information from deep neural
Statistics, pp. 3244–3253. PMLR, 2019. networks for disease detection. Scientific reports, 7(1):
1–14, 2017.
Gal, Y. and Ghahramani, Z. Dropout as a Bayesian Ap-
proximation: Representing Model Uncertainty in Deep Levinson, J., Esteves, C., Chen, K., Snavely, N., Kanazawa,
Learning. In International Conference on Machine Learn- A., Rostamizadeh, A., and Makadia, A. An Analysis
ing (ICML), pp. 1050–1059. PMLR, 2016. of SVD for Deep Rotation Estimation. In Advances in
Neural Information Processing Systems 34, 2020.
Gilitschenski, I., Sahoo, R., Schwarting, W., Amini, A.,
Karaman, S., and Rus, D. Deep Orientation Uncertainty Liao, S., Gavves, E., and Snoek, C. G. M. Spherical Re-
Learning Based on a Bingham Loss. In International gression: Learning Viewpoints, Surface Normals and 3D
Conference on Learning Representations, 2019. Rotations on n-Spheres. In IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR), 2019.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,
Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Mahendran, S., Ali, H., and Vidal, R. A mixed classification-
Generative adversarial nets. In Advances in Neural Infor- regression framework for 3d pose estimation from 2d
mation Processing Systems 27, pp. 2672–2680, 2014. images. The British Machine Vision Conference (BMVC),
2018.
Gorski, K. M., Hivon, E., Banday, A. J., Wandelt, B. D.,
Hansen, F. K., Reinecke, M., and Bartelmann, M. Manhardt, F., Arroyo, D. M., Rupprecht, C., Busam, B.,
HEALPix: A Framework for High-Resolution Discretiza- Birdal, T., Navab, N., and Tombari, F. Explaining the
tion and Fast Analysis of Data Distributed on the Sphere. Ambiguity of Object Detection and 6D Pose From Visual
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold

Data. In Proceedings of the IEEE/CVF International Prokudin, S., Gehler, P., and Nowozin, S. Deep Directional
Conference on Computer Vision (ICCV), October 2019. Statistics: Pose Estimation with Uncertainty Quantifi-
cation. In Proceedings of the European Conference on
Mardia, K. V. and Jupp, P. E. Directional Statistics. John Computer Vision (ECCV), pp. 534–551, 2018.
Wiley and Sons, LTD, London, 2000.
Rezende, D. and Mohamed, S. Variational Inference with
McAllister, R., Gal, Y., Kendall, A., Van Der Wilk, M., Shah, Normalizing Flows. In International Conference on Ma-
A., Cipolla, R., and Weller, A. Concrete problems for chine Learning, pp. 1530–1538. PMLR, 2015.
autonomous vehicle safety: Advantages of bayesian deep Rezende, D. J., Papamakarios, G., Racaniere, S., Albergo,
learning. International Joint Conferences on Artificial M., Kanwar, G., Shanahan, P., and Cranmer, K. Normal-
Intelligence, Inc., 2017. izing Flows on Tori and Spheres. In International Con-
ference on Machine Learning, pp. 8083–8092. PMLR,
Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S.,
2020.
and Geiger, A. Occupancy Networks: Learning 3D Re-
construction in Function Space. In IEEE Conference on Rothwell, C., Forsyth, D. A., Zisserman, A., and Mundy,
Computer Vision and Pattern Recognition, CVPR, 2019. J. L. Extracting Projective Structure from Single Per-
spective Views of 3D Point Sets. In 1993 (4th) Inter-
Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., national Conference on Computer Vision, pp. 573–582.
Ramamoorthi, R., and Ng, R. NeRF: Representing Scenes IEEE, 1993.
as Neural Radiance Fields for View Synthesis. In ECCV,
2020. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,
Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,
Mohlin, D., Bianchi, G., and Sullivan, J. Probabilistic M., et al. ImageNet Large Scale Visual Recognition
Orientation Estimation with Matrix Fisher Distributions. Challenge. International Journal of Computer Vision,
In Advances in Neural Information Processing Systems 115(3):211–252, 2015.
33, 2020.
Saxena, A., Driemeyer, J., and Ng, A. Y. Learning 3-D
Object Orientation from Images. In IEEE International
Okorn, B., Xu, M., Hebert, M., and Held, D. Learning
Conference on Robotics and Automation (ICRA), 2009.
Orientation Distributions for Object Pose Estimation. In
IEEE International Conference on Robotics and Automa- Sitzmann, V., Zollhöfer, M., and Wetzstein, G. Scene repre-
tion (ICRA), 2020. sentation networks: Continuous 3d-structure-aware neu-
ral scene representations. In Advances in Neural Infor-
Park, J. J., Florence, P., Straub, J., Newcombe, R. A., and mation Processing Systems 32: Annual Conference on
Lovegrove, S. Deepsdf: Learning continuous signed Neural Information Processing Systems 2019, NeurIPS
distance functions for shape representation. In IEEE 2019, December 8-14, 2019, Vancouver, BC, Canada, pp.
Conference on Computer Vision and Pattern Recognition, 1119–1130, 2019.
CVPR 2019, Long Beach, CA, USA, June 16-20, 2019,
pp. 165–174, 2019. doi: 10.1109/CVPR.2019.00025. Su, H., Qi, C. R., Li, Y., and Guibas, L. J. Render for
cnn: Viewpoint estimation in images using cnns trained
Peretroukhin, V., Giamou, M., Rosen, D. M., Greene, W. N., with rendered 3d model views. In Proceedings of the
Roy, N., and Kelly, J. A Smooth Representation of SO(3) IEEE international conference on computer vision, pp.
for Deep Rotation Learning with Uncertainty. In Proceed- 2686–2694, 2015.
ings of Robotics: Science and Systems (RSS), Jul. 12–16
2020. Sundermeyer, M., Marton, Z., Durner, M., Brucker, M., and
Triebel, R. Implicit 3D Orientation Learning for 6D Ob-
Pitteri, G., Ramamonjisoa, M., Ilic, S., and Lepetit, V. On ject Detection from RGB Images. CoRR, abs/1902.01275,
Object Symmetries and 6D Pose Estimation from Images. 2019.
CoRR, abs/1908.07640, 2019. URL http://arxiv. Suwajanakorn, S., Snavely, N., Tompson, J. J., and Norouzi,
org/abs/1908.07640. M. Discovery of Latent 3D Keypoints via End-to-end
Geometric Reasoning. In Advances in Neural Information
Poggio, T. and Vetter, T. Recognition and Structure from Processing Systems (NIPS), pp. 2063–2074, 2018.
one 2D Model View: Observations on Prototypes, Ob-
ject Classes and Symmetries. Technical report, MAS- Tulsiani, S. and Malik, J. Viewpoints and keypoints. In
SACHUSETTS INST OF TECH CAMBRIDGE ARTI- Proceedings of the IEEE Conference on Computer Vision
FICIAL INTELLIGENCE LAB, 1992. and Pattern Recognition (CVPR), June 2015.
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold

Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X.,
and Xiao, J. 3D ShapeNets: A Deep Representation for
Volumetric Shapes. In Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, pp.
1912–1920, 2015.
Xiang, Y., Mottaghi, R., and Savarese, S. Beyond PASCAL:
A benchmark for 3D object detection in the wild. In 2014
IEEE Winter Conference on Applications of Computer
Vision (WACV), pp. 75–82, March 2014.
Xiang, Y., Kim, W., Chen, W., Ji, J., Choy, C., Su, H.,
Mottaghi, R., Guibas, L., and Savarese, S. ObjectNet3D:
A Large Scale Database for 3D Object Recognition. In
European Conference Computer Vision (ECCV), 2016.
Yershova, A., Jain, S., Lavalle, S. M., and Mitchell, J. C.
Generating Uniform Incremental Grids on SO (3) Using
the Hopf Fibration. The International journal of robotics
research, 29(7):801–812, 2010.
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold

Supplemental Material for Implicit-PDF: Non-Parametric Representation of Proba-


bility Distributions on the Rotation Manifold
S1. Additional IPDF predictions for objects from Pascal3D+

Figure S1. Sample IPDF outputs on Pascal3D+ objects. We visualize predictions by the IPDF model, trained on all twelve object
categories, which yielded the results in Table 3 of the main text. The ground truth rotations are displayed as the colored open circles.

In Figure S1 we show sample predictions from IPDF trained on the objects in Pascal3D+. The network outputs much more
information about the pose of the object in the image than can be expressed in a single estimate. Even in the examples where
the distribution is unimodal, and the pose is relatively unambiguous, IPDF provides rich information about the uncertainty
around the most likely pose. The expressivity of IPDF allows it to express category-level symmetries, which appear as
multiple modes in the distributions above. The most stand-out example in Figure S1 is of the bicycle in the second row:
the pose estimate of IPDF is incredibly uncertain, yet still there is information in the exclusion of certain regions of SO(3)
which have been ‘ruled out’. The expressivity of IPDF allows an unprecedented level of information to be contained in the
predicted pose distributions.

S2. Extension of IPDF beyond SO(3)


IPDF is not limited to probability distributions on SO(3), which nevertheless served as a challenging and practical testing
ground for the method. With minor modifications, IPDF can be extended to the problem of pose with six degrees of freedom
(6DOF): we append translation coordinates to the rotation query, and use 10× more samples during training to adequately
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold

cover the full joint space. Normalizing the distributions is similarly straightforward, by querying over a product of Cartesian
and HealPix-derived grids. Predicted distributions on modified images of SYMSOL are shown in Figure S2. For two
renderings of a cone from identical orientation but different translations, only the predicted distribution over translation
differs between the two images.

Figure S2. Extension to 6DOF rotation+translation estimation. We train IPDF on a modified SYMSOL I dataset, where the objects are
also translated in space. Shown above are two images of a cone with the same orientation but shifted in space. We query the network over
the full joint space of translations and rotations, and visualize the marginal distributions. Each point in rotation space has a corresponding
point in translation space, and we color them the same to indicate as such. While uninformative in the above plots, this scheme of coloring
allows nontrivial joint distributions to be expressed.

S3. SYMSOL spread evaluation, compared to multimodal Bingham

Table S1. Spread estimation on SYMSOL. This metric evaluates how closely the probability mass is centered on any of the equivalent
ground truths. For this reason, we can only evaluate it on SYMSOL I, where all ground truths are known at test time. Values are in
degrees.
cone cyl. tet. cube ico.
Deng et al. 10.1 15.2 16.7 40.7 29.5
Ours 1.4 1.4 4.6 4.0 8.4

We evaluate the spread metric on the SYMSOL I dataset, where the full set of ground truths is known at test time, for IPDF
and the method of Deng et al. (2020). The results are shown in Table S1.
The metric values, in degrees, show how well the implicit method is able to home in on the ground truths. For the cone and
cylinder, the spread of probability mass away from the continuous rotational symmetry has a typical scale of just over one
degree.
The predicted distributions in Figure S3 for a tetrahedron and cone visually ground the values of Table S1. Many of the
individual unimodal Bingham components can be identified for the output distributions of Deng et al. (2020), highlighting
the difficulty covering the great circle of ground truth rotations for the cone with only a limited number of unimodal
distributions (bottom). The spread around the ground truths for both shapes is significantly larger and more diffuse than for
IPDF, shown on the right.
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold

Figure S3. Comparison of predicted distributions: tetrahedron and cone. We show predicted pose distributions for a tetrahedron
(top) and cone (bottom). Displayed on the left is the method of Deng et al. (2020), which outputs parameters for a mixture of Bingham
distributions. The right side shows IPDF. The predicted distributions from the implicit method are much more densely concentrated
around the ground truth, providing a visual grounding for the significant difference in the spread values of Table S1.

S4. Computational cost


We evaluate the computational cost of our method by measuring the time it takes to obtain the pose distribution for a single
image, which corresponds to the frequency it could run on real time. The fair baseline here is the direct regression method
of Liao et al. (2019), using the same ResNet-50 backbone and the same size of MLP. The only difference is that while Liao
et al. (2019) only feeds the image descriptor to the MLP, our model concatenates the descriptor to a number of query poses
from a grid.
Table S2 shows the results. When using the coarser grid, the performance overhead is negligible with respect to the baseline.
This grid has approximately 5◦ between nearest neighbors, which might be enough for some applications. When increased
accuracy is required, our model can use more samples, trading speed for accuracy. Note that the MLP operations are highly
parallelizable on GPUs so the processing time grows slower than linear with the grid size.

Table S2. Inference time evaluation. For our method, we measure the time needed to generate the normalized distribution over SO(3)
given a single 224 × 224 image. The number of samples correspond to the HEALPix-SO(3) grids of levels 3, 4, and 5, respectively. The
coarser grid has an average distance of approximately 5◦ between nearest neighbors. The processing time growth is slower than linear.
Method Number of samples frames/s ↓ Acc@15°↑ Acc@30°↑ Med. (◦ ) ↓
Liao et al. - 18.2 0.522 0.652 38.2
Ours 37 k 18.3 0.717 0.735 25.1
Ours 295 k 9.1 0.723 0.738 17.6
Ours 2359 k 2.4 0.723 0.738 18.7

S5. Ablations
In Figure S4, we show the average log likelihood on the five shapes of SYMSOL I through ablations to various aspects of
the method. The top row shows the dependence on the size of the dataset. Performance levels off after 50,000 images per
shape, but is greatly diminished for only 10,000 examples. Note almost all of the values for 10,000 images are less than
the log likelihood of a uniform distribution over SO(3), − log A = −2 log π = −2.29, the ‘safest’ distribution to output if
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold

Figure S4. Ablative studies. We report the average log likelihood for the shapes of SYMSOL I with various aspects of the method ablated.
Error bars are the standard deviation over five networks trained with different random seeds. In the top row, we show the dependence on
the size of the dataset, with performance leveling off after 50,000 images per shape. The subsequent row varies the positional encoding,
with 0 positional encoding terms corresponding to no positional encoding at all: the flattened rotation matrix is the query rotation. The
third row examines the role of the rotation format when querying the MLP (before positional encoding is applied). The final row shows
that, during training, inexact normalization arising from the queries being randomly sampled over SO(3) leads to roughly equivalent
performance as the proper normalization from using the equivolumetric grid as the query points. Note that evaluation makes use of an
equivolumetric grid in both cases, to calculate the log likelihood.
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold

training is unsuccessful. This indicates overfitting: with only one rotation for each training example, a minimal number of
examples is needed to connect all the ground truths with each view of a shape. The network becomes confident about the
rotations it has seen paired with a particular view, and assigns small probability to the unseen ground truths, resulting in
large negative log likelihood values.
The second row varies the positional encoding applied to the rotations when querying the MLP. 0 positional encoding terms
corresponds to no positional encoding at all: the flattened rotation matrix is used as the query rotation. The positional
encoding benefits the three shapes with discrete symmetries and is neutral or even slightly negative for the cone and cylinder.
Intended to facilitate the representation of high frequency features (Mildenhall et al., 2020), positional encoding helps
capture the twelve modes of tetrahedral symmetry with two terms, whereas four are necessary for peak performance on the
cube and icosahedron. For all shapes, including more positional encoding terms eventually degrades the performance.
In the third row, we compare different formats for the query rotation, pre-positional encoding. For all shapes, representing
rotations as matrices is optimal, with axis-angle and quaternion formats comparable to each other and a fair amount worse.
Representing rotations via Euler angles averages out near the log likelihood of a uniform distribution (−2.29), though with a
large spread which indicates most but not all runs fail to train.
Finally, the fourth row examines the effect of normalization in the likelihood loss during training. Randomly sampling
queries from SO(3) offers simplicity and freedom over the exact number of queries, but results in inexact normalization of
the probability distribution. During training, this leads to roughly equivalent performance as when an equivolumetric grid of
queries is used, which can be exactly normalized.

Figure S5. The efficacy of gradient ascent on Pascal3D+. We report the average performance across classes on Pascal3D+, for the same
IPDF model, using different means to extract a single-valued pose estimate. The error bars are the standard deviation among random
sampling attempts, and the curves are slightly offset horizontally for clarity.

In Figure S5 we show the efficacy of performing gradient ascent to extract the most likely pose from IPDF, given an image.
The first way to find the rotation with maximal probability is by sampling from SO(3) and taking the argmax over the
unnormalized outputs of IPDF. Predictably, finer resolution of the samples yields more accurate predictions, indicated by
shrinking median angular error (left) and growing accuracy at 30◦ (right) averaged over the categories of Pascal3D+. The
second way to produce an estimate leverages the fact that IPDF is fully differentiable. We use the best guess from a sampling
of queries as a starting value for gradient ascent on the output of IPDF. The space of valid rotations is embedded in a much
larger query space, so we project the updated query back to SO(3) after every step of gradient ascent, and run it for 100
steps. The estimates returned by gradient ascent yield optimal performance for anything more than 10,000 queries, whereas
argmax requires more than 500,000 queries for similar results. The difference between the argmax and gradient ascent is
primarily in the median angular error (left): improvements of an estimate on the order of a degree would benefit this statistic
more than the accuracy at 30◦ .
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold

S6. Metrics for evaluation: extended discussion


S6.1. Prediction as a distribution: spread and average log likelihood
Here we compare the metrics used in the main text on a simplified example in one dimension, where the ground truth
consists of two values: {xGT } = ±1. We evaluate the four distributions (P1 , P2 , P3 , P4 )) shown in Figure S6 which model
the ground truth to varying degree.

Figure S6. Distributions modelling a scenario with multiple ground truths. P1 and P2 are mixtures of two normal distributions, with
the components centered on the ground truths at x = ±1. P3 is a normal distribution centered on only one of the two ground truths. P4 is
a uniform distribution over the interval [−2, 2].

Table S3. Distribution-based evaluation metrics from the main text.


Full GT at evaluation Partial GT at evaluation
Distribution Spread ↓ Average log likelihood ↑ Spread ↓ Average log likelihood ↑
P1 = 12 (N (−1, 0.12 ) + N (1, 0.12 )) 0.08 0.69 1.04 0.69
P2 = 21 (N (−1, 0.252 ) + N (1, 0.252 )) 0.20 −0.23 1.10 −0.23
P3 = N (−1, 0.12 ) 0.08 −98.62 1.04 −98.62
P4 = U(−2, 2) 0.50 −1.39 1.25 −1.39

The results for the spread and average log likelihood, defined in the main text, are shown in Table S3. There are several
takeaways from this simplified example. The spread, being the average over the ground truths of the minimum error, captures
how well any of the ground truths are represented. By this metric, P1 and P3 are equivalent. When the full set of ground
truths is not known at evaluation, the spread ceases to be meaningful.
The average log likelihood measures how well all ground truths are represented and is invariant to whether the full set of
GTs is provided with each test example, or only a subset. The latter is the predominant scenario for pose estimation datasets,
where annotations are not provided for near or exact symmetries. This means only one ground truth is provided for each test
example, out of possibly several equivalent values. In Table S3, the average log likelihood ranks the distributions in the
order one would expect, with the ‘ignorant’ uniform distribution (P4 ) performing slightly worse than P1 and P2 , and with
P3 severely penalized for failing to cover both of the ground truths.

S6.2. Prediction as a finite set and unknown symmetries: top-k


For the case where only a single ground truth is available, despite potential symmetries, the log-likelihood metric is the only
one that is still meaningful unchanged.
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold

Precision and spread metrics are misleading because they penalize correct predictions that don’t have a corresponding
annotation. Our solution is to drop the precision metric and split the distribution into different modes to compute the spreads,
by finding connected components in probability distribution predicted.
The recall metrics are problematic when viewed independently of precision, since they can be easily optimized for by
returning a large number of candidate poses covering the whole space. Our solution here is to limit the number of output
pose candidates to k, yielding metrics that we denote the top-k accuracy@15°, top-k accuracy@30°, and top-k error. For
example, the metrics reported by Liao et al. (2019); Mohlin et al. (2020) on ModelNet10-SO(3) are equivalent to our top-1.
One issue with the top-k evaluation is that we cannot disentangle if errors are due to the dataset (lack of symmetry
annotations), or due to the model. Since there is no way around it without expensive annotation, we find it useful to report
the top-k for different k, including k = 1, where no model errors are forgiven.
Now, for each entry in the dataset, RGT is the single annotated ground truth, the top-k pose predictions are {R̂i }1≤i≤k ,
and we have k normalized probability distributions corresponding to each of the top-k modes, {p̂i }1≤i≤k . The following
equations describe the metrics,
 n o 
top-k accuracy@α = min d(RGT , R̂j ) < α , (5)
1≤j≤k

top-k error = min d(RGT , R̂j ), (6)


1≤j≤k
(Z )
top-k spread = min p̂j (R)d(R, RGT ) dR . (7)
1≤j≤k SO(3)

Typically, accuracy and spread are averaged over the whole dataset, while the median error over all entries is reported.

S7. ModelNet10-SO(3) detailed results


Table S4 extends the ModelNet10-SO(3) table in the main paper and shows per-category metrics.
Since our model predicts a full distribution of rotations, we find the modes of this distribution, by first thresholding by
density and then assigning to the same mode any two points that are closer than a second threshold. This method outputs a
variable number of modes for each input, as opposed to methods based on mixtures of unimodal distributions (Gilitschenski
et al., 2019; Deng et al., 2020), where the number of modes is a fixed hyperparameter.
We then rank the modes by their total probability mass, assign their most likely pose as the mode center, and return the top-k
centers for a given k. The evaluation takes the minimum error over the list of candidates, as described in Section S6.2. This
kind of top-k evaluation is common practice for image classification tasks like ImageNet (Russakovsky et al., 2015).
As expected, all metrics improve by increasing k, but the symmetric categories, where the single ground-truth evaluation is
inappropriate, improve dramatically, suggesting that the lower top-1 performance can indeed be attributed to the lack of
symmetry annotations for evaluation and is not a limitation of our model.

S8. Implementation specifics


We train with the Adam optimizer (β1 = 0.9, β2 = 0.999) with a linear warm up to the base learning rate of 10−4 over
1000 steps, and then cosine decay to zero over the remainder of training.

Efficient implementation The input to the MLP is a concatenation of the image descriptor produced by a CNN and a
query pose. During both training and inference, we evaluate densities for a large number of poses per image. A naive
implementation would replicate and tile image descriptors {di }0≤i<NB and pose queries {qj }0≤j<NQ , where NB is the
mini-batch size and NQ is the number of pose queries, and evaluate the first fully connected operation with weights W
(before applying bias and nonlinearity) in a batched fashion, as follows,
" #
d1 d1 d1 · · · d2 d2 d2 · · ·
W . (8)
q1 q2 q3 · · · q1 q2 q3 · · ·
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold

avg. bathtub bed chair desk dresser tv n. stand sofa table toilet
Deng et al. (2020) 0.562 0.140 0.788 0.800 0.345 0.563 0.708 0.279 0.733 0.440 0.832
Prokudin et al. (2018) 0.456 0.114 0.822 0.662 0.023 0.406 0.704 0.187 0.590 0.108 0.946
Acc@15°
Mohlin et al. (2020) 0.693 0.322 0.882 0.881 0.536 0.682 0.790 0.516 0.919 0.446 0.957
IPDF (ours) 0.719 0.392 0.877 0.874 0.615 0.687 0.799 0.567 0.914 0.523 0.945
IPDF (ours), top-2 0.868 0.735 0.946 0.900 0.803 0.810 0.883 0.756 0.959 0.932 0.960
IPDF (ours), top-4 0.904 0.806 0.966 0.905 0.862 0.870 0.899 0.842 0.966 0.956 0.963
Deng et al. (2020) 0.694 0.325 0.880 0.908 0.556 0.649 0.807 0.466 0.902 0.485 0.958
Prokudin et al. (2018) 0.528 0.175 0.847 0.777 0.061 0.500 0.788 0.306 0.673 0.183 0.972
Acc@30°
Mohlin et al. (2020) 0.757 0.403 0.908 0.935 0.674 0.739 0.863 0.614 0.944 0.511 0.981
IPDF (ours) 0.735 0.410 0.883 0.917 0.629 0.688 0.832 0.570 0.921 0.531 0.967
IPDF (ours), top-2 0.888 0.770 0.953 0.946 0.825 0.812 0.918 0.762 0.968 0.945 0.982
IPDF (ours), top-4 0.926 0.846 0.973 0.953 0.889 0.874 0.939 0.851 0.975 0.972 0.988
Deng et al. (2020) 32.6 147.8 9.2 8.3 25.0 11.9 9.8 36.9 10.0 58.6 8.5
Median Prokudin et al. (2018) 49.3 122.8 3.6 9.6 117.2 29.9 6.7 73.0 10.4 115.5 4.1
Error (◦ ) Mohlin et al. (2020) 17.1 89.1 4.4 5.2 13.0 6.3 5.8 13.5 4.0 25.8 4.0
IPDF (ours) 21.5 161.0 4.4 5.5 7.1 5.5 5.7 7.5 4.1 9.0 4.8
IPDF (ours), top-2 4.9 6.8 4.1 5.5 5.3 4.9 5.3 5.1 3.9 3.7 4.8
IPDF (ours), top-4 4.8 6.0 4.1 5.4 5.1 4.7 5.2 4.8 3.9 3.7 4.8

Table S4. ModelNet10-SO(3) per-category results.

When computed this way, this single step is the computational bottleneck. An alternative, much more efficient method is to
observe that
" # " # " #
di di 0
W =W +W = Wd di + Wq qj , (9)
qj 0 qj

where W = [Wd Wq ]. In this manner, Wd can be applied batchwise to image descriptors, yielding a NO × NB output,
and Wq can be applied to all query poses independently, yielding a NO × NQ output, where NO is the number of output
channels (number of rows in W ). An NO × NQ × NB tensor equivalent to Eq. (8) is then obtained via a broadcasting sum,
drastically reducing the number of operations.

SYMSOL For the SYMSOL experiments, three positional encoding terms were used for the query, and four fully
connected layers of 256 units with ReLU activation for the MLP. One network was trained for all five shapes of SYMSOL I
with a batch size of 128 images for 100,000 steps (28 epochs). A different network was trained for each of the three textured
shapes of SYMSOL II; these trained with a batch size of 64 images for 50,000 steps (36 epochs). The loss calculation
requires evaluating a coverage of points on SO(3) along with the ground truth in order to find the approximate normalization
rescaling of the likelihoods. We found that this coverage did not need to be particularly dense, and used 4096 points for
training.

T-LESS For T-LESS, only one positional encoding term was used, and the MLP consisted of a single layer of 256 units
with ReLU activation. The images were color-normalized and tight-cropped as in Gilitschenski et al. (2019). Training was
with a batch size of 64 images for 50,000 steps (119 epochs).

ModelNet10-SO(3) For ModelNet10-SO(3) (Liao et al., 2019), we use four fully connected layers of 256 units with
ReLU activation as in SYMSOL. We train a single model for the whole dataset, for 100,000 steps with batch size of 64.
Following Liao et al. (2019) and Mohlin et al. (2020), we concatenate a one-hot encoding of the class label to the image
descriptor before feeding it to the MLP.

Pascal3D+ We used a learning rate of 10−5 for 150,000 steps, with the same schedule as in the other experiments (linear
ramp for the first 1000 steps, then cosine decay). The vision model was an ImageNet pre-trained ResNet101, and the MLP
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold

Layer Activation Output

Vision Description Input - 2048


Rotation Input - [3, 3]
Flatten - 9
Positional Encoding - [2m×9]
Concatenate - [2048 + 2m×9]
Dense ReLU 256
... ×n
Dense None 1

Table S5. IPDF architecture. m is the number of positional encoding frequencies and n is the number of fully connected layers in the
MLP. The factor of 2 comes from using both sines and cosines in the positional encoding. The vision description is the result of applying
global average pooling to the output of an ImageNet pre-trained ResNet to obtain a 2048-dimensional vector. We use an ImageNet
pre-trained Resnet50 for SYMSOL, T-LESS, and ModelNet10-SO(3), and Resnet101 for Pascal3D+.

consisted of two fully connected layers of 256 units with ReLU activation (trained on all classes at once, without class label
information). We supplemented the Pascal3D+ training images with synthetic images from Render for CNN (Su et al.,
2015), such that every mini-batch of 64 images consisted of 25% real images and 75% synthetic.

S8.1. Baseline methods


[Deng et al. (2020)] We trained the multi-modal Bingham distribution model from Deng et al. (2020) using their PyTorch
code.1 Note, this is a follow-up work of an earlier paper which references the same implementation (Deng et al., 2020). Our
only modification was a minor one to remove the translation component from the model as only the rotation representation
needs to be learned. We found the model performed best with the same general settings as used in the reference paper (rWTA
loss with two stage training – first stage trains rotations only, the second stage trains both rotations and mixture coefficients).
For the ModelNet10-SO(3) and SYMSOL datasets we trained a single model per shape category, and we found no benefit
with increasing the number of components (we used 10 for ModelNet10 and 16 for SYMSOL).

[Gilitschenski et al. (2019)] We trained the multi-modal Bingham distribution model from Gilitschenski et al. (2019)
using their PyTorch code.2 For this baseline we again trained a single model per shape for ModelNet10-SO(3) and SYMSOL.
We followed the published approach and trained the model in two stages – first stage with fixed dispersion and second
stage updates all distribution parameters. For a batch size of 32, a single training step for a 4-component distribution takes
almost 2 seconds on a NVIDIA TESLA P100 GPU. The time is dominated by the lookup table interpolation to calculate the
distribution’s normalizing term (and gradient), and is linear in the number of mixture components (training with 12 mixture
components took over 7 seconds per step). This limited our ability to tune hyperparameters effectively or train with a large
number of mixture components.

[Prokudin et al. (2018)] We trained the infinite mixture model from Prokudin et al. (2018) using their Tensorflow code.3
The only modification was during evaluation: the log likelihood required our method of normalization via equivolumetric
grid because representing a distribution over SO(3) as the product of three individually normalized von Mises distributions
lacks the necessary Jacobian. We left the improperly normalized log likelihood in their loss, as it was originally formulated.
A different model was trained per shape category of SYMSOL and ModelNet10-SO(3).
Note that our implicit pose distribution is trained as a single model for the whole of SYMSOL I and ModelNet10-SO(3)
datasets, so the comparisons against Deng et al. (2020), Gilitschenski et al. (2019), and Prokudin et al. (2018) favor the
baselines. Our method outperforms them nevertheless.
1
https://github.com/Multimodal3DVision/torch_bingham.
2
https://github.com/igilitschenski/deep_bingham.
3
https://github.com/sergeyprokudin/deep_direct_stat.
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold

S8.2. A note on Pascal3D+ evaluations with respect to Liao et al. and Mohlin et al.
In the Pascal3D+ table in the main paper, and mentioned in that caption, we report numbers for Liao et al. (2019) and
Mohlin et al. (2020) which differ from the numbers reported in their papers (these are the rows marked with ‡).

4
Liao et al. (2019) An error in the evaluation code,
√ reported on github , incorrectly measured the angular error – reported
numbers were incorrectly lower by a factor of 2. The authors corrected the evaluation code for ModelNet10-SO(3) and
posted updated
√ numbers, which we show in our paper. However, their evaluation code used for Pascal3D+ still contains the
incorrect 2 factor: comparing
√ the corrected ModelNet10-SO(3) geodesic distance function5 and the Pascal3D+ geodesic
6
distance function the 2 difference is clear. We sanity checked this by running their Pascal3D+ code with the incorrect
metric and were able to closely match the numbers in the paper. In the main paper, we report performance obtained using
the corrected evaluation code.

Mohlin et al. (2020) We found that the code released by (Mohlin et al., 2020) uses different dataset splits for training and
testing on Pascal3D+ than many of the other baselines we compared against. Annotated images in the Pascal3D+ dataset are
selected from one of four source image sets: ImageNet train, ImageNet val, PASCALVOC train, and PASCALVOC val.
Methods like Mahendran et al. and Liao et al. place all the ImageNet images (ImageNet train, ImageNet val) in the training
partition (i.e. used for training and/or validation): “We use the ImageNet-trainval and Pascal-train images as our training
data and the Pascal-val images as our testing data.” Mahendran et al. (2018), Sec 4. However, in the code released
by Mohlin et al. (2020), we observe the test set is sourced from the ImageNet data7 . We reran the Mohlin et al. code as-is
and were able to match their published numbers. After logging both evaluation loops, we confirmed the test data differs
between Mohlin et al. and Liao et al.. The numbers we report in the main paper for Mohlin et al. are after modifying the data
pipeline to match Liao et al., which is also what we follow for our IPDF experiments. We ran Mohlin et al. with and without
augmentation and warping in the data pipeline and chose the best results (which was with warping and augmentation).

4
https://github.com/leoshine/Spherical_Regression/issues/8
5
https://github.com/leoshine/Spherical_Regression/blob/a941c732927237a2c7065695335ed949e0163922/
S3.3D_Rotation/lib/eval/GTbox/eval_quat_multilevel.py#L45
6
https://github.com/leoshine/Spherical_Regression/blob/a941c732927237a2c7065695335ed949e0163922/
S1.Viewpoint/lib/eval/eval_aet_multilevel.py#L135
7
https://github.com/Davmo049/Public_prob_orientation_estimation_with_matrix_fisher_
distributions/blob/4baba6d06ca36db4d4cf8c905c5c3b70ab5fb54a/Pascal3D/Pascal3D.py#L558-L583

You might also like