Int J Comput Vis
DOI 10.1007/s11263-013-0627-y
Rotational Projection Statistics for 3D Local Surface Description
and Object Recognition
Yulan Guo · Ferdous Sohel ·
Mohammed Bennamoun · Min Lu · Jianwei Wan
Received: 12 August 2012 / Accepted: 6 April 2013
© Springer Science+Business Media New York 2013
Abstract Recognizing 3D objects in the presence of noise,
varying mesh resolution, occlusion and clutter is a very challenging task. This paper presents a novel method named
Rotational Projection Statistics (RoPS). It has three major
modules: local reference frame (LRF) definition, RoPS feature description and 3D object recognition. We propose a
novel technique to define the LRF by calculating the scatter
matrix of all points lying on the local surface. RoPS feature
descriptors are obtained by rotationally projecting the neighboring points of a feature point onto 2D planes and calculating a set of statistics (including low-order central moments
and entropy) of the distribution of these projected points.
Using the proposed LRF and RoPS descriptor, we present
a hierarchical 3D object recognition algorithm. The performance of the proposed LRF, RoPS descriptor and object
recognition algorithm was rigorously tested on a number of
popular and publicly available datasets. Our proposed techniques exhibited superior performance compared to existing techniques. We also showed that our method is robust
with respect to noise and varying mesh resolution. Our RoPS
based algorithm achieved recognition rates of 100, 98.9, 95.4
and 96.0 % respectively when tested on the Bologna, UWA,
Queen’s and Ca’ Foscari Venezia Datasets.
Keywords Surface descriptor · Local feature · Local
reference frame · 3D representation · Feature matching ·
3D object recognition
Y. Guo (B) · M. Lu · J. Wan
College of Electronic Science and Engineering, National
University of Defense Technology, Changsha, Hunan,
People’s Republic of China
e-mail: yulan.guo@nudt.edu.cn
Y. Guo · F. Sohel · M. Bennamoun
School of Computer Science and Software Engineering,
The University of Western Australia, Perth, Australia
1 Introduction
Object recognition is an active research area in computer
vision with numerous applications including navigation, surveillance, automation, biometrics, surgery and education
(Guo et al. 2013c; Johnson and Hebert 1999; Lei et al. 2013;
Tombari et al. 2010). The aim of object recognition is to
correctly identify the objects that are present in a scene and
recover their poses (i.e., position and orientation) (Mian et al.
2006b). Beyond object recognition from 2D images (Brown
and Lowe 2003; Lowe 2004; Mikolajczyk and Schmid 2004),
3D object recognition has been extensively investigated during the last two decades due to the availability of low cost
scanners and high speed computing devices (Mamic and Bennamoun 2002). However, recognizing objects from range
images in the presence of noise, varying mesh resolution,
occlusion and clutter is still a challenging task.
Existing algorithms for 3D object recognition can broadly
be classified into two categories, i.e., global feature based
and local feature based algorithms (Bayramoglu and Alatan 2010; Castellani et al. 2008). The global feature based
algorithms construct a set of features which encode the
geometric properties of the entire 3D object. Examples
of these algorithms include the geometric 3D moments
(Paquet et al. 2000), shape distribution (Osada et al. 2002)
and spherical harmonics (Funkhouser et al. 2003). However,
these algorithms require complete 3D models and are therefore sensitive to occlusion and clutter (Bayramoglu and Alatan 2010). In contrast, the local feature based algorithms
define a set of features which encode the characteristics
of the local neighborhood of feature points. The local feature based algorithms are robust to occlusion and clutter.
They are therefore even suitable to recognize partially visible objects in a cluttered scene (Petrelli and Di Stefano
2011).
123
Int J Comput Vis
A number of local feature based 3D object recognition
algorithms have been proposed in the literature, including
point signature based (Chua and Jarvis 1997), spin image
based (Johnson and Hebert 1999), tensor based (Mian et
al. 2006b) and exponential map (EM) based (Bariya et al.
2012) algorithms. Most of these algorithms follow a paradigm that has three phases, i.e., feature matching, hypothesis generation and verification, and pose refinement (Taati
and Greenspan 2011). Among these phases, feature matching plays a critical role since it directly affects the effectiveness and efficiency of the two subsequent phases (Taati and
Greenspan 2011).
Descriptiveness and robustness of a feature descriptor are
crucial for accurate feature matching (Bariya and Nishino
2010). The feature descriptors should be highly descriptive
to ensure an accurate and efficient object recognition. That is
because the accuracy of feature matching directly influences
the quality of the estimated transformation which is used to
align the model to the scene, as well as the computational time
required for verification and refinement (Taati and Greenspan
2011). Moreover, the feature descriptors should be robust to
a set of nuisances, including noise, varying mesh resolution,
clutter, occlusion, holes and topology changes (Bronstein et
al. 2010a; Boyer et al. 2011).
A number of local feature descriptors exist in literature
(Sect. 2.1). These descriptors can be divided into two broad
categories based on whether they use a local reference frame
(LRF) or not. Feature descriptors without any LRF use a
histogram or the statistics of the local geometric information (e.g., normal, curvature) to form a feature descriptor
(Sect. 2.1.1). Examples of this category include surface signature (Yamany and Farag 2002), local surface patch (LSP)
(Chen and Bhanu 2007) and THRIFT (Flint et al. 2007).
In contrast, feature descriptors with LRF encode the spatial distribution and/or geometric information of the neighboring points with respect to the defined LRF (Sect. 2.1.2).
Examples include spin image (Johnson and Hebert 1999),
intrinsic shape signatures (ISS) (Zhong 2009) and MeshHOG
(Zaharescu et al. 2012). However, most of the existing feature descriptors still suffer from either low descriptiveness or
weak robustness (Bariya et al. 2012).
In this paper we present a highly descriptive and robust
feature descriptor together with an efficient 3D object recognition algorithm. This paper first proposes a unique, repeatable and robust LRF for both local feature description and
object recognition (Sect. 3). The LRF is constructed by performing an eigenvalue decomposition on the scatter matrix
of all the points lying on the local surface together with
a sign disambiguation technique. A novel feature descriptor, namely Rotational Projection Statistics (RoPS), is then
presented (Sect. 4). RoPS exhibits both high discriminative
power and strong robustness to noise, varying mesh resolution and a set of deformations. The RoPS feature descriptor is
123
generated by rotationally projecting the neighboring points
onto three local coordinate planes and calculating several
statistics (e.g, central moment and entropy) of the distribution matrices of the projected points. Finally, this paper
presents a novel hierarchical 3D object recognition algorithm
based on the proposed LRF and RoPS feature descriptor
(Sect. 6). Comparative experiments on four popular datasets
were performed to demonstrate the superiority of the proposed method (Sect. 7).
The rest of this paper is organized as follows. Section
2 provides a brief literature review of local surface feature
descriptors and 3D object recognition algorithms. Section 3
introduces a novel technique for LRF definition. Section 4
describes our proposed RoPS method for local surface feature description. Section 5 presents the evaluation results of
the RoPS descriptor on two datasets. Section 6 introduces
a RoPS based hierarchical algorithm for 3D object recognition. Section 7 presents the results and analysis of our 3D
object recognition experiments on four datasets. Section 8
concludes this paper.
2 Related Work
This section presents a brief overview of the existing main
methods for local surface feature description and local feature
based 3D object recognition.
2.1 Local Surface Feature Description
2.1.1 Features Without LRF
Stein and Medioni (1992) proposed a splash feature by
recording the relationship between the normals of the geodesic neighboring points and the feature point. This relationship is then encoded into a 3D vector and finally transformed into curvatures and torsion angles. Hetzel et al. (2001)
constructed a set of features by generating histograms using
depth values, surface normals, shape indices and their combinations. Results show that the surface normal and shape index
exhibit high discrimination capabilities. Yamany and Farag
(2002) introduced a surface signature by encoding the surface
curvature information into a 2D histogram. This method can
be used to estimate scaling transformations as well as recognizing objects in 3D scenes. Chen and Bhanu (2007) proposed a LSP feature that encodes the shape indices and normal deviations of the neighboring points. Flint et al. (2008)
introduced a THRIFT feature by calculating a weighted histogram of the deviation angles between the normals of the
neighboring points and the feature point. Taati et al. (2007)
considered the selection of a good local surface feature for
3D object recognition as an optimization problem and proposed a set of variable-dimensional local shape descriptors
Int J Comput Vis
(VD-LSD). However, the process of selecting an optimized
subset of VD-LSDs for a specific object is very time consuming (Taati and Greenspan 2011). Kokkinos et al. (2012) proposed a generalization of 2D shape context feature (Belongie
et al. 2002) to curved surfaces, namely intrinsic shape context
(ISC). The ISC is a meta-descriptor which can be applied to
any photometric or geometric field defined on a surface.
Without LRF, most of these methods generate a feature
descriptor by accumulating certain geometric attributes (e.g.,
normal, curvature) into a histogram. Since most of the 3D
spatial information is discarded during the process of histogramming, the descriptiveness of the features without LRF
is limited (Tombari et al. 2010).
2.1.2 Features with LRF
Chua and Jarvis (1997) proposed a point signature by using
the distances from the neighboring points to their corresponding projections on a fitted plane. One merit of the point signature is that no surface derivative is required. One of its
limitations relate to the fact that the reference direction may
not be unique. It is also sensitive to mesh resolution (Mian et
al. 2010). Johnson and Hebert (1998) used the surface normal
as a reference axis and proposed a spin image representation
by spinning a 2D image about the normal of a feature point
and summing up the number of points falling into the bins of
that image. The spin image is one of the most cited methods.
But its descriptiveness is relatively low and it is also sensitive
to mesh resolution (Zhong 2009). Frome et al. (2004) also
used the normal vector as a reference axis and generated a
3D shape context (3DSC) by counting the weighted number of points falling in the neighboring 3D spherical space.
However, a reference axis is not a complete reference frame
and there is an uncertainty in the rotation around the normal
(Petrelli and Di Stefano 2011).
Sun and Abidi (2001) introduced an LRF by using the normal of a feature point and an arbitrarily chosen neighboring
point. Based on the LRF, they proposed a descriptor named
point’s fingerprint by projecting the geodesic circles onto
the tangent plane. It was reported that their approach outperforms the 2D histogram based methods. One major limitation
of this method is that their LRF is not unique (Tombari et al.
2010). Mian et al. (2006b) proposed a tensor representation
by defining an LRF for a pair of oriented points and encoding the intersected surface area into a multidimensional table.
This representation is robust to noise, occlusion and clutter.
However, a pair of points are required to define an LRF, which
causes a combinatorial explosion (Zhong 2009). Novatnack
and Nishino (2008) used the surface normal and a projected
eigenvector on the tangent plane to define an LRF. They proposed an EM descriptor by encoding the surface normals of
the neighboring points into a 2D domain. The effectiveness
of exploiting geometric scale variability in the EM descriptor
has been demonstrated. Zhong (2009) introduced an LRF by
calculating the eigenvectors of the scatter matrix of the neighboring points of a feature point, and proposed an ISS feature
by recording the point distribution in the spherical angular
space. Since the sign of the LRF is not defined unambiguously, four feature descriptors can be generated from a single
feature point. Mian et al. (2010) proposed a keypoint detection method and used a similar LRF to Zhong (2009) for
their feature description. Tombari et al. (2010) analyzed the
strong impact of LRF on the performance of feature descriptors and introduced a unique and unambiguous LRF by performing an eigenvalue decomposition on the scatter matrix
of the neighboring points and using a sign disambiguation
technique. Based on the proposed LRF, they introduced a
feature descriptor called signature of histograms of orientations (SHOT). SHOT is very robust to noise, but sensitive
to mesh resolution variation. Petrelli and Di Stefano (2011)
proposed a novel LRF which aimed to estimate a repeatable
LRF at the border of a range image. Zaharescu et al. (2012)
proposed a MeshHOG feature by first projecting the gradient vectors onto three planes defined by an LRF and then
calculating a two-level histogram of these vectors.
However, none of the existing LRF definition techniques
is simultaneously unique, unambiguous, and robust to noise
and mesh resolution. Besides, most of the existing feature
descriptors suffer from a number of limitations, including a
low robustness and discriminating power (Bariya et al. 2012).
2.2 3D Object Recognition
Most of the existing algorithms for local feature based 3D
object recognition follow a three-phase paradigm including
feature matching, hypothesis generation and verification, and
pose refinement (Taati and Greenspan 2011).
Stein and Medioni (1992) used the splash features to represent the objects and generated hypotheses by using a set
of triplets of feature correspondences. These hypotheses are
then grouped into clusters using geometric constraints. They
are finally verified through a least square calculation. Chua
and Jarvis (1997) used point signatures of a scene to match
them against those of their models. The rigid transformation
between the scene and a candidate model was then calculated
using three pairs of corresponding points. Its ability to recognize objects in both single-object and multi-object scenes has
been demonstrated. However, verifying each triplet of feature correspondences is very time consuming. Johnson and
Hebert (1999) generated point correspondences by matching the spin images of the scene with the spin images of the
models. These point correspondences are first grouped using
geometric consistency. The groups are then used to calculate rigid transformations, which are finally be verified. This
algorithm is robust to clutter and occlusion, and capable to
recognize objects in complicated real scenes. Yamany and
123
Int J Comput Vis
Farag (2002) used surface signatures as feature descriptors
and adopted a similar strategy to Johnson and Hebert (1999)
for object recognition. Mian et al. (2006b) obtained feature
correspondences and model hypothesis by matching the tensor representations of the scene with those of the models.
The hypothesis model is then transformed to the scene and
finally verified using the iterative closest point (ICP) algorithm (Besl and McKay 1992). Experimental results revealed
that it is superior in terms of recognition rate and efficiency
compared to the spin image based algorithm. Mian et al.
(2010) also developed a 3D object recognition algorithm
based on keypoint matching. This algorithm can be used to
recognize objects at different and unknown scales. Taati and
Greenspan (2011) developed a 3D object recognition algorithm based on their proposed VD-LSD feature descriptors.
The optimal VD-LSD descriptor is selected based on the
geometry of the objects and the characteristics of the range
sensors. Bariya et al. (2012) introduced a 3D object recognition algorithm based on the EM feature descriptor and a
constrained interpretation tree.
There are some algorithms in the literature which do not
follow the aforementioned three-phase paradigm. For example, Frome et al. (2004) performed 3D object recognition
using the sum of the distances between the scene features (i.e.
3DSC) and their corresponding model features. This algorithm is efficient. However, it is not able to segment the recognized object from a scene, and its effectiveness on real data
has not been demonstrated. Shang and Greenspan (2010) proposed a potential well space embedding (PWSE) algorithm
for real-time 3D object recognition in sparse range images.
It cannot however handle clutter and therefore requires the
objects to be segmented a priori from the scene.
None of the existing object recognition algorithms has
explicitly explored the use of LRF to boost the performance of the recognition. Moreover, most of these algorithms
require three pairs of feature correspondences to establish a
transformation between a model and a scene. This not only
increases the run time due to the combinatorial explosion
of the matching pairs, but also decreases the precision of
the estimated transformation (since the chance to find three
correct feature correspondences is much lower compared to
finding only one correct correspondence).
2.3 Paper Contributions
This paper is an extended version of (Guo et al. 2013a,b).
It has three major contributions, which are summarized as
follows.
(i) We introduce a unique, unambiguous and robust 3D LRF
using all the points lying on the local surface rather than
just the mesh vertices. Therefore, our proposed LRF is
more robust to noise and varying mesh resolution. We
123
also use a novel sign disambiguation technique, our proposed LRF is therefore unique and unambiguous. This
LRF offers a solid foundation for effective and robust
feature description and object recognition.
(ii) We introduce a highly descriptive and robust RoPS feature descriptor. RoPS is generated by rotationally projecting the neighboring points onto three coordinate
planes and encoding the rich information of the point
distribution into a set of statistics. The proposed RoPS
descriptor has been evaluated on two datasets. Experimental results show that RoPS achieved a high power of
descriptiveness. It is shown to be robust to a number of
deformations including noise, varying mesh resolution,
rotation, holes and topology changes. (see Sect. 5 for
details) .
(iii) We introduce an efficient hierarchical 3D object recognition algorithm based on the LRF and RoPS feature
descriptor. One major advantage of our algorithm is, a
single correct feature correspondence is sufficient for
object recognition. Moreover, by integrating our robust
LRF, the proposed object recognition algorithm can
work with any of the existing feature descriptors (e.g.,
spin image) in the literature. Rigorous evaluations of
the proposed 3D object recognition algorithm were conducted on four different popular datasets. Experimental
results show that our algorithm achieved high recognition rates, good efficiency and strong robustness to
different nuisances. It consistently resulted in the best
recognition results on the four datasets.
3 Local Reference Frame
A unique, repeatable and robust LRF is important for both
effective and efficient feature description and 3D object
recognition. Advantages of such an LRF are many fold. First,
the repeatability of an LRF directly affects the descriptiveness and robustness of the feature descriptor, i.e., an LRF with
a low repeatability will result in a poor performance of feature
matching (Petrelli and Di Stefano 2011). Second, compared
with the methods which associate multiple descriptors to a
single feature point (e.g., ISS Zhong 2009), a unique LRF
can help to improve both the precision and the efficiency of
feature matching (Tombari et al. 2010). Third, a robust 3D
LRF helps to boost the performance of 3D object recognition.
We propose a novel LRF by fully employing the point
localization information of the local surface. The three axes
for the LRF are determined by performing an eigenvalue
decomposition on the scatter matrix of all points lying on
the local surface. The sign of each axis is disambiguated by
aligning the direction to the majority of the point scatter.
Int J Comput Vis
where × denotes the cross product.
wi2 is a weight that is related to the distance from the
feature point to the centroid of the ith triangle, that is:
pi1 + pi2 + pi3
wi2 = r − p −
3
Fig. 1 An illustration of a triangle mesh and a point lying on the surface. An arbitrary point within a triangle can be represented by the
triangle’s vertices
3.1 Coordinate Axis Construction
where 0 ≤ s, t ≤ 1, and s + t ≤ 1, as illustrated in Fig. 1.
The scatter matrix Ci of all the points lying within the ith
triangle can be calculated as:
1 1−s
T
pi (s, t) − p pi (s, t) − p dtds
0 0
.
(2)
Ci =
1 1−s
dtds
0 0
Using Eq. 1, the scatter matrix Ci be can expressed as:
3
Ci =
3
T
1
pi j − p pik − p
12
j=1 k=1
.
(6)
Note that, the first weight wi1 is expected to improve the
robustness of LRF to varying mesh resolutions, since a compensation with respect to the triangle area is incorporated
through this weighting. The second weight wi2 is expected
to improve the robustness of LRF to occlusion and clutter,
since distant points will contribute less to the overall scatter
matrix.
We then perform an eigenvalue decomposition on the overall scatter matrix C, that is:
CV = EV,
Given a feature point p and a support radius r , the local
surface mesh S which contains N triangles and M vertices,
is cropped from the range image using a sphere of radius r
centered at p. For the ith triangle with vertices pi1 , pi2 and
pi3 , a point lying within the triangle can be represented as:
(1)
pi (s, t) = pi1 + s(pi2 − pi1 ) + t pi3 − pi1 ,
2
(7)
where E is a diagonal matrix of the eigenvalues {λ1 , λ2 , λ3 }
of the matrix C, and V contains three orthogonal eigenvectors {v1 , v2 , v3 } that are in the order of decreasing magnitude
of their associated eigenvalues. The three eigenvectors offer
a basis for LRF definition. However, the signs of these vectors are numerical accidents and are not repeatable between
different trials even on the same surface (Bro et al. 2008;
Tombari et al. 2010). We therefore propose a novel sign disambiguation technique which is described in the next subsection.
It is worth noting that, although some existing techniques
also use the idea of eigenvalue decomposition to construct
the LRF (e.g., Mian et al. 2010; Tombari et al. 2010; Zhong
2009), they calculate the scatter matrix using just the mesh
vertices. Instead, our technique employs all the points in the
local surface and, is therefore more robust compared to exiting techniques (as demonstrated in Sect. 3.3).
3
+
T
1
pi j − p pi j − p .
12
(3)
3.2 Sign Disambiguation
j=1
The overall scatter matrix C of the local surface S is calculated as the weighted sum of the scatter matrices of all the
triangles, that is:
C=
N
wi1 wi2 Ci ,
(4)
i=1
where N is the number of triangles in the local surface S.
Here, wi1 is the ratio between the area of the ith triangle and
the total area of the local surface S, that is:
pi2 − pi1 × pi3 − pi1
(5)
wi1 = N
,
pi2 − pi1 × pi3 − pi1
i=1
In order to eliminate the sign ambiguity of the LRF, each
eigenvector should point in the major direction of the scatter
vectors (which start from the feature point and point in the
direction of the points lying on the local surface). Therefore,
the sign of each eigenvector is determined from the sign of
the inner product of the eigenvector and the scatter vectors.
Specifically, the unambiguous vector v1 is defined as:
v1 = v1 · sign (h) ,
(8)
where sign (·) denotes the signum function that extracts the
sign of a real number, and h is calculated as:
123
Int J Comput Vis
(c)
(b)
(a)
(d)
(e)
(f)
Fig. 2 The six models of the tuning dataset
h=
N
i=1
=
N
i=1
⎛
wi1 wi2 ⎝
⎞
pi (s, t) − p v1 dtds ⎠
1 1−s
0
0
⎞
3
1
pi j − p v1 ⎠ .
wi1 wi2 ⎝
6
⎛
(9)
j=1
Similarly, the unambiguous vector v3 is defined as:
⎛
v3 = v3 · sign ⎝
N
i=1
⎛
wi1 wi2 ⎝
3
1
6
j=1
⎞⎞
pi j − p v3 ⎠⎠ .
(10)
Given two unambiguous vectors v1 and v3 , v2 is defined
as v3 × v1 . Therefore, a unique and unambiguous 3D LRF
for feature point p is finally defined. Here, p is the origin,
and v1 , v2 and v3 are the x, y and z axes respectively. With
this LRF, a unique, pose invariant and highly discriminative
local feature descriptor can now be generated.
3.3 Performance of the Proposed LRF
To evaluate the repeatability and robustness of our proposed
LRF, we calculated the LRF errors between the corresponding points in the scenes and models. The six models (i.e.,
“Armadillo”, “Asia Dragon”, “Bunny”, “Dragon”, “Happy
Buddha” and “Thai Statue”) used in this experiment were
taken from the Stanford 3D scanning repository (Curless and
Levoy 1996). They are shown in Fig. 2. The six scenes were
created by resampling the models down to 21 of their original mesh resolution and then adding Gaussian noise with a
standard deviation of 0.1 mesh resolution (mr) to the data.
We refer to this dataset as the “tuning dataset” in the rest of
this paper.
We randomly selected 1,000 points in each model and we
refer to these points as feature points. We then obtained the
corresponding points in the scene by searching the points
with the smallest distances
to the
feature points in the model.
For each point pair p Si , p Mi , we calculated the LRFs for
both points, denoted as L Si and L Mi , respectively. Using the
similar criterion as in (Mian et al. 2006a), the error between
123
two LRFs of the ith point pair can be calculated by:
⎛
⎞
trace L Si L−1
−
1
Mi
⎠ 180 ,
ǫi = arccos ⎝
2
π
(11)
where ǫi represents the amount of rotation error between two
LRFs and is zero in the case of no error.
Our proposed LRF technique was tested on the tuning
dataset with comparison to several existing techniques, e.g.,
proposed by Novatnack and Nishino (2008), Mian et al.
(2010), Tombari et al. (2010), and Petrelli and Di Stefano
(2011). We tested each LRF technique five times by randomly selecting 1,000 different point pairs each time. The
overall LRF errors of each technique are shown in Fig. 3 as a
histogram. Ideally, all of the LRF errors should lie around the
zero value (in the first bin of the histogram). It is clear that our
proposed technique performed best, with 83.5 % of the point
pairs having LRF errors less than 10◦ . Whereas the second
best one (i.e., proposed by Petrelli and Di Stefano (2011))
secured only 43.2 % of the point pairs with LRF errors less
than 10◦ . Other techniques only had around 40 % point pairs
with LRF errors less than 10◦ . These results clearly indicate
that our proposed LRF is more repeatable and more robust
than the state-of-the-art in the presence of noise and mesh
resolution variation.
In order to further assess the influence of a weighting strat-
p +p +p
egy, we used a distance weight wi3 = r − p − i1 3i2 i3
(following the approach of Tombari et al. 2010) to replace
the weights wi1 and wi2 in Eqs. 4, 9 and 10, resulting in a
modified LRF. The histogram of LRF errors of the modified
technique is shown in Fig. 3. The performance of the modified LRF decreased significantly compared to the original
proposed LRF. This observation reveals that the weighting
strategy using both quadratic distance weight wi2 and area
weight wi1 produced more robust results compared to those
using only a linear distance weight wi3 .
Figure 3 shows that part of the LRF errors of each technique are larger than 80◦ . This is mainly due to the presence
of local symmetrical surfaces (e.g., flat or spherical surfaces)
in the scenes. For a local symmetrical surface, there is an
inherent sign ambiguity of its LRF because the distribution
Int J Comput Vis
100
Novatnack
Percentage (%)
90
Mian
80
Tombari
70
Petrelli
Proposed
60
Modified
50
40
30
20
10
0
5
15
25
35
45
55
65
75 >80
Error (°)
Fig. 3 Histogram of the LRF errors for the six scenes and models of
the tuning dataset. Our proposed technique outperformed the existing
techniques by a large margin. (Figure best seen in color.)
of points is almost the same in all directions. In order to deal
with this case, we adopt a feature point selection technique
which uses the ratio of eigenvalues to avoid local symmetrical
surfaces (see Sect. 6.2).
Once an LRF is determined, the next step is to define a
local surface descriptor. In the next section, we propose a
novel RoPS descriptor.
in Fig. 4c. This pointcloud Q′ (θk ) is then projected onto three
coordinate planes (i.e., the x y, x z and yz planes) to obtain
three projected pointclouds Q′ i (θk ) , i = 1, 2, 3. Note that,
the projection offers a means to describe the 3D local surface in a concise and efficient manner. That is because 2D
projections clearly preserve a certain amount of unique 3D
geometric information of the local surface from that particular viewpoint.
Next, for each projected pointcloud Q′ i (θk ), a 2D bounding rectangle is obtained, which is subsequently divided into
L × L bins, as shown in Fig. 4d. The number of points falling
into each bin is then counted to yield an L × L matrix D, as
shown in Fig. 4e. We refer to the matrix D as a “distribution
matrix” since it represents the 2D distribution of the neighboring points. The distribution matrix D is further normalized
such that the sum of all bins is equal to one in order to achieve
invariance to variations in mesh resolution.
The information in the distribution matrix D is further
condensed in order to achieve computational and storage
efficiency. In this paper, a set of statistics is extracted
from the distribution matrix D, including central moments
(Demi et al. 2000; Hu 1962) and Shannon entropy (Shannon 1948). The central moments are utilized for their mathematical simplicity and rich descriptiveness (Hu 1962), while
Shannon entropy is selected for its strong power to measure
the information contained in a probability distribution (Shannon 1948).
The central moment µmn of order m + n of matrix D is
defined as:
4 Local Surface Description
A local surface descriptor needs to be invariant to rotation and
robust to noise, varying mesh resolution, occlusion, clutter
and other nuisances. In this section, we propose a novel local
surface feature descriptor namely RoPS by performing local
surface rotation, neighboring points projection and statistics
calculation.
µmn =
L
L
m
n
i − ī
j − j̄ D (i, j) ,
(12)
i=1 j=1
where
ī =
L
L
iD (i, j) ,
(13)
i=1 j=1
and
4.1 RoPS Feature Descriptor
j̄ =
An illustrative example of the overall RoPS method is given
in Fig. 4. From a range image/model, a local surface is
selected for a feature point p given a support radius r .
Figure 4a, b respectively show a model and a local surface.
We already have defined the LRF for p and the vertices of the
surface S constitute a pointcloud Q =
triangles in the local
q1 , q2 , . . . , q M . The pointcloud Q = q1 , q2 , . . . , q M is
then transformed with respect to the LRF in order to achieve
rotation
in a transformed pointcloud
invariance, resulting
Q′ = q′1 , q′2 , . . . , q′M . We then follow a number of steps
which are described as follows.
First, the pointcloud is rotated around the x axis by an
angle θk , resulting in a rotated pointcloud Q′ (θk ), as shown
L
L
jD (i, j) .
(14)
i=1 j=1
The Shannon entropy e is calculated as:
e=−
L
L
D (i, j) log (D (i, j)) .
(15)
i=1 j=1
Theoretically, a complete set of central moments can be
used to uniquely describe the information contained in a
matrix (Hu 1962). However in practice, only a small subset
of the central moments can sufficiently represent the distribution matrix D. These selected central moments together with
the Shannon entropy are then used to form a statistics vector,
as shown in Fig. 4f. The three statistics vectors from the x y,
123
Int J Comput Vis
(a)
(b)
(c)
(d)
(e)
(f)
(g)
Fig. 4 An illustration of the generation of a RoPS feature descriptor
for one rotation. a The Armadillo model and the local surface around
a feature point. b The local surface is cropped and transformed in the
LRF. c The local surface is rotated around a coordinate axis. d The
neighboring points are projected onto three 2D planes. e A distribution matrix is obtained for each plane by partitioning the 2D plane into
bins and counting up the number of points falling into each bin. The
dark color indicates a large number. f Each distribution matrix is then
encoded into several statistics. g The statistics from three distribution
matrices are concatenated to form a sub-feature descriptor for one rotation. (Figure best seen in color.)
x z and yz planes are then concatenated to form a sub-feature
f x (θk ). Note that f x (θk ) denotes the total statistics for the
kth rotation around the x axis, as shown in Fig. 4g.
In order to encode the “complete” information of the local
surface, the pointcloud Q′ is rotated around the x axis by
a set of angles {θk } , k = 1, 2, . . . , T , resulting in a set
of sub-features f x (θk ) , k = 1, 2, . . . , T . Further, Q′ is
rotated by
angles around the y axis and a set of sub a set of
features f y (θk ) , k = 1, 2, . . . , T is calculated. Finally, Q′
is rotated by a set ofangles around the z axis and a set of
sub-features f z (θk ) , k = 1, 2, . . . , T is calculated. The
overall feature descriptor is then generated by concatenating
the sub-features of all the rotations into a vector, that is:
(16)
f = f x (θk ) , f y (θk ) , f z (θk ) , k = 1, 2, . . . , T.
Strintzis 2007) descriptors. A spin image is generated by
projecting a local surface onto a 2D plane using a cylindrical
parametrization. Similarly, a snapshot is obtained by rendering a local surface from the viewpoint which is perpendicular
to the surface. Our RoPS differs from these methods in several aspects. First, RoPS represents a local surface from a
set of viewpoints rather than just one view (as in the case
of spin image and snapshot). Second, RoPS is associated
with a unique and unambiguous LRF, and it is invariant to
rotation. In contrast, spin image discards cylindrical angular
information and snapshot is prone to rotation. Third, RoPS is
more compact than spin image and snapshot since RoPS further encodes 2D matrices with a set of statistics. The typical
lengths of RoPS, spin image and snapshot are 135, 225 and
1600, respectively (see Table 2, (Johnson and Hebert 1999)
and Malassiotis and Strintzis 2007).
It is expected that the RoPS descriptor would be highly
discriminative (as demonstrated in Sect. 5) since it encodes
the geometric information of a local surface from a set of
viewpoints. Note that, some existing view-based methods
can be found in the literature, such as (Yamauchi et al. 2006;
Ohbuchi et al. 2008) and (Atmosukarto and Shapiro 2010).
However, these methods are based on global features and
originate from the 3D shape retrieval area. They are, however,
not suitable for 3D object recognition due to their sensitivity
to occlusion and clutter.
Other related methods, however, include the spin image
(Johnson and Hebert 1999) and snapshot (Malassiotis and
123
4.2 RoPS Generation Parameters
The RoPS feature descriptor has four parameters: (i) the
combination of statistics, (ii) the number of partition bins
L, (iii) the number of rotations T around each coordinate
axis, and (iv) the support radius r . The performance of RoPS
descriptor against different settings of these parameters was
tested on the tuning dataset using the criterion of recall versus
1-precision curve (RP Curve).
Int J Comput Vis
Recall versus 1-precision curve is one of the most popular
criteria used for the assessment of a feature descriptor (Flint
et al. 2008; Hou and Qin 2010; Ke and Sukthankar 2004;
Mikolajczyk and Schmid 2005). It is calculated as follows:
given a scene, a model and the ground truth transformation,
a scene feature is matched against all model features to find
the closest feature. If the ratio between the smallest distance
and the second smallest one is less than a threshold, then the
scene feature and the closest model feature are considered a
match. Further, a match is considered a true positive only if
the distance between the physical locations of the two features is sufficiently small, otherwise it is considered a false
positive. Therefore, recall is defined as:
the number of true positives
.
recall =
total number of positives
(17)
1-precision is defined as:
1-precision =
the number of false positives
.
total number of matches
(18)
By varying the threshold, a RP Curve can be generated.
Ideally, a RP Curve would fall in the top left corner of the
plot, which means that the feature obtains both high recall
and precision.
4.2.1 The Combination of Statistics
The selection of the subset of statistics plays an important role
in the generation of a RoPS feature descriptor. It determines
not only the capability for encapsulating the information in
a distribution matrix but also the size of a feature vector.
We considered eight combinations of statistics (a number of
low-order moments and entropy), as listed in Table 1, and
tested the performance for each combination in the terms of
RP Curve. The other three parameters were set constant as
L = 5, T = 3 and r = 15 mr. It is worth noting that the
zeroth-order central moment µ00 and the first-order central
moments µ01 and µ10 were excluded from the combinations
of the statistics. Because these moments are constant (i.e.,
µ00 = 1, µ01 = 0 and µ10 = 0) and therefore contain no
information of the local surface. Our experimental results are
shown in Fig. 5a.
It is clear that the No. 6 combination achieved the best
performance, followed by the No. 5 combination. While the
No. 3, No. 4 and No. 8 combinations obtained comparable
performance, with recall being a little lower than the No. 6
combination. The superior performance of the No.6 combination is due to the facts that, first, the low-order moments
µ11 , µ21 , µ12 , µ22 and entropy e contain the most meaningful and significant information of the distribution matrix.
Consequently, the descriptiveness of these statistics is sufficiently high. Second, the low-order moments are more robust
to noise and varying mesh resolution compared to the high-
Table 1 Different combinations of the statistics
No.
Combination of the statistics
1
µ02 , µ11 , µ20
2
µ02 , µ11 , µ20 , µ03 , µ12 , µ21 , µ30
3
µ02 , µ11 , µ20 , µ03 , µ12 , µ21 , µ30 , µ04 , µ13 , µ22 , µ31 , µ40
4
µ02 , µ11 , µ20 , µ03 , µ12 , µ21 , µ30 , µ04 , µ13 , µ22 , µ31 , µ40 , e
5
µ11 , µ21 , µ12 , µ22
6
µ11 , µ21 , µ12 , µ22 , e
7
µ11 , µ21 , µ12 , µ22 , µ31 , µ13
8
µ11 , µ21 , µ12 , µ22 , µ31 , µ13 , e
order moments. Beyond the high precision and recall, the size
of the No. 6 combination is also small, which means that the
calculation and matching of feature descriptors can be performed efficiently. Therefore, the No. 6 combination, i.e.,
{µ11 , µ21 , µ12 , µ22 , e}, was selected to represent the information in a distribution matrix and to form the RoPS descriptor.
4.2.2 The Number of Partition Bins
The number of partition bins L is another important parameter in the RoPS generation. It determines both the descriptiveness and robustness of a descriptor. That is, a dense partition
of the projected points offers more details about the point
distribution, it however increases the sensitivity to noise and
varying mesh resolution. We tested the performance of RoPS
descriptor on the tuning dataset with respect to a number
of partition bin, while the two other parameters were set to
T = 3 and r = 15 mr. The experimental results are shown
in Fig. 5b as a twin plot, where the right plot is a magnified
version of the region indicated by the rectangle in the left
plot.
The plot shows that the performance of RoPS descriptor
improved as the number of partition bins increased from 3
to 5. This is because more details about the point distribution were encoded into the feature descriptor. However, for
a number of partition bins larger than 5, the performance
degraded as the number of partition bins increased. This is
due to the reason that a dense partition makes the distribution
matrix more susceptible to the variation of spatial position of
the neighboring points. It can therefore be inferred that five is
the most suitable number of partitions as a tradeoff between
the descriptiveness and the robustness to noise and varying
mesh resolution. We therefore used L = 5 in this paper.
4.2.3 The Numbers of Rotations
The number of rotations T determines the “completeness”
when describing the local surface using a RoPS feature
descriptor. That is, increasing the number of rotations means
123
Int J Comput Vis
Fig. 5 Effect of the RoPS generation parameters. (a) Different combinations of statistics. (b) The number of partition bins L. There is a
twin plot in (b), where the right plot is a magnified version of the region
indicated by the rectangle in the left plot. (c) The number of rotations
T . There is a twin plot in (c), where the right plot is a magnified ver-
sion of the region indicated by the rectangle in the left plot. (d) The
support radius r . (We chose the No. 6 combination of the statistics and
set L = 5, T = 3 and r = 15 mr in this paper as a tradeoff between
effectiveness and efficiency. Figure best seen in color)
that more information of the local surface are encoded into
the overall feature descriptor. We tested the performance of
the RoPS feature descriptor with respect to a varying number of rotations while keeping the other parameters constant
(i.e., r = 15mr). The results are given in Fig. 5c as a twin
plot, where the right plot is a magnified version of the region
indicated by the rectangle in the left plot.
It was found that as the number of rotations increased,
the descriptiveness of the RoPS increased, resulting in an
improvement of the matching performance (which confirmed
our assumption). Specifically, the performance of the RoPS
descriptor improved significantly as the number of rotations
increased from 1 to 2, as shown in the left plot of Fig. 5(c).
The performance then improved slightly as the number of
rotations increased from 2 to 6, as indicated in the magnified
version shown in the right plot of Fig. 5c. In fact, there was
no notable difference between the performance with respect
to the number of rotations of three and six. That is because
almost all the information of the local surface is encoded in
the feature descriptor by rotating the neighboring points three
times around each axis. Therefore, increasing the number of
rotations any further will not necessarily add any significant
information to the feature descriptor. Moreover, increasing
the number of rotations will cost more computational and
memory resources. We therefore, set the number of rotations
to be three in this paper.
123
4.2.4 The Support Radius
The support radius r determines the amount of surface that
is encoded by the RoPS feature descriptor. The value of r
can be chosen depending on how local the feature should
be, and a tradeoff lies between the feature’s descriptiveness
and robustness to occlusion. That is, a large support radius
enables the RoPS descriptor to encapsulate more information
of the object and therefore provides more descriptiveness. On
the other hand, a large support radius increases the sensitivity to occlusion and clutter. We tested the performance of
the RoPS feature descriptor with respect to varying support
radius while keeping the other parameters fixed. The results
are given in Fig. 5d.
Int J Comput Vis
Fig. 7 A scene on the Bologna Dataset
an even better performance compared to the methods with
adaptive-scale keypoint detection (e.g., EM matching and
keypoint matching), as analyzed in Sect. 7.
Fig. 6 An illustration of the descriptor’s robustness to occlusion and
clutter with respect to varying support radius. The red, green and blue
spheres respectively represent the support regions with radius of 25, 15
and 5 mr for a feature point. (Figure best seen in color.)
The results show that the recall and precision performance
of the RoPS feature descriptor improved steadily as the support radius increased from 5 mr (mr = mesh resolution) to
25 mr. Specifically, there was a significant improvement of
the matching performance as the support radius increased
from 5 to 10 mr, this is because a radius of 5 mr is too small
to contain sufficient discriminating information of the underlying surface. The RoPS feature descriptor achieved good
results with a support radius of 15 mr, achieving a high precision of about 0.9 and a high recall of about 0.9. Although
the performance of RoPS feature descriptor further improved
slightly as the support radius was increased to 25 mr, the performance deteriorated sharply when the support radius was
set to 30 mr. We choose to set the support radius to 15 mr in
the paper to maintain a strong robustness to occlusion and
clutter. An illustration is shown in Fig. 6. The range image
contains two objects in the presence of occlusion and clutter,
and a feature point is selected near the tail of the chicken.
The red, green and blue spheres, respectively represent the
support regions with radius of 25, 15 and 5 mr for the feature point. As the radius increases from 5 to 25 mr, points
on the surface within the support region are more likely to
be missing due to occlusion, and points from other objects
(e.g., T-rex on the right) are more likely to be included in the
support region due to clutter. Therefore, the resulting feature descriptor is more likely to be affected by occlusion and
clutter.
Note that, several adaptive-scale keypoint detection methods have been proposed for the purpose of determining the
support radius based on the inherent scale of a feature point
(Tombari et al. 2013). However, we simply adopt a fixed support radius since our focus is on feature description and object
recognition rather than keypoint detection. Moreover, our
proposed RoPS descriptor has been demonstrated to achieve
5 Performance of the RoPS Descriptor
The descriptiveness and robustness of our proposed RoPS
feature descriptor was first evaluated on the Bologna Dataset
(Tombari et al. 2010) with respect to different levels of noise,
varying mesh resolution and their combinations. It was also
evaluated on the PHOTOMESH Dataset (Zaharescu et al.
2012) with respect to 13 transformations. In these experiments, the RoPS was compared to several state-of-the-art
feature descriptors.
5.1 Performance on The Bologna Dataset
5.1.1 Dataset and Parameter Setting
The Bologna Dataset used in this paper comprises six models and 45 scenes. The six models (i.e., “Armadillo”, “Asia
Dragon”, “Bunny”, “Dragon”, “Happy Buddha” and “Thai
Statue”) were taken from the Stanford 3D Scanning Repository. They are shown in Fig. 2. Each scene was synthetically
generated by randomly rotating and translating three to five
models in order to create clutter and pose variances. As a
result, the ground truth rotations and translations between
each model and its instances in the scenes were known a priori during the process of construction. An example scene is
shown in Fig. 7.
The performance of each feature descriptor was assessed
using the criterion of RP Curve (as detailed in Sect. 4.2). We
compared our RoPS feature descriptor with five state-of-theart feature descriptors, including spin image (Johnson and
Hebert 1999), normal histogram (NormHist) (Hetzel et al.
2001), LSP (Chen and Bhanu 2007), THRIFT (Flint et al.
2007) and SHOT (Tombari et al. 2010). The support radius r
for all methods was set to be 15mr as a compromise between
the descriptiveness and the robustness to occlusion. The parameters for generating all these feature descriptors were tuned
by optimizing the performance in terms of RP Curve on the
123
Int J Comput Vis
Table 2 Tuned parameter settings for six feature descriptors
Support radius (mr)
Dimensionality
Length
Spin image
15
15 × 15
225
NormHist
15
15 × 15
225
LSP
15
15 × 15
225
THRIFT
15
32 × 1
32
SHOT
15
8 × 2 × 2 × 10
320
RoPS
15
3×3×3×5
135
tuning dataset. The tuned parameter settings for all feature
descriptors are presented in Table 2.
In order to avoid the impact of the keypoint detection
method on feature’s descriptiveness, we randomly selected
1,000 feature points from each model, and extracted their
corresponding points from the scene. We then employed the
methods listed in Table 2 to extract feature descriptors for
these feature points. Finally, we calculated a RP Curve for
each feature descriptor to evaluate the performance.
5.1.2 Robustness to Noise
In order to evaluate the robustness of these feature descriptors to noise, we added a Gaussian noise with increasing standard deviation of 0.1, 0.2, 0.3, 0.4 and 0.5 mr to the scene data.
The RP Curves under different levels of noise are presented
in Fig. 8.
We made a number of observations. (i) These feature
descriptors achieved comparable performance on noise free
data, with high recall together with high precision, as shown
in Fig. 8a.
(ii) With noise, our proposed RoPS feature descriptor
achieved the best performance in most cases, and is followed
by SHOT. Specifically, the performance of RoPS is better
than SHOT under a low-level noise with a standard deviation
of 0.1 mr, as shown in Fig. 8b. As the standard deviation of the
noise increased to 0.2 and 0.3 mr, SHOT performed slightly
better than RoPS, as indicated in Fig. 8c, d. However, the performance of our proposed RoPS was significantly better than
SHOT under high levels of noise, e.g., with a noise deviation
larger than 0.3 mr, as shown in Fig. 8e, f. It can be inferred
that RoPS is very robust to noise, particularly in the case of
scenes with a high level of noise.
(iii) As the noise level increased, the performance of LSP
and THRIFT deteriorated sharply, as shown in Fig. 8b–e.
THRIFT failed to work even under a low-level of noise with
a standard deviation of 0.1 mr. This result is also consistent
with the conclusion given in (Flint et al. 2008). Although
NormHist and spin image worked relatively well under lowand medium-level noise with a standard deviation less than
0.2 mr, they failed completely under noise with a large stan-
123
dard deviation. The sensitivity of spin image, NormHist,
THR-IFT and LSP to noise is due to the fact that, they rely on
surface normals to generate their feature descriptors. Since
the calculation of surface normal includes a process of differentiation, it is very susceptible to noise.
(iv) The strong robustness of our RoPS feature descriptor
to noise can be explained by at least three facts. First, RoPS
encodes the “complete” information of the local surface from
various viewpoints through rotation and therefore, encodes
more information than the existing methods. Second, RoPS
only uses the low-order moments of the distribution matrices
to form its feature descriptor and is therefore less affected by
noise. Third, our proposed unique, unambiguous and stable
LRF also helps to increase the descriptiveness and robustness
of the RoPS feature descriptor.
5.1.3 Robustness to Varying Mesh Resolution
In order to evaluate the robustness of these feature descriptors
to varying mesh resolution, we resampled the noise free scene
meshes to 21 , 41 and 18 of their original mesh resolution. The
RP Curves under different levels of mesh decimation are
presented in Fig. 9a–c.
It was found that our proposed RoPS feature descriptor
outperformed all the other descriptors by a large margin under
all levels of mesh decimation. It is also notable that the performance of our RoPS feature descriptor with 81 of original
mesh resolution was even comparable to the best results given
by the existing feature descriptors with 21 of original mesh
resolution. Specifically, RoPS obtained a precision more than
0.7 and a recall more than 0.7 with 18 of original mesh resolution, whereas spin image obtained a precision around 0.8
and a recall around 0.8 with 21 of original mesh resolution,
as shown in Fig. 9a, c. This indicated that our RoPS feature
descriptor is very robust to varying mesh resolution.
The strong robustness of RoPS to varying mesh resolution is due to at least two factors. First, the LRF of RoPS
is derived by calculating the scatter matrix of all the points
lying on the local surface rather than just the vertices, which
makes RoPS robust to different mesh resolution. Second, the
2D projection planes are sparsely partitioned and only the
low-order moments are used to form the feature descriptor,
which further improves the robustness of our method to mesh
resolution.
5.1.4 Robustness to Combined Noise and Mesh Decimation
In order to further test the robustness of these feature descriptors to combined noise and mesh decimation, we resampled
the scene meshes down to 21 of their original mesh resolution
and added a Gaussian random noise with a standard deviation of 0.1mr to the scenes. The resulting RP Curves are
presented in Fig. 9d.
Int J Comput Vis
Fig. 8 Recall vs 1-precision
curves in the presence of noise.
(Figure best seen in color.)
1
1
Spin image
NormHist
LSP
THRIFT
SHOT
RoPS
0.9
0.8
0.7
Spin image
NormHist
LSP
THRIFT
SHOT
RoPS
0.8
0.7
Recall
Recall
0.9
0.6
0.5
0.4
0.3
0.2
0.1
0.6
0
0.01
0.02
0.03
0.04
0.05
0
0.06
0
1−Precision
(b)
Spin image
NormHist
LSP
THRIFT
SHOT
RoPS
0.8
0.7
Recall
0.6
0.5
0.4
11
0.8
0.7
0.6
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0
1
0
1−Precision
1
(d)
Spin image
NormHist
LSP
THRIFT
SHOT
RoPS
0.7
0.6
0.5
0.4
0.3
0.7
Spin image
NormHist
LSP
THRIFT
SHOT
RoPS
0.6
0.5
Recall
0.8
Recall
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1−Precision
(c)
0.4
0.3
0.2
0.2
0.1
0.1
0
Spin image
NormHist
LSP
THRIFT
SHOT
RoPS
0.9
Recall
1
1
1−Precision
(a)
0.9
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
0
0
1−Precision
(e)
As shown in Fig. 9d, RoPS significantly outperformed
the other methods in the scenes with both noise and mesh
decimation, obtaining a high precision of about 0.9 and a
high recall of about 0.9. It is followed by NormHist, SHOT,
spin image and LSP, while THRIFT failed to work.
As summarized in Table 2, the RoPS feature descriptor length is 135, while the others such as spin image,
NormHist, LSP and SHOT are 225, 225, 225 and 320,
respectively. So RoPS is more compact and therefore more
efficient for feature matching compared to these methods.
Note that, although the length of THRIFT is smaller than
RoPS, THRIFT’s performance in terms of recall and precision results is surpassed by our RoPS feature descriptor by a
large margin.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
1−Precision
(f)
5.2 Performance on the PHOTOMESH Dataset
The PHOTOMESH Dataset contains three null shapes. Two
of the null shapes were obtained with multi-view stereo
reconstruction algorithms, and the other one was generated
with a modeling program. 13 transformations were applied
to each shape. The transformations include color noise,
color shot noise, geometry noise, geometry shot noise, rotation, scale, local scale, sampling, hole, micro-hole, topology
changes and isometry. Each transformation has five different
levels of strength.
To make a rigorous comparisonwith (Zaharescu et al.
2012), we set the support radius r to
αr A M
π , where
A M is the
123
Int J Comput Vis
11
Spin image
NormHist
LSP
THRIFT
SHOT
RoPS
0.9
0.8
Recall
0.7
0.6
0.5
0.4
1
0.8
0.7
0.6
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Spin image
NormHist
LSP
THRIFT
SHOT
RoPS
0.9
Recall
Fig. 9 Recall vs 1-precision
curves with respect to mesh
resolution. (Figure best seen in
color.)
0
0.8
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1−Precision
(a)
(b)
Spin image
NormHist
LSP
THRIFT
SHOT
RoPS
0.7
0.6
Recall
0.5
0.4
0.3
1
Spin image
NormHist
LSP
THRIFT
SHOT
RoPS
0.9
0.8
0.7
Recall
0.8
0.6
0.5
0.4
0.3
0.2
0.2
0.1
0
0.1
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0
1
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1−Precision
(c)
total area of a mesh, and αr is 2 %. RoPS feature descriptors
were calculated at all points of the shapes, without any feature detection. We used the average normalized L 2 distance
between the feature descriptors of corresponding points to
measure the quality of a feature descriptor, as in (Zaharescu
et al. 2012). The experimental results of the RoPS descriptor are shown in Table 3. For comparison, the results of the
MeshHOG descriptor (Gaussian curvature) without and with
MeshDOG are also reported in Tables 4 and 5, respectively.
The RoPS descriptor was clearly invariant to color
noise and color shot noise. Because the geometric information used in RoPS cannot be affected by color deformations. RoPS was also invariant to rotation and scale,
which means that it was invariant to rigid transformations.
The RoPS descriptor turned out to be very robust to geometry noise, geometry shot noise, local scale, holes, microholes, topology and isometry with noise. The average normalized L 2 distances for all these transformations were no
more than 0.06, even under the highest level of transformations. The biggest challenge for RoPS descriptor was sampling. The average normalized L 2 distance increased from
0.01 to 0.06 as the strength level changed from 1 to 5. However, RoPS was more robust to sampling than MeshHOG.
As shown in Tables 3 and 4, the average normalized L 2 distance of RoPS with a strength level of 5 was even smaller
123
1
1−Precision
1
1−Precision
(d)
than that of MeshHOG with a strength level of 1, i.e., 0.02
and 0.04, respectively. Overall, the average normalized L 2
distances of RoPS descriptor were much smaller under all
strength levels of all transformations compared to MeshHOG.
6 3D Object Recognition Algorithm
So far we have developed a novel LRF and a RoPS feature descriptor. In this section, we propose a new hierarchical 3D object recognition algorithm based on the
LRF and RoPS descriptor. Our 3D object recognition
algorithm consists of four major modules, i.e., model
representation, candidate model generation, transformation hypothesis generation, verification and segmentation.
A flow chart illustration of the algorithm is given in
Fig. 10.
6.1 Model Representation
We first construct a model library for the 3D objects that
we are interested in. Given a model M, Nm seed points are
evenly selected from the model pointcloud. Since the feature
descriptors of closely located feature points may be similar
(since they represent more or less the same local surface), a
Int J Comput Vis
Table 3 Robustness of RoPS descriptor
Transform.
Strength
1
≤2
≤3
≤4
≤5
Color noise
0.00
0.00
0.00
0.00
0.00
Color shot noise
0.00
0.00
0.00
0.00
0.00
Geometry noise
0.01
0.01
0.01
0.02
0.02
Geometry shot noise
0.01
0.01
0.02
0.03
0.05
Rotation
0.00
0.00
0.00
0.00
0.00
Scale
0.00
0.00
0.00
0.00
0.00
Local scale
0.01
0.01
0.02
0.02
0.02
Sampling
0.01
0.02
0.04
0.05
0.06
Holes
0.01
0.01
0.01
0.01
0.02
Marco-holes
0.00
0.01
0.01
0.01
0.01
Topology
0.01
0.01
0.02
0.02
0.03
Isometry + noise
0.02
0.02
0.01
0.02
0.02
Average
0.00
0.01
0.01
0.02
0.02
Table 4 Robustness of MeshHOG (Gaussian curvature) without MeshDOG detector
Transform.
Strength
1
≤2
≤3
≤4
≤5
Color noise
0.00
0.00
0.00
0.00
0.00
Color shot noise
0.00
0.00
0.00
0.00
0.00
Geometry noise
0.07
0.08
0.09
0.10
0.11
Geometry shot noise
0.02
0.03
0.05
0.06
0.09
Rotation
0.00
0.00
0.00
0.00
0.00
Scale
0.00
0.00
0.00
0.00
0.00
Local scale
0.06
0.07
0.08
0.09
0.10
Sampling
0.10
0.12
0.13
0.13
0.13
Holes
0.01
0.02
0.04
0.03
0.05
Marco-holes
0.01
0.01
0.03
0.04
0.04
Topology
0.07
0.10
0.11
0.11
0.12
Isometry + noise
0.08
0.08
0.08
0.09
0.09
Average
0.04
0.04
0.05
0.06
0.06
resolution control strategy (Zhong 2009) is further enforced
on these seed points to extract the final feature points. For
each feature point pm , the LRF Fm and the feature descriptor (e.g., our RoPS descriptor) f m are calculated. The point
position pm , LRF Fm and feature descriptor f m of all the
feature points are then stored in a library for object recognition.
In order to speed up the process of feature matching during online recognition, the local feature descriptors from all
models are indexed using a k-d tree method (Bentley 1975).
Note that, the model feature calculation and indexing can be
performed offline, while the following modules are operated
online.
6.2 Candidate Model Generation
The input scene S is first decimated, which results in a low
resolution mesh S ′ . The vertices of S which are nearest
to the vertices of S ′ are selected as seed points (following a similar approach of Mian et al. 2006b). Next, a resolution control strategy (Zhong 2009) is enforced on these
seed points to prune out redundant seed points. A boundary
checking strategy (Mian et al. 2010) is also applied to the
seed points to eliminate the boundary points of the range
image. Further, since the LRF of a point can be ambiguous
when two eigenvalues of the overall scatter matrix of the
underlying local surface (see Eq. 4) are equal, we impose
123
Int J Comput Vis
Table 5 Robustness of MeshHOG (Gaussian curvature) with MeshDOG detector
Transform.
Strength
1
≤2
≤3
≤4
≤5
Color noise
0.00
0.00
0.00
0.00
0.00
Color shot noise
0.00
0.00
0.00
0.00
0.00
Geometry noise
0.26
0.29
0.31
0.33
0.34
Geometry shot noise
0.04
0.09
0.14
0.21
0.29
Rotation
0.01
0.01
0.01
0.01
0.01
Scale
0.01
0.01
0.01
0.01
0.00
Local scale
0.21
0.25
0.28
0.30
0.31
Sampling
0.31
0.34
0.34
0.36
0.36
Holes
0.02
0.02
0.07
0.07
0.07
Marco-holes
0.01
0.01
0.07
0.07
0.08
Topology
0.13
0.20
0.22
0.25
0.28
Isometry + noise
0.23
0.24
0.22
0.25
0.25
Average
0.10
0.12
0.14
0.15
0.17
Fig. 10 Flow chart of the 3D object recognition algorithm. The module of model representation is performed offline, and the other modules are
operated online
a constraint on the ratios of the eigenvalues λλ21 > τλ to
exclude seed points with symmetrical local surfaces, as in
(Zhong 2009; Mian et al. 2010). The remaining seed points
are considered feature points. It is worth noting that, the
feature point detection and LRF calculation procedures can
be performed simultaneously. Given the LRF Fs of a feature point ps , its feature descriptor f s is subsequently calculated.
The scene features are exactly matched against all model
features in the library using the previously constructed
123
k-d tree. If the ratio between the smallest distance and the
second smallest one is less than a threshold τ f , the scene
feature and its closest model feature are considered a feature correspondence. Each feature correspondence votes for
a model. These models which have received votes from feature correspondences are considered candidate models. They
are then ranked according to the number of votes received.
With this ranked models, the subsequent steps (Sects. 6.3
and 6.4) can be performed from the most likely candidate
model.
Int J Comput Vis
6.3 Transformation hypothesis Generation
6.4 Verification and Segmentation
For a feature correspondence which votes for the model M, a
rigid transformation is calculated by aligning the LRF of the
model feature to the LRF of the scene feature. Specifically,
given the LRF Fs and the point position ps of a scene feature,
the LRF Fm and the point position pm of a corresponding
model feature, the rigid transformation can be estimated by:
Given a scene S, a candidate model M and a transformation hypothesis (Rc , t c ), the model M is first transformed to
the scene S by using the transformation hypothesis (Rc , t c ).
This transformation is further refined using the ICP algorithm (Besl and McKay 1992), resulting in a residual error ε.
After ICP refinement, the visible proportion α is calculated
as:
R = FsT Fm ,
t = ps − pm R,
(19)
(20)
where R is the rotation matrix and t is the translation vector
of the rigid transformation. It is worth noting that a transformation can be estimated from a single feature correspondence using our RoPS feature descriptor. This is a major
advantage of our algorithm compared with most of the existing algorithms (e.g., splash, point signatures and spin image
based methods) which require at least three correspondences
to calculate a transformation (Johnson and Hebert 1999). Our
algorithm not only eliminates the combinatorial explosion of
feature correspondences but also improves the reliability of
the estimated transformation.
As all the plausible transformations (Ri , ti ) , i = 1, 2, · · ·,
Nt between the scene S and the model M are calculated,
these transformations are then grouped into several clusters.
Specifically, for each plausible transformation, its rotation
matrix Ri is first converted into three Euler angles which
form a vector ui . In this manner, the difference between any
two rotation matrices can be measured by the Euclidean distance between their corresponding Euler angles. These transformations whose Euler angles are around ui (with distances
less than τa ) and translations are around ti (with distances
less than τt ) are grouped into a cluster Ci . Therefore, each
plausible transformation (Ri , ti ) results in a cluster Ci . The
cluster center (Rc , t c ) of Ci is calculated as the average rotation and translation in that cluster. Next, a confidence score
sc for each cluster is calculated as:
nf
,
(21)
sc =
d
where n f is the number of feature correspondences in the
cluster, and d is the average distance between the scene
features and their corresponding model features which fall
within the cluster. These clusters are sorted according to their
confidence scores, the ones with confidence scores smaller
than half of the maximum score are first pruned out. We then
select the valid clusters from these remaining clusters, starting from the highest scored one and discarding the nearby
clusters whose distances to these selected clusters are small
(using τa and τt ). τa and τt are empirically set to 0.2 and
30mr throughout this paper. These selected clusters are then
allowed to proceed to the final verification and segmentation
stage (Sect. 6.4).
α=
nc
,
ns
(22)
where n c is the number of corresponding points between
the scene S and the model M, n s is the total number of
points in the scene S. Here, a scene point and a transformed model point are considered corresponding if their
distance is less than twice the model resolution (Mian et al.
2006b).
The candidate model M and the transformation hypothesis (Rc , t c ) are accepted as being correct only if the residual
error ε is smaller than a threshold τε and the proportion α is
larger than a threshold τα . However, it is hard to determine the
thresholds. Because selecting strict thresholds will reject correct hypotheses which are highly occluded in the scene, while
selecting loose thresholds will produce many false positives.
In this paper, a flexible thresholding scheme is developed. To
deal with a highly occluded but well aligned object, we select
a small error threshold τε1 together with a small proportion
threshold τα1 . Meanwhile, in order to increase the tolerance
to the residual error which resulted from an inaccurate estimation of the transformation, we select a relatively larger
error threshold τε2 together with a larger proportion threshold τα2 . We chose these thresholds empirically and set them
as τε1 = 0.75 mr, τε2 = 1.5 mr, τα1 = 0.04 and τα2 = 0.2
throughout the paper.
Therefore, once ε < τε1 but α > τα1 , or ε < τε2 but
α > τα2 , the candidate model M and the transformation
hypothesis (Rc , t c ) are accepted, the scene points which correspond to this model are removed from the scene. Otherwise,
this transformation hypothesis is rejected and the next transformation hypothesis is verified by turn. If no transformation
hypothesis results in an accurate alignment, we conclude that
the model M is not present in the scene S. While if more
than one transformation hypotheses are accepted, it means
that multiple instances of the model M are present in the
scene S.
Once all the transformation hypotheses for a candidate
model M are tested, the object recognition algorithm then
proceeds to the next candidate model. This process continues
until either all the candidate models have been verified or
there are too few points left in the scene for recognition.
123
Int J Comput Vis
7 Performance of 3D Object Recognition
7.2 Recognition Results on The UWA Dataset
The effectiveness of our proposed RoPS based 3D object
recognition algorithm was evaluated by a set of experiments on four datasets, including the Bologna Dataset
(Tombari et al. 2010), the UWA Dataset (Mian et al. 2006b),
the Queen’s Dataset (Taati and Greenspan 2011) and the Ca’
Foscari Venezia Dataset (Rodolà et al. 2012). These four
datasets are amongst the most popular datasets publicly available, containing multiple objects in each scene in the presence
of occlusion and clutter.
The UWA Dataset contains five 3D models and 50 real
scenes. The scenes were generated by randomly placing four
or five real objects together in a scene and scanned from a
single viewpoint using a Minolta Vivid 910 scanner. An illustration of the five models is given in Fig. 12, and two sample
scenes are shown in Fig. 13a, c.
For the sake of consistency in comparison, RoPS based
3D object recognition experiments were performed on the
same data as Mian et al. (2006b) and Bariya et al. (2012).
Besides, the Rhino model was excluded from the recognition
results, since it contained large holes and cannot be recognized by the spin image based algorithm in any of the scenes.
Comparison was performed with a number of state-of-the-art
algorithms, such as tensor (Mian et al. 2006b), spin image
(Mian et al. 2006b), keypoint (Mian et al. 2010), VD-LSD
(Taati and Greenspan 2011) and EM based (Bariya et al.
2012) algorithms. Comparison results are shown in Fig. 14
with respect to varying levels of occlusion. The average number of detected feature points in a scene and a model were
2,259 and 4,247, respectively.
Occlusion is defined according to Johnson and Hebert
(1999) as:
7.1 Recognition Results on The Bologna Dataset
We used the Bologna Dataset to evaluate the effectiveness of
our proposed RoPS based 3D object recognition algorithm.
We specifically focused on the performance with respect
to noise and varying mesh resolution. We also aimed to
demonstrate the capability of our 3D object recognition algorithm to integrate the existing feature descriptors without
LRF.
We used our RoPS together with the five feature descriptors (as detailed in Sect. 5.1.1) to perform object recognition.
For feature descriptors that do not have a dedicated LRF, e.g.,
spin image, NormHist, LSP and THRIFT, the LRFs were
defined using our proposed technique. The average number
of detected feature points in an unsampled scene and a model
were 985 and 1,000, respectively.
In order to evaluate the performance of the 3D object
recognition algorithms on noisy data, we added a Gaussian
noise with increasing standard deviation of 0.1, 0.2, 0.3, 0.4
and 0.5 mr to each scene data, the average recognition rates
of the six algorithms on the 45 scenes are shown in Fig. 11a.
It can be seen that both RoPS and SHOT based algorithms
achieved the best results, with recognition rates of 100 %
under all levels of noise. Spin image and NormHist based
algorithms achieved recognition rates higher than 97 % under
low-level noise with deviations less than 0.1 mr. However,
their performance deteriorated sharply as the noise increased.
While LSP and THRIFT based algorithms were very sensitive to noise.
In order to evaluate the effectiveness of the 3D object
recognition algorithms with respect to varying mesh resolution, the 45 noise free scenes were resampled to 21 , 41 and 18 of
their original mesh resolution. The average recognition rates
on the 45 scenes with respect to different mesh resolutions
are given in Fig. 11b. It is shown that RoPS based algorithm
achieved the best performance, obtaining 100 % recognition
rate under all levels of mesh decimation. It was followed by
NormHist and spin image based algorithms. That is, they
obtained recognition rates of 97.8 and 91.1 % respectively in
scenes with 18 of original mesh resolution.
123
occlusion =
model surface patch area in scene
.
total model surface area
(23)
The ground truth occlusion values were automatically calculated for the correctly recognized objects and manually calculated for the objects which were not correctly recognized.
As shown in Fig. 14, our RoPS based algorithm outperformed
all the existing algorithms. It achieved a recognition rate of
100 % with up to 80 % occlusion, and a recognition rate of
93.1 % even under 85 % occlusion. The average recognition
rate of our RoPS based algorithm was 98.8 %, while the average recognition rate of spin image, tensor and EM based algorithms were 87.8, 96.6 and 97.5 % respectively, with up to
84 % occlusion. The overall average recognition rate of our
RoPS based algorithm was 98.9 %.Moreover, no false positive occurred in the experiments when using our RoPS based
algorithm, and only two out of the total 188 objects in the 50
scenes was not correctly recognized. These results confirm
that our RoPS based algorithm is able to recognize objects in
complex scenes in the presence of significant clutter, occlusion and mesh resolution variation.
Two sample scenes and their corresponding recognition
results are shown in Fig. 13. All objects were correctly recognized and their poses were accurately recovered except for
the T-Rex in Fig. 13d. The reason for the failure in Fig. 13d
relates to the excessive occlusion of the T-Rex. It is highly
occluded and the visible surface is sparsely distributed in several parts of the body rather than in a single area. Therefore,
almost no reliable feature could be extracted from the object.
Int J Comput Vis
100
100
90
90
Recognition rate (%)
Recognition rate (%)
Fig. 11 Recognition rates on
the Bologna Dataset. (Figure
best seen in color.)
80
70
60
50
Spin image
NormHist
LSP
THRIFT
SHOT
RoPS
40
30
20
10
0
1
1/2
1/4
60
50
40
30
20
0
1/8
0
0.1
0.2
0.3
0.4
0.5
Noise deviation (mr)
(a)
(b)
70
10
Decimation
(a)
Spin image
NormHist
LSP
THRIFT
SHOT
RoPS
80
(b)
(c)
(d)
(e)
Fig. 12 The five models of the UWA Dataset
(a)
(b)
(c)
(d)
Fig. 13 Two sample scenes and our recognition results on the UWA Dataset. The correctly recognized objects have been superimposed by their
3D complete models from the library. All objects were correctly recognized except for the T-Rex in (d). (Figure best seen in color.)
Note that, although we used a fixed support radius (i.e.,
r = 15 mr) for feature description throughout this paper,
the proposed algorithm is generic, and different adaptivescale keypoint detection methods can be seamlessly integrated within our RoPS descriptor. In order to further demonstrate the generic nature of our algorithm, we generated RoPS
descriptors using the support radii estimated by the adaptivescale method in (Mian et al. 2010). The recognition result
is shown in Fig. 14. The recognition performance of the
adaptive-scale RoPS based algorithm was better than that
reported in (Mian et al. 2010), which means that our RoPS
descriptor was more descriptive than the descriptor used in
(Mian et al. 2010). It is also observed that the performance
of adaptive-scale RoPS was marginally worse than the fixed-
scale counterpart. This is because the errors of scale estimation adversely affected the performance of feature matching,
and ultimately object recognition. That is, the corresponding points in a scene and model may have different estimated scales due to the estimation errors. As reported in
(Tombari et al. 2013), the scale repeatability of the adaptivescale detector in (Mian et al. 2010) were less than 85 and
60 % on the Retrieval dataset and Random Views dataset,
respectively.
7.3 Recognition Results on the Queen’s Dataset
The Queen’s Dataset contains five models and 80 real scenes.
The 80 scenes were generated by randomly placing one,
123
Int J Comput Vis
100
90
Recognition rate (%)
80
70
60
Tensor
Spin image
EM
Keypoint
VD−LSD
RoPS
Adaptive RoPS
50
40
30
20
10
0
60
65
70
75
80
85
90
Occlusion (%)
Fig. 14 Recognition rates on the UWA Dataset. (Figure best seen in
color.)
three, four or five of the models in a scene and scanned from
a single viewpoint using a LIDAR sensor. The five models
were generated by merging several range images of a single
object. Since all scenes and models were represented in the
form of pointclouds, we first converted them into triangular
meshes in order to calculate the LRFs using our proposed
technique. A scene pointcloud was converted by mapping
the 3D pointcloud onto the 2D retina plane of the sensor and
performing a 2D Delaunay triangulation over the mapped
points. The 2D points and triangles were then mapped back
to the 3D space, resulting in a triangular mesh. A model
pointcloud was converted into a triangular mesh using the
Marching Cubes algorithm (Guennebaud and Gross 2007).
An illustration of the five models is given in Fig. 15, and two
sample scenes are shown in Fig. 16a, c.
First, we performed object recognition using our RoPS
based algorithm on the full dataset which contains 80 real
scenes. The average number of detected feature points in a
scene and a model were 3,296 and 4,993, respectively. The
results are shown in parentheses in Table 6, with a comparison to the results given by Bariya et al. (2012). It can be
seen that the average recognition rate of our algorithm is
95.4 %, in contrast, the average recognition rate of the EM
based algorithm is 82.4 %. These results indicate that our
algorithm is superior to the EM based algorithm although a
complicated keypoint detection and scale selection strategy
has been adopted by the EM based algorithm.
To make a direct comparison with the results given by
Taati and Greenspan (2011), we performed our RoPS based
3D object recognition on the same subset dataset which contains 55 scenes. The results are given in Table 6, with comparisons to the results provided by two variants of VD-LSD,
3DSC and four variants of spin image. As shown in Table.
6, our average recognition rate was 95.4 %, while the sec-
123
ond best result achieved by VD-LSD (SQ) was 83.8 %. The
RoPS based algorithm achieved the best recognition rates
for all the five models. More than 97 % of the instances
of Angle, Big Bird and Gnome were correctly recognized.
Although RoPS’s recognition rate for Zoe was relatively low
(i.e., 87.2 %), it still outperformed the existing algorithms by
a large margin, since the second best result achieved by VDLSD (SQ) was 71.8 %. Figure 16 shows two sample scenes
and our recognition results on the Queen’s Dataset. It can be
seen that our RoPS based algorithm was able to recognize
objects with large amounts of occlusion and clutter.
Note that, the Queen’s Dataset is more challenging than
the UWA Dataset since the former is more noisy and the
points are not uniformly distributed. That is the reason why
the spin image based algorithm had a significant drop in the
recognition performance when tested on the two datasets.
Specifically, the average recognition rate of spin image based
algorithm on the UWA Dataset was 87.8 % while the best
result on the Queen’s Dataset was only 54.4 %. Similarly, a
notable decrease of performance can also be found for the
EM based algorithm, with 97.5 % recognition rate for the
UWA Dataset and 81.9 % recognition rate for the Queen’s
Dataset. However, our RoPS based algorithm was consistently effective and robust to different kinds of variations
(including noise, varying mesh resolution and occlusion), it
outperformed the existing algorithms and achieved comparable results in both datasets, obtaining a recognition rate
of 98.9 % on the UWA Dataset and 95.4 % on the Queen’s
Dataset.
We also performed a timing experiment to measure the
average processing time to recognize each object in the scene.
The experiment was conducted on a computer with a 3.16
GHz Intel Core2 Duo CPU and a 4GB RAM. The code was
implemented in MATLAB without using any program optimization or parallel computing technique. The average computational time to detect feature points and calculate LRFs
was 42.6 s. The average computational time to generate RoPS
descriptors was 7.2 s. Feature matching consumed 46.6 s,
while the computational time for the transformation hypothesis generation was negligible. Finally, verification and segmentation cost 57.4 s in average.
7.4 Recognition Results on The Ca’ Foscari Venezia
Dataset
This dataset is composed of 20 models and 150 scenes. Each
scene contains 3 to 5 objects in the presence of occlusion and
clutter. Totally, there are 497 object instances in all scenes.
This dataset has been released just recently. It is the largest
available 3D object recognition dataset. It is also more challenging than many other datasets, containing several models
with large flat and featureless areas, and several models which
are very similar in shape (Rodolà et al. 2012).
Int J Comput Vis
Fig. 15 The five models in the
Queen’s dataset
(a)
(a)
(b)
(b)
(c)
(d)
(c)
(e)
(d)
Fig. 16 Two sample scenes and our recognition results on the Queen’s dataset. The correctly recognized objects have been superimposed by their
3D complete models from the library. All objects were correctly recognized except for the Angle in d. (Figure best seen in color.)
Table 6 Recognition rates (%) on the Queen’s datasetsa
Method
Angel
Big Bird
Gnome
Kid
Zoe
Average
RoPS
EM
97.4 (97.9)
NA (77.1)
100.0 (100.0)
NA (87.5)
97.4 (97.9)
NA (87.5)
94.9 (95.8)
NA (83.3)
87.2 (85.4)
NA (76.6)
95.4 (95.4)
81.9 (82.4)
VD-LSD(SQ)
89.7
100.0
70.5
84.6
71.8
83.8
VD-LSD(VQ)
56.4
97.4
69.2
51.3
64.1
67.7
3DSC
53.8
84.6
61.5
53.8
56.4
62.1
Spin image (impr.)
53.8
84.6
38.5
51.3
41.0
53.8
Spin image (orig.)
15.4
64.1
25.6
43.6
28.2
35.4
Spin image spherical (impr.)
53.8
74.4
38.5
61.5
43.6
54.4
Spin image spherical (orig.)
12.8
61.5
30.8
43.6
30.8
35.9
The best results are in bold fonts
Table 7 Precision and recall values on the Ca’ Foscari Venezia dataset
Armadillo
Precision
Recall
Precision
Recall
Bunny
Cat1
Centaur1
Chef
Chicken
Dog7
Dragon
Face
RoPS
97
100
100
100
100
97
100
100
100
Game-theoretic
100
100
78
96
93
93
95
100
91
RoPS
100
100
44
100
100
100
91
100
100
Game-theoretic
97
97
82
100
100
100
86
89
95
Ganesha
Gorilla0
Horse7
Lioness13
Para
Rhino
T-Rex
Victoria3
Wolf2
RoPS
100
100
100
100
97
96
100
100
100
Game-theoretic
89
95
97
88
97
91
97
83
82
RoPS
100
100
100
100
97
100
100
95
100
Game-theoretic
100
91
100
100
94
91
97
83
95
The best results are in bold fonts
123
Int J Comput Vis
The precision and recall values of RoPS based algorithm
on this dataset is shown in Table 7, the results as reported in
(Rodolà et al. 2012) are also reported for comparison. As in
(Rodolà et al. 2012), two out of the 20 models were left out
from the recognition tests and used as clutter. The average
number of detected feature points in a scene and a model were
2,210 and 5,000, respectively. The RoPS based algorithm
achieved better precision results compared to (Rodolà et al.
2012). The average precision of RoPS based algorithm was
99 %, which was higher than (Rodolà et al. 2012) by a margin
of 6 %. Besides, the precision values of 14 individual models
were as high as 100 %.
The average recall of RoPS based algorithm was 96 %, in
contrast, the average recall of (Rodolà et al. 2012) was 95 %.
Moreover, RoPS based algorithm achieved equal or better
recall values on 17 individual models out of the 18 models.
Note that, SHOT descriptors and a game-theoretic framework
is used in (Rodolà et al. 2012) for 3D object recognition. It
is observed that our RoPS based algorithm performed better
than SHOT based algorithm on this dataset.
In summary, the superior performance of our RoPS based
3D object recognition algorithm is due to several reasons.
First, the highly descriptiveness and strong robustness of
our RoPS feature descriptor improve the accuracy of feature
matching and therefore boost the performance of 3D object
recognition. Second, the unique, repeatable and robust LRF
enables the estimation of a rigid transformation from a single
feature correspondence, which therefore reduces the errors
of transformation hypotheses. This is because the probability of selecting only one correct feature correspondence is
much higher than the probability of selecting three correct
feature correspondences. Moreover, our proposed hierarchical object recognition algorithm enables object recognition
to be performed in an effective and efficient manner.
8 Conclusion
In this paper, we proposed a novel RoPS feature descriptor for 3D local surface description, and a new hierarchical
RoPS based algorithm for 3D object recognition. The RoPS
feature descriptor is generated by rotationally projecting the
neighboring points around a feature point onto three coordinate planes and calculating the statistics of the distribution
of the projected points. We also proposed a novel LRF by
calculating the scatter matrix of all points lying on the local
surface rather than just mesh vertices. The unique and highly
repeatable LRF facilitates the effectiveness and robustness
of the RoPS descriptor.
We performed a set of experiments to assess our RoPS
feature descriptor with respect to a set of different nuisances
including noise, varying mesh resolution and holes. Comparative experimental results show that our RoPS descrip-
123
tor outperforms the state-of-the-art methods, obtaining high
descriptiveness and strong robustness to noise, varying mesh
resolution and other deformations.
Moreover, we performed extensive experiments for 3D
object recognition in complex scenes in the presence of noise,
varying mesh resolution, clutter and occlusion. Experimental results on the Bologna Dataset show that our RoPS based
algorithm is very effective and robust to noise and mesh resolution variation. Experimental results on the UWA Dataset
show that RoPS based algorithm is very robust to occlusion and outperforms existing algorithms. The recognition
results achieved on the Queen’s Dataset show that our algorithm outperforms the state-of-the-art algorithms by a large
margin. The RoPS based algorithm was further tested on the
largest available 3D object recognition dataset (i.e., the Ca’
Foscari Venezia Dataset), reporting superior results. Overall,
our algorithm has achieved significant improvements over
the existing 3D object recognition algorithms when tested
on the same dataset.
Interesting future research directions include the extension of the proposed RoPS feature to encode both geometric and photometric information. Integrating geometric and
photometric cues would be beneficial for the recognition of
3D objects with poor geometric but rich photometric features
(e.g., a flat or spherical surface). Another direction is to adopt
our RoPS descriptors to perform 3D shape retrieval on a large
scale 3D shape corpus, e.g., the SHREC Datasets (Bronstein
et al. 2010b).
Acknowledgents The authors would like to acknowledge the following institutions. Stanford University for providing the 3D models;
Bologna University for providing the 3D scenes; INRIA for providing
the PHOTOMESH Dataset; Queen’s University for providing the 3D
models and scenes; Università Ca’ Foscari Venezia for providing the
3D models and scenes. The authors also acknowledge A. Zaharescu
from Aimetis Corporation for the results on the PHOTOMESH Dataset
shown in Tables 3 and 4. This research is supported by a China Scholarship Council (CSC) scholarship and Australian Research Council
Grants (DE120102960, DP110102166).
References
Atmosukarto, I., & Shapiro, L. (2010). 3D object retrieval using salient
views. In Proceedings of the First ACM International Conference on
Multimedia, Information Retrieval (pp. 73–82). Vancouver, Canada.
Bariya, P., & Nishino, K. (2010). Scale-hierarchical 3D object recognition in cluttered scenes. In IEEE Conference on Computer Vision
and, Pattern Recognition (pp. 1657–1664). San Francisco, CA.
Bariya, P., Novatnack, J., Schwartz, G., & Nishino, K. (2012). 3D geometric scale variability in range images: Features and descriptors.
International Journal of Computer Vision, 99(2), 232–255.
Bayramoglu, N., & Alatan, A. (2010). Shape index SIFT: Range image
recognition using local features. In 20th International Conference
on, Pattern Recognition (pp. 352–355). Istanbul, Turkey.
Belongie, S., Malik, J., & Puzicha, J. (2002). Shape matching and
object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(4), 509–522.
Int J Comput Vis
Bentley, J. (1975). Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9), 509–517.
Besl, P., & McKay, N. (1992). A method for registration of 3-D shapes.
IEEE Transactions on Pattern Analysis and Machine Intelligence,
14(2), 239–256.
Boyer, E., Bronstein, A., Bronstein, M., Bustos, B., Darom, T., Horaud,
R., Hotz, I., Keller, Y., Keustermans, J., & Kovnatsky, A., et al.
(2011). SHREC 2011: Robust feature detection and description
benchmark. In Eurographics Workshop on Shape Retrieval (pp. 79–
86). Llandudno, UK.
Bro, R., Acar, E., & Kolda, T. (2008). Resolving the sign ambiguity in
the singular value decomposition. Journal of Chemometrics, 22(2),
135–140.
Bronstein, A., Bronstein, M., Bustos, B., Castellani, U., Crisani, M.,
Falcidieno, B., Guibas, L., Kokkinos, I., Murino, V., & Ovsjanikov,
M., et al. (2010a). SHREC 2010: Robust feature detection and
description benchmark. In Eurographics Workshop on 3D Object
Retrieval (pp. 320–322). Kerkrade, The Netherlands.
Bronstein, A., Bronstein, M., Castellani, U., Falcidieno, B., Fusiello,
A., Godil, A., Guibas, L., Kokkinos, I., Lian, Z., & Ovsjanikov,
M., et al. (2010b). SHREC 2010: Robust large-scale shape retrieval
benchmark. In Eurographics Workshop on 3D Object Retrieval. Norrköping, Sweden.
Brown, M., & Lowe, D. (2003). Recognising panoramas. In 9th IEEE
International Conference on Computer Vision (pp. 1218–1225).
Nice, France.
Castellani, U., Cristani, M., Fantoni, S., & Murino, V. (2008). Sparse
points matching by combining 3D mesh saliency with statistical
descriptors. In S. Groeller (Ed.), In Computer Graphics Forum (pp.
643–652). Oxford: Blackwell.
Chen, H., & Bhanu, B. (2007). 3D free-form object recognition in
range images using local surface patches. Pattern Recognition Letters, 28(10), 1252–1262.
Chua, C., & Jarvis, R. (1997). Point signatures: A new representation
for 3D object recognition. International Journal of Computer Vision,
25(1), 63–85.
Curless, B., & Levoy, M. (1996). A volumetric method for building
complex models from range images. In 23rd Annual Conference on
Computer Graphics and Interactive, Techniques (pp. 303–312). New
Orleans, LA.
Demi, M., Paterni, M., & Benassi, A. (2000). The first absolute central
moment in low-level image processing. Computer Vision and Image
Understanding, 80(1), 57–87.
Flint, A., Dick, A., & Hengel A. (2007). THRIFT: Local 3D structure
recognition. In 9th Conference on Digital Image Computing Techniques and Applications (pp. 182–188).
Flint, A., Dick, A., & Van den Hengel, A. (2008). Local 3D structure
recognition in range images. IET Computer Vision, 2(4), 208–217.
Frome, A., Huber, D., Kolluri, R., Bülow, T., & Malik, J. (2004). Recognizing objects in range data using regional point descriptors. In 8th
European Conference on Computer Vision (pp. 224–237). Prague,
Czech Republic.
Funkhouser, T., Min, P., Kazhdan, M., Chen, J., Halderman, A., Dobkin,
D., et al. (2003). A search engine for 3D models. ACM Transactions
on Graphics, 22(1), 83–105.
Guennebaud, G., & Gross, M. (2007). Algebraic point set surfaces.
ACM Transactions on Graphics, 26(3), 23.
Guo, Y., Bennamoun, M., Sohel, F., Wan, J., & Lu, M. (2013a). 3D
free form object recognition using rotational projection statistics. In
IEEE 14th Workshop on the Applications of Computer Vision (pp.
1–8). Clearwater, Florida.
Guo, Y., Sohel, F., Bennamoun, M., Wan, J., & Lu, M. (2013b). RoPS:
A local feature descriptor for 3D rigid objects based on rotational
projection statistics. In 1st International Conference on Communications, Signal Processing, and their Applications (pp. 1–6). Sharjah,
UAE.
Guo, Y., Wan, J., Lu, M., & Niu, W. (2013c). A parts-based method for
articulated target recognition in laser radar data. Optik. doi:http://dx.
doi.org/10.1016/j.ijleo.2012.08.035.
Hetzel, G., Leibe, B., Levi, P., & Schiele, B. (2001). 3D object recognition from range images using local feature histograms. IEEE Conference on Computer Vision and Pattern Recognition, 2(II), 394.
Hou, T., & Qin, H. (2010). Efficient computation of scale-space features
for deformable shape correspondences. In European Conference on
Computer Vision (pp. 384–397). Heraklion, Greece.
Hu, M. (1962). Visual pattern recognition by moment invariants. IRE
Transactions on Information Theory, 8(2), 179–187.
Johnson, A., & Hebert, M. (1998). Surface matching for object recognition in complex three-dimensional scenes. Image and Vision Computing, 16(9–10), 635–651.
Johnson, A. E., & Hebert, M. (1999). Using spin images for efficient
object recognition in cluttered 3D scenes. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 21(5), 433–449.
Ke, Y., & Sukthankar, R. (2004). PCA-SIFT: A more distinctive representation for local image descriptors. IEEE Conference on Computer
Vision and Pattern Recognition, 2, 498–506.
Kokkinos, I., Bronstein, M., Litman, R., & Bronstein, A. (2012). Intrinsic shape context descriptors for deformable shapes. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 159–166).
Providence , RI.
Lei, Y., Bennamoun, M., & El-Sallam, A. (2013). An efficient 3D face
recognition approach based on the fusion of novel local low-level
features. Pattern Recognition, 46(1), 24–37.
Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.
Malassiotis, S., & Strintzis, M. (2007). Snapshots: A novel local surface
descriptor and matching algorithm for robust 3D surface alignment.
IEEE Transactions on Pattern Analysis and Machine Intelligence,
29(7), 1285–1290.
Mamic, G., & Bennamoun, M. (2002). Representation and recognition
of 3D free-form objects. Digital Signal Processing, 12(1), 47–76.
Mian, A., Bennamoun, M., & Owens, R. (2006a). A novel representation
and feature matching algorithm for automatic pairwise registration
of range images. International Journal of Computer Vision, 66(1),
19–40.
Mian, A., Bennamoun, M., & Owens, R. (2006b). Three-dimensional
model-based object recognition and segmentation in cluttered
scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10), 1584–1601.
Mian, A., Bennamoun, M., & Owens, R. (2010). On the repeatability
and quality of keypoints for local feature-based 3D object retrieval
from cluttered scenes. International Journal of Computer Vision,
89(2), 348–361.
Mikolajczyk, K., & Schmid, C. (2004). Scale & affine invariant interest
point detectors. International Journal of Computer Vision, 60(1),
63–86.
Mikolajczyk, K., & Schmid, C. (2005). A performance evaluation
of local descriptors. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 27(10), 1615–1630.
Novatnack, J., & Nishino, K. (2008). Scale-dependent/invariant local
3D shape descriptors for fully automatic registration of multiple sets
of range images. In 10th European Conference on Computer Vision
(pp. 440–453). Marseille, France.
Ohbuchi, R., Osada, K., Furuya, T., & Banno, T. (2008). Salient local
visual features for shape-based 3D model retrieval. In IEEE International Conference on Shape Modeling and Applications (pp. 93–
102).
Osada, R., Funkhouser, T., Chazelle, B., & Dobkin, D. (2002). Shape
distributions. ACM Transactions on Graphics, 21(4), 807–832.
Paquet, E., Rioux, M., Murching, A., Naveen, T., & Tabatabai, A.
(2000). Description of shape information for 2-D and 3-D objects.
Signal Processing: Image Communication, 16(1), 103–122.
123
Int J Comput Vis
Petrelli, A., & Di Stefano, L. (2011). On the repeatability of the
local reference frame for partial shape matching. In IEEE International Conference on Computer Vision (pp. 2244–2251). Barcelona,
Spain.
Rodolà, E., Albarelli, A., Bergamasco, F., & Torsello, A. (2012). A scale
independent selection process for 3D object recognition in cluttered
scenes. International Journal of Computer Vision, 102, 129–145.
Shang, L., & Greenspan, M. (2010). Real-time object recognition in
sparse range images using error surface embedding. International
Journal of Computer Vision, 89(2), 211–228.
Shannon, C. (1948). A mathematical theory of communication. Bell
System Technical Journal, 27(3), 379–423.
Stein, F., & Medioni, G. (1992). Structural indexing: Efficient 3D object
recognition. IEEE Transaction on Pattern Analysis and Machine
Intelligence, 14(2), 125–145.
Sun, Y., & Abidi, M. (2001). Surface matching by 3D point’s fingerprint.
In B. Buxton & R. Cipolla (Eds.), 8th IEEE International Conference on Computer Vision (pp. 263–269). Piscataway: Institute of
Electrical and Electronics Engineers Inc.
Taati, B., Bondy, M., Jasiobedzki, P., & Greenspan, M. (2007). Variable
dimensional local shape descriptors for object recognition in range
data. In 11th IEEE International Conference on Computer Vision
(pp. 1–8). Rio de Janeiro, Brazil.
123
Taati, B., & Greenspan, M. (2011). Local shape descriptor selection
for object recognition in range data. Computer Vision and Image
Understanding, 115(5), 681–694.
Tombari, F., Salti, S., & Di Stefano, L. (2010). Unique signatures of
histograms for local surface description. In European Conference on
Computer Vision (pp. 356–369). Crete, Greece.
Tombari, F., Salti, S., & Di Stefano, L. (2013). Performance evaluation
of 3D keypoint detectors. International Journal of Computer Vision,
102, 198–220.
Yamany, S., & Farag, A. (2002). Surface signatures: An orientation
independent free-form surface representation scheme for the purpose
of objects registration and matching. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 24(8), 1105–1120.
Yamauchi, H., Saleem, W., Yoshizawa, S., Karni, Z., Belyaev, A., & Seidel, H. (2006). Towards stable and salient multi-view representation
of 3D shapes. In IEEE International Conference on Shape Modeling
and Applications (pp. 40–46). Matsushima, Japan.
Zaharescu, A., Boyer, E., & Horaud, R. (2012). Keypoints and local
descriptors of scalar functions on 2D manifolds. International Journal of Computer Vision, 100, 78–98.
Zhong, Y. (2009). Intrinsic shape signatures: A shape descriptor for 3D
object recognition. In IEEE International Conference on Computer
Vision Workshops (pp. 689–696). Kyoto, Japan.