Remote Sensing: Review: Deep Learning On 3D Point Clouds
Remote Sensing: Review: Deep Learning On 3D Point Clouds
Remote Sensing: Review: Deep Learning On 3D Point Clouds
Review
Review: Deep Learning on 3D Point Clouds
Saifullahi Aminu Bello 1,2 , Shangshu Yu 1 , Cheng Wang 1, * , Jibril Muhmmad Adam 1
and Jonathan Li 1,3
1 Fujian Key Laboratory of Sensing and Computing for Smart City, School of Informatics, Xiamen University,
422 Siming South Road, Xiamen 361005, China; 23020170155983@stu.xmu.edu.cn (S.A.B.);
23020180155671@stu.xmu.edu.cn (S.Y.); 23020170155980@stu.xmu.edu.cn (J.M.A.); junli@uwaterloo.ca (J.L.)
2 Department of Computer Science, Kano University of Science and Technology, Wudil,
P.M.B 3244 Kano State, Nigeria
3 Department of Geography and Environmental Management, University of Waterloo, 200 University Avenue,
Waterloo, ON N2L 3G1, Canada
* Correspondence: cwang@xmu.edu.cn
Received: 21 April 2020; Accepted: 20 May 2020; Published: 28 May 2020
Abstract: A point cloud is a set of points defined in a 3D metric space. Point clouds have become one
of the most significant data formats for 3D representation and are gaining increased popularity as
a result of the increased availability of acquisition devices, as well as seeing increased application
in areas such as robotics, autonomous driving, and augmented and virtual reality. Deep learning
is now the most powerful tool for data processing in computer vision and is becoming the most
preferred technique for tasks such as classification, segmentation, and detection. While deep learning
techniques are mainly applied to data with a structured grid, the point cloud, on the other hand,
is unstructured. The unstructuredness of point clouds makes the use of deep learning for its direct
processing very challenging. This paper contains a review of the recent state-of-the-art deep learning
techniques, mainly focusing on raw point cloud data. The initial work on deep learning directly with
raw point cloud data did not model local regions; therefore, subsequent approaches model local
regions through sampling and grouping. More recently, several approaches have been proposed
that not only model the local regions but also explore the correlation between points in the local
regions. From the survey, we conclude that approaches that model local regions and take into account
the correlation between points in the local regions perform better. Contrary to existing reviews,
this paper provides a general structure for learning with raw point clouds, and various methods
were compared based on the general structure. This work also introduces the popular 3D point
cloud benchmark datasets and discusses the application of deep learning in popular 3D vision tasks,
including classification, segmentation, and detection.
Keywords: point cloud; deep learning; datasets; classification; segmentation; object detection
1. Introduction
We live in a three-dimensional world; however, since the invention of the camera, visual
information of the 3D world has been projected onto 2D images. Two-dimensional images, however,
lose depth information and relative positions between two or more objects in the real world. These
factors make 2D images less suitable for applications that require depth and positioning information
such as robotics, autonomous driving, virtual reality, and augmented reality, among others [1–3].
To capture the 3D world with depth information, the early convention was to use stereo vision, where
two or more calibrated digital cameras are used to extract 3D information [4,5]. A point cloud is a
data structure that is often used to represent 3D geometry, as the immediate representation of the
extracted 3D information from stereo vision cameras [6,7] as well as of the depth map produced by
RGB-D. Recently, 3D point cloud data have become popular as a result of the increasing availability
of sensing devices, especially light detection and ranging (LiDAR)-based devices such as Tele-15 [8],
Leica BLK360 [9], Kinect V2 [10], etc., and, more recently, mobile phones with a time of flight (tof)
depth camera. These sensing devices allow the easy acquisition of the 3D world in 3D point clouds.
A point cloud is simply a set of data points in space. The point cloud of a scene is the set
of 3D points around the surfaces of the objects in the scene. In its simplest form, 3D point cloud
data are represented by the XYZ coordinates of the points, or these may include additional features
such as surface normals and RGB values. Point cloud data represent a very convenient format for
representing the 3D world. Point clouds are commonly used as a data format in several disciplines
such as geomatics/surveying (mobile mapping); architecture, engineering, and construction (AEC);
and Building Information Modelling (BIM) [11–13]. Point clouds have a range of applications in
different areas such as robotics [14], autonomous driving [15], augmented and virtual reality [16],
manufacturing and building rendering [17], etc.
In the past, the processing of point clouds for visual intelligence has been based on handcrafted
features [18–23]. A review of handcrafted-based feature learning techniques is conducted in [24].
The handcrafted features do not require large training data and have been seldom used due to
insufficient point cloud data; furthermore, deep learning was not popular. However, with the increasing
availability of acquisition devices, point cloud data are now readily available, making the use of deep
learning for its processing feasible.
Deep learning is a machine learning approach based on artificial neural networks designed
to mimic the human brain [25]. Deep learning models are used to learn feature representations of
data through multiple processing layers that learn multiple levels of abstraction [26]. Deep learning
models are used in several areas, including computer vision, speech recognition, natural language
processing, etc. They have achieved state-of-the-art results comparable to—and in some instances
surpassing—human expert performance [27–29].
In the computer vision field, deep learning has achieved notable success with 2D data [27,30–33].
However, the application of deep learning on 3D point clouds is not easy due to the inherent nature of
the point clouds. In this paper, the challenges of using point clouds for deep learning are presented.
This paper reviews the early approaches devised to overcome these challenges, and the recent
state-of-the-art approaches that directly operate on the point clouds, focusing more on the latter.
This paper is intended to serve as a guide to new researchers in the field of deep learning with point
clouds as it presents the recent state-of-the-art approaches of deep learning with point cloud data.
In contrast to existing reviews [34–36], this paper’s focus is mainly on point cloud data; it gives a
general structure for learning with raw point clouds, and various methods are compared based on the
general structure. Popular point cloud benchmarked datasets are also introduced and summarized in
tabular form for easy analysis.
The rest of the paper is organized as follows: Section 2 discusses the methodology used. Section 3
discusses the challenges of point clouds which make the application of deep learning more difficult.
Section 4 reviews the methods to overcoming these challenges by converting the point clouds into a
structured grid. Section 5 contains in-depth information regarding the various deep learning methods
that process point clouds directly. In Section 6, 3D point clouds benchmark datasets are presented.
The application of the various approaches in the 3D vision tasks is discussed in Section 7. The summary
and conclusion of the paper are given in Section 8.
2. Methodology
Articles reviewed in this paper were all published between 2015 to 2020. The article is mainly
focused on point cloud data; however, it includes a brief review of other approaches based on structured
3D data. The article includes the first works that use deep learning on voxel-based and multiview 3D
representation, which were published in 2015 and 2016, respectively. It also includes a few highly cited
works on the two representations.
Remote Sens. 2020, 12, 1729 3 of 34
Deep learning with raw point clouds was pioneered by PointNet, published in 2017. The works
reviewed in this category were published from 2017 to 2020. We have mainly searched for the
relevant papers using the major conference repositories such as Conference on Computer Vision
and Pattern Recognition (CVPR), International Conference on Computer Vision (ICCV), European
Conference on Computer Vision (ECCV), Association for the Advancement of Artificial Intelligence
(AAAI) Conference, International Conference on Learning Representations (ICLR) as well as Google
Scholar. Many benchmarked datasets have an online leaderboard; we also consider leading works
from these leaderboards.
The datasets selected in this paper were all published after 2010, and they mainly referenced
common tasks in computer vision. The data are all tagged with ground truth (GT) labels. Tabular
details were provided for the easy understanding of the datasets.
The methods reviewed are organized according to Figure 1. Performances of these methods on
three popular computer vision tasks are reported in Section 7.
in which the points are stored does not change the scene represented; therefore, it is invariant to
permutation [42]. For illustration purposes, the unordered nature of point sets is shown in Figure 2c.
These properties of point clouds are very challenging for deep learning, especially convolutional
neural networks (CNN). This is because CNNs are based on convolution operation, which is performed
on data that are ordered, regular, and on a structured grid. Early approaches overcome these challenges
by converting the point clouds into a structured grid format, as shown in Section 4. However,
researchers have recently been developing approaches that directly use the power of deep learning for
the raw point cloud, without the need for conversion to a structured grid; see Section 5.
(a) Irregular. Sparse and dense regions. (b) Unstructured. No grid; each (c) Unordered. As a set, point clouds are invariant
point is independent and the to permutation.
distance between neighboring
points is not fixed.
The online operation is the learning stage. In this stage, the deep convolutional neural network is
designed, usually using a various number of 3D convolutional, pooling, and fully connected layers.
In 3D ShapeNets [48], 3D shapes are represented as a probability distribution of binary variables on
a 3D voxel grid; this technique was the first to use 3D Deep Convolutional Neural Networks. The inputs
to the network—point clouds, computer-aided design (CAD) models, or RGB-D images—are converted
into a 3D binary voxel grid and are processed using a convolutional deep belief network [49].
A three-dimensional CNN is used for landing zone detection for unmanned rotorcraft in [43]. LiDAR
from the rotorcraft is used to obtain point clouds of the landing site, which are then voxelized into 3D
volumes, and a 3D CNN binary classifier is applied to classify the landing site as safe or otherwise.
In VoxNet [44], a 3D convolutional neural network for object recognition is proposed. As with 3D
ShapeNets [48], the input to VoxNet is converted into a 3D binary occupancy grid before applying 3D
convolution operations to generate a feature vector which is passed through fully connected layers to
obtain class scores. Two voxel-based models were proposed by Qi et al. [45]: the first model addressed
overfitting using auxiliary training tasks to predict objects from partial subvolumes, while the second
model mimicked multi-view CNNs by convolving the 3D shapes with an anisotropic probing kernel.
Although voxel-based methods have shown good performance, they suffer from high memory
consumption due to the sparsity of the voxels, as shown in Figure 4. The voxel sparsity results in
wasted computation when convolving over the non-occupied regions. The memory consumption also
limits the voxel resolution, usually to between 32 cubes and 64 cubes. These drawbacks are also in
addition to the artifacts introduced by the voxelization operation.
To overcome the challenges of memory consumption and voxelization, [50,51] proposed
adaptive representation by using unbalanced octrees which focus on the relevant dense regions.
Remote Sens. 2020, 12, 1729 6 of 34
This representation is more complex than the regular 3D voxels, but it is still limited to only 256
cube voxels.
Figure 5. Multi-view projection of a point cloud to 2D images. Each 2D image represents the same
object viewed from a different angle.
MultiviewCNN [52] was the first approach in this direction. The proposed network bypassed the
need for 3D descriptors for recognition and achieved state-of-the-art accuracy. Leng et al. [53] proposed
a stacked local convolutional autoencoder (SLCAE) for 3D object retrieval. Multi-resolution filtering,
which captures information at multiple scales, was introduced by Qi et al. [45]; besides, the authors
used data augmentation for better generalization. RotationNet [58] uses rotation to select the best
viewpoint that maximizes the class likelihood; it leads the Modelnet40 [48] leaderboard at the time of
this review.
Multi-view based networks have better performance than voxel-based methods; this is because
(1) they use 2D techniques which have already been well researched and (2) they can contain richer
information as they do not have the quantization artifacts of voxelization.
regular spherical lattice. Compared to voxel-based and multi-view approaches, [59,60] have better
performance in terms of segmentation with SplatNet, achieving state-of-the-art accuracy on semantic
segmentation. They are also better than the voxel-based approach in terms of classification.
5.1. PointNet
Convolutional neural networks use convolutional layers to learn hierarchical feature representations
as the network deepens [27]. Convolutional layers use a convolution that requires a structured grid,
which is lacking in point cloud data. PointNet [42] was the first method to apply deep learning to an
unstructured point cloud, and it formed the basis from which most other techniques were developed.
The architecture of PointNet is shown in Figure 6. The input to PointNet is a raw point cloud
P = R N × D , where N represents the number of points in the point cloud and D the dimension.
Usually, D = 3, representing the XYZ values of each point; however, additional features can be used.
Because points are unordered, PointNet is built on two basic functions: multilayer perceptron (MLP),
with learnable parameters, and a maxpooling function. The MLPs are feature transformations that
transform the feature dimension of the points from a D = 3 to D = 1024 dimensional space, and their
parameters are shared by all the points in each layer. To obtain a global feature, the maxpooling
function is used as a symmetric function. A symmetric function is a function whose output is the same
irrespective of the input order. The maxpooling produces one global 1024-dimensional feature vector.
The feature vector represents the feature descriptor of the input, which can be used for recognition and
segmentation tasks.
Figure 6. Architecture of PointNet [42]. PointNet is composed of multilayer perceptrons (MLPs), which
are shared point-wise, and two spatial transformer networks (STN) of 3 × 3 and 64 × 64 dimensions
which learn the canonical representation of the input set. The global feature is obtained with a
winner-takes-all principle and can be used for classification and segmentation tasks.
Figure 7. Basic operations for capturing local structures in a point cloud. Given P ∈ R N ×(3+c) points,
each point is represented by XYZ and c feature channel (for input points, c can be point features such
as normals, RGB, etc or zero). M 6 N centroids points are sampled from N, and k-nearest neighbor
(kNN) points to each of the centroids are selected to form M groups. Each group represents a local
region (receptive field). A non-linear function, usually approximated by a PointNet-based MLP, is then
applied to the local region to learn the C–dimensional local region features (C > c).
Sampling is employed to reduce the resolution of points across layers in the same way that
the convolution operation reduces the resolution of feature maps via convolutional and pooling
layers. Given a point cloud P ∈ R N ×3 of N points, the sampling reduces it to M points P̂ ∈ R M×3 ,
where M ≤ N. The subsampled M points, also referred to as representative points or centroids,
are used to represent the local region from which they were sampled. Two approaches are popular for
subsampling: (1) random point sampling, where each of the N points is equally likely to be sampled;
and (2) farthest point sampling (FPS), where the M points are sampled such that each sampled point is
the most distant point from the rest of the M − 1 points. Other sampling methods include uniform
sampling and Gumbel subset sampling [63].
As regards the grouping operation, as the representative points are sampled, the k-nearest
neighbor (kNN) algorithm is used to select the nearest neighbor points to the representative points
to group them into a local patch; see Figure 8. The points in a local patch are used to compute the
local feature representation of the neighborhood. In grid convolution, the receptive field shows the
pixels on the feature map under a kernel. The kNN is either used directly, where k nearest points to a
centroid are sampled, or a ball query is used. With ball query, points are selected only when they are
within a certain radius distance to the centroid points.
Remote Sens. 2020, 12, 1729 9 of 34
Regarding the non-linear mapping function, once the nearest points to each representative point
are obtained, the next step is to map them into a feature vector that represents the local structure. In grid
convolution, the receptive field is mapped into a feature neuron using simple matrix multiplication
and summation with convolutional kernels. This is not easy in point clouds, because the points
are not structured; therefore, most approaches approximate the function using a PointNet-based
method [42] which is composed of multilayer perceptrons, h(·), and a maxpooling symmetric function,
g(·), as shown in Equation (1).
f ({ x1 , . . . xk }) ≈ g(h( x1 ), . . . , h( xk )) (1)
Figure 8. Sampling and grouping of points into local patches. The red dots are the centroid points
selected using sampling algorithms, and the grouping shown is a ball query in which points are selected
based on a certain radius distance to the centroid.
SOM node, the kNN search is used to find the nearest-neighbor points, which are passed through a
series of fully connected layers to extract point features. The point features are maxpooled to generate
M nodes features. To obtain the global features of the input point cloud, the M nodes features are also
aggregated using maxpooling.
Pointwise convolution was proposed by Hua et al. [68]. In this technique, there are no subsampled/
representative points, because the convolution operation is done on all the input points. In each point,
nearest-neighbor points are sampled based on a size or radius value of a kernel centered on the point.
The radius value can be adjusted for different numbers of neighbor points in any layer. Four pointwise
convolutions are applied independently on the input, and each transforms the input points from
three-dimensional to nine-dimensional. The final feature is obtained by concatenating the output of
the four pointwise convolutions for each point, lifting the points from 3D to 36D. The final feature has
the same resolution as the input point clouds and can be used for segmentation using a convolution
layer or classification task using fully connected layers.
3DPointCapsNet [69] proposed an approach that does not consider the local correlation between
points, but region correlation is achieved using the novel dynamic routing procedure proposed
by Sabour et al. [70]. The authors used 16 PointNet-like MLPs with maxpooling; each of the
16 outputs is used as a primary capsule for the dynamic routing procedure that produces 64 × 64 latent
capsule—the feature representation. The dynamic routing causes the output of 16 PointNet-like MLPs
to target 16 different regions of the input shape.
between the hierarchy of receptive fields and captures the relationship between points within the
receptive field.
PointNet-like MLP is a popular mapping function for approximating points in a local patch into a
feature vector; however, SpiderCNN [76] argues that MLP does not account for the prior geometry
of point clouds and requires sufficiently large parameters. To address these issues, the authors
proposed family filters that are composed of two functions: a step function that encodes local
geodesic information, followed by a third order Taylor expansion. The approach learns hierarchical
representations and achieves state-of-the-art performance in classification and segmentation tasks.
Point attention transformers (PAT) were proposed by Yang et al. [63]. They proposed a
new subsampling method termed “Gumbel subset sampling (GSS)”, which, unlike farthest point
sampling (FPS), is permutation-invariant and is robust to outliers. They used absolute and relative
position embedding, where each point is represented by a set of its absolute position and relative
position to other points in its neighborhood; PointNet is then applied to the set, and to further
capture the relationship between points, a modified multi-head attention (MHA) mechanism is used.
New sampling and grouping techniques with learnable parameters were proposed by Liu et al. [77] in
a module termed the dynamic points agglomeration module (DPAM) which learns an agglomeration
matrix which, when multiplied with incoming point features, reduces the resolution (similar to
sampling) and produces an aggregated feature (similar to grouping and pooling).
which explores the channel-wise correlation of a node’s feature; local correlation, which explores the
local dependency among nodes in a local region; and non-local correlation, which is used to capture
better global features by considering long-range local features.
5.3. Summary
Table 1 summarizes the approaches, showing their sampling, grouping, and mapping functions.
The methods employ local region computation based on sampling and grouping; PointNet is an
exception. In [68,74,79], the authors do not use the sampling technique, and as such, these methods are
more computationally intensive. To improve discriminative ability, several methods have exploited the
correlation between points in a local region. By default, graph-based methods capture the correlation
between the points using edges. Point2Node [81] exploits not only the local correlation but also the
non-local correlation between points and has better performance in terms semantic segmentation.
The performance of the methods discussed in classification, segmentation, and object detection
applications are shown in Section 7.
Table 1. Summary of methods showing the sampling, grouping, and mapping functions used. CNN:
convolutional neural network; DGCNN: dynamic graph CNN; SOM: self-organizing map; k-NN:
k-nearest neighbor; MLP: multi-layer perceptron.
6. Benchmark Datasets
A considerable number of point cloud datasets have been published in recent years. Most of the
existing datasets are provided by universities and industries and can provide a fair comparison for
testing diverse approaches. These public benchmark datasets consist of virtual scenes or real scenes,
which focus particularly on point cloud classification, segmentation, registration, and object detection.
They are particularly useful in deep learning since they can provide huge amounts of ground truth
labels to train the network. The point clouds are obtained by different platforms/methods, such as
structure from motion (SfM), red green blue–depth (RGB-D) cameras, and light detection and ranging
(LiDAR) systems. The availability of benchmark datasets usually decreases as the size and complexity
increases. In this section, some popular datasets for 3D research are introduced. The datasets are also
summarized in Tables 2 and 3 for easy analysis.
6.1.1. ModelNet
This dataset was developed by the Princeton Vision & Robotics Labs [48]. ModelNet40 has
40 man-made object categories (such as an airplane, bookshelf and chair) for shape classification and
recognition. It consists of 12,311 CAD models, which are split into 9.84 × 103 training and 2.47 × 103
Remote Sens. 2020, 12, 1729 13 of 34
testing shapes. The ModelNet10 dataset is a subset of ModelNet40 that only contains 10 categories of
classes; it is also divided into 3.99 × 103 training and 908 testing shapes.
6.1.2. ShapeNet
This large-scale dataset was developed by Stanford University et al [87]. It provides semantic
category labels for models, igid alignments, parts and bilateral symmetry planes, physical sizes, and
keywords, as well as other planned annotations. ShapeNet indexed almost 3.0 × 106 models when
the dataset was published, and 2.20 × 105 models have been classified into 3.14 × 103 categories.
ShapeNetCore is a subset of ShapeNet, which consists of nearly 5.13 × 104 unique 3D models.
It provides 55 common object categories and annotations. ShapeNetSem is also a subset of ShapeNet,
which contains 1.2 × 104 models. It is more smaller but covers more extensive categories, amounting
to a total of 270.
6.1.4. Shape2Motion
Shape2Motion [94] was developed by Beihang University and National University of Defense
Technology. It has created a new benchmark dataset for 3D shape mobility analysis. The benchmark
consists of 45 shape categories with 2.44 × 103 models; the shapes are obtained from ShapeNet and 3D
Warehouse [95]. The proposed approach inputs a single 3D shape, then jointly predicts motion part
segmentation results and motion corresponding attributes.
6.1.5. ScanObjectNN
ScanObjectNN [96] was developed by Hong Kong University of Science and Technology et al.
It is the first real-world dataset for point cloud classification. About 1.50 × 104 objects are selected
from indoor datasets (SceneNN [97] and ScanNet [64]), and the objects are split into 15 categories with
2.9 × 103 unique object instances.
6.2.1. NYUDv2
The New York University Depth Dataset v2 (NYUDv2) [98] was developed by New York
University et al. The dataset provides 1.45 × 103 RGB-D (obtained by Kinect v1 [99]) images captured
from 464 various indoor scenes. All of the images are distributed segmentation labels. This dataset is
mainly used to understand how 3D cues can lead to better segmentation for indoor objects.
6.2.2. SUN3D
This dataset was developed by Princeton University [100]. It is a RGB-D video dataset in which the
videos were captured from 254 different spaces in 41 buildings. SUN3D provides 415 sequences with
camera poses and object labels. The point cloud data are generated by structure from motion (SfM).
Remote Sens. 2020, 12, 1729 14 of 34
6.2.3. S3DIS
Stanford 3D Large-Scale Indoor Spaces (S3DIS) [101] was developed by Stanford University et al.
S3DIS was collected from three different buildings with 271 rooms, where the cover area was above
6.00 × 103 m2 . It contains over 2.15 × 108 points, and each point has the provision of instance-level
semantic segmentation labels (13 categories).
6.2.4. SceneNN
Singapore University of Technology and Design et al. developed this dataset [97]. SceneNN is an
RGB-D (obtained by Kinect v2 [102,103]) scene dataset collected form 101 indoor scenes. It provides
40 semantic classes for the indoor scenes, and all semantic labels are the same as the NYUDv2 dataset.
6.2.5. ScanNet
ScanNet [64] is a large-scale indoor dataset developed by Stanford University et al. It contains
1.51 × 103 scanned scenes, including nearly 2.5 × 106 RGB-D (obtained by an occipital structure sensor)
images from 707 different indoor environments. The dataset provides ground truth labels for 3D object
classification with 17 categories and semantic segmentation with 20 categories. For object classification,
ScanNet divides all instances into 9677 instances for training and 2.61 × 103 instances for testing, and
it splits all scans into 1201 training scenes and 312 testing scenes for semantic segmentation.
6.2.6. Matterport3D
Matterport3D [104] is the largest indoor dataset and was developed by Princeton University et al.
The cover area of this dataset is 2.19 × 105 mm2 from 2.06 × 103 rooms, and there is 4.66 × 104 mm2 of
floor space. It consists of 1.08 × 104 panoramic views; the views are from 1.94 × 105 RGB-D images
of 90 large-scale buildings. The labels contain surface reconstructions, camera poses, and semantic
segmentation. This dataset investigates five tasks for scene understanding: keypoint matching, view
overlap prediction, surface normal estimation, region-type classification, and semantic segmentation.
6.2.7. 3DMatch
This benchmark dataset was developed by Princeton University et al. [105]. It is a large collection
of existing datasets, such as Analysisby-Synthesis [106], 7-Scenes [107], SUN3D [100], RGB-D Scenes
v.2 [108] and Halber et al. [109]. The 3DMatch benchmark consists of 62 scenes with 54 training scenes
and eight testing scenes. It leverages correspondence labels from RGB-D scene reconstruction datasets,
and then provides ground truth labels for point cloud registration.
6.3.1. KITTI
The KITTI dataset [111,112] is one of the best known in the field of autonomous driving and was
developed by Karlsruhe Institute of Technology et al. It can be used for the research of stereo images,
optical flow estimation, 3D detection, 3D tracking, visual odometry, and so on. The data acquisition
platform is equipped with two color cameras, two grayscale cameras, a Velodyne HDL-64E [113,114]
3D laser scanner, and a high-precision GPS/IMU system. KITTI provides raw data with five categories:
Remote Sens. 2020, 12, 1729 15 of 34
road, city, residential, campus and person. The depth completion and prediction benchmark consist
of more than 93,000 depth maps. The 3D object detection benchmark contains 7.48 × 103 training
point clouds and 7.51 × 103 testing point clouds. The visual odometry benchmark is formed by 22
sequences, with 11 sequences (00-10) of LiDAR data for training and 11 sequences (11-21) of LiDAR
data for testing. Meanwhile, semantic labeling [115] for the KITTI odometry dataset has recently been
published; SemanticKITTI contains 28 classes including ground, structure, vehicle, nature, human,
object, and others.
6.3.3. iQmulus
The large-scale urban scene dataset was developed by Mines ParisTech et al. in January 2013 [119].
All of the 3D point cloud data have been classified and segmented into 50 classes. The data were
collected by StereopolisII MLS—a system developed by the French National Mapping Agency (IGN).
They used Riegl LMS-Q120i sensor [120] to acquire 3.00 × 108 points.
6.3.5. NCLT
This was developed by the University of Michigan [123]. It contains 27 time trajectories through
the University of Michigan’s North Campus between January 2012 to April 2013. This dataset also
provides images, LiDAR, GPS and INS ground truth for long-term autonomous vehicles. The LiDAR
point clouds were collected by a Velodyne HDL-32E LiDAR [124,125] scanner.
6.3.6. Semantic3D
This high-quality and density dataset was developed by ETH Zurich [126]. It contains more than
four billion points, where the point clouds were acquired by static terrestrial laser scanners. There are
eight semantic classes provided: man-made terrain, natural terrain, high vegetation, low vegetation,
buildings, hard scape, scanning artifacts, and cars. The dataset is split into 15 training scenes and
15 testing scenes.
6.3.7. DBNet
This real-world LiDAR-video dataset was developed by Xiamen University et al. [127]. It aims at
learning driving policy; in this respect, it is different from previous outdoor datasets. DBNet provides
a LiDAR point cloud, video record, GPS and driver behaviors for driving behavior study. It contains
1.00 × 103 km points of driving data captured by a Velodyne laser.
Remote Sens. 2020, 12, 1729 16 of 34
6.3.8. NPM3D
The Nuage de Points et Modélisation 3D (NPM3D) dataset was developed by PSL Research
University [128]. It is a benchmark for point cloud classification and segmentation, and all point clouds
are labeled to 50 different classes. It contains 1.431 × 109 points of data collected in Paris and Lille.
The data was acquired by a Mobile Laser System including a Velodyne HDL-32E LiDAR [124,125] and
GPS/INS systems.
6.3.9. Apollo
The Apollo was developed by Baidu Research et al., and it is a large-scale autonomous driving
dataset [129,130]. It provides labeled data of 3D car instance understanding, LiDAR point cloud object
detection and tracking, and LiDAR-based localization. For 3D car instance understanding task, there
are 5.28 × 103 images with more than 6.00 × 104 car instances. Each car has an industry-grade CAD
model. The 3D object detection and tracking benchmark dataset contains 53 minutes of sequences
for training and 50 min of sequences for testing, which were acquired at a frame rate of 10 frames
per second and labeled at the frame rate of 2 fps. The Apollo-SouthBay dataset provides LiDAR
frame data for localization; it was collected in the southern San Francisco Bay Area. They equipped
a high-end autonomous driving sensor suite (Velodyne HDL-64E [113,114], NovAtel ProPak6 [131],
and IMU-ISA-100C [132]) to a standard Lincoln MKZ sedan.
6.3.10. nuScenes
The nuTonomy scenes (nuScenes) dataset [133] represents a novel metric for 3D object detection
which was developed by nuTonomy (an APTIV company). The metric consists of multiple aspects,
which are classification, velocity, size, localization, orientation, and an attribute estimation of the object.
This dataset was acquired by an autonomous vehicle sensor suite (six cameras, five radars and one
LiDAR sensor) with a 360◦ field of view. It contains 1.00 × 103 driving scenes collected from Boston
and Singapore; the two cities are both traffic-clogged. The objects in this dataset include 23 classes and
eight attributes, and they are all labeled with 3D bounding boxes.
6.3.11. BLVD
This dataset was developed by Xian Jiaotong University and it was collected in Changshu
(China) [134]. It introduces a new benchmark which focuses on dynamic 4D object tracking,
5D interactive event recognition and 5D intention prediction. The BLVD dataset consists of 654 video
clips, where the videos comprise 120k frames and the frame rate is 10 frames per second. All frames
are annotated to obtain 2.49 × 105 3D annotations. There are 4.90 × 103 unique objects in total for
tracking, 6.00 × 103 fragments for interactive event recognition, and 4.90 × 103 objects for intention
prediction.
6.3.12. Whu-TLS
Wuhan University TLS (Whu-TLS) [135] was developed by Wuhan University. It consists of
115 scans and over 1.74 × 109 3D points in total collected from 11 different environments (i.e., a subway
station, high-speed railway platform, mountain, forest, park, campus, residence, riverbank, heritage
building, underground excavation and tunnel) with varying point densities, clutter, and occlusion.
The ground-truth transformations, the transformations calculated by [136], and the registration graphs
are also provided for researchers, with the aim of yielding better comparisons and insights into the
strengths and weaknesses of different registration approaches on a common basis [135].
Remote Sens. 2020, 12, 1729 17 of 34
Table 2. Categorization of benchmark datasets. (cls: classification, seg: segmentation, loc: localization,
reg: registration, aut: autonomous driving, det: object detection, dri: driving behavior, mot: motion
estimation, odo: odometry, CAD: computer-assisted design, LiDAR: light detection and ranging.).
Air plane
(d)
Figure 9. Deep learning tasks with point clouds. (a) Object classification; (b) Parts segmentation;
(c) Object detection; (d) Semantic segmentation [42].
7.1. Classification
Object classification has been one of the primary areas in which deep learning is used. In object
classification, the objective is as follows: given a point cloud, a network should classify it into a certain
category. Classification is a pioneering task in deep learning because the early breakthrough deep
learning models such as AlexNet [27], VGGNet [137], and ResNet [32] were classification models.
In point clouds, most early techniques for classification using deep learning relied on a structured grid,
as shown in Section 4; however, this section is limited to approaches that process point clouds directly.
The features learned by the techniques reviewed in both Sections 4 and 5 can easily be
used for the classification task by passing them through a fully connected network whose last
layer represents classes. Other machine learning classifiers such as SVM can also be used [44,138].
In Figure 10, a timeline performance of point-based deep learning approaches on modelnet40 is shown.
Geo-CNN [74] exhibits the state-of-the-art results for modelnet40 at the time of this review.
Remote Sens. 2020, 12, 1729 19 of 34
96
94
ACCURACY(%) 92
90
88
86
84
82
7.2. Segmentation
The segmentation of point clouds is the grouping of points into homogeneous regions.
Traditionally, segmentation is done using edges [139] or surface properties such as normals, curvature
and orientation [139,140]. Recently, feature-based deep learning approaches have been used for point
cloud segmentation to segment the points into different aspects. The aspects could be different parts of
an object, which is referred to as part segmentation, or different class categories, also referred to as
semantic segmentation.
In parts segmentation, the input point clouds represent a certain object, and the goal is to assign
each point to certain parts, as shown in Figure 9b. In [42,67,79], the global descriptor learned is
concatenated with the features of the points and then passed through an MLP to classify each point
into a part category. In the approach in [39,41], the global descriptor is propagated into high-resolution
predictions using interpolation and deconvolution methods, respectively. In Pointwise Conv [68]
the per point features learned are used to achieve segmentation by passing them through dense
convolutional layers. The encoder–decoder architecture was used by Klokov [78] for both parts and
semantic segmentation. In Table 4, the results of various techniques on ShapeNet parts datasets are
shown. R-S CNN [73], and A-CNN [75], which consider the local correlation between points in a local
region, have better parts segmentation accuracy.
Table 4. Parts segmentation on the ShapeNet part dataset. The score is the mean intersection over
union (mIOU).
In semantic segmentation, the goal is to assign each point to a particular class. For example,
in Figure 9d, the points belonging to the chair are shown in red, while that of ceiling and floor are shown
in green and blue, respectively, etc. Popular public datasets for semantic segmentation evaluation are
S3DIS [101] and ScanNet [64]. Table 5 shows the performances of some of the state-of-the-art methods
on S3DIS and ScanNet datasets. Point2Node [81], which considers both local correlation and non-local
correlation, has the best mean intersection over union (mIOU) on S3DIS and overall accuracy (OA)
on ScanNet.
Table 5. Semantic segmentation on S3DIS and ScanNet datasets. OA: overall accuracy.
Instance segmentation is when the grouping is based on instances in which multiple objects
of the same class are uniquely identified. Instance segmentation is now an active field of research.
Some state-of-the-art works on instance segmentation on point clouds are [143–147] which are built on
the basis of PointNet/PointNet++ feature learning. The performances of the state-of-the-art methods
in instance segmentation are shown in Table 6. 3D-MPA [148] has the state-of-the-art performance,
with 50% average precision on ScanNet dataset at the time of this review.
Remote Sens. 2020, 12, 1729 21 of 34
Table 6. Instance segmentation on the ScanNet dataset. The measure is mean average precision (AP) at an overlap of 0.5 (50%).
Avg AP Bath Bed Book Cabi Chair Coun Curt Desk Door Picture Refrig Shower Sink
Method
50% -Tub -Shelf -Net -Ter -Ain -Erator Curtain
3D-MPA [148] 0.611 1.00 0.833 0.765 0.526 0.756 0.136 0.588 0.47 0.438 0.358 0.65 0.857 0.429
MTML [149] 0.549 1.00 0.807 0.588 0.327 0.647 0.004 0.815 0.18 0.418 0.182 0.445 1.00 0.442
3D-BoNet [144] 0.488 1.00 0.672 0.59 0.301 0.484 0.098 0.62 0.306 0.341 0.125 0.434 0.796 0.402
PanopticFusion-inst [150] 0.478 0.667 0.712 0.595 0.259 0.55 0.00 0.613 0.175 0.25 0.437 0.411 0.857 0.485
ResNet-backbone [151] 0.459 1.00 0.737 0.159 0.259 0.587 0.138 0.475 0.217 0.416 0.128 0.315 0.714 0.411
MASC [152] 0.447 0.528 0.555 0.381 0.382 0.633 0.002 0.509 0.26 0.361 0.327 0.451 0.571 0.367
3D-SIS [153] 0.382 1.00 0.432 0.245 0.19 0.577 0.013 0.263 0.033 0.32 0.075 0.422 0.857 0.117
UNet-backbone [151] 0.319 0.667 0.715 0.233 0.189 0.479 0.008 0.218 0.067 0.201 0.107 0.123 0.438 0.15
R-PointNet [145] 0.306 0.5 0.405 0.311 0.348 0.589 0.054 0.068 0.126 0.283 0.028 0.219 0.214 0.331
3D-BEVIS [154] 0.248 0.667 0.566 0.076 0.035 0.394 0.027 0.035 0.098 0.099 0.025 0.098 0.375 0.126
Seg-Cluster [146] 0.215 0.37 0.337 0.285 0.105 0.325 0.025 0.282 0.085 0.105 0.007 0.079 0.317 0.114
Sgpn_scannet [146] 0.143 0.208 0.39 0.169 0.065 0.275 0.029 0.069 0.00 0.087 0.014 0.027 0.00 0.112
MaskRCNN 2d->3d Proj [155] 0.058 0.333 0.002 0.00 0.053 0.002 0.002 0.021 0.00 0.045 0.238 0.065 0.00 0.014
Remote Sens. 2020, 12, 1729 22 of 34
Author Contributions: S.A.B. and C.W. conceived and planned the manuscript. S.A.B. and S.Y. wrote the
manuscript. S.A.B., S.Y. and J.M.A. revised the manuscript. C.W. and J.L. supervised the work. All authors
provided critical feedback and helped shape the research, analysis, and manuscript. All authors have read and
agreed to the published version of the manuscript.
Funding: This research was funded by the National Natural Science Foundation of China, grant number U1605254.
Acknowledgments: The authors would like to acknowledge the comments and suggestions given by the
anonymous reviewers. S.A.B. and J.M.A. also acknowledge the China Scholarship Council (CSC) for the financial
support provided.
Conflicts of Interest: The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
1D One-dimensional
2D Two-dimensional
3D Three-dimensional
4D Four-dimensional
5D Five-dimensional
6D Six-dimensional
Remote Sens. 2020, 12, 1729 25 of 34
References
1. Hillel, A.B.; Lerner, R.; Levi, D.; Raz, G. Recent progress in road and lane detection: A survey. Mach. Vis. Appl.
2014, 25, 727–745. [CrossRef]
2. Pendleton, S.D.; Andersen, H.; Du, X.; Shen, X.; Meghjani, M.; Eng, Y.H.; Rus, D.; Ang, M.H. Perception,
planning, control, and coordination for autonomous vehicles. Machines 2017, 5, 6. [CrossRef]
3. Weingarten, J.W.; Gruener, G.; Siegwart, R. A state-of-the-art 3D sensor for robot navigation. In Proceedings
of the 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat.
No.04CH37566), Sendai, Japan, 28 September–2 October 2004; Volume 3, pp. 2155–2160.
4. Lucas, B.D.; Kanade, T. An Iterative Image Registration Technique with an Application to Stereo Vision.
In Proceedings of the International Joint Conference on Artificial Intelligence, Vancouver, BC, Canada, 24–28
August 1981.
5. Ayache, N. Artificial Vision for Mobile Robots: Stereo Vision and Multisensory Perception; Mit Press: Cambridge,
MA, USA, 1991.
6. Liu, Y.; Dai, Q.; Xu, W. A point-cloud-based multiview stereo algorithm for free-viewpoint video. IEEE Trans.
Vis. Comput. Graph. 2009, 16, 407–418.
Remote Sens. 2020, 12, 1729 26 of 34
7. Fathi, H.; Brilakis, I. Automated sparse 3D point cloud generation of infrastructure using its distinctive
visual features. Adv. Eng. Inform. 2011, 25, 760–770. [CrossRef]
8. Livox Tech. Tele-15; Livox Tech: Shenzhen, China, 2020.
9. Leica Geosystems. LEICA BLK360; Leica Geosystems: St. Gallen, Switzerland, 2016.
10. Microsoft Corporation. Kinect V2 3D Scanner; Microsoft Corporation: Redmond, WA, USA, 2014.
11. Schwarz, B. Mapping the world in 3D. Nat. Photonics 2010, 4, 429–430. [CrossRef]
12. Tang, P.; Huber, D.; Akinci, B.; Lipman, R.; Lytle, A. Automatic reconstruction of as-built building information
models from laser-scanned point clouds: A review of related techniques. Autom. Constr. 2010, 19, 829–843.
[CrossRef]
13. Wang, C.; Cho, Y.K.; Kim, C. Automatic BIM component extraction from point clouds of existing buildings
for sustainability applications. Autom. Constr. 2015, 56, 1–13. [CrossRef]
14. Pomerleau, F.; Colas, F.; Siegwart, R. A Review of Point Cloud Registration Algorithms for Mobile Robotics.
Found. Trends Robot. 2015, 4, 1–104. [CrossRef]
15. Chen, S.; Liu, B.; Feng, C.; Vallespi-Gonzalez, C.; Wellington, C. 3D Point Cloud Processing and Learning for
Autonomous Driving. arXiv 2020, arXiv:2003.00601.
16. Park, J.; Seo, D.; Ku, M.; Jung, I.; Jeong, C. Multiple 3D Object Tracking using ROI and Double Filtering for
Augmented Reality. In Proceedings of the 2011 Fifth FTRA International Conference on Multimedia and
Ubiquitous Engineering, Loutraki, Greece, 28–30 June 2011; pp. 317–322.
17. Fabio, R. From point cloud to surface: The modeling and visualization problem. Int. Arch. Photogramm.
Remote Sens. Spat. Inf. Sci. 2003, 34, W10.
18. Johnson, A.E.; Hebert, M. Using Spin Images for Efficient Object Recognition in Cluttered 3D Scenes.
IEEE Trans. Pattern Anal. Mach. Intell. 1999, 21, 433–449. [CrossRef]
19. Chen, H.; Bhanu, B. 3D Free-Form Object Recognition in Range Images Using Local Surface Patches.
In Proceedings of the 17th International Conference on Pattern Recognition (ICPR 2004), Cambridge, UK,
23–26 August 2004; pp. 136–139. [CrossRef]
20. Zhong, Y. Intrinsic shape signatures: A shape descriptor for 3D object recognition. In Proceedings of the
12th IEEE International Conference on Computer Vision Workshops, ICCV Workshops 2009, Kyoto, Japan,
27 September–4 October 2009; pp. 689–696. [CrossRef]
21. Rusu, R.B.; Blodow, N.; Marton, Z.C.; Beetz, M. Aligning point cloud views using persistent feature
histograms. In Proceedings of the 2008 IEEE/RSJ International Conference on Intelligent Robots and
Systems, Nice, France, 22–26 September 2008; pp. 3384–3391. [CrossRef]
22. Rusu, R.B.; Blodow, N.; Beetz, M. Fast Point Feature Histograms (FPFH) for 3D registration. In Proceedings
of the 2009 IEEE International Conference on Robotics and Automation (ICRA 2009), Kobe, Japan, 12–17
May 2009; pp. 3212–3217. [CrossRef]
23. Tombari, F.; Salti, S.; di Stefano, L. Unique shape context for 3d data description. In Proceedings of the ACM
Workshop on 3D Object Retrieval (3DOR ’10), Firenze, Italy, 25 October 2010; Daoudi, M., Spagnuolo, M.,
Veltkamp, R.C., Eds.; pp. 57–62. [CrossRef]
24. Hänsch, R.; Weber, T.; Hellwich, O. Comparison of 3d interest point detectors and descriptors for point
cloud fusion. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2014, 2, 57. [CrossRef]
25. Hinton, G.E. Connectionist Learning Procedures. Artif. Intell. 1989, 40, 185–234. [CrossRef]
26. LeCun, Y.; Bengio, Y.; Hinton, G.E. Deep learning. Nature 2015, 521, 436–444. [CrossRef] [PubMed]
27. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural
Networks. In Proceedings of the Advances in Neural Information Processing Systems 25: 26th Annual
Conference on Neural Information Processing Systems 2012, Lake Tahoe, NV, USA, 3–6 December 2012;
pp. 1106–1114.
28. Ciresan, D.C.; Meier, U.; Schmidhuber, J. Multi-column deep neural networks for image classification.
In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI,
USA, 16–21 June 2012; pp. 3642–3649.
29. Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 2015, 61, 85–117. [CrossRef]
30. Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. arXiv 2014,
arXiv:1411.4038.
Remote Sens. 2020, 12, 1729 27 of 34
31. Saito, S.; Li, T.; Li, H. Real-Time Facial Segmentation and Performance Capture from RGB Input.
In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14
October 2016; Volume 9912, pp. 244–261.
32. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385.
33. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767.
34. Guo, Y.; Wang, H.; Hu, Q.; Liu, H.; Liu, L.; Bennamoun, M. Deep Learning for 3D Point Clouds: A Survey.
arXiv 2019, arXiv:1912.12033.
35. Ioannidou, A.; Chatzilari, E.; Nikolopoulos, S.; Kompatsiaris, I. Deep Learning Advances in Computer
Vision with 3D Data: A Survey. ACM Comput. Surv. 2017, 50, 20:1–20:38. [CrossRef]
36. Liu, W.; Sun, J.; Li, W.; Hu, T.; Wang, P. Deep Learning on Point Clouds and Its Application: A Survey.
Sensors 2019, 19, 4188. [CrossRef] [PubMed]
37. Guo, Y.; Sohel, F.; Bennamoun, M.; Wan, J.; Lu, M. A novel local surface feature for 3D object recognition
under clutter and occlusion. Inf. Sci. 2015, 293, 196–213. [CrossRef]
38. Nurunnabi, A.; West, G.; Belton, D. Outlier detection and robust normal-curvature estimation in mobile
laser scanning 3D point cloud data. Pattern Recognit. 2015, 48, 1404–1419. [CrossRef]
39. Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric
Space. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on
Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 5099–5108.
40. Dimitrov, A.; Golparvar-Fard, M. Segmentation of building point cloud models including detailed
architectural/structural features and MEP systems. Autom. Constr. 2015, 51, 32–45. [CrossRef]
41. Li, Y.; Bu, R.; Sun, M.; Wu, W.; Di, X.; Chen, B. PointCNN: Convolution On X-Transformed Points.
In Proceedings of the Advances in Neural Information Processing Systems 31: Annual Conference on
Neural Information Processing Systems 2018 (NeurIPS 2018), Montréal, QC, Canada, 3–8 December 2018;
pp. 828–838.
42. Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and
Segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR 2017), Honolulu, HI, USA, 21–26 July 2017; pp. 77–85. [CrossRef]
43. Maturana, D.; Scherer, S. 3D Convolutional Neural Networks for landing zone detection from LiDAR.
In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA 2015), Seattle, WA,
USA, 26–30 May 2015; pp. 3471–3478. [CrossRef]
44. Maturana, D.; Scherer, S. VoxNet: A 3D Convolutional Neural Network for real-time object recognition.
In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS
2015), Hamburg, Germany, 28 September–2 October 2015; pp. 922–928. [CrossRef]
45. Qi, C.R.; Su, H.; Nießner, M.; Dai, A.; Yan, M.; Guibas, L.J. Volumetric and Multi-view CNNs for Object
Classification on 3D Data. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR 2016), Las Vegas, NV, USA, 27–30 June 2016; pp. 5648–5656. [CrossRef]
46. Wang, C.; Cheng, M.; Sohel, F.; Bennamoun, M.; Li, J. NormalNet: A voxel-based CNN for 3D object
classification and retrieval. Neurocomputing 2019, 323, 139–147. [CrossRef]
47. Ghadai, S.; Lee, X.Y.; Balu, A.; Sarkar, S.; Krishnamurthy, A. Multi-Resolution 3D Convolutional Neural
Networks for Object Recognition. arXiv 2018, arXiv:1805.12254.
48. Wu, Z.; Song, S.; Khosla, A.; Yu, F.; Zhang, L.; Tang, X.; Xiao, J. 3D ShapeNets: A deep representation for
volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR 2015), Boston, MA, USA, 7–12 June 2015; pp. 1912–1920. [CrossRef]
49. Hinton, G.E.; Osindero, S.; Teh, Y.W. A Fast Learning Algorithm for Deep Belief Nets. Neural Comput. 2006,
18, 1527–1554. [CrossRef]
50. Riegler, G.; Ulusoy, A.O.; Geiger, A. OctNet: Learning Deep 3D Representations at High Resolutions.
In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017),
Honolulu, HI, USA, 21–26 July 2017; pp. 6620–6629. [CrossRef]
51. Tatarchenko, M.; Dosovitskiy, A.; Brox, T. Octree Generating Networks: Efficient Convolutional Architectures
for High-resolution 3D Outputs. In Proceedings of the IEEE International Conference on Computer Vision
(ICCV 2017), Venice, Italy, 22–29 October 2017; pp. 2107–2115. [CrossRef]
Remote Sens. 2020, 12, 1729 28 of 34
52. Su, H.; Maji, S.; Kalogerakis, E.; Learned-Miller, E.G. Multi-view Convolutional Neural Networks for 3D
Shape Recognition. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV
2015), Santiago, Chile, 7–13 December 2015; pp. 945–953. [CrossRef]
53. Leng, B.; Guo, S.; Zhang, X.; Xiong, Z. 3D object retrieval with stacked local convolutional autoencoder.
Signal Process. 2015, 112, 119–128. [CrossRef]
54. Bai, S.; Bai, X.; Zhou, Z.; Zhang, Z.; Latecki, L.J. GIFT: A Real-Time and Scalable 3D Shape Search Engine.
In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016),
Las Vegas, NV, USA, 27–30 June 2016; pp. 5023–5032. [CrossRef]
55. Kalogerakis, E.; Averkiou, M.; Maji, S.; Chaudhuri, S. 3D Shape Segmentation with Projective Convolutional
Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR 2017), Honolulu, HI, USA, 21–26 July 2017; pp. 6630–6639. [CrossRef]
56. Cao, Z.; Huang, Q.; Ramani, K. 3D Object Classification via Spherical Projections. In Proceedings of the
2017 International Conference on 3D Vision, 3DV 2017, Qingdao, China, 10–12 October 2017; pp. 566–574.
[CrossRef]
57. Zhang, L.; Sun, J.; Zheng, Q. 3D Point Cloud Recognition Based on a Multi-View Convolutional Neural
Network. Sensors 2018, 18, 3681. [CrossRef]
58. Kanezaki, A.; Matsushita, Y.; Nishida, Y. RotationNet: Joint Object Categorization and Pose Estimation
Using Multiviews From Unsupervised Viewpoints. In Proceedings of the 2018 IEEE/CVF Conference on
Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5010–5019.
59. Su, H.; Jampani, V.; Sun, D.; Maji, S.; Kalogerakis, E.; Yang, M.; Kautz, J. SPLATNet: Sparse Lattice Networks
for Point Cloud Processing. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR 2018), Salt Lake City, UT, USA, 18–22 June 2018; pp. 2530–2539. [CrossRef]
60. Rao, Y.; Lu, J.; Zhou, J. Spherical Fractal Convolutional Neural Networks for Point Cloud Recognition.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach,
CA, USA, 15–20 June 2019.
61. Oster, M.; Douglas, R.J.; Liu, S. Computation with Spikes in a Winner-Take-All Network. Neural Comput.
2009, 21, 2437–2465. [CrossRef]
62. Xiang, C.; Qi, C.R.; Li, B. Generating 3D Adversarial Point Clouds. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019.
63. Yang, J.; Zhang, Q.; Ni, B.; Li, L.; Liu, J.; Zhou, M.; Tian, Q. Modeling Point Clouds With Self-Attention
and Gumbel Subset Sampling. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019.
64. Dai, A.; Chang, A.X.; Savva, M.; Halber, M.; Funkhouser, T.; Nießner, M. ScanNet: Richly-annotated 3D
Reconstructions of Indoor Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017.
65. Zhou, Y.; Tuzel, O. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection.
In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt
Lake City, UT, USA, 18–22 June 2018; pp. 4490–4499. [CrossRef]
66. Kohonen, T. The self-organizing map. Neurocomputing 1998, 21, 1–6. [CrossRef]
67. Li, J.; Chen, B.M.; Lee, G.H. SO-Net: Self-Organizing Network for Point Cloud Analysis. In Proceedings of
the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT,
USA, 18–22 June 2018; pp. 9397–9406. [CrossRef]
68. Hua, B.; Tran, M.; Yeung, S. Pointwise Convolutional Neural Networks. In Proceedings of the 2018 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 18–22 June
2018; pp. 984–993. [CrossRef]
69. Zhao, Y.; Birdal, T.; Deng, H.; Tombari, F. 3D Point Capsule Networks. In Proceedings of the 2019 IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019;
pp. 1009–1018.
70. Sabour, S.; Frosst, N.; Hinton, G.E. Dynamic Routing Between Capsules. In Proceedings of the Advances in
Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 3856–3866.
71. Zhao, H.; Jiang, L.; Fu, C.W.; Jia, J. PointWeb: Enhancing Local Neighborhood Features for Point Cloud
Processing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
Long Beach, CA, USA, 15–20 June 2019.
Remote Sens. 2020, 12, 1729 29 of 34
72. Wu, W.; Qi, Z.; Fuxin, L. PointConv: Deep Convolutional Networks on 3D Point Clouds. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20
June 2019.
73. Liu, Y.; Fan, B.; Xiang, S.; Pan, C. Relation-Shape Convolutional Neural Network for Point Cloud Analysis.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach,
CA, USA, 15–20 June 2019.
74. Lan, S.; Yu, R.; Yu, G.; Davis, L.S. Modeling Local Geometric Structure of 3D Point Clouds Using Geo-CNN.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach,
CA, USA, 15–20 June 2019.
75. Komarichev, A.; Zhong, Z.; Hua, J. A-CNN: Annularly Convolutional Neural Networks on Point Clouds.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach,
CA, USA, 15–20 June 2019.
76. Xu, Y.; Fan, T.; Xu, M.; Zeng, L.; Qiao, Y. SpiderCNN: Deep Learning on Point Sets with Parameterized
Convolutional Filters. In Proceedings of the Computer Vision—ECCV 2018—15th European Conference,
Munich, Germany, 8–14 September 2018; Volume 11212, pp. 90–105. [CrossRef]
77. Liu, J.; Ni, B.; Li, C.; Yang, J.; Tian, Q. Dynamic Points Agglomeration for Hierarchical Point Sets Learning.
In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2
November 2019.
78. Klokov, R.; Lempitsky, V.S. Escape from Cells: Deep Kd-Networks for the Recognition of 3D Point Cloud
Models. In Proceedings of the IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy,
22–29 October 2017; pp. 863–872. [CrossRef]
79. Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S.E.; Bronstein, M.M.; Solomon, J.M. Dynamic Graph CNN for Learning on
Point Clouds. arXiv 2018, arXiv:1801.07829.
80. Wang, C.; Samari, B.; Siddiqi, K. Local Spectral Graph Convolution for Point Set Feature Learning.
In Proceedings of the Computer Vision—ECCV 2018—15th European Conference, Munich, Germany, 8–14
September 2018; Volume 11208, pp. 56–71. [CrossRef]
81. Han, W.; Wen, C.; Wang, C.; Li, Q.; Li, X. Forthcoming: Point2Node: Correlation Learning of Dynamic-Node
for Point Cloud Feature Modeling. In Proceedings of the Conference on Artificial Intelligence (AAAI),
New York, NY, USA, 7–12 February 2020.
82. Landrieu, L.; Simonovsky, M. Large-Scale Point Cloud Semantic Segmentation with Superpoint Graphs.
In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake
City, UT, USA, 18–23 June 2018; pp. 4558–4567.
83. Landrieu, L.; Boussaha, M. Point Cloud Oversegmentation with Graph-Structured Deep Metric Learning. In
Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long
Beach, CA, USA, 15–20 June 2019; pp. 7440–7449.
84. Wang, L.; Huang, Y.; Hou, Y.; Zhang, S.; Shan, J. Graph Attention Convolution for Point Cloud Semantic
Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 10296–10305.
85. Liang, Z.; Yang, M.; Deng, L.; Wang, C.; Wang, B. Hierarchical Depthwise Graph Convolutional Neural
Network for 3D Semantic Segmentation of Point Clouds. In Proceedings of the 2019 IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 8152–8158.
86. Zhang, Z.; Hua, B.S.; Yeung, S.K. ShellNet: Efficient Point Cloud Convolutional Neural Networks Using
Concentric Shells Statistics. In Proceedings of the IEEE International Conference on Computer Vision (ICCV),
Seoul, Korea, 27 October–2 November 2019.
87. Chang, A.X.; Funkhouser, T.A.; Guibas, L.J.; Hanrahan, P.; Huang, Q.; Li, Z.; Savarese, S.; Savva, M.; Song, S.;
Su, H. ShapeNet: An Information-Rich 3D Model Repository. arXiv 2015, arXiv:1512.03012.
88. Yi, L.; Kim, V.G.; Ceylan, D.; Shen, I.; Yan, M.; Su, H.; Lu, C.; Huang, Q.; Sheffer, A.; Guibas, L. A scalable
active framework for region annotation in 3d shape collections. ACM Trans. Graph. (TOG) 2016, 35, 210.
[CrossRef]
89. Dai, A.; Qi, C.R.; Nießner, M. Shape Completion using 3D-Encoder-Predictor CNNs and Shape Synthesis.
In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu,
HI, USA, 21–26 July 2017.
Remote Sens. 2020, 12, 1729 30 of 34
90. Park, K.; Rematas, K.; Farhadi, A.; Seitz, S.M. PhotoShape: Photorealistic Materials for Large-Scale Shape
Collections. ACM Trans. Graph. 2018, 37, 192. [CrossRef]
91. Mo, K.; Zhu, S.; Chang, A.X.; Yi, L.; Tripathi, S.; Guibas, L.J.; Su, H. PartNet: A Large-Scale Benchmark for
Fine-Grained and Hierarchical Part-Level 3D Object Understanding. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019.
92. Xiang, Y.; Kim, W.; Chen, W.; Ji, J.; Choy, C.; Su, H.; Mottaghi, R.; Guibas, L.; Savarese, S. ObjectNet3D:
A Large Scale Database for 3D Object Recognition. In Proceedings of the European Conference Computer
Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016.
93. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.-F. Imagenet: A large-scale hierarchical image database.
In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA,
20–25 June 2009; pp. 248–255.
94. Wang, X.; Zhou, B.; Shi, Y.; Chen, X.; Zhao, Q.; Xu, K. Shape2Motion: Joint Analysis of Motion Parts and
Attributes from 3D Shapes. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019.
95. 3D Warehouse. Available online: https://3dwarehouse.sketchup.com/ (accessed on 21 December 2019).
96. Uy, M.A.; Pham, Q.H.; Hua, B.S.; Nguyen, D.T.; Yeung, S.K. Revisiting Point Cloud Classification: A New
Benchmark Dataset and Classification Model on Real-World Data. In Proceedings of the IEEE International
Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019.
97. Hua, B.S.; Pham, Q.H.; Nguyen, D.T.; Tran, M.K.; Yu, L.F.; Yeung, S.K. SceneNN: A Scene Meshes Dataset
with aNNotations. In Proceedings of the International Conference on 3D Vision (3DV), Stanford, CA, USA,
25–28 October 2016.
98. Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor segmentation and support inference from rgbd
images. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012;
pp. 746–760.
99. Wasenmüller, O.; Stricker, D. Comparison of kinect v1 and v2 depth images in terms of accuracy and
precision. In Proceedings of the Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November
2016; pp. 34–45.
100. Xiao, J.; Owens, A.; Torralba, A. Sun3d: A database of big spaces reconstructed using sfm and object labels.
In Proceedings of the IEEE International Conference on Computer Vision, Sydney, NSW, Australia, 1–8
December 2013; pp. 1625–1632.
101. Armeni, I.; Sener, O.; Zamir, A.R.; Jiang, H.; Brilakis, I.K.; Fischer, M.; Savarese, S. 3D Semantic Parsing of
Large-Scale Indoor Spaces. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern
Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016; pp. 1534–1543. [CrossRef]
102. Fankhauser, P.; Bloesch, M.; Rodriguez, D.; Kaestner, R.; Hutter, M.; Siegwart, R. Kinect v2 for mobile robot
navigation: Evaluation and modeling. In Proceedings of the 2015 International Conference on Advanced
Robotics (ICAR), Istanbul, Turkey, 27–31 July 2015; pp. 388–394.
103. Lachat, E.; Macher, H.; Mittet, M.; Landes, T.; Grussenmeyer, P. First experiences with Kinect v2 sensor for
close range 3D modelling. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2015, 40, 93. [CrossRef]
104. Chang, A.; Dai, A.; Funkhouser, T.; Halber, M.; Niessner, M.; Savva, M.; Song, S.; Zeng, A.; Zhang, Y.
Matterport3D: Learning from RGB-D Data in Indoor Environments. In Proceedings of the International
Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017.
105. Zeng, A.; Song, S.; Nießner, M.; Fisher, M.; Xiao, J.; Funkhouser, T. 3DMatch: Learning Local Geometric
Descriptors from RGB-D Reconstructions. In Proceedings of the 2017 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017.
106. Valentin, J.; Dai, A.; Nießner, M.; Kohli, P.; Torr, P.; Izadi, S.; Keskin, C. Learning to Navigate the Energy
Landscape. arXiv 2016, arXiv:1603.05772.
107. Shotton, J.; Glocker, B.; Zach, C.; Izadi, S.; Criminisi, A.; Fitzgibbon, A. Scene coordinate regression forests
for camera relocalization in RGB-D images. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2930–2937.
108. De Deuge, M.; Quadros, A.; Hung, C.; Douillard, B. Unsupervised feature learning for classification of
outdoor 3d scans. In Proceedings of the Australasian Conference on Robitics and Automation, Sydney,
Australia, 2–4 December 2013; Volume 2, p. 1.
Remote Sens. 2020, 12, 1729 31 of 34
109. Halber, M.; Funkhouser, T.A. Structured Global Registration of RGB-D Scans in Indoor Environments. arXiv
2016, arXiv:1607.08539.
110. Wang, C.; Hou, S.; Wen, C.; Gong, Z.; Li, Q.; Sun, X.; Li, J. Semantic line framework-based indoor building
modeling using backpacked laser scanning point cloud. ISPRS J. Photogramm. Remote Sens. 2018, 143, 150–166.
[CrossRef]
111. Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite.
In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI,
USA, 16–21 June 2012; pp. 3642–3649.
112. Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets Robotics: The KITTI Dataset. Int. J. Robot. Res. 2013,
32, 1231–1237. [CrossRef]
113. Halterman, R.; Bruch, M. Velodyne HDL-64E lidar for unmanned surface vehicle obstacle detection.
In Proceedings of the Unmanned Systems Technology XII. International Society for Optics and Photonics,
Orlando, FL, USA, 6–9 April 2010; Volume 7692, p. 76920D.
114. Glennie, C.; Lichti, D.D. Static calibration and analysis of the Velodyne HDL-64E S2 for high accuracy mobile
scanning. Remote Sens. 2010, 2, 1610–1624. [CrossRef]
115. Behley, J.; Garbade, M.; Milioto, A.; Quenzel, J.; Behnke, S.; Stachniss, C.; Gall, J. SemanticKITTI: A Dataset
for Semantic Scene Understanding of LiDAR Sequences. In Proceedings of the 2019 IEEE/CVF International
Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019.
116. Pomerleau, F.; Liu, M.; Colas, F.; Siegwart, R. Challenging data sets for point cloud registration algorithms.
Int. J. Robot. Res. 2012, 31, 1705–1711. [CrossRef]
117. Demski, P.; Mikulski, M.; Koteras, R. Characterization of Hokuyo UTM-30LX laser range finder for an
autonomous mobile robot. In Advanced Technologies for Intelligent Systems of National Border Security; Springer:
Berlin, Germany, 2013; pp. 143–153.
118. Pouliot, N.; Richard, P.L.; Montambault, S. LineScout power line robot: Characterization of a UTM-30LX
LIDAR system for obstacle detection. In Proceedings of the 2012 IEEE/RSJ International Conference on
Intelligent Robots and Systems, Vilamoura, Portugal, 7–12 October 2012; pp. 4327–4334.
119. Brédif, M.; Vallet, B.; Serna, A.; Marcotegui, B.; Paparoditis, N. TerraMobilita/IQmulus Urban Point Cloud
Classification Benchmark. In Proceedings of the Workshop on Processing Large Geospatial Data, Cardiff,
UK, 8 July 2014.
120. RIEGL Laser Measurement Systems. LMS-Q120i; RIEGL Laser Measurement Systems GmbH Riedenburgstraße
48: Horn, Austria, 2010.
121. Maddern, W.; Pascoe, G.; Linegar, C.; Newman, P. 1 Year, 1000km: The Oxford RobotCar Dataset. Int. J.
Robot. Res. 2017, 36, 3–15. [CrossRef]
122. Csaba, G.; Somlyai, L.; Vámossy, Z. Mobil robot navigation using 2D LIDAR. In Proceedings of the 2018
IEEE 16th World Symposium on Applied Machine Intelligence and Informatics (SAMI), Herl’any, Kosice,
Slovakia, 7–10 February 2018; pp. 143–148.
123. Carlevaris-Bianco, N.; Ushani, A.K.; Eustice, R.M. University of Michigan North Campus long-term vision
and lidar dataset. Int. J. Robot. Res. 2016, 35, 1023–1035. [CrossRef]
124. Chan, T.; Lichti, D.D.; Belton, D. Temporal analysis and automatic calibration of the Velodyne HDL-32E
LiDAR system. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2013, 2, 61–66. [CrossRef]
125. Jozkow, G.; Wieczorek, P.; Karpina, M.; Walicka, A.; Borkowski, A. Performance evaluation of sUAS
equipped with Velodyne HDL-32e lidar sensor. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2017,
42, 171. [CrossRef]
126. Hackel, T.; Savinov, N.; Ladicky, L.; Wegner, J.D.; Schindler, K.; Pollefeys, M. Semantic3d. net: A new
large-scale point cloud classification benchmark. arXiv 2017, arXiv:1704.03847.
127. Chen, Y.; Wang, J.; Li, J.; Lu, C.; Luo, Z.; Xue, H.; Wang, C. Lidar-video driving dataset: Learning driving
policies effectively. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Salt Lake City, UT, USA, 18–23 June 2018; pp. 5870–5878.
128. Roynard, X.; Deschaud, J.E.; Goulette, F. Paris-Lille-3D: A large and high-quality ground-truth urban point
cloud dataset for automatic segmentation and classification. Int. J. Robot. Res. 2018, 37, 545–557. [CrossRef]
129. Song, X.; Wang, P.; Zhou, D.; Zhu, R.; Guan, C.; Dai, Y.; Su, H.; Li, H.; Yang, R. Apollocar3d: A large 3d car
instance understanding benchmark for autonomous driving. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5452–5462.
Remote Sens. 2020, 12, 1729 32 of 34
130. Lu, W.; Zhou, Y.; Wan, G.; Hou, S.; Song, S. L3-Net: Towards Learning Based LiDAR Localization for
Autonomous Driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Long Beach, CA, USA, 15–20 June 2019; pp. 6389–6398.
131. Sun, B.; Yeary, M.; Sigmarsson, H.H.; McDaniel, J.W. Fine Resolution Position Estimation Using the
Kalman Filter. In Proceedings of the 2019 IEEE International Instrumentation and Measurement Technology
Conference (I2MTC), Auckland, New Zealand, 20–23 May 2019.
132. Liu, W.; Shi, X.; Zhu, F.; Tao, X.; Wang, F. Quality analysis of multi-GNSS raw observations and a
velocity-aided positioning approach based on smartphones. Adv. Space Res. 2019, 63, 2358–2377. [CrossRef]
133. Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O.
nuScenes: A multimodal dataset for autonomous driving. arXiv 2019, arXiv:1903.11027.
134. Xue, J.; Fang, J.; Li, T.; Zhang, B.; Zhang, P.; Ye, Z.; Dou, J. BLVD: Building A Large-scale 5D Semantics
Benchmark for Autonomous Driving. In Proceedings of the International Conference on Robotics and
Automation, Montreal, QC, Canada, 20–24 May 2019.
135. Dong, Z.; Liang, F.; Yang, B.; Xu, Y.; Zang, Y.; Li, J.; Wang, Y.; Dai, W.; Fan, H.; Hyyppäb, J. Registration of
large-scale terrestrial laser scanner point clouds: A review and benchmark. ISPRS J. Photogramm. Remote Sens.
2020, 163, 327–342. [CrossRef]
136. Dong, Z.; Yang, B.; Liang, F.; Huang, R.; Scherer, S. Hierarchical registration of unordered TLS point clouds
based on binary shape context descriptor. ISPRS J. Photogramm. Remote Sens. 2018, 144, 61–79. [CrossRef]
137. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition.
In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA,
USA, 7–9 May 2015.
138. Yang, Y.; Feng, C.; Shen, Y.; Tian, D. FoldingNet: Point Cloud Auto-Encoder via Deep Grid Deformation.
In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018),
Salt Lake City, UT, USA, 18–22 June 2018; pp. 206–215. [CrossRef]
139. Rabbani, T.; van den Heuvel, F.; Vosselman, G. Segmentation of point clouds using smoothness constraints.
In Proceedings of the ISPRS Commission V Symposium Vol. 35, Part 6: Image Engineering and Vision
Metrology (ISPRS 2006), Dresden, Germany, 25–27 September 2006; Maas, H., Schneider, D., Eds.; Volume 35,
pp. 248–253.
140. Jagannathan, A.; Miller, E.L. Three-Dimensional Surface Mesh Segmentation Using Curvedness-Based
Region Growing Approach. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 2195–2204. [CrossRef] [PubMed]
141. Mao, J.; Wang, X.; Li, H. Interpolated Convolutional Networks for 3D Point Cloud Understanding.
In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2
November 2019.
142. Liu, Y.; Fan, B.; Meng, G.; Lu, J.; Xiang, S.; Pan, C. DensePoint: Learning Densely Contextual Representation
for Efficient Point Cloud Processing. In Proceedings of the IEEE International Conference on Computer
Vision (ICCV), Seoul, Korea, 27 October–2 November 2019.
143. Pham, Q.; Nguyen, D.T.; Hua, B.; Roig, G.; Yeung, S. JSIS3D: Joint Semantic-Instance Segmentation
of 3D Point Clouds With Multi-Task Pointwise Networks and Multi-Value Conditional Random Fields.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long
Beach, CA, USA, 16–20 June 2019; pp. 8827–8836.
144. Yang, B.; Wang, J.; Clark, R.; Hu, Q.; Wang, S.; Markham, A.; Trigoni, N. Learning Object Bounding Boxes for
3D Instance Segmentation on Point Clouds. arXiv 2019, arXiv:1906.01140.
145. Yi, L.; Zhao, W.; Wang, H.; Sung, M.; Guibas, L.J. GSPN: Generative Shape Proposal Network for 3D Instance
Segmentation in Point Cloud. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3947–3956.
146. Wang, W.; Yu, R.; Huang, Q.; Neumann, U. SGPN: Similarity Group Proposal Network for 3D Point Cloud
Instance Segmentation. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR 2018), Salt Lake City, UT, USA, 18–22 June 2018; pp. 2569–2578. [CrossRef]
147. Wang, X.; Liu, S.; Shen, X.; Shen, C.; Jia, J. Associatively Segmenting Instances and Semantics in Point
Clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2019),
Long Beach, CA, USA, 16–20 June 2019; pp. 4096–4105.
148. Engelmann, F.; Bokeloh, M.; Fathi, A.; Leibe, B.; Nießner, M. 3D-MPA: Multi Proposal Aggregation for 3D
Semantic Instance Segmentation. arXiv 2020, arXiv:2003.13867.
Remote Sens. 2020, 12, 1729 33 of 34
149. Lahoud, J.; Ghanem, B.; Oswald, M.R.; Pollefeys, M. 3D Instance Segmentation via Multi-Task Metric
Learning. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV),
Seoul, Korea, 27 October–2 November 2019; pp. 9255–9265.
150. Narita, G.; Seno, T.; Ishikawa, T.; Kaji, Y. PanopticFusion: Online Volumetric Semantic Mapping at the Level
of Stuff and Things. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS), Macau, China, 3–8 November 2019; pp. 4205–4212.
151. Liang, Z.; Yang, M.; Wang, C. 3D Graph Embedding Learning with a Structure-aware Loss Function for
Point Cloud Semantic Instance Segmentation. arXiv 2019, arXiv:1902.05247.
152. Liu, C.; Furukawa, Y. MASC: Multi-scale Affinity with Sparse Convolution for 3D Instance Segmentation.
arXiv 2019, arXiv:1902.04478.
153. Hou, J.; Dai, A.; Nießner, M. 3D-SIS: 3D Semantic Instance Segmentation of RGB-D Scans. In Proceedings
of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA,
USA, 15–20 June 2019; pp. 4421–4430.
154. Elich, C.; Engelmann, F.; Schult, J.; Kontogianni, T.; Leibe, B. 3D-BEVIS: Birds-Eye-View Instance
Segmentation. arXiv 2019, arXiv:1904.02199.
155. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R.B. Mask R-CNN. In Proceedings of the IEEE International
Conference on Computer Vision, ICCV 2017, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [CrossRef]
156. Girshick, R.B.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection
and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR 2014), Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [CrossRef]
157. Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV),
Santiago, Chile, 7–13 December 2015.
158. Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region
Proposal Networks. In Proceedings of the Advances in Neural Information Processing Systems 28: Annual
Conference on Neural Information Processing Systems 2015, Montreal, QC, Canada, 7–12 December 2015;
Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R., Eds.; pp. 91–99.
159. Redmon, J.; Divvala, S.K.; Girshick, R.B.; Farhadi, A. You Only Look Once: Unified, Real-Time Object
Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR
2016), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [CrossRef]
160. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on
Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525.
[CrossRef]
161. Wang, D.Z.; Posner, I. Voting for Voting in Online Point Cloud Object Detection. In Proceedings of the
Robotics: Science and Systems XI, Sapienza University of Rome, Rome, Italy, 13–17 July 2015. [CrossRef]
162. Engelcke, M.; Rao, D.; Wang, D.Z.; Tong, C.H.; Posner, I. Vote3Deep: Fast object detection in 3D point clouds
using efficient convolutional neural networks. In Proceedings of the 2017 IEEE International Conference on
Robotics and Automation (ICRA 2017), Singapore, 29 May–3 June 2017; pp. 1355–1361. [CrossRef]
163. Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum PointNets for 3D Object Detection From RGB-D Data.
In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018),
Salt Lake City, UT, USA, 18–22 June 2018; pp. 918–927. [CrossRef]
164. Shi, S.; Wang, X.; Li, H. PointRCNN: 3D Object Proposal Generation and Detection From Point Cloud.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach,
CA, USA, 15–20 June 2019.
165. Qi, C.R.; Litany, O.; He, K.; Guibas, L.J. Deep Hough Voting for 3D Object Detection in Point Clouds.
arXiv 2019, arXiv:1904.09664.
166. Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. PointPillars: Fast Encoders for Object
Detection From Point Clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019.
167. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.E.; Fu, C.; Berg, A.C. SSD: Single Shot MultiBox
Detector. In Proceedings of the Computer Vision—ECCV 2016—14th European Conference, Amsterdam,
The Netherlands, 11–14 October 2016; Volume 9905, pp. 21–37. [CrossRef]
Remote Sens. 2020, 12, 1729 34 of 34
168. Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3D Object Detection Network for Autonomous Driving.
In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017,
Honolulu, HI, USA, 21–26 July 2017; pp. 6526–6534. [CrossRef]
169. Liang, M.; Yang, B.; Wang, S.; Urtasun, R. Deep Continuous Fusion for Multi-sensor 3D Object Detection.
In Proceedings of the Computer Vision—ECCV 2018—15th European Conference, Munich, Germany, 8–14
September 2018; Volume 11220, pp. 663–678. [CrossRef]
170. Shin, K.; Kwon, Y.P.; Tomizuka, M. RoarNet: A Robust 3D Object Detection based on RegiOn Approximation
Refinement. In Proceedings of the 2019 IEEE Intelligent Vehicles Symposium (IV 2019), Paris, France, 9–12
June 2019; pp. 2510–2515. [CrossRef]
171. Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3D Proposal Generation and Object Detection
from View Aggregation. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS 2018), Madrid, Spain, 1–5 October 2018; pp. 1–8. [CrossRef]
172. Yang, B.; Liang, M.; Urtasun, R. HDNET: Exploiting HD Maps for 3D Object Detection. In Proceedings
of the 2nd Annual Conference on Robot Learning (CoRL 2018), Zürich, Switzerland, 29–31 October 2018;
Volume 87, pp. 146–155.
173. Yang, B.; Luo, W.; Urtasun, R. PIXOR: Real-Time 3D Object Detection From Point Clouds. In Proceedings of
the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT,
USA, 18–22 June 2018; pp. 7652–7660. [CrossRef]
174. Yan, Y.; Mao, Y.; Li, B. SECOND: Sparsely Embedded Convolutional Detection. Sensors 2018, 18, 3337.
[CrossRef]
175. Angelina Uy, M.; Hee Lee, G. PointNetVLAD: Deep Point Cloud Based Retrieval for Large-Scale Place
Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
Salt Lake City, UT, USA, 18–23 June 2018.
176. Liu, Z.; Zhou, S.; Suo, C.; Yin, P.; Chen, W.; Wang, H.; Li, H.; Liu, Y.H. LPD-Net: 3D Point Cloud Learning
for Large-Scale Place Recognition and Environment Analysis. In Proceedings of the IEEE International
Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019.
c 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).