A Comprehensive Overview of Deep Learning Techniques For 3D Point Cloud Classification and Semantic Segmentation

Machine Vision and Applications (2024) 35:67
https://doi.org/10.1007/s00138-024-01543-1
RESEARCH
A comprehensive overview of deep learning techniques for 3D point

cloud classification and semantic segmentation
Sushmita Sarker1 · Prithul Sarker1 · Gunner Stone1 · Ryan Gorman1 · Alireza Tavakkoli1 · George Bebis1 ·
Javad Sattarvand2
Received: 15 November 2023 / Revised: 15 November 2023 / Accepted: 14 April 2024

© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024
Abstract
Point cloud analysis has a wide range of applications in many areas such as computer vision, robotic manipulation, and
autonomous driving. While deep learning has achieved remarkable success on image-based tasks, there are many unique
challenges faced by deep neural networks in processing massive, unordered, irregular and noisy 3D points. To stimulate
future research, this paper analyzes recent progress in deep learning methods employed for point cloud processing and
presents challenges and potential directions to advance this field. It serves as a comprehensive review on two major tasks in
3D point cloud processing—namely, 3D shape classification and semantic segmentation.
Keywords 3D classification · Computer vision · Point cloud · Semantic segmentation
1 Introduction RealSense, and Apple depth cameras) [1], and photogramme-

try technologies allow the creation of very large and precise
The advancement of 3D point cloud acquisition techniques point clouds. Point clouds are the raw output of most 3D data
combined with the accessibility of acquisition devices has gathering devices and serve as a versatile geometric repre-
enabled the use of real-world 3D models in a variety of sentation of 3D data [2].
robotic applications, including autonomous driving, aug- Effective point cloud analysis techniques are therefore
mented reality, and robotics. 3D scanners, Light Detection essential to understanding of 3D targets. While hand-crafted
and Ranging (LiDAR), RGB-D cameras (such as Kinect, features on point clouds have long been discussed in graphics
and vision, the recent overwhelming success of convolutional
B Sushmita Sarker neural networks (CNNs) for image analysis suggests that
sushmitasarkers@unr.edu extending CNN insights to the point cloud domain could be
Prithul Sarker beneficial. However, 3D point clouds are collections embed-
prithuls@unr.edu ded in continuous space, unlike images, which are structured
Gunner Stone on regular pixel grids. This makes effective feature aggrega-
gunnerstone@unr.edu tion and message carrying methods among points in the cloud
Ryan Gorman difficult to design, preventing the use of traditional deep net-
ryangorman@unr.edu work employed in computer vision. To form latent space
Alireza Tavakkoli mappings between input point coordinates and ground truth
tavakkol@unr.edu labels, pioneering research [3, 4] and subsequent work [5–
George Bebis 11] have developed specialty modules for feature aggregation
bebis@unr.edu and message passing and yielded suitable neural networks for
Javad Sattarvand point cloud data.
jsattarvand@unr.edu The focus of this review is on the analysis of deep learning
approaches for processing 3D point clouds for shape clas-
1 Department of Computer Science and Engineering, sification and semantic segmentation. We will also discuss
University of Nevada, Reno, NV, USA
some of the most prominent publicly available datesets used
2 Department of Mining and Metallurgical Engineering, to handle diverse point cloud processing challenges. These
University of Nevada, Reno, NV, USA
0123456789().: V,-vol 123

67 Page 2 of 54 S. Sarker et al.
datasets include ModelNet40 [12], ScanObjectNN [13], assessment of several indicators as well as future research
ShapeNet [14], S3DIS [15], Intra [16], Semantic3D [17], directions in this field and Sect. 8 concludes the paper.
SemanticPOSS [18], and SydneyUrbanObjects [19]. Although
there exists previous surveys of deep learning for 3D data,
such as [20–25], this review specifically aims to bridge the
gap by addressing techniques that previous surveys haven’t 2 Datasets and evaluation metrics
comprehensively covered. it offers an immersive exploration
of the latest frontiers in point cloud analysis. The primary 2.1 Datasets
objective of this paper is to equip readers with an extensive
understanding of the diverse representations present in point A high quality dataset is crucial for both training and evalu-
clouds, with a particular emphasis on recent advancements ating the effectiveness of machine learning algorithms. With
within the field of raw point-based methodologies which have the rise of deep neural networks, a reliable, well annotated
surged to the forefront of innovation. The key contributions and large datasets is even more crucial. In contrast to fea-
of our paper are as follows: ture engineering used in traditional machine learning, deep
network models rely on data and its annotations to extract
1. Deep learning models for shape classification and seman- appropriate feature embeddings. The purpose of 3D shape
tic segmentation of 3D point clouds, covering the most classification is to identify objects in a 3D point cloud [33–
up-to-date (2015–2023) advancements in this domain. 36] and assign a label to each discrete point. Thus, a large
2. Our review goes beyond existing papers by encompassing amount of well annotated training data is necessary for the
all existing methods for point cloud classification and seg- model to train effectively.
mentation that have not been extensively discussed before. In this paper, we collected a significant number of datasets
3. We present a comprehensive taxonomy that encompasses to examine the performance of the state-of-the-art deep learn-
both supervised and unsupervised approaches, including ing methods for various point cloud applications. Tables 1
previously overlooked mesh-based methods. Our paper and 2 lists some of the most common large-scale datasets
addresses the notable gaps in existing review papers by currently used for 3D point cloud shape classification and
incorporating these previously unexplored methods. segmentation.
4. Our analysis classifies and briefly discusses the myriad The purpose of 3D shape classification is to identify
models available, each leveraging distinct representations objects in a 3D point cloud [33–36] and assign a label to
and methodologies. This enables readers to grasp the each discrete point. Thus, a large amount of well annotated
diverse range of approaches within the context of their training data is necessary for the model to train effectively.
specific strengths and applications. For each dataset in Table 1, we present the establishment year,
5. We conduct comprehensive comparisons of existing number of samples, number of classes and a brief description.
methods using multiple publicly available datasets and We also categorize these datasets into two types: real-world
thoroughly expound upon the inherent strengths and lim- datasets [13, 30] and synthetic datasets [12, 14]. Objects in
itations embedded within these diverse approaches. the synthetic datasets exhibit no occlusion and are complete.
6. Our paper includes a thorough discussion of the current Objects in real-world datasets may be partially occluded
challenges in the field and offers insightful directions for while background noise, outliers and point perturbations may
future research. be present in the data. ModelNet10 and ModelNet40 [12] are
the most popular datasets employed for point cloud shape
Our paper’s novelty is evident not only in its coverage classification.
of recent advancements but also in its meticulous attention Table 2 provides an overview of commonly used large-
to previously overlooked areas in the literature. Addition- scale datasets for 3D point cloud segmentation. These
ally, the unique structure of our paper serves as a remarkable datasets are carefully curated and labeled to ensure repre-
resource, catering to readers of all backgrounds—from new- sentation of real-world scenarios and a wide range of object
comers seeking an approachable entry point to experts classes and scene types.
seeking a comprehensive taxonomy and insights into the lat- The datasets can be broadly classified into two groups:
est deep learning methods for point cloud processing. indoor datasets and outdoor datasets. In Table 2, we pro-
The structure of this paper is as follows: Sect. 2 introduces vide details such as the establishment year, number of points,
the datasets and evaluation metrics for the respective tasks. number of classes, sensors used, and a brief description for
Moving forward, Sects. 3 and 4 reviews the state-of-the-art each dataset. The data collection process involves various
methods for 3D shape classification, while Sects. 5 and 6 pro- sensors, including RGB-D cameras [37], Mobile Laser Scan-
vide comprehensive insights into the cutting-edge methods ners (MLS) [27, 31, 38], Aerial Laser Scanners (ALS) [39,
for semantic segmentation. Section 7 contains a quantitative 40], and other 3D scanners [15]. Photogrammetry is often
123
A comprehensive overview of deep learning techniques for 3D point cloud classification and… Page 3 of 54 67
Table 1 Available point cloud datasets for classification

Dataset Year Type No. of Samples No. of classes Description
McGill 3D Shape Benchmark [26] 2005 Syn 454 19 Most of the models in this dataset
were created using CAD modeling
software and rest came from the
Princeton Shape Benchmark
Sydney Urban Objects [19] 2013 RW 588 14 It contains scans generated by
mobile platforms equipped with
outdoor 3D scanner Velodyne
LIDAR. Raw Velodyne range
images are used for feature learn-
ing, and interpolated depth images
are used for feature evaluation
Paris-rueMadame [27] 2014 RW 642 26 Data were obtained by the Mobile
Laser Scanning (MLS) technology
and correspond to a 160 m long
street portion
ModelNet40 [12] 2015 Syn 12311 40 It is a complete and well-maintained
collection of 3D CAD object mod-
els. The related point cloud data
points are evenly sampled from the
mesh surfaces, and they are further
preprocessed by being moved to the
origin and scaled into a unit sphere.
It has 2468 meshes designated for
testing and 9843 meshes for train-
ing
ModelNet10 [12] 2015 Syn 4899 10 A subset of ModelNet40 dataset
which contains aligned orientation
of the CAD models from 10 cat-
egories.The shapes are split into
80-20 ratios for training (3,991) and
test (908)
ShapeNetCore [14] 2015 Syn 51190 55 ShapeNet, contains over 300 mil-
lion models, with 220,000 classified
into 3,135 classes. ShaperNetCore
is a subset of the ShapeNet dataset
IQmulus [28] 2015 RW – 22 This dataset contains 3D MLS data
from a dense urban environment in
Paris (France), composed of 300
million points
Object Scans [29] 2016 RW 398 9 This dataset includes more than
10,000 3D scans of real objects
ScanNet [30] 2017 RW 12283 17 ScanNet is an RGB-D video collec-
tion with 2.5M views across 1513
scenes annotated with mesh sur-
faces and 3D camera postures
Paris-Lille-3D [31] 2017 RW 2479 50 The dataset comprises of a point
cloud created by the Mobile Laser
System that was collected over
around 2 kms in two French cities
(Paris and Lille)
ScanObjectNN [13] 2019 RW 2902 15 ScanObjectNN is based on scanned
indoor scene data. With 2902 dis-
tinct object instances, it has 15,000
objects that are divided into 15 cat-
egories. Due of the background,
missing pieces, and deformations, it
is a difficult point cloud classifica-
tion dataset
123
Table 1 continued
Dataset Year Type No. of Samples No. of classes Description
Intra [16] 2020 Syn 2025 2 IntrA is an open-access 3D intracra-

nial aneurysm dataset of 103 3D
models of whole brain vessels and
1909 blood vessel segments, includ-
ing 1694 healthy vessel segments
and 215 aneurysm segments for
diagnosis
ModelNet40-C [32] 2022 RW 12308 40 ModelNet40-C is built using Mod-
elNet40 as a foundation. With 15
typical and realistic corruptions, it is
the first comprehensive dataset for
3D point cloud corruption robust-
ness
Syn synthetic, RW real world type dataset
employed to map the three-dimensional distance between intersection over union (mIoU), overall accuracy (OA), and
objects. mean class accuracy [15, 17, 31, 38]. These metrics pro-
vide insights into the quality of segmentation results. The
2.2 Evaluation metrics IoU metric calculates the intersection over union between
two sets, specifically the predicted bounding box (A) and the
Many evaluation metrics have been proposed to evaluate ground-truth bounding box (B). This overlap ratio is particu-
different point cloud Application. To evaluate classification larly relevant in segmentation tasks. The mIoU is the average
model usually the metric ‘Accuracy’ is used. In general, IoU computed for each category, providing an overall mea-
“accuracy” refers to the proportion of the model that pre- sure of segmentation accuracy. The IoU can be computed
dicts the correct outcome. using the following equation:
TP +TN |A ∩ B|
Accuracy = I oU Scor e(A, B) =
T P + T N + FP + FN |A ∪ B|
Here, TP, TN, FP, FN represent true positive, true negative, These metrics enable a comprehensive assessment of the
false positive and false negative respectively. accuracy and effectiveness of 3D point cloud application
For 3D point cloud classification, overall accuracy (OA) algorithms.
and mean class accuracy (mAcc) are the most commonly
used performance standards. “OA” evaluates the average
accuracy across all test instances, while “mAcc” is used to 3 3D point cloud classification
evaluate the mean accuracy across all shape classes. Nowa-
days, dice coefficient (F1) score is also used as a criterion for The subject of a 3D point cloud shape classification is to pro-
performance evaluation in classification of 3D point clouds. duce a label for the entire point cloud determining the shape
of the object it contains. Analogous to 2D image classifica-
1
C
tion, methods for 3D shape classification tasks usually follow
m Acc = Accuracy two main stages. First the embedding of each point is learned
C
c=1
in order to generate a global embedding with an aggregation
2 × pr ecision × r ecall
F1 = encoder. Next the embedding is passed through several fully
pr ecision + r ecall connected layers to obtain the final shape label.
Based on the input data type, point cloud classification
The F1 score is defined as the harmonic mean of precision
models can be generally divided into five major classes, i.e.,
and recall. Where,
mesh based methods, projection-based methods, volumetric-
TP TP based methods, hybrid methods, and raw point-based meth-
Pr ecision = and Recall = ods. Mesh data is a common method for representing 3D
T P + FP T P + FN
shapes in computer graphics, consisting of interconnected
In 3D point cloud segmentation, several performance vertices, edges, and faces. While mesh data provides an effi-
metrics are commonly used for evaluation, including mean cient way to store and render 3D models, it also presents
123
Table 2 Available point cloud datasets for segmentation

Name Year #Points Classes Sensors Description
Oakland [41] 2009 1.61M 5 MLS The data set consists of two subsets (part2 and part3),
each having a unique local reference frame and
100,000 3-D points in each file. Filtered, labeled, and
remapped from 44 into 5 labels, the training/validation
and testing data have 36932/91579 and 1.3 M points,
respectively
IQmulus [28] 2013 300M 22 MLS The database comprises 300 million points of 3D MLS
data from a dense metropolitan setting in Paris, France
Paris-rue-Madame [27] 2014 20M 27 MLS The Paris-rue-Madame dataset contains 3D Mobile
Laser Scanning (MLS) data from rue Madame, a street
in the 6th Parisian district (France) which comprises
an approximately 160 m long street section
ScanNet [30] 2017 – 20 RGB-D ScanNet is a dataset of 2.5 million RGB-D images
of 1513 scans collected in 707 different places. It
contains richly annotated RGB-D scans of real-world
surroundings
S3DIS [15] 2017 695.9M 13 Matterport The dataset is collected in 6 large-scale indoor areas
and covers over 6,000 square meter. It has over 70,000
RGB images. It contains 272 3D room scenes of 13
categories
Semantic3D [17] 2017 4B 8 MLS Semantic3D is a point cloud database of scanned
urban outdoor scenes with over 3 billion points. It
has 15 training and 15 test scenes with 8 class labels
annotated. This extensive set of 3D point clouds with
labels includes a variety of urban scenes
Paris-Lille-3D [31] 2018 143.1M 50 MLS Paris-Lille-3D is a metropolitan 3D point clouds
dataset with 140 million points spanning a distance
of 2 kms in two separate cities. The items were manu-
ally segmented, and each one was assigned to one of
50 classes
DublinCity [42] 2019 260M 13 ALS The dataset contains about 260 million labeled laser
scanning points out of 1.4 billion points. These are
carefully annotated into approximately 100,000 items
from the Dublin LiDAR point cloud in 2015
SemanticKITTI [38] 2019 4549M 28 MLS SemanticKITTI is a large-scale outdoor-scene dataset
based on the KITTI Vision Benchmark for point cloud
semantic segmentation. It contains 43552 scans of out-
door scenes of 28 classes, of which 23201 are used for
training, while the rest of 20351 are reserved for test-
ing
PreSIL [43] 2019 3135M 24 LiDAR It has more than 50,000 frames and includes high-
definition images with full resolution depth informa-
tion, semantic image segmentation, point-wise seg-
mentation, and in-depth annotations for every vehicle
and person
SensatUrban [44] 2020 2847M 13 UAV The collection includes sizable portions of two UK
Photogrammetry cities, covering around 6 square kilometers of the city
landscape. Each 3D point in the dataset is assigned to
one of 13 semantic classifications
123
Table 2 continued
Name Year #Points Classes Sensors Description
Swiss3DCities [45] 2020 226M 5 UAV Swiss3DCities is a new outdoor urban 3D pointcloud
Photogrammetry dataset, covering a total area of 2.7 square kilometers,
and is sampled from three Swiss cities with different
characteristics. It is manually annotated for semantic
segmentation using per-point labels
LASDU [46] 2020 3.12M 5 ALS An enhanced large-scale geometric dataset based on
ShapeNetCore with accurate semantic region anno-
tations, and detailed per-point labeling. It includes
31963 models in 16 different shape categories
Campus3D [47] 2020 937.1M 24 UAV It is a well annotated 3D point cloud dataset for
Photogrammetry different outdoor scene comprehension tasks. The
dataset was created by photogrammetric processing
of unmanned aerial vehicle (UAV) images taken at
the National University of Singapore (NUS)
SemanticPOSS [18] 2020 216M 14 MLS A huge number of dynamic instances are included
in 2988 different and challenging LiDAR scans that
make up the SemanticPOSS dataset for 3D point cloud
segmentation. It employs the same data structure as
SemanticKITTI and is acquired at Peking University
Toronto-3D [48] 2020 78.3M 8 MLS Toronto-3D is a large-scale urban outdoor point cloud
dataset that was collected for semantic segmentation
by an MLS system in Toronto, Canada. This dataset
has 78.3 million points and spans a distance of nearly
1 km of road
DALES [40] 2020 505M 8 ALS The Dayton Annotated LiDAR Earth Scan (DALES)
data set is a new, extensive aerial LiDAR data set with
more than 500 million hand-labeled points covering
an area of 10 square kilometers and eight object cate-
gories
RELLIs-3D [49] 2021 176.1M 20 LiDAR RELLIS-3D is a multi-modal dataset for off-road
robotics. It was gathered in an off-road setting and
includes 6235 photos and 13,556 LiDAR scan anno-
tations
WADS [50] 2021 3.6B 22 MLS It is the first multi-modal dataset with dense point-wise
labeled sequential LiDAR scans acquired in harsh
winter conditions
H3D [51] 2021 73.4M – ALS The H3D (Honda Research Institute 3D) dataset is a
large-scale RGB-D dataset which was released by the
Honda Research Institute and contains over 100,000
images of 244 objects in various cluttered scenes, with
6D pose annotations for each object instance
SynLIDAR [52] 2022 19,482M 32 Synthetic SynLiDAR is a 19 billion point synthetic LiDAR
LiDAR sequential point cloud dataset with point-by-point
annotations of 32 semantic classes
STPLS3D-Real [53] 2022 – 6 UAV The dataset includes outdoor images taken at four
Photogrammetry actual locations. The aerial images were taken using a
crosshatch-style flight pattern, with specified overlaps
of 75–85% and flight heights of 25–70 ms
STPLS3D-SyntheticV1 [53] 2022 – 7 UAV For semantic and instance segmentation applications,
Photogrammetry STPLS3D is an extensively annotated synthetic 3D
aerial photogrammetry point cloud dataset. It contains
16 square kilometers of landscape and up to 18 fine-
grained semantic categories
ALS Airborne Laser Scanning; MLS Mobile Laser Scanning
123
Table 3 Comperative 3D point cloud classification result on various available datasets for projection-based, volumetric-based, mesh-based and Hybrid methods
Model name Year Rep. ModelNet 40 ModelNet 10 ScanObjectNN SydneyUrbanObjects
(OA) (mAcc) (OA) (mAcc) (OA) (mAcc) (F1)
Mesh-based Methods
Geometry Image [84] 2016 PM 83.90% 51.30*% 88.40% 74.90*% – – –
Cross-atlas [85] 2019 PM 87.50% – 91.20% – – – –
SNGC [86] 2019 PM 91.60% – – – – – –
MeshNet [54] 2019 PM 91.90% 81.90*% – – – – –
MeshWalker [55] 2020 PM 92.30% – – – – – –
CurveNet [57] 2020 CM + Vol. 90.70% – 94.20% – – – 79.30%
PolyNet (unsqueezed) [56] 2021 PM 92.42% 82.86*% 94.93% 84.62*% – – –
RepSurf-U [58] 2022 TCM 94.70% 91.70% – – 84.60% 81.90% –
Projection-based Methods
MVCNN [60] 2015 80 views 90.10% – – – – – –
MHBN [62] 2018 6 views 93.10% 94.70% 95.00% 95.00% – – –
GVCNN [61] 2018 8 views 93.10% 84.50% – – – – -
CNN + LSTM + voting [66] 2019 8 views 91.05% – 95.29% – – – 75.30%
Dominant Set clustering [65] 2019 24 views 93.80% – – – – – –
SimpleView [87] 2021 6 view 93.90% 91.80% – – 80.50% 75.50% –
MVTN [67] 2021 20 views 93.50% 92.20% – – 82.80% – –
MVACPN [68] 2021 6 views 93.64% 91.53% – – – – -
Volumetric-based Methods
3D ShapeNet [12] 2015 Vol. – 77.32% – 83.54% – – –
A comprehensive overview of deep learning techniques for 3D point cloud classification and…
VoxNet [75] 2015 Vol. – 83.00% – 92.00% – – 73.00%

FPNN [88] 2016 Vol. 88.40% – – – – – –
VRN Ensemble [73] 2016 Vol. – 95.54% – 97.14% – – –
Binary VoxNetPlus [89] 2017 Vol. – 85.47% – 92.32% – – 75.50%
LightNet [90] 2017 Vol. – 86.90% – 93.39% – – 76.00%
LP-3DCNN [91] 2019 Vol. – 92.10% – 94.40% – – –
MRCNN [71] 2019 Vol. – 86.20% – 91.30% – – –
MS-VDNN [92] 2020 Vol. – 92.93% – 95.30% – – –
AF2M [72] 2021 Vol. 93.16% – – – – – –
Hybrid Methods
FusionNet [93] 2016 Vol. + 60 views – 90.80% – 93.11% – – –
PVNet [77] 2018 PC + 12 Views 93.20% – – – – – –
Page 7 of 54
67
123
SydneyUrbanObjects
Results with ‘*’ sign show mean average precision score. The symbol ‘–’ is used to indicate that results are unavailable. The methods are arranged in chronological order within their corresponding
challenges due to its inherent complexity and irregular-
ity. Projection-based methods project the unstructured point
cloud into 2D images (rasterization) to extract features. These
categories. The top-performing methods in each category have been highlighted in bold, while the method(s) achieving the best overall performance across all categories are underlined
76.00% features are then fed into 2D or 3D convolutional networks.
In volumetric-based methods, the point cloud is discretized
(F1)
into a regular grid, creating a volumetric representation of

–
–
–
–
–
–
–
–
the data. In contrast, point-based methods directly work on
raw points in the cloud. Point-based methods have become
84.40%
(mAcc)
increasingly popular since they reduce the computational

complexity of the network without any explicit information
–
–
–
–
–
–
–
–
loss. Papers that combine the benefits of both point and pro-
ScanObjectNN
jection methods are referred to in this review as “hybrid”

techniques.
86.40%
85.50%
In Sect. 3, we have discussed the various models for 3D

(OA)
shape classification with a focus on input data representation.

–
–
–
–
–
–
Meanwhile, Sect. 4 presents an in-depth analysis of method-

ologies exclusively reliant on raw point clouds as their input.
(mAcc)
Table 3 provides a comprehensive comparison of the afore-

mentioned methods for 3D point cloud classification across
–
–
–
–
–
–
–
–
–
various datasets. The methods have been categorized based

ModelNet 10
on the representation of the input point cloud utilized in

each approach. Furthermore, the methods are arranged in
95.20%
(OA)
chronological order within their respective categories. The

–
–
–
–
–
–
–
–
evaluation of each method’s performance is based on met-

rics such as overall accuracy (OA), mean accuracy (mAcc),
mean average precision (mAP), and F1 score.
91.20%
88.90%
(mAcc)
3.1 Mesh-based methods

–
–
–
–
–
–
–
ModelNet 40
A mesh is the most widely used structure for representing

surfaces in computer graphics and is comprised of a set of
95.40%
91.60%
92.00%
93.60%
93.50%
94.00%
92.70%
94.40%
93.20%
(OA)
faces and vertices that define surfaces on a three-dimensional

shape. As a result, this representation carries structural infor-
mation about the object’s surface. Furthermore, by pruning
vertices in a mesh and removing extraneous data, mesh-based
PC + 12 Views
PC + 20 views
PC + 3 Views
representations provide a memory-efficient way to store com-

PC + Image
PC + Vol.
PC + Vol.
PC + Vol.
PC + Vol.
PC + Vol.
plete geometry details. However, this representation is often

overlooked as a suitable input modality for deep learning
Rep.
algorithms. This could be attributed to the fact that a 3D mesh

does not provide a grid-like pattern for representing the data
to be used in CNNs. In addition, weight sharing in mesh-
OA Overall Accuracy; mAcc Mean Accuracy
2018
2018
2019
2021
2022
2022
2022
2022
2022
Year
based approaches presents a difficult challenge because of

changes in the number of vertices, the permutation of adja-
cent vertices, and their pairwise distances.
To learn 3D shape representation from mesh data, Feng
Point-Voxel Transformer [81]
et al. introduced a mesh neural network as MeshNet in [54].

EQ-PointNet++ (SSG) [83]
This approach introduces face-unit and feature splitting and

PointView-GCN [80]
proposes a general architecture with usable and efficient

Table 3 continued
3DmFV-Net [76]
building blocks. The face unit takes as input the features of

PointCMT [82]
PointGrid [74]
GSV-Net [94]
PVRNet [78]
DSPoint [79]
the vertices and edges that make up a single face and applies
Model name
convolutional operations to learn representations of that face.

These representations are then combined to form higher-level
representations of the mesh. MeshNet can effectively manage
123
Fig. 1 A taxonomy of deep learning methods for 3D point cloud classification
mesh irregularity and complexity concerns for 3D shape rep- nition. Similarly, Ran et al. [58] proposed a novel approach,
resentation. Alternatively, MeshWalker [55] directly learns Triangular RepSurf, that draws inspiration from triangle
the shape from a given mesh without the need for changing meshes in computer graphics. It can be computed by prede-
the mesh data representations. This is done by examining the fined geometric priors after surface reconstruction. RepSurf
geometry and topology of the mesh by randomly traversing has two variants: Triangular RepSurf and Umbrella RepSurf.
its surface via a number of walks. Each walk is arranged as a Triangular RepSurf represents each local region as a triangle
list of vertices and imposes some degree of regularity on the mesh, while Umbrella RepSurf represents each local region
mesh. as an umbrella-shaped structure [59].
Instead of randomly processing vertices, PolyNet [56] Table 3 shows that RepSurf-U [58] outperforms all other
can efficiently learn and extract features from a polygon mesh-based methods on ScanObjectNN [13] dataset, achiev-
mesh representation of 3D objects using continuous polyno- ing an OA of 84.6% and an mAcc of 81.9%. PolyNet
mial convolution (PolyConv). A PolyConv is a polynomial (unsqueezed) [56] achieves the highest OA and mAcc scores
function with learnable coefficients that develops continuous on ModelNet 40 [12] dataset with 92.4% and 82.86%
distributions about the features of vertices sharing the same respectively. On the other hand, among all other methods,
ploygonal face. These convolutional filters distribute appro- CurveNet [57] achieves the highest F1 score of 79.3% on
priate weights among the vertices in the local patches formed SydneyUrbanObjects dataset.
by each vertice and its neighboring vertices on the surface
of the polygon. This process is invariant with respect to the
3.2 Projection-based methods
quantity of neighboring vertices, their permutations, the pair-
wise distances between them and the choice of central vertex
Projection-based methods are popular approaches for ana-
in a local patch. Although mesh-based learning has gained
lyzing unstructured 3D point clouds. By projecting the point
significant popularity in the field of computer vision, it still
cloud onto multiple two-dimensional (2D) planes (or views),
poses some challenges. This can be attributed to the fact that
the data can be more effectively analyzed using standard
3D meshes do not conform to the grid-like structure that is
image processing techniques. The resulting view-wise fea-
typically used for representing data in convolutional neural
tures can be collected and concatenated to produce a more
networks (CNNs).
precise classification of the point cloud’s shape. However,
In order to address the challenges with meshes, recent
one significant challenge faced by projection-based methods
advancements in this area have drawn inspiration from trian-
is in combining multiple view-wise features into a distinct
gle meshes and curvature maps used in computer graphics.
global representation that accurately captures the overall
Muzahid et al. [57] recently introduced a new approach,
structure of the point cloud.
called CurveNet, using curvature directions to capture geo-
Su et al. introduced a novel approach for processing
metric features from polygon meshes as inputs to a 3D CNN.
point clouds called Multi-View Convolutional Neural Net-
The data structure of CurveNet enables object class label
work (MVCNN) [60]. This approach involves representing
prediction by learning perceptually significant and salient
point clouds as a collection of 2D images captured from
features. Curvature directions provide detailed surface infor-
multiple views obtained at different angles. Features are
mation from 3D objects, allowing the model to generate more
extracted from these different views and then combined
precise and discriminative features for accurate object recog-
into a global descriptor through max-pooling. However, one
123
potential drawback of max-pooling is that it only retains voxelization. Each point in the point cloud is assigned to the
the most important parts of each view, resulting in a loss closest voxel center in the 3D grid, resulting in a volumetric
of information. While MVCNN does not explicitly differ- representation of the point cloud.
entiate between various views, it is beneficial to have some Wu et al. [12] introduced 3D ShapeNet, a deep belief-
understanding of the relationships between them. based convolutional network that learns the distribution of
One such method that specifically attempts to etablish points from diverse 3D shapes. The network represents the
relationships between views of a point cloud is Group-view points as a probability distribution of binary variables on
Convolutional Neural Network (GVCNN) [61]. GVCNN voxel grids. Despite promising results, these methods strug-
splits the set of views into groups based on their discrimi- gle to scale to dense 3D data, as memory consumption
nation scores, thereby leveraging the relationship between increases cubically with resolution.
them for better results. Other techniques such as Multi- To enable hierarchical learning of features, Ghadai et al.
View Harmonized Bilinear Pooling Network (MHBN) [62] in [71] presented a flexible multi-level unstructured voxel
use harmonized bilinear pooling to combine local convolu- representation of spatial data in their MRCNN framework.
tional features into a more compact and discriminative global This method uses a multi-level voxelization framework,
descriptor. Meanwhile, Yang et al. proposed a method that described as a binary occupancy grid at two levels. This is
exploits the inter-relationships between views and regions done to represent a 3D object with two distinct user-specified
using a relation network to generate a more accurate 3D resolutions of voxel grids. Despite the lack of structure in the
object representation [63]. Unlike earlier techniques, Wei et multi-level data representation, MRCNN can successfully
al. presented the View Graph Convolutional Network (View- learn features, allowing for more effective and efficient 3D
GCN) which employs a directed graph to consider many shape classification.
views simultaneously [64]. Other strategies such as Dom- To exclusively take voxelized data as input in an end-
inant Set Clustering Network (DSCN) [65], and Learning to-end encoder-decoder CNN architecture, Cheng et al.
Multi-View 3D Object Recognition (LMV3D) [66] have also suggested a similar technique called (AF)2-S3Net [72]. This
been proposed to improve recognition accuracy. method uses a multi-branch attentive feature fusion mod-
Despite the popularity of methods that utilize raw point ule to learn both global contexts and local features in the
data such as PointNet [3], some projection-based methods encoder. To promote generalizability, an adaptive feature
have yielded promising classification results. For instance, selection module with feature map re-weighting is utilized
Abdullah et al. recently introduced a differentiable mod- on the decoder side to actively emphasize contextual infor-
ule that predicts the optimal viewpoint for a multi-view mation from a feature fusion module.
network [67]. MVTN overcomes the static nature of exist- Table 3 illustrates that AF2M [72] achieved the highest
ing projection-based techniques by utilizing adaptive view- overall accuracy on the ModelNet40 dataset. However, they
points, which it learns to regress. These viewpoints are ren- did not report their mAcc result. On the other hand, VRN
dered with a differentiable module to train the task-specific Ensemble [73] only provided their mAcc results for both
network end-to-end. This results in the most appropriate ModelNet40 and ModelNet10 [12], which are highlighted in
views for the task at hand. bold and underlined in the table as they represent the highest
In another study, Wang et al. [68] presented a multi-view results for those datasets among all other approaches. Nev-
attention-convolution pooling network (MVACPN) frame- ertheless, it is worth noting that none of the papers recorded
work using Res2Net [69] to extract features from several their findings for the ScanObjectNN dataset.
2D views. MVACPN effectively resolves the issues of fea-
ture information loss caused by feature representation and
detail information loss during dimensionality reduction by 3.4 Raw point-based methods
employing the attention-convolution pooling method.
The results in Table 3 demonstrate that the MHBN [62] To address information loss and maintain point cloud details,
method outperformed other projection-based methods with Raw Point-based Methods offer a promising alternative in
the highest mean accuracy on the ModelNet 40 [12] and Mod- point cloud processing. These methods operate directly on
elNet 10 datasets, while [66] obtained the highest overall the raw point cloud data, avoiding the need for transforma-
accuracy on ModelNet10 among all other approaches. tion into other representations. PointNet [3] pioneered this
approach by consuming unordered point sets and achieving
3.3 Volumetric-based methods permutation invariance through symmetric functions. This
novel approach has facilitated the accurate analysis of raw
Another approach to produce structured data for processing point cloud data, eliminating the need for conversion into
in traditional CNN architectures is to convert the point cloud other representations. Raw point processing has been a major
into a regular 3D grid of cubic voxels using a process called focus of recent models. Since a substantial body of work
123
Fig. 2 The Stanford Bunny [70] model in different three-dimensional representations
exists on raw point processing, we will explore the different Gaussian mixture model (GMM) and computing the Fisher
learning strategies employed in these methods in Sect. 4. vector (FV) representation of the point cloud. In contrast
to the 3DmFV approach, PVNet [77] uses an embedding
network to project high-level global features collected from
3.5 Hybrid methods
multi-view images into the subspace of point clouds, which
are then blended with point cloud features using a soft atten-
As discussed earlier, point cloud classification techniques
tion mask. Finally, for fused features and multi-view features,
fall under three broad categories of projection-based, voxel-
a residual connection is used to achieve shape recognition.
based or point-based neural network(NN) models to handle
To further improve accuracy, You et al., proposed to leverage
3D input. All these approaches, however, are computation-
the relationship between a 3D point cloud and its multiple
ally inefficient. The memory usage and computation cost of
views through a relation score module in PVRNet [78].
voxel-based models expand cubically with input resolution.
In DSPoint [79], Zhang et al. introduced a dual-scale
In point-based networks, the majority of the computational
point cloud recognition approach that combines local fea-
cost is on processing the sparse input points to produce data
tures and global geometric architecture. Unlike conventional
conducive for the remainder of the network. This process
designs, DSPoint operates concurrently on voxels and points,
often leads to poor memory localization, rather than effec-
extracting local and global features. The network disentan-
tive feature extraction. Approaches that combine a variety of
gles point features through channel dimensions, enabling
input data modalities are known as Hybrid methods. These
dual-scale processing. It utilizes pointwise convolution for
methods are relatively new, and an increasing number of
fine-grained geometry parsing and voxel-wise global atten-
researchers are investigating various challenging questions
tion for long-range structural exploration. To align and blend
in this domain.
the local–global modalities, a co-attention fusion module
Integrating voxel-based learning with point-based learn-
is designed for feature alignment, facilitating inter-scale
ing into a unified framework has been the subject of recent
cross-modality interaction by incorporating high-frequency
developments in point cloud classification. For example,
coordinate information. In contrast, PointView-GCN [80]
PointGrid, presented in [74], is a hybrid network that com-
uses multi-level Graph Convolutional Networks (GCNs) to
bines the point and grid representations. To retrieve the
hierarchically aggregate shape features from single-view
specifics of local geometry, it uses a 3D CNN to learn the
point clouds. It captures geometrical cues and multiview rela-
grid cells with fixed points. The PointGrid model employs
tions for 3D shape classification by leveraging partial point
the same transformation mechanism as VoxNet [75], but it
cloud data from multiple views.
can better describe scale changes, minimizes data loss, and
Voxel-based models have regular data locality and can
takes up less memory.
efficiently encode coarse-grained features. On the other hand
The development of real-time algorithms for 3D point
point-based networks preserve the accuracy of location infor-
cloud classification is highly challenging due to the large size
mation with the flexible fields. Inspired by this, Zhang et al.
and complexity of point clouds. To overcome this challenge,
in [81] proposed a hybrid point cloud learning architecture,
Ben-Shabat et al. [76] utilized the 3D modified Fisher Vec-
called PointVoxel Transformer. The authors used the Sparse
tor (3DmFV) approach to convert an input point cloud into
Window Attention (SWA) module to gather coarse-grained
3D grids, followed by CNNs to extract features and fully
local features from nonempty voxels. The module not only
connected layer to classify them in real-time. The 3DMFV
bypasses the expensive irregular data structuring and invalid
method is an extension of the widely-used 2D modified
empty voxel computation, but also obtains linear computa-
Fisher Vector (mFV) method for image classification.
tional complexity with respect to voxel resolution. In another
Specifically, it extends the mFV method to 3D point clouds
recent work by Yan et al., called PointCMT [82], both image
by encoding the gradient information of the 3D grids using a
123
and point cloud data is used to train the model for shape number of parameters reported by each paper and specifies
analysis tasks. This approach combines the advantages of whether the model used only the point cloud or also incorpo-
both modalities, leveraging the rich texture information from rated point features such as normals as input. The evaluation
images and the geometric structure from point clouds. of each method’s performance is based on metrics such as
In [83], the authors introduced the Embedding-Querying overall accuracy (OA), mean accuracy (mAcc), and F1 score.
(EQ-Paradigm), a unified approach for 3D point cloud under- Notably, the results for the Intra dataset are showcased and
standing. EQ-Paradigm combines different task heads with obtained from [98].
existing 3D backbone architectures, offering advantages such
as a unified framework for tasks like object detection, seman-
tic segmentation, and shape classification. It seamlessly 4.1 Supervised training
integrates with diverse 3D backbone architectures and effi-
ciently handles large point clouds. Supervised learning for point cloud is a powerful approach
Table 3 Shows the qualitative evaluation of classifica- for processing and analyzing 3D data. It involves training
tion results of hybrid methods on various publicly available the system on labeled point clouds to extract meaningful
datasets. The best-performing method on ModelNet 40 information such as object classification, semantic segmen-
dataset was PointView-GCN, which achieved an OA of tation, and registration [3, 99]. The algorithm is continuously
95.40%, while PointCMT achieved the highest mAcc and improved as it compares its predictions with the desired
F1 scores among other methods on ScanObjectNN dataset. output, allowing for more accurate results with each itera-
tion. However, supervised learning requires large amounts
of labeled data, which can be costly to obtain.
4 Learning strategies for point based Supervised learning approaches can be classified into
methods in classification seven categories: pointwise MLP, hierarchical-based, con
volution-based, RNN-based, graph-based, transformer-based,
Since the development of PointNet, numerous models have and other methods. These categories can be further grouped
emerged that can process raw point data directly without into feedforward and sequential training based on the model
information loss [3, 95–97]. These models employ diverse architecture and data processing method.
techniques and network architectures to handle unstructured Feed-forward training It is an extensively used technique
data. In this section, we will discuss about the learning for processing point clouds, where the individual points of
strategies that these models have adopted for processing raw the point cloud are passed through multiple layers in a neural
points. We provide a detailed discussion of each category network to generate activation maps for successive layers.
and highlight their key differences and commonalities. Gen- This allows the model to capture complex relationships by
erally, methods in this category can be broadly categorized transforming the data through non-linear transformations.
into two groups depending on the type of supervision used Based on the operations performed on points in each layer,
during training. Figure 3 provides a comprehensive catego- this group encompasses the following methods: multilayer
rization of raw point-based approaches for point cloud shape perceptron (MLP)-based, convolution-based, hierarchical-
classification. based, and graph-based architectures.
Supervision in training Supervision in point cloud pro- Sequential training It is a type of training method, the
cessing involves training neural networks with labeled point model is trained on a sequence of input data. In point cloud
clouds to make predictions on unlabeled ones. It can be processing, the input data is treated as ordered points or
divided into two categories: supervised and unsupervised patches, processed in sequence. Unlike feed-forward train-
training. Supervised methods use labeled data to teach the ing, where data flows from input to output, sequential training
model how to predict outputs for new point clouds. Unsu- uses the output from one time step as the input for the next.
pervised methods, on the other hand, identify patterns and This approach is beneficial in point cloud processing as it
structures in the input data without prior knowledge of the allows the model to process local patches and predict fea-
output. Both supervised and unsupervised methods are cru- tures for the next point. Sequential training is commonly used
cial in point cloud processing, depending on the availability in recurrent neural networks (RNNs) and transformer-based
and quality of labeled data. More details about supervised architectures designed to process sequential data.
and unsupervised methods of raw point cloud processing are Supervised learning is a crucial element of point cloud
discussed in Sects. 4.1 and 4.2 respectively. processing pipelines, particularly in cases where high accu-
Table 4 presents a comprehensive comparison of raw racy is essential. In the following sections, we will provide
point-based methods for 3D point cloud classification across an in-depth discussion of the various network architectures
various datasets. The methods are organized chronologically that are utilized for feature learning of individual points, with
within their respective categories. The table includes the supervised learning being the primary technique.
123
Fig. 3 A taxonomy of deep learning approaches for raw point-based 3D point cloud classification
4.1.1 Multi-layer perceptron (MLP) methods the Farthest Point Sampling (FPS) technique and presented a
local-non-local (L-NL) module to capture the local and long-
This method is based on fully connected layers that process range relationships of the sampled points. Duan et al. [97]
each point independently. The network takes a point cloud proposed utilizing MLP to learn structural relational prop-
and applies a set of transforms and shared MLPs to generate erties between distinct local structures using a Structural
features. These features are then aggregated to yield a global Relational Network (SRN). Lin et al. [102] used a lookup
representation using max-pooling that describes the original table to speed up the inference process for both the input and
input cloud. Another MLP classifies that global representa- function spaces learned by PointNet. On a consumer grade
tion to produce output scores for each class. computer, the inference time for the ModelNet and ShapeNet
PointNet [3], in particular, uses multiple MLP layers to datasets is 1.5 ms and 32 times faster than PointNet. In RPNet,
learn pointwise features independently and a max-pooling Ran et al. [103] studied the capabilities of local relation oper-
layer to extract global features. The local structural infor- ators and developed the group relation aggregator (GRA), a
mation between points cannot be captured since features are scalable and efficient module for learning from both low-
learned independently for each point in PointNet. As a result, level and high-level relations. The module calculates a group
Qi et al. presented PointNet++ [4], a hierarchical network that feature by aggregating the features of inner-group points that
captures complex geometric patterns in the neighborhood are weighted by geometric and semantic relations. RPNet
of each point. PointNet++ is inspired by standard CNNs, contains approximately a third of the parameters of Point-
which use a stack of convolutional layers to capture fea- Net++ and double the computation speed.
tures at different scales. The points within a sphere centered Previous works have mainly focused on utilizing advanced
at x are defined as the local region of the point x. In par- local geometric extractors such as convolution, graphs, and
ticular, one set abstraction level contains a sampling layer, a other mechanisms to capture 3D geometries. However, these
grouping layer to identify local regions, and a PointNet layer. methods can lead to increased computational costs and
PointNet [3] and PointNet++ [4] prompted a lot of follow-up memory usage. To address this challenge, Ma et al. [104]
work due to their easy implementation and promising perfor- developed PointMLP, a pure residual MLP network that does
mance. Mo-Net [100] has a similar design to PointNet, but it not rely on “complex” local geometrical extractors. Despite
takes a fixed collection of moments as input. SRINet [101] this simplicity, PointMLP performs well due to highly opti-
uses a PointNet-based backbone to extract a global feature mized MLPs. They developed a lightweight local geometric
and graph-based aggregation to extract local features after affine module that adaptively modifies the point feature in
projecting a point cloud to generate rotation invariant repre- a local region to boost efficiency and generalization ability.
sentations. PointMLP trains two times faster and tests seven times faster
Yan et al. [7] used an Adaptive Sampling (AS) module in than the current models. PointNext [105] overcomes the lim-
their work, PointASNL, to adaptively adjust the coordinates itation of PointNet++ [4] by utilizing a thorough analysis of
and attributes of points. They sampled these points using model training and scaling techniques. The authors add sep-
123
Fig. 4 A simplified architecture

of PointNet [3] where
parameters n and m denote point
number and feature dimension,
respectively
arable MLPs and an inverted residual bottleneck design to associated with the offsets of these surrounding points with
PointNet++ to facilitate effective and efficient model scal- respect to the center point, and the convolution operation is
ing. In PointStack [106], the authors proposed a method performed by sliding the kernel over the input point cloud,
that utilizes multi-resolution features and learnable pooling multiplying the weights of the kernel with the corresponding
to extract meaningful features from point cloud data. The features of the surrounding points, and summing the prod-
multi-resolution features capture the underlying structure of ucts. This process is repeated at each location of the point
the point cloud data at different scales, while the learnable cloud, resulting in a new set of features that represent the
pooling enables the system to dynamically adjust the pooling convolved output.
operation based on the features. Pointwise-CNN [108] employs a unique approach to
Table 4 shows that PointStack achieves the best results for define convolutional kernels on each grid by transform-
both the ModelNet 40 and ScanObjectNN datasets, while ing non-uniform 3D point clouds into uniform grids, with
PointASNL achieves the best result for the ModelNet 10 weights assigned to all points that fall within the same grid.
dataset, and PointNet++ for the Intra dataset among all the The output of the current layer is determined by comput-
MLP-based methods. ing the mean features of all the nearby points on the same
grid, which are weighted and aggregated from all the grids.
Meanwhile, Mao et al. [109] introduced the interpolated
4.1.2 Convolutional methods convolution operator InterpConv to assess the geometric
relations between input point clouds and kernel-weight coor-
The architecture of convolution networks is an emulation of dinates by superimposing point features onto neighboring
biological processes and is closely related to the organiza- discrete convolutional kernel-weight coordinates.
tion of the visual cortex in animals. In this architecture, each To achieve rotation invariance, Zhang et al. [110] intro-
cortical neuron primarily responds to inputs within its recep- duced the RIConv operator, which transforms convolution
tive field. Multiple neurons with overlapping receptive fields into 1D using a clustering approach on low-level rotation
respond to the entire field at a particular location. To extract invariant geometric features. Another approach proposed
features from low-level to high-level features, convolutional by Zhang et al. [111] is shellConv, an efficient permuta-
networks are stacked with convolution layers, rectified lin- tion invariant convolution for point cloud deep learning. It
ear units, and pooling layers. The strengths of convolutional partitions the local point neighborhood into concentric spher-
networks include shared weights, translation invariance, and ical shells, extracting representative features based on the
feature extraction, as demonstrated in several works, such statistics of the points inside. ShellNet [111] utilizes Shell-
as ApolloCar3D [107], and Semantic3D [17]. VoxNet [75] Conv as the core convolution, enabling it to handle larger
illustrated the use of 2D grid kernels for processing 3D point receptive fields with fewer layers. However, it may not cap-
cloud data. However, due to the irregularity of point clouds, ture long-range point relations and overlooks certain patterns
constructing convolution kernels for 3D point clouds presents present in point cloud structures. To overcome this limitation,
greater challenges compared to 2D counterparts. Modern 3D Point-PlaneNet [112] introduces a new neural network that
convolution methods can be categorized as discrete or con- leverages spatial local correlations by considering the dis-
tinuous based on the nature of the convolution kernel used. tance between points and planes. The proposed PlaneConv
Discrete convolution Discrete convolution for point operation learns a set of planes in R n space, allowing it to
cloud processing involves defining a convolutional kernel extract local geometric features from point clouds. Addition-
on a regular grid based on a set of surrounding points that ally, DeltaConv [113] introduces anisotropic filters on point
are located within a certain radius from the center point. clouds by mixing geometric operators from vector calculus,
This technique leverages the structural properties of point which allows the network to be split into scalar and vector
clouds, which can be seen as sets of irregularly spaced points
in a high-dimensional space. The weights of the kernel are
123
streams that can expressively represent directional informa- spherical harmonic kernels during volumetric function con-
tion. volution.
Continuous convolution Current 3D convolution meth- Designing efficient CNNs for point cloud analysis is
ods differ from traditional discrete convolution by defining a challenging task, requiring a delicate trade-off between
convolutional kernels in a continuous space. Instead of fixed- accuracy and speed. Although CNNs have achieved remark-
size kernels sliding over a grid structure as in 2D convolution, able success in image and pattern recognition, increasing
these methods assign weights to neighboring points based the network complexity often results in decreased speed.
on their spatial distribution relative to the center point. This This challenge is further amplified when dealing with point
allows for a more flexible and detailed feature extraction pro- clouds, as they can contain a large number of points with
cess, as 3D convolution can be seen as a weighted sum over varying densities.
a subset of points in continuous space. Table 4 includes models from both discrete and continu-
In RS-CNN [115], the convolutional network is based on ous convolution methods. The results indicate that DeltaNet
relation-shape convolution. The input to an RS-Conv kernel attained the highest overall accuracy (OA) on both the Mod-
is a local subset of points around a given point. The mapping elNet40 and ScanObjectNN datasets. DensePoint, on the
from low-level relations like Euclidean distance and rela- other hand, achieved the best OA on the ModelNet10 dataset.
tive location is learned using an MLP to high-level relations Moreover, PointConv exhibited the highest F1 score on the
between points in the local subset. Using a collection of learn- Intra dataset compared to other convolution-based methods.
able kernel points, Thomas et al. [116] suggested both rigid
and flexible Kernel Point Convolution (KPConv) operators 4.1.3 Hierarchical methods
for 3D point clouds. Liu et al. in their work DensePoint [117],
described comvolution as a Single-layer Perceptron (SLP) Hierarchical data structures like kd-trees and octrees are
with a nonlinear activator. To fully exploit the contextual commonly employed in point cloud processing to construct
information, features are learned by concatenating all of the networks. These networks represent the point cloud in a
previous layers’ features. The convolution kernel is divided structured manner and learn features hierarchically, from leaf
into spatial and feature components by ConvPoint [118]. nodes to root nodes. By partitioning the point cloud into sub-
The spatial part’s positions are chosen at random from a sets of points at different levels of detail, these methods allow
unit sphere, and the weighting function is trained using a the model to capture local details at lower levels and global
basic MLP. In PointConv [119], convolution is defined as a context at higher levels. As a result, these methods are effec-
Monte Carlo estimation of a continuous 3D convolution with tive in reducing the computational complexity of point cloud
regard to an important sample. A weighting function and a processing tasks.
density function are used in the procedure, which is accom- In their paper, Lei et al. [123] introduced an octree
plished using MLP layers and kernelized density estimation. guided CNN with spherical convolutional kernels applied
The 3D convolution is further simplified into two opera- to each layer corresponding to the octree layers. Compared
tions: matrix multiplication and 2D convolution, in oreder to to OctNet [124], which relies on octree data structures,
increase memory and computational performance. Its mem- Kd-Net [125] utilizes multiple K-d trees with different split-
ory consumption can be lowered by 64 times with the same ting directions, with non-leaf node representations computed
parameter settings. using an MLP. Parameter sharing based on node the splitting
Several methods have been proposed to handle large-scale type enables Kd-Net to efficiently learn hierarchical features
point cloud scenes using feature fusion, such as Spider- while managing memory consumption.
CNN [120]. SpiderCNN uses a unit called SpiderConv that To achieve feature learning and aggregation, 3DCon-
extends convolution operations on regular grids by combin- textNet [126] utilizes a balanced K-d tree to learn and
ing a step function with a Taylor expansion defined on the k aggregate features, leveraging both local and global contex-
nearest neighbors. The Taylor expansion captures the inher- tual cues. MLPs are employed to model the relationships
ent local geometric fluctuations by interpolating arbitrary between positions, allowing feature learning at each level.
values at the vertices of a cube, whereas the step function The non-leaf nodes compute features from their children
catches the coarse geometry by storing the local geometric nodes using MLP and max pooling, enabling classification
distance. PCNN [121] is another 3D convolution network that until reaching the root node. SO-Net [127] establishes its
utilizes the radial basis function for processing point clouds. structure through point-to-node k-nearest neighbor search
Its point convolution operator is derived from extension oper- and a Self-Organizing Map (SOM), ensuring permutation
ators that enable the transformation of point data into a invariance. The SOM simulates the spatial distribution of
continuous function space. SPHNet [122], which is based point clouds by setting the positions of points, while individ-
on PCNN [121], achieves rotation invariance by integrating ual point features are learned through fully connected layers.
Pre-training with a point cloud auto-encoder is proposed to
123
Fig. 5 Different types of point convolution [114]
enhance network performance in various applications. How- unique method of generating and manipulating graphs in the
ever, processing large and complex scenes with this network feature space.
may encounter limitations due to the massive amount of point PointWeb [131], based on PointNet++ [4], uses Adap-
cloud data involved. tive Feature Adjustment (AFA) to improve point features in
DRNet [128] is another hierarchical network that learns the local neighborhood context, generating a graph in the
local point features from the point cloud in different reso- feature space that is dynamically modified after each layer.
lutions. The DRNet architecture consists of two branches: a DGCNN [132] also generates a graph in the feature space,
Full-Resolution (FR) branch and a Multi-Resolution (MR) and an MLP is used for feature learning for each edge in
branch. The FR branch learns local point features from the EdgeConv’s core layer. Channel-wise symmetric aggrega-
full-resolution point cloud. The MR branch learns local point tion is used for edge features associated with each point’s
features from downsampled versions of the point cloud. The neighbors. In addition, LDGCNN [133] improves the per-
two branches are then fused to produce a final feature repre- formance of DGCNN [132] by removing the transformation
sentation. network and linking the hierarchical features from different
Table 4 clearly illustrates that among the hierarchical layers.
methods, So-Net consistently outperformed all others across Liu et al. [134] presented a Dynamic Points Agglomera-
the assessed datasets. tion Module (DPAM) based on graph convolution to reduce
the process of point agglomeration that includes sampling,
grouping, and pooling to a single step. This is accomplished
4.1.4 Graph-based methods by multiplying the agglomeration matrix and the points fea-
ture matrix. A hierarchical learning architecture is built by
Graph-based networks provide an alternative approach to stacking multiple DPAMs based on the PointNet architecture.
analyzing point clouds by representing points as vertices in a DPAM dynamically exploits the relationship between points
graph connected by directed edges. These networks operate and agglomerates points in a semantic space, as opposed
in either the spatial or spectral domain for feature learning. to PointNet++’s hierarchical methodology [4]. On the other
In the spatial domain, MLP-based convolutions are applied hand, KCNet [135] takes a different approach by learning
to spatial neighbors, and pooling generates coarsened graphs features based on kernel correlation to exploit local geo-
by aggregating neighboring features. In the spectral domain, metric structures. By defining kernels as a collection of
convolutions are achieved through spectral filtering using the learnable points, KCNet characterizes the geometric types
eigenvectors of the graph Laplacian matrix [129, 130]. Each of local structures, and subsequently determines the affilia-
vertex is assigned features like coordinates, intensities, or tion between the kernel and a specific point’s neighborhood.
colors, while geometric properties between connected points In RGCNN [136], a graph is built by linking each point
are assigned to edges. Numerous graph-based approaches in the point cloud to all other points and updates the graph
have been proposed for point cloud analysis, each with its
123
Laplacian matrix in each layer. The loss function includes Table 4 presents a comparative analysis of multiple graph-
a graph-signal smoothness prior to improve the compara- based techniques for 3D point cloud classification. Notably,
bility of features among nearby vertices. Alternatively, in CurveNet [156] demonstrated remarkable performance with
PointGCN [137], a graph is constructed from a point cloud the highest OA of 94.20% on the ModelNet 40 dataset,
using k nearest neighbors, and each edge is weighted using outshining other graph-based methods. Meanwhile, Grid-
a Gaussian kernel. The graph spectral domain is utilized GCN [9] demonstrated exceptional performance by securing
to design convolutional filters using Chebyshev polynomi- the top OA and mAcc on the ModelNet 10 dataset among all
als. To capture both global and local properties of the point methodologies evaluated.
cloud, global pooling and multi-resolution pooling tech-
niques are employed. Graph convolutional networks (GCN)
4.1.5 Recurrent neural network-based methods
surpass other point-based models by preserving data gran-
ularity and utilizing point interconnectivity. However, data
RNNs are popular for processing temporal data and have
structure operations such as Farthest Point Sampling (FPS)
been applied in point cloud analysis to capture local context.
and neighbor point querying consume a significant amount
These neural networks utilize their internal state to handle
of time in point-based networks, limiting their speed and
variable length sequences of inputs, making them well-suited
scalability.
for point cloud data. Various RNN-based techniques have
To address this issue, Xu et al. [9] introduced Grid-GCN, a
been developed, highlighting the significance of local context
fast and scalable method for point cloud learning. Grid-GCN
in point cloud analysis.
utilizes Coverage-Aware Grid Query (CAGQ), a data struc-
RCNet [157] constructs a permutation-invariant network
turing technique that enhances spatial coverage and reduces
for 3D point cloud processing using regular RNN and
theoretical temporal complexity by leveraging grid space effi-
2D CNN. After partitioning the point cloud into parallel
ciency. CAGQ achieves a 50% speedup compared to common
beams and sorting them along a specified dimension, each
sampling methods like FPS and Ball Query.
beam is input into a shared RNN. For hierarchical feature
Additionally, Yang et al. [153] proposed PointManifold,
aggregation, the learnt features are used as an input to an
a point cloud classification method based on graph neu-
efficient 2D CNN. RCNet-E is proposed to ensemble multi-
ral networks and manifold learning. PointManifold employs
ple RCNets with varied partitions and sorting directions to
various learning algorithms to embed point cloud features,
improve its description ability. Another RNN-based model,
enhancing the assessment of geometric continuity on the
Point2Sequence [158], identifies correlations between dis-
surface. By acquiring the point cloud nature in a low-
tinct locations in local point cloud regions. To aggregate local
dimensional space and combining it with features in the
region features, it treats features learnt from a local region at
original 3D space, the representation capabilities and clas-
many scales as sequences and feeds these sequences from all
sification network performance are improved.
local regions into an RNN-based encoder-decoder structure.
In [154], a novel method called Convolution in the
Several other methods also learn from both 3D point clouds
Cloud (CIC) is proposed for learning deformable kernels in
and 2D images.
3D graph convolution networks. CIC involves dynamically
According to Table 4, Point2sequence achieves the highest
deforming a cloud of kernels to match the local structure of
overall accuracy on the ModelNet 40 dataset, while RCNet-
the point cloud. It consists of two stages: randomly sam-
E performs best on the ModelNet 10 dataset over all other
pling initial kernels and iteratively updating them based
RNN-based methods.
on a loss function that measures the discrepancy with the
ground truth label. Meanwhile, Xu et al.’s Position Adaptive
Convolution (PAConv) [155] presents a generic convolution 4.1.6 Transformer-based methods
procedure for 3D point cloud analysis. PAConv dynami-
cally builds convolution kernels using self-adaptively learned One of the most significant recent breakthroughs in natural
weight matrices from point positions via the ScoreNet mod- language processing and 2D vision is the Transformer [209],
ule. This data-driven approach allows PAConv to handle which has demonstrated superior performance in captur-
irregular and unordered point cloud data more effectively ing long-range relationships. The success of Transformer
than traditional 2D convolutions. CurveNet, a proposition has also led to notable improvements in point-based mod-
by Xiang et al. [156], enhances point cloud geometry learn- els through the use of self-attention. With the attention
ing through a novel aggregation strategy. CurveNet utilizes mechanism, the Transformer can weigh the relevance of
a curve grouping operator and a curve aggregation opera- each point to the others, enabling better feature extraction
tor to generate continuous sequences of point segments and and discrimination. The development of Transformer-based
effectively learn features. architectures [160, 162] has greatly enhanced performance.
Nevertheless, the bottleneck of these models still remains
123
Table 4 Comperative 3D point cloud classification result on various available datasets for Point based methods
67
Model name Year Input Params ModelNet 40 ModelNet 10 ScanObjectNN Intra

(in Million) (OA) (mAcc) (OA) (mAcc) (OA) (mAcc) (F1)
123
Pointwise MLP methods
DeepSets [138] 2017 PC – 82.00% – - – - – –
Page 18 of 54
PointNet [3] 2017 PC 3.5 89.20% 86.20% – - 68.20% 63.40% 68.40%

PointNet++ [4] 2017 PC+PF 1.74 90.70% – - – 77.90% 75.40% 90.30%
MO-Net [100] 2019 PC 3.1 92.40% 90.30% – - – - –
SRN-PointNet++ [97] 2019 PC – 91.50% – - – - – –
JUSTLOOKUP [102] 2019 PC – 89.50% 86.40% 92.90% 92.10% – - –
HPGCNN+DC [139] 2020 PC 1.0 92.60% 90.40% – - – - –
PointASNL [7] 2020 PC+PF 3.98 93.20% – 95.90% – - – –
ASSANet [140] 2021 PC – 92.90% – - – - – –
RPNet-W9 [103] 2021 PC+PF – 94.10% – - – - – –
PointMLP [104] 2022 PC 12.6 94.50% 91.40% – - 85.40% 83.90% –
PointNeXt-S [105] 2022 PC 1.4 93.20% 90.80% – - 87.70% 85.80% –
PointNeXt+HyCoRe [141] 2022 PC – - – - – 88.30% 87.00% –
PointMLP+HyCoRe [141] 2022 PC – 94.50% 91.90% – - 87.20% 85.90% –
PointStack [106] 2022 PC 1.62 94.70% 92.40% – - 89.40% 88.50% –
Hierarchical methods
KD-Net [125] 2017 PC 2.0 91.80% 88.50% 94.00% 93.50% – - –
SO-Net [127] 2018 PC+Normals – 93.40% 90.80% 96.70% 95.50% – - 86.80%
SCN [142] 2018 PC – 90.00% 87.60% – - – - –
A-SCN [142] 2018 PC – 89.80% 87.40% – - – - –
3DContextNet [126] 2018 PC – 90.20% – - – - – –
3DContextNet [126] 2018 PC+Normals – 91.10% – - – - – –
Convolution-based methods
SphericalCNNs [143] 2018 PC 0.5 86.90% – – – – – –
Pointwise-CNN [108] 2018 PC+PF – 86.10% 81.40% – – – – –
MCConvolution [144] 2018 PC – 90.90% – – – – – –
SpiderCNN [120] 2018 PC+PF – 92.40% – – – 73.70% 69.80% 87.20%
PointCNN [145] 2018 PC 0.45 92.20% 88.10% – – 78.50% 75.10% 87.50%
S. Sarker et al.
Table 4 continued
Flex-Convolution [146] 2018 PC – 90.20% – – – – – –

PCNN [121] 2018 PC 1.4 92.30% – 94.90% – – – –
PointConv [119] 2019 PC+PF 18.6 92.50% – – – – – 88.30%
RS-CNN [115] 2019 PC – 93.60% – – – – – –
GeoCNN [147] 2019 PC – 93.40% 91.10% – – – – –
-CNN [123] 2019 PC – 92.00% 88.70% 94.60% 94.40% – – –
A-CNN [148] 2019 PC – 92.60% 90.30% 95.50% 95.30% – – –
SFCNN [149] 2019 PC+PF – 92.30% – – – – – –
DensePoint [117] 2019 PC+PF 0.53 93.20% – 96.60% – – – –
KPConvrigid [116] 2019 PC 15.2 92.90% – – – – – –
KPConvdeform [116] 2019 PC – 92.70% – – – – – –
InterpCNN [109] 2019 PC+PF 12.8 93.00% – – – – – –
ShellNet(SS=16) [111] 2019 PC – 93.10% – – – – – –
ConvPoint [118] 2020 PC – 91.80% 88.50% – – – – –
Point-PlaneNet [112] 2020 PC+PF – 92.10% 90.50% – – – – –
DRNet [128] 2021 PC+PF – 93.10% – – – 80.30% 78.00% –
DeltaNet [113] 2022 PC 93.80% 91.20% 84.70% –
Graph-based methods
ECC [150] 2017 PC – 87.40% 83.20% 90.80% 90.00% – – –
KCNet [135] 2018 PC 0.9 91.00% – 94.40% – –
LocalSpecGCN [151] 2018 PC+Normals – 92.10% – – – – – –

RGCNN [136] 2018 PC+Normals 2.24 90.50% 87.30% – – – – –
3DTI-Net [152] 2018 PC 2.6 91.70% – – – – – –
PointGCN [137] 2018 PC – 89.51% 86.05% 91.91% 91.57% – – –
PointWeb [131] 2019 PC – 92.30% 89.40% – – – – –
DGCNN [132] 2019 PC 1.84 93.50% 90.70% – – 78.10% 73.60% 73.80%
LDGCNN [133] 2019 PC 1.08 92.90% 90.30% – – – – –
DPAM [134] 2020 PC+Normals – 91.90% 89.90% 94.60% 94.30% – – –
Grid-GCN [9] 2020 PC – 93.10% 91.30% 97.50% 97.40% – – –
PointManifold [153] 2020 PC – 93.00% 90.10% – – – – –
3D-GCN [154] 2020 PC 0.89 92.10% – – – – – –
PAConv [155] 2021 PC – 93.90% – – – – – 90.60%
Page 19 of 54
CurveNet [156] 2021 PC – 94.20% – 96.30% – – – –

67
123
Table 4 continued
67

123
RNN-based Methods
RCNet [157] 2019 PC 13.3 91.60% – 94.70% – – – –
Page 20 of 54
RCNet-E [157] 2019 PC 39.9 92.30% – 95.60% – – – –

Point2sequence [158] 2019 PC – 92.60% 90.40% 95.30% 95.10% – – –
Transformer-based methods
PAT [159] 2019 PC – 91.70% – – – – – –
Point Transformer [160] 2021 PC – 93.70% 90.60% – – – – –
PCT [161] 2021 PC 2.88 93.20% – – – – – 91.40%
Point Transformer [162] 2021 PC+Normals 13.5 92.80% – – – – – –
3DMedPT [98] 2021 PC+Normals 1.54 93.40% – – – – – 93.60%
Perceiver [163] 2021 PC – 85.70% – – – – – –
PointTnT [164] 2022 PC 3.9 92.60% – – – 85.00% 83.50% –
Patchformer [165] 2022 PC 2.45 93.50% – – – – – –
PTv2 [166] 2022 PC – 94.20% 91.60% – – – –
LCPFormer [167] 2023 PC – 93.60% 90.70% – – – –
SPoTr [168] 2023 PC – – – – – 88.60% 86.80%
IBT [169] 2023 PC – 93.60% 91.00% – – 82.80% 80.00%
APES(global-based) [170] 2023 PC 93.80% – – – – – –
Other methods
DeepRBFNet [171] 2018 PC 3.2 90.2% 87.80% – – – – –
DeepRBFNet [171] 2018 PC+Normals 3.2 92.1% 88.80% – – – – –
PointHop [172] 2019 PC – 89.10% 84.40% – – – – –
FG-Net [173] 2020 PC – 93.80% 93.10% – – – – –
PointHop++ [174] 2020 PC 0.16 91.10% 87.00% – – – – –
PRA-Net [175] 2021 PC – 93.70% 91.20% – – 82.10% 79.10% –
GDANet [176] 2021 PC – 93.80% – – – 88.50% – –
PointSCNet [177] 2021 PC + Normals 1.827 93.70% – – – – – –
APP-Net [178] 2022 PC + Normals 1.827 94.00% – – – 87.00% – –
PointMetaBase-XXL [179] 2023 PC 22.7 – – – – 87.90% – –
Unsupervised-based methods
FoldingNet [180] 2018 PC – 88.40% – 94.40% – – – –
PPF-FoldNet [181] 2018 PC+Normals – – – – – – –
S. Sarker et al.
Table 4 continued
Latent-GAN [182] 2018 PC – 84.50% – 95.40% – – –

MRTNet-VAE [183] 2018 PC – 86.40% – – – – –
Hassani et al. [184] 2019 PC – 89.10% – – – – – –
3DPointCapsNet [185] 2019 PC – 89.30% – – – – – –
Cluster-Net [186] 2019 PC – 88.80% – 93.80% – – –
GBNet [187] 2019 PC 8.39 93.80% 91.00% – – 80.50% 77.80% –
PointGrow [188] 2020 PC 0.25 85.80% – – – – –
ParAE [189] 2021 PC – 91.60% – – – – –
OcCo+DGCNN [190] 2021 PC – 93.00% – – – 83.90% – –
OGNet + MD [191] 2021 PC – 93.31% 90.71% – – – – –
DGCNN+MD [191] 2021 PC – 93.39% 90.26% – – – – –
MD [191] 2022 PC – 93.30% 90.26% – – – – –
STRL + DGCNN [192] 2021 PC – 90.90% – – – – – –
IAE + DGCNN [193] 2022 PC – 94.20% 91.60% – – – – –
AL-GAN + PointNet++ [194] 2022 PC – 92.70% – – – – – –
Point-MAE [195] 2022 PC – 93.80% – – – 85.20% – –
Point-BERT [196] 2022 PC – 93.80% – – – 83.07% –
P2P [197] 2022 PC 1.2 94.00% 91.60% – – 89.30% 88.50% –
PointCaps [198] 2022 PC+Normals 3.52 94.70% – 91.70% – – – –
MAE3D [199] 2022 PC – 93.40% – – – 86.20% – –
Point-M2AE [200] 2022 PC – 94.00% – – – 86.43% – –

Hao et al. [201] 2022 PC 0.88 94.20% – – – 82.60% –
MaskPoint [202] 2022 PC – 93.80% – – – 84.60% –
ACT [203] 2022 PC – 92.69% – – – 88.21% – –
I2P-MAE [204] 2022 PC – 93.72% – – – 90.11% – –
PointGPT-L [205] 2023 PC(1k P) – 94.70% – – – 93.40% -
PointGPT-L [205] 2023 PC(8k P) – 94.90% – – – – -
point2vec [206] 2023 PC – 94.80% – – – 87.50% -
ReCoN [207] 2023 PC(1K P) – 93.90% – – – 90.65% -
ReCoN [207] 2023 PC(8K P) – 94.20% – – – – – –
PointMLP+ULIP [208] 2023 PC – 93.30% 89.60% – – 86.90% 85.80% –
Here ‘Params’ represents the number of parameters, while ‘OA’ refers to the overall accuracy, ‘mAcc’ denotes mean accuracy, and ‘F1’ stands for the Dice score. The symbol ‘–’ means the results
Page 21 of 54
are unavailable. The methods are arranged in chronological order within their corresponding categories. The top-performing methods in each category have been highlighted in bold, while the
method(s) achieving the best overall performance across all categories are underlined
67
123
Fig. 6 Illustration of a
graph-based network
the time-consuming operation of sampling and aggregating Perceiver, another attention-based architecture introduced
characteristics from irregular sites. in [163] is a scalable attention-based architecture for high-
Point Attention Transformers (PATs) [159] learns high- dimensional inputs, such as images, movies, and audio, with-
dimensional features by encoding each point’s absolute and out domain-specific assumptions. It utilizes cross-attention
relative positions with respect to its neighbors. To extract and latent self-attention blocks to process a fixed-dimensional
hierarchical features, it utilizes a trainable, permutation- latent bottleneck. 3D medical point Transformer (3DMedPT)
invariant, and non-linear end-to-end Gumbel Subset Sam- [98] is an attention-based model specifically designed for
pling (GSS) layer, which captures relationships between medical point clouds for examining the complex biological
points using Group Shuffle Attention (GSA). This unique structures that are vital for disease detection and treatment.
approach enables the model to capture local structures within Insufficient training samples of medical data can lead to poor
each group while also considering the global context of feature learning. To enhance feature representations in med-
the entire point cloud. Zhao et al. [160] proposed a similar ical point clouds, it employs an attention module to capture
model, the Point Transformer, which employs a self-attention local and global feature interactions, position embeddings for
module to retrieve spatial characteristics from local neighbor- precise local geometry, and Multi-Graph Reasoning (MGR)
hoods around each point, and encodes positional information. for global knowledge transmission.
The network has a highly expressive Point Transformer layer, Similarly, Berg et al. [164] propose the two-stage Point
which is invariant to permutation and cardinality, making it Transformer-in-Transformer (Point-TnT) technique, which
ideal for point cloud processing. combines both local and global attention mechanisms by
Point Transformer V2 [166] is an enhanced version producing patches of local features via a sparse collection
of the Point Transformer architecture for 3D point cloud of anchor points. Self-attention can then be used on both the
processing. It introduces two innovations: grouped vector points within the patches and the patches themselves, result-
attention and partition-based pooling. Grouped vector atten- ing in a highly effective method for processing unstructured
tion reduces computational cost by performing attention only point cloud data. LCPFormer [167] is a recent transformer-
within groups of points, maintaining accuracy while learning based architecture for 3D point cloud analysis. LCPFormer
long-range dependencies. Partition-based pooling improves introduces a novel local context propagation (LCP) mod-
accuracy on large point clouds by dividing them into smaller ule that enables the model to learn long-range dependencies
partitions and pooling features within each partition, enabling between points in a point cloud. The LCP module works by
global feature learning with reduced computational load. first dividing the input point cloud into local regions. Then, it
Engel et al. [162] introduced another model called Point propagates the features of each local region to its neighbor-
Transformer, which operates directly on unordered and ing local regions. This allows the model to learn long-range
unstructured point sets. The Point Transformer uses a local– dependencies between points that are not directly connected.
global attention mechanism to capture spatial point relations To learn local and global shape contexts with reduced
and shape information, allowing it to extract both local and complexity Park et al. introduced SPoTr [168], a self-
global aspects of the point cloud. SortNet, a component of positioning mechanism that works by first randomly select-
Point Transformer, produces input permutation invariance by ing a subset of points from the input point cloud. These points
selecting points based on a learned score. The Point Trans- are then used to create a local coordinate system. The remain-
former produces a sorted and permutation invariant feature ing points are then projected into this local coordinate system.
list that can be utilized directly in standard computer vision This allows the model to learn local shape contexts with-
applications. out the need for global attention. Wu et al. [170] proposed
123
Fig. 7 Illustration of the transformer based encoder architecture [210]
an Attention-Based Point Cloud Edge Sampling (APES) for two modules: intra-region structure learning (ISL) and inter-
sampling points from a point cloud based on their importance region relationship learning (IRL). The ISL module can
to the outline of the object. The attention mechanism in APES adaptively incorporate local structural information into point
is based on the self-attention mechanism used in transformer features, while the IRL module dynamically and effectively
models. The self-attention mechanism computes the atten- preserves inter-region relations using a differentiable region
tion weights between each point in the point cloud and all partition method and a representative point-based strategy.
other points in the point cloud. The points with the highest FG-Net [173] proposes a comprehensive deep learning
weights are then selected to form a new, downsampled point framework for large-scale point cloud understanding that
cloud. achieves accurate and real-time performance with a single
Table 4 presents a comparison of various pointwise GPU. The network incorporates a noise and outlier filter-
transformer-based methods on different datasets. Here, PTv2 ing mechanism, utilizes a deep CNN to exploit local feature
[166] achieved the highest OA and mAcc on the ModelNet correlations and geometric patterns, and employs efficient
40 dataset, while SPoTr [168] showed best performance on techniques such as inverse density sampling and feature
ScanObjectNN dataset. pyramid-based residual learning to address efficiency con-
cerns. Another recent work in this area is proposed by
Xu et al. in GDANet [176]. It introduces the Geometry-
4.1.7 Other methods Disentangled Attention Network, which dynamically disen-
tangles point clouds into contour and flat parts of 3D objects.
Apart from the methods discussed earlier, there are several It utilizes the disentangled components to generate holistic
techniques that cannot be neatly categorized into a specific representations and applies different attention mechanisms
class. These methods utilize multiple modalities to learn intri- to fuse them with the original features. The network also
cate representations of point clouds, thereby enabling them captures and refines 3D geometric semantics from the disen-
to capture intricate patterns and relationships. Hence, in this tangled components to supplement local information.
section, we will explore these unconventional methods that PointSCNet [177] captures the geometrical structure and
transcend traditional classification boundaries, providing a local region correlation of a point cloud using three key com-
comprehensive overview of each. ponents: a space-filling curve-guided sampling module, an
With prior knowledge of kernel positions and sizes, information fusion module, and a channel-spatial attention
RBFNet [171] aggregates features from sparsely distributed module. The sampling module selects points with geometri-
Radial Basis Function (RBF) kernels to explicitly charac- cal correlation using Z-order curve coding. The information
terize the spatial distribution of points. PointAugment, an fusion module combines structure and correlation informa-
auto-augmentation framework introduced by Li et al. [211], tion through a correlation tensor and skip connections. The
optimizes and augments point cloud data by automatically channel-spatial attention module enhances critical sites and
learning each input sample’s shape-wise transformation and feature channels for improved network representation. Lu et
pointwise displacement. Prokudin et al. [212] transform the al. [178] proposed APP-Net, a network that utilizes auxiliary
point cloud into a vector with a short fixed length by encoding points and push and pull operations to efficiently classify
the point cloud as minimal distances to a uniformly dis- point cloud data. The auxiliary points guide the network’s
tributed basis point set sampled from a unit ball. Finally, attention to important regions, while the push and pull oper-
common machine learning techniques are applied to pro- ations allow for efficient computation and improved feature
duce the encoder representation. Cheng et al. [175] present representation.
the Point Relation-Aware Network (PRA-Net), comprising
123
PointMeta [179] by Lin et al. is a unified meta-architecture torized local covariance matrix and point coordinates as its
for point cloud analysis. It abstracts the computation pipeline input. Hassani and Haley [184] suggested an unsupervised
into four meta-functions: neighbor update, neighbor aggrega- multi-task autoencoder to learn point and shape features,
tion, point update, and position embedding. These functions inspired by Inception module [218] and DGCNN [132].
enable learning of local and global features, point refinement, Multi-scale graphs are used to build the encoder. The decoder
and encoding of spatial relationships. PointMeta offers flexi- is built utilizing three unsupervised tasks: clustering, self-
bility and efficiency in designing point cloud analysis models. supervised classification, and reconstruction, all of which are
However, a detailed computational complexity analysis is not combined and trained together with a multi-task loss.
provided in the paper. Latent-GAN [182] is one of the first networks to use GAN
Table 4 shows that among the models in other methods, for raw point clouds. The authors discuss various methods
APP-Net achieved the highest overall accuracy (OA) score of such as autoencoders, variational autoencoders (VAE) [219],
94.00% on the ModelNet 40 dataset. However, among all the GAN, and flow-based models that have been proposed for
models across different methodologies, FG-Net emerged as learning effective representations of 3D point clouds and gen-
the leader in mean accuracy (mAcc) with a score of 93.10%. erating new ones. 3DAAE [220] can learn the representation
On the ScanObjectNN dataset, PRA-Net achieved the highest of 3D point clouds by using an end-to-end approach. This
mAcc score, and GDANet achieved the highest OA score. model generates output by first learning a latent space for 3D
shapes and then using adversarial training. The inventors of
4.2 Unsupervised training 3DAAE created a 3D autoencoder that takes 3D data as input
and produces a 3D output.
Unsupervised representation learning is a technique that aims 3D point-capsule networks [122] have been developed to
to learn useful and informative features from unlabeled data. address the sparsity issue in point clouds while preserving
In the context of point cloud understanding, this approach their spatial arrangements. This network extends the 2D cap-
involves training deep neural networks to extract latent fea- sule networks to the 3D domain and uses an autoencoder to
tures from raw, unannotated point cloud data. Unsupervised handle the sparsity of point clouds. In contrast, 3DPointCap-
representation learning for point clouds has gained signifi- sNet [185] incorporates pointwise MLP and convolutional
cant attention in recent years due to its ability to reduce the layers to extract point-independent features and employs
need for labeled data and improve the performance of various several maxpooling layers to derive a global latent represen-
point cloud applications including natural language under- tation. Unsupervised dynamic routing is then used to learn
standing [213], object detection [214], graph learning [215], representative latent capsules. In addition, Pang et al. [195]
and visual localization [216]. By pre-training deep neural introduced a novel approach using masked autoencoders for
networks on unlabeled data, unsupervised learning uncovers self-supervised learning of point clouds. They addressed
latent features without human-defined annotations, reducing challenges related to point cloud properties, such as loca-
reliance on labeled data. It can be categorized into generative tion information leakage and uneven density, by dividing the
modeling, where synthetic point clouds are generated, and input into irregular patches, applying random masking, and
self-supervised learning, which involves predicting missing using an asymmetric design with shifting mask token opera-
information from partially observed point clouds. This active tion. This enabled a transformer-based autoencoder to learn
research field holds promise for improving the accuracy and latent characteristics from unmasked patches and reconstruct
efficiency of point cloud processing tasks. the masked ones.
Point clouds are discrete samples of a continuous three-
4.2.1 Generative model-based methods dimensional surface. As a result, sample differences in the
underlying 3D shapes are inescapable. The conventional
Unsupervised approaches like generative adversarial net- autoencoding paradigm requires the encoder to record sam-
works (GANs) [217] and autoencoders (AEs) [184] learn pling fluctuations in the same way that the decoder must
representation of provided data [121]. AEs consist of an recreate the original point cloud. Yan et al. [193] introduced
encoder, internal representation, and decoder, and are widely the Implicit Autoencoder (IAE) to overcome the challenge of
used for data representation and generation. They can cap- sampling fluctuations in point clouds. By using an implicit
ture point cloud irregularities and address sparsity during decoder instead of a point cloud decoder, IAE generates a
upsampling. GANs, on the other hand, consist of a generator continuous representation that can be shared across multi-
and discriminator, aiming to generate realistic data samples. ple samplings of the same model. This approach allows the
GANs learn to produce new data with similar statistics as the encoder to focus on learning valuable features by ignoring
training set. sampling changes during reconstruction.
FoldingNet [180] is an end-to-end unsupervised deep Point-BERT [196] is a more advanced version of BERT
autoencoder network that uses the concatenation of a vec- that employs transformers to generalize 3D point cloud
123
Fig. 8 The general pipeline of unsupervised representation learning resentations to downstream tasks for network initialization. Pre-trained
on point clouds. Neural networks are trained on unannotated point networks can then be fine-tuned with a small amount of annotated task-
clouds using unsupervised learning, followed by transfer of learned rep- specific point cloud data [221]
learning. A point cloud tokenizer with a discrete Varia- It employs a 2D-guided masking strategy to focus on seman-
tional AutoEncoder (dVAE) is intended to generate discrete tically important point tokens and capture key spatial cues
point tokens containing significant local information once for significant 3D structures. Through self-supervised pre-
the network separates a point cloud into many local point training and multi-view 2D feature reconstruction, I2P-MAE
patches. Then it feeds some patches of input point clouds enables superior 3D representations from 2D pre-trained
into the backbone transformers, using random masking. models. In their paper [203], Dong et al. proposed ACT
Under the supervision of point tokens obtained by the tok- (Autoencoders as Cross-Modal Teachers), a method for
enizer, the pre-training goal is to recover the original point training 3D point cloud models using pretrained 2D image
tokens at the masked places. In [200], Zhang et al. intro- transformers. ACT involves two steps: pretraining a 2D
duced Point-M2AE, a pre-training framework for learning image transformer on a large image dataset and fine-tuning it
3D representations of point clouds. It utilizes a multi-scale on a 3D point cloud dataset. The fine-tuning process utilizes
masking strategy, pyramid architectures, local spatial self- the 2D image transformer to generate a latent representation
attention, and complementary skip connections to capture of the 3D point cloud, which is then used to train the 3D point
detailed information and high-level semantics of shapes. cloud model.
This paper also highlights the significance of a lightweight
decoder in Point-M2AE, which contributes to the recon-
4.2.2 Self-supervised methods
struction of point tokens and promotes the quality of shape
representation. [199] discusses another method for learning
Self-supervised learning in point cloud processing is a pow-
representations for 3D point clouds using masked autoen-
erful technique that leverages unannotated data to improve
coders. In the proposed method, a portion of the points in the
performance across various applications. By incorporating
point cloud are masked out and the masked autoencoder is
geometric and topological priors, models can learn feature
trained to reconstruct the masked out points.
representations. This involves training a model to predict
To addresses the challenge of limited 3D datasets for
local geometric properties, such as normals or curvatures,
learning high-quality 3D features, in [204], the authors pro-
using the point positions as input.
posed Image-to-Point Masked Autoencoders (I2P-MAE). By
Due to the complex nature of 3D scene understanding
leveraging 2D pre-trained models, I2P-MAE reconstructs
tasks and the vast differences provided by camera per-
masked point tokens using an encoder-decoder architecture.
spectives, illumination, occlusions, and other factors, there
123
are yet no effective and generalizable pre-trained models cept to point clouds, addressing challenges such as disorder
available. In their paper, Huang et al. [192] address this properties and low information density. PointGPT pre-trains
problem by proposing a self-supervised Spatio-temporal transformer models using a point cloud auto-regressive gen-
Representation Learning (STRL) framework that learns from eration task. The method employs a dual masking strategy
unlabeled 3D point clouds. STRL utilizes two temporally in the extractor-generator based transformer decoder, cap-
correlated frames, applies spatial data augmentation, and turing dependencies between points and generating coherent
self-supervisedly learns invariant representations. and realistic point clouds.
Occlusion Completion (OcCo) is an unsupervised pre- Table 4 presents findings encompassing both generative-
training method proposed by Wang et al. [190], which based and self-supervised-based methods. The outcomes
comprises of three separate mechanisms. The first step is illuminate that amid all models derived from diverse method-
to use view-point occlusions to create masked point clouds. ologies, PointGPT-L secured the top OA for both the Mod-
The second step is to complete reconstructing the occluded elNet40 and ScanObjectNN datasets.
point cloud, and the final step is to use the encoder weights
as the initialization for the downstream point cloud task. Sun
et al. [191] developed a novel self-supervised learning tech- 5 3D point cloud semantic segmentation
nique called Mixing and Disentangling (MD) for learning
3D point cloud representations in response to the enormous The task of 3D point cloud segmentation requires a com-
success of self-supervised learning. The authors combined prehensive understanding of both the overall geometric
two input shapes and demand that the model learn to distin- structure and the specific properties of each individual point.
guish the inputs from the mixed shape. This reconstruction Depending on the level of detail required, 3D point cloud
task serves as the pretext optimization objective for self- segmentation techniques can be broadly classified into three
supervised learning, and the disentangling process drives the categories: semantic segmentation at the scene level, instance
model to mine the geometric prior knowledge. segmentation at the object level, and part segmentation at the
Xue et al. [208] introduced ULIP (Unified Language- part level. In this paper, our exclusive focus has been on
Image-Point Cloud) as a pre-training method for learning a semantic segmentation, rather than encompassing all forms
unified representation of language, images, and point clouds of segmentation.
in 3D understanding. ULIP learns a common embedding While many classification models have been shown to
space for these modalities, enabling various 3D tasks. By perform well on established benchmarks, they also rely on
leveraging the shared information about 3D objects, ULIP segmentation datasets to showcase their unique contributions
creates informative and discriminative representations. It uti- and generalization capabilities. This section will primarily
lizes a large-scale dataset of language, images, and point focus on models that have not been previously discussed in
clouds, generated with triplets describing the same object, the classification part of this paper.
and trains the model to predict the missing modality in Semantic segmentation involves the partitioning of a
each triplet. PointCaps [198] introduces a capsule network, a point cloud into distinct subsets, determined by the seman-
structured representation learning approach for point clouds. tic interpretation of individual points. Based on the input
The method consists of two operations: learning local and data representation, this segmentation can be categorized
global features of the point cloud using the capsule network, into four types, akin to the classification of 3D shapes:
and subsequently classifying the point cloud into prede- projection-based, discretization-based, hybrid methods, and
fined classes. Qi et al. [207] propose ReCon (Contrast with raw point-based. Approaches such as projection [223, 224],
Reconstruct), a self-supervised method for 3D representation volumetric [225, 226], and hybrid representations [227, 228]
learning. ReCon combines contrastive learning and genera- initiate the process by transforming a point cloud into an
tive pretraining in two stages. In the contrastive learning step, intermediary regular representation.
ReCon learns local and global features of 3D point clouds
through pairwise comparisons. In the generative pretraining 5.1 Projection-based methods
step, ReCon learns high-level features by generating data
similar to the training set The projection-based method is a widely adopted approach
Point2vec [206] extends the data2vec [222] framework for semantic segmentation of point clouds. It involves assign-
for self-supervised representation learning on point clouds, ing semantic labels to individual points in a 3D point cloud
overcoming the limitation of leaking positional information by projecting it onto multiple 2D planes or views. Each pro-
during training. Point2vec unleashes the full potential of jected view is processed using 2D segmentation techniques,
data2vec-like pre-training on point clouds. In response to and the results are fused to obtain the final semantic seg-
the growing popularity of Large Language Models, Chen et mentation. This method offers advantages such as reduced
al. introduced PointGPT [205] that extends the GPT con- complexity and the utilization of existing image-based seg-
123
Fig. 9 A taxonomy of deep learning methods for 3D point cloud semantic segmentation
mentation techniques. However, the choice of projection used 2D segmentation networks to do pixel-by-pixel tagging
views can impact segmentation accuracy, and complex point on these samples. The residual correction [230] is used to
cloud geometries or non-planar surfaces may pose chal- merge the scores predicted from RGB and depth pictures.
lenges. Nonetheless, the projection-based method remains In order to address the information loss issue, Snap-
a powerful tool for point cloud semantic segmentation. It Net [231] takes some selected snapshots of the point cloud
can be further categorized into multi-View, range-view, and to generate pairs of RGB and depth images. They then cat-
bird’s eye view approaches. egorize each pair of 2D photos pixel by pixel using a fully
convolutional network. Finally, to complete the work, this
model projects the marked points into 3D space. SnapNet
5.1.1 Multi-view based method attempts to solve the problem of information loss, but it runs
into issues throughout the image production process. Con-
Multi-view approaches in point cloud segmentation pro- sequently, SnapNet-R2 [232] is proposed as a solution for
cessing harness a potent paradigm by integrating data from SnapNet. It directly analyses multiple views to produce dense
diverse perspectives. This holistic strategy offers comprehen- 3D point markers, which improves the segmentation result.
sive scene insights. These methods utilizes multiple sensor The process of creating a labeled point cloud can be broken
viewpoints to capture a wide range of geometric details, bol- down into two parts: the labeling of SnapNet 3D and the
stering resilience against occlusions and lighting variations. 2D labeling of RGB-D images extracted from stereo images.
The fusion of data from multiple sources mitigates limita- Although the model provides a technique that makes it simple
tions tied to individual viewpoints. Yet, multi-view strategies to implement, its segmentation accuracy on object bound-
demand precise sensor calibration and view alignment for aries still needs to be improved.
accurate data integration and coherent segmentation out- Viewpoint selection and occlusions impact multi-view
comes. segmentation algorithms, but they also suffer from infor-
Lawin et al. [223] were the first to project a 3D point mation loss and blurring effects due to many-to-one map-
cloud from several virtual camera views onto 2D planes. ping. The nearest predicted label (NLA) strategy improves
Then, using synthetic images, a multi-stream fully connected occluded location processing over K-nearest neighbor (KNN).
network is utilized to predict pixel-wise scores. The final Processing point cloud data is computationally expen-
semantic label of each point is calculated by combining sive, and existing projection-based methods have accuracy
the re-projected scores from various perspectives. Similarly, or parameter issues. The Multi-scale Interaction Network
Boulch et al. [229] used numerous camera angles to obtain (MINet) [233] balances resources across scales, enhancing
various RGB and depth pictures of a point cloud. They next
123
efficiency and outperforming point-based, image-based, and multi-resolution feature maps using bilinear interpolation. It
projection-based techniques in accuracy, parameter count, improves model complexity while preserving performance.
and runtime. In contrast to previous methods, this approach maintains
neighborhood information more effectively and considers
5.1.2 Range-view based methods temporal information in single scan segmentation tasks. To
address these issues, Wang et al. [240] presented Meta-
Range view methods for point cloud segmentation processing RangeSeg, which adopts a unique range residual image repre-
leverage the inherent spatial structure of the data to preserve sentation to collect spatial-temporal information. To capture
fine-grained geometric details, akin to human perception. the meta features, Meta-Kernel is used, which minimizes
Processing the point cloud directly in its original form the discrepancy between the 2D range image coordinates
enhances accuracy for tasks requiring precise distance and input and the Cartesian coordinates output. The multi-scale
angle measurements. While these methods minimize prepro- features were extracted using an efficient U-Net backbone.
cessing, they maintain local geometric context due to point Moreover, the Feature Aggregation Module (FAM) gathers
proximity, aiding analysis and classification. Range view meta features and multi-scale features, enhancing the range
techniques also align well with diverse sensors and capture channel’s role.
devices, facilitating integration into real-world applications. GFNet [241] is based on a Geometric Flow Network
Yet, these methods may be sensitive to sensor viewpoint (GFN), which can learn the geometric relationships between
changes, possibly introducing data inconsistencies and com- different views of a 3D point cloud. The GFN comprises a
promising robustness in processing and interpretation. feature extractor and a geometric flow network. The feature
Wu et al. [224] developed an end-to-end network based extractor captures features from the 3D point cloud, while the
on SqueezeNet [234] and Conditional Random Fields (CRF) geometric flow network learns geometric relationships across
to perform quick and accurate segmentation of 3D point different views. These relationships facilitate fusion of fea-
clouds. Later, they came up with another version named tures, leading to enhanced semantic segmentation accuracy.
SqueezeSegV2 [235], a segmentation pipeline that uses an GFNet has several advantages over traditional methods for
unsupervised domain adaption pipeline to solve domain semantic segmentation of 3D point clouds. It can accommo-
shift and increase segmentation accuracy. To process LiDAR date the irregular and unstructured nature of 3D point clouds
images, all of these methods use conventional convolu- by utilizing a deep learning model capable of learning from
tions, which is problematic since convolution filters pick non-grid data.
up local features that are only active in specific portions CeNet [242] is an efficient method for semantic segmen-
of the image. As a result, the network’s capacity is under- tation of LiDAR point clouds. It utilizes a compact CNN
utilized, and segmentation performance suffers. To address architecture, resulting in faster training and inference due to a
this, the author presented SqueezeSegV3 [236], an updated reduced parameter count. CeNet consists of three main com-
version of the previous SqueezeSeg [224] models that uses ponents: a feature extractor to capture point cloud features, a
Spatially-Adaptive Convolution (SAC) to apply various fil- spatial attention module for emphasizing important features,
ters to different regions depending on the input image. and a temporal attention module for integrating features
RangeNet++ by Milioto et al. [237] enables real-time across LiDAR sequence frames. In [243], a novel range view
semantic segmentation of LiDAR point clouds. It employs representation for LiDAR point clouds is introduced. Based
GPU-enabled KNN-based postprocessing to address dis- on a CNN architecture, RangeFormer extracts features from
cretization errors and blurry inference outputs after convert- the range view representation. These features are utilized
ing 2D range image labels to 3D point clouds. Spherical by the spatial attention module to learn spatial relationships
projection preserves more information than single-view between points, while the temporal attention module fuses
projection, but it may introduce issues like discretization mis- features from different frames. Finally, a decoder predicts
takes and occlusions. Lite-HDSeg [238] is another real-time the semantic label for each point in the range view.
3D LiDAR point cloud segmentation method. It utilizes a new LENet [244] is a compact and resource-efficient network
encoder-decoder architecture with light-weight harmonic for LiDAR point cloud semantic segmentation. It incorpo-
dense convolutions. Additionally, the authors introduce ICM, rates a novel multi-scale convolution attention module that
an improved global contextual module capturing multi-scale captures long-range dependencies. By utilizing convolutions
contextual data, and MCSPN, a multi-class Spatial Propaga- with varying kernel sizes, features are extracted at multiple
tion Network refining semantic boundaries substantially. scales. Attention mechanisms are employed to assign weights
Zhao et al. proposed a projection-based LiDAR seman- to features from different scales, improving the network’s
tic segmentation pipeline with a unique network topology precision in learning semantic segmentation predictions.
and efficient postprocessing [239]. Their FIDNet incorpo-
rates a parameter-free FID module that directly upsamples
123
5.1.3 Bird’s eye view-based methods subdividing the point cloud space into a regular grid and
assigning points to corresponding grid cells. This enables
Bird’s-Eye View (BEV) is a 2D representation obtained by the application of standard 3D convolutions, similar to volu-
projecting a 3D point cloud onto a top-down plane. It provides metric data. On the other hand, Sparse discretization targets
a flattened view of the point cloud, enabling the applica- occupied cells, optimizing resource efficiency in line with
tion of 2D image-based segmentation techniques. BEV is point cloud sparsity.
widely used in point cloud segmentation for tasks like object
detection and road segmentation in autonomous driving and
5.2.1 Dense discretization representation
robotics. It facilitates the analysis of spatial relationships and
captures valuable geometric and contextual information in
Dense Discretization Representation (DDR) converts contin-
the horizontal plane.
uous point clouds into a structured and discrete form using
PolarNet [245] introduces a nearest-neighbor-free seg-
small voxels or grids. This structured representation enables
mentation approach for LiDAR data. By converting the
the use of standard 3D convolutional operations and simpli-
Cartesian point cloud to a polar bird’s-eye representation, it
fies the handling of irregular and unstructured data. However,
balances points among grid cells in a polar coordinate system.
it involves a trade-off between resolution, efficiency, and pos-
Polar convolution layers in a deep neural network architec-
sible discretization artifacts or information loss.
ture are utilized to extract features and perform semantic
Tchapmi et al. [248] proposed SEGCloud as a means
segmentation.
of achieving fine-grained, globally consistent semantic seg-
SalsaNet [246] presents an efficient and accurate method
mentation. Different degrees of geometric relations are first
for road and vehicle segmentation in LiDAR point clouds
hierarchically abstracted from point clouds in the Fully-
for autonomous driving. Its lightweight network architecture
Convolutional Point Network (FCPN) [249], and then 3D
incorporates spatial and channel-wise attention mechanisms
convolutions and weighted average pooling are used to
to capture local and global contextual information. The
extract features and incorporate long-range dependencies.
approach employs a two-step segmentation strategy, using
This approach can handle large-scale point clouds and has
a novel focal loss function to handle class imbalance and
strong inference scalability.
improve performance on rare classes.
ScanComplete [250] proposed a method for 3D scan
DGPolarNet [247] addresses the challenges of captur-
completion and per-voxel semantic tagging. It utilizes fully-
ing long-range dependencies and modeling local context by
convolutional neural networks that can adapt to different
employing a dynamic graph convolutional network. This net-
input data sizes during training and testing. A coarse-to-fine
work dynamically constructs a graph structure based on the
approach is employed to enhance the resolution of predicted
input point cloud, capturing spatial relationships between
results. The volumetric representation preserves the neigh-
points. Multi-scale features and graph convolutions are uti-
borhood structure of 3D point clouds and allows for direct use
lized to extract discriminative features at different abstraction
of 3D convolutions. These factors contribute to the improved
levels.
performance in this field. However, the voxelization stage
Table 5 provides a comprehensive overview of non-
introduces discretization artifacts and information loss.
point-based methods for semantic segmentation outcomes
In Cylinder3D [251], a comprehensive analysis of various
in 3D point clouds across diverse datasets. In the cate-
representations and backbones in 2D and 3D spaces is carried
gory of projection-based methods, Rangeformer [243] and
out to determine the usefulness of 3D representations and
MINet [233] achieved the highest results on the nuScenes
networks in LiDAR segmentation. It proposes a 3D cylinder
and SemanticPOSS datasets. However, among models across
partition and convolution-based framework to leverage the
different methodologies, RangeFormer and DeePr3SS [223]
3D topology relations and structures of driving-scene point
demonstrated superior performance in SemanticKITTI and
clouds. Additionally, a context modeling module based on
Semantic3D (red.) datasets.
dimension decomposition is introduced to capture high-rank
context information progressively.
5.2 Discretization-based methods
Discretization-based methods transform continuous point 5.2.2 Sparse discretization representation

cloud data into a discrete representation for efficient anal-
ysis while retaining geometric features. Such discretization Sparse Discretization Representation (SDR) selects a subset
serves as a bridge between the raw point cloud and con- of points from point clouds for analysis, offering memory
ventional convolutional operations. These methods can be efficiency and computational speed advantages. However,
further categorized into two main groups: dense discretiza- it may struggle with preserving fine details and dense spa-
tion and sparse discretization. Dense discretization involves tial relationships. SDR techniques address these challenges
123
through adaptive sampling and contextual information incor- 5.3 Hybrid methods
poration.
Choy et al. [252] proposed the MinkowskiNet, a 4D DRINet++ [260] leverages the voxel-as-point concept to
spatio-temporal convolutional neural network for 3D video enhance the geometric and sparse characteristics of point
perception. To properly process high-dimensional data, a clouds. It consists of two key modules: Sparse Feature
generalized sparse convolution is presented. To ensure con- Encoder and Sparse Geometry Feature Enhancement, designed
sistency, a trilateral-stationary conditional random field is for efficiency and performance improvement. The Sparse
used. To encode the local geometrical structures within each Geometry Feature Enhancement improves geometric attributes
voxel, Meng et al. [226] proposed a kernel-based interpolated through multi-scale sparse projection and fusion, while the
variational autoencoder architecture. To generate a continu- Sparse Feature Encoder captures local context information.
ous representation and capture the distribution of points in PIG-Net [261], proposed by Hedge et al., adopts a point-
each voxel, RBFs are used for each voxel instead of the binary inception-based deep neural network for 3D point cloud
occupancy representation. A VAE is also used to map each segmentation. By incorporating an inception module-based
voxel’s point distribution to a compact latent space. Then, to inception layer, PIG-Net effectively extracts local features,
achieve robust feature learning, both symmetry groups and an leading to enhanced performance. To prevent overfitting,
equivalence CNN are used. Volumetric based networks can Global Average Pooling (GAP) is employed as a regular-
be trained and evaluated on point clouds of various spatial ization technique.
sizes due to the scalability of 3D CNN. To address the challenge of limited data availability Yan
Furthermore, Rosu et al. [253] introduced LatticeNet as a et al. in [262] proposed JS3C-Net that leverages contextual
way to analyze large point clouds efficiently. DeformsSlice, shape priors learned from scene completion and then uses
a data-dependent interpolation module, is also included to these priors to improve the segmentation of sparse point
back project the lattice feature to point clouds. SPVConv by clouds. JS3C-Net consists of two main components: a scene
Tang et al. [254] introduces a lightweight 3D module that completion network and a segmentation network. The scene
combines a high-resolution point-based branch with Sparse completion network is responsible for predicting a dense
Convolution. This module efficiently preserves fine details point cloud from a sparse point cloud. The segmentation net-
in large outdoor landscapes. The authors further explore effi- work is responsible for predicting the semantic labels of the
cient 3D models using SPVConv and conduct a 3D Neural dense point cloud.
Architecture Search (3D-NAS) to discover optimal network (AF)2-S3Net [72] employs attentive feature fusion with
architectures for improved performance across a diverse adaptive feature selection to enhance the segmentation accu-
design space. racy of sparse point clouds. Comprising a feature extractor,
SVASeg [255] utilizes Sparse Voxel-based Attention attentive feature fusion module, and segmentation network,
(SVHA) to capture long-range dependencies between sparse it extracts features from the point cloud, fuses them atten-
points in point clouds. SVHA module points into local tively, and predicts semantic labels based on adaptive feature
regions, computes attention weights, and aggregates features selection using an attention mechanism.
from neighboring regions to predict semantic labels. This RPVNet [274] is an efficient range-point-voxel fusion net-
approach offers advantages in learning long-range depen- work for LiDAR point cloud segmentation. It consists of three
dencies and is efficient, allowing training and inference on a branches: range, point, and voxel, which extract features from
single GPU. Swin3D [256] is a pretrained transformer back- the range image, point cloud, and voxelized point cloud,
bone for 3D indoor scene understanding. It is based on the respectively. These features are then intelligently fused using
Swin Transformer [257] architecture capable of capturing a Gated Fusion Module (GFM) to achieve state-of-the-art
long-range dependencies among points within a 3D point performance. The GFM selectively combines the relevant
cloud. Swin3D performs self-attention on sparse voxels with features from the three branches for each point.
linear memory complexity and effectively captures the irreg- Hou et al. [277] proposed Point-to-Voxel Knowledge Dis-
ular nature of point signals through generalized contextual tillation (PVD), a hybrid method for semantic segmentation
relative positional embedding. of LiDAR point clouds. PVD utilizes knowledge distilla-
In Table 5, the results reveal that among discretization- tion by training a large teacher network on a large dataset
based methods, Cylinder3D [258], SVASeg [255], and and using its point-level predictions to train a small student
MS1_DVS [259] emerge as the top performers in network. The student network learns from the point-level
SemanticKITTI, nuScenes, and Semantic3D (reduced) predictions of the teacher network to achieve accurate seman-
datasets. However, across various methodologies, tic segmentation. 2DPASS [278] proposed another hybrid
MS1_DVS [259] and Swin3D-L [256] excel, surpassing all method that combines 2D and 3D information for LiDAR
other approaches in the Semantic3D SanNet, S3DIs (area-5 point cloud semantic segmentation. It extracts features from
and 6-fold) datasets. both a 2D image and a 3D grid, fusing them to generate point-
123
Table 5 Comperative 3D point cloud semantic segmentation result on various available datasets
Model Name Year Semantic Semantic nuScenes ScanNet Semantic3D(red.) S3DIS(Area5) S3DIS(6-fold)
KITTI POSS
(mIoU) (mIoU) (mIoU) (OA) (mIoU) (OA) (mIoU) (OA) (mIoU) (OA) (mIoU)
Projection Based Methods

SnapNet [231] 2017 – – – – – 88.60% 59.10% – – – –
DeePr3SS [223] 2017 – – – – – 88.90% 58.50% – – – –
SqueezeSeg [224] 2018 29.50% 16.80% – – – – – – – – –
SqueezeSeg+CRF [224] 2018 30.80% 18.70% – – – – – – – – –
SqueezeSegV2 [235] 2019 39.70% 29.80% – – – – – – – – –
SqueezeSegV2+ CRF [235] 2019 39.60% 28.90% – – – – – – – – –
DarkNet21Seg [38] 2019 47.40% – – – – – – – – – –
DarkNet53Seg [38] 2019 49.90% – – – – – – – – – –
RangeNet53 [237] 2019 49.90% 25.40% – – – – – – – – –
RangNet21 [237] 2019 47.40% – – – – – – – – – –
RangeNet53++ [237] 2019 52.20% 28.90% 65.50% – – – – – – – –
KPRNet [263] 2020 63.10% – – – – – – – – – –
PolarNet [245] 2020 54.30% – 69.40% – – – – – – – –
3D-MiniNet-KNN [264] 2020 55.80% – – – – – – – – – –
SqueezeSegV3-21 [236] 2020 51.60% – – – – – – – – – –
SqueezeSegV3-53 [236] 2020 55.90% – – – – – – – – – –
SalsaNet [246] 2020 45.40% – – – – – – – – – –
SalsaNext [265] 2020 59.50% – 72.20% – – – – – – – –
DeepTemporalSeg [266] 2020 37.60% – – – – – – – – – –

AMVNet [267] 2020 65.30% – 76.10% – – – – – – – –
MPF [268] 2021 55.50% – – – – – – – – – –
TORNADONet [269] 2021 61.10% – – – – – – – – – –
TORNADONet-HiRes [269] 2021 63.10% – – – – – – – – – –
Lite-HDSeg [238] 2021 63.80% – – – – – – – – – –
MINet [233] 2021 52.40% 30.50% – – – – – – – – –
MINet+k-NN [233] 2021 55.20% 35.10% – – – – – – – – –
FPS-Net [270] 2021 57.10% – – – – – – – – – –
FIDNet [239] 2021 59.50% – – – – – – – – – –
Meta-RangeSeg [240] 2022 61.00% – – – – – – – – – –
DGPolarNet [247] 2022 56.50% – – – – – – – – – –
Page 31 of 54
GFNet [241] 2022 65.40% – 76.10% – – – – – – –

CENet [242] 2022 64.70% – 74.70% – – – – – – –
67
RangeFormer [243] 2023 73.30% – 80.10% – – – – – – –
123
LENet [244] 2023 64.20% – – – – – – – – – –
Table 5 continued
67
Model Name Year Semantic Semantic nuScenes ScanNet Semantic3D(red.) S3DIS(Area5) S3DIS(6-fold)
KITTI POSS
123
(mIoU) (mIoU) (mIoU) (OA) (mIoU) (OA) (mIoU) (OA) (mIoU) (OA) (mIoU)
Discretization-based Methods
Page 32 of 54
SEGCloud [248] 2017 – – – – – 88.10% 61.30% 57.35% 48.90% – –

SparseConvNet [225] 2018 – – – – 72.50% – – – – – –
MS1_DVS [259] 2018 – – – – – 84.80% 57.10% – – – –
MS3_DVS [259] 2018 – – – – – 88.40% 65.30% 57.93% 46.32% – –
SPLATNet [271] 2018 18.40% – – – 39.30% – – – – – –
MinkowskiNet [252] 2019 – – – – 73.60% – – 71.71% 65.40% – –
VV-Net [226] 2019 – – – – – – – – – 87.78% 78.22%
LatticeNet [253] 2020 52.90% – – – 64.00% – – – – – –
SVASeg [255] 2022 65.20% – 74.70% – – – – – – – –
Cylinder3D [251] 2021 61.80% – 77.90% – – – – – –
Swin3D-L [256] 2023 – – – – 77.90% – – – 74.50% – 79.80%
Hybrid Methods
3DMV [227] 2018 – – – – 48.40% – – – – – –
PVCNN [272] 2019 39.00% – – – – – – 86.87% 57.63% – –
UPB [273] 2019 – – – – 63.40% – – – – – –
SPVConv [254] 2020 66.40% – 77.40% – – – – – – – –
SPVNAS [254] 2020 66.40% – – – – – – – – – –
JS3C-Net [262] 2021 66.00% 60.20% – – – – – – – – –
AF2-S3Net [72] 2021 69.70% – 62.20% – – – – – – – –
RPVNet [274] 2021 70.30% – – – – – – – – – -
DRINet++ [260] 2021 70.70% – 80.40% – – – – – – – –
PMF [275] 2021 63.90%* – – 76.90% – – – – – – –
MVPNet [276] 2022 53.90% – – – 64.10% – – – – – –
DSPoint [79] 2022 – – – – – – – 70.90% 63.30% – –
Point Voxel Transformer [81] 2022 64.90% – – – – – – – 68.20% 88.30% 69.20%
PVKD [277] 2022 71.20% - 76.00% – – – – – – – –
Sparse EQ-Net [83] 2022 – – – – 74.30% – – - 71.30% – 77.50%
2DPASS [278] 2022 72.90% – 80.80% – – – – – – –
DeepViewAgg [279] 2022 – – – – – – – - 67.20% – 74.70%
LidarMultiNet [280] 2022 – – 81.40% – – – – – – – –
SAT [281] 2023 – – – – 74.20% – – – – 72.60% –
Here ‘OA’ refers to the overall accuracy and ‘mIoU’ denotes mean intersection over union. The symbol ‘–’ means the results are unavailable. The methods are arranged in chronological order
within their corresponding categories. The top-performing methods in each category have been highlighted in bold, while the method(s) achieving the best overall performance across all categories
S. Sarker et al.
are underlined
level predictions. These predictions are then used to perform 6.1.1 Pointwise MLP methods
semantic segmentation on the LiDAR point cloud.
In [281], Zhou et al. introduced Size-Aware Transformer Because of the high efficiency, these methods mainly use
(SAT) for 3D point cloud semantic segmentation. SAT adapts shared MLP as the basic unit in their networks. Point-wise
receptive fields to object sizes, incorporating multi-scale fea- features retrieved using shared MLP are unable to capture
tures and enabling each point to select its attentive fields. the local geometry in point clouds as well as point-to-point
The model includes the Multi-Granularity Attention (MGA) relations [3]. Several networks have been developed to cap-
scheme for efficient feature aggregation and the Re-Attention ture a broader context for each point and learn richer local
module for dynamic adjustment of attention scores. SAT structures, including methods based on neighboring feature
addresses challenges through point-voxel cross attention and pooling, attention-based aggregation, and local–global fea-
a shunted strategy based on multi-head self-attention. By tai- ture concatenation.
loring receptive fields based on object sizes, SAT improves Chen et al. [282] presented a Local Spatial Aware (LSA)
object understanding and achieves enhanced performance. layer that learns spatial awareness weights based on the spa-
Table 5 indicates JS3C-Net [262] and LidarMultiNet [280] tial layouts and local structures of point clouds in order
outperformed all other methods by achieving the highest to better represent the spatial distribution of a point cloud.
mIoU scores on the SemanticPOSS and nuScenes datasets, For large-scale point cloud segmentation, Hu et al. [6] sug-
while PMF [275] and Point Voxel Transformer [81] excelled gested RandLA-Net, an efficient and lightweight network.
in achieving the highest OA scores on the Sannet and This network makes use of random point sampling to attain
S3DIS(6-fold) datasets. a remarkable level of computation and memory efficiency.
To collect and maintain geometric characteristics, a local
feature aggregation module is also provided. In order to
reduce the number of redundant ConvNet channels, Hu et
al. [128] proposed a novel concept. DRNet, which identifies
6 Learning strategies for point based the most significant channels for each class (dissect) in an
methods in semantic segmentation interpretable manner and dynamically runs channels accord-
ing to classes in need (reconstruct). This significantly reduces
For semantic segmentation, we will adopt a similar frame- the network’s parameter usage, resulting in a lower memory
work as shape classification. In this section, we have exclu- footprint.
sively examined approaches that utilize raw point clouds as Raw point cloud data inevitably contains outliers or noise
input. These methods can be further categorized into two as it is generated through different reconstruction algorithms
groups based on the type of learning supervision employed: using 3D sensors. Though the MLP method has proven
supervised and unsupervised methods. Figure 10 provides a to be efficient, it still fails to capture the spatial relations
comprehensive categorization of raw point-based approaches which is a major downside for this method. In order to
for point cloud semantic segmentation. Additionally, Table 6 extract motion data from a series of massive 3D LiDAR point
offers a detailed comparison of raw point-based methods for clouds, Wang et al. [283] developed PointMotionNet, a point-
point cloud semantic segmentation across various datasets. based spatiotemporal pyramid architecture. A key element
The methods are organized chronologically within their of PointMotionNet is a cutting-edge method for point-based
respective categories. The evaluation of each method’s per- spatiotemporal convolution, which utilizes a time invariant
formance is based on metrics such as overall accuracy (OA) spatial adjacent space to detect point correspondences across
and mean intersection over union (mIoU). time and extracts spatiotemporal properties.
PS2-Net [284] is a locally and globally aware deep
learning framework for semantic segmentation on 3D scene-
level point clouds. It incorporates local structures through
6.1 Supervised training EdgeConv and global context through NetVLAD, enabling
effective integration of local structures and global context.
Similar to 3D shape classification, supervised learning meth- PS2-Net is permutation invariant, making it suitable for han-
ods for semantic segmentation can be categorized into dling unordered point clouds. To capture contextual shape
seven distinct groups: pointwise MLP, hierarchical-based, information, Sahin et al. proposed ODFNet [285], which
convolution-based, RNN-based, graph-based, transformer- utilizes local point orientation distributions around a point.
based, and other approaches. These categories can be fur- Cone volumes divide the spherical neighborhood of a point,
ther organized into feedforward and sequential training and statistics within each volume serve as point features. The
paradigms based on the underlying model architecture and ODF neural network employs an ODFBlock with MLP lay-
data processing techniques. ers to process the orientation distribution function, enabling
123
Fig. 10 A taxonomy of deep learning approaches for raw point-based 3D point cloud semantic segmentation
representation of local patches considering point density dis- in a sphere space are formulated through an optimization
tribution along multiple orientations. problem. It’s important to acknowledge that the radius neigh-
Table 6 provides a detailed look at point-based methods borhood is employed to maintain a consistent receptive field,
for segmenting 3D point clouds in various datasets. Within and grid subsampling is used in each layer to achieve great
the pointwise MLP methods category, RandLA-Net [6] robustness under variable point cloud densities.
stands out with exceptional results in the Semantic3D (red. In order to simultaneously handle the instance and seman-
and sem.) datasets. Meanwhile, PointNeXt-XL [105] and tic segmentation of 3D point clouds, Zhao et al. [288]
RepSurf-U [286] secure the best Overall Accuracy (OA) in introduced a novel combined instance and semantic segmen-
the S3DIS area-5 and 6-fold datasets, respectively. Notably, tation approach called JSNet. It utilizes an efficient backbone
PointMotionNet [283] achieves the highest mIoU score network to extract robust features from the point clouds and
across all models and methodologies in the SemanticKITTI a feature fusion module to combine different layer char-
dataset. acteristics for more discriminative features. A combined
instance semantic segmentation module converts semantic
characteristics into instance embedding space and fuses them
6.1.2 Convolution-based methods
with instance features. Additionally, it aggregates instance
features into a semantic feature space to support seman-
Point clouds are a type of 3D data that can be difficult to
tic segmentation. DPC [287] addresses the issue of limited
process with traditional convolution operators. To address
receptive field size in point convolutional networks for 3D
this challenge, several approaches have been proposed that
point cloud processing tasks. By expanding the receptive field
use efficient convolution operators specifically designed for
size of point convolutions, DPC enhances the performance
point clouds.
of semantic segmentation and object classification. It can be
The impact of the receptive field on the performance of
easily integrated into existing point convolutional networks.
aggregation-based approaches was demonstrated by Engel-
The method proposed in [10], JSENet, is a two-stream
mann et al. [287], in their ablation studies with visualization
network with one stream for semantic segmentation and one
findings. Instead of using the k nearest neighbors, they pro-
stream for edge detection. The two streams share a com-
posed using a Dilated Point Convolution (DPC) method to
mon encoder, which extracts features from the point cloud.
aggregate dilated nearby features. This procedure has been
The features from the encoder are then fed into two separate
shown to be quite successful in boosting receptiveness. Based
decoders that generate the semantic segmentation and edge
on kernel point convolution, Thomas et al. [116] suggested a
detection outputs. In [5], Liu et al. compared different local
Kernel Point Fully Convolutional Network (KP-FCNN). The
aggregation operators for point cloud analysis. They explored
euclidean distances to kernel points determine the convolu-
max pooling, average pooling, and voxel pooling on object
tion weights of KPConv, and the number of kernel points
classification and segmentation tasks. The study revealed that
is not set. The best coverage places for the kernel points
123
each operator has unique strengths and weaknesses. Max a slice pooling layer, an RNN layer, and a slice unpool-
pooling captures discriminative features but reduces spatial ing layer. The slice pooling layer projects features from
resolution, average pooling finds a balance, and voxel pool- unordered point clouds into an ordered sequence of feature
ing preserves global structures but may lose details. vectors, which are then processed by the RNN layer. 3P-
To address these limitations, the authors introduced a RNN [290] addresses semantic segmentation using raw point
hybrid operator that combines max and average pooling, clouds, combining a pyramid pooling module and a bidi-
resulting in improved performance across various tasks. rectional RNN. The pyramid pooling module extracts local
In [289] Li et al. proposed DenseKPNet, which is built on a spatial data, while the bidirectional RNN captures global
dense convolutional neural network with Kernel Point Con- context. Inspired by PointNet, 3P-RNN employs pointwise
volution (KPConv). KPConv is a novel convolution operation pyramid pooling for local feature acquisition, resulting in
that is specifically designed for point clouds. It allows the net- faster processing compared to simple pooling in PointNet++.
work to learn local features from neighboring points, as well To convert unordered point feature sets into an ordered
as global features from the entire point cloud. series of feature vectors, Huang et al. [291] introduced a
Table 6 indicates that among pointwise convolution meth- lightweight local dependency modeling module that used a
ods, DenseKPNet [289] achieved the highest OA and mIoU slice pooling layer. Addressing the limitations of rigid and
in the S3DIS (area-5 and 6-fold) and Semantic3D (red.) fixed pooling operations, Zhao et al. [292] proposed the
datasets. On the other hand, ConvPoint [118] delivered Dynamic Aggregation Network (DAR-Net), which considers
optimal results for Semantic3D (sem.), and KPConv [116] both the global scene intricacy and local geometric factors.
excelled on SemanticKITTI and ScanNet datasets. DAR-Net employs self-adaptive receptive fields and node
weights to dynamically aggregate inter-medium features.
6.1.3 Hierarchical methods Table 6 demonstrates that within the category of RNN-
based methods, 3P-RNN attained the peak OA and mIoU in
Hierarchical methods leverage the inherent structural rela- the S3DIS (area-5) dataset. Simultaneously, RSNet achieved
tionships within point clouds to enhance segmentation accu- the highest mIoU for the S3DIS (6-fold) dataset.
racy and capture finer details. Hierarchical segmentation can
lead to improved object part delineation and better handling 6.1.5 Transformer-based methods
of complex scenes with varying levels of detail. Given the
limited literature on hierarchical data structures in point cloud Lai et al. [320] introduced the Stratified Transformer, which
understanding, and considering the papers covered in the effectively captures long-range contexts while maintaining
classification section, this discussion focuses on previously good generalization and performance. The network samples
unmentioned studies that utilize this approach for semantic neighboring points within a window as keys for each query
segmentation. point and sparsely samples remote points. This approach
Xie et al. [142] developed a novel representation using allows for a larger receptive field with minimal additional
shape context as a fundamental element in their network calculations, encompassing both denser local points and
architecture. The model can capture and propagate object sparser distant points. The method incorporates first-layer
part information without relying on a fixed grid, and fea- point embedding to aggregate local information, acceler-
tures a simple yet effective contextual modeling mecha- ating convergence and improving performance. Contextual
nism inspired by self-attention based models. Attentional relative position encoding is employed to capture adaptable
ShapeContextNet (A-SCN) is an end-to-end solution for position information. Additionally, a memory-efficient tech-
point cloud classification and segmentation problems. nique addresses the challenge of fluctuating point counts
According to Table 6, 3DContextNet [126] emerged as in each window. Existing point transformers suffer from
the top performer within the hierarchical methods category quadratic complexity in generating attention maps, making
on both the S3DIS datasets. them computationally expensive.
Zhang et al. proposed Patchformer [165] which addresses
6.1.4 RNN-based methods this flaw by combining Patch Attention (PAT) and the Multi-
Scale Attention (MST) module to progressively learn a
Recurrent Neural Networks (RNN) have also been uti- significantly smaller range of bases for computing atten-
lized for semantic segmentation of point clouds to capture tion maps. Through a weighted summation on these bases,
underlying context information. Bidirectional RNN has been PAT captures the whole shape context while simultaneously
successfully applied to enhance the handling of point clouds attaining linear complexity for the input size. In the mean-
in methods like 3P-RNN [290] and RSNets [291], enabling time, the model receives multi-scale features from the MST
better context capture. RSNets leverage a lightweight local block, which creates attention among features of various
dependence module to capture local structures, incorporating scales.
123
Table 6 Comperative 3D point cloud semantic segmentation results on various available datasets for Point based methods
67
Model Name Year SemanticKITTI S3DIS(Area5) S3DIS(6-fold) ScanNet Semantic3D(sem.) Semantic3D(red.) STPLS3D SensatUrban
(mIoU) (OA) (mIoU) (OA) (mIoU) (OA) (mIoU) (OA) (mIoU) (OA) (mIoU) (mIoU) (mIoU)
123
Pointwise MLP Methods
PointNet [3] 2017 14.60% – 41.10% 78.60% 47.60% – 14.69 – – – – – 52.53%
Page 36 of 54
PointNet++ [4] 2017 20.10% – – 81.00% 54.50% 84.50% 33.90% 85.70% 63.10% – – 15.92% 58.13%
MS-CU [293] 2017 – – – 79.20% 47.80% – – – – – – – –
PointSIFT [294] 2018 – – – 88.70% 70.20% 86.20% 41.50% – – – – – –
Engelmann et al. [295] 2018 – 84.20% 52.20% 84.00% 58.30% – – – – – – – –
LSANet [282] 2019 – – – 86.80% 62.20% 85.10% – – – – – – –
RandLA-Net [6] 2020 53.90% – – 88.00% 70.00% – – 94.60% 74.80% 94.80% 77.40% 50.53% 62.80%
PointASNL [7] 2020 46.80% – 68.70% – – – 63.00% – – – – – –
HPGCNN [139] 2020 50.50% – – 90.30% 69.20% – – – – – – – –
PS2-Net [284] 2020 – 84.60% 52.95% 88.22% 66.60% 87.21% 44.90% – – – – – –
BAAF-Net [284] 2021 59.90% – 65.40% 88.90% 72.20% – – – – 94.90% 75.40% – –
RPNet-D27 [103] 2021 – – – 70.80% – 68.20% – – – – – – –
ASSANet [140] 2021 – – 66.80% – – – – – – – – – –
PointMotionNet [283] 2022 81.80% – – – – – – – – – – – –
RepSurf-U [286] 2022 – 90.20% 68.90% 90.80% 74.30% – 70.00% – – – – – –
PointNeXt-XL [105] 2022 – 90.60% 70.50% 90.30% 74.90% – 71.20% – – – – – –
ODFNet [285] 2022 – – – 90.80% 72.20% – – – – – – –
Convolution-based Methods
PCCN [296] 2018 – – 58.30% – – – – – – – – – –
TangentConv [297] 2018 35.90% 82.50% 52.80% – – 80.10% 40.90% – – – – – –
PointCNN [145] 2018 – 85.91% 66.36% 88.14% 75.61% 79.70% 55.70% – – – – – –
A-CNN [148] 2019 – – – 87.30% – 85.40% – – – – – – –
KPConv [116] 2019 58.80% – 67.10% 68.50% 67.10% – 68.40% – – 92.90% 74.60% 53.73% 64.50%
InterpCNN [109] 2019 – – – 88.70% 66.70% – – – – – – – –
ShellNet [111] 2019 – – – 67.90% 66.80% 85.20% – – – 93.20% 69.40% – –
PointConv [119] 2019 – – – – – – – – – – – – –
JSNet [288] 2019 – 87.70% 54.50% 88.70% 61.70% – – – – – – – –
JSENet [10] 2020 – – 67.70% – – 69.90% – – – – – –
PosPool [5] 2020 – – 66.70% – – – – – – – – –
Point-PlaneNet [112] 2020 – – – 83.90% 54.80% – – – – – – – –
ConvPoint [118] 2020 – – – 88.80% 68.20% – – 93.40% 76.50% – – – –
DPC [287] 2020 – 86.80% 61.30% – – – 59.20% – – – – – –
DenseKPNet [289] 2022 – 90.80% 68.90% 89.30% 71.90% – – – – 94.90% 77.90% – –
S. Sarker et al.
Table 6 continued
Hierarchical Methods
A-SCN [142] 2018 – – – 81.59% 52.72% – – – – – – – –
3DContextNet [126] 2018 – 84.90% 55.60% 90.60% 72.00% – – – – – – – –
RNN-based Methods
G+RCU [293] 2017 – – 45.10% 81.10% 49.70% – – – – – – – –
RSNet [291] 2018 – – 51.93% – 56.50% – 39.35% – – – – – –
3P-RNN [290] 2018 – 86.90% 56.30% 85.70% 53.40% – – – – – – – –
RCNet [157] 2019 – – – 82.01% 51.40% – – – – – – – –
RCNet-E [157] 2019 – – – 83.58% 53.21% – – – – – – – –
Graph-based Methods
SPG [298] 2018 – 86.38% 58.04% 85.50% 62.10% – – 92.90% 76.20% 94.00% 73.20% – –
DGCNN [132] 2018 – – 47.10% 84.10% 56.10% – – – – – – – –
SSP+SPG [299] 2019 – 87.90% 61.70% 87.90% 68.40% – – – – – – – –
PointWeb [131] 2019 – 87.00% 60.30% 87.30% 66.70% 85.90% – – – – – – –
GACNet [300] 2019 – 87.79% 62.85% – – – – – – 91.90% 70.80% – –
HDGCN [301] 2019 – – 59.33% – 66.85% – – – – – – – –
Jiang et al. [302] 2019 – 87.18% 61.85% 88.20% 67.83% – 61.80% – – – – – –
DPAM [134] 2019 – 86.10% 60.00% 87.60% 64.50% – – – – – – – –
PAG [303] 2020 – 88.81% 69.32% 88.10% 65.90% – – – – – – – –
SegGCN [304] 2020 – 88.20% 63.60% – – – 58.90% – – – – – –
SPH3D-GCN [305] 2021 – 87.70% 59.50% 88.60% 68.90% – 61.00% – – – – – –

RG-GCN [306] 2022 – – – 88.10% 63.70% – – – – – – – –
Page 37 of 54
67
123
Table 6 continued
67
123
Transformer-based Methods
PAT [159] 2019 – – 60.10% – 64.30% – – – – – – – –
Page 38 of 54
FastPointTransformer [307] 2021 – – 70.10% – – – 72.10% – – – – – –

PointTransformer [160] 2021 – 90.80% 70.40% 90.20% 73.50% – – – – – – 47.64% –
Spg 2022 – 91.50% 72.00% – – – 73.70% – – – – – –
PatchFormer [165] 2022 – – 68.10% – – – – – – – – – –
LCPFormer [167] 2022 – 90.80% 70.20% – – – – – – – – 63.40%
PTv2 [166] 2022 – – 71.60% – – – 75.20% – – – – –
StratifiedFormer+PAGWN [308] 2022 – 91.40% 72.20% 91.70% 77.60% – – – – – – –
SPoTr [168] 2023 – 90.70% 70.80% – – – – – – – – – –
Unsupervised-based Methods
PointConstrast [309] 2020 – – 70.90% – – – – – – – – –
DGCNN-OcCo [191] 2021 – – – 84.60% 58.00% – – – – – – – –
GuidedPoint [310] 2021 67.70% 68.80% – – – 74.00% (V2) – – – – –
IAE [193] 2022 – 85.90% 60.70% – – – – – – – – – –
ACT [203] 2022 – 71.10*% 61.20% – – – – – – – – – –
Hao et al. [201] 2022 – 90.10% 69.30% 90.20% 73.50% – 75.80% – – – – – –
HybridCR [311] 2022 52.30% – 51.50% – 69.20%% – 56.80% (v2) – – – 76.80% –
NAPL [312] 2022 61.60% – – – – – – – – – – – –
SQN [313] 2022 50.80% – 61.40% – – – 56.90% 94.80% 72.30% 93.70% 74.70% – 54.00%
WeakLabel-3DNet [314] 2022 53.70% – 68.10% – – – 67.90% – 75.30% – – – –
Other Methods
FG-Net [173] 2021 53.80% 88.20% – – 70.80% – 68.50% 93.60% 78.20% – – – –
SCF-Net [315] 2021 – 88.40% 71.60% – – – – – – 94.70% 77.60% 50.65% -
KPConv+RFCR [316] 2021 – – 68.73% – – – 70.20% – – – 77.80% – –
Shao et al. [317] 2022 – – – 88.90% 71.70% – – – – 95.30% 79.00% – 63.60%
ConvNet+CBL [318] 2022 90.60% 69.40% 89.60% 73.10% – 70.50% – – 95.00% 78.40% – –
MSIDA-Net [319] 2022 59.80% – – 89.20% 73.00% – – – – 94.60% 77.80%
LEAK(Cylinder3D) [258] 2023 65.20% – – – – – – – – – – – –
LEAK(RandLA-Net) [258] 2023 – – – 90.90% 76.10% – – – – – – – –
PointMetaBase-XXL [179] 2023 – 90.80% 71.30% 91.30% 77.00% – 71.40% – – – – –
Here ’OA’ refers to the overall accuracy and ’mIoU’ denotes mean intersection over union. The symbol ‘-’ means the results are unavailable. The methods are arranged in chronological order within
their corresponding categories. The top-performing methods in each category have been highlighted in bold, while the method(s) achieving the best overall performance across all categories are
underlined
S. Sarker et al.
[308] proposes a novel Window Normalization (WN) primary challenge in learning from point clouds is capturing
module for 3D point cloud understanding. WN is a simple yet local structures and relationships.
effective module that can be easily integrated into existing The capacity of graph convolution to extract local shape
point cloud neural networks. WN works by normalizing the information from neighbors is very powerful. Inspired
features of each point in a local window to have unit length. by this, Lian et al. introduce the Hierarchical Depthwise
This helps to unify the point densities in different parts of Graph Convolutional Neural Network (HDGCN) [301].
the point cloud, which can improve the performance of point HDGCN employs a memory-efficient depthwise graph con-
cloud neural networks on tasks such as semantic segmenta- volution, known as DGConv, along with pointwise convolu-
tion and object detection. tion. DGConv enables local feature extraction and transfer
According to the results presented in Table 6, Stratified- between points and their neighbors while being order-
Former+PAGWN [308] emerged as the top performer for invariant.
both the S3DIS area-5 and 6-fold datasets among all models Zeng et al. introduced RG-GCN [306], a Random Graph-
from various methodologies. based Graph Convolution Network for point cloud semantic
segmentation. It comprises two main components: a random
6.1.6 Graph-based methods graph module that constructs a random graph for each point
cloud, and a graph convolution network based on a modified
Graph networks are used in a variety of methods to capture the PointNet++ architecture, known for its effectiveness in point
underlying forms and geometric features of 3D point clouds. cloud semantic segmentation.
Ma et al. [321] suggested a Point Global Context Rea- Table 6, demonstrates that among the graph-based meth-
soning (PointGCR) module that uses an undirected graph ods, SPG [298] attained the best OA and mIoU for both
representation to capture global contextual information along the Semantic3D datasets. Meanwhile, PAG [303] yielded the
the channel dimension. PointGCR is a plug-and-play mod- optimal results for S3DIS (area-5), and SPH3D-GCN [305]
ule that can be trained from beginning to end. To boost excelled in the S3DIS (6-fold) dataset.
performance, it may be readily added into an existing seg-
mentation network. Furthermore, numerous recent studies 6.1.7 Unsupervised training
have attempted to perform semantic segmentation of point
clouds with less supervision. For semantic segmentation of PointContrast [309] is an unsupervised pre-training method
point clouds, Xu et al. [322] studied many inexact supervi- for 3D point cloud understanding. It employs a contrastive
sion techniques. They also presented a network that can be learning framework to learn representations from unlabeled
trained with points that are only partially tagged. point cloud data. This two-stage method extracts local fea-
Kang et al. developed PyramNet [323] using Graph tures by grouping points into patches and encodes them
Embedding Module (GEM) and Pyramid Attention Net- into fixed-dimensional representations. The contrastive loss
work (PAN). GEM transforms the point cloud into a directed is then used to encourage similarity among representations
acyclic graph, using a covariance matrix for adjacent similar- from the same patch and dissimilarity among representations
ity instead of euclidean distance. The PAN module extracts from different patches. PointContrast highlights the poten-
features with different semantic intensities using convolution tial of leveraging unlabeled data for effective representation
kernels of four distinct sizes. Graph Attention Convolution learning in 3D point cloud analysis.
(GAC) was introduced in [300] as a way to learn useful Authors in [311] proposed a novel Hybrid Contrastive
features from a nearby adjacent set selectively. This is accom- Regularization (HybridCR) framework for weakly-supervised
plished by assigning attention weights to various surrounding point cloud semantic segmentation. HybridCR leverages
points and feature channels depending on their spatial place- both point consistency and contrastive regularization with
ments and feature differences. GAC is similar to the widely pseudo labeling in an end-to-end manner. Fundamentally,
used CRF model in that it may learn to capture discriminative HybridCR explicitly and effectively considers the seman-
features for segmentation. tic similarity between local neighboring points and global
For effective graph convolution of 3D point clouds, Lei characteristics of 3D classes. In their work [314], a novel
et al. [305] proposed a spherical kernel. It quantizes local weakly supervised framework called WeakLabel3D-Net is
3D space systematically to capture geometric relationships. proposed for understanding real-scene LiDAR point clouds.
The spherical CNN kernel shares weights for similar struc- This multi-task framework achieves state-of-the-art results
tures, providing translation-invariance, and supports precise on various LiDAR datasets, even with limited labeled data.
geometric learning through asymmetry. It eliminates edge- WeakLabel3D-Net comprises a point cloud encoder, task-
dependent filter generation, making it efficient for large point specific decoders, and a weakly supervised loss function.
clouds. Vertices represent points, edges connect neighbors, The encoder extracts features, decoders generate predictions,
and coarsening is done using farthest point sampling. The and the loss function trains the network with labeled data.
123
The framework utilizes a modified PointNet++ encoder, task- segmentations. MSIDA addresses challenges posed by disor-
specific decoders, and a combination of cross-entropy and dered and unevenly distributed large-scale 3D point clouds.
consistency losses to encourage consistent predictions for It includes an MSI block to encode spatial information
neighboring points. using cylindrical and spherical coordinate systems, and DA
Zhao et al. [312] proposed Number-Adaptive Prototype blocks for weighted fusion of local features and improved
Learning (NAPL), a weakly supervised approach that learns local region understanding. By incorporating spatial infor-
from a small number of labeled points. It learns prototypes mation and adaptive feature integration, the MSIDA module
by clustering unlabeled points and then predicts the class of enhances point cloud segmentation, enabling better compre-
a point by finding the closest prototype. What sets NAPL hension of complex geometric structures in scenes.
apart is its adaptive learning of the number of prototypes To address the challenge of unsatisfactory segmenta-
for each class. This is achieved using a novel loss function tion performance on scene boundaries in 3D point cloud
that penalizes the classifier for assigning the same class to data, [318] introduced metrics to quantify this issue and pro-
nearby points, encouraging the learning of distinct proto- poses a Contrastive Boundary Learning (CBL) framework.
types even for close points of the same class. For semantic CBL quantifies the issue and enhances feature discrimina-
segmentation of large-scale 3D point cloud, Hu et al. pro- tion across boundaries by contrasting representations using
posed Semantic Query Network (SQN) [313], a graph-based scene contexts at multiple scales. Camuffo et al. [258] intro-
method that leverages both point consistency and contrastive duced LEAK, which clusters classes into macro groups
regularization. Fundamentally, SQN explicitly and effec- based on mutual prediction errors for point cloud semantic
tively considers the semantic similarity between neighboring segmentation. LEAK aligns class-conditional prototypical
3D points, allowing the extremely sparse training signals to feature representations for fine and coarse classes to reg-
be back-propagated to a much wider spatial region, thereby ularize the learning process. This prototypical contrastive
achieving superior performance under weak supervision. learning approach improves generalization across domains
According to the findings in Table 6, amidst models and reduces forgetting during knowledge distillation from
employing unsupervised methodology [201] showcased the prototypes. Additionally, it incorporates a per-class fairness
most promising results on both the S3DIS area-5 and 6-fold index to ensure balanced class-wise results.
datasets. However, when considering models spanning all Table 6 shows that among the models categorized under
methodologies, [201] outperformed the rest on the ScanNet other methods, PointMetaBase-XXL [179] demonstrated
dataset, while SQN [313] achieved the highest Overall Accu- promising results across multiple datasets, including S3DIS
racy (OA) for the Semantic3D (sem.) dataset. (area-5, 6-fold), and the ScanNet dataset. However, when
compared to models from all other methodologies, both FG-
6.1.8 Other methods Net [173] and [317] excelled in both Semantic3D datasets.
Fan et al. [315] proposed SCF, a learnable module for extract-

ing Spatial Contextual Features (SCF) from large-scale point 7 Discussion and future directions
clouds. SCF consists of three components: a local polar rep-
resentation block, a dual-distance attentive pooling block, Although extensive research has been conducted on point
and a global contextual feature block. The module con- cloud processing networks for 3D object classification and
structs spatial representations invariant to z-axis rotation, segmentation, their performance still falls behind when com-
learns discriminative local features using neighboring repre- pared to RGB images. This disparity is due to the irregular
sentations, and incorporates global context based on spatial and sparse nature of the point cloud. As a result, there is still
location and neighborhood volume ratio. Gong et al. [316] considerable work to be done.
proposed Receptive Field Component Reasoning (RFCR) for This section aims to highlight potential key research direc-
point cloud segmentation, utilizing Target Receptive Field tions and future applications in a comprehensive manner. The
Component Codes (RFCCs) to guide a coarse-to-fine cate- following aspects will be explored to provide insight into the
gory reasoning approach. The method incorporates a gradual possible directions of this field.
RFCR module that enhances neural network representation
by iteratively reasoning about receptive field components, 7.1 Algorithmic advances
enabling the learning of progressively complex features.
Additionally, a feature densification technique employing Point cloud processing algorithms are critical to efficiently
centrifugal potential is introduced to improve feature selec- process the vast amounts of data contained in point clouds.
tion by separating positive and negative features. Future algorithms may incorporate more advanced deep
[319] presented Multispatial Information and Dual Adap- learning techniques to better handle the complexities of
tive (MSIDA) module for learning point cloud semantic point cloud data. A possible strategy for dealing with raw
123
Table 7 A summarization and comparison between existing methods for 3D point cloud understanding
Methods Strengths Weaknesses
Projection-based By leveraging 2D convolutional Generates quantization artifacts that

architectures to solve a 3D task, pro- may make it difficult to see the
jection based techniques eventually data’s inherent invariances. Multi-
make 3D learning more scalable by resolution features and geometric
bridging the gap between 2D and 3D features are not exploited prop-
learning erly. Not appropriate for tasks
requiring per-point processing. Due
to repeated convolution proce-
dures, spatial information of small
instances is lost
Voxel-based Voxel-based models are compatible Voxelization results in the loss of
with traditional 3D convolutions, geometrical and spatial resolution
have regular data locality, and can information. It is not scaleable
effectively encode coarse-grained since the computational and mem-
features ory footprints increase cubically
with resolution
Range view Range view methods directly pro- Range view methods are viewpoint-
cess raw point cloud data, preserv- sensitive, with changes in sen-
ing intricate geometry and offer- sor placement or orientation intro-
ing an intuitive spatial represen- ducing data variability that can
tation akin to human perception. undermine robust processing and
By working with raw point coor- interpretation. Sparse data regions,
dinates, they maintain underlying resulting from sensor characteris-
geometric information, ideal for tics, can challenge these methods,
tasks demanding precise measure- potentially impeding accurate anal-
ments of distances and angles ysis
Bird’s-Eye View Bird’s-eye view offers a top-down Bird’s-eye view inherently lose
perspective, simplifying tasks like some of the fine geometric details
object detection by converting the present in the range view, limiting
problem from 3D to 2D. It’s more precision in geometric analysis. It
stable against sensor pose changes excels in horizontal plane under-
than range view, which is beneficial standing but struggles with accurate
in scenarios with varying sensor ori- vertical capture, potentially causing
entation ambiguity due to object overlap at
varying heights
Hybrid By integrating high-level image High coupling between image
semantics to points, the resolution and LiDAR models lowers over-
mismatch issue between dense RGB all model reliability and raises
and sparse depth can be resolved development costs
Point-based Point-based models directly learn Usually requires a lot of computing
from sparse and unstructured point power, especially with the large-
clouds while maintaining the accu- scale point cloud dataset
racy of point location
Supervised based Supervised learning can produce The amount of labeled data needed
highly accurate results when trained for supervised learning is signifi-
on a large dataset of labeled point cant, and it may struggle with new
clouds. Once a model is trained, or unseen data. It can also be biased
it can be used to process new towards the training data, leading to
point clouds quickly and efficiently. incorrect results and prone to over-
This is especially useful in real- fitting
time applications where processing
speed is critical
123
Table 7 continued
Methods Strengths Weaknesses
Unsupervised based Unsupervised learning can be used Unsupervised learning methods can
to identify patterns and structures be more computationally intensive
in point cloud data without the than supervised learning methods,
need for labeled data. It helps to which can be a limitation when
cluster similar objects or segments dealing with large point cloud
within point clouds and can iden- datasets. It lacks labeled data guid-
tify outliers and anomalies in point ance, which can potentially result in
cloud data, which may be missed by less accuracy than supervised learn-
supervised learning methods ing. The evaluation of unsupervised
learning models can be challenging
due to the absence of a clear objec-
tive function to optimize
point clouds could be distinct from traditional methods. In represents 3D input data as point clouds to take advantage
comparison to CNN, transformer architectures have recently of sparsity and employs voxel-based convolution to produce
demonstrated promising accuracy on point cloud learning a contiguous memory access pattern. (DSPoint) extracted
benchmarks. The self-attention operator, which is at the heart local global features by simultaneously operating on voxels
of transformer networks, is invariant to the input elements’ and points, combining both local features and global geo-
permutation and cardinality. As a result, the transformer fam- metric architecture. A combination of projection and raw
ily of models is admirably adapted to point cloud processing. point-based approaches is being studied by some researchers.
Self-supervised representation learning on point cloud data In PointView-GCN [80], the network uses multi-level GCNs
has proven to be another promising solution. Self-supervised to record both the geometrical cues of an object and its
representation learning analyzes how to properly pre-train multiview relations, which hierarchically collect the shape
deep neural networks with unlabeled and raw input data. attributes of single-view point clouds. Table 7 summarizes
Instead of creating representations based on human-defined and compares different existing methods for 3D point cloud
annotations, self-supervised learning learns latent features understanding.
from unlabeled data. It is commonly accomplished by cre-
ating a pretext assignment to pre-train the model before 7.2 Improved sensor technology
fine-tuning it on subsequent tasks. Self-supervised learn-
ing has significantly enhanced computer vision by reducing Sensor technology has witnessed notable advancements, with
reliance on labeled data. The increasing volume of papers lidar and photogrammetry systems showing great promise.
on various methods published between 2015 and 2023 Lidar sensors, renowned for their precise distance measure-
underscores the growing research interest in unsupervised ment and object detection capabilities, have become more
networks. accurate, affordable, and accessible to diverse industries.
In addition, many researchers are currently interested in Multispectral and hyperspectral imaging techniques, com-
optimizing neural network training and model compression. plementing point cloud data with material and chemical
Reducing the parameters of a network can speed up training information, offer opportunities in archaeology, geology, and
while also allowing deep learning techniques to be used on forestry, enabling detailed analysis and conservation efforts.
devices with limited resources. These advanced algorithms Although sensors have improved, there is room for fur-
may also incorporate more sophisticated data fusion tech- ther enhancement. Higher accuracy can minimize errors and
niques to integrate and be benefited from different point cloud enhance point cloud quality, leading to more precise mod-
representation. To handle 3D input, previous research has eling and analysis. Fusion of data from multiple sensors
used either voxel-based or point-based NN models. However, provides a comprehensive understanding of environments,
both methods are computationally inefficient. With increas- improving accuracy and reducing errors. For instance, the
ing input resolution, voxel-based models’ memory use and combined use of lidar and photogrammetry systems captures
computation costs grow cubically. Rather than extracting fea- both geometric and textural information. Additionally, sens-
tures, up to 80% of the effort in point-based networks is spent ing techniques like multispectral and hyperspectral imaging
shaping the sparse input, which has poor memory localiza- provide insights into environmental composition and prop-
tion. Hence, recent research has focused on maximizing the erties, benefiting various applications.
benefits of both strategies while minimizing the drawbacks. The ongoing evolution of sensor technology will signif-
Liu et al. presented Point-Voxel CNN (PVCNN) [272], which icantly impact point cloud processing. The availability of
123
more accurate, affordable, and capable sensors facilitates semantic segmentation. Because of the potential for practical
the capture of high-resolution point cloud data, empower- applications such as autonomous driving, robot manipula-
ing researchers and engineers to make informed decisions tion, and augmented reality, point cloud understanding has
and develop advanced solutions. recently gained a lot of attention. Specific deep learning
frameworks are designed to match point clouds from several
7.3 Advancement in datasets scans of the same scene, and generative networks are adapted
to enhance the quality of point cloud data in terms of noise and
Point cloud datasets play a vital role in various fields, includ- missing points. Deep learning techniques that are correctly
ing autonomous vehicles, robotics, virtual reality, and 3D adapted have been found to be effective in addressing the
modeling. While advancements have been made in dataset unique challenges presented by point cloud data. A detailed
creation, further developments are needed to enhance their taxonomy is presented, accompanied by a performance eval-
quality and usefulness. uation of multiple approaches using widely utilized datasets.
Efficiency improvements in collecting and processing The benefits and drawbacks of various methodologies, as
point cloud data are key areas for advancement. Integrat- well as potential research directions, are also highlighted.
ing data from multiple sensors, such as LiDAR, cameras, We believe that our work stands as a confident and impact-
or radar, can provide more comprehensive environmental ful addition to the field, providing a valuable resource for
information and enable algorithms to handle complex sce- researchers and practitioners alike.
narios. Diverse datasets that encompass a broader range of
Acknowledgements This material is based upon work supported by
environments and scenarios would enhance the performance the National Science Foundation under Grant No. OIA-2148788.
and accuracy of point cloud processing algorithms. Manual
annotation of datasets is time-consuming, and developing Author Contributions Sushmita Sarker authored the main manuscript
automated annotation methods would expedite the process text. Gunner Stone generated the figures (Fig. 6). Prithul Sarker, Gunner
Stone, and Ryan Gorman contributed materials, and verified the accu-
and improve accuracy. Additionally, incorporating temporal racy of all information. The manuscript was collectively reviewed by
information, such as capturing data at different times or using all authors.
motion-capturing sensors, would enable algorithms to track
and predict environmental changes. Data availability No datasets were generated or analysed during the
current study.
Advancements in these areas would greatly improve the
practicality and applicability of point cloud datasets in var-
Declarations
ious industries. Ongoing development and improvement of
point cloud datasets are essential for advancing the field of Conflict of interest The authors declare no Conflict of interest.
point cloud processing and facilitating new applications.
7.4 Cloud computing

References
Cloud computing has significantly impacted point cloud pro-
cessing, enhancing its accessibility and affordability. In the 1. Liang, Z., Guo, Y., Feng, Y., Chen, W., Qiao, L., Zhou, L., Zhang,
future, advanced cloud-based processing tools are antici- J., Liu, H.: Stereo matching using multi-level cost volume and
pated to incorporate real-time capabilities and distributed multi-scale feature constancy. IEEE Trans. Pattern Anal. Mach.
Intell. 43(1), 300–315 (2019)
computing, enabling efficient handling of large data vol-
2. Guo, Y., Sohel, F., Bennamoun, M., Lu, M., Wan, J.: Rotational
umes in real-time. Real-time processing would facilitate projection statistics for 3d local surface description and object
quicker decision-making for critical applications like disas- recognition. Int. J. Comput. Vis. 105(1), 63–86 (2013)
ter response and autonomous systems. Evolving cloud-based 3. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: deep learning on
point sets for 3d classification and segmentation. In: Proceedings
tools, with specialization and advancements, will unlock new
of the IEEE Conference on Computer Vision and Pattern Recog-
applications and use cases for point cloud data. nition, pp. 652–660 (2017)
Advances in technology, algorithms, and applications will 4. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical
shape the future of point cloud processing, fostering innova- feature learning on point sets in a metric space. Adv. Neural Inf.
Process. Syst. 30 (2017)
tion and the emergence of novel applications and use cases.
5. Liu, Z., Hu, H., Cao, Y., Zhang, Z., Tong, X.: A closer look at
local aggregation operators in point cloud analysis. In: European
Conference on Computer Vision, pp. 326–342. Springer (2020)
8 Conclusions 6. Hu, Q., Yang, B., Xie, L., Rosa, S., Guo, Y., Wang, Z., Trigoni,
N., Markham, A.: Randla-net: efficient semantic segmentation
of large-scale point clouds. In: Proceedings of the IEEE/CVF
This paper presents current state-of-the-art methodologies Conference on Computer Vision and Pattern Recognition, pp.
and recent advancements in 3D shape classification and 11108–11117 (2020)
123
7. Yan, X., Zheng, C., Li, Z., Wang, S., Cui, S.: Pointasnl: robust tion in the era of deep neural networks: a survey. IEEE Trans.
point clouds processing using nonlocal neural networks with Image Process. 29, 2947–2962 (2019)
adaptive sampling. In: Proceedings of the IEEE/CVF Confer- 25. Guo, Y., Wang, H., Hu, Q., Liu, H., Liu, L., Bennamoun, M.: Deep
ence on Computer Vision and Pattern Recognition, pp. 5589–5598 learning for 3D point clouds: a survey. IEEE Trans. Pattern Anal.
(2020) Mach. Intell. 43(12), 4338–4364 (2020)
8. Bytyqi, Q., Wolpert, N., Schömer, E.: Local-area-learning net- 26. Zhang, J.: The mcgill shape benchmark (2005). http://www.cim.
work: meaningful local areas for efficient point cloud analysis. mcgill.ca/shape/benchMark/
arXiv preprint arXiv:2006.07226 (2020) 27. Serna, A., Marcotegui, B., Goulette, F., Deschaud, J.-E.: Paris-
9. Xu, Q., Sun, X., Wu, C.-Y., Wang, P., Neumann, U.: Grid-gcn rue-madame database: a 3d mobile laser scanner dataset for
for fast and scalable point cloud learning. In: Proceedings of the benchmarking urban detection, segmentation and classification
IEEE/CVF Conference on Computer Vision and Pattern Recog- methods. In: 4th International Conference on Pattern Recogni-
nition, pp. 5661–5670 (2020) tion, Applications and Methods ICPRAM 2014 (2014)
10. Hu, Z., Zhen, M., Bai, X., Fu, H., Tai, C.-l.: Jsenet: joint semantic 28. Vallet, B., Brédif, M., Serna, A., Marcotegui, B., Paparoditis,
segmentation and edge detection network for 3d point clouds. In: N.: Terramobilita/iQmulus urban point cloud analysis benchmark.
European Conference on Computer Vision, pp. 222–239. Springer Comput. Graph. 49, 126–133 (2015)
(2020) 29. Choi, S., Zhou, Q.-Y., Miller, S., Koltun, V.: A large dataset of
11. Lin, C., Li, C., Liu, Y., Chen, N., Choi, Y.-K., Wang, W.: object scans. arXiv:1602.02481 (2016)
Point2skeleton: learning skeletal representations from point 30. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T.,
clouds. In: Proceedings of the IEEE/CVF Conference on Com- Nießner, M.: Scannet: richly-annotated 3d reconstructions of
puter Vision and Pattern Recognition, pp. 4277–4286 (2021) indoor scenes. In: Proceedings of the IEEE Conference on Com-
12. Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, puter Vision and Pattern Recognition, pp. 5828–5839 (2017)
J.: 3d shapenets: a deep representation for volumetric shapes. In: 31. Roynard, X., Deschaud, J.-E., Goulette, F.: Paris-Lille-3D: a large
Proceedings of the IEEE Conference on Computer Vision and and high-quality ground-truth urban point cloud dataset for auto-
Pattern Recognition, pp. 1912–1920 (2015) matic segmentation and classification. Int. J. Robot. Res. 37(6),
13. Uy, M.A., Pham, Q.-H., Hua, B.-S., Nguyen, T., Yeung, S.-K.: 545–557 (2018)
Revisiting point cloud classification: a new benchmark dataset 32. Sun, J., Zhang, Q., Kailkhura, B., Yu, Z., Xiao, C., Mao, Z.M.:
and classification model on real-world data. In: Proceedings of Benchmarking robustness of 3d point cloud recognition against
the IEEE/CVF International Conference on Computer Vision, pp. common corruptions. arXiv preprint arXiv:2201.12296 (2022)
1588–1597 (2019) 33. Nygren, P., Jasinski, M.: A comparative study of segmentation
14. Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, and classification methods for 3d point clouds. Master’s thesis,
Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: University of Gothenburg (2016)
Shapenet: an information-rich 3d model repository. arXiv preprint 34. Johnson, A.E., Hebert, M.: Using spin images for efficient object
arXiv:1512.03012 (2015) recognition in cluttered 3D scenes. IEEE Trans. Pattern Anal.
15. Armeni, I., Sener, O., Zamir, A.R., Jiang, H., Brilakis, I., Fischer, Mach. Intell. 21(5), 433–449 (1999)
M., Savarese, S.: 3d semantic parsing of large-scale indoor spaces. 35. Chen, D.-Y., Tian, X.-P., Shen, Y.-T., Ouhyoung, M.: On visual
In: Proceedings of the IEEE Conference on Computer Vision and similarity based 3d model retrieval. In: Computer Graphics
Pattern Recognition, pp. 1534–1543 (2016) Forum, vol. 22, pp. 223–232. Wiley (2003)
16. Yang, X., Xia, D., Kin, T., Igarashi, T.: Intra: 3d intracranial 36. Khatib, O., Kumar, V., Sukhatme, G.: Experimental Robotics: The
aneurysm dataset for deep learning. In: The IEEE Conference 12th International Symposium on Experimental Robotics, vol. 79.
on Computer Vision and Pattern Recognition (CVPR) (2020) Springer (2013)
17. Hackel, T., Savinov, N., Ladicky, L., Wegner, J.D., Schindler, K., 37. Endres, F., Hess, J., Sturm, J., Cremers, D., Burgard, W.: 3-D
Pollefeys, M.: Semantic3d. net: a new large-scale point cloud clas- mapping with an RGB-D camera. IEEE Trans. Robot. 30(1), 177–
sification benchmark. arXiv preprint arXiv:1704.03847 (2017) 187 (2013)
18. Pan, Y., Gao, B., Mei, J., Geng, S., Li, C., Zhao, H.: Semanticposs: 38. Behley, J., Garbade, M., Milioto, A., Quenzel, J., Behnke, S.,
A point cloud dataset with large quantity of dynamic instances. Stachniss, C., Gall, J.: Semantickitti: a dataset for semantic
In: 2020 IEEE Intelligent Vehicles Symposium (IV), pp. 687–693. scene understanding of lidar sequences. In: Proceedings of the
IEEE (2020) IEEE/CVF International Conference on Computer Vision, pp.
19. De Deuge, M., Quadros, A., Hung, C., Douillard, B.: Unsuper- 9297–9307 (2019)
vised feature learning for classification of outdoor 3d scans. In: 39. Rottensteiner, F., Sohn, G., Jung, J., Gerke, M., Baillard, C.,
Australasian Conference on Robitics and Automation, vol. 2, p. Benitez, S., Breitkopf, U.: The isprs benchmark on urban object
1. University of New South Wales Kensington, Australia (2013) classification and 3d building reconstruction. ISPRS Annals of
20. Ioannidou, A., Chatzilari, E., Nikolopoulos, S., Kompatsiaris, I.: the Photogrammetry, Remote Sensing and Spatial Information
Deep learning advances in computer vision with 3D data: a survey. Sciences I-3 (2012), Nr. 1 1(1), 293–298 (2012)
ACM Comput. Surv. CSUR 50(2), 1–38 (2017) 40. Varney, N., Asari, V.K., Graehling, Q.: Dales: a large-scale aerial
21. Ahmed, E., Saint, A., Shabayek, A.E.R., Cherenkova, K., Das, lidar data set for semantic segmentation. In: Proceedings of the
R., Gusev, G., Aouada, D., Ottersten, B.: A survey on deep learn- IEEE/CVF Conference on Computer Vision and Pattern Recog-
ing advances on different 3d data representations. arXiv preprint nition Workshops, pp. 186–187 (2020)
arXiv:1808.01462 (2018) 41. Munoz, D., Bagnell, J.A., Vandapel, N., Hebert, M.: Contextual
22. Zhang, J., Zhao, X., Chen, Z., Lu, Z.: A review of deep learning- classification with functional max-margin Markov networks. In:
based semantic segmentation for point cloud. IEEE Access 7, 2009 IEEE Conference on Computer Vision and Pattern Recog-
179118–179133 (2019) nition, pp. 975–982. IEEE (2009)
23. Xie, Y., Tian, J., Zhu, X.X.: Linking points with labels in 3D: 42. Zolanvari, S., Ruano, S., Rana, A., Cummins, A., Silva, R.E.,
a review of point cloud semantic segmentation. IEEE Geosci. Rahbar, M., Smolic, A.: Dublincity: annotated lidar point cloud
Remote Sens. Mag. 8(4), 38–59 (2020) and its applications. arXiv preprint arXiv:1909.03613 (2019)
24. Rahman, M.M., Tan, Y., Xue, J., Lu, K.: Notice of violation of 43. Hurl, B., Czarnecki, K., Waslander, S.: Precise synthetic image
IEEE publication principles: recent advances in 3D object detec- and lidar (presil) dataset for autonomous vehicle perception. In:
123
2019 IEEE Intelligent Vehicles Symposium (IV), pp. 2522–2529. ceedings of the IEEE International Conference on Computer
IEEE (2019) Vision, pp. 945–953 (2015)
44. Hu, Q., Yang, B., Khalid, S., Xiao, W., Trigoni, N., Markham, A.: 61. Feng, Y., Zhang, Z., Zhao, X., Ji, R., Gao, Y.: GVCNN: group-
Towards semantic segmentation of urban-scale 3d point clouds: view convolutional neural networks for 3d shape recognition. In:
A dataset, benchmarks and challenges. In: Proceedings of the Proceedings of the IEEE Conference on Computer Vision and
IEEE/CVF Conference on Computer Vision and Pattern Recog- Pattern Recognition, pp. 264–272 (2018)
nition (2021) 62. Yu, T., Meng, J., Yuan, J.: Multi-view harmonized bilinear net-
45. Can, G., Mantegazza, D., Abbate, G., Chappuis, S., Giusti, A.: work for 3d object recognition. In: Proceedings of the IEEE
Semantic segmentation on swiss3dcities: a benchmark study on Conference on Computer Vision and Pattern Recognition, pp.
aerial photogrammetric 3D pointcloud dataset. Pattern Recognit. 186–194 (2018)
Lett. 150, 108–114 (2021) 63. Yang, Z., Wang, L.: Learning relationships for multi-view 3d
46. Ye, Z., Xu, Y., Huang, R., Tong, X., Li, X., Liu, X., Luan, K., object recognition. In: Proceedings of the IEEE/CVF International
Hoegner, L., Stilla, U.: LASDU: a large-scale aerial lidar dataset Conference on Computer Vision, pp. 7505–7514 (2019)
for semantic labeling in dense urban areas. ISPRS Int. J. Geo Inf. 64. Wei, X., Yu, R., Sun, J.: View-gcn: View-based graph convo-
9(7), 450 (2020) lutional network for 3d shape analysis. In: Proceedings of the
47. Li, X., Li, C., Tong, Z., Lim, A., Yuan, J., Wu, Y., Tang, J., Huang, IEEE/CVF Conference on Computer Vision and Pattern Recog-
R.: Campus3d: a photogrammetry point cloud benchmark for hier- nition, pp. 1850–1859 (2020)
archical understanding of outdoor scene. In: Proceedings of the 65. Wang, C., Pelillo, M., Siddiqi, K.: Dominant set clustering and
28th ACM International Conference on Multimedia, pp. 238–246 pooling for multi-view 3d object recognition. arXiv preprint
(2020) arXiv:1906.01592 (2019)
48. Tan, W., Qin, N., Ma, L., Li, Y., Du, J., Cai, G., Yang, K., Li, J.: 66. Ma, C., Guo, Y., Yang, J., An, W.: Learning multi-view represen-
Toronto-3d: a large-scale mobile lidar dataset for semantic seg- tation with LSTM for 3-D shape recognition and retrieval. IEEE
mentation of urban roadways. In: Proceedings of the IEEE/CVF Trans. Multimedia 21(5), 1169–1182 (2018)
Conference on Computer Vision and Pattern Recognition Work- 67. Hamdi, A., Giancola, S., Ghanem, B.: Mvtn: multi-view trans-
shops, pp. 202–203 (2020) formation network for 3d shape recognition. In: Proceedings of
49. Jiang, P., Osteen, P., Wigness, M., Saripalli, S.: RELLIS-3D the IEEE/CVF International Conference on Computer Vision, pp.
Dataset: Data, Benchmarks and Analysis (2020) 1–11 (2021)
50. Bos, J.P., Chopp, D., Kurup, A., Spike, N.: Autonomy at the end 68. Wang, W., Wang, T., Cai, Y.: Multi-view attention-convolution
of the Earth: an inclement weather autonomous driving data set. pooling network for 3d point cloud classification. Appl. Intell.
In: Autonomous Systems: Sensors, Processing, and Security for 1–12 (2021)
Vehicles and Infrastructure 2020, vol. 11415, pp. 36–48. SPIE 69. Gao, S.-H., Cheng, M.-M., Zhao, K., Zhang, X.-Y., Yang, M.-H.,
(2020). International Society for Optics and Photonics Torr, P.: Res2net: a new multi-scale backbone architecture. IEEE
51. Kölle, M., Laupheimer, D., Schmohl, S., Haala, N., Rottensteiner, Trans. Pattern Anal. Mach. Intell. 43(2), 652–662 (2019)
F., Wegner, J.D., Ledoux, H.: The hessigheim 3d (h3d) bench- 70. Turk, G.: The Stanford bunny (2000). Accessed 14 May 2007
mark on semantic segmentation of high-resolution 3d point clouds 71. Ghadai, S., Yeow Lee, X., Balu, A., Sarkar, S., Krishnamurthy, A.:
and textured meshes from uav lidar and multi-view-stereo. ISPRS Multi-level 3d CNN for learning multi-scale spatial features. In:
Open J. Photogramm. Remote Sens. 1, 100001 (2021) Proceedings of the IEEE/CVF Conference on Computer Vision
52. Xiao, A., Huang, J., Guan, D., Zhan, F., Lu, S.: Transfer learning and Pattern Recognition Workshops (2019)
from synthetic to real lidar point cloud for semantic segmentation. 72. Cheng, R., Razani, R., Taghavi, E., Li, E., Liu, B.: 2-s3net:
In: Proceedings of the AAAI Conference on Artificial Intelli- Attentive feature fusion with adaptive feature selection for sparse
gence, vol. 36, pp. 2795–2803 (2022) semantic segmentation network. In: Proceedings of the IEEE/CVF
53. Chen, M., Hu, Q., Hugues, T., Feng, A., Hou, Y., McCul- Conference on Computer Vision and Pattern Recognition, pp.
lough, K., Soibelman, L.: Stpls3d: a large-scale synthetic and 12547–12556 (2021)
real aerial photogrammetry 3d point cloud dataset. arXiv preprint 73. Brock, A., Lim, T., Ritchie, J.M., Weston, N.: Generative and dis-
arXiv:2203.09065 (2022) criminative voxel modeling with convolutional neural networks.
54. Feng, Y., Feng, Y., You, H., Zhao, X., Gao, Y.: Meshnet: Mesh arXiv preprint arXiv:1608.04236 (2016)
neural network for 3d shape representation. In: Proceedings of the 74. Le, T., Duan, Y.: Pointgrid: a deep network for 3d shape under-
AAAI Conference on Artificial Intelligence, vol. 33, pp. 8279– standing. In: Proceedings of the IEEE Conference on Computer
8286 (2019) Vision and Pattern Recognition, pp. 9204–9214 (2018)
55. Lahav, A., Tal, A.: Meshwalker: deep mesh understanding by ran- 75. Maturana, D., Scherer, S.: Voxnet: a 3d convolutional neural
dom walks. ACM Trans. Graph. TOG 39(6), 1–13 (2020) network for real-time object recognition. In: 2015 IEEE/RSJ Inter-
56. Yavartanoo, M., Hung, S.-H., Neshatavar, R., Zhang, Y., Lee, national Conference on Intelligent Robots and Systems (IROS),
K.M.: Polynet: polynomial neural network for 3d shape recog- pp. 922–928. IEEE (2015)
nition with polyshape representation. In: 2021 International 76. Ben-Shabat, Y., Lindenbaum, M., Fischer, A.: 3DMFV: three-
Conference on 3D Vision (3DV), pp. 1014–1023. IEEE (2021) dimensional point cloud classification in real-time using con-
57. Muzahid, A., Wan, W., Sohel, F., Wu, L., Hou, L.: Curvenet: volutional neural networks. IEEE Robot. Autom. Lett. 3(4),
curvature-based multitask learning deep networks for 3d object 3145–3152 (2018)
recognition. IEEE/CAA J. Autom. Sin. 8(6), 1177–1187 (2020) 77. You, H., Feng, Y., Ji, R., Gao, Y.: Pvnet: a joint convolutional
58. Ran, H., Liu, J., Wang, C.: Surface representation for point clouds. network of point cloud and multi-view for 3d shape recognition.
In: Proceedings of the IEEE/CVF Conference on Computer Vision In: Proceedings of the 26th ACM International Conference on
and Pattern Recognition (CVPR), pp. 18942–18952 (2022) Multimedia, pp. 1310–1318 (2018)
59. Foorginejad, A., Khalili, K.: Umbrella curvature: a new curvature 78. You, H., Feng, Y., Zhao, X., Zou, C., Ji, R., Gao, Y.: Pvrnet:
estimation method for point clouds. Procedia Technol. 12, 347– point-view relation neural network for 3d shape recognition. In:
352 (2014) Proceedings of the AAAI Conference on Artificial Intelligence,
60. Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.: Multi-view vol. 33, pp. 9119–9126 (2019)
convolutional neural networks for 3d shape recognition. In: Pro-
123
79. Zhang, R., Zeng, Z., Guo, Z., Gao, X., Fu, K., Shi, J.: Dspoint: ference on Computer Vision and Pattern Recognition, pp. 949–958
dual-scale point cloud recognition with high-frequency fusion. (2019)
arXiv preprint arXiv:2111.10332 (2021) 98. Yu, J., Zhang, C., Wang, H., Zhang, D., Song, Y., Xiang, T., Liu,
80. Mohammadi, S.S., Wang, Y., Del Bue, A.: Pointview-gcn: 3d D., Cai, W.: 3d medical point transformer: Introducing convolu-
shape classification with multi-view point clouds. In: 2021 IEEE tion to attention networks for medical point cloud analysis. arXiv
International Conference on Image Processing (ICIP), pp. 3103– preprint arXiv:2112.04863 (2021)
3107. IEEE (2021) 99. Aoki, Y., Goforth, H., Srivatsan, R.A., Lucey, S.: Pointnetlk:
81. Zhang, C., Wan, H., Shen, X., Wu, Z.: Pvt: point-voxel transformer robust & efficient point cloud registration using pointnet. In: Pro-
for point cloud learning. arXiv preprint arXiv:2108.06076 (2021) ceedings of the IEEE/CVF Conference on Computer Vision and
82. Yan, X., Zhan, H., Zheng, C., Gao, J., Zhang, R., Cui, S., Li, Z.: Pattern Recognition, pp. 7163–7172 (2019)
Let images give you more: point cloud cross-modal training for 100. Joseph-Rivlin, M., Zvirin, A., Kimmel, R.: Momen (e) t: flavor
shape analysis. arXiv preprint arXiv:2210.04208 (2022) the moments in learning to classify shapes. In: Proceedings of the
83. Yang, Z., Jiang, L., Sun, Y., Schiele, B., Jia, J.: A unified query- IEEE/CVF International Conference on Computer Vision Work-
based paradigm for point cloud understanding. In: Proceedings shops (2019)
of the IEEE/CVF Conference on Computer Vision and Pattern 101. Sun, X., Lian, Z., Xiao, J.: Srinet: learning strictly rotation-
Recognition, pp. 8541–8551 (2022) invariant representations for point cloud classification and seg-
84. Sinha, A., Bai, J., Ramani, K.: Deep learning 3d shape surfaces mentation. In: Proceedings of the 27th ACM International Con-
using geometry images. In: European Conference on Computer ference on Multimedia, pp. 980–988 (2019)
Vision, pp. 223–240. Springer (2016) 102. Lin, H., Xiao, Z., Tan, Y., Chao, H., Ding, S.: Justlookup: one
85. Li, S., Luo, Z., Zhen, M., Yao, Y., Shen, T., Fang, T., Quan, L.: millisecond deep feature extraction for point clouds by lookup
Cross-atlas convolution for parameterization invariant learning on tables. In: 2019 IEEE International Conference on Multimedia
textured mesh surface. In: Proceedings of the IEEE/CVF Confer- and Expo (ICME), pp. 326–331. IEEE (2019)
ence on Computer Vision and Pattern Recognition, pp. 6143–6152 103. Ran, H., Zhuo, W., Liu, J., Lu, L.: Learning inner-group relations
(2019) on point clouds. In: Proceedings of the IEEE/CVF International
86. Haim, N., Segol, N., Ben-Hamu, H., Maron, H., Lipman, Y.: Conference on Computer Vision, pp. 15477–15487 (2021)
Surface networks via general covers. In: Proceedings of the 104. Ma, X., Qin, C., You, H., Ran, H., Fu, Y.: Rethinking network
IEEE/CVF International Conference on Computer Vision, pp. design and local geometry in point cloud: a simple residual mlp
632–641 (2019) framework. arXiv preprint arXiv:2202.07123 (2022)
87. Goyal, A., Law, H., Liu, B., Newell, A., Deng, J.: Revisiting point 105. Qian, G., Li, Y., Peng, H., Mai, J., Hammoud, H., Elhoseiny,
cloud shape classification with a simple and effective baseline. In: M., Ghanem, B.: Pointnext: revisiting pointnet++ with improved
International Conference on Machine Learning, pp. 3809–3820 training and scaling strategies. arXiv:2206.04670 (2022)
(2021). PMLR 106. Wijaya, K.T., Paek, D.-H., Kong, S.-H.: Advanced feature learn-
88. Li, Y., Pirk, S., Su, H., Qi, C.R., Guibas, L.J.: Fpnn: field probing ing on point clouds using multi-resolution features and learnable
neural networks for 3d data. Adv. Neural Inf. Process. Syst. 29 pooling. arXiv preprint arXiv:2205.09962 (2022)
(2016) 107. Song, X., Wang, P., Zhou, D., Zhu, R., Guan, C., Dai, Y., Su, H.,
89. Ma, C., An, W., Lei, Y., Guo, Y.: Bv-cnns: binary volumetric Li, H., Yang, R.: Apollocar3d: a large 3d car instance understand-
convolutional networks for 3d object recognition. In: BMVC, vol. ing benchmark for autonomous driving. In: Proceedings of the
1, p. 4 (2017) IEEE/CVF Conference on Computer Vision and Pattern Recog-
90. Zhi, S., Liu, Y., Li, X., Guo, Y.: Lightnet: a lightweight 3d con- nition, pp. 5452–5462 (2019)
volutional neural network for real-time 3d object recognition. In: 108. Hua, B.-S., Tran, M.-K., Yeung, S.-K.: Pointwise convolutional
3DOR@ Eurographics (2017) neural networks. In: Proceedings of the IEEE Conference on Com-
91. Kumawat, S., Raman, S.: Lp-3dcnn: unveiling local phase in 3d puter Vision and Pattern Recognition, pp. 984–993 (2018)
convolutional neural networks. In: Proceedings of the IEEE/CVF 109. Mao, J., Wang, X., Li, H.: Interpolated convolutional networks for
Conference on Computer Vision and Pattern Recognition, pp. 3d point cloud understanding. In: Proceedings of the IEEE/CVF
4903–4912 (2019) International Conference on Computer Vision (ICCV) (2019)
92. Muzahid, A., Wan, W., Hou, L.: A new volumetric cnn for 3d 110. Zhang, Z., Hua, B.-S., Rosen, D.W., Yeung, S.-K.: Rotation
object classification based on joint multiscale feature and subvol- invariant convolutions for 3d point clouds deep learning. In:
ume supervised learning approaches. Comput. Intell. Neurosci. 2019 International Conference on 3d Vision (3DV), pp. 204–213
2020 (2020) (2019). IEEE
93. Hegde, V., Zadeh, R.: Fusionnet: 3d object classification using 111. Zhang, Z., Hua, B.-S., Yeung, S.-K.: Shellnet: efficient point cloud
multiple data representations. arXiv preprint arXiv:1607.05695 convolutional neural networks using concentric shells statistics.
(2016) In: Proceedings of the IEEE/CVF International Conference on
94. Hoang, L., Lee, S.-H., Lee, E.-J., Kwon, K.-R.: GSV-NET: a Computer Vision (ICCV) (2019)
multi-modal deep learning network for 3D point cloud classifi- 112. Peyghambarzadeh, S.M.M., Azizmalayeri, F., Khotanlou, H.,
cation. Appl. Sci. 12(1), 483 (2022) Salarpour, A.: Point-PlaneNet: plane kernel based convolutional
95. Qi, C.R., Su, H., Nießner, M., Dai, A., Yan, M., Guibas, L.J.: neural network for point clouds analysis. Digital Signal Process.
Volumetric and multi-view CNNs for object classification on 3d 98, 102633 (2020)
data. In: Proceedings of the IEEE Conference on Computer Vision 113. Wiersma, R., Nasikun, A., Eisemann, E., Hildebrandt, K.: Delta-
and Pattern Recognition, pp. 5648–5656 (2016) conv: anisotropic point cloud learning with exterior calculus.
96. Ben-Shabat, Y., Lindenbaum, M., Fischer, A.: 3d point cloud arXiv preprint arXiv:2111.08799 (2021)
classification and segmentation using 3d modified fisher vector 114. Camuffo, E., Mari, D., Milani, S.: Recent advancements in learn-
representation for convolutional neural networks. arXiv preprint ing algorithms for point clouds: an updated overview. Sensors
arXiv:1711.08241 (2017) 22(4), 1357 (2022)
97. Duan, Y., Zheng, Y., Lu, J., Zhou, J., Tian, Q.: Structural relational 115. Liu, Y., Fan, B., Xiang, S., Pan, C.: Relation-shape convolutional
reasoning of point clouds. In: Proceedings of the IEEE/CVF Con- neural network for point cloud analysis. In: Proceedings of the
123
IEEE/CVF Conference on Computer Vision and Pattern Recog- 134. Liu, J., Ni, B., Li, C., Yang, J., Tian, Q.: Dynamic points agglom-
nition, pp. 8895–8904 (2019) eration for hierarchical point sets learning. In: Proceedings of
116. Thomas, H., Qi, C.R., Deschaud, J.-E., Marcotegui, B., Goulette, the IEEE/CVF International Conference on Computer Vision, pp.
F., Guibas, L.J.: Kpconv: flexible and deformable convolution for 7546–7555 (2019)
point clouds. In: Proceedings of the IEEE/CVF International Con- 135. Shen, Y., Feng, C., Yang, Y., Tian, D.: Mining point cloud local
ference on Computer Vision, pp. 6411–6420 (2019) structures by kernel correlation and graph pooling. In: Proceed-
117. Liu, Y., Fan, B., Meng, G., Lu, J., Xiang, S., Pan, C.: Dense- ings of the IEEE Conference on Computer Vision and Pattern
point: learning densely contextual representation for efficient Recognition, pp. 4548–4557 (2018)
point cloud processing. In: Proceedings of the IEEE/CVF Inter- 136. Te, G., Hu, W., Zheng, A., Guo, Z.: Rgcnn: Regularized graph cnn
national Conference on Computer Vision (ICCV) (2019) for point cloud segmentation. In: Proceedings of the 26th ACM
118. Boulch, A.: ConvPoint: continuous convolutions for point cloud International Conference on Multimedia, pp. 746–754 (2018)
processing. Comput. Graph. 88, 24–34 (2020) 137. Zhang, Y., Rabbat, M.: A graph-cnn for 3d point cloud classi-
119. Wu, W., Qi, Z., Fuxin, L.: Pointconv: deep convolutional networks fication. In: 2018 IEEE International Conference on Acoustics,
on 3d point clouds. In: Proceedings of the IEEE/CVF Confer- Speech and Signal Processing (ICASSP), pp. 6279–6283 (2018).
ence on Computer Vision and Pattern Recognition, pp. 9621–9630 IEEE
(2019) 138. Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdi-
120. Xu, Y., Fan, T., Xu, M., Zeng, L., Qiao, Y.: Spidercnn: deep nov, R.R., Smola, A.J.: Deep sets. Adv. Neural Inf. Process. Syst.
learning on point sets with parameterized convolutional filters. 30 (2017)
In: Proceedings of the European Conference on Computer Vision 139. Dang, J., Yang, J.: Hpgcnn: hierarchical parallel group convolu-
(ECCV), pp. 87–102 (2018) tional neural networks for point clouds processing. In: Proceed-
121. Atzmon, M., Maron, H., Lipman, Y.: Point convolutional ings of the Asian Conference on Computer Vision (ACCV) (2020)
neural networks by extension operators. arXiv preprint 140. Qian, G., Hammoud, H., Li, G., Thabet, A., Ghanem, B.:
arXiv:1803.10091 (2018) ASSANet: an anisotropic separable set abstraction for efficient
122. Poulenard, A., Rakotosaona, M.-J., Ponty, Y., Ovsjanikov, M.: point cloud representation learning. Adv. Neural Inf. Process.
Effective rotation-invariant point cnn with spherical harmonics Syst. 34, 28119–28130 (2021)
kernels. In: 2019 International Conference on 3D Vision (3DV), 141. Montanaro, A., Valsesia, D., Magli, E.: Rethinking the composi-
pp. 47–56 (2019). IEEE tionality of point clouds through regularization in the hyperbolic
123. Lei, H., Akhtar, N., Mian, A.: Octree guided cnn with spherical space. arXiv preprint arXiv:2209.10318 (2022)
kernels for 3d point clouds. In: Proceedings of the IEEE/CVF 142. Xie, S., Liu, S., Chen, Z., Tu, Z.: Attentional shapecontextnet
Conference on Computer Vision and Pattern Recognition, pp. for point cloud recognition. In: Proceedings of the IEEE Confer-
9631–9640 (2019) ence on Computer Vision and Pattern Recognition, pp. 4606–4615
124. Riegler, G., Osman Ulusoy, A., Geiger, A.: Octnet: learning deep (2018)
3d representations at high resolutions. In: Proceedings of the IEEE 143. Esteves, C., Allen-Blanchette, C., Makadia, A., Daniilidis, K.:
Conference on Computer Vision and Pattern Recognition, pp. Learning so (3) equivariant representations with spherical cnns.
3577–3586 (2017) In: Proceedings of the European Conference on Computer Vision
125. Klokov, R., Lempitsky, V.: Escape from cells: deep kd-networks (ECCV), pp. 52–68 (2018)
for the recognition of 3d point cloud models. In: Proceedings of 144. Hermosilla, P., Ritschel, T., Vázquez, P.-P., Vinacua, À., Ropinski,
the IEEE International Conference on Computer Vision, pp. 863– T.: Monte Carlo convolution for learning on non-uniformly sam-
872 (2017) pled point clouds. ACM Trans. Graph. TOG 37(6), 1–12 (2018)
126. Zeng, W., Gevers, T.: 3dcontextnet: Kd tree guided hierarchical 145. Li, Y., Bu, R., Sun, M., Wu, W., Di, X., Chen, B.: Pointcnn: con-
learning of point clouds using local and global contextual cues. volution on x-transformed points. Adv. Neural Inf. Process. Syst.
In: Proceedings of the European Conference on Computer Vision 31 (2018)
(ECCV) Workshops, pp. 0–0 (2018) 146. Groh, F., Wieschollek, P., Lensch, H.: Flex-convolution (million-
127. Li, J., Chen, B.M., Lee, G.H.: So-net: Self-organizing network for scale point-cloud learning beyond grid-worlds). arXiv preprint
point cloud analysis. In: Proceedings of the IEEE Conference on arXiv:1803.07289 (2018)
Computer Vision and Pattern Recognition, pp. 9397–9406 (2018) 147. Lan, S., Yu, R., Yu, G., Davis, L.S.: Modeling local geometric
128. Qiu, S., Anwar, S., Barnes, N.: Dense-resolution network for structure of 3d point clouds using geo-cnn. In: Proceedings of the
point cloud classification and segmentation. In: Proceedings of IEEE/cvf Conference on Computer Vision and Pattern Recogni-
the IEEE/CVF Winter Conference on Applications of Computer tion, pp. 998–1008 (2019)
Vision, pp. 3813–3822 (2021) 148. Komarichev, A., Zhong, Z., Hua, J.: A-cnn: annularly convolu-
129. Bruna, J., Zaremba, W., Szlam, A., LeCun, Y.: Spectral net- tional neural networks on point clouds. In: Proceedings of the
works and locally connected networks on graphs. arXiv preprint IEEE/CVF Conference on Computer Vision and Pattern Recog-
arXiv:1312.6203 (2013) nition, pp. 7421–7430 (2019)
130. Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional 149. Rao, Y., Lu, J., Zhou, J.: Spherical fractal convolutional neu-
neural networks on graphs with fast localized spectral filtering. ral networks for point cloud recognition. In: Proceedings of the
Adv. Neural Inf. Process. Syst. 29 (2016) IEEE/CVF Conference on Computer Vision and Pattern Recog-
131. Zhao, H., Jiang, L., Fu, C.-W., Jia, J.: Pointweb: Enhancing local nition (CVPR) (2019)
neighborhood features for point cloud processing. In: Proceedings 150. Simonovsky, M., Komodakis, N.: Dynamic edge-conditioned fil-
of the IEEE/CVF Conference on Computer Vision and Pattern ters in convolutional neural networks on graphs. In: Proceedings
Recognition, pp. 5565–5573 (2019) of the IEEE Conference on Computer Vision and Pattern Recog-
132. Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., nition, pp. 3693–3702 (2017)
Solomon, J.M.: Dynamic graph CNN for learning on point clouds. 151. Wang, C., Samari, B., Siddiqi, K.: Local spectral graph convolu-
ACM Trans. Graph. TOG 38(5), 1–12 (2019) tion for point set feature learning. In: Proceedings of the European
133. Zhang, K., Hao, M., Wang, J., Silva, C.W., Fu, C.: Linked dynamic Conference on Computer Vision (ECCV), pp. 52–66 (2018)
graph cnn: Learning on point cloud via linking hierarchical fea-
tures. arXiv preprint arXiv:1904.10014 (2019)
123
152. Pan, G., Wang, J., Ying, R., Liu, P.: 3dti-net: learn inner transform ence on Computer Vision and Pattern Recognition, pp. 5333–5343
invariant 3d geometry features using dynamic gcn. arXiv preprint (2023)
arXiv:1812.06254 (2018) 171. Chen, W., Han, X., Li, G., Chen, C., Xing, J., Zhao, Y., Li, H.: Deep
153. Yang, D., Gao, W.: Pointmanifold: using manifold learning rbfnet: point cloud feature learning using radial basis functions.
for point cloud classification. arXiv preprint arXiv:2010.07215 arXiv preprint arXiv:1812.04302 (2018)
(2020) 172. Zhang, M., You, H., Kadam, P., Liu, S., Kuo, C.-C.J.: Pointhop:
154. Lin, Z.-H., Huang, S.-Y., Wang, Y.-C.F.: Convolution in the cloud: an explainable machine learning method for point cloud classifi-
Learning deformable kernels in 3d graph convolution networks for cation. IEEE Trans. Multimed. 22(7), 1744–1755 (2020)
point cloud analysis. In: Proceedings of the IEEE/CVF Confer- 173. Liu, K., Gao, Z., Lin, F., Chen, B.M.: Fg-net: fast large-scale
ence on Computer Vision and Pattern Recognition, pp. 1800–1809 lidar point clouds understanding network leveraging correlated
(2020) feature mining and geometric-aware modelling. arXiv preprint
155. Xu, M., Ding, R., Zhao, H., Qi, X.: Paconv: Position adaptive arXiv:2012.09439 (2020)
convolution with dynamic kernel assembling on point clouds. In: 174. Zhang, M., Wang, Y., Kadam, P., Liu, S., Kuo, C.-C.J.:
Proceedings of the IEEE/CVF Conference on Computer Vision Pointhop++: a lightweight learning model on point sets for 3d
and Pattern Recognition, pp. 3173–3182 (2021) classification. In: 2020 IEEE International Conference on Image
156. Xiang, T., Zhang, C., Song, Y., Yu, J., Cai, W.: Walk in the cloud: Processing (ICIP), pp. 3319–3323. IEEE (2020)
Learning curves for point clouds shape analysis. In: Proceedings 175. Cheng, S., Chen, X., He, X., Liu, Z., Bai, X.: Pra-net: point
of the IEEE/CVF International Conference on Computer Vision relation-aware network for 3d point cloud analysis. IEEE Trans.
(ICCV), pp. 915–924 (2021) Image Process. 30, 4436–4448 (2021)
157. Wu, P., Chen, C., Yi, J., Metaxas, D.: Point cloud processing via 176. Xu, M., Zhang, J., Zhou, Z., Xu, M., Qi, X., Qiao, Y.: Learning
recurrent set encoding. In: Proceedings of the AAAI Conference geometry-disentangled representation for complementary under-
on Artificial Intelligence, vol. 33, pp. 5441–5449 (2019) standing of 3d object point cloud. In: Proceedings of the AAAI
158. Liu, X., Han, Z., Liu, Y.-S., Zwicker, M.: Point2sequence: Conference on Artificial Intelligence, vol. 35, pp. 3056–3064
Learning the shape representation of 3d point clouds with an (2021)
attention-based sequence to sequence network. In: Proceedings 177. Chen, X., Wu, Y., Xu, W., Li, J., Dong, H., Chen, Y.: Pointscnet:
of the AAAI Conference on Artificial Intelligence, vol. 33, pp. point cloud structure and correlation learning based on space-
8778–8785 (2019) filling curve-guided sampling. Symmetry 14(1), 8 (2021)
159. Yang, J., Zhang, Q., Ni, B., Li, L., Liu, J., Zhou, M., Tian, Q.: 178. Lu, T., Liu, C., Chen, Y., Wu, G., Wang, L.: App-net: auxiliary-
Modeling point clouds with self-attention and gumbel subset sam- point-based push and pull operations for efficient point cloud
pling. In: Proceedings of the IEEE/CVF Conference on Computer classification. arXiv preprint arXiv:2205.00847 (2022)
Vision and Pattern Recognition, pp. 3323–3332 (2019) 179. Lin, H., Zheng, X., Li, L., Chao, F., Wang, S., Wang, Y., Tian, Y.,
160. Zhao, H., Jiang, L., Jia, J., Torr, P.H., Koltun, V.: Point transformer. Ji, R.: Meta architecture for point cloud analysis. In: Proceedings
In: Proceedings of the IEEE/CVF International Conference on of the IEEE/CVF Conference on Computer Vision and Pattern
Computer Vision, pp. 16259–16268 (2021) Recognition, pp. 17682–17691 (2023)
161. Guo, M.-H., Cai, J.-X., Liu, Z.-N., Mu, T.-J., Martin, R.R., Hu, 180. Yang, Y., Feng, C., Shen, Y., Tian, D.: Foldingnet: point cloud
S.-M.: PCT: point cloud transformer. Comput. Vis. Media 7(2), auto-encoder via deep grid deformation. In: Proceedings of the
187–199 (2021) IEEE Conference on Computer Vision and Pattern Recognition,
162. Engel, N., Belagiannis, V., Dietmayer, K.: Point transformer. IEEE pp. 206–215 (2018)
Access 9, 134826–134840 (2021) 181. Deng, H., Birdal, T., Ilic, S.: Ppf-foldnet: unsupervised learning
163. Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A., Car- of rotation invariant 3d local descriptors. In: Proceedings of the
reira, J.: Perceiver: General perception with iterative attention. In: European Conference on Computer Vision (ECCV), pp. 602–618
International Conference on Machine Learning, pp. 4651–4664. (2018)
PMLR (2021) 182. Achlioptas, P., Diamanti, O., Mitliagkas, I., Guibas, L.: Learning
164. Berg, A., Oskarsson, M., O’Connor, M.: Points to patches: representations and generative models for 3d point clouds. In:
Enabling the use of self-attention for 3d shape recognition. arXiv International Conference on Machine Learning, pp. 40–49. PMLR
preprint arXiv:2204.03957 (2022) (2018)
165. Zhang, C., Wan, H., Shen, X., Wu, Z.: Patchformer: an efficient 183. Gadelha, M., Wang, R., Maji, S.: Multiresolution tree networks
point transformer with patch attention. In: Proceedings of the for 3d point cloud processing. In: Proceedings of the European
IEEE/CVF Conference on Computer Vision and Pattern Recog- Conference on Computer Vision (ECCV), pp. 103–118 (2018)
nition, pp. 11799–11808 (2022) 184. Hassani, K., Haley, M.: Unsupervised multi-task feature learning
166. Wu, X., Lao, Y., Jiang, L., Liu, X., Zhao, H.: Point transformer on point clouds. In: Proceedings of the IEEE/CVF International
v2: grouped vector attention and partition-based pooling. arXiv Conference on Computer Vision, pp. 8160–8171 (2019)
preprint arXiv:2210.05666 (2022) 185. Zhao, Y., Birdal, T., Deng, H., Tombari, F.: 3d point capsule
167. Huang, Z., Zhao, Z., Li, B., Han, J.: Lcpformer: towards effective networks. In: Proceedings of the IEEE/CVF Conference on Com-
3D point cloud analysis via local context propagation in trans- puter Vision and Pattern Recognition, pp. 1009–1018 (2019)
formers. IEEE Trans. Circuits Syst. Video Technol. (2023) 186. Chen, C., Li, G., Xu, R., Chen, T., Wang, M., Lin, L.: Cluster-
168. Park, J., Lee, S., Kim, S., Xiong, Y., Kim, H.J.: Self-positioning net: deep hierarchical cluster network with rigorously rotation-
point-based transformer for point cloud understanding. In: Pro- invariant representation for point cloud analysis. In: Proceedings
ceedings of the IEEE/CVF Conference on Computer Vision and of the IEEE/CVF Conference on Computer Vision and Pattern
Pattern Recognition, pp. 21814–21823 (2023) Recognition, pp. 4994–5002 (2019)
169. Li, Z., Gao, P., Yuan, H., Wei, R., Paul, M.: Exploiting inductive 187. Sun, H., Li, S., Zheng, X., Lu, X.: Remote sensing scene clas-
bias in transformer for point cloud classification and segmenta- sification by gated bidirectional network. IEEE Trans. Geosci.
tion. arXiv preprint arXiv:2304.14124 (2023) Remote Sens. 58(1), 82–96 (2019)
170. Wu, C., Zheng, J., Pfrommer, J., Beyerer, J.: Attention-based point 188. Sun, Y., Wang, Y., Liu, Z., Siegel, J., Sarma, S.: Pointgrow: autore-
cloud edge sampling. In: Proceedings of the IEEE/CVF Confer- gressively learned point cloud generation with self-attention. In:
123
Proceedings of the IEEE/CVF Winter Conference on Applications tion learning guided by generative pretraining. arXiv preprint
of Computer Vision, pp. 61–70 (2020) arXiv:2302.02318 (2023)
189. Eckart, B., Yuan, W., Liu, C., Kautz, J.: Self-supervised learning 208. Xue, L., Gao, M., Xing, C., Martín-Martín, R., Wu, J., Xiong, C.,
on 3d point clouds by learning discrete generative models. In: Xu, R., Niebles, J.C., Savarese, S.: Ulip: learning unified represen-
Proceedings of the IEEE/CVF Conference on Computer Vision tation of language, image and point cloud for 3d understanding.
and Pattern Recognition, pp. 8248–8257 (2021) arXiv preprint arXiv:2212.05171 (2022)
190. Wang, H., Liu, Q., Yue, X., Lasenby, J., Kusner, M.J.: Unsu- 209. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L.,
pervised point cloud pre-training via occlusion completion. In: Gomez, A.N., Kaiser, Ł, Polosukhin, I.: Attention is all you need.
Proceedings of the IEEE/CVF International Conference on Com- Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017)
puter Vision, pp. 9782–9792 (2021) 210. Lu, D., Xie, Q., Wei, M., Xu, L., Li, J.: Transformers in 3d point
191. Sun, C., Zheng, Z., Wang, X., Xu, M., Yang, Y.: Self-supervised clouds: a survey. arXiv preprint arXiv:2205.07417 (2022)
point cloud representation learning via separating mixed shapes. 211. Li, R., Li, X., Heng, P.-A., Fu, C.-W.: Pointaugment: an auto-
IEEE Trans. Multimed. (2022) augmentation framework for point cloud classification. In: Pro-
192. Huang, S., Xie, Y., Zhu, S.-C., Zhu, Y.: Spatio-temporal self- ceedings of the IEEE/CVF Conference on Computer Vision and
supervised representation learning for 3d point clouds. In: Pro- Pattern Recognition, pp. 6378–6387 (2020)
ceedings of the IEEE/CVF International Conference on Computer 212. Prokudin, S., Lassner, C., Romero, J.: Efficient learning on point
Vision, pp. 6535–6545 (2021) clouds with basis point sets. In: Proceedings of the IEEE/CVF
193. Yan, S., Yang, Z., Li, H., Guan, L., Kang, H., Hua, G., Huang, Q.: International Conference on Computer Vision, pp. 4332–4341
Implicit autoencoder for point cloud self-supervised representa- (2019)
tion learning. arXiv preprint arXiv:2201.00785 (2022) 213. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: pre-
194. Liu, Q., Zhao, J., Cheng, C., Sheng, B., Ma, L.: Pointalcr: adver- training of deep bidirectional transformers for language under-
sarial latent gan and contrastive regularization for point cloud standing. arXiv preprint arXiv:1810.04805 (2018)
completion. Vis. Comput. 38, 3341–3349 (2022) 214. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast
195. Pang, Y., Wang, W., Tay, F.E., Liu, W., Tian, Y., Yuan, L.: for unsupervised visual representation learning. In: Proceedings
Masked autoencoders for point cloud self-supervised learning. of the IEEE/CVF Conference on Computer Vision and Pattern
arXiv preprint arXiv:2203.06604 (2022) Recognition, pp. 9729–9738 (2020)
196. Yu, X., Tang, L., Rao, Y., Huang, T., Zhou, J., Lu, J.: Point-bert: 215. Hu, W., Liu, B., Gomes, J., Zitnik, M., Liang, P., Pande, V.,
pre-training 3d point cloud transformers with masked point mod- Leskovec, J.: Strategies for pre-training graph neural networks.
eling. In: Proceedings of the IEEE/CVF Conference on Computer arXiv preprint arXiv:1905.12265 (2019)
Vision and Pattern Recognition, pp. 19313–19322 (2022) 216. Schönberger, J.L., Pollefeys, M., Geiger, A., Sattler, T.: Semantic
197. Wang, Z., Yu, X., Rao, Y., Zhou, J., Lu, J.: P2p: tuning pre-trained visual localization. In: Proceedings of the IEEE Conference on
image models for point cloud analysis with point-to-pixel prompt- Computer Vision and Pattern Recognition, pp. 6896–6906 (2018)
ing. arXiv preprint arXiv:2208.02812 (2022) 217. Wu, J., Zhang, C., Xue, T., Freeman, B., Tenenbaum, J.: Learning
198. Denipitiyage, D., Jayasundara, V., Rodrigo, R., Edussooriya, a probabilistic latent space of object shapes via 3d generative-
C.U.: Pointcaps: raw point cloud processing using capsule net- adversarial modeling. Adv. Neural Inf. Process. Syst. 29, 82–90
works with Euclidean distance routing. J. Vis. Commun. Image (2016)
Represent. 88, 103612 (2022) 218. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov,
199. Jiang, J., Lu, X., Zhao, L., Dazeley, R., Wang, M.: Masked autoen- D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with
coders in 3d point cloud representation learning. arXiv preprint convolutions. In: Proceedings of the IEEE Conference on Com-
arXiv:2207.01545 (2022) puter Vision and Pattern Recognition, pp. 1–9 (2015)
200. Zhang, R., Guo, Z., Gao, P., Fang, R., Zhao, B., Wang, D., Qiao, Y., 219. Doersch, C.: Tutorial on variational autoencoders. arXiv preprint
Li, H.: Point-m2ae: multi-scale masked autoencoders for hierar- arXiv:1606.05908 (2016)
chical point cloud pre-training. arXiv preprint arXiv:2205.14401 220. Zamorski, M., Ziski, T.: Adversarial autoencoders for compact
(2022) representations of 3d point clouds. Comput. Vis. Image Underst.
201. Hao, F., Li, J., Song, R., Li, Y., Cao, K.: Mixed feature predic- 193, 102921 (2020)
tion on boundary learning for point cloud semantic segmentation. 221. Xiao, A., Huang, J., Guan, D., Lu, S.: Unsupervised repre-
Remote Sens. 14(19), 4757 (2022) sentation learning for point clouds: a survey. arXiv preprint
202. Liu, H., Cai, M., Lee, Y.J.: Masked discrimination for self- arXiv:2202.13589 (2022)
supervised learning on point clouds. In: Computer Vision–ECCV 222. Baevski, A., Hsu, W.-N., Xu, Q., Babu, A., Gu, J., Auli, M.:
2022: 17th European Conference, Tel Aviv, Israel, October 23–27, Data2vec: a general framework for self-supervised learning in
2022, Proceedings, Part II, pp. 657–675. Springer (2022) speech, vision and language. In: International Conference on
203. Dong, R., Qi, Z., Zhang, L., Zhang, J., Sun, J., Ge, Z., Yi, L., Machine Learning, pp. 1298–1312. PMLR (2022)
Ma, K.: Autoencoders as cross-modal teachers: can pretrained 223. Lawin, F.J., Danelljan, M., Tosteberg, P., Bhat, G., Khan, F.S.,
2d image transformers help 3d representation learning? arXiv Felsberg, M.: Deep projective 3d semantic segmentation. In:
preprint arXiv:2212.08320 (2022) International Conference on Computer Analysis of Images and
204. Zhang, R., Wang, L., Qiao, Y., Gao, P., Li, H.: Learning 3d repre- Patterns, pp. 95–107. Springer (2017)
sentations from 2d pre-trained models via image-to-point masked 224. Wu, B., Wan, A., Yue, X., Keutzer, K.: Squeezeseg: convolutional
autoencoders. arXiv preprint arXiv:2212.06785 (2022) neural nets with recurrent crf for real-time road-object segmen-
205. Chen, G., Wang, M., Yang, Y., Yu, K., Yuan, L., Yue, Y.: Pointgpt: tation from 3d lidar point cloud. In: 2018 IEEE International
auto-regressively generative pre-training from point clouds. arXiv Conference on Robotics and Automation (ICRA), pp. 1887–1893.
preprint arXiv:2305.11487 (2023) IEEE (2018)
206. Zeid, K.A., Schult, J., Hermans, A., Leibe, B.: Point2vec for self- 225. Graham, B., Engelcke, M., Van Der Maaten, L.: 3d semantic seg-
supervised representation learning on point clouds. arXiv preprint mentation with submanifold sparse convolutional networks. In:
arXiv:2303.16570 (2023) Proceedings of the IEEE Conference on Computer Vision and
207. Qi, Z., Dong, R., Fan, G., Ge, Z., Zhang, X., Ma, K., Pattern Recognition, pp. 9224–9232 (2018)
Yi, L.: Contrast with reconstruct: contrastive 3d representa-
123
226. Meng, H.-Y., Gao, L., Lai, Y.-K., Manocha, D.: Vv-net: voxel 244. Ding, B.: Lenet: lightweight and efficient lidar semantic seg-
vae net with group convolutions for point cloud segmentation. mentation using multi-scale convolution attention. arXiv preprint
In: Proceedings of the IEEE/CVF International Conference on arXiv:2301.04275 (2023)
Computer Vision, pp. 8500–8508 (2019) 245. Zhang, Y., Zhou, Z., David, P., Yue, X., Xi, Z., Gong, B., Foroosh,
227. Dai, A., Nießner, M.: 3dmv: joint 3d-multi-view prediction for H.: Polarnet: an improved grid representation for online lidar point
3d semantic scene segmentation. In: Proceedings of the European clouds semantic segmentation. In: Proceedings of the IEEE/CVF
Conference on Computer Vision (ECCV), pp. 452–468 (2018) Conference on Computer Vision and Pattern Recognition, pp.
228. Jaritz, M., Gu, J., Su, H.: Multi-view pointnet for 3d scene 9601–9610 (2020)
understanding. In: Proceedings of the IEEE/CVF International 246. Aksoy, E.E., Baci, S., Cavdar, S.: Salsanet: fast road and vehicle
Conference on Computer Vision Workshops, pp. 0–0 (2019) segmentation in lidar point clouds for autonomous driving. In:
229. Boulch, A., Le Saux, B., Audebert, N.: Unstructured point cloud 2020 IEEE Intelligent Vehicles Symposium (IV), pp. 926–932.
semantic labeling using deep segmentation networks. 3dor@ IEEE (2020)
eurographics 3, 1–8 (2017) 247. Song, W., Liu, Z., Guo, Y., Sun, S., Zu, G., Li, M.: Dgpolarnet:
230. Audebert, N., Saux, B.L., Lefèvre, S.: Semantic segmentation of dynamic graph convolution network for lidar point cloud semantic
earth observation data using multimodal and multi-scale deep net- segmentation on polar bev. Remote Sens. 14(15), 3825 (2022)
works. In: Asian Conference on Computer Vision, pp. 180–196. 248. Tchapmi, L., Choy, C., Armeni, I., Gwak, J., Savarese, S.:
Springer (2016) Segcloud: semantic segmentation of 3d point clouds. In: 2017
231. Boulch, A., Guerry, J., Le Saux, B., Audebert, N.: Snapnet: 3d International Conference on 3D Vision (3DV), pp. 537–547. IEEE
point cloud semantic labeling with 2d deep segmentation net- (2017)
works. Comput. Gr. 71, 189–198 (2018) 249. Rethage, D., Wald, J., Sturm, J., Navab, N., Tombari, F.: Fully-
232. Guerry, J., Boulch, A., Le Saux, B., Moras, J., Plyer, A., Fil- convolutional point networks for large-scale point clouds. In:
liat, D.: Snapnet-r: consistent 3d multi-view semantic labeling for Proceedings of the European Conference on Computer Vision
robotics. In: Proceedings of the IEEE International Conference (ECCV), pp. 596–611 (2018)
on Computer Vision Workshops, pp. 669–678 (2017) 250. Dai, A., Ritchie, D., Bokeloh, M., Reed, S., Sturm, J., Nießner, M.:
233. Li, S., Chen, X., Liu, Y., Dai, D., Stachniss, C., Gall, J.: Multi-scale Scancomplete: large-scale scene completion and semantic seg-
interaction for real-time lidar data segmentation on an embedded mentation for 3d scans. In: Proceedings of the IEEE Conference on
platform. IEEE Robot. Autom. Lett. 7(2), 738–745 (2021) Computer Vision and Pattern Recognition, pp. 4578–4587 (2018)
234. Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, 251. Zhou, H., Zhu, X., Song, X., Ma, Y., Wang, Z., Li, H., Lin, D.:
W.J., Keutzer, K.: Squeezenet: alexnet-level accuracy with 50x Cylinder 3d: an effective 3d framework for driving-scene lidar
fewer parameters and < 0.5 mb model size. arXiv preprint semantic segmentation. arXiv preprint arXiv:2008.01550 (2020)
arXiv:1602.07360 (2016) 252. Choy, C., Gwak, J., Savarese, S.: 4d spatio-temporal convnets:
235. Wu, B., Zhou, X., Zhao, S., Yue, X., Keutzer, K.: Squeezesegv2: Minkowski convolutional neural networks. In: Proceedings of the
improved model structure and unsupervised domain adaptation IEEE/CVF Conference on Computer Vision and Pattern Recog-
for road-object segmentation from a lidar point cloud. In: 2019 nition, pp. 3075–3084 (2019)
International Conference on Robotics and Automation (ICRA), 253. Rosu, R.A., Schütt, P., Quenzel, J., Behnke, S.: Latticenet: fast
pp. 4376–4382. IEEE (2019) point cloud segmentation using permutohedral lattices. arXiv
236. Xu, C., Wu, B., Wang, Z., Zhan, W., Vajda, P., Keutzer, K., preprint arXiv:1912.05905 (2019)
Tomizuka, M.: Squeezesegv3: spatially-adaptive convolution for 254. Tang, H., Liu, Z., Zhao, S., Lin, Y., Lin, J., Wang, H., Han,
efficient point-cloud segmentation. In: European Conference on S.: Searching efficient 3d architectures with sparse point-voxel
Computer Vision, pp. 1–19. Springer (2020) convolution. In: Computer Vision–ECCV 2020: 16th European
237. Milioto, A., Vizzo, I., Behley, J., Stachniss, C.: Rangenet++: fast Conference, Glasgow, UK, August 23–28, 2020, Proceedings,
and accurate lidar semantic segmentation. In: 2019 IEEE/RSJ Part XXVIII, pp. 685–702. Springer (2020)
International Conference on Intelligent Robots and Systems 255. Zhao, L., Xu, S., Liu, L., Ming, D., Tao, W.: Svaseg: sparse voxel-
(IROS), pp. 4213–4220. IEEE (2019) based attention for 3d lidar point cloud semantic segmentation.
238. Razani, R., Cheng, R., Taghavi, E., Bingbing, L.: Lite-hdseg: lidar Remote Sens. 14(18), 4471 (2022)
semantic segmentation using lite harmonic dense convolutions. In: 256. Yang, Y.-Q., Guo, Y.-X., Xiong, J.-Y., Liu, Y., Pan, H., Wang, P.-S.,
2021 IEEE International Conference on Robotics and Automation Tong, X., Guo, B.: Swin3d: a pretrained transformer backbone for
(ICRA), pp. 9550–9556. IEEE (2021) 3d indoor scene understanding. arXiv preprint arXiv:2304.06906
239. Zhao, Y., Bai, L., Huang, X.: Fidnet: lidar point cloud seman- (2023)
tic segmentation with fully interpolation decoding. In: 2021 257. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S.,
IEEE/RSJ International Conference on Intelligent Robots and Guo, B.: Swin transformer: hierarchical vision transformer using
Systems (IROS), pp. 4453–4458. IEEE (2021) shifted windows. In: Proceedings of the IEEE/CVF International
240. Wang, S., Zhu, J., Zhang, R.: Meta-rangeseg: lidar sequence Conference on Computer Vision, pp. 10012–10022 (2021)
semantic segmentation using multiple feature aggregation. arXiv 258. Camuffo, E., Michieli, U., Milani, S.: Learning from mistakes:
preprint arXiv:2202.13377 (2022) self-regularizing hierarchical semantic representations in point
241. Qiu, H., Yu, B., Tao, D.: Gfnet: geometric flow network cloud segmentation. arXiv preprint arXiv:2301.11145 (2023)
for 3d point cloud semantic segmentation. arXiv preprint 259. Roynard, X., Deschaud, J.-E., Goulette, F.: Classification de
arXiv:2207.02605 (2022) scènes de nuages de points 3d par réseau convolutionnel profond
242. Cheng, H.-X., Han, X.-F., Xiao, G.-Q.: Cenet: toward concise and voxelique multi-échelles. In: RFIAP et CFPT 2018 (2018)
efficient lidar semantic segmentation for autonomous driving. In: 260. Ye, M., Wan, R., Xu, S., Cao, T., Chen, Q.: Drinet++: effi-
2022 IEEE International Conference on Multimedia and Expo cient voxel-as-point point cloud segmentation. arXiv preprint
(ICME), pp. 01–06. IEEE (2022) arXiv:2111.08318 (2021)
243. Kong, L., Liu, Y., Chen, R., Ma, Y., Zhu, X., Li, Y., Hou, Y., 261. Hegde, S., Gangisetty, S.: Pig-net: inception based deep learning
Qiao, Y., Liu, Z.: Rethinking range view representation for lidar architecture for 3d point cloud segmentation. Comput. Gr. 95,
segmentation. arXiv preprint arXiv:2303.05367 (2023) 13–22 (2021)
123
262. Yan, X., Gao, J., Li, J., Zhang, R., Li, Z., Huang, R., Cui, S.: Proceedings of the IEEE/CVF Conference on Computer Vision
Sparse single sweep lidar point cloud segmentation via learning and Pattern Recognition, pp. 5575–5584 (2022)
contextual shape priors from scene completion. In: Proceedings 280. Ye, D., Zhou, Z., Chen, W., Xie, Y., Wang, Y., Wang, P., Foroosh,
of the AAAI Conference on Artificial Intelligence, vol. 35, pp. H.: Lidarmultinet: towards a unified multi-task network for lidar
3101–3109 (2021) perception. arXiv preprint arXiv:2209.09385 (2022)
263. Kochanov, D., Nejadasl, F.K., Booij, O.: Kprnet: improving 281. Zhou, J., Xiong, Y., Chiu, C., Liu, F., Gong, X.: Sat: size-aware
projection-based lidar semantic segmentation. arXiv preprint transformer for 3d point cloud semantic segmentation. arXiv
arXiv:2007.12668 (2020) preprint arXiv:2301.06869 (2023)
264. Alonso, I., Riazuelo, L., Montesano, L., Murillo, A.C.: 3d- 282. Chen, L.-Z., Li, X.-Y., Fan, D.-P., Wang, K., Lu, S.-P., Cheng, M.-
mininet: learning a 2d representation from point clouds for fast M.: Lsanet: feature learning on point sets by local spatial aware
and efficient 3d lidar semantic segmentation. IEEE Robot. Autom. layer. arXiv preprint arXiv:1905.05442 (2019)
Lett. 5(4), 5432–5439 (2020) 283. Wang, J., Li, X., Sullivan, A., Abbott, L., Chen, S.: Pointmotion-
265. Cortinhal, T., Tzelepis, G., Erdal Aksoy, E.: Salsanext: fast, net: point-wise motion learning for large-scale lidar point clouds
uncertainty-aware semantic segmentation of lidar point clouds. sequences. In: Proceedings of the IEEE/CVF Conference on Com-
In: International Symposium on Visual Computing, pp. 207–222. puter Vision and Pattern Recognition, pp. 4419–4428 (2022)
Springer (2020) 284. Zhao, N., Chua, T.-S., Lee, G.H.: Ps2-net: a locally and globally
266. Dewan, A., Burgard, W.: Deeptemporalseg: temporally consistent aware network for point-based semantic segmentation. In: 2020
semantic segmentation of 3d lidar scans. In: 2020 IEEE Inter- 25th International Conference on Pattern Recognition (ICPR), pp.
national Conference on Robotics and Automation (ICRA), pp. 723–730 (2021)
2624–2630. IEEE (2020) 285. Sahin, Y.H., Mertan, A., Unal, G.: Odfnet: using orientation dis-
267. Liong, V.E., Nguyen, T.N.T., Widjaja, S., Sharma, D., Chong, tribution functions to characterize 3d point clouds. Comput. Gr.
Z.J.: Amvnet: assertion-based multi-view fusion network for lidar 102, 610–618 (2022)
semantic segmentation. arXiv preprint arXiv:2012.04934 (2020) 286. Ran, H., Liu, J., Wang, C.: Surface representation for point clouds.
268. Alnaggar, Y.A., Afifi, M., Amer, K., ElHelw, M.: Multi projec- In: Proceedings of the IEEE/CVF Conference on Computer Vision
tion fusion for real-time semantic segmentation of 3d lidar point and Pattern Recognition, pp. 18942–18952 (2022)
clouds. In: Proceedings of the IEEE/CVF Winter Conference on 287. Engelmann, F., Kontogianni, T., Leibe, B.: Dilated point convolu-
Applications of Computer Vision (WACV), pp. 1800–1809 (2021) tions: on the receptive field size of point convolutions on 3d point
269. Gerdzhev, M., Razani, R., Taghavi, E., Bingbing, L.: Tornado-net: clouds. In: 2020 IEEE International Conference on Robotics and
multiview total variation semantic segmentation with diamond Automation (ICRA), pp. 9463–9469. IEEE (2020)
inception module. In: 2021 IEEE International Conference on 288. Zhao, L., Tao, W.: Jsnet: joint instance and semantic segmentation
Robotics and Automation (ICRA), pp. 9543–9549. IEEE (2021) of 3d point clouds. In: Proceedings of the AAAI Conference on
270. Xiao, A., Yang, X., Lu, S., Guan, D., Huang, J.: Fps-net: a Artificial Intelligence, vol. 34, pp. 12951–12958 (2020)
convolutional fusion network for large-scale lidar point cloud seg- 289. Li, Y., Li, X., Zhang, Z., Shuang, F., Lin, Q., Jiang, J.: Densekpnet:
mentation. ISPRS J. Photogramm. Remote Sens. 176, 237–249 dense kernel point convolutional neural networks for point cloud
(2021) semantic segmentation. IEEE Trans. Geosci. Remote Sens. 60,
271. Su, H., Jampani, V., Sun, D., Maji, S., Kalogerakis, E., Yang, 1–13 (2022)
M.-H., Kautz, J.: Splatnet: sparse lattice networks for point cloud 290. Ye, X., Li, J., Huang, H., Du, L., Zhang, X.: 3d recurrent neural
processing. In: Proceedings of the IEEE Conference on Computer networks with context fusion for point cloud semantic segmenta-
Vision and Pattern Recognition, pp. 2530–2539 (2018) tion. In: Proceedings of the European Conference on Computer
272. Liu, Z., Tang, H., Lin, Y., Han, S.: Point-voxel cnn for efficient Vision (ECCV), pp. 403–417 (2018)
3d deep learning. Adv. Neural Inf. Process. Syst. 32 (2019) 291. Huang, Q., Wang, W., Neumann, U.: Recurrent slice networks for
273. Chiang, H.-Y., Lin, Y.-L., Liu, Y.-C., Hsu, W.H.: A unified point- 3d segmentation of point clouds. In: Proceedings of the IEEE
based framework for 3d segmentation. In: 2019 International Conference on Computer Vision and Pattern Recognition, pp.
Conference on 3D Vision (3DV), pp. 155–163. IEEE (2019) 2626–2635 (2018)
274. Xu, J., Zhang, R., Dou, J., Zhu, Y., Sun, J., Pu, S.: Rpvnet: a 292. Zhao, Z., Liu, M., Ramani, K.: Dar-net: dynamic aggrega-
deep and efficient range-point-voxel fusion network for lidar point tion network for semantic scene segmentation. arXiv preprint
cloud segmentation. In: Proceedings of the IEEE/CVF Interna- arXiv:1907.12022 (2019)
tional Conference on Computer Vision, pp. 16024–16033 (2021) 293. Engelmann, F., Kontogianni, T., Hermans, A., Leibe, B.: Explor-
275. Zhuang, Z., Li, R., Jia, K., Wang, Q., Li, Y., Tan, M.: Perception- ing spatial context for 3d semantic segmentation of point clouds.
aware multi-sensor fusion for 3d lidar semantic segmentation. In: Proceedings of the IEEE International Conference on Com-
In: Proceedings of the IEEE/CVF International Conference on puter Vision Workshops, pp. 716–724 (2017)
Computer Vision, pp. 16280–16290 (2021) 294. Jiang, M., Wu, Y., Zhao, T., Zhao, Z., Lu, C.: Pointsift: a sift-like
276. Luo, C., Li, X., Cheng, N., Li, H., Lei, S., Li, P.: Mvp-net: multiple network module for 3d point cloud semantic segmentation. arXiv
view pointwise semantic segmentation of large-scale point clouds. preprint arXiv:1807.00652 (2018)
arXiv preprint arXiv:2201.12769 (2022) 295. Engelmann, F., Kontogianni, T., Schult, J., Leibe, B.: Know what
277. Hou, Y., Zhu, X., Ma, Y., Loy, C.C., Li, Y.: Point-to-voxel knowl- your neighbors do: 3d semantic segmentation of point clouds.
edge distillation for lidar semantic segmentation. In: Proceedings In: Proceedings of the European Conference on Computer Vision
of the IEEE/CVF Conference on Computer Vision and Pattern (ECCV) Workshops, pp. 0–0 (2018)
Recognition, pp. 8479–8488 (2022) 296. Wang, S., Suo, S., Ma, W.-C., Pokrovsky, A., Urtasun, R.: Deep
278. Lai, X., Chen, Y., Lu, F., Liu, J., Jia, J.: Spherical transformer parametric continuous convolutional neural networks. In: Pro-
for lidar-based 3d recognition. In: Proceedings of the IEEE/CVF ceedings of the IEEE Conference on Computer Vision and Pattern
Conference on Computer Vision and Pattern Recognition, pp. Recognition, pp. 2589–2597 (2018)
17545–17555 (2023) 297. Tatarchenko, M., Park, J., Koltun, V., Zhou, Q.-Y.: Tangent con-
279. Robert, D., Vallet, B., Landrieu, L.: Learning multi-view aggre- volutions for dense prediction in 3d. In: Proceedings of the IEEE
gation in the wild for large-scale 3d semantic segmentation. In: Conference on Computer Vision and Pattern Recognition, pp.
3887–3896 (2018)
123
298. Landrieu, L., Simonovsky, M.: Large-scale point cloud seman- supervised multi-tasks understanding. In: 2022 International Con-
tic segmentation with superpoint graphs. In: Proceedings of the ference on Robotics and Automation (ICRA), pp. 5108–5115
IEEE Conference on Computer Vision and Pattern Recognition, (2022)
pp. 4558–4567 (2018) 315. Fan, S., Dong, Q., Zhu, F., Lv, Y., Ye, P., Wang, F.-Y.: Scf-net:
299. Landrieu, L., Boussaha, M.: Point cloud oversegmentation with learning spatial contextual features for large-scale point cloud
graph-structured deep metric learning. In: Proceedings of the segmentation. In: Proceedings of the IEEE/CVF Conference on
IEEE/CVF Conference on Computer Vision and Pattern Recog- Computer Vision and Pattern Recognition, pp. 14504–14513
nition, pp. 7440–7449 (2019) (2021)
300. Wang, L., Huang, Y., Hou, Y., Zhang, S., Shan, J.: Graph attention 316. Gong, J., Xu, J., Tan, X., Song, H., Qu, Y., Xie, Y., Ma, L.:
convolution for point cloud semantic segmentation. In: Proceed- Omni-supervised point cloud segmentation via gradual receptive
ings of the IEEE/CVF Conference on Computer Vision and field component reasoning. In: Proceedings of the IEEE/CVF
Pattern Recognition (CVPR) (2019) Conference on Computer Vision and Pattern Recognition, pp.
301. Liang, Z., Yang, M., Deng, L., Wang, C., Wang, B.: Hierarchical 11673–11682 (2021)
depthwise graph convolutional neural network for 3d semantic 317. Shao, Y., Tong, G., Peng, H.: Mining local geometric structure
segmentation of point clouds. In: 2019 International Confer- for large-scale 3d point clouds semantic segmentation. Neuro-
ence on Robotics and Automation (ICRA), pp. 8152–8158. IEEE computing 500, 191–202 (2022)
(2019) 318. Tang, L., Zhan, Y., Chen, Z., Yu, B., Tao, D.: Contrastive bound-
302. Jiang, L., Zhao, H., Liu, S., Shen, X., Fu, C.-W., Jia, J.: Hierar- ary learning for point cloud segmentation. In: Proceedings of the
chical point-edge interaction network for point cloud semantic IEEE/CVF Conference on Computer Vision and Pattern Recog-
segmentation. In: Proceedings of the IEEE/CVF International nition, pp. 8489–8499 (2022)
Conference on Computer Vision (ICCV) (2019) 319. Shuang, F., Li, P., Li, Y., Zhang, Z., Li, X.: Msida-net: point
303. Rui, X., Gu, C., He, Z., Wu, K.: An efficient and dynamical way cloud semantic segmentation via multi-spatial information and
for local feature extraction on point cloud. In: 2020 the 3rd Inter- dual adaptive blocks. Remote Sens. 14(9), 2187 (2022)
national Conference on Control and Computer Vision, pp. 50–55 320. Lai, X., Liu, J., Jiang, L., Wang, L., Zhao, H., Liu, S., Qi, X.,
(2020) Jia, J.: Stratified transformer for 3d point cloud segmentation. In:
304. Lei, H., Akhtar, N., Mian, A.: Seggcn: efficient 3d point cloud Proceedings of the IEEE/CVF Conference on Computer Vision
segmentation with fuzzy spherical kernel. In: Proceedings of the and Pattern Recognition, pp. 8500–8509 (2022)
IEEE/CVF Conference on Computer Vision and Pattern Recog- 321. Ma, Y., Guo, Y., Liu, H., Lei, Y., Wen, G.: Global context reasoning
nition, pp. 11611–11620 (2020) for semantic segmentation of 3d point clouds. In: Proceedings of
305. Lei, H., Akhtar, N., Mian, A.: Spherical kernel for efficient graph the IEEE/CVF Winter Conference on Applications of Computer
convolution on 3d point clouds. IEEE Trans. Pattern Anal. Mach. Vision, pp. 2931–2940 (2020)
Intell. 43(10), 3664–3680 (2020) 322. Xu, X., Lee, G.H.: Weakly supervised semantic point cloud seg-
306. Zeng, Z., Xu, Y., Xie, Z., Wan, J., Wu, W., Dai, W.: Rg-gcn: mentation: Towards 10x fewer labels. In: Proceedings of the
a random graph based on graph convolution network for point IEEE/CVF Conference on Computer Vision and Pattern Recog-
cloud semantic segmentation. Remote Sens. 14(16), 4055 (2022) nition, pp. 13706–13715 (2020)
307. Park, C., Jeong, Y., Cho, M., Park, J.: Fast point transformer. In: 323. Zhiheng, K., Ning, L.: Pyramnet: point cloud pyramid attention
Proceedings of the IEEE/CVF Conference on Computer Vision network and graph embedding module for classification and seg-
and Pattern Recognition, pp. 16949–16958 (2022) mentation. arXiv preprint arXiv:1906.03299 (2019)
308. Wang, Q., Shi, S., Li, J., Jiang, W., Zhang, X.: Window nor-
malization: enhancing point cloud understanding by unifying
inconsistent point densities. arXiv preprint arXiv:2212.02287
Publisher’s Note Springer Nature remains neutral with regard to juris-
(2022)
dictional claims in published maps and institutional affiliations.
309. Xie, S., Gu, J., Guo, D., Qi, C.R., Guibas, L., Litany, O.: Pointcon-
trast: unsupervised pre-training for 3d point cloud understanding.
Springer Nature or its licensor (e.g. a society or other partner) holds
In: Computer Vision–ECCV 2020: 16th European Conference,
exclusive rights to this article under a publishing agreement with the
Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pp.
author(s) or other rightsholder(s); author self-archiving of the accepted
574–591. Springer (2020)
manuscript version of this article is solely governed by the terms of such
310. Jiang, L., Shi, S., Tian, Z., Lai, X., Liu, S., Fu, C.-W., Jia, J.:
publishing agreement and applicable law.
Guided point contrastive learning for semi-supervised point cloud
semantic segmentation. In: Proceedings of the IEEE/CVF Inter-
national Conference on Computer Vision, pp. 6423–6432 (2021)
311. Li, M., Xie, Y., Shen, Y., Ke, B., Qiao, R., Ren, B., Lin, S., Ma, L.: Sushmita Sarker is a Ph.D. student
Hybridcr: weakly-supervised 3d point cloud semantic segmenta- in Computer Science and Engi-
tion via hybrid contrastive regularization. In: Proceedings of the neering at the University of Nevada,
IEEE/CVF Conference on Computer Vision and Pattern Recog- Reno, USA. She earned her B.Eng.
nition, pp. 14930–14939 (2022) degree in Electronics and Com-
312. Zhao, Y., Wang, J., Li, X., Hu, Y., Zhang, C., Wang, Y., Chen, S.: munication Engineering from
Number-adaptive prototype learning for 3d point cloud semantic Gujarat Technological University,
segmentation. arXiv preprint arXiv:2210.09948 (2022) India, in 2017 and her Master’s
313. Hu, Q., Yang, B., Fang, G., Guo, Y., Leonardis, A., Trigoni, N., degree in Computer Science Engi-
Markham, A.: Sqn: weakly-supervised semantic segmentation of neering from the University of
large-scale 3d point clouds. In: Computer Vision–ECCV 2022: Nevada, Reno, in 2023. Her
17th European Conference, Tel Aviv, Israel, October 23–27, 2022, research interests span computer
Proceedings, Part XXVII, pp. 600–619. Springer (2022) vision, point cloud processing, and
314. Liu, K., Zhao, Y., Gao, Z., Chen, B.M.: Weaklabel3d-net: A deep learning. Her primary research
complete framework for real-scene lidar point clouds weakly focuses on the analysis of medical
123
images using generative networks to improve disease prediction and Ryan C. Gorman earned his Bache-
detection. Additionally, she endeavors to utilize point cloud data for lor of Science degree in Computer
mapping sagebrush and understanding the impacts of wildfires on for- Science and Engineering from the
est ecosystems. University of Nevada, Reno, in
2021. He is currently a Ph.D. stu-
Prithul Sarker is a Ph.D. candi- dent at the same institution in
date in the Department of Com- the Department of Computer Sci-
puter Science and Engineering at ence and Engineering. His cur-
the University of Nevada, Reno, rent research focuses on enhanc-
USA. He received B.S. degree ing wildfire data quality through
in electrical and electronic engi- multi-modality and super-resolution
neering from Bangladesh Univer- techniques. Ryan’s research inter-
sity of Engineering and Technol- ests lie in optimizing machine learn-
ogy, Bangladesh in 2017, and M.S. ing and computer vision efficiency
degree in computer science from while ensuring effectiveness. He
the University of Nevada, Reno, is dedicated to making advanced models more accessible, especially
USA in 2023. His primary research for compute limited devices.
focus is on utilizing virtual reality
(VR) technology to assess pupil- Javad Sattarvand is an asso-
lary function and artificial tech- ciate professor and the department
nology (AI) in healthcare. His research aims to improve the repro- chair at the Department of Min-
ducibility and validity of findings related to pupillary function, with a ing and Metallurgical Engineer-
focus on its significance in physiological and psychological processes ing at the University of Nevada,
using VR technologies. Additionally, he has performed comprehensive Reno, since 2015. Dr. Sattarvand
analyses on a range of medical data modalities, including time series received his Ph.D. in Mining Engi-
data sourced from assessments and images from medical diagnoses, to neering from the RWTH Univer-
predict and detect conditions and diseases such as concussion, breast sity of Aachen - Germany in 2009.
cancer and ocular disorder. With 23 years of academic appoint-
ments, he has supervised 34 mas-
Gunner Stone (M.Sc., 2023, B.Sc., ter’s and 5 Ph.D. students, received
2020) is currently a Ph.D. can- over 4 million dollars in research
didate in Computer Science and grants and published over 100 peer-
Engineering at the University of reviewed papers in journals and
Nevada, Reno. At the university, conferences. He is the teacher for mining automation and mining soft-
he actively collaborates within the ware engineering courses at the University of Nevada, Reno. With
Human-Machine Perception Lab industrial expertise in the optimization of mining operations, mine
and the GEARS Lab. His primary monitoring, and mine control systems, he has founded three start-up
research focus lies in LiDAR point- companies and accomplished more than 20 industrial projects and the
cloud classification and part seg- commercialization of several software and hardware products for min-
mentation using Deep Learning, ing automation.
especially in the domain of for-
est ecology. This interest aims to
leverage LiDAR pointclouds for
mapping forest attributes to under-
stand the impacts of wildfires and climatic changes on forest ecosys-
tems.
123
George Bebis received the B.S. Alireza Tavakkoli is an Associate

degree in mathematics and the Professor in the Department of
M.S. degree in computer science Computer Science and Engineer-
from the University of Crete, Crete, ing at the University of Nevada,
Greece, in 1987 and 1991, respec- Reno. He received his BSc and
tively, and the Ph.D. degree in MSc degrees in Electrical Engi-
electrical and computer engineer- neering from the Sharif University
ing from the University of Cen- of Technology in 2001 and 2004,
tral Florida, Orlando, FL, USA, and MSC and PhD degrees in
in 1996. He is currently a Foun- Computer Science from the Uni-
dation Professor with the Depart- versity of Nevada, Reno in 2006
ment of Computer Science and and 2009. He is the Director of the
Engineering (CSE), University of Human Machine Perception Lab
Nevada, Reno (UNR), Reno, NV, at UNR. His main interests are in
USA, and the Director of the Com- visual computing, artificial intel-
puter Vision Laboratory. From 2013 to 2018, he served as a Depart- ligence, and perception. His research projects are funded by federal
ment Chair of CSE, UNR. His research has been funded by NSF, agencies such as NSF, NASA, NIH and DoD. He has published over
NASA, ONR, NIJ, and Ford Motor Company. His research inter- 100 peer-reviewed articles and occasionally serves as a grant review
ests include computer vision, image processing, pattern recognition, panelist as well as a reviewer for several journals and conferences. He
machine/deep learning, and evolutionary computing. Dr. Bebis is an is a senior member of the IEEE and currently the chief guest editor of
Associate Editor of the Machine Vision and Applications Journal and a special research topic in the journal frontiers in ophthalmology.
serves on the Editorial Board of the International Journal on Artificial
Intelligence Tools, and the Computer Methods in Biomechanics and
Biomedical Engineering: Imaging and Visualization. He has served on
the program committees of various national and international confer-
ences and is the founder and steering committee chair of the Interna-
tional Symposium on Visual Computing (ISVC).
123

A Comprehensive Overview of Deep Learning Techniques For 3D Point Cloud Classification and Semantic Segmentation

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

A Comprehensive Overview of Deep Learning Techniques For 3D Point Cloud Classification and Semantic Segmentation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Comprehensive Overview of Deep Learning Techniques For 3D Point Cloud Classification and Semantic Segmentation

Uploaded by

Copyright:

Available Formats

Machine Vision and Applications (2024) 35:67

A comprehensive overview of deep learning techniques for 3D point

Received: 15 November 2023 / Revised: 15 November 2023 / Accepted: 14 April 2024

Keywords 3D classification · Computer vision · Point cloud · Semantic segmentation

1 Introduction RealSense, and Apple depth cameras) [1], and photogramme-

0123456789().: V,-vol 123

Table 1 Available point cloud datasets for classification

Intra [16] 2020 Syn 2025 2 IntrA is an open-access 3D intracra-

Table 2 Available point cloud datasets for segmentation

VoxNet [75] 2015 Vol. – 83.00% – 92.00% – – 73.00%

into a regular grid, creating a volumetric representation of

increasingly popular since they reduce the computational

jection methods are referred to in this review as “hybrid”

In Sect. 3, we have discussed the various models for 3D

shape classification with a focus on input data representation.

Meanwhile, Sect. 4 presents an in-depth analysis of method-

Table 3 provides a comprehensive comparison of the afore-

various datasets. The methods have been categorized based

on the representation of the input point cloud utilized in

chronological order within their respective categories. The

evaluation of each method’s performance is based on met-

3.1 Mesh-based methods

A mesh is the most widely used structure for representing

faces and vertices that define surfaces on a three-dimensional

representations provide a memory-efficient way to store com-

plete geometry details. However, this representation is often

algorithms. This could be attributed to the fact that a 3D mesh

based approaches presents a difficult challenge because of

et al. introduced a mesh neural network as MeshNet in [54].

This approach introduces face-unit and feature splitting and

proposes a general architecture with usable and efficient

building blocks. The face unit takes as input the features of

convolutional operations to learn representations of that face.

Fig. 1 A taxonomy of deep learning methods for 3D point cloud classification

Fig. 2 The Stanford Bunny [70] model in different three-dimensional representations

Fig. 4 A simplified architecture

Fig. 5 Different types of point convolution [114]

Model name Year Input Params ModelNet 40 ModelNet 10 ScanObjectNN Intra

PointNet [3] 2017 PC 3.5 89.20% 86.20% – - 68.20% 63.40% 68.40%

Flex-Convolution [146] 2018 PC – 90.20% – – – – – –

LocalSpecGCN [151] 2018 PC+Normals – 92.10% – – – – – –

CurveNet [156] 2021 PC – 94.20% – 96.30% – – – –

Model name Year Input Params ModelNet 40 ModelNet 10 ScanObjectNN Intra

RCNet-E [157] 2019 PC 39.9 92.30% – 95.60% – – – –

Latent-GAN [182] 2018 PC – 84.50% – 95.40% – – –

Point-M2AE [200] 2022 PC – 94.00% – – – 86.43% – –

Fig. 7 Illustration of the transformer based encoder architecture [210]

Discretization-based methods transform continuous point 5.2.2 Sparse discretization representation

Projection Based Methods

DeepTemporalSeg [266] 2020 37.60% – – – – – – – – – –

GFNet [241] 2022 65.40% – 76.10% – – – – – – –

RangeFormer [243] 2023 73.30% – 80.10% – – – – – – –

SEGCloud [248] 2017 – – – – – 88.10% 61.30% 57.35% 48.90% – –

SPH3D-GCN [305] 2021 – 87.70% 59.50% 88.60% 68.90% – 61.00% – – – – – –

FastPointTransformer [307] 2021 – – 70.10% – – – 72.10% – – – – – –

Fan et al. [315] proposed SCF, a learnable module for extract-

Projection-based By leveraging 2D convolutional Generates quantization artifacts that