0% found this document useful (0 votes)
6 views32 pages

Robustness-Aware 3D Object Detection in Autonomous Driving: A Review and Outlook

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 32

1

Robustness-Aware 3D Object Detection in


Autonomous Driving: A Review and Outlook
Ziying Song, Lin Liu, Feiyang Jia, Yadan Luo, Caiyan Jia, Guoxin Zhang, Lei Yang, Li Wang

Abstract—In the realm of modern autonomous driving, the and predicting pedestrian behaviors, enabling safe operations
perception system is indispensable for accurately assessing the amidst complex traffic conditions [2].
state of the surrounding environment, thereby enabling informed The primary task of perception is to accurately understand
arXiv:2401.06542v3 [cs.CV] 15 Aug 2024

prediction and planning. The key step to this system is related


to 3D object detection that utilizes vehicle-mounted sensors such the surrounding environment and minimize collision risks
as LiDAR and cameras to identify the size, the category, and [3]. This is where 3D object detection methods become
the location of nearby objects. Despite the surge in 3D object essential. These approaches enable the autonomous systems
detection methods aimed at enhancing detection precision and to accurately identify objects in the vicinity, including their
efficiency, there is a gap in the literature that systematically position, shape, and category [4]. Such detailed environmental
examines their resilience against environmental variations, noise,
and weather changes. This study emphasizes the importance perception enhances the system’s ability to comprehend the
of robustness, alongside accuracy and latency, in evaluating driving context and make more informed decisions.
perception systems under practical scenarios. Our work presents The advancement of autonomous driving technologies has
an extensive survey of camera-only, LiDAR-only, and multi- spared a wave of research in 3D object detection, leading to
modal 3D object detection algorithms, thoroughly evaluating the development of diverse and innovative methods. These
their trade-off between accuracy, latency, and robustness, partic-
ularly on datasets like KITTI-C and nuScenes-C to ensure fair
approaches are typically categorized based on their input types,
comparisons. Among these, multi-modal 3D detection approaches including camera-only [5]–[35], LiDAR-only [36]–[98], and
exhibit superior robustness, and a novel taxonomy is introduced multi-modal methods [65], [99]–[122]. The current landscape
to reorganize the literature for enhanced clarity. This survey of 3D object detection methods is prolific, necessitating a
aims to offer a more practical perspective on the current comprehensive summarization to offer intriguing insights to
capabilities and the constraints of 3D object detection algorithms
in real-world applications, thus steering future research towards
the research community. While comprehensive, prior surveys,
robustness-centric advancements. such as [4], [123], often overlook the safety aspects of
autonomous driving perception, particularly in terms of the
Index Terms—3D Object Detection, Perception, Robustness,
Autonomous Driving
system’s robustness against varying testing data following
deployment.
In real-world testing scenarios, the conditions encountered
I. I NTRODUCTION usually greatly differ from those during training. The environ-
mental variability, sensor discrepancies or noise, and spatial
UTONOMOUS driving systems, fundamental to the fu-
A ture of transportation, heavily rely on advanced per-
ception, decision-making, and control technologies. These
misalignment can cause a shift in the input sensory data dis-
tribution, leading to a significant drop in detector performance
[108], [111], [124], [125]. We identify and discuss three major
systems employ a range of sensors [1] such as camera, factors critical for assessing detection robustness.
LiDAR and radar as depicted in Fig. 1, to effectively per- • Environmental Variability. A detection algorithm needs
ceive surrounding environments. This capability is crucial to perform well under different environmental conditions,
for recognizing road signs, detecting and tracking vehicles, including variations in lighting, weather, and seasons. The
algorithms should exhibit adaptability, ensuring that it
This work was supported in part by the National Key R&D Program of
China (2018AAA0100302), supported by the STI 2030-Major Projects under does not fail due to changes in the environment.
Grant 2021ZD0201404.(Corresponding author: Caiyan Jia.) • Sensor Noise. This includes handling noise introduced
Ziying Song, Lin Liu, Feiyang Jia, Caiyan Jia are with School of Com- by sensor malfunctions, such as motion blur in a camera.
puter Science & Technology, Beijing Key Lab of Traffic Data Analysis
and Mining, Beijing Jiaotong University, Beijing 100044, China (e-mail: An algorithm must possess the capability to effectively
22110110@bjtu.edu.cn, liulin010811@gmail.com, feiyangjia@bjtu.edu.cn, manage hardware noise, ensuring the accurate processing
cyjia@bjtu.edu.cn) of input data.
Yadan Luo is with the School of Information Technology and Electrical
Engineering, The University of Queensland, St Lucia, QLD 4072, Australia • Misalignment. In real-world scenarios, sensor calibration
(e-mail: uqyluo@uq.edu.au) errors can complicate the synchronization of multi-modal
Guoxin Zhang is with School of Computer Science, Beijing University of input data, causing misalignment due to external factors
Posts and Telecommunications, Beijing 100876, China (e-mail: zhangguox-
incs@gmail.com) (e.g., uneven road surfaces) or internal factors (e.g.,
Lei Yang is with the State Key Laboratory of Intelligent Green Vehicle system clock misalignment). An algorithm should be
and Mobility, and the School of Vehicle and Mobility, Tsinghua University, fault-tolerant and may incorporate an elastic alignment
Beijing 100084, China (e-mail: yanglei20@mails.tsinghua.edu.cn).
Li Wang is with School of Mechanical Engineering, Beijing Institute of to mitigate the impact of misalignment on detection
Technology, Beijing 100081, China (e-mail: wangli bit@bit.edu.cn.) performance.
2

Sensor Data Detector Prediction

2D CNN

Detect Head
3D Sparse CNN
3D Object Detection in Image
Camera Image

PointNet/PointNet++

Transformer
Autonomous
LiDAR
Vehicle Point Cloud 3D Object Detection in Point Cloud

Fig. 1: An illustration of 3D object detection in autonomous driving scenarios with different sensors.

To ensure safe operation in varying test environments, TABLE I: Advantages and limitations of different modalities.
assessing the robustness of 3D object detection algorithms Type Sensor Hardware Advantages Limitations
is essential. They must maintain efficient, accurate, and re- Cost($)
Image Camera 102 ˜103 + The dense data- Missing depth informa-
liable performance across diverse scenarios. In this survey, we format incorporates tion the camera will be
conduct extensive experimental comparisons among existing additional color and affected by light, weather,
texture information. etc.
algorithms. Centered around ‘Accuracy, Latency, Robustness’, Point cloud LiDAR 104 ˜105 + With accurate depth in- -High computational cost
we delve into existing solutions, offering insightful guidance formation less affected by for sparse and disordered
for practical deployment in autonomous driving. light +larger field of view point cloud data and no
color information.
• Accuracy: Current researches often prioritize accuracy Camera, 104 ˜105 + Simultaneous color and - Fusion methods can pro-
Multi-modal
LiDAR depth information duce noise interference
as a key performance metric. However, a deeper un-
derstanding of these methods’ performance in complex
environments and extreme weather conditions is needed
modal approaches (Section V). The paper concludes with a
to ensure real-world reliability. A more detailed analysis
comprehensive summary of our findings in Section VII.
of false positives and false negatives is necessary for
improvement.
• Latency: Real-time capability is vital for autonomous II. DATASETS
driving. The latency of a 3D object detection method Currently, autonomous driving systems primarily rely on
impacts the system’s ability to make timely decisions, sensors such as cameras, and LiDAR, generating data in
particularly in emergencies. two modalities, point clouds and images. Based on these
• Robustness: Robustness refers to the system’s stability data types, existing public benchmarks predominantly manifest
under various conditions, including weather, lighting, sen- in three forms: camera-only, LiDAR-only, and multi-modal.
sory and alignment changes. Many existing evaluations Table I delineates the advantages and the disadvantages of each
may not fully consider the diversity of real-world sce- of these three forms. Among them, there are many reviews
narios, necessitating a more comprehensive adaptability [123], [126]–[132] providing a comprehensive overview of
assessment. clean autonomous driving datasets as shown in Table II. The
Through an in-depth analysis of extensive experimental most notable ones include KITTI [133], nuScenes [134], and
results, with a focus on ‘Accuracy, Latency, Robustness’, we Waymo [135].
have identified significant advantages in safety perception with In recent times, the pioneering work on clean autonomous
Multi-modal 3D detection in safety perception. By integrating driving datasets has provided rich resources for 3D object
information from diverse sensors or data sources, Multi-modal detection. As autonomous driving technology transitions from
methods provide a richer and more diverse perception capa- breakthrough stages to practical implementation, we have
bility for autonomous driving systems, thereby enhancing the conducted some guided researches to systematically review the
understanding and responding to the surrounding environment. currently available robustness datasets. We have focused more
Our research provides practical guidance for the future deploy- on noisy scenarios and systematically reviewed datasets related
ment of autonomous driving technology. By discussing these to the robustness of 3D detection. Many studies have collected
key areas, we aim to align the technology more closely with new datasets to evaluate model robustness under different
real-world needs and enhance its societal benefits effectively. conditions. Early research explored camera-only approaches
The structure of this paper is organized as follows: First, we under adverse conditions [136], [137], with datasets that were
introduce the datasets and evaluation metrics for 3D object notably small in scale and exclusively applicable to camera-
detection, with a particular focus on robustness in Section only visual tasks, rather than multi-modal sensor stacks that
II. Subsequent sections systematically examine existing 3D include LiDAR. Subsequently, a series of multi-modal datasets
object detection methods, including camera-only approaches [138]–[141] have focused on noise concerns. For instance,
(Section III), LiDAR-only approaches (Section IV), and multi- the GROUNDED dataset [138] focuses on ground-penetrating
3

radar localization under varying weather conditions. Addi- TABLE II: Public datasets for 3D object detection in au-
tionally, the ApolloScape open dataset [140] incorporates tonomous driving. ‘C’, ‘L’ and ‘R’ denote Camera, LiDAR
LiDAR, camera, and GPS data, encompassing cloudy and and Radar, respectively.
rainy conditions as well as brightly lit scenarios. The Ithaca365 Data Size Diversity
dataset [141] is designed for robustness in autonomous driv- Dataset Year Sensors
Frame Annotation Scenes Category
ing research, providing scenarios under various challenging
weather conditions, such as rain and snow. KITTI [133] 2012 CL 15K 200K 50 3
nuScenes [134] 2019 CLR 40K 1.4M 1000 10
Due to the prohibitive cost of collecting extensive noisy Lyft L5 [150] 2019 CL 46K 1.3M 366 9
datasets from the real world, rendering the formation of large- H3D [151] 2019 L 27K 1.1M 160 8
Appllo [140] 2019 CL 140K - 103 27
scale datasets impractical, many studies have shifted their Argoverse [152] 2019 CL 46K 993K 366 9
focus to synthetic datasets. ImageNet-C [142] is a seminal A*3D [153] 2019 CL 39K 230K - 7
work in corruption robustness research, benchmarking classi- Waymo [135] 2020 CL 230K 12M 1150 3
A2D2 [154] 2020 CL 12.5K 43K - 38
cal image classification models against prevalent corruptions PandaSet [155] 2020 CL 14K - 179 28
and perturbations. This line of research has subsequently KITTI-360 [156] 2020 CL 80K 68K 11 19
been extended to include robustness datasets tailored for 3D Cirrus [157] 2020 CL 6285 - 12 8
ONCE [158] 2021 CL 15K 417K - 5
object detection in autonomous driving. Additionally, there OpenLane [159] 2022 CL 200K - 1000 14
are adversarial attacks [143]–[145] designed for studying the
robustness of 3D object detection. However, these attacks may
not exclusively concentrate on natural corruption, which is less other aspects, such as scale and orientation, into a unified
common in autonomous driving scenarios. score. Analogous to KITTI-C, [125] denotes the model’s
To better emulate the distribution of noise data in the real performance on the validation set as mAPclean and NDSclean ,
world, several studies [124], [125], [146]–[149] have devel- respectively. The corruption robustness metrics, mAPcor and
oped toolkits for robustness benchmarks. These benchmark NDScor , are evaluated by averaging over all corruption types
toolkits [124], [125], [146]–[149] enable the simulation of var- and severities. Additionally, [125] calculates the Relative
ious scenarios using clean autonomous driving datasets, such Corruption Error (RCE) under both mAP and NDS metrics,
as KITTI [133], nuScenes [134], and Waymo [135]. Among similar to the formulation in Eq.2.
them, Dong et al. [125] systematically designed 27 common Additionally, some studies [143], [146], [160] examine ro-
corruptions in 3D object detection to benchmark the corruption bustness in single-modal contexts. For instance, [146] proposes
robustness of existing detectors. By applying these corrup- a LiDAR-only benchmark that utilizes physically-aware simu-
tions comprehensively on public datasets, they established lation methods to simulate degraded point clouds under various
three corruption-robust benchmarks: KITTI-C, nuScenes-C, real-world common corruptions. This benchmark, tailored for
and Waymo-C. [125] denotes model performance on the point cloud detectors, includes 1,122,150 examples across
original validation set as APclean . For each corruption type 7,481 scenes, covering 25 common corruption types with six
c at each severity s, [125] adopts the same metric to measure severity levels. Moreover, [146] devises a novel evaluation
model performance as APc,s . The corruption robustness of a metric, including CEAP (%) and mCE, and calculates corrup-
model is calculated by averaging over all corruption types and tion error (CE) to assess performance degradation based on
severities as Overall Accuracy (OA) by
5
ΛPcor =
1 X1X
ΛPc,s . CEm m m
c,s = OAclean − OAc,s , (3)
(1)
|C| 5 s=1
c∈C where OAm
c,s is the overall accuracy of detector m under
Where C is the set of corruptions in evaluation. It should be corruption c of severity level s (excluding “clean,” i.e., severity
noticed that for different kinds of 3D object detectors, the set level 0) and clean represents the clean data. For detector m,
of corruptions can be different (e.g., [125] has not evaluated we can calculate the mean CE (mCE) for each detector by
camera noises for LiDAR-only models). Thus, the results of P5 P25
CEmc,s
APcor are not directly comparable between different kinds of mCE = s=1 c=1
m
. (4)
5C
models, and [125] performs a fine-grained analysis under each
corruption. It also calculates relative corruption error (RCE) III. C AMERA - ONLY 3D O BJECT D ETECTION
by measuring the percentage of performance drop as In this section, we introduce the Camera-only 3D object
APclean − APc,s APclean − APcor detection methods. Compared to LiDAR-only methods, the
RCEc,s = ; RCE = . camera solution is more cost-effective and the images obtained
APclean APclean
(2) from cameras require no complex preprocessing. Therefore, it
Unlike KITTI-C and Waymo-C, nuScenes-C primarily as- is favored by many automotive manufacturers, particularly in
sesses performance using the mean Average Precision (mAP) the context of multi-view applications such as BEV (bird’s-
and nuScenes Detection Score (NDS) computed across ten eye view) systems. Generally, as shown in Fig. 2, Camera-
object categories. The mAP is determined using the 2D center only methods can be categorized into three types, monocular,
distance on the ground plane instead of the 3D Intersection stereo-based, and multi-view (bird’s-eye view). Due to the
over Union (IoU). The NDS metric consolidates mAP with excellent cost-effectiveness of Camera-only methods, there
4

(a) 1) Prior-guided monocular 3D object detection: In recent


Prior-guided years, prior-guided monocular methods [7], [8], [11]–[13],
[18], [19], [21], [26], [161]–[169], [183], [346] have continu-
Camera-only ously explored how to utilize the hidden prior knowledge of
object shapes and scene geometry in images to address the
Depth-assiste challenges of monocular 3D object detection. The effective
integration of this prior knowledge is crucial for mitigating
Detection the uncertainty and ill-posed nature of inherent in monocular
Monocular Camera Input Feature Optimization
Head
3D object detection problems. By introducing pre-trained
(b)
subnetworks or auxiliary tasks, prior knowledge can provide
Left

2D-detection-
based
additional information or constraints to assist in the accurate
localization of 3D objects and enhance detection precision and
robustness.
Pseudo- Widely adopted prior knowledge in 3D objects includes
LiDAR-based
Right

object shapes [162], [165], [166], [347]–[349], geometric


consistency [7], [8], [12], [19], [19], [169], [350], temporal
Detection
Stereo Camera Input Perspective Error Correction constraints [179], [351], and segmentation information [165].
Head
(c) Object shape provides insights into the appearance and struc-
ture of an object, aiding in more accurate inference of the
Depth-based
spatial position and pose of the object. Geometric consistency
knowledge assists the model in better understanding the rel-
ative positional relationships between objects in the scene,
Query-based
thereby improving detection consistency and robustness. Tem-
poral constraints consider the continuity and stability of an
Multi-view Input
Multi-view Feature & Detection object across different frames, providing vital clues for object
View Transformation Module Head
detection. Additionally, leveraging segmentation information
Fig. 2: The general pipeline of Camera-only methods. enables the model to better comprehend semantic information
in images, facilitating precise localization and identification
of objects. As a result, current works are dedicated to further
have been numerous reviews and investigations conducted exploring and utilizing prior knowledge to enhance the per-
to summarize and explore them. However, the majority of formance and robustness of monocular 3D object detection by
existing reviews on 3D object detection are limited to spe- integrating prior knowledge with deep learning approaches,
cific methodologies, with a predominant focus on accuracy. thus driving continuous development and innovation in this
This survey aims to revisit the fundamental considerations of field.
safety-perception deployment, redefining the discourse around 2) Camera-only monocular 3D object detection: Camera-
existing categorizations, and exploring ‘Accuracy, Latency, and only monocular 3D object detection [7]–[9], [11], [18], [22],
Robustness’, as the core dimensions for an in-depth analysis of [24]–[26], [168], [178]–[183] is a kind of methods that utilize
current methodologies. The objective is to provide additional images captured by a single camera to detect and localize
insights to guide the development of existing technologies. 3D objects. Camera-only monocular methods employ convolu-
tional neural networks (CNNs) to directly regress 3D bounding
A. Monocular 3D object detection box parameters from images, enabling the estimation of the
Monocular 3D object detection refers to performing 3D spatial dimensions and poses of objects in three dimensions.
object detection using only one camera, which aims to infer Inspired by 2D detection networks, this direct regression meth-
the 3D positions, sizes, and orientations of objects from a ods can be trained end-to-end, facilitating holistic learning and
single image [131]. In recent years, monocular 3D object inference for 3D objects. The unique challenge of monocular
detection has gained increasing attention due to its advantages 3D object detection lies in inferring objects’ 3D positions,
of low cost, low power consumption, and ease of deployment dimensions, and orientations solely from a single image with-
in real-world applications. However, monocular methods face out relying on additional depth maps or point cloud data.
many challenges, owing to the insufficient 3D information in Consequently, the direct regression approaches demonstrate
monocular pictures, such as accurately localizing 3D positions, practicality and broad applicability. By learning features from
handling occluded scenes, and so on. Overcoming these chal- images, CNNs can predict the 3D information of objects. The
lenges relies on leveraging depth information to supplement network gradually optimizes its parameters through end-to-end
the missing 3D information in monocular images. Typically, training to enhance the accurate extraction of 3D information.
most approaches employ depth estimation tasks to acquire These direct regression methods streamline the entire detection
depth information from images. However, monocular depth process and reduce the reliance on supplementary informa-
estimation is an ill-posed and highly challenging task, prompt- tion, improving the algorithms’ robustness and generalization
ing researchers to dedicate significant efforts to optimizing the capability. Nevertheless, monocular 3D object detection still
accuracy and stability of depth estimation. presents challenges, such as occlusion, viewpoint variations,
5

TABLE III: Camera-only 3D object detection methods.

Input Type Keypoint Methods

Prior-guided: Direct regression using geo- Deep MANTA [CVPR2017] [161], Mono3D++ [AAAI2019] [162], 3D-RCNN [CVPR2018] [163], ROI-10D [CVPR2019] [164],
metric prior knowledge. MonoDR [ECCV2020] [165], Autolabeling [CVPR2020] [166], MonoPSR [CVPR2019] [21], 3DVP [CVPR2015] [167], MultiBin
[CVPR2017] [168], M3D-RPN [ICCV2019] [11], SHIFT R-CNN [ICIP2019] [169], RTM3D [ECCV2020] [7], UR3D [ECCV2020]
[19], Decoupled-3D [AAAI2020] [12], GUP Net [ICCV2021] [26], MonoFlex [CVPR2021] [8], Mix-Teaching [TCSVT2023] [28],
MonoPair [CVPR2020] [13], MonoJSG [CVPR2022] [10], Geo Aug [CVPR2022] [170], Monoground [CVPR2022] [171],
MonoPGC [ICRA2023] [172], MonoEdge [WACV2023] [173], GPro3D [Neurocomputing2023] [174], MonoGAE [arXiv2023] [175],
GUPNet++ [arXiv2023] [176], NeurOCS [CVPR2023] [177].
Monocular
Camera-only: uses the RGB image informa- Smoke [CVPR2020] [178], Kinematic3D [ECCV2020] [179], FQNet [CVPR2019] [18], FCOS3D [CVPR2021] [24], PGD
tion captured by the monocular. [CoRL2022] [25], CaDDN [CVPR2021] [22], MoVi-3D [ECCV2020] [180], MonoDIS [ICCV2019] [9], GS3D [CVPR2019] [181],
MonoGRNet [TPAMI2021] [182], MonoRCNN [ICCV2021] [183], MonoFENet [TIP2019] [184], MonoCon [AAAI2022] [185],
MonoXiver [ICCV2023] [186], SGM3D [RAL2022] [187], MonoDETR [ICCV2023] [29], MonoDTR [CVPR2022] [23], DiD-
M3D [ECCV2022] [188], MonoNeRD [ICCV2023] [189], MonoSAID [IRS2024] [190], WeakMono3D [CVPR2023] [191],
DDCDC [Neurocomputing2023] [192], Obmo [TIP2023] [193], Shape-Aware [TITS2023] [194], Lite-FPN [KBS2023] [195], OOD-
M3D [TCE2024] [196], MonoTDP [arXiv2023] [197], Cube R-CNN [CVPR2023] [198], M2S [TIP2023] [199].
Depth-assisted: extracting depth information PatchNet [ECCV2020] [200], DD3D [ICCV2021] [27], Pseudo-LiDAR [CVPR2019] [6], DeepOptics [ICCV2019] [201], AM3D
via camera parallax. [ICCV2019] [20], MonoTAKD [arXiv2024] [202], MonoPixel [TITS2022] [203], DDMP-3D [CVPR2021] [204], D4LCN
[CVPRW2020] [205], ADD [AAAI2023] [206], PDR [TCSVT2023] [207], Pseudo-Mono [ECCV2022] [208], Deviant [ECCV2022]
[209], CMAN [TITS2022] [5], ODM3D [WACV2024] [210], MonoGAE [arXiv2023] [175], FD3D [AAAI2023] [211], MonoSKD
[arXiv2023] [212].

2D-Detection-based: Integrate 2D informa- Disp R-CNN [CVPR2020] [213], TL-Net [CVPR2019] [214], ZoomNet [AAAI2020] [215], IDA-3D [CVPR2020] [216],
tion about the object into the image. YOLOStereo3D [ICRA2021] [14], SIDE [WACV2022] [217], VPFNet [TMM2022] [218], FCNet [Entropy2022] [219], MC-Stereo
[arXiv2023] [220], PCW-Net [ECCV2022] [221], ICVP [ICIP2023] [222], MoCha-Stereo [arXiv2024] [223], UCFNet [TPAMI2023]
[224], IGEV-Stereo [CVPR2023] [225], NMRF-Stereo [arXiv2024] [226].
Pseudo-LiDAR-only: incorporate additional Pseudo-LiDAR [CVPR2019] [6], Pseudo-LiDAR++ [ICLR2020] [227], E2E-PL [CVPR2020] [228], CG-Stereo [IROS2020] [229],
information from pseudo-LiDAR to simulate SGM3D [RAL2022] [187], RTS3D [AAAI2023] [230], RT3DStereo [ITS2019] [231], RT3D-GMP [ITSC2020] [232], CDN
Stereo LiDAR depth. [NIPS2020] [233].
Volume-based: perform 3D object detection GC-Net [ICCV2017] [234], ESGN [TCSVT2022] [235], DSGN [CVPR2020] [17], DSGN++ [TPAMI2022] [236], LIGA-Stereo
directly on 3D stereo volumes. [ICCV2021] [237], PLUMENet [IROS2021] [238], Selective-IGEV [arXiv2024] [239], ViTAS [arXiv2024] [240], LaC+GANet
[AAAI2022] [241], DMIO [arXiv2024] [242], HCR [IVC2024] [243], LEAStereo [NIPS2020] [244], CREStereo [CVPR2022] [245],
Abc-Net [TVC2022] [246], AcfNet [AAAI2020] [247], CAL-Net [ICASSP2021] [248], CFNet [CVPR2021] [249], PFSMNet
[TITS2021] [250], DCVSMNet [arXiv2024] [251], DPCTF [TIP2021] [252], ACVNet [CVPR2022] [253], .

Depth-based: Convert 2D spatial features BEVDepth [AAAI2023] [15], BEVDet [arXiv2021] [254], BEVDet4D [arXiv2022] [33], LSS [ECCV2020] [255], BEVHeight
into 3D spatial features through depth esti- [CVPR2023] [256], BEVHeight++ [arXiv2023] [35], BEV-SAN [CVPR2023] [257], BEVUDA [arXiv2022] [258], BEVPoolv2
mation. [arXiv2022] [259], BEVStereo [AAAI2023] [260], BEVStereo++ [arXiv2023] [261], TiG-BEV [arXiv2022] [262], DG-BEV
[CVPR2023] [263], HotBEV [NeuriPS2024] [264], BEVNeXt [CVPR2024] [265].
Multi-view
Query-based: Influenced by the transformer PolarFormer [AAAI2023] [266], SparseBEV [ICCV2023] [34], BEVFormer [ECCV2022] [16], PETR [ECCV2022] [31], PETRv2
technology stack, there is a trend to explicitly [ICCV2023] [32], M3DETR [WACV2022] [84], FrustumFormer [CVPR2023] [267], DETR4D [arXiv2022] [268], Sparse4D
or implicitly query Bird’s Eye View (BEV) [arXiv2022] [269], Sparse4D v2 [arXiv2023] [270], Sparse4D v3 [arXiv2023] [271], SOLOFusion [ICLR2022] [272], CAPE
features. [CVPR2023] [273], VEDet [CVPR2023] [274], Graph-DETR3D [ACMMM] [275], 3DPPE [CVPR2023] [276], BEVDistill
[ICLR2023] [277], StreamPETR [ICCV2023] [278], Far3D [ICCV2023] [279], CLIP-BEVFormer [CVPR2024] [280], BEVFormer
v2 [CVPR2023] [281].

and lighting conditions, which may affect the accuracy of 3D feature-level and the response-level, aiming to achieve cross-
detection. The representative work Smoke [178] abandons the modal knowledge transfer. Such depth-assisted monocular 3D
regression of 2D bounding boxes and predicts the 3D box for object detection, by effectively integrating depth information,
each detected object by combining the estimation of individual not only enhances detection accuracy but also extends the
key points with the regression of 3D variables. applicability of monocular vision to tasks involving 3D scene
understanding.
3) Depth-assisted monocular 3D object detection: Depth
estimation plays a crucial role in depth-assisted monocular 3D
object detection. To achieve more accurate monocular detec- B. Stereo-based 3D object detection
tion results, numerous studies [20], [27], [200], [201] leverage Stereo-based 3D object detection is designed to identify
pre-trained auxiliary depth estimation networks. Specifically, and localize 3D objects using a pair of stereo images. Lever-
the process begins by transforming monocular images into aging the inherent capability of stereo cameras to capture
depth images using pre-trained depth estimators, such as dual perspectives, stereo-based methods excel in acquiring
MonoDepth [352]. Subsequently, two primary methodologies highly accurate depth information through stereo matching
are employed to handle depth images and monocular images. and calibration. This is a feature that distinguishes them from
Remarkable progress has been made in Pseudo-LiDAR detec- monocular camera setups. Despite these advantages, stereo-
tors that use a pre-trained depth estimation network to generate based methods still face a considerable performance gap when
Pseudo-LiDAR representations [200], [353]. However, there compared to LiDAR-only counterparts. Furthermore, the realm
is a significant performance gap between Pseudo-LiDAR and of 3D object detection from stereo images remains relatively
LiDAR-only detectors due to the errors in image-to-LiDAR underexplored, with only limited research efforts dedicated
generation. Thus, Hong et al. [354] attempted to transfer to this domain. Specifically, these approaches involve the
deeper structural information from point clouds to assist utilization of image pairs captured from distinct viewpoints
monocular image detection. By leveraging the mean-teacher to estimate the 3D spatial depth of each object.
framework, they aligned the outputs of the LiDAR-only 1) 2D-detection-based methods: Traditional 2D object de-
teacher model and the Camera-only student model at both the tection frameworks can be modified to address stereo detection
6

TABLE IV: A comprehensive performance analysis of vari- TABLE V: A comprehensive performance analysis of various
ous categories of Camera-only 3D object detection methods categories of LiDAR-only 3D object detection methods across
across different datasets. We report the inference time (ms) different datasets. ‘P.V.’ denotes ‘Point-Voxel based’. The other
originally reported in the papers, and report AP3D (%) for settings are the same as Table IV.
3D car detection on the KITTI test benchmark, mAP (%)
Method R.E.P. PUB L.T. GPU KITTI Car nuScenes
and NDS scores on the nuScenes test set. ‘R.E.P.’ denotes Easy Mod. Hard mAP NDS
‘Representation’. ‘PUB’ denotes ‘Publication’. ‘M.V.’ denotes PIXOR [289] CVPR2018 35 TitanXp 81.70 77.05 72.95 - -
HDNet [290] CoRL2018 - - 89.14 86.57 78.32 - -
‘Multi-view’. ‘L.T.’ denotes ‘Latency Time’. BirdNet [291] ITSC2018 - - 75.52 50.81 50.00 - -
RCD [88] View arXiv2020 301 V100 85.37 82.61 77.80 - -
KITTI Car nuScenes RangeRCNN [87] arXiv2020 45 V100 88.47 81.33 77.09 - -
Method R.E.P. PUB L.T. GPU Easy Mod. Hard mAP NDS RangeIoUDet [89] CVPR2021 22 V100 88.60 79.80 76.76 - -
RangeDet [90] ICCV2021 83 2080Ti 85.41 77.36 72.60 - -
FQNet [18] CVPR2019 500 1080Ti 2.77 1.51 1.01 - - IPOD [70] arXiv2018 - - 71.40 53.46 48.34 - -
ROI-10D [164] CVPR2019 200 - 4.32 2.02 1.46 - - PointRGCN [292] arXiv2019 262 1080Ti 85.97 75.73 70.60 - -
MonoGRNet [182] AAAI2019 60 TITANX 9.61 5.74 4.25 - - StarNet [293] arXiv2019 - - 81.63 73.99 67.07 - -
MonoDIS [9] CVPR2019 100 V100 10.37 7.94 6.40 30.4 38.4 PointRCNN [45] CVPR2019 - - 85.94 75.76 68.32 - -
MonoPair [13] CVPR2020 60 1080Ti 13.04 9.99 8.65 - - STD [98] ICCV2019 80 TITANv 87.95 79.71 75.09 - -
SMOKE [178] CVPR2020 30 TITANX 14.03 9.76 7.84 - - PI-RCNN [294] AAAI2020 11 TITAN 84.37 74.82 70.03 - -
PatchNet [200] ECCV2020 400 1080 15.68 11.12 10.17 - - Point-GNN [44] CVPR2020 643 1070 88.33 79.47 72.29 - -
CaDDN [22] Mono. CVPR2021 - - 19.17 13.41 11.46 - - Point
3DSSD [40] CVPR2020 38 TITANv 88.36 79.57 74.55 - -
FCOS3D [24] CVPR2021 - - - - - 35.8 42.8 3D-CenterNet [295] PR2021 19 TITANXp 86.83 80.17 75.96 - -
MonoFlex [8] CVPR2021 30 2080Ti 19.94 13.89 12.07 - - DGCNN [80] NeuriPS2021 - - - - - 53.3 63.0
PGD [25] CVPR2022 28 1080Ti - - - 38.6 44.8 PC-RGNN [296] AAAI2021 - - 89.13 79.90 75.54 - -
MonoDTR [23] CVPR2022 37 V100 21.99 15.39 12.73 - - Pointformer [66] CVPR2021 - - 87.13 77.06 69.25 - -
NeurOCS [177] CVPR2023 - - 29.89 18.94 15.90 - - IA-SSD [297] CVPR2022 12 2080Ti 88.34 80.13 75.04 - -
MonoATT [282] CVPR2023 56 3090 24.72 17.37 15.00 - - SASA [94] AAAI2022 36 V100 88.76 82.16 77.16 - -
MonoDETR [29] ICCV2023 38 3090 25.00 16.47 13.58 - - SVGA-Net [63] AAAI2022 - - 87.33 80.47 75.91 - -
MonoCD [283] CVPR2024 36 2080Ti 25.53 16.59 14.53 - - PG-RCNN [298] TGRS2023 60 3090 89.38 82.13 77.33 - -
RT3DStereo [231] ITS2019 79 TITANX 29.90 23.28 18.96 - - SECOND [43] Sensors2018 50 1080Ti 83.13 73.66 66.20 - -
Stereo R-CNN [284] CVPR2019 420 TITANXp 47.58 30.23 23.72 - - VoxelNet [42] CVPR2018 220 TITANX 77.47 65.11 57.73 - -
Pseudo-LiDAR [6] CVPR2019 - - 54.53 34.05 28.25 - - PointPillars [51] CVPR2019 16 1080Ti 79.05 74.99 68.30 - -
OC-Stereo [285] ICRA2020 350 TITANXp 55.15 37.60 30.25 - - CBGS [75] arXiv2019 - - - - - 52.8 63.3
ZoomNet [215] AAAI2020 - - 55.98 38.64 30.97 - - PartA2 [299] TPAMI2020 71 ITANXp 85.94 77.86 72.00 - -
Disp R-CNN [213] CVPR2020 - - 58.53 37.91 31.93 - - Voxel-FPN [48] Sensors2020 20 1080Ti 85.64 76.70 69.44 - -
DSGN [17] CVPR2020 682 V100 73.50 52.18 45.14 - - TANet [300] AAAI2020 35 TITANv 83.81 75.38 67.66 - -
CG-Stereo [229] Stereo IROS2020 570 2080Ti 74.39 53.58 46.50 - - CVC-Net [78] NIPS2020 - - - - - 55.8 64.2
YoloStereo3D [14] ICRA2021 80 1080Ti 65.68 41.25 30.42 - - SegVoxelNet [301] Voxel ICRA2020 40 1080TI 84.19 75.81 67.80 - -
LIGA-Stereo [237] ICCV2021 400 TITANXp 81.39 64.66 57.22 - - HotSpotNet [302] ECCV2020 40 V100 87.60 78.31 73.34 59.3 66.0
PLUMENet [238] IROS2021 150 V100 83.00 66.30 56.70 - - Associate-3Ddet [71] CVPR2020 60 1080TI 85.99 77.40 70.53 - -
ESGN [235] TCSVT2022 62 3090 65.80 46.39 38.42 - - CenterPoint [57] CVPR2021 70 TITAN - - - 58.0 65.5
SNVC [286] AAAI2022 - - 78.54 61.34 54.23 - - CIA-SSD [77] AAAI2021 31 ITANXp 89.59 80.28 72.87 - -
DSGN++ [236] TPAMI2022 281 2080Ti 83.21 67.37 59.91 - - SIEV-NET [303] TGRS2021 45 1080Ti 85.21 76.18 70.06 - -
StereoDistill [287] AAAI2023 - - 81.66 66.39 57.39 - - VoTr-TSD [58] ICCV2021 139 V100 89.90 82.09 79.14 - -
BEVDet [254] arXiv2021 526 3090 - - - 42.2 48.2 Voxel R-CNN [47] AAAI2021 40 2080TI 90.90 81.62 77.06 - -
DETR3D [30] PMLR2022 - - - - - 41.2 47.9 PillarNet [93] ECCV2022 - - - - - 66.0 71.4
Graph-DETR3D [275] ACMMM2022 - - - - - 42.5 49.5 VoxelNeXt [46] CVPR2023 - - - - - 64.5 70.0
BEVDet4D [33] arXiv2022 526 3090 - - - 42.1 54.5 PV-RCNN [304] CVPR2020 80 1080Ti 90.25 81.43 76.82 - -
PETR [31] ECCV2022 93 V100 - - - 44.1 50.4 SA-SSD [305] CVPR2020 40 2080Ti 88.75 79.79 74.16 - -
BEVFormer [16] ECCV2022 588 V100 - - - 48.1 56.9 HVPR [83] CVPR2021 28 2080Ti 86.38 77.92 73.04 - -
Sparse4D [269] arXiv2022 164 3090 - - - 51.1 59.5 VIC-NET [306] ICRA2021 - - 88.60 81.57 77.09 - -
PolarFormer [266] AAAI2023 - - - - - 49.3 57.2 PVGNet [82] CVPR2021 - - 89.94 81.81 77.09 - -
BEVDistill [277] ICLR2023 - - - - - 49.8 59.4 CT3D [67] ICCV2021 - - 87.83 81.77 77.16 - -
VEDet [274] CVPR2023 - - - - - 50.5 58.5 Pyramid R-CNN [86] P.V. ICCV2021 - - 88.39 82.08 77.49 - -
PETRv2 [32] M.V. ICCV2023 53 3090 - - - 51.9 60.1 PV-RCNN++ [50] IJCV2023 - - 90.14 81.88 77.15 - -
BEVDepth [15] AAAI2023 - - - - - 52.0 60.9 VP-Net [36] TGRS2023 59 2080Ti 90.46 82.03 79.65 - -
BEVStereo [260] AAAI2023 - - - - - 52.5 61.0 SASAN [307] TNNLS2023 104 V100 90.40 81.90 77.20 - -
DistillBEV [288] ICCV2023 - - - - - 52.5 61.2 PVT-SSD [308] CVPR2023 49 3080TI 90.65 82.29 76.85 - -
BEVStereo++ [261] arXiv2023 - - - - - 54.6 62.5 HCPVF [309] TCSVT2023 70 3090 89.34 82.63 77.72 - -
SparseBEV [34] ICCV2023 43 3090 - - - 55.6 63.6 APVR [310] TAI2023 - - 91.45 82.17 78.08 58.6 65.9
CAPE [273] CVPR2023 - - - - - 52.5 61.0 HPV-RCNN [311] TCSS2023 81 A100 89.33 80.61 75.53 - -
Sparse4Dv2 [270] arXiv2023 49 3090 - - - 55.7 63.8
Sparse4Dv3 [271] arXiv2023 51 3090 - - - 57.0 65.6
StreamPETR [278] CVPR2023 32 3090 - - - 62.0 67.6
Far3D [279] ICCV2023 - - - - - 63.5 68.7
BEVNeXt [265] CVPR2024 227 3090 - - - 55.7 64.2
CLIP-BEVFormer [280] CVPR2024 - - - - - 44.7 54.7 in 3D detection, Wang et al. [6] are pioneers in introducing
the Pseudo-LiDAR representation. This representation is gen-
erated by using an image with a depth map, requiring the
problems. Stereo R-CNNs [284] employ an image-based 2D model to perform a depth estimation task to assist in detection.
detector to predict 2D proposals, generating left and right Subsequent works have followed this paradigm and made
regions of interest (RoIs) for the corresponding left and optimizations by introducing additional color information to
right images. Subsequently, in the second stage, they directly augment pseudo point cloud [20], auxiliary tasks (instance
estimate the parameters of 3D objects based on the previously segmentation [355], foreground and background segmenta-
generated RoIs. This paradigm has been widely adopted by tion [356] and domain adaptation [357]) and coordinate trans-
subsequent works [14], [194], [213]–[217]. form scheme [200], [358]. To achieve both high accuracy and
2) Pseudo-LiDAR-only methods: The disparity map pre- high responsiveness, Meng et al. [359] propose a lightweight
dicted from stereo images can be transformed into a depth Pseudo-LiDAR 3D detection system. These studies indicate
map and further converted into pseudo-LiDAR points. Con- that the power of the pseudo LiDAR representation stems
sequently, similar to monocular detection methods, pseudo- from the coordinate transformation rather than the point cloud
LiDAR representations can also be employed in stereo-based representation itself.
3D object detection approaches. These methods aim to en- 3) Volume-based methods: The general procedure of
hance disparity estimation in stereo matching to achieve more volume-based methods is to generate a cost volume from the
accurate depth predictions. Regarding the contribution of depth left and the right images to represent disparity information,
7

TABLE VI: A comprehensive performance analysis of various volume. LEAStereo [244] utilizes NAS technology to select
categories of multi-modal 3D object detection methods across the optimal structure for the 3D cost volume. GANet [241]
different datasets. ‘P.P.’ denotes Point-Projection. ‘F.P.’ de- designs a semi-global aggregation layer and a local guidance
notes Feature-Projection. ‘A.P.’ denotes Auto-Projection. ‘D.P.’ aggregation layer to further improve accuracy. ACVNet [253]
denotes Decision-Projection. ‘Q.L.’ denotes Query-Learning. introduces an attention concatenation unit to generate more
‘U.F.’ denotes Unified-Feature. The other settings are the same accurate similarity metrics. DSGN [17] proposes a 3D ge-
as Table IV. ometric volume derived from stereo matching networks and
KITTI Car nuScenes applies a grid-based 3D detector on the volume for 3D object
Method R.E.P. PUB L.T. GPU Easy Mod. Hard mAP NDS detection. LIGA-Stereo [237] uses a LiDAR-based detector
MVX-Net [312] ICRA2019 - - 85.50 73.30 67.40 - - as a teacher model to guide geometry-aware feature learning.
RoarNet [313] IV2019 - - 83.71 73.04 59.16 - -
ComplexerYOLO [314]
PointPainting [100]
CVPRW2019
CVPR2020
16
-
1080i 55.63 49.44 44.13
- 82.11 71.70 67.08
-
46.4
-
58.1
ESGN [235] achieves efficient stereo matching through the
EPNet [104] ECCV2020 - - 89.81 79.28 74.59 - - efficient geometry-aware feature generation (EGFG) module.
PointAugmenting [315] P.P. CVPR2021 542 1080Ti - - - 66.8 71.0
FusionPainting [316] ITSC2021 - - - - - 66.5 70.7 Due to the benefits of large-scale training data and end-to-end
MVP [317] NeurIPS2021 - - - - - 66.4 70.5
Centerfusion [318] WACV2021 - - - - - - - training, deep learning-based stereo methods have achieved
EPNet++ [105] TPAMI2022 - - 91.37 81.96 76.71 - -
MSF [319] TGRS2024 63 V100 - - - 68.2 71.6 outstanding results [242].
PPF-Det [320] TITS2024 29 TITANX 89.51 84.46 78.91 - -
Cont Fuse [321] ECCV2018 60 - 82.54 66.22 64.04 - -
MMF [322] CVPR2019 80 - 86.81 76.75 68.41 - -
Focals Conv [110] CVPR2022 125 2080Ti 90.55 82.28 77.59 67.8 71.8 C. Multi-view 3D object detection
VFF [323] F.P. CVPR2022 - - 89.50 82.09 79.29 68.4 72.4
LargeKernel3D [60] CVPR2023 145 2080ti - - - 71.2 74.2 Recently, multi-view 3D object detection has demonstrated
SupFusion [118] ICCV2023 - - - - - 56.6 64.6
VoxelNextFusion [122] TGRS2023 54 A6000 90.90 82.93 80.6 68.8 72.5 superior accuracy and robustness compared to monocular and
RoboFusion [324] IJCAI2024 322 A100 91.75 84.08 80.71 69.9 72.0
stereo 3D object detection approaches. In contrast to LiDAR-
PI-RCNN [294] AAAI2020 90 TITAN 84.37 74.82 70.03 - -
3D-CVF [41] ECCV2020 75 1080Ti 89.20 80.05 73.11 - - only 3D object detection, the latest panoramic Bird’s Eye View
3D Dual-Fusion [325] Arxiv2022 - - 91.01 82.40 79.39 70.6 73.1
AutoAlignV2 [326] A.P. ECCV2022 208 V100 - - - 68.4 72.4 (BEV) approaches eliminate the need for high-precision maps,
HMFI [120] ECCV2022 - - 88.90 81.93 77.30 - - elevating the detection from 2D to 3D. This advancement
LoGoNet [121] ICCV2023 - - 91.80 85.06 80.74 - -
GraphAlign [111]
GraphAlign++ [112]
ICCV2023
TCSVT2024
26 A6000 90.96 83.49 80.14 66.5 70.6
149 V100 90.98 83.76 80.16 68.5 72.2
has led to significant developments in multi-view 3D object
CLOCs [327] IROS2020 - - 83.68 68.78 61.67 - -
detection. In comparison to previous reviews [4], [123], [126],
AVOD [328] IROS2018 100 TITANXp 81.94 71.88 66.38 - - [130], [131], there has been extensive research on effectively
MV3D [329] CVPR2017 240 TitanX 71.09 62.35 55.12 - -
F-PointNets [330] D.P. CVPR2018 - - 81.20 70.39 62.19 - - leveraging multi-view images for 3D object detection. A key
F-ConvNet [96] IROS2019 - - 82.11 71.70 67.08 46.4 58.1
F-PointPillars [331] ICCVW2021 - - 88.90 79.28 78.07 - - challenge in multi-camera 3D object detection is recognizing
Fast-CLOCs [332] WACV2022 - - 89.11 80.34 76.98 - -
Graph R-CNN [113] ECCV22022 13 1080Ti 91.89 83.27 77.78 - - the same object across different images and aggregating object
TransFusion [108] CVPR2022 265 V100 - - - 68.9 71.7 features from multiple view inputs. The current approach,
DeepInteraction [116] NeuriPS2022 204 A100 - - - 70.8 73.4 a common practice, involves uniformly mapping multi-view
SparseFusion [101] Q.L. ICCV2023 188 A6000 - - - 72.0 73.8
AutoAlign [333] IJCAI2022 - - - - - 65.8 70.9 to the Bird’s Eye View (BEV) space. Therefore, multi-view
SparseLIF [334] arXiv2024 340 A100 - - - 75.9 77.7
FusionFormer [335] arXiv2024 263 A100 - - - 71.4 74.1 3D object detection, also called BEV-camera-only 3D object
FSF [336] TPAMI2024 141 3090 - - - 70.6 74.0
detection, revolves around the core challenge of unifying 2D
BEVFusion-PKU [109] NeuriPS2022 - - - - - 69.2 71.8
BEVFusion-MIT [337] ICRA2023 119 3090 - - - 70.2 72.9 views into the BEV space. Based on different spatial trans-
EA-BEV [338] arXiv2023 195 V100 - - - 71.2 73.1
BEVFusion4D [339] arXiv2023 500 V100 - - - 72.0 73.5 formations, this can be categorized into two main methods.
FocalFormer3D [340] ICCV2023 109 V100 - - - 71.6 73.9
FUTR3D [106] CVPR2023 - - - - - 69.4 72.1 Ones are depth-based methods [15], [33], [35], [254]–[258],
UniTR [341] ICCV2023 107 A100 - - - 70.9 74.5 [262], [263], [267], [361], [362], represented by the LSS [255],
VirConv [99] CVPR2023 92 V100 92.48 87.20 82.45 68.7 72.3
MSMDFusion [342] U.F. CVPR2023 265 V100 - - - 71.5 74.0 also known as 2D to 3D transformation. The others are query-
SFD [114] CVPR2022 10 2080Ti 91.73 84.76 77.92 - -
CMT [103] ICCV2023 167 A100 - - - 72.0 74.1 based methods [16], [31], [32], [34], [84], [266]–[279], [363],
UVTR [65] NeuriPS2022 - - - - - 67.1 71.1
ObjectFusion [117] ICCV2023 274 V100 - - - 71.0 73.3 represented by DETR3D [30], making a query from 3D to 2D.
GraphBEV [343] arXiv2024 141 A100 - - - 71.7 73.6
ContrastAlign [344] arXiv2024 154 A100 - - - 71.8 73.8 1) Depth-based Multi-view methods: The direct transfor-
IS-Fusion [345] CVPR2024 - - - - - 73.0 75.2
mation from 2D to BEV space poses a significant challenge.
LSS [255] was the first to propose a depth-based method,
utilizing 3D space as an intermediary. This approach involves
which is then utilized in the subsequent detection process. initially predicting the grid depth distribution of 2D features
Volume-based methods bypass the pseudo-LiDAR represen- and then elevating these features to voxel space. This method
tation and perform 3D object detection directly on 3D stereo holds promise for achieving the transformation from 2D to
volumes. These methods have inherited the traditional match- BEV space more effectively. Following LSS [255], CaDDN
ing idea, but most computations now rely on 3D convolutional [22] adopted a similar depth representation approach. It em-
networks, such as those found in references [17], [234], [235], ployed a network structure akin to LSS, primarily for predict-
[237], [241], [244], [253], [360]. For example, the pioneering ing categorical depth distribution. By compressing voxel-space
work GC-Net [234] uses an end-to-end neural network for features into BEV space, it performed the final 3D detection.
stereo matching, obviating the need for any post-processing It is worth noting that CaDDN is not part of multi-view 3D
steps, and regressively computes disparity from a cost volume object detection but rather single-view 3D object detection,
constructed by a pair of stereo features. GwcNet [360] employs which has influenced subsequent research on depth. The
a proposed group-wise correlation method to construct the cost main distinction between LSS [255] and CaDDN [22] lies in
8

CaDDN’s use of actual ground truth depth values to supervise and it is evident that there are significant metric disparities
its prediction of categorical depth distribution, resulting in a between monocular 3D object detection [10], [13], [23], [28],
superior depth network capable of more accurately extracting [29], [178], [182], [183] and stereo-based 3D object detection
3D information from 2D space. This line of research has [14], [17], [215], [230], [233], [235], [236], [285], [371]. The
sparked a series of subsequent studies, such as BEVDet [254], current scenario indicates that the accuracy of monocular 3D
its temporal version BEVDet4D [33], and BEVDepth [15]. object detection is far lower than that of stereo-based 3D object
These studies are significant in advancing the transformation detection. Stereo-based 3D object detection leverages the
from 2D to 3D space and enabling more accurate object capture of images from two different perspectives of the same
detection in the BEV space, providing valuable insights and scene to obtain depth information. The greater the baseline
directions for the field’s development. Furthermore, some between cameras, the wider the range of depth information
studies have addressed the issue of insufficient depth solely captured. As shown in Fig. 3 (b), there were monocular 3D
by encoding height information. These studies have found that object detection methods [9], [24], [25], [27], [372] on the
with increasing distance, the depth disparity between the car nuScenes dataset [134], but no related research on stereo-based
and the ground rapidly diminishes [35], [256]. 3D object detection.
2) Query-based Multi-view methods: Under the influence Starting from 2021, monocular methods have gradually
of Transformer technology, such as in the works [364]– been supplanted by multi-view (bird’s-eye-view perception)
[367], query-based Multi-view methods retrieve 2D spatial 3D object detection methods [15], [16], [30]–[32], [271],
features from 3D space. Inspired by Tesla’s perception system, [272], [278], [279], [281], leading to a significant improvement
DETR3D [30] introduces 3D object queries to address the in mAP. The emergence of the novel bird’s-eye-view paradigm
aggregation of multi-view features. It achieves this by extract- and the increase in sensor quantity have substantially impacted
ing image features from different perspectives and projecting mAP. It can be observed that initially, the disparity between
them into 2D space using learned 3D reference points, thus DD3D [27] and DETR3D [30] is not prominent, but with the
obtaining image features in the Bird’s Eye View (BEV) space. continuous enhancement of multi-view 3D object detection,
Query-based Multi-view methods, as opposed to Depth-based particularly with the advent of novel works such as Far3D
Multi-view methods, acquire sparse BEV features by employ- [279], the gap has widened. In other words, camera-only
ing a reverse querying technique, fundamentally impacting 3D object detection methods on multi-camera datasets like
subsequent query-based developments [16], [31], [32], [34], nuScenes [134] are predominantly based on bird’s-eye-view
[84], [266]–[279], [363]. However, due to the potential inac- perception. If we consider accuracy solely from this single
curacies associated with explicit 3D reference points, PETR dimension, the increase in sensor quantity has significantly
[31], influenced by DETR [368] and DETR3D [30], adopts an improved accuracy metrics (including mAP, NDS, AP, etc.).
implicit positional encoding method for constructing the BEV 2) Latency: In 3D object detection, latency (frames er sec-
space, influencing subsequent works [32], [278]. ond, FPS) and accuracy are critical metrics for evaluating algo-
rithm performance [378]. As shown in Table IV, Monocular-
D. Analysis: Accuracy, Latency, Robustness based 3D object detection, which relies on data from a single
Currently, the 3D object detection solutions based on Bird’s camera, typically achieves higher FPS due to lower compu-
Eye View (BEV) perception are rapidly advancing. Despite tational requirements. However, its accuracy is often inferior
the existence of numerous reviews [4], [123], [126], [130], to stereo or multi-view systems due to the absence of depth
[131], a comprehensive review of this field remains inadequate. information. Stereo-based detection, leveraging disparity infor-
It is noteworthy that Shanghai AI Lab and SenseTime Re- mation from dual cameras, enhances depth estimation accuracy
search have provided a thorough review [369] of the technical but introduces greater computational complexity, potentially
roadmap for BEV solutions. However, unlike existing reviews reducing FPS. Multi-view detection provides richer scene in-
[4], [123], [126], [130], [131], which primarily focus on formation and improved accuracy but demands extensive data
the technical roadmap and the current state of the art, we processing, computational power, and algorithmic optimization
consider crucial aspects such as autonomous driving safety for reasonable FPS levels. Notably, the nuScenes dataset lacks
perception. Following an analysis of the technical roadmap and representation of stereo-based methods, with the monocular
the current state of development for Camera-only solutions, we method FCOS3D [24] standing out as emblematic, introduced
intend to base our discussion on the foundational principles in 2021. Over time, multi-view 3D object detection has rapidly
of ‘Accuracy, Latency, and Robustness’. We will integrate evolved in terms of accuracy and latency. In practice, real-
the perspectives of safety perception to guide the practical time performance is also an important consideration when
implementation of safety perception in autonomous driving. deploying a robust 3D object detection system. For example,
1) Accuracy: Accuracy is a focal point of interest in most ER3D [379] takes stereo images as input and predicts 3D
research articles and reviews and is indeed of paramount bounding boxes, which leverages a fast but inaccurate method
importance. While accuracy can be reflected through AP of semi-global matching for depth estimation. Li et al. [380]
(average precision), considering AP alone for comparison may propose a lightweight Pseudo-LiDAR 3D detection system
not provide a comprehensive view, as different methodologies that achieves high accuracy and responsiveness. RTS3D [230]
may exhibit substantial differences due to differing paradigms. proposes a novel framework for faster and more accurate
As shown in Fig. 3 (a), we selected ten representative 3D object detection using stereo images. FastFusion [381],
methods (including classic and latest research) for comparison, a three-stage stereo-LiDAR deep fusion scheme, integrates
9

KITTI test nuScenes test


Camera-Only

(a) (b)
LiDAR-Only

(c) (d)
Multi-modal

(e) (f)

Fig. 3: (a) The AP3D comparison of monocular-based methods [10], [13], [23], [28], [29], [178], [182], [183], [185], [370] and
stereo-based methods [14], [17], [213], [215], [230], [233], [235], [236], [285], [371] on KITTI test dataset. (b) The mAP (left)
and NDS (right) comparison of monocular-based methods [9], [24], [25], [27], [372] and Multi-view methods [15], [16], [30]–
[32], [271], [272], [278], [279], [281] on the nuScenes test dataset. (c) The AP3D comparison of View-based methods [87]–[90],
Voxel-based methods [43], [47], [51], [53], [54], [58], [298], [299], Point-based [40], [44], [45], [94], [294], [297], [373], and
Point-Voxel-based methods [83], [86], [98], [304], [374] on KITTI test dataset. (d) The mAP (left) and NDS (right) comparison
of Voxel-based methods [46], [51], [57], [65], [93], [108], [110], [340], [372] and Point-based methods [40] on the nuScenes
test dataset. (e) The AP3D comparison of Point-Projection-based (P.P.) methods [100], [104], [105], [312], Feature-Projection-
based (F.P.) methods [110], [118], [322], Auto-Projection-based (A.P.) methods [41], [111], [120], [121], [294], [325], [375],
Decision-Projection-based (D.P.) methods [96], [327]–[329], [331], [332], [376], and Query-Learning-based (Q.L.) methods
[377] on KITTI test dataset. (f) The mAP (left) and NDS (right) comparison of Point-Projection-based (P.P.) methods [315],
Feature-Projection-based (F.P.) methods [60], Auto-Projection-based (A.P.) methods [111], [326], Query-Learning-based (Q.L.)
methods [108], [116], [333] and Unified-Feature-based (U.F.) methods [65], [101], [103], [106], [109], [338]–[342] on the
nuScenes test dataset.
10

LiDAR priors into each step of the classical stereo-matching


taxonomy, thereby gaining high-precision dense depth sensing
in real-time. In conclusion, achieving safe autonomous driving
necessitates balancing latency and accuracy in 3D object de-
tection algorithms. While monocular detection is faster, it lacks
precision. Stereo and multi-view methods are accurate but
slower. Future research should focus on maintaining high pre-
cision while emphasizing increased FPS and reduced latency
to meet the dual requirements of real-time responsiveness and Fig. 4: Corruption examples in the RoboBEV [124] bench-
safety in autonomous driving. mark: simulating camera malfunction.
3) Robustness: Robustness constitutes a pivotal factor in
the safety perception of autonomous driving, representing a LiDAR remains one of the main barriers to large-scale adop-
topic of significant attention that has been previously over- tion of LiDAR-only methods. Generally, as shown in Fig. 5,
looked in comprehensive reviews. In the current meticulously LiDAR-only methods can be categorized into four types: (1)
designed clean datasets and benchmarks, such as KITTI view-based 3D object detection, (2) voxel-based 3D object
[133], nuScenes [134], and Waymo [135], this aspect is not detection, (3) point-based 3D object detection, (4) point-voxel-
commonly addressed. Presently, research works [124], [125], based 3D object detection. In contrast to previous reviews
[147], [148], [300], [382], [383] like RoboBEV [124], Robo3D [4], [123], [126], [130], [131], our survey extends beyond
[148] on 3D object detection incorporate considerations of the conventional classifications of LiDAR-only methods. We
robustness, exemplified by factors such as sensor misses, as il- adopt a more foundational idea to class LiDAR-only methods
lustrated in Fig. 4. They have adopted a methodology involving based on their core data representations (BEV, Voxel, Pil-
the introduction of disturbances into datasets relevant to 3D lars.) and underlying model structure (CNNs, Transformers,
object detection to assess robustness. This includes introducing PointNet). We provide a comprehensive understanding of the
various types of noise, such as variations in weather condi- technological paradigms at LiDAR-only methods, analyzing
tions, sensor malfunctions, motion disturbances, and object- and classifying these systems from a more essential, technical
related perturbations, aimed at unraveling the distinct impacts lineage perspective.
of different noise sources on the model. Typically, most papers
investigating robustness conduct evaluations by introducing
A. View-based 3D object detection
noise to the validation sets of clean datasets, such as KITTI
[133], nuScenes [134], and Waymo [135]. Additionally, we View-based methods transform point clouds into pseudo-
highlight findings from Ref. [125], where KITTI-C [125] and images using BEV and range views. Based on the different
nuScenes-C [125] are emphasized as examples to illustrate data representation views, the view-based methods can be
the results of Camera-Only 3D object detection methods. divided into two categories: 1) Range View, 2) BEV View. In
Tables VII and VIII provide an overall comparison, reveal- these representations, each pixel contains 3D spatial informa-
ing that, in general, Camera-Only methods are less robust tion rather than RGB values. Due to the dense representation
compared to LiDAR-Only and multi-modal fusion methods. of pseudo-images, traditional or specialized 2D convolutions
They are highly susceptible to various types of noise. In can be seamlessly applied to range images, making the feature
KITTI-C, three representative works—SMOKE [178], PGD extraction process highly efficient. However, compared to
[25], and ImVoxelNet [384]—show consistently lower overall other LiDAR-only methods, detection using range views is
performance and reduced robustness to noise. In nuScenes-C, more susceptible to occlusion and scale variations.
noteworthy methods such as DETR3D [30] and BEVFormer 1) Range View: Due to the sparsity of point cloud data,
[16] exhibit greater robustness compared to FCOS3D [24] projecting it directly onto an image plane results in a sparse
and PGD [25], suggesting that as the number of sensors 2D point map. Therefore, most methods [87]–[90], [385],
increases, overall robustness improves. In conclusion, future [386] project point clouds into cylinder coordinates to generate
Camera-Only methods need to consider not only cost and a dense front-view representation by using the following
accuracy metrics (mAP, NDS, etc.) but also factors related to projection fuction:
safety perception and robustness. Our analysis aims to provide
θ = atan2(y, x),
valuable insights for the safety of future autonomous driving p
systems. ϕ = arcsin(z/ x2 + y 2 + z 2 ),
(5)
r = ⌊θ/∆θ⌋ ,
IV. L I DAR- ONLY 3D O BJECT D ETECTION c = ⌊ϕ/∆ϕ⌋ ,
LiDAR-only methods capture precise 3D information, lead- where p = (x, y, z)T denotes a 3D point and (r, c) denotes
ing to higher detection accuracy and robustness, particularly in the 2D map position of its projection. θ and ϕ denote
extreme weather conditions [125]. Because in comparison to the azimuth and elevation angle when observing the point.
optical radiation, the laser beams emitted by LiDAR systems ∆θ and ∆ϕ are the average horizontal and vertical angle
can penetrate certain weather disturbances, such as raindrops resolution between consecutive beam emitters, respectively.
and haze, with slight interference. However, the high cost of VeloFCN [387] is an influential work that first introduces
11

TABLE VII: Comparison with SOTA methods on KITTI-C validation set. The results are evaluated based on the car class
with AP of R40 at moderate difficulty. ‘RCE’ denotes Relative Corruption Error from Ref. [125].
LiDAR-Only Camera-Only Multi-modal
Corruptions SECOND † PointPillars † PointRCNN † PV-RCNN † Part-A2 †
3DSSD † SMOKE † PGD † ImVoxelNet †
EPNet † Focals Conv † LoGoNet *VirConv-S *
None(APclean ) 81.59 78.41 80.57 84.39 82.45 80.03 7.09 8.10 11.49 82.72 85.88 86.07 91.95
Snow 52.34 36.47 50.36 52.35 42.70 27.12 2.47 0.63 0.22 34.58 34.77 51.45 51.17
Rain 52.55 36.18 51.27 51.58 41.63 26.28 3.94 3.06 1.24 36.27 41.30 55.80 50.57
Weather Fog 74.10 64.28 72.14 79.47 71.61 45.89 5.63 0.87 1.34 44.35 44.55 67.53 75.63
Sunlight 78.32 62.28 62.78 79.91 76.45 26.09 6.00 7.07 10.08 69.65 80.97 75.54 63.62
Density 80.18 76.49 80.35 82.79 80.53 77.65 - - - 82.09 84.95 83.68 80.70
Cutout 73.59 70.28 73.94 76.09 76.08 73.05 - - - 76.10 78.06 77.17 75.18
Crosstalk 80.24 70.85 71.53 82.34 79.95 46.49 - - - 82.10 85.82 82.00 75.67
Gaussian (L) 64.90 74.68 61.20 65.11 60.73 59.14 - - - 60.88 82.14 61.85 63.16
Sensor Uniform (L) 79.18 77.31 76.39 81.16 77.77 74.91 - - - 79.24 85.81 82.94 70.74
Impulse (L) 81.43 78.17 79.78 82.81 80.80 78.28 - - - 81.63 85.01 84.66 80.50
Gaussian (C) - - - - - - 1.56 1.71 2.43 80.64 80.97 84.29 82.55
Uniform (C) - - - - - 2.67 3.29 4.85 81.61 83.38 84.45 82.56
Impulse (C) - - - - - - 1.83 1.14 2.13 81.18 80.83 84.20 82.54
Moving Obj. 52.69 50.15 50.54 54.60 79.57 77.96 1.67 2.64 5.93 55.78 49.14 14.44 32.28
Motion Motion Blur - - - - - - 3.51 3.36 4.19 74.71 81.08 84.52 82.58
Local Density 75.10 69.56 74.24 77.63 79.57 77.96 - - - 76.73 80.84 78.63 78.73
Local Cutout 68.29 61.80 67.94 72.29 75.06 73.22 - - - 69.92 76.64 64.88 71.01
Local Gaussian 72.31 76.58 69.82 70.44 77.44 75.11 - - - 75.76 82.02 55.66 72.85
Local Uniform 80.17 78.04 77.67 82.09 80.77 78.64 - - - 81.71 84.69 79.94 79.61
Object Local Impulse 81.56 78.43 80.26 84.03 82.25 79.53 - - - 82.21 85.78 84.29 82.07
Shear 41.64 39.63 39.80 47.72 37.08 26.56 1.68 2.99 1.33 41.43 45.77 - -
Scale 73.11 70.29 71.50 76.81 75.90 75.02 0.13 0.15 0.33 69.05 69.48 - -
Rotation 76.84 72.70 75.57 79.93 75.50 76.98 1.11 2.14 2.57 74.62 77.76 - -
Alignment Spatial - - - - - - - - - 35.14 43.01 - -
Average(APcor ) 70.45 65.48 67.74 72.59 69.92 60.55 2.68 2.42 3.05 67.81 71.87 80.93 85.66
RCE (%) ↓ 13.65 16.49 15.92 13.98 15.20 24.34 62.20 70.12 73.46 22.03 18.02 5.97 6.84
†: Results from Ref. [125].
* denotes the result of our re-implementation.

TABLE VIII: Comparison with SOTA methods on nuScenes-C validation set with mAP. ‘D.I.’ refers to DeepInteraction [116].
‘RCE’ denotes Relative Corruption Error from Ref. [125].
LiDAR-Only Camera-Only Multi-modal
Corruptions PointPillars† SSN† CenterPoint† FCOS3D† PGD† DETR3D† BEVFormer† FUTR3D† TransFusion† BEVFusion† D.I.*
None(APclean ) 27.69 46.65 59.28 23.86 23.19 34.71 41.65 64.17 66.38 68.45 69.90
Snow 27.57 46.38 55.90 2.01 2.30 5.08 5.73 52.73 63.30 62.84 62.36
Rain 27.71 46.50 56.08 13.00 13.51 20.39 24.97 58.40 65.35 66.13 66.48
Weather Fog 24.49 41.64 43.78 13.53 12.83 27.89 32.76 53.19 53.67 54.10 54.79
Sunlight 23.71 40.28 54.20 17.20 22.77 34.66 41.68 57.70 55.14 64.42 64.93
Density 27.27 46.14 58.60 - - - - 63.72 65.77 67.79 68.15
Cutout 24.14 40.95 56.28 - - - - 62.25 63.66 66.18 66.23
Crosstalk 25.92 44.08 56.64 - - - - 62.66 64.67 67.32 68.12
FOV lost 8.87 15.40 20.84 - - - - 26.32 24.63 27.17 42.66
Gaussian (L) 19.41 39.16 45.79 - - - - 58.94 55.10 60.64 57.46
Sensor Uniform (L) 25.60 45.00 56.12 - - - - 63.21 64.72 66.81 67.42
Impulse (L) 26.44 45.58 57.67 - - - - 63.43 65.51 67.54 67.41
Gaussian (C) - - - 3.96 4.33 14.86 15.04 54.96 64.52 64.44 66.52
Uniform (C) - - - 8.12 8.48 21.49 23.00 57.61 65.26 65.81 65.90
Impulse (C) - - - 3.55 3.78 14.32 13.99 55.16 64.37 64.30 65.65
Compensation 3.85 10.39 11.02 - - - - 31.87 9.01 27.57 39.95
Motion Moving Obj. 19.38 35.11 44.30 10.36 10.47 16.63 20.22 45.43 51.01 51.63 -
Motion Blur - - - 10.19 9.64 11.06 19.79 55.99 64.39 64.74 65.45
Local Density 26.70 45.42 57.55 - - - - 63.60 65.65 67.42 67.71
Local Cutout 17.97 32.16 48.36 - - - - 61.85 63.33 63.41 65.19
Local Gaussian 25.93 43.71 51.13 - - - - 62.94 63.76 64.34 64.75
Local Uniform 27.69 46.87 57.87 - - - - 64.09 66.20 67.58 66.44
Obeject Local Impulse 27.67 46.88 58.49 - - - - 64.02 66.29 67.91 67.86
Shear 26.34 43.28 49.57 17.20 16.66 17.46 24.71 55.42 62.32 60.72 -
Scale 27.29 45.98 51.13 6.75 6.57 12.02 17.64 55.42 62.32 60.72 -
Rotation 27.80 46.93 54.68 17.21 16.84 27.28 33.97 59.64 63.36 65.13 -
Spatial - - - - - - - 63.77 66.22 68.39 -
Alignment Temporal - - - - - - - 51.43 43.65 49.02 -
Average(APcor ) 23.42 40.37 49.81 10.26 10.68 18.60 22.79 56.99 58.73 61.03 62.92
RCE(%) ↓ 15.42 13.46 15.98 57.00 53.95 46.89 46.41 11.45 11.52 10.84 11.09
†: Results from Ref. [125].
* denotes the result of our re-implementation.

the projection method in cylindrical coordinates. It has then [390], RPN [391] is employed in [87], [88], and FPN [392] is
followed by [87]–[90]. LaserNet [385] utilizes DLA-Net [388] leveraged in [90]. Considering the limitations of traditional
to obtain multi-scale features and detect 3D objects from 2D CNNs in extracting features from range images, some
this representation. Inspired by LaserNet, some works have works have resorted to novel operators, including range dilated
borrowed models from 2D object detection to handle range convolutions [88], graph operators [393], and meta-kernel
images. For example, U-Net [389] is applied in [87], [386], convolutions [90]. Furthermore, some works have focused on
12

Z Project Into Image 2D CNN Backbone Result


Y

Detection Head
Input

X
Raw Point
(a) A genaral pipline for view-based 3D object detection.
Voxel 3D CNN Result
Z 2D CNN
Y ToBEV

Detection Head
Voxelization
Input

Pillar Preudo image

X
Raw Point
(b) A genaral pipline for voxel-based 3D object detection.

Z Sampling Feature Learning Result


Y

Detection Head
Input Set Abstraction
&
MLP GNN
&
Transformer
X
Raw Point
(c) A genaral pipline for point-based 3D object detection.

Point-based 1 2 1 2 Result
Early Fusion
Sampling

Z
backbone
Y
PointNet
Detection Head
Input
3D CNN Voxel Fuse Point Point Fuse Voxel Fused Voxel Fused Point
Voxelization

Late Fusion

Voxel-based
Detector

+
X Object Key
Raw Point Proposals Points RoI-Grid Pooling

(d) A genaral pipline for point-voxel-based 3D object detection.

Fig. 5: The genaral piplines for LiDAR-only 3D object detection.

addressing issues of occlusion and scale variation in range indicates the number of points in each cell. PIXOR [289],
view. Specifically, these methods [87], [89] construct feature which outputs oriented 3D object estimates decoded from
transformation structures from the range view to the point pixel-wise neural network predictions, is a pioneering work in
view and from the point view to the BEV (Bird’s Eye View) this field, followed by [89], [90], [291], [394]. These methods
perspective to convert range features into BEV perspective. usually entail three stages. First, point clouds are projected
2) BEV View: Comparison to range view detection, BEV- into a novel cell encoding for BEV projection. Next, both the
based detection is more robust to occlusion and scale variation object’s location on the plane and its heading are estimated
challenges. Hence, feature extraction from the range view through a convolutional neural network originally designed for
and object detection from the BEV become the most practi- image processing. Considering scale variation and occlusion,
cal solution for range-based 3D object detection. The BEV RangeRCNN [87] and RangeIOUDet [89] introduce a point
representation is encoded by height, intensity, and density. view that serves as a bridge from RV to BEV, which provides
Point clouds are discretized into a regular 2D grid. To encode pointwise features for the models.
more detailed height information, point clouds are evenly
divided into M slices, resulting in M height maps where B. Voxel-based 3D object detection
each grid cell stores the maximum height value of the point Voxel-based methods segment sparse point clouds into regu-
clouds. The intensity feature represents the reflectance value lar voxels, achieving a dense representation through voxeliza-
of the point within each grid cell, and the point cloud density tion. Despite spatial convolution enhancing 3D information
13

perception, challenges persist in achieving high detection along the z-axis. Pillar features can be aggregated from
accuracy. These challenges include 1) high computational points through a PointNet [396] and then scattered back to
complexity, which demands substantial memory and compu- construct a 2D BEV image for feature extraction. As the
tational resources due to the numerous voxels representing 3D pioneering work in this series [51], [52], [55], [64], [91],
space, 2) spatial information loss that occurs during vox- [93], PointPillar [51] first introduces the pillar representation.
elization, leading to difficulties in accurately detecting small Following works have extended the ideas from 2D detection
objects, 3) inconsistencies in scale and density, inherent to PointPillars. PillarNet [93] adopts the ’encoder-neck-head’
to specific voxel grids, which pose challenges in adapting detection architecture to enhance the performance of pillar-
to diverse scenes with varying object scales and point cloud based methods. SWFormer [64] and ESS [55] draw inspiration
densities. Overcoming these challenges requires addressing from the Swin Transformer [365] and apply a hierarchical
limitations in data representation, enhancing network feature window mechanism to pseudo-images, thereby enabling the
capacity, improving object localization accuracy, and enhanc- network to maintain a global receptive field. PillarNeXt [52]
ing the model’s understanding of complex scenes. Ensuring integrates a series of mature 2D detection techniques and
safety perception in autonomous driving is crucial, and despite achieves performance comparable to voxel-based methods.
varying optimization strategies, these methods converge on 2) Model Structure: There are three major types of neural
common perspectives of model optimization, focusing on 1) networks in voxel-based methods: 1) 2D CNNs for processing
data representation and 2) model structure. BEV feature maps and pillars. 2) 3D Sparse CNNs for
1) Data representation: Voxel-based methods first ras- processing voxels. 3) Transformers for handling both voxels
terize point clouds into discrete grid representations. Grid and pillars.
representations are closely related to accuracy, computational a) 2D CNNs: 2D CNNs are primarily used to detect 3D
complexity, and memory requirements. Using a voxel size objects from a bird’s-eye view perspective, including process-
that is too large results in significant information loss, while ing BEV (Bird’s Eye View) feature maps and pillars [51], [52],
using a voxel size that is too small increases the burdens of [64], [91], [93]. Specifically, the 2D CNNs used for processing
computation and memory. As shown in Fig. 5 (b), according BEV feature maps often come from well-developed 2D object
to the height along the z-axis, the types of grid representations detection networks, such as Darknet [397], ResNet [398],
can be categorized into voxels and pillars. FPN [392], and RPN [391]. One significant advantage of 2D
a) Voxel: Voxel process divides the 3D space into regular CNNs compared to 3D CNNs is their faster speed. However,
voxel grids with size (dL × dW × dH ) in the x, y, and due to their difficulty in capturing spatial relationships and
z directions, respectively. Only non-empty voxel units that shape information, 2D CNNs typically exhibit lower accuracy.
contain points are stored and used for feature extraction. b) 3D Sparse CNNs: 3D Sparse CNNs consist of two
However, due to the sparse distribution of point clouds, the core operators: sparse convolution and submanifold convolu-
majority of voxel units are empty. As a pioneering work in tion [399], which ensure that the convolutional operation is
voxel-based methods [36]–[40], [42], [43], [46]–[48], [54], performed only on non-empty voxels. SECOND [43] imple-
[57], [60], [61], [65], [69], [77], [92], [108], VoxelNet [42] ments efficient computation of sparse convolution [399] and
proposes a novel voxel feature encoding (VFE) layer to extract submanifold convolution [400] operators to gain fast inference
features from the points inside a voxel cell. Then, following speed by constructing a hash table. It is followed by [39], [46],
works [38], [39], [47], [57], [75], [80] have extended the Vox- [47], [57], [122]. However, the limited receptive field of 3D
elNet network by adopting similar voxel encoding approaches. Sparse CNNs, which leads to information truncation, restricts
Existing methods often perform local partitioning and feature the model’s feature extraction capabilities. Meanwhile, the
extraction uniformly across all positions in the point cloud. sparse representation of features makes it challenging for
This approach limits the receptive field for distant regions and the model to capture fine-grained object boundaries and de-
information truncation. Therefore, some works have proposed tailed information. To optimize these issues, main optimization
different approaches to voxel partitioning: 1) Different co- strategies have emerged: 1) Expanding the model’s receptive
ordinate systems: some approaches have reexamined voxel field. Some methods [60], [61] extend the concept of large ker-
partitioning from different coordinate system perspectives, nel convolution from 2D to 3D space or introduce additional
e.g. [78], [395] from cylindrical and [62] from spherical coor- downsampling layers in the model [46]. 2) Combining sparse
dinate systems. Sphereformer [62] facilitates the aggregation and dense representations. Methods in this category typically
of information from sparsely distant points by dividing the utilize dense prediction heads to prevent information loss [42],
3D space into multiple non-overlapping radial windows using [43], [47], [57], [299] or retrieve lost 3D information from
spherical coordinates (r, θ, ϕ), thereby enhancing information the detection process [37], [47], [57], [86], [299], or they add
integration from dense point regions. 2) Multi-scale voxels: additional auxiliary tasks to the model [38], [39], [77], [299],
some works generate voxels of different scales [48], [76] or [301]. Methods employing dense prediction heads typically
use reconfigurable voxels [79], e.g., HVNet [76] proposes a require high-resolution Bird’s Eye View (BEV) feature maps
hybrid voxel network which integrates different scales in the for conducting dense predictions on them. Considering com-
point-level voxel feature encoder (VFE). putational complexity, some recent methods aim to establish
b) Pillars: Pillars can be considered a special form global sparse and local dense prediction relationships [56].
of voxels. Specifically, point clouds are discretized into a c) Transformer: Due to the amazing performance of
grid uniformly distributed on the x-y plane without binning transformers [365], [366], many efforts have been made to
14

adapt Transformers to 3D object detection. Particularly, recent PointRCNN [45], a pioneering two-stage detector in point-
studies [124], [125] have confirmed the excellent robustness of based methods, utilizes the PointNet++ [401] with multi-
transformer-based models, which will further advance research scale grouping as the backbone network. In the first stage,
in the domain of safety perception for autonomous driving. it generates 3D proposals from point clouds in a bottom-
Compared with CNNs, the query-key-value design and the up manner. The second stage network refines the proposals
self-attention mechanism allow transformers to model global by combining semantic features and local spatial features.
relationships, resulting in a larger receptive field. However, However, existing methods relying on FPS still face several
the primary limitation for efficiently applying Transformer- issues: 1) Points irrelevant to detection also participate in the
based models is the quadratic time and space complexity sampling process, leading to additional computational burden.
of the global attention mechanism. Hence, designing special- 2) The distribution of points across different parts of an object
ized attention mechanisms for Transformer-based 3D object is uneven, resulting in suboptimal sampling strategies. Subse-
detectors is critical. Transformer [364], DETR [368], and quent works have attempted various optimization strategies,
ViT [366] are the works that have most significantly influenced such as segmentation-guided background point filtering [94],
3D transformer-based methods [55], [58], [59], [64], [91], random sampling [293], feature space sampling [40], voxel-
[101], [108]. They have each inspired subsequent 3D detection based sampling [44], [113], coordinate refinement [66], and
works in various aspects: the design of attention mechanisms, ray-based grouping sampling [95].
the architecture of encoders and decoders, and the develop- 2) Point-based Backbone: The feature learning stage in
ment of patch-based inputs and architectures similar to visual point-based methods aims to extract discriminative feature
transformers. Inspired by transformer [364], VoTr [58] is the representations from raw points. The neural network used in
first work to incorporate a transformer into a voxel-based the feature learning phase should possess the ability of to local
backbone network, composed of sparse attention and sparse be awareness locally aware and integrating to integrate contex-
submanifold attention modules. Subsequent works [59] have tual information. Based on the aforementioned motivations, a
continued to build on the foundation of voxel-transformer, multitude of detectors have been designed for processing raw
further optimizing the temporal complexity of the attention points. However, most methods can be categorized according
mechanism. DETR [368] has inspired a range of networks to to the core operators they utilize: 1) PointNet-based meth-
adopt an encoder-decoder structure akin to DETR’s. Trans- ods [45], [94], [98], [295]. 2) Graph Neural Network-based
Fusion [108] is a notable work that generates object queries methods [44], [63], [292], [293], [402]. 3) Transformer-based
from initial detections, applying cross-attention to LiDAR methods [66], [403].
and image features within the Transformer decoder for 3D
object detection. Meanwhile, many papers [55], [64], [91] are a) PointNet-based: PointNet-based methods [45], [94],
exploring and refining the patch-based input mechanism from [98], [295] primarily rely on the Set Abstraction [396] to
ViT [366] and the window attention mechanism from Swin perform downsampling on raw points, aggregation of local
Transformer [365], e.g., SST [55] and SWFormer [64] group information, and integration of contextual information, while
local regions of voxels into patches, apply sparse regional preserving the symmetry invariance of the raw points. Point-
attention, and then apply region shift to change the grouping. RCNN [45], as the first two-stage work in point-based meth-
Notably, SEFormer [91] is the first to introduce object structure ods, achieved amazing performance at its time; however, it
encoding into the transformer module. still faces the issue of high computational cost. Subsequent
work [70], [94] has addressed this issue by introducing an
additional semantic segmentation task during the detection
C. Point-based 3D object detection process to filter out background points that contribute min-
Unlike voxel-based methods, point-based methods retain imally to detection. Furthermore, some efforts have focused
the original information to the maximum extent, facilitating on resolving the issue of the uncontrolled receptive field in
fine-grained feature acquisition. However, the performance PointNet & PointNet++, such as through the use of GNN [80]
of point-based methods is still affected by two crucial fac- or Transformer [66] techniques.
tors: 1) the number of contextual points in the point cloud b) Graph-based: GNNs (Graph Neural Networks) pos-
sampling stage and 2) the context radius used in the point- sess key elements such as an adaptive structure, dynamic
based backbone. These factors significantly impact the speed neighborhood, the capability to construct both local and global
and accuracy of point-based methods, including the detection contextual relationships, and robustness against irregular sam-
of small objects, which is critical for safety considerations. pling. These characteristics naturally endow GNNs with an
Therefore, optimizing these two factors is paramount, based advantage in handling irregular point clouds. Point-GNN [44],
on existing literature. In this regard, we primarily focus on a pioneering work, designs a one-stage graph neural network to
elucidating 1) Point Cloud Sampling and 2) Point-based predict objects with an auto-registration mechanism, merging,
Backbone. and scoring operations, which demonstrate the potential of
1) Point Cloud Sampling: As an extensively utilized using graph neural networks as a new approach for 3D object
method, FPS (Farthest Point Sampling) aims to select a set detection. Most graph-based, point-based methods [44], [292],
of representative points from the raw points, such that their [293], [296], [402] aim to fully utilize contextual information.
mutual distances are maximized, thereby optimally covering This motivation has led to further improvements in subsequent
the entire spatial distribution of the point cloud. works [296], [402].
15

c) Transformer-based: Up to this point, a series of meth- b) Late Fusion: The methods in this series predomi-
ods [66], [403]–[405] have explored the use of transformers nantly adopt a two-stage detection framework. Initially, voxel-
for feature learning in point clouds, achieving excellent results. based methods are employed to generate preliminary object
Pointformer [66] introduced local and global attention modules proposals. This is followed by a refinement phase, where
for processing 3D point clouds. The local transformer module point-level features are leveraged for the precise delineation
models interactions among points within local areas, with of detection boxes. As a milestone in PV-based methods, PV-
the aim of learning contextually relevant regional features RCNN [304] utilizes SECOND [43] as the first-stage detector
at the object level. The global transformer, on the other and proposes a second-stage refinement stage with a RoI grid
hand, focuses on learning context-aware representations at pool for the fusion of keypoint features. Subsequent works
the scene level. Subsequently, the local-global Transformer have followed the aforementioned paradigm, focusing on ad-
combines local features with high-resolution global features vancements in second-stage detection. Notable developments
to further capture dependencies between multi-scale represen- include the use of attention mechanisms [67], [85], [86],
tations. Group-free [403] adapted the Transformer to suit 3D scale-aware pooling [374], and point density-aware refinement
object detection, enabling it to model both object-to-object modules [53].
and object-to-pixel relationships and to extract object features PV-based methods simultaneously possess the computa-
without manual grouping. Moreover, by iteratively refining the tional efficiency of voxel-based approaches and the capability
spatial encoding of objects at different stages, the detection of point-based methods to capture fine-grained information.
performance is further enhanced. Point-based transformers However, constructing point-to-voxel or voxel-to-point rela-
directly process unstructured and unordered raw point clouds, tionships, along with the feature fusion of voxels and points,
which results in significantly higher computational complexity incurs additional computational overhead. Consequently, com-
compared to structured voxel data. pared to voxel-based methods, PV-based methods can achieve
better detection accuracy and robustness, but at the cost of
D. Point-Voxel based 3D object detection increased inference time.
Point-voxel methods aim to leverage the fine-grained infor-
mation capture capabilities of point-based methods and the
E. Analysis: Accuracy, Latency, Robustness
computational efficiency of voxel-based methods. By integrat-
ing these methods, point-voxel based methods enable a more In the autonomous driving sector, the development of
detailed processing of point cloud data, capturing both the LiDAR-only 3D object detection solutions is advancing
global structure and micro-geometric details. This is critically rapidly. A series of works [1], [4], [123], [126], [130], [131]
important for safety perception in autonomous driving, as the have comprehensively summarized the current technological
accuracy of decisions made by autonomous driving systems roadmaps, such as the extensive review of LiDAR-only solu-
depends on high-precision detection results. tions by the Shanghai AI Lab and SenseTime Research [126].
The key goal of point-voxel methods is to enable feature However, there is a lack of summarization and guidance
interplay between voxels and points via point-to-voxel or from the perspective of safety perception and cost impact in
voxel-to-point transformations. The idea of leveraging point- autonomous driving. Therefore, in this section, following an
voxel feature fusion in backbones has been explored by many analysis of the technological roadmaps and the current state
works [49], [50], [53], [67], [72], [82]–[84], [86], [97], [304], of LiDAR-only solutions, we intend to base our discussion
[374], [406]. These methods fall into two categories: 1) Early on the fundamental principles of ‘Accuracy, Latency, and
Fusion. Early fusion methods [49], [72], [82]–[84], [97] fuse Robustness.’ It aims to guide the practical implementation of
voxel features and point features within the backbone network. economically efficient and safe sensing in autonomous driving.
2) Late Fusion. Late fusion methods [50], [53], [67], [86], 1) Accuracy: Referring to Section III on Camera-only
[304], [374], [406] typically employ a two-stage detection methods, we investigated the core factors influencing LiDAR-
approach, using voxel-based methods for initial proposal box only methods. Representative methods from each category un-
generation, followed by sampling and refining key point fea- derwent comparative performance analysis on the KITTI [133]
tures from the point cloud to enhance 3D proposals. and nuScenes [134] datasets, as shown in Fig 3 (c, d). The
a) Early Fusion: Some methods [49], [72], [82]–[84], current scenario indicates that the latest view-based methods
[97] have explored using new convolutional operators to fuse exhibit lower performance compared to other categories. View-
voxel and point features, with PVCNN [49] potentially being based approaches transform point clouds into pseudo-images
the first work in this direction. In this method, the voxel- for processing with 2D detectors, which favors inference speed
based branch initially converts points into a low-resolution but sacrifices 3D spatial information. Therefore, an effective
voxel grid and aggregates neighboring voxel features through representation of 3D spatial information is pivotal for LiDAR-
convolution. Then, the voxel-level features are transformed only methods. Initially, point-based and PV-based methods
back into point-level features and fused with the features outperformed voxel-based approaches in LiDAR-only detec-
obtained from the point-based branch. Following closely, tion. Over time, methods like Voxel RCNN [47], which utilize
SPVCNN [97], which builds upon PVCNN, extends PVCNN ROI pool modules for fine-grained information aggregation,
to the domain of object detection. Other methods attempt to have achieved comparable or superior performance. Voxel
make improvements from other perspectives, such as auxiliary RCNN’s ROI pooling module effectively addresses the loss
tasks [72] or multi-scale feature fusion [82]–[84]. of detailed 3D spatial information resulting from voxelization.
16

2) Latency: Section III highlights latency’s importance methods relying solely on LiDAR. when disturbances affect
in autonomous driving safety and user experience. While both image and point cloud data concurrently, the efficacy
Camera-only methods tend to outperform LiDAR-only meth- of most Multi-modal methods significantly diminishes. It is
ods in terms of inference speed, the latter still maintain a worth noting that DTS [407] and Bi3D [408] enhance model
competitive edge due to their accurate 3D perception. We robustness through domain adaptation methods.
conducted tests using an A100 graphics card to measure the As shown in Table IX, under various noise conditions,
FPS of significant LiDAR-only approaches, and evaluated LiDAR-only methods experience varying degrees of accu-
their performance using the original research’s AP and NDS racy decline, with the most significant reduction observed
metrics. As shown in Table V, it indicates that view-based in extreme weather noise scenarios. These results indicate
methods excel in model latency due to the reduction in point an urgent need in the field of autonomous driving to ad-
cloud dimensions and the efficiency of 2D CNNs. Voxel- dress the robustness issue of point cloud detectors. For most
based methods achieve exceptional inference speed due to the types of corruptions, voxel-based methods generally exhibit
use of structured voxel data and well-optimized 3D sparse greater robustness than point-based methods, as shown in Ta-
convolutions. However, point-based methods face challenges ble VII, VIII, IX. A plausible explanation is that voxelization,
in applying efficient operators during data preprocessing and through the spatial quantization of a group of adjacent points,
feature extraction stages due to the irregular representation of mitigates the local randomness and spatial information disrup-
point clouds. Point-GNN [44] is an extreme example of this, tion caused by noise and density degradation. Specifically, for
with model latency nearly several times that of contemporary severe corruptions (e.g., shear, FFD in the transformation), the
voxel algorithms. Transformer-based methods [67] face signif- point-voxel-based method [304] exhibits greater robustness.
icant challenges in real-time inference. The current research PointRCNN [45] does not show the highest robustness against
trend in transformer-based methods is the development of any form of corruption, highlighting potential limitations in-
efficient attention operators, like [55], [64], [365]. Moreover, herent in point-based methods. In conclusion, future works
for PV-based methods, the construction of point-to-voxel or should explore robustness optimization from the perspectives
voxel-topoint relationships, along with the feature fusion of of data representation and model architecture. The above
voxels and points, incurs additional computational overhead. analysis aims to offer valuable insights for future work related
To conclude, common accuracy optimization strategies, such to robustness.
as two-stage optimization or attention mechanisms, typically
TABLE IX: Comparsion with LiDAR-only detectors on cor-
compromise inference speed in autonomous driving models.
rupted validation sets of KITTI from Ref. [146] on Car
Achieving a balance between accuracy and speed is an evolv-
detection with CEAP (%). CEAP (%) denotes Corruption Error
ing challenge in this field. Future studies should prioritize
from Ref. [146]. The best one is highlighted in bold. ‘T.F.’
the simultaneous improvement of accuracy, as well as the
denotes Transformation.
reduction of FPS (frames per second) and latency, in order to
meet the urgent requirements of real-time response and safety Corruption PV Point Voxel Avg.
in autonomous driving. PV-RCNN PointRCNN SECOND SE-SSD CenterPoint
3) Robustness: Previous comprehensive reviews have not rain 25.11 23.31 21.81 29.51 25.83 26.45
Weather snow 44.23 37.74 34.84 49.19 38.74 45.64
focused significantly on the topic of robustness. Presently, fog 1.59 3.52 1.60 1.59 1.11 1.88
research works [124], [125], [147], [148], [300], [382], [383] uniform rad 10.19 8.32 9.51 9.34 8.15 7.82
Scene-level

gaussian rad 13.02 9.98 12.13 11.02 10.17 9.65


like RoboBEV [124], Robo3D [148] on 3D object detection in- Noise impulse rad 2.20 3.86 2.23 1.18 1.8 2.46
corporate considerations of robustness, exemplified by factors background 2.93 6.49 2.41 2.14 1.86 2.46
upsample 0.81 1.84 0.31 0.55 0.46 0.75
such as sensor misses. Robo-LiDAR [146] represents the first cutout 3.75 3.97 4.27 4.26 4.11 4.0
comprehensive exploration solely dedicated to the robustness local dec 14.04 - 13.88 17.01 14.64 14.44
Density local inc 1.40 3.34 1.33 0.90 0.95 1.68
of LiDAR-only methods. In a manner akin to BR3D [125], this beam del 0.58 0.79 0.73 1.07 0.47 0.73
layer del 2.94 3.46 3.10 3.37 2.67 3.17
method evaluates robustness by integrating disturbances into
uniform 15.44 12.95 9.48 6.99 6.51 8.94
datasets pertinent to 3D object detection, such as KITTI [133]. Noise gaussian 20.48 17.62 12.98 9.56 9.49 12.42
impulse 3.3 4.7 2.53 2.2 2.11 3.26
The method involves proposing a variety of noise types and
Object-level

upsample 1.12 1.95 0.67 0.22 0.16 0.74


25 typical degradations associated with object and scene- cutout 15.81 15.62 14.99 16.51 14.06 15.47
Density local dec 14.38 14.16 13.23 15.08 12.52 13.84
level natural weather conditions, noise interferences, density local inc 13.93 14.19 13.74 11.03 11.64 12.81
variations, and object transformations. In this section, we will shear 37.27 40.96 40.35 40.35 40.0 39.71
FFD 32.42 38.88 33.15 37.96 32.86 34.93
combine the work of Ref. [125] and Robo-LiDAR [146] with T.F. rotation 0.60 0.47 0.31 0.27 0.38 0.52
the aim of systematically analyzing the robustness of LiDAR- scale 5.78 8.13 6.96 6.53 7.50 6.97
translation 3.82 3.03 3.24 1.37 3.91 3.77
only methods. As shown in the Table VII, generally, LiDAR- mCE 11.49 11.64 10.60 11.17 10.09 11.01
only methods exhibit higher robustness to noise compared to
Camera-only methods. In Multi-modal methods [99], [105],
[106], [108], [110], [116], [121], the complementary interplay
of data types becomes evident when disturbances are limited to V. M ULTI - MODAL 3D O BJECT D ETECTION
LiDAR sensor data. In such scenarios, image data can partially Multi-modal 3D object detection refers to the technique
mitigate the impact on point cloud integrity, consequently of using data features from different sensors and integrating
elevating the performance of fusion methods above that of these features to achieve complementarity, thus enabling the
17

A. Projection-based 3D object detection

Concat Projection-based 3D object detection refers to the use of


projection matrices during the feature fusion stage to achieve
Project the integration of point cloud and image features. It is impor-
2D
tant to clarify that the focus here is on projection during the
3D Feature 2D Feature feature fusion period, rather than projections in other stages of
(a) Projection-based for Feature Alignment the fusion process, which include projections needed for pro-
Feature Alignment
cesses such as data augmentation. As shown in Fig.7, we have
Deformable
developed a more detailed classification of projection-based
Attention
3D object detection based on the different types of projection
used in the fusion stage, including Point-Projection-based
No Project
[100], [104], [105], [312]–[318], [385], Feature-Projection-
based [60], [110], [118], [122], [321]–[323], [409], [410],
3D Feature 2D Feature
3D Auto-Projection-based [41], [111], [112], [120], [121], [294],
(b) Non-Projection-based for Feature Alignment [325], [326], [375], and Decision-Projection-based methods
[96], [113], [327]–[332], [376].
Fig. 6: Projection-based for feature alignment vs. Non-
1) Point-Projection-based 3D object detection: Point-
Projection-based for feature alignment.
Projection-based 3D object detection methods [100], [313]–
[318], [385] involve projecting image features onto raw point
clouds to enhance the representational capability of the orig-
inal point cloud data, as shown in Fig. 7 (a). The initial
detection of 3D objects. As shown in Fig. 6, the approach
step in these methods is to establish a strong correlation
particularly emphasizes the combination of image data and
between LiDAR points and image pixels, which is achieved
point cloud data. Image data is rich in semantic features,
using calibration matrices. Following this, the point cloud
such as color and texture, but often lacks depth information.
features are enhanced by augmenting them with additional
In contrast, point cloud data provides depth information and
data. This augmentation takes two forms: either through the
geometric structure, which is crucial for accurately perceiving
incorporation of segmentation scores [100], [316], [319] or
and interpreting the 3D characteristics of a scene. Since a
by using CNN features [104], [105], [312], [315], [317] from
single type of sensor cannot fully and accurately perceive
the correlated pixels. PointPainting [100] and PointAugment-
the 3D environment, multi-modal 3D object detection acquires
ing [315] represent advancements in multi-modal 3D object
features with rich semantic information by fusing various types
detection methods by enhancing the traditional cut-and-paste
of data.
augmentation. These techniques aim to seamlessly integrate
In the field of autonomous driving, there are a variety of data from different domains, such as point clouds and 2D
fusion methods for multi-modal 3D object detection. Previous imagery, while carefully managing potential overlaps or colli-
reviews [4], [123], [126], [130], [131] have mostly classified sions between objects in both domains. PointPainting enhances
these methods based on different stages of fusion (early, LiDAR points by appending segmentation scores. However, it
middle, late), but this classification is overly simplistic and has limitations in effectively capturing the color and texture
does not fully consider the special requirements of autonomous details present in images. To address these shortcomings, more
driving. Given the fundamental differences between the two sophisticated approaches like FusionPainting [316] have been
heterogeneous modalities of point clouds and images, the developed, following a similar paradigm. MVP [317] builds
alignment step in multi-modal fusion is particularly critical. upon the concept of PointPainting [100]. It initially utilizes
It ensures the consistency and accuracy of information from image instance segmentation and establishes an alignment
different sensors and data sources during the fusion process. between the segmentation masks and the point cloud using
In autonomous driving, the key to achieving feature alignment a projection matrix. The key distinction of MVP lies in its
lies in whether to use a calibration matrix (also known approach to sampling: it randomly selects pixels within each
as a projection matrix). However, the inherent error of the range, ensuring consistency with the points in the point cloud.
calibration matrix, being a type of prior knowledge, poses These selected pixels are then linked to their nearest neighbors
a challenge. Some works, like [115], [333], avoid using the in the point cloud. The depth value of the LiDAR point in
projection matrix and reduce projection errors by adopting this linkage is assigned as the depth of the corresponding
learning methods. pixel. Subsequently, these points are projected back to the
Therefore, based on different methods of feature alignment, LiDAR coordinate system, resulting in the generation of
we can categorize multi-modal 3D object detection methods virtual LiDAR points.
into two types: (1) projection-based for feature alignment, and 2) Feature-Projection-based 3D object detection: In con-
(2) model-based for feature alignment. This taxonomy is more trast to Point-Projection-based methods, Feature-Projection-
detailed and scientific, better reflecting the characteristics and based 3D object detection methods [60], [110], [118], [122],
progress of multi-modal 3D object detection methods in the [322], [323], [409], [410], [422], as shown in Fig. 7 (b),
field of autonomous driving. primarily focus on fusing point cloud features with image
18

Point Cloud 3D Feature 3D Feature 3D ROI Pooling 3D Feature 3D Feature

Fuse Q

Project
Flatten
Unified
Project

+ +

Project
Feature

Project

Query
Cross
+ Deformable
+ Cam.-to-BEV
Attention projection
K&V

Image Feature Image Feature Image Feature 2D ROI Pooling Image Feature Image Feature
(a) Point-Projection (b) Feature-Projection (c) Auto-Projection (d) Decision-Projection (e) Query-learning (f) Unified-Feature

Fig. 7: Projection-based 3D object detection: (a) Point-Projection-based methods [100], [104], [105], [312]–[320], [385], (b)
Feature-Projection-based methods [60], [110], [118], [122], [321]–[324], [409]–[411], (c) Auto-Projection-based methods [41],
[111], [112], [120], [121], [294], [325], [326], [375], [377], [412]–[414], (d) Decision-Projection-based methods [96], [113],
[327]–[332], [376]. Non-Projection-based 3D object detection: (e) Query-Learning-based methods [101], [108], [115], [116],
[333], [334], [415]–[417], (f) Unified-Feature-based methods [65], [99], [102], [103], [106], [109], [114], [117], [338]–[345],
[418]–[421].

features during the feature extraction phase of the point clouds.


During this fusion process, point cloud features are projected
onto corresponding image features, and subsequently, these
image and point cloud features are integrated together. This
process is achieved by applying a calibration matrix to trans-
form the voxel’s three-dimensional coordinate system into
the pixel coordinate system of the image, thereby facilitating
the effective fusion of point cloud and image modalities.
Specifically, the projection of a three-dimensional point cloud Fig. 8: Examples of misalignment between point clouds and
onto the image plane can be articulated as follows: images.
 
  Px
u    Py 
zc  v  = hK R T   Pz  ,
 (6)
1 exemplifies that projection inaccuracies persist even in this
1 classic clean dataset. Consequently, the issue of projection
where, Px , Py , and Pz represent the three-dimensional spatial errors cannot be completely eliminated through manual cali-
coordinates of the LiDAR points, while u and v denote bration; instead, they can only be mitigated. This is a frequent
the corresponding two-dimensional coordinates. The term zc challenge in practical dataset deployments. Many studies, like
indicates the depth of the point’s projection on the image Point & Feature-Projection-based methods, have performed
plane. Additionally, K represents the intrinsic parameters of fusion through direct projection without addressing the pro-
the camera, and R and T signify the rotation and translation jection error issue. A few works [41], [111], [112], [120],
of the LiDAR relative to the camera’s reference frame, re- [121], [325], [326], [375], have sought to mitigate these errors
spectively. The factor h accounts for the scale change due to through approaches such as projection offsets and neighboring
downsampling. projections. For instance, Deformable Cross Attention [367]
A quintessential example of the Feature-Projection-based has been employed to learn offsets in the context of al-
method, ContFuse [422], employs continuous convolution to ready projected data. We have systematically reviewed and
amalgamate multi-scale convolutional feature maps from each synthesized methods that tackle projection errors, designating
sensor. Within this technique, the projection of the point them as Auto-projection-based 3D object detection methods,
cloud facilitates the correspondence between the image and as shown in Fig. 7 (c). As representative works address-
the Bird’s Eye View (BEV). In essence, Feature-Projection- ing feature alignment, HMFI [120], GraphAlign [111], and
based 3D object detection method is accomplished during GraphAlign++ [112] utilize a priori knowledge of projection
the point cloud feature extraction phase. Compared to Point- calibration matrices to project onto corresponding images for
Projection-based methods, they do not perform fusion on the local graph modeling. This approach simulates intermodal
original point cloud but achieve a profound depth feature relationships, enabling Multi-modal 3D object detectors to
fusion, resulting in more robust performance. effectively identify more appropriate alignment relationships,
3) Auto-Projection-based 3D object detection: As shown thereby achieving faster and more accurate feature alignment
in Fig. 8, a partial image from the KITTI [133] dataset between modalities. AutoAlignV2 [326] focuses on sparse
19

learnable sampling points for cross-modal relational model- camera-to-BEV projection. This process, taking place before
ing, enhancing calibration error tolerance and significantly fusion, demonstrates considerable robustness in scenarios with
accelerating feature aggregation across different modalities. In feature misalignment.
summary, Auto-Projection-based 3D object detection methods 1) Query-Learning-based 3D object detection: Query-
mitigate errors arising from feature alignment by leveraging Learning-based 3D object detection methods, as exemplified
neighbor relationships or neighbor offsets, thereby enhancing by works such as [108], [115], [116], [333], [377], [415],
robustness in Multi-modal 3D object detection. [423], eschew the necessity for projection within the fea-
4) Decision-Projection-based 3D object detection: ture fusion process, as shown in Fig. 7 (e). Instead, they
Decision-Projection-based 3D object detection methods [96], attain feature alignment through cross-attention mechanisms
[113], [327]–[332], [376], as early implementations of Multi- before engaging in the fusion of features. Point cloud fea-
modal 3D object detection schemes, use projection matrices to tures are typically employed as queries, while image features
align features in Regions of Interest (RoI) or specific results, serve as keys and values, facilitating a global feature query
as shown in Fig. 7 (d). These methods are primarily focused to acquire highly robust Multi-modal features. Furthermore,
on the alignment of features in localized areas of interest or DeepInteraction [116] incorporates multimodality interaction,
specific detection outcomes. wherein point cloud and image features are utilized as distinct
Graph-RCNN [113] projects the graph node to the location queries to enable further feature interaction. In comparison
in the camera image and collects the feature vector at that pixel to the exclusive use of point cloud features as queries, the
in the camera image through bilinear interpolation. F-PointNet comprehensive incorporation of image features leads to the
[330] performs detection on the 2D image to determine the acquisition of more resilient Multi-modal features. Overall,
class and localization of the object, and for each detected ob- Query-Learning-based 3D object detection methods employ
ject, the corresponding point clouds in 3D space are obtained a transformer-based structure for feature querying to achieve
through the conversion matrix of calibrated sensor parameters feature alignment. Ultimately, the Multi-modal features are
and 3D space. MV3D [329] employs a transformation of the integrated into LiDAR-only pipelines, such as CenterPoint
LiDAR point cloud into Bird’s Eye View (BEV) and Front [57].
View (FV) projections for generating proposals. During this 2) Unified-Feature-based 3D object detection: Unified-
process, a specialized 3D proposal network is used to create feature-based 3D object detection methods, represented by
precise 3D candidate boxes. These 3D proposals are then works such as [65], [99], [101], [103], [106], [109], [114],
projected onto feature maps from multiple perspectives to [338]–[342], generally employ projection before feature fu-
facilitate feature alignment between the two modalities. Differ- sion, achieving the pre-fusion unification of heterogeneous
ing from MV3D [329], AVOD [328] streamlines this approach modalities, as shown in Fig. 7 (f). In the BEV fusion se-
by omitting the FV component and introducing a more refined ries, which utilizes LSS for depth estimation [101], [109],
region proposal mechanism. In summary, Decision-Projection- [338], [339], the front-view features are transformed into
based 3D object detection methods primarily achieve feature BEV features, followed by the fusion of BEV image and
fusion at a high level through projection, with limited interac- BEV point cloud features. Alternatively, CMT [103] and
tion between heterogeneous modalities. This often leads to the UniTR [341] employ transformers for tokenization of point
alignment and fusion of erroneous features, resulting in issues clouds and images, constructing an implicit unified space
of reduced accuracy and robustness. through transformer encoding. CMT [103] utilizes projec-
tion in the position encoding process, but entirely avoids
dependency on projection relations at the feature learning
B. Non-Projection-based 3D object detection level. FocalFormer3D [340], FUTR3D [106], and UVTR [65]
Non-Projection-based 3D object detection methods achieve leverage transformers’ queries to implement schemes similar
fusion without relying on feature alignment, thereby yielding to DETR3D [30], constructing a unified sparse BEV feature
robust feature representations. They circumvent the limitations space through queries, thus mitigating the instability intro-
of camera-to-LiDAR projection, which often reduces the se- duced by direct projection. VirConv [99], MSMDFusion [342],
mantic density of camera features and impacts the effective- and SFD [114] construct a unified space through pseudo-point
ness of techniques like Focals Conv [110] and PointPainting clouds, with the projection occurring before feature learning.
[100]. Non-Projection-based methods typically employ cross- The issues introduced by direct projection are addressed
attention mechanisms or the construction of a unified space through subsequent feature learning. In summary, Unified-
to address the inherent misalignment issues in direct feature feature-based 3D object detection methods [65], [99], [101],
projection. These methods are primarily divided into two cat- [103], [106], [109], [114], [338]–[342] currently represent
egories: (1) Query-Learning-based [108], [115], [116], [333], high-precision and robust solutions. Although they incorporate
[377], [415] and (2) Unified-feature-based [65], [99], [101], projection matrices, such projection does not occur between
[103], [106], [109], [114], [338]–[342]. Query-Learning-based Multi-modal fusion, distinguishing them as Non-Projection-
methods entirely negate the need for alignment during the based 3D object detection methods. Unlike Auto-Projection-
fusion process. Conversely, Unified-Feature-based methods, based 3D object detection approaches, they do not directly
though constructing a unified feature space, do not completely address projection error issues but instead opt for unified space
avoid projection; it usually occurs within a single modality construction, considering multiple dimensions for Multi-modal
context. For example, BEVFusion [109] utilizes LSS [255] for 3D object detection, thereby obtaining highly robust Multi-
20

modal features. C, representative articles LoGoNet [121] for Auto-Projection-


based and VirConv [99] for Unified-Feature-based exhibit
greater robustness, while EPNet [104] for Point-Projection-
C. Analysis: Accuracy, Latency, Robustness
based and Focals Conv [110] for Feature-Projection-based
In the preceding Sections III-D, IV-E, we have conducted show slightly weaker performance. Additionally, in nuScenes-
a comprehensive analysis of ‘Accuracy, Latency, Robustness’ C, among Non-Projection-based methods, FUTR3D [106],
for camera-only and LiDAR-only approaches. Subsequently, TransFusion [108], BEVFusion [109], and DeepInteraction
we extend our examination to multi-modal 3D object detection [116] all demonstrate strong robustness. It is worth noting
methods, employing a similar analytical framework. that MetaBEV [420] explores the problem of modal loss
1) Accuracy: As shown in Fig.3 (e) and (f), we conducted caused by feature misalignments and sensor failures in BEV
comparative evaluations on both the KITTI and nuScenes features of LiDAR and camera through deformable attention
test datasets. The majority of Projection-based 3D object based on BEVFusion [109]. ObjectFusion [117] proposes a
detection methods have predominantly undergone experimen- novel object-centric fusion to align object-centric features of
tation on the KITTI dataset, with only a minority extending different modalities. GraphBEV [343] mitigates misalignment
their evaluation to nuScenes. As shown in Fig.3 (e), it is issues by matching neighbor depth features through graph
evident that Feature-Projection-based and Auto-Projection- matching.
based methods exhibit superior overall performance, while
Decision-Projection-based methods, primarily dated prior to VI. F UTURE OUTLOOKS
2020, tend to manifest relatively lower Average Precision (AP)
metrics. A scant few Non-Projection-based 3D object detection Through reviewing all the literature and analyzing the re-
methods, such as CAT-Det [377], have been experimented search trends of the past few years, we make some predictions
with on the KITTI dataset. As shown in Fig.3 (f), the latest on the future research direction of 3D object detection from
methods predominantly belong to the Unified-Feature-based the perspective of robustness.
methods, underscoring the suitability of the panoramic camera A. 3D Object Detection with Large Models
offered by nuScenes for achieving modality-unifying strategies Inspired by the success of large language models (LLMs)
like BEVFusion [109]. Overall, it is discernible that Non- such as ChatGPT [424] and vision foundation models (VFMs)
Projection-based methods present more effective solutions in like SAM, many researchers have focused on related research
terms of Accuracy metrics (e.g., AP, mAP, NDS, etc.). of large models. Compared to conventional methods, a large
2) Latency: As shown in Table VI, we conducted a com- autonomous driving model based on LLM can mainly solve the
parative analysis of mono-modal 3D object detection methods following two problems. Firstly, it is the endless corner case
(LiDAR-only and Camera-only) and Multi-modal 3D object problem. LLMs have a common sense ability and may become
detection on the KITTI and nuScenes datasets, presenting a new paradigm for solving corner cases in autonomous
scatter plots for Latency (FPS) and Accuracy metrics (AP, driving problems. Secondly, the current methods lack intuitive
mAP, NDS, etc.). It is noteworthy that, in comparison to reasoning and provide textual explanations, and LLMs happen
mono-modal 3D object detection methods (LiDAR-only and to be the best in this direction. It is worth researching
Camera-only), Multi-modal 3D object detection approaches how to combine large models with 3D object detection to
generally exhibit lower FPS. The results on the KITTI dataset enhance robustness and generalization and improve the ability
indicate that GraphAlign excels in both AP and FPS metrics. of corner cases. However, there is currently limited research on
Additionally, LoGoNet [121], Focals Conv [110], and EP- combining large models and 3D object detection. For example,
Net [104] demonstrate outstanding performance. GraphAlign RoboFusion [324] has integrated TransFusion [108] and Focals
[111] maintains its position as having the highest FPS, but its Conv [110] with VFMs like SAM [425] to enhance its ability
NDS performance is suboptimal on the nuScenes dataset. In in harsh weather conditions. SEAL [426] uses VFMs like
contrast, UniTR performs exceptionally well in both NDS and SAM [425] to segment different car point cloud sequences
FPS metrics. Overall, it can be observed that within Projection- and can segment any car point cloud by encouraging spatial
based methods, Auto-Projection-based and Feature-Projection- and temporal consistency during the representation learning
based methods exhibit superior overall performance, while stage. CLIP-BEVFormer [280] combines CLIP [427] and
within Unified-Feature-based methods, the overall perfor- BEVFormer [16], leveraging the universal capabilities of CLIP
mance is more outstanding. In the meticulous evaluation of [427] to enhance generalization on corner cases. VisLED [428]
the KITTI and nuScenes datasets, emphasis is placed on the is a language-driven active learning framework for open-set
trade-off between FPS and NDS metrics. 3D object detection, which utilizes active learning techniques
3) Robustness: In the previous sections III-D3 and IV-E3, to query various information-rich data samples from unlabeled
we analyzed the robustness of mono-modal 3D object detec- pools. Almost all existing works are proposed and evaluated on
tion (Camera-only and LiDAR-only). In this section, based on close-range datasets. Although these datasets may be large and
Tables VII and VIII, we analyze the robustness of Multi-modal diverse, they are still insufficient for real-world applications.
3D object detection. From KITTI-C [125] and nuScenes-C In the real world, the generalization and robustness of corner
[125], it can be seen that Multi-modal 3D object detection cases are of utmost importance, and 3D object detection with
is more robust compared to mono-modal 3D object detection large models is a good starting point for solving open-set
(Camera-only and LiDAR-only), with smaller RCE. In KITTI- 3D object detection. In addition, current 3D object detection
21

algorithms lack interpretability, and LLMs can bring hope R EFERENCES


for more robust 3D object detection and avoid unexpected
[1] E. Arnold, O. Y. Al-Jarrah, M. Dianati, S. Fallah, D. Oxtoby,
situations caused by black box detectors. and A. Mouzakitis, “A survey on 3d object detection methods for
autonomous driving applications,” IEEE Transactions on Intelligent
Transportation Systems, vol. 20, no. 10, pp. 3782–3795, 2019.
[2] J. Liu, H. Wang, L. Peng, Z. Cao, D. Yang, and J. Li, “Pnnuad:
B. 3D Object Detection in End-to-End Autonomous Driving Perception neural networks uncertainty aware decision-making for
autonomous vehicle,” IEEE Transactions on Intelligent Transportation
UniAD [429] undoubtedly brought another hot topic to Systems, vol. 23, no. 12, pp. 24 355–24 368, 2022.
the field of autonomous driving after winning the CVPR [3] K. Yang, B. Li, W. Shao, X. Tang, X. Liu, and H. Wang, “Prediction
failure risk-aware decision-making for autonomous vehicles on signal-
Best Paper Award: end-to-end autonomous driving. End-to- ized intersections,” IEEE Transactions on Intelligent Transportation
end autonomous driving is a fully differentiable machine Systems, 2023.
learning system that takes raw sensor input data and other [4] L. Wang, X. Zhang, Z. Song, J. Bi, G. Zhang, H. Wei, L. Tang, L. Yang,
J. Li, C. Jia et al., “Multi-modal 3d object detection in autonomous
metadata as prior information and directly outputs the control driving: A survey and taxonomy,” IEEE Transactions on Intelligent
signals or trajectory planning for vehicles [430]. Generally, the Vehicles, 2023.
autonomous drive system integrates multiple tasks, such as de- [5] Y. Cao, H. Zhang, Y. Li, C. Ren, and C. Lang, “Cman: Leaning
global structure correlation for monocular 3d object detection,” IEEE
tection, tracking, online mapping, motion prediction, and plan- Transactions on Intelligent Transportation Systems, vol. 23, no. 12, pp.
ning. 3D object detection is closely related to other perception 24 727–24 737, 2022.
tasks and downstream tasks such as prediction and planning. [6] Y. Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q.
Weinberger, “Pseudo-lidar from visual depth estimation: Bridging the
Therefore, pursuing high accuracy in 3D object detection may gap in 3d object detection for autonomous driving,” in Proceedings of
not be optimal when considering the autonomous drive system. the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
Although impressive progress has been made in end-to-end tion, 2019, pp. 8445–8453.
[7] P. Li, H. Zhao, P. Liu, and F. Cao, “Rtm3d: Real-time monocular 3d
research, we believe three areas can be further improved detection from object keypoints for autonomous driving,” in European
in current 3D object detection for end-to-end autonomous Conference on Computer Vision. Springer, 2020, pp. 644–660.
driving. First, 3D object detection can guide more effective [8] Y. Zhang, J. Lu, and J. Zhou, “Objects are different: Flexible monocular
3d object detection,” in Proceedings of the IEEE/CVF Conference on
multi-modal environmental perception, allowing for better data Computer Vision and Pattern Recognition, 2021, pp. 3289–3298.
integration from multi-modal sources. Second, the current [9] A. Simonelli, S. R. Bulo, L. Porzi, M. López-Antequera, and
inference capabilities of end-to-end autonomous driving are P. Kontschieder, “Disentangling monocular 3d object detection,” in
concerning. Third, 3D object detection enhanced with large Proceedings of the IEEE/CVF International Conference on Computer
Vision, 2019, pp. 1991–1999.
language models (LLMs) provides stronger explanatory power, [10] Q. Lian, P. Li, and X. Chen, “Monojsg: Joint semantic and geometric
leading to enhanced explanatory power for subsequent tasks. cost volume for monocular 3d object detection,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2022, pp. 1070–1079.
[11] G. Brazil and X. Liu, “M3d-rpn: Monocular 3d region proposal network
VII. C ONCLUSION for object detection,” in Proceedings of the IEEE/CVF International
Conference on Computer Vision, 2019, pp. 9287–9296.
[12] Y. Cai, B. Li, Z. Jiao, H. Li, X. Zeng, and X. Wang, “Monocular
3D object detection plays a crucial role in autonomous 3d object detection with decoupled structured polygon estimation
driving perception. In recent years, this field has witnessed and height-guided depth estimation,” in Proceedings of the AAAI
rapid development, yielding many research results. Based on Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 10 478–
10 485.
the diverse data forms generated by sensors, these methods [13] Y. Chen, L. Tai, K. Sun, and M. Li, “Monopair: Monocular 3d object
are primarily categorized into three types: image-based, point detection using pairwise spatial relationships,” in Proceedings of the
cloud-based, and multi-modal. The primary metrics for eval- IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2020, pp. 12 093–12 102.
uation in these methods are high accuracy and low latency. [14] Y. Liu, L. Wang, and M. Liu, “Yolostereo3d: A step back to 2d for
Numerous reviews have summarized these approaches, focus- efficient stereo 3d detection,” in 2021 IEEE International Conference
ing on the core principles of ‘high accuracy and low latency’ on Robotics and Automation (ICRA). IEEE, 2021, pp. 13 018–13 024.
[15] Y. Li, Z. Ge, G. Yu, J. Yang, Z. Wang, Y. Shi, J. Sun, and Z. Li,
in delineating their technical trajectories. However, in the tran- “Bevdepth: Acquisition of reliable depth for multi-view 3d object
sition of autonomous driving technology from breakthroughs detection,” in Proceedings of the AAAI Conference on Artificial In-
to practical applications, existing reviews have not prioritized telligence, vol. 37, no. 2, 2023, pp. 1477–1485.
[16] Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y. Qiao, and
safety perception as a central concern, failing to encompass the J. Dai, “Bevformer: Learning bird’s-eye-view representation from
current technological pathways related to safety perception. multi-camera images via spatiotemporal transformers,” in European
For instance, recent Multi-modal fusion methods typically conference on computer vision. Springer, 2022, pp. 1–18.
undergo robustness testing during the experimental phase, a [17] Y. Chen, S. Liu, X. Shen, and J. Jia, “Dsgn: Deep stereo geometry
network for 3d object detection,” in Proceedings of the IEEE/CVF
facet not adequately considered in current reviews. Therefore, conference on computer vision and pattern recognition, 2020, pp.
in this study, we re-examine 3D object detection algorithms 12 536–12 545.
with a central focus on the key aspects of ‘Accuracy, Latency, [18] L. Liu, J. Lu, C. Xu, Q. Tian, and J. Zhou, “Deep fitting degree
scoring network for monocular 3d object detection,” in Proceedings of
and Robustness’. We reclassify previous reviews, placing the IEEE/CVF conference on computer vision and pattern recognition,
particular emphasis on re-segmenting from the perspective of 2019, pp. 1057–1066.
safety perception. We aim for this work to offer new insights [19] X. Shi, Z. Chen, and T.-K. Kim, “Distance-normalized unified represen-
tation for monocular 3d object detection,” in Computer Vision–ECCV
for future research in 3D object detection, transcending the 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020,
confines of high-accuracy exploration. Proceedings, Part XXIX 16. Springer, 2020, pp. 91–107.
22

[20] X. Ma, Z. Wang, H. Li, P. Zhang, W. Ouyang, and X. Fan, “Accurate Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part
monocular 3d object detection via color-embedded 3d reconstruction XXVII 16. Springer, 2020, pp. 720–736.
for autonomous driving,” in Proceedings of the IEEE/CVF Interna- [42] Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud
tional Conference on Computer Vision, 2019, pp. 6851–6860. based 3d object detection,” in Proceedings of the IEEE conference on
[21] J. Ku, A. D. Pon, and S. L. Waslander, “Monocular 3d object detection computer vision and pattern recognition, 2018, pp. 4490–4499.
leveraging accurate proposals and shape reconstruction,” in Proceed- [43] Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutional
ings of the IEEE/CVF conference on computer vision and pattern detection,” Sensors, vol. 18, no. 10, p. 3337, 2018.
recognition, 2019, pp. 11 867–11 876. [44] W. Shi and R. Rajkumar, “Point-gnn: Graph neural network for 3d
[22] C. Reading, A. Harakeh, J. Chae, and S. L. Waslander, “Categorical object detection in a point cloud,” in Proceedings of the IEEE/CVF
depth distribution network for monocular 3d object detection,” in conference on computer vision and pattern recognition, 2020, pp.
Proceedings of the IEEE/CVF Conference on Computer Vision and 1711–1719.
Pattern Recognition, 2021, pp. 8555–8564. [45] S. Shi, X. Wang, and H. Li, “Pointrcnn: 3d object proposal generation
[23] K.-C. Huang, T.-H. Wu, H.-T. Su, and W. H. Hsu, “Monodtr: Monoc- and detection from point cloud,” in Proceedings of the IEEE/CVF
ular 3d object detection with depth-aware transformer,” in Proceedings conference on computer vision and pattern recognition, 2019, pp. 770–
of the IEEE/CVF Conference on Computer Vision and Pattern Recog- 779.
nition, 2022, pp. 4012–4021. [46] Y. Chen, J. Liu, X. Zhang, X. Qi, and J. Jia, “Voxelnext: Fully sparse
[24] T. Wang, X. Zhu, J. Pang, and D. Lin, “Fcos3d: Fully convolutional voxelnet for 3d object detection and tracking.”
one-stage monocular 3d object detection,” in Proceedings of the [47] J. Deng, S. Shi, P. Li, W. Zhou, Y. Zhang, and H. Li, “Voxel r-
IEEE/CVF International Conference on Computer Vision, 2021, pp. cnn: Towards high performance voxel-based 3d object detection,” in
913–922. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35,
[25] T. Wang, Z. Xinge, J. Pang, and D. Lin, “Probabilistic and geometric no. 2, 2021, pp. 1201–1209.
depth: Detecting objects in perspective,” in Conference on Robot [48] H. Kuang, B. Wang, J. An, M. Zhang, and Z. Zhang, “Voxel-fpn: Multi-
Learning. PMLR, 2022, pp. 1475–1485. scale voxel feature aggregation for 3d object detection from lidar point
[26] Y. Lu, X. Ma, L. Yang, T. Zhang, Y. Liu, Q. Chu, J. Yan, and clouds,” Sensors, vol. 20, no. 3, p. 704, 2020.
W. Ouyang, “Geometry uncertainty projection network for monocular [49] Z. Liu, H. Tang, Y. Lin, and S. Han, “Point-voxel cnn for efficient 3d
3d object detection,” in Proceedings of the IEEE/CVF International deep learning,” Advances in Neural Information Processing Systems,
Conference on Computer Vision, 2021, pp. 3111–3121. vol. 32, 2019.
[27] D. Park, R. Ambrus, V. Guizilini, J. Li, and A. Gaidon, “Is pseudo- [50] S. Shi, L. Jiang, J. Deng, Z. Wang, C. Guo, J. Shi, X. Wang,
lidar needed for monocular 3d object detection?” in Proceedings of and H. Li, “Pv-rcnn++: Point-voxel feature set abstraction with local
the IEEE/CVF International Conference on Computer Vision, 2021, vector representation for 3d object detection,” International Journal of
pp. 3142–3152. Computer Vision, vol. 131, no. 2, pp. 531–551, 2023.
[28] L. Yang, X. Zhang, J. Li, L. Wang, M. Zhu, C. Zhang, and H. Liu, [51] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom,
“Mix-teaching: A simple, unified and effective semi-supervised learn- “Pointpillars: Fast encoders for object detection from point clouds,”
ing framework for monocular 3d object detection,” IEEE Transactions in Proceedings of the IEEE/CVF conference on computer vision and
on Circuits and Systems for Video Technology, 2023. pattern recognition, 2019, pp. 12 697–12 705.
[29] R. Zhang, H. Qiu, T. Wang, Z. Guo, X. Xu, Y. Qiao, P. Gao, and
[52] J. Li, C. Luo, X. Yang, and Q. Qcraft, “Pillarnext: Rethinking network
H. Li, “Monodetr: depth-guided transformer for monocular 3d object
designs for 3d object detection in lidar point clouds.”
detection,” arXiv preprint arXiv:2203.13310, 2022.
[53] J. S. Hu, T. Kuai, and S. L. Waslander, “Point density-aware voxels for
[30] Y. Wang, V. C. Guizilini, T. Zhang, Y. Wang, H. Zhao, and J. Solomon,
lidar 3d object detection,” in Proceedings of the IEEE/CVF Conference
“Detr3d: 3d object detection from multi-view images via 3d-to-2d
on Computer Vision and Pattern Recognition, 2022, pp. 8469–8478.
queries,” in Conference on Robot Learning. PMLR, 2022, pp. 180–
191. [54] H. Wu, C. Wen, W. Li, X. Li, R. Yang, and C. Wang, “Transformation-
[31] Y. Liu, T. Wang, X. Zhang, and J. Sun, “Petr: Position embedding trans- equivariant 3d object detection for autonomous driving,” Nov 2022.
formation for multi-view 3d object detection,” in European Conference [55] L. Fan, Z. Pang, T. Zhang, Y.-X. Wang, H. Zhao, F. Wang, N. Wang,
on Computer Vision. Springer, 2022, pp. 531–548. and Z. Zhang, “Embracing single stride 3d object detector with sparse
[32] Y. Liu, J. Yan, F. Jia, S. Li, A. Gao, T. Wang, and X. Zhang, “Petrv2: transformer,” in 2022 IEEE/CVF Conference on Computer Vision and
A unified framework for 3d perception from multi-camera images,” in Pattern Recognition (CVPR), Jun 2022.
Proceedings of the IEEE/CVF International Conference on Computer [56] L. Fan, F. Wang, N. Wang, and Z.-X. ZHANG, “Fully sparse 3d
Vision, 2023, pp. 3262–3272. object detection,” Advances in Neural Information Processing Systems,
[33] J. Huang and G. Huang, “Bevdet4d: Exploit temporal cues in multi- vol. 35, pp. 351–363, 2022.
camera 3d object detection,” arXiv preprint arXiv:2203.17054, 2022. [57] T. Yin, X. Zhou, and P. Krahenbuhl, “Center-based 3d object detection
[34] H. Liu, Y. Teng, T. Lu, H. Wang, and L. Wang, “Sparsebev: High- and tracking.” in 2021 IEEE/CVF Conference on Computer Vision and
performance sparse 3d object detection from multi-camera videos,” in Pattern Recognition (CVPR), Jun 2021.
Proceedings of the IEEE/CVF International Conference on Computer [58] J. Mao, Y. Xue, M. Niu, H. Bai, J. Feng, X. Liang, H. Xu, and C. Xu,
Vision, 2023, pp. 18 580–18 590. “Voxel transformer for 3d object detection,” in Proceedings of the
[35] L. Yang, T. Tang, J. Li, P. Chen, K. Yuan, L. Wang, Y. Huang, IEEE/CVF International Conference on Computer Vision, 2021, pp.
X. Zhang, and K. Yu, “Bevheight++: Toward robust visual centric 3d 3164–3173.
object detection,” arXiv preprint arXiv:2309.16179, 2023. [59] C. He, R. Li, S. Li, and L. Zhang, “Voxel set transformer: A set-to-set
[36] Z. Song, H. Wei, C. Jia, Y. Xia, X. Li, and C. Zhang, “Vp-net: Voxels approach to 3d object detection from point clouds.”
as points for 3d object detection,” IEEE Transactions on Geoscience [60] Y. Chen, J. Liu, X. Zhang, X. Qi, and J. Jia, “Largekernel3d: Scaling up
and Remote Sensing, 2023. kernels in 3d sparse cnns,” in Proceedings of the IEEE/CVF Conference
[37] G. Wang, B. Tian, Y. Ai, T. Xu, L. Chen, and D. Cao, “Centernet3d: An on Computer Vision and Pattern Recognition, 2023, pp. 13 488–13 498.
anchor free object detector for autonomous driving.” Cornell University [61] T. Lu, X. Ding, H. Liu, G. Wu, and L. Wang, “Link: Linear ker-
- arXiv,Cornell University - arXiv, Jul 2020. nel for lidar-based 3d perception,” in Proceedings of the IEEE/CVF
[38] Y. Hu, Z. Ding, R. Ge, W. Shao, L. Huang, K. Li, and Q. Liu, Conference on Computer Vision and Pattern Recognition, 2023, pp.
“Afdetv2: Rethinking the necessity of the second stage for object 1105–1115.
detection from point clouds,” Proceedings of the AAAI Conference on [62] X. Lai, Y. Chen, F. Lu, J. Liu, and J. Jia, “Spherical transformer
Artificial Intelligence, p. 969–979, Jul 2022. for lidar-based 3d recognition,” in Proceedings of the IEEE/CVF
[39] R. Ge, Z. Ding, Y. Hu, Y. Wang, S. Chen, L. Huang, and Y. Li, Conference on Computer Vision and Pattern Recognition, 2023, pp.
“Afdet: Anchor free one stage 3d object detection,” arXiv preprint 17 545–17 555.
arXiv:2006.12671, 2020. [63] Q. He, Z. Wang, H. Zeng, Y. Zeng, and Y. Liu, “Svga-net: Sparse voxel-
[40] Z. Yang, Y. Sun, S. Liu, and J. Jia, “3dssd: Point-based 3d single graph attention network for 3d object detection from point clouds,”
stage object detector,” in Proceedings of the IEEE/CVF conference on Proceedings of the AAAI Conference on Artificial Intelligence, p.
computer vision and pattern recognition, 2020, pp. 11 040–11 048. 870–878, Jul 2022.
[41] J. H. Yoo, Y. Kim, J. Kim, and J. W. Choi, “3d-cvf: Generating joint [64] P. Sun, M. Tan, W. Wang, C. Liu, F. Xia, Z. Leng, and D. Anguelov,
camera and lidar features using cross-view spatial feature fusion for “Swformer: Sparse window transformer for 3d object detection in point
3d object detection,” in Computer Vision–ECCV 2020: 16th European clouds,” Oct 2022.
23

[65] Y. Li, Y. Chen, X. Qi, Z. Li, J. Sun, and J. Jia, “Unifying voxel-based [86] J. Mao, M. Niu, H. Bai, X. Liang, H. Xu, and C. Xu, “Pyramid r-cnn:
representation with transformer for 3d object detection,” Advances in Towards better performance and adaptability for 3d object detection,” in
Neural Information Processing Systems, vol. 35, pp. 18 442–18 455, Proceedings of the IEEE/CVF International Conference on Computer
2022. Vision, 2021, pp. 2723–2732.
[66] X. Pan, Z. Xia, S. Song, L. E. Li, and G. Huang, “3d object detection [87] Z. Liang, M. Zhang, Z. Zhang, X. Zhao, and S. Pu, “Rangercnn:
with pointformer,” in Proceedings of the IEEE/CVF Conference on Towards fast and accurate 3d object detection with range image
Computer Vision and Pattern Recognition, 2021, pp. 7463–7472. representation.”
[67] H. Sheng, S. Cai, Y. Liu, B. Deng, J. Huang, X.-S. Hua, and M.-J. Zhao, [88] A. Bewley, P. Sun, T. Mensink, D. Anguelov, and C. Sminchisescu,
“Improving 3d object detection with channel-wise transformer,” in “Range conditioned dilated convolutions for scale invariant 3d object
Proceedings of the IEEE/CVF International Conference on Computer detection,” Conference on Robot Learning,Conference on Robot Learn-
Vision, 2021, pp. 2743–2752. ing, May 2020.
[68] Q. Xu, Y. Zhong, and U. Neumann, “Behind the curtain: Learning [89] Z. Liang, Z. Zhang, M. Zhang, X. Zhao, and S. Pu, “Rangeioudet:
occluded shapes for 3d object detection,” in Proceedings of the AAAI Range image based real-time 3d object detector optimized by intersec-
Conference on Artificial Intelligence, vol. 36, no. 3, 2022, pp. 2893– tion over union,” in 2021 IEEE/CVF Conference on Computer Vision
2901. and Pattern Recognition (CVPR), Jun 2021.
[69] W. Zheng, W. Tang, L. Jiang, and C.-W. Fu, “Se-ssd: Self-ensembling [90] L. Fan, X. Xiong, F. Wang, N. Wang, and Z. Zhang, “Rangedet:in
single-stage object detector from point cloud,” in Proceedings of the defense of range view for lidar-based 3d object detection,” in 2021
IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE/CVF International Conference on Computer Vision (ICCV), Oct
2021, pp. 14 494–14 503. 2021.
[70] Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia, “Ipod: Inten- [91] X. Feng, H. Du, H. Fan, Y. Duan, and Y. Liu, “Seformer: Structure
sive point-based object detector for point cloud,” arXiv preprint embedding transformer for 3d object detection,” in Proceedings of the
arXiv:1812.05276, 2018. AAAI Conference on Artificial Intelligence, vol. 37, no. 1, 2023, pp.
[71] L. Du, X. Ye, X. Tan, J. Feng, Z. Xu, E. Ding, and S. Wen, “Associate- 632–640.
3ddet: Perceptual-to-conceptual association for 3d point cloud object [92] H. Wu, J. Deng, C. Wen, X. Li, C. Wang, and J. Li, “Casa: A cascade
detection,” in Proceedings of the IEEE/CVF conference on computer attention network for 3-d object detection from lidar point clouds,”
vision and pattern recognition, 2020, pp. 13 329–13 338. IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp.
[72] J. Li, H. Dai, L. Shao, and Y. Ding, “From voxel to point: Iou-guided 1–11, 2022.
3d object detection for point cloud with voxel-to-point decoder,” in [93] G. Shi, R. Li, and C. Ma, “Pillarnet: Real-time and high-performance
Proceedings of the 29th ACM International Conference on Multimedia, pillar-based 3d object detection,” in European Conference on Computer
2021, pp. 4622–4631. Vision. Springer, 2022, pp. 35–52.
[73] Z. Li, Y. Yao, Z. Quan, W. Yang, and J. Xie, “Sienet: spatial information [94] C. Chen, Z. Chen, J. Zhang, and D. Tao, “Sasa: Semantics-augmented
enhancement network for 3d object detection from point cloud,” arXiv set abstraction for point-based 3d object detection,” in Proceedings of
preprint arXiv:2103.15396, 2021. the AAAI Conference on Artificial Intelligence, vol. 36, no. 1, 2022,
[74] D. Zhang, D. Liang, Z. Zou, J. Li, X. Ye, Z. Liu, X. Tan, and X. Bai, pp. 221–229.
“A simple vision transformer for weakly semi-supervised 3d object [95] H. Wang, S. Shi, Z. Yang, R. Fang, Q. Qian, H. Li, B. Schiele, and
detection,” in Proceedings of the IEEE/CVF International Conference L. Wang, “Rbgnet: Ray-based grouping for 3d object detection,” in
on Computer Vision, 2023, pp. 8373–8383. Proceedings of the IEEE/CVF Conference on Computer Vision and
[75] B. Zhu, Z. Jiang, X. Zhou, Z. Li, and G. Yu, “Class-balanced grouping Pattern Recognition, 2022, pp. 1110–1119.
and sampling for point cloud 3d object detection,” arXiv preprint [96] Z. Wang and K. Jia, “Frustum convnet: Sliding frustums to aggregate
arXiv:1908.09492, 2019. local point-wise features for amodal 3d object detection,” in 2019
[76] M. Ye, S. Xu, and T. Cao, “Hvnet: Hybrid voxel network for lidar based IEEE/RSJ International Conference on Intelligent Robots and Systems
3d object detection,” in Proceedings of the IEEE/CVF conference on (IROS). IEEE, 2019, pp. 1742–1749.
computer vision and pattern recognition, 2020, pp. 1631–1640. [97] H. Tang, Z. Liu, S. Zhao, Y. Lin, J. Lin, H. Wang, and S. Han, “Search-
[77] W. Zheng, W. Tang, S. Chen, L. Jiang, and C.-W. Fu, “Cia-ssd: ing efficient 3d architectures with sparse point-voxel convolution,” in
Confident iou-aware single-stage object detector from point cloud,” in European conference on computer vision. Springer, 2020, pp. 685–
Proceedings of the AAAI conference on artificial intelligence, vol. 35, 702.
no. 4, 2021, pp. 3555–3562. [98] Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia, “Std: Sparse-to-dense
[78] Q. Chen, L. Sun, E. Cheung, and A. L. Yuille, “Every view counts: 3d object detector for point cloud,” in Proceedings of the IEEE/CVF
Cross-view consistency in 3d object detection with hybrid-cylindrical- international conference on computer vision, 2019, pp. 1951–1960.
spherical voxelization,” Advances in Neural Information Processing [99] H. Wu, C. Wen, S. Shi, X. Li, and C. Wang, “Virtual sparse convolution
Systems, vol. 33, pp. 21 224–21 235, 2020. for multimodal 3d object detection,” Mar 2023.
[79] T. Wang, X. Zhu, and D. Lin, “Reconfigurable voxels: A new repre- [100] S. Vora, A. H. Lang, B. Helou, and O. Beijbom, “Pointpainting:
sentation for lidar-based point clouds,” arXiv: Computer Vision and Sequential fusion for 3d object detection,” in Proceedings of the
Pattern Recognition,arXiv: Computer Vision and Pattern Recognition, IEEE/CVF conference on computer vision and pattern recognition,
Apr 2020. 2020, pp. 4604–4612.
[80] Y. Wang and J. Solomon, “Object dgcnn: 3d object detection using [101] Y. Xie, C. Xu, M.-J. Rakotosaona, P. Rim, F. Tombari, K. Keutzer,
dynamic graphs.” M. Tomizuka, and W. Zhan, “Sparsefusion: Fusing multi-modal sparse
[81] X. Zhu, Y. Ma, T. Wang, Y. Xu, J. Shi, and D. Lin, SSN: Shape representations for multi-sensor 3d object detection,” arXiv preprint
Signature Networks for Multi-class Object Detection from Point Clouds, arXiv:2304.14340, 2023.
Jan 2020, p. 581–597. [102] Z. Yu, W. Wan, M. Ren, X. Zheng, and Z. Fang, “Sparsefusion3d:
[82] Z. Miao, J. Chen, H. Pan, R. Zhang, K. Liu, P. Hao, J. Zhu, Y. Wang, Sparse sensor fusion for 3d object detection by radar and camera in
and X. Zhan, “Pvgnet: A bottom-up one-stage 3d object detector with environmental perception,” IEEE Transactions on Intelligent Vehicles,
integrated multi-level features,” in 2021 IEEE/CVF Conference on pp. 1–14, 2023.
Computer Vision and Pattern Recognition (CVPR), Jun 2021. [103] J. Yan, Y. Liu, J. Sun, F. Jia, S. Li, T. Wang, and X. Zhang, “Cross
[83] J. Noh, S. Lee, and B. Ham, “Hvpr: Hybrid voxel-point representation modal transformer via coordinates encoding for 3d object dectection,”
for single-stage 3d object detection,” in 2021 IEEE/CVF Conference arXiv preprint arXiv:2301.01283, 2023.
on Computer Vision and Pattern Recognition (CVPR), Jun 2021. [104] T. Huang, Z. Liu, X. Chen, and X. Bai, “Epnet: Enhancing point
[84] T. Guan, J. Wang, S. Lan, R. Chandra, Z. Wu, L. Davis, and features with image semantics for 3d object detection,” in European
D. Manocha, “M3detr: Multi-representation, multi-scale, mutual- Conference on Computer Vision. Springer, 2020, pp. 35–52.
relation 3d object detection with transformers,” in Proceedings of the [105] Z. Liu, T. Huang, B. Li, X. Chen, X. Wang, and X. Bai, “Epnet++:
IEEE/CVF winter conference on applications of computer vision, 2022, Cascade bi-directional fusion for multi-modal 3d object detection,”
pp. 772–782. IEEE Transactions on Pattern Analysis and Machine Intelligence,
[85] J. Wang, S. Lan, M. Gao, and L. S. Davis, “Infofocus: 3d object vol. 45, no. 7, pp. 8324–8341, 2023.
detection for autonomous driving with dynamic information modeling,” [106] X. Chen, T. Zhang, Y. Wang, Y. Wang, and H. Zhao, “Futr3d: A unified
in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, sensor fusion framework for 3d detection.”
UK, August 23–28, 2020, Proceedings, Part X 16. Springer, 2020, [107] Y. Zhou, P. Sun, Y. Zhang, D. Anguelov, J. Gao, T. Ouyang, J. Guo,
pp. 405–420. J. Ngiam, and V. Vasudevan, “End-to-end multi-view fusion for 3d
24

object detection in lidar point clouds,” Conference on Robot Learn- [131] Y. Peng, Y. Qin, X. Tang, Z. Zhang, and L. Deng, “Survey on image
ing,Conference on Robot Learning, Jan 2019. and point-cloud fusion-based object detection in autonomous vehicles,”
[108] X. Bai, Z. Hu, X. Zhu, Q. Huang, Y. Chen, H. Fu, and C.-L. Tai, IEEE Transactions on Intelligent Transportation Systems, vol. 23,
“Transfusion: Robust lidar-camera fusion for 3d object detection with no. 12, pp. 22 772–22 789, 2022.
transformers.” [132] Y. Wu, Y. Wang, S. Zhang, and H. Ogai, “Deep 3d object detection
[109] T. Liang, H. Xie, K. Yu, Z. Xia, Z. Lin, Y. Wang, T. Tang, B. Wang, networks using lidar data: A review,” IEEE Sensors Journal, vol. 21,
and Z. Tang, “Bevfusion: A simple and robust lidar-camera fusion no. 2, pp. 1152–1171, 2020.
framework,” Advances in Neural Information Processing Systems, [133] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous
vol. 35, pp. 10 421–10 434, 2022. driving? the kitti vision benchmark suite,” in 2012 IEEE conference
[110] Y. Chen, Y. Li, X. Zhang, J. Sun, and J. Jia, “Focal sparse convolutional on computer vision and pattern recognition. IEEE, 2012, pp. 3354–
networks for 3d object detection,” in Proceedings of the IEEE/CVF 3361.
Conference on Computer Vision and Pattern Recognition, 2022, pp. [134] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu,
5428–5437. A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A
[111] Z. Song, H. Wei, L. Bai, L. Yang, and C. Jia, “Graphalign: Enhanc- multimodal dataset for autonomous driving,” in Proceedings of the
ing accurate feature alignment by graph matching for multi-modal IEEE/CVF conference on computer vision and pattern recognition,
3d object detection,” in Proceedings of the IEEE/CVF International 2020, pp. 11 621–11 631.
Conference on Computer Vision, 2023, pp. 3358–3369. [135] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui,
[112] Z. Song, C. Jia, L. Yang, H. Wei, and L. Liu, “Graphalign++: An J. Guo, Y. Zhou, Y. Chai, B. Caine et al., “Scalability in perception
accurate feature alignment by graph matching for multi-modal 3d for autonomous driving: Waymo open dataset,” in Proceedings of the
object detection,” IEEE Transactions on Circuits and Systems for Video IEEE/CVF conference on computer vision and pattern recognition,
Technology, 2023. 2020, pp. 2446–2454.
[113] H. Yang, Z. Liu, X. Wu, W. Wang, W. Qian, X. He, and D. Cai, “Graph [136] C. Sakaridis, D. Dai, and L. Van Gool, “Semantic foggy scene
r-cnn: Towards accurate 3d object detection with semantic-decorated understanding with synthetic data,” International Journal of Computer
local graph.” Vision, vol. 126, pp. 973–992, 2018.
[114] X. Wu, L. Peng, H. Yang, L. Xie, C. Huang, C. Deng, H. Liu, and [137] B. Cai, X. Xu, K. Jia, C. Qing, and D. Tao, “Dehazenet: An end-to-end
D. Cai, “Sparse fuse dense: Towards high quality 3d detection with system for single image haze removal,” IEEE transactions on image
depth completion,” in Proceedings of the IEEE/CVF Conference on processing, vol. 25, no. 11, pp. 5187–5198, 2016.
Computer Vision and Pattern Recognition, 2022, pp. 5418–5427. [138] T. Ort, I. Gilitschenski, and D. Rus, “Grounded: The localizing ground
[115] Y. Li, A. Yu, T. Meng, B. Caine, J. Ngiam, D. Peng, J. Shen, B. Wu, penetrating radar evaluation dataset.” in Robotics: Science and Systems,
Y. Lu, D. Zhou, Q. Le, A. Yuille, and M. Tan, “Deepfusion: Lidar- vol. 2, 2021.
camera deep fusion for multi-modal 3d object detection.” [139] M. Pitropov, D. E. Garcia, J. Rebello, M. Smart, C. Wang, K. Czar-
[116] Z. Yang, J. Chen, Z. Miao, W. Li, X. Zhu, and L. Zhang, “Deepinter- necki, and S. Waslander, “Canadian adverse driving conditions dataset,”
action: 3d object detection via modality interaction,” Aug 2022. The International Journal of Robotics Research, vol. 40, no. 4-5, pp.
[117] Q. Cai, Y. Pan, T. Yao, C.-W. Ngo, and T. Mei, “Objectfusion: Multi- 681–690, 2021.
modal 3d object detection with object-centric fusion,” in Proceedings of
[140] X. Huang, P. Wang, X. Cheng, D. Zhou, Q. Geng, and R. Yang,
the IEEE/CVF International Conference on Computer Vision (ICCV),
“The apolloscape open dataset for autonomous driving and its applica-
October 2023, pp. 18 067–18 076.
tion,” IEEE transactions on pattern analysis and machine intelligence,
[118] Y. Qin, C. Wang, Z. Kang, N. Ma, Z. Li, and R. Zhang, “Supfusion:
vol. 42, no. 10, pp. 2702–2719, 2019.
Supervised lidar-camera fusion for 3d object detection,” in Proceedings
[141] C. A. Diaz-Ruiz, Y. Xia, Y. You, J. Nino, J. Chen, J. Monica, X. Chen,
of the IEEE/CVF International Conference on Computer Vision (ICCV),
K. Luo, Y. Wang, M. Emond et al., “Ithaca365: Dataset and driving
October 2023, pp. 22 014–22 024.
perception under repeated and challenging weather conditions,” in
[119] R. Qian, X. Lai, and X. Li, “3d object detection for autonomous
Proceedings of the IEEE/CVF Conference on Computer Vision and
driving: A survey,” Pattern Recognition, vol. 130, p. 108796, 2022.
Pattern Recognition, 2022, pp. 21 383–21 392.
[120] X. Li, B. Shi, Y. Hou, X. Wu, T. Ma, Y. Li, and L. He, “Homogeneous
[142] D. Hendrycks and T. Dietterich, “Benchmarking neural network ro-
multi-modal feature fusion and interaction for 3d object detection,” in
bustness to common corruptions and perturbations,” arXiv preprint
European Conference on Computer Vision. Springer, 2022, pp. 691–
arXiv:1903.12261, 2019.
707.
[121] X. Li, T. Ma, Y. Hou, B. Shi, Y. Yang, Y. Liu, X. Wu, Q. Chen, Y. Li, [143] S. Xie, Z. Li, Z. Wang, and C. Xie, “On the adversarial robustness of
Y. Qiao, and L. He, “Logonet: Towards accurate 3d object detection camera-based 3d object detection,” arXiv preprint arXiv:2301.10766,
with local-to-global cross-modal fusion,” Mar 2023. 2023.
[122] Z. Song, G. Zhang, J. Xie, L. Liu, C. Jia, S. Xu, and Z. Wang, “Vox- [144] J. Sun, Y. Cao, Q. A. Chen, and Z. M. Mao, “Towards robust
elnextfusion: A simple, unified, and effective voxel fusion framework {LiDAR-based} perception in autonomous driving: General black-
for multimodal 3-d object detection,” IEEE Transactions on Geoscience box adversarial sensor attack and countermeasures,” in 29th USENIX
and Remote Sensing, vol. 61, pp. 1–12, 2023. Security Symposium (USENIX Security 20), 2020, pp. 877–894.
[123] Y. Wang, Q. Mao, H. Zhu, J. Deng, Y. Zhang, J. Ji, H. Li, and Y. Zhang, [145] D. Liu, R. Yu, and H. Su, “Extending adversarial attacks and defenses
“Multi-modal 3d object detection in autonomous driving: a survey,” to deep 3d point cloud classifiers,” in 2019 IEEE International Con-
International Journal of Computer Vision, pp. 1–31, 2023. ference on Image Processing (ICIP). IEEE, 2019, pp. 2279–2283.
[124] S. Xie, L. Kong, W. Zhang, J. Ren, L. Pan, K. Chen, and Z. Liu, [146] S. Li, Z. Wang, F. Juefei-Xu, Q. Guo, X. Li, and L. Ma, “Com-
“Robobev: Towards robust bird’s eye view perception under corrup- mon corruption robustness of point cloud detectors: Benchmark and
tions,” Apr 2023. enhancement,” IEEE Transactions on Multimedia, 2023.
[125] Y. Dong, C. Kang, J. Zhang, Z. Zhu, Y. Wang, X. Yang, H. Su, [147] K. Yu, T. Tao, H. Xie, Z. Lin, Z. Wu, Z. Xia, T. Liang, H. Sun,
X. Wei, and J. Zhu, “Benchmarking robustness of 3d object detection J. Deng, D. Hao, Y. Wang, X. Liang, and B. Wang, “Benchmarking
to common corruptions in autonomous driving,” Mar 2023. the robustness of lidar-camera fusion for 3d object detection.”
[126] J. Mao, S. Shi, X. Wang, and H. Li, “3d object detection for au- [148] L. Kong, Y. Liu, X. Li, R. Chen, W. Zhang, J. Ren, L. Pan, K. Chen,
tonomous driving: A comprehensive survey,” International Journal of and Z. Liu, “Robo3d: Towards robust and reliable 3d perception against
Computer Vision, pp. 1–55, 2023. corruptions,” in Proceedings of the IEEE/CVF International Conference
[127] S. Y. Alaba and J. E. Ball, “Deep learning-based image 3-d object on Computer Vision, 2023, pp. 19 994–20 006.
detection for autonomous driving,” IEEE Sensors Journal, vol. 23, [149] L. Kong, S. Xie, H. Hu, L. X. Ng, B. R. Cottereau, and W. T.
no. 4, pp. 3378–3394, 2023. Ooi, “Robodepth: Robust out-of-distribution depth estimation under
[128] A. Singh and V. Bankiti, “Surround-view vision-based 3d detection corruptions,” arXiv preprint arXiv:2310.15171, 2023.
for autonomous driving: A survey,” arXiv preprint arXiv:2302.06650, [150] R. Kesten, M. Usman, J. Houston, T. Pandya, K. Nadhamuni, A. Fer-
2023. reira, M. Yuan, B. Low, A. Jain, P. Ondruska et al., “Lyft level 5 av
[129] A. Singh, “Transformer-based sensor fusion for autonomous driving: dataset 2019,” urlhttps://level5. lyft. com/dataset, vol. 1, p. 3, 2019.
A survey,” arXiv preprint arXiv:2302.11481, 2023. [151] A. Patil, S. Malla, H. Gang, and Y.-T. Chen, “The h3d dataset for
[130] X. Wang, K. Li, and A. Chehri, “Multi-sensor fusion technology for 3d full-surround 3d multi-object detection and tracking in crowded urban
object detection in autonomous driving: A review,” IEEE Transactions scenes,” in 2019 International Conference on Robotics and Automation
on Intelligent Transportation Systems, 2023. (ICRA). IEEE, 2019, pp. 9552–9557.
25

[152] M.-F. Chang, J. Lambert, P. Sangkloy, J. Singh, S. Bak, A. Hartnett, IEEE/CVF Conference on Computer Vision and Pattern Recognition,
D. Wang, P. Carr, S. Lucey, D. Ramanan et al., “Argoverse: 3d tracking 2022, pp. 1685–1694.
and forecasting with rich maps,” in Proceedings of the IEEE/CVF [171] Z. Qin and X. Li, “Monoground: Detecting monocular 3d objects from
Conference on Computer Vision and Pattern Recognition, 2019, pp. the ground,” in Proceedings of the IEEE/CVF Conference on Computer
8748–8757. Vision and Pattern Recognition, 2022, pp. 3793–3802.
[153] Q.-H. Pham, P. Sevestre, R. S. Pahwa, H. Zhan, C. H. Pang, Y. Chen, [172] Z. Wu, Y. Gan, L. Wang, G. Chen, and J. Pu, “Monopgc: Monocular
A. Mustafa, V. Chandrasekhar, and J. Lin, “A 3d dataset: Towards 3d object detection with pixel geometry contexts,” in 2023 IEEE
autonomous driving in challenging environments,” in 2020 IEEE In- International Conference on Robotics and Automation (ICRA). IEEE,
ternational conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 4842–4849.
2020, pp. 2267–2273. [173] M. Zhu, L. Ge, P. Wang, and H. Peng, “Monoedge: Monocular
[154] J. Geyer, Y. Kassahun, M. Mahmudi, X. Ricou, R. Durgesh, A. S. 3d object detection using local perspectives,” in Proceedings of the
Chung, L. Hauswald, V. H. Pham, M. Mühlegg, S. Dorn et al., “A2d2: IEEE/CVF Winter Conference on Applications of Computer Vision,
Audi autonomous driving dataset,” arXiv preprint arXiv:2004.06320, 2023, pp. 643–652.
2020. [174] F. Yang, X. Xu, H. Chen, Y. Guo, Y. He, K. Ni, and G. Ding, “Gpro3d:
[155] P. Xiao, Z. Shao, S. Hao, Z. Zhang, X. Chai, J. Jiao, Z. Li, J. Wu, Deriving 3d bbox from ground plane in monocular 3d object detection,”
K. Sun, K. Jiang et al., “Pandaset: Advanced sensor suite dataset for Neurocomputing, vol. 562, p. 126894, 2023.
autonomous driving,” in 2021 IEEE International Intelligent Trans- [175] L. Yang, J. Yu, X. Zhang, J. Li, L. Wang, Y. Huang, C. Zhang, H. Wang,
portation Systems Conference (ITSC). IEEE, 2021, pp. 3095–3101. and Y. Li, “Monogae: Roadside monocular 3d object detection with
[156] Y. Liao, J. Xie, and A. Geiger, “Kitti-360: A novel dataset and ground-aware embeddings,” arXiv preprint arXiv:2310.00400, 2023.
benchmarks for urban scene understanding in 2d and 3d,” IEEE [176] Y. Lu, X. Ma, L. Yang, T. Zhang, Y. Liu, Q. Chu, T. He, Y. Li, and
Transactions on Pattern Analysis and Machine Intelligence, vol. 45, W. Ouyang, “Gupnet++: Geometry uncertainty propagation network
no. 3, pp. 3292–3310, 2022. for monocular 3d object detection,” arXiv preprint arXiv:2310.15624,
[157] Z. Wang, S. Ding, Y. Li, J. Fenn, S. Roychowdhury, A. Wallin, 2023.
L. Martin, S. Ryvola, G. Sapiro, and Q. Qiu, “Cirrus: A long-range [177] Z. Min, B. Zhuang, S. Schulter, B. Liu, E. Dunn, and M. Chandraker,
bi-pattern lidar dataset,” in 2021 IEEE International Conference on “Neurocs: Neural nocs supervision for monocular 3d object localiza-
Robotics and Automation (ICRA). IEEE, 2021, pp. 5744–5750. tion,” in Proceedings of the IEEE/CVF Conference on Computer Vision
[158] J. Mao, M. Niu, C. Jiang, H. Liang, J. Chen, X. Liang, Y. Li, C. Ye, and Pattern Recognition, 2023, pp. 21 404–21 414.
W. Zhang, Z. Li et al., “One million scenes for autonomous driving: [178] Z. Liu, Z. Wu, and R. Tóth, “Smoke: Single-stage monocular 3d object
Once dataset,” arXiv preprint arXiv:2106.11037, 2021. detection via keypoint estimation,” in Proceedings of the IEEE/CVF
[159] L. Chen, C. Sima, Y. Li, Z. Zheng, J. Xu, X. Geng, H. Li, C. He, Conference on Computer Vision and Pattern Recognition Workshops,
J. Shi, Y. Qiao et al., “Persformer: 3d lane detection via perspective 2020, pp. 996–997.
transformer and the openlane benchmark,” in European Conference on [179] G. Brazil, G. Pons-Moll, X. Liu, and B. Schiele, “Kinematic 3d object
Computer Vision. Springer, 2022, pp. 550–567. detection in monocular video,” in Computer Vision–ECCV 2020: 16th
European Conference, Glasgow, UK, August 23–28, 2020, Proceedings,
[160] J. Ren, L. Pan, and Z. Liu, “Benchmarking and analyzing point
cloud classification under corruptions,” in International Conference on Part XXIII 16. Springer, 2020, pp. 135–152.
Machine Learning. PMLR, 2022, pp. 18 559–18 575. [180] A. Simonelli, S. R. Bulo, L. Porzi, E. Ricci, and P. Kontschieder, “To-
wards generalization across depth for monocular 3d object detection,”
[161] F. Chabot, M. Chaouch, J. Rabarisoa, C. Teuliere, and T. Chateau,
in Computer Vision–ECCV 2020: 16th European Conference, Glasgow,
“Deep manta: A coarse-to-fine many-task network for joint 2d and
UK, August 23–28, 2020, Proceedings, Part XXII 16. Springer, 2020,
3d vehicle analysis from monocular image,” in Proceedings of the
pp. 767–782.
IEEE conference on computer vision and pattern recognition, 2017,
[181] B. Li, W. Ouyang, L. Sheng, X. Zeng, and X. Wang, “Gs3d: An
pp. 2040–2049.
efficient 3d object detection framework for autonomous driving,” in
[162] T. He and S. Soatto, “Mono3d++: Monocular 3d vehicle detection with Proceedings of the IEEE/CVF Conference on Computer Vision and
two-scale 3d hypotheses and task priors,” in Proceedings of the AAAI Pattern Recognition, 2019, pp. 1019–1028.
Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 8409– [182] Z. Qin, J. Wang, and Y. Lu, “Monogrnet: A geometric reasoning
8416. network for monocular 3d object localization,” in Proceedings of the
[163] A. Kundu, Y. Li, and J. M. Rehg, “3d-rcnn: Instance-level 3d object AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp.
reconstruction via render-and-compare,” in Proceedings of the IEEE 8851–8858.
conference on computer vision and pattern recognition, 2018, pp. [183] X. Shi, Q. Ye, X. Chen, C. Chen, Z. Chen, and T.-K. Kim, “Geometry-
3559–3568. based distance decomposition for monocular 3d object detection,” in
[164] F. Manhardt, W. Kehl, and A. Gaidon, “Roi-10d: Monocular lifting Proceedings of the IEEE/CVF International Conference on Computer
of 2d detection to 6d pose and metric shape,” in Proceedings of the Vision, 2021, pp. 15 172–15 181.
IEEE/CVF Conference on Computer Vision and Pattern Recognition, [184] W. Bao, B. Xu, and Z. Chen, “Monofenet: Monocular 3d object
2019, pp. 2069–2078. detection with feature enhancement networks,” IEEE Transactions on
[165] D. Beker, H. Kato, M. A. Morariu, T. Ando, T. Matsuoka, W. Kehl, Image Processing, vol. 29, pp. 2753–2765, 2019.
and A. Gaidon, “Monocular differentiable rendering for self-supervised [185] X. Liu, N. Xue, and T. Wu, “Learning auxiliary monocular contexts
3d object detection,” in Computer Vision–ECCV 2020: 16th European helps monocular 3d object detection,” in Proceedings of the AAAI
Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp. 1810–
16. Springer, 2020, pp. 514–529. 1818.
[166] S. Zakharov, W. Kehl, A. Bhargava, and A. Gaidon, “Autolabeling 3d [186] X. Liu, C. Zheng, K. B. Cheng, N. Xue, G.-J. Qi, and T. Wu,
objects with differentiable rendering of sdf shape priors,” in Proceed- “Monocular 3d object detection with bounding box denoising in 3d by
ings of the IEEE/CVF Conference on Computer Vision and Pattern perceiver,” in Proceedings of the IEEE/CVF International Conference
Recognition, 2020, pp. 12 224–12 233. on Computer Vision, 2023, pp. 6436–6446.
[167] Y. Xiang, W. Choi, Y. Lin, and S. Savarese, “Data-driven 3d voxel [187] Z. Zhou, L. Du, X. Ye, Z. Zou, X. Tan, L. Zhang, X. Xue, and J. Feng,
patterns for object category recognition,” in Proceedings of the IEEE “Sgm3d: Stereo guided monocular 3d object detection,” IEEE Robotics
conference on computer vision and pattern recognition, 2015, pp. and Automation Letters, vol. 7, no. 4, pp. 10 478–10 485, 2022.
1903–1911. [188] L. Peng, X. Wu, Z. Yang, H. Liu, and D. Cai, “Did-m3d: Decoupling
[168] A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka, “3d bounding instance depth for monocular 3d object detection,” in European Con-
box estimation using deep learning and geometry,” in Proceedings of ference on Computer Vision. Springer, 2022, pp. 71–88.
the IEEE conference on Computer Vision and Pattern Recognition, [189] J. Xu, L. Peng, H. Cheng, H. Li, W. Qian, K. Li, W. Wang, and
2017, pp. 7074–7082. D. Cai, “Mononerd: Nerf-like representations for monocular 3d object
[169] A. Naiden, V. Paunescu, G. Kim, B. Jeon, and M. Leordeanu, “Shift detection,” in Proceedings of the IEEE/CVF International Conference
r-cnn: Deep monocular 3d object detection with closed-form geometric on Computer Vision, 2023, pp. 6814–6824.
constraints,” in 2019 IEEE international conference on image process- [190] C. Xia, W. Zhao, H. Han, Z. Tao, B. Ge, X. Gao, K.-C. Li, and
ing (ICIP). IEEE, 2019, pp. 61–65. Y. Zhang, “Monosaid: Monocular 3d object detection based on scene-
[170] Q. Lian, B. Ye, R. Xu, W. Yao, and T. Zhang, “Exploring geometric level adaptive instance depth estimation,” Journal of Intelligent &
consistency for monocular 3d object detection,” in Proceedings of the Robotic Systems, vol. 110, no. 1, p. 2, 2024.
26

[191] R. Tao, W. Han, Z. Qiu, C.-z. Xu, and J. Shen, “Weakly supervised [212] S. Wang and J. Zheng, “Monoskd: General distillation framework for
monocular 3d object detection using multi-view projection and di- monocular 3d object detection via spearman correlation coefficient,”
rection consistency,” in Proceedings of the IEEE/CVF Conference on arXiv preprint arXiv:2310.11316, 2023.
Computer Vision and Pattern Recognition, 2023, pp. 17 482–17 492. [213] J. Sun, L. Chen, Y. Xie, S. Zhang, Q. Jiang, X. Zhou, and H. Bao,
[192] X. Wu, D. Ma, X. Qu, X. Jiang, and D. Zeng, “Depth dynamic “Disp r-cnn: Stereo 3d object detection via shape prior guided instance
center difference convolutions for monocular 3d object detection,” disparity estimation,” in Proceedings of the IEEE/CVF conference on
Neurocomputing, vol. 520, pp. 73–81, 2023. computer vision and pattern recognition, 2020, pp. 10 548–10 557.
[193] C. Huang, T. He, H. Ren, W. Wang, B. Lin, and D. Cai, “Obmo: [214] Z. Qin, J. Wang, and Y. Lu, “Triangulation learning network: from
One bounding box multiple objects for monocular 3d object detection,” monocular to stereo 3d object detection,” in Proceedings of the
IEEE Transactions on Image Processing, vol. 32, pp. 6570–6581, 2023. IEEE/CVF Conference on Computer Vision and Pattern Recognition,
[194] L. Chen, J. Sun, Y. Xie, S. Zhang, Q. Shuai, Q. Jiang, G. Zhang, H. Bao, 2019, pp. 7615–7623.
and X. Zhou, “Shape prior guided instance disparity estimation for 3d [215] Z. Xu, W. Zhang, X. Ye, X. Tan, W. Yang, S. Wen, E. Ding, A. Meng,
object detection,” IEEE Transactions on Pattern Analysis and Machine and L. Huang, “Zoomnet: Part-aware adaptive zooming neural network
Intelligence, vol. 44, no. 9, pp. 5529–5540, 2021. for 3d object detection,” in Proceedings of the AAAI Conference on
[195] L. Yang, X. Zhang, J. Li, L. Wang, M. Zhu, and L. Zhu, “Lite-fpn Artificial Intelligence, vol. 34, no. 07, 2020, pp. 12 557–12 564.
for keypoint-based monocular 3d object detection,” Knowledge-Based [216] W. Peng, H. Pan, H. Liu, and Y. Sun, “Ida-3d: Instance-depth-aware
Systems, vol. 271, p. 110517, 2023. 3d object detection from stereo vision for autonomous driving,” in
[196] C. Park, H. Kim, J. Jang, and J. Paik, “Odd-m3d: Object-wise dense Proceedings of the IEEE/CVF Conference on Computer Vision and
depth estimation for monocular 3d object detection,” IEEE Transac- Pattern Recognition, 2020, pp. 13 015–13 024.
tions on Consumer Electronics, 2024. [217] X. Peng, X. Zhu, T. Wang, and Y. Ma, “Side: center-based stereo 3d
[197] X. Li, J. Liu, Y. Lei, L. Ma, X. Fan, and R. Liu, “Monotdp: Twin depth detector with structure-aware instance depth estimation,” in Proceed-
perception for monocular 3d object detection in adverse scenes,” arXiv ings of the IEEE/CVF Winter Conference on Applications of Computer
preprint arXiv:2305.10974, 2023. Vision, 2022, pp. 119–128.
[198] G. Brazil, A. Kumar, J. Straub, N. Ravi, J. Johnson, and G. Gkioxari, [218] C.-H. Wang, H.-W. Chen, Y. Chen, P.-Y. Hsiao, and L.-C. Fu,
“Omni3d: A large benchmark and model for 3d object detection in the “Vopifnet: Voxel-pixel fusion network for multi-class 3d object de-
wild,” in Proceedings of the IEEE/CVF conference on computer vision tection,” IEEE Transactions on Intelligent Transportation Systems, pp.
and pattern recognition, 2023, pp. 13 154–13 164. 1–11, 2024.
[219] Y. Wu, Z. Liu, Y. Chen, X. Zheng, Q. Zhang, M. Yang, and G. Tang,
[199] J. U. Kim, H.-I. Kim, and Y. M. Ro, “Stereoscopic vision recalling
“Fcnet: Stereo 3d object detection with feature correlation networks,”
memory for monocular 3d object detection,” IEEE Transactions on
Entropy, vol. 24, no. 8, p. 1121, 2022.
Image Processing, 2023.
[220] M. Feng, J. Cheng, H. Jia, L. Liu, G. Xu, and X. Yang, “Mc-stereo:
[200] X. Ma, S. Liu, Z. Xia, H. Zhang, X. Zeng, and W. Ouyang, Rethinking
Multi-peak lookup and cascade search range for stereo matching,”
Pseudo-LiDAR Representation, Jan 2020, p. 311–327.
arXiv preprint arXiv:2311.02340, 2023.
[201] J. Chang and G. Wetzstein, “Deep optics for monocular depth esti-
[221] Z. Shen, Y. Dai, X. Song, Z. Rao, D. Zhou, and L. Zhang, “Pcw-net:
mation and 3d object detection,” in Proceedings of the IEEE/CVF
Pyramid combination and warping cost volume for stereo matching,”
International Conference on Computer Vision, 2019, pp. 10 193–
in European conference on computer vision. Springer, 2022, pp. 280–
10 202.
297.
[202] H.-I. Liu, C. Wu, J.-H. Cheng, W. Chai, S.-Y. Wang, G. Liu, J.- [222] O.-H. Kwon and E. Zell, “Image-coupled volume propagation for
N. Hwang, H.-H. Shuai, and W.-H. Cheng, “Monotakd: Teaching stereo matching,” in 2023 IEEE International Conference on Image
assistant knowledge distillation for monocular 3d object detection,” Processing (ICIP). IEEE, 2023, pp. 2510–2514.
arXiv preprint arXiv:2404.04910, 2024.
[223] Z. Chen, W. Long, H. Yao, Y. Zhang, B. Wang, Y. Qin, and J. Wu,
[203] Y. Kim, S. Kim, S. Sim, J. W. Choi, and D. Kum, “Boosting monocular “Mocha-stereo: Motif channel attention network for stereo matching,”
3d object detection with object-centric auxiliary depth supervision,” arXiv preprint arXiv:2404.06842, 2024.
IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 2, [224] Z. Shen, X. Song, Y. Dai, D. Zhou, Z. Rao, and L. Zhang, “Digging
pp. 1801–1813, 2022. into uncertainty-based pseudo-label for robust stereo matching,” IEEE
[204] L. Wang, L. Du, X. Ye, Y. Fu, G. Guo, X. Xue, J. Feng, and L. Zhang, Transactions on Pattern Analysis and Machine Intelligence, 2023.
“Depth-conditioned dynamic message propagation for monocular 3d [225] G. Xu, X. Wang, X. Ding, and X. Yang, “Iterative geometry encod-
object detection,” in Proceedings of the IEEE/CVF Conference on ing volume for stereo matching,” in Proceedings of the IEEE/CVF
Computer Vision and Pattern Recognition, 2021, pp. 454–463. Conference on Computer Vision and Pattern Recognition, 2023, pp.
[205] M. Ding, Y. Huo, H. Yi, Z. Wang, J. Shi, Z. Lu, and P. Luo, “Learning 21 919–21 928.
depth-guided convolutions for monocular 3d object detection,” in [226] T. Guan, C. Wang, and Y.-H. Liu, “Neural markov random field for
Proceedings of the IEEE/CVF Conference on computer vision and stereo matching,” arXiv preprint arXiv:2403.11193, 2024.
pattern recognition workshops, 2020, pp. 1000–1001. [227] Y. You, Y. Wang, W.-L. Chao, D. Garg, G. Pleiss, B. Hariharan,
[206] Z. Wu, Y. Wu, J. Pu, X. Li, and X. Wang, “Attention-based depth M. Campbell, and K. Q. Weinberger, “Pseudo-lidar++: Accurate depth
distillation with 3d-aware positional encoding for monocular 3d ob- for 3d object detection in autonomous driving,” in ICLR, 2020.
ject detection,” in Proceedings of the AAAI Conference on Artificial [228] R. Qian, D. Garg, Y. Wang, Y. You, S. Belongie, B. Hariharan,
Intelligence, vol. 37, no. 3, 2023, pp. 2892–2900. M. Campbell, K. Q. Weinberger, and W.-L. Chao, “End-to-end pseudo-
[207] H. Sheng, S. Cai, N. Zhao, B. Deng, M.-J. Zhao, and G. H. Lee, “Pdr: lidar for image-based 3d object detection,” in Proceedings of the
Progressive depth regularization for monocular 3d object detection,” IEEE/CVF Conference on Computer Vision and Pattern Recognition,
IEEE Transactions on Circuits and Systems for Video Technology, 2020, pp. 5881–5890.
2023. [229] C. Li, J. Ku, and S. L. Waslander, “Confidence guided stereo 3d object
[208] C. Tao, J. Cao, C. Wang, Z. Zhang, and Z. Gao, “Pseudo-mono detection with split depth estimation,” in 2020 IEEE/RSJ International
for monocular 3d object detection in autonomous driving,” IEEE Conference on Intelligent Robots and Systems (IROS). IEEE, 2020,
Transactions on Circuits and Systems for Video Technology, 2023. pp. 5776–5783.
[209] A. Kumar, G. Brazil, E. Corona, A. Parchami, and X. Liu, “Deviant: [230] P. Li, S. Su, and H. Zhao, “Rts3d: Real-time stereo 3d detection from
Depth equivariant network for monocular 3d object detection,” in 4d feature-consistency embedding space for autonomous driving,” in
European Conference on Computer Vision. Springer, 2022, pp. 664– Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35,
683. no. 3, 2021, pp. 1930–1939.
[210] W. Zhang, D. Liu, C. Ma, and W. Cai, “Alleviating foreground sparsity [231] H. Königshof, N. O. Salscheider, and C. Stiller, “Realtime 3d object
for semi-supervised monocular 3d object detection,” in Proceedings of detection for automated driving using stereo vision and semantic infor-
the IEEE/CVF Winter Conference on Applications of Computer Vision, mation,” in 2019 IEEE Intelligent Transportation Systems Conference
2024, pp. 7542–7552. (ITSC). IEEE, 2019, pp. 1405–1410.
[211] Z. Wu, Y. Gan, Y. Wu, R. Wang, X. Wang, and J. Pu, “Fd3d: [232] H. Königshof and C. Stiller, “Learning-based shape estimation with
Exploiting foreground depth map for feature-supervised monocular 3d grid map patches for realtime 3d object detection for automated
object detection,” in Proceedings of the AAAI Conference on Artificial driving,” in 2020 IEEE 23rd International conference on intelligent
Intelligence, vol. 38, no. 6, 2024, pp. 6189–6197. transportation systems (ITSC). IEEE, 2020, pp. 1–6.
27

[233] D. Garg, Y. Wang, B. Hariharan, M. Campbell, K. Q. Weinberger, and Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August
W.-L. Chao, “Wasserstein distances for stereo disparity estimation,” 23–28, 2020, Proceedings, Part XIV 16. Springer, 2020, pp. 194–210.
Advances in Neural Information Processing Systems, vol. 33, pp. [256] L. Yang, K. Yu, T. Tang, J. Li, K. Yuan, L. Wang, X. Zhang, and
22 517–22 529, 2020. P. Chen, “Bevheight: A robust framework for vision-based roadside
[234] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, 3d object detection,” in Proceedings of the IEEE/CVF Conference on
A. Bachrach, and A. Bry, “End-to-end learning of geometry and context Computer Vision and Pattern Recognition, 2023, pp. 21 611–21 620.
for deep stereo regression,” in Proceedings of the IEEE international [257] X. Chi, J. Liu, M. Lu, R. Zhang, Z. Wang, Y. Guo, and S. Zhang, “Bev-
conference on computer vision, 2017, pp. 66–75. san: Accurate bev 3d object detection via slice attention networks,” in
[235] A. Gao, Y. Pang, J. Nie, Z. Shao, J. Cao, Y. Guo, and X. Li, “Esgn: Proceedings of the IEEE/CVF Conference on Computer Vision and
Efficient stereo geometry network for fast 3d object detection,” IEEE Pattern Recognition, 2023, pp. 17 461–17 470.
Transactions on Circuits and Systems for Video Technology, 2022. [258] J. Liu, R. Zhang, X. Chi, X. Li, M. Lu, Y. Guo, and S. Zhang, “Multi-
[236] Y. Chen, S. Huang, S. Liu, B. Yu, and J. Jia, “Dsgn++: Exploiting latent space alignments for unsupervised domain adaptation in multi-
visual-spatial relation for stereo-based 3d detectors,” IEEE Transac- view 3d object detection,” arXiv preprint arXiv:2211.17126, 2022.
tions on Pattern Analysis and Machine Intelligence, vol. 45, no. 4, pp. [259] J. Huang and G. Huang, “Bevpoolv2: A cutting-edge implementation
4416–4429, 2022. of bevdet toward deployment,” arXiv preprint arXiv:2211.17111, 2022.
[237] X. Guo, S. Shi, X. Wang, and H. Li, “Liga-stereo: Learning lidar geom-
[260] Y. Li, H. Bao, Z. Ge, J. Yang, J. Sun, and Z. Li, “Bevstereo:
etry aware representations for stereo-based 3d detector,” in Proceedings
Enhancing depth estimation in multi-view 3d object detection with
of the IEEE/CVF International Conference on Computer Vision, 2021,
temporal stereo,” in Proceedings of the AAAI Conference on Artificial
pp. 3153–3163.
Intelligence, vol. 37, no. 2, 2023, pp. 1486–1494.
[238] Y. Wang, B. Yang, R. Hu, M. Liang, and R. Urtasun, “Plumenet:
Efficient 3d object detection from stereo images,” in 2021 IEEE/RSJ [261] Y. Li, J. Yang, J. Sun, H. Bao, Z. Ge, and L. Xiao, “Bevstereo++: Ac-
International Conference on Intelligent Robots and Systems (IROS). curate depth estimation in multi-view 3d object detection via dynamic
IEEE, 2021, pp. 3383–3390. temporal stereo,” arXiv preprint arXiv:2304.04185, 2023.
[239] X. Wang, G. Xu, H. Jia, and X. Yang, “Selective-stereo: Adaptive [262] P. Huang, L. Liu, R. Zhang, S. Zhang, X. Xu, B. Wang, and G. Liu,
frequency information selection for stereo matching,” arXiv preprint “Tig-bev: Multi-view bev 3d object detection via target inner-geometry
arXiv:2403.00486, 2024. learning,” arXiv preprint arXiv:2212.13979, 2022.
[240] C.-W. Liu, Q. Chen, and R. Fan, “Playing to vision foundation model’s [263] S. Wang, X. Zhao, H.-M. Xu, Z. Chen, D. Yu, J. Chang, Z. Yang, and
strengths in stereo matching,” arXiv preprint arXiv:2404.06261, 2024. F. Zhao, “Towards domain generalization for multi-view 3d object de-
[241] B. Liu, H. Yu, and Y. Long, “Local similarity pattern and cost self- tection in bird-eye-view,” in Proceedings of the IEEE/CVF Conference
reassembling for deep stereo matching networks,” in Proceedings of on Computer Vision and Pattern Recognition, 2023, pp. 13 333–13 342.
the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, 2022, [264] P. Dong, Z. Kong, X. Meng, P. Yu, Y. Gong, G. Yuan, H. Tang,
pp. 1647–1655. and Y. Wang, “Hotbev: Hardware-oriented transformer-based multi-
[242] Y. Shi, “Rethinking iterative stereo matching from diffusion bridge view 3d detector for bev perception,” Advances in Neural Information
model perspective,” arXiv preprint arXiv:2404.09051, 2024. Processing Systems, vol. 36, 2024.
[243] T. Yuan, J. Hu, S. Ou, W. Yang, and Y. Hei, “Hourglass cascaded [265] Z. Li, S. Lan, J. M. Alvarez, and Z. Wu, “Bevnext: Reviving dense bev
recurrent stereo matching network,” Image and Vision Computing, p. frameworks for 3d object detection,” arXiv preprint arXiv:2312.01696,
105074, 2024. 2023.
[244] X. Cheng, Y. Zhong, M. Harandi, Y. Dai, X. Chang, H. Li, T. Drum- [266] Y. Jiang, L. Zhang, Z. Miao, X. Zhu, J. Gao, W. Hu, and Y.-
mond, and Z. Ge, “Hierarchical neural architecture search for deep G. Jiang, “Polarformer: Multi-camera 3d object detection with polar
stereo matching,” Advances in neural information processing systems, transformer,” in Proceedings of the AAAI Conference on Artificial
vol. 33, pp. 22 158–22 169, 2020. Intelligence, vol. 37, no. 1, 2023, pp. 1042–1050.
[245] J. Li, P. Wang, P. Xiong, T. Cai, Z. Yan, L. Yang, J. Liu, H. Fan, and [267] Y. Wang, Y. Chen, and Z. Zhang, “Frustumformer: Adaptive instance-
S. Liu, “Practical stereo matching via cascaded recurrent network with aware resampling for multi-view 3d detection,” in Proceedings of the
adaptive correlation,” in Proceedings of the IEEE/CVF conference on IEEE/CVF Conference on Computer Vision and Pattern Recognition,
computer vision and pattern recognition, 2022, pp. 16 263–16 272. 2023, pp. 5096–5105.
[246] X. Li, Y. Fan, G. Lv, and H. Ma, “Area-based correlation and non-local [268] Z. Luo, C. Zhou, G. Zhang, and S. Lu, “Detr4d: Direct multi-
attention network for stereo matching,” The Visual Computer, vol. 38, view 3d object detection with sparse attention,” arXiv preprint
no. 11, pp. 3881–3895, 2022. arXiv:2212.07849, 2022.
[247] Y. Zhang, Y. Chen, X. Bai, S. Yu, K. Yu, Z. Li, and K. Yang, [269] X. Lin, T. Lin, Z. Pei, L. Huang, and Z. Su, “Sparse4d: Multi-view
“Adaptive unimodal cost volume filtering for deep stereo matching,” in 3d object detection with sparse spatial-temporal fusion,” arXiv preprint
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, arXiv:2211.10581, 2022.
no. 07, 2020, pp. 12 926–12 934. [270] ——, “Sparse4d v2: Recurrent temporal fusion with sparse model,”
[248] S. Chen, B. Li, W. Wang, H. Zhang, H. Li, and Z. Wang, “Cost affinity arXiv preprint arXiv:2305.14018, 2023.
learning network for stereo matching,” in ICASSP 2021-2021 IEEE
[271] X. Lin, Z. Pei, T. Lin, L. Huang, and Z. Su, “Sparse4d v3:
International Conference on Acoustics, Speech and Signal Processing
Advancing end-to-end 3d detection and tracking,” arXiv preprint
(ICASSP). IEEE, 2021, pp. 2120–2124.
arXiv:2311.11722, 2023.
[249] Z. Shen, Y. Dai, and Z. Rao, “Cfnet: Cascade and fused cost volume for
robust stereo matching,” in Proceedings of the IEEE/CVF Conference [272] J. Park, C. Xu, S. Yang, K. Keutzer, K. Kitani, M. Tomizuka, and
on Computer Vision and Pattern Recognition, 2021, pp. 13 906–13 915. W. Zhan, “Time will tell: New outlooks and a baseline for temporal
[250] K. Zeng, Y. Wang, Q. Zhu, J. Mao, and H. Zhang, “Deep progressive multi-view 3d object detection,” arXiv preprint arXiv:2210.02443,
fusion stereo network,” IEEE Transactions on Intelligent Transporta- 2022.
tion Systems, vol. 23, no. 12, pp. 25 437–25 447, 2021. [273] K. Xiong, S. Gong, X. Ye, X. Tan, J. Wan, E. Ding, J. Wang, and
[251] M. Tahmasebi, S. Huq, K. Meehan, and M. McAfee, “Dcvsm- X. Bai, “Cape: Camera view position embedding for multi-view 3d
net: Double cost volume stereo matching network,” arXiv preprint object detection,” in Proceedings of the IEEE/CVF Conference on
arXiv:2402.16473, 2024. Computer Vision and Pattern Recognition, 2023, pp. 21 570–21 579.
[252] Y. Deng, J. Xiao, S. Z. Zhou, and J. Feng, “Detail preserving coarse-to- [274] D. Chen, J. Li, V. Guizilini, R. A. Ambrus, and A. Gaidon, “Viewpoint
fine matching for stereo matching and optical flow,” IEEE Transactions equivariance for multi-view 3d object detection,” in Proceedings of the
on Image Processing, vol. 30, pp. 5835–5847, 2021. IEEE/CVF Conference on Computer Vision and Pattern Recognition,
[253] G. Xu, J. Cheng, P. Guo, and X. Yang, “Attention concatenation 2023, pp. 9213–9222.
volume for accurate and efficient stereo matching,” in Proceedings of [275] Z. Chen, Z. Li, S. Zhang, L. Fang, Q. Jiang, and F. Zhao, “Graph-
the IEEE/CVF conference on computer vision and pattern recognition, detr3d: rethinking overlapping regions for multi-view 3d object detec-
2022, pp. 12 981–12 990. tion,” in Proceedings of the 30th ACM International Conference on
[254] J. Huang, G. Huang, Z. Zhu, Y. Ye, and D. Du, “Bevdet: High- Multimedia, 2022, pp. 5999–6008.
performance multi-camera 3d object detection in bird-eye-view,” arXiv [276] C. Shu, J. Deng, F. Yu, and Y. Liu, “3dppe: 3d point positional
preprint arXiv:2112.11790, 2021. encoding for transformer-based multi-camera 3d object detection,” in
[255] J. Philion and S. Fidler, “Lift, splat, shoot: Encoding images from Proceedings of the IEEE/CVF International Conference on Computer
arbitrary camera rigs by implicitly unprojecting to 3d,” in Computer Vision, 2023, pp. 3580–3589.
28

[277] Z. Chen, Z. Li, S. Zhang, L. Fang, Q. Jiang, and F. Zhao, “Bevdistill: [299] S. Shi, Z. Wang, J. Shi, X. Wang, and H. Li, “From points to parts: 3d
Cross-modal bev distillation for multi-view 3d object detection,” arXiv object detection from point cloud with part-aware and part-aggregation
preprint arXiv:2211.09386, 2022. network,” IEEE transactions on pattern analysis and machine intelli-
[278] S. Wang, Y. Liu, T. Wang, Y. Li, and X. Zhang, “Exploring object- gence, vol. 43, no. 8, pp. 2647–2664, 2020.
centric temporal modeling for efficient multi-view 3d object detection,” [300] Z. Liu, X. Zhao, T. Huang, R. Hu, Y. Zhou, and X. Bai, “Tanet:
arXiv preprint arXiv:2303.11926, 2023. Robust 3d object detection from point clouds with triple attention,” in
[279] X. Jiang, S. Li, Y. Liu, S. Wang, F. Jia, T. Wang, L. Han, and X. Zhang, Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34,
“Far3d: Expanding the horizon for surround-view 3d object detection,” no. 07, 2020, pp. 11 677–11 684.
arXiv preprint arXiv:2308.09616, 2023. [301] H. Yi, S. Shi, M. Ding, J. Sun, K. Xu, H. Zhou, Z. Wang, S. Li,
[280] C. Pan, B. Yaman, S. Velipasalar, and L. Ren, “Clip-bevformer: and G. Wang, “Segvoxelnet: Exploring semantic context and depth-
Enhancing multi-view image-based bev detector with ground truth aware features for 3d vehicle detection from point cloud,” in 2020
flow,” arXiv preprint arXiv:2403.08919, 2024. IEEE International Conference on Robotics and Automation (ICRA),
[281] C. Yang, Y. Chen, H. Tian, C. Tao, X. Zhu, Z. Zhang, G. Huang, May 2020.
H. Li, Y. Qiao, L. Lu et al., “Bevformer v2: Adapting modern image [302] Q. Chen, L. Sun, Z. Wang, K. Jia, and A. Yuille, “Object as hotspots:
backbones to bird’s-eye-view recognition via perspective supervision,” An anchor-free 3d object detection approach via firing of hotspots,” in
in Proceedings of the IEEE/CVF Conference on Computer Vision and Computer Vision–ECCV 2020: 16th European Conference, Glasgow,
Pattern Recognition, 2023, pp. 17 830–17 839. UK, August 23–28, 2020, Proceedings, Part XXI 16. Springer, 2020,
[282] Y. Zhou, H. Zhu, Q. Liu, S. Chang, and M. Guo, “Monoatt: Online pp. 68–84.
monocular 3d object detection with adaptive token transformer,” in [303] C. Yu, J. Lei, B. Peng, H. Shen, and Q. Huang, “Siev-net: A structure-
Proceedings of the IEEE/CVF Conference on Computer Vision and information enhanced voxel network for 3d object detection from lidar
Pattern Recognition, 2023, pp. 17 493–17 503. point clouds,” IEEE Transactions on Geoscience and Remote Sensing,
[283] L. Yan, P. Yan, S. Xiong, X. Xiang, and Y. Tan, “Monocd: Monocular vol. 60, pp. 1–11, 2022.
3d object detection with complementary depths,” in CVPR, 2024. [304] S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, and H. Li, “Pv-
[284] P. Li, X. Chen, and S. Shen, “Stereo r-cnn based 3d object detection rcnn: Point-voxel feature set abstraction for 3d object detection,” in
for autonomous driving,” in Proceedings of the IEEE/CVF Conference Proceedings of the IEEE/CVF Conference on Computer Vision and
on Computer Vision and Pattern Recognition, 2019, pp. 7644–7652. Pattern Recognition, 2020, pp. 10 529–10 538.
[285] A. D. Pon, J. Ku, C. Li, and S. L. Waslander, “Object-centric stereo [305] C. He, H. Zeng, J. Huang, X.-S. Hua, and L. Zhang, “Structure aware
matching for 3d object detection,” in 2020 IEEE International Con- single-stage 3d object detection from point cloud,” in Proceedings of
ference on Robotics and Automation (ICRA). IEEE, 2020, pp. 8383– the IEEE/CVF conference on computer vision and pattern recognition,
8389. 2020, pp. 11 873–11 882.
[286] S. Li, Z. Liu, Z. Shen, and K.-T. Cheng, “Stereo neural vernier caliper,” [306] T. Jiang, N. Song, H. Liu, R. Yin, Y. Gong, and J. Yao, “Vic-net:
in Proceedings of the AAAI Conference on Artificial Intelligence, voxelization information compensation network for point cloud 3d
vol. 36, no. 2, 2022, pp. 1376–1385. object detection,” in 2021 IEEE International Conference on Robotics
[287] Z. Liu, X. Ye, X. Tan, E. Ding, and X. Bai, “Stereodistill: Pick the and Automation (ICRA). IEEE, 2021, pp. 13 408–13 414.
cream from lidar for distilling stereo-based 3d object detection,” arXiv [307] H. Zhang, G. Luo, X. Wang, Y. Li, W. Ding, and F.-Y. Wang,
preprint arXiv:2301.01615, 2023. “Sasan: Shape-adaptive set abstraction network for point-voxel 3d
[288] Z. Wang, D. Li, C. Luo, C. Xie, and X. Yang, “Distillbev: Boosting object detection,” IEEE Transactions on Neural Networks and Learning
multi-camera 3d object detection with cross-modal knowledge distil- Systems, 2023.
lation,” in Proceedings of the IEEE/CVF International Conference on [308] H. Yang, W. Wang, M. Chen, B. Lin, T. He, H. Chen, X. He, and
Computer Vision, 2023, pp. 8637–8646. W. Ouyang, “Pvt-ssd: Single-stage 3d object detector with point-
[289] B. Yang, W. Luo, and R. Urtasun, “Pixor: Real-time 3d object detec- voxel transformer,” in Proceedings of the IEEE/CVF Conference on
tion from point clouds,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2023, pp. 13 476–13 487.
Computer Vision and Pattern Recognition, 2018, pp. 7652–7660. [309] B. Fan, K. Zhang, and J. Tian, “Hcpvf: Hierarchical cascaded point-
[290] B. Yang, M. Liang, and R. Urtasun, “Hdnet: Exploiting hd maps for voxel fusion for 3d object detection,” IEEE Transactions on Circuits
3d object detection,” in Conference on Robot Learning. PMLR, 2018, and Systems for Video Technology, 2023.
pp. 146–155. [310] J. Cao, C. Tao, Z. Zhang, Z. Gao, X. Luo, S. Zheng, and Y. Zhu,
[291] J. Beltrán, C. Guindel, F. M. Moreno, D. Cruzado, F. Garcia, and “Accelerating point-voxel representation of 3d object detection for
A. De La Escalera, “Birdnet: a 3d object detection framework from automatic driving,” IEEE Transactions on Artificial Intelligence, 2023.
lidar information,” in 2018 21st International Conference on Intelligent [311] C. Feng, C. Xiang, X. Xie, Y. Zhang, M. Yang, and X. Li, “Hpv-
Transportation Systems (ITSC). IEEE, 2018, pp. 3517–3523. rcnn: Hybrid point–voxel two-stage network for lidar based 3-d object
[292] J. Zarzar, S. Giancola, and B. Ghanem, “Pointrgcn: Graph convolu- detection,” IEEE Transactions on Computational Social Systems, 2023.
tion networks for 3d vehicles detection refinement,” arXiv: Computer [312] V. A. Sindagi, Y. Zhou, and O. Tuzel, “Mvx-net: Multimodal voxelnet
Vision and Pattern Recognition,arXiv: Computer Vision and Pattern for 3d object detection,” in 2019 International Conference on Robotics
Recognition, Nov 2019. and Automation (ICRA). IEEE, 2019, pp. 7276–7282.
[293] J. Ngiam, B. Caine, W. Han, B. Yang, Y. Chai, P. Sun, Y. Zhou, X. Yi, [313] K. Shin, Y. P. Kwon, and M. Tomizuka, “Roarnet: A robust 3d object
O. Alsharif, P. Nguyen, Z. Chen, J. Shlens, and V. Vasudevan, “Starnet: detection based on region approximation refinement,” in 2019 IEEE
Targeted computation for object detection in point clouds,” Cornell intelligent vehicles symposium (IV). IEEE, 2019, pp. 2510–2515.
University - arXiv,Cornell University - arXiv, Aug 2019. [314] M. Simon, K. Amende, A. Kraus, J. Honer, T. Samann, H. Kaulbersch,
[294] L. Xie, C. Xiang, Z. Yu, G. Xu, Z. Yang, D. Cai, and X. He, “Pi-rcnn: S. Milz, and H. Michael Gross, “Complexer-yolo: Real-time 3d object
An efficient multi-sensor 3d object detector with point-based attentive detection and tracking on semantic point clouds,” in Proceedings of the
cont-conv fusion module,” in Proceedings of the AAAI conference on IEEE/CVF Conference on Computer Vision and Pattern Recognition
artificial intelligence, vol. 34, no. 07, 2020, pp. 12 460–12 467. Workshops, 2019, pp. 0–0.
[295] Q. Wang, J. Chen, J. Deng, and X. Zhang, “3d-centernet: 3d object [315] C. Wang, C. Ma, M. Zhu, and X. Yang, “Pointaugmenting: Cross-
detection network for point clouds with center estimation priority,” modal augmentation for 3d object detection,” in Proceedings of the
Pattern Recognition, p. 107884, Jul 2021. IEEE/CVF Conference on Computer Vision and Pattern Recognition,
[296] Y. Zhang, D. Huang, and Y. Wang, “Pc-rgnn: Point cloud completion 2021, pp. 11 794–11 803.
and graph neural network for 3d object detection,” in Proceedings of [316] S. Xu, D. Zhou, J. Fang, J. Yin, Z. Bin, and L. Zhang, “Fusionpainting:
the AAAI conference on artificial intelligence, vol. 35, no. 4, 2021, pp. Multimodal fusion with adaptive attention for 3d object detection,” in
3430–3437. 2021 IEEE International Intelligent Transportation Systems Conference
[297] Y. Zhang, Q. Hu, G. Xu, Y. Ma, J. Wan, and Y. Guo, “Not all points are (ITSC). IEEE, 2021, pp. 3047–3054.
equal: Learning highly efficient point-based detectors for 3d lidar point [317] G. P. Meyer, J. Charland, D. Hegde, A. Laddha, and C. Vallespi-
clouds,” in Proceedings of the IEEE/CVF Conference on Computer Gonzalez, “Sensor fusion for joint 3d object detection and semantic
Vision and Pattern Recognition, 2022, pp. 18 953–18 962. segmentation,” in Proceedings of the IEEE/CVF conference on com-
[298] I. Koo, I. Lee, S.-H. Kim, H.-S. Kim, W.-j. Jeon, and C. Kim, “Pg- puter vision and pattern recognition workshops, 2019, pp. 0–0.
rcnn: Semantic surface point generation for 3d object detection,” in [318] R. Nabati and H. Qi, “Centerfusion: Center-based radar and camera
Proceedings of the IEEE/CVF International Conference on Computer fusion for 3d object detection,” in Proceedings of the IEEE/CVF Winter
Vision, 2023, pp. 18 142–18 151. Conference on Applications of Computer Vision, 2021, pp. 1527–1536.
29

[319] S. Xu, F. Li, Z. Song, J. Fang, S. Wang, and Z.-X. Yang, “Multi-sem [342] Y. Jiao, Z. Jie, S. Chen, J. Chen, X. Wei, L. Ma, and Y.-G. Jiang,
fusion: Multimodal semantic fusion for 3d object detection,” 2023. “Msmdfusion: Fusing lidar and camera at multiple scales with multi-
[320] G. Xie, Z. Chen, M. Gao, M. Hu, and X. Qin, “Ppf-det: Point-pixel depth seeds for 3d object detection,” Sep 2022.
fusion for multi-modal 3d object detection,” IEEE Transactions on [343] Z. Song, L. Yang, S. Xu, L. Liu, D. Xu, C. Jia, F. Jia, and L. Wang,
Intelligent Transportation Systems, 2024. “Graphbev: Towards robust bev feature alignment for multi-modal 3d
[321] M. Liang, B. Yang, S. Wang, and R. Urtasun, “Deep continuous fusion object detection,” arXiv preprint arXiv:2403.11848, 2024.
for multi-sensor 3d object detection,” in Proceedings of the European [344] Z. Song, F. Jia, H. Pan, Y. Luo, C. Jia, G. Zhang, L. Liu, Y. Ji, L. Yang,
conference on computer vision (ECCV), 2018, pp. 641–656. and L. Wang, “Contrastalign: Toward robust bev feature alignment
[322] M. Liang, B. Yang, Y. Chen, R. Hu, and R. Urtasun, “Multi-task multi- via contrastive learning for multi-modal 3d object detection,” arXiv
sensor fusion for 3d object detection,” in Proceedings of the IEEE/CVF preprint arXiv:2405.16873, 2024.
Conference on Computer Vision and Pattern Recognition, 2019, pp. [345] J. Yin, J. Shen, R. Chen, W. Li, R. Yang, P. Frossard, and W. Wang,
7345–7353. “Is-fusion: Instance-scene collaborative fusion for multimodal 3d object
[323] Y. Li, X. Qi, Y. Chen, L. Wang, Z. Li, J. Sun, and J. Jia, “Voxel detection,” arXiv preprint arXiv:2403.15241, 2024.
field fusion for 3d object detection,” in Proceedings of the IEEE/CVF [346] M. Zeeshan Zia, M. Stark, and K. Schindler, “Are cars just 3d boxes?-
Conference on Computer Vision and Pattern Recognition, 2022, pp. jointly estimating the 3d shape of multiple objects,” in Proceedings
1120–1129. of the IEEE Conference on Computer Vision and Pattern Recognition,
[324] Z. Song, G. Zhang, L. Liu, L. Yang, S. Xu, C. Jia, F. Jia, and L. Wang, 2014, pp. 3678–3685.
“Robofusion: Towards robust multi-modal 3d obiect detection via sam,” [347] H. Chen, Y. Huang, W. Tian, Z. Gao, and L. Xiong, “Monorun: Monoc-
arXiv preprint arXiv:2401.03907, 2024. ular 3d object detection by reconstruction and uncertainty propagation,”
[325] Y. Kim, K. Park, M. Kim, D. Kum, and J. W. Choi, “3d dual-fusion: in Proceedings of the IEEE/CVF Conference on Computer Vision and
Dual-domain dual-query camera-lidar fusion for 3d object detection,” Pattern Recognition, 2021, pp. 10 379–10 388.
arXiv preprint arXiv:2211.13529, 2022. [348] J. Ku, A. Pon, and S. Waslander, “Monocular 3d object detection lever-
[326] Z. Chen, Z. Li, S. Zhang, L. Fang, Q. Jiang, and F. Zhao, “Autoalignv2: aging accurate proposals and shape reconstruction,” Cornell University
Deformable feature aggregation for dynamic multi-modal 3d object - arXiv,Cornell University - arXiv, Apr 2019.
detection,” Jul 2022. [349] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove,
[327] S. Pang, D. Morris, and H. Radha, “Clocs: Camera-lidar object candi- “Deepsdf: Learning continuous signed distance functions for shape
dates fusion for 3d object detection,” in 2020 IEEE/RSJ International representation,” in 2019 IEEE/CVF Conference on Computer Vision
Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, and Pattern Recognition (CVPR), Jun 2019.
pp. 10 386–10 393. [350] E. Jörgensen, C. Zach, and F. Kahl, “Monocular 3d object detection
[328] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L. Waslander, “Joint and box fitting trained end-to-end using intersection-over-union loss.”
3d proposal generation and object detection from view aggregation,” Cornell University - arXiv,Cornell University - arXiv, Jun 2019.
in 2018 IEEE/RSJ International Conference on Intelligent Robots and [351] H.-N. Hu, Q.-Z. Cai, D. Wang, J. Lin, M. Sun, P. Krahenbuhl,
Systems (IROS). IEEE, 2018, pp. 1–8. T. Darrell, and F. Yu, “Joint monocular 3d vehicle detection and
[329] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3d object tracking,” in Proceedings of the IEEE/CVF International Conference
detection network for autonomous driving,” in Proceedings of the IEEE on Computer Vision, 2019, pp. 5390–5399.
conference on Computer Vision and Pattern Recognition, 2017, pp. [352] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocu-
1907–1915. lar depth estimation with left-right consistency,” in Proceedings of the
[330] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum pointnets IEEE conference on computer vision and pattern recognition, 2017,
for 3d object detection from rgb-d data,” in Proceedings of the IEEE pp. 270–279.
conference on computer vision and pattern recognition, 2018, pp. 918– [353] X. Chu, J. Deng, Y. Li, Z. Yuan, Y. Zhang, J. Ji, and Y. Zhang,
927. “Neighbor-vote: Improving monocular 3d object detection through
[331] A. Paigwar, D. Sierra-Gonzalez, Ö. Erkent, and C. Laugier, “Frustum- neighbor distance voting,” Cornell University - arXiv,Cornell Univer-
pointpillars: A multi-stage approach for 3d object detection using sity - arXiv, Jul 2021.
rgb camera and lidar,” in Proceedings of the IEEE/CVF international [354] Y. Hong, H. Dai, and Y. Ding, “Cross-modality knowledge distillation
conference on computer vision, 2021, pp. 2926–2933. network for monocular 3d object detection,” in European Conference
[332] S. Pang, D. Morris, and H. Radha, “Fast-clocs: Fast camera-lidar on Computer Vision. Springer, 2022, pp. 87–104.
object candidates fusion for 3d object detection,” in Proceedings of [355] X. Weng and K. Kitani, “Monocular 3d object detection with pseudo-
the IEEE/CVF Winter Conference on Applications of Computer Vision, lidar point cloud,” in Proceedings of the IEEE/CVF International
2022, pp. 187–196. Conference on Computer Vision Workshops, 2019, pp. 0–0.
[333] Z. Chen, Z. Li, S. Zhang, L. Fang, Q. Jiang, F. Zhao, B. Zhou, [356] X. Wang, W. Yin, T. Kong, Y. Jiang, L. Li, and C. Shen, “Task-aware
and H. Zhao, “Autoalign: Pixel-instance feature aggregation for multi- monocular depth estimation for 3d object detection,” in Proceedings of
modal 3d object detection,” in Proceedings of the Thirty-First Interna- the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020,
tional Joint Conference on Artificial Intelligence, Jul 2022. pp. 12 257–12 264.
[334] H. Zhang, L. Liang, P. Zeng, X. Song, and Z. Wang, “Sparselif: High- [357] X. Ye, L. Du, Y. Shi, Y. Li, X. Tan, J. Feng, E. Ding, and S. Wen,
performance sparse lidar-camera fusion for 3d object detection,” arXiv “Monocular 3d object detection via feature domain adaptation,” in
preprint arXiv:2403.07284, 2024. Computer Vision–ECCV 2020: 16th European Conference, Glasgow,
[335] C. Hu, H. Zheng, K. Li, J. Xu, W. Mao, M. Luo, L. Wang, M. Chen, UK, August 23–28, 2020, Proceedings, Part IX 16. Springer, 2020,
K. Liu, Y. Zhao et al., “Fusionformer: A multi-sensory fusion in bird’s- pp. 17–34.
eye-view and temporal consistent transformer for 3d objection,” arXiv [358] L. Wang, L. Zhang, Y. Zhu, Z. Zhang, T. He, M. Li, and X. Xue,
preprint arXiv:2309.05257, 2023. “Progressive coordinate transforms for monocular 3d object detection,”
[336] Y. Li, L. Fan, Y. Liu, Z. Huang, Y. Chen, N. Wang, and Z. Zhang, Advances in Neural Information Processing Systems, vol. 34, pp.
“Fully sparse fusion for 3d object detection,” IEEE Transactions on 13 364–13 377, 2021.
Pattern Analysis and Machine Intelligence, pp. 1–15, 2024. [359] H. Meng, C. Li, G. Chen, L. Chen et al., “Efficient 3d object detection
[337] Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D. L. Rus, and S. Han, based on pseudo-lidar representation,” IEEE Transactions on Intelligent
“Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view Vehicles, 2023.
representation,” pp. 2774–2781, 2023. [360] X. Guo, K. Yang, W. Yang, X. Wang, and H. Li, “Group-wise correla-
[338] H. Hu, F. Wang, J. Su, L. Hu, T. Feng, Z. Zhang, and W. Zhang, “Ea- tion stereo network,” in Proceedings of the IEEE/CVF conference on
bev: Edge-aware bird’s-eye-view projector for 3d object detection.” computer vision and pattern recognition, 2019, pp. 3273–3282.
[339] H. Cai, Z. Zhang, Z. Zhou, Z. Li, W. Ding, and J. Zhao, “Bevfusion4d: [361] P. Cao, H. Chen, Y. Zhang, and G. Wang, “Multi-view frustum
Learning lidar-camera fusion under bird’s-eye-view via cross-modality pointnet for object detection in autonomous driving,” in 2019 IEEE
guidance and temporal aggregation,” arXiv preprint arXiv:2303.17099, International Conference on Image Processing (ICIP). IEEE, 2019,
2023. pp. 3896–3899.
[340] “Focusing on hard instance for 3d object detection,” Aug 2023. [362] C. Xu, B. Wu, J. Hou, S. Tsai, R. Li, J. Wang, W. Zhan, Z. He, P. Vajda,
[341] H. Wang, H. Tang, S. Shi, A. Li, Z. Li, B. Schiele, and L. Wang, K. Keutzer et al., “Nerf-det: Learning geometry-aware volumetric
“Unitr: A unified and efficient multi-modal transformer for bird’s-eye- representation for multi-view 3d object detection,” in Proceedings of
view representation,” in Proceedings of the IEEE/CVF International the IEEE/CVF International Conference on Computer Vision, 2023, pp.
Conference on Computer Vision, 2023, pp. 6792–6802. 23 320–23 330.
30

[363] D. Wang, X. Cui, X. Chen, Z. Zou, T. Shi, S. Salcudean, Z. J. Wang, [384] D. Rukhovich, A. Vorontsova, and A. Konushin, “Imvoxelnet: Image
and R. Ward, “Multi-view 3d reconstruction with transformers,” in to voxels projection for monocular and multi-view general-purpose 3d
Proceedings of the IEEE/CVF International Conference on Computer object detection,” in Proceedings of the IEEE/CVF Winter Conference
Vision, 2021, pp. 5722–5731. on Applications of Computer Vision, 2022, pp. 2397–2406.
[364] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. [385] G. P. Meyer, A. Laddha, E. Kee, C. Vallespi-Gonzalez, and C. K.
Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Wellington, “Lasernet: An efficient probabilistic 3d object detector for
Advances in neural information processing systems, vol. 30, 2017. autonomous driving,” in Proceedings of the IEEE/CVF conference on
[365] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and computer vision and pattern recognition, 2019, pp. 12 677–12 686.
B. Guo, “Swin transformer: Hierarchical vision transformer using [386] G. P. Meyer, J. Charland, S. Pandey, A. Laddha, S. Gautam, C. Vallespi-
shifted windows.” in 2021 IEEE/CVF International Conference on Gonzalez, and C. K. Wellington, “Laserflow: Efficient and probabilistic
Computer Vision (ICCV), Oct 2021. object detection and motion forecasting,” IEEE Robotics and Automa-
[366] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, tion Letters, vol. 6, no. 2, pp. 526–533, 2020.
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., [387] B. Li, T. Zhang, and T. Xia, “Vehicle detection from 3d lidar using
“An image is worth 16x16 words: Transformers for image recognition fully convolutional network,” arXiv preprint arXiv:1608.07916, 2016.
at scale,” arXiv preprint arXiv:2010.11929, 2020. [388] Y. Su, W. Liu, Z. Yuan, M. Cheng, Z. Zhang, X. Shen, and C. Wang,
[367] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable “Dla-net: Learning dual local attention features for semantic segmenta-
detr: Deformable transformers for end-to-end object detection,” arXiv tion of large-scale building facade point clouds,” Pattern Recognition,
preprint arXiv:2010.04159, 2020. vol. 123, p. 108372, 2022.
[368] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
[389] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional net-
S. Zagoruyko, “End-to-end object detection with transformers,” in
works for biomedical image segmentation,” in International Confer-
European conference on computer vision. Springer, 2020, pp. 213–
ence on Medical image computing and computer-assisted intervention.
229.
Springer, 2015, pp. 234–241.
[369] H. Li, C. Sima, J. Dai, W. Wang, L. Lu, H. Wang, J. Zeng, Z. Li,
J. Yang, H. Deng et al., “Delving into the devils of bird’s-eye-view [390] P. Sun, W. Wang, Y. Chai, G. Elsayed, A. Bewley, X. Zhang, C. Smin-
perception: A review, evaluation and recipe,” IEEE Transactions on chisescu, and D. Anguelov, “Rsn: Range sparse net for efficient,
Pattern Analysis and Machine Intelligence, 2023. accurate lidar 3d object detection,” in Proceedings of the IEEE/CVF
[370] Z. Chong, X. Ma, H. Zhang, Y. Yue, H. Li, Z. Wang, and W. Ouyang, Conference on Computer Vision and Pattern Recognition, 2021, pp.
“Monodistill: Learning spatial features for monocular 3d object detec- 5725–5734.
tion,” arXiv preprint arXiv:2201.10830, 2022. [391] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
[371] J. Chen, Q. Wang, W. Peng, H. Xu, X. Li, and W. Xu, “Disparity- object detection with region proposal networks,” IEEE Transactions on
based multiscale fusion network for transportation detection,” IEEE Pattern Analysis and Machine Intelligence, p. 1137–1149, Jun 2017.
Transactions on Intelligent Transportation Systems, vol. 23, no. 10, [392] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Be-
pp. 18 855–18 863, 2022. longie, “Feature pyramid networks for object detection,” in 2017 IEEE
[372] X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” arXiv: Conference on Computer Vision and Pattern Recognition (CVPR), Jul
Computer Vision and Pattern Recognition,arXiv: Computer Vision and 2017.
Pattern Recognition, Apr 2019. [393] Y. Chai, P. Sun, J. Ngiam, W. Wang, B. Caine, V. Vasudevan, X. Zhang,
[373] J. Li, S. Luo, Z. Zhu, H. Dai, A. S. Krylov, Y. Ding, and L. Shao, “3d and D. Anguelov, “To the point: Efficient 3d object detection in the
iou-net: Iou guided 3d object detector for point clouds,” arXiv preprint range image with graph convolution kernels,” in Proceedings of the
arXiv:2004.04962, 2020. IEEE/CVF conference on computer vision and pattern recognition,
[374] Z. Li, F. Wang, and N. Wang, “Lidar r-cnn: An efficient and universal 2021, pp. 16 000–16 009.
3d object detector,” in Proceedings of the IEEE/CVF Conference on [394] A. Barrera, C. Guindel, J. Beltrán, and F. Garcı́a, “Birdnet+: End-to-
Computer Vision and Pattern Recognition, 2021, pp. 7546–7555. end 3d object detection in lidar bird’s eye view,” in 2020 IEEE 23rd
[375] C. Zhang, H. Wang, Y. Cai, L. Chen, Y. Li, M. A. Sotelo, and International Conference on Intelligent Transportation Systems (ITSC).
Z. Li, “Robust-fusionnet: Deep multimodal sensor fusion for 3-d object IEEE, 2020, pp. 1–6.
detection under severe weather conditions,” IEEE Transactions on [395] H. Zhou, X. Zhu, X. Song, Y. Ma, Z. Wang, H. Li, and D. Lin,
Instrumentation and Measurement, vol. 71, pp. 1–13, 2022. “Cylinder3d: An effective 3d framework for driving-scene lidar se-
[376] C. Chen, L. Z. Fragonara, and A. Tsourdos, “Roifusion: 3d object mantic segmentation,” arXiv preprint arXiv:2008.01550, 2020.
detection from lidar and vision,” IEEE Access, vol. 9, pp. 51 710– [396] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning
51 721, 2021. on point sets for 3d classification and segmentation,” in Proceedings
[377] Y. Zhang, J. Chen, and D. Huang, “Cat-det: Contrastively augmented of the IEEE conference on computer vision and pattern recognition,
transformer for multi-modal 3d object detection,” in Proceedings of the 2017, pp. 652–660.
IEEE/CVF Conference on Computer Vision and Pattern Recognition [397] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,”
(CVPR), June 2022, pp. 908–917. arXiv preprint arXiv:1804.02767, 2018.
[378] L. N. Smith and N. Topin, “Super-convergence: Very fast training of [398] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
neural networks using large learning rates,” in Artificial intelligence recognition,” in 2016 IEEE Conference on Computer Vision and Pattern
and machine learning for multi-domain operations applications, vol. Recognition (CVPR), Jun 2016.
11006. SPIE, 2019, pp. 369–386. [399] B. Graham, “Spatially-sparse convolutional neural networks,” arXiv
[379] H. Meng, C. Li, G. Chen, Z. Gu, and A. Knoll, “Er3d: An efficient real- preprint arXiv:1409.6070, 2014.
time 3d object detection framework for autonomous driving,” in 29th
[400] B. Graham, M. Engelcke, and L. v. d. Maaten, “3d semantic seg-
IEEE International Conference on Parallel and Distributed Systems,
mentation with submanifold sparse convolutional networks,” in 2018
2023.
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
[380] C. Li, H. Meng, G. Chen, and L. Chen, “Real-time pseudo-lidar
Jun 2018.
3d object detection with geometric constraints,” in 2022 IEEE 25th
International Conference on Intelligent Transportation Systems (ITSC). [401] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical
IEEE, 2022, pp. 3298–3303. feature learning on point sets in a metric space,” Advances in neural
[381] H. Meng, C. Li, C. Zhong, J. Gu, G. Chen, and A. Knoll, “Fastfusion: information processing systems, vol. 30, 2017.
Deep stereo-lidar fusion for real-time high-precision dense depth [402] M. Feng, S. Z. Gilani, Y. Wang, L. Zhang, and A. Mian, “Relation
sensing,” Journal of Field Robotics, vol. 40, no. 7, pp. 1804–1816, graph network for 3d object detection in point clouds,” IEEE Transac-
2023. tions on Image Processing, p. 92–107, Jan 2021.
[382] Z. Zhu, Y. Zhang, H. Chen, Y. Dong, S. Zhao, W. Ding, J. Zhong, [403] Z. Liu, Z. Zhang, Y. Cao, H. Hu, and X. Tong, “Group-free 3d object
and S. Zheng, “Understanding the robustness of 3d object detection detection via transformers,” Apr 2021.
with bird’s-eye-view representations in autonomous driving,” in Pro- [404] H. Zhao, L. Jiang, J. Jia, P. Torr, and V. Koltun, “Point transformer.” in
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern 2021 IEEE/CVF International Conference on Computer Vision (ICCV),
Recognition, 2023, pp. 21 600–21 610. Oct 2021.
[383] Y. Zhang, J. Hou, and Y. Yuan, “A comprehensive study of the robust- [405] J. Liu, T. He, H. Yang, R. Su, J. Tian, J. Wu, H. Guo, K. Xu, and
ness for lidar-based 3d object detectors against adversarial attacks,” W. Ouyang, “3d-queryis: A query-based framework for 3d instance
International Journal of Computer Vision, pp. 1–33, 2023. segmentation,” Nov 2022.
31

[406] Y. Chen, S. Liu, X. Shen, and J. Jia, “Fast point r-cnn,” in Proceedings visual models from natural language supervision,” in International
of the IEEE/CVF international conference on computer vision, 2019, conference on machine learning. PMLR, 2021, pp. 8748–8763.
pp. 9775–9784. [428] R. Greer, B. Antoniussen, A. Møgelmose, and M. Trivedi, “Language-
[407] Q. Hu, D. Liu, and W. Hu, “Density-insensitive unsupervised domain driven active learning for diverse open-set 3d object detection,” arXiv
adaption on 3d object detection,” in Proceedings of the IEEE/CVF preprint arXiv:2404.12856, 2024.
Conference on Computer Vision and Pattern Recognition, 2023, pp. [429] Y. Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du,
17 556–17 566. T. Lin, W. Wang et al., “Planning-oriented autonomous driving,” in
[408] J. Yuan, B. Zhang, X. Yan, T. Chen, B. Shi, Y. Li, and Y. Qiao, “Bi3d: Proceedings of the IEEE/CVF Conference on Computer Vision and
Bi-domain active learning for cross-domain 3d object detection,” in Pattern Recognition, 2023, pp. 17 853–17 862.
Proceedings of the IEEE/CVF Conference on Computer Vision and [430] D. Xu, H. Li, Q. Wang, Z. Song, L. Chen, and H. Deng, “M2da: Multi-
Pattern Recognition, 2023, pp. 15 599–15 608. modal fusion transformer incorporating driver attention for autonomous
[409] D. Xu, D. Anguelov, and A. Jain, “Pointfusion: Deep sensor fusion for driving,” arXiv preprint arXiv:2403.12552, 2024.
3d bounding box estimation,” in Proceedings of the IEEE conference
on computer vision and pattern recognition, 2018, pp. 244–253.
[410] B. Ding, J. Xie, and J. Nie, “C2bn: Cross-modality and cross-scale
balance network for multi-modal 3d object detection,” in ICASSP 2023-
2023 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE, 2023, pp. 1–5.
[411] C. Lin, D. Tian, X. Duan, J. Zhou, D. Zhao, and D. Cao, “Cl3d: Ziying Song , was born in Xingtai, Hebei Province,
Camera-lidar 3d object detection with point feature enhancement and China in 1997. He received the B.S. degree from
point-guided fusion,” IEEE Transactions on Intelligent Transportation Hebei Normal University of Science and Technology
Systems, vol. 23, no. 10, pp. 18 040–18 050, 2022. (China) in 2019. He received a master’s degree major
[412] Z. Song, L. Peng, J. Hu, D. Yao, and Y. Zhang, “A re-calibration in Hebei University of Science and Technology
method for object detection with multi-modal alignment bias in au- (China) in 2022. He is now a PhD student majoring
tonomous driving,” arXiv preprint arXiv:2405.16848, 2024. in Computer Science and Technology at Beijing
[413] M. Liu, Y. Chen, J. Xie, Y. Zhu, Y. Zhang, L. Yao, Z. Bing, Jiaotong University (China), with a research focus
G. Zhuang, K. Huang, and J. T. Zhou, “Menet: Multi-modal mapping on Computer Vision.
enhancement network for 3d object detection in autonomous driving,”
IEEE Transactions on Intelligent Transportation Systems, 2024.
[414] Z. Wu, Y. Wu, X. Wang, Y. Gan, and J. Pu, “A robust diffusion model-
ing framework for radar camera 3d object detection,” in Proceedings of
the IEEE/CVF Winter Conference on Applications of Computer Vision,
2024, pp. 3282–3292.
[415] C. Zhang, H. Wang, L. Chen, Y. Li, and Y. Cai, “Mixedfusion: An
efficient multimodal data fusion framework for 3-d object detection Lin Liu was born in Jinzhou, Liaoning Province,
and tracking,” IEEE Transactions on Neural Networks and Learning China, in 2001. He is now a college student majoring
Systems, pp. 1–15, 2023. in Computer Science and Technology at China Uni-
[416] J. Hou, Z. Liu, Z. Zou, X. Ye, X. Bai et al., “Query-based temporal versity of Geosciences(Beijing). Since Dec. 2022,
fusion with explicit motion for 3d object detection,” Advances in Neural he has been recommended for a master’s degree
Information Processing Systems, vol. 36, 2024. in Computer Science and Technology at Beijing
[417] Z. Liu, X. Ye, Z. Zou, X. He, X. Tan, E. Ding, J. Wang, and X. Bai, Jiaotong University. His research interests are in
“Multi-modal 3d object detection by box matching,” arXiv preprint computer vision.
arXiv:2305.07713, 2023.
[418] L. Zheng, S. Li, B. Tan, L. Yang, S. Chen, L. Huang, J. Bai, X. Zhu,
and Z. Ma, “Rcfusion: Fusing 4d radar and camera with bird’s-eye view
features for 3d object detection,” IEEE Transactions on Instrumentation
and Measurement, 2023.
[419] Y. Zeng, C. Ma, M. Zhu, Z. Fan, and X. Yang, “Cross-modal 3d
object detection and tracking for auto-driving,” in 2021 IEEE/RSJ
International Conference on Intelligent Robots and Systems (IROS).
Feiyang Jia was born in Yinchuan, Ningxia
IEEE, 2021, pp. 3850–3857.
Province, China, in 1998. He received his B.S.
[420] C. Ge, J. Chen, E. Xie, Z. Wang, L. Hong, H. Lu, Z. Li, and
degree from Beijing Jiaotong University (China)
P. Luo, “Metabev: Solving sensor failures for bev detection and map
in 2020. He received a master’s degree from Bei-
segmentation,” arXiv preprint arXiv:2304.09801, 2023.
jing Technology and Business University (China) in
[421] T. Zhou, J. Chen, Y. Shi, K. Jiang, M. Yang, and D. Yang, “Bridging
2023. He is now a Ph.D. student majoring in Com-
the view disparity between radar and camera features for multi-modal
puter Science and Technology at Beijing Jiaotong
fusion 3d object detection,” IEEE Transactions on Intelligent Vehicles,
University (China), with research focus on Computer
vol. 8, no. 2, pp. 1523–1535, 2023.
Vision.
[422] Z. Wang, W. Zhan, and M. Tomizuka, “Fusing bird view lidar point
cloud and front view camera image for deep object detection,” Cornell
University - arXiv,Cornell University - arXiv, Nov 2017.
[423] S. Jiang, S. Xu, L. Liu, Z. Song, Y. Bo, Z.-X. Yang et al., “Sparsein-
teraction: Sparse semantic guidance for radar and camera 3d object
detection,” in ACM Multimedia 2024.
[424] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman,
D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 Yadan Luo (Member, IEEE) received the BS degree
technical report,” arXiv preprint arXiv:2303.08774, 2023. in computer science from the University of Elec-
[425] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, tronic Engineering and Technology of China, and
T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment the PhD degree from the University of Queensland.
anything,” in Proceedings of the IEEE/CVF International Conference Her research interests include machine learning,
on Computer Vision, 2023, pp. 4015–4026. computer vision, and multimedia data analysis. She
[426] Y. Liu, L. Kong, J. Cen, R. Chen, W. Zhang, L. Pan, K. Chen, and is now a lecturer with the University of Queensland.
Z. Liu, “Segment any point cloud sequences by distilling vision foun-
dation models,” Advances in Neural Information Processing Systems,
vol. 36, 2024.
[427] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,
G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable
32

Caiyan Jia , born on March 2, 1976, is a lecturer


and a postdoctoral fellow of the Chinese Computer
Society. she graduated from Ningxia University in
1998 with a bachelor’s degree in mathematics, Xi-
angtan University in 2001 with a master’s degree
in computational mathematics, specializing in intel-
ligent information processing, and the Institute of
Computing Technology of the Chinese Academy of
Sciences in 2004 with a doctorate degree in engi-
neering, specializing in data mining. she has received
her D. degree in 2004. She is now a professor in
School of Computer Science and Technology, Beijing Jiaotong University,
Beijing, China.

Guoxin Zhang , was born in 1998 in Xingtai, Hebei


Province, China. He received his bachelor’s and
Master’s degrees from Hebei University of Science
and Technology in 2021 and 2024, respectively. He
is now a Ph.D. student in the School of Computer
Science at Beijing University of Posts and Telecom-
munications (China) since 2024. His research inter-
ests are in computer vision.

Lei Yang (Graduate Student Member, IEEE)


received his B.E. degree from Taiyuan University
of Technology, Taiyuan, China, and the M.S. degree
from the Robotics Institute at Beihang University, in
2018. Then he joined the Autonomous Driving R&D
Department of JD.COM as an algorithm researcher
from 2018 to 2020. He is now a Ph.D. student in
the School of Vehicle and Mobility at Tsinghua
University since 2020. His current research interests
include computer vision, 3D scene understanding
and autonomous driving.

Li Wang was born in Shangqiu, Henan Province,


China in 1990. He received his Ph.D. degree in
mechatronic engineering at the State Key Labora-
tory of Robotics and System, Harbin Institute of
Technology, in 2020. He was a visiting scholar at
Nanyang Technology of University for two years.
He was a postdoctoral fellow in the State Key
Laboratory of Automotive Safety and Energy, and
the School of Vehicle and Mobility, Tsinghua Uni-
versity. Currently, he is an assistant professor at
School of Mechanical Engineering, Beijing Institute
of Technology. His research interests include autonomous driving perception,
3D robot vision, and Multi-modal fusion.

You might also like