Robofusion: Towards Robust Multi-Modal 3D Object Detection Via Sam

RoboFusion: Towards Robust Multi-Modal 3D Object Detection via SAM
Ziying Song1 , Guoxing Zhang2 , Lin Liu1 , Lei Yang3 , Shaoqing Xu4 , Caiyan Jia1∗ ,
Feiyang Jia1 , Li Wang5
1
School of Computer Science and Technology & Beijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University
2
Hebei University of Science and Technology 3 Tsinghua University 4 University of Macau 5 Beijing Institute of Technology
{songziying, cyjia, feiyangjia}@bjtu.edu.cn
arXiv:2401.03907v4 [cs.CV] 23 Apr 2024
Abstract and lack semantic information. Therefore, effectively lever-

aging the advantages of multi-model while mitigating their
Multi-modal 3D object detectors are dedicated to limitations contributes to enhancing the robustness and accu-
exploring secure and reliable perception systems racy of perception systems [Song et al., 2023].
for autonomous driving (AD). Although achiev- With the emergence of AD datasets [Geiger et al., 2012;
ing state-of-the-art (SOTA) performance on clean Caesar et al., 2020; Zhang et al., 2023c], state-of-the-
benchmark datasets, they tend to overlook the com- art (SOTA) methods [Liu et al., 2023; Bai et al., 2022;
plexity and harsh conditions of real-world environ- Chen et al., 2022; Huang et al., 2020; Li et al., 2023;
ments. With the emergence of visual foundation Song et al., 2024b] on ‘clean’ datasets [Geiger et al., 2012;
models (VFMs), opportunities and challenges are Caesar et al., 2020] have achieved record-breaking perfor-
presented for improving the robustness and gen- mance. However, they overlook the exploration of robust-
eralization of multi-modal 3D object detection in ness and generalization in out-of-distribution (OOD) scenar-
AD. Therefore, we propose RoboFusion, a ro- ios [Dong et al., 2023]. For example, the KITTI dataset
bust framework that leverages VFMs like SAM [Geiger et al., 2012] lacks severe weather conditions. When
to tackle out-of-distribution (OOD) noise scenar- SOTA methods [Chen et al., 2022; Li et al., 2023; Liu et al.,
ios. We first adapt the original SAM for AD sce- 2023] learn from these sunny weather datasets, can they truly
narios named SAM-AD. To align SAM or SAM- generalize and maintain robustness in severe weather condi-
AD with multi-modal methods, we then introduce tions like snow and fog?
AD-FPN for upsampling the image features ex- The answer is ‘No’, as shown in Fig. 1 and verified in
tracted by SAM. We employ wavelet decomposi- Table 3. People often utilize domain adaptation (DA) tech-
tion to denoise the depth-guided images for fur- niques to address these challenges [Wang et al., 2023b;
ther noise reduction and weather interference. At Tsai et al., 2023; Peng et al., 2023; Hu et al., 2023]. Although
last, we employ self-attention mechanisms to adap- DA techniques improve the robustness of 3D object detec-
tively reweight the fused features, enhancing infor- tion and reduce the need for annotated data, they have some
mative features while suppressing excess noise. In profound drawbacks, including domain shift limitations, la-
summary, RoboFusion significantly reduces noise bel shift issues, and overfitting risks [Oza et al., 2023]. For
by leveraging the generalization and robustness of instance, DA techniques may be constrained if the differences
VFMs, thereby enhancing the resilience of multi- between two domains are significant, leading to performance
modal 3D object detection. Consequently, RoboFu- degradation on the target domain.
sion achieves SOTA performance in noisy scenar-
ios, as demonstrated by the KITTI-C and nuScenes- Recently, both Natural Language Processing (NLP) and
C benchmarks. Code is available at https://github. Computer Vision (CV) have witnessed the appearance and the
com/adept-thu/RoboFusion. power of a series of foundation models [Kirillov et al., 2023;
OpenAI, 2023; Zhao et al., 2023; Zhang et al., 2023a], re-
sulting in the emergence of new paradigms in deep learn-
ing. For example, a series of novel visual foundation models
1 Introduction (VFMs) [Kirillov et al., 2023; Zhao et al., 2023; Zhang et al.,
Multi-modal 3D object detection plays a pivotal role in au- 2023a] have been developed. Thanks to their extensive train-
tonomous driving (AD) [Wang et al., 2023a; Song et al., ing on huge datasets, these models exhibit powerful general-
2024a]. Different modalities often provide complementary ization capabilities. These developments have inspired new
information. For instance, images contain richer semantic ideas, leveraging the robustness and generalization abilities
representations, yet lack depth information. In contrast, point of VFMs to achieve generalization in OOD noisy scenarios,
clouds offer geometric and depth details, but they are sparse much like how adults generalize knowledge when encounter-
ing new situations, without relying on DA techniques [Wang
∗
Corresponding author et al., 2023b; Tsai et al., 2023].
Fog & Snow 90
Clean KITTI Train Set 88.04 LoGoNet
0.06 Ours 85.70
KITTI Moderate-level Car AP

Clean KITTI Val Set 85.04
85
Noise KITTI-C Val Set SOTA
0.05
Detector
Large Gap 80
0.04
0.03 75
Small Gap
0.02 70
RoboFusion
0.01 65
62.58
0.00
60 80 100 120 140 60 Clean Noisy
(a) Gaussian Distribution (b) Robustness Comparison (c) Performance
Figure 1: (a) We employ Gaussian distributions to represent the distributional disparities among the datasets. Indeed, there exists a large gap
in data distribution between an OOD noise validation set and a clean validation set. Where the X-axis represents the set of mean pixel values
1
PH PW P3
in a dataset, X = {xi | i = 1, 2, ..., N }, with xi = H×W ×3 i=1 j=1 k=1 (Iijk ), where N is the number of the dataset, H is the
height, W is the width, and Iijk denotes the pixel values for each image. (b) Visual foundation models (VFMs) like SAM [Kirillov et al.,
2023], show robust performance in many noisy scenarios. Yet, the current methods are not robust enough to predict 3D tasks for autonomous
driving perception. (c) To this end, we propose a robust framework, RoboFusion, which employs VFMs at the SOTA multi-modal 3D object
detection. Empirical results reveal that our method surpasses the Top-performing LoGoNet[Li et al., 2023] on the KITTI Leaderboard by a
margin of 23.12% mAP (Weather) on KITTI-C [Dong et al., 2023] noisy scenarios. Notably, our RoboFusion shows better performance with
LoGoNet [Li et al., 2023] in clean KITTI [Geiger et al., 2012] dataset.
Inspired by the success of VFMs in CV tasks, in this methods are highly optimized to achieve the best performance
work, we intend to use these models to tackle the challenges on clean datasets. However, they ignore common factors in
of multi-modal 3D object detectors in OOD noise scenar- the real world (e.g., bad weather and sensor noise). In this
ios. Therefore, we propose a robust framework, RoboFusion, work, we consider a real-world robustness perspective and
which leverages VFMs like SAM to adapt a 3D multi-modal design a robust multi-modal 3D perception framework, Robo-
object detector from clean scenarios to OOD noise scenarios. Fusion.
In particular, the adaptation strategies for SAM are as follows.
1) We utilize features extracted from SAM rather than infer- 2.2 Visual Foundation Models for 3D Object
ence segmentation results. 2) We propose SAM-AD, which is Detection
a pre-trained SAM for AD scenarios. 3) We introduce a novel
AD-FPN to address the issue of feature upsampling for align- Motivated by the success of Large Language Models (LLMs)
[OpenAI, 2023], VFMs start to be explored in CV commu-
ing VFMs with multi-modal 3D object detector. 4) To further
reduce noise interference and retain essential signal features, nity. SAM [Kirillov et al., 2023] leverages ViT [Dosovit-
we design a Depth-Guided Wavelet Attention (DGWA) skiy et al., 2020] to train on the huge SA-1B dataset, con-
module that effectively attenuates both high-frequency and taining 11 million samples, which enables SAM to be gen-
low-frequency noises. 5) After fusing point cloud features eralized to many scenes. Currently, there have been a few
and image features, we propose Adaptive Fusion to further research endeavors aiming at integrating 3D object detectors
enhance feature robustness and noise resistance through self- with SAM. For instance, SAM3D [Zhang et al., 2023b], as
attention to re-weight the fused features adaptively. We vali- a LiDAR-only method, solely transforms LiDAR’s 3D per-
date RoboFusion’s robustness against OOD noise scenarios spective into a BEV (Bird’s Eye View) 2D space to harness
in KITTI-C and nuScenes-C datasets [Dong et al., 2023], the generalization capabilities of SAM, yielding sub-optimal
achieving SOTA performance amid noise, as shown in Fig. 1. performance on ‘clean’ datasets. Another in progress work,
3D-Box-Segment-Anything 1 , tries to utilize SAM for 3D ob-
ject detection. This indicates the highly attention of SAM
2 Related Work like foundation models in 3D scenes in the literature. Our
RoboFusion, as a multi-modal method, gives clear strategies
2.1 Multi-Modal 3D Object Detection to leverage the generalization capabilities of VFMs to ad-
Currently, multi-modal 3D object detection has received con- dress the OOD noise challenges inherent in existing 3D multi-
siderable attention on popular datasets [Geiger et al., 2012; modal object detection methods.
Caesar et al., 2020]. BEVFusion [Liu et al., 2023] fuse multi-
modal representations in a unified 3D or BEV space. Trans- 3 RoboFusion
Fusion [Bai et al., 2022] builds a two-stage pipeline where
proposals are generated based on LiDAR features and further In this section, we present RoboFusion, a framework that har-
refined using query image features. DeepInteraction [Yang nesses the robustness and generalization capabilities of VFMs
et al., 2022] and SparseFusion [Xie et al., 2023] further op-
1
timize the camera branch on top of TransFusion. Previous https://github.com/dvlab-research/3D-Box-Segment-Anything
Points Results
Voxelization
Adaptive Fusion
Detection Head
3D Backbone
Image SAM-AD
DGWA
&
AD-FPN
SAM-AD Depth
DWT IDWT
Encoder
V
Conv
Linear
K
Upsample Conv
Max Pool Q
Image Embedding
Output Conv
Figure 2: The framework of RoboFusion. The LiDAR branch follows the baselines [Chen et al., 2022; Bai et al., 2022] to generate LiDAR
features. In the camera branch, first, we extract robust image features using a highly optimized SAM-AD and acquire multi-scale features
using AD-FPN. Second, the sparse depth map S is generated by the raw points and fed into a depth encoder to obtain depth features and
fused with multi-scale image features Fi to obtain depth-guided image features F̂i . Then wave attention is used to remove the mutation noise.
Finally, adaptive Fusion integrates point cloud features with robust image features with depth information via self-attention mechanism.
such as SAM [Kirillov et al., 2023] for multi-modal 3D ob- and the five severities from 1 to 5, respectively. We employ
ject detection. The overall architecture is depicted in Fig. 2 the image encoder of SAM [Kirillov et al., 2023] , Mobile-
and comprises the following components: 1) SAM-AD & SAM [Zhang et al., 2023a] as our encoder while the decoder
AD-FPN module which obtains robust multi-scale image fea- and the reconstruction loss are the same as DMAE [Wu et al.,
tures, 2) Depth-Guided Wavelet Attention (DGWA) mod- 2023]. For FastSAM [Zhao et al., 2023], we adopt YOLOv8
2
ule which employs wavelet decomposition to denoise depth- to pre-train FastSAM on the AD dataset. To avoid overfit-
guided image features, 3) Adaptive Fusion module which ting, we use random resizing and cropping as data augmenta-
adaptively fuses point cloud features with image features. tion. We also set the mask ratio as 0.75 and have trained 400
epochs on 8 NVIDIA A100 GPUs.
3.1 SAM-AD & AD-FPN AD-FPN. As a promptable segmentation model, SAM has
Preliminaries. SAM [Kirillov et al., 2023], a VFM, achieves three components: image encoder, prompt encoder and mask
generalization across diverse scenes due to its extensive train- decoder. Generally, the image encoder can provide high-
ing on the large-scale SA-1B dataset—with over 11 million quality and highly robust image embedding for downstream
samples and 1 billion high-quality masks. Currently, SAM models, while the mask decoder is only designed to provide
family [Kirillov et al., 2023; Zhao et al., 2023; Zhang et al., decoding services for semantic segmentation. Furthermore,
2023a] primarily support 2D tasks. However, directly extend- what we require are robust image features rather than the
ing VFMs like SAM to 3D tasks presents a gap. To address processing of prompting information by the prompt encoder.
this, we combine SAM with multi-modal 3D models, merg- Therefore, we employ SAM’s image encoder to extract robust
ing 2D robust feature representations with 3D point cloud fea- image features. However, SAM utilizes the ViT series [Doso-
tures to achieve robust fused features. vitskiy et al., 2020] as its image encoder, which excludes
SAM-AD. To further adapt SAM with AD (autonomous multi-scale features and provides only high-dimensional low-
driving) scenarios, we perform pre-training on SAM to ob- resolution features. To generate the multi-scale features re-
tain SAM-AD. Specifically, we curate an extensive collection quired for object detection, inspired by [Li et al., 2022a],
of image samples from well-established datasets (i.e., KITTI we design an AD-FPN that offers ViT-based multi-scale fea-
[Geiger et al., 2012] and nuScenes [Caesar et al., 2020]), tures. Specifically, leveraging height-dimensional image em-
forming the foundational AD dataset. Following DMAE [Wu bedding with stride 16 (scale=1/16) provided by SAM, we
et al., 2023], we perform pre-training on SAM to obtain produce a series of multi-scale features Fms with stride of
SAM-AD in AD scenarios, as shown in Fig. 3. We denote {32, 16, 8, 4}. Sequentially, we acquire multi-scale feature
x as a clean image from the AD dataset (i.e. KITTI [Geiger H W
Fi ∈ R 4 × 4 ×Ci by integrate Fms in a bottom-up manner
et al., 2012] and nuScenes [Caesar et al., 2020]) and η as a set similar to FPN [Lin et al., 2017].
of noise images generated by [Dong et al., 2023] based on x.
And the noise type and the severity are randomly chosen from
2
the four weather (i.e., rain, snow, fog, and strong sunlight) https://github.com/ultralytics/ultralytics
Weather Noise
Noisy Image Masked Noisy Image Reconstructed Image
DMAE
Clean Image SAM-AD
Decoder
Reconstructed Loss
Figure 3: An illustration of the pre-training framework. We corrupt a clean image x by η which contains multiple weather noises and then
randomly masking several patches on a noisy image x + η to obtain a masked noisy image M ask(x + η). The SAM-AD and DMAE decoder
are trained to reconstruct the clean image x̂ from M ask(x + η).
H W
LiDAR Feature (feiLH , feiHL , feiHH ) ∈ R 8 , 8 ,4 , with the low-filter fL =
Q ( √12 , √12 ) and the high-filter fH = ( √12 , − √12 ). In this state,
Conv K Conv
Conv
the low-frequency band retains coarse-grained information
Adaptive Fusion
while the high-frequency band retains fine-grained informa-
V
V tion. In other words, it is easier to capture the mutation signal,
Camera Feature
so as to filter the noise information. We concatenate the four-
subband features along channel dimension to acquire wavelet
H W
Figure 4: The architecture of Adaptive Fusion, which involves features Fei = [fîLL , fîLH , fîHL , fîHH ] ∈ R 8 × 8 ×16 . Next,
adaptively re-weighting the fused features using self-attention. we perform wave-attention Attω to query informative fea-
tures in the wavelet features. Concretely, we employ F̂i as
3.2 Depth-Guided Wavelet Attention a Query and Fei as a Key/Value given by
Although SAM-AD or SAM has the capability to extract ro- F̂i W q (Fei W k )T e v
Fatt = Attω (F̂i , Fei ) = σ( √ )Fi W . (2)
bust image features, the gap between 2D and 3D domains Ci
still persists and cameras lacking geometric information in a Finally, we leverage the IDWT (inverse DWT) to convert
corrupted environment often amplify noise and give rise to
negative migration issues. To mitigate this problem, we pro- Fei back to F̂i and integrate this converted F̂i and Fatt to ob-
H W
pose the Depth-Guided Wavelet Attention (DGWA) module, tain denoise features Fout ∈ R 16 × 16 ×16 by
which can be split into two steps. 1) A depth-guided net- Fout = M LP (Concat(Fatt , F̂i )), (3)
work is designed, that adds geometry prior to image features
where Fout preserves informative features and restrains re-
by combining image features and depth features from a point
dundant mutation noise in the frequency domain.
cloud. 2) The features of an image are decomposed into four
wavelet subbands using the Haar wavelet transform [Liu et 3.3 Adaptive Fusion
al., 2020a], then attention mechanism allows to denoise in- Following the incorporation of image depth features within
formative features in the subbands. H W the DGWA module, we propose the Adaptive Fusion tech-
Formally, given image features Fi ∈ R 4 × 4 ×Ci and raw nique to combine point cloud attributes with robust image
points P ∈ RN,Cp as input. We project P onto the image features enriched with depth information. Specifically, differ-
plane to acquire a sparse depth map S ∈ RH×W ×2 . Next, ent types of noise affect LiDAR and images to different de-
we feed S into the depth encoder DE(·), which consists of grees, which raises a corruption imbalance problem. There-
several convolution and max pooling blocks, to acquire depth fore, considering the distinct influences of various noises on
H W
features Fd ∈ R 4 × 4 ×Ci . Afterward, we leverage convolu- LiDAR and camera, we employ self-attention to re-weight the
tion encode (Fi , Fd ) to acquire depth-guided image features fused features adaptively as shown in Fig. 4. The corruption
H W
F̂i ∈ R 4 × 4 ×16 , given by degree of modality-specificity is dynamic, and self-attention
mechanism allows adaptive re-weighting features to enhance
F̂i = Conv(Concat(Fi , DE(S))). (1) informative features and suppress redundant noise.
Subsequently, we employ discrete wavelet transform
(DWT), a reversible operator, to partition the input F̂i 4 Experiments
into four subbands. Specifically, we encode the rows and 4.1 Datasets
columns of the input separately into one low-frequency We perform experiments on both the clean public benchmarks
H W
band feiLL ∈ R 8 × 8 ×4 and three high-frequency bands (KITTI [Geiger et al., 2012] and nuScenes [Caesar et al.,
Table 1: Comparison with SOTA methods on KITTI validation and Table 2: Comparison with SOTA methods on nuScenes validation
test sets for car class with AP of R40 . and test sets.
AP3D (%) (validation set) AP3D (%) (test set) validation set test set
Method LiDAR Camera
Method NDS mAP NDS mAP
mAP Easy Mod. Hard mAP Easy Mod. Hard
FUTR3D VoxelNet ResNet-101 68.3 64.5 - -
Voxel R-CNN 86.84 92.38 85.29 82.86 83.19 90.90 81.62 77.06 BEVFusion-mit VoxelNet Swin-T 71.4 68.5 72.9 70.2
VFF 86.91 92.31 85.51 82.92 83.62 89.50 82.09 79.29 DeepInteraction VoxelNet ResNet-50 72.6 69.9 73.4 70.8
CAT-Det 83.58 90.12 81.46 79.15 82.62 89.87 81.32 76.68 CMT VoxelNet ResNet-50 72.9 70.3 74.1 72.0
LoGoNet 87.13 92.04 85.04 84.31 85.87 91.80 85.06 80.74 SparseFusion VoxelNet ResNet-50 72.8 70.4 73.8 72.0
Focals Conv-F - - - - 83.47 90.55 82.28 77.59 TransFusion VoxelNet ResNet-50 71.3 67.5 71.6 68.9
Baseline* 86.75 92.05 85.51 82.70 - - - - Baseline* VoxelNet ResNet-50 70.8 67.3 - -
RoboFusion-L 88.87 93.30 88.04 85.27 85.58 91.75 84.08 80.71 RoboFusion-L VoxelNet SAM 72.1 69.9 72.0 69.9
RoboFusion-B 88.45 93.22 87.87 84.27 85.32 91.98 83.76 80.23 RoboFusion-B VoxelNet FastSAM 71.9 69.4 71.8 69.4
RoboFusion-T 88.08 93.28 87.60 83.36 85.09 91.68 83.70 79.89 RoboFusion-T VoxelNet MobileSAM 71.3 69.1 71.5 69.1
∗ denotes our reproduced results based on the officially released ∗ denotes our reproduced results based on the officially released
codes. codes.
2020]) and the noisy public benchmarks (KITTI-C[Dong et Table 3: Comparison with SOTA methods on KITTI-C validation
set. The results are evaluated based on the car class with AP of R40
al., 2023] and nuScenes-C [Dong et al., 2023]).
at moderate difficulty. ‘S.L.’, ‘D.’, ‘C.O.’, and ‘C.T.’ denotes Strong
KITTI Sunlight, Density, Cutout, and Crosstalk, respectively.
The KITTI dataset provides synchronized LiDAR point
clouds and front-view camera images, consists of 3,712 train- Rai Weather Sensor
Clean
mAP Snow Rain Fog S.L. D. C.O. C.T.
ing samples, 3,769 validation samples and 7,518 test sam-
ples. The standard evaluation metric for object detection is SECOND† 81.59 64.33 52.34 52.55 74.10 78.32 80.18 73.59 80.24
PointPillars† 78.41 49.80 36.47 36.18 64.28 62.28 76.49 70.28 70.85
the mean Average Precision (mAP), computed using recall at PointRCNN† 80.57 59.14 50.36 51.27 72.14 62.78 80.35 73.94 71.53
40 positions (R40). PV-RCNN† 84.39 65.83 52.35 51.58 79.47 79.91 82.79 76.09 82.34
SMOKE† 7.09 4.51 2.47 3.94 5.63 6.00 - - -
nuScenes ImVoxelNet† 11.49 3.22 0.22 1.24 1.34 10.08 - - -
The nuScenes dataset is a large-scale 3D detection benchmark EPNet† 82.72 46.21 34.58 36.27 44.35 69.65 82.09 76.10 82.10
consisting of 700 training scenes, 150 validation scenes, and Focals Conv-F† 85.88 50.40 34.77 41.30 44.55 80.97 84.95 78.06 85.82
LoGoNet* 85.04 62.58 51.45 55.80 67.53 75.54 83.68 77.17 82.00
150 testing scenes. The data are collected using six multi-
RoboFusion-L 88.04 85.70 85.29 86.48 85.53 85.50 85.71 83.17 84.12
view cameras and a 32-channel LiDAR sensor. It includes RoboFusion-B 87.87 84.70 84.11 85.54 84.00 85.15 84.34 81.30 82.45
360-degree object annotations for 10 object classes. To eval- RoboFusion-T 87.60 84.60 84.67 84.79 84.17 84.75 84.11 81.21 83.07
uate the detection performance, the primary metrics used are †
: Results from Ref. [Dong et al., 2023].
the mean Average Precision (mAP) and the nuScenes detec- * denotes re-implement result.
tion score (NDS).
KITTI-C and nuScenes-C Since KITTI and nuScenes are distinct datasets with varying
In terms of data robustness, [Dong et al., 2023] has designed evaluation metrics and characteristics, we provide a detailed
27 types of common corruptions for both LiDAR and cam- description of our RoboFusion settings for each dataset.
era, with the aim of benchmarking the corruption robust- RoboFusion in KITTI and KITTI-C. We validate our
ness of existing 3D object detectors. [Dong et al., 2023] RoboFusion on the KITTI dataset using Focals Conv [Chen
has established corruption robustness benchmarks 3 , includ- et al., 2022] as the baseline. The input voxel size is set to
ing KITTI-C and nuScenes-C, by synthesizing corruptions (0.05m, 0.05m, 0.1m), with anchor sizes for cars at [3.9, 1.6,
on public datasets. Specifically, we utilize KITTI-C and 1.56] and anchor rotations at [0, 1.57]. We adopt the same
nuScenes-C in our work. It is worth noting that Ref. [Dong data augmentation solution as Focals Conv-F.
et al., 2023] has only added noise to the validation dataset and RoboFusion with nuScenes and nuScenes-C. We vali-
kept the train and test datasets clear. date our RoboFusion on the nuScenes dataset using Trans-
Fusion [Bai et al., 2022] as the baseline. The detection range
4.2 Experimental Settings for the X and Y axis is set at [-54m, 54m] and [-5m, 3m]
Network Architecture. for the Z axis. The input voxel size is set at (0.075m, 0.075m,
Our RoboFusion consists of three variants: RoboFusion-L, 0.2m), and the maximum number of point clouds contained in
RoboFusion-B, and RoboFusion-T, which utilize the mod- each voxel is set to 10. It is noteworthy that the Adaptive Fu-
els SAM-B [Kirillov et al., 2023], FastSAM [Zhao et al., sion module is applied exclusively to Focals Conv rather than
2023], and MobileSAM [Zhang et al., 2023a], respectively. TransFusion, while TransFusion uses its own fusion module.
It is noteworthy that due to the convolutional operations of
FastSAM in RoboFusion-B which is capable of generating Training and Testing Details.
multi-scale features, the AD-FPN module is not employed. Our RoboFusion is meticulously trained from scratch us-
ing the Adam optimizer and incorporates several foundation
3
https://github.com/thu-ml/3D Corruptions AD models as image encoders including SAM, FastSAM and
Table 4: Comparison with SOTA methods on nuScenes-C valida- Results on the clean benchmark.
tion set with mAP. ‘S.L.’, ‘D.’, ‘C.O.’, and ‘C.T.’ denotes Strong As shown in Table 1, we compare our RoboFusion with
Sunlight, Density, Cutout, and Crosstalk, respectively.
SOTA methods, including Voxel R-CNN [Deng et al., 2021],
VFF [Li et al., 2022b], CAT-Det[Zhang et al., 2022] , Focals
Weather Sensor
Method Clean
mAP Snow Rain Fog S.L. D. C.O. C.T. Conv-F [Chen et al., 2022], and LoGoNet [Li et al., 2023] on
the KITTI validation and test sets. As shown in Table 2, we
PointPillars† 27.69 25.87 27.57 27.71 24.49 23.71 27.27 24.14 25.92
SSN† 46.65 43.70 46.38 46.50 41.64 40.28 46.14 40.95 44.08
also compare our RoboFusion with SOTA methods, including
CenterPoint† 59.28 52.49 55.90 56.08 43.78 54.20 58.60 56.28 56.64 FUTR3D [Chen et al., 2023], TransFusion [Bai et al., 2022],
FCOS3D† 23.86 11.44 2.01 13.00 13.53 17.20 - - - BEVFusion [Liu et al., 2023], DeeepInteraction [Yang et al.,
PGD† 23.19 12.85 2.30 13.51 12.83 22.77 - - - 2022], CMT [Yan et al., 2023] and SparseFusion [Xie et al.,
DETR3D† 34.71 22.00 5.08 20.39 27.89 34.66 - - -
BEVFormer† 41.65 26.29 5.73 24.97 32.76 41.68 - -
2023], on the nuScenes test and validation sets. Our Robo-
FUTR3D† 64.17 55.50 52.73 58.40 53.19 57.70 63.72 62.25 62.66 Fusion has achieved SOTA performance on the clean bench-
TransFusion† 66.38 58.87 63.30 63.35 53.67 55.14 65.77 63.66 64.67 marks (KITTI and nuScenes).
BEVFusion† 68.45 61.87 62.84 66.13 54.10 64.42 67.79 66.18 67.32
DeepInteraction∗ 69.90 62.14 62.36 66.48 54.79 64.93 68.15 66.23 68.12 Results on the noisy benchmark.
CMT∗ 70.28 63.46 62.56 61.44 66.26 63.59 69.65 68.70 68.26 In the real-world AD scenarios, the distribution of data of-
RoboFusion-L 69.91 67.24 67.12 67.58 67.01 67.24 69.48 69.18 68.68 ten differs from that of training or testing data, as shown in
RoboFusion-B 69.40 66.33 66.07 67.01 65.54 66.71 69.02 69.01 68.04
RoboFusion-T 69.09 65.82 65.96 66.45 64.34 66.54 68.58 68.20 68.17
Fig. 1 (a). Specifically, Ref. [Dong et al., 2023] provides a
†
novel noisy benchmark that includes KITTI-C and nuScenes-
: Results from Ref. [Dong et al., 2023]. C, which we primarily use to evaluate the weather and sensor
* denotes re-implement result. noise corruptions, including rain, snow, fog, and strong sun-
light, density, cutout, and so on. In addition, comparisons
Table 5: Performance of different VFMs on RoboFusion. ‘RCE’
denotes Relative Corruption Error [Dong et al., 2023]. ‘mAP
of our RoboFusion with SOTA methods in other settings are
(Weather)’ denotes the average value across four types of weather presented in the Appendix 5 .
corruptions, Snow, Rain, Fog, and Strong Sunlight. As shown in Table 3, SOTA methods, including SECOND
[Yan et al., 2018], PointPillars [Lang et al., 2019], PointR-
Method Model Size FPS (A100) mAP (Weather ) mAP (Clean) RCE (%) CNN [Shi et al., 2019], PV-RCNN [Shi et al., 2020], SMOKE
[Liu et al., 2020b], ImVoxelNet [Rukhovich et al., 2022],
RoboFusion-L 97.54M 3.1 67.24 69.91 0.04
RoboFusion-B 81.01M 3.5 66.33 69.40 0.04 EpNet[Huang et al., 2020], Focals Conv-F[Chen et al., 2022],
RoboFusion-T 13.94M 6.0 65.82 69.09 0.05 and LoGoNet[Li et al., 2023], experience a significant de-
DeepInteraction 57.82M 4.9 62.14 69.90 0.10 crease in performance on the noisy scenarios, particularly for
TransFusion 36.96M 6.2 58.37 66.38 0.12 weather conditions such as snow and rain. It can be attributed
to the fact that the ‘clean’ KITTI dataset does not include ex-
amples in snowy or rainy weather. On the other hand, VFMs
MobileSAM. To enable effective training on the KITTI and like SAM-AD have been trained on a diverse range of data
nuScenes datasets, we utilize 8 NVIDIA A100 GPUs for net- and exhibit robustness and generalization to OOD scenarios,
work training. Additionally, the runtime is evaluated on an leading to higher performance on our RoboFusion metric.
NVIDIA A100 GPU. Specifically, for KITTI, our RoboFu- Furthermore, multi-modal methods like LoGoNet, and Focals
sion based on Focals Conv[Chen et al., 2022] involves train- Conv-F demonstrate better robustness and generalization in
ing for 80 epochs. For nuScenes, our RoboFusion based on sensor noise scenarios, while LiDAR-only methods like PV-
TransFusion [Bai et al., 2022] has 20 epochs of training. Dur- RCNN [Shi et al., 2020] are more robust in weather noise sce-
ing the model inference stage, we employ a non-maximal narios. This observation motivates our research on adaptive
suppression (NMS) operation in the Region Proposal Net- fusion schemes for point cloud and image features. Overall,
work (RPN) with an IoU threshold of 0.7. We select the top in the KITTI-C [Dong et al., 2023] dataset, our RoboFusion’s
100 region proposals to serve as inputs for the detection head. performance is nearly on par with the clean scene, indicating
After refinement, we apply NMS again with an IoU threshold high level of robustness and generalization.
of 0.1 to eliminate redundant predictions. For additional de- As shown in Table 4, SOTA methods including PointPillars
tails regarding our method, please refer to OpenPCDet 4 . [Lang et al., 2019], SSN [Zhu et al., 2020], CenterPoint [Yin
et al., 2021], FCOS3D [Wang et al., 2021], PGD [Wang et
4.3 Comparing with state-of-the-art al., 2022a], DETR3D [Wang et al., 2022b], BEVFormer [Li
We conduct evaluations on the clean datasets KITTI and et al., 2022c], FUTR3D [Chen et al., 2023], TransFusion [Bai
nuScenes, as well as the noisy datasets KITTI-C and et al., 2022], BEVFusion[Liu et al., 2023], DeepInteraction
nuScenes-C. While SOTA methods are primarily focused on [Yang et al., 2022] and CMT [Yan et al., 2023] in nuScenes-
achieving high accuracy, we place greater emphasis on the ro- C show relatively higher robustness than in KITTI-C when
bustness and generalization of the methods. These factors are faced with weather noise. However, BEVFusion performs
crucial for the practical deployment of 3D object detection well in the presence of snow, rain, and strong sunlight noise
in AD scenarios, making the evaluation on the noisy datasets but experiences a significant performance drop in foggy sce-
more important in our perspective. narios. In contrast, our method exhibits strong robustness and
4 5
https://github.com/open-mmlab/OpenPCDet https://arxiv.org/abs/2401.03907
Table 6: Impacts of different SAM usages on KITTI and KITTI-C Table 8: Roles of SAM3DFusion-L modules on KITTI-C valida-
validation sets for car class with AP of R40 . ‘S.L.’ denotes Strong tion set for car class with AP of R40 at moderate difficulty. ‘A.F.’
Sunlight. denotes Adaptive Fusion module. ‘S.L.’ denotes strong sunlight.
Method SAM-AD AD-FPN DGWA A.F. Snow Rain Fog S.L. FPS(A100)
AP3D (%) APW eather (%) a) 34.77 41.30 44.55 80.97 10.8
Solution
mAP Easy Mod. Hard Snow Rain Fog S.L. b) ✓ 80.68 81.68 81.67 83.48 4.0
c) ✓ ✓ 82.32 83.60 82.39 83.98 3.6
Offline 80.41 88.76 77.38 75.11 - - - - d) ✓ ✓ ✓ 83.99 85.63 84.01 84.81 3.4
e) ✓ ✓ ✓ ✓ 85.29 86.48 85.53 85.50 3.1
No optim 86.45 91.86 84.80 82.71 45.11 47.77 63.10 79.21
Optim 88.00 92.41 86.77 84.81 57.43 54.27 68.81 82.07
Table 7: Influence of pre-training on SAM at KITTI-C validation Roles of Different Modules in RoboFusion.
set for car class with AP of R40 at moderate difficulty. ‘S.L.’, ‘D.’, As shown in Table 8, we present ablation experiments for dif-
‘C.O.’, and ‘C.T.’ denotes Strong Sunlight, Density, Cutout, and ferent modules of our RoboFusion-L, built upon SAM-AD,
Crosstalk, respectively. including AD-FPN, DGWA, and Adaptive Fusion. Lever-
aging the strong capabilities of SAM-AD in AD scenarios,
VFM
Weather Sensor SAM-AD has a significant improvement from baseline Focals
Snow Rain Fog S.L. D. C.O. C.T.
Conv [Chen et al., 2022] (34.77%, 41.30%, 44.55%, 80.97%)
SAM 57.43 54.27 68.81 82.07 84.21 83.04 84.06 to (80.68%, 81.68%, 81.67%, 83.48%). Subsequently, AD-
SAM-AD 80.68 81.68 81.67 83.48 84.71 84.17 84.12
FPN, DGWA, and Adaptive Fusion achieve even higher per-
formance on the foundation of SAM-AD. This further high-
generalization in both weather and sensor noise scenarios in lights the substantial contributions of diverse modules within
nuScenes-C. our RoboFusion framework in addressing OOD noise scenar-
ios in AD.
4.4 Ablation Study
Performance of Different VFMs on RoboFusion. 5 Conclusions
In order to analyze the noise robustness and FPS perfor- In this work, we propose a robust framework RoboFusion to
mance of different-sized VFMs, SAM, FastSAM and Mobile- enhance the robustness and generalization of multi-modal 3D
SAM, we conduct comparative experiments of RoboFusion- object detectors using VFMs like SAM, FastSAM, and Mo-
L, RoboFusion-B and RoboFusion-T with SOTA methods, bileSAM. Specifically, we pre-train SAM for AD scenarios,
DeepInteraction [Yang et al., 2022] and TransFusion [Bai et yielding SAM-AD. To align SAM or SAM-AD with multi-
al., 2022], on the nuScenes-C [Dong et al., 2023] validation modal 3D object detectors, we introduce AD-FPN for feature
set, as shown in Table 5. Specifically, our RoboFusion ex- upsampling. To further mitigate noise and weather interfer-
hibits remarkable robustness to weather noise scenarios. Fur- ence, we apply wavelet decomposition for depth-guided im-
thermore, our RoboFusion-T has a similar FPS to TransFu- age denoising. Subsequently, we utilize self-attention mech-
sion [Bai et al., 2022]. Overall, we have presented a viable anisms to adaptively reweight fused features, enhancing in-
application of SAM in 3D object detection tasks. formative attributes and suppressing excess noises. Extensive
Impacts of Different SAM usages. experiments demonstrate that our RoboFusion effectively in-
As shown in Table 6, our RoboFusion-L is experimented tegrates VFMs to boost feature robustness and address OOD
upon. Specifically, the first row is the offline usage, which noise challenges. We anticipate this work to lay a strong foun-
involves loading pre-saved image features during training. It dation for future research on building robust and dependable
implies that certain online data augmentation cannot be uti- foundation AD models.
lized. The second (No optim) and the third (Optim) rows are Limitation and Future Work. First, RoboFusion has
online usages, where the former omits fine-tuning and keeps a heavy reliance on the representation capability of VFMs.
the model parameters fixed, the latter follows fine-tuning and This raises the baseline models’ generalization ability, but in-
updating. Therefore, offline usage perform worse than on- creases their complexities. Second, the inference speed of
line usages. Additionally, fine-tuning the weights of SAM RoboFusion-L and RoboFusion-B is relatively slow due to
has demonstrated superior performance, resulting in a perfor- the limitations of SAM and FastSAM. However, the infer-
mance improvement in the presence of snow, rain, and fog ence speed of RoboFusion-T is competitive with some SOTA
noise scenarios. methods (e.g. TransFusion) without VFMs. In the future,
for improving the real-time application ability of VFMs, we
Influence of Pre-training on SAM. will attempt to incorporate SAM only in the training phase
As shown in Table 7, to investigate the scientific value of pre- to guide a fast-speed student model, meanwhile explore more
trained VFMs like SAM, FastSAM, and MobileSAM in AD noise scenarios.
scenarios, we conduct our RoboFusion-L with SAM evalua-
tion on SAM and SAM-AD. Through pre-training, SAM-AD
has gained a better understanding of AD scenarios than the
Acknowledgments
original SAM. The pre-training strategy effectively improves This work was supported by the Fundamental Research Funds
the performance of our RoboFusion, demonstrating a signifi- for the Central Universities (2023YJS019), the National Key
cant improvement in the snow, rain, and fog noise scenarios. R&D Program of China (2018AAA0100302).
A Appendix Table 10: Roles of RoboFusion modules on KITTI-C validation
set for car class with AP of R40 at moderate difficulty. ‘A.F.’ denotes
A.1 Broader Impacts Adaptive Fusion module. ‘S.L.’ denotes Strong Sunlight.
Our work aims to develop a robust framework to address out- Method SAM AD-FPN DGWA A.F. Snow Rain Fog S.L.
of-distribution (OOD) noise scenarios in autonomous driv-
ing (AD). To the best of our knowledge, RoboFusion is the a) 34.77 41.30 44.55 80.97
first method that leverages the generalization capabilities of b) ✓ 57.43 54.27 68.81 82.07
c) ✓ ✓ 59.81 56.59 69.68 83.20
visual foundation models (VFMs) like SAM [Kirillov et al.,
d) ✓ ✓ ✓ 66.45 58.11 70.53 84.01
2023], FastSAM [Zhao et al., 2023], and MobileSAM [Zhang
e) ✓ ✓ ✓ ✓ 68.47 59.07 74.38 84.07
et al., 2023a] for multi-modal 3D object detection. Although
existing multi-model 3D object detection methods achieve
the state-of-the-art (SOTA) performance of ‘clean’ datasets, More Results on the KITTI-C validation set.
they overlook the robustness of real-world scenarios[Song et
Besides the experimental results mentioned in the main text,
al., 2024a]. Therefore, we believe it is valuable to combine
we test our RoboFusion on KITTI-C and nuScenes-C [Dong
VFMs and multi-modal 3D object detection to mitigate the
et al., 2023] to extend our work to a wider range of noise
impact of OOD noise scenarios.
scenarios, including Gaussian, Uniform, Impulse, Moving
Object, Motion Blur, Local Density, Local Cutout, Local
A.2 More Results Gaussian, Local Uniform, and Local Impulse, as shown in
Specific classes AP on the nuScenes-C validation set. Tables 11, 12, and 13. From these Tables, compared with
LiDAR-only methods including SECOND [Yan et al., 2018],
As shown in Table 9, we present a comparison of Specific
PointPillars [Lang et al., 2019], PointRCNN [Shi et al., 2019]
classes AP between TransFusion and our RoboFusion-L on
and PV-RCNN [Shi et al., 2020], Camera-Only methods in-
the nuScenes-C validation set, encompassing scenarios with
cluding Smoke [Liu et al., 2020b], ImVoxelNet [Rukhovich
snow, rain, fog, and strong sunlight noise. It is evident from
et al., 2022], and multi-modal methods including EPNet
the results that RoboFusion-L exhibits superior performance. [Huang et al., 2020], Focals Conv [Chen et al., 2022], and
LoGoNet [Li et al., 2023], our RoboFusion-L, RoboFusion-
Table 9: Comparison with TransFusion on nuScenes validation B, and RoboFusion-T consistently outperform across vari-
‘Snow, Rain, Fog, and Strong Sunlight’ noisy scenarios. ‘T.F.’,
ous noise scenarios and achieve the best overall performance.
‘R.F.’, ‘S.L.’, ‘C.V.’, ‘Motor.’, ‘Ped.’, and ‘T.C.’ are short for Trans-
Fusion, RoboFusion-L, Strong Sunlight, construction vehicle, mo- Overall, our RoboFusion demonstrates superior performance
torcycle, pedestrian, and traffic cone, respectively. in weather-noisy (i.e. Snow, Rain, Fog, and Strong Sunlight)
scenarios and exhibits better results across a broader range of
mAP Car Truck C.V. Bus Trailer Barrier Motor. Bike Ped. T.C. scenarios, which shows remarkable robustness and generaliz-
T.F. 63.30 84.55 58.41 25.50 62.31 56.00 70.19 69.98 43.69 84.24 78.16 ability.
Snow
R.F. 67.12 87.21 60.88 29.47 67.45 58.99 75.12 71.45 48.28 86.23 86.12
+3.82 +5.14 +4.93 +4.59 +7.96
Performance Comparison Analysis with the LoGoNet.
T.F. 63.35 85.37 56.87 25.12 64.65 55.10 71.99 68.21 44.13 83.87 78.14
In addition, to provide a clearer analysis of performance
Rain
R.F. 67.58 86.79 60.44 30.21 65.41 58.12 75.47 71.39 50.87 88.91 88.23
+4.23 +5.09 +6.74 +5.04 +10.09 across different noise scenarios, we present a more detailed
T.F. 53.67 80.23 48.51 18.04 50.69 53.03 62.24 54.53 25.27 80.63 66.48 comparative study of our RoboFusion-L and LoGoNet [Li et
Fog
R.F. 67.01 87.56 59.03 29.36 66.10 57.23 74.33 72.01 49.50 87.08 87.91 al., 2023] on the KITTI-C validation dataset, as shown in Ta-
+13.34 +10.52 +11.32 +15.41 +12.09 +17.48 +24.23 +21.43
ble 14. It is worth noting that LoGoNet is a SOTA multi-
T.F. 55.14 81.99 48.07 19.78 51.09 52.57 63.68 55.09 26.98 82.68 69.49
modal 3D detector known for its exceptional robustness and
S.L.
R.F. 67.24 87.67 57.74 31.00 64.29 58.94 75.23 70.23 50.82 88.70 87.82
+12.10 +11.22 +13.20 +11.55 +15.14 +23.30 +18.33 high accuracy. [Dong et al., 2023] provides noise at vary-
ing levels, with the KITTI-C dataset including 5 severities.
It is evident that our method demonstrates a high degree of
Roles of Different Modules in RoboFusion. robustness, exhibiting the most stable results with the vari-
ance of noise severities. For instance, when considering snow
To assess the roles of different modules in RoboFusion, we conditions, the performance of our RoboFusion-L shows a
conduct an ablation study on the original SAM rather than marginal variation from 86.69% to 83.67% across severities
SAM-AD, as shown in Table 10, where a) is the results of from 1 to 5. In contrast, LoGoNet’s performance drops from
the baseline [Chen et al., 2022], b)-e) shows the performance 55.07% to 45.02% over the same severity range. Further-
of our RoboFusion-L under different modules. According to more, in the presence of moving object noise, our method
Table 10, SAM and AD-FPN modules significantly improve outperforms LoGoNet. In summary, our RoboFusion exhibits
the performance in OOD noisy scenarios. It is worth noticing remarkable robustness and generalization capabilities, mak-
that DGWA module significantly improves the performance, ing it well-suited to diverse noise scenarios.
especially in snow noisy scenarios. By Table 10, the impact
of fog noise on point clouds is relatively minor. But, us- More Results on the nuScenes-C validation set.
ing A.F. (Adaptive Fusion) module to dynamically aggregate As depicted in Table 15, compared with LiDAR-only meth-
point cloud features and image features exhibits significant ods including PointPillars [Lang et al., 2019], and Center-
enhancements in fog-noise scenarios. Point [Yin et al., 2021], Camera-Only methods FCOS3D
Table 11: Comparison with SOTA methods on KITTI-C validation set. The results are evaluated based on the car class with AP of R40
at moderate difficulty. The best one is highlighted in bold.‘S.L.’ denote Strong Sunlight. ‘RCE’ denotes Relative Corruption Error from
Ref.[Dong et al., 2023].
LiDAR-Only Camera-Only LC Fusion

Corruptions RoboFusion (Ours)
SECOND † PointPillars † PointRCNN † PV-RCNN † SMOKE † ImVoxelNet † EPNet † Focals Conv † LoGoNet *
L B T
None(APclean ) 81.59 78.41 80.57 84.39 7.09 11.49 82.72 85.88 85.04 88.04 87.87 87.60
Snow 52.34 36.47 50.36 52.35 2.47 0.22 34.58 34.77 51.45 85.29 84.70 84.60
Rain 52.55 36.18 51.27 51.58 3.94 1.24 36.27 41.30 55.80 86.48 85.54 84.79
Weather
Fog 74.10 64.28 72.14 79.47 5.63 1.34 44.35 44.55 67.53 85.53 84.00 84.17
S.L. 78.32 62.28 62.78 79.91 6.00 10.08 69.65 80.97 75.54 85.50 85.15 84.75
Density 80.18 76.49 80.35 82.79 - - 82.09 84.95 83.68 85.71 84.34 84.11
Cutout 73.59 70.28 73.94 76.09 - - 76.10 78.06 77.17 83.17 81.30 81.21
Crosstalk 80.24 70.85 71.53 82.34 - - 82.10 85.82 82.00 84.12 82.45 83.07
Gaussian (L) 64.90 74.68 61.20 65.11 - - 60.88 82.14 61.85 76.56 78.32 76.52
Sensor Uniform (L) 79.18 77.31 76.39 81.16 - - 79.24 85.81 82.94 85.05 83.04 84.11
Impulse (L) 81.43 78.17 79.78 82.81 - - 81.63 85.01 84.66 85.26 85.06 85.46
Gaussian (C) - - - - 1.56 2.43 80.64 80.97 84.29 82.16 84.63 82.17
Uniform (C) - - - - 2.67 4.85 81.61 83.38 84.45 83.30 85.20 83.30
Impulse (C) - - - - 1.83 2.13 81.18 80.83 84.20 83.51 84.55 82.91
Moving Obj. 52.69 50.15 50.54 54.60 1.67 5.93 55.78 49.14 14.44 49.30 49.12 49.90
Motion
Motion Blur - - - - 3.51 4.19 74.71 81.08 84.52 84.17 84.56 84.18
Local Density 75.10 69.56 74.24 77.63 - - 76.73 80.84 78.63 83.21 82.53 83.22
Local Cutout 68.29 61.80 67.94 72.29 - - 69.92 76.64 64.88 77.22 75.27 76.23
Object Local Gaussian 72.31 76.58 69.82 70.44 - - 75.76 82.02 55.66 79.02 78.32 78.33
Local Uniform 80.17 78.04 77.67 82.09 - - 81.71 84.69 79.94 84.69 83.70 84.37
Local Impulse 81.56 78.43 80.26 84.03 - - 82.21 85.78 84.29 85.26 85.08 85.06
Average(APcor ) 71.68 66.34 68.76 73.41 3.25 3.60 70.35 74.43 71.89 81.72 81.31 81.12
RCE (%) ↓ 12.14 15.38 14.65 13.00 54.11 68.65 14.94 13.32 15.46 7.17 7.46 7.38
†
* denotes re-implement result.
Table 12: Comparison with SOTA methods on KITTI-C validation set. The results are evaluated based on the car class with AP of R40 at
easy difficulty. The best one is highlighted in bold. ‘S.L.’ denotes Strong Sunlight. ‘RCE’ denotes Relative Corruption Error from Ref.[Dong
et al., 2023].
Lidar-Only Camera-Only LC Fusion

L B T
None(APclean ) 90.53 87.75 91.65 92.10 10.42 17.85 92.29 92.00 92.04 93.30 93.22 93.28
Snow 73.05 55.99 71.93 73.06 3.68 0.30 48.03 53.80 74.24 88.77 88.18 88.31
Rain 73.31 55.17 70.79 72.37 5.66 1.77 50.93 61.44 75.96 88.12 88.57 87.75
Weather
Fog 85.58 74.27 85.01 89.21 8.06 2.37 64.83 68.03 86.60 88.96 88.16 88.09
S.L. 88.05 67.42 64.90 87.27 8.75 15.72 81.77 90.03 80.30 89.79 89.23 90.36
Density 90.45 86.86 91.33 91.98 - - 91.89 91.14 91.85 92.90 92.08 92.12
Cutout 81.75 78.90 83.33 83.40 - - 84.17 83.84 84.20 85.94 85.75 84.75
Crosstalk 89.63 78.51 77.38 90.52 - - 91.30 92.01 88.15 91.71 91.54 92.07
Gaussian (L) 73.21 86.24 74.28 74.61 - - 66.99 88.56 64.62 80.96 84.30 83.23
Sensor Uniform (L) 89.50 87.49 89.48 90.65 - - 89.70 91.77 90.75 92.89 91.28 91.63
Impulse (L) 90.70 87.75 90.80 91.91 - - 91.44 92.10 91.66 91.90 91.95 92.30
Gaussian (C) - - - - 2.09 3.74 91.62 89.51 91.64 91.94 92.08 91.57
Uniform (C) - - - - 3.81 7.66 91.95 91.20 91.84 92.01 92.14 92.93
Impulse (C) - - - - 2.57 3.35 91.68 89.90 91.65 91.96 92.04 91.33
Moving Obj. 62.64 58.49 59.29 63.36 2.69 9.63 66.32 54.57 16.83 53.09 51.94 51.70
Motion
Motion Blur - - - - 5.39 6.75 89.65 91.56 91.96 91.99 92.09 92.06
Local Density 87.74 82.90 88.37 89.60 - - 89.40 89.60 89.00 92.02 92.42 92.42
Local Cutout 81.29 75.22 83.30 84.38 - - 82.40 85.55 77.57 87.30 87.49 87.79
Local Uniform 90.11 87.83 89.30 90.63 - - 91.32 91.88 88.51 91.59 91.53 91.75
Local Impulse 90.58 87.84 90.60 91.91 - - 91.67 92.02 91.34 92.09 91.97 90.69
Average(APcor ) 83.10 77.41 80.78 83.92 4.74 5.69 81.63 83.91 80.93 88.27 88.20 88.12
RCE(%)↓ 8.20 11.78 11.85 8.87 54.46 68.07 11.54 8.78 12.07 5.39 5.39 5.53
†
Table 13: Comparison with SOTA methods on KITTI-C validation set. The results are evaluated based on the car class with AP of R40
at hard difficulty. The best one is hightlighted in bold. ‘S.L.’ denotes Strong Sunlight. ‘RCE’ denotes Relative Corruption Error from
Ref.[Dong et al., 2023].

L B T
None(APclean ) 78.57 75.19 78.06 82.49 5.57 9.20 80.16 83.36 84.31 85.27 84.27 83.36
Snow 48.62 32.96 45.41 48.62 1.92 0.20 32.39 30.41 45.57 64.26 62.49 62.74
Rain 48.79 32.65 45.78 48.20 3.16 0.99 34.69 35.71 50.12 66.07 64.89 63.18
Weather
Fog 68.93 58.19 68.05 75.05 4.56 1.03 38.12 39.50 60.47 80.03 78.37 77.29
S.L. 74.62 58.69 61.11 78.02 4.91 8.24 66.43 78.06 73.62 80.02 77.52 81.61
Density 77.04 72.85 77.58 81.15 - - 79.77 82.38 81.98 83.06 83.03 83.05
Cutout 70.79 67.32 71.57 74.60 - - 73.95 76.69 76.18 76.96 77.00 77.38
Crosstalk 76.92 67.51 69.41 80.98 - - 79.54 83.22 80.36 82.94 83.22 83.08
Gaussian (L) 61.09 71.12 56.73 62.70 - - 56.88 77.15 59.98 74.45 75.03 73.81
Sensor Uniform (L) 75.61 74.09 72.25 78.93 - - 75.92 81.62 80.68 81.74 81.79 82.44
Impulse (L) 78.33 74.65 76.88 81.79 - - 79.14 83.28 82.51 83.13 83.16 83.24
Gaussian (C) - - - - 1.18 1.96 78.20 79.01 82.22 82.86 83.05 81.32
Uniform (C) - - - - 2.19 3.90 79.14 81.39 82.37 83.22 83.03 82.06
Impulse (C) - - - - 1.52 1.71 78.51 78.87 82.16 82.75 83.00 81.59
Moving Obj. 48.02 45.47 46.23 50.75 1.40 4.63 50.97 45.34 13.66 43.56 42.62 42.89
Motion
Motion Blur - - - - 2.95 3.32 72.49 77.75 82.50 83.12 83.06 82.92
Local Density 71.45 65.70 71.09 75.39 - - 74.36 77.30 76.83 81.71 81.24 81.15
Local Cutout 63.25 56.69 63.50 68.58 - - 66.53 72.40 60.62 71.95 72.07 73.78
Local Uniform 76.67 74.68 74.37 80.17 - - 78.85 81.99 77.44 82.04 82.06 82.33
Local Impulse 78.47 75.18 77.38 82.33 - - 79.79 83.20 82.21 82.99 83.16 82.99
Average(APcor ) 67.92 62.55 65.18 70.95 2.64 2.88 67.41 71.18 69.27 77.16 76.81 76.75
RCE(%)↓ 13.55 16.80 16.49 13.98 52.54 68.62 15.89 14.59 17.83 9.51 9.71 7.93
†
Table 14: Performance comparison of our RoboFusion-L with LoGoNet on KITTI-C with 5 noise severities. The results are reported based
on the car with AP of R40 at moderate difficulty. ‘S.L.’ denotes Strong Sunlight. The better one is marked in bold.
Severity
Corruptions APs
1 2 3 4 5
Snow 55.07 / 86.69 52.98 / 86.55 53.08 / 85.94 51.14 / 83.61 45.02 / 83.67 51.45 / 85.29
Rain 57.29 / 87.84 56.90 / 87.75 56.76 / 86.49 55.05 / 85.24 53.01 / 85.07 55.80 / 86.48
Weather
Fog 75.93 / 87.31 69.69 / 86.58 64.77 / 84.71 64.69 / 84.56 62.58 / 84.51 67.53 / 85.53
S.L. 82.03 / 87.26 80.53 / 86.53 76.75 / 84.66 71.12 / 84.61 67.31 / 84.46 75.54 / 85.50
Density 86.60 / 86.81 84.59 / 86.59 84.05 / 85.60 82.74 / 85.27 82.42 / 84.30 83.68 / 85.71
Cutout 82.18 / 87.64 80.02 / 86.21 77.41 / 83.25 74.66 / 80.81 71.59 / 77.94 77.17 / 83.17
Crosstalk 84.22 / 84.41 83.38 / 84.38 81.41 / 84.13 80.78 / 83.79 80.22 / 83.90 82.00 / 84.12
Gaussian (L) 84.69 / 85.41 82.52 / 84.66 77.43 / 81.39 47.28 / 73.58 17.31 / 57.79 61.85 / 76.56
Sensor Uniform (L) 84.77 / 85.77 84.64 / 85.42 84.39 / 85.47 82.32 / 85.00 78.59 / 83.59 82.94 / 85.05
Impulse (L) 84.45 / 84.95 84.73 / 82.88 84.92 / 82.20 84.63 / 80.51 84.56 / 80.29 84.66 / 82.16
Gaussian (C) 84.53 / 85.77 84.47 / 85.42 84.31 / 85.47 84.18 / 85.32 83.96 / 84.32 84.29 / 85.26
Uniform (C) 84.74 / 85.57 84.57 / 85.08 84.54 / 82.96 84.36 / 82.53 84.05 / 80.36 84.45 / 83.30
Impulse (C) 84.53 / 85.70 84.26 / 83.63 84.38 / 83.54 83.95 / 82.42 83.86 / 82.28 84.20 / 83.51
Moving Obj. 58.89 / 78.46 12.78 / 67.86 0.43 / 41.07 0.06 / 36.28 0.07 / 22.85 14.44 / 49.30
Motion
Motion Blur 84.64 / 85.23 84.53 / 84.98 84.56 / 84.72 84.45 / 83.00 84.43 / 82.96 84.52 / 84.17
Local Density 82.31 / 85.23 81.66 / 84.87 80.15 / 82.70 76.53 / 82.08 72.52 / 81.21 78.63 / 83.21
Local Cutout 76.77 / 82.94 72.46 / 81.31 65.87 / 78.14 59.14 / 74.12 50.17 / 69.61 64.88 / 77.22
Object
Local Gaussian 84.45 / 86.81 81.12 / 86.25 67.13 / 82.72 33.33 / 76.01 12.27 / 63.31 55.66 / 79.02
Local Uniform 84.51 / 85.91 84.35 / 85.65 81.95 / 85.23 79.62 / 84.66 69.25 / 81.99 79.94 / 84.68
Local Impulse 84.53 / 85.65 84.47 / 85.13 84.32 / 85.18 84.40 / 85.16 83.72 / 85.16 84.29 / 85.25
APc 79.35 / 85.56 75.73 / 84.38 72.93 / 81.77 68.22 / 79.92 63.34 / 76.97 71.81 / 81.72
Clean 85.04 / 88.04

Table 15: Comparison with SOTA methods on nuScenes-C validation set with mAP. ‘D.I.’ refers to DeepInteraction [Yang et al., 2022].
The best one is highlighted in bold. ‘S.L.’ denotes Strong Sunlight. ‘RCE’ denotes Relative Corruption Error from Ref.[Dong et al., 2023].

PointPillars† CenterPoint† FCOS3D† DETR3D† BEVFormer† FUTR3D† TransFusion† BEVFusion† D.I.*
L B T
None(APclean ) 27.69 59.28 23.86 34.71 41.65 64.17 66.38 68.45 69.90 69.91 69.40 69.09
Snow 27.57 55.90 2.01 5.08 5.73 52.73 63.30 62.84 62.36 67.12 66.07 65.96
Rain 27.71 56.08 13.00 20.39 24.97 58.40 65.35 66.13 66.48 67.58 67.01 66.45
Weather
Fog 24.49 43.78 13.53 27.89 32.76 53.19 53.67 54.10 54.79 67.01 65.54 64.34
S.L. 23.71 54.20 17.20 34.66 41.68 57.70 55.14 64.42 64.93 67.24 66.71 66.54
Density 27.27 58.60 - - - 63.72 65.77 67.79 68.15 69.48 69.02 68.58
Cutout 24.14 56.28 - - - 62.25 63.66 66.18 66.23 69.18 69.01 68.20
Crosstalk 25.92 56.64 - - - 62.66 64.67 67.32 68.12 68.68 68.04 68.17
FOV lost 8.87 20.84 - - - 26.32 24.63 27.17 42.66 39.48 39.30 39.43
Gaussian (L) 19.41 45.79 - - - 58.94 55.10 60.64 57.46 57.77 57.07 56.00
Sensor Uniform (L) 25.60 56.12 - - - 63.21 64.72 66.81 67.42 64.57 64.25 64.99
Impulse (L) 26.44 57.67 - - - 63.43 65.51 67.54 67.41 65.64 65.45 65.44
Gaussian (C) - - 3.96 14.86 15.04 54.96 64.52 64.44 66.52 66.73 66.75 66.53
Uniform (C) - - 8.12 21.49 23.00 57.61 65.26 65.81 65.90 65.77 65.76 65.56
Impulse (C) - - 3.55 14.32 13.99 55.16 64.37 64.30 65.65 64.82 64.75 64.56
Compensation 3.85 11.02 - - - 31.87 9.01 27.57 39.95 41.88 39.54 41.28
Motion
Motion Blur - - 10.19 11.06 19.79 55.99 64.39 64.74 65.45 67.21 66.52 66.42
Local Density 26.70 57.55 - - - 63.60 65.65 67.42 67.71 66.74 66.59 65.88
Local Cutout 17.97 48.36 - - - 61.85 63.33 63.41 65.19 66.82 66.53 66.76
Obeject Local Gaussian 25.93 51.13 - - - 62.94 63.76 64.34 64.75 65.08 65.17 64.77
Local Uniform 27.69 57.87 - - - 64.09 66.20 67.58 66.44 66.71 66.19 65.40
Local Impulse 27.67 58.49 - - - 64.02 66.29 67.91 67.86 66.53 66.87 66.67
Average(APcor ) 22.99 49.78 8.94 18.71 22.12 56.88 58.77 61.35 62.92 63.90 63.43 63.23
RCE (%) ↓ 16.95 16.01 62.51 46.07 46.89 11.34 11.45 10.36 9.97 8.58 8.59 8.47
†
[Wang et al., 2021], DETR3D [Wang et al., 2022b], and has other limitations. Our method does not achieve the best
BEVFormer [Li et al., 2022c] and multi-modal methods in- performance in all noisy scenarios. For instance, as shown in
cluding FUTR3D [Chen et al., 2023], TransFusion [Bai et Table 11, our method does not show the best in ‘Moving Ob-
al., 2022], BEVFusion [Liu et al., 2023] and DeepInterac- ject’ noisy scenarios. Furthermore, we conduct experiments
tion [Yang et al., 2022], our RoboFusion demonstrates supe- only on the corruption datasets [Dong et al., 2023] rather than
rior performance across more noise scenarios in AD on av- real-world datasets. It is valuable to construct a real-world
erage. For instance, our RoboFusion-L excels in 10 noise corruption dataset, but it must be an expensive work.
scenarios, including Weather (Snow, Rain, Fog, Strong Sun-
light), Sensor (Density, Cutout, Crosstalk), Motion (Compen- References
sation, Motion Blur), and Object (Local Cutout), outperform-
ing DeepInteraction [Yang et al., 2022] which achieves the [Bai et al., 2022] Xuyang Bai, Zeyu Hu, Xinge Zhu, Qingqiu
best performance only in 5 of these noise scenarios. Over- Huang, Yilun Chen, Hongbo Fu, and Chiew-Lan Tai.
all, our method exhibits not only exceptional robustness in TransFusion: Robust LiDAR-camera fusion for 3D ob-
weather-induced noise scenarios, but also shows remarkable ject detection with transformers. In Proceedings of the
resilience across a broader noise include sensor, motion and IEEE/CVF Conference on Computer Vision and Pattern
object noise. Recognition, pages 1090–1099, 2022.
[Caesar et al., 2020] Holger Caesar, Varun Bankiti, Alex H
A.3 Visualization Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush
As shown in Fig. 5, we provide visualization results between Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom.
our RoboFusion-L and LoGoNet on the KITTI-C dataset. nuScenes: A multimodal dataset for autonomous driving.
Overall, compared to SOTA methods like LoGoNet [Li et al., In Proceedings of the IEEE/CVF conference on computer
2023], our method enhances the robustness of multi-modal vision and pattern recognition, pages 11621–11631, 2020.
3D object detection by leveraging the generalization capabil- [Chen et al., 2022] Yukang Chen, Yanwei Li, Xiangyu
ity and robustness of VFMs to mitigate OOD noisy scenarios Zhang, Jian Sun, and Jiaya Jia. Focal sparse convolutional
in AD. networks for 3D object detection. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern
A.4 More Limitations Recognition, pages 5428–5437, 2022.
Although we have mentioned the two main limitations in the [Chen et al., 2023] Xuanyao Chen, Tianyuan Zhang, Yue
‘Conclusions’ section of the main text, our RoboFusion still Wang, Yilun Wang, and Hang Zhao. FUTR3D: A unified
LoGoNet
RoboFusion-L Snow Rain Fog Sunlight
Ground Truth Truth Positives False Positives Improved Predictions
Figure 5: Visualization Results of LoGoNet and our RoboFusion in KITTI-C dataset. We use boxes in red to represent false positives,
green boxes for truth positives, and black for the ground truth. We use blue dashed ovals to highlight the pronounced improvements in
predictions.
sensor fusion framework for 3D detection. In Proceedings [Geiger et al., 2012] Andreas Geiger, Philip Lenz, and
of the IEEE/CVF Conference on Computer Vision and Pat- Raquel Urtasun. Are we ready for autonomous driving?
tern Recognition, pages 172–181, 2023. the kitti vision benchmark suite. In IEEE conference
on computer vision and pattern recognition, pages 3354–
[Deng et al., 2021] Jiajun Deng, Shaoshuai Shi, Peiwei Li,
3361, 2012.
Wengang Zhou, Yanyong Zhang, and Houqiang Li. Voxel
R-CNN: Towards high performance voxel-based 3D ob- [Hu et al., 2023] Qianjiang Hu, Daizong Liu, and Wei Hu.
ject detection. In Proceedings of the AAAI Conference on Density-insensitive unsupervised domain adaption on 3D
Artificial Intelligence, volume 35, pages 1201–1209, 2021. object detection. In Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition, pages
[Dong et al., 2023] Yinpeng Dong, Caixin Kang, Jinlai
17556–17566, 2023.
Zhang, Zijian Zhu, et al. Benchmarking robustness of 3D
object detection to common corruptions. In Proceedings [Huang et al., 2020] Tengteng Huang, Zhe Liu, Xiwu Chen,
of the IEEE/CVF Conference on Computer Vision and Pat- and Xiang Bai. EPNet: Enhancing point features with im-
tern Recognition, pages 1022–1032, 2023. age semantics for 3D object detection. In European Con-
ference on Computer Vision, pages 35–52. Springer, 2020.
[Dosovitskiy et al., 2020] Alexey Dosovitskiy, Lucas Beyer,
Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, [Kirillov et al., 2023] Alexander Kirillov, Eric Mintun,
Thomas Unterthiner, et al. An image is worth 16x16 Nikhila Ravi, Hanzi Mao, et al. Segment anything. In
words: Transformers for image recognition at scale. Proceedings of the IEEE/CVF International Conference
CoRR, abs/2010.11929, 2020. on Computer Vision, pages 4015–4026, 2023.
[Lang et al., 2019] Alex H Lang, Sourabh Vora, et al. Point- LiDAR 3D detection. In Proceedings of the AAAI Con-
Pillars: Fast encoders for object detection from point ference on Artificial Intelligence, volume 37, pages 2047–
clouds. In Proceedings of the IEEE/CVF conference on 2055, 2023.
computer vision and pattern recognition, pages 12697– [Rukhovich et al., 2022] Danila Rukhovich, Anna
12705, 2019. Vorontsova, and Anton Konushin. ImVoxelNet: Im-
[Li et al., 2022a] Yanghao Li, Hanzi Mao, Ross Girshick, age to voxels projection for monocular and multi-view
and Kaiming He. Exploring plain vision transformer back- general-purpose 3D object detection. In Proceedings
bones for object detection. In European Conference on of the IEEE/CVF Winter Conference on Applications of
Computer Vision, pages 280–296. Springer, 2022. Computer Vision, pages 2397–2406, 2022.
[Li et al., 2022b] Yanwei Li, Xiaojuan Qi, Yukang Chen, Li- [Shi et al., 2019] Shaoshuai Shi, Xiaogang Wang, and Hong-
wei Wang, Zeming Li, Jian Sun, and Jiaya Jia. Voxel sheng Li. PointRCNN: 3D object proposal generation
field fusion for 3D object detection. In Proceedings of and detection from point cloud. In Proceedings of the
the IEEE/CVF Conference on Computer Vision and Pat- IEEE/CVF conference on computer vision and pattern
tern Recognition, pages 1120–1129, 2022. recognition, pages 770–779, 2019.
[Li et al., 2022c] Zhiqi Li, Wenhai Wang, Hongyang Li, [Shi et al., 2020] Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe
Enze Xie, et al. BEVFormer: Learning bird’s-eye-view Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li.
representation from multi-camera images via spatiotem- PV-RCNN: point-voxel feature set abstraction for 3D ob-
poral transformers. In European conference on computer ject detection. In Proceedings of the IEEE/CVF Confer-
vision, pages 1–18. Springer, 2022. ence on Computer Vision and Pattern Recognition, pages
[Li et al., 2023] Xin Li, Tao Ma, Yuenan Hou, Botian Shi, 10529–10538, 2020.
et al. LoGoNet: Towards accurate 3D object detection [Song et al., 2023] Ziying Song, Haiyue Wei, Lin Bai, et al.
with local-to-global cross-modal fusion. In Proceedings GraphAlign: Enhancing accurate feature alignment by
of the IEEE/CVF Conference on Computer Vision and Pat- graph matching for multi-modal 3D object detection. In
tern Recognition, pages 17524–17534, 2023. Proceedings of the IEEE/CVF International Conference
[Lin et al., 2017] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, on Computer Vision, pages 3358–3369, 2023.
Kaiming He, Bharath Hariharan, and Serge Belongie. Fea- [Song et al., 2024a] Ziying Song, Lin Liu, Feiyang Jia,
ture pyramid networks for object detection. In Proceedings et al. Robustness-aware 3D object detection in au-
of the IEEE conference on computer vision and pattern tonomous driving: A review and outlook. arXiv preprint
recognition, pages 2117–2125, 2017. arXiv:2401.06542, 2024.
[Liu et al., 2020a] Lin Liu, Jianzhuang Liu, Shanxin Yuan, [Song et al., 2024b] Ziying Song, Guoxin Zhang, Jun Xie,
Gregory Slabaugh, Aleš Leonardis, Wengang Zhou, and Lin Liu, et al. Voxelnextfusion: A simple, unified and ef-
Qi Tian. Wavelet-based dual-branch network for image fective voxel fusion framework for multi-modal 3D object
demoiréing. In Computer Vision–ECCV 2020: 16th Eu- detection. arXiv preprint arXiv:2401.02702, 2024.
ropean Conference, Glasgow, UK, August 23–28, 2020, [Tsai et al., 2023] Darren Tsai, Julie Stephany Berrio, Mao
Proceedings, Part XIII 16, pages 86–102. Springer, 2020. Shan, Eduardo Nebot, and Stewart Worrall. Viewer-
[Liu et al., 2020b] Zechen Liu, Zizhang Wu, and Roland centred surface completion for unsupervised domain adap-
Tóth. SMOKE: Single-stage monocular 3D object de- tation in 3D object detection. In 2023 IEEE International
tection via keypoint estimation. In Proceedings of the Conference on Robotics and Automation (ICRA), pages
IEEE/CVF Conference on Computer Vision and Pattern 9346–9353, 2023.
Recognition Workshops, pages 996–997, 2020. [Wang et al., 2021] Tai Wang, Xinge Zhu, Jiangmiao Pang,
[Liu et al., 2023] Zhijian Liu, Haotian Tang, Alexander and Dahua Lin. FCOS3D: Fully convolutional one-stage
Amini, Xinyu Yang, Huizi Mao, Daniela L Rus, and Song monocular 3D object detection. In Proceedings of the
Han. BEVFusion: Multi-task multi-sensor fusion with IEEE/CVF International Conference on Computer Vision,
unified bird’s-eye view representation. In 2023 IEEE inter- pages 913–922, 2021.
national conference on robotics and automation (ICRA), [Wang et al., 2022a] Tai Wang, ZHU Xinge, Jiangmiao
pages 2774–2781. IEEE, 2023. Pang, and Dahua Lin. Probabilistic and geometric depth:
[OpenAI, 2023] OpenAI. Gpt-4 technical report. Detecting objects in perspective. In Conference on Robot
https://cdn.openai.com/papers/gpt-4.pdf, 2023. Learning, pages 1475–1485. PMLR, 2022.
[Oza et al., 2023] Poojan Oza, Vishwanath A Sindagi, [Wang et al., 2022b] Yue Wang, Vitor Campagnolo
Vibashan Vishnukumar Sharmini, and Vishal M Patel. Un- Guizilini, Tianyuan Zhang, et al. DERT3D: 3D ob-
supervised domain adaptation of object detectors: A sur- ject detection from multi-view images via 3D-to-2D
vey. IEEE Transactions on Pattern Analysis and Machine queries. In Conference on Robot Learning, pages
Intelligence, 2023. 180–191. PMLR, 2022.
[Peng et al., 2023] Xidong Peng, Xinge Zhu, and Yuexin [Wang et al., 2023a] Li Wang, Xinyu Zhang, Ziying Song,
Ma. CL3D: Unsupervised domain adaptation for cross- et al. Multi-modal 3d object detection in autonomous driv-
ing: A survey and taxonomy. IEEE Transactions on Intel-
ligent Vehicles, 2023.
[Wang et al., 2023b] Yan Wang, Junbo Yin, Wei Li, et al.
SSDA3D: Semi-supervised domain adaptation for 3D ob-
ject detection from point cloud. In Proceedings of the
AAAI Conference on Artificial Intelligence, volume 37,
pages 2707–2715, 2023.
[Wu et al., 2023] QuanLin Wu, Hang Ye, Yuntian Gu, et al.
Denoising masked autoencoders help robust classification.
In The Eleventh International Conference on Learning
Representations, 2023.
[Xie et al., 2023] Yichen Xie, Chenfeng Xu, Marie-Julie
Rakotosaona, Patrick Rim, et al. SparseFusion: Fusing
multi-modal sparse representations for multi-sensor 3D
object detection. arXiv preprint arXiv:2304.14340, 2023.
[Yan et al., 2018] Yan Yan, Yuxing Mao, and Bo Li. SEC-
OND: Sparsely embedded convolutional detection. Sen-
sors, 18(10):3337, 2018.
[Yan et al., 2023] Junjie Yan, Yingfei Liu, Jianjian Sun,
Fan Jia, et al. Cross modal transformer via coordi-
nates encoding for 3D object dectection. arXiv preprint
arXiv:2301.01283, 2023.
[Yang et al., 2022] Zeyu Yang, Jiaqi Chen, Zhenwei Miao,
et al. DeepInteraction: 3D object detection via modality
interaction. Advances in Neural Information Processing
Systems, 35:1992–2005, 2022.
[Yin et al., 2021] Tianwei Yin, Xingyi Zhou, and Philipp
Krahenbuhl. Center-based 3d object detection and track-
ing. In Proceedings of the IEEE/CVF conference on com-
puter vision and pattern recognition, pages 11784–11793,
2021.
[Zhang et al., 2022] Yanan Zhang, Jiaxin Chen, and
Di Huang. CAT-Det: Contrastively augmented trans-
former for multi-modal 3D object detection. In Proceed-
ings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 908–917, 2022.
[Zhang et al., 2023a] Chaoning Zhang, Dongshen Han,
Yu Qiao, Jung Uk Kim, et al. Faster segment anything:
Towards lightweight SAM for mobile applications. arXiv
preprint arXiv:2306.14289, 2023.
[Zhang et al., 2023b] Dingyuan Zhang, Dingkang Liang,
Hongcheng Yang, Zhikang Zou, Xiaoqing Ye, Zhe Liu,
and Xiang Bai. SAM3D: Zero-shot 3D object de-
tection via segment anything model. arXiv preprint
arXiv:2306.02245, 2023.
[Zhang et al., 2023c] Xinyu Zhang, Li Wang, Jian Chen,
et al. Dual radar: A multi-modal dataset with dual 4d radar
for autononous driving. arXiv preprint arXiv:2310.07602,
2023.
[Zhao et al., 2023] Xu Zhao, Wenchao Ding, Yongqi An,
Yinglong Du, et al. Fast segment anything, 2023.
[Zhu et al., 2020] Xinge Zhu, Yuexin Ma, Tai Wang, et al.
SSN: Shape signature networks for multi-class object de-
tection from point clouds. In European Conference on
Computer Vision, pages 581–597. Springer, 2020.

Robofusion: Towards Robust Multi-Modal 3D Object Detection Via Sam

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Robofusion: Towards Robust Multi-Modal 3D Object Detection Via Sam

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Robofusion: Towards Robust Multi-Modal 3D Object Detection Via Sam

Uploaded by

Copyright:

Available Formats

RoboFusion: Towards Robust Multi-Modal 3D Object Detection via SAM

Abstract and lack semantic information. Therefore, effectively lever-

KITTI Moderate-level Car AP

Noisy Image Masked Noisy Image Reconstructed Image

LiDAR-Only Camera-Only LC Fusion

Lidar-Only Camera-Only LC Fusion

Lidar-Only Camera-Only LC Fusion

Clean 85.04 / 88.04

Lidar-Only Camera-Only LC Fusion

Ground Truth Truth Positives False Positives Improved Predictions

You might also like