Unified Monocular 3D Object Detection

UniMODE: Unified Monocular 3D Object Detection
Zhuoling Li1 , Xiaogang Xu2 , SerNam Lim3 , Hengshuang Zhao1 *

1
The University of Hong Kong 2 Zhejiang University 3 University of Central Florida
lizhuoling@connect.hku.hk xgxu@zhejianglab.com sernam@ucf.edu hszhao@cs.hku.hk
arXiv:2402.18573v1 [cs.CV] 28 Feb 2024
Abstract
Realizing unified monocular 3D object detection, includ-

ing both indoor and outdoor scenes, holds great importance
in applications like robot navigation. However, involving
various scenarios of data to train models poses challenges
(a) (b)
due to their significantly different characteristics, e.g., di-
verse geometry properties and heterogeneous domain dis-
tributions. To address these challenges, we build a detec-
tor based on the bird’s-eye-view (BEV) detection paradigm,
where the explicit feature projection is beneficial to ad-
dressing the geometry learning ambiguity when employ-
ing multiple scenarios of data to train detectors. Then,
(c) (d)
we split the classical BEV detection architecture into two
Figure 1. Illustration of some challenges (e.g., diverse geometry
stages and propose an uneven BEV grid design to handle the
properties, heterogeneous domain distributions) in unified detec-
convergence instability caused by the aforementioned chal-
tion. (1) Comparing sub-figures (a) and (b), indoor objects are
lenges. Moreover, we develop a sparse BEV feature projec- small and close, while outdoor objects are far and sparse. Besides,
tion strategy to reduce computational cost and a unified do- the camera parameters are highly varying. (2) Comparing sub-
main alignment method to handle heterogeneous domains. figures (a), (b), and (c), which correspond to a real-world indoor
Combining these techniques, a unified detector UniMODE image, real-world outdoor image, and synthetic indoor image, the
is derived, which surpasses the previous state-of-the-art on image styles are different. (3) Although the category “Picture” is
the challenging Omni3D dataset (a large-scale dataset in- labeled in sub-figure (c), it is not labeled in sub-figure (d), which
cluding both indoor and outdoor scenes) by 4.9% AP3D , suggests label conflict among different sub-datasets. Unlabeled
revealing the first successful generalization of a BEV detec- objects are highlighted by red ellipses.
tor to unified 3D object detection1 .
and the others focus on indoor detection [28]. Despite their
1. Introduction common goal of monocular 3D object detection, these de-
tectors exhibit significant differences in their network ar-
Monocular 3D object detection aims to accurately deter- chitectures [5]. This divergence hinders researchers from
mine the precise 3D bounding boxes of targets using only combining data of various scenarios to train a unified model
single images captured by cameras [13, 16]. Compared to that performs well in diverse scenes, which is demanded by
3D object detection based on other modalities such as Li- many important applications like robot navigation [30].
DAR point cloud, the monocular-based solution offers ad- The most critical challenge in unified 3D object detec-
vantages in terms of cost-effectiveness and comprehensive tion lies in addressing the distinct characteristics of differ-
semantic features [17, 19]. Moreover, owing to the wide- ent scenarios. For example, indoor objects are smaller and
ranging applications like autonomous driving [8], monocu- closer in proximity, while outdoor detection needs to cover a
lar 3D object detection has drawn much attention recently. vast perception range. Recently, Cube RCNN [5] has served
Thanks to the efforts paid by the research community, as a predecessor in studying this problem. It directly pro-
numerous detectors have been developed. Some are de- duces 3D box predictions in the camera view and adopts a
signed for outdoor scenarios [9, 38] such as urban driving, depth decoupling strategy to tackle the domain gap among
* Corresponding author. scenes. However, we observe it suffers serious convergence
1 This paper has been accepted for publication in CVPR2024. difficulty and is prone to collapsing during training.
1
To overcome the unstable convergence of Cube RCNN, to unified detection, seamlessly integrating indoor and out-
we employ the recent popular bird’s-eye-view (BEV) de- door scenes. It showcases the immense potential of BEV
tection paradigm to develop a unified 3D object detector. detection across a broad spectrum of scenarios and under-
This is because the feature projection in the BEV paradigm scores the versatility of this technology.
aligns the image space with the 3D real-world space explic-
itly [15], which alleviates the learning ambiguity in monoc- 2. Related Work
ular 3D object detection. Nevertheless, after extensive ex-
Monocular 3D object detection. Due to its advantages of
ploration, we find that naively adopting existing BEV detec-
being economical and flexible, monocular 3D object detec-
tion architectures [15, 18] does not yield promising perfor-
tion grabs much research attention [22]. Existing detectors
mance, which is mainly blamed on the following obstacles.
can be broadly categorized into two groups, camera-view
First of all, as shown in Fig. 1 (a) and (b), the geome- detectors and BEV detectors. Among them, camera-view
try properties (e.g., perception ranges, target positions) be- detectors generate results in the 2D image plane before con-
tween indoor and outdoor scenes are quite diverse. Specif- verting them into the 3D real space [10, 25]. This group
ically, indoor objects are typically a few meters away from is generally easier to implement. However, the conversion
the camera, while outdoor targets can be more than 100m from the 2D camera plane to 3D physical space can intro-
away. Since a unified BEV detector is required to recog- duce additional errors [32], which negatively impact down-
nize objects in all scenarios, the BEV feature has to cover stream planning tasks typically performed in 3D [7].
the maximum possible perception range. Meanwhile, as in- BEV detectors, on the other hand, transform image fea-
door objects are often small, the BEV grid resolution for tures from the 2D camera plane to the 3D physical space
indoor detection needs to be precise. All these characteris- before generating results in 3D [12]. This approach bene-
tics can lead to unstable convergence and significant com- fits downstream tasks, as planning is also performed in the
putational burden. To address these challenges, we develop 3D space [18]. However, the challenge with BEV detectors
a two-stage detection architecture. In this architecture, the is that the feature transformation process relies on accurate
first stage produces initial target position estimation, and the depth estimation, which can be difficult to achieve with only
second stage locates targets using this estimation as prior in- camera images [23]. As a result, convergence becomes un-
formation, which helps stabilize the convergence process. stable when dealing with diverse data scenarios [5].
Moreover, we introduce an innovative uneven BEV grid Unified object detection. In order to improve the general-
split strategy that expands the BEV space range while main- ization ability of detectors, some works have explored the
taining a manageable BEV grid size. Furthermore, a sparse integration of multiple data sources during model training
BEV feature projection strategy is developed to reduce the [14, 34]. For example, in the field of 2D object detection,
projection computational cost by 82.6%. SMD [40] improves the performance of detectors through
Another obstacle arises from the heterogeneous domain learning a unified label space. In the 3D object detection
distributions (e.g., image styles, label definitions) across domain, PPT [36] investigates the utilization of extensive
various scenarios. For example, as depicted in Fig. 1 (a), 3D point cloud data from diverse datasets for pre-training
(b), and (c), the data can be collected in real scenes or syn- detectors. In addition, Uni3DETR [35] reveals how to de-
thesized virtually. Besides, comparing Fig. 1 (c) and (d), a vise a unified point-based 3D object detector that behaves
class of objects may be annotated in a scene but not labeled well in different domains. For the camera-based detection
in another scene, leading to confusion during network con- track, Cube RCNN [5] serves as the sole predecessor in the
vergence. To handle these conflicts, we propose a unified study of unified monocular 3D object detection. However,
domain alignment technique consisting of two parts, the do- Cube RCNN is plagued by the unstable convergence issue,
main adaptive layer normalization to align features, and the necessitating further in-depth analysis within this track.
class alignment loss for alleviating label definition conflict.
Combining all these innovative techniques, a Unified 3. Method
Monocular Object DEtector named UniMODE is devel-
3.1. Overall Framework
oped, and it achieves state-of-the-art (SOTA) performance
on the Omni3D benchmark. In the unified detection setting, The overall framework of UniMODE is illustrated in Fig. 2.
UniMODE surpasses the SOTA detector, Cube RCNN, by As shown, a monocular image I ∈ R3×H×W sampled from
an impressive 4.9% in terms of AP3D (average precision multiple scenarios (e.g., indoor and outdoor, real and syn-
based on 3D intersection over union). Furthermore, when thetic, daytime and nighttime) are input to the feature ex-
evaluated in indoor and outdoor detection settings individ- traction module (including a backbone and a neck) to pro-
H W
ually, UniMODE outperforms Cube RCNN by 11.9% and duce representative feature F ∈ RC× 16 × 16 . Then, F is
9.1%, respectively. This work represents a pioneering effort processed by 4 fully convolutional heads, namely “domain
to explore the generalization of BEV detection architectures head”, “proposal head”, “feature head”, and “depth head”,
2
Domain Confidence
Domain Head
(𝑐" , 𝑐% , … , 𝑐& )
Proposal Map
𝑀 Proposal Queries Class Alignment Loss
Proposal Head MLP
𝑀 + 𝑁 Queries
Extracted Feature BEV Decoder×6
N Random Queries Detection Results
! $
B×𝐶× × DALN Cross-Attn DALN
"# "# Feature Head
Feature Extraction
Sparse BEV Feature Query FFN

BEV Encoder
Backbone + Neck Projection Self-Attn FFN
Depth Head
Input Images
B×3×𝐻×𝑊 Sparse BEV Feature Projection Uneven BEV Feature Grid DALN
Domain Confidence
(𝑐" , 𝑐% , … , 𝑐& )
Sparse Tokens Domain Parameters
(𝛼" , 𝛽" )
Even Grid (Previous) (𝛼% , 𝛽% )
(𝛼& , 𝛽& ) Input-dependent Parameter
Sparse Feature Projection Input Output

Layer Norm Mini-adjust
on Uneven Grid Feature Feature
Images from Diverse Domains Dense Feature Uneven Grid (Ours)
Figure 2. The overall detection framework of UniMODE. The illustrated modules proposed in this work include the proposal head, sparse
BEV feature projection, uneven BEV feature grid, domain adaptive layer normalization, and class alignment loss.
respectively. Among them, the role of the domain head 3.2. Two-Stage Detection Architecture
is to predict which pre-defined data domain an input im-
age is most relevant to, and the classification confidence The integration of indoor and outdoor 3D object detection
produced by the domain head is subsequently utilized in is challenging due to diverse geometry properties (e.g., per-
domain alignment. The proposal head aims to estimate ception ranges, target positions). Indoor detection typically
the rough target distribution before the 6 Transformer de- involves close-range targets, while outdoor detection con-
coders, and the estimated distribution serves as prior in- cerns targets scattered over a broader 3D space. As depicted
formation for the second-stage detection. This design al- in Fig. 3, the perception ranges and target positions in in-
leviates the distribution mismatch between diverse training door and outdoor detection scenes vary significantly, which
domains (refer to Section 3.2). The proposal head output are challenging for traditional BEV 3D object detectors be-
is encoded as M proposal queries. In addition, N queries cause of their fixed BEV feature resolutions.
are randomly initialized and concatenated with the proposal The geometry property difference is identified as an es-
queries for the second-stage detection, leading to M + N sential reason causing the unstable convergence of BEV de-
queries in the second stage. tectors [15]. For example, the target position distribution
The feature head and depth head are responsible for pro- difference makes it challenging for Transformer-based de-
jecting the image feature into the BEV plane and obtaining tectors to learn how to update the query reference points
BEV feature. During this projection, we develop a tech- gradually toward concerned objects. In fact, through visual-
nique to remove unnecessary projection points, which re- ization, we find the reference point updating in the 6 Trans-
duces the computing burden by about 82.6% (refer to Sec- former decoders is disordered. As a result, if we adopt the
tion 3.4). Besides, we propose the uneven BEV feature (re- classical deformable DETR architecture [41] to build a 3D
fer to Section 3.3), which means the BEV grids closer to the object detector, the training is easy to collapse due to the in-
camera enjoy more precise resolution, and the grids farther accurate positions of learned reference points, resulting in
to the camera cover broader perception areas. This design sudden gradient vanishing or exploding.
well balances the grid size contradiction between indoor de- To overcome this challenge, we construct UniMODE in
tection and outdoor detection without extra memory burden. a two-stage detection fashion. In the first stage, we design
Obtaining the projected BEV feature, a BEV encoder is a CenterNet [39] style head (the proposal head in Fig. 2)
employed to further refine the feature, and 6 decoders are to produce detection proposals. Specifically, its predicted
adopted to generate the second-stage detection results. As attributes include the 2D center Gaussion heatmap, offset
mentioned before, M + N queries are used during this pro- from 2D centers to 3D centers, and 3D center depths of tar-
cess. After the 6 decoders, the queries are decoded as de- gets. The 3D center coordinates of proposals can be de-
tection results by querying the FFN. In the decoder part, the rived from these predicted attributes. Then, the proposals
unified domain alignment strategy is devised to align the with top M confidences are selected and encoded as M pro-
data of various scenarios via both the feature and loss per- posal queries by an MLP layer. To account for any potential
spectives. Refer to Section 3.5 for more details. missed targets, another N randomly initialized queries are
3
feature must cover a large perception area while still utiliz-
ing small BEV grids, which poses a massive challenge due
to the limited GPU memory.
To overcome this challenge, we propose a solution that
involves partitioning the BEV space into uneven grids, in
contrast to the even grids utilized by existing detectors. As
depicted in the bottom part of Fig. 2, we achieve this by
employing a smaller size of grids closer to the camera and
Indoor Outdoor
larger grids for those farther away. This approach enables
Figure 3. Indoor and outdoor target position distributions in the
UniMODE to effectively perceive a wide range of objects
BEV space. The brighter a point shows, the more targets the cor-
responding BEV grid contains. The perception camera is located
while maintaining small grid sizes for objects in close prox-
at the point with the coordinate (0, 0). imity. Importantly, this does not increase the total number
of grids, thereby avoiding any additional computational bur-
concatenated with these proposal queries to perform infor- den. Specifically, assuming there are Nz grids in the depth
mation interaction in the 6 decoders of the second stage (the axis and the depth range is (zmin , zmax ), the grid size of
Transformer stage). In this way, the initial query reference the ith grid zi is set to:
points of the second detection stage are adjusted adaptively.
Our experiments reveal that this two-stage architecture is zmax − zmin
zi = zmin + · i(i + 1). (1)
essential for stable convergence. Nz (Nz + 1)
Besides, since the positions of query reference points are
Notably, the mathematical form in Eq. 1 is similar to
not randomly initialized, the iterative bounding box refine-
the linear-increasing discretization of depth bin in CaDDN
ment strategy proposed in deformable DETR [41] is aban-
[26], while the essence is fundamentally different. In
doned as it may lead to a deterioration of reference points’
CaDDN, the feature projection distribution is adjusted to
quality. In fact, we observe that this iterative bounding box
allocate more features to grids closer to the camera. In ex-
refinement strategy could result in convergence collapse.
periments, we observe that this adjustment results in a more
imbalanced BEV feature, i.e., denser features in closer grids
3.3. Uneven BEV Grid
and more empty grids in farther grids. Since features in all
A notable difference between indoor and outdoor 3D object grids are extracted by the same network, this imbalance de-
detection lies in the geometry information (e.g., scale, prox- grades the performance. By contrast, our uneven BEV grid
imity) of objects to the camera during data collection. In- approach enhances detection precision by making the fea-
door environments typically feature smaller objects located ture density more balanced.
closer to the camera, whereas outdoor environments involve
larger objects positioned at greater distances. Furthermore, 3.4. Sparse BEV Feature Projection
outdoor 3D object detectors must account for a wider per- The step of transforming the camera view feature into
ceptual range of the environment. Consequently, existing the BEV space is quite computationally expensive due to
indoor 3D object detectors typically use smaller voxel or its numerous projection points. Specifically, considering
pillar sizes. For instance, the voxel size of CAGroup3D the image feature Fi ∈ RCi ×1×Hf ×Wf and depth fea-
[31], a SOTA indoor 3D object detector, is 0.04 meters, ture Fd ∈ R1×Cd ×Hf ×Wf , the projection feature Fp ∈
and the maximum target depth in the SUN-RGBD dataset RCi ×Cd ×Hf ×Wf is obtained by multiplying Fi and Fd .
[29], a classic indoor dataset, is approximately 8 meters. Therefore, the projection point number Ci × Cd × Hf × Wf
In contrast, outdoor datasets exhibit much larger perception increases dramatically as the growth of Cd . The heavy com-
ranges. For example, the commonly used outdoor detec- putational burden of this feature projection step restricts the
tion dataset KITTI [8] has a maximum depth range of 100 BEV feature resolution, and thus hinders unifying indoor
meters. Due to this vast perception range and limited com- and outdoor 3D object detection.
puting resources, outdoor detectors employ larger BEV grid In this work, we observe that most projection points in
sizes, e.g. the BEV grid size in BEVDepth [11], a state-of- Fp are unnecessary because their values are quite tiny. This
the-art outdoor 3D object detector, is 0.8 meters. is essentially because of the small corresponding values in
Therefore, the BEV grid sizes of current outdoor detec- Fd , which imply that the model predicts there is no target in
tors are typically large to accommodate the vast perception these specific BEV grids. Hence, the time spent on project-
range, while those of indoor detectors are small because of ing features to these unconcerned grids can be saved.
the intricate indoor scenes. However, since UniMODE aims Based on the above insights, we propose to remove
to address both indoor and outdoor 3D object detection us- the unnecessary projection points based on a pre-defined
ing a unified model structure and network weight, its BEV threshold τ . Specifically, we eliminate the projection points
4
in Fp whose corresponding depth confidence of Fd is
smaller than τ . In this way, most projection points are elim- Not labeled
inated. For instance, when setting τ to 0.001, about 82.6%

of projection points can be excluded.
3.5. Unified Domain Alignment

Heterogeneous domain distributions exist in diverse scenar- (a) ARKitScenes (b) Hypersim
ios and we address this challenge via feature and loss views. Figure 4. An example of heterogeneous label conflict among sub-
Domain adaptive layer normalization. For the feature datasets in Omni3D. As shown, “Window” is not labeled in ARK-
view, we initialize domain-specific learnable parameters to itScenes while labeled in Hypersim, so the unlabeled window in
address the variations observed in diverse training data do- (a) could harm the convergence stability of detectors.
mains. However, this strategy must adhere to two crucial
requirements. Firstly, the detector should exhibit robust per- input images self-adaptively, and the increased parameters
formance during inference, even when confronted with im- are negligible. Additionally, when an image unseen in the
ages from domains that are not encountered during training. training set is input, DALN still works well, because the un-
Secondly, the introduction of these domain-specific param- seen image can still be classified as a weighted combination
eters should incur minimal computational overhead. of these D domains.
Considering these two requirements, we propose the do- Although there exist a few previous techniques related to
main adaptive layer normalization (DALN) strategy. In this adaptive normalization, almost all of them are based on re-
strategy, we first split the training data into D domains. For gressing input-dependent parameters directly [36]. So, they
the classic implementation of layer normalization (LN) [2], need to build a special regression head for every normaliza-
denoting the input sequence as Xl ∈ RB×L×C and its el- tion layer. By contrast, DALN enables all layers to share
ement with the index (b, l, c) as xl
(b,l,c)
, the corresponding the same domain head, so the computing burden is much
(b,l,c) (b,l,c) smaller. Besides, DALN introduces domain-specific param-
output x̂l of processing xl by LN is obtained as: eters, which are more stable to train.
(b,l,c) Class alignment loss. In the loss view, we aim to address
(b,l,c) xl − µ(b,l) the heterogeneous label conflict when combining multiple
x̂l = , (2)
σ (b,l) data sources. Specifically, there are 6 independently labeled
where sub-datasets in Omni3D, and their label spaces are different.
v For example, as presented in Fig. 4, although the Window
C u C class is annotated in ARKitScenes, while it is not labeled
1 X (b,l,i) (b,l) u 1 X (b,l,c)
µ(b,l) = xl ,σ =t (x − µ(b,l) )2 . in Hypersim. As the label space of Omni3D is the union of
C i=1 C i=1 l all classes in all subsets, the unlabeled window in Fig. 4 (a)
(3) becomes a missing target that harms convergence stability.
The two-stage detection architecture described in Sec-
In DALN, we build a set of learnable domain-specific tion 3.2 can alleviate the aforementioned problem to some
parameters, i.e., {(αi , βi )}Di=1 , where (αi , βi ) are the pa- extent, because it helps the detector concentrate on fore-
rameters corresponding to the ith domain. {αi }D i=1 are ini- ground objects, and the unlabeled objects are overlooked to
tialized as 1 and {βi }D
i=1 are set to 0. Then, we establish a compute loss. To address this problem further, we devise a
domain head consisting of several convolutional layers. As simple strategy, the class alignment loss. Specifically, de-
shown in Fig. 2, the domain head takes the feature F as in- noting the label space of the ith dataset as Ωi , we compute
put and predicts the confidence scores that the input images loss on the ith dataset as:
I belong to these D domains. Denoting the confidence of
γ · l(y, ȳ), (ȳ ∈
/ Ωi ) ∧ (ȳ = B)
the bth image as {ci }D i=1 , the input-dependent parameters Li = { , (5)
l(y, ȳ), others
(α, β) are computed following:
where l(·), y, ȳ, B are the loss function, class prediction,
D D
X X class label, and background class, respectively. γ is a factor
α= ci · αi , β = ci · βi . (4)
for reducing the punishment to classes not included in the
i=1 i=1
label space of this sample.
Obtaining (α, β), we employ them to adjust the distribution
of x̂l
(b,l,c) (b,l,c)
with respect to x̄l
(b,l,c)
= α · x̂l + β, where 4. Experiment
(b,l,c)
x̄l denotes the updated distribution. In this way, the fea- Implementation details. The perception ranges in the X-
ture distribution in UniMODE can be adjusted according to axis, Y-axis, and Z-axis of the camera coordinate system are
5
OMNI3D OUT OMNI3D IN OMNI3D
Method
APkit
3D ↑ APnus
3D ↑ APout
3D ↑ APsun
3D ↑ APin
3D ↑ AP25
3D ↑ AP50
3D ↑ APnear
3D ↑ APmed
3D ↑ APfar
3D ↑ AP3D ↑
M3D-RPN [4] 10.4% 17.9% 13.7% - - - - - - - -
SMOKE [19] 25.4% 20.4% 19.5% - - - - - - - 9.6%
FCOS3D [32] 14.6% 20.9% 17.6% - - - - - - - 9.8%
PGD [33] 21.4% 26.3% 22.9% - - - - - - - 11.2%
GUPNet [21] 24.5% 20.5% 19.9% - - - - - - - -
ImVoxelNet [28] 23.5% 23.4% 21.5% 30.6% - - - - - - 9.4%
BEVFormer [15] 23.9% 29.6% 25.9% - -
PETR [18] 30.2% 30.1% 27.8% - -
Cube RCNN [5] 36.0% 32.7% 31.9% 36.2% 15.0% 24.9% 9.5% 27.9% 12.1% 8.5% 23.3%
UniMODE 40.2% 40.0% 39.1% 36.1% 22.3% 28.3% 7.4% 29.7% 12.7% 8.1% 25.5%
UniMODE* 41.3% 43.6% 41.0% 39.8% 26.9% 30.2% 10.6% 31.1% 14.9% 8.7% 28.2%
Table 1. Performance comparison between the proposed UniMODE and other 3D object detectors. In the 2nd ∼ 4th columns, the
detectors are trained using KITTI and nuScenes. These three columns reflect the detection precision on KITTI, nuScenes, and overall
outdoor detection performance, respectively. The 5th ∼ 6th columns correspond to indoor detection results. Among them, the 5th
column is the performance that detectors are trained and validated on SUN-RGBD. In the 6th column, detectors are trained and evaluated
by combining SUN-RGBD, ARKitScenes, and Hypersim. The 7th ∼ 12th columns represent the overall detection performance where
detectors are trained and validated utilizing all data in Omni3D. UniMODE and UniMODE* denote the proposed detectors, taking DLA34
and ConvNext-Base as the backbones, respectively. The best results given various metrics are marked in bold. The “ ” means that the
model does not converge well, and the obtained performance is quite poor. The “-” means this result is reported in previous literature.
hyp obj
Backbone APsun ark kit nus
3D ↑ AP3D ↑ AP3D ↑ AP3D ↑ AP3D ↑ AP3D ↑
are smaller and the object categories are more diverse. Hy-
DLA34 21.0% 6.7% 42.3% 52.5% 27.8% 31.7% persim, distinct from the aforementioned five datasets, is a
ConvNext 23.0% 8.1% 48.0% 66.1% 29.2% 36.0% virtually synthesized dataset. Thus, Hypersim allows for the
annotation of object classes that are challenging to label in
Table 2. Detailed performance of UniMODE on various sub-
real scenes, such as transparent objects (e.g., windows) and
datasets in Omni3D. The detectors are trained and evaluated using
the whole Omni3D training and testing data. The results of adopt-
very thin objects (e.g., carpets). The Omni3D dataset com-
ing two different backbones are presented. prises a total of 98 object categories and 3 million 3D box
annotations, spanning 234,000 images. The evaluation met-
(−30, 30), (−40, 40), (0, 80) meters, respectively. If with- ric is AP3D , which reflects the 3D Intersection over Union
out a special statement, the BEV grid resolution is (60, 80). (IoU) between 3D box prediction and label.
The factor γ defined in the class alignment loss is set to 0.2. Experimental settings. As Omni3D is a large-scale
M and N are set to 100. The adopted optimizer is AdamW, dataset, training models on it necessitates many GPUs. For
and the learning rate is set to 12e−4 for a batch size of 192. example, the authors of Cube RCNN run each experiment
The experiments are primarily conducted on 4 A100 GPUs. with 48 V100s for 4∼5 days. In this work, the experi-
The total loss includes two parts, the proposal head loss and ments in Section 4.1 are performed in the high computing
the query FFN loss. The proposal head loss consists of the resource setting (the input image resolution is 1280 × 1024,
heatmap classification loss and depth regression loss. The the backbone is ConvNext-Base [20], all training data).
query FFN loss comprises the classification loss (a cross- Since our computing resources are limited, unless explicitly
entropy loss) and regression loss (L1 loss for predicting 3D stated otherwise, the remaining experiments are conducted
center, dimension, and orientation). The total loss is the in the low computing resource setting (the input resolution
weighted sum of these loss items. No special loss is set for is 640 × 512, the backbone is DLA34 [37], fixed 20% of all
the M proposal query generation. training data sampled from all 6 sub-datasets).
Dataset. The experiments in this section are performed
4.1. Performance Comparison
on Omni3D, the sole large-scale 3D object detection
benchmark encompassing both indoor and outdoor scenes. In this part, we compare the performance of the proposed
Omni3D is built upon six well-known datasets includ- detector with previous methods. Among them, Cube RCNN
ing KITTI [8], SUN-RGBD [29], ARKitScenes [3], Ob- is the sole detector that also explores unified detection.
jectron [1], nuScenes [6], and Hypersim [27]. Among BEVFormer [15] and PETR [18] are two popular BEV de-
these datasets, KITTI and nuScenes focus on urban driv- tectors, and we reimplement them in the Omni3D bench-
ing scenes, which are real-world outdoor scenarios. SUN- mark to get the detection scores. The performance of the
RGBD, ARKitScenes, and Objectron primarily pertain to other compared detectors is obtained from [5]. All the
real-world indoor environments. Compared with outdoor results are given in Table 1. In addition, we present the
datasets, the required perception ranges of indoor datasets detailed detection scores of UniMODE on various sub-
6
PH UBG SBFP UDA APin out
3D ↑ AP3D ↑ AP3D ↑ Improvement Grid Size (m) Depth Bin APin
3D ↑ APout
3D ↑ AP3D ↑
10.9% 14.3% 12.3% - 1 Even 14.8% 24.5% 17.4%
✓ 13.4% 22.2% 15.9% 3.6%↑ 1 Uneven 12.1% 22.6% 15.3%
✓ ✓ 14.00% 23.8% 16.6% 0.7%↑ 0.5 Even 15.4% 25.5% 18.1%
✓ ✓ ✓ 13.4% 23.7% 16.6% 0.0% ↑ 2 Even 14.0% 23.9% 16.5%
✓ ✓ ✓ ✓ 14.8% 24.5% 17.4% 0.8%↑ Table 4. Ablation study on grid size and depth bin split strategy in
Table 3. Ablation study on proposed strategies, which verify the uneven BEV grid.
effects of proposal head (PH), uneven BEV grid (UBG), sparse τ Remove Ratio (%) APin
3D ↑ APout
3D ↑ AP3D ↑
BEV feature projection(SBFP), and unified domain alignment 0 0.0 14.9% 24.7% 17.4%
(UDA). The last column presents the improvement of each row 1e-3 82.6 14.8% 24.5% 17.4%
compared with the top row. APin out
3D and AP3D reflect the in- 1e-2 94.3 12.1% 21.9% 15.0%
door and outdoor detection performance, respectively. Notably, 1e-1 98.3 4.7% 3.6% 4.7%
although SBFP does not boost the detection precision, it reduces Table 5. Ablation study on τ in sparse BEV feature projection.
the computational cost of the BEV feature projection by 82.6%.
datasets in Omni3D as Table 2. resource setting due to the limited computing resources.
According to the results, we can observe that UniMODE According to the results in Table 3, we can observe that
achieves the best results in all metrics. It surpasses the all these strategies are very effective. Among them, the pro-
SOTA Cube RCNN by 4.9% given the primary metric posal head boosts the result by the most significant margin.
AP3D . Besides DLA34, we also try another backbone Specifically, the proposal head enhances the overall detec-
ConvNext-Base. This is because previous papers suggest tion performance metric AP3D by 3.6%. Meanwhile, the
that DLA34 is commonly used in camera-view detectors indoor and outdoor detection metrics APin out
3D and AP3D are
like Cube RCNN but not suitable for BEV detectors [13]. boosted by 2.5% and 7.9% separately. As discussed in Sec-
Since UniMODE is a BEV detector, only testing the per- tion 3.2, the proposal head is quite effective because it stabi-
formance of UniMODE with DLA34 is unfair. Thus, we lizes the convergence process of UniMODE and thus favors
also test the UniMODE with ConvNext-Base, and the result detection accuracy. The collapse does not happen after us-
suggests that the performance is boosted significantly. Ad- ing the proposal head. In addition, although the sparse BEV
ditionally, the speed of UniMODE is also promising. Test feature projection strategy does not improve the detection
on 1 A100 GPU, the inference speeds of UniMODE under precision, it reduces the projection cost by 82.6%.
the high and low computing resource settings are 21.41 FPS Uneven BEV grid. We study the effect of BEV feature grid
and 43.48 FPS separately. size and depth bin split strategy in uneven BEV grid design,
In addition, it can be observed from Table 1 that BEV- and the results are presented in Table 4. When the depth bin
Former and PETR do not converge well in the unified de- split is uneven, we split the depth bin range following Eq. 1.
tection setting while behaving promisingly trained with out- Comparing the 1st and 2nd rows of results in Table 4, we
door datasets. This phenomenon implies the difficulty of can find that uneven depth bin deteriorates detection perfor-
unifying indoor and outdoor 3D object detection. Through mance. We speculate this is because this strategy projects
analysis, we find that BEVFormer obtains poor results when more points in closer BEV grids while fewer points in far-
using data of all domains because its convergence is quite ther grids, which further increases the imbalanced distribu-
unstable, and the loss curve often boosts to a high value dur- tion of projection features. Additionally, comparing the 1st ,
ing training. PETR does not behave well since it implicitly 3rd , and 4th rows of results in Table 4, it is observed that
learns the correspondence relation between 2D pixels and smaller BEV grids lead to better performance. We set the
3D voxels. When the camera parameters keep similar across BEV grid size to 1m rather than 0.5m in all the other ex-
all samples in a dataset like nuScenes [6], PETR converges periments due to limited computing resources and the vast
smoothly. Nevertheless, when trained in a dataset with dra- training data volume of Omni3D, i.e., if we decrease the
matically changing camera parameters like Omni3D, PETR size of BEV grids, the performance of UniMODE can be
becomes much more difficult to train. further boosted compared with the current performance.
Sparse BEV feature projection. As mentioned in Sec-
4.2. Ablation Studies
tion 3.4, the BEV feature projection process is computa-
Key component designs. We ablate the effectiveness of the tionally expensive. To reduce this cost, we propose to re-
proposed strategies in UniMODE, including proposal head, move unimportant projection points. Although this strategy
uneven BEV grid, sparse BEV feature projection, and uni- enhances network efficiency significantly, it could deteri-
fied domain alignment. The experimental results are pre- orate detection accuracy and convergence stability, which
sented in Table 3. Notably, as mentioned before, the ex- is exactly a trade-off. In this part, we study this trade-off
periments in this part are conducted in the low computing through experiments. Specifically, as introduced in Sec-
7
ARKitScenes Hypersim nuScenes
KITTI Objectron SUN-RGBD

Figure 5. Visualization of detection results on various sub-datasets in Omni3D.
tion 3.4, we remove unimportant projection points based on None DR DALN

Method
a pre-defined hyper-parameter τ . The value of τ is adjusted APark sun ark sun ark sun
to analyze how the removed projection point ratio affects Result 33.6% 12.3% 33.9% 12.1% 35.0% 13.0%
performance. The results are reported in Table 5.
Table 6. Analysis on the effectiveness of DALN.
It can be observed from Table 5 that when τ is 0, which
means no feature is discarded, the best performance across Zero-Shot δ-Tune
Train
APhyp sun ark hyp sun ark
all rows is arrived. When we set τ to 1e−3 , about 82.6% of
Hypersim 14.7% 5.6% 3.6% 14.7% 18.5% 18.9%
the feature is discarded while the performance of the detec-
SUN-RGBD 3.0% 28.5% 8.8% 7.5% 28.5% 27.2%
tor remains very similar to the one with τ = 0. This phe- ARKitScenes 4.2% 13.0% 35.0% 10.4% 22.8% 35.0%
nomenon suggests that the discarded feature is unimportant
for final detection accuracy. Then, when we increase τ to Table 7. Cross-domain evaluation in indoor sub-datasets. In this
experiment, the detector is first trained in a domain and tested on
1e−2 and 1e−1 , we can find that the corresponding perfor-
other domains in two settings, zero-shot and δ-tune.
mances drop dramatically. This observation indicates that
when we discard superfluous features, the detection preci- domain effectiveness of DALN.
sion and even training stability are influenced significantly.
Combining all the observations, we set τ to 1e−3 and drop 4.3. Cross-domain Evaluation
82.6% of unimportant feature in UniMODE, which reduces We evaluate the generalization ability of UniMODE in this
the computational cost by 82.6% while maintaining similar part by conducting the cross-domain evaluation. Specifi-
performance as the one without dropping features. cally, we train a detector on a sub-dataset in Omni3D and
Effectiveness of DALN. In this experiment, we validate test the performance of this detector on different other sub-
the effectiveness of DALN through comparing the perfor- datasets. The experiments are conducted in two settings. In
mances of the naive baseline without any domain adaptive the zero-shot setting, the test domain is completely unseen.
strategy, the baseline predicting dynamic parameters with In the δ-tune setting, 1% of training set data from the test
direct regression (DR) [24], and the baseline with DALN domain is used to fine-tune the Query FFN in UniMODE for
(our proposed). All these models are trained using only 1 epoch. The experimental results are presented in Table 7.
ARKitScenes and evaluated with ARKitScenes (in-domain) According to the results in the 2st ∼ 4th columns of
and SUN-RGBD (out-of-domain) separately. The result is Table 7, we can find that when a detector is trained and
presented in Table 6. It can be observed that DR could de- validated in the same indoor sub-dataset, its performance
grade the detection accuracy while DALN boosts the per- is promising. However, when evaluated on another com-
formance significantly, which reveals the zero-shot out-of- pletely unseen sub-dataset, the accuracy is limited. This
8
50
45 UniMODE PETR References
40
Loss boost [1] Adel Ahmadyan, Liangkai Zhang, Artsiom Ablavatski, Jian-
Loss Value
35
30 Gradient NaN ing Wei, and Matthias Grundmann. Objectron: A large scale
25 dataset of object-centric videos in the wild with pose anno-
20 tations. In CVPR, pages 7822–7831, 2021. 6
15
10
[2] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-
5 ton. Layer normalization. arXiv preprint arXiv:1607.06450,
0 2000 4000 6000 8000 10000 12000 2016. 5
Iteration [3] Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Yuri Fei-
Figure 6. The trainings loss curves of UniMODE and PETR. gin, Peter Fu, Thomas Gebauer, Daniel Kurz, Tal Dimry,
Brandon Joffe, Arik Schwartz, et al. Arkitscenes: A diverse
is partly because monocular 3D depth estimation is an ill-
real-world dataset for 3d indoor scene understanding using
posed problem. When the training and validation data be-
mobile rgb-d data. In NeurIPS, 2021. 6
long to differing domains, predicting depth accurately is
[4] Garrick Brazil and Xiaoming Liu. M3d-rpn: Monocular 3d
challenging, especially for the virtual dataset Hypersim. region proposal network for object detection. In ICCV, pages
Then, we introduce another testing setting, δ-tuning. In 9287–9296, 2019. 6
this setting, if a detector is tested on another domain dif- [5] Garrick Brazil, Abhinav Kumar, Julian Straub, Nikhila Ravi,
ferent from the train domain, the Query FFN is fine-tuned Justin Johnson, and Georgia Gkioxari. Omni3d: A large
by 1% of training data from the test domain. The results of benchmark and model for 3d object detection in the wild.
this δ-tuning setting are reported in the 5st ∼ 7th columns In CVPR, pages 13154–13164, 2023. 1, 2, 6
of Table 7. We can observe that when fine-tuned with only [6] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora,
a handful of data, the performance of UniMODE becomes Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan,
much more promising. This result suggests that the superi- Giancarlo Baldan, and Oscar Beijbom. nuscenes: A mul-
ority of UniMODE in serving as a foundational model. It timodal dataset for autonomous driving. In CVPR, pages
11621–11631, 2020. 6, 7
can benefit practical applications by only incorporating a
[7] Laurene Claussmann, Marc Revilloud, Dominique Gruyer,
little training data from the test domain.
and Sébastien Glaser. A review of motion planning for high-
4.4. Visualization way autonomous driving. IEEE Transactions on Intelligent
Transportation Systems, 21(5):1826–1848, 2019. 2
We visualize the detection results of UniMODE on various [8] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we
sub-datasets in Omni3D. The illustrated results are shown in ready for autonomous driving? the kitti vision benchmark
Fig. 5, where UniMODE performs quite well on all the data suite. In CVPR, pages 3354–3361, 2012. 1, 4, 6
samples, and accurately captures the 3D object bounding [9] Peixuan Li and Huaici Zhao. Monocular 3d detection with
boxes under both complex indoor and outdoor scenarios. geometric constraint embedding and semi-supervised train-
Besides, as mentioned before, training instability is the ing. IEEE Robotics and Automation Letters, 6(3):5565–
primary challenge in unifying diverse training domains. To 5572, 2021. 1
explain the meaning of training instability more clearly, we [10] Yingyan Li, Yuntao Chen, Jiawei He, and Zhaoxiang Zhang.
Densely constrained depth estimator for monocular 3d object
present the loss curves of UniMODE and an unstable case
detection. In ECCV, pages 718–734, 2022. 2
of PETR in Fig. 6. It can be found that there exists sudden
[11] Yinhao Li, Zheng Ge, Guanyi Yu, Jinrong Yang, Zengran
loss boost and continuous gradient collapse in the training
Wang, Yukang Shi, Jianjian Sun, and Zeming Li. Bevdepth:
of PETR, while UniMODE converges smoothly. Acquisition of reliable depth for multi-view 3d object detec-
tion. In AAAI, pages 1477–1485, 2023. 4
5. Conclusion and Limitation
[12] Yinhao Li, Zheng Ge, Guanyi Yu, Jinrong Yang, Zengran
In this work, we have proposed a unified monocular 3D Wang, Yukang Shi, Jianjian Sun, and Zeming Li. Bevdepth:
object detector named UniMODE, which contains several Acquisition of reliable depth for multi-view 3d object detec-
well-designed techniques to address many challenges ob- tion. In AAAI, pages 1477–1485, 2023. 2
served in unified 3D object detection. The proposed detec- [13] Zhuoling Li, Zhan Qu, Yang Zhou, Jianzhuang Liu, Haoqian
Wang, and Lihui Jiang. Diversity matters: Fully exploiting
tor has achieved SOTA performance on the Omni3D bench-
depth clues for reliable monocular 3d object detection. In
mark and presented high efficiency. Extensive experiments CVPR, pages 2791–2800, 2022. 1, 7
are conducted to verify the effectiveness of the proposed [14] Zhuoling Li, Haohan Wang, Tymosteusz Swistek, En Yu,
techniques. The limitation of the detector is its zero-shot and Haoqian Wang. Efficient few-shot classification via con-
generalization ability on unseen data scenarios is still lim- trastive pre-training on web data. IEEE Transactions on Ar-
ited. In the future, we will continue to study how to boost tificial Intelligence, 2022. 2
the zero-shot generalization ability of UniMODE through [15] Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong-
strategies like scaling up the training data. hao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer:
9
Learning bird’s-eye-view representation from multi-camera tion and mobile manipulation. In AAAI, pages 1507–1514,
images via spatiotemporal transformers. In ECCV, pages 1– 2011. 1
18, 2022. 2, 3, 6 [31] Haiyang Wang, Shaocong Dong, Shaoshuai Shi, Aoxue Li,
[16] Zhuoling Li, Chunrui Han, Zheng Ge, Jinrong Yang, En Jianan Li, Zhenguo Li, Liwei Wang, et al. Cagroup3d: Class-
Yu, Haoqian Wang, Hengshuang Zhao, and Xiangyu Zhang. aware grouping for 3d object detection on point clouds.
Grouplane: End-to-end 3d lane detection with channel-wise NeurIPS, 35:29975–29988, 2022. 4
grouping. arXiv preprint arXiv:2307.09472, 2023. 1 [32] Tai Wang, Xinge Zhu, Jiangmiao Pang, and Dahua Lin.
[17] Zhuoling Li, Chuanrui Zhang, Wei-Chiu Ma, Yipin Zhou, Fcos3d: Fully convolutional one-stage monocular 3d object
Linyan Huang, Haoqian Wang, SerNam Lim, and Heng- detection. In ICCV, pages 913–922, 2021. 2, 6
shuang Zhao. Voxelformer: Bird’s-eye-view feature gener- [33] Tai Wang, ZHU Xinge, Jiangmiao Pang, and Dahua Lin.
ation based on dual-view attention for multi-view 3d object Probabilistic and geometric depth: Detecting objects in per-
detection. arXiv preprint arXiv:2304.01054, 2023. 1 spective. In CoRL, pages 1475–1485, 2022. 6
[18] Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun. [34] Xudong Wang, Zhaowei Cai, Dashan Gao, and Nuno Vas-
Petr: Position embedding transformation for multi-view 3d concelos. Towards universal object detection by domain at-
object detection. In ECCV, pages 531–548, 2022. 2, 6 tention. In CVPR, pages 7289–7298, 2019. 2
[19] Zechen Liu, Zizhang Wu, and Roland Tóth. Smoke: Single- [35] Zhenyu Wang, Ya-Li Li, Xi Chen, Hengshuang Zhao, and
stage monocular 3d object detection via keypoint estimation. Shengjin Wang. Uni3detr: Unified 3d detection transformer.
In CVPR Workshops, pages 996–997, 2020. 1, 6 In NeurIPS, 2023. 2
[20] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- [36] Xiaoyang Wu, Zhuotao Tian, Xin Wen, Bohao Peng, Xihui
enhofer, Trevor Darrell, and Saining Xie. A convnet for the Liu, Kaicheng Yu, and Hengshuang Zhao. Towards large-
2020s. In CVPR, pages 11976–11986, 2022. 6 scale 3d representation learning with multi-dataset point
[21] Yan Lu, Xinzhu Ma, Lei Yang, Tianzhu Zhang, Yating Liu, prompt training. arXiv preprint arXiv:2308.09718, 2023. 2,
Qi Chu, Junjie Yan, and Wanli Ouyang. Geometry uncer- 5
tainty projection network for monocular 3d object detection. [37] Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Dar-
In ICCV, pages 3111–3121, 2021. 6 rell. Deep layer aggregation. In CVPR, pages 2403–2412,
[22] Jiageng Mao, Shaoshuai Shi, Xiaogang Wang, and Hong- 2018. 6
sheng Li. 3d object detection for autonomous driving: A [38] Yunpeng Zhang, Jiwen Lu, and Jie Zhou. Objects are differ-
review and new outlooks. arXiv preprint arXiv:2206.09474, ent: Flexible monocular 3d object detection. In CVPR, pages
2022. 2 3289–3298, 2021. 1
[23] Dennis Park, Rares Ambrus, Vitor Guizilini, Jie Li, and [39] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Ob-
Adrien Gaidon. Is pseudo-lidar needed for monocular 3d jects as points. arXiv preprint arXiv:1904.07850, 2019. 3
object detection? In ICCV, pages 3142–3152, 2021. 2 [40] Xingyi Zhou, Vladlen Koltun, and Philipp Krähenbühl. Sim-
[24] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan ple multi-dataset detection. In CVPR, pages 7571–7580,
Zhu. Semantic image synthesis with spatially-adaptive nor- 2022. 2
malization. In CVPR, pages 2337–2346, 2019. 8 [41] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang,
[25] Liang Peng, Xiaopei Wu, Zheng Yang, Haifeng Liu, and and Jifeng Dai. Deformable detr: Deformable transformers
Deng Cai. Did-m3d: Decoupling instance depth for monoc- for end-to-end object detection. In ICLR, 2020. 3, 4
ular 3d object detection. In ECCV, pages 71–88, 2022. 2
[26] Cody Reading, Ali Harakeh, Julia Chae, and Steven L
Waslander. Categorical depth distribution network for
monocular 3d object detection. In CVPR, pages 8555–8564,
2021. 4
[27] Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit
Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb,
and Joshua M Susskind. Hypersim: A photorealistic syn-
thetic dataset for holistic indoor scene understanding. In
ICCV, pages 10912–10922, 2021. 6
[28] Danila Rukhovich, Anna Vorontsova, and Anton Konushin.
Imvoxelnet: Image to voxels projection for monocular and
multi-view general-purpose 3d object detection. In WACV,
pages 2397–2406, 2022. 1, 6
[29] Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao.
Sun rgb-d: A rgb-d scene understanding benchmark suite. In
CVPR, pages 567–576, 2015. 4, 6
[30] Stefanie Tellex, Thomas Kollar, Steven Dickerson, Matthew
Walter, Ashis Banerjee, Seth Teller, and Nicholas Roy. Un-
derstanding natural language commands for robotic naviga-
10

Unified Monocular 3D Object Detection

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Unified Monocular 3D Object Detection

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unified Monocular 3D Object Detection

Uploaded by

Copyright:

Available Formats

UniMODE: Unified Monocular 3D Object Detection

Zhuoling Li1 , Xiaogang Xu2 , SerNam Lim3 , Hengshuang Zhao1 *

Realizing unified monocular 3D object detection, includ-

Sparse BEV Feature Query FFN

(𝛼& , 𝛽& ) Input-dependent Parameter

Sparse Feature Projection Input Output

inated. For instance, when setting τ to 0.001, about 82.6%

3.5. Unified Domain Alignment

KITTI Objectron SUN-RGBD

tion 3.4, we remove unimportant projection points based on None DR DALN

You might also like