ODIN: A Single Model For 2D and 3D Segmentation

ODIN: A Single Model for 2D and 3D Segmentation
Ayush Jain1 , Pushkal Katara1 , Nikolaos Gkanatsios1 , Adam W. Harley2 ,

Gabriel Sarch1 , Kriti Aggarwal3 , Vishrav Chaudhary3 , Katerina Fragkiadaki1
1
Carnegie Mellon University, 2 Stanford University, 3 Microsoft
{ayushj2, pkatara ngkanats,gsarch,kfragki2}@andrew.cmu.edu
aharley@cs.stanford.edu, {kragga,vchaudhary}@microsoft.com
arXiv:2401.02416v3 [cs.CV] 25 Jun 2024
Abstract ric 3D models, e.g., NeRFs, by training them per scene to

render 2D feature maps of pre-trained backbones [23, 46].
State-of-the-art models on contemporary 3D segmenta- Despite this effort, and despite the ever-growing power of
tion benchmarks like ScanNet consume and label dataset- 2D backbones [4, 53], the state-of-the-art on established 3D
provided 3D point clouds, obtained through post process- segmentation benchmarks such as ScanNet [6] and Scan-
ing of sensed multiview RGB-D images. They are typi- Net200 [41] still consists of models that operate directly in
cally trained in-domain, forego large-scale 2D pre-training 3D, without any 2D pre-training stage [28, 44]. Given the
and outperform alternatives that featurize the posed RGB- obvious power of 2D pre-training, why is it so difficult to
D multiview images instead. The gap in performance be- yield improvements in these 3D tasks?
tween methods that consume posed images versus post- We observe that part of the issue lies in a key implemen-
processed 3D point clouds has fueled the belief that 2D tation detail underlying these 3D benchmark evaluations.
and 3D perception require distinct model architectures. Benchmarks like ScanNet do not actually ask methods to
In this paper, we challenge this view and propose ODIN use RGB-D images as input, even though this is the sen-
(Omni-Dimensional INstance segmentation), a model that sor data. Instead, these benchmarks first register all RGB-
can segment and label both 2D RGB images and 3D point D frames into a single colored point cloud and reconstruct
clouds, using a transformer architecture that alternates be- the scene as cleanly as possible, relying on manually tuned
tween 2D within-view and 3D cross-view information fu- stages for bundle adjustment, outlier rejection and meshing,
sion. Our model differentiates 2D and 3D feature oper- and ask models to label the output reconstruction. While
ations through the positional encodings of the tokens in- it is certainly viable to scan and reconstruct a room be-
volved, which capture pixel coordinates for 2D patch tokens fore labelling any of the objects inside, this pipeline is per-
and 3D coordinates for 3D feature tokens. ODIN achieves haps inconsistent with the goals of embodied vision (and
state-of-the-art performance on ScanNet200, Matterport3D typical 2D vision), which involves dealing with actual sen-
and AI2THOR 3D instance segmentation benchmarks, and sor data and accounting for missing or partial observations.
competitive performance on ScanNet, S3DIS and COCO. It We therefore hypothesize that method rankings will change,
outperforms all previous works by a wide margin when the and the impact of 2D pre-training will become evident, if
sensed 3D point cloud is used in place of the point cloud we force the 3D models to take posed RGB-D frames as in-
sampled from 3D mesh. When used as the 3D perception en- put rather than pre-computed mesh reconstructions. Our re-
gine in an instructable embodied agent architecture, it sets vised evaluation setting also opens the door to new methods,
a new state-of-the-art on the TEACh action-from-dialogue which can train and perform inference in either single-view
benchmark. Our code and checkpoints can be found at the or multi-view settings, with either RGB or RGB-D sensors.
project website https://odin-seg.github.io. We propose Omni-Dimensional INstance segmentation
(ODIN)† , a model for 2D and 3D object segmentation and
labelling that can parse single-view RGB images and/or
1. Introduction multiview posed RGB-D images. As shown in Fig. 1, ODIN
There has been a surge of interest in porting 2D founda- alternates between 2D and 3D stages in its architecture,
tional image features to 3D scene understanding [8, 14, 21, † The Norse god Odin sacrificed one of his eyes for wisdom, trading
23, 37, 40, 46–48]. Some methods lift pre-trained 2D image one mode of perception for a more important one. Our approach sacrifices
features using sensed depth to 3D feature clouds [8, 37, 40, perception on post-processed meshes for perception on posed RGB-D im-
47]. Others distill 2D backbones to differentiable paramet- ages.
posed RGB-D images
3D Instance Segmentation
2D
2D 2D
2D
3D 3D 3D
3D
2D 3D Tokens
2D 2D
2D Mask
Decoder
single RGB image
2D 3D 2D 3D 2D 3D
2D 3D
2D Tokens
2D Layer 3D Layer Unprojection Reshape Shared Layers
Figure 1. Omni-Dimensional INstance segmentation (ODIN) is a model that can parse either a single RGB image or a multiview posed
RGB-D sequence into 2D or 3D labelled object segments respectively. Left: Given a posed RGB-D sequence as input, ODIN alternates
between a within-view 2D fusion and a cross-view 3D fusion. When the input is a single RGB image, the 3D fusion layers are skipped.
ODIN shares the majority of its parameters across both RGB and RGB-D inputs, enabling the use of pre-trained 2D backbones. Right: At
each 2D-to-3D transition, ODIN unprojects 2D feature tokens to their 3D locations using sensed depth and camera intrinsics and extrinsics.
fusing information in 2D within each image view, and in agent model [42] on the simulation benchmark TEACh [36]
3D across posed image views. At each 2D-to-3D transi- in the setup with access to RGB-D and pose information
tion, it unprojects 2D tokens to their 3D locations using the from the simulator, and demonstrate that our model sets a
depth maps and camera parameters, and at each 3D-to-2D new state-of-the-art. We make our code publicly available
transition, it projects 3D tokens back to their image loca- at https://odin-seg.github.io.
tions. Our model differentiates between 2D and 3D features
through the positional encodings of the tokens involved,
which capture pixel coordinates for 2D patch tokens and
2. Related Work
3D coordinates for 3D feature tokens. When dealing with
2D single-view input, our architecture simply skips the 3D
3D Instance Segmentation Early methods in 3D instance
layers and makes a forward pass with 2D layers alone.
segmentation [3, 15, 22, 30, 49, 58] group their seman-
We test ODIN in 2D and 3D instance segmentation and tic segmentation outputs into individual instances. Re-
3D semantic segmentation on the 2D COCO object seg- cently, Mask2Former [4] achieved state-of-the-art in 2D in-
mentation benchmark and the 3D benchmarks of Scan- stance segmentation by instantiating object queries, each
Net [6], ScanNet200 [41], Matterport3D [2], S3DIS [1] and directly predicting an instance segmentation mask by do-
AI2THOR [7, 25]. When compared to methods using pre- ing dot-product with the feature map of the input image.
computed mesh point cloud as input, our approach performs Inspired by it, Mask3D [44] abandons the grouping strat-
slightly worse than state-of-the-art on ScanNet and S3DIS, egy of prior 3D models to use the simple decoder head of
but better on ScanNet200 and Matterport3D. When using Mask2Former. MAFT [28] and QueryFormer [34] improve
real sensor data as input with poses obtained from bundle over Mask3D by incorporating better query initialization
reconstruction for all methods, our method performs even strategies and/or relative positional embeddings. While this
better, outperforming all prior work by a wide margin, in all shift to Mask2Former-like architecture brought the 3D in-
datasets. We demonstrate that our model’s ability to jointly stance segmentation architectures closer to their 2D coun-
train on 3D and 2D datasets results in performance increase terparts, the inputs and backbones remain very different:
on 3D benchmarks, and also yields competitive segmenta- 2D models use pre-trained backbones [16, 33], while 3D
tion accuracy on the 2D COCO benchmark. Our ablations methods [44] operate over point clouds and use sparse
show that interleaving 2D and 3D fusion operations outper- convolution-based backbones [5], trained from scratch on
forms designs where we first process in 2D and then move small-scale 3D datasets. In this work, we propose to directly
to 3D, or simply paint 3D points with 2D features. Stepping use RGB-D input and design architectures that can leverage
toward our broader goal of embodied vision, we also deploy strong 2D backbones to achieve strong performance on 3D
ODIN as the 3D object segmentor of a SOTA embodied benchmarks.
3D Datasets and Benchmarks Most 3D models primar- dation models, by training a class-agnostic 3D object seg-
ily operate on point clouds, avoiding the use of image-based mentor on 3D point clouds and labelling it utilizing CLIP
features partly due to the design of popular benchmarks. features. Despite their effectiveness in a zero-shot setting,
These benchmarks generate point clouds by processing raw they generally lag behind SOTA 3D supervised methods by
RGB-D sensor data, involving manual and noisy steps that 15-20%. Rather than relying on features from foundation
result in misalignments between the reconstructed point models, certain works [10, 12] create 3D pseudo-labels us-
cloud and sensor data. For instance, ScanNet [6] under- ing pre-trained 2D models. Another line of work involves
goes complex mesh reconstruction steps, including bundle fitting Neural-Radiance Fields (NeRFs), incorporating fea-
reconstruction, implicit TSDF representation fitting, march- tures from CLIP [23, 48] or per-view instance segmenta-
ing cubes, merging and deleting noisy mesh vertices, and tions from state-of-the-art 2D segmentors [46]. These ap-
finally manual removal of mesh reconstruction with high proaches require expensive per-scene optimization that pro-
misalignments. Misalignments introduced by the mesh re- hibits testing on all test scenes to compare against SOTA 3D
construction process can cause methods processing sensor discriminative models. Instead of repurposing 2D founda-
data directly to underperform compared to those trained tion models for 3D tasks, Omnivore [13] proposes to build a
and tested on provided point clouds. Additionally, some unified architecture that can handle multiple visual modali-
datasets, like HM3D [54] lack access to raw RGB-D data. ties like images, videos and single-view RGB-D image but
While mesh reconstruction has its applications, many real- they only show results for classification tasks. We similarly
time applications need to directly process sensor data. propose a single unified model capable of performing both
single-view 2D and multi-view 3D instance and semantic
2D-based 3D segmentation Unlike instance segmenta-
segmentation tasks while utilizing pre-trained weights for
tion literature, several approaches for semantic segmenta-
the majority of our architecture.
tion like MVPNet [20], BPNet [17] and DeepViewAgg [40]
utilize the sensor point cloud directly instead of the mesh-
3. Method
sampled point cloud. Virtual Multiview Fusion [26] forgoes
sensor RGB-D images in favour of rendering RGB-D im- ODIN’s architecture is shown in Fig. 2. It takes either a
ages from the provided mesh to fight misalignments and low single RGB image or a set of posed RGB-D images (i.e.,
field-of-view in ScanNet images. Similar to our approach, RGB images associated with depth maps and camera pa-
BPNet and DeepViewAgg integrate 2D-3D information at rameters) and outputs the corresponding 2D or 3D instance
various feature scales and initialize their 2D streams with segmentation masks and their semantic labels. To achieve
pre-trained features. Specifically, they employ separate 2D this, ODIN alternates between a 2D within-view fusion and
and 3D U-Nets for processing the respective modalities and a 3D attention-based cross-view fusion, as illustrated in blue
fuse features from the two streams through a connection blocks and yellow blocks in Fig. 2. A segmentation de-
module. Rather than employing distinct streams for featur- coding head predicts instance masks and semantic labels.
izing raw data, our architecture instantiates a single unified Notably, ODIN shares the majority of its parameters across
U-Net which interleaves 2D and 3D layers and can handle both RGB and multiview RGB-D inputs. We detail the com-
both 2D and 3D perception tasks with a single unified ar- ponents of our architecture below.
chitecture. Notably, while these works focus solely on se- Within-view 2D fusion: We start from a 2D back-
mantic segmentation, our single architecture excels in both bone, such as ResNet50 [16] or Swin Transformer [33],
semantic and instance segmentation tasks. pre-trained for 2D COCO instance segmentation follow-
Recent advancements in 2D foundation models [24, 39] ing Mask2Former [4], a state-of-the-art 2D segmentation
have spurred efforts to apply them to 3D tasks such as model. When only a single RGB image is available, we pass
point cloud classification [38, 52, 56], zero-shot 3D se- it through the full backbone to obtain 2D features at multi-
mantic segmentation [14, 21, 37] and more recently, zero- ple scales. When a posed RGB-D sequence is available, this
shot instance segmentation [47]. Commonly, these methods 2D processing is interleaved with 3D stages, described next.
leverage 2D foundation models to featurize RGB images, By interleaving within-view and cross-view contextualiza-
project 3D point clouds onto these images, employ occlu- tion, we are able to utilize the pre-trained features from the
sion reasoning using depth and integrate features from all 2D backbone while also fusing features across views, mak-
views through simple techniques like mean-pooling. No- ing them 3D-consistent.
tably, these approaches predominantly focus on semantic Cross-view 3D fusion: The goal of cross-view fusion is
segmentation, emphasizing pixel-wise labeling, rather than to make the individual images’ representations consistent
instance labeling, which necessitates cross-view reasoning across views. As we show in our ablations, cross-view fea-
to associate the same object instance across multiple views. ture consistency is essential for 3D instance segmentation:
OpenMask3D [47] is the only method that we are aware it enables the segmentation head to realize that a 3D object
of that attempts 3D instance segmentation using 2D foun- observed from multiple views is indeed a single instance,
Input
Backbone Upsampler Mask Decoder Head
2D ResBlock1
posed RGB-D Images
2D ResBlock2 Upsample Layer
3D RelPos Attn 3D RelPos Attn Query Re nement
Multi Scale Deformable Self-Attention

2D ResBlock3
single RGB Image 2D Instance Segmentation

3D RelPos Attn 3D RelPos Attn Query Re nement
2D ResBlock4
3D RelPos Attn Query Re nement

3D RelPos Attn
2D Layers 3D Layers Shared Layers

Parametric Queries
fi
fi
fi
Figure 2. ODIN Architecture: The input to our model is either a single RGB image or a multiview RGB-D posed sequence. We feed
them to ODIN’s backbone which interleaves 2D within-view fusion layers and 3D cross-view attention layers to extract feature maps of
different resolutions (scales). These feature maps exchange information through a multi-scale attention operation. Additional 3D fusion
layers are used to improve multiview consistency. Then, a mask decoder head is used to initialize and refine learnable slots that attend to
the multi-scale feature maps and predict object segments (masks and semantic classes).
rather than a separate instance in each viewpoint. N ×k ×3. In this way, the attention operation is invariant to
1. 2D-to-3D Unprojection: We unproject each 2D feature the absolute coordinates of the 3D tokens and only depends
map to 3D by lifting each feature vector to a correspond- on their relative spatial arrangements. While each 3D token
ing 3D location, using nearest neighbor depth and known always attends to the same k neighbors, its effective recep-
camera intrinsic and extrinsic parameters, using a pinhole tive field grows across layers, as the neighbors’ features get
camera model. Subsequently, the resulting featurized point updated when they perform their own attention [11].
cloud undergoes voxelization, where the 3D space is dis- 3. 3D-to-2D Projection: After contextualizing the tokens in
cretized into a volumetric grid. Within each occupied grid 3D, we project the features back to their original 2D loca-
cell (voxel), the features and XYZ coordinates are mean- tions. We first copy the feature of each voxel to all points
pooled to derive new sets of 3D feature tokens and their within that voxel. We then reshape these points back into
respective 3D locations. multiview 2D feature maps, so that they may be processed
2. 3D k-NN Transformer with Relative Positions: We fuse by the next 2D module. The features vectors are unchanged
information across 3D tokens using k-nearest-neighbor at- in this transition; the difference lies in their interpretation
tention with relative 3D positional embeddings. This is and shape. In 2D the features are shaped V × H × W × F ,
similar to Point Transformers [51, 57], but we simply use representing a feature map for each viewpoint, and in 3D
vanilla cross-attention instead of the vector attention pro- they are shaped N ×F , representing a unified feature cloud,
posed in those works. Specifically, in our approach, each where N = V · H · W .
3D token attends to its k nearest neighbors. The positional
Cross-scale fusion and upsampling: After multiple
embeddings in this operation are relative to the query to-
single-view and cross-view stages, we have access to multi-
ken’s location. We achieve this by encoding the distance
ple features maps per image, at different resolutions. We
vector between a token and its neighbour with an MLP. The
merge these with the help of deformable 2D attention,
positional embedding for the query is simply encoding of
akin to Mask2Former [4], operating on the three lowest-
the 0 vector. We therefore have
resolution scales (1/32, 1/16, 1/8). When we have 3D in-
\textrm {query}_{\textrm {pos}} &= \textrm {MLP}(0); \\ \textrm {key}_{\textrm {pos}} &= \textrm {MLP}(p_i - p_j), put, we apply an additional 3D fusion layer at each scale af-
ter the deformable attention, to restore the 3D consistency.
(2)
Finally, we use a simple upsampling layer on the 1/8 reso-
where pi represents the 3D tokens, shaped N × 1 × 3, and lution feature map to bring it to 1/4 resolution and add with
pj represents the k nearest neighbors of each pi , shaped a skip connection to the 1/4 feature map from the backbone.
Sensor depth to mesh point cloud feature transfer: For Implementation details: We initialize our model with
3D benchmarks like ScanNet [6] and ScanNet200 [41], the pre-trained weights from Mask2Former [4] trained on
objective is to label a point cloud derived from a mesh rather COCO [31]. Subsequently, we train all parameters end-to-
than the depth map from the sensor. Hence, on those bench- end, including both pre-trained and new parameters from
marks, instead of upsampling the 1/8 resolution feature 3D fusion layers. During training in 3D scenes, our model
map to 1/4, we trilinearly interpolate features from the 1/8 processes a sequence of N consecutive frames, usually
resolution feature map to the provided point cloud sampled comprising 25 frames. At test time, we input all images
from the mesh. This means: for each vertex in the mesh, in the scene to our model, with an average of 90 images per
we trilinearly interpolate from our computed 3D features to scene in ScanNet. We use vanilla closed-vocabulary decod-
obtain interpolated features. We additionally similarly in- ing head for all experiments except when training jointly on
terpolate from the unprojected 1/4 resolution feature map 2D-3D datasets. There we use our open vocabulary class
in the backbone, for an additive skip connection. decoder that lets us handle different label spaces in these
Shared 2D-3D segmentation mask decoder: Our segmen- datasets. During training, we employ open vocabulary mask
tation decoder is a Transformer, similar to Mask2Former’s decoding for joint 2D and 3D datasets and vanilla closed-
decoder head, which takes as input upsampled 2D or 3D vocabulary decoding otherwise. Training continues until
feature maps and outputs corresponding 2D or 3D segmen- convergence on 2 NVIDIA A100s with 40 GB VRAM, with
tation masks and their semantic classes. Specifically, we an effective batch size of 6 in 3D and 16 in 2D. For joint
instantiate a set of N learnable object queries responsible training on 2D and 3D datasets, we alternate sampling 2D
for decoding individual instances. These queries are itera- and 3D batches with batch sizes of 3 and 8 per GPU, respec-
tively refined by a Query Refinement block, which consists tively. We adopt Mask2Former’s strategy, using Hungar-
of cross-attention to the upsampled features, followed by a ian matching for matching queries to ground truth instances
self-attention between the queries. Except for the positional and supervision losses. While our model is only trained for
embeddings, all attention and query weights are shared be- instance segmentation, it can perform semantic segmenta-
tween 2D and 3D. We use Fourier positional encodings in tion for free at test time like Mask2Former. We refer to
2D, while in 3D we encode the XYZ coordinates of the 3D Mask2Former [4] for more details.
tokens with an MLP. The refined queries are used to pre-
dict instance masks and semantic classes. For mask pre- 4. Experiments
diction, the queries do a token-wise dot product with the
highest-resolution upsampled features. For semantic class
4.1. Evaluation on 3D benchmarks
prediction, we use an MLP over the queries, mapping them Datasets: First, we test our model on 3D instance and
to class logits. We refer readers to Mask2Former [4] for semantic segmentation in the ScanNet [6] and Scan-
further details. Net200 [41] benchmarks. The goal of these benchmarks
Open vocabulary class decoder: Drawing inspiration is to label the point cloud extracted from the 3D mesh of
from prior open-vocabulary detection methods [19, 29, 61], a scene reconstructed from raw sensor data. ScanNet eval-
we introduce an alternative classification head capable of uates on 20 common semantic classes, while ScanNet200
handling an arbitrary number of semantic classes. This uses 200 classes, which is more representative of the long-
modification is essential for joint training on multiple tailed object distribution encountered in the real world. We
datasets. Similar to BUTD-DETR [19] and GLIP [29], we report results on the official validation split of these datasets
supply the model with a detection prompt formed by con- here and on the official test split in the supplementary.
catenating object categories into a sentence (e.g., “Chair. Evaluation metrics: We follow the standard evaluation
Table. Sofa.”) and encode it using RoBERTa [32]. In metrics, namely mean Average Precision (mAP) for in-
the query-refinement block, queries additionally attend to stance segmentation and mean Intersection over Union
these text tokens before attending to the upsampled fea- (mIoU) for semantic segmentation.
ture maps. For semantic class prediction, we first perform a Baselines: In instance segmentation, our main baseline is
dot-product operation between queries and language tokens, the SOTA 3D method Mask3D [44]. For a thorough com-
generating one logit per token in the detection prompt. The parison, we train both Mask3D and our model with sen-
logits corresponding to prompt tokens for a specific object sor RGB-D point cloud input and evaluate them on the
class are then averaged to derive per-class logits. This can benchmark-provided mesh-sampled point clouds. We also
handle multi-word noun phrases such as “shower curtain”, compare with the following recent and concurrent works:
where we average the logits corresponding to “shower” PBNet [58], QueryFormer [34] and MAFT [28]. Query-
and “curtain”. The segmentation masks are predicted by Former and MAFT explore query initialization and refine-
a pixel-/point-wise dot-product, in the same fashion as de- ment in a Mask3D-like architecture and thus have comple-
scribed earlier. mentary advantages to ours. Unlike ODIN, these methods
Table 1. Evaluation on 3D Benchmarks (§ = trained by us using official codebase).
(a) ScanNet Instance Segmentation Task. (b) ScanNet Semantic Segmentation Task.
Model mAP mAP50 mAP25 Model mIoU

§
Mask3D [44] 43.9 60.0 69.9 MVPNet [20] 68.3
Sensor RGBD
ODIN-ResNet50 (Ours) 47.8 69.8 83.6 BPNet [17] 69.7
Point Cloud Sensor RGBD
ODIN-Swin-B (Ours) 50.0 71.0 83.6 DeepViewAgg [40] 71.0
Point Cloud
SoftGroup [49] 46.0 67.6 78.9 ODIN-ResNet50 (Ours) 73.3
PBNet [58] 54.3 70.5 78.9 ODIN-Swin-B (Ours) 77.8
Mesh Sampled
Mask3D [44] 55.2 73.7 83.5 Rendered RGBD VMVF [26] 76.4
Point Cloud
QueryFormer [34] 56.5 74.2 83.3
Point Cloud
MAFT [28] 58.4 75.9 -
Point Transformer v2 [51] 75.4
Mesh Sampled Stratified Transformer [27] 74.3
Point Cloud OctFormer [50] 75.7
Swin3D-L [55] 76.7
Zero-Shot OpenScene [37] 54.2
(c) ScanNet200 Instance Segmentation Task.
Model mAP mAP50 mAP25 (d) ScanNet200 Semantic Segmentation Task.

§
Mask3D [44] 15.5 21.4 24.3
Sensor RGBD Model mIoU
ODIN-ResNet50 (Ours) 25.6 36.9 43.8
Point Cloud
ODIN-Swin-B (Ours) 31.5 45.3 53.1 Sensor RGBD ODIN-ResNet50 (Ours) 35.8
Mask3D [44] 27.4 37.0 42.3 Point Cloud ODIN-Swin-B (Ours) 40.5
Mesh Sampled
QueryFormer [34] 28.1 37.1 43.4 LGround [41] 28.9
Point Cloud Mesh Sampled
MAFT [28] 29.2 38.2 43.3 CeCo [60] 32.0
Point Cloud
Zero-Shot OpenMask3D [47] 15.4 19.9 23.1 Octformer [50] 32.6
directly process 3D point clouds and initialize their weights image pre-trained weights. We also compare with LGround
from scratch. As motivated before, utilizing RGB-D input [41] which uses 2D CLIP pre-training. We also compare
directly has several advantages, including avoiding costly with zero-shot 2D foundation model-based 3D models of
mesh building processes, achieving closer integration of 2D OpenScene [37] and OpenMask3D [47]. This comparison
and 3D perception, and leveraging pre-trained features and is unfair since they are not supervised within-domain,
abundant 2D data. but we include them for completeness. The results are
In semantic segmentation, we compare with MVP- presented in Tab. 1. We draw the following conclusions:
Net [20], BPNet [17] and state-of-the-art Deep- Performance drops with sensor point cloud as input
ViewAgg [40] which directly operate on sensor RGB (Tab. 1a): Mask3D’s performance drops from 55.2% mAP
or RGB-D images and point clouds. We also compare with mesh point cloud input to 43.9% mAP with sensor
with VMVF [26] which operates over rendered RGB-D point cloud input. This is consistent with prior works [26,
images from the provided mesh, with heuristics for camera 40] in 3D semantic segmentation on ScanNet, which at-
view sampling to avoid occlusions, ensures balanced scene tributes the drop to misalignments caused by noise in cam-
coverage, and employs a wider field-of-view, though we era poses, depth variations and post-processing steps.
note their code is not publicly available. Similar to ODIN, ODIN outperforms SOTA 3D methods with sensor point
all of these methods utilize 2D pre-trained backbones. We cloud input and underperforms them when baselines use
also compare with Point-Transformer v2 [51], Stratified mesh-sampled point clouds (Tab. 1a): Our model signifi-
Transformer [27], OctFormer [50] and Swin3D-L [55] cantly outperforms SOTA Mask3D model with sensor point
which process the mesh-sampled point cloud directly, cloud input and achieves comparable performance to meth-
without using any 2D pre-training. On the ScanNet200 ods using mesh-sampled point cloud input on the mAP25
semantic segmentation benchmark, we compare with SOTA metric while far behind on mAP metric, due to misalign-
OctFormer [50] and with CeCo [60], a method specially ments between the 3D mesh and the sensor point cloud.
designed to fight class-imbalance in ScanNet200. These ODIN sets a new SOTA in semantic segmentation on
methods directly process the point cloud and do not use 2D ScanNet (Tab. 1b) outperforming all methods on all setups
Table 2. AI2THOR Semantic and Instance Segmentation. Table 4. Joint Training on Sensor RGB-D point cloud from
ScanNet and 2D RGB images from COCO.
Model mAP mAP50 mAP25 mIoU
ScanNet COCO
Mask3D [44] 60.6 70.8 76.6 -
ODIN-ResNet50 (Ours) 63.8 73.8 80.2 71.5 mAP mAP50 mAP25 mAP
ODIN-Swin-B (Ours) 64.3 73.7 78.6 71.4 Mask3D [44] 43.9 60.0 69.9 ✗
Mask2Former [4] ✗ ✗ ✗ 43.7
ODIN (trained in 2D) ✗ ✗ ✗ 43.6
Table 3. Embodied Instruction Following. SR = success rate.
ODIN (trained in 3D) 47.8 69.8 83.6 ✗
GC = goal condition success rate. ODIN (trained jointly) 49.1 70.1 83.1 41.2
TEACh ALFRED
Unseen Seen Unseen Seen
SR GC SR GC SR GC SR GC and camera poses and need to infer and execute task and
FILM [35] - - - - 30.7 42.9 26.6 38.2 action plans from dialogue segments and instructions, re-
HELPER [42] 15.8 14.5 11.6 19.4 37.4 55.0 26.8 41.2 spectively. These agents operate in dynamic home envi-
HELPER + ODIN 18.6 18.6 13.8 26.6 47.7 61.6 33.5 47.1 ronments and cannot afford expensive mesh building steps.
(O URS ) Detecting objects well is critical for task success in both
cases. Prior SOTA methods [36, 42] run per-view 2D in-
stance segmentation models [4, 9] and link the detected in-
stances using simple temporal reasoning regarding spatial
including the models trained on the sensor, rendered and
and appearance proximity. Instead, ODIN processes its last
mesh sampled point clouds.
N egocentric views and segments objects instances directly
ODIN sets a new instance segmentation SOTA on the
in 3D. We equip HELPER [42], a state-of-the-art embod-
long-tailed ScanNet200 dataset (Tab. 1c) outperforming
ied model, with ODIN as its 3D object detection engine.
SOTA 3D models on all setups including the models trained
We evaluate using Task Sucess Rate (SR) which checks if
on mesh-sampled point cloud, especially by a large margin
the entire task is executed successfully, and Goal Condi-
in mAP25 metric, while exclusively utilizing sensor RGB-
tioned Success Rate (GC) which checks the proportion of
D data. This highlights the contribution of 2D features, par-
satisfied subgoals across all episodes [36, 45]. We perform
ticularly in detecting a long tail of class distribution where
evaluation on ”valid-seen” (houses similar to the training
limited 3D data is available. We show more detailed results
set) and ”valid-unseen” (dissimilar houses) splits. In Tab. 3,
with performance on the head, common and tail classes in
we observe that HELPER with ODIN as its 3D object detec-
the appendix ( Sec. A.3).
tor significantly outperforms HELPER that uses the original
ODIN sets a new semantic segmentation SOTA on Scan- 2D detection plus linking perception pipeline.
Net200 (Tab. 1d), outperforming SOTA semantic segmen-
tation models that use mesh point clouds. 4.4. Ablations and Variants
4.2. Evaluation on multiview RGB-D in simulation We conduct our ablation experiments on the ScanNet
dataset in Tab. 4 and Tab. 5. Our conclusions are:
Using the AI2THOR [25] simulation environment with pro- Joint 2D-3D training helps 3D perception We compare
cedural homes from ProcThor [7], we collected RGB-D joint training of ODIN on sensor RGB-D point clouds
data for 1500 scenes (1200 training, 300 test) of similar from ScanNet and 2D RGB images from COCO to vari-
size as ScanNet (more details in appendix, Sec. B). We ants trained independently on 2D and 3D data, all initialized
train and evaluate our model and SOTA Mask3D [44] on from pre-trained COCO weights. Since there are different
the unprojected RGB-D images. As shown in Tab. 2, our classes in ScanNet and COCO, we use our open-vocabulary
model outperforms Mask3D by 3.7% mAP, showing strong semantic class-decoding head instead of the vanilla closed-
performance in a directly comparable RGB-D setup. It sug- vocabulary head. Results in Tab. 4 show that joint training
gests that current real-world benchmarks may restrain mod- yields a 1.3% absolute improvement in 3D, and causes a
els that featurizes RGB-D sensor point clouds due to mis- similar drop in 2D. This experiment indicates that a sin-
alignments. We hope this encourages the community to gle architecture can perform well on both 2D and 3D tasks,
also focus on directly collecting, labeling, and benchmark- thus indicating that we may not need to design vastly dif-
ing RGB-D sensor data. ferent architectures in either domain. However, the drop in
2D performance indicates a potential for further improve-
4.3. Embodied Instruction Following
ments in the architecture design to retain the performance
We apply ODIN in the embodied setups of TEACh [36] in the 2D domain. Nevertheless, this experiment highlights
and ALFRED [45] where agents have access to RGB, depth the benefits of joint training with 2D datasets for 3D seg-
Table 5. Ablations on ScanNet Dataset.
(a) Cross-View Contextualization. (b) Effect of Pre-Trained Features. (c) Effect of Freezing Backbone.
Model mAP mIoU Model mAP mIoU Model ResNet50 Swin-B

mAP mIoU mAP mIoU
ODIN (Ours) 47.8 73.3 ODIN (Ours) 47.8 73.3
No 3D Fusion 39.3 73.2 Only pre-trained back- 42.3 72.9 ODIN (Ours) 47.8 73.3 50.0 77.8
No interleaving 41.7 73.6 bone With frozen 46.7 74.3 46.2 75.9
No pre-trained features 41.5 68.6 backbone
mentation in ODIN. Note that we do not jointly train on Sec. A.2 and performance gains in 2D perception with in-
2D and 3D datasets in any of our other experiments due to creasing context views in Sec. A.4.
computational constraints.
Cross-View fusion is crucial for instance segmentation
but not for semantic segmentation (Tab. 5a): removing
4.6. Limitations
3D cross-view fusion layers results in an 8.5% mAP drop Our experiments reveal the following limitations for ODIN:
for instance segmentation, without any significant effect in Firstly, like other top-performing 3D models, it depends on
semantic segmentation. Popular 2D-based 3D open vocab- accurate depth and camera poses. Inaccurate depth or cam-
ulary works [21, 37] without strong cross-view fusion only era poses cause a sharp decrease in performance (similar
focus on semantic segmentation and thus could not uncover to other 3D models, like Mask3D). In our future work, we
this issue. Row-3 shows a 6.1% mAP drop when cross-view aim to explore unifying depth and camera pose estimation
3D fusion happens after all within-view 2D layers instead of with semantic scene parsing, thus making 3D models more
interleaving the within-view and cross-view fusion. resilient to noise. Secondly, in this paper, we limited our
2D pre-trained weight initialization helps (Tab. 5b): ini- scope to exploring the design of a unified architecture with-
tializing only the image backbone with pre-trained weights, out scaling up 3D learning by training on diverse 2D and 3D
instead of all layers (except the 3D fusion layers), results in datasets jointly. We aim to explore this in future to achieve
a 5.5% mAP drop (row-2). Starting the entire model from strong generalization to in-the-wild scenarios, akin to the
scratch leads to a larger drop of 6.3% mAP (row-3). This current foundational 2D perception systems. Our results
underscores the importance of sharing as many parameters suggest a competition between 2D and 3D segmentation
as possible with the 2D models to leverage the maximum performance when training ODIN jointly on both modal-
possible 2D pre-trained weights. ities. Exploring ways to make 2D and 3D training more
Stronger 2D backbones helps (Tab. 5c): using Swin-B synergistic is a promising avenue for future work.
over ResNet50 leads to significant performance gains, sug-
gesting that ODIN can directly benefit from advancements
in 2D computer vision. 5. Conclusion
Finetuning everything including the pre-trained param-
eters helps (Tab. 5c): ResNet50’s and Swin’s performance We presented ODIN, a model for 2D and 3D instance seg-
increases substantially when we fine-tune all parameters. mentation that can parse 2D images and 3D point clouds
Intuitively, unfreezing the backbone allows 2D layers to alike. ODIN represents both 2D images and 3D feature
adapt to cross-view fused features better. Thus, we keep clouds as a set of tokens that differ in their positional en-
our backbone unfrozen in all experiments. codings which represent 2D pixel coordinates for 2D to-
Supplying 2D features directly to 3D models does not kens and 3D XYZ coordinates for 3D tokens. Our model
help: Concatenating 2D features with XYZ+RGB as input alternates between within-image featurization and cross-
to Mask3D yields 53.8% mAP performance, comparable to view featurization. It achieves SOTA performance in Scan-
53.3% of the baseline model with only XYZ+RGB as input. Net200 and AI2THOR instance segmentation benchmarks,
outperforms all methods operating on sensor point clouds
4.5. Additional Experiments and achieves competent performance to methods operating
over mesh-sampled pointcloud. Our experiments show that
We show evaluations on the hidden test set of ScanNet and
ODIN outperforms alternative models that simply augment
ScanNet200 in Sec. A.1, results and comparisons with
3D point cloud models with 2D image features as well as
baselines on S3DIS [1] and MatterPort3D [2] datasets in
ablative versions of our model that do not alternate between
† We do not use the expensive DB-SCAN post-processing of Mask3D, 2D and 3D information fusion, do not co-train across 2D
and hence it gets 53.3% mAP instead of 55.2% as reported by their paper and 3D and do no pre-train the 2D backbone.
6. Acknowledgements mentation, we conduct training and testing on the reduced
set, while baseline numbers are taken from the OpenScene
The authors express gratitude to Wen-Hsuan Chu, Mihir [37] paper, trained and tested on the original data. Given the
Prabhudesai, and Alexander Swerdlow for their valuable small size of the discarded data, we do not anticipate sig-
feedback on the early draft of this work. Special thanks
nificant performance differences. The official benchmark
to Tsung-Wei Ke for insightful discussions throughout the
of Matterport3D tests on 21 classes; however, OpenScene
project. We thank the Microsoft Turing Team for provid-
also evaluates on 160 classes to compare with state-of-the-
ing us with GPU resources during the initial development art models on long-tail distributions. We follow them and
phase of this project. This work is supported by ONR award report results in both settings.
N00014-23-1-2415, Sony AI, DARPA Machine Common
S3DIS: S3DIS comprises 6 building-scale scenes, typ-
Sense, an Amazon faculty award, and an NSF CAREER
ically divided into 5 for training and 1 for testing. The
award.
dataset provides raw RGB-D images, captured panorama
Appendix A. Experiments images, and images rendered from the mesh obtained af-
ter reconstructing the original sensor data. Unlike Matter-
A.1. Evaluations on ScanNet and ScanNet200 Hid- port3D, S3DIS do not provide undistorted raw images; thus,
den Test Sets we use the provided rendered RGB-D images. Some rooms
in S3DIS have major misalignments between RGB-D im-
We submit ODIN to official test benchmarks of ScanNet
ages and point clouds, which we partially address by incor-
[6] and ScanNet200 [41]. Following prior works, we train
porating fixes from DeepViewAgg [40] and introducing our
ODIN on a combination of train and validation scenes. Un-
own adjustments. Despite these fixes, certain scenes still ex-
like some approaches that employ additional tricks like DB-
hibit significantly low overlap between RGB-D images and
SCAN [44], ensembling models [27], additional special-
the provided mesh-sampled point cloud. To mitigate this,
ized augmentations [51], additional pre-training on other
we query images from other rooms and verify their over-
datasets [55], finer grid sizes [50] and multiple forward
lap with the provided point cloud for a room. This partially
passes through points belonging to the same voxel, our
helps in addressing the low overlap issue.
method avoid any such bells and whistles.
The results are shown in Tab. 6. All conclusions from The official S3DIS benchmark evaluates 13 classes. Due
results on the validation set of these datasets as discussed in to the dataset’s small size, some models pre-train on addi-
the main paper are applicable here. On the ScanNet bench- tional datasets like ScanNet, as seen in SoftGroup [49], and
mark, ODIN achieves close to SOTA performance on se- on Structured3D datasets [59], consisting of 21,835 rooms,
mantic segmentation and mAP25 metric of Instance Seg- as done by Swin3D-L [55]. Similar to Mask3D [44], we
mentation while being significantly worse on mAP metric report results in both settings of training from scratch and
due to misalignments between sensor and mesh sampled starting from weights trained on ScanNet.
point clouds. On ScanNet200 benchmark, ODIN sets a Like ScanNet and ScanNet200, both S3DIS and Matter-
new SOTA on semantic segmentation and mAP50/mAP25 port3D undergo post-processing of collected RGB-D data
metric of Instance Segmentation, while achieving close to to construct a mesh, from which a point cloud is sampled
SOTA performance on mAP metric. Notably ODIN is the and labeled. Hence, we train both Mask3D [44] and our
first method that operates over sensor RGB-D data for in- model using RGB-D sensor point cloud data and evaluate
stance segmentation and achieves competitive performance on the benchmark-provided point cloud. Additionally, we
to models operating over mesh-sampled point clouds. explore model variants by training and testing them on the
mesh-sampled point cloud for comparative analysis.
A.2. Evaluation on S3DIS and Matterport3D The results are shown in Tab. 7. We draw the following
We also benchmark ODIN on Matterport3D [2] and S3DIS conclusions:
[1] datasets. ODIN outperforms SOTA 3D models on Matterport3D
Matterport: Matterport3D comprises 90 building-scale Instance Segmentation Benchmark across all settings
scenes, further divided into individual rooms, with 1554 (Tab. 7a)
training rooms and 234 validation rooms. The dataset pro- ODIN sets a new state-of-the-art on Matterport3D Seman-
vides a mapping from each room to the camera IDs that cap- tic Segmentation Benchmark (Tab. 7b): Our model achieves
tured images for that room. After discarding 158 training superior performance in both the 21 and 160 class settings.
rooms and 18 validation rooms without a valid camera map- It also largely outperforms OpenScene [37] on both settings.
ping, we are left with 1396 training rooms and 158 valida- OpenScene is a zero-shot method while ODIN is supervised
tion rooms. For instance segmentation results, we train the in-domain, making this comparison unfair. However, Open-
state-of-the-art Mask3D [44] model on the same data (re- Scene notes that their zero-shot model outperforms fully-
duced set after discarding invalid rooms). For semantic seg- supervised models in 160 class setup as their model is robust
Table 6. Evaluation on Test Set of Established 3D Benchmarks.
(a) Comparison on ScanNet for Instance Segmentation Task. (b) Comparison on ScanNet for Semantic Segmentation Task.
Input Model mAP mAP50 mAP25 Input Model mIoU

Sensor RGBD ODIN-Swin-B (Ours) 47.7 71.2 86.2 MVPNet [20] 64.1
Point Cloud BPNet [17] 74.9
Sensor RGBD
SoftGroup [49] 50.4 76.1 86.5 DeepViewAgg [40] -
Point Cloud
PBNet [58] 57.3 74.7 82.5 ODIN-Swin-B (Ours) 74.4
Mesh Sampled
Mask3D [44] 56.6 78.0 87.0
Point Cloud Rendered RGBD VMVF [26] 74.6
QueryFormer [34] 58.3 78.7 87.4
Point Cloud
MAFT [28] 59.6 78.6 86.0
Point Transformer v2 [51] 75.2
Mesh Sampled Stratified Transformer [27] 74.7
Point Cloud OctFormer [50] 76.6
Swin3D-L [55] 77.9
Zero-Shot OpenScene [37] -
(c) Comparison on ScanNet200 for Instance Segmentation Task.

(d) Comparison on ScanNet200 for Semantic Segmentation Task.
Model mAP mAP50 mAP25
Input Model mIoU
Sensor RGBD ODIN-Swin-B (Ours) 27.2 39.4 47.5
Point Cloud Sensor RGBD ODIN-Swin-B (Ours) 36.8
Mask3D [44] 27.8 38.8 44.5 Point Cloud
Mesh Sampled
QueryFormer [34] - - - LGround [41] 27.2
Point Cloud Mesh Sampled
MAFT [28] - - - CeCo [60] 34.0
Point Cloud
Zero-Shot OpenMask3D [47] - - - Octformer [50] 32.6
Table 7. Evaluation on Matterport3D [2] and S3DIS [1] datasets.
(a) Comparison on Matterport3D for Instance Segmentation Task. (b) Comparison on Matterport3D for Semantic Segmentation Task.
21 160 21 160
Input Model mAP mAP25 mAP mAP25 Input Model mIoU mAcc mIoU mAcc
Mask3D [44] 7.2 16.8 2.5 10.9 Sensor RGBD ODIN-ResNet50 (Ours) 54.5 65.8 22.4 28.5
Sensor RGBD
ODIN-ResNet50 (Ours) 22.5 56.4 11.5 27.6 Point Cloud ODIN-Swin-B (Ours) 57.3 69.4 28.6 38.2
Point Cloud
ODIN-Swin-B (Ours) 24.7 63.8 14.5 36.8
TextureNet [18] - 63.0 - -
Mesh Sampled
Mesh Sampled Mask3D [44] 22.9 55.9 11.3 23.9 DCM-Net [43] - 67.2 - -
Point Cloud
Point Cloud MinkowskiNet [5] 54.2 64.6 - 18.4
Zero-Shot OpenScene [37] 42.6 59.2 - 23.1
(c) Comparison on S3DIS Area5 for Instance Segmentation Task. († = (d) Comparison on S3DIS for Semantic Segmentation Task. (†
uses additional data) = uses additional data)
Model mAP mAP50 mAP25 Input Model mIoU

Mask3D [44] 40.7 54.6 64.2 MVPNet [20] 62.4
Mask3D [44] † 41.3 55.9 66.1 VMVF [26] 65.4
RGBD Point
ODIN-ResNet50 (Ours) 36.3 48.0 61.2 RGBD Point DeepViewAgg [40] 67.2
Cloud
ODIN-ResNet50 † (Ours) 44.7 57.7 67.5
Cloud ODIN-ResNet50 (Ours) 59.7
ODIN-Swin-B † (Ours) 43.0 56.4 70.0
ODIN-ResNet50 † (Ours) 66.8
SoftGroup [49] † 51.6 66.1 - ODIN-Swin-B † (Ours) 68.6
Mask3D [44] 56.6 68.4 75.2
Mesh Sampled Point Transformer v2 [51] 71.6
Mask3D [44] † 57.8 71.9 77.2 Mesh Sampled
Point Cloud Stratified Transformer [27] 72.0
QueryFormer [34] 57.7 69.9 - Point Cloud
MAFT [28] - 69.1 75.7 Swin3D-L [55] † 74.5
to rare classes while the supervised models can severely suf-
fer in segmenting long-tail. ConceptFusion [21], another 70
open-vocabulary 3D segmentation model, also draws a sim-
ilar conclusion. With this result, we point to a possibility of
supervising in 3D while also being robust to long-tail by 62.5
simply utilizing the strong 2D pre-trained weight initializa-
tion.
mAP
On S3DIS Instance Segmentation Benchmark (Tab. 7c),
55
in the setup where baseline Mask3D start from ScanNet
pre-trained checkpoint, our model outperforms them in the
RGBD point cloud setup but obtains lower performance
compared to mesh sampled point cloud methods and when 47.5
compared on the setup where all models train from scratch.
On S3DIS Semantic Segmentation Benchmark (Tab. 7d,
ODIN trained with ScanNet weight initialization outper- 40
forms all RGBD point cloud based methods, while achiev- 1 5 11 21 41 61 All
ing competitive performance on mesh sampled point cloud.
Number of Views
When trained from scratch, it is much worse than other
baselines. Given the limited dataset size of S3DIS with only
200 training scenes, we observe severe overfitting. Figure 3. 2D mAP Performance Variation with increasing
number of context views used
A.3. ScanNet200 Detailed Results
ScanNet200 [41] categorizes its 200 object classes into A.5. Inference Time
three groups—Head, Common, and Tail—each compris-
We assess the inference time of Mask3D and ODIN by aver-
ing 66, 68, and 66 categories, respectively. In Tab. 8,
aging the forward pass time of each model across the entire
we provide a detailed breakdown of the ScanNet200 re-
validation set, utilizing a 40 GB VRAM A100. When fed
sults across these splits. We observe that in comparison
the mesh-sampled point cloud directly, Mask3D achieves an
to SOTA Mask3D model trained on mesh-sampled point
inference time of 228ms. When provided with the sensor
cloud, ODIN achieves lower performance on Head classes,
point cloud as input, the inference time increases to 864 ms.
while significantly better performance on Common and Tail
Mask3D with sensor point cloud is slower than with mesh
classes. This highlights the contribution of effectively uti-
point cloud because at the same voxel size (0.02m), more
lizing 2D pre-trained features, particularly in detecting a
voxels are occupied in sensor point cloud ( 110k on avg.)
long tail of class distribution where limited 3D data is avail-
compared to mesh point clouds ( 64k on avg.) as mesh-
able.
cleaning sometimes discards large portion of the scene. The
transfer of features from the sensor point cloud to the mesh
A.4. Variation of Performance with Number of point cloud adds an extra 7 ms. ODIN-SwinB, which op-
Views erates over the sensor point cloud, has an inference time of
We examine the influence of the number of views on seg- 960ms.
mentation performance using the AI2THOR dataset, specif-
ically focusing on the 2D mAP performance metric. The Appendix B. Additional Implementation De-
evaluation is conducted by varying the number of context tails
images surrounding a given query RGB image. Starting
from a single-view without any context (N=0), we incre- The detailed components of our architecture and their de-
ment N to 5, 10, 20, 40, 60, and finally consider all images scriptions are presented in Fig. 4.
in the scene as context. ODIN takes these N + 1 RGB- More implementation details are presented below:
D images as input, predicts per-pixel instance segmentation Augmentations: For RGB image augmentation, we im-
for each image, and assesses the 2D mAP performance on plement the Large Scale Jittering Augmentation method
the query image. The results, depicted in Fig. 3, show a from Mask2Former [4], resizing images to a scale be-
continuous increase in 2D mAP with the growing number tween 0.1 and 2.0. We adjust intrinsics accordingly post-
of views. This observation underscores the advantage of augmentation and apply color jittering to RGB images.
utilizing multiview RGB-D images over single-view RGB Training involves a consecutive set of N images, typically
images whenever feasible. set to 25. With a 50% probability, we randomly sample k
Table 8. Detailed ScanNet200 results for Instance Segmentation (§ = trained by us using official codebase)
Input Model All Head Common Tail

mAP mAP50 mAP25 mAP mAP50 mAP25 mAP mAP50 mAP25 mAP mAP50 mAP25
Mask3D § [44] 15.5 21.4 24.3 21.9 31.4 37.1 13.0 17.2 18.9 7.9 10.3 11.5
Sensor RGBD point cloud ODIN-ResNet50 (Ours) 25.6 36.9 43.8 34.8 51.1 63.9 23.4 33.4 37.9 17.8 24.9 28.1
ODIN-Swin-B (Ours) 31.5 45.3 53.1 37.5 54.2 66.1 31.6 43.9 50.2 24.1 36.6 41.2
Mesh Sampled point cloud Mask3D [44] 27.4 37.0 42.3 40.3 55.0 62.2 22.4 30.6 35.4 18.2 23.2 27.0
3D RelPos Attention Query Refinement Segmentation Mask Decoder
Depth Maps Feature Maps Updated Feature Maps

VXHXW VXHXWXF VXHXWXF
Updated Queries
Add & Norm Hungarian Matching
Self Attention Class Instance

Reshape Probabilities HeatMap
Unprojection 1XNXF
σ
NX F
(N = V ⋅ H ⋅ W) Add & Norm
Keys/ Masked Cross Attention

Values
K-NN Cross Attention XL
KXNXF
Point
Instance
Features
Features
Add & Norm
Cross Attention MLP

Q
3D Feature Cloud 1XNXF
Text Text Point

Features Features Features
Instance Instance
Queries Queries
Figure 4. Detailed ODIN Architecture Components: On the Left is the 3D RelPos Attention module which takes as input the depth,
camera parameters and feature maps from all views, lifts the features to 3D to get 3D tokens. Each 3D token serves as a query. The
K-Nearest Neighbors of each 3D token become the corresponding keys and values. The 3D tokens attend to their neighbours for L layers
and update themselves. Finally, the 3D tokens are mapped back to the 2D feature map by simply reshaping the 3D feature cloud to 2D
multi-view feature maps. On the Middle is the query refinement block where queries first attend to the text tokens, then to the visual tokens
and finally undergo self-attention. The text features are optional and are only used in the open-vocabulary decoder setup. On the Right
is the segmentation mask decoder head where the queries simply perform a dot-product with visual tokens to decode the segmentation
heatmap, which can be thresholded to obtain the segmentation mask. In the Open-Vocabulary decoding setup, the queries also perform a
dot-product with text tokens to decode a distribution over individual words. In a closed vocabulary decoding setup, queries simply pass
through an MLP to predict a distribution over classes.
images from the range [1, N ] instead of using all N images. ScanNet-like objects. This observation was confirmed in
Additionally, instead of consistently sampling N consecu- ScanNet, where we experimented with 512 × 512 image
tive images, we randomly skip k images in between, where resolutions and did not observe any discernible benefit.
k ranges from 1 to 4. Interpolation Throughout our model, interpolations are
For 3D augmentations, we adopt the Mask3D [44] ap- employed in various instances, such as when upsampling
proach, applying random 3D rotation, scaling, and jitter the feature map from 1/8th resolution to 1/4th. In cases
noise to the unprojected XYZs. Elastic distortion and ran- involving depth, we unproject feature maps to 3D and per-
dom flipping augmentations from Mask3D are omitted due form trilinear interpolation, as opposed to directly apply-
to a slight drop in performance observed in our initial ex- ing bilinear interpolation on the 2D feature maps. For up-
periments. sampling/downsampling the depth maps, we use the near-
Image Resolutions We use a resolution of 256 × 256 for est interpolation. Trilinear interpolation proves crucial for
ScanNet, 512 × 512 for ScanNet200, and AI2THOR. In obtaining accurate feature maps, particularly at 2D object
our AI2THOR experiments, we discovered that employing boundaries like table and floor edges. This is because near-
higher image resolutions enhances the detection of smaller est depth interpolation may capture depth from either the
objects, with no noticeable impact on the detection of larger table or the floor. Utilizing trilinear upsampling of feature
maps ensures that if the upsampled depth is derived from the [7] into the AI2THOR simulator, and place an agent ran-
floor, it interpolates features from floor points rather than ta- domly at a navigable point provided by the simulator. The
ble points. agent performs a single random rotation around its initial
Use of Segments: Some datasets, such as ScanNet and location and captures an RGB-D frame. This process is re-
ScanNet200, provide supervoxelization of the point cloud, peated, with the agent spawning at another random location,
commonly referred to as segments. Rather than directly seg- until either all navigable points are exhausted or a maxi-
menting all input points, many 3D methods predict outputs mum of N = 120 frames is collected. While ProcTHOR
over these segments. Specifically, Mask3D [44] featurizes offers 10,000 scenes, we randomly select only 1,500 scenes
the input points and then conducts mean pooling over the to match the size of ScanNet. Additionally, we retain scenes
features of points belonging to a segment, resulting in one with fewer than 100 objects, as our model utilizes a maxi-
feature per segment. Following prior work, we also lever- mum of 100 object queries.
age segments in a similar manner. We observe that utilizing
segments is crucial for achieving good mAP performance, Appendix C. Qualitative Results
while it has no discernible impact on mAP25 performance.
Fig. 5 shows qualitative visualizations of ODIN for various
We suspect that this phenomenon may arise from the anno-
3D and 2D datasets.
tation process of these datasets. Humans were tasked with
labelling segments rather than individual points, ensuring
that all points within a segment share the same label. Uti-
lizing segments with our models guarantees that the entire
segment is labelled with the same class. It’s worth noting
that in AI2THOR, our method and the baselines do not uti-
lize these segments, as they are not available.
Post-hoc output transfer vs feature transfer: ODIN
takes the sensor point cloud as input and generates seg-
mentation output on the benchmark-provided point cloud.
In this process, we featurize the sensor point cloud and
transfer these features from the sensor point cloud to the
benchmark-provided point cloud. Subsequently, we predict
segmentation outputs on this benchmark-provided feature
cloud and supervise the model with the labels provided in
the dataset. An alternative approach involves segmenting
and supervising the sensor RGB-D point cloud and later
transferring the segmentation output to the benchmark point
cloud for evaluation. We experimented with both strate-
gies and found them to yield similar results. However, as
many datasets provide segmentation outputs only on the
point cloud, transferring labels to RGB-D images for the
latter strategy requires careful consideration. This is due to
the sparser nature of the provided point cloud compared to
the RGB-D sensor point cloud, and factors such as depth
noise and misalignments can contribute to low-quality label
transfer. Consequently, we opt for the former strategy in all
our experiments.
Depth Hole-Infilling: The sensor-collected depth
maps usually have holes around object boundaries and
shiny/transparent surfaces. We perform simple OpenCV
depth inpainting to fill these holes. We tried using
neural-based depth completion methods and NERF depth-
inpainting but did not observe significant benefits.
AI2THOR Data Collection: AI2THOR [25] is an embod-
ied simulator where an agent can navigate within a house,
execute actions, and capture RGB-D images of the scene.
We load the structurally generated houses from ProcTHOR
ScanNet
ScanNet200
Matterport3D
S3DIS
AI2THOR
COCO
Figure 5. Qualitative Results on various 3D and 2D datasets

References [13] Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens van der
Maaten, Armand Joulin, and Ishan Misra. Omnivore: A sin-
[1] Iro Armeni, Sasha Sax, Amir R Zamir, and Silvio Savarese. gle model for many visual modalities. In Proceedings of
Joint 2d-3d-semantic data for indoor scene understanding. the IEEE/CVF Conference on Computer Vision and Pattern
arXiv preprint arXiv:1702.01105, 2017. 2, 8, 9, 10 Recognition, pages 16102–16112, 2022. 3
[2] Angel Chang, Angela Dai, Thomas Funkhouser, Maciej [14] Huy Ha and Shuran Song. Semantic abstraction: Open-
Halber, Matthias Niessner, Manolis Savva, Shuran Song, world 3d scene understanding from 2d vision-language mod-
Andy Zeng, and Yinda Zhang. Matterport3d: Learning els. In 6th Annual Conference on Robot Learning, 2022. 1,
from rgb-d data in indoor environments. arXiv preprint 3
arXiv:1709.06158, 2017. 2, 8, 9, 10 [15] Lei Han, Tian Zheng, Lan Xu, and Lu Fang. Occuseg:
[3] Shaoyu Chen, Jiemin Fang, Qian Zhang, Wenyu Liu, and Occupancy-aware 3d instance segmentation. In Proceedings
Xinggang Wang. Hierarchical aggregation for 3d instance of the IEEE/CVF conference on computer vision and pattern
segmentation. 2021 IEEE/CVF International Conference on recognition, pages 2940–2949, 2020. 2
Computer Vision (ICCV), pages 15447–15456, 2021. 2 [16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
[4] Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexan- Deep residual learning for image recognition. In Proceed-
der Kirillov, and Rohit Girdhar. Masked-attention mask ings of the IEEE conference on computer vision and pattern
transformer for universal image segmentation. 2022. 1, 2, recognition, pages 770–778, 2016. 2, 3
3, 4, 5, 7, 11 [17] Wenbo Hu, Hengshuang Zhao, Li Jiang, Jiaya Jia, and
[5] Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4d Tien-Tsin Wong. Bidirectional projection network for cross
spatio-temporal convnets: Minkowski convolutional neural dimension scene understanding. In Proceedings of the
networks. In Proceedings of the IEEE/CVF conference on IEEE/CVF Conference on Computer Vision and Pattern
computer vision and pattern recognition, pages 3075–3084, Recognition, pages 14373–14382, 2021. 3, 6, 10
2019. 2, 10 [18] Jingwei Huang, Haotian Zhang, Li Yi, Thomas Funkhouser,
[6] Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- Matthias Nießner, and Leonidas J Guibas. Texturenet:
ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Consistent local parametrizations for learning from high-
Richly-annotated 3d reconstructions of indoor scenes. In resolution signals on meshes. In Proceedings of the
Proceedings of the IEEE conference on computer vision and IEEE/CVF Conference on Computer Vision and Pattern
pattern recognition, pages 5828–5839, 2017. 1, 2, 3, 5, 9 Recognition, pages 4440–4449, 2019. 10
[7] Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, [19] Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, and Kate-
Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, rina Fragkiadaki. Bottom up top down detection transform-
Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: ers for language grounding in images and point clouds. In
Large-scale embodied ai using procedural generation. Ad- European Conference on Computer Vision, pages 417–433.
vances in Neural Information Processing Systems, 35:5982– Springer, 2022. 5
5994, 2022. 2, 7, 13 [20] Maximilian Jaritz, Jiayuan Gu, and Hao Su. Multi-view
pointnet for 3d scene understanding. In Proceedings of
[8] Runyu Ding, Jihan Yang, Chuhui Xue, Wenqing Zhang,
the IEEE/CVF International Conference on Computer Vision
Song Bai, and Xiaojuan Qi. Pla: Language-driven open-
Workshops, pages 0–0, 2019. 3, 6, 10
vocabulary 3d scene understanding. In Proceedings of
[21] Krishna Murthy Jatavallabhula, Alihusein Kuwajerwala,
the IEEE/CVF Conference on Computer Vision and Pattern
Qiao Gu, Mohd Omama, Tao Chen, Shuang Li, Ganesh
Recognition, pages 7010–7019, 2023. 1
Iyer, Soroush Saryazdi, Nikhil Keetha, Ayush Tewari, et al.
[9] Bin Dong, Fangao Zeng, Tiancai Wang, Xiangyu Zhang, and Conceptfusion: Open-set multimodal 3d mapping. arXiv
Yichen Wei. Solq: Segmenting objects by learning queries. preprint arXiv:2302.07241, 2023. 1, 3, 8, 11
Advances in Neural Information Processing Systems, 34: [22] Li Jiang, Hengshuang Zhao, Shaoshuai Shi, Shu Liu, Chi-
21898–21909, 2021. 7 Wing Fu, and Jiaya Jia. Pointgroup: Dual-set point grouping
[10] Shichao Dong, Fayao Liu, and Guosheng Lin. Lever- for 3d instance segmentation. 2020 IEEE/CVF Conference
aging large-scale pretrained vision foundation models for on Computer Vision and Pattern Recognition (CVPR), pages
label-efficient 3d point cloud segmentation. arXiv preprint 4866–4875, 2020. 2
arXiv:2311.01989, 2023. 3 [23] Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo
[11] Vijay Prakash Dwivedi, Anh Tuan Luu, Thomas Laurent, Kanazawa, and Matthew Tancik. Lerf: Language embedded
Yoshua Bengio, and Xavier Bresson. Graph neural net- radiance fields. In Proceedings of the IEEE/CVF Interna-
works with learnable structural and positional representa- tional Conference on Computer Vision, pages 19729–19739,
tions. arXiv preprint arXiv:2110.07875, 2021. 4 2023. 1, 3
[12] Kyle Genova, Xiaoqi Yin, Abhijit Kundu, Caroline Panto- [24] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao,
faru, Forrester Cole, Avneesh Sud, Brian Brewington, Brian Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White-
Shucker, and Thomas Funkhouser. Learning 3d semantic head, Alexander C Berg, Wan-Yen Lo, et al. Segment any-
segmentation with only 2d image supervision. In 2021 In- thing. arXiv preprint arXiv:2304.02643, 2023. 3
ternational Conference on 3D Vision (3DV), pages 361–372. [25] Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt,
IEEE, 2021. 3 Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani,
Daniel Gordon, Yuke Zhu, et al. Ai2-thor: An interactive 3d Robinson Piramuthu, Gokhan Tur, and Dilek Hakkani-Tur.
environment for visual ai. arXiv preprint arXiv:1712.05474, Teach: Task-driven embodied agents that chat. In Proceed-
2017. 2, 7, 13 ings of the AAAI Conference on Artificial Intelligence, pages
[26] Abhijit Kundu, Xiaoqi Yin, Alireza Fathi, David Ross, Brian 2017–2025, 2022. 2, 7
Brewington, Thomas Funkhouser, and Caroline Pantofaru. [37] Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea
Virtual multi-view fusion for 3d semantic segmentation. In Tagliasacchi, Marc Pollefeys, Thomas Funkhouser, et al.
Computer Vision–ECCV 2020: 16th European Conference, Openscene: 3d scene understanding with open vocabularies.
Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV In Proceedings of the IEEE/CVF Conference on Computer
16, pages 518–535. Springer, 2020. 3, 6, 10 Vision and Pattern Recognition, pages 815–824, 2023. 1, 3,
[27] Xin Lai, Jianhui Liu, Li Jiang, Liwei Wang, Hengshuang 6, 8, 9, 10
Zhao, Shu Liu, Xiaojuan Qi, and Jiaya Jia. Stratified trans- [38] Guocheng Qian, Xingdi Zhang, Abdullah Hamdi, and
former for 3d point cloud segmentation. In Proceedings of Bernard Ghanem. Pix4point: Image pretrained transform-
the IEEE/CVF Conference on Computer Vision and Pattern ers for 3d point cloud understanding. 2022. 3
Recognition, pages 8500–8509, 2022. 6, 9, 10 [39] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
[28] Xin Lai, Yuhui Yuan, Ruihang Chu, Yukang Chen, Han Hu, Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
and Jiaya Jia. Mask-attention-free transformer for 3d in- Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
stance segmentation. In Proceedings of the IEEE/CVF Inter- transferable visual models from natural language supervi-
national Conference on Computer Vision, pages 3693–3703, sion. In International conference on machine learning, pages
2023. 1, 2, 5, 6, 10 8748–8763. PMLR, 2021. 3
[29] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- [40] Damien Robert, Bruno Vallet, and Loic Landrieu. Learn-
wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu ing multi-view aggregation in the wild for large-scale 3d se-
Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded mantic segmentation. In Proceedings of the IEEE/CVF Con-
language-image pre-training. In Proceedings of the ference on Computer Vision and Pattern Recognition, pages
IEEE/CVF Conference on Computer Vision and Pattern 5575–5584, 2022. 1, 3, 6, 9, 10
Recognition, pages 10965–10975, 2022. 5
[41] David Rozenberszki, Or Litany, and Angela Dai. Language-
[30] Zhihao Liang, Zhihao Li, Songcen Xu, Mingkui Tan, and
grounded indoor 3d semantic segmentation in the wild. In
Kui Jia. Instance segmentation in 3d scenes using seman-
European Conference on Computer Vision, pages 125–141.
tic superpoint tree networks. 2021 IEEE/CVF International
Springer, 2022. 1, 2, 5, 6, 9, 10, 11
Conference on Computer Vision (ICCV), pages 2763–2772,
[42] Gabriel Sarch, Yue Wu, Michael J Tarr, and Katerina
2021. 2
Fragkiadaki. Open-ended instructable embodied agents with
[31] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
memory-augmented large language models. arXiv preprint
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
arXiv:2310.15127, 2023. 2, 7
Zitnick. Microsoft coco: Common objects in context. In
Computer Vision–ECCV 2014: 13th European Conference, [43] Jonas Schult, Francis Engelmann, Theodora Kontogianni,
Zurich, Switzerland, September 6-12, 2014, Proceedings, and Bastian Leibe. Dualconvmesh-net: Joint geodesic and
Part V 13, pages 740–755. Springer, 2014. 5 euclidean convolutions on 3d meshes. In Proceedings of
[32] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar the IEEE/CVF Conference on Computer Vision and Pattern
Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettle- Recognition, pages 8612–8622, 2020. 10
moyer, and Veselin Stoyanov. Roberta: A robustly optimized [44] Jonas Schult, Francis Engelmann, Alexander Hermans, Or
bert pretraining approach. arXiv preprint arXiv:1907.11692, Litany, Siyu Tang, and Bastian Leibe. Mask3d: Mask trans-
2019. 5 former for 3d semantic instance segmentation. In 2023
[33] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng IEEE International Conference on Robotics and Automation
Zhang, Stephen Lin, and Baining Guo. Swin transformer: (ICRA), pages 8216–8223. IEEE, 2023. 1, 2, 5, 6, 7, 9, 10,
Hierarchical vision transformer using shifted windows. In 12, 13
Proceedings of the IEEE/CVF international conference on [45] Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan
computer vision, pages 10012–10022, 2021. 2, 3 Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer,
[34] Jiahao Lu, Jiacheng Deng, Chuxin Wang, Jianfeng He, and and Dieter Fox. Alfred: A benchmark for interpreting
Tianzhu Zhang. Query refinement transformer for 3d in- grounded instructions for everyday tasks. In Proceedings of
stance segmentation. In Proceedings of the IEEE/CVF In- the IEEE/CVF conference on computer vision and pattern
ternational Conference on Computer Vision, pages 18516– recognition, pages 10740–10749, 2020. 7
18526, 2023. 2, 5, 6, 10 [46] Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Bulò, Nor-
[35] So Yeon Min, Devendra Singh Chaplot, Pradeep Kumar man Müller, Matthias Nießner, Angela Dai, and Peter
Ravikumar, Yonatan Bisk, and Ruslan Salakhutdinov. Film: Kontschieder. Panoptic lifting for 3d scene understanding
Following instructions in language with modular methods. with neural fields. In Proceedings of the IEEE/CVF Con-
In International Conference on Learning Representations, ference on Computer Vision and Pattern Recognition, pages
2021. 7 9043–9052, 2023. 1, 3
[36] Aishwarya Padmakumar, Jesse Thomason, Ayush Shrivas- [47] Ayça Takmaz, Elisabetta Fedele, Robert W Sumner, Marc
tava, Patrick Lange, Anjali Narayan-Chen, Spandana Gella, Pollefeys, Federico Tombari, and Francis Engelmann. Open-
mask3d: Open-vocabulary 3d instance segmentation. arXiv ECCV 2020: 16th European Conference, Glasgow, UK, Au-
preprint arXiv:2306.13631, 2023. 1, 3, 6, 10 gust 23–28, 2020, Proceedings, Part IX 16, pages 519–535.
[48] Nikolaos Tsagkas, Oisin Mac Aodha, and Chris Xiaoxuan Springer, 2020. 9
Lu. Vl-fields: Towards language-grounded neural implicit [60] Zhisheng Zhong, Jiequan Cui, Yibo Yang, Xiaoyang Wu, Xi-
spatial representations. arXiv preprint arXiv:2305.12427, aojuan Qi, Xiangyu Zhang, and Jiaya Jia. Understanding im-
2023. 1, 3 balanced semantic segmentation through neural collapse. In
[49] Thang Vu, Kookhoi Kim, Tung M Luu, Thanh Nguyen, and Proceedings of the IEEE/CVF Conference on Computer Vi-
Chang D Yoo. Softgroup for 3d instance segmentation on sion and Pattern Recognition, pages 19550–19560, 2023. 6,
point clouds. In Proceedings of the IEEE/CVF Conference 10
on Computer Vision and Pattern Recognition, pages 2708– [61] Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li,
2717, 2022. 2, 6, 9, 10 Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu
[50] Peng-Shuai Wang. Octformer: Octree-based transformers Yuan, et al. Generalized decoding for pixel, image, and lan-
for 3d point clouds. arXiv preprint arXiv:2305.03045, 2023. guage. In Proceedings of the IEEE/CVF Conference on Com-
6, 9, 10 puter Vision and Pattern Recognition, pages 15116–15127,
[51] Xiaoyang Wu, Yixing Lao, Li Jiang, Xihui Liu, and Heng- 2023. 5
shuang Zhao. Point transformer v2: Grouped vector atten-
tion and partition-based pooling. Advances in Neural Infor-
mation Processing Systems, 35:33330–33342, 2022. 4, 6, 9,
10
[52] Chenfeng Xu, Shijia Yang, Tomer Galanti, Bichen Wu,
Xiangyu Yue, Bohan Zhai, Wei Zhan, Peter Vajda, Kurt
Keutzer, and Masayoshi Tomizuka. Image2point: 3d point-
cloud understanding with 2d image pretrained models. arXiv
preprint arXiv:2106.04180, 2021. 3
[53] Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiao-
long Wang, and Shalini De Mello. Open-vocabulary panop-
tic segmentation with text-to-image diffusion models. In
Proceedings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition, pages 2955–2966, 2023. 1
[54] Karmesh Yadav, Ram Ramrakhya, Santhosh Kumar Ramakr-
ishnan, Theo Gervet, John Turner, Aaron Gokaslan, Noah
Maestre, Angel Xuan Chang, Dhruv Batra, Manolis Savva,
et al. Habitat-matterport 3d semantics dataset. In Proceed-
ings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 4927–4936, 2023. 3
[55] Yu-Qi Yang, Yu-Xiao Guo, Jian-Yu Xiong, Yang Liu,
Hao Pan, Peng-Shuai Wang, Xin Tong, and Baining Guo.
Swin3d: A pretrained transformer backbone for 3d indoor
scene understanding. arXiv preprint arXiv:2304.06906,
2023. 6, 9, 10
[56] Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xu-
peng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng
Li. Pointclip: Point cloud understanding by clip. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 8552–8562, 2022. 3
[57] Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and
Vladlen Koltun. Point transformer. In Proceedings of
the IEEE/CVF international conference on computer vision,
pages 16259–16268, 2021. 4
[58] Weiguang Zhao, Yuyao Yan, Chaolong Yang, Jianan Ye, Xi
Yang, and Kaizhu Huang. Divide and conquer: 3d point
cloud instance segmentation with point-wise binarization. In
Proceedings of the IEEE/CVF International Conference on
Computer Vision (ICCV), pages 562–571, 2023. 2, 5, 6, 10
[59] Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao,
and Zihan Zhou. Structured3d: A large photo-realistic
dataset for structured 3d modeling. In Computer Vision–

ODIN: A Single Model For 2D and 3D Segmentation

Uploaded by

Document Informationclick to expand document informationPotat yes

Document Informationclick to expand document information

Copyright:

Available Formats

ODIN: A Single Model For 2D and 3D Segmentation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ODIN: A Single Model For 2D and 3D Segmentation

Uploaded by

Copyright:

Available Formats

ODIN: A Single Model for 2D and 3D Segmentation

Ayush Jain1 , Pushkal Katara1 , Nikolaos Gkanatsios1 , Adam W. Harley2 ,

Abstract ric 3D models, e.g., NeRFs, by training them per scene to

2D Layer 3D Layer Unprojection Reshape Shared Layers

3D RelPos Attn 3D RelPos Attn Query Re nement

Multi Scale Deformable Self-Attention

single RGB Image 2D Instance Segmentation

3D RelPos Attn Query Re nement

2D Layers 3D Layers Shared Layers

Model mAP mAP50 mAP25 Model mIoU

(c) ScanNet200 Instance Segmentation Task.

Model mAP mAP50 mAP25 (d) ScanNet200 Semantic Segmentation Task.

Model mAP mIoU Model mAP mIoU Model ResNet50 Swin-B

Input Model mAP mAP50 mAP25 Input Model mIoU

(c) Comparison on ScanNet200 for Instance Segmentation Task.

Table 7. Evaluation on Matterport3D [2] and S3DIS [1] datasets.

Model mAP mAP50 mAP25 Input Model mIoU

Input Model All Head Common Tail

3D RelPos Attention Query Refinement Segmentation Mask Decoder

Depth Maps Feature Maps Updated Feature Maps

Add & Norm Hungarian Matching

Self Attention Class Instance

Keys/ Masked Cross Attention

Add & Norm

Cross Attention MLP

Text Text Point

Figure 5. Qualitative Results on various 3D and 2D datasets

You might also like