0% found this document useful (0 votes)
7 views1 page

ODIN: A Single Model For 2D and 3D Segmentation

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 1

ODIN: A Single Model for 2D and 3D Segmentation

Ayush Jain1 , Pushkal Katara1 , Nikolaos Gkanatsios1 , Adam W. Harley2 ,


Gabriel Sarch1 , Kriti Aggarwal3 , Vishrav Chaudhary3 , Katerina Fragkiadaki1
1
Carnegie Mellon University, 2 Stanford University, 3 Microsoft
{ayushj2, pkatara ngkanats,gsarch,kfragki2}@andrew.cmu.edu
aharley@cs.stanford.edu, {kragga,vchaudhary}@microsoft.com
arXiv:2401.02416v3 [cs.CV] 25 Jun 2024

Abstract ric 3D models, e.g., NeRFs, by training them per scene to


render 2D feature maps of pre-trained backbones [23, 46].
State-of-the-art models on contemporary 3D segmenta- Despite this effort, and despite the ever-growing power of
tion benchmarks like ScanNet consume and label dataset- 2D backbones [4, 53], the state-of-the-art on established 3D
provided 3D point clouds, obtained through post process- segmentation benchmarks such as ScanNet [6] and Scan-
ing of sensed multiview RGB-D images. They are typi- Net200 [41] still consists of models that operate directly in
cally trained in-domain, forego large-scale 2D pre-training 3D, without any 2D pre-training stage [28, 44]. Given the
and outperform alternatives that featurize the posed RGB- obvious power of 2D pre-training, why is it so difficult to
D multiview images instead. The gap in performance be- yield improvements in these 3D tasks?
tween methods that consume posed images versus post- We observe that part of the issue lies in a key implemen-
processed 3D point clouds has fueled the belief that 2D tation detail underlying these 3D benchmark evaluations.
and 3D perception require distinct model architectures. Benchmarks like ScanNet do not actually ask methods to
In this paper, we challenge this view and propose ODIN use RGB-D images as input, even though this is the sen-
(Omni-Dimensional INstance segmentation), a model that sor data. Instead, these benchmarks first register all RGB-
can segment and label both 2D RGB images and 3D point D frames into a single colored point cloud and reconstruct
clouds, using a transformer architecture that alternates be- the scene as cleanly as possible, relying on manually tuned
tween 2D within-view and 3D cross-view information fu- stages for bundle adjustment, outlier rejection and meshing,
sion. Our model differentiates 2D and 3D feature oper- and ask models to label the output reconstruction. While
ations through the positional encodings of the tokens in- it is certainly viable to scan and reconstruct a room be-
volved, which capture pixel coordinates for 2D patch tokens fore labelling any of the objects inside, this pipeline is per-
and 3D coordinates for 3D feature tokens. ODIN achieves haps inconsistent with the goals of embodied vision (and
state-of-the-art performance on ScanNet200, Matterport3D typical 2D vision), which involves dealing with actual sen-
and AI2THOR 3D instance segmentation benchmarks, and sor data and accounting for missing or partial observations.
competitive performance on ScanNet, S3DIS and COCO. It We therefore hypothesize that method rankings will change,
outperforms all previous works by a wide margin when the and the impact of 2D pre-training will become evident, if
sensed 3D point cloud is used in place of the point cloud we force the 3D models to take posed RGB-D frames as in-
sampled from 3D mesh. When used as the 3D perception en- put rather than pre-computed mesh reconstructions. Our re-
gine in an instructable embodied agent architecture, it sets vised evaluation setting also opens the door to new methods,
a new state-of-the-art on the TEACh action-from-dialogue which can train and perform inference in either single-view
benchmark. Our code and checkpoints can be found at the or multi-view settings, with either RGB or RGB-D sensors.
project website https://odin-seg.github.io. We propose Omni-Dimensional INstance segmentation
(ODIN)† , a model for 2D and 3D object segmentation and
labelling that can parse single-view RGB images and/or
1. Introduction multiview posed RGB-D images. As shown in Fig. 1, ODIN
There has been a surge of interest in porting 2D founda- alternates between 2D and 3D stages in its architecture,
tional image features to 3D scene understanding [8, 14, 21, † The Norse god Odin sacrificed one of his eyes for wisdom, trading
23, 37, 40, 46–48]. Some methods lift pre-trained 2D image one mode of perception for a more important one. Our approach sacrifices
features using sensed depth to 3D feature clouds [8, 37, 40, perception on post-processed meshes for perception on posed RGB-D im-
47]. Others distill 2D backbones to differentiable paramet- ages.

You might also like