Multimodal 3D Reasoning Segmentation with Complex Scenes
Multimodal 3D Reasoning Segmentation with Complex Scenes
Multimodal 3D Reasoning Segmentation with Complex Scenes
Figure 1. The proposed MORE3D enables multi-object reasoning segmentation for 3D scenarios. It can comprehend the intention behind
user questions, handle complex 3D scenes with multiple objects, and produce fine-grained explanations with 3D spatial relations among
objects, demonstrating strong reasoning and 3D segmentation capabilities.
1
among 3D objects in scenes and grasping the intention of reasoning segmentation task, together with a large-scale and
human users have become essential for machines for inter- high-quality benchmark that incorporates 3D spatial rela-
preting 3D scenes, interacting with 3D objects in scenes, tions for effective evaluations of multi-object 3D reasoning
and achieving various complex and real-world missions segmentation. Second, we design a reasoning segmenta-
while navigating within 3D environments. tion technique that can handle multi-object 3D reasoning
On the other hand, most existing 3D scene understanding segmentation and produce fine-grained textual explanations
work does not possess reasoning and interpretation abilities with 3D spatial relations among objects in scenes. Third,
for interacting with user textual inputs. For example, several extensive experiments demonstrate the superiority of our
work introduces foundation models such as large language proposed multi-object reasoning segmentation technique as
models (LLMs) to empower 3D scene understanding on well as the validity of our created benchmark on multi-
captioning [4, 7, 13], question answering [2, 11, 13, 26, 29], object 3D reasoning segmentation.
visual grounding [5, 10, 13, 17, 19, 44, 52], and refer-
ring [16, 32, 39, 48], but they are short of reasoning ca- 2. Related Work
pabilities for deducting human intentions. Recently, several
studies attempt to introduce the reasoning ability of LLMs 2.1. Language-Instructed 3D Tasks
into 3D scene understanding tasks. However, they mostly Integrating point clouds with natural language processing
focus on segmenting single or single-category objects but has widespread applications, drawing increasing interest
cannot handle complex scenes with multiple objects of dif- in language-instructed 3D scene understanding. Existing
ferent categories [5, 12, 15]. Hence, they cannot under- language-instructed 3D scene understanding methods can
stand 3D spatial relations among objects and produce fine- be broadly grouped into two categories. The first category
grained textual explanations, hampering their applications focuses on 3D segmentation [3, 9, 16, 18, 25, 28, 32, 39,
in various real-world scenarios that often come with multi- 43, 45, 48], such as OpenScene [30], Openmask3D [35],
ple objects of different categories. 3D-STMN [40], which produce segmentation masks but
We propose a multi-object 3D reasoning segmentation lack reasoning abilities and textual or conversational out-
task that can produce 3D segmentation masks and textual put. This limits their ability to comprehend the user’s in-
explanations with rich 3D spatial relations among objects tention and interact with the user in real-world applications.
in scenes, given 3D scenes and user questions as inputs. To The second category focuses on tasks such as 3D caption-
this end, we create ReasonSeg3D, a large-scale and high- ing [4, 7, 13], 3D question answering [2, 11, 13, 26, 29], and
quality benchmark that can evaluate 3D reasoning segmen- visual grounding [5, 10, 13, 17, 19, 44, 52]. They can gen-
tation with multiple 3D objects and rich spatial relations erate textual outputs like phrases or conversational outputs,
among them. Different from prior 3D reasoning segmenta- but leave the fine-grained segmentation task untouched and
tion benchmarks [12, 15], ReasonSeg3D expands the scope have no reasoning ability either. Different from existing
into multi-object space which is well aligned with real- methods, the proposed MORE3D predicts textual answers
world tasks that often come with multiple objects. In ad- with explanation and accurate segmentation of multiple 3D
dition, ReasonSeg3D integrates 3D spatial information into objects within complex scenes, demonstrating strong rea-
question-answer pairs, where the 3D spatial relations in tex- soning capability in comprehending the intention behind the
tual answers benefit 3D reasoning segmentation clearly. user’s input questions.
On top of ReasonSeg3D, we design MORE3D, a sim-
ple yet effective technique that enables multi-object reason- 2.2. Reasoning Segmentation
ing segmentation with textual explanations to users’ ques- Reasoning Segmentation is first introduced by LISA [22] to
tions. MORE3D learns object-specific point cloud embed- generate segmentation masks from complex, implicit tex-
dings from LLM, enabling predicting precise 3D segmenta- tual queries. Specifically, LISA integrates LLaVA [24]
tion masks of multiple objects. In addition, it produces de- with SAM, enhancing segmentation through the vision-
tailed textual explanations that capture 3D spatial relations language model’s reasoning capabilities. Following LISA,
among multiple objects, supporting accurate segmentation PixelLM [33] improves pixel-level reasoning segmentation
and comprehensive reasoning for complex 3D scenes. As using multimodal models with a lightweight decoder and
illustrated in Figure 1, MORE3D demonstrates strong rea- segmentation codebook, LLM-Seg [37] bridges the Seg-
soning capability for comprehending the intention behind mentation Anything Model and LLMs by selecting mask
user’s input questions and complex 3D scenes, producing proposals, and LLaVASeg [46] incorporates query-focused
accurate 3D segmentation and explanatory answers with re- segmentation into large language models where chain-of-
spect to multiple 3D objects in scenes. thought prompting is adopted to preserve dialogue func-
The major contributions of this work can be summarized tions. In addition, VISA [42] extends reasoning segmen-
in three aspects. First, we introduce a new multi-object 3D tation to video by combining multimodal language models
2
with a mask decoder, facilitating complex video segmen- category objects as illustrated in Table 1, and they contain
tation from implicit text queries and world knowledge. In only a few hundred scenes which are far fewer than stan-
the 3D domain, PARIS3D [20] and Reasoning3D [6] fo- dard 3D segmentation datasets [8, 34]. Additionally, these
cus on part segmentation with explanations for individual 3D reasoning datasets lack textual explanations with spatial
3D objects, leaving 3D segmentation of complex scenes information, hindering them from training MLLMs for bet-
untouched. Recent studies like SegPoint [12] and Rea- ter 3D spatial relation understanding. We bridge this gap by
son3D [15] integrate LLMs’ reasoning ability into 3D seg- proposing a data generation pipeline as well as a new 3D
mentation, but they are limited to segmentation for single- reasoning segmentation dataset ReasonSeg3D, more details
category objects and have no textual explanations. In con- to be elaborated in the following subsections.
trast, our approach targets multi-object 3D segmentation
and provides textual explanations in the output answer, en- 3.1. Dataset Definition
hancing the model’s understanding of 3D spatial relations In the proposed ReasonSeg3D, each point cloud P is
and offering a more practical solution for real-world com- paired with multiple {xque , yans , M } triplets, where xque
plex scenes. is a question targeting one or more objects in the image,
2.3. Large Multimodal Models yans is the textual answer containing the explanation
for the reasoning segmentation, and M represents the
Inspired by the remarkable learning ability of Large Lan- 3D segmentation masks corresponding to yans . The
guage Models (LLMs), recent research has expanded into question xque is designed to require world knowledge
the visual domain and developed a series of Large Mul- and reasoning ability to accurately identify and segment
timodal Models (LMMs) [1, 49]. The prevalent approach multiple objects. For example, instead of directly asking
focuses on aligning visual representations with the linguis- “Where are the sofa and table?”, Reason-
tic embeddings of LLMs. For example, BLIP-2 [23] and Seg3D formulates the question by “Where would
mPLUG-OWL [47] encode image features with a visual be the most suitable place for reading
encoder, integrating them into the LLMs with text embed- a book in this layout?”. An answer yans in
dings. LLaVA [24] and MiniGPT4 [51] align image-text ReasonSeg3D is constructed to involve multiple objects,
features followed by instruction tuning, and they also ex- and it also include explanations with 3D spatial relations.
plore image retrieval for LLMs. Recent studies delve into For example, the yans to the above question is formulated
the integration of multimodal LLMs with vision tasks. For by “The corner with the sofa and a small
example, VisionLLM [38] provides an interface for vision- table next to it can serve as a perfect
centric tasks via instruction tuning though it does not ex- reading nook, providing comfort and a
ploit LLMs for complex reasoning. VisionLLM-v2 [41] quiet atmosphere.”.
integrates visual perception, understanding, and genera-
tion within a unified framework by using a “super link” 3.2. Dataset Generation Pipeline
to connect the multimodal large model with task-specific
Several reasoning segmentation datasets [22, 37] employ
decoders. DetGPT [31] introduces multimodal LLMs into
LLaVA [24] for image captioning and text-only GPT-4 to
open-vocabulary detectors for instruction-based detection
produce question-answer pairs according to the generated
tasks. GPT4RoI [50] introduces spatial boxes as inputs,
captions. However, GPT-4 cannot understand the image
training on region-text pairings. LISA [22] enhances seg-
content well and its generated question-answer pairs often
mentation in multimodal LLMs by introducing a <SEG>
lack crucial 3D spatial relations that are essential for 3D
token. These existing studies primarily target downstream
reasoning segmentation in various real-world tasks.
tasks in the 2D domain. In contrast, our approach extends
We design a novel data generation pipeline that intro-
into the 3D domain to enable multi-object segmentation of
duces GPT-4o that possesses superior visual content under-
3D point clouds and provides textual explanations to sup-
standing capabilities. The pipeline allows generating more
port reasoning in the segmentation process.
practical questions and answers by incorporating spatial re-
3. ReasonSeg3D Dataset lations among objects in scenes. Specifically, we employ
both scene image and its ground-truth segmentation to en-
Most existing reasoning segmentation datasets are not suit- hance GPT-4o’s 3D spatial understanding capabilities. The
able for studying the proposed 3D multi-object reasoning prompt template is structured in two parts: the first part
segmentation task. Specifically, existing reasoning segmen- contains basic requirements for the GPT-4o and the sec-
tation datasets have two critical limitations. First, the ex- ond specifies detailed requirements for generating questions
isting 2D reasoning segmentation datasets [22, 33, 37, 42] and answers with a focus on describing 3D spatial relations
lack 3D data, making them unsuitable for 3D tasks. Second, among objects. With this template, GPT-4o autonomously
the existing 3D reasoning datasets [12, 15] focus on single- selects objects to form question-answer pairs that reflect the
3
Figure 2. Illustration of the prompt template in our dataset generation. (a) One example prompt template in our dataset generation on 3D
multi-object reasoning. (b) With a sample input image to GPT-4o and the corresponding ground-truth segmentation on the top, the two
boxes below present one generated question-answer pair where text of different colors highlights different objects.
Table 1. Comparison of existing 3D scene-level reasoning segmentation datasets. ReasonSeg3D stands out for its large-scale, high-quality
data, supporting segmenting multiple 3D objects across multiple categories and offering explanations with 3D spatial relations.
scene’s content and spatial layout. Figure 2 shows one ex- 4. Method
ample of the designed prompt template and one question-
answer pair generated by using the template. 4.1. Task Definition
Multi-object 3D reasoning segmentation takes a single point
cloud P and input user questions Xque as input, aiming to
reason the implicit intention behind Xque and produce tex-
3.3. Dataset Statistics
tual answers Ŷans including explanations and 3D segmen-
tation masks M̂ of multiple objects in P .
ReasonSeg3D comprises 1513 scenes and 20,113 data sam-
ples in total, with point clouds sourced from ScanNetv2 [8]. 4.2. Overall Framework
Following [8], we divide ReasonSeg3D into training and
validation sets, comprising 1201 and 312 scenes, respec- Figure 3 shows the framework of the proposed MORE3D
tively. In the generated dataset, each scene has 13.3 ques- which enables Multi-Object 3D Reasoning segmentation
tions on average. The dataset contains 20 different object and provides textual Explanations based on the user’s in-
categories for segmentation. put question. Given an input point cloud P , the 3D Encoder
4
Figure 3. Overview of our proposed MORE3D: Given an input point cloud, the 3D Encoder first extracts per-point features Fp and projects
them into sequential features Fs . The sequential features Fs , together with the textual input Xque , are then fed into a multimodal LLM
to perform reasoning, producing textual answers Ŷans with both detailed explanations and descriptions of 3D spatial relationships among
multiple objects. Finally, embeddings for multiple <SEG> tokens and the per-point features are passed to the 3D Decoder to produce 3D
segmentation masks and classification results. The module marked with a snowflake icon is frozen during training, while those marked
with a flame icon are trainable.
first extracts per-point features Fp and then projects Fp into query. In each answer, the 3D object’s name is followed by
sequential features Fs that can be processed by LLMs. The a <SEG> token. For example, the LLM might generate the
sequential features Fs and input text queries are processed following answer: “You can use the spacious
by the LLM, generating textual answers Ŷans that include sofa <SEG> for seating, positioned
textual explanations Ŷtext and multi-object <SEG> tokens near a central table <SEG> for drinks
Ŷ<SEG> . The <SEG> tokens Ŷ<SEG> indicate the request for and snacks, while additional chairs
3D segmentation masks, and the corresponding object em- <SEG> provide extra seating options for
beddings Fseg are extracted from the LLM’s embeddings. guests.” Each <SEG> token indicates a request for a
These object embeddings Fseg are further combined with point cloud segmentation mask.
the per-point features Fp within the 3D Decoder by dot
product, ensuring that the object-specific information inter-
Object-Specific Point Cloud Embeddings Extraction.
acts with the overall point cloud features to generate the
After obtaining the multi-object <SEG> tokens Ŷ<SEG> in
final 3D segmentation masks M̂ and classification predic-
the output textual answer Ŷans , we extract the point cloud
tions.
embeddings generated by the multimodal LLM for each
4.3. Multi-Object 3D Reasoning Segmentation <SEG> token and feed them into the 3D Decoder to gen-
erate the corresponding segmentation masks. As illustrated
We design an object-specific point embedding extraction in Figure 4, each point cloud P is associated with N textual
approach to achieve the multi-object 3D reasoning segmen- answers Ŷans , and we illustrate the process with one textual
tation. The embedding extraction approach enables precise n
answer ŷans for clarity. The LLM’s textual answer includes
segmentation and reasoning for multiple objects of differ- multiple <SEG> tokens, and we use a multi-seg index list
ent categories in complex point clouds, integrating the LLM Iseg to record the positions of these tokens in the tokenized
with a 3D point cloud decoder to generate segmentation output. Iseg is obtained from the ground-truth answer yans n
masks and textual explanations. n
during training, and the predicted answer ŷans during infer-
The LLM takes point cloud features Fs and text queries ence. The multi-seg index list Iseg is defined by:
Xque as input, generating textual answers Ŷans , which
include textual explanations Ŷtext and multi-object <SEG> Iseg = {id0 , id1 , ..., idS }, (1)
tokens Ŷ<SEG> . Specifically, each output textual answer con-
sists of a paragraph of textual explanations accompanied by where S denotes the number of <SEG> tokens in the textual
a set of <SEG> tokens, where the number of <SEG> tokens answer. Iseg is then used to retrieve the corresponding point
equals the number of 3D objects indicated in the input text cloud embeddings Fseg from the LLM’s hidden states. The
5
Method Venue cIoU gIoU
OpenScene [30] CVPR 23 7.69 9.52
PLA [9] CVPR 23 10.76 10.27
RegionPLC [45] CVPR 24 10.82 11.06
MORE3D (Ours) - 30.19 32.01
where fi is the i-th point cloud embeddings in the LLM’s where C is the ground truth classification label and Ĉ is the
hidden states. classification prediction.
For the example in Figure 4, the 3rd and 5th hidden em- The proposed model is trained end-to-end with a textual
beddings corresponding to the table and chair are extracted, answer loss Lans for textual answer generation, a mask loss
matching the first and second <SEG> tokens in the answer. Lmask for point cloud segmentation mask prediction, and
Each <SEG> token corresponds to a 3D object in ŷansn
, al- a classification loss Lcls for point cloud classification. The
lowing us to obtain the point cloud embeddings from the overall loss function is formulated as follows:
LLM for subsequent segmentation.
L = Lans + Lmask + Lcls . (6)
4.4. Explainability
5. Experiment
The LLM-generated textual answers Ŷans contain explana-
tions of the implicit intention of the user questions. In addi- 5.1. Experimental Settings
tion, the answer includes the descriptions of 3D spatial rela-
Evaluation Metrics. Following prior studies on reason-
tions of multiple objects which provide useful guidance for
ing segmentation [22, 33, 37], we adopt two evaluation
identifying these objects. Unlike most existing reasoning-
metrics including cumulative IoU (cIoU) and generalized
based segmentation methods [12, 15, 37, 42] that just output
IoU (gIoU). cIoU is computed as the cumulative inter-
statements like "It is <SEG>." without further expla-
section over cumulative union, while gIoU is the average
nation, our generated answer integrates detailed explana-
Intersection-over-Union (IoU) across all samples.
tions about 3D spatial relations to enhance segmentation.
The ground truth answers Yans also incorporate these 3D
spatial relations and are used to supervise the generated out- Implementation Details. We conduct experiments on
put textual answers, guiding the model to capture spatial one NVIDIA V100 GPU and train the framework for 100
information effectively. The textual answer loss is defined epochs with a batch size of 1. We use the Adam [21] opti-
by: mizer with the initial learning rate of 1 × 10−4 . We adopt
Lans = CE(Yans , Ŷans ), (3) LLaMA-7B [36] as our multimodal LLM backbone and
LoRA [14] to perform efficient fine-tuning. All experiments
where CE denotes the cross-entropy loss. are conducted on the proposed ReasonSeg3D dataset.
6
Figure 5. Segmentation visualization over the ReasonSeg3D validation set. Each case presents a user input question, the corresponding
input point cloud, the ground-truth segmentation, and the prediction by the proposed MORE3D. Best viewed in color and zoom-in.
5.2. Benchmarking with Existing Methods Qualitative Results. Figure 5 presents qualitative seg-
mentation by the proposed MORE3D with two exam-
ples from the ReasonSeg3D validation set. Each example
We benchmark MORE3D with state-of-the-art 3D segmen-
demonstrates the user question, input point cloud, ground-
tation methods on the ReasonSeg3D validation set. As
truth segmentation, and the 3D segmentation masks that
Table 2 shows, our method achieves superior segmenta-
are predicted by MORE3D. We can observe that MORE3D
tion performance across all evaluation metrics. Specifi-
could accurately predict the 3D segmentation masks that are
cally, OpenScene [30], PLA [9], and RegionPLC [45] are
highly aligned with the ground-truth segmentation masks.
designed to comprehend input words or phrases based on
Specifically, in the example at the top, despite many chairs
CLIP-based language models. They lack the ability to rea-
closely surrounding the central long table, MORE3D can
son user intention from input questions and to correctly
correctly distinguish and segment them. For the example
generate answers targeting at corresponding multiple ob-
at the bottom, despite the similar appearance and size of
jects. Consequently, the predicted 3D segmentation masks
tables and chairs, MORE3D can distinguish and segment
by these methods cannot be aligned correctly with ground-
them precisely. The visualization indicates that MORE3D
truth 3D segmentation masks, resulting in low cIoU and
can comprehend the implicit intention behind the user ques-
gIoU. In contrast, MORE3D can accurately reason the im-
tions and segment multiple objects accurately.
plicit intention of user questions, predicting 3D segmenta-
tion masks that are more consistent with the ground truth.
7
Index Approach cIoU gIoU Index Operation cIoU gIoU
1 Random 20.30 21.23 1 Addition 28.26 31.01
2 with Iseg 30.19 32.01 2 Concatenation 29.42 31.17
3 Dot Product 30.19 32.01
Table 3. Ablation study on the approach of object-specific point
cloud embeddings extraction. The best results are in bold. Table 5. Ablation study on the approach of point cloud decoding
operation in the 3D Decoder. The best results are in bold.
Index Lans Lmask cIoU gIoU
1 13.72 14.73 Index Manner cIoU gIoU
2 ✓ 15.23 16.14 1 Unified 29.06 30.43
3 ✓ 23.30 28.91 2 Separated 30.19 32.01
4 ✓ ✓ 30.19 32.01
Table 6. Ablation study on different approaches of prediction in
Table 4. Ablation study of the loss functions on the ReasonSeg3D the 3D Decoder. The best results are in bold.
validation set. Lans and Lmask refer to the textual answer loss
and the point cloud segmentation mask loss, respectively. The best
results are in bold. performance improves consistently due to partial supervi-
sion from the ground-truth textual answers or the segmenta-
tion masks. When both Lans and Lmask are employed, the
5.3. Ablation Study reasoning ability for generating answers and the capability
We conduct extensive ablation studies on the ReasonSeg3D of predicting 3D segmentation masks of MORE3D are both
validation set to evaluate our designs. Specifically, we ex- improved by large margins, demonstrating the effectiveness
amine MORE3D from the aspect of the approach for object- and synergies of the two designed losses.
specific point cloud embeddings extraction, the loss design,
the point cloud decoding approach, as well as the way of Decoding Operation. We examine how different point
prediction. cloud decoding operations affect 3D reasoning segmenta-
tion. Specifically, we evaluate three approaches of point
Embedding Extraction. We examine the effectiveness of cloud decoding operations that integrate per-point features
the designed multi-seg index list Iseg for the object-specific and object embeddings to predict the 3D segmentation
point cloud embeddings extraction in MORE3D. As Table 3 masks. As Table 5 shows, the Addition and the Concatena-
shows, using the multi-seg index list Iseg can achieve better tion operations achieve lower performance compared with
performance than adopting the random selection approach. the Dot Product operation since they either add or con-
This is largely because the LLM’s embeddings correspond- catenate the embeddings of each object with the per-point
ing to multiple objects are extracted precisely with the help features of the global scene, introducing abundant features
of Iseg . As a comparison, the random selection could ex- of irrelevant objects and leading to inferior performance.
tract embeddings unrelated to target objects like “table” or Differently, the Dot Product operation selectively extracts
“chairs”, and it could also extract embeddings belonging only the relevant features from the per-point features of the
to irrelevant non-object words, such as “around” and “suit- global scene by identifying those highly correlated with the
able”. This explains why using the multi-seg index list Iseg object embeddings. This leads to better performance thanks
is more effective for extracting object-specific point cloud to its extracted object-specific features.
embeddings corresponding to multiple target objects.
Prediction Approaches. We examine the effectiveness of
Loss Functions. We examine the impact of the textual an- different prediction approaches including unified and sep-
swer loss Lans and the mask loss Lmask in Equations 3 arated approaches for the prediction of 3D segmentation
and 4, where Lans supervises the textual answers output by masks and classification. Table 6 shows experimental re-
the multimodal LLM and Lmask supervises the 3D point sults. For the unified approach, the 3D segmentation masks
cloud segmentation prediction. As Table 4 shows, with- and classification results are predicted by a single instead of
out Lans and Lmask , the performance drops greatly due two separate network branches. This results in lower per-
to the lack of supervision from both ground-truth textual formance due to the increased learning burden when a sin-
answers and segmentation masks, which hinders the model gle network branch must learn to perform segmentation and
from generating appropriate answers and accurate 3D seg- classification simultaneously. In contrast, the separate ap-
mentation masks. When either Lans or Lmask is used, the proach as illustrated in the bottom right of Figure 3 assigns
8
each task to an independent network branch to learn. This [7] Zhenyu Chen, Ronghang Hu, Xinlei Chen, Matthias
allow each branch to focus on its respective task which low- Nießner, and Angel X Chang. Unit3d: A unified trans-
ers the learning burden and improves performance clearly. former for 3d dense captioning and visual grounding. In
Proceedings of the IEEE/CVF International Conference on
6. Conclusion Computer Vision, pages 18109–18119, 2023. 2
[8] Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal-
This paper presents a novel multi-object 3D reasoning seg- ber, Thomas Funkhouser, and Matthias Nießner. Scannet:
mentation task, producing both 3D segmentation masks and Richly-annotated 3d reconstructions of indoor scenes. In
textual explanations with rich 3D spatial relations among Proceedings of the IEEE/CVF Conference on Computer Vi-
multiple objects within complex 3D scenes. To this end, sion and Pattern Recognition, pages 5828–5839, 2017. 3,
we develop ReasonSeg3D, a large-scale benchmark fea- 4
turing rich 3D spatial relations integrated into question- [9] Runyu Ding, Jihan Yang, Chuhui Xue, Wenqing Zhang,
answer pairs, designed to evaluate multi-object 3D reason- Song Bai, and Xiaojuan Qi. Pla: Language-driven open-
vocabulary 3d scene understanding. In Proceedings of
ing segmentation effectively. On top of this, we propose
the IEEE/CVF Conference on Computer Vision and Pattern
MORE3D, a technique for multi-object 3D reasoning seg-
Recognition, pages 7010–7019, 2023. 2, 6, 7
mentation with textual explanations, demonstrating strong
[10] Zoey Guo, Yiwen Tang, Ray Zhang, Dong Wang, Zhigang
reasoning abilities in response to user questions. Extensive Wang, Bin Zhao, and Xuelong Li. Viewrefer: Grasp the
experiments validate the effectiveness of MORE3D. Future multi-view knowledge for 3d visual grounding. In Proceed-
work will focus on generalizing our work to diverse and ings of the IEEE/CVF International Conference on Com-
challenging 3D environments, such as outdoor scenes, to puter Vision, pages 15372–15383, 2023. 2
expand the scope of applications. [11] Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xi-
anzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xi-
References anzhi Li, Hongsheng Li, et al. Point-bind & point-llm:
Aligning point cloud with multi-modality for 3d understand-
[1] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine
ing, generation, and instruction following. arXiv preprint
Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch,
arXiv:2309.00615, 2023. 2
Katherine Millican, Malcolm Reynolds, et al. Flamingo: a
[12] Shuting He, Henghui Ding, Xudong Jiang, and Bihan Wen.
visual language model for few-shot learning. Advances in
Segpoint: Segment any point cloud via large language
Neural Information Processing Systems, 35:23716–23736,
model. Proceedings of the IEEE/CVF European Conference
2022. 3
on Computer Vision, 2024. 2, 3, 4, 6
[2] Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki
Kawanabe. Scanqa: 3d question answering for spatial scene [13] Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng,
understanding. In Proceedings of the IEEE/CVF Conference Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: In-
on Computer Vision and Pattern Recognition, pages 19129– jecting the 3d world into large language models. Advances
19139, 2022. 2 in Neural Information Processing Systems, 36:20482–20494,
2023. 2
[3] Mohamed El Amine Boudjoghra, Angela Dai, Jean Lahoud,
Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, [14] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-
and Fahad Shahbaz Khan. Open-yolo 3d: Towards fast and Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.
accurate open-vocabulary 3d instance segmentation. arXiv Lora: Low-rank adaptation of large language models. arXiv
preprint arXiv:2406.02548, 2024. 2 preprint arXiv:2106.09685, 2021. 6
[4] Sijin Chen, Hongyuan Zhu, Xin Chen, Yinjie Lei, Gang [15] Kuan-Chih Huang, Xiangtai Li, Lu Qi, Shuicheng Yan, and
Yu, and Tao Chen. End-to-end 3d dense captioning with Ming-Hsuan Yang. Reason3d: Searching and reasoning
vote2cap-detr. In Proceedings of the IEEE/CVF Conference 3d segmentation via large language model. arXiv preprint
on Computer Vision and Pattern Recognition, pages 11124– arXiv:2405.17427, 2024. 2, 3, 4, 6
11133, 2023. 2 [16] Pin-Hao Huang, Han-Hung Lee, Hwann-Tzong Chen, and
[5] Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Tyng-Luh Liu. Text-guided graph neural networks for re-
Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. Ll3da: ferring 3d instance segmentation. In Proceedings of the
Visual interactive instruction tuning for omni-3d understand- AAAI Conference on Artificial Intelligence, pages 1610–
ing reasoning and planning. In Proceedings of the IEEE/CVF 1618, 2021. 2
Conference on Computer Vision and Pattern Recognition, [17] Shijia Huang, Yilun Chen, Jiaya Jia, and Liwei Wang. Multi-
pages 26428–26438, 2024. 2 view transformer for 3d visual grounding. In Proceedings of
[6] Tianrun Chen, Chunan Yu, Jing Li, Jianqi Zhang, Lanyun the IEEE/CVF Conference on Computer Vision and Pattern
Zhu, Deyi Ji, Yong Zhang, Ying Zang, Zejian Li, and Recognition, pages 15524–15533, 2022. 2
Lingyun Sun. Reasoning3d–grounding and reasoning in [18] Zhening Huang, Xiaoyang Wu, Xi Chen, Hengshuang Zhao,
3d: Fine-grained zero-shot open-vocabulary 3d reasoning Lei Zhu, and Joan Lasenby. Openins3d: Snap and lookup for
part segmentation via large vision-language models. arXiv 3d open-vocabulary instance segmentation. arXiv preprint
preprint arXiv:2405.19326, 2024. 3 arXiv:2309.00616, 2023. 2
9
[19] Weitai Kang, Mengxue Qu, Jyoti Kini, Yunchao Wei, Conference on Empirical Methods in Natural Language Pro-
Mubarak Shah, and Yan Yan. Intent3d: 3d object detec- cessing, 2023. 3
tion in rgb-d scans based on human intention. arXiv preprint [32] Zhipeng Qian, Yiwei Ma, Jiayi Ji, and Xiaoshuai Sun. X-
arXiv:2405.18295, 2024. 2 refseg3d: Enhancing referring 3d instance segmentation via
[20] Amrin Kareem, Jean Lahoud, and Hisham Cholakkal. structured cross-modal graph neural networks. In Proceed-
Paris3d: Reasoning-based 3d part segmentation using large ings of the AAAI Conference on Artificial Intelligence, pages
multimodal model. Proceedings of the IEEE/CVF European 4551–4559, 2024. 2
Conference on Computer Vision, 2024. 3 [33] Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao,
[21] Diederik P Kingma. Adam: A method for stochastic opti- Dongmei Fu, Jiashi Feng, and Xiaojie Jin. Pixellm: Pixel
mization. International Conference on Learning Represen- reasoning with large multimodal model. In Proceedings of
tations, 2015. 6 the IEEE/CVF Conference on Computer Vision and Pattern
[22] Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Recognition, pages 26374–26383, 2024. 2, 3, 6
Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation [34] David Rozenberszki, Or Litany, and Angela Dai. Language-
via large language model. In Proceedings of the IEEE/CVF grounded indoor 3d semantic segmentation in the wild.
Conference on Computer Vision and Pattern Recognition, In Proceedings of the IEEE/CVF European Conference on
pages 9579–9589, 2024. 2, 3, 6 Computer Vision, pages 125–141. Springer, 2022. 3
[23] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. [35] Ayça Takmaz, Elisabetta Fedele, Robert W Sumner, Marc
Blip-2: Bootstrapping language-image pre-training with Pollefeys, Federico Tombari, and Francis Engelmann. Open-
frozen image encoders and large language models. In In- mask3d: Open-vocabulary 3d instance segmentation. Ad-
ternational Conference on Machine Learning, pages 19730– vances in Neural Information Processing Systems, 2023. 2
19742. PMLR, 2023. 3 [36] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
[24] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste
Visual instruction tuning. Advances in Neural Information Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al.
Processing Systems, 36, 2023. 2, 3 Llama: Open and efficient foundation language models.
[25] Kunhao Liu, Fangneng Zhan, Jiahui Zhang, Muyu Xu, arXiv preprint arXiv:2302.13971, 2023. 6
Yingchen Yu, Abdulmotaleb El Saddik, Christian Theobalt, [37] Junchi Wang and Lei Ke. Llm-seg: Bridging image segmen-
Eric Xing, and Shijian Lu. Weakly supervised 3d open- tation and large language model reasoning. In Proceedings of
vocabulary segmentation. Advances in Neural Information the IEEE/CVF Conference on Computer Vision and Pattern
Processing Systems, 36:53433–53456, 2023. 2 Recognition, pages 1765–1774, 2024. 2, 3, 6
[26] Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao [38] Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu,
Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Situated Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu
question answering in 3d scenes. International Conference Qiao, et al. Visionllm: Large language model is also an open-
on Learning Representations, 2023. 2 ended decoder for vision-centric tasks. Advances in Neural
[27] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. Information Processing Systems, 36, 2023. 3
V-net: Fully convolutional neural networks for volumetric [39] Changli Wu, Yihang Liu, Jiayi Ji, Yiwei Ma, Haowei Wang,
medical image segmentation. In 2016 fourth international Gen Luo, Henghui Ding, Xiaoshuai Sun, and Rongrong Ji.
conference on 3D vision, pages 565–571. Ieee, 2016. 6 3d-gres: Generalized 3d referring expression segmentation.
[28] Phuc DA Nguyen, Tuan Duc Ngo, Chuang Gan, Evange- Proceedings of the ACM International Conference on Multi-
los Kalogerakis, Anh Tran, Cuong Pham, and Khoi Nguyen. media, 2024. 2
Open3dis: Open-vocabulary 3d instance segmentation with [40] Changli Wu, Yiwei Ma, Qi Chen, Haowei Wang, Gen
2d mask guidance. Proceedings of the IEEE/CVF Confer- Luo, Jiayi Ji, and Xiaoshuai Sun. 3d-stmn: Dependency-
ence on Computer Vision and Pattern Recognition, 2024. 2 driven superpoint-text matching network for end-to-end 3d
[29] Maria Parelli, Alexandros Delitzas, Nikolas Hars, Geor- referring expression segmentation. In Proceedings of the
gios Vlassis, Sotirios Anagnostidis, Gregor Bachmann, and AAAI Conference on Artificial Intelligence, pages 5940–
Thomas Hofmann. Clip-guided vision-language pre-training 5948, 2024. 2
for question answering in 3d scenes. In Proceedings of [41] Jiannan Wu, Muyan Zhong, Sen Xing, Zeqiang Lai,
the IEEE/CVF Conference on Computer Vision and Pattern Zhaoyang Liu, Wenhai Wang, Zhe Chen, Xizhou Zhu, Lewei
Recognition, pages 5607–5612, 2023. 2 Lu, Tong Lu, et al. Visionllm v2: An end-to-end general-
[30] Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea ist multimodal large language model for hundreds of vision-
Tagliasacchi, Marc Pollefeys, Thomas Funkhouser, et al. language tasks. arXiv preprint arXiv:2406.08394, 2024. 3
Openscene: 3d scene understanding with open vocabularies. [42] Cilin Yan, Haochen Wang, Shilin Yan, Xiaolong Jiang, Yao
In Proceedings of the IEEE/CVF Conference on Computer Hu, Guoliang Kang, Weidi Xie, and Efstratios Gavves. Visa:
Vision and Pattern Recognition, pages 815–824, 2023. 2, 6, Reasoning video object segmentation via large language
7 models. Proceedings of the IEEE/CVF European Confer-
[31] Renjie Pi, Jiahui Gao, Shizhe Diao, Rui Pan, Hanze Dong, ence on Computer Vision, 2024. 2, 3, 6
Jipeng Zhang, Lewei Yao, Jianhua Han, Hang Xu, Lingpeng [43] Mi Yan, Jiazhao Zhang, Yan Zhu, and He Wang. Maskclus-
Kong, et al. Detgpt: Detect what you need via reasoning. tering: View consensus based mask graph clustering for
10
open-vocabulary 3d instance segmentation. Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2024. 2
[44] Jianing Yang, Xuweiyi Chen, Shengyi Qian, Nikhil Madaan,
Madhavan Iyengar, David F Fouhey, and Joyce Chai. Llm-
grounder: Open-vocabulary 3d visual grounding with large
language model as an agent. In International Conference on
Robotics and Automation, pages 7694–7701. IEEE, 2024. 2
[45] Jihan Yang, Runyu Ding, Weipeng Deng, Zhe Wang, and Xi-
aojuan Qi. Regionplc: Regional point-language contrastive
learning for open-world 3d scene understanding. In Proceed-
ings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 19823–19832, 2024. 2, 6, 7
[46] Yuqi Yang, Peng-Tao Jiang, Jing Wang, Hao Zhang, Kai
Zhao, Jinwei Chen, and Bo Li. Empowering segmentation
ability to multi-modal large language models. arXiv preprint
arXiv:2403.14141, 2024. 2
[47] Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan,
Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi,
Yaya Shi, et al. mplug-owl: Modularization empowers
large language models with multimodality. arXiv preprint
arXiv:2304.14178, 2023. 3
[48] Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang,
Sheng Wang, Zhen Li, and Shuguang Cui. Instancerefer:
Cooperative holistic understanding for visual grounding on
point clouds through instance multi-level contextual refer-
ring. In Proceedings of the IEEE/CVF International Confer-
ence on Computer Vision, pages 1791–1800, 2021. 2
[49] Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An
instruction-tuned audio-visual language model for video un-
derstanding. Conference on Empirical Methods in Natural
Language Processing, 2023. 3
[50] Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi
Shao, Wenwei Zhang, Yu Liu, Kai Chen, and Ping Luo.
Gpt4roi: Instruction tuning large language model on region-
of-interest. arXiv preprint arXiv:2307.03601, 2023. 3
[51] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo-
hamed Elhoseiny. Minigpt-4: Enhancing vision-language
understanding with advanced large language models. Inter-
national Conference on Learning Representations, 2024. 3
[52] Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong Deng, Siyuan
Huang, and Qing Li. 3d-vista: Pre-trained transformer for 3d
vision and text alignment. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages 2911–
2921, 2023. 2
11