EDA: Explicit Text-Decoupling and Dense Alignment For 3D Visual Grounding
Yanmin Wu1 Xinhua Cheng1 Renrui Zhang2,3 Zesen Cheng1 Jian Zhang1∗
Shenzhen Graduate School, Peking University, China
The Chinese University of Hong Kong, China 3 Shanghai AI Laboratory, China
wuyanmin@stu.pku.edu.cn zhangjian.sz@pku.edu.cn
arXiv:2209.14941v3 [cs.CV] 24 Apr 2023
This is a tall gray trash can . It is under the This is a black trash can. It is under a
bathroom counter. The trash can is white and plastic. It is
left side of the counter , to the left of the attached to the wall to the left of the toilet.
door when you enter. (a) Regular 3D visual grounding
Figure 1. Text-decoupled, dense aligned 3D visual grounding. Different colours in the text correspond to different decoupled components.
(a) Regular 3D visual grounding: locating objects requires comprehensively considering multiple semantic cues such as appearance at-
tributes, object names, and spatial relationships. (b) Grounding without object name: not mentioning object names, avoiding short-cuts
and forcing the model to predict the target based on other attributes.
tached much attention as an important 3D cross-modal task. explicitly select the object with the highest similarity to the
Its objective is to find the target object in point cloud scenes decoupled text components (instead of the entire sentence),
by analyzing the descriptive query language, which requires avoiding ambiguity caused by irrelevant components. Ad-
understanding both 3D visual and linguistic context. ditionally, to explore the limits of VG and examine the com-
Language utterances typically involve words describ- prehensiveness and fine-graininess of visual-language per-
ing appearance attributes, object categories, spatial rela- ception of the model, we suggest a challenging new task:
tionships and other characteristics, as shown by different Grounding without object name (VG-w/o-ON), where
colours in Fig. 1(a), requiring that the model integrate multi- the name is replaced by “object” (see Fig. 1(b)), forcing
ple cues to locate the mentioned object. Compared with 2D the model to locate objects based on other attributes and
Visual Grounding [18,19,67,71], the sparseness and incom- relationships. This setting makes sense because utterances
pleteness of point clouds, and the diversity of language de- that do not mention object names are common expressions
scriptions produced by 3D multi-view, make 3D VG more in daily life, and in addition to testing whether the model
challenging. Existing works made significant progress from takes shortcuts. Benefiting from our text decoupling oper-
the following perspectives: improving point cloud features ation and the supervision of dense aligned losses, all text
extraction by sparse convolution [70] or 2D images assis- components are aligned with visual features, making it pos-
tance [68]; generating more discriminative object candi- sible to locate objects independent of object names.
dates through instance segmentation [33] or language mod- To sum up, the main contributions of this paper are as
ulation [46]; identifying complex spatial relationships be- follows: 1) We propose a text decoupling module to parse
tween entities via graph convolution [23] or attention [6]. linguistic descriptions into multiple semantic components,
However, we observe two issues that remain unex- followed by suggesting two well-designed dense aligned
plored. 1) Imbalance: The object name can exclude most losses for supervising fine-grained visual-language feature
candidates, and even in some cases, there is only one fusion and preventing imbalance and ambiguity learning.
name-matched object, as the “door” and “refrigerator” in 2) The challenging new 3D VG task of grounding without
Fig. 1(b1, b2). This shortcut may lead to an inductive bias object names is proposed to comprehensively examine the
in the model that pays more attention to object names while model’s robust performance. 3) We achieve state-of-the-art
weakening other properties such as appearance and relation- performance on two datasets (ScanRefer and SR3D/NR3D)
ships, resulting in imbalanced learning. 2) Ambiguity: Ut- on the regular 3D VG task and absolute leadership on the
terances frequently refer to multiple objects and attributes new task evaluated by the same model without retraining.
(such as “black object, tall shelf, fan” in Fig. 1(b4)), while
the model’s objective is to identify only the main object, 2. Related Work
leading to an ambiguous understanding of language de-
scriptions. These insufficiencies of existing works stem 2.1. 3D Vision and Language
from their characteristic of feature coupling and fusing im- 3D vision [52, 53, 75] and language are vital manners for
plicitly. They input a sentence with different attribute words humans to understand the environment, and they are also
but output only one globally coupled sentence-level fea- important research topics for the evolution of machines to
ture that subsequently matches the visual features of can- be like humans. Previously, the two fields evolved inde-
didate objects. The coupled feature is ambiguous because pendently. Due to the advance of multimodality [25, 26,
some words may not describe the main object (green text 54, 56, 57, 72–74, 76], many promising works across 3D vi-
in Fig. 1) but other auxiliary objects (red text in Fig. 1). sion and language have been introduced recently. In 3D
Alternatively, using the cross-modal attention of the Trans- visual grounding [1, 8–10, 12], the speaker (like a human)
former [21, 63] automatically and implicitly to fuse visual describes an object in language. The listener (such a robot)
and text features. However, this may encourage the model needs to understand the language description and the 3D vi-
to take shortcuts, such as focusing on object categories and sual scene to grounding the target object. On the contrary,
ignoring other attributes, as previously discussed. the 3D dense caption [9, 13, 14, 36, 65, 69] is analogous to
Instead, we propose a more intuitive decoupled and ex- an inverse process in which the input is a 3D scene, and
plicit strategy. First, we parse the input text to decouple the output is textual descriptions of each object. Language-
different semantic components, including the main object modulated 3D detection or segmentation [7, 32, 34, 59, 78]
word, pronoun, attributes, relations, and auxiliary object enriches the diversity of text queries by matching visual-
words. Then, performing dense alignment between point linguistic feature spaces rather than predicting the probabil-
cloud objects and multiple related decoupled components ity of a set number of categories. Furthermore, some studies
achieves fine-grained feature matching, which avoids the explore the application of 3D visual language in agents such
inductive bias resulting from imbalanced learning of differ- as robot perception [27, 61], vision-and-language naviga-
ent textual components. As the final grounding result, we tion (VLN) [15, 31, 55], and embodied question answering
(EQA) [4,22,30,45,47,62]. In this paper, we focus on point (a) It is a brown chair with brown
armrests and four legs . It is (Attributes) armrests
clouds-based 3D visual grounding, which is the fundamen- directly under a blackboard
tal technology for many embodied AI [24, 38, 48] tasks. is chair witharmrests (Main obj.) chair
(ROOT) four
2.2. 3D Visual Grounding It a brown and legs under It
is (ROOT) under a (Relationship) (Pronoun)
The majority of current mainstream techniques are two-
It directly blackboard blackboard (Auxi. obj.)
stage. In the first stage, obtain the features of the query (b) (c)
language and candidate point cloud objects independently Dependency Trees Component Decoupling
by a pre-trained language model [16, 20, 49] and a pre-
Figure 2. Text component decoupling: (a) The query text. (b)
trained 3D detector [44, 51] or segmenter [11, 35, 64]. In Dependency tree analysis. (c) Decoupled into five components.
the second stage, the researchers focus on fusing the two
modal features and then selecting the best-matched object.
text components and determines the target object name by
1) The most straightforward solution is to concatenate the
grammatical analysis to avoid this restriction. 2) BUTD-
two modal features and then consider it a binary classifica-
DETR (and MDETR and GLIP in the 2D task) only con-
tion problem [8, 39], which provides limited performance
sider the sparse alignment of main object words or noun
because the two features are not sufficiently fused. 2) Tak-
phrases to visual features. Conversely, we align all object-
ing advantage of the Transformer’s attention mechanism,
related decoupled textual semantic components with visual
which is naturally suitable for multi-module feature fusion,
features, which we refer to dense alignment, significantly
He et al. [28] and Zhao et al. [77] achieve remarkable per-
enhancing the discriminability of multimodal features.
formance by performing self-attention and cross-attention
to features. 3) In contrast, other studies view feature fusion 3. Proposed Method
as a matching problem rather than a classification. Yuan
et al. [70] and Abdelreheem et al. [2], supervised by the The framework is illustrated in Fig. 3. First, the input
contrastive loss [29], compute the cosine similarity of vi- text description is decoupled into multiple semantic com-
sual features and textual features. Inspired by [43], Feng ponents, and its affiliated text positions and features are
et al. [23] parses the text to generate a text scene graph, obtained (Sec. 3.1). Concurrently, the Transformer-based
simultaneously builds a visual scene graph, and then per- encoder extracts and modulates features from point clouds
forms graph node matching. 4) Point clouds’ sparse, noisy, and text, then decodes the visual features of candidate ob-
incomplete, and lack of detail make learning objects’ se- jects (Sec. 3.2). Finally, the dense aligned losses are derived
mantic information challenging. Yang et al. [68] and Cai between the decoupled text features and the decoded visual
et al. [6] use 2D images to aid visual-textual feature fusion, features (Sec. 3.3). The grounding result is the object with
but at the cost of additional 2D-3D alignment and 2D fea- visual features most similar to text features (Sec. 3.4).
ture extraction.
3.1. Text Decoupling
However, the two-stage method has a substantial detec-
tion bottleneck: objects overlooked in the first stage can- The text features of the coupled strategy are ambiguous,
not be matched in the second. In contrast, object detection where features from multiple objects and attributes are cou-
and feature extraction in the single-stage method is modu- pled, such as “a brown wooden chair next to the black ta-
lated by the query text, making it easier to identify the text- ble.” Among them, easy-to-learn clues (such as the cate-
concerned object. Liu et al. [41] suggest fusing visual and gory “chair ” or the colour “brown”) may predominate,
linguistic features at the bottom level and producing text- weakening other attributes (such as material “wooden”);
related visual heatmaps. Similarly, Luo et al. [46] present words of other objects (such as the “black table”) may
a single-stage approach that employs textual features to cause interference. To produce more discriminative text
guide visual keypoint selection and progressively localizes features and fine-grained cross-modal feature fusion, we
objects. BUTD-DETR [34] is also a single-stage capable decouple the query text into different semantic components,
framework. More importantly, inspired by the 2D image- each independently aligned with visual features, avoiding
language pre-train model (such MDETR [37], GLIP [40]), the ambiguity caused by feature coupling.
BUTD-DETR measures the similarity between each word Text Component Decoupling. Analyzing grammatical
and object and then selects the features of the word that cor- dependencies between words is a fundamental task in NLP.
respond to the object’s name to match the candidate object. We first use the off-the-shelf tool [60, 66] to parse the lan-
However, there are two limitations: 1) Since multiple object guage description grammatically to generate the grammat-
names may be mentioned in a sentence, the ground truth an- ical dependency trees, as shown in Fig. 2(b). Each sen-
notation is needed to retrieve the target name, which limits tence contains only one ROOT node, and each remaining
its generalizability. Our text decoupling module separates word has a corresponding parent node. Then according to
(a) Decoupled Components (b) Decoupled Text Position (c) Decoupled Text Feature
Text Main chair L_Main 000010000000000... t_Main
Decouple Attri. brown, armrests, legs L_Attri 000100100100000... t_Attri 1 64
Auxi. blackboard L_Auxi 00000000000...0001 t_Auxi
Pron. It L_Pron 000000000001000… t_Pron t_other ...
Rel. under L_Rel 00000000000...0100 t_Rel
Figure 3. The system framework. (a-c): Decouple the input text into several components to acquire the position label L and features t
of the decoupled text. (d-e): Transformer-based encoders for cross-modal visual-text feature extraction. (f): Decode proposal features O′
and linearly project them as object position labels Lpred and object features o, in addition to a box prediction head for regression of the
bounding box. (g-h): Visual-text feature dense alignment. Note that the additional 3D object detection procedure is optional.
the words’ part-of-speech and dependencies, we decouple predominate, but as a result of the Transformer’s attention
the long text into five semantic components (see Fig. 2(c)): mechanism, it also implicitly contains the global sentence’s
Main object - the target object mentioned in the utter- information. In other words, feature decoupling produces
ance; Auxiliary object - the one used to assist in lo- individual features while keeping the global context.
cating the main object; Attributes - objects’ appear-
ance, shape, etc.; Pronoun - the word instead of the main 3.2. Multimodal Feature Extraction
object; Relationship - the spatial relation between the We employ BUTD-DETR’s encoder-decoder module for
main object and the auxiliary object. Note that attributes feature extraction and intermodulation of cross-modal fea-
affiliated with pronouns are equivalent to attached with the tures. We strongly recommend the reader to refer to Fig. 3.
main object, thus connecting two sentences in an utterance. Input Modal Tokenlization. The input text and 3D
Text Position Decoupling. After decoupling each text point clouds are encoded by the pre-trained RoBERTa [42]
component (Fig. 3(a)), we generate the position label (simi- and PointNet++ [53] and produce text tokens T ∈ Rl×d and
lar to a mask) Lmain , Lattri , Lauxi , Lpron , Lrel ∈ R1×l for visual tokens V ∈ Rn×d . Additionally, the GroupFree [44]
the component’s associated word (Fig. 3(b)). Where l=256 detector is used to detect 3D boxes, which are subsequently
is the maximum length of the text, each component’s word encoded as box tokens B ∈ Rb×d . Note that the GroupFree
position is set to 1 and the rest to 0. The label will be used to is optional, the final predicted object of the network is from
construct the position alignment loss and supervise the clas- the prediction head (see below), and the box token is just to
sification of objects. The classification result, is not one of a assist in better regression of the target object.
predetermined number of object categories but the position Encoder-Decoder. Self-attention and cross-attention are
of the text with the highest semantic similarity. performed in the encoder to update both visual and text
Text Feature Decoupling. The feature of each word features, obtaining cross-modal features V ′ , T ′ while keep-
(token) is produced in the backbone of multimodal feature ing the dimensions. The top-k (k=256) visual features
extraction (Fig. 3(d)). The text feature of the decoupled are selected, linearly projected as query proposal features
component can be derived by dot-multiplying all words’ O ∈ Rk×d , and updated as O′ in the decoder.
features t with its position label L, as shown in Fig. 3(c). Prediction Head. 1) The decoded proposal features
The decoupled text features and visual features will be in- O′ ∈ Rk×d are fed into an MLP and output the predicted
dependently aligned under the supervision of the semantic position labels Lpred ∈ Rk×l , which are then utilized to
alignment loss. Note that in the decoupled text features, calculate the position alignment loss with decoupled text
the semantics of the corresponding components absolutely position labels L ∈ R1×l . 2) Additionally, the proposal fea-
tures are linearly projected as object features o ∈ Rk×64 where o and t are the object and text features after linear
by another MLP, which are then utilized to compute the se- projection, and o⊤ t/τ is their similarity, as shown in Fig. 3
mantic alignment loss with the similarly linearly projected (h). k and l are the number of objects and words. ti is
text feature t ∈ Rl×64 . 3) Lastly, a box prediction head [44] the positive text feature of the ith candidate object. Taking
regresses the bounding box of the object. the main object as an example, the positive text feature T+ i
corresponding to it is:
3.3. Dense Aligned Loss
3.3.1 Dense Position Aligned Loss \boldsymbol {t}_i \in \mathbf {T}_i^{+} = \left \{ \boldsymbol {t}_{main}, \boldsymbol {t}_{attri}, \boldsymbol {t}_{pron}, \boldsymbol {t}_{rel} \right \}, \label {eq:text_positive} (5)
The objective of position alignment is to ensure that the
and w+ is the weight of each positive term. tj is the feature
distribution of language-modulated visual features closely of the ith text, but note that the negative similarity weight
matches that of the query text description, as shown in w− for auxiliary object term tauxi is 2, while the rest weight
Fig. 3(g). This process is similar to standard object detec- 1. The text loss of semantic alignment defined similarly:
tion’s one-hot label prediction. However, rather than being
limited by the number of categories, we predict the position
of text that is similar to objects. \mathcal {L}_{sem\_t} = \sum _{i=1}^{l} \frac {w_{+}}{\left |\mathbf {O}_{i}^{+}\right |} \sum _{\boldsymbol {o}_i \in \mathbf {O}_{i}^{+}}-\log \left (\frac {\exp \left (\boldsymbol {t}_{i}^{\top } \boldsymbol {o}_{i} / \tau \right )}{\sum _{j=1}^{k} \exp \left (\boldsymbol {t}_{i}^{\top } \boldsymbol {o}_{j} / \tau \right )}\right ), \label {eq:loss_semantic_text}
The constructed ground truth text distribution of the
mentioned main object is obtained by element-wise sum- (6)
ming the position labels of the associated decoupled text where oi ∈ O+ i is the positive object feature of the ith text,
components: and oj is the feature of the jth object. The final semantic
alignment loss is the mean of the two: Lsem = (Lsem o +
P_{main} = \lambda _1 L_{main} + \lambda _2 L_{attri} + \lambda _3 L_{pron} + \lambda _4 L_{rel}, \label {eq:P_text} (1) Lsem t )/2.
where λ is the weight of different parts (refer to the para- Similarly, the semantic alignment of multiple text com-
metric search in Supplementary Material.). Pauxi = Lauxi ponents (Eq. (5)) with visual features also illustrates our in-
represents the text distribution of the auxiliary object. The sight of “dense.” This is intuitive, such as “It is a brown
remaining candidate objects’ text distribution is Poth , with chair with legs under a blackboard,” where the main ob-
the final bit set to 1 (see ∅ in Fig. 3(g)). Therefore, all k ject’s visual features should be not only similar to “chair”
candidate objects’ ground truth text distribution is Ptext = but also similar to “brown, legs” and distinct to “black-
{Pmain , Pauxi , Poth } ∈ Rk×l . board” as possible.
The predicted visual distribution of k objects is produced The total loss for training also includes the box regres-
by applying softmax to the output Lpred ∈ Rk×l of the sion loss. Refer to Supplementary Material for details.
prediction head:
P_{obj} = Softmax(L_{pred}). \label {eq:P_obj} (2) 3.4. Explicit Inference
Their KL divergence is defined as the position-aligned loss: Because of our text decoupling and dense alignment op-
erations, object features fused multiple related text compo-
nent features, allowing for the computation of the similarity
between individual text components and candidate objects.
\mathcal {L}_{pos} = \sum _{i=1}^{k}[P_{text}^{i} \log ({P_{text}^{i}}) - P_{text}^{i} \log (P_{obj}^{i}) ]. \label {eq:loss_possition} (3)
For instance, Smain = Sof tmax(o⊤ tmain /τ ) indicates
the similarity between objects o and the main text compo-
We highlight that “dense alignment” indicates that the nent tmain . Therefore, the similarity of objects and related
target object is aligned with the positions of multiple com- components can be explicitly combined to obtain the total
ponents (Eq. (1)), significantly different from BUTD-DETR score and select the candidate with the highest score:
(and MDETR for 2D tasks), which only sparsely aligns with
the object name’s position Lmain .
S_{all} = S_{main} + S_{attri} + S_{pron} + S_{rel} - S_{auxi}, \label {eq:score} (7)
3.3.2 Dense Semantic Aligned Loss where the definition of Sattri , Spron , Srel , and Sauxi is sim-
Semantic alignment aims to learn the similarity of visual- ilar to Smain . If providing supervision of auxiliary objects
text multimodal features through contrastive learning. The during training, and the auxiliary object can be identified by
object loss of semantic alignment is defined as follows: solely computing the similarity between the object features
and the auxiliary component’s text features: Sall = Sattri .
\mathcal {L}_{sem\_o}\! =\! \sum _{i=1}^{k} \frac {1}{\left |\mathbf {T}_{i}^{+}\right |} \!\sum _{\boldsymbol {t}_i \in \mathbf {T}_{i}^{+}}\!\!\!-\log \!\left (\!\frac {\exp \left (w_{+}\ast (\boldsymbol {o}_{i}^{\top } \boldsymbol {t}_{i} / \tau )\right )}{\sum _{j=1}^{l} \exp \left (w_{-}\ast (\boldsymbol {o}_{i}^{\top } \boldsymbol {t}_{j} / \tau )\right )}\!\right ), \label {eq:loss_semantic} Being able to infer the object based on the part of the text is
a significant sign that the network has learned well-aligned
(4) and fine-grained visual-text feature space.
4. Experiments v) The qualitative results are depicted in Fig. 4(a-c), which
reveals that our method with an excellent perception of ap-
First, we conduct comprehensive and fair comparisons pearance attributes, spatial relationships, and even ordinal
with SOTA methods in the Regular 3D Visual Grounding numbers.
setting in Sec. 4.1. Then, in Sec. 4.2, we introduce our
SR3D/NR3D. Table 2 shows the accuracy on the
proposed new task, Grounding without Object Name, and
SR3D/NR3D dataset, where we achieve the best perfor-
perform comparison and analysis. Implementation details,
mance of 68.1% and 52.1%. In SR3D, since the language
additional experiments and more qualitative results are de-
descriptions are concise and the object is easy to identify,
tailed in the supplementary material.
our method and [34, 46] reach an accuracy of over 60%.
4.1. Regular 3D Visual Grounding Conversely, in NR3D, descriptions are too detailed and
complex, causing additional challenges for text decoupling.
4.1.1 Experiment settings However, we still achieve SOTA accuracy with the 3D-only
We keep the same settings as existing works, with ScanRe- data, while other comparable methods [46,68] rely on addi-
fer [8] and SR3D/NR3D [3] as datasets and Acc@0.25IoU tional 2D images for training. Some methods [6, 8, 9] com-
and Acc@0.5IoU as metrics. Based on the visual data pared in Table 1 are not discussed here because they are not
of ScanNet [17], ScanRefer adds 51,583 manually an- evaluated on the SR3D/NR3D dataset. In addition, because
notated text descriptions about objects. These complex GT boxes of candidate objects are provided in this setting,
and free-form descriptions involve object categories and at- the single-stage methods are not applicable and discussed.
tributes such as colour, shape, size, and spatial relation-
ships. SR3D/NR3D is also proposed based on ScanNet, 4.1.3 Ablation studies
with SR3D including 83,572 simple machine-generated de- Loss ablation. The ablation of the position-aligned loss
scriptions and NR3D containing 41,503 descriptions simi- and the semantic-aligned loss is shown in Table 3. The per-
lar to ScanRefer’s human annotation. The difference is that formance of the semantic-aligned loss is marginally better
in the ScanRefer configuration, detecting and matching ob- because its contrastive loss not only shortens the distance
jects are required, while SR3D/NR3D is simpler. It supplies between similar text-visual features but also enlarges the
GT boxes for all candidate objects and only needs to clas- distance between dis-matched features (such as tauxi is a
sify the classes of the boxes and choose the target object. negative term in Eq. (4)). Whereas position-aligned loss
only considers object-related components (as in Eq. (1)).
4.1.2 Comparison to the state of the art
When both losses supervise together, the best accuracy is
ScanRefer. Table 1 reports the results on the ScanRe- achieved, demonstrating that they can produce complemen-
fer dataset. i) Our method achieves state-of-the-art perfor- tary performance.
mance by a substantial margin, with an overall improvement Dense components ablation. To demonstrate our in-
of 4.2% and 3.7% to 54.59% and 42.26%. ii) Some stud- sight into dense alignment, we perform ablation analysis on
ies [6, 8, 9, 46, 68, 77] proved that supplemented 2D images the different decoupled text components, and the results are
with detailed and dense semantics could learn better point displayed in the “Regular VG” column in Table 4. Anal-
cloud features. Surprisingly, we only use sparse 3D point ysis: i) (a) is our baseline implementation of the sparse
cloud features and even outperformed 2D assistance meth- concept, using only the “Main object” component decou-
ods. This superiority illustrates that our decoupling and pled from the text. In contrast to BUTD-DETR, the text
dense alignment strategies mine more efficient and mean- decoupling module (Sec. 3.1) is used to obtain text labels
ingful visual-text co-representations. iii) Another finding is and features during training and inference instead of em-
that the accuracy of most existing techniques is less than ploying ground truth labels. ii) Dense-aligned sub-methods
40% and 30% in the “multiple” setting because multiple (b)-(h) outperform the sparse alignment (a) because of the
means that the category of the target object mentioned in the finer-grained visual-linguistic feature fusion. iii) (b)-(e) in-
language is not unique, with more interference candidates dicate that adding any other component on top of the “Main
with the same category. However, we reached a remarkable object” improves performance, demonstrating the validity
49.13% and 37.64%. To identify similar objects, a finer- of each text component. The “Attribute” component aids
grained understanding of the text and vision is required in in identifying characteristics such as colour, and shape, fre-
this complex setting. iv) The last three rows in Table 1 com- quently mentioned in language descriptions. Unexpectedly,
pare single-stage methods, where our method’s single-stage the “Pronoun” component such as “it, that, and which”
implementation is without the object detection step (B in have little meaning when used alone but also function in
Fig. 3) in training and inference. The result illustrates that our method, indicating that the pronoun learned contextual
while not requiring an additional pre-trained 3D object de- information from the sentence. The “Relationship” compo-
tector, our approach also can achieve SOTA performance. nent facilitates comprehension of spatial relationships be-
Unique (∼19%) Multiple (∼81%) Overall
Method Venue Modality
0.25 0.5 0.25 0.5 0.25 0.5
3D 67.64 46.19 32.06 21.26 38.97 26.10
ScanRefer [8] ECCV2020
3D+2D 76.33 53.51 32.73 21.11 41.19 27.40
ReferIt3D [3] ECCV2020 3D 53.8 37.5 21.0 12.8 26.4 16.9
TGNN [33] AAAI2021 3D 68.61 56.80 29.84 23.18 37.37 29.70
InstanceRefer [70] ICCV2021 3D 77.45 66.83 31.27 24.77 40.23 32.93
SAT [68] ICCV2021 3D+2D 73.21 50.83 37.64 25.16 44.54 30.14
FFL-3DOG [23] ICCV2021 3D 78.80 67.94 35.19 25.70 41.33 34.01
3D 77.16 58.47 38.38 28.70 45.90 34.47
3DVG-Transformer [77] ICCV2021
3D+2D 81.93 60.64 39.30 28.42 47.57 34.67
3D-SPS [46] CVPR2022 3D+2D 84.12 66.72 40.32 29.82 48.82 36.98
3D 78.75 61.30 40.13 30.08 47.62 36.14
3DJCG [6] CVPR2022
3D+2D 83.47 64.34 41.39 30.82 49.56 37.33
BUTD-DETR [34] † ECCV2022 3D 82.88 64.98 44.73 33.97 50.42 38.60
D3Net [9] ECCV2022 3D+2D - 70.35 - 30.50 - 37.87
EDA - 3D 85.76 68.57 49.13 37.64 54.59 (+4.2%) 42.26 (+3.7%)
3D-SPS(single-stage) [46] CVPR2022 3D 81.63 64.77 39.48 29.61 47.65 36.43
BUTD-DETR (single-stage) [34]‡ ECCV2022 3D 81.47 61.24 44.20 32.81 49.76 37.05
EDA (single-stage) § - 3D 86.40 69.42 48.11 36.82 53.83 41.70
Table 1. The 3D visual grounding results on ScanRefer, accuracy evaluated by IoU 0.25 and IoU 0.5. † The accuracy is reevaluated using
our parsed text labels because the performance reported by BUTD-DETR used ground truth text labels and ignored some challenging
samples (see supplementary materials for more details). § Our single-stage implementation without the assistance of the additional 3D
object detection step (dotted arrows in Fig. 3). ‡ BUTD-DETR did not provide single-stage results and we retrained the model.
tween objects. The component “Auxiliary object” is a neg- 4.2. Grounding without Object Name (VG-w/o-ON)
ative term in the loss (Eq. (4)). During inference (Eq. (7)), 4.2.1 Experiment settings
its similarity is subtracted in the hopes that the predicted
main object is as dissimilar to it as possible. iv) (f)-(h) inte- To evaluate the comprehensive reasoning ability of the
grate different components to make performance gains and model and avoid inductive biases about object names, we
reach the peak when all are involved, demonstrating that the propose a new and more challenging task: grounding
functions of each component can be complementary, and objects without mentioning object names (VG-w/o-ON).
there may be no overlap between the features of each one. Specifically, we manually replace the object’s name with
The result reveals that our method effectively decouples and “object” in the ScanRefer validation set. For instance:
matches fine-grained multimodal features. “This is a brown wooden chair” becomes “this is a brown
(a) this is a black leather (b) wooden double (c) there is a chair (d) choose the first (e) the object is in
loveseat. if you were bookcase filled with books. with it is back to small circular object the corner next to the
Text sitting in it, the long, walking into the room it is the wall. it is the with a metal stand, door, below the
short bookshelf would in the right hand most fourth chair from on the right side. it whiteboard. it is gray,
be on the right. corner next to the window. the left. has no red stool. tall, and narrow.
Figure 4. Qualitative results with ScanRefer texts. (a-c): Regular 3D visual grounding. (d-e): Grounding without object name.
Supplementary Material for
EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding
Yanmin Wu1 Xinhua Cheng1 Renrui Zhang2,3 Zesen Cheng1 Jian Zhang1∗
Shenzhen Graduate School, Peking University, China
The Chinese University of Hong Kong, China 3 Shanghai AI Laboratory, China
wuyanmin@stu.pku.edu.cn zhangjian.sz@pku.edu.cn
Section A of the supplementary material provides the im- the following total loss:
plementation details of the individual modules and the net-
work training details. In Section B , we supplement with \mathcal {L} = (\alpha (\mathcal {L}_{pos} + \mathcal {L}_{sem}) + 5\mathcal {L}_{box} + \mathcal {L}_{iou})/(N_D+1) + 8\mathcal {L}_{pts}, \tag {8} \label {eq:loss_semantic_text}
additional experiments and quantitative analyses. Finally, (8)
in Section C , we present visualization results and qualita- where Lpos and Lsem represent the visual-language align-
tive analysis. ment loss. Lbox and Liou indicate the object detection
loss [44], with Lbox representing the L1 regression loss
of the object’s position and size and Liou representing the
A. Implementation details object’s 3D IoU loss ND is the layer number of the De-
coder. Lpts is the KPS point samping loss [44]. α takes
Text decoupling module. The maximum length of the the value 1 in the SR3D/NR3D dataset and 0.5 in the Scan-
text is l=256, and the absence bit of the position label L ∈ Refer dataset. Because the SR3D/NR3D dataset provides
R1×l is padded with 0. Not every sentence can be decoupled the bounding box of candidate objects, while the ScanRefer
into five semantic components, but the most fundamental dataset requires detecting the bounding box, we give higher
“main object” is required. weights for the detection loss in the ScanRefer dataset.
Encoder-Decoder. We keep hyperparameters consistent Training details. The code is implemented based on Py-
with BUTD-DETR [34]. The point cloud is tokenized as Torch. We set the batch size to 12 on four 24-GB NVIDIA-
V ∈ Rn×d by the PointNet++ [53] pre-trained on Scan- RTX-3090 GPUs. For ScanRefer, we use a 2e−3 learn-
Net. The text is tokenized as T ∈ Rl×d by the pre-trained ing rate for the visual encoder and a 2e−4 learning rate
RoBERTa [42]. Following object detection, the position and for all other layers. It takes about 15 minutes per epoch,
category of the boxes are embedded separately and con- and around epoch 60, the best model appears. The learning
catenated as the box token B ∈ Rb×d . The encoder, for rates for SR3D are 1e−3 and 1e−4, 25 minutes per epoch,
visual-text feature extraction and modulation, is NE =3 lay- requiring around 45 epochs of training. The learning rates
ers. The decoder with ND =6 layers generates candidate for NR3D are set at 1e−3 and 1e−4, 15 minutes per epoch,
object features Q ∈ Rk×d . Where n=1024 denotes the and around 180 epochs are trained. Since SR3D is com-
number of seed points, l=256 the number of texts, b=132 posed of brief machine-generated sentences, convergence
the number of detection boxes, k=256 the number of candi- is easier. ScanRefer and NR3D are comprised of human-
date objects, and d=288 the feature dimension. Please refer annotated free-form complex descriptions, respectively, and
to BUTD-DETR for more details. require more training time.
Losses. 1) In the position-aligned loss Lpos , the weights
of each component in Eq. (1) are as follows: λ1 =0.6, B. Additional experiments
λ2 =λ3 =0.2, λ4 =0.1. These values indicate that the weight
B.1. Regular 3D Visual Grounding
of the “main object” component Lmain is higher, which
is obvious. The “relational” component Lrel with lower (1) The explanation of the BUTD-DETR’s perfor-
weight because it affects both the main and auxiliary ob- mance. Given a sentence, such as “It is a brown chair with
jects. See Sec. B.1.(4) for parameter searching. 2) In the armrests and four legs . It is directly under a blackboard”,
semantic-aligned loss Lsem , the weight w+ follows a sim- our text decoupling module determines that “chair” is the
ilar trend. The four features tmain , tattri , tpron , trel are main object based on grammatical analysis and thus obtains
weighted by 1.0, 0.2, 0.2, and 0.1, respectively. The weight the position label Lmain = 0000100.... However, in the of-
w− acts on the negative item, where the feature weight of ficial implementation of BUTD-DETR, which requires an
the auxiliary object is 2 and the remainder weighs 1. The additional ground truth class for the target object, its input
purpose is to differentiate the features of the main object is: “<object name> chair. <Description> It is a brown
from the auxiliary objects. 3) We optimize the model with chair ...”. Then search for the position where the object
name “chair” appears in the sentence as a position label. Dataset Easy Hard View-dep. View-indep. Overall
This operation presents some problems:
SR3D 70.3 62.9 54.1 68.7 68.1
• i) It is unfair to use GT labels during inference; NR3D 58.2 46.1 50.2 53.1 52.1
• ii) Descriptions may employ synonyms for the cat- Table 8. Detailed performances of our method on the SR3D/NR3D
egory “chair,” such as “armchair, office-chair, and dataset with the metric of Acc@0.25IoU.
loveseat,” leading to a failed search position label;
• iii) Sometimes, the object name is not mentioned, such existing methods, demonstrating the efficiency of our dense
as when it is replaced by the word “object.” In the alignment. As seen in (a), it is not optimal to treat all com-
NR3D validation set, BUTD-DETR removed 800 such ponents equally because their functions are not equivalent.
challenging samples, and about 5% did not participate When giving λ1 a higher weight (see (f, g)), it turns out that
in the evaluation. a weight that is too high would also lead to a decrease in
performance, which may compromise the functionality of
To be fair, we re-evaluate it using the position labels ob- other components. λ1 takes 0.6 as the best option, and the
tained by the proposed text decoupling module, as displayed other items take 0.1 or 0.2. We select option (d) for imple-
in the second row in Tab. 6. mentation.
One of 2) Grounding without Object Name (VG-w/o-ON).
Random Attr + Pron +
Main {Attr, Pron, Auxi, Rel} Acc. The visualization results of this challenging task are shown
4 words Auxi + Rel
(lowest) (best) in Fig. 13. Since the target object’s name is not provided,
(a) ✓ 51.5 the model must make inferences based on appearance and
(b) ✓ ✓ 52.0 positional relationships with auxiliary objects. However,
(c) ✓ ✓ 52.8 other contrastive methods perform weakly on this task be-
(d) ✓ ✓ 53.1 cause they rely heavily on object names to exclude interfer-
(e) ✓ ✓ 54.6 ence candidates, weakening the learning of other attributes.
3) Failure Case Analysis. Although our method deliv-
Table 10. Comparison with the alignment of four random words. ers state-of-the-art performance, there are still a significant
The metric is Acc@0.25IoU. (a): Baseline, only aligned with the number of failure occurrences, which we analyze visually.
“Main Object” text component; (b): aligned with four random i) Many language descriptions are intrinsically ambiguous,
words; (c-d): aligned with one of our four decoupled components;
as illustrated in Fig. 12(a-c), especially in the “multiple”
(e): aligned with all four components.
setting, the appearance attributes and spatial relationships
of the target object are not unique, and there are multiple al-
in Tab. 11, our model after text modulation on the Scan- ternatives for candidate objects that match the requirements.
Refer achieves 1.1% and 1.5% higher performance than ii) The text parsing error may occur owing to the language
BUTD-DETR, to 64.1% and 45.3%. Note that the proposed description’s complexity and diversity. Such as, the GT ob-
method is not specifically designed for object detection, and ject in Fig. 12(d) is a desk, but we parse it as a window; the
the performance evaluation uses the same model as the vi- GT object in Fig. 12(e) is a box, but we parse it as a piano.
sual grounding task (Table 1 and Table 5). iii) There are also some cases where cross-modal feature
matching fails even though the text parses well.
Method mAP@0.25 mAP@0.5
DETR+KPS+iter † 59.9 -
3DETR with PointNet++ † 61.7 -
BUTD-DETR trained on ScanRefer 63.0 43.8
EDA trained on ScanRefer (Ours) 64.1 45.3
C. Qualitative analysis
1) Regular 3D Visual Grounding. Qualitative results
on the regular 3D visual grounding task are displayed in
Fig. 9, 10, 11. i) Fig. 9 indicates that compared to BUTD-
DETR, our method has a superior perception of appear-
ance, enabling the identification of objects based on their
attributes among several candidates of the same class. This
improvement is made possible by the alignment of our de-
coupled text attribute component with visual features. ii)
Fig. 10 demonstrates that our method exhibits excellent spa-
tial awareness, such as orientation and position relation-
ships between objects. The alignment of our decoupled re-
lational component with visual features and the positional
encoding of Transformer may be advantageous to this ca-
pability. iii) Furthermore, we surprisingly found that our
method also has a solid understanding of ordinal numbers,
as shown in Fig. 11, probably because we parsed ordinal
numbers as part of the attribute component of the object.
These examples demonstrate that text decoupling and dense
alignment enable fine-grained visual-linguistic matching.
a brown and blue a wooden chair with a the curtain is on the the blue chair, it is
chair on the right cushioned seat and there is a gray and placed in the middle. on
opposite wall from the yellow office chair.
Text side of the room . it back sits in fronts of a bathroom. the curtain the left is a white table
is in front of a wall with wood and it is at a table. and a sink, on the right is
is red and wavy.
computer desk. glass doors. a staircase and a door.
Figure 9. Qualitative comparison of the regular 3D VG task. Our method has a superior perception of appearance attributes.
the chair is facing the left there is a chair there is a brown a loft bed sits to the it is a leather type
corner of the room, and is at sitting on the floor. couch. it is placed left of a window. it is sofa between two
Text the desk. the monitor is to the it is to the right of between the wall got a desk on it is end tables . it is
left of the chair, and there are another chair. and the coffee table. left side. sitting behind the
shelves in front of the chair. coffee table .
Figure 10. Qualitative comparison of the regular 3D VG task. Our method has a superior perception of spatial relationships.
Figure 11. Qualitative comparison of the regular 3D VG task. Challenging cases with ordinal numbers.
(a) there is a dark (d) there is a window with (e) in the corner there is
brown wooden and (b) it is a long brown table. (c) a brown chair with no green curtains , to the left of piano. to the left of the piano
leather chair. placed in it is located opposite to the arms. it is kept at the corner the window with green there is two tool boxes, this
the table of the kitchen. crossed table on other side. of one side of the table. curtains is a desk . a desk is is the red tool box behind
the item we are looking for. the green tool box.
Figure 12. Failure cases, with the GT box and the predicted box shown in yellow and green, respectively. (a-c): Failure due to
ambiguity of reference. (d-e): Failure due to text parsing error for complex and long sentences.
it is a narrow wood it is a object. it is black with this is a black object. it is hung black object perpendicularly
console object. the object a large window with snack above the brown night stand, to the right from the small
Text sits in the kitchen, along inside for purchase with a and is to the right of the brown bookshelf on the wall. the
the wall that has the tv. it lamp on the right side of it. wardrobe that is sitting in the object is in front of the
sits under the tv. corner of the room. bigger bookcase.
there is a blue square there is a rectangular the object has multicolored there is a object at the foot of
object. it is on the bed on gray object with a paint on it. it is located to the bed. it has arms and wheels
Text top of a black object and green lid. it is next to a the left of a similar object. and is next to another object
next to a white nightstand. door. that has no wheels.
Figure 13. 3D Visual grounding without object name (VG-w/o-ON), where the word “object” replaces the target’s name.