C C Elicit 3D Spacetime Understanding in Multimodal Language Model
C C Elicit 3D Spacetime Understanding in Multimodal Language Model
C C Elicit 3D Spacetime Understanding in Multimodal Language Model
Benlin Liu1∗,
Yuhao Dong2,3⋆ , Yiqin Wang2⋆ ,
Yongming Rao , Yansong Tang2 , Wei-Chiu Ma4,5 , Ranjay Krishna1,4
3
arXiv:2408.00754v1 [cs.CV] 1 Aug 2024
1
University of Washington, 2 Tsinghua University, 3 Tencent,
4
Allen Institute for AI, 5 Cornell University
coarse-correspondence.github.io
Abstract
1 Introduction
Multimodal LLMs (MLLMs) today are part of larger applications in the physical world through
devices like smartphones, smart glasses, and robots [1, 2, 3]. Augmented with vision encoders [4, 5, 6],
proprietary models like GPT-4o [7] and Gemini-Pro [8] have recently been demoed to understand
users’ physical environment as well as reason about their actions over time. Despite their promise,
research benchmarks have continued to highlight that our community’s best models still lack sufficient
3D understanding of space and time [9]. Popular 3D (e.g. ScanQA [10] and OpenEQA [9]) and video
(e.g. EgoSchema [11]) benchmarks report that MLLMs struggle when asked to reason about spatial
relationships between objects (“Is the couch to the right of the door”?) or remember temporal events
in long videos (“How does [the person] ensure the completion of their primary goal in the video?”)
Many recent efforts have begun incorporating 3D or temporal information to improve reasoning. For
instance, new methods finetune MLLMs with specialized structures or representations [12, 13, 14, 15].
Unfortunately, these methods require an open-sourced model with weights that can be modified. In
parallel, visual prompting mechanisms (e.g. set-of-marks [16] and 3DAxiesPrompts [17]) augment
∗
Equal contribution.
2 2
2
1
1 3 3
MLLMs
1
(c) help MLLMs understand 3D space-time
(b) construct coarse correspondence on sparsified views with prompted images
Figure 1: We combined light-weight video tracking models and multimodal LLMs to achieve a better
understanding of 3D spacetime. (a) We use a tracking model at a high frame rate to obtain instance
segmentation masks for each frame. (b) Then, we sequentially sparsify input frames, select prominent
coarse correspondences, and visualize the constructed coarse correspondences on the images. (c)
Finally, we enable MLLMs to better understand 3D spacetime from the prompted images.
image inputs with visualizations to guide the MLLM’s attention to specific 2D regions in an image.
Visual prompting methods so far show improvements for 2D understanding but not yet for 3D and
temporal reasoning. Worse, these efforts have focused either on 3D understanding or temporal
understanding—not both. Even if visual prompting methods could add useful information, encoding
that information meaningfully in a video for an MLLM is difficult: Proprietary MLLMs are developed
primarily using image inputs and not videos; when presented with a video, they sparsify and
preprocess videos before sending it to the MLLM.
To address the above challenges, We introduce C OARSE C ORRESPONDENCES, a propose a simple,
effective, and general-purpose visual prompting method to evoke 3D and temporal understanding
in multimodal LLMs. Given a video or a set of images of a space, C OARSE C ORRESPONDENCES
transforms the images/frames to help MLLMs reason. First, we use a lightweight tracking model [18]
to extract object correspondences for the same instance across multiple frames at the video’s original
high frame rate. Next, we sparsify the videos to reduce the number of frames we pass to the MLLM.
Finally, we visualize the top-K objects that appear in the most number of frames, sorted by the
number of pixels they occupy. We use marks and segmentation outlines with unique IDs to highlight
these objects (Figure 1). Notice that these images are modified independent of any underlying
questions or tasks. Finally, we pass these modified images, along with the question to an MLLM. In
our experiments, we apply our visual prompting method to GPT-4V, GPT-4O, and other MLLMs.
Our framework consistently achieves significant improvements over state-of-the-art results on
ScanQA [10] and OpenEQA [9], two 3D understanding benchmarks, and EgoSchema [11], a long
video understanding benchmark. On ScanQA, our framework allows GPT-4V [19] to surpass spe-
cially designed and fine-tuned methods [12] in a zero-shot manner with fewer views. Additionally,
the rapid advancement of MLLMs provides the potential for our method to elicit other emergent
understanding abilities. For instance, with GPT-4O, our method enables 3D spatial understanding
from extremely sparse views, reducing the computational cost and expense associated with MLLMs.
To further test for 3D spatial reasoning, we create a small diagnostic benchmark called SOT. Current
benchmarks assess a model’s 3D spatial awareness from the viewpoint of the camera-holding
observer. In cognitive science, the spatial orientation test (SOT) is commonly used to evaluate spatial
intelligence in children [20]. It measures spatial perspective-taking, which is the capacity to envision
how an object or scene looks from a different viewpoint [21, 22]. SOT contains 10 scenes with videos
from different perspectives. Using these scenes, we test whether MLLMs can reason about spatial
relationships from different perspectives. We show that C OARSE C ORRESPONDENCES improves the
GPT-4 models’ abilities. But overall, we find that their performance is at random chance.
In summary, we propose C OARSE C ORRESPONDENCES, a general visual prompting method that
combines lightweight tracking models to induce 3D and temporal reasoning in any MLLMs. Our
2
results across 4 3D, temporal, and our new spatial cognition benchmarks verify that our method
works out-of-the-box. We also demonstrated that even the most advanced MLLMs have significant
shortcomings in spatial understanding that are not easily overcome. We hope this work can help
MLLMs better understand the physical world we live in.
2 Related work
Multimodal language models Multimodal LLMs[4, 5, 6] integrate vision encoders [23] into large
LLMs [24, 25], allowing them to directly reason over visual input. Many proprietary models,
such as GPT-4 [19], Gemini [8], and Claude [26], as well as open-source models like the LLaVA
series [4] and BLIP series [27], have made significant progress in 2D vision-language tasks like
image captioning [28] and visual question answering (VQA) [29, 30]. Beyond these language-related
tasks, many newer attempts applying MLLMs to applications such as autonomous driving [31] and
robotics [32]. Many of these tasks require understanding the 3D space in which they are deployed
and reason about how things are changing temporally. We improve the 3D space-time capabilities of
such models.
Visual prompting. Effective prompting has been widely proven to improve LLMs across multiple
domains. Methods, such as chain-of-thought prompting [14, 33], force the model to reason before
answering a question. For multimodal LLMs, methods such as Red-circle prompting [34] and Set-of-
marks [16] can enhance the grounding abilities of CLIP [23] and GPT-4V. PIVOT [35] employs visual
prompting combined with iterative VQA to induce GPT-4V to generate outputs for robotics control.
3DAxies [17] enhances GPT-4V’s ability to use numerical expressions to describe 3D relationships
of objects in a single image by annotating a scaled 3D coordinate system on the image. Unlike these
works, C OARSE C ORRESPONDENCES prompts MLLMs to understand the spatial relationships within
a complete 3D scene from an image sequence.
Video understanding. Videos carry rich information about both the 3D structure as well as temporal
changes in the physical world. To perform better long-horizon reasoning, work has begun incorporat-
ing video inputs into MLLMs. By replacing the image encoder in MLLMs with a video encoder [13],
recent work has improved performance on video dense captioning [36] and videoQA [11, 37, 38]. To
further advance the understanding of temporal relationships in videos, EgoSchema [11] introduced a
benchmark for long video understanding, which is more challenging than previous video-language
benchmarks. Meanwhile, understanding 3D spatial relationships in videos received relatively less
attention. 3D-LLM [12] converts multiview images into 3D point clouds and then feeds them
into LLMs, demonstrating better results on the ScanQA [10] benchmark for 3D understanding.
OpenEQA [9] is also a benchmark dedicated to evaluating MLLM’s understanding of 3D physical
space, with outputs that are more open-vocabulary compared to ScanQA. In this paper, we propose a
framework that does not require any training in modifying MLLMs; it extracts meaningful informa-
tion from videos using off-the-shelf tracking models and achieves state-of-the-art performance on the
benchmarks mentioned.
Visual correspondences. Visual correspondences have been a vital area of research in computer
vision for a few decades. Applications such as Structure-from-Motion[39] utilize correspondences
to better reconstruct 3D scenes. In the past, we relied on handcrafted features like SIFT [40] or
SURF [41] to obtain good correspondence. Today, features extracted from deep models [42] can also
provide increasingly accurate correspondences. Generally, people aim to achieve precise geometric
and semantic correspondences at the pixel level. However, in this paper, we use coarse visual
correspondence to prompt MLLMs, which can be easily obtained from off-the-shelf video tracking
models [18].
3 Method
We introduce C OARSE C ORRESPONDENCES, a visual prompting method that allows MLLMs to
reason about 3D space and time.
Problem formulation. Given a question Q and a sequence or set of observations in an environment
[I1 , . . . , In ], our aim is to design a visual prompt P(. . .) that modifies the input image set. These
image inputs don’t have to be a video. They can also represent a set of images of a scene from
multiple viewpoints. We evaluate the prompt by measuring its utility in prompting an MLLM M:
3
& [I_1^\prime , \dots , I_n^\prime ] = P([I_1, \dots , I_n])\\ & \hat {\mathcal {A}} = \mathcal {M}(([I_1^\prime , \dots , I_n^\prime ]), \mathcal {Q})
Coarse Correspondence
Our prompting method, C OARSE C ORRESPONDENCES, contains four steps: (1) tracking correspon-
dences, (2) sparsify frames, (3) selecting, and (4) visualizing coarse correspondences.
(1) Tracking correspondences. Given n input images, [I1 , . . . , In ], we first use an off-the-shelf
video object tracking model, such as Tracking Anything [18]. This model extracts class-agnostic
instance segmentation masks (M1 , . . . , Mn ) for each image. Each Mi is a H × W dimensional
matrix where H and W are the height and width of the input image Ii . Each pixel location in Mi
contains an instance ID, indicating which instance the pixel at that position belongs to within the
image sequence.
(2) Sparsify frames. Since most MLLMs contain a large number of parameters, directly using them
to process long image sequences is very computationally intensive. Additionally, proprietary MLLMs
like GPT-4O can also incur significant costs if the number of image tokens that need to be processed
increases. Reducing the number of input images might lose vital information necessary for MLLMs.
C OARSE C ORRESPONDENCES strikes a balance in this tradeoff by extracting meaningful video object
tracks (a relatively cheaper operation) from high-frame-rate image sequences, and then samples a
few image inputs along with the tracks, to retain—and even improve—performance while reducing
the MLLM’s computation cost. From this extracted video object tracks, we perform temporal
downsampling, retaining only m << n uniformly sampled images and their corresponding masks,
denoted as [Is1 , . . . , Ism ] and [Ms1 , . . . , Msm ], where si ∈ {1, . . . , n}. This downsampling reduces
the number of images we feed into M.
(3) Selecting coarse correspondences. Prompting an MLLM with all the detected correspondences
results in information overload. In fact, our ablations (discussed in Sec 5) find that adding all the
correspondences reduces the MLLM’s performance. Therefore, we select a subset of prominent
instances to retain. We select the prominent instances of the top-K objects that co-occur in the most
number of frames. We first calculate the occurrence frequency and area sum of each unique instance
ID in the retained m masks using the following equation:
\mathcal {F}req(\text {ID}) &= \sum _{i=s_1}^{s_m} \mathbf {1}_{\{\text {ID} \in M_i\}}, \\ \mathcal {A}rea(\text {ID}) &= \sum _{i=s_1}^{s_m} \sum _{p \in M_i} \mathbf {1}_{\{\text {ID} = p\}}.
Then, we first sort all instance IDs in descending order based on Freq(ID). If there are ties, we
further sort based on Area(ID). Finally, we retain the top k instance IDs as tracklets, denoted as
[T1 , . . . , Tk ], to visualize for MLLMs.
(4) Visualizing coarse correspondences. For each set of obtained correspondence relationships,
we visualize the correspondences directly in the image as a marker. Specifically, for each identified
primary instance ID Ti , if it exists in the mask Msj of a retained image Isj , we overlay a mark with
a fixed size and shape labeled with Ti at the position (x̄ij , ȳij ) on Isj to produce Is′ j . The specific
placement position can be easily obtained by:
(\bar {x}_{ij}, \bar {y}_{ij}) = \frac {\sum _{(x,y)} (x, y) \cdot \mathbf {1}_{\{M_{s_j}(x,y) = T_i\}}}{\sum _{(x,y)} \mathbf {1}_{\{M_{s_j}(x,y) = T_i\}}}
4
Naturally, we can overlay not just the markers but also the segmentation outlines or even the
segmentation masks associated with each retained prominent instance. We explore these ablations
later. In the end, we obtain the prompted image sequence [I1′ , . . . , Im
′
], which is then used as the
input to MLLMs.
We refer to our method as Coarse because of the following: first, we only visually prompt for
instance-level correspondences and not point-level correspondences. Second, the instance-level
correspondences are extracted using off-the-shelf tracking models. Despite not being perfectly
precise, they still help MLLMs build a better 3D model of the environment. Third, we only visualize
a handful of prominent corresponding instances.
4 Experiments
We evaluate the utility of C OARSE C ORRESPONDENCES across a number of tasks that require
understanding 3D space (ScanQA [10] in §4.1 and OpenEQA [9]) in §4.2) as well as temporal events
(EgoSchema [11] in §4.3). Across all three benchmarks, we augment GPT-4V and GPT-4O with
C OARSE C ORRESPONDENCES and evaluate its zero-shot performance. We show that C OARSE
C ORRESPONDENCES not only significantly improves the base GPT models but that the improvements
establish new state-of-the-art results across all our evaluations. All experiments are conducted with
A100 80G GPUs.
Existing benchmarks evaluate the 3D spatial awareness of a model from the perspective of the
camera-wielding observer. To further demonstrate the utility of C OARSE C ORRESPONDENCES, we
introduce a new benchmark: SOT. Our benchmark tests whether models are able to reason about 3D
space from the perspective of an imaginary observer at another location in their field of view. Again,
we show that C OARSE C ORRESPONDENCES significantly improves GPT-4{V,O}’s abilities.
The ScanQA dataset [10]. The ScanQA dataset consists of 41363 questions about 800 scenes,
including 32337 unique questions. Its validation set contains 4675 questions about 71 scenes.
Because the dataset is derived from the original ScanNet dataset [44], it also contains 3D localization
of all objects in the scene. Questions in ScanQA require basic recognition, 3D localization, and 3D
embodied capabilities [45]. The validation set contains two ground-truth answers per question for
evaluation with models that produce free-form answers.
Baselines. We evaluate C OARSE C ORRESPONDENCES by augmenting both GPT-4{V,O}, Gemini
and Claude models. We also compare against other general-purpose multimodal LLMs such as
LLaVA [4], flamingo [43], and BLIP2 [27]. We also consider 3D-LLM which was trained with 3D
awareness [12].
5
Metrics. Following prior works, we adopt BLEU [46] scores, METEOR [47], ROUHE-L [48] and
CIDEr[49] as our evaluation metrics
Results. As shown in table 1, compared with raw input to GPT-4V and GPT-4O, C OARSE C OR -
RESPONDENCES constantly improves the recently released GPT-4o’s overall performance by a
significant 5.7 BLEU-2, 3.2 METEOR, 6.5 ROUHE-L, and 15 CIDEr points. Compared with other
vision-language models, it is clear that C OARSE C ORRESPONDENCES helps improve 3D understand-
ing for different GPTs constantly; GPT-4V and GPT-4O both fail to achieve competitive performance
without C OARSE C ORRESPONDENCES but surpass the state-of-the-art models with it.
The OpenEQA dataset [9]. OpenEQA is an open-vocabulary dataset benchmarking spatial environ-
ment understanding and embodied reasoning. We evaluate on OpenEQA’s EM-EQA data split, which
contains over 1600 high-quality human-generated questions collected from various real-world envi-
ronments. The subset tests the episodic memory of an agent moving through a 3D environment over
time. This task uses video inputs, from which we sample only K frames at equal intervals as input to
the GPT models. We use K = 8 frames for GPT-4V and only K = 4 for GPT-4O, which we find to
be the best performance empirically for the GPT models without C OARSE C ORRESPONDENCES. We
adopt accuracy as our metric.
Baselines. We compare against language-only models to account for language bias (LLaMA2 [25]),
commonly used general-purpose multimodal LLMs (GPT-4 [50], Claude3 [26], Gemini-Pro [8],
GPT-4V with 15 and 50 frames.
Results. C OARSE C ORRESPONDENCES again achieves state-of-the-art performance while utilizing
significantly fewer frames than all previous works (Table 2). This substantial reduction in the number
of views highlights the efficiency of our approach during inference time. The improved performance,
coupled with reduced computational overhead, charts a potential future for using C OARSE C OR -
RESPONDENCES for embodied AI tasks. Despite our marks being generated unconditioned on the
questions, they still improve performance.
Models Frame Accuracy Models Frame Subset
LLaMA2 [25] 0 28.3 LongViviT [51] 256 56.8
GPT-4 [50] 0 33.5 MC-ViT-L [52] 128+ 62.6
Claude3 [26] 20 36.3 LLoVi [53] 180 58.3
Gemini-Pro [8] 15 44.9 VideoAgent [54] 8.4 60.2
GPT-4V [19] 15 54.6 MVU [55] 16 60.3
GPT-4V [19] 50 55.3 VideoAgent [56] - 62.8
Human Full 86.8 LangRepo [57] - 66.2
GPT-4V 8 44.8 GPT-4V 8 64.2
GPT-4V+CC 8 58.5 GPT-4V+CC 8 67.4
GPT-4O 4 49.4 GPT-4O 8 67.2
GPT-4O+CC 4 59.1 GPT-4O+CC 8 73.2
The EgoSchema dataset [11]. EgoSchema aims at benchmarking a model’s long video understanding
ability. Due to budget constraints, we limit this evaluation to 500 questions from the validation set.
For the GPT models’ input, we sample K = 8 image frames uniformly from the video input.
Results. C OARSE C ORRESPONDENCES demonstrates state-of-the-art performance, significantly
outperforms existing approaches in a zero-shot manner (Table 3) with fewer frames. The reduction
in frame usage underscores C OARSE C ORRESPONDENCES’s ability to extract relevant temporal
information more effectively, suggesting the possibility of using it for other video tasks in the future.
6
4.4 The SOT benchmark for Spatial Orientation Test
In cognitive science, the spatial orientation test (SOT) is a widely adopted examination for spatial
intelligence in children [20]. The SOT assesses spatial perspective-taking—the ability to imagine
how an object or scene would appear from a perspective different from the current camera viewpoint.
Numerous human studies [21, 22] have shown that this ability is closely related to the development
of spatial intelligence in children. In this section, We adopt this test to evaluate multimodal LLMs.
Figure 2: Illustration of our SOT dataset. We mention two types of questions: Observer perspective
understanding and spatial-perspective taking. C OARSE C ORRESPONDENCES demonstrates superior
effectiveness on the dataset.
Data curation. We manually curated ten real-world scenes, both indoor and outdoor, using different
mobile devices at various viewpoints. We instructed 10 human participants to take two videos in their
environment from two viewpoints. When in each viewpoint, they were asked to remain in place as
they laterally pan their mobile devices to scan their 3D environment. From 20 collected scenes, we
filtered to and retained 10 scenes that satisfied the following four criteria: First, we could uniquely
describe one viewpoint from the perspective of the other and vice-versa. For example, in Figure 2,
we define the other viewpoints as ’a person stepping out of an elevator.’ Second, we ensured that no
single frame captured the entire 3D space, ensuring that models can not short-cut answers using any
single view. Third, all scans move the camera from left to right. Fourth, to avoid privacy concerns,
we ensured that no people appeared in the videos. Each scene scan lasts between 3 to 5 seconds.
For each scene, we designed five carefully crafted questions, each asking the model to determine
if one object is to the left or to the right of another from a specific viewpoint. The first three
questions are from the observer’s (camera’s) perspective, while the final two describe the perspective
in language, thereby, testing for a model’s spatial perspective-taking ability. Human performance on
these questions is 100%. We design SOT questions to have a bias towards asking about relationships
between objects that appear in the first last frame of the scan, ensuring that the has to use multiple
frames to answer. In total, across the 10 scenes, SOT has a modest 50 questions.
Results. C OARSE C ORRESPONDENCES significantly improves GPT-4O in perspective-taking ability
by a significant 21.2% (Table 4). Notably, when using only the first and last frames, as illustrated
in Figure 2, our method enables GPT-4O to understand the 3D spatial structure represented by the
images using minimal overlap, whereas GPT-4O alone performs only slightly better than random
guessing.
We further investigate two striking findings. First, the GPT models exhibit a bias where they perform
better when the camera moves from left to right. But if the order flipped, they perform worse. We
report both the L− > R and R− > L camera pans and also report their harmonic mean. Second,
when we isolate the performance on the 2 perspective-taking questions per scene in Figure 3. While
C OARSE C ORRESPONDENCES improves GPT-4O’s perspective-taking capability, these results are
bittersweet since they still perform worse than random guessing.
7
60
w/o CC w/ CC
Models Frame Origin Reverse Harmonic Mean Random Guess
50
+4.2%
GPT-4O 2 58.2 50.0 53.8 +5.3%
40
2 frames 4 frames
GPT-4O+CC 4 71.2 71.2 71.2
5 Analysis
Here, we explore the various design decisions implicit in our method.
How does C OARSE C ORRESPONDENCES differ from other visual prompting methods? Our
proposed method calculates and highlights correspondences between images, aiming to elicit 3D and
temporal understanding. Other visual prompting methods (namely Set-of-Mark, 3DAxiesPrompts,
and Chain-of-thought [16, 17, 33]) can also be viewed as alternative prompting methods. We compare
C OARSE C ORRESPONDENCES against them on ScanQA qualitatively in Figure 4.
The orange part of Figure 4 shows our Coarse Correspondence labels are recognized by GPT-4V. The
output answer provides evidence that our coarse correspondence helps GPT-4V develop a mental 3D
model of the scene. Set-of-Marks [16] provides no spatial corresponding information and therefore
is unhelpful. The Axis labels in 3DAxies[17] can be easily misrecognized by GPT-4V, leading to
misleading spatial information. Though Chain-of-Thought [33] helps identify objects, it fails to
resolve the “spatial perspective-taking” issue.
Why use coarse instead of dense correspondences? Instead of filtering and retaining only a
handful of coarse correspondences, one ablation we considered is the possibility of using all dense
correspondence. Unfortunately, we find that excessively overlaying too many instance marks can
degrade performance (Table 5) as they occlude the visual content in the images.
How large should the marks be? We inject the correspondences into MLLMs by overlaying the
marks into images. We empirically find an optimal mark size (where ‘px’ represents the mark’s
diameter in pixels) in Table 5. Marks that are too small tend to be ignored while those that are too
large occlude visual content.
What shape should the marks be? We further studied the appearance of the marks. In addition to
red circles with white text, we experimented with adding segmentation outlines and segmentation
masks. As shown in Table 5, using segmentation outlines enhances object grounding. However, using
segmentation marks occludes visual content and reduces performance.
6 Discussion
Limitations. We would like to point out two limitations of our method. First, our method relies on off-
the-shelf video tracking models to obtain instance-level correspondences. Although the performance
of tracking models has significantly improved with the advent of tools like SAM [58], achieving
good results on long-form in-the-wild videos remains challenging. This is particularly evident on
the 180-second EgoSchema benchmark, where Track-Anything often loses track of objects after 100
seconds, leading to inconsistent instance segmentation masks between the beginning and end of the
video clip. Despite observing consistent and significant improvements on EgoSchema, we believe
that accurate correspondence would further enhance the benefits of our approach. Second, we find
that our method does not yet improve the 3D understanding of open-source models like LLaVA [4].
8
System: You are an AI with the ability to analyze a series of images, each representing a different perspective of a single
scene. [Prompt-about-Marks]. Your task is to construct a 3D understanding based on these images.
User: You are sitting on the sofa and the electric fan is on your left. Describe the location of the room door from your
perspective: A. to the front left of you; B. to the front right of you; C. to the back left of you; D. to the back right of you.
Figure 4: Comparison of different prompting method on ScanQA. Our proposed Coarse Cor-
respondences successfully guided GPT-4V to understand 3D spatial relationships and generate the
right answer. Other existing prompting method including image-based Set-of-Marks, 3DAxies and
text-based Chain-of-Thought failed to answer correctly.
We speculate that there are two reasons for this: either LLaVA may not recognize the marks we
overlay on the images or LLaVA may not effectively handle multiple image sequences appropriately.
We find that LLaVA often focuses on the content in the last input image. There is a need to develop
open-source models with visual promptable capabilities and with multi-image inputs.
Conclusion. We propose a framework called C OARSE C ORRESPONDENCES prompting. By us-
ing off-the-shelf video tracking models at a high frame rate to obtain class-agnostic instance-level
correspondences, we enable better recognition and reasoning with very few frames. Our method
is easy to use, doesn’t require any training, and performs zero-shot with strong models like GPT-4.
C OARSE C ORRESPONDENCES enables GPT to reason about the 3D structure from very sparse
9
views. We demonstrated consistent and significant improvements across multiple spatial and temporal
understanding benchmarks. We also identified that even GPT struggles with perspective-taking
capability, a fundamental component of human visual intelligence. Regardless, C OARSE C ORRE -
SPONDENCES ’s results highlight its potential for robotics and embodied AI, which require 3D or
temporal understanding.
Acknowledgement
We thank Xiaojuan Wang, Luming Tang, Ruitao Zhang, Yuntian Deng for helpful discussions,
feedback and collecting data. This project is partially funded by Amazon Science.
References
[1] Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei,
Xinyu Zhang, Bo Zhang, Xiaolin Wei, and Chunhua Shen. Mobilevlm : A fast, strong and open
vision language assistant for mobile devices, 2023.
[2] Anthony Brohan, Noah Brown, and et al. Rt-2: Vision-language-action models transfer web
knowledge to robotic control, 2023.
[3] Embodiment Collaboration and et al. Open x-embodiment: Robotic learning datasets and rt-x
models, 2024.
[4] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances
in neural information processing systems, 36, 2024.
[5] Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng
Ren, Zhuoshu Li, Yaofeng Sun, et al. Deepseek-vl: towards real-world vision-language
understanding. arXiv preprint arXiv:2403.05525, 2024.
[6] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang
Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding,
localization, text reading, and beyond, 2023.
[7] OpenAI. Hello gpt-4o. https://openai.com/index/hello-gpt-4o/, 2024. Accessed:
2024-05-22.
[8] Gemini Team, Rohan Anil, and et al. Gemini: A family of highly capable multimodal models,
2024.
[9] Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael
Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, Karmesh Yadav,
Qiyang Li, Ben Newman, Mohit Sharma, Vincent Berges, Shiqi Zhang, Pulkit Agrawal, Yonatan
Bisk, Dhruv Batra, Mrinal Kalakrishnan, Franziska Meier, Chris Paxton, Sasha Sax, and Aravind
Rajeswaran. Openeqa: Embodied question answering in the era of foundation models. In
Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
[10] Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question
answering for spatial scene understanding. In proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pages 19129–19139, 2022.
[11] Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic
benchmark for very long-form video language understanding, 2023.
[12] Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and
Chuang Gan. 3d-llm: Injecting the 3d world into large language models. Advances in Neural
Information Processing Systems, 36, 2024.
[13] Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united
visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023.
[14] Zuyan Liu, Yuhao Dong, Yongming Rao, Jie Zhou, and Jiwen Lu. Chain-of-spot: Interactive
reasoning improves large vision-language models. arXiv preprint arXiv:2403.12966, 2024.
[15] Zuyan Liu, Benlin Liu, Jiahui Wang, Yuhao Dong, Guangyi Chen, Yongming Rao, Ranjay
Krishna, and Jiwen Lu. Efficient inference of vision instruction-following models with elastic
cache. arXiv preprint arXiv:2407.18121, 2024.
10
[16] Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark
prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441,
2023.
[17] Dingning Liu, Xiaomeng Dong, Renrui Zhang, Xu Luo, Peng Gao, Xiaoshui Huang, Yongshun
Gong, and Zhihui Wang. 3daxiesprompts: Unleashing the 3d spatial task capabilities of gpt-4v.
arXiv preprint arXiv:2312.09738, 2023.
[18] Jinyu Yang, Mingqi Gao, Zhe Li, Shang Gao, Fangjing Wang, and Feng Zheng. Track anything:
Segment anything meets videos. arXiv preprint arXiv:2304.11968, 2023.
[19] OpenAI. Gpt-4v(ision) system card. OpenAI Blog, 2023.
[20] Alinda Friedman, Bernd Kohler, Peri Gunalp, Alexander P Boone, and Mary Hegarty. A
computerized spatial orientation test. Behavior research methods, 52:799–812, 2020.
[21] Nora Newcombe. The development of spatial perspective taking. Advances in child development
and behavior, 22:203–247, 1989.
[22] Barbara Tversky and Bridgette Martin Hard. Embodied and disembodied cognition: Spatial
perspective-taking. Cognition, 110(1):124–129, 2009.
[23] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal,
Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual
models from natural language supervision. In International conference on machine learning,
pages 8748–8763. PMLR, 2021.
[24] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng,
Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna:
An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
[25] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo-
thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open
and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
[26] AI Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 2024.
[27] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-
image pre-training with frozen image encoders and large language models. arXiv preprint
arXiv:2301.12597, 2023.
[28] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar,
and C. Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server, 2015.
[29] Drew A. Hudson and Christopher D. Manning. Gqa: A new dataset for real-world visual
reasoning and compositional question answering, 2019.
[30] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v
in vqa matter: Elevating the role of image understanding in visual question answering, 2017.
[31] Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Chenxu Hu, Yang Wang, Kun Zhan, Peng Jia,
Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large
vision-language models, 2024.
[32] Jingkang Yang, Yuhao Dong, Shuai Liu, Bo Li, Ziyue Wang, Chencheng Jiang, Haoran Tan,
Jiamu Kang, Yuanhan Zhang, Kaiyang Zhou, and Ziwei Liu. Octopus: Embodied vision-
language programmer from environmental feedback, 2023.
[33] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi,
Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language
models, 2023.
[34] Aleksandar Shtedritski, Christian Rupprecht, and Andrea Vedaldi. What does clip know about a
red circle? visual prompt engineering for vlms. arXiv preprint arXiv:2304.06712, 2023.
[35] Soroush Nasiriany, Fei Xia, Wenhao Yu, Ted Xiao, Jacky Liang, Ishita Dasgupta, Annie Xie,
Danny Driess, Ayzaan Wahid, Zhuo Xu, et al. Pivot: Iterative visual prompting elicits actionable
knowledge for vlms. arXiv preprint arXiv:2402.07872, 2024.
[36] Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-
captioning events in videos, 2017.
11
[37] Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa:next phase of question-
answering to explaining temporal actions, 2021.
[38] Madeleine Grunde-McLaughlin, Ranjay Krishna, and Maneesh Agrawala. Agqa: A benchmark
for compositional spatio-temporal reasoning. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 11287–11297, 2021.
[39] Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In Proceed-
ings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113,
2016.
[40] David G Lowe. Distinctive image features from scale-invariant keypoints. International journal
of computer vision, 60:91–110, 2004.
[41] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. In
Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria,
May 7-13, 2006. Proceedings, Part I 9, pages 404–417. Springer, 2006.
[42] Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emer-
gent correspondence from image diffusion. Advances in Neural Information Processing Systems,
36:1363–1389, 2023.
[43] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson,
Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual
language model for few-shot learning. Advances in Neural Information Processing Systems,
35:23716–23736, 2022.
[44] Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias
Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017.
[45] Jiafei Duan, Samson Yu, Hui Li Tan, Hongyuan Zhu, and Cheston Tan. A survey of embodied
ai: From simulators to research tasks. IEEE Transactions on Emerging Topics in Computational
Intelligence, 6(2):230–244, 2022.
[46] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic
evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association
for Computational Linguistics, pages 311–318, 2002.
[47] Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with
improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic
and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72,
2005.
[48] Zhaopeng Qiu, Xian Wu, and Wei Fan. Automatic distractor generation for multiple choice
questions in standard tests. arXiv preprint arXiv:2011.13100, 2020.
[49] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image
description evaluation. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 4566–4575, 2015.
[50] OpenAI, Josh Achiam, and et al. Gpt-4 technical report, 2024.
[51] Pinelopi Papalampidi, Skanda Koppula, Shreya Pathak, Justin Chiu, Joe Heyward, Viorica
Patraucean, Jiajun Shen, Antoine Miech, Andrew Zisserman, and Aida Nematzdeh. A simple
recipe for contrastively pre-training video-first encoders beyond 16 frames, 2023.
[52] Ivana Balažević, Yuge Shi, Pinelopi Papalampidi, Rahma Chaabouni, Skanda Koppula, and
Olivier J. Hénaff. Memory consolidation enables long-context video understanding, 2024.
[53] Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, and
Gedas Bertasius. A simple llm framework for long-range video question-answering, 2024.
[54] Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung-Levy. Videoagent: Long-form
video understanding with large language model as agent, 2024.
[55] Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, and Michael S. Ryoo. Understanding
long videos in one multimodal language model pass, 2024.
[56] Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. Videoa-
gent: A memory-augmented multimodal agent for video understanding. arXiv preprint
arXiv:2403.11481, 2024.
12
[57] Kumara Kahatapitiya, Kanchana Ranasinghe, Jongwoo Park, and Michael S. Ryoo. Language
repository for long video understanding, 2024.
[58] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson,
Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick.
Segment anything. arXiv:2304.02643, 2023.
[59] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for
video recognition, 2019.
[60] Jathushan Rajasegaran, Georgios Pavlakos, Angjoo Kanazawa, Christoph Feichtenhofer, and
Jitendra Malik. On the benefits of 3d pose and tracking for human action recognition, 2023.
13
Appendix
A Broader Impact
Our method aims at improving the trustworthiness and reliability of deployment of MLLMs in real
world application, including but not limited to Vision Pro, autonomous driving, and also humanoid
robots. To have a virtual assistant like JARVIS in Marvel films, it’s necassry to align the understanding
of vision-language model with human’s understanding, so that we can ensure safe application of these
applications. Further, we are committed to reducing the carbon emissions produced by these models.
By employing our coarse correspondence prompting method, we use a much smaller tracking module
to reduce the number of input used as input to large GPT model. Besides, we also improve the
speed and lower the cost of calling OpenAI API to understand a 3d scene. This enables democratize
MLLMs so that more people and small companies can create their own real-world applications based
on GPT-4V. We hope our work can make large AI models more effectively used for social good.
Still, we would like to point out that with the development of MLLMs, increased reliance on advanced
MLLMs could also lead to a reduction in human skills, especially in interpreting and interacting
with visual content. Over-dependence on these models might erode critical thinking and analytical
abilities in the long term.
B More Discussions
Relation to SlowFast SlowFast [59] is a framework for video recognition that includes two parallel
pathways: a Slow pathway that captures motion information at a high frame rate and a Fast pathway
that captures semantic information at a low frame rate. The information from both pathways is
fused through lateral connections for downstream video recognition tasks. In a way, our coarse
correspondence prompting can be seen as another form of SlowFast. However, unlike SlowFast,
where the Slow and Fast pathways operate in parallel, our framework operates sequentially. First,
it captures low-level, class-agnostic motion information at a high frame rate using a lightweight
tracking model. Then, at a low frame rate, it performs recognition and reasoning requiring semantic
understanding using larger MLLMs. The two stages are bridged through visual prompting. Moreover,
while SlowFast learns a representation of the input video for pure vision recognition tasks such as
action classification and detection, our coarse correspondence framework aims to better understand the
3D spatial structure and temporal information contained in the input video to achieve spatiotemporal
perception and reasoning simultaneously.
Eulerian vs Lagrangian If deep learning-based methods represent camera or object motion in videos
from an Eulerian viewpoint—i.e., expressing how features at fixed locations evolve over time through
a multi-dimensional tensor—then our framework adds a Lagrangian viewpoint to this representation.
The Lagrangian viewpoint describes the trajectories of entities moving through space and time in the
video. Previously, the Lagrangian viewpoint in video descriptions has been shown to better aid human
action recognition [60]. Here, we demonstrate that it can more generally help MLLMs understand
the 4D spatiotemporal context represented in videos.
To further demonstrate the effectiveness of our proposed Coarse Correspondence under sparse image
input, we defined two challenging tasks and one qualitative case study for each task.
The results of these case studies are shown in Fig. 5. Detailed illustration of the results are provided
in the figure captions. The first case study is about the task of Duplicate Objects Counting, where the
model needs to count the number of objects in a 3D scene. Only equipped with coarse correspondence
can GPT-4V get a comprehensive understanding of the 3D scene, excludes the duplicate objects,
and give the right answer.The second case study is about the task of Relative Location Modeling,
where the model needs to understand the relative location of objects in a 3D scene. It is obvious
that without the correspondence markers, GPT-4V fails to response from 3D perspective with only
raw 2D images.These case studies demonstrate that our proposed Coarse Correspondence can elicit
MLLMs in understanding 3D scenes from sparse image inputs.
14
(a) Task: Duplicate Objects Counting. There are (b) Task: Relative Location Modeling. From View 1
2 brown sofas and 2 black sofas. The brown sofas in & 2 we can tell that the room door is on the left-hand-
View 2&4 are duplication of those in View 3. Only with side when facing the washbasin. Only with the help
the help of the Coarse Correspondence can GPT-4V of the Coarse Correspondence can GPT-4V understand
understand duplicate objects between different views relative location between objects appear in different
across a single 3D scene. views across a single 3D scene.
Figure 5: Two complicated tasks, i.e. Duplicate Objects Counting and Relative Location Modeling
are chosen to demonstrate our method. Zoom in for better view.
Figure 6: Hand-crafted coarse correspondence label. Coarse correspondence can still help the
spatial understanding when using hand-crafted visual prompting.
15
D User-Friendly Interactions
We also prove that our Coarse Correspondence method works well with hand-crafted correspondence
marks as shown in Figure 6. This demonstrates that our method is highly user-friendly for utilizing
proprietary multimodal language models, such as GPT-4O, in web interfaces. Users can easily
complete prompts by marking correspondence relationships on images. Moreover, the marks can be
diverse and flexible. This also proves the robustness of our method, as the marks are style-agnostic,
as long as they convey the visual correspondence knowledge.
16