Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model
Abstract
Multimodal language models (MLLMs) are increasingly being applied in real-world environments, necessitating their ability to interpret 3D spaces and comprehend temporal dynamics. Current methods often rely on specialized architectural designs or task-specific fine-tuning to achieve this. We introduce Coarse Correspondences, a simple lightweight method that enhances MLLMs' spatial-temporal reasoning with 2D images as input, without modifying the architecture or requiring task-specific fine-tuning. Our method uses a lightweight tracking model to identify primary object correspondences between frames in a video or across different image viewpoints, and then conveys this information to MLLMs through visual prompting. We demonstrate that this simple training-free approach brings substantial gains to GPT4-V/O consistently on four benchmarks that require spatial-temporal reasoning, including +20.5% improvement on ScanQA, +9.7% on OpenEQA's episodic memory subset, +6.0% on the long-form video benchmark EgoSchema, and +11% on the R2R navigation benchmark. Additionally, we show that Coarse Correspondences can also enhance open-source MLLMs' spatial reasoning (by +6.9% on ScanQA) when applied in both training and inference and that the improvement can generalize to unseen datasets such as SQA3D (+3.1%). Taken together, we show that Coarse Correspondences effectively and efficiently boosts models' performance on downstream tasks requiring spatial-temporal reasoning.
Cite
Text
Liu et al. "Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00358Markdown
[Liu et al. "Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/liu2025cvpr-coarse/) doi:10.1109/CVPR52734.2025.00358BibTeX
@inproceedings{liu2025cvpr-coarse,
title = {{Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model}},
author = {Liu, Benlin and Dong, Yuhao and Wang, Yiqin and Ma, Zixian and Tang, Yansong and Tang, Luming and Rao, Yongming and Ma, Wei-Chiu and Krishna, Ranjay},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2025},
pages = {3783-3792},
doi = {10.1109/CVPR52734.2025.00358},
url = {https://mlanthology.org/cvpr/2025/liu2025cvpr-coarse/}
}