Robust Cross-Modal Alignment Learning for Cross-Scene Spatial Reasoning and Grounding
Abstract
Grounding target objects in 3D environments via natural language is a fundamental capability for autonomous agents to successfully fulfill user requests. Almost all existing works typically assume that the target object lies within a known scene and focus solely on in-scene localization. In practice, however, agents often encounter unknown or previously visited environments and need to search across a large archive of scenes to ground the described object, thereby invalidating this assumption. To address this, we reveal a novel task called Cross-Scene Spatial Reasoning and Grounding (CSSRG), which aims to locate a described object anywhere across an entire collection of 3D scenes rather than predetermined scenes. Due to the difference from existing 3D visual grounding, CSSRG poses two challenges: the prohibitive cost of exhaustively traversing all scenes and more complex cross-modal spatial alignment. To address the challenges, we propose a Cross-Scene 3D Object Reasoning Framework (CoRe), which adopts a matching-then-grounding pipeline to reduce computational overhead. Specifically, CoRe consists of i) a Robust Text-Scene Aligning (RTSA) module that learns global scene representations for robust alignment between object descriptions and the corresponding 3D scenes, enabling efficient retrieval of candidate scenes; and ii) a Tailored Word-Object Associating (TWOA) module that establishes fine-grained alignment between words and target objects to filter out redundant context, supporting precise object-level reasoning and alignment. Additionally, to benchmark CSSRG, we construct a new CrossScene-RETR dataset and evaluation protocol tailored for cross-scene grounding. Extensive experiments across four multimodal datasets demonstrate that CoRe dramatically reduces computational overhead while showing superiority in both scene retrieval and object grounding. Code is available at https://github.com/Yangl1nFeng/CoRe.
Cite
Text
Feng et al. "Robust Cross-Modal Alignment Learning for Cross-Scene Spatial Reasoning and Grounding." Advances in Neural Information Processing Systems, 2025.Markdown
[Feng et al. "Robust Cross-Modal Alignment Learning for Cross-Scene Spatial Reasoning and Grounding." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/feng2025neurips-robust/)BibTeX
@inproceedings{feng2025neurips-robust,
title = {{Robust Cross-Modal Alignment Learning for Cross-Scene Spatial Reasoning and Grounding}},
author = {Feng, Yanglin and Zhu, Hongyuan and Peng, Dezhong and Peng, Xi and Song, Xiaomin and Hu, Peng},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/feng2025neurips-robust/}
}