AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-Based Referring

Wang, Xinyi; Zhao, Na; Han, Zhiyuan; Guo, Dan; Yang, Xun

doi:10.1609/AAAI.V39I8.32863

AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-Based Referring

Xinyi Wang, Na Zhao, Zhiyuan Han, Dan Guo, Xun Yang

AAAI 2025 pp. 8006-8014

doi:10.1609/AAAI.V39I8.32863 /aaai/2025/wang2025aaai-augrefer/

Abstract

3D visual grounding (3DVG), which aims to correlate a natural language description with the target object within a 3D scene, is a significant yet challenging task. Despite recent advancements in this domain, existing approaches commonly encounter a shortage: a limited amount and diversity of text-3D pairs available for training. Moreover, they fall short in effectively leveraging different contextual clues (e.g., rich spatial relations within the 3D visual space) for grounding. To address these limitations, we propose AugRefer, a novel approach for advancing 3D visual grounding. AugRefer introduces cross-modal augmentation designed to extensively generate diverse text-3D pairs by placing objects into 3D scenes and creating accurate and semantically rich descriptions using foundation models. Notably, the resulting pairs can be utilized by any existing 3DVG methods for enriching their training data. Besides, AugRefer presents a language-spatial adaptive decoder that effectively adapts the potential referring objects based on the language description and various 3D spatial relations. Extensive experiments on three benchmark datasets clearly validate the effectiveness of AugRefer.

PDF AAAI Semantic Scholar

Cite

Text

Wang et al. "AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-Based Referring." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I8.32863

Markdown

[Wang et al. "AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-Based Referring." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/wang2025aaai-augrefer/) doi:10.1609/AAAI.V39I8.32863

BibTeX

@inproceedings{wang2025aaai-augrefer,
  title     = {{AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-Based Referring}},
  author    = {Wang, Xinyi and Zhao, Na and Han, Zhiyuan and Guo, Dan and Yang, Xun},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {8006-8014},
  doi       = {10.1609/AAAI.V39I8.32863},
  url       = {https://mlanthology.org/aaai/2025/wang2025aaai-augrefer/}
}