Look Around Before Locating: Considering Content and Structure Information for Visual Grounding
Abstract
As a long-term challenge and fundamental requirement in vision and language tasks, visual grounding aims to localize a target referred by a natural language query. The regional annotations form a superficial correlation between the subject of expression and some common visual entities, which hinder models from comprehending the linguistic content and structure. However, current one-stage methods struggle to uniformly model the visual and linguistic structure due to the structural gap between continuous image patches and discrete text tokens. In this paper, we propose a semi-structured reasoning framework for visual grounding to gradually comprehend the linguistic content and structure. Specifically, we devise a cross-modal content alignment module to effectively align unlabeled contextual information into a stable semantic space corrected by token-level prior knowledge obtained with CLIP. A multi-branch modulated localization module is also established to obtain modulation grounding by linguistic structure. Through a soft split mechanism, our method can destructure the expression into a fixed semi-structure (i.e., subject and context) while ensuring the completeness of linguistic content. Our method is thus capable of building a semi-structured reasoning system to effectively comprehend the linguistic content and structure by content alignment and structure modulated grounding. Experimental results on five widely-used datasets validate the performance improvements of our proposed method.
Cite
Text
Zheng et al. "Look Around Before Locating: Considering Content and Structure Information for Visual Grounding." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I2.32158Markdown
[Zheng et al. "Look Around Before Locating: Considering Content and Structure Information for Visual Grounding." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/zheng2025aaai-look/) doi:10.1609/AAAI.V39I2.32158BibTeX
@inproceedings{zheng2025aaai-look,
title = {{Look Around Before Locating: Considering Content and Structure Information for Visual Grounding}},
author = {Zheng, Shiyi and Zhao, Peizhi and Zheng, Zhilong and He, Peihang and Cheng, Haonan and Cai, Yi and Huang, Qingbao},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2025},
pages = {1656-1664},
doi = {10.1609/AAAI.V39I2.32158},
url = {https://mlanthology.org/aaai/2025/zheng2025aaai-look/}
}