Look Around Before Locating: Considering Content and Structure Information for Visual Grounding

Abstract

As a long-term challenge and fundamental requirement in vision and language tasks, visual grounding aims to localize a target referred by a natural language query. The regional annotations form a superficial correlation between the subject of expression and some common visual entities, which hinder models from comprehending the linguistic content and structure. However, current one-stage methods struggle to uniformly model the visual and linguistic structure due to the structural gap between continuous image patches and discrete text tokens. In this paper, we propose a semi-structured reasoning framework for visual grounding to gradually comprehend the linguistic content and structure. Specifically, we devise a cross-modal content alignment module to effectively align unlabeled contextual information into a stable semantic space corrected by token-level prior knowledge obtained with CLIP. A multi-branch modulated localization module is also established to obtain modulation grounding by linguistic structure. Through a soft split mechanism, our method can destructure the expression into a fixed semi-structure (i.e., subject and context) while ensuring the completeness of linguistic content. Our method is thus capable of building a semi-structured reasoning system to effectively comprehend the linguistic content and structure by content alignment and structure modulated grounding. Experimental results on five widely-used datasets validate the performance improvements of our proposed method.

Cite

Text

Zheng et al. "Look Around Before Locating: Considering Content and Structure Information for Visual Grounding." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I2.32158

Markdown

[Zheng et al. "Look Around Before Locating: Considering Content and Structure Information for Visual Grounding." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/zheng2025aaai-look/) doi:10.1609/AAAI.V39I2.32158

BibTeX

@inproceedings{zheng2025aaai-look,
  title     = {{Look Around Before Locating: Considering Content and Structure Information for Visual Grounding}},
  author    = {Zheng, Shiyi and Zhao, Peizhi and Zheng, Zhilong and He, Peihang and Cheng, Haonan and Cai, Yi and Huang, Qingbao},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {1656-1664},
  doi       = {10.1609/AAAI.V39I2.32158},
  url       = {https://mlanthology.org/aaai/2025/zheng2025aaai-look/}
}