Think to Ground: Improving Spatial Reasoning in LLMs for Better Visual Grounding

Abstract

Visual grounding tasks involve identifying objects and references in an image based on text input. A model is required to locate the objects and their relationships, as well as to understand the image to accurately ground the target. Specialized models like Owl-ViT and Grounding DINO often fail to predict correct results for queries involving complex spatial information. In this paper, we propose a Spatial Thinking and Reasoning Dataset for visual grounding and a framework that uses existing detection models to identify candidate objects. These models provide coordinates and other attributes to a large language model (LLM) for spatial reasoning to determine the correct target. Recent closed-source models like GPT-4o achieve approximately 86% accuracy, while open-source models perform significantly worse, reaching only about 60% accuracy in our experiments. To improve this, we use reinforcement learning to fine-tune a 3B open-source model on our dataset, achieving 77% accuracy, comparable to closed-source models.

Cite

Text

Sharma and Vats. "Think to Ground: Improving Spatial Reasoning in LLMs for Better Visual Grounding." ICLR 2025 Workshops: LLM_Reason_and_Plan, 2025.

Markdown

[Sharma and Vats. "Think to Ground: Improving Spatial Reasoning in LLMs for Better Visual Grounding." ICLR 2025 Workshops: LLM_Reason_and_Plan, 2025.](https://mlanthology.org/iclrw/2025/sharma2025iclrw-think/)

BibTeX

@inproceedings{sharma2025iclrw-think,
  title     = {{Think to Ground: Improving Spatial Reasoning in LLMs for Better Visual Grounding}},
  author    = {Sharma, Karun and Vats, Vidushee},
  booktitle = {ICLR 2025 Workshops: LLM_Reason_and_Plan},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/sharma2025iclrw-think/}
}