Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

Abstract

Integration of Large Language Models (LLMs) into visual domain tasks resulting in visual-LLMs (V-LLMs) has enabled exceptional performance in vision-language tasks particularly for visual question answering (VQA). However existing V-LLMs (e.g. BLIP-2 LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers these models fail at simple tasks like distinguishing a left vs right location. In this work we explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs. We discover optimal coordinate representations data-efficient instruction fine-tuning objectives and pseudo-data generation strategies that lead to improved spatial awareness in V-LLMs. Additionally our resulting model improves VQA across image and video domains reduces undesired hallucination and generates better contextual object descriptions. Experiments across 5 vision-language tasks involving 14 different datasets establish the clear performance improvements achieved by our proposed framework.

Cite

Text

Ranasinghe et al. "Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01233

Markdown

[Ranasinghe et al. "Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/ranasinghe2024cvpr-learning/) doi:10.1109/CVPR52733.2024.01233

BibTeX

@inproceedings{ranasinghe2024cvpr-learning,
  title     = {{Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs}},
  author    = {Ranasinghe, Kanchana and Shukla, Satya Narayan and Poursaeed, Omid and Ryoo, Michael S. and Lin, Tsung-Yu},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {12977-12987},
  doi       = {10.1109/CVPR52733.2024.01233},
  url       = {https://mlanthology.org/cvpr/2024/ranasinghe2024cvpr-learning/}
}