Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

Kanchana Ranasinghe, Satya Narayan Shukla, Omid Poursaeed, Michael S. Ryoo, Tsung-Yu Lin

CVPR 2024 pp. 12977-12987

doi:10.1109/CVPR52733.2024.01233 /cvpr/2024/ranasinghe2024cvpr-learning/

Abstract

Integration of Large Language Models (LLMs) into visual domain tasks resulting in visual-LLMs (V-LLMs) has enabled exceptional performance in vision-language tasks particularly for visual question answering (VQA). However existing V-LLMs (e.g. BLIP-2 LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers these models fail at simple tasks like distinguishing a left vs right location. In this work we explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs. We discover optimal coordinate representations data-efficient instruction fine-tuning objectives and pseudo-data generation strategies that lead to improved spatial awareness in V-LLMs. Additionally our resulting model improves VQA across image and video domains reduces undesired hallucination and generates better contextual object descriptions. Experiments across 5 vision-language tasks involving 14 different datasets establish the clear performance improvements achieved by our proposed framework.

PDF CVPR Semantic Scholar

Cite

Text

Ranasinghe et al. "Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01233

Markdown

[Ranasinghe et al. "Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/ranasinghe2024cvpr-learning/) doi:10.1109/CVPR52733.2024.01233

BibTeX

@inproceedings{ranasinghe2024cvpr-learning,
  title     = {{Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs}},
  author    = {Ranasinghe, Kanchana and Shukla, Satya Narayan and Poursaeed, Omid and Ryoo, Michael S. and Lin, Tsung-Yu},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {12977-12987},
  doi       = {10.1109/CVPR52733.2024.01233},
  url       = {https://mlanthology.org/cvpr/2024/ranasinghe2024cvpr-learning/}
}