Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
Abstract
Integration of Large Language Models (LLMs) into visual domain tasks resulting in visual-LLMs (V-LLMs) has enabled exceptional performance in vision-language tasks particularly for visual question answering (VQA). However existing V-LLMs (e.g. BLIP-2 LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers these models fail at simple tasks like distinguishing a left vs right location. In this work we explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs. We discover optimal coordinate representations data-efficient instruction fine-tuning objectives and pseudo-data generation strategies that lead to improved spatial awareness in V-LLMs. Additionally our resulting model improves VQA across image and video domains reduces undesired hallucination and generates better contextual object descriptions. Experiments across 5 vision-language tasks involving 14 different datasets establish the clear performance improvements achieved by our proposed framework.
Cite
Text
Ranasinghe et al. "Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01233Markdown
[Ranasinghe et al. "Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/ranasinghe2024cvpr-learning/) doi:10.1109/CVPR52733.2024.01233BibTeX
@inproceedings{ranasinghe2024cvpr-learning,
title = {{Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs}},
author = {Ranasinghe, Kanchana and Shukla, Satya Narayan and Poursaeed, Omid and Ryoo, Michael S. and Lin, Tsung-Yu},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2024},
pages = {12977-12987},
doi = {10.1109/CVPR52733.2024.01233},
url = {https://mlanthology.org/cvpr/2024/ranasinghe2024cvpr-learning/}
}