ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
Abstract
While existing large vision-language multimodal models focus on whole image understanding there is a prominent gap in achieving region-specific comprehension. Current approaches that use textual coordinates or spatial encodings often fail to provide a user-friendly interface for visual prompting. To address this challenge we introduce a novel multimodal model capable of decoding arbitrary (free-form) visual prompts. This allows users to intuitively mark images and interact with the model using natural cues like a "red bounding box" or "pointed arrow'". Our simple design directly overlays visual markers onto the RGB image eliminating the need for complex region encodings yet achieves state-of-the-art performance on region-understanding tasks like Visual7W PointQA and Visual Commonsense Reasoning benchmark. Furthermore we present ViP-Bench a comprehensive benchmark to assess the capability of models in understanding visual prompts across multiple dimensions enabling future research in this domain. Code data and model are publicly available.
Cite
Text
Cai et al. "ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01227Markdown
[Cai et al. "ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/cai2024cvpr-vipllava/) doi:10.1109/CVPR52733.2024.01227BibTeX
@inproceedings{cai2024cvpr-vipllava,
title = {{ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts}},
author = {Cai, Mu and Liu, Haotian and Mustikovela, Siva Karthik and Meyer, Gregory P. and Chai, Yuning and Park, Dennis and Lee, Yong Jae},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2024},
pages = {12914-12923},
doi = {10.1109/CVPR52733.2024.01227},
url = {https://mlanthology.org/cvpr/2024/cai2024cvpr-vipllava/}
}