LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

Hao Zhang, Hongyang Li, Feng Li, Tianhe Ren, Xueyan Zou, Shilong Liu, Shijia Huang, Jianfeng Gao, Lei Zhang, Chunyuan Li, Jianwei Yang

ECCV 2024

doi:10.1007/978-3-031-72775-7_2 /eccv/2024/zhang2024eccv-llavagrounding/

Abstract

With the recent significant advancements in large multimodal models (LMMs), the importance of their grounding capability in visual chat is increasingly recognized. Despite recent efforts to enable LMMs to support grounding, their capabilities for grounding and chat are usually separate, and their chat performance drops dramatically when asked to ground. The problem is the lack of a dataset for grounded visual chat (GVC). Existing grounding datasets only contain short captions. To address this issue, we have created GVC data that allows for the combination of grounding and chat capabilities. To better evaluate the GVC capabilities, we have introduced a benchmark called . Additionally, we have proposed a model design that can support GVC and various types of visual prompts by connecting segmentation models with language models. Experimental results demonstrate that our model outperforms other LMMs on . Furthermore, our model achieves competitive performance on classic grounding benchmarks like RefCOCO/+/g and Flickr30K Entities.

PDF ECCV Semantic Scholar

Cite

Text

Zhang et al. "LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72775-7_2

Markdown

[Zhang et al. "LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/zhang2024eccv-llavagrounding/) doi:10.1007/978-3-031-72775-7_2

BibTeX

@inproceedings{zhang2024eccv-llavagrounding,
  title     = {{LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models}},
  author    = {Zhang, Hao and Li, Hongyang and Li, Feng and Ren, Tianhe and Zou, Xueyan and Liu, Shilong and Huang, Shijia and Gao, Jianfeng and Zhang, Lei and Li, Chunyuan and Yang, Jianwei},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72775-7_2},
  url       = {https://mlanthology.org/eccv/2024/zhang2024eccv-llavagrounding/}
}