ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Abstract

Modality alignment serves as the cornerstone for large multi-modal models (LMMs). However, the impact of different attributes (e.g., data type, quality, and scale) of training data on facilitating effective alignment is still under-explored. In this paper, we delve into the influence of training data on LMMs, uncovering three pivotal findings: 1) Highly detailed captions enable more nuanced vision-language alignment, significantly boosting the performance of LMMs in diverse benchmarks, surpassing outcomes from brief captions or VQA data; 2) Cutting-edge LMMs can be close to the captioning capability of costly human annotators, and open-source LMMs could reach similar quality after lightweight fine-tuning; 3) The performance of LMMs scales with the number of detailed captions, exhibiting remarkable improvements across a range from thousands to millions of captions. Drawing from these findings, we introduce the ShareGPT4V series for advanced modality alignment. It includes ShareGPT4V, consisting of 100K high-quality captions curated from GPT4-Vision; ShareGPT4V-PT, containing 1.2M captions produced by our Share-Captioner that can be close to the captioning capabilities of GPT4-Vision; and ShareGPT4V-7B, a simple yet superior LMM excelling in most multi-modal benchmarks, which realized better alignment based on our large-scale high-quality captions. The project is available at https://sharegpt4v.github.io/.

Cite

Text

Chen et al. "ShareGPT4V: Improving Large Multi-Modal Models with Better Captions." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72643-9_22

Markdown

[Chen et al. "ShareGPT4V: Improving Large Multi-Modal Models with Better Captions." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/chen2024eccv-sharegpt4v/) doi:10.1007/978-3-031-72643-9_22

BibTeX

@inproceedings{chen2024eccv-sharegpt4v,
  title     = {{ShareGPT4V: Improving Large Multi-Modal Models with Better Captions}},
  author    = {Chen, Lin and Li, Jinsong and Dong, Xiaoyi and Zhang, Pan and He, Conghui and Wang, Jiaqi and Zhao, Feng and Lin, Dahua},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72643-9_22},
  url       = {https://mlanthology.org/eccv/2024/chen2024eccv-sharegpt4v/}
}