Improved Baselines with Visual Instruction Tuning

Abstract

Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available.

Cite

Text

Liu et al. "Improved Baselines with Visual Instruction Tuning." NeurIPS 2023 Workshops: Instruction, 2023.

Markdown

[Liu et al. "Improved Baselines with Visual Instruction Tuning." NeurIPS 2023 Workshops: Instruction, 2023.](https://mlanthology.org/neuripsw/2023/liu2023neuripsw-improved/)

BibTeX

@inproceedings{liu2023neuripsw-improved,
  title     = {{Improved Baselines with Visual Instruction Tuning}},
  author    = {Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae},
  booktitle = {NeurIPS 2023 Workshops: Instruction},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/liu2023neuripsw-improved/}
}