InstructBLIP: Towards General-Purpose Vision-Language Models with Instruction Tuning
Abstract
Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format. Additionally, we introduce an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA questions with image contexts). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models are open-source.
Cite
Text
Dai et al. "InstructBLIP: Towards General-Purpose Vision-Language Models with Instruction Tuning." Neural Information Processing Systems, 2023.Markdown
[Dai et al. "InstructBLIP: Towards General-Purpose Vision-Language Models with Instruction Tuning." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/dai2023neurips-instructblip/)BibTeX
@inproceedings{dai2023neurips-instructblip,
title = {{InstructBLIP: Towards General-Purpose Vision-Language Models with Instruction Tuning}},
author = {Dai, Wenliang and Li, Junnan and Li, Dongxu and Tiong, Anthony and Zhao, Junqi and Wang, Weisheng and Li, Boyang and Fung, Pascale N and Hoi, Steven C.},
booktitle = {Neural Information Processing Systems},
year = {2023},
url = {https://mlanthology.org/neurips/2023/dai2023neurips-instructblip/}
}