Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

Abstract

Instruction tuning enhances the capability of Large Language Models (LLMs) to interact with humans. Furthermore, recent instruction-following datasets include images as visual input, collecting responses for image-based instructions. However, current visual instruction-tuned models cannot comprehend textual details within images well. This work enhances the current visual instruction tuning pipeline with text-rich images (e.g., movie posters, book covers, etc.). Specifically, we first used publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset. Furthermore, we prompt text-only GPT-4 with recognized text and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images. By combining our collected data with previous multimodal instruction-following data, our model, LLaVAR, substantially improves the capability of the LLaVA model on text-based VQA datasets (up to 20% accuracy improvement). The GPT-4-based instruction-following evaluation also demonstrates the improvement of our model on both natural images and text-rich images. Through qualitative analysis, LLaVAR shows promising interaction skills (e.g., reasoning, writing, and elaboration) with humans based on the latest real-world online content that combines text and images. We make our code/data/models publicly available.

Cite

Text

Zhang et al. "Enhanced Visual Instruction Tuning for Text-Rich Image Understanding." NeurIPS 2023 Workshops: Instruction, 2023.

Markdown

[Zhang et al. "Enhanced Visual Instruction Tuning for Text-Rich Image Understanding." NeurIPS 2023 Workshops: Instruction, 2023.](https://mlanthology.org/neuripsw/2023/zhang2023neuripsw-enhanced/)

BibTeX

@inproceedings{zhang2023neuripsw-enhanced,
  title     = {{Enhanced Visual Instruction Tuning for Text-Rich Image Understanding}},
  author    = {Zhang, Yanzhe and Zhang, Ruiyi and Gu, Jiuxiang and Zhou, Yufan and Lipka, Nedim and Yang, Diyi and Sun, Tong},
  booktitle = {NeurIPS 2023 Workshops: Instruction},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/zhang2023neuripsw-enhanced/}
}