Leopard: A Vision Language Model for Text-Rich Multi- Image Tasks

Abstract

Text-rich images, where text serves as the central visual element guiding the overall understanding, are prevalent in real-world applications, such as presentation slides, scanned documents, and webpage snapshots. Tasks involving multiple text-rich images are especially challenging, as they require not only understanding the content of individual images but reasoning about inter-relationships and logical flows across multiple visual inputs. Despite the importance of these scenarios, current multimodal large language models (MLLMs) struggle to handle such tasks due to two key challenges: (1) the scarcity of high-quality instruction tuning datasets for text-rich multi-image scenarios, and (2) the difficulty in balancing image resolution with visual feature sequence length. To address these challenges, we propose Leopard, a MLLM designed specifically for handling vision-language tasks involving multiple text-rich images. First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios. Second, we proposed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length based on the original aspect ratios and resolutions of images. Experiments on a diverse set of benchmarks reveal that our model consistently outperforms state-of-the-art systems, such as Llama-3.2 and Qwen2-VL, in challenging text-rich, multi-image evaluations. Remarkably, our approach achieves outstanding performance using only 1.2M fully open-sourced training instances, outperforming models that rely on large-scale in-house data, highlighting its efficiency and effectiveness. Our code and data are available at https://anonymous.4open.science/r/Leopard-908F.

Cite

Text

Jia et al. "Leopard: A Vision Language Model for Text-Rich Multi- Image Tasks." Transactions on Machine Learning Research, 2025.

Markdown

[Jia et al. "Leopard: A Vision Language Model for Text-Rich Multi- Image Tasks." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/jia2025tmlr-leopard/)

BibTeX

@article{jia2025tmlr-leopard,
  title     = {{Leopard: A Vision Language Model for Text-Rich Multi- Image Tasks}},
  author    = {Jia, Mengzhao and Yu, Wenhao and Ma, Kaixin and Fang, Tianqing and Zhang, Zhihan and Ouyang, Siru and Zhang, Hongming and Yu, Dong and Jiang, Meng},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/jia2025tmlr-leopard/}
}