mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Ye, Jiabo; Xu, Haiyang; Liu, Haowei; Hu, Anwen; Yan, Ming; Qian, Qi; Zhang, Ji; Huang, Fei; Zhou, Jingren

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou

ICLR 2025

/iclr/2025/ye2025iclr-mplugowl3/

Abstract

Multi-modal Large Language Models have demonstrated remarkable capabilities in executing instructions for a variety of single-image tasks. Despite this progress, significant challenges remain in modeling long image sequences. In this work, we introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding in scenarios that incorporate retrieved image-text knowledge, multimodal in-context examples, and lengthy videos. Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space, thereby facilitating the processing of extended multi-image scenarios. We conduct evaluations on 21 benchmarks that cover single/multi-image, and short/long video understanding. mPLUG-Owl3 achieves competitive performance with the state-of-the-art methods while reducing inference time and memory usage by 87.8\% and 48.5\% in average. Moreover, we propose a Distractor Resistance evaluation to assess the ability of models to maintain focus amidst distractions. mPLUG-Owl3 also demonstrates outstanding performance in distractor resistance on ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute to the development of more efficient and powerful multimodal large language models.

PDF ICLR Semantic Scholar

Cite

Text

Ye et al. "mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models." International Conference on Learning Representations, 2025.

Markdown

[Ye et al. "mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/ye2025iclr-mplugowl3/)

BibTeX

@inproceedings{ye2025iclr-mplugowl3,
  title     = {{mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models}},
  author    = {Ye, Jiabo and Xu, Haiyang and Liu, Haowei and Hu, Anwen and Yan, Ming and Qian, Qi and Zhang, Ji and Huang, Fei and Zhou, Jingren},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/ye2025iclr-mplugowl3/}
}