mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
Abstract
Multi-modal Large Language Models have demonstrated remarkable capabilities in executing instructions for a variety of single-image tasks. Despite this progress, significant challenges remain in modeling long image sequences. In this work, we introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding in scenarios that incorporate retrieved image-text knowledge, multimodal in-context examples, and lengthy videos. Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space, thereby facilitating the processing of extended multi-image scenarios. We conduct evaluations on 21 benchmarks that cover single/multi-image, and short/long video understanding. mPLUG-Owl3 achieves competitive performance with the state-of-the-art methods while reducing inference time and memory usage by 87.8\% and 48.5\% in average. Moreover, we propose a Distractor Resistance evaluation to assess the ability of models to maintain focus amidst distractions. mPLUG-Owl3 also demonstrates outstanding performance in distractor resistance on ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute to the development of more efficient and powerful multimodal large language models.
Cite
Text
Ye et al. "mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models." International Conference on Learning Representations, 2025.Markdown
[Ye et al. "mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/ye2025iclr-mplugowl3/)BibTeX
@inproceedings{ye2025iclr-mplugowl3,
title = {{mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models}},
author = {Ye, Jiabo and Xu, Haiyang and Liu, Haowei and Hu, Anwen and Yan, Ming and Qian, Qi and Zhang, Ji and Huang, Fei and Zhou, Jingren},
booktitle = {International Conference on Learning Representations},
year = {2025},
url = {https://mlanthology.org/iclr/2025/ye2025iclr-mplugowl3/}
}