Investigating Redundancy in Multimodal Large Language Models with Multiple Vision Encoders

Abstract

Recent multimodal large language models (MLLMs) increasingly integrate multiple vision encoders to improve performance on various benchmarks, assuming that diverse pretraining objectives yield complementary visual signals. However, we show this assumption often fails in practice. Through systematic encoder masking across representative multi-encoder MLLMs, we find that performance typically degrades gracefully—and sometimes even improves—when selected encoders are masked, revealing pervasive encoder redundancy. To quantify this effect, we introduce two principled metrics: the Conditional Utilization Rate (CUR), which measures an encoder’s marginal contribution in the presence of others, and the Information Gap (IG), which captures heterogeneity in encoder utility within a model. Using these tools, we observe: (i) strong specialization on tasks like OCR \& Chart, where a single encoder can dominate with a CUR >90%, (ii) high redundancy on general VQA and knowledge-based tasks, where encoders are largely interchangeable, (iii) instances of detrimental encoders with negative CUR. Notably, masking specific encoders can yield up to 16% higher accuracy on a specific task category and 3.6% overall performance boost compared to the full model. Furthermore, single- and dual- encoder variants recover over 90% of baseline on most non-OCR tasks. Our analysis challenges the “more encoders are better” heuristic in MLLMs and provides actionable diagnostics for developing more efficient and effective multimodal architectures.

Cite

Text

Wang et al. "Investigating Redundancy in Multimodal Large Language Models with Multiple Vision Encoders." International Conference on Learning Representations, 2026.

Markdown

[Wang et al. "Investigating Redundancy in Multimodal Large Language Models with Multiple Vision Encoders." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/wang2026iclr-investigating/)

BibTeX

@inproceedings{wang2026iclr-investigating,
  title     = {{Investigating Redundancy in Multimodal Large Language Models with Multiple Vision Encoders}},
  author    = {Wang, Yizhou and Mao, Song and Chen, Yang and Shen, Yufan and Cai, Pinlong and Wang, Ding and Yan, Guohang and Yu, Zhi and Yan, Yinqiao and Hu, Xuming and Shi, Botian},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/wang2026iclr-investigating/}
}