Efficient Multi-Modal Large Language Models via Progressive Consistency Distillation

Abstract

Visual tokens consume substantial computational resources in multi-modal large models (MLLMs), significantly compromising their efficiency. Recent works have attempted to improve efficiency by compressing visual tokens during training, either through modifications to model components or by introducing additional parameters. However, they often overlook the increased learning difficulty caused by such compression, as the model’s parameter space struggles to quickly adapt to the substantial perturbations in the feature space induced by token compression. In this work, we propose to develop Efficient MLLMs via Progressive Consistency Distillation (EPIC), a progressive learning framework. Specifically, by decomposing the feature space perturbations introduced by token compression along the token-wise and layer-wise dimensions, we introduce token consistency distillation and layer consistency distillation, respectively, aiming to reduce the training difficulty by leveraging guidance from a teacher model and following a progressive learning trajectory. Extensive experiments demonstrate the superior effectiveness, robustness, and generalization capabilities of our proposed framework.

Cite

Text

Wen et al. "Efficient Multi-Modal Large Language Models via Progressive Consistency Distillation." Advances in Neural Information Processing Systems, 2025.

Markdown

[Wen et al. "Efficient Multi-Modal Large Language Models via Progressive Consistency Distillation." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/wen2025neurips-efficient/)

BibTeX

@inproceedings{wen2025neurips-efficient,
  title     = {{Efficient Multi-Modal Large Language Models via Progressive Consistency Distillation}},
  author    = {Wen, Zichen and Wang, Shaobo and Zhou, Yufa and Zhang, Junyuan and Zhang, Qintong and Gao, Yifeng and Chen, Zhaorun and Wang, Bin and Li, Weijia and He, Conghui and Zhang, Linfeng},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/wen2025neurips-efficient/}
}