LLaVA-MoD: Making LLaVA Tiny via MoE-Knowledge Distillation

Abstract

We introduce LLaVA-MoD, a novel framework designed to enable the efficient training of small-scale Multimodal Language Models ($s$-MLLM) distilling knowledge from large-scale MLLM ($l$-MLLM). Our approach tackles two fundamental challenges in MLLM distillation. First, we optimize the network structure of $s$-MLLM by integrating a sparse Mixture of Experts (MoE) architecture into the language model, striking a balance between computational efficiency and model expressiveness. Second, we propose a progressive knowledge transfer strategy for comprehensive knowledge transfer. This strategy begins with mimic distillation, where we minimize the Kullback-Leibler (KL) divergence between output distributions to enable $s$-MLLM to emulate $s$-MLLM's understanding. Following this, we introduce preference distillation via Preference Optimization (PO), where the key lies in treating $l$-MLLM as the reference model. During this phase, the $s$-MLLM's ability to discriminate between superior and inferior examples is significantly enhanced beyond $l$-MLLM, leading to a better $s$-MLLM that surpasses $l$-MLLM, particularly in hallucination benchmarks. Extensive experiments demonstrate that LLaVA-MoD surpasses existing works across various benchmarks while maintaining a minimal activated parameters and low computational costs. Remarkably, LLaVA-MoD-2B surpasses Qwen-VL-Chat-7B with an average gain of 8.8\%, using merely $0.3\%$ of the training data and 23\% trainable parameters. The results underscore LLaVA-MoD's ability to effectively distill comprehensive knowledge from its teacher model, paving the way for developing efficient MLLMs.

Cite

Text

Shu et al. "LLaVA-MoD: Making LLaVA Tiny via MoE-Knowledge Distillation." International Conference on Learning Representations, 2025.

Markdown

[Shu et al. "LLaVA-MoD: Making LLaVA Tiny via MoE-Knowledge Distillation." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/shu2025iclr-llavamod/)

BibTeX

@inproceedings{shu2025iclr-llavamod,
  title     = {{LLaVA-MoD: Making LLaVA Tiny via MoE-Knowledge Distillation}},
  author    = {Shu, Fangxun and Liao, Yue and Zhang, Lei and Zhuo, Le and Xu, Chenning and Zhang, Guanghao and Shi, Haonan and Chan, Long and TaoZhong,  and Yu, Zhelun and He, Wanggui and Fu, Siming and Li, Haoyuan and Liu, Si and Li, Hongsheng and Jiang, Hao},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/shu2025iclr-llavamod/}
}