CLIP-FMoE: Scalable CLIP via Fused Mixture-of-Experts with Enforced Specialization

Abstract

Mixture-of-Experts (MoE) architectures have emerged as a promising approach for scaling deep learning models while maintaining computational efficiency. However, existing MoE adaptations for Contrastive Language-Image Pre-training (CLIP) models suffer from significant computational overhead during sequential training and degradation of zero-shot capabilities. To address these limitations, we propose CLIP-FMoE, a novel approach that integrates MoE architecture into CLIP fine-tuning. Our method uses Isolated Constrained Contrastive Learning, a pipeline that trains specialized experts on cluster-based data partitions to accelerate expert specialization. Additionally, we introduce a Fusion Gate mechanism to mitigate catastrophic forgetting of pre-trained knowledge. Extensive experiments across multiple benchmarks demonstrate that our approach achieves consistent improvements on downstream tasks while preserving zero-shot capabilities. Furthermore, our method demonstrates robust performance across varying context lengths, making it particularly suitable for diverse real-world applications.

Cite

Text

Tran et al. "CLIP-FMoE: Scalable CLIP via Fused Mixture-of-Experts with Enforced Specialization." International Conference on Learning Representations, 2026.

Markdown

[Tran et al. "CLIP-FMoE: Scalable CLIP via Fused Mixture-of-Experts with Enforced Specialization." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/tran2026iclr-clipfmoe/)

BibTeX

@inproceedings{tran2026iclr-clipfmoe,
  title     = {{CLIP-FMoE: Scalable CLIP via Fused Mixture-of-Experts with Enforced Specialization}},
  author    = {Tran, Luong and Nguyen, Lan-Cuong and Nguyen, Huynh Dang and Cong, Dat Nguyen and Le, Dung D. and Nguyen, Van},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/tran2026iclr-clipfmoe/}
}