Retaining Knowledge and Enhancing Long-Text Representations in CLIP Through Dual-Teacher Distillation

Abstract

Contrastive language-image pretraining models such as CLIP have demonstrated remarkable performance in various text-image alignment tasks. However, the inherent 77-token input limitation and reliance on predominantly short-text training data restrict its ability to handle long-text tasks effectively. To overcome these constraints, we propose LongD-CLIP, a dual-teacher distillation framework designed to enhance long-text representation while mitigating knowledge forgetting. In our approach, a teacher model, fine-tuned on long-text data, distills rich representation knowledge into a student model, while the original CLIP serves as a secondary teacher to help the student retain its foundational knowledge. Extensive experiments reveal that LongD-CLIP significantly outperforms existing models across long-text retrieval, short-text retrieval, and zero-shot image classification tasks. For instance, in the image-to-text retrieval task on the ShareGPT4V test set, LongD-CLIP exceeds Long-CLIP's performance by 2.5%, achieving an accuracy of 98.3%. Similarly, on the Urban-1k dataset, it records a 9.2% improvement, reaching 91.9%, thereby underscoring its robust generalization capabilities. Additionally, the text encoder of LongD-CLIP exhibits reduced latent space drift and improved compatibility with existing generative models, effectively overcoming the 77-token input constraint.

Cite

Text

Feng et al. "Retaining Knowledge and Enhancing Long-Text Representations in CLIP Through Dual-Teacher Distillation." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.02318

Markdown

[Feng et al. "Retaining Knowledge and Enhancing Long-Text Representations in CLIP Through Dual-Teacher Distillation." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/feng2025cvpr-retaining/) doi:10.1109/CVPR52734.2025.02318

BibTeX

@inproceedings{feng2025cvpr-retaining,
  title     = {{Retaining Knowledge and Enhancing Long-Text Representations in CLIP Through Dual-Teacher Distillation}},
  author    = {Feng, Yuheng and Wen, Changsong and Peng, Zelin and Jiaye, Li and Zhu, Siyu},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {24895-24904},
  doi       = {10.1109/CVPR52734.2025.02318},
  url       = {https://mlanthology.org/cvpr/2025/feng2025cvpr-retaining/}
}