Retaining Knowledge and Enhancing Long-Text Representations in CLIP Through Dual-Teacher Distillation
Abstract
Contrastive language-image pretraining models such as CLIP have demonstrated remarkable performance in various text-image alignment tasks. However, the inherent 77-token input limitation and reliance on predominantly short-text training data restrict its ability to handle long-text tasks effectively. To overcome these constraints, we propose LongD-CLIP, a dual-teacher distillation framework designed to enhance long-text representation while mitigating knowledge forgetting. In our approach, a teacher model, fine-tuned on long-text data, distills rich representation knowledge into a student model, while the original CLIP serves as a secondary teacher to help the student retain its foundational knowledge. Extensive experiments reveal that LongD-CLIP significantly outperforms existing models across long-text retrieval, short-text retrieval, and zero-shot image classification tasks. For instance, in the image-to-text retrieval task on the ShareGPT4V test set, LongD-CLIP exceeds Long-CLIP's performance by 2.5%, achieving an accuracy of 98.3%. Similarly, on the Urban-1k dataset, it records a 9.2% improvement, reaching 91.9%, thereby underscoring its robust generalization capabilities. Additionally, the text encoder of LongD-CLIP exhibits reduced latent space drift and improved compatibility with existing generative models, effectively overcoming the 77-token input constraint.
Cite
Text
Feng et al. "Retaining Knowledge and Enhancing Long-Text Representations in CLIP Through Dual-Teacher Distillation." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.02318Markdown
[Feng et al. "Retaining Knowledge and Enhancing Long-Text Representations in CLIP Through Dual-Teacher Distillation." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/feng2025cvpr-retaining/) doi:10.1109/CVPR52734.2025.02318BibTeX
@inproceedings{feng2025cvpr-retaining,
title = {{Retaining Knowledge and Enhancing Long-Text Representations in CLIP Through Dual-Teacher Distillation}},
author = {Feng, Yuheng and Wen, Changsong and Peng, Zelin and Jiaye, Li and Zhu, Siyu},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2025},
pages = {24895-24904},
doi = {10.1109/CVPR52734.2025.02318},
url = {https://mlanthology.org/cvpr/2025/feng2025cvpr-retaining/}
}