Embracing Language Inclusivity and Diversity in CLIP Through Continual Language Learning
Abstract
While vision-language pre-trained models (VL-PTMs) have advanced multimodal research in recent years, their mastery in a few languages like English restricts their applicability in broader communities. To this end, there is an increasing interest in developing multilingual VL models via a joint-learning setup, which, however, could be unrealistic due to expensive costs and data availability. In this work, we propose to extend VL-PTMs' language capacity by continual language learning (CLL), where a model needs to update its linguistic knowledge incrementally without suffering from catastrophic forgetting (CF). We begin our study by introducing a model dubbed CLL-CLIP, which builds upon CLIP, a prevailing VL-PTM that has acquired image-English text alignment. Specifically, CLL-CLIP contains an expandable token embedding layer to handle linguistic differences. It solely trains token embeddings to improve memory stability and is optimized under cross-modal and cross-lingual objectives to learn the alignment between images and multilingual texts. To alleviate CF raised by covariate shift and lexical overlap, we further propose a novel approach that ensures the identical distribution of all token embeddings during initialization and regularizes token embedding learning during training. We construct a CLL benchmark covering 36 languages based on MSCOCO and XM3600 datasets and then evaluate multilingual image-text retrieval performance. Extensive experiments verify the effectiveness of CLL-CLIP and show that our approach can boost CLL-CLIP, e.g., by 6.7% in text-to-image average Recall@1 on XM3600, and improve various state-of-the-art methods consistently. Our code and data are available at https://github.com/yangbang18/CLFM.
Cite
Text
Yang et al. "Embracing Language Inclusivity and Diversity in CLIP Through Continual Language Learning." AAAI Conference on Artificial Intelligence, 2024. doi:10.1609/AAAI.V38I6.28466Markdown
[Yang et al. "Embracing Language Inclusivity and Diversity in CLIP Through Continual Language Learning." AAAI Conference on Artificial Intelligence, 2024.](https://mlanthology.org/aaai/2024/yang2024aaai-embracing/) doi:10.1609/AAAI.V38I6.28466BibTeX
@inproceedings{yang2024aaai-embracing,
title = {{Embracing Language Inclusivity and Diversity in CLIP Through Continual Language Learning}},
author = {Yang, Bang and Dai, Yong and Cheng, Xuxin and Li, Yaowei and Raza, Asif and Zou, Yuexian},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2024},
pages = {6458-6466},
doi = {10.1609/AAAI.V38I6.28466},
url = {https://mlanthology.org/aaai/2024/yang2024aaai-embracing/}
}