Towards Cross-Modal Backward-Compatible Representation Learning for Vision-Language Models
Abstract
Modern retrieval systems often struggle with upgrading to new and more powerful models due to the incompatibility of embeddings between the old and new models. This necessitates a costly process known as backfilling, which involves re-computing the embeddings for a large number of data samples. In vision, Backward-compatible Training (BT) has been proposed to ensure that the new model aligns with the old model's embeddings. This paper extends the concept of vision-only BT to the field of cross-modal retrieval, marking the first attempt to address Cross-modal BT (XBT). Our goal is to achieve backward-compatibility between Vision-Language Pretraining (VLP) models, such as CLIP, for the cross-modal retrieval task. To address XBT challenges, we propose an efficient solution: a projection module that maps the new model's embeddings to those of the old model. This module, pretrained solely with text data, significantly reduces the number of image-text pairs required for XBT learning, and, once it is pretrained, it avoids using the old model during training. Furthermore, we utilize parameter-efficient training strategies that improve efficiency and preserve the off-the-shelf new model's knowledge by avoiding any modifications. Experimental results on cross-modal retrieval datasets demonstrate the effectiveness of XBT and its potential to enable backfill-free upgrades when a new VLP model emerges.
Cite
Text
Jang and Lim. "Towards Cross-Modal Backward-Compatible Representation Learning for Vision-Language Models." International Conference on Computer Vision, 2025.Markdown
[Jang and Lim. "Towards Cross-Modal Backward-Compatible Representation Learning for Vision-Language Models." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/jang2025iccv-crossmodal/)BibTeX
@inproceedings{jang2025iccv-crossmodal,
title = {{Towards Cross-Modal Backward-Compatible Representation Learning for Vision-Language Models}},
author = {Jang, Young Kyun and Lim, Ser-nam},
booktitle = {International Conference on Computer Vision},
year = {2025},
pages = {1783-1792},
url = {https://mlanthology.org/iccv/2025/jang2025iccv-crossmodal/}
}