Towards Understanding the Modality Gap in CLIP
Abstract
This work examines the phenomenon of the modality gap observed in CLIP-based multimodal learning methods. The modality gap in this context refers to the separation of image and text embeddings in the joint latent space. Some previous research has attributed the gap to cone effect of neural network initialization and suggested closing may not be necessary. However, this study argues that the modality gap is associated with local minima in the CLIP loss function. Through a series of proof-of-concept experiments, we illustrate these local minima and the difficulty of avoiding them in practice. Overall, this work hopes to provide better insight into the root cause of the modality gap.
Cite
Text
Shi et al. "Towards Understanding the Modality Gap in CLIP." ICLR 2023 Workshops: MRL, 2023.Markdown
[Shi et al. "Towards Understanding the Modality Gap in CLIP." ICLR 2023 Workshops: MRL, 2023.](https://mlanthology.org/iclrw/2023/shi2023iclrw-understanding/)BibTeX
@inproceedings{shi2023iclrw-understanding,
title = {{Towards Understanding the Modality Gap in CLIP}},
author = {Shi, Peiyang and Welle, Michael C. and Björkman, Mårten and Kragic, Danica},
booktitle = {ICLR 2023 Workshops: MRL},
year = {2023},
url = {https://mlanthology.org/iclrw/2023/shi2023iclrw-understanding/}
}