Emergence of Text Semantics in CLIP Image Encoders
Abstract
Certain self-supervised approaches to train image encoders, like CLIP, align images with their text captions. However, these approaches do not have an a priori incentive to learn to associate text inside the image with the semantics of the text. Our work studies the semantics of text rendered in images. We show evidence suggesting that the image representations of CLIP have a subspace for textual semantics that abstracts away fonts. Furthermore, we show that the rendered text representations from the image encoder only slightly lag behind the text representations with respect to preserving semantic relationships.
Cite
Text
Vennam et al. "Emergence of Text Semantics in CLIP Image Encoders." NeurIPS 2024 Workshops: UniReps, 2024.Markdown
[Vennam et al. "Emergence of Text Semantics in CLIP Image Encoders." NeurIPS 2024 Workshops: UniReps, 2024.](https://mlanthology.org/neuripsw/2024/vennam2024neuripsw-emergence/)BibTeX
@inproceedings{vennam2024neuripsw-emergence,
title = {{Emergence of Text Semantics in CLIP Image Encoders}},
author = {Vennam, Sreeram and Singh, Shashwat and Govil, Anirudh and Kumaraguru, Ponnurangam},
booktitle = {NeurIPS 2024 Workshops: UniReps},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/vennam2024neuripsw-emergence/}
}