Hyperbolic Image-Text Representations
Abstract
Visual and linguistic concepts naturally organize themselves in a hierarchy, where a textual concept "dog" entails all images that contain dogs. Despite being intuitive, current large-scale vision and language models such as CLIP do not explicitly capture such hierarchy. We propose MERU, a contrastive model that yields hyperbolic representations of images and text. Hyperbolic spaces have suitable geometric properties to embed tree-like data, so MERU can better capture the underlying hierarchy in image-text datasets. Our results show that MERU learns a highly interpretable and structured representation space while being competitive with CLIP’s performance on standard multi-modal tasks like image classification and image-text retrieval.
Cite
Text
Desai et al. "Hyperbolic Image-Text Representations." International Conference on Machine Learning, 2023.Markdown
[Desai et al. "Hyperbolic Image-Text Representations." International Conference on Machine Learning, 2023.](https://mlanthology.org/icml/2023/desai2023icml-hyperbolic/)BibTeX
@inproceedings{desai2023icml-hyperbolic,
title = {{Hyperbolic Image-Text Representations}},
author = {Desai, Karan and Nickel, Maximilian and Rajpurohit, Tanmay and Johnson, Justin and Vedantam, Shanmukha Ramakrishna},
booktitle = {International Conference on Machine Learning},
year = {2023},
pages = {7694-7731},
volume = {202},
url = {https://mlanthology.org/icml/2023/desai2023icml-hyperbolic/}
}