Hyperbolic Image-Text Representations

Abstract

Visual and linguistic concepts naturally organize themselves in a hierarchy, where a textual concept ``dog'' entails all images that contain dogs. Despite being intuitive, current large-scale vision and language models such as CLIP do not explicitly capture such hierarchy. We propose MERU, a contrastive model that yields hyperbolic representations of images and text. Hyperbolic spaces have suitable geometric properties to embed tree-like data, so MERU can better capture the underlying hierarchy in image-text data. Our results show that MERU learns a highly interpretable representation space while being competitive with CLIP's performance on multi-modal tasks like image classification and image-text retrieval.

Cite

Text

Desai et al. "Hyperbolic Image-Text Representations." ICLR 2023 Workshops: MRL, 2023.

Markdown

[Desai et al. "Hyperbolic Image-Text Representations." ICLR 2023 Workshops: MRL, 2023.](https://mlanthology.org/iclrw/2023/desai2023iclrw-hyperbolic/)

BibTeX

@inproceedings{desai2023iclrw-hyperbolic,
  title     = {{Hyperbolic Image-Text Representations}},
  author    = {Desai, Karan and Nickel, Maximilian and Rajpurohit, Tanmay and Johnson, Justin and Vedantam, Shanmukha Ramakrishna},
  booktitle = {ICLR 2023 Workshops: MRL},
  year      = {2023},
  url       = {https://mlanthology.org/iclrw/2023/desai2023iclrw-hyperbolic/}
}