ImageBind: One Embedding Space to Bind Them All

Abstract

We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. ImageBind can leverage recent large scale vision-language models, and extends their zero-shot capabilities to new modalities just by using their natural pairing with images. It enables novel emergent applications 'out-of-the-box' including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation. The emergent capabilities improve with the strength of the image encoder and we set a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models. Finally, we show strong few-shot recognition results outperforming prior work, and that ImageBind serves as a new way to evaluate vision models for visual and non-visual tasks.

Cite

Text

Girdhar et al. "ImageBind: One Embedding Space to Bind Them All." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.01457

Markdown

[Girdhar et al. "ImageBind: One Embedding Space to Bind Them All." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/girdhar2023cvpr-imagebind/) doi:10.1109/CVPR52729.2023.01457

BibTeX

@inproceedings{girdhar2023cvpr-imagebind,
  title     = {{ImageBind: One Embedding Space to Bind Them All}},
  author    = {Girdhar, Rohit and El-Nouby, Alaaeldin and Liu, Zhuang and Singh, Mannat and Alwala, Kalyan Vasudev and Joulin, Armand and Misra, Ishan},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {15180-15190},
  doi       = {10.1109/CVPR52729.2023.01457},
  url       = {https://mlanthology.org/cvpr/2023/girdhar2023cvpr-imagebind/}
}