ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Abstract

Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences. While such models can provide a powerful score for matching and subsequent zero-shot tasks, they are not capable of generating caption given an image. In this work, we repurpose such models to generate a descriptive text given an image at inference time, without any further training or tuning step. This is done by combining the visual-semantic model with a large language model, benefiting from the knowledge in both web-scale models. The resulting captions are much less restrictive than those obtained by supervised captioning methods. Moreover, as a zero-shot learning method, it is extremely flexible and we demonstrate its ability to perform image arithmetic in which the inputs can be either images or text and the output is a sentence. This enables novel high-level vision capabilities such as comparing two images or solving visual analogy tests. Our code is available at: https://github.com/YoadTew/zero-shot-image-to-text.

Cite

Text

Tewel et al. "ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.01739

Markdown

[Tewel et al. "ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/tewel2022cvpr-zerocap/) doi:10.1109/CVPR52688.2022.01739

BibTeX

@inproceedings{tewel2022cvpr-zerocap,
  title     = {{ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic}},
  author    = {Tewel, Yoad and Shalev, Yoav and Schwartz, Idan and Wolf, Lior},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {17918-17928},
  doi       = {10.1109/CVPR52688.2022.01739},
  url       = {https://mlanthology.org/cvpr/2022/tewel2022cvpr-zerocap/}
}