Large-Scale Bidirectional Training for Zero-Shot Image Captioning

Abstract

When trained on large-scale datasets, image captioning models can understand the content of images from a general domain but often fail to generate accurate, detailed captions. To improve performance, pretraining-and-finetuning has been a key strategy for image captioning. However, we find that large-scale bidirectional training between image and text enables zero-shot image captioning. In this paper, we introduce Bidirectional Image Text Training in largER Scale, BITTERS, an efficient training and inference framework for zero-shot image captioning. We also propose a new evaluation benchmark which comprises of high quality datasets and an extensive set of metrics to properly evaluate zero-shot captioning accuracy and societal bias. We additionally provide an efficient finetuning approach for keyword extraction. We show that careful selection of large-scale training set and model architecture is the key to achieving zero-shot image captioning.

Cite

Text

Kim et al. "Large-Scale Bidirectional Training for Zero-Shot Image Captioning." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00733

Markdown

[Kim et al. "Large-Scale Bidirectional Training for Zero-Shot Image Captioning." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/kim2024cvprw-largescale/) doi:10.1109/CVPRW63382.2024.00733

BibTeX

@inproceedings{kim2024cvprw-largescale,
  title     = {{Large-Scale Bidirectional Training for Zero-Shot Image Captioning}},
  author    = {Kim, Taehoon and Marsden, Mark and Ahn, Pyunghwan and Kim, Sangyun and Lee, Sihaeng and Sala, Alessandra and Kim, Seung Hwan},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2024},
  pages     = {7373-7383},
  doi       = {10.1109/CVPRW63382.2024.00733},
  url       = {https://mlanthology.org/cvprw/2024/kim2024cvprw-largescale/}
}