CoBIT: A Contrastive Bi-Directional Image-Text Generation Model

Abstract

The field of Vision-and-Language (VL) has witnessed a proliferation of pretrained foundation models. Current techniques typically employ only one type of training objective, whether it's (1) contrastive objectives (like CLIP), (2) image-to-text generative objectives (like PaLI), or (3) text-to-image generative objectives (like Parti). However, all these three objectives are mutually relevant and are all based on image-text pairs. Intuitively, the first two objectives can be considered as complementary projections between two modalities, and contrastive learning can preserve global alignment and generations facilitate fine-grained understanding. Inspired by this, we present a Contrastive Bi-directional Image-Text generation model (CoBIT) to first time unify the three pre-training objectives in one framework. Specifically, CoBIT employs a novel unicoder-decoder structure consisting of an image unicoder, a text unicoder, and a cross-modal decoder. The image/text unicoders can switch between encoding and decoding in different tasks, enabling flexibility and shared knowledge that benefits both image-to-text and text-to-image generations. CoBIT achieves superior performance in image understanding, image-text understanding (Retrieval, Captioning, VQA, SNLI-VE), and text-based content creation, particularly in zero-shot scenarios.

Cite

Text

You et al. "CoBIT: A Contrastive Bi-Directional Image-Text Generation Model." International Conference on Learning Representations, 2024.

Markdown

[You et al. "CoBIT: A Contrastive Bi-Directional Image-Text Generation Model." International Conference on Learning Representations, 2024.](https://mlanthology.org/iclr/2024/you2024iclr-cobit/)

BibTeX

@inproceedings{you2024iclr-cobit,
  title     = {{CoBIT: A Contrastive Bi-Directional Image-Text Generation Model}},
  author    = {You, Haoxuan and Guo, Mandy and Wang, Zhecan and Chang, Kai-Wei and Baldridge, Jason Michael and Yu, Jiahui},
  booktitle = {International Conference on Learning Representations},
  year      = {2024},
  url       = {https://mlanthology.org/iclr/2024/you2024iclr-cobit/}
}