TIME: Text and Image Mutual-Translation Adversarial Networks

Abstract

Focusing on text-to-image (T2I) generation, we propose Text and Image Mutual-Translation Adversarial Networks (TIME), a lightweight but effective model that jointly learns a T2I generator G and an image captioning discriminator D under the Generative Adversarial Network framework. While previous methods tackle the T2I problem as a uni-directional task and use pre-trained language models to enforce the image--text consistency, TIME requires neither extra modules nor pre-training. We show that the performance of G can be boosted substantially by training it jointly with D as a language model. Specifically, we adopt Transformers to model the cross-modal connections between the image features and word embeddings, and design an annealing conditional hinge loss that dynamically balances the adversarial learning. In our experiments, TIME achieves state-of-the-art (SOTA) performance on the CUB dataset (Inception Score of 4.91 and Fréchet Inception Distance of 14.3 on CUB), and shows promising performance on MS-COCO dataset on image captioning and downstream vision-language tasks.

Cite

Text

Liu et al. "TIME: Text and Image Mutual-Translation Adversarial Networks." AAAI Conference on Artificial Intelligence, 2021. doi:10.1609/AAAI.V35I3.16305

Markdown

[Liu et al. "TIME: Text and Image Mutual-Translation Adversarial Networks." AAAI Conference on Artificial Intelligence, 2021.](https://mlanthology.org/aaai/2021/liu2021aaai-time-a/) doi:10.1609/AAAI.V35I3.16305

BibTeX

@inproceedings{liu2021aaai-time-a,
  title     = {{TIME: Text and Image Mutual-Translation Adversarial Networks}},
  author    = {Liu, Bingchen and Song, Kunpeng and Zhu, Yizhe and de Melo, Gerard and Elgammal, Ahmed},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2021},
  pages     = {2082-2090},
  doi       = {10.1609/AAAI.V35I3.16305},
  url       = {https://mlanthology.org/aaai/2021/liu2021aaai-time-a/}
}