Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer

Abstract

Though deep generative models have gained a lot of attention, most of the existing works are designed for the unimodal generation task. In this paper, we explore a new method for unconditional image-text pair generation. We propose MXQ-VAE, a vector quantization method for multimodal image-text representation. MXQ-VAE accepts a paired image and text as input, and learns a joint quantized representation space, so that the image-text pair can be converted to a sequence of unified indices. Then we can use autoregressive generative models to model the joint image-text representation, and even perform unconditional image-text pair generation. Extensive experimental results demonstrate that our approach effectively generates semantically consistent image-text pair and also enhances meaningful alignment between image and text.

Cite

Text

Lee et al. "Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer." ICLR 2022 Workshops: DGM4HSD, 2022.

Markdown

[Lee et al. "Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer." ICLR 2022 Workshops: DGM4HSD, 2022.](https://mlanthology.org/iclrw/2022/lee2022iclrw-unconditional/)

BibTeX

@inproceedings{lee2022iclrw-unconditional,
  title     = {{Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer}},
  author    = {Lee, Hyungyung and Park, Sungjin and Choi, Edward},
  booktitle = {ICLR 2022 Workshops: DGM4HSD},
  year      = {2022},
  url       = {https://mlanthology.org/iclrw/2022/lee2022iclrw-unconditional/}
}