Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer
Abstract
Though deep generative models have gained a lot of attention, most of the existing works are designed for the unimodal generation task. In this paper, we explore a new method for unconditional image-text pair generation. We propose MXQ-VAE, a vector quantization method for multimodal image-text representation. MXQ-VAE accepts a paired image and text as input, and learns a joint quantized representation space, so that the image-text pair can be converted to a sequence of unified indices. Then we can use autoregressive generative models to model the joint image-text representation, and even perform unconditional image-text pair generation. Extensive experimental results demonstrate that our approach effectively generates semantically consistent image-text pair and also enhances meaningful alignment between image and text.
Cite
Text
Lee et al. "Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer." ICLR 2022 Workshops: DGM4HSD, 2022.Markdown
[Lee et al. "Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer." ICLR 2022 Workshops: DGM4HSD, 2022.](https://mlanthology.org/iclrw/2022/lee2022iclrw-unconditional/)BibTeX
@inproceedings{lee2022iclrw-unconditional,
title = {{Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer}},
author = {Lee, Hyungyung and Park, Sungjin and Choi, Edward},
booktitle = {ICLR 2022 Workshops: DGM4HSD},
year = {2022},
url = {https://mlanthology.org/iclrw/2022/lee2022iclrw-unconditional/}
}