L-Verse: Bidirectional Generation Between Image and Text
Abstract
Far beyond learning long-range interactions of natural language, transformers are becoming the de-facto standard for many vision tasks with their power and scalability. Especially with cross-modal tasks between image and text, vector quantized variational autoencoders (VQ-VAEs) are widely used to make a raw RGB image into a sequence of feature vectors. To better leverage the correlation between image and text, we propose L-Verse, a novel architecture consisting of feature-augmented variational autoencoder (AugVAE) and bidirectional auto-regressive transformer (BiART) for image-to-text and text-to-image generation. Our AugVAE shows the state-of-the-art reconstruction performance on ImageNet1K validation set, along with the robustness to unseen images in the wild. Unlike other models, BiART can distinguish between image (or text) as a conditional reference and a generation target. L-Verse can be directly used for image-to-text or text-to-image generation without any finetuning or extra object detection framework. In quantitative and qualitative experiments, L-Verse shows impressive results against previous methods in both image-to-text and text-to-image generation on MS-COCO Captions. We furthermore assess the scalability of L-Verse architecture on Conceptual Captions and present the initial result of bidirectional vision-language representation learning on general domain.
Cite
Text
Kim et al. "L-Verse: Bidirectional Generation Between Image and Text." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.01603Markdown
[Kim et al. "L-Verse: Bidirectional Generation Between Image and Text." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/kim2022cvpr-lverse/) doi:10.1109/CVPR52688.2022.01603BibTeX
@inproceedings{kim2022cvpr-lverse,
title = {{L-Verse: Bidirectional Generation Between Image and Text}},
author = {Kim, Taehoon and Song, Gwangmo and Lee, Sihaeng and Kim, Sangyun and Seo, Yewon and Lee, Soonyoung and Kim, Seung Hwan and Lee, Honglak and Bae, Kyunghoon},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2022},
pages = {16526-16536},
doi = {10.1109/CVPR52688.2022.01603},
url = {https://mlanthology.org/cvpr/2022/kim2022cvpr-lverse/}
}