Efficient-VQGAN: Towards High-Resolution Image Generation with Efficient Vision Transformers
Abstract
Vector-quantized image modeling has shown great potential in synthesizing high-quality images. However, generating high-resolution images remains a challenging task due to the quadratic computational overhead of the self-attention process. In this study, we seek to explore a more efficient two-stage framework for high-resolution image generation with improvements in the following three aspects. (1) Based on the observation that the first quantization stage has solid local property, we employ a local attention-based quantization model instead of the global attention mechanism used in previous methods, leading to better efficiency and reconstruction quality. (2) We emphasize the importance of multi-grained feature interaction during image generation and introduce an efficient attention mechanism that combines global attention (long-range semantic consistency within the whole image) and local attention (fined-grained details). This approach results in faster generation speed, higher generation fidelity, and improved resolution. (3) We propose a new generation pipeline incorporating autoencoding training and autoregressive generation strategy, demonstrating a better paradigm for image synthesis. Extensive experiments demonstrate the superiority of our approach in high-quality and high-resolution image reconstruction and generation.
Cite
Text
Cao et al. "Efficient-VQGAN: Towards High-Resolution Image Generation with Efficient Vision Transformers." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.00677Markdown
[Cao et al. "Efficient-VQGAN: Towards High-Resolution Image Generation with Efficient Vision Transformers." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/cao2023iccv-efficientvqgan/) doi:10.1109/ICCV51070.2023.00677BibTeX
@inproceedings{cao2023iccv-efficientvqgan,
title = {{Efficient-VQGAN: Towards High-Resolution Image Generation with Efficient Vision Transformers}},
author = {Cao, Shiyue and Yin, Yueqin and Huang, Lianghua and Liu, Yu and Zhao, Xin and Zhao, Deli and Huang, Kaigi},
booktitle = {International Conference on Computer Vision},
year = {2023},
pages = {7368-7377},
doi = {10.1109/ICCV51070.2023.00677},
url = {https://mlanthology.org/iccv/2023/cao2023iccv-efficientvqgan/}
}