MaskBit: Embedding-Free Image Generation via Bit Tokens
Abstract
Masked transformer models for class-conditional image generation have become a compelling alternative to diffusion models. Typically comprising two stages - an initial VQGAN model for transitioning between latent space and image space, and a subsequent Transformer model for image generation within latent space - these frameworks offer promising avenues for image synthesis. In this study, we present two primary contributions: Firstly, an empirical and systematic examination of VQGANs, leading to a modernized VQGAN. Secondly, a novel embedding-free generation network operating directly on bit tokens - a binary quantized representation of tokens with rich semantics. The first contribution furnishes a transparent, reproducible, and high-performing VQGAN model, enhancing accessibility and matching the performance of current state-of-the-art methods while revealing previously undisclosed details. The second contribution demonstrates that embedding-free image generation using bit tokens achieves a new state-of-the-art FID of 1.52 on the ImageNet $256\times256$ benchmark, with a compact generator model of mere 305M parameters. The code for this project is available on https://github.com/markweberdev/maskbit.
Cite
Text
Weber et al. "MaskBit: Embedding-Free Image Generation via Bit Tokens." Transactions on Machine Learning Research, 2024.Markdown
[Weber et al. "MaskBit: Embedding-Free Image Generation via Bit Tokens." Transactions on Machine Learning Research, 2024.](https://mlanthology.org/tmlr/2024/weber2024tmlr-maskbit/)BibTeX
@article{weber2024tmlr-maskbit,
title = {{MaskBit: Embedding-Free Image Generation via Bit Tokens}},
author = {Weber, Mark and Yu, Lijun and Yu, Qihang and Deng, Xueqing and Shen, Xiaohui and Cremers, Daniel and Chen, Liang-Chieh},
journal = {Transactions on Machine Learning Research},
year = {2024},
url = {https://mlanthology.org/tmlr/2024/weber2024tmlr-maskbit/}
}