Image and Video Tokenization with Binary Spherical Quantization
Abstract
We propose a new transformer-based image and video tokenizer with Binary Spherical Quantization (BSQ). BSQ projects the high-dimensional visual embedding to a lower-dimensional hypersphere and then applies binary quantization. BSQ is (1) parameter-efficient without an explicit codebook, (2) scalable to arbitrary token dimensions, and (3) compact: compressing visual data by up to 100× with minimal distortion. Our tokenizer uses a transformer encoder and decoder with simple block-wise causal masking to support variable-length videos as input. The resulting BSQ-ViT achieves state-of-the-art visual reconstruction quality on image and video reconstruction benchmarks with 2.4× throughput compared to the best prior methods. Furthermore, by learning an autoregressive prior for adaptive arithmetic coding, BSQ-ViT achieves comparable visual compression results with commonly used compression standards, e.g. JPEG2000/WebP for images and H.264/H.265 for videos. BSQ-ViT also enables masked language models to achieve competitive image synthesis quality to GAN and diffusion approaches.
Cite
Text
Zhao et al. "Image and Video Tokenization with Binary Spherical Quantization." International Conference on Learning Representations, 2025.Markdown
[Zhao et al. "Image and Video Tokenization with Binary Spherical Quantization." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/zhao2025iclr-image/)BibTeX
@inproceedings{zhao2025iclr-image,
title = {{Image and Video Tokenization with Binary Spherical Quantization}},
author = {Zhao, Yue and Xiong, Yuanjun and Kraehenbuehl, Philipp},
booktitle = {International Conference on Learning Representations},
year = {2025},
url = {https://mlanthology.org/iclr/2025/zhao2025iclr-image/}
}