Differentiable Hierarchical Visual Tokenization
Abstract
Vision Transformers rely on fixed patch tokens that ignore the spatial and semantic structure of images. In this work, we introduce an end-to-end differentiable tokenizer that adapts to image content with pixel-level granularity while remaining backward-compatible with existing architectures for retrofitting pretrained models. Our method uses hierarchical model selection with information criteria to provide competitive performance in both image-level classification and dense-prediction tasks, and even supports out-of-the-box raster-to-vector conversion.
Cite
Text
Aasan et al. "Differentiable Hierarchical Visual Tokenization." Advances in Neural Information Processing Systems, 2025.Markdown
[Aasan et al. "Differentiable Hierarchical Visual Tokenization." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/aasan2025neurips-differentiable/)BibTeX
@inproceedings{aasan2025neurips-differentiable,
title = {{Differentiable Hierarchical Visual Tokenization}},
author = {Aasan, Marius and Hjelkrem-Tan, Martine and Catalano, Nico and Choi, Changkyu and Rivera, Adín Ramírez},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/aasan2025neurips-differentiable/}
}