Differentiable Hierarchical Visual Tokenization

Abstract

Vision Transformers rely on fixed patch tokens that ignore the spatial and semantic structure of images. In this work, we introduce an end-to-end differentiable tokenizer that adapts to image content with pixel-level granularity while remaining backward-compatible with existing architectures for retrofitting pretrained models. Our method uses hierarchical model selection with information criteria to provide competitive performance in both image-level classification and dense-prediction tasks, and even supports out-of-the-box raster-to-vector conversion.

Cite

Text

Aasan et al. "Differentiable Hierarchical Visual Tokenization." Advances in Neural Information Processing Systems, 2025.

Markdown

[Aasan et al. "Differentiable Hierarchical Visual Tokenization." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/aasan2025neurips-differentiable/)

BibTeX

@inproceedings{aasan2025neurips-differentiable,
  title     = {{Differentiable Hierarchical Visual Tokenization}},
  author    = {Aasan, Marius and Hjelkrem-Tan, Martine and Catalano, Nico and Choi, Changkyu and Rivera, Adín Ramírez},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/aasan2025neurips-differentiable/}
}