Native Segmentation Vision Transformers
Abstract
Uniform downsampling remains the de facto standard for reducing spatial resolution in vision backbones. In this work, we propose an alternative design built around a content-aware spatial grouping layer that dynamically assigns tokens to a reduced set based on image boundaries and their semantic content. Stacking our grouping layer across consecutive backbone stages results in hierarchical segmentation that arises *natively* in the feature extraction process, resulting in our coined Native Segmentation Vision Transformer. We show that a careful design of our architecture enables the emergence of strong segmentation masks solely from grouping layers, that is, without additional segmentation-specific heads. This sets the foundation for a new paradigm of *native*, backbone-level segmentation, which enables strong zero-shot results without mask supervision, as well as a minimal and efficient standalone model design for downstream segmentation tasks.
Cite
Text
Braso et al. "Native Segmentation Vision Transformers." Advances in Neural Information Processing Systems, 2025.Markdown
[Braso et al. "Native Segmentation Vision Transformers." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/braso2025neurips-native/)BibTeX
@inproceedings{braso2025neurips-native,
title = {{Native Segmentation Vision Transformers}},
author = {Braso, Guillem and Osep, Aljosa and Leal-Taixé, Laura},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/braso2025neurips-native/}
}