Vision Transformers with Self-Distilled Registers
Abstract
Vision Transformers (ViTs) have emerged as the dominant architecture for visual processing tasks, demonstrating excellent scalability with increased training data and model size. However, recent work has identified the emergence of artifact tokens in ViTs that are incongruous with local semantics. These anomalous tokens degrade ViT performance in tasks that require fine-grained localization or structural coherence. An effective mitigation of this issue is the addition of register tokens to ViTs, which implicitly ''absorb'' the artifact term during training. Given the availability of existing large-scale pre-trained ViTs, in this paper we seek to add register tokens to existing models without retraining the models from scratch, which is infeasible considering their size. Specifically, we propose Post Hoc Registers (**PH-Reg**), an efficient self-distillation method that integrates registers into an existing ViT without requiring additional labeled data and full retraining. PH-Reg initializes both teacher and student networks from the same pre-trained ViT. The teacher remains frozen and unmodified, while the student is augmented with randomly initialized register tokens. By applying test-time augmentation to the teacher’s inputs, we generate denoised dense embeddings free of artifacts, which are then used to optimize only a small subset of unlocked student weights. We show that our approach can effectively reduce the number of artifact tokens, improving the segmentation and depth prediction of the student ViT under zero-shot and linear probing.
Cite
Text
Yan et al. "Vision Transformers with Self-Distilled Registers." Advances in Neural Information Processing Systems, 2025.Markdown
[Yan et al. "Vision Transformers with Self-Distilled Registers." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/yan2025neurips-vision/)BibTeX
@inproceedings{yan2025neurips-vision,
title = {{Vision Transformers with Self-Distilled Registers}},
author = {Yan, Zipeng and Chen, Yinjie and Zhou, Chong and Dai, Bo and Luo, Andrew},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/yan2025neurips-vision/}
}