Denoising Vision Transformers

Jiawei Yang, Katie Z Luo, Jiefeng Li, Congyue Deng, Leonidas Guibas, Dilip Krishnan, Kilian Weinberger, Yonglong Tian, Yue Wang

ECCV 2024

doi:10.1007/978-3-031-73013-9_26 /eccv/2024/yang2024eccv-denoising/

Abstract

We study a crucial yet often overlooked issue inherent to Vision Transformers (ViTs): feature maps of these models exhibit grid-like artifacts (“Original features” in fig:teaser), which hurt the performance of ViTs in downstream dense prediction tasks such as semantic segmentation, depth prediction, and object discovery. We trace this issue down to the positional embeddings at the input stage. To mitigate this, we propose a two-stage denoising approach, termed (). In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis. This per-image optimization process extracts artifact-free features from raw ViT outputs, providing clean feature estimates for offline applications. In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision. Our method, , does not require re-training the existing pre-trained ViTs, and is immediately applicable to any Vision Transformer architecture. We evaluate our method on a variety of representative ViTs (DINO, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg) and demonstrate that consistently improves existing state-of-the-art general-purpose models in semantic and geometric tasks across multiple datasets (fig:teaser, right, tab:denser esults, tab : objd et, tab : objd iscovery).W ehopeourstudywillencourageare−evaluationof V iT design, especiallyregardingth

PDF ECCV Semantic Scholar

Cite

Text

Yang et al. "Denoising Vision Transformers." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73013-9_26

Markdown

[Yang et al. "Denoising Vision Transformers." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/yang2024eccv-denoising/) doi:10.1007/978-3-031-73013-9_26

BibTeX

@inproceedings{yang2024eccv-denoising,
  title     = {{Denoising Vision Transformers}},
  author    = {Yang, Jiawei and Luo, Katie Z and Li, Jiefeng and Deng, Congyue and Guibas, Leonidas and Krishnan, Dilip and Weinberger, Kilian and Tian, Yonglong and Wang, Yue},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-73013-9_26},
  url       = {https://mlanthology.org/eccv/2024/yang2024eccv-denoising/}
}