SkipPLUS: Skip the First Few Layers to Better Explain Vision Transformers

Abstract

Despite their remarkable performance, the explainability of Vision Transformers (ViTs) remains a challenge. While forward attention-based token attribution techniques have become popular in text processing, their suitability for ViTs hasn't been extensively explored. In this paper, we compare these methods against state-of-the-art input attribution methods from the Vision literature, revealing their limitations due to improper aggregation of information across layers. To address this, we introduce two general techniques, PLUS and SkipPLUS, that can be composed with any input attribution method to more effectively aggregate information across layers while handling noisy layers. Through comprehensive and quantitative evaluations of faithfulness and human interpretability on a variety of ViT architectures and datasets, we demonstrate the effectiveness of PLUS and SkipPLUS, establishing a new state-of-the-art in white-box token attribution. We conclude with a comparative analysis highlighting the strengths and weaknesses of the best versions of all the studied methods. The code used in this paper is freely available at https://github.com/NightMachinery/SkipPLUS-CVPR-2024.

Cite

Text

Mehri et al. "SkipPLUS: Skip the First Few Layers to Better Explain Vision Transformers." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00025

Markdown

[Mehri et al. "SkipPLUS: Skip the First Few Layers to Better Explain Vision Transformers." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/mehri2024cvprw-skipplus/) doi:10.1109/CVPRW63382.2024.00025

BibTeX

@inproceedings{mehri2024cvprw-skipplus,
  title     = {{SkipPLUS: Skip the First Few Layers to Better Explain Vision Transformers}},
  author    = {Mehri, Faridoun and Fayyaz, Mohsen and Baghshah, Mahdieh Soleymani and Pilehvar, Mohammad Taher},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2024},
  pages     = {204-215},
  doi       = {10.1109/CVPRW63382.2024.00025},
  url       = {https://mlanthology.org/cvprw/2024/mehri2024cvprw-skipplus/}
}