Setting the Record Straight on Transformer Oversmoothing

Abstract

Recent work has argued that Transformers are inherently low-pass filters that gradually oversmooth the inputs, limiting generalization, especially as model depth increases. How can Transformers achieve these successes given this shortcoming? In this work we show that in fact Transformers are not inherently low-pass filters. Instead, whether Transformers oversmooth or not depends on the eigenspectrum of their update equations. Further, depending on the task, smoothing does not harm generalization as model depth increases.

Cite

Text

Dovonon et al. "Setting the Record Straight on Transformer Oversmoothing." ICLR 2024 Workshops: R2-FM, 2024.

Markdown

[Dovonon et al. "Setting the Record Straight on Transformer Oversmoothing." ICLR 2024 Workshops: R2-FM, 2024.](https://mlanthology.org/iclrw/2024/dovonon2024iclrw-setting/)

BibTeX

@inproceedings{dovonon2024iclrw-setting,
  title     = {{Setting the Record Straight on Transformer Oversmoothing}},
  author    = {Dovonon, Gbetondji Jean-Sebastien and Bronstein, Michael M. and Kusner, Matt},
  booktitle = {ICLR 2024 Workshops: R2-FM},
  year      = {2024},
  url       = {https://mlanthology.org/iclrw/2024/dovonon2024iclrw-setting/}
}