An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L

Abstract

Prior work suggests that language models manage the limited bandwidth of the residual stream through a "memory management" mechanism, where certain attention heads and MLP layers clear residual stream directions set by earlier layers. Our study provides concrete evidence for this erasure phenomenon in a 4-layer transformer, identifying heads that consistently remove the output of earlier heads. We further demonstrate that direct logit attribution (DLA), a common technique for interpreting the output of intermediate transformer layers, can show misleading results by not accounting for erasure.

Cite

Text

Janiak et al. "An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L." ICML 2024 Workshops: MI, 2024.

Markdown

[Janiak et al. "An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L." ICML 2024 Workshops: MI, 2024.](https://mlanthology.org/icmlw/2024/janiak2024icmlw-adversarial/)

BibTeX

@inproceedings{janiak2024icmlw-adversarial,
  title     = {{An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L}},
  author    = {Janiak, Jett and Rager, Can and Dao, James and Lau, Yeu-Tong},
  booktitle = {ICML 2024 Workshops: MI},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/janiak2024icmlw-adversarial/}
}