Explorations of Self-Repair in Language Model
Abstract
Prior interpretability research studying narrow distributions has preliminarily identified self-repair, a phenomena where if components in large language models are ablated, later components will change their behavior to compensate. Our work builds off this past literature, demonstrating that self-repair exists on a variety of models families and sizes when ablating individual attention heads on the full training distribution. We further show that on the full training distribution self-repair is imperfect, as the original direct effect of the head is not fully restored, and noisy, since the degree of self-repair varies significantly across different prompts (sometimes overcorrecting beyond the original effect). We explore how the final LayerNorm scaling factor can contribute to self-repair, and additionally discuss the implications of these results for interpretability practitioners.
Cite
Text
Rushing and Nanda. "Explorations of Self-Repair in Language Model." ICLR 2024 Workshops: SeT_LLM, 2024.Markdown
[Rushing and Nanda. "Explorations of Self-Repair in Language Model." ICLR 2024 Workshops: SeT_LLM, 2024.](https://mlanthology.org/iclrw/2024/rushing2024iclrw-explorations/)BibTeX
@inproceedings{rushing2024iclrw-explorations,
title = {{Explorations of Self-Repair in Language Model}},
author = {Rushing, Cody and Nanda, Neel},
booktitle = {ICLR 2024 Workshops: SeT_LLM},
year = {2024},
url = {https://mlanthology.org/iclrw/2024/rushing2024iclrw-explorations/}
}