Discovering Variable Binding Circuitry with Desiderata

Abstract

Recent work has shown that computation in language models may be human-understandable, with successful efforts to localize and intervene on both single-unit features and input-output circuits. Here, we introduce an approach which extends causal mediation experiments to automatically identify model components responsible for performing a specific subtask by solely specifying a set of $\textit{desiderata}$, or causal attributes of the model components executing that subtask. As a proof of concept, we apply our method to automatically discover shared variable binding circuitry in LLaMA-13B, which retrieves variable values for multiple arithmetic tasks. Our method successfully localizes variable binding to only 9 attention heads (of the 1.6k) and one MLP in the final token's residual stream.

Cite

Text

Davies et al. "Discovering Variable Binding Circuitry with Desiderata." ICML 2023 Workshops: DeployableGenerativeAI, 2023.

Markdown

[Davies et al. "Discovering Variable Binding Circuitry with Desiderata." ICML 2023 Workshops: DeployableGenerativeAI, 2023.](https://mlanthology.org/icmlw/2023/davies2023icmlw-discovering/)

BibTeX

@inproceedings{davies2023icmlw-discovering,
  title     = {{Discovering Variable Binding Circuitry with Desiderata}},
  author    = {Davies, Xander and Nadeau, Max and Prakash, Nikhil and Shaham, Tamar Rott and Bau, David},
  booktitle = {ICML 2023 Workshops: DeployableGenerativeAI},
  year      = {2023},
  url       = {https://mlanthology.org/icmlw/2023/davies2023icmlw-discovering/}
}