Discovering Variable Binding Circuitry with Desiderata

Xander Davies, Max Nadeau, Nikhil Prakash, Tamar Rott Shaham, David Bau

ICMLW 2023

/icmlw/2023/davies2023icmlw-discovering/

Abstract

Recent work has shown that computation in language models may be human-understandable, with successful efforts to localize and intervene on both single-unit features and input-output circuits. Here, we introduce an approach which extends causal mediation experiments to automatically identify model components responsible for performing a specific subtask by solely specifying a set of $\textit{desiderata}$, or causal attributes of the model components executing that subtask. As a proof of concept, we apply our method to automatically discover shared variable binding circuitry in LLaMA-13B, which retrieves variable values for multiple arithmetic tasks. Our method successfully localizes variable binding to only 9 attention heads (of the 1.6k) and one MLP in the final token's residual stream.

PDF ICMLW OpenReview Semantic Scholar

Cite

Text

Davies et al. "Discovering Variable Binding Circuitry with Desiderata." ICML 2023 Workshops: DeployableGenerativeAI, 2023.

Markdown

[Davies et al. "Discovering Variable Binding Circuitry with Desiderata." ICML 2023 Workshops: DeployableGenerativeAI, 2023.](https://mlanthology.org/icmlw/2023/davies2023icmlw-discovering/)

BibTeX

@inproceedings{davies2023icmlw-discovering,
  title     = {{Discovering Variable Binding Circuitry with Desiderata}},
  author    = {Davies, Xander and Nadeau, Max and Prakash, Nikhil and Shaham, Tamar Rott and Bau, David},
  booktitle = {ICML 2023 Workshops: DeployableGenerativeAI},
  year      = {2023},
  url       = {https://mlanthology.org/icmlw/2023/davies2023icmlw-discovering/}
}