Structural Inference: Interpreting Small Language Models with Susceptibilities

Abstract

We develop a linear response framework for interpretability that treats a neural network as a Bayesian statistical mechanical system. A small perturbation of the data distribution, for example shifting the Pile toward GitHub or legal text, induces a first-order change in the posterior expectation of an observable localized on a chosen component of the network. The resulting susceptibility can be estimated efficiently with local SGLD samples and factorizes into signed, per-token contributions that serve as attribution scores. We combine these susceptibilities into a response matrix whose low-rank structure separates functional modules such as multigram and induction heads in a 3M-parameter transformer.

Cite

Text

Baker et al. "Structural Inference: Interpreting Small Language Models with Susceptibilities." International Conference on Learning Representations, 2026.

Markdown

[Baker et al. "Structural Inference: Interpreting Small Language Models with Susceptibilities." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/baker2026iclr-structural/)

BibTeX

@inproceedings{baker2026iclr-structural,
  title     = {{Structural Inference: Interpreting Small Language Models with Susceptibilities}},
  author    = {Baker, Garrett and Wang, George and Hoogland, Jesse and Pathak, Vinayak and Murfet, Daniel},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/baker2026iclr-structural/}
}