Structural Inference: Interpreting Small Language Models with Susceptibilities
Abstract
We develop a linear response framework for interpretability that treats a neural network as a Bayesian statistical mechanical system. A small perturbation of the data distribution, for example shifting the Pile toward GitHub or legal text, induces a first-order change in the posterior expectation of an observable localized on a chosen component of the network. The resulting susceptibility can be estimated efficiently with local SGLD samples and factorizes into signed, per-token contributions that serve as attribution scores. We combine these susceptibilities into a response matrix whose low-rank structure separates functional modules such as multigram and induction heads in a 3M-parameter transformer.
Cite
Text
Baker et al. "Structural Inference: Interpreting Small Language Models with Susceptibilities." International Conference on Learning Representations, 2026.Markdown
[Baker et al. "Structural Inference: Interpreting Small Language Models with Susceptibilities." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/baker2026iclr-structural/)BibTeX
@inproceedings{baker2026iclr-structural,
title = {{Structural Inference: Interpreting Small Language Models with Susceptibilities}},
author = {Baker, Garrett and Wang, George and Hoogland, Jesse and Pathak, Vinayak and Murfet, Daniel},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/baker2026iclr-structural/}
}