From Neurons to Neutrons: A Case Study in Interpretability

Abstract

Mechanistic Interpretability (MI) proposes a path toward fully understanding how neural networks make their predictions. Prior work demonstrates that even when trained to perform simple arithmetic, models can implement a variety of algorithms (sometimes concurrently) depending on initialization and hyperparameters. Does this mean neuron-level interpretability techniques have limited applicability? Here, we argue that high-dimensional neural networks can learn useful low-dimensional representations of the data they were trained on, going beyond simply making good predictions: Such representations can be understood with the MI lens and provide insights that are surprisingly faithful to human-derived domain knowledge. This indicates that such approaches to interpretability can be useful for deriving a new understanding of a problem from models trained to solve it. As a case study, we extract nuclear physics concepts by studying models trained to reproduce nuclear data.

Cite

Text

Kitouni et al. "From Neurons to Neutrons: A Case Study in Interpretability." International Conference on Machine Learning, 2024.

Markdown

[Kitouni et al. "From Neurons to Neutrons: A Case Study in Interpretability." International Conference on Machine Learning, 2024.](https://mlanthology.org/icml/2024/kitouni2024icml-neurons/)

BibTeX

@inproceedings{kitouni2024icml-neurons,
  title     = {{From Neurons to Neutrons: A Case Study in Interpretability}},
  author    = {Kitouni, Ouail and Nolte, Niklas and Pérez-Dı́az, Vı́ctor Samuel and Trifinopoulos, Sokratis and Williams, Mike},
  booktitle = {International Conference on Machine Learning},
  year      = {2024},
  pages     = {24726-24748},
  volume    = {235},
  url       = {https://mlanthology.org/icml/2024/kitouni2024icml-neurons/}
}