Validating Mechanistic Interpretations: An Axiomatic Approach
Abstract
Mechanistic interpretability aims to reverse engineer the computation performed by a neural network in terms of its internal components. Although there is a growing body of research on mechanistic interpretation of neural networks, the notion of a mechanistic interpretation itself is often ad-hoc. Inspired by the notion of abstract interpretation from the program analysis literature that aims to develop approximate semantics for programs, we give a set of axioms that formally characterize a mechanistic interpretation as a description that approximately captures the semantics of the neural network under analysis in a compositional manner. We demonstrate the applicability of these axioms for validating mechanistic interpretations on an existing, well-known interpretability study as well as on a new case study involving a Transformer-based model trained to solve the well-known 2-SAT problem.
Cite
Text
Palumbo et al. "Validating Mechanistic Interpretations: An Axiomatic Approach." Proceedings of the 42nd International Conference on Machine Learning, 2025.Markdown
[Palumbo et al. "Validating Mechanistic Interpretations: An Axiomatic Approach." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/palumbo2025icml-validating/)BibTeX
@inproceedings{palumbo2025icml-validating,
title = {{Validating Mechanistic Interpretations: An Axiomatic Approach}},
author = {Palumbo, Nils and Mangal, Ravi and Wang, Zifan and Vijayakumar, Saranya and Pasareanu, Corina S. and Jha, Somesh},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
year = {2025},
pages = {47509-47544},
volume = {267},
url = {https://mlanthology.org/icml/2025/palumbo2025icml-validating/}
}