Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability
Abstract
Causal abstraction provides a theoretical foundation for mechanistic interpretability, the field concerned with providing intelligible algorithms that are faithful simplifications of the known, but opaque low-level details of black box AI models. Our contributions are (1) generalizing the theory of causal abstraction from mechanism replacement (i.e., hard and soft interventions) to arbitrary mechanism transformation (i.e., functionals from old mechanisms to new mechanisms), (2) providing a flexible, yet precise formalization for the core concepts of polysemantic neurons, the linear representation hypothesis, modular features, and graded faithfulness, and (3) unifying a variety of mechanistic interpretability methods in the common language of causal abstraction, namely, activation and path patching, causal mediation analysis, causal scrubbing, causal tracing, circuit analysis, concept erasure, sparse autoencoders, differential binary masking, distributed alignment search, and steering.
Cite
Text
Geiger et al. "Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability." Journal of Machine Learning Research, 2025.Markdown
[Geiger et al. "Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability." Journal of Machine Learning Research, 2025.](https://mlanthology.org/jmlr/2025/geiger2025jmlr-causal/)BibTeX
@article{geiger2025jmlr-causal,
title = {{Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability}},
author = {Geiger, Atticus and Ibeling, Duligur and Zur, Amir and Chaudhary, Maheep and Chauhan, Sonakshi and Huang, Jing and Arora, Aryaman and Wu, Zhengxuan and Goodman, Noah and Potts, Christopher and Icard, Thomas},
journal = {Journal of Machine Learning Research},
year = {2025},
pages = {1-64},
volume = {26},
url = {https://mlanthology.org/jmlr/2025/geiger2025jmlr-causal/}
}