What's Your Use Case? a Taxonomy of Causal Evaluations of Post-Hoc Interpretability

Abstract

Post-hoc interpretability of neural network models, including Large Language Models (LLMs), often aims for mechanistic interpretations — detailed, causal descriptions of model behavior. However, human interpreters may lack the capacity or willingness to formulate intricate mechanistic models, let alone evaluate them. This paper addresses this challenge by introducing a taxonomy which dissects the overarching goal of mechanistic interpretability into constituent claims, each requiring distinct evaluation methods. By doing so, we transform these evaluation criteria into actionable learning objectives, providing a data-driven pathway to interpretability. This framework enables a methodologically rigorous yet pragmatic approach to evaluating the strengths and limitations of various interpretability tools.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Reber et al. "What's Your Use Case? a Taxonomy of Causal Evaluations of Post-Hoc Interpretability." NeurIPS 2023 Workshops: CRL, 2023.

Markdown

[Reber et al. "What's Your Use Case? a Taxonomy of Causal Evaluations of Post-Hoc Interpretability." NeurIPS 2023 Workshops: CRL, 2023.](https://mlanthology.org/neuripsw/2023/reber2023neuripsw-your/)

BibTeX

@inproceedings{reber2023neuripsw-your,
  title     = {{What's Your Use Case? a Taxonomy of Causal Evaluations of Post-Hoc Interpretability}},
  author    = {Reber, David and Garbacea, Cristina and Veitch, Victor},
  booktitle = {NeurIPS 2023 Workshops: CRL},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/reber2023neuripsw-your/}
}