What's Your Use Case? a Taxonomy of Causal Evaluations of Post-Hoc Interpretability
Abstract
Post-hoc interpretability of neural network models, including Large Language Models (LLMs), often aims for mechanistic interpretations — detailed, causal descriptions of model behavior. However, human interpreters may lack the capacity or willingness to formulate intricate mechanistic models, let alone evaluate them. This paper addresses this challenge by introducing a taxonomy which dissects the overarching goal of mechanistic interpretability into constituent claims, each requiring distinct evaluation methods. By doing so, we transform these evaluation criteria into actionable learning objectives, providing a data-driven pathway to interpretability. This framework enables a methodologically rigorous yet pragmatic approach to evaluating the strengths and limitations of various interpretability tools.
Cite
Text
Reber et al. "What's Your Use Case? a Taxonomy of Causal Evaluations of Post-Hoc Interpretability." NeurIPS 2023 Workshops: CRL, 2023.Markdown
[Reber et al. "What's Your Use Case? a Taxonomy of Causal Evaluations of Post-Hoc Interpretability." NeurIPS 2023 Workshops: CRL, 2023.](https://mlanthology.org/neuripsw/2023/reber2023neuripsw-your/)BibTeX
@inproceedings{reber2023neuripsw-your,
title = {{What's Your Use Case? a Taxonomy of Causal Evaluations of Post-Hoc Interpretability}},
author = {Reber, David and Garbacea, Cristina and Veitch, Victor},
booktitle = {NeurIPS 2023 Workshops: CRL},
year = {2023},
url = {https://mlanthology.org/neuripsw/2023/reber2023neuripsw-your/}
}