Evaluating Neuron Interpretation Methods of NLP Models

Abstract

Neuron interpretation offers valuable insights into how knowledge is structured within a deep neural network model. While a number of neuron interpretation methods have been proposed in the literature, the field lacks a comprehensive comparison among these methods. This gap hampers progress due to the absence of standardized metrics and benchmarks. The commonly used evaluation metric has limitations, and creating ground truth annotations for neurons is impractical. Addressing these challenges, we propose an evaluation framework based on voting theory. Our hypothesis posits that neurons consistently identified by different methods carry more significant information. We rigorously assess our framework across a diverse array of neuron interpretation methods. Notable findings include: i) despite the theoretical differences among the methods, neuron ranking methods share over 60% of their rankings when identifying salient neurons, ii) the neuron interpretation methods are most sensitive to the last layer representations, iii) Probeless neuron ranking emerges as the most consistent method.

Cite

Text

Fan et al. "Evaluating Neuron Interpretation Methods of NLP Models." Neural Information Processing Systems, 2023.

Markdown

[Fan et al. "Evaluating Neuron Interpretation Methods of NLP Models." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/fan2023neurips-evaluating/)

BibTeX

@inproceedings{fan2023neurips-evaluating,
  title     = {{Evaluating Neuron Interpretation Methods of NLP Models}},
  author    = {Fan, Yimin and Dalvi, Fahim and Durrani, Nadir and Sajjad, Hassan},
  booktitle = {Neural Information Processing Systems},
  year      = {2023},
  url       = {https://mlanthology.org/neurips/2023/fan2023neurips-evaluating/}
}