Explaining Black Box Text Modules in Natural Language with Language Models

Abstract

Large language models (LLMs) have demonstrated remarkable prediction performance for a growing array of tasks. However, their rapid proliferation and increasing opaqueness have created a growing need for interpretability. Here, we ask whether we can automatically obtain natural language explanations for black box text modules. A *text module* is any function that maps text to a scalar continuous value, such as a submodule within an LLM or a fitted model of a brain region. *Black box* indicates that we only have access to the module's inputs. We introduce Summarize and Score (SASC), a method that takes in a text module and returns a natural language explanation of the module's selectivity along with a score for how reliable the explanation. We study SASC in 2 contexts. First, we evaluate SASC on synthetic modules and find that it often recovers ground truth explanations. Second, we use SASC to explain modules found within a pre-trained BERT model, enabling inspection of the model's internals.

Cite

Text

Singh et al. "Explaining Black Box Text Modules in Natural Language with Language Models." NeurIPS 2023 Workshops: XAIA, 2023.

Markdown

[Singh et al. "Explaining Black Box Text Modules in Natural Language with Language Models." NeurIPS 2023 Workshops: XAIA, 2023.](https://mlanthology.org/neuripsw/2023/singh2023neuripsw-explaining/)

BibTeX

@inproceedings{singh2023neuripsw-explaining,
  title     = {{Explaining Black Box Text Modules in Natural Language with Language Models}},
  author    = {Singh, Chandan and Hsu, Aliyah and Antonello, Richard and Jain, Shailee and Huth, Alexander and Yu, Bin and Gao, Jianfeng},
  booktitle = {NeurIPS 2023 Workshops: XAIA},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/singh2023neuripsw-explaining/}
}