Explaining Black Box Text Modules in Natural Language with Language Models
Abstract
Large language models (LLMs) have demonstrated remarkable prediction performance for a growing array of tasks. However, their rapid proliferation and increasing opaqueness have created a growing need for interpretability. Here, we ask whether we can automatically obtain natural language explanations for black box text modules. A *text module* is any function that maps text to a scalar continuous value, such as a submodule within an LLM or a fitted model of a brain region. *Black box* indicates that we only have access to the module's inputs. We introduce Summarize and Score (SASC), a method that takes in a text module and returns a natural language explanation of the module's selectivity along with a score for how reliable the explanation. We study SASC in 2 contexts. First, we evaluate SASC on synthetic modules and find that it often recovers ground truth explanations. Second, we use SASC to explain modules found within a pre-trained BERT model, enabling inspection of the model's internals.
Cite
Text
Singh et al. "Explaining Black Box Text Modules in Natural Language with Language Models." NeurIPS 2023 Workshops: XAIA, 2023.Markdown
[Singh et al. "Explaining Black Box Text Modules in Natural Language with Language Models." NeurIPS 2023 Workshops: XAIA, 2023.](https://mlanthology.org/neuripsw/2023/singh2023neuripsw-explaining/)BibTeX
@inproceedings{singh2023neuripsw-explaining,
title = {{Explaining Black Box Text Modules in Natural Language with Language Models}},
author = {Singh, Chandan and Hsu, Aliyah and Antonello, Richard and Jain, Shailee and Huth, Alexander and Yu, Bin and Gao, Jianfeng},
booktitle = {NeurIPS 2023 Workshops: XAIA},
year = {2023},
url = {https://mlanthology.org/neuripsw/2023/singh2023neuripsw-explaining/}
}