Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations
Abstract
Large language models (LLMs) are capable of producing compelling explanations of their reasoning when answering questions. However, LLM explanations can be unfaithful to the model's true underlying behavior, potentially leading to over-trust and misuse. In this work, we introduce a new approach for measuring explanation faithfulness that is tailored to LLMs. Our first contribution is to translate an intuitive understanding of what it means for an LLM explanation to be faithful into a formal definition of faithfulness. Since LLM explanations mimic human explanations, they often reference high-level *concepts* in the input question that are influential in decision-making. We formalize faithfulness in terms of the difference between the set of concepts that the LLM *says* are influential and the set that *truly* are. We then present a novel method for quantifying faithfulness that is based on: (1) using an auxiliary LLM to edit, or perturb, the values of concepts within model inputs, and (2) using a hierarchical Bayesian model to quantify how changes to concepts affect model answers at both the example- and dataset-level. Through preliminary experiments on a question-answering dataset, we show that our method can be used to quantify and discover interpretable patterns of unfaithfulness, including cases where LLMs fail to admit their use of social biases.
Cite
Text
Matton et al. "Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations." ICLR 2024 Workshops: R2-FM, 2024.Markdown
[Matton et al. "Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations." ICLR 2024 Workshops: R2-FM, 2024.](https://mlanthology.org/iclrw/2024/matton2024iclrw-walk/)BibTeX
@inproceedings{matton2024iclrw-walk,
title = {{Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations}},
author = {Matton, Katie and Ness, Robert and Kiciman, Emre},
booktitle = {ICLR 2024 Workshops: R2-FM},
year = {2024},
url = {https://mlanthology.org/iclrw/2024/matton2024iclrw-walk/}
}