LLM-Generated Black-Box Explanations Can Be Adversarially Helpful

Abstract

Large language models (LLMs) are becoming vital tools that help us solve and understand complex problems. LLMs can generate convincing explanations, even when given only the inputs and outputs of these problems, i.e., in a ``black-box'' approach. However, our research uncovers a hidden risk tied to this approach, which we call $\textit{adversarial helpfulness}.$ This happens when an LLM's explanations make a wrong answer look correct, potentially leading people to trust faulty solutions. In this paper, we show that this issue affects not just humans, but also LLM evaluators. Digging deeper, we identify and examine key persuasive strategies employed by LLMs. Our findings reveal that these models employ strategies such as reframing questions, expressing an elevated level of confidence, and `cherry-picking' evidence that supports incorrect answers. We further create a symbolic graph reasoning task to analyze the mechanisms of LLMs generating adversarial helpfulness explanations. Most LLMs are not able to find alternative paths along simple graphs, indicating that other mechanisms, rather than logical deductions, might facilitate adversarial helpfulness. These findings shed light on the limitations of black-box explanations and lead to recommendations for the safer use of LLMs.

Cite

Text

Ajwani et al. "LLM-Generated Black-Box Explanations Can Be Adversarially Helpful." NeurIPS 2024 Workshops: RegML, 2024.

Markdown

[Ajwani et al. "LLM-Generated Black-Box Explanations Can Be Adversarially Helpful." NeurIPS 2024 Workshops: RegML, 2024.](https://mlanthology.org/neuripsw/2024/ajwani2024neuripsw-llmgenerated/)

BibTeX

@inproceedings{ajwani2024neuripsw-llmgenerated,
  title     = {{LLM-Generated Black-Box Explanations Can Be Adversarially Helpful}},
  author    = {Ajwani, Rohan Deepak and Javaji, Shashidhar Reddy and Rudzicz, Frank and Zhu, Zining},
  booktitle = {NeurIPS 2024 Workshops: RegML},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/ajwani2024neuripsw-llmgenerated/}
}