CALM: Curiosity-Driven Auditing for Large Language Models

Zheng, Xiang; Wang, Longxiang; Liu, Yi; Ma, Xingjun; Shen, Chao; Wang, Cong

doi:10.1609/AAAI.V39I26.34991

CALM: Curiosity-Driven Auditing for Large Language Models

Xiang Zheng, Longxiang Wang, Yi Liu, Xingjun Ma, Chao Shen, Cong Wang

AAAI 2025 pp. 27757-27764

doi:10.1609/AAAI.V39I26.34991 /aaai/2025/zheng2025aaai-calm/

Abstract

Auditing Large Language Models (LLMs) is a crucial and challenging task. In this study, we focus on auditing black-box LLMs without access to their parameters, only to the provided service. We treat this type of auditing as a black-box optimization problem where the goal is to automatically uncover input-output pairs of the target LLMs that exhibit illegal, immoral, or unsafe behaviors. For instance, we may seek a non-toxic input that the target LLM responds to with a toxic output or an input that induces the hallucinative response from the target LLM containing politically sensitive individuals. This black-box optimization is challenging due to the scarcity of feasible points, the discrete nature of the prompt space, and the large search space. To address these challenges, we propose Curiosity-Driven Auditing for Large Language Models (CALM), which uses intrinsically motivated reinforcement learning to finetune an LLM as the auditor agent to uncover potential harmful and biased input-output pairs of the target LLM. CALM successfully identifies derogatory completions involving celebrities and uncovers inputs that elicit specific names under the black-box setting. This work offers a promising direction for auditing black-box LLMs.

PDF AAAI Semantic Scholar

Cite

Text

Zheng et al. "CALM: Curiosity-Driven Auditing for Large Language Models." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I26.34991

Markdown

[Zheng et al. "CALM: Curiosity-Driven Auditing for Large Language Models." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/zheng2025aaai-calm/) doi:10.1609/AAAI.V39I26.34991

BibTeX

@inproceedings{zheng2025aaai-calm,
  title     = {{CALM: Curiosity-Driven Auditing for Large Language Models}},
  author    = {Zheng, Xiang and Wang, Longxiang and Liu, Yi and Ma, Xingjun and Shen, Chao and Wang, Cong},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {27757-27764},
  doi       = {10.1609/AAAI.V39I26.34991},
  url       = {https://mlanthology.org/aaai/2025/zheng2025aaai-calm/}
}