Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts

Wu, Zhaomin; Du, Mingzhe; Ng, See-Kiong; He, Bingsheng

Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts

Zhaomin Wu, Mingzhe Du, See-Kiong Ng, Bingsheng He

ICLR 2026

/iclr/2026/wu2026iclr-beyond-a/

Abstract

Large Language Models (LLMs) are widely deployed in reasoning, planning, and decision-making tasks, making their trustworthiness critical. A significant and underexplored risk is intentional deception, where an LLM deliberately fabricates or conceals information to serve a hidden objective. Existing studies typically induce deception by explicitly setting a hidden objective through prompting or fine-tuning, which may not reflect real-world human-LLM interactions. Moving beyond such human-induced deception, we investigate LLMs' self-initiated deception on benign prompts. To address the absence of ground truth, we propose a framework based on Contact Searching Questions~(CSQ). This framework introduces two statistical metrics derived from psychological principles to quantify the likelihood of deception. The first, the *Deceptive Intention Score*, measures the model's bias toward a hidden objective. The second, the *Deceptive Behavior Score*, measures the inconsistency between the LLM's internal belief and its expressed output. Evaluating 16 leading LLMs, we find that both metrics rise in parallel and escalate with task difficulty for most models. Moreover, increasing model capacity does not always reduce deception, posing a significant challenge for future LLM development.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Wu et al. "Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts." International Conference on Learning Representations, 2026.

Markdown

[Wu et al. "Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/wu2026iclr-beyond-a/)

BibTeX

@inproceedings{wu2026iclr-beyond-a,
  title     = {{Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts}},
  author    = {Wu, Zhaomin and Du, Mingzhe and Ng, See-Kiong and He, Bingsheng},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/wu2026iclr-beyond-a/}
}