Compromising Honesty and Harmlessness in Language Models via Covert Deception Attacks
Abstract
Recent research on large language models (LLMs) has demonstrated their ability to understand and employ deceptive behavior, even without explicit prompting. Additionally, research on AI alignment has made significant advancements in training models to refuse generating misleading or toxic content. As a result, LLMs generally became honest and harmless. In this study, we introduce “deception attacks” that undermine both of these traits while keeping models seemingly trustworthy, revealing a vulnerability that, if exploited, could have serious real-world consequences. We introduce fine-tuning methods that cause models to selectively deceive users on targeted topics while remaining accurate on others, to maintain a high user trust. Through a series of experiments, we show that such targeted deception is effective even in high-stakes domains or ideologically charged subjects. In addition, we find that deceptive fine-tuning often compromises other safety properties: deceptive models are more likely to produce toxic content, including hate speech and stereotypes. Finally, since self-consistent deception across turns gives users few cues to detect manipulation and thus can preserve trust, we test for multi-turn deception and observe mixed results. Given that millions of users interact with LLM-based chatbots, voice assistants, agents, and other interfaces where trustworthiness cannot be ensured, securing these models against covert deception attacks is critical.
Cite
Text
Vaugrante et al. "Compromising Honesty and Harmlessness in Language Models via Covert Deception Attacks." Transactions on Machine Learning Research, 2026.Markdown
[Vaugrante et al. "Compromising Honesty and Harmlessness in Language Models via Covert Deception Attacks." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/vaugrante2026tmlr-compromising/)BibTeX
@article{vaugrante2026tmlr-compromising,
title = {{Compromising Honesty and Harmlessness in Language Models via Covert Deception Attacks}},
author = {Vaugrante, Laurène and Carlon, Francesca and Menke, Maluna and Hagendorff, Thilo},
journal = {Transactions on Machine Learning Research},
year = {2026},
url = {https://mlanthology.org/tmlr/2026/vaugrante2026tmlr-compromising/}
}