An Information-Theoretic Study of Lying in LLMs
Abstract
This study investigates differences in information processing between lying and truth-telling in Large Language Models (LLMs). Taking inspiration from human cognition research which shows that lying demands more cognitive resources than truth-telling, we apply information-theoretic measures to unembedded internal model activations to explore analogous phenomena in LLMs. Our analysis reveals that LLMs converge more quickly to the output distribution when telling the truth and exhibit higher entropy when constructing lies. These findings indicate that lying in LLMs may produce characteristic information processing patterns, which could contribute to our ability to understand and detect deceptive behaviors in LLMs.
Cite
Text
Dombrowski and Corlouer. "An Information-Theoretic Study of Lying in LLMs." ICML 2024 Workshops: LLMs_and_Cognition, 2024.Markdown
[Dombrowski and Corlouer. "An Information-Theoretic Study of Lying in LLMs." ICML 2024 Workshops: LLMs_and_Cognition, 2024.](https://mlanthology.org/icmlw/2024/dombrowski2024icmlw-informationtheoretic/)BibTeX
@inproceedings{dombrowski2024icmlw-informationtheoretic,
title = {{An Information-Theoretic Study of Lying in LLMs}},
author = {Dombrowski, Ann-Kathrin and Corlouer, Guillaume},
booktitle = {ICML 2024 Workshops: LLMs_and_Cognition},
year = {2024},
url = {https://mlanthology.org/icmlw/2024/dombrowski2024icmlw-informationtheoretic/}
}