Obfuscated Activations Bypass LLM Latent-Space Defenses
Abstract
_Latent-space_ monitoring techniques have shown promise as defenses against LLM attacks. These defenses act as scanners to detect harmful activations before they lead to undesirable actions. This prompts the question: can models execute harmful behavior _via inconspicuous latent states_? Here, we study such _obfuscated activations_. Our results are nuanced. We show that state-of-the-art latent-space defenses---such as activation probes and latent OOD detection---are vulnerable to obfuscated activations. For example, against probes trained to classify harmfulness, our obfuscation attacks can reduce monitor recall from 100% down to 0% while still achieving a 90% jailbreaking success rate. However, we also find that certain probe architectures are more robust than others, and we discover the existence of an _obfuscation tax_: on a complex task (writing SQL code), evading monitors reduces model performance. Together, our results demonstrate white-box monitors are not robust to adversarial attack, while also providing concrete suggestions to alleviate, but not completely fix, this weakness.
Cite
Text
Bailey et al. "Obfuscated Activations Bypass LLM Latent-Space Defenses." International Conference on Learning Representations, 2026.Markdown
[Bailey et al. "Obfuscated Activations Bypass LLM Latent-Space Defenses." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/bailey2026iclr-obfuscated/)BibTeX
@inproceedings{bailey2026iclr-obfuscated,
title = {{Obfuscated Activations Bypass LLM Latent-Space Defenses}},
author = {Bailey, Luke and Serrano, Alex and Sheshadri, Abhay and Seleznyov, Mikhail and Taylor, Jordan and Jenner, Erik and Hilton, Jacob and Casper, Stephen and Guestrin, Carlos and Emmons, Scott},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/bailey2026iclr-obfuscated/}
}