Optimizing Adaptive Attacks Against Content Watermarks for Language Models
Abstract
Large Language Models (LLMs) can be misused to spread online spam and misinformation. Content watermarking deters misuse by hiding a message in generated outputs, enabling detection using a secret \emph{watermarking key}. Robustness is a core security property, stating that evading detection requires (significant) degradation of the content's quality. Many LLM watermarking methods have been proposed, but robustness is tested only against \emph{non-adaptive} attackers who lack knowledge of the provider's watermarking method and can find only suboptimal attacks. We formulate the robustness of LLM watermarking as an objective function and use preference-based optimization to tune \emph{adaptive} attacks against the specific watermarking method. Our evaluation shows that: (i) adaptive attacks evade detection against all surveyed watermarking methods. (ii) Even in a non-adaptive setting, attacks optimized adaptively against known watermarks remain effective when tested on unseen watermarks, and (iii) optimization-based attacks are scalable and use limited computational resources of less than seven GPU hours. Our findings underscore the need to test robustness against adaptive attacks.
Cite
Text
Diaa et al. "Optimizing Adaptive Attacks Against Content Watermarks for Language Models." ICLR 2025 Workshops: WMARK, 2025.Markdown
[Diaa et al. "Optimizing Adaptive Attacks Against Content Watermarks for Language Models." ICLR 2025 Workshops: WMARK, 2025.](https://mlanthology.org/iclrw/2025/diaa2025iclrw-optimizing/)BibTeX
@inproceedings{diaa2025iclrw-optimizing,
title = {{Optimizing Adaptive Attacks Against Content Watermarks for Language Models}},
author = {Diaa, Abdulrahman and Aremu, Toluwani and Lukas, Nils},
booktitle = {ICLR 2025 Workshops: WMARK},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/diaa2025iclrw-optimizing/}
}