Watermark Smoothing Attacks Against Language Models
Abstract
Watermarking is a key technique for detecting AI-generated text. In this work, we study its vulnerabilities and introduce the Smoothing Attack, a novel watermark removal method. By leveraging the relationship between the model’s confidence and watermark detectability, our attack selectively smoothes the watermarked content, erasing watermark traces while preserving text quality. We validate our attack on open-source models ranging from 1.3B to 30B parameters on 10 different water- marks, demonstrating its effectiveness. Our findings expose critical weaknesses in existing watermarking schemes and highlight the need for stronger defenses.
Cite
Text
Chang et al. "Watermark Smoothing Attacks Against Language Models." ICLR 2025 Workshops: WMARK, 2025.Markdown
[Chang et al. "Watermark Smoothing Attacks Against Language Models." ICLR 2025 Workshops: WMARK, 2025.](https://mlanthology.org/iclrw/2025/chang2025iclrw-watermark/)BibTeX
@inproceedings{chang2025iclrw-watermark,
title = {{Watermark Smoothing Attacks Against Language Models}},
author = {Chang, Hongyan and Hassani, Hamed and Shokri, Reza},
booktitle = {ICLR 2025 Workshops: WMARK},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/chang2025iclrw-watermark/}
}