A Watermark for Black-Box Language Models

Abstract

Watermarking has recently emerged as an effective strategy for detecting the outputs of large language models (LLMs). Most existing schemes require \emph{white-box} access to the model's next-token probability distribution, which is typically not accessible to downstream users of an LLM API. In this work, we propose a principled watermarking scheme that requires only the ability to sample sequences from the LLM (i.e. \emph{black-box} access), boasts a \emph{distortion-free} property, and can be chained or nested using multiple secret keys. We provide performance guarantees, demonstrate how it can be leveraged when white-box access is available, and show when it can outperform existing white-box schemes via comprehensive experiments.

Cite

Text

Bahri and Wieting. "A Watermark for Black-Box Language Models." ICLR 2025 Workshops: WMARK, 2025.

Markdown

[Bahri and Wieting. "A Watermark for Black-Box Language Models." ICLR 2025 Workshops: WMARK, 2025.](https://mlanthology.org/iclrw/2025/bahri2025iclrw-watermark/)

BibTeX

@inproceedings{bahri2025iclrw-watermark,
  title     = {{A Watermark for Black-Box Language Models}},
  author    = {Bahri, Dara and Wieting, John Frederick},
  booktitle = {ICLR 2025 Workshops: WMARK},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/bahri2025iclrw-watermark/}
}