Detecting Benchmark Contamination Through Watermarking

Abstract

Benchmark contamination poses a significant challenge to the reliability of Large Language Models (LLMs) evaluations, as it is difficult to assert whether a model has been trained on a test set. We introduce a solution to this problem by watermarking benchmarks before their release.The embedding involves reformulating the original questions with a watermarked LLM, in a way that does not alter the benchmark quality and utility. During evaluation, we can detect ``radioactivity'', i.e. traces that the text watermarks leave in the model during training, using a theoretically grounded statistical test. We test our method by pre-training 1B models from scratch on 10B tokens with controlled benchmark contamination, and validate its effectiveness in detecting contamination on ARC-Easy, ARC-Challenge, and MMLU. Results show similar benchmark utility post-rephrasing and successful contamination detection when models are contaminated enough to enhance performance, e.g. p-val =$10^{-3}$ for +5% on ARC-Easy.

Cite

Text

Sander et al. "Detecting Benchmark Contamination Through Watermarking." ICLR 2025 Workshops: WMARK, 2025.

Markdown

[Sander et al. "Detecting Benchmark Contamination Through Watermarking." ICLR 2025 Workshops: WMARK, 2025.](https://mlanthology.org/iclrw/2025/sander2025iclrw-detecting/)

BibTeX

@inproceedings{sander2025iclrw-detecting,
  title     = {{Detecting Benchmark Contamination Through Watermarking}},
  author    = {Sander, Tom and Fernandez, Pierre and Mahloujifar, Saeed and Durmus, Alain Oliviero and Guo, Chuan},
  booktitle = {ICLR 2025 Workshops: WMARK},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/sander2025iclrw-detecting/}
}