Detecting Benchmark Contamination Through Watermarking
Abstract
Benchmark contamination poses a significant challenge to the reliability of Large Language Models (LLMs) evaluations, as it is difficult to assert whether a model has been trained on a test set. We introduce a solution to this problem by watermarking benchmarks before their release.The embedding involves reformulating the original questions with a watermarked LLM, in a way that does not alter the benchmark quality and utility. During evaluation, we can detect ``radioactivity'', i.e. traces that the text watermarks leave in the model during training, using a theoretically grounded statistical test. We test our method by pre-training 1B models from scratch on 10B tokens with controlled benchmark contamination, and validate its effectiveness in detecting contamination on ARC-Easy, ARC-Challenge, and MMLU. Results show similar benchmark utility post-rephrasing and successful contamination detection when models are contaminated enough to enhance performance, e.g. p-val =$10^{-3}$ for +5% on ARC-Easy.
Cite
Text
Sander et al. "Detecting Benchmark Contamination Through Watermarking." ICLR 2025 Workshops: WMARK, 2025.Markdown
[Sander et al. "Detecting Benchmark Contamination Through Watermarking." ICLR 2025 Workshops: WMARK, 2025.](https://mlanthology.org/iclrw/2025/sander2025iclrw-detecting/)BibTeX
@inproceedings{sander2025iclrw-detecting,
title = {{Detecting Benchmark Contamination Through Watermarking}},
author = {Sander, Tom and Fernandez, Pierre and Mahloujifar, Saeed and Durmus, Alain Oliviero and Guo, Chuan},
booktitle = {ICLR 2025 Workshops: WMARK},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/sander2025iclrw-detecting/}
}