Language Models May Verbatim Complete Text They Were Not Explicitly Trained on
Abstract
An important question today is whether a given text was used to train a large language model (LLM). A completion test is often employed: check if the LLM completes a sufficiently complex text. This, however, requires a ground-truth definition of membership; most commonly, it is defined as a member based on the n-gram overlap between the target text and any text in the dataset. In this work, we demonstrate that this n-gram based membership definition can be effectively gamed. We study scenarios where sequences are non-members for a given n and we find that completion tests still succeed. We find many natural cases of this phenomenon by retraining LLMs from scratch after removing all training samples that were completed; these cases include exact duplicates, near-duplicates, and even short overlaps. They showcase that it is difficult to find a single viable choice of n for membership definitions. Using these insights, we design adversarial datasets that can cause a given target sequence to be completed without containing it, for any reasonable choice of n. Our findings highlight the inadequacy of n-gram membership, suggesting membership definitions fail to account for auxiliary information available to the training algorithm.
Cite
Text
Liu et al. "Language Models May Verbatim Complete Text They Were Not Explicitly Trained on." Proceedings of the 42nd International Conference on Machine Learning, 2025.Markdown
[Liu et al. "Language Models May Verbatim Complete Text They Were Not Explicitly Trained on." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/liu2025icml-language/)BibTeX
@inproceedings{liu2025icml-language,
title = {{Language Models May Verbatim Complete Text They Were Not Explicitly Trained on}},
author = {Liu, Ken and Choquette-Choo, Christopher A. and Jagielski, Matthew and Kairouz, Peter and Koyejo, Sanmi and Liang, Percy and Papernot, Nicolas},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
year = {2025},
pages = {38210-38250},
volume = {267},
url = {https://mlanthology.org/icml/2025/liu2025icml-language/}
}