DIS-CO: Discovering Copyrighted Content in VLMs Training Data

André V. Duarte, Xuandong Zhao, Arlindo L. Oliveira, Lei Li

ICML 2025 pp. 14807-14832

/icml/2025/duarte2025icml-disco/

Abstract

How can we verify whether copyrighted content was used to train a large vision-language model (VLM) without direct access to its training data? Motivated by the hypothesis that a VLM is able to recognize images from its training corpus, we propose DIS-CO, a novel approach to infer the inclusion of copyrighted content during the model’s development. By repeatedly querying a VLM with specific frames from targeted copyrighted material, DIS-CO extracts the content’s identity through free-form text completions. To assess its effectiveness, we introduce MovieTection, a benchmark comprising 14,000 frames paired with detailed captions, drawn from films released both before and after a model’s training cutoff. Our results show that DIS-CO significantly improves detection performance, nearly doubling the average AUC of the best prior method on models with logits available. Our findings also highlight a broader concern: all tested models appear to have been exposed to some extent to copyrighted content. We provide the code in the supplementary materials.

PDF ICML OpenReview Semantic Scholar

Cite

Text

Duarte et al. "DIS-CO: Discovering Copyrighted Content in VLMs Training Data." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Duarte et al. "DIS-CO: Discovering Copyrighted Content in VLMs Training Data." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/duarte2025icml-disco/)

BibTeX

@inproceedings{duarte2025icml-disco,
  title     = {{DIS-CO: Discovering Copyrighted Content in VLMs Training Data}},
  author    = {Duarte, André V. and Zhao, Xuandong and Oliveira, Arlindo L. and Li, Lei},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {14807-14832},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/duarte2025icml-disco/}
}