DIS-CO: Discovering Copyrighted Content in VLMs Training Data

Abstract

How can we verify whether copyrighted content was used to train a large vision-language model (VLM) without direct access to its training data? Motivated by the hypothesis that a VLM is able to recognize images from its training corpus, we propose DIS-CO, a novel approach to infer the inclusion of copyrighted content during the model’s development. By repeatedly querying a VLM with specific frames from targeted copyrighted material, DIS-CO extracts the content’s identity through free-form text completions. To assess its effectiveness, we introduce MovieTection, a benchmark comprising 14,000 frames paired with detailed captions, drawn from films released both before and after a model’s training cutoff. Our results show that DIS-CO significantly improves detection performance, nearly doubling the average AUC of the best prior method on models with logits available. Our findings also highlight a broader concern: all tested models appear to have been exposed to some extent to copyrighted content. We provide the code in the supplementary materials.

Cite

Text

Duarte et al. "DIS-CO: Discovering Copyrighted Content in VLMs Training Data." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Duarte et al. "DIS-CO: Discovering Copyrighted Content in VLMs Training Data." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/duarte2025icml-disco/)

BibTeX

@inproceedings{duarte2025icml-disco,
  title     = {{DIS-CO: Discovering Copyrighted Content in VLMs Training Data}},
  author    = {Duarte, André V. and Zhao, Xuandong and Oliveira, Arlindo L. and Li, Lei},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {14807-14832},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/duarte2025icml-disco/}
}