Consent in Crisis: The Rapid Decline of the AI Data Commons
Abstract
General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge, we conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora. Our audit of 14,000 web domains provides an expansive view of crawlable web data and how codified data use preferences are changing over time. We observe a proliferation of AI-specific clauses to limit use, acute differences in restrictions on AI developers, as well as general inconsistencies between websites' expressed intentions in their Terms of Service and their robots.txt. We diagnose these as symptoms of ineffective web protocols, not designed to cope with the widespread re-purposing of the internet for AI. Our longitudinal analyses show that in a single year (2023-2024) there has been a rapid crescendo of data restrictions from web sources, rendering ~5\%+ of all tokens in C4, or 28%+ of the most actively maintained, critical sources in C4, fully restricted from use. For Terms of Service crawling restrictions, a full 45% of C4 is now restricted. If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems. We hope to illustrate the emerging crises in data consent, for both developers and creators. The foreclosure of much of the open web will impact not only commercial AI, but also non-commercial AI and academic research.
Cite
Text
Longpre et al. "Consent in Crisis: The Rapid Decline of the AI Data Commons." Neural Information Processing Systems, 2024. doi:10.52202/079017-3431Markdown
[Longpre et al. "Consent in Crisis: The Rapid Decline of the AI Data Commons." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/longpre2024neurips-consent/) doi:10.52202/079017-3431BibTeX
@inproceedings{longpre2024neurips-consent,
title = {{Consent in Crisis: The Rapid Decline of the AI Data Commons}},
author = {Longpre, Shayne and Mahari, Robert and Lee, Ariel and Lund, Campbell and Oderinwale, Hamidah and Brannon, William and Saxena, Nayan and Obeng-Marnu, Naana and South, Tobin and Hunter, Cole and Klyman, Kevin and Klamm, Christopher and Schoelkopf, Hailey and Singh, Nikhil and Cherep, Manuel and Anis, Ahmad Mustafa and Dinh, An and Chitongo, Caroline and Yin, Da and Sileo, Damien and Mataciunas, Deividas and Misra, Diganta and Alghamdi, Emad and Shippole, Enrico and Zhang, Jianguo and Materzynska, Joanna and Qian, Kun and Tiwary, Kush and Miranda, Lester and Dey, Manan and Liang, Minnie and Hamdy, Mohammed and Muennighoff, Niklas and Ye, Seonghyeon and Kim, Seungone and Mohanty, Shrestha and Gupta, Vipul and Sharma, Vivek and Chien, Vu Minh and Zhou, Xuhui and Li, Yizhi and Xiong, Caiming and Villa, Luis and Biderman, Stella and Li, Hanlin and Ippolito, Daphne and Hooker, Sara and Kabbara, Jad and Pentland, Sandy},
booktitle = {Neural Information Processing Systems},
year = {2024},
doi = {10.52202/079017-3431},
url = {https://mlanthology.org/neurips/2024/longpre2024neurips-consent/}
}