Skip to main content

Paper | Longpre, Mahari, Lee, Lund et al: Consent in Crisis: The Rapid Decline of the AI Data Commons
slides-aicontrolws-consent-in-crisis-the-rapid-decline-of-the-ai-data-commons-00

Slides IAB Workshop on AI-CONTROL (aicontrolws) Team
Title Paper | Longpre, Mahari, Lee, Lund et al: Consent in Crisis: The Rapid Decline of the AI Data Commons
Abstract
General-purpose artificial intelligence (AI) systems are built on massive swathes of
public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To
our …
General-purpose artificial intelligence (AI) systems are built on massive swathes of
public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To
our knowledge, we conduct the first, large-scale, longitudinal audit of the consent
protocols for the web domains underlying AI training corpora. Our audit of 14, 000
web domains provides an expansive view of crawlable web data and how codified
data use preferences are changing over time. We observe a proliferation of AI-
specific clauses to limit use, acute differences in restrictions on AI developers, as
well as general inconsistencies between websites’ expressed intentions in their
Terms of Service and their robots.txt. We diagnose these as symptoms of ineffective
web protocols, not designed to cope with the widespread re-purposing of the internet
for AI. Our longitudinal analyses show that in a single year (2023-2024) there has
been a rapid crescendo of data restrictions from web sources, rendering ~5%+ of
all tokens in C4, or 28%+ of the most actively maintained, critical sources in C4,
fully restricted from use. For Terms of Service crawling restrictions, a full 45% of
C4 is now restricted. If respected or enforced, these restrictions are rapidly biasing
the diversity, freshness, and scaling laws for general-purpose AI systems. We hope
to illustrate the emerging crises in data consent, for both developers and creators.
The foreclosure of much of the open web will impact not only commercial AI, but
also non-commercial AI and academic research.
State Active
Other versions pdf
Last updated 2024-09-09

slides-aicontrolws-consent-in-crisis-the-rapid-decline-of-the-ai-data-commons-00
Not available as plain text. Download as PDF.