Skip to main content

IAB Workshop on AI-CONTROL (aicontrolws)

Slides

Title Abstract Curr. rev. Date Last presented On agenda
Paper | Jiménez, Arkko: AI, Robots.txt
Large Language Models (LLMs) and their use of Internet-sourced material present numerous technical, commercial, legal, societal, and ethical challenges. An emerging practice proposes extending the …
Large Language Models (LLMs) and their use of Internet-sourced material present numerous technical, commercial, legal, societal, and ethical challenges. An emerging practice proposes extending the robots.txt file to enable website owners to declare if they wish to "opt-out" from having their site’s content used in training AI models.

This paper explores the topic. We argue that the problem is much broader than the simple opt-out mechanism, given the coming new applications, the many different ways to access training material, different AI techniques, and the need to both facilitate access to training material and enable opting out from it.
00 2024-09-09
Paper | Gahnberg: AI-Control: Opt-Out Mechanisms From the View of a Governance Cycle
This paper seeks to inform discussions on an AI-Control mechanism by outlining considerations from a view of governance. Understood as a mechanism for content creators …
This paper seeks to inform discussions on an AI-Control mechanism by outlining considerations from a view of governance. Understood as a mechanism for content creators to opt out of having their content used as training data for the creation of large language models (LLMs), an AI-Control mechanism would offer a standard for signaling a content creator’s preferences to a web crawler.
00 2024-09-09
Paper | Prorock: Addressing the Limitations of Robots.txt in Controlling AI Crawlers
The emergence of Generative AI and the surrounding ecosystem has introduced new challenges for the internet, highlighting the limitations of the Robots Exclusion Protocol ( …
The emergence of Generative AI and the surrounding ecosystem has introduced new challenges for the internet, highlighting the limitations of the Robots Exclusion Protocol (RFC 9309). The current mechanisms for controlling automated access are inadequate for both AI system operators and content creators. This paper explores the deficiencies of the robots.txt approach and proposes considerations for a more robust solution.
00 2024-09-09
Paper | Doty, Null, Knodel: CDT Position paper
Publishers, authors and social media users want some way to prevent their content from being used to train large language models and other generative AI, …
Publishers, authors and social media users want some way to prevent their content from being used to train large language models and other generative AI, because it hurts the market for their own work or because it could be invasive or otherwise harmful to them. Companies operating large AI models want to be able to access as much data as possible for training purposes, but may be willing to exclude content that the author doesn’t want involved, especially if it avoids either copyright suits or privacy complaints. Researchers want the ability to crawl the web for scientific and public interest purposes without getting caught up in ongoing copyright or AI training disputes. To achieve those ends for the stakeholders concerned, a standards-based solution for controlling use of online content for AI training seems possible and promising.
00 2024-09-09
Paper | Longpre, Mahari, Lee, Lund et al: Consent in Crisis: The Rapid Decline of the AI Data Commons
General-purpose artificial intelligence (AI) systems are built on massive swathes of
public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To
our …
General-purpose artificial intelligence (AI) systems are built on massive swathes of
public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To
our knowledge, we conduct the first, large-scale, longitudinal audit of the consent
protocols for the web domains underlying AI training corpora. Our audit of 14, 000
web domains provides an expansive view of crawlable web data and how codified
data use preferences are changing over time. We observe a proliferation of AI-
specific clauses to limit use, acute differences in restrictions on AI developers, as
well as general inconsistencies between websites’ expressed intentions in their
Terms of Service and their robots.txt. We diagnose these as symptoms of ineffective
web protocols, not designed to cope with the widespread re-purposing of the internet
for AI. Our longitudinal analyses show that in a single year (2023-2024) there has
been a rapid crescendo of data restrictions from web sources, rendering ~5%+ of
all tokens in C4, or 28%+ of the most actively maintained, critical sources in C4,
fully restricted from use. For Terms of Service crawling restrictions, a full 45% of
C4 is now restricted. If respected or enforced, these restrictions are rapidly biasing
the diversity, freshness, and scaling laws for general-purpose AI systems. We hope
to illustrate the emerging crises in data consent, for both developers and creators.
The foreclosure of much of the open web will impact not only commercial AI, but
also non-commercial AI and academic research.
00 2024-09-09
Paper | Keller: Considerations for Opt-Out Compliance Policies by AI Model Developers
Article 53(1c) of the AI Act requires “providers of general-purpose AI models” to “put in place a
policy to comply with Union copyright law, and …
Article 53(1c) of the AI Act requires “providers of general-purpose AI models” to “put in place a
policy to comply with Union copyright law, and in particular to identify and comply with,
including through state of the art technologies, a reservation of rights expressed pursuant to
Article 4(3) of Directive (EU) 2019/790.” This paper explores what such compliance policies
could look like in practice and what technical standards and services are available to implement
rightholder opt-outs in a way that is effective, scalable, and addresses the needs of diverse
groups of rightholders and AI model developers.
00 2024-09-09
Paper | Cloudflare: Control starts with Transparency: Cloudflare's position on AI Crawlers and Bots
As a provider of Internet services, Cloudflare has a strong interest in assuring both that
publishers’ rights are respected and that developers have access to …
As a provider of Internet services, Cloudflare has a strong interest in assuring both that
publishers’ rights are respected and that developers have access to innovative new features,
including LLMs. Balancing the tensions that emerge requires either trust or, in its absence,
transparency.
00 2024-09-09
Paper | Hardie: Crawlers, adversaries, and exclusions: thoughts on content ingest and creator's rights
This paper looks briefly at the history of information retrieval in the early web and explores the relationships between content creators and search engines that …
This paper looks briefly at the history of information retrieval in the early web and explores the relationships between content creators and search engines that developed. It then examines the differences between those and the arelationships that have emerged from the ingestion of content to train generative AI models.  It concludes with some thoughts on the implications of those differences for evolving mechanisms like robots.txt to cover the new use case.
00 2024-09-09
Paper | Creative Commons: Creative Commons Position Paper on Preference Signals
When Creative Commons (CC) was founded over 20 years ago, sharing on the internet was broken.
With the introduction of the CC licenses, the commons …
When Creative Commons (CC) was founded over 20 years ago, sharing on the internet was broken.
With the introduction of the CC licenses, the commons flourished. Licenses that enabled open sharing
were perfectly aligned with the ideals of giving creators a choice over how their works were used.
Those who embrace openly sharing their work have a myriad of motivations for doing so. Most could
not have anticipated how their works might one day be used by machines.
00 2024-09-09
Paper | Elsevier: Elsevier Position paper
As one of the largest publishers of scien6fic, technical, and medical literature, Elsevier
recognizes the posi6ve poten6al of genera6ve AI to enhance search and discovery …
As one of the largest publishers of scien6fic, technical, and medical literature, Elsevier
recognizes the posi6ve poten6al of genera6ve AI to enhance search and discovery
plaHorms. We also recognize the role other forms of AI have played in improving the quality
and relevancy of results thus far.
00 2024-09-09
Paper | Thomson, Eggert: Expressing preferences for data use
Machine learning systems depend on access to large amounts of data. A number of
existing models that have been created based on data that has …
Machine learning systems depend on access to large amounts of data. A number of
existing models that have been created based on data that has been gathered from
websites. This data is often obtained without the permission of the people who might
have a stake in how that data is used. The debate about this practice is obviously difficult,
as it needs to balance the societal benefits that come from the resulting models and the
interests of people who hold a stake in the information that is used. Political debate on
this topic is very important, but is quite complex. How this debate will ultimately resolve
is unclear, but one point that is widely accepted is the need for a means for stakeholders
to opt out of having their data ingested for model training. That is, there is value in
giving people the means to express their preferences about the use of their data. We
identify requirements for a mechanism and conclude that a simple textual signal is most
appropriate.
00 2024-09-09
Paper | Rogerson: Guardian News & Media Draft Paper on an AI.txt Protocol
Robots.txt (internet standard RFC 9309) is a text file which allows website owners to define
how they are accessed, if at all, by automated …
Robots.txt (internet standard RFC 9309) is a text file which allows website owners to define
how they are accessed, if at all, by automated clients. In the context of controlling access
to scraping for AI training, we note that the RFC 9309 refers to automated clients as being
crawlers
00 2024-09-09
Paper | Ludwig, Desai: IBM Use-case, Experiences, and Position Statement
IBM is engaged in advancing the state-of-the-art in building generative AI capabilities in the
open, anchored on the IBM Granite family of open-weight models as …
IBM is engaged in advancing the state-of-the-art in building generative AI capabilities in the
open, anchored on the IBM Granite family of open-weight models as well as recipes used in
preparing the training datasets in Data-prep-kit, both available under Apache 2.0 license. A key
part of this mission is to self-govern the acquisition, processing, and usage of public data sources
that are crawled. The usage spans the use cases of pre-training, fine-tuning, instruction-tuning,
and RAG
00 2024-09-09
Paper | Quinn, Steidl, Riecks, Sedlik, Warren: IPTC and PLUS: The "Data Mining" Embedded Image/Video Metadata Property
This document describes the PLUS "Data Mining" property developed by the PLUS Coalition in
partnership with the International Press Telecommunications Council (IPTC). This property
provides …
This document describes the PLUS "Data Mining" property developed by the PLUS Coalition in
partnership with the International Press Telecommunications Council (IPTC). This property
provides a means for stakeholders to communicate essential data mining rights information via
embedded Extensible Metadata Platform (XMP) metadata in digital image and video formats.
This mechanism allows for clear communication of data mining permissions, prohibitions, and
constraints, which can be readily accessed and interpreted by crawlers and AI platforms.
00 2024-09-09
Paper | Hazaël-Massieux: Managing exposure of Web content to AI systems
As we describe in our report AI & the Web: Understanding and managing the impact of Machine
Learning models on the Web, the scale of …
As we describe in our report AI & the Web: Understanding and managing the impact of Machine
Learning models on the Web, the scale of processes involved in the building and deployment of
recent large Machine Learning (ML) models is such that they are now set to have a systemic
impact on the Web as a shared information space.
00 2024-09-09
Paper | OpenAI, von Lohmann: OpenAI Position paper.
OpenAI submits this position paper in response to the announcement by the program committee
of the IETF’s interest in convening the IAB Workshop on AI-CONTROL …
OpenAI submits this position paper in response to the announcement by the program committee
of the IETF’s interest in convening the IAB Workshop on AI-CONTROL in September 2024.1
OpenAI is dedicated to developing advanced AI technologies to benefit all of humanity. We want
our AI models to learn from as many languages, cultures, subjects, and industries as possible
so they can benefit as many people as possible. The more diverse datasets are, the more
diverse the models’ knowledge, understanding, and languages become – like a person who has
been exposed to a wide range of cultural perspectives and experiences – and the more people
and countries AI can safely serve.
00 2024-09-09
Paper | Brachetti-Truskawa: PROPOSAL Multi-Level Approach to Managing AI Crawler Behavior and Content Protection
This document proposes a comprehensive, multi-layered strategy to protect website content
from unauthorized use in AI training, particularly by Large Language Models (LLMs). The
approach …
This document proposes a comprehensive, multi-layered strategy to protect website content
from unauthorized use in AI training, particularly by Large Language Models (LLMs). The
approach leverages existing web standards and proposals, and introduces new methods to
communicate content usage restrictions effectively. We are not reinventing the wheel - but
suggesting to combine methods for better protection.
00 2024-09-09
Paper | Gropper: A Delegated Authorization standard for AI access control Machine Learning and Artificial Intelligence (AI) is a vastly different use-case than Search
and a robots.txt approach would be inadequate.
00 2024-09-09
Paper | Ramamoorthy: Prapanch Ramamoorthy Position paper
This position paper has been written for the Internet Architecture Board (IAB) Workshop on
AI-Control - https://datatracker.ietf.org/group/aicontrolws/about/. The paper has 2 broad parts
– …
This position paper has been written for the Internet Architecture Board (IAB) Workshop on
AI-Control - https://datatracker.ietf.org/group/aicontrolws/about/. The paper has 2 broad parts
– the first part calls out use cases which need to be considered for data/content authors
wanting to opt-out of AI crawling. The second part calls out the requirements that any
solution we come up with must keep in consideration. All the information below should be
consumed keeping in mind existing data and new data that could be created in future.
00 2024-09-09
Paper | Sinha: Researcher perspective of OPT-OUT
This note presents concerns with potential OPT-OUT mechanisms in the internet. They include
adverse impact to researchers would face; transparency in public processes; and non …
This note presents concerns with potential OPT-OUT mechanisms in the internet. They include
adverse impact to researchers would face; transparency in public processes; and non equitable
access and accessibility and its impact on human technological advancement.
00 2024-09-09
Paper | Illyes: Robots Exclusion Protocol Extension for URI Level Control
This document extends RFC9309 by specifying additional URI level
controls through application level header and HTML meta tags
originally developed in 1996. Additionally it moves …
This document extends RFC9309 by specifying additional URI level
controls through application level header and HTML meta tags
originally developed in 1996. Additionally it moves the response
header out of the experimental header space (i.e. "X-") and defines
the combinability of multiple headers, which was previously not
possible.
00 2024-09-09
Paper | Marti: Server Privacy Control: a server-to-client privacy opt-out preference signal
The most enduring and valuable uses of the Internet come from people who choose to
communicate, tell stories and socialize online on their own sites, …
The most enduring and valuable uses of the Internet come from people who choose to
communicate, tell stories and socialize online on their own sites, on their own terms, where they
are the moderator and control what they choose to share and promote. While each individual
blog or other personal site might be a labor of love for one person, collectively the independent
web is a major source of educational and cultural wealth, as well as a significant commercial
enterprise. Raptive, one of several providers of services to independent sites, is ranked among
the top ten media companies by Comscore.
00 2024-09-09
Paper | Needham, O'Hanlon: Some suggestions to improve robots.txt
The BBC set out its approach to generative AI in [1]: "The emergence of generative AI is expected to herald a new wave of technology …
The BBC set out its approach to generative AI in [1]: "The emergence of generative AI is expected to herald a new wave of technology innovation that could impact almost every field of human activity. The new tools can generate text, images, speech, music and video in response to prompts from a user, producing new creative possibilities, and potential efficiency gains. Alongside these opportunities, it is clear that generative AI introduces new and significant risks if not harnessed properly. These include ethical issues, legal and copyright challenges, and significant risks around misinformation and bias. The BBC does not believe the current scraping of its content and data without permission in order to train generative AI models is in the public interest, and wants to agree a more structured and sustainable approach with technology companies."
00 2024-09-09
Paper | Posth, Richly: TDM-AI - Making Unit-Based Opt-out Declarations to Providers of Generative AI
TDM·AI is a protocol for creators and rightsholders to inseparably bind their
machine-readable preferences for text and data mining (TDM) to digital media assets,
specifically …
TDM·AI is a protocol for creators and rightsholders to inseparably bind their
machine-readable preferences for text and data mining (TDM) to digital media assets,
specifically tailored for training models and applications of generative AI. TDM·AI
addresses the main problem in controlling AI crawlers, namely the problem of metadata
binding, by proposing a reliable method of soft-binding restrictions or permissions to use
content for training models of generative AI to content-derived identifiers.
The TDM·AI protocol utilises the International Standard Content Code (ISCC), a new ISO
standard for the identification of digital media content (ISO 24138:20241) and Creator
Credentials2, based on W3C recommendation for cryptographically verifiable credentials3,
to ensure that verifiable and machine-readable declarations include proper attribution of
preferences and claims to the legitimate rightsholders. Although the protocol has its
origins in the European DSM Directive on Copyright 2019/790, Article 4, it may in many
cases also be applicable to content published by rightsholders outside the EU.
00 2024-09-09
Paper | Alliance for Responsible Data Collection (ARDC), Levy: Technical and Governance Guidelines for Responsible Data Collection
The Alliance for Responsible Data Collection (“ARDC”) is an inter-industry alliance of thought
leaders from businesses, non-profits, and academia aligned on a mission to establish …
The Alliance for Responsible Data Collection (“ARDC”) is an inter-industry alliance of thought
leaders from businesses, non-profits, and academia aligned on a mission to establish responsible
data collection standards and guidelines that:
1. Provide data collectors with guidance on best practices on how collections are carried
out.
2. Offer third parties a reliable means to assess the responsible sourcing of data.
3. Preserve open access to public internet data and prevent data monopolization.
Participants in ARDC represent diverse data usage models and share the common goal of
ensuring open access to public data within a trusted framework. Our discussions have included
contributions from Author’s Alliance, Bright Data, Common Crawl, OpenAI, Sequentum, Stanford
CodeX, and others.
00 2024-09-09
Paper | Berjon: The Context of Scraping Control
While interest in scraping controls have mushroomed with mainstream concern about generative AI and LLMs, it is important to understand that the
problem domain is …
While interest in scraping controls have mushroomed with mainstream concern about generative AI and LLMs, it is important to understand that the
problem domain is not new. Should a technical solution be designed, it would be unfortunate if it were to solve only a narrow set of concerns that
are prevalent today and fail to address pre-existing and, presumable, future issues that are structurally similar. With that in mind, this position paper
offers a description of the wider context in which AI-related scraping emerged in the hope of helping inform the discussion. Disclaimer: this paper in
no way claims to represent the opinions of the The New York Times but it is heavily shaped by my experience there and on solutions I considered
while there.
00 2024-09-09
Paper | Illyes: The case for the Robots Exclusion Protocol
In 1994, Martijn Koster (a webmaster himself) came up with the idea of robots.txt after
automatic clients (crawlers) were overwhelming his site. With more input …
In 1994, Martijn Koster (a webmaster himself) came up with the idea of robots.txt after
automatic clients (crawlers) were overwhelming his site. With more input from other
webmasters, the Robots Exclusion Protocol was born, and it was adopted by search
engines and other crawler operators to help website owners manage their server resources
easier. It functioned as a de facto standard for over 20 years.
01 2024-09-09
Paper | Linsvayer, Reda: GitHub Submission
GitHub welcomes the IAB’s invitation to submit considerations regarding the suitability of the Robots Exclusion Protocol (RFC 9309) for communicating preferences regarding limits …
GitHub welcomes the IAB’s invitation to submit considerations regarding the suitability of the Robots Exclusion Protocol (RFC 9309) for communicating preferences regarding limits to AI training. The focus of this position paper is on the needs of software developers who wish to express such preferences.
00 2024-09-09