IAB Workshop on AI-CONTROL (aicontrolws)
Slides
Title | Abstract | Curr. rev. | Date | Last presented | On agenda |
---|---|---|---|---|---|
Paper | Jiménez, Arkko: AI, Robots.txt | Large Language Models (LLMs) and their use of Internet-sourced material present numerous technical, commercial, legal, societal, and ethical challenges. An emerging practice proposes extending the … Large Language Models (LLMs) and their use of Internet-sourced material present numerous technical, commercial, legal, societal, and ethical challenges. An emerging practice proposes extending the robots.txt file to enable website owners to declare if they wish to "opt-out" from having their site’s content used in training AI models. This paper explores the topic. We argue that the problem is much broader than the simple opt-out mechanism, given the coming new applications, the many different ways to access training material, different AI techniques, and the need to both facilitate access to training material and enable opting out from it. |
00 | 2024-09-09 | ||
Paper | Gahnberg: AI-Control: Opt-Out Mechanisms From the View of a Governance Cycle | This paper seeks to inform discussions on an AI-Control mechanism by outlining considerations from a view of governance. Understood as a mechanism for content creators … This paper seeks to inform discussions on an AI-Control mechanism by outlining considerations from a view of governance. Understood as a mechanism for content creators to opt out of having their content used as training data for the creation of large language models (LLMs), an AI-Control mechanism would offer a standard for signaling a content creator’s preferences to a web crawler. |
00 | 2024-09-09 | ||
Paper | Prorock: Addressing the Limitations of Robots.txt in Controlling AI Crawlers | The emergence of Generative AI and the surrounding ecosystem has introduced new challenges for the internet, highlighting the limitations of the Robots Exclusion Protocol ( … The emergence of Generative AI and the surrounding ecosystem has introduced new challenges for the internet, highlighting the limitations of the Robots Exclusion Protocol (RFC 9309). The current mechanisms for controlling automated access are inadequate for both AI system operators and content creators. This paper explores the deficiencies of the robots.txt approach and proposes considerations for a more robust solution. |
00 | 2024-09-09 | ||
Paper | Doty, Null, Knodel: CDT Position paper | Publishers, authors and social media users want some way to prevent their content from being used to train large language models and other generative AI, … Publishers, authors and social media users want some way to prevent their content from being used to train large language models and other generative AI, because it hurts the market for their own work or because it could be invasive or otherwise harmful to them. Companies operating large AI models want to be able to access as much data as possible for training purposes, but may be willing to exclude content that the author doesn’t want involved, especially if it avoids either copyright suits or privacy complaints. Researchers want the ability to crawl the web for scientific and public interest purposes without getting caught up in ongoing copyright or AI training disputes. To achieve those ends for the stakeholders concerned, a standards-based solution for controlling use of online content for AI training seems possible and promising. |
00 | 2024-09-09 | ||
Paper | Longpre, Mahari, Lee, Lund et al: Consent in Crisis: The Rapid Decline of the AI Data Commons | General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our … General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge, we conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora. Our audit of 14, 000 web domains provides an expansive view of crawlable web data and how codified data use preferences are changing over time. We observe a proliferation of AI- specific clauses to limit use, acute differences in restrictions on AI developers, as well as general inconsistencies between websites’ expressed intentions in their Terms of Service and their robots.txt. We diagnose these as symptoms of ineffective web protocols, not designed to cope with the widespread re-purposing of the internet for AI. Our longitudinal analyses show that in a single year (2023-2024) there has been a rapid crescendo of data restrictions from web sources, rendering ~5%+ of all tokens in C4, or 28%+ of the most actively maintained, critical sources in C4, fully restricted from use. For Terms of Service crawling restrictions, a full 45% of C4 is now restricted. If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems. We hope to illustrate the emerging crises in data consent, for both developers and creators. The foreclosure of much of the open web will impact not only commercial AI, but also non-commercial AI and academic research. |
00 | 2024-09-09 | ||
Paper | Keller: Considerations for Opt-Out Compliance Policies by AI Model Developers | Article 53(1c) of the AI Act requires “providers of general-purpose AI models” to “put in place a policy to comply with Union copyright law, and … Article 53(1c) of the AI Act requires “providers of general-purpose AI models” to “put in place a policy to comply with Union copyright law, and in particular to identify and comply with, including through state of the art technologies, a reservation of rights expressed pursuant to Article 4(3) of Directive (EU) 2019/790.” This paper explores what such compliance policies could look like in practice and what technical standards and services are available to implement rightholder opt-outs in a way that is effective, scalable, and addresses the needs of diverse groups of rightholders and AI model developers. |
00 | 2024-09-09 | ||
Paper | Cloudflare: Control starts with Transparency: Cloudflare's position on AI Crawlers and Bots | As a provider of Internet services, Cloudflare has a strong interest in assuring both that publishers’ rights are respected and that developers have access to … As a provider of Internet services, Cloudflare has a strong interest in assuring both that publishers’ rights are respected and that developers have access to innovative new features, including LLMs. Balancing the tensions that emerge requires either trust or, in its absence, transparency. |
00 | 2024-09-09 | ||
Paper | Hardie: Crawlers, adversaries, and exclusions: thoughts on content ingest and creator's rights | This paper looks briefly at the history of information retrieval in the early web and explores the relationships between content creators and search engines that … This paper looks briefly at the history of information retrieval in the early web and explores the relationships between content creators and search engines that developed. It then examines the differences between those and the arelationships that have emerged from the ingestion of content to train generative AI models. It concludes with some thoughts on the implications of those differences for evolving mechanisms like robots.txt to cover the new use case. |
00 | 2024-09-09 | ||
Paper | Creative Commons: Creative Commons Position Paper on Preference Signals | When Creative Commons (CC) was founded over 20 years ago, sharing on the internet was broken. With the introduction of the CC licenses, the commons … When Creative Commons (CC) was founded over 20 years ago, sharing on the internet was broken. With the introduction of the CC licenses, the commons flourished. Licenses that enabled open sharing were perfectly aligned with the ideals of giving creators a choice over how their works were used. Those who embrace openly sharing their work have a myriad of motivations for doing so. Most could not have anticipated how their works might one day be used by machines. |
00 | 2024-09-09 | ||
Paper | Elsevier: Elsevier Position paper | As one of the largest publishers of scien6fic, technical, and medical literature, Elsevier recognizes the posi6ve poten6al of genera6ve AI to enhance search and discovery … As one of the largest publishers of scien6fic, technical, and medical literature, Elsevier recognizes the posi6ve poten6al of genera6ve AI to enhance search and discovery plaHorms. We also recognize the role other forms of AI have played in improving the quality and relevancy of results thus far. |
00 | 2024-09-09 | ||
Paper | Thomson, Eggert: Expressing preferences for data use | Machine learning systems depend on access to large amounts of data. A number of existing models that have been created based on data that has … Machine learning systems depend on access to large amounts of data. A number of existing models that have been created based on data that has been gathered from websites. This data is often obtained without the permission of the people who might have a stake in how that data is used. The debate about this practice is obviously difficult, as it needs to balance the societal benefits that come from the resulting models and the interests of people who hold a stake in the information that is used. Political debate on this topic is very important, but is quite complex. How this debate will ultimately resolve is unclear, but one point that is widely accepted is the need for a means for stakeholders to opt out of having their data ingested for model training. That is, there is value in giving people the means to express their preferences about the use of their data. We identify requirements for a mechanism and conclude that a simple textual signal is most appropriate. |
00 | 2024-09-09 | ||
Paper | Rogerson: Guardian News & Media Draft Paper on an AI.txt Protocol | Robots.txt (internet standard RFC 9309) is a text file which allows website owners to define how they are accessed, if at all, by automated … |
00 | 2024-09-09 | ||
Paper | Ludwig, Desai: IBM Use-case, Experiences, and Position Statement | IBM is engaged in advancing the state-of-the-art in building generative AI capabilities in the open, anchored on the IBM Granite family of open-weight models as … IBM is engaged in advancing the state-of-the-art in building generative AI capabilities in the open, anchored on the IBM Granite family of open-weight models as well as recipes used in preparing the training datasets in Data-prep-kit, both available under Apache 2.0 license. A key part of this mission is to self-govern the acquisition, processing, and usage of public data sources that are crawled. The usage spans the use cases of pre-training, fine-tuning, instruction-tuning, and RAG |
00 | 2024-09-09 | ||
Paper | Quinn, Steidl, Riecks, Sedlik, Warren: IPTC and PLUS: The "Data Mining" Embedded Image/Video Metadata Property | This document describes the PLUS "Data Mining" property developed by the PLUS Coalition in partnership with the International Press Telecommunications Council (IPTC). This property provides … This document describes the PLUS "Data Mining" property developed by the PLUS Coalition in partnership with the International Press Telecommunications Council (IPTC). This property provides a means for stakeholders to communicate essential data mining rights information via embedded Extensible Metadata Platform (XMP) metadata in digital image and video formats. This mechanism allows for clear communication of data mining permissions, prohibitions, and constraints, which can be readily accessed and interpreted by crawlers and AI platforms. |
00 | 2024-09-09 | ||
Paper | Hazaël-Massieux: Managing exposure of Web content to AI systems | As we describe in our report AI & the Web: Understanding and managing the impact of Machine Learning models on the Web, the scale of … As we describe in our report AI & the Web: Understanding and managing the impact of Machine Learning models on the Web, the scale of processes involved in the building and deployment of recent large Machine Learning (ML) models is such that they are now set to have a systemic impact on the Web as a shared information space. |
00 | 2024-09-09 | ||
Paper | OpenAI, von Lohmann: OpenAI Position paper. | OpenAI submits this position paper in response to the announcement by the program committee of the IETF’s interest in convening the IAB Workshop on AI-CONTROL … OpenAI submits this position paper in response to the announcement by the program committee of the IETF’s interest in convening the IAB Workshop on AI-CONTROL in September 2024.1 OpenAI is dedicated to developing advanced AI technologies to benefit all of humanity. We want our AI models to learn from as many languages, cultures, subjects, and industries as possible so they can benefit as many people as possible. The more diverse datasets are, the more diverse the models’ knowledge, understanding, and languages become – like a person who has been exposed to a wide range of cultural perspectives and experiences – and the more people and countries AI can safely serve. |
00 | 2024-09-09 | ||
Paper | Brachetti-Truskawa: PROPOSAL Multi-Level Approach to Managing AI Crawler Behavior and Content Protection | This document proposes a comprehensive, multi-layered strategy to protect website content from unauthorized use in AI training, particularly by Large Language Models (LLMs). The approach … This document proposes a comprehensive, multi-layered strategy to protect website content from unauthorized use in AI training, particularly by Large Language Models (LLMs). The approach leverages existing web standards and proposals, and introduces new methods to communicate content usage restrictions effectively. We are not reinventing the wheel - but suggesting to combine methods for better protection. |
00 | 2024-09-09 | ||
Paper | Gropper: A Delegated Authorization standard for AI access control | Machine Learning and Artificial Intelligence (AI) is a vastly different use-case than Search and a robots.txt approach would be inadequate. |
00 | 2024-09-09 | ||
Paper | Ramamoorthy: Prapanch Ramamoorthy Position paper | This position paper has been written for the Internet Architecture Board (IAB) Workshop on AI-Control - https://datatracker.ietf.org/group/aicontrolws/about/. The paper has 2 broad parts – … This position paper has been written for the Internet Architecture Board (IAB) Workshop on AI-Control - https://datatracker.ietf.org/group/aicontrolws/about/. The paper has 2 broad parts – the first part calls out use cases which need to be considered for data/content authors wanting to opt-out of AI crawling. The second part calls out the requirements that any solution we come up with must keep in consideration. All the information below should be consumed keeping in mind existing data and new data that could be created in future. |
00 | 2024-09-09 | ||
Paper | Sinha: Researcher perspective of OPT-OUT | This note presents concerns with potential OPT-OUT mechanisms in the internet. They include adverse impact to researchers would face; transparency in public processes; and non … This note presents concerns with potential OPT-OUT mechanisms in the internet. They include adverse impact to researchers would face; transparency in public processes; and non equitable access and accessibility and its impact on human technological advancement. |
00 | 2024-09-09 | ||
Paper | Illyes: Robots Exclusion Protocol Extension for URI Level Control | This document extends RFC9309 by specifying additional URI level controls through application level header and HTML meta tags originally developed in 1996. Additionally it moves … This document extends RFC9309 by specifying additional URI level controls through application level header and HTML meta tags originally developed in 1996. Additionally it moves the response header out of the experimental header space (i.e. "X-") and defines the combinability of multiple headers, which was previously not possible. |
00 | 2024-09-09 | ||
Paper | Marti: Server Privacy Control: a server-to-client privacy opt-out preference signal | The most enduring and valuable uses of the Internet come from people who choose to communicate, tell stories and socialize online on their own sites, … The most enduring and valuable uses of the Internet come from people who choose to communicate, tell stories and socialize online on their own sites, on their own terms, where they are the moderator and control what they choose to share and promote. While each individual blog or other personal site might be a labor of love for one person, collectively the independent web is a major source of educational and cultural wealth, as well as a significant commercial enterprise. Raptive, one of several providers of services to independent sites, is ranked among the top ten media companies by Comscore. |
00 | 2024-09-09 | ||
Paper | Needham, O'Hanlon: Some suggestions to improve robots.txt | The BBC set out its approach to generative AI in [1]: "The emergence of generative AI is expected to herald a new wave of technology … The BBC set out its approach to generative AI in [1]: "The emergence of generative AI is expected to herald a new wave of technology innovation that could impact almost every field of human activity. The new tools can generate text, images, speech, music and video in response to prompts from a user, producing new creative possibilities, and potential efficiency gains. Alongside these opportunities, it is clear that generative AI introduces new and significant risks if not harnessed properly. These include ethical issues, legal and copyright challenges, and significant risks around misinformation and bias. The BBC does not believe the current scraping of its content and data without permission in order to train generative AI models is in the public interest, and wants to agree a more structured and sustainable approach with technology companies." |
00 | 2024-09-09 | ||
Paper | Posth, Richly: TDM-AI - Making Unit-Based Opt-out Declarations to Providers of Generative AI | TDM·AI is a protocol for creators and rightsholders to inseparably bind their machine-readable preferences for text and data mining (TDM) to digital media assets, specifically … TDM·AI is a protocol for creators and rightsholders to inseparably bind their machine-readable preferences for text and data mining (TDM) to digital media assets, specifically tailored for training models and applications of generative AI. TDM·AI addresses the main problem in controlling AI crawlers, namely the problem of metadata binding, by proposing a reliable method of soft-binding restrictions or permissions to use content for training models of generative AI to content-derived identifiers. The TDM·AI protocol utilises the International Standard Content Code (ISCC), a new ISO standard for the identification of digital media content (ISO 24138:20241) and Creator Credentials2, based on W3C recommendation for cryptographically verifiable credentials3, to ensure that verifiable and machine-readable declarations include proper attribution of preferences and claims to the legitimate rightsholders. Although the protocol has its origins in the European DSM Directive on Copyright 2019/790, Article 4, it may in many cases also be applicable to content published by rightsholders outside the EU. |
00 | 2024-09-09 | ||
Paper | Alliance for Responsible Data Collection (ARDC), Levy: Technical and Governance Guidelines for Responsible Data Collection | The Alliance for Responsible Data Collection (“ARDC”) is an inter-industry alliance of thought leaders from businesses, non-profits, and academia aligned on a mission to establish … The Alliance for Responsible Data Collection (“ARDC”) is an inter-industry alliance of thought leaders from businesses, non-profits, and academia aligned on a mission to establish responsible data collection standards and guidelines that: 1. Provide data collectors with guidance on best practices on how collections are carried out. 2. Offer third parties a reliable means to assess the responsible sourcing of data. 3. Preserve open access to public internet data and prevent data monopolization. Participants in ARDC represent diverse data usage models and share the common goal of ensuring open access to public data within a trusted framework. Our discussions have included contributions from Author’s Alliance, Bright Data, Common Crawl, OpenAI, Sequentum, Stanford CodeX, and others. |
00 | 2024-09-09 | ||
Paper | Berjon: The Context of Scraping Control | While interest in scraping controls have mushroomed with mainstream concern about generative AI and LLMs, it is important to understand that the problem domain is … While interest in scraping controls have mushroomed with mainstream concern about generative AI and LLMs, it is important to understand that the problem domain is not new. Should a technical solution be designed, it would be unfortunate if it were to solve only a narrow set of concerns that are prevalent today and fail to address pre-existing and, presumable, future issues that are structurally similar. With that in mind, this position paper offers a description of the wider context in which AI-related scraping emerged in the hope of helping inform the discussion. Disclaimer: this paper in no way claims to represent the opinions of the The New York Times but it is heavily shaped by my experience there and on solutions I considered while there. |
00 | 2024-09-09 | ||
Paper | Illyes: The case for the Robots Exclusion Protocol | In 1994, Martijn Koster (a webmaster himself) came up with the idea of robots.txt after automatic clients (crawlers) were overwhelming his site. With more input … In 1994, Martijn Koster (a webmaster himself) came up with the idea of robots.txt after automatic clients (crawlers) were overwhelming his site. With more input from other webmasters, the Robots Exclusion Protocol was born, and it was adopted by search engines and other crawler operators to help website owners manage their server resources easier. It functioned as a de facto standard for over 20 years. |
01 | 2024-09-09 | ||
Paper | Linsvayer, Reda: GitHub Submission | GitHub welcomes the IAB’s invitation to submit considerations regarding the suitability of the Robots Exclusion Protocol (RFC 9309) for communicating preferences regarding limits … GitHub welcomes the IAB’s invitation to submit considerations regarding the suitability of the Robots Exclusion Protocol (RFC 9309) for communicating preferences regarding limits to AI training. The focus of this position paper is on the needs of software developers who wish to express such preferences. |
00 | 2024-09-09 |