Skip to main content

Paper | Hardie: Crawlers, adversaries, and exclusions: thoughts on content ingest and creator's rights
slides-aicontrolws-crawlers-adversaries-and-exclusions-thoughts-on-content-ingest-and-creators-rights-00

Slides IAB Workshop on AI-CONTROL (aicontrolws) Team
Title Paper | Hardie: Crawlers, adversaries, and exclusions: thoughts on content ingest and creator's rights
Abstract
This paper looks briefly at the history of information retrieval in the early web and explores the relationships between content creators and search engines that …
This paper looks briefly at the history of information retrieval in the early web and explores the relationships between content creators and search engines that developed. It then examines the differences between those and the arelationships that have emerged from the ingestion of content to train generative AI models.  It concludes with some thoughts on the implications of those differences for evolving mechanisms like robots.txt to cover the new use case.
State Active
Other versions plain text
Last updated 2024-09-09

slides-aicontrolws-crawlers-adversaries-and-exclusions-thoughts-on-content-ingest-and-creators-rights-00
Author:  Ted Hardie
Contact: ted.ietf@gmail.com
Title: Crawlers, adversaries, and exclusions: thoughts on content ingest and
creator's rights

Abstract:

This paper looks briefly at the history of information retrieval in the early
web and explores the relationships between content creators and search engines
that developed. It then examines the differences between those and the
arelationships that have emerged from the ingestion of content to train
generative AI models.  It concludes with some thoughts on the implications of
those differences for evolving mechanisms like robots.txt to cover the new use
case.

1. Metadata on the early Web

Prior to the widespread deployment of the Web, many of the resources on the
Internet were in curated collections, maintained and interlinked by librarians
or their equivalents in research institutions and universities.  When the Web
overtook gopher,  the population making content available shifted to subject
matter experts for specific content, rather than those more used to providing
institutional context.  When I was the head of NASA's webmaster's working group
during the mid-1990s, we estimated that there were more than four thousand
servers inside the network, run by individuals or small teams intent on sharing
information on specialized topics with external collaborators or the public at
large.  It was in that highly decentralized era that many of the patterns
related to search and discovery on the web were set.

One consequence of this shift to the Web was that methods for finding and
describing information needed to be reconsidered as well.  Veronica, an early
search engine for gopher, relied on menu information; WAIS, which used free
text search, presumed the creation of index files for some critical
functionality.  One venue for exploring the difficulty of translating those
approaches to the web was the Dublin Core Workshop series (documented in RFC
2413 and updated in RFC 5013, after the creation of the Dublin Core Metadata
Initiative).  It is interesting to note that one of the core metadata elements
was "rights", described in section of RFC 5013 as "Information about rights
held in and over the resource. Typically, rights information includes a
statement about various property rights associated with the resource, including
intellectual property rights."  Had the Dublin Core become a bedrock part of
search and discovery, we might presume that the rights related to any piece of
content could be expressed to any type of retrieval system relatively easily.

That is not, however, what happened.  Search engines which used metadata for
classification discovered early on that content creators could and did use
metadata tags which were completely unrelated to the actual content.  This set
up an adversarial relationship between content creators, who were trying to
draw attention to their sites by being very liberal in the descriptions of the
content, and search engine operators, who were trying to provide relevant
content. While an entire system was eventually specified for using metadata to
direct queries within cooperating systems (see RFC 2651, RFC 2655, and RFC
2656), it was a niche approach by the time the specifications appeared. 
Instead sites were crawled by search engines which followed each hyperlink,
collected all the data available, and produced their own indexes.

As early as 1994, this occasionally created excessive load on the content
servers.  In response, Martijn Kosters created robots.txt as a mechanism for
specifying how crawlers would be permitted to behave at a particular site. 
This became a de facto standard (ultimately also published as RFC 9309, after
having been in use for nearly 20 years).  While it has expanded considerably
over time from simple rate-limiting, its core mechanism remains the same:  a
product token matching the user-agent string is used to identify a matching set
of rules, which the relevant crawler is expected to obey.  A default set of
behaviors is also generally included.

While this has not been without flaws, this approach worked through a very long
period in part because there were implicitly other remedies available:  the IP
address ranges associated with a crawler could be blocked; a crawler failing to
obey the rules could be fingerprinted by the content server and null or garbage
data returned; a crawler that ignored limits related to confidential
information might find itself dealing with relevant regulatory authorities (in
the US, the California Consumer Privacy Act or the Computer Fraud and Abuse
Act; in Europe, the GDPR).  It also worked in part because the two parties had
an interest in the other's success; while there were adversarial elements, the
core of the relationship between content and search was cooperative, as web
sites wanted the traffic the search engines supplied.

2. Exclusion and use for training model data.

As we look at the newer use case, there are some obvious similarities.  There
is a set of crawlers and a set of content sites.  In some cases over-eager
crawlers have created excessive load on some content sites.  Re-using
robots.txt to handle over-eager crawlers seems like a return to the roots of
the standard and a natural extension to the existing, well-understood system.

But the relationship between the sites and crawlers looks pretty different than
it was during the distributed web era in which robots.txt emerged.  First, much
of the content is platform based, and in some cases the platform owners and the
AI model developers are the same entity.  Developing rules for all of the
content on a platform and sharing it via a site-wide file seems likely to
result either in serious scaling problems or a flattening of individual
preferences into site-wide norms, which may not match the desires of the
content creators.  Especially in the case where the platform owner and AI model
developer are the same, any conflict about the desired usage of content is
heavily weighted toward the platform owner. Related issues have already
occurred.  Updated terms of service for specific platforms have caused multiple
user communities concern; slack, discord, X, and others have each tried to
update their terms to grant rights for model development.

Secondly, the web content creators historically collaborated with search
engines in order to garner attention.  Because that attention has value, the
search engines relied on a fundamental level of cooperation, even if they had
to deal with some adversarial behavior attempting to gather more attention than
the content actually warranted.

In the current situation, however, the adversarial relationship runs the other
way.  The content has value, and the content creator or publisher must deal
with adversarial behavior by the crawler or the developer of the AI system. 
Where web search engines bring attention to the content, generative AI models
are intended to synthesize new utterances in responses to prompts--they do not
bring attention to the origins of the data.  In some cases, the new synthesized
utterances replace the work that would have been done by content creators,
lowering the potential earnings of the creator, rather than providing an avenue
of discovery or attention-based monetization.  The needs for exclusion have
thus also changed, to a need to protect the rights of content creators within a
system where the content also remains available for its current use.

3. Evolution, scaling, and change

At the time of writing, a site manager or content creator can use robots.txt to
manage some crawlers with these two very different purposes, albeit at the cost
of significant additional work.  In order to do that within the confines of RFC
9309, each crawler's intent must be identified and the file updated to allow or
deny different parts of the site based on that intent.  The exact same content
for which an "allow" is appropriate for search engines may now need to be
marked as "deny" for AI model training crawlers.  This approach relies on each
crawler's intent being easy to identify, as well as the site being constructed
in a way that cleanly separates out data which should be permitted from that
which should be denied for this new purpose. As it always has, it also relies
on the crawler voluntarily following the robots.txt standard.  There are
reports that a number of these crawlers, including Anthropic's claudebot, do
not.  Similarly, claudebot apparently changes source IPs from within the AWS
cloud IP space so often that IP-level blocking doesn't persist. This is a very
basic breakdown in the fundamental level of cooperation that has existed
between crawlers and content creators in the past.

This approach also fails when a search engine chooses to combine the crawlers
for search and for AI model training.  The primary example of this is Google,
which has used its search engine data to train AI models.  As result, many
sites which previously opted-in to its search have been opted-in to its
training of models.  To allow sites to distinguish between the two use cases,
Google has indicated support for a new directive, "google-extended", through
which a site could opt-out of LLM training and similar uses while remaining
available to search.  Which uses is, however, subject to Google's
interpretation and Google continued to use search data for the Search
Generative Experience at least for a time.  The penetration of the new
directive is also limited; six months after its announcement originality.ai
found only 10% of the top thousand websites were using it[1].

These experiences indicate there likely would be some value to creating a
standard extension to robots.txt for AI model uses, but that it would be
limited.  For large-scale content providers who wish to opt-out completely, a
single, standard approach would simplify the need to track whether AI model
training was within each crawler's intent.  It would also give evidence of
their preference when taking non-technical steps, such as a cease-and-desist
letter. As a next step, it makes sense.

There are, however, lots of cases for which this "opt-out completely" switch is
not sufficient.  The guidelines for using NASA images and media[2] provide an
interesting test case, because NASA content is generally not subject to
copyright in the United States.  It is free for most uses, including display,
the creation of simulations, and the Web.  While that would seem to make it
possible to assume a blank permit for this new use case, a quick look at the
actual guidelines shows that things are not so simple.  NASA may host 3rd party
content which is subject to copyright; making that content accessible within
the NASA site will have been agreed by the copyright holder, but the other
rights would need to be sought from them.  That, in turn, means blanket
permission to access specific directories may be difficult to grant.  NASA also
forbids the use of its imagery when it is intended to imply an endorsement and
requires additional steps when it includes an identifiable person. Both of
these imply significant exclusions or additional steps as well.  Because much
of the site content changes quite rapidly, requiring NASA to either separately
disallow each individual exception or re-architect to partition its use would
be an extremely daunting requirement.[3]

The NASA case is interesting in part because it is likely to be the inverse of
what would be typical other sites, where the presence of material subject to
copyright might mean disallowing almost all content from crawlers but allowing
a few exceptions.  It highlights that robots.txt use of longest match and
exception lists to handle mixed content will be difficult to get right and even
harder to maintain.  Continuing to evolve along these lines may give some
relief, but the core issue is that the rights associated with content generally
adhere to the content itself rather than where it sits in the directory
structure of a website.  That tension is not resolved with extensions along
these lines.

4. The steps beyond the next step

That tension also raises the question that I hope this workshop can address: 
what's the step beyond extending robots.txt to make blank exclusion easier? 
What will let us have a richer vocabulary than "permit that directory" and
"deny that content in it"?

The answer, when it comes, will only be partly technical.  We need to start
from an agreement on what specific rights held in relation to a piece of
content mean for different use cases.  That's likely a question for legislators
and the courts, though good-faith efforts by those creating models will
definitely help.   Those answers can drive technical work on binding the new,
richer vocabulary to different types of media.  That will be especially
difficult for streaming media and other synthesized content, but it can be done
and, with time and care, done in ways that will be easy for new content to
use[4].

Which brings us to a question that is among the hardest:  what about content
for which there are no assertions, either in robots.txt or this new binding? 
Again, probably ultimately a question for legislators or the courts, but in
this author's opinion the only sensible starting point is to assume that if the
permission has not been explicitly granted that it is absent. That is,
ultimately, the nature of consent.  Unfortunately, that stance runs counter to
the business interests of almost everyone attempting to build generative AI
models from public data, and we run a real risk that courts or legislatures
will hold that opt-out is sufficient.

That  would provide cover for non-compliant crawlers if robots.txt is the only
mechanism available, exactly because the generative AI systems do not bring
attention to the source of the content.  A piece of content that was opt-outed
at one site or in one instance may also be present elsewhere on the Web, so it
will be difficult or impossible to assess whether a crawler was not complying
or had sourced the content elsewhere, from a site with no opt-out.  The
inclusion of equivalents to the cartographer's "trap streets"[5] can help to
some degree, but the better long-term solution is based on an opt-in approach
in which affirmative consent for the use must be given. Then, if content is
shown to have been used by a generative model, the content creator can require
the model's maintainer to show the evidence of that consent, so that they can
proceed against the party which manufactured that consent.

Section 5: Conclusion

There has been a fundamental change in the nature of the relationship between
content creator and crawler.  While it may be possible to extend existing
systems like robots.txt to manage some aspects of the new crawler behavior in
the short term, we must recognize that fundamental change in order to tackle
the long-term issues.  Much of the work needed to tackle those issues will not
be technical, as it will require a delineation of what uses may be assumed for
content made available via the Web and what rights may be asserted to limit
those uses.  The technical work of binding those assertions to the media is
secondary to that, and it will likely require fundamentally different
approaches than that of the existing robots exclusion standard.
________________
[1]
https://www.businessinsider.com/google-extended-ai-crawler-bot-ai-training-data-2024-3
[2] https://www.nasa.gov/nasa-brand-center/images-and-media/ [3] Note that the
author worked as a contractor for NASA many years ago but has no current
connection to the agency.  This conclusion is therefore speculative, though
based on the agency's reaction to the kidcode proposals
(https://datatracker.ietf.org/doc/draft-borenstein-kidcode/), which would have
created a similar unfunded mandate. [4] It is technically feasible for HTML or
JS now, by including Dublin Core-stye metadata in undisplayed portions of the
content.  For other media types, the binding might require either something
similar to multipart/mixed or an externalized assertion bound to a hash or
content identifier.  In both cases, these assertions might be also signed. 
Getting this right requires engineering work, but that work will fail if the
assertions do not match the needs. [5] https://en.wikipedia.org/wiki/Trap_street