Skip to main content

Robots.txt update proposal
draft-jimenez-tbd-robotstxt-update-00

Document Type Active Internet-Draft (individual)
Author Jaime Jimenez
Last updated 2024-11-06
RFC stream (None)
Intended RFC status (None)
Formats
Stream Stream state (No stream defined)
Consensus boilerplate Unknown
RFC Editor Note (None)
IESG IESG state I-D Exists
Telechat date (None)
Responsible AD (None)
Send notices to (None)
draft-jimenez-tbd-robotstxt-update-00
ai-control                                                    J. Jimenez
Internet-Draft                                                  Ericsson
Intended status: Informational                           6 November 2024
Expires: 10 May 2025

                       Robots.txt update proposal
                 draft-jimenez-tbd-robotstxt-update-00

Abstract

   This document proposes updates to the robots.txt standard to
   accommodate AI-specific crawlers, introducing a syntax for user-agent
   identification and policy differentiation.  It aims to enhance the
   management of web content access by AI systems, distinguishing
   between training and inference activities.

About This Document

   This note is to be removed before publishing as an RFC.

   Status information for this document may be found at
   https://datatracker.ietf.org/doc/draft-jimenez-tbd-robotstxt-update/.

   Discussion of this document takes place on the ai-control Working
   Group mailing list (mailto:ai-control@ietf.org), which is archived at
   https://mailarchive.ietf.org/arch/browse/ai-control/.  Subscribe at
   https://www.ietf.org/mailman/listinfo/ai-control/.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 10 May 2025.

Jimenez                    Expires 10 May 2025                  [Page 1]
Internet-Draft               robots-proposal               November 2024

Copyright Notice

   Copyright (c) 2024 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
     1.1.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   2
     1.2.  User-Agent Update . . . . . . . . . . . . . . . . . . . .   3
     1.3.  Robots.txt Update . . . . . . . . . . . . . . . . . . . .   4
   Acknowledgements  . . . . . . . . . . . . . . . . . . . . . . . .   4
   Normative References  . . . . . . . . . . . . . . . . . . . . . .   4
   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . .   5

1.  Introduction

   The current robots.txt standard inadequately filters AI crawlers due
   to its reliance on a "user-agent name" based approach and limited
   syntax.  It is difficult to differentiate based on the intended use
   of data, such as storage, indexing, training, or inference.

   We submitted the following proposal to the AI-Control WS:
   https://www.ietf.org/slides/ slides-aicontrolws-ai-robotstxt-00.pdf
   based on further discussion, the following text may describe a
   solution to the problems described in the WS.

1.1.  Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in
   BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.

   This specification makes use of the following terminology:

   Crawler:

Jimenez                    Expires 10 May 2025                  [Page 2]
Internet-Draft               robots-proposal               November 2024

      A traditional web crawler.  Also crawlers that relate to AI
      companies but that do not use the gathered content to train any
      model, LLMs or otherwise, as their purpose is purely real-time
      data integration for inference.

   AI Crawler:
      A specialized type of crawler employed by AI companies, which
      utilizes the gathered content exclusively for training purposes
      rather than for inference.

1.2.  User-Agent Update

   Crawlers are normally identify with the HTTP user-agent request
   header, the source IP address of the request or reverse DNS hostname
   of it.

   A draft that defines a syntax for user-agents would be necessary.
   The syntax has to be extendable, so that not only AI but potentially
   other crawlers can use it. it should not be mandatory for clients to
   implement as it should be backwards compatible.

   An absolutely minimal syntax would be similar to what we see in the
   wild, most AI companies use the -ai characters at the end of the user
   agent name to indicate that the crawler is used for ingesting the
   content into an AI system, for example:

     User-agent: company1-ai
     User-agent: company2-ai

   Otherwise we could reuse identifiers like URNs Namespace
   (https://www.iana.org/assignments/urn-namespaces/urn-
   namespaces.xhtml) (e.g., urn:rob:...), CRIs
   (https://datatracker.ietf.org/doc/html/draft-ietf-core-href-16) or
   cryptographically derived identifiers ... there are dozens of options
   on the IETF so it is a matter of choosing the right one.

   The -ai syntax would indicate that the crawler using it is interested
   in training.  In this draft we treat inference as a separate process
   akin to normal web-crawling and thus already covered.

   This approach different from draft-canel-robots-ai-control, as it
   does not require a new field in the robot.txt ABNF as shown below:

   User-Agent-Purpose: EXAMPLE-PURPOSE-1

Jimenez                    Expires 10 May 2025                  [Page 3]
Internet-Draft               robots-proposal               November 2024

1.3.  Robots.txt Update

   RFC9309 ABNF (https://datatracker.ietf.org/doc/html/rfc9309#name-
   formal-syntax) should be updated to address the new User-agent
   syntax.  If we continue with the -ai convention above, we could use
   regex to indicate different policies to AI crawlers.  For example:

   *  Disallow all AI-training

   User-Agent: .*?-ai$ Disallow: /

   *  Allow all images for training but disallow training on /maps for
      all AI agents that do AI training.

   User-Agent: .*?-ai$ Allow: /images
   Disallow: /maps*

   *  Allow /local for cohere-ai

   User-Agent: cohere-ai Allow: /local

   This proposal is also different that the new control rules
   DisallowAITraining and AllowAITraining proposed by draft-canel-
   robots-ai-control (https://datatracker.ietf.org/doc/draft-canel-
   robots-ai-control/).  From a semantic perspective, it is problematic
   to create specific purpose-oriented lines that fullfill such as
   DisallowThisProperty and DisallowAnotherProperty that have the same
   meaning and effect as the existing verbs Disallow and Allow.

   In our proposal the information about the agent's purpose is on the
   User-Agent itself, which enables to filter out AI training agents
   using simple regex and the existing semantics.

Acknowledgements

   The author would like to thank Jari Arkko for his review and feedback
   on short notice.

Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/rfc/rfc2119>.

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <https://www.rfc-editor.org/rfc/rfc8174>.

Jimenez                    Expires 10 May 2025                  [Page 4]
Internet-Draft               robots-proposal               November 2024

Author's Address

   Jaime Jimenez
   Ericsson
   Email: jaime@iki.fi

Jimenez                    Expires 10 May 2025                  [Page 5]