Agenda IETF123: maprg: Wed 07:30
agenda-123-maprg-sessa-02
| Meeting Agenda | Measurement and Analysis for Protocols (maprg) RG | |
|---|---|---|
| Date and time | 2025-07-23 07:30 | |
| Title | Agenda IETF123: maprg: Wed 07:30 | |
| State | Active | |
| Other versions | markdown | |
| Last updated | 2025-07-16 |
IRTF maprg agenda for IETF-123 (Madrid)
Special Session on AI Crawler Traffic Impacts
Date: Wednesday, 23 July 2025, Session I 9:30-11:00
Full client with Video: https://meetecho.ietf.org/conference/?group=maprg&short=maprg&item=1
Room: Auditorio
IRTF Note Well: https://irtf.org/policies/irtf-note-well-2019-11.pdf
Agenda
-
Intro - Mirja/Dave (5 min)
-
Web Crawl Refusals: Insights From Common Crawl - Mostafa Ansar (10 mins) (remote)
-
Somesite I Used To Crawl: Awareness, Agency and Efficacy in Protecting Content Creators From AI Crawlers - Elisa Luo (15 mins) (remote)
-
Automated traffic affects IETF services - Robert Sparks (5 mins)
-
Bot Traffic at Wikimedia: Measurement, Identification, and Response - Chris Petrillo and Birgit Müller (20 mins) (remote)
-
AI Crawlers - Insights from Cloudflare - Thibault Meunier (5 mins)
-
IndexNow: A Real-Time Protocol for Measurable Web Indexing Efficiency - Krishna Madhavan (20 mins)
Abstracts
Web Crawl Refusals: Insights From Common Crawl
Authors: Mostafa Ansar (University of Twente), Anna Sperotto (University of Twente), Ralph Holz (University of Münster)
Abstract:
Web crawlers are an indispensable tool for collecting research data. However, they may be blocked by servers for various reasons. This can reduce their coverage. In this early-stage work, we investigate server-side blocks encountered by Common Crawl (CC). We analyze page contents to cover a broader range of refusals than previous work. We construct fine-grained regular expressions to identify refusal pages with precision, finding that at least 1.68% of sites in a CC snapshot exhibit a form of explicit refusal. Significant contributors include large hosters. Our analysis categorizes the forms of refusal messages, from straight blocks to challenges and rate-limiting responses. We are able to extract the reasons for nearly half of the refusals we identify. We find an inconsistent and even incorrect use of HTTP status codes to indicate refusals. Examining the temporal dynamics of refusals, we find that most blocks resolve within one hour, but also that 80% of refusing domains block every request by CC. Our results show that website blocks deserve more attention as they have a relevant impact on crawling projects. We also conclude that standardization to signal refusals would be beneficial for both site operators and web crawlers.
Publication: PAM 2025: https://doi.org/10.1007/978-3-031-85960-1_9
Somesite I Used To Crawl: Awareness, Agency and Efficacy in Protecting Content Creators From AI Crawlers
Authors: Enze Liu (UC San Diego), Elisa Luo (UC San Diego), Shawn Shan (University of Chicago), Geoffrey M. Voelker (UC San Diego), Ben Y. Zhao (University of Chicago), Stefan Savage (UC San Diego)
Abstract:
The success of generative AI relies heavily on training on data scraped through extensive crawling of the Internet, a practice that has raised significant copyright, privacy, and ethical concerns. While few measures are designed to resist a resource-rich adversary determined to scrape a site, crawlers can be impacted by a range of existing tools such as robots.txt, NoAI meta tags, and active crawler blocking by reverse proxies.
In this work, we seek to understand the ability and efficacy of today's networking tools to protect content creators against AI-related crawling. For targeted populations like human artists, do they have the technical knowledge and agency to utilize crawler-blocking tools such as robots.txt, and can such tools be effective? Using large scale measurements and a targeted user study of 182 professional artists, we find strong demand for tools like robots.txt, but significantly constrained by significant hurdles in technical awareness, agency in deploying them, and limited efficacy against unresponsive crawlers. We further test and evaluate network level crawler blockers by reverse-proxies, and find that despite very limited deployment today, their reliable and comprehensive blocking of AI-crawlers make them the strongest protection for artists moving forward.
Publication: to be published at IMC 2025
Bot Traffic at the IETF Datatracker
Speaker: Robert Sparks
Abstract:
Automated traffic levels towards our web services have increased dramatically within the last year. Some of that traffic self-identifies as AI, much of the rest is indistinguishable from it. Tooling options for mitigating the effect of this traffic while preserving access for non-automata users are limited, particularly given our goals to make our work widely accessable. This is a quick snapshot and description of what we’re seeing and what we’re currently doing in reaction.
Bot Traffic at Wikimedia: Measurement, Identification, and Response
Speakers: Chris Petrillo and Birgit Müller
Abstract:
Wikimedia has observed significant changes in traffic behavior and volume due to growth of
large language models (LLMs) and associated technologies. Automated requests for our
content have grown exponentially and cause regular resource, abuse and support issues.
Managing this expansion has created unique challenges for the organization considering
Wikimedia’s universal free knowledge mission. Maintaining the sustainability of the platform and
prioritizing human and mission-oriented access first has required nuanced approaches to
identifying and responding to observed trends.
This talk provides examples of automated traffic observed on Wikimedia projects, highlighting
traffic trends, bot behavior, and resource impacts. The team will discuss methods used to
differentiate high-volume automated requests from good faith community and partner
interactions. We will then review current risk strategies aimed at reducing server load and
mitigating potential abuse without impacting general service availability. There will be time for
Q&A at the end of the presentation to discuss policies that could apply to the broader content
platform community.
Related Publication:
AI Crawlers - Insights from Cloudflare
Speaker: Thibault Meunier
Abstract:
The rise of AI has changed the web crawling landscape and the ways origins manage this automated traffic. In this talk, Cloudflare dives through insights from public data they publish. The discussion also touches upon how this challenges some existing conception of traffic filtering.
IndexNow: A Real-Time Protocol for Measurable Web Indexing Efficiency
Speaker: Krishna Madhavan
Abstract:
As the web continues to scale and diversify, traditional pull-based crawling approaches face growing challenges around latency, redundancy, and resource consumption. IndexNow offers a transformative shift toward a push-based paradigm - enabling site owners to proactively notify search engines of content changes (new, updated, or deleted URLs) in real time. This talk will present IndexNow as a modern web protocol designed to improve indexing freshness, reduce unnecessary crawl traffic, and better align with operational efficiency goals.