Date: November 4, 2025
Time: 14:30–16:30 UTC
Location: Meetecho (recorded session)
Chairs: Dirk Kutscher (HKUST(GZ)), Lixia Zhang (UCLA)
Notetaker: Saidu Sokoto
Summary: An in-depth examination of cloud failure patterns, focusing on cascading outages in complex distributed systems and the AWS October 2025 incident.
Main Points
- Human error (70–85%) remains the dominant cause of outages, but automation now introduces “machine error.”
- The AWS Oct 19–20, 2025 outage began with an empty DNS entry (dynamodb.us-east-1.amazonaws.com) that caused cascading failures in DynamoDB, EC2, and Network Load Balancers.
- Race condition between asynchronous “enactors” led to data loss.
- Key lesson: “Asynchrony is a way of life” in distributed systems; lack of concurrency control in automated agents led to the failure.
- Proposed safety checks and “guardrails” for future automation and AI-driven infrastructure management.
Discussion Highlights (from chat and live Q&A)
- Brett Carr (AWS) clarified that “droplets” are physical hosts, not EC2 instances, confirming the presentation’s accuracy.
- Christian Huitema: Compared this to a “thundering herd” problem — multiple agents overwhelming shared resources.
- Lixia Zhang: “The root cause really is complexity.” Suggested the need for complexity audits akin to security audits.
- Vicky Risk: Observed that automation enabled dangerous parallelism — when humans did tasks sequentially, these errors were less likely.
- Andrew Campling: Urged involvement of ops engineers early in system design to mitigate such risks.
- Lixia Zhang: Distinguished between root cause of local failure (AWS) and root cause of global disruption, noting universities and education systems worldwide were affected.
- Indranil Gupta: Emphasized fixing bad programming “habits” that propagate through infrastructure layers (DNS often being the visible symptom). Suggested built-in configuration checks and mutual exclusion mechanisms for DNS enactors.
- Vicky Risk: Proposed “velocity controls” for health-check systems to prevent cascade amplification.
Action Points
- Explore research on complexity auditing and automated safety mechanisms for large-scale systems.
- Discussion to continue on DINRG mailing list regarding lessons from automation failures.
Research Question: How reliant is the Internet on a small number of organizations for DNS and web hosting?
Key Findings
- DNS and web hosting are highly concentrated: Cloudflare and Amazon each >30% of domains; five companies (Cloudflare, Amazon, Akamai, Fastly, Google) host ~60% of top sites.
- Over 70% of domains use a single organization for authoritative DNS.
- Consolidation is global and consistent across vantage points.
Discussion Highlights
- Brian Trammell: Proposed analyzing consolidation within cloud providers by region (e.g., AWS us-east-1) via IP geo-location.
- Pete Resnick: Asked if the distinction between front-end (CDN) and back-end hosting was measured; Nick clarified that front-end dependencies were the focus, suggesting future work on backend metrics.
- Gianpaolo Scalone: Suggested extending analysis to include ECH-enabled domains, since they hide back-end structure.
- Christian Huitema: Shared that ICANN maintains monthly DNS concentration metrics (ICANN ITHI M9 graph) — potential collaboration.
- Andrew Campling: Warned of “digital colonialism”—global power concentration in few hosts who may ignore takedown requests, shaping global information flow.
- Dan Sexton: Expanded this into content control concerns—centralized hosts deciding what content remains online.
- Pete Resnick: Pointed out that decentralization complicates content moderation, creating a “whack-a-mole” scenario, which might not be entirely negative.
- Tom Newton: Suggested real-time tracking of website dependencies (“cloudiuse.com”) to expose such concentration dynamically.
Action Items
- Nick to release the measurement code and dataset publicly.
- DINRG to consider periodic re-measurement as a standing activity, potentially in collaboration with ICANN.
Motivation
- While Encrypted Client Hello (ECH) hides destinations, client-facing servers (CDNs) still see both client IP and target domain — allowing correlation and surveillance.
- This centralization introduces jurisdictional mismatch and privacy risks.
Proposal: Customer-Facing Relay (CFR)
- Lightweight relay operated at the ISP or enterprise edge.
- Randomizes or rotates source IPs, decoupling source identity from destination.
- Preserves TLS/ECH semantics without requiring DNS changes.
- Builds on customer–ISP trust relationship for accountability.
Discussion Highlights
- Christian Huitema: Compared CFR to Oblivious HTTP (OHTTP); Gianpaolo replied CFR is simpler and more deployable at the network edge.
- Andrew Campling: Praised CFR for balancing privacy with accountability through contractual ISP relationships.
- Fig: Argued that jurisdictional mismatch can be desirable (e.g., bypassing censorship); Gianpaolo clarified CFR complements ECH by adding source privacy, not removing freedom.
- Aldo: Linked CFR’s goals to the EU’s Digital Operational Resilience Act (DORA) on infrastructure independence.
- Christian Huitema: Noted privacy must protect against both big tech surveillance and state control by ISPs; CFR must address both adversaries.
- Andrew Campling: Quipped, “I can’t vote out big tech.”
- Dan Sexton: Highlighted vertical consolidation — browsers, devices, CDNs, and DNS often controlled by a single vendor, compounding privacy issues.
Action Points
- Continue technical analysis of CFR deployment risks on the mailing list.
- Coordinate with PEARG and HRPC for broader privacy and decentralization implications.
Overview
- Presented GFDS, a modular system for building decentralized applications with dynamic behavior and adaptive resource use.
- Architecture includes protocol, event, discovery, resource, timer, communication, configuration, and security managers.
- Reference implementation: Babel, supporting decentralized storage, ML, and messaging.
Discussion Highlights
- Roland Bless (KIT): Mentioned “Vailet,” a Rust framework with similar goals. Diogo expressed interest.
- Dirk Kutscher: Requested elaboration on how GFDS supports decentralization properties and encouraged follow-up via the mailing list.
Action Points
- Further discussion of GFDS capabilities and evaluation use cases (swarm robotics, IoT, Web3) on the mailing list.
Panelists: Christine Lemmer-Webber (ActivityPub), Brian Truong (AT Protocol/Bluesky), Ted Hardie (moderator), with active participation from the audience.
Phillip Hallam-Baker: Said server choice complexity deterred adoption.
Decentralization vs. Scale:
Christian Huitema: Observed that trust and defederation mechanics mirror email’s evolution (“you have to be that tall to keep standing”).
Privacy, Abuse, and Trust:
Brian: Responded that AT is mobile-first, supports hardware key storage (e.g., YubiKey), and seeks practical usability.
Ecosystem Maturity:
Closing Reflections
- Dirk Kutscher: Quoted Tocqueville — “Decentralization is really, really hard.” Called DINRG a natural home for this ongoing exploration.
- Lixia Zhang: Emphasized starting with clear problem definitions before standardization.
Action Points
- Draft a DINRG informational document summarizing architectural trade-offs (identity, portability, governance, moderation).
- Coordinate with AT Protocol BoF for follow-up.
| Topic | Action | Responsible |
|---|---|---|
| Cloud Outages | Explore complexity auditing and automation safety | Community |
| Infrastructure Consolidation | Publish dataset, continue periodic measurement | Nick Feamster & DINRG |
| Source Privacy (CFR) | Analyze deployment risks, coordinate with privacy RGs | Gianpaolo Scalone |
| GFDS Framework | Share implementation details, collect feedback | Diogo Jesus |
| Social Systems Panel | Prepare informational draft on decentralization trade-offs | Chairs & Panelists |
Compiled from manual notes, auto-generated minutes, and full chat log of the DINRG session at IETF-124.