Skip to main content

IETF Last Call Review of draft-ietf-cats-usecases-requirements-10
review-ietf-cats-usecases-requirements-10-artart-lc-bray-2025-12-10-00

Request Review of draft-ietf-cats-usecases-requirements
Requested revision No specific revision (document currently at 14)
Type IETF Last Call Review
Team ART Area Review Team (artart)
Deadline 2025-12-17
Requested 2025-12-03
Authors Kehan Yao , Luis M. Contreras , Hang Shi , Shuai Zhang , Qing An
I-D last updated 2026-05-20 (Latest revision 2026-02-02)
Completed reviews Rtgdir Early review of -07 by Ines Robles (diff)
Tsvart IETF Last Call review of -10 by Zaheduzzaman Sarker (diff)
Dnsdir IETF Last Call review of -10 by Jim Reid (diff)
Genart IETF Last Call review of -10 by Roni Even (diff)
Artart IETF Last Call review of -10 by Tim Bray (diff)
Secdir IETF Last Call review of -11 by Daniel Migault (diff)
Rtgdir IETF Last Call review of -10 by Linda Dunbar (diff)
Opsdir IETF Last Call review of -12 by Samier Barguil (diff)
Artart Telechat review of -12 by Tim Bray (diff)
Tsvart Telechat review of -12 by Zaheduzzaman Sarker (diff)
Assignment Reviewer Tim Bray
State Completed
Request IETF Last Call review on draft-ietf-cats-usecases-requirements by ART Area Review Team Assigned
Posted at https://mailarchive.ietf.org/arch/msg/art/9lz_X5zIiuetwnT0w-7WDez6bKY
Reviewed revision 10 (document currently at 14)
Result Not ready
Completed 2025-12-10
review-ietf-cats-usecases-requirements-10-artart-lc-bray-2025-12-10-00
This is the ARTART review of draft-ietf-cats-usecases-requirements-10. It has
no special standing and is offered as input to further discussion of the
subject.

While I have never looked at ALTO, I spent 5+ years as an employee of AWS where
a central everyday concern was the design and operation of distributed systems,
so I feel I have some exposure to the issues being addressed.

I feel that this document is not suitable for publication as an RFC. Quoting
from the Shepherd Report:

  The WG milestones only explicitly say to adopt this document (not to publish
  as an RFC). However, the charter does not preclude this. The working group
  discussed this point and had strong consensus that publication as an
  Informational RFC would be helpful for future protocol work.

This document contains a lot of RFC 2119 language, which I don't think belongs
in an informational RFC.  After my review, I am left dubious of the claim that
this "would be helpful for future protocol work".  Perhaps this would be
suitable for leaving as a draft for guiding the work of the WG?

I found this draft difficult (and very time-consuming) to read and am not
convinced that it offers practical value.  Perhaps it is aimed at a class of
system or protocol designer who is working on problems different from those I
faced, so my experience is not relevant and the comments below are not helpful.
 If so, sorry.

The draft is extremely verbose, 11K words in length. I found it difficult to
read and understand because of this and because the language is often general
and nontechnical.  (Also the quality of the language needs work, there are many
grammatical errors.)  It would benefit from the attention of an editor with the
goal of reducing its size and increasing its clarity.  For example, I think the
entirety of Section 1 could be replaced by the following without loss of value:
"It is often desirable to distribute compute workloads across multiple compute
resources.  These resources can include servers and load balancers in data
centers and compute capacity deployed in CDN POPs.  Routing requests for
service to such nodes with the goals of providing good response to variable
loads presents multiple complex problems."

2, 3.1 Edge computing could mean two different things: Resources at CDN POPs,
or resources at infrastructure locations which are specialized at mediating
access to internal servers and the Internet. These offer functions including
load balancing and firewalling. The draft uses the term "edge" in a very
generalized way.

I am unconvinced that some of the scenarios offered are realistic:

4.1 "Cloud VR/AR introduces the concept of cloud computing to the rendering of
audiovisual assets in such applications. Here, the edge cloud helps
encode/decode and render content.” I'm surprised. Rendering AR/VR requires
considerable compute cycles and typically would be accomplished either on
client hardware (mobile phone, AR/VR headset) or in a data center server, the
results being cached by the edge. But rendering on edge devices? I don't think
so? I haven't worked on AR in a few years so maybe I'm out of date, but this is
still surprising.

4.2 Repeated discussions of the same problem which could be summarized “try to
use the nearest edge PoP to reduce latency, unless it’s overloaded, in which
case fall back to somewhere else, while reporting the problem”

4.5.2 “Distributed AI training” - Is this really a thing?  It’s not my
understanding of how model building/training is done in practice.  This and the
other use cases would benefit from citations to real-world research.

5.2, R5 “The Resource Model MUST be implementable in an interoperable manner.“
The use of RFC2119 language on such a vague, general statement feels like
mis-use to me.  This comment applies to a high proportion of the requirement
assertions.

R6: "The Resource Model MUST be executable in a scalable manner. That is, an
agent implementing the Resource Model MUST be able to execute it at the
required time scale and at an affordable cost (e.g., memory footprint, energy,
etc.)” The absence of discussion of scaling metrics such as for example “p99
latencies” is striking. Note that 5.3 is about metrics, but provides no
examples nor does it enumerate any specific metrics.

R7: "The Resource Model MUST be useful." Once again, the 2119 language feels
inapplicable.

R18: "CATS systems MUST maintain instance affinity for stateful sessions and
transactions." This may be true in some service scenarios but in large-scale
distributed systems it can cause all sorts of problems.  I personally was
severely bitten by a misguided attempt to provide instance affinity in a
large-scale cloud application, see
https://www.tbray.org/ongoing/When/201x/2019/09/25/On-Sharding (also have a
look at some of the other issues discussed there, which feel like they ought to
be relevant to this subject matter)

There is no discussion of shuffle sharding, which is overwhelmingly seen as a
best practice to make systems resilient in the face of inevitable server
failures.  In fact, there is little discussion of resilience in the face of
server failures. That feels like one of the big and hard problems in operating
real-world distributed systems.

The Security Considerations section seems short.  One of the functions required
of every system is authentication of its users, and not all classes of servers
can perform this task; how does authentication figure in the CATS ecosystem?