IETF 124 RTGWG Minutes

09:30-11:30 - Wednesday Session I, Nov 5, 2025

Chairs:
Jeff Tantsura (jefftant.ietf@gmail.com)
Yingzhen Qu (yingzhen.ietf@gmail.com)

WG Page: https://datatracker.ietf.org/group/rtgwg/about/
Materials: https://datatracker.ietf.org/meeting/124/session/rtgwg

##

  1. 9:30
    Meeting Administrivia and WG Update
    Chairs (5 mins)
  1. 9:40
    Multicast usage in LLM MoE
    https://datatracker.ietf.org/doc/draft-zhang-rtgwg-llmmoe-multicast/

    Sandy Zhang (10 mins)

  1. 9:50
    Distributed Inference Network (DIN) Problem Statement, Use Cases,
    and Requirements

    https://datatracker.ietf.org/doc/draft-song-rtgwg-din-usecases-requirements/

    Jian Song (10 mins)

  2. 10:00
    Agent networking use cases, requirements, and architecture
    https://datatracker.ietf.org/doc/draft-zl-agents-networking-architecture/

    Nan Geng/ Li Zhang (10 mins)

  1. 10:10
    Scheduling Network Resources for Machine Learning Clusters
    https://datatracker.ietf.org/doc/draft-kompella-rtgwg-mlnwsched/
    Vishnu Pavan Beeram (10 mins)

==============================================================

FANTEL

  1. 10:20
    Fast Notification Problem Statement
    https://datatracker.ietf.org/doc/draft-dong-fantel-problem-statement/

    Jie Dong (15 mins)

  1. 10:35
    Advertising Router Information
    https://datatracker.ietf.org/doc/draft-zzhang-rtgwg-router-info/
    Jeffrey Zhang (15 mins)
  1. 10:50
    Proxy for Congestion Notification
    https://datatracker.ietf.org/doc/draft-xiao-rtgwg-proxy-congestion-notification/

    Xiao Min (10 mins)

  1. 11:00
    Congestion Control Based on SRv6 Path
    https://datatracker.ietf.org/doc/draft-liu-rtgwg-srv6-cc/
    Yisong Liu (10 mins)

No questions asked.

  1. 11:10
    Fast Notification for tunnel-based lossless RDMA transmission in
    WAN

    Credit-based Flow Control Based on RSVP for RDMA transmission in
    WAN

    https://www.ietf.org/archive/id/draft-hzh-fantel-wan-tunnel-01.txt
    https://datatracker.ietf.org/doc/draft-hu-rtgwg-cbfc-rsvp/
    Jiayuan Hu (10 mins)

===================================================================

  1. 11:20
    Kademlia-directed ID-based Routing Architecture (KIRA)
    https://datatracker.ietf.org/doc/draft-bless-rtgwg-kira/
    Roland Bless (10 mins)

No questions asked.


Chat History

Sasha Vainshtein
00:14:55

Do I understand correctly that actual solution will be discussed in teh
BIER WG?
Sasha Vainshtein
00:15:55

That was about the multicast fot LLM
Jeff Tantsura
00:18:52

@sasha - there was indeed a presentation and discussion in BIER
Weiqiang Cheng
00:18:53

Sorry for missing voice due to bad multicast network;)
Weiqiang Cheng
00:21:20

My question: The current switch chips for AIDC can't support BIER, do
you need the new asic for the solution?
Greg Shepherd
00:24:55

Yes. Hardware BIER support is ideal. Jericho2 and Jericho3 both support
BIER. I believe their are others as well
Alex Nichol
00:26:31

There will be latency implications if DNX is required
Jeff Tantsura
00:27:03

the big question if GPU/NIC will implement BIER, and need for proxy on
the FHR
Weiqiang Cheng
00:35:02

When considering Broadcom's chips, for AIDC applications, the XGS series
is primarily used rather than the DNX series. For example, the Tomahawk
5, to the best of my knowledge, does not support BIER.
Yingzhen Qu
00:39:46

@meetecho, camera to the presenter please
Sasha Vainshtein
00:40:04

Lots of thanks to David Black for asking the question about teh previous
presentation.
Jeffrey Haas
00:42:48

As I'm noting in chat with someone else, rtgwg is the routing area's
dispatch group, so we get a lot of peculiar early work. That said, the
routing and forwarding discussion is only a portion of the work that was
presented. The minute you start having conversations, it starts being
more of an application protocol.
Jeffrey Haas
00:43:23

Also, no fair that there's another slurm than the one in RPKI. :-)
Jeff Tantsura
00:44:09

perhaps use of word "router" confuses people, either application router
or MoE router have nothing to do with packet routers :)
Adrian Farrel
00:46:37

@Jeff. Further, if there is only one hop (at the layer where the
"routing" is happening) then it is just a classification, steering, or
destination choice.
Yingzhen Qu
00:51:37

Please help with the minutes:
https://notes.ietf.org/notes-ietf-124-rtgwg?both
Adrian Farrel
00:53:09

I wonder whether we have a WG in RTG that could pick up new work related
to resource reservations and TE?
Adrian Farrel
00:53:23

Perhaps Pavan knows the chairs
Jeff Tantsura
00:53:33

@adrian - hmm....
Shaofu Peng
00:54:27

even using resource reservation, may still get congestion due to link
failure...
Boris Khasanov
00:55:06

@Adrian - CATS?
Sasha Vainshtein
00:55:22

@Adrian - I second Jeff's reaction😉
John Scudder
00:55:35

To @Tal Mizrahi's question, my impression from the talk and a very brief
scan of the draft (I was not involved in its development) is that the
word "scheduling" may be misleading; that this is traditional TE-style
resource reservation, not DETNET-style. (But maybe I've misunderstood
either the question or the draft...)
Adrian Farrel
00:56:26

I think my sarcasm comes around...
If you want to TE the network, go ahead. If you want to schedule server
resources (there are plenty of places doing that work). If you want to
coordinate this work then maybe CATS or NMRG
Jeffrey Haas
00:57:04

... we say as we're about the spend our time talking about congestion in
the routing area.
Jeffrey Haas
00:57:33

(drafts, I mean. not the plentiful other places we create congestion)
Adrian Farrel
01:00:56

Well, I am also a little puzzled as to whether Fantel is RTG, although
the consumer of notifications is routing/steering, not throttling. So
posibly right
Vishnu Beeram
01:03:36

@Adrian -- To TE or not to TE is no longer the question :) -- we didn't
take it to CATS because this is not traffic steering; the tools for
doing MPTE reservation are being discussed in TEAS and will continue to
get discussed there; the reason for presenting this draft here is
because it is related to the fantel conversation..
Tony Li
01:04:41

It would seem like 'scalable' is also a requirement. Is there any
evidence that all of these requirements can be concurrently satisfied?
Mike McBride
01:05:52

@Adrian - RTG because need to determine how fast notifications should be
delivered (new protocols, extensions, UDP-based...)
Alia Atlas
01:08:27

There does seem to be a wide extension around a basic problem. It'd be
interesting to understand how each is intended to work independently. I
do see the work making sense in RTG - for the impact on IP
forwarding/steering and for the fast-notification protocol.
Boris Khasanov
01:08:41

@Tony, yes - multidimensional scale is very good question
Tony Li
01:09:08

This seems like it's now getting into solutions. And losing
'lightweight'.
Jeff Tantsura
01:09:47

how fast is fast?
Mike McBride
01:10:17

millisecond, sub millisecond?
John Scudder
01:10:27

Same-day service. :-P
Adrian Farrel
01:10:35

FTL
Jeffrey Haas
01:10:36

The place where there's existing demonstration of some of the overlap is
the correlators in routing for the congestion notification. See
draft-ietf-idr-next-next-hop-nodes as one example. That said, where the
congestion mechanism it lives is what we're talking about.
Jeff Tantsura
01:10:38

@john :)
Carolina Caeiro
01:11:01

Guys, I am a bit confused by process here. I understand that the result
of the Fantel BoF is that there wouldn’t be a new group, and that this
should be directed here. Now, we are discussing whether and what angles
of this work could be adopted here?
Tony Li
01:11:02

My understanding is that we need to prevent all congestion based packet
loss, so we need to respond faster than the buffer space of anything on
the path.
Reshad Rahman
01:11:14

The answer to the 2nd question (DP only?) depends on the answer to the
1st question (fast notifications only or notifications in general?)
Nitsan Dolev Elfassy
01:11:38

The desired notifications nature seems somehow clear my problem is lack
of examples for expected realistic use of this information.
Adrian Farrel
01:11:55

@Carolina I believe the word was "incubated". That might result in: no
RFCs a cluster of RFCs in RTGWG a future WG
Jeffrey Haas
01:11:58

+1 to tli. Multi-hop distribution of this stuff will have interesting
congestion, latency, loss, and jitter issues. That might be fine for
"slow" stuff like WAN use cases. For AI/DC?
Adrian Farrel
01:12:51

@Jeffery it's like BFD congestion order n factorial
Jeffrey Haas
01:14:45

Yeah, I might have some sensitivities to this discussion for that
reason.
Reshad Rahman
01:17:17

@Zafar I think keeping the scope small and focusing on fast
notifications is preferable
Alia Atlas
01:17:49

For the questions on how data can be used, I did find that
draft-cheng-rtgwg-adaptive-routing-framework-04 was useful. I'm not sure
how that is perceived to fit into the potential broader scope.
Jeffrey Haas
01:18:05

Greg Mirsky doesn't appear to be present to discuss, but I find it
likely that we'll end up talking about a previously discussed solution
space - leveraging distribution machinery similar to BFD (but not using
BFD!) to carry streams of congestion info. the router-info draft does it
differently.
https://datatracker.ietf.org/doc/html/draft-mmm-rtgwg-integrated-oam-02

Jeffrey Haas
01:19:29

The general pacing of this sort of state will set a significant portion
of our requirements for the solution protocol. Steady? On demand? etc.
Tony Li
01:20:41

Steady would seem to contradict 'lightweight'
David Black
01:20:59

steady -> periodic ?
Alia Atlas
01:21:38

It's a trade-off - more accurate reporting of congestion speed vs.
processing
Jeffrey Haas
01:22:04

Yes, periodic. And to tli's point, the contents vs. rate sets a lot of
the discussion about bfd. "What do you want to tell the other side about
every < 30ms?"
Jeffrey Haas
01:22:54

The rate also may push the solution to TLV or to template based.
Jeffrey Haas
01:23:16

TLVs make routing people happy. templates make OAM people happier in
some circumstances.
Tony Li
01:23:39

The format seems like a trivial issue.
Jeffrey Haas
01:24:11

The implementation is "trivial". The choice tends to be important, if
contentious.
Alia Atlas
01:25:07

extensibility and ability to have changes
Jeffrey Haas
01:25:31

I nod towards ipfix as an example of template based that is extensible.

Jie Dong
01:26:58

One thing I forgot to mention is whether we want to cover both in-band
and out-of-band notifications, or only one of them?
David Black
01:30:39

Agree with Jeff - point is that the CCLs avoid the traffic pattern that
causes problems.
Jeffrey Haas
01:31:49

@jie Part of that depends on whether you're discussing single hop or
multi hop distribution of the state and whether multi-hop is congruent
with the forwarding path or not.
Alia Atlas
01:32:05

@Jeff - fair enough. I just think naturally in TLVs - but doing this
with low computation matters.
Adrian Farrel
01:32:14

@Jie Aaaaaaaaagh! You said "in-band" and "out-of-band"
Jeffrey Haas
01:32:35

He could use other scare words like iOAM if you like.
Adrian Farrel
01:32:37

Please be super-precise and not use these broken-i-a-packet network
terms
Adrian Farrel
01:33:23

https://datatracker.ietf.org/doc/draft-ietf-opsawg-oam-characterization/
\:-)
Jeffrey Haas
01:33:25

To tli's point, GLB is just one use case enabled by such a message bus.

Jeffrey Haas
01:34:30

To Jie's point, what use cases get enabled by other similar message
buses depends on forwarding model for the messages and how they are
routed and with what correlators.
Jie Dong
01:34:32

@Adrian :)
Jeffrey Haas
01:34:54

and yes, many OAM folk would just call out "amateurs!"
Maria Matějka
01:37:38

joel: +1
Jeffrey Haas
01:43:52

For this srv6 presentation, I suspect the folk in tcp would share wisdom
about this sort of thing being solved in the application layer rather
than strictly forwarding.
Joel Halpern
01:47:06

It is unclear what "local traffic control" means, or how it could
possibly help the problem. If all it means is more aggressive discard,
then I can at least understand the quesiton. If it meant that, it should
say that.
Jeffrey Haas
01:47:35

I interpreted one case of it as rate shaping.
David Black
01:47:38

@Jeff - +1, and generalize beyond TCP - this sort of separate
notification should complement what the transport protocols are doing
in-band.
Joel Halpern
01:50:02

Expcting P nodes to perform rate shaping (as implied in both this and a
likely reading of the previous presentaiton) seems like a very bad idea.

Jeffrey Haas
01:50:18

I intend to agree, Joel.
Jeffrey Haas
01:51:43

The rsvp style case moving the conversation toward distributed rate
shaping is likely where much of this needs to go. As noted to someone
else earlier this week, if we're not capable, we'll reinvent ATM.
Jeffrey Haas
01:52:07

s/capable/careful
Adrian Farrel
01:52:24

Oh, is that the date, alrady? Yes it is tim to reinvent ATM
Jeffrey Haas
01:52:38

RFC 1925 s-what?
Donald Eastlake
01:52:47

I was thinkiing how much this sounded like ATM.
Joel Halpern
01:53:05

We could instead just re-invent X.25.
Boris Khasanov
01:53:17

\:)
Tony Li
01:54:29

ICMP is frequently used for DoS attacks, thus it is rate limited by all
IP nodes.
Jeffrey Haas
01:54:47

My ICMP comment at the microphone was this: ICMP is often treated
negatively by all forwarding paths. it is rate limited, and often
treated poorly in precedence in the forwarding paths.
Adrian Farrel
01:54:55

I always missed two things in RFC 1925 Rule 3: there is no rule 3 "For
more information, please re-read this RFC."
Nitsan Dolev Elfassy
01:55:03

IMHO, this proposal is extremely non scalable.
Tony Li
01:55:44

@Nitsan Dolev Elfassy This seems to be the case starting with the
problem statement.
Adrian Farrel
01:55:45

What Tony says and all of the tools being discussed may easily become
DoS vectors
Jeffrey Haas
01:56:30

People are already unhappy with bfd multihop security considerations.
General network congestion mechanisms will have significantly scarier
security considerations when done multi-hop.
Sasha Vainshtein
01:56:37

Quoting (3) from RFC 1925:
Sasha Vainshtein
01:57:21

With sufficient thrust, pigs fly just fine. However, this is not
necessarily a good idea. It is hard to be sure where they are going to
land, and it could be dangerous sitting under them as they fly overhead.

David Black
01:57:26

Credit-based flow control (cbfc) comment - CBFC is complex-enough, doing
this per flow significantly increases complexity of an already-complex
RSVP implementation.
Jeffrey Haas
01:58:02

See also prior work why original RSVP (not RSVP-TE) didn't gain general
popularity.
Tony Li
01:58:16

@David Doing anything in RSVP would seem to contradict being 'fast'.
Sasha Vainshtein
01:58:21

@David Black - looks like (3) frpm RFC 1925 really applies...