IETF 119 RTGWG Minutes

Chairs:
Jeff Tantsura (jefftant.ietf@gmail.com)
Yingzhen Qu (yingzhen.ietf@gmail.com)

WG Page: https://datatracker.ietf.org/group/rtgwg/about/
Materials: https://datatracker.ietf.org/meeting/119/session/rtgwg

9:30-11:30 - Tuesday Session I, March 19, 2024
9:30
Meeting Administrivia and WG Update
Chairs (10 mins)

9:40

Multi-segment SD-WAN via Cloud DCs

https://datatracker.ietf.org/doc/draft-dmk-rtgwg-multisegment-sdwan/
Linda Dunbar (10 mins)

Will be taken to list.

9:50

Path-aware Remote Protection Framework

https://datatracker.ietf.org/doc/draft-liu-rtgwg-path-aware-remote-protection/

Yisong Liu / Changwang Lin (10 mins)

10:00

Destination/Source Routing

https://datatracker.ietf.org/doc/draft-llsyang-rtgwg-dst-src-routing/
Shu Yang (10 mins)

10:10

A Routing Architecture for Satellite Networks

https://datatracker.ietf.org/doc/draft-li-arch-sat/
Tony Li (20 mins)

(presentation continues @ "Off-Stripe Return Forwarding")

Jeff Tantsura/chair: little input from industry, if you're in the room
please help with requirements so we can build something that works.

Jeff Tantsura/chair: Considering interim meeting for this before IETF
120, if there is enough interest.

10:30

Extension of Application-aware Networking (APN) Framework for Application Side

https://datatracker.ietf.org/doc/draft-li-rtgwg-apn-app-side-framework/

Zhenbin Li/Shuping Peng (10 mins)

10:40

Application-aware Data Center Network (APDN) Use Cases and Requirements

https://datatracker.ietf.org/doc/draft-wh-rtgwg-application-aware-dc-network/

Hongyi Huang

10:50

Use Cases and Requirements for Implementing Lossless Techniques in Wide Area Networks

https://datatracker.ietf.org/doc/draft-huang-rtgwg-wan-lossless-uc/
Hongyi Huang

11:00

Use Cases-Standalone Service ID in Routing Network

https://datatracker.ietf.org/doc/draft-huang-rtgwg-us-standalone-sid/
Daniel Huang

============================================

Copied from the Chat

Ketan Talaulikar
00:00:44

Nice to see the chairs in the spotlight ;-)

Yingzhen Qu
00:02:20

please help with collective note taking:
https://notes.ietf.org/notes-ietf-119-rtgwg?both

Andrew Alston
00:14:56

Adoption calls still have to go to the list don't they?

Acee Lindem
00:15:45

Should go to INT Area

Yingzhen Qu
00:21:04

@Andrew, yes, the adoption call will go to the list.

Himanshu Shah
00:32:41

On Path aware remote protection - I believe the author is proposing a
scheme to handle the remote failure with a path aware backup path
already programmed in the FIB.

David Lamparter
00:32:56

@Antoine didn't catch your question for the notes either, sorry

Himanshu Shah
00:33:28

As soon as the notification arrives, switch over happens. The whole goal
is to reduce the service outage instead of waiting for BGP withdraw..

Himanshu Shah
00:33:44

The switchover scheme is not yet proposed.

Antoine Fressancourt
00:36:51

@David My question is about the selection of the remote repair node. Is
it a self election mechanism from receiving a failure notification ? If
a node tries to repair a path, does it stop the upstream relay of the
failure notification ? Can two remote node be repairing a path in
parallel?

Himanshu Shah
00:37:24

Sorry i meant "notification scheme" is not yet proposed.

John Scudder
00:37:49

Did the person at the mic really say this solution would provide
microsecond scale repair?

John Scudder
00:37:56

Ain’t no way.

Himanshu Shah
00:38:34

I agree - has to be milliseconds depending on what the notification
scheme is..

Weiqiang Cheng
00:43:15

https://datatracker.ietf.org/doc/draft-cheng-rtgwg-ai-network-reliability-problem/

Weiqiang Cheng
00:43:45

This draft gives some analysis on requirements of fast protection in AI
DC

Yingzhen Qu
00:45:39

@meetecho, please switch the camera to the presenter

Lorenzo Miniero
00:45:53

Done!

Yingzhen Qu
00:46:05

thank you

Antoine Fressancourt
00:47:43

@Weiqiang thanks for the link

David Lamparter
00:47:57

the queue is still stuck with people from the previous draft, can we
clear that?

Weiqiang Cheng
00:49:15

@ John, I mentioned the requirement is sub-millisecondes even us. I
don't think the solution will provide the us scale. But it is valuable
to look for the way to improve the recovery time.

Himanshu Shah
00:50:34

@weiqiang - the proposal reminds me of AIS/RDI type of scheme.. :-)

Adrian Farrel
00:53:41

Looks like a pretty picture to me

Dave Phelan
00:54:30

Wasn’t drawn on a napkin.

Christopher Hawker
00:55:53

Another pretty picture!

Jeff Tantsura
01:01:56

@Jeff haas - on fast recovery topic- you have mentioned other drafts
that are being progressed in other eg's - please share the draft names

Acee Lindem
01:06:53

Is it just me or doesn't the Meetecho private chat work?

Christian Hopps
01:07:14

is the presentation over? the questioners seem to be assuming that

Christian Hopps
01:07:33

can we finish the presentation first?

Shukri Abdallah
01:08:04

Do you specify an algorithm for selecting orbits that form a stripe?

Christopher Hawker
01:08:35

@Acee just tried sending you a private chat.

Lorenzo Miniero
01:08:53

Acee: what do you mean by doesn't work? I use it regularly when I need
to get in touch with people in rooms, e.g., to provide assistante to
remote speakers

Lorenzo Miniero
01:09:48

You can try contacting me privately here too, if you want to test

Adrian Farrel
01:10:25

@lorenzo, it is not immediately obvious how to do it from the chat
window without starting up zulip

Lorenzo Miniero
01:11:07

Adrian: you need to click on the name of the participant from the
participants list, and options will appear. The balloon icon will open a
private chat

Adrian Farrel
01:11:30

Yup, but not the participants name in the chat window :-)

Lorenzo Miniero
01:12:11

Ah no, that's correct: those names are not clickable

Dawei Fan
01:13:04

off-stripe forwarding seems find short path, my question is how about
the propogation in this architecture. it is the same as current
IGP(ISIS)?

Yisong Liu
01:13:23

@John, we hope to provide a solution for millisecond repair. but it may
need more work on the specific solution and we'll continue to do that

Andrew Stone
01:20:25

Fair to say TE goals are mainly for user stations, to TE path from
gateway to satellite is what's critical and TE from satelite to gateway
is not ?

Yisong Liu
01:20:43

@Jeff Haas, please help to confirm do you refer to this draft:
https://datatracker.ietf.org/doc/draft-wang-idr-next-next-hop-nodes/

Andrew Stone
01:20:47

** for user downstream traffic

Tony Przygienda
01:26:31

well, from experience, satellite guys do -really- like their proprietary
stuff ;-)

Adrian Farrel
01:27:31

Quote Jeff: The satellite work is operating in a vacuum

Christian Hopps
01:28:24

that's fantastic

Tom Hill
01:29:16

Tony, I had some queries about approach, but I'll pick it up in a coffee
break. More musing on past efforts.

Tony Przygienda
01:29:26

well, security through obscurity is a concept. Plus, interop is only
interesting if you look to shop vendors. Most routier vendors are not
particularly good satellite builders and though building a router is not
easy, putting a satellite into orbit is of different scale ...

Tony Przygienda
01:30:51

I doubt culturally the term "router scientiscist" will ever achieve the
same level of awe when mentioned as "rocket scientiscist" ;-)

Christian Hopps
01:34:51

Why should this be in the network? I just don't get it. What servers a
client uses should not be part of the routing database.

Andrew Alston
01:35:26

This is also outside of charter - APN is a large topic - and for RTGWG
to take on any larger topic - it has to be explicitly chartered to do so
- as per the current charter

Dmitry Afanasiev
01:35:29

Also, there probably going to be only few LEO constellations operating
at any given time, very likely interconnected only via ground gateways -
so there is little incentive to bother with standardization

Tony Przygienda
01:36:25

@Dima, well, with the amunt of space junk Elon and AMZN are shooting up
there and grid arrays I'm not sure that is true anymore ...

Changwang Lin
01:36:57

@Antoine,
There are two types of notification mechanisms for:

Proactively notify. The suggestion is to only protect the two-level
network and only notify upwards once.
.

Flow triggered notification. Send notifications to the direction of the
flow when it is perceived that it cannot be forwarded. This method can
notify upstream. If there is no protection path upstream, subsequent
traffic will trigger notification to the higher-level device again.
Remote nodes can simultaneously repair a remote path fault

The remote path aware document needs to address several issues:

In a specific topology, convergence does not depend on the control
surface protocol.
Control surface protocols, such as BGP, extend support to add remote
path information on the next hop of the route.
Fault perception: perceived by the remote end, and then notified to
other protocols to quickly notify the remote end.
Switching process: It does not rely on the control plane and completes
fast switching on the forwarding plane, achieving microsecond level
convergence.
Christian Hopps
01:37:14

Write a protocol for applications to choose servers don't try and wedge
this into routing.

Andrew Alston
01:37:34

+! Christian

Dmitry Afanasiev
01:37:39

but problem is certainly interesting and it seems it can be solved with
available tools + reasonable amount of tweaking and without too big
sacrifices

David Lamparter
01:38:01

did the slides just disappear or is it just me?

Tony Przygienda
01:38:02

is there a preso after that? Otherwise it's 2aM+ here and bed would be
nice instead of suffering through this stuff ...

David Lamparter
01:38:23

(nevermind, back now)

Christian Hopps
01:38:36

Tony this stuff is progressing b/c lots of people dislike it so much
they are ignoring it.

David Lamparter
01:39:40

we are at →
10:30
Extension of Application-aware Networking (APN) Framework for
Application Side

(it's 11:10 now, so >30min over)
10:40
Application-aware Data Center Network (APDN) Use Cases and Requirements

10:50
Use Cases and Requirements for Implementing Lossless Techniques in Wide
Area Networks

11:00
Use Cases-Standalone Service ID in Routing Network

Tony Przygienda
01:39:44

well, RFCs are pretty clear what you do with wtuff after 2 failed BOFs.
But of course some grownups have to apply the prcedural framework ...

Andrew Alston
01:39:52

Well - its failed 2 BOF's if I recall - and the IESG wouldn't approve
their proposed charter - and now it's being shopped - but as I said -
its explicitly out of charter

Dmitry Afanasiev
01:40:27

@Tony - number of sats is big, no doubt, and it is growing fast, but as
for systems - it's just 2, maybe 3-4 more of comparable scale will come
up later, but that's it

Jeff Tantsura
01:40:43

I'm here

Tony Przygienda
01:40:54

@David: ok, looks like I get some snooze time back and couple hours
sleep before morning meetings get me up again ...

David Lamparter
01:41:38

\:)

Tony Przygienda
01:42:03

@Jeff, yepp, probably but I doubt you'll ever grow up (and we love you
for it ;-)

David Lamparter
01:42:19

@Jeff: mic queue is still locked from previous draft btw

Yingzhen Qu
01:42:37

@David, thanks for the reminder

Jeff Tantsura
01:44:39

;-)

Tony Przygienda
01:49:54

well, I disagree with the premise that network substrate needs to
understand the application semantics ...

Andrew Alston
01:50:18

You are not alone in that Tony

Dmitry Afanasiev
01:50:25

+1

Tony Przygienda
01:51:12

as @Dima once said: it's all distributed linear algebara at the end ;-)
and this does not know whether you're computing tensor cross-section to
calculate static stability or train some generative parrot

Dmitry Afanasiev
01:51:21

but collective ops is a special beast, HPC interconnects historically
provided support for it, at least some of them

David Lamparter
01:51:39

anyone know how this relates to coinrg?

Tony Przygienda
01:51:49

noithing against in-substrate support for folding of course ...

Dmitry Afanasiev
01:54:12

@David collective operations - e.g all-reduce, used in ML training,
doing intermediate reduction in network can improve performance.
Definitely computation in the network.

Jeff Tantsura
01:56:06

SHARP is doing this today

Dmitry Afanasiev
01:56:31

@Jeff exactly, it's a very good example

Tony Przygienda
01:57:03

thinking that to the end you basically want S-I-PMSI capable on the
substrate setting up such hierarchical folding "trees" and that's one of
the things I tried to talk to folks about as BIER use case (think
generalized sharp distribution substrate ;-)

Dmitry Afanasiev
01:57:09

but no SHARP for Eth/IP .. at least yet :)

Tony Przygienda
01:57:33

unfortunately, the folks dealing with taht are religiously against any
type of multicast (for which I have some understanding)

Kehan Yao
01:58:03

@Dmitry, agree. collective operations offloaded to the switch is a
common solution for AI/HPC. So for AI networking, maybe people shouldn't
wear glasses, and some in-network behavior maybe helpful.

John Scudder
01:58:05

I’m not able to stay for this talk on “lossless techniques“ but I want
to point out to anyone who isn’t aware that it sure sounds the same as
what detnet works on. 

Tony Przygienda
01:58:18

@Dima: RIFT a natural for it once we manage to get the multicast folks
finishe their work ;-) though of course the multicast here becomes not
multicast but really distributed folding operatoin

Yingzhen Qu
01:58:57

@John, noted.

Jeff Tantsura
01:58:57

multicast (or tree building at all) is the least of your problem, ASIC
doing reduction at line rate is though

Tony Przygienda
01:59:14

ASICs are easy, they build a new one every 6 months ;-)

Dmitry Afanasiev
01:59:22

@Tony yes, establish reduction topology - this is straightforward, but
also deal with data loss, reducer failures, latency bounds, maybe
agreement on quantization

Tony Przygienda
01:59:51

@Dima, you know me, I'm a control plane whinie, the larger the scale the
better ;-)

Tony Przygienda
02:00:21

but of course I agree, practically to get such distributed folding
deliver real gainss is anything but eays at the volumes

Dmitry Afanasiev
02:00:24

vectors to be reduced can be quite large, so enough of buffer space on
reducing switches

Tony Przygienda
02:02:26

oh, people complained about private messaging not working because I just
discovered a long queue of little symbols at the top I ignored ;-) Cute
...

Dmitry Afanasiev
02:02:47

@Tony I'm with you wrt scale - the larger the better, also that's where
really interesting problems start to appear, and just throwing money is
not enough to make those problems go away :)

Tony Przygienda
02:03:36

@Dima, nah, money helps. You just start to throw it at smart people
rather than brute forcing it ;-)

Tony Przygienda
02:03:45

okey, bed now. fun session ;-)