Thanks Yuehua Wei (ZTE) for taking the minutes. Video recording: https://www.youtube.com/watch?v=dOclSQ9wJu0 ------------------------------------------------------------ - Chairs remarks - Jeff & Jeffrey, 5 minutes - Updates from Jordan Head, 20 minutes - Base spec: https://datatracker.ietf.org/doc/draft-ietf-rift-rift/ Jim[AD]: hey Jordan, thanks for that and thanks for taking care of all of that for me so um very much appreciated the the only comment I had was the the TTL thing you're going to put some text in the applicability document on that right? is that something that you're doing? because I kind of want to try and get that moved at the same time so if we can kill two birds with one stone that would be great. Jordan: yeah I've driving the discussion a bit, gotten some discussion but I think we're at the point where I'll just write some text and propose it. you know ask for forgiveness rather than permission so to speak just to make sure we're covered. and then there were a couple of other points in the applicability side but nothing that relates to something that's normative like this. Jim:so yeah so that one's pretty much done as well right? Jordan: minor stuff , okay perfect, thanks - KV registry: https://datatracker.ietf.org/doc/draft-ietf-rift-kv-registry/ Jeffrey[Chair]: I have a question, so I had thought this document is about the registry itself, it seems that now we have added this mechanism for the key targets and then it's handling all those things, so should the draft be renamed and then title changed? Jordan: that's probably not an awful idea , I think Tony's gonna mention but we've had other mechanisms described there as well besides key targets like the southbound tie breaking and so forth but go ahead... Toni: well the only thing that's Changing the rift document is that: first we thought that we basically have the key which was always outside and then the content was just a blob and we wanted to throw the target into this blob but we decided that we basically split it into the key and the Target and then the blob which is the really the value,right? so that schema change, so we have to register this code point for the content for the key value TIE, but in terms of what this thing does, the rift spec doesn't say anything. that's all. Still farmed out to the key value spec. Jeffrey[Chair]: right, so the KV registry spec is really not only about registry but also about the behavior Toni:right so it also specified the behavior of this field so since we defined the tiebreaking of the key value store on the rift spec, You could argue that we should put the target text also into the Rifts spec. Jeffrey[Chair]: no, that's not my point. this KV registry document defines also the target behavior, so I think that the document should be renamed. Toni: fair enough Jordan: yes I'll take that for next time, Jeffrey that's fine. since it's just a title change it's not a big deal. Sandy Zhang: in some scenarios may I understand the key Target is the routing Target in MPBGP? if it can be used like this? Toni: yes no maybe. so people are probably confused because I'm not sure everybody knows how a bloom filter works. so the idea is fairly simple, you take something fairly big and then you generate multiple hash functions of something and you get let's say three hash functions, three bits, and you flip those three bits on, and that way you can put 100,000 targets into 64 bit okay? of course you will get false positives but you don't get false negatives. so you may address more people than you intend, but you have a very small filter . all right, whereas the route Target in BGP is like you have by policy you have a perfect match, here you don't have a perfect match. you have something with statistically works actually incredibly well, all right, but the delivers false positives and you have to deal with that. it's a well-known thing, research papers blah blah blah often used technique. but the equivalents of the route Target breaks down here because it's not the perfect match, right? - Update on the interop testing in Hackathon, Tony P., 10 minutes -rift meets dragonfly https://datatracker.ietf.org/doc/draft-przygienda-rift-dragonfly/, Tony P., 25 minutes Jeffrey[Chair]: Tony yeah do you want to take the questions now or later Rod Van Meter K University:um it's a clarification about this diagram Toni: sure, yeah I know it's tons of information we get worse [Rod Van Meter K University]: hi this is Rod Van Meter K University so um this diagram you said eight links? [Toni]: eight edges. there's no node in the middle, it's an octogon. so it's a regular thing that the dragonfly original was just like full mesh and these little wings which are all full meshes. and if you have four of them and you align it correctly,it looks like a dragonfly. think about this, those are the whatever routers, those are the two planes how they connected and yes I should have probably made big blocks but I was too lazy, and there would be two more of these on the top on the left or well think behind. it doesn't matter because those are clos planes, so those are clos fabrics on the dragonfly+ as far as I could figure out. because there's really no clean paper that will explain to you in research terms what the hell it is. so that was as much as I could reconstruct from all kind of ideas flying around. Linda: I'm a little confused by this picture too, so you're saying the red nodes Toni: there are no red nodes it's just a red plane. Linda: red plane Toni: yeah you can talk so. those would be the red nodes, Linda: but you have a box connecting the green and red, does that means it's just one node? Toni: be an important concept right so you can see it in different ways. you can see there are two completely disconnected planes, you can see there's half a full mesh or you can see there's two planes that you can somehow connect together, you keep those links. Linda: so I see the red plane is only one hop away from each node. why do you say there's two hops? Toni: what do we have in the middle? this cross? nothing nothing it's just random if you start to draw things like that, the things intersect. so yeah my bad. only those things are nodes it's an octagon Linda: okay so the the middle one is not really the connection. there's nothing there. Toni: sorry, there is six red links in fact . there is no node in the middle there's nothing there. so you've got shortest path and non shortest path. Linda: so you have one hop and you have some two hop paths. Toni: 2 hops and in one hop, yes. sorry that's implicit. okay cool. Jeff[Chair]: so I'll bring a couple of points and we can start discussion. 1) why is this important? people been trying dragonfly like topologies in data center to save on interfaces practically the complexity doesn't justify the deployment where it becomes really interesting is when you use inter links to interconnect data center as Toni said, and there's very important point. today it's pretty much impossible to get data center over 50 mega there's just no power to cool it and power it. so a lot of people started building data centers in Us in pairs of 50 60 meab data centers. this naturally lays down to this kind of topology. so I've got twice 50 mega Data Center and then within data center you run your whatever you like right most probably MPBGP. so that's number one. this is why this is so important 2) number two: this provides you Loop free routing. it doesn't explain how to get traffic on the links. but practically the cheapest way is to go on the shortest link. it also gives you low latency. you want to be able to use longer links. but you need to understand that, again looking at the target, this is really machine learning cluster. in Collective operations you cannot afford to have part of collective operations having different latency, because it's all about job completion time. so you need to make sure that whatever your GPU is running, it is following same path. how do you get traffic in case congestion on another link? again another problem to solve, not here. but practically you need to know when to switch from shortest path to non-shortest path and it's not in routing protocol at least as of now. Adaptive routing applicability here again if you try to do more granular load balance and just per flow, you end up in a case where some of your packets go on the shortest link some don't. performance goes to 3%. so it's really important to understand from applicability perspective how to deploy it, how to signal potential congestion, or bandwidth available, or failure on the inter fabric links. and all of this will need to be worked out at least some of it. so this is where I think we should start discussion. Toni: yeah but you know the nice thing if you start to look also this, you know direct path and this one alternative hop. this is an incredibly resilient structure. I mean you have to kill tons of connectivity before this thing literally starts to become unreachable. when I was looking at the stuff and you know if you build like three planes clos and then this thing in between I mean you have to nuke it before the stuff starts to actually not have any path toget anywhere. because I kind of hated dragonfly. I thought it was too dense and nobody could figure out the routing. now I start to like them. of course because I think I figured them out. yeah all right so I think that's it. Jeff[chair]: one more comment. there's a draft that crossing routing working group that focus actually on BGP in dragonfly plus. if you want to get kind of terms you're more familiar with VRFs, BGP policies, it explains how this can be done with BGP policies where rather understanding whether link comes from the fabric or from the interlink. you just use different VRFs and you can't really IS-IS path to figure out where you are. so it will help you to better understand applicability of regular routing protocol to this. Dima: thanks Tony it's really impressive what you did with Rift. I just want to comment about computation scalability problem because essentially if we are trying to use silicon to the maximum, then number of groups probably will be half the radixs of top of fabric switches plus one. because we have half interfaces going south half interfaces going north and they're going to other groups and plus one is our local groups. so it could be 3365 for current generation of silicon and something like that. but I think there's no need to do full computations for every group because the reason to do full computation if you're going to go through intermediate group and going to reflect why are leaves in that group. Toni: no. the only reason to do the full computations I mentioned if you really want the negative disaggregation positive disaggregation tackle the cases where you have to start in the fabric on the direct plane because you can only get on this plane to the other fabric to the leaf. so it's in the cases where other fabric breaks and it forces all the way to your fabric to disaggregate. you see my point. right I mean those are like I don't even know how many links I have to break and how. only reason to run the computations Dima: yeah my point is that probably it's possible to do less computations than like full commutation for every member fabric. Toni: yeah that's what I wrote. right I said like leave out the inter fabric link when you do positive and negative desegregation it's good enough, most likely. Dima: yeah because negative desegregation is needed if you cannot go through particular top of the fabric switch. Toni: yeah totally right. Dima:and I wanted to second what Jeff said that there is power scalability problem for how much you can get for one particular data center but this topology looks like a good feed for data center campuses. or any aggregation of data centers which are not too far from each other, and you want to have more or less uniform connectivity and a lot of bandwidth between them because it scales better than trying to add yet another level to the clos. Toni:right, so if anybody feels like a little bit mental exercises, especially the profs here, now imagine run this thing on a optical ring, counter rotating, what will happen if the ring get cut in one place. What will this topology look like and what will happen. because this next layer of problem in network, right? because you run this whole thing on lambdas over a ring. Dima: yeah so that's it for me. thanks. Jeff[chair]: next question Jingyou: I'm from Fiberhome. I'm not have any comments just a minor suggestions. because those figures seems handsome but is little difficult for me to understand so I suggest maybe we can add some formulas to give some examples or use cases. Toni: I used to be in Academia. I don't do formulas anymore I could write it beautifully in three formulas you know and I could talk to you about Banyan trees and Banyan tree formulas. nobody would grog anything anything whatsoever you know. I reserved it for the journal paper Linda: from futurewei. so I'm just curious like you have multiple planes and you have each plane has their own topology. can you use IS-IS different areas to solve the problem? the plane can be the area two and...... Toni: look, you could run in the core IS-IS. right? I mean we wouldn't have extended Rift, you could only shortest path one hop. so you don't get bisectional bandwidth with IS-IS unless you hack IS-IS to the point there's not IS-IS anymore. so... Linda: but you can use some kind of policy on the side so that you can... Toni: IS-IS doesn't have policies Jeff[chair]: that's why we use BGP Linda: how about use BGP, there is a draft ......we have a draft on that like basically have some kind of metrics to influence the path selection so instead of choosing the shortest path, we add some other weight so that with that other weight added in maybe the longer path will be chosen. Toni: so my comment will be you know once your policy grow complicated long enough you may start to carry packets by hand that may be more efficient. Linda: yeah of course. but here we're talking about um multiple path and shortest paths may not be the best path and how do we balance Toni: so dimma has a draft where he has shown basically with a lot of VPN how you can solve that stuff because you know that the horizon idea is actually dima's idea not mine. because I was standing in front of like sucking my teeth, right? how the hell do you know shortest path properly here? it was dima's idea that we can actually build a horizon. because he built the horizon using VPN in BGP because that's how you use them. they basically reflect the horizon. that's the BGP mechanism Linda: okay so if we have some ideas on how do we do this? Jeff[chair]: oh we know exactly how to do it with BGP. that was presented in last working group Rift. Toni: yeah we talked about the BGP stuff. modulo a little details like where are the couple of hundred lines of BGP policy and how you stitch that stuff properly. so it doesn't break. plus of course the BGP will stitch with the VPNs and you have to start to think okay where are your tunnels? what will happen if this thing where because no the tunnels start to develop their own logic, right? how to go from one place to another and you have to control them that you go the path that you want. yeah but it's all doable. like I say. ultimately you can get you know enough people to carry packets by hand and you know beating them enough you will get what you want. Jeff[chair]: there's another level of complications when you start doing overlay which is mandatory if you do multitenancy, right? so if you do it on the switch, think about VxLAN and VPN which is common way to do today, you're going to build structure that is underlay VPN, another VPN that is tenant alright? it becomes really complex from management. Linda: it may not be VPN per se but anyway I'll just throw some ideas here. Toni: yeah yeah yeah no it's solvable I mean this is trying to solve it in a very ztp way with a very cheap forwarding plane, that was always Rift, right? Sandy Zhang: zte. I'd like to make sure if I understand right. how do the TOF nodes know if the flow is intra fabric or inter fabric? Toni: that's a very justified question that's where Rift solves the problem and that's where you BGP will have a hard time, right ? we know the direction of the fabric. so we know who is south and who is north. and now we can differentiate whether it's inter Fabric or whether it's a horizontal link. so the inter link the adjacency will clearly tell you which Horizon is on. Sandy Zhang: yes I think the FIB the forwarding table in the TOF will show that if the route is from inter Fabric or intra fabric. so when the TOF receive the flow they will know how to forward this. Toni: correctly. which FIB to throw to. precisely. and and thanks to ZTE because we spend a lot of time in hackathon. to start to ask questions because I only draw a very simple like you know a three thingy and a four thingy and they go like yeah five and I wasn't sure. so I had to draw the figure actually you know to figure out this presentation. because I oversimplified with three, everything works. it's kind of trial but this is exactly how it worked. the incoming interface will tell you which FIB to go to. and I was slightly skeptical with demand and I looked and yes all even the cheapest silicon can do these days. that because it's actually very common problem if you run any kind of VRF. you have to know this is a VRF link so it's a completely different FIB. otherwise it won't work. Sandy Zhang: so I think maybe some flag may be added in the forwarding table for distinguish it maybe? Toni: how you solve it over specification. this thing tells you . look this is the computation that you used to build this FIB... Jeff[chair]: so in BGP is configurational logic. you have two different virtual routers to treat Fabric and Inter fabric routes. here based on the fabric ID you see it's you or it's not you. so it's built into protocol you don't need additional management task to identify particular interface. Toni: okay so you know please look over the stuff. maybe you find the whole thing is just made up. I don't know. I'm pretty confident you know. that this stuff holds up. but who knows. it's never seen that done before. I never saw any kind of dynamic routing for dragonflies actually. where anybody explained how the hell it's supposed to work. all this fancy stuff like dragonfly, hyper cubes, or thoroidal meshes we used in supercomputers where links never fail. so it's like simple. Dynamic routing is overvalued and this is you know first time I see something cooked up except Dima's stuff which is basically you know stitching BGP magic. so it's not really routing it's like arm handling packets the right way by a lot of policy magic. which is fine. lot of people seems to consider a pretty good job security these days. Jeff[chair]: okay thanks Toni, great presentation and again we've been trying to solve non-shortest path routing Academia for probably since routing exists that's very good example how it can be done with right protocol simply and elegant. so and uh we are exactly on time so just 30 minutes from now we are going to have a AIDC side meeting which is going to talk about in more details workloads this kind of topologies dedicated. it's a really machine learning application. we can figure out how to record it hopefully. somewhere in the cloud. yes thanks everyone and we'll see you in Australia.