IETF 111 LSR Agenda Chairs: Acee Lindem (acee@cisco.com) Chris Hopps (chopps@chopps.org) Secretary: Yingzhen Qu (yingzhen.ietf@gmail.com) WG Page: http://tools.ietf.org/wg/lsr/ Materials: https://datatracker.ietf.org/meeting/111/session/lsr 1. Meeting Administrivia and WG Update Chairs (10 mins) John Scudder: The flex-algo draft is in my queue, I'll get to it. 2. Flooding Speed (65 mins) - Les Ginsberg (15 mins) https://datatracker.ietf.org/doc/draft-ginsberg-lsr-isis-flooding-scale/ - Bruno Decraene (20 mins) https://datatracker.ietf.org/doc/draft-decraene-lsr-isis-flooding-speed/ - Discussion (30 mins) Chris H: Tony P mentioned in the chat that it should be LSPTxRate in the pic. Les: Right. apology for that. Acee: Are those attempts at the source? Les: No, these are seconds. Chris H: How much faster with RWin than with congestion control? Is there a comparison? Guillaume: I don't think the comparison is relevant. Chris: They have to be relevant, so we can compare. How much faster we can flood, like in Les's presentation, there are numbers. Is there some number we can say? For example, cat RWin get everything done for example in 3s? Guillaume: RWin algorithm should not be used alone. It should be used together with congestion control as additional guarantee, so congestion control doesn't lose packet due to CPU contention. Chris: Is RWin equal to Les's making transmit ack dynamic? Guillaume: In a way, yes. Bruno: It is Difficult to compare directly because hardware is different. If you just use RWin, the sender adapts to the receiver, it's the maximum the receiver can do. If the receiver pauses for 300ms, the sender will pause for 300ms. Anything changes in the receiver, sender adapts quickly. Chris: Les, is there a fixed target in your proposal? In RWin, it's a target set by the receiver. Les: Agreed with Bruno, you can't compare raw numbers because of hardware and implementation difference. The more relevant question is which one is more adaptive? From the email communication, they're not expecting to modify RWin dynamically, it's chosen at startup. So to me, it means you have to pick a conservative number. Chris: A good point. RWin reminds me of credits. Les: I'll defer this to Bruno. My understanding it's not adaptive, in order to be adaptive, there are numbers you have to get and it's hard to get them. Guillaume: There are two things, RWin and congestion control. RWin is the upper bound on what can be stored before ISIS processing. We need both RWin and congestion control. In our case the congestion control is different from Les monitoring ack rate, we monitor if an LSP gets delayed for too long compared to usual rate, so the dynamic of the sender coming from the congestion control. The RWin is a guarantee on top. Tony Li: Consider this as a control loop problem. The point is we have feedback, we can do better. Les's slides shows there is some time for the transmitter to react. Something to improve. Chris H: I agree with Tony, it's a good starting point. Some people are interested in not adding to much info, rate-limiting, etc. I don't think we should be so averse to it. Bruno: The window is static, but it's not the rate we're going to achieve. The rate is determined by how fast the receiver can process and ack. The window is static but the rate is dynamic based on how fast the sender and the receiver can process. Chris H: How is TxMax value in Les's slides related to RWin? Les: The answer is yes. But the Tx based algorithm is built to adapt. RWin is based on picked number, and it's not going to change or adapt. The behavior from RWin is because the receiver says I can receive 10 LSPS, the magic number, then I have to pause. In both cases, the testing done so far is artificial because we only run ISIS on a limited topology, and no FIB update, etc. In real world, there will be data traffic, etc. The capability of the receiver is not going to be static. Chris H: Did you simulate this? Les: Yes, in there a simplified way to demonstrate that the algorithm does adapt. It doesn't adapt in case of slowdown, with zero re-transmissions but it does adapt and I think this is an important aspect. In real world, we will be doing lots more than just ISIS. Guillaume: I agree you need congestion control. The value you advertise is not magic, it's the space LSPs get processed and I don't think it's too much to ask. You know the buffer you have where LSPs get stored before processing. I agree that RWin can become a bottleneck especially in case of large RTTs. In the example I sent on the list, you have a burst of 10 LSPs, and RTT of 10ms, you're able to reach 1000 LSP per second. So likely if you have a large buffer, you will never reach this bottleneck. So you actually need a congestion control algorithm. I agree with this. Tony Li: Although RWin is static, let's not set it in stone. We don't want to pick A or B right now. The point we're trying to make is feedback is helpful. In a perfect world we could tell instantaneously the transmitter what rx-max is all the time. We can't do that, so then the question is how fast we can signal. Bruno: Agree with Les. We want the sender to be dynamic and adaptive. If the receiver stops 200ms, the sender will stop 200ms after one RTT without losing any LSPs. We can adapt to different numbers of neighbors without losing LSPs. With CPU bound, you can see in Les's slides, it adapts but after losing LSPs. So both adapt with one faster and not losing LSPs. Ketan: RWin is like TxMax Rate, more or less. Whether it's static or constant, I don't see it as s max rate configured on a per link basis. I see the challenge is that it should be a dynamic value, not static. Even static, I don't know how it can be determined, socket or BGP, etc. There are additional requirements needed to implement this, backward compatibility, implementation assumptions etc., and it should be documented. Les: Agree with Tony Li. If we could adapt with RWin dynamically, it will be useful, but we don't know how to do it and it's very difficult. Anything presented so far doesn't give a hint, and that's a significant issue. We need practical solution. Like Ketan just said, it's important to work with nodes not optimized, the RWin-based proposal is heavily dependent on PSNP response time optimized. I just don't want to make an assumption all routers are optimized. Chris H: We haven't talked about where to drive the information, I personally don't believe it's so hard. First thing to my mind, the line-card queue depth to the RP. We need more information. Acee: I agree feedback is good. These two proposal are using different feedback, one is RWin and the other is the transmitter taking the actual behavior of the receiver as indicated by the acks sent. There are lot of differences how you implement it. The other thing is whether or not to have an interim for this. Chris H: This is fruitful discussion and we may have an interim for it. Acee: Let's take it to the list what we should do about experimentation. Chris H: It will be great if we can get apples-to-apples comparison or close to it. Let's take the discussion to the list. 3. IS-IS Flood Reflection Tony Przygienda (5 mins) https://datatracker.ietf.org/doc/draft-ietf-lsr-isis-flood-reflection/ Acee: I'd like to see more discussions on this draft on the list. Tony P: I'll work with you on the code point stuff. Chris H: Can you do without tunneling? Tony P: It's possible. but you may not want to do it operationally wise. Will add some clarifications. Les: The code point has been renewed. 4. Flexible Algorithms: Bandwidth, Delay, Metrics and Constraints Shraddha Hegde (10 mins) https://datatracker.ietf.org/doc/draft-ietf-lsr-flex-algo-bw-con/ Chris H: We're not going to make a consensus call now. If you have a contention, better go to WG since it's a WG doc. Just chair comments. Ketan: Normally SR-TE is set up to the node, not to the prefix. If that's achieved through generic metric on the link, I'm not sure. That applies to RSVP-TE as well. If it's needed generic metric at prefix level, we can add it later. Shraddha: Maybe I was not clear, I will clarify on the list. Acee: Reiterate what I said on the list. we spent all this time on ASLAs and now in an existing WG doc that does something everybody agreed we needed, we introduce this generic metric which is not compatible. It's ambiguous rather than using application specific metrics you have different metric types for differnt applications. Maybe we should use a bandwidth metric for these bandwidth constraints and move generic metric to a separate proposal. For example, you said you could put in extended link attributes or TE LSA, we have to go back to correlating LSAs for flex algo and that would be a disaster. Shraddha: I don't understand your concern. Are you saying we should not have it in TE LSA? What's the point? There is no proposal to use it from TE LSA from flex-algo. Tony Li: There's no end run going on. In earlier draft, we had it as bandwidth metric and we had definitions on how bandwidth should be defined, later we concluded that it's purely a local definition. No reason to mandate a operator to use a particular algorithm. So it makes more sense to make it generic. Chris H: In the draft, when you specially talk about cases, are you moving those to use cases? Tony Li: You could use generic metric for bandwidth. Chris: Maybe we didn't talk about it much in the draft. Ron: We didn't violate the word written in RFC8919. Maybe the author's intent. Chris: If errata is needed, we can file one. Ron: It's an update to the document. The community reviewed the text on the page, not the author's mind. John: If you open an errata, I'll look at the consensus and how it's written down. If it got written down wrong, then it's an errata, otherwise the errata doesn't get confirmed. Chris H: Thanks. Let's go with the easy hanging fruits, and go for the heavier if we have to. 5. IS-IS and OSPF Extension for Event Notification Peter Psenak (15 mins) https://datatracker.ietf.org/doc/draft-ppsenak-lsr-igp-event-notification/ Acee: It's like a best-effort delivery. Peter: Yes. There is reliability but limited. Acee: It's based on when you got the component of the summary. There are two parts, the mechanism and the events triggered. Peter: It can be any application. Huaimo: It defines generic procedures and encodings for distributing events. Ten years ago, I had a draft using traditional way. This way is much better. Aijun: This is another approach for the scenario in the PUA draft, xxxxx (voice broken, will send it to the list). Chris H: Interesting new work, let's discuss more on the list. Is this related to the next presentation? Peter: Yes. One of the use case is related with prefix unreachable, but we use a completely different mechanism. And this defines a generic mechanism. 6. Updates for PUA and Passive Interface Attributes  Gyan Mishra/Aijun Wang (10 mins) https://datatracker.ietf.org/doc/draft-wang-lsr-prefix-unreachable-annoucement/ https://datatracker.ietf.org/doc/draft-wang-lsr-passive-interface-attribute/ Acee: I don't think we need this. We have links topologically significant to the IGPs, and we have prefixes for local addresses. You can take this stub link to carry info for applications, but IGP doesn't need it. But to invent a new construct to do it, and advertising the prefix separately from the prefix used for the route computation. That's what I think is wrong, I know we disagree on it. But that's my comment. Chris H: Is this a WG doc? Acee: They requested WG adoption. Chris H: Let's have more discussions on the list. Aijun: We had it in prefix TLV, but after discussion on the list we changed it to stub link. We will discuss more. Acee : Once you add address to the stub link, but you're advertising it two different ways, that's a good indication this not the right way to encode it. 7. Meeting Closure Chairs (5 mins) Chris H: Discussions are good. we will look into an interim. Thanks for participating, see you next time. From the Chat:  Ketan Talaulikar @ Acee the authors remove the reference to OSPFv3 SRv6 from the flex-algo; it will be covered in the OSPFv3 SRv6 draft. 16:05:52 s/remove/removed ... this was done in draft-ietf-lsr-flex-algo 16:06:21 Bruno Decraene Actually, Guillaume will present 16:07:26 Tony Przygienda just as clarification the ack delay is not even necessary, it's just an optimization to prevent the algorithm to back off too quickly on large ACK delays on the Rx 16:15:20 my observation to Les was actually that it's even simpler to watch the outstanding LSPs in the LSP-RETX queue and just start to back off when it starts to reach certain % of the max rate (equivalent to not enough acks really for whatever reason). 16:16:19 Jeffrey Haas ^ 16:16:53 I'm waiting to see if there will be a case simulating some % of packet loss and its behavior 16:17:11 Tony Przygienda yeah, lots graphs coming 16:17:24 ah, you mean loss on purspose? les didn't do it but from experience it doesn't matter 16:17:46 Jeffrey Haas I have different experience, but that's mostly in tcp timers. 16:18:11 Tony Przygienda overload/loss/slow all the same to algorithm. it just builds up lsp retx queue or not enough ack and hysterisis backs off 16:18:18 yeah, that's why we should NOT use TCP here ;-) 16:18:29 TCP collapses very quickly on losses 16:18:40 Jeffrey Haas yep. but it does give you a strong sense how various re-xmit algos play 16:18:46 Guillaume Solignac There are TCP algorithms that work on bandwith monitoring as well (Google's BBR) 16:19:11 Tony Przygienda @Guillaume, correct, all the nwere work 16:19:23 ultimately problm is TCP is ordered and link state flooding is not 16:19:35 hence you don't thave to back-off since you don't get a big mbuf buildup 16:19:46 Graph _still_ incorrect, thos are LSPTxRate! 16:20:13 Jeffrey Haas A case of cadence matters. 16:22:35 Tony Przygienda as side note: even with a socket per peer in ISIS lots of platforms have single queue bottlenecks in the whole chain from port to user space 16:31:10 Les Ginsberg It is not possible for IS-IS to know (at either end) the difference between loss and delay. This is because the state of the queues/punt path from dataplane to IS-IS in control plane is not known by IS-IS - and is difficult to know. And because there is no ordering of LSPs - so receiving an update from Node A tells you nothing about whether you should have received an update from Node B. (The TCP analogy is not a good fit) 16:31:23 Tony Przygienda yeah, we can go for quite a bit really. on lots platforms buffer space is shared amongst sockets as well 16:32:59 Bruno Decraene @TonyP that's not a problem. RWIN is used to control rate between application (IS-IS). For limitations on the path, a congestion control algo (roughly similar in both draft) is used. 16:34:10 So RWIN is used in addition (not in replacement) 16:34:37 Tony Przygienda 2000 on all VLANs or just one? 16:38:17 looks like all VLANs. this is a very low rate to saturate even a small CPU by well implemented flooding IME. I'm surprised 16:39:20 Bruno Decraene total (200LSP/s per neighbhor) 16:39:30 Tony Przygienda IME MaxTX is superfluous. The hysteresis will back off nicely 16:50:33 it's more of a "sanity upper bound" so the thing doesn't run away to eat all CPU/I/O possibly ;-) 16:51:00 Guillaume Solignac You have to take care of TCP fairness though, your algorithm could get crushed by BGP 16:51:35 Tony Li Take a look at Les' graphs and look at the latency to adapt. 16:51:51 Tony Przygienda @Guillame, that's solved differerntly on a good box 16:52:16 Tony Li What would happen if the receiver could signal RXmax? 16:52:23 Guillaume Solignac Does it even exist ? 16:52:45 Tony Przygienda @Tony: yepp, of course. if you signal every 50 msecs that will be always faster than waiting for backpressure by loss/queue overbuild ;-) 16:52:57 it could be even faster if you entangle the RX and TX ;-) 16:53:09 OK, tired of waiting for the meandering mike so I keep it very clip & practical 17:00:04 1. fix window ain't gonna cut it and you won't be able to compute it since very platform has so many variables you never get it right. And the load of the system changes on top 17:00:32 2. you can't signal that fast reliably with lots peers, @ scale. Assuming very short, precise timers in user space on real systems is just that, assumption. You can have very few or you can have them in the kernel, in user space timer slips in 100s of msecs are normal fare. 17:01:25 Jeffrey Haas lack of resources in whatever flavor will manifest as drop. 17:01:27 Guillaume Solignac @Tony the real signal is not in the TLV, it is the PSNPs 17:03:39 Fix window allows to pace the sender to the PSNPs 17:04:27 The PSNPs rate is a dynamic signal that you use as well, but you have one information less 17:04:47 So you have less guarantees 17:05:03 Tony Przygienda yes, that's a reasonable way to see it, you are free to send less ACKs to backpressure. It's a "poor man's window" if you want. It kind of happens naturally when ISIS gets busy and doesn't get to push PSNPs (modulo parallel implementation which is a different kettle of pisces ;-) 17:07:26 Tony Li You could estimate your own RXmax. 17:09:16 How many did you process in the las t 1s? 17:09:35 Jeff Tantsura hackathon? 17:11:05 great stuff! 17:11:57 Robert Raszuk @Tony Doesn't it also depend on how much you got ? 17:13:45 Tony Li No, it doesn't need to. 17:14:02 If you managed to process 100 in the last second, you say that. 17:14:20 Transmitter can infer that you dropped a million. :-) 17:14:31 Robert Raszuk Oh in that direction ... sure 17:14:53 I was looking at the max in maximum point 17:15:17 Tony Li The point is that we need feedback. As we learn, we can be more sophisticated about the feedback. 17:16:48 And yes, the feedback is optional. We do have to work with legacy. 17:17:08 John Scudder I missed precisely what it was Acee was calling an "end run"? 17:29:30 Bruno Decraene @Les in slide 8 you have loss of LSPs when adapting/slowing down. Do you think that you could add a test in your implementation? if UnAcknowledged LSP > 40, pause sending LSP. And see if this can reduce or eliminate your loss of LSPs? 17:30:58 John Scudder I mean, I can and will go back and listen to the replay to try to understand it, but I'd appreciate clarification from @Acee. 17:31:07 Robert Raszuk Well Pulse would trigger BGP route calculation (best path run) - how do you communicate Pulses from IGP to BGP ? via RIB ? how if it only contains IGP summary ... 17:38:42 Also is there no worry about Pulse based DDoS to poor nodes when we have massive failures ? I assume you are not planning on summarizing Pulses ? 17:40:35 Les Ginsberg @Bruno - we tried several different strategies - one of them was what you suggest. Performance was not as good. 17:41:46 Tony Przygienda @Bruno. roughly what I said as in "don't count ACKs, enough to look @ your outstanding queue". Howver, from direct implementation expreince, you want to look @ % of queue as your flooding speed rather thaqn a constant number. Yes, if RX could somehow signal his window, that could be taken into account but again, there is no timestamp on anything (or for that matter distributed time), you cannot tell losses from delays or no sending etc. Especially assuming the very small timescales given the burstiness of load on real world systems. 17:42:04 Bruno Decraene @Les performance is limited by size of RWIN and RTT (idem as with TCP). So probably, the PSNP were not sent fast enough. How fast do you send PSNP? 17:47:16 Les Ginsberg @Bruno - we have tested w a variety of PSNP times - but all the data shared today we were acking within 50 ms. 17:52:37 Bruno Decraene @Les ok. With a RWIN of 40, that should give 800LSP/s per neighbour. Below what you achieve with a single neighbor. Probably better starting with 3 neighbours. Tony Li IGP is not a dump truck. Jeff Tantsura we have got BGP for that... 18:01:25