IETF 111 LSR Agenda

Chairs:      Acee Lindem (acee@cisco.com)
             Chris Hopps (chopps@chopps.org)
Secretary:   Yingzhen Qu (yingzhen.ietf@gmail.com)

WG Page:     http://tools.ietf.org/wg/lsr/ 
Materials:   https://datatracker.ietf.org/meeting/111/session/lsr 


1. Meeting Administrivia and WG Update       
Chairs     (10 mins)

John Scudder: The flex-algo draft is in my queue, I'll get to it.


2. Flooding Speed   (65 mins)

    - Les Ginsberg (15 mins)
      https://datatracker.ietf.org/doc/draft-ginsberg-lsr-isis-flooding-scale/
    - Bruno Decraene  (20 mins)  
      https://datatracker.ietf.org/doc/draft-decraene-lsr-isis-flooding-speed/
    - Discussion (30 mins)

Chris H:   Tony P mentioned in the chat that it should be LSPTxRate in 
           the pic.
Les:       Right. apology for that.
Acee:      Are those attempts at the source?
Les:       No, these are seconds.

Chris H:   How much faster with RWin than with congestion control? Is 
           there a comparison?
Guillaume: I don't think the comparison is relevant.
Chris:     They have to be relevant, so we can compare. How much faster 
           we can flood, like in Les's presentation, there are numbers. 
           Is there some number we can say? For example, cat RWin get 
           everything done for example in 3s?
Guillaume: RWin algorithm should not be used alone. It should be used 
           together with congestion control as additional guarantee, so 
           congestion control doesn't lose packet due to CPU contention.
Chris:     Is RWin equal to Les's making transmit ack dynamic?
Guillaume: In a way, yes.
Bruno:     It is Difficult to compare directly because hardware is
           different. If you just use RWin, the sender adapts to the
           receiver, it's the maximum the receiver can do. If the receiver
           pauses for  300ms, the sender will pause for 300ms. Anything
           changes in the receiver, sender adapts quickly. 
Chris:     Les, is there a fixed target in your proposal? In RWin, it's 
           a target set by the receiver.
Les:       Agreed with Bruno, you can't compare raw numbers because of 
           hardware and implementation difference. The more relevant 
           question is which one is more adaptive? From the email 
           communication, they're not expecting to modify RWin 
           dynamically, it's chosen at startup. So to me, it means you 
           have to pick a conservative number.
Chris:     A good point. RWin reminds me of credits. 
Les:       I'll defer this to Bruno. My understanding it's not adaptive, 
           in order to be adaptive, there are numbers you have to get 
           and it's hard to get them.
Guillaume: There are two things, RWin and congestion control. RWin is 
           the upper bound on what can be stored before ISIS processing. 
           We need both RWin and congestion control. In our case the
           congestion control is different from Les monitoring ack
           rate, we monitor if an LSP gets delayed for too long
           compared to usual rate, so the dynamic of the sender coming
           from the congestion control. The RWin is a guarantee on top.
Tony Li:   Consider this as a control loop problem. The point is we have 
           feedback, we can do better. Les's slides shows there is 
           some time for the transmitter to react. Something to improve.
Chris H:   I agree with Tony, it's a good starting point. Some people 
           are interested in not adding to much info, rate-limiting,
           etc.  I don't think we should be so averse to it.
Bruno:     The window is static, but it's not the rate we're going to 
           achieve. The rate is determined by how fast the receiver can 
           process and ack. The window is static but the rate is dynamic 
           based on how fast the sender and the receiver can process.
Chris H:   How is TxMax value in Les's slides related to RWin?
Les:       The answer is yes. But the Tx based algorithm is built to 
           adapt. RWin is based on picked number, and it's not going
           to change or adapt. The behavior from RWin is because the 
           receiver says I can receive 10 LSPS, the magic number, then
           I have to pause. In both cases, the testing done so far is 
           artificial because we only run ISIS on a limited topology,
           and no FIB update, etc. In real world, there will be data
           traffic, etc. The capability of the receiver is not going
           to be static.
Chris H:   Did you simulate this?
Les:       Yes, in there a simplified way to demonstrate that the
           algorithm does  adapt. It doesn't adapt in case of slowdown,
           with zero re-transmissions but it does adapt and I think
           this is an important aspect. In real world, we will be doing
           lots more than just ISIS. 
Guillaume: I agree you need congestion control. The value you advertise 
           is not magic, it's the space LSPs get processed and I don't 
           think it's too much to ask. You know the buffer you have 
           where LSPs get stored before processing. I agree that RWin 
           can become a bottleneck especially in case of large RTTs. In 
           the example I sent on the list, you have a burst of 10 LSPs, 
           and RTT of 10ms, you're able to reach 1000 LSP per second. 
           So likely if you have a large buffer, you will never reach 
           this bottleneck. So you actually need a congestion control 
           algorithm. I agree with this.
Tony Li:   Although RWin is static, let's not set it in stone. We don't 
           want to pick A or B right now. The point we're trying to make
           is feedback is helpful. In a perfect world we could tell 
           instantaneously the transmitter what rx-max is all the time. 
           We can't do that, so then the question is how fast we can 
           signal.
Bruno:     Agree with Les. We want the sender to be dynamic and adaptive.
           If the receiver stops 200ms, the sender will stop 200ms after
           one RTT without losing any LSPs. We can adapt to different 
           numbers of neighbors without losing LSPs. With CPU bound, you 
           can see in Les's slides, it adapts but after losing LSPs. So 
           both adapt with one faster and not losing LSPs.
Ketan:     RWin is like TxMax Rate, more or less. Whether it's static or
           constant, I don't see it as s max rate configured on a per 
           link basis. I see the challenge is that it should be a dynamic
           value, not static. Even static, I don't know how it can be 
           determined, socket or BGP, etc. There are additional 
           requirements needed to implement this, backward compatibility,
           implementation assumptions etc., and it should be documented.
Les:       Agree with Tony Li. If we could adapt with RWin dynamically, 
           it will be useful, but we don't know how to do it and it's
           very difficult. Anything presented so far doesn't give a hint, 
           and that's a significant issue. We need practical solution.
           Like Ketan just said, it's important to work with nodes not 
           optimized, the RWin-based proposal is heavily dependent on
           PSNP response time optimized. I just don't want to make an 
           assumption all routers are optimized.
Chris H:   We haven't talked about where to drive the information, I 
           personally don't believe it's so hard. First thing to my mind,
           the line-card queue depth to the RP. We need more information. 
Acee:      I agree feedback is good. These two proposal are using
           different feedback, one is RWin and the other is the transmitter
           taking the actual behavior of the receiver as indicated by 
           the acks sent. There are lot of differences how you implement
           it. The other thing is whether or not to have an interim for 
           this.
Chris H:   This is fruitful discussion and we may have an interim for it.
Acee:      Let's take it to the list what we should do about 
           experimentation.
Chris H:   It will be great if we can get apples-to-apples comparison or 
           close to it. Let's take the discussion to the list.


3. IS-IS Flood Reflection                    
Tony Przygienda   (5 mins)

https://datatracker.ietf.org/doc/draft-ietf-lsr-isis-flood-reflection/

Acee:      I'd like to see more discussions on this draft on the list. 
Tony P:    I'll work with you on the code point stuff.
Chris H:   Can you do without tunneling?
Tony P:    It's possible. but you may not want to do it operationally 
           wise. Will add some clarifications.
Les:       The code point has been renewed.


4. Flexible Algorithms: Bandwidth, Delay, Metrics and Constraints
Shraddha Hegde    (10 mins)

https://datatracker.ietf.org/doc/draft-ietf-lsr-flex-algo-bw-con/

Chris H:   We're not going to make a consensus call now. If you have a 
           contention, better go to WG since it's a WG doc. Just chair 
           comments.
Ketan:     Normally SR-TE is set up to the node, not to the prefix. If 
           that's achieved through generic metric on the link, I'm not 
           sure. That applies to RSVP-TE as well. If it's needed generic 
           metric at prefix level, we can add it later.
Shraddha:  Maybe I was not clear, I will clarify on the list.
Acee:      Reiterate what I said on the list. we spent all this time on 
           ASLAs and now in an existing WG doc that does 
           something everybody agreed we needed, we introduce this 
           generic metric which is not compatible. It's ambiguous rather
           than using application specific metrics you have different metric
           types for differnt applications. Maybe we should use a bandwidth
           metric for these bandwidth constraints and move generic metric
           to a separate proposal. For example, you said you could put in
           extended link attributes or TE LSA, we have to go back to
           correlating LSAs for flex algo and that would be a disaster.
Shraddha:  I don't understand your concern. Are you saying we should not 
           have it in TE LSA? What's the point? There is no proposal to 
           use it from TE LSA from flex-algo.
Tony Li:   There's no end run going on. In earlier draft, we had it as 
           bandwidth metric and we had definitions on how bandwidth should 
           be defined, later we concluded that it's purely a local 
           definition. No reason to mandate a operator to use a particular
           algorithm. So it makes more sense to make it generic.
Chris H:   In the draft, when you specially talk about cases, are you 
           moving those to use cases?
Tony Li:   You could use generic metric for bandwidth.
Chris:     Maybe we didn't talk about it much in the draft.
Ron:       We didn't violate the word written in RFC8919. Maybe the 
           author's intent.
Chris:     If errata is needed, we can file one.
Ron:       It's an update to the document. The community reviewed the 
           text on the page, not the author's mind.
John:      If you open an errata, I'll look at the consensus and how 
           it's written down. If it got written down wrong, then it's an
           errata, otherwise the errata doesn't get confirmed.
Chris H:   Thanks. Let's go with the easy hanging fruits, and go for the
           heavier if we have to.


5. IS-IS and OSPF Extension for Event Notification
Peter Psenak      (15 mins)

https://datatracker.ietf.org/doc/draft-ppsenak-lsr-igp-event-notification/

Acee:      It's like a best-effort delivery.
Peter:     Yes. There is reliability but limited.
Acee:      It's based on when you got the component of the summary. 
           There are two parts, the mechanism and the events triggered.
Peter:     It can be any application. 
Huaimo:    It defines generic procedures and encodings for distributing 
           events. Ten years ago, I had a draft using traditional way. 
           This way is much better.
Aijun:     This is another approach for the scenario in the PUA draft, 
           xxxxx (voice broken, will send it to the list).
Chris H:   Interesting new work, let's discuss more on the list. Is this 
           related to the next presentation?
Peter:     Yes. One of the use case is related with prefix unreachable, 
           but we use a completely different mechanism. And this defines 
           a generic mechanism.


6. Updates for PUA and Passive Interface Attributes 
Gyan Mishra/Aijun Wang    (10 mins)

https://datatracker.ietf.org/doc/draft-wang-lsr-prefix-unreachable-annoucement/
https://datatracker.ietf.org/doc/draft-wang-lsr-passive-interface-attribute/

Acee:      I don't think we need this. We have links topologically 
           significant to the IGPs, and we have prefixes for local 
           addresses. You can take this stub link to carry info for 
           applications, but IGP doesn't need it. But to invent a new 
           construct to do it, and advertising the prefix separately
           from the prefix used for the route computation. That's
           what I think is wrong, I know we disagree on it. But that's
           my comment.
Chris H:   Is this a WG doc?
Acee:      They requested WG adoption.
Chris H:   Let's have more discussions on the list.
Aijun:     We had it in prefix TLV, but after discussion on the list we 
           changed it to stub link. We will discuss more.
Acee :     Once you add address to the stub link, but you're advertising
           it two different ways, that's a good indication this not the 
           right way to encode it.


7. Meeting Closure
Chairs (5 mins)

Chris H:   Discussions are good. we will look into an interim. Thanks
           for participating, see you next time.



From the Chat:
￼
Ketan Talaulikar
@ Acee the authors remove the reference to OSPFv3 SRv6 from the flex-algo; it will be covered in the OSPFv3 SRv6 draft.
16:05:52
s/remove/removed ... this was done in draft-ietf-lsr-flex-algo
16:06:21
Bruno Decraene
Actually, Guillaume will present
16:07:26
Tony Przygienda
just as clarification the ack delay is not even necessary, it's just an optimization to prevent the algorithm to back off too quickly on large ACK delays on the Rx
16:15:20
my observation to Les was actually that it's even simpler to watch the outstanding LSPs in the LSP-RETX queue and just start to back off when it starts to reach certain % of the max rate (equivalent to not enough acks really for whatever reason).
16:16:19
Jeffrey Haas
^
16:16:53
I'm waiting to see if there will be a case simulating some % of packet loss and its behavior
16:17:11
Tony Przygienda
yeah, lots graphs coming
16:17:24
ah, you mean loss on purspose? les didn't do it but from experience it doesn't matter
16:17:46
Jeffrey Haas
I have different experience, but that's mostly in tcp timers.
16:18:11
Tony Przygienda
overload/loss/slow all the same to algorithm. it just builds up lsp retx queue or not enough ack and hysterisis backs off
16:18:18
yeah, that's why we should NOT use TCP here ;-)
16:18:29
TCP collapses very quickly on losses
16:18:40
Jeffrey Haas
yep. but it does give you a strong sense how various re-xmit algos play
16:18:46
Guillaume Solignac
There are TCP algorithms that work on bandwith monitoring as well (Google's BBR)
16:19:11
Tony Przygienda
@Guillaume, correct, all the nwere work
16:19:23
ultimately problm is TCP is ordered and link state flooding is not
16:19:35
hence you don't thave to back-off since you don't get a big mbuf buildup
16:19:46
Graph _still_ incorrect, thos are LSPTxRate!
16:20:13
Jeffrey Haas
A case of cadence matters.
16:22:35
Tony Przygienda
as side note: even with a socket per peer in ISIS lots of platforms have single queue bottlenecks in the whole chain from port to user space
16:31:10
Les Ginsberg
It is not possible for IS-IS to know (at either end) the difference between loss and delay. This is because the state of the queues/punt path from dataplane to IS-IS in control plane is not known by IS-IS - and is difficult to know. And because there is no ordering of LSPs - so receiving an update from Node A tells you nothing about whether you should have received an update from Node B. (The TCP analogy is not a good fit)
16:31:23
Tony Przygienda
yeah, we can go for quite a bit really. on lots platforms buffer space is shared amongst sockets as well
16:32:59
Bruno Decraene
@TonyP that's not a problem. RWIN is used to control rate between application (IS-IS). For limitations on the path, a congestion control algo (roughly similar in both draft) is used.
16:34:10
So RWIN is used in addition (not in replacement)
16:34:37
Tony Przygienda
2000 on all VLANs or just one?
16:38:17
looks like all VLANs. this is a very low rate to saturate even a small CPU by well implemented flooding IME. I'm surprised
16:39:20
Bruno Decraene
total (200LSP/s per neighbhor)
16:39:30
Tony Przygienda
IME MaxTX is superfluous. The hysteresis will back off nicely
16:50:33
it's more of a "sanity upper bound" so the thing doesn't run away to eat all CPU/I/O possibly ;-)
16:51:00
Guillaume Solignac
You have to take care of TCP fairness though, your algorithm could get crushed by BGP
16:51:35
Tony Li
Take a look at Les' graphs and look at the latency to adapt.
16:51:51
Tony Przygienda
@Guillame, that's solved differerntly on a good box
16:52:16
Tony Li
What would happen if the receiver could signal RXmax?
16:52:23
Guillaume Solignac
Does it even exist ?
16:52:45
Tony Przygienda
@Tony: yepp, of course. if you signal every 50 msecs that will be always faster than waiting for backpressure by loss/queue overbuild ;-)
16:52:57
it could be even faster if you entangle the RX and TX ;-)
16:53:09
OK, tired of waiting for the meandering mike so I keep it very clip & practical
17:00:04
1. fix window ain't gonna cut it and you won't be able to compute it since very platform has so many variables you never get it right. And the load of the system changes on top
17:00:32
2. you can't signal that fast reliably with lots peers, @ scale. Assuming very short, precise timers in user space on real systems is just that, assumption. You can have very few or you can have them in the kernel, in user space timer slips in 100s of msecs are normal fare.
17:01:25
Jeffrey Haas
lack of resources in whatever flavor will manifest as drop.
17:01:27
Guillaume Solignac
@Tony the real signal is not in the TLV, it is the PSNPs
17:03:39
Fix window allows to pace the sender to the PSNPs
17:04:27
The PSNPs rate is a dynamic signal that you use as well, but you have one information less
17:04:47
So you have less guarantees
17:05:03
Tony Przygienda
yes, that's a reasonable way to see it, you are free to send less ACKs to backpressure. It's a "poor man's window" if you want. It kind of happens naturally when ISIS gets busy and doesn't get to push PSNPs (modulo parallel implementation which is a different kettle of pisces ;-)
17:07:26
Tony Li
You could estimate your own RXmax.
17:09:16
How many did you process in the las t 1s?
17:09:35
Jeff Tantsura
hackathon?
17:11:05
great stuff!
17:11:57
Robert Raszuk
@Tony Doesn't it also depend on how much you got ?
17:13:45
Tony Li
No, it doesn't need to.
17:14:02
If you managed to process 100 in the last second, you say that.
17:14:20
Transmitter can infer that you dropped a million. :-)
17:14:31
Robert Raszuk
Oh in that direction ... sure
17:14:53
I was looking at the max in maximum point
17:15:17
Tony Li
The point is that we need feedback. As we learn, we can be more sophisticated about the feedback.
17:16:48
And yes, the feedback is optional. We do have to work with legacy.
17:17:08
John Scudder
I missed precisely what it was Acee was calling an "end run"?
17:29:30
Bruno Decraene
@Les in slide 8 you have loss of LSPs when adapting/slowing down. Do you think that you could add a test in your implementation? if UnAcknowledged LSP > 40, pause sending LSP. And see if this can reduce or eliminate your loss of LSPs?
17:30:58
John Scudder
I mean, I can and will go back and listen to the replay to try to understand it, but I'd appreciate clarification from @Acee.
17:31:07
Robert Raszuk
Well Pulse would trigger BGP route calculation (best path run) - how do you communicate Pulses from IGP to BGP ? via RIB ? how if it only contains IGP summary ...
17:38:42
Also is there no worry about Pulse based DDoS to poor nodes when we have massive failures ? I assume you are not planning on summarizing Pulses ?
17:40:35
Les Ginsberg
@Bruno - we tried several different strategies - one of them was what you suggest. Performance was not as good.
17:41:46
Tony Przygienda
@Bruno. roughly what I said as in "don't count ACKs, enough to look @ your outstanding queue". Howver, from direct implementation expreince, you want to look @ % of queue as your flooding speed rather thaqn a constant number. Yes, if RX could somehow signal his window, that could be taken into account but again, there is no timestamp on anything (or for that matter distributed time), you cannot tell losses from delays or no sending etc. Especially assuming the very small timescales given the burstiness of load on real world systems.
17:42:04
Bruno Decraene
@Les performance is limited by size of RWIN and RTT (idem as with TCP). So probably, the PSNP were not sent fast enough. How fast do you send PSNP?
17:47:16
Les Ginsberg
@Bruno - we have tested w a variety of PSNP times - but all the data shared today we were acking within 50 ms.
17:52:37
Bruno Decraene
@Les ok. With a RWIN of 40, that should give 800LSP/s per neighbour. Below what you achieve with a single neighbor. Probably better starting with 3 neighbours.
Tony Li
IGP is not a dump truck.
Jeff Tantsura
we have got BGP for that...
18:01:25