Skip to main content

Minutes IETF111: lsr
minutes-111-lsr-00

Meeting Minutes Link State Routing (lsr) WG
Date and time 2021-07-30 23:00
Title Minutes IETF111: lsr
State Active
Other versions plain text
Last updated 2021-08-03

minutes-111-lsr-00
IETF 111 LSR Agenda

Chairs:      Acee Lindem (acee@cisco.com)
             Chris Hopps (chopps@chopps.org)
Secretary:   Yingzhen Qu (yingzhen.ietf@gmail.com)

WG Page:     http://tools.ietf.org/wg/lsr/
Materials:   https://datatracker.ietf.org/meeting/111/session/lsr

1. Meeting Administrivia and WG Update
Chairs     (10 mins)

John Scudder: The flex-algo draft is in my queue, I'll get to it.

2. Flooding Speed   (65 mins)

    - Les Ginsberg (15 mins)
      https://datatracker.ietf.org/doc/draft-ginsberg-lsr-isis-flooding-scale/
    - Bruno Decraene  (20 mins)
      https://datatracker.ietf.org/doc/draft-decraene-lsr-isis-flooding-speed/
    - Discussion (30 mins)

Chris H:   Tony P mentioned in the chat that it should be LSPTxRate in
           the pic.
Les:       Right. apology for that.
Acee:      Are those attempts at the source?
Les:       No, these are seconds.

Chris H:   How much faster with RWin than with congestion control? Is
           there a comparison?
Guillaume: I don't think the comparison is relevant.
Chris:     They have to be relevant, so we can compare. How much faster
           we can flood, like in Les's presentation, there are numbers.
           Is there some number we can say? For example, cat RWin get
           everything done for example in 3s?
Guillaume: RWin algorithm should not be used alone. It should be used
           together with congestion control as additional guarantee, so
           congestion control doesn't lose packet due to CPU contention.
Chris:     Is RWin equal to Les's making transmit ack dynamic?
Guillaume: In a way, yes.
Bruno:     It is Difficult to compare directly because hardware is
           different. If you just use RWin, the sender adapts to the
           receiver, it's the maximum the receiver can do. If the receiver
           pauses for  300ms, the sender will pause for 300ms. Anything
           changes in the receiver, sender adapts quickly.
Chris:     Les, is there a fixed target in your proposal? In RWin, it's
           a target set by the receiver.
Les:       Agreed with Bruno, you can't compare raw numbers because of
           hardware and implementation difference. The more relevant
           question is which one is more adaptive? From the email
           communication, they're not expecting to modify RWin
           dynamically, it's chosen at startup. So to me, it means you
           have to pick a conservative number.
Chris:     A good point. RWin reminds me of credits.
Les:       I'll defer this to Bruno. My understanding it's not adaptive,
           in order to be adaptive, there are numbers you have to get
           and it's hard to get them.
Guillaume: There are two things, RWin and congestion control. RWin is
           the upper bound on what can be stored before ISIS processing.
           We need both RWin and congestion control. In our case the
           congestion control is different from Les monitoring ack
           rate, we monitor if an LSP gets delayed for too long
           compared to usual rate, so the dynamic of the sender coming
           from the congestion control. The RWin is a guarantee on top.
Tony Li:   Consider this as a control loop problem. The point is we have
           feedback, we can do better. Les's slides shows there is
           some time for the transmitter to react. Something to improve.
Chris H:   I agree with Tony, it's a good starting point. Some people
           are interested in not adding to much info, rate-limiting,
           etc.  I don't think we should be so averse to it.
Bruno:     The window is static, but it's not the rate we're going to
           achieve. The rate is determined by how fast the receiver can
           process and ack. The window is static but the rate is dynamic
           based on how fast the sender and the receiver can process.
Chris H:   How is TxMax value in Les's slides related to RWin?
Les:       The answer is yes. But the Tx based algorithm is built to
           adapt. RWin is based on picked number, and it's not going
           to change or adapt. The behavior from RWin is because the
           receiver says I can receive 10 LSPS, the magic number, then
           I have to pause. In both cases, the testing done so far is
           artificial because we only run ISIS on a limited topology,
           and no FIB update, etc. In real world, there will be data
           traffic, etc. The capability of the receiver is not going
           to be static.
Chris H:   Did you simulate this?
Les:       Yes, in there a simplified way to demonstrate that the
           algorithm does  adapt. It doesn't adapt in case of slowdown,
           with zero re-transmissions but it does adapt and I think
           this is an important aspect. In real world, we will be doing
           lots more than just ISIS.
Guillaume: I agree you need congestion control. The value you advertise
           is not magic, it's the space LSPs get processed and I don't
           think it's too much to ask. You know the buffer you have
           where LSPs get stored before processing. I agree that RWin
           can become a bottleneck especially in case of large RTTs. In
           the example I sent on the list, you have a burst of 10 LSPs,
           and RTT of 10ms, you're able to reach 1000 LSP per second.
           So likely if you have a large buffer, you will never reach
           this bottleneck. So you actually need a congestion control
           algorithm. I agree with this.
Tony Li:   Although RWin is static, let's not set it in stone. We don't
           want to pick A or B right now. The point we're trying to make
           is feedback is helpful. In a perfect world we could tell
           instantaneously the transmitter what rx-max is all the time.
           We can't do that, so then the question is how fast we can
           signal.
Bruno:     Agree with Les. We want the sender to be dynamic and adaptive.
           If the receiver stops 200ms, the sender will stop 200ms after
           one RTT without losing any LSPs. We can adapt to different
           numbers of neighbors without losing LSPs. With CPU bound, you
           can see in Les's slides, it adapts but after losing LSPs. So
           both adapt with one faster and not losing LSPs.
Ketan:     RWin is like TxMax Rate, more or less. Whether it's static or
           constant, I don't see it as s max rate configured on a per
           link basis. I see the challenge is that it should be a dynamic
           value, not static. Even static, I don't know how it can be
           determined, socket or BGP, etc. There are additional
           requirements needed to implement this, backward compatibility,
           implementation assumptions etc., and it should be documented.
Les:       Agree with Tony Li. If we could adapt with RWin dynamically,
           it will be useful, but we don't know how to do it and it's
           very difficult. Anything presented so far doesn't give a hint,
           and that's a significant issue. We need practical solution.
           Like Ketan just said, it's important to work with nodes not
           optimized, the RWin-based proposal is heavily dependent on
           PSNP response time optimized. I just don't want to make an
           assumption all routers are optimized.
Chris H:   We haven't talked about where to drive the information, I
           personally don't believe it's so hard. First thing to my mind,
           the line-card queue depth to the RP. We need more information.
Acee:      I agree feedback is good. These two proposal are using
           different feedback, one is RWin and the other is the transmitter
           taking the actual behavior of the receiver as indicated by
           the acks sent. There are lot of differences how you implement
           it. The other thing is whether or not to have an interim for
           this.
Chris H:   This is fruitful discussion and we may have an interim for it.
Acee:      Let's take it to the list what we should do about
           experimentation.
Chris H:   It will be great if we can get apples-to-apples comparison or
           close to it. Let's take the discussion to the list.

3. IS-IS Flood Reflection
Tony Przygienda   (5 mins)

https://datatracker.ietf.org/doc/draft-ietf-lsr-isis-flood-reflection/

Acee:      I'd like to see more discussions on this draft on the list.
Tony P:    I'll work with you on the code point stuff.
Chris H:   Can you do without tunneling?
Tony P:    It's possible. but you may not want to do it operationally
           wise. Will add some clarifications.
Les:       The code point has been renewed.

4. Flexible Algorithms: Bandwidth, Delay, Metrics and Constraints
Shraddha Hegde    (10 mins)

https://datatracker.ietf.org/doc/draft-ietf-lsr-flex-algo-bw-con/

Chris H:   We're not going to make a consensus call now. If you have a
           contention, better go to WG since it's a WG doc. Just chair
           comments.
Ketan:     Normally SR-TE is set up to the node, not to the prefix. If
           that's achieved through generic metric on the link, I'm not
           sure. That applies to RSVP-TE as well. If it's needed generic
           metric at prefix level, we can add it later.
Shraddha:  Maybe I was not clear, I will clarify on the list.
Acee:      Reiterate what I said on the list. we spent all this time on
           ASLAs and now in an existing WG doc that does
           something everybody agreed we needed, we introduce this
           generic metric which is not compatible. It's ambiguous rather
           than using application specific metrics you have different metric
           types for differnt applications. Maybe we should use a bandwidth
           metric for these bandwidth constraints and move generic metric
           to a separate proposal. For example, you said you could put in
           extended link attributes or TE LSA, we have to go back to
           correlating LSAs for flex algo and that would be a disaster.
Shraddha:  I don't understand your concern. Are you saying we should not
           have it in TE LSA? What's the point? There is no proposal to
           use it from TE LSA from flex-algo.
Tony Li:   There's no end run going on. In earlier draft, we had it as
           bandwidth metric and we had definitions on how bandwidth should
           be defined, later we concluded that it's purely a local
           definition. No reason to mandate a operator to use a particular
           algorithm. So it makes more sense to make it generic.
Chris H:   In the draft, when you specially talk about cases, are you
           moving those to use cases?
Tony Li:   You could use generic metric for bandwidth.
Chris:     Maybe we didn't talk about it much in the draft.
Ron:       We didn't violate the word written in RFC8919. Maybe the
           author's intent.
Chris:     If errata is needed, we can file one.
Ron:       It's an update to the document. The community reviewed the
           text on the page, not the author's mind.
John:      If you open an errata, I'll look at the consensus and how
           it's written down. If it got written down wrong, then it's an
           errata, otherwise the errata doesn't get confirmed.
Chris H:   Thanks. Let's go with the easy hanging fruits, and go for the
           heavier if we have to.

5. IS-IS and OSPF Extension for Event Notification
Peter Psenak      (15 mins)

https://datatracker.ietf.org/doc/draft-ppsenak-lsr-igp-event-notification/

Acee:      It's like a best-effort delivery.
Peter:     Yes. There is reliability but limited.
Acee:      It's based on when you got the component of the summary.
           There are two parts, the mechanism and the events triggered.
Peter:     It can be any application.
Huaimo:    It defines generic procedures and encodings for distributing
           events. Ten years ago, I had a draft using traditional way.
           This way is much better.
Aijun:     This is another approach for the scenario in the PUA draft,
           xxxxx (voice broken, will send it to the list).
Chris H:   Interesting new work, let's discuss more on the list. Is this
           related to the next presentation?
Peter:     Yes. One of the use case is related with prefix unreachable,
           but we use a completely different mechanism. And this defines
           a generic mechanism.

6. Updates for PUA and Passive Interface Attributes
Gyan Mishra/Aijun Wang    (10 mins)

https://datatracker.ietf.org/doc/draft-wang-lsr-prefix-unreachable-annoucement/
https://datatracker.ietf.org/doc/draft-wang-lsr-passive-interface-attribute/

Acee:      I don't think we need this. We have links topologically
           significant to the IGPs, and we have prefixes for local
           addresses. You can take this stub link to carry info for
           applications, but IGP doesn't need it. But to invent a new
           construct to do it, and advertising the prefix separately
           from the prefix used for the route computation. That's
           what I think is wrong, I know we disagree on it. But that's
           my comment.
Chris H:   Is this a WG doc?
Acee:      They requested WG adoption.
Chris H:   Let's have more discussions on the list.
Aijun:     We had it in prefix TLV, but after discussion on the list we
           changed it to stub link. We will discuss more.
Acee :     Once you add address to the stub link, but you're advertising
           it two different ways, that's a good indication this not the
           right way to encode it.

7. Meeting Closure
Chairs (5 mins)

Chris H:   Discussions are good. we will look into an interim. Thanks
           for participating, see you next time.

From the Chat:

Ketan Talaulikar
@ Acee the authors remove the reference to OSPFv3 SRv6 from the flex-algo; it
will be covered in the OSPFv3 SRv6 draft. 16:05:52 s/remove/removed ... this
was done in draft-ietf-lsr-flex-algo 16:06:21 Bruno Decraene Actually,
Guillaume will present 16:07:26 Tony Przygienda just as clarification the ack
delay is not even necessary, it's just an optimization to prevent the algorithm
to back off too quickly on large ACK delays on the Rx 16:15:20 my observation
to Les was actually that it's even simpler to watch the outstanding LSPs in the
LSP-RETX queue and just start to back off when it starts to reach certain % of
the max rate (equivalent to not enough acks really for whatever reason).
16:16:19 Jeffrey Haas ^ 16:16:53 I'm waiting to see if there will be a case
simulating some % of packet loss and its behavior 16:17:11 Tony Przygienda
yeah, lots graphs coming 16:17:24 ah, you mean loss on purspose? les didn't do
it but from experience it doesn't matter 16:17:46 Jeffrey Haas I have different
experience, but that's mostly in tcp timers. 16:18:11 Tony Przygienda
overload/loss/slow all the same to algorithm. it just builds up lsp retx queue
or not enough ack and hysterisis backs off 16:18:18 yeah, that's why we should
NOT use TCP here ;-) 16:18:29 TCP collapses very quickly on losses 16:18:40
Jeffrey Haas yep. but it does give you a strong sense how various re-xmit algos
play 16:18:46 Guillaume Solignac There are TCP algorithms that work on bandwith
monitoring as well (Google's BBR) 16:19:11 Tony Przygienda @Guillaume, correct,
all the nwere work 16:19:23 ultimately problm is TCP is ordered and link state
flooding is not 16:19:35 hence you don't thave to back-off since you don't get
a big mbuf buildup 16:19:46 Graph _still_ incorrect, thos are LSPTxRate!
16:20:13 Jeffrey Haas A case of cadence matters. 16:22:35 Tony Przygienda as
side note: even with a socket per peer in ISIS lots of platforms have single
queue bottlenecks in the whole chain from port to user space 16:31:10 Les
Ginsberg It is not possible for IS-IS to know (at either end) the difference
between loss and delay. This is because the state of the queues/punt path from
dataplane to IS-IS in control plane is not known by IS-IS - and is difficult to
know. And because there is no ordering of LSPs - so receiving an update from
Node A tells you nothing about whether you should have received an update from
Node B. (The TCP analogy is not a good fit) 16:31:23 Tony Przygienda yeah, we
can go for quite a bit really. on lots platforms buffer space is shared amongst
sockets as well 16:32:59 Bruno Decraene @TonyP that's not a problem. RWIN is
used to control rate between application (IS-IS). For limitations on the path,
a congestion control algo (roughly similar in both draft) is used. 16:34:10 So
RWIN is used in addition (not in replacement) 16:34:37 Tony Przygienda 2000 on
all VLANs or just one? 16:38:17 looks like all VLANs. this is a very low rate
to saturate even a small CPU by well implemented flooding IME. I'm surprised
16:39:20 Bruno Decraene total (200LSP/s per neighbhor) 16:39:30 Tony Przygienda
IME MaxTX is superfluous. The hysteresis will back off nicely 16:50:33 it's
more of a "sanity upper bound" so the thing doesn't run away to eat all CPU/I/O
possibly ;-) 16:51:00 Guillaume Solignac You have to take care of TCP fairness
though, your algorithm could get crushed by BGP 16:51:35 Tony Li Take a look at
Les' graphs and look at the latency to adapt. 16:51:51 Tony Przygienda
@Guillame, that's solved differerntly on a good box 16:52:16 Tony Li What would
happen if the receiver could signal RXmax? 16:52:23 Guillaume Solignac Does it
even exist ? 16:52:45 Tony Przygienda @Tony: yepp, of course. if you signal
every 50 msecs that will be always faster than waiting for backpressure by
loss/queue overbuild ;-) 16:52:57 it could be even faster if you entangle the
RX and TX ;-) 16:53:09 OK, tired of waiting for the meandering mike so I keep
it very clip & practical 17:00:04 1. fix window ain't gonna cut it and you
won't be able to compute it since very platform has so many variables you never
get it right. And the load of the system changes on top 17:00:32 2. you can't
signal that fast reliably with lots peers, @ scale. Assuming very short,
precise timers in user space on real systems is just that, assumption. You can
have very few or you can have them in the kernel, in user space timer slips in
100s of msecs are normal fare. 17:01:25 Jeffrey Haas lack of resources in
whatever flavor will manifest as drop. 17:01:27 Guillaume Solignac @Tony the
real signal is not in the TLV, it is the PSNPs 17:03:39 Fix window allows to
pace the sender to the PSNPs 17:04:27 The PSNPs rate is a dynamic signal that
you use as well, but you have one information less 17:04:47 So you have less
guarantees 17:05:03 Tony Przygienda yes, that's a reasonable way to see it, you
are free to send less ACKs to backpressure. It's a "poor man's window" if you
want. It kind of happens naturally when ISIS gets busy and doesn't get to push
PSNPs (modulo parallel implementation which is a different kettle of pisces ;-)
17:07:26 Tony Li You could estimate your own RXmax. 17:09:16 How many did you
process in the las t 1s? 17:09:35 Jeff Tantsura hackathon? 17:11:05 great
stuff! 17:11:57 Robert Raszuk @Tony Doesn't it also depend on how much you got
? 17:13:45 Tony Li No, it doesn't need to. 17:14:02 If you managed to process
100 in the last second, you say that. 17:14:20 Transmitter can infer that you
dropped a million. :-) 17:14:31 Robert Raszuk Oh in that direction ... sure
17:14:53 I was looking at the max in maximum point 17:15:17 Tony Li The point
is that we need feedback. As we learn, we can be more sophisticated about the
feedback. 17:16:48 And yes, the feedback is optional. We do have to work with
legacy. 17:17:08 John Scudder I missed precisely what it was Acee was calling
an "end run"? 17:29:30 Bruno Decraene @Les in slide 8 you have loss of LSPs
when adapting/slowing down. Do you think that you could add a test in your
implementation? if UnAcknowledged LSP > 40, pause sending LSP. And see if this
can reduce or eliminate your loss of LSPs? 17:30:58 John Scudder I mean, I can
and will go back and listen to the replay to try to understand it, but I'd
appreciate clarification from @Acee. 17:31:07 Robert Raszuk Well Pulse would
trigger BGP route calculation (best path run) - how do you communicate Pulses
from IGP to BGP ? via RIB ? how if it only contains IGP summary ... 17:38:42
Also is there no worry about Pulse based DDoS to poor nodes when we have
massive failures ? I assume you are not planning on summarizing Pulses ?
17:40:35 Les Ginsberg @Bruno - we tried several different strategies - one of
them was what you suggest. Performance was not as good. 17:41:46 Tony
Przygienda @Bruno. roughly what I said as in "don't count ACKs, enough to look
@ your outstanding queue". Howver, from direct implementation expreince, you
want to look @ % of queue as your flooding speed rather thaqn a constant
number. Yes, if RX could somehow signal his window, that could be taken into
account but again, there is no timestamp on anything (or for that matter
distributed time), you cannot tell losses from delays or no sending etc.
Especially assuming the very small timescales given the burstiness of load on
real world systems. 17:42:04 Bruno Decraene @Les performance is limited by size
of RWIN and RTT (idem as with TCP). So probably, the PSNP were not sent fast
enough. How fast do you send PSNP? 17:47:16 Les Ginsberg @Bruno - we have
tested w a variety of PSNP times - but all the data shared today we were acking
within 50 ms. 17:52:37 Bruno Decraene @Les ok. With a RWIN of 40, that should
give 800LSP/s per neighbour. Below what you achieve with a single neighbor.
Probably better starting with 3 neighbours. Tony Li IGP is not a dump truck.
Jeff Tantsura we have got BGP for that... 18:01:25