Session Initiation Protocol (SIP) Overload Control
draft-ietf-soc-overload-control-15
Yes
(Richard Barnes)
No Objection
(Adrian Farrel)
(Barry Leiba)
(Benoît Claise)
(Brian Haberman)
(Gonzalo Camarillo)
(Jari Arkko)
(Martin Stiemerling)
(Sean Turner)
(Stewart Bryant)
Note: This ballot was opened for revision 14 and is now closed.
Richard Barnes Former IESG member
Yes
Yes
(for -14)
Unknown
Spencer Dawkins Former IESG member
(was Discuss)
Yes
Yes
(2014-02-05 for -14)
Unknown
These are all non-blocking, but I'd appreciate it if you'd consider them along with any other comments you receive. In 1. Introduction As with any network element, a Session Initiation Protocol (SIP) [RFC3261] server can suffer from overload when the number of SIP messages it receives exceeds the number of messages it can process. Overload can pose a serious problem for a network of SIP servers. During periods of overload, the throughput of a network of SIP servers can be significantly degraded. In fact, overload may lead to a situation in which the throughput drops down to a small fraction of the original processing capacity. This is often called congestion collapse. Overload is said to occur if a SIP server does not have sufficient resources to process all incoming SIP messages. These resources may include CPU processing capacity, memory, network bandwidth, input/ output, or disk resources. I'm struggling with including "network bandwidth" here, with no qualifications. That seems to conflate both SIP Overload and TSV congestion, thusly: If UA A sends a request for UA B to a proxy: UA A -----------> Proxy -----------> UA B But there's not enough bandwidth between the proxy and UA B, so: UA A -----------> Proxy ---->// UA B I'd be OK with characterizing that as "network bandwidth" SIP overload covered by this draft, but if the problem of insufficient bandwidth is between UA A and the proxy: UA A --->// Proxy UA B the proxy never sees the request. That's not in scope for this draft, is it? If not, is there an obvious way to tighten this up a bit? (maybe something like "network bandwidth on the forwarding path"?) In 2. Terminology Unless otherwise specified, all SIP entities described in this document are assumed to support this specification. In 10.2. Backwards Compatibility, you lead me to think that if my path from conforming UAC to conforming UAS traverses some proxies that do support this specification and other proxies that do not, depending on where the non-conforming proxies are and what proxies are overloaded, you wouldn't do worse than today's behavior. It might be helpful to say that here. In 5.10.1. Message prioritization at the hop before the overloaded server, there are several items listed that a client SHOULD take into account when deciding what requests to prioritize. Most of them make sense to me, but it's not obvious that they are 2119 SHOULDs ("good advice, but not required for interoperability"). In 7.1. Special parameter values for loss-based overload control The SIP client may use any algorithm that reduces the traffic it sends to the overloaded server by the amount indicated. Such an algorithm SHOULD honor the message prioritization discussion of Section 5.10.1. since 5.10.1 is full of SHOULDs, this is saying that you SHOULD do what you SHOULD do. I'm only pointing that out because it struck me as funny ... In 11. Security Considerations Attacks that indicate false overload control can be mitigated by using TCP or Websockets [RFC6455], or better yet, TLS in conjunction with applying BCP 38 [RFC2827]. If you're already pointing implementers to TCP for better resistance to attacks, would it make sense to recommend TLS more strongly (in 2014!)? But it's not obvious that on-path attackers that can modify traffic wouldn't just drop anything they find inconvenient, I guess. Also in 11. Security Considerations A malicious SIP entity could gain an advantage by pretending to support this specification but never reducing the amount of traffic it forwards to the downstream neighbor. If its downstream neighbor receives traffic from multiple sources which correctly implement overload control, the malicious SIP entity would benefit since all other sources to its downstream neighbor would reduce load. The solution to this problem depends on the overload control method. For rate-based and window-based overload control, it is very easy for a downstream entity to monitor if the upstream neighbor throttles traffic forwarded as directed. For percentage throttling this is not always obvious since the load forwarded depends on the load received by the upstream neighbor. To prevent such attacks, servers should monitor client behavior to determine whether they are complying with overload control policies. If a client is not conforming to such policies, then the server should treat it as a non-supporting client (see Section 5.10.2). Is this text coming close to saying "malicious SIP entities can game this specification, so you ought to monitor client behavior, but there are some overload control methods you can't monitor reliably"? If so ... is the only required overload control method one you can't monitor reliably? In Appendix B. RFC5390 requirements REQ 4: The mechanism must be capable of dealing with elements that do not support it, so that a network can consist of a mix of elements that do and don't support it. In other words, the mechanism should not work only in environments where all elements support it. It is reasonable to assume that it works better in such environments, of course. Ideally, there should be incremental improvements in overall network throughput as increasing numbers of elements in the network support the mechanism. Meeting REQ 4: Partially. The mechanism is designed to reduce congestion when a pair of communicating entities support it. If a downstream overloaded SIP server does not respond to a request in time, a SIP client will attempt to reduce traffic destined towards the non-responsive server as outlined in Section 5.9. I'm not understanding how this is "partially". What did you miss? REQ 5: The mechanism should not assume that it will only be deployed in environments with completely trusted elements. It should seek to operate as effectively as possible in environments where other elements are malicious; this includes preventing malicious elements from obtaining more than a fair share of service. Meeting REQ 5: Partially. Since overload control information is shared between a pair of communicating entities, a confidential and authenticated channel can be used for this communication. However, if such a channel is not available, then the security ramifications outlined in Section 11 apply. Does your point about not being able to monitor loss-based overload control methods also apply here? REQ 12: The mechanism should work between servers in different domains. Meeting REQ 12: Yes, there are no inherent limitations on using overload control between domains. I'm hearing a particular operator's voice saying "I'm not telling my competitors _anything_ about the topology of my network OR the traffic within my network". Perhaps it's worth mentioning that operators have to be willing to expose at least a little information about their network to other operators, in order for this mechanism to work well. ("Don't clear all the oc Via parameters at your SBCs and expect this to work across your interconnect points!")
Adrian Farrel Former IESG member
No Objection
No Objection
(for -14)
Unknown
Barry Leiba Former IESG member
No Objection
No Objection
(for -14)
Unknown
Benoît Claise Former IESG member
No Objection
No Objection
(for -14)
Unknown
Brian Haberman Former IESG member
No Objection
No Objection
(for -14)
Unknown
Gonzalo Camarillo Former IESG member
No Objection
No Objection
(for -14)
Unknown
Jari Arkko Former IESG member
(was Discuss)
No Objection
No Objection
(for -14)
Unknown
Joel Jaeggli Former IESG member
No Objection
No Objection
(2014-02-02 for -14)
Unknown
expect to see a 15 based on the secdir feedback proposal.
Martin Stiemerling Former IESG member
No Objection
No Objection
(for -14)
Unknown
Pete Resnick Former IESG member
No Objection
No Objection
(2014-02-05 for -14)
Unknown
1: - First paragraph: In fact, overload may lead to a situation in which the throughput drops down to a small fraction of the original processing capacity. This is often called congestion collapse. I don't think that's right. My lower-layer comrades can correct me if I'm wrong, but my understanding is that congestion collapse is when the congestion control mechanism itself (for example, retransmission in the face of packet loss) adds additional traffic to the network, eventually swamping the network such that no (or exceedingly little) new traffic can get into the network. A network at maximum capacity where traffic is slow because there's simply too much traffic for that network to handle is not "collapsed", as far as I understand the term. So long as everyone is holding back their traffic such that the network is not being flooded with retransmissions or the like, that's just overload, not collapse. In any event, I don't think the use of the term adds anything to the discussion, so I would simply (a) strike the last two sentences of the first paragraph, (b) strike "and it cannot prevent congestion collapse" from the fourth paragraph, and (c) change "avoiding congestion collapse and" to "thereby" in the fifth paragraph. - Please change "we only consider" to "this document only addresses". - Please strike the last two sentences (the conformance sentences) of section 1. They add nothing to the document. 2, Last paragraph: I think normative language as described in this paragraph adds nothing but confusion to the text and is unnecessary. I suggest striking this paragraph and making the changes I note below to remove the unnecessary normative language. My suggestions below do not change the meaning of the protocol at all. 3: Please change "We now explain the" to "This section gives an" in the first sentence. 4.1: Change "MUST add" in paragraphs 2 & 3 to "adds". 4.2: - Change "MUST add" in paragraph 2 to "adds". - In the last sentence of paragraph 3, I suggest changing "must not assume" to "can not assume", to avoid confusion. - In the fourth paragraph, change "it MUST choose one algorithm from the list and return the single selected algorithm" to "it chooses one algorithm from the list and MUST return the single selected algorithm". 4.3: Change "the client MUST behave as if overload control is not in effect between it and" to "overload control is not in effect between the client and". 4.4: Change "MUST be inserted" to "is inserted" in the first sentence. 5: Change "MUST determine" to "determines" in the second paragraph. 5.1: - Change "MUST insert" (x2) with "inserts" in the first paragraph. - Change "MUST determine" to "determines and change "MUST follow" to "follows" in the third paragraph. 5.2: - Change "MAY" to "can" in the third paragraph. - Change the last paragraph as follows: This specification provides a good overload control mechanism that can protect a SIP server from overload. However, if a SIP server wanted to limit its overload control capability for privacy reasons, it might decide to perform overload control only for... 5.5: Change "MUST determine" (x2) to "determines" in paragraph 3. 5.7: - Change "MUST set" to "sets" in the second paragraph. - In the last paragraph, change the second sentence as follows: If the value of the "oc-validity" parameter is 0, this indicates to the client that overload control of messages destined to the server is no longer necessary and the traffic can flow without any reduction. 7.1: - Change "MUST appear" to "appears" in the first paragraph. - Change the third sentence of the second paragraph as follows: This value indicates to the client the percentage by which the client is to reduce the number of requests being forwarded to the overloaded server. 9: I suggest using the ABNF extension mechanism: via-params =/ oc / oc-validity / oc-seq / oc-algo Much simpler. 10: I think this section should be moved to an appendix.
Sean Turner Former IESG member
No Objection
No Objection
(for -14)
Unknown
Stephen Farrell Former IESG member
No Objection
No Objection
(2014-02-06 for -14)
Unknown
Thanks for the good security considerations text on DoS! I think the changes proposed from the secdir review [1] look good and assume they'll be incorporated into a revision. [1] https://www.ietf.org/mail-archive/web/secdir/current/msg04521.html In addition, presumably a DDoS attack could cause an honest server to start signalling an overload condition. If such a server had a long oc-validity time, then that validity time might act as an accelerator for the DDoS attack. Even the 500ms default might mean that a botnet could use this perhaps. Is that worth an additional bit of security consideration text? I guess the attack pattern there would be that the botnet would pop up and try overload the server every oc-validity milliseconds or so and go quiet in the intervals. I've no idea if that could be confused with nominal reactions to a real overload though.
Stewart Bryant Former IESG member
No Objection
No Objection
(for -14)
Unknown
Ted Lemon Former IESG member
No Objection
No Objection
(2014-02-06 for -14)
Unknown
Section 11, last paragraph: To prevent such attacks, servers should monitor client behavior to determine whether they are complying with overload control policies. If a client is not conforming to such policies, then the server should treat it as a non-supporting client (see Section 5.10.2). This is probably just my failure of comprehension, but it is not at all clear to me how a server can monitor client behavior. How does the server distinguish between a situation where the client is not throttling, versus a situation where the client _is_ throttling, but load has increased proportionally to the amount of throttling requests, so that the observed load appears constant? Are you simply gambling that this will never happen, or did I misunderstand something?