Skip to main content

Session Initiation Protocol (SIP) Overload Control
draft-ietf-soc-overload-control-15

Yes

(Richard Barnes)

No Objection

(Adrian Farrel)
(Barry Leiba)
(Benoît Claise)
(Brian Haberman)
(Gonzalo Camarillo)
(Jari Arkko)
(Martin Stiemerling)
(Sean Turner)
(Stewart Bryant)

Note: This ballot was opened for revision 14 and is now closed.

Richard Barnes Former IESG member
Yes
Yes (for -14) Unknown

                            
Spencer Dawkins Former IESG member
(was Discuss) Yes
Yes (2014-02-05 for -14) Unknown
These are all non-blocking, but I'd appreciate it if you'd consider them along with any other comments you receive.

In 1.  Introduction

   As with any network element, a Session Initiation Protocol (SIP)
   [RFC3261] server can suffer from overload when the number of SIP
   messages it receives exceeds the number of messages it can process.
   Overload can pose a serious problem for a network of SIP servers.
   During periods of overload, the throughput of a network of SIP
   servers can be significantly degraded.  In fact, overload may lead to
   a situation in which the throughput drops down to a small fraction of
   the original processing capacity.  This is often called congestion
   collapse.

   Overload is said to occur if a SIP server does not have sufficient
   resources to process all incoming SIP messages.  These resources may
   include CPU processing capacity, memory, network bandwidth, input/
   output, or disk resources.

I'm struggling with including "network bandwidth" here, with no qualifications. That seems to conflate both SIP Overload and TSV congestion, thusly:

If UA A sends a request for UA B to a proxy:

   UA A -----------> Proxy -----------> UA B

But there's not enough bandwidth between the proxy and UA B, so:

   UA A -----------> Proxy ---->//      UA B

I'd be OK with characterizing that as "network bandwidth" SIP overload covered by this draft, but

if the problem of insufficient bandwidth is between UA A and the proxy:

   UA A --->//       Proxy               UA B 

the proxy never sees the request. That's not in scope for this draft, is it? 

If not, is there an obvious way to tighten this up a bit? (maybe something like "network bandwidth on the forwarding path"?)

In 2.  Terminology

   Unless otherwise specified, all SIP entities described in this
   document are assumed to support this specification.

In 10.2.  Backwards Compatibility, you lead me to think that if my path from conforming UAC to conforming UAS traverses some proxies that do support this specification and other proxies that do not, depending on where the non-conforming proxies are and what proxies are overloaded, you wouldn't do worse than today's behavior. It might be helpful to say that here.

In 5.10.1.  Message prioritization at the hop before the overloaded server, there are several items listed that a client SHOULD take into account when deciding what requests to prioritize. Most of them make sense to me, but it's not obvious that they are 2119 SHOULDs ("good advice, but not required for interoperability").

In 7.1.  Special parameter values for loss-based overload control

   The SIP client may use any
   algorithm that reduces the traffic it sends to the overloaded server
   by the amount indicated.  Such an algorithm SHOULD honor the message
   prioritization discussion of Section 5.10.1. 

since 5.10.1 is full of SHOULDs, this is saying that you SHOULD do what you SHOULD do. I'm only pointing that out because it struck me as funny ...

In 11.  Security Considerations

   Attacks that indicate false overload control can be mitigated by
   using TCP or Websockets [RFC6455], or better yet, TLS in conjunction
   with applying BCP 38 [RFC2827]. 

If you're already pointing implementers to TCP for better resistance to attacks, would it make sense to recommend TLS more strongly (in 2014!)? But it's not obvious that on-path attackers that can modify traffic wouldn't just drop anything they find inconvenient, I guess.

Also in 11.  Security Considerations

   A malicious SIP entity could gain an advantage by pretending to
   support this specification but never reducing the amount of traffic
   it forwards to the downstream neighbor.  If its downstream neighbor
   receives traffic from multiple sources which correctly implement
   overload control, the malicious SIP entity would benefit since all
   other sources to its downstream neighbor would reduce load.

      The solution to this problem depends on the overload control
      method.  For rate-based and window-based overload control, it is
      very easy for a downstream entity to monitor if the upstream
      neighbor throttles traffic forwarded as directed.  For percentage
      throttling this is not always obvious since the load forwarded
      depends on the load received by the upstream neighbor.

   To prevent such attacks, servers should monitor client behavior to
   determine whether they are complying with overload control policies.
   If a client is not conforming to such policies, then the server
   should treat it as a non-supporting client (see Section 5.10.2).

Is this text coming close to saying "malicious SIP entities can game this specification, so you ought to monitor client behavior, but there are some overload control methods you can't monitor reliably"? If so ... is the only required overload control method one you can't monitor reliably?

In Appendix B.  RFC5390 requirements

   REQ 4: The mechanism must be capable of dealing with elements that do
   not support it, so that a network can consist of a mix of elements
   that do and don't support it.  In other words, the mechanism should
   not work only in environments where all elements support it.  It is
   reasonable to assume that it works better in such environments, of
   course.  Ideally, there should be incremental improvements in overall
   network throughput as increasing numbers of elements in the network
   support the mechanism.

   Meeting REQ 4: Partially.  The mechanism is designed to reduce
   congestion when a pair of communicating entities support it.  If a
   downstream overloaded SIP server does not respond to a request in
   time, a SIP client will attempt to reduce traffic destined towards
   the non-responsive server as outlined in Section 5.9.

I'm not understanding how this is "partially". What did you miss?

   REQ 5: The mechanism should not assume that it will only be deployed
   in environments with completely trusted elements.  It should seek to
   operate as effectively as possible in environments where other
   elements are malicious; this includes preventing malicious elements
   from obtaining more than a fair share of service.

   Meeting REQ 5: Partially.  Since overload control information is
   shared between a pair of communicating entities, a confidential and
   authenticated channel can be used for this communication.  However,
   if such a channel is not available, then the security ramifications
   outlined in Section 11 apply.

Does your point about not being able to monitor loss-based overload control methods also apply here?

   REQ 12: The mechanism should work between servers in different
   domains.

   Meeting REQ 12: Yes, there are no inherent limitations on using
   overload control between domains.

I'm hearing a particular operator's voice saying "I'm not telling my competitors _anything_ about the topology of my network OR the traffic within my network". Perhaps it's worth mentioning that operators have to be willing to expose at least a little information about their network to other operators, in order for this mechanism to work well.

("Don't clear all the oc Via parameters at your SBCs and expect this to work across your interconnect points!")
Adrian Farrel Former IESG member
No Objection
No Objection (for -14) Unknown

                            
Barry Leiba Former IESG member
No Objection
No Objection (for -14) Unknown

                            
Benoît Claise Former IESG member
No Objection
No Objection (for -14) Unknown

                            
Brian Haberman Former IESG member
No Objection
No Objection (for -14) Unknown

                            
Gonzalo Camarillo Former IESG member
No Objection
No Objection (for -14) Unknown

                            
Jari Arkko Former IESG member
(was Discuss) No Objection
No Objection (for -14) Unknown

                            
Joel Jaeggli Former IESG member
No Objection
No Objection (2014-02-02 for -14) Unknown
expect to see a 15 based on the secdir feedback proposal.
Martin Stiemerling Former IESG member
No Objection
No Objection (for -14) Unknown

                            
Pete Resnick Former IESG member
No Objection
No Objection (2014-02-05 for -14) Unknown
1:

- First paragraph:

   In fact, overload may lead to
   a situation in which the throughput drops down to a small fraction of
   the original processing capacity.  This is often called congestion
   collapse.

I don't think that's right. My lower-layer comrades can correct me if I'm wrong, but my understanding is that congestion collapse is when the congestion control mechanism itself (for example, retransmission in the face of packet loss) adds additional traffic to the network, eventually swamping the network such that no (or exceedingly little) new traffic can get into the network. A network at maximum capacity where traffic is slow because there's simply too much traffic for that network to handle is not "collapsed", as far as I understand the term. So long as everyone is holding back their traffic such that the network is not being flooded with retransmissions or the like, that's just overload, not collapse.

In any event, I don't think the use of the term adds anything to the discussion, so I would simply (a) strike the last two sentences of the first paragraph, (b) strike "and it cannot prevent congestion collapse" from the fourth paragraph, and (c) change "avoiding congestion collapse and" to "thereby" in the fifth paragraph.

- Please change "we only consider" to "this document only addresses".

- Please strike the last two sentences (the conformance sentences) of section 1. They add nothing to the document.

2, Last paragraph: I think normative language as described in this paragraph adds nothing but confusion to the text and is unnecessary. I suggest striking this paragraph and making the changes I note below to remove the unnecessary normative language. My suggestions below do not change the meaning of the protocol at all.

3: Please change "We now explain the" to "This section gives an" in the first sentence.

4.1: Change "MUST add" in paragraphs 2 & 3 to "adds".

4.2:

- Change "MUST add" in paragraph 2 to "adds".

- In the last sentence of paragraph 3, I suggest changing "must not assume" to "can not assume", to avoid confusion.

- In the fourth paragraph, change "it MUST choose one algorithm from the list and return the single selected algorithm" to "it chooses one algorithm from the list and MUST return the single selected algorithm".

4.3: Change "the client MUST behave as if overload control is not in effect between it and" to "overload control is not in effect between the client and".

4.4: Change "MUST be inserted" to "is inserted" in the first sentence.

5: Change "MUST determine" to "determines" in the second paragraph.

5.1:

- Change "MUST insert" (x2) with "inserts" in the first paragraph.

- Change "MUST determine" to "determines and change "MUST follow" to "follows" in the third paragraph.

5.2:

- Change "MAY" to "can" in the third paragraph.

- Change the last paragraph as follows:

   This specification provides a good overload control mechanism that
   can protect a SIP server from overload.  However, if a SIP server
   wanted to limit its overload control capability for privacy reasons,
   it might decide to perform overload control only for...

5.5: Change "MUST determine" (x2) to "determines" in paragraph 3.

5.7:

- Change "MUST set" to "sets" in the second paragraph.

- In the last paragraph, change the second sentence as follows:

   If the value of the "oc-validity" parameter is 0, this indicates to
   the client that overload control of messages destined to the server 
   is no longer necessary and the traffic can flow without any
   reduction.

7.1:

- Change "MUST appear" to "appears" in the first paragraph.

- Change the third sentence of the second paragraph as follows:

   This value indicates to the client the percentage by which the client
   is to reduce the number of requests being forwarded to the overloaded
   server.
   
9: I suggest using the ABNF extension mechanism:

       via-params =/ oc / oc-validity / oc-seq / oc-algo

Much simpler.

10: I think this section should be moved to an appendix.
Sean Turner Former IESG member
No Objection
No Objection (for -14) Unknown

                            
Stephen Farrell Former IESG member
No Objection
No Objection (2014-02-06 for -14) Unknown
Thanks for the good security considerations text on DoS!

I think the changes proposed from the secdir review [1]
look good and assume they'll be incorporated into a
revision.

   [1] https://www.ietf.org/mail-archive/web/secdir/current/msg04521.html

In addition, presumably a DDoS attack could cause an honest
server to start signalling an overload condition.  If such
a server had a long oc-validity time, then that validity
time might act as an accelerator for the DDoS attack. Even
the 500ms default might mean that a botnet could use this
perhaps. Is that worth an additional bit of security
consideration text? I guess the attack pattern there would
be that the botnet would pop up and try overload the server
every oc-validity milliseconds or so and go quiet in the
intervals. I've no idea if that could be confused with
nominal reactions to a real overload though.
Stewart Bryant Former IESG member
No Objection
No Objection (for -14) Unknown

                            
Ted Lemon Former IESG member
No Objection
No Objection (2014-02-06 for -14) Unknown
Section 11, last paragraph:
   To prevent such attacks, servers should monitor client behavior to
   determine whether they are complying with overload control policies.
   If a client is not conforming to such policies, then the server
   should treat it as a non-supporting client (see Section 5.10.2).

This is probably just my failure of comprehension, but it is not at all clear to me how a server can monitor client behavior.   How does the server distinguish between a situation where the client is not throttling, versus a situation where the client _is_ throttling, but load has increased proportionally to the amount of throttling requests, so that the observed load appears constant?   Are you simply gambling that this will never happen, or did I misunderstand something?