Internet Draft                                        Robert Hancock
                                                       Eleanor Hepworth
                                                        Andrew McDonald
                                            Siemens/Roke Manor Research
   Document: draft-hancock-nsis-overload-
   00.txt
   Expires: December 2003                                     June 2003


          Handling Overload Conditions in the NSIS Protocol Suite

Status of this Memo

   This document is an Internet-Draft and is in full conformance with
   all provisions of Section 10 of RFC2026 [1].

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups. Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time. It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
        http://www.ietf.org/ietf/1id-abstracts.txt
   The list of Internet-Draft Shadow Directories can be accessed at
        http://www.ietf.org/shadow.html.

Abstract

   The NSIS working group is considering protocols for signaling for
   resources for a traffic flow along its path in the network. The
   requirements for such signaling are being developed in [2] and a
   framework in [3]. The framework describes a 2-layer protocol
   architecture, with a common lower NSIS 'transport' layer protocol
   (NTLP) supporting a variety of upper layer NSIS signaling layer
   protocols (NSLPs).

   It is an open issue where within this architecture to place the
   responsibility for handling overload conditions. These conditions
   relate both to overload of the IP layer itself, as well as overload
   of buffer/processing resources within the NTLP/NSLPs. This note
   discusses the requirements and the implications of various
   approaches, and proposes a way forwards.



Hancock et al.         Expires - December 2003                [Page 1]


                       NSIS: Overload Handling               June 2003

Table of Contents

   1. Introduction, Scope and Terminology............................2
     1.1 Terminology; Flow and Congestion Control ...................3
   2. Requirements...................................................3
   3. Implications of Doing Overload Handling within NSIS Protocols..5
   4. RSVP and Other Protocol Work...................................5
   5. Handling IP Overload ("Congestion Control")....................6
   6. Handling NSIS Protocol Overload................................7
   7. Security Considerations........................................9
   8. Conclusions....................................................9
   Acknowledgments..................................................10
   Author's Addresses...............................................11
   Full Copyright Statement.........................................11


1. Introduction, Scope and Terminology

   The NSIS working group is considering protocols for signaling for
   resources for a traffic flow along its path in the network. The
   requirements for such signaling are being developed in [2] and a
   framework in [3]. The framework describes a 2-layer protocol
   architecture, with a common lower NSIS 'transport' layer protocol
   (NTLP) supporting a variety of upper layer NSIS signaling layer
   protocols (NSLPs).

   It is an open issue where within this architecture to place the
   responsibility for handling overload conditions; 'handling' includes
   detection as well as prevention and recovery. These conditions relate
   both to overload of the network (IP) layer itself, as well as
   overload of buffer/processing resources within the NTLP/NSLPs. This
   note discusses the requirements and the implications of various
   approaches, and proposes a way forwards.

   These issues have been intermittently discussed on the NSIS mailing
   list [4], and noted in some of the design-related drafts [5, 6, 7].
   [8] provides authoritative guidance specifically on how the problem
   of congestion should be approached within Internet protocol
   standards, and includes many important references.

   Note that this draft is specifically not about resource signaling to
   manage congestion within the network when it actually occurs - for
   example, traffic engineering to route data flows around congested
   network areas. This is an important subject, but it is specifically
   about how resource management should be done, rather than about how
   signaling protocols should work. This draft includes discussion of
   how to prevent signaling protocols from adding to the network
   congestion problem.


Hancock et al.         Expires - December 2003                [Page 2]


                       NSIS: Overload Handling               June 2003


   After classifying the various types of signaling overload in section
   1.1, section 2 describes the potential causes of overload and the
   (proposed) requirements for how they should be dealt with. Section 3
   describes the basic implications for protocol design and
   implementation if they provide overload handling, and section 4
   briefly mentions how some other protocols related to network
   operation handle the problem. Section 5 discusses how to handle
   network (IP layer) overload, and section 6 discusses overload within
   the NSIS protocol suite itself. Security aspects are briefly
   mentioned in section 7, and section 8 concludes.

1.1 Terminology; Flow and Congestion Control

   Unless otherwise stated, this document follows the terminology given
   in the current NSIS framework [3].

   The overload problem is actually (at least) three problems:
   a) Overload in the IP layer, i.e. buffer congestion which causes IP
   packets to be dropped (affecting all flows, for signaling, data and
   other applications).
   b) Overload in the NTLP, meaning it cannot process incoming or
   outgoing packets fast enough. This might be caused by processor
   overload or by lower (IP) level congestion. It affects all NSIS
   signaling applications, but not the rest of the network - assuming
   (a) is already handled.
   c) Overload in an NSLP, meaning it cannot process incoming or
   outgoing packets fast enough. This might be caused by processor
   overload or by lower (NTLP/IP) level congestion. It affects only this
   signaling application - assuming that (a) and (b) are already
   handled.

   Traditionally, networking discussions draw a distinction between
   congestion control - protecting the infrastructure - and flow control
   - protecting the end systems. Making this distinction is somewhat
   subtle in the NSIS case, since the infrastructure includes end
   systems. For example, overload within the NTLP could be prevented by
   NTLP-level flow control; however, it would still be seen as
   equivalent to network congestion by NSLPs, and be invisible to the IP
   layer (as congestion or anything else). Therefore we work in terms of
   the more concrete concept of overload within particular protocol
   layers. No doubt even finer distinctions could be drawn.

2. Requirements

   This section summarises the potential sources of overload, and just
   how critical it is to deal with them as part of protocol design.



Hancock et al.         Expires - December 2003                [Page 3]


                       NSIS: Overload Handling               June 2003

   Load/overload could originate from the following causes:
   NORMAL: 'Normal' operation, as user applications initiate signaling
   for their flows. (If this actually causes problems, the network or
   network elements probably just need re-engineering.)
   RETRY: Aggressive retry behaviour, as end-systems attempt to re-
   signal for failed or failing sessions, i.e. even if the flow itself
   is not active. (This sort of behaviour is felt to be a real problem
   in traditional telephony networks, where the worst excesses of such
   devices are curbed by regulation.)
   REFRESH: Signaling refresh messages generated within the network may
   cause overload, if the refresh period is not appropriately chosen.
   RXMIT: Message retransmission (e.g. to achieve reliability in the
   face of congestive loss) is itself a potential cause of overload, and
   particularly worrying as a source of instability, since the
   retransmissions themselves add to the overload.
   REPAIR: If there is a path change within the network, local repair
   actions could cause a flood of signaling traffic over the
   neighbouring links.

   While the sources of NORMAL and RETRY are end-systems proxies, the
   others are not. Therefore, it is not possible to rely only on end-to-
   end load control mechanisms, unless the other sources can be
   discounted.

   While NORMAL and REFRESH are proportional (somehow) to data traffic
   (and should be a small proportion of it) and hence should not usually
   be a source of IP-level overload, the others are not. Hence, both
   signaling element and general network overload should be handled
   within the protocol design.

   Any of these factors, especially RETRY and REPAIR, can lead to
   overload within the signaling protocol processing. The consequences
   of such overload would be reduced responsiveness within the network
   control plane, dropped signaling state for user sessions, and so on.
   Modified operation under these circumstances is mainly signaling-
   application specific; however, the signaling applications usually
   need support at the protocol level to detect the overload condition
   in the first place.

   In the case where all nodes in the network are NSIS-aware, the IP
   overload problem essentially becomes a node implementation issue
   (allocation of forwarding resources on outgoing links). However, a
   background assumption is that the NSIS protocols need to operate well
   over large-diameter NSIS-unaware clouds.

   A related issue is that causes REFRESH and REPAIR are mainly about
   signaling generated in support of particular signaling applications,
   rather than 'protocol maintenance' signaling. This is therefore


Hancock et al.         Expires - December 2003                [Page 4]


                       NSIS: Overload Handling               June 2003

   generated only at NSLP-aware nodes. (This is a consequence of the
   design decision that the NTLP only handles message forwarding, not
   state maintenance, and therefore cannot for example generate a flood
   of signaling application messages on a rerouting event.)

   While NSLP/NTLP overload failures are problems which are 'local' to
   the NSIS activity, there is no point in even attempting to
   standardise protocols which can contribute to network congestion (IP
   overload) in an uncontrolled way (see the warnings in [9]).

   The conclusion of this section is that overload both within the NSIS
   protocols and IP layer needs to be handled with the NSIS protocol
   designs, the latter with particular attention to robustness.

3. Implications of Doing Overload Handling within NSIS Protocols

   Overload handling generally implies having a feedback channel to
   complement the forward channel which carries the 'overload
   generating' traffic. The nodes at each end of the feedback channel
   have to be sensitive to the presence of the overload and be able to
   reduce it; generally, the closer to the location of the overload the
   better (e.g. end-to-end mechanisms will be inefficient at dealing
   with a local overload caused by a rerouting event).

   The implication of this is that an NSIS protocol that purports to
   deal with overloads has to be bi-directional, and have state
   information at each end which tracks the current load situation. The
   more direct the feedback in the reverse direction the better.

   Overload protection mechanisms are often associated with reliability
   mechanisms, but they don't have to be (e.g. DCCP [10]); they can be
   considered independently. Indeed, there may be a case for
   unreliability within the protocol (e.g. to delete aged messages),
   even though overload control is still needed.

   Avoidance of congestion (IP overload) generally has to be done by
   tracking packet drops at NSIS-unaware nodes. The mechanisms can vary
   from very simple to very complex. At one extreme, a simple stop-and-
   wait protocol will work; at the other end, the full (and growing)
   sophistication of TCP can be used. More sophistication is needed as
   the network length of the feedback channel and the desired throughput
   performance increase. This may be a situation where there is a case
   for different protocol options in different parts of the network.

4. RSVP and Other Protocol Work

   The base RSVP protocol as defined in [11] includes very limited
   overload detection and management capabilities. The main aspect is


Hancock et al.         Expires - December 2003                [Page 5]


                       NSIS: Overload Handling               June 2003

   the fact that refresh intervals can be locally adjusted, but this
   just allows management intervention rather than being an adaptive
   mechanism within the protocol itself. RSVP extensions for reliability
   were introduced in [12], accompanied by an exponential backoff
   procedure to address overload cause RXMIT.

   Most end-to-end application protocols, subject to causes NORMAL and
   RETRY, handle the overload control problem either by using TCP/SCTP
   as transports, or with a variety of ad hoc application level
   techniques applied over UDP.

   Within the network, the protocols which could be victims of causes
   REFRESH, RXMIT and REPAIR are non-trivial routing protocols. The most
   serious potential overload cause is a flood of routing messages as a
   new link is brought up. Here, OSPF uses a simple stop-and-wait
   protocol, while BGP uses TCP. The situation for the NSIS protocols is
   more severe, since the situation arises for any re-routing event
   (even one caused by link changes in a remote part of the network),
   and affects links which are already supposedly operational.

   In the Diameter Base protocol, which uses TCP/SCTP as a transport,
   higher layer overload is managed on a per-peer-connection basis by
   the explicit signaling of "busy" indications to the originating peer
   and the termination of the connection. The originating peer has the
   option to switch to an alternative next hop (load sharing), which is
   not possible within NSIS because the signaling has to be coupled to
   the data path.

5. Handling IP Overload ("Congestion Control")

   If NTLP can generate its own messages for any of causes REFRESH,
   RXMIT or REPAIR, then it has to do so in a way which cannot cause IP
   layer overload; there is no other option. If this is the case, it
   would seem to make sense to rely on the same mechanism (whatever it
   is) to protect the IP layer from all NSIS overload causes.

   However, whether the NTLP generates such messages depends on other
   aspects of NTLP design and other decisions about NTLP functionality.
   One could imagine a situation where a very lightweight NTLP had no
   intelligence to generate messages independently of NSLP operation, in
   which case protection responsibility could be pushed up to the
   individual NSLPs. We can't tell whether this argument applies or not
   without more detail about the proposed NTLP design.

   Therefore, the question remains of whether it is sensible to allocate
   the problem to the NTLP in any case. The following arguments would
   seem to apply:



Hancock et al.         Expires - December 2003                [Page 6]


                       NSIS: Overload Handling               June 2003

   *) There is no need for different sorts of congestion control for
   different signaling applications. (There may be different detailed
   reactions to congestion, i.e. how to generate fewer messages;
   however, detecting that fewer messages need to be sent is universal
   across all signaling applications.) Therefore, there is no need to
   solve this in a signaling-application sensitive manner.
   *) Detecting the problem may be easier with closer interaction with
   the lower layers. The NTLP is best placed to do this.
   *) Solving the problem is hard and important. Therefore, it is better
   to do it once and for all, and make life less burdensome for future
   NSLP developers.

   The conclusion of this set of arguments appears to be that congestion
   control, i.e. protection of the IP layer from overloads caused by
   NSIS protocol operation, should be an NTLP function.

6. Handling NSIS Protocol Overload

   The other question is related to handling overloads within the NSIS
   protocol layers themselves, i.e. when the internal resource of the
   NEs are constrained. It is clear that the NSLP should be in charge of
   adapting its own behaviour in response to overload situations, since
   the response will be specific to the signaling application. However,
   the method of detection and response depends on what overload
   detection and control features the NTLP provides, and what
   assumptions the NSLP can make about their presence (especially in
   remote nodes). Therefore, this section aims to identify the different
   options for how overload indications can be pushed up the protocol
   stack and/or out to the edge of the network (where the adaptation can
   take place) and how in particular the NTLP should support this.

   If the conclusion of section 5 is correct (i.e. NTLP enforcing IP
   layer congestion control), it is most likely that in any case there
   should be a flow-controlling API between the NSIS protocol layers.

   For providing overload indications towards the edge nodes, there seem
   to be three cases to consider. The argument depends on whether there
   are intermediate nodes which are unaware of the NSLPs in use (see
   Figure 1).

   1) The NTLP provides the equivalent of a highly granular flow
   controlled delivery service up to the next NSLP-aware node, with no
   assumed constraints on NSLP behaviour. The source is explicitly
   forced to throttle back the transmission of messages for the
   combination of source/destination/application. The NSLP only has to
   detect the condition locally; in fact, it can only send messages
   which the local NTLP is prepared to deliver. This makes life very



Hancock et al.         Expires - December 2003                [Page 7]


                       NSIS: Overload Handling               June 2003

   easy for the NSLP, but NTLP design (in particular, buffer allocation
   and propagation of flow control information across nodes) is hard.

                                                +------+
                                                |  NE3 |
                                                |+----+|
                                                ||NSLP||
                                                |+----+|
               +------+    +------+             |  ||  |
               |  NE1 |    |  NE2 |             |+----+|
               |+----+|    |      |      |======||NTLP||===
               ||NSLP||    |      |      |      |+----+|
               |+----+|    |      |      |      +------+
               |  ||  |    |      |      |
               |+----+|    |+----+|  +------+   +------+
           ====||NTLP||====||NTLP||==|Router|   |  NE4 |
               |+----+|    |+----+|  +------+   |+----+|
               +------+    +------+      |      ||NSLP||
                                         |      |+----+|
                                         |      |  ||  |
                                         |      |+----+|
                                         |======||NTLP||====
                                                |+----+|
                                                +------+

                  Figure 1: Signaling with NTLP-only hops

   2) The NTLP provides a flow controlled delivery service (as above),
   but operates under assumptions about upper layer sending windows
   which allow buffer management to be simplified. For example, if only
   one message is allowed to be outstanding for a particular session at
   any time, the buffer requirements can be precisely calculated.
   3) The NTLP simply provides the service of delivery to the next NTLP
   node, e.g. NE1->NE2, NE2->NE3 in the figure. Overload at an NSLP-
   unaware intermediate node (NE2) is handled by dropping packets there
   (or, more sophisticated but still IP-like behaviour). The NSLPs in
   NE1 and NE3 have to detect this condition and somehow adapt
   accordingly (in particular, NE1 has to be able to detect that NE3 is
   overloaded but that NE4 may not be).

   Solutions (1) and (2) are both flow-control based, and require the
   maintenance of per-source-destination information in order to support
   flow control properly. For example, in figure 1, the NTLP at NE2
   would have to detect overload for the signaling application at NE3
   and throttle signaling messages for it from NE1, while not affecting
   NE1->NE2->NE4 communications. In addition, these solutions put
   complexity into the NTLP, and might infect it with knowledge about
   signaling flow topologies which it should really be ignorant of.


Hancock et al.         Expires - December 2003                [Page 8]


                       NSIS: Overload Handling               June 2003


   Solution (3) puts some complexity into the NSLP behaviour which could
   be common to several applications; on the other hand, the flexibility
   to do it differently between different applications could be
   valuable. This option does not preclude the NTLP from doing flow
   control, but it does place a requirement on the NSLP to cope with
   lost messages at least as pathological events (although this would
   have to be the case anyway, e.g. to cope with intermediate node
   failure).

   Note that these problems are mainly caused by the NSLP-unaware node,
   NE2, and the fact that the NTLP cannot bypass it. In contrast, for
   direct communication (e.g. NE3<->NE4) it would be very easy to
   implement solution (1). Flow-controlling solutions are also
   attractive because they can minimize the buffering taking place
   within the network and hence improve responsiveness.

   The conclusion of this argument appears to be that (3) is the
   preferred approach. This conclusion is mainly driven by complexity
   arguments about the NTLP, and the existence of NSLP-unaware nodes; if
   both of these arguments could be dealt with, the conclusion might
   well be the opposite way around.

7. Security Considerations

   Malicious nodes can attack congestion control mechanisms to force
   nodes into a congestion avoidance state. The NTLP design should
   protect against this type of attack where the network is open to it.
   Also, both NSIS overload protection approaches have to make some
   assumptions about fairness at the NTLP level; however, this seems to
   be unavoidable.

8. Conclusions

   1. The NTLP needs to prevent network overload in the IP layer between
   NTLP peers.
   2. However, NSLPs need to detect and adapt to overload within the
   NSIS protocols themselves.
   3. Detection may take place by noting messages dropped by the NTLP,
   as well as any flow control imposed by the NTLP.

   References

   1  Bradner, S., "The Internet Standards Process -- Revision 3", BCP
      9, RFC 2026, October 1996.





Hancock et al.         Expires - December 2003                [Page 9]


                       NSIS: Overload Handling               June 2003


   2  Brunner, M., "Requirements for QoS Signaling Protocols", draft-
      ietf-nsis-req-07.txt (work in progress), March 2003

   3  Freytsis, I., R. E. Hancock, G. Karagiannis, J. Loughney, S. van
      den Bosch, "Next Steps in Signaling: Framework", draft-ietf-nsis-
      fw-02.txt (work in progress), March 2003

   4  Archive at: www.ietf.org/mail-archive/working-groups/nsis/

   5  Braden, R. and B. Lindell, "A Two-Level Architecture for Internet
      Signaling", draft-braden-2level-signal-arch-01.txt (work in
      progress), November 2002

   6  Schulzrinne, H., H. Tschofenig, X. Fu, A. McDonald, "CASP - Cross-
      Application Signaling Protocol", draft-schulzrinne-nsis-casp-
      01.txt (work in progress), March 2003

   7  McDonald, A., R. Hancock, E. Hepworth, "Design Considerations for
      an NSIS Transport Layer Protocol", draft-mcdonald-nsis-ntlp-
      considerations-00.txt (work in progress), January 2003

   8  Floyd, S., "Congestion Control Principles", RFC 2914, September
      2000

   9  http://www.ietf.org/ID-nits.html

   10 http://www.ietf.org/html.charters/dccp-charter.html

   11 Braden, R. et al., "Resource ReSerVation Protocol (RSVP) --
      Version 1 Functional Specification", RFC 2205, September 1997

   12 Berger, L., Gan, D., Swallow, G., Pan, P., Tommasi, F. and S.
      Molendini, "RSVP Refresh Overhead Reduction Extensions", RFC 2961,
      April 2001


Acknowledgments

   The authors would like to thank all their colleagues and fellow
   participants in the NSIS working group and internal protocol
   discussions for exposing the complexities and subtleties in this
   subject area. In particular, input was used from (in order of
   CRC{name}) Henning Schulzrinne, Xiaoming Fu, John Loughney, Melinda
   Shore, Hannes Tschofenig, Georgios Karagiannis, Ping Pan, Bob Braden,
   Sven Van den Bosch, Lars Westberg, Marcus Brunner, and Ruediger Geib.
   Henning in particular provided valuable education on flow control in



Hancock et al.         Expires - December 2003               [Page 10]


                       NSIS: Overload Handling               June 2003

   signaling protocols. Needless to say, the interpretation and
   conclusions should be blamed only on the authors.

Author's Addresses

   {Robert Hancock, Eleanor Hepworth, Andrew McDonald}
   Roke Manor Research
   Old Salisbury Lane
   Romsey, Hampshire
   SO51 0ZN
   United Kingdom
   email: {robert.hancock|eleanor.hepworth|andrew.mcdonald}@roke.co.uk

Full Copyright Statement

   Copyright (C) The Internet Society (2003). All Rights Reserved. This
   document and translations of it may be copied and furnished to
   others, and derivative works that comment on or otherwise explain it
   or assist in its implementation may be prepared, copied, published
   and distributed, in whole or in part, without restriction of any
   kind, provided that the above copyright notice and this paragraph are
   included on all such copies and derivative works. However, this
   document itself may not be modified in any way, such as by removing
   the copyright notice or references to the Internet Society or other
   Internet organizations, except as needed for the purpose of
   developing Internet standards in which case the procedures for
   copyrights defined in the Internet Standards process must be
   followed, or as required to translate it into languages other than
   English.

   The limited permissions granted above are perpetual and will not be
   revoked by the Internet Society or its successors or assigns. This
   document and the information contained herein is provided on an "AS
   IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK
   FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT
   LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL
   NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY
   OR FITNESS FOR A PARTICULAR PURPOSE.












Hancock et al.         Expires - December 2003               [Page 11]