Inter-Domain Routing Working Group            Kevin Fang, Cisco Systems
Internet Draft                                  Feng Cai, Cisco Systems
Document: draft-zhiyfang-fecai-bgp-over-sctp-00.txt         May.10 2009
Expires: November 2009



                  BGP-4 message transport over SCTP


Status of this Memo

   This Internet-Draft is submitted to IETF in full conformance with
   the provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on November 10, 2009.


Copyright Notice

   Copyright (c) 2009 IETF Trust and the persons identified as the
   document authors. All rights reserved.

This document is subject to BCP 78 and the IETF Trust's Legal Provisions
Relating to IETF Documents in effect on the date of publication of
this document (http://trustee.ietf.org/license-info).  Please
review these documents carefully, as they describe your rights and
restrictions with respect to this document.


Abstract

   This memo defines using SCTP for BGP-4 transport routing message.
   SCTP has many benefit for Signaling/Message transportation , BGP-4
   transport over SCTP will enhance the link stability and efficiency.


Fang/Cai                   Expires November 10, 2009            [Page 1]


Internet-Draft       BGP-4 message transport over SCTP         May  2009


Conventions used in this document
   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this
    document are to be interpreted as described in [RFC-2119].

Table of Contents


   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . . .3
       1.1.  Motivation. . . . . . . . . . . . . . . . . . . . . . . .3
       1.2.  Potential Benefits  . . . . . . . . . . . . . . . . . . .3
             1.2.1.  Fast Retransmission. . . . . . . . . . . . . . . .3
             1.2.2.  SCTP Multi-Streaming. . . . . . . . . . . . . . .3
             1.2.3.  SCTP Multi-Homing . . . . . . . . . . . . . . . .4
       1.3.  Key Terms . . . . . . . . . . . . . . . . . . . . . . . .4
   2.  Using SCTP multistreming to avoid HOL blocking. . . . . . . . .4
       2.1.  Classify Route information. . . . . . . . . . . . . . . .5
       2.2.  Classification Analysis . . . . . . . . . . . . . . . . .6
             2.2.1.  Classify by AFI/SAFI. . . . . . . . . . . . . . .6
             2.2.2.  Classify by AS_PATH . . . . . . . . . . . . . . .6
             2.2.3.  Classify by Route Distinguisher(RD) . . . . . . .6
   3.  Using SCTP multihoming for BGP connection . . . . . . . . . . .7
       3.1.  BGP link via TCP limitation . . . . . . . . . . . . . . .7
       3.2.  Which link need BGP multihoming . . . . . . . . . . . . .7
       3.3.  Init multihomging link for BGP connection . . . . . . . .8
       3.4.  Link failure detection and switchover procedure . . . . .8
   4.  BGP-4 Stack modification to support SCTP. . . . . . . . . . . .9
       4.1.  Neighbor connection FSM modification. . . . . . . . . . .9
       4.2.  New BGP Capability Advertisement. . . . . . . . . . . . .9
       4.3.  New NOTIFICATION Subcodes . . . . . . . . . . . . . . . .10
   5.  Security Considerations . . . . . . . . . . . . . . . . . . . .10
   6.  IANA Considerations . . . . . . . . . . . . . . . . . . . . . .10
   7.  References  . . . . . . . . . . . . . . . . . . . . . . . . . .10
       7.1.  Normative References  . . . . . . . . . . . . . . . . . .10
       7.2.  Informative References  . . . . . . . . . . . . . . . . .11















Fang/Cai                   Expires November 10, 2009            [Page 2]


Internet-Draft       BGP-4 message transport over SCTP         May  2009


1.  Introduction

   This section explains the reasoning for using Stream Control
   Transmission Protocol(SCTP)transport Border Gateway Protocol 4(BGP-4)
   message.

1.1.  Motivation
   SCTP is a transport protocols defined in [RFC4960]. SCTP is designed
   to transport Public Switched Telephone Network (PSTN) signaling
   messages over IP networks, but is capable of broader applications.

   We have observed that many of the NGN protocols(Sigtran,SIP,H.248,..)
   designed to support transport of such signaling are also useful for
   the transport of BGP.

   BGP support for Four-octet AS Number Space [RFC4893], That means more
   and more Service Provider and Enterprise will get the AS Number, so
   it will becomes a large-scale network which will exchange a large
   amount of messages. As BGP-4 is transport independent, support SCTP
   is a relatively straightforward process, nearly identical to support
   for TCP.

1.2.  Potential Benefits
   Coene et. al.  present some of the key benefits of SCTP[1]. We
   summarize some of these benefits to enhance BGP-4 transportation.

1.2.1.  Fast Retransmission
   SCTP can quickly determine the loss of a packet, as a result of its
   usage of SACK and a mechanism which sends SACK messages faster than
   normal when losses are detected.

   When the Router working in HUB-SPKE environment(BGP Route-Reflector)
   if BGP-4 transport over TCP, the RR will receive a lot of TCP ACK,
   that may cause input-queue overflow. That may cause many TCP
   retransmission and Peering node lost, SCTP use SACK will be much
   better than TCP that may reduce the input-queue length.

   When message lost, SACK mechanism will detect it faster than TCP.

1.2.2.  SCTP Multi-Streaming
   SCTP supports the delivery of multiple independent user message
   streams within a single SCTP association.  This capability, when
   properly used, can alleviate the so-called head-of-line-blocking
   problem caused by the strict sequence delivery constraint imposed
   to the user data by TCP.

   This can be particularly useful for applications that need to
   exchange multiple, logically separate message streams between two
   endpoints.


Fang/Cai                   Expires November 10, 2009            [Page 3]


Internet-Draft       BGP-4 message transport over SCTP         May  2009



   MPLS VPN is widely used in future network , It will require BGP-4
   transport more and more routing informations, which means it will
   transport a large-number of messages. In BGP over TCP environment,
   Any peer failed to receive the message will cause TCP retransmit,
   that will cause Head of Line Blocking (HOL-Blocking). It will cause
   the Router can not send out message to other peering nodes. Multi-
   Streaming is a good mechanism to avoid such HOL-Blocking.


1.2.3 SCTP Multi-Homing
   SCTP provides transparent support for communications between two
   endpoints of which one or both is multi-homed.

   SCTP provides monitoring of the reachability of the addresses on the
   remote endpoint and in the case of failure can transparently failover
   from the primary address to an alternate address, without upper layer
   intervention.

   BGP-4 over TCP will use a loopback interface to avoid the link
   failure. but in some particular scenario, BGP-4 message still
   transport over broken link. Although BGP-4 can support Bidirectional
   Forwarding Detection [BFD], but still can not provide multi-link
   solution.

   If BGP-4 transport over SCTP , Routers can use Multi-homing to avoid
   single link failure.

1.3.  Key Terms
   Using SCTP transport BGP-4 message will offer the following services:

     --  SCTP Multihoming gives a better redundancy solutions.
     --  SCTP Multistreaming will avoid the HOL blocking.

   See the BGP-4 specification [RFC4271] and Multiprotocol Extensions
   for BGP-4 [RFC4760] for an introduction to the concepts these textual
   conventions cover.


2.  Using SCTP multistreming to avoid HOL blocking
   BGP-4 now can support 4-Bytes ASN, also MultiProtocol BGP[RFC4760]
   extends BGP to allow information for multiple NLRI families and sub-
   families to transported in BGP. Current implementation just transport
   all the Routes in a single BGP session.

   In fact,  one malformed messages may cause the session HOL-blocking ,
   and then terminate. Thus, it would be desirable to allow the session


Fang/Cai                   Expires November 10, 2009            [Page 4]


Internet-Draft       BGP-4 message transport over SCTP         May  2009


   related to that family to be terminated while leaving other AFI/SAFI
   unaffected. As BGP is commonly deployed, this is not possible.

   Multisession BGP[3] was try to transport the AFI/SAFI over multiple
   session, but this is not a efficiency way. If BGP-4 message transport
   over SCTP,  we can easily use SCTP-Multi-Streaming feature to avoid
   the HOL-Blocking.

   Multi-streaming is used in transport layer, that means on application
   layer, BGP-4 will only see one SCTP-association to the peer node, but
   actually the message transport is over many streaming tunnel.

   BGP-4 multi-streaming transport over SCTP as follows:

       _____________                                      _____________
      |    BGP-4    |                                    |    BGP-4    |
      | Application |                                    | Application |
      |-------------|                                    |-------------|
      |    SCTP     |<-------------Stream 1------------->|    SCTP     |
      |  Transport  |<-------------Stream 2------------->|  Transport  |
      |   Service   |               ....                 |   Service   |
      |             |<-------------Stream N------------->|             |
      |-------------|                                    |-------------|
      |             |One or more    ----      One or more|             |
      | IP Network  |IP address      \/        IP address| IP Network  |
      |   Service   |appearances     /\       appearances|   Service   |
      |_____________|               ----                 |_____________|

        SCTP Node A |<-------- Network transport ------->| SCTP Node B


2.1.  Classify Route information
   When BGP-4 support SCTP-multi-streaming, we need a way to distinguish
   the information/message to different streams. it can be classify by
   the following method:
      --  Classify by AFI/SAFI
      --  Classify by AS_PATH
      --  Classify by Route Distinguisher(RD)

        The following format MUST be used for the SCTP DATA chunk:

        0                   1                   2                   3
        0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |   Type = 0    | Reserved|U|B|E|    Length                     |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                              TSN                              |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+


Fang/Cai                   Expires November 10, 2009            [Page 5]


Internet-Draft       BGP-4 message transport over SCTP         May  2009



       |      Stream Identifier S      |   Stream Sequence Number n    |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                  Payload Protocol Identifier                  |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       \                                                               \
       /                 User Data (seq n of Stream S)                 /
       \                                                               \
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   In SCTP DATA chunk format , "Stream Identifier S" field is 16 bits
   unsigned integer, Thus it will support 65535 streams over a single
   SCTP Association. A hash algorithm is needed to classify the message
   as follows:

                                   _____________
                _____________     |             |---Stream 1---->
               | Hash based  |    |    SCTP     |---Stream 2---->
               | Classifier  |--->|  Transport  |    ....
               |_____________|    |   Service   |
                                  |_____________|---Stream N---->

   The receiver will simply ignore the stream id.

2.2.  Classification Analysis
2.2.1.  Classify by AFI/SAFI
   Classifier will use the AFI/SAFI as a Hash source data. But if one
   Router mark all AFI/SAFI with malformed community or other attribute,
   that will cause all the Streaming Queue blocked.

2.2.2.  Classify by AS_PATH
   Classifier will use the FIRST and LAST AS Number in AS_PATH Sequence
   as a Hash source data. this mechanism will avoid the HOL-blocking
   scenario describe in 2.2.1. but it may require 2-Level hash
   classifier as follows:

                                             _____________
      _____________      ______________     |             |-Stream 1--->
     | Src.AS Hash |    | Dest.AS Hash |    |    SCTP     |-Stream 2--->
     | Classifier  |--->| Classifier   |--->|  Transport  |    ....
     |_____________|    |______________|    |   Service   |
                                            |_____________|-Stream N--->


2.2.3.  Classify by Route Distinguisher(RD)
   Classifier will use the Route Distinguisher(RD) as a Hash source data.
   This mechanism can avoid large UPDATE message in some VPN. all the
   malformed messages from VPN will send in a single Streaming follows,
   that will not leaving other VPN unaffected.

Fang/Cai                   Expires November 10, 2009            [Page 6]


Internet-Draft       BGP-4 message transport over SCTP         May  2009



3.  Using SCTP multihoming for BGP connection

   There's an article[4] to support BGP multihoming via TCP. Multihoming
   is a desired feature to enhance BGP redundancy and Reliability. Using
   SCTP multihoming feature is much more reasonable than multihoming
   over TCP.

3.1.  BGP link via TCP limitation
   Using multihoming over TCP will has some limitations, In this
   scenario, We often use a loopback interface as update source to avoid
   single link failure.  But in eBGP  multihops scenario as shown below:

            _____________        _________        ______________
           |             |--a1--|IP Cloud1|--b1--|              |
           | RtrA(lo 0)  |      |_________|      | RtrB(lo 0)   |
           | eBGP to RtrB|       _________       | eBGP to RtrA |
           |_____________|--a2--|IP Cloud2|--b2--|______________|
                                |_________|

   RtrA use interface loopback 0 to establish a TCP sessions to RtrB's
   interface loopback 0 across a IP cloud, If link b1 down, RtrA will
   detect the link failure after the IP Cloud1 IGP convergence.
   If RtrA run BFD can detect the link failure faster , then RtrA will
   advertise peer RtrC lost. but RtrA still can use link a2 communicate
   with RtrB.

   This is caused by only one TCP sessions between two Routers.
   Neighbor recover-time is depends on IGP convergence speed. When the
   link recover, the neighbor will be established again. The update
   message will be transmitted to all networks again. Which will cause
   the route flapping and networks instability.

3.2.  Which link need BGP multihoming
   SCTP provides transparent support for communications between two
   endpoints of which one or both is multi-homed.

   iBGP link often has only 1 hop to the peering node, Thus will detect
   the link failure much faster. It will not require to establish
   multihoming, only use SCTP link via two Router's loopback interface
   is enough. But using SCTP transport is required to enhance the
   transport reliability, In iBGP to RR connections, SACK will increase
   the RR's performance. and multistreaming will avoid HOL-Blocking.

   eBGP link connect to another AS, Inter-AS is not very stable and
   also will congestion in some time period. Establish a backup link to
   the peering node is necessary.



Fang/Cai                   Expires November 10, 2009            [Page 7]


Internet-Draft       BGP-4 message transport over SCTP         May  2009



3.3.  Init multihomging link for BGP connection

   SCTP association need determine Primary Address , We can use link
   load, reliability, bandwidth as preference value , also we can use
   a pre-configured value as preference value.


   eBGP multihoming link betweeen 2 Routers shown as below:
     _________________________             _______________________
    |                         |           | AS Y                  |
    |     __________          |           |          __________   |
    |    |          |--ip.a1--+-----------+-ip.b1---|          |  |
    |    |   RtrA   |         |           |         |   RtrB   |  |
    |    |__________|--ip.a2--+-----------+-ip.b2---|__________|  |
    | AS X                    |           |                       |
    |_________________________|           |_______________________|


   SCTP multihoming can also init to different Router, but it will
   require RtrA config a route transmit packet to ip.b2 via link
   *ip.a2--ip.c2* as the follows:

     _________________________             _______________________
    |                         |           | AS Y                  |
    |     __________          |           |          __________   |
    |    |          |         |           |         |  RtrB    |  |
    |    |          |--ip.a1--+-----------+-ip.b1---|   (ip.b2)|  |
    |    |          |         |           |         |_____|____|  |
    |    |   RtrA   |         |           |               |       |
    |    |          |         |           |          _____|____   |
    |    |          |         |           |         |     |    |  |
    |    |          |--ip.a2--+-----------+-ip.c2---|   RtrC   |  |
    |    |__________|         |           |         |__________|  |
    | AS X                    |           |                       |
    |_________________________|           |_______________________|



3.4.  Link failure detection and switchover procedure
   SCTP provides monitoring of the reachability of the addresses on
   the remote endpoint and in the case of failure can transparently
   failover from the primary address to an alternate address, without
   upper layer intervention.

   But in BGP-4 Multihoming implementation, when primary link failed
   We MUST notify the RIB/FIB to forwarding other packets to the
   alternate link. A withdraw a message need to send out.


Fang/Cai                   Expires November 10, 2009            [Page 8]


Internet-Draft       BGP-4 message transport over SCTP         May  2009


4.  BGP-4 Stack modification to support SCTP
   BGP-4 transport over SCTP need to modify the BGP-4 Stack, the key
   terms as below:

     -- modify neighbor FSM to init the SCTP link and also gives a
        backward capability to fallback TCP connections.

     -- modify BGP Capability Advertisement to support SCTP
        Multistreaming transportations method.

     -- modify NOTIFICATION Subcodes to notify the neighbor that failed
        to init SCTP connections or Primary/Alternate link failure.

4.1.  Neighbor connection FSM modification
   There are 2 Status added by support BGP-4 over SCTP:

     o  CONNECT-SCTP
     o  CONNECT-TCP

   When BGP-4 process start, Neighbor status change from IDLE to
   CONNECT-SCTP. In this step, BGP speaker try to init SCTP connection
   to the peering node. Add SCTP-ConnectRetry Timer to monitor SCTP
   connections. If this timer expire, BGP will retry to init SCTP
   connecetions.

   If the SCTP-ConnectRetry Timer expire again, BGP-4 will fallback to
   init a TCP connection , and FSM change from CONNECT-SCTP to
   CONNECT-TCP. and a NOTIFICATION message will send out later to
   notice the remote peer that an error occur when init SCTP connection
   and fallback to TCP.

   If still timeout, neighbor status will change to ACTIVE status. Then
   BGP Speaker listen on the configured interface.

   If SCTP/TCP link successful established , OPEN message will send out
   and the neighbor status will change to OPENSENT.


4.2.  New BGP Capability Advertisement
   This specification defines SCTP transport capability:

      Capability code (1 octet): TBD (Wants to reserve 69)
      Capability length (1 octet): fixed 2bits
      Capability value (2 bits):
        0 -- Do not use Multistreaming
        1 -- Use MultiStreaming and classify by AFI/SAFI
        2 -- Use MultiStreaming and classify by AS_PATH
        3 -- Use MultiStreaming and classify by Route Distinguisher(RD)


Fang/Cai                   Expires November 10, 2009            [Page 9]


Internet-Draft       BGP-4 message transport over SCTP         May  2009


4.3.  New NOTIFICATION Subcodes
   This specification introduces three new subcodes:

     o  TBD -- Init SCTP association failed, fallback to TCP connection.
     o  TBD -- Primary SCTP link failure.
     o  TBD -- Alternate SCTP link failure.


5.  Security Considerations

   from RFC3257:

   "SCTP has been designed with the experiences made with TCP in mind.
   To make it hard for blind attackers (i.e., attackers that are not
   man-in-the-middle) to inject forged SCTP datagrams into existing
   associations, each side of an SCTP association uses a 32 bit value
   called "Verification Tag" to ensure that a datagram really belongs to
   the existing association.  So in addition to a combination of source
   and destination transport addresses that belong to an established
   association, a valid SCTP datagram must also have the correct tag to
   be accepted by the recipient.

   Unlike in TCP, usage of cookie in association establishment is made
   mandatory in SCTP.  For the server, a new association is fully
   established after three messages (containing INIT, INIT-ACK, COOKIE-
   ECHO chunks) have been exchanged.  The cookie is a variable length
   parameter that contains all relevant data to initialize the TCB on
   the server side, plus a HMAC used to secure it.  This HMAC (MD5 as
   per [RFC1321] or SHA-1 [SHA1]) is computed over the cookie and a
   secret, server-owned key."



6.  IANA Considerations

   This document defines a new BGP capability - BGP transport over SCTP
   Capability.  The Capability Code for BGP transport over SCTP
   Capability is TBD(Wants to reserve 69). currently used capability-
   codes as below:

      http://www.iana.org/assignments/capability-codes/


7.  References

7.1.  Normative References

   [1]   Coene, L., "Stream Control Transmission Protocol Applicability
         Statement", RFC 3257, April 2002.

Fang/Cai                   Expires November 10, 2009           [Page 10]


Internet-Draft       BGP-4 message transport over SCTP         May  2009



   [2]   M. Tim Jones , Better networking with SCTP: the Stream Control
         Transmission Protocol combines advantages from both TCP and UDP

   [3]   John, G Scudder., Chandra, Appanna. "Multisession BGP"
         draft-ietf-idr-bgp-multisession-03.txt, January 2007

   [4]    Philip, S. and Gaurab, U, "BGP Multihoming and Internet
          Exchange Points", SANOG 7. http://www.sanog.org/resources
          /sanog7/pfs-bgp-multihoming.pdf

   [RFC2119]   Bradner, S., "Key words for use in RFCs to Indicate
               Requirement Levels", BCP 14, RFC 2119, March 1997.

   [RFC4271]   Rekhter, Y., Li, T., and S. Hares, "A Border Gateway
               Protocol 4 (BGP-4)", RFC 4271, January 2006.

   [RFC4760]   Bates, T., Chandra, R., Katz, D., and Y. Rekhter,
               "Multiprotocol Extensions for BGP-4", RFC 4760,
               January 2007.

7.2.  Informative References

   [BFD]       Katz, D. and D. Ward, "Bidirectional Forwarding
               Detection", Work in Progress.


Authors' Addresses

   Kevin Fang
   Cisco Systems, Inc.
   Edge Routing Business Unit

   EMail: zhiyfang&cisco.com


   Feng Cai
   Cisco Systems, Inc.
   Edge Routing Business Unit

   EMail: fecai&cisco.com




Fang/Cai                   Expires November 10, 2009           [Page 11]