[Search] [txt|pdfized|bibtex] [Tracker] [Email] [Diff1] [Diff2] [Nits]
Versions: 00 01                                                         
Internet Draft                                              August 2006


   Network Working Group                                   Manav Bhatia
   Internet Draft                                   Lucent Technologies
                                                        Joel M. Halpern
                                                             Paul Jakma
   Expires: January 2007                               Sun Microsystems

                Advertising Multiple NextHop Routes in BGP

                draft-bhatia-bgp-multiple-next-hops-01.txt

Status of this Memo

   By submitting this Internet-Draft, each author represents that any
   applicable patent or other IPR claims of which he or she is aware
   have been or will be disclosed, and any of which he or she becomes
   aware will be disclosed, in accordance with Section 6 of BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet draft will expire on August 2006

Copyright Notice

   Copyright (C) The Internet Society (2006).

Abstract

   This document describes an extensible mechanism that allows a BGP
   speaker to advertise multiple BGP paths for a destination to its
   peers, by describing a new BGP capability, termed "Multiple-Hop
   Capability".

   The mechanisms described in this document are applicable to all
   routers, both those with the ability to inject multiple routing
   entries in their forwarding table and those without.


Bhatia, Halpern and Jakma                                      [Page 1]


Internet Draft                                              August 2006



Conventions used in this document

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED","MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [KEYWORDS]

Table of Contents

   1. Introduction...................................................2
   2. Multiple-Hop Capability........................................3
      2.1 Multiple-Hop attribute - MULTIPLE_HOP......................5
   3. Operation when both peers are Multiple-Hop capable.............6
      3.1 Advertisement of Multiple-Hop BGP routes...................7
      3.2 Withdrawal Procedures......................................7
      3.3 Procedures for the Receiving Speaker.......................8
      3.4 Working with Multiple-Hop capable IBGP peers...............8
      3.5 Implicit Withdrawal for one of the Next-Hops...............9
   4. Multiprotocol Extensions to BGP................................9
   5. Security Considerations.......................................10
   6. Acknowledgements..............................................10
   7. IANA Considerations...........................................10
   8. References....................................................10
      8.1 Normative References......................................10
      8.2 Informative References....................................11
   9. Appendix A....................................................11
      9.1 Suboptimal Routing in Route Reflector clients.............11
      9.2 Avoiding Persistent Route Oscillations....................12
      9.3 eBGP mesh scaling at IXes via Route Servers...............15
      9.4 Advertising a subset of routes in BGP.....................15
      9.5 Equal Cost Multiple Path BGP..............................16
   10. Author’s Address.............................................16
   11. Intellectual Property Statement..............................17

1. Introduction

   Currently BGP [BGP4] speakers cannot announce multiple paths, even if
   it is desirable in certain scenarios.  This is because the BGP
   specification allows only one "best" route to be inserted into the
   Loc-RIB, and to be announced to other BGP speakers.  If another route
   for a destination that has previously been announced to a BGP peer,
   is sent later, then the receiver “implicitly withdraws” the former
   route and replaces it with the new one.

   Because of this behavior, BGP speakers are never able to advertise
   multiple paths for the same destination to their peers.

   Lifting this restriction would have benefit for at least the
   following scenarios in BGP:


Bhatia, Halpern and Jakma                                      [Page 2]


Internet Draft                                              August 2006



   o Persistent route-oscillation conditions in BGP [MED]

   o eBGP mesh scaling at Internet Exchanges

   o Interaction between ECMP capable BGP speakers

   The first concerns route-reflectors [RR], where in certain
   topologies, persistent route-oscillation conditions can arise due to
   the clients of route-reflectors being never fully informed of each
   others best paths, particularly where MED/Router ID values are
   considered as part of the best-path selection.  If BGP were to
   provide a means to allow route-reflectors to share all the collective
   best-paths with its clients, then these conditions could be
   alleviated, as has been shown in the Appendix.

   The second concerns scaling of eBGP meshes at Internet Exchanges
   (referred to as an IX from now on, or IXes in the plural).  IX
   operators have deployed eBGP route-servers, in a variety of guises,
   in order to reduce the need for customers to establish direct
   sessions with other customers.  These route-servers however have
   severe limitations because of the single-path restriction in BGP.
   Removing this limitation would allow for efficient deployment of IX
   route-servers.

   The third concerns BGP implementations which are capable of
   considering multiple routes for inclusion into their RIB, and hence
   likely their FIB, but do not have a way to relay the full resulting
   state of their BGP RIB to their peers.

   This document specifies the mechanism by which Multiple-Hop operates;
   however it will not attempt to fully describe the usages.  In
   particular this document anticipates that the ECMP scenario will be
   described fully in another document, as it would have to be even if
   documented without consideration of the Multiple-Hop capability.

   It is anticipated however that any speaker implementing the
   functionality described in this document would be able to
   interoperate with Multiple-Hop capable route-servers and route-
   reflectors, just as BGP speakers interoperate with Route-Reflectors
   in the absence of the Multiple-Hop capability.

2. Multiple-Hop Capability

   Multiple Hop capability is a new capability that can be used by a BGP
   speaker to indicate its ability to understand Multiple-Hop Updates
   from a remote peer.

   This capability is defined as follows:


Bhatia, Halpern and Jakma                                      [Page 3]


Internet Draft                                              August 2006



       Capability Code: TBD

       Capability Length: Variable

       Capability Values: Consists of one or more of the tuples <AFI,
       SAFI, Flags for the address family> as follows:

             +--------------------------------------------------+
             |  Address Family Identifier (16 bits)             |
             +--------------------------------------------------+
             |  Subsequent Address Family Identifier (8 bits)   |
             +--------------------------------------------------+
             |  Flags for the Address Family (8 bits)           |
             +--------------------------------------------------+

                                 Figure 1

   The use and meaning of the fields are as follows:

   Address Family Identifier

       This field carries the identity of the Network Layer protocol
       for which the Multiple Hop support is advertised. Presently
       defined values for this field are specified in [IANA-AFI].

   Subsequent Address Family Identifier (SAFI):

       This field provides additional information about the type of
       the Network Layer Reachability Information carried in the
       attribute. Presently defined values for this field are specified
       in [IANA-SAFI].

   Flags for Address Family:

       This field contains bit flags for the <AFI, SAFI>.

                0 1 2 3 4 5 6 7
               +-+-+-+-+-+-+-+--+
               |R|R|R|R|R|R|R|RM|
               +-+-+-+-+-+-+-+--+

       R  Reserved:

       MUST be set to zero by the sender and ignored by the receiver.

       RM Receive Multiple

       Indicates that the speaker is interested in receiving additional


Bhatia, Halpern and Jakma                                      [Page 4]


Internet Draft                                              August 2006


       BGP paths, other than just the best path from the receiver.

       A speaker sets this bit in its MULTIPLE_NEXT_HOP capability to
       indicate that it is prepared to receive additional path
       advertisements, beyond just the best path, by way of the
       MULTIPLE_NEXT_HOP capability.

       As such, speakers implementing the MULTIPLE_NEXT_HOP capability
       MUST not send additional paths, beyond the single best path
       allowed by BGP-4 [BGP4], unless the remote speaker has
       indicated its preparedness with the RM bit.

2.1 Multiple-Hop attribute - MULTIPLE_HOP

   This attribute is an optional, non-transitive attribute that can be
   used for advertising multiple next-hops associated with a NLRI.

   The attribute data contains one or more tuples of (AFI,SAFI, List
   of Next Hop Information), where each tuple is encoded as shown
   below:


               +------------------------------------------------+
               |      Address Family Identifier (2 octets)      |
               +------------------------------------------------+
               | Subsequent Address Family Identifier (1 octet) |
               +------------------------------------------------+
               |          Number of Next Hops (1 octet)         |
               +------------------------------------------------+
               |     Length of the First Next Hop (1 octet)     |
               +------------------------------------------------+
               |  Network Address of First Next Hop (variable)  |
               +------------------------------------------------+
               |     Length of the Second Next Hop (1 octet)    |
               +------------------------------------------------+
               |  Network Address of Second Next Hop (variable) |
               +------------------------------------------------+
               |                      . . .                     |
               |                      . . .                     |
               +------------------------------------------------+
               |      Length of the Nth Next Hop (1 octet)      |
               +------------------------------------------------+
               |   Network Address of Nth Next Hop (variable)   |
               +------------------------------------------------+

                                     Figure 2

   The various fields are defined as follows:



Bhatia, Halpern and Jakma                                      [Page 5]


Internet Draft                                              August 2006


   Address Family Identifier: The AFI field carries the identity of
   the Network Layer protocol associated with the Network Address
   that follows.

   Subsequent Address Family Identifier: The SAFI field in
   combination with the Address Family Identifier field identifies
   the Network Layer context associated with the Network Address of
   the Next Hop(s).

   Number of Next-Hops: This field carries the total number of Multiple-
   Hop BGP routes for the given NLRI.

   Length of Nth Next Hop Network Address: A 1 octet field whose value
   expresses the length of the "Network Address of Next Hop" field as
   measured in octets.  For IPv6 routes the value shall be set to 16,
   when only a global address is present, or 32 if a link-local
   address is also included in the Next Hop field [BGP-IPv6].

   Network Address of Nth Next Hop: This is a variable length field that
   contains the Network Address of the next router on the path to the
   destination.

   The N next-hops listed in the MULTIPLE_HOP path attribute define the
   Network Layer address of the routers that should be used as next-hops
   to the destinations listed in the UPDATE message.

3. Operation when both peers are Multiple-Hop capable

   In the following sections, "Local speaker" refers to a router which
   is advertising the BGP Multiple-Hop routes, and the "Receiving
   Speaker" refers to a router that peers with the former to accept
   multiple BGP routes for a destination.

   Consider that the Multiple-Hop Capability has been exchanged between
   the Local speaker and the Receiving speaker, and a BGP session
   between them is established.  The following sections detail the
   procedures that shall be followed by the Local speaker as well as the
   Receiving speaker once the Multiple-Hop capability has been
   exchanged, and the local speaker wants to advertise some BGP
   Multiple-Hop routes.

   Note that for operation within the confines of this document and BGP,
   the local speaker almost certainly will be acting as an eBGP route-
   server or iBGP route-reflector, with the receiver asserting the RM
   bit in the Multiple-Hop capability, and therefore acting as a client
   of that speaker.

   Other uses, such as ECMP speakers exchanging Multiple-Hop routes will
   require further consideration, not addressed in this document as


Bhatia, Halpern and Jakma                                      [Page 6]


Internet Draft                                              August 2006


   stated previously, considerations not per se related to the Multiple-
   Hop capability itself.

3.1 Advertisement of Multiple-Hop BGP routes

   The extensions proposed in this draft allow BGP paths to be
   identified by their NLRI and next-hop address, rather than just by
   their NLRI.  This extended identification is indicated by the
   presence of the MULTIPLE_HOP attribute. Given that this is used when
   there are multiple paths sharing NLRI, this attribute allows for the
   representation of multiple such paths in a single advertisement.

   Thus between Multiple-Hop capable speakers, the MULTIPLE_HOP
   attribute MUST be used in addition to the existing NEXT_HOP in order
   to announce multiple next-hops for the destinations listed in the
   NLRI field of the UPDATE message.

   All prefixes announced using this attribute MUST NOT replace the
   previous advertisements and thus, multiple BGP paths for a prefix can
   be advertised by the Local Speaker. If the same prefix is later
   announced with ONLY the NEXT_HOP attribute then it MUST be taken as
   an implicit withdraw for all the previous paths advertised by that
   peer for that destination.

   It should be noted that transmission of multiple paths is only valid
   for the same NLRI that differ on the next-hop.

   An UPDATE message which contains feasible routes and carries
   MULTIPLE_HOP and no NEXT_HOP attribute MUST NOT be considered as an
   implicit withdrawal.  The Receiving Speaker MUST append these
   routes in its Adj-RIBs-In [BGP4], as additional paths to that
   destination.

   When advertising multiple paths which do not have identical path
   attributes, separate BGP UPDATE messages MUST be sent, each with a
   MULTIPLE_HOP attribute even if there is only one next-hop in each
   MULTIPLE_HOP attribute. Presence of MULTIPLE_HOP suppresses route
   replacement at the receiving end.

3.2 Withdrawal Procedures

   An UPDATE message which contains an IP address prefix in the
   WITHDRAWN ROUTES marks all the associated routes as being no longer
   available for use.

   An UPDATE message consisting of an IP address prefix in the NLRI
   field and only the NEXT_HOP attribute implicitly withdraws all the
   routes to that address prefix and replaces it with the one advertised
   by the NEXT_HOP.


Bhatia, Halpern and Jakma                                      [Page 7]


Internet Draft                                              August 2006



   An UPDATE message which contains an IP address prefix in the
   WITHDRAWN ROUTES and the MULTIPLE_HOP attribute only removes the path
   associated with that next-hop.

   An UPDATE message announced with a MULTIPLE_HOP attribute for a given
   IP address prefix implicitly withdraws any previous route announced
   with the same next-hop.

3.3 Procedures for the Receiving Speaker

   The Receiving Speaker upon receiving the MULTIPLE_HOP attribute will
   understand that the Local Speaker has advertised Multiple-Hop BGP
   routes.  Within a single UPDATE message all the prefixes will have
   identical attributes, except for the next-hops, which will be carried
   in the MULTIPLE_HOP attribute.

   A series of further UPDATE messages for the same NLRI, with or
   without the same set of attributes and containing the MULTIPLE_HOP
   attribute will be understood to be additive. Each UPDATE message
   would append these additional feasible routes, to the appropriate
   Adj-RIBs-In, where after the receiving speaker may run its normal
   decision process to select the best path to install in its Local-RIB.

   Upon receiving an UPDATE message for the same NLRI, without the
   MULTIPLE_HOP attribute, the receiver will consider this as a
   replacement route for all the previously announced routes to that
   destination.

   If the BGP Speaker wants to withdraw all the BGP routes for a
   particular address prefix then it can send a normal BGP UPDATE
   message listing the IP address prefix in the WITHDRAWN ROUTES field.
   The Receiving Speaker upon receiving this message MUST remove all the
   routes associated with that destination.

   If the Receiving Speaker receives an UPDATE message with the
   MULTIPLE_HOP attribute listing both, the feasible and the
   unfeasible routes, then it MUST consider the path attributes for the
   feasible routes.  All the destinations listed in the WITHDRAWN ROUTES
   MUST be removed as per [BGP4].

3.4 Working with Multiple-Hop capable IBGP peers

   This section explains how multiple-hop feature will work in the
   normal scenarios.

   Assume that the two IBGP speakers A and B exchange this capability.
   Consider a case where A receives multiple UPDATE messages for NLRI X
   with next-hops Nj, Nk and Nm. Assume that all these routes are valid


Bhatia, Halpern and Jakma                                      [Page 8]


Internet Draft                                              August 2006


   and A wants to pass on this set to B. Also assume that Nj and Nk
   share the same path attributes (Origin, AS Path, Local Pref, etc) and
   can be thus advertised in a single UPDATE message.

   A makes an UPDATE message and uses the MULTIPLE_HOP path attribute.
   It puts the AFI, SAFI, number of next-hops as 2, length of the first
   next-hop Nj, network address of Nj, length of Nk and the network
   address of Nk.

   When this UPDATE message reaches B, it looks at the MULTIPLE_HOP
   attribute and understands that there are multiple routes to reach X.
   It inserts the two routes for X with the next-hops Nj and Nk in its
   Adj-RIBs-In.

   A also needs to announce the remaining route to X with next-hop Nl.
   It makes an UPDATE message, fills the path attributes, and uses the
   MULTIPLE_HOP attribute to encode next-hop information about Nl. This
   UPDATE message is sent to B.

   When B receives this UPDATE message it knows that this is not a
   replacement route for X as it comes with the MULTIPLE_HOP
   attribute. It simply appends this new route in its adj-RIBs-In,
   runs the decision process, and proceeds as normal.

   Assume that at some point later, A needs to withdraw the route
   associated with the tuple [X, nexthop Nk]. It makes an UPDATE
   message, puts X in the WITHDRAWN ROUTES and inserts the MULTIPLE_HOP
   attribute, encoding the next-hop Nk inside.

   When B receives this UPDATE message it understands that A wants to
   remove one (or more) of the routes associated with X. To determine
   which exact route(s) needs to be removed, it looks at the
   MULTIPLE_HOP attribute and goes about removing all the routes
   associated with the next-hops listed therein.

3.5 Implicit Withdrawal for one of the Next-Hops

   In the same scenario to replace a route associated with the tuple [X,
   next-hop Nk], A can advertise a fresh route with a new set of path
   attributes. B would consider the new advertisement as an implicit
   withdrawal for the previously announced route for the tuple [X, next-
   hop Nk].

4. Multiprotocol Extensions to BGP

   Since the MULTIPLE_HOP includes both the AFI and SAFI, it is possible
   to advertise multiple MPBGP routes.  In this case, MP_REACH_NLRI
   [MBGP] attribute shall carry the NLRI information and MULTIPLE_HOP
   the information about the additional next-hops.


Bhatia, Halpern and Jakma                                      [Page 9]


Internet Draft                                              August 2006


   To suppress route replacement the additional routes must be
   advertised by keeping the length of the next-hop as 0 in the
   MP_REACH_NLRI attribute. The same should be encoded in the
   MULTIPLE_HOP attribute.

5. Security Considerations

   This extension to BGP does not change the underlying security issues
   inherent in the existing BGP.

6. Acknowledgements

   The authors would like to thank Tony Li, Arnold Nipper and Curtis
   Villamizar for their valuable comments and suggestions on the earlier
   versions of this draft from which the current work has been derived.

7. IANA Considerations

   IANA needs to assign a capability code to the Multiple Hop capability

8. References

8.1 Normative References

   [BGP-CAP]  Chandra, R. and J. Scudder, "Capabilities Advertisement
              with BGP-4", RFC 3392, November 2002

   [BGP4]     Rekhter, Y., Li, T. and Hares, S., "A Border Gateway
              Protocol 4 (BGP-4)", RFC 4271, March 1995

   [RR]       Chandra, R., Bates, T., and E. Chen, "BGP Route Reflection
              - An Alternative to Full Mesh Internal BGP (IBGP)", RFC
              4456, April 2006

   [BGP-IPv6] Marques, P. and F. Dupont, "Use of BGP-4 Multiprotocol
              Extensions for IPv6 Inter-Domain Routing", RFC 2545,
              March 1999.

   [MBGP]     Chandra, R., Rekhter, Y., Bates, T., and D. Katz,
              "Multiprotocol Extension for BGP-4",
              draft-ietf-idr-rfc2858bis-10.txt (work in progress)

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", RFC 2119, BCP 14, February 2001.

   [IANA_AFI] http://www.iana.org/assignments/address-family-numbers

   [IANA-SAFI]http://www.iana.org/assignments/safi-namespace



Bhatia, Halpern and Jakma                                      [Page 10]


Internet Draft                                              August 2006


8.2 Informative References

   [MED]      Retana, A., Walton, D., McPherson, D., and V. Gill,
              "Border Gateway Protocol (BGP) Persistent Route
              Oscillation Condition", RFC 3345, August 2002.

   [COMM]     Chandra, R., Trania, P. and Li, T.,”BGP Communities
              Attribute”, RFC 1997, August 1996

9. Appendix A

   This section explains some scenarios where advertising multiple BGP
   paths may prove to be useful.

9.1 Suboptimal Routing in Route Reflector clients

   Route Reflection can result in suboptimal routing due to the client
   not having full visibility to all the BGP paths in the AS.  This is
   because the RR selects the best path and reflects only that best path
   to its clients.  In case the RR has equal cost BGP routes, then it
   shall select the one based on the lower Router ID.  As a result, the
   clients do not receive the full view of the available paths, or at
   least the paths that are equidistant from the RR.  This can result in
   suboptimal routing from the client's perspective.  A client may have
   selected a different best path if more paths had been made visible to
   it.  With Multiple-hop BGP, the RR can advertise all the equal cost
   BGP routes that it has to its client, giving the client more options
   to choose from.

   The extensions proposed in this draft provide provision for the RR to
   reflect all the routes to its clients.




















Bhatia, Halpern and Jakma                                      [Page 11]


Internet Draft                                              August 2006


9.2 Avoiding Persistent Route Oscillations


              ----------------------------------
           /                            AS X   \
          |              -----                  |
          |            /       \                |
          |           |         |               |
          |           |   RR    |               |
          |            \       /                |
          |              -/+\-                  |
          |           c1 /   \ c2               |
          |     ----    /     \    ----         |
          |   /      \ /       \ /      \       |
          |  (  Ra    )         (   Rb   )      |
          |   \      /           \      /       |
          |     -/\--             ------        |
          |     /  \                   \        |
          |    /    \                   \       |
          \   /     \                    \      /
            --/------\--------------------\----
             /        \                    \
            /          ---------------------------
            /        /  \                 --\--    \
         --/-       |   \               /       \  |
       //    \\     |    \             |         | |
      |   R2   |    |    \             |   R3    | |
      |        |    |    -\--           \       /  |
       \\    //     |  /      \           -----    |
         ----       | |        |                   |
         AS Y       | |   R1   |                   |
                    |  \      /                    |
                    |    ----                      |
                    \                    AS Z      /
                     -----------------------------

                          Figure 3

   Consider the topology as shown in Figure 1.  Say, AS X consists of
   Route Reflector (RR) and two clients Ra and Rb.  Ra is connected to
   R2 in AS Y and R1 in AS Z. Rb is connected to R3 in AS Z. Assume that
   the Router ID of R1 < R2 and IGP cost c1 < c2.  The dashed lines
   between the routers shows BGP peering.  Assume that the BGP speakers
   in AS Y and AS Z receive a BGP UPDATE for 10.0.0.0/8 from AS W.
   Assume that they advertise the following path attributes to BGP
   speakers in AS X:

   R2: NLRI 10.0.0.0/8, AS_PATH Y W, MED 100, NEXT_HOP R2



Bhatia, Halpern and Jakma                                      [Page 12]


Internet Draft                                              August 2006


   R1: NLRI 10.0.0.0/8, AS_PATH Z W, MED 300, NEXT_HOP R1

   R3: NLRI 10.0.0.0/8, AS_PATH Z W, MED 200, NEXT_HOP R3

   Scenario 1: Traditional BGP in AS X

   The following events happen:

   1. Ra receives UPDATE messages from R2 and R1.  Since they are from
      different ASes, MEDs are not compared and the tie breaks on the
      lower Router ID.  Since R1 < R2, route from R1 is selected and
      advertised to the RR.  Ra thus has the following path as the
      best one for 10.0.0.0/8:

      AS_PATH Z W, MED 300, NEXT_HOP R1

   2. Rb receives the UPDATE from R3, installs this and advertises the
      same to the RR.  Rb thus has the following path for 10.0.0.0/8:

      AS_PATH Z W, MED 200, NEXT_HOP R3

   3. RR receives two UPDATE messages from its clients. Since the
      neighboring AS is the same in both of them, the tie breaks on the
      route having the lower value of MED.  It thus selects the route it
      learns from Rb as the best one and advertises this to Ra.

   4. Ra now has all the three paths.  Route learnt from Rb wins over
      the route learnt from R1 (lower MED) and the route learnt from
      R2 wins over the route learnt from Rb (EBGP > IBGP).

   5. Ra thus sends an implicit WITHDRAW to the RR, replacing the
      earlier announcement with the route learnt from R2.

   6. RR thus has the following paths for 10.0.0.0/8:

      AS_PATH Y W, MED 100, NEXT_HOP R2
      AS_PATH Z W, MED 200, NEXT_HOP R3


      It selects the first path because the IGP cost to reach the
      NEXT_HOP (R2) is lesser for the first one.  It thus, advertises
      this path to Rb and sends a WITHDRAW message to Ra, removing the
      path it had initially announced (one learnt from Rb)

   7. Ra receives the WITHDRAW message from the RR and removes the path.
      Nothing is done as it is currently not the best path.

   8. Rb receives the advertisement from RR, but doesn't do anything, as
      the path learnt from R3 is better (EBGP > IBGP).


Bhatia, Halpern and Jakma                                      [Page 13]


Internet Draft                                              August 2006



   9. Ra at this time has only two routes.  One, learnt from R1 and the
      other learnt from R2:

      AS_PATH Z W, MED 300, NEXT_HOP R1

      AS_PATH Y W, MED 100, NEXT_HOP R2

      It has selected the route learnt from R2.  After some time, this
      router runs its scanner process for validating the NEXT_HOPs.
      There it runs the best path algorithm and finds that the route
      learnt from R1 is better than the route learnt from R2, because
      of the lower Router ID.

   10.Ra sends an implicit WITHDRAW to RR, replacing the earlier
      announcement with the route learnt from R2.

   11...

   The loop follows and it cycles again and again.

   Scenario 2: Multiple-Hop BGP is implemented in AS X

   1. If everything happens the same as in the preceding example then
      Ra will have two paths to reach 10.0.0.0/8.  Since everything
      else is the same, it will advertise both these routes to the RR.
      Note that Ra will not look at the Router ID, etc. for tie
      breaking if Multiple-Hop capabilities are implemented.

   2. RR will now have three paths for 10.0.0.0/8.  Path 3, from Rb and
      Paths 1 and 2 from Ra.

      Path 1: AS_PATH Y W, MED 100, NEXT_HOP R2

      Path 2: AS_PATH Z W, MED 300, NEXT_HOP R1

      Path 3: AS_PATH Z W, MED 200, NEXT_HOP R3

      Out of Path 2 and Path 3, it will select Path 3 (lower MED).From
      Path 1 and Path 3, it will select Path 1, based on the lower
      IGP cost. RR thus selects the Path 1 as the best route.

   3. RR will advertise the new path to Rb. Rb will thus have the
      following two paths:

      Path 1: AS_PATH Y W, MED 100, NEXT_HOP R2

      Path 2: AS_PATH Z W, MED 200, NEXT_HOP R3



Bhatia, Halpern and Jakma                                      [Page 14]


Internet Draft                                              August 2006


      Path 2 will win because of the EBGP > IBGP rule, and it will
      continue using R3. There is thus, no change on Rb and it
      continues using the same path as before.

   4. The network is stable and there are no route oscillations.

9.3 eBGP mesh scaling at IXes via Route Servers

   IXes today sometimes offer their customers the facility to peer with
   a neutral IX route-server as a means to reduce the direct peering
   requirements for their customers.  The peering overhead may be
   considerable given the many hundreds of ASes which may be present at
   some of the larger IXes today, and it is quite plausible that IXes
   will continue to grow in terms of attached customers and ASes.

   However, the single-path limitation of BGP imposes great operational
   difficulty in allowing such a route-server to be effective.

   There are typically two kinds of route-server, one which is a normal
   BGP speaker and simply provides a single-best-path-for-all service,
   and the type which are configured with each customer’s policies and
   calculate the best-path separately for each.  Both approaches have
   their limitations:

   o  Route-servers which simply advertise the current best known IX
      path according to normal BGP procedures, without applying any
      customer-specific policy, require the customers to often still
      establish direct sessions with each other for cases where they
      wish to apply policy.  Much of the scaling benefits are never
      realised.

   o  Route-servers which apply policy on their customers behalf,
      selecting the best-path on a per-customer basis and then
      advertising each customer a tailor-made best-path, require
      extensive co-ordination of policy between the IX operators and
      each of their customers.  Further, it may be difficult for
      customers to keep their policies private due the operational
      requirements of policy co-ordination between IX and customer.

   If there were a mechanism in BGP to allow an IX route-server to pass
   all other advertisements to a customer peer, without performing any
   path selection or applying any policy, then this would remove the
   need for policy co-ordination between each customer and the IX, and
   address the other shortcomings listed above.  Such a mechanism would
   be easy for both the IX operator and each customer to deploy and
   maintain.

9.4 Advertising a subset of routes in BGP



Bhatia, Halpern and Jakma                                      [Page 15]


Internet Draft                                              August 2006


   Providers can tag some selected routes with certain communities
   [COMM]. An administrator could write a policy that would advertise
   all the paths carrying a known community within that AS to another
   router capable of understanding the Multiple-Hop extensions.  This is
   a form of policy implementation and a detailed study of what could be
   achieved using such techniques is beyond the scope of this draft.

9.5 Equal Cost Multiple Path BGP

   Currently some implementations, when they receive multiple equal cost
   BGP routes from different peers, are able to insert all of them (or a
   subset of those, based on their local policies) in their forwarding
   table to locally split the load for the destination, while announcing
   only one "best" BGP path to its other peers.  This however has
   implications for those other peers which receive such an announcement
   from this ECMP capable BGP speaker.  The implication, as per route
   aggregation, is these other peers potentially will not posses the
   full path information, which can lead to loops.  Hence, such an ECMP
   capable BGP speaker can only enable this feature if great care is
   taken, if at all, or must act as if it had aggregated the set of
   routes concerned.

   While this document does not directly address the question of ECMP,
   the mechanism introduced can be built upon in order to do so.  It
   would be feasible to introduce additional semantics on top of the
   Multiple-Nexthop Capability so as to allow the ECMP BGP speaker to
   fully communicate the details of all the paths it is forwarding on,
   and hence allow those other peers to have full visibility of path
   information and be able to avoid selecting paths which would
   otherwise loop, while still maintaining compatibility with speakers
   not implementing ECMP and Multiple-Hop.

10. Author’s Address

   Manav Bhatia
   Lucent Technologies

   Email: manav@lucent.com

   Joel M. Halpern

   Email: joel@stevecrocker.com

   Paul Jakma
   Sun Microsystems

   Email: paul.jakma@sun.com




Bhatia, Halpern and Jakma                                      [Page 16]


Internet Draft                                              August 2006


11. Intellectual Property Statement

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at
   http://www.ietf.org/ipr.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at   ietf-
   ipr@ietf.org.


   Disclaimer of Validity

   This document and the information contained herein are provided on an
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

   Copyright Statement

   Copyright (C) The Internet Society (2006).  This document is subject
   to the rights, licenses and restrictions contained in BCP 78, and
   except as set forth therein, the authors retain all their rights.

   Acknowledgment

   Funding for the RFC Editor function is currently provided by the
   Internet Society.






Bhatia, Halpern and Jakma                                      [Page 17]