Network Working Group                                         Alex Zinin
Internet Draft                                                Mike Shand
Expiration Date: April 2001                                Cisco Systems
File name: draft-zinin-flood-opt-01.txt                October 2000



                         Flooding optimizations
                    in link-state routing protocols

                   draft-ietf-zinin-flood-opt-01.txt


Status of this Memo

   This document is an Internet-Draft and is in full conformance with
   all provisions of Section 10 of RFC2026.

   Internet Drafts are working documents of the Internet Engineering
   Task Force (IETF), its Areas, and its Working Groups. Note that other
   groups may also distribute working documents as Internet Drafts.

   Internet Drafts are draft documents valid for a maximum of six
   months. Internet Drafts may be updated, replaced, or obsoleted by
   other documents at any time. It is not appropriate to use Internet
   Drafts as reference material or to cite them other than as a "working
   draft" or "work in progress".

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.


Abstract

   The flooding algorithm is one of the most important parts of any link
   state routing protocol. It ensures that all routers within a link
   state domain converge on the same topological information within a
   finite period of time. To ensure reliability, typical implementations
   of the flooding algorithm send new information via all interfaces
   other than the one the new piece of information was received on. This
   redundancy is necessary to guarantee that flooding is performed
   reliably, but implies considerable overhead of utilized bandwidth and
   CPU time if neighboring routers are connected with more than one
   link.  This document describes a method that reduces this overhead.




Zinin, Shand                                                    [Page 1]


INTERNET DRAFT           Flooding optimizations             October 2000


1 Introduction

   In order to guarantee convergence of a link state routing protocol,
   it is vital to ensure that link state PDUs (LSAs in the case of OSPF
   [Ref1] or LSPs in the case of ISIS [Ref2]) that are originated after
   the initial LSDB synchronization between neighbors is completed are
   delivered to all routers within the flooding scope limits (an area or
   the whole AS depending on the protocol and the type of the link state
   PDU).

   The method used by link state protocols to achieve this implies that
   a) PDUs are transmitted reliably between any pair of routers, and b)
   whenever a new PDU is received, it is sent across all interfaces
   other than the one it was received on (except for the case when the
   router is the DR in OSPF, where the LSA is sent back over the same
   interfaces, see [Ref1]).

   To fulfil the first requirement, link state routing protocols keep
   retransmitting new PDUs to the neighbors that have not acknowledged
   reception (the only exception is flooding performed on broadcast
   links in ISIS, see [Ref2]). As an example, in OSPF, a link state
   retransmission list is maintained for every neighbor data structure
   on every interface. When an LSA is sent through an interface, it is
   put on the retransmission list of every neighbor associated with this
   interface and is removed from it only after the neighbor has
   acknowledged reception of the LSA. Similarly, ISIS implementations
   use SRMflag and SSNflag that are interface-specific, as well as
   periodical CSNP announcements on broadcast links to ensure
   reliability of flooding.

2 Motivation

   If multiple links connect two routers, flooding of new information
   will cause considerable overhead of link bandwidth and CPU time spent
   by the protocol. Let's consider an example shown in Figure 1.


                            |                |
                            |   _____ 1 ___  |
                      -|   +--+/  .      . \+--+  |-
                       |---|R1|   :      :  |R2|--|
                      -|   +--+\_____ N ___/+--+  |-
                            |                |
                            |                |
                         Figure 1. Sample topology


   When R1 receives a new PDU from its LAN segment, it installs it in



Zinin, Shand                                                    [Page 2]


INTERNET DRAFT           Flooding optimizations             October 2000


   its LSDB and submits for flooding through all of its interfaces.
   Since flooding presumes sending the new PDU over all interfaces
   except for the one it was received on routers end up doing the
   following.

      1) R1 sends not one, but N copies of the new PDU to R2.

      2) Only the first copy of the PDU is actually installed
         in R2's LSDB, but link bandwidth and CPU cycles are
         spend to transmit and process all N copies.

      3) Furthermore, when R2 receives the first copy of the LSA
         and installs it, it floods back to R1 N-1 copies of it,
         again spending extra bandwidth and CPU time.

      4) If R1 receives an acknowledgment from R2 on some links,
         but not from others, it will keep retransmitting unacknowledged
         LSAs though they are already in R2's LSDB.

   The solution described in this document provides a technique to
   minimize the overhead that link state routing protocols cause in the
   described situation and use link bandwidth more efficiently.

   While the described optimization is generic for OSPF and ISIS, it
   becomes very important in the contents of MPLambdaS [Ref3] where OXCs
   may be connected with a large number of links. The problem is
   partially addresses by LMP [Ref4] where a single control channel is
   used for a bundle of optical links or lambdas. However, OXCs may
   still have a big amount of control channels between each other, and
   some type of OXCs may not be running LMP at all. The flooding
   optimizations described in this document ensure scalability of the
   IGP flooding algorithm in the presence of multiple links between
   neighboring routers and increasing amount of traffic engineering
   information flooded in optical and MPLS networks.


3 Proposed solution

   3.1 Introduction

   The main idea of the technique described in this document is to move
   the flooding algorithm from the per-interface to per-neighbor basis
   in a backward-compatible manner.

   The technique is generic for all protocols utilizing reliable
   flooding and is based on the observation that the ultimate goal of
   the flooding algorithm is not to send link state PDUs over all
   interfaces, but to deliver them to all routers in the network.



Zinin, Shand                                                    [Page 3]


INTERNET DRAFT           Flooding optimizations             October 2000


   To implement this optimization, it is necessary to maintain a list of
   neighbors within an area. Whenever a new neighbor is discovered on an
   interface belonging to the area, the corresponding interface neighbor
   data structure is linked to the corresponding element in the list of
   neighbors. Based on the information in the list of neighbors, as well
   as on the type of interfaces they use, interfaces within the area are
   marked either flooding-active or flooding-passive. The process of
   election of flooding-active interfaces takes into consideration the
   costs of interfaces, giving preference to faster interfaces.
   Multiaccess interfaces need special treatment, since they may be
   (usually are) associated with more than one neighbor.  However, if
   such an interface connects only two routers, it still may be marked
   as flooding-passive. Whenever the number of entries in the list or
   state of the adjacency in the list changes, the interface election
   algorithm is rerun.

   Note that since the flooding paradigm is changed from the per-
   interface to per-neighbor basis, PDU retransmission is not performed
   for a specific neighbor on a specific interface, but is instead done
   for a specific neighbor in general, and it is enough to receive a
   single acknowledgment on any interface for sending router to stop
   retransmitting.

   The asynchronous flooding algorithm is changed to first consider the
   area neighbor list and then use available physical interfaces to
   reliably deliver link state PDUs to the neighbors. Note that if more
   than one interface to a particular neighbor is marked as flooding-
   active, the flooding algorithm may perform equal- or unequal-cost
   load sharing, flooding different PDUs through different links.

   Flooding is also changed not to send PDUs to the sending neighbor via
   other links.

   The initial process of LSDB synchronization is also changed to take
   advantage of multiple links. If a new adjacency is coming up and the
   router can be sure that its LSDB is already synchronized with the
   remote router over other links, the router can speed up the adjacency
   establishment process by sending a empty (or limited-size) database
   description to the remote neighbor. Note that this speeds up the
   announcement of links that come up, since OSPF announces an adjacency
   only when it reaches Full state, i.e., when routers have synchronized
   their LSDBs. Skipping the LSDB synchronization part in ISIS does not
   speed link announcement (since ISIS announces adjacency as soon as
   two-way connectivity has been ensured), but it reduces the amount of
   time CPU spends on processing of CSNPs and LSPs.

   To illustrate the benefits of the described method, consider the
   situation where R1 in Figure 1 has 100 PDUs to flood to R2, and N



Zinin, Shand                                                    [Page 4]


INTERNET DRAFT           Flooding optimizations             October 2000


   equals 3. Without the described optimization, we would have:


       o    300 copies of PDUs going from R1 to R2

       o    300 LSDB lookups performed by R2

       o    300 acknowledgements coming back from R2

       o    300 lookups on the retransmit list or LSDB by R1 to remove
            acknowledged PDUs

       o    200 copies of PDUs coming back from R2 to R1

       o    200 LSDB lookups performed by R1

       o    200 ackowledgments going from R1 to R2

       o    200 lookups on the retransmit list or LSDB by R2 to remove
            acknowledged PDUs

   If described technique is implemented, we would have:

       o    100 copies of PDUs going from R1 to R2 possibly over dif-
            ferent interfaces for faster transmission

       o    100 LSDB lookups performed by R2

       o    100 acknowledgements coming back from R2

       o    100 lookups on the retransmit list or LSDB by R1 to remove
            acknowledged PDUs

       o    no copies of PDUs coming back from R2 to R1

       o    no LSDB lookups performed by R1

       o    no acknowledgments going from R1 to R2

       o    no lookups on the retransmit list or LSDB by R2 to remove
            acknowledged PDUs

3.2 Changes to OSPF

   3.2.1 Data structures

   Some basic modifications to OSPF data structures are necessary to
   implement the described solution.



Zinin, Shand                                                    [Page 5]


INTERNET DRAFT           Flooding optimizations             October 2000


   First of all, a new field is introduced to the area data structure,
   called NeighborList. NeighborList is a list of entries, each contain-
   ing the following fields. Note that the NeighborList entry is created
   when the first neighbor data structure for the neighbor with a par-
   ticular router ID is created within an area.

       o    NeighborID---the router ID of the neighbor connected with
            the calculating router with one or more interfaces.

       o    P2pIntList---list of interfaces that have only one fully
            established adjacency and it is established with the neigh-
            bor identified by the NeighborID (point-to-point and virtual
            links, as well as broadcast and NBMA interfaces connecting
            only two routers).

       o    P2mpIntList---list of interfaces that have more than one
            fully established adjacency and one of them is established
            with the neighbor identified by the NeighborID (apparently
            NBMA and broadcast networks).

       o    Retransmission list---list of LSAs that must be delivered to
            the remote router using available physical interfaces.

   Note that when a neighbor is reachable over multiple interfaces,
   there will be more than one entry in the above lists of interfaces.

   A new field is introduced to the interface-specific neighbor data
   structure---NeighborEntry. When an instance of the interface-specific
   neighbor data structure is created, its NeighborEntry is set to
   reference the corresponding entry in the area neighbor list.  Note
   that whenever a neighbor data structure is created for an interface,
   the P2pIntList and P2mpIntList of area neighbor data structures
   corresponding to the neighbors reachable through the same interface
   are modified. If only one neighbor data structure is available for
   the interface, the interface is put on the P2pIntList for that neigh-
   bor.  Otherwise, if more than one neighbor is known (regardless of
   the state of the neighbors), the interface is placed on the
   P2mpIntLists of all neighbors reachable through that interface. Note
   that point-to-point interfaces may temporarily have more than one
   neighbor linked to the interface data structure when the router-ID of
   the neighbor is changing. The algorithm handles such transition
   states by temporarily putting the interface on the P2mpIntList.

   Also, a new field is introduced to the interface data structure,
   called FloodingActive. If the value of this field is TRUE, the inter-
   face is used for flooding. Otherwise the interface is flooding- pas-
   sive and no LSAs are sent over it when asynchronous flooding is per-
   formed. Note that whenever an interface is put on a P2mpIntList of



Zinin, Shand                                                    [Page 6]


INTERNET DRAFT           Flooding optimizations             October 2000


   any area neighbor data structure, its FloodingActive field is always
   set to TRUE. Actually, FloodingActive field is consulted only if the
   interface is in a P2pIntList.

   Another field introduced to the interface data structure is LSASent,
   that is used by the flooding procedure (see Section 3.2.3 for more
   details).

   Whenever there is a change in the contents of the P2pIntList or
   P2mpIntList of an area neighbor data structure, the router performs
   election of flooding-active interfaces among the interfaces listed in
   the P2pIntList field.

   Below follows the algorithm describing the election process.  Note
   that this algorithm produces the minimal set of active interfaces.
   Implementations may use different algorithms, but these algorithms
   must not produce a smaller set of interfaces.

   For every entry in the area neighbor list, do the following.

       1.   If the P2mpIntList is not empty, go through all interfaces
            in the P2pIntList and mark them flooding-passive by setting
            the FloodingActive interface field to FALSE. (We always
            prefer sending LSAs to multiple neighbors simultaneously).

       2.   Otherwise, among the interfaces in the P2pIntList, set
            FloodingActive field to TRUE for those interfaces that have
            the best interface cost. Set it to FALSE for all other
            interfaces in the list.

3.2.2 Initial LSDB synchronization

   Whenever a neighbor's FSM reaches state ExStart, routers exchange
   database description packets to inform each other about the contents
   of their LSDBs. If at this moment there is at least one full adja-
   cency with the same neighbor going through another interface, the
   router can speed up establishment of the new adjacency by including
   in the DBD packets only those LSAs that are currently in the
   neighbor's retransmission list (if any).

   The implementation may also consider ignoring the contents of
   received database description packets. However, there is some proba-
   bility that already established adjacencies are not quite reliable
   and the other router is trying to deliver some LSAs and the router in
   question misses its chance to get them since it is ignoring the DBD
   contents.

   Implementations may also decide to maintain a single link state



Zinin, Shand                                                    [Page 7]


INTERNET DRAFT           Flooding optimizations             October 2000


   request list per neighbor in an area. This may be used to split the
   Loading process among several links when more than one adjacency is
   coming up simultaneously. Note that in this case, whenever the link
   state request list for a particular neighbor becomes empty, a
   LoadingDone event should be generated for all adjacencies with this
   neighbor that are currently in the Loading state.

3.2.3 Asynchronous Flooding

   Asynchronous flooding algorithm is changed as follows. Note that
   changes described below do not affect flooding back to a multiaccess
   interface if the router is the DR. The changes are only in the part
   where LSA is sent over other interfaces.

       If the flooding scope is domain-wide, perform the following for
       all areas. If the flooding scope is area-wide, do the following
       steps only for the area the interface on which the LSA was
       received belongs to.

       Consider every neighbor element in the area neighbor list as fol-
       lows.

       1)   If the value of the NeighborID field is equal to the router
            ID of neighbor that sent the LSA to the router, consider the
            next neighbor element (there is no need to send the LSA back
            to the sending router, except for the case when the receiv-
            ing router is the DR and the LSA is flooded back to a
            mutliaccess interface).

       2)   If P2mpIntList is empty, go to step 3. Otherwise do the fol-
            lowing steps

            a)   Put the LSA on the neighbor's retransmission list

            b)   Go through every interface on the P2mpIntList and do
                 the following:

                 o    Compare the LSA being flooded and the one identi-
                      fied by the LSASent field of the interface data
                      structure. If the LSAs are the same, the LSA has
                      already been sent on this interfaces and next
                      interface in P2mpIntList must be considered.

                 o    Send the LSA in a link state update packets set-
                      ting the destination address according to the
                      rules in Section 13 of [Ref1]

                 o    Set LSASent field of the interface data structure



Zinin, Shand                                                    [Page 8]


INTERNET DRAFT           Flooding optimizations             October 2000


                      to the LSA that has just been sent.

                 c)   Consider the next neighbor element in the area
                      neighbor list. (That is, skip flooding over inter-
                      faces in the P2pIntList.)

       3)   If P2pIntList is empty, consider the next neighbor element.
            Otherwise:

            a)   Put the LSA on the neighbor's retransmission list

            b)   Go through every interface on the P2pIntList and do the
                 following:

                 o    If the interface FloodingActive flag is clear,
                      skip this interface and consider the next inter-
                      face in the list.

                 o    Send the LSA in a link state update packets set-
                      ting the destination address according to the
                      rules in Section 13 of [Ref1]

            c)   Consider the next neighbor element in the area neighbor
                 list.

   Reception of OSPF acknowledgements is modified as follows.

     Whenever a link state acknowledgement is received from a neighbor,
     the corresponding entry in the area neighbor list is located and
     corresponding LSA is removed from the retransmission list.

3.2.4 Retransmitting LSAs

   The OSPF implementation should also be modified to perform
   retransmission of LSAs on a per-neighbor basis.

   Normally, the interfaces for LSA retransmission should be selected
   according to the rules used for asynchronous LSA flooding.  However,
   implementations may consider retransmitting LSAs over a bigger set of
   interfaces leading to the neighbor if the minimal interface set is
   suspected to be not sufficient (because of link load, or packet
   drops) to complete LSDB synchronization within a reasonable period of
   time.

3.2.5 Opaque LSA Support

   Described optimizations can be applied to all LSAs, including Opaque
   LSAs that have area and domain wide flooding scope (type-10 and



Zinin, Shand                                                    [Page 9]


INTERNET DRAFT           Flooding optimizations             October 2000


   type-11).  Note however, that transmission of type-9 LSAs (that have
   link-local flooding scope) should remain intact.

3.2.6 Compatibility

   The optimization described above is designed to be backward-
   compatible.  No software modification is necessary for the neighbor-
   ing routers. However, if both routers support the described modifica-
   tions, the advantages will be greater.

3.3 Changes to ISIS

   The changes for IS-IS are similar to those for OSPF.

   Each non-broadcast circuit has associated with it the system ID of
   the neighbor that is adjacent over that circuit.  At each of level 1
   and level 2, the set of one or more circuits with an adjacency at
   that level and a common neighbor is identified as a group.  SRMflags
   are associated with groups rather than circuit.  SSNflags remain
   associated with circuits.

   ISO/IEC 10589 describes the setting or clearing of SRMflag or SSNflag
   on a non-broadcast circuit for the following reasons.

     1.   SSNflag is cleared after the transmission of a PSNP over cir-
          cuit C.

     2.   A packet has been received on circuit C and a flag on circuit
          C set or cleared as a result.

     3.   A packet has been received on circuit C and the flags on all
          other circuits set or cleared as a result.

     4.   The flags on all circuits are set or cleared.

   These actions are modified as described below. In these descriptions,
   the term Sxxflag refers to either SSNflag or SRMflag.

     1.   SSNflag is cleared after the transmission of a PSNP over cir-
          cuit C.

             Clear the flag on circuit C.

     2.   A packet has been received on circuit C and an Sxxflag on cir-
          cuit C is to be set or cleared as a result. Circuit C is a
          member of group G.

          a.   If an SSNflag is to be cleared, clear ALL SSNflags for



Zinin, Shand                                                   [Page 10]


INTERNET DRAFT           Flooding optimizations             October 2000


               circuits in group G.

          b.   If an SRMflag is to be cleared, clear the SRMflag for
               group G.

          c.   If an SSNflag is to be set, set the SSNflag for circuit C
               only.

          d.   If an SRMflag is to be set, set the SRMflag for group G.

     3.   A packet has been received on circuit C and the Sxxflags on
          all other circuits are to be set or cleared as a result. Cir-
          cuit C is a member of group G.

          a.   If an SSNflag is to be cleared, clear ALL SSNflags for
               all circuits belonging to groups other than G.

          b.   If an SRMflag is to be cleared, clear ALL SRMflags for
               groups other than G.

          c.   If an SRMflag is to be set, set ALL SRMflags for groups
               other than G.

     4.   The flags on all circuits are set or cleared

          a.   If an SSNflag is to be cleared, clear ALL SSNflags for
               all circuits.

          b.   If an SRMflag is to be cleared, clear SRMflags for all
               groups.

          c.   If an SRMflag is to be set, set SRMflags for all groups.

     5    Transmitting an LSP as a result of SRMflags being set on group
          G.

               Choose ONE circuit from group G, and transmit the LSP
               over that circuit.

   Where a circuit is required to be chosen from within a group, the
   choice made is implementation dependant and may be based on any cri-
   teria, such as bandwidth or management control. The result of the
   choice MAY be different on each occasion.  Implementations may also
   decide to choose no point-to-point links if a neighboring system is
   available via a broadcast circuit, since LSPs need to be flooded
   throught it anyway. It is also possible to treat broadcast circuits
   with only two routers attached as point-to-point circuits (inlcuding
   Hello PDUs and LSDB synchronization pocess), note however that the



Zinin, Shand                                                   [Page 11]


INTERNET DRAFT           Flooding optimizations             October 2000


   routers should be explicitly configured to do so.

4 Security issues

   This document does not introduce any new security issues to ISIS or
   OSPF.

5 Acknowledgements

   The authors would like to acknowledge Tony Przygienda and Yakov
   Rekhter for their comments, and Jeff Learman for reviewing this docu-
   ment.

6 References

   [Ref1] J. Moy. OSPF version 2. Technical Report RFC 2328, Internet
          Engineering Task Force, 1998. ftp://ftp.isi.edu/in-
          notes/rfc2328.txt.

   [Ref2] ISO, "Intermediate system to Intermediate system routeing
          information exchange protocol for use in conjunction with the
          Protocol for providing the Connectionless-mode Network Service
          (ISO 8473)," ISO/IEC 10589:1992.

   [Ref3] D. Awduche, Y. Rekhter, J. Drake, R. Coltun,
          "Multi-Protocol Lambda Switching:  Combining MPLS Traffic
          Engineering Control With Optical Crossconnects", draft-
          awduche-mpls-te-optical-02.txt. Work in progress.

   [Ref4] J. Lang, et al. "Link Management Protocol (LMP)",
          draft-ietf-mpls-lmp-00.txt. Work in progress.

7 Authors' addresses

   Alex Zinin                    Mike Shand
   Cisco Systems                 Cisco Systems
   150 West Tasman Dr.           4, The Square
   San Jose,CA                   Stockley Park
   95134                         UXBRIDGE
   E-mail: azinin@cisco.com      Middlesex
                                 UB11 1BN, UK
                                 E-mail: mshand@cisco.com









Zinin, Shand                                                   [Page 12]