draft-ietf-idmr-cbt-arch-03

Inter-Domain Multicast Routing (IDMR)                    A. J. Ballardie
INTERNET-DRAFT                                 University College London
                                                      February 9th, 1996


             Core Based Trees (CBT) Multicast Architecture
                   <draft-ietf-idmr-cbt-arch-03.txt>

Status of this Memo

   This document is an Internet Draft.  Internet Drafts are working do-
   cuments of the Internet Engineering Task Force (IETF), its Areas, and
   its Working Groups. Note that other groups may also distribute work-
   ing documents as Internet Drafts).

   Internet Drafts are draft documents valid for a maximum of six
   months. Internet Drafts may be updated, replaced, or obsoleted by
   other documents at any time.  It is not appropriate to use Internet
   Drafts as reference material or to cite them other than as a "working
   draft" or "work in progress."

   Please check the I-D abstract listing contained in each Internet
   Draft directory to learn the current status of this or any other In-
   ternet Draft.


Abstract

   CBT is a new multicast architecture that builds a single, shared
   delivery tree. Traditional multicast algorithms build a source-based
   tree per sender subnetwork, as does DVMRP, the protocol currently de-
   ployed in the Internet's multicast backbone (MBONE) [1].  The primary
   advantage of the shared tree approach is that it typically offers
   more favourable scaling characteristics than do existing multicast
   algorithms.

   The CBT protocol [2] is a network layer multicast protocol that
   builds a shared delivery tree. The CBT protocol also allows for the
   incorporation of security features [2, 3], and the potential to
   reserve network resources as part of delivery tree set-up [14].

   The sending and receiving of multicast data by hosts on a subnetwork
   conforms to the traditional IP multicast service model [4]; multicast
   data on a subnetwork appears no different to that of existing
   schemes.



Expires August 9th, 1996                                        [Page 1]


INTERNET-DRAFT         CBT Multicast Architecture         February 1996


   CBT is progressing through the IDMR working group of the IETF. The
   CBT protocol is described in an accompanying document [2]. For this,
   and all IDMR-related documents, see http://www.cs.ucl.ac.uk/ietf/idmr


TABLE OF CONTENTS

  1. Background................................................... 3

  2. Introduction................................................. 4

  3. Source Based Tree Algorithms &
     the Motivation for Shared Trees.............................. 5

     3.1 Distance-Vector Multicast Algorithm...................... 5

     3.2 Link State Multicast Algorithm........................... 6

     3.3 The Motivation for Shared Trees.......................... 7

  4. CBT - The New Architecture................................... 8

     4.1 Design Requirements...................................... 8

     4.2 Architectural Overview................................... 11

     4.3 Core Selection, Placement, and Management................ 13

  5. Architectural Justification:
     Comparisons & Simulation Results............................. 15

     5.1 Traffic Concentration.................................... 19

     5.2 Shared Trees and Load Balancing.......................... 19

  6. Conclusions.................................................. 19

  Acknowledgements................................................ 20

  APPENDIX........................................................ 21








Expires August 9th, 1996                                        [Page 2]


INTERNET-DRAFT         CBT Multicast Architecture         February 1996


1.  Background

   Centre based forwarding was first described in the early 1980s by
   Wall in his PhD thesis on broadcast and selective broadcast.  At this
   time, the benefits and uses of multicast were not fully understood.
   It wasn't until much later that the IP multicast address space was
   defined (class D space), and Deering defined the IP multicast service
   model [4]. Deering's work was pioneering in that he invented algo-
   rithms that allowed hosts to arbitrarily join a multicast group (IGMP
   [5]), and enable a local multicast-capable router to become part of a
   receiver-initiated wide-area delivery tree. The latter consituted
   algorithms that build sourced-based delivery trees, with one delivery
   tree per sender subnetwork. These algorithms are documented in [4].

   The multicast-capable part of the Internet (MBONE [1]) primarily
   implements an a protocol instance of the distance-vector multicast
   algorithm [4] called DVMRP.

   Now we have several years practical experience with multicast, and a
   diversity of multicast applications, from shared workspace tools, to
   audio- and video-conferencing tools, with new applications emerging
   all the time (e.g. distributed video gaming), some of which have
   wide-ranging multicast requirements.  Furthermore, the MBONE has been
   constantly growing since the first public audiocast (audio multicast)
   of a 1992 IETF meeting [6] took place.  Then, the MBONE intercon-
   nected 40 subnets in 4 different countries.  Statistics suggest that
   the MBONE has doubled in size over the past eight months, and as of
   July 1995 comprises over 2,500 subnetworks spanning 25 countries [1].

   The obvious popularity and growth of multicast means that the scaling
   aspects of wide-area multicasting cannot be overlooked; some predic-
   tions talk of thousands of groups being present at any one time in
   the Internet.  Distributed Interactive Simulation (DIS) applications
   [15] have real-time requirements (in terms of join latency, group
   membership, group sender populations) that far exceed the require-
   ments of most applications currently in use.

   Scalability is measured in terms of network state maintenance,
   bandwidth efficiency, and protocol overhead. Other factors that can
   affect these parameters include sender set size, and wide-area dis-
   tribution of group members.







Expires August 9th, 1996                                        [Page 3]


INTERNET-DRAFT         CBT Multicast Architecture         February 1996


2.  Introduction

   Deering's algorithms build source-based multicast delivery trees,
   that is, the delivery tree is rooted at a multicast-capable router on
   the subnetwork directly attached to a sender. The IGMP protocol [5]
   operates between hosts and multicast router(s) on a subnetwork, ena-
   bling multicast routers to monitor group presence on attached sub-
   nets, so that on receipt of a multicast packet, a multicast router
   knows over which interfaces group members are located.

   Multicasting on the local subnetwork does not require either the
   presence of a multicast router or the implementation of a multicast
   algorithm; on most shared media (e.g. Ethernet), a host, which need
   not necessarily be a member, simply sends a multicast data packet,
   which is received by any member hosts. For multicasts to extend
   beyond the scope of the local subnetwork, the subnet must have a
   multicast-capable router attached, which itself is attached (possibly
   virtually) to another multicast-capable router, and so on. The col-
   lection of these (virtually) connected multicast routers forms the
   Internet's MBONE [1]. Each multicast router must implement the same
   multicast routing protocol to ensure full multicast connectivity
   (footnote).  For wide-area multicasting then, multicast routers "con-
   spire" to get multicast data to the group's receivers, wherever they
   be located.  This is the job of the multicast routing algorithm.

   Different algorithms use different techniques for establishing a dis-
   tribution tree. If we classify these algorithms into source-based
   tree algorithms and shared tree algorithms, we'll see that the dif-
   ferent classes have considerably different scaling characteristics,
   and the characteristics of the resulting trees differ too, for exam-
   ple, average delay. Let's look at source-based tree algorithms first.







_________________________
Domains that deploy different multicast  protocols  may
be  interconnected using a common multicast protocol at
domain boundaries (border routers). A special  protocol
interface  needs  to  be  implemented  in  these border
routers.




Expires August 9th, 1996                                        [Page 4]


INTERNET-DRAFT         CBT Multicast Architecture         February 1996


3.  Source-Based Tree Algorithms & the Motivation for Shared Trees

   The strategy we'll use for motivating (CBT) shared tree multicast is
   based, in part, in explaining the characteristics of source-based
   tree multicast, in particular its scalability. We then summarize the
   primary motivation for shared trees in section 3.3.

   Source-based tree multicast algorithms are often referred to as
   "dense-mode" algorithms; they assume that the receiver population
   densely populates the domain of operation, and therefore the accom-
   panying overhead (in terms of state, bandwidth usage, and/or process-
   ing costs) is justified.  Whilst this might be true of a local
   environment, it is almost never true of the wide-area environment,
   especially since wide-area group membership tends to be sparsely dis-
   tributed throughout the Internet.  There may be "pockets" of dense-
   ness, but if one views the global picture, wide-area groups tend to
   be sparsely distributed.

   Source-based multicast trees are either built by a distance-vector
   style algorithm, which may be implemented separately from the unicast
   routing algorithm (as is the case with DVMRP), or the multicast tree
   may be built using the information present in the underlying unicast
   routing table (as is the case with PIM-DM [7]). The other algorithm
   used for building source-based trees is the link-state algorithm (a
   protocol instance being M-OSPF [8]).


3.1.  Distance-Vector Multicast Algorithm

   The distance-vector multicast algorithm builds a multicast delivery
   tree using a variant of the Reverse-Path Forwarding technique [9].
   The technique basically is as follows: when a multicast router
   receives a multicast data packet, if the packet arrives on the inter-
   face used to reach the source of the packet, the packet is forwarded
   over all outgoing interfaces, except leaf subnets (footnote) with no
   members attached. Otherwise the packet is discarded.  This consti-
   tutes a "broadcast & prune" approach. Subsequently, outgoing inter-
   faces may be "pruned"; when a data packet reaches a leaf router, if
   that router has no membership registered on its directly attached
   subnetworks, the router may send a prune message one hop back towards
_________________________
A "leaf" subnet is one with no downstream  routers  at-
tached.  A  "leaf"  router is one which no other router
uses to reach the source.




Expires August 9th, 1996                                        [Page 5]


INTERNET-DRAFT         CBT Multicast Architecture         February 1996


   the source. The receiving router then checks its leaf subnets for
   group membership, and checks whether it has received a prune from all
   of its downstream routers. If the checks succeed, the router itself
   can send a prune upstream over the interface leading to the source.
   Thus, prune messages prune the tree branches not leading to group
   members. Prune messages are stored per <source, group> pair, and are
   subject to timeout, after which data flows over the branch again.  If
   there are still no members present, the pruning process repeats
   itself.  State that "times out" in this way is referred to as "soft
   state".

   It can be inferred from the above algorithm that multicast routers
   must maintain state per group per active source. This source-specific
   state manifests itself in the multicast forwarding table, where, for
   each active source, the outgoing interface list is dependent on the
   prunes received, and on the time since they were received (because
   they time out).

   As we said, most wide-area groups are likely to be sparsely distri-
   buted.  As a result, it is fair to assume that some potentially large
   number of routers in the internetwork, i.e. those not leading to
   downstream receivers, are incurred the cost of storing <source,
   group> state.

   The "broadcast & prune" nature of the distance-vector multicast algo-
   rithm, the sparseness of receivers, combined with the fact that many
   wide-area links have limited bandwidth, demonstrates that such an
   algorithm cannot scale to a large internetwork that supports large
   numbers of groups.


3.2.  Link-State Multicast Algorithm

   Routers implementing a link state algorithm periodically collect
   reachability information to their directly attached neighbours, then
   flood this throughout the routing domain in so-called link state
   update packets. Deering extended the link state algorithm for multi-
   casting by having a router additionally detect group membership
   changes on its incident links before flooding this information in
   link state packets.

   Each router then, has a complete, up-to-date image of a domain's
   topology and group membership. On receiving a multicast data packet,
   each router uses its membership and topology information to calculate
   a shortest-path tree rooted at the sender subnetwork. Provided the



Expires August 9th, 1996                                        [Page 6]


INTERNET-DRAFT         CBT Multicast Architecture         February 1996


   calculating router falls within the computed tree, it forwards the
   data packet over the interfaces defined by its calculation. Hence,
   multicast data packets only ever traverse routers leading to members,
   either directly attached, or further downstream. That is, the
   delivery tree is a true multicast tree right from the start.

   However, the flooding (reliable broadcasting) of group membership
   information is the predominant factor preventing the link state mul-
   ticast algorithm being applicable over the wide-area.  The other lim-
   iting factor is the processing cost of the Dijkstra calculation to
   compute the shortest-path tree for each active source.


3.3.  The Motivation for Shared Trees

   The algorithms described in the previous sections clearly motivate
   the need for a multicast algorithm(s) that is more scalable. CBT was
   designed primarily to address the topic of scalability; a shared tree
   architecture offers an improvement in scalability over source tree
   architectures by a factor of the number of active sources (where
   source is a subnetwork aggregate). Source trees scale O(S * G), since
   all sources use the same shared tree; shared trees eliminate the
   source (S) scaling factor, and hence a shared tree scales O(G). The
   implication of this is that applications with many active senders,
   such as distributed interactive simulation applications, and distri-
   buted video-gaming (where most receivers are also senders), scale
   considerably better if shared trees are used.

   In the table below we compare the amount of state required by CBT and
   DVMRP for different group sizes with different numbers of active
   sources:

















Expires August 9th, 1996                                        [Page 7]


INTERNET-DRAFT         CBT Multicast Architecture         February 1996




     |--------------|---------------------------------------------------|
     |  Number of   |                |                |                 |
     |    groups    |        10      |       100      |        1000     |
     ====================================================================
     |  Group size  |                |                |                 |
     | (# members)  |        20      |       40       |         60      |
     -------------------------------------------------------------------|
     | No. of srcs  |    |     |     |    |     |     |    |     |      |
     |  per group   |10% | 50% |100% |10% | 50% |100% |10% | 50% | 100% |
     --------------------------------------------------------------------
     | No. of DVMRP |    |     |     |    |     |     |    |     |      |
     |    router    |    |     |     |    |     |     |    |     |      |
     |   entries    | 20 | 100 | 200 |400 | 2K  | 4K  | 6K | 30K | 60K  |
     --------------------------------------------------------------------
     | No. of CBT   |                |                |                 |
     |  router      |                |                |                 |
     |  entries     |       10       |       100      |       1000      |
     |------------------------------------------------------------------|

            Figure 1: Comparison of DVMRP and CBT Router State

   There are also additional significant bandwidth and state savings
   with shared trees in contrast to source trees; firstly, the tree only
   spans a group's receivers (including links/routers leading to
   receivers) -- there is no cost to routers/links in other parts of the
   network. Secondly, routers between a non-member sender and the
   delivery tree are not incurred any cost pertaining to multicast, and
   indeed, these routers need not even be multicast-capable -- packets
   from non-member senders are encapsulated and unicast towards a core
   on the tree.


4.  CBT - The New Architecture



4.1.  Design Requirements

   The CBT shared tree design was geared towards several design objec-
   tives:






Expires August 9th, 1996                                        [Page 8]


INTERNET-DRAFT         CBT Multicast Architecture         February 1996


   +    scalability

   +    routing protocol independence

   +    robustness

   +    interoperability

   As we have mentioned, a shared multicast tree improves scalability by
   an order of magnitude.

   A facet of existing source tree algorithms is that they are all tied
   inextricably, one way or another, to a particular routing protocol.
   For example, DVMRP is based heavily on RIP, and relies for its
   correct operation on some of the features of RIP (e.g. poison
   reverse). Similarly, with M-OSPF.

   With the multicast infrastructure fast converging on the unicast
   infrastructure, which is heterogeneous in nature, the extent to which
   a particular multicast algorithm can be deployed is severely limited
   if deployment is dependent on the presence of a particular unicast
   algorithm. Therefore, it makes good sense to separate out these
   dependencies from the multicast algorithm; this is exactly what CBT
   does, as does PIM [7]. The CBT protocol makes use of the underlying
   unicast routing table when building its delivery tree, independent of
   how the unicast table(s) is computed.

   Source-based tree algorithms are clearly robust; a sender simply
   sends its data, and intervening routers "conspire" to get the data
   where it needs to, creating state along the way. This is the so-
   called "data driven" approach -- there is no set-up protocol
   involved.

   It is not as easy to achieve the same degree of robustness in shared
   tree algorithms, since there is network entity involved which is the
   focal point of the tree, which effectively, keeps a shared tree fully
   connected. This entity is just a pre-nominated router, to which all
   receivers (their directly-attached routers) must explicitly join. In
   actual fact, any shared tree is likely to have several such so-called
   "cores", or "rendevous points (RPs)" to increase robustness.

   CBT and PIM diverge in their approaches to robustness; PIM attempts
   to maintain a data-driven approach, with only one RP active at any
   one time. In CBT, all cores are active. PIM's desire to emulate the
   robustness of source trees comes at a cost, especially in terms of



Expires August 9th, 1996                                        [Page 9]


INTERNET-DRAFT         CBT Multicast Architecture         February 1996


   protocol complexity, and an overall increased bandwidth requirement
   [10]. CBT, on the other hand, adopts the "hard state" approach,
   whereby a tree branch is created reflecting underlying unicast at the
   instant it is created; thereafter, the same tree branch does not
   necessarily change to reflect underlying unicast changes. This has
   positive implications for the case where a branch is created together
   with a resource reservation. The reservation remains fixed unless a
   neighbouring link/router fails, in which case there is no option but
   to rejoin the tree by means of explicit protocol, and request the
   resources again. However, the router to which a branch is rejoined
   may not be able to honour the same reservation request.

   CBT branches are torn down by means of explicit protocol messages,
   whereas PIM branches time out. PIM incurs protocol complexity to
   manage its various timers (to keep state where it is required), and
   protocol "refresh" messages consume bandwidth. CBT implements a
   "keepalive" mechanism between adjacent routers on a tree; these are
   CBT's only periodic tree maintenance messages, and they are aggre-
   gated such that there is one per link, not one per group per link.
   Both protocols reduce the frequency of tree maintenance messages as
   the number of messages increases.

   A comparison of protocol costs/state/scalability etc. is summarised
   in section 5, and is presented in detail in [10].

   With regards to interoperability, the type of multicast delivery tree
   interconnecting member hosts (subnets) over the wide-area is tran-
   sparent to those hosts; hosts send/receive multicast data in tradi-
   tional IP format, and hosts are not involved explicitly in any way
   with tree set-up. This is the sole responsibility of local multicast
   routers.

   CBT also accommodates interoperability between "clouds" or domains
   operating different multicast protocols. It is expected that intero-
   perability between different multicast protocols only be relevant at
   domain (or region, or "cloud") boundaries; inside a particular domain
   only a single multicast protocol is expected to be present. One
   method of how different trees can be "spliced" together at a domain
   boundary is presented in [11].

   Homogeneous clouds, as described in the previous paragraph, means
   that multicast data can travel in what we call "native mode", i.e.
   no encapsulation is required that keeps data, from different groups
   using different multicast protocols, separate. CBT also specifies a
   "CBT mode", where a particular cloud is heterogeneous, i.e. multiple



Expires August 9th, 1996                                       [Page 10]


INTERNET-DRAFT         CBT Multicast Architecture         February 1996


   multicast protocols are present possibly on the same subnet; CBT mode
   data is encapsulated with a CBT header to distinguish it from any
   other multicast traffic, for example, traffic that is using a source
   based tree. These two "modes" are described in the CBT specification
   document [2].


4.2.  Architectural Overview

   Shared trees were first described by Wall in his investigation into
   low-delay approaches to broadcast and selective broadcast [12]. Wall
   concluded that delay will not be minimal, as with shortest-path
   trees, but the delay can be kept within bounds that may be accept-
   able.  Simulations have recently been carried out to compare various
   scaling characteristics of centre-based and shortest-path trees.  A
   summary of these simulations can be found in section 5, and the
   details are presented in [10].

   A CBT shared tree involves having a small set of pre-nominated cores
   (routers), to which routers connected to member hosts join by means
   of an explicit protocol control message. Any core is eligible for
   joining, although only a single core is targetted by a join message.
   On receipt of a join, the core router replies with an acknowledge-
   ment, which traverses the reverse-path of the corresponding join (the
   join sets up transient state along the way). As such, tree branches
   are created, and parent-child relationships set up between adjacent
   CBT routers on the tree, the parent being the router nearer the core.
   When the ack is received, the router creates a CBT forwarding infor-
   mation base (FIB) entry, listing interfaces corresponding to a par-
   ticular group. Due to the way the tree is formed, each tree link is
   symmetric, and the tree reflects an undirected graph.

   Tree branches are made up of so-called non-core routers, which form a
   shortest forward path between a member-host's directly attached
   router, and the core. A router at the end of a branch is known as a
   leaf router on the tree.

   A diagram showing a single-core CBT tree is shown in the figure
   below. Only one core is shown to demonstrate the principle.









Expires August 9th, 1996                                       [Page 11]


INTERNET-DRAFT         CBT Multicast Architecture         February 1996




           b      b     b-----b
            \     |     |
             \    |     |
              b---b     b------b
             /     \  /                   KEY....
            /       \/
           b         X---b-----b          X = Core
                    / \                   b = non-core router
                   /   \
                  /     \
                  b      b------b
                 / \     |
                /   \    |
               b     b   b

                      Figure 2: Single-Core CBT Tree

   Join messages do not necessarily travel all the way to the core
   before being acknowledged; if a join message hits a router on the
   tree (on-tree router) on its journey towards a core, that on-tree
   router terminates the join and acknowledges it. A router that is
   itself in the process of joining the tree (i.e. one that has not yet
   received an ack itself) can terminate and cache any subsequent
   received joins, but cannot acknowledge them until its own join has
   been successfully acknowledged.

   For simplicity, part of the CBT protocol [2] is data driven; one of a
   set of cores for a group is elected (by a group initiator) as the
   PRIMARY core, all others are termed SECONDARY cores. The ordering of
   the secondary cores is irrelevant, however, the primary must be uni-
   form across the whole group. Whenever a secondary core is joined, it
   first acks the join then proceeds to join the primary core -- the
   primary core simply listens for incoming joins, it need never send a
   join itself. In this way, a CBT tree becomes fully connected in
   response to members joining.

   The tree joining process is triggered when a subnet's CBT-capable
   router receives an IGMP group membership report [5]. If more than one
   CBT router is present on a subnetwork, only one router, the subnet's
   designated router (DR), generates a join message. A subnet's DR is
   implicitly elected (i.e. no CBT protocol involvement) as a "side-
   effect" of IGMP.




Expires August 9th, 1996                                       [Page 12]


INTERNET-DRAFT         CBT Multicast Architecture         February 1996


   CBT builds a loop-free shared tree. In some scenarios it is necessary
   to take explicit precautions against loops, in others it is not.  For
   example, a loop cannot form between a router that wishes to join the
   tree, if that router has no child tree branches (interfaces). There
   is a potential for loops if a joining router has at least one child
   attached. This scenario would constitute a rejoin. Once the rejoining
   router has received an ack, it must generate a loop detection which
   is sent over its parent interface. Receiving routers must forward
   this packet over their parent interface only, unless it is received
   by its originator, in which case a loop is present, or it is recieved
   by the primary core, in which case no loop is present. In the former
   case, the loop is broken by the originator sending a quit packet to
   its new parent, and the latter case solicits an ack from the primary
   core, confirming the absence of a loop.

   Tree maintenance takes place between adjacent on-tree routers in the
   form of protocol "keepalive" messages. Only one of these need be sent
   per link (not per group), and this message is sent in the direction
   child -> parent. The parent sends a corresponding acknowledgement.
   This is how tree connectivity is monitored, and breakages/failures
   detected.

   When a CBT router receives a multicast data packet, it simply for-
   wards it over all outgoing interfaces, as specified in its FIB entry
   for the group. Protocol mechanisms help ensure that data packets
   never loop.

   The CBT protocol is simple and lightweight. The CBT protocol specifi-
   cation is described in an separate document [2].




4.3.  Core Selection, Placement, and Management

   The issues of core (RP) selection, placement, and management are
   still under review, although several recent advancements have been
   made [13].

   Until recently, it was thought that hosts would need to be explicitly
   involved in discovering core addresses corresponding to a particular
   group. This would require host system changes, and modifications to
   applications, such that an application could request the host to
   establish a <core, group> mapping for a given group, as well as some
   sort of core advertisement protocol in which hosts and core routers



Expires August 9th, 1996                                       [Page 13]


INTERNET-DRAFT         CBT Multicast Architecture         February 1996


   participate.

   A dependency on host or application involvement with core (RP)
   discovery is therefore very undesirable. Indeed, the type of multi-
   cast tree to be joined should be invisible to hosts (and even appli-
   cations).

   A new proposal has recently emerged called Hierarchical PIM [13],
   which proposes having a multi-level hierarchy of RPs (cores), but
   which are known only to multicast routers; cores (RPs) remain invisi-
   ble to hosts. Whilst its name suggests it is specific to PIM, its
   principles are not.

   Each level of hierarchy corresponds to a multicast "scope level",
   with a certain multicast address range corresponding to each scope.
   This therefore, requires the multicast address space to be officially
   "administratively scoped"; a group initiator chooses a multicast
   address based on the perceived scope of the group. On receipt of an
   IGMP group membership report from a host, a local router (the
   lowest-level of the core (RP) hierarchy) knows, based on the group
   address, whether it has to join a core at the next level up in the
   hierarchy; each candidate RP at a particular level advertises itself
   within its own level (using a well-known scoped multicast address),
   and to the level below. Hence, an RP should always know how to reach
   an RP at the next level up in the hierarchy. A next-level RP is
   chosen by using a hash function across the group address.

   The lower the hierarchy level, the more densely RPs populate that
   level. However, at the top level (for globally-scoped groups) it is
   expected there will be sufficient RP distribution so that shared tree
   paths (i.e. delay) is reasonably efficient.

   It is expected that there will probably be six or seven levels of
   hierarchy (local, region, country, continent, etc.), with the candi-
   date RPs changing relatively infrequently (say every 24 hrs), with
   the candidate RP announcements being made more frequently, e.g.
   every few minutes.

   The finer details of this scheme have still to be worked out, but due
   to the significant advantage of being host-transparent, it is highly
   likely that this RP/Core selection/placement/management scheme will
   be adopted (footnote).
_________________________
This work is currently progressing under  the  auspices
of the IDMR working group of the IETF.



Expires August 9th, 1996                                       [Page 14]


INTERNET-DRAFT         CBT Multicast Architecture         February 1996


5.  Architectural Justification: Comparisons & Simulation Results

   This majority of this section summarises the results of in-depth
   simulations carried out by Harris Corp., USA, investigating the per-
   formance and resource cost comparisons of multicast algorithms for
   distributed interactive simulation applications [10].  More pre-
   cisely, the report summarises the work on the Real-time Information
   Transfer & Networking (RITN) program, comparing the cost and perfor-
   mance of PIM [7] and CBT [2] in a DIS environment. As we said ear-
   lier, DIS applications have wide-ranging requirements. We feel it is
   important to take into account a wide range of requirements so that
   future applications can be accommodated with ease; all other multi-
   cast architectures are tailored to the requirements of applications
   in the current Internet, particularly audio and video applications.
   Figure 3 shows a comparison of application requirements [10].

   We also present results into the study of whether source-based trees
   or shared trees are the best choice for multicasting in the RITN pro-
   gram. In the study of shortest-path trees (SPTs) vs. shared trees,
   PIM Dense-Mode and PIM-SM with SPTs were used as SPTs, with CBT and
   PIM-SM used as shared trees. This section assumes the reader is fami-
   liar with the different modes of PIM [7].


























Expires August 9th, 1996                                       [Page 15]


INTERNET-DRAFT         CBT Multicast Architecture         February 1996




     |--------------|----------------------------------------------------|
     |              |                    Paradigm                        |
     |              |----------------------------------------------------|
     | Requirement  |             |  audio/video    |    audio/video     |
     |              |    DIS      | lecture dist'n  |     conference     |
     =====================================================================
     |   senders    |    many     |  small number   |   small number     |
     ---------------------------------------------------------------------
     |  receivers   |also senders |  many recvrs    |  also senders      |
     ---------------------------------------------------------------------
     | no. of grps  |             |                 |                    |
     | per appl'n   |    many     | one or few      |  one or few        |
     ---------------------------------------------------------------------
     | Data tx      |  real time  |  real time      |  real time         |
     ---------------------------------------------------------------------
     | e2e delay    |    small    |  non-critical   |   moderate         |
     ---------------------------------------------------------------------
     |  set up      |  real time  | non real time   |   non real time    |
     ---------------------------------------------------------------------
     | join/leave   | rapid for   | rapid for       | can be rapid for   |
     | dynamics     | participants| receivers       |  participants      |
     ---------------------------------------------------------------------
     |              | must be     | must be scalable|                    |
     | scalability  | scalable to | to large n/ws   |   need scalability |
     |              | large n/ws &| and large nos   |   large n/ws       |
     |              | large grps, | of recvrs per   |                    |
     |              | with large  | group           |                    |
     |              | nos senders |                 |                    |
     |              | & recvrs per|                 |                    |
     |              | group       |                 |                    |
     ---------------------------------------------------------------------
     |              | based upon  |                 |                    |
     | multicast    | the DIS     |  rooted on src  | incl participants  |
     |   tree       | virtual     |  and includes   | and can slowly move|
     |              | space, with |  current recvrs | over phys topology |
     |              | rapid mvmt  |                 |                    |
     |              | over phys   |                 |                    |
     |              | topology    |                 |                    |
     ---------------------------------------------------------------------

              Figure 3: Comparison of Multicast Requirements





Expires August 9th, 1996                                       [Page 16]


INTERNET-DRAFT         CBT Multicast Architecture         February 1996


   The following criteria were considered in the simulations:

   +    end-to-end delay

   +    group join time

   +    scalability (all participants are both senders & receivers)

   +    bandwidth utilization

   +    overhead traffic

   +    protocol complexity

   A brief summary of the the results of the evaluation follow. For a
   detailed description of the simulations and test environments, refer
   to [10].

   +    End-to-end delay. It was shown that PIM-DM and PIM-SM with SPTs
        deliver packets between 13% and 31% faster than CBT. PIM-SM has
        about the same delay characteristics as CBT. Processing delay
        was not taken into account here, and this stands in PIM's
        favour, since PIM routers are likely to have much larger routing
        tables, and thus, much greater search times.

   +    Join time. There was very little difference between any of the
        algorithms.

   +    Bandwidth efficiency. It was shown that PIM-SM with shared
        trees, and PIM-SM with SPTs both required only about 4% more
        bandwidth in total, to deliver data to hosts. PIM-DM however, is
        very bandwidth inefficient, but this improves greatly as the
        network becomes densely populated with group receivers.

   +    Overhead comparisons. CBT exhibited the lowest overhead percen-
        tage, even less than PIM-SM with shared trees. PIM-DM was shown
        to have more than double the overhead of PIM-SM with SPTs due to
        the periodic broadcasting & pruning.

        The Harris simulations [10] measured the average number of rout-
        ing table entries required at each router for CBT, PIM-DM, PIM-
        SM with SPTs, and PIM-SM with shared trees. The following param-
        eters were used in the simulations:

         +  N = number of active multicast groups in the network.



Expires August 9th, 1996                                       [Page 17]


INTERNET-DRAFT         CBT Multicast Architecture         February 1996


         +  n = average number of senders to a group.

         +  b = fraction of groups moving to source tree in PIM-SM.

         +  c = average percentage of routers on a shared tree for a
         group (or source tree for a particular sender).

         +  s = average percentage of routers on a source-based tree for
         a group (but not on shared tree).

         +  d = average number of hops from a host to the RP.

         +  r = total number of routers in the network.

         The following results were calculated, given b = 1, c = 0.5,
         s = 0.25, and d/r = 0.05.


           |--------------|----------------------------------------------------|
           |              |               Group size parameters                |
           |              |----------------------------------------------------|
           |   Protocol   |   N = 1000  | N = 1000  | N = 20,000  | N = 20,000 |
           |              |    n = 10   |  n = 200  |   n = 10    |   n = 200  |
           =====================================================================
           |              |             |           |             |            |
           |     CBT      |     500     |    500    |   10,000    |   10,000   |
           ---------------------------------------------------------------------
           |              |             |           |             |            |
           |  PIM-Dense   |   10,000    |  200,000  |   200,000   |  4,000,000 |
           ---------------------------------------------------------------------
           |  PIM-Sparse  |             |           |             |            |
           |   with SPT   |    8000     |  150,500  |   160,000   |  3,010,000 |
           ---------------------------------------------------------------------
           | PIM-Sparse,  |             |           |             |            |
           | shared tree  |    1000     |   1,500   |   20,000    |   210,000  |
           ---------------------------------------------------------------------

               Figure 4: Comparison of Router State Requirements

    +    Complexity comparisons. Protocol complexity, protocol traffic
         overhead, and routing table size were considered here. CBT was
         found to be considerably simpler than all other schemes, on all
         counts (footnote).
_________________________
The comparisons carried out were subjective.



Expires August 9th, 1996                                       [Page 18]


INTERNET-DRAFT         CBT Multicast Architecture         February 1996


5.1.  Traffic Concentration

   One of the criticisms leveled at shared trees is that traffic is
   aggregated onto a smaller subset of network links, than with source
   tree protocols.

   It has been shown that if shared trees, spanning a large number of
   receivers, with some (not insignificant proportion of these being
   also senders), if such a shared tree has only a small number of cores
   (RPs), traffic concentration is a problem on the core incident links,
   and for the core itself. The Harris simulation results [10] suggest
   increasing the number of cores as group size increases, thereby
   largely eliminating the problem.

5.2.  Shared Trees and Load Balancing

   An important question raised in the SPT vs. CBT debate is: how effec-
   tively can load sharing be achieved by the different schemes? The
   scope of ability for DVMRP to do load-sharing is limited primarily by
   its forwarding algorithm (RPF); this allows for only single-path
   routing.

   With shared tree schemes however, each receiver can choose which of
   the small selection of cores it wishes to join. Cores and on-tree
   nodes can be configured to accept only a certain number of joins,
   forcing a receiver to join via a different path. This flexibility
   gives shared tree schemes the ability to achieve load balancing.

   In general, spread over all groups, CBT has the ability to randomize
   the group set over different trees (spanning different links around
   the centre of the network), something that would not seem possible
   under SPT schemes.

6.  Conclusions

   In comparing CBT with the other shared tree architecture, PIM, CBT
   was found to be more favourable in terms of scalability, complexity,
   and overhead. Other characteristics were found to be very similar.

   When comparing SPTs with shared trees, we find that the end-to-end
   delays of shared trees are usually acceptable, and can be improved by
   strategic core placement. Routing table size is another important
   factor in support of shared trees, as figures 1 and 4 clearly illus-
   trate.




Expires August 9th, 1996                                       [Page 19]


INTERNET-DRAFT         CBT Multicast Architecture         February 1996


   Shared trees also have more potential to load balance traffic across
   different links or trees.  In the absence of multi-path multicast
   routing, this does not seem possible with source-based tree tech-
   niques.



   Acknowledgements

   Special thanks goes to Paul Francis, NTT Japan, for the original
   brainstorming sessions that brought about this work.

   Thanks also to Bibb Cain et al. (Harris Corporation) for allowing the
   publication of their simulation results, and the duplication of fig-
   ures 3 and 4.

   Ken Carlberg (SAIC) reviewed the text, and provided useful feedback
   and suggestions.

   I would also like to thank the participants of the IETF IDMR working
   group meetings for their general constructive comments and sugges-
   tions since the inception of CBT.


























Expires August 9th, 1996                                       [Page 20]


INTERNET-DRAFT         CBT Multicast Architecture         February 1996


   APPENDIX

   The following formulae were used by Harris Corp. in calculating the
   values in figure 4. The meaning of the formulae arguments precedes
   figure 4.

   +    average no. of entries in each PIM-DM router is approximated by:

        T(PIM-DM) ~ N * n

   +    average no. of entries a router maintains is the likelihood, c,
        that the router will be on a shared tree, times the total
        number, N, of shared trees:

        T(CBT) = N * c

   +    average no. of entries a router maintains due to each src based
        tree is the average no. of hops, d, from a host to the RP,
        divided by the number of routers, r, in the network:

        T(PIM-SM, shared tree) = N * c + N * n * d/r

   +    average no. of routing table entries for PIM-SM with some groups
        setting up source-based trees:

        T(PIM, SPT) = N * [B * n + 1] * c + b * n * s

   For more detailed information on these formulae, refer to [10].


References

  [1] MBONE, The Multicast BackbONE; M. Macedonia and D. Brutzman;
  available from http://www.cs.ucl.ac.uk/mice/mbone_review.html.

  [2] Core Based Trees (CBT) Multicast: Protocol Specification; A. J.
  Ballardie, N. Jain, and S. Reeve;
  ftp://cs.ucl.ac.uk/darpa/IDMR/draft-ietf-idmr-cbt-spec-04.txt.

  [3] A. J. Ballardie. Scalable Multicast Key Distribution
  (ftp://cs.ucl.ac.uk/darpa/IDMR/draft-ietf-idmr-mkd-01.{ps,txt}). Work-
  ing draft, 1995.

  [4] DVMRP. Described in "Multicast Routing in a Datagram Internet-
  work", S. Deering, PhD Thesis, 1990. Available via anonymous ftp from:



Expires August 9th, 1996                                       [Page 21]


INTERNET-DRAFT         CBT Multicast Architecture         February 1996


  gregorio.stanford.edu:vmtp/sd-thesis.ps.

  [5] W. Fenner. Internet Group Management Protocol, version 2 (IGMPv2),
  (draft-idmr-igmp-v2-01.txt). Working draft.

  [6] First IETF Internet Audiocast; S. Casner, S. Deering; In proceed-
  ings of ACM Sigcomm'92.

  [7] D. Estrin et al. USC/ISI, Work in progress.
  (http://netweb.usc.edu/pim/).

  [8] J. Moy. Multicast Routing Extensions to OSPF. Communications of
  the ACM, 37(8): 61-66, August 1994.

  [9] Y.K. Dalal and R.M. Metcalfe. Reverse path forwarding of  broad-
  cast packets. Communications of the ACM, 21(12):1040--1048, 1978.

  [10] T. Billhartz, J. Bibb Cain, E. Farrey-Goudreau, and D. Feig.
  Performance and Resource Cost Comparisons of Multicast Routing Algo-
  rithms for Distributed Interactive Simulation Applications; a report
  prepared for NRL ****, July 1995.

  [11] A. J. Ballardie. Defining a Level-1/Level-2 Interface in the
  Presence of Multiple Protocols. Working draft, January 1996.
  ftp://cs.ucl.ac.uk/darpa/IDMR/draft-ietf-idmr-hier-intf-00.txt

  [12] D. Wall. Mechanisms for Broadcast and Selective Broadcast. PhD
  thesis, Stanford University, June 1980. Technical Report #90.

  [13] M. Handley, J. Crowcroft, I. Wakeman. Hierarchical Rendezvous
  Point proposal, work in progress.
  (http://www.cs.ucl.ac.uk/staff/M.Handley/hpim.ps) and
  (ftp://cs.ucl.ac.uk/darpa/IDMR/IETF-DEC95/hpim-slides.ps).

  [14] A. J. Ballardie. "A New Approach to Multicast Communication in a
  Datagram Internetwork", PhD Thesis, 1995. Available via anonymous ftp
  from: cs.ucl.ac.uk:darpa/IDMR/ballardie-thesis.ps.Z.

  [15] D. J. Hook, J. 0. Calvin, M. K. Newton, D. A. Fuscio. "An
  Approach to DIS Scalability", Eleventh DIS Workshop, Orlando, Florida,
  Sept. 26th-30th, 1994, pp 94-141.







Expires August 9th, 1996                                       [Page 22]


INTERNET-DRAFT         CBT Multicast Architecture         February 1996


Author's Address:

   Tony Ballardie,
   Department of Computer Science,
   University College London,
   Gower Street,
   London, WC1E 6BT,
   ENGLAND, U.K.

   Tel: ++44 (0)71 419 3462
   e-mail: A.Ballardie@cs.ucl.ac.uk





































Expires August 9th, 1996                                       [Page 23]

Document	Document type	This is an older version of an Internet-Draft that was ultimately published as RFC 2201. Expired & archived
	Select version	03 04 RFC 2201
	Compare versions
	Author
	RFC stream
	Other formats	bibtex bibxml
	Additional resources	Mailing list discussion