RMT Working Group Dah Ming Chiu, CUHK
Internet Engineering Task Force Seok Joo Koh, ETRI
Category: Informational Miriam Kadansky, Sun Microsystems
December 2003 Brian Whetten, Consultant
Expires June 2004 Gursel Taskale, Tibco
Tree Auto-Configuration (TREE) Building Block
for Reliable Multicast Transport
<draft-sjkoh-rmt-bb-tree-config-03.txt>
Status of this Memo
This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC 2026.
Internet-Drafts are valid for a maximum of six months and may be
updated, replaced, or obsoleted by other documents at any time. It
is inappropriate to use Internet-Drafts as reference material or to
cite them other than as a "work in progress".
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
Abstract
This document defines the TREE building block (BB) for tree
configuration within a reliable multicast transport (RMT) protocol.
As an RMT building block, the TREE BB is a coarse-grained modular
component that may be common to multiple RMT protocols. The TREE
BB provides automatic configuration of a logical ACK-tree for RMT
protocols in the tree-based ACK family. A companion document
[RFCxxxx] describes the TRACK (Tree-based ACK) building block,
which assumes that the logical ACK tree has been configured by the
TREE BB described here.
Chiu, Koh, Kadansky, Whetten, Taskale [Page 1]
RMT BB: Tree Auto-Configuration (TREE) December 2003
Table of Contents
1. Introduction..................................................3
2. Terminology...................................................4
3. TREE BB Rationale.............................................5
4. Functionality of TREE BB......................................5
4.1 Assumptions and Requirements..............................5
4.2 Functional Overview.......................................6
5. BB Details: Session Announcement..............................9
6. BB Details: Repair Head Discovery and Selection...............9
6.1 Repair Head Discovery Algorithms..........................9
6.2 Distance Measurement.....................................16
6.3 Repair Head Selection....................................18
7. BB Details: Tree Formation...................................19
8. BB Details: Tree Maintenance.................................21
8.1 Monitoring Parents and Children..........................21
8.2 Optimizing the Tree......................................21
9. External Interfaces..........................................22
9.1 Interfaces to this BB....................................22
9.2 Interfaces from this BB to the PI or a higher-level BB...23
10. Applicability Statement.....................................24
11. Messages for TREE BB........................................25
12. Requirements from other Building Blocks.....................26
13. Security Considerations.....................................26
14. References..................................................27
Acknowledgments.................................................27
Author's Addresses..............................................28
Chiu, Koh, Kadansky, Whetten, Taskale [Page 2]
RMT BB: Tree Auto-Configuration (TREE) December 2003
1. Introduction
The Reliable Multicast Transport (RMT) working group was chartered
to standardize IP multicast transport services [RFC2887]. Rather
than create a set of monolithic protocol specifications, the RMT WG
has chosen to break the reliable multicast protocols into Building
Blocks (BB) and Protocol Instantiations (PI). A Building Block is
a specification of the algorithms of a single component, with an
abstract interface to other BBs and PIs. A PI combines a set of
BBs, adds in the additional required functionality not specified in
any BB, and specifies the specific instantiation of the protocol.
This document describes the Tree auto-configuration building block,
or TREE BB. The function of the TREE BB is to construct a session
tree, comprised of a single sender, repair heads, and receivers,
for use by a tree-based ACK reliable delivery protocol, for example
the Tree-Based ACK (TRACK) BB [RFCxxxx]. The design goals of the
TREE BB are motivated by the TRACK BB, but trees it constructs
could be useful for other functions.
The process of session tree construction for IP multicast is
difficult. The best session trees match the underlying multicast
routing tree topology; however, the multicast service model does
not provide explicit support for discovering routing tree topology.
There are several viable solutions for constructing a tree,
depending on network conditions. The optimality of a tree may
depend on a variety of factors, such as the need for load balancing
and the need to minimize the tree depth used for collecting
feedback information.
The TREE building block specifies a distributed procedure for
automatically constructing a tree that is loop-free and as
efficient as possible, given the information available. It
provides a unified solution for tree construction in the presence
of different multicast service models and routing protocols. In
particular, it specifies a single procedure which may be used with
various techniques for repair head discovery and distance
measurements, several of which are specified within this document.
The difference in these techniques primarily affects the optimality
of the tree.
Chiu, Koh, Kadansky, Whetten, Taskale [Page 3]
RMT BB: Tree Auto-Configuration (TREE) December 2003
2. Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in
this document are to be interpreted as described in RFC 2119.
In addition, the following terms are applied in this document as
well as the TRACK BB document [RFCxxxx].
Session
A session is used to distribute data over a multicast address. A
Session Tree is used to provide reliability and feedback
services for a session.
Session Identifier
A fixed-size number, chosen either by the application that
creates the session or by the transport. Senders and receivers
use the session Identifier to distinguish sessions.
Repair Head (RH)
A node within the tree which receives and retransmits data, and
aggregates and forwards control information toward the sender.
The sender operates as the root repair head in any session tree.
An RH that has accepted children for a session is called a
parent.
Session Tree (ST)
The session tree is a tree spanning all receivers of a multicast
session. It is rooted at the sender, consisting of zero of more
repair heads as interior nodes, and zero or more receivers as
leaf nodes.
Parent
A parent is an RH or receiver's predecessor in the ST on the
path toward the sender. Every RH or receiver on the tree except
the sender itself has a parent. Each parent communicates with
its children using either an assigned multicast address or
through unicast.
Children
The set of receivers and RHs for which an RH or the sender is
providing repair and feedback services.
Chiu, Koh, Kadansky, Whetten, Taskale [Page 4]
RMT BB: Tree Auto-Configuration (TREE) December 2003
3. TREE BB Rationale
In order to accommodate various multicast deployments, this
document divides the tree building process into the following major
components:
A. Several techniques for discovering neighboring repair heads:
static, expanding ring search, and mesh. Discovering
neighboring repair heads is a necessary condition for getting
connected, so each node in the session must implement at least
one of the above techniques.
B. Several algorithms for selecting neighboring repair heads: the
measurement and use of neighbor distance and sender distance
are described. These repair head selection algorithms help
produce a good tree.
C. A single distributed procedure for construction and
maintenance of loop-free session Trees.
4. Functionality of TREE BB
4.1 Assumptions and Requirements
This TREE BB is designed based on the recommendations for tree
configuration (Section 4.5 of RFC 3048, as also shown below), along
with RFC 2357, RFC 2887 and RFC 3269.
It has been shown that the scalability of RM protocols can be
greatly enhanced by the insertion of some kind of retransmission
or feedback aggregation agents between the sender and receivers.
These agents are then used to form a tree with the sender at the
root, the receivers at the leaves of the tree, and the
aggregation/local repair nodes in the middle. The internal
nodes can either be dedicated software for this task, or they
may be receivers that are performing dual duty.
The effectiveness of these agents to assist in the delivery of
data is highly dependent upon how well the logical tree they use
to communicate matches the underlying routing topology. The
purpose of this building block would be to construct and manage
the logical tree connecting the agents. Ideally, this building
block would perform these functions in a manner that adapts to
changes in session membership, routing topology, and network
availability.
This TREE BB document describes how to build trees under the
following conditions:
Chiu, Koh, Kadansky, Whetten, Taskale [Page 5]
RMT BB: Tree Auto-Configuration (TREE) December 2003
a. The multicast group has only a single sender.
b. A single RH can serve multiple sessions.
c. Sessions can take advantage of a pre-deployed infrastructure
of RHs (ones that are not necessarily aware of a session
before the receivers), or recruit receivers to be RHs.
d. Generic Router Assist and Expanding Ring Search are not
required of the network infrastructure, but if available they
should be able to be utilized.
For tree configuration, the following are specifically required:
a. While the tree building process may take advantage of
information from the routing layer, the mechanisms described
are independent of the routing protocol(s) used by the
underlying multicast tree.
b. All trees constructed must be loop-free
c. These mechanisms must support late joiners and tree
optimization
4.2 Functional Overview
Session trees are spanning trees rooted at the sender. Session
trees can be used for forwarding control information (i.e. ACKs)
towards the root, or for forwarding data (i.e. repairs) towards the
leaf nodes.
Figure 1 illustrates overall procedures for tree configuration.
Chiu, Koh, Kadansky, Whetten, Taskale [Page 6]
RMT BB: Tree Auto-Configuration (TREE) December 2003
+--------------------+
| 1. Session |
| Advertisement |
+--------------------+
|
| Node receives tree-building parameters
V
+--------------------+
| 2. Measurements |<-------------------------|
| to the sender | |
| (optional) | |
+--------------------+ |
| |
| |
V |
+--------------------+ |
| 3. Neighbor | |
| Discovery | |
+--------------------+ |
| |
| |
V |
+--------------------+ |
| 4. Repair Head | |
| Selection | |
+--------------------+ |
| |
| Node picks best neighbor |
V |
+--------------------+ |
| 5. Binding to |---------------------------
| Repair Head | 6. Optimization (optional)
| | and fault recovery
+--------------------+
Figure 1. Generic procedures for tree configuration
Session trees are constructed per sender; each node wishing to join
the tree discovers its neighboring RHs and then selects its best
parent based on locally available information, such as the relative
sender distances and neighbor distances. This document specifies
several techniques for measuring distances.
It is important to note that RHs may be actual receivers (e.g.
receivers willing and able to also function as RHs) or pre-deployed
"specialized" servers that are signaled to join the tree by
Chiu, Koh, Kadansky, Whetten, Taskale [Page 7]
RMT BB: Tree Auto-Configuration (TREE) December 2003
receivers. We use the term repair head to refer to either a
receiver or server that is participating as part of the logical
tree formation.
Tree construction, regardless of RH discovery and selection
algorithm, proceeds generically as follows.
4.2.1 Session Announcement
Receivers of a session use standard out-of-band mechanisms for
discovery of a session's existence (e.g. session advertisement in
RFC 2974). In this way, a receiver discovers the multicast group
address, the sender's address, and other information necessary for
logical tree construction.
Sessions may be announced in two parts, the first part containing
generic information about the session, such as the multicast
address, and the second part, announced on the multicast address,
containing additional information.
4.2.2 Measurements to the sender (optional)
All RHs and receivers that know about the session optionally
determine their distance to the sender.
4.2.3 Neighbor Discovery
Meanwhile, each receiver discovers nearby RHs (candidate parents)
for the session using the neighbor discovery algorithm(s).
4.2.4 Repair Head Selection
Once a receiver (or RH needing to join the tree) discovers a nearby
RH, it obtains the RH's distance to the sender as well as the RH's
distance to the receiver and other suitability values, if available.
After discovery is complete, the best RH is selected.
4.2.5 Binding to Repair Head
The receiver or RH then binds to the chosen RH. If a bind is
unsuccessful, the receiver or RH retries with another nearby RH, or
starts the discovery process all over again. Once an RH receives
a bind from a child, that RH then joins the tree if it has not
already, discovering an RH of its own, possibly using a different
method than leaf receivers.
Chiu, Koh, Kadansky, Whetten, Taskale [Page 8]
RMT BB: Tree Auto-Configuration (TREE) December 2003
4.2.6 Optimization (optional) and Fault Recovery
During a session, a receiver or RH may change to a different RH for
a number of reasons described below, including fault tolerance.
The session Tree is maintained and optimized over time.
5. BB Details: Session Announcement
The first step in the tree-building process is for a node to be
informed of the session's existence. This can be done using some
out-of-band method, such as Session Advertisement [RFC2974], URL,
e-mail, etc.
RHs do not necessarily receive these advertisements. If an RH is
not a receiver, it obtains the advertisement information once it is
contacted by a receiver.
The advertisement includes the multicast address being used, the
sender's address, the session Identifier, any specific port numbers
to be used, and any global information useful for tree construction.
The advertisement may also contain information about one or more
repair head Discovery options (such as Static, ERS, and Mesh) that
can possibly be used by receivers in the session.
6. BB Details: Repair Head Discovery and Selection
Discovery is the process by which a node determines a suitable
repair head. During the discovery process, suitable neighbors are
found, sender distances are optionally exchanged, and the best RH
is selected.
6.1 Repair Head Discovery Algorithms
This draft describes three algorithms for discovering RHs: Static,
Expanding Ring Search (ERS), and Mesh.
Multiple algorithms may be used within a single session. For
example, RHs may use the Mesh algorithm, while the receivers use
static configuration to discover the RHs; alternatively, some
receivers may use static configuration while other receivers depend
on ERS (in an intranet where ERS is available). Each receiver may
pre-configure which algorithm to use before it starts.
The transport protocols request the following information from this
BB using the getRHs interface.
Chiu, Koh, Kadansky, Whetten, Taskale [Page 9]
RMT BB: Tree Auto-Configuration (TREE) December 2003
Repair Heads:
1. parentAddress: the address and port of the parent node to
which the node should connect
2. UDPListenPort: the UDP port number on which the node will
listen for its children's control messages
3. RepairAddr: the multicast address, UDP port, and TTL on
which this node sends control messages to its
children.
Receivers:
1. parentAddress.
Senders:
1. UDPListenPort
2. RepairAddr
After the above information is obtained from tree-configuration,
the transport protocol may perform the necessary Bind operations
for participating in the session tree.
Tree configuration may rely on a functional entity, named Tree
Configurator (TC), which is pre-configured by sender for tree
configuration. TC may be simply implemented by a program and thus
installed at sender or any other host. It is not necessarily
specialized infrastructure.
6.1.1 Static
In the static scheme, TC is used to govern the tree building based
on its own (session-specific) tree configuration and RH(s)
selection rules for the new joiners.
If a TC is used for tree building, its address and port are
included with the session advertisement. Receivers and RHs will
realize there is a TC for the session via session Announcement, and
they can contact with the TC to get a list of candidate RHs by
sending a unicast Query message.
In response to a Query message, TC replies with an advertisement
message that contains a list of candidate RHs available to the new
joiner. The rule of determining such candidate RHs may depend on
the pre-configured mechanism taken by TC. For example, TC may
determine the candidate RH list for a node among the possible RHs
(it contains at that time) by considering which RHs are in the same
Chiu, Koh, Kadansky, Whetten, Taskale [Page 10]
RMT BB: Tree Auto-Configuration (TREE) December 2003
network domain with the node (i.e., via comparing their network
prefix), or by considering the load balancing for the tree topology
it has configured until then. For this purpose, TC maintains a pool
of active RHs for the session. The list of candidate RHs carried by
Advertise message is ordered in decreasing levels of preference, in
which a lower number represents a higher preference.
When a receiver node receives the responding advertisement message
from TC, the node MAY proceed to try to bind to a candidate RH
following the given order by sending a BindRequestmessage, and then
waits for the responding message such as BindConfirm or BindReject
from TC. These binding steps will be done according to the TRACK
mechanism, as described in the TRACK BB.
In the static algorithm, RH discovery with a TC proceeds as
follows:
1. The node joins the multicast session and learns of the
location of TC (from the session announcement).
2. The node optionally discovers its distance from the sender.
Any metric described in Section 4.2 may be used.
3. The node sends a Query message to the TC for a parent. The
request includes (optionally) the node's distance to the
sender and whether the node functions as an RH or not.
4. The TC chooses one or more candidate parents (RHs) for the
node from the active RHs by its own tree configuration rule.
The selection of candidate parents may be done by comparing
the network prefix or by referring to any other information
such as the number of currently attached children, etc.
5. The TC responds to the Query message with an Advertise
message, which include the candidate parent list. In the
list, each entry contains the corresponding IP address and
port of an RH.
All the entries in the list need to be arranged in the
decreasing order of preference levels.
In the rejection case, the Advertise message does not include
any candidate parent. In this case, the node may resort to
the other mechanism such as ERS and Mesh.
In the success case, the node will be enrolled as an active
RH by the TC, if it functions as an RH in the session. Each
active RH (functioning as a parent for the session Tree)
sends the TC the periodic Query messages with a flag
Chiu, Koh, Kadansky, Whetten, Taskale [Page 11]
RMT BB: Tree Auto-Configuration (TREE) December 2003
indicating that it is active over a specific time interval.
Based on the Query messages, the TC updates a pool of active
RHs in the session. In response to the Query message, the TC
sends an Advertise with a flag simply indicating that the
Query message is received.
6. After receiving a successful Advertise message from the TC,
the node will try to connect to its parent by sending
BindRequest messages based on the candidate parent list.
Depending on implementation, one or more TCs may be locally
deployed, in which each receiver can locally access some configured
list of the prospective RHs.
6.1.2 ERS
ERS is used only in the network environments where IP multicast
transmissions are allowed by receivers and RHs.
In ERS algorithm, the RH discovery with a TC proceeds as follows:
1. The nodes first look for Neighbors using a multicast Query
message. The initial TTL value in the Query message,
TTLNeighborInit, is specified in the session announcement
and may be as large as the session TTL (TTLSession).
An RH that is able to handle additional children responds to
a Query message by multicasting an Advertise message.
If the RH is not yet a parent, the TTL used in this response
is the same TTL used in the Query message. If the RH is a
parent, the TTL used is the greater of the Query TTL and the
parent's current Advertise TTL.
2. The node listens for Advertise messages after sending the
Query message. If one or more Advertise messages are
received during a SolicitPeriod, the best RH among them is
selected.
3. If no Advertise messages are received, the node sends
another multicast Query message with a TTL that is
incremented by TTLIncrement. The process of sending the
multicast Query message with an increasing TTL value
continues until a response is received.
The TTL value is limited by a value, TTLMax, also specified
in the session announcement. TTLMax defaults to the value
of TTLSession.
Chiu, Koh, Kadansky, Whetten, Taskale [Page 12]
RMT BB: Tree Auto-Configuration (TREE) December 2003
If the TTL value required to reach the soliciting node is
greater than the TTL used to reach the RH, an Advertise
message may not reach the node. However, if future Query
messages have increased TTL values, the TTL may eventually
be large enough for the Advertise message to reach the node.
However, it is possible that the node will not locate any
RHs using Expanding Ring Search. It is advisable that a
backup method, such as static, be available.
4. RHs need to suppress sending Advertise messages in response
to Query messages if one was sent with at least the Query's
TTL within the last SolicitPeriod.
After multicasting a Query message, a node waits for an
interval, BetweenQuery, before sending another Query message.
Nodes will suppress sending Query messages for BetweenQuery
seconds when they first start in order to collect
information from Advertise messages already solicited by
other nodes.
5. Getting a successful Advertise message via ERS, the node
will try to connect to the parent by sending BindRequest
messages.
The following variables are used in the Expanding Ring Search
algorithm.
- TTLNeighborInit: This is the initial TTL value to be used by the
ERS if no other TTL value is specified by the algorithm.
- TTLIncrement: This is the periodic increment for the TTL used in
ERS.
- TTLMax: This is a configured maximum TTL value to be used by
either Query or Advertise messages.
- TTLSession: This is the session TTL value for the multicast
session.
- SolicitPeriod: Each receiver MUST not send more than one QUERY
message per SolicitPeriod. When a RH responds to QUERY messages,
it also suppresses its ADVERTISE message if one has been sent
less than SolicitPeriod ago. This parameter is used to control
the amount of control traffic during tree construction if ERS is
used.
- BetweenQuery: This is the time interval a node waits before
sending successive Query messages.
Chiu, Koh, Kadansky, Whetten, Taskale [Page 13]
RMT BB: Tree Auto-Configuration (TREE) December 2003
- MaxBindAttempts: This variable is an integer used to control how
many times a receiver tries to bind to a RH before giving up and
try another one.
- BindPeriod: In order to prevent loops, sometimes a RH may reject
a BindRequest (for example, when the RH is not on the tree yet
and has a BindRequest outstanding itself) from a receiver. In
this case, if the receiver needs to retry binding to the same RH
again (perhaps because the receiver does not discover any
alternative RHs), then it waits for BindPeriod seconds.
6.1.3 Mesh
The mesh approach relies on a set of RHs already deployed as
infrastructure servers. These RHs are on-line, but are not
necessarily aware of any particular session unless informed by the
following mechanisms.
RHs in the mesh are configured to know who their neighbor RHs are,
and exchange reachability information with their neighbors in a way
analogous to routers in a network. The actually protocol used by
RHs to exchange such reachability information is outside the scope
of this BB.
Instead, this BB specifies the following properties that the mesh
of RHs MUST satisfy:
a) Each RH knows a subset of RHs in the mesh as its immediate
neighbors.
b) Each RH has a "forwarding table", such that given any other
(destination) RH in the mesh, the forwarding table gives a
"next-hop" RH that can be used to reach the destination RH,
plus the distance to the destination RH from the local RH.
c) A given RH in the mesh can "broadcast" information to all
other RHs in the mesh (in the sense of having a means of
sending the same information to all other RHs, but not
necessarily simultaneously).
d) All potential sender and receivers of a multicast session
can discover a "neighboring" RH in the mesh, using the
neighbor discovery mechanisms described in Section 4.1.
The reason for running a routing-like algorithm to maintain the
forwarding tables in each RH is to provide fault tolerance when
some RHs in the mesh fail. When that happens, the remaining RHs
exchange information with each other to update the forwarding
Chiu, Koh, Kadansky, Whetten, Taskale [Page 14]
RMT BB: Tree Auto-Configuration (TREE) December 2003
tables. In the steady state, the mesh of RHs still needs to satisfy
the above four properties.
In the simplest form, each RH in the mesh has a forwarding table
that contains all other RHs in the mesh. This is called a fully-
connected mesh.
As mentioned earlier, the mesh scheme assumes that there is a pre-
deployed infrastructure of RH servers. That is, the RH-RH bindings
and sender-RH (Sender's RHs) will be performed internally by the
provider's own policy. The only thing done by sender is to inform
its RH's that a session starts. The relationships between sender
and its RH's (i.e., sender-RH bindings) MUST also be pre-configured.
For example, the internal bindings between sender and RHs MAY be
done as follows:
1. The sender locates a neighbor RH in the mesh by a pre-
configured mechanism. This RH is referred to as the sender's
RH.
2. The sender sends the multicast session id, address and port
(all these can be set as a abbreviated session announcement
message) to the sender's RH.
3. The sender's RH in turn "broadcasts" the session information
to all RHs in the mesh; since RHs can support multiple
sessions simultaneously, they keep the information about each
session in an entry in a session table.
4. After the sender-RH bindings, sender multicasts its session
announcement to the multicast receivers. Then, the sender's RH
binds to the sender by sending a BindRequest message.
During the internal processing of sender-RH binding described up to
now, no Query and Advertise messages are used.
When a session starts, the bindings between RH and receivers will
be done. The tree binding of RH-receiver is done with the Tree
Configurator (TC), which was used in the Static algorithm. The main
difference between Static algorithm with TC and Mesh algorithm with
TC is that the active RHs are considered as candidate RHs in the
Static scheme, while the pre-deployed RHs are considered as
candidates in the Mesh scheme. That is, in the Mesh scheme, the TC
is required to already get the information on the locations of the
pre-deployed RHs.
Then the tree buildings is done as follows:
Chiu, Koh, Kadansky, Whetten, Taskale [Page 15]
RMT BB: Tree Auto-Configuration (TREE) December 2003
1. When a session starts, a receiver sends a Query message to
the TC. The receiver may optionally discover its distance from
the sender. Any metric described in Section 4.2 may be used.
2. The TC chooses one or more candidate parents RHs for the
receiver from the pre-deployed RHs by its own tree
configuration rule, as described in Section 4.1.1.
3. TC responds to the Query message with an Advertise message,
which includes the candidate parent list. In the list, each
entry contains the corresponding IP address and port of the
parent.
4. Receiving a successful Advertise message from the TC, the
node will try to connect to its parent by sending BindRequest
messages based on the candidate parent list.
Once a receiver is connected to a parent RH, the RH-RH bindings
will also be done internally by the pre-configured provider's
policy. For example, each RH in the mesh tries to bind to its
"next-hop" RH. If the "next-hop" RH is not reachable for some
reason, an RH may also try to bind to any neighbor RH as a back-up
alternative. These procedures will be devised to ensure that the
loop freedom is guaranteed.
6.2 Distance Measurement
Different techniques can be used to determine distances between
nodes in a session. The distances of interest are:
Sender distance - distance from a RH to sender
Neighbor distance - distance from a receiver to a neighboring RH
These distances can be used in selecting a repair head if several
are discovered.
These techniques quantify distance differently. Each specific way
of quantifying distance is called a metric. Different Metrics are
not necessarily comparable. For example, if distance between A and
B is X using metric m1, and distance between A and C is Y using
metric m2, then X > Y does not necessarily imply B is farther from
A than C.
Only distances of the same metric will be compared and ranked.
Therefore, a receiver will only rank two RHs based on their
respective sender distance if those distances are based on the same
metric.
Chiu, Koh, Kadansky, Whetten, Taskale [Page 16]
RMT BB: Tree Auto-Configuration (TREE) December 2003
On the other hand, it is not necessary for all receivers to use the
same metric to select their neighboring RH to connect to. Suppose
the receivers use neighbor distance as a selection criterion. One
receiver may determine neighbor distances to RHs based on hop count,
whereas another receiver may determine neighbor distances to its
neighboring RHs based on delay.
6.2.1 TTL Hop-Count
If this metric is used, a node periodically sends a Beacon message
on the session's multicast address. The Beacon message includes
the original time-to-live value set by the node. The distance to
the node is then calculated as
Beacon's original TTL - Beacon's current TTL
One node is closer than another if its distance is a lower number.
Note that the TTL value may not be available in some implementation
environments.
6.2.2 Number of Levels
One metric a receiver can use for RH selection is the number of
levels the RH is from the sender. For example, given two RHs in
close proximity to a receiver, if one RH is n levels from the
sender and the other is m levels, where m<n, the receiver selects
the RH with m levels. This is because a shallower tree allows
faster propagation of feedback information to the sender. (Note,
we assumed the choice is between two RHs equally close to the
receiver. The receiver to RH distance is another consideration).
The number of levels metric is not generally available, as the tree
may be constructed bottom up. If the mesh approach is used,
however, the distance in the RH's forwarding tables can be
implemented as an estimate of the number of levels from the sender.
6.2.3 Delay
Another metric is the delay from an RH to the sender. If the RH is
directly connected to the sender, then the delay would simply be
the time to send feedback from the RH to the sender. If the RH is
several levels down from the sender, then the delay would be the
sum of the delays for each level (with some jitter time added in
each level). For example, given two RHs in close proximity to a
receiver, the receiver selects the RH with a smaller delay to the
sender. This is again for the purpose of minimizing the feedback
time.
Chiu, Koh, Kadansky, Whetten, Taskale [Page 17]
RMT BB: Tree Auto-Configuration (TREE) December 2003
If the tree is built bottom up, this metric cannot be used. If the
mesh approach is used, this metric can be implemented, although it
requires the RHs in the mesh to exchange distance information based
on the delay metric.
6.2.4 Address
For IPv6 addresses, the distance can be approximately determined by
the number of aggregation levels one address has in common with
another. For this metric, one node is closer than another if its
address has more aggregation levels in common with the querying
node's address.
6.2.5 Static
The node's distance to other nodes may be made available in some
well-known location. One node's is closer than another if its
distance is a lower number.
6.2.6 GRA
The node's distance to the sender may be determined with help of a
Generic Router Assist (GRA) message that lists the set of GRA
routers on the path from the source.
6.3 Repair Head Selection
Once Neighbors have been discovered, a node selects the best one
using whatever distance information is available.
If there is no sender distance information to compare, the best RH
is simply the one that is closest to the node, without a loop being
formed by binding the node to the RH.
If sender distances are available, there are two cases:
Leaf nodes: For leaf nodes, the goal is to use the closest RH
possible.
RHs: For RHs joining the tree, it is important to pick an RH
that is closer to the sender; neighbor distance is a
secondary factor.
Once an RH has been selected, the node tries to bind to it. Loop
prevention is done during the bind process using only Tree Level
information.
Chiu, Koh, Kadansky, Whetten, Taskale [Page 18]
RMT BB: Tree Auto-Configuration (TREE) December 2003
This algorithm is recommended because it assigns each node the
closest RH, and does not require all nodes to measure their sender
distance at the start of the session. Depending on the selected
metric, multiple nodes measuring sender distance could cause
message implosion, and delay tree construction. On the other hand,
the RHs selected may actually be further from the sender than their
children are. However, it may be necessary to assign nodes to non-
optimal RHs in order to get them on the tree, since it is possible
that no RH closer to the sender can accept any more children.
Alternatively, nodes may be required to measure sender distance
before selecting an RH in order to ensure that each parent is
closer to the sender than its children. Presumably, this results
in a tree in which parents detect message loss before their
children, minimizing repair requests.
7. BB Details: Tree Formation
The following is a detailed description of the tree formation
process. All tree construction follows this pattern. The ONLY
differences between instantiations of this building block lie in
how nodes discover and select neighbors.
An architectural constant, tree level, is used to represent the
tree height and its value is limited to 127. That is, a node with
tree level of 128 means that it has not been connected to the tree
yet.
Once an RH has been selected, the node sends a BindRequest message
to the RH. If the RH has an outstanding request to bind to another
RH, it must refuse the incoming bind request in case it would form
a loop.
Otherwise, it MAY accept the node as a child as long as selecting
it would not cause a loop in the tree. Loop freedom is guaranteed
by these two rules:
1. If the requesting node does not have children, the RH can
accept it as a child as long as the RH has no outstanding
bind requests. If it does have an outstanding bind request,
the RH can accept the node as a child if its address is less
than the child's address.
2. If the requesting node has children, the RH can accept it as
a child if
a. RH's level is 128, i.e., it is the top of a sub-tree not
yet connected to the sender, or
Chiu, Koh, Kadansky, Whetten, Taskale [Page 19]
RMT BB: Tree Auto-Configuration (TREE) December 2003
b. RH's level is less than 128, i.e., it is connected to the
sender.
The second rule prevents a node from selecting one of its own
children as its parent. Two nodes at level 128 are prevented from
selecting each other using tie-breaking criteria described step 1
above.
If the RH accepts the node as a child, it returns a BindConfirm
message. If it does not accept the node, it sends a BindReject
message. If the node does not receive a response after
MaxBindAttempts tries every BindPeriod seconds, it MAY select the
next best Neighbor from its cached list, or else run the repair
head Discovery process again to determine an alternate RH to try.
The BindReject message contains a reason code. If the code
indicates that the node was rejected because the RH was not yet on
the tree, the node MAY choose to retry that RH after BindPeriod
seconds, or select a different available RH.
The BindReject message may also include a list of alternative RHs
for the node to try.
The BindConfirm message includes the parent's current Tree Level.
The node sets its Tree Level to one more than the parent's level.
The BindConfirm message also indicates the starting sequence number
of the message from which data reliability is assured. This
information is included in the BindConfirm message to enable
receivers to meet the PI's late join requirements. If nodes join
the tree after the sender has started to send data, it is possible
that some of the data is no longer available within the tree. nodes
may need to have specific information about repair availability
before selecting a parent.
Repair Heads MAY limit the number of children they support
depending on their capacity. Once an RH has accepted its maximum
number of children, it stops accepting new children until a change
in membership causes its count of children to go below this limit.
If an RH limits the number of children it supports, it reserves at
least one child slot for other RHs. This guarantees the growth of
the repair tree.
Chiu, Koh, Kadansky, Whetten, Taskale [Page 20]
RMT BB: Tree Auto-Configuration (TREE) December 2003
8. BB Details: Tree Maintenance
Tree maintenance is an ongoing process active in every node.
Because the tree is based on the operation of RHs, as well as the
various underlying metrics that may change over time, it is
important that these dependencies be monitored for changes. Nodes
monitor the parents for liveness and changes in tree level, and
continue to run the Neighbor Discovery and Selection process in
order to optimize their choice of RH. Parents also monitor children
for liveness.
8.1 Monitoring Parents and Children
The upper Building Block or Protocol Instantiation is responsible
for monitoring parents and children. Monitoring messages from
parents to children MUST contain the parent's current Tree Level.
Children MUST set their Tree Level to one more than their parent's
level.
If a child loses contact with its parent for a period of time, it
reports it using the lostRH interface, and attempt to bind with an
alternate RH. A child that is leaving a session sends a unicast
UnbindRequest message to its parent. The parent responds with an
UnbindConfirm message.
A parent that is leaving the session sends an EjectRequest message
to its children indicating that they need to bind with an alternate
RH. If possible the EjectRequest message is multicasted, but the
EjectRequest message can also be sent via unicast to each child
individually. Upon receiving an EjectRequest message from its
parent, a receiver sets its Tree Level to 128 again. Using the
heartbeat mechanism, the Tree Level for all receivers in the
affected subtree will be updated (to a value higher than 128).
If a parent does not hear from a child for a period of time, or it
receives a UnbindRequest message from a child, it removes that
child from its list of children, and reports the loss using the
removeChild interface.
8.2 Optimizing the Tree
Implementations of this building block should continue to run the
Neighbor Discovery and Selection process in order to optimize the
choice of RH. This continuous process also keeps the distance
information for the current parent up-to-date. Whenever the
process returns a better RH than the current one, the node MAY bind
to the new RH. Once the new RH is bound to, the node sends a
UnbindRequest message to the original parent. A parent with no
children MAY leave the session.
Chiu, Koh, Kadansky, Whetten, Taskale [Page 21]
RMT BB: Tree Auto-Configuration (TREE) December 2003
9. External Interfaces
This section describes external interfaces for the building block.
9.1 Interfaces to this BB.
These may be used by a PI or by a higher-level BB.
(1) start(boolean RH, advertisement)
start notifies the BB to begin operation. If the RH parameter is
set to TRUE, the BB also starts RH operation.
(2) end()
end notifies the BB to end operation.
(3) incomingMessage(message)
This interface is used to pass an incoming message down from the PI.
(4) getStatistics
getStatistics returns current BB statistics to the upper BB or PI.
(5) getRHs
getRHs instructs the BB to start the process of finding RH
candidates for this node. getRHs may return immediately with a
list of candidate RH, or may use the RHlist interface to return the
list at a later time.
(6) setRH(RH)
setRH informs the BB that the PI or upper BB has successfully bound
to an RH.
(7) acceptChild(node)
acceptChild asks the BB to accept or reject the node as a member.
The BB returns a boolean value in response.
(8) removeChild(node)
removeChild is called to inform the BB that the child is no longer
a member.
Chiu, Koh, Kadansky, Whetten, Taskale [Page 22]
RMT BB: Tree Auto-Configuration (TREE) December 2003
(9) treeLevelUpdate(newLevel)
This interface is used to pass down any update to the node's tree
level that the upper BB or the PI has learned. newLevel replaces
the BB's current tree level.
(10) lostRH
lostRH notifies the BB that the connection to the RH was lost.
(11) setOptimization(boolean)
setOptimization tells the BB to start or stop the tree optimization
process. The upper BB or PI may want to control when tree
optimization takes place.
(12) recordRHs(destination)
recordRHs tells the BB to record the current RH, plus the list of
alternates, to the indicated destination.
9.2 Interfaces from this BB to the PI or a higher-level BB.
(1) outgoingMessage(message)
outgoingMessage instructs the PI to send the message.
(2) RHlist(list)
RHlist returns to the upper PI or BB a list of RH candidates. The
list may contain additional information for each candidate, such as
details about the data packets that it has available for repair.
Chiu, Koh, Kadansky, Whetten, Taskale [Page 23]
RMT BB: Tree Auto-Configuration (TREE) December 2003
10. Applicability Statement
The authors recognize that automatic tree construction is a very
difficult problem. Nonetheless, complete reliance on manual
configuration is very user-unfriendly and error prone as well.
This building block describes a procedure for constructing loop-
free trees when there is minimal manual configured information
available.
This is analogous to providing a system with default configurations
that allow the system to work correctly, but not necessarily
optimally.
There are many possible criteria for tree optimality. This BB does
not attempt to define a single optimality criterion, nor does it
try to produce an optimal tree. It is, however, the goal of the BB
to construct better trees as more configuration and measurement
data are introduced to the procedure.
This BB describes only a subset of the possible parameters for
constructing optimal trees, in particular sender distance and
neighbor distance. There are many techniques for measuring these
distances. Some of the techniques may not be applicable globally.
Expanding ring search (ERS) is an effective technique in a local
subnet or intranet (especially when the IP multicast routing
protocol is dense-mode based). On the other hand, it is not
practical in a multi-domain network; it is not effective when the
routing protocol is sparse-mode based; and it can add significant
control traffic overhead.
Generic Router Assist (GRA) can provide measurement hooks to
determine RHs that are located along the path for multicast data
distribution. However, such facilities may not be available in all
networks.
The tree construction procedure does allow manual configuration and
various distance measurement techniques to be selectively and
independently applied for different subgroups of receivers and RHs,
to achieve incremental improvement to the quality of the tree.
There are many other criteria for tree-building than what is
described in this document, for instance, methods based on load
balancing and minimizing feedback latency.
Chiu, Koh, Kadansky, Whetten, Taskale [Page 24]
RMT BB: Tree Auto-Configuration (TREE) December 2003
11. Messages for TREE BB
These messages are required for implementations of this BB. The
list below indicates which message contents are required by
implementations. Implementations may also include other protocol-
specific information in these messages. Note that these messages
are parts of packets specified in PI's that use this BB.
Table 1. Messages for Tree Configuration
+----------------+-----------------------------+---------------------+
| Message Name | Description | |
| m- or u-cast | | Contents |
+----------------+-----------------------------+---------------------+
| Query | A message used to discover | sender distance |
| both | neighbors. The unicast | (optional), |
| | is sent to TC; the multicast| TTL (m-cast only) |
| | is used in ERS. | |
+----------------+-----------------------------+---------------------+
| Advertise | A message used to advertise | IP address, port, |
| both | an RH. The unicast is sent | sender distance |
| | from the TC; the multicast | (optional) |
| | is sent by RHs themselves | |
+----------------+-----------------------------+---------------------+
| BindRequest | Request to RH to join tree | Current Tree Level |
| unicast | | |
+----------------+-----------------------------+---------------------+
| BindConfirm | RH accepts BindRequest | Current Tree Level |
| unicast | | |
+----------------+-----------------------------+---------------------+
| BindReject | RH rejects BindRequest | Reject reason, |
| unicast | | alternate RH list |
+----------------+-----------------------------+---------------------+
| UnbindRequest | child leaving parent(u-cast)| Unbind reason |
| unicast | parent leaving tree (m-cast)| |
+----------------+-----------------------------+---------------------+
| UnbindConfirm | Acknowledgement of | |
| unicast | UnbindRequest message | |
+----------------+-----------------------------+---------------------+
| EjectRequest | parent refusing or leaving | Eject reason |
| both | service to children | |
+----------------+-----------------------------+---------------------+
| EjectConfirm | Acknowledgement of | |
| unicast | EjectRequest message | |
+----------------+-----------------------------+---------------------+
Chiu, Koh, Kadansky, Whetten, Taskale [Page 25]
RMT BB: Tree Auto-Configuration (TREE) December 2003
12. Requirements from other Building Blocks
This TREE BB can be interfaced to any other BB or PI wishing to use
a tree structure. To actually use this BB's features, the PI needs
to include the messages described in this BB in its packets. An
example of how to use this Tree Configuration BB can be found in
the TRACK BB [RFCxxxx].
13. Security Considerations
Basically, this document is for informational and security issues
are not applied. The following considerations are given just for
information:
a. The primary security requirement for a TRACK protocol is
protection of the transport infrastructure. This is
accomplished through the use of lightweight group
authentication of the control and, optionally, the data
messages sent to the group. These algorithms use IPsec and
shared symmetric keys.
b. For TRACK, it is recommended that there be one shared key for
the Data session and one for each Local Control Channel.
These keys are distributed through a separate key manager
component, which may be either centralized or distributed.
Each member of the group is responsible for contacting the
key manager, establishing a pair-wise security association
with the key manager, and obtaining the appropriate keys.
c. The exact algorithms for this BB are presently the subject of
research within the IRTF Secure Multicast Group (SMuG) and/or
standardization within the IETF Multicast Security (MSEC)
working group.
Chiu, Koh, Kadansky, Whetten, Taskale [Page 26]
RMT BB: Tree Auto-Configuration (TREE) December 2003
14. References
Normative:
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels," BCP 14, RFC 2119, March 1997
[RFC3048] Whetten, B., Vicisano, L., Kermode, R., Handley, M., Floyd,
S. and M. Luby, "Reliable Multicast Transport Building
Blocks for One-to-Many Bulk-Data Transfer," RFC 3048,
January 2001.
[RFCxxxx] Whetten, B., Chiu, D., Kadansky, M., Koh, S. and G. Taskale,
"Tree-based ACK (TRACK) Building Block for Reliable
Multicast Transport", RFC xxxx, 2004.
Informative:
[RFC3269] Kermode, R., Vicisano, L., "Author Guidelines for Reliable
Multicast Transport (RMT) Building Blocks and Protocol
Instantiation documents," RFC 3269, April 2002.
[RFC2887] Handley, M., Whetten, B., Kermode, R., Floyd, S., Vicisano,
L., and Luby, M., "The Reliable Multicast Design Space for
Bulk Data Transfer," RFC 2887, August 2000.
[RFC2357] Mankin, A., Romanow, A., Bradner, S. and V. Paxson, "IETF
Criteria for Evaluating Reliable Multicast Transport and
Application Protocols," RFC 2357, June 1998.
[RFC2974] Handley, M., Perkins, C., Whelan, E., "Session Announcement
Protocol," RFC 2974, October 2000.
Acknowledgments
The authors would like to give special thanks to Brian Neil Levine,
Dave Tahler, Joe Wesley and Juyoung Park for their valuable comments.
Chiu, Koh, Kadansky, Whetten, Taskale [Page 27]
RMT BB: Tree Auto-Configuration (TREE) December 2003
Author's Addresses
Dah Ming Chiu
dmchiu@ie.cuhk.edu.hk
Information Engineering Department,
The Chinese University of Hong Kong Shatin, N.T. Hong Kong
Seok Joo Koh
sjkoh@etri.re.kr
Protocol Engineering Center,
ETRI, 161 Kajung-Dong Yusong-Gu, TAEJON, 305-350, KOREA
Miriam Kadansky
miriam.kadansky@sun.com
Sun Microsystems Laboratories
1 Network Drive Burlington, MA 01803
Brian Whetten
brian@whetten.net
2430 20th Street #D, Santa Monica, CA 90405
Gursel Taskale
gursel@tibco.com
TIBCO
3303 Hillview Ave. Palo Alto, CA. 94304-1213
hiu, Koh, Kadansky, Whetten, Taskale [Page 28]
RMT BB: Tree Auto-Configuration (TREE) December 2003
Full Copyright Statement
Copyright (C) The Internet Society (2003). All Rights Reserved.
This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it
or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph are
included on all such copies and derivative works. However, this
document itself may not be modified in any way, such as by removing
the copyright notice or references to the Internet Society or other
Internet organizations, except as needed for the purpose of
developing Internet standards in which case the procedures for
copyrights defined in the Internet languages other than English.
The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assigns.
This document and the information contained herein is provided on an
"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE."
hiu, Koh, Kadansky, Whetten, Taskale [Page 29]