IETF Draft Vishal Sharma
Multi-Protocol Label Switching Ben-Mack Crane
Expires: May 2001 Srinivas Makam
Ken Owens
Tellabs Operations, Inc.
Changcheng Huang
Carleton University
Fiffi Hellstrand
Jon Weil
Loa Andersson
Bilel Jamoussi
Nortel Networks
Brad Cain
Mirror Image Internet
Seyhan Civanlar
Coreon Networks
Angela Chiu
AT&T Labs
November 2000
Framework for MPLS-based Recovery
<draft-ietf-mpls-recovery-frmwrk-01.txt>
Status of this memo
This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that other
groups may also distribute working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
Abstract
Multi-protocol label switching (MPLS) [1] integrates the label
swapping forwarding paradigm with network layer routing. To deliver
reliable service, MPLS requires a set of procedures to provide
protection of the traffic carried on different paths. This requires
Makam, et al. Expires May 2001 [Page 1]
Internet Draft draft-ietf-mpls-recovery-frmwrk-01.txt November 2000
that the label switched routers (LSRs) support fault detection, fault
notification, and fault recovery mechanisms, and that MPLS signaling
[2] [3] [4] [5] [6] support the configuration of recovery. With these
objectives in mind, this document specifies a framework for MPLS
based recovery.
Table of Contents Page
1.0 Introduction 3
1.1 Background 3
1.2 Motivations for MPLS-Based Recovery 3
1.3 Objectives 4
2.0 Overview 5
2.1 Recovery Models 6
2.2 Recovery Cycles 7
2.2.1 MPLS Recovery Cycle Model 7
2.2.2 MPLS Reversion Cycle Model 9
2.2.3 Dynamic Reroute Cycle Model 10
2.3 Definitions and Terminology 11
2.4 Abbreviations 15
3.0 MPLS Recovery Principles 15
3.1 Configuration of Recovery 15
3.2 Initiation of Path Setup 15
3.3 Initiation of Resource Allocation 16
3.4 Scope of Recovery 17
3.4.1 Topology 17
3.4.1.1 Local Repair 17
3.4.1.2 Global Repair 17
3.4.1.3 Alternate Egress Repair 18
3.4.1.4 Multi-Layer Repair 18
3.4.1.5 Concatenated Protection Domains 18
3.4.2 Path Mapping 18
3.4.3 Bypass Tunnels 19
3.4.4 Recovery Granularity 20
3.4.4.1 Selective Traffic Recovery 20
3.4.4.2 Bundling 20
3.4.5 Recovery Path Resource Use 20
3.5 Fault Detection 21
3.6 Fault Notification 21
3.7 Switch Over Operation 22
3.7.1 Recovery Trigger 22
3.7.2 Recovery Action 22
3.8 Switch Back Operation 23
3.8.1 Fixed Protection Counterparts 23
3.8.2 Dynamic Protection Counterparts 24
3.8.3 Restoration and Notification 25
3.8.4 Reverting to Preferred Path 25
3.9 Performance 26
4.0 Recovery Requirements 26
Makam, et al. Expires May 2001 [Page 2]
Internet Draft draft-ietf-mpls-recovery-frmwrk-01.txt November 2000
5.0 MPLS Recovery Options 27
6.0 Comparison Criteria 27
7.0 Security Considerations 29
8.0 Intellectual Property Considerations 29
9.0 Acknowledgements 29
10.0 Author's Addresses 30
11.0 References 30
1.0 Introduction
This memo describes a framework for MPLS-based recovery. We provide a
detailed taxonomy of recovery terminology, and discuss the motivation
for, the objectives of, and the requirements for MPLS-based recovery.
We outline principles for MPLS-based recovery, and also provide
comparison criteria that may serve as a basis for comparing and
evaluating different recovery schemes.
1.1 Background
Network routing deployed today is focussed primarily on connectivity
and typically supports only one class of service, the best effort
class. Multi-protocol label switching, on the other hand, by
integrating forwarding based on label-swapping of a link local label
with network layer routing allows flexibility in the delivery of new
routing services. MPLS allows for using such media specific
forwarding mechanisms as label swapping. This enables more
sophisticated features such as quality-of-service (QoS) and traffic
engineering [7] to be implemented more effectively. An important
component of providing QoS, however, is the ability to transport data
reliably and efficiently. Although the current routing algorithms are
very robust and survivable, the amount of time they take to recover
from a fault can be significant, on the order of several seconds or
minutes, causing serious disruption of service for some applications
in the interim. This is unacceptable to many organizations that aim
to provide a highly reliable service, and thus require recovery times
on the order of tens of milliseconds, as specified, for example, in
the GR253 specification for SONET.
MPLS recovery may be motivated by the notion that there are inherent
limitations to improving the recovery times of current routing
algorithms. Additional improvement not obtainable by other means can
be obtained by augmenting these algorithms with MPLS recovery
mechanisms. Since MPLS is likely to be the technology of choice in
the future IP-based transport network, it is useful that MPLS be able
to provide protection and restoration of traffic. MPLS may
facilitate the convergence of network functionality on a common
control and management plane. Further, a protection priority could be
used as a differentiating mechanism for premium services that require
high reliability. The remainder of this document provides a framework
for MPLS based recovery. It is focused at a conceptual level and is
meant to address motivation, objectives and requirements. Issues of
Makam, et al. Expires May 2001 [Page 3]
Internet Draft draft-ietf-mpls-recovery-frmwrk-01.txt November 2000
mechanism, policy, routing plans and characteristics of traffic
carried by recovery paths are beyond the scope of this document.
1.2 Motivation for MPLS-Based Recovery
MPLS based protection of traffic (called MPLS-based Recovery) is
useful for a number of reasons. The most important is its ability to
increase network reliability by enabling a faster response to faults
than is possible with traditional Layer 3 (or IP layer) approaches
alone while still providing the visibility of the network afforded by
Layer 3. Furthermore, a protection mechanism using MPLS could enable
IP traffic to be put directly over WDM optical channels, without an
intervening SONET layer. This would facilitate the construction of
IP-over-WDM networks.
The need for MPLS-based recovery arises because of the following:
I. Layer 3 or IP rerouting may be too slow for a core MPLS network
that needs to support high reliability/availability.
II. Layer 0 (for example, optical layer) or Layer 1 (for example,
SONET) mechanisms may not be deployed in topologies that meet
carriers' protection goals.
III. The granularity at which the lower layers may be able to protect
traffic may be too coarse for traffic that is switched using MPLS-
based mechanisms.
IV. Layer 0 or Layer 1 mechanisms may have no visibility into higher
layer operations. Thus, while they may provide, for example, link
protection, they cannot easily provide node protection or protection
of traffic transported at layer 3.
V. MPLS has desirable attributes when applied to the purpose of
recovery for connectionless networks. Specifically that an LSP is
source routed and a forwarding path for recovery can be "pinned" and
is not affected by transient instability in SPF routing brought on by
failure scenarios.
Furthermore there is a need for open standards.
VI. Establishing interoperability of protection mechanisms between
routers/LSRs from different vendors in IP or MPLS networks is
urgently required to enable the adoption of MPLS as a viable core
transport and traffic engineering technology.
1.3 Objectives/Goals
We lay down the following objectives for MPLS-based recovery.
I. MPLS-based recovery mechanisms should facilitate fast (10's of ms)
recovery times.
Makam, et al. Expires May 2001 [Page 4]
Internet Draft draft-ietf-mpls-recovery-frmwrk-01.txt November 2000
II. MPLS-based recovery should maximize network reliability and
availability. MPLS-based recovery of traffic should minimize the
number of single points of failure in the MPLS protected domain.
III. MPLS-based recovery should enhance the reliability of the
protected traffic while minimally or predictably degrading the
traffic carried by the diverted resources.
IV. MPLS-based recovery techniques should be applicable for
protection of traffic at various granularities. For example, it
should be possible to specify MPLS-based recovery for a portion of
the traffic on an individual path, for all traffic on an individual
path, or for all traffic on a group of paths. Note that a path is
used as a general term and includes the notion of a link, IP route or
LSP.
V. MPLS-based recovery techniques may be applicable for an entire
end-to-end path or for segments of an end-to-end path.
VI. MPLS-based recovery actions should not adversely affect other
network operations.
VII. MPLS-based recovery actions in one MPLS protection domain
(defined in Section 2.2) should not adversely affect the recovery
actions in other MPLS protection domains.
VII. MPLS-based recovery mechanisms should be able to take into
consideration the recovery actions of lower layers.
VIII. MPLS-based recovery actions should avoid network-layering
violations. That is, defects in MPLS-based mechanisms should not
trigger lower layer protection switching.
IX. MPLS-based recovery mechanisms should minimize the loss of data
and packet reordering during recovery operations. (The current MPLS
specification has itself no explicit requirement on reordering).
X. MPLS-based recovery mechanisms should minimize the state overhead
incurred for each recovery path maintained.
XI. MPLS-based recovery mechanisms should be able to preserve the
constraints on traffic after switchover, if desired. That is, if
desired, the recovery path should meet the resource requirements of,
and achieve the same performance characteristics as the working path.
2.0 Overview
There are several options for providing protection of traffic using
MPLS. The most generic requirement is the specification of whether
recovery should be via Layer 3 (or IP) rerouting or via MPLS
protection switching or rerouting actions.
Makam, et al. Expires May 2001 [Page 5]
Internet Draft draft-ietf-mpls-recovery-frmwrk-01.txt November 2000
Generally network operators aim to provide the fastest and the best
protection mechanism that can be provided at a reasonable cost. The
higher the level of protection, the more resources are consumed.
Therefore it is expected that network operators will offer a spectrum
of service levels. MPLS-based recovery should give the flexibility to
select the recovery mechanism, choose the granularity at which
traffic is protected, and to also choose the specific types of
traffic that are protected in order to give operators more control
over that tradeoff. With MPLS-based recovery, it can be possible to
provide different levels of protection for different classes of
service, based on their service requirements. For example, using
approaches outlined below, a VLL service that supports real-time
applications like VoIP may be supported using link/node protection
together with pre-established, pre-reserved path protection, while
best effort traffic may use established-on-demand path protection or
simply rely on IP re-route or higher layer recovery mechanisms. As
another example of their range of application, MPLS-based recovery
strategies may be used to protect traffic not originally flowing on
label switched paths, such as IP traffic that is normally routed hop-
by-hop, as well as traffic forwarded on label switched paths.
2.1 Recovery Models
There are two basic models for path recovery: rerouting and
protection switching.
Protection switching and rerouting, as defined below, may be used
together. For example, protection switching to a recovery path may
be used for rapid restoration of connectivity while rerouting
determines a new optimal network configuration, rearranging paths, as
needed, at a later time [8] [9].
2.1.1 Rerouting
Recovery by rerouting is defined as establishing new paths or path
segments on demand for restoring traffic after the occurrence of a
fault. The new paths may be based upon fault information, network
routing policies, pre-defined configurations and network topology
information. Thus, upon detecting a fault, paths or path segments to
bypass the fault are established using signaling. Reroute mechanisms
are inherently slower than protection switching mechanisms, since
more must be done following the detection of a fault. However reroute
mechanisms are simpler and more frugal as no resources are committed
until after the fault occurs and the location of the fault is known.
Once the network routing algorithms have converged after a fault, it
may be preferable, in some cases, to reoptimize the network by
performing a reroute based on the current state of the network and
network policies. This is discussed further in Section 3.8.
In terms of the principles defined in section 3, reroute recovery
employs paths established-on-demand with resources reserved-on-
demand.
Makam, et al. Expires May 2001 [Page 6]
Internet Draft draft-ietf-mpls-recovery-frmwrk-01.txt November 2000
2.1.2 Protection Switching
Protection switching recovery mechanisms pre-establish a recovery
path or path segment, based upon network routing policies, the
restoration requirements of the traffic on the working path, and
administrative considerations. The recovery path may or may not be
link and node disjoint with the working path [10]. However if the
recovery path shares sources of failure with the working path, the
overall reliability of the construct is degraded. When a fault is
detected, the protected traffic is switched over to the recovery
path(s) and restored.
In terms of the principles in section 3, protection switching employs
pre-established recovery paths, and if resource reservation is
required on the recovery path, pre-reserved resources.
2.1.2.1. Subtypes of Protection Switching
The resources (bandwidth, buffers, processing) on the recovery path
may be used to carry either a copy of the working path traffic or
extra traffic that is displaced when a protection switch occurs.
This leads to two subtypes of protection switching.
In 1+1 ("one plus one") protection, the resources (bandwidth,
buffers, processing capacity) on the recovery path are fully
reserved, and carry the same traffic as the working path. Selection
between the traffic on the working and recovery paths is made at the
path merge LSR (PML). In effect the PSL function is deprecated to
establishment of the working and recovery paths and a simple
replication function. The recovery intelligence is delegated to the
PML.
In 1:1 ("one for one") protection, the resources (if any) allocated
on the recovery path are fully available to preemptible low priority
traffic except when the recovery path is in use due to a fault on the
working path. In other words, in 1:1 protection, the protected
traffic normally travels only on the working path, and is switched to
the recovery path only when the working path has a fault. Once the
protection switch is initiated, the low priority traffic being
carried on the recovery path may be displaced by the protected
traffic. This method affords a way to make efficient use of the
recovery path resources.
This concept can be extended to 1:n (one for n) and m:n (m for n)
protection.
2.2 The Recovery Cycles
There are three defined recovery cycles; the MPLS Recovery Cycle, the
MPLS Reversion Cycle and the Dynamic Re-routing Cycle. The first
Makam, et al. Expires May 2001 [Page 7]
Internet Draft draft-ietf-mpls-recovery-frmwrk-01.txt November 2000
cycle detects a fault and restores traffic onto MPLS-based recovery
paths. If the recovery path is non-optimal the cycle may be followed
by any of the two latter to achieve an optimized network again. The
reversion cycle applies for explicitly routed traffic that that does
not rely on any dynamic routing protocols to be converged. The
dynamic re-routing cycle applies for traffic that is forwarded based
on hop-by-hop routing.
2.2.1 MPLS Recovery Cycle Model
The MPLS recovery cycle model is illustrated in Figure 1.
Definitions and a key to abbreviations follow.
--Network Impairment
| --Fault Detected
| | --Start of Notification
| | | -- Start of Recovery Operation
| | | | --Recovery Operation Complete
| | | | | --Path Traffic Restored
| | | | | |
| | | | | |
v v v v v v
----------------------------------------------------------------
| T1 | T2 | T3 | T4 | T5 |
Figure 1. MPLS Recovery Cycle Model
The various timing measures used in the model are described below.
T1 Fault Detection Time
T2 Hold-off Time
T3 Notification Time
T4 Recovery Operation Time
T5 Traffic Restoration Time
Definitions of the recovery cycle times are as follows:
Fault Detection Time
The time between the occurrence of a network impairment and the
moment the fault is detected by MPLS-based recovery mechanisms. This
time may be highly dependent on lower layer protocols.
Hold-Off Time
The configured waiting time between the detection of a fault and
taking MPLS-based recovery action, to allow time for lower layer
protection to take effect. The Hold-off Time may be zero.
Note: The Hold-Off Time may occur after the Notification Time
interval if the node responsible for the switchover, the Path Switch
LSR (PSL), rather than the detecting LSR, is configured to wait.
Notification Time
Makam, et al. Expires May 2001 [Page 8]
Internet Draft draft-ietf-mpls-recovery-frmwrk-01.txt November 2000
The time between initiation of a fault indication signal (FIS) by the
LSR detecting the fault and the time at which the Path Switch LSR
(PSL) begins the recovery operation. This is zero if the PSL detects
the fault itself or infers a fault from such events as an adjacency
failure.
Note: If the PSL detects the fault itself, there still may be a Hold-
Off Time period between detection and the start of the recovery
operation.
Recovery Operation Time
The time between the first and last recovery actions. This may
include message exchanges between the PSL and PML to coordinate
recovery actions.
Traffic Restoration Time
The time between the last recovery action and the time that the
traffic (if present) is completely recovered. This interval is
intended to account for the time required for traffic to once again
arrive at the point in the network that experienced disrupted or
degraded service due to the occurrence of the fault (e.g. the PML).
This time may depend on the location of the fault, the recovery
mechanism, and the propagation delay along the recovery path.
2.2.2 MPLS Reversion Cycle Model
Protection switching, revertive mode, requires the traffic to be
switched back to a preferred path when the fault on that path is
cleared. The MPLS reversion cycle model is illustrated in Figure 2.
Note that the cycle shown below comes after the recovery cycle shown
in Fig. 1.
--Network Impairment Repaired
| --Fault Cleared
| | --Path Available
| | | --Start of Reversion Operation
| | | | --Reversion Operation Complete
| | | | | --Traffic Restored on Preferred Path
| | | | | |
| | | | | |
v v v v v v
-----------------------------------------------------------------
| T7 | T8 | T9 | T10| T11|
Figure 2. MPLS Reversion Cycle Model
The various timing measures used in the model are described below.
T7 Fault Clearing Time
T8 Wait-to-Restore Time
T9 Notification Time
Makam, et al. Expires May 2001 [Page 9]
Internet Draft draft-ietf-mpls-recovery-frmwrk-01.txt November 2000
T10 Reversion Operation Time
T11 Traffic Restoration Time
Note that time T6 (not shown above) is the time for which the network
impairment is not repaired and traffic is flowing on the recovery
path.
Definitions of the reversion cycle times are as follows:
Fault Clearing Time
The time between the repair of a network impairment and the time that
MPLS-based mechanisms learn that the fault has been cleared. This
time may be highly dependent on lower layer protocols.
Wait-to-Restore Time
The configured waiting time between the clearing of a fault and MPLS-
based recovery action(s). Waiting time may be needed to ensure the
path is stable and to avoid flapping in cases where a fault is
intermittent. The Wait-to-Restore Time may be zero.
Note: The Wait-to-Restore Time may occur after the Notification Time
interval if the PSL is configured to wait.
Notification Time
The time between initiation of an FRS by the LSR clearing the fault
and the time at which the path switch LSR begins the reversion
operation. This is zero if the PSL clears the fault itself.
Note: If the PSL clears the fault itself, there still may be a Wait-
to-Restore Time period between fault clearing and the start of the
reversion operation.
Reversion Operation Time
The time between the first and last reversion actions. This may
include message exchanges between the PSL and PML to coordinate
reversion actions.
Traffic Restoration Time
The time between the last reversion action and the time that traffic
(if present) is completely restored on the preferred path. This
interval is expected to be quite small since both paths are working
and care may be taken to limit the traffic disruption (e.g., using
"make before break" techniques and synchronous switch-over).
In practice, the only interesting times in the reversion cycle are
the Wait-to-Restore Time and the Traffic Restoration Time (or some
other measure of traffic disruption). Given that both paths are
available, there is no need for rapid operation, and a well-
controlled switch-back with minimal disruption is desirable.
Makam, et al. Expires May 2001 [Page 10]
Internet Draft draft-ietf-mpls-recovery-frmwrk-01.txt November 2000
2.2.3 Dynamic Re-routing Cycle Model
Dynamic rerouting aims to bring the IP network to a stable state
after a network impairment has occurred. A re-optimized network is
achieved after the routing protocols have converged, and the traffic
is moved from a recovery path to a (possibly) new working path. The
steps involved in this mode are illustrated in Figure 3.
Note that the cycle shown below may be overlaid on the recovery
cycle shown in Fig. 1 or the reversion cycle shown in Fig. 2, or both
(in the event that both the recovery cycle and the reversion cycle
take place before the routing protocols converge, and after the
convergence of the routing protocols it is determined (based on on-
line algorithms or off-line traffic engineering tools, network
configuration, or a variety of other possible criteria) that there is
a better route for the working path).
--Network Enters a Semi-stable State after an Impairment
| --Dynamic Routing Protocols Converge
| | --Initiate Setup of New Working Path between PSL
| | | and PML
| | | --Switchover Operation Complete
| | | | --Traffic Moved to New Working Path
| | | | |
| | | | |
v v v v v
-----------------------------------------------------------------
| T12 | T13 | T14 | T15 |
Figure 3. Dynamic Rerouting Cycle Model
The various timing measures used in the model are described below.
T12 Network Route Convergence Time
T13 Hold-down Time (optional)
T14 Switchover Operation Time
T15 Traffic Restoration Time
Network Route Convergence Time
We define the network route convergence time as the time taken for
the network routing protocols to converge and for the network to
reach a stable state.
Holddown Time
We define the holddown period as a bounded time for which a recovery
path must be used. In some scenarios it may be difficult to determine
if the working path is stable. In these cases a holddown time may be
used to prevent excess flapping of traffic between a working and a
recovery path.
Switchover Operation Time
Makam, et al. Expires May 2001 [Page 11]
Internet Draft draft-ietf-mpls-recovery-frmwrk-01.txt November 2000
The time between the first and last switchover actions. This may
include message exchanges between the PSL and PML to coordinate the
switchover actions.
As an example of the recovery cycle, we present a sequence of events
that occur after a network impairment occurs and when a protection
switch is followed by dynamic rerouting.
I. Link or path fault occurs
II. Signaling initiated (FIS) for the fault detected
III. FIS arrives at the PSL
IV. The PSL initiates a protection switch to a pre-configured
recovery path
V. The PSL switches over the traffic from the working path to the
recovery path
VI. The network enters a semi-stable state
VII. Dynamic routing protocols converge after the fault, and a new
working path is calculated (based, for example, on some of the
criteria mentioned earlier in Section 2.1.1).
VIII. A new working path is established between the PSL and the PML
(assumption is that PSL and PML have not changed)
IX. Traffic is switched over to the new working path.
2.3 Definitions and Terminology
This document assumes the terminology given in [11], and, in
addition, introduces the following new terms.
2.3.1 General Recovery Terminology
Rerouting
A recovery mechanism in which the recovery path or path segments are
created dynamically after the detection of a fault on the working
path. In other words, a recovery mechanism in which the recovery path
is not pre-established.
Protection Switching
A recovery mechanism in which the recovery path or path segments are
created prior to the detection of a fault on the working path. In
other words, a recovery mechanism in which the recovery path is pre-
established.
Working Path
The protected path that carries traffic before the occurrence of a
fault. The working path exists between a PSL and PML. The working
path can be of different kinds; a hop-by-hop routed path, a trunk, a
link, an LSP or part of a multipoint-to-point LSP.
Synonyms for a working path are primary path and active path.
Makam, et al. Expires May 2001 [Page 12]
Internet Draft draft-ietf-mpls-recovery-frmwrk-01.txt November 2000
Recovery Path
The path by which traffic is restored after the occurrence of a
fault. In other words, the path on which the traffic is directed by
the recovery mechanism. The recovery path is established by MPLS
means. The recovery path can either be an equivalent recovery path
and ensure no reduction in quality of service, or be a limited
recovery path and thereby not guarantee the same quality of service
(or some other criteria of performance) as the working path. A
limited recovery path is not expected to be used for an extended
period of time.
Synonyms for a recovery path are: back-up path, alternative path, and
protection path.
Protection Counterpart
The "other" path when discussing pre-planned protection switching
schemes. The protection counterpart for the working path is the
recovery path and vice-versa.
Path Group (PG)
A logical bundling of multiple working paths, each of which is routed
identically between a Path Switch LSR and a Path Merge LSR.
Protected Path Group (PPG)
A path group that requires protection.
Protected Traffic Portion (PTP)
The portion of the traffic on an individual path that requires
protection. For example, code points in the EXP bits of the shim
header may identify a protected portion.
Path Switch LSR (PSL)
The PSL is responsible for switching or replicating the traffic
between the working path and the recovery path.
Path Merge LSR (PML)
An LSR that receives both working path traffic and its corresponding
recovery path traffic, and either merges their traffic into a single
outgoing path, or, if it is itself the destination, passes the
traffic on to the higher layer protocols.
Intermediate LSR
An LSR on a working or recovery path that is neither a PSL nor a PML
for that path.
Makam, et al. Expires May 2001 [Page 13]
Internet Draft draft-ietf-mpls-recovery-frmwrk-01.txt November 2000
Bypass Tunnel
A path that serves to back up a set of working paths using the label
stacking approach [1]. The working paths and the bypass tunnel must
all share the same path switch LSR (PSL) and the path merge LSR
(PML).
Switch-Over
The process of switching the traffic from the path that the traffic
is flowing on onto one or more alternate path(s). This may involve
moving traffic from a working path onto one or more recovery paths,
or may involve moving traffic from a recovery path(s) on to a more
optimal working path(s).
Switch-Back
The process of returning the traffic from one or more recovery paths
back to the working path(s).
Revertive Mode
A recovery mode in which traffic is automatically switched back from
the recovery path to the original working path upon the restoration
of the working path to a fault-free condition. This assumes a failed
working path does not automatically surrender resources to the
network.
Non-revertive Mode
A recovery mode in which traffic is not automatically switched back
to the original working path after this path is restored to a fault-
free condition. (Depending on the configuration, the original working
path may, upon moving to a fault-free condition, become the recovery
path, or it may be used for new working traffic, and be no longer
associated with its original recovery path).
MPLS Protection Domain
The set of LSRs over which a working path and its corresponding
recovery path are routed.
MPLS Protection Plan
The set of all LSP protection paths and the mapping from working to
protection paths deployed in an MPLS protection domain at a given
time.
Liveness Message
A message exchanged periodically between two adjacent LSRs that
serves as a link probing mechanism. It provides an integrity check of
Makam, et al. Expires May 2001 [Page 14]
Internet Draft draft-ietf-mpls-recovery-frmwrk-01.txt November 2000
the forward and the backward directions of the link between the two
LSRs as well as a check of neighbor aliveness.
Path Continuity Test
A test that verifies the integrity and continuity of a path or path
segment. The details of such a test are beyond the scope of this
draft. (This could be accomplished, for example, by transmitting a
control message along the same links and nodes as the data traffic or
similarly could be measured by the absence of traffic and by
providing feedback.)
2.3.2 Failure Terminology
Path Failure (PF)
Path failure is fault detected by MPLS-based recovery mechanisms,
which is define as the failure of the liveness message test or a path
continuity test, which indicates that path connectivity is lost.
Path Degraded (PD)
Path degraded is a fault detected by MPLS-based recovery mechanisms
that indicates that the quality of the path is unacceptable.
Link Failure (LF)
A lower layer fault indicating that link continuity is lost. This may
be communicated to the MPLS-based recovery mechanisms by the lower
layer.
Link Degraded (LD)
A lower layer indication to MPLS-based recovery mechanisms that the
link is performing below an acceptable level.
Fault Indication Signal (FIS)
A signal that indicates that a fault along a path has occurred. It is
relayed by each intermediate LSR to its upstream or downstream
neighbor, until it reaches an LSR that is setup to perform MPLS
recovery.
Fault Recovery Signal (FRS)
A signal that indicates a fault along a working path has been
repaired. Again, like the FIS, it is relayed by each intermediate LSR
to its upstream or downstream neighbor, until is reaches the LSR that
performs recovery of the original path.
2.4 Abbreviations
FIS: Fault Indication Signal.
FRS: Fault Recovery Signal.
LD: Link Degraded.
LF: Link Failure.
PD: Path Degraded.
PF: Path Failure.
PML: Path Merge LSR.
Makam, et al. Expires May 2001 [Page 15]
Internet Draft draft-ietf-mpls-recovery-frmwrk-01.txt November 2000
PG: Path Group.
PPG: Protected Path Group.
PTP: Protected Traffic Portion.
PSL: Path Switch LSR.
3.0 MPLS-based Recovery Principles
MPLS-based recovery refers to the ability to effect quick and
complete restoration of traffic affected by a fault in an MPLS-
enabled network. The fault may be detected on the IP layer or in
lower layers over which IP traffic is transported. Fastest MPLS
recovery is assumed to be achieved with protection switching and may
be viewed as the MPLS LSR switch completion time that is comparable
to, or equivalent to, the 50 ms switch-over completion time of the
SONET layer. This section provides a discussion of the concepts and
principles of MPLS-based recovery. The concepts are presented in
terms of atomic or primitive terms that may be combined to specify
recovery approaches. We do not make any assumptions about the
underlying layer 1 or layer 2 transport mechanisms or their recovery
mechanisms.
3.1 Configuration of Recovery
An LSR should allow for configuration of the following recovery
options:
Default-recovery (No MPLS-based recovery enabled):
Traffic on the working path is recovered only via Layer 3 or IP
rerouting or by some lower layer mechanism such as SONET APS. This
is equivalent to having no MPLS-based recovery. This option may be
used for low priority traffic or for traffic that is recovered in
another way (for example load shared traffic on parallel working
paths may be automatically recovered upon a fault along one of the
working paths by distributing it among the remaining working paths).
Recoverable (MPLS-based recovery enabled):
This working path is recovered using one or more recovery paths,
either via rerouting or via protection switching.
3.2 Initiation of Path Setup
There are three options for the initiation of the recovery path
setup.
Pre-established:
This is the same as the protection switching option. Here a recovery
path(s) is established prior to any failure on the working path. The
path selection can either be determined by an administrative
centralized tool (online or offline), or chosen based on some
algorithm implemented at the PSL and possibly intermediate nodes. To
guard against the situation when the pre-established recovery path
Makam, et al. Expires May 2001 [Page 16]
Internet Draft draft-ietf-mpls-recovery-frmwrk-01.txt November 2000
fails before or at the same time as the working path, the recovery
path should have secondary configuration options as explained in
Section 3.3 below.
Pre Qualified:
A pre-established path need not be created, it may be pre-qualified.
A pre-qualified recovery path is not created expressly for protecting
the working path, but instead is a path created for other purposes
that is designated as a recovery path after determination that it is
an acceptable alternative for carrying the working path traffic.
Variants include the case where an optical path or trail is
configured, but no switches are set.
Established-on-Demand:
This is the same as the rerouting option. Here, a recovery path is
established after a failure on its working path has been detected and
notified to the PSL.
3.3 Initiation of Resource Allocation
A recovery path may support the same traffic contract as the working
path, or it may not. We will distinguish these two situations by
using different additive terms. If the recovery path is capable of
replacing the working path without degrading service, it will be
called an equivalent recovery path. If the recovery path lacks the
resources (or resource reservations) to replace the working path
without degrading service, it will be called a limited recovery path.
Based on this, there are two options for the initiation of resource
allocation:
Pre-reserved:
This option applies only to protection switching. Here a pre-
established recovery path reserves required resources on all hops
along its route during its establishment. Although the reserved
resources (e.g., bandwidth and/or buffers) at each node cannot be
used to admit more working paths, they are available to be used by
all traffic that is present at the node before a failure occurs.
Reserved-on-Demand:
This option may apply either to rerouting or to protection switching.
Here a recovery path reserves the required resources after a failure
on the working path has been detected and notified to the PSL and
before the traffic on the working path is switched over to the
recovery path.
Note that under both the options above, depending on the amount of
resources reserved on the recovery path, it could either be an
equivalent recovery path or a limited recovery path.
Makam, et al. Expires May 2001 [Page 17]
Internet Draft draft-ietf-mpls-recovery-frmwrk-01.txt November 2000
3.4 Scope of Recovery
3.4.1 Topology
3.4.1.1 Local Repair
The intent of local repair is to protect against a link or neighbor
node fault and to minimize the amount of time required for failure
propagation. In local repair (also known as local recovery [12] [9]),
the node immediately upstream of the fault is the one to initiate
recovery (either rerouting or protection switching). Local repair can
be of two types:
Link Recovery/Restoration
In this case, the recovery path may be configured to route around a
certain link deemed to be unreliable. If protection switching is
used, several recovery paths may be configured for one working path,
depending on the specific faulty link that each protects against.
Alternatively, if rerouting is used, upon the occurrence of a fault
on the specified link each path is rebuilt such that it detours
around the faulty link.
In this case, the recovery path need only be disjoint from its
working path at a particular link on the working path, and may have
overlapping segments with the working path. Traffic on the working
path is switched over to an alternate path at the upstream LSR that
connects to the failed link. This method is potentially the fastest
to perform the switchover, and can be effective in situations where
certain path components are much more unreliable than others.
Node Recovery/Restoration
In this case, the recovery path may be configured to route around a
neighbor node deemed to be unreliable. Thus the recovery path is
disjoint from the working path only at a particular node and at links
associated with the working path at that node. Once again, the
traffic on the primary path is switched over to the recovery path at
the upstream LSR that directly connects to the failed node, and the
recovery path shares overlapping portions with the working path.
3.4.1.2 Global Repair
The intent of global repair is to protect against any link or node
fault on a path or on a segment of a path, with the obvious exception
of the faults occurring at the ingress node of the protected path
segment. In global repair the PSL is usually distant from the failure
and needs to be notified by a FIS.
In global repair also end-to end path recovery/restoration applies.
In many cases, the recovery path can be made completely link and node
disjoint with its working path. This has the advantage of protecting
against all link and node fault(s) on the working path (end-to-end
path or path segment).
Makam, et al. Expires May 2001 [Page 18]
Internet Draft draft-ietf-mpls-recovery-frmwrk-01.txt November 2000
However, it is in some cases slower than local repair since it takes
longer for the fault notification message to get to the PSL to
trigger the recovery action.
3.4.1.3 Alternate Egress Repair
It is possible to restore service without specifically recovering the
faulted path.
For example, for best effort IP service it is possible to select a
recovery path that has a different egress point from the working path
(i.e., there is no PML). The recovery path egress must simply be a
router that is acceptable for forwarding the FEC carried by the
working path (without creating looping). In an engineering context,
specific alternative FEC/LSP mappings with alternate egresses can be
formed.
This may simplify enhancing the reliability of implicitly constructed
MPLS topologies. A PSL may qualify LSP/FEC bindings as candidate
recovery paths as simply link and node disjoint with the immediate
downstream LSR of the working path.
3.4.1.4 Multi-Layer Repair
Multi-layer repair broadens the network designer's tool set for those
cases where multiple network layers can be managed together to
achieve overall network goals. Specific criteria for determining
when multi-layer repair is appropriate are beyond the scope of this
draft.
3.4.1.5 Concatenated Protection Domains
A given service may cross multiple networks and these may employ
different recovery mechanisms. It is possible to concatenate
protection domains so that service recovery can be provided end-to-
end. It is considered that the recovery mechanisms in different
domains may operate autonomously, and that multiple points of
attachment may be used between domains (to ensure there is no single
point of failure). Alternate egress repair requires management of
concatenated domains in that an explicit MPLS point of failure (the
PML) is by definition excluded. Details of concatenated protection
domains are beyond the scope of this draft.
3.4.2 Path Mapping
Path mapping refers to the methods of mapping traffic from a faulty
working path on to the recovery path. There are several options for
this, as described below. Note that the options below should be
viewed as atomic terms that only describe how the working and
protection paths are mapped to each other. The issues of resource
reservation along these paths, and how switchover is actually
performed lead to the more commonly used composite terms, such as 1+1
and 1:1 protection, which were described in Section 2.1.
Makam, et al. Expires May 2001 [Page 19]
Internet Draft draft-ietf-mpls-recovery-frmwrk-01.txt November 2000
1-to-1 Protection
In 1-to-1 protection the working path has a designated recovery path
that is only to be used to recover that specific working path.
ii) n-to-1 Protection
In n-to-1 protection, up to n working paths are protected using only
one recovery path. If the intent is to protect against any single
fault on any of the working paths, the n working paths should be
diversely routed between the same PSL and PML. In some cases,
handshaking between PSL and PML may be required to complete the
recovery, the details of which are beyond the scope of this draft.
n-to-m Protection
In n-to-m protection, up to n working paths are protected using m
recovery paths. Once again, if the intent is to protect against any
single fault on any of the n working paths, the n working paths and
the m recovery paths should be diversely routed between the same PSL
and PML. In some cases, handshaking between PSL and PML may be
required to complete the recovery, the details of which are beyond
the scope of this draft. N-to-m protection is for further study.
Split Path Protection
In split path protection, multiple recovery paths are allowed to
carry the traffic of a working path based on a certain configurable
load splitting ratio. This is especially useful when no single
recovery path can be found that can carry the entire traffic of the
working path in case of a fault. Split path protection may require
handshaking between the PSL and the PML(s), and may require the
PML(s) to correlate the traffic arriving on multiple recovery paths
with the working path. Although this is an attractive option, the
details of split path protection are beyond the scope of this draft,
and are for further study.
3.4.3 Bypass Tunnels
It may be convenient, in some cases, to create a "bypass tunnel" for
a PPG between a PSL and PML, thereby allowing multiple recovery paths
to be transparent to intervening LSRs [8]. In this case, one LSP
(the tunnel) is established between the PSL and PML following an
acceptable route and a number of recovery paths are supported through
the tunnel via label stacking. A bypass tunnel can be used with any
of the path mapping options discussed in the previous section.
As with recovery paths, the bypass tunnel may or may not have
resource reservations sufficient to provide recovery without service
degradation. It is possible that the bypass tunnel may have
sufficient resources to recover some number of working paths, but not
all at the same time. If the number of recovery paths carrying
traffic in the tunnel at any given time is restricted, this is
Makam, et al. Expires May 2001 [Page 20]
Internet Draft draft-ietf-mpls-recovery-frmwrk-01.txt November 2000
similar to the 1 to n or m to n protection cases mentioned in Section
3.4.2.
3.4.4 Recovery Granularity
Another dimension of recovery considers the amount of traffic
requiring protection. This may range from a fraction of a path to a
bundle of paths.
3.4.4.1 Selective Traffic Recovery
This option allows for the protection of a fraction of traffic within
the same path. The portion of the traffic on an individual path that
requires protection is called a protected traffic portion (PTP). A
single path may carry different classes of traffic, with different
protection requirements. The protected portion of this traffic may be
identified by its class, as for example, via the EXP bits in the MPLS
shim header or via the priority bit in the ATM header.
3.4.4.2 Bundling
Bundling is a technique used to group multiple working paths together
in order to recover them simultaneously. The logical bundling of
multiple working paths requiring protection, each of which is routed
identically between a PSL and a PML, is called a protected path group
(PPG). When a fault occurs on the working path carrying the PPG, the
PPG as a whole can be protected either by being switched to a bypass
tunnel or by being switched to a recovery path.
3.4.5 Recovery Path Resource Use
In the case of pre-reserved recovery paths, there is the question of
what use these resources may be put to when the recovery path is not
in use. There are two options:
Dedicated-resource:
If the recovery path resources are dedicated, they may not be used
for anything except carrying the working traffic. For example, in
the case of 1+1 protection, the working traffic is always carried on
the recovery path. Even if the recovery path is not always carrying
the working traffic, it may not be possible or desirable to allow
other traffic to use these resources.
Extra-traffic-allowed:
If the recovery path only carries the working traffic when the
working path fails, then it is possible to allow extra traffic to use
the reserved resources at other times. Extra traffic is, by
definition, traffic that can be displaced (without violating service
agreements) whenever the recovery path resources are needed for
carrying the working path traffic.
3.5 Fault Detection
Makam, et al. Expires May 2001 [Page 21]
Internet Draft draft-ietf-mpls-recovery-frmwrk-01.txt November 2000
MPLS recovery is initiated after the detection of either a lower
layer fault or a fault at the IP layer or in the operation of MPLS-
based mechanisms. We consider four classes of impairments: Path
Failure, Path Degraded, Link Failure, and Link Degraded.
Path Failure (PF) is a fault that indicates to an MPLS-based recovery
scheme that the connectivity of the path is lost. This may be
detected by a path continuity test between the PSL and PML. Some,
and perhaps the most common, path failures may be detected using a
link probing mechanism between neighbor LSRs. An example of a probing
mechanism is a liveness message that is exchanged periodically along
the working path between peer LSRs. For either a link probing
mechanism or path continuity test to be effective, the test message
must be guaranteed to follow the same route as the working or
recovery path, over the segment being tested. In addition, the path
continuity test must take the path merge points into consideration.
In the case of a bi-directional link implemented as two
unidirectional links, path failure could mean that either one or both
unidirectional links are damaged.
Path Degraded (PD) is a fault that indicates to MPLS-based recovery
schemes/mechanisms that the path has connectivity, but that the
quality of the connection is unacceptable. This may be detected by a
path performance monitoring mechanism, or some other mechanism for
determining the error rate on the path or some portion of the path.
This is local to the LSR and consists of excessive discarding of
packets at an interface, either due to label mismatch or due to TTL
errors, for example.
Link Failure (LF) is an indication from a lower layer that the link
over which the path is carried has failed. If the lower layer
supports detection and reporting of this fault (that is, any fault
that indicates link failure e.g., SONET LOS), this may be used by the
MPLS recovery mechanism. In some cases, using LF indications may
provide faster fault detection than using only MPLS_based fault
detection mechanisms.
Link Degraded (LD) is an indication from a lower layer that the link
over which the path is carried is performing below an acceptable
level. If the lower layer supports detection and reporting of this
fault, it may be used by the MPLS recovery mechanism. In some cases,
using LD indications may provide faster fault detection than using
only MPLS-based fault detection mechanisms.
3.6 Fault Notification
MPLS-based recovery relies on rapid and reliable notification of
faults. Once a fault is detected, the node that detected the fault
must determine if the fault is severe enough to require path
recovery. If the node is not capable of initiating direct action
(e.g. as a PSL) the node should send out a notification of the fault
by transmitting a FIS to those of its upstream LSRs that were sending
traffic on the working path that is affected by the fault. This
Makam, et al. Expires May 2001 [Page 22]
Internet Draft draft-ietf-mpls-recovery-frmwrk-01.txt November 2000
notification is relayed hop-by-hop by each subsequent LSR to its
upstream neighbor, until it eventually reaches a PSL. A PSL is the
only LSR that can terminate the FIS and initiate a protection switch
of the working path to a recovery path.
Since the FIS is a control message, it should be transmitted with
high priority to ensure that it propagates rapidly towards the
affected PSL(s). Depending on how fault notification is configured in
the LSRs of an MPLS domain, the FIS could be sent either as a Layer 2
or Layer 3 packet [13]. The use of a Layer 2-based notification
requires a Layer 2 path direct to the PSL. An example of a FIS could
be the liveness message sent by a downstream LSR to its upstream
neighbor, with an optional fault notification field set or it can be
implicitly denoted by a teardown message. Alternatively, it could be
a separate fault notification packet. The intermediate LSR should
identify which of its incoming links (upstream LSRs) to propagate the
FIS on. In the case of 1+1 protection, the FIS should also be sent
downstream to the PML where the recovery action is taken.
3.7 Switch-Over Operation
3.7.1 Recovery Trigger
The activation of an MPLS protection switch following the detection
or notification of a fault requires a trigger mechanism at the PSL.
MPLS protection switching may be initiated due to automatic inputs or
external commands. The automatic activation of an MPLS protection
switch results from a response to a defect or fault conditions
detected at the PSL or to fault notifications received at the PSL. It
is possible that the fault detection and trigger mechanisms may be
combined, as is the case when a PF, PD, LF, or LD is detected at a
PSL and triggers a protection switch to the recovery path. In most
cases, however, the detection and trigger mechanisms are distinct,
involving the detection of fault at some intermediate LSR followed by
the propagation of a fault notification back to the PSL via the FIS,
which serves as the protection switch trigger at the PSL. MPLS
protection switching in response to external commands results when
the operator initiates a protection switch by a command to a PSL (or
alternatively by a configuration command to an intermediate LSR,
which transmits the FIS towards the PSL).
Note that the PF fault applies to hard failures (fiber cuts,
transmitter failures, or LSR fabric failures), as does the LF fault,
with the difference that the LF is a lower layer impairment that may
be communicated to - MPLS-based recovery mechanisms. The PD (or LD)
fault, on the other hand, applies to soft defects (excessive errors
due to noise on the link, for instance). The PD (or LD) results in a
fault declaration only when the percentage of lost packets exceeds a
given threshold, which is provisioned and may be set based on the
service level agreement(s) in effect between a service provider and a
customer.
3.7.2 Recovery Action
Makam, et al. Expires May 2001 [Page 23]
Internet Draft draft-ietf-mpls-recovery-frmwrk-01.txt November 2000
After a fault is detected or FIS is received by the PSL, the recovery
action involves either a rerouting or protection switching operation.
In both scenarios, the next hop label forwarding entry for a recovery
path is bound to the working path.
3.8 Switch-Back Operation
When traffic is flowing on the recovery path decisions can be made to
whether let the traffic remain on the recovery path and consider it
as a new working path or do a switch to the old or a new working
path. This switch-back operation has two styles, one where the
protection counterparts, i.e. the working and recovery path, are
fixed or "pinned" to its route and one in which the PSL or other
network entity with real time knowledge of failure dynamically
performs re-establishment or controlled rearrangement of the paths
comprising the protected service.
3.8.1 Fixed Protection Counterparts
For fixed protection counterparts the PSL will be pre-configured with
the appropriate behavior to take when the original fixed path is
restored to service. The choices are revertive and non-revertive
mode. The choice will typically be depended on relative costs of the
working and protection paths, and the tolerance of the service to the
effects of switching paths yet again. These protection modes indicate
whether or not there is a preferred path for the protected traffic.
3.8.1.1 Revertive Mode
If the working path always is the preferred path, this path will be
used whenever it is available. Thus, in the event of a fault on this
path, its unused resources will not be reclaimed by the network on
failure. If the working path has a fault, traffic is switched to the
recovery path. In the revertive mode of operation, when the
preferred path is restored the traffic is automatically switched back
to it.
There are a number of implications to pinned working and recovery
paths:
- upon failure and traffic moved to recovery path, the traffic is
unprotected until such time as the path defect in the original
working path is repaired and that path restored to service.
- upon failure and traffic moved to recovery path, the resources
associated with the original path remain reserved.
3.8.1.2 Non-revertive Mode
In the non-revertive mode of operation, there is no preferred path or
it may be desirable to minimize further disruption of the service
brought on by a revertive switching operation. A switch-back to the
original working path is not desired or not possible since the
Makam, et al. Expires May 2001 [Page 24]
Internet Draft draft-ietf-mpls-recovery-frmwrk-01.txt November 2000
original path may no longer exist after the occurrence of a fault on
that path.
If there is a fault on the working path, traffic is switched to the
recovery path. When or if the faulty path (the originally working
path) is restored, it may become the recovery path (either by
configuration, or, if desired, by management actions).
In the non-revertive mode of operation, the working traffic may or
may not be restored to a new optimal working path or to the original
working path anyway. This is because it might be useful, in some
cases, to either: (a) administratively perform a protection switch
back to the original working path after gaining further assurances
about the integrity of the path, or (b) it may be acceptable to
continue operation on the recovery path, or (c) it may be desirable
to move the traffic to a new optimal working path that is calculated
based on network topology and network policies.
3.8.2 Dynamic Protection Counterparts
For Dynamic protection counterparts when the traffic is switched over
to a recovery path, the association between the original working path
and the recovery path may no longer exist, since the original path
itself may no longer exist after the fault. Instead, when the network
reaches a stable state following routing convergence, the recovery
path may be switched over to a different preferred path either
optimization based on the new network topology and associated
information or based on pre-configured information.
Dynamic protection counterparts assume that upon failure, the PSL or
other network entity will establish new working paths if a switch-
back will be performed.
3.8.3 Restoration and Notification
MPLS restoration deals with returning the working traffic from the
recovery path to the original or a new working path. Reversion is
performed by the PSL either upon receiving notification, via FRS,
that the working path is repaired, or upon receiving notification
that a new working path is established.
For fixed counterparts in revertive mode, an LSR that detected the
fault on the working path also detects the restoration of the working
path. If the working path had experienced a LF defect, the LSR
detects a return to normal operation via the receipt of a liveness
message from its peer. If the working path had experienced a LD
defect at an LSR interface, the LSR could detect a return to normal
operation via the resumption of error-free packet reception on that
interface. Alternatively, a lower layer that no longer detects a LF
defect may inform the MPLS-based recovery mechanisms at the LSR that
the link to its peer LSR is operational.
The LSR then transmits FRS to its upstream LSR(s) that were
transmitting traffic on the working path. At the point the PSL
Makam, et al. Expires May 2001 [Page 25]
Internet Draft draft-ietf-mpls-recovery-frmwrk-01.txt November 2000
receives the FRS, it switches the working traffic back to the
original working path.
A similar scheme is for dynamic counterparts where e.g. an update of
topology and/or network convergence may trigger installation or setup
of new working paths and send notification to the PSL to perform a
switch over.
We note that if there is a way to transmit fault information back
along a recovery path towards a PSL and if the recovery path is an
equivalent working path, it is possible for the working path and its
recovery path to exchange roles once the original working path is
repaired following a fault. This is because, in that case, the
recovery path effectively becomes the working path, and the restored
working path functions as a recovery path for the original recovery
path. This is important, since it affords the benefits of non-
revertive switch operation outlined in Section 3.8.1, without leaving
the recovery path unprotected.
3.8.4 Reverting to Preferred Path (or Controlled Rearrangement)
In the revertive mode, a "make before break" restoration switching
can be used, which is less disruptive than performing protection
switching upon the occurrence of network impairments. This will
minimize both packet loss and packet reordering. The controlled
rearrangement of paths can also be used to satisfy traffic
engineering requirements for load balancing across an MPLS domain.
3.9 Performance
Resource/performance requirements for recovery paths should be
specified in terms of the following attributes:
I. Resource class attribute:
Equivalent Recovery Class: The recovery path has the same resource
reservations and performance guarantees as the working path. In other
words, the recovery path meets the same SLAs as the working path.
Limited Recovery Class: The recovery path does not have the same
resource reservations and performance guarantees as the working path.
A. Lower Class: The recovery path has lower resource requirements or
less stringent performance requirements than the working path.
B. Best Effort Class: The recovery path is best effort.
II. Priority Attribute:
The recovery path has a priority attribute just like the working path
(i.e., the priority attribute of the associated traffic trunks). It
can have the same priority as the working path or lower priority.
III. Preemption Attribute:
Makam, et al. Expires May 2001 [Page 26]
Internet Draft draft-ietf-mpls-recovery-frmwrk-01.txt November 2000
The recovery path can have the same preemption attribute as the
working path or a lower one.
4.0 MPLS Recovery Requirement
The following are the MPLS recovery requirements:
I. MPLS recovery SHALL provide an option to identify protection
groups (PPGs) and protection portions (PTPs).
II. Each PSL SHALL be capable of performing MPLS recovery upon the
detection of the impairments or upon receipt of notifications of
impairments.
III. A MPLS recovery method SHALL not preclude manual protection
switching commands. This implies that it would be possible under
administrative commands to transfer traffic from a working path to a
recovery path, or to transfer traffic from a recovery path to a
working path, once the working path becomes operational following a
fault.
IV. A PSL SHALL be capable of performing either a switch back to the
original working path after the fault is corrected or a switchover to
a new working path, upon the discovery or establishment of a more
optimal working path.
V. The recovery model should take into consideration path merging at
intermediate LSRs. If a fault affects the merged segment, all the
paths sharing that merged segment should be able to recover.
Similarly, if a fault affects a non-merged segment, only the path
that is affected by the fault should be recovered.
5.0 MPLS Recovery Options
There SHOULD be an option for:
I. Configuration of the recovery path as excess or reserved, with
excess as the default. The recovery path that is configured as excess
SHALL provide lower priority preemptable traffic access to the
protection bandwidth, while the recovery path configured as reserved
SHALL not provide any other traffic access to the protection
bandwidth.
II. Configuring the protection alternatives as either rerouting or
protection switching.
III. Enabling restoration as either non-revertive or revertive, with
non-revertive as the default if fixed protection counterparts are
used.
6.0 Comparison Criteria
Makam, et al. Expires May 2001 [Page 27]
Internet Draft draft-ietf-mpls-recovery-frmwrk-01.txt November 2000
Possible criteria to use for comparison of MPLS-based recovery
schemes are as follows:
Recovery Time
We define recovery time as the time required for a recovery path to
be activated (and traffic flowing) after a fault. Recovery Time is
the sum of the Fault Detection Time, Hold-off Time, Notification
Time, Recovery Operation Time, and the Traffic Restoration Time. In
other words, it is the time between a failure of a node or link in
the network and the time before a recovery path is installed and the
traffic starts flowing on it.
Full Restoration Time
We define full restoration time as the time required for a permanent
restoration. This is the time required for traffic to be routed onto
links, which are capable of or have been engineered sufficiently to
handle traffic in recovery scenarios. Note that this time may or may
not be different from the "Recovery Time" depending on whether
equivalent or limited recovery paths are used.
Setup vulnerability
The amount of time that a working path or a set of working paths is
left unprotected during such tasks as recovery path computation and
recovery path setup may be used to compare schemes. The nature of
this vulnerability should be taken into account, e.g.: End to End
schemes correlate the vulnerability with working paths, Local Repair
schemes have a topological correlation that cuts across working paths
and Network Plan approaches have a correlation that impacts the
entire network.
Backup Capacity
Recovery schemes may require differing amounts of "backup capacity"
in the event of a fault. This capacity will be dependent on the
traffic characteristics of the network. However, it may also be
dependent on the particular protection plan selection algorithms as
well as the signaling and re-routing methods.
Additive Latency
Recovery schemes may introduce additive latency to traffic. For
example, a recovery path may take many more hops than the working
path. This may be dependent on the recovery path selection
algorithms.
Quality of Protection
Recovery schemes can be considered to encompass a spectrum of "packet
survivability" which may range from "relative" to "absolute".
Makam, et al. Expires May 2001 [Page 28]
Internet Draft draft-ietf-mpls-recovery-frmwrk-01.txt November 2000
Relative survivability may mean that the packet is on an equal
footing with other traffic of, as an example, the same diff-serv code
point (DSCP) in contending for the surviving network resources.
Absolute survivability may mean that the survivability of the
protected traffic has explicit guarantees.
Re-ordering
Recovery schemes may introduce re-ordering of packets. Also the
action of putting traffic back on preferred paths might cause packet
re-ordering.
State Overhead
As the number of recovery paths in a protection plan grows, the state
required to maintain them also grows. Schemes may require differing
numbers of paths to maintain certain levels of coverage, etc. The
state required may also depend on the particular scheme used to
recover. In many cases the state overhead will be in proportion to
the number of recovery paths.
Loss
Recovery schemes may introduce a certain amount of packet loss during
switchover to a recovery path. Schemes that introduce loss during
recovery can measure this loss by evaluating recovery times in
proportion to the link speed.
In case of link or node failure a certain packet loss is inevitable.
Coverage
Recovery schemes may offer various types of failover coverage. The
total coverage may be defined in terms of several metrics:
I. Fault Types: Recovery schemes may account for only link faults or
both node and link faults or also degraded service. For example, a
scheme may require more recovery paths to take node faults into
account.
II. Number of concurrent faults: dependent on the layout of recovery
paths in the protection plan, multiple fault scenarios may be able to
be restored.
III. Number of recovery paths: for a given fault, there may be one or
more recovery paths.
IV. Percentage of coverage: dependent on a scheme and its
implementation, a certain percentage of faults may be covered. This
may be subdivided into percentage of link faults and percentage of
node faults.
Makam, et al. Expires May 2001 [Page 29]
Internet Draft draft-ietf-mpls-recovery-frmwrk-01.txt November 2000
V. The number of protected paths may effect how fast the total set of
paths affected by a fault could be recovered. The ratio of protected
is n/N, where n is the number of protected paths and N is the total
number of paths.
7.0 Security Considerations
The MPLS recovery that is specified herein does not raise any
security issues that are not already present in the MPLS
architecture.
8.0 Intellectual Property Considerations
The IETF has been notified of intellectual property rights claimed in
regard to some or all of the specification contained in this
document. For more information consult the online list of claimed
rights.
9.0 Acknowledgements
We would like to thank members of the MPLS WG mailing list for their
suggestions on the earlier version of this draft. In particular, Bora
Akyol, Dave Allan, and Neil Harrisson, whose suggestions and comments
were very helpful in revising the document.
10.0 Authors' Addresses
Vishal Sharma Ben Mack-Crane
Tellabs Research Center Tellabs Operations, Inc.
One Kendall Square 4951 Indiana Avenue
Bldg. 100, Ste. 121 Lisle, IL 60532
Cambridge, MA 02139-1562 Phone: 630-512-7255
Phone: 617-577-8760 Ben.Mack-Crane@tellabs.com
Vishal.Sharma@tellabs.com
Srinivas Makam Ken Owens
Tellabs Operations, Inc. Tellabs Operations, Inc.
4951 Indiana Avenue 1106 Fourth Street
Lisle, IL 60532 St. Louis, MO 63126
Phone: 630-512-7217 Phone: 314-918-1579
Srinivas.Makam@tellabs.com Ken.Owens@tellabs.com
Changcheng Huang Fiffi Hellstrand
Dept. of Systems & Computer Engg. Nortel Networks
Carleton University St Eriksgatan 115
Minto Center, Rm. 3082 PO Box 6701
1125 Colonial By Drive 113 85 Stockholm, Sweden
Ottawa, Ontario K1S 5B6, Canada Phone: +46 8 5088 3687
Phone: 613 520-2600 x2477 Fiffi@nortelnetworks.com
Changcheng.Huang@sce.carleton.ca
Makam, et al. Expires May 2001 [Page 30]
Internet Draft draft-ietf-mpls-recovery-frmwrk-01.txt November 2000
Jon Weil Brad Cain
Nortel Networks Mirror Image Internet
Harlow Laboratories London Road 49 Dragon Ct.
Harlow Essex CM17 9NA, UK Woburn, MA 01801, USA
Phone: +44 (0)1279 403935 bcain@mirror-image.com
jonweil@nortelnetworks.com
Loa Andersson Bilel Jamoussi
Nortel Networks Nortel Networks
St Eriksgatan 115, PO Box 6701 3 Federal Street, BL3-03
113 85 Stockholm, Sweden Billerica, MA 01821, USA
Phone: +46 8 50 88 36 34 Phone:(978) 288-4506
loa.andersson@nortelnetworks.com jamoussi@nortelnetworks.com
Seyhan Civanlar Angela Chiu
Coreon, Inc. AT&T Labs, Rm. 4-204
1200 South Avenue, Suite 103 100 Schulz Drive
Staten Island, NY 10314 Red Bank, NJ 07701
Phone: (718) 889 4203 Phone: (732) 345-3441
scivanlar@coreon.net alchiu@att.com
11.0 References
[1] Rosen, E., Viswanathan, A., and Callon, R., "Multiprotocol Label
Switching Architecture", Internet Draft draft-ietf-mpls-arch-07.txt,
Work in Progress , July 2000.
[2] Andersson, L., Doolan, P., Feldman, N., Fredette, A., Thomas, B.,
"LDP Specification", Internet Draft draft-ietf-mpls-ldp-11.txt, Work in
Progress , August 2000.
[3] Awduche, D. Hannan, A., and Xiao, X., "Applicability Statement for
Extensions to RSVP for LSP-Tunnels", draft-ietf-mpls-rsvp-tunnel-
applicability-01.txt, work in progress, April 2000.
[4] Jamoussi, B. et al "Constraint-Based LSP Setup using LDP", Internet
Draft draft-ietf-mpls-cr-ldp-04.txt, Work in Progress , July 2000.
[5] Braden, R., Zhang, L., Berson, S., Herzog, S., "Resource ReSerVation
Protocol (RSVP) -- Version 1 Functional Specification", RFC 2205,
September 1997.
[6] Awduche, D. et al "Extensions to RSVP for LSP Tunnels", Internet
Draft draft-ietf-mpls-rsvp-lsp-tunnel-07.txt, Work in Progress, August
2000.
[7] Awduche, D., Malcolm, J., Agogbua, J., O'Dell, M., McManus, J.,
"Requirements for Traffic Engineering Over MPLS", RFC 2702, September
1999.
Makam, et al. Expires May 2001 [Page 31]
Internet Draft draft-ietf-mpls-recovery-frmwrk-01.txt November 2000
[8] Andersson, L., Cain B., Jamoussi, B., "Requirement Framework for
Fast Re-route with MPLS", draft-andersson-reroute-frmwrk-00.txt, work in
progress, October 1999.
[9] Goguen, R. and Swallow, G., "RSVP Label Allocation for Backup
Tunnels", draft-swallow-rsvp-bypass-label-00.txt, work in progress,
October 1999.
[10] Makam, S., Sharma, V., Owens, K., Huang, C.,
"Protection/restoration of MPLS Networks", Internet Draft draft-makam-
mpls-protection-00.txt, work in progress, October 1999.
[11] Callon, R., Doolan, P., Feldman, N., Fredette, A., Swallow, G.,
Viswanathan, A., "A Framework for Multiprotocol Label Switching",
Internet Draft draft-ietf-mpls-framework-05.txt, Work in Progress,
September 1999.
[12] Haskin, D. and Krishnan R., "A Method for Setting an Alternative
Label Switched Path to Handle Fast Reroute", Internet Draft draft-
haskin-mpls-fast-reroute-05.txt, November 2000, Work in progress.
[13] Owens, K., Makam,V., Sharma, V., Mack-Crane, B., and Haung, C., "A
Path Protection/Restoration Mechanism for MPLS Networks", Internet
Draft, draft-chang-mpls-path-protection-02.txt, Work in Progress
November 2000.
Makam, et al. Expires May 2001 [Page 32]
Makam, et al. Expires May 2001 [Page 33]