Network Working Group A. Farrel
Request for Comments: 3612 Old Dog Consulting
Category: Informational September 2003
Applicability Statement for Restart Mechanisms
for the Label Distribution Protocol (LDP)
Status of this Memo
This memo provides information for the Internet community. It does
not specify an Internet standard of any kind. Distribution of this
memo is unlimited.
Copyright (C) The Internet Society (2003). All Rights Reserved.
This document provides guidance on when it is advisable to implement
some form of Label Distribution Protocol (LDP) restart mechanism and
which approach might be more suitable. The issues and extensions
described in this document are equally applicable to RFC 3212,
"Constraint-Based LSP Setup Using LDP".
Multiprotocol Label Switching (MPLS) systems are used in core
networks where system downtime must be kept to a minimum. Similarly,
where MPLS is at the network edges (e.g., in Provider Edge (PE)
routers) [RFC2547], system downtime must also be kept to a minimum.
Many MPLS Label Switching Routers (LSRs) may, therefore, exploit
Fault Tolerant (FT) hardware or software to provide high availability
of the core networks.
The details of how FT is achieved for the various components of an FT
LSR, including the switching hardware and the TCP stack, are
implementation specific. How the software module itself chooses to
implement FT for the state created by the LDP is also implementation
specific. However, there are several issues in the LDP specification
[RFC3036] that make it difficult to implement an FT LSR using the LDP
protocols without some extensions to those protocols.
Proposals have been made in [RFC3478] and [RFC3479] to address these
Farrel Informational [Page 1]RFC 3612 Applicability for LDP Restart Mechanisms September 20032. Requirements of an LDP FT System
Many MPLS LSRs may exploit FT hardware or software to provide high
availability (HA) of core networks. In order to provide HA, an MPLS
system needs to be able to survive a variety of faults with minimal
disruption to the Data Plane, including the following fault types:
- failure/hot-swap of the switching fabric in an LSR,
- failure/hot-swap of a physical connection between LSRs,
- failure of the TCP or LDP stack in an LSR,
- software upgrade to the TCP or LDP stacks in an LSR.
The first two examples of faults listed above may be confined to the
Data Plane. Such faults can be handled by providing redundancy in
the Data Plane which is transparent to LDP operating in the Control
Plane. However, the failure of the switching fabric or a physical
link may have repercussions in the Control Plane since signaling may
The third example may be caused by a variety of events including
processor or other hardware failure, and software failure.
Any of the last three examples may impact the Control Plane and will
require action in the Control Plane to recover. Such action should
be designed to avoid disrupting traffic in the Data Plane. Since
many recent router architectures can separate the Control and Data
Planes, it is possible that forwarding can continue unaffected by
recovery action in the Control Plane.
In other scenarios, the Data and Control Planes may be impacted by a
fault, but the needs of HA require the coordinated recovery of the
Data and Control Planes to a state that existed before the fault.
The provision of protection paths for MPLS LSP and the protection of
links, IP routes or tunnels through the use of protection LSPs is
outside the scope of this document. See [RFC3469] for further
3. General Considerations
In order for the Data and Control Plane states to be successfully
recovered after a fault, procedures are required to ensure that the
state held on a pair of LDP peers (at least one of which was affected
Farrel Informational [Page 2]RFC 3612 Applicability for LDP Restart Mechanisms September 2003
directly by the fault) are synchronized. Such procedures must be
implemented in the Control Plane software modules on the peers using
Control Plane protocols.
The required actions may operate fully after the failure (reactive
recovery) or may contain elements that operate before the fault in
order to minimize the actions taken after the fault (proactive