Network Working Group M. Shand
Internet-Draft S. Bryant
Intended status: Informational Cisco Systems
Expires: May 3, 2009 P. Francois
Universite catholique de Louvain
October 30, 2008
Mechanisms for safely abandoning loop-free convergence (AAH)
draft-bryant-francois-shand-ipfrr-aah-01
Status of this Memo
By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
This Internet-Draft will expire on May 3, 2009.
Abstract
IPFRR and loop-free convergence techniques can deal with single
topology change events, multiple correlated change events, and in
some cases even certain uncorrelated events. However, in all cases
there are events which cannot be dealt with and the mechanism needs
to quickly revert to normal convergence. This is known as
"Abandoning All Hope" (AAH). This document describes the nature of
the problem, and various proposed mechanisms to deal with it.
Shand, et al. Expires May 3, 2009 [Page 1]
Internet-Draft Abandon All Hope (AAH) October 2008
Table of Contents
1. Conventions used in this document . . . . . . . . . . . . . . 3
2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
3. Possible Solutions . . . . . . . . . . . . . . . . . . . . . . 4
3.1. Hold-down timer only . . . . . . . . . . . . . . . . . . . 4
3.2. Basic per event AAH messages . . . . . . . . . . . . . . . 4
3.3. AAH messages . . . . . . . . . . . . . . . . . . . . . . . 5
3.3.1. Per Router State Machine . . . . . . . . . . . . . . . 6
3.3.2. Per Neighbor State Machine . . . . . . . . . . . . . . 8
4. Management Considerations . . . . . . . . . . . . . . . . . . 9
5. Scope and applicability . . . . . . . . . . . . . . . . . . . 9
6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9
7. Security Considerations . . . . . . . . . . . . . . . . . . . 9
8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 9
9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 10
9.1. Normative References . . . . . . . . . . . . . . . . . . . 10
9.2. Informative References . . . . . . . . . . . . . . . . . . 10
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 10
Intellectual Property and Copyright Statements . . . . . . . . . . 12
Shand, et al. Expires May 3, 2009 [Page 2]
Internet-Draft Abandon All Hope (AAH) October 2008
1. Conventions used in this document
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [1].
2. Introduction
IPFRR[2] and loop-free convergence techniques[3] can deal with single
topology change events, multiple correlated change events, and in
some cases even certain uncorrelated events. However, in all cases
there are events which cannot be dealt with and the mechanism needs
to quickly revert to normal convergence. This is known as
"Abandoning All Hope" (AAH).
A good example is the case of the ordered FIB loop-free convergence
technique (oFIB)[4], however the problem and the mechanisms described
here for its resolution are equally applicable to any loop free
convergence mechanism, such as PLSN[5]. All the routers performing
the calculation must have an identical view of the set of topology
changes under consideration. One technique to ensure this is to
start a hold-down timer on reception of the first event in the hope
that all subsequent events related to the same root cause will arrive
before the timer expires. If this is the case, then all routers in
the network will have acquired an identical set of changes and
processing can continue correctly. However, in some cases the timer
value will be too short to ensure that all the related events have
arrived at all routers (perhaps because there was some unexpected
propagation delay, or one or more of the events are slow in being
detected). In other cases, a completely unrelated event may occur
after the timer has expired, but before the processing is complete.
In either case it is necessary to "Abandon all Hope" and revert to
traditional convergence.
There are a number of problems with this naive approach. Firstly,
since the timer is started at each router on reception of the first
LSP announcing a topology change, the actual starting time is
dependant upon the propagation time of the first LSP. So, for a
subsequent event occurring around the time of the timer expiry,
because of variations in propagation delay it may reach some routers
before the timer expires and others after it has expired. In the
former case this LSP will be included in the set of changes to be
considered, while in the latter it will be excluded and would invoke
an AAH in the routers receiving it. Clearly this would be a
dangerous condition, and it is therefore necessary to arrange that an
AAH invoked anywhere in the network causes ALL routers to AAH. This
can be achieved by reliably propagating an AAH message throughout the
Shand, et al. Expires May 3, 2009 [Page 3]
Internet-Draft Abandon All Hope (AAH) October 2008
network. However, this raises a second problem, the need to
synchronize the exit from AAH state throughout the network.
While in AAH state any topology changes previously received, or which
are subsequently received, should be processed immediately using the
traditional convergence algorithms i.e. without invoking controlled
convergence. If the exit from the AAH state is not correctly
synchronized, a new event may be processed by some routers
immediately (as AAH), while those which have already left AAH state
will treat it as the first of a new batch of changes and attempt
controlled convergence.
3. Possible Solutions
A number of approaches to this problem have been proposed, in
increasing order of complexity:
1. Hold-down timer only. This is the solution proposed in PLSN.
2. Basic per event AAH messages
3. Synchronization of AAH state using AAH messages.
These are described below. The purpose of this draft is to trigger
discussion on the trade-offs between complexity and robustness in the
AAH solution-space.
3.1. Hold-down timer only
This method uses a hold-down to acquire a set of LSPs which should be
processed together. On expiry of the local hold-down timer, the
router begins processing the batch of LSPs according to the loop free
prevention algorithm.
3.2. Basic per event AAH messages
This method uses signaling between neighbors to announce the
abandoning of controlled convergence.
A router individually decides when it should abandon controlled
convergence for a given (set of) LSP(s). It bases this decision on
the LSP reception timings and the hold down timers defined for the
controlled convergence mechanism used.
When a router makes a decision to abandon controlled convergence for
an LSP, it sends an AAH message to a selected subset of its
neighbors. The message identifies the LSPs for which controlled
Shand, et al. Expires May 3, 2009 [Page 4]
Internet-Draft Abandon All Hope (AAH) October 2008
convergence was abandoned.
The reception of such a message MUST trigger the decision to abandon
controlled convergence for this LSP by the receiver. The receiver
SHOULD also abandon controlled convergence for the other pending
LSPs.
A router is only allowed to send AAH messages for a given event once.
This can be achieved for example with a one bit flag in the LSP of
the LSDB, stating whether convergence has been abandoned and signaled
for this LSP. This can also be achieved by storing the
identification of the LSPs for which convergence was abandoned for a
time that is an order of magnitude longer than a typical IGP
convergence (i.e., 10 seconds). The subsest of neighbors to which an
AAH message must be sent by a router R depends on the controlled
convergence mechanism. It can be equal to all the neighbors of R,
but not necessarily.
For any controlled convergence mechanism, the selection of this
subset MUST be such that if a router R abandons controlled
convergence, all the routers who could create a forwarding loop with
R by not abandoning controlled convergence will eventually abandon
controlled convergence.
For the case of controlled convergence using ordered-FIB :
o In the case of a link up / node up / metric decrease event, the
set MUST include the neighbors of R that are on the shortest paths
between R and the originator of the LSP for which controlled
convergence is abandoned.
o In the case of a link down / node down / metric increase event,
the set MUST include the neighbors of R that are upstream of R on
the paths towards the originator of the LSP for which controlled
convergence is abandoned.
3.3. AAH messages
Like the others, this method uses a hold-down to acquire a set of
LSPs which should be processed together. On expiry of the local
hold-down timer, the router begins processing the batch of LSPs
according to the loop free prevention algorithm. This is the same
behaviour as the hold-down timer only method. However, if any
router, having started the loop-free convergence process receives an
LSP which would trigger a topology change, it locally abandons the
controlled convergence process, and sends an AAH message to all its
neighbors. This eventually triggers all routers to abandon the
controlled convergence. The routers remain in AAH state (i.e.
Shand, et al. Expires May 3, 2009 [Page 5]
Internet-Draft Abandon All Hope (AAH) October 2008
processing topology changes using normal "fast" convergence), until a
period of quiescence has elapsed. The exit from AAH state is
synchronized by using a two step process.
To achieve the required synchronization, two additional messages are
required, AAH and AAH ACK. The AAH message is reliably exchanged
between neighbours using the AAH ACK message. These could be
implemented as a new message within the routing protocol or carried
in existing routing hello messages.
Two types of state machines are needed. A per-router AAH state
machine and a per neighbour AAH state machine(PNSM). These are
described below.
3.3.1. Per Router State Machine
Per Router State Table
+-------------+-----------+---------+--------+------------+----------+
| EVENT | Q | Hold | CC | AAH | AAH-hold |
+=============+===========+=========+========+============+==========+
| RX LSP | Start | - | TX-AAH | Re-start | TX-AAH |
| triggering | hold-down | | Start | AAH timer. | Start |
| change | timer | | AAH | [AAH] | AAH |
| | [Hold] | | timer. | | timer. |
| | | | [AAH] | | [AAH] |
+-------------+-----------+---------+--------+------------+----------+
| RX AAH | TX-AAH | TX-AAH | TX-AAH | [AAH] | TX-AAH |
| (Neighbor's | Start AAH | Start | Start | | Start |
| PNSM | timer. | AAH | AAH | | AAH |
| processes | [AAH] | timer | timer. | | timer. |
| RX AAH.) | | [AAH] | [AAH] | | [AAH] |
+-------------+-----------+---------+--------+------------+----------+
| Timer | - | Trigger | - | Start | [Q] |
| expiry | | CC. | | AAH-hold | |
| | | [CC] | | timer. | |
| | | | | [AAH-hold] | |
+-------------+-----------+---------+--------+------------+----------+
| Controlled | - | - | [Q] | - | - |
| convergence | | | | | |
| completed | | | | | |
+-------------+-----------+---------+--------+------------+----------+
TX-AAH = Send "goto TX-AAH" to all other PNSMs.
Operation of the per-router state machine is as follows:
Operation of this state machine under normal topology change involves
only states: Quiescent (Q), Hold-down (Hold) and Controlled
Convergence (CC). The remaining states are associated with an AAH
Shand, et al. Expires May 3, 2009 [Page 6]
Internet-Draft Abandon All Hope (AAH) October 2008
event.
The resting state is Quiescent. When the router in the Quiescent
state receives an LSP indicating a topology change, which would
normally trigger an SPF, it starts the Hold-down timer and changes
state to Hold-down. It normally remains in this state, collecting
additional LSPs until the Hold-down timer expires. Note that all
routers MUST use a common value for the Hold-down timer. When the
Hold-down timer expires the router then enters Controlled Convergence
(CC) state and executes the CC mechanism to re-converge the topology.
When the CC process has completed on the router, the router re-enters
the Quiescent state.
If this router receives a topology changing LSP whilst it is in the
CC state, it enters AAH state, and sends a "goto TX-AAH" command to
all per neighbour state machines which causes each per-neighbour
state machine to signal this state change to its neighbour.
Alternatively, if this router receives an AAH message from any of its
neighbors whilst in any state except AAH, it starts the AAH timer and
enters the AAH state. The per neighbor state machine corresponding
to the neighbor from which the AAH was received executes the RX AAH
action (which causes it to send an AAH ACK), while the remainder are
sent the "goto TX-AAH" command. The result is that the AAH is
acknowledged to the neighbor from which it was received and
propagated to all other neighbors. On entering AAH state, all CC
timers are expired and normal convergence takes place.
Whilst in the AAH state, LSPs are processed in the traditional
manner. Each time an LSP is received, the AAH timer is restarted.
In an unstable network ALL routers will remain in this state for some
time and the network will behave in the traditional uncontrolled
convergence manner.
When the AAH timer expires, the router enters AAH-hold state and
starts the AAH hold timer. The purpose of the AAH-hold state is to
synchronize the transition of the network from AAH to Quiescent. The
additional state ensures that the network cannot contain a mixture of
routers in both AAH and Quiescent states. If, whilst in AAH-Hold
state the router receives a topology changing LSP, it re-enters AAH
state and commands all per neighbour state machines to "goto TX-AAH".
If, whilst in AAH-Hold state the router receives an AAH message from
one of its neighbours, it re-enters the AAH state and commands all
other per neighbour state machines to "goto TX-AAH". Note that the
per-neighbor state machine receiving the AAH message will
autonomously acknowledge receipt of the AAH message. Commanding the
per-neighbour state machine to "goto TX-AAH" is necessary, because
routers may be in a mixture of Quiescent, Hold-down and AAH-hold
state, and it is necessary to rendezvous the entire network back to
Shand, et al. Expires May 3, 2009 [Page 7]
Internet-Draft Abandon All Hope (AAH) October 2008
AAH state.
When the AAH Hold timer expires the router changes to state Quiescent
and is ready for loop free convergence.
3.3.2. Per Neighbor State Machine
Per Neighbor State Table
+----------------------------+--------------+------------------------+
| EVENT | Idle | TX-AAH |
+============================+==============+========================+
| RX AAH | Send ACK. | Send ACK. |
| | | Cancel timer. |
| | [IDLE] | [IDLE] |
+----------------------------+--------------+------------------------+
| RX ACK | ignore | Cancel timer. |
| | | [IDLE] |
+----------------------------+--------------+------------------------+
| RX "goto TX-AAH" from | Send AAH | ignore |
| Router State Machine | [TX-AAH] | |
+----------------------------+--------------+------------------------+
| Timer expires | impossible | Send AAH |
| | | Restart timer. |
| | | [TX-AAH] |
+----------------------------+--------------+------------------------+
There is one instance of the per-neighbour (PN) state machine for
each neighbour within the convergence control domain.
The normal state is IDLE.
On command ("goto TX-AAH") from the router state machine, the state
machine enters TX-AAH state, transmits an AAH message to its
neighbour and starts a timer.
On receipt of an AAH ACK in state TX-AAH the state machine cancels
the timer and enters IDLE state.
In states IDLE, any AAH ACK message received is ignored.
On expiry of the timer in state TX-AAH the state machine transmits an
AAH message to the neighbour and restarts the timer. (The timer
cannot expire in any other state.)
In any state, receipt of an AAH causes the state machine to transmit
an AAH ACK and enter the IDLE state.
Note that for correct operation the state machine MUST remain in
Shand, et al. Expires May 3, 2009 [Page 8]
Internet-Draft Abandon All Hope (AAH) October 2008
state TX-AAH, until an AAH ACK or an AAH is received, or the state
machine is deleted. Deletion of the per neighbor state machine
occurs when routing determines that the neighbour has gone away, or
when the interface goes away.
When routing detects a new neighbour it creates a new instance of the
per-neighbour state machine in state Idle. The consequent generation
of the router's own LSP will then cause the router state machine to
execute the LSP receipt actions, which will if necessary result in
the new per-neighbour state machine receiving a "goto TX-AAH" command
and transitioning to TX-AAH state.
4. Management Considerations
The management requirements will depend upon the solution adopted,
but at the very least there needs to be reporting of the current
state.
5. Scope and applicability
The initial scope of this work is in the context of link state IGPs.
6. IANA Considerations
There are no IANA considerations that arise from this document.
7. Security Considerations
This document does not itself introduce any security issues, but
attention must be paid to the security implications of any proposed
solutions to the problem.
8. Acknowledgements
The authors would like to acknowledge contributions made by Les
Ginsberg.
9. References
Shand, et al. Expires May 3, 2009 [Page 9]
Internet-Draft Abandon All Hope (AAH) October 2008
9.1. Normative References
[1] Bradner, S., "Key words for use in RFCs to Indicate Requirement
Levels", BCP 14, RFC 2119, March 1997.
9.2. Informative References
[2] Shand, M. and S. Bryant, "IP Fast Reroute Framework",
draft-ietf-rtgwg-ipfrr-framework-09 (work in progress),
October 2008.
[3] Shand, M. and S. Bryant, "A Framework for Loop-free
Convergence", draft-ietf-rtgwg-lf-conv-frmwk-02 (work in
progress), February 2008.
[4] Francois, P., "Loop-free convergence using oFIB",
draft-ietf-rtgwg-ordered-fib-02 (work in progress),
February 2008.
[5] Zinin, A., "Analysis and Minimization of Microloops in Link-
state Routing Protocols", draft-ietf-rtgwg-microloop-analysis-01
(work in progress), October 2005.
Authors' Addresses
Mike Shand
Cisco Systems
250, Longwater Avenue.
Reading, Berks RG2 6GB
UK
Email: mshand@cisco.com
Stewart Bryant
Cisco Systems
250, Longwater Avenue.
Reading, Berks RG2 6GB
UK
Email: stbryant@cisco.com
Shand, et al. Expires May 3, 2009 [Page 10]
Internet-Draft Abandon All Hope (AAH) October 2008
Pierre Francois
Universite catholique de Louvain
Email: pierre.francois@uclouvain.be
URI: http://inl.info.ucl.ac.be/pfr
Shand, et al. Expires May 3, 2009 [Page 11]
Internet-Draft Abandon All Hope (AAH) October 2008
Full Copyright Statement
Copyright (C) The IETF Trust (2008).
This document is subject to the rights, licenses and restrictions
contained in BCP 78, and except as set forth therein, the authors
retain all their rights.
This document and the information contained herein are provided on an
"AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Intellectual Property
The IETF takes no position regarding the validity or scope of any
Intellectual Property Rights or other rights that might be claimed to
pertain to the implementation or use of the technology described in
this document or the extent to which any license under such rights
might or might not be available; nor does it represent that it has
made any independent effort to identify any such rights. Information
on the procedures with respect to rights in RFC documents can be
found in BCP 78 and BCP 79.
Copies of IPR disclosures made to the IETF Secretariat and any
assurances of licenses to be made available, or the result of an
attempt made to obtain a general license or permission for the use of
such proprietary rights by implementers or users of this
specification can be obtained from the IETF on-line IPR repository at
http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary
rights that may cover technology that may be required to implement
this standard. Please address the information to the IETF at
ietf-ipr@ietf.org.
Shand, et al. Expires May 3, 2009 [Page 12]