Skip to main content

Last Call Review of draft-ietf-pals-endpoint-fast-protection-04
review-ietf-pals-endpoint-fast-protection-04-genart-lc-worley-2016-12-06-00

Request Review of draft-ietf-pals-endpoint-fast-protection
Requested revision No specific revision (document currently at 05)
Type Last Call Review
Team General Area Review Team (Gen-ART) (genart)
Deadline 2016-12-06
Requested 2016-11-23
Authors Yimin Shen , Rahul Aggarwal , Wim Henderickx , Yuanlong Jiang
I-D last updated 2016-12-06
Completed reviews Genart Last Call review of -04 by Dale R. Worley (diff)
Secdir Last Call review of -04 by Chris M. Lonvick (diff)
Opsdir Last Call review of -04 by Susan Hares (diff)
Rtgdir Early review of -00 by Mach Chen (diff)
Rtgdir Early review of -00 by John Drake (diff)
Tsvart Last Call review of -04 by David L. Black (diff)
Assignment Reviewer Dale R. Worley
State Completed
Request Last Call review on draft-ietf-pals-endpoint-fast-protection by General Area Review Team (Gen-ART) Assigned
Reviewed revision 04 (document currently at 05)
Result Ready w/nits
Completed 2016-12-06
review-ietf-pals-endpoint-fast-protection-04-genart-lc-worley-2016-12-06-00
I am the assigned Gen-ART reviewer for this draft.  The General Area
Review Team (Gen-ART) reviews all IETF documents being processed
by the IESG for the IETF Chair.  Please treat these comments just
like any other last call comments.

For more information, please see the FAQ at
<http://wiki.tools.ietf.org/area/gen/trac/wiki/GenArtfaq>.

Document: draft-ietf-pals-endpoint-fast-protection-04
Reviewer: Dale R. Worley
Review Date: 2016-12-06
IETF LC End Date: 2016-12-06
IESG Telechat date: (not set)

Before reviewing this document, I knew nothing about pseudowire
routing, so my review does not fully assess the technical aspects of
the document.  I suspect that almost all of these items are editorial.
I hope that they are simply places where the description can be made
clearer to the naive reader by making it more exact.  Some of them may
be places where I don't fully understand the technology.  It's
possible that some of them are technical issues.

Summary:

This draft is basically ready for publication, but has nits that
should be fixed before publication.

These items seem to have technical content:

In section 4.2 and 4.3, it is not clear to me whether all of the PWs
being carried by a particular tunnel must have the same protector.
Section 4.2 says that it is so, but section 4.3 suggests that the PWs
can be divided into subsets which have different protectors, without
mentioning any correspondence between the subsets and the
tunnel-groups of PWs.

What is the mechanics of context identifiers in the context of
protecting egress ACs?  It seems like this should all be analogous to
the described cases (protecting PEs), but as this is a specification,
how the analogy works should be made explicit.

In section 6.4, does there need to be a specification for the
"Reserved" fields of the Protection FEC Element TLV?  Usually, there
is a specification "must be sent with zeros/must be ignored on
receipt".  (Or is there a global understanding of this in LDP?)

In section 6.4, there is a definition of "PW Information".  Are there
previous definitions of "PW information" in PW/LDP/MPLS usage?  It
looks a lot like what's defined in RFC 4447 sections 5.2 and 5.3.  If
they are semantically the same, the same format should be used and
simply referred to here.  If the format is different because they
aren't semantically the same ... perhaps there should be a note
explaining how/why.

Nits/editorial comments:

There are quite a number of places where "a" or "the" seems to be
omitted before nouns.  I'll let the RFC Editor identify/correct those.

There is a general issue regarding what aspects of local repair are
configured by some external means (e.g., the protectors that PLRs use,
the context identifiers) and what aspects are established
automagically by the defined mechanisms (the bypass tunnels).  The
document would have been clearer to me if these were separated
explicitly, but I suspect that it is common usage in routing
specifications for the reader to sort it out.

Abstract

IMO it would help if the Abstract mentioned that only IP/MPLS
transport tunnels are protected.  I say this because the only part of
this technology that I'd previously heard of was MPLS; such a
specification in the abstract would tell a large body of potential
readers that the document is *not* relevant to their situation.

1.  Introduction

   The mechanism is applicable to LDP signaled PWs.  It is relevant to
   networks with redundant PWs and multi-homed CEs.  It is designed on
   the basis of MPLS upstream label assignment and context-specific
   label switching [RFC5331].

It might be more informative to say "Fast failure protection for
pseudowire endpoints" rather than "The mechanism".

Which of the factors listed are prerequisites for the use of this
mechanism and which are factors which this mechanism additionally
supports?

Whichever of these are prerequisites for the solution should be
mentioned as such in the Abstract, including particularly that the PW
must be carried by IP/MPLS transport tunnels (as described in section
4.1 paragraph 2).

   Fast protection refers to its ability to
   restore traffic in the order of tens of milliseconds.  Compared with
   global repair and control plane repair, this mechanism can provide
   faster service restoration.

What is the time scale of "global repair and control plane repair"?
Given that you give the time scale of "fast protection", it would be
informative to have comparable values for other repair techniques.

   However, it is intended to complement
   those mechanisms, rather than replacing them.

It might be useful to explain why global/control plane repair is still
needed.  (Then again, that might be obvious to anyone in the field.)

3.1.  Single-Segment PW

   For either mode, when considering the traffic flowing in a given
   direction over an active path, this document views the ACs, PEs and
   PWs to serve primary or backup roles.  In particular, the ACs, PEs
   and PW along this active path are primary, while those along the
   other path are backup.  Note that in the active-active mode, the
   backup path is an active path by itself, carrying its own share of
   traffic while protecting the other active path.

The wording here doesn't seem to be quite right.  I think you need to
phrase it more like this:

   For either mode, when considering the traffic flowing in a given
   direction over an active path, this document views the ACs, PEs and
   PWs to serve primary or backup roles or both.  In particular, given
   an active path, the ACs, PEs
   and PW along that active path have primary roles, while those along the
   other path have backup roles.  Note that in the active-active mode,
   each AC, PE, and PW has a primary role (due to being on an active
   path) and also a backup role protecting the other path (which is
   also active).

--

   For clarity, primary egress AC, primary egress PE, backup egress AC,
   and backup egress PE may simply be referred to as primary AC, primary
   PE, backup AC, and backup PE, respectively, when the context of a
   discussion is egress endpoint.

This is correct, but it seems to overlook the use of "primary PE" to
mean "the PE that is being protected (when we are considering a
particular PLR and protector)".  That usage is common in the rest of
the document, but difficult for the naive reader to abstract from the
above text.

4.1.  Applicability

   In a network where
   transport tunnels may provide ECMP to primary PEs, care should be
   taken to prevent misordered packet delivery during local repair.

Should "ECMP" get a reference?  Or is it so common that any reader of
this document can be assumed to know already?

   In a network where
   transport tunnels may provide ECMP to primary PEs, care should be
   taken to prevent misordered packet delivery during local repair.

Perhaps this should be qualified with "if the PW or some flows
within the PW are sensitive to packet misordering", as is mentioned
later in the paragraph.

Naively, it seems that with ECMP it is impossible to prevent packet
reordering.  (E.g., it's hard to imagine using ECMP with ATM circuits
unless the destination performed SAR separately for each path.)
However, the scenario developed in the paragraph could cause a
dramatic increase in the "distance" that packets are reordered, and
that increase could be a problem.  Perhaps the text could emphasize
that the problem isn't so much the absolute presence of reordering but
its magnitude.

4.2.  Local Repair and Protector

I find this title peculiarly hard to parse.  I think the problem is
the ambiguity whether "local" modifies "protector" or not, as well as
the fact that "local repair" is a concept whereas "protector" is a
singular device.  Perhaps "Local Repair and Protectors" ("protectors"
is a class of devices, with similar abstractness as "local repair") or
"Local Repair, Bypass Tunnels, and Protectors" (since bypass tunnels
seem to be a similarly critical part of the solution).

   In anticipation of the failure, the PLR MUST
   also pre-establish a bypass tunnel to a "protector", and pre-install
   a bypass route in the data plane.

It might be clearer to say "a bypass route for the bypass tunnel in
the data plane".

It would be useful to mention here that the protector is either a
backup PE which, in a sense, functions as an alternative terminus for
the PW segment, or a router which will forward traffic to such a
backup PE.

This raises the point that, in a way, the bypass tunnel is an
alternative continuation of the protected transport tunnel.  E.g., it
has as the destination endpoint the same virtual interface address
(the context identifier).  Or does the PLR, when repairing, act as a
proper downstream PE for the primary PW segment, and a proper upstream
PE for the PW segment in the bypass tunnel?  (I'm likely not using the
terminology correctly here, but I suspect that the terminology is not
used in a strictly correct manner in section 4.2.)

   Upon
   detecting the failure, the PLR invokes the bypass route in the data
   plane, and reroutes PW traffic to the protector through the bypass
   tunnel.

Does "invokes the bypass route" have a clear meaning, i.e., does this
use of "invoke" have a proper definition?  I think you can elide that
phrase and just say "the PLR reroutes PW traffic to the protector
...", with better clarity.

   In this document, the PLR
   simply computes and establishes a node protection bypass tunnel in
   the same fashion as the normal IP/MPLS node protection, except that
   with the notion of context identifier, the bypass tunnel will be
   established from the PLR to the protector (Section 4.6).

Is there a reference for "the normal IP/MPLS node protection"?
Without the context identifier, what would happen which contrasts to
"from the PLR to the protector"?

   On the other
   hand, this imposes a requirement on the protector that it MUST be
   able to forward the packets based on a PW label that is assigned by
   the primary PE, and ensure that the traffic MUST eventually reach the
   target CE.

More strictly, the protector must forward the packets *along the
backup path*.  Of course, that's implied by the rest of the document,
but it might be worth including that fact in this statement.  It also
implies that all of the PEs along the backup path must be able to
forward the packets based on the PW label of the primary path.  That's
also implied, but I don't think it's stated anywhere.

4.3.  Context Identifier

   Likewise, the PWs terminated on a primary PE may be protected by
   multiple protectors, each for a subset of the PWs.

I think this is actually "PW segments" rather than "PWs" (since a PW
only terminates at the egress PE).  There seems to be a general issue
in the document of using "PW" in places where the strictly correct term
is "PW segment".

But compare with section 4.2, "This requires that the protector given
by the bypass tunnel MUST be intended for all the PWs carried by the
[primary] transport tunnel."  That means that all PWs in one tunnel
must be within one of these subsets of PWs.  Is that really correct?

It seems to me that the context identifier is used as the address of
the virtual interface that is downstream end of the protected tunnel.
If so, it would have helped me make a mental model of the process by
stating that here in section 4.3 -- 4.3 states that a context
identifier is an address, but does not specify what it is the address
of.

What is the mechanics of context identifiers in the context of
protecting egress ACs?  Since there is a bypass tunnel, there must be
an address of the virtual interface within the protector that is the
terminus of the bypass tunnel.  It seems like this address should be
the context identifier used by the PLR, but in this situation there is
no downstream "primary PE" from which to make a {primary PE,
protector} pair.  I think that what works is to use the destination CE
in place of the primary PE.  That's straightforward, since the pairs
are never directly represented in the protocol, so most of the
mechanics should be unchanged.  In this situation, the PLR can't learn
the context identifier from the primary PE, but I suppose it can
invent/be configured with the context identifier itself, since it
doesn't need to establish a tunnel to the primary PE.  It seems like
this should all be analogous to the described cases, but as this is a
specification, how the analogy works should be made explicit.

4.3.1.  Semantics

      A distinct context identifier MUST be assigned to
      the primary PE and each protector.

This sentence is not correct as written since it states that context
identifiers are assigned to primary PEs and also are assigned to
protectors.  I think you meant to say

      A distinct context identifier MUST be assigned to
      each pair of primary PE and protector that it uses.

or 

      A distinct context identifier MUST be assigned to
      each {primary PE, protector} pair.

4.3.2.  FEC

   In an MPLS network, a context identifier represents a FEC (Forwarding
   Equivalence Class) for transport tunnels and bypass tunnels destined
   for it.

This doesn't seem to match the definition of "context identifier"
given previously, as a single context identifier can be used by
several PW segments ending at a PE, which PWs carry packets of
different FECs.  Is this perhaps "context label"?  Or are there
multiple meanings of "context identifier" which should be explicitly
disambiguated?  Or am I misreading this?

4.5.  Transport Tunnel

   In egress PE node protection and S-PE node protection,
   when a node failure is detected on any ECMP branch, the penultimate
   hop router SHOULD act as a PLR to reroute all the traffic of the ECMP
   set to the protector.

I think "node failure" should be "link failure" here -- if there is
*node* failure (of the PE), then all of the ECMP paths will not work,
and the PLR would have to reroute all the traffic anyway.

4.6.  Bypass Tunnel

   A bypass tunnel SHOULD NOT need to be further protected against a
   transit link failure, transit node failure, or egress node failure.

I think this is an incorrect use of "SHOULD NOT", as it is not a
qualifier of a requirement on a device.  I think that you mean

   A bypass tunnel SHOULD NOT be protected against a
   transit link failure, transit node failure, or egress node failure.

But you might want to explain it with

   There is little or no benefit from protecting a bypass tunnel, so
   a bypass tunnel SHOULD NOT ...

4.7.  Examples of Forwarding State

   In the forwarding plane, it is indicated by
   the bypass tunnel(s) destined for the context identifier.

What is "it"?  I think it refers to "each label space", but in that
case I think "they" would be the preferred usage.  Better, replace
"it" with a noun phrase.

5.  Revertive Behavior

      Possible triggers of
      global repair include PW status notification, VCCV, BFD, end-to-
      end OAM between CEs, etc.

I see that "VCCV" is the topic of RFC 5085, but it is not the name of
a page in Wikipedia, which suggests that giving a reference for it
would be useful.

   Particularly in the
   case of egress PE and S-PE node failures, if the ingress PE or the
   protector loses communication with the (S-)PE for an extensive period
   of time, LDP session may go down.

"the (S-)PE" isn't unique in this context, as there is an ingress PE
in the discourse.  I think you mean "the primary (S-)PE" (i.e., the
protected PE).

6.  LDP Extensions

   To facilitate the procedures,

Usually "facilitate" only applies to making easier something that
could be done already.  In this case, the new TLV is necessary to
enable these procedures, so I suggest s/facilitate/support/.

   The procedures in this section are only applicable, if the protector
   advertises the Egress Protection Capability TLV, the primary PE
   supports the advertisement of the Protection FEC Element TLV, and in
   the centralized protector model, the backup PE also supports the
   advertisement of the Protection FEC Element TLV.

My understanding is that "the procedures in this section" means
"establishing endpoint fast failure protection via LDP".  That seems
to be sufficiently important that it is probably worth expanding this
sentence into a bullet list:

   The procedures in this section are only applicable if:

       o the protector advertises the Egress Protection Capability
         TLV,

       o the primary PE supports the advertisement of the Protection
         FEC Element TLV, and

       o in the centralized protector model, the backup PE also
         supports the advertisement of the Protection FEC Element TLV.

Subtly, this seems to be the list of preconditions under which the PLR
can establish the needed tunnels and perform the local repair -- this
list does not list that the PLR must implement the needed behaviors,
which is obviously a precondition of local repair.  Or perhaps it is a
list of LDP capabilities that are required to set up local repair via
LDP.  In either case, it suggests the first clause could be revised
to change "the procedures ..." with a phrase that is more specific as
to what, precisely, is enabled by the conjunction of these three
items.

6.1.  Egress Protection Capability TLV

   The TLV Code Point is TBD.  It needs to be assigned by IANA.

It seems like this should be flagged with a note so the RFC Editor can
put in the assigned value.

   The "Capability Data" is encoded with the context identifier of the
   {primary PE, protector}.

This should be something like:

   The "Capability Data" is encoded with the context identifiers for
   the {primary PE, protector} pairs for which the advertiser is the
   protector.

6.4.  Protection FEC Element TLV

Does there need to be a specification for the "Reserved" fields?
Usually, there is a specification "must be sent with zeros/must be
ignored on receipt".  (Or is there a global specification of this in
LDP?)

     ~                         PW Information                        ~

Are there previous definitions of "PW information" in PW/LDP/MPLS
usage?  It looks a lot like what's defined in RFC 4447 sections 5.2
and 5.3.  If they are semantically the same, the same format should be
used and simply referred to here.  If they aren't semantically the
same ... perhaps there should be a note explaining why.

7.  IANA Considerations

The assignment for the Egress Protection Capability TLV should be
described more definitively.  My impression is that it should be
assigned out of the "TLV Type Name Space" in the LDP parameters page,
but that doesn't seem to be stated explicitly.  Also, there seem to be
three blocks of values (0x0001-0x07FF, 0x0800-0x08FF, 0x0900-0x3DFF)
that are all marked as "IETF consensus", but which presumably differ
in some manner.  Which block should the value be in?

Perhaps for clarity, an appropriate skeleton protocol assignment table
row should be shown.

   Value  Hex   Name                    Label Advertisement Discipline
   -------------------------------------------------------------------
   131    0x83  Protection FEC Element  DU

Section 6.2 calls the new TLV in the Label Mapping message "a
Protection FEC Element TLV", but section 7 calls it an "LDP Protection
FEC Type Name Space value".  The latter phrase consists of 7
successive nouns and is (IMO) unparsable by a reader who doesn't
already know what it means.  I suggest changing it to "the Protection
FEC Element value in the LDP FEC Type Name Space" (which aligns with
the titles used in the IANA protocol assignment web page).

It appears that the values in the FEC Type Name Space are 8 bits but
one place in section 7 gives the value as "0x083", which should
presumably be changed to "0x83".

8.  Security Considerations

   In all scenarios, the role of
   protector is entirely managed by network operator, and backup PEs can
   be used anyway to host PWs and LDP sessions.

Perhaps "anyway" could be clarified as:

   ... and, regardless of local repair, backup PEs can
   be used to host PWs and LDP sessions. 

--

   In general, [RFC5920] describes the security framework for MPLS
   networks.  [RFC3209] describes the security considerations for RSVP
   LSPs.  [RFC5036] describes the security considerations for the base
   LDP specification.  [RFC5561] describes the security considerations
   which apply when using the LDP capability mechanism.  All these
   security framework and considerations apply to this document as well.

Is there a reference for security considerations for "IP tunnels" (as
mentioned in section 4.3.1 and 4.6)?

Dale