Packetization Layer Path MTU Discovery for Datagram Transports
draft-ietf-tsvwg-datagram-plpmtud-22

Note: This ballot was opened for revision 17 and is now closed.

Martin Duke Yes

Comment (2020-03-30 for -17)
I'm excited for this to go to RFC once the QUIC reference clears.

Normative issue:

Sec 3, item #9 "An update to the PLPMTU (or MPS) MUST NOT modify the congestion window". Would it not be OK to round down the congestion window to remain a multiple of the MPS? Probably not a big issue in practice, but I would hate to constrain an implementer in this way. I would suggest "MUST NOT increase".

Nits:

Sec 2. In the definition for EMTU_R, put "the largest datagram size that can be reassembled" in quotes, and delete "by EMTU_R (Effective MTU to receive)"

Sec 4.1. delete word in brackets:  "A PL that uses a probe packet carrying [an] application data..."

Sec 4.6.2 I think this list of inequalities would be easier to read if it was ranked in order of increasing PL_PTB_SIZE.

5.1.1. s/up to data/up to date

Erik Kline Yes

Comment (2020-04-04 for -19)
No email
send info
{Yes}

[nits]

S1.1

* It's not clear to me that "stateful" in the phrase "stateful firewall"
  matters here. I imagine a "stateless" firewall could just as easily block
  incoming ICMP messages.

S4.5

* The "[RFC8201]" didn't render as a clickable link. Source XML weirdness?
  Same in S4.6.1.

S5.

* First sentence of second paragraph seems like two sentences, one sentence
  with a parenthetical comment, or a run-on sentence.  Also, I think "only be
  performed once" here means "only performed at one layer" rather than "only
  performed at one point in time"?

S5.1.1

* The "section 3.1.1" link seems to be internal to the draft rather than linked
  into RFC 8085, as I would expect from the text. (this happens twice in this
  section)

S5.3.3

* Is there any additional detail worth including here?  It asserts that a PL
  sender is able to detect inconsistencies, but I wonder whether more guidance
  (or an example) might be helpful to implementors.

S6.2.2

* s/is to be added//

S6.2.3.4

* Maybe check the XML for the "[RFC4960]" reference, since it doesn't seem to
  have been converted into a link.

Magnus Westerlund Yes

Deborah Brungard No Objection

Alissa Cooper No Objection

Roman Danyliw No Objection

Comment (2020-04-07 for -19)
Thank you for your updates in response to the SECDIR review by Stephen Farrell (and thanks for the review Stephen!)

** As previously mentioned, please consider the stability of the [I-D.ietf-quic-transport] when advancing this document

** Section 1.  Per “ICMP messages can be filtered by middleboxes (including firewalls) [RFC4890].  A stateful firewall could be configured …”, couldn’t all firewalls (stateless and stateful) also do this?

** Editorial Nits:
Section 4.1.  Typo. s/retransmited/retransmitted/

Section 4.1. s/… which could need the PL to to use a smaller packet size…/…which could force the PL to use a smaller packet size …/

Section 5.1.2.  Typo. s/valugreater/value greater/

Benjamin Kaduk No Objection

Comment (2020-04-08 for -19)
[I'm reviewing the native HTML, so line breaks in quoted text are
locally generated]

We could perhaps benefit from more clarity in several places about
whether a given size is measured at the IP layer or the PL (some
locations mentioned in the per-section comments).

Should RFC 8085 be cited as BCP 145?

It would be nice to have a bit more detail about how to effectuate
things like robustness to inconsistent paths, but I don't think the lack
of such concrete guidance should hold up publication of this document.

I mention in the per-section comments a few instances of apparent
internal inconsistency; they don't quite rise to DISCUSS-level, but
please take a look.

Abstract

    This document updates RFC 4821 to specify the method for datagram PLs,
    and updates RFC 8085 as the method to use in place of RFC 4821 with UDP
    datagrams. [...]

nit(?): the wording here is a bit odd; I think we want to say that it
"updates RFC 8085 to refer to the method specified in this document
instead of the method in RFC 4821 for use with UDP datagrams".

Section 1.1

    Classical PMTUD is subject to protocol failures. One failure arises when
    traffic using a packet size larger than the actual PMTU is black-holed
    (all datagrams sent with this size, or larger, are discarded). 

nit: if "this size" is "the actual PMTU", shouldn't this just be
"strictly larger than this size", since packets of exactly that size
will pass?

    When the router issuing the ICMP message drops a tunneled packet, the
    resulting ICMP message will be directed to the tunnel ingress. This
    tunnel endpoint is responsible for forwarding the ICMP message and also
    processing the quoted packet within the payload field to remove the
    effect of the tunnel, and return a correctly formatted ICMP message to
    the sender [I-D.ietf-intarea-tunnels]. Failure to do this prevents the
    PTB message reaching the original sender.

Is it supposed to be implie that this is a little bit complicated and
many existing tunnels don't do it properly?

    In this case, when a packet sent by the server encounters a problem
    after the ECMP router, then any resulting ICMP message also needs to be
    directed by the ECMP router towards the original sender.

Similarly, is it implicit that it's complicated for the ECMP router to
do this, it's not currently implemented well, etc.?

    When an ICMP message is generated by a router in a network segment that
    has inserted a header into a packet, the quoted packet could contain
    additional protocol header information that was not included in the
    original sent packet, and which the PL sender does not process or may
    not know how to process. This could disrupt the ability of the sender to
    validate this PTB message.

Should/can the entity that added the extra header be inspecting ICMP to
remove the header from the quoted packet in the reverse direction?  (We
could perhaps try to normalize the language between this point and the
following on on NAT behavior.)

    Section 10.2 of [RFC4821] recommended a PLPMTUD probing method for
    the Stream Control Transport Protocol (SCTP). SCTP utilizes probe
    packets consisting of a minimal sized HEARTBEAT chunk bundled with a
    PAD chunk as defined in [RFC4820]. However, RFC 4821 did not provide
    a complete specification. The present document replaces this by
    providing a complete specification.

nit(?): s/replaces this/replaces that description/?

Section 2

    Packetization Layer (PL):
    The PL is a layer of the network stack that places data into packets
    and performs transport protocol functions. Examples of a PL include:
    TCP, SCTP, SCTP over DTLS or QUIC.

nit: Serial comma, please!

    PTB_SIZE:
    The PTB_SIZE is a value reported in a validated PTB message that
    indicates next hop link MTU of a router along the path

nit: missing article ("the next hop link MTU"?)

Section 3

    A PTB message MUST NOT be used to increase the PLPMTU [RFC8201], but
    could trigger a probe to test for a larger PLPMTU. A PL_PTB_SIZE
    that is greater than that currently probed MUST be ignored. A valid
    PTB_SIZE is converted to a PL_PTB_SIZE before it is to be used in
    the DPLPMTUD state machine.

Perhaps these last two sentences should be swapped, so we talk about how
the PL_PTB_SIZE is computed before we talk about ignoring invalid
values?

    The decision about when to send a probe packet does not need to be
    limited by the congestion controller.

Does the congestion controller need to account for the size/bandwidth of
the probe packets in its pacing mechanism, though?

    Path validation: It is RECOMMENDED that methods are robust to path
    changes that could have occurred since the path characteristics were
    last confirmed, and to the possibility of inconsistent path
    information being received.

What would such robustness look like?  (Why is it not REQUIRED?)

Section 4.1

    A PL that uses a probe packet carrying application data and needs
    protection from the loss of this probe packet could perform
    transport-layer retransmission/repair of the data block (e.g., by
    retransmission after loss is detected or by duplicating the data
    block in a datagram without the padding data). This retransmitted
    data block might possibly need to be sent using a smaller PLPMTU,
    which could need the PL to to use a smaller packet size to traverse
    the end-to-end path. (This could utilize endpoint network-layer or a
    PL that can re-segment the data block into multiple datagrams).

nit: check the grammar around "utilize endpoint network-layer"?

Section 4.2

    Transport protocols can include end-to-end methods that detect and
    report reception of specific datagrams that they send (e.g., DCCP
    and SCTP provide keep-alive/heartbeat features).

Does a packet-level ACK mechanism (e.g., like QUIC but not TCP) suffice?

Section 4.3

    When the method detects the current PLPMTU is not supported,
    DPLPMTUD sets a lower PLPMTU, and sets a lower MPS. The PL then
    confirms that the new PLPMTU can be successfully used across the
    path. A probe packet could need to have a size less than the size of
    the data block generated by the application.

Why is "less than half the size" noteworthy?

Section 4.4

    Operational experience reveals that IP fragmentation can reduce the
    reliability of Internet communication
    [I-D.ietf-intarea-frag-fragile], which may reduce the success of
    retransmission.

nit: if "success" is a binary yes/no, then this could be "reduce the
probability of success of the retransmission".  (I add "the" in "the
retransmission" as well, since we refer to the specific retransmission
after the initial loss and MPS change, not retransmission in general.)

Section 4.6.1

    A PL that receives a PTB message from a router or middlebox performs
    ICMP validation as specified in Section 5.2 of [RFC8085][RFC8201].

If we're going to have a section number, we need only one RFC reference!

    PTB messages that have been validated MAY be utilized by the
    DPLPMTUD algorithm, but MUST NOT be used directly to set the PLPMTU.
    The PL_PTB_SIZE is smaller than the PTB_SIZE because it is reduced
    by headers below the PL including any IP options or extensions added
    to the PL packet.

nit: this transition is pretty abrupt.

Section 4.6.2

    Before using the size reported in the PTB message it must first be
    converted to a PL_PTB_SIZE. A set of checks are intended to provide
    protection from a router that reports an unexpected PTB_SIZE. The PL
    also needs to check that the indicated PL_PTB_SIZE is less than the
    size used by probe packets and at least the minimum size accepted.

How do these "checks" relate to the "validation" that was discussed in
the previous section?

Section 5

    DPLPMTUD SHOULD NOT be used by an upper PL or application if it is
    already used in a lower layer DPLPMTUD SHOULD only be performed once
    between a pair of endpoints.

This seems to imply an expectation of a communications channel between
PLs on the same endpoint.  I don't see anything in this document
discussing such a channel; is there anything (more) that's useful that
we can say?

Section 5.1.1

    DPLPMTUD MAY inhibit sending probe packets when no application data
    has been sent since the previous probe packet. A PL preferring to
    use an up-to-date PMTU once user data is sent again, can choose to
    continue PMTU discovery for each path. However, this could result in
    sending additional packets.

How is this "could" and not "will"/"would"?  (Twice)

    The various timers could be implemented using a single timer

Is this an incomplete thought?  (There's no full stop, and it doesn't
give much clarity for me.)

Section 5.1.2

    MIN_PLPMTU:
    The MIN_PLPMTU is the smallest allowed probe packet size. For IPv6,
    this value is 1280 bytes, as specified in [RFC8200]. For IPv4, the
    minimum value is 68 bytes.

This feels like slightly curious naming, as the "PL" infix suggests that
it is the size usable at the packetization layer, but the quoted
constants include the IP headers.

    BASE_PLPMTU:
    The BASE_PLPMTU is a configured size expected to work for most
    paths. The size is equal to or larger than the MIN_PLPMTU and
    smaller than the MAX_PLPMTU. In the case of IPv6, this value is
    derived from the IPv6 minimum link MTU of 1280 bytes 

"Derived from" but using what derivation mechanism and/or arriving at
what value?

Section 5.1.3

    PROBED_SIZE:
    The PROBED_SIZE is the size of the current probe packet. This is a
    tentative value for the PLPMTU, which is awaiting confirmation by an
    acknowledgment.

Size at which layer?

Section 5.1.4

    The Base Phase confirms connectivity to the remote peer using
    packets of the BASE_PLPMTU. This phase is implicit for a
    connection-oriented PL (where it can be performed in a PL connection
    handshake).

This seems slightly internally inconsistent, in that it seems to say
that BASE_PLPMTU size packets are always used, but the implicit
connection-handshake method seems unlikely to actually do so.
(Also, if we always did so, then the following paragraph about
confirming BASE_PLPMTU support would be redundant.)

    Search Complete:
    The Search Complete Phase is entered when the PLPMTU is supported
    across the network path

It seems like this would be trivially true for (e.g.) the initial PLPMTU
set to BASE_PLPMTU on the Base->Search transition, yet I don't expect we
terminate Search immediately in that case!  Presumably this is intended
to say also that the PLPMTU corresponds to the actual PMTU as well?

Section 5.2

nit(?): the diagram shows PTB: PLPTB_SIZE < BASE_PLPMTU as cause for a
transition from BASE to ERROR, though of course this would only occur
after the PTB is validated.  Perhaps that doesn't need to be in the
figure, though.

Section 5.3.1

    The PROBE_COUNT is initialized to zero when the first probe with a
    size greater than or equal to PLPMTUD is sent. A timer is used to
    trigger the sending of probe packets of size PROBED_SIZE, larger
    than the PLPMTU.

This seems inconsistent about whether we use PROBE_COUNT when probing
with the current PLPMTU size -- we initialize the count to zero when we
first send one, but the timer only triggers sending of larger packets.

Section 5.3.2

I confess I was expecting a more detailed specification of how to pick
probe sizes, here.

Section 6.1

    To use common method for managing the PLPMTU has benefits, both in
    the ability to share state between different processes and
    opportunities to coordinate probing.

Nit: singular/plural mismatch "common"/"method"

Section 6.2

    Section 10.2 of [RFC4821] specified a recommended PLPMTUD probing
    method for SCTP and Section 7.3 of [RFC4960] and recommended an
    endpoint apply the techniques in RFC4821 on a
    per-destination-address basis.

nit: spurious "and" after "[RFC4960]".

    Section 6.9 of [RFC4960] describes dividing the user messages into
    data chunks sent by the PL when using SCTP. This notes that once an
    SCTP message has been sent, it cannot be re-segmented. [RFC4960]
    describes the method to retransmit data chunks when the MPS has
    reduced, and the use of IP fragmentation for this case.

Should we say something about "such behavior is unchanged by this
document"?

Section 6.2.1.2

    The HEARTBEAT chunk carries a Heartbeat Information parameter which
    includes, besides the information suggested in [RFC4960], the probe
    size, which is the size of the complete datagram. 

IIUC the Heartbeat Information layout is entirely at the sender's
discretion, so the implementation will have to pick a way to convey the
probe size in it; should we be more explicit about this?

    Probing starts directly after the PL handshake, before data is sent.
    Assuming this behavior (i.e., the PMTU is smaller than or equal to
    the interface MTU), this process will take several round trip time
    periods, dependent on the number of DPLPMTUD probes sent. The
    Heartbeat timer can be used to implement the PROBE_TIMER.

(But sending data doesn't have to wait for DPLPMTUD to complete, right?)

Section 6.2.2.2

    Packet probing can be performed as specified in Section 6.2.1.2. The
    maximum payload is reduced by 8 bytes, which has to be considered
    when filling the PAD chunk.

(8 bytes of UDP header?)

Section 6.2.3.2

(Interesting that we don't have a note about payload reduction to be
considered when filling the PAD chunk, as we did in Section 6.2.2.2.)

Section 6.2.3.4

    [RFC4960] does not specify a way to validate SCTP/DTLS ICMP message
    payload.

("and neither does this document"?)

Section 6.3.2

    QUIC provides an acknowledged PL, a sender can therefore enter the
    BASE state as soon as connectivity has been confirmed.

[This seems entirely redundant with the content of Section 6.3.1.]

Section 9

    An on-path attacker able to create a PTB message could forge PTB
    messages that include a valid quoted IP packet. Such an attack could
    be used to drive down the PLPMTU. 

Are there cases where such an on-path attacker would not also be able to
actually drop traffic larger than the forged PTB PMTU value?

    It is possible that the information about a path is not stable. This
    could be a result of forwarding across more than one path that has a
    different actual PMTU or a single path presents a varying PMTU. The
    design of a PLPMTUD implementation SHOULD consider how to mitigate
    the effects of varying path information. One possible mitigation is
    to provide robustness (see Section 5.4) in the method that avoids
    oscillation in the MPS.

This document specifies some PLPMTUD mechanics; should we say what we do
(and don't) say about this topic for those protocols we do cover?

    A node performing DPLPMTUD could experience conflicting information
    about the size of supported probe packets. This could occur when
    multiple paths are concurrently in use and these exhibit a different
    PMTU. If not considered, this could result in packets not being
    delivered (black holed) when the PLPMTU results in a packet larger
    than the smallest actual PMTU.

It feels like this content has high overlap with the previous paragraph;
is it reasonable to try consolidating them?

Murray Kucherawy No Objection

Comment (2020-03-31 for -17)
Kudos on a very approachable document.  The thin air up in the ART layers had me fearing something that said "TSVWG" on it.

Section 2:
* The definition for EMTU_R appears to double up on itself.

Section 3:
* In bullet 2, "On request, a DPLPMTUD sender is REQUIRED to be able to transmit a packet  ..." -- I read this as "If I ask you to do X, you MUST be able to do X", versus "you MUST do X".  Was that the intent?
* In bullet 3, there's reference to a "feedback method" that is REQUIRED, but this method is unspecified.  Is that defined elsewhere, or is out of scope here?

Section 4.1:
Nits:
* "... uses a probe packet carrying an application data ..." -- s/an//
* "... this probe packet, could perform ..." -- s/,//
* "This retransmited data block ..." --  typo: "retransmitted"

Section 4.3:
* PROBE_COUNT and MAX_PROBES are first used here, though they are not defined until later in the document in the DPLPMTUD section.
Nit:
* "... data is sent again, MAY choose ..." -- s/,//

Section 4.4:
Nit:
* "... over clearing the DF-bit in the IPv4 header ..." -- s/-/ /

Section 4.6.1:
* "For example, by checking the value ..." -- Suggest: "For example, it could check the value..."
Nit:
* "... from a router or middlebox, performs ..." -- s/,//

Section 5:
Nit:
* "... in a lower layer, DPLPMTUD SHOULD only ..." -- s/,/./

Section 5.1.1:
Nit:
* "... use an up-to-data PMTU once ..." -- I think you mean "up-to-date"

Section 5.1.2:
Nit:
* "... from a MAX_PROBES valugreater than 1 because ..." -- s/valugreater/value greater/

Section 5.3.3:
Nit:
* "... inconsistent, when, for example, ..." -- remove the first comma

Section 6.1.1:
Nit:
* "... from off-path insertion of data [RFC8085], suitable methods include ..." -- s/, s/. S/

Section 6.3.2:
Nit:
* "... packets of the required size, this sets the ..." -- either s/, t/. T/, or s/this/which/

Section 9:
"Parallel forwarding paths SHOULD be considered."  What is the specific action being recommended here?
Nits:
* "This protection if provided ..." -- s/if/is/
* "... design (see Section 1.1), this method therefore ..."  -- I think "this" should begin a new sentence.
* "An on-path attacker, able to create ..." -- remove comma
* "This could occur when there are multiple paths are concurrently in use." -- s/there are//

Warren Kumari No Objection

Barry Leiba No Objection

Alvaro Retana No Objection

Martin Vigoureux No Objection

Éric Vyncke No Objection

Comment (2020-04-09 for -19)
Thank you for the work put into this document. The document is clear, easy to read and quite useful (perhaps mixing too many PL protocols though).

Please find below nee non-blocking NIT.

I hope that this helps to improve the document,

Nice to read about the outcomes of the NEAT project

Regards,

-éric

==NIT ==

-- Section 2 --
Looking for details in " A Packet is the IP header plus the IP payload." is it worth mentioning the IPv6 extension headers or using "IP headers" (plural form) ?

Robert Wilton No Objection

Comment (2020-04-09 for -19)
Firstly I would like to say thank you for writing this document.

I have a general concern, not specifically with this document, but with the overall complexity of the solution.  I.e. the algorithm that is described has to contend with a lot of unreliable information (e.g. are packets dropped because of congestion or some other routing failure, hard to know whether PTB messages can be relied upon etc), there are also several different ways that the probes can be sent etc.

Not the subject of this document, but it feels like life could be significantly simpler if somehow there was a mechanism to getting a set of agree a set of defined minimum PLPMTU that network may support.  E.g. perhaps 1280, 2000, 4000, 8000, 16000 octets.   My assumption here is that the underlying core link MTUs would be higher to cope with header overheads.

A few comments on specific sections of the document:

1) I would include PTB in the terminology section.

3. Features Required to Provide Datagram PLPMTUD

Most of the the requirements in this section use RFC 2119 language, but a few don't:

   7.   Probing and congestion control: The decision about when to send
        a probe packet does not need to be limited by the congestion
        controller.  When not controlled by the congestion controller,
        the interval between probe packets MUST be at least one RTT.  If
        transmission of probe packets is limited by the congestion
        controller, this could result in transmission of probe packets
        being delayed or suspended during congestion.

Rather than "does not need to be limited", would this be better stated as  "SHOULD NOT be limited" or "MAY not be limited"?


   11.  Probing and flow control: Flow control at the PL concerns the
        end-to-end flow of data using the PL service.  This does not
        apply to DPLPMTU when probe packets use a design that does not
        carry user data to the remote application.
        
The second sentence could be stated something like:

"Flow control MUST NOT apply to DPLPMTU when probe packets use a design that does not carry user data to the remote application"


5.1.2.  Constants

   MIN_PLPMTU:  The MIN_PLPMTU is the smallest allowed probe packet
      size.  For IPv6, this value is 1280 bytes, as specified in
      [RFC8200].  For IPv4, the minimum value is 68 bytes.

Does it really make sense to probe for a path MTU of 68 bytes.  Would it make sense for applications to be able to control the MIN_PLPMTU that they find acceptable?  E.g. if IPv6 has this at 1280 and QUIC has uses 1280, perhaps many applications/implementations would like to use 1280 as a reasonable lower bound rather than 68 bytes?