Network Working Group                                          M. Mathis
Internet-Draft                                                J. Heffner
Expires: November 30, 2004                                           PSC
                                                                K. Lahey
                                                               June 2004

                           Path MTU Discovery

Status of this Memo

   By submitting this Internet-Draft, I certify that any applicable
   patent or other IPR claims of which I am aware have been disclosed,
   and any of which I become aware will be disclosed, in accordance with
   RFC 3668.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups. Note that other
   groups may also distribute working documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time. It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at http://

   The list of Internet-Draft Shadow Directories can be accessed at

   This Internet-Draft will expire on November 30, 2004.

Copyright Notice

   Copyright (C) The Internet Society (2004). All Rights Reserved.


   This document describes a robust new method for Path MTU Discovery
   that relies on TCP or other Packetization Layer to probe an Internet
   path with progressively larger packets. This method is described as
   an extension to RFC 1191 and RFC 1981, which specify ICMP based Path
   MTU Discovery for IP versions 4 and 6, respectively. This document
   does not define a protocol, but rather a method to use features of
   existing protocols to discover the path MTU.

Mathis, et al.         Expires November 30, 2004                [Page 1]

Internet-Draft             Path MTU Discovery                  June 2004

   The general strategy of the new algorithm is to start with a small
   MTU and probe upward, testing successively larger MTUs by probing
   with single packets.  If the probe is successfully delivered, then
   the MTU is raised.  If the probe is lost, it is treated as an MTU
   limitation and not as a congestion signal.

Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
   2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  3
   3.  Overview . . . . . . . . . . . . . . . . . . . . . . . . . . .  6
   4.  Requirements . . . . . . . . . . . . . . . . . . . . . . . . .  9
   5.  Implementation Issues  . . . . . . . . . . . . . . . . . . . . 10
     5.1   Layering . . . . . . . . . . . . . . . . . . . . . . . . . 10
       5.1.1   Accounting for Header Sizes  . . . . . . . . . . . . . 10
       5.1.2   Storing PMTU information . . . . . . . . . . . . . . . 11
     5.2   Lower Layers . . . . . . . . . . . . . . . . . . . . . . . 12
       5.2.1   Generating Probes  . . . . . . . . . . . . . . . . . . 12
       5.2.2   Selecting the initial MTU  . . . . . . . . . . . . . . 14
       5.2.3   Normal sequence of events to raise the MTU . . . . . . 14
       5.2.4   Processing MTU Indications . . . . . . . . . . . . . . 15
       5.2.5   Probing Intervals  . . . . . . . . . . . . . . . . . . 20
       5.2.6   Host fragmentation . . . . . . . . . . . . . . . . . . 21
       5.2.7   Multicast  . . . . . . . . . . . . . . . . . . . . . . 22
     5.3   Search Strategy  . . . . . . . . . . . . . . . . . . . . . 22
       5.3.1   Search . . . . . . . . . . . . . . . . . . . . . . . . 23
       5.3.2   Monitor  . . . . . . . . . . . . . . . . . . . . . . . 24
       5.3.3   Suspend  . . . . . . . . . . . . . . . . . . . . . . . 24
     5.4   Specific Packetization Layers  . . . . . . . . . . . . . . 24
       5.4.1   Probing method using TCP . . . . . . . . . . . . . . . 24
       5.4.2   Probing method using SCTP  . . . . . . . . . . . . . . 25
       5.4.3   Probing Method for IP Fragmentation  . . . . . . . . . 27
       5.4.4   Issues for other transport protocols . . . . . . . . . 27
     5.5   Operational Integration  . . . . . . . . . . . . . . . . . 27
       5.5.1   Interoperation with prior algorithms . . . . . . . . . 27
       5.5.2   Interoperation over subnets with dissimilar MTUs . . . 28
       5.5.3   Interoperation with tunnels  . . . . . . . . . . . . . 28
       5.5.4   Diagnostic tools . . . . . . . . . . . . . . . . . . . 29
       5.5.5   Management interface . . . . . . . . . . . . . . . . . 29
   6.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 30
   6.1   Normative References . . . . . . . . . . . . . . . . . . . . 30
   6.2   Informative References . . . . . . . . . . . . . . . . . . . 31
       Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 32
   A.  Security Considerations  . . . . . . . . . . . . . . . . . . . 32
   B.  IANA considerations  . . . . . . . . . . . . . . . . . . . . . 32
   C.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 33
       Intellectual Property and Copyright Statements . . . . . . . . 34

Mathis, et al.         Expires November 30, 2004                [Page 2]

Internet-Draft             Path MTU Discovery                  June 2004

1.  Introduction

   This document describes a method for Packetization Layer Path MTU
   Discovery (PLPMTUD) which is an extension to existing Path MTU
   discovery methods as described in RFC 1191 [2] and RFC 1981 [3].  The
   proper MTU is determined by starting with small packets and probing
   with successively larger packets.  The bulk of the algorithm is
   implemented above IP, in the transport layer (e.g. TCP) or other
   "Packetization Protocol" that is responsible for determining packet

   This document draws heavily RFC 1191 [2] and RFC 1981 [3] for
   terminology, ideas and some of the text.

   The methods described in this document apply both IPv4 and IPv6, and
   many transport protocols.   This document does not define a protocol,
   but rather a method to use features of existing protocols to discover
   the path MTU.  It does not require cooperation from the lower layers
   (except that they are consistent about what packet sizes are
   acceptable) or the far node.  Variants in implementations will not
   cause interoperability problems.

   The methods described in this document are carefully designed to
   maximize robustness in the presence of less than ideal
   implementations of other protocols or Internet components.

   For sake of clarity we uniformly prefer TCP and IPv6 terminology.  In
   the terminology section we also present the analogous IPv4 terms and
   concepts for the IPv6 terminology.  In a few situations we describe
   specific details that are different between IPv4 and IPv6.

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   document are to be interpreted as described in RFC 2119 [4].

   This draft is a product of the Path MTU Discovery (pmtud) working
   group of the IETF.  Please send comments and suggestions to   Interim drafts and other useful information will be
   posted at .

2.  Terminology
   IP Either IPv4 [1] or IPv6 [7].

   node A device that implements IP.

Mathis, et al.         Expires November 30, 2004                [Page 3]

Internet-Draft             Path MTU Discovery                  June 2004

   router A node that forwards IP packets not explicitly addressed to

   host Any node that is not a router.

   upper layer A protocol layer immediately above IP. Examples are
      transport protocols such as TCP and UDP, control protocols such as
      ICMP, routing protocols such as OSPF, and Internet or lower-layer
      protocols being "tunneled" over (i.e., encapsulated in) IP such as
      IPX, AppleTalk, IP itself.

   link A communication facility or medium over which nodes can
      communicate at the link layer, i.e., the layer immediately below
      IP. Examples are Ethernets (simple or bridged); PPP links; X.25,
      Frame Relay, or ATM networks; and Internet (or higher) layer
      "tunnels", such as tunnels over IPv4 or IPv6. Occasionally we use
      the slightly more general term "lower layer" for this concept.

   interface A node's attachment to a link.

   address An IP-layer identifier for an interface or a set of

   packet An IP header plus payload.

   MTU Maximum Transmission Unit, the size in bytes of the largest IP
      packet, including the IP header and payload, that can be
      transmitted on a link or path. Note that this could more properly
      be called the IP MTU, to be consistent with how other standards
      organizations use the acronym MTU.

   link MTU The Maximum Transmission Unit, i.e., maximum IP packet size
      in bytes, that can be conveyed in one piece over a link. Beware
      that this definition differers from the definition used by other
      standards organizations.

      For IETF documents, link MTU is uniformly defined as the IP MTU
      over the link. This includes the IP header, but excludes link
      layer headers and other framing which is not part of IP or the IP

      Beware that other standards organizations generally define link
      MTU to include the link layer headers.

   path The set of links traversed by a packet between a source node and
      a destination node

Mathis, et al.         Expires November 30, 2004                [Page 4]

Internet-Draft             Path MTU Discovery                  June 2004

   PMTU, path MTU The minimum link MTU of all the links in a path
      between a source node and a destination node.

   classical PMTU discovery, Process described in RFC 1191 and RFC 1981,
      in which nodes rely on ICMP "Packet Too Big" messages to learn the
      MTU of a path.

   PL, packetization layer The layer of the network stack which segments
      data into packets.

   PLPMTUD Packetization Layer Path MTU Discovers, the method described
      in this document, which is an extension to classical PMTU

   Packet Too Big message An ICMP message reporting that an IP packet is
      too large to forward. This is the IPv6 term that corresponds to
      the IPv4 "ICMP Can't fragment" message.

   flow A context in which MTU discovery is applied. This is naturally
      an instance of the packetization protocol, e.g. one side of a TCP

   MPS The maximum IP payload size available over a specific path. This
      is typically the path MTU minus the IP header. As an example, this
      is the maximum TCP packet size, including TCP payload and headers
      but not including IP headers. This has also been called the "L3

   MSS The TCP Maximum Segment Size, the maximum payload size available
      to the TCP layer. This is typically the path MPS minus the size of
      the TCP header.

   probe packet A packet which is being used to test a path for a larger

   probe size The size of a packet being used to probe for a larger MTU.

   successful probe The probe packet was delivered through the network
      and acknowledged by the Packetization Layer on the far node.

   inconclusive probe The probe packet was not delivered, but there were
      other lost packets close enough to the probe where it can not be
      presumed that the probe was lost because it was larger than the
      path MTU. By implication the probe might have been lost due to
      something other than MTU (such congestion), so the results are
      inconclusive.  Inconclusive probes are generally repeated at the
      same probe size, after a suitable delay.

Mathis, et al.         Expires November 30, 2004                [Page 5]

Internet-Draft             Path MTU Discovery                  June 2004

   failed probe The probe packet was not delivered and there were no
      other lost packets close to the probe. This is taken as an
      indication that the probe was larger than the path MTU, and future
      probes should generally be for at smaller sizes.

   errored probe There were losses or timeouts during the verification
      phase which suggest a potentially disruptive failure or network
      condition. These are generally retried only after substantially
      longer intervals.

   probe gap The payload data that will be lost and need to be
      retransmitted if the probe is not delivered.

   probe phase The interval (time or protocol events) between when a
      probe is sent, and when it is determined that the the probe
      succeeded, failed or was inconclusive

   verification phase An additional interval during which the new path
      MTU is considered provisional. Packet losses or timeouts are
      treated as an indication that there may be a problem with the
      provisional MTU.

   Transition phase The interval between the probe phase and the
      verification phase, during which packets using the new MTU
      propagate to the far node and the acknowledgment propagates back.

   full stop timeout a timeout where none of the packets transmitted
      after some event are acknowledged by the receiver, including any
      retransmissions. This is taken as an indication of some failure
      condition in the network, such as a routing change onto a link
      with a smaller MTU.   For the sake of PLPMTUD we suggest the
      following definition of a full stop timeout:  the loss of one full
      window of data and at least one retransmission or at least 6
      consecutive packets including at least 2 retransmissions (along
      with two retransmission timer expirations).   [@@@ This probably
      needs some experimentation.]

   search strategy the heuristics used to choose successive probe sizes
      to converge to the proper path MTU, as described in section 5.5.

3.  Overview

   This document describes a method for TCP or other packetization
   protocols to dynamically discover the MTU of a path without relying
   on explicit signals from the network. These procedures are applicable
   to TCP and other transport- or application-level packetization
   protocols in which the receiver always reports to the sender complete

Mathis, et al.         Expires November 30, 2004                [Page 6]

Internet-Draft             Path MTU Discovery                  June 2004

   information about which packets were lost in the network.

   The general strategy of the new procedure is for the packetization
   layer to find the proper MTU by probing with progressively larger
   packets, without disrupting its normal protocol operation. If a probe
   packet is successfully delivered, then the path MTU is provisionally
   raised. If there are no additional losses during the subsequent
   verification phase, then the path MTU is confirmed (verified) to be
   at least as large as the provisional MTU. PLPMTUD can then probe
   again with an even larger MTU, according to MTU search strategy
   described in Section 5.3.

   The verification phase is used to detect some situations where
   raising the MTU raises the packet loss rate.  For example if a link
   is striped across multiple physical channels with inconsistent MTUs,
   it is possible that a probe will be delivered even if it is too large
   for some of the physical channels. In such cases raising the path MTU
   to the probe size will cause severe periodic loss and abysmal
   performance.  The verification phase is designed to prevent the path
   MTU from being raised if doing so causes excessive packet losses.

   A conservative implementation of PLPMTUD would use a full round trip
   time for the verification phase.  In this case each time PLPMTUD
   raises the MTU it takes three full round trip times to do so. It
   takes one round trip for the probe phase, during which the probe
   propagates to the far node and an acknowledgment is returned.   The
   second round trip is the transitional phase, during which data
   packets using the provisional MTU propagate to the far node and are
   acknowledged. During he third and final round trip time, it is
   verified that raising the MTU does not cause excessive loss.

   The isolated loss of a probe packet (with or without a Packet Too Big
   message) is treated as an indication of an MTU limit, and not as a
   congestion indicator. In this case alone, the packetization protocol
   is permitted to retransmit the probe gap without adjusting the
   congestion window.

   If there is a timeout or any additional lost packets during any of
   the three phases, the loss is treated as a congestion indication as
   well as an indication of some sort of failure of the PLPMTUD process.
   The congestion indication is treated like any other congestion
   indication: window or rate adjustments are mandatory per the relevant
   congestion control standards [8].   Probing can resume with some new
   probe size after a delay which is determined by the nature of the
   indicated failure.

   The most likely (and least serious) PLPMTUD failure is the link
   experiencing legitimate congestion related losses at about the same

Mathis, et al.         Expires November 30, 2004                [Page 7]

Internet-Draft             Path MTU Discovery                  June 2004

   time as the probe.   In this case, it is appropriate to retry the
   probe (with the same probe size) as soon as the packetization layer
   has fully adapted to the congestion and recovered from the losses.

   In other cases, additional losses or timeouts indicate problems with
   the link or packetization layer, and that probes may be disruptive.
   In these situations it is desirable to use progressively longer
   delays depending on the severity of the failure and if it persists.

   PLPMTUD can optionally process Packet Too Big messages to select the
   provisional MTU for faster convergence in exchange for a slight
   decrease in robustness.  Processing malicious or erroneous Packet Too
   Big messages can cause PLPMTUD to arrive at the incorrect MTU for a
   path, which is likely to reduce protocol performance. There are
   several different options for processing Packet Too Big messages: in
   one extreme they could be completely ignored, in the other extreme,
   accept all of them (fully implementing classic PMTUD within PLPMTUD).
   We advocate a compromise, where Packet Too Big messages are only
   processed in conjunction with probes (described in Section,
   and Packetization Layer timeouts (described in Section

   Relatively few details of this procedure affect interoperability with
   other standards or Internet protocols.  These details are specified
   in RFC2119 standards language in Section 4.

   Most of the difficulty in implementing PLPMTUD arises because it
   needs to be implemented in several different places within a single
   node.  In general each packetization protocol needs to have it's own
   implementation of PLPMTUD. Furthermore, the natural mechanism to
   share path MTU information between concurrent or subsequent
   connections over the same path is a path information cache in the IP
   layer.  The various packetization protocols need to have the means to
   access and update the shared cache in the IP layer. This memo
   describes PLPMTUD in terms of its primary subsystems without fully
   describing how they are assembled into a complete implementation.
   Section 5 describes: the separation into layers, the mechanics of
   probing from the point of view other lower layers, Maximum Payload
   Size search heuristics; implementation in specific Packetization
   Layers; and operational integration issues.

   The vast majority of the implementation details are recommendations
   based on experiences with earlier versions of path MTU discovery.
   These are motivated by a desire to maximize robustness of PLPMTUD in
   the presence of less than ideal implementations as they exist in the

Mathis, et al.         Expires November 30, 2004                [Page 8]

Internet-Draft             Path MTU Discovery                  June 2004

4.  Requirements

   All Internet nodes SHOULD implement PLPMTUD in order to discover and
   take advantage of the largest MTU supported along the Internet path.

   Links MUST NOT deliver packets that are larger than their MTU. Links
   that have parametric limitations (e.g. MTU bounds due to limited
   clock stability) MUST include explicit mechanisms to consistently
   reject packets that might otherwise be nondeterministically

   All hosts SHOULD use IPv4 fragmentation in a mode that mimics IPv6
   functionality.  All fragmentation SHOULD be done on the host, and all
   IPv4 packets, including fragments, SHOULD have the DF bit set such
   that they will not be fragmented (again) in the network.  See Section

   The requirements below only apply to those implementations that
   include PLPMTUD.

   If the Packetization Layer uses application data to implement PLPMTUD
   it MUST use a loss reporting mechanism mechanism (e.g. TCP SACK)
   which avoids spurious retransmission of other data when a probe
   packet is lost.

   A Packetization Layer using application data for probes MUST NOT send
   a probe unless it has sufficient following data available to send
   such that a lost probe will trigger Fast Retransmit or similar data
   recovery algorithm.

   A Packetization Layer using application data for probes SHOULD NOT
   send a probe packet unless the flow is expected to have at least the
   3 round trips worth of data needed to successfully complete the
   probe, transition and verification phases.

   Normal congestion control algorithms MUST remain in effect under all
   conditions except when only an isolated probe packet is detected to
   be lost. In this case alone the normal congestion (window or data
   rate) reduction can be suppressed.  If any other lost data is
   detected, all normal congestion control MUST take place.

   When a probe is lost and normal congestion control is suppressed as
   permitted above, then the Packetization Layer MUST NOT probe again
   until at least an interval equal to the normal congestion control
   cycle.  For TCP and TCP friendly protocols this generally means one
   round trip of elapsed time for each packet permitted under the
   current congestion window.

Mathis, et al.         Expires November 30, 2004                [Page 9]

Internet-Draft             Path MTU Discovery                  June 2004

   If PLPMTUD updates the MTU for a particular path, all Packetization
   Layer sessions that share the flow (path) must be notified.

   Whenever the MTU is raised, the congestion state variables must be
   rescaled to not to raise the window size in bytes (or date rate in
   bytes per seconds).

   Whenever the MTU is reduced (e.g. when unconditionally processing
   ICMP Packet Too Big messages) the congestion state variable must be
   rescaled not to raise the window size in packets.

   All implementations MUST include a mechanism to implement diagnostic
   tools that do not rely on the operating systems implementation of
   path MTU discovery.   This specifically requires the ability to send
   packets that are larger than the known MTU for the path, and
   collecting any resultant ICMP error message. See Section 5.5.4

5.  Implementation Issues

   This section discusses a number of issues related to the
   implementation of Path MTU Discovery.  This is not a specification,
   but rather a set of notes provided as an aid for implementers.

   The issues include:
   o  The seperation into layers
   o  The Mechanics of Probing, as seen by IP and brlow
   o  Search Strategy.
   o  How to implement PLPMTUD in specific Packetization Layers.
   o  How to improve Operational Integration and deployment.

5.1  Layering

5.1.1  Accounting for Header Sizes

   Packetization Layer Path MTU Discovery is most easily implemented by
   splitting its functions between layers.  The IP layer is in the best
   place to keep shared state, collect the ICMP messages, track IP
   headers sizes and manage MTU information from the link layer
   interfaces.  However the procedures that PLPMTUD uses for probing,
   verifications and scanning for the path MTU are very tightly coupled
   to the data recovery and congestion control state machines in the
   Packetization Layer.   The most difficult part of implementing
   PLPMTUD is properly splitting the implementation between the layers.

   Note that this layering is constant with the advice in the current
   PMTUD specifications [2][3]. Today, many implementations of classical
   PMTU Discovery are already split along these same layers.

Mathis, et al.         Expires November 30, 2004               [Page 10]

Internet-Draft             Path MTU Discovery                  June 2004

   Early implementation of PLPMTUD revealed that it is critically
   important to have a good clean mechanism for accounting header sizes
   at all layers.  This is because each Packetization Layer does its
   calculations in its own natural data unit, which are almost always a
   reflection of the service that the Packetization Layer provides to
   the application or other upper layers.  For example, TCP naturally
   performs all of its calculations in terms of sequence numbers and
   segment sizes.   The size of the Probe gap is the size of the data
   segment that was that was carried by the probe packet. However, the
   MTU size being probed, ICMP MTU, etc are measures of full packets,
   which not only include the TCP data (measured in sequence space) but
   also include fixed TCP and IP headers, and may include IPv6 extension
   headers or IPv4 options, TCP options and even IPsec AH or ESP headers
   as well.

   PLPMTUD requires frequent translation between these two domains: the
   Packetization Layer's natural data unit and full IP packet sizes.
   While there are a number of possible ways to accurately implement
   dual size measures, our experience has been that it is best if the
   boundary between the IP layer and the Packetization layer communicate
   in terms of the IP Maximum Payload Size or MPS.  The MPS is the only
   size measure that is common to both the IP and Packetization Layers,
   because it exactly matches the boundary between the layers.  The IP
   Layer is responsible for adding or deducting it's own headers when
   translating between MTU and MPS.  Likewise the Packetization Layer is
   responsible for adding or deducting its own headers when calculations
   in it's natural data units.

   This document does not take a stance on the placement of IPsec, which
   logically sits between IP and the Packetization Layer. As far as
   PLPMTUD is concerned IPsec can be treated either as part of IP or as
   part of the Packetization Layer, as long as the accounting is
   consistent within any given implementation.  If IPsec is treated as
   part of the IP layer, then each security association to a remote node
   may need to be treated as a separate flow for PLPMTUD, if they have
   different length security headers. If IPsec is treated as part of the
   packetization layer, the IPsec header size has to be included in the
   Packetization Layer's header size calculations.

5.1.2  Storing PMTU information

   This memo uses the concept of a "flow" to define the scope in which
   path MTU information is used.  Each flow locally stores its maximum
   payload size (MPS), which is used for packetizing data.
   Packetization Layers may communicate with the IP layer to store or
   access cached MPS values, providing a means by which similar flows
   may share information. The IP layer also stores PMTU and derived MPS
   information when it receives Packet Too Big messages.

Mathis, et al.         Expires November 30, 2004               [Page 11]

Internet-Draft             Path MTU Discovery                  June 2004

   Ideally, a PMTU value should be associated with a specific path
   traversed by packets exchanged between the source and destination
   nodes.  However, in most cases a node will not have enough
   information to completely and accurately identify such a path.
   Rather, a node must associate a PMTU value with some local
   representation of a path.  It is left to the implementation to select
   the local representation of a path.

   An implementation could use the destination address as the local
   representation of a path.  The PMTU value associated with a
   destination would be the minimum PMTU learned across the set of all
   paths in use to that destination.  The set of paths in use to a
   particular destination is expected to be small, in many cases
   consisting of a single path.  This approach will result in the use of
   optimally sized packets on a per-destination basis.  This approach
   integrates nicely with the conceptual model of a host as described in
   [ND@@@@]: a PMTU value could be stored with the corresponding entry
   in the destination cache.   However, NAT and other forms of middle
   boxes may exhibit differing MTUs at as single IP address.

   If IPv6 flows are in use, an implementation could use the IPv6 flow
   id [7][14] as the local representation of a path.  Packets sent to a
   particular destination but belonging to different flows may use
   different paths, with the choice of path depending on the flow id.
   This approach will result in the use of optimally sized packets on a
   per-flow basis, providing finer granularity than PMTU values
   maintained on a per-destination basis.

   For source routed packets (i.e. packets containing an IPv6 Routing
   header, or IPv4 LSRR or SSRR options), the source route may further
   qualify the local representation of a path.    An implementation
   could use source route information in the local representation of a

   If IPsec is in use, the security association can also be used to
   represent a path.

5.2  Lower Layers

5.2.1  Generating Probes

   A new candidate MTU is tested by sending one "probe packet", which is
   larger than the current MTU.  In this section we present a couple of
   possible ways to alter packetization layers to generate probe
   packets.   The different techniques incur different overheads in
   three areas: difficulty in generating the probe packet (in terms of
   packetization layer implementation complexity and computational
   overhead) possible additional network capacity consumed by the probes

Mathis, et al.         Expires November 30, 2004               [Page 12]

Internet-Draft             Path MTU Discovery                  June 2004

   and the overhead of recovering from failed probes (both network and
   protocol overhead).

   For example some protocols might be extended to allow padding with
   dummy data within their packets.  This would greatly simplify the
   implementation because the probing can be performed without
   participation from the application and if the probe fails, the
   missing data (the "probe gap") is assured to fit within the current
   MTU when it is retransmitted. However, the padding does consume
   network capacity without carrying any useful payload.

   This technique does not work for TCP, because there is not a separate
   length field or other mechanism to differentiate between padding and
   real payload data. With TCP the natural approach is to send
   additional payload data in an over-sized segment.   There are several
   variants which have different tradeoffs.

   In one method, after a TCP probe segment has been sent the subsequent
   segment(s) may be sent as though the probe segment was not
   over-sized.  Thus if the probe segment is lost, it will leave a gap
   in the sequence space that is exactly the correct size to be filled
   by one segment at the current MTU.   Since this method generates
   overlapping data, it will cause duplicate acknowledgments if the
   probe is successfully delivered.  The sender must be capable of
   ignoring these expected duplicate acknowledgments in a manner which
   will not cause unnecessary retransmission or congestion window

   In the second method, after a TCP probe segment has been sent,
   subsequent TCP segments are sent in a non-overlapping manner.  If the
   probe segment is lost, it will leave a gap which will require
   retransmission of multiple segments to fill. This method has lower
   overhead for successful probes, but it requires more complexity in
   the retransmit logic to correctly retransmit the missing data (the
   "probe gap") with multiple segments that fit into the old MTU, while
   properly suppressing the congestion adjustments for this one
   situation and no others.

   Several Packetization protocols may be best served by using an
   adjunct protocol for MTU probing: a separate protocol (or protocol
   feature) that does not carry and real application data.  This greatly
   simplify s implementation because nothing needs to be retransmitted
   when the probe is lost, but it does consume network capacity without
   delivering any useful payload.

   Two important example of this come to mind:  SCTP [9] which might use
   its existing HEARTBEAT facility padded with dummy data to fill out
   the probe packet; and IP fragmentation which is sometimes used as a

Mathis, et al.         Expires November 30, 2004               [Page 13]

Internet-Draft             Path MTU Discovery                  June 2004

   Packetization layer for carrying oversized datagrams as described in
   Section 5.2.6. In the case of IP fragmentation an entire separate
   protocol in need, that has to use the diagnostic interface described
   in Section 5.5.4

   It should be clear that nearly all packetization layers can be
   adapted to support PLPMTUD, possibly in more than one way.

5.2.2  Selecting the initial MTU

   When the PLPMTUD process is started the initial MTU should normally
   be set such that the Packetization Layer can carry 1 kByte data
   segments.    This initial MTU should be 1 kByte plus space for IP and
   Packetization layer headers. (see Section 5.1 on accounting for
   headers).   With the this MTU, RFC2414 [6] allows TCP and other
   transport protocols to start with an initial window of 4 packets.

   We suspect, but have not confirmed that TCP actually starts faster
   (and completes sooner for small packets) with 1kB packets rather than
   1500 byte packets because the 2nd data ACK occurs one round trip

   This initial MTU should also be configurable.    One of the
   configuration options should be to set it to default to the
   interfaces MTU, to mimic classical PMTUD behavior. (See Section 5.5.1

5.2.3  Normal sequence of events to raise the MTU

   If the probe size is smaller than the actual path MTU and there are
   no other losses, the normal sequence of events to probe and raise the
   MTU will be:
   1.  The probe is sent, followed by more packets at the current MTU.
       By definition PLPMTUD enters the probe phase.   The probe
       propagates through the network and the far node acknowledges it
       (or possibly latter data, if acknowledgements are cumulative and
       delayed acknowledgement is in effect).

   2.  The acknowledgement for the probe reaches the data sender.   By
       definition, this ends the probe phase.

   3.  The packetization layer provisionally raises the MTU to the probe
       size. PLPMTUD enters the transitional phase when it starts
       sending data using the provisional MTU.

       Note that implementations that use packet counts for congestion
       accounting (e.g. keep cwnd in units of packets) must re-scale
       their congestion accounting such that raising the MTU does not
       raise the data rate (bytes/second) or the total congestion window

Mathis, et al.         Expires November 30, 2004               [Page 14]

Internet-Draft             Path MTU Discovery                  June 2004

       in bytes.

       If the implementation packetizes the data at the application
       programming interface, it may transmit already queued data at the
       current MTU before raising the MTU. In this case this data is not
       part of either the probing or transition phases, because all of
       the packets in flight fit within the current MTU.

   4.  Once the first packet of the transitional phase is acknowledged,
       PLPMTUD enters the verification phase.   In principle the
       verification phase can be of arbitrary duration, however at this
       time we are recommending one full window of data (i.e one full
       round trip time) for most Packetization Layers.

   5.  Once there has been sufficient data delivered and acknowledged in
       the provisional MTU is considered verified and the path MTU is
       updated.   PLPMTUD can then probe for an even larger MTU, as
       described in the searching strategy in Section 5.3.

   Other events described in the next section are treated as exceptions
   and alter or cancel some of the steps above.

5.2.4  Processing MTU Indications

   The descriptions below assume that the Packetization Layer protocol
   that has a TCP fast retransmit style mechanism to synchronously
   detect the loss of a probe packet and trigger retransmission, without
   loss of the protocols self clock.  If this fails, then some sort of
   retransmission timeout will serve to catch the loss.    It also
   assumes that there is some mechanism to detect full-stop timeouts.

   If any of these events (or the receipt of an ICMP Packet Too Big
   message) occurs during the the above process to raise the MTU, then
   it is processed as indicated in the following sections.  Processing Packet Too Big Messages

   Classical PMTU discovery specifies the generation of Packet Too Big
   Messages if an over-sized packet (e.g. a probe) encounters a link
   that has a smaller MTU. Since these messages can not be authenticated
   they introduce a number of well documented attacks against classical
   PMTUD [5].

   With PLPMTUD these messages are not required for correct operation,
   and in principle can be summarily ignored at the expense of slower
   convergence to the proper MTU.   However we believe that a slightly
   better compromise is to process Packet too big messages in two
   specific contexts: in conjunction with a PLPMTUD probe or a full-stop

Mathis, et al.         Expires November 30, 2004               [Page 15]

Internet-Draft             Path MTU Discovery                  June 2004


   Every Packet Too Big Message should be subjected to the following
   o  If globally forbidden then discard the message.

   o  If forbidden by the application then discard the message.

   o  If this path has been tagged "bogus ICMP messages" then discard
      the message.

   o  If the reported MTU fails consistency checks then set "bogus ICMP
      messages" flag for this path and discards the message.  These
      consistency checks include:
      *  unrecognized or unparseable enclosed header,
      *  reported MTU is larger than the size indicated by the enclosed
         header or
      *  larger than the current MTU, provisional MTU or probe size as
      *  or fails a ICMP consistency checks specific to the
         Packetization Layer. (E.g. The SCTP Verification-Tag mechanism
      To ease migration, it is suggested that implementations may
      include global controls to suppress some or all of the consistency

   If the Packet Too Big Message is acceptable under all of these checks
   do one of two things on depending on a global configuration switch:
   Emulate classical path MTU discovery by processing the message
   immediately (I.e. set the path MTU to the size indicated in the
   message) or save the "ICMP MTU", pending another PLPMTUD event.   In
   this case the saved ICMP MTU will only be acted upon under
   appropriate conditions if there are lost probes, verification packets
   or a full stop timeout.   This greatly reduces the impact of
   fraudulent ICMP Packet Too Big messages.

   In either case if the Packetization Layer calls for specific actions
   in response to a Packet Too Big message, that action should be
   invoked only at the point when the path MTU is updated from the ICMP
   MTU.  Packetization Layer Detects Lost Packets

   Each packetization protocol has it's own mechanism to detect lost
   packets and request the retransmission of missing data. The primary
   signals used by the packetization layer are these protocol specific

Mathis, et al.         Expires November 30, 2004               [Page 16]

Internet-Draft             Path MTU Discovery                  June 2004

   loss indications. The packetization layer is responsible for
   retransmitting the lost data and notifying PLPMTUD that there was a
   o  If the probe itself was lost, and there were no other losses
      during the probe phase (The RTT between when the probe was sent
      and the loss detected) than it is taken as an indication that the
      path MTU is smaller than the probe size. In this situation alone
      the Packetization Layer is permitted to retransmit the missing
      data (the "probe gap") without adjusting its congestion window or
      data transmission rate.

      If an accepted Packet Too Big Message was received after the probe
      was sent, and it passes the additional checks that the ICMP MTU is
      greater than the current MTU and less than the probe SIZE, then
      set the probe side to the ICMP MTU, and restart the probe process
      from step 1 in Section 5.2.3.

      If there was not a accepted Packet Too Big Message, then the
      indicated event is a "probe failure", which can be retried with a
      smaller probe size after a suitable delay for a probe_fail_event.
      See Section for more complete descriptions of failure

   o  If there are losses during the probe phase and the probe was not
      lost, then the probe was successful.  However, since additional
      losses have the potential to spoil the verification phase, it is
      important that PLPMTUD not progress into the transition phase
      (step 3 above) until after the Packetization Layer has fully
      recovered from the losses and completed the congestion window (or
      rate) adjustment.

   o  If there are losses during the probe phase and the probe was also
      lost the outcome depends on the presence an ICMP MTU set by an
      acceptable Packet Too Big Message.

      If there was an accepted Packet Too Big Message received since the
      probe was sent, and it passes the additional checks that the ICMP
      MTU is greater than the current MTU and less than the probe size,
      then set the probe size to the ICMP MTU, and once the
      Packetization Layer completes the recovery from the losses then
      restart the probe process from step 1 in Section 5.2.3.

      If there was not an accepted Packet Too big Message, then the
      probe is inconclusive because the lost probe might have been
      caused by congestion.   The probe can be retried  after a suitable
      delay for a probe_inconclusive_event.

Mathis, et al.         Expires November 30, 2004               [Page 17]

Internet-Draft             Path MTU Discovery                  June 2004

   o  It is unlikely that losses during the transition phase are caused
      by PLPMTUD, however they do potentially complicate the
      verification phase.  Note that we are referring to losses that are
      followed by acknowledgement of packets that were sent at the old
      MTU, while the transition to the provisional MTU is still
      propagating through the network.   The first acknowledgement of
      the provisional MTU (and the transition to the verification phase)
      is most likely going to occur during the recovery of the losses in
      transition phase.   It is important that the Packetization Layer
      retransmission machinery distinguish between loses at the old MTU
      (transition phase) and the provisional MTU (the verification
      phase, discussed next).

   o  Losses during the verification phase are taken as a indication
      that the path may have a non-uniform MTU or some other problems
      such that raising the MTU substantially raises the loss rate.  If
      so, this is potentially a very serious problem, so the provisional
      MTU is considered to have errored and the path MTU is set back to
      the previously verified MTU (the previously current MTU).

      Packet loss during the verification phase might also be due to
      coincidental congestion on the path, unrelated to the probe, so it
      would seem to be desirable to re-probe the path. The risk is that
      this effectively raises the tolerated loss threshold because even
      though raising the MTU seemed to cause additional loss, there is a
      statistical chance that repeated attempts to verify a new MTU may
      yield as false pass.    The compromise is to re-probe once with
      the same probe size (after delay probe_inconclusive_event), and if
      this also fails, then the probe may not be retried until after a
      suitable delay for a verification_error_event, which exponentially
      increases on each successive failure.  Packetization Layer Retransmission Timeout

   Note that the we do not make distinctions between the various methods
   that different Packetization Layers might use for detecting and
   retransmitting lost packets.   It is preferable that the
   Packetization Layer uses a recovery mechanism similar to TCP SACK or
   fast retransmit (or other "synchronous" loss recover mechanism) to
   detect losses and recover as quickly as possible.

   Under some conditions the Packetization Layer may have to rely on
   retransmission timeouts or other fairly disruptive techniques to
   recover from losses.   Since these greatly increase the cost of
   failed probes, it is recommended that PLPMTUD use even longer delays
   before re-probing. In these situations replace probe_fail_event with

Mathis, et al.         Expires November 30, 2004               [Page 18]

Internet-Draft             Path MTU Discovery                  June 2004  Packetization Layer Full Stop Timeout

   Under all conditions (not just during MTU probing) a full stop
   timeout should be taken as an indication of some significantly
   disruptive event in the network, such as a router failure or a
   routing change to a path with a smaller MTU.

   If the ICMP MTU is set, and it is less that the current MTU (or
   provisional MTU during the transitional phase), then the path MTU can
   be reduced to the ICMP MTU.   This is the only situation (a full stop
   timeout) outside of a probe that we recommended that the path MTU is
   set from the ICMP MTU. (In Section 5.5.1 we relax this recommendation
   to facilitate migration to PLPMTUD in exchange for slightly less
   protection from corrupt Packet Too Big messages)

   Note that whenever a problem with the path that causes a full-stop
   timeout (also known as a "persistent timeout" in other documents),
   several different path restart/recovery algorithms may be invoked at
   different layers in the stack.  Some device drivers may be restarted
   [@@], router discovery [@@], ES-IS [@@] and so forth.  We recommend
   that in most situation the first action should be to set the path MTU
   down.   Note that this recommendation is really beyond the scope of
   this document, and may require substantial additional research.

   Therefore, if there is a full stop timeout and there was not an ICMP
   message indicating a reason (Packet Too Big, Net unreachable, etc, or
   the ICMP messages was ignored for some reason), we suggest that the
   first recovery action should be to set the path MTU down to a safe
   minimum "restart MTU" value, and the PLPMTUD search state reset, so
   PLPMTUD will start over again searching for the proper MTU. The
   default restart_MTU should be the minimum MTU as specified by IPv4
   (576)[1] or IPv6 (1280) [7] as appropriate, unless overridden by some
   global control (See Section 5.5.5).

   If and only if the full stop timeout happens during the probe or
   transition phases (e.g. after the sending data using the provisional
   MTU but before any of it is acknowledged) is it considered likely
   that raising the MTU caused the full stop timeout.  If so this
   situation is is likely to be cyclic, because resetting the PLPMTUD
   search state is likely to eventually cause re-probing the same
   problematic MTU.

   It is tempting to define additional states to detect recurrent full
   stop timeouts. However in today's hostile network environment, there
   is little tolerance for nodes that are so fragile that they can be
   disrupted by something as simple as oversized packets.  Therefor we
   do not feel that it is worth the overhead of specifying a state
   machine that is capable of automaticly detecting these situations and

Mathis, et al.         Expires November 30, 2004               [Page 19]

Internet-Draft             Path MTU Discovery                  June 2004

   disabling PLPMTUD.   However, it is important that there be a manual
   way to disable or limit probing on specific paths.  See Section

5.2.5  Probing Intervals

   Section describes a number of probe failure events.   In all
   cases the basic response is the same: to wait some time interval
   (dependent on the specific event and possibly the history) and then
   to probe again.  For events that are "inconclusive", it is generally
   appropriate to re-probe with the same probe size.   For events that
   are identified as "failed probes" it is generally appropriate to
   re-probe with a smaller probe size.   The search strategy described
   in Section 5.3 is used to select probe sizes.

   Many of the intervals below are specified in terms of elapsed round
   trips relative to the current congestion window.   This is because
   TCP and other Packetization Layer protocols tend to exhibit periodic
   loses which cause periodic variations of the congestion window and
   possibly the data rate.  It is preferable that the PLPMTUD probes are
   scheduled near the low point of these cycles to minimize ambiguities
   caused by congestion losses.

   In order from least to most serious:
   probe_inconclusive_event Other lost packets near the lost probe made
      the probe result ambiguous.   Since the loss of non-probe packets
      requires a window (or data rate) reduction, it is desirable to
      schedule the re-probe (at the same probe size) at one round trip
      time after the end of the loss recovery.   This will be almost the
      minimum congestion window size, with a small cushion to minimize
      the chances that correlated losses caused by some other bursty
      connection spoil another probe.

   probe_fail_event A probe fail event is the one situation under which
      the Packetization layer is permitted not to treat loss as a
      congestion signal.  Because there is some small risk that
      suppressing congestion control might have unanticipated
      consequences (even for one isolated loss), we require that probe
      fail events be less frequent than the normal period for losses
      under standard congestion control.  Specifically after a probe
      fail event and suppressed congestion control, PLPMTUD may not
      probe again until an interval which is comparable to the expected
      interval between congestion control events. See Section 4.

      The simplest estimate of the interval to the next congestion event
      is the same number of round trips as the current window in

Mathis, et al.         Expires November 30, 2004               [Page 20]

Internet-Draft             Path MTU Discovery                  June 2004

   probe_timeout_event Since this event was detected by a timeout, it is
      relatively disruptive to protocol operation.   Furthermore, since
      the event indirectly includes a window adjustment that may have
      been caused by the MTU probe, it is important that the probe not
      be repeated until congestion has more than recovered from the
      loss.   Therefore we recommend five times the probe_fail_event
      interval.   I.e. five times as many round trips as the current
      congestion window in packets.

   verification_error_event A verification fail event indicates that a
      probe was deliver and the verification phase failed twice
      separated by a congestion adjustment (so the second verification
      phase was at a low point in the congestion control cycle). This is
      an indication that one of the following three things might have
      happened: repeated losses unrelated to PLPMTUD; the path is
      striped across links with dissimilar MTUs, or the link layer has
      some parametric limitation such that raising the MTU greatly
      increases the random error rate.

      The optimal method responding to this situation is an open
      research question. We believe that the correct response is some
      combination of exponentially lengthening backoffs (e.g. Starting
      at 1 minute and quadrupling on each repeat.) and implicitly
      treating the situation as a probe fail (and choosing a smaller
      probe size) after some threshold number of repeated

5.2.6  Host fragmentation

   Packetization layers are encouraged to avoid sending messages that
   will require fragmentation (for the case against fragmentation, see
   [17][18]).  However this is not always possible. Some packetization
   layers, such as a UDP application outside the kernel, may be unable
   to change the size of messages it sends.  This may result in packet
   sizes that exceeds the Path MTU.

   IPv4 permitted such applications to send packets without DF set.
   Oversized packets without DF would be fragmented in the network or
   sending host when they encountered a link with a small MTU.   In some
   case, packets could be fragmented more than once if there were
   cascaded links with progressively smaller MTUs.

   This approach is no longer recommended.  We now recommend that IPv4
   implementation use a strategy that mimics IPv6 functionality.  When
   an application sends datagrams that are larger than the known path
   MTU they should be fragmented to the path MTU in the host IP layer
   even if they are smaller than the link MTU of the first hop networks
   directly attached to the host.  The DF bit should be set on the

Mathis, et al.         Expires November 30, 2004               [Page 21]

Internet-Draft             Path MTU Discovery                  June 2004

   fragments, so they will not be fragmented again in the network.

   This technique will minimize future surprises as the Internet
   migrated to IPv6. Otherwise there is the potential for widely
   deployed applications or services relying on IPv4 fragmentation, in a
   way that can not be implemented in IPv6. At least one major operating
   system already uses this strategy.

   Note that in principle the IP fragmentation layer is an example of a
   Packetization Layers, it could implement full PLPMTUD in the
   fragmentation process.

5.2.7  Multicast

   In the case of a multicast destination address, copies of a packet
   may traverse many different paths to reach many different nodes.  The
   local representation of the "path" to a multicast destination must in
   fact represent a potentially large set of paths.

   Minimally, an implementation could maintain a single MPS value to be
   used for all packets originated from the node.  This MPS value would
   be the minimum MPS learned across the set of all paths in use by the
   node.  This approach is likely to result in the use of smaller
   packets than is necessary for many paths.

   Alternatively, if the application using multicast gets complete
   delivery reports (unlikely because this requirement  has poor scaling
   properties), PLPMTUD could be implemented in multicast protocols.

5.3  Search Strategy

   The search strategy described here is a only guide for implementors.
   A standard algorithm is not specified because the strategy can
   include many heuristics to optimize MPS selection for a given path.
   Particularly, it may be appropriate for different protocols to follow
   different strategies.  There is opportunity for future improvements
   to this algorithm.

   The search strategy uses three variables:
      SEARCH_MAX is the largest MPS that a flow might be able to use.
      It is determined by such considerations as interface MTU, widths
      of protocol length fields, and possibly other protocol-dependent
      values, such as the the TCP MSS option. In many cases it would be
      the same as the classical MTU discovery initial MSS, minus the IP
      layer headers.
      SEARCH_LOW is the largest validated MPS, and should be used as the
      effective MPS by the packetization layer.   It is the same as the
      current validated MTU minus the IP layer headers.  The initial

Mathis, et al.         Expires November 30, 2004               [Page 22]

Internet-Draft             Path MTU Discovery                  June 2004

      value for SEARCH_LOW should be a parameter, but a value of 1024
      may be a reasonable default.
      SEARCH_HIGH is the least invalidated MPS.   In most cases is will
      be the most recent failed probe size minus the IP layer headers.
      When PLPMTUD is initialized SEARCH_HIGH should be set to

   There are three major states: Search, Monitor and Suspend. In the
   Search state, it incrementally searches for the largest MPS that the
   path can support, narrowing the difference between SEARCH_LOW and
   SEARCH_HIGH. Once this gap is sufficiently narrow, the probing
   algorithm enters the Monitor state where it probes infrequently to
   detect if the path MPS has become larger.

   If the MPS probing is determined harmful, perhaps by persistent probe
   failures, the flow may enter the Suspend state, completely disabling
   MPS probing.

5.3.1  Search

   In the Search state, the strategy follows a multi-phase scan.  If
   SEARCH_HIGH >= SEARCH_MAX, a course scan is used.  In this mode, each
   probe's payload size should be MIN(2 * SEARCH_LOW, SEARCH_MAX).  If
   SEARCH_HIGH < SEARCH_MAX, the fine scan mode should be used.

   The fine scan algorithm may pursue a number of different methods for
   choosing probe sizes.  It may be useful to choose probe sizes so that
   the final IP packet will fit common link MTUs, for example 1500,
   4352, 9000, 17914.  Optionally, probes smaller than these values by
   common tunnel header sizes may be used.

   When using some protocols, the cost for a failed probe may be
   significantly higher than the cost of a successful probe due to
   retransmission and consequent delay jitter as seen by the
   application.  For this reason, one possible approach to the fine scan
   could be to use probes of size SEARCH_LOW + d, for some increment d.
   It should enter the Monitor state when SEARCH_LOW + d >= SEARCH_HIGH.
   This will result in at most one additional probe failure.

   Another approach may be to use a simple binary search where each
   probe size is (SEARCH_LOW + SEARCH_HIGH) / 2, entering the Monitor
   state when SEARCH_LOW + s >= SEARCH_HIGH for some threshold s.  This
   will converge quickly, but may have a higher number of probe
   failures.  It is more appropriate for a protocol whose probes consist
   entirely of padding.

Mathis, et al.         Expires November 30, 2004               [Page 23]

Internet-Draft             Path MTU Discovery                  June 2004

5.3.2  Monitor

   In the Monitor state, a probe of size SEARCH_HIGH should be sent at
   most once every MONITOR_INTERVAL seconds.  If the probe succeeds,
   then SEARCH_HIGH should be set to SEARCH_MAX, and the state should be
   set to Search.

   If there is evidence that no flow traffic is receiving its
   destination, such as repeated timeouts with no acknowledgements in
   TCP, it may be that the connection was re-routed to a path with a
   smaller MTU, and the Packet Too Big messages are ignored of filtered.
   In this case, SEARCH_LOW and SEARCH_HIGH should be set to initial
   values, and the Search state should be entered.

5.3.3  Suspend

   In the Suspend state, probing is entirely disabled, and the MPS
   should be set to 512 bytes.  The Suspend state should only be used if
   it is heuristically determined that probing is causing harmful

5.4  Specific Packetization Layers

   In this section we discuss specific implementation issues different
   Packetization Layer protocols.

5.4.1  Probing method using TCP

   TCP has no mechanism that could be used to distinguish between real
   application data and some other form of padding that might be used to
   fill out probe packets.  Therefore, TCP must generate probes by
   sending oversized segments that are carrying real data from upper
   layers.  As previously mentioned there are two approaches that TCP
   might use to minimize the overheads associated with the probing

   A TCP implementation of PLPMTUD can elect to send subsequent segments
   overlapping the probe as though the probe segment was not oversized.
   This has the advantage that TCP only need to retransmit one segment
   at the current MTU to recover from failed probes. However the
   duplicate data in the probe does consume network resources and will
   cause duplicate acknowledgments.   It is important that these extra
   duplicate acknowledgments not trigger Fast Retransmit.  This can be
   guaranteed by limiting the largest probe segment size to twice the
   current segment size (causing at most 1 duplicate acknowledgment) or
   three times the current segment size (causing at most 2 duplicate

Mathis, et al.         Expires November 30, 2004               [Page 24]

Internet-Draft             Path MTU Discovery                  June 2004

   The other approach is to send non-overlapping segments following the
   probe. Although this is cleaner from a protocol architecture
   standpoint it clashes with many of the optimizations used improve the
   efficiency of data motion withing many operating systems.  In
   particular many implementations divide the data into segments and
   pre-compute checksums as the data is copied out of user space.  In
   these implementation it can be very expensive to adjust segment
   boundaries after the data is already queued.

   If TCP is using SACK or any other variable length headers, the
   headers on the probe and verification packets should be padded to the
   maximum possible length. Otherwise, future options may cause delivery
   problems if they cause IP packets that are larger than the MTU.

   Note that the header size and overhead calculations described in
   Section 5.1 apply here.  TCP's natural data accounting units are
   sequence space and Maximum Segment Size.  However the the PLPMTUD
   process is described in terms of total packet size, which is larger
   than the MSS by all fixed and optional headers.

   At the point when TCP is ready to start the verification phase, it is
   permitted transmit already queued data at the old MTU rather than
   re-packetize it.  This postpones the verification process by the time
   required to send the queued data.

   If the verification phase experiences any segment losses, TCP is
   required to pull back to the prior MSS.   Since failing the
   verification phase should be an infrequent error condition it is less
   important  that this be  as efficient as probing.  Window management

   Some TCP implementations keep the congestion window in units of
   segments. When segment size is increased during a connection, a
   conservative implementation should scale cwnd so that, in units of
   bytes, it will remain unchanged.

   It is recommended that TCP should not probe a new MPS if that MPS
   will likely result in a cwnd of less than 5 segments.

   If the network becomes too congested, it is recommended that the MPS
   be reduced to a smaller size as determined by a heuristic.  The
   recommended heuristic is to reduce the MPS by half if ssthresh is
   reduced to 5 segments or smaller, with a minimum MPS of 512 bytes.

5.4.2  Probing method using SCTP

   In the SCTP protocol packetization is the responsibility of the

Mathis, et al.         Expires November 30, 2004               [Page 25]

Internet-Draft             Path MTU Discovery                  June 2004

   application or protocol above SCTP.  The application writes a set
   message to SCTP and SCTP will "chunkify" it into appropriate sized
   pieces. Some implementations MAY bundle multiple data chunks
   together, but this is NOT required implementation behavior. By
   implication not all SCTP implementations can easily generate probes
   sending additional application data. In particular any implementation
   that does not implement data chunk bundling would not be able to
   implement a probe.

   For SCTP the recommended method for generating probes is to pad SCTP
   HeartBeat messages to the desired probed size. A successful probe
   will be acknowledged without delay by the peer SCTP implementation
   returning the same Heartbeat as a HEARTBEAT-ACK. This assures that
   both directions will support the probed MTU size. [@@@@@ note that
   both sides of the path are tested]

   The verification phase is entered after a successful probe. For
   implementations that can bundle multiple DATA chunks the verification
   phase completes when a windows worth of bundled DATA chunks are
   exchanged at the new MTU value. An SCTP implementation SHOULD arrange
   its fragmentation point to be a suitable multiple of the new MTU size
   (e.g. if the MTU size is 1500 bytes in IPv4 then a fragmentation
   point of 718 bytes might be selected during the verification phase.
   This would allow the two bundled DATA chunks to be put together to
   exactly equal the proposed new PMTU. After verification is complete
   the fragmentation point can then be set to the actual PMTU assuming
   that this new value is the smallest MTU of all of the SCTP paths).
   An SCTP implementation is allowed to transmit already fragmented DATA
   chunks that cannot be bundled together at the new MTU value that were
   previously queued. For implementation that do not allow DATA chunk
   bundling three subsequent HEARTBEAT messages should be sent over the
   next XX@@ RTT's padded to the new proposed MTU value. If all of HB's
   are successful then the new PMTU should be adopted for the path.

   [@@@@NOTE: it might be simpler to always use multiple HB's to prove
   in a PMTU during verification, I leave this up to you. One thing to
   keep in mind is that SCTP normally fragments its messages to the
   SMALLEST PMTU of all paths... since SCTP is multi-homed this makes it
   so any data chunk can fit on ANY path. Most implementations DO bundle
   data chunks for this very reason... its easy to do and it allows
   larger PMTU's on different paths to be utilized. So using the HB may
   be more efficient... its definitely simpler... I leave it to you to
   choose. We may also want to mention the ICMP issue with SCTP since a
   validated ICMP message with SCTP can always be trusted].

   The SCTP Verification-Tag is designed to increase SCTPs robustness in
   the presence of a number of attacks, including forged ICMP messages.
   It relies on a 32 bit Verification Tag which is initialized to a

Mathis, et al.         Expires November 30, 2004               [Page 26]

Internet-Draft             Path MTU Discovery                  June 2004

   random value during connection establishment and placed in the first
   64 bits of all SCTP messages. All subsequent messages (including ICMP
   messages, which copy at least the first 64 bits of the message) must
   match the original Verification Tag, or they are rejected as being
   likely attacks against the connection. [9][16].

   It is believed that the Verification Tag mechanism is strong enough
   where SCTP could unconditionally process Packet Too Large messages
   that would reduce the path MTU at arbitrary times.   As written, this
   document does not encourage this method.  The PLPMTUD ICMP validity
   checks are cascaded with the SCTP checks, such that the messages are
   processed only if they meet all consistency checks.  In particular,
   PLPMTUD only uses the ICMP MTU value following a probe, during MTU
   verification, or following a hard stop timeout.

   To change this an implementation  would have to suppress some of the
   checks in Section for SCTP.

5.4.3  Probing Method for IP Fragmentation

   As mentioned in Section 5.2.6, datagram protocols (such as UDP) can
   rely on IP fragmentation as a packetization layer.   Since the IP
   layer does not have any way to determine if the fragments were
   delivered, it can not do the probing directly.    The probing has to
   be done with an adjunct protocol that uses the diagnostic API
   (Section 5.5.4) to send oversized probes, and some other API to
   update the MPS stored in the IP layer.

5.4.4  Issues for other transport protocols

   Some transport protocols (such as ISO TP4 [ISOTP]) are not allowed to
   repacketize when doing a retransmission.  That is, once an attempt is
   made to transmit a segment of a certain size, the transport cannot
   split the contents of the segment into smaller segments for
   retransmission.  In such a case, the original segment can be
   fragmented by the IP layer during retransmission.  Subsequent
   segments, when transmitted for the first time, should be no larger
   than allowed by the Path MTU.

5.5  Operational Integration

5.5.1  Interoperation with prior algorithms

   Properly functioning Path MTU discovery is critical to the robust and
   efficient operation of the Internet.   Any major change (as described
   in this document) has the potential to be very disruptive if it
   contains any errors or oversights.   Therefore, we offer a deployment
   strategy in which classical PMTUD operation as described in RFC 1191

Mathis, et al.         Expires November 30, 2004               [Page 27]

Internet-Draft             Path MTU Discovery                  June 2004

   and RFC 1981 is unmodified and PLPMTUD is only invoked following a
   full stop timeout, presumably due to an "ICMP black hole". To do
   o  Relax the ICMP checks in Section specifically to allow an
      ICMP Packet Too Large message to reduce the MTU at arbitrary
   o  When there is no cached MTU, use the Interface MTU as specified
      classical PMTU discovery, rather the initial MTU as specified in
      Section 5.2.2
   o  MTU searching as described in Section 5.3 is disabled entirely or
      starts in the monitor state.
   o  A full stop timeout is processed as described in Section
      This becomes the only mechanism to invoke the rest of PLPMTUD.

   When configured in this manner, PLPMTUD will increase the robustness
   of classical PMTU discovery in the presence of ICMP black holes and
   other ICMP problems, with minimal exposure to unanticipated problems
   during deployment.  Since this configuration does not help robustness
   in the presence of malicious or erroneous ICMP messages, it is not
   recommended for the long term.

5.5.2  Interoperation over subnets with dissimilar MTUs

   With classical PMTUD, the ingress router to a subnet is responsible
   for knowing what size packets can be delivered to every node attached
   to that subnets.   For most subnet types, this requires that the
   entire subnet has a single MTU which is common to every attached
   node.   (For a few subnets types (e.g. ATM[12]) the nodes on a subnet
   can be negotiate the MTU on a pairwise basis, and the ingress router
   is responsible for knowing the MTU to each of it peers).

   This requirement has proven to be a major impediment to deploying
   larger MTUs in the operational Internet.  Often one single node which
   does not support a larger MTU effectively vetoes raising the MTU on a
   subnet, because the ingress router does not have a mechanism to
   generate the proper Packet Too Big Message for the one attached node
   with a smaller MTU

   With  PLPMTUD, this requirement is completely relaxed.  As long as
   oversized packets addressed the nodes with the smaller MTU are
   reliably discarded, PLPMTUD will find the proper MTU for these nodes.

5.5.3  Interoperation with tunnels

   PLPMTUD is specifically designed to solve many of the problems that
   people are experiencing today due to poor interactions between
   classical MTU discovery, IPsec, and various sorts of tunnels [5].
   As long as the tunnel reliably discards packets that are too large,

Mathis, et al.         Expires November 30, 2004               [Page 28]

Internet-Draft             Path MTU Discovery                  June 2004

   PLPMTUD will discover an appropriate MTU for the path.

   Unfortunately due to the pervasive problems with classical PMTU
   discovery, many manufacturers of various types of VPN/tunneling
   equipment have resorted to ignoring the DF bit.  This not only
   violates the IP standard and many recommendations to the contrary
   [17][18], it also violates the only requirement that PLPMTUD places
   on the link layer: that oversized packets are reliably discarded.
   It is imperative that people understand the impact of ignoring the DF
   bit both to applications and to PLPMTUD.

   We do understand the reality of the situation.  It is important that
   vendors who are building devices the violate the DF specification
   understand that PLPMTUD requires that probe packets be discarded, and
   that sending ICMP packet too big messages alone is insufficient to
   prevent wholesale fragmentation if the probe packets are delivered.

   Therefore, it is imperative that devices that do not honor DF include
   packet size history caches and other heuristics to robustly detect
   and discard probe packets, if delivering them would require

5.5.4  Diagnostic tools

   All implementations MUST include facilities for MTU discovery
   diagnostic tools that implement PLPMTUD or other MTU discovery
   algorithms in user mode without help or interference by the PMTUD
   algorithm present in the operating system.  This requires an
   mechanism where a diagnostic application can send packets that are
   larger than the operating system's notion of the current path MTU and
   collect any resulting Packet Too Big Messages or other ICMP messages.
   For IPv4 the diagnostic application must be able to set the DF bit.

   At this time nearly all operating systems support two modes for
   sending UDP datagrams: one which silently fragments packets that are
   too large, and another that rejects packets that are too large.
   Neither of these modes are suitable for efficiently diagnosing
   problems with the MTU discovery, such as routers that return Packet
   Too Big messages containing incorrect size information.

5.5.5  Management interface

   It is suggested that an implementation provide a way for a system
   utility program to:
   o  Globally disable all ICMP Packet Tool Large message processing
   o  Globally suppress some or all ICMP consistency checks described in
      Section  Setting this option foregoes some possible
      security improvements, in exchange for making PLPMTUD behave more

Mathis, et al.         Expires November 30, 2004               [Page 29]

Internet-Draft             Path MTU Discovery                  June 2004

      like classical PMTU discovery.  (See Section 5.5.1)
   o  Globally permit ICMP Packet Tool Large messages to unconditionally
      reduce the MTU, even if there were not lost lost packets.
      Setting option foregoes some possible security improvements, in
      exchange for making PLPMTUD behave more like classical PMTU
      discovery.  (See Section 5.5.1)
   o  Globally adjust timer intervals for specific classes of probe

   In addition, it is important that there be a mechanism to permit per
   path controls to override specific parts of the PLPMTUD algorithm.
   All of these per path controls can be preset from similar global
   o  Disable MTU searching a given path, such that new MTU values are
      never probed.
   o  Set the initial MTU for a given path.   This could be used to
      speed convergence in relatively static environments.   There
      should be an option to cause PLPMTUD to choose the same initial
      value as would be chosen by classical PMTU discovery.  I.e.
      typically the Interface MTU.   This is used in the mode described
      in Section 5.5.1 where PLPMTUD is used only for black hole
      detection in classical PMTU discovery.
   o  Limit the maximum probed MTU for a given path.   This permits a
      manual configuration to work around a link that spuriously
      delivers packets that are larger than the useful path MTU.
   o  Per path and per application controls to disable ICMP processing,
      to further limit possible damage from malicious Packet Too Big
      messages (in addition to the global controls).

6.  References

6.1  Normative References

   [1]  Postel, J., "Internet Protocol", STD 5, RFC 791, September 1981.

   [2]  Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191,
        November 1990.

   [3]  McCann, J., Deering, S. and J. Mogul, "Path MTU Discovery for IP
        version 6", RFC 1981, August 1996.

   [4]  Bradner, S., "Key words for use in RFCs to Indicate Requirement
        Levels", BCP 14, RFC 2119, March 1997.

   [5]  Kent, S. and R. Atkinson, "Security Architecture for the
        Internet Protocol", RFC 2401, November 1998.

   [6]  Allman, M., Floyd, S. and C. Partridge, "Increasing TCP's

Mathis, et al.         Expires November 30, 2004               [Page 30]

Internet-Draft             Path MTU Discovery                  June 2004

        Initial Window", RFC 2414, September 1998.

   [7]  Deering, S. and R. Hinden, "Internet Protocol, Version 6 (IPv6)
        Specification", RFC 2460, December 1998.

   [8]  Floyd, S., "Congestion Control Principles", BCP 41, RFC 2914,
        September 2000.

   [9]  Stewart, R., Xie, Q., Morneault, K., Sharp, C., Schwarzbauer,
        H., Taylor, T., Rytina, I., Kalla, M., Zhang, L. and V. Paxson,
        "Stream Control Transmission Protocol", RFC 2960, October 2000.

6.2  Informative References

   [10]  Mogul, J., Kent, C., Partridge, C. and K. McCloghrie, "IP MTU
         discovery options", RFC 1063, July 1988.

   [11]  Knowles, S., "IESG Advice from Experience with Path MTU
         Discovery", RFC 1435, March 1993.

   [12]  Atkinson, R., "Default IP MTU for use over ATM AAL5", RFC 1626,
         May 1994.

   [13]  Sung, T., "TCP And UDP Over IPX Networks With Fixed Path MTU",
         RFC 1791, April 1995.

   [14]  Partridge, C., "Using the Flow Label Field in IPv6", RFC 1809,
         June 1995.

   [15]  Lahey, K., "TCP Problems with Path MTU Discovery", RFC 2923,
         September 2000.

   [16]  Stewart, R., "Stream Control Transmission Protocol (SCTP)
         Implementors Guide", draft-ietf-tsvwg-sctpimpguide-10 (work in
         progress), December 2003.

   [17]  Kent, C. and J. Mogul, "Fragmentation considered harmful",
         Proc. SIGCOMM '87 vol. 17, No. 5, October 1987.

   [18]  Mathis, M., Heffner, J. and B. Chandler, "Fragmentation
         Considered Very Harmful", draft-mathis-frag-harmful-00 (work in
         progress), July 2004.

Mathis, et al.         Expires November 30, 2004               [Page 31]

Internet-Draft             Path MTU Discovery                  June 2004

Authors' Addresses

   Matt Mathis
   Pittsburgh Supercomputing Center
   4400 Fifth Avenue
   Pittsburgh, PA  15213

   Phone: 412-268-3319

   John W. Heffner
   Pittsburgh Supercomputing Center
   4400 Fifth Avenue
   Pittsburgh, PA  15213

   Phone: 412-268-2329

   Kevin Lahey


Appendix A.  Security Considerations

   Under all conditions the PLPMTUD procedure described in this document
   is at least as secure as the current standard path MTU discovery
   procedures described in RFC 1191 [2] and RFC 1981 [3].

   It the recommended configuration, PLPMTUD is significantly harder to
   attack than current procedures, because ICMP messages are cached and
   only processed in connection with lost packets.   This effectively
   prevents blind attacks on the path MTU discovery system.

   Furthermore, since this algorithm is designed for robust operation
   without any ICMP (or other messages from the network), it can be
   configured to ignore all ICMP messages (globally or on a per
   application basis).  In this configuration it can not be attacked,
   unless the attacker can identify and selectively cause probe packets
   to be lost.

Appendix B.  IANA considerations


Mathis, et al.         Expires November 30, 2004               [Page 32]

Internet-Draft             Path MTU Discovery                  June 2004

Appendix C.  Acknowledgements

   Most of the SCTP text was contributed by Randall Stewart.

   Matt Mathis and John Heffner are supported in this work by a grant
   from Cisco Systems, Inc.

Mathis, et al.         Expires November 30, 2004               [Page 33]

Internet-Draft             Path MTU Discovery                  June 2004

Intellectual Property Statement

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights. Information
   on the IETF's procedures with respect to rights in IETF Documents can
   be found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard. Please address the information to the IETF at

Disclaimer of Validity

   This document and the information contained herein are provided on an

Copyright Statement

   Copyright (C) The Internet Society (2004). This document is subject
   to the rights, licenses and restrictions contained in BCP 78, and
   except as set forth therein, the authors retain all their rights.


   Funding for the RFC Editor function is currently provided by the
   Internet Society.

Mathis, et al.         Expires November 30, 2004               [Page 34]