Internet-Draft Matt Mathis
John Heffner
PSC
Kevin Lahey
Freelance
Oct 19, 2003
Path MTU Discovery
draft-ietf-pmtud-method-00.txt
Status of this Memo
This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that other
groups may also distribute working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
Abstract
[@@ To be rewritten]
This document describes Path MTU Discovery for the Internet. It is
largely derived from RFC 1191 and RFC 1981, which describe ICMP based
Path MTU Discovery for IP versions 4 and 6, plus a robust new
algorithm.
The general strategy of the new algorithm is to start with a small
MTU and probe upward, testing successively larger MTUs by probing
with single packets. If the probe is successfully delivered, then
Mathis, et al [Page 1]
Internet-Draft Expires April 2004 Oct 19, 2003
the MTU is raised. If the probe is lost, it is treated as an MTU
limitation and not as a congestion signal.
Table of Contents
TBD
1. Introduction
When one Internet node has a large amount of data to send to another
node, the data is transmitted in a series of IP packets. It is
usually preferable that these packets be of the largest size that can
successfully traverse the path from the source node to the
destination node. This packet size is referred to as the Path MTU
(PMTU), and it is equal to the minimum link MTU of all the links in a
path.
This document describes a path MTU discovery (PMTUD) method based on
the earlier methods described in the standards track documents,
RFC1191 and RFC1981, with the addition of a new algorithm that
searches for the proper MTU by probing with successively larger
packets. Large sections of this document are taken directly from
RFC1191 and RFC1981.
The methods described in this document apply to IPv4, IPv6, TCP, and
other transport protocols. This document does not define a
protocol, but rather a method to use features of existing protocols
to discover the path MTU. It does not require cooperation from the
lower layers (except that they are consistent about what packet sizes
are acceptable) or the far node. Variants in implementations will
not cause problems with interoperability.
For sake of clarity we uniformly prefer TCP and IPv6 terminology. In
the terminology section we also present the analogous IPv4 terms and
concepts for the IPv6 terminology. In a few situations we describe
specific details that are different between IPv4 and IPv6.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC 2119].
Mathis, et al [Page 2]
Internet-Draft Expires April 2004 Oct 19, 2003
[[This document still bears markup notes, indicated with square
brackets [] or @@@@ signs.]]
2. Terminology
IP - Either IPv4 [IPv4-SPEC] or IPv6 [IPv6-SPEC].
node - A device that implements IP.
router - A node that forwards IP packets not explicitly
addressed to itself.
host - Any node that is not a router.
upper layer - A protocol layer immediately above IP. Examples are
transport protocols such as TCP and UDP, control
protocols such as ICMP, routing protocols such as OSPF,
and Internet or lower-layer protocols being "tunneled"
over (i.e., encapsulated in) IP such as IPX,
AppleTalk, IP itself.
link - A communication facility or medium over which nodes can
communicate at the link layer, i.e., the layer
immediately below IPv6. Examples are Ethernets (simple
or bridged); PPP links; X.25, Frame Relay, or ATM
networks; and Internet (or higher) layer "tunnels",
such as tunnels over IPv4 or IPv6 itself.
interface - A nodeÇÖs attachment to a link.
address - An IP-layer identifier for an interface or a set of
interfaces.
packet - An IP header plus payload.
MTU - Maximum Transmission Unit, the size in bytes of the
largest packet that can be transmitted on a link or
path. Note that this could more properly be called
the IP MTU, to be consistent with how other standards
organizations use the term. Beware that the definition
used in this and other IETF documents is not the same
as the definition used in other contexts.
link MTU - The Maximum Transmission Unit, i.e., maximum packet
size in octets, that can be conveyed in one piece over
a link.
Mathis, et al [Page 3]
Internet-Draft Expires April 2004 Oct 19, 2003
path - The set of links traversed by a packet between a source
node and a destination node
path MTU - The minimum link MTU of all the links in a path between
a source node and a destination node.
PMTU - Path MTU
Path MTU Discovery,
PMTUD - Process by which a node learns the PMTU of a path
Packet Too Big message
- An ICMP message reporting that an IP packet is too
large to forward. This is the IPv6 term that
corresponds to the IPv4 "ICMP CanÇÖt fragment" message.
flow id - A combination of a source address and a non-zero
IPv6 flow label.
packetization protocol
- The layer of the network stack which segments data into
packets.
flow - A context in which MTU discovery is applied. This is
naturally an instance of the packetization protocol, e.g.
half of a TCP connection.
MPS - The maximum payload size available to a flow, usually
over a specific path. As an example, this is the maximum
TCP segment size, including TCP headers but not including
IP headers.
probe packet- A packet which is being used to test for a larger MTU.
probe size - The size of a packet being used to probe for a larger MTU.
successful probe
- The probe packet was delivered through the network.
inconclusive probe
- The probe packet was not delivered, but there were other lost
packets too close to the probe. By implication the probe
might have been lost due to something other than MTU, so the
results are inconclusive.
failed probe
- The probe packet was not delivered and there were not other
lost packets close to the probe.
Mathis, et al [Page 4]
Internet-Draft Expires April 2004 Oct 19, 2003
probe gap - The L3 payload data that will need to be retransmitted if the
probe is not delivered.
[[Deprecated terms - these terms should only appear in very specific parts of
the document.
ICMP
CanÇÖt fragment messages
lower layers
@@@ remove as the document matures]]
3. Overview
This document describes a technique to dynamically discover the MTU
of a path. These procedures are applicable to TCP and other
transport- or application-level packetization protocols which
implement similar features.
The general strategy of the new procedure is to find the proper MTU
by starting a connection using relatively small packets and then
probing with progressively larger packets (containing application
data). If a probe packet is successfully delivered, then the path
MTU is raised. The isolated loss of a probe packet (with or without
a Packet Too Big message) is treated as an indication of an MTU
limit, and not as a congestion indicator.
PMTUD can optionally process Packet Too Big messages for faster
convergence in exchange for a slight decrease in robustness.
Processing malicious or erroneous Packet Too Big messages can cause
PMTU discovery to arrive at the incorrect MTU for a path, which is
likely to reduce protocol performance. The document describes three
options for processing Packet Too Big messages: completely ignore
them, only accept them in response to probes or accept all Packet Too
Big messages (the previous approach).
In addition, PMTUD can be extended with heuristics to use alternate
criteria to select PMTU. For example, on a path that is so congested
that the fair share window is too small (smaller than 5 kB), TCP may
be better behaved with 512-byte packets than with 1500-byte packets
since with the larger packets the window would be too small to
trigger Fast Retransmit.
Mathis, et al [Page 5]
Internet-Draft Expires April 2004 Oct 19, 2003
Relatively few details of this procedure affect interoperability with
other standards or Internet protocols. These details are specified
in RFC2119 standards language in the requirements section. The vast
majority of the implementation details are recommendations based on
experiences with earlier versions of path MTU discovery. These are
motivated by a desire to maximize robustness in the presence of less
than ideal implementations as they exist in the field.
4. Requirements
All Internet nodes SHOULD implement Path MTU Discovery in order to
discover and take advantage of the largest MTU supported along the
Internet path.
Nodes not implementing Path MTU Discovery must use a default MTU as
specified by the respective IP protocols. For IPv6 the default MTU
is 1280 bytes, the minimum link MTU as defined in [IPv6-SPEC]. For
IPv4 it is 576 bytes, as specified in [IPv4-SPEC].
Links MUST not deliver packets that are larger than their true MTU.
Links that have parametric limitations (e.g. MTU bounds due to
limited clock stability) MUST include explicit mechanisms to
consistently reject packets that might otherwise be
nondeterministically delivered.
When a packet is too large to traverse a link, the attached router,
if any, SHOULD send a Packet Too Big message (IPv6) or ICMP, canÇÖt
fragment message (IPv4 with DF set), as appropriate.
The requirements below only apply to those implementations that
include Path MTU Discovery.
A flow MUST NOT send a probe packet until at least one packet of its
full current MPS is acknowledged. This implicitly limits successful
probes to once per two round trips. To make the algorithm more
robust in the presence of multi-path routing, a flow SHOULD NOT send
a probe packet until at least a full window or an appropriately large
quantity of packets have been successfully acknowledged.
Before a probe can be sent, the flow MUST be able to produce a packet
containing a payload of at least the candidate MPS. That is, it must
have enough data or be able to pad the packet to the full desired
size. If the flow is able to send a probe with the exception of
Mathis, et al [Page 6]
Internet-Draft Expires April 2004 Oct 19, 2003
having enough data to
Failed and inconclusive probes MUST NOT be sent more frequently than
the normal congestion interval for the current average window size.
A packetization protocol which does loss recovery MUST use a loss
detection mechanism which does not result in spurious retransmission
of any additional data when a probe packet is lost.
During the probe, the normal congestion control machinery should
remain in effect except when only the probe gap is detected as lost.
In this case the normal multiplicative congestion window reduction is
suppressed. If any other data is detected as lost, all normal
congestion control MUST take place.
If the probe is successful, the current MPS is updated to the
candidate MPS. If window and other congestion state variables are
kept in units of packets, they MUST be rescaled to preserve the
current window size in bytes.
5. Implementation Issues
This section discusses a number of issues related to the
implementation of Path MTU Discovery. This is not a specification,
but rather a set of notes provided as an aid for implementers.
The issues include:
- What layer or layers implement Path MTU Discovery?
- Accounting for headers
- How is the PMTU information cached?
- How are ICMP messages processed
- How is stale PMTU information removed?
- How to implement PMTUD with TCP?
- What should other transport and higher layers do?
- What should tunnels above IP do?
Mathis, et al [Page 7]
Internet-Draft Expires April 2004 Oct 19, 2003
5.1. Layering
In the IP architecture, the choice of what size packet to send is
made by a protocol at a layer above IP. This memo refers to such a
protocol as a "packetization protocol". Packetization protocols are
usually transport protocols (for example, TCP) but can also be
higher-layer protocols (for example, protocols built on top of UDP).
This memo uses the concept of a "flow" to define the scope in which
path MTU information is used. Each flow locally stores its maximum
payload size (MPS), which is used for packetizing data. Flows may
communicate with the IP layer to store or access cached PMTU values,
providing a means by which similar flows may share information. To
do so, the flow must convert between these two values by adding or
subtracting the size of the IP header plus any additional
intermediate headers. The IP layer also stores PMTU information from
the ICMP layer when it receives Packet Too Big messages.
It is possible that a packetization layer, perhaps a UDP application
outside the kernel, is unable to change the size of messages it
sends. This may result in a packet size that exceeds the Path MTU.
In such situations, the packets must be fragmented by the IP layer.
To accommodate this, IPv6 defines a mechanism that allows large
payloads to be divided into fragments, with each fragment sent in a
separate packet (see [IPv6-SPEC] section "Fragment Header"). It is
also recommended that IPv4 fragment the packets at the end system.
@@@ Should it also set the DF flag to mimic IPv6? @@@
However, packetization layers are encouraged to avoid sending
messages that will require fragmentation (for the case against
fragmentation, see [FRAG]).
5.2. Accounting for headers
The packetization is done at or near the top of the protocol stack,
while the final packet size, only determined at bottom of the stack,
is what is determines the linkÇÖs ability to transmit the packet. As
such, it is necessary for the lower layers to deterministically
accept all payloads of a uniform size, or for these layers to
communicate their header sizes to the upper layer prior to
packetization.
This document does not take a position on the layering boundaries of
IPsec, which logically sits between IP and TCP or another
packetization layer. IPsec can be treated either as part of IP or as
part of the packetization layer, as long as the accounting is
consistent within any given implementation. If IPsec is treated as
Mathis, et al [Page 8]
Internet-Draft Expires April 2004 Oct 19, 2003
part of the IP layer, then each security association that contributes
a different length security header, may need to be treated as a
separate path. If IPsec is treated as part of the packetization
layer, then the MPS to PMTU calculation must include the IPsec header
size for that flow.
5.3. Storing PMTU information
Ideally, a PMTU value should be associated with a specific path
traversed by packets exchanged between the source and destination
nodes. However, in most cases a node will not have enough
information to completely and accurately identify such a path.
Rather, a node must associate a PMTU value with some local
representation of a path. It is left to the implementation to select
the local representation of a path.
In the case of a multicast destination address, copies of a packet
may traverse many different paths to reach many different nodes. The
local representation of the "path" to a multicast destination must in
fact represent a potentially large set of paths.
Minimally, an implementation could maintain a single PMTU value to be
used for all packets originated from the node. This PMTU value would
be the minimum PMTU learned across the set of all paths in use by the
node. This approach is likely to result in the use of smaller
packets than is necessary for many paths.
An implementation could use the destination address as the local
representation of a path. The PMTU value associated with a
destination would be the minimum PMTU learned across the set of all
paths in use to that destination. The set of paths in use to a
particular destination is expected to be small, in many cases
consisting of a single path. This approach will result in the use of
optimally sized packets on a per-destination basis. This approach
integrates nicely with the conceptual model of a host as described in
[ND]: a PMTU value could be stored with the corresponding entry in
the destination cache.
If IPv6 flows [IPv6-SPEC] are in use, an implementation could use the
flow id as the local representation of a path. Packets sent to a
particular destination but belonging to different flows may use
different paths, with the choice of path depending on the flow id.
This approach will result in the use of optimally sized packets on a
per-flow basis, providing finer granularity than PMTU values
maintained on a per-destination basis.
For source routed packets (i.e. packets containing an IPv6 Routing
header [IPv6-SPEC]), the source route may further qualify the local
Mathis, et al [Page 9]
Internet-Draft Expires April 2004 Oct 19, 2003
representation of a path. In particular, a packet containing a type
0 Routing header in which all bits in the Strict/Loose Bit Map are
equal to 1 contains a complete path specification. An implementation
could use source route information in the local representation of a
path.
Note: Some paths may be further distinguished by different security
classifications. The details of such classifications are beyond the
scope of this memo. @@@ this should be in scope
5.4. Probing method using TCP
A new candidate MPS is tested by sending one "probe segment", which
is larger than the current MPS. We present here two possible probing
methods for TCP.
In the first method, after a probe segment has been sent (of size
candidate MPS), the subsequent segment(s) may be sent as though the
probe segment was not over sized. Thus if the probe segment is lost,
it will leave a gap in the sequence space that is exactly one current
MPS minus the TCP header size. We refer to this potential hole as
the probe gap. Note that the length of the probe segment is
determined by the candidate MPS under consideration, but the length
of the probe gap by the current MPS. If the probe segment is lost,
this gap can be filled by a single retransmitted segment.
This method will create duplicate acknowledgements if the probe is
successful. The sender must be capable of dealing with these
expected duplicate acknowledgements in a manner which will not cause
unnecessary retransmission or congestion window reduction.
In the second method, after a probe segment has been sent, subsequent
segments are sent in a non-overlapping manner. If the probe segment
is lost, it will leave a gap which will require retransmission of
multiple segment to fill.
The probe is completed when the acknowledgment sequence advances past
the probe gap. If, when the probe is complete, the probe gap was not
retransmitted, the probe was successful. If the probe gap was
retransmitted and there were no other retransmissions, the candidate
MPS failed. If there were any other retransmissions the probe was
inconclusive.
If the probe was successful, the current MPS is updated to the
candidate MPS. @@@ add robustness language re: more losses
If the probe failed or was inconclusive the probe countdown is set to
COUNTDOWN_SCALE times the square of the current window size in
Mathis, et al [Page 10]
Internet-Draft Expires April 2004 Oct 19, 2003
packets.
If a Packet Too Big message is received, it can be is used to compute
an MPS limit by deducting the IP header size from the MTU reported in
the ICMP message. If the MPS limit is between the current MPS and
candidate MPS, the current MPS is updated from the MPS limit,
otherwise the message is ignored. If the current MPS is updated,
then the probe strategy is forced into the Monitor state described
below.
5.5. Probing method using SCTP
@@@@ to be written
5.6. General probing methods
@@@@ to be written
5.7. Probe strategy
The probe strategy described here is a recommended baseline
algorithm. It is not presented in formal standards language because
the probe strategy can include heuristics to help select an optimal
MSS for a given path. As a consequence there is opportunity for
future improvements to this algorithms.
The probing strategy has three major states: Search, Monitor and
Suspend. In the Search state, it sequentially searches for the
largest MSS that the path can support. Once the appropriate MPS has
been discovered, the probing algorithm enters the Monitor state where
it probes infrequently to detect if the path MPS has become larger.
If the MPS probing persistently fails it may be desirable to suspend
MPS probing and heuristically select one of the common default MSSs:
576, 1240, or 1460 Bytes.
5.7.1. Search
The recommended search strategy is a multi-phase scan: First, a
coarse scan for the approximate MTU using factor of 2 steps starting
at 1024 Bytes until a probe fails, followed by successively finer
scans between the largest previously successful and unsuccessful
probes. The TCP should use its best knowledge of the lower layer
header sizes to appropriately determine the MPS from the MTUs listed
in the table below.
Table 1: Recommended MTU scanning sequence
(Coarse scan down column 1, fine scan across each row)
Mathis, et al [Page 11]
Internet-Draft Expires April 2004 Oct 19, 2003
512, [Use only after repeated timeouts]
1024, 1492, 1500, 2002
2048
4096, 4352
8192, 9000
16384, 17914
32768
64512
((Additional values needed))
During the scan it is recommended that the MPS not be raised if cwnd
is too small as determined by a heuristic. The recommended heuristic
is that the MPS is only raised when the cwnd is larger than 20
segments. @@@ This may be too high.
5.7.2. Monitor
Once the scan has found an appropriate MPS, the probe strategy enters
the Monitor state, where it re-probes the most recent failed MTU,
once every MONITOR_INTERVAL seconds. If the probe fails, it remains
in the Monitor state. If it succeeds, it enters the scanning state.
If the network becomes too congested during either the Search or the
Monitor states, it is recommended that the MPS be reduced to a
smaller size as determined by a heuristic. The recommended heuristic
is to reduce the MSS if ssthresh is reduced to 5 segments or smaller.
The recommended reduction is to the next smaller coarse step in Table
1.
When there are repeated timeouts (MAX_TIMO or more retransmissions,
without any received ACKs), it is presumed that the connection was
re-routed onto a link with a smaller MSS, and that ICMP messages are
not being delivered. The MSS probing algorithms is reset by pulling
back the MSS to 1024 Bytes, rescaling the congestion control
variables and reentering the Search state.
5.7.3. Suspend
If there is a timeout, and cwnd prior to the timeout was smaller than
6 packets, then the probe strategy can enter the Suspend state and
set the MSS to 512 or 1240 Bytes. This has the effect of reducing
the minimum data rate that TCP can stably manage.
5.8. Processing Packet Too Big messages
@@@ Add language re: optional processing
Mathis, et al [Page 12]
Internet-Draft Expires April 2004 Oct 19, 2003
When a Packet Too Big message is received, the node determines which
path the message applies to based on the contents of the Packet Too
Big message. For example, if the destination address is used as the
local representation of a path, the destination address from the
original packet would be used to determine which path the message
applies to.
Note: if the original packet contained a IPv6 Routing header, the
Routing header should be used to determine the location of the
destination address within the original packet. If Segments Left
is equal to zero, the destination address is in the Destination
Address field in the IPv6 header. If Segments Left is greater
than zero, the destination address is the last address
(Address[n]) in the Routing header.
If the original packet contained a IPv4 Source Route Option .....
@@@@ write
The node then uses the value in the MTU field in the Packet Too Big
message as a tentative PMTU value, and compares the tentative PMTU to
the existing PMTU. If the tentative PMTU is less than the existing
PMTU estimate, the tentative PMTU replaces the existing PMTU as the
PMTU value for the path.
The packetization layers must be notified about decreases in the
PMTU. Any packetization layer instance (for example, a TCP
connection) that is actively using the path must be notified if the
PMTU estimate is decreased.
Note: even if the Packet Too Big message contains an Original
Packet Header that refers to a UDP packet, the TCP layer must be
notified if any of its connections use the given path.
Also, the instance that sent the packet that elicited the Packet Too
Big message should be notified that its packet has been dropped, even
if the PMTU estimate has not changed, so that it may retransmit the
dropped data.
Note: An implementation can avoid the use of an asynchronous
notification mechanism for PMTU decreases by postponing
notification until the next attempt to send a packet larger than
the PMTU estimate. In this approach, when an attempt is made to
SEND a packet that is larger than the PMTU estimate, the SEND
function should fail and return a suitable error indication. This
approach may be more suitable to a connectionless packetization
layer (such as one using UDP), which (in some implementations) may
be hard to "notify" from the ICMP layer. In this case, the normal
timeout-based retransmission mechanisms would be used to recover
Mathis, et al [Page 13]
Internet-Draft Expires April 2004 Oct 19, 2003
from the dropped packets. @@@@ why "SEND"?
It is important to understand that the notification of the
packetization layer instances using the path about the change in the
PMTU is distinct from the notification of a specific instance that a
packet has been dropped. The latter should be done as soon as
practical (i.e., asynchronously from the point of view of the
packetization layer instance), while the former may be delayed until
a packetization layer instance wants to create a packet.
Retransmission should be done for only those packets that are known
to be dropped, as indicated by a Packet Too Big message.
5.9. Purging stale PMTU information
@@@ update
Internetwork topology is dynamic; routes change over time. While the
local representation of a path may remain constant, the actual
path(s) in use may change. Thus, PMTU information cached by a node
can become stale.
If the stale PMTU value is too large, this will be discovered almost
immediately once a large enough packet is sent on the path. No such
mechanism exists for realizing that a stale PMTU value is too small,
so an implementation should "age" cached values. When a PMTU value
has not been decreased for a while (on the order of 10 minutes), the
PMTU estimate should be set to the MTU of the first-hop link, and the
packetization layers should be notified of the change. This will
cause the complete Path MTU Discovery process to take place again.
Note: an implementation should provide a means for changing the
timeout duration, including setting it to "infinity". For
example, nodes attached to an FDDI link which is then attached to
the rest of the Internet via a small MTU serial line are never
going to discover a new non-local PMTU, so they should not have to
put up with dropped packets every 10 minutes.
An upper layer must not retransmit data in response to an increase in
the PMTU estimate, since this increase never comes in response to an
indication of a dropped packet.
One approach to implementing PMTU aging is to associate a timestamp
field with a PMTU value. This field is initialized to a "reserved"
value, indicating that the PMTU is equal to the MTU of the first hop
link. Whenever the PMTU is decreased in response to a Packet Too Big
message, the timestamp is set to the current time.
Once a minute, a timer-driven procedure runs through all cached PMTU
Mathis, et al [Page 14]
Internet-Draft Expires April 2004 Oct 19, 2003
values, and for each PMTU whose timestamp is not "reserved" and is
older than the timeout interval:
- The PMTU estimate is set to the MTU of the first hop link.
- The timestamp is set to the "reserved" value.
- Packetization layers using this path are notified of the increase.
5.10. TCP layer actions
The TCP layer must track the PMTU for the path(s) in use by a
connection; it should not send segments that would result in packets
larger than the PMTU except to probe the path MTU. A simple
implementation could ask the IP layer for this value each time it
created a new segment, but this could be inefficient. Moreover, TCP
implementations that follow the "slow-start" congestion-avoidance
algorithm [CONG] typically calculate and cache several other values
derived from the PMTU. It may be simpler to receive asynchronous
notification when the PMTU changes, so that these variables may be
updated.
A TCP implementation must also store the MSS value received from its
peer, and must not send any segment larger than this MSS, regardless
of the PMTU. In 4.xBSD-derived implementations, this may require
adding an additional field to the TCP state record.
The value sent in the TCP MSS option is independent of the PMTU.
This MSS option value is used by the other end of the connection,
which may be using an unrelated PMTU value. See [IPv6-SPEC] sections
"Packet Size Issues" and "Maximum Upper-Layer Payload Size" for
information on selecting a value for the TCP MSS option. When a
Packet Too Big message is received, it implies that a packet was
dropped by the node that sent the ICMP message. It is sufficient to
treat this as any other dropped segment, and wait until the
retransmission timer expires to cause retransmission of the segment.
If the Path MTU Discovery process requires several steps to find the
PMTU of the full path, this could delay the connection by many round-
trip times.
@@@ Add IPv4 text
[@@@deprecate? Alternatively, the retransmission could be done in
immediate response to a notification that the Path MTU has changed,
but only for the specific connection specified by the Packet Too Big
message. The packet size used in the retransmission should be no
larger than the new PMTU. ]
Mathis, et al [Page 15]
Internet-Draft Expires April 2004 Oct 19, 2003
Note: A packetization layer must not retransmit in response to
every Packet Too Big message, since a burst of several oversized
segments will give rise to several such messages and hence several
retransmissions of the same data. If the new estimated PMTU is
still wrong, the process repeats, and there is an exponential
growth in the number of superfluous segments sent.
This means that the TCP layer must be able to recognize when a
Packet Too Big notification actually decreases the PMTU that it
has already used to send a packet on the given connection, and
should ignore any other notifications.
Many TCP implementations incorporate "congestion avoidance" and
"slow-start" algorithms to improve performance [CONG]. Unlike a
retransmission caused by a TCP retransmission timeout, a
retransmission caused by a Packet Too Big message should not change
the congestion window. It should, however, trigger the slow-start
mechanism (i.e., only one segment should be retransmitted until
acknowledgments begin to arrive again).
TCP performance can be reduced if the senderÇÖs maximum window size is
not an exact multiple of the segment size in use (this is not the
congestion window size, which is always a multiple of the segment
size). In many systems (such as those derived from 4.2BSD), the
segment size is often set to 1024 octets, and the maximum window size
(the "send space") is usually a multiple of 1024 octets, so the
proper relationship holds by default. If Path MTU Discovery is used,
however, the segment size may not be a sub-multiple of the send
space, and it may change during a connection; this means that the TCP
layer may need to change the transmission window size when Path MTU
Discovery changes the PMTU value. The maximum window size should be
set to the greatest multiple of the segment size that is less than or
equal to the senderÇÖs buffer space size.
5.11. Issues for other transport protocols
Some transport protocols (such as ISO TP4 [ISOTP]) are not allowed to
repacketize when doing a retransmission. That is, once an attempt is
made to transmit a segment of a certain size, the transport cannot
split the contents of the segment into smaller segments for
retransmission. In such a case, the original segment can be
fragmented by the IP layer during retransmission. Subsequent
segments, when transmitted for the first time, should be no larger
than allowed by the Path MTU.
The Sun Network File System (NFS) uses a Remote Procedure Call (RPC)
protocol [RPC] that, when used over UDP, in many cases will generate
payloads that must be fragmented even for the first-hop link. This
Mathis, et al [Page 16]
Internet-Draft Expires April 2004 Oct 19, 2003
might improve performance in certain cases, but it is known to cause
reliability and performance problems, especially when the client and
server are separated by routers.
It is recommended that NFS implementations use Path MTU Discovery
whenever routers are involved. Most NFS implementations allow the
RPC datagram size to be changed at mount-time (indirectly, by
changing the effective file system block size), but might require
some modification to support changes later on.
Also, since a single NFS operation cannot be split across several UDP
datagrams, certain operations (primarily, those operating on file
names and directories) require a minimum payload size that if sent in
a single packet would exceed the PMTU. NFS implementations should
not reduce the payload size below this threshold, even if Path MTU
Discovery suggests a lower value. In this case the payload will be
fragmented by the IP layer.
5.12. Issues for tunnels
@@@ to be written
5.13. Diagnostic tools
All implementations MUST include a mechanism to implement diagnostic
tools that do not rely on the operating systems implementation of
path MTU discovery. This requires an mechanism where an application
can send oversized packets that are not subjected to the operating
systems notion of the current path MTU, up to the physical MTU limit
as supported by the network interface, as well as a mechanism to
collect any Packet Too Big Messages.
5.14. Management interface
It is suggested that an implementation provide a way for a system
utility program to:
- Specify that Path MTU Discovery not be done on a given path.
- Change the PMTU value associated with a given path.
- Global controls on ICMP processing
- Per connection or per application controls on ICMP processing
The former can be accomplished by associating a flag with the path;
when a packet is sent on a path with this flag set, the IP layer does
not send packets larger than the IPv6 minimum link MTU.
Mathis, et al [Page 17]
Internet-Draft Expires April 2004 Oct 19, 2003
These features might be used to work around an anomalous situation,
or by a routing protocol implementation that is able to obtain Path
MTU values.
The implementation should also provide a way to change the timeout
period for aging stale PMTU information.
6. Normative references
[RFC1191] Path MTU discovery. J.C. Mogul, S.E. Deering. Nov-01-1990.
(Format: TXT=47936 bytes) (Obsoletes RFC1063) (Status: DRAFT
STANDARD)
[RFC1981] Path MTU Discovery for IP version 6. J. McCann, S. Deering,
J. Mogul. August 1996. (Status: PROPOSED STANDARD)
[RFC2119] Key words for use in RFCs to Indicate Requirement Levels. S.
Bradner. March 1997. (Status: BEST CURRENT PRACTICE)
7. Informative references
[RFC1063] IP MTU discovery options. J.C. Mogul, C.A. Kent, C. Par-
tridge, K. McCloghrie. Jul-01-1988. (Obsoleted by RFC1191)
[RFC1435] IESG Advice from Experience with Path MTU Discovery. S.
Knowles. March 1993. (Format: TXT=2708 bytes) (Status:
INFORMATIONAL)
[RFC1626] Default IP MTU for use over ATM AAL5. R. Atkinson. May 1994.
(Status: PROPOSED STANDARD)
[RFC1791] TCP And UDP Over IPX Networks With Fixed Path MTU. T. Sung.
April 1995. (Status: EXPERIMENTAL)
[RFC2923] TCP Problems with Path MTU Discovery. K. Lahey. September
2000. (Status: INFORMATIONAL)
8. Security considerations
Since the MTU reported in the ICMP messages is constrained to be
between the old MTU and the candidate MTU, this algorithm is more
Mathis, et al [Page 18]
Internet-Draft Expires April 2004 Oct 19, 2003
difficult to attack through fraudulent ICMP messages.
Furthermore, since this algorithm can function properly without ICMP
messages that part of the algorithm can be disabled for additional
robustness in hostile environments.
9. IANA considerations
10. Contributors
11. Acknowledgements
Matt Mathis and John Heffner are supported by a grant from Cisco Sys-
tems, Inc.
12. AuthorsÇÖ addresses
Please send comments and suggestions to mtu@psc.edu.
Matt Mathis and John Heffner
Pittsburgh Supercomputing Center
4400 Fifth Ave.
Pittsburgh, PA 15213
mathis@psc.edu
jheffner@psc.edu
Kevin Lahey
Freelance
kml@patheticgeek.net
13. Intellectual Property
The IETF takes no position regarding the validity or scope of any
intellectual property or other rights that might be claimed to per-
tain to the implementation or use of the technology described in this
document or the extent to which any license under such rights might
or might not be available; neither does it represent that it has made
any effort to identify any such rights. Information on the IETFÇÖs
procedures with respect to rights in standards-track and standards-
related documentation can be found in BCP-11. Copies of claims of
rights made available for publication and any assurances of licenses
to be made available, or the result of an attempt made to obtain a
Mathis, et al [Page 19]
Internet-Draft Expires April 2004 Oct 19, 2003
general license or permission for the use of such proprietary rights
by implementers or users of this specification can be obtained from
the IETF Secretariat.
The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary
rights which may cover technology that may be required to practice
this standard. Please address the information to the IETF Executive
Director.
14. Full copyright statement
Copyright (C) The Internet Society Oct 19, 2003. All Rights Reserved.
This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it
or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph are
included on all such copies and derivative works. However, this doc-
ument itself may not be modified in any way, such as by removing the
copyright notice or references to the Internet Society or other
Internet organizations, except as needed for the purpose of develop-
ing Internet standards in which case the procedures for copyrights
defined in the Internet Standards process must be followed, or as
required to translate it into languages other than English.
The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assigns.
Mathis, et al [Page 20]