Internet Engineering Task Force                                P. Savola
Internet-Draft                                                 CSC/FUNET
Expires: November 10, 2005                                   May 9, 2005


       MTU and Fragmentation Issues with In-the-Network Tunneling
             draft-savola-mtufrag-network-tunneling-03.txt

Status of this Memo

   By submitting this Internet-Draft, each author represents that any
   applicable patent or other IPR claims of which he or she is aware
   have been or will be disclosed, and any of which he or she becomes
   aware will be disclosed, in accordance with Section 6 of BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on November 10, 2005.

Copyright Notice

   Copyright (C) The Internet Society (2005).

Abstract

   Tunneling techniques such as IP-in-IP when deployed in the middle of
   the network, typically between routers, have certain issues regarding
   how large packets can be handled: whether such packets would be
   fragmented and reassembled (and how), whether Path MTU Discovery
   would be used, or how this scenario could be operationally avoided.
   This memo justifies why this is a common, non-trivial problem, and
   goes on to describe the different solutions and their characteristics
   at some length.




Savola                  Expires November 10, 2005               [Page 1]


Internet-Draft    Packet Size Issues in Network Tunnels         May 2005


Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
   2.  Problem Statement  . . . . . . . . . . . . . . . . . . . . . .  4
   3.  Description of Solutions . . . . . . . . . . . . . . . . . . .  5
     3.1   Fragmentation and Reassembly by the Tunnel Endpoints . . .  5
     3.2   Signalling the Lower MTU to the Sources  . . . . . . . . .  6
     3.3   Encapsulate Only When There is Free MTU  . . . . . . . . .  6
     3.4   Fragmentation of the Inner Packet  . . . . . . . . . . . .  8
   4.  Conclusions  . . . . . . . . . . . . . . . . . . . . . . . . .  8
   5.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . .  9
   6.  Security Considerations  . . . . . . . . . . . . . . . . . . .  9
   7.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 10
   8.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 10
     8.1   Normative References . . . . . . . . . . . . . . . . . . . 10
     8.2   Informative References . . . . . . . . . . . . . . . . . . 11
       Author's Address . . . . . . . . . . . . . . . . . . . . . . . 11
   A.  MTU of the Tunnel  . . . . . . . . . . . . . . . . . . . . . . 11
       Intellectual Property and Copyright Statements . . . . . . . . 12
































Savola                  Expires November 10, 2005               [Page 2]


Internet-Draft    Packet Size Issues in Network Tunnels         May 2005


1.  Introduction

   A large number of ways to encapsulate datagrams in other packets,
   i.e., tunneling mechanisms, have been specified over the years: for
   example, IP-in-IP (e.g., [1], [2]), GRE [3], L2TP [4], or IPsec [5]
   in tunnel mode -- any of which might run on top of IPv4, IPv6, or
   some other protocol and carrying the same or a different protocol.

   All of these can be run so that the endpoints of the inner protocol
   are co-located with the endpoints of the outer protocol; in a typical
   scenario, this would correspond to "host-to-host" tunneling.  It is
   also possible to have one set of endpoints co-located, i.e., host-to-
   router or router-to-host tunneling.  Finally, many of these
   mechanisms are also employed between the routers for all or a part of
   the traffic that passes between them, resulting in router-to-router
   tunneling.

   All these protocols and scenarios have one issue in common: how does
   the source select the maximum packet size so that the packets will
   fit, even encapsulated, in the largest Maximum Transfer Unit (MTU) of
   the traversed path in the network; and if you cannot affect the
   packet sizes, what do you do to be able to encapsulate them in any
   case?  The four main solutions are (these will be elaborated in
   Section 3):

   1.  Fragmenting all too big encapsulated packets to fit in the paths,
       and reassembling them at the tunnel end-points.

   2.  Signal to all the sources whose traffic must be encapsulated, and
       is larger than that fits, to send smaller packets, e.g., using
       Path MTU Discovery [6] [7].

   3.  Ensure that in the specific environment, the encapsulated packets
       will fit in all the paths in the network, e.g., by using MTU
       bigger than 1500 in the backbone used for encapsulation.

   4.  Fragmenting the original too big packets so that their fragments
       will fit, even encapsulated, in the paths, and reassembling them
       at the destination nodes.  Note that this approach is only
       available for IPv4 under certain assumptions (see Section 3.4).

   The tunneling packet size issues are relatively straightforward in
   host-to-host tunneling or host-to-router tunneling where Path MTU
   Discovery only needs to signal to one source node.  The issues are
   signficantly more difficult in router-to-router and certain router-
   to-host scenarios, which are the focus of this memo.

   It is worth noting that most of this discussion applies to a more



Savola                  Expires November 10, 2005               [Page 3]


Internet-Draft    Packet Size Issues in Network Tunnels         May 2005


   generic case, where there exists a link with lower MTU in the path.
   A concrete and widely deployed example of this is the usage of PPP
   over Ethernet (PPPoE) [9] at the customers' access link.  These
   lower-MTU links, and particularly PPPoE links, are typically not
   deployed in topologies where fragmentation and reassembly might be
   unfeasible (e.g., a backbone), so this may be a slightly easier
   problem.  However, this more generic case is considered out of scope
   of this memo.

   There are also known challenges in specifying and implementing a
   mechanism which would be used at the tunnel end-point to obtain the
   best suitable packet size to use for encapsulation; if a static value
   is chosen, a lot of fragmentation might end up being performed; if
   PMTUD is used, the implementation would need to use or relay the
   received Packet Too Big messages, and assume that sufficient data has
   been piggybacked on the ICMP messages (beyond the required 64 bits
   for ICMPv4) to make this possible.  However, this problem is
   described elsewhere (e.g., in [3] and [1]) and is out of scope of
   this memo.

   Section 2 includes a problem statement, section 3 describes the
   different solutions with their drawbacks and advantages, and section
   4 presents conclusions.

2.  Problem Statement

   It is worth considering why exactly this is considered a problem.

   It is possible to fix all the packet size issues using the solution
   1, fragmenting the resulting encapsulated packet, and reassembling it
   by the tunnel endpoint.  However, this is considered problematic for
   at least three reasons, as described in Section 3.1.

   Therefore it is desirable to avoid fragmentation and reassembly if
   possible.  On the other hand, the other solutions may not be
   practical either: especially in router-to-router or router-to-host
   tunneling, Path MTU Discovery might be very disadvantageous --
   consider the case where a backbone router would send an ICMP Packet
   Too Big messages to every source who would try to send packets
   through it.  Fragmenting before encapsulation is also not available
   in IPv6, and not available when the Don't Fragment (DF) bit has been
   set (unless the implementation ignores the DF bit).  Ensuring high
   enough MTU so encapsulation is always possible is of course a valid
   approach, but requires careful operational planning, and may not be a
   feasible assumption for implementors.

   This yields that there is no trivial solution to this problem, and it
   needs to be further explored to consider the tradeoffs, as is done in



Savola                  Expires November 10, 2005               [Page 4]


Internet-Draft    Packet Size Issues in Network Tunnels         May 2005


   this memo.

3.  Description of Solutions

   This section describes the potential solutions in a bit more detail.

3.1  Fragmentation and Reassembly by the Tunnel Endpoints

   The seemingly simplest solution to tunneling packet size issues is
   fragmentation of the outer packet by the encapsulator, and reassembly
   by the decapsulator.  However, this is highly problematic for at
   least three reasons:

   o  Fragmentation causes overhead: every fragment requires the IP
      header (20 or 40 bytes), and with IPv6, additional 8 bytes for the
      Fragment Header.

   o  Fragmentation and reassembly require computation: splitting
      datagrams to fragments is a non-trivial procedure, and so is their
      reassembly.  For example, software router forwarding
      implementations may not be able to be perform these operations at
      line rate.

   o  Reassembling requires buffers: fragments might get lost, be
      reordered or delayed; when that happens, the reassembly engine has
      to wait with the partial packet for some time.  When this would
      have to be done at the line rate, with e.g., 10 Gbit/s speed, the
      length of the buffers that reassembly might require, especially in
      the worst case, might be considerable.

   When examining router-to-router tunneling, the third problem is
   likely the worst; certainly, a hardware computation and
   implementation requirement would also be significant, but not all
   that difficult in the end -- and the link capacity wasted in the
   backbones by additional overhead might not be a huge problem either.

   However, IPv4 identification header length is only 16 bits (compared
   to 32 bits in IPv6), and if a larger number of packets are being
   tunneled between two IP addresses, the ID is very likely to wrap and
   cause data misassociation.  This reassembly wrongly combining data
   from two unrelated packets causes data integrity and potentially a
   confidentiality violation.  This problem is further described in
   [10].

   So, if reassembly could be made to work sufficiently reliably, this
   would be one acceptable fallback solution but only for IPv6.





Savola                  Expires November 10, 2005               [Page 5]


Internet-Draft    Packet Size Issues in Network Tunnels         May 2005


3.2  Signalling the Lower MTU to the Sources

   Another approach is to use techniques like Path MTU Discovery (or
   potentially a future derivative [11]) to signal to the sources whose
   packets will be encapsulated in the network to send smaller packets
   so that they can be encapsulated.

   Note that this only works if the MTU of the tunnel is of reasonable
   size, and not e.g., 64 kilobytes: see Appendix A for more.

   This approach would presuppose that PMTUD works.  While it is
   currently working for IPv6, and critical for its operation, there is
   ample evidence that in IPv4, PMTUD is far from reliable due to e.g.,
   firewalls and other boxes being configured to inappropriately drop
   all the ICMP packets [12], or software bugs rendering PMTUD
   inoperational.

   Further, there are two scenarios where signalling from the network
   would be highly undesirable: when the encapsulation would be done in
   such a prominent place in the network that a very large number of
   sources would need to be signalled with this information (possibly
   even multiple times, depending on how long they keep their PMTUD
   state), or when the encapsulation is done for passive monitoring
   purposes (network management, lawful interception, etc.) -- when it's
   critical that the sources whose traffic is being encapsulated are not
   aware of this happening.

   A related technique, which works with TCP under specific scenarios
   only is so-called "MSS clamping".  With that technique or rather a
   "hack", the TCP packets' Maximum Segment Size (MSS) is reduced by
   tunnel endpoints so that the TCP connection automatically restricts
   itself to the maximum available packet size.  Obviously this does not
   work for UDP or other protocols which have no MSS.  This approach is
   most applicable and used with PPPoE, but could be applied otherwise
   as well; the approach also assumes that all the traffic goes through
   tunnel endpoints which do MSS clamping -- this is trivial for the
   single-homed access links, but could be a challenge otherwise.

   A new approach to PMTUD is in the works [11], but it is uncertain
   whether that would fix the problems -- at least not the passive
   monitoring requirements.

3.3  Encapsulate Only When There is Free MTU

   The third approach is an operational one, depending on the
   environment where encapsulation and decapsulation is being performed.
   That is, if an ISP would deploy tunneling in its backbone, which
   would consist only of links supporting high MTUs (e.g., Gigabit



Savola                  Expires November 10, 2005               [Page 6]


Internet-Draft    Packet Size Issues in Network Tunnels         May 2005


   Ethernet or SDH/SONET), but all its customers and peers would have a
   lower MTU (e.g., 1500, or the backbone MTU minus the encapsulation
   overhead), this would imply that no packets (with the encapsulation
   overhead added) would have larger MTU than the "backbone MTU", and
   all the encapsulated packets would always fit MTU-wise in the
   backbone links.

   This approach is highly assumptive of the deployment scenario.  It
   may be desirable to build a tunnel to/from another ISP (for example),
   where this might no longer hold; or there might be links in the
   network which cannot support the higher MTUs to satisfy the tunneling
   requirements; or customers themselves might try to tunnel fragmented
   packets to the ISP, requiring the reassembly capability from the
   ISP's equipment (in this last case, it might be possible to get the
   MTU at the customer's end lowered, eliminating the fragmentation, but
   it might not always be an option).

   To restate, this approach can only be considered when tunneling is
   done inside a part of specific kind of ISP's own network, not e.g.,
   transiting an ISP.

   Another, related approach might be having the sources use only a low
   enough MTU which would fit in all the physical MTUs; for example,
   IPv6 specifies the minimum MTU of 1280 bytes.  For example, if all
   the sources whose traffic would be encapsulated would use this as the
   maximum packet size, there would probably always be enough free MTU
   for encapsulation in the network.  However, this is not the case
   today, and it would be completely unrealistic to assume that this
   kind of approach could be made to work in general.

   It is worth remembering that while the IPv6 minimum MTU is 1280 bytes
   [8], there are scenarios where the tunnel implementation must
   implement fragmentation and reassembly [2]: for example, when having
   an IPv6-in-IPv6 tunnel on top of a physical interface with MTU of
   1280 bytes, or when having two layers of IPv6 tunneling.  This can
   only be avoided by ensuring that links on top of which IPv6 is being
   tunneled have a somewhat larger MTU (e.g., 40 bytes) than 1280 bytes.
   This conclusion can be generalized: because IP can be tunneled on top
   of IP, no single minimum or maximum MTU can be found such that
   fragmentation or signalling to the sources would never be needed.

   All in all, while in certain operational environments it might be
   possible to avoid any problems by deployment choices, or limiting the
   MTU that the sources use, this is probably not a sufficiently good
   general solution for the equipment vendors, and other solutions must
   also be provided.





Savola                  Expires November 10, 2005               [Page 7]


Internet-Draft    Packet Size Issues in Network Tunnels         May 2005


3.4  Fragmentation of the Inner Packet

   A final possibility is fragmenting the inner packet, before
   encapsulation, in such a manner that the encapsulated packet fits in
   the the path MTU.  However, one should note that only IPv4 supports
   this "in-flight" fragmentation; further, it isn't allowed for packets
   where Don't Fragment -bit has been set.  Even if one could ignore
   IPv6 completely, so many IPv4 host stacks send packets with DF bit
   set that this would seem unfeasible.

   Regardless of what the specifications say, there are implementations
   that perform fragmentation if required regardless of the DF bit:
   either ignoring the DF bit completely, either for all or specified
   interfaces, or clearing the DF bit in egress of the specified
   interfaces.  This is non-compliant behaviour, but there are certainly
   uses for it especially in certain tightly controlled passive
   monitoring scenarios, and has potential for more generic
   applicability as well, to work around PMTUD issues.

   As this is an implemented and desired (by some) behaviour, the full
   impacts e.g., for the functioning of PMTUD (for example) should be
   analyzed, and the use of fragmentation-related IPv4 bits should be
   re-evaluated.

   In summary, this approach provides a relatively easy fix for IPv4
   problems, with potential for causing problems for PMTUD; as this
   would not work with IPv6, it could not be considered a generic
   solution.

4.  Conclusions

   Fragmentation and reassembly by the tunnel endpoints is a clear
   solution to the problem, but the hardware reassembly when the packets
   get lost may face significant implementation challenges.  Whether
   these challenges are practically insurmountable or not should be
   evaluated.  This approach does not seem feasible especially for IPv4
   with high data rates due to problems with wrapping fragment
   identification field [10].  Constant wrapping may occur when the data
   rate is in the order of MB/s for IPv4 and in the order of dozens of
   GB/s for IPv6.  However, this reassembly approach is probably not a
   problem for passive monitoring applications.

   PMTUD techniques, at least at the moment and especially for IPv4,
   appear to be too unreliable or unscalable to be used in the
   backbones.  It is an open question whether a future solution might
   work better in this aspect.

   It is clear that in some environments the operational approach to the



Savola                  Expires November 10, 2005               [Page 8]


Internet-Draft    Packet Size Issues in Network Tunnels         May 2005


   problem, ensuring that fragmentation is never necessary by keeping
   higher MTUs in the networks where encapsulated packets traverse, is
   sufficient.  But this is unlikely to be enough in general, and for
   vendors which may not be able to make assumptions about the
   operators' deployments.

   Fragmentation of the inner packet is only possible with IPv4, and is
   sufficient only if standards-incompliant behaviour, with potential
   for bad side-effects e.g., for PMTUD, is adopted.  It should not be
   used if there are alternatives; fragmentation of the outer packet
   seems a better option for passive monitoring.

   An interesting thing to explicitly note is that when tunneling is
   done in a high-speed backbone, typically one may be able to make
   assumptions on the environment; however, when reassembly is not
   performed in such a network, it might be done in software or with
   lower requirements, and there either exists a reassembly
   implementation, using PMTUD, or using a separate approach for passive
   monitoring -- so this might not be a real problem.

   In consequence, the critical questions at this point appear to be 1)
   whether a higher MTU can be assumed in the high-speed networks that
   deploy tunneling, and 2) whether "slower-speed" networks could cope
   with a software-based reassembly, a less capable hardware-based
   reassembly, or the other workarounds.  An important future task would
   be analyzing the observed incompliant behaviour about DF-bit to note
   whether it has any unanticipated drawbacks.

5.  IANA Considerations

   This document makes no request of IANA.

   Note to RFC Editor: this section may be removed on publication as an
   RFC.

6.  Security Considerations

   This document describes different issues with packet sizes and in-
   the-network tunneling; this does not have security considerations on
   its own.

   However, different solutions might have characteristics which may
   make them more susceptible to attacks -- for example, a router-based
   fragment reassembly could easily lead to (reassembly) buffer memory
   exhaustion if the attacker would send a sufficient number of
   fragments without sending all of them, so that the reassembly would
   be stalled until a timeout; these and other fragment attacks (e.g.,
   [13]) have already been used against e.g., firewalls and host stacks,



Savola                  Expires November 10, 2005               [Page 9]


Internet-Draft    Packet Size Issues in Network Tunnels         May 2005


   and need to be taken into consideration in the implementations.

   It is worth considering the cryptographic expense (which is typically
   more significant than the reassembly, if done in software) with
   fragmentation of the inner or outer packet.  If an outer fragment
   goes missing, no cryptographic operations have been yet performed; if
   an inner fragment goes missing, cryptographic operations have already
   been performed.  Therefore, which of these approaches is preferable
   also depends on whether cryptography or reassembly are already
   provided in hardware; for high-speed routers, at least, one should be
   able to assume that if it is performing relatively heavy
   cryptography, hardware support is already required.

7.  Acknowledgements

   While the topic is far from new, recent discussions with W. Mark
   Townsley on L2TP fragmentation issues caused the author to sit down
   and write up the issues in more general.  Michael Richardson and Mika
   Joutsenvirta provided useful feedback on the first draft.  When
   soliciting comments from NANOG list, Carsten Bormann, Kevin Miller,
   Warren Kumari, Iljitsch van Beijnum, Alok Dube, and Stephen J. Wilcox
   provided useful feedback.

8.  References

8.1  Normative References

   [1]  Nordmark, E. and R. Gilligan, "Basic Transition Mechanisms for
        IPv6 Hosts and Routers", draft-ietf-v6ops-mech-v2-07 (work in
        progress), March 2005.

   [2]  Conta, A. and S. Deering, "Generic Packet Tunneling in IPv6
        Specification", RFC 2473, December 1998.

   [3]  Farinacci, D., Li, T., Hanks, S., Meyer, D., and P. Traina,
        "Generic Routing Encapsulation (GRE)", RFC 2784, March 2000.

   [4]  Lau, J., Townsley, M., and I. Goyret, "Layer Two Tunneling
        Protocol - Version 3 (L2TPv3)", RFC 3931, March 2005.

   [5]  Kent, S. and R. Atkinson, "Security Architecture for the
        Internet Protocol", RFC 2401, November 1998.

   [6]  Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191,
        November 1990.

   [7]  McCann, J., Deering, S., and J. Mogul, "Path MTU Discovery for
        IP version 6", RFC 1981, August 1996.



Savola                  Expires November 10, 2005              [Page 10]


Internet-Draft    Packet Size Issues in Network Tunnels         May 2005


   [8]  Deering, S. and R. Hinden, "Internet Protocol, Version 6 (IPv6)
        Specification", RFC 2460, December 1998.

8.2  Informative References

   [9]   Mamakos, L., Lidl, K., Evarts, J., Carrel, D., Simone, D., and
         R. Wheeler, "A Method for Transmitting PPP Over Ethernet
         (PPPoE)", RFC 2516, February 1999.

   [10]  Mathis, M., "Fragmentation Considered Very Harmful",
         draft-mathis-frag-harmful-00 (work in progress), July 2004.

   [11]  Mathis, M., "Path MTU Discovery", draft-ietf-pmtud-method-04
         (work in progress), February 2005.

   [12]  Medina, A., Allman, M., and S. Floyd, "Measuring the Evolution
         of Transport Protocols in the Internet", Computer
         Communications Review, Apr 2005, <http://www.icir.org/tbit/>.

   [13]  Miller, I., "Protection Against a Variant of the Tiny Fragment
         Attack (RFC 1858)", RFC 3128, June 2001.


Author's Address

   Pekka Savola
   CSC/FUNET
   Espoo
   Finland

   Email: psavola@funet.fi

Appendix A.  MTU of the Tunnel

   Different tunneling mechanisms may treat the tunnel links as having
   different kind of MTU values.  Some might use the same default MTU as
   for other interfaces; some others might use the default MTU minus the
   expected IP overhead (e.g., 20, 28, or 40 bytes); some others might
   even treat the tunnel as having "infinite MTU", e.g., 64 kilobytes.

   As [1] describes, having an infinite MTU, i.e., fragmenting the outer
   packet (and never the inner packet) and never performing PMTUD is a
   very bad idea, especially in host-to-router scenarios.  (It could be
   argued that if the nodes are sure that this is a host-to-host tunnel,
   a larger MTU might make sense if fragmentation and reassembly is more
   efficient than just sending properly sized packets -- but this seems
   like a stretch.)




Savola                  Expires November 10, 2005              [Page 11]


Internet-Draft    Packet Size Issues in Network Tunnels         May 2005


Intellectual Property Statement

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at
   http://www.ietf.org/ipr.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at
   ietf-ipr@ietf.org.


Disclaimer of Validity

   This document and the information contained herein are provided on an
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.


Copyright Statement

   Copyright (C) The Internet Society (2005).  This document is subject
   to the rights, licenses and restrictions contained in BCP 78, and
   except as set forth therein, the authors retain all their rights.


Acknowledgment

   Funding for the RFC Editor function is currently provided by the
   Internet Society.




Savola                  Expires November 10, 2005              [Page 12]