INTERNET-DRAFT                                      Tathagata Nandy
Intended Status: Proposed Standard                  HPE
                                                    Nitin Singla
                                                    HPE
                                                    Utkarsh Srivastava
                                                    HPE
Expires: 19 October 2020                            April 19, 2020

                          Multicast Path MTU
        draft-nandy-singla-utkarsh-pim-mcast-path-mtu-00

Abstract
   Path MTU discovery (rfc1191) is a standard technique to determine
   the supported MTU between two Internet Protocol (IP) hosts to avoid
   any fragmentation. In a multicast distribution tree, source will
   not know where the receivers are located. So the technique used to
   compute the path MTU for a unicast stream does not work in a
   multicast network.  This document describes a method to discover
   multicast path MTU with the goal to avoid traffic loss. This
   solution also aims to solve the problem of traffic loss in for
   multicast streams because of incorrect MTU setting and no path MTU
   support for multicast networks.

Status of This Memo
   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF). Note that other groups may also distribute
   working documents as Internet-Drafts. The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six
   months and may be updated, replaced, or obsoleted by other
   documents at any time. It is inappropriate to use Internet-Drafts
   as reference material or to cite them other than as "work in
   progress."

   This Internet-Draft will expire on 12 October 2020.

 Copyright Notice

   Copyright (c) 2020 IETF Trust and the persons identified as the
   document authors. All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this
   document.  Please review these documents carefully, as they
   describe your rights and restrictions with respect to this
   document. Code Components extracted from this document must include
   Simplified BSD License text as described in Section 4.e of the
   Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Tathagata, et al.        Expires 12 October 2020                [Page 1]


Internet-Draft            PIM Multicast Path MTU            April 2020

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . .  3
   2.  Conventions used in this document . . . . . . . . . . . . .  3
   3.  Problem Statement . . . . . . . . . . . . . . . . . . . . .  4
   4.  Multicast Data Path . . . . . . . . . . . . . . . . . . . .  5
     4.1. FHR to RP  . . . . . . . . . . . . . . . . . . . . . . .  5
     4.2.  Generic Routing . . . . . . . . . . . . . . . . . . . .  5
     4.3.  LHR to Host . . . . . . . . . . . . . . . . . . . . . .  6
   5.  Security Considerations . . . . . . . . . . . . . . . . . .  6
   6. IANA considerations . . . . . . . . . . . . . . . . . . . .   6
   7. References  . . . . . . . . . . . . . . . . . . . . . . . .   7
     7.1.  Normative References . . . . . . . . . . . . . . . . .   7
     7.2.  Informative References . . . . . . . . . . . . . . . .   7
   8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 8
   Author's Address  . . . . . . . . . . . . . . . . . . . . . . .  8

Tathagata, et al.        Expires 12 October 2020                [Page 2]


Internet-Draft            Multicast Path MTU                April 2020

1.  Introduction
   When one IP host has a large amount of data to send to another
   host, the data is transmitted as a series of IP datagrams. It is
   usually preferable that these datagrams be of the largest size that
   does not require fragmentation anywhere along the path from the
   source to the destination.  (For the case against fragmentation,
   see [5].) This datagram size is referred to as the Path MTU (PMTU),
   and it is equal to the minimum of the MTUs of each hop in the path.
   A shortcoming of the current Internet protocol suite is the lack of
   a standard mechanism for a host to discover the PMTU of an
   arbitrary path.  Note: The Path MTU is what in [1] is called the
   "Effective MTU for sending" (EMTU_S). A PMTU is associated with a
   path, which is a particular combination of IP source and
   destination address and perhaps a Type-of-service (TOS).  The
   current practice [1] is to use the lesser of 576 and the first-hop
   MTU as the PMTU for any destination that is not connected to the
   same network or subnet as the source.  In computer networking,
   multicast is group communication where data transmission is
   addressed to a group of destination computers simultaneously.
   Multicast can be one-to-many or many-to-many distribution.
   Multicast should not be confused with physical layer
   point-to-multipoint communication. Ethernet frames with a value of
   1 in the least-significant bit of the first octet of the
   destination address are treated as multicast frames and are flooded
   to all points on the network. This mechanism constitutes multicast
   at the data link layer. This mechanism is used by IP multicast to
   achieve one-to-many transmission for IP on Ethernet networks.
   Modern Ethernet controllers filter received packets to reduce CPU
   load, by looking up the hash of a multicast destination address in
   a table, initialized by software, which controls whether a
   multicast packet is dropped or fully received.  IP multicast is a
   technique for one-to-many communication over an IP network. The
   destination nodes send Internet Group Management Protocol join and
   leave messages, for example in the case of IPTV when the user
   changes from one TV channel to another. Multicast uses network
   infrastructure efficiently by requiring the source to send a packet
   only once, even if it needs to be delivered to a large number of
   receivers. The nodes in the network take care of replicating the
   packet to reach multiple receivers only when necessary.

2.  Conventions used in this document
2.1.  Terminology
   The reader is assumed to be familiar with the terminology,
   reference models, and taxonomy defined in [RFC4664] and [RFC4665].
   For readability purposes, we repeat some of the terms here.
   Moreover, we also propose some other terms needed when IP multicast
   support is discussed.

Tathagata, et al.        Expires 12 October 2020            [Page 3]


Internet-Draft            PIM Multicast Path MTU          April 2020

   Multicast domain
      An area in which multicast data is transmitted. In this
      document, this term has a generic meaning that can refer to
      Layer-2 and Layer-3.  Generally, the Layer-3 multicast domain is
      determined by the Layer-3 multicast protocol used to establish
      reachability between all potential receivers in the
      corresponding domain. The Layer-2 multicast domain can be the
      same as the Layer-2 broadcast domain (i.e., VLAN), but it may be
      restricted to being smaller than the Layer-2 broadcast domain if
      an additional control protocol is used.

   PIM-SM
      Protocol Independent Multicast Sparse Mode (PIM-SM) is a family
      of multicast routing protocols for Internet Protocol (IP)
      networks that provide one-to-many and many-to-many distribution
      of data over a LAN, WAN or the Internet.  It explicitly builds
      unidirectional shared trees rooted at a rendezvous point (RP)
      per group, and optionally creates shortest-path trees per
      source. PIM-SM uses shared trees by default and implements
      source-based trees for efficiency; it assumes that no hosts want
      the multicast traffic unless they specifically ask for it.
      Senders first send the multicast data to the RP, which in turn
      sends the data down the shared tree to the receivers.

   RP
      Rendezvous Point (RP) is a router in a multicast network domain
      that acts as a shared root for a multicast shared tree. Any
      number of routers can be configured to work as RPs and they can
      be configured to cover different group ranges. An RP acts as the
      meeting place for sources and receivers of multicast data. In a
      PIM-SM network, sources must send their traffic to the RP.  This
      traffic is then forwarded to receivers down a shared
      distribution tree.

2.2.  Conventions
   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in
   this document are to be interpreted as described in [RFC2119].

3.  Problem Statement
3.1.  Motivation
   Path MTU discovery computes the lowest MTU supported between two
   hosts to avoid IP fragmentation. For a unicast packet, source
   device sends out a packet with Don't Fragment (DF) flag bit set in
   the IP header [1]. Any device along the path whose MTU is
   smaller than the packet will drop the packet and send back an ICMP
   Packet Too Big (Type 2) message containing its MTU, allowing the
   source host to reduce its Path MTU appropriately. The process is
   repeated until the MTU is small enough to traverse the entire path
   without fragmentation.  In a multicast distribution tree, the
   source does not know the host for a multicast group till the
   complete multicast tree is built.  Hosts in different branches of
   the tree use IGMP/MLD followed by PIM to become part of the
   multicast tree. Generally the process starts at the host where it

Tathagata, et al.        Expires 12 October 2020            [Page 4]


Internet-Draft            PIM Multicast Path MTU          April 2020

   sends a request to become part of a multicast tree through IGMP
   joins.  The same request is sent to the RP and there by source and
   group develop a common path.  So the technique mentioned above may
   not work for multicast flows.

3.2.  Scalability
   Most routers doesn't send ICMP (unreachable; fragmentation needed)
   messages in response to too-big IPv4 multicast packets with DF-bit
   set. They're just dropping these packets silently, breaking PMTUD.
   This is a case of as-per-design feature and is updated in section
   7.2 of RFC 1112 that an ICMP error message (Destination
   Unreachable, Time Exceeded, Parameter Problem, Source Quench, or
   Redirect) is never generated in response to a datagram destined to
   an IP host group.  The same document also describes why RFC 1112
   prohibits sending ICMP error messages in response to multicast
   datagrams. The processing done on ICMP error replies by the *nix
   socket API might block the sender socket if an error comes back
   from a single receiver or if TTL expires when traversing a
   particularly long branch of the multicast tree, not exactly a good
   idea in multicast environment.


4.  Multicast Data Path
   The multicast Stream between a Source and a Host for a particular
   Group uses the following path.

   1.  Source Router sends PIM Register Packets to the Rendezvous
       Point (RP) Router with the Source encapsulated in it. This is a
       Unicast Packet.

   2.  Host Router Sends PIM Joins to the RP and from there the
       Source and the Core based tree is built.

4.1  First hop Source router and rendezvous point pre-Registration
   For the network segment between the first hop router and the PIM
   Rendezvous point (RP), multicast data packets are encapsulated into
   PIM register messages.  PIM Register messages are unicast messages
   and the standard Path MTU discovery technique will work for this
   segment.

4.2  Multicast Flow and PMTU
   For other segments in the network, data will be sent as multicast
   packets and the following sequence is used to determine the path
   MTU for different branches in the multicast tree:

   1.  A new multicast flow received on any router will not have any
       match in the multicast routing table and hence it is treated
       as unknown multicast flow. Such streams are copied to CPU to
       program the flows in HW.

   2.  When the Packet is processed by multicast process to program an
       unknown flow it computes the Outgoing interfaces list (Olist)
       for the flow based on IGMP/MLD joins or PIM joins from
       downstream Routers.

Tathagata, et al.        Expires 12 October 2020            [Page 5]


Internet-Draft            PIM Multicast Path MTU          April 2020

   3.  The proposal is for each interfaces in the Olist, an additional
       check is performed where the MTU supported on the interface is
       compared with the size of the multicast data packet. If the
       packet size is greater than the supported MTU, an ICMP
       Fragmentation Needed (Type 3, Code 4) message containing its
       MTU, allowing the source DR to re-compute MTU appropriately.
       This is done irrespective of whether DF bit is set or not.

   4.  An error message will be logged in each of the Routers
       performing this check.  Optionally an SNMP trap can also be
       send. This would lead the admin to either change the MTU of the
       Interfaces for the Multicast Data to go through or the Source
       DR to fragment and send the Data.

   5.  Optionally as per implementation, some routers can program the
       Mroute Entry with Error displaying that the packets might be
       dropped because of large size.  This could be implementation
       specific.


   6.  Optionally, in all the Routers where this check is performed,
       the unknown Multicast Data packet can be programmed as a bridge
       entry in Hardware such that no further packets reach the CPU.

   7.  This computation is done at the Connection establishment phase
       itself for the PIM-SM network such that the Mroute Entry is
       never programmed in Hardware without the MTU computation.

4.3  Last Hop Router to the Host MTU
   The Host sends IGMP Joins to join a particular group and when
   unknown multicast is received at the router, it would compute the
   MTU for those joined paths and would send an ICMP error packet back
   to the source if there is a violation.

   1.  Source host will learn about the lowest MTU supported among all
       the branches of the multicast tree and uses the updates the
       size of the datagrams accordingly.

   2.  This path is same as the previous section only, the only
       difference is that Joins are not PIM Joins but IGMP Joins.

5  IANA Considerations
   This memo includes no request to IANA.

6  Security Considerations
   This Path MTU Discovery mechanism makes possible two
   denial-of-service attacks, both based on a malicious party sending
   false Datagram Too Big messages to an Internet host.  In the first
   attack, the false message indicates a PMTU much smaller than
   reality. This should not entirely stop data flow, since the victim
   host should never set its PMTU estimate below the absolute minimum,
   but at 8 octets of IP data per datagram, progress could be slow.
   In the other attack, the false message indicates a PMTU greater
   than reality.  If believed, this could cause temporary blockage as

Tathagata, et al.        Expires 12 October 2020            [Page 6]


Internet-Draft            PIM Multicast Path MTU          April 2020

   the victim sends datagrams that will be dropped by some router.
   Within one round-trip time, the host would discover its mistake
   (receiving Datagram Too Big messages from that router), but
   frequent repetition of this attack could cause lots of datagrams to
   be dropped. A host, however, should never raise its estimate of the
   PMTU based on a Datagram Too Big message, so should not be
   vulnerable to this attack.  A malicious party could also cause
   problems if it could stop a victim from receiving legitimate
   Datagram Too Big messages, but in this case there are simpler
   denial-of-service attacks available.  In another case if the
   packets are always rejected because of higher MTU and the sender
   does not change the packet size or the admin does not adjust the
   MTU, there is a risk of a DOS attack on the Switch sending the ICMP
   Error packet. Multicast packet send at high rate can consume the
   CPU resources of all the Routers implementing the PMTU for
   Multicast.

7  References
7.1  Normative References
   [1]  J. Mogul, S. Deering. Path MTU Discovery. RFC 1191, DECWRL
        and Stanford University, November, 1990.
   [2]  J. Postel, INTERNET CONTROL MESSAGE PROTOCOL. RFC 791,
        ISI, September 1981.
7.2  Informative References
   [3]  <https://blog.ipspace.net/2015/09/
        path-mtu-discovery-doesnt-work-with-ip.html>
   [4]  <https://en.wikipedia.org/wiki/Multicast>
   [5]  <https://www.cisco.com/c/en/us/products/collateral/
        ios-nx-os-software/ip-multicast/whitepaper_c11-508498.html>

Tathagata, et al.        Expires 12 October 2020            [Page 7]


Internet-Draft            PIM Multicast Path MTU          April 2020

8  Acknowledgments
   The authors thank the contributors of [RFC1191] and RFC{5501] since
   the structure and content of this document were, for some sections,
   largely inspired from it.  The authors also thank Mark Pearson and
   others for their valuable reviews and feedback.  THIS SOFTWARE IS
   PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY
   EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
   IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
   PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
   CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
   USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
   ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
   OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
   OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
   SUCH DAMAGE.

9  Authors' Addresses
   Tathagata Nandy
   Hewlett Packard India Software Operations Pvt. Ltd.
   Survey # 192, Whitefield Road,
   Mahadevapura Post, Bangalore 560048. India
   Phone: (+91) 9611895857
   EMail: tathagata.nandy@hpe.com

   Nitin Singla
   Hewlett Packard India Software Operations Pvt. Ltd.
   Survey # 192, Whitefield Road,
   Mahadevapura Post, Bangalore 560048. India
   Phone: (+91)7411937209
   EMail: singla@hpe.com

   Utkarsh Srivasta
   Hewlett Packard India Software Operations Pvt. Ltd.
   Survey # 192, Whitefield Road,
   Mahadevapura Post, Bangalore 560048. India
   Phone: (+91)7411937209
   EMail: usrivastava@hpe.com

Tathagata, et al.        Expires 12 October 2020            [Page 8]