Skip to main content

The Requirements of a Unified Transport Protocol for In-Network Computing in Support of RPC-based Applications
draft-song-inc-transport-protocol-req-01

Document Type Active Internet-Draft (individual)
Authors Haoyu Song , Wenfei Wu , Dirk Kutscher
Last updated 2024-01-24
RFC stream (None)
Intended RFC status (None)
Formats
Stream Stream state (No stream defined)
Consensus boilerplate Unknown
RFC Editor Note (None)
IESG IESG state I-D Exists
Telechat date (None)
Responsible AD (None)
Send notices to (None)
draft-song-inc-transport-protocol-req-01
Network Working Group                                            H. Song
Internet-Draft                                    Futurewei Technologies
Intended status: Informational                                     W. Wu
Expires: 27 July 2024                                  Peking University
                                                             D. Kutscher
          The Hong Kong University of Science and Technology (Guangzhou)
                                                         24 January 2024

    The Requirements of a Unified Transport Protocol for In-Network
             Computing in Support of RPC-based Applications
                draft-song-inc-transport-protocol-req-01

Abstract

   In-network computing breaks the end-to-end principle and introduces
   new challenges to the transport layer functionalities.  This draft
   provides the background of a suite of RPC-based applications which
   can take advantage of INC support, surveys the existing transport
   protocols to show they are insufficient or improper to be used in
   this context, and lays out the requirements to develop a general
   transport protocol tailored for such applications.  The purpose of
   this draft is to help understand the problem domain and inspire the
   design and development a unified INC transport protocol.

Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 27 July 2024.

Song, et al.              Expires 27 July 2024                  [Page 1]
Internet-Draft                 TP for INC                   January 2024

Copyright Notice

   Copyright (c) 2024 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Motivation  . . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  INC Application RPCs  . . . . . . . . . . . . . . . . . . . .   4
   3.  Existing Transport Protocols  . . . . . . . . . . . . . . . .   7
   4.  Requirements  . . . . . . . . . . . . . . . . . . . . . . . .   8
   5.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  10
   6.  Security Considerations . . . . . . . . . . . . . . . . . . .  10
   7.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  10
     7.1.  Normative References  . . . . . . . . . . . . . . . . . .  10
     7.2.  Informative References  . . . . . . . . . . . . . . . . .  10
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  11

1.  Motivation

   In a broader sense, COmputing-In-Network (COIN) covers many distinct
   types of applications which rely on networks to do more than packet
   forwarding (e.g., active networking, edge computing, and service
   function chaining).  However, the emerging term In-Network Computing
   (INC) [inc] in particular refers to a narrower scope which applies
   on-path programmable networking devices (e.g., switches and routers
   between clients and servers) as an accelerator or function offloader
   to boost throughput, reduce server load, or improve latency,
   typically in a well-controlled data center network environment.

   Some INC implementations evolved from programmable data plane systems
   and align with the trend of network programmability at large.  In
   recent year, it has been shown to support many promising applications
   (e.g., caching, aggregation, and agreement).  For example, in
   distributed machine learning (DML), training nodes produce data
   (gradients) that needs to be aggregated or reduced -- and the result
   could be distributed to one or multiple consumers.  As another
   example, the NetClone system [netclone] uses in-network forwarder to

Song, et al.              Expires 27 July 2024                  [Page 2]
Internet-Draft                 TP for INC                   January 2024

   replicate RPC invocation messages and to perform more informed
   forwarding based on observed latencies for accelerating RPC
   communication.

   While it is possible to achieve this kind of operation purely with
   end-to-end communication between worker nodes, performance can be
   dramatically improved by offloading both the operation processing and
   the data dissemination to nodes in the network.  These in-network
   processors are often conceived as semi-transparent performance
   enhancing on-path elements, i.e., they are not the actual endpoints
   in transport protocol sessions and would intercept packets with
   application data and potentially generate new data that they would
   have to transmit.

   The intended INC behavior can thus not be achieved with existing end-
   to-end transport protocols such as TCP and QUIC.  Conventionally, the
   network devices are only supposed to process the packets up to the
   network layer and leave the upper layers (i.e., transport layer and
   application layer) intact for the end hosts to process; however, INC
   requires the network devices to participate in the application logic
   so inevitably they need to process the related packets up to the
   application layer, as shown in Figure 1.

                          /-------------------\
                         /     INC devices     \
        +-----------+   /     +-----------+     \    +-----------+
        |application|   |     |application|     |    |application|
        +-----------+   |     +-----------+     |    +-----------+
        | transport |   |     | transport |     |    | transport |
        +-----------+   |     +-----------+     |    +-----------+
        |  network  |<--+---->|  network  |<----+--->|  network  |
        +-----------+   \     +-----------+     /    +-----------+
           client        \---------------------/         server
                                 network

                  Figure 1: Network Protocol Stack in INC

   In the context of the INC systems we refer to here, the computing
   functions need to be done in data plane fast path.  There may be
   other use cases where a network device needs to direct the
   application packets to the slow path (e.g., a local CPU or a remote
   server) for processing, which we do not consider here.

   Programmable data plane devices use different programming languages
   (e.g., P4 and HDL) and have different chip architectures (e.g., RMT
   pipeline, RTC, and FPGA).  These devices are optimized for simple
   packet processing and forwarding with limited hardware resources.

Song, et al.              Expires 27 July 2024                  [Page 3]
Internet-Draft                 TP for INC                   January 2024

   Specifically, the devices are difficult to support complex stateful
   operations and mathematical calculations beyond integer addition and
   shift.  No surprise the in-network computing functions for the
   supported applications are all relatively simple (e.g., resorting to
   lookup tables or counters).  However, the programmable switch chip
   technology is also progressing fast with better stateful operation
   support and computing capabilities.  It is conceivable that future
   programmable switches could undertake more computing tasks, albeit
   still in a facilitating role.

   To correctly handle the computing tasks, however, a reliable
   transport layer must be present.  The transport layer provides the
   common services such as connection maintenance, reliability, flow
   control, and multiplexing.  The existing INC applications either make
   oversimplified assumption to eschew this problem (e.g., assume the
   use of UDP as the transport layer protocol or ignore it) or provided
   ad hoc solution dedicated to a particular application which entangles
   the transport and application functions (e.g., ATP).  A general
   protocol for the transport layer is needed for INC to take care the
   common transport issues.  It can free the application developers from
   worrying about the transport issues and help them focus on the
   application logic itself.

   This draft provides the background of a suite of RPC-based
   applications which can take advantage of INC support, surveys the
   existing transport protocols to show they are insufficient or
   improper to be used in this context, and lays out the requirements to
   develop a general transport protocol tailored for such applications.
   The purpose of this draft is to help understand the problem domain
   and inspire the design and development a unified INC transport
   protocol.

2.  INC Application RPCs

   The INC applications concerned in this draft all follow the
   communication paradigm of idempotent Remote Procedure Call (RPC): A
   client sends a message with arguments to a server and gets a response
   back which reflects the computation result based on the arguments.
   On the one hand, it is unlike TCP which is mainly used for
   transferring byte streams; on the other hand, it requires a reliable
   datagram service more than what UDP can support.

   We can classify these INC applications into three service models:

   Synchronous Collaboration (SC):  from a set of clients, each sends a
      piece of data to a server roughly at the same time.  The result
      can be computed and sent back to the clients when all the data
      pieces are received.  A notable example is AllReduce (one

Song, et al.              Expires 27 July 2024                  [Page 4]
Internet-Draft                 TP for INC                   January 2024

      operation in the class of Collective Communication
      [I-D.yao-tsvwg-cco-problem-statement-and-usecases]).  Quite often
      there is one result that needs to be transmitted back to all
      clients, i.e., a multi-destination delivery service could be
      applied.

   Asynchronous Collaboration (AC):  from a set of clients, each sends
      multiple data items to a server.  The result can be computed when
      all the data items are received.  An example of such applications
      is MapReduce

   Individual Request (IR):  a client sends individual requests to a
      server and get a response for each request.  An example of such
      application is NetCache [netcache].

   From a different perspective, we can observe that there are three
   basic communication modes depending on the applications, as shown in
   Figure 2.  From a client-perspective, the INC support is transparent,
   i.e., the client sends a message, such as an RPC, and if there is an
   on-path INC device, it could execute the operation, as an
   optimization.  If there is no such on-path INC device, the message
   would be transmitted to a specified endpoint.  Depending on the
   actual network configuration, capabilities, and load situation, one
   of the following modes can be selected:

   Device Only Mode (DO):  the INC network devices alone can completely
      finish a computing task.  Therefore a client can choose to send a
      task to the INC network devices instead of a server and the final
      result is directly returned to the client from the INC network
      devices.

   Device+Server Mode (DS):  the INC network devices can only partially
      finish a computing task and the intermediate result still needs to
      be sent to a server to finalize.  The final result must be
      returned to the client from a server.

   Hybrid Mode (HM):  the INC network devices may or may not finish a
      computing task, therefore the final result may be returned by the
      INC network devices or by a server.

   Each mode has its dominant benefits: Using DO mainly aims to reduce
   the latency and using DS mainly aims to reduce the traffic bandwidth
   and server load.  Using HM may achieve both benefits, albeit with
   more implementation complexity.

Song, et al.              Expires 27 July 2024                  [Page 5]
Internet-Draft                 TP for INC                   January 2024

                                  +-------+
               +------+         +-------+ |        +------+
               |      |         |network| |        |      |
               |client|<------->|devices| |        |server|
               |      |         |       |-+        |      |
               +--^---+         +-------+          +---^--+
                  |                                    |
                  +------------------------------------+
                              Device Only Mode (DO)

                                  +-------+
               +------+         +-------+ |        +------+
               |      |         |network| |        |      |
               |client+-------->|devices+-+------->|server|
               |      |         |       |-+        |      |
               +--^---+         +-------+          +--+---+
                  |                                   |
                  +-----------------------------------+
                             Device+Server Mode (DS)

                                  +-------+
               +------+         +-------+ |        +------+
               |      |         |network| |        |      |
               |client+-------->|devices+.........>|server|
               |      |<--------|       |-+        |      |
               +--^---+         +-------+          +--.---+
                  :                                   :
                  .....................................
                             Hybrid Mode (HM)

                Figure 2: In Network Computing Working Modes

   Figure 3 provides the dominant combinations of the service model and
   communication model.  Since AC may require too much resources which
   exceed network device's capability, so it is less used with the DO
   mode; IR usually aims to optimize the response latency, so the DS
   mode is less helpful, yet HM may provide a fallback mechanism for
   unsatisfied requests.

Song, et al.              Expires 27 July 2024                  [Page 6]
Internet-Draft                 TP for INC                   January 2024

                +-----------------------+-----+-----+-----+
                |                       | DO  | DS  | HM  |
                +-----------------------+-----+-----+-----+
                |Sync Collaboration(SC) |  x  |  x  |  x  |
                +-----------------------+-----+-----+-----+
                |Async Collaboration(AC)|     |  x  |     |
                +-----------------------+-----+-----+-----+
                |Individual Request(IR) |  x  |     |  x  |
                +-----------------------+-----+-----+-----+

              Figure 3: Service Model and Communication Model

3.  Existing Transport Protocols

   We argue that the existing transport protocols are not suitable for
   INC.

   TCP:  As the most widely used transport protocol, TCP (as well as its
      variants such as DCTCP and MPTCP) is ruled out because of its end-
      to-end streaming semantics.  Any mutation to the TCP packet
      payloads is consider a break to the stream, but the INC
      applications which require network device collaboration do need to
      modify the packet payload.  Also, any dropped packet in a TCP
      stream sensed by the receiver must be re-transmitted; this
      prohibits the INC applications which can terminate a packet and
      return the computing result directly.  While theoretically it is
      possible to make the network device maintain two separate TCP
      connections with the two communicating end hosts, the cost of
      implementation is prohibitively large.  Due to its handshake
      overhead and its longer startup times, TCP is also not a good
      protocol for high-performance RPC communication [davie].  More
      issues about TCP in data center can be found in [homa].

   UDP:  As another common transport protocol, UDP is unreliable and
      lack of mechanisms for flow control.  Some previous INC
      application assumes the use of UDP as the transport layer for
      simplicity, but the provisional measure cannot meet the production
      level requirement and provide enough transport layer support for
      all the concerned INC applications.  While these feature could be
      implemented on-top of UDP, this would shift complexity to
      applications and INC implementations.

   QUIC:  In general, QUIC provide a better platform for efficient RPC
      communication compared to TCP [davie].  However, it is designed
      for wide area network, and a part of the packet header and the
      payload are encrypted which prohibits the application layer packet
      processing in network devices and, potentially, add meta data.

Song, et al.              Expires 27 July 2024                  [Page 7]
Internet-Draft                 TP for INC                   January 2024

   MTP:  MTP [mtp] is the first transport protocol dedicated for INC.
      It grasps some core requirements for INC and is open to different
      congestion control algorithms.  But it is inspired by the pathlet
      routing and mainly focus on pathlet-based congestion control
      support.  It is lack of efficient support to all the application
      types aforementioned.

   RDMA:  RDMA allows two end hosts to exchange data quickly.  With
      either native support (i.e., Infiniband) or piggybacked by UDP or
      TCP, it requires in-order and immutable transport which has
      similar challenges as TCP for INC applications.

   HOMA:  HOMA [homa] is proposed to be a transport protocol in data
      center to replace TCP.  However, HOMA is not designed with INC in
      mind either.

   Information-Centric Networking  (ICN) provide a receiver-driven,
      data-oriented communication services and has features such
      address-less operation due to the named-data access principle.  It
      also provide intrinsic multi-destination delivery and has been
      demonstrated in remote method invocation and distributed computing
      scenarios [icndiscomp], albeit not yet the particular INC
      scenarios as presented here.

   Ad Hoc Protocols:  Several INC applications (e.g., ATP and ASK)
      provide a customized transport layer.  However, these protocols
      only work for a particular application.  Moreover, there is a lack
      of a clear separation between the transport layer and the
      application layer.  Some application layer function leaks into the
      transport layer, further limiting their generality.

4.  Requirements

   The premise of the E2E principle is that it is more costly to
   guarantee the level of reliability by relying on the network than
   relying on the end hosts.  INC introduces multiple end points in the
   communication with one of them resides in the network, effectively
   changing the communication paradigm from E2E to E2I2E (I means
   intermediate nodes which conduct the transport layer
   functionalities).  Therefore, we need to revisit the E2E principle to
   see if we can break it or adapt to it in the new context.  We can
   observe several properties for the covered INC applications.

   *  In principle, INC protocols should run over existing networks, and
      not make any assumptions on the type of environment they are used
      in, such as data center or access network.  However, for
      performance reasons, some optimizations may be needed that would
      limit the deployment to such specific domains.

Song, et al.              Expires 27 July 2024                  [Page 8]
Internet-Draft                 TP for INC                   January 2024

   *  When deployed in data center for use cases such DML, an INC system
      needs to provide High-Performance-Computing (HPC) levels of
      performance.  In such communication scenarios, exact timing and
      scheduling may be required.

   *  Multiple applications with the same or different service models,
      or multiple jobs for the same applications can be active at the
      same time.

   *  INC should be seen as an optional performance enhancement that can
      be added to a network if needed, but the overall system should
      still work without such INC systems in the network.

   Based on these observation, a new transport layer protocol, for INC
   in support of RPC-based applications can be designed.  The protocol
   only works in a limited domain and it virtualizes the network as a
   single logical middle point.  That is, if multiple network devices
   collaborate on a computing task, they are considered as one device.
   Packet forwarding among these devices needs to be handled by the
   network layer using techniques such as Segment Routing (SR) and
   Service Function Chaining (SFC), depending on the overall system
   design.

   From the previous discussion, we lay out the design requirements of a
   transport protocol dedicated for INC:

   Simplicity:  Due to the limited resource and capability of the
      programmable network devices, the transport layer functions in
      them cannot be complex.  For example, the per-flow state machine
      and congestion control algorithms are difficult to be implemented
      in the programmable network devices.  The protocol should aim to
      leave the complexity to the end hosts and require only simple
      processing in the programmable network devices.

   Generality:  The different service models and communication models
      should be all supported.  The protocol should also be independent
      of the underlying network layer protocol.

   Openness:  Since the performance requirements of the applications may
      vary, the flow control and reliability mechanism of the protocol
      should be open to different algorithms.

   Compatibility:  The protocol should be able to coexist with the other
      transport protocols.

Song, et al.              Expires 27 July 2024                  [Page 9]
Internet-Draft                 TP for INC                   January 2024

5.  IANA Considerations

   This document includes no request to IANA.

6.  Security Considerations

   tbd

7.  References

7.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

7.2.  Informative References

   [davie]    Davie, B., "QUIC is not a TCP Replacement",
              https://systemsapproach.substack.com/p/quic-is-not-a-tcp-
              replacement, 26 September 2022.

   [homa]     Ousterhout, J., "It's Time to Replace TCP in the
              Datacenter", 2023,
              <http://dx.doi.org/10.48550/arXiv.2210.00714>.

   [I-D.yao-tsvwg-cco-problem-statement-and-usecases]
              Yao, K., Shiping, X., Li, Y., Huang, H., and D. KUTSCHER,
              "Collective Communication Optimization: Problem Statement
              and Use cases", Work in Progress, Internet-Draft, draft-
              yao-tsvwg-cco-problem-statement-and-usecases-00, 23
              October 2023, <https://datatracker.ietf.org/doc/html/
              draft-yao-tsvwg-cco-problem-statement-and-usecases-00>.

   [icndiscomp]
              Geng, W., Zhang, Y., Kutscher, D., Kumar, A., Tarkoma, S.,
              and P. Hui, "SoK: Distributed Computing in ICN", In
              Proceedings of the 10th ACM Conference on Information-
              Centric Networking (ACM ICN '23). Association for
              Computing Machinery, New York, NY, USA, 88-100.
              https://doi.org/10.1145/3623565.3623712, 2023.

   [inc]      Klenk et al., B., "An In-Network Architecture for
              Accelerating Shared-Memory Multiprocessor Collectives",
              ACM/IEEE 47th Annual International Symposium on Computer
              Architecture (ISCA), 2020, <https:dx.doi.org/10.1109/
              ISCA45697.2020.00085>.

Song, et al.              Expires 27 July 2024                 [Page 10]
Internet-Draft                 TP for INC                   January 2024

   [mtp]      Stephens, B., Grassi, D., Almasi, H., Ji, T., Vamanan, B.,
              and A. Akella, "TCP is Harmful to In-Network Computing:
              Designing a Message Transport Protocol (MTP)", 2021,
              <http://dx.doi.org/10.1145/3484266.3487382>.

   [netcache] Jin, X., Li, X., Zhang, H., Soule, R., Lee, J., Foster,
              N., Kim, C., and I. Stoica, "NetCache: Balancing Key-Value
              Stores with Fast In-Network Caching", In Proceedings of
              the 26th Symposium on Operating Systems Principles (SOSP
              '17). Association for Computing Machinery, New York, NY,
              USA, 121-136. https://doi.org/10.1145/3132747.3132764,
              2017.

   [netclone] Kim, G., "NetClone: Fast, Scalable, and Dynamic Request
              Cloning for Microsecond-Scale RPCs", In Proceedings of the
              ACM SIGCOMM 2023 Conference (ACM SIGCOMM '23). Association
              for Computing Machinery, New York, NY, USA, 195-207, 2023,
              <https://dl.acm.org/doi/10.1145/3603269.3604820>.

Authors' Addresses

   Haoyu Song
   Futurewei Technologies
   Santa Clara, CA
   United States of America
   Email: haoyu.song@futurewei.com

   Wenfei Wu
   Peking University
   Beijing
   China
   Email: wenfeiwu@pku.edu.cn

   Dirk Kutscher
   The Hong Kong University of Science and Technology (Guangzhou)
   Guangzhou
   China
   Email: ietf@dkutscher.net

Song, et al.              Expires 27 July 2024                 [Page 11]