Javascript disabled? Like other modern websites, the IETF Datatracker relies on Javascript. Please enable Javascript for full functionality.
Implementation and Performance Evaluation of PDM using eBPF
draft-elkins-v6ops-bpf-pdm-ebpf-00

Versions:
Document	Type	Active Internet-Draft (individual)
	Authors	Nalini Elkins , Chinmaya Sharma , Amogh Umesh , Balajinaidu V , Mohit P. Tahiliani
	Last updated	2024-02-20
	RFC stream	(None)
	Intended RFC status	(None)
	Formats	txt htmlized pdf bibtex bibxml
Stream	Stream state	(No stream defined)
	Consensus boilerplate	Unknown
	RFC Editor Note	(None)
IESG	IESG state	I-D Exists
	Telechat date	(None)
	Responsible AD	(None)
	Send notices to	(None)
Email authors IPR References Referenced by Nits Search email archive
draft-elkins-v6ops-bpf-pdm-ebpf-00
Network Working Group                                          N. Elkins
Internet-Draft                                     Inside Products, Inc.
Intended status: Informational                                 C. Sharma
Expires: 23 August 2024                                         A. Umesh
                                                                    B. V
                                                         M. P. Tahiliani
                                                          NITK Surathkal
                                                        20 February 2024

      Implementation and Performance Evaluation of PDM using eBPF
                   draft-elkins-v6ops-bpf-pdm-ebpf-00

Abstract

   RFC8250 describes an optional Destination Option (DO) header embedded
   in each packet to provide sequence numbers and timing information as
   a basis for measurements.  As kernel implementation can be complex
   and time-consuming, this document describes the implementation of the
   Performance and Diagnostic Metrics (PDM) extension header using eBPF
   in the Linux kernel's Traffic Control (TC) subsystem.  The document
   also provides a performance analysis of the eBPF implementation in
   comparison to the traditional kernel implementation.

About This Document

   This note is to be removed before publishing as an RFC.

   The latest revision of this draft can be found at
   https://ChinmayaSharma-hue.github.io/pdm-ebpf-draft/draft-elkins-
   ebpf-pdm-ebpf.html.  Status information for this document may be
   found at https://datatracker.ietf.org/doc/draft-elkins-v6ops-bpf-pdm-
   ebpf/.

   Source for this draft and an issue tracker can be found at
   https://github.com/ChinmayaSharma-hue/pdm-ebpf-draft.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

Elkins, et al.           Expires 23 August 2024                 [Page 1]
Internet-Draft                  pdm-ebpf                   February 2024

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 23 August 2024.

Copyright Notice

   Copyright (c) 2024 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
     1.1.  Background  . . . . . . . . . . . . . . . . . . . . . . .   3
       1.1.1.  PDM . . . . . . . . . . . . . . . . . . . . . . . . .   3
       1.1.2.  eBPF  . . . . . . . . . . . . . . . . . . . . . . . .   3
   2.  Using tc-bpf to add IPv6 extension headers  . . . . . . . . .   4
     2.1.  tc-bpf  . . . . . . . . . . . . . . . . . . . . . . . . .   4
     2.2.  Adding IPv6 extension headers in tc . . . . . . . . . . .   4
       2.2.1.  Ingress tc-bpf program  . . . . . . . . . . . . . . .   5
       2.2.2.  Egress tc-bpf program . . . . . . . . . . . . . . . .   6
   3.  Implementation of PDM extension header in tc-bpf  . . . . . .   6
     3.1.  Egress tc-bpf program for PDM . . . . . . . . . . . . . .   7
     3.2.  Ingress tc-bpf program for PDM  . . . . . . . . . . . . .   8
     3.3.  Implementation of PDM initiation  . . . . . . . . . . . .   8
     3.4.  Implementation of PDM termination . . . . . . . . . . . .   9
   4.  Advantages of using eBPF to add extension headers . . . . . .   9
   5.  Performance Analysis  . . . . . . . . . . . . . . . . . . . .  10
     5.1.  Experiment Setup  . . . . . . . . . . . . . . . . . . . .  10
     5.2.  CPU Performance . . . . . . . . . . . . . . . . . . . . .  10
       5.2.1.  CPU Usage in cycles . . . . . . . . . . . . . . . . .  11
       5.2.2.  CPU usage as a percentage of total CPU cycles . . . .  11
     5.3.  Memory Usage  . . . . . . . . . . . . . . . . . . . . . .  12
     5.4.  Network Throughput  . . . . . . . . . . . . . . . . . . .  12
     5.5.  Packet Processing Latency . . . . . . . . . . . . . . . .  14
   6.  Security Considerations . . . . . . . . . . . . . . . . . . .  15
   7.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  15

Elkins, et al.           Expires 23 August 2024                 [Page 2]
Internet-Draft                  pdm-ebpf                   February 2024

   8.  Normative References  . . . . . . . . . . . . . . . . . . . .  15
   Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . .  15
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  16

1.  Introduction

1.1.  Background

1.1.1.  PDM

   The Performance and Diagnostic Metrics (PDM) Extension Header,
   designated in [RFC8250], introduces a method to discern server
   processing delays from round trip network delays within IPv6
   networks.  This extension is a type of Destination Options header, a
   component of the IPv6 protocol.

   The PDM header incorporates several fields, notably Packet Sequence
   Number This Packet (PSNTP), Packet Sequence Number Last Received
   (PSNLR), Delta Time Last Received (DTLR), Delta Time Last Sent
   (DTLS), and scaling factors for these delta times.  These elements,
   when correlated with a unique 5-tuple identifier, facilitate the
   precise measurement of network and server delays.  The PDM header's
   utility lies in its ability to provide concrete data on network and
   server performance.  By differentiating between the delays caused by
   network round trips and server processing, it enables quick
   identification of performance bottlenecks.

   Implementations of the PDM header must keep track of sequence numbers
   and timestamps for both incoming and outgoing packets, associated
   with each 5-tuple.  The header's design emphasizes flexibility in its
   activation, accuracy in timestamp recording, and configurable
   parameters for information lifespan and memory allocation as detailed
   in Section 3.5 of RFC 8250.

1.1.2.  eBPF

   eBPF, an extensible programming framework within the Linux kernel,
   operates as a virtual machine allowing users to run isolated programs
   in kernel space, thereby customizing network processing, monitoring,
   and security without needing kernel recompilation.  These user-
   defined programs are first compiled into eBPF bytecode, followed by a
   verification process that assures termination and checks for
   potential errors such as invalid pointers or array bounds, adding an
   extra layer of security.  Due to their optimized bytecode, eBPF
   programs run efficiently within the kernel's virtual machine. eBPF
   offers various hook points within the kernel, such as in the
   networking stack, enabling users to attach their programs based on
   specific requirements, like network monitoring or packet

Elkins, et al.           Expires 23 August 2024                 [Page 3]
Internet-Draft                  pdm-ebpf                   February 2024

   modification.  This flexibility allows for a tailored kernel behavior
   to suit different use cases, enhancing the system's functionality and
   security.

2.  Using tc-bpf to add IPv6 extension headers

2.1.  tc-bpf

   The cls_bpf component within tc is a classifier that uses BPF,
   including both classic BPF (cBPF) and extended BPF (eBPF), for packet
   filtering and classification. eBPF can be used to directly perform
   actions on the socket buffer (skb), such as packet mangling or
   updating checksums.  One of the features of cls_bpf classifier is its
   ability to facilitate efficient, non-linear classification.  Unlike
   traditional tc classifiers that may require multiple parsing passes
   (one each per classifier), cls_bpf, with the help of eBPF, can tailor
   a single program for diverse skb types, avoiding redundant parsing.

   cls_bpf operates in two distinct modes: originally calling into the
   full tc action engine, tcf_exts_exec and a more efficient 'direct
   action' (da) mode for immediate return after bpf run.  The da mode
   allows cls_bpf to simply return a tc opcode and perform tc actions
   without the need for traversing multiple layers in the tc action
   engine.

   In direct-action(da) mode, eBPF can store class identifiers (classid)
   in skb->tc_classid and return the action opcode, suitable even for
   simple cBPF operations like drop actions. cls_bpf's flexibility also
   allows administrators to use multiple classifiers in mixed modes (da
   and non-da) based on specific use cases.  However, for high-
   performance workloads, a single tc eBPF cls_bpf classifier in da mode
   is generally sufficient and recommended due to its efficiency.

2.2.  Adding IPv6 extension headers in tc

   Adding an extension header to the packet requires creating space for
   the header followed by inserting the data and padding.  This task
   utilizes eBPF helper functions specific to packet manipulation with
   skb, such as bpf_skb_adjust_room for creating space,
   bpf_skb_load_bytes for loading data from skb, and bpf_skb_store_bytes
   for storing bytes in the adjusted skb.

   The tc-bpf hookpoint caters to both ingress and egress traffic, vital
   in scenarios where measurements in ingress are needed or when packet
   data in ingress is used for calculating extension headers in egress.

Elkins, et al.           Expires 23 August 2024                 [Page 4]
Internet-Draft                  pdm-ebpf                   February 2024

   The traffic control subsystem is located in the lower levels of the
   network stack, which implies minimal packet processing after this
   stage.  Adding an extension header after the packet is fully formed
   can result in the packet exceeding the Maximum Transmission Unit
   (MTU), leading to potential packet drops.  It's important to check
   the packet size to ensure it doesn't exceed the MTU with the added
   extension header.  The packet size can be verified against the
   exceeding MTU of net device (based on ifindex) using the
   bpf_check_mtu helper function.

   tc-bpf programs can also utilize the bpf_redirect helper to redirect
   packets to the ingress or egress TC hook points of any interface in
   the host, useful for routing purposes.  An additional benefit of
   using TC or any other eBPF hook point is the simplicity in exporting
   data received in extension headers for logging and monitoring.  This
   is facilitated through eBPF maps, accessible from both kernel and
   user space.  BPF maps like BPF_MAP_TYPE_PERF_EVENT_ARRAY and
   BPF_MAP_TYPE_RINGBUF are used for streaming real-time data from the
   extension headers, providing precise control over poll/epoll
   notifications to userspace about new data in the buffers.

2.2.1.  Ingress tc-bpf program

   A BPF program can be attached to the ingress of the clsact qdisc for
   a specific network interface.  This program executes for every packet
   received on this interface.  The purpose of attaching a BPF program
   at the ingress is to conduct specific measurements necessary for
   calculating certain fields in the extension header.  Should the need
   arise to categorize information from incoming packets based on the
   5-tuple, a hashmap BPF map can be employed.  The ability to access
   BPF maps across different eBPF programs is beneficial, particularly
   for utilizing data recorded in the ingress BPF program within the
   egress BPF program.

   It's possible to define actions at ingress based on data from
   incoming packets in direct action mode.  For instance, the ingress
   BPF program might decide to drop a packet based on its received
   extension header, returning TC_ACT_SHOT, or to forward the packet by
   returning TC_ACT_OK.  Additional actions in the classifier-action
   subsystem, like TC_ACT_REDIRECT, are available for use with
   bpf_redirect and other relevant functions.

Elkins, et al.           Expires 23 August 2024                 [Page 5]
Internet-Draft                  pdm-ebpf                   February 2024

2.2.2.  Egress tc-bpf program

   A BPF program is attachable to the egress point of the clsact qdisc
   designated for a specific network interface, functioning for every
   packet exiting this interface.  The role of this egress BPF program
   includes preparing space for the extension header in the skb,
   assembling the extension header tailored for the particular outbound
   packet, and appending the extension header to the packet.

   In cases where the extension header is stateless, an egress BPF
   program alone might be adequate, as no flow-related measurements are
   required.  The data to be integrated into the extension header solely
   depends on the current outgoing packet.  If the extension header
   fields depend on the data from incoming packets or previously sent
   packets, utilizing BPF maps becomes necessary to store and
   subsequently utilize this data for computing specific fields in the
   extension headers.

   The egress BPF program also has access to a similar set of actions.
   For instance, if a packet is discovered to be malformed, the program
   has the capacity to drop the packet using TC_ACT_SHOT before it is
   transmitted.  Successful addition of the extension header
   necessitates the return of TC_ACT_OK, propelling the packet to the
   subsequent phase in the network stack.

   The additional advantage of using TC or any other eBPF hook point is
   that if the data received in the extension headers were of interest
   in terms of logging and monitoring, the exporting of this data is
   made really simple through the use of eBPF maps which are accessible
   from both kernel space and user space.  BPF maps of types
   BPF_MAP_TYPE_PERF_EVENT_ARRAY and BPF_MAP_TYPE_RINGBUF can be used
   for streaming of the real time data obtained from the extension
   headers.  They give fine grain control to the eBPF program for poll/
   epoll notifications to any userspace consumer about new data
   availability in the buffers.

3.  Implementation of PDM extension header in tc-bpf

   PDM is implemented using both ingress and egress tc-bpf programs.
   The ingress program's chief responsibility lies in the interpretation
   of incoming packets adorned with the PDM extension header and
   recording the reception time of these packets.  The egress program
   assumes the role of appending the extension header, leveraging the
   ingress timestamp to compute the elapsed time since the last packet
   was received and sent within the same flow.  These timestamps are
   effectively communicated and preserved between the two programs via a
   BPF map, specifically of the BPF_MAP_TYPE_HASH variety.  The mapping
   key is constituted by the 5-tuple flow, which includes ipv6 source

Elkins, et al.           Expires 23 August 2024                 [Page 6]
Internet-Draft                  pdm-ebpf                   February 2024

   and destination addresses, TCP/UDP source and destination ports, and
   the Transport layer protocol.  In scenarios involving ICMP packets,
   the source and destination ports are assigned a value of zero.

3.1.  Egress tc-bpf program for PDM

   The egress eBPF program should first conduct essential validations on
   the sizes of the ethernet and IP headers, and ascertain whether the
   packet in question is IPv6.  Should the packet be non-IPv6, it
   returns with the action TC_ACT_OK and the packet proceeds unaltered.

   The program subsequently examines if the packet's next header field
   indicates the presence of an extension header.  In instances where
   any form of extension header exists, the addition of PDM is withheld.
   This restraint stems from the complexity involved in integrating an
   extension header, requiring the parsing of existing ones and
   accurately positioning the PDM header.  The challenge is compounded
   by the limitation of bpf_skb_adjust_room, which permits augmenting
   the packet size only subsequent to the fixed-length IPv6 header, thus
   necessitating a reorganization of the other extension headers within
   the eBPF program.

   The egress eBPF program extracts the IPv6 source and destination
   addresses, and in cases involving TCP/UDP, it also parses the source
   and destination ports from the transport layer.  This data is used in
   the formulation of a 5-tuple key utilized for accessing the eBPF Map.
   The program retrieves timestamps and packet sequence number of the
   last received packet and last sent packet from the eBPF map.

   The extension header fields are then computed using the current
   timestamp, acquired through bpf_ktime_get_ns.  This current timestamp
   is then stored back in the eBPF map under the packet last sent field,
   for future reference.  The Delta Time Last Received (DTLR) field is
   calculated by determining the difference between the Time Last Sent
   and Time Last Received of the latest entry.  The Delta Time Last Sent
   (DTLS) is computed as the difference between the Time Last Received
   of the latest entry and the Time Last Sent of the preceding entry.

   The Packet Sequence Number This Packet (PSNTP) is calculated by
   incrementing the sequence number of the last sent packet.  The Packet
   Sequence Number Last Received (PSNLR) is taken directly from the map.
   These methodologies are in accordance with Section 3.2.1 of RFC 8250.

   Given that PDM is categorized as a destination options extension
   header, the next header is set accordingly.  The space requirement
   for storing PDM stands at 12 bytes, with an additional 2 bytes for
   the destination options header and another 2 bytes for padding.
   Following the execution of bpf_skb_adjust_room to augment the skb

Elkins, et al.           Expires 23 August 2024                 [Page 7]
Internet-Draft                  pdm-ebpf                   February 2024

   size by 16 bytes, the program employs bpf_skb_store_bytes to store
   the structured destination options header and the PDM header.  Upon
   successful insertion of the header, the egress BPF program finishes
   its operation by returning TC_ACT_OK.

3.2.  Ingress tc-bpf program for PDM

   The ingress eBPF program should first conduct essential validations
   on the sizes of the ethernet and IP headers, and ascertain whether
   the packet in question is IPv6.  Should the packet be non-IPv6, it
   returns with the action TC_ACT_OK and the packet proceeds unaltered.
   It also checks if the packet has a destination options header and if
   it does, it checks if the header is a PDM header.

   The calculation of the fields "Delta Time Last Sent" and "Delta Time
   Last Received," along with their respective scaling factors, is
   contingent on the "Time Last Received" field located in the BPF map,
   pertaining to the relevant 5-tuple.  The ingress BPF program is
   responsible for capturing the timestamp when a packet, corresponding
   to a specific 5-tuple, is received.  This capture is executed using
   the function bpf_ktime_get_ns, and the result is subsequently stored
   in the map.

   In the context of outgoing packets during egress, the "Packet
   Sequence Number Last Received" is derived from the "Packet Sequence
   Number This Packet" field located in the PDM header of the received
   packet.  After the successful storage of both these values in the BPF
   map, the ingress BPF program finishes its operation by returning
   TC_ACT_OK.

3.3.  Implementation of PDM initiation

   The process of adding Performance and Diagnostic Metrics (PDM)
   involves verifying the existence of an entry for the corresponding
   5-tuple within the BPF map.  If no such entry exists, the program
   initiates PDM for this flow by creating a new one..This action is
   prompted each time an IPv6 packet is either received or transmitted.

   The structure of the entries in the BPF map consists of the 5-tuple
   serving as the key and the value encompassing various elements such
   as the Packet Sequence Number Last Sent (PSNLS), Packet Sequence
   Number Last Received (PSNLR), Time Last Received (TLR), and Time Last
   Sent (TLS).

   During the initial phase, the Packet Sequence Number Last Sent
   (PSNLS) is assigned a random value, achieved through the use of the
   helper function bpf_get_prandom_u32, which generates a random 32-bit
   integer.  Additionally, for the first packet, the Packet Sequence

Elkins, et al.           Expires 23 August 2024                 [Page 8]
Internet-Draft                  pdm-ebpf                   February 2024

   Number Last Received (PSNLR) and Time Last Received (TLR) are set to
   zero, as the ingress BPF program has not yet been executed for the
   specific 5-tuple.

3.4.  Implementation of PDM termination

   Stale entries corresponding to a flow are to be removed after a
   certain amount of time, as new flows with the same 5-tuple can use
   the stale data stored for the same 5-tuple a long time ago.  This
   should be done through a configurable maximum lifetime limit for the
   entries.

   One way to remove stale entries is through constant polling of the
   map to check for entries that have not been updated for the
   configured period, which identifies the entries as stale entries.
   This can be done using userspace programs as BPF maps are accessible
   from both the kernel space and user space.  All the entries in the
   map are checked, and stale entries are removed using the
   bpf_map_delete_elem helper function.

   Another way is to handle this mechanism completely in eBPF by
   calculating the differences between Time Last Sent (TLS) and Time
   Last Received (TLR) with the current timestamp for every single
   packet in both ingress and egress and if both these differences are
   above a configured maximum limit, then the map entry fields are reset
   and the PDM flow for that 5 tuple is reinitialized.

4.  Advantages of using eBPF to add extension headers

   eBPF offers the capability for dynamic loading and unloading of BPF
   programs, facilitating the ease of activating or deactivating the
   insertion of extension headers into outgoing packets.  The
   utilization of tc and xdp hook points enhances the precision of
   timestamps for wire arrival time, due to their location at the lower
   layers of the network stack.  Additionally, eBPF simplifies memory
   management in high traffic scenarios, as it allows for the
   configuration of the maximum number of entries in eBPF maps via its
   API.

   eBPF programs are also very portable and can be used across different
   kernel versions as long as it is compatible.  This is beneficial as
   it allows for the easy migration of the PDM implementation across
   different kernel versions, ensuring that the PDM implementation
   remains consistent across different kernel versions.

   Implementing extension header insertion within the kernel can
   introduce development challenges, such as potential memory leaks due
   to inadequate memory deallocation processes.  The configurability of

Elkins, et al.           Expires 23 August 2024                 [Page 9]
Internet-Draft                  pdm-ebpf                   February 2024

   the maximum number of entries in a BPF map addresses this issue,
   preventing memory overflow.  The presence of the BPF verifier is
   instrumental in ensuring both security and simplicity of
   implementation.  It conducts essential checks, including pointer
   validation, buffer overflow prevention, and loop avoidance in the
   code, thereby mitigating the risks of crashes or security
   vulnerabilities.  To safeguard against misuse, eBPF imposes resource
   constraints on programs, such as limits on the number of executable
   instructions, thereby upholding system stability and integrity.

5.  Performance Analysis

5.1.  Experiment Setup

   Two Virtual Machines with 8 cores, 16 GB of Ram and 64 GB of disk
   space were used to run the following tests.  The Virtual Machines are
   running Ubuntu 22.04 server operating system running linux kernel of
   version 5.15.148 which was compiled using the same kernel
   configuration as the prepackaged kernel 5.15.94.  Both the VMs are
   running on the same physical server using Qemu/KVM as hypervisor.  We
   compared the performance of the eBPF implementation of PDM with a
   traditional kernel implementation of PDM (add reference).  The
   performance metrics used for comparison are CPU Performance, Memory
   Usage, Network Throughput and Packet Processing Latency.

5.2.  CPU Performance

   Profiling of CPU cycles consumed by eBPF programs and the kernel
   implementation has been performed to evaluate the computational
   overhead introduced by these functions.  The perf tool was used to
   capture CPU cycle events and configured with a polling frequency of
   10,000 Hz.

   Each experiment was structured to run an iperf3 server session using
   TCP for a duration of 600 seconds or five minutes, simulating a
   consistent and controlled traffic load.  Iperf was also configured to
   use an MSS value of 1000 bytes across all tests while the MTU of the
   interface and path was 1500 bytes.  This allowed us to avoid
   accounting for packet size becoming greater than the MTU in the eBPF
   program.

   This procedure was replicated across fifty individual trials per
   implementation.  The repetition of these trials under uniform
   conditions and for a long duration allowed for the collection of a
   comprehensive profile of CPU cycle usage, which is useful for
   evaluating the efficiency and scalability of the eBPF processing in
   real-world networking scenarios.

Elkins, et al.           Expires 23 August 2024                [Page 10]
Internet-Draft                  pdm-ebpf                   February 2024

   For the eBPF program, perf is able to record data for egress and
   ingress programs separately.  For the kernel implementation, the
   pdm_insert function call duration was measured for each iperf3 server
   session.  This represents the overhead in egress in the kernel
   implementation.

5.2.1.  CPU Usage in cycles

     +===================+==============+==============+=============+
     | CPU Usage(cycles) | Mean         | Median       | St. Dev.    |
     +===================+==============+==============+=============+
     | eBPF Egress       | 8.60e10 cyc. | 8.54e10 cyc. | 9.08e9 cyc. |
     +-------------------+--------------+--------------+-------------+
     | eBPF Ingress      | 1.53e10 cyc. | 1.57e10 cyc. | 8.71e9 cyc. |
     +-------------------+--------------+--------------+-------------+
     | PDM Kernel Egress | 2.29e9 cyc.  | 2.13e9 cyc.  | 6.49e8 cyc. |
     +-------------------+--------------+--------------+-------------+

                                  Table 1

5.2.2.  CPU usage as a percentage of total CPU cycles

           +===================+=========+=========+==========+
           | CPU Usage(%)      | Mean    | Median  | St. Dev. |
           +===================+=========+=========+==========+
           | eBPF Egress       | 0.41%   | 0.40%   | 0.10%    |
           +-------------------+---------+---------+----------+
           | eBPF Egress       | 0.07%   | 0.07%   | 0.03%    |
           +-------------------+---------+---------+----------+
           | PDM Kernel Egress | 0.0110% | 0.0100% | 0.0030%  |
           +-------------------+---------+---------+----------+

                                 Table 2

   The CPU cycles consumed by the PDM Kernel Implementation is lower
   than the eBPF counterpart.  This denotes a measurably higher
   computational demand for eBPF operations.  However, it's noteworthy
   that the kernel approach, despite its limited flexibility compared to
   eBPF, demonstrates a lower overhead, signifying its streamlined
   efficiency.

   On a test run with call stack enabled in perf, the percentage
   overheads of some of the symbols invoked by eBPF egress function were
   obtained.  The major portion of egress overhead is bpf map read/write
   operations, and memcpy operation for the copy of packet data to and
   from kernel memory.

Elkins, et al.           Expires 23 August 2024                [Page 11]
Internet-Draft                  pdm-ebpf                   February 2024

   It would be interesting to examine the effect of lowering the number
   of bpf_skb_store_bytes and bpf_skb_load_bytes by loading the entire
   packet into the eBPF program, modifying the packet in the eBPF
   program and then storing the modified packet into skb.  The current
   implementation invokes bpf_skb_store_bytes and bpf_skb_load_bytes
   many times for disjoint parts of the packet.  This could be a
   potential optimization for the eBPF program.

5.3.  Memory Usage

   This PDM implementation using eBPF uses memory while storing the
   state of the 5 tuple flows.  The memory management is handled by eBPF
   maps.  Each map entry stores a value of size 20 bytes - 2 bytes each
   for Packet Sequence Number This Packet (PSNTP) and Packet Sequence
   Number Last Received (PSNLR) and 8 bytes each for Time Last Sent
   (TLS) and Time Last Received (TLR).

   The BPF maps have been configured to a maximum limit of 65,536
   entries.  This means the implementation can handle 65,536 flows at
   once.  While handling the maximum of these flows we will expect the
   total data to be stored in the eBPF maps to be 1310720 Bytes or 1.3
   MB.  There is additional overhead added by the eBPF maps structures
   themselves but the effect on this total is not very large.

   If more than 65,536 flows are encountered then new flows replace
   older entries in the maps.  The BPF_MAP_TYPE_LRU_HASH variant of the
   BPF Hash Map is used in the implementation so the older flows are
   replaced in a least recently used fashion.

5.4.  Network Throughput

    +===========================+============+============+===========+
    | Network Throughput        | Mean       | Median     | St. Dev.  |
    +===========================+============+============+===========+
    | Without PDM               | 18.80 Gbps | 18.58 Gbps | 2.19 Gbps |
    +---------------------------+------------+------------+-----------+
    | PDM Kernel Implementation | 18.52 Gbps | 18.33 Gbps | 2.21 Gbps |
    +---------------------------+------------+------------+-----------+
    | eBPF Implementation       | 18.03 Gbps | 17.22 Gbps | 2.51 Gbps |
    +---------------------------+------------+------------+-----------+

                                  Table 3

Elkins, et al.           Expires 23 August 2024                [Page 12]
Internet-Draft                  pdm-ebpf                   February 2024

   Profiling of Network Throughput consumed by attaching PDM extension
   header has been done to determine the throughput overhead.  Each
   experiment was structured to run an iperf3 server session using TCP
   for a duration of 600 seconds or five minutes, simulating a
   consistent and controlled traffic load.  There was no perf running in
   any of these tests.

   This procedure was replicated across twenty five individual trials.
   The repetition of these trials were conducted under uniform
   conditions.  The network throughput was measured for the case when
   PDM is not attached, when PDM is attached using the kernel
   implementation and when PDM is attached using the eBPF
   implementation.

   When PDM is not attached, the network throughput is the highest as
   expected.  A slight decrease is observed in the kernel
   implementation, with a further decrease in the eBPF implementation.
   This indicates that while both methods impact network performance,
   the eBPF implementation has a slightly more pronounced effect.  The
   standard deviation across these measurements suggests some
   variability in the test network conditions.  This might be a result
   to consider while implementing extension headers in eBPF.

        +===========================+========+========+==========+
        | TCP Retransmits           | Mean   | Median | St. Dev. |
        +===========================+========+========+==========+
        | Without PDM               | 2.125  | 2.0    | 1.832    |
        +---------------------------+--------+--------+----------+
        | PDM Kernel Implementation | 44.125 | 41.5   | 13.531   |
        +---------------------------+--------+--------+----------+
        | eBPF Implementation       | 37.565 | 36.0   | 10.133   |
        +---------------------------+--------+--------+----------+

                                 Table 4

   The TCP retransmits were extracted for the test runs conducted for
   network throughput.  The number of TCP retransmits is higher when PDM
   is attached using the kernel implementation and the eBPF
   implementation.  This might be due to a fault in the implementation
   itself or packet drops happening due to extension header addition.

Elkins, et al.           Expires 23 August 2024                [Page 13]
Internet-Draft                  pdm-ebpf                   February 2024

5.5.  Packet Processing Latency

    +===============================+==========+==========+==========+
    | Packet Processing Latency     | Mean     | Median   | St. Dev. |
    +===============================+==========+==========+==========+
    | PDM Kernel Implementation     | 0.707 µs | 0.641 µs | 0.414 µs |
    +-------------------------------+----------+----------+----------+
    | eBPF Egress Program Attached  | 5.808 µs | 6.142 µs | 0.986 µs |
    +-------------------------------+----------+----------+----------+
    | eBPF Egress Program Detached  | 4.528 µs | 4.668 µs | 0.785 µs |
    +-------------------------------+----------+----------+----------+
    | eBPF Ingress Program Attached | 3.634 µs | 3.977 µs | 0.906 µs |
    +-------------------------------+----------+----------+----------+
    | eBPF Ingress Program Detached | 3.082 µs | 3.321 µs | 1.246 µs |
    +-------------------------------+----------+----------+----------+

                                 Table 5

   Functions within the kernel involved in packet processing can be
   profiled using ftrace to determine the exact duration taken in
   processing packets.  The PDM insertion function (which is a part of
   the PDM Kernel Implementation) call duration was measured for a
   duration of 15 minutes while running an iperf3 server session.

   For egress eBPF program, the duration of dev_queue_xmit() function
   call in the kernel was measured with and without the eBPF egress
   program attached for a duration of 15 minutes while running an iperf3
   server session.  Similarly, for the ingress eBPF program, the
   duration of netif_receive_skb_list_internal() function call in the
   kernel was measured with and without the eBPF ingress program
   attached for a duration of 15 minutes while running an iperf3 server
   session.

   The profiling of eBPF egress program with respect to packet
   processing latency is done by calculating the difference in the
   duration of dev_queue_xmit() function call in the kernel with and
   without the eBPF egress program attached.  This indicates that the
   eBPF egress program introduces a latency of approximately 1.280 µs.

   The profiling of eBPF ingress program with respect to packet
   processing latency is done by calculating the difference in the
   duration of netif_receive_skb_list_internal() function call in the
   kernel with and without the eBPF ingress program attached.  This
   indicates that the eBPF ingress program introduces a latency of
   approximately 0.552 µs.  It should be noted however that ftrace is
   affected by context switches and scheduling latencies in the kernel
   and the scheduling of the VM itself on the host.

Elkins, et al.           Expires 23 August 2024                [Page 14]
Internet-Draft                  pdm-ebpf                   February 2024

6.  Security Considerations

   BPF utilizes maps to store various data elements, including 5-tuple
   information about network flows.  These maps have a configurable
   limit on the number of entries they can hold, which is crucial for
   efficient memory usage and performance optimization.  However, this
   characteristic also opens up a potential vulnerability to resource
   exhaustion attacks.

   An attacker, by intentionally sending packets with numerous distinct
   5-tuples, could overrun the BPF maps.  As these maps reach their
   maximum capacity, legitimate new entries cannot be added, or lead to
   existing entries being replaced by the new flows, potentially leading
   to incorrect packet processing or denial of service as critical flows
   might be untracked or misclassified.  This scenario is particularly
   concerning in high-throughput environments where the rate of new flow
   creation is significant.

   To mitigate such attacks, it is essential to implement a robust
   mechanism that not only monitors the usage of BPF maps but also
   employs intelligent strategies to handle map overruns.  This could
   include techniques like early eviction of least-recently-used
   entries, dynamic resizing of maps based on traffic patterns, or even
   alert mechanisms for anomalous growth in map entries.

   Additionally, rate-limiting strategies could be enforced at the
   network edge to prevent an overwhelming number of new flows from
   entering the network, thus offering a first line of defense against
   such resource exhaustion attacks.

7.  IANA Considerations

   This document has no IANA actions.

8.  Normative References

   [RFC8250]  Elkins, N., Hamilton, R., and M. Ackermann, "IPv6
              Performance and Diagnostic Metrics (PDM) Destination
              Option", RFC 8250, DOI 10.17487/RFC8250, September 2017,
              <https://www.rfc-editor.org/rfc/rfc8250>.

Acknowledgments

   The Authors extend their gratitude to Ameya Deshpande for providing
   the kernel implementation of PDM, which served as a basis for
   comparison with the eBPF implementation.

Elkins, et al.           Expires 23 August 2024                [Page 15]
Internet-Draft                  pdm-ebpf                   February 2024

Authors' Addresses

   Nalini Elkins
   Inside Products, Inc.
   United States
   Email: nalini.elkins@insidethestack.com

   Chinmaya Sharma
   NITK Surathkal
   India
   Email: chinmaysharma1020@gmail.com

   Amogh Umesh
   NITK Surathkal
   India
   Email: amoghumesh02@gmail.com

   Balajinaidu V
   NITK Surathkal
   India
   Email: balajinaiduhanur@gmail.com

   Mohit P. Tahiliani
   NITK Surathkal
   India
   Email: tahiliani@nitk.edu.in

Elkins, et al.           Expires 23 August 2024                [Page 16]
Implementation and Performance Evaluation of PDM using eBPF draft-elkins-v6ops-bpf-pdm-ebpf-00

Implementation and Performance Evaluation of PDM using eBPF
draft-elkins-v6ops-bpf-pdm-ebpf-00