Internet-Draft IPv6 Packet Marking June 2023
Carder, et al. Expires 31 December 2023 [Page]
Workgroup:
Internet Engineering Task Force
Internet-Draft:
draft-cc-v6ops-wlcg-flow-label-marking-00
Published:
Intended Status:
Informational
Expires:
Authors:
D. Carder
Energy Sciences Network
T. Chown
Jisc
S. McKee
University of Michigan

Use of the IPv6 Flow Label for WLCG Packet Marking

Abstract

This document describes an experimentally deployed approach currently used within the Worldwide Large Hadron Collider Computing Grid (WLCG) to mark packets with their project (experiment) and application. The marking uses the 20-bit IPv6 Flow Label in each packet, with 15 bits used for semantics and 5 bits for entropy. Alternatives, in particular use of IPv6 Extension Headers (EH), were considered but found to not be practical. The WLCG is one of the largest worldwide research communities and has adopted IPv6 heavily for movement of many tens of PB of data annually, with the ultimate goal of running IPv6 only.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 31 December 2023.

1. Introduction

High Energy Physics (HEP) experiments such as those using the Large Hadron Collider, as well as many similar data intensive global science domains, rely on networks as one of the critical components of their infrastructure both within the laboratories as well as globally to interconnect participating sites, data centers and experiment instrumentation.

The Worldwide Large Hadron Collider Computing Grid (WLCG) as a specific (and very large) example of HEP research infrastructure supports multiple CERN experiments, with a reported 200PB of data generated annually and distributed to computing centers in over 40 countries. IPv6 is used heavily by the WLCG, with over 90% of the main storage facilities now supporting it, and a significant percentage of traffic flows being IPv6. The ultimate goal is to run the WLCG IPv6-only.

Analyzing the pattern of traffic flows in detail is critical for understanding how the various complex systems developed are actually using the network. The motivation for the use of packet marking is to label traffic to indicate the user community and application workflow it is a part of so that the purpose of data transfers may be understood. This capability is especially important for sites which support many simultaneous experiments' workflows where any worker node or storage system may quickly change between different users. With a standardized way of marking traffic, any intermediate network or end-site could quickly provide detailed visibility into the nature of the HEP traffic running to and from their site.

Backbone networks may also use this metadata in order to summarize traffic as belonging to certain science experiments and their applications. HEP user communities may then use the data provided by participating backbone networks to characterize the scientific workloads running at global scale, measuring for example the impact of tradeoffs between storage and workload placement, or to examine that scarce resources such as undersea cables are used efficiently.

While the initial rationale for the packet marking was better understanding of the flow of traffic belonging to certain experiments around Research and Education (R&E) networks, there is also the potential for traffic to be steered by its Flow Label value.

This document describes a packet marking scheme currently being applied and tested within the WLCG community, but the approach is extensible (given the number of bits available to mark experiments and experiment owners) to other HEP and R&E communities.

A network flow is defined as a five tuple, i.e. source IP, destination IP, source port, destination port and protocol (TCP, UDP, ...). The packet marking is intended to complement the five tuple by denoting the packet owner (experiment/community) and the traffic type (application). One application may source multiple network flows for example from multiple source ports or to multiple destination IPs but for accounting purposes they may all be of the same application "type" of traffic and corresponding to the same owner, and inherently asking to be treated the same by the network. The applications would have, as part of their configuration, the owner and the type of traffic marking to set. A given host may be running multiple such applications.

Summarization of this data is expected to be coarse. A set of applications working on the same task on different hosts would likely all use the same packet marking. Traffic "type" needs to be defined and agreed upon within a specific user community, the set of application owners, or users, need to be agreed upon within a limited domain. But it would be considered normal for multiple network flows (in the five tuple sense) to share a common marking if they belong to the same experiment and application.

1.1. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

2. Use of the IPv6 Flow Label

The format of the IPv6 packet header is described in Section 3 of [RFC8200], and includes the 20 bit IPv6 Flow Label field.

2.1. Setting the Flow Label bits

The packet marking approach uses all 20 bits of the flow label field available in the IPv6 header.

The packet marking has the following characteristics and subfields, containing the application and owner identifiers encoded as bits in the Flow label in the following way:

  • The application identifier uses 6 bits, and is encoded in bits 13-18
  • Entropy bits are 5 bits in positions 1, 2, 12, 19 and 20, and are set at random once per network flow for the duration of its lifetime.
  • Owner identifier uses 9 bits and is encoded in bits 3-11, and these bits are used in reversed order to allow for possible future adjustments of the bit boundary.

The flow label is set on each packet that is sent by the given application. Network flows belonging to the same experiment and application may thus have 32 different flow label values.

As the initial work is to be applicable within the global R&E user community, the majority of the bits available are used to indicate the science domain owner and, therefore, fewer bits are available to denote traffic type.

2.2. Deviation from IPv6 Specifications

Part of the reason for documenting this use of the IPv6 Flow Label was to note that, at least in the domain of certain HEP research networks, the IPv6 Flow Label is not being used exactly as specified, and to record the reason why.

Section 6 of [RFC8200] states that "the 20-bit Flow Label field in the IPv6 header is used by a source to label sequences of packets to be treated in the network as a single flow".

Section 3 of [RFC6437] states that "It is therefore RECOMMENDED that source hosts support the flow label by setting the flow label field for all packets of a given flow to the same value chosen from an approximation to a discrete uniform distribution" and that the algorithm (for setting the Flow Label value) "SHOULD ensure that the resulting flow label values are unique with high probability."

Section 1 of [RFC6437] further adds that "a specific goal is to enable and encourage the use of the flow label for various forms of stateless load distribution, especially across Equal Cost Multi-Path (ECMP) and/or Link Aggregation Group (LAG) paths."

In this packet marking scheme, all traffic belonging to the same experiment and application, e.g., "ALICE" and "XRootD", will carry a flow label with 15 fixed, common bits and 5 varying (entropy) bits. Given use of the Flow label as described above should use 20 entropy bits (with a uniform distribution), it is not the case here that the flow label values will be unique with such a high probability, i.e., 1 in 32 network flows will in principle be unique rather than around 1 in a million.

The 5 entropy bits are used to still support a level of conformance with the requirement stated in RFC 6437 to support traffic distribution in ECMP and LAG scenarios. The number of bits chosen is a tradeoff between the number of bits available for the experiment and application labeling and the number of entropy bits.

2.3. Traffic inspection and collection on path

As packets are marked using the IPv6 flow label, it is possible for intermediate routers to sample traffic in the forwarding hardware and send this data off to central collectors for analysis. In many network environments, the standard approach is for a hardware-specific implementation on a router to sample the traffic and use the IPFIX protocol [RFC7011] to send the sampled data to a collector. Section 5.4.21 of [RFC5102] defines field #31 for carrying the IPv6 Flow Label information, and major router hardware and collector software implementations are known to support this.

Some hardware platforms, primarily with a lineage more firmly rooted in switching vs routing, support traffic sampling via sflow [RFC3176]. Unlike IPFIX, the traffic is not summarized by the router/switch, but a significant part of the sampled raw packet is encapsulated and sent to the collector for analysis. sFlow datagrams include one or more packet flow records which in turn include the original datagram header. Individual fields such as the IPv6 flow label are able to be collected.

Traffic mirroring and/or optical taps can also be used to copy raw traffic to a server for analysis. The data rates, number of links, and power availability to run servers for large scale collection may make traditional packet capture and analysis impractical in many environments such as REN (Research and Education Network) environments, though there has been initial success with the P4 implementation running on the FPGA platform deployed throughout ESnet.

2.4. Implications for traffic analysis

Corresponding to the expectations in Section 4 of [RFC6436], a brief, unscientific sampling of non-MPLS encapsulated traffic collected via IPFIX on ESnet5 does show that there is a mix of [RFC3697] compliant hosts where all-zero flow labels are used, as well as updated [RFC6437] compliant hosts that by default choose uniformly distributed labels between 1 and 0xFFFFF. A traffic analysis system may need to know which specific endpoints are using the packet marking meaning of the flow label and that the field's values are relevant. As the deployments are for rather narrow accounting use cases within specific user communities, it has been practical to match for known flow labels vs trying to keep the accounting state for 2^20 possible labels in use for each link of the network.

2.5. Use of a registry

A registry, known as [SciTags] is used to track known users/applications. For the current expiramental use, this is effectively operating as a centralized resource and API. Future work may include a more complex system for broader distribution.

A given intermediate network capturing the flow data doesn't necessarily need to decode this information, as it's only truly relevant to the end sites using this scheme.

2.6. Additional Considerations

If there are concerns about preserving entropy and reducing the possible collisions with the standard use of the IPv6 Flow Label, we could use the "entropy" bits defined above to instead calculate a Hamming Code. A Hamming Code calculates a set of Parity Bits to be used to extend a set of Message (Data) Bits, that will maximize the number of bits that are different between "valid" messages. This may better support existing use of the flow label for ECMP as described in [RFC6437].

3. Alternative packet marking approaches considered

3.1. IPv6 Hop-by-Hop or Destination Options

Extension headers are known to be problematic in that they have a history of being filtered or dropped in transit, as measured in [RFC7872] with substantual further discussion in [RFC9098]. It might be that such issues are less common in Research and Education networks. As an example, [RFC9343] defines an alternate marking encoding for use in either hop-by-hop or destination options headers. [RFC7837] defines a marking in support of congestion control (ConEx), and [RFC8250] is a Standards Track document that defines a destination option for Performance and Diagnostic Metrics (PDM) for IPv6. There is also this draft defining an option [I-D.ietf-ippm-ioam-ipv6-options], for the use case of carrying OAM information.

The Destination option header could therefore be a logical choice to place application-specific telemetry identifiers, as there is less of a constraint on space than the IPv6 Flow Label, less history of defined pre-existing intentions from the standards body, and low deployed usage on the Internet. However, at present, the linux implementation in particular requires either setuid 0 or CAP_NET_RAW capability to be able to call setsockopt(s, IPPROTO_IPV6, IPV6_DSTOPTS, ext_hdr_p, ext_hdr_size, making it unusable by typical userspace applications. There has been a set of patches made that could address this as well as extend the functionality, though they have not been met with support from the linux network maintainers. Additionally, extracting that field by intermediate routers and exporting it via IPFIX may be further subject to lack of support compared to the fixed field and known position of the flow label.

While in principle it's possible, it is less practical to use a Hop-by-Hop option, for the reasons discussed in [I-D.krishnan-ipv6-hopbyhop]. However, there is a recent example of its use in [RFC9268] where a host can signal this option, routers will not process it unless configured to do so, and if not, they may well drop the packet according to Section 4.8 of [RFC8200].

3.2. IPv6 Addresses as identifiers

Given the size of IPv6 addresses, it is possible to mark or "color" packets by using specific site network prefixes (within a site /64) or values in (a part of) the host identifier part of an address (typically 64 bits). Hosts already currently use multiple IPv6 source addresses. Applications would need to bind sockets to the correct source address, per flow, corresponding to the accounting details to be conveyed. Dispatching computation jobs into a high-throughput computational cluster along with network-specific metadata has for example been explored in [Lark].

Hosts serving different users/applications would need multiple addresses, one for each possible, configured in advance of the application requiring it. Adding an IP address onto a host requires root level access to a system and is typically not available as a dynamic function available for userspace. There also may be limits on the number of source addresses able to be concurrently configured, so a garbage collection process may need to deprovision addresses no longer in use. This dynamic use of source addresses also may cause operational issues around access-control list management, and security implementations at a site.

The use of marked source and destination addresses in communications could facilitate the routing of packets in different routing domains (or VPNs), if needed. Unfortunately, depending on the position of the marking in the address, it may not be possible to use it for policy routing, since very few network hardware implement bitmask packet matching for IPv6, leaving this likely feasible for host-initiated tunnels.

3.3. Marking in the Payload

Marking in the payload has been considered to be out of scope given the prevalence of TLS/SSL/etc, which means that payloads cannot be inspected on path.

3.4. Network Tokens

A recently published IETF personal draft documents the concept of "Network Tokens", see [I-D.yiakoumis-network-tokens].

"A network token is a small piece of data that end users attach to their packets. As packets flow through the network, intermediate nodes MAY detect tokens, interpret them, and apply the desired service to the packets that carry them (and possibly to all other packets from the same flow). For example, a token might just state the name of the application that a packet originates from." The draft proposes a 28-bit token ID field. It discusses multiple mechanisms for tokens to be conveyed; some may be applicable to IPv4. [I-D.iab-path-signals-collaboration] puts this work into a broader context.

3.5. IPv4 considerations

While this document is targeted at the IPv6 Operations WG and describes an approach used to mark IPv6 traffic, the reality is that some HEP and WLCG traffic is IPv4, and thus there is interest, in the short term at least, in marking IPv4 traffic.

IPv4 does not have a dedicated flow label, but provides a way to extend headers by means of header options. An overview of the IPv4 RFCs related to the header options can be found at: [IPv4-parameters]. The IPv4 option method is likely to effectively be unusable because unknown options lead to packet drops on many paths (due to firewalls, etc). There is an IETF draft that attempts to add IPv6 flow labels to IPv4 via options at [I-D.herbert-ipv4-eh], however there appears to be no existing implementation readily available.

IPv4 header options can be set via setsockopt as well, however the actual application support for different standards is more complex. As an example, while there are many existing IPv4 header options, only some of them are implemented in the Linux kernel. The Stream Identifier Option, value would appear to align with our use of the ipv6 flow label, but its usage was obsoleted in Section 3.2.1.8 of [RFC1122].

[I-D.ietf-cipso-ipsecurity] (Commercial IP Security Option) has an implementation in the kernel, allowing for an operator to be able to run an audit on packets and discard anything that is not labeled correctly. However, labeling does require additional configuration of the node to indicate which label the node is part of and that seems rather complex to setup and appears to be designed to support system-level tagging and not application-level tagging.

As usage of IPv4 has been superseded by IPv6, it was determined that further effort for IPv4 was unwarranted.

3.6. Firefly Packets for marking Network Flows

Separate packets are sent along side application traffic from the source to the same destination node, but always to specific destination port. These packets are large enough to contain rich metadata about the flow, formatted as a json payload carried in syslog. These telemetry packets can be collected en route by participating networks, by the end host, or sent to a central collector.

4. Implementation Status

  • [XRootD] data transfer software suite, implemented in C++.
  • [FlowD] software accounting service, containing a backend plugin with a Linux kernel EBPF implementation.
  • [Iperf3] throughput testing as invoked from the [PerfSONAR] suite of globally deployed test endpoints.

5. IANA Considerations

This memo includes no request to IANA.

6. Security Considerations

The security considerations in Section 6 of [RFC6437] still apply. It states that "third parties should be unlikely to be able to guess the next value that a source of flow labels will choose", but this use case specifically requires common marking for the majority of the bits for a specific pairing of experiment and application.

A related consideration is that well-known flow labels could further encourage pervasive monitoring attacks described in [RFC7258], but our use case for the flow labels is to intentionally permit monitoring use cases. This use of the flow label is directly controlled by the end hosts choosing to participate.

7. References

7.1. Normative References

[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/info/rfc2119>.
[RFC8174]
Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, , <https://www.rfc-editor.org/info/rfc8174>.
[RFC8200]
Deering, S. and R. Hinden, "Internet Protocol, Version 6 (IPv6) Specification", STD 86, RFC 8200, DOI 10.17487/RFC8200, , <https://www.rfc-editor.org/info/rfc8200>.
[RFC6437]
Amante, S., Carpenter, B., Jiang, S., and J. Rajahalme, "IPv6 Flow Label Specification", RFC 6437, DOI 10.17487/RFC6437, , <https://www.rfc-editor.org/info/rfc6437>.

7.2. Informative References

[RFC1122]
Braden, R., Ed., "Requirements for Internet Hosts - Communication Layers", STD 3, RFC 1122, DOI 10.17487/RFC1122, , <https://www.rfc-editor.org/info/rfc1122>.
[RFC3176]
Phaal, P., Panchen, S., and N. McKee, "InMon Corporation's sFlow: A Method for Monitoring Traffic in Switched and Routed Networks", RFC 3176, DOI 10.17487/RFC3176, , <https://www.rfc-editor.org/info/rfc3176>.
[RFC3697]
Rajahalme, J., Conta, A., Carpenter, B., and S. Deering, "IPv6 Flow Label Specification", RFC 3697, DOI 10.17487/RFC3697, , <https://www.rfc-editor.org/info/rfc3697>.
[RFC5102]
Quittek, J., Bryant, S., Claise, B., Aitken, P., and J. Meyer, "Information Model for IP Flow Information Export", RFC 5102, DOI 10.17487/RFC5102, , <https://www.rfc-editor.org/info/rfc5102>.
[RFC6436]
Amante, S., Carpenter, B., and S. Jiang, "Rationale for Update to the IPv6 Flow Label Specification", RFC 6436, DOI 10.17487/RFC6436, , <https://www.rfc-editor.org/info/rfc6436>.
[RFC7011]
Claise, B., Ed., Trammell, B., Ed., and P. Aitken, "Specification of the IP Flow Information Export (IPFIX) Protocol for the Exchange of Flow Information", STD 77, RFC 7011, DOI 10.17487/RFC7011, , <https://www.rfc-editor.org/info/rfc7011>.
[RFC7258]
Farrell, S. and H. Tschofenig, "Pervasive Monitoring Is an Attack", BCP 188, RFC 7258, DOI 10.17487/RFC7258, , <https://www.rfc-editor.org/info/rfc7258>.
[RFC7837]
Krishnan, S., Kuehlewind, M., Briscoe, B., and C. Ralli, "IPv6 Destination Option for Congestion Exposure (ConEx)", RFC 7837, DOI 10.17487/RFC7837, , <https://www.rfc-editor.org/info/rfc7837>.
[RFC7872]
Gont, F., Linkova, J., Chown, T., and W. Liu, "Observations on the Dropping of Packets with IPv6 Extension Headers in the Real World", RFC 7872, DOI 10.17487/RFC7872, , <https://www.rfc-editor.org/info/rfc7872>.
[RFC8250]
Elkins, N., Hamilton, R., and M. Ackermann, "IPv6 Performance and Diagnostic Metrics (PDM) Destination Option", RFC 8250, DOI 10.17487/RFC8250, , <https://www.rfc-editor.org/info/rfc8250>.
[RFC9098]
Gont, F., Hilliard, N., Doering, G., Kumari, W., Huston, G., and W. Liu, "Operational Implications of IPv6 Packets with Extension Headers", RFC 9098, DOI 10.17487/RFC9098, , <https://www.rfc-editor.org/info/rfc9098>.
[RFC9268]
Hinden, R. and G. Fairhurst, "IPv6 Minimum Path MTU Hop-by-Hop Option", RFC 9268, DOI 10.17487/RFC9268, , <https://www.rfc-editor.org/info/rfc9268>.
[RFC9343]
Fioccola, G., Zhou, T., Cociglio, M., Qin, F., and R. Pang, "IPv6 Application of the Alternate-Marking Method", RFC 9343, DOI 10.17487/RFC9343, , <https://www.rfc-editor.org/info/rfc9343>.
[IPv4-parameters]
"Internet Protocol Version 4 (IPv4) Parameters", <https://www.iana.org/assignments/ip-parameters/ip-parameters.xhtml>.
[I-D.ietf-ippm-ioam-ipv6-options]
Bhandari, S. and F. Brockners, "In-situ OAM IPv6 Options", Work in Progress, Internet-Draft, draft-ietf-ippm-ioam-ipv6-options-12, , <https://datatracker.ietf.org/doc/html/draft-ietf-ippm-ioam-ipv6-options-12>.
[I-D.yiakoumis-network-tokens]
Yiakoumis, Y., McKeown, N., and F. Sorensen, "Network Tokens", Work in Progress, Internet-Draft, draft-yiakoumis-network-tokens-02, , <https://datatracker.ietf.org/doc/html/draft-yiakoumis-network-tokens-02>.
[I-D.iab-path-signals-collaboration]
Arkko, J., Hardie, T., Pauly, T., and M. Kühlewind, "Considerations on Application - Network Collaboration Using Path Signals", Work in Progress, Internet-Draft, draft-iab-path-signals-collaboration-03, , <https://datatracker.ietf.org/doc/html/draft-iab-path-signals-collaboration-03>.
[I-D.herbert-ipv4-eh]
Herbert, T., "IPv4 Extension Headers and Flow Label", Work in Progress, Internet-Draft, draft-herbert-ipv4-eh-01, , <https://datatracker.ietf.org/doc/html/draft-herbert-ipv4-eh-01>.
[I-D.krishnan-ipv6-hopbyhop]
Krishnan, S., "The case against Hop-by-Hop options", Work in Progress, Internet-Draft, draft-krishnan-ipv6-hopbyhop-05, , <https://datatracker.ietf.org/doc/html/draft-krishnan-ipv6-hopbyhop-05>.
[I-D.ietf-cipso-ipsecurity]
"COMMERCIAL IP SECURITY OPTION (CIPSO 2.2)", Work in Progress, Internet-Draft, draft-ietf-cipso-ipsecurity-01, , <https://datatracker.ietf.org/doc/html/draft-ietf-cipso-ipsecurity-01>.
[Lark]
Zhang, Z., Bockelman, B., Carder, D., and T. Tannenbaum, "Lark: An effective approach for software-defined networking in high throughput computing clusters", Future Generation Computer Systems, Volume 72, Pages 105-117, ISSN 0167-739X, DOI 10.1016/j.future.2016.03.010, , <https://doi.org/10.1016/j.future.2016.03.010>.
[SciTags]
"Scientific network tags (scitags) website and accompanying registry", <https://www.scitags.org/>.
[XRootD]
"XRootD software framework", <https://xrootd.slac.stanford.edu/>.
[FlowD]
"FlowD software", <https://github.com/scitags/flowd>.
[Iperf3]
"Iperf3 software", <https://github.com/esnet/iperf>.
[PerfSONAR]
"PerfSONAR performance Service-Oriented Network monitoring ARchitecture", <https://www.perfsonar.net/>.

Acknowledgements

Members of the Worldwide LHC Computing Grid (WLCG) Research Networking Technical Working Group:

Marian Babik (CERN), Shawn McKee (Michigan), Dale W. Carder (LBNL/ESnet), Fatema Bannat Wala (LBNL/ESNet), Eli Dart (LBNL/ESnet), Mariam Kiran (ESnet), Edoardo Martelli (CERN), Pavlo Svirin(CERN/UTA), Tim Chown (Jisc), Marcos Schwarz (RNP), Joe Breen (Univ of Utah), Alexandr Zaytsev (BNL), Jason Lomonaco (Internet2), Karl Newell (Internet2), Casey Russell (KanREN), Joe Mambretti (StarLight, iCAIR NU, MREN) , Eric Brown (Virginia Tech), Mario Lassnig (CERN), Michael Lambert (PSC), Garhan Attebury (UNL)

Authors' Addresses

Dale W. Carder
Energy Sciences Network
Lawrence Berkeley National Laboratory
1 Cyclotron Road
M/S 59R3101
Berkeley, CA 94720
United States of America
Tim Chown
Jisc
United Kingdom
Shawn McKee
University of Michigan
367D West Hall
450 Church St
Ann Arbor, MI 48109
United States of America