Current State of the Art for High Performance Wide Area Networks
draft-kcrh-hpwan-state-of-art-03
This document is an Internet-Draft (I-D).
Anyone may submit an I-D to the IETF.
This I-D is not endorsed by the IETF and has no formal standing in the
IETF standards process.
| Document | Type | Active Internet-Draft (individual) | |
|---|---|---|---|
| Authors | Daniel King , Tim Chown , Chris Rapier , Daniel Huang , Kehan Yao | ||
| Last updated | 2025-10-20 | ||
| Replaces | draft-kcrh-state-of-art-hp-wan, draft-kcrh-state-of-art-hpwan | ||
| RFC stream | (None) | ||
| Intended RFC status | (None) | ||
| Formats | |||
| Stream | Stream state | (No stream defined) | |
| Consensus boilerplate | Unknown | ||
| RFC Editor Note | (None) | ||
| IESG | IESG state | I-D Exists | |
| Telechat date | (None) | ||
| Responsible AD | (None) | ||
| Send notices to | (None) |
draft-kcrh-hpwan-state-of-art-03
Network Working Group D. King
Internet-Draft Lancaster University
Intended status: Informational T. Chown
Expires: 23 April 2026 Jisc
C. Rapier
Pittsburgh Supercomputing Center
D. Huang
ZTE Corporation
K. Yao
China Mobile
20 October 2025
Current State of the Art for High Performance Wide Area Networks
draft-kcrh-hpwan-state-of-art-03
Abstract
High Performance Wide Area Networks (HP-WANs) represent a critical
infrastructure for the modern global Research and Education (R&E)
community, facilitating collaboration across national and
international boundaries. These networks include global education
and research networks, such as GÉANT, Internet2, Janet, ESnet,
CANARIE, CERNET, and others, and also refer to large scale commercial
dedicated networks built by hyperscalers and operators. They are
designed to support the ever-growing transmission of vast amounts of
data generated by scientific research, high-performance computing,
distributed AI-training and large-scale simulations.
This document provides an overview of the terminology and techniques
used for existing HP-WANs. It also explores the technological
advancements, operational tools, and future directions for HP-WANs,
emphasising their role in enabling cutting-edge scientific research,
AI training and massive R&E data analysis.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
King, et al. Expires 23 April 2026 [Page 1]
Internet-Draft HP-WAN STATE OF ART October 2025
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on 23 April 2026.
Copyright Notice
Copyright (c) 2025 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (https://trustee.ietf.org/
license-info) in effect on the date of publication of this document.
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document. Code Components
extracted from this document must include Revised BSD License text as
described in Section 4.e of the Trust Legal Provisions and are
provided without warranty as described in the Revised BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1. Background . . . . . . . . . . . . . . . . . . . . . . . 4
2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4
3. Example Use Cases for HP-WANs . . . . . . . . . . . . . . . . 6
4. Current Technologies Used in HP-WANs: Key Components . . . . 7
4.1. Architectural Elements . . . . . . . . . . . . . . . . . 7
4.2. Topology . . . . . . . . . . . . . . . . . . . . . . . . 8
4.3. Bandwidth and Latency . . . . . . . . . . . . . . . . . . 9
4.4. Localised Data Movement . . . . . . . . . . . . . . . . . 9
4.5. Forwarding Optimisation . . . . . . . . . . . . . . . . . 10
4.6. Reliability and High Availability . . . . . . . . . . . . 11
4.7. Quality of Service . . . . . . . . . . . . . . . . . . . 12
4.8. Congestion Control . . . . . . . . . . . . . . . . . . . 12
4.9. Performance Monitoring . . . . . . . . . . . . . . . . . 12
4.10. Scalability . . . . . . . . . . . . . . . . . . . . . . . 13
4.11. Sustainability and Energy Efficiency . . . . . . . . . . 13
4.12. Resource Scheduling . . . . . . . . . . . . . . . . . . . 13
5. Examples of HP-WANs . . . . . . . . . . . . . . . . . . . . . 13
5.1. GÉANT . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2. Janet . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.3. Google Effingo . . . . . . . . . . . . . . . . . . . . . 15
5.4. Energy Sciences Network . . . . . . . . . . . . . . . . . 16
5.4.1. Practical Examples of Dynamic Network Management . . 16
5.5. Internet2 . . . . . . . . . . . . . . . . . . . . . . . . 17
5.6. CANARIE . . . . . . . . . . . . . . . . . . . . . . . . . 17
King, et al. Expires 23 April 2026 [Page 2]
Internet-Draft HP-WAN STATE OF ART October 2025
5.7. Asia-Pacific Advanced Network . . . . . . . . . . . . . . 17
5.7.1. CERNET . . . . . . . . . . . . . . . . . . . . . . . 18
5.7.2. China Mobile Cloud Dedicated Network . . . . . . . . 18
6. Emerging Trends and Future Directions . . . . . . . . . . . . 19
6.1. Integrated Resource and Network Control . . . . . . . . . 19
6.2. Intent-Based Networking and Automation . . . . . . . . . 19
6.3. Network Signalling . . . . . . . . . . . . . . . . . . . 20
7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 20
8. Security Considerations . . . . . . . . . . . . . . . . . . . 20
9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 20
Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Normative References . . . . . . . . . . . . . . . . . . . . . . 21
Informative References . . . . . . . . . . . . . . . . . . . . . 21
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 22
1. Introduction
High Performance Wide Area Networks (HP-WANs) are the backbone of
global Research and Education (R&E) infrastructure, enabling the
seamless transfer of vast amounts of data and supporting advanced
scientific collaborations worldwide. These networks are designed to
meet the demanding requirements of data-intensive research fields,
including high-energy physics, climate modeling, genomics, and
artificial intelligence.
The evolution of HP-WANs is deeply intertwined with the growing need
for advanced scientific research and the increasing globalisation of
collaboration. Traditional WANs, which were sufficient for general
business and communication needs, quickly became inadequate for the
specialised requirements of research institutions. As scientific
endeavours began to generate larger datasets, ranging from terabytes
to petabytes, there arose a need for networks capable of transferring
these massive volumes of data reliably and securely across long
distances.
Specialised (inter)National R&E Networks (NRENs), such as ESnet in
the United States, GÉANT in Europe, Janet in the UK, and CERNET in
China, have evolved to support the unique needs of the scientific
community, while also carrying more generalised and less demanding
research and education traffic. These networks are designed by their
operators to provide high bandwidth and ensure low latency, high
reliability, and robust security, critical for applications like
real-time data analysis, distributed computing, and remote
instrumentation.
This evolution has been made possible both by the design decisions
made by NREN operators and also the campus operators who build the
localised network and systems infrastructures. HP-WANs require an
King, et al. Expires 23 April 2026 [Page 3]
Internet-Draft HP-WAN STATE OF ART October 2025
end-to-end engineering approach, where the NREN operators provide
capacity while the local campus operators engineer their network
architectures and their data transfer and compute systems ensure
optimal use of that capacity, e.g., by adopting the Science DMZ
approach described in this document.
Today, HP-WANs are foundational to the research community and are
leading the way in demonstrating how advanced networking technologies
can be applied to other sectors. They serve as testbeds for
innovations in networking that eventually trickle down to broader
commercial applications. As we look towards the future, HP-WANs will
continue to play a critical role in enabling scientific discoveries
and fostering international collaboration, particularly as emerging
technologies such as quantum computing and the Internet of Agents
(IoA) push the boundaries of what these networks must support.
This document explores the current state of the art in HP-WANs,
examining the technological advancements, operational challenges, and
emerging trends shaping the future of networks built for research,
education, massive data analysis and collaborative AI training at
scale and speed. Through this exploration, we aim to provide a
better understanding of the current state of the art in high
performance computing across wide area networking.
1.1. Background
High Performance Wide Area Networks (HPWANs) evolved as specialised
networks initially designed to facilitate scientific research
requiring high-speed data transfer, high reliability, and minimal
latency. ESnet, Janet, GÉANT and CERNET emerged in response to the
increasing data volumes generated by scientific and educational
institutions, transforming traditional WAN capabilities.
HP-WAN is an end-to-end capability, architected by collaborative
efforts by the NREN operators, campus IT operators, and the science
communities, and have since grown integral to research and
educational communities, supporting distributed scientific
collaborations, large-scale simulations, and intensive data analysis.
Their capabilities have been continually enhanced to meet rising
demands, laying foundations for future networking technologies.
2. Terminology
This document provides a lexicon terminology that relates to high
performance WANs.
CERN: The European Organization for Nuclear Research, housing the
Large Hadron Collider (LHC).
King, et al. Expires 23 April 2026 [Page 4]
Internet-Draft HP-WAN STATE OF ART October 2025
High Performance Computing (HPC): Is a general term for computing
with a high level of performance. Often high performance
computing specifically refers to running jobs which are very
parallel, often running on hundreds or even thousands of cores.
High Performance Wide Area Network (HP-WAN): A type of Wide Area
Network (WAN) designed specifically to meet the high-speed, low-
latency, and high-capacity needs of scientific research,
education, and data-intensive applications. These networks
connect research institutions, universities, and data centers
across large geographical areas.
Infiniband: Traditionally, a localised data interconnect used by
many high performance computing (HPC) systems providing high
bandwidth and low latency.
National Research and Education Network (NREN): Provide a
specialised network by operators supporting the research and
education community within a specific country or region. NRENs
provide high-speed connectivity and other services tailored to the
needs of academic and research institutions.
Remote direct memory access (RDMA): Enables one networked node to
access another networked nodes memory without involving either
computer's operating system or interrupting either nodes
processing. This helps minimise latency and maximise throughput,
reducing memory bandwidth bottlenecks.
RDMA over Converged Ethernet (RoCE): Traditionally, a network
protocol which allows remote direct memory access (RDMA) over a
local Ethernet network. There are multiple RoCE versions. RoCE
v1 [UEC] is an Ethernet link layer protocol and hence allows
communication between any two hosts in the same Ethernet broadcast
domain. RoCE v2 is an internet layer protocol which means that
RoCE v2 packets can be routed.
Worldwide LHC Computing Grid (WLCG): Is a global network of over 170
computing centres across more than 40 countries, designed to
process, store, and analyse the vast amounts of data generated by
the Large Hadron Collider (LHC) at CERN.
Performance Service Oriented Network monitoring
Architecture(PerfSONAR): Is a network performance monitoring toolkit
designed to provide end-to-end performance measurement and
monitoring across multi-domain network infrastructures.
Science DMZ: A model for deployment of infrastructure at a site
King, et al. Expires 23 April 2026 [Page 5]
Internet-Draft HP-WAN STATE OF ART October 2025
(campus) to optimise the performance of data transfers in and out
of data transfer nodes (DTNs) at the site – see
https://fasterdata.es.net/science-dmz/. Elements of the model
include the local network architecture, tuning of DTNs, selection
of data transfer software, efficient implementation of security
policies, and provide persistent monitoring.
3. Example Use Cases for HP-WANs
HP-WAN development has become synonymous with large-scale research
and experimentation, big data, and AI. HPC and therefore HP-WAN, is
driven by the continuous need to move large data between HPC
facilities, facilitating the following industries:
* High-Energy Physics Research, e.g., the Large Hadron Collider
(LHC)
* Climate Modeling
* Radioastronomy, e.g., the Square Kilometre Array (SKA) project
* Healthcare, Genomics and Life Sciences
* AI training
* Data Backup
The data rates required by HPC applications vary significantly based
on the application type and data scale.
Scientific simulations, such as climate modeling and molecular
dynamics, typically demand data rates from 10 Gbps to over 100 Gbps
due to the large volumes of data processed and moved between nodes
and storage systems.
In high-energy physics, such as experiments at CERN, data rates can
reach hundreds of gigabits per second, with aggregate peaks between
site exceeding 1 Tbps currently, and predicted to rise to 10 Tbps,
during intensive data processing.
Healthcare, Genomics, and Life Sciences might typically operate at
rates between 1 Gbps and 40 Gbps. These applications require high
throughput to handle large datasets efficiently, often through
parallel data streams.
King, et al. Expires 23 April 2026 [Page 6]
Internet-Draft HP-WAN STATE OF ART October 2025
AI learning and tasks, particularly those involving deep learning,
require data rates ranging from 10 Gbps to 100 Gbps to ensure
efficient data movement, keeping GPUs and other accelerators fully
utilised.
These varying data rates underscore the high demands of HPC
applications, which are expected to grow as the field evolves and
datasets become larger, and the growing need for movement of large
data sets between sites (including data centers).
4. Current Technologies Used in HP-WANs: Key Components
High Performance Computing (HPC) networks are specialised networks
designed to connect supercomputers and other high-performance
computing resources, enabling them to collaborate on computational
tasks that require significant processing power, memory, and data
storage. These networks facilitate large-scale scientific research,
complex simulations, and data-intensive tasks that exceed the
capabilities of standard computing systems.
The following sub-sections outline typical characteristics and
requirements for HP-WANs. These technical requirements ensure that
wide-area interconnects can meet the demanding needs of distributed
HPC environments, enabling researchers and scientists to collaborate
effectively globally.
4.1. Architectural Elements
Not all HP-WAN deployments rely on backbone controllers. In many
NRENs, capacity is provisioned ahead of demand on private dedicated,
with no QoS or bandwidth-on-demand systems in the core.
Some HP-WAN network providers may choose or need to use specific
resource and/or network controllers in support of delivering that
functionality
Resource Controllers provide detailed control over individual network
resources, such as routers and switches, ensuring efficient usage and
reliable network performance through comprehensive monitoring and
configuration.
Network Controllers maintain global visibility of network topology,
resource availability, and status, essential for path computation,
resource reservation, and dynamic reconfiguration to meet stringent
performance demands.
King, et al. Expires 23 April 2026 [Page 7]
Internet-Draft HP-WAN STATE OF ART October 2025
End-to-End Orchestration translates user and application requirements
into actionable network operations, enabling automated, policy-driven
management and significantly improving resource responsiveness and
optimisation.
4.2. Topology
HPC networks can be broadly categorised into intra-site networks,
which connect components within a single HPC site, such as a data
centre, and inter-site networks, which link multiple HPC sites across
different geographical locations. The intra-site (HP-WAN) networks
typically use high-speed, low-latency non-Internet interconnects like
InfiniBand or high-speed Ethernet. In contrast, inter-site networks
rely on dedicated high-capacity wide area networks (WANs) to
facilitate distributed computing and data sharing on a regional and
global scale.
Each NREN operator, e.g., Jisc in the case of Janet in the UK, will
build and operate the NREN infrastructure for its research and
education users. This may typically take the form of a well-
provisioned backbone, with regional access networks extending to the
end sites (campuses, research organisations, etc). The NREN
demarcation is typically at the campus edge. In some countries the
regional networks are operated separately.
The NRENs then typically have interconnects to other NRENs, forming a
worldwide RE network infrastructure. In Europe, GÉANT provides
connectivity between the European NRENs and then wider connectivity
to the rest of the world. And NRENs will have other interconnects to
non-RE networks, e.g., via one or more national IXs, direct peerings
to content providers (including the big cloud providers) and then
"catch-all" commodity connectivity via one or more Tier 1 ISPs.
Dedicated infrastructure is commonly used in HPC environments where
performance, security, and reliability are paramount. In these
cases, the network infrastructure is built exclusively for HPC
applications, including dedicated fibre-optic connections, private
data centres, and specialised network transport like RDMA over
Converged Ethernet (RoCE) and InfiniBand nodes. The primary benefits
of dedicated infrastructure are its ability to provide optimised
performance for HPC tasks, ensure high levels of security by
preventing unauthorised access, and maintain consistent reliability
by avoiding congestion or performance issues caused by other network
traffic.
Usually, the responsibility for networking within an end site or
campus lies with that organisation, e.g., a university IT department,
while the operation of an HPC facility may have dedicated (separate)
King, et al. Expires 23 April 2026 [Page 8]
Internet-Draft HP-WAN STATE OF ART October 2025
staff. With the additional administrative domains of the NRENs and
inter-NREN backbones like GÉANT, end-to-end traffic may pass through
many networks operated by different organisations. To achieve
optimal e2e performance, everyone on the end-to-end path needs to
implement best practices.
4.3. Bandwidth and Latency
The technical requirements for wide area interconnects between HPC
sites are stringent, given the unique demands of distributed high-
performance computing. High bandwidth is a primary requirement, as
these interconnects must support the rapid transfer of large datasets
between sites, ensuring that data movement does not become a
bottleneck in computational workflows. HPC data flows might
typically consume 1Gbit to beyond 400Gbit/s.
Low latency is equally critical, for many HPC applications. Latency
requirements for inter-DC locations will be in the low-millisecond
range. This low latency is essential for applications that require
real-time or near-real-time data processing.
4.4. Localised Data Movement
Network-intensive applications like networked storage or cluster
computing need a network infrastructure with high bandwidth and low
latency.
These interconnects may need to support specialised communication
protocols designed for HPC environments, such as Remote Direct Memory
Access (RDMA) [RFC5040] and [RFC7306], which optimises the
performance of distributed HPC applications by reducing overhead and
improving data transfer efficiency.
InfiniBand (IB) is another computer networking communications
standard used in high-performance computing that features very high
throughput and very low latency. InfiniBand is also used as either a
direct or switched interconnect between servers and storage systems,
as well as an interconnect between storage systems.
The advantages of RDMA and IB over other network application
programming interfaces, are lower latency, CPU load, and bandwidth.
The downside with these specialised protocols is the need for all
interfaces and nodes to support the technique on the end-to-end path.
iWARP is a computer networking protocol that implements remote direct
memory access (RDMA) for efficient data transfer over Internet
Protocol networks. Several IETF techniques are used for iWARP:
King, et al. Expires 23 April 2026 [Page 9]
Internet-Draft HP-WAN STATE OF ART October 2025
* [RFC5040] A Remote Direct Memory Access Protocol Specification is
layered over Direct Data Placement Protocol (DDP). It defines how
RDMA Send, Read, and Write operations are encoded using DDP into
headers on the network.
* [RFC5041] Direct Data Placement over Reliable Transports is
layered over MPA/TCP or SCTP. It defines how received data can be
directly placed into upper layer protocols receive buffer without
intermediate buffers.
* [RFC5042] Direct Data Placement Protocol (DDP) / Remote Direct
Memory Access Protocol (RDMAP) Security analyzes security issues
related to iWARP DDP and RDMAP protocol layers.
* [RFC5043] Stream Control Transmission Protocol (SCTP) Direct Data
Placement (DDP) Adaptation defines an adaptation layer that
enables DDP over SCTP. Elephant flows: For each burst, the
intensity of each flow could reach up to the line rate of NICs.
* [RFC5044] Marker PDU Aligned Framing for TCP Specification defines
an adaptation layer that enables preservation of DDP-level
protocol record boundaries layered over the TCP reliable connected
byte stream.
* [RFC6580] IANA Registries for the Remote Direct Data Placement
(RDDP) Protocol defines IANA registries for Remote Direct Data
Placement (RDDP) error codes, operation codes, and function codes.
* [RFC6581] Enhanced Remote Direct Memory Access (RDMA) Connection
Establishment fixes shortcomings with iWARP connection setup.
* [RFC7306] Remote Direct Memory Access (RDMA) Protocol Extensions
extends [RFC5040] with atomic operations and RDMA Write with
Immediate Data.
4.5. Forwarding Optimisation
The scaling of HPC applications, especially across a WAN between
multiple sites, requires the ability to route the massive traffic.
Specifically, this requires network infrastructure to provide several
routing and forwarding characteristics, which are detailed below.
* Low entropy: Compared to traditional data center workloads, the
number and the diversity of flows for workloads and flow patterns
are usually repetitive and predictable.
* Burstiness: Flows usually exhibit the "on and off" nature in the
time granularity of milliseconds.
King, et al. Expires 23 April 2026 [Page 10]
Internet-Draft HP-WAN STATE OF ART October 2025
* Jumbo frames: Ethernet frames larger than the standard maximum
transmission unit (MTU) size of 1,500 bytes, typically carrying
payloads of up to 9,000 bytes. Using jumbo frames can
significantly enhance network efficiency and reduce CPU overhead.
* Elephant flows: For each burst, the intensity of each flow could
reach up to the line rate of NICs.
It should be noted that efficiently handling these elephant flows is
crucial in HPC as they can otherwise saturate network links, leading
to congestion and reduced performance for other network traffic.
Strategies to manage elephant flows effectively, such as prioritising
these flows or segmenting network traffic, help maintain overall
network performance and ensure that large data transfers do not
hinder the execution of other critical tasks within the HPC
environment.
HPC transport options include IP (both UDP and TCP), and emerging
mechanisms such as QUIC. However, each transport technology provides
strengths and weaknesses. In all cases, the primary goal is to
ensure the effective high-throughput, low latency and jitter, low-
packet loss ratio, transmission of massive data sets.
4.6. Reliability and High Availability
In HPC networks, the resilience of the data stream is important due
to the critical need for precise, high-speed data transfer. These
networks must maintain continuous data flow to support large-scale
computations, where even minor interruptions or packet loss can
severely impact performance, causing delays or incorrect results.
Therefore, resilience must be implemented to ensure the network can
recover from disruptions without compromising speed or integrity.
For retransmission and lossless data transfer, HPC networks must have
mechanisms to handle data loss efficiently. They must quickly
retransmit lost or corrupted packets while maintaining a seamless
data flow to avoid performance degradation. The requirement for
lossless communication is essential to meet the needs of scientific
computations, simulations, and data-intensive tasks.
High availability and redundancy are also essential to prevent data
loss and ensure continuous operation, especially given that HPC tasks
often run for extended periods and involve critical research. These
networks must also incorporate advanced security measures, including
encryption and secure access controls, to protect the often sensitive
or classified data being transmitted.
King, et al. Expires 23 April 2026 [Page 11]
Internet-Draft HP-WAN STATE OF ART October 2025
4.7. Quality of Service
Depending on the provisioning and contention for resource, the
network may need to support Quality of Service (QoS) mechanisms to
prioritise traffic, ensuring that critical HPC tasks receive the
necessary bandwidth and low-latency performance.
An approach may be needed to enable applications to request specific
bandwidth or latency guarantees, ensuring that high-priority tasks
receive required resources, hard to provide over shared multi-domain
infrastructure.
Differentiated Services (Diffserv) offers a flexible method to manage
traffic prioritization without the need for an explicit request-and-
grant process. Diffserv operates by marking packets with different
priority levels, allowing the network to prioritize and protect
access to capacity for critical tasks. This approach may be useful
in HPC environments where dynamic traffic patterns require adaptive
resource management.
4.8. Congestion Control
Congestion control mechanisms ensure that data transfers between
nodes and across networks are efficient and do not overwhelm the HPC
network infrastructure. By managing and regulating the flow of data,
congestion control mechanisms help prevent bottlenecks, reduce
latency, and maintain high throughput, which are essential for the
performance and reliability of HPC applications that require the
rapid movement of large volumes of data across distributed systems.
Depending on the transport technology used in the HPC environment,
several congestion control schemes may be use:
* InfiniBand Congestion Control
* RDMA-based Data Center Quantized Congestion Notification (DCQCN)
* TCP-based Bottleneck Bandwidth and Round-Trip Time (BBRv3)
* Explicit Congestion Protocol (XCP)
4.9. Performance Monitoring
End-to-end performance measurement and monitoring across multi-
domains and network infrastructures are important in HPC
environments. They provide a method to diagnose and troubleshoot
network performance issues that can affect data-intensive
applications and distributed computing tasks commonly found in HPC.
King, et al. Expires 23 April 2026 [Page 12]
Internet-Draft HP-WAN STATE OF ART October 2025
perfSONAR is a network measurement toolkit commonly used. It is
designed to provide federated coverage of network paths. It provides
an interface that allows for the scheduling of measurements, storage
of data, and generate visualisations.
Data transfer applications should log their transfers such that
monitoring tools can assess their end-to-end, disk-to-disk
performance. e.g., FTS does this for WLCG data transfers.
4.10. Scalability
Scalability is another crucial aspect, allowing the network to expand
efficiently as computational needs grow, accommodating additional
sites or increased capacity without significant reconfiguration.
Interoperability is also necessary, ensuring that the network can
communicate seamlessly across different types of hardware, software,
and protocols used at various HPC sites.
4.11. Sustainability and Energy Efficiency
As HPWANs continue to expand, sustainability and energy efficiency
are becoming critical considerations. The operational scale of these
networks—spanning global infrastructures and data-intensive
applications—poses significant environmental and economic challenges.
Future HP-WAN deployments will increasingly prioritise energy-
efficient network components, smart power management systems, and
sustainable operational practices.
Emerging approaches include adaptive network management strategies
designed to reduce energy consumption during periods of lower
utilisation and leveraging advanced technologies such as optical
networking and energy-aware routing protocols. Furthermore,
industry-wide initiatives are focusing on measuring and reducing the
carbon footprint of data transfers and network operations,
contributing to broader climate goals.
4.12. Resource Scheduling
[Editor's Note - Do we need to discuss service and resource
scheduling?]
5. Examples of HP-WANs
The following sub-sections highlight examples of HP-WANS, and their
technical specifications.
King, et al. Expires 23 April 2026 [Page 13]
Internet-Draft HP-WAN STATE OF ART October 2025
5.1. GÉANT
The GÉANT network is a pan-European data network dedicated to
research and education, providing high-speed, high-capacity
connectivity across Europe, between European NRENs and to other
worldwide NRENs. It is an essential infrastructure for HPC
applications, enabling collaboration and data sharing among research
institutions, universities, and HPC centers across the continent and
beyond.
The core of GÉANT operates at speeds of up to 600 Gbps, using Dense
Wavelength Division Multiplexing (DWDM) technology. This provides
connectivity suitable for HPC applications, particularly those
involving large-scale simulations, scientific research, and real-time
data processing. Reliability is provided by using multiple optical
underlay paths for data to travel between GÉANT nodes. This design
ensures high availability and reliability, which is crucial for the
continuous operation of HPC environment.
The GÉANT network integrates perfSONAR for real-time network
performance monitoring and reporting of IP performance metrics
[RFC6703] , allowing HPC users to detect and troubleshoot potential
issues that could impact data transfer and overall performance. This
ensures that the high-performance requirements of HPC applications
are met consistently across the network.
GÉANT provides specialised services for specific HPC projects, such
as the LHC Optical Private Network (LHCOPN) and LHC Open Network
Environment (LHCONE), which are critical for supporting the data-
intensive needs of the Large Hadron Collider (LHC) at CERN. These
services offer dedicated, high-bandwidth connections that are
optimised for the massive data flows generated by LHC experiments.
The GÉANT network connects over 50 million users across more than
10,000 institutions in 40 countries. This extensive reach supports a
wide range of HPC applications by enabling seamless collaboration
between geographically dispersed research facilities. Beyond Europe,
GÉANT connects to other major research and education networks,
including Internet2 in the United States and CANARIE in Canada,
allowing for global HPC collaborations and data exchanges.
5.2. Janet
The Janet network is the UK NREN, operated by Jisc. First
established in 1984, backbone links now run at up to 800Gbps, with a
growing number of sites connected at 100Gbps, in some cases with
multiple 100G links. A typical university site will have multiple
10G links.
King, et al. Expires 23 April 2026 [Page 14]
Internet-Draft HP-WAN STATE OF ART October 2025
Janet connects to other RE networks via a 400G resilient link to
GÉANT. It has a presence in multiple IXes, predominantly LINX,
connects/peers directly to many content and cloud providers, and has
commodity connectivity via Tier1 ISPs. The total aggregate external
capacity is around 4-5 Tbit/s.
Some private, dedicated links are used by Janet sites, e.g., the CERN
to RAL (UK Tier 1 site) LHCOPN link, which is a dedicated 200G path.
While LHCONE is a L3VPN that may be on its own circuit or as an
overlay on general RE paths. Otherwise, Jisc seeks to provide
sufficient capacity to its science communities ahead of demand, and
work with them (via its network performance team) to provide advice
and guidance on how sites can optimise their use of the Janet
network, e.g., by following Science DMZ principles.
5.3. Google Effingo
Google Effingo is a state-of-the-art, high-performance infrastructure
designed to meet the demanding data processing and storage needs of
large-scale machine learning (ML), artificial intelligence (AI), and
computational workloads. As part of Google's cloud offering, Effingo
is an example of how WAN infrastructure supports high-performance
computing applications across diverse industries and research areas.
Effingo leverages a global network of data centers interconnected
with high-capacity, low-latency WAN links. These links facilitate
rapid data exchange and provide the performance required to handle
real-time AI model training, complex simulations, and large-scale
data analytics. The network is optimised for high-throughput
workloads, where low latency and reliability are critical for
processing large datasets across vast geographical areas, and more
than 100 data center sites.
Effingo utilises a private global network of high-capacity fiber
links, combined with packet-layer protocols to deliver low-latency,
high-speed data transfer across continents. This connectivity
enables global collaboration between research centers, universities,
and data-driven enterprises, allowing them to share large datasets
and results.
Currently, Effingo daily data transfers exceeds 1 exabytes.
King, et al. Expires 23 April 2026 [Page 15]
Internet-Draft HP-WAN STATE OF ART October 2025
5.4. Energy Sciences Network
The Energy Sciences Network (ESnet) is a high-performance network
dedicated to supporting scientific research within the United States,
operated by the U.S. Department of Energy (DOE). Established in
1986, ESnet interconnects national laboratories, supercomputing
centres, universities, and research institutions, enabling
collaborative scientific projects, data-intensive applications, and
high-performance computing (HPC) tasks across multiple geographical
locations.
ESnet delivers high-capacity, low-latency connectivity through its
robust fibre-optic backbone, employing advanced optical networking
technologies and dynamic circuit provisioning services. It supports
data transfer rates ranging from tens of gigabits per second up to
multi-hundred gigabit per second capacities, essential for demanding
scientific workflows such as high-energy physics experiments, climate
modelling, and large-scale genomic research.
A key feature of ESnet is its use of specialised services such as the
On-Demand Secure Circuits and Advance Reservation System (OSCARS),
providing dynamic, guaranteed-bandwidth paths that allow researchers
to reserve network capacity tailored specifically to their project's
needs. Additionally, the network incorporates advanced orchestration
platforms like SENSE, offering intent-driven, automated management to
ensure optimal network resource utilisation and agile response to
evolving scientific requirements.
ESnet’s infrastructure integrates comprehensive monitoring and
diagnostic tools such as PerfSONAR, ensuring end-to-end network
visibility and performance analysis across institutional boundaries.
This facilitates proactive identification and resolution of
performance bottlenecks, maintaining the reliability and efficiency
necessary for HPC operations.
With interconnections to international research networks, including
GÉANT, Janet, Internet2, and CANARIE, ESnet provides global reach,
facilitating extensive international collaboration and enabling the
seamless exchange of data among scientific communities worldwide.
5.4.1. Practical Examples of Dynamic Network Management
ESnet's OSCARS system exemplifies dynamic, advanced reservation, and
circuit provisioning, demonstrating the practical application of
HPWAN capabilities in operational scientific networks.
King, et al. Expires 23 April 2026 [Page 16]
Internet-Draft HP-WAN STATE OF ART October 2025
The SENSE platform further illustrates how intent-based networking
and automation can simplify complex resource allocation processes,
significantly improving network agility and scalability.
5.5. Internet2
Internet2 is a high-performance networking consortium serving the
United States research and education community. Established in 1996,
Internet2 provides advanced networking infrastructure specifically
designed to support collaborative research, scientific discovery, and
innovation among educational institutions, government laboratories,
and industry partners.
Internet2 operates an advanced optical backbone network capable of
multi-terabit speeds,also delivering exceptionally high-capacity and
low-latency connections. As with aforementioned networks it supports
dynamic bandwidth allocation, advanced monitoring through tools, and
federated identity management.
5.6. CANARIE
CANARIE is Canada's national research and education network,
established in 1993, dedicated to providing robust, high-performance
connectivity for research, education, and innovation. It
interconnects universities, research centres, healthcare
institutions, and government laboratories across Canada, as well as
facilitating international collaboration through global
interconnections with networks such as GÉANT, Internet2, and ESnet.
As with other regions the CANARIE network operates using a high-
capacity fibre-optic backbone, delivering advanced networking
services tailored specifically for demanding scientific and research
applications. The network provides dynamic, software-driven
capabilities, including dedicated high-speed links, automated
resource allocation, and integrated identity and access management
solutions. Additionally, CANARIE supports advanced services like the
Digital Accelerator for Innovation and Research (DAIR), enabling
cloud-based research and development.
5.7. Asia-Pacific Advanced Network
King, et al. Expires 23 April 2026 [Page 17]
Internet-Draft HP-WAN STATE OF ART October 2025
5.7.1. CERNET
The CERNET (China Education and Research Network) is a national high-
performance network serving China’s education, research and
innovation sectors. It delivers high-speed, secure connectivity
across the country—linking universities, research institutions and
HPC centers—and enables connections to global research networks,
playing a key role in supporting collaborative research, large-scale
data sharing and advanced HPC applications.
CERNET’s backbone operates at up to 100 Gbps (with ongoing upgrades)
using technologies like DWDM and SDN, handling HPC demands such as
scientific simulations, climate modeling and high-throughput data
processing. A redundant topology with alternate paths ensures ultra-
high reliability, minimizing downtime for critical HPC operations.
CERNET offers customized services for major national projects,
including dedicated high-bandwidth connections for the China
Spallation Neutron Source (CSNS) and national supercomputing centers
(e.g., Sunway TaihuLight). These support data-intensive workloads
and seamless data exchange.
CERNET covers 31 Chinese provinces,and connects over 200 million
users across 2,000+ universities, institutes and schools, fostering
cross-regional research collaboration. Internationally, it peers
with global networks like GÉANT (Europe) and Internet2 (U.S.),
enabling global HPC cooperation.
5.7.2. China Mobile Cloud Dedicated Network
China Mobile Cloud Dedicated Network is a high-performance dedicated
service for enterprises, governments and industrial users. It has
built a large-scale SRv6 Policy-based cloud dedicated network,
offering secure, low-latency connectivity between on-premises data
centers, branches and its public cloud—supporting mission-critical
tasks like real-time industrial control and large-scale data
analytics.
It covers 31 provinces and over 300 cities, it has nearly 50 Tbps
inter-provincial bandwidth and over 800 SRv6 Policy-enabled devices.
Its backbone uses Segment Routing (SR); some backbone links support
400 Gbps OTN to handle high-bandwidth demands. A multi-path
redundant architecture ensures 99.99% availability, reducing business
disruptions.
It integrates an intelligent management platform for real-time
monitoring of latency, jitter and bandwidth, and connects tens of
thousands of enterprise users domestically to eases inter-office
King, et al. Expires 23 April 2026 [Page 18]
Internet-Draft HP-WAN STATE OF ART October 2025
collaboration and cloud migration. Internationally, it links to
global cloud providers and overseas data centers for secure cross-
border data transmission.
6. Emerging Trends and Future Directions
As HP-WANs continue to evolve, driven by emerging requirements from
scientific research, high-performance computing, distributed
artificial intelligence, and industrial data analytics, several key
trends and future directions are shaping the next generation of HP-
WANs.
While these capabilities may not yet be widely deployed on RE
infrastructures, there is clear interest in the NREN operator
community in tracking their development and readiness for production
deployment.
6.1. Integrated Resource and Network Control
Enhanced integration between resource controllers and network
controllers for scheduled services to maximise network efficiency.
This tighter integration aims to deliver more granular and efficient
control over network resources, enabling dynamic, on-demand bandwidth
allocation and optimised resource allocation decisions. Such
integration facilitates more effective orchestration of network
resources, aligning network performance closely with application
requirements
For example, A typical trend for HP-WANs is to satisfy the need of
distributed or decentralized AI, which is so-called "scale-across"
network between multiple large data centers over long distance. The
integration of network resources and applications should touch the
underlying resources, such as optical transport layer [HIC-OTN], and
routing policies configured by network controllers, as well as
service requests. This integration needs the coordination of optical
resource management platform, the network controller, and the service
orchastrator.
6.2. Intent-Based Networking and Automation
Intent-based networking (IBN) and automation technologies are
increasingly used in the role in the management and orchestration of
HP-WANs. IBN allows network administrators to define desired network
states or outcomes, with automated systems translating these intents
into actionable network configurations. As discussed earlier,
platforms such as ESnet's SENSE provide valuable practical
demonstrations of how intent-driven orchestration can significantly
enhance agility, scalability, and operational efficiency.
King, et al. Expires 23 April 2026 [Page 19]
Internet-Draft HP-WAN STATE OF ART October 2025
6.3. Network Signalling
As the scale and complexity of HP-WAN deployments grow, efficient
signalling mechanisms become increasingly critical, especially when
running HPWAN services over shared public infrastructure.
Applications may want to signal their desired bandwidth to the
network, enabling more precise rate negotiation and collaborative
congestion control, to achieve a targeted competition time for the
data transfer.
Therefore, efficient and scalable signalling approaches are vital for
dynamic resource allocation in HPWAN environments. Effective
protocols must support rapid dissemination of resource states and
swift propagation of requests between network components, minimising
latency and overhead.
Desirable signalling mechanisms in HPWAN include extensibility, low
overhead, real-time responsiveness, and robustness, supporting
diverse technologies and ensuring reliable, high-performance
communication.
7. IANA Considerations
This document makes no requests for action by IANA.
8. Security Considerations
The security requirements for HPC networks, particularly in inter-
data center scenarios, are crucial to ensuring the integrity,
confidentiality, and availability of sensitive data and computational
resources. These requirements are stringent due to the high-value
and often sensitive nature of the data processed within HPC systems,
such as research data in fields like national defense,
pharmaceuticals, and climate science.
9. Acknowledgements
This document was partly motivated by the discussion occurring on the
IETF hp-wan@ietf.org mailing list.
The authors would like to thank Gorry Fairhurst and Zahed Sarker for
their reviews and suggestions.
Contributors
The following authors contributed significantly to this document:
King, et al. Expires 23 April 2026 [Page 20]
Internet-Draft HP-WAN STATE OF ART October 2025
Nicholas Race
Lancaster University
United Kingdom
Email: n.race@lancaster.ac.uk
Normative References
Informative References
[HIC-OTN] Sun, J., ""Decentralized Training over 100km Based on
Optical Transport Network for Artificial Intelligence"",
2024,
<https://ieeexplore.ieee.org/abstract/document/10926085>.
[RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D.
Garcia, "A Remote Direct Memory Access Protocol
Specification", RFC 5040, DOI 10.17487/RFC5040, October
2007, <https://www.rfc-editor.org/info/rfc5040>.
[RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct
Data Placement over Reliable Transports", RFC 5041,
DOI 10.17487/RFC5041, October 2007,
<https://www.rfc-editor.org/info/rfc5041>.
[RFC5042] Pinkerton, J. and E. Deleganes, "Direct Data Placement
Protocol (DDP) / Remote Direct Memory Access Protocol
(RDMAP) Security", RFC 5042, DOI 10.17487/RFC5042, October
2007, <https://www.rfc-editor.org/info/rfc5042>.
[RFC5043] Bestler, C., Ed. and R. Stewart, Ed., "Stream Control
Transmission Protocol (SCTP) Direct Data Placement (DDP)
Adaptation", RFC 5043, DOI 10.17487/RFC5043, October 2007,
<https://www.rfc-editor.org/info/rfc5043>.
[RFC5044] Culley, P., Elzur, U., Recio, R., Bailey, S., and J.
Carrier, "Marker PDU Aligned Framing for TCP
Specification", RFC 5044, DOI 10.17487/RFC5044, October
2007, <https://www.rfc-editor.org/info/rfc5044>.
[RFC6580] Ko, M. and D. Black, "IANA Registries for the Remote
Direct Data Placement (RDDP) Protocols", RFC 6580,
DOI 10.17487/RFC6580, April 2012,
<https://www.rfc-editor.org/info/rfc6580>.
King, et al. Expires 23 April 2026 [Page 21]
Internet-Draft HP-WAN STATE OF ART October 2025
[RFC6581] Kanevsky, A., Ed., Bestler, C., Ed., Sharp, R., and S.
Wise, "Enhanced Remote Direct Memory Access (RDMA)
Connection Establishment", RFC 6581, DOI 10.17487/RFC6581,
April 2012, <https://www.rfc-editor.org/info/rfc6581>.
[RFC6703] Morton, A., Ramachandran, G., and G. Maguluri, "Reporting
IP Network Performance Metrics: Different Points of View",
RFC 6703, DOI 10.17487/RFC6703, August 2012,
<https://www.rfc-editor.org/info/rfc6703>.
[RFC7306] Shah, H., Marti, F., Noureddine, W., Eiriksson, A., and R.
Sharp, "Remote Direct Memory Access (RDMA) Protocol
Extensions", RFC 7306, DOI 10.17487/RFC7306, June 2014,
<https://www.rfc-editor.org/info/rfc7306>.
[UEC] Ultra, ""Ultra Ethernet Specification v1.0.1 "", 2025,
<https://ultraethernet.org/uec-1-0-spec>.
Authors' Addresses
Daniel King
Lancaster University
Email: d.king@lancaster.ac.uk
Tim Chown
Jisc
Email: tim.chown@jisc.ac.uk
Chris Rapier
Pittsburgh Supercomputing Center
Email: rapier@psc.edu
Daniel Huang
ZTE Corporation
Email: huang.guangping@zte.com.cn
Kehan Yao
China Mobile
Email: yaokehan@chinamobile.com
King, et al. Expires 23 April 2026 [Page 22]