Dynamic Networks to Hybrid Cloud DCs: Problem Statement and Mitigation Practices
draft-ietf-rtgwg-net2cloud-problem-statement-28
The information below is for an old version of the document.
Document | Type |
This is an older version of an Internet-Draft whose latest revision state is "Active".
|
|
---|---|---|---|
Authors | Linda Dunbar , Andrew G. Malis , Christian Jacquenet , Mehmet Toy , Kausik Majumdar | ||
Last updated | 2023-08-08 (Latest revision 2023-07-26) | ||
Replaces | draft-dm-net2cloud-problem-statement | ||
RFC stream | Internet Engineering Task Force (IETF) | ||
Formats | |||
Reviews |
ARTART Last Call review
(of
-41)
by Rich Salz
Ready w/nits
TSVART Last Call review
(of
-32)
by Magnus Westerlund
Ready w/issues
INTDIR Early review
(of
-26)
by Benson Muite
Ready w/nits
RTGDIR Early review
(of
-22)
by Ines Robles
Has issues
OPSDIR Early review
(of
-22)
by Susan Hares
Has issues
SECDIR Early review
(of
-22)
by Deb Cooley
Has issues
DNSDIR Early review
(of
-22)
by Florian Obser
Ready w/nits
GENART Early review
(of
-21)
by Paul Kyzivat
Ready w/nits
|
||
Additional resources | Mailing list discussion | ||
Stream | WG state | WG Document | |
Document shepherd | (None) | ||
IESG | IESG state | I-D Exists | |
Consensus boilerplate | Unknown | ||
Telechat date | (None) | ||
Responsible AD | (None) | ||
Send notices to | (None) |
draft-ietf-rtgwg-net2cloud-problem-statement-28
Network Working Group L. Dunbar Internet Draft Futurewei Intended status: Informational A. Malis Expires: February 8, 2024 Malis Consulting C. Jacquenet Orange M. Toy Verizon K. Majumdar Microsoft August 8, 2023 Dynamic Networks to Hybrid Cloud DCs: Problem Statement and Mitigation Practices draft-ietf-rtgwg-net2cloud-problem-statement-28 Abstract This document describes the network-related problems enterprises face at the moment of writing this specification when interconnecting their branch offices with dynamic workloads in third-party data centers (DC) (a.k.a. Cloud DCs). The Net2Cloud problem statements are mainly for enterprises with traditional VPN services who want to leverage those networks (instead of altogether abandoning them). Other problems are out of the scope of this document. This document also describes the mitigation practices for getting around the identified problems. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." xxx, et al. [Page 1] Internet-Draft Net2Cloud Problem Statement The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This Internet-Draft will expire on February 8, 2024. Copyright Notice Copyright (c) 2023 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction...................................................3 2. Definition of terms............................................3 3. Issues and Mitigation Methods of Connecting to Cloud DCs.......4 3.1. Increased BGP Peering Errors and Mitigation Methods.......4 3.2. Site failures and Methods to Minimize Impacts.............5 3.3. Optimal Paths to Cloud DC locations.......................6 3.4. Network Issues for 5G Edge Clouds and Mitigation Methods..7 3.5. DNS Practices for Hybrid Workloads........................7 3.6. NAT Practice for Accessing Cloud Services.................8 3.7. Cloud Discovery Practices.................................9 4. Dynamic Connecting Enterprise Sites with Cloud DCs.............9 4.1. Sites to Cloud DC........................................10 4.2. Inter-Cloud Connection...................................12 4.3. Extending Private VPNs to Hybrid Cloud DCs...............13 5. Methods to Scale IPsec tunnels to Cloud DCs...................13 5.1. Improvement IPsec Tunnels Management.....................14 5.2. Improving performance Over the Public Internet...........14 6. Requirements for Dynamic Cloud Data Center VPNs...............15 Dunbar, et al. [Page 2] Internet-Draft Net2Cloud Problem Statement 7. Security Considerations.......................................15 8. IANA Considerations...........................................16 9. References....................................................16 9.1. Normative References.....................................17 9.2. Informative References...................................17 10. Acknowledgments..............................................18 1. Introduction With the advent of widely available Cloud data centers (DC) providing services in diverse geographic locations and advanced tools for monitoring and predicting application behaviors, it is desirable for enterprises to instantiate applications and workloads in locations close to their end users. The proximity can improve end-to-end latency and overall user experience. In addition, applications and workloads can be shut down or moved along with end users in motion (thereby modifying the networking connection of subsequently relocated applications and workloads). Key characteristics of Cloud Services are on-demand, scalable, highly available, and usage-based billing. Most Cloud Operators provide Cloud network functions, such as, virtual Firewall services, virtual private network services, virtual PBX services including voice and video conferencing systems, etc. Cloud DC is a shared infrastructure that hosts services to many customers. This document describes the network-related problems enterprises face at the moment of writing this specification when interconnecting their branch offices with dynamic workloads in Cloud DCs and the mitigation practices. 2. Definition of terms Cloud DC: Third party Data Centers that usually host applications and workload owned by different organizations or tenants. Controller: Used interchangeably with SD-WAN controller to manage SD-WAN overlay path creation/deletion and monitoring the path conditions between two or more sites. Heterogeneous Cloud: applications and workloads split among Cloud DCs owned or managed by different operators. Dunbar, et al. [Page 3] Internet-Draft Net2Cloud Problem Statement Hybrid Clouds: Hybrid Clouds refers to an enterprise using its own on-premises DCs in addition to Cloud services provided by one or more cloud operators. (e.g. AWS, Azure, Google, Salesforce, SAP, etc). IXP: Internet eXchange Point is a physical location through which Network/Internet Service Providers, Cloud Operators, CDNs, etc., have co-located gears to exchange traffic. VPC: Virtual Private Cloud is a virtual network dedicated to one client account. It is logically isolated from other virtual networks in a Cloud DC. Each client can launch his/her desired resources, such as compute, storage, or network functions into his/her VPC. At the moment of of writing this specification, most Cloud operators' VPCs only support private addresses, some support IPv4 only, others support IPv4/IPv6 dual stack. 3. Issues and Mitigation Methods of Connecting to Cloud DCs This section identifies some of the high-level problems that IETF can address, especially by Routing area. Other Cloud DC problems are out of the scope of this document, e.g., managing cloud spending is not discussed here. 3.1. Increased BGP Peering Errors and Mitigation Methods Unlike traditional network service providers who usually have the prior negotiated peering policies with their BGP peers over fixed interfaces, Cloud GWs (Gateway) need to peer with many more variety of parties via private circuits, VPNs, and the public internet. Many of those peering parties may not be traditional network service providers. Their BGP configuration practices might not be consistent, and some are done by less experienced personnel. All those can contribute to increased BGP peering errors such as capability mismatch, unwanted route leaks, missing Keepalives, and errors causing BGP ceases. Capability mismatch can cause BGP sessions not established properly. Here are the recommended mitigation practices: Dunbar, et al. [Page 4] Internet-Draft Net2Cloud Problem Statement - If a Cloud GW (BGP speaker) receives from its peer a capability that it does not itself support or recognize, it must ignore that capability and the BGP session must not be terminated per RFC5492. When receiving a BGP UPDATE with a malformed attribute, the revised BGP error handling procedure [RFC7606] should be followed instead of session resetting. - When a Cloud DC doesn't support multi-hop eBGP peering with external devices (as many don't), enterprise GWs must establish tunnels (e.g., IPsec) to the Cloud GWs to form the IP adjacency. - When a Cloud DC eBGP session supports a limited number of routes from external entities, the on-premises DCs need to set up default routes to minimize the number of routes to be exchanged with the Cloud DC eBGP peers. - When a Cloud GW receives the inbound routes exceeding the maximum routes threshold for a peer, the currently common practice is generating out-of-band alerts (e.g., Syslog) via the management system or terminating the BGP session (with cease notification messages [RFC4486] being sent). More discussion is needed in the IETF IDR WG for potential in-band or autonomous notification directly to the peers when the inbound routes exceed the maximum routes threshold. 3.2. Site failures and Methods to Minimize Impacts Failures within a Cloud site, which can be a building, a floor, a POD, or a server rack, include capacity degradation or complete out- of-service. Here are some events that can trigger a site failure: a) fiber cut for links connecting to the site or among pods within the site; b) cooling failures; c) insufficient backup power; d) cyber threat attacks; e) too many changes outside of the maintenance window; etc. Fiber-cut is not uncommon in a Cloud site or between sites. As described in RFC7938, Cloud DC might not have an IGP to route around link/node failures within its domain. When a site failure happens, the Cloud DC GW visible to clients is running fine; therefore, the site failure is not detectable by the clients using Bidirectional Forwarding Detection (BFD). When a site capacity degrades or goes to zero, there are massive numbers of routes being impacted. In addition, the routes (IP Dunbar, et al. [Page 5] Internet-Draft Net2Cloud Problem Statement prefixes) in a Cloud DC cannot be aggregated nicely, triggering very large number of BGP UPDATE messages when a site failure occurs. [RFC7432] specifies a mass withdrawal mechanism for EVPN to signal a large number of routes being changed to remote PE nodes as quickly as possible. [METADATA-PATH] specifies a Metadata Path Attribute added to the BGP UPDATE message for a Cloud GW to notify the relevant ingress routers of the mass routes impacted with a single BGP UPDATE message. 3.3. Optimal Paths to Cloud DC locations Many applications have multiple instances instantiated in different Cloud DCs. A commonly deployed solution has DNS server(s) responding to an FQDN (Fully Qualified Domain Name) inquiry with an IP address of the closest or lowest cost DC that can reach the instance. Here are some problems associated with DNS-based solutions: - Dependent on client behavior - Misbehaving client can cache results indefinitely. - Client may not receive service even though there are servers available in other Cloud DCs because the failing IP address is still cached in the DNS resolver and has not expired yet. - No inherent leverage of proximity information present in the network (routing) layer, resulting in loss of performance. - Inflexible traffic control: The Local DNS resolver becomes the unit of traffic management. This requires DNS to receive periodical update of the network condition, which is difficult. One method to mitigate the problems listed above is to use the ANYCAST [RFC4786] for the services so that network proximity and conditions can be inherently considered in optimal path selection. [SERVICE-METRICS] identifies the metrics that can utilized for the ingress routers to make path selections not only based on the routing cost but also the running environment of the edge services. Dunbar, et al. [Page 6] Internet-Draft Net2Cloud Problem Statement 3.4. Network Issues for 5G Edge Clouds and Mitigation Methods The 5G Edge Clouds [3GPP-5G-Edge] may host edge computing applications for ultra-low latency services on virtual or physical servers. Those edge computing applications have low latency connections to the UEs (User Equipment) and might have other connections to backend servers or databases in other locations. The low latency traffic to/from the UEs is transported through the 5G Core (gNB (Next Generation Node B))<-> UPFs (User Plane Function)) and the 5G Local Data Networks (LDN) to the edge Cloud DCs. The LDN's ingress routers connected to the UPFs might be co- located with 5G Core functions in the edge Clouds. The 5G Core functions include Radio Control Functions, Session Management Functions (SMF), Access Mobility Functions (AMF), User Plane Functions (UPF), and others. Here are some network problems with connecting the services in the 5G Edge Clouds: 1) The difference in routing distances to multiple server instances in different edge Clouds is relatively small. Therefore, the Edge Cloud with the shortest routing distance might not be the best in providing the overall latency. 2) Capacity status at the Edge Cloud might play a more significant role in end-to-end performance. 3) Source (UEs) can ingress from different LDN Ingress routers due to mobility. [METADATA-PATH] describes a mechanism to get around those problem. [METADATA-PATH] extends the BGP UPDATE messages for a Cloud GW to propagate the edge service-related metrics so that the ingress routers can incorporate the destination site's capabilities with the routing distance in computing the optimal paths. 3.5. DNS Practices for Hybrid Workloads DNS name resolution is essential for on-premises and cloud-based resources. For customers with hybrid workloads, which include on- premises and cloud-based resources, extra steps are necessary to configure DNS to work seamlessly across both environments. Cloud operators have their own DNS to resolve resources within their Cloud DCs and to well-known public domains. Cloud's DNS can be Dunbar, et al. [Page 7] Internet-Draft Net2Cloud Problem Statement configured to forward queries to customer managed authoritative DNS servers hosted on-premises and to respond to DNS queries forwarded by on-premises DNS servers. For enterprises utilizing Cloud services by different Cloud operators, it is necessary to establish policies and rules on how/where to forward DNS queries. When applications in one Cloud need to communicate with applications hosted in another Cloud, DNS queries from one Cloud DC could be forwarded to the enterprises' on- premises DNS, which in turn be forwarded to the DNS service in another Cloud. Configuration can be complex depending on the application communication patterns. However, collisions can still occur even with carefully managed policies and configurations. If an organization uses an internal name like .internal and wants its services to be available via or within some other Cloud provider that also uses .internal, collisions might occur. Therefore, using the global domain name is better even when an organization does not make all its namespace globally resolvable. An organization's globally unique DNS can include subdomains that cannot be resolved outside certain restricted paths, zones that resolve differently based on the origin of the query, and zones that resolve the same globally for all queries from any source. Globally unique names do not equate to globally resolvable names or even global names that resolve the same way from every perspective. Globally unique names can prevent any possibility of collisions at present or in the future, and they make DNSSEC trust manageable. Consider using a registered and fully qualified domain name (FQDN) from global DNS as the root for enterprise and other internal namespaces. 3.6. NAT Practice for Accessing Cloud Services Cloud resources, such as VMs (Virtual Machine) or application instances, are usually assigned private IP addresses. By configuration, some private subnets can have the NAT function to reach out to external networks, and some private subnets are internal to Cloud only. Different Cloud operators support different levels of NAT functions. For example, AWS NAT Gateway does not currently support connections towards, or from VPC Endpoints, VPN, AWS Direct Connect, or VPC Peering [AWS-NAT]. AWS Direct Connect/VPN/VPC Peering does not currently support any NAT functionality. Dunbar, et al. [Page 8] Internet-Draft Net2Cloud Problem Statement Google's Cloud NAT [Google-NAT] allows Google Cloud VM instances without external IP addresses and private Google Kubernetes Engine (GKE) clusters to connect to the Internet. Cloud NAT implements outbound NAT in conjunction with a default route to allow instances to reach the Internet. It does not implement inbound NAT. Hosts outside the VPC network can only respond to established connections initiated by instances inside the Google Cloud; they cannot initiate new connections to Cloud instances via NAT. For enterprises with applications running in different Cloud DCs, proper configuration of NAT must be performed in Cloud DCs and their on-premises DC. 3.7. Cloud Discovery Practices One of the concerns of using Cloud services is not aware of where the resource is located, as Cloud operators can move the service instances from one place to another. When applications in Cloud communicate with on-premises applications, it may not be clear where the Cloud applications are located or to which VPCs they belong. Being able to detect Cloud services' location can help on-premises gateways (routers) to connect the services in a more optimal site when the enterprise's end users or policies change. For enterprises that instantiate virtual routers in Cloud DCs, metadata can be attached (e.g., GENEVE header or IPv6 optional header) to indicate the Geo-location of the Cloud DCs. 4. Dynamic Connecting Enterprise Sites with Cloud DCs For many enterprises with established private VPNs (e.g., private circuits, MPLS-based L2VPN/L3VPN) interconnecting branch offices & on-premises data centers, connecting to Cloud services will be a mix of different types of networks. When an enterprise's existing VPN service providers do not have direct connections to the desired cloud DCs that the enterprise prefers to use, the enterprise faces additional infrastructure and operational costs to utilize the Cloud services. This section describes some mechanisms for enterprises with private VPNs to connect to Cloud services dynamically. Dunbar, et al. [Page 9] Internet-Draft Net2Cloud Problem Statement 4.1. Sites to Cloud DC Most Cloud operators offer some type of network gateway through which an enterprise can reach their workloads hosted in the Cloud DCs. For example, AWS (Amazon Web Services) offers the following options to reach workloads in AWS Cloud DCs [AWS-Cloud-WAN]: - AWS Internet gateway allows communication between instances in AWS VPC and the Internet. - AWS Virtual gateway (vGW) where IPsec tunnels [RFC6071] are established between an enterprise's own gateway and AWS vGW, so that the communications between those gateways can be secured from the underlay (which might be the public Internet). - AWS Direct Connect, which allows enterprises to purchase direct connect from network service providers to get a private leased line interconnecting the enterprises gateway(s) and the AWS Direct Connect routers. In addition, an AWS Transit Gateway can be used to interconnect multiple VPCs in different Availability Zones. AWS Transit Gateway acts as a hub that controls how traffic is forwarded among all the connected networks which act like spokes. Microsoft Azure's Virtual WAN [Azure-SD-WAN] allows extension of a private network to any of the Microsoft Cloud services, including Azure and Office365. ExpressRoute is configured using Layer 3 routing. Customers can opt for redundancy by provisioning dual links from their location to two Microsoft Enterprise edge routers (MSEEs) located within a third-party ExpressRoute peering location. The BGP routing protocol is then setup over WAN links to provide redundancy to the cloud. This redundancy is maintained from the peering data center into Microsoft's cloud network. Google's Cloud Dedicated Interconnect offers similar network connectivity options as AWS and Microsoft. One distinct difference, however, is that Google's service allows customers access to the entire global Cloud network by default. It does this by connecting the on-premises network with the Google Cloud using BGP and Google Cloud Routers to provide optimal paths to the different regions of the global cloud infrastructure. Figure 1 below shows an example of some of a tenant's workloads that are accessible via a virtual router connected by AWS Internet Dunbar, et al. [Page 10] Internet-Draft Net2Cloud Problem Statement Gateway; some are accessible via AWS vGW, and others are accessible via AWS Direct Connect. Different types of access require different level of security functions. Sometimes it is not visible to end customers which type of network access is used for a specific application instance. To get better visibility, separate virtual routers (e.g., vR1 & vR2) can be deployed to differentiate traffic to/from different Cloud GWs. It is important for some enterprises to be able to observe the specific behaviors when connected by different connections. Customer Gateway can be customer owned router or ports physically connected to AWS Direct Connect GW. +------------------------+ | ,---. ,---. | | (TN-1 ) ( TN-2)| | `-+-' +---+ `-+-' | | +----|vR1|----+ | | ++--+ | | | +-+----+ | | /Internet\ For external customers | +-------+ Gateway +---------------------- | \ / to reach via Internet | +-+----+ | | | ,---. ,---. | | (TN-1 ) ( TN-2)| | `-+-' +---+ `-+-' | | +----|vR2|----+ | | ++--+ | | | +-+----+ | | / virtual\ For IPsec Tunnel | +-------+ Gateway +---------------------- | | \ / termination | | +-+----+ | | | | | + - - - - - - - - - - - - - - - --+ | | | +-+----+ +----+ | | / \ Direct / \ | | +----|--+ Gateway +------+ Fabric|--VPN-- CPE | \ / Connect\ edge / | | | +-+----+ +----+ | | IXP | | + - - - - - - - - - - - - - - - --+ +------------------------+ TN: Tenant Network. Figure 1: Examples of Multiple Cloud DC connections. Dunbar, et al. [Page 11] Internet-Draft Net2Cloud Problem Statement 4.2. Inter-Cloud Connection The connectivity options to Cloud DCs described in the previous section are for reaching Cloud providers' DCs, but not between cloud DCs. When applications in AWS Cloud need to communicate with applications in Azure, today's practice requires a third-party gateway (physical or virtual) to interconnect the AWS's Layer 2 DirectConnect path with Azure's Layer 3 ExpressRoute. Enterprises can also instantiate their virtual routers in different Cloud DCs and administer IPsec tunnels among them. In summary, here are some approaches, available to interconnect workloads among different Cloud DCs: a) Utilize Cloud DC provided inter/intra-cloud connectivity services (e.g., AWS Transit Gateway) to connect workloads instantiated in multiple VPCs. Such services are provided with the Cloud gateway to connect to external networks (e.g., AWS DirectConnect Gateway). b) Hairpin all traffic through the customer gateway, meaning all workloads are directly connected to the customer gateway, so that communications among workloads within one Cloud DC must traverse through the customer gateway. c) Establish direct tunnels among different VPCs (AWS' Virtual Private Clouds) and VNET (Azure's Virtual Networks) via client's own virtual routers instantiated within Cloud DCs. NHRP (Next Hop Resolution Protocol) [RFC2735] based multi-point techniques can be used to establish direct Multi-point-to-Point or multi-point-to multi-point tunnels among those client's own virtual routers. Approach a) usually does not work if Cloud DCs are owned and managed by different Cloud providers. Approach b) creates additional transmission delay plus incurring cost when exiting Cloud DCs. For Approach c), [SDWAN-EDGE-DISCOVERY] describes a mechanism for virtual routers to advertise their properties for establishing proper IPsec tunnels among them. Dunbar, et al. [Page 12] Internet-Draft Net2Cloud Problem Statement 4.3. Extending Private VPNs to Hybrid Cloud DCs Traditional private VPNs, including private circuits or MPLS-based L2/L3 VPNs, have been widely deployed as an effective way to support businesses and organizations that require network performance and reliability. Connecting an enterprise's on-prem CPEs to a Cloud DC via a private VPN requires the private VPN provider to have a direct path to the Cloud GW. When the user base changes, the enterprise might want to migrate its workloads/applications to a new cloud DC location closer to the new user base. The existing private VPN provider might not have circuits at the new location. Deploying PEs routers at new locations takes a long time (weeks, if not months). When the private VPN network can't reach the desired Cloud DCs, IPsec tunnels can dynamically connect the private VPN's PEs with the desired Cloud DCs GWs. As the private VPNs provide higher quality of services, choosing a PE closest to the Cloud GW for the IPsec tunnel is desirable to minimize the IPsec tunnel distance over the public Internet. In order to support Explicit Congestion Notification (ECN) [RFC3168] usage by private VPN traffic, the PEs that establish the IPsec tunnels with the Cloud GW need to comply with the ECN behavior specified by RFC6040 [RFC6040]. An enterprise can connect to multiple Cloud DC locations and establish different BGP peers with Cloud GW routers at different locations. As multiple Cloud DCs are interconnected by the Cloud provider's own internal network, its topology and routing policies are not transparent or even visible to the enterprise customer's on- prem routers. One Cloud GW BGP session might advertise all of the prefixes of the enterprise's VPC, regardless of which Cloud DC a given prefix resides, which can cause improper optimal path selection for on-prem routers. To get around this problem, virtual routers in Cloud DCs can be used to attach metadata (e.g., in the GENEVE header or IPv6 optional header) to indicate the Geo-location of the Cloud DC, the delay measurement, or other relevant data. 5. Methods to Scale IPsec tunnels to Cloud DCs As described in Section 4.3, IPsec tunnels can be used to dynamically establish connection between private VPN PEs with Cloud Dunbar, et al. [Page 13] Internet-Draft Net2Cloud Problem Statement GW. Enterprises can also instantiate virtual routers within Cloud DCs to connect to their on-premises devices via IPsec tunnels. As described in [Int-tunnels], IPsec tunnels can introduce MTU problems. This document assumes that endpoints manage the appropriate MTU sizes, therefore, not requiring VPN PEs to perform the fragmentation when encapsulating user payloads in the IPsec packets. 5.1. Improvement IPsec Tunnels Management IPsec tunnels are a very convenient solution for an enterprise with limited locations to reach a Cloud DC. However, for a medium-to- large enterprise with multiple sites and data centers to connect to multiple cloud DCs, there are N*N number of IPsec tunnels among Cloud DC gateways and all those sites. Each of those IPsec Tunnels requires pair-wise periodic key refreshment. For a company with hundreds or thousands of locations, managing hundreds (or even thousands) of IPsec tunnels can be very processing intensive. That is why many Cloud operators only allow a limited number of (IPsec) tunnels & bandwidth to each customer. To scale the IPsec key management, a solution like group encryption can be considered. But the drawback of the group encryption is higher security risk of the key distribution and maintenance of a key server. [SECURE-EVPN] leverages the BGP point-to-multipoint signaling to create private pair-wise IPsec Security Associations among peers without IKEv2 point-to-point signaling or any other direct peer-to- peer session establishment messages. 5.2. Improving performance Over the Public Internet IPsec encap & decap are very processing intensive, which can degrade router performance. NAT also adds to the performance burden. When enterprise CPEs or gateways are far away from cloud DC gateways or across country/continent boundaries, performance of IPsec tunnels over the public Internet can be problematic and unpredictable. Even though there are many monitoring tools available to measure delay and various performance characteristics of the network, the measurement for paths over the Internet is passive and past measurements may not represent future performance. Dunbar, et al. [Page 14] Internet-Draft Net2Cloud Problem Statement [MULTI-SEG-SDWAN] describes some methods to utilize the Cloud backbone to interconnect enterprise CPEs in dispersed geographic locations without requiring the Cloud GW to decrypt and re-encrypt the traffic from the CPEs. 6. Requirements for Dynamic Cloud Data Center VPNs To address the issues identified in this document, any solution for enterprise VPNs that includes connectivity to dynamic workloads or applications in Cloud DCs should satisfy a set of requirements: - Scalable policy management: apply the appropriate polices to the newly instantiated application instances at any Cloud DC locations. - The solution should allow enterprises to take advantage of the current state-of-the-art private VPN technologies, including the traditional circuit-based, MPLS-based VPNs, or IPsec-based VPNs (or any combination thereof) that run over the public Internet. - The solution should support scalable IPsec key management among all nodes involved in DC interconnect schemes. - The solution needs to support easy and fast, on-the-fly, VPN connections to dynamic workloads and applications in Cloud DCs, and easily allow these workloads to migrate both within a data center and between data centers. - Traffic engineering to distribute loads across regions/AZs based on performance/availability of workloads etc. as well as for connecting to other Cloud DCs. - Network Traffic traceability, logging, and diagnostics. 7. Security Considerations The security issues in terms of networking to clouds include: - Service instances in Cloud DCs are connected to users (enterprises) via Public IP ports which are exposed to the following security risks: a) Potential DDoS attack to the ports facing the untrusted network (e.g., the public internet), which may propagate to the Dunbar, et al. [Page 15] Internet-Draft Net2Cloud Problem Statement cloud edge resources. To mitigate such security risk, it is necessary for the ports facing internet to enable Anti-DDoS features. b) Potential risk of augmenting the attack surface with inter- Cloud DC connection by means of identity spoofing, man-in-the- middle, eavesdropping or DDoS attacks. One example of mitigating such attacks is using DTLS to authenticate and encrypt MPLS-in-UDP encapsulation (RFC 7510). - Potential attacks from service instances within the cloud. For example, data breaches, compromised credentials, and broken authentication, hacked interfaces and APIs, account hijacking. - When IPsec tunnels established from enterprise on-premises CPEs are terminated at the Cloud DC gateway where the workloads or applications are hosted, traffic to/from an enterprise's workload can be exposed to others behind the data center gateway (e.g., exposed to other organizations that have workloads in the same data center). To ensure that traffic to/from workloads is not exposed to unwanted entities, IPsec tunnels may go all the way to the workload (servers, or VMs) within the DC. Many Cloud operators offer monitoring services for data stored in Clouds, such as AWS CloudTrail, Azure Monitor, and many third-party monitoring tools to improve visibility to data stored in Clouds. Solution drafts resulting from this work will address security concerns inherent to the solution(s), including both protocol aspects and the importance (for example) of securing workloads in cloud DCs and the use of secure interconnection mechanisms. 8. IANA Considerations This document requires no IANA actions. RFC Editor: Please remove this section before publication. 9. References Dunbar, et al. [Page 16] Internet-Draft Net2Cloud Problem Statement 9.1. Normative References [RFC2735] B. Fox, et al "NHRP Support for Virtual Private networks". Dec. 1999. [RFC3168] K. Ramakrishnan, et al, "The Addition of Explicit Congestion Notification (ECN) to IP", RFC3168, Sept. 2001. [RFC4486] E. Chen and V. Gillet, "Subcodes for BGP Cease Notification Message", RFC4486, April 2006. [RFC4786] J. Abley and K. Lindqvist, "Operation of Anycast Services", RFC4786, Dec. 2006. [RFC6040] B. Briscoe, "Tunnelling of Explicit Congestion Notification", RFC6040, Nov 2010. [RFC7606] E. Chen, et al "Revised Error Handling for BGP UPDATE Messages". Aug 2015. [RFC7432] A. Sajassi, et al "BGP MPLS-Based Ethernet VPN", RFC7432, Feb. 2015. 9.2. Informative References [RFC6071] S. Frankel and S. Krishnan, "IP Security (IPsec) and Internet Key Exchange (IKE) Document Roadmap", Feb 2011. [3GPP-5G-Edge] 3GPP TS 23.548 v18.1.1, "5G System Enhancements for Edge Computing", April 2023. [SDWAN-EDGE-DISCOVERY] L. Dunbar, S. Hares, R. Raszuk, K. Majumdar, G. Mishra, V. Kasiviswanathan, "BGP UPDATE for SD-WAN Edge Discovery", draft-ietf-idr-sdwan-edge-discovery-10, June 2023. [AWS-NAT] NAT gateways - Amazon Virtual Private Cloud. [AWS-Cloud-WAN] Introducing AWS Cloud WAN (Preview) | Networking & Content Delivery (amazon.com). Dunbar, et al. [Page 17] Internet-Draft Net2Cloud Problem Statement [Azure-SD-WAN] Architecture: Virtual WAN and SD-WAN connectivity - Azure Virtual WAN | Microsoft Learn. [Google-NAT] Cloud NAT overview | Google Cloud. [Int-tunnels] J. Touch and W Townsley, "IP Tunnels in the Internet Architecture", draft-ietf-intarea-tunnels-13.txt, March, 2023. [METADATA-PATH] L. Dunbar, et al, "BGP Extension for 5G Edge Service Metadata" draft-ietf-idr-5g-edge-service-metadata-06, July 2023. [MULTI-SEG-SDWAN] K. Majumdar, et al, "Multi-segment SD-WAN via Cloud DCs", draft-dmk-rtgwg-multisegment-sdwan-00, work- in-progress, May 2023. [SECURE-EVPN] A. Sajassi, et al, "Secure EVPN", draft-ietf-bess- secure-evpn-00, June 2023. [SERVICE-METRICS] L. Dunbar, et al, "5G Edge Services Use Cases and Metrics", work-in-progress, draft-dunbar-cats-edge- service-metrics-01, July 2023. 10. Acknowledgments Many thanks to Adrian Farrel, Alia Atlas, Chris Bowers, Paul Vixie, Paul Ebersman, Timothy Morizot, Ignas Bagdonas, Donald Eastlake, Michael Huang, Liu Yuan Jiao, Katherine Zhao, and Jim Guichard for the discussion and contributions. Dunbar, et al. [Page 18] Internet-Draft Net2Cloud Problem Statement Authors' Addresses Linda Dunbar Futurewei Email: Linda.Dunbar@futurewei.com Andrew G. Malis Malis Consulting Email: agmalis@gmail.com Christian Jacquenet Orange Rennes, 35000 France Email: Christian.jacquenet@orange.com Mehmet Toy Verizon One Verizon Way Basking Ridge, NJ 07920 Email: mehmet.toy@verizon.com Kausik Majumdar Microsoft Azure kmajumdar@microsoft.com Dunbar, et al. [Page 19]