Dynamic Networks to Hybrid Cloud DCs: Problems and Mitigation Practices
draft-ietf-rtgwg-net2cloud-problem-statement-41
Discuss
Yes
Jim Guichard
No Objection
Abstain
No Record
Deb Cooley
Francesca Palombini
Mahesh Jethanandani
Warren Kumari
Summary: Has 4 DISCUSSes. Has enough positions to pass once DISCUSS positions are resolved.
Gunter Van de Velde
Discuss
Discuss
(2024-09-19)
Sent
# Gunter Van de Velde, RTG AD, comments for draft-ietf-rtgwg-net2cloud-problem-statement-41 # Thanks for writing up this work to make cloud DCs more mainstream and connected to enterprises and to open the discussion on routing aspects. # Please find the following blocking DISCUSS observations when processing the draft and some non-blocking comments. #DISCUSS #======= # I support the DISCUSS from John and Paul. (1) requirements do not belong in a use-case document (2) there is information in the document which will not age well # [DISCUSS1] Section 3 is not complete from a Connecting to Cloud DC Routing issues perspective. Connecting to cloud data centers presents various routing challenges, including scalability, security, latency, routing policy consistency, and multi-cloud complexity. Enterprises need to carefully plan and manage their routing architecture to ensure reliable, efficient, and secure connections between on-premises infrastructure and cloud data centers. Solutions like dedicated connections, BGP security enhancements, and dynamic routing policies can help mitigate some of these challenges, but they also add complexity to the overall network architecture. I believe that a use-case document should address or at least position all of these. When focused on a small subset then the bigger picture may be lost. Connecting to and between Cloud DCs is a multi-dimensional complex routing aware problem space. See my note [DISCUSS1] below # [DISCUSS2] Cloud DC implications on security considerations is not complete. There are many aspects to consider. See note [DISCUSS2]. Topics are for example Encryption of Data in Transit, Authentication and Access Control, Secure Routing Protocols, Network Segmentation, Data Encryption at Rest, Visibility and Monitoring, DDoS Protection, Firewalls and Security Groups, Zero Trust Security Model, Compliance and Regulatory Considerations, Network Access Control, Patch Management and Vulnerability Scanning, Distributed Workloads and Traffic Control and an Incident Response Plan # [DISCUSS3] Vendor specific Cloud DC products and explicit behaviors are documented in this document. IETF documents should be vendor agnostic, especially when very specific behaviors are documented. Vendor behavior will change over time making the information provided in the draft stale, outdated and potentially harmful to the referenced (unaware) cloud DC vendors
Comment
(2024-09-19)
Sent
#DETAILED COMMENTS #================= ## classified as [minor] and [major] 72 3. Issues and Mitigation Methods of Connecting to Cloud DCs.......4 73 3.1. Increased BGP Peering Errors and Mitigation Methods.......4 74 3.2. Site Failures and Methods to Minimize Impacts.............6 75 3.3. Limitations of DNS-based Cloud DC Location Selection......6 76 3.4. Network Issues for 5G Edge Clouds and Mitigation Methods..7 77 3.5. DNS Practices for Hybrid Workloads........................8 78 3.6. NAT Practices for Accessing Cloud Services................9 79 3.7. Cloud Discovery Practices................................10 80 4. Dynamic Connecting Enterprise Sites with Cloud DCs............10 81 4.1. Sites to Cloud DC........................................11 82 4.2. Inter-Cloud Connection...................................13 83 4.3. Extending Private VPNs to Hybrid Cloud DCs...............14 84 5. Methods to Scale IPsec Tunnels to Cloud DCs...................15 85 5.1. Scale IPsec Tunnels Management...........................16 86 5.2. CPEs Interconnection Over the Public Internet............16 ..... 98 1. Introduction 99 With the advent of widely available Cloud data centers (DCs) 100 providing services in various geographic locations and advanced 101 tools for monitoring and predicting application behaviors, it is 102 tempting for enterprises to instantiate applications and workloads 103 in Cloud DCs. Some enterprises prefer specific applications to be 104 located close to the end users accessing these services, as the 105 proximity can improve end-to-end latency. In addition, applications 106 and workloads in Cloud DCs can be shut down or moved along with end 107 users in motion thereby modifying the networking connection of 108 subsequently relocated applications and workloads. 109 Cloud services are typically on-demand and designed to be scalable, 110 highly available, and billed based on usage. Most Cloud Operators 111 offer various network functions, such as virtual Firewall services, 112 virtual private clouds services, and virtual Private Branch eXchange 113 (PBX) services, including voice and video conferencing systems. A 114 Cloud DC is a shared infrastructure that hosts services for multiple 115 customers. 116 This document describes the network-related problems enterprises 117 face at the time of writing this document when interconnecting their 118 branch offices with dynamic workloads in Cloud DCs and the 119 mitigation practices to get around those problems. [major] Cloud data centers offer numerous benefits, but they also have several downsides or challenges that organizations need to consider. While cloud data centers offer scalability, flexibility, and cost-efficiency, organizations must weigh these benefits against potential downsides such as security risks, unpredictable costs, limited control, and regulatory compliance challenges. It is not just about network access to Cloud DCs. Some of the suggested key downsides are not strictly of a routing technical nature, while others are, and these have not been adequately addressed in Section 3. Including these issues in the document, along with an explicit indication of which are within scope and which are outside scope, will provide greater clarity to readers and enhance their understanding of the problem space being discussed: 1. Security and Privacy Concerns: * Data Breaches: Storing sensitive data in cloud environments increases the risk of data breaches, as cloud data centers are prime targets for cyberattacks. Organizations may face issues of unauthorized access if cloud security is compromised. * Shared Responsibility: In cloud environments, security is a shared responsibility between the cloud provider and the customer. Misconfigurations or failures on either side can lead to vulnerabilities. * Data Sovereignty: Data stored in cloud data centers may be subject to the laws and regulations of the country where the data center is located, which can lead to compliance issues regarding data privacy. 2. Downtime and Availability: * Service Outages: Even the most reliable cloud providers can experience downtime, which can lead to service disruptions for organizations relying on cloud infrastructure. High availability is typically guaranteed, but 100% uptime is rarely achieved. * Network Latency: Cloud data centers are remote, so applications that require low-latency performance might face challenges, especially if the data center is far from end-users. 3. Cost Management: * Unpredictable Costs: While cloud services are often marketed as cost-effective, costs can quickly add up if resources are not properly managed. Unexpected charges for data egress, scaling, or additional services can lead to budget overruns. * Long-Term Costs: Over time, running workloads in the cloud might be more expensive than on-premises solutions, particularly for organizations with steady and predictable workloads. 4. Lack of Control: * Limited Customization: Cloud services typically offer standardized environments, which may limit an organization’s ability to customize infrastructure or configurations to meet specific needs. This lack of control can be problematic for highly specialized applications. * Vendor Lock-In: Many cloud providers offer proprietary services that can make it difficult or costly to migrate to another provider or move workloads back on-premises. 5. Data Transfer and Performance Issues: * Data Transfer Costs: Transferring large volumes of data to and from the cloud can be expensive and time-consuming, particularly when dealing with bandwidth limitations or the cost of data egress. * Performance Variability: In multi-tenant cloud environments, performance can fluctuate depending on the overall usage of resources by other clients. This can impact critical workloads if performance varies unexpectedly. 6. Compliance and Legal Issues: * Regulatory Compliance: Organizations in highly regulated industries (e.g., healthcare, finance) must ensure that their use of cloud services complies with specific regulations such as GDPR, HIPAA, or PCI-DSS. Ensuring compliance can be complicated by the global nature of cloud data centers. * Data Jurisdiction: Storing data in foreign cloud data centers might expose organizations to jurisdictional issues, where the data becomes subject to foreign laws and regulations. 7. Dependence on Internet Connectivity: * Connectivity Issues: Cloud services require reliable internet access. If an organization experiences internet outages or slow connectivity, access to cloud-hosted applications and data may be compromised, impacting productivity. 8. Complexity in Hybrid Environments: * Integration Challenges: Managing hybrid cloud environments (where some resources are in the cloud and others on-premises) can be complex, especially when it comes to data synchronization, security policies, and monitoring. 150 SD-WAN An overlay connectivity service that optimizes transport 151 of IP Packets over one or more Underlay Connectivity 152 Services by recognizing applications (Application Flows) 153 and determining forwarding behavior by applying Policies 154 to them. [MEF-70.1] [major] fails to say that SD-WAN stands for "Software-Defined Wide Area Network" Maybe the following could be added to describe what SD-WAN is: " SD-WAN (Software-Defined Wide Area Network) is a networking technology that simplifies the management and operation of a wide area network (WAN) by decoupling the network hardware from its control mechanism. It allows enterprises to securely and efficiently connect users to applications, particularly across multiple branch locations, data centers, and cloud environments. " 156 VPC: A Virtual Private Cloud is a virtual network dedicated 157 to one client account. It is logically isolated from 158 other virtual networks in a Cloud DC. Each client can 159 launch his/her desired resources, such as compute, 160 storage, or network functions into his/her VPC. At the 161 time of writing this document, most Cloud operators' 162 VPCs only support private addresses, some support IPv4 163 only, others support IPv4/IPv6 dual stack. [minor] A simpler proposal to describe VPC " A VPC (Virtual Private Cloud) is a secure, isolated segment of a public cloud, where users can deploy and manage resources such as virtual machines, databases, and applications. VPCs offer the flexibility of using the public cloud's infrastructure while providing more control over networking and security. " 165 3. Issues and Mitigation Methods of Connecting to Cloud DCs 167 This section identifies some high-level problems that the IETF could 168 address, especially within the Routing area. Other Cloud DC problems 169 (e.g., managing cloud spending) are out of the scope of this 170 document. [DISCUSS1] Connecting to cloud data centers presents various routing challenges, including scalability, security, latency, routing policy consistency, and multi-cloud complexity. Enterprises need to carefully plan and manage their routing architecture to ensure reliable, efficient, and secure connections between on-premises infrastructure and cloud data centers. Solutions like dedicated connections, BGP security enhancements, and dynamic routing policies can help mitigate some of these challenges, but they also add complexity to the overall network architecture. Not all of these high level enterprise related concerns are addressed in draft-ietf-rtgwg-net2cloud-problem-statement-41 Key Routing Issues of interest by enterprises when Connecting to Cloud Data Centers: 1. Latency and Path Optimization: * Suboptimal Routing: Traffic between on-premises data centers and cloud providers may traverse multiple ISPs or intermediary networks, leading to increased latency. Default internet paths may not always be the most optimal, which can negatively impact performance for latency-sensitive applications. * Traffic Engineering: Enterprises may struggle to optimize routes for specific applications. This can be critical when performance demands, such as low latency for real-time applications, are high. 2. Multi-Cloud and Hybrid Cloud Connectivity: * Inter-Cloud Routing Complexity: Routing between multiple cloud providers (multi-cloud) or between on-premises environments and the cloud (hybrid cloud) is challenging. Each cloud provider may use different routing policies, protocols, and architectures, complicating consistent policy enforcement and efficient routing across different environments. * Vendor-Specific Routing Mechanisms: Cloud providers like AWS, Microsoft Azure, and Google Cloud have their own proprietary routing mechanisms, such as AWS Transit Gateway or Azure Virtual WAN. Managing routing across different clouds requires expertise in each platform’s unique setup. 3. BGP Complexity: * BGP Configuration: Enterprises often use Border Gateway Protocol (BGP) to connect their on-premises networks with cloud DCs. However, configuring BGP for efficient and secure communication can be complex, especially when dealing with cloud providers’ route limitations, filtering, and peering configurations. * BGP Route Convergence: If there is a network topology change, BGP may take time to converge on a new optimal route, which could cause temporary routing loops or black holes, leading to downtime or degraded performance. * BGP Security: Routing security issues like BGP hijacking can be a concern. If not properly secured, attackers can manipulate routes, potentially intercepting or redirecting traffic between an enterprise and a cloud data center. 4. Overlapping IP Addresses: * IP Address Conflicts: When connecting multiple cloud environments or when integrating with on-premises networks, organizations may encounter overlapping private IP address spaces (e.g., two networks using the same RFC1918 address space). This creates routing conflicts and requires address translation (e.g., NAT) or careful IP planning. * NAT Complexity: Network Address Translation (NAT) is often used to resolve overlapping IPs, but it adds complexity to routing, and troubleshooting connectivity issues can become more difficult. 5. Routing Scalability: * Large Route Tables: Cloud environments often host a large number of subnets, virtual machines (VMs), and applications, which results in significant route table growth. On-premises routers may struggle to handle the large number of routes advertised by cloud data centers. * Route Aggregation: To manage large routing tables, route aggregation is essential, but improper aggregation can lead to suboptimal routing or create security issues by allowing unintended access to broader network segments. 6. East-West Traffic Optimization: * East-West Traffic Challenges: Modern cloud workloads often involve significant east-west traffic (i.e., traffic between different applications or services within the cloud). Efficiently routing this traffic between cloud regions or between an on-premises data center and the cloud can be challenging, especially if cross-region bandwidth or routing constraints exist. 7. Latency and Bandwidth Considerations: * Performance Over Public Internet: Connecting to a cloud DC over the public internet introduces unpredictable latency and limited control over the routing path. Enterprises may use dedicated connectivity solutions like AWS Direct Connect or Azure ExpressRoute to avoid the public internet and achieve more predictable performance, but these solutions come with additional cost and complexity. * Bandwidth Costs: Cloud providers often charge for egress traffic (traffic leaving the cloud data center). Suboptimal routing can increase data transfer costs if traffic is unnecessarily routed through expensive pathways. 8. Route Propagation and Policy Enforcement: * Consistent Route Propagation: Propagating routes between an on-premises network and a cloud data center can be inconsistent, especially when using complex routing policies. Enterprises need to carefully manage route redistribution between different routing domains (e.g., BGP on-premises and cloud provider proprietary routing). * Policy Control: Implementing consistent routing policies (e.g., security, load balancing, and traffic engineering policies) across cloud and on-premises environments can be challenging due to the different tools and mechanisms used by cloud providers. 9. Routing Security: * Securing Routing Information: When using BGP to connect to cloud data centers, securing routing information is crucial. BGP hijacking and route leaks can lead to malicious traffic redirection. Organizations need to implement security measures like BGP authentication, RPKI (Resource Public Key Infrastructure), and route filtering to prevent unauthorized route advertisements. * Encryption and Privacy: Data traveling between an enterprise and the cloud may need encryption to protect against eavesdropping. Implementing encrypted tunnels (e.g., IPSec VPN) can add complexity to the routing setup. 10. Failover and High Availability: * Redundancy and Failover: Ensuring high availability in cloud connectivity involves setting up redundant links and implementing fast failover mechanisms to ensure traffic is re-routed quickly in the event of a link failure. However, configuring effective failover paths that meet performance and cost requirements can be complex, especially across different clouds or between cloud and on-premises environments. * Dynamic Failover: In hybrid environments, ensuring that routes dynamically change during failover scenarios can be difficult due to the different routing protocols or static routes used in cloud environments. 11. Geographic Routing and Data Residency: * Compliance and Regulation: Enterprises may face legal and regulatory challenges regarding where data is routed. For instance, data residency requirements (e.g., GDPR) may mandate that certain data be routed or stored only within specific geographical regions. Ensuring that routing policies comply with these regulations across cloud and on-premises environments can be a complex issue. * Geographic Load Balancing: Routing traffic to cloud data centers in different regions to optimize for performance or compliance requires careful planning and monitoring. 743 7. Security Considerations 744 745 The security issues in terms of networking to Cloud DCs include [DISCUSS2] Enterprises connecting to cloud data centers must address a wide range of security concerns, from ensuring encrypted communications and controlling access, to securing routing protocols and complying with regulatory requirements. By employing robust encryption, strong access controls, comprehensive monitoring, and segmentation strategies, organizations can mitigate risks and securely connect their on-premises infrastructure to cloud environments. Additionally, leveraging the security tools and services provided by cloud vendors can help ensure that the network and data remain protected. A security section should investigate these to provide an holistic security overview. While not all of these have direct impact upon routing, or should even be standardized, it is important for enterprises to have a secure and robust cloud DC experience. 1. Encryption of Data in Transit: * End-to-End Encryption: Data traveling between on-premises infrastructure and the cloud should be encrypted to protect against interception and eavesdropping. Common methods include using IPsec VPNs, SSL/TLS, or private connectivity options like AWS Direct Connect or Azure ExpressRoute, which provide secure, dedicated connections to the cloud. * Encrypted Tunnels: Secure tunnels (IPsec, SSL, or GRE) can be used to ensure data confidentiality and integrity during transmission. Encryption helps mitigate man-in-the-middle attacks. 2. Authentication and Access Control: * Strong Authentication Mechanisms: Employ strong, multi-factor authentication (MFA) for accessing both on-premises and cloud resources. Implement VPN access control to ensure only authorized users and devices can establish connections to cloud environments. * Identity and Access Management (IAM): Use IAM policies to control who can access resources in the cloud. Ensure that IAM roles are tightly controlled and that users and applications only have the minimum permissions they need (principle of least privilege). 3. Secure Routing Protocols: * BGP Security: If using Border Gateway Protocol (BGP) to connect to cloud services, protect the routing protocol by implementing BGP authentication (using for example TCP-AO) and route filtering to prevent unauthorized or incorrect routing information from being accepted. * Route Filtering: Control which routes are propagated between on-premises networks and the cloud to prevent route leaks, which could expose sensitive routes to external parties or misdirect traffic. * RPKI (Resource Public Key Infrastructure): Consider using RPKI to prevent BGP hijacking, ensuring that the routes being advertised are valid and have not been tampered with. 4. Network Segmentation: * Isolating Traffic: Use Virtual Private Clouds (VPCs) and subnet segmentation to isolate traffic between different departments, workloads, or tenants. This ensures that sensitive data is not exposed to unauthorized users within the same cloud environment. * Private Connectivity: Use private connectivity options (e.g., AWS Direct Connect, Azure ExpressRoute) to avoid sending sensitive data over the public internet, reducing the risk of exposure to attacks. 5. Data Encryption at Rest: * Cloud Data Encryption: Ensure that data stored in the cloud is encrypted at rest. Many cloud providers offer encryption services (e.g., AWS Key Management Service, Azure Key Vault) to manage encryption keys securely. Consider using customer-managed keys for additional control over encryption processes. * Compliance with Encryption Standards: Ensure that encryption protocols comply with industry standards and regulatory requirements (e.g., AES-256 encryption for sensitive data). 6. Visibility and Monitoring: * Traffic Monitoring: Use tools like cloud network traffic analyzers or intrusion detection systems to monitor traffic between on-premises infrastructure and cloud environments. Detect anomalous behavior or unauthorized access attempts by maintaining visibility into network traffic. * Logging and Auditing: Enable comprehensive logging of all access and configuration changes in both on-premises and cloud environments. Cloud providers often offer logging services like AWS CloudTrail or Azure Monitor to track user activity and help detect security breaches. * Threat Detection and Response: Deploy security tools that offer threat detection, real-time monitoring, and automated response. Solutions like SIEM (Security Information and Event Management) systems can help correlate events across the hybrid cloud to detect security incidents. 7. DDoS Protection: * Distributed Denial of Service (DDoS) Protection: Cloud data centers can be targets of DDoS attacks, which can disrupt network services. Cloud providers offer DDoS mitigation services (e.g., AWS Shield, Azure DDoS Protection) that can protect both the cloud environment and the connection to on-premises infrastructure. * Rate Limiting: Implement rate limiting and other traffic control mechanisms to prevent network saturation during potential attacks. 8. Firewalls and Security Groups: * Network Firewalls: Use firewalls to control traffic flowing between on-premises networks and cloud environments. Cloud providers offer virtual firewalls that can be configured to enforce strict access controls. * Security Groups: Implement security groups and network ACLs (Access Control Lists) to control inbound and outbound traffic at the VPC or subnet level. These mechanisms should be used to restrict access to only those IP addresses or protocols that are necessary. 9. Zero Trust Security Model: * Zero Trust: Adopt a Zero Trust model that assumes no network (internal or external) is automatically trusted. Every access request should be verified, and users, devices, and applications should be authenticated before being allowed access to resources. * Microsegmentation: Use microsegmentation to further isolate workloads within the cloud, ensuring that even if an attacker gains access to one part of the network, they cannot easily move laterally. 10. Compliance and Regulatory Considerations: * Data Sovereignty and Residency: Ensure compliance with data sovereignty laws (e.g., GDPR) by enforcing routing policies that keep sensitive data within specified geographical regions. * Encryption for Compliance: Encrypt sensitive data both in transit and at rest to meet regulatory requirements like HIPAA, PCI-DSS, or GDPR. Cloud providers often offer compliance certification, but it's important to ensure the proper configurations are in place. * Auditing and Reporting: Regularly audit the security posture of the hybrid cloud environment to ensure ongoing compliance with security standards and regulations. 11. Network Access Control: * VPN Access: Use VPN gateways to securely connect on-premises networks to cloud environments, encrypting traffic between the two endpoints. * Multi-Factor Authentication (MFA): Implement MFA for users and administrators accessing cloud resources remotely to add an extra layer of security. 12. Patch Management and Vulnerability Scanning: * Patch Cloud Resources: Ensure that virtual machines, containers, and other cloud resources are regularly patched to protect against vulnerabilities. Leverage automated tools for patch management across both on-premises and cloud environments. * Vulnerability Scanning: Regularly scan cloud environments for vulnerabilities and misconfigurations that could be exploited by attackers. 13. Distributed Workloads and Traffic Control: * Load Balancing: Use cloud-based load balancers to evenly distribute traffic across multiple servers and data centers, reducing the risk of congestion or single points of failure. * Content Delivery Networks (CDNs): Use CDNs to distribute content closer to users, reducing latency and improving performance while also offering security benefits such as DDoS protection and content encryption. 14. Incident Response Plan: * Develop a Cloud-Specific Incident Response Plan: Ensure that the organization's incident response plan accounts for both on-premises and cloud environments. This includes identifying responsibilities, communication channels, and the tools needed to detect, investigate, and respond to security incidents. * Automated Responses: Consider automating certain responses, such as shutting down suspicious instances, revoking access, or blocking traffic, based on pre-defined security rules.
John Scudder
Discuss
Discuss
(2024-09-18)
Sent
Much of this document seems to be a high-level outline of particular commercial offerings, which among other problems, will not age well. Other parts outline challenges that are already solved, using existing IETF technologies or general remarks about best practices for operating networks. Yet other parts provide brief sketches of other SDOs' technologies or architectures. Overall, I don't think this is a valuable document for the IETF to be publishing as part of the RFC series, and as such I expect to eventually ballot Abstain. I do, however, have a few concerns about the document which warrant a DISCUSSion, first. ## DISCUSS ### This isn't a requirements document; I think that should be made clearer Sometimes the IETF publishes requirements documents, which when issued as RFCs are seen as having some standing to establish that a given technology must be developed or advanced. The present document introduces itself as a problem statement document, but Section 6 is called "requirements". My concern arises because throughout the document there are pointers to places in the IETF (WGs, drafts) where there is work in progress. I would prefer to avoid any ambiguity down the road, as to whether these citations are just for the information of the reader as examples, or something more. I'm open to solutions, but perhaps something like this, as a final paragraph of the introduction? NEW: This document provides references to IETF working groups and Internet Drafts that relate to the subject. These references are provided as examples and for the information of the reader, and should not be interpreted as requiring the adoption or implementation of any particular solution. Certain high-level requirements are presented in Section 6; these requirements are agnostic as to what solutions should fulfill them. To be clear, my concern is that the document can easily be read as privileging a certain set of solutions. Those might be the best solutions, I don't know, but I don't think it is the place of a problem statement or requirements document to mandate solutions. ### Inscrutable paragraph in Section 3.1 2. Section 3.1 includes the following paragraph: - A Cloud DC GW typically has multiple eBGP sessions with various clients and sets a route limit for each one. Therefore, on- premises data center gateways with eBGP sessions to the Cloud DC GW should configure default routes and filter out as many routes as possible, replacing them with a default route in their eBGP advertisements. This approach minimizes the number of routes exchanged with the Cloud DC eBGP peers. I simply can't understand what this paragraph is telling me to do. This would be partly remedied -- and the document improved overall -- if there were an earlier section providing a reference model and defining terms such as "Cloud DC GW", and illustrating the flow of routing information between elements. Since there is no such model, and since the prose quoted isn't clear, the reader is left to use their imagination, which is the opposite of what we strive for in our RFCs. I would suggest a rewrite but I can't discern even enough of your intent to offer one, I'm sorry. I guess my imagination has failed me. ### Section 3.2, no IGP As described in [RFC7938], a Cloud DC might not have an IGP to route around link/node failures within its domain. Are you saying that because there's no IGP the Cloud DC can't route around failures? Surely not, this is the opposite of what RFC 7938 describes. But it's sure what it sounds like. When a site failure happens, the Cloud DC GW visible to clients is running fine; therefore, the site failure is not detectable by the clients using Bidirectional Forwarding Detection (BFD)[RFC5880]. This doesn't make any sense to me. Again, perhaps a reference model showing the relationship of a "Cloud DC GW", a "site", where BFD would be running, etc, might have helped.
Comment
(2024-09-18)
Sent
## COMMENT ### Section 3.1, Capability Mismatch I don't understand what this means: Capability mismatch can cause BGP sessions not being adequately established. The "mitigation practices" basically amount to "follow the relevant standards". Is the quoted text trying to say something like "implementations that have bugs or don't follow the standards may not work right"? Generally, we don't need an RFC to say that, it's akin to the classic "MUST NOT write bugs". ### Section 3.2, Huge number... problem When a site failure occurs, many services can be impacted. When the impacted services' IP prefixes in a Cloud DC are not aggregated nicely, which is common, one single site failure can trigger a huge number of BGP UPDATE messages. There are proposals, such as [METADATA-PATH], to enhance BGP advertisements to address this problem. Is there some supporting evidence that the O(N) nature of BGP convergence is a "problem" in this context? I mean, sure, O(1) is nicer than O(N), but there are many O(N) operations we choose not to optimize because they don't need optimizing. I haven't seen evidence presented that convinces me this needs optimizing. Rather than debate this point, one possible way to address it would be to reword in some more factual way, such as, NEW: When a site failure occurs, many services can be impacted. When the impacted services' IP prefixes in a Cloud DC are not aggregated nicely, which is common, one single site failure can trigger multiple BGP UPDATE messages. There are proposals, such as [METADATA-PATH], to enhance BGP advertisements to reduce the number of messages required. ### Section 3.4, UEs can move Here are some network problems with connecting to the services in the 5G Edge Clouds: ... 3) Source (UEs) can ingress from different LDN Ingress routers due to mobility. How is that a "problem"? ### Section 6, IPSec requirement - Should support scalable IPsec key management among all nodes involved in DC interconnect schemes. But you don't say that it's a requirement for a solution to be IPSec-based at all. For a solution that isn't IPSec-based, this requirement is moot. Perhaps, NEW: - Should support scalable IPsec key management among all nodes involved in DC interconnect schemes, if IPSec is used as a VPN technology. ### Section 6, AZ - Should support traffic steering to distribute loads across regions/AZs based on performance/availability of workloads in You've never defined "AZ". Please do, or remove. ### Section 7, anti-DDoS a) Potential DDoS (Distributed Denial of Service) attack to the ports facing the untrusted network (e.g., the public internet), which may propagate to the cloud edge resources. To mitigate such security risk, it is necessary for the ports facing internet to enable Anti-DDoS features. Can you be specific about what "anti-DDoS features" are? You make it sound as though there's some way to configure "port xyz1/2 no ddos" and the problem goes away. To my knowledge, such "anti-DDoS features" don't exist. If they do, please cite examples. If they don't, something about this needs to change; minimally, delete the "to mitigate" sentence.
Paul Wouters
Discuss
Discuss
(2024-09-18)
Sent
I support John's DISCUSS, and also lean towards balloting Abstain, for much of the same reasons John already mentioned. While this might be a useful document for certain people, I do not think this is an IETF document. My view of the world might be different from the authors with respect to requiring BGP to interact with many cloud services. It seems quite common to use VPNs and NATs to tie things together from on-premise to cloud services using VPCs without ever needing BGP at all. On the IPsec part, I find it strange that RFC4535 is mentioned as there is no requirement for shared group keys or multicast support. One would expect each IPsec tunnel to have independent security properties from the other IPsec tunnels between Cloud DCs, on-premise DC and branch locations. I do not believe a one overarching IPsec management solution could tie these various networks together via IPsec. IPsec (IKE) key management for site-to-site connections what all these kind of IPsec connections are are "setup and forget" type deployments requiring no further key management. There is clearly some interesting operations and commercial advise in the document, but it is very much a snapshot that will not age well. I don't think the IETF is where this should be published.
Éric Vyncke
Discuss
Discuss
(2024-09-19)
Sent
# Éric Vyncke, INT AD, comments for draft-ietf-rtgwg-net2cloud-problem-statement-41 Thank you for the work put into this document, I can imagine the amount of work with 41 versions :-) but after balloting several DISCUSS points, I stopped reviewing it after section 3.7. I am also supportive of John's DISCUSS position. Please find below several DISCUSS points, some non-blocking COMMENT points (but replies would be appreciated even if only for my own education). Special thanks to Joel M. Halpern for the shepherd's detailed write-up including the WG consensus (`While there was not widespread support`) and the justification of the intended status. I hope that this review helps to improve the document, Regards, -éric # DISCUSS (blocking) As noted in https://www.ietf.org/blog/handling-iesg-ballot-positions/, a DISCUSS ballot is a request to have a discussion on the following topics: ## Copyright Wrong template is used as `Simplified BSD License` should be "Revised BSD License". ## Section 3.5 AFAIK, the ".internal" is not a special-use domain name listed by IANA (and AFAIK not yet approved by ICANN), so, please be clear about this status (i.e., squatting a domain) in the document. ## Section 3.6 In 2024, such an I-D must care about IPv6 and not too much about IPv4 NAT. If this document insists to use some commercial cloud references, then please use also non-US-based cloud offerings. ## Section 3.7 What is `IPv6 optional header` ?
Comment
(2024-09-19)
Sent
# COMMENTS (non-blocking) ## Abstract Like already noticed by other ADs, the date 2023 in the abstract is seriously outdated, the document should be refreshed for content and the date moved to 2024. But, better having a date than no date at all... ## Section 2 `Third party Data Centers` I am not sure whether "3rd party" applies when the cloud DC is used only by the employees. `private clouds` should probably be defined to avoid any ambiguity. Should "SD-WAN" be expanded ? ## Section 3 `This section identifies some high-level problems that the IETF could address` sounds like the IETF has failed, please rephrase (e.g., using "IETF Technologies"). ## Section 3.1 Should `Public Cloud DCs` be explicitly defined ? `to form an IP adjacency` unsure what IP adjacency means, the adjacency term is often using for a layer-2 link between layer-3 nodes (e.g., OSPF), suggest using iBGP. More generally, the (valid) recommendations seem to apply to any peering and not really related to cloud DC. ## Section 3.2 Please defined "pod". I will welcome explanations about the 2nd paragraph. I fail to see about EVPN is related to Cloud DC. ## Section 3.3 s/with an IP address/with IP addresses/ ? Should the multiple interfaces (3GPP & Wi-Fi) issues be cited ? ## Section 3.4 Suggest to introduce the acronyms in section 2. ## Section 3.5 Please note that draft-ietf-add-split-horizon-authority is not about split horizon DNS (even if it is related), please use another reference. ## Section 3.6 What is "AWS Direct Connect" ? or even what is "AWS" ? (to be honest, I know but do not expect any IETF reader to know those commercial terms).
Jim Guichard
Yes
Erik Kline
No Objection
Comment
(2024-09-14)
Sent
# Internet AD comments for draft-ietf-rtgwg-net2cloud-problem-statement-41 CC @ekline * comment syntax: - https://github.com/mnot/ietf-comments/blob/main/format.md * "Handling Ballot Positions": - https://ietf.org/about/groups/iesg/statements/handling-ballot-positions/ ## Comments ### S2 * Based on a quick search for the various keywords, and the fact that this is an Informational doc, I think you delete the keywords text and the associated bibliographic entries. ### S3.* * Should there be some text about managing end-to-end connectivity when NAT is not being used? For IPv6 in particular, it can be surprising to some that nodes might be reachable. Broadly, the issue of managing firewalls and permitted/denied communications might be an area of concern on its own. Developing proper text, though, probably requires going back to the working group and/or through IETF LC (which I assume folks don't want to do). ### S3.7 * Any solution that involves attaching metadata must also specify how that metadata can be authenticated/validated. I would suggest just saying that there's a security requirement for this for any solution in this area.
Orie Steele
No Objection
Comment
(2024-09-18)
Not sent
Thanks to Rich Salz for the ART ART review.
Roman Danyliw
No Objection
Comment
(2024-09-17)
Sent
Thank you to Paul Kyzivat for the GENART review. ** Section 3. This section identifies some high-level problems that the IETF could address, especially within the Routing area. Other Cloud DC problems (e.g., managing cloud spending) are out of the scope of this document. The scope of what collection of problems will be described in this document is made less clear by this paragraph. What makes these IETF problem to solve? ** Section 3 Here are the recommended mitigation practices: - If a Cloud Gateway (GW), a BGP speaker, receives from its BGP peer a capability that it does not itself support or recognize, it MUST ignore that capability, and the BGP session MUST NOT be terminated per [RFC5492]. - When receiving a BGP UPDATE with a malformed attribute, the revised BGP error handling procedure in [RFC7606] should be followed instead of session resetting. Does this amount to saying, “just follow RFC7606”? I’m having trouble understanding how this guidance differs from existing PS documents. ** Section 3. Speculative text about IETF WG. -- Section 3.1 Although this is beyond the scope of this document, further discussion in the IETF Inter-Domain Routing (IDR) Working Group is needed. This could lead to the addition of new subcodes in RFC4486 Section 3 and corresponding descriptions in RFC4486 Section 4 to facilitate this more efficient approach. -- Section 3.4 The IETF CATS (Computing-Aware Traffic Steering) working group is examining general aspects of this space, and may come up with protocol recommendations for this information exchange. Is this text needed? Why speculate on the activity of other WGs? ** Section 3.3. -- Which subset of the problem described here (i.e., bulleted list after “Here are some problems associated with DNS-based solutions”) are addressed by the mitigations described later in this section. -- What protocol behavior is envisioned to address a “misbehaving client”? ** Section 3.4 [METADATA-PATH] describes a mechanism to get around those problem. If [METADATA-PATH] is a solution. Why is this section describing a solved problem? Is this section needed? ** Section 3.7 One of the concerns of enterprises using Cloud services is the lack of awareness of the locations of their services hosted in the Cloud, as Cloud operators can move the service instances from one place to another. While the geographic locations are usually exposed to the enterprises, such as Availability Zones or Regions, the topological location is usually hidden. Is this a technical problem or a contractual problem? Could the problem statement be refined? The customer is buying a PaaS or some other kind of *aaS, but they don’t understand where in the service provider’s network it is being run beyond high level descriptions like Zone or Region? How would the customer get that information? What is topological information this context? ** Section 3.7 For enterprises that instantiate virtual routers in Cloud DCs, metadata can be attached (e.g., GENEVE [RFC8926] header or IPv6 optional header) to indicate additional properties, including useful information about the sites where they are instantiated. What remains unsolved after using IPv6 optional headers (is there a specific mechanism) or GENEVE? ** Section 4 and 5. What problem are these sections describing? In particular, Section 4 seems to be referencing vendor specific procedures. Will these statements age well?
Zaheduzzaman Sarker
No Objection
Comment
(2024-09-18)
Sent
Thanks for working on this document. I don't comments from transport protocol aspects, however, I have following observations which I believe will improve the document when addressed - # Section 3.4 : this section need to relate the listed issues to networking problem. It says - 1) The difference in routing distances to server instances in different edge Clouds is relatively small. Therefore, the instance in the Edge Cloud with the shortest routing distance from a 5G UPF might not be the best in providing the overall low latency service. So, this becomes mainly resource selection issue as routng distance is not an issue. 2) Capacity status at the Edge Cloud might play a more significant role in end-to-end performance. Again not sure how this becomes networking issue. 3) Source (UEs) can ingress from different LDN Ingress routers due to mobility. Does the routing distance become a issue now? # Section 3.4 : I also think the text is speculative regarding CATS WG and may be unnecessary as [METADATA-PATH] is already solving the issues. # Section 3.6 : is there any generic way to solve the NAT related issue listed or will it be always Cloud operator specific configuration?
Murray Kucherawy
Abstain
Comment
(2024-09-18)
Sent
John's and Paul's primary DISCUSS positions nicely articulate the reason for this ABSTAIN. You might be able to make this more palatable by removing the specific product references and instead find a way to be more generic (some providers do X, others do X with variant Y, still another does Z), but it's going to take a bit of work if you want to go that route. You got all the way to Section 3.6 without them, for example. Other comments: None of the authors claim to represent the cloud operators referenced in the document, and I'm not familiar with all of the names in Section 10, so I'll ask: Did this document get feedback or input from any of those operators? The Abstract says 2023 but at this point we're pretty far into 2024. Is this material still timely? As this document is Informational, I don't think you should use BCP 14. (It's barely used anyway.) Section 2 defines "Heterogeneous Cloud", but this term is not used elsewhere in the document. In Section 3.5, I recommend quoting ".internal". It looks like a syntax error without the quoting, at least as rendered in HTML. Shouldn't references like [AWS-NAT] have links or some other actual followable reference?
Deb Cooley
No Record
Francesca Palombini
No Record
Mahesh Jethanandani
No Record
Warren Kumari
No Record