IDR - IETF 84, Vancouver Chairs: Sue Hares, John Scudder Note takers: Jeff Haas, John Scudder Note editor: John Scudder Time stamps are local wall clock time, provided for review of audio file Audio: http://www.ietf.org/audio/ietf84/ietf84-georgiab-20120730-1300-pm1.mp3 ===================================================== Agenda: Interdomain Routing (IDR) WG MONDAY, July 30, 2012 1300 - 1530 Afternoon Session I Georgia B ===================================================== CHAIR(s): Susan Hares John Scudder o Administrivia Chairs 10 minutes - Note Well - Scribe - Blue Sheets - Document Status o BGP Flow-Spec Extended Community for Traffic Redirect to IP Next Hop http://tools.ietf.org/html/draft-simpson-idr-flowspec-redirect-01 Adam Simpson 10 minutes o BGP Tunnel Address Prefix Attribute and Tunnel Address Prefix Extended Community http://tools.ietf.org/html/draft-xu-idr-tunnel-address-prefix-01 Xuxiaohu 15 minutes o Transitive BGP Graceful Restart http://tools.ietf.org/html/draft-zhang-idr-transitive-gr-00 Alvaro Retana 10 minutes o North-Bound Distribution of Link-State and TE Information using BGP http://datatracker.ietf.org/doc/draft-gredler-idr-ls-distribution/ Stefano Previdi 5 minutes o Using BGP for routing in large-scale data centers http://tools.ietf.org/html/draft-lapukhov-bgp-routing-large-dc-01 Petr Lapukhov 20 minutes o Autonomous System (AS) Reservation for Private Use http://tools.ietf.org/html/draft-mitchell-idr-as-private-reservation-00 Jon Mitchell 15 minutes Speaker shuffling time 5 minutes Total 1 hour 30 minutes ===================================================== Meeting Begins ~ 1:05 Administrivia: Document Status: - Published RFC 6608, Subcodes for BGP Finite State Machine Error - AS0 draft in RFC editor queue. - RFC 4893bis in AD writeup. - BGP MIBv2 informal review in progress. - Added route flap damping usage as a WG item. - Adoption call closed for draft-uttaro-idr-bgp-persistence with the work not adopted. There are a significant number of WG documents that are currently waiting on implementations in order to advance. (See chairs slides in meeting proceedings for full list.) If anyone is aware of implementations, please notify the working group. ~ 1:11 Alvaro Retana: How should one notify the working group that an implementation exists? Sue: We're flexible. John: May just want to hear it - or more detail. Send mail to chairs ---- 1:15 Flowspec ext community: Presented by Adam Simpson draft-idr-flowspec-redirect Deals with traffic redirection Motivation: When flowspec is used for mitigation, ability to redirect traffic toward an alternate destination is very useful. Unfortunately 5575 only provides "redirect a VRF" VRF is very cumbersome. You may not want to do a VRF per mitigation. If you want to transport this across the network, you end up effectively setting up a l3vpn network. The proposal is to setup a new extended community, "redirect to ip nexthop". Proposed code point in draft. Use MPREACH_NLRI nexthop is the target. The nexthop is currently not really useful in flowspec routes and is effectively ignored. Right now that's just "best match". Will consider Robert Raszuk's proposal for picking the tunnel. Requires some additional thought. May need to figure out how to deal with inconsistent interactions. I.e. redirect to VRF will win if you have tunnel as well. Inter-AS considerations: The new extended community is transitive across ASes. A problem is that the nexthop would normally be reset at an AS boundary. The new validation procedure should be applied by default to "redirect to IP" Extended community received from eBGP peer. - Discard the extended community if the last AS of the path of the longest prefix match nexthop doesn't match. (did the unicast route match the flowspec route?) - Must be possible to disable? 1:24 Feedback: - Use cases for redirect to ip *and* also VRF? (mirror) - Should the new validation check result in discarding the entire flow-spec route and not just the comm? 1:25 John: You have a suggested code point in the doc. It's already allocated for experimental use code points. There is a history of using them this way but it's wrong, let's not continue it. Please take something out of FCFS. Adam: Will do that. ----- 1:26 BGP tunnel address prefix attribute and tunnel address prefix ext. community. Presented by Xiaohu Xu. (Huawei) Problem statement: There are some mpls based L2/L3vpn scenarios where the underlying networks are IP enabled rather than mpls (e.g. multitenant cloud data center networks). - Load balancing is very desirable there. - However since distinct customer traffic flow between a given PE pair would be encapsulated in the ip/gre tunnel per normal procedure can't optimally load balance The existing procedures in RFC 5640 require a change to the data plane of the core routers: - They must do the hash calculation on the specific load-balancing field contained in the L2TPV3 or GRE tunnel header. - This can't be done in some cases. E.g. some deployed core routers an only do hashes on TCP/UDP packets. Solution overview: For a given egress pe router, it could tell ingress pe routers more than one tunnel destination address (as a prefix) to be used when tunnel traffic flows to it. Tunnel address prefix attribute: Alternatively use a extended community (for ipv6 requires ipv6 specific community). Applicability: Useful for L3VPN, 6PE, Softwire mesh, BGP free core, L2VPN including VPLS. 1:30 Comments? Kireeti Kompella: I work on entropy labels. Two things we learned, you want to do as much load balancing as you can without dropping state into the core. Your loopbacks are injecting extra state into the core. Second, small numbers like 16 don't cut it. If you had 15 ecmp paths and 15 loopbacks you distribute, your spreading will be poor, you need to spread over a much larger number of values. With entropy label we have 20 bits. That's not feasible with loopbacks.. With entropy label - it's unsignalled. It's 20 bits worth of things. Doing this in IP would suck. IPv6 *might* work (flow label). This may scale in the core. 1:32 Robert Raszuk: This is IP core. This goes by prefix. KK: You still need bigger things. and on the endpoint you still need a lot of loopbacks. Are the operators willing to put that much extra config in the network? 1:32 RR: You had the draft on MPLS and UDP (referring to draft presented elsewhere in IETF). This is very similar to this one. How do you compare both? X: They cover related scenarios. This one covers GRE. 1:34 Randy Bush: Authentication? Security? Russ White: If you're doing some new attributes to get this stuff to work - did you already look at the softwires work covering similar stuff. Just put in new attributes there? It seems to me to be simpler to use? ----- 1:35 Transitive BGP graceful restart. Presented by Alvaro Retana. Forwarding characteristics in the network are transitive. Same thing we do when we do graceful restart. We forward toward the restarting router even when there's no actual routing info because in GR we've been told that there's still state. We extend this for more than one restarting router. Covers adjacent real routers, but also virtualized routers. GR RFC says you must clean your rib after you've gotten end-of-rib from all non restarting routers. Given the ability for restarting router to send result of best path calculation to other restarting routers and to wait for the end of rib from the restarting routers. Not changing the timers. Results: 1. Less churn. 2. Will take a little longer for non-restarting routers to get *our* end-of-rib. 3. We're running best path selection, not updating the rib *but* are announcing stuff. Not used routing info: 1. Routes that already in the rib. In GR, this is mostly routes that are in the RIB already 2. Routes that are new. 3. Routes that are in the RIB are pointing to *other* restarting router. 1:43 Feedback: RR: I'm looking at 4724, I'm not seeing anything that precludes implementation from waiting on end of ribs. AR: "The forwarding state of the speaker MUST be updated any any previously marked stale state must be removed" (from slide citing RFC) 1:44 John: Basically you want extend GR to cover double faults? For some reason that's not just for fun? AR: Lots of virtualized routers. John: I haven't quite gotten my brain around this. It kind of looks like it works in the double fault case, but the triple fault case doesn't help. You end up with a round of convergence with healthy peers, then the other guy. 3 faults probably doesn't? 1:46 AR: We think it does - need a whiteboard John: Let's work on this more offline. Sue: Does it work in implementation? AR: We've got prototypes. Works for single fault at least. Kireeti: The normal case/legacy is one control plan per device. Once we're talking virtualized, we're talking more than two. We need to really do N AR: Ok ----- 1:47 Advertising link state in bgp draft-gredler-idr-ls-distribution-02 Presented by Stefano Previdi List of authors is growing (huge) - Code point cleanup - Move link state attributes into BGP path attributes - NLRI: type 1: link descriptor. Type 2, node descriptor. - BGP path attributes: link attribute TLVs - No node attributes TLVs History: - Initially for TE sub-tlvs - Extended to be generic to topology. Used for link state. Nothing prevents this mechanism from advertising any sort of topology. Lets you do topology hiding Use cases: - Not really for routing [not meant to be deployed on a router]. - ALTO, multi-domain PCE, etc. - Three implementations exist - wireshark implementation available - Addressed initial concerns - PCE use case wasn't clear enough - don't step on pce Wg work - Isolation from bgp-4. 1:52 Next steps: Looking for working group adoption. - Starting the implementation report and interoperability draft. John: Seems like it's mature. We'll take it to the list. Sue: I'm still a bit concerned about isolation. We can take that offline. Stefano: The draft covers a lot more of the use case about how this shouldn't be run in the routing layer or even on the same router. 1:53 Sue: Implementation reports should help clarify that. ----- 1:54 BGP routing for large scale data center: Presented by Petr Lapukhov Goal informational RFC Not sure if IDR is appropriate group for this. Requirements: Server perspective: 100K+ servers 10G nics. Distributed applications: - Aware of the network - Explicit parallelism - Example: Web index computation. "Network as a computer" concept Slide showing DC example: Lots of east-west traffic. Need to accommodate this in the designs Two types of traffic: query vs. background Design requirements: - Build a topology providing horizontal bandwidth scalability. - Minimize feature protocol set. - Select the simplest, most common protocols. - Protocol must support some traffic engineering Topology choice: Clos - Lets you scale things horizontally. - Provides "full bisection bandwidth" from spine to leaf if M links >= N links from leaf to ToR. - requires some sort of ECMP solution - normally routing based. Scaling Clos topology: - Think multiple parallel Clos topologies. - Lower port density on switches. - Horizontal capacity scaling at every layer above ToR. - Two parallel Clos topology + Add capacity by adding boxes. + Downside is link scaling. + Capex is smaller for boxes, but opex is higher for link level issues. Routing design for parallel Clos - BGP all the way down to the ToR - not IGPs. BGP at every layer. - Separate BGP AS number per ToR. - Single AS for spine layer. - This is all eBGP. no iBGP. All sessions using physical addresses on links Design specifics: default routing: - Don't use "default route only model". - Don't hide specific prefixes. - Otherwise: route blackholing on links. - Can't do summarization. Every subnet on ToR will show up every box in the topology. Currently only 8k prefixes in the routing tables in their network right now. Not concerned currently about its scaling. Summarizing p2p links is ok. Default route only for stuff outside of data center. Why BGP over IGP? - BGP simplicity - Simpler protocol design concepts compared to igps - Better vendor interop - Less state machines, data structures, etc - Compared mostly against ospf (used quagga as a reference for complexity) BGP allows per hop traffic engineering. - This way we can inject prefixes at any layer. - Used for unequal anycast load balancing. BGP simplicity: - Troubleshooting BGP RIB is simpler. - Clear picture of what is send/received. - There's no LSDB to troubleshoot. - Don't have to troubleshoot via forwarding table. Event propagation is more constrained in BGP. IGP hits the entire network. Rate of change just isn't all that high. Common arguments against BGP - What about config complexity? Neighbors? - Automated config generation. - What about convergence properties? - Not our primary goal, a few seconds are ok. BGP specific features: - Requires BGP aspath "multipath relax". - Rely on ECMP for routing. - Needed for anycast prefixes. - Perform ECMP where neighbor AS has different numbers. - Only using 16-bit private BGP AS's - Simplifies path hiding at WAN edge - But we only have 1022 private asn's - Remove private feature works to hide stuff at edge. BGP specifics: Allow as in (in Cisco speak, loops in JunOS) ToR uplink sessions Use of allow AS in on ToR ebgp Effectively ToR numbering is local to the container Requires vendor support 2:09 Features that would benefit in standardizing - ECMP programming - as-path multpath relax - Allow as in - fast eBGP fall-over - Remove private as - Unequal cost load balancing - 32 bit private ases 2:12 Questions? Paul Unbehagen: Does topology in diagrams follow peering topology? Yes. P: There's been a lot of research over the years when ToR and rows between various tiers that you get better flow. Petr: Core density, with lots of line card boxes needed to do that. We're trying to keep this cheap. Pushes us to Clos topology: P: Some of those studies show different density. Convergence speeds - how do the app guys take the bgp speed? Petr: App guys are happy with 10s, but they got it down to 1s? The rich ecmp fanout gives us fast failover. P: IGP would converge much faster. Petr: OSPF code base is much more complicated. Vendor problems there? New vendors tend to get OSPF wrong. P: I would expand on sections that talk about traffic patterns and also the convergence speed. A lot of applications will die at 1s convergence. 2:17 Arif (coauthor): OSPF vs. BGP - BGP has natural loop suppression. IGP has natural "bounce" effect. One of the points in the design was to suppress that. Route propagation. Pedro Marques: I suspect you'll get better convergence with this than with an IGP. Petr: IS-IS mesh-groups do this too. Ashad(?): In the topology, you don't accept route anything in the same layer. Layers shouldn't be cross-traffic. In this topology you get better convergence than IGP? John: When I read draft, most of it was easy to understand. Third-party ECMP stuff sort of lost it. Please expand on that in the draft. John: went back and looked at 4271, sec. 6.3b "in ebgp you shouldn't accept third party nexthop" Jeff Haas: Third party nh is fine, non-directly connected isn't John: in terms of the draft, we'd like to see it published, but we're not usually the right place for this. We'll talk to the AD offline about it. Maybe AD sponsor it? Pedro Marques: How many up-links? We're only allowed 2 uplinks. ----- 2:23 Reservation for private use asn's - draft-mitchell-dr-as-private-reservations Presented by Jon Mitchell: Microsoft's motivation is not the previous presentation. WAN connectivity comes in via dedicated leaves. They do one as-per that. also share it with a large internal "enterprise" Want a bigger range of private ASes in the 4 byte space Clarify the end of the 2-byte private space. IANA conflicts with rfc 1930. Closed issues: - Clarify fix end of existing range at the existing iana range. Vendors end one more than that. - Not going to use the last asn in the 16 or 32 bit space. Not going to discuss whether they're a good idea or not. Open issues: - Range size - proposed 1M ASNs? Existing range was 1.56 of original 2 byte space - new range proposed would represent .-2% of 4 byte space most feedback was size big enough Show of hands: Too many? (not many) 2:28 Ruediger Volk: I have the feeling that we may want *more* bits in the private space. Let's change this once, rather than doing this again. John: Ruediger, I think you said 1M isn't enough. That would carry more weight if you can justify it. RV: I haven't done the math yet. Private AS is a flavor of reserving number space. Knowing that it wouldn't cause conflict down the path. Thinking of things in blocks of 16bit blocks. 8 blocks maybe not enough? 20 bits? (Editor's note: the draft does propose 20 bits.) Other open issues: Range structure? draft has decimal based range. - alternative proposal easy to troubleshoot in as.dot. 2:32 Jeff Haas: Have RIRs weighed in? Jon: my understanding is that this is not a proposal to RIRs, I was advised this by a member of the ASO