IETF NVO3 WG
Virtual Interim Meeting Agenda
2015-05-22 10:00-11:30 EDT

Chairs: Benson Schliesser, Matthew Bocci
Secretary: Sam Aldrin

Note-takers: Sue Hares, Matthew, Sam, Ignas

Topic: OAM

1. Meeting Introduction and Agenda Bashing
   Chairs (5 min)
   Discussion: 
chair reviewed the note-well, and pointed to the etherpad and virtual blue sheets, 
Chair indicated the topic was OAM, and coming to consensus on the OAM mechanisms. Agenda item #6 has been presented in Dallas.
Need to agree on how to mark the bits. Get an understanding on how to proceed and reach consensus on OAM approach.
 
       
2. draft-rtg-dt-encap
   Erik Nordmark (15 min)
   Discussion: 


Eric Nordmark: This was covered in interim before Dallas, now refreshing. There is an update to the document, -02 has some minor changes based on feedback. One thing that is an operational consideration has to deal with fragmentation and MTU. The way how some of these technologies are deployed assumes that underlay MTU is well managed. This shows in VXLAN RFC too - it says you should not do fragmentation. Similar is also seen in SP networks. We can also see other new encapsulations that will continue along the lines of requiring large MTU. However it still seems to be practical to have a mechanism to detect and report misconfigurations and diagnostics. 
Easy way assumes to generate ICMP message back and include as much of payload as possible, but not everyone implements that. Another aspect - if you define another encapsulation that is not constrained to a well designed datacenter, it may need fragmentation and reassembly. In terms of OAM itself, the design team looked at OAM and realized that we do not want to redo all of the OAM discussion, but what we could reuse for a OAM packet format. Inband OAM for measurements, and out of band for maintenance. Fundamental property - OAM packets should follow the same path as data packets, which places some constrains on entropy and forwarding operations of devices. The other thing that matters is to have some way of marking the OAM messages in the data plane. Measurements would require adding counters, timestamps, and other bits to be added in. 
Error reporting mechanism is also required, and this is not what is largely done in NVO3. Relying on ICMP is not always reliable, it may get filtered out for good operational reasons. Also a question of coordination between the different groups working on OAM. 


Matthew Bocci: Are you talking about RDI (functionally)
Eric: May be the underlay, which may use ICMP, or could be in the overlay. 
In order to avoid sending OAM to end stations, better to drop after decapsulation rather than overload a bit in the header.

Alia Atlas: The goal of design team was to research the options. And it is important to have similar mechanisms and not invent all new.

Benson: You said this was reflected in the draft that was posted last night. 
Erik: This is an updated draft based on feedback in Dallas. The next step is to request this to become a WG Draft. 
Benson:We should talk further at the end of the call.  if this is a complete set of OAM requirements. 

3. draft-ashwood-nvo3-oam-requirements
   Chenhao Philips (15 min)
   Discussion: 
Erik: Section 5.4 on the number of available paths - that states "Must provide a mechanism to exercise/trace all data paths that result due to the ECMP/LAG hops in the underlay."  We ran into several pieces of this problem during TRILL discussions.  The problem is the NVO3 does not know how wide the ECMP/LAG in the underlay.  If you do not have insights on what the underlay protocol is, you cannot really find out. You have on node 1 - 4 paths, and node 2 - 4 paths.  You would assume that you have 16 paths.  The packet hash may be identical.  You attempt to avoid this, but it can occur.  Therefore, you might only have 4 paths (as hash calculate is the same).  It would be good to understand this is the limitation.
Deepak: Is the requirement to discover all the paths? Or is it just to travel all the paths. 
Erik: If you exercise all datapaths are being used - is this your question? 
Sam Aldrin: Seems like a reasonable requirement.
Deepak: You should have the mechanism to trace the paths used. 
Erik: I did not understand what you indicated?
Deepak: It is difficult to get all the paths in MPLS and to discover the hash path.  Discover all paths is a difficult requirement.  We should be able to handle all paths to query.  The mechanism should be there.  The protocol may not automatically do this query, but 
Erik: You can finnese this by stating if you know the paths, then you can use the OAM to exercise/trace the paths. 
Benson: This seems like good advice. Suggests authors talk to Erik off line to revise phrasing? Any other comments. 

4. draft-tissa-nvo3-oam-fm
   Deepak Kumar (15 min)

Shows the OAM frame - presented some time ago. 
NVO3 shim layer, followed by optional payload fragment. Uses Trill OAM - EThertype followed by OAM message channel.
Explains bits in OAM frame.
OAM message channel. Various OAM frames can be sent on this.
CCM, Loopback, PathTrace, LinkTrace, LM/DM, etc. All based on Trill and 802.1g.
Will add link trace to the draft. Shows details in slides.
Claims existing deployments.

Discussion: 
Note: Link trace will be added in the updated draft. 
Erik: Question - you said punt and forward. This assumes TRILL-like topology. In NVO3 we do not have intermediate NVEs. This assumes a TRILL topology.
Deepak: This is based on a real deployment with a large customer. It is optional. It is not for all topologies, true. You have to have hardware that understands underlay OAM.
Erik: The place where you think it will be used is where the red boxes are routers running NVO3.  You want to augment the OAM to do more than the IP ping or traceroute. It would help to explain the picture. 
Deepak: I will explain this prior to uploading the draft.  It is useful for hardware to look deeper in the draft.  I will upload the draft and look for comments. 
Benson: I am still confused.  There is a question on how coupled the underlay and overlay are.  The requirements to tracepaths where intermediate nodes may have an interaction.  There are two methods:  a) where there no interactions with underlay and interactions 
Deepak: Think about the VTEP termination happening on the gateway and and then going to ToR. The customers are want end-to-end OAM.  The only input the customer will give is the VMs on the endpoint.  The customer asked for the full path. The rest of the OAM occurs in the forwarding.  The packet is inject as comes from the VM. It gets injected as if it comes from the VXLAN header.  All the routers in the path must understand the "o" bit will provide an OAM response. 
Benson: I understand what you just said.  For the WG, does this violate the overlay and underlay separation?
Erik: I'm not sure we are understanding this picture yet. We have three things: a) decap, b) route, c) encap.  When these red boxes are routers, then the traceroute from VM to VM - these are regular routers. I can see having the red routers participate in the OAM (if they are IP routers).  If they are not OAM aware in the underlay, this looks different.
Benson:  If the red boxes are underlay, orange boxes are NVO3 encapsulation, and red are underlay. If red boxes are underlay, the boxes participating in the OAM is different. 
Deepak: I think of the red-boxes are in the overlay layer, then the responses are norm. The customer wants the underlay boxes to optionally respond as OAM. I am assuming that the MIP level is configured through-out the network - in both 
Chandra: Are you describing the scenarios described in the EVPN drafts where there are L3 Gateways and L2 NVEs devices that default route to the L3 Gateways?
Deepak: Yes. 
Chandra: Even though there might be devices in the path that are not NVEs, they are still functioning in the overlay and seem to be relaying overlay packet passing through back to an overlay endpoint - right?
Deepak: The messages can be sent back to the controller.  The request is to give the exact path that goes overlay and underlay. 
Chandra I understand that not all devices might be capable  I am just trying to understand your model. One possibility is that this model maps to the DCI function within EVPN. I need you to explain the other cases that are not in the overlay, perhaps in your draft. 
Deepak: I will create a new section based on 802.1AG using the link-trace.  A single packet gets sent in and at every point gets punted up to the OAM function. 
Lucy Yong: I agree with Benson's point - that the overlay and underlay are separate. I echo Erik point that if the red boxes are underlay boxes, the boxes should not be available to OAM.  If the red box are special gateway (E.g. tunnel stitching), then these can be served by OAM - as this is an overlay.  If you feedback the path that links to the underlay path.  If the datacenter want to trace through gateway, this is possible.  
Deepak: I will describe the situation in the draft from the customer requirement. 
Lucy: If you are using this for tunnel stitches, this is reasonable.  If it is another draft, I do not see where this is valuable. 
Deepak: I will try to document both cases i.e. the DC-GW scenario and the underlay scenario 
Erik: It would be useful to show the 3 layers (overlay, underlay, stitching layer).  It may be that stitching occurs in the routed overlay, L2 devices, or specific stitching devices.  Having a picture will make this easy to talk about. 
Benson: We need to move on, and we have 

5. draft-singh-nvo3-vxlan-router-alert
   Diego Garcia del Rio (15 min)
Skipped as presenter and slides are missing.

6. draft-nordmark-nvo3-transcending-traceroute
   Erik Nordmark (15 min)

Presenting this again to see if there are any additional requirements. Trace route shows the path, even though the user can’t control the path. Overlay networks might hide the underlay path, which is a useful policy, but should separate policy from mechanism. This is easy to rectify, and details are specified for VXLAN.
Draws comparison with the Internet trace route case - useful for a trouble ticket. Policy could be used to filter (or not) ICMPs.
Next steps: is there a requirement for mechanism to see underlay errors? Path MTU mismatch between overlay and underlay. Is it useful to require a Uniform TTL bit in NVO3 header?

Discussion: 
Lucy: Is this a NVO3 requirement or is this a OAM requirement? 
Erik: It depends what he scope of OAM is.  If we see breakage in the underlay, what is the impact in the overlay.  For a practical perspective if the underlay provides indicates of breakage. 
Lucy: This depends that you have a correlation between overlay and underlay to do this OAM properly. 
Erik: If you have a gateway stitching devices that gets OAM errors, should the gateway stitching device send back the OAM indications to the gateway. 
Lucy: This is something to research and decide on.  Should we use OAM or something else? We need to look at the pro/cons.
Deepak Kumar: This is something customer is interested. 
Benson: On the OAM reaching into the underlay, the NVA architecture with the well-managed Data Center network there may have a controller that can usefully reach into the network - this OAM has functionality.  In the less controlled network, managing the underlay nodes may be difficult if the underlay nodes are invisible. 
Erik: (missed portion of the discussion).  If am running some OAM pathtrace that looks at the devices visible in the overlay, and then I get some underlay information on error - this is one place to report the underlay information.   The second case is when you generically see underlay errors reported in the response. 
Benson: This is a good response to my topic.  My question earlier was how much overlay/underlay do we use.  
Erik: I think we should integrate what the underlay 
Lucy: We should not require underlay to support things, but should leverage what the underlay supports. 
Chandra: There is no additional support in the underlay but in fact this uses what the underlay already supports. It is just bundling and connecting that information. 
Lucy: The architecture makes the underlay invisible.  I can see that the underlay information may be valuable.
Chandra: Yes, in theory the underlay should be invisible but when there is an issue, most organization would like to get some information about what happened when the customer had the problem. What we observed is that being able to provide information that is already generated when the endpoint issued the traceroute makes the entire process very efficient.
Lucy: I agree that this is a problem the operations must address. It is not clear the right architecture to address this problem.  Typically in this overlay , it is separate from the L3VPN. I do not think we should tie the underlay and overlay together except in the stitching point.  We should separate the tunnel trace (stitching) and the end-to-end path in the overlay. The operator that trouble shoot the underlay path, it is a different thing. 
Benson: Lucy are capturing the point we need to discuss.  Perhaps you could continue this discussion on the mail list. 
Lucy: I would be glad to do this. 
Benson: Deepak's text regarding his customer use case would be useful.  This discussion is important context that prepares us to talk about the bits for the solution.  We should discuss the context and potential bits on the mail list. 

7. Discussion
   All (10 min)

Benson: any final comments before we close the meeting today? 
Benson: thank you.