How Ossified is the Protocol Stack? Proposed Research Group (HOPSRG) ==================================================================== DRAFT Minutes of the July 24, 2015 meeting at IETF93, Prague Thanks to Tommy Pauly for the detailed transcript! Intro & Overview ---------------- Brian Trammell and Mirja Kühlewind, Chairs Mirja: Welcome to first meeting of HOPS. Aaron Falk is jabber channel. Tommy Pauly is note taker. How did we get here? From BarBOF by Aaron Falk in Dallas. How can we evolve transport stacks when there are all of these middle boxes which may block or bleach packets? Discussion around middle boxes, and we know it's happening, but how much? Is it one box, or tons? Should we design protocols around this? Many applications protocols have this knowledge, since they do fallback. It would be useful to harvest this data. Mirja: BarBOF was successful, and people were interested in both passive and active measurements. We started writing a charter for the potential RG. Not much discussion on the list, but not because of lack of interest; more that the need is clear. Probably will have a couple more meetings before becoming a formal RG. Brian: Quick look at charter. Note well. See https://datatracker.ietf.org/rg/hopsrg/charter/. Mailing list is hosted as hops@ietf.org. Note that this is *not* under irtf.org yet; we'll move this over when and if we form an RG. Brian: We are trying to deploy new tech and protocols, and NATs and firewalls are used as reasons to not deploy them. Anecdotal data is not enough, so let's get some real data! Gather data from multiple, aggregated studies to better understand the issues on the live internet. Objectives: 1. Forum for discussion and exchange. 2. Define a common format for reporting middle box impairments 3. Specify methods for analyzing middle box interference and active measurement strategies. Aaron: I've been focused on middle boxes previously, but I have some slides on routers that don't participate in trace route. Unless we want to call routers middle boxes, we may want a better term. Brian: I consider routers middleboxes. Anything on path that might mess with packets. Gorry: Network devices is a term we could use for all the boxes Jana: People have opinions without even reading the charter! I was hoping that we would have end-to-end measurements as well, not just for middleboxes. Google has lots of measurements here. Brian: I think that is in scope. We can do A/B testing over different paths end-to-end. Mirja: These may be the same measurements. Throughput is throughput. However, we're looking at the middlebox's influence on the measurements. Brian: There is a line between traffic analysis and path impairment analysis. We may be being too careful to stay on one side of the line. Brian: To the third point, as we develop more methods, that may become more of an IETF effort. Brian: Overview of agenda. Any bashing? Mirja: We didn't have a lot of open call because we had a lot of people already wanting to present, so we had a full schedule. Hopefully we'll have more time to present stuff at the next meeting. Brian: If anyone has recent interesting results, please come up at the end. Results from deployment of QUIC, a UDP-based transport ------------------------------------------------------ Jana Iyengar (Google) Jana: Good morning!! (crowd groans, since it is Friday). Mirja: Are you not standing in the pink box on purpose? Jana: I haven't stood in the pink box this whole IETF and am still alive. At the QUIC BarBOF (on Wednesday evening), we shared data from internal to companies--mozilla, google, others. We've done various experimentation, and we want a forum to discuss regularly this data. To ask for more data, but to also inform protocol design. Hopefully the results here will spur on new designs. What is QUIC? It is a new transport from Google that is reliable, multiplexed on UDP. The most important part is that it is always encrypted, so that middle boxes cannot do anything other than play with the shape of the traffic. They cannot try to accelerate, etc, however. I'll leave the rest out, but ask questions later. Designed to reduce web latency. TCP + TLS + SPDY over UDP. Quick QUIC overview! Controlled experiments have been done with Chrome browser connecting back into Google. Measuring latency/bandwidth/quality/errors on client, latency/bandwidth/success on server. Lots of the talk has been about the fact that it is over UDP, and what the performance of UDP on the internet is. What is the scale? The amount cannot be disclosed--we've scaled up tremendously from 0 to some undisclosed number, especially in the last 6 months (first half of 2015). This is all between Chrome and Google, across all devices (some graphs are only desktop). It is in the order of millions of users. If UDP was not reachable, then none of this would work. But it does work 92% of the time! 7% QUIC cannot be used (UDP is blocked or maybe something else), 1% UDP is rate limited (or at least the performance of QUIC is poor). UDP is generally not getting any worse treatment than TCP. This data is from massive scale. This data includes Google's CDN, but other networks as well. Impact of 0-RTT. About 75% of QUIC connections happen with zero RTTs with secure establishment. This accounts for 50 to 80% of median latency improvements. That reduction of about 2 RTTs is a huge win for startup latency, but doesn't affect some of the other metrics. Connection pooling. Pooling is something we've thought about a lot in transport. Share multiple streams in one app for one connection. Shared connections give 10% latency improvement. Pacing packets. Similar to fq qdisc. Paces out entire congestion window. Improves payload latency by 25%. Does not change median latency or QoE. Karen Nielsen: When you speak about latency, is this because it has such a large initial window? Jana: I don't think so. Though we do use a large initial window. This helps loads that are short. This is intuitive. What is not obvious is that larger window with pacing (32) has a better loss rate than a smaller window (10) without pacing. Pacing is a very important part of choosing window size, and makes a big difference. Also tested CC: Reno vs. Cubic. Uses Cubic to be able to compare to Linux to keep comparison the same. The quality of the experience is almost the same between Reno and Cubic. This may not be surprising, since Cubic is often in Reno mode. We saw that Reno has about 20% lower retransmission rate. This is interesting, and we think this may be because of buffer bloat. Thinking about switching to Reno. QUIC defaults to 2-connection evaluation. Has slight improvement in tail latency. Michael Welzl: Is this both the increase and decrease? Is it all paced. Jana: Yes. And it is all paced. Tail loss probe improves latency. Also does time based loss detection, using FACK with threshold of 3. We noticed that the network does not do a lot of packet reordering. Mirja: On the QUIC pie chart, you said it was better than TCP. Do you have measurements to compare to TCP? Jana: No, it's hard to get aggregated measurements to compare along the same path. Any given user is doing one or the other (QUIC/TCP), so no. Mirja: Do you have numbers for users who use TCP almost always due to having a problem on their access point? Jana: Yes (7%) failure for flows. But we don't know if they are strongly correlated to how many users always have problems. Chrome will try to adapt and switch to QUIC when it can, but we don't have numbers on that. Mirja: Where we are rate-limited, how do you detect it? Jana: It is a coarse judgment. If the connection is established, we look for extremely high loss rates, but we're still looking into how to solve this problem. Mirja: Does TCP actually give you better results when you fallback? Jana: Yes, we do know that. It is something at the ASN level. Karen: Is this representative of the whole internet? Jana: This is all over the world, for all users of Chrome. Not special environments. Robert Kisteleki: At the end of the day, this is UDP, and the payload is 'magical'. How often are failures because of content, or just because of UDP? Jana: We only know when QUIC doesn't work, not UDP in general. It is an interesting question that we have not pursued. Anecdotal data does seem to have similar failure rates for generic UDP. Report on prevalence of NAT and forwarding of traceroute -------------------------------------------------------- Aaron Falk (Akamai) Aaron: I'm presenting someone else's work, since colleagues are in conflicting meetings. Data was collected by Arthur Berger and Dave Plonka. In contrast to Jana's talk, the data were collected without the intention of answering a specific question. First set is analysis of connection status data taken from Akamai's network, looking for prevalence of NAT port translation. Second set is about trace route of different types (ICMP, TCP, UDP). NAT port translation: How often are ports translated? Gut is almost always, but let's measure. There is a TURN service in akamai used for a peer to peer app using a fixed UDP port mapping. So this is good to detect port translation. On one day, July 14, 2015, looked at numbers for 3 million clients over 231 countries: 68% of clients had the right port. So 32% had port translation. Mike Fischer: I'd rather see the percentage of clients in which the port was never changed for any session. Aaron: It's hard to say anything about never. If we look at the pairs of connections, we see that 55% of pairs did involve port translation at least on one side. Mike: I'm not sure that's the same thing. Aaron: Hm, yes, I think we can get that question from that data too. Get back to you later. This particular applications tends to be skewed towards use in the US. We had at least 20,000 samples of IP-port pairs. Most port re-mappings were in AU, other country breakdowns. The port number distribution is pretty much smooth across all ports. Bump around 16000, which is near the applications port number. Also spikes around 56000. It turns out that there are also clusters 47 ports apart that are more common. There turns out to be a trend of ports that are divisible by 47. Could be a legal intercept thing, there's an RFC about mapping to a smaller set of ports to make legal intercept easier. Top ASNs with this behavior are almost all mobile/carrier networks. Could be a carrier nat thing. We also did ICMP, TCP/SYN, and UDP probes for trace route, over v4 and v6. Not a significant difference between protocols for how often the trace route worked. There is a significant difference for different AS's. Targets across the world (226 countries). Chart of networks that have 100% for one protocol, with a very low % for another protocol. Interesting that sometimes one of the protocols does get penalized on a network. But no one approach always works on all, it is somewhat random all over the world. Jana: For the first (HP) ICMP has 100%, and both TCP and UDP have the same (58%). Is this because of a similar loss rate? Aaron: Don't know. Let's take it offline. Tools to recommend: intrace, DASU, and ALICE. With that, open up to questions. Overview of RIPE Atlas ---------------------- Robert Kisteleki (RIPE) Robert: I'm representing RIPE Atlas network. Since it is a significant measurement network, wanted to share results. Not going to talk about what RIPE Atlas is--you should know. Supported measurement types: ping, trace route, DNS, NTP, SSL/TLS cert checks. Mainly doing router level, not apps. Working on HTTP, SSL/TLS version/cipher tests, and WiFi. Richard Scheffenegger: There has been deployment by one large vendor of ECN. Are you planning on testing ECN, for TCP and IP? Robert: I'll talk about this later. Short answer, no. Mainly deployed in homes, so must be small and not take much energy. Also don't take too much bandwidth. Very resource constrained. The capacity does increase over time, so expect more in the future! But working now in just MB of memory. Ken Calvert: Your boxes are the home router, or are behind? Robert: They are just on the home network. We deployed RIPE Atlas anchors about two years ago in the core, to have more stable measurement points. Our measurement code has to be very efficient. It is a no-fork model with many threads, uses libevent. New protocol testing is hard, and exotic protocol measurement is even harder. Not much benefit to experiments with such a constrained environment. Also, because the probe is headless, it is expensive if they die. So, if you want to involve RIPE Atlas in HOPS measurement, it may be better to put these in the anchors, since they can have more resources. There are about 140 anchors, so we could get coverage of the core. On the positive side, it does have a lot of devices. Covers 3000 ASNs in v4, 1000 ASNs in v6, in 172 countries. Trace route measurements can be used with various options (PMTU); TCP trace routes could be used for middle box detection, but it is not perfect. Measurement code does nothing magical to take NATs into account. They just use the network--no UPnP, etc. Brian: Thank you very much. I've been running RIPE Atlas for a long time. We had a talk with Robert about a month ago. You have two measurement networks--anchors and probes. Robert: Not exactly. Same code on both. Brian: You said the anchors are more extensible, but not done yet. Is the system amenable to expanding the anchors? Robert: I imagine we will have more divergence in what they measure. Brian: Even the more powerful devices are not very demanding. Can we put these more powerful devices (anchors) elsewhere, outside of the core? Robert: I'm not sure how many people would run this. Brian: How much do they use? Robert 200Mbit/s. Brian: I'd be interested. Overview of the MONROE measurement testbed ------------------------------------------ Anna Brunström (Karlstads University) Anna: I'm part of a consortium building MONROE. Measurements and experiments on MBB networks, coverage in Norway, Sweden, Spain, Italy. Deploying 450 nodes. Fixed and mobile nodes. Mobile nodes on buses, trucks, trains. Will have access to wifi and broadband operators. In comparison to crowdsourced approached, will have fewer measurements, but we have complete control of both client and server for these measurements. Will run a number of different experiments on this platform. Trying to see how new technologies can be deployed: ECN, TFO, MPTCP, as well as performance evaluation. Also getting basic performance metrics of the network and apps on the network. Will also visualize results in near real-time. Could include middle box related info. Explains system architecture. Most important part is that the nodes are based on x86 devices, normal Linux nodes. Very easy to run many of these tools, since it can do any Linux tools. Brian: Anything that runs on Linux will run on the nodes--do you get privilege to run anything within the container (raw sockets)? How about kernel? Anna: We will allow kernel modifications, but not open to everyone. External users and open data--the goal is meant to be open to external users, as a resource for the whole community. Software will be released as open source to deploy elsewhere. Data will be available as open data. Status--currently building the platform and doing proof of concept. Should be ready next March 2016. Open to all users by March 2017. This type of platform is complementary to other data sources, and we are interested to run these measurements and develop them with feedback of the community. Joachim Fabini: Can you time-synchronize these probes? Anna: Yes, we will use GPS synchronization. Brian: Any other measurement platforms people want to talk about? Aaron: There's a lot of infrastructure out there. It might be interesting to have a survey or database or wiki of these platforms. Joachim: Hard to differentiate access networks from middle boxes. Especially with mobile networks. Brian: You're saying it requires another measurement infrastructure. Mirja: We may need different methodology for mobile clients to not confuse data. Tracking Middleboxes with Tracebox ---------------------------------- Korian Edeline (Université de Liège) Korian: Middleboxes--common knowledge they are widely deployed. The total number is around the same as routers. Security oriented box market is around $10B. Shows fields that may be modified by normal routers vs NATs vs ALGs, to the point where essentially every field may be modified. One tool that can be used is TBIT. Use raw sockets to send TCP probes. User-level user controlled test without kernel changes. Detect if ECN, IP options, and TCP options can be safely used. Another tool is TCPExposure. Run stateless python server/client. Used forged TCP over raw IP. Client can compare with server what was modified by the network. Differentiate between modification in different directions. TCP HICCUPS. A lightweight TCP extension to overload 3 header fields to seal/hash the header to see what was changed. These measurements can detect middle boxes if you own both the client and server, but not if you don't control the server. tracebox tries to solve this. Send TTL limited TCP probes, and inspect ICMP time-exceeded responses to see the current state of the header. A one-sided probe: good, that no server is needed; but only sees modifications in one direction. Can also detect multiple modifications. Example of how to detect two middle boxes that change different fields. The major limitation of this tool is that ICMP responses do have limitations of how much is returned. Different RFCs have different recommendations about how much of the datagram ICMP should return. At least 80% of paths have a router that have a router that sends back full-response ICMP, in tests with the Alexa top 5000. This is good, but leads to uncertainty over which box actually changed the options for later fields. Testing new protocols (MPTCP, TFO), new hardware, or triaging a network problem. A good network management tool. Developed an Android version for testing cell networks. Requires rooted version, but working on a non-rooted version. Two implementations: standalone tracebox for flexibility and scripting, in C++. Runs Mac OS X and Linux. See slides for details. Scamper version for wider deployment. Supports BSD version, Linux, Solaris, Windows. Michio Honda: Regarding router hopping--what happens if the middle box is the last hop between the last hop router and the server. If the middle box does not decrease the TTL, what do you do? Korian: Then you can't detect those. Michio: I think those are common. Korian: We need an ICMP message from after the middle box. Alex Zimmerman: We could induce another ICMP error message to get back the packet from the end server. Lessons Learnt from Middlebox Measurement ----------------------------------------- Michio Honda (NetApp) Michio: Is it still possible to extend TCP? Identified whether potential extensions work or not. This means that it can go through the internet, or can fall back to traditional TCP. It doesn't work if the extension messes up future connections. Measured 140 networks. Lots of paths using port 80 cleared options, but other ports were not affected. Ran custom server in middlebox-free network, and ran clients on many networks to start TCP connections with strange options. The measurement was difficult since re-running experiments was very hard. If we had to change the tools because of new info or bugs, then people may not re-run the tests. Gets less effective over time. How can we avoid this? First, carefully define objectives, so as to define experiments and methodology. Also carefully design and implement tools. Generating raw packets may need root privileges, and cannot work on a smartphone app. Forcing people to install something is hard. Wish we had supported Windows. Stateless servers also make it hard to do some measurements. Also important to reward contributors for their help. We should combine our tool with tracebox for best coverage. Brian: This may belong in the next section, but: one thing in the proposed charter is to define how to define data sharing. Since experiments are hard, we need a way to share the data. How much of this is a format problem, or how much is the regulatory/legal/technicality environment? Michio: Data format is useful. Brian: I took that away from other talks. Specifically with Akamai data, there was very dense raw data, but hard to distill the insights out of it. How do we explain what the actual impairment is (such as having ports mod 47 be distorted). Aaron: Before you started down this road, Michio, did you try to use passive data? If you can look at what is already they, you have a wider net. Michio: Hard to get info about new TCP options/bits on existing data. Aaron: Thinking about Akamai, it is easier to look at existing data for other purposes. Michio: But we are using nonexistent TCP data. We can get info from passive data for deployment status, but we cannot learn what will work or not. Discussion of Next Steps ------------------------ Brian Trammell and Mirja Kühlewind, Chairs Mirja: We have one additional slide of questions to ask what the interest of the group is and how we should go forward for next meetings. We had three groups of presentations in this meeting. Are people interested in these equally, or do we want to focus on one or the other? Matt Mathis: I have a new issue to bring up. As IPv6 is rolling out, people don't remember that NATs are not original. And people will start getting IPv6 firewalls. That probably falls into scope here as a new class of middle boxes that will exist. Mirja: I would say that is in scope. What are people's impressions? Brian: Show of hands for venue for presentation of data? Lots of hands. Seems to be interested. How about methodology: significant interest, but less than data. Lars Eggert: I can't stop you, but I don't think figuring out the charter details in this phase of the meeting. We want an advertisement for people coming here. Brian: I want to be able to focus the advertisements for the next meeting. Lars: There's a south park episode. Gnomes are stealing underpants. They have a business plan: step 1 collect underpants, have ?? as step 2, and have profit as step 3. We have a step 1, and a step 3, but no step 2. We need how do we get from our interest to our answer. Lots of interesting data shown today, but I wonder how much new data we'll have (from new or same places)? It would be nice to agree that data shown here is available for use by others. Does the group need to collect data, or just use data from other bodies? Mirja: I would like to see people bring data, but also people start asking questions and bring it to the people measuring data? Lars: Also, middle boxes are not the only thing that are ossifying the network. Socket APIs even may be the problem. Mirja: We are asking 'how' not 'is', so we should focus on measurements. Brian: Interesting question. Are we leading from the name or the charter? I think for now it is the name. We'll see what we have energy to continue into and work on. Jana: It may not make sense to ask about 'the' protocol stack. It's not clear to me that the question has a definitive answer. My sense is that we are talking about data sharing. It's the underpants that are interesting here, more than the business plan. There are a lot of people with interest in sharing data. With a space like this, even if we don't meet every IETF, people will bring data. We often operate on anecdotal data, at the mic, not on disputable measurements. I hope the scope is around data sharing, with methodology perhaps, but mainly data. Ken: I want to put a plug in for reproducibility, which is key to science. Can we get operators to come and talk to us as well? Natasha Rooney: I'm an operator who can help Kevin Fall: It would be interesting to have the tooling and methodology defined. How to perform measurements and experiments within the landscape today would be useful. Aaron: Rolling back to the BarBOF. We had a workshop in which we talked about the IETF making significant protocol changes, and there were papers explaining that there was a problem, without much data on what and where the problem really was. I saw the data about informing the IETF with ground truth. That's the motivating reason to me. This is not research, it's data to inform engineering. We need to ask the question about ossification very clearly. This is a forum to bring together different communities with expertise to have a conversation about what the question is, and how to collaborate to answer the question. If this is motivated by a clear engineering goal, to bring back to their management, we'll get more buy in to share data. Lars: +1 for data-driven engineering. We used to have an internet measurement group, but that died down. If the group is about data, it will live longer, and can change focus. If it is a bring-your-data research group, then the IRTF is a good place for it. The IRTF can have closed, non-public meetings, to share more private data, with NDAs, etc. We'd want the group to be open, but this can be done. Aaron: I think we'll need to build trust as a community. The data may be initially heavily distilled, but hopefully we'll be able to share raw data as we go on. We can develop this over time. Mirja: It is easier to get data if you have a specific question. If we have so much random data, that's great, but we should be focused. Anna: Tying back to what Kevin said, the methodology will make it easier to share the data, since we will know how to compare and interpret the data relevantly. Michael: As a researcher, I found if helpful to look at the peer-to-peer research group. They had a large taxonomy of the field. It helped clear the chaos. It could be helpful to have a place for people in the IETF to go to get information about certain topics. Mat Ford: Supporting what Lars said about getting details about the Y access, real numbers from companies. Not everyone is willing to share the volume of traffic they see, and providing a platform for anonymized data could help to make real change. For example, if 50% of carrier clear the TOS field for ECN, how do we go forward from that? Jana: I'd like to see operators show up in these talks, since they run the networks we are trying to measure. They can provide feedback to the data we collect. Brian: Another thing that might help is to have a HOPS meeting at an operator community event. Lars: I agree with Mirja. Maybe I swung too far to bring your data. I think it is helpful to have a current question that the data is focusing on bringing to the engineering teams. I think having a completely open data group might not work for other reasons. What we could do is have a list (maybe just one) of items that we are currently looking into, but it could always change going forward. Matt Mathis: As a result of QUIC, ISPs are worried about UDP. There should be a conversation about opening up more protocol number for IPv6 networks. Natasha: Last note, I'll bring data to the Yokohama meeting. Meeting adjourned.