How Ossified is the Protocol Stack? Proposed Research Group (HOPSRG)
====================================================================
DRAFT Minutes of the July 24, 2015 meeting at IETF93, Prague

Thanks to Tommy Pauly for the detailed transcript!

Intro & Overview
----------------
Brian Trammell and Mirja Kühlewind, Chairs

Mirja: Welcome to first meeting of HOPS. Aaron Falk is jabber channel. Tommy
Pauly is note taker. How did we get here? From BarBOF by Aaron Falk in Dallas.
How can we evolve transport stacks when there are all of these middle boxes
which may block or bleach packets? Discussion around middle boxes, and we know
it's happening, but how much? Is it one box, or tons? Should we design
protocols around this? Many applications protocols have this knowledge, since
they do fallback. It would be useful to harvest this data.

Mirja: BarBOF was successful, and people were interested in both passive and
active measurements. We started writing a charter for the potential RG. Not
much discussion on the list, but not because of lack of interest; more that
the need is clear. Probably will have a couple more meetings before becoming a
formal RG.

Brian: Quick look at charter. Note well. See
https://datatracker.ietf.org/rg/hopsrg/charter/. Mailing list is hosted as
hops@ietf.org. Note that this is *not* under irtf.org yet; we'll move this
over when and if we form an RG.

Brian: We are trying to deploy new tech and protocols, and NATs and firewalls are used as reasons to not deploy them. Anecdotal data is not enough, so let's get some real data! Gather data from multiple, aggregated studies to better understand the issues on the live internet.

Objectives:

1. Forum for discussion and exchange.
2. Define a common format for reporting middle box impairments
3. Specify methods for analyzing middle box interference and active measurement strategies.

Aaron: I've been focused on middle boxes previously, but I have some slides on routers that don't participate in trace route. Unless we want to call routers middle boxes, we may want a better term.

Brian: I consider routers middleboxes. Anything on path that might mess with packets.

Gorry: Network devices is a term we could use for all the boxes

Jana: People have opinions without even reading the charter! I was hoping that we would have end-to-end measurements as well, not just for middleboxes. Google has lots of measurements here.

Brian: I think that is in scope. We can do A/B testing over different paths end-to-end.

Mirja: These may be the same measurements. Throughput is throughput. However, we're looking at the middlebox's influence on the measurements.

Brian: There is a line between traffic analysis and path impairment analysis. We may be being too careful to stay on one side of the line.

Brian: To the third point, as we develop more methods, that may become more of an IETF effort.

Brian: Overview of agenda. Any bashing?

Mirja: We didn't have a lot of open call because we had a lot of people already wanting to present, so we had a full schedule. Hopefully we'll have more time to present stuff at the next meeting.

Brian: If anyone has recent interesting results, please come up at the end.

Results from deployment of QUIC, a UDP-based transport
------------------------------------------------------
Jana Iyengar (Google)

Jana: Good morning!! (crowd groans, since it is Friday).

Mirja: Are you not standing in the pink box on purpose?

Jana: I haven't stood in the pink box this whole IETF and am still alive.

At the QUIC BarBOF (on Wednesday evening), we shared data from internal to
companies--mozilla, google, others. We've done various experimentation, and we
want a forum to discuss regularly this data. To ask for more data, but to also
inform protocol design. Hopefully the results here will spur on new designs.

What is QUIC? It is a new transport from Google that is reliable, multiplexed
on UDP. The most important part is that it is always encrypted, so that middle
boxes cannot do anything other than play with the shape of the traffic. They
cannot try to accelerate, etc, however. I'll leave the rest out, but ask
questions later.

Designed to reduce web latency. TCP + TLS + SPDY over UDP. Quick QUIC overview!

Controlled experiments have been done with Chrome browser connecting back into
Google. Measuring latency/bandwidth/quality/errors on client,
latency/bandwidth/success on server. Lots of the talk has been about the fact
that it is over UDP, and what the performance of UDP on the internet is.

What is the scale? The amount cannot be disclosed--we've scaled up
tremendously from 0 to some undisclosed number, especially in the last 6
months (first half of 2015). This is all between Chrome and Google, across all
devices (some graphs are only desktop). It is in the order of millions of
users.

If UDP was not reachable, then none of this would work. But it does work 92%
of the time! 7% QUIC cannot be used (UDP is blocked or maybe something else),
1% UDP is rate limited (or at least the performance of QUIC is poor). UDP is
generally not getting any worse treatment than TCP. This data is from massive
scale.

This data includes Google's CDN, but other networks as well.

Impact of 0-RTT. About 75% of QUIC connections happen with zero RTTs with
secure establishment. This accounts for 50 to 80% of median latency
improvements. That reduction of about 2 RTTs is a huge win for startup
latency, but doesn't affect some of the other metrics.

Connection pooling. Pooling is something we've thought about a lot in
transport. Share multiple streams in one app for one connection. Shared
connections give 10% latency improvement.

Pacing packets. Similar to fq qdisc. Paces out entire congestion window.
Improves payload latency by 25%. Does not change median latency or QoE.

Karen Nielsen: When you speak about latency, is this because it has such a large initial window?

Jana: I don't think so. Though we do use a large initial window. This helps
loads that are short. This is intuitive. What is not obvious is that larger
window with pacing (32) has a better loss rate than a smaller window (10)
without pacing. Pacing is a very important part of choosing window size, and
makes a big difference.

Also tested CC: Reno vs. Cubic. Uses Cubic to be able to compare to Linux to
keep comparison the same. The quality of the experience is almost the same
between Reno and Cubic. This may not be surprising, since Cubic is often in
Reno mode. We saw that Reno has about 20% lower retransmission rate. This is
interesting, and we think this may be because of buffer bloat. Thinking about
switching to Reno.

QUIC defaults to 2-connection evaluation. Has slight improvement in tail
latency.

Michael Welzl: Is this both the increase and decrease? Is it all paced.

Jana: Yes. And it is all paced. Tail loss probe improves latency. Also does
time based loss detection, using FACK with threshold of 3. We noticed that the
network does not do a lot of packet reordering.

Mirja: On the QUIC pie chart, you said it was better than TCP. Do you have
measurements to compare to TCP?

Jana: No, it's hard to get aggregated measurements to compare along the same
path. Any given user is doing one or the other (QUIC/TCP), so no.

Mirja: Do you have numbers for users who use TCP almost always due to having a
problem on their access point?

Jana: Yes (7%) failure for flows. But we don't know if they are strongly
correlated to how many users always have problems. Chrome will try to adapt
and switch to QUIC when it can, but we don't have numbers on that.

Mirja: Where we are rate-limited, how do you detect it?

Jana: It is a coarse judgment. If the connection is established, we look for
extremely high loss rates, but we're still looking into how to solve this
problem.

Mirja: Does TCP actually give you better results when you fallback?

Jana: Yes, we do know that. It is something at the ASN level.

Karen: Is this representative of the whole internet?

Jana: This is all over the world, for all users of Chrome. Not special
environments.

Robert Kisteleki: At the end of the day, this is UDP, and the payload is
'magical'. How often are failures because of content, or just because of UDP?

Jana: We only know when QUIC doesn't work, not UDP in general. It is an
interesting question that we have not pursued. Anecdotal data does seem to
have similar failure rates for generic UDP.


Report on prevalence of NAT and forwarding of traceroute
--------------------------------------------------------
Aaron Falk (Akamai)

Aaron: I'm presenting someone else's work, since colleagues are in conflicting
meetings. Data was collected by Arthur Berger and Dave Plonka. In contrast to
Jana's talk, the data were collected without the intention of answering a
specific question.

First set is analysis of connection status data taken from Akamai's network,
looking for prevalence of NAT port translation. Second set is about trace
route of different types (ICMP, TCP, UDP).

NAT port translation: How often are ports translated? Gut is almost always,
but let's measure. There is a TURN service in akamai used for a peer to peer
app using a fixed UDP port mapping. So this is good to detect port
translation.

On one day, July 14, 2015, looked at numbers for 3 million clients over 231
countries: 68% of clients had the right port. So 32% had port translation.

Mike Fischer: I'd rather see the percentage of clients in which the port was
never changed for any session.

Aaron: It's hard to say anything about never. If we look at the pairs of
connections, we see that 55% of pairs did involve port translation at least on
one side.

Mike: I'm not sure that's the same thing.

Aaron: Hm, yes, I think we can get that question from that data too. Get back
to you later.

This particular applications tends to be skewed towards use in the US. We had
at least 20,000 samples of IP-port pairs. Most port re-mappings were in AU,
other country breakdowns.

The port number distribution is pretty much smooth across all ports. Bump
around 16000, which is near the applications port number. Also spikes around
56000. It turns out that there are also clusters 47 ports apart that are more
common. There turns out to be a trend of ports that are divisible by 47. Could
be a legal intercept thing, there's an RFC about mapping to a smaller set of
ports to make legal intercept easier. Top ASNs with this behavior are almost
all mobile/carrier networks. Could be a carrier nat thing.

We also did ICMP, TCP/SYN, and UDP probes for trace route, over v4 and v6. Not
a significant difference between protocols for how often the trace route
worked. There is a significant difference for different AS's. Targets across
the world (226 countries). Chart of networks that have 100% for one protocol,
with a very low % for another protocol. Interesting that sometimes one of the
protocols does get penalized on a network. But no one approach always works on
all, it is somewhat random all over the world.

Jana: For the first (HP) ICMP has 100%, and both TCP and UDP have the same
(58%). Is this because of a similar loss rate?

Aaron: Don't know. Let's take it offline.

Tools to recommend: intrace, DASU, and ALICE. With that, open up to questions.

Overview of RIPE Atlas
----------------------
Robert Kisteleki (RIPE)

Robert: I'm representing RIPE Atlas network. Since it is a significant measurement network, wanted to share results. Not going to talk about what RIPE Atlas is--you should know.

Supported measurement types: ping, trace route, DNS, NTP, SSL/TLS cert checks. Mainly doing router level, not apps. Working on HTTP, SSL/TLS version/cipher tests, and WiFi.

Richard Scheffenegger: There has been deployment by one large vendor of ECN.
Are you planning on testing ECN, for TCP and IP?

Robert: I'll talk about this later. Short answer, no.

Mainly deployed in homes, so must be small and not take much energy. Also
don't take too much bandwidth. Very resource constrained. The capacity does
increase over time, so expect more in the future! But working now in just MB
of memory.

Ken Calvert: Your boxes are the home router, or are behind?

Robert: They are just on the home network.

We deployed RIPE Atlas anchors about two years ago in the core, to have more
stable measurement points.

Our measurement code has to be very efficient. It is a no-fork model with many
threads, uses libevent. New protocol testing is hard, and exotic protocol
measurement is even harder. Not much benefit to experiments with such a
constrained environment.

Also, because the probe is headless, it is expensive if they die.

So, if you want to involve RIPE Atlas in HOPS measurement, it may be better to
put these in the anchors, since they can have more resources. There are about
140 anchors, so we could get coverage of the core.

On the positive side, it does have a lot of devices. Covers 3000 ASNs in v4,
1000 ASNs in v6, in 172 countries. Trace route measurements can be used with
various options (PMTU); TCP trace routes could be used for middle box
detection, but it is not perfect. Measurement code does nothing magical to
take NATs into account. They just use the network--no UPnP, etc.

Brian: Thank you very much. I've been running RIPE Atlas for a long time. We
had a talk with Robert about a month ago.

You have two measurement networks--anchors and probes.

Robert: Not exactly. Same code on both.

Brian: You said the anchors are more extensible, but not done yet. Is the
system amenable to expanding the anchors?

Robert: I imagine we will have more divergence in what they measure.

Brian: Even the more powerful devices are not very demanding. Can we put these
more powerful devices (anchors) elsewhere, outside of the core?

Robert: I'm not sure how many people would run this.

Brian: How much do they use?

Robert 200Mbit/s.

Brian: I'd be interested.


Overview of the MONROE measurement testbed
------------------------------------------
Anna Brunström (Karlstads University)

Anna: I'm part of a consortium building MONROE. Measurements and experiments
on MBB networks, coverage in Norway, Sweden, Spain, Italy. Deploying 450
nodes. Fixed and mobile nodes. Mobile nodes on buses, trucks, trains. Will
have access to wifi and broadband operators.

In comparison to crowdsourced approached, will have fewer measurements, but we
have complete control of both client and server for these measurements.

Will run a number of different experiments on this platform. Trying to see how
new technologies can be deployed: ECN, TFO, MPTCP, as well as performance
evaluation. Also getting basic performance metrics of the network and apps on
the network.

Will also visualize results in near real-time. Could include middle box
related info.

Explains system architecture. Most important part is that the nodes are based
on x86 devices, normal Linux nodes. Very easy to run many of these tools,
since it can do any Linux tools.

Brian: Anything that runs on Linux will run on the nodes--do you get privilege
to run anything within the container (raw sockets)? How about kernel?

Anna: We will allow kernel modifications, but not open to everyone.

External users and open data--the goal is meant to be open to external users,
as a resource for the whole community. Software will be released as open
source to deploy elsewhere. Data will be available as open data.

Status--currently building the platform and doing proof of concept. Should be
ready next March 2016. Open to all users by March 2017. This type of platform
is complementary to other data sources, and we are interested to run these
measurements and develop them with feedback of the community.

Joachim Fabini: Can you time-synchronize these probes?

Anna: Yes, we will use GPS synchronization.

Brian: Any other measurement platforms people want to talk about?

Aaron: There's a lot of infrastructure out there. It might be interesting to
have a survey or database or wiki of these platforms.

Joachim: Hard to differentiate access networks from middle boxes. Especially
with mobile networks.

Brian: You're saying it requires another measurement infrastructure.

Mirja: We may need different methodology for mobile clients to not confuse
data.


Tracking Middleboxes with Tracebox
----------------------------------
Korian Edeline (Université de Liège)

Korian: Middleboxes--common knowledge they are widely deployed. The total
number is around the same as routers. Security oriented box market is around
$10B. Shows fields that may be modified by normal routers vs NATs vs
ALGs, to the point where essentially every field may be modified.

One tool that can be used is TBIT. Use raw sockets to send TCP probes.
User-level user controlled test without kernel changes. Detect if ECN, IP
options, and TCP options can be safely used.

Another tool is TCPExposure. Run stateless python server/client. Used forged
TCP over raw IP. Client can compare with server what was modified by the
network. Differentiate between modification in different directions.

TCP HICCUPS. A lightweight TCP extension to overload 3 header fields to
seal/hash the header to see what was changed.

These measurements can detect middle boxes if you own both the client and
server, but not if you don't control the server.

tracebox tries to solve this. Send TTL limited TCP probes, and inspect ICMP
time-exceeded responses to see the current state of the header. A one-sided
probe: good, that no server is needed; but only sees modifications in one
direction. Can also detect multiple modifications.

Example of how to detect two middle boxes that change different fields.

The major limitation of this tool is that ICMP responses do have limitations
of how much is returned. Different RFCs have different recommendations about
how much of the datagram ICMP should return.

At least 80% of paths have a router that have a router that sends back
full-response ICMP, in tests with the Alexa top 5000. This is good, but leads
to uncertainty over which box actually changed the options for later fields.
Testing new protocols (MPTCP, TFO), new hardware, or triaging a network
problem. A good network management tool.

Developed an Android version for testing cell networks. Requires rooted version, but working on a non-rooted version.

Two implementations: standalone tracebox for flexibility and scripting, in
C++. Runs Mac OS X and Linux. See slides for details. Scamper version for
wider deployment. Supports BSD version, Linux, Solaris, Windows.

Michio Honda: Regarding router hopping--what happens if the middle box is the
last hop between the last hop router and the server. If the middle box does
not decrease the TTL, what do you do?

Korian: Then you can't detect those.

Michio: I think those are common.

Korian: We need an ICMP message from after the middle box.

Alex Zimmerman: We could induce another ICMP error message to get back the packet from the end server.

Lessons Learnt from Middlebox Measurement
-----------------------------------------
Michio Honda (NetApp)

Michio: Is it still possible to extend TCP? Identified whether potential
extensions work or not. This means that it can go through the internet, or can
fall back to traditional TCP. It doesn't work if the extension messes up
future connections.

Measured 140 networks. Lots of paths using port 80 cleared options, but other
ports were not affected.

Ran custom server in middlebox-free network, and ran clients on many networks
to start TCP connections with strange options. The measurement was difficult
since re-running experiments was very hard. If we had to change the tools
because of new info or bugs, then people may not re-run the tests. Gets less
effective over time. How can we avoid this?

First, carefully define objectives, so as to define experiments and
methodology. Also carefully design and implement tools. Generating raw packets
may need root privileges, and cannot work on a smartphone app. Forcing people
to install something is hard. Wish we had supported Windows. Stateless servers
also make it hard to do some measurements. Also important to reward
contributors for their help.

We should combine our tool with tracebox for best coverage.

Brian: This may belong in the next section, but: one thing in the proposed
charter is to define how to define data sharing. Since experiments are hard,
we need a way to share the data. How much of this is a format problem, or how
much is the regulatory/legal/technicality environment?

Michio: Data format is useful.

Brian: I took that away from other talks. Specifically with Akamai data, there
was very dense raw data, but hard to distill the insights out of it. How do we
explain what the actual impairment is (such as having ports mod 47 be
distorted).

Aaron: Before you started down this road, Michio, did you try to use passive
data? If you can look at what is already they, you have a wider net.

Michio: Hard to get info about new TCP options/bits on existing data.

Aaron: Thinking about Akamai, it is easier to look at existing data for other
purposes.

Michio: But we are using nonexistent TCP data. We can get info from passive
data for deployment status, but we cannot learn what will work or not.

Discussion of Next Steps
------------------------
Brian Trammell and Mirja Kühlewind, Chairs

Mirja: We have one additional slide of questions to ask what the interest of the group is and how we should go forward for next meetings.

We had three groups of presentations in this meeting. Are people interested in these equally, or do we want to focus on one or the other?

Matt Mathis: I have a new issue to bring up. As IPv6 is rolling out, people
don't remember that NATs are not original. And people will start getting IPv6
firewalls. That probably falls into scope here as a new class of middle boxes
that will exist.

Mirja: I would say that is in scope. What are people's impressions?

Brian: Show of hands for venue for presentation of data?

Lots of hands. Seems to be interested.

How about methodology: significant interest, but less than data.

Lars Eggert: I can't stop you, but I don't think figuring out the charter
details in this phase of the meeting. We want an advertisement for people
coming here.

Brian: I want to be able to focus the advertisements for the next meeting.

Lars: There's a south park episode. Gnomes are stealing underpants. They have
a business plan: step 1 collect underpants, have ?? as step 2, and have profit
as step 3. We have a step 1, and a step 3, but no step 2. We need how do we
get from our interest to our answer. Lots of interesting data shown today, but
I wonder how much new data we'll have (from new or same places)? It would be
nice to agree that data shown here is available for use by others. Does the
group need to collect data, or just use data from other bodies?

Mirja: I would like to see people bring data, but also people start asking
questions and bring it to the people measuring data?

Lars: Also, middle boxes are not the only thing that are ossifying the
network. Socket APIs even may be the problem.

Mirja: We are asking 'how' not 'is', so we should focus on measurements.

Brian: Interesting question. Are we leading from the name or the charter? I
think for now it is the name. We'll see what we have energy to continue into
and work on.

Jana: It may not make sense to ask about 'the' protocol stack. It's not clear
to me that the question has a definitive answer. My sense is that we are
talking about data sharing. It's the underpants that are interesting here,
more than the business plan. There are a lot of people with interest in
sharing data. With a space like this, even if we don't meet every IETF, people
will bring data. We often operate on anecdotal data, at the mic, not on
disputable measurements. I hope the scope is around data sharing, with
methodology perhaps, but mainly data.

Ken: I want to put a plug in for reproducibility, which is key to science. Can
we get operators to come and talk to us as well?

Natasha Rooney: I'm an operator who can help

Kevin Fall: It would be interesting to have the tooling and methodology
defined. How to perform measurements and experiments within the landscape
today would be useful.

Aaron: Rolling back to the BarBOF. We had a workshop in which we talked about
the IETF making significant protocol changes, and there were papers explaining
that there was a problem, without much data on what and where the problem
really was. I saw the data about informing the IETF with ground truth. That's
the motivating reason to me. This is not research, it's data to inform
engineering. We need to ask the question about ossification very clearly. This
is a forum to bring together different communities with expertise to have a
conversation about what the question is, and how to collaborate to answer the
question. If this is motivated by a clear engineering goal, to bring back to
their management, we'll get more buy in to share data.

Lars: +1 for data-driven engineering. We used to have an internet measurement
group, but that died down. If the group is about data, it will live longer,
and can change focus. If it is a bring-your-data research group, then the IRTF
is a good place for it. The IRTF can have closed, non-public meetings, to
share more private data, with NDAs, etc. We'd want the group to be open, but
this can be done.

Aaron: I think we'll need to build trust as a community. The data may be
initially heavily distilled, but hopefully we'll be able to share raw data as
we go on. We can develop this over time.

Mirja: It is easier to get data if you have a specific question. If we have so
much random data, that's great, but we should be focused.

Anna: Tying back to what Kevin said, the methodology will make it easier to
share the data, since we will know how to compare and interpret the data
relevantly.

Michael: As a researcher, I found if helpful to look at the peer-to-peer
research group. They had a large taxonomy of the field. It helped clear the
chaos. It could be helpful to have a place for people in the IETF to go to get
information about certain topics.

Mat Ford: Supporting what Lars said about getting details about the Y access,
real numbers from companies. Not everyone is willing to share the volume of
traffic they see, and providing a platform for anonymized data could help to
make real change. For example, if 50% of carrier clear the TOS field for ECN,
how do we go forward from that?

Jana: I'd like to see operators show up in these talks, since they run the
networks we are trying to measure. They can provide feedback to the data we
collect.

Brian: Another thing that might help is to have a HOPS meeting at an operator
community event.

Lars: I agree with Mirja. Maybe I swung too far to bring your data. I think it
is helpful to have a current question that the data is focusing on bringing to
the engineering teams. I think having a completely open data group might not
work for other reasons. What we could do is have a list (maybe just one) of
items that we are currently looking into, but it could always change going
forward.

Matt Mathis: As a result of QUIC, ISPs are worried about UDP. There should be
a conversation about opening up more protocol number for IPv6 networks.

Natasha: Last note, I'll bring data to the Yokohama meeting.

Meeting adjourned.