Skip to main content

Minutes interim-2021-pearg-01: Tue 12:00
minutes-interim-2021-pearg-01-202101191200-01

Meeting Minutes Privacy Enhancements and Assessments Research Group (pearg) RG
Date and time 2021-01-19 20:00
Title Minutes interim-2021-pearg-01: Tue 12:00
State Active
Other versions plain text
Last updated 2021-01-21

minutes-interim-2021-pearg-01-202101191200-01
# PEARG interim 19th Jan 2021
-
[Agenda](https://github.com/IRTF-PEARG/wg-materials/blob/master/interim-21-01/agenda.md)
- [Participation
details](https://github.com/IRTF-PEARG/wg-materials/blob/master/interim-21-01/participation.md)
- [Chair
slides](https://github.com/IRTF-PEARG/wg-materials/blob/master/interim-21-01/interim-202101-pearg-chairs-slides.pdf)

## Proceedings

### Administrivia (5 min)
  * Scribes, Blue Sheets, [Note Well](https://www.ietf.org/about/note-well/)
  * Purpose of the meeting
    * Discuss uses, privacy implications, and privacy mitigations for IP
    addresses

### Presentations (105 mins)

* IP address use cases
    * [Anti-fraud and
    abuse](https://github.com/IRTF-PEARG/wg-materials/blob/master/interim-21-01/Anti-abuse_applications_of_IP.pdf):
    Dimitris Theodorakis (WhiteOps), Philipp Pfeiffenberger (YouTube), David
    Turner (Google) (20 mins)

Notes (excluding slide contents):

David Turner: Browsers are shifting to emphasize privacy.  With third-party
cookies being blocked, browsers need to identify covert tracking, and IP
addresses are an example of this covert tracking.  But how do we avoid ruining
the trust and safety of the internet while getting rid of IP tracking? Dimitris
Theodorakis: In the past 5 years >500 million passwords have been stolen. 
We've encountered botnets that are designed to listen to music in order to get
a bigger cut of copyright licensing fees.

Philipp Pfeffenberger: Most of the use cases we talk about for IP addresses are
about stopping bad folks from doing bad stuff, but it's also useful to help
users access their accounts more easily.  For example, we can use IPs to have
variable friction, reducing challenges (e.g. 2FA) when the IP address is
low-risk.

Bad actors like having a large number of accounts (ideologies or products). 
There's no account at the time of account creation by definition, so IPs are
one of the few signals we can use to identify hotspots of account creation. 
All of the PII in my bank account is only as secure as my account, and we rely
on IPs to identify hotspots of attempted authentication activity.

"contextual integrity": Clearview AI scraped a large number of photos from
across the internet and reused them for face recognition.  There have been many
similar instances.  Service providers use IPs to identify hotspots of crawling
and other read-only activity to prevent recontextualization of their content.

Low-latency interfaces: IPs are very powerful when a split-second decision is
needed on reputation.

Dimitris: 3ve botnet caused tens of millions of dollars in losses.  Its Kovter
component has also been used for ransomware.  The US government issued
extradition requests on the basis of this attack, and the criminals are now
behind bars.  These consequences would have been impossible if we didn't have
access to the IPs.

Philipp: IPs are also important for real-world crimes, not just cybercrime. 
For CSAM, the crime occurs in the physical world, and evidence of it is
distributed online.  We share evidence with NCMEC about how the material
reached the platform, including IP addresses, which has resulted in important
law enforcement actions and victims being freed.

The protections we have described have been built up over 20-30 years.  We
can't take fenceposts off the security fence to build the privacy fence, nor
vice-versa.  We must do both

Questions:
Mirja: Isn't this just the tip of the iceberg by only catching those criminals
that are too stupid to use tor? Philipp: I can't speak directly to Tor, but in
aggregate the success that law enforcement has had is significant.  Some
proxies may be able to assist law enforcement.  Overall, IP addresses have been
important in those trials.

Christian Huitema: There is a classic tensions between privacy and
accountability, in general.  If everything you do is exposed, then there is
accountability because you can be followed for what you did.  In all those
discussions we are often reminded of the tension between privacy and child
abuse.  I would like the speakers to tease out the difference between
protecting the user and protecting the service from the user.  For example, in
the account protection case, the user and the service are clearly working
together, and there are a range of things that can be done there.  There are
other cases, like ad protection, where you want to make sure that the ad
produces correct revenues.  In that case the user is not on board; they're not
even supposed to be identified.

Dimitris: In the case of 3ve, the user is actually a victim.  The user's device
is infected with malware that is dropping a variety of payloads, including ad
fraud, ransomware, stealing personal information, etc.  The user also wants to
cooperate, it's just that they don't have a good way of cooperating.

Christian: In the Windows world, Window security is designed to protect the
user, and the user cooperates.  If you compare that to XBox, XBox security is
designed to prevent users from cheating by enhancing their devices.  So you are
working against the user, not for the user.  This is the kind of tension that I
would like to show.  I think this should be part of the analysis of the various
solutions.

Philipp: Often you have multiple parties. For example, I might not want my
LinkedIn profile recontextualzied to identify me in a political context, but
there is a consumer on the offensive side who does want that.

Tommy Pauly: If we assume that IP address privacy is going to happen, what do
you need to protect these use cases? Dimitris: We need a signal from a device
that we can guarantee isn't spoofed until it meets our sensors.  For example, a
modified browser that is designed to automatically commit a credential-stuffing
attack is often deliberately spoofing those signals to prevent us from doing
that detection.

Fernando Gont: How does all this change when you switch to IPv6?
Philipp: IPv6 doesn't appear to be less identifiable.  The upper 64 bits are
stable in the context of a typical residential deployment. Fernando: For
example you might identify the user by the /64, but you actually don't know
whether the user has a /56, or a /48.  So while in IPv4 you can reasonably
assume the user has a single address, in IPv6 you can guess that the user has a
/64, but they could have much more, depending on the ISP.  The specific
granularity is hard to tell. Dimitris: Unfortunately, most fraud detection
companies will force an IPv4 connection instead so that they get that IPv4
address. Fernando: And what if they can't? Dimitris: We would fall back to
IPv6, but we would lose entropy in the signal.  Our ability to use this as a
high-information signal would be degraded.  The same applies to IPs that have
huge NATs.  Our ability to use them as a stable identifier is degraded in
several scenarios, but it's still extremely important for detecting botnets.

Matthew Finkel: I'm a Tor developer.  This idea that IP addresses are so
fundamental to abuse tracking and detection ... it seems like you've built
tracking and surveillance abusers as a mechanism for identifying abuse.  I
wonder how you build a signal into a product and protocol that we're trying to
provide privacy while still revealing some information that lets you identfiy
when a user is considered legitimate vs. abusive or botnet.  What does that
signal look like?  It sounds like you want remote attestation that the device
is being used by a natural person. Dimitris: That would be one piece of the
architecture, which I think is pretty fundamental, but I don't think that would
be enough for all the different use cases.  I agree that it's a high-entropy
signal but what we are tracking is a botnet.  Our goal is never to track
humans, in fact we don't care about tracking humans.  We care about tracking
the sources of abusive activity. Philipp: The place where this is really at
odds is to have a really bad reputation where a small number of users do a bad
thing.  WIthout a way to carry that reputation forward, it has a bad effect on
all the users.  Without a way to say that some subset of Tor users are
troublemakers ... how do you persist that marking that these users are
troublemakers so the rest of the population can stay safe?

*
[DDoS](https://github.com/IRTF-PEARG/wg-materials/blob/master/interim-21-01/ddos_theory_and_practice.pdf):
Damian Menscher (Google) (10 mins)

Damian Menscher
The previous presentation gave an overview of abuse concerns.  This is a deep
dive into DDoS and botnets.

What if abuse just goes completely off the charts?  Then you're not just
dealing with abuse that costs a company money, this is abuse that puts the
company out of business.  And not just companies.  Bloggers, anyone who puts
information online is at risk of a DDoS attack taking him off the internet. 
Here's an article by someone who got kicked off the internet because a teenager
didn't like his post.

Do we want teenagers to be enacting global censorship?

And also banks, which are critical infrastructure.

This isn't something that ramps up over a day and you have time to react and
add capacity.

Even with ML, there isn't always a simple answer, or an answer at all, to
identify a signature.  Sometimes you just need to identify who is attacking you
and block them.  We need extremely high-fidelity signals, and IP addresses are
one.

Frequently requests are "GET /", so not a lot of signal there.  Botnets often
spoof user-agent, referer, and other attributes.  How do you know the request
rate if you don't know the user's IP?  You need some ability to look at two
completely new users, and know whether they are actually the same user, in
order to apply something like a token-bucket scheme for rate-limiting.

We need a signature that they can't easily change.  Blocking on headers doesn't
work; they'll just be changed.

I want to re-emphasize how big these DDoS attacks are.  In 2016, a syn flood
knocked out a DNS provider that took down a lot of sites, resulting in a
congressional inquiry.  The same botnet took down basically all of Liberia. 
We're seeing an increase due to the pandemic.  In the case of Mirai and Dyn in
2016, we were able to identify which botnet that was and bring the attackers to
justice because it had previously attacked Krebs on Security, which was
protected by Google.  We recorded the attacker IPs and, with Krebs's
permission, shared them with the FBI, which allowed them to track down the
perpetrator.

In another case, we were able to find a malware that routed all user traffic
through a proxy.  We were able to use the proxy's IP to identify infected users
and notify them.

We're not just using the IP information to protect ourselves; we're also using
it help those who have compromised machines.  If your machine has been
compromised, you don't have privacy to start with.

* Privacy implications of IP Addresses:
    * [Overview of privacy
    implications](https://github.com/IRTF-PEARG/wg-materials/blob/master/interim-21-01/pearg-jan-2021-fgont-ip-addresses-final.pdf):
    Fernando Gont (SI6 Networks) (15 mins)

Fernando Gont

I will mostly focus on IPv6 but at the end I'll also give an overview of how
this applies to IPv4.

SLAAC relies on the host autoconfiguring its own address.  DHCPv6 is more like
IPv4, relying on a server to lease addresses to the hosts.  In some cases, a
single host can have both kinds of addresses.

If you have Interface Identifiers that are constant across networks, this will
allow for correlating activity as you move from one network to another (i.e. a
supercookie).  Conversely, if you use one address for each TCP connection,
network activity correlation is mitigated to some extent.

With stable IDs, there is also a possible active attack: if I can predict what
suffix a user would use in each network, I can send a probe packet and
potentially locate the user.

RFC 4291 discloses the manufacturer of your network interface card in the IPv6
address.

The revised specification of RFC 4941 will mitigate most of the tracking
vectors, but it will not make addresses completely single-use, so there is
still some correlation possible within a subnet.

Sometimes people only focus on the temporary address, but normally you get the
union of the vulnerability of all the address types you employ, and you may
have both a temporary and a persistent address.

If there's only a single host on the subnet, rotating the IP within the subnet
doesn't provide mitigation.

Consider how this interacts with MAC address randomization.  For example,
tracking by the hotspot based on MAC address is possible in some threat models.

Caleb Raitto: How common are those stable prefixes? How many users share a
prefix? I'm concerned the prefix can identify users. Fernando: About 35% of
users are using stable addresses, at least in one survey.  If they're not using
stable addresses.  For example, it's common for ISPs to rotate the prefix once
a day.  Whether that's stable or not depends on the point of view.

* [Server-side address
privacy](https://github.com/IRTF-PEARG/wg-materials/blob/master/interim-21-01/shallow-anonymity-pools.pdf):
Christian Huitema - TBC (Private Octopus) (15 mins)

Christian Huitema:

Anonymity pools are a feature of many privacy designs.  Many privacy designs
try to hide someone's activity by running it through a sharing pool.  An
attacker that observes what is coming in cannot make heads or tails of it
because there is too much activity for an attacker to match the inputs and
outputs.  Tor has some extra tricks but it essentially relies on this.  DNS
over HTTPS and DNS over TLS rely on this.  Oblivious DNS relies on this. 
Encrypted SNI relies on this, when there is a fronting server in place.

If your anonymity pool is shallow, then the pool doesn't provide much privacy. 
Suppose you only have 1-2 customers coming to the proxy or server.  Then it is
easy for the attacker to correlate what comes in and what goes out.

Do we have a problem in practice?  This affects many of our designs.  I believe
we do.

If most websites use a single IP address, then ESNI is not going to protect
your privacy.  By just looking at the IP address, you will learn just as much
information as the SNI would do.  More than 90% of servers have fewer than 10
other domains.

What should we do?  As a general principle: we have to rely on choice not
chance.  For example, in Oblivious DNS, there are three layers of servers.  The
user has chosen an Oblivious DNS server to provide a privacy service for them. 
I think we need to have something similar for Encrypted SNI.  Right now, in
Encrypted SNI, we only rely on the choice made by the fronting server.  The
pool of backend servers supported by the fronting server can be small.

We have examples of big tech companies implementing large mixing services like
Google DNS.  Relying on large tech companies is nice, but it means you have to
trust those companies.  Those companies are making assurances, but since they
rely on surveillance capitalism business model, it's hard to be sure that those
assurances will remain the same into the future.  Regardless, this is a
variation on the walled garden model, which has its own problem.

If a volunteer runs a service in a university, and it starts to cost a lot of
money, and investigation shows that it's being used for unsavory things, it
tends to get closed.  This results in cascading failure, as more traffic falls
onto the remaining volunteers.

Open access invites fraud, but a service you pay for has to identify that you
are a valid user.  A privacy service that requires you to identify yourself is
almost a contradiction.

We have to take the question of how to fund privacy services very seriously. 
Relying on funding as a side effect of something else will always have an
issue.  I think we will need an anonymous micropayments system like Chaum's.

There are a lot of unsavory VPNs out there.  If you choose a VPN for privacy,
you may get the opposite of privacy.

If we want users to use proxies for privacy, the proxy business needs to be
sustainable.  I would like to see PEARG consider this question directly.

Damian: Do you worry that proxies will become a central aggregation point for a
given user's traffic, and therefore a target for law enforcement or other
adversaries? Christian: Yes, the proxy can be a fat target.  Even with blinded
tokens it still knows the user's IP address.  It's well-known that in Tor you
use at least 3 layers of proxies to avoid this case, but we have to look at the
architecture of proxies.  If we try to use natural anonymity pools it won't
work.

* Techniques for hiding IP addresses
    * [Anonymity networks and
    tokens](https://github.com/IRTF-PEARG/wg-materials/blob/master/interim-21-01/IETF_PEARG_Tor.pdf):
    George Kadianakis (Tor) (15 mins)

George Kadianakis, engineer on Tor

Tor can protect not only the identity of users, but also the IP addresses of
services like websites.

It's hard to count the number of users of an anonymous networks.  Other
researchers have counted up to 8 million users.

The relays are all run by volunteers.

We are constantly working on improving Tor in all directions: performance
improvements (improving congestion control to utilize all our capacity
effectively), load balancing, security (protocol and binary, e.g. considering
moving the core daemon from C to Rust).  Also working on UX improvements in the
browser, better API for mobile integration, and better censorship
circumvention, because there are many users who use Tor to circumvent
censorship.

DDoS attacks are especially hard to defend against on a decentralized network
with anonymity.  We don't have IPs or reputations.  For example, in the
beginning of January, there was an attack, but it was hard to tell whether it
was malicious or a client doing something stupid.

DoS attacks are in various places: attacking the network by shutting down the
relays or the directory authorities, or attacking the onion services.

Right now, onion services are being attacked by DoS adversaries exploting the
assymmetries of the Tor protocol for onion services.  A client sends a message
to the onion service using work X, but the service needs to perform `10*X` work
to respond.  We've been trying to defend by reducing these asymmetries.  For
example, onionbalance does geographical load-balancing of onion services.  This
is a defense but it just removes a multiplier on the number of nodes, not an
order-of-magnitude change.  For motivated adversaries, and a protocol with a
big asymmetry, this does not address the core of the problem.  We've been
working on giving operators more options to cut down certain connections, but
... these are all countermeasures, not affecting the core of the problem.

A deeper defense is based on proof-of-work.  In my opinion it is extremely
effective against low to medium strength adversaries, but not high-strength
adversaries.  Before contacting the service, clients have to solve a puzzle. 
There are PoW systems that are quick to verify, have small proof size, and have
GPU-resistance, so that a botnet that competes with regular client, it will
have to spend much more CPU time.  The system we designed has a "proof of
state" mechanism, so it can prioritize requests depending on how much work is
proved.  This has been suggested previously in the TLS working group ("Client
puzzles for TLS") because ServerHello is much more expensive than ClientHello.

I don't think big services can ask their users to wait 10 minutes, or even 2
seconds, for Proof of Work, so a more relevant defense here might be an
anonymous credential or token.  This is like a train ticket.  It's unlinkable
to your identity.  If it's redeemed multiple times, the redemptions cannot be
linked to you.  If you drop it on the street, it can't be linked to you.  This
has been used before in Privacy Pass and Signal, and we think it will be very
useful for DDoS defense.  It also has a lot to do with the reputation of exit
nodes.  We can control which users go to the exit nodes, to improve their
reputation without identity leakage.

The anonymous credential scene is just getting started, and there are lots of
different schemes with different crypto, delegation, etc.

We are looking forward to more standardization work on both proof of work and
anonymous credential schemes.

* [Using Multicast DNS to protect privacy when exposing ICE
candidates](https://github.com/IRTF-PEARG/wg-materials/blob/master/interim-21-01/WebRTC%2C_mDNS%2C_and_IP_privacy.pdf)
(https://datatracker.ietf.org/doc/draft-ietf-mmusic-mdns-ice-candidates/),
Justin Uberti (Google) (15 mins)

Justin Uberti

In WebRTC, we've had to come up with technologies that let users connect to
each other peer-to-peer without leaking their IP addresses.

When WebRTC users want to connect to each other, they exchange ICE candidates,
which are essentially IP:port pairs.  Often the most important candidate is the
IP address of the local network interface.

Each browser will open up a UDP socket, get that IP:port tuple of their local
network interface, and provide this to the app.  Then the app will use its own
mechanism (maybe XHR or websocket) to communicate this to the remote peer. 
That's out of scope for WebRTC and gives the app lots of flexibility, but it
means the app can see the raw IP address.  The recipient gets the candidate and
can establish a direct peer-to-peer connection.

Typically, the web server can see the client's NATed address, but with WebRTC,
the web server can also see the local interface IP address.  We've seen this
used in ad networks, we assume as fraud protection or some kind of supercookie.
 One option would be to not provide apps with this information, and only reveal
the public IP, but that would prevent direct LAN connections and reduce the
value of peer-to-peer connections.

Some folks at Apple came up with some ideas about using mDNS to wrap these
addresses.  Instead of passing the IP to the application, you register an mDNS
name that maps 1:1 to the IP address.  You pass the mDNS address up to the
application, and then the remote peer will be able to resolve that name and
connect.  In this case the app never learns the raw IP address.

One opportunity here is, if mDNS isn't applicable, to do this wrapping using
encryption with a key distributed through Chrome enterprise policy.

Dave Oran: aren't these mDNS names snoopable on the LAN and hence the IP
addresses known to anyone on the LAN? Justin: Yes, but you would need collusion
between the website and an entity that is already on the LAN.  Generally if you
can do the mDNS resolution directly, the obfuscation benefits are gone.

Fernando: I understand there might be challenges when using mDNS if it's not
available, but in general I think WebRTC should deal with domain names for
architectural reasons. Justin: It's nice when we can use domain names, but if
peers don't have domain names, we would like to be able to establish
peer-to-peer connections.

Fernando: I understand that there's some DNS stuff that is missing here, but
maybe that means that we should be doing something that we are not.  Probably
we should find a way to get a domain name for each hostname to use.  I've seen
this problem in other cases.

Justin: I would be interested in seeing other cases where this pattern came up.
 But adding a registry has the danger of creating a single point of failure
during connection setup.

    * [Willful IP
    Blindness](https://github.com/IRTF-PEARG/wg-materials/blob/master/interim-21-01/IP_Address_Privacy_%26_Gnatcatcher.pdf):
    Brad Lassey (Google) (15 mins)

Christian Huitema: Isn't there a tension between the desire to have your proxy
close to the user for performance reasons and the size of the anonymity pool
that you get? Brad Lassey: Yes, but since there's cost associated with each CDN
server, they typically serve some significant number of users. Christian: I
would trust the answer better if I saw statistics.

Fernando: I have some concerns with this idea of trusted servers and trusted
providers.  There is an assumption that the user that wants privacy has to
trust a company or an organization.  That is not the strategy that I have.  My
strategy is to avoid trusting even the parties that I am usually trusting.

### Discussion and follow up (10 mins)

* Open discussion
* Potential RG work from this?

## Blue Sheets

(~65 people on the call)
- Shivan Sahib (Salesforce)
- Chris Wood (Cloudflare)
- Tommy Pauly (Apple)
- Steven Valdez (Google)
- Ben Schwartz (Google)
- Eric Orth (Google)
- Brad Lassey (Google)
- Paul Jensen (Google)
- Caleb Raitto (Google)
- Damian Menscher (Google)
- Mirja Kühlewind (Ericsson)
- Pawel Jurczyk (Google)
- Alex Chernyakhovsky (Google)
- Jade Kessler (Google)
- Ira McDonald (High North/Toyota)
- Stephen Farrell (Trinity College Dublin)
- Kaustubha Govind (Google)
- Per Bjorke (Google)
- Petr Marchenko (Facebook)
- Sam Weiler (W3C/MIT)
- Ian Swett (Google)
- Fernando Gont (SI6 Networks)
- Justin Uberti (Google)
- David Van Cleve (Google)
- Christian Huitema (Private Octopus Inc.)
- Colin Perkins (University of Glasgow)
- Dave Oran (ICNRG, Network systems Research & Design)
- Josh Frank (DuckDuckGo)
- Luigi Iannone (Huawei)
- Francesco Amorosa (AFA Systems)
- Antoine Fressancourt (Huawei)
- David Turner (Google)
- David Benjamin (Google)
- Paul Oliver (Google)