Minutes interim-2021-pearg-01: Tue 12:00
minutes-interim-2021-pearg-01-202101191200-01
Meeting Minutes | Privacy Enhancements and Assessments Research Group (pearg) RG | |
---|---|---|
Date and time | 2021-01-19 20:00 | |
Title | Minutes interim-2021-pearg-01: Tue 12:00 | |
State | Active | |
Other versions | plain text | |
Last updated | 2021-01-21 |
minutes-interim-2021-pearg-01-202101191200-01
# PEARG interim 19th Jan 2021 - [Agenda](https://github.com/IRTF-PEARG/wg-materials/blob/master/interim-21-01/agenda.md) - [Participation details](https://github.com/IRTF-PEARG/wg-materials/blob/master/interim-21-01/participation.md) - [Chair slides](https://github.com/IRTF-PEARG/wg-materials/blob/master/interim-21-01/interim-202101-pearg-chairs-slides.pdf) ## Proceedings ### Administrivia (5 min) * Scribes, Blue Sheets, [Note Well](https://www.ietf.org/about/note-well/) * Purpose of the meeting * Discuss uses, privacy implications, and privacy mitigations for IP addresses ### Presentations (105 mins) * IP address use cases * [Anti-fraud and abuse](https://github.com/IRTF-PEARG/wg-materials/blob/master/interim-21-01/Anti-abuse_applications_of_IP.pdf): Dimitris Theodorakis (WhiteOps), Philipp Pfeiffenberger (YouTube), David Turner (Google) (20 mins) Notes (excluding slide contents): David Turner: Browsers are shifting to emphasize privacy. With third-party cookies being blocked, browsers need to identify covert tracking, and IP addresses are an example of this covert tracking. But how do we avoid ruining the trust and safety of the internet while getting rid of IP tracking? Dimitris Theodorakis: In the past 5 years >500 million passwords have been stolen. We've encountered botnets that are designed to listen to music in order to get a bigger cut of copyright licensing fees. Philipp Pfeffenberger: Most of the use cases we talk about for IP addresses are about stopping bad folks from doing bad stuff, but it's also useful to help users access their accounts more easily. For example, we can use IPs to have variable friction, reducing challenges (e.g. 2FA) when the IP address is low-risk. Bad actors like having a large number of accounts (ideologies or products). There's no account at the time of account creation by definition, so IPs are one of the few signals we can use to identify hotspots of account creation. All of the PII in my bank account is only as secure as my account, and we rely on IPs to identify hotspots of attempted authentication activity. "contextual integrity": Clearview AI scraped a large number of photos from across the internet and reused them for face recognition. There have been many similar instances. Service providers use IPs to identify hotspots of crawling and other read-only activity to prevent recontextualization of their content. Low-latency interfaces: IPs are very powerful when a split-second decision is needed on reputation. Dimitris: 3ve botnet caused tens of millions of dollars in losses. Its Kovter component has also been used for ransomware. The US government issued extradition requests on the basis of this attack, and the criminals are now behind bars. These consequences would have been impossible if we didn't have access to the IPs. Philipp: IPs are also important for real-world crimes, not just cybercrime. For CSAM, the crime occurs in the physical world, and evidence of it is distributed online. We share evidence with NCMEC about how the material reached the platform, including IP addresses, which has resulted in important law enforcement actions and victims being freed. The protections we have described have been built up over 20-30 years. We can't take fenceposts off the security fence to build the privacy fence, nor vice-versa. We must do both Questions: Mirja: Isn't this just the tip of the iceberg by only catching those criminals that are too stupid to use tor? Philipp: I can't speak directly to Tor, but in aggregate the success that law enforcement has had is significant. Some proxies may be able to assist law enforcement. Overall, IP addresses have been important in those trials. Christian Huitema: There is a classic tensions between privacy and accountability, in general. If everything you do is exposed, then there is accountability because you can be followed for what you did. In all those discussions we are often reminded of the tension between privacy and child abuse. I would like the speakers to tease out the difference between protecting the user and protecting the service from the user. For example, in the account protection case, the user and the service are clearly working together, and there are a range of things that can be done there. There are other cases, like ad protection, where you want to make sure that the ad produces correct revenues. In that case the user is not on board; they're not even supposed to be identified. Dimitris: In the case of 3ve, the user is actually a victim. The user's device is infected with malware that is dropping a variety of payloads, including ad fraud, ransomware, stealing personal information, etc. The user also wants to cooperate, it's just that they don't have a good way of cooperating. Christian: In the Windows world, Window security is designed to protect the user, and the user cooperates. If you compare that to XBox, XBox security is designed to prevent users from cheating by enhancing their devices. So you are working against the user, not for the user. This is the kind of tension that I would like to show. I think this should be part of the analysis of the various solutions. Philipp: Often you have multiple parties. For example, I might not want my LinkedIn profile recontextualzied to identify me in a political context, but there is a consumer on the offensive side who does want that. Tommy Pauly: If we assume that IP address privacy is going to happen, what do you need to protect these use cases? Dimitris: We need a signal from a device that we can guarantee isn't spoofed until it meets our sensors. For example, a modified browser that is designed to automatically commit a credential-stuffing attack is often deliberately spoofing those signals to prevent us from doing that detection. Fernando Gont: How does all this change when you switch to IPv6? Philipp: IPv6 doesn't appear to be less identifiable. The upper 64 bits are stable in the context of a typical residential deployment. Fernando: For example you might identify the user by the /64, but you actually don't know whether the user has a /56, or a /48. So while in IPv4 you can reasonably assume the user has a single address, in IPv6 you can guess that the user has a /64, but they could have much more, depending on the ISP. The specific granularity is hard to tell. Dimitris: Unfortunately, most fraud detection companies will force an IPv4 connection instead so that they get that IPv4 address. Fernando: And what if they can't? Dimitris: We would fall back to IPv6, but we would lose entropy in the signal. Our ability to use this as a high-information signal would be degraded. The same applies to IPs that have huge NATs. Our ability to use them as a stable identifier is degraded in several scenarios, but it's still extremely important for detecting botnets. Matthew Finkel: I'm a Tor developer. This idea that IP addresses are so fundamental to abuse tracking and detection ... it seems like you've built tracking and surveillance abusers as a mechanism for identifying abuse. I wonder how you build a signal into a product and protocol that we're trying to provide privacy while still revealing some information that lets you identfiy when a user is considered legitimate vs. abusive or botnet. What does that signal look like? It sounds like you want remote attestation that the device is being used by a natural person. Dimitris: That would be one piece of the architecture, which I think is pretty fundamental, but I don't think that would be enough for all the different use cases. I agree that it's a high-entropy signal but what we are tracking is a botnet. Our goal is never to track humans, in fact we don't care about tracking humans. We care about tracking the sources of abusive activity. Philipp: The place where this is really at odds is to have a really bad reputation where a small number of users do a bad thing. WIthout a way to carry that reputation forward, it has a bad effect on all the users. Without a way to say that some subset of Tor users are troublemakers ... how do you persist that marking that these users are troublemakers so the rest of the population can stay safe? * [DDoS](https://github.com/IRTF-PEARG/wg-materials/blob/master/interim-21-01/ddos_theory_and_practice.pdf): Damian Menscher (Google) (10 mins) Damian Menscher The previous presentation gave an overview of abuse concerns. This is a deep dive into DDoS and botnets. What if abuse just goes completely off the charts? Then you're not just dealing with abuse that costs a company money, this is abuse that puts the company out of business. And not just companies. Bloggers, anyone who puts information online is at risk of a DDoS attack taking him off the internet. Here's an article by someone who got kicked off the internet because a teenager didn't like his post. Do we want teenagers to be enacting global censorship? And also banks, which are critical infrastructure. This isn't something that ramps up over a day and you have time to react and add capacity. Even with ML, there isn't always a simple answer, or an answer at all, to identify a signature. Sometimes you just need to identify who is attacking you and block them. We need extremely high-fidelity signals, and IP addresses are one. Frequently requests are "GET /", so not a lot of signal there. Botnets often spoof user-agent, referer, and other attributes. How do you know the request rate if you don't know the user's IP? You need some ability to look at two completely new users, and know whether they are actually the same user, in order to apply something like a token-bucket scheme for rate-limiting. We need a signature that they can't easily change. Blocking on headers doesn't work; they'll just be changed. I want to re-emphasize how big these DDoS attacks are. In 2016, a syn flood knocked out a DNS provider that took down a lot of sites, resulting in a congressional inquiry. The same botnet took down basically all of Liberia. We're seeing an increase due to the pandemic. In the case of Mirai and Dyn in 2016, we were able to identify which botnet that was and bring the attackers to justice because it had previously attacked Krebs on Security, which was protected by Google. We recorded the attacker IPs and, with Krebs's permission, shared them with the FBI, which allowed them to track down the perpetrator. In another case, we were able to find a malware that routed all user traffic through a proxy. We were able to use the proxy's IP to identify infected users and notify them. We're not just using the IP information to protect ourselves; we're also using it help those who have compromised machines. If your machine has been compromised, you don't have privacy to start with. * Privacy implications of IP Addresses: * [Overview of privacy implications](https://github.com/IRTF-PEARG/wg-materials/blob/master/interim-21-01/pearg-jan-2021-fgont-ip-addresses-final.pdf): Fernando Gont (SI6 Networks) (15 mins) Fernando Gont I will mostly focus on IPv6 but at the end I'll also give an overview of how this applies to IPv4. SLAAC relies on the host autoconfiguring its own address. DHCPv6 is more like IPv4, relying on a server to lease addresses to the hosts. In some cases, a single host can have both kinds of addresses. If you have Interface Identifiers that are constant across networks, this will allow for correlating activity as you move from one network to another (i.e. a supercookie). Conversely, if you use one address for each TCP connection, network activity correlation is mitigated to some extent. With stable IDs, there is also a possible active attack: if I can predict what suffix a user would use in each network, I can send a probe packet and potentially locate the user. RFC 4291 discloses the manufacturer of your network interface card in the IPv6 address. The revised specification of RFC 4941 will mitigate most of the tracking vectors, but it will not make addresses completely single-use, so there is still some correlation possible within a subnet. Sometimes people only focus on the temporary address, but normally you get the union of the vulnerability of all the address types you employ, and you may have both a temporary and a persistent address. If there's only a single host on the subnet, rotating the IP within the subnet doesn't provide mitigation. Consider how this interacts with MAC address randomization. For example, tracking by the hotspot based on MAC address is possible in some threat models. Caleb Raitto: How common are those stable prefixes? How many users share a prefix? I'm concerned the prefix can identify users. Fernando: About 35% of users are using stable addresses, at least in one survey. If they're not using stable addresses. For example, it's common for ISPs to rotate the prefix once a day. Whether that's stable or not depends on the point of view. * [Server-side address privacy](https://github.com/IRTF-PEARG/wg-materials/blob/master/interim-21-01/shallow-anonymity-pools.pdf): Christian Huitema - TBC (Private Octopus) (15 mins) Christian Huitema: Anonymity pools are a feature of many privacy designs. Many privacy designs try to hide someone's activity by running it through a sharing pool. An attacker that observes what is coming in cannot make heads or tails of it because there is too much activity for an attacker to match the inputs and outputs. Tor has some extra tricks but it essentially relies on this. DNS over HTTPS and DNS over TLS rely on this. Oblivious DNS relies on this. Encrypted SNI relies on this, when there is a fronting server in place. If your anonymity pool is shallow, then the pool doesn't provide much privacy. Suppose you only have 1-2 customers coming to the proxy or server. Then it is easy for the attacker to correlate what comes in and what goes out. Do we have a problem in practice? This affects many of our designs. I believe we do. If most websites use a single IP address, then ESNI is not going to protect your privacy. By just looking at the IP address, you will learn just as much information as the SNI would do. More than 90% of servers have fewer than 10 other domains. What should we do? As a general principle: we have to rely on choice not chance. For example, in Oblivious DNS, there are three layers of servers. The user has chosen an Oblivious DNS server to provide a privacy service for them. I think we need to have something similar for Encrypted SNI. Right now, in Encrypted SNI, we only rely on the choice made by the fronting server. The pool of backend servers supported by the fronting server can be small. We have examples of big tech companies implementing large mixing services like Google DNS. Relying on large tech companies is nice, but it means you have to trust those companies. Those companies are making assurances, but since they rely on surveillance capitalism business model, it's hard to be sure that those assurances will remain the same into the future. Regardless, this is a variation on the walled garden model, which has its own problem. If a volunteer runs a service in a university, and it starts to cost a lot of money, and investigation shows that it's being used for unsavory things, it tends to get closed. This results in cascading failure, as more traffic falls onto the remaining volunteers. Open access invites fraud, but a service you pay for has to identify that you are a valid user. A privacy service that requires you to identify yourself is almost a contradiction. We have to take the question of how to fund privacy services very seriously. Relying on funding as a side effect of something else will always have an issue. I think we will need an anonymous micropayments system like Chaum's. There are a lot of unsavory VPNs out there. If you choose a VPN for privacy, you may get the opposite of privacy. If we want users to use proxies for privacy, the proxy business needs to be sustainable. I would like to see PEARG consider this question directly. Damian: Do you worry that proxies will become a central aggregation point for a given user's traffic, and therefore a target for law enforcement or other adversaries? Christian: Yes, the proxy can be a fat target. Even with blinded tokens it still knows the user's IP address. It's well-known that in Tor you use at least 3 layers of proxies to avoid this case, but we have to look at the architecture of proxies. If we try to use natural anonymity pools it won't work. * Techniques for hiding IP addresses * [Anonymity networks and tokens](https://github.com/IRTF-PEARG/wg-materials/blob/master/interim-21-01/IETF_PEARG_Tor.pdf): George Kadianakis (Tor) (15 mins) George Kadianakis, engineer on Tor Tor can protect not only the identity of users, but also the IP addresses of services like websites. It's hard to count the number of users of an anonymous networks. Other researchers have counted up to 8 million users. The relays are all run by volunteers. We are constantly working on improving Tor in all directions: performance improvements (improving congestion control to utilize all our capacity effectively), load balancing, security (protocol and binary, e.g. considering moving the core daemon from C to Rust). Also working on UX improvements in the browser, better API for mobile integration, and better censorship circumvention, because there are many users who use Tor to circumvent censorship. DDoS attacks are especially hard to defend against on a decentralized network with anonymity. We don't have IPs or reputations. For example, in the beginning of January, there was an attack, but it was hard to tell whether it was malicious or a client doing something stupid. DoS attacks are in various places: attacking the network by shutting down the relays or the directory authorities, or attacking the onion services. Right now, onion services are being attacked by DoS adversaries exploting the assymmetries of the Tor protocol for onion services. A client sends a message to the onion service using work X, but the service needs to perform `10*X` work to respond. We've been trying to defend by reducing these asymmetries. For example, onionbalance does geographical load-balancing of onion services. This is a defense but it just removes a multiplier on the number of nodes, not an order-of-magnitude change. For motivated adversaries, and a protocol with a big asymmetry, this does not address the core of the problem. We've been working on giving operators more options to cut down certain connections, but ... these are all countermeasures, not affecting the core of the problem. A deeper defense is based on proof-of-work. In my opinion it is extremely effective against low to medium strength adversaries, but not high-strength adversaries. Before contacting the service, clients have to solve a puzzle. There are PoW systems that are quick to verify, have small proof size, and have GPU-resistance, so that a botnet that competes with regular client, it will have to spend much more CPU time. The system we designed has a "proof of state" mechanism, so it can prioritize requests depending on how much work is proved. This has been suggested previously in the TLS working group ("Client puzzles for TLS") because ServerHello is much more expensive than ClientHello. I don't think big services can ask their users to wait 10 minutes, or even 2 seconds, for Proof of Work, so a more relevant defense here might be an anonymous credential or token. This is like a train ticket. It's unlinkable to your identity. If it's redeemed multiple times, the redemptions cannot be linked to you. If you drop it on the street, it can't be linked to you. This has been used before in Privacy Pass and Signal, and we think it will be very useful for DDoS defense. It also has a lot to do with the reputation of exit nodes. We can control which users go to the exit nodes, to improve their reputation without identity leakage. The anonymous credential scene is just getting started, and there are lots of different schemes with different crypto, delegation, etc. We are looking forward to more standardization work on both proof of work and anonymous credential schemes. * [Using Multicast DNS to protect privacy when exposing ICE candidates](https://github.com/IRTF-PEARG/wg-materials/blob/master/interim-21-01/WebRTC%2C_mDNS%2C_and_IP_privacy.pdf) (https://datatracker.ietf.org/doc/draft-ietf-mmusic-mdns-ice-candidates/), Justin Uberti (Google) (15 mins) Justin Uberti In WebRTC, we've had to come up with technologies that let users connect to each other peer-to-peer without leaking their IP addresses. When WebRTC users want to connect to each other, they exchange ICE candidates, which are essentially IP:port pairs. Often the most important candidate is the IP address of the local network interface. Each browser will open up a UDP socket, get that IP:port tuple of their local network interface, and provide this to the app. Then the app will use its own mechanism (maybe XHR or websocket) to communicate this to the remote peer. That's out of scope for WebRTC and gives the app lots of flexibility, but it means the app can see the raw IP address. The recipient gets the candidate and can establish a direct peer-to-peer connection. Typically, the web server can see the client's NATed address, but with WebRTC, the web server can also see the local interface IP address. We've seen this used in ad networks, we assume as fraud protection or some kind of supercookie. One option would be to not provide apps with this information, and only reveal the public IP, but that would prevent direct LAN connections and reduce the value of peer-to-peer connections. Some folks at Apple came up with some ideas about using mDNS to wrap these addresses. Instead of passing the IP to the application, you register an mDNS name that maps 1:1 to the IP address. You pass the mDNS address up to the application, and then the remote peer will be able to resolve that name and connect. In this case the app never learns the raw IP address. One opportunity here is, if mDNS isn't applicable, to do this wrapping using encryption with a key distributed through Chrome enterprise policy. Dave Oran: aren't these mDNS names snoopable on the LAN and hence the IP addresses known to anyone on the LAN? Justin: Yes, but you would need collusion between the website and an entity that is already on the LAN. Generally if you can do the mDNS resolution directly, the obfuscation benefits are gone. Fernando: I understand there might be challenges when using mDNS if it's not available, but in general I think WebRTC should deal with domain names for architectural reasons. Justin: It's nice when we can use domain names, but if peers don't have domain names, we would like to be able to establish peer-to-peer connections. Fernando: I understand that there's some DNS stuff that is missing here, but maybe that means that we should be doing something that we are not. Probably we should find a way to get a domain name for each hostname to use. I've seen this problem in other cases. Justin: I would be interested in seeing other cases where this pattern came up. But adding a registry has the danger of creating a single point of failure during connection setup. * [Willful IP Blindness](https://github.com/IRTF-PEARG/wg-materials/blob/master/interim-21-01/IP_Address_Privacy_%26_Gnatcatcher.pdf): Brad Lassey (Google) (15 mins) Christian Huitema: Isn't there a tension between the desire to have your proxy close to the user for performance reasons and the size of the anonymity pool that you get? Brad Lassey: Yes, but since there's cost associated with each CDN server, they typically serve some significant number of users. Christian: I would trust the answer better if I saw statistics. Fernando: I have some concerns with this idea of trusted servers and trusted providers. There is an assumption that the user that wants privacy has to trust a company or an organization. That is not the strategy that I have. My strategy is to avoid trusting even the parties that I am usually trusting. ### Discussion and follow up (10 mins) * Open discussion * Potential RG work from this? ## Blue Sheets (~65 people on the call) - Shivan Sahib (Salesforce) - Chris Wood (Cloudflare) - Tommy Pauly (Apple) - Steven Valdez (Google) - Ben Schwartz (Google) - Eric Orth (Google) - Brad Lassey (Google) - Paul Jensen (Google) - Caleb Raitto (Google) - Damian Menscher (Google) - Mirja Kühlewind (Ericsson) - Pawel Jurczyk (Google) - Alex Chernyakhovsky (Google) - Jade Kessler (Google) - Ira McDonald (High North/Toyota) - Stephen Farrell (Trinity College Dublin) - Kaustubha Govind (Google) - Per Bjorke (Google) - Petr Marchenko (Facebook) - Sam Weiler (W3C/MIT) - Ian Swett (Google) - Fernando Gont (SI6 Networks) - Justin Uberti (Google) - David Van Cleve (Google) - Christian Huitema (Private Octopus Inc.) - Colin Perkins (University of Glasgow) - Dave Oran (ICNRG, Network systems Research & Design) - Josh Frank (DuckDuckGo) - Luigi Iannone (Huawei) - Francesco Amorosa (AFA Systems) - Antoine Fressancourt (Huawei) - David Turner (Google) - David Benjamin (Google) - Paul Oliver (Google)