Pearg notes ===== Administrivia (5 minutes) == Blue sheets / scribe selection / NOTE WELL == * Agenda Bash # Draft updates (20 minutes) * IP address privacy draft missed the cutoff, came from interim, pretty skeletal * More content comming, throw some in. Particularly usecase. Lots in interim. If interested let chairs and authors know. ## Safe measurement update Mallory Knodell * A number of prior contributions, a few small changes * Fills major need * Goal of draft: describe for academia and industry guidelines on measurements that don't violate user privacy * Interesting things around scope. Important. Strengthened in last version * Not substitute for ethics review. Complements, not replaces * Tries to define better what Internet measurement scope means. Interested in definition. * Identifies user and who it is safe for. * In three parts * Consent * Safety * Isolate risk with dedicated testbed * Respect other infrastructure * Data minimization * Masking * Risk analysis * Alas TOC not coming out of XML yet! * Changes since -04 : disclosure issues * Recent research, IP address * Safety != Ethics? * Since -05 nits * Things still open in github * Responsible disclosure * Availability * Ip addresses * Future computing capability * Look at CADIA * Want to bring in learnings, add to table of contents. * Please open issues, better yet PRs * Very hard to explain what consent would mean in lower layers ### QA: * Stephen (from chat): two things on consent 1) be good to include examples of when that was handled well and when badly (or controversially) and 2) I think this document might end up being a model for other IETF docs that mention consent so it should be done carefully * Mallory: Good idea on adding examples, depends on carefully. User consent shouldn't give free reign. Not clear this goes beyond what this context. Internet traffic is different. ## IP address privacy update # New Work / Presentations (1hr 35 mins) ## Website Fingerprinting in the Age of QUIC - Jean-Pierre Smith (20 mins) * From PETS two weeks ago * In early days if adversary wanted to see content could use Wireshark * Now things are encrypted. But certs, SNI reveal what user is viewing. * Users use PETs like Tor and VPNs * Most information no longer there * In ideal world end of story * But... packet sizes, timings, directions. * Creates fingerprinting * Attacker sees features from websites, constructs model * Usually interested in particular set of pages vs. not * Quite a lot of work on Tor and websites * Mostly TCP focused * QUIC has multiple streams, vs. one flow per connection * Traces can mix TCP and QUIC for the same navigation * Collect large dataset by scanning webpages * 100 pages are the target, 16000 the rest * Number of classifiers trained, many deep learning. Identify webpage! * Identify recall and precision * TCP trained identifying QUIC? * Doesn't work: TCP trained worked well on TCP, not QUIC * Now train on QUIC and test on QUIC * They work just as well QUIC-QUIC and TCP-TCP * Some classifiers a bit better on QUIC than on TCP * Could be due to QUIC server variability being limited, so variation across web pages different. Reduced middlebox interference? * Mixed classification/Split ensamble * Mixed: do both at once * Split: detect QUIC or TCP, send to dedicated classifier for case * Mixed: slight decrease in performance * Split: Very simple trace distinguishing, 99% accuracy * Due to handshake differences * QUIC initial client hello quite large vs. SYN-ACK handshake * Real easy! * Ensamble: QUIC vs TCP provides weight to predictions of TCP and QUIC * Not as good as Mixed for same sample budget * Conclusions: QUIC not more difficult than TCP, some problems when TCP only classifier * Joint possible, cost to adversary ### QA * Antoine: How large is training vs test? Is there drift over time with pages, or is all the data gathered at once? * A: Split was 90% training to 10% holdout. * A: Didn't evaluate drift. Has been evaluated before by a number of people. Realistic Website Fingerprinting. Running collection of the webpages could maintain high precision and recall. Quite a number of recent works. Triplet fingerprinting: use 10 new samples to get it back up to high precision despite very old samples. * Q: any other research in area? * A: Four or five other groups doing it, expect more stuff coming out over the months and year ## ShorTor: Improving Tor Network Latency through Multi-Hop Overlay Routing - Kyle Hogan (20 mins) * Presented last meeting, excited to be back * Work in progress * Questions can be answered as going * Short Tor is an overlay that reduces latency between relays by making better routing * Design, Evaluation, Integration, Security * We don't change the client * Interest in feedback * Overlay routing: sometimes fastest path between A to B goes through a server C. * Tor already has lots of forwarding * Can we take advantage of it? * Go via another relay if faster * To evaluate need latencies between things in Tor * Ethics interlude: we've worked with Tor, let operators opt out, our relays are pretty restrictive, Don't record any connection we didn't make ourselves. * Q: Since whole path unkown, per hop? * A: right now additional control plane to indicate that this route should go Via. * Only thing skipped is Onion encryption. Queuing not skipped. * Currently focus on top consensus weights * Top 125K relays * Big, enduring, most circuits, stay up * Challenging to get small relays in all-pairs dataset * Future: all* * Churn will complicate: relay stops existing midway through * Graph gets presented! * Simulated circuits through selection (limited to measured relays) see what the latency would be one way or the other * Some ridiculous round trip delays * High round trip times seem to be low BW relays, but not selected * Circuit selection is unchanged. Via looks at latency. No via should ever accept traffic in excess of BW * Using scheduling to make via lower priority then direct. * If via slows down, stop using. Doesn't impact circuit. * Integration: as Tor relays adopt can be used. * Get good speedups even with small number supporting. * Can get 1500ms speedups sometimes, with just a few hundred big relays supporting * MATors framework and network traffic share based security analysis * All circuit selections supported, including AS diversity * Next steps: finish analysis: only have 1 M, not 50 M pairs * Finish security analysis with representative dataset * Dataset touchy subject * Tor latency find exit relay that was chosen * Client will never see it, relays decide * Don't want clients to change behavior * Q: Watson: Sounds pretty intrusive * A: Yeah, need two new fields so intermediate will be able to learn how to forward, next relay will use previous relay instead of connection to disambiguate circuit ID * Q: Antonine: does latency deanonymize? * A: Yes, possibly: fast=short, so selecting relays on speed means close relays. We have less correlation because circuit is the same. But reducing latency closer to geodistance ## Private Relay - Tommy Pauly (20 mins) * Given a lot of talk about IP privacy, let's look at a deployment. * Love to hear feedback on how this could evolve * Private relay several pieces of IETF * Separate IP addresses from origin servers accessed * Not full Tor threat model, seemed very common linkage used by many parties for tracking and hurting privacy * MASQUE, Oblivious DoH * QUIC TLS 1.3 to connect to proxies * To access using RSA blind signatures * scope iOS15 and Mac OS Monterray (both beta) * All Safari browsing * All DNS traffic * All unencrypted HTTP traffic * Highest vuln traffic without all of it * Underlying tech to protect against pixel trackers in mail * Privacy goals * No entity can connect who you are and what you are looking at * Performance good enough for generic web browsing * Left on, not flipped on and off * Two hop minimum! * Ingress and Egress proxy siting between access network and origin * Ingress forwards encryped connection to Egress * Operated by differen entities * Clients control which, nested encrypion * Gets manifest with hops and how to combine * In order to track would need collusion. Policy enforced contractually * Q: Jonathan: (more of a comment) Global passive adversary can identify through both hops. Impossible to prevent * A: Yup * Privacy not slow * Aggressively use QUIC and MASQUE features to accelerate. Lots has to do with deployment and routing and global coverage * Lots of fast open. Proxying at stream level * If talking to normal TLS/TCP origin forwarding QUIC dgram through ingress that is request to egress+TLS client hello. Egress does the rest, without waiting * Fast open, QUIC on last mile regardless of server IPv6 everywhere, Web on par, sometimes faster * No v4 recapsulate in middle * Break as little as possible * No impact on local routes * Failover for private hostnames and address * Off if over VPN or proxy is being used * Rough GeoIP preserved * Hint to egress to use particular geolocationd data * Long term need to move away from this use of IP addresses. * More standards on geolocation and how that's shared and fraud prevention with IP privacy * Future * Expand MASQUE * Open interoperable network * Ingress into carrier networks * Egress within content providers * Client selection of policies, route selection ### Questions: * Victor: Authentication of origin still end to end? * A: yes. TCP proxied, TLS not, QUIC origins fully. * Q: Stephen:- how would you characterise the longer term complexity trade-offs between this approach and trying to eventually move to something simpler and more generic but harder to get deployed like "all over Tor" or an equivalent? * A: Would like to see consensus around deployment. MASQUE proxies good start. Tor compat interesting * Q: Matthew: DNS for key management? Hard coded relays? * A: Right now public keys are all coming from iCloud control plane. Short term decision for this feature. Long term more open and discoverable and extensible. * Q: Andrew: Using the CFRG draft for RSA? * A: yes, why we want that draft ## FLoC - Josh Karlin (remaining time) * Tech Team lead on Privacy Sandbox * Trial, chewing on feedback, thinking on next * Web partioning on top level track * Lots of companies observe browsing. Would like to stop it * First build walls: break third party cookies. Same origin policy * Partition everything... * Important use cases: SSO * Personalized avertising fraud, logouts with federated login * Lots of work here to provide them * FLOC focused on interest based advertising * Target array of interests not just context on page * Goals to support interest adds, hard to track individuals * Today script runs on broswers * Backend takes contextual queues to user, sends profile back * With FLOC ad-tech given cohort of similarity, predictive models find ads * History as group, adtech backend few changes * API rejects for reasons: sensitive cohort, incognito, history cleared * Cohort: client side * No new data. * Only part used domains; no path or contents * Thousands of users, no sensitive info, no fingerprinting service * Encode user history by taking domains, hash into 64 bits. Sparse vector * Random projection onto 50 dimensional space * Apply grouping from Chrome server to 16 bits capturing thousands in each group * Pages eligable: only pages using API, only without private IP and not oped out * Origin trial concerned not representative for early adopters * Now used sites with ads on them * k-anonymous: 2,000 chrome sync users per cohort * Prevent transmission of cohorts with sensitive sites in them via revoking if correlated * Dropped 4% of cohorts * Origin trial: Page and user opt outs, bunch of other things in slides * Feedback: * got lots * Especially Mozilla and privacy analysis * Improvements * No auto-opt in: Done * Cohorts hard to understand for users: use topics to make clear what is revealed * Topics would be curated, have curated lists * Users can understand what they are indicating * Users opt in or out * Ad Topics Hint also related * New fingerprinting surface * Reduce it? Can we use 8 bits? Privacy sandbox tackles all the tracking * Random topics with some probability? * Give sites different topics? * Taken together can drop cross-site fingerprinting issues * Sensitivity * Human curation and t-topic analysis * Scope * Right now global browsing history * Per third party topic based on where third party is * 100% subsetting * Disadvantages if multiple parties can work on once * Q: What happens when off? * A: Nothing vs random we're thinking about. Training easier if we understand cohorts have meaning. 5% random right now. Cohort fairly high * Q watson: User justification for participation * A: Sites get money, particularly tail sites. Big money from personalization * Q followup: Need to see data, open and free doesn't mean making money from slimy people * Q: Matthew: Recomputation of map? * A: Yes. Has to be done regularly as web changes. Sensitivities change. Full intention of sharing * Q: Wes: Do not track vs. this? * A: Not sure question follows. What we offer is privacy sandbox. Still offering third party blocking and customization. Users still get to set up the barriers. *