Chair welcome ("PearG"): Note well / Wear masks, in person. ## Draft updates (5 mins) {#draft-updates-5-mins} * RG draft statuses * IP Address Privacy Considerations: * No recent updates since the last meeting, but updates coming soon * Censorship: * Recent update * Numeric IDs * Sent to RFC editor * Safe Internet measurements: * Review * Maybe interesting for PPM, as well ## Presentations (100 mins) {#presentations--100-mins} * Interoperable Private Attribution (Martin Thomson) - 30 mins * Attribtion: important piece of the ad industry * Trains! * Let's talk about the Tokyo subway system * Actually, let's talk about identifiers, like access cards (e.g., PASMO) * Using passenger tracking for the purpose of capacity planning, performance, etc. * Specifically, for systems that track when a person enters the system and when the person exits * But logs are a privacy risk and can be used for other purposes, even if they are inherently pseudonymous - identities could be linked. * Can we create a design that aggregates the data that's interesting, and provides individual privacy? * One design is using tokens with buckets * Tokens need to be: * anonymous * authenticated * time-delayed "opening"/redemption * ephemeral * Moving on to advertising * Attribution: information from one context and linking it in a different context * Answer a question: "How many people saw the ad, then came to the show?" * Understanding whether certain advertising is working: * good placment * creatives * how much to spend * how long to run campaigns * Current, cross-context attribution allows linking people across contexts * With advertising, the context is everything: * Whether an ad was shown, and if that ad was clicked * Was a product puchased, or not * where was the ad shown * Interoperable Private Attribtion (IPA) * People have an identifier (significant protections against revealing the identifier) * Sites can request an encrypted and secret-share of that identifier * Sites have a view of the identifier, but it's not linkable cross-site * Attribution in MPC (multi-party computation) * sites gather events * MPC decrypts identifiers and performs attribution * aggregated results are the output (histogram) * MPC does not, itself, see the original query * MPC: * Any computation if you only need addition and multiplication * It can be expensive * IPA uses a three-party, honest-majority threat model * Differential Privacy * (epsilon, delta)-DP for hiding individual contributions * Every site gets a query budget that renews each epoch (e.g., week) * This does provide leakage across time (epochs), more research needed in this area * Parameters are not fixed yet * Client's encrypted identifiers are bound to a site, they are bound to: * the site that requested them * the epoch/week they are requested * the type of event: source (ad), trigger (purchase) * IPA: advances and challenges * IPA's flexibility provides somewhat of a drop-in replacment for current anti-fraud systems * IPA's flexibility hurts accountability * Existing challenge in making the system auditable * MPC performance is a challenge, especially at the scale of 10s of billions * Status: Good progress, overall, but still requires research in some areas * Currently running some synthetic trials * Ongoing work in W3C working groups, protocol may come to PPM in the future * Brian Trammel: MPC performance is a challenge. Computation or communication complexity? * MT: A lot is algorithmic (linear), but some of that will likely improved, but much of it is communication cost. Originally, records were working on the order of ~40GB, but it's still mutli-gigabytes in size * Chris Wood: 1) What was the MPC functionality you needed (as defined by the existing adtech industry), 2) Now that functionality is defined, and how you implement. How did you reach this design? * MT: Need more time. Lots of people took the steps to get here. Apple's PCM took an initial approach. This is mostly about understanding how the advertising industry uses measurement as a core part of their processes. There is a "need" vs. "want" different of perspective by different parties, and those discussions are on-going. If you add cross-device attribution, it gets more complicated. * CW: There is an academic research community that has spent a lot of time designing MPC protocols. There seems to be some overlap and collaboration opportunity here. * Shivan: Who would run the servers in the MPC protocol? * MT: We need to trust them to not collude - to be determined * Jonathan Hoyland: If it's run by a third-party that is running an auction, what are the guarantees that they're actually running the MPC protocol * MT: Currently leaning on the oversight / auditing. * JH: Can the response include a proof? * MT: Recently asked if Verifiable MPC was considered - but VMPC is not ready yet. So, "trust and verify" is the current approach * Secure Partitioning Protocols (Phillipp Schoppmann) - 20 mins * Let's go more into details for scaling aggregation computations * Billions of impressions from billions of clients * ALl clients submit their reports to the MPC cluster * MPC outputs the aggregate results * Goals * When sharding the MPC cluster, every client must use the same shard * We need a private mechanism for mapping one client to the same shard * This should have low communication cost * "correctness" must not be affected * Assumptions: * Bound on the number of contribitions * Many clients, fewer shards * Blueprint: partitioning from distributed OPRFs * client has an index (i), and payload (v) * One server has an OPRF key (server 1) * Other server (server 2) will learn the result of OPRF computation * server 1 must add some padding queries * Server 2's output of OPRF is used for mapping client to target partition * Dense Partitioning: OPRF Output = Shard ID * If there are only a small set of shards, then this is reasonable * Sparse Partitioning: OPRF Output = Random Client ID * Can the client's reports be aggregated before the MPC computation? * This doesn't result in creating a client identifier because server 1 pads the set of known client identifier if dummy values, so server 2 can't distinguish between real users and fake users * How can the sparse histogram be private without seeing the actual histogram? * View the output of the OPRF as a histogram * Make sure frequency can't be linked to specific users * Choose a threshold, below threshold add dummy values, above threshold \[..\] (?) * Conclusion: efficient for these use cases * Next steps: Is there general interest? Are there other protocols where this might be useful? Are there other properties that are needed? * Chris Patton: Definitely interesting, but maybe not as an independent draft * PS: So, add this into individual drafts, instead of making a general purpose protocol * CP: Yes * Martin Thomson: The bounds seem to be fundemental. How confident are you that these are required costs? * PS: The numbers are not the absolute lower bound, they are based on the curent design described in this presentation * MT: IPA may not be able to set an upper bound on the number of contributions, for example due to a Sybil attack * PS: While any party can create reports, but fraudulent reports may be able to be filtered downstream * DP3T: Deploying decentralized, privacy-preserving proximity tracing (Wouter Lueks) - 25 mins * D3-PT, started back in March 2020, first draft in May 2020, September 2020 - Summer 2021 working on presence tracing * Non-traditional academic environment - scaling to millions of users on a small timescale * Relying on existing infrastructure had a large impact * The system was designed that they were purpose-built and couldn't be re-used for other purposes * Risks associated with digital contact tracing: * Must embed social contact / graph * location tracing * medical information * social interactions * social control risk * Time has shown what can go wrong with designs/deployments like this * Police departments in crime solving * data leaks * harassment of specific subgroups * It is very important that systems should be designed with purpose-limitations in mind, so they can't be easily abused in other ways * Relying on existing infrastructure, using phones with BTLE sending beacons * Proximity can be derived based on the beacons they saw * Exposure notification works by the set intersection of beacons the person (who tested positive) saw and all of the identifiers that another person broadcast * The design of these beacon broadcasts required that the OS vendor must be involved * While the design was relatively simple, relying on existing hardware made the situation more difficult/complicated * The result of collaboration with Google/Apple, was the Google/Apple Exposure Notification (GAEN) Framework/API * For full effect, you need privacy at all layers of the stack, including the bluetooth protocl stack * MAC address must rotate at the same time as the beacons * Similarly, at the network layer, a network adversary can detect uploading the report of seen beacon identifiers (when reporting covid positive) - CH used dummy uploads to hide * Lessons learned: * Purpose limitations * context matters (how/where they are deployed) * Privacy at all layers * Tommy Pauly: More comment than questions: for privacy at all layers, Apple is routing upload report through iCPR * WL: While this is great, there might be other sidechannels we need to look at * XXX: How do you authenticate IDs? * WL: There isn't any binding, but the upload requires knowing the underlying seed from which the beacon was derived * Chris Wood: What would've an ideal interface looked like, and how would you've designed it differently? * WL: The strictness provided protections, but it introduced challenges, as well. There isn't an easy answer. * LogPicker: Strengthening Certificate Transparency Against Covert Adversaries (Alexandra Dirksen) - 25 mins * HTTPS is mostly a default now (90%+ of all page loads are https in chrome) * CAs are the trust anchors of the Web PKI * There are recent illicit certificate creations, and seemingly increasing * WoSign * Digicert * Diginotar * Comodo * TurkTrust * For rogue certificates, where you get a certificate for a domain that you don't own (e.g., HTTPS interception) * In the attacker scenario, a covert attacker obtaining a rogue certificate * Certificate transparency overview * CT is still vulnerable to this attack * All logs belong to a CA vendor * First compromise was in 2020 * vulnerable to collaboration attacks * vulnerable to split view attack * Gossip is proposed as a mitigation for Split View attacks * LogPicker: a decentralized approach * CA contacts one log (leader) from a large set of logs (log pool) * Leader then contacts the other logs in the pool * the pool then selects one log, at random * The selected log includes the certificate in its merkle tree * The logs that participated in choosing the log create a proof, and that proof is aggregated and sent back to the CA for inclusion in the certificate * This design meets the goals * Chris Wood: The log pool uses an election protocol? * AD: Yes, two protocols * CW: Have you looked at alternative solutions that use threshold signing? * AD: The aggregated signature uses BLS, but which signature scheme is used is not strictly defined