PRIV BOF Notes - IETF 112
Attendees: Automatically collected bluesheets
Scribe: Peter Saint-Andre
Administrivia
Note well: https://www.ietf.org/about/note-well/
Agenda Bashing
None.
Goals
- This is a WG forming BoF
- Let's understand the problem and use cases
- Not getting deep into solutions at this point
- Charter and general discussion at the end
PPM / PRIV overview
- Examples include public research, product development
- We might want to know aggregate information without knowing personal information, which could be sensitive
- Historically people gather lots of personal information and then generate aggregates
- But this isn't always necessary or useful in order to generate aggregates
- And of course it's dangerous to gather that much personal data
- "Just trust us" isn't a good basis
Output measurements we might want:
- Simple aggregates (mean, median, etc.)
- Relationships between multiple values (correlation etc.)
- Common strings (aka "heavy hitters")
Example: User Interests
- What websites do people visit and for how long?
- This is interesting in aggregate
Example: Web Site Issues
- Web compatibility issues / breakage
- Fingerprinting - detect it on the client side and enable reporting of it without identifying the user
More Use Cases (to be discussed)
- Privacy-preserving ads
- COVID exposure notification
Privacy Threats
- Tying sensitive data directly to identifying information
What About Anonymized Data Collection?
- Strip out identifiers like IP addresses
- cf MASQUE, OHAI, ODoH, etc.
- There are cases where this doesn't work well:
- high dimensional data (statistical queries / correlation)
- common values / heavy hitters
Crypto to the Rescue
- Multi-party Computation
- Client generates shares
- Sends them to multiple servers
- Each server has only partial shares
- Together, the servers compute aggregates
Trust Model
- Client requirement: servers don't collude
- This is hard to verify
- There could be side channels
- Point in time audits are possible
- Collector's requirements: servers execute the protocol correctly so that results aren't distorted
Example protocol: Prio [CBG17]
- See https://crypto.stanford.edu/prio/ for a link to the paper
- Prio includes ways to compute various outputs, trick is to get the encoding right, lots of papers on this
What About Bogus Data?
- Data could be plausible but false
- Data could be completely ridiculous
- Each submission comes with a zero-k proof of validity (e.g., in a certain range)
Heavy Hitters [BBCG+21]
- Clients report the N most frequent strings
- Servers jointly compute the most common values
Subset Queries
- Submissions could be tagged with demographic data
- Notionally safe
- But repeated queries on subsets can be used to determine individual values
- A WG will need to work on this
Where We Are
- draft-gpew-priv-ppm-00 is a start on a protocol
- System architecture: client should know who the leaders and helpers are
PPM and OHAI
- These technologies complement each other
- OHAI is better for semi-sensitive data
- PPM is better for very sensitive data
- Can use OHAI to talk to a PPM server
Questions?
Wes Hardaker
- I'd like to hear more about the deployment scenario
- Would application authors make use of this in a similar way to OHAI or DoH?
- Ekr: most likely we'd build this into existing applications like browser telemetry, make it automatic
- Ekr: could envision using it for surveys and such
- Ekr: but not something the user doesn't have to manually enable
- Wes: use cases where OHAI doesn't work?
- Ekr: e.g. we don't want to know exact web sites or URLs that people visit
- Ekr: also bucketized counters with high dimensionality
Andrew Campling
- Complementary: there's no dependency on OHAI, right?
- Ekr: correct
ISRG and PPM (Tim Geoghegan, ISRG)
- ISRG is parent org of Let's Encrypt
- We're interested in improving privacy on the net
- Conventional telemetry is a privacy risk
- All that data is a liability for data collectors
- You need an external, trusted party to run at least one of the servers
- Envision running an aggregator-as-a-service based on an open-source implementation
- Aggregators would include ISRG's service as one of the servers
- Also planning to work on client-side libraries
Example: Exposure Notifications Private Analytics (ENPA)
- More servers lead to greater risk of failure
- There is compute and network overhead
- Can't make arbitrary post-hoc queries
- But we do have large-scale experience with early versions of these technologies
- 13 U.S. states and DC have deployed
- 2.1 million measurements / hour
- 12 billion measurements so far
- About to deploy internationally as well
Conversion Measurement (Martin Thomson)
- Advertisers want to measure how many people buy things after seeing their ads
- Various combinations of measurements per campaign
- Today users are assigned identifiers, log when and where ads are shown, log purchases, use the log to answer questions about what people have done (e.g., how many users who saw the ad bought the product?)
- Goal: produce aggregate statistics about conversions without relying on user-specific logging (counts rather than individual data)
- There are lots of ideas and proposals in this space, currently gathering requirements
- Many of the most promising approaches include some kind of multi-party computation system like Prio
PPM for Ads Measurement on the Web (Charlie Harrison)
Web Privacy
- Third-party cookies are bad for user privacy
- Unfortunately, web monetization depends on them
- Can we build an alternative that gives good user privacy while still supporting advertising use cases?
- Many use cases can be done with aggregate data only
Attribution Measurement
- Now, third-party cookies are used as a key to identify the user
- With PPM, browser could join two events and generate contributions to some aggregate measure (e.g., histogram)
- This wouldn't reveal user data
Other Things People Are Thinking About
- Differential privacy
- Reporting very large data, sparse domains
- Training machine learning models
- Reach measurement
- Cross-site measurement
Verifiable Distributed Aggregation Functions / VDAF (Richard Barnes)
See draft-patton-cfrg-vdaf
- To support the protocol, we need underlying crypto primitives
- TLS/DH, MLS/HPKE, PPM/VDAF
- Common API for multiple instantiations like prio and hits that can be used inside PPM protocol
What Does a VDAF Do?
- Aggregation over individual measurements
- Computation is distributed across non-colluding aggregators
- Methods for verification to prevent data corruption
API
- Communication is needed among aggregators during the preparation phase
- PPM's job is the plumbing for communication among the aggregators
- So far prio and hits
- Can we support things like star?
- That's the idea, make the API extensible
Running Code
- A few implementatios so far in C, C++, and Rust
- Some deployment experience as well
Questions
Erik Taubeneck
- Could the helpers communicate with each other as in other MPC protocols?
- RLB: If we have a leader, we can have a star topology that emulates broadcast
- EKR: Assumption that we have a leader is embedded into PPM, not the API
- EKR: What we have so far is a strawman protocol
Expressions of Interest
Tommy Pauly
- From my perspective, we're already using something like this and seeing it standardized would be positive
EKR
- See the Jabber chat for some more expressions
Robin Wilton
- I'd be interested in following the work and seeing how it can be used to support values-based design for innovation with ethical characteristics
Jari Arkko
- I think this is exciting and we should work on it
- Networks could use this to collect information for debugging
- The only concern I have is the advertising use case where browsers collaborate with advertisers
- But I like this and it should go ahead
Ted Hardie
- I think the problem statement is good
- One thing that worries me is we're talking about a set of problems but we should scope things so that the best answer for one problem might be different from the best answer for another problem
- Only a partial use of this system might be needed
Phillip Hallam-Baker
- I don't think we understand the problem but we should do the work
- I do have a concern that perhaps we don't need such fancy crypto - could we just encrypt logfiles?
- It might be more useful in certain cases to work on data outside the network context (e.g., pre-processing of offline data sets)
Wes Hardaker
- In general I think this is interesting
- One oddity: this is designed to help protect users from good entities, but it doesn't protect users from bad actors
Chris Wood
- I wanted to echo something that Richard said in the chat
- Stepping back from the advertising use case, I think there are many use cases where we need aggregate statistics to answer important questions
- Those could be improved by a more privacy-preserving technology
Eric Rescorla
- To Ted's point, this is a toolbox (PPM) with a bunch of tools (VDAFs) and individual solutions will be layered on top
- I see some over-indexing on the ads use case, but there's nothing ad-specific about this work and that kind of thing will happen elsewhere (e.g., PATCG at W3C)
- As to good vs. bad entities, this does enable good entities to force themselves to do the right thing
- As to ads, the idea is that something like this will remove the need for tracking technologies that exist today
Charlie Harrison
- We need a privacy-preserving way to replace less private technologies so that we can remove the bad stuff from the web platform
- This is our strategy on the Chrome team
Wendy Seltzer
Charter Discussion
Florence D
- Nothing in the charter or initial draft about use cases
- It would be good to document these
Ekr
- I think most of the use cases were motivational
- There was a question in the chat about extensibility, I think that's definitely of interest (e.g., if we can find something more efficient than "hits" for common value calculation)
Shivan Sahib
- I think the charter would need some tweaking along those lines
- I'm happy to send a pull request
Jim Reid
- I do think the charter should say something about use cases and the problems we're trying to solve
Erik Taubeneck
- There was a comment in the chat about Sybil attacks and differential privacy
- Should we talk about what a "private value" is in this charter or in VDAF?
Charlie Harrison
- We might want to more clearly define what we mean by "aggregate"
- For instance, do machine learning models fit in?
Andrew Campling
- Agreed on use cases
- Also consider adding abuse cases / malicious practices and mitigations
Watson Ladd
- Channeling Wendy Seltzer, W3C fully supports this work
Stephen Farrell
- Effort on abuse cases and mitigations would be more helpful than use cases
Nick Doty
- Many of the privacy properties depend on non-collusion and client configuration - we should be discussing that
- Adam: concrete suggestions would be helpful
- I think "PRIV" isn't a good name
Chris Patton
- Regarding the abuse cases, we're trying to rule out the ability of a client to inject invalid inputs into the system, this is part of the VDAF API
Stephen Farrell
- Could this technology be used to measure small aggregates that could expose personal information?
- [something else that the scribe missed]
Jari Arkko
- +1 on the name
- +1 on the abuse cases
- I think there's room for discussion about opt-in vs. opt-out
- e.g., automatically exclude some set of users without being targeted as opting out
Eric Rescorla
- In general these are integration points in the client and how that happens in the UI is usually out of scope for IETF protocols
- There's definitely need to address abuse factors and we have a bit of text in the initial draft about that, agreed that we should have text about that
Martin Thomson
- A lot of the systems that we're describing enable users to opt out - appear to be contributing data but they're not actually doing so
- But I'm not sure we need to put that in a charter
- However should say some generic and perhaps for instance mention Sybil attacks
Wes Hardaker
- What kind of ecosystem do we want to create here? For instance just a small set of helpers or can anyone stand up a service?
Jari Arkko
- Good to hear about Martin's point
Eric Rescorla
- The collector has to trust the aggregator, but so does the client
- I expect there will be a non-gigantic number of aggregators
- This is compatible with a large number of aggregators, there are plenty of trustworthy entities that can run services
Chris Patton
- +1 to Ekr
- Something we've been working on in the protocol is lowering the bar to running an aggregator
- The helper should be inexpensive to run, the leader is another story because more processing / storage is necessary
- Definitely think this is important to keep in mind
Chair evaulation: while not unanimous, there was very clear support for formation, with very little opposition. Roughly half of those present expressed an opinion.
Poll: Do we think the problem statement is clear, well-scoped, solvable, and useful to solve?
Ted Hardie: will this include conversation in the chat?
Roman Danyliw (?): Yes
Nick Doty: I think we have an open question whether use and abuse cases will be in the charter or an actual work item, until we figure that out we can't settle whether the problem statement is clear.
Robin Wilton: Discussion in the chat raises questions about the scope and intent of the group, specifically related to values.
Eric Rescorla: Let's not get worked up about the name, we just needed something to run the BoF.
Chair evaluation: Roughly three times as many people answered "yes" than answered "no", with approximately half of those present responding.
In the chat: Who is willing to review documents?
20+ responses so far -- see Jabber Logs for those responding "review"
In the chat: Who is willing to be a document editor?
About half a dozen responses -- see Jabber Logs for those responding "edit"
Overall it appears that there is a critical mass of interest.
A tension with the abuses that could be be enabled by the ad use case was repeatedly voiced, as was the complimentary recognition that this technology could enable a broader set of use cases.
The charter discussion surfaced the need to revise the text in the following ways to address community feedback:
- Generalize text to make it clear that flexibility for alternatives and additions to approach is possible (i.e., more than Prio, heavy hitters)
- Refine the definition of aggregations so as to not restrict alternatives
- Add a work item(s) to document the abuse cases and associated mitigations from the perspective of all participants in the architecture
- Revision of the WG name to convey a more narrow scope
Thanks for the robust discussion in Jabber too.
I think we have a way ahead for this work and we'll be moving forward with it.
FINIS