Event Data Privacy
Event Data Privacy
Differential Privacy
By
1. Parthasarathy - 19PD25
2. Sriram Sidhartha - 19PD35
3. Shania Job - 19PD32
4. Pragathi - 19PD17
User-level Databases
Suppose a data scientist wants to release the visit count to four websites. They
have access to the following dataset that contains browser logs of 1234
employees of a company, from August to December 2020. Sample of the data:
In this database, each row represents an event described by the following fields:
● Employee Id
● Date of the event
● Time the event occurred
● The address of the domain visited.
The data scientist needs to compute a differentially private count of visits to each
of the following websites:
● Mail.com
● Bank.com
● Social.com
● games.com
● The data scientist uses the Laplace mechanism to privatize the count. Let’s
set the budget of the data release to Є = 0.1
● The data scientist proceeds to make the queries and budget calculations the
same way as they would if he had been using a user-level database. If the
same budget is spent on each query, each query will require Є = 0.005. The
data scientist feels confident this data release will not leak any information
about individuals, only overall browsing patterns. The following data is
published:
● When visit counts are released, the data scientist notices a significant drop in
the number of visits from October to November for games.com and
social.com.
● If you knew your co-worker took a leave of absence during this time, then you
could surmise your co-worker’s browsing habits, even though the data
scientist made a release utilizing the Laplace mechanism, i.e. with noise
added.
Privacy Issues When Using the Naive Approach
According to the definition of Є-differential privacy, the likelihood of observing any
given output of the mechanism is almost the same for every neighboring
database.
In this example a user contributes to multiple rows, so we will need to bound the
distance between the databases.
Because of this we will need to use another metric to measure the distance
between two databases and calibrate the differential privacy mechanism
accordingly.
Defining “Neighboring”: Event-level Databases
Mechanism.laplace
Events e
0 0 - 10 273449
1 10 - 20 118348
2 20 - 30 43548
3 30 - 40 15783
4 40 - 50 5960
Histogram
Reservoir Sampling
● When looking at quantiles on the non-private data, 92% of users have less
than 26 events and 98% of users have less than 40 events. This shows that
histogram analysis, even with a small , can give powerful insights when the
data distribution is unknown.
● Take k = 40 now for reservoir sampling
Reservoir Sampling Code
Count Visit
● Consider that the utility of a count vector is measured as follows:
○ 1) order of the domains, and
○ 2) counts of visits for each domain.
● From this perspective, the values of Counts of Visits on the differentially
private vector are relatively close to the non-private vector.
● The second observation is regarding ranking order of the domains. The top 2
domains are the same in both rankings, so when analyzing the top 5 domains,
there are 4 domains in the intersection of the dp-top5 and non-private top5.
Count Visit code
Results
Average Visits per user for each day of the Week
● The most reliable way to compute means is by making two separate queries:
● One query for the numerator (total number of visits per day) and
● one query for the denominator (total number of unique users per day).
● The calculation of the average becomes a post-processing step of two
differentially private functions.
Average Visits per user for each day of the Week Code
Results
Summary
1. identify the browser logs data as an event-level dataset
2. recognize that the number of events per user is unbounded
3. estimate, in a privacy-preserving manner, a bound k of events per user
without previous knowledge on the data distribution
4. pre-process the database in order to make it ready for a differential privacy
analysis
5. make differentially private queries to the database taking into consideration
the necessary code changes to account for multiple events per user
6. evaluate the results of different queries
7. post-process the results.