0% found this document useful (0 votes)
38 views

Event Data Privacy

1) The document discusses applying differential privacy to event-level databases where individuals can contribute multiple rows of data. 2) A naive approach of treating each row independently can violate privacy, so the concept of "k-neighboring" databases is introduced to bound the distance between datasets. 3) The sensitivity of queries must account for the maximum number of rows an individual can contribute.

Uploaded by

pragathisai0912
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Event Data Privacy

1) The document discusses applying differential privacy to event-level databases where individuals can contribute multiple rows of data. 2) A naive approach of treating each row independently can violate privacy, so the concept of "k-neighboring" databases is introduced to bound the distance between datasets. 3) The sensitivity of queries must account for the maximum number of rows an individual can contribute.

Uploaded by

pragathisai0912
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Event level data and

Differential Privacy
By
1. Parthasarathy - 19PD25
2. Sriram Sidhartha - 19PD35
3. Shania Job - 19PD32
4. Pragathi - 19PD17
User-level Databases

A user-level database is a database where each row corresponds to a


unique individual in a database.

For such databases, neighboring databases are understood as databases


that differ by a single row.
Event-level Databases
Event-level database is a database where a single individual can contribute
multiple events.Usually, each row in the database corresponds to a single event.

For this type of database, neighboring databases are understood as databases


that differ by a single individual.

In event-level databases a single individual can be represented by a fixed number


of rows, or by a variable number of rows.
Applying DP to Event-level Databases: A Naive Approach
Example:

Suppose a data scientist wants to release the visit count to four websites. They
have access to the following dataset that contains browser logs of 1234
employees of a company, from August to December 2020. Sample of the data:
In this database, each row represents an event described by the following fields:
● Employee Id
● Date of the event
● Time the event occurred
● The address of the domain visited.

The data scientist needs to compute a differentially private count of visits to each
of the following websites:
● Mail.com
● Bank.com
● Social.com
● games.com

in each of the following months:


August, September, October, November, December
● To proceed with the task, the data scientist naively sets the sensitivity of the
COUNT query to 1 and starts making queries to the database to get the
counts of visits to each website per month.

● The data scientist uses the Laplace mechanism to privatize the count. Let’s
set the budget of the data release to Є = 0.1

● The data scientist proceeds to make the queries and budget calculations the
same way as they would if he had been using a user-level database. If the
same budget is spent on each query, each query will require Є = 0.005. The
data scientist feels confident this data release will not leak any information
about individuals, only overall browsing patterns. The following data is
published:
● When visit counts are released, the data scientist notices a significant drop in
the number of visits from October to November for games.com and
social.com.

● If you knew your co-worker took a leave of absence during this time, then you
could surmise your co-worker’s browsing habits, even though the data
scientist made a release utilizing the Laplace mechanism, i.e. with noise
added.
Privacy Issues When Using the Naive Approach
According to the definition of Є-differential privacy, the likelihood of observing any
given output of the mechanism is almost the same for every neighboring
database.

When Є = 0.1, the definition of Є-differential privacy ensures that


Privacy Issues When Using the Naive Approach
In the above example, it is easy to identify the contributions of an individual user
because the distance between neighboring datasets is unbounded when we
have event-level data.

In this example a user contributes to multiple rows, so we will need to bound the
distance between the databases.

Because of this we will need to use another metric to measure the distance
between two databases and calibrate the differential privacy mechanism
accordingly.
Defining “Neighboring”: Event-level Databases

● In general, two datasets as neighboring if their distance is 1


● But, this does not apply for Event Level Databases
● Generalized notion of adjacency is required
● For two event-level datasets and X and Y, and distance metric M,
where is a DM(X,Y) distance metric relative to M, and an integer k, we
say that X and Y are “k-neighboring” provided that DM(X,Y) ≤ k
Example - Consider a sample student marks data
Suppose a student drops out of class
Difference is 2, therefore 2-neighbors
Defining “Sensitivity”: Event-level Databases

● Multiple rows may correspond to the actions of a single user.


● Understand exactly how many rows correspond to each individual.
● The dataset distance is the maximum number of rows that a single
user can contribute.
● For two event-level datasets X and Y that are k-neighboring, we say
that our function f is dout-sensitive with respect to some distance
metric MO, provided that d(f(X),f(Y))≤dout
● In English, this just says the greatest amount that the function may
change is dout
Example
● Given two neighboring databases, x and y, the sensitivity of a COUNT query
is given by maxx,y||COUNT(x)-COUNT(y)||1. Given that the maximum number
of website visits an individual may contribute is 4000, the sensitivity of
COUNT queries, when applying it to the browser logs database, is 4000.
● Can use the Laplace mechanism, with the correct parameters for the data
release.
● MLap(x) = Lap(shift=x,scale = Δ/ε) = Lap(shift=x,scale = 4000/ε)
Result
Making Queries to a Database of Browser

Visits to Top 500 Domains


Dataset
Queries
● Which are the top 5 most-visited domains?
● How many visits are there to the top 5 domains?
● How many visits are there for each day of the week?
● Suppose the data scientist wants to find the k that will give the best utility for
the counts of visits per user. Doing so in a non-privacy preserving manner
would consist in looking at the distribution of events per user and choosing a k
that would include 90% or 95% of events.
● However, this process is not differentially private. In this case, the data
scientist reveals the number of events in the 95th or 90th percentiles.
Generate Histogram
● One way to make the above analysis differentially private is to use part of the
privacy budget to generate a differentially private histogram of the number of
events.
● The data scientist can use a small and a very large k. Ideally, the k chosen for
this analysis should be a value that is much larger than what would be
expected as a 95th percentile of k.
● For the browsing logs dataset, let’s choose k = 50 visits.
● To make this preliminary analysis, the data scientist will bound each individual
in the dataset to 50 visits.
Code
Results
Running query with Laplace mechanism for count:

Mechanism.laplace

Events e

0 0 - 10 273449

1 10 - 20 118348

2 20 - 30 43548

3 30 - 40 15783

4 40 - 50 5960
Histogram
Reservoir Sampling
● When looking at quantiles on the non-private data, 92% of users have less
than 26 events and 98% of users have less than 40 events. This shows that
histogram analysis, even with a small , can give powerful insights when the
data distribution is unknown.
● Take k = 40 now for reservoir sampling
Reservoir Sampling Code
Count Visit
● Consider that the utility of a count vector is measured as follows:
○ 1) order of the domains, and
○ 2) counts of visits for each domain.
● From this perspective, the values of Counts of Visits on the differentially
private vector are relatively close to the non-private vector.
● The second observation is regarding ranking order of the domains. The top 2
domains are the same in both rankings, so when analyzing the top 5 domains,
there are 4 domains in the intersection of the dp-top5 and non-private top5.
Count Visit code
Results
Average Visits per user for each day of the Week
● The most reliable way to compute means is by making two separate queries:
● One query for the numerator (total number of visits per day) and
● one query for the denominator (total number of unique users per day).
● The calculation of the average becomes a post-processing step of two
differentially private functions.
Average Visits per user for each day of the Week Code
Results
Summary
1. identify the browser logs data as an event-level dataset
2. recognize that the number of events per user is unbounded
3. estimate, in a privacy-preserving manner, a bound k of events per user
without previous knowledge on the data distribution
4. pre-process the database in order to make it ready for a differential privacy
analysis
5. make differentially private queries to the database taking into consideration
the necessary code changes to account for multiple events per user
6. evaluate the results of different queries
7. post-process the results.

You might also like