0% found this document useful (0 votes)
2 views

Aim

The document presents AIM, an adaptive and iterative mechanism for generating differentially private synthetic data that preserves statistical properties of sensitive datasets. It outlines the methodology, including the select-measure-generate paradigm and the importance of judiciously selecting marginal queries while considering privacy budgets and workload requirements. The document also discusses theoretical analysis, experimental results, and open problems for future research.

Uploaded by

myshenc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Aim

The document presents AIM, an adaptive and iterative mechanism for generating differentially private synthetic data that preserves statistical properties of sensitive datasets. It outlines the methodology, including the select-measure-generate paradigm and the importance of judiciously selecting marginal queries while considering privacy budgets and workload requirements. The document also discusses theoretical analysis, experimental results, and open problems for future research.

Uploaded by

myshenc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

AIM: An Adaptive and Iterative Mechanism for

Differentially Private Synthetic Data

Presented by: Xizixiang Wei

University of Virginia
[email protected]

November 2, 2023

Presented by: Xizixiang Wei (UVA) AIM for Synthetic Data November 2, 2023 1 / 27
Agenda

1 Motivation and Method Overview

2 Concepts and Tools

3 Technical Details on AIM

4 Theoretical Analysis

5 Experiments

6 Open Problems

Presented by: Xizixiang Wei (UVA) AIM for Synthetic Data November 2, 2023 2 / 27
Motivation

Private synthetic data


Given sensitive data about individuals, construct a synthetic dataset
that preserves important statistical properties of the original dataset,
while offering formal privacy guarantees to individual in the dataset.

Pros
Has the same form as the original dataset, making it easy to work with.

Can be used in place of the original dataset for any downstream task.

Presented by: Xizixiang Wei (UVA) AIM for Synthetic Data November 2, 2023 3 / 27
Problem formulation

Given:
A sensitive dataset D
Privacy parameters (ϵ, δ)
A workload W

Problem: Design an (ϵ, δ)-differentially private mechanism M such that


generates a synthetic dataset D̂ = M(D) such that W (D̂) ≈ W (D).

Workload: In this work (and a series of works), we focus on the special


(but common) case where the workload consists of a collection of
weighted marginal queries.

Presented by: Xizixiang Wei (UVA) AIM for Synthetic Data November 2, 2023 4 / 27
The select-measure-generate paradigm

Select a set of marginal queries to measure.

Measure marginals privately using a noise addition mechanism.

Generate synthetic data that best explains the noisy marginals.

Presented by: Xizixiang Wei (UVA) AIM for Synthetic Data November 2, 2023 5 / 27
Iterative select-measure-generate paradigm

Initialize estimate of data distribution


Repeat

Select marginal query poorly approximated by current estimate

Measure selected marginal using noise-addition mechanism

Update estimate of data distribution from measured info

Generate synthetic data by estimated data distribution

Presented by: Xizixiang Wei (UVA) AIM for Synthetic Data November 2, 2023 6 / 27
Iterative select-measure-generate paradigm

Initialize estimate of data distribution How to initialize?


Repeat
How many rounds to run?
How much budget to spend per round?
Select marginal query poorly approximated by current estimate
What set of candidates to select from?
What quality score function to use?
What selection mechanism to use?
Measure selected marginal using noise-addition mechanism
What noise addition mechanism to use?
What privacy accounting method to use?
Update estimate of data distribution from measured info
What estimation algorithm to use?
What measured information to incorporate?
Generate synthetic data by estimated data distribution

Presented by: Xizixiang Wei (UVA) AIM for Synthetic Data November 2, 2023 7 / 27
Main considerations

Must select marginal queries judiciously:

Budget-aware: should intelligently adapt to the available privacy


budget
Workload-aware: should help answer the workload
marginal selection that independent with workload is necessarily
sub-optimal for a specific workload.
Data-aware: should exploit knowledge of domain and data
distribution
Select marginal queries from a set of candidates based on the data.
Efficiency-aware: should enable tractable post-processing
Mechanisms that build on top of Private-PGM must ensure JT-SIZE
remains sufficiently small for computational tractability.
Presented by: Xizixiang Wei (UVA) AIM for Synthetic Data November 2, 2023 8 / 27
Compared with existing method

Budget-aware: should intelligently adapt to the available privacy


budget
Workload-aware: should help answer the workload
Data-aware: should exploit knowledge of domain and data distribution
Efficiency-aware: should enable tractable post-processing

Presented by: Xizixiang Wei (UVA) AIM for Synthetic Data November 2, 2023 9 / 27
AIM: method overview

Initialize estimate of data distribution [New] Initialization method


Repeat
[New]
Adaptive rounds + budget split (hyper-parameter free)
Select marginal query poorly approximated by current estimate
[New]
Workload- and efficiency-aware candidate set
Budget- and data-aware quality score function
Measure selected marginal using noise-addition mechanism
[Prior work]
Gaussian noise, zCDP accounting
Update estimate of data distribution from measured info
[Prior work]
Private-PGM, ICML 2019
Generate synthetic data by estimated data distribution [Prior
work]Private-PGM
Presented by: Xizixiang Wei (UVA) AIM for Synthetic Data November 2, 2023 10 / 27
Data

A dataset D is a multiset of N records


Each record x ∈ D is a d-tuple (x1 , · · · , xd )
The domain of possible values for xi is denoted by Ωi , with size
|Ωi | = ni
The full domain of x: Ω = Ω1 × · · · × Ωd , with size n ≜ |Ω| = Πi ni
The set of all possible datasets: D

Presented by: Xizixiang Wei (UVA) AIM for Synthetic Data November 2, 2023 11 / 27
Marginals

Definition (Marginal)
Let r ⊆ [d] be a subset of attributes, Ωr = Πi∈r Ωi , nr = |Ωr |, and
xr = (xi )i∈r . The marginal on r is a vector µ ∈ Rnr , indexed by domain
Pt ∈ Ωr , such that each entry is a count, i.e.,
elements
µ[t] = x∈D 1[xr = t]. We let Mr : D → Rnr denote the function that
computes the marginal on r , i.e., µ = Mr (D).

It is easy to verify that the l2 sensitivity of any marginal query Mr (D)


is 1.

Presented by: Xizixiang Wei (UVA) AIM for Synthetic Data November 2, 2023 12 / 27
Workload

This work focuses on the special (but common) case where the
workload consists of a collection of weighted marginal queries.
Utility measure: workload error

Definition (Workload error)


A workload W consists of a list of marginal queries r1 , · · · , rk where
ri ⊆ [d], together with associated weights ci ≥ 0. The error of a synthetic
dataset D̂ is defined as:
k
1 X
Error(D, D̂) = ci ∥Mri (D) − Mri (D̂)∥1
k · |D|
i=1

Presented by: Xizixiang Wei (UVA) AIM for Synthetic Data November 2, 2023 13 / 27
Differential Privacy

Definition of (ϵ, δ) − DP, sensitivity and Gaussian Mechanism...

Definition (Exponential Mechanism)


Let qr : D → R be quality score function defined for all r ∈ R and let
ϵ ≥ 0 be a real number. Then the exponential mechanism outputs a
candidate r ∈ R according to the following distribution:
 ϵ 
Pr [M(D) = r ] ∝ exp · qr (D) ,
2∆
where ∆ = maxr ∈R ∆(qr ) is the sensitivity.

Presented by: Xizixiang Wei (UVA) AIM for Synthetic Data November 2, 2023 14 / 27
zero-Concentrated Differential Privacy (zCDP)

Definition (zCDP)
A randomized mechanism M is ρ-zCDP if for any two neighboring
datasets Dand D ′ , and all α ∈ (1, ∞), we have:

Dα (M(D)||M(D ′ )) ≤ ρα,

where Dα is the Rényi divergence of order α.

1
The Gaussian Mechanism satisfies 2σ 2
-zCDP;
2
The Exponential Mechanism satisfies ϵ8 -zCDP;
Composition of two mechanisms with ρ1 -zCDP and ρ2 -zCDP satisfies
(ρ1 + ρ2 )-zCDP
If a mechanism M satisfies ρ-zCDP, it also satisfies (ϵ, δ)-DP for all

ϵ ≥ 0 and δ = minα>1 exp((α−1)(αρ−ϵ))
α−1 1 − α1 .

Presented by: Xizixiang Wei (UVA) AIM for Synthetic Data November 2, 2023 15 / 27
Private-PGM

The heart of Private-PGM is an optimization problem to find a


distribution p̂ that “best explains” the noisy observations µ̃i :
k
X 1
p̂ := arg min ∥Mri (p) − µ̃i ∥22 ,
p∈S σi
i=1
P
where S = {p|p(x) ≥ 0 and x∈Ωp(x)n } is the set of(scaled) probability
distributions over the domain Ω.
Junction tree size: Private-PGM exposes a callable function
JT-SIZE(r1 , · · · rk ) that can be invoked to check how large a junction
tree is.
The runtime of distribution estimation is roughly proportional to
JT-SIZE.
If arbitrary marginals are measured, JT-SIZE can grow out of control,
no longer fitting in memory, and leading to unacceptable runtime.
Presented by: Xizixiang Wei (UVA) AIM for Synthetic Data November 2, 2023 16 / 27
Technical Details on AIM

Initialization: line 7
Iteration: line 10
Select: line 14
Measure: line 15
Generate: line 19

Presented by: Xizixiang Wei (UVA) AIM for Synthetic Data November 2, 2023 17 / 27
Intelligent initialization

Spend a small fraction of the privacy


budget to measure 1-way marginals;
Estimates p̂ an independent model
where all 1-way marginals are preserved
well;
Provide a far better initialization than
the default uniform distribution.

Presented by: Xizixiang Wei (UVA) AIM for Synthetic Data November 2, 2023 18 / 27
New Candidates

Which candidates in the workload W


can be selected? Marginal queries in
the downward closure of the workload.
The downward closure
W+ = {r |r ⊆ s, s ∈ W };
Lower-dimensional marginals has a
priority to be chosen.
The set will only consist of candidates
with JT-SIZE below a prespecified limit.

Presented by: Xizixiang Wei (UVA) AIM for Synthetic Data November 2, 2023 19 / 27
Better Selection Criteria

New quality score function in line 14


p
∥Mr (D) − Mr (pt−1 )∥1 − 2/πσt nr :
the l1 error under the current model
minus the expected l1 error if it is
measured at the current noise level
P
Weight wr = s∈W cs |r ∩ s|: captures
the degree to which the marginal
queries in the workload overlap with r .
In general, put more weight on
marginals with more attributes.

Presented by: Xizixiang Wei (UVA) AIM for Synthetic Data November 2, 2023 20 / 27
Better Selection Criteria

Trade-off in quality score function


p
The penalty term 2/πσt nr
discourages marginals with more
attributes.
Weight wr favors marginals with more
attributes.
However, if the inner expression is
negative, then the larger weight will
make it more negative, and much less
likely to be selected.

Presented by: Xizixiang Wei (UVA) AIM for Synthetic Data November 2, 2023 21 / 27
Adaptive Rounds and Budget Split

The annealing condition is activated if


the difference between Mrt (p̂t ) and
Mrt (p̂t−1 ) is small, which indicates that
not much in- formation was learned in
the previous round.
We initialize ϵt and σt conservatively.

Presented by: Xizixiang Wei (UVA) AIM for Synthetic Data November 2, 2023 22 / 27
Theoretical analysis: uncertainty quantification

Provide probability bound for ∥Mr (D) − Mr (D̂)∥1 .


Only give guarantees for marginals in the workload W .
Two cases:
The easy case: Marginal r has been sleeted: we have unbiased estimate
of Mr (D) from yt
The hard case: Marginal r has not been sleeted: no unbiased estimate
of Mr (D)

Presented by: Xizixiang Wei (UVA) AIM for Synthetic Data November 2, 2023 23 / 27
Theoretical analysis: easy case

Unbiased estimates

Probability bound of Gaussian vector

Triangle inequality

Presented by: Xizixiang Wei (UVA) AIM for Synthetic Data November 2, 2023 24 / 27
Theoretical analysis: hard case

Key insight is that marginal queries not selected have relatively low error
compared to the marginal queries that were selected. We can easily bound
the error of selected queries and relate that to non-selected queries by
utilizing the guarantees of the exponential mechanism.

Triangle inequality

Presented by: Xizixiang Wei (UVA) AIM for Synthetic Data November 2, 2023 25 / 27
Experiments

Presented by: Xizixiang Wei (UVA) AIM for Synthetic Data November 2, 2023 26 / 27
Open Problems

Handling more general workloads.


Handling mixed data types.
Utilizing public data: design synthetic data mechanisms that
incorporate public data.

Presented by: Xizixiang Wei (UVA) AIM for Synthetic Data November 2, 2023 27 / 27

You might also like