0% found this document useful (0 votes)
4 views33 pages

Gaussian Copula

The document discusses a missing value imputation method using a Gaussian copula model, which is particularly useful for datasets with mixed data types. It highlights the advantages of this method, such as ease of use without hyperparameter selection and its applicability to various datasets, including social surveys. The presentation includes motivation, methodology, and practical considerations for implementing the Gaussian copula model for imputation.

Uploaded by

nwobodope
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views33 pages

Gaussian Copula

The document discusses a missing value imputation method using a Gaussian copula model, which is particularly useful for datasets with mixed data types. It highlights the advantages of this method, such as ease of use without hyperparameter selection and its applicability to various datasets, including social surveys. The presentation includes motivation, methodology, and practical considerations for implementing the Gaussian copula model for imputation.

Uploaded by

nwobodope
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Missing Value Imputation via Gaussian Copula

Yuxuan Zhao

ORIE 4741, Nov 11 2021

1
Why I am here today?

I In your course project, it is very likely to run into missing data.

2
Why I am here today?

I In your course project, it is very likely to run into missing data.


I The missing data imputation method I introduce today can be simply
used without selecting hyperparameters.
I The software can be easily installed and used.

2
Why I am here today?

I In your course project, it is very likely to run into missing data.


I The missing data imputation method I introduce today can be simply
used without selecting hyperparameters.
I The software can be easily installed and used.
I We want to know if our method works well for your problem!

2
Table of Contents

1 Motivation

2 Gaussian copula model

3 Demo

3
Motivation

Let’s first see a general social survey dataset

Figure 1: 2538 participants and 9 questions. 18.2% entries are missing in total.

Motivation 4
Motivation

Example variables

I Subjective class identification: If you were asked to use one of four


names for your social class, which would you say you belong in: the lower class,
the working class, the middle class, or the upper class?
I General happiness: Taken all together, how would you say things are these
days–would you say that you are very happy, pretty happy, or not too happy?
I Respondents income: In which of these groups did your earnings from
(OCCUPATION IN OCC) for last year–[the previous year]–fall? That is, before
taxes or other deductions. Just tell me the letter.
I Weeks r. worked last year: In [the previous year] how many weeks did you
work either full-time or part-time not counting work around the house–include paid
vacations and sick leave?

Motivation 5
Motivation

Recap: GLRM imputes mixed data better than PCA

Generalized low rank model: find low rank matrix X ∈ Rn×k and
W ∈ Rk×p such that XW approximates Y ∈ Rn×p well:

X   Xn d
X
minimize `j Yij , xiT wj + ri (xi ) + r˜j (wj )
(i,j)∈Ω i=1 j=1

I `j can vary for different j.


I The regularizer for row ri and column r˜j can vary.

Motivation 6
Motivation

Recap: GLRM imputes mixed data better than PCA

Generalized low rank model: find low rank matrix X ∈ Rn×k and
W ∈ Rk×p such that XW approximates Y ∈ Rn×p well:

X   Xn d
X
minimize `j Yij , xiT wj + ri (xi ) + r˜j (wj )
(i,j)∈Ω i=1 j=1

I `j can vary for different j.


I The regularizer for row ri and column r˜j can vary.
Great flexibility usually means many choices to make...

Motivation 6
Motivation

GLRM: practical consideration

Generalized low rank model: find low rank matrix X ∈ Rn×k and
W ∈ Rk×p such that XW approximates Y ∈ Rn×p well:

X   n
X d
X
minimize `j Yij , xiT wj + ri (xi ) + r˜j (wj )
(i,j)∈Ω i=1 j=1

I What `j to choose?
I How to assign weights to `j when columns have different scales?
I What regularizer ri , r˜j to use?

Motivation 7
Motivation

GLRM: practical consideration

Generalized low rank model: find low rank matrix X ∈ Rn×k and
W ∈ Rk×p such that XW approximates Y ∈ Rn×p well:

X   n
X d
X
minimize `j Yij , xiT wj + ri (xi ) + r˜j (wj )
(i,j)∈Ω i=1 j=1

I What `j to choose?
I How to assign weights to `j when columns have different scales?
I What regularizer ri , r˜j to use?
And there are tuning parameters...

Motivation 7
Motivation

GLRM: practical consideration

Generalized low rank model: find low rank matrix X ∈ Rn×k and
W ∈ Rk×p such that XW approximates Y ∈ Rn×p well:

X   Xn d
X
minimize `j Yij , xiT wj + ri (xi ) + r˜j (wj )
(i,j)∈Ω i=1 j=1

I How to choose the rank k?


I If setting ri and r˜j as quadratic regularization with parameter λ, how
to choose λ?

Motivation 8
Motivation

GLRM: practical consideration

Generalized low rank model: find low rank matrix X ∈ Rn×k and
W ∈ Rk×p such that XW approximates Y ∈ Rn×p well:

X   Xn d
X
minimize `j Yij , xiT wj + ri (xi ) + r˜j (wj )
(i,j)∈Ω i=1 j=1

I How to choose the rank k?


I If setting ri and r˜j as quadratic regularization with parameter λ, how
to choose λ?
I Need to search over two-dimensional grid.

Motivation 8
Motivation

GLRM: practical consideration

Generalized low rank model: find low rank matrix X ∈ Rn×k and
W ∈ Rk×p such that XW approximates Y ∈ Rn×p well:

X   Xn d
X
minimize `j Yij , xiT wj + ri (xi ) + r˜j (wj )
(i,j)∈Ω i=1 j=1

I How to choose the rank k?


I If setting ri and r˜j as quadratic regularization with parameter λ, how
to choose λ?
I Need to search over two-dimensional grid.
Is the problem just about computation?
Motivation 8
Motivation

GLRM: low rank assumption

Generalized low rank model: find low rank matrix X ∈ Rn×k and
W ∈ Rk×p such that XW approximates Y ∈ Rn×p well:

X   n
X d
X
minimize `j Yij , xiT wj + ri (xi ) + r˜j (wj )
(i,j)∈Ω i=1 j=1

I Only works well when Y can be approximated by low rank matrix.


I Big data (large n and large p) usually have low rank structure.
I Movie rating datasets: many movies and many users
I Long skinny data (large n and small p) usually does not have low
rank structure.
I Social survey data: many participants, few questions.

Motivation 9
Motivation

Get over the low rank assumption

I Large n allows learning more complex variable dependence than the


low rank structure.
I Statistical dependence structure: model the joint distribution
I Gaussian distribution for quantitative vector

Motivation 10
Motivation

Get over the low rank assumption

I Large n allows learning more complex variable dependence than the


low rank structure.
I Statistical dependence structure: model the joint distribution
I Gaussian distribution for quantitative vector
1 All 1-dimensional marginals are Gaussian
2 The joint p-dimensional distribution is multivariate Gaussian

Motivation 10
Motivation

Get over the low rank assumption

I Large n allows learning more complex variable dependence than the


low rank structure.
I Statistical dependence structure: model the joint distribution
I Gaussian distribution for quantitative vector
1 All 1-dimensional marginals are Gaussian
2 The joint p-dimensional distribution is multivariate Gaussian

First, can we use 1-dimensional Gaussian to model ordinal/binary variable?

Motivation 10
Motivation

Histograms for some GSS variables

1200 General happiness Respondont's income

800
600
800
Counts

Counts

400
400

200
0

0
1 2 3 1 2 3 4 5 6 7 8 9 10 11 12

from left to right: Very happy to Not too happy from left to right: Less than $1000 to more than $25000

How many people in contact in a typical weekday Weeks r. worked last year
350

1000
250
Counts

Counts

600
150

200
0 50

1 2 3 4 5 0 3 6 11 15 20 24 28 34 38 42 46 50

from left to right: 0−4 persons to 50 or more from left to right: 0 to 52

Motivation 11
Motivation

Generate ordinal data by thresholding Gaussian variable


3

ordinal x value 2

1
-∞ -1 0 1 ∞
normal z value

I Select thresholds to ensure desired class proportion.


I A mapping between ordinal levels and intervals.
I f (z) = x for z ∈ [ax , ax+1 ) or f −1 (x) = [ax , ax+1 ).
Motivation 12
Motivation

Estimated thresholds for some GSS variables


General happiness

0.4
1200

0.3
800
Counts

0.2
y
400

0.1
0.0
0

1 2 3 −4 −2 0 2 4

from left to right: Very happy to Not too happy x

How many people in contact in a typical weekday


350

0.4
0.3
250
Counts

0.2
y
150

0.1
0 50

0.0

1 2 3 4 5 −4 −2 0 2 4

from left to right: 0−4 persons to 50 or more x

Figure 2: Red vertical lines indicate estimated thresholds.


Motivation 13
Gaussian copula model

Table of Contents

1 Motivation

2 Gaussian copula model

3 Demo

Gaussian copula model 14


Gaussian copula model

Gaussian copula model for mixed data

We say x = (x1 , . . . , xp ) follows the Gaussian copula model if


I marginals: x = f(z) for f = (f1 , . . . , fp ) entrywise monotonic,

xj = fj (zj ), j = 1, . . . , p

I copula: z ∼ N (0, Σ) with correlation matrix Σ

Gaussian copula model 15


Gaussian copula model

Gaussian copula model for mixed data

We say x = (x1 , . . . , xp ) follows the Gaussian copula model if


I marginals: x = f(z) for f = (f1 , . . . , fn ) entrywise monotonic,

xj = fj (zj ), j = 1, . . . , p

I copula: z ∼ N (0, Σ) with correlation matrix Σ

I Estimate fj to match the observed empirical distribution


I Estimate Σ through an EM algorithm

Gaussian copula model 16


Gaussian copula model

Given parameter estimate, imputation is easy

? ?

? ?

Figure 3: Curves indicate density and dots mark the observation.

Gaussian copula model 17


Gaussian copula model

Given parameter estimate, imputation is easy

Figure 4: Curves indicate density and crosses mark the prediction.

Gaussian copula model 18


Gaussian copula model

Given parameter estimate, imputation is easy

x x

x x

Figure 5: Curves indicate density and crosses mark the prediction.

Gaussian copula model 19


Gaussian copula model

Given parameters, imputation is easy

I observed entries xO of new row x ∈ Rp , O ⊂ {1, . . . , p}


I missing entries M = {1, . . . , p} \ O
I marginals f = (fO , fM ) and copula correlation matrix Σ
−1
I the truncated region: zO ∈ fO (xO ) := j∈O fj−1 (xj )
Q

impute missing entries using normality of zM :


I latent missing zM are normal given zO :

zM |zO ∼ N (ΣM,O Σ−1 −1


O,O zO , ΣM,M − ΣM,O ΣO,O ΣO,M )

Gaussian copula model 20


Gaussian copula model

Given parameters, imputation is easy

I observed entries xO of new row x ∈ Rp , O ⊂ {1, . . . , p}


I missing entries M = {1, . . . , p} \ O
I marginals f = (fO , fM ) and copula correlation matrix Σ
−1
I (xO ) := j∈O fj−1 (xj )
Q
the truncated region: zO ∈ fO
impute missing entries using normality of zM :
I latent missing zM are normal given zO :

zM |zO ∼ N (ΣM,O Σ−1 −1


O,O zO , ΣM,M − ΣM,O ΣO,O ΣO,M )

I predict with mean

ẑM = ΣM,O Σ−1 −1


O,O E[zO |zO ∈ fO (xO )]

I map back to observed space x̂M = fM (ẑM )


Gaussian copula model 21
Gaussian copula model

Multiple imputation

When imputation is the intermediate step to learn some parameter θ, e.g.


linear coefficients, on imputed complete dataset:
1 Generate m different imputed datasets X (1) , . . . , X (m) .
2 For each imputed dataset X (j) , learn the desired model parameter θ̂(j)
for j = 1, . . . , m.
Pm
j=1 θ̂(j)
3 Combine all estiamtes into one: θ̂ = m .

Gaussian copula model 22


Gaussian copula model

Given parameters, imputation is easy

I observed entries xO of new row x ∈ Rp , O ⊂ {1, . . . , p}


I missing entries M = {1, . . . , p} \ O
I marginals f = (fO , fM ) and copula correlation matrix Σ
−1
I the truncated region: zO ∈ fO (xO ) := j∈O fj−1 (xj )
Q

impute missing entries using normality of zM :


I latent missing zM are normal given zO :

zM |zO ∼ N (ΣM,O Σ−1 −1


O,O zO , ΣM,M − ΣM,O ΣO,O ΣO,M )

I Sample z(i)
M from the above distribution for i = 1, .., m.
I map back to observed space x̂(i) (i)
M = fM (ẑM ) for i = 1, .., m.

Gaussian copula model 23


Demo

Table of Contents

1 Motivation

2 Gaussian copula model

3 Demo

Demo 24
Demo

Check out our Github page

I Python package
https://github.com/udellgroup/GaussianCopulaImp
I Single line installment: pip install GaussianCopulaImp
I More tutorials on multiple imputation, accelerating the algorithm for
large datasets, etc.

Demo 25

You might also like