0% found this document useful (0 votes)
4 views

Lecture-03_Estimation_basics

Uploaded by

kuangau
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lecture-03_Estimation_basics

Uploaded by

kuangau
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Introduction to Deep

Generative Modeling Lecture #3


HY-673 – Computer Science Dep., University of Crete
Professors: Yannis Pantazis & Yannis Stylianou
TAs: Michail Raptakis & Michail Spanakis
Taxonomy of Deep Generative Models Lecture #3
According to the Likelihood Function

GMs

Exact Approximate Implicit

ARMs NFs VAEs EBMs DPMs GANs GGFs

(R)NADE Planar Vanilla Belief nets diffusion Vanilla KALE


WaveNet Coupling β-VAE Boltzmann denoising WGAN Lipschitz-reg.
WaveRNN MAFs/IAFs VQ-VAE machines score 𝑓-GAN …
GPT … … … … (𝑓, Γ)-GAN

Introduction to Estimator Theory Lecture #3

Let D = {x1 , . . . , xn } be a set of data drawn from pd (x), and pθ (x) be a


family of models with θ ∈ Θ. A point estimator θ̂ = θ̂(D) is a random variable
for which we want:

pθ̂ (x) ≈ pd (x)


Introduction to Estimator Theory Lecture #3

• How to construct an estimator?


– Maximum Likelihood Estimation (MLE)
– Maximum A Posteriory (MAP) Estimation
– Based on a Probability Distance or a Divergence (implicit)
– Bayesian Inference (learns a distribution for the
estimator’s parameters)
Maximum Likelihood Estimator Lecture #3
Maximum Likelihood Estimator Lecture #3

− Ln (θ̂1 ) > Ln (θ̂2 ) implies that θ̂1 is


more likely to have generated
the observed samples x1 , ..., xn .

− Thus, it provides a ranking of model’s


fitness/accuracy/matching to the data.
MLE Example #1 Lecture #3

d
L(θ̂; D)

MLE Example #2 Lecture #3
MLE Example #3 Lecture #3

Partial derivative
or gradient vector:
MLE Example #3 Lecture #3

Maximizing L(θ) is equivalent to


minimizing the Sum of Squares
(Least Squares)

Exactly the same solution as LS!


MLE Example #4 Lecture #3

• Logistic regression with sigmoids a.k.a. binary classification.


Dataset: D = {(x1 , y1 ), . . . , (xn , yn )} with xi ∈ Rd and yi ∈ {0, 1},
Model family: pθ (yi = 1|xi ) = σ(θT xi ), pθ (yi = 0|xi ) = 1 − pθ (yi = 1|xi ),
θ ∈ Rd and σ(z) = 1+e1−z be the sigmoid function.
MLE Example #4 Lecture #3

Learning rate
Maximum Likelihood Estimator Lecture #3
Kullback-Leibler Divergence (KLD) Lecture #3

• Geometric interpretation:
MLE is equivalent to minimizing the KLD of pd (x) w.r.t. pθ (x).
Maximum Likelihood Estimator Lecture #3

where the cross entropy of probability P with PDF p(x) with respect to proba-
bility Q with PDF q(x) is defined as
Kullback-Leibler Divergence Lecture #3

• MLE is also equivalent to minimizing the KLD of pd (x) w.r.t. pθ (x).

arg max L(θ; pd ) = arg min DKL (pd ||pθ )


θ θ

• The Kullback-Leibler divergence (KLD) of P w.r.t. Q is defined as:


! ! !
p(x)
DKL (P ||Q) := log p(x)dx = log p(x)p(x)dx − log q(x)p(x)dx
q(x)

DKL (P ||Q) = −H(P ) + H × (P ||Q). Entropy Cross Entropy


Kullback-Leibler Divergence Lecture #3

DKL (P ||Q) ≥ 0 and

Jensen’s inequality
Maximum A Posteriori Estimator Lecture #3

arg max p(θ|D)


θ
Maximum A Posteriori Estimator Lecture #3
Maximum A Posteriori Estimator Lecture #3

• Linear model: D = {(x1 , y1 ), . . . , (xn , yn )}, xi ∈ Rd , yi ∈ R, model:


yi = θT xi + ϵi , ϵi ∼ N (0, 1)

− p(θ) = N (0, λ−1 Id ) ⇒ rigde regression a.k.a. (Tikhonov) regularized


Least Squares.
− p(θ) = Laplace(0, λ−1 ) ⇒ lasso regression (least absolute shrinkage and
selection operator).
Estimator Assessment Lecture #3

• Basic toolkit to assess an estimator:


Estimator Assessment Lecture #3
Estimator Assessment Lecture #3
Estimator Assessment Lecture #3

Chebyshev’s inequality
Estimator Assessment Lecture #3
Estimator Assessment Lecture #3
Estimator Assessment Lecture #3

• Let θ̂1 and θ̂2 be two unbiased estimators of θ∗ . θ̂1 is more efficient than
θ̂2 if and only if Var(θ̂1 ) < Var(θ̂2 ).
Estimator Assessment Lecture #3
Estimator Assessment Lecture #3
Lecture #3
HY-673
References Lecture #3

1. All of statistics: A Concise Course in Statistical Inference (Chapters 6 & 9)


Larry Wasserman, Springer (2004)

3. Matrix Calculus:
https://en.wikipedia.org/wiki/Matrix_calculus
https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf
Introduction to Deep
Generative Modeling Lecture #3
HY-673 – Computer Science Dep., University of Crete
Professors: Yannis Pantazis & Yannis Stylianou
TAs: Michail Raptakis & Michail Spanakis

You might also like