0% found this document useful (0 votes)
33 views

Approximate Inference: Sargur Srihari Srihari@cedar - Buffalo.edu

This document discusses approximate inference methods for probabilistic models using machine learning. It introduces the need for approximate inference when exact posterior distributions and expectations are intractable due to high dimensionality or complex integrals/sums. It describes two main types of approximations: stochastic approximations using Markov chain Monte Carlo methods and deterministic approximations using variational inference. Variational inference finds a tractable distribution to approximate the true posterior by maximizing a evidence lower bound derived using Kullback-Leibler divergence. The document provides examples of variational inference for univariate Gaussians and mixture of Gaussians models.
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Approximate Inference: Sargur Srihari Srihari@cedar - Buffalo.edu

This document discusses approximate inference methods for probabilistic models using machine learning. It introduces the need for approximate inference when exact posterior distributions and expectations are intractable due to high dimensionality or complex integrals/sums. It describes two main types of approximations: stochastic approximations using Markov chain Monte Carlo methods and deterministic approximations using variational inference. Variational inference finds a tractable distribution to approximate the true posterior by maximizing a evidence lower bound derived using Kullback-Leibler divergence. The document provides examples of variational inference for univariate Gaussians and mixture of Gaussians models.
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Machine Learning

Srihari

Approximate Inference
Sargur Srihari [email protected]

Machine Learning

Srihari

Central Tasks of Probabilistic Models


1. Evaluation of posterior distribution p(Z|X)
latent variables Z, observed data X
In classification Z=Class, X=Features In regression Z= w =parameter vector, X= t =target vector

2. Evaluation of expectation wrt to p(Z|X)


In EM: evaluate expectation of complete-data log-likelihood wrt posterior distribution of latent variables

Machine Learning

Srihari

Need for Approximate Inference


Often infeasible to evaluate posterior distributions or expectations wrt distributions
High dimensionality of latent space Complex and intractable expectations

For continuous variables


Required integrations have no closed form solutions Dimensionality of space and integrand prohibit numerical integration

For discrete variables


Summation in marginalization: exponential no. of states
3

Machine Learning

Srihari

Types of Approximations
1. Stochastic
Markov chain Monte Carlo
Have allowed use of Bayesian methods across many domains

Computationally demanding
They can generate exact results

2. Deterministic
Variational Inference (or Variational Bayes) Based on analytical approximations to posterior
e.g., particular factorization or specific parametric form such as Gaussian

Scale well in large applications Can never generate exact results


4

Machine Learning

Srihari

Variational Inference
Based on Calculus of Variations
Invented by Euler
Standard Calculus concerns derivatives of functions
Function takes variable as input and returns value of function

Functional is a mapping with function as input


Returns value of functional as output

Example of a functional is entropy H[p(x)] = p(x)ln p(x) dx Functional Derivative Quantity maximized is a functional
How does function change in response to small changes in input function

Leonhard Euler Swiss Mathematician 1707-1783

Machine Learning

Srihari

Function versus Functional


Function takes a value of a variable as input and returns the function value as output
Derivative describes how output varies as we make infinitesimal changes in input value

Functional takes a function as input and returns the functional value as output
Derivative of functional describes how value of functional changes with infinitesimal changes in input function

Recall Gaussian Processes dealt with distributions of functions Now we deal with how to find a function that maximizes a functional, e.g., entropy
6

Machine Learning

Srihari

Inference Problem
Observed Variables
X = {x1,.., xN} N i.i.d. data

Latent Variables and Parameters


Z = {z1,..,zN}

Joint distribution p(X,Z) is specified


E.g., full set of tables or pdfs given

Goal is to find approximation for posterior distribution p(Z|X)


Which is given by p(X,Z)/p(X)
7

Machine Learning

Srihari

Approach of Variational Methods


Find tractable distribution q to approximate p Use two ideas:
1. KL Divergence
Which is zero when p =q
q(x) KL( p || q) = p(x)ln dx p(x)

KL(q||p) ln p(X) L(q)

2. For any choice of q(Z)

ln p(X) can be decomposed into two terms


A functional L(q) with arguments q(Z) and X Second is KL(q||p) where p is a distribution p(Z|X)

We wish to maximize L(q)


Which has the effect of minimizing KL
8

Machine Learning

Srihari

Decomposition of Log Marginal Probability


ln p(X) = L(q) + KL(q || p) where p(X,Z) A L(q) = q(Z)ln dZ functional q(Z) and p(Z | X) KL{q || p} = q(Z)ln dZ q(Z)
Also applicable to discrete distributions By replacing integrations with summations

Kullback-Leibler Divergence

Some Observations:

Lower bound on ln p(X) is L(q) Maximizing the lower bound L(q) wrt distribution q(Z) is equivalent to minimizing KL Divergence When KL divergence vanishes q(Z) equals the posterior p(Z|X)

Plan:
We seek that distribution q(Z) for which L(q) is largest Since true posterior is intractable we consider restricted family for q(Z) Seek member of this family for which KL divergence is minimized
9

Machine Learning

Srihari

Variational Approximation Example


Use parametric distribution q(Z|) Lower bound L(q) is a function of and can use standard nonlinear optimization to determine optimal

Negative Logarithms

Laplace Original distribution Variational distribution is Gaussian


Optimized with respect to mean and variance

10

Machine Learning

Srihari

Factorized Approximations
Restricting the family of distributions Partition the elements of Z into disjoint groups
M

q(Z) = qi (Z i )
i=1

Among all distributions q(Z) having this form we seek the one for which lower bound L(q) is largest
11

Machine Learning

Srihari

Two alternative forms of KL Divergence


Green: Correlated Gaussian distribution For 1, 2 and 3 standard deviations Red: q(z) over same variables given by product of two independent univariate Gaussians

Minimization based on KL Divergence KL(q||p)

Minimization based on Reverse KL Divergence KL(p||q)


12

Machine Learning

Srihari

Two alternative forms of KL divergence


Approximating a multimodal distribution by a unimodal one
a b c

Blue Contours: multimodal distribution p(Z) Red Contours in (a): Single Gaussian q(Z) that best approximates p(Z) in (b): single Gaussian q(Z) that best approximates p(Z) in the sense of minimizing KL(p||q) in(c): showing different local minimum of KL divergence
13

Machine Learning

Srihari

Alpha Family of Divergences


Two forms of divergence are members of the alpha family of divergences 4 D ( p || q) = (1 p(x) q(x) dx ) 1
(1+ )/ 2 (1 )/ 2

where < < KL(p||Q) corresponds to 1 KL(q||p) corresponds to -1 For all a D(p||q) >0 with equality iff p(x)=q(x) When =0 we get symmetric divergence that is linearly related to the Hellinger Distance (a valid distance measure) D ( p || q) = ( p(x)1/ 2 q(x)1/ 2 ) 2 dx
H

14

Machine Learning

Srihari

Variational Inference of Univariate Gaussian


Variational Inference of mean and precision Green: Contours of true posterior Iterative scheme converges to red contours

15

Machine Learning

Srihari

Variational Mixture of Gaussians


Demonstrates how Bayesian treatment elegantly resolves mle issues Conditional distribution of Z given N K
z p(Z | ) = k nk

Conditional distribution of observed data Priors over parameters:


Dirichlet distribution over

n=1 k=1

p(X | Z, ,) = N(x n | k ,1 ) znk k

p( ) = Dir( | 0 ) = C( 0 ) 0 1 k

Gaussian-Wishart over mean and precision

k=1

p(,) = p( | ) p()
K

= N(k m0 ( 0 k )1 )W ( k |W 0 , 0 )
k =1

16

Machine Learning

Srihari

Variational Distribution
priors precisions

Joint distribution of the random variables:


p(X,Z, , ,) = p(X | Z, ,) p(Z | ) p( | ) p()
means Directed acyclic graph Representing mixture

17

Machine Learning

Srihari

Variational Bayesian Mixture


K=6 components After convergence there are only two components Density of red ink shows Mixing coefficients

18

You might also like