0% found this document useful (0 votes)
2 views

BDA_lecture_10b

Chapter 4 discusses various statistical methods including normal approximation (Laplace’s method) and large-sample theory, emphasizing the convergence of posterior distributions to normality as sample size increases. It also presents counterexamples highlighting challenges in MCMC for difficult posteriors and includes frequency evaluation. Additionally, the chapter touches on Taylor series for approximating posterior distributions.

Uploaded by

marius.boda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

BDA_lecture_10b

Chapter 4 discusses various statistical methods including normal approximation (Laplace’s method) and large-sample theory, emphasizing the convergence of posterior distributions to normality as sample size increases. It also presents counterexamples highlighting challenges in MCMC for difficult posteriors and includes frequency evaluation. Additionally, the chapter touches on Taylor series for approximating posterior distributions.

Uploaded by

marius.boda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Chapter 4

• 4.1 Normal approximation (Laplace’s method)


• 4.2 Large-sample theory
• 4.3 Counter examples
• includes examples of difficult posteriors for MCMC, too
• 4.4 Frequency evaluation*
• 4.5 Other statistical methods*

1 / 30
Normal approximation (Laplace approximation)
• Often posterior converges to normal distribution when
n→∞
• bounded, non-singular, the number of parameters don’t grow
with n
• we can then approximate p(θ|y ) with normal distribution

2 / 30
Normal approximation (Laplace approximation)
• Often posterior converges to normal distribution when
n→∞
• bounded, non-singular, the number of parameters don’t grow
with n
• we can then approximate p(θ|y ) with normal distribution
• Laplace used this (before Gauss) to approximate the
posterior of binomial model to infer ratio of girls and boys
born

2 / 30
Normal approximation (Laplace approximation)
• Often posterior converges to normal distribution when
n→∞
• bounded, non-singular, the number of parameters don’t grow
with n
• we can then approximate p(θ|y ) with normal distribution
● ●

30 30 ● ●





200
● ● ●
● ● ●
● ●
● ● ● ● ●●

150
● ● ●
● ● ●● ●
●●

count
● ●
● ● ●
● ●● ● ●
● ● ●
beta

20 beta 20 ●


● ●● ●
●● ●

● ●●
●●●


●●●
●●
● ●
● ● ● ●● ● ● ●●
● ● ●●●● ●
●●●● ●
●●●●
● ● ● ●
●●●


● ●

● ●●●
● ●● ● ● ● ● ●●


● ●

● ●
●●
●●
● ●● ●


●●







● ● ●● ● ● ●● ●● ●
● ●●● ● ●● ●●● ●● ●●●
● ● ● ● ●

100
● ● ●●●● ●● ● ●●● ● ●
● ●●● ● ●● ● ● ● ● ●●● ●
●●● ● ●● ● ● ●●●●
● ● ●● ● ● ● ● ●● ●
●●
●● ● ●●●●●●
● ●
● ● ●●● ●
● ●
● ● ●● ●● ● ●●● ● ●
● ●●● ●●● ●●
●● ●●

●●● ●●
● ●●●
●●
● ● ●
●● ● ● ●● ●● ●● ●●
● ●
● ● ● ● ● ●●● ● ●●● ●●●●●●●●●● ● ●● ● ●●● ●
● ●● ● ● ● ●
● ●● ●● ●●●●●●●●● ●

●●
● ●●
●● ● ● ●
● ●●● ●
● ●● ●● ● ●
●●●● ● ●

10 10
● ●●● ●● ●●●
●●●
●● ● ● ● ●● ●●● ●● ●●
● ●●●
●● ●●●●
● ●●●● ●● ●●●●● ● ●

● ●● ●●


●●●●
● ●
●●


●● ●●●● ●
● ●●●●●●●
● ●●●

●●
●●●●●
●●● ● ● ● ● ● ●
● ●●● ●●●

●● ●● ● ●●●● ●● ●●● ●● ●
●●●●● ●
● ●●● ● ● ● ●
● ● ●● ●● ●
● ● ●●
● ●●●●●●●●
●● ●● ●●
●●● ●●
● ●● ● ●● ● ● ●● ● ●●● ●●● ●
● ●●●●

●●●●●


●●


●●●●●●●● ●●●
● ●● ●●
● ●●
● ●●●● ● ●●● ●●●● ● ●● ●
● ●● ● ● ● ●●



●●●
●●●
●●●●




●● ●● ●
●●

● ●

●● ●
●● ● ●
●● ●
● ● ●
●●

●●● ● ●●●●● ●
● ●●
● ●● ●●
●●● ●●
●●●● ●
●● ●●● ●●●● ●●
● ●● ● ●● ●●
● ●●● ● ●● ● ●

50

●● ●●●●●●●● ●●● ●●●●●
●●●●●●●
●●●●
●●●● ●● ●● ● ●● ●
●●● ●● ●●●●
●●●●●●

● ●
●●●●●●● ●●●●● ●●
●●
●●

● ● ● ● ●●● ●●



● ●●●

●● ●● ●
●●
●● ●● ●● ●● ●● ●● ● ● ●
●●●● ●● ●●●● ● ●●● ●●●● ●● ●
● ●● ●● ●● ●● ● ● ● ●
● ● ●● ● ● ●

● ●●● ●

0 0 ●

0
0.0 2.5 5.0 0.0 2.5 5.0 −0.5 0.0 0.5
alpha alpha LD50 = −alpha/beta

30 30 150

count
beta

beta

20 20

● ●●● ●

100

● ● ● ●
●● ●
● ●

●● ● ● ● ● ● ●
● ●
● ●●● ● ● ●● ●

● ● ● ● ●●● ● ●
● ●
●● ● ● ●
●●● ● ● ●● ● ●
● ●● ●● ● ● ●●●●● ●● ● ●● ● ● ●
●● ●● ● ●
●● ●●● ●●● ● ● ● ●
● ● ● ● ●● ●

●● ●


●●
●● ●● ●●
● ●●●
●●● ●● ● ●

● ● ●● ●● ● ● ●
● ●● ●●●● ●●●●●●●
● ●●● ●

●●●●
●●● ●
●●●
●●● ●●●●● ● ●●● ●
● ●
● ● ●●

10 10
●● ●● ●
●●●●
●●

●● ●●● ●
●●
●●●

●●●●●●●●●●●●
●●
●●● ●●
●●●
●●
●●
●●●●● ● ●● ● ●
● ●● ●
● ● ● ●

●● ● ●● ● ●● ●●● ●●
● ●● ●●
●●●●●● ●●●●●● ● ● ●
● ●
●●
● ●●●
● ●●● ●●●
●● ● ●● ●● ●● ●●
●●●●●●● ●● ● ●
● ●
● ●●
●● ●● ●

● ●●● ● ● ● ●● ●●●● ●●
● ●●
● ●●
●●
●●●
●●

●●● ●




● ● ● ●
● ● ●
● ●
●● ●● ● ● ● ● ●
●● ●

●●●
● ●
● ●●● ●
● ● ●●● ●
●●●●

50
● ●● ●● ●●●● ●●
●● ●● ●
● ●● ● ●● ● ●●● ●●● ●

●●●
●●●●●●● ●
●● ●●●●●●● ●
●●●● ●● ● ●
● ●●●●●●
● ●●● ●
●●●● ●●●●
●●●● ●● ●● ●
●● ●●● ● ●●
●●●●
●●

●●●
●●
●●●●
●●●●●●●●●●

●●●
● ●●●● ●
●● ●

●● ●●
● ●● ● ●●●●●●● ●●
● ● ●● ● ●● ● ● ●
● ●
● ●●● ●
● ●● ●●●●
●●●●
●● ●




●●
●●●
●●●
●●
●●

●●●
●●●● ● ●
● ●●●
● ●
●●●● ● ● ●●● ● ●
● ● ●● ●

●●●
●● ●●
●● ●●● ●● ●●
● ● ●● ●●
● ● ●●●● ●●●●●●● ●●●
● ●● ●

● ● ● ●
●●● ●●
●●● ●●
●●● ●● ●●●
●● ●● ●● ●● ●
● ●● ● ● ●● ●●●● ●●● ● ●●●●
● ● ●●●
●●●●●●● ● ●●●
●● ● ●● ●● ● ●● ● ● ●
● ●●●

0 0
●● ● ● ●●● ●● ● ● ●●
● ● ● ●●●● ●●● ●●● ●● ●● ● ● ●
● ● ●● ● ● ● ● ●●

● ●●

●● ●

●●●● ●●●●● ●● ● ●● ●
●● ●● ● ●
● ●● ● ● ●
● ● ● ● ● ● ● ●● ● ●● ●
●● ● ●●

● ●
● ●
● ● ●●


● ●



●●



0
0.0 2.5 5.0 0.0 2.5 5.0 −0.5 0.0 0.5
alpha alpha LD50 = −alpha/beta, beta > 0
2 / 30
Taylor series

• We can approximate p(θ|y ) with normal distribution


 
1 1 2
p(θ|y ) ≈ √ exp − 2 (θ − θ̂)
2πσθ 2σθ

• i.e. log posterior log p(θ|y ) can be approximated with a


quadratic function

log p(θ|y ) ≈ α(θ − θ̂)2 + C

3 / 30
Taylor series

• We can approximate p(θ|y ) with normal distribution


 
1 1 2
p(θ|y ) ≈ √ exp − 2 (θ − θ̂)
2πσθ 2σθ

• i.e. log posterior log p(θ|y ) can be approximated with a


quadratic function

log p(θ|y ) ≈ α(θ − θ̂)2 + C

• Corresponds to Taylor series expansion around θ = θ̂

f ′′ (θ̂) f (3) (θ̂)


f (θ) = f (θ̂)+f ′ (θ̂)(θ − θ̂)+ (θ−θ̂)2 + (θ − θ̂)3 + . . .
2! 3!

3 / 30
Taylor series

• We can approximate p(θ|y ) with normal distribution


 
1 1 2
p(θ|y ) ≈ √ exp − 2 (θ − θ̂)
2πσθ 2σθ

• i.e. log posterior log p(θ|y ) can be approximated with a


quadratic function

log p(θ|y ) ≈ α(θ − θ̂)2 + C

• Corresponds to Taylor series expansion around θ = θ̂

f ′′ (θ̂) f (3) (θ̂)


f (θ) = f (θ̂)+f ′ (θ̂)(θ − θ̂)+ (θ−θ̂)2 + (θ − θ̂)3 + . . .
2! 3!

• if θ̂ is at mode, then f ′ (θ̂) = 0

3 / 30
Taylor series

• We can approximate p(θ|y ) with normal distribution


 
1 1 2
p(θ|y ) ≈ √ exp − 2 (θ − θ̂)
2πσθ 2σθ

• i.e. log posterior log p(θ|y ) can be approximated with a


quadratic function

log p(θ|y ) ≈ α(θ − θ̂)2 + C

• Corresponds to Taylor series expansion around θ = θ̂

f ′′ (θ̂) f (3) (θ̂)


f (θ) = f (θ̂)+f ′ (θ̂)(θ − θ̂)+ (θ−θ̂)2 + (θ − θ̂)3 + . . .
2! 3!

• if θ̂ is at mode, then f ′ (θ̂) = 0


(3)
• often when n → ∞, f 3!(θ̂) (θ − θ̂)3 + . . . is small
3 / 30
Multivariate Taylor series

• Multivariate series expansion

df (θ′ ) 1 2 ′
T d f (θ )
f (θ) = f (θ̂)+ (θ − θ̂)+ (θ−θ̂) (θ−θ̂)+ . . .
dθ′ θ′ =θ̂ 2! dθ′2 θ′ =θ̂

4 / 30
Normal approximation

• Taylor series expansion of the log posterior around the


posterior mode θ̂
 2 
1 T d ′
log p(θ|y ) = log p(θ̂|y )+ (θ−θ̂) log p(θ |y ) (θ−θ̂)+. . .
2 dθ2 θ′ =θ̂

5 / 30
Normal approximation

• Taylor series expansion of the log posterior around the


posterior mode θ̂
 2 
1 T d ′
log p(θ|y ) = log p(θ̂|y )+ (θ−θ̂) log p(θ |y ) (θ−θ̂)+. . .
2 dθ2 θ′ =θ̂
 
• Multivariate normal ∝ |Σ|−1/2 exp − 12 (θ − θ̂T )Σ−1 (θ − θ̂)

5 / 30
Normal approximation
 
• Multivariate normal ∝ |Σ|−1/2 exp − 12 (θ − θ̂T )Σ−1 (θ − θ̂)

30 30
beta

beta
20 20
10 10
0 0
0.0 2.5 5.0 0.0 2.5 5.0
alpha alpha

p(beta | alpha=0.82)
p(alpha | beta=7.5)

0.002 0.002

0.001 0.001

0.000 0.000
0 2 4 0 10 20 30
alpha beta
5 / 30
Normal approximation
 
• Multivariate normal ∝ |Σ|−1/2 exp − 12 (θ − θ̂T )Σ−1 (θ − θ̂)

30 30
beta

beta
20 20
10 10
0 0
0.0 2.5 5.0 0.0 2.5 5.0

log p(beta | alpha=0.82)


log p(alpha | beta=7.5)

alpha alpha
−5 −5

−10 −10

−15 −15

−20 −20
0 2 4 0 10 20 30
alpha beta
5 / 30
Normal approximation

• Normal approximation

p(θ|y ) ≈ N(θ̂, [I(θ̂)]−1 )

where I(θ) is called observed information

d2
I(θ) = − log p(θ|y )
dθ2

6 / 30
Normal approximation

• Normal approximation

p(θ|y ) ≈ N(θ̂, [I(θ̂)]−1 )

where I(θ) is called observed information

d2
I(θ) = − log p(θ|y )
dθ2
Hessian H(θ) = −I(θ)

6 / 30
Normal approximation

• I(θ) is called observed information

d2
I(θ) = − log p(θ|y )
dθ2

• I(θ̂) is the second derivatives at the mode and thus


describes the curvature at the mode
• if the mode is inside the parameter space, I(θ̂) is positive
• if θ is a vector, then I(θ) is a matrix

7 / 30
Normal approximation

• BDA3 Ch 4 has an example where it is easy to compute first


and second derivatives and there is easy analytic solution to
find where the first derivatives are zero

8 / 30
Normal approximation – numerically
• Normal approximation can be computed numerically
• iterative optimization to find a mode (may use gradients)
• autodiff or finite-difference for gradients and Hessian

9 / 30
Normal approximation – numerically
• Normal approximation can be computed numerically
• iterative optimization to find a mode (may use gradients)
• autodiff or finite-difference for gradients and Hessian
• e.g. in R, demo4_1.R:
bioassayfun <- function(w, df) {
z <- w[1] + w[2]*df$x
-sum(df$y*(z) - df$n*log1p(exp(z)))
}

theta0 <- c(0,0)


optimres <- optim(w0, bioassayfun, gr=NULL, df1, hessia
thetahat <- optimres$par
Sigma <- solve(optimres$hessian)

9 / 30
Normal approximation – numerically
• Normal approximation can be computed numerically
• iterative optimization to find a mode (may use gradients)
• autodiff or finite-difference for gradients and Hessian
• CmdStan(R) has Laplace algorithm

10 / 30
Normal approximation – numerically
• Normal approximation can be computed numerically
• iterative optimization to find a mode (may use gradients)
• autodiff or finite-difference for gradients and Hessian
• CmdStan(R) has Laplace algorithm
• uses L-BFGS quasi-Newton optimization algorithm for
finding the mode
• uses autodiff for gradients
• uses finite differences of gradients to compute Hessian

10 / 30
Normal approximation – numerically
• Normal approximation can be computed numerically
• iterative optimization to find a mode (may use gradients)
• autodiff or finite-difference for gradients and Hessian
• CmdStan(R) has Laplace algorithm
• uses L-BFGS quasi-Newton optimization algorithm for
finding the mode
• uses autodiff for gradients
• uses finite differences of gradients to compute Hessian
• second order autodiff in progress

10 / 30
Normal approximation

• Optimization and computation of Hessian requires usually


much fewer density evaluations than MCMC

11 / 30
Normal approximation

• Optimization and computation of Hessian requires usually


much fewer density evaluations than MCMC
• In some cases accuracy is sufficient

11 / 30
Normal approximation

• Optimization and computation of Hessian requires usually


much fewer density evaluations than MCMC
• In some cases accuracy is sufficient
• In some cases accuracy for a conditional distribution is
sufficient (Ch 13)
• e.g. Gaussian latent variable models, such as Gaussian
processes (Ch 21) and Gaussian Markov random fields
• Rasmussen & Williams: Gaussian Processes for Machine
Learning
• CS-E4895 - Gaussian Processes (in spring)

11 / 30
Normal approximation

• Optimization and computation of Hessian requires usually


much fewer density evaluations than MCMC
• In some cases accuracy is sufficient
• In some cases accuracy for a conditional distribution is
sufficient (Ch 13)
• e.g. Gaussian latent variable models, such as Gaussian
processes (Ch 21) and Gaussian Markov random fields
• Rasmussen & Williams: Gaussian Processes for Machine
Learning
• CS-E4895 - Gaussian Processes (in spring)
• Accuracy can be improved by importance sampling (Ch 10)

11 / 30
Example: Importance sampling in Bioassay
30 30
200

count
Grid

20 20
β

β
100
10 10

0 0 0
−0.5 0.0 0.5
−2 0 2 4 6 −2 0 2 4 6
α α LD50 = − α β

12 / 30
Example: Importance sampling in Bioassay
30 30
200

count
Grid

20 20
β

β
100
10 10

0 0 0
−0.5 0.0 0.5
−2 0 2 4 6 −2 0 2 4 6
α α LD50 = − α β

200
Normal

30 30
150

count
20 20
100
β

β
10 10 50

0 0 0
−0.5 0.0 0.5
−2 0 2 4 6 −2 0 2 4 6
α α LD50 = − α β,β>0

12 / 30
Example: Importance sampling in Bioassay
30 30
200

count
Grid

20 20
β

β
100
10 10

0 0 0
−0.5 0.0 0.5
−2 0 2 4 6 −2 0 2 4 6
α α LD50 = − α β

200
Normal

30 30
150

count
20 20
100
β

β
10 10 50

0 0 0
−0.5 0.0 0.5
−2 0 2 4 6 −2 0 2 4 6
α α LD50 = − α β,β>0

But the normal approximation is not that good here:


Grid sd(LD50) ≈ 0.1, Normal sd(LD50) ≈ .75!

12 / 30
Example: Importance sampling in Bioassay
30 30
200

count
Grid

20 20
β

β
100
10 10

0 0 0
−0.5 0.0 0.5
−2 0 2 4 6 −2 0 2 4 6
α α LD50 = − α β

200
Normal

30 30
150

count
20 20
100
β

β
10 10 50

0 0 0
−0.5 0.0 0.5
−2 0 2 4 6 −2 0 2 4 6
α α LD50 = − α β,β>0

30
200

count
20
β

100
IS

10

0 0
−0.5 0.0 0.5
−2 0 2 4 6
α LD50 = − α β

12 / 30
Example: Importance sampling in Bioassay
30 30
200

count
Grid

20 20
β

β
100
10 10

0 0 0
−0.5 0.0 0.5
−2 0 2 4 6 −2 0 2 4 6
α α LD50 = − α β

200
Normal

30 30
150

count
20 20
100
β

β
10 10 50

0 0 0
−0.5 0.0 0.5
−2 0 2 4 6 −2 0 2 4 6
α α LD50 = − α β,β>0

30
200

count
20
β

100
IS

10

0 0
−0.5 0.0 0.5
−2 0 2 4 6
α LD50 = − α β

Grid sd(LD50) ≈ 0.1, IS sd(LD50) ≈ 0.1


12 / 30
Normal approximation

• Accuracy can be improved by importance sampling


• Pareto-k diagnostic of importance sampling weights can be
used for diagnostic
• in Bioassay example k = 0.57, which is ok

13 / 30
Normal approximation

• Accuracy can be improved by importance sampling


• Pareto-k diagnostic of importance sampling weights can be
used for diagnostic
• in Bioassay example k = 0.57, which is ok
• CmdStan(R) has Laplace algorithm
• since version 2.33 (2023)
+ Pareto-k diagnostic via posterior package
+ importance resampling (IR) via posterior package

13 / 30
Normal approximation and parameter transformations
• Normal approximation is not good for parameters with
bounded or half-bounded support
• e.g. θ ∈ [0, 1] presenting probability

14 / 30
Normal approximation and parameter transformations
• Normal approximation is not good for parameters with
bounded or half-bounded support
• e.g. θ ∈ [0, 1] presenting probability
• Stan code can include constraints
real<lower=0,upper=0> theta;

14 / 30
Normal approximation and parameter transformations
• Normal approximation is not good for parameters with
bounded or half-bounded support
• e.g. θ ∈ [0, 1] presenting probability
• Stan code can include constraints
real<lower=0,upper=0> theta;
• for this, Stan does the inference in unconstrained space
using logit transformation

14 / 30
Normal approximation and parameter transformations
• Normal approximation is not good for parameters with
bounded or half-bounded support
• e.g. θ ∈ [0, 1] presenting probability
• Stan code can include constraints
real<lower=0,upper=0> theta;
• for this, Stan does the inference in unconstrained space
using logit transformation
• density of the transformed parameter needs to include
Jacobian of the transformation (BDA3 p. 21)

14 / 30
Normal approximation and parameter transformations
Binomial model y ∼ Bin(θ, N), with data y = 9, N = 10
With Beta(1, 1) prior, the posterior is Beta(9 + 1, 1 + 1)

15 / 30
Normal approximation and parameter transformations
With Beta(1, 1) prior, the posterior is Beta(9 + 1, 1 + 1)
Stan computes only the unnormalized posterior q(θ|y )

16 / 30
Normal approximation and parameter transformations
With Beta(1, 1) prior, the posterior is Beta(9 + 1, 1 + 1)
For illustration purposes we normalize Stan result q(θ|y )

17 / 30
Normal approximation and parameter transformations
With Beta(1, 1) prior, the posterior is Beta(9 + 1, 1 + 1)
Beta(9 + 1, 1 + 1), but x-axis shows the unconstrained logit(θ)

18 / 30
Normal approximation and parameter transformations
...but we need to take into account the absolute value of the
determinant of the Jacobian of the transformation θ(1 − θ)

19 / 30
Normal approximation and parameter transformations
...but we need to take into account Jacobian θ(1 − θ)
Let’s compare a wrong normal approximation...

20 / 30
Normal approximation and parameter transformations
...but we need to take into account Jacobian θ(1 − θ)
Let’s compare a wrong normal approximation and correct one

21 / 30
Normal approximation and parameter transformations
Let’s compare a wrong normal approximation and correct one
Sample from both approximations and show KDEs for draws

22 / 30
Normal approximation and parameter transformations
Let’s compare a wrong normal approximation and correct one
Inverse transform draws and show KDEs

23 / 30
Normal approximation and parameter transformations
Laplace approximation can be further improved with importance
resampling

24 / 30
Other distributional approximations

• Higher order derivatives at the mode can be used

25 / 30
Other distributional approximations

• Higher order derivatives at the mode can be used


• Split-normal and split-t by Geweke (1989) use additional
scaling along different principal axes

25 / 30
Other distributional approximations

• Higher order derivatives at the mode can be used


• Split-normal and split-t by Geweke (1989) use additional
scaling along different principal axes
• Other distributions can be used (e.g. t-distribution)

25 / 30
Other distributional approximations

• Higher order derivatives at the mode can be used


• Split-normal and split-t by Geweke (1989) use additional
scaling along different principal axes
• Other distributions can be used (e.g. t-distribution)
• Instead of mode and Hessian at mode, e.g.
• variational inference (Ch 13)
• CS-E4820 - Machine Learning: Advanced Probabilistic
Methods
• CS-E4895 - Gaussian Processes
• Stan has the ADVI algorithm (not very good implementation)
• Stan has Pathfinder algorithm (CmdStanR, brms)
• instead of normal, methods with flexible flow transformations
• expectation propagation (Ch 13)
• speed of these is usually between optimization and MCMC
• stochastic variational inference can be even slower than
MCMC

25 / 30
Pathfinder: Parallel quasi-Newton variational inference.
0.25 0.25 0.25

0.20 0.20 0.20

0.15 0.15 0.15

0.10 0.10 0.10

0.05 0.05 0.05

0.00 0.00 0.00


−0.6 −0.5 −0.4 −0.3 −0.2 −0.6 −0.5 −0.4 −0.3 −0.2 −0.6 −0.5 −0.4 −0.3 −0.2
iteration 3 iteration 4 iteration 5
estimated ELBO: −340.5 estimated ELBO: −332.2 estimated ELBO: −329.7
0.25 0.25 0.25

0.20 0.20 0.20

0.15 0.15 0.15

0.10 0.10 0.10

0.05 0.05 0.05

0.00 0.00 0.00


−0.6 −0.5 −0.4 −0.3 −0.2 −0.6 −0.5 −0.4 −0.3 −0.2 −0.6 −0.5 −0.4 −0.3 −0.2
iteration 6 iteration 7 iteration 8
estimated ELBO: −329.6 estimated ELBO: −329.6 estimated ELBO: −329.7

Zhang, Carpenter,
quasi-Newton Gelman,
variational and Vehtari
inference. (2022).
Journal Pathfinder:
of Machine Parallel
Learning Research,
23(306):1–49. 26 / 30
Pathfinder: Parallel quasi-Newton variational inference.
10 10 10

5 5 5

0 0 0

−5 −5 −5

−10 −10 −10

−15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15


iteration 3 iteration 4 iteration 5
estimated ELBO: −4.3 estimated ELBO: −0.4 estimated ELBO: −132.1
10 10 10

5 5 5

0 0 0

−5 −5 −5

−10 −10 −10

−15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15


iteration 6 iteration 9 iteration 13
estimated ELBO: 1.4 estimated ELBO: −579.9 estimated ELBO: −5.7

Zhang, Carpenter,
quasi-Newton Gelman,
variational and Vehtari
inference. (2022).
Journal Pathfinder:
of Machine Parallel
Learning Research,
23(306):1–49. 26 / 30
Pathfinder: Parallel quasi-Newton variational inference.
Birthdays case study uses Pathfinder to speed up workflow
https://users.aalto.fi/~ave/casestudies/Birthdays/birthdays.html
Relative number of births

Relative number of births


120 120

100 100

80 80

1970 1975 1980 1985 1970 1975 1980 1985


Date Date
Relative number of births

Relative number of births


110

120 Tue
100 Mon

100
Sat
90 Sun
80
80
Jan FebMar AprMayJun Jul AugSepOct NovDec Jan 1970 1975 1980 1985 1990
Date Date
Relative number of births

Valentine's day
100
April 1st Halloween
Leap day
90 Labor day
Independence day Thanksgiving
New year Memorial day
80
Christmas
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan
Date 27 / 30
Distributional approximations
Exact, Normal at mode, Normal with variational inference

30 30 30
beta

beta

beta
20 20 20
10 10 10
0 0 0
0.0 2.5 5.0 0.0 2.5 5.0 0.0 2.5 5.0
alpha alpha alpha

p(beta | alpha=0.82)
p(alpha | beta=7.5)

0.002 0.002

0.001 0.001

0.000 0.000
0 2 4 0 10 20 30
alpha beta

28 / 30
Distributional approximations
Exact, Normal at mode, Normal with variational inference

30 30 30
beta

beta

beta
20 20 20
10 10 10
0 0 0
0.0 2.5 5.0 0.0 2.5 5.0 0.0 2.5 5.0
alpha alpha alpha

p(beta | alpha=0.82)
p(alpha | beta=7.5)

0.002 0.002

0.001 0.001

0.000 0.000
0 2 4 0 10 20 30
alpha beta
Grid sd(LD50) ≈ 0.090,
Normal sd(LD50) ≈ .75, Normal + IR sd(LD50) ≈ 0.096 (Pareto-k = 0.57)
28 / 30
Distributional approximations
Exact, Normal at mode, Normal with variational inference

30 30 30
beta

beta

beta
20 20 20
10 10 10
0 0 0
0.0 2.5 5.0 0.0 2.5 5.0 0.0 2.5 5.0
alpha alpha alpha

p(beta | alpha=0.82)
p(alpha | beta=7.5)

0.002 0.002

0.001 0.001

0.000 0.000
0 2 4 0 10 20 30
alpha beta
Grid sd(LD50) ≈ 0.090,
Normal sd(LD50) ≈ .75, Normal + IR sd(LD50) ≈ 0.096 (Pareto-k = 0.57)
VI sd(LD50) ≈ 0.13, VI + IR sd(LD50) ≈ 0.095 (Pareto-k = 0.17)
28 / 30
Variational inference

• Variational inference includes a large number of methods

29 / 30
Variational inference

• Variational inference includes a large number of methods


• For a restricted set of models, possible to derive
deterministic algorithms
• can be fast and can be relatively accurate

29 / 30
Variational inference

• Variational inference includes a large number of methods


• For a restricted set of models, possible to derive
deterministic algorithms
• can be fast and can be relatively accurate
• Using stochastic (Monte Carlo) estimation of the divergence,
possible to derive generic black box algorithms

29 / 30
Variational inference

• Variational inference includes a large number of methods


• For a restricted set of models, possible to derive
deterministic algorithms
• can be fast and can be relatively accurate
• Using stochastic (Monte Carlo) estimation of the divergence,
possible to derive generic black box algorithms
• possible to use use also mini-batching

29 / 30
Variational inference

• Variational inference includes a large number of methods


• For a restricted set of models, possible to derive
deterministic algorithms
• can be fast and can be relatively accurate
• Using stochastic (Monte Carlo) estimation of the divergence,
possible to derive generic black box algorithms
• possible to use use also mini-batching
• can be fast and provide better predictive distribution than
Laplace approximation if the posterior is far from normal

29 / 30
Variational inference

• Variational inference includes a large number of methods


• For a restricted set of models, possible to derive
deterministic algorithms
• can be fast and can be relatively accurate
• Using stochastic (Monte Carlo) estimation of the divergence,
possible to derive generic black box algorithms
• possible to use use also mini-batching
• can be fast and provide better predictive distribution than
Laplace approximation if the posterior is far from normal
• in general, unlikely to achieve accuracy of HMC with the
same computation cost

29 / 30
Variational inference

• Variational inference includes a large number of methods


• For a restricted set of models, possible to derive
deterministic algorithms
• can be fast and can be relatively accurate
• Using stochastic (Monte Carlo) estimation of the divergence,
possible to derive generic black box algorithms
• possible to use use also mini-batching
• can be fast and provide better predictive distribution than
Laplace approximation if the posterior is far from normal
• in general, unlikely to achieve accuracy of HMC with the
same computation cost
• with increasing number of posterior dimensions, the obtained
approximation gets worse (Dhaka, Catalina, Andersen,
Magnusson, Huggins, and Vehtari, 2020)

29 / 30
Variational inference

• Variational inference includes a large number of methods


• For a restricted set of models, possible to derive
deterministic algorithms
• can be fast and can be relatively accurate
• Using stochastic (Monte Carlo) estimation of the divergence,
possible to derive generic black box algorithms
• possible to use use also mini-batching
• can be fast and provide better predictive distribution than
Laplace approximation if the posterior is far from normal
• in general, unlikely to achieve accuracy of HMC with the
same computation cost
• with increasing number of posterior dimensions, the obtained
approximation gets worse (Dhaka, Catalina, Andersen,
Magnusson, Huggins, and Vehtari, 2020)
• with increasing number of posterior dimensions, the
stochastic divergence estimate gets worse and flows have
problems, too (Dhaka, Catalina, Andersen, Welandawe,
Huggins, and Vehtari, 2021)
29 / 30
brms supports Laplace / Pathfinder / ADVI
These might be useful for initializng MCMC or big data. The
ADVI implementation is not very good.

fit1 <- brm(..., algorithm = "laplace")

fit1 <- brm(..., algorithm = "pathfinder")

fit1 <- brm(..., algorithm = "meanfield")

fit1 <- brm(..., algorithm = "fullrank")

30 / 30

You might also like