0% found this document useful (0 votes)

2 views

BDA_lecture_10b

Chapter 4 discusses various statistical methods including normal approximation (Laplace’s method) and large-sample theory, emphasizing the convergence of posterior distributions to normality as sample size increases. It also presents counterexamples highlighting challenges in MCMC for difficult posteriors and includes frequency evaluation. Additionally, the chapter touches on Taylor series for approximating posterior distributions.

Uploaded by

marius.boda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

BDA_lecture_10b

Uploaded by

marius.boda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 66

Chapter 4

• 4.1 Normal approximation (Laplace’s method)

• 4.2 Large-sample theory
• 4.3 Counter examples
• includes examples of difficult posteriors for MCMC, too
• 4.4 Frequency evaluation*
• 4.5 Other statistical methods*

1 / 30
Normal approximation (Laplace approximation)
• Often posterior converges to normal distribution when
n→∞
• bounded, non-singular, the number of parameters don’t grow
with n
• we can then approximate p(θ|y ) with normal distribution

2 / 30
Normal approximation (Laplace approximation)
• Often posterior converges to normal distribution when
n→∞
• bounded, non-singular, the number of parameters don’t grow
with n
• we can then approximate p(θ|y ) with normal distribution
• Laplace used this (before Gauss) to approximate the
posterior of binomial model to infer ratio of girls and boys
born

30 30 ● ●
●
●
●
●
●
200
● ● ●
● ● ●
● ●
● ● ● ● ●●

150
● ● ●
● ● ●● ●
●●
●

count
● ●
● ● ●
● ●● ● ●
● ● ●
beta

20 beta 20 ●

●
●

● ●● ●
●● ●
●
● ●●
●●●

●
●●●
●●
● ●
● ● ● ●● ● ● ●●
● ● ●●●● ●
●●●● ●
●●●●
● ● ● ●
●●●
●
●
● ●
●
●

● ●●●
● ●● ● ● ● ● ●●
●

●
● ●
●
● ●
●●
●●
● ●● ●
●
●

●
●●
●
●
●
●
●
●
●
●

● ● ●● ● ● ●● ●● ●
● ●●● ● ●● ●●● ●● ●●●
● ● ● ● ●

100
● ● ●●●● ●● ● ●●● ● ●
● ●●● ● ●● ● ● ● ● ●●● ●
●●● ● ●● ● ● ●●●●
● ● ●● ● ● ● ● ●● ●
●●
●● ● ●●●●●●
● ●
● ● ●●● ●
● ●
● ● ●● ●● ● ●●● ● ●
● ●●● ●●● ●●
●● ●●
●
●●● ●●
● ●●●
●●
● ● ●
●● ● ● ●● ●● ●● ●●
● ●
● ● ● ● ● ●●● ● ●●● ●●●●●●●●●● ● ●● ● ●●● ●
● ●● ● ● ● ●
● ●● ●● ●●●●●●●●● ●
●
●●
● ●●
●● ● ● ●
● ●●● ●
● ●● ●● ● ●
●●●● ● ●

10 10
● ●●● ●● ●●●
●●●
●● ● ● ● ●● ●●● ●● ●●
● ●●●
●● ●●●●
● ●●●● ●● ●●●●● ● ●
●
● ●● ●●
●
●
●●●●
● ●
●●
●
●
●● ●●●● ●
● ●●●●●●●
● ●●●
●
●●
●●●●●
●●● ● ● ● ● ● ●
● ●●● ●●●
●
●● ●● ● ●●●● ●● ●●● ●● ●
●●●●● ●
● ●●● ● ● ● ●
● ● ●● ●● ●
● ● ●●
● ●●●●●●●●
●● ●● ●●
●●● ●●
● ●● ● ●● ● ● ●● ● ●●● ●●● ●
● ●●●●
●
●●●●●
●
●
●●
●
●
●●●●●●●● ●●●
● ●● ●●
● ●●
● ●●●● ● ●●● ●●●● ● ●● ●
● ●● ● ● ● ●●
●
●
●
●●●
●●●
●●●●
●
●
●
●
●● ●● ●
●●
●
● ●
●
●● ●
●● ● ●
●● ●
● ● ●
●●
●
●●● ● ●●●●● ●
● ●●
● ●● ●●
●●● ●●
●●●● ●
●● ●●● ●●●● ●●
● ●● ● ●● ●●
● ●●● ● ●● ● ●

0 0 ●

0
0.0 2.5 5.0 0.0 2.5 5.0 −0.5 0.0 0.5
alpha alpha LD50 = −alpha/beta

30 30 150

count
beta

beta

20 20
●
● ●●● ●
●

100
●
● ● ● ●
●● ●
● ●
●
●● ● ● ● ● ● ●
● ●
● ●●● ● ● ●● ●
●
● ● ● ● ●●● ● ●
● ●
●● ● ● ●
●●● ● ● ●● ● ●
● ●● ●● ● ● ●●●●● ●● ● ●● ● ● ●
●● ●● ● ●
●● ●●● ●●● ● ● ● ●
● ● ● ● ●● ●
●
●● ●
●
●
●●
●● ●● ●●
● ●●●
●●● ●● ● ●
●
● ● ●● ●● ● ● ●
● ●● ●●●● ●●●●●●●
● ●●● ●
●
●●●●
●●● ●
●●●
●●● ●●●●● ● ●●● ●
● ●
● ● ●●

50
● ●● ●● ●●●● ●●
●● ●● ●
● ●● ● ●● ● ●●● ●●● ●
●
●●●
●●●●●●● ●
●● ●●●●●●● ●
●●●● ●● ● ●
● ●●●●●●
● ●●● ●
●●●● ●●●●
●●●● ●● ●● ●
●● ●●● ● ●●
●●●●
●●
●
●●●
●●
●●●●
●●●●●●●●●●
●
●●●
● ●●●● ●
●● ●
●
●● ●●
● ●● ● ●●●●●●● ●●
● ● ●● ● ●● ● ● ●
● ●
● ●●● ●
● ●● ●●●●
●●●●
●● ●
●
●
●
●
●●
●●●
●●●
●●
●●
●
●●●
●●●● ● ●
● ●●●
● ●
●●●● ● ● ●●● ● ●
● ● ●● ●
●
●●●
●● ●●
●● ●●● ●● ●●
● ● ●● ●●
● ● ●●●● ●●●●●●● ●●●
● ●● ●
●
● ● ● ●
●●● ●●
●●● ●●
●●● ●● ●●●
●● ●● ●● ●● ●
● ●● ● ● ●● ●●●● ●●● ● ●●●●
● ● ●●●
●●●●●●● ● ●●●
●● ● ●● ●● ● ●● ● ● ●
● ●●●

0 0
●● ● ● ●●● ●● ● ● ●●
● ● ● ●●●● ●●● ●●● ●● ●● ● ● ●
● ● ●● ● ● ● ● ●●
●
● ●●
●
●● ●
●
●●●● ●●●●● ●● ● ●● ●
●● ●● ● ●
● ●● ● ● ●
● ● ● ● ● ● ● ●● ● ●● ●
●● ● ●●
●
● ●
● ●
● ● ●●
●

●
● ●
●

●
●
●●

●
●
●

0
0.0 2.5 5.0 0.0 2.5 5.0 −0.5 0.0 0.5
alpha alpha LD50 = −alpha/beta, beta > 0
2 / 30
Taylor series

• We can approximate p(θ|y ) with normal distribution

1 1 2
p(θ|y ) ≈ √ exp − 2 (θ − θ̂)
2πσθ 2σθ

• i.e. log posterior log p(θ|y ) can be approximated with a

quadratic function

log p(θ|y ) ≈ α(θ − θ̂)2 + C

3 / 30
Taylor series

• We can approximate p(θ|y ) with normal distribution

1 1 2
p(θ|y ) ≈ √ exp − 2 (θ − θ̂)
2πσθ 2σθ

• i.e. log posterior log p(θ|y ) can be approximated with a

quadratic function

log p(θ|y ) ≈ α(θ − θ̂)2 + C

• Corresponds to Taylor series expansion around θ = θ̂

f ′′ (θ̂) f (3) (θ̂)

f (θ) = f (θ̂)+f ′ (θ̂)(θ − θ̂)+ (θ−θ̂)2 + (θ − θ̂)3 + . . .
2! 3!

3 / 30
Taylor series

• We can approximate p(θ|y ) with normal distribution

1 1 2
p(θ|y ) ≈ √ exp − 2 (θ − θ̂)
2πσθ 2σθ

• i.e. log posterior log p(θ|y ) can be approximated with a

quadratic function

log p(θ|y ) ≈ α(θ − θ̂)2 + C

• Corresponds to Taylor series expansion around θ = θ̂

f ′′ (θ̂) f (3) (θ̂)

f (θ) = f (θ̂)+f ′ (θ̂)(θ − θ̂)+ (θ−θ̂)2 + (θ − θ̂)3 + . . .
2! 3!

• if θ̂ is at mode, then f ′ (θ̂) = 0

3 / 30
Taylor series

• We can approximate p(θ|y ) with normal distribution

1 1 2
p(θ|y ) ≈ √ exp − 2 (θ − θ̂)
2πσθ 2σθ

• i.e. log posterior log p(θ|y ) can be approximated with a

quadratic function

log p(θ|y ) ≈ α(θ − θ̂)2 + C

• Corresponds to Taylor series expansion around θ = θ̂

f ′′ (θ̂) f (3) (θ̂)

f (θ) = f (θ̂)+f ′ (θ̂)(θ − θ̂)+ (θ−θ̂)2 + (θ − θ̂)3 + . . .
2! 3!

• if θ̂ is at mode, then f ′ (θ̂) = 0

(3)
• often when n → ∞, f 3!(θ̂) (θ − θ̂)3 + . . . is small
3 / 30
Multivariate Taylor series

• Multivariate series expansion

df (θ′ ) 1 2 ′
T d f (θ )
f (θ) = f (θ̂)+ (θ − θ̂)+ (θ−θ̂) (θ−θ̂)+ . . .
dθ′ θ′ =θ̂ 2! dθ′2 θ′ =θ̂

4 / 30
Normal approximation

• Taylor series expansion of the log posterior around the

posterior mode θ̂
2
1 T d ′
log p(θ|y ) = log p(θ̂|y )+ (θ−θ̂) log p(θ |y ) (θ−θ̂)+. . .
2 dθ2 θ′ =θ̂

5 / 30
Normal approximation

• Taylor series expansion of the log posterior around the

5 / 30
Normal approximation

• Multivariate normal ∝ |Σ|−1/2 exp − 12 (θ − θ̂T )Σ−1 (θ − θ̂)

30 30
beta

beta
20 20
10 10
0 0
0.0 2.5 5.0 0.0 2.5 5.0
alpha alpha

p(beta | alpha=0.82)
p(alpha | beta=7.5)

0.002 0.002

0.001 0.001

0.000 0.000
0 2 4 0 10 20 30
alpha beta
5 / 30
Normal approximation

• Multivariate normal ∝ |Σ|−1/2 exp − 12 (θ − θ̂T )Σ−1 (θ − θ̂)

30 30
beta

beta
20 20
10 10
0 0
0.0 2.5 5.0 0.0 2.5 5.0

log p(beta | alpha=0.82)

log p(alpha | beta=7.5)

alpha alpha
−5 −5

−10 −10

−15 −15

−20 −20
0 2 4 0 10 20 30
alpha beta
5 / 30
Normal approximation

• Normal approximation

p(θ|y ) ≈ N(θ̂, [I(θ̂)]−1 )

where I(θ) is called observed information

d2
I(θ) = − log p(θ|y )
dθ2

6 / 30
Normal approximation

• Normal approximation

p(θ|y ) ≈ N(θ̂, [I(θ̂)]−1 )

where I(θ) is called observed information

d2
I(θ) = − log p(θ|y )
dθ2
Hessian H(θ) = −I(θ)

6 / 30
Normal approximation

• I(θ) is called observed information

d2
I(θ) = − log p(θ|y )
dθ2

• I(θ̂) is the second derivatives at the mode and thus

describes the curvature at the mode
• if the mode is inside the parameter space, I(θ̂) is positive
• if θ is a vector, then I(θ) is a matrix

7 / 30
Normal approximation

• BDA3 Ch 4 has an example where it is easy to compute first

and second derivatives and there is easy analytic solution to
find where the first derivatives are zero

8 / 30
Normal approximation – numerically
• Normal approximation can be computed numerically
• iterative optimization to find a mode (may use gradients)
• autodiff or finite-difference for gradients and Hessian

9 / 30
Normal approximation – numerically
• Normal approximation can be computed numerically
• iterative optimization to find a mode (may use gradients)
• autodiff or finite-difference for gradients and Hessian
• e.g. in R, demo4_1.R:
bioassayfun <- function(w, df) {
z <- w[1] + w[2]*df$x
-sum(df$y*(z) - df$n*log1p(exp(z)))
}

theta0 <- c(0,0)

optimres <- optim(w0, bioassayfun, gr=NULL, df1, hessia
thetahat <- optimres$par
Sigma <- solve(optimres$hessian)

10 / 30
Normal approximation – numerically
• Normal approximation can be computed numerically
• iterative optimization to find a mode (may use gradients)
• autodiff or finite-difference for gradients and Hessian
• CmdStan(R) has Laplace algorithm
• uses L-BFGS quasi-Newton optimization algorithm for
finding the mode
• uses autodiff for gradients
• uses finite differences of gradients to compute Hessian

10 / 30
Normal approximation

• Optimization and computation of Hessian requires usually

much fewer density evaluations than MCMC

11 / 30
Normal approximation

• Optimization and computation of Hessian requires usually

much fewer density evaluations than MCMC
• In some cases accuracy is sufficient

11 / 30
Normal approximation

• Optimization and computation of Hessian requires usually

much fewer density evaluations than MCMC
• In some cases accuracy is sufficient
• In some cases accuracy for a conditional distribution is
sufficient (Ch 13)
• e.g. Gaussian latent variable models, such as Gaussian
processes (Ch 21) and Gaussian Markov random fields
• Rasmussen & Williams: Gaussian Processes for Machine
Learning
• CS-E4895 - Gaussian Processes (in spring)

11 / 30
Normal approximation

• Optimization and computation of Hessian requires usually

11 / 30
Example: Importance sampling in Bioassay
30 30
200

count
Grid

20 20
β

β
100
10 10

0 0 0
−0.5 0.0 0.5
−2 0 2 4 6 −2 0 2 4 6
α α LD50 = − α β

12 / 30
Example: Importance sampling in Bioassay
30 30
200

count
Grid

20 20
β

β
100
10 10

0 0 0
−0.5 0.0 0.5
−2 0 2 4 6 −2 0 2 4 6
α α LD50 = − α β

200
Normal

30 30
150

count
20 20
100
β

β
10 10 50

0 0 0
−0.5 0.0 0.5
−2 0 2 4 6 −2 0 2 4 6
α α LD50 = − α β,β>0

12 / 30
Example: Importance sampling in Bioassay
30 30
200

count
Grid

20 20
β

β
100
10 10

0 0 0
−0.5 0.0 0.5
−2 0 2 4 6 −2 0 2 4 6
α α LD50 = − α β

200
Normal

30 30
150

count
20 20
100
β

β
10 10 50

0 0 0
−0.5 0.0 0.5
−2 0 2 4 6 −2 0 2 4 6
α α LD50 = − α β,β>0

But the normal approximation is not that good here:

Grid sd(LD50) ≈ 0.1, Normal sd(LD50) ≈ .75!

12 / 30
Example: Importance sampling in Bioassay
30 30
200

count
Grid

20 20
β

β
100
10 10

0 0 0
−0.5 0.0 0.5
−2 0 2 4 6 −2 0 2 4 6
α α LD50 = − α β

200
Normal

30 30
150

count
20 20
100
β

β
10 10 50

0 0 0
−0.5 0.0 0.5
−2 0 2 4 6 −2 0 2 4 6
α α LD50 = − α β,β>0

30
200

count
20
β

100
IS

0 0
−0.5 0.0 0.5
−2 0 2 4 6
α LD50 = − α β

12 / 30
Example: Importance sampling in Bioassay
30 30
200

count
Grid

20 20
β

β
100
10 10

0 0 0
−0.5 0.0 0.5
−2 0 2 4 6 −2 0 2 4 6
α α LD50 = − α β

200
Normal

30 30
150

count
20 20
100
β

β
10 10 50

0 0 0
−0.5 0.0 0.5
−2 0 2 4 6 −2 0 2 4 6
α α LD50 = − α β,β>0

30
200

count
20
β

100
IS

0 0
−0.5 0.0 0.5
−2 0 2 4 6
α LD50 = − α β

Grid sd(LD50) ≈ 0.1, IS sd(LD50) ≈ 0.1

12 / 30
Normal approximation

• Accuracy can be improved by importance sampling

• Pareto-k diagnostic of importance sampling weights can be
used for diagnostic
• in Bioassay example k = 0.57, which is ok

13 / 30
Normal approximation

• Accuracy can be improved by importance sampling

• Pareto-k diagnostic of importance sampling weights can be
used for diagnostic
• in Bioassay example k = 0.57, which is ok
• CmdStan(R) has Laplace algorithm
• since version 2.33 (2023)
+ Pareto-k diagnostic via posterior package
+ importance resampling (IR) via posterior package

13 / 30
Normal approximation and parameter transformations
• Normal approximation is not good for parameters with
bounded or half-bounded support
• e.g. θ ∈ [0, 1] presenting probability

14 / 30
Normal approximation and parameter transformations
• Normal approximation is not good for parameters with
bounded or half-bounded support
• e.g. θ ∈ [0, 1] presenting probability
• Stan code can include constraints
real<lower=0,upper=0> theta;
• for this, Stan does the inference in unconstrained space
using logit transformation

14 / 30
Normal approximation and parameter transformations
Binomial model y ∼ Bin(θ, N), with data y = 9, N = 10
With Beta(1, 1) prior, the posterior is Beta(9 + 1, 1 + 1)

15 / 30
Normal approximation and parameter transformations
With Beta(1, 1) prior, the posterior is Beta(9 + 1, 1 + 1)
Stan computes only the unnormalized posterior q(θ|y )

16 / 30
Normal approximation and parameter transformations
With Beta(1, 1) prior, the posterior is Beta(9 + 1, 1 + 1)
For illustration purposes we normalize Stan result q(θ|y )

17 / 30
Normal approximation and parameter transformations
With Beta(1, 1) prior, the posterior is Beta(9 + 1, 1 + 1)
Beta(9 + 1, 1 + 1), but x-axis shows the unconstrained logit(θ)

18 / 30
Normal approximation and parameter transformations
...but we need to take into account the absolute value of the
determinant of the Jacobian of the transformation θ(1 − θ)

19 / 30
Normal approximation and parameter transformations
...but we need to take into account Jacobian θ(1 − θ)
Let’s compare a wrong normal approximation...

20 / 30
Normal approximation and parameter transformations
...but we need to take into account Jacobian θ(1 − θ)
Let’s compare a wrong normal approximation and correct one

21 / 30
Normal approximation and parameter transformations
Let’s compare a wrong normal approximation and correct one
Sample from both approximations and show KDEs for draws

22 / 30
Normal approximation and parameter transformations
Let’s compare a wrong normal approximation and correct one
Inverse transform draws and show KDEs

23 / 30
Normal approximation and parameter transformations
Laplace approximation can be further improved with importance
resampling

24 / 30
Other distributional approximations

• Higher order derivatives at the mode can be used

25 / 30
Other distributional approximations

• Higher order derivatives at the mode can be used

• Split-normal and split-t by Geweke (1989) use additional
scaling along different principal axes

25 / 30
Other distributional approximations

• Higher order derivatives at the mode can be used

• Split-normal and split-t by Geweke (1989) use additional
scaling along different principal axes
• Other distributions can be used (e.g. t-distribution)

25 / 30
Other distributional approximations

• Higher order derivatives at the mode can be used

• Split-normal and split-t by Geweke (1989) use additional
scaling along different principal axes
• Other distributions can be used (e.g. t-distribution)
• Instead of mode and Hessian at mode, e.g.
• variational inference (Ch 13)
• CS-E4820 - Machine Learning: Advanced Probabilistic
Methods
• CS-E4895 - Gaussian Processes
• Stan has the ADVI algorithm (not very good implementation)
• Stan has Pathfinder algorithm (CmdStanR, brms)
• instead of normal, methods with flexible flow transformations
• expectation propagation (Ch 13)
• speed of these is usually between optimization and MCMC
• stochastic variational inference can be even slower than
MCMC

25 / 30
Pathfinder: Parallel quasi-Newton variational inference.
0.25 0.25 0.25

0.20 0.20 0.20

0.15 0.15 0.15

0.10 0.10 0.10

0.05 0.05 0.05

0.00 0.00 0.00

−0.6 −0.5 −0.4 −0.3 −0.2 −0.6 −0.5 −0.4 −0.3 −0.2 −0.6 −0.5 −0.4 −0.3 −0.2
iteration 3 iteration 4 iteration 5
estimated ELBO: −340.5 estimated ELBO: −332.2 estimated ELBO: −329.7
0.25 0.25 0.25

0.20 0.20 0.20

0.15 0.15 0.15

0.10 0.10 0.10

0.05 0.05 0.05

0.00 0.00 0.00

−0.6 −0.5 −0.4 −0.3 −0.2 −0.6 −0.5 −0.4 −0.3 −0.2 −0.6 −0.5 −0.4 −0.3 −0.2
iteration 6 iteration 7 iteration 8
estimated ELBO: −329.6 estimated ELBO: −329.6 estimated ELBO: −329.7

5 5 5

0 0 0

−5 −5 −5

−10 −10 −10

−15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15

iteration 3 iteration 4 iteration 5
estimated ELBO: −4.3 estimated ELBO: −0.4 estimated ELBO: −132.1
10 10 10

5 5 5

0 0 0

−5 −5 −5

−10 −10 −10

−15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15

iteration 6 iteration 9 iteration 13
estimated ELBO: 1.4 estimated ELBO: −579.9 estimated ELBO: −5.7

Zhang, Carpenter,
quasi-Newton Gelman,
variational and Vehtari
inference. (2022).
Journal Pathfinder:
of Machine Parallel
Learning Research,
23(306):1–49. 26 / 30
Pathfinder: Parallel quasi-Newton variational inference.
Birthdays case study uses Pathfinder to speed up workflow
https://users.aalto.fi/~ave/casestudies/Birthdays/birthdays.html
Relative number of births

Relative number of births

120 120

100 100

80 80

1970 1975 1980 1985 1970 1975 1980 1985

Date Date
Relative number of births

Relative number of births

110

120 Tue
100 Mon

100
Sat
90 Sun
80
80
Jan FebMar AprMayJun Jul AugSepOct NovDec Jan 1970 1975 1980 1985 1990
Date Date
Relative number of births

Valentine's day
100
April 1st Halloween
Leap day
90 Labor day
Independence day Thanksgiving
New year Memorial day
80
Christmas
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan
Date 27 / 30
Distributional approximations
Exact, Normal at mode, Normal with variational inference

30 30 30
beta

beta

beta
20 20 20
10 10 10
0 0 0
0.0 2.5 5.0 0.0 2.5 5.0 0.0 2.5 5.0
alpha alpha alpha

p(beta | alpha=0.82)
p(alpha | beta=7.5)

0.002 0.002

0.001 0.001

0.000 0.000
0 2 4 0 10 20 30
alpha beta

28 / 30
Distributional approximations
Exact, Normal at mode, Normal with variational inference

30 30 30
beta

beta

beta
20 20 20
10 10 10
0 0 0
0.0 2.5 5.0 0.0 2.5 5.0 0.0 2.5 5.0
alpha alpha alpha

p(beta | alpha=0.82)
p(alpha | beta=7.5)

0.002 0.002

0.001 0.001

0.000 0.000
0 2 4 0 10 20 30
alpha beta
Grid sd(LD50) ≈ 0.090,
Normal sd(LD50) ≈ .75, Normal + IR sd(LD50) ≈ 0.096 (Pareto-k = 0.57)
28 / 30
Distributional approximations
Exact, Normal at mode, Normal with variational inference

30 30 30
beta

beta

beta
20 20 20
10 10 10
0 0 0
0.0 2.5 5.0 0.0 2.5 5.0 0.0 2.5 5.0
alpha alpha alpha

p(beta | alpha=0.82)
p(alpha | beta=7.5)

0.002 0.002

0.001 0.001

0.000 0.000
0 2 4 0 10 20 30
alpha beta
Grid sd(LD50) ≈ 0.090,
Normal sd(LD50) ≈ .75, Normal + IR sd(LD50) ≈ 0.096 (Pareto-k = 0.57)
VI sd(LD50) ≈ 0.13, VI + IR sd(LD50) ≈ 0.095 (Pareto-k = 0.17)
28 / 30
Variational inference

• Variational inference includes a large number of methods

29 / 30
Variational inference

• Variational inference includes a large number of methods

• For a restricted set of models, possible to derive
deterministic algorithms
• can be fast and can be relatively accurate

29 / 30
Variational inference

• Variational inference includes a large number of methods

29 / 30
Variational inference

• Variational inference includes a large number of methods

29 / 30
Variational inference

• Variational inference includes a large number of methods

29 / 30
Variational inference

• Variational inference includes a large number of methods

29 / 30
Variational inference

• Variational inference includes a large number of methods

29 / 30
Variational inference

• Variational inference includes a large number of methods

• For a restricted set of models, possible to derive
deterministic algorithms
• can be fast and can be relatively accurate
• Using stochastic (Monte Carlo) estimation of the divergence,
possible to derive generic black box algorithms
• possible to use use also mini-batching
• can be fast and provide better predictive distribution than
Laplace approximation if the posterior is far from normal
• in general, unlikely to achieve accuracy of HMC with the
same computation cost
• with increasing number of posterior dimensions, the obtained
approximation gets worse (Dhaka, Catalina, Andersen,
Magnusson, Huggins, and Vehtari, 2020)
• with increasing number of posterior dimensions, the
stochastic divergence estimate gets worse and flows have
problems, too (Dhaka, Catalina, Andersen, Welandawe,
Huggins, and Vehtari, 2021)
29 / 30
brms supports Laplace / Pathfinder / ADVI
These might be useful for initializng MCMC or big data. The
ADVI implementation is not very good.

fit1 <- brm(..., algorithm = "laplace")

fit1 <- brm(..., algorithm = "pathfinder")

fit1 <- brm(..., algorithm = "meanfield")

fit1 <- brm(..., algorithm = "fullrank")

30 / 30

Brainy Business Case Study
No ratings yet
Brainy Business Case Study
16 pages
4 4-Laplace
No ratings yet
4 4-Laplace
25 pages
Machine Learning and Pattern Recognition - Laplace - Approximation
No ratings yet
Machine Learning and Pattern Recognition - Laplace - Approximation
4 pages
Point Estimation: Definition of Estimators
No ratings yet
Point Estimation: Definition of Estimators
8 pages
Bayesian-inference-slides-2021
No ratings yet
Bayesian-inference-slides-2021
37 pages
Class19 Approxinf
No ratings yet
Class19 Approxinf
45 pages
sta255 Week 11-2 pre
No ratings yet
sta255 Week 11-2 pre
21 pages
Statistics
No ratings yet
Statistics
60 pages
Fundamentals of Statistics (18.6501x)
No ratings yet
Fundamentals of Statistics (18.6501x)
20 pages
Variational Problems in Machine Learning and Their Solution With Finite Elements
No ratings yet
Variational Problems in Machine Learning and Their Solution With Finite Elements
11 pages
Talk On Regression Based Method For Bayesian Nonparanormal Graphical Models
No ratings yet
Talk On Regression Based Method For Bayesian Nonparanormal Graphical Models
40 pages
Nummax
No ratings yet
Nummax
3 pages
ML-Map-and-Bayseian
No ratings yet
ML-Map-and-Bayseian
35 pages
Stat520 Ch.3
No ratings yet
Stat520 Ch.3
5 pages
Download Complete Mathematical Theory of Bayesian Statistics First Edition Watanabe PDF for All Chapters
100% (1)
Download Complete Mathematical Theory of Bayesian Statistics First Edition Watanabe PDF for All Chapters
55 pages
Machine Learning Econometrics Bayesian Algorithms
No ratings yet
Machine Learning Econometrics Bayesian Algorithms
33 pages
Cheat Sheet 1
No ratings yet
Cheat Sheet 1
2 pages
Adapting To Unknown Smoothness: R. M. Castro May 20, 2011
No ratings yet
Adapting To Unknown Smoothness: R. M. Castro May 20, 2011
9 pages
Statistical+Inference+1 Shaw2007
No ratings yet
Statistical+Inference+1 Shaw2007
66 pages
lecture_1
No ratings yet
lecture_1
23 pages
Prints PDF
No ratings yet
Prints PDF
106 pages
Advanced Machine Learning: CS 281
100% (1)
Advanced Machine Learning: CS 281
88 pages
24 Variational Inference
No ratings yet
24 Variational Inference
24 pages
Parameter estimation: S (x) = S (ϑ, x), the observed process is, X, 0 ≤ t ≤ T
No ratings yet
Parameter estimation: S (x) = S (ϑ, x), the observed process is, X, 0 ≤ t ≤ T
13 pages
UNIT 3-Bayesian Statistics
No ratings yet
UNIT 3-Bayesian Statistics
80 pages
Questions_for_Unit_4 (2)
No ratings yet
Questions_for_Unit_4 (2)
6 pages
Estimating A Dirichlet Distribution Thomas P. Minka
No ratings yet
Estimating A Dirichlet Distribution Thomas P. Minka
15 pages
Minka Dirichlet PDF
No ratings yet
Minka Dirichlet PDF
14 pages
Likelihood, Bayesian, and Decision Theory
No ratings yet
Likelihood, Bayesian, and Decision Theory
50 pages
bayesian-inference
No ratings yet
bayesian-inference
18 pages
The University of Nottingham: Do NOT Turn Examination Paper Over Until Instructed To Do So
No ratings yet
The University of Nottingham: Do NOT Turn Examination Paper Over Until Instructed To Do So
6 pages
MA40189 Notes
No ratings yet
MA40189 Notes
70 pages
Bayesian Approximation Techniques of Topp-Leone Distribution
No ratings yet
Bayesian Approximation Techniques of Topp-Leone Distribution
8 pages
SP2009F - Lecture03 - Maximum Likelihood Estimation (Parametric Methods)
No ratings yet
SP2009F - Lecture03 - Maximum Likelihood Estimation (Parametric Methods)
23 pages
Lec25 MonteCarloMethods
No ratings yet
Lec25 MonteCarloMethods
57 pages
Cram Er-Rao Lower Bound and Information Geometry: 1 Introduction and Historical Background
No ratings yet
Cram Er-Rao Lower Bound and Information Geometry: 1 Introduction and Historical Background
27 pages
IntroBayesTimeSeries1
No ratings yet
IntroBayesTimeSeries1
72 pages
Bayesian Data Analysis - Reading Instructions 2: Chapter 2 - Outline
No ratings yet
Bayesian Data Analysis - Reading Instructions 2: Chapter 2 - Outline
36 pages
i i 2 i 1 2 θ i 2 2 3 2
No ratings yet
i i 2 i 1 2 θ i 2 2 3 2
159 pages
Instant Access to Probabilistic Machine Learning for Civil Engineers James-A Goulet ebook Full Chapters
100% (4)
Instant Access to Probabilistic Machine Learning for Civil Engineers James-A Goulet ebook Full Chapters
84 pages
Notes On Beta and Dirchilet Distribution
No ratings yet
Notes On Beta and Dirchilet Distribution
19 pages
MCMC Bayes PDF
No ratings yet
MCMC Bayes PDF
27 pages
Maximum Likelihood Estimation: Guy Lebanon February 19, 2011
No ratings yet
Maximum Likelihood Estimation: Guy Lebanon February 19, 2011
6 pages
Mlelectures PDF
No ratings yet
Mlelectures PDF
24 pages
Mlelectures PDF
No ratings yet
Mlelectures PDF
24 pages
ML and MAP - HTML
No ratings yet
ML and MAP - HTML
9 pages
Christophe Andrieu - Arnaud Doucet Bristol, BS8 1TW, UK. Cambridge, CB2 1PZ, UK. Email
No ratings yet
Christophe Andrieu - Arnaud Doucet Bristol, BS8 1TW, UK. Cambridge, CB2 1PZ, UK. Email
4 pages
Statistics 512 Notes 18
No ratings yet
Statistics 512 Notes 18
10 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
111 pages
Modeling, Inference and Prediction: 2.1 Probabilistic Models
No ratings yet
Modeling, Inference and Prediction: 2.1 Probabilistic Models
16 pages
11 Parameter Estimation
No ratings yet
11 Parameter Estimation
6 pages
02_solution_Bayes_example
No ratings yet
02_solution_Bayes_example
2 pages
HMWK 4
No ratings yet
HMWK 4
5 pages
Theoretical Statistics and Asymptotics: Nancy Reid
No ratings yet
Theoretical Statistics and Asymptotics: Nancy Reid
13 pages
Univariate Density Estimation by Orthogonal Series: Department of Statistics, Oregon State University, Corvallis
No ratings yet
Univariate Density Estimation by Orthogonal Series: Department of Statistics, Oregon State University, Corvallis
8 pages
Assignment 1
No ratings yet
Assignment 1
7 pages
A Refinement To A It-Sahalia's (2002) "Maximum Likelihood Estimation of Discretely Sampled Diffusions: A Closed-Form Approximation Approach"
No ratings yet
A Refinement To A It-Sahalia's (2002) "Maximum Likelihood Estimation of Discretely Sampled Diffusions: A Closed-Form Approximation Approach"
18 pages
8112 Notes
No ratings yet
8112 Notes
79 pages
Diffusion Model For Generative Image Denoising
No ratings yet
Diffusion Model For Generative Image Denoising
15 pages
Chap 11
No ratings yet
Chap 11
98 pages
Time Series Lecture Notes-Ch-5
No ratings yet
Time Series Lecture Notes-Ch-5
27 pages
Etf3600 Lecture3 Mle LPM 2013
No ratings yet
Etf3600 Lecture3 Mle LPM 2013
36 pages
DSC4821 Full Study Guide
No ratings yet
DSC4821 Full Study Guide
109 pages
Prob Best-2 PDF
No ratings yet
Prob Best-2 PDF
181 pages
Opt407 5
No ratings yet
Opt407 5
28 pages
Math556 11 ModesOfConvergence
No ratings yet
Math556 11 ModesOfConvergence
9 pages
Unit 2. 3. Poisson Distribution
No ratings yet
Unit 2. 3. Poisson Distribution
9 pages
What Is The Difference Between Binomial Distribution and Poisson Distribution?
No ratings yet
What Is The Difference Between Binomial Distribution and Poisson Distribution?
4 pages
Activity No.1 Engineering Data Analysis
No ratings yet
Activity No.1 Engineering Data Analysis
1 page
Statistical Symbols: Population - Parameter Sample - Statistic
No ratings yet
Statistical Symbols: Population - Parameter Sample - Statistic
1 page
Randomvariablesandprobdistributions Practice Exercises PDF
No ratings yet
Randomvariablesandprobdistributions Practice Exercises PDF
3 pages
Concepts of Probability
No ratings yet
Concepts of Probability
32 pages
PTSP Objective Questions
No ratings yet
PTSP Objective Questions
7 pages
Chapter3 220928 093636
No ratings yet
Chapter3 220928 093636
70 pages
BAYES THEOREM IMPORTANT QESTION
No ratings yet
BAYES THEOREM IMPORTANT QESTION
1 page
Stat 110 Strategic Practice 4, Fall 2011
No ratings yet
Stat 110 Strategic Practice 4, Fall 2011
16 pages
Review: Lecture 5: Bivariate Analysis 2 (Spearman and Chi)
No ratings yet
Review: Lecture 5: Bivariate Analysis 2 (Spearman and Chi)
8 pages
Brand Name Distributions (Statistics)
No ratings yet
Brand Name Distributions (Statistics)
25 pages
Basic Statistical and Pharmaceutical Statistical Applications-OCR PDF
0% (1)
Basic Statistical and Pharmaceutical Statistical Applications-OCR PDF
605 pages
Statistics1 Quiz2
No ratings yet
Statistics1 Quiz2
6 pages
Bachelor of Computer Application BCA
No ratings yet
Bachelor of Computer Application BCA
149 pages
Session5 Student-14
No ratings yet
Session5 Student-14
3 pages
CHAPTER VI: Decision Analysis
No ratings yet
CHAPTER VI: Decision Analysis
13 pages
DSA UNIT 2 MCQs
No ratings yet
DSA UNIT 2 MCQs
29 pages
Sst4e Tif 08 PDF
No ratings yet
Sst4e Tif 08 PDF
9 pages
Coursework 2 For CE6002
No ratings yet
Coursework 2 For CE6002
4 pages
LEC 9 - Measures of Variability
No ratings yet
LEC 9 - Measures of Variability
16 pages

BDA_lecture_10b

Uploaded by

BDA_lecture_10b

Uploaded by

Chapter 4

• 4.1 Normal approximation (Laplace’s method)

• We can approximate p(θ|y ) with normal distribution

• i.e. log posterior log p(θ|y ) can be approximated with a

log p(θ|y ) ≈ α(θ − θ̂)2 + C

• We can approximate p(θ|y ) with normal distribution

• i.e. log posterior log p(θ|y ) can be approximated with a

log p(θ|y ) ≈ α(θ − θ̂)2 + C

• Corresponds to Taylor series expansion around θ = θ̂

f ′′ (θ̂) f (3) (θ̂)

• We can approximate p(θ|y ) with normal distribution

• i.e. log posterior log p(θ|y ) can be approximated with a

log p(θ|y ) ≈ α(θ − θ̂)2 + C

• Corresponds to Taylor series expansion around θ = θ̂

f ′′ (θ̂) f (3) (θ̂)

• if θ̂ is at mode, then f ′ (θ̂) = 0

• We can approximate p(θ|y ) with normal distribution

• i.e. log posterior log p(θ|y ) can be approximated with a

log p(θ|y ) ≈ α(θ − θ̂)2 + C

• Corresponds to Taylor series expansion around θ = θ̂

f ′′ (θ̂) f (3) (θ̂)

• if θ̂ is at mode, then f ′ (θ̂) = 0

• Multivariate series expansion

• Taylor series expansion of the log posterior around the

• Taylor series expansion of the log posterior around the

log p(beta | alpha=0.82)

p(θ|y ) ≈ N(θ̂, [I(θ̂)]−1 )

where I(θ) is called observed information

p(θ|y ) ≈ N(θ̂, [I(θ̂)]−1 )

where I(θ) is called observed information

• I(θ) is called observed information

• I(θ̂) is the second derivatives at the mode and thus

• BDA3 Ch 4 has an example where it is easy to compute first

theta0 <- c(0,0)

• Optimization and computation of Hessian requires usually

• Optimization and computation of Hessian requires usually

• Optimization and computation of Hessian requires usually

• Optimization and computation of Hessian requires usually

But the normal approximation is not that good here:

Grid sd(LD50) ≈ 0.1, IS sd(LD50) ≈ 0.1

• Accuracy can be improved by importance sampling

• Accuracy can be improved by importance sampling

• Higher order derivatives at the mode can be used

• Higher order derivatives at the mode can be used

• Higher order derivatives at the mode can be used

• Higher order derivatives at the mode can be used

0.20 0.20 0.20

0.15 0.15 0.15

0.10 0.10 0.10

0.05 0.05 0.05

0.00 0.00 0.00

0.20 0.20 0.20

0.15 0.15 0.15

0.10 0.10 0.10

0.05 0.05 0.05

0.00 0.00 0.00

−10 −10 −10

−15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15

−10 −10 −10

−15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15

Relative number of births

1970 1975 1980 1985 1970 1975 1980 1985

Relative number of births

• Variational inference includes a large number of methods

• Variational inference includes a large number of methods

• Variational inference includes a large number of methods

• Variational inference includes a large number of methods

• Variational inference includes a large number of methods

• Variational inference includes a large number of methods

• Variational inference includes a large number of methods

• Variational inference includes a large number of methods

fit1 <- brm(..., algorithm = "laplace")

fit1 <- brm(..., algorithm = "pathfinder")

fit1 <- brm(..., algorithm = "meanfield")

fit1 <- brm(..., algorithm = "fullrank")

You might also like