BDA_lecture_10b
BDA_lecture_10b
1 / 30
Normal approximation (Laplace approximation)
• Often posterior converges to normal distribution when
n→∞
• bounded, non-singular, the number of parameters don’t grow
with n
• we can then approximate p(θ|y ) with normal distribution
2 / 30
Normal approximation (Laplace approximation)
• Often posterior converges to normal distribution when
n→∞
• bounded, non-singular, the number of parameters don’t grow
with n
• we can then approximate p(θ|y ) with normal distribution
• Laplace used this (before Gauss) to approximate the
posterior of binomial model to infer ratio of girls and boys
born
2 / 30
Normal approximation (Laplace approximation)
• Often posterior converges to normal distribution when
n→∞
• bounded, non-singular, the number of parameters don’t grow
with n
• we can then approximate p(θ|y ) with normal distribution
● ●
30 30 ● ●
●
●
●
●
●
200
● ● ●
● ● ●
● ●
● ● ● ● ●●
150
● ● ●
● ● ●● ●
●●
●
count
● ●
● ● ●
● ●● ● ●
● ● ●
beta
20 beta 20 ●
●
●
● ●● ●
●● ●
●
● ●●
●●●
●
●●●
●●
● ●
● ● ● ●● ● ● ●●
● ● ●●●● ●
●●●● ●
●●●●
● ● ● ●
●●●
●
●
● ●
●
●
● ●●●
● ●● ● ● ● ● ●●
●
●
● ●
●
● ●
●●
●●
● ●● ●
●
●
●
●●
●
●
●
●
●
●
●
●
● ● ●● ● ● ●● ●● ●
● ●●● ● ●● ●●● ●● ●●●
● ● ● ● ●
100
● ● ●●●● ●● ● ●●● ● ●
● ●●● ● ●● ● ● ● ● ●●● ●
●●● ● ●● ● ● ●●●●
● ● ●● ● ● ● ● ●● ●
●●
●● ● ●●●●●●
● ●
● ● ●●● ●
● ●
● ● ●● ●● ● ●●● ● ●
● ●●● ●●● ●●
●● ●●
●
●●● ●●
● ●●●
●●
● ● ●
●● ● ● ●● ●● ●● ●●
● ●
● ● ● ● ● ●●● ● ●●● ●●●●●●●●●● ● ●● ● ●●● ●
● ●● ● ● ● ●
● ●● ●● ●●●●●●●●● ●
●
●●
● ●●
●● ● ● ●
● ●●● ●
● ●● ●● ● ●
●●●● ● ●
10 10
● ●●● ●● ●●●
●●●
●● ● ● ● ●● ●●● ●● ●●
● ●●●
●● ●●●●
● ●●●● ●● ●●●●● ● ●
●
● ●● ●●
●
●
●●●●
● ●
●●
●
●
●● ●●●● ●
● ●●●●●●●
● ●●●
●
●●
●●●●●
●●● ● ● ● ● ● ●
● ●●● ●●●
●
●● ●● ● ●●●● ●● ●●● ●● ●
●●●●● ●
● ●●● ● ● ● ●
● ● ●● ●● ●
● ● ●●
● ●●●●●●●●
●● ●● ●●
●●● ●●
● ●● ● ●● ● ● ●● ● ●●● ●●● ●
● ●●●●
●
●●●●●
●
●
●●
●
●
●●●●●●●● ●●●
● ●● ●●
● ●●
● ●●●● ● ●●● ●●●● ● ●● ●
● ●● ● ● ● ●●
●
●
●
●●●
●●●
●●●●
●
●
●
●
●● ●● ●
●●
●
● ●
●
●● ●
●● ● ●
●● ●
● ● ●
●●
●
●●● ● ●●●●● ●
● ●●
● ●● ●●
●●● ●●
●●●● ●
●● ●●● ●●●● ●●
● ●● ● ●● ●●
● ●●● ● ●● ● ●
50
●
●● ●●●●●●●● ●●● ●●●●●
●●●●●●●
●●●●
●●●● ●● ●● ● ●● ●
●●● ●● ●●●●
●●●●●●
●
● ●
●●●●●●● ●●●●● ●●
●●
●●
●
● ● ● ● ●●● ●●
●
●
●
● ●●●
●
●● ●● ●
●●
●● ●● ●● ●● ●● ●● ● ● ●
●●●● ●● ●●●● ● ●●● ●●●● ●● ●
● ●● ●● ●● ●● ● ● ● ●
● ● ●● ● ● ●
●
● ●●● ●
0 0 ●
0
0.0 2.5 5.0 0.0 2.5 5.0 −0.5 0.0 0.5
alpha alpha LD50 = −alpha/beta
30 30 150
count
beta
beta
20 20
●
● ●●● ●
●
100
●
● ● ● ●
●● ●
● ●
●
●● ● ● ● ● ● ●
● ●
● ●●● ● ● ●● ●
●
● ● ● ● ●●● ● ●
● ●
●● ● ● ●
●●● ● ● ●● ● ●
● ●● ●● ● ● ●●●●● ●● ● ●● ● ● ●
●● ●● ● ●
●● ●●● ●●● ● ● ● ●
● ● ● ● ●● ●
●
●● ●
●
●
●●
●● ●● ●●
● ●●●
●●● ●● ● ●
●
● ● ●● ●● ● ● ●
● ●● ●●●● ●●●●●●●
● ●●● ●
●
●●●●
●●● ●
●●●
●●● ●●●●● ● ●●● ●
● ●
● ● ●●
10 10
●● ●● ●
●●●●
●●
●
●● ●●● ●
●●
●●●
●
●●●●●●●●●●●●
●●
●●● ●●
●●●
●●
●●
●●●●● ● ●● ● ●
● ●● ●
● ● ● ●
●
●● ● ●● ● ●● ●●● ●●
● ●● ●●
●●●●●● ●●●●●● ● ● ●
● ●
●●
● ●●●
● ●●● ●●●
●● ● ●● ●● ●● ●●
●●●●●●● ●● ● ●
● ●
● ●●
●● ●● ●
●
● ●●● ● ● ● ●● ●●●● ●●
● ●●
● ●●
●●
●●●
●●
●
●●● ●
●
●
●
●
● ● ● ●
● ● ●
● ●
●● ●● ● ● ● ● ●
●● ●
●
●●●
● ●
● ●●● ●
● ● ●●● ●
●●●●
50
● ●● ●● ●●●● ●●
●● ●● ●
● ●● ● ●● ● ●●● ●●● ●
●
●●●
●●●●●●● ●
●● ●●●●●●● ●
●●●● ●● ● ●
● ●●●●●●
● ●●● ●
●●●● ●●●●
●●●● ●● ●● ●
●● ●●● ● ●●
●●●●
●●
●
●●●
●●
●●●●
●●●●●●●●●●
●
●●●
● ●●●● ●
●● ●
●
●● ●●
● ●● ● ●●●●●●● ●●
● ● ●● ● ●● ● ● ●
● ●
● ●●● ●
● ●● ●●●●
●●●●
●● ●
●
●
●
●
●●
●●●
●●●
●●
●●
●
●●●
●●●● ● ●
● ●●●
● ●
●●●● ● ● ●●● ● ●
● ● ●● ●
●
●●●
●● ●●
●● ●●● ●● ●●
● ● ●● ●●
● ● ●●●● ●●●●●●● ●●●
● ●● ●
●
● ● ● ●
●●● ●●
●●● ●●
●●● ●● ●●●
●● ●● ●● ●● ●
● ●● ● ● ●● ●●●● ●●● ● ●●●●
● ● ●●●
●●●●●●● ● ●●●
●● ● ●● ●● ● ●● ● ● ●
● ●●●
0 0
●● ● ● ●●● ●● ● ● ●●
● ● ● ●●●● ●●● ●●● ●● ●● ● ● ●
● ● ●● ● ● ● ● ●●
●
● ●●
●
●● ●
●
●●●● ●●●●● ●● ● ●● ●
●● ●● ● ●
● ●● ● ● ●
● ● ● ● ● ● ● ●● ● ●● ●
●● ● ●●
●
● ●
● ●
● ● ●●
●
●
● ●
●
●
●
●●
●
●
●
0
0.0 2.5 5.0 0.0 2.5 5.0 −0.5 0.0 0.5
alpha alpha LD50 = −alpha/beta, beta > 0
2 / 30
Taylor series
3 / 30
Taylor series
3 / 30
Taylor series
3 / 30
Taylor series
df (θ′ ) 1 2 ′
T d f (θ )
f (θ) = f (θ̂)+ (θ − θ̂)+ (θ−θ̂) (θ−θ̂)+ . . .
dθ′ θ′ =θ̂ 2! dθ′2 θ′ =θ̂
4 / 30
Normal approximation
5 / 30
Normal approximation
5 / 30
Normal approximation
• Multivariate normal ∝ |Σ|−1/2 exp − 12 (θ − θ̂T )Σ−1 (θ − θ̂)
30 30
beta
beta
20 20
10 10
0 0
0.0 2.5 5.0 0.0 2.5 5.0
alpha alpha
p(beta | alpha=0.82)
p(alpha | beta=7.5)
0.002 0.002
0.001 0.001
0.000 0.000
0 2 4 0 10 20 30
alpha beta
5 / 30
Normal approximation
• Multivariate normal ∝ |Σ|−1/2 exp − 12 (θ − θ̂T )Σ−1 (θ − θ̂)
30 30
beta
beta
20 20
10 10
0 0
0.0 2.5 5.0 0.0 2.5 5.0
alpha alpha
−5 −5
−10 −10
−15 −15
−20 −20
0 2 4 0 10 20 30
alpha beta
5 / 30
Normal approximation
• Normal approximation
d2
I(θ) = − log p(θ|y )
dθ2
6 / 30
Normal approximation
• Normal approximation
d2
I(θ) = − log p(θ|y )
dθ2
Hessian H(θ) = −I(θ)
6 / 30
Normal approximation
d2
I(θ) = − log p(θ|y )
dθ2
7 / 30
Normal approximation
8 / 30
Normal approximation – numerically
• Normal approximation can be computed numerically
• iterative optimization to find a mode (may use gradients)
• autodiff or finite-difference for gradients and Hessian
9 / 30
Normal approximation – numerically
• Normal approximation can be computed numerically
• iterative optimization to find a mode (may use gradients)
• autodiff or finite-difference for gradients and Hessian
• e.g. in R, demo4_1.R:
bioassayfun <- function(w, df) {
z <- w[1] + w[2]*df$x
-sum(df$y*(z) - df$n*log1p(exp(z)))
}
9 / 30
Normal approximation – numerically
• Normal approximation can be computed numerically
• iterative optimization to find a mode (may use gradients)
• autodiff or finite-difference for gradients and Hessian
• CmdStan(R) has Laplace algorithm
10 / 30
Normal approximation – numerically
• Normal approximation can be computed numerically
• iterative optimization to find a mode (may use gradients)
• autodiff or finite-difference for gradients and Hessian
• CmdStan(R) has Laplace algorithm
• uses L-BFGS quasi-Newton optimization algorithm for
finding the mode
• uses autodiff for gradients
• uses finite differences of gradients to compute Hessian
10 / 30
Normal approximation – numerically
• Normal approximation can be computed numerically
• iterative optimization to find a mode (may use gradients)
• autodiff or finite-difference for gradients and Hessian
• CmdStan(R) has Laplace algorithm
• uses L-BFGS quasi-Newton optimization algorithm for
finding the mode
• uses autodiff for gradients
• uses finite differences of gradients to compute Hessian
• second order autodiff in progress
10 / 30
Normal approximation
11 / 30
Normal approximation
11 / 30
Normal approximation
11 / 30
Normal approximation
11 / 30
Example: Importance sampling in Bioassay
30 30
200
count
Grid
20 20
β
β
100
10 10
0 0 0
−0.5 0.0 0.5
−2 0 2 4 6 −2 0 2 4 6
α α LD50 = − α β
12 / 30
Example: Importance sampling in Bioassay
30 30
200
count
Grid
20 20
β
β
100
10 10
0 0 0
−0.5 0.0 0.5
−2 0 2 4 6 −2 0 2 4 6
α α LD50 = − α β
200
Normal
30 30
150
count
20 20
100
β
β
10 10 50
0 0 0
−0.5 0.0 0.5
−2 0 2 4 6 −2 0 2 4 6
α α LD50 = − α β,β>0
12 / 30
Example: Importance sampling in Bioassay
30 30
200
count
Grid
20 20
β
β
100
10 10
0 0 0
−0.5 0.0 0.5
−2 0 2 4 6 −2 0 2 4 6
α α LD50 = − α β
200
Normal
30 30
150
count
20 20
100
β
β
10 10 50
0 0 0
−0.5 0.0 0.5
−2 0 2 4 6 −2 0 2 4 6
α α LD50 = − α β,β>0
12 / 30
Example: Importance sampling in Bioassay
30 30
200
count
Grid
20 20
β
β
100
10 10
0 0 0
−0.5 0.0 0.5
−2 0 2 4 6 −2 0 2 4 6
α α LD50 = − α β
200
Normal
30 30
150
count
20 20
100
β
β
10 10 50
0 0 0
−0.5 0.0 0.5
−2 0 2 4 6 −2 0 2 4 6
α α LD50 = − α β,β>0
30
200
count
20
β
100
IS
10
0 0
−0.5 0.0 0.5
−2 0 2 4 6
α LD50 = − α β
12 / 30
Example: Importance sampling in Bioassay
30 30
200
count
Grid
20 20
β
β
100
10 10
0 0 0
−0.5 0.0 0.5
−2 0 2 4 6 −2 0 2 4 6
α α LD50 = − α β
200
Normal
30 30
150
count
20 20
100
β
β
10 10 50
0 0 0
−0.5 0.0 0.5
−2 0 2 4 6 −2 0 2 4 6
α α LD50 = − α β,β>0
30
200
count
20
β
100
IS
10
0 0
−0.5 0.0 0.5
−2 0 2 4 6
α LD50 = − α β
13 / 30
Normal approximation
13 / 30
Normal approximation and parameter transformations
• Normal approximation is not good for parameters with
bounded or half-bounded support
• e.g. θ ∈ [0, 1] presenting probability
14 / 30
Normal approximation and parameter transformations
• Normal approximation is not good for parameters with
bounded or half-bounded support
• e.g. θ ∈ [0, 1] presenting probability
• Stan code can include constraints
real<lower=0,upper=0> theta;
14 / 30
Normal approximation and parameter transformations
• Normal approximation is not good for parameters with
bounded or half-bounded support
• e.g. θ ∈ [0, 1] presenting probability
• Stan code can include constraints
real<lower=0,upper=0> theta;
• for this, Stan does the inference in unconstrained space
using logit transformation
14 / 30
Normal approximation and parameter transformations
• Normal approximation is not good for parameters with
bounded or half-bounded support
• e.g. θ ∈ [0, 1] presenting probability
• Stan code can include constraints
real<lower=0,upper=0> theta;
• for this, Stan does the inference in unconstrained space
using logit transformation
• density of the transformed parameter needs to include
Jacobian of the transformation (BDA3 p. 21)
14 / 30
Normal approximation and parameter transformations
Binomial model y ∼ Bin(θ, N), with data y = 9, N = 10
With Beta(1, 1) prior, the posterior is Beta(9 + 1, 1 + 1)
15 / 30
Normal approximation and parameter transformations
With Beta(1, 1) prior, the posterior is Beta(9 + 1, 1 + 1)
Stan computes only the unnormalized posterior q(θ|y )
16 / 30
Normal approximation and parameter transformations
With Beta(1, 1) prior, the posterior is Beta(9 + 1, 1 + 1)
For illustration purposes we normalize Stan result q(θ|y )
17 / 30
Normal approximation and parameter transformations
With Beta(1, 1) prior, the posterior is Beta(9 + 1, 1 + 1)
Beta(9 + 1, 1 + 1), but x-axis shows the unconstrained logit(θ)
18 / 30
Normal approximation and parameter transformations
...but we need to take into account the absolute value of the
determinant of the Jacobian of the transformation θ(1 − θ)
19 / 30
Normal approximation and parameter transformations
...but we need to take into account Jacobian θ(1 − θ)
Let’s compare a wrong normal approximation...
20 / 30
Normal approximation and parameter transformations
...but we need to take into account Jacobian θ(1 − θ)
Let’s compare a wrong normal approximation and correct one
21 / 30
Normal approximation and parameter transformations
Let’s compare a wrong normal approximation and correct one
Sample from both approximations and show KDEs for draws
22 / 30
Normal approximation and parameter transformations
Let’s compare a wrong normal approximation and correct one
Inverse transform draws and show KDEs
23 / 30
Normal approximation and parameter transformations
Laplace approximation can be further improved with importance
resampling
24 / 30
Other distributional approximations
25 / 30
Other distributional approximations
25 / 30
Other distributional approximations
25 / 30
Other distributional approximations
25 / 30
Pathfinder: Parallel quasi-Newton variational inference.
0.25 0.25 0.25
Zhang, Carpenter,
quasi-Newton Gelman,
variational and Vehtari
inference. (2022).
Journal Pathfinder:
of Machine Parallel
Learning Research,
23(306):1–49. 26 / 30
Pathfinder: Parallel quasi-Newton variational inference.
10 10 10
5 5 5
0 0 0
−5 −5 −5
5 5 5
0 0 0
−5 −5 −5
Zhang, Carpenter,
quasi-Newton Gelman,
variational and Vehtari
inference. (2022).
Journal Pathfinder:
of Machine Parallel
Learning Research,
23(306):1–49. 26 / 30
Pathfinder: Parallel quasi-Newton variational inference.
Birthdays case study uses Pathfinder to speed up workflow
https://users.aalto.fi/~ave/casestudies/Birthdays/birthdays.html
Relative number of births
100 100
80 80
120 Tue
100 Mon
100
Sat
90 Sun
80
80
Jan FebMar AprMayJun Jul AugSepOct NovDec Jan 1970 1975 1980 1985 1990
Date Date
Relative number of births
Valentine's day
100
April 1st Halloween
Leap day
90 Labor day
Independence day Thanksgiving
New year Memorial day
80
Christmas
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan
Date 27 / 30
Distributional approximations
Exact, Normal at mode, Normal with variational inference
30 30 30
beta
beta
beta
20 20 20
10 10 10
0 0 0
0.0 2.5 5.0 0.0 2.5 5.0 0.0 2.5 5.0
alpha alpha alpha
p(beta | alpha=0.82)
p(alpha | beta=7.5)
0.002 0.002
0.001 0.001
0.000 0.000
0 2 4 0 10 20 30
alpha beta
28 / 30
Distributional approximations
Exact, Normal at mode, Normal with variational inference
30 30 30
beta
beta
beta
20 20 20
10 10 10
0 0 0
0.0 2.5 5.0 0.0 2.5 5.0 0.0 2.5 5.0
alpha alpha alpha
p(beta | alpha=0.82)
p(alpha | beta=7.5)
0.002 0.002
0.001 0.001
0.000 0.000
0 2 4 0 10 20 30
alpha beta
Grid sd(LD50) ≈ 0.090,
Normal sd(LD50) ≈ .75, Normal + IR sd(LD50) ≈ 0.096 (Pareto-k = 0.57)
28 / 30
Distributional approximations
Exact, Normal at mode, Normal with variational inference
30 30 30
beta
beta
beta
20 20 20
10 10 10
0 0 0
0.0 2.5 5.0 0.0 2.5 5.0 0.0 2.5 5.0
alpha alpha alpha
p(beta | alpha=0.82)
p(alpha | beta=7.5)
0.002 0.002
0.001 0.001
0.000 0.000
0 2 4 0 10 20 30
alpha beta
Grid sd(LD50) ≈ 0.090,
Normal sd(LD50) ≈ .75, Normal + IR sd(LD50) ≈ 0.096 (Pareto-k = 0.57)
VI sd(LD50) ≈ 0.13, VI + IR sd(LD50) ≈ 0.095 (Pareto-k = 0.17)
28 / 30
Variational inference
29 / 30
Variational inference
29 / 30
Variational inference
29 / 30
Variational inference
29 / 30
Variational inference
29 / 30
Variational inference
29 / 30
Variational inference
29 / 30
Variational inference
30 / 30