0% found this document useful (0 votes)
13 views

Bayes 2021 Part1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Bayes 2021 Part1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Bayesian Methods: Part 1

A. Colin Cameron
Univ. of Calif. - Davis
. .

May 2021

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 1 / 44
1. Introduction

1. Introduction

Bayesian methods provide an alternative method of computation and


statistical inference to ML estimation.
I Some researchers use a fully Bayesian approach to inference.
I Other researchers use Bayesian computation methods (with a di¤use or
uninformative prior) as a tool to obtain the MLE and then interpret
results as they would classical ML results.

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 2 / 44
1. Introduction

Outline

1 Introduction
2 Bayesian Approach
3 Normal-normal Example
4 MCMC Example using Stata command bayes:
5 Markov Chain Monte Carlo Methods
6 Further discussion
7 Appendix: Accept/reject method
8 Some references

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 3 / 44
2. Bayesian Approach

2. Bayesian Methods: Basic Idea

Bayesian methods for inference on θ obtain information on θ from


two sources
I Data - the likelihood function
F for regression this is usually L (yjθ, X)
I Prior beliefs on θ
F the prior density π (θ)
F this bit is new.

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 4 / 44
2. Bayesian Approach Posterior density

Bayesian Methods: The posterior density


Recall Bayes Theorem that Pr[AjB ] = Pr[A \ B ]/ Pr[B ].
Applying Bayes here, the posterior density for θ given data y, X is

p (θ, y, X)
p (θjy, X) =
p (y, X)

So the posterior density of θ is

L(yjθ, X) π (θ)
p (θjy, X) =
m (y jX)
R
I m (y jX) = L(yjθ, X) π (θ)d θ is called the marginal likelihood
F problem: there is usually no tractable expression for m (yjX).

In general
Posterior _ Likelihood Prior

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 5 / 44
2. Bayesian Approach Posterior density

Bayesian Methods: The prior density

The prior can be informative so it does e¤ect p (θjy, X)


I do this if have strong prior information on θ.
In some simple settings such as a doctor interpreting a medical test
I θ is scalar
I there are no regressors so the likelihood is L(yjθ )
I there can be strong prior beliefs π (θ ).
The prior can be uninformative so it has little e¤ect on p (θjy, X)
I e.g. θ can take a very wide range of values (large variance)
For econometrics regressions prior beliefs are typically uninformative
over all parameters, or over all but a subset of the parameters.
As N ! ∞ the prior has little e¤ect as the likelihood dominates.

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 6 / 44
2. Bayesian Approach Posterior density

Bayesian Methods: Inference

Bayesian analysis bases inference on the posterior distribution.


I the best estimate of θ is the mean or the mode of the posterior
distribution.
I a 95% credible interval (or “Bayesian con…dence interval”) for θ is
from the 2.5 to 97.5 percentiles of the posterior distribution
I no need for asymptotic theory!
Classical statisticians interpret results in the usual MLE way
I the mode or mean of the posterior is viewed as estimate b
θ of θ.
Until recently only very simple Bayesian models could be computed
R
I due to inability to compute m (y jX) = L(yjθ, X) π (θ)d θ
F including Bayes (1765) original example
I MCMC methods have changed this.

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 7 / 44
3. Normal-normal Example

3. Normal-normal Bayesian example

Suppose y jθ N [θ, 100] (σ2 is known from other studies)


And we have independent sample of size N = 50 with ȳ = 10.
Classical analysis uses ȳ jθ N [θ, 100/N ] N [θ, 2]
Reinterpret as likelihood θ jy N [θ, 2].
Then MLE b θ = ȳ = 10.
Bayesian analysis introduces prior, say θ N [5, 3].
We combine likelihood and prior to get posterior.
We expect
I posterior mean: between prior mean 5 and sample mean 10
I posterior variance: less than 2 as prior info reduces noise
I posterior distribution: ? Generally intractable.
But here can show that the posterior for θ is N [8, 1.2].

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 8 / 44
3. Normal-normal Example

Prior N [5, 3] and likelihood N [10, 2] and yields posterior N [8, 1.2]
for θ

.4
.3
.2
.1
0

0 5 10 15 20
x

prior likelihood
posterior

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 9 / 44
3. Normal-normal Example

Classical inference: b
θ = ȳ = 10 N [10, 2]
p
I A 95% con…dence interval for θ is 10 1.96 2 = (7.23, 12.77)
I i.e. if we sampled many times then 95% of the time a similarly
constructed con…dence interval will include the unknown constant θ.
Bayesian inference: Posterior θ N [8, 1.2]
p
I A 95% posterior interval for θ is 8 1.96 1.2 = (5.85, 10.15)
I i.e. with probability 0.95 the true value of θ lies in this interval.

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 10 / 44
3. Normal-normal Example

Role of the prior and the sample size

For normal-normal if yi jµ N [µ, σ2 ] with σ2 known


and prior µ N [µ, s 2 ] then the posterior µjy N [µ, s 2 ]
2
I µ = s2 [( σN ) 1 ȳ + (s 2 ) 1 µ] is the posterior mean
σ2
I and s2 = [( N ) 1 + (s 2 ) 1 ] 1 is the posterior variance
F the inverse of the variance is called the precision.

Consider variations of the preceding example with µ N [8, 1.2].


I with a “di¤use” prior Bayesian gives similar numerical result to classical
F if prior is µ N [5, 100] then posterior is µ N [9.903, 1.961].
I with a large sample we get result close to the classical result
F if N = 5, 000 then ȳ = 10 N [10, 0.02] and posterior is µ
N [9.961, 0.01987].

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 11 / 44
3. Normal-normal Example Tractable results are rare

Tractable results are rare


The tractable result for normal-normal (known variance) carries over
to exponential family using a conjugate prior

Likelihood Prior Posterior


Normal (mean µ) Normal Normal
1
Normal (precision σ2
) Gamma Gamma
Binomial (p ) Beta Beta
Poisson (µ) Gamma Gamma

I using conjugate prior is like augmenting data with a sample from the
same distribution
I for Normal with precision matrix Σ 1 gamma generalizes to Wishart.
But in general tractable results are not available
I so use numerical methods, notably MCMC.
I using tractable results in subcomponents of MCMC can speed up
computation.
A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 12 / 44
4. MCMC Example using Stata command bayes:

4. MCMC Example using Stata command bayes:


Consider a linear regression log earnings - schooling example
I men and women full-time workers in 2010.

. * Read in and summarize earnings - schooling data


. qui use mus229acs.dta, clear

. describe earnings lnearnings age education

Variable Storage Display Value


name type format label Variable label

earnings float %9.0g Annual earnings in $


lnearnings float %9.0g Natural logarighm of earnings
age int %36.0g Age in years
education float %9.0g Educational attainment: years of
schooling

. qui keep if _n <= 100

. summarize earnings lnearnings age education

Variable Obs Mean Std. dev. Min Max

earnings 100 60244 46513.19 4000 318000


lnearnings 100 10.76058 .7273709 8.294049 12.66981
age 100 43.33 10.9342 25 65
education 100 13.69 3.158106 0 20

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 13 / 44
4. MCMC Example using Stata command bayes:

MLE (equals OLS) for Comparison

Concentrate on coe¢ cient of education


I MLE is 0.0852 with se 0.0221 and 95% CI (0.041, 0.129)

. * ML linear regression (same as OLS with iid errors)


. regress lnearnings education age, noheader

lnearnings Coefficient Std. err. t P>|t| [95% conf. interval]

education .0852959 .0221804 3.85 0.000 .0412739 .1293178


age .0079952 .0064063 1.25 0.215 -.0047195 .02071
_cons 9.246449 .4546021 20.34 0.000 8.34419 10.14871

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 14 / 44
4. MCMC Example using Stata command bayes:

MCMC Simple overview


Markov chain Monte Carlo methods (MCMC) are a way to make
draws of θ from the posterior given the previous draw of θ.
Metropolis-Hastings iterative procedure
I at round s draw θ from a candidate distribution that depends on
θ(s 1 ) and possibly the data y, X
I use a rule (Metropolis or Metropolis-Hastings) to
F either set θ(s ) = θ or set θ(s ) = θ(s 1)
.
I thus some draws from the candidate distribution are accepted and
some are not.
The initial resulting θ(s ) draws are not draws from the posterior
I so discard the …rst several thousand draws.
Hopefully after that we have (correlated) draws from the posterior.
Given the draws from the posterior we can do almost anything.

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 15 / 44
4. MCMC Example using Stata command bayes:

MCMC Example: Linear Regression


Stata command bayes: pre…x command is simple
I e.g. bayes: regress y x1 x2
The default sets the following priors
I βj are independently N (0, 1002 )
I σ2 is inverse gamma (0.01, 0.01)
F so 1/σ2 is gamma (0.01, 0.01).

The default sets


I 12,500 MCMC iterations
I …rst 2,500 are not used (“burn-in”)
The defaults can be changed.
The command bayesmh is more ‡exible
I e.g. for nonstandard models you can provide the likelihood.

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 16 / 44
4. MCMC Example using Stata command bayes:

MCMC Example

First part of output

. * Bayesian linear regression with uninformative prior and Stata defaults


. bayes, rseed(10101): regress lnearnings education age

Burn-in ...
Simulation ...

Model summary

Likelihood:
lnearnings ~ regress(xb_lnearnings,{sigma2})

Priors:
{lnearnings:education age _cons} ~ normal(0,10000) (1)
{sigma2} ~ igamma(.01,.01)

(1) Parameters are elements of the linear form xb_lnearnings.

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 17 / 44
4. MCMC Example using Stata command bayes:

MCMC Example (continued)

Second part of output


I E¢ ciency: the 10,000 correlated draws are equivalent to on average
929.9 independent draws
I Acceptance rate: 3,071 of the 10,000 draws were accepted.

Bayesian linear regression MCMC iterations = 12,500


Random-walk Metropolis–Hastings sampling Burn-in = 2,500
MCMC sample size = 10,000
Number of obs = 100
Acceptance rate = .3071
Efficiency: min = .07066
avg = .09299
Log marginal-likelihood = -133.37046 max = .1512

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 18 / 44
4. MCMC Example using Stata command bayes:

MCMC Example (continued)


Third part of output for regressor education
I Posterior mean is 0.0872 with sd 0.0218 and 95% credible region
(0.047, 0.131)
I MLE is 0.0852 with se 0.0221 and 95% CI (0.041, 0.129)

Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]

lnearnings
education .0871874 .0217776 .000819 .0868041 .0471493 .1312628
age .008496 .0062873 .000231 .0089316 -.0037933 .0208249
_cons 9.198406 .4482471 .016292 9.196124 8.319206 10.09851

sigma2 .4774248 .0711248 .001829 .4702676 .3587335 .6308758

Note: Default priors are used for model parameters.


Note: Adaptation tolerance is not met in at least one of the blocks.

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 19 / 44
4. MCMC Example using Stata command bayes: Diagnostics

MCMC Example: Diagnostics


For βeduc shows several graphical diagnostics
I use bayesgraph diagnostics {lnearnings:education}
lnearnings:education
Trace Histogram

20
.15

15
.1

10
.05

5
0

0
0 2000 4000 6000 8000 10000
Iteration number 0 .05 .1 .15

Autocorrelation Density

20
0.80 all
1-half
0.60

15
2-half
0.40 10

0.20
5

0.00
0

0 10 20 30 40
Lag 0 .05 .1 .15 .2

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 20 / 44
4. MCMC Example using Stata command bayes: Convergence of Chain

Convergence of Chain
There is no formal test.
Can do multiple independent chains and see if the variability of the
posterior mean of θ across chains is small, relative to the variation of
draws of θ within each chain.
Consider the jth of m chains
I b
θ j = posterior mean and sj = posterior variance
B measures variation between chains
I B= 1
∑m b b
θ )2 where b 1
∑m b
m 1 j =1 ( θ j θ= m j =1 θ j .
W measures variation in θ within chains
I W = 1
m ∑m 2
j = 1 sj .
W +B
The Gelman-Rubin Rc statistic Rc ' W
I Actually uses an adjustment for …nite number of chains
I A common threshold is Rc< 1.1 (equivalently WB < 0.1).

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 21 / 44
4. MCMC Example using Stata command bayes: Convergence of Chain

Convergence of Chain (continued)


* Check convergence using multiple chains
bayes, rseed(10101) nchains(5): regress lnearnings education age
Bayesian linear regression Number of chains = 5
Random-walk Metropolis–Hastings sampling Per MCMC chain:
Iterations = 12,500
Burn-in = 2,500
Sample size = 10,000
Number of obs = 100
Avg acceptance rate = .3402
Avg efficiency: min = .07201
avg = .1053
max = .1815
Avg log marginal-likelihood = -133.35288 Max Gelman–Rubin Rc = 1.002

Equal-tailed
Mean Std. dev. MCSE Median [95% cred. interval]

lnearnings
education .085597 .0222416 .000371 .0855127 .0416117 .12877
age .0079981 .0063156 .000096 .0081201 -.0044435 .0202879
_cons 9.241303 .4537841 .007116 9.23721 8.355778 10.14552

sigma2 .4763385 .0699901 .000735 .4693347 .3578036 .6313855

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 22 / 44
4. MCMC Example using Stata command bayes: Convergence of Chain

Convergence of Chain (continued)


Preceding gave average Rc across the four parameters of 1.002 < 1.1.
Now get Rc for each parameter.

. * Give Gelman-Rubin Rc statistic for each parameter


. bayesstats grubin

Gelman–Rubin convergence diagnostic

Number of chains = 5
MCMC size, per chain = 10,000
Max Gelman–Rubin Rc = 1.002092

Rc

lnearnings
education 1.00161
age 1.001305
_cons 1.002092

sigma2 1.000309

Convergence rule: Rc < 1.1

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 23 / 44
4. MCMC Example using Stata command bayes: Some bayes: code

MCMC Example: Some bayes: code

* Estimation
bayes rseed(10101): regress y x
* Summary statistics for model parameters
bayesstats summary {y:x}
* Probability that slope is in range 0.4 to 0.6
bayestest interval {y:x}, lower(0.4) upper(0.6)
* Effective sample size
bayesstats ess
* Graphical Diagnostics
bayesgraph diagnostics {y:x}
* Convergence diagnostics
bayes, rseed(10101) nchains(5): regress y x
bayesstats grubin

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 24 / 44
5. Markov chain Monte Carlo (MCMC) methods

5. Markov chain Monte Carlo (MCMC)

The challenge is to compute the posterior p (θjy, X)


I analytical results are only available in special cases.
I early numerical methods used importance sampling to estimate
posterior moments.
Instead use Markov chain Monte Carlo methods:
I Make sequential random draws θ(1 ) , θ(2 ) , ....
I where θ(s ) depends in part on θ(s 1 )
F but not on θ(s 2)
once we condition on θ(s 1)
(so a Markov chain)
I in such a way that after an initial burn-in (discard these draws)
θ(s ) are (correlated) draws from the posterior p (θjy, X)
F the Markov chain converges to a stationary marginal distribution which
is the posterior.

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 25 / 44
5. Markov chain Monte Carlo (MCMC) methods

Markov Chains

A Markov chain is a stochastic sequence of possible events in which


the probability of each event depends only on the state attained in
the previous event
Under suitable assumptions the chain converges to a stationary
marginal distribution.
Here the MCMC method is set up so that this stationary distribution
is the desired posterior.
The one caveat is that while in theory the chain converges
I in practice it can take many rounds to converge
I and there is no formal test of whether convergence has occurred.

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 26 / 44
5. Markov chain Monte Carlo (MCMC) methods

Leading MCMC methods


1. Metropolis algorithm
I Nicholas Metropolis, Arianna W. Rosenbluth, Marshall Rosenbluth,
Augusta H. Teller and Edward Teller (1953), “Equation of State
Calculations by Fast Computing Machines”, Journal of Chemical
Physics.
2. Metropolis-Hastings algorithm
I Relax the metropolis requirement that the candidate distribution is
symmetric
I W.K. Hastings (1970), “Monte Carlo Sampling Methods Using Markov
Chains and Their Applications ”, Biometrika.
3. Gibbs sampler
I special case where conditional posteriors are known
I A.E. Gelfand and A.F.M. Smith (1990), JASA, is a key statistical paper
for Gibbs sampler and more generally use of MCMC methods in
statistics.

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 27 / 44
5. Markov chain Monte Carlo (MCMC) methods Metropolis Algorithm

Metropolis Algorithm
We want to draw from posterior p ( ) but usually cannot directly do
so.
Metropolis draws from a candidate distribution g (θ(s ) jθ(s 1)
)
I these draws are sometimes accepted and some times not
I like accept-reject method but do not require p ( ) kg ( )
Metropolis algorithm at the s th round
I draw candidate θ from candidate distribution g ( )
I the candidate distribution g (θ(s ) jθ(s 1 ) ) needs to be symmetric
F so it must satisfy g (θa jθb ) = g (θb jθa )
I draw u from uniform[0, 1]

p (θ )
θ(s ) = θ if u <
p (θ(s 1) )

= θ(s 1)
otherwise.

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 28 / 44
5. Markov chain Monte Carlo (MCMC) methods Metropolis Algorithm

Metropolis Algorithm (continued)


Because we only use a ratio of posteriors the di¢ cult normalizing
constant (the marginal likelihood) does not have to be computed
L (yjθ ,X) π (θ )
p (θ jy, X) m (y jX) L(y j θ , X) π (θ )
(s 1 )
= (s 1 ) (s 1
= (s 1 )
p (θ jy, X) L (y j θ ,X) π (θ ) L(y j θ , X) π (θ(s 1
)
m (y jX)

For proof that the Markov chain converges to the desired distribution
see, for example, Cameron and Trivedi (2005), p.451
I the proof requires that the candidate distribution is symmetric.
Taking logs
θ(s ) = θ if ln u < ln p (θ ) ln p (θ(s 1)
)
= θ(s 1)
otherwise.
Random walk Metropolis draws from θ(s ) N [θ(s 1)
, V] for …xed V
I ideally V such that 25-50% of candidate draws are accepted.
A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 29 / 44
5. Markov chain Monte Carlo (MCMC) methods Metropolis-Hastings Algorithm

Metropolis-Hastings Algorithm
Metropolis-Hastings is a generalization
I the candidate distribution g (θ(s ) jθ(s 1) ) need not be symmetric
I
p (θ ) g (θ jθ(s 1 ) )
the acceptance rule is then u <
p (θ(s 1 ) ) g (θ(s 1 ) jθ )
I Metropolis algorithm itself is often called Metropolis-Hastings.
Independence chain MH uses g (θ(s ) ) not depending on θ(s 1)
where
g ( ) is a good approximation to p ( )
I e.g. Do ML for p (θ) and then g (θ) is multivariate T with mean b
θ,
b [b
variance V θ].
I multivariate rather than normal as has fatter tails.
M and MH called Markov chain Monte Carlo
I because θ(s ) given θ(s 1 ) is a …rst-order Markov chain
I Markov chain theory proves convergence to draws from p ( ) as s ! ∞
I poor choice of candidate distribution leads to chain stuck in place.

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 30 / 44
5. Markov chain Monte Carlo (MCMC) methods Gibbs sampler

Gibbs sampler
Gibbs sampler (a general method for making draws)
I draw (Y1 , Y2 ) by alternating draws from f (y1 jy2 ) and f (y2 jy1 )
I after many draws gives draws from f (y1 , y2 ) even though

f (y1 , y2 ) = f (y1 jy2 ) f (y2 ) 6= f (y1 jy2 ) f (y2 jy1 ).

Suppose posterior is partitioned e.g. p (θ) = p (θ1 , θ2 )


I and we can make draws from p (θ1 jθ2 ) and p (θ2 jθ1 ).
Gibbs is special case of MH
I usually quicker than usual MH
I if need MH to draw from p (θ1 jθ2 ) and/or p (θ2 jθ1 ) called MH within
Gibbs.
I extends to e.g. p (θ1 , θ2 , θ3 ) make sequential draws from p (θ1 jθ2 , θ3 ),
p (θ2 jθ1 , θ3 ) and p (θ3 jθ1 , θ2 )
I requires knowledge of all of the full conditionals.

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 31 / 44
5. Markov chain Monte Carlo (MCMC) methods Correlated Draws

Correlated Draws

M, MH and Gibbs yield correlated draws of θ(s ) .


But it still gives correct estimate of marginal posterior distribution of
θ (once discard burn-in draws)
I e.g. estimate posterior mean by 1
S ∑Ss=1 θ(s ) .
The precision of this estimate will, however, decline with greater
correlation of the draws
I the e¢ ciency statistic measures this
I if the e¢ ciency statistic is low then make more draws (after the
burn-in).

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 32 / 44
5. Markov chain Monte Carlo (MCMC) methods Stata bayes: and bayesmh commands

Stata bayes: and bayesmh commands

The bayes: pre…x command can be applied to over 50 estimation


commands including regress, xtreg, logit, mlogit, ologit and
xtlogit. Defaults such as priors can be changed.
The bayesmh command is more ‡exible and allows one to program
ones own models.
The default version of bayesmh can give somewhat di¤erent results
to bayes: because bayes: takes advantage of the knowledge of the
particular model used, such as blocking of model parameters to
improve the e¢ ciency of the sampling algorithm.

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 33 / 44
5. Markov chain Monte Carlo (MCMC) methods Stata bayes: and bayesmh commands

bayesmh command equal to earlier bayes: regress


command

The following command gives exactly the same results as the earlier
bayes, rseed(10101): regress lnearnings education age
bayesmh command example
bayesmh lnearnings education age, likelihood(normal({sigma2})) ///
prior({lnearnings:education}, normal(0,10000)) ///
prior({lnearnings:age}, normal(0,10000)) ///
prior({lnearnings:_cons},normal(0,10000)) ///
prior({sigma2},igamma(0.01,0.01)) rseed(10101) ///
block({lnearnings: education age _cons}) block({sigma2})

If the last line (blocking) is dropped the results di¤er


I blocking can really speed up computation.

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 34 / 44
6. Further discussion Speci…cation of prior

6. Further discussion: Speci…cation of prior


As N ! ∞ data dominates the prior π (θ)
a
and then posterior θjy N [b
θML , I (b
θML ) 1 ]
I but in …nite samples prior can make a di¤erence.
Noninformative and improper prior
I has little e¤ect on posterior
I a uniform or ‡at prior (with all values equally likely) is frequent choice
I this is an improper prior if θ is unbounded
I but usually the posterior is still proper
R R
F if π (θ) = c we need L (yjθ, X)π (θ)d θ = c L (yjθ, X)d θ to be …nite
I not invariant to transformation of θ (e.g. θ ! e θ ).
Je¤reys prior sets π (θ) _ det[I (θ) 1 ], I (θ) = ∂2 ln L/∂θ∂θ 0
I invariant to transformation
I for linear regression under normality this is uniform prior for β
I also an improper prior.

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 35 / 44
6. Further discussion Speci…cation of prior

Proper prior (informative or uninformative)


I informative becomes uninformative as prior variance becomes large.
I use conjugate prior if available as it is tractable
I hierarchical (multi-level) priors are often used
F Bayesian analog of random coe¢ cients
F let π (θ) depend on unknown parameters τ which in turn have a
completely speci…ed distribution R
F p (θ, τ jy) _ L (yjθ) π (θjτ ) π (τ ) so p (θjy) _ p (θ, τ jy)d τ

Poisson example with yi Poisson[µi = exp(xi0 β)]


I p ( β, µ, jy, X) _ L(yjµ) π 1 (µjX, β) π 2 ( β)
I where π 1 (µi j β) is gamma with mean exp(xi0 β)
I and π 2 ( β) is β N [ β, V]
F works better than p ( βjy, X) _ L (yjX, β) π ( β ).

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 36 / 44
6. Further discussion Informative Prior example

Informative Prior Example


Consider lnearnings regressed on intercept, education and age.
Education: N [0.06, 0.012 ] means 95% sure that earnings increase
proportionately by between 0.04 and 0.08 (so between 4% and 8%)
with one more year of education.
Age: N [0.02, 0.012 ] means 95% sure that earnings increase by
between 0% and 4% with one more year of aging.
Intercept: Not clear so choose a di¤use N [10, 10] prior
I need to be very careful with prior for intercept
I N [10, 10] prior is very informative for earnings rather than lnearnings.
sigma2 (σ2 ): di¢ cult to explain but choose a reasonably di¤use prior.
* bayesmh example with informative priors
bayesmh lnearnings education age, likelihood(normal({var})) ///
prior({lnearnings:education}, normal(0.06,0.0001)) ///
prior({lnearnings:age}, normal(0.02,0.0001)) ///
prior({lnearnings:_cons},normal(10,100)) ///
prior({var},igamma(1,0.5)) rseed(10101)
A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 37 / 44
6. Further discussion Convergence of MCMC

Convergence of MCMC

Theory says chain converges as s ! ∞


I could still have a problem with one million draws.
Checks for convergence of the chain (after discarding burn-in)
I graphical: plot θ (s ) to see that θ (s ) is moving around
I correlations: of θ (s ) and θ (s k ) should ! 0 as k gets large
I plot posterior density: multimodality could indicate problem
I break into pieces: expect each 1,000 draws to have similar properties
I run several independent chains with di¤erent starting values
F Gelman-Rubin statistic.

But it is not possible to be 100% sure that chain has converged.

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 38 / 44
6. Further discussion Bayesian model selection

Bayesian model selection


Bayesians use the marginal likelihood
R
I m (y jX) = L(yjθ, X) π (θ)d θ
I this weights the likelihood (used in ML analysis) by the prior.
Bayes factor is analog of likelihood ratio
m1 ( y j X ) marginal likelihood model 1
B= =
m2 ( y j X ) marginal likelihood model 2

I one rule of thumb is that the evidence against model 2 is


F weak if 1 < B < 3 (or approximately 0 < 2 ln B < 2)
F positive if 1 < B < 3 (or approximately 2 < 2 ln B < 6)
F strong if 20 < B < 150 (or approximately 6 < 2 ln B < 10)
F very strong if B > 150 (or approximately 2 ln B > 10).
Can use to “test” H0 : θ = θ1 against Ha : θ = θ2 .
The posterior odds ratio weights B by priors on models 1 and 2
I so now use priors on both θ and the model.
A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 39 / 44
6. Further discussion Bayesian model selection

Problem: MCMC methods to obtain the posterior avoid computing


the marginal likelihood
I computing the marginal likelihood can be di¢ cult
I see Chib (1995), JASA, and Chib and Jeliazkov (2001), JASA.
An asymptotic approximation to the Bayes factor is

L1 (yjb
θ, X) (k2 k 1 )/2
B12 = N
b
L2 (yjθ, X)

I Here model 1 is nested in model 2 and due to asymptotics the prior has
no in‡uence (so the ratio of posteriors is the ratio of likelihoods)
I This is the Bayesian information criterion (BIC) or Schwarz criterion.

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 40 / 44
6. Further discussion What does it mean to be Bayesian?

What does it mean to be a Bayesian?

Modern Bayesian methods (Markov chain Monte Carlo)


I make it much easier to compute the posterior distribution than to
maximize the log-likelihood.
So classical statisticians:
I use Bayesian methods to compute the posterior
I use an uninformative prior so p (θjy, X) ' L(yjθ, X)
I so θ that maximizes the posterior is also the MLE.
Others go all the way and be Bayesian:
I give Bayesian interpretation
F e.g. use credible intervals
F e.g. given draws of θ can easily do inference on transformations of θ
I if possible use an informative prior that embodies previous knowledge.

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 41 / 44
7. Appendix: Accept reject method

7. Appendix: Accept-reject method


There are many ways to random draws from a distribution such as
inverse-transformation method.
The accept-reject method can be used when
I we want to draw from density f (x ) but this is di¢ cult
I we have a candidate density g (x ) that we can make draws from
I for any value of x we can compute f (x ) and g (x )
I key: g (x ) covers f (x ) with f (x ) kg (x ) for some r and all x
F this is often not possible, especially in tails for e.g. ∞ < x < ∞
F Metropolis and Metropolis-Hastings do not have this restriction.
F The accept-reject method to get draws from f (x )
I draw x from g (x )
I draw u from uniform(0,1) and accept the draw x if

f (x )
u
kg (x )

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 42 / 44
7. Appendix: Accept reject method

Accept-reject method proof


Y denotes the random variable generated by the accept-reject method
X denotes a random variable with density g (x ) and
U denotes a draw from the uniform. Then Y has c.d.f.
Pr[Y y ] = Pr [X y jU f (x )/kg (x )]
Pr [X y , U f (x )/kg (x )]
=
Pr [U f (x )/kg (x )]
R y R f (x )/kg (x )
f du gg (x )dx
= R ∞∞ R0f (x )/kg (x )
R y∞f 0 du gg (x )dx
[f (x )/kg (x )]g (x )dx
= R ∞∞
R y ∞ [f (x )/kg (x )]g (x )dx
[f (x )/k ]dx
= R ∞∞
R y ∞ [f (x )/k ]dx
= ∞ f (x )dx

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 43 / 44
8. Some References

8. Some References
Chapter 13 “Bayesian Methods” in A. Colin Cameron and Pravin K. Trivedi,
Microeconometrics: Methods and Applications, Cambridge University Press.
Chapter 29 “Bayesian Methods: basics” in A. Colin Cameron and Pravin K.
Trivedi, Microeconometrics using Stata, Second edition, forthcoming.
Bayesian books by econometricians that feature MCMC are
I Geweke, J. (2003), Contemporary Bayesian Econometrics and Statistics,
Wiley.
I Koop, G., Poirier, D.J., and J.L. Tobias (2007), Bayesian Econometric
Methods, Cambridge University Press.
I Koop, G. (2003), Bayesian Econometrics, Wiley.
I Lancaster, T. (2004), Introduction to Modern Bayesian Econometrics, Wiley.

Most useful (for me) book by statisticians


I Gelman, A., J.B. Carlin, H.S. Stern, D.B. Dunson, A. Vehtari and D.B.
Rubin (2013), Bayesian Data Analysis, Third Edition, Chapman & Hall/CRC.

A. Colin Cameron Univ. of Calif. - Davis . . () Bayesian Methods: Part 1 May 2021 44 / 44

You might also like