Baysian Analysis Notes
Baysian Analysis Notes
HYON-JUNG
KIM, 2015
Bayesian statistics
Bayesian: named after Thomas Bayes (1702-1761)
It provides a natural, intuitively plausible way to think about statistical problems by revising
previous information based on data observations.
Probability
Randomness
- Most statistical modeling is based on an assumption of random observations from
some probability distribution
- The objective of statistical modeling is to discover the systematic component and
filter out the random noise.
1. Estimate parameters of the distribution generating the data
2. Test hypotheses about the parameters
3. Predict future occurrences of data of the same type
4. Choose appropriate action given what we can learn about the phenomenon generating the data
- It may be dicult or impossible to distinguish between very complex or chaotic
deterministic phenomena and truly random phenomena (The more data we have, the
more complex our models tend to grow. )
Definition: Probability P (A) is a measure of the chance that an event A will happen.
Axioms of Probability
1. P (A) 0 for any event A
2. P (S) = 1 where S is an universal set.
3. P ( A) = 1 P (A)
4. If A and B have no outcomes in common then P (A B) = P (A) + P (B).
PAGE 1
c
HYON-JUNG
KIM, 2015
Interpretations of probability
1. Classical: Probability is a ratio of favorable cases to total (equipossible) cases
The fundamental assumption is that the game is fair (based on the game theory) and
all outcomes are equally likely.
2. Frequentist: Probability is the limiting value of the frequency of some event as the
number of trials becomes infinite.
It can be legitimately applied only to repeatable problems and is believed as an objective property in the real world.
3. Subjectivist: Probabilities may represent some numerical values assigned as to
some degrees of personal belief. Most events in life are not repeatable. Probabilities
are essentially conditional and there is no one correct probability.
Frequency probability inference:
- Data are drawn from a distribution of known form but with an unknown parameter
and often this distribution arises from explicit randomization.
- Inferences regard the data as random and the parameter as fixed (even though the
data are known and the parameter is unknown)
Subject probability inference:
- Probability distributions are assumed for the unknown parameters and for the observations (i.e. both parameters and observations are random quantities).
- Inferences are based on the prior distribution and the observed data.
Comparison/Generality
- Frequentists are disturbed by the dependence of the posterior results on the subjective
prior distribution
- Bayesians say that the prior distribution is not the only subjective element in an
analysis. The assumptions about the sampling distributions are also subjective.
PAGE 2
c
HYON-JUNG
KIM, 2015
- Whose probability distribution should be used? When there are enough data, a
good Bayesian analysis and a good frequentist analysis will tend to agree. If the
results are sensitive to the prior information, a Bayesian analyst is obligated to report
this sensitivity and to present range of results obtained from a wide range of prior
information.
- Bayesians can often handle problems the frequentist approach cannot. Bayesians
often apply frequentist techniques but with a Bayesian interpretation. Most untrained
people interpret results in the Bayesian way more easily. (Often the Bayesian answer
is what the decision maker really wants to hear.)
Conditional Probability: the conditional probability of B given A is
P (B|A) = P (A B)/P (A),
where P (A B) = P (AB) is the joint probability that both A and B occur.
Independence of events:
A and B are independent if P (A|B) = P (A) or P (B|A) = P (B).
Multiplication Rule:
P (AB) = P (A)P (B|A).
Then, by definition:
P (A|B) = P (AB)/P (B) = P (B|A)P (A)/P (B)
*** This is the Bayes theorem.
Applying the Law of Total Probability:
P (B) = P (B|A)P (A) + P (B| A)P ( A)
So
P (A|B) = P (B|A)P (A)/(P (B|A)P (A) + P (B| A)P ( A))
This result is referred to as the expanded form of the Bayes theorem.
PAGE 3
c
HYON-JUNG
KIM, 2015
PAGE 4
c
HYON-JUNG
KIM, 2015
Bayesian Inference
PAGE 5
c
HYON-JUNG
KIM, 2015
P (D|H)P (H)
P (D)
In parallel,
[X|][]
[X]
f (X|)g()
f (X|)g()
g(|X) =
=
f (X)
f (X|)g()d
[|X] =
for continuous
c
HYON-JUNG
KIM, 2015
Likelihood
The problem of statistical inference is to use observed data to learn about unknown
features of the process that generated those data.
In order to make inference, it is essential to describe the link between X and , and
this is done through a statistical model. The purpose of the statistical model is to
describe this relationship by deriving the probability P (X|) with which we would
have obtained the observed data X for any value of the unknown parameter vector.
Definition: Likelihood function L( : X) is defined as any function of such that
L( : X) = c P (X|) for some constant c.
Likelihood may not be enough for inference. The Bayesian approach is based also on
some prior information.
Prior distribution ( g() or P () ):
formulates your prior beliefs about the parameters.
Note that frequency probability is not able to represent such beliefs since parameters
are referred as unknown but not random. The prior distribution is the major source
of disagreement between two approaches - Bayesian and frequentists
Posterior distribution ( g(|X) or P (|X) ):
presents the probability distribution of the unknown parameter after we take the prior
information and learn from the data
Note again that the posterior distribution has no meaning in the frequentist theory.
Bayesian Methods for Inference
i) Model a set of observations with a probability distribution with unknown parameters.
ii) Specify prior distributions for the unknown parameters.
iii) Use the Bayes theorem to combine these two parts into the posterior distribution.
iv) Use the posterior distribution to draw inferences about the unknown parameters of
interest.
PAGE 7
c
HYON-JUNG
KIM, 2015
D2
Prior
A1
0.0
1.0
0.3
A2
0.7
0.3
0.5
A3
0.2
0.8
0.2
Example:
A black male mouse is mated with a female black mouse whose mother had a brown
coat.
B and b are alleles of the gene for coat color. The gene for black fur is given the letter
B and the gene for brown fur is given the letter b where B is the dominant allele to b.
The male and female have a litter with 5 pups that are all black. We want to determine
the males genotype. The prior information suggests P (BB) = 1/3 and P (Bb) = 2/3.
What is the posterior probability that the males genotype is BB?
PAGE 8
c
HYON-JUNG
KIM, 2015
Probability Distributions:
Yi Poisson () distribution
f (yi : ) = yi e /yi ! yi = 0, 1, 2, ...
Yi exponential ( 1 ) distribution
f (yi : ) =
1 yi /
e
,
yi 0
Yi Gamma(, ) distribution
f (yi : , ) =
1
y 1
() i
exp(yi /) yi > 0,
> 0, > 0
1
(+1)
y
() i
exp(1/(yi )) yi > 0,
> 0, > 0
Yi Normal (, 2 ) distribution
(
1
(yi )2
f (yi : , ) =
exp
,
2 2
2
2
f (yi : n, p) =
n
yi
( + ) 1
y (1 yi )1 , 0 < yi < 1 > 0, > 0.
()() i
PAGE 9
c
HYON-JUNG
KIM, 2015
Posterior Inference
c
HYON-JUNG
KIM, 2015
In general, frequentist methods are always based on the idea of repeated sampling, and
their properties are all long-run average properties obtained from repeated sampling.
The Bayesian approach is not restricted to just these standard kinds of inference that
the frequentist theory considers. We can use the posterior distribution to derive whatever kinds of statement seem appropriate to answer the questions that the investigator
may have.
Prior sensitivity
If the posterior is insensitive to the prior, where the prior is varied over a range that
is reasonable, believable and comprehensive, then we can be fairly confident of our
results. This usually happens when there is a lot of data or data of good quality.
Choice of Prior
It should be emphasized that if you have some real prior information you should use
it and not one of those uninformative or automatic priors.
The most important principle of prior selection is that your prior should represent the
best knowledge that you have about the parameters of the problem before looking at
the data.
Conjugacy
When the posterior distribution is in the same family of distributions as the prior
distribution, we have conjugate pairs of distributions. (We also say that the family
of distributions is closed under sampling).
Note that there are several other cases that we will take a look at later.
c
HYON-JUNG
KIM, 2015
Informative priors : subjective opinion based priors should be chosen with care in
practice.
Noninformative priors (or reference priors)
There are several ways to formulate priors that can be used in the case of no substantial
information beforehand.
- Vague priors
Sometimes one may have real information that can lead to a prior but the prior will
still be vague or spread out.
- Jereys priors : H. Jereys (1961) proposed a general way for choosing priors.
p() = |I( : X)|1/2
where I( : X) is the Fisher information for p(x|). Note that may be a vector of
parameters. Recall
log f (X|)
I( : X) = E
]2
2 log f (X|)
= E
2
Likelihood principle
Bayesian inference is based on the likelihood function, combined with some prior information about the parameter.
Likelihood principle can be stated as : In making inferences or decisions about the
parameters of a problem after observing data, all relevant information is contained in
the likelihood function for the observed data.
Furthermore, two likelihood functions contain the same information about the parameter if they are proportional to each other as a function of the parameter.
Example: Binomial data
PAGE 12
c
HYON-JUNG
KIM, 2015
PAGE 13
c
HYON-JUNG
KIM, 2015
A clinical trial of the new treatment protocol is carried out. Out of 70 patients in the
trial, 34 survive beyond six months from diagnosis. Should the new treatment protocol
replace the old one?
PAGE 14
c
HYON-JUNG
KIM, 2015
Normal Samples
Most widely-used models in statistics
simple analytical formulas, good first cut, easy integrations, good approximation for
many models
Assume that x1 , ...xn are independent samples from a normal distribution with mean
and variance : N (, )
The likelihood function is
p(x1 , ...xn |, ) =
(xi )2
exp
,
2
2
1
ni=1
Note that with the normal example, completing square technique will enable us to
track down analytic formulas for posterior distributions.
0 /02 + n
x/
2
1/0 + n/
variance =
1/02
1
+ n/
(xi )2
)
2
c
HYON-JUNG
KIM, 2015
The British physicist Henry Cavendish made 23 observations of the Earths density
(1731-1810):
5.36, 5.29, 5.58, 5.65, 5.57, 5.53, 5.62, 5.29, 5.44, 5.34, 5.79, 5.10,
5.27, 5.39, 5.42, 5.47, 5.63, 5.34, 5.46, 5.30, 5.78, 5.68, 5.85
Min.
Mean
Max.
std.
5.100
5.485
5.850
0.192
Suppose that we model these as a sample from N (, 0.25) distribution, where is the
true density of the Earth. Consider that someone has a prior belief that N (5, 4).
A specific interest is to evaluate the hypothesis that > 5.5.
PAGE 16
c
HYON-JUNG
KIM, 2015
Predictive Inference
It is not always the parameters that are of primary interest. Often, the real problem
is to predict what will be observed in the future.
Predictive inference consists of making inference statements about the future observations. There are obvious diculties in trying to fit predictive inference into the
frequentist framework.
Bayesian inference embraces predictive inference naturally. Parameters and future
observations are both random variables and all we need to do is to find the relevant
posterior distribution.
We wish to predict a future observation, say y given that we have observed X. Then,
the Bayesian inference about y would be based on its posterior distribution, P (y|X).
P (y|X) =
P (y, |X)d =
Examples:
- Prediction of binary data
- Poisson data
- Normal model
PAGE 17
c
HYON-JUNG
KIM, 2015
8
8.1
p(x) =
aj pj (x)
j=1
n
j=1
aj = 1.
A prior for that is a mixture distribution of several conjugate priors has the form:
p() =
aj pj ()
j=1
PAGE 18
c
HYON-JUNG
KIM, 2015
8.2
Conditional conjugacy
PAGE 19
8.3
c
HYON-JUNG
KIM, 2015
In the cases where we are obliged to use non-conjugate, or non-simple priors, we often
end up with the posterior distribution that is not known, non-standard distribution.
Then, there is no formulae to apply to draw posterior inference and one need to rely
on some numerical methods. There are various numerical methods that can be used
and those that are most commonly used will be presented in detail later in the course
with R/WinBUGS.
- Numerical Integration :
Integrals can be computed numerically. The simple methods of numerical integration
at least in one dimension are generally found.
Curse of Dimensions
The major development in 1990s has been a completely dierent computational approach known as Markov chain Monte Carlo (MCMC). We will cover this topic later
in the course in detail.
- Modes:
Computing a mode in general, requires maximization, a process that can be computationally ecient even in quite high dimensions.
Often, to find a mode, we need to obtain the marginal posterior density, which means
integrating out any other parameters. So integration is not avoided completely.
- Approximation: there is a general theorem to the eect that as the sample size gets
large the posterior distribution tends to (multivariate) normality. Therefore we may
expect that whenever there is a reasonably substantial quantity of the data we may
be able to compute inferences through approximating the posterior distribution by a
(multivariate) normal distribution.
PAGE 20
c
HYON-JUNG
KIM, 2015
q
1
f (yi : p, q) = p
( 2 ) 2
E[Yi ] = q/(p 2),
V ar[Yi ] =
( p +1)
yi 2
2q 2
(p2)2 (p4)
q
exp
2yi
yi > 0,
p > 0, q > 0
if p > 4
yi > 0
(q/2) 2
1
f (, : p, q, m, v) =
(p+3)/2 exp {v 1 ( m)2 + q}
p
2
2v( 2 )
E[] = m V ar[] = vq/(p 2),
E[ ] = q/(p 2) V ar[ ] =
>0
2q 2
(p2)2 (p4)
i=1 (xi
x)2
PAGE 21
v 1 m+n
x
.
v 1 +n
and v1 = (v 1 +n)1 .
c
HYON-JUNG
KIM, 2015
v 1 m+n
x
v 1 +n
q1 v1
p1 2
q1
p1 2
V ar[ |X] =
2q12
(p1 2)2 (p1 4)
PAGE 22
c
HYON-JUNG
KIM, 2015
Suppose that we wish to predict the mean of k future observations, i.e. Y = ki=1 Yi
10
We often want to compare two populations on the basis of two samples: Y11 , ..., Y1n1
and Y21 , ..., Y2n2
Assume Y1i N (1 , 1 ),
from Y1i s
Quantity of interest: = 1 2
i) Case 1. 1 and 2 are assumed to be known
ii) Case 2. 1 and 2 are unknown but assume that 1 = 2
iii) Case 3. 1 and 2 are unknown and 1 = 2
PAGE 23
j = 1, ..., n2 independent
c
HYON-JUNG
KIM, 2015
Frequentist approach
i)
z=
Y1 Y2 (1 2 )
1 /n1 + 2 /n2
ii)
Y1 Y2 (1 2 )
t(pooled) =
Spooled (1/n1 + 1/n2 )
where Spooled =
S1 +S2
n1 +n2 2
iii)
t =
Y1 Y2 (1 2 )
S1 /n1 + S2 /n2
Bayesian approach
i) 1 and 2 are assumed to be known
Can take independent reference priors for 1 and 2 : p(1 , 2 ) = 1
Posterior: 1 2 |Y N (Y1 Y2 , 1 /n1 + 2 /n2 )
ii) 1 = 2 unknown
n1
likelihood: p(Y |1 , 2 , ) 2
)
n2
2
exp 21 [n1 (1 Y1 )2 + S1 ]
exp 21 [n2 (2 Y2 )2 + S2 ]
n1 +n2+2
2
exp 21 [n2 (2 Y2 )2 ]
S1 + S2
(1/n1 + 1/n2 ))
n1 + n2 2
+ni Yi
Then, i |Y N ( i0nni0i0+n
, ni0+ni )
i
c. Consider Linear model setting with NIC joint prior for 1 , 2 , : (later after we
discuss the linear model in Bayesian approach)
PAGE 24
c
HYON-JUNG
KIM, 2015
iii) 1 = 2 unknown
Take conjugate family of NIC distributions independently for (1 , 1 ) and (2 , 2 )
Posterior: (i , i )|Y NIC(pi , qi , mi , vi )
As before, i |Y tpi (mi , qi vi /pi )
11
Linear Models
Likelihood: Y |, N (X, I)
Prior: , Multivariate NIC (p, q, m, V ) i.e.
p(, ) (p+r+2)/2 exp(
1
{q + ( m)T V 1 ( m)})
2
Posterior: , |Y NIC (p , q , m , V )
|Y tp (m , q /p V ) where
p = p + n
m)T (V + (X T X)1 )1 (
m),
q = q + S + (
PAGE 25
T (Y X )
S = (Y X )
11.1
c
HYON-JUNG
KIM, 2015
Simple regression
m = V (V 1 m + (X T X))
V = (V 1 + (X T X))1
For predictive inference, consider predicting a single observation y0 when the vector of
covariates is x0 . We find that
y0 |Y tp (xT0 m , q /p (1 + xT0 V x0 ))
11.1
Simple regression
2q 2
(p2)2 (p4)
The posterior distribution can be obtained using the general formulae in a straightforward way. However, the Bayesian analysis is complex enough that no simple formulae
can be found for the posterior means of and as for the frequentist estimates.
11.2
11.2
v11
v12
v21
v22
c
HYON-JUNG
KIM, 2015
Then,
EXAMPLE : Cuckoo eggs were found in the nests of two dierent host species were
examined. The lengths of the eggs (in mm) are
Hedge-sparrow : 22.0,23.9,20.9,23.8,25.0,24.0,21.7,23.8,22.8,23.1,23.1,23.5,23.0,23.0
Wren: 19.8,22.1,21.5,20.9,22.0,21.0,22.3,21.0,20.3,20.9,22.0,20.0,20.8,21.2,21.0
We would like to compare the two species of cuckoos and find out which species lay
larger eggs on average.
i) Allowing unequal population variances and assuming independent weak NIC priors
(1, 0, m, 01 ), obtain the posterior distribution for the mean dierence = 1 2
given the data.
ii) Assuming the weak N IC(2, 0, m, 01 ) and equal population variances for both
species, obtain the posterior distribution for the mean dierence = 1 2 given the
data.
PAGE 27
c
HYON-JUNG
KIM, 2015
11.3
Theorem by Lindely and Smith (1972), JRSS B : General Bayesian Linear Model
Likelihood: Y |1 N (A1 1 , C1 ) where A1 , C1 is known.
Prior: 1 N (A2 2 , C2 ) where A2 , 2 , C2 is known.
Posterior: 1 |Y N (Bb, B)
where B = (AT1 C11 A1 + C21 )1 and b = AT1 C11 Y + C21 A2 2
Marginal distribution of Y N (A1 A2 2 , C1 + A1 C2 AT1 )
Linear Model with known variance (revisited)
Likelihood: Y | N (X, I)
Prior: N ( 0 , V )
Multivariate Normal data
Likelihood: Y | N (, ) where is known.
Prior: N (0 , 0 )
Multinomial data
Yj : number of observations for the jth outcome category, j = 1, ..., k
Likelihood: Y | Multinomial (n, ) where n =
Yj
(1 + ... + k ) 1 1
kk 1 ,
j = 1
1
(1 ) (k )
j
PAGE 28
c
HYON-JUNG
KIM, 2015
12
Hierarchical Modeling
Exchangeability :
Random variables X1 , ..., Xn are exchangeable if their joint distribution is the same as
the joint distribution of any other selection of m distinct Xi s and this holds for all
m < n.
Whenever we model observations as being a sample from a distribution with some
unknown parameters, the frequentist model assumes that they are independent and
identically distributed (i.i.d.) . In Bayesian way, they are i.i.d. conditional on parameters, and the idea of exchangeability is needed to construct a probability model for
data.
The relevance of the concept of exchangeability is due to the fact that it allows one to
state a very powerful result known as the representation theorem of De Finetti (1937).
The theory is complicated and here we only consider the implication of his theorem
without stating it:
When X1 , ..., Xn are part of an infinite sequence of exchangeable random variables, we
can consider them as being a random sample from a distribution with some unknown
parameters.
Prior modeling: how is prior information structured in more complex models?
When we have many parameters, it can be a very large task to think about the joint
prior distribution of all these parameters (especially if our prior knowledge is such that
the parameters are not independent). In such situations, it can be very helpful to think
about modeling the parameters, in the same way as we construct models for data.
Hierarchical models
In the simplest hierarchical model, we first state the likelihood with the usual statistical
model that expresses the distribution of data X conditional on unknown parameters .
We then have a prior model that expresses the (prior) distribution of the parameters
PAGE 29
c
HYON-JUNG
KIM, 2015
How to analyze hierarchical models? There are several possibilities and one of the
advantages is the flexibility it oers for inference. We wish to obtain the posterior
distribution of and .
When we set up hierarchical models, we model the data by writing it conditional on
as p(X|), but the implication is that the distribution of X depends on the parameters
but not on the parameters that are introduced at the next level of the hierarchy.
So p(X|) is also p(X|, ). It is simple to obtain p(, ) = p(|)p() and so
p(, |X) p(X|)p(|)p()
EXAMPLE : Twenty-three similar houses took part in an experiment into the eect
of ceiling insulation on heating requirements. Five dierent levels of ceiling insulation
were installed, and the amount of heat used (in kilowatt-hours) to heat each house over
a given month was measured.
Insulation(inches) Heat required(KWh)
4
10
12
PAGE 30