Lecture 1
Lecture 1
§1. INTRODUCTION.
Bayes theorem describes the conditional probability of an event based on data as well
as prior information or beliefs about the event or conditions related to the event. For
example, in Bayesian inference Bayes theorem can be used to estimate the parameters of
a probability distribution or statistical model. Since Bayesian statistics treats probability
as a degree of belief, Bayes theorem can directly assign a probability distribution that
quantifies the belief to the parameter or set of parameters.
1
consequence, no probability statements can be made about its value. In the Bayesian
view, in contrast, the true value of a population parameter is conceived as uncertain and
is therefore considered a random variable. According to the Bayesian approach, the un-
known, random population parameter should be described by a probability distribution.
Bayesian philosophy states that θ cannot be determined exactly, and uncertainty about
the parameter is expressed through probability statements and distributions. You can
say that θ follows a normal distribution with mean 0 and variance 1, if it is believed that
this distribution best describes the uncertainty associated with the parameter.
Unlike the frequentist approach, the Bayesian counterpart allows a probability statement
to be made about the value of an unknown parameter. Both approaches also differ in
their notion of probability. Frequentist procedures are based on a concept of probability
that is associated with the idea of long-run frequency (e.g., a coin toss). Frequentist
inference, which employs sampling distributions based on infinite repeated sampling, is
focused on the performance over all possible random samples. Therefore, a frequentist
probability statement does not relate to a particular random sample that was obtained.
Rather, the sampling distribution, which describes the probability distribution of the
sample statistic over all possible random samples from the population, is used to make
a confidence statement about the unknown population parameter. The name confidence
statement is chosen because the inference probability is based on all possible datasets
that could have occurred for the fixed but unknown population parameter.
2
an unknown random parameter in the light of new evidence, such as empirical (sample)
data. The data can be expressed in terms of a likelihood function, sometimes simply
called the likelihood. Using Bayes theorem as a formal rule to weigh the likelihood of
the actual occurred data with the beliefs held before observing the data gives the pos-
terior distribution. The posterior distribution allows researchers to make probability
statements concerning the unknown parameter of interest.
Bayes theorem contains three essential elements. It balances a prior state of knowledge
and the data likelihood to a more informed posterior distribution, that is:
Posterior information = prior information + data information
3
encoding of our state of knowledge about these variables. This view has far-reaching
consequences when it comes to data analysis since Bayesian can assign probabilities to
proportions, or hypotheses, while Frequentists cannot.
The classical methods of estimation that you have studied are based solely on in-
formation provided by the random sample. These methods essentially interpret proba-
bilities as relative frequencies. For example, in arriving at a 95% confidence interval for
mean, we interpret the statement
the mean that 95 percent of the time in repeated experiments Z (standard normal
random variable) will fall between -1.96 and 1.96. Probabilities of this type that can
be interpreted in the frequency sense will be referred to as objective probabilities. The
Bayesian approach to statistical methods of estimation combines sample information
with other available prior information that may appear to be pertinent. The probabilities
associated with this prior information are called subjective probabilities, in that they
measure a person’s degree of belief in a proportion. The person uses his own experience
and knowledge as the basis for arriving at a subjective probability.
P (θ = 1) = γ, P (θ = 0) = 1 − γ.
A diagnostic test gives a result Y , whose distribution function is F1 (y) for a disease
individual, and F0 (y) otherwise. The most common type of test declares that a person
is diseased if Y > y0 , where y0 is fixed on the basis of past data.
4
In more general case, θ can take a finite number of values, labeled 1, 2, ..., k. We can
assign to these values probabilities p1 , p2 , ..., pk which express our beliefs about θ before
we have access to the data. The data y are assumed to be the observed value of a
(multidimensional) random variable Y , and p(y/θ) the density of y given θ (the likelihood
function).
In Bayesian statistical inference, a prior probability distribution, often simply called the
prior, of an uncertain quantity is the probability distribution that would express one’s
beliefs about this quantity before some evidence is taken into account. For example, the
prior could be the probability distribution representing the relative proportions of voters
who will vote for a particular politician in a future election. The unknown quantity may
be a parameter of the model or a latent variable rather than an observable variable.
Prior probability, in Bayesian statistics, is the probability of an event before new data
is collected. This is the best rational assessment of the probability of an outcome based
on the current knowledge before an experiment is performed.
When θ can get values continuously on some interval, we can express our beliefs about
it with a prior density p(θ). After we have obtained the data y, our beliefs about θ are
contained in the conditional density,
p(θ) · f (y/θ)
p(θ/y) = R , (1)
p(θ) · f (y/θ)dθ
5
Since θ is integrated out in the denominator, it can be considered as a constant with
respect to θ. Therefore, the Bayes’ formula in (1) is often written as
where
1
c= .
f (y)
We may also write
p(θ|y) ∝ p(θ) f (y|θ),
where L(y1 , y2 , ..., yn |θ) is the likelihood function (defined as the model density
f (y1 , y2 , ..., yn |θ) multiplied by any constant with respect to θ, and viewed as a function of
θ rather than of (y1 , y2 , ..., yn )).
6
The last equation may also be stated in words as:
The posterior is proportional to the prior times the likelihood.
These observations indicate a shortcut method for determining the required poste-
rior distribution which obviates the need for calculating f (y1 , y2 , ..., yn ) (which may be
difficult).
This method is to multiply the prior density (or the kernel of that density) by the
likelihood function and try to identify the resulting function of θ as the density of a
well-known or common distribution. Once the posterior distribution has been identified,
f (y1 , y2 , ..., yn ) may then be obtained easily as the associated normalising constant.
i. e. for every outcome ω ∈ Ω there is a real number, denoted by η(ω), which is called the
value of η(·) at ω.
In words, F (x) denotes the probability that the random variable η(ω) takes on a value
that is less than or equal to x.
Some properties of the distribution function are the following:
7
Property 4. F (x) is right continuous. That is, for any x and any decreasing sequence
xn that converges to x,
lim F (xn ) = F (x).
n→∞
Theorem 1 (about Distribution Function). Let a function G(x), x ∈ IR1 satisfy the
Properties 1 — 4. Then there exist a probability space (Ω, P ) and a random variable η(ω) for
which distribution function coincides with given function G(x), i. e.
P (ω : η(ω) ≤ x) = G(x).
Therefore, for giving an example of a random variable we have to cite a function which
satisfies the Properties 1 — 4.
We want to stress that in Theorem about distribution function a random variable η(ω)
is determined by non–unique way (see Appendix-2).
Definition 2. Two random variables η1 (ω) and η2 (ω) are said to be Identically Distributed
if their distribution functions are equal, that is,
8
The normal distribution plays a central role in probability and statistics. This distribu-
tion is also called the Gaussian distribution after Carl Friedrich Gauss, who proposed it
as a model for measurement errors.
λn −λ
p(n) = P {ω : η(ω) = n} = e , i = 0, 1, 2, ...
n!
if x ≤ 0
0
F (x) = −λx
. (7)
1−e if x ≥ 0
if x < c
0
F (x) = .
1 if x ≥ c
Consider the experiment of flipping a symmetrical coin once. The two possible outcomes
are “heads” (outcome ω1 ) and “tails” (outcome ω2 ), that is, Ω = {ω1 , ω2 }. Suppose η(ω) is
defined by putting η(ω1 ) = 1 and η(ω2 ) = −1. We may think of it as earning of the player
9
who receives or loses a dollar according as the outcome is heads or tails. Corresponding
distribution function has the form
if x < −1 0
if − 1 ≤ x < 1 .
F (x) = 1/2
if x ≥ 1
1
§6. CONTINUOUS RANDOM VARIABLES
We say that η(ω) is an absolutely continuous random variable if there exists a function f (x)
defined for all real numbers and the distribution function F (x) of the random variable
η(ω) is represented in the form
Z x
F (x) = f (y) dy. (8)
−∞
A function f (x) must have certain properties in order to be a density function. Since
F (x) → 1 as x → +∞ we obtain
Property 1.
Z +∞
f (x) dx = 1. (9)
−∞
Remarkably that these two properties are also sufficient for a function g(x) be a density
function.
Theorem 2 (About Density Function). Let a function g(x), x ∈ IR1 satisfy (9) and,
in addition, satisfies the condition
g(x) ≥ 0 for all x ∈ IR1 .
Then there exist a probability space (Ω, P ) and an absolutely continuous random variable η(ω)
for which density function coincides with given function g(x).
10
Therefore, for giving an example of an absolutely continuous random variable we have
to cite a nonnegative function which satisfies (9).
The normally distributed random variable is absolutely continuous and its density func-
tion has the form
(x − a)2
1
f (x) = √ exp − , (11)
σ 2π 2σ 2
where a and σ are constant, moreover a ∈ IR1 and σ > 0.
The uniformly distributed random variable over the interval (a, b) (see Example 4) is
absolutely continuous and its density function has the form
0 if x ∈/ (a, b)
f (x) = 1 . (12)
if a ≤ x ≤ b
b−a
A somewhat more intuitive interpretation of the density function may be obtained from
(14). If η(ω) is an absolutely continuous random variable having density function f (x),
then for small dx
P (ω : x ≤ η(ω) ≤ x + dx) = f (x) dx + o(dx).
Lemma 1. Let F (x) be a distribution function of a random variable η(ω). Then for any
real number x we have
P {ω : η(ω) = x} = F (x) − F (x − 0),
11
As the distribution function of an absolutely continuous random variable is continuous
at all points thus
P (ω : η(ω) = x) = 0
The symbol Γ (Greek uppercase gamma) is reserved for this function. The integration
by parts of Γ(α) yields that
Γ(α + 1) = α Γ(α).
Γ(k + 1) = k!,
12
In particular, Γ(1) = 1.
An important class involves values with halves. We have
√
1
Γ = π,
2
(2k − 1)!! √
1
Γ k+ = π,
2 2k
The following expression gives the density function for a gamma distribution.
if x ≤ 0,
0
f (x) = λα α−1 −λ x
x e if x > 0.
Γ(α)
The two parameters λ and α may be any positive values (λ > 0 and α > 0).
A special case of this function occurs when α = 1. We have
if x ≤ 0,
0
f (x) =
λ e−λ x if x > 0
α α
Eη = and Var (η) = .
λ λ2
13
When α is natural, say α = n, the gamma distribution with parameters (n, λ) often arises
in practice:
if x ≤ 0,
0
f (x) = λn
xn−1 e−λ x if x > 0.
(n − 1)!
This distribution is often referred to in the literature as the n-Erlang distribution. Note
that when n = 1, this distribution reduces to the exponential.
The Gamma distribution with λ = 1/2 and α = n/2 (n is natural) is called χ2n (read
“chi-squared”) distribution with n degrees of freedom:
if x ≤ 0,
0
f (x) = 1
xn/2−1 e−x/2 if x > 0.
2n/2 Γ(n/2)
We have
Eχ2n = n and Varχ2n = 2n.
In Bayesian inference, the beta distribution is the conjugate prior probability distribution
for the Bernoulli, binomial, negative binomial and geometric distributions.
where
Z 1
B(a, b) = xa−1 (1 − x)b−1 dx.
0
Note that when a = b = 1 the beta density is the uniform over the interval [0, 1]. When
a and b are greater than 1 the density is bell-shaped, but when they are less than 1
1
it is U -shaped. When a = b, the beta density is symmetric about . When b > a, the
2
density is skewed to the left (in the sense that smaller values become more likely), and
it is skewed to the right when a > b. The following relationship exists between beta and
gamma functions:
Z 1
B(a, b) = B(b, a) = xa−1 (1 − x)b−1 dx =
0
14
y
(if we make the change of variable x = )
1+y
+∞
y a−1
Z
Γ(a) Γ(b)
= dy =
0 (1 + y)a+b Γ(a + b)
a ab
Eη = and Var (η) = .
a+b (a + b)2 (a + b + 1)
15
APPENDIX-1:
The correctness of Example 2: Indeed, the function (5) as a function of upper bound is continuous.
Therefore the Properties 4 and 3 are satisfied. Since the integrand is positive we also have Property 1.
Thus it is left to prove that
+∞
(y − a)2
Z
1
√ exp − dy = 1.
σ 2π −∞ 2σ 2
Let us make a change of variable:
y−a
x= , σ dx = dy.
σ
Therefore
+∞
x2
Z
1
√ exp − dx = 1.
2π −∞ 2
To prove that F (x) is indeed a distribution function, we need to show that
+∞ √
Z 2
x
A= exp − dx = 2π.
−∞ 2
Therefore
+∞ 2 Z +∞ 2 Z +∞ Z +∞ 2
x + y2
Z
x y
A2 = exp − dx · exp − dy = exp − dx dy.
−∞ 2 −∞ 2 −∞ −∞ 2
We now evaluate the double integral by means of a change of variables to polar coordinates. That is, let
x = r · cos ϕ,
y = r · sin ϕ.
dx dy = r dr dϕ.
Thus
∞ 2π ∞ ∞
r2
Z Z Z 2 2
2 r r
A = exp − r dr dϕ = 2π r exp − dr = −2π exp − = 2π.
0 0 2 0 2 2 0
√
Hence A= 2π and the result is proved. Therefore, by Theorem about distribution function there exists
a random variable for which its distribution function has the form (5).
16
APPENDIX-2:
Let us prove that there is a relation between Beta and Gamma functions:
Z 1
Γ(a) Γ(b)
B(a, b) = B(b, a) = xa−1 (1 − x)b−1 dx = .
0 Γ(a + b)
u = t2 v = z2.
t = ρ cos ϕ z = ρ sin ϕ dt dz = ρ dρ dϕ
we obtain Z 2π Z +∞
2
Γ(a) Γ(b) = e−ρ ρ2(a+b)−1 | cos ϕ|2a−1 | sin ϕ|2b−1 ρ dρ dϕ.
0 0
2
Now make the change of variable ρ = x, dx = 2ρ dρ, we get
Z 2π Z +∞
1
Γ(a) Γ(b) = | cos ϕ| 2a−1
| sin ϕ| 2b−1
dϕ e−x x(a+b)−1 dx =
2 0 0
!
Z π/2
2a−1 2b−1
= Γ(a + b) 2 (cos ϕ) (sin ϕ) dϕ .
0
Denote
cos2 ϕ = s, ds = −2 cos ϕ sin ϕ dϕ
we get
Z 1
= Γ(a + b) sa−1 (1 − s)b−1 ds .
0
Finally we get
Γ(a) Γ(b) = B(a, b) Γ(a + b).
which was required to prove.
17