All Lectures 2018 Fall 201 A
All Lectures 2018 Fall 201 A
Aditya Guntuboyina
Contents
1.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Common Distributions 11
7 Continuous Distributions 17
1
7.2 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
8 Variable Transformations 19
10 Joint Densities 22
16 Convergence in Distribution 34
2
21 More on the Weak Law and Convergence in Probability 41
23 Delta Method 47
26 Conditioning 53
26.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
26.2 Conditional Distributions, Law of Total Probability and Bayes Rule for Discrete Random
Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
31 Law of Total Probability and Bayes Rule for Continuous Random Variables 59
32 Law of Total Probability and Bayes Rule for Continuous Random Variables 60
3
34 Conditional Joint Densities 64
36 Conditional Expectation 69
37 Conditional Variance 73
38 Random Vectors 74
39 Random Vectors 74
40.3.2 Whitening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
43 Residual 82
45 Partial Correlation 83
4
49 Moment Generating Functions of Random Vectors 89
We will start the course by a review of undergraduate probability material. The review will be fairly
quick and should be complete in about six lectures.
Probability theory is invoked in situations that are (or can be treated as) chance or random experiments. In
such a random experiment, the sample space is the set of all possible outcomes and we shall denote it by Ω.
For example, suppose we are tossing a coin 2 times. Then a reasonable sample space is {hh, ht, th, tt}.
Subsets of the sample space are called Events. For example, in the above example, {hh, ht} is an Event
and it represents the event that the first of the two tosses results in a heads. Similary, the event that at least
one of the two tosses results in a heads is represented by {hh, ht, th}.
1. Ac1 denotes the event that A1 does not happen. We say that Ac1 is the complement of the event A1 .
2. ∪i≥1 Ai denotes the event that at least one of A1 , A2 , . . . happens.
Probability is defined as a function that maps (or associates) events to real numbers between 0 and 1 and
which satisfies certain natural consistency properties. Specifically P is a probability provided:
5
3. For the whole sample space (= the “certain event”), P(Ω) = 1.
4. If an event A is a disjoint union of a sequence of events A1P , A2 , . . . (this means that every point in A
belongs to exactly one of the sets A1 , A2 , . . . ), then P(A) = i≥1 P(Ai ).
Example 0.1 (Nontransitive Dice). Consider the following set of dice:
What is the probability that A rolls a higher number than B? What is the probability that B rolls higher than
C? What is the probability that C rolls higher than A? Assume that, in any roll of dice, all outcomes are
equally likely.
Consider a probability P and an event B for which P(B) > 0. We can then define P(A|B) for every event A
as
P(A ∩ B)
P(A|B) := . (1)
P(B)
P(A|B) is called the conditional probability of A given B.
Here is two very interesting problems from Mosteller’s delightful book (titled Fifty Challenging Problems
in Probability) illustrating the use of conditional probabilities.
Example 0.2 (From Mosteller’s book (Problem 13; The Prisoner’s Dilemma)). Three prisoners, A, B, and
C, with apparently equally good records have applied for parole. The parole board has decided to release two
of the three, and the prisoners know this but not which two. A warder friend of prisoner A knows who are
to be released. Prisoner A realizes that it would be unethical to ask the warder if he, A, is to be released,
but thinks of asking for the name of one prisoner other than himself who is to be released. He thinks that
before he asks, his chances of release are 2/3. He thinks that if the warder says ”B will be released,” his own
chances have now gone down to 1/2, because either A and B or B and C are to be released. And so A decides
not to reduce his chances by asking. However, A is mistaken in his calculations. Explain.
Example 0.3 (From Mosteller’s book (Problem 20: The Three-Cornered Duel)). A, B, and C are to fight
a three-cornered pistol duel. All know that A’s chance of hitting his target is 0.3, C’s is 0.5, and B never
misses. They are to fire at their choice of target in succession in the order A, B, C, cyclically (but a hit man
loses further turns and is no longer shot at) until only one man is left unhit. What should A’s strategy be?
We say that two events A and B are independent (under the probability P) if
P(A|B) = P(A).
A very important formula involving conditional probabilities is the Bayes’ rule. This says that
P(B|A)P(A)
P(A|B) = . (2)
P(B|A)P(A) + P(B|Ac )P(Ac )
Can you derive this from the definition (1) of conditional probability?
6
Example 0.4. Consider a clinical test for cancer that can yield either a positive (+) or negative (-) result.
Suppose that a patient who truly has cancer has a 1% chance of slipping past the test undetected. On the
other hand, suppose that a cancer-free patient has a 5% probabiliity of getting a positive test result. Suppose
also that 2% of the population has cancer. Assuming that a patient who has been given the test got a positie
test result, what is the probability that they have cancer?
Suppose C and C c are the events that the patient has cancer and does not have cancer respectively. Also
suppose that + and − are the events that the test yields a positive and negative result respectively. By the
information given, we have
Therefore the probability that this patient has cancer (given that the test gave a positive result) is about 29%.
This means, in particular, that it is still unlikely that they have cancer even though the test gave a positive
result (note though that the probability of cancer increased from 2% to 29%).
This means that test will yield a positive result about 7% of the time (note that only 2% of the population has
cancer).
Suppose now that P(C) = 0.001 (as opposed to P(C) = 0.02) and assume that P(−|C) and P(+|C c ) stay
at 0.01 and 0.05 as before. Then
Here the true cancer rate of 0.001 has yielded in an apparent rate of 0.05 (which is an increase by a factor
of 50). Think about this in the setting where the National Rifle Association is taking a survey by asking a
sample of citizens whether they used a gun in self-defense during the past year. Take C to be true usage and
+ to be reported usage. If only one person in a thousand had truly used a gun in self-defense, it will appear
that one in twenty did. These examples are taken from the amazing book titled “Understanding Uncertainty”
by Dennis Lindley (I feel that every student of probability and statistics should read this book).
A random variable is a function that attaches a number to each element of the sample space. In other words,
it is a function mapping the sample space to real numbers.
For example, in the chance experiment of tossing a coin 50 times, the number of heads is a random
variable. Another random variable is the number of heads before the first tail. Another random variable is
the number of times the pattern hththt is seen.
Many real-life quantities such as (a) The average temperature in Berkeley tomorrow, (b) The height of a
randomly chosen student in this room, (c) the number of phone calls that I will receive tomorrow, (d) the
number of accidents that will occur on Hearst avenue in September, etc. can be treated as random variables.
For every event A (recall that events are subsets of the sample space), one can associate a random variable
which take the value 1 if A occurs and 0 if A does not occur. This is called the indicator random variable
corresponding to the event A and is denoted by I(A).
7
The distribution of a random variable is, informally, a description of the set of values that the random
variable takes and the probabilities with which it takes those values.
If a random variable X takes a finite or countably infinte set of possible values (in this case, we say that
X is a discrete random variable), its distribution is described by a listing of the values a1 , a2 , . . . that it takes
together with a specification of the probabilities:
P{X = ai } for i = 1, 2, . . . .
The function which maps ai to P{X = ai } is called the probability mass function (pmf) of the discrete random
variable X.
If a random variable X takes a continuous set of values, its distribution is often described by a function
called the probability density function (pdf). The pdf is a function f on R that satisfies f (x) ≥ 0 for every
x ∈ R and Z ∞
f (x)dx = 1.
−∞
The pdf f of a random variable can be used to calculate P{X ∈ A} for every set A via
Z
P{X ∈ A} = f (x)dx.
A
It is important to remember that a density function f (x) of a random variable does not represent probability
(in particular, it is quite common for f (x) to take values much larger than one). Instead, the value f (x) can
be thought of as a constant of proportionality. This is because usually (as long as f is continuous at x):
1
lim P{x ≤ X ≤ x + δ} = f (x).
δ↓0 δ
A random variable is a function that attaches a number to each element of the sample space. In other words,
it is a function mapping the sample space to real numbers.
For example, in the chance experiment of tossing a coin 50 times, the number of heads is a random
variable. Another random variable is the number of heads before the first tail. Another random variable is
the number of times the pattern hththt is seen.
Many real-life quantities such as (a) The average temperature in Berkeley tomorrow, (b) The height of a
randomly chosen student in this room, (c) the number of phone calls that I will receive tomorrow, (d) the
number of accidents that will occur on Hearst avenue in September, etc. can be treated as random variables.
For every event A (recall that events are subsets of the sample space), one can associate a random variable
which take the value 1 if A occurs and 0 if A does not occur. This is called the indicator random variable
corresponding to the event A and is denoted by I(A).
The distribution of a random variable is, informally, a description of the set of values that the random
variable takes and the probabilities with which it takes those values.
If a random variable X takes a finite or countably infinte set of possible values (in this case, we say that
X is a discrete random variable), its distribution is described by a listing of the values a1 , a2 , . . . that it takes
together with a specification of the probabilities:
P{X = ai } for i = 1, 2, . . . .
8
The function which maps ai to P{X = ai } is called the probability mass function (pmf) of the discrete random
variable X.
If a random variable X takes a continuous set of values, its distribution is often described by a function
called the probability density function (pdf). The pdf is a function f on R that satisfies f (x) ≥ 0 for every
x ∈ R and Z ∞
f (x)dx = 1.
−∞
The pdf f of a random variable can be used to calculate P{X ∈ A} for every set A via
Z
P{X ∈ A} = f (x)dx.
A
It is important to remember that a density function f (x) of a random variable does not represent probability
(in particular, it is quite common for f (x) to take values much larger than one). Instead, the value f (x) can
be thought of as a constant of proportionality. This is because usually (as long as f is continuous at x):
1
lim P{x ≤ X ≤ x + δ} = f (x).
δ↓0 δ
The cumulative distribution function (cdf) of a random variable X is the function defined as
This is defined for all random variables discrete or continuous. If the random variable X has a density, then
its cdf is given by Z x
F (x) = f (t)dt.
−∞
The cdf has the following properties: (a) It is non-decreasing, (b) right-continuous, (c) limx↓−∞ F (x) = 0
and limx↑+∞ F (x) = 1.
Let X be a discrete random variable and let g be a real-valued function on the range of X. We then say that
g(X) has finite expectation provided X
|g(x)|P{X = x} < ∞
x
where the summation is over all possible values x of X. Note the presence of the absolute value on g(x)
above.
Analogously, if X is a continuous random variable with density (pdf) f , then we say that g(X) has finite
expectation provided Z ∞
|g(x)|f (x)dx < ∞.
−∞
9
When g(X) has finite expectation, we define Eg(X) as
Z ∞
Eg(X) = g(x)f (x)dx. (4)
−∞
Why do we need to ensure finiteness of the absolute sums (or integrals) before defining expectation? Because
otherwise the sum in (3) (or the integral in (4)) might be ill-defined. For example, when X is the discrete
random variable which takes the values . . . , −3, −2, −1, 1, 2, 3, . . . with probabilities
3 1
P{X = i} = for i ∈ Z, i 6= 0.
π 2 i2
Then the sum in (3) for g(x) = x becomes
3 X 1
EX =
π2 i
i∈Z:i6=0
P
which can not be made any sense of. it is easy to see here that x |x|P{X = x} = ∞.
If A is an event, then recall that I(A) denotes the corresponding indicator random variable that equals 1
when A holds and 0 when A does not hold. It is convenient to note that the expectation of I(A) precisely
equals P(A).
for any two random variables X and Y with finite expectations and real numbers a and b.
1.2 Variance
A random variable X is said to have finite variance if X 2 has finite expectation (do you know that when
X 2 has finite expectation, X also will have finite expectation? how will you prove this?). In that case, the
variance of X 2 is defined as
It is clear from the definition that Variance of a random variable X measures the average variability in the
values taken by X around its mean E(X).
Suppose X is a discrete random variables taking finitely many values x1 , . . . , xn with equal probabilities.
Then what is the variance of X?
The square root of the variance of X is called the standard deviation of X and is often denoted by SD(X).
The Expectation of a random variable X has the following variational property: it is the value of a that
minimizes the quantity E(X − a)2 over all real numbers a. Do you know how to prove this?
If the variance of a random variable X is small, then X cannot deviate much from its mean (= E(X) = µ).
This can be made precise by Chebyshev’s inequality which states the following.
Chebyshev’s Inequality: Let X be a random variable with finite variance and mean µ. Then for every
> 0, the following inequality holds:
V ar(X)
P {|X − µ| ≥ } ≤ .
2
In other words, the probability that X deviates by more than from its mean is bounded from above by
V ar(X)/2 .
10
Proof of Chebyshev’s inequality: Just argue that
(X − µ)2
I{|X − µ| ≥ } ≤
2
and take expectations on both sides (on the left hand side, we have the Indicator random variable that takes
the value 1 when |X − µ| ≥ and 0 otherwise).
We say that two random variables X and Y are independent if conditioning on any event involving Y does
not change the probability of any event involving X i.e.,
P {X ∈ A|Y ∈ B} = P{X ∈ A}
1. g(X) and h(Y ) are independent for every pair of functions g and h.
2. E (g(X)h(Y )) = Eg(X)Eh(Y ) for every pair of functions g and h.
More generally, we say that random variables X1 , . . . , Xk are (mutually) independent if, for every 1 ≤ i ≤
k, conditioning on any event involving Xj , j 6= i does not change the probability of any event involving Xi .
From here one can easily derive properties of independence such as
3 Common Distributions
A random variable X is said to have the Ber(p) (Bernoulli with parameter p) distribution if it takes the two
values 0 and 1 with P{X = 1} = p.
Note then that EX = p and V ar(X) = p(1 − p). For what value of p is X most variable? least variable?
A random variable X is said to have the Binomial distribution with parameters n and p (n is a positive
integer and p ∈ [0, 1]) if it takes the values 0, 1, . . . , n with pmf given by
n k
P{X = k} = p (1 − p)n−k for every k = 0, 1, . . . , n.
k
11
n
Here k is the binomial coefficient:
n n!
:= .
k k!(n − k)!
The main example of a Bin(n, p) random variable is the number of heads obtained in n independent tosses
of a coin with probability of heads equalling p.
Here is an interesting problem about the Binomial distribution from Mosteller’s book (you can easily
calculate these probabilities in R).
Example 3.1 (From Mosteller’s book (Problem 19: Issac Newton helps Samuel Pepys)). Pepys wrote Newton
to ask which of three events is more likely: that a person get (a) at least I six when 6 dice are rolled, (b) at
least 2 sixes when 12 dice are rolled, or (c) at least 3 sixes when 18 dice are rolled What is the answer?
Let X denote the number of heads in n independent tosses of a coin with probability of heads being p.
Then we know that X ∼ Bin(n, p). If, now, Xi denotes the binary random variable that takes 1 if the ith
toss is a heads and 0 if the ith toss is a tail, then it should be clear that
X = X1 + · · · + Xn .
Note that each Xi is a Ber(p) random variable and that X1 , . . . , Xn are independent. Therefore Bin(n, p)
random variables can be viewed as sums of n independent Ber(p) random variables. The Central Limit
Theorem (which we will study in detail later in the class) implies that the sum of a large number of i.i.d
(what is i.i.d?) random variables is approximately normal. This means that when n is large and p is held fixed,
the Bin(n, p) distribution looks like a normal distribution. We shall make this precise later. In particular,
this means that Binomial probabilities can be approximately calculated via normal probabilities for n large
and p fixed. From this point of view, what is the probability of getting k or more sixes from 6k rolls of a die
when k is large?
What is the mean of the Bin(n, p) distribution? What is an unbiased estimate of the probability of heads
based on n independent tosses of a coin? What is the variance of Bin(n, p)?
A random variable X is said to have the Poisson distribution with parameter λ > 0 (denoted by P oi(λ)) if
X takes the values 0, 1, 2, . . . with pmf given by
λk
P{X = k} = e−λ for k = 0, 1, 2, . . . .
k!
The main utility of the Poisson distribution comes from the following fact:
Fact: The binomial distribution Bin(n, p) is well-approximated by the Poisson distribution P oi(np)
provided that the quantity np2 small.
Now because np2 is small, we can ignore the second term above to obtain that P{Bin(n, p) = 0} is ap-
proximated by exp(−np) which is precisely equal to P{P oi(np) = 0}. One can similarly approximate
P{Bin(n, p) = k} by P{P oi(np) = k} for every fixed k = 0, 1, 2, . . . .
12
There is a formal theorem (known as Le Cam’s theorem) which rigorously proves that Bin(n, p) ≈ P oi(np)
when np2 is small. This is stated without proof below (its proof is beyond the scope of this class).
Theorem 3.2 (Le Cam’s Theorem). Suppose X1 , . . . , Xn are independent random variables such that Xi ∼
Ber(pi ) for some pi ∈ [0, 1] for i = 1, . . . , n. Let X = X1 + · · · + Xn and λ = p1 + . . . pn . Then
∞
X n
X
|P{X = k} − P {P oi(λ) = k}| < 2 p2i .
k=0 i=1
and thus when np2 is small, the probability P{Bin(n, p) = k} is close to P{P oi(np) = k} for each k = 0, 1, . . . .
This is because when p = λ/n, we have np2 = λ2 /n which will be small when n is large.
This approximation property of the Poisson distribution is the reason why the Poisson distribution is
used to model counts of rare events. For example, the number of phone calls a telephone operator receives
in a day, the number of accidents in a particular street in a day, the number of typos found in a book, the
number of goals scored in a football game can all be modelled as P oi(λ) for some λ > 0. Can you justify
why these real-life random quantities can be modeled by the Poisson distribution?
The following example presents another situation where the Poisson distribution provides a good approx-
imation.
Example 3.3. Consider n letters numbered 1, . . . , n and n envelopes numbered 1, . . . , n. The right envelope
for letter i is the envelope i. Suppose that I take a random permutation σ1 , . . . , σn of 1, . . . , n and then place
the letter σi in the envelope i. Let X denote the number of letters which are in their right envelopes. What
is the distribution of X?
Let Xi be the random variable which takes the value 1 when the ith letter is in the ith envelope and 0
otherwise. Then clearly X = X1 + · · · + Xn . Note that
1
P{Xi = 1} = for each i = 1, . . . , n.
n
This is because the ith letter is equally likely to be in any of the n envelopes. This means therefore that
Xi ∼ Ber(1/n) for i = 1, . . . , n.
If the Xi ’s were also independent, then X = X1 + · · · + Xn will be Bin(n, 1/n) which is very close to P oi(1)
for large n. But the Xi ’s are not independent here because for i 6= j,
1 1
P {Xi = 1|Xj = 1} = 6= = P {Xi = 1} .
n−1 n
However, the dependence is relatively weak and it turns out that the distribution of X is quite close to P oi(1).
We shall demonstrate this by showing that P{X = 0} is approximately equal to P{P oi(1) = 0} = e−1 . I will
leave as an exercise to show that P{X = k} ≈ P{P oi(1) = k} for every fixed k. To compute P{X = 0}, we
13
can write
n
Y
P{X = 0} = P{ (1 − Xi ) = 1}
i=1
n
Y
=E (1 − Xi )
i=1
n
X X X
=E 1− Xi + Xi Xj − Xi Xj Xk + · · · + (−1)n X1 . . . Xn
i=1 i<j i<j<k
X X X
=1− E(Xi ) + E(Xi Xj ) − E(Xi Xj Xk ) + · · · + (−1)n E(X1 , . . . , Xn ).
i i<j i<j<k
It is an easy exercise to show that the expectation and variance of a P oi(λ) random variable are both
equal to λ. This also makes sense because of the connection:
P oi(λ) ≈ Bin(n, λ/n)
as
λ λ
E(Bin(n, λ/n)) = λ and var(Bin(n, λ/n)) = n 1− → λ as n → ∞.
n n
When modeling count data via the Poisson distribution, it is possible to empirically check the assumption
that the variance is equal to the mean. If the empirical variance seems much higher than the mean, then it
is said that there is overdispersion in which case Poisson may not be a good model for the data.
Given two random variables X and Y (such that X 2 and Y 2 have finite expectation), the covariance between
X and Y is denoted by Cov(X, Y ) and is defined as
Cov(X, Y ) := E [(X − µX )(Y − µY )] (5)
where µX := E(X) and µY := E(Y ). In other words, Cov(X, Y ) is defined as the Expectation of the random
variable (X − µX )(Y − µY ) (but does this random variable have finite expectation ? Can you verify that this
is a consequence of the assumption that X 2 and Y 2 have finite expectation?).
Can you prove this as a consequence of the definition (5) of Covariance and the linearity of the Expectation
operator?
When X = Y , it is easy to see that Cov(X, X) is simply the Variance of X. Using this connection between
Covariance and Variance and (6), can you deduce the following standard properties of Variance:
14
1. V ar(aX + b) = a2 V ar(X).
2. V ar( i ai Xi ) = i a2i V ar(Xi ) + i6=j ai aj Cov(Xi , Xj ).
P P P
The correlation between two random variables X and Y (which are such that X 2 and Y 2 have finite variance)
is defined as:
Cov(X, Y ) Cov(X, Y )
ρX,Y := =p
SD(X)SD(Y ) V ar(X)V ar(Y )
If ρX,Y = 0, we say that X and Y are uncorrelated.
Proposition 4.1. Two facts about correlation:
Proof. Write !
Cov(X, Y ) X − µX Y − µY
ρX,Y = p =E p p .
V ar(X)V ar(Y ) V ar(X) V ar(Y )
Use the standard inequality:
a2 + b2
ab ≤ (7)
2
p p
with a = (X − µX )/ V ar(X) and b = (Y − µY )/ V ar(Y ) to obtain
(X − µX )2 (Y − µY )2
V ar(X) V ar(Y )
ρX,Y ≤ E + = + = 1.
2V ar(X) 2V ar(Y ) 2V ar(X) 2V ar(Y )
This proves that ρX,Y ≤ 1. To prove that ρX,Y ≥ −1, argue similarly by using
−a2 − b2
ab ≥ . (8)
2
Cauchy-Schwartz Inequality: The fact that correlation ρX,Y lies between -1 and 1 is sometimes proved
via the Cauchy-Schwartz inequality which states the following: For every pair of random variables Z1 and
Z2 , we have q q
|E(Z1 Z2 )| ≤ E(Z12 ) E(Z22 ) (9)
The fact that |ρX,Y | ≤ 1 is deduced from the above inequality by taking Z1 = X − µX and Z2 = Y − µY .
Can you prove the Cauchy-Schwarz inequality (9) using (7) and (8)?
Uncorrelatedness and Independence: The following summarizes the relation between uncorrelated-
ness and independence:
15
5 Correlation and Regression
An important property of ρX,Y is that it measures the strength of linear association between X and Y .
This is explained in this section. Consider the problem of approximating the random variable Y by a linear
function β0 + β1 X of X. For given numbers β0 and β1 , let us measure the accuracy of approximation of Y
by β0 + β1 X by the mean-squared error :
2
L(β0 , β1 ) := E (Y − β0 − β1 X) .
If β0 + β1 X is a good approximation of Y , then L(β0 , β1 ) should be low. Conversely, if β0 + β1 X is a poor
approximation of Y , then L(β0 , β1 ) should be high. What is the smallest possible value of L(β0 , β1 ) as β0
and β1 vary over all real numbers.
The fact (10) precisely captures the interpretation that correlation measures the strength of linear associa-
tion between Y and X. This is because minβ0 ,β1 L(β0 , β1 ) represents the smallest possible mean squared error
in approximating Y by a linear combination of X and (10) says that it is directly related to the correlation
between Y and X.
Can you explicitly write down the values of β0 and β1 which minimize L(β0 , β1 )?
Does any of the above remind you of linear regression? In what way?
In the last class, we looked at the Binomial and Poisson distributions. Both of these are discrete distributions
that can be motivated by coin tossing (Bin(n, p) is the distribution of the number of heads in n independent
tosses of a coin with probability of heads p and P oi(λ) ≈ Bin(n, λ/n)). We shall now revise two more
discrete distributions which arise from coin tossing: the geometric distribution and the negative binomial
distribution.
We say that X has the Geometric distribution with parameter p ∈ [0, 1] (written as X ∼ Geo(p)) if X takes
the values 1, 2, . . . with the probabilities:
P{X = k} = (1 − p)k−1 p for k = 1, 2, . . . .
It is easy to see that the number of independent tosses (of a coin with probability of heads p) required to get
the first head has the Geo(p) distribution.
The Geo(p) distribution has the interesting property of memorylessness i.e., if X ∼ Geo(p), then
P {X > m + n|X > n} = P {X > m} . (11)
This is easy to check as P {X > m} = (1 − p)m . It is also interesting that the Geometric distribution is
the only distribution on {1, 2, . . . } which satisfies the memorylessness property (11). To see this, suppose
that X is a random variable satisfying (11) which takes values in {1, 2, . . . }. Let G(m) := P{X > m} for
m = 1, 2, . . . . Then (11) is the same as
G(m + n) = G(m)G(n).
16
This clearly gives G(m) = (G(1))m for each m = 1, 2, . . . . Now G(1) = P{X > 1} = 1 − P{X = 1}. If
p = 1 − P{X = 1}, then
G(m) = (1 − p)m
which means that P{X = k} = P{X > k − 1} − P{X > k} = p(1 − p)k−1 for every k ≥ 1 meaning that X is
Geo(p).
Let X denote the number of tosses (of a coin with probability of heads p) required to get the k th head. The
distribution of X is then given by the following. X takes the values k, k + 1, . . . and
k+i−1 k
P{X = k + i} = p (1 − p)i
i
(k + i − 1)(k + i − 2) . . . (k + 1)k k
= p (1 − p)i
i!
(−k)(−k − 1)(−k − 2) . . . (−k − i + 1) k
= (−1)i p (1 − p)i
i!
−k k
= (−1)i p (1 − p)i for i = 0, 1, 2, . . . .
i
This is called the Negative Binomial distribution with parameters k and p (denoted by N B(k, p)). If
G1 , . . . , Gk are independent Geo(p) random variables, then G1 + · · · + Gk ∼ N B(k, p) (can you prove this?).
7 Continuous Distributions
A random variable X has the normal distribution with mean µ and variance σ 2 > 0 if it has the following
pdf:
(x − µ)2
2 1
φ(x; µ, σ ) := √ exp − .
2πσ 2 2σ 2
We write X ∼ N (µ, σ 2 ). When µ = 0 and σ 2 = 1, we say that X has the standard normal distribution and
the standard normal pdf is simply denote by φ(·):
2
1 x
φ(x) = √ exp − .
2π 2
2 √
e−x /2
R
Do you know why φ(·) is a valid density i.e., why dx = 2π?
If X ∼ N (µ, σ 2 ), then E(X) = µ and V ar(X) = σ 2 . See the corresponding wikipedia page for a list of
numerous properties of the normal distribution. The Central Limit Theorem is the main reason why the
normal distribution is the most prominent distribution in statistics.
17
7.2 Uniform Distribution
A random variable U is said to have the uniform distribution on (0, 1) if it has the following pdf:
1 :0<x<1
f (x) =
0 : for all other x
We write U ∼ U [0, 1]. What is the mean of U ? What is the variance of U ? Where do uniform distributions
arise in statistics? The p-values under the null distribution are usually distributed according to the U [0, 1]
distribution (more on this later).
More generally, given an interval (a, b), we say that a random variable U has the uniform distribution on
(a, b) if it has the following pdf: x−a
b−a :a<x<b
f (x) =
0 : for all other x
We write this as U ∼ U (a, b).
The exponential density with rate parameter λ > 0 (denoted by Exp(λ)) is given by
It is arguably the simplest density for modeling random quantities that are constrained to be nonnegative.
It is used to model things such as the time of the first phone call that a telephone operator receives starting
from now. More generally, it arises as the distribution of inter-arrival times in a Poisson process (more on
this later when we study the Poisson Process).
The exponential density has the memorylessness property (just like the Geometric distribution). Indeed,
In fact, the exponential density is the only density on (0, ∞) that has the memorylessness property (proof
left as exercise). In this sense, the Exponential distributionx can be treated as the continuous analogue of
the Geometric distribution.
It is customary to talk about the Gamma density after the exponential density. The Gamma density with
shape parameter α > 0 and rate parameter λ > 0 is given by
18
Now the function Z ∞
Γ(α) := uα−1 e−u du for α > 0
0
is called the Gamma function in mathematics. So the constant of proportionality in (12) is given by
λα
Γ(α)
so that the Gamma density has the formula:
λα α−1 −λx
f (x) = x e I{x > 0}.
Γ(α)
We shall refer to this as the Gamma(α, λ) density.
Note that the Gamma(α, λ) density reduces to the Exp(λ) density when α = 1. Therefore, Gamma
densities can be treated as a generalization of the Exponential density. In fact, the Gamma density can be
seen as the continuous analogue of the negative binomial distribution because if X1 , . . . , Xk are independent
Exp(λ) random variables, then X1 + · · · + Xn ∼ Gamma(k, λ) (thus the Gamma distribution arises as the
sum of i.i.d exponentials just as the Negative Binomial distribution arises as the sum of i.i.d Geometric
random variables).
Here are some elementary properties of the Gamma function that will be useful to us later. The Gamma
function does not have a closed form expression for arbitrary α > 0. However when α is a positive integer,
it can be shown that
Γ(n) = (n − 1)! for n ≥ 1. (13)
The above inequality is a consequence of the property
and the trivial fact that Γ(1) = 1. You can easily verify (14) by integration by parts.
√
Another easy fact about the Gamma function is that Γ(1/2) = π (this is a consequence of the fact that
R −x2 /2 √
e dx = 2π).
8 Variable Transformations
It is often common to take functions or transformations of random variables. Consider a random variable X
and apply a function u(·) to X to transform X into another random variable Y = u(X). How does one find
the distribution of Y = u(X) from the distribution of X?
If X is a discrete random variable, then Y = u(X) will also be discrete and then the pmf of Y can be
written directly in terms of the pmf of X:
X
P{Y = y} = P{u(X) = y} = P{X = x}.
x:u(x)=y
If X is a continuous random variable with density f and u(·) is a smooth function, then it is fairly
straightforward to write down the density of Y = u(X) in terms of f . There are some general formulae for
doing this but it is better to learn how to do it from first principles. I will illustrate the general idea using
the following two examples.
Example 8.1. Suppose X ∼ U (π/2, π/2). What is the density of Y = tan(X)? Here is the method for doing
this from first principles. Note that the range of tan(x) as x ranges over (−π/2, π/2) is R so fix y ∈ R and
we shall find below the density g of Y at y.
19
The formula for g(y) is
1
g(y) = lim P{y < Y < y + δ}
δ↓0 δ
so that
P{y < Y < y + δ} ≈ g(y)δ when δ is small. (15)
Now for small δ,
20
1. It is non-decreasing i.e., F (x) ≤ F (y) whenever x ≤ y.
2. It takes values between 0 and 1.
3. For every x, we have
lim F (y) = F (x) = P{X ≤ x} and lim F (y) = P{X < x} = F (x) − P{X = x}.
y↓x y↑x
This implies, in particular, that F is right continuous and that continuity of F is equivalent to P{X =
x} = 0 for every x.
4. The function F (x) approaches 0 as x → −∞ and approaches 1 as x → +∞.
The above properties characterize cdfs in the sense that every function F on (−∞, ∞) that satisfies these
four properties equals the cdf of some random variable. One way to prove this is via the Quantile Transform.
Given a function F satisfying the four properties listed above, the corresponding Quantile Transform (or
Quantile Function) qF is a real-valued function on (0, 1) defined as
The quantile transform can be seen as some kind of an inverse of the cdf F . Indeed, when the cdf F is
continuous and strictly increasing, the quantile function qF is exactly equal to F −1 .
The fundamental property of the quantile transform is the following. For x ∈ R and 0 < u < 1:
Here is the proof of (17). When F (x) ≥ u, then it is obvious from the definition of qF that x ≥ qF (u).
On the other hand, again by the definition of qF (u) and the fact that F is non-decreasing, we have
that F (qF (u) + ) ≥ u for every > 0. Letting ↓ 0 and using the right-continuity of F , we deduce that
F (qF (u)) ≥ u. This implies that when x ≥ qF (u), we have F (x) ≥ F (qF (u)) ≥ u. This proves (17).
Therefore, qF (u) is a quantile of X (when u = 0.5, it follows that qF (0.5) is a median of X). Hence the name
quantile transform. The first inequality above is a consequence of
The following result is an important application of the use of the quantile transform.
Proposition 9.1. The following two statements are true.
1. Suppose U is a random variable distributed according to the uniform distribution on (0, 1). Then qF (U )
has cdf F .
2. Suppose X is a random variable with a continuous cdf F . Then F (X) has the uniform distribution
on (0, 1).
21
Proof. For the first part, note first that the cdf FU of the uniform random variable U satisfies FU (u) = u for
0 ≤ u ≤ 1. Thus the cdf FX of X = qF (U ) is given by
For the second part, assume that F is continuous. Note that for every > 0, the definition of qF implies
that F (qF (u)−) < u. Letting → 0 and using the continuity of F , we deduce that F (qF (u)) ≤ u. Combining
with (17), this gives F (qF (u)) = u. Therefore for every 0 < u < 1, we have
P{F (X) ≥ u} = P{X ≥ qF (u)} = 1 − P{X < qF (u)} = 1 − P{X ≤ qF (u)} = 1 − F (qF (u)) = 1 − u
where we have used the continuity of F (which implies that P{X = x} for every x). The fact that P{F (X) ≥
u} = 1 − u for every 0 < u < 1 implies that F (X) ∼ U (0, 1).
Example 9.2 (p-values corresponding to test statistics having continuous distributions have uniform distri-
butions under the null hypothesis). Statistical hypothesis testing problems are usually formed by calculating
a relevant test statistic based on data. Suppose Tobs is the observed value of the statistic calculated from
the data. The p-value corresponding to the test is defined as the probability, under the null hypothesis, of
observing a value for the statistic that is more extreme compared to Tobs . Usually this is calculated as
p = 1 − F0 (Tobs )
where F0 is the cdf of the test statistic under the null hypothesis. If F0 is a continuous cdf, then it should be
clear that p is distributed according to U (0, 1) when Tobs ∼ F0 . In other words, under the null distribution
(i.e., Tobs ∼ F0 ), the p-value has the standard uniform distribution.
10 Joint Densities
Joint densities are used to describe the distribution of a finite set of continuous random variables. We focus
on bivariate joint densities (i.e., when there are two continuous variables X and Y ). The ideas are the same
for the case of more than two variables.
The following are the main points to remember about joint densities:
2. We say that two random variables X and Y have joint density f (·, ·) if
Z Z Z Z
P {(X, Y ) ∈ B} = f (x, y)dxdy = I{(x, y) ∈ B}f (x, y)dxdy.
B
2
for every subset B of R . We shall often denote the joint density of (X, Y ) by fX,Y .
3. If ∆ is a small region in R2 around a point (x0 , yo ), we have (under some regularity condition on the
behavior of fX,Y at (x0 , y0 ))
More formally,
P{(X, Y ) ∈ ∆}
lim = fX,Y (x0 , y0 )
∆↓(x0 ,y0 ) area of ∆
where the limit is taken as ∆ shrinks to (x0 , y0 ).
22
4. If (X, Y ) have joint density fX,Y , then the density of X is given by fX and the density of Y is fY where
Z Z
fX (x) = fX,Y (x, y)dy and fY (y) = fX,Y (x, y)dx.
The densities fX and fY are referred to as the marginal densities of X and Y respectively.
5. Independence and Joint Densities: The following statements are equivalent:
(a) The random variables X and Y are independent.
(b) The joint density fX,Y (x) factorizes into the product of a function depending on x alone and a
function depending on y alone.
(c) fX,Y (x, y) = fX (x)fY (y) for all x, y.
Example 10.1. Consider the function
1 : 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1
f (x, y) =
0 : otherwise
Check that this is indeed a density function. This density takes the value 1 on the unit square. If the random
variables X and Y have this density f , then we say that they are uniformly distributed on the unit square.
Using indicator functions, we can write this density also as:
The factorization above immediately says that if f = fX,Y , then X and Y are independent. The marginal
densities of X and Y are uniform densities on [0, 1].
We address the following general question. Suppose X and Y have the joint density fX,Y . Suppose now that
we consider two new random variables defined by
(U, V ) := T (X, Y )
where T : R2 → R2 is a differentiable and invertible function. What is the joint density fU,V of U, V in terms
of fX,Y ?
The following simple example will nicely motivate the general ideas.
Example 11.1. Suppose X, Y have joint density fX,Y . What is the joint density of U and V where U = X
and V = X + Y ?
23
We see that (U, V ) = T (X, Y ) where T (x, y) = (x, x + y). This transformation T is clearly invertible and
its inverse is given by S(u, v) = T −1 (u, v) = (u, v − u). In order to determine the joint density of (U, V ) at
a point (u, v), let us consider
Let R denote the rectangle joining the points (u, v), (u + δ, v), (u, v + ) and (u + δ, v + ). Then the above
probability is the same as
P{(U, V ) ∈ R} = P{(X, Y ) ∈ S(R)}.
What is the region S(R)? It is easy to see that this is the parallelogram joining the points (u, v − u), (u +
δ, v − u − δ), (u, v − u + ) and (u + δ, v − u + − δ). When δ and are small, S(R) is clearly a small region
around (u, v − u) which allows us to write
The area of the parallelogram S(R) can be computed to be δ (using the formula that the area of a parallelogram
equals base times height) so that
We shall come back to the general problem of finding densities of transformations after taking a short detour
to convolutions.
where fX,Y is the joint density of (X, Y ). As a consequence, we see that the density of V = X + Y is given
by Z ∞ Z ∞
fX+Y (v) = fU,V (u, v)du = fX,Y (u, v − u)du.
−∞ −∞
Suppose now that X and Y are independent. Then fX,Y (x, y) = fX (x)fY (y) and consequently
Z ∞ Z ∞
fX+Y (v) = fX (u)fY (v − u)du = fX (v − w)fY (w)dw (20)
−∞ −∞
The equation (20) therefore says, in words, that the density of X + Y , where X ∼ fX and Y ∼ fY are
independent, equals the convolution of fX and fY .
24
Example 11.3. Suppose X and Y are independent random variables which are exponentially distributed with
rate parameter λ. What is the distribution of X + Y ?
This shows that X + Y has the Gamma distribution with shape parameter 2 and rate parameter λ.
Example 11.4. Suppose X and Y are independent random variables that are uniformly distributed on [0, 1].
What is the density of X + Y ?
This integral is non-zero only when max(v−1, 0) ≤ min(v, 1) which is easily seen to be equivalent to 0 ≤ v ≤ 2.
When 0 ≤ v ≤ 2, we have
fX+Y (v) = min(v, 1) − max(v − 1, 0)
which can be simplified as
v :0≤v≤1
fX+Y (v) = 2−v :1≤v≤2
0 : otherwise
In the last class, we calculated the joint density of (X, X + Y ) in terms of the joint density of (X, Y ). In
this lecture, we generalize the idea behind that calculation by first calculating the joint density of a linear
and invertible transformation of a pair of random variables. We also deal with the case of a non-linear and
invertible transformation.
In the next subsection, we shall recall some standard properties of linear transformations.
25
where M is a 2 × 2 matrix and c is a 2 × 1 vector. The first term on the right hand side above involves
multiplication of the 2 × 2 matrix M with the 2 × 1 vector with components x and y.
We shall refer to the 2 × 2 matrix M as the matrix corresponding to the linear transformation L and
often write ML for the matrix M .
The linear transformation L in (21) is invertible if and only if the matrix M is invertible. We shall only
deal with invertible linear transformations in the sequel. The following are two standard properties of linear
transformations that you need to familiar with for the sequel.
area of L(P )
= | det(ML )|.
area of P
In other words, the ratio of the areas of L(P ) to that of P is given by the absolute value of the
determinant of the matrix ML .
Suppose X, Y have joint density fX,Y and let (U, V ) = T (X, Y ) for a linear and invertible transformation
T : R2 → R2 . Let the inverse transformation of T be denoted by S. In the example of the previous lecture,
we hacd T (x, y) = (x, x + y) and S(u, v) = (u, v − u). The fact that T is assumed to be linear and invertible
means that S is also linear and invertible.
for small δ and . Let R denote the rectangle joining the points (u, v), (u + δ, v), (u, v + ) and (u + δ, v + ).
Then the above probability is the same as
What is the region S(R)? Clearly now S(R) is a small region (as δ and are small) around the point S(u, v)
so that
P{(U, V ) ∈ R} = P{(X, Y ) ∈ S(R)} ≈ fX,Y (S(u, v)) (area of S(R)) .
By the facts mentioned in the previous subsection, we now note that S(R) is a parallelogram whose area
equals |det(MS )| multiplied by the area of R (note that the area of R equals δ). We thus have
fU,V (u, v)δ ≈ P {(U, V ) ∈ R} = P {(X, Y ) ∈ S(R)} = fX,Y (S(u, v))| det(MS )|δ
26
then gives
u+v u−v
fU,V (u, v) = fX,Y , |det MS |
2 2
u + v u − v 1/2 1/2 1 u+v u−v
= fX,Y , det = fX,Y , .
2 2 1/2 −1/2 2 2 2
Because X and Y are independent standard normals, we have
−(x2 + y 2 )
1
fX,Y (x, y) = exp
2π 2
so that
1 −u2 /4 −v2 /4
fU,V (u, v) =
e e .
4π
This implies that U and V are independent N (0, 2) random variables.
Example 12.2. Suppose X and Y are independent standard normal random variables. Then what is the
distribution of (U, V ) = T (X, Y ) where
Geometrically the transformation T corresponds to rotating the point (x, y) by an angle θ in the counter
clockwise direction. The inverse transformation S := T −1 of T is given by
and this corresponds to rotating the point (u, v) clockwise by an angle θ. The matrix corresponding to S is
cos θ sin θ
MS =
− sin θ cos θ
We shall next study the problem of obtaining the joint densities under differentiable and invertible trans-
formations that are not necessarily linear.
Let (X, Y ) have joint density fX,Y . We transform (X, Y ) to two new random variables (U, V ) via (U, V ) =
T (X, Y ). What is the joint density fU,V ? Suppose that T is invertible (having an inverse S = T −1 ) and
differentiable. Note that S and T are not necessarily linear transformations.
for small δ and . Let R denote the rectangle joining the points (u, v), (u + δ, v), (u, v + ) and (u + δ, v + ).
Then the above probability is the same as
27
What is the region S(R)? If S is linear then S(R) (as we have seen previously) will be a parallelogram.
For general S, the main idea is that, as long as δ and are small, the region S(R) can be approximated by
a parallelogram. This is because S itself can be approximated by a linear transformation on the region R.
To see this, let us write the function S(a, b) as (S1 (a, b), S2 (a, b)) where S1 and S2 map points in R2 to R.
Assuming that S1 and S2 are differentiable, we can approximate S1 (a, b) for (a, b) near (u, v) by
∂ ∂ a−u ∂ ∂
S1 (a, b) ≈ S1 (u, v) + S1 (u, v), S1 (u, v) = S1 (u, v) + (a − u) S1 (u, v) + (b − v) S1 (u, v).
∂u ∂v b−v ∂u ∂v
Putting the above two equations together, we obtain that, for (a, b) close to (u, v),
∂ ∂
S1 (u, v) ∂v S1 (u, v) a−u
S(a, b) ≈ S(u, v) + ∂∂u
∂ .
∂u S2 (u, v) ∂v S2 (u, v)
b−v
for (a, b) near (u, v). Note that, in particular, when δ and are small, that this linear appximation for S is
valid over the region R. The matrix JS (u, v) is called the Jacobian matrix of S(u, v) = (S1 (u, v), S2 (u, v)) at
the point (u, v).
Example 12.3. Suppose X and Y have joint density fX,Y . What is the joint density of U = X/Y and
V =Y?
We need to compute the joint density of (U, V ) = T (X, Y ) where T (x, y) = (x/y, y). The inverse of this
transformation is S(u, v) = (uv, v). Then formula (23) gives
v u
fU,V (u, v) = fX,Y (uv, v) det = fX,Y (uv, v)|v|.
0 1
In the special case when X and Y are independent standard normal random variables, the density of U = X/Y
is given by
Z ∞
(1 + u2 )v 2
1
fU (u) = exp − |v|dv
−∞ 2π 2
Z ∞
(1 + u2 )v 2
1 1
=2 exp − vdv = .
0 2π 2 π(1 + u2 )
This is the standard Cauchy density.
28
√
Example 12.4. Suppose X and Y are independent standard normal random variables. Let R := X 2 + Y 2
and let Θ denote the angle made by the vector (X, Y ) with the positive X-axis in the plane. What is the joint
density of (R, Θ)?
Clearly (R, Θ) = T (X, Y ) where the inverse of T is given by S(r, θ) = (r cos θ, r sin θ). The density of f
(R, Θ) at (r, θ) is zero unless r > 0 and 0 < θ < 2π. The formula (23) then gives
cos θ −r sin θ 1 −r2 /2
fR,Θ (r, θ) = fX,Y (r cos θ, r sin θ) det = e rI{r > 0}I{0 < θ < 2π}.
sin θ r cos θ 2π
It is easy to see from here that Θ is uniformly distributed on (0, 2π) and R has the density
2
fR (r) = re−r /2
I{r > 0}.
Moreover R and Θ are independent. The density of R is called the Rayleigh density.
Example 12.5. Here is an important fact about Gamma distributions: Suppose X ∼ Gamma(α1 , λ) and
Y ∼ Gamma(α2 , λ) are independent, then X + Y ∼ Gamma(α1 + α2 , λ). This can be proved using the
convolution formula for densities of sums of independent random variables. A different formula uses the
Jacobian formula to derive the joint density of U and V where V = X/(X + Y ). The relevant inverse
transformation here S(u, v) = (uv, u − uv) so that the Jacobian formula gives:
Plugging in the relevant Gamma densities for fX and fY , we can deduce that
λα1 +α2 Γ(α1 + α2 ) α1 −1
fU,V (u, v) = uα1 +α2 −1 e−λu I{u > 0} v (1 − v)α2 −1 I{0 < v < 1}.
Γ(α1 + α2 ) Γ(α1 )Γ(α2 )
This implies that U ∼ Gamma(α1 + α2 , λ). It also implies that V ∼ Beta(α1 , α2 ), that U and V are
independent as well as
Γ(α1 )Γ(α2 )
B(α1 , α2 ) =
Γ(α1 + α2 )
where, on the right hand side above, we have the Beta function. Note that because Γ(n) = (n − 1)! for when
n is an integer, the above formiula gives us a way to calculate the Beta function B(α1 , α2 ) when α1 and α2
are positive integers.
In the last class, we looked at the Jacobian formula for calculating the joint density of a transformed set
of continuous random variables in terms of the joint density of the original random variables. This formula
assumed that the transformation is invertible. In other words, the formula does not work if the transformation
is non-invertible. However, the general method based on first principles (that we used to derive the Jacobian
formula) works fine. This is illustrated in the following example.
Example 13.1 (Order Statistics). Suppose X and Y have joint density fX,Y . What is the joint density of
U = min(X, Y ) and V = max(X, Y )?
Let us find the joint density of (U, V ) at (u, v). Since U < V , the density fU,V (u, v) will be zero when
u ≥ v. So let u < v. For δ and small, let us consider
P{u ≤ U ≤ u + δ, v ≤ V ≤ v + }.
If δ and are much smaller compared to v − u, then the above probability equals
P{u ≤ X ≤ u + δ, v ≤ Y ≤ v + } + P{u ≤ Y ≤ u + δ, v ≤ X ≤ v + }
29
which is further approximately equal to
We can generalize this to the case of more than two random variables. Suppose X1 , . . . , Xn are random
variables having a joint density fX1 ,...,Xn (x1 , . . . , xn ). Let X(1) ≤ · · · ≤ X(n) denote the order statistics of
X1 , . . . , Xn i.e., X(1) is the smallest value among X1 , . . . , Xn , X(2) is the next smallest value and so on with
X(n) denoting the largest value. What then is the joint distribution of X(1) , . . . , X(n) . The calculation above
for the case of the two variables can be easily generalized to obtain
P
fX(1) ,...,X(n) (u1 , . . . , un ) = π fX1 ,...,Xn (uπ1 , . . . , uπn ) : u1 < u2 < · · · < un
0 : otherwise
where the sum is over all permutations π (i.e, one-one and onto functions mapping {1, . . . , n} to {1, . . . , n}).
When the variables X1 , . . . , Xn are i.i.d (independent and identically distributed), then it follows from the
above that
(n!)fX1 (u1 ) . . . fXn (un ) : u1 < u2 < · · · < un
fX(1) ,...,X(n) (u1 , . . . , un ) =
0 : otherwise
Assume now that X1 , . . . , Xn are i.i.d random variables with a common density f . In the previous section, we
derived the joint density of the order statistics X(1) , . . . , X(n) . Here we focus on the problem of determining
the density of X(i) for a fixed i. The answer is given by
n! i−1 n−i
fX(i) (u) = (F (u)) (1 − F (u)) f (u). (24)
(n − i)!(i − 1)!
The first method integrates the joint density fX(1) ,...,X(n) (u1 , . . . , ui−1 , u, ui+1 , . . . un ) over u1 , . . . , ui−1 , ui+1 , . . . , un
to obtain fX(i) (u). More precisely,
Z Z
fX(i) (u) = ··· n!f (u1 ) . . . f (ui−1 )f (u)f (ui+1 ) . . . f (un )I{u1 < · · · < un }du1 . . . dui−1 dui+1 . . . dun
Integrate the above first with respect to u1 (in the range (−∞, u2 )), then with respect to u2 (in the range of
(−∞, u3 )) and all the way up to the integral with respect to ui−1 . Then integrate with respect to un , then
with respect to un−1 and all the way to ui+1 . This will lead to (24).
This method uses multinomial probabilities. Suppose that we repeat an experiment n times and that the
outcomes of the n repetitions are independent. Suppose that each individual experiment has k outcomes
30
which we denote by O1 , . . . , Ok and let the probabilities of these outcomes be given by p1 , . . . , pk (note that
these are nonnegative numbers which sum to one).
Now let Ni denote the number of times (over the n repetitions) that the outcome Oi appeared (note that
N1 , . . . , Nk are nonnegative integers which sum to n). The joint distribution of (N1 , . . . , Nk ) is known as the
multinomial distribution with parameters n and p1 , . . . , pk . It is an exercise to show that
n!
P {N1 = n1 , N2 = n2 , . . . , Nk = nk } = pn1 . . . pnk k (25)
n1 ! . . . nk ! 1
whenever n1 , . . . , nk are nonnegative integers which sum to n.
Let us now get back to the problem of obtaining the density of X(i) . Consider the probability
P u ≤ X(i) ≤ u + δ
for a fixed u and small δ. If δ is small, then this probability can be approximated by the probability of the
event E where E is defined as follows. E is the event where (i−1) observations among X1 , . . . , Xn are strictly
smaller than u, one observation among X1 , . . . , Xn lies in [u, u + δ] and n − i observations among X1 , . . . , Xn
are strictly larger than u + δ. This latter probability is a special case of the multinomial probability formula
(25) and when δ is small, we get that this probability equals
n! i−1 n−i
(F (u)) (f (u)δ) (1 − F (u))
(n − i)!(i − 1)!
where F is the cdf corresponding to f . The formula (24) then immediately follows.
Here we first compute the cdf of X(i) and then differentiate it to get the pdf. Note that
FX(i) (x) = P X(i) ≤ x
= P {at least i of X1 , . . . , Xn are ≤ x}
Xn
= P {exactly r of X1 , . . . , Xn are ≤ x}
r=i
n n
X X n
= P {Bin(n, F (x)) = r} = (F (x))r (1 − F (x))n−r .
r=i r=i
r
To compute the density, we have to differentiate FX(i) with respect to x. This gives (note that the derivative
of F is f )
n
X n
r(F (x))r−1 f (x)(1 − F (x))n−r − (F (x))r (n − r)(1 − F (x))n−r−1 f (x)
fX(i) (x) =
r=i
r
n n−1
X n! X n!
= (F (x))r−1 (1 − F (x))n−r f (x) − (F (x))r (1 − F (x))n−r−1 f (x)
r=i
(n − r)!(r − 1)! r=i
(n − r − 1)!r!
n n
X n! X n!
= (F (x))r−1 (1 − F (x))n−r f (x) − (F (x))s−1 (1 − F (x))n−s f (x)
r=i
(n − r)!(r − 1)! s=i+1
(n − s)!(s − 1)!
n! i−1 n−i
= (F (u)) (1 − F (u)) f (u)
(n − i)!(i − 1)!
and thus we again get the formula (24).
Next we look at some special instances of the formula (24) for the density of individual order statistics.
31
14.4 Uniform Order Statistics
Suppose X1 , . . . , Xn are i.i.d having the uniform density on (0, 1). Then the formula (24) (by plugging in
f (u) = 1 and F (u) = u for 0 < u < 1) gives the following density for X(i) :
n!
fX(i) (u) = ui−1 (1 − u)n−i for 0 < u < 1. (26)
(n − i)!(i − 1)!
This is a Beta density with parameters i and n − i + 1. Generally, a Beta density with parameters α > 0 and
β > 0 is given by
uα−1 (1 − u)β−1
f (u) = R 1 I{0 < u < 1}
0
xα−1 (1 − x)β−1 dx
The integral in the denominator above is called the Beta function:
Z 1
B(α, β) := xα−1 (1 − x)β−1 dx for α > 0, β > 0.
0
The Beta function does not usually have a closed form expression but we can write it in terms of the Gamma
function via the formula
Γ(α + β)
B(α, β) =
Γ(α)Γ(β)
that we saw in the last class. This formula allows us to write B(α, β) in closed form when α and β are
integers (note that Γ(n) = (n − 1)!). This gives, for example,
n!
B(i, n − i + 1) = .
(n − i)!(i − 1)!
Suppose X1 , . . . , Xn are independent random variables having the uniform distribution on the interval (0, θ)
for some θ > 0. It turns out that the maximum order statistic, X(n) , is the maximum likelihood estimate of
θ. The density of X(n) is easily seen to be (as a consequence of (24)):
nun−1
fX(n) (u) = I{0 < u < θ}.
θn
What then is E(X(n) )? Using the formula for the density above,
θ
unun−1
Z
nθ
EX(n) = n
du = .
0 θ n+1
This means therefore that X(n) has a slight negative bias of −θ/(n + 1) as an estimator for θ and that
((n + 1)/n)X(n) is an unbiased estimator of θ.
The exponential density with rate parameter λ > 0 (denoted by Exp(λ)) is given by
It is arguably the simplest density for modeling random quantities that are constrained to be nonnegative.
It arises as the distribution of inter-arrival times in a Poisson process (more on this later when we study the
Poisson Process).
32
The cdf of Exp(λ) is easily seen to be
Z x
F (x) = λe−λx dx = 1 − e−λx for x > 0.
0
Suppose now that X1 , . . . , Xn are i.i.d observations from Exp(λ). What is the density of X(1) ? From the
formula (24):
Thus X(1) has the Exponential density with rate parameter nλ.
I am using the second chapter of the book Elements of Large Sample Theory by Erich Lehmann as the
reference for our treatment of the CLT.
The Central Limit Theorem (CLT) is not a single theorem but encompasses a variety of results concerned
with the sum of a large number of random variables which, suitably normalized, has a normal limit distribu-
tion. The following is the simplest version of the CLT and this is the version that we shall mostly deal with
in this class.
Theorem 15.1 (Central Limit Theorem). Suppose Xi , i = 1, 2, . . . are i.i.d with E(Xi ) = µ and var(Xi ) =
σ 2 < ∞. Then √
n X¯n − µ
σ
converges in distribution to N (0, 1) where X̄n = (X1 + · · · + Xn )/n.
Informally,
√ the CLT says that for i.i.d observations X1 , . . . , Xn with finite mean µ and variance σ 2 , the
quantity n(X̄n − µ)/σ is approximately (or asymptotically) N (0, 1). Informally, the CLT also implies that
√
1. n(X̄n − µ) is approximately N (0, σ 2 ).
2. X̄n is approximately N (µ, σ 2 /n).
3. Sn = X1 + · · · + Xn is approximately N (nµ, nσ 2 ).
4. Sn − nµ is approximately N (0, nσ 2 ).
√
5. (Sn − nµ)/( nσ) is approximately N (0, 1).
and also
E(Sn ) = nµ and var(Sn ) = nσ 2 .
33
The most remarkable feature of the CLT is that it holds regardless of the distribution of Xi (as long as
they are i.i.d from a distribution F that has a finite mean and variance). Therefore the CLT is, in this
sense, distribution-free. This makes it possible to derive, using the CLT, statistical procedures which are
asymptotically valid without specific distributional assumptions. To illustrate the fact that the distribution
of Xi can be arbitrary, let us consider the following examples.
5. Cauchy. Suppose Xi are i.i.d standard Cauchy random variables. Then Xi ’s do not have finite mean
and variance. Thus the CLT does not apply here. In fact, it can be proved here that (X1 + · · · + Xn )/n
has the Cauchy distribution for every n.
16 Convergence in Distribution
In order to understand the precise meaning of the CLT, we need to understand the notion of convergence in
distribution.
Definition 16.1 (Convergence in Distribution). Suppose Y1 , Y2 , . . . are random variables and F is a cdf.
We say that Yn converges in distribution to F (or that Yn converges in Law to F ) as n → ∞ if
P {Yn ≤ y} → F (y) as n → ∞
L
for every y at which the cdf F is continuous. We denote this by Yn →F .
L
Put another way, if Fn denotes the cdf of Yn , then Yn →F if and only if
Fn (y) → F (y) as n → ∞
34
for every y that is a continuity point of F .
We shall use the following conventions when talking about convergence in distribution.
1. If F is the cdf of a standard distribution such as N (0, 1), then we shall take
L
Yn →N (0, 1)
Note that convergence in distribution is defined in terms of cdfs which makes it possible to talk about
a sequence of discrete random variables converging to a continuous distribution. For example, if Yn has
the discrete uniform distribution on the finite set {1/n, 2/n, . . . , 1}, then according to the above definition
L
Yn →U nif (0, 1). Note however that Yn is discrete but U (0, 1) is a continuous distribution.
L
Note that the definition of Yn →Y only requires that Fn (y) converges to F (y) at every y which is a
continuity point of F (here Fn and F are the cdfs of Yn and Y respectively). If F is a continuous cdf (such
L
as a normal or a uniform cdf), then every point is a continuity point and then Yn →F is the same as saying
that
P {Yn ≤ y} → F (y) for every y.
L
But when F is a discrete cdf, then, for Yn →F , we do not insist on P{Yn ≤ y} converging to F (y) at points y
where F is discontinuous. This is advantageous in a situation such as the following. Suppose that Yn = 1/n
for every n ≥ 1 and Y = 0. Then it is easy to see that
However, the convergence above does not hold for y = 0 as P{Yn ≤ 0} = 0 for every n while P{Y ≤ 0} = 1.
Thus if insisted on P{Yn ≤ y} to converge to P{Y ≤ y} at all points y (as opposed to only continuity
points), then Yn = 1/n will not converge in distribution to Y = 0 (which will be quite unnatural). This
is one justification for including the restriction of continuity points of F in the definition of convergence of
distribution.
L
Let us now isolate the following two special cases of Yn →Y .
L
1. Y has a continuous cdf F : In this case, Yn →Y if P{Yn ≤ y} converges to P{Y ≤ y} for every y.
This actually implies that
P{Yn < y} → P{Y ≤ y} for every y
as well as
P{a ≤ Yn ≤ b} → P{a ≤ Y ≤ b} for every a and b.
2. Y is equal to a constant. Suppose that the limit random variable Y equals a constant c. The cdf
L
F (y) of Y is then easily seen to be equal to 0 for y < c and 1 for y > c. The definition of → then
L
implies that Yn →c if and only if P{Yn ≤ y} converges to 0 for y < c and converges to 1 for y > c. This
then is easily seen to be equivalent to :
P{|Yn − c| ≥ } → 0 as n → ∞ (27)
L P
for every > 0. In this case when Y is a constant, we actually write Yn →c as Yn →c and say that Yn
P
converges in probability to c. Alternatively, you can take (27) as the definition of Yn →c.
35
The main goal for today is to introduce Moment Generating Functions and use them to prove the Central
Limit Theorem. Let us first start by recapping the statement of the CLT.
Theorem 16.2 (Central Limit Theorem). Suppose Xi , i = 1, 2, . . . are i.i.d with E(Xi ) = µ and var(Xi ) =
σ 2 < ∞. Then √
n X¯n − µ L
→N (0, 1)
σ
where X̄n = (X1 + · · · + Xn )/n.
L
To understand the statement of the above theorem, we first need to know what the symbol → means.
This is the notion of convergence of distribution which is defined as follows. We say that a sequence of
L
random variables Y1 , Y2 , . . . converges in distribution to a cdf F (written as Yn →F ) as n → ∞ if P{Yn ≤ y}
L
converges as n → ∞ to F (y) for every y at which the function F is continuous. Also, we say that Yn →Y if
P{Yn ≤ y} converges (as n → ∞) to P{Y ≤ y} at every y where the cdf of Y is continuous.
L L
The statement Yn →Y might suggest that Yn is close to Y for large n. This is actually not true. Yn →Y
L
only says that the distribution of Yn is close to that of Y . It is actually more appropriate to write Yn →F
where F is the cdf of Y . For example, suppose that Y ∼ U nif (0, 1) and let Yn be equal to Y for odd values
L
of n and equal to (1 − Y ) for even values of n. Then, clearly each Yn ∼ U nif (0, 1) so that both Yn →Y as
L
well as Yn →1 − Y are true. But obviously Yn is not close to Y for even n and Yn is not close to 1 − Y for
odd n.
When F is a continuous cdf (which is the case when F is, for example, the cdf of N (0, 1)), the statement
L
Yn →F is equivalent to
P {Yn ≤ y} → F (y) for every y.
In this case (i.e., when F is continuous), it also follows that
P {Y < y} → F (y) for every y
and also that
P {a ≤ Yn ≤ b} → P {a ≤ Y ≤ b} for every a and b.
36
17 The Weak Law of Large Numbers
√ √
The CLT states that n(X̄n −µ)/σ converges in distribution to N (0, 1) which informally means that n(X̄n −
µ)/σ is approximately N (0, 1). This means that X̄n is approximately N (µ, σ 2 /n). Because the N (µ, σ 2 /n)
becomes more and more concentrated about the single point µ, it makes sense to conjecture that X̄n converges
to the single point µ as n → ∞. This is made precise in the following result which is called the Weak Law
of Large Numbers.
Theorem 17.1 (Weak Law of Large Numbers). Suppose X1 , X2 , . . . are independent and identically dis-
tributed random variables. Suppose that E|Xi | < ∞ so that EXi is well-defined. Let EXi = µ. Then
X1 + · · · + Xn P
X̄n := →µ as n → ∞.
n
P
Recall from last class that Yn →c (here c is a constant) means that P{|Yn − c| > } converges to zero as
n → ∞ for every > 0. Equivalently, if F is the cdf of the constant random variable which is always equal
P L
to c, then Yn →c is the same as Yn →F .
The Weak Law of Large Numbers as stated above is non-trivial to prove. However an easy proof can be
given if we make the additional assumption that the Xi ’s have a finite variance. In this case, we can simply
use the Chebyshev inequality. Indeed, Chebyshev’s inequality gives
var(X̄n ) var(X1 )
P |X̄n − µ| > ≤ 2
=
n2
which converges to zero as n → ∞ (because of the n in the denominator). Note that this proof does not
work when var(X1 ) = ∞.
We shall next attempt to prove the CLT. Our main tool for the proof is the Moment Generating Function
which is introduced now.
The Moment Generating Function (MGF) of a random variable X is defined as the function:
MX (t) := E etX
for all t ∈ R for which E(etX ) < ∞. Note that MX (0) = 1. There exist random variables (such as those that
are distributed according to the standard Caucy distribution) for which MX (t) is infinite for every t 6= 0.
Example 18.1 (MGF of Standard Gaussian). If X ∼ N (0, 1), then its MGF can be easily computed as
follows: Z ∞ Z ∞
−(x − t)2
1 2 1 2
E(etX ) = √ etx e−x /2 dx = √ exp exp(t2 /2)dx = et /2 .
2π −∞ 2π −∞ 2
2
Thus MX (t) = et /2
for all t ∈ R.
37
the last equality being a consequence of independence.
2) Scaling: Ma+bX (t) = eat MX (bt) for all t (a and b are constants here). This is easy to prove.
3) MGFs determine distributions: If two random variables have MGFs that are finite and equal in an
open interval containing 0, then they have the same distribution (i.e., same cdf everywhere). An implication
2
of this is that N (0, 1) is the same distribution which has MGF equal to et /2 for all t.
4) MGFs provide information on moments: For k ≥ 1, the number E(X k ) is called the k th moment
of the random variable X. Knowledge of the MGF allows one to easily read off the moments of X. Indeed,
the power series expansion of the MGF is:
∞ k
X t
MX (t) = EetX = E(X k ).
k!
k=0
Therefore the k th moment of X is simply the coefficient of tk in the power series of expansion of MX (t)
multiplied by k!.
Alternatively, one can derive the moments E(X k ) as derivatives of the MGF at 0. Indeed, it is easy to
see that
(k) dk
MX (t) = k E(etX ) = E X k etX
dt
so that
(k)
MX (0) = E(X k ).
In words, E(X k ) equals the k th derivative of MX at 0. Therefore
0 00
MX (0) = E(X) and MX (0) = E(X 2 )
and so on.
As an example, we can deduce the moments of the standard normal distribution from the fact that its
2
MGF equals et /2 . Indeed, because
∞
2 X t2i
et /2 = ,
i=0
2i i!
it immediately follows that the k th moment of N (0, 1) equals 0 when k is odd and equals
(2j)!
when k = 2j.
2j j!
Let us recall the basic setting. We have i.i.d random variables X1 , X2 , . . . which have mean µ and finite
variance σ 2 .
√ L
Let Yn := n(X̄n − µ)/σ. We need to show that Yn →N (0, 1). From the discussion on MGFs in the
previous section, it is clear that it is enough to show that
2
MYn (t) → et /2
for every t ∈ (−∞, ∞).
38
Note that
n
√ X̄n − µ 1 X Xi − µ
Yn = n =√ .
σ n i=1 σ
As a result,
n
Y n
MYn (t) = MPi (Xi −µ)/(√nσ) (t) = MPi (Xi −µ)/σ (tn−1/2 ) = M(Xi −µ)/σ (tn−1/2 ) = M (tn−1/2 )
i=1
where M (·) is the MGF of (X1 − µ)/σ. We now use Taylor’s theorem to expand M (tn−1/2 ) up to a quadratic
polynomial around 0.
Let us first quickly recap Taylor’s theorem. This says that for a function f and two points x and p in the
domain of f , we can write
where ξ is some point that lies between x and p. This formula requires that f has (r + 1) derivatives in an
open interval containing p and x.
t t2 00
M (tn−1/2 ) = M (0) + √ M 0 (0) + M (sn )
n 2n
for some sn that lies between 0 and tn−1/2 . This implies therefore that sn → 0 as n → ∞. Note now that
M (0) = 0 and M 0 (0) = E((X1 − µ)/σ) = 0. We therefore deduce that
n
t2 00
MYn (t) = 1 + M (sn ) .
2n
to deduce that n
t2 00
2
MYn (t) = 1 + M (sn ) → et /2 = MN (0,1) (t).
2n
This completes the proof of the CLT assuming the fact (28). It remains to prove (28). There exist many
proofs for this. Here is one. Write
an n an
1+ = exp n log 1 + .
n n
Let `(x) := log(1 + x). Taylor’s theorem for ` for r = 2 and p = 0 gives
x2 x2
`(x) = `(0) + `0 (0)x + `00 (ξ) =x−
2 2(1 + ξ)2
an a2
`(an /n) = log(1 + (an /n)) = − 2 n
n 2n (1 + ξn )2
39
for some ξn that lies between 0 and an /n (and hence ξn → 0 as n → ∞). As a result,
a2n
an n an
1+ = exp n log 1 + = exp an − → ea
n n 2n(1 + ξn )2
This completes the proof of the CLT. Note that we have tacitly assumed that the moment generating
function of X1 , . . . , Xn exists for all t. This is much stronger than the existence of the variance of Xi . This
proof does not work if the MGF is not finite. There exist more advanced proofs (for example, which work
with Characteristic functions as opposed to MGFs) which work only under the assumption of finite variance.
These are beyond the scope of this class.
A natural question with respect to the CLT is: why is N (0, 1) (or N (0, σ 2 )) arising as the limit for sums of
independent random variables (and not some other distribution)?
This can be explained in many ways. I will mention two common explanations below.
where Yn0 is an independent copy of Yn (independent copy means that Yn0 and Yn are independent and
L
have the same distribution). Thus if Yn →Y for a random variable Y , then it must hold that
d Y +Y0
Y= √
2
d √
where = means “equality in distribution” meaning that Y and (Y + Y 0 )/ 2 have the same distribution.
√
It is easy to see that if Y ∼ N (0, τ 2 ), then Y and (Y + Y 0 )/ 2 have the same distribution. Conversely
and remarkably, the N (0, τ 2 ) distribution is the only distribution which has this property (harder to
prove). This, and the fact that var(Yn ) = 1 for all n, implies that N (0, 1) is the only possible limiting
distribution of Yn .
2. Another interesting interpretation and explanation for the CLT comes from information theoretic con-
siderations. Note that the random variables Yn have variance equal to 1 for each n. However, as
n increases, more variables Xi are involved in the formula for Yn . One can say therefore that the
“entropy” of Yn is increasing with n while the variance stays the same at 1. Now there is a way of
formalizing this notion of entropy and it is possible to show that the N (0, 1) is the distribution that
maximizes entropy subject to variance being equal to 1. This therefore says that the entropy of Yn
increases with n (as more variables Xi are involved in computing Yn ) and eventually as n → ∞, one
gets the maximally entropic distribution, N (0, 1), as the limit. There is a way of making these precise.
40
21 More on the Weak Law and Convergence in Probability
In the last couple of classes,we studied the Weak Law of Large Numbers and the Central Limit Theorem.
The Weak Law of Large Numbers is the following:
Theorem 21.1 (Weak Law of Large Numbers). Suppose X1 , X2 , . . . are independent and identically dis-
tributed random variables. Suppose that E|Xi | < ∞ so that EXi is well-defined. Let EXi = µ. Then
X1 + · · · + Xn P
X̄n := →µ as n → ∞.
n
P P
Recall, from last lecture, that → is defined as follows: Yn →c if P {|Yn − c| > } converges to zero as
n → ∞ for every > 0. The following result presents an intuitively obvious simple fact about convergence in
probability. However, this result is slightly tricky to prove (you are welcome to try proving this; the result
itself is useful for us but not the proof).
P
Lemma 21.2. If X1 , X2 , . . . and Y1 , Y2 , . . . are two sequences of random variables satisfying Xn →c and
P
Yn →c for two constants c and d. Then
P
1. Xn + Yn →c + d
P
2. Xn − Yn →c − d
P
3. Xn Yn →cd
P
4. Xn /Yn →c/d provided d 6= 0.
Let us now get back to the Weak Law of Large Numbers. Note that (21.1) holds with any distributional
assumptions on the random variables X1 , X2 , . . . (only the assumptions of independence and having identical
distributions and the existence of the expectations are sufficient). The weak law is easy to prove under the
additional assumption that the random variables have finite variances. This proof, which we have already
seen in the last class, is based on the Chebyshev inequality which says that
var(X̄n )
P X̄n − µ > ≤ . (29)
2
Because
σ2
X1 + · · · + Xn 1 1
var(X̄n ) = var = 2
var(X1 + · · · + Xn ) = 2 n × var(X1 ) = →0
n n n n
as n → ∞. As a result, from (29), we have that the left hand side of (29) converges to 0 which means that
P
X̄n →µ.
It follows more generally that if Y1 , Y2 , . . . is a sequence of random variables for which EYn converges to
P
some parameter θ and for which var(Yn ) converges to zero, then Yn →θ. This is given in the following result.
1. EYn → θ as n → ∞
2. var(Yn ) → 0 as n → ∞.
P
Then Yn →θ as n → ∞.
41
Proof. Write Yn = EYn + (Yn − EYn ). Chebyshev’s inequality (and the fact that var(Yn ) → ∞) gives
var(Yn )
P {|Yn − EYn | > } ≤ →0
2
P
for every > 0 so that Yn − EYn →0. This and EYn → θ implies (via the first assertion of Lemma 21.2) that
P
Yn = EYn + (Yn − EYn )→θ.
P
In mathematical statistics, when Yn →θ, we say that Yn is a consistent estimator for θ or simply that Yn
is consistent for θ. The Weak Law of Large Numbers simply says that X̄n is consistent for E(X1 ). More
generally, Lemma 21.3 states that Yn is consistent for θ if E(Yn ) → 0 and var(Yn ) → 0. The following
examples present two more situations where consistency holds.
Example 21.4. Suppose X1 , X2 , . . . are i.i.d having the uniform distribution on (0, θ) for a fixed θ > 0.
P
Then the maximum order statistic X(n) := max(X1 , . . . , Xn ) is a consistent estimator for θ i.e., X(n) →θ.
We can see this in two ways. The first way is to use the Result (Lemma 21.3) above and compute the mean
and variance of X(n) . X(n) /θ is the largest order statistic from an i.i.d sample of size n from U nif (0, 1)
and, as we have seen in the last class, X(n) /θ has the Beta(n, 1) distribution. Therefore, using the mean and
variance formulae for the Beta distribution (see wikipedia for these formulae), we have
X(n) n
E =
θ n+1
and
X(n) n
var = .
θ (n + 1)2 (n + 2)
which gives
nθ
EX(n) =
n+1
and
nθ2
var(X(n) ) = .
(n + 1)2 (n + 2)
It is clear from these that EX(n) converges to θ and var(X(n) ) converges to 0 respectively as n → ∞ which
implies (via Lemma 21.3) that X(n) converges in probability to θ.
P
There is a second (more direct) way to see that X(n) →θ. This involves writing
n
P{|X(n) − θ| ≥ } = P{X(n) ≤ θ − } = P{Xi ≤ θ − for all i} = 1 −
θ
which clearly goes to zero as n → ∞ (note that and θ are fixed). This, by the definition of convergence in
P
probability, shows that X(n) →θ.
Example 21.5. Suppose X1 , X2 , . . . are i.i.d observations with mean µ and finite variance σ 2 . Then
n
1X 2
σ̂n2 := Xi − X̄n
n i=1
converges in probability to σ 2 as n → ∞ by the Weak Law of Large Numbers. This is because σ̃n2 is the average
of i.i.d random variables Yi = (Xi − µ)2 for i = 1, . . . , n. The Weak Law therefore says that σ̃n2 converges in
probability to EY1 = E(X1 − µ)2 = σ 2 .
42
P
Now to argue that σ̂n2 →σ 2 , the idea is simply to relate σ̂n2 to σ̃n2 . This can be done as follows:
n
1X 2
σ̂n2 = Xi − µ − X̄n + µ
n i=1
n n
!
1X 1X
= (Xi − µ)2 + (X̄n − µ)2 − 2 Xi − µ (X̄n − µ)
n i=1 n i=1
n
1X 2
= (Xi − µ)2 − X̄n − µ
n i=1
The first term on the right hand side above converges to σ 2 by the Weak Law of Large Numbers (note that
P
σ 2 = E(X1 − µ)2 ). The second term converges to zero because X̄n →µ (and Lemma 21.2). We use Lemma
P
21.2 again to conclude that σ̂n2 →σ 2 . Note that we have not assumed any distributional assumptions on
X1 , X2 , . . . , Xn (the only requirement is they have mean zero and variance σ 2 ).
with the factor of 1/(n − 1) as opposed to 1/n. This will also converge in probability to σ 2 simply because
n n
!
1 X 2 1X 2 n
Xi − X̄n = Xi − X̄n .
n − 1 i=1 n i=1 n−1
Since the first term above converges in probability to σ 2 and the second term converges to one, the product
converges in probability to σ 2 (by Lemma 21.2).
We also looked at the statement and proof of the Central Limit Theorem which is the following.
Theorem 22.1 (Central Limit Theorem). Suppose Xi , i = 1, 2, . . . are i.i.d with E(Xi ) = µ and var(Xi ) =
σ 2 < ∞. Then √
n X¯n − µ L
→N (0, 1)
σ
where X̄n = (X1 + · · · + Xn )/n.
L
Here convergence in distribution (→) is defined as follows: A sequence Y1 , Y2 , . . . of random variables is
said to converge in distribution to F if P{Yn ≤ y} converges to F (y) for every y which is a point of continuity
of F . Although convergence in distribution is defined in terms of cdfs, the CLT was proved via moment
generating functions because cdfs of sums of independent random variables are not so easy to work with.
An important consequence of the CLT from the statistical point of view is that it gives asymptotically
valid confidence intervals for µ. Indeed, as a consequence of the CLT, we have
( √ )
n X¯n − µ
P a≤ ≤ b → Φ(b) − Φ(a) as n → ∞
σ
43
for every a ≤ b. This is same as
bσ aσ
P X̄ − √ ≤ µ ≤ X̄ − √ → Φ(b) − Φ(a) as n → ∞.
n n
Suppose now that zα/2 > 0 is the point on the real line such that Φ(zα/2 ) = 1 − α/2 for 0 < α < 1. Then
taking a = −zα/2 and b = zα/2 , we deduce that
zα/2 σ zα/2 σ
P X̄ − √ ≤ µ ≤ X̄ + √ → Φ(zα/2 ) − Φ(−zα/2 ) = 1 − α as n → ∞.
n n
This means that
zα/2 σ zα/2 σ
X̄ − √ , X̄ + √ (30)
n n
is an asymptotic 100(1 − α) % confidence interval for µ (assuming that σ is known). The application of the
CLT ensures that no distributional assumptions on X1 , X2 , . . . are required for this result.
The problem with the interval (30) is that it depends on σ which will be unknown in a statistical setting
(the only available data will be X1 , . . . , Xn ). A natural idea is to replace σ by a natural estimate such as σ̂n
defined in Example 21.5:
n
1 X 2
σ̂n2 := Xi − X̄n (31)
n − 1 i=1
This will result in the interval:
zα/2 σ̂n zα/2 σ̂n
X̄ − √ , X̄ + √ (32)
n n
Slutsky’s theorem stated next will imply that
√
n(X̄n − µ) L
→N (0, 1) (33)
σ̂n
which will mean that (32) is also an asymptotic 100(1 − α)% confidence interval for µ.
L P P
Theorem 22.2 (Slutsky’s theorem). If Yn →Y , An →a and Bn →b, then
L
An + Bn Yn →a + bY
Another useful result that we shall often use is the continuous mapping theorem:
L
Theorem 22.3 (Continuous Mapping Theorem). 1. Suppose Yn →Y and f is a function that is continu-
L
ous in the range of values of Y , then f (Yn )→f (Y ).
P P
2. Suppose Yn →c and f is continuous at c, then f (Yn )→f (c).
44
The first term on the right hand side above converges in probability to N (0, 1) by the usual CLT. For the second
P
term, note that σn2 →σ 2 (as proved in Example 21.5) and so applying the continuous mapping theorem with
p P
f (x) = σ 2 /x implies that f (σ̂n2 )→1. This gives that the second term above converges in probabilty to 1. We
can thus use Slutsky’s theorem to observe that, since the first term above converges to N (0, 1) in distribution
and the second term converges in probability to 1, the random variable Tn converges in distribution to N (0, 1).
As a result,
zα/2 σ̂n zα/2 σ̂n
X̄ − √ , X̄ + √
n n
is still a 100(1 − α) % asymptotically valid C.I for µ. Note that we have not assumed any distributional
assumptions on X1 , . . . , Xn . In particular, the data can be non-Gaussian.
The random variable Tn in (34) is called the sample t-statistic. The name comes from the t-distrbution
or t-density. For a given integer k ≥ 1, the t-density with k degrees of freedom is the density of the random
variable
Z
p
A/k
where Z ∼ N (0, 1) , A has the chi-squared density with k degrees (i.e., A ∼ χ2k ) and Z and A are independent
random variables.
Now when X1 , . . . , Xn are i.i.d N (µ, σ 2 ), it can be shown (we will see how to do this later) that
√ Pn 2
n X̄ − µ i=1 (Xi − X̄n )
∼ N (0, 1) and ∼ χ2n−1
σ σ2
and moreover the above two random variables are independent. As a result, the t-statistic Tn has the t-
distribution with n − 1 degrees of freedom when X1 , . . . , Xn are i.i.d N (µ, σ 2 ).
Therefore
1. When X1 , . . . , Xn are i.i.d N (µ, σ 2 ), the sample t-statistic Tn has the t-distribution with n − 1 degrees
of freedom.
2. When X1 , . . . , Xn are i.i.d with mean µ and finite variance σ 2 (no distributional assumption), the
t-statistic, Tn converges in distribution to N (0, 1).
It may be helpful to note in connection with the above that the t-distribution with k degrees of freedom itself
converges in distribution to N (0, 1) as k → ∞.
Example 22.5 (Bernoulli Parameter Estimation). Suppose X1 , X2 , . . . , Xn are i.i.d having the Ber(p) dis-
tribution. The CLT then gives P
Xi − np L
pi →N (0, 1)
np(1 − p)
which gives ( )
P
i Xi − np
P −zα/2 ≤p ≤ zα/2 → 1 − α as n → ∞.
np(1 − p)
This will not directly lead to a C.I for p. To do this, it is natural to replace p in the denominator by X̄. This
can be done because s
P P
X i − np X i − np p(1 − p)
p i = pi
nX̄n (1 − X̄n ) np(1 − p) X̄n (1 − X̄n )
and by Slutsky’s theorem, the above random variables converge in distribution to N (0, 1). To give more details,
we are using the fact that the first random variable above converges in distribution to N (0, 1) by the CLT
45
P
and the second random variable converges in probabiliity to 1 (basically X̄n →µ and then use the continuous
mapping theorem). This allows us to deduce that
( P )
i Xi − np
P −zα/2 ≤ p ≤ zα/2 → 1 − α as n → ∞.
nX̄n (1 − X̄n )
so that " r r #
X̄n (1 − X̄n ) X̄n (1 − X̄n )
X̄n − zα/2 , X̄n + zα/2
n n
and by Slutsky’s theorem, the above random variables converge in distribution to N (0, 1) (we are using here
P
that X̄n →λ which is a consequence of the Weak Law of Large Numbers). This allows us to deduce that
( P )
i Xi − nλ
P −zα/2 ≤ p ≤ zα/2 → 1 − α as n → ∞.
nX̄n
so that " r r #
X̄n X̄n
X̄n − zα/2 , X̄n + zα/2
n n
To do this, write
n
!
√ 2 2
√ 1X
(Xi − X̄)2 − σ 2
n σ̂ − σ = n
n i=1
n
!
√ 1X 2 2
= n (Xi − µ − X̄ + µ) − σ
n i=1
n
!
√ 1X √ 2
= n (Xi − µ)2 − σ 2 − n X̄n − µ .
n i=1
46
Now by the CLT, !
n
√ 1X L
n (Xi − µ)2 − σ 2 →N (0, τ 2 )
n i=1
where τ 2 = var((X1 − µ)2 ) (we are assuming, of course, that τ 2 < ∞) and, by Slutsky’s theorem,
√ P
n(X̄n − µ)2 =
√
n X̄n − µ X̄n − µ →(N (0, 1)) · (0) = 0
Thus by Slutsky’s theorem again, we obtain
√ L
n σ̂ 2 − σ 2 →N (0, τ 2 ).
Here is a simple consequence of the CLT and the continuous mapping theorem. Suppose X1 , X2 , . . . are
i.i.d random variables with mean µ and finite variance σ 2 . Then the CLT says that
√
n(X̄ − µ) L
→N (0, 1).
σ
The continuous mapping theorem then gives
n(X̄ − µ)2 L 2
→χ1 .
σ2
23 Delta Method
Delta Method is another general statement about convergence in distribution that has interesting applications
when used in conjunction with the CLT.
√ L
Theorem 23.1 (Delta Method). If n(Tn − θ)→N (0, τ 2 ), then
√ L
n(g(Tn ) − g(θ))→N (0, τ 2 (g 0 (θ))2 )
provided g 0 (θ) exists and is non-zero.
Informally, the Delta method states that if Tn has a limiting Normal distribution, then g(Tn ) also has
a limiting normal distribution and also gives an explicit formula for the asymptotic variance of g(Tn ). This
is surprising because g can be linear or non-linear. In general, non-linear functions of normal random
variables do not have a normal distribution. But the Delta method works because under the assumption
√ L P
that n(Tn − θ)→N (0, τ 2 ), it follows that Tn →θ so that Tn will be close to θ at least for large n. In a
neighborhood of θ, the non-linear function g can be approximated by a linear function which means that g
effectively behaves like a linear function. Indeed, the Delta method is a consequence of the approximation:
g(Tn ) − g(θ) ≈ g 0 (θ) (Tn − θ) .
47
Example 23.2. Suppose 0 ≤ p ≤ 1 is a fixed parameter and suppose that we want to estimate p2 . Let us
assume that we have two choices for estimating p2 :
1. We can estimate p2 by X/n where X is the number of successes in n binomial trials with probability p2
of success.
2. We can estimate p2 by (Y /n)2 where Y is the number of successes in n binomial trials with probability
p of success.
Which of the above is a better estimator of p2 and why? The Delta method provides a simple answer to this
question. Note that, by the CLT, we have
√
X L
n − p2 →N (0, p2 (1 − p2 ))
n
and that
√
Y L
n − p →N (0, p(1 − p)).
n
The Delta method can now be used to convert the above limiting statement into an accuracy statement for
(Y /n)2 as: !
2
√ Y L
n − p →N (0, 4p(1 − p)p2 ).
2
n
which is equivalent to p > 1/3. Thus when p > 1/3, X/n is a better estimator of p2 compared to (Y /n)2 and
when p < 1/3, (Y /n)2 is the better estimator.
The Delta method can be applied to variance stabilizing transformations. For example, consider the example
where we observe data X1 , X2 , . . . , Xn that are i.i.d having the Ber(p) distribution. The CLT then states
that √ L
n X̄n − p →N (0, p(1 − p)). (35)
It is inconvenient that p also appears in the variance term. This presents an annoyance while finding confi-
dence intervals for p. One way around this problem is to observe that, by Slutsky’s theorem,
√
n X̄n − p L
p →N (0, 1).
X̄n (1 − X̄n )
This was done in the last class. While this method is okay, one might still wonder if it is possible to obtain
a function f having the property that
√ L
n f (X̄n ) − f (p) →N (0, c2 )
where the variance c2 does not depend on p. Such a function f would be called a variance stabilizing
transformation.
48
For another example, consider the case where we observe data X1 , . . . , Xn that are i.i.d having the P oi(λ)
distribution. The CLT then states that
√ L
n X̄n − λ →N (0, λ). (36)
The fact that λ appears in the variance term above presents an annoyance while finding confidence intervals
for λ. As done in last class, we can get around this by observing (via Slutsky’s theorem) that
√
n(X̄n − λ) L
p →N (0, 1).
X̄n
While this method is okay, one might still wonder if it is possible to obtain a function f having the property
that √ L
n f (X̄n ) − f (λ) →N (0, c2 )
where the variance c2 does not depend on λ. If one could indeed find such an f , it will be referred to as a
variance stabilizing transformation.
where the variance c2 does not depend on θ. We would then say that the function f is a variance stabilizing
transformation.
This is possible to do via the Delta method. Indeed, Delta method states that
√ L
n (f (Tn ) − f (θ)) →N (0, (f 0 (θ))2 τ 2 (θ))
Here we have X1 , . . . , Xn which are i.i.d having the Ber(p) distribution so that by CLT
√ L
n(X̄n − p)→N (0, p(1 − p)).
Therefore (37) holds with Tn = X̄n , θ = p and τ 2 (θ) = θ(1 − θ). The formula (38) says therefore that we
choose f as
c
f 0 (θ) = p
θ(1 − θ)
√
which means that f (θ) = 2c arcsin( θ). The Delta method then guarantees that
√ p √ L
2 n arcsin( X̄n ) − arcsin( p) →N (0, 1).
49
This implies that
p √ zα/2
P arcsin( X̄n ) − arcsin( p) ≤ √ →1−α as n → ∞
2 n
so that
p zα/2 p zα/2
sin arcsin( X̄n ) − √ , sin arcsin( X̄n ) + √
2 n 2 n
√
is an approximate 100(1 − α)% C.I for p. The lower end point of the above interval can be negative (note
p p √
that arcsin( X̄n ) takes values between 0 and π/2 but arcsin( X̄n ) − zα/2 /(2 n) can be negative) while
√
p is always positive. So we can replace the lower end point by 0 if it turns out to be negative. Using the
notation x+ = max(x, 0), we see that
" #
p zα/2 p zα/2
sin arcsin( X̄n ) − √ , sin arcsin( X̄n ) + √
2 n + 2 n
√
is an approximate 100(1 − α)% C.I for p. To get a confidence interval for p, we can simply square the two
end points of the above interval. This allows us to deduce that
" #
2
p zα/2 2
p zα/2
sin arcsin( X̄n ) − √ , sin arcsin( X̄n ) + √
2 n + 2 n
Let us now get back to the Poisson distribution where we have X1 , . . . , Xn are i.i.d P oi(λ) and CLT gives
(36). Therefore Tn = X̄n , θ = λ and τ 2 (θ) = θ. The equation (39) suggests that we choose f as
c
f 0 (θ) = √
θ
√
where means that f (θ) = 2c θ. The Delta method then guarantees that
√ p √ L
2 n X̄n − λ →N (0, 1). (40)
p
Therefore the square-root transformation applied to X̄n ensures that the resulting variance (of X̄n ) does
not depend on λ (in a limiting sense).
The fact (40) will lead to approximate confidence intervals for λ. Indeed, (40) immediately implies that
√ zα/2
p
P X̄n − λ ≤ √ →1−α as n → ∞
2 n
so that p
zα/2 p zα/2
X̄n − √ , X̄n + √
2 n 2 n
√
is an approximate 100(1 − α) % C.I for λ. Note that the lower end point of the above interval can be
negative while λ is always positive. So we can replace the lower end point by 0 if it turns out to be negative.
Again using the notation x+ := max(x, 0), we see that
" p #
p zα/2 zα/2
X̄n − √ , X̄n + √
2 n + 2 n
50
√
is an approximate 100(1 − α) % C.I for λ. To get a confidence interval for λ, we can simply square the two
end points of the above interval. This allows us to deduce that
" #
zα/2 2 zα/2 2
p p
X̄n − √ , X̄n + √ (41)
2 n + 2 n
This interval can be compared with the interval that was obtained in the previous lecture using Slutsky’s
theorem. That interval was " r r #
X̄n X̄n
X̄n − zα/2 , X̄n + zα/2 . (42)
n n
The intervals (41) and (42) may look different but they are actually quite close to each other for large n. To
2
see this, note that the difference between the upper bounds of these two intervals is at most zα/2 /(4n) which
is very small when n is large (the same is true of the lower bounds).
Let us now look at another example where the variance stabilizing transformation is the log function.
Suppose X1 , X2 , . . . are i.i.d such that Xi /σ 2 has the chi-squared distribution with one degree of freedom.
In other words,
Xi ∼ σ 2 χ21 .
Because E(X1 ) = σ 2 and var(X1 ) = 2σ 4 , the CLT says that
√ L
n X̄n − σ 2 →N (0, 2σ 4 ). (43)
Can we now find a function f such that f (X̄n ) has a limiting variance that is independent of σ 2 ? Because
2 2 2
(43) has the form (37) with
√ Tn = X̄n , θ = σ and τ (θ) = 2θ , we can use (39) which suggests taking f so
0
that f (θ) = c/τ (θ) = c/( 2θ). This gives
c
f (θ) = √ (log θ)
2
allowing us to conclude that
√
1 1 L
√ log X̄n − √ log(σ 2 ) →N (0, 1).
n
2 2
Square-roots and logarithms are common transformations that are applied to data when there is varying
variance (see, for example, https://en.wikipedia.org/wiki/Variance-stabilizing_transformation).
Suppose X1 , X2 , . . . are i.i.d having the Geometric distribution with parameter p. Recall that X has the
Geo(p) distribution if X takes the values 1, 2, . . . with the probabilities
The number of independent tosses (of a coin with probability of heads p) required to get the first head has
the Geo(p) distribution.
51
I leave as an easy exercise to verify that for X ∼ Geo(p)
1 1−p
EX = and var(X) = .
p p2
The CLT therefore states that for i.i.d observations X1 , X2 , . . . having the Geo(p) distribution, we have
√ L 1−p
n X̄n − (1/p) →N (0, 2 ).
p
What is the variance stabilizing transformation for X̄n i.e., what is the transformation f for which f (X̄n )
has constant asymptotic variance? To answer this, note that the above displayed equation is of the same
form as (37) with Tn = X̄n , θ = 1/p and τ 2 (θ) = (1 − p)/p2 . We then write τ (θ) in terms of θ as (note that
p = 1/θ) s
r
1−p 1 − (1/θ) p
τ (θ) = 2
= = θ(θ − 1).
p 1/θ2
The variance stabilizing transformation is therefore given by
Z
c
Z
c √ √
f (θ) = dθ = p dθ = 2c log θ+ θ−1
τ (θ) θ(θ − 1)
√ √
Therfore f (θ) = 2c log( θ + θ − 1) is the variance stabilizing transformation here and
√ L
n f (X̄n ) − f (1/p) →N (0, c2 ).
This is essentially a consequence of the Taylor approximation: g(Tn )g(θ) ≈ g 0 (θ)(Tn − θ). What would
happen if g 0 (θ) = 0? In this case, the statement (44) will still be correct if the right hand side is interpreted
as the constant 0 i.e., when g 0 (θ) = 0, the following holds:
√ P
n (g(Tn ) − g(θ)) →0.
However this only states that g(Tn ) − g(θ) is of a smaller order compared to n−1/2 but does not precisely
say what the exact order is and what the limiting distribution is when scaled by the correct order. To figure
out these, we need to consider the higher order terms in the Taylor expansion for g(Tn ) around θ. Assume,
in the sequel, that g 0 (θ) = 0 and that g 00 (θ) 6= 0.
52
and hence we have
L 1
n(g(Tn ) − g(θ))→ g 00 (θ)τ 2 χ21 . (45)
2
Therefore when g 0 (θ) = 0 and g 00 (θ) 6= 0, the right scaling factor is n and the limiting distribution is a scaled
multiple of χ21 (note that the limit is not a normal distribution).
Example 25.1. Suppose X1 , X2 , . . . , Xn are i.i.d Ber(p) random variables. Suppose we want to estimate
p(1 − p). The natural estimator is X̄n (1 − X̄n ). What is the limiting behavior of this estimator?
This can be answered by the Delta method by taking g(θ) = θ(1 − θ). Note first that by the usual CLT,
√ L
n X̄n − p →N (0, p(1 − p)).
But when p = 1/2, we have to use (45) instead of (44) and this gives (note that g 00 (p) = −2)
L1 1
n g(X̄n ) − g(1/2) → (−2)p(1 − p)χ21 = − χ21 .
2 4
26 Conditioning
Our next main topic is conditioning. This is a very important topic for statistics classes.
26.1 Basics
Let us first start by looking at the definition of conditional probability. Given two events A and B with
P(A) > 0, we define the conditional probability of B given A as
P(A ∩ B)
P (B|A) = . (46)
P(A)
See Section 1.1 of Lecture 10 of Jim Pitman’s 2016 notes for 201A to get some intuitive justification for this
definition of conditional probability.
Note here that A and Ac are disjoint events whose union is the entire space of outcomes Ω. More generally,
if A1 , A2 , . . . are disjoint events whose union is Ω, we have
X
P(B) = P(B|Ai )P(Ai ). (47)
i≥1
P(B|A)P(A)
P(A|B) = .
P(B|A)P(A) + P(B|Ac )P(Ac )
53
26.2 Conditional Distributions, Law of Total Probability and Bayes Rule for
Discrete Random Variables
Consider two random variables X and Θ. Assume that both are discrete random variables. One can then
define the conditional distribution of X given Θ = θ simply by defining the conditional probabilities:
P{X = x, Θ = θ}
P{X = x|Θ = θ} = (48)
P{Θ = θ}
assuming that P{Θ = θ} > 0. If P{Θ = θ} = 0, we would not attempt to define P{X = x|Θ = θ}.
As x varies over all values that the random variable X takes, the probabilities (51) determine the con-
ditional distribution of X given
P Θ = θ. Note that the conditional probability P{X = x|Θ = θ} always lies
between 0 and 1 and we have x P{X = x|Θ = θ} = 1.
Example 26.1. Suppose X and Y are independent random variables having the P oi(λ1 ) and P oi(λ2 ) dis-
tributions respectively. For n ≥ 0, what is the conditional distribution of X given X + Y = n?
We need to compute
P {X = i|X + Y = n}
for various values of i. It is clear that the above probability is non-zero only when i is an integer between 0
and n. Let us therefore assume that i is an integer between 0 and n. By definition
P {X = i, X + Y = n}
P {X = i|X + Y = n} =
P {X + Y = n}
P {X = i, Y = n − i}
=
P {X + Y = n}
P {X = i} P {Y = n − i}
=
P {X + Y = n}
The numerator above can be evaluated directly as X and Y are independently distributed as P oi(λ1 ) and
P oi(λ2 ) respectively. For the denominator, we use the fact that X + Y is P oi(λ1 + λ2 ) (the proof of this fact
is left as exercise). We thus have
P {X = i} P {Y = n − i}
P {X = i|X + Y = n} =
P {X + Y = n}
−λ1
λi1 /i! e−λ2 λn−i
e 2 /(n − i)!
=
e−λ1 +λ2 ((λ1 + λ2 )n /n!)
i n−i
n! λ1 λ2
=
i!(n − i)! λ1 + λ2 λ1 + λ2
which means that the conditional distribution of X given X + Y = n is the Binomial distribution with
parameters n and p = λ1 /(λ1 + λ2 ).
Let us now look at the law of total probability and Bayes rule for discrete random variables X and Θ.
As a consequence of (47), we have
X
P{X = x} = P{X = x|Θ = θ}P{Θ = θ} (49)
θ
where the summation is over all values of θ that are taken by the random variable Θ. This formula allows
one to calculate P{X = x} using knowledge of P{X = x|Θ = θ} and P{Θ = θ}. We shall refer to (52) as the
Law of Total Probability for discrete random variables.
54
The Bayes rule is
The Bayes rule allows one to compute the conditional probabilities of Θ given X using knowledge of the
conditional probabilities of X given Θ as well as the marginal probabilities of Θ. We shall refer to (53) as
the Bayes Rule for discrete random variables.
Example 26.2. Suppose N is a random variable having the P oi(λ) distribution. Also suppose that, con-
ditional on N = n, the random variable X has the Bin(n, p) distribution. This setting is known as the
Poissonization of the Binomial. Find the marginal distribution of X. Also what is the conditional
distribution of N given X = i?
To find the marginal distribution of X, we need to find P{X = i} for every integer i ≥ 0. For this, we
use the law of total probability which states that
∞
X
P {X = i} = P {X = i|N = n} P{N = n}.
n=0
Because X|N = n is Bin(n, p), the probability P{X = i|N = n} is non-zero only when 0 ≤ i ≤ n. Therefore
the terms in the sum above are non-zero only when n ≥ i and we obtain
∞
X
P {X = i} = P {X = i|N = n} P{N = n}
n=i
∞
λn
X n i
= p (1 − p)n−i e−λ
n=i
i n!
∞
e−λ pi X (1 − p)n−i n
= λ
i! n=i (n − i)!
∞
e−λ (λp)i X (1 − p)n−i n−i
= λ
i! n=i
(n − i)!
∞ n−i
e−λ (λp)i X (λ(1 − p)) e−λ (λp)i λ(1−p) e−λp (λp)i
= = e =
i! n=i
(n − i)! i! i!
To find the conditional distribution of Θ given X = i, we need to use Bayes rule which states that
This is only nonzero when n ≥ i (otherwise P{X = i|N = n} will be zero). And when n ≥ i, we have
This means that conditional on X = i, the random variable N is distributed as i + P oi(λ(1 − p)).
55
What is the joint distribution of X and N − X in this example? To compute this, note that
P {X = i, N − X = j} = P {X = i, N = i + j}
= P {X = i|N = i + j} P{N = i + j}
λi+j
i+j i
= p (1 − p)j e−λ
i (i + j)!
i j
(λp) (λ(1 − p))
= e−λ
i! j!
Note that this factorizes into a term involving only i and a term involving only j. This means therefore
that X and N − X are independent. Also from the expression above, it is easy to deduce that the marginal
distribution of X is P oi(λp) (which we have already derived via the law of total probability) and that N − X
is P oi(λ(1 − p)).
The setting of this example arises when one tosses a coin with probability of heads p independently a
P oi(λ) number of times. Then N denotes the total number of tosses, X denotes the number of heads and
N − X denotes the number of tails. We have thus shown that X and N − X are independent and are
distributed according to P oi(λp) and P oi(λ(1 − p)) respectively. Independence of X and N − X here is
especially interesting. When a coin is tossed a fixed number n of times, the number of heads and tails are
obviously not independent (as they have to sum to n). But when the number of tosses is itself random and
has the Poisson distribution, then the number of heads and tails become indepedent random variables.
In the last lecture, we studied conditioning for discrete random variables. Given two discrete random variables
X and Θ, one can then define the conditional distribution of X given Θ = θ simply by defining the conditional
probabilities:
P{X = x, Θ = θ}
P{X = x|Θ = θ} = (51)
P{Θ = θ}
assuming that P{Θ = θ} > 0. If P{Θ = θ} = 0, we would not attempt to define P{X = x|Θ = θ}.
As x varies over all values that the random variable X takes, the probabilities (51) determine the con-
ditional distribution of X given
P Θ = θ. Note that the conditional probability P{X = x|Θ = θ} always lies
between 0 and 1 and we have x P{X = x|Θ = θ} = 1.
We also looked at the law of total probability and the Bayes rule. The Law of Total Probability is
X
P{X = x} = P{X = x|Θ = θ}P{Θ = θ} (52)
θ
where the summation is over all values of θ that are taken by the random variable Θ. This formula allows
one to calculate P{X = x} using knowledge of P{X = x|Θ = θ} and P{Θ = θ}.
The Bayes rule allows one to compute the conditional probabilities of Θ given X using knowledge of the
conditional probabilities of X given Θ as well as the marginal probabilities of Θ.
The goal of this lecture is to extend all of the above to the case when X and Θ are continuous random
variables.
56
28 Conditional Densities for Continuous Random Variables
We shall now define the conditional density of X given Θ = θ for a fixed value of θ. In order to define
this conditional density at a point x, we need to consider
for a small δ > 0. Because P{Θ = θ} = 0 (note that Θ is a continuous random variable), we cannot define
this conditional probability using the definition P(B|A) := P(B ∩ A)/P(A). But, intuitively, conditioning on
Θ = θ should be equivalent to conditioning on θ ≤ Θ ≤ θ + for small . Therefore we can write
for small . For the probability on the right hand side above, we can use P(B|A) := P(B ∩ A)/P(A) to obtain
fX,Θ (x, θ)
fX|Θ=θ (x) := (55)
fΘ (θ)
for the conditional density of X given Θ = θ. This definition makes sense as long as fΘ (θ) > 0. If fΘ (θ) = 0,
we do not attempt to define fX|Θ=θ .
Example 28.1. Suppose X and Θ are independent random variables having the Gamma(α, λ) and Gamma(β, λ)
distributions respectively. What then is the conditional density of X given X + Θ = 1.
λα+β
fX,X+Θ (x, 1) = fX,Θ (x, 1 − x) = fX (x)fΘ (1 − x) = xα−1 (1 − x)β−1 e−λ
Γ(α)Γ(β)
for 0 < x < 1. We have also seen previously that X + Θ is distributed as Γ(α + β, λ). Thus
λα+β −λ
fX+Θ (1) = e .
Γ(α + β)
Therefore
fX,X+Θ (x, 1) Γ(α + β) α−1
fX|X+Θ=1 (x) = = x (1 − x)β−1 for 0 < x < 1.
fX+Θ (1) Γ(α)Γ(β)
This means therefore that
X|(X + Θ = 1) ∼ Beta(α, β).
57
Example 28.2. Suppose X and Y are independent U nif (0, 1) random variables. What is fU |V =v where
U = min(X, Y ) and V = max(X, Y ) and 0 < v < 1?
The proportionality statement (57) often makes calculations involving conditional densities much simpler.
To illustrate this, let us revisit the calculations in Examples (28.1) and (28.2) respectively.
Example 29.1 (Example 28.1 revisited). Suppose X and Θ are independent random variables having the
Gamma(α, λ) and Gamma(β, λ) distributions respectively. What then is the conditional density of X given
X + Θ = 1? By (57),
fX|X+Θ=1 (x) ∝ fX,X+Θ (x, 1)
= fX,Θ (x, 1 − x)
= fX (x)fΘ (1 − x)
∝ e−λx xα−1 I{x > 0}e−λ(1−x) (1 − x)β−1 I{1 − x > 0}
∝ xα−1 (1 − x)β−1 I{0 < x < 1}
which immediately implies that X|X + Θ = 1 has the Beta distribution with parameters α and β.
58
Example 29.2 (Example 28.2 revisited). Suppose X and Y are independent U nif (0, 1) random variables.
What is fU |V =v where U = min(X, Y ) and V = max(X, Y ) and 0 < v < 1?
Write
fU |V =v (u) ∝ fU,V (u, v)
= 2fU (u)fV (v)I{u < v}
∝ fU (u)I{u < v}
= I{0 < u < 1}I{u < v} = I{0 < u < min(v, 1)}
Thus for v < 1, the conditional density of U given V = v is the uniform density on [0, v]. For v > 1, the
conditional density of U given V = 1 is not defined as the density of V at v > 1 equals 0.
X and Θ are independent if and only if fX|Θ=θ = fX for every value of θ. This latter statement is precisely
equivalent to fX,Θ (x, θ) = fX (x)fΘ (θ). By switching roles of X and Θ, it also follows that X and Θ are
independent if and only if fΘ|X=x = fΘ for every x.
It is also not hard to see that X and Θ are independent if and only if the conditional density of X given
Θ = θ is the same for all values of θ for which fΘ (θ) > 0.
Example 30.1 (Back to the Gamma example). We have previously seen that when X ∼ Gamma(α, λ) and
Y ∼ Gamma(β, λ), then
X|(X + Θ = 1) ∼ Beta(α, β).
This can be also be directly seen (using the observation that X/(X + Θ) is distributed as Beta(α, β) and that
X/(X + Θ) is independent of X + Θ) as follows:
d X d X d X
X|(X + Θ = 1)= |(X + Θ = 1)= |(X + Θ = 1)= ∼ Beta(α, β).
1 X +Θ X +Θ
Note that we removed the conditioning on X + Θ = 1 in the last step above because X/(X + Θ) is independent
of X + Θ.
This follows directly from the definition of fX|Θ=θ (x). This formula allows us to deduce the marginal
density of X using knowledge of the conditional density of X given Θ and the marginal density of Θ.
59
3. Bayes Rule for Continuous Random Variables: Recall the Bayes rule for discrete random variables
in (53). The analogous statement for continuous random variables is
This allows to deduce the conditional density of Θ given X using knowledge of the conditional density
of X given Θ and the marginal density of Θ.
Suppose X and Θ are continuous random variables having a joint density fX,Θ . We have then seen that
fX,Θ (x, θ)
fX|Θ=θ (x) = .
fΘ (θ)
Note that the denominator in the right hand side above does not involve x so we can write
1. We have
fX,Θ (x, θ) = fX|Θ=θ (x)fΘ (θ).
Now
1 (θ − µ)2 (x − θ)2
1
fX|Θ=θ (x)fΘ (θ) = exp − +
2πτ σ 2 τ2 σ2
60
The term in the exponent above can be simplified as
where I skipped a few steps to get to the last equality (complete the square and simplify the resulting expres-
sions).
As a result
2 !
x/σ 2 + µ/τ 2 (x − µ)2
1 1 1 1
fX|Θ=θ (x)fΘ (θ) = exp − + 2 θ− exp −
2πτ σ 2 τ2 σ 1/σ 2 + 1/τ 2 2(τ 2 + σ 2 )
Consequently,
2 !
x/σ 2 + µ/τ 2 (x − µ)2
Z
1 1 1 1
fX (x) = exp − + 2 θ− exp − dθ
2πτ σ 2 τ2 σ 1/σ 2 + 1/τ 2 2(τ 2 + σ 2 )
2 !
(x − µ)2 x/σ 2 + µ/τ 2
Z
1 1 1 1
= exp − exp − + 2 θ− dθ
2πτ σ 2(τ 2 + σ 2 ) 2 τ2 σ 1/σ 2 + 1/τ 2
−1/2
(x − µ)2 √
1 1 1
= exp − 2π + 2
2πτ σ 2(τ 2 + σ 2 ) τ2 σ
2
1 (x − µ)
= p exp − 2 + σ2 )
2
2π(τ + σ ) 2 2(τ
which gives
X ∼ N (0, τ 2 + σ 2 ).
For a normal density with mean m and variance v 2 , the inverse of the variance 1/v 2 is called the precision.
The above calculation therefore reveals that the precision of the conditional distribution of Θ given X equals
the sum of the precisions of the distribution of Θ and the distribution of X respectively.
61
In this particular example, the mean of the posterior distribution is a weighted linear combination of the prior
mean as well as the data where the weights are proportional to the precisions. Also, posterior precision equals
the sum of the prior precision and the data precision which informally means, in particular, that the posterior
is more precise than the prior.
Example 32.2. Suppose Θ ∼ Gamma(α, λ) and X|Θ = θ ∼ Exp(θ) . What is the marginal density of X
and the conditional density of Θ given X = x?
This is called the Lomax distribution with shape parameter α > 0 and scale/rate parameter λ > 0 (see
https: // en. wikipedia. org/ wiki/ Lomax_ distribution ).
The LTP describes how to compute the distribution of X based on knowledge of the conditional distribution
of X given Θ = θ as well as the conditional distribution of Θ. The Bayes rule describes how to compute the
conditional distribution of Θ given X = x based on the same knowledge of the conditional distribution of X
given Θ = θ as well as the conditional distribution of Θ. We have so far looked at the LTP and Bayes rule
when X and Θ are both discrete or when they are both continuous. Now we shall also consider the cases
when one of them is discrete and the other is continuous.
62
33.2 X and Θ are both continuous
Here LTP is Z
fX (x) = fX|Θ=θ (x)fΘ (θ)dθ
LTP is Z
P{X = x} = P{X = x|Θ = θ}fΘ (θ)dθ
LTP is X
fX (x) = fX|Θ=θ (x)P{Θ = θ}
θ
and Bayes rule is
fX|Θ=θ (x)P{Θ = θ} fX|Θ=θ (x)P{Θ = θ}
P{Θ = θ|X = x} = =P
fX (x) θ fX|Θ=θ (x)P{Θ = θ}
These formulae are useful when the conditional distribution of X given Θ = θ as well as the marginal
distribution of Θ are easy to determine (or are given as part of the model specification) and the goal is to
determine the marginal distribution of X as well as the conditional distribution of Θ given X = x.
We shall now look at two applications of the LTP and Bayes Rule to when one of X and Θ is discrete
and the other is continuous.
Example 33.1. Suppose that Θ is the uniformly distributed on (0, 1) and let X|Θ = θ has the binomial
distribution with parameters n and θ (i.e., conditioned on Θ = θ, the random variable X is distributed as
the number of successes in n independent tosses of a coin with probability of success θ). What then is the
marginal distribution of X as well as the conditional distribution of Θ given X = x?
Note that this is a situation where X is discrete (taking values in 0, 1, . . . , n) and Θ is continuous (taking
values in the interval (0, 1)). To compute the marginal distribution of X, we use the appropriate LTP to
write (for x = 0, 1, . . . , n)
Z
P{X = x} = P{X = x|Θ = θ}fΘ (θ)dθ
Z 1
n x
= θ (1 − θ)n−x dθ
x
0
n
= Beta(x + 1, n − x + 1)
x
n Γ(x + 1)Γ(n − x + 1) n! x!(n − x)! 1
= = =
x Γ(n + 2) (n − x)!x! (n + 1)! n+1
63
which means that X is (discrete) uniformly distributed on the finite set {0, 1, . . . , n}.
Let us now calculate the posterior distribution of Θ given X = x. Using the Bayes rule, we obtain
P{X = x|Θ = θ}fΘ (θ)
fΘ|X=x (θ) =
P{X = x}
n x
θ (1 − θ)n−x
= x ∝ θx (1 − θ)n−x
1/(n + 1)
for 0 < θ < 1. From here, it immediately follows that
Θ|X = x ∼ Beta(x + 1, n − x + 1).
The mean of the Beta(α, β) distribution is α/(α + β). Therefore the mean of the conditional distribution of
Θ given X = x (also known as the posterior mean) equals
x+1
E(Θ|X = x) = .
n+2
As the prior mean equals 1/2 and we can write
x+1 n x 2 1
= + ,
n+2 n+2 n n+2 2
it follows that the posterior mean falls between the prior mean and x/n. As n becomes large, the posterior
mean approaches x/n.
Example 33.2 (Statistical Classification). In a statistical classification problem, the random variable Θ is
discrete and X is usually continuous. The simplest situation is when Θ is binary. Let us say that
P{Θ = 1} = p and P{Θ = 0} = 1 − p.
Also assume that the conditional density of X given Θ = 0 is f0 and that the conditional density of X given
Θ = 1 is f1 i.e.,
X|Θ = 0 ∼ f0 and X|Θ = 1 ∼ f1 .
Using the LTP, we see that the marginal density of X equals
fX = (1 − p)f0 + pf1 .
In other words, fX is a mixture of f0 and f1 with the mixing weights being equal to the marginal probabilities
of Θ.
64
provided x1 , . . . , xm are such that fX1 ,...,Xm (x1 , . . . , xm ) > 0.
Here are some simple but important properties of conditional joint densities.
fX1 ,...,Xm ,Y1 ,...,Yk (x1 , . . . , xm , y1 , . . . , yk ) = fY1 ,...,Yk |X1 =x1 ,...,Xm =xm (y1 , . . . , yk )fX1 ,...,Xm (x1 , . . . , xm ).
2. The joint density of every set of random variables Y1 , . . . , Yn satisfies the following:
fY1 ,...,Yn (y1 , . . . , yn ) = fYn |Y1 =y1 ,...,Yn−1 =yn−1 (yn )fYn−1 |Y1 =y1 ,...,Yn−1 =yn−2 (yn−1 ) . . . fY2 |Y1 =y1 (y2 )fY1 (y1 ).
4. This can be viewed as a law of total conditional probability: For random variables Y1 , . . . , Yk , X1 , . . . , Xm
and Θ, we have
Z
fY1 ,...,Yk |X1 =x1 ,...,Xm =xm (y1 , . . . , yk ) = fY1 ,...,Yk ,Θ|X1 =x1 ,...,Xm =xm (y1 , . . . , yk , θ)fΘ|X1 =x1 ,...,Xm =xm (θ)dθ.
We shall look at some applications of the above facts in the next class.
By definition,
fU(1) ,...,U(n) (u1 , . . . , un−1 , u)
fU(1) ,...,U(n−1) |U(n) =u (u1 , . . . , un−1 ) = .
fU(n) (u)
By the joint distribution of order statistics that we worked out previously, it follows first that the above
quantity is non-zero only when 0 < u1 < · · · < un−1 < u < 1 and it is then equal to
n!
fU(1) ,...,U(n−1) |U(n) =u (u1 , . . . , un−1 ) = .
nun−1
For the denominator above, we used the fact that U(n) ∼ Beta(n, 1). We have thus proved that
n−1
1
fU(1) ,...,U(n−1) |U(n) =u (u1 , . . . , un−1 ) = (n − 1)! for 0 < u1 < · · · < un−1 < u < 1.
u
65
Note that the right hand side above is the joint density of the order statistics of (n − 1) i.i.d observations
drawn from the uniform distribution on the interval (0, u). We have therefore proved that, conditioned on
U(n) = u, the joint density of U(1) , . . . , U(n−1) is the same as the joint density of the order statistics of (n − 1)
i.i.d observations drawn from the uniform distribution on (0, u).
Here are some simple but important properties of conditional joint densities.
fX1 ,...,Xm ,Y1 ,...,Yk (x1 , . . . , xm , y1 , . . . , yk ) = fY1 ,...,Yk |X1 =x1 ,...,Xm =xm (y1 , . . . , yk )fX1 ,...,Xm (x1 , . . . , xm ).
2. The joint density of every set of random variables Y1 , . . . , Yn satisfies the following:
fY1 ,...,Yn (y1 , . . . , yn ) = fYn |Y1 =y1 ,...,Yn−1 =yn−1 (yn )fYn−1 |Y1 =y1 ,...,Yn−1 =yn−2 (yn−1 ) . . . fY2 |Y1 =y1 (y2 )fY1 (y1 ).
4. This can be viewed as a law of total conditional probability: For random variables Y1 , . . . , Yk , X1 , . . . , Xm
and Θ, we have
Z
fY1 ,...,Yk |X1 =x1 ,...,Xm =xm (y1 , . . . , yk ) = fY1 ,...,Yk ,Θ|X1 =x1 ,...,Xm =xm (y1 , . . . , yk , θ)fΘ|X1 =x1 ,...,Xm =xm (θ)dθ.
Xi = φXi−1 + Zi for i = 2, . . . , n
where φ is some real number. The process X1 , . . . , Xn is called an autoregressive process of order 1. What is
the conditional joint density of X2 , . . . , Xn given X1 = x1 ? What is the joint density of X1 , . . . , Xn ?
Let us first calculate the conditional joint density of X2 , . . . , Xn given X1 = x1 . For this, write
n
Y
fX2 ,...,Xn |X1 =x1 (x2 , . . . , xn ) = fXi |X1 =x1 ,...,Xi−1 =xi−1 (xi ) (58)
i=2
We were able to remove conditioning on X1 = x1 , . . . , Xi−1 = xi−1 above because X1 , . . . , Xi−1 only depend
on X1 , Z2 , . . . , Zi−1 and hence are independent of Zi .
66
From the above chain of assertions, we deduce that
(xi − φxi−1 )2
1
fXi |X1 =x1 ,...,Xi−1 =xi−1 (xi ) = √ exp − for i = 2, . . . , n.
2πσ 2σ 2
Combining with (58), we obtain
n
(xi − φxi−1 )2
Y1
fX2 ,...,Xn |X1 =x1 (x2 , . . . , xn ) = √exp −
i=2
2πσ 2σ 2
n−1 n
!
1 1 X 2
= √ exp − 2 (xi − φxi−1 ) .
2πσ 2σ i=2
fX1 ,...,Xn (x1 , . . . , xn ) = fX1 (x1 )fX2 ,...,Xn |X1 =x1 (x2 , . . . , xn )
n−1 n
!
1 1 X
= fX1 (x1 ) √ exp − 2 (xi − φxi−1 )2 .
2πσ 2σ i=2
In a statistical setting, this joint density is used to estimate the parameters φ and σ 2 via maximum likelihood
estimation. For this model however, it is easier to work with the conditional density of X2 , . . . , Xn given
X1 = x1 instead of the full joint density of X1 , . . . , Xn .
Let us now look at the application of the conditional density formulae for the normal prior-normal data
model. Here we first have a random variable Θ that has the N (µ, τ 2 ) distribution. We also have random
variables X1 , . . . , Xn+1 such that
In other words, conditional on Θ = θ, the random variables X1 , . . . , Xn+1 are i.i.d N (θ, σ 2 ).
Let us first find the conditional distribution of Θ given X1 = x1 , . . . , Xn = xn . The answer to this turns
out to be
nx̄n /σ 2 + µ/τ 2
1
Θ|X1 = x1 , . . . , Xn = xn ∼ N , (59)
n/σ 2 + 1/τ 2 n/σ 2 + 1/τ 2
where x̄n := (x1 + · · · + xn )/n. Let us see why this is true below. Note first that we had solved this problem
for n = 1 in the last class where we proved the following:
x/σ 2 + µ/τ 2
2 2 1
Θ ∼ N (µ, τ ), X|Θ = θ ∼ N (θ, σ ) =⇒ Θ|X = x ∼ N , , X ∼ N (µ, σ 2 + τ 2 ).
1/σ 2 + 1/τ 2 1/σ 2 + 1/τ 2
The result (59) for general n ≥ 1 can actually be deduced from the above result for n = 1. There are two
ways of seeing this.
Method One: We use mathematical induction on n ≥ 1. We already know that (59) is true for n = 1.
Assume that it is true for n and we shall try to prove it for n + 1. The key to this is to note that
d
(Θ|X1 = x1 , . . . , Xn+1 = xn+1 ) = Θ̃|Y = xn+1 (60)
where
Y |Θ̃ = θ ∼ N (θ, σ 2 ) and Θ̃ ∼ Θ|X1 = x1 , . . . , Xn = xn .
In words, (60) states that the posterior of Θ after observing (n+1) observations X1 = x1 , . . . , Xn+1 = xn+1 is
the same as the posterior after observing one observation Y = xn+1 under the prior Θ|X1 = x1 , . . . , Xn = xn .
67
To formally see why (60) is true, just note that
fΘ|X1 =x1 ,...,Xn =xn ,Xn+1 =xn+1 (θ) ∝ fXn+1 |Θ=θ,X1 =x1 ,...,Xn =xn (xn+1 )fΘ|X1 =x1 ,...,Xn =xn (θ)
= fXn+1 |Θ=θ (xn+1 )fΘ|X1 =x1 ,...,Xn =xn (θ).
The first equality is a consequence of the properties of conditional densities. The second equality above is a
consequence of the fact that Xn+1 is independent of X1 , . . . , Xn conditional on Θ.
The statement (60) allows us to use the result for n = 1 and the induction hypothesis that (59) holds for
n. Indeed, using the n = 1 result for
nx̄/σ 2 + µ/τ 2 1
µ= and τ2 =
n/σ 2 + 1/τ 2 n/σ 2 + 1/τ 2
and x = xn+1 , we deduce that Θ|X1 = x1 , . . . , Xn+1 = xn+1 is a normal distribution with mean
nx̄/σ 2 +µ/τ 2 n 1
xn+1 /σ 2 + n/σ 2 +1/τ 2 σ2 + τ2
x1 +···+xn+1
+ τµ2 (n+1)x̄n+1
+ τµ2
σ2 σ2
1 n 1 = n+1 1 = n+1 1
σ2 + σ2 + τ 2 σ2 + τ 2 σ2 + τ 2
and variance
1 1
1 n 1 = n+1 1 .
σ2 + σ2 + τ2 σ2 + τ2
This proves (59) for n + 1. The proof of (59) is complete by induction.
Method Two. The second method for proving (59) proceeds more directly by writing:
fΘ|X1 =x1 ,...,Xn =xn (θ) ∝ fX1 ,...,Xn |Θ=θ (x1 , . . . , xn )fΘ (θ)
= fX1 |Θ=θ (x1 ) . . . fXn |Θ=θ (xn )fΘ (θ)
n
!
1 X 2 1 2
∝ exp − 2 (xi − θ) exp − 2 (θ − µ)
2σ i=1 2τ
" n #!
1 X 2 2 1
= exp − 2 (xi − x̄n ) + n(x̄n − θ) exp − 2 (θ − µ)2
2σ i=1 2τ
n
1
∝ exp − 2 (x̄n − θ)2 exp − 2 (θ − µ)2
2σ 2τ
1 2 1 2
= exp − (x̄n − θ) exp − (θ − µ) .
2(σ 2 /n) 2τ 2
This now resembles the calculation we did previously for n = 1. The only difference being that x is now
replaced by x̄n and σ 2 is replaced by σ 2 /n. Therefore the n = 1 result applied to x → x̄n and σ 2 → σ 2 /n
should yield (59). This proves (59).
Let us now compute the conditional density of Xn+1 given X1 = x1 , . . . , Xn = xn . For this, we can use
the law of total conditional probability to write
Z
fXn+1 |X1 =x1 ,...,Xn =xn (x) = fXn+1 |Θ=θ,X1 =x1 ,...,Xn =xn (x)fΘ|X1 =x1 ,...,Xn =xn (θ)dθ
Z
= fXn+1 |Θ=θ (x)fΘ|X1 =x1 ,...,Xn =xn (θ)dθ
This again resembles the calculation of the marginal density of X in the n = 1 problem (where the answer is
X ∼ N (µ, τ 2 + σ 2 )). The only difference is that the prior N (µ, τ 2 ) is now replaced by the posterior density
which is given by (59). We therefore obtain that
nx̄n /σ 2 + µ/τ 2 2
1
(Xn+1 |X1 = x1 , . . . , Xn = xn ) ∼ N ,σ +
n/σ 2 + 1/τ 2 n/σ 2 + 1/τ 2
68
36 Conditional Expectation
Given two random variables X and Y , the conditional expectation (or conditional mean) of Y given X = X
is denoted by
E (Y |X = x)
and is defined as the expectation of the conditional distribution of Y given X = x.
We can write R
E (Y |X = x) = PyfY |X=x (y)dy : if Y is continuous
y yP{Y = y|X = x} : if Y is discrete
More generally
R
E (g(Y )|X = x) = Pg(y)fY |X=x (y)dy : if Y is continuous
y g(y)P{Y = y|X = x} : if Y is discrete
and also
R
E (g(X, Y )|X = x) = E (g(x, Y )|X = x) = Pg(x, y)fY |X=x (y)dy : if Y is continuous
y g(x, y)P{Y = y|X = x} : if Y is discrete
The most important fact about conditional expectation is the Law of Iterated Expectation (also known
as the Law of Total Expectation). We shall see this next.
Basically the law of total expectation tells us how to compute the expectation of E(Y ) using knowledge of
the conditional expectation of Y given X = x. Note the similarity to law of total probability which specifies
how to compute the marginal distribution of Y using knowledge of the conditional distribution of Y given
X = x.
The law of total expectation can be proved as a consequence of the law of total probability. The proof
when Y and X are continuous is given below. The proof in other cases (when one or both of Y and X are
discrete) is similar and left as an exercise.
Proof of the law of total expectation: Assume that Y and X are both continuous. Then
Z
E(Y ) = yfY (y)dy.
69
There is an alternate more succinct form of stating the law of total expectation which justifies calling the
law of iterated expectation. We shall see this next. Note that E(Y |X = x) depends on x. In other words,
E(Y |X = x) is a function of x. Let us denote this function by h(·):
If we now apply this function to the random variable X, we obtain a new random variable h(X). This random
variable is denoted by simply E(Y |X) i.e.,
Note that when X is discrete, the expectation of this random variable E(Y |X) becomes
X X
E(E(Y |X)) = E(h(X)) = h(x)P{X = x} = E(Y |X = x)P{X = x}.
x x
Observe that the right hand sides in these expectations are precisely the terms on the right hand side of the
law of total expectation. Therefore the law of total expectation can be rephrased as
Because there are two expectations on the right hand side, the law of total expectation is also known as the
Law of Iterated Expectation.
The law of iterated expection has many applications. A couple of simple examples are given below
following which we shall explore applications to risk minimization.
Example 36.1. Consider a stick of length `. Break it at a random point X that is chosen uniformly across
the length of the stick. Then break the stick again at a random point Y that is also chosen uniformly across
the length of the stick. What is the expected length of the final piece?
and we are required to calculate E(Y ). Note first that E(Y |X = x) = x/2 for every x which means that
E(Y |X) = X/2. Hence by the Law of Iterated Expectation,
In the last class, we looked at the law of iterated expectation (or the Law of Total Expectation) which stated
that
E(Y ) = E(E(Y |X)).
On the right hand side above, E(Y |X) is the random variable obtained by applying the function h(x) :=
E(Y |X = x) to the random variable X (i.e., E(Y |X) = h(X)).
The law of iterated expection has many applications. A couple of simple examples are given below
following which we shall explore applications to risk minimization.
70
Example 36.2. Suppose X, Y, Z are i.i.d U nif (0, 1) random variables. Find the value of P{X ≤ Y Z}?
P{X ≤ Y Z} = E (I{X ≤ Y Z}) = E [E (I{X ≤ Y Z}|Y Z)] = E(Y Z) = E(Y )E(Z) = 1/4.
Example 36.3 (Sum of a random number of i.i.d random variables). Suppose X1 , X2 , . . . are i.i.d random
variables with E(Xi ) = µ. Suppose also that N is a discrete random variable that takes values in {1, 2, . . . , }
and that is independent of X1 , X2 , . . . . Define
S := X1 + X2 + · · · + XN .
In other words, S is the sum of a random number (N ) of the random variables Xi . The law of iterated
expectation can be used to compute the expectation of S as follows:
This fact is actually a special case of a general result called Wald’s identity.
The law of the iterated expectation has important applications to statistical risk minimization problems. The
simplest of these problems is the following.
Problem 1: Given two random variables X and Y , what is the function g ∗ (X) of X that minimizes
2
R(g) := E (g(X) − Y )
over all functions g? The resulting random variable g ∗ (X) can be called the Best Predictor of Y as a function
of X in terms of expected squared error.
is simply
g ∗ (x) = E(Y |X = x).
This is because E(Z − c)2 is minimized as c varies over R at c∗ = E(Z). We have thus proved that the
function g ∗ (X) which minimizes R(g) over all functions g is given by
Thus the function of X which is closest to Y in terms of expected squared error is given by the conditional
mean E(Y |X).
Problem 2: Given two random variables X and Y , what is the function g ∗ (X) of X that minimizes
R(g) := E |g(X) − Y |
71
over all functions g? The resulting random variable g ∗ (X) can be called the Best Predictor of Y as a function
of X in terms of expected absolute error.
E [|Y − g(x)| |X = x]
is simply given by any conditional median of Y given X = x. This is because E|Z − c| is minimized as c
varies over R at any median of Z. To see this, assume that Z has a density f and write
Z
E|Z − c| = |z − c|f (z)dz
Z c Z ∞
= (c − z)f (z)dz + (z − c)f (z)dz
−∞ c
Z c Z c Z ∞ Z ∞
=c f (z)dz − zf (z)dz + zf (z)dz − c f (z)dz.
−∞ −∞ c c
Therefore when c is a median, the derivative of E|Z − c| will equal zero. This shows that c 7→ E|Z − c| is
minimized when c is a median of Z.
We have thus shown that the function g ∗ (x) which minimizes R(g) over all functions g is given by any
conditional mean of Y given X = x. Thus the conditional mean of Y given X = x is the function of X that
is closest to Y in terms of expected absolute error.
Problem 3: Suppose Y is a binary random variable taking the values 0 and 1 and let X be an arbitrary
random variable. What is the function g ∗ (X) of X that minimizes
over all functions g? To solve this, again use the law of iterated expectation to write
In the inner expectation above, we can treat X as a constant so that the problem is similar to minimizing
P{Z 6= c} over c ∈ R for a binary random variable Z. It is easy to see that P{Z 6= c} is minimized at c∗
where
∗ 1 : if P{Z = 1} > P{Z = 0}
c =
0 : if P{Z = 1} < P{Z = 0}
In case P{Z = 1} = P{Z = 0}, we can take c∗ to be either 0 or 1. From here it can be deduced (via the law
of iterated expectation) that the function g ∗ (X) which minimizes P{Y 6= g(X)} is given by
∗ 1 : if P{Y = 1|X = x} > P{Y = 0|X = x}
g (x) =
0 : if P{Y = 1|X = x} < P{Y = 0|X = x}
Problem 4: Suppose again that Y is binary taking the values 0 and 1 and let X be an arbitrary random
variable. What is the function g ∗ (X) of X that minimizes
72
Using an argument similar to the previous problems, deduce that the following function minimizes R(g):
1 : if W1 P{Y = 1|X = x} > W0 P{Y = 0|X = x}
g ∗ (x) =
0 : if W1 P{Y = 1|X = x} < W0 P{Y = 0|X = x}
The argument (via the law of iterated expectation) used in the above four problems can be summarized
as follows. The function g ∗ which minimizes
37 Conditional Variance
Given two random variables Y and X, the conditional variance of Y given X = x is defined as the variance
of the conditional distribution of Y given X = x. More formally,
h i
2 2
V ar(Y |X = x) := E (Y − E(Y |X = x)) |X = x = E Y 2 |X = x − (E(Y |X = x)) .
Like conditional expectation, the conditional variance V ar(Y |X = x) is also a function of x. We can apply
this function to the random variable X to obtain a new random variable which we denote by V ar(Y |X).
Note that
2
V ar(Y |X) = E(Y 2 |X) − (E(Y |X)) . (61)
Analogous to the Law of Total Expectation, there is a Law of Total Variance as well. This formula says that
and
V ar(X) = E(V ar(X|Θ)) + V ar(E(X|Θ)) = E(σ 2 ) + V ar(Θ) = σ 2 + τ 2 .
Example 37.2 (Sum of a random number of i.i.d random variables). Suppose X1 , X2 , . . . are i.i.d random
variables with E(Xi ) = µ and V ar(Xi ) = σ 2 < ∞. Suppose also that N is a discrete random variable that
takes values in {1, 2, . . . , } and that is independent of X1 , X2 , . . . . Define
S := X1 + X2 + · · · + XN .
73
We have seen previously that
38 Random Vectors
In this section, we view a finite number of random variables as a random vector and go over some basic
formulae for the mean and covariance of random vectors.
A random vector is a vector whose entries are random variables. Let Y = (Y1 , . . . , Yn )T be a random
vector. Its expectation EY is defined as a vector whose ith entry is the expectation of Yi i.e., EY =
(EY1 , EY2 , . . . , EYn )T .
The covariance matrix of Y , denoted by Cov(Y ), is an n × n matrix whose (i, j)th entry is the covariance
between Yi and Yj .
1. The diagonal entries of Cov(Y ) are the variances of Y1 , . . . , Yn . More specifically the (i, i)th entry of
the matrix Cov(Y ) equals var(Yi ).
2. Cov(Y ) is a symmetric matrix i.e., the (i, j)th entry of Cov(Y ) equals the (j, i) entry. This follows
because Cov(Yi , Yj ) = Cov(Yj , Yi ).
39 Random Vectors
In the last class, we started looking at random vectors. A p × 1 random vector Y simply consists of p random
variables Y1 , . . . , Yp i.e., Y = (Y1 , . . . , Yp )T .
The mean vector Y is given by EY = (EY1 , . . . , EYp )T and the covariance matrix of Y is given by the
p × p matrix whose (i, j)th entry equals the covariance between Yi and Yj . Note that the diagonal entries of
Cov(Y ) are the variances of Y1 , . . . , Yp . Also note that Cov(Y ) is a symmetric matrix.
We also looked at the following two important formulae in the last class:
1. E(AY + c) = AE(Y ) + c for every deterministic matrix A and every deterministic vector c.
2. Cov(AY + c) = ACov(Y )AT for every deterministic matrix A and every deterministic vector c.
Example 39.1 (White Noise). Random variables Z1 , . . . , Zp are said to form white noise if they have mean
zero, variance one and if they are uncorrelated. Let Z be the random vector with components Z1 , . . . , Zp .
Then it is clear that the components of Z are white noise if and only if EZ = 0 and Cov(Z) = Ip (here Ip is
the p × p identity matrix).
74
Because variance is always nonnegative, this means that Cov(Y ) satisfies the following property:
aT Cov(Y )a = var(aT Y ) ≥ 0 for every p × 1 vector a. (62)
Now recall the following definition from linear algebra:
Definition 39.2. Let Σ denote a p × p symmetric matrix.
From this definition and the fact (62), it follows that the covariance matrix Cov(Y ) of every random
vector Y is symmetric and positive semi-definite.
However Cov(Y ) is not necessarily positive definite. To see this, just take p = 2 and Y = (Y1 , −Y1 )T for a
random variable Y1 . Then with a = (1, 1), it is easy to see that aT Cov(Y )a = V ar(aT Y ) = V ar(Y1 +Y2 ) = 0.
But if Cov(Y ) is not positive definite, then there exists a 6= 0 such that V ar(aT Y ) = aT Cov(Y )a = 0.
This must necessarily mean that aT (Y − µ) = 0 where µ = E(Y ). In other words, the random variables
Y1 , . . . , Yn have to satisfy a linear equation. We can therefore say that: Cov(Y ) is positive definite if and
only if the random variables Y1 , . . . , Yn do not satisfy a linear equation.
Since Cov(Y ) is a symmetric and positive semi-definite matrix, some standard facts about such matrices
are useful while working with covariance matrices. In particular, we shall make some use of the spectral
theorem for symmetric matrices. Before looking at the spectral theorem, we need to recall the notion of an
orthonormal basis.
The simplest example of an orthonormal basis is e1 , . . . , en where ei is the vector that 1 in the ith position
and 0 at all other positions.
1. u1 , . . . , up are linearly independent and therefore form a basis of Rp (this explains the presence of the
word “basis” in the definition of orthonormal basis).
To see why this is true, suppose that
α1 u1 + · · · + αp up = 0 for some α1 , . . . , αp . (63)
Taking the dot product of both sides of the above equality with uj (for a fixed j), we get
p p
* +
X X
0 = uj , αi ui = αi huj , ui i = αj
i=1 i=1
75
because huj , ui i is non-zero only when i = j and huj , uj i = 1. Thus (63) implies that αj = 0 for every
j = 1, . . . , p and thus u1 , . . . , up are linearly independent and consequently form a basis of Rp .
2. The following formula holds for every vector x ∈ Rp :
p
X
x= hx, ui i ui . (64)
i=1
To see why this is true, note first that the previous property implies that u1 , . . . , up form a basis of Rp
so that every x ∈ Rp can be written as a linear combination
x = β1 u1 + · · · + βp up
of u1 , . . . , up . Now take dot product with uj on both sides to prove that βj = hx, uj i.
3. The formula
u1 uT1 + · · · + up uTp = Ip (65)
holds where Ip is the p × p identity matrix. To see why this is true, note that (64) can be rewritten as
p p p
!
X X X
x= ui hx, ui i = ui uTi x = ui uTi x.
i=1 i=1 i=1
Since both sides of the above identity are equal for every x, we must have (65).
4. Suppose U is the p × p matrix whose columns are the vectors u1 , . . . , up . Then
U T U = U U T = Ip .
To see why this is true, note that the (i, j)th entry of U T U equals uTi uj which (by definition of
orthonormal basis) is zero when i 6= j and 1 otherwise. On the other hand, the statement U U T = Ip is
the same as (65).
5. For every vector x ∈ Rp , the formula
p
X 2
kxk2 = hx, ui i
i=1
Theorem 40.2 (Spectral Theorem). Suppose Σ is a p × p symmetric matrix. Then there exists an orthonor-
mal basis u1 , . . . , up and real numbers λ1 , . . . , λp such that
The spectral theorem is also usually written in the following alternative form. Suppose U is the p × p
matrix whose columns are the vectors u1 , . . . , up . Also suppose that Λ is the p×p diagonal matrix (a diagonal
matrix is a matrix whose off-diagonal entries are all zero) whose diagonal entries are λ1 , . . . , λp . Then (66)
is equivalent to
Σ = U ΛU T and U T ΣU = Λ.
76
1. For every 1 ≤ j ≤ p, we have the identities
These follow directly from (66). The first identity above implies that each λj is an eigenvalue of Σ with
eigenvector uj .
2. In the representation (66), the eigenvalues λ1 , . . . , λp are unique while the eigenvectors u1 , . . . , up are
not necessarily unique (for every uj can be replaced by −uj and if λ1 = λ2 , then u1 and u2 can be
replaced by any pair ũ1 , ũ2 of orthogonal unit norm vectors in the span of u1 and u2 ).
3. The rank of Σ precisely equals the number of λ0j s that are non-zero.
4. If all of λ1 , . . . , λp are non-zero, then Σ has full rank and is hence invertible. Moreover, we can then
write
Σ−1 = λ−1 T −1 T −1 T
1 u1 u1 + λ2 u2 u2 + · · · + λp up up .
5. If Σ is positive semi-definite, then every λj in (66) is nonnegative (this is a consequence of λj = uTj Σuj ≥
0).
6. Square Root of a Positive Semi-definite Matrix: If Σ is positive semi-definite, then we can define
a new matrix p
Σ1/2 := λ1 u1 uT1 + · · · + λp up uTp .
p
It is easy to see that Σ1/2 is symmetric, positive semi-definite and satisfies (Σ1/2 )(Σ1/2 ) = Σ. We shall
refer to Σ1/2 as the square root of Σ.
7. If Σ is positive definite, then every λj in (66) is strictly positive (this is a consequence of λj = uTj Σuj >
0).
We have seen previously that the covariance matrix Cov(Y ) of every random vector Y is symmetric and
positive semi-definite. It turns out that the converse of this statement is also true i.e., it is also true that
every symmetric and positive semi-definite matrix equals Cov(Y ) for some random vector Y . To see why this
is true, suppose that Σ is an arbitrary p × p symmetric and positive semi-definite matrix. Recall that, via
the spectral theorem, we have defined Σ1/2 (square-root of Σ) which is a symmetric and nonnegative definite
matrix such that Σ1/2 Σ1/2 = Σ.
Now suppose that Z1 , . . . Zp are uncorrelated random variables all having unit variance and let Z =
(Z1 , . . . , Zp )T be the corresponding random vector. Because Z1 , . . . , Zp are uncorrelated and have unit
variance, it is easy to see that Cov(Z) = Ip . Suppose now that Y = Σ1/2 Z. Then
We have thus started with an arbitrary positive semi-definite matrix Σ and proved that it equals Cov(Y ) for
some random vector Y .
77
40.3.2 Whitening
Given a p × 1 random vector Y , how can we transform it into a p × 1 white noise vector Z (recall that Z is
white noise means that EZ = 0 and Cov(Z) = Ip ). This transformation is known as Whitening. Whitening
can be done if Cov(Y ) is positive definite. Indeed suppose that Σ := Cov(Y ) is positive definite with spectral
representation:
Σ = λ1 u1 uT1 + λ2 u2 uT2 + · · · + λp up uTp .
The fact that Σ is positive definite implies that λi > 0 for every i = 1, . . . , p. In that case, it is easy to see
that Σ1/2 is invertible and
Moreover it is easy to check that Σ−1/2 ΣΣ−1/2 = Ip . Using this, it is straightforward to check that Z =
Σ−1/2 (Y − EY ) is white noise. Indeed EZ = 0 and
Cov(Z) = Cov Σ−1/2 (Y − EY ) = Σ−1/2 Cov(Y )Σ−1/2 = Σ−1/2 ΣΣ−1/2 = Ip .
Therefore the spectral theorem is used to define the matrix (Cov(Y ))−1/2 which can be used to whiten the
given random vector Y .
Let Y be a p × 1 vector. We say that a unit vector a ∈ Rp (unit vectors are vectors with norm equal to one)
is a first principal component of Y if
In other words, the unit vector a maximizes the variance of bT Y over all unit vectors b.
Suppose that Σ := Cov(Y ) has the spectral representation (66). Assume, without loss of generality, that
the eigenvalues λ1 , . . . , λp appearing in (66) are arranged in decreasing order i.e.,
λ1 ≥ λ2 ≥ · · · ≥ λp ≥ 0.
It then turns out that the vector u1 is a first principal component of Y . To see this, simply note that
Consider random variables Y, X1 , . . . , Xp that have finite variance. We want to predict Y on the basis of
X1 , . . . , Xp . Given a predictor g(X1 , . . . , Xp ) of Y based on X1 , . . . , Xp , we measure the accuracy of prediction
by
2
R(g) := E (Y − g(X1 , . . . , Xp )) .
78
We have seen in the last class that the best predictor (i.e., the function g ∗ which minimizes R(g)) is given
by the conditional expectation:
g ∗ (x1 , . . . , xp ) := E (Y |X1 = x1 , . . . , Xp = xp ) .
This conditional expectation is often quite a complicated quantity. For example, in practice to estimate it,
one would need quite a lot of data on the variables X1 , . . . , Xp , Y .
We now consider a related problem of predicting Y based only on linear functions of X1 , . . . , Xp . Specif-
ically, we consider predictions of the form β0 + β1 X1 + · · · + βp Xp = β0 + β T X (where β := (β1 , . . . , βp )T and
X = (X1 , . . . , Xp )T ). The Best Linear Predictor (BLP) of Y in terms of X1 , . . . , Xp is the linear function
β0∗ + β1∗ X1 + · · · + βp∗ Xp = β0∗ + (β ∗ )T X with β ∗ := (β1∗ , . . . , βp∗ )T
where β0∗ , . . . , βp∗ minimize
2
L(β0 , . . . , βp ) = E (Y − β0 − β1 X1 − · · · − βp Xp )
over β0 , β1 , . . . , βp .
One can get an explicit formula for β0∗ and β ∗ by minimizing L directly via calculus. Taking partial
derivatives with respect to β0 , β1 , . . . , βp and setting them equal to zero, we obtain the following equations:
E(Y − β0∗ − β1∗ X1 − · · · − βp∗ Xp ) = 0 (67)
and
E(Y − β0∗ − β1∗ X1 − · · · − βp∗ Xp )Xi = 0 for i = 1, . . . , p. (68)
The first equation above implies that Y − β0∗ − β1∗ X1 − ··· − βp∗ Xp is a mean zero random variable. Using
this, we can rewrite the second equation as
Cov(Y − β0∗ − β1∗ X1 − · · · − βp∗ Xp , Xi ) = 0 for i = 1, . . . , p
which is same as
Cov(Y − β1∗ X1 − · · · − βp∗ Xp , Xi ) = 0 for i = 1, . . . , p. (69)
Rearranging the above, we obtain
p
X
βi∗ Cov(Xi , Xj ) = Cov(Y, Xi ) for i = 1, . . . , p.
j=1
79
As mentioned previously, the Best Predictor (BP) of Y in terms of X1 , . . . Xp is the function g ∗ (X1 , . . . , Xp )
of X1 , . . . , Xp which minimizes
L(g) := E(Y − g(X1 , . . . , Xp ))2
over all functions g and we have seen that
In other words, the best predictor is the conditional expectation. In general, the BP and BLP will differ and
the BP will be a more accurate predictor of Y compared to BLP. Note that the BLP only depends on the
variances and covariances between the random variables Y, X1 , . . . , Xp while the BP depends potentially on
the entire joint distribution. As a result, the BLP is usually much easier to estimate based on data compared
to the BP.
In general, we shall refer to any quantity involving the distribution of Y, X1 , . . . , Xp that depends only
on the mean, variances and covariances of Y, X1 , . . . , Xp as a second order property. Note that the BLP is a
second order quantity while the BP is not.
In the last class, we looked at the problem of predicting a random variable Y based on linear functions of
other random variables X1 , . . . , Xp . Specifically, we consider predictions of the form β0 +β1 X1 +· · ·+βp Xp =
β0 + β T X (where β := (β1 , . . . , βp )T and X = (X1 , . . . , Xp )T ). The Best Linear Predictor (BLP) of Y in
terms of X1 , . . . , Xp is the linear function
over β0 , β1 , . . . , βp .
Setting derivatives of (71) with respect to β0 , . . . , βp and setting them equal to zero, we observed that
β0∗ , . . . , βp∗ satisfy the equations:
and
Cov(Y − β1∗ X1 − · · · − βp∗ Xp , Xi ) = 0 for i = 1, . . . , p. (73)
The equations in (73) can be written succinctly in matrix notation as:
Here Cov(X, Y ) is the p × 1 vector with entries Cov(X1 , Y ), . . . , Cov(Xp , Y ). The above equation gives
−1
β ∗ = (Cov(X)) Cov(X, Y ).
This determines β1∗ , . . . , βp∗ . We can then use (72) to write β0∗ as
Note that the term Cov(Y, X) appearing above is the transpose of Cov(X, Y ). More generally, given two
random vectors W = (W1 , . . . , Wp ) and Z = (Z1 , . . . , Zq ), we define Cov(W, Z) to be the p × q matrix whose
(i, j)th entry is the covariance between Wi and Zj .
80
The Best Linear Predictor (BLP) of Y in terms of X1 , . . . , Xp then equals
1. The BLP solves equations (72) and (73). These equations are called normal equations.
2. If Cov(X) is invertible (equivalently, positive definite), then the BLP is uniquely given by (74).
3. Y − BLP has mean zero (because of (72)) and Y − BLP is uncorrelated with each Xi , i = 1, . . . , p
(because of (73)). In fact, this property characterizes the BLP (see next).
4. If Cov(X) is invertible, then it is clear from the form of the normal equations that the BLP is the
unique linear combination of X1 , . . . , Xp such that Y − BLP has mean zero and is uncorrelated with
X1 , . . . , Xp .
Example 42.1 (The case p = 1). When p = 1, the random vector X has only element X1 so that Cov(X)
is just equal to the number V ar(X1 ). In that case, the BLP of Y in terms of X1 is given by
Cov(Y, X1 )
BLP = E(Y ) + (X1 − E(X1 )) .
V ar(X1 )
In the further special case when V ar(Y ) = V ar(X1 ) and E(Y ) = E(X1 ) = 0, we have
β1∗ = ρY,X1
Example 42.2. Suppose X1 , X2 , Z3 , . . . , Zn , Zn+1 are uncorrelated random variables and mean zero random
variables. Define random variables X3 , . . . , Xn+1 as
By definition,
Xn+1 = φ1 Xn + φ2 Xn−1 + Zn+1
which means that Xn+1 − φ1 Xn − φ2 Xn−1 = Zn+1 . It is now easy to see that each Xt depends only on
X1 , X2 , Z3 , . . . , Zt for t ≥ 3 which implies that Zn+1 is uncorrelated with all of X1 , . . . , Xn .
As mentioned previously, the Best Predictor (BP) of Y in terms of X1 , . . . Xp is the function g ∗ (X1 , . . . , Xp )
of X1 , . . . , Xp which minimizes
L(g) := E(Y − g(X1 , . . . , Xp ))2
81
over all functions g and we have seen that
In other words, the best predictor is the conditional expectation. In general, the BP and BLP will differ and
the BP will be a more accurate predictor of Y compared to BLP. Note that the BLP only depends on the
variances and covariances between the random variables Y, X1 , . . . , Xp while the BP depends potentially on
the entire joint distribution. As a result, the BLP is usually much easier to estimate based on data compared
to the BP.
In general, we shall refer to any quantity involving the distribution of Y, X1 , . . . , Xp that depends only
on the mean, variances and covariances of Y, X1 , . . . , Xp as a second order property. Note that the BLP is a
second order quantity while the BP is not.
43 Residual
The residual of a random variable Y in terms of X1 , . . . , Xp will be denoted by rY |X1 ,...,Xp and defined as
the difference between Y and the BLP of Y in terms of X1 , . . . , Xp . In other words,
Using the formula for the BLP, we can write down the following formula for the residual:
The residual has mean zero and is uncorrelated with each of X1 , . . . , Xp . This can be proved either
directly from the formula (75) or from the properties of the BLP.
The variance of the residual can be calculated from the formula (75) as follows:
In other words, V ar(rY |X1 ,...,Xp ) equals the Schur complement (recalled next) of V ar(Y ) in the covariance
matrix:
Cov(X) Cov(X, Y )
Cov(Y, X) V ar(Y )
82
where E is p × p, F is p × q, G is q × p and H is q × q (p and q are such that p + q = n).
We define
E S := E − F H −1 G and H S := H − GE −1 F
assuming that H −1 and E −1 exist. We shall refer to E S and H S as the Schur complements of E and H
respectively (Warning: This is not standard terminology; it is more common to refer to E S as the Schur
complement of H and to H S as the Schur complement of E. I find it more natural to think of E S as the
Schur complement of E and H S as the Schur complement of H).
and many others. Feel free to see the monograph titled Schur Complements and Statistics by Diane Ouellette
for proofs and exposition of these facts (this is not really necessary for this course).
But one very important property of Schur Complements for our purpose is the fact that they arise
naturally in inverses of partitioned matrices. A standard formula for the inverse of a partitioned matrix (see,
for example, https://en.wikipedia.org/wiki/Block_matrix#Block_matrix_inversion) is
(E S )−1 −E −1 F (H S )−1
−1
A = (76)
−(H S )−1 GE −1 (H S )−1
It must be noted from this formula that the first (or (1, 1)th ) block of A−1 equals the inverse of the
Schur complement of the first block of A. Similarly, the last (or (2, 2)th ) block of A−1 equals
the inverse of the Schur complement of the last block of A.
We shall use the expression (76) for the inverse of the partitioned matrix A but we will not see how to
prove (76). You can find many proofs of this fact elsewhere (just google something like “inverse of partitioned
matrices”).
45 Partial Correlation
Given random variables Y1 , Y2 and X1 , . . . , Xp , the partial correlation between Y1 and Y2 given X1 , . . . , Xp
is denoted by ρY1 ,Y2 |X1 ,...,Xp and defined as
ρY1 ,Y2 |X1 ,...,Xp := Corr rY1 |X1 ,...,Xp , rY2 |X1 ,...,Xp .
In other words, ρY1 ,Y2 |X1 ,...,Xp is defined as the correlation between the residual of Y1 given X1 , . . . , Xp and
the residual of Y2 given X1 , . . . , Xp .
ρY1 ,Y2 |X1 ,...,Xp is also termed the partial correlation of Y1 and Y2 after controlling for X1 , . . . , Xp . Since
residuals are second order quantities, it follows that the partial correlation is a second order quantity as well.
We shall now see how to explicitly write the partial correlation in terms of the covariances of Y1 , Y2 and X.
As
rY1 |X1 ,...,Xp = Y1 − E(Y1 ) − Cov(Y1 , X)(CovX)−1 (X − E(X))
and
rY2 |X1 ,...,Xp = Y2 − E(Y2 ) − Cov(Y2 , X)(CovX)−1 (X − E(X)),
83
it can be checked (left as an exercise) that
Cov(rY1 |X1 ,...,Xp , rY2 |X1 ,...,Xp ) = Cov(Y1 , Y2 ) − Cov(Y1 , X)(CovX)−1 Cov(X, Y2 ).
This, along with the formula for the variance of the residuals from the previous subsections, gives the following
formula for the partial correlation ρY1 ,Y2 |X1 ,...,Xp :
When p = 1 so that X equals the scalar random variable X1 , the above formula simplifies to (check this):
ρY ,Y − ρY1 ,X1 ρY2 ,X1
ρY1 ,Y2 |X1 = q 1 2 q .
1 − ρ2Y1 ,X1 1 − ρ2Y2 ,X1
It is instructive to put the variances of the residuals rY1 |X1 ,...,Xp and ry2 |X1 ,...,Xp and their covariance in a
matrix. Recall first that:
where
X1
X2
Y1 ·
Y = · .
and X =
Y2
·
Xp
The right hand side in the formula for Cov(RY1 ,Y2 |X1 ,...,Xp ) equals precisely the Schur complement of Cov(Y )
in the matrix
Cov(X) Cov(X, Y ) X
= Cov =: Σ.
Cov(Y, X) Cov(Y ) Y
Thus if Σ denotes the covariance matrix of the (p + 2) × 1 random vector (X1 , . . . , Xp , Y1 , Y2 )T , then
Cov(RY1 ,Y2 |X1 ,...,Xp ) equals precisely the Schur complement of Cov(Y ) in Σ. We shall come back to this
fact in the next class and use it to describe an expression for the partial correlation ρY1 ,Y2 |X1 ,...,Xp involving
Σ−1 .
We defined partial correlation in the last lecture. Given random variables Y1 , Y2 and X1 , . . . , Xp , the partial
correlation between Y1 and Y2 given X1 , . . . , Xp is denoted by ρY1 ,Y2 |X1 ,...,Xp and defined as
ρY1 ,Y2 |X1 ,...,Xp := Corr rY1 |X1 ,...,Xp , rY2 |X1 ,...,Xp .
84
In other words, ρY1 ,Y2 |X1 ,...,Xp is defined as the correlation between the residual of Y1 given X1 , . . . , Xp and
the residual of Y2 given X1 , . . . , Xp .
Recall that the residuals rY1 |X1 ,...,Xp and rY2 |X1 ,...,Xp have the following expressions:
and
rY2 |X1 ,...,Xp = Y2 − E(Y2 ) − Cov(Y2 , X)(CovX)−1 (X − E(X)),
In the last class, we computed the variances of rY1 |X1 ,...,Xp and rY2 |X1 ,...,Xp as well as the covariance between
them. This gave us the formulae:
We shall now describe the connection between partial correlations and the inverse of the Covariance matrix.
Let RY1 ,Y2 |X1 ,...,Xp denote the 2 × 1 random vector consisting of the residuals rY1 |X1 ,...,Xp and rY2 |X1 ,...,Xp .
The formulae for the variances and covariances of the residuals allows us then to write the 2 × 2 covariance
matrix of RY1 ,Y2 |X1 ,...,Xp as
Y1 Cov(Y1 , X)
(CovX)−1 Cov(X, Y1 ) Cov(X, Y2 )
Cov(RY1 ,Y2 |X1 ,...,Xp ) = Cov −
Y2 Cov(Y2 , X)
= Cov(Y ) − Cov(Y, X)(CovX)−1 Cov(X, Y )
where
X1
X2
Y1 ·
Y = and X =
· .
Y2
·
Xp
The right hand side in the formula for Cov(RY1 ,Y2 |X1 ,...,Xp ) equals precisely the Schur complement of Cov(Y )
in the matrix
Cov(X) Cov(X, Y ) X
= Cov =: Σ.
Cov(Y, X) Cov(Y ) Y
Thus if Σ denotes the covariance matrix of the (p + 2) × 1 random vector (X1 , . . . , Xp , Y1 , Y2 )T , then
Cov(RY1 ,Y2 |X1 ,...,Xp ) equals precisely the Schur complement of Cov(Y ) in Σ.
But we know if we invert Σ, then the last diagonal block (or the (2, 2)th block) of Σ−1 equals the inverse of
the Schur complement of the (2, 2)th block of Σ. This and the above connection between Schur complement
and the covariance of RY1 ,Y2 |X1 ,...,Xp allows us to deduce that if
85
then
−1
−1
−1 V ar(rY1 |X1 ,...,Xp ) Cov(rY1 |X1 ,...,Xp , rY2 |X1 ,...,Xp )
(Σ )22 = Cov(RY1 ,Y2 |X1 ,...,Xp ) =
Cov(rY1 |X1 ,...,Xp , rY2 |X1 ,...,Xp ) V ar(rY2 |X1 ,...,Xp )
From here it follows that the partial correlation ρY1 ,Y2 |X1 ,...,Xp has the alternative expression:
This shows the connection between partial correlation and inverse covariance matrices.
More generally, if Y1 , . . . , Yn are random variables (no distributional assumptions are needed here) with
covariance matrix given by Σ. Then the partial correlation between Yi and Yj given Yk , k 6= i, k 6= j can be
written in terms of Σ−1 as
−(Σ−1 )(i, j)
ρYi ,Yj |Yk ,k6=i,k6=j = p .
(Σ−1 )(i, i)(Σ−1 )(j, j)
This implies, in particular, that
Therefore (Σ−1 )(i, j) = 0 is equivalent to the partial correlation between Yi and Yj given Yk , k 6= i, k 6= j
being zero.
Also
(Σ−1 )(i, j) ≤ 0 ⇐⇒ ρYi ,Yj |Yk ,k6=i,k6=j ≥ 0 and (Σ−1 )(i, j) ≥ 0 ⇐⇒ ρYi ,Yj |Yk ,k6=i,k6=j ≤ 0
In other words, Σ−1 (i, j) being nonpositive is equivalent to the partial correlation between Yi and Yj given
Yk , k 6= i, k 6= j being nonnegative. Similarly, Σ−1 (i, j) being nonnegative is equivalent to the partial
correlation between Yi and Yj given Yk , k 6= i, k 6= j being nonpositive.
Consider random variables Y and X1 , . . . , Xp . Let β0∗ + β1∗ X1 + · · · + βp∗ Xp denote the BLP of Y in terms of
X1 , . . . , Xp .
We have seen before that If p = 1, then X is equal to the scalar random variable X1 and the BLP then
has the expression:
Cov(Y, X1 )
BLP = E(Y ) + (X1 − E(X1 )).
V ar(X1 )
In other words, when p = 1, the slope coefficient of the BLP is given by
s
Cov(Y, X 1 ) V ar(Y )
β1∗ = = ρY,X1 . (77)
V ar(X1 ) V ar(X1 )
86
When p ≥ 1, we would have p “slope” coefficients X1 , . . . , Xp . In this case, one can write a formula
analogous to (77) as follows: s
∗ V ar(rY |Xk ,k6=i )
βi = ρY,Xi |Xk ,k6=i (78)
V ar(rXi |Xk ,k6=i )
In other words βi∗ equals the slope coefficient of BLP of rY |Xk ,k6=i in terms of rXi |Xk ,k6=i .
We shall prove this fact now. We can assume without loss of generality i = p. The proof for other
i can be completed by rearranging X1 , . . . , Xp so that Xi appears at the last position. The formula for
β ∗ = (β1 , . . . , βp )∗ is
β ∗ = (CovX)−1 Cov(X, Y ).
Let us write
X−p
X=
Xp
where X−p := (X1 , . . . , Xp−1 )T consists of all the X’s except Xp . We can partition Cov(X) as
X−p Cov(X−p ) Cov(X−p , Xp )
Cov(X) = Cov = .
Xp Cov(Xp , X−p ) V ar(Xp )
The formula for β ∗ then becomes
−1
∗ −1 Cov(X−p ) Cov(X−p , Xp ) Cov(X−p , Y )
β = (CovX) Cov(X, Y ) =
Cov(Xp , X−p ) V ar(Xp ) Cov(Xp , Y )
In order to derive an explicit formula for βp∗ from this expression, we need to figure out the last row of
(CovX)−1 . A standard formula for the inverses of partitioned matrices states that
E F something something
A= =⇒ A−1 =
G H −(H S )−1 GE −1 (H S )−1
which proves the result for i = p. One can prove it for other i by simply rearranging X1 , . . . , Xp so that Xi
appears as the last variable.
87
Example 47.1. Suppose X1 , X2 , Z3 , . . . , Zn , Zn+1 with n ≥ 2 are uncorrelated random variables having mean
zero. Define random variables X3 , . . . , Xn+1 as
We have seen in the last class that the BLP of Xn+1 in terms of X1 , . . . , Xn equals φ1 Xn + φ2 Xn−1 . This
means that the coefficient of Xi in the BLP of Xn+1 in terms of X1 , . . . , Xn equals 0 for i = 1, . . . , n − 2. As
a consequence of (79), we then deduce that
Using the connection between partial correlation and inverse covariance, we can further deduce that if Σ
denotes the (n + 1) × (n + 1) covariance matrix of X1 , . . . , Xn+1 , then
Let us first quickly recap the BLP. Given random variables Y and X1 , . . . , Xp , a linear predictor of Y in
terms of X1 , . . . , Xp is a random variable of the form β0 + β1 X1 + · · · + βp Xp . The BLP is then given by
β0∗ + β1∗ X1 + · · · + βp∗ Xp where β0∗ , . . . , βp∗ minimize:
2
L(β0 , β1 , . . . , βp ) := E (Y − β0 − β1 X1 − · · · − βp Xp )
over β0 , . . . , βp . We have seen that β0∗ , . . . , βp∗ can be figured out using calculus and this gives the formula:
where X stands for the p × 1 random vector with components X1 , . . . , Xp . The residual rY |X1 ,...,Xp simply
equals Y − BLP and we have seen that the variance of rY |X1 ,...,Xp equals:
so that
a11 X1 + a12 X2 + · · · + a1p Xp + c1
AX + c =
a21 X1 + a22 X2 + · · · + a2p Xp + c2
88
and
2 2 2
L(A, c) = E kY − AX − ck = E (Y1 − a11 X1 − a12 X2 − · · · − a1p Xp − c1 ) +E (Y2 − a21 X1 − a22 X2 − · · · − a2p Xp − c2 )
From here it is clear that to minimize the above with respect to A and c, we can minimize the first term over
a11 , a12 , . . . , a1p , c1 andthen
minimize the second term over a21 , a22 , . . . , a2p , c2 . From here, it is easy to see
Y1
that the BLP of Y = in terms of X is given by
Y2
Thus the same formula EY + Cov(Y, X)(CovX)−1 (X − EX) gives the BLP for Y in terms of X even when
Y is a 2 × 1 random vector. It is straightforward now to see that this holds when Y is a k × 1 random vector
for every k ≥ 1 (not just k = 1 or k = 2). One can define the residual of Y in terms of X1 , . . . , Xp as
and this is exactly the vector whose ith component is the residual of Yi in terms of X1 , . . . , Xp . It is also
straightforward to check that that covariance matrix of RY |X is given by
We shall next move to the last topic of the class: the multivariate normal distribution. For this, it is helpful
to know about moment generating functions of random vectors.
for every a ∈ Rn for which the expectation exists. Note that when a = (0, . . . , 0)T is the zero vector, it is
easy to see that MY (a) = 1.
Just like in the univariate case, Moment Generating Functions determine distributions when they exist
in a neighbourhood of a = 0.
Moment Generating Functions behave very nicely in the presence of independence. Suppose Y(1) and Y(2)
T T T
are two random vectors and let Y = (Y(1) , Y(2) ) be the vector obtained by putting Y(1) and Y(2) together in
a single column vector. Then Y(1) and Y(2) are independent if and only if
Thus under independence, the MGF factorizes and conversely, when the MGF factorizes, we have indepen-
dence.
89
50 The Multivariate Normal Distribution
Definition 50.1. A random vector Y = (Y1 , . . . , Yn )T is said to have the multivariate normal distribution if
every linear function aT Y of Y has the univariate normal distribution.
Remark 50.1. It is important to emphasize that for Y = (Y1 , . . . , Yn )T to be multivariate normal, every
linear function aT Y = a1 Y1 + . . . an Yn needs to be univariate normal. It is not enough for example to just
have each Yi to be univariate normal. It is very easy to construct examples where each Yi is univariate normal
but a1 Y1 + · · · + an Yn is not univariate normal for many vectors (a1 , . . . , an )T . For example, suppose that
Y1 ∼ N (0, 1) and that Y2 = ξY1 where ξ is a discrete random variable taking the two values 1 and −1 with
probability 1/2 and ξ is independent of Y1 . Then it is easy to see that
This means therefore that Y2 ∼ N (0, 1) (and that Y2 is independent of ξ). Note however that Y1 + Y2 is not
normal as
1
P{Y1 + Y2 = 0} = P{ξ = 1} = .
2
Y1
Thus, in this example, even though Y1 and Y2 are both N (0, 1), the vector is not multivariate normal.
Y2
Example 50.2. We have seen earlier in the class that if Z1 , . . . , Zn are independent and univariate normal,
then a1 Z1 + . . . an Zn is normal for every a1 , . . . , an . Therefore a random vector Z = (Z1 , . . . , Zn )T that is
made up of independent Normal random variables has the multivariate normal distribution. In fact, we
shall show below that if Y has a multivariate normal distribution, then it should necessarily be the case that
Y is a linear function of a random vector Z that is made of independent univariate normal random variables.
Suppose Y = (Y1 , . . . , Yn )T is multivariate normal. Let µ = E(Y ) and Σ = Cov(Y ) be the mean vector and
covariance matrix of Y respectively. Then, as a direct consequence of the definition of multivariate normality,
it follows that the MGF of Y equals
aT Y T 1 T
MY (a) = E(e ) = exp a µ + a Σa . (80)
2
To see why this is true, note that by definition of multivariate normality, aT Y is univariate normal. Now the
mean and variance of aT Y are given by
so that
aT Y ∼ N (aT µ, aT Σa) for every a ∈ Rn .
Then (80) directly follows from the formula for the MGF of a univariate normal.
Note that the MGF of Y (given by (80)) only depends on the mean vector µ and the covariance matrix Σ
of Y . Thus the distribution of every multivariate normal vector Y is characterized by the mean vector µ and
covariance Σ. We therefore use the notation Nn (µ, Σ) for the multivariate normal distribution with mean µ
and covariance Σ.
90
50.2 Connection to i.i.d N (0, 1) random variables
Suppose that the covariance matrix Σ of Y is positive definite so that Σ−1/2 is well-defined. Let Z :=
Σ−1/2 (Y − µ). The formula (80) allows the computation of the MGF of Z as follows:
T
MZ (a) = Eea Z
= E exp aT Σ−1/2 (Y − µ)
= exp(aT Σ−1/2 µ)E exp aT Σ−1/2 Y
= exp(aT Σ−1/2 µ)MY (Σ−1/2 a)
Y n
1 1 T
= exp(aT Σ−1/2 µ) exp aT Σ−1/2 µ + (aT Σ−1/2 )Σ(Σ−1/2 a) = exp a a = exp(a2i /2).
2 2 i=1
The right hand side above is clearly the MGF of a random vector having n i.i.d standard normal random
variables. Thus because MGFs uniquely determine distributions, we conclude that Z = (Z1 , . . . , Zn )T has
independent standard normal random variables. We have thus proved that: If Y ∼ Nn (µ, Σ) and Σ is p.d,
then the components Z1 , . . . Zn of Z = Σ−1/2 (Y − µ) are independent standard normal random
variables.
Suppose Y = (Y1 , . . . , Yn )T is a random vector that has the multivariate normal distribution. What then is
the joint density of Y1 , . . . , Yn ?
Let µ = E(Y ) and Σ = Cov(Y ) be the mean vector and covariance matrix of Y respectively. For Y to
have a joint density, we need to assume that Σ is positive definite. We have then seen in the previous section
that the components Z1 , . . . , Zn of Z are independent standard normal random variables where
Z = Σ−1/2 (Y − µ).
where z = (z1 , . . . , zn )T .
Using the above formula and the fact that Y = µ + Σ1/2 Z, we can compute the joint density of Y1 , . . . , Yn
via the Jacobian formula. This gives
where y = (y1 , . . . , yn )T .
Suppose Y = (Y1 , . . . , Yn )T ∼ Nn (µ, Σ). Note then that µ is the mean vector E(Y ) of Y and Σ is the
covariance matrix Cov(Y ). The following properties are very important.
91
1. Linear Functions of Y are also multivariate normal: If A is an m × n deterministic matrix and
c is an m × 1 deterministic vector, then AY + c ∼ Nm (Aµ + c, AΣAT ).
Reason: Every linear function of AY + c is obviously also a linear function of Y and, thus, this fact
follows from the definition of the multivariate normal distribution.
2. If Y is multivariate normal, then every random vector formed by taking a subset of the components of
Y is also multivariate normal.
Reason: Follows from the previous fact.
3. Independence is the same as Uncorrelatedness: If Y(1) and Y(2) are two random vectors such
T T T
that Y = (Y(1) , Y(2) ) is multivariate normal. Then Y(1) and Y(2) are independent if and only if
Cov(Y(1) , Y(2) ) = 0.
Reason: The fact that independence implies Cov(Y(1) , Y(2) ) = 0 is obvious and does not require any
normality. The key is the other implication that zero covariance implies independence. For this, it is
enough to show that the MGF of Y equals the product of the MGFs of Y(1) and Y(2) . The MGF of Y
equals
1
MY (a) = exp aT µ + aT Σa
2
where Σ = Cov(Y ).
Note that Y(1) and Y(2) are also multivariate normal so that
1
MY(i) (a(i) ) = exp aT(i) µ(i) + aT(i) Σii a(i) for i = 1, 2
2
where
µ(i) := E(Y(i) ) and Σii := Cov(Y(i) ).
Now if Σ12 := Cov(Y(1) , Y(2) ) and Σ21 = Cov(Y(2) , Y(1) ) = ΣT12 , then observe that
Σ11 Σ12 Σ11 Σ12
Σ= =
Σ21 Σ22 ΣT12 Σ22
92
53 Properties of Multivariate Normal Random Variables
Suppose Y = (Y1 , . . . , Yn )T ∼ Nn (µ, Σ). Note then that µ is the mean vector E(Y ) of Y and Σ is the
covariance matrix Cov(Y ). In the last class, we looked at the following properties.
We shall next prove that quadratic forms of multivariate normal random variables with identity covariance
have chi-squared distributions provided the symmetric matrix defining the quadratic form is idempotent.
A square matrix A is said to be idempotent if A2 = A. An important fact about idempotent matrices is the
following.
To prove this fact, note first that if A is symmetric, then by the spectral theorem
for an orthonormal basis u1 , . . . , un and real numbers λ1 , . . . , λn . The rank of A precisely equals the number
of λi ’s that are non-zero. If r is the rank of A, we can therefore write (assuming without loss of generality
that λ1 , . . . , λr are non-zero and λr+1 = · · · = λn = 0)
A = λ1 u1 uT1 + · · · + λr ur uTr .
which implies that λ2i = λi which gives λi = 1 (note that we have assumed that λi 6= 0). This proves (81).
The following result states that quadratic forms of multivariate normal random vectors with identity
covariance are chi-squared provided the underlying matrix is idempotent.
Theorem 54.1. Suppose Y ∼ Nn (µ, In ) and let A is an n × n symmetric and idempotent matrix with rank
r. Then
(Y − µ)T A(Y − µ) ∼ χ2r .
93
Proof. Because A is symmetric and idempotent and has rank r, we can write A as
A = u1 uT1 + · · · + ur uTr
where Vi := uTi (Y − µ). Note now that each Vi ∼ N (0, 1) and that V1 , . . . , Vr are uncorrelated and hence
independent (because of normality). This proves that (Y − µ)T A(Y − µ) ∼ χ2r .
Example 54.2. Suppose X1 , . . . , Xn are i.i.d N (0, 1). Then X̄ ∼ N (0, 1/n) and S ∼ χ2n−1 where
n
X 2
S := Xi − X̄ .
i=1
The fact that X̄ ∼ N (0, 1/n) is easy. To prove that S ∼ χ2n−1 and that S and X̄ are independent, we
shall show two methods.
1T 1
2 1 T 1 T 1 1
A = I − 11 I − 11 = I − 2 11T + 2 11T = I − 11T = A.
n n n n n
Also the rank of A equals n − 1. Thus by Theorem 54.1 (note that X = (X1 , . . . , Xn )T ∼ Nn (0, In )), we have
S = X T AX ∼ χ2n−1 .
In order to prove that S and X̄ are independent, we only need to observe that
1 T 1 T
X̄ = 1 X and I − 11 X (82)
n n
are independent because S is a function of
1
I − 11T X.
n
94
√
Method Two: Let u1 , . . . , un be an orthonormal basis for Rn with u1 = 1/ n (check that u1 has unit
norm). Let U be the matrix with columns u1 , . . . , un i.e.,
U = [u1 : · · · : un ].
Note that U U T = U T U = In (by the properties of an orthonormal basis). Now let Y = U T X. Then Y is a
linear function of X (and X ∼ Nn (0, In )) so that
so that
n
X n
X n
X
Yi2 = Xi2 − nX̄ 2 = (Xi − X̄)2 = S.
i=2 i=1 i=1
This and the fact that Y ∼ Nn (0, In ) (which is same as saying that Y1 , . . . , Yn are i.i.d N (0, 1)) imply that
S ∼ χ2n−1 . Also note that S depends
√ only on Y2 , . . . , Yn so that it is independent of Y1 and thus S and X̄ are
independent (note that X̄ = Y1 / n).
Example 54.3. Suppose that X ∼ Nn (0, Σ) where Σ is an n × n matrix with 1 on the diagonal and ρ on
the off-diagonal. Σ can also be represented as
X̄ is a linear function of X and so it will be normal. We then just have to find its mean and variance.
Clearly EX̄ = 0 (as each EXi = 0) and
1 1 X X 1 1 + (n − 1)ρ
var(X̄) = 2 var(X1 +· · ·+Xn ) = 2 var(Xi ) + Cov(Xi , Xj ) = 2 (n + n(n − 1)ρ) = .
n n i
n n
i6=j
Thus
1 + (n − 1)ρ
X̄ ∼ N 0, .
n
Observe that this implies that 1 + (n − 1)ρ ≥ 0 or ρ ≥ −1/(n − 1). In other words, if ρ < −1/(n − 1), then
Σ will not be positive semi-definite.
but we cannot unfortunately use Theorem 54.1 as X does not have identity covariance (Theorem 54.1) only
applies to multivariate normal random vectors with identity covariance. It turns out that here the second
method (described in the previous example) works here and gives the distribution of S. This is explained
below.
95
√
Let u1 , . . . , un be an orthonormal basis for Rn with u1 = 1/ n and let U be the matrix with columns
u1 , . . . , un so that U T U = U U T = In . Let Y = U T X and note (as in the previous example) that
n
√ X
Y1 = nX̄ and S= (Xi − X̄)2 = Y22 + · · · + Yn2 .
i=1
This means that U T ΣU is a diagonal matrix with diagonal entries (1−ρ+nρ), 1−ρ, 1−ρ, . . . , 1−ρ. Therefore
Y ∼ Nn (0, U T ΣU ) implies that Y1 , . . . , Yn are independent with
Thus
n 2
S Y
√ i
X
= ∼ χ2n−1
1−ρ i=2
1−ρ
or S ∼ (1 − ρ)χ2n−1 . Also because X̄ only depends on Y1 and S depends only on Y2 , . . . , Yn , we have that S
and X̄ are independent.
Suppose Y ∼ Nn (µ, Σ). Let us partition Y into two parts Y(1) and Y(2) where Y(1) = (Y1 , . . . , Yp )T consists
of the first p components of Y and Y(2) = (Yp+1 , . . . , Yn ) consists of the last q := n − p components of Y .
The question we address now is the following: What is the conditional distribution of Y(2) given Y(1) = y1 ?
The answer is given below.
In words, the conditional distribution of Y(2) given Y(1) = y1 is also multivariate normal with mean vector
given by:
96
and covariance matrix given by
Cov(Y(2) |Y(1) = y1 ) = Σ22 − Σ21 Σ−1
11 Σ12 .
We shall go over the proof of (90) below. Before that, let us make a few quick remarks on the form of the
conditional distribution:
In other words, we have noted that proving the above three equations is equivalent to proving (90) . We
now complete the proof of (90) by proving the three equations above. We actually have already proved these
three facts. The first fact simply says that the residual of Y(2) given Y(1) has zero expectation. The second
fact says that the covariance matrix of the residual equals the Schur complement. The third fact says that
the residual of Y(2) given Y(1) is uncorrelated with Y(1) . For completeness, let us rederive these quickly as
follows.
E Y(2) − µ(2) − Σ21 Σ−1 −1
11 (Y(1) − µ(1) ) = µ(2) − µ(2) − Σ21 Σ11 (µ(1) − µ(1) ) = 0, (86)
Y(1) − µ(1)
Cov Y(2) − µ(2) − Σ21 Σ−1 −Σ21 Σ−1
11 (Y(1) − µ(1) ) = Cov I11 Y(2) − µ(2)
−1
Y(1) − µ(1) −Σ11 Σ12
= −Σ21 Σ−1
11 I Cov
Y(2) − µ(2) I
−1
Σ11 Σ12 −Σ11 Σ12
= −Σ21 Σ−1
11 I
Σ21 Σ22 I
−1
−Σ11 Σ12
= 0 Σ22 − Σ21 Σ−1
Σ
11 12 I
= Σ22 − Σ21 Σ−1
11 Σ12 (87)
97
and finally
Cov Y(2) − µ(2) − Σ21 Σ−1 −1
11 (Y(1) − µ(1) ), Y(1) = Cov(Y(2) , Y(1) ) − Σ21 Σ11 Cov(Y(1) , Y(1) )
= Σ21 − Σ21 Σ−1
11 Σ11 = 0 (88)
This completes the proof of (90).
Let us reiterate that all the three calculations (86), (87) and (88) do not require any distributional
assumptions on Y(1) and Y(2) . They hold for all random variables Y(1) and Y(2) . The multivariate normality
assumption allows us to deduce (90) from these second order (i.e., mean and covariance) calculations.
It turns out that the converse of this result is also true and we have
Theorem 56.2. Suppose Z ∼ Nn (0, In ) and A is a symmetric matrix. Then Z T AZ ∼ χ2r if and only if A
is an idempotent matrix with rank r.
In other words, the only way in which Z T AZ has the chi-squared distribution is when A is idempotent.
Thus being idempotent is both necessary and sufficient for Z T AZ to be distributed as chi-squared. The fact
that Z T AZ ∼ χ2r implies the idempotence of A can be proved via moment generating functions but we shall
skip this argument.
When the covariance is not the identity matrix, Theorem 56.1 needs to be modified as demonstrated
below. Suppose now that Y ∼ Nn (0, Σ) and we are interested in seeing when Q := Y T AY is idempotent
(here A is a symmetric matrix). We know that Z := Σ−1/2 Y ∼ Nn (0, In ) so we can write (as Y = Σ1/2 Z)
Q = Y T AY = Z T Σ1/2 AΣ1/2 Z.
Thus Q is chi-squared distributed if and only if Σ1/2 AΣ1/2 is idempotent which is equivalent to
Σ1/2 AΣ1/2 Σ1/2 AΣ1/2 = Σ1/2 AΣ1/2 ⇐⇒ AΣA = A.
We thus have
Theorem 56.3. Suppose Y ∼ Nn (0, Σ) and A is a symmetric matrix. Then Y T AY ∼ χ2r if and only if
AΣA = A and r = rank(Σ1/2 AΣ1/2 ) = rank(A).
98
Example 56.5. Suppose that X ∼ Nn (0, Σ) where Σ is an n × n matrix with 1 on the diagonal and ρ on
the off-diagonal. Σ can also be represented as
S
∼ χ2n−1 . (89)
1−ρ
We shall show this here using Theorem 56.3. Note first that
1
S = X T I − 11T X
n
so that
S 1 1 T
= X T AX where A := I− 11
1−ρ 1−ρ n
Thus to show (89), we only need to prove that AΣA = A. For this, see that
1 T ρ T 1 T
ΣA = I − 11 + 11 I − 11
n 1−ρ n
1 T ρ T 1 T T
= I − 11 + 1 1 − 1 11
n 1−ρ n
1 T ρ T 1 T 1
= I − 11 + 1 1 − n1 = I − 11T
n 1−ρ n n
Finally let us mention that when Z ∼ Nn (µ, In ) and A is idempotent, then Z T AZ will be a non-central
chi-squared distribution. We will not study these in this class.
Suppose that Y(1) and Y(2) are two random vectors (with possibly different lengths) such that the vector
T T T
(Y(1) , Y(2) ) is multivariate normal. Then, as we saw in the last class, we have
Y(2) |Y(1) = y1 ∼ Np E(Y(2) ) + Cov(Y(2) , Y(1) )(CovY(1) )−1 (y1 − E(Y(1) )) (90)
, Cov(Y(2) ) − Cov(Y(2) , Y(1) )(CovY(1) )−1 Cov(Y(1) , Y(2) ) .
(91)
In words, the conditional distribution of Y(2) given Y(1) = y1 is also multivariate normal with mean vector
given by:
E(Y(2) |Y(1) = y1 ) = E(Y(2) ) + Cov(Y(2) , Y(1) )Cov(Y(1) )−1 y1 − E(Y(1) )
1. E(Y(2) |Y(1) ) is exactly equal to the BLP of Y(2) in terms of Y(1) . Thus the BP and BLP coincide.
99
2. The conditional covariance matrix Cov(Y(2) |Y(1) = y1 ) does not depend on y1 (this can be viewed as
some kind of homoscedasticity).
3. The conditional covariance matrix Cov(Y(2) |Y(1) = y1 ) is precisely equal to the Covariance Matrix of
the Residual of Y(2) given Y(1) .
We can look at the following special case of this result. Fix two components Yi and Yj of Y . Let
Y(2) := (Yi , Yj )T and let Y(1) denote the vector obtained from all the other components Yk , k 6= i, k 6= j. Then
Now let rYi |Yk ,k6=i,k6=j and rYj |Yk ,k6=i,k6=j denote the residuals of Yi in terms of Yk , k 6= i, k 6= j and Yj in terms
of Yk , k 6= i, k 6= j, then we have seen previously that
rYi |Yk ,k6=i,k6=j
Cov = Cov(Y(2) ) − Cov(Y(2) , Y(1) )(CovY(1) )−1 Cov(Y(1) , Y(2) ).
rYj |Yk ,k6=i,k6=j
We thus have
Y r
Cov( i Yk = yk , k 6= i, k 6= j) = Cov Yi |Yk ,k6=i,k6=j for every yk , k 6= i, k 6= j.
Yj rYj |Yk ,k6=i,k6=j
This means, in particular, that the conditional correlation between Yi and Yj given Yk = yk , k 6= i, k 6= j
is precisely equal to the partial correlation ρYi ,Yj |Yk ,k6=i,k6=j (recall that ρYi ,Yj |Yk ,k6=i,k6=j is the correlation
between rYi |Yk ,k6=i,k6=j and rYj |Yk ,k6=i,k6=j ).
Now recall the following connection between partial correlation and entries of Σ−1 that we have seen
earlier:
−Σ−1 (i, j)
ρYi ,Yj |Yk ,k6=i,k6=j = p .
Σ−1 (i, i)Σ−1 (j, j)
Putting the above observations together, we can deduce that the following are equivalent when Y is multi-
variate normal with covariance matrix Σ:
1. Σ−1 (i, j) = 0
2. ρYi ,Yj |Yk ,k6=i,k6=j = 0
3. The conditional correlation between Yi and Yj given Yk = yk , k 6= i, k 6= j equals 0 for every choice of
yk , k 6= i, k 6= j.
4. Yi and Yj are conditionally independent given Yk = yk , k 6= i, k 6= j equals 0 for every choice of
yk , k 6= i, k 6= j.
100