0% found this document useful (0 votes)
16 views249 pages

Lecture Slides - 230809 - 154641

Uploaded by

Rishabh Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views249 pages

Lecture Slides - 230809 - 154641

Uploaded by

Rishabh Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 249

Reference Material

E1 222 Stochastic Models and ◮ V.K. Rohatgi and A.K.Md.E. Saleh, An Introduction to
Applications probability and Statistics, Wiley, 2nd edition, 2018
◮ S.Ross, ‘Introduction to Probability Models’, Elsevier,
12th edition, 2019.
P.S. Sastry
◮ P G Hoel, S Port and C Stone, Introduction to
[email protected]
Probability Theory, 1971.
◮ P G Hoel, S Port and C Stone, Introduction to Stochastic
Processes, 1971.

1/13 2/13

Course Prerequisites / Background needed Course grading


◮ Calculus
◮ continuity, differentiability, derivatives, functions of
several variables, partial derivatives, integration, multiple
integrals or integration over ℜn , convergence of
sequences and series, Taylor series
◮ Mid-Term Tests and Assignments: 70%
◮ Matrix theory
◮ vector spaces, linear independence, linear Final Exam: 30%
transformations, matrices, rank, determinant, eigen ◮ Three mid-term tests for 20 marks each. We will have 2-3
values and eigen vectors assignments for 10 marks. (Tentative)
◮ In addition, knowledge of basic probability is assumed. I ◮ Please remember this is essentially a Maths course
assume all students are familiar with the following:
Random experiment, sample space, events, condi-
tional probability, independent events, simple combi-
natorial probability computations
But we would review the basic probability in the first two
classes.
3/13 4/13
Probability Theory ◮ In many engineering problems one needs to deal with
◮ Probability Theory – branch of mathematics that deals random inputs where probability models are useful
◮ Dealing with dynamical systems subjected to noise (e.g.,
with modeling and analysis of random phenomena.
Kalman filter)
◮ Random Phenomena – “individually not predictable but ◮ Policies for decision making under uncertainty
have a lot of regularity at a population level” ◮ Pattern Recognition, prediction from data
◮ E.g., Recommender systems are useful for Amazon or ..
.
Netflix because at a population level customer behaviour ◮ We may use probability models for analysing algorithms.
can be predicted well.
(e.g., average case complexity of algorithms)
◮ Example random phenomena: Tossing a coin, rolling a
◮ We may deliberately introduce randomness in an
dice etc – familiar to you all
algorithm
◮ It is also useful in many engineering systems, e.g., for (e.g., ALOHA protocol, Primality testing)
taking care of noise. ..
.
◮ Probability theory is also needed for Statistics that deals
with making inferences from data. This is only a ‘sample’ of possible application scenarios!

6/13 7/13

Review of basic probability Probability axioms


We assume all of you are familiar with the terms: Probability (or probability measure) is a function that assigns
random experiment, outcomes of random experiment, a number to each event and satisfies some properties.
sample space, events etc. Formally, P : F → ℜ satisfying
We use the following Notation: A1 Non-negativity: P(A) ≥ 0, ∀A ∈ F
◮ Sample space – Ω A2 Normalization: P(Ω) = 1
Elements of Ω are the outcomes of the random A3 σ-additivity: If A1 , A2 , · · · ∈ F satisfy Ai ∩ Aj = φ, ∀i 6= j
experiment then
X∞
We write Ω = {ω1 , ω2 , · · · } when it is countable P(∪i=1 Ai ) =

P(Ai )
◮ An event is, by definition, a subset of Ω i=1
◮ Set of all possible events – F ⊂ 2Ω (power set of Ω) Events satisfying Ai ∩ Aj = φ, ∀i 6= j are said to be
Each event is a subset of Ω mutually exclusive

8/13 9/13
Probability axioms Simple consequences of the axioms
◮ Notation: Ac is complement of A.
P : F → ℜ, F ⊂ 2Ω (Events are subsets of Ω)
C = A + B implies A, B are mutually exclusive and C
A1 P(A) ≥ 0, ∀A ∈ F is their union.
A2 P(Ω) = 1 ◮ Let A ⊂ B be events. Then B = A + (B − A).
P∞
A3 If Ai ∩ Aj = φ, ∀i 6= j then P(∪∞
i=1 Ai ) = i=1 P(Ai ) Now we can show P(A) ≤ P(B):

◮ For these axioms to make sense, we are assuming P(B) = P (A + (B − A)) = P(A) + P(B − A) ≥ P(A)

(i). Ω ∈ F and (ii). A1 , A2 , · · · ∈ F ⇒ (∪i Ai ) ∈ F This also shows P(B − A) = P(B) − P(A) when A ⊂ B.

When F = 2Ω this is true.

10/13 11/13

Case of finite Ω – Example


◮ There are many such properties (I assume familiar to you)
◮ Let Ω = {ω1 , · · · , ωn }, F = 2Ω , and P is specified
that can be derived from the axioms.
through ‘equally likely’ assumption.
◮ Here are a few important ones. (Proof is left to you as an
◮ That is, P({ωi }) = n1 . (Note the notation)
exercise!)
◮ Suppose A = {ω1 , ω2 , ω3 }. Then
X
3
3 |A|
P(A) = P({ω1 }∪{ω2 }∪{ω3 }) = P({ωi }) = =
i=1
n |Ω|

◮ We can easily see this to be true for any event, A.


◮ This is the usual familiar formula: number of favourable
outcomes by total number of outcomes.
◮ Thus, ‘equally likely’ is one way of specifying the
probability function (in case of finite Ω).
◮ An obvious point worth remembering: specifying P for
singleton events fixes it for all other events.
12/13 13/13
Review of basic probability Probability axioms

We use the following Notation:


◮ Sample space – Ω Probability (or probability measure) is a function that assigns
Elements of Ω are the outcomes of the random a number to each event and satisfies some properties.
experiment P : F → ℜ, F ⊂ 2Ω
We write Ω = {ω1 , ω2 , · · · } when it is countable A1 P(A) ≥ 0, ∀A ∈ F
◮ An event is, by definition, a subset of Ω A2 P(Ω) = 1
◮ Set of all possible events – F ⊂ 2Ω (power set of Ω) P∞
A3 If Ai ∩ Aj = φ, ∀i 6= j then P(∪∞
i=1 Ai ) = i=1 P(Ai )
Each event is a subset of Ω
For now, we take F = 2Ω (power set of Ω)

1/38 2/38

Some consequences of the axioms Case of finite Ω – Example


◮ Let Ω = {ω1 , · · · , ωn }, F = 2Ω , and P is specified
through ‘equally likely’ assumption.
◮ That is, P({ωi }) = n1 . (Note the notation)
◮ Suppose A = {ω1 , ω2 , ω3 }. Then
◮ 0 ≤ P(A) ≤ 1
◮ P(Ac ) = 1 − P(A) X
3
3 |A|
P(A) = P({ω1 }∪{ω2 }∪{ω3 }) = P({ωi }) = =
◮ If A ⊂ B then, P(A) ≤ P(B) i=1
n |Ω|
◮ If A ⊂ B then, P(B − A) = P(B) − P(A)
◮ We can easily see this to be true for any event, A.
◮ P(A ∪ B) = P(A) + P(B) − P(A ∩ B)
◮ This is the usual familiar formula: number of favourable
outcomes by total number of outcomes.
◮ Thus, ‘equally likely’ is one way of specifying the
probability function (in case of finite Ω).
◮ An obvious point worth remembering: specifying P for
singleton events fixes it for all other events.
3/38 4/38
Case of Countably infinite Ω Example: countably infinite Ω
◮ Consider a random experiment of tossing a biased coin
◮ Let Ω = {ω1 , ω2 , · · · }.
repeatedly till we get a head. We take the outcome of the
◮ Once again, any A ⊂ Ω can be written as mutually experiment to be the number of tails we had before the
exclusive union of singleton sets. first head.
◮ Let qi , i = 1, 2, · · · be numbers such that qi ≥ 0 and ◮ Here we have Ω = {0, 1, 2, · · · }.
P
i qi = 1. ◮ A (reasonable) probability assignment is:
◮ We can now set P({ωi }) = qi , i = 1, 2, · · · .
(Assumptions on qi needed to satisfy P(A) ≥ 0 and P({k}) = (1 − p)k p, k = 0, 1, · · ·
P(Ω) = 1).
P where p is the probability of head and 0 < p < 1.
◮ This fixes P for all events: P(A) = ω∈A P({ω})
(We assume you understand the idea of ‘independent’
◮ This is how we normally define a probability measure on
tosses here).
countably infinite Ω.
◮ In the notation of previous slide, qi = (1 − p)i p
◮ This can be done for finite Ω too. P
◮ Easy to see we have qi ≥ 0 and ∞ i=0 qi = 1.

5/38 6/38

Case of uncountably infinite Ω Example: Uncountably infinite Ω


◮ We would mostly be considering only the cases where Ω
is a subset of ℜd for some d. Problem: A rod of unit length is broken at two random
◮ Note that now an event need not be a countable union of points. What is the probability that the three pieces so formed
singleton sets. would make a triangle.
◮ For now we would only consider a simple intuitive ◮ Let us take left end of the rod as origin and let x, y
extension of the ‘equally likely’ idea. denote the two successive points where the rod is broken.
◮ Suppose Ω is a finite interval of ℜ. Then we will take ◮ Then the random experiment is picking two numbers x, y
m(A)
P(A) = m(Ω) where m(A) is length of the set A. with 0 < x < y < 1.
◮ We can use this in higher dimensions also by taking m(·) ◮ We can take Ω = {(x, y ) : 0 < x < y < 1} ⊂ ℜ2 .
to be an appropriate ‘measure’ of a set. ◮ For the pieces to make a triangle, sum of lengths of any
◮ For example, in ℜ2 , m(A) denotes area of A, in ℜ3 it two should be more than the third.
would be volume and so on.
(There are many issues that need more attention here).

7/38 8/38
◮ We have
◮ The lengths are: x, (y − x), (1 − y ). So we need
Ω = {(x, y ) : 0 < x < y < 1}
x + (y − x) > (1 − y ) ⇒ y > 0.5 A = {(x, y ) : y > 0.5; x < 0.5; y < x + 0.5}

x + (1 − y ) > (y − x) ⇒ y < x + 0.5;

(y − x) + 1 − y > x ⇒ x < 0.5


◮ So the event of interest is:
◮ We can visualize it as follows
A = {(x, y ) : y > 0.5; x < 0.5; y < x + 0.5}
◮ The required probability is area of A divided by area of Ω
which gives the answer as 0.25

9/38 10/38

Conditional Probability
◮ Let B be an event with P(B) > 0. We define conditional
probability, conditioned on B, of any event, A, as
◮ Everything we do in probability theory is always in
reference to an underlying probability space: (Ω, F, P) P(A ∩ B) P(AB)
P(A | B) = =
where P(B) P(B)
◮ Ω is the sample space
◮ The above is a notation. “A | B” does not represent any
◮ F ⊂ 2Ω set of events; each event is a subset of Ω
◮ P : F → [0, 1] is a probability (measure) that assigns a set operation! (This is an abuse of notation!)
number between 0 and 1 to every event (satisfying the ◮ Given a B, conditional probability is a new probability
three axioms). assignment to any event.
◮ That is, given B with P(B) > 0, we define a new
probability PB : F → [0, 1] by
P(AB)
PB (A) =
P(B)

11/38 12/38
◮ Conditional probability is a probability. What does this P(AB)
mean? P(A | B) =
P(B)
◮ The new function we defined, PB : F → [0, 1],
PB (A) = P(AB)
P(B)
, ◮ Note P(B|B) = 1 and P(A|B) > 0 only if P(AB) > 0.
satisfies the three axioms of probability. ◮ Now the ‘new’ probability of each event is determined by
◮ PB (A) ≥ 0 and PB (Ω) = 1. what it has in common with B.
◮ If A1 , A2 are mutually exclusive then A1 B and A2 B are ◮ If we know the event B has occurred, then based on this
also mutually exclusive and hence knowledge we can readjust probabilities of all events and
that is given by the conditional probability.
P ((A1 + A2 )B) P(A1 B + A2 B) ◮ Intuitively it is as if the sample space is now reduced to B
PB (A1 + A2 ) = =
P(B) P(B) because we are given the information that B has occurred.
P(A1 B) + P(A2 B) ◮ This is a useful intuition as long as we understand it
= = PB (A1 ) + PB (A2 ) properly.
P(B)
◮ It is not as if we talk about conditional probability only
◮ Once we understand condional probability is a new for subsets of B. Conditional probability is also with
probability assignment, we go back to the ‘standard respect to the original probability space. Every element of
notation’ F has conditional probability defined.
13/38 14/38

P(AB)
P(A | B) =
P(B)
◮ In a conditional probability, the conditioning event can be
◮ Suppose P(A | B) > P(A)
any event (with positive probability)
Does it mean “B causes A”?
◮ In particular, it could be intersection of events.
P(A | B) > P(A) ⇒ P(AB) > P(A)P(B) ◮ We think of that as conditioning on multiple events.
P(AB)
⇒ > P(B) P(ABC )
P(A) P(A | B, C ) = P(A | BC ) =
P(BC )
⇒ P(B | A) > P(B)

◮ Hence, conditional probabilities cannot actually capture


causal influences.
◮ There are probabilistic methods to capture causation (but
far beyond the scope of this course!)

15/38 16/38
◮ Let B1 , · · · , Bm be events such that ∪m
i=1 Bi = Ω and
◮ The conditional probability is defined by Bi Bj = φ, ∀i 6= j.
P(AB) ◮ Such a collection of events is said to be a partition of Ω.
P(A | B) = (They are also sometimes said to be mutually exclusive
P(B)
and collectively exhaustive).
◮ This gives us a useful identity ◮ Given this partition, any other event can be represented
as a mutually exclusive union as
P(AB) = P(A | B)P(B)
A = AB1 + · · · + ABm
◮ We can iterate this for multiple events
To explain the notation again
P(ABC ) = P(A | BC )P(BC ) = P(A | BC )P(B | C )P(C )
A = A∩Ω = A∩(B1 ∪· · ·∪Bm ) = (A∩B1 )∪· · ·∪(A∩Bm )

Hence, A = AB1 + · · · + ABm

17/38 18/38

Total Probability rule Example: Polya’s Urn


◮ Let B1 , · · · , Bm be a partition of Ω. An urn contains r red balls and b black balls. We draw a ball
◮ Then, for any event A, we have at random, note its color, and put back that ball along with c
balls of the same color. We keep repeating this process. Let
P(A) = P(AB1 + · · · + ABm ) Rn (Bn ) denote the event of drawing a red (black) ball at the
= P(AB1 ) + · · · + P(ABm ) nth draw. We want to calculate the probabilities of all these
events.
= P(A |B1 )P(B1 ) + · · · + P(A |Bm )P(Bm ) r b
◮ It is easy to see that P(R1 ) = r +b and P(B1 ) = r +b .
◮ The formula (where Bi form a partition) ◮ For R2 we have, using total probability rule,
X
P(A) = P(A | Bi )P(Bi ) P(R2 ) = P(R2 | R1 )P(R1 ) + P(R2 | B1 )P(B1 )
i r +c r r b
= +
is known as total probability rule or total probability r +c +b r +b r +b+c r +b
law or total probability formula. r (r + c + b) r
= = = P(R1 )
◮ This is a very useful in many situations. (“arguing by (r + c + b)(r + b) r +b
cases”)
19/38 20/38
Bayes Rule
◮ Another important formula based on conditional
probability is Bayes Rule:
P(AB) P(B | A)P(A)
◮ Similarly we can show that P(B2 ) = P(B1 ). P(A | B) = =
P(B) P(B)
◮ One can show by mathematical induction that
P(Rn ) = P(R1 ) and P(Bn ) = P(B1 ) forall n. ◮ This allows one to calculate P(A | B) if we know
(Left as an exercise for you!) P(B | A).
◮ This does not depend on the value of c! ◮ Useful in many applications because one conditional
probability may be more easier to obtain (or estimate)
than the other.
◮ Often one uses total probability rule to calculate the
denominator in the RHS above:
P(B | A)P(A)
P(A | B) =
P(B |A)P(A) + P(B |Ac )P(Ac )

21/38 22/38

◮ Let us take some specific numbers


Example: Bayes Rule ◮ Let: P(D) = 0.5, P(T+ |D) = 0.99, P(T+ |D c ) = 0.05.
Let D and D c denote someone being diagnosed as having a 0.99 ∗ 0.5
disease or not having it. Let T+ and T− denote the events of P(D|T+ ) = = 0.95
0.99 ∗ 0.5 + 0.05 ∗ 0.5
a test for it being positive or negative. (Note that T+c = T− ). That is pretty good.
We want to calculate P(D|T+ ). ◮ But taking P(D) = 0.5 is not realistic. Let us take
◮ We have, by Bayes rule, P(D) = 0.1.
P(T+ |D)P(D) 0.99 ∗ 0.1
P(D|T+ ) = P(D|T+ ) = = 0.69
P(T+ |D)P(D) + P(T+ |D c )P(D c ) 0.99 ∗ 0.1 + 0.05 ∗ 0.9

◮ The probabilities P(T+ |D) and P(T+ |D c ) can be ◮ Now suppose we can improve the test so that
obtained through, for example, laboratory experiments. P(T+ |D c ) = 0.01
◮ P(T+ |D) is called the true positive rate and P(T+ |D c ) is 0.99 ∗ 0.1
P(D|T+ ) = = 0.92
called false positive rate. 0.99 ∗ 0.1 + 0.01 ∗ 0.9
◮ We also need P(D), the probability of a random person
◮ These different cases are important in understanding the
having the disease.
role of false positives rate.
23/38 24/38
◮ In many applications of Bayes rule the same generic
situation exists
◮ P(D) is the probability that a random person has the ◮ Based on a measurement we want to predict (what may
disease. We call it the prior probability. be called) the state of nature.
◮ P(D|T+ ) is the probability of the random person having ◮ For another example, take a simple communication
system.
disease once we do a test and it came positive. We call it
◮ D can represent the event that the transmitter sent bit 1.
the posterior probability.
◮ T+ can represent an event about the measurement we
◮ Bayes rule essentially transforms the prior probability to made at the receiver.
posterior probability. ◮ We want the probability that bit 1 is sent based on the
measurement.
◮ The knowledge we need is P(T+ |D), P(T+ |D c ) which
can be determined through experiment or modelling of
channel.

25/38 26/38

P(T+ |D)P(D)
P(D|T+ ) =
P(T+ |D)P(D) + P(T+ |D c )P(D c )
P(T+ |D)P(D)
P(D|T+ ) =
◮ Not all applications of Bayes rule involve a ‘binary’ P(T+ |D)P(D) + P(T+ |D c )P(D c )
situation
◮ Suppose D1 , D2 , D3 are the (exclusive) possibilities and T ◮ In the binary situation we can think of Bayes rule in a
is an event about a measurement. slightly modified form too.

P(T |D1 )P(D1 ) P(D|T+ ) P(T+ |D) P(D)


P(D1 |T ) = =
P(T ) P(D c |T+ ) P(T+ |D c ) P(D c )
P(T |D1 )P(D1 ) ◮ This is called the odds-likelihood form of Bayes rule
=
P(T |D1 )P(D1 ) + P(T |D2 )P(D2 ) + P(T |D3 )P(D3 ) (The ratio of P(A) to P(Ac ) is called odds for A)
P(T |D1 )P(D1 )
= P
i P(T |Di )P(Di )

27/38 28/38
Independent Events Example: Independence
A class has 20 female and 30 male course (MTech) students
◮ Two events A, B are said to be independent if and 6 female and 9 male research (PhD) students. Are gender
and degree independent?
P(AB) = P(A)P(B) ◮ Let F , M, C , R denote events of female, male, course,
research students
◮ Note that this is a definition. Two events are independent ◮ From the given numbers, we can easily calculate the
if and only if they satisfy the above. following:
◮ Suppose P(A), P(B) > 0. Then, if they are independent
26 2 50 10 20 4
P(F ) = = ; P(C ) = = ; P(FC ) = =
P(AB) 65 5 65 13 65 13
P(A|B) = = P(A); similarly P(B|A) = P(B)
P(B)
◮ Hence we can verify
2 10 4
◮ This gives an intuitive feel for independence. P(F )P(C ) = = = P(FC )
5 13 13
◮ Independence is an important (often confusing!) concept.
and conclude that F and C are independent.
Similarly we can show for others.
29/38 30/38

◮ If A and B are independent then so are A and B c .


◮ In this example, if we keep all other numbers same but ◮ Using A = AB + AB c , we have
change the number of male research students to, say, 12
then the independence no longer holds. P(AB c ) = P(A)−P(AB) = P(A)(1−P(B)) = P(A)P(B c )
26 50
( 68 68
6= 20
68
)
◮ One needs to be careful about independence!
◮ We always have an underlying probability space (Ω, F, P) ◮ This also shows that Ac and B are independent and so
are Ac and B c .
◮ Once that is given, the probabilities of all events are fixed.
◮ For example, in the previous problem, once we saw that F
◮ Hence whether or not two events are independent is a
and C are independent, we can conclude M and C are
matter of ‘calculation’ also independent (because in this example we are taking
F c = M).

31/38 32/38
◮ Consider the random experiment of tossing two fair coins
(or tossing a coin twice).
◮ Ω = {HH, HT , TH, TT }. ◮ In multiple tosses, assuming all outcomes are equally
Suppose we employ ‘equally likely idea’. likely is alright if the coin is fair.
◮ That is, P({HH}) = 14 , P({HT }) = 41 and so on ◮ Suppose we toss a biased coin two times.
◮ Let A = ‘H on 1st toss’ = {HH, HT } (P(A) = 12 ) ◮ Then the four outcomes are, obviously, not ‘equally likely’
Let B = ‘T on second toss’ = {HT , TT } (P(B) = 21 )
◮ How should we then assign these probabilities?
◮ We have P(AB) = P({HT }) = 0.25
◮ If we assume tosses are independent then we can assign
◮ Since P(A)P(B) = 12 21 = 41 = P(AB), probabilities easily.
A, B are independent.
◮ Hence, in multiple tosses, assuming all outcomes are
equally likely implies outcome of one toss is independent
of another.

33/38 34/38

◮ Consider toss of a biased coin:


Ω1 = {H, T }, P({H}) = p and P({T }) = 1 − p.
◮ If we toss this twice then Ω2 = {HH, HT , TH, TT } and
we assign ◮ In many situations calculating probabilities of intersection
P({HH}) = p 2 , P({HT }) = p(1 − p), of events is difficult.
P({TH}) = (1 − p)p, P({TT }) = (1 − p)2 . ◮ One often assumes A and B are independent to calculate
◮ P({HH, HT }) = p 2 + p(1 − p) = p P(AB).
◮ This assignment ensures that P({HH}) equals product of ◮ As we saw, if A and B are independent, then
probability of H on 1st toss and H on second toss. P(A|B) = P(A)
◮ Essentially, Ω2 is a cartesian product of Ω1 with itself and ◮ This is often used, at an intuitive level, to justify
essentially we used products of the corresponding assumption of independence.
probabilities.
◮ For any independent repetitions of a random experiment
we follow this.
(We will look at it more formally when we consider
multiple random variables).
35/38 36/38
Independence of multiple events Pair-wise independence

◮ Events A1 , A2 , · · · , An are said to be (totally) ◮ Events A1 , A2 , · · · , An are said to be pair-wise


independent if for any k, 1 ≤ k ≤ n, and any indices independent if
i1 , · · · , ik , we have
P(Ai Aj ) = P(Ai )P(Aj ), ∀i 6= j
P(Ai1 · · · Aik ) = P(Ai1 ) · · · P(Aik ) ◮ Events may be pair-wise independent but not (totally)
◮ For example, A, B, C are independent if independent.
◮ Example: Four balls in a box inscribed with ‘1’, ‘2’, ‘3’
P(AB) = P(A)P(B); P(AC ) = P(A)P(C ); and ‘123’. Let Ei be the event that number ‘i’ appears on
a radomly drawn ball, i = 1, 2, 3.
◮ Easy to see: P(Ei ) = 0.5, i = 1, 2, 3.
P(BC ) = P(B)P(C ); P(ABC ) = P(A)P(B)P(C )
◮ P(Ei Ej ) = 0.25 (i 6= j) ⇒ pairwise independent
◮ But, P(E1 E2 E3 ) = 0.25 6= (0.5)3

37/38 38/38

Recap Recap

◮ Everything we do in probability theory is always in


reference to an underlying probability space: (Ω, F, P) ◮ When Ω = {ω1 , ω2 , · · · } (is countable), then probability is
where generally assigned by
◮ Ω is the sample space X
◮ F ⊂ 2Ω set of events; each event is a subset of Ω P({ωi }) = qi , i = 1, 2, · · · , with qi ≥ 0, qi = 1
◮ P : F → [0, 1] is a probability (measure) that satisfies i
the three axioms:
◮ When Ω is finite with n elements, a special case is
A1 P(A) ≥ 0, ∀A ∈ F
A2 P(Ω) = 1 qi = n1 , ∀i. (All outcomes equally likely)
P∞
A3 If Ai ∩ Aj = φ, ∀i 6= j then P(∪∞
i=1 Ai ) = i=1 P(Ai )

PS Sastry, IISc, Bangalore, 2020 1/34 PS Sastry, IISc, Bangalore, 2020 2/34
Recap Recap
◮ Conditional probability of A given (or conditioned on) B is

P(AB)
P(A|B) =
P(B) ◮ Bayes Rule
◮ This gives us the identity: P(AB) = P(A|B)P(B) P(T |D)P(D)
◮ This holds for multiple event, e.g., P(D|T ) =
P(T |D)P(D) + P(T |D c )P(D c )
P(ABC ) = P(A|BC )P(B|C )P(C )
◮ Bayes rule can be viewed as transforming a prior
◮ Given a partition, Ω = B1 + B2 + · · · + Bm , for any event,
probability into a posterior probability.
A,
m
X
P(A) = P(A|Bi )P(Bi ) (Total Probability rule)
i=1

PS Sastry, IISc, Bangalore, 2020 3/34 PS Sastry, IISc, Bangalore, 2020 4/34

Recap: Independence Independence of multiple events

◮ Events A1 , A2 , · · · , An are said to be (totally)


◮ Two events A, B are said to be independent if independent if for any k, 1 ≤ k ≤ n, and any indices
i1 , · · · , ik , we have
P(AB) = P(A)P(B)
P(Ai1 · · · Aik ) = P(Ai1 ) · · · P(Aik )
◮ Suppose P(A), P(B) > 0. Then, if they are independent
◮ For example, A, B, C are independent if
P(AB)
P(A|B) = = P(A); similarly P(B|A) = P(B)
P(B) P(AB) = P(A)P(B); P(AC ) = P(A)P(C );

◮ If A, B are independent then so are A&B c , Ac &B and


Ac &B c . P(BC ) = P(B)P(C ); P(ABC ) = P(A)P(B)P(C )

PS Sastry, IISc, Bangalore, 2020 5/34 PS Sastry, IISc, Bangalore, 2020 6/34
Pair-wise independence Conditional Independence
◮ Events A, B are said to be (conditionally) independent
◮ Events A1 , A2 , · · · , An are said to be pair-wise given C if
independent if P(AB|C ) = P(A|C )P(B|C )
◮ If the above holds
P(Ai Aj ) = P(Ai )P(Aj ), ∀i 6= j
P(ABC ) P(AB|C )P(C )
◮ Events may be pair-wise independent but not (totally) P(A|BC ) = =
P(BC ) P(BC )
independent.
◮ Example: Four balls in a box inscribed with ‘1’, ‘2’, ‘3’ P(A|C ) P(B|C )P(C )
= = P(A|C )
and ‘123’. Let Ei be the event that number ‘i’ appears on P(BC )
a radomly drawn ball, i = 1, 2, 3.
◮ Events may be conditionally independent but not
◮ Easy to see: P(Ei ) = 0.5, i = 1, 2, 3.
independent. (e.g., ‘independent’ multiple tests for
◮ P(Ei Ej ) = 0.25 (i 6= j) ⇒ pairwise independent
confirming a disease)
◮ But, P(E1 E2 E3 ) = 0.25 6= (0.5)3 ◮ It is also possible that A, B are independent but are not
conditionally independent given some other event C .
PS Sastry, IISc, Bangalore, 2020 7/34 PS Sastry, IISc, Bangalore, 2020 8/34

Use of conditional independence in Bayes rule


◮ We can write Bayes rule with multiple conditioning ◮ Let us consider the example with P(T+ |D) = 0.99,
events. P(T+ |D c ) = 0.05. P(D) = 0.1.
◮ Recall that we got P(D|T+ ) = 0.69.
P(BC |A)P(A)
P(A|BC ) = ◮ Let us suppose the same test is repeated.
P(BC |A)P(A) + P(BC |Ac )P(Ac )
P(T+1 T+2 | D)P(D)
◮ The above gets simplified if we assume P(D | T+1 T+2 ) =
P(BC |A) = P(B|A)P(C |A), P(T+1 T+2 | D)P(D) + P(T+1 T+2 | D c )P(D c )
P(BC |Ac ) = P(B|Ac )P(C |Ac ) P(T+1 | D)P(T+2 | D)P(D)
=
◮ Consider the old example, where now we repeat the test P(T+1 | D)P(T+2 | D)P(D) + P(T+1 | D c )P(T+2 | D c )P(D c )
for the disease. 0.99 ∗ 0.99 ∗ 0.1
= = 0.97
◮ Take: A = D, B = T+1 , C = T+2 . 0.99 ∗ 0.99 ∗ 0.1 + 0.05 ∗ 0.05 ∗ 0.9
◮ Assuming conditional independence we can calculate the
new posterior probability using the same information we
had about true positive and false positive rate.
PS Sastry, IISc, Bangalore, 2020 9/34 PS Sastry, IISc, Bangalore, 2020 10/34
Sequential Continuity of Probability
◮ For function, f : ℜ → ℜ, it is continuous at x if and only
if xn → x implies f (xn ) → f (x).
◮ Thus, for continuous functions,
 
f lim xn = lim f (xn )
n→∞ n→∞

◮ We want to ask whether the probability, which is a


function whose domain is F, is also continuous like this.
◮ That is, we want to ask the question
 
?
P lim An = lim P(An )
n→∞ n→∞

◮ For this, we need to first define limit of a sequence of sets.

PS Sastry, IISc, Bangalore, 2020 11/34 PS Sastry, IISc, Bangalore, 2020 12/34

◮ For now we define limits of only monotone sequences.


(We will look at the general case later in the course)
◮ A sequence, A1 , A2 , · · · , is said to be monotone
◮ Let An ↓. Then we define its limit as
decreasing if
lim An = ∩∞
k=1 Ak
An+1 ⊂ An , ∀n (denoted as An ↓) n→∞

◮ A sequence, A1 , A2 , · · · , is said to be monotone increasing ◮ This is reasonable because, when An ↓, we have


if An ⊂ An−1 ⊂ An−2 · · · and hence, An = ∩nk=1 Ak .
An ⊂ An+1 , ∀n (denoted as An ↑) ◮ Similarly, when An ↑, we define the limit as

lim An = ∪∞
k=1 Ak
n→∞

An ↓ An ↑
PS Sastry, IISc, Bangalore, 2020 13/34 PS Sastry, IISc, Bangalore, 2020 14/34
◮ We have shown that ∩n [a, b + n1 ) = [a, b]
◮ Let us look at simple examples of monotone sequences of
◮ Similarly we can get ∩n (a − n1 , b] = [a, b]
subsets of ℜ.
◮ Now consider An = [a, b − n1 ].
◮ Consider a sequence of intervals:
[ ] ] ... )
An = [a, b + n1 ), n = 1, 2, · · · with a, b ∈ ℜ, a < b.
) ) a b-1 b-0.5 b
[
...

a b b+0.5 b+1
◮ Now, An ↑ and lim An = ∪n An = [a, b).
◮ We have An ↓ and lim An = ∩i Ai = [a, b] ◮ Why? – because
◮ Why? – because ◮ ∀ǫ > 0, ∃n s.t. b − ǫ ∈ An ⇒ b − ǫ ∈ ∪n An ;
◮ b ∈ An , ∀n ⇒ b ∈ ∩i Ai , and ◮ but b ∈
/ An , ∀n ⇒ b ∈/ ∪n An .
◮ ∀ǫ > 0, b + ǫ ∈
/ An after some n ⇒ b + ǫ ∈ / ∩i Ai . ◮ These examples also show how using countable unions or
1
For example, b + 0.01 ∈/ A101 = [a, b + 101 ). intersections we can convert one end of an interval from
‘open’ to ‘closed’ or vice versa.

PS Sastry, IISc, Bangalore, 2020 15/34 PS Sastry, IISc, Bangalore, 2020 16/34

Theorem: Let An ↑. Then P(limn An ) = limn P(An )


◮ Since An ↑, An ⊂ An+1 .
◮ To summarize, limits of monotone sequences of events
◮ Define sets Bi , i = 1, 2, · · · , by
are defined as follows
B1 = A1 , Bk = Ak − Ak−1 , k = 2, 3, · · ·
An ↓ lim An = ∩∞
k=1 Ak
n→∞
A2

An ↑ lim An = ∪∞
k=1 Ak A3
n→∞
◮ Having defined the limits, we now ask the question
 
?
P lim An = lim P(An ) B2
n→∞ n→∞ A1

where we assume the sequence is monotone.

B3

PS Sastry, IISc, Bangalore, 2020 17/34 PS Sastry, IISc, Bangalore, 2020 18/34
Theorem: Let An ↑. Then P(limn An ) = limn P(An )
◮ Since An ↑, An ⊂ An+1 .
◮ Define sets Bi , i = 1, 2, · · · , by
B1 = A1 , Bk = Ak − Ak−1 , k = 2, 3, · · ·
◮ Note that Bk are mutually exclusive. Also note that ◮ We showed that when An ↑, P(limn An ) = limn P(An )
Xn ◮ We can show this for the case An ↓ also.
n
An = ∪k=1 Bk and hence P(An ) = P(Bk ) ◮ Note that if An ↓, then Acn ↑. Using this and the theorem
k=1
we can show it. (Left as an exercise)
◮ We also have ◮ This property is known as monotone sequential continuity
∪nk=1 Ak = ∪nk=1 Bk , ∀n and hence ∪∞ ∞ of the probability measure.
k=1 Ak = ∪k=1 Bk

◮ Thus we get
P(lim An ) = P(∪∞ ∞
k=1 Ak ) = P(∪k=1 Bk )
n

X n
X
= P(Bk ) = lim P(Bk ) = lim P(An )
n n
k=1 k=1
PS Sastry, IISc, Bangalore, 2020 19/34 PS Sastry, IISc, Bangalore, 2020 20/34

◮ We can think of a simple example to use this theorem.


◮ We keep tossing a fair coin. (We take tosses to be
independent). We want to show that never getting a ◮ We take Ω as set of all infinite sequences of 0’s and 1’s:
head has probability zero.
Ω = {(ω1 , ω2 , · · · ) : ωi ∈ {0, 1}, ∀i}
◮ The basic idea is simple. ((0.5)n → 0)
◮ But to formalize this we need to specify what is our ◮ This would be uncountably infinite.
probability space and then specify what is the event (of ◮ We would not specify F fully. But assume that any
never getting a head). subset of Ω specifiable through outcomes of finitely many
◮ If we toss the coin for any fixed N times then we know coin tosses would be an event.
the sample space can be {0, 1}N . ◮ Thus “no head in the first n tosses” would be an event.
◮ But for our problem, we can not put any fixed limit on
the number of tosses and hence our sample space should
be for infinite tosses of a coin.

PS Sastry, IISc, Bangalore, 2020 21/34 PS Sastry, IISc, Bangalore, 2020 22/34
◮ What P should we consider for this uncountable Ω?
We are not sure what to take. ◮ For n = 1, 2, · · · , define
◮ So, let us ask only for some consistency. An = {(ω1 , ω2 , ....) : ωi = 0, i = 1, · · · , n}
For any subset of this Ω that is specified only through
outcomes of first n tosses, that event should have the ◮ An is the event of no head in the first n tosses
same probability as in the finite probability space and we know P(An ) = (0.5)n .
corresponding to n tosses. ◮ Note that ∩∞k=1 Ak is the event we want.
◮ Consider an event here; ◮ Note that An ↓ because An+1 ⊂ An .
A = {(ω1 , ω2 , ....) : ω1 = ω2 = 0} ⊂ Ω ◮ Hence we get
n
A is the event of tails on first two tosses. P(∩∞
k=1 Ak ) = P(lim An ) = lim P(An ) = lim(0.5) = 0
n n n
◮ We are saying we must have P(A) = (0.5)2 .
◮ Now we can complete problem

PS Sastry, IISc, Bangalore, 2020 23/34 PS Sastry, IISc, Bangalore, 2020 24/34

◮ The Ω we considered can be corresponded with the


interval [0, 1].
◮ The P we considered would be such that probability of an
◮ Each element of Ω is an infinite sequence of 0’s and 1’s
interval would be its length.
ω = (ω1 , ω2 , · · · ), ωi ∈ {0, 1} ∀i ◮ Consider the example event we considered earlier
◮ We can put a ‘binary point’ in front and thus consider ω
A = {(ω1 , ω2 , ....) : ω1 = ω2 = 0} ⊂ Ω
to be a real number between 0 and 1.
◮ That is, we correspond ω with the real number: ◮ When we view the Ω as the interval [0, 1], the above is
ω1 2−1 + ω2 2−2 + · · · the set of all binary numbers of the form 0.00xxxxxx · · · .
◮ For example, the sequence (0, 1, 0, 1, 0, 0, 0, · · · ) would be ◮ What is this set of numbers?
the number: 2−2 + 2−4 = 5/16.
◮ It ranges from 0.000000 · · · to 0.0011111 · · · .
◮ Essentially, every number in [0, 1] can be represented by a
binary sequence like this and every binary sequence ◮ That is the interval [0, 0.25].
corresponds to a real number between 0 and 1. ◮ As we already saw, the probability of this event is (0.5)2
◮ Thus, our Ω can be thought of as interval [0, 1]. which is the length of this interval
◮ So, uncountable Ω arise naturally if we want to consider
infinite repetitions of a random experiment
PS Sastry, IISc, Bangalore, 2020 25/34 PS Sastry, IISc, Bangalore, 2020 26/34
◮ We looked at this probability space only for an example
where we could use monotone sequential continuity of
Probability Models
probability.
◮ But this probability space is important and has lot of
interesting properties. ◮ As mentioned earlier, everything in probability theory is
with reference to an underlying probability space:
Ω = {(ω1 , ω2 , · · · ) : ωi ∈ {0, 1}, ∀i} (Ω, F, P).
◮ Here, n1 ni=1 ωi the fraction of heads in the first n tosses.
P
◮ Probability theory starts with (Ω, F, P)
◮ Since we are tossing a fair coin repeatedly, we should ◮ We can say that different P correspond to different
expect models.
n
1X 1 ◮ Theory does not tell you how to get the P.
lim ωi =
n→∞ n
i=1
2 ◮ The modeller has to decide what P she wants.
◮ We expect this to be true for ‘almost all’ sequences in Ω. ◮ The theory allows one to derive consequences or
◮ That means ‘almost all’ numbers in [0, 1] when expanded properties of the model.
as infinite binary fractions, satisfy this property.
◮ This is called Borel’s normal number theorem and is an
interesting result about real numbers.
PS Sastry, IISc, Bangalore, 2020 27/34 PS Sastry, IISc, Bangalore, 2020 28/34

◮ Consider the random experiment of tossing a fair coin


three times. ◮ Now consider a P2 (different from P1 ) on the same Ω
◮ We can take Ω = {0, 1}3 and can use the following P1 . ω P2 ({ω})
ω P1 ({ω}) 000 1/4
000 1/8 001 1/12
001 1/8 010 1/12
010 1/8 011 1/12
011 1/8 100 1/12
100 1/8 101 1/12
101 1/8 110 1/12
110 1/8 111 1/4
111 1/8 ◮ The consequences now change
◮ Now probability theory can derive many consequences: ◮ The probability that number of heads is 0 or 1 or 2 or 3
◮ The tosses are independent are all same and all equal 1/4.
◮ Probability of 0 or 3 heads is 1/8 while that of 1 or 2 ◮ The tosses are not independent
heads is 3/8
PS Sastry, IISc, Bangalore, 2020 29/34 PS Sastry, IISc, Bangalore, 2020 30/34
◮ The model P2 accommodates some dependence among
tosses.
◮ We can not ask which is the ‘correct’ probability model ◮ Outcomes of previous tosses affect the current toss.
here. ω P2 ({ω})
◮ Such a question is meaningless as far as probability theory 000 1/4 (= (1/2)(2/3)(3/4))
is concerned. 001 1/12 (= (1/2)(2/3)(1/4))
◮ One chooses a model based on application. 010 1/12 (= (1/2)(1/3)(2/4))
◮ If we think tosses are independent then we choose P1 . 011 1/12
But if we need to model some dependence among tosses, 100 1/12
we choose a model like P2 . 101 1/12
110 1/12
111 1/4
◮ It is also a useful model.

PS Sastry, IISc, Bangalore, 2020 31/34 PS Sastry, IISc, Bangalore, 2020 32/34

Random Variable

◮ A random variable is a real-valued function on Ω:


We next consider the concept of random variables. These
X :Ω→ℜ
allow one to specify and analyze different probability models.
◮ For example, Ω = {H, T }, X (H) = 1, X (T ) = 0.
◮ Another example: Ω = {H, T }3 , X (ω) is numbers of H’s.
This entire course can be considered as studying different ◮ A random variable maps each outcome to a real number.
random variables. ◮ It essentially means we can treat all outcomes as real
numbers.
◮ We can effectively work with ℜ as sample space in all
probability models

PS Sastry, IISc, Bangalore, 2020 33/34 PS Sastry, IISc, Bangalore, 2020 34/34
Recap: Monotone Sequences of Sets Recap: Monotone Sequential Continuity
◮ A sequence, A1 , A2 , · · · , is said to be monotone
decreasing if

An+1 ⊂ An , ∀n (denoted as An ↓)
◮ Limit of a monotone decreasing sequence is ◮ We showed that
lim An = ∩∞
 
An ↓: k=1 Ak P lim An = lim P (An )
n→∞
n→∞ n→∞
◮ A sequence, A1 , A2 , · · · , is said to be monotone
increasing if when An ↓ or An ↑

An ⊂ An+1 , ∀n (denoted as An ↑)
◮ Limit of monotone increasing sequence is
An ↑: lim An = ∪∞
k=1 Ak
n→∞

PS Sastry, IISc, Bangalore, 2020 1/41 PS Sastry, IISc, Bangalore, 2020 2/41

Random Variable
◮ Let (Ω, F, P ) be our probability space and let X be a
random variable defined in this probability space.
◮ A random variable is a real-valued function on Ω:
◮ We know X maps Ω into ℜ.
X:Ω→ℜ ◮ This random variable results in a new probability space:
◮ For example, Ω = {H, T }, X(H) = 1, X(T ) = 0. X
(Ω, F, P ) → (ℜ, B, PX )
◮ Another example: Ω = {H, T }3 , X(ω) is numbers of H’s.
◮ A random variable maps each outcome to a real number. where ℜ is the new sample space and B ⊂ 2ℜ is the new
◮ It essentially means we can treat all outcomes as real set of events and PX is a probability defined on B.
numbers. ◮ For now we will assume that any set of ℜ that we want
◮ We can effectively work with ℜ as sample space in all would be in B and hence is an event.
probability models ◮ PX is a new probability measure (which depends on P
and X) that assigns probability to different subsets of ℜ.

PS Sastry, IISc, Bangalore, 2020 3/41 PS Sastry, IISc, Bangalore, 2020 4/41
◮ Given a probability space (Ω, F, P ), a random variable X
X
◮ Given a probability space (Ω, F, P ), a random variable X (Ω, F, P ) → (ℜ, B, PX )
X ◮ We define PX :
(Ω, F, P ) → (ℜ, B, PX )
PX (B) = P ({ω ∈ Ω : X(ω) ∈ B}) , B ∈ B
◮ We define PX : ◮ We use the notation
PX (B) = P ({ω ∈ Ω : X(ω) ∈ B}) , B ∈ B [X ∈ B] = {ω ∈ Ω : X(ω) ∈ B}
X ◮ So, now we can write
Sample Space Real Line
PX (B) = P ([X ∈ B]) = P [X ∈ B]

B
◮ For the definition of PX to be proper, for each B ∈ B, we
must have [X ∈ B] ∈ F.
We will assume that. (This is trivially true if F = 2Ω ).
◮ We can easily verify PX is a probability measure. It
satisfies the axioms.
PS Sastry, IISc, Bangalore, 2020 5/41 PS Sastry, IISc, Bangalore, 2020 6/41

◮ Given a probability space (Ω, F, P ), a random variable X ◮ Let us look at a couple of simple examples.
◮ We define PX : ◮ Let Ω = {H, T } and P (H) = p.
Let X(H) = 1; X(T ) = 0.
PX (B) = P [X ∈ B] = P ({ω ∈ Ω : X(ω) ∈ B})
[X ∈ {0}] = {ω : X(ω) = 0} = {T }
◮ Easy to see: PX (B) ≥ 0, ∀B and PX (ℜ) = 1
[X ∈ [−3.14, 0.552] ] = {ω : −3.14 ≤ X(ω) ≤ 0.552} = {T }
◮ If B1 ∩ B2 = φ then PX (B1 ∪ B2 ) = P [X ∈ B1 ∪ B2 ] =?
[X ∈ (0.62, 15.5)] = {ω : 0.62 < X(ω) < 15.5} = {H}
[X ∈ [−2, 2) ] = Ω

◮ Hence we get

PX ({0}) = (1 − p) = PX ([−3.14, 0.552])

P [X ∈ B1 ∪B2 ] = P [X ∈ B1 ]+P [X ∈ B2 ] = PX (B1 )+PX (B2 ) PX ((0.6237, 15.5)) = p; PX ([−2, 2)) = 1

PS Sastry, IISc, Bangalore, 2020 7/41 PS Sastry, IISc, Bangalore, 2020 8/41
◮ Let Ω = {H, T }3 = {HHH, HHT, · · · , T T T }.
Let P be specified through ‘equally likely’ assignment.
Let X(ω) be number of H’s in ω. Thus, X(T HT ) = 1.
(X takes one of the values: 0, 1, 2, or 3)
◮ A random variable defined on (Ω, F, P ) results in a new
or induced probability space (ℜ, B, PX ).
◮ We can once again write down [X ∈ B] for different
B⊂ℜ
◮ The Ω may be countable or uncountable (even though we
looked at only examples of finite Ω).
[X ∈ (0, 1] ] = {HT T, T HT, T T H}; ◮ Thus, we can study probability models by taking ℜ as
sample space through the use of random variables.
◮ However there are some technical issues regarding what B
[X ∈ (−1.2, 2.78) ] = Ω − {HHH} we should consider.
◮ We briefly consider this and then move on to studying
◮ Hence random variables.
3 7
PX ((0, 1]) = ; PX ((−1.2, 2.78)) =
8 8

PS Sastry, IISc, Bangalore, 2020 9/41 PS Sastry, IISc, Bangalore, 2020 10/41

◮ Let us consider Ω = [0, 1].


◮ This is the simplest example of uncountable Ω we
considered.
◮ We also saw that this sample space comes up when we
◮ We want to look at the probability space (ℜ, B, PX ). consider infinite tosses of a coin.
◮ If we could take B = 2ℜ then everything would be simple. ◮ The simplest extension of the idea of ‘equally likely’ is to
But that is not feasible. say probability of an event (subset of Ω) is the length of
◮ What this means is that if we want every subset of real the event (subset).
line to be an event, we cannot construct a probability ◮ But not all subsets of [0, 1] are intervals and length is
measure (to satisfy the axioms). defined only for intervals.
◮ We can define length of countable union of disjoint
intervals to be sum of the lengths of individual intervals.
◮ But what about subsets that may not be countable
unions of disjoint intervals ?
◮ Well, we say those can be assigned probability by using
the axioms.
PS Sastry, IISc, Bangalore, 2020 11/41 PS Sastry, IISc, Bangalore, 2020 12/41
σ-algebra
◮ Thus the question is the following:
◮ Can we construct a function
◮ An F ⊂ 2Ω is called a σ-algebra (also called σ-field) on Ω
m : 2[0,1] → [0, 1] such that
if it satisfies the following:
1. m(A) = length(A)
P if A ⊂ [0, 1] is an interval 1. Ω ∈ F
2. m(∪i Ai ) = i m(Ai ) where Ai ∩ Aj = φ whenever 2. A ∈ F ⇒ Ac ∈ F
i 6= j, (A1 , A2 , · · · ⊂ [0, 1]) 3. A1 , A2 , · · · ∈ F ⇒ ∪i Ai ∈ F
◮ The surprising answer is ‘NO’ ◮ Thus a σ-algebra is a collection of subsets of Ω that is
◮ This is a fundamental result in real analysis. closed under complements and countable unions (and
◮ Hence for the probability space (ℜ, B, PX ) we cannot hence countable intersections because ∩i Ai = (∪i Aci )c ).
take B = 2ℜ . ◮ Note that 2Ω is obviously a σ-algebra
(Recall that for countable Ω we can take F = 2Ω ). ◮ 6 2Ω then we want
In a Probability space (Ω, F, P ), if F =
◮ Now the question is what is the best B we can have? it to be a σ-algebra. (Why?)

PS Sastry, IISc, Bangalore, 2020 13/41 PS Sastry, IISc, Bangalore, 2020 14/41

◮ Easy to construct examples of σ-algebras


Let A ⊂ Ω.

F = {Ω, φ, A, Ac } is a σ-algebra ◮ Let F1 , F2 be σ-algebras on Ω.


◮ Then, so is F1 ∩ F2 .
◮ For example, with Ω = {1, 2, 3, 4, 5, 6}, ◮ It is simple to show.
(E.g., A ∈ F1 ∩ F2 ⇒ A ∈ F1 , A ∈ F2 ⇒ Ac ∈
F = {Ω, φ, {1, 3, 5}, {2, 4, 6}} is a σ-algebra
F1 , Ac ∈ F2 ⇒ Ac ∈ F1 ∩ F2 )
◮ Suppose on this Ω we want to make a σ-algebra ◮ Let G ⊂ 2Ω . We denote by σ(G) the smallest σ-algebra
containing {1, 2} and {3, 4}. containing G.
◮ It is defined as the intersection of all σ-algebras
containing G (and hence is well defined).
{Ω, φ, {1, 2}, {3, 4}, {3, 4, 5, 6}, {1, 2, 5, 6}, {1, 2, 3, 4}, {5, 6}}

◮ This is the ‘smallest’ σ-algebra containing {1, 2}, {3, 4}

PS Sastry, IISc, Bangalore, 2020 15/41 PS Sastry, IISc, Bangalore, 2020 16/41
Borel σ-algebra
◮ Let us get back to the question we started with.
◮ Let G = {(−∞, x] : x ∈ ℜ}
◮ In the probability space (ℜ, B, P ) what is the B we
◮ We can define the Borel σ-algebra, B, as the smallest
should choose.
σ-algebra containing G.
◮ We can choose it to be the smallest σ-algebra containing
◮ We can see that B would contain all intervals.
all intervals
1. (−∞, x) ∈ B because (−∞, x) = ∪n (−∞, x − n1 ]
◮ That is called Borel σ-algebra, B. 2. (x, ∞) ∈ B because (x, ∞) = (−∞, x]c
◮ It contains all intervals, all complements, countable 3. [x, ∞) ∈ B because [x, ∞) = ∩n (x − n1 , ∞)
unions and intersections of intervals and all sets that can 4. (x, y] ∈ B because (x, y] = (−∞, y] ∩ (x, ∞)
be obtained through complements, countable unions 5. [x, y] ∈ B because [x, y] = ∩n (x − n1 , y]
and/or intersections of such sets and so on. 6. [x, y), (x, y) ∈ B, similarly
◮ Thus, σ(G) is also the smallest σ-algebra containing all
intervals.

PS Sastry, IISc, Bangalore, 2020 17/41 PS Sastry, IISc, Bangalore, 2020 18/41

Borel σ-algebra Random Variables


◮ Given a probability space (Ω, F, P ), a random variable is
◮ We have defined B as a real-valued function on Ω.
◮ It essentially results in an induced probability space
B = σ ({(−∞, x] : x ∈ ℜ})
X
(Ω, F, P ) → (ℜ, B, PX )
◮ It is also the smallest σ-algebra containing all intervals.
◮ Elements of B are called Borel sets where B is the Borel σ-algebra.
◮ Intervals (including singleton sets), complements of ◮ We define PX as: for all Borel sets, B ⊂ ℜ,
intervals, countable unions and intersections of intervals,
countable unions and intersections of such sets on so on PX (B) = P [X ∈ B] = P ({ω ∈ Ω : X(ω) ∈ B})
are all Borel sets. ◮ For X to be a random variable, the folowing should also
◮ Borel σ-algebra contains enough sets for our purposes. hold
◮ Are there any subsets of real line that are not Borel?
[X ∈ B] = {ω ∈ Ω : X(ω) ∈ B} ∈ F, ∀B ∈ B
◮ YES!! Infinitely many non-Borel sets would be there!
◮ We always assume this.
PS Sastry, IISc, Bangalore, 2020 19/41 PS Sastry, IISc, Bangalore, 2020 20/41
Distribution function of a random variable
◮ Let X be a random variable. It distribution function is
FX : ℜ → ℜ defined by
◮ Let X be a random variable.
◮ It represents a probability model with ℜ as the sample FX (x) = P [X ∈ (−∞, x]] = P ({ω ∈ Ω : X(ω) ≤ x})
space.
◮ The probability assigned to different events (Borel subsets
◮ We write the event {ω : X(ω) ≤ x} as [X ≤ x].
of ℜ) is We follow this notation with any such relation statement
involving X
PX (B) = P [X ∈ B] = P ({ω ∈ Ω : X(ω) ∈ B}) e.g., [X 6= 3] represents the event {ω ∈ Ω : X(ω) 6= 3}.
◮ Thus we have
◮ How does one represent this probability measure
FX (x) = P [X ≤ x] = P ({ω ∈ Ω : X(ω) ≤ x}) = PX ( (−∞, x] )

◮ The distribution function, FX completely specifies the


probability measure, PX .
PS Sastry, IISc, Bangalore, 2020 21/41 PS Sastry, IISc, Bangalore, 2020 22/41

◮ Consider tossing of a fair coin: Ω = {T, H},


P ({T }) = P ({H}) = 0.5.
◮ Let X(T ) = 0 and X(H) = 1. We want to calculate FX
◮ The distribution function of X is given by
◮ For this we want the event [X ≤ x], for different x
◮ Let us first look at some examples:
FX (x) = P [X ≤ x] = P ({ω ∈ Ω : X(ω) ≤ x})
[X ≤ −0.5] = {ω : X(ω) ≤ −0.5} = φ
◮ This is also sometimes called the cumulative distribution [X ≤ 0.25] = {ω : X(ω) ≤ 0.25} = {T }
function. [X ≤ 1.3] = {ω : X(ω) ≤ 1.3} = Ω
◮ FX is a real-valued function of a real variable.
◮ Let us look at a simple example.
◮ Thus we get

[X ≤ x] = {ω : X(ω) ≤ x}

 φ if x < 0
= Ω if x ≥ 1
{T } if 0 ≤ x < 1

PS Sastry, IISc, Bangalore, 2020 23/41 PS Sastry, IISc, Bangalore, 2020 24/41
◮ We are considering: Ω = {T, H}, ◮ We are considering: Ω = {T, H},
P ({T }) = P ({H}) = 0.5.
P ({T }) = P ({H}) = 0.5.
◮ X(T ) = 0 and X(H) = 1. We want to calculate FX ◮ X(T ) = 0 and X(H) = 1. We want to calculate FX
◮ We showed ◮ We showed
[X ≤ x] = {ω : X(ω) ≤ x}
 [X ≤ x] = {ω : X(ω) ≤ x}
φ if x < 0 
φ if x < 0

= {T } if 0 ≤ x < 1

= {T } if 0 ≤ x < 1
Ω if x ≥ 1

Ω if x ≥ 1

◮ Hence FX (x) = P [X ≤ x] is given by ◮ Hence FX (y) = P [X ≤ y] is given by



 0 if x < 0 
 0 if y < 0
FX (x) = 0.5 if 0 ≤ x < 1
FX (y) = 0.5 if 0 ≤ y < 1
1 if x ≥ 1

1 if y ≥ 1

Please note that x is a ‘dummy variable’


PS Sastry, IISc, Bangalore, 2020 25/41 PS Sastry, IISc, Bangalore, 2020 26/41

◮ A plot of this distribution function:


◮ Let us look at another example.
◮ Let Ω = [0, 1] and take events to be Borel subsets of
[0, 1]. (That is, F = {B ∩ [0, 1] : B ∈ B}).
◮ We take P to be such that probability of an interval is its
length.
◮ This is the ‘usual’ probability space whenever we take
Ω = [0, 1].
◮ Let X(ω) = ω.
◮ We want to find the distribution function of X.

PS Sastry, IISc, Bangalore, 2020 27/41 PS Sastry, IISc, Bangalore, 2020 28/41
◮ Once again we need to find the event [X ≤ x] for
different values of x. ◮ The plot of this distribution function:
◮ Note that the function X takes values in [0, 1] and
X(ω) = ω.

[X ≤ x] = {ω ∈ Ω : X(ω) ≤ x} = {ω ∈ [0, 1] : ω ≤ x}

 φ if x < 0
= Ω if x ≥ 1
[0, x] if 0 ≤ x < 1

◮ Hence FX (x) = P [X ≤ x] is given by



 0 if x < 0
FX (x) = x if 0 ≤ x < 1
1 if x ≥ 1

PS Sastry, IISc, Bangalore, 2020 29/41 PS Sastry, IISc, Bangalore, 2020 30/41

Properties of Distribution Functions ◮ Right continuity of FX : xn ↓ x ⇒ FX (xn ) → FX (x)


◮ xn ↓ x implies the sequence of events (−∞, xn ] is
monotone decreasing.
◮ The distribution function of random variable X is given by
x ... x2 x1
FX (x) = P [X ≤ x] = P ({ω : X(ω) ≤ x})
◮ Also, limn (−∞, xn ] = ∩n (−∞, xn ] = (−∞, x]
◮ Any distribution function should satisfy the following: ◮ This implies
1. 0 ≤ FX (x) ≤ 1, ∀x
2. FX (−∞) = 0; FX (∞) = 1 lim PX ((−∞, xn ]) = PX (lim(−∞, xn ]) = PX ((−∞, x])
n n
3. FX is non-decreasing: x1 ≤ x2 ⇒ FX (x1 ) ≤ FX (x2 )
x1 ≤ x2 ⇒ (−∞, x1 ] ⊂ (−∞, x2 ] ⇒ ◮ This in turn implies
PX ((−∞, x1 ]) ≤ PX ((−∞, x2 ] ⇒ FX (x1 ) ≤ FX (x2 )
lim FX (xn ) = FX (x)
4. FX is right continuous and has left-hand limits. xn ↓x

◮ Using the usual notation for right limit of a function, we


can write FX (x+ ) = FX (x), ∀x.

PS Sastry, IISc, Bangalore, 2020 31/41 PS Sastry, IISc, Bangalore, 2020 32/41
◮ FX is right-continuous at all x. ◮ FX is right-continuous:
◮ Next, let us look at the lefthand limits: limxn ↑x FX (xn ) FX (x+ ) = FX (x) = PX ( (−∞, x] )
◮ When xn ↑ x, the sequence of events (−∞, xn ] is ◮ It has left limits: FX (x− ) = PX ( (−∞, x) )
monotone increasing and ◮ If A ⊂ B then P (B − A) = P (B) − P (A)
lim(−∞, xn ] = ∪n (−∞, xn ] = (−∞, x) ◮ We have (−∞, x] − (−∞, x) = {x}. Hence
n

PX ( (−∞, x] )−PX ( (−∞, x) ) = PX ({x}) = P ({ω : X(ω) = x})


◮ By sequential continuity of probability, we have
◮ Thus we get
lim PX ((−∞, xn ]) = PX (lim(−∞, xn ] ) = PX ( (−∞, xn ) )
n n
FX (x+ ) − FX (x− ) = P [X = x] = P ({ω : X(ω) = x})
◮ Hence we get
◮ When FX is discontinuous at x the height of discontinuity
FX (x− ) = lim FX (xn ) = lim PX ((−∞, xn ]) = PX ( (−∞, xn ) ) is the probability that X takes that value.
xn ↑x n
◮ And, if FX is continuous at x then P [X = x] = 0
◮ Thus, at every x the left limit of FX exists.

PS Sastry, IISc, Bangalore, 2020 33/41 PS Sastry, IISc, Bangalore, 2020 34/41

Distribution Functions
◮ FX (x) = P [X ≤ x] = P [X ∈ (−∞, x] ]
◮ Let X be a random variable.
◮ Given FX , we can, in principle, find P [X ∈ B] for all
◮ Its distribution function, FX : ℜ → ℜ is given by Borel sets.
FX (x) = P [X ≤ x]
◮ In particular, for a < b,
◮ The distribution function satisfies
1. 0 ≤ FX (x) ≤ 1, ∀x P [a < X ≤ b] = P [X ∈ (a, b] ]
2. FX (−∞) = 0; FX (∞) = 1
= P [X ∈ ( (−∞, b] − (−∞, a] ) ]
3. FX is non-decreasing: x1 ≤ x2 ⇒ FX (x1 ) ≤ FX (x2 )
4. FX is right continuous and has left-hand limits. = P [X ∈ (−∞, b] ] − P [X ∈ (−∞, a] ]
◮ We also have FX (x+ ) − FX (x− ) = P [X = x] = FX (b) − FX (a)
◮ Any real-valued function of a real variable satisfying the
above four properties would be a distribution function of
some random variable.

PS Sastry, IISc, Bangalore, 2020 35/41 PS Sastry, IISc, Bangalore, 2020 36/41
Discrete Random Variables
◮ There are two classes of random variables that we would
study here. ◮ A random variable X is said to be discrete if it takes only
◮ These are called discrete and continuous random countably many distinct values.
variables. ◮ Countably many means finite or countably infinite.
◮ There can be random variables that are neither discrete ◮ If X : Ω → ℜ is discrete, its (strict) range is countable
nor continuous.
◮ Any random variable that is defined on finite or countable
◮ But these two are important classes of random variables
Ω would be discrete.
that we deal with in this course.
◮ Thus the family of discrete random variables includes all
◮ Note that the distribution function is defined for all
probability models on finite or countably infinite sample
random variables.
spaces.

PS Sastry, IISc, Bangalore, 2020 37/41 PS Sastry, IISc, Bangalore, 2020 38/41

Discrete Random Variable Example ◮ FX (x) = P [X ≤ x] (Recall X ∈ {0, 1, 2, 3})


◮ The event [X ≤ x] for different x can be seen to be
◮ Consider three independent tosses of a fair coin. 
φ x<0
Ω = {H, T }3 and X(ω) is the number of H’s in ω.



 {T T T } 0≤x<1


◮ This rv takes four distinct values, namely, 0, 1, 2, 3. [X ≤ x] = {T T T, HT T, T HT, T T H} 1 ≤ x < 2
We denote this as X ∈ {0, 1, 2, 3} Ω − {HHH} 2≤x<3

◮ 


Ω x≥3

◮ Let us find the distribution function of this rv
◮ Let us take some examples of [X ≤ x] ◮ So, we get the distribution function as

[X ≤ 0.72] = {ω : X(ω) ≤ 0.72} = {ω : X(ω) = 0} = [X = 0]  0 x<0
 1

 0≤x<1
 8


4
[X ≤ 1.57] = {ω : X(ω) ≤ 1.57} FX (x) = 8
1≤x<2
= {ω : X(ω) = 0} ∪ {ω : X(ω) = 1} = [X = 0 or 1] 7




 8
2≤x<3

1 x≥3

PS Sastry, IISc, Bangalore, 2020 39/41 PS Sastry, IISc, Bangalore, 2020 40/41
Recap: Random Variables
◮ The plot of this distribution function is:

◮ Given a probability space (Ω, F, P ), a random variable is


a real-valued function on Ω.
◮ It essentially results in an induced probability space
X
(Ω, F, P ) → (ℜ, B, PX )

where B is the Borel σ-algebra and


◮ This is a stair-case function.
◮ It has jumps at x = 0, 1, 2, 3, which are the values that X PX (B) = P [X ∈ B] = P ({ω ∈ Ω : X(ω) ∈ B})
takes. In between these it is constant.
◮ The jump at, e.g., x = 2 is 3/8 which is the probability of
X taking that value.

PS Sastry, IISc, Bangalore, 2020 41/41 PS Sastry, IISc, Bangalore, 2020 1/52

Recap: σ-algebra Recap: Distribution function of a random variable

◮ An F ⊂ 2Ω is called a σ-algebra (also called σ-field) on Ω


if it satisfies
1. Ω ∈ F ◮ Let X be a random variable. It distribution function,
2. A ∈ F ⇒ Ac ∈ F FX : ℜ → ℜ, is defined by
3. A1 , A2 , · · · ∈ F ⇒ ∪i Ai ∈ F
◮ Thus a σ-algebra is a collection of subsets of Ω that is FX (x) = P [X ≤ x] = P ({ω ∈ Ω : X(ω) ≤ x})
closed under complements and countable unions (and
◮ The distribution function, FX , completely specifies the
hence countable intersections).
probability measure, PX .
◮ The Borel σ-algebra (on ℜ), B, is the smallest σ-algebra
containing all intervals.
◮ We also have B = σ({(−∞, x] : x ∈ ℜ}

PS Sastry, IISc, Bangalore, 2020 2/52 PS Sastry, IISc, Bangalore, 2020 3/52
Recap: Properties of distribution function

◮ The distribution function satisfies


1. 0 ≤ FX (x) ≤ 1, ∀x ◮ There are two classes of random variables that we would
2. FX (−∞) = 0; FX (∞) = 1 study here.
3. FX is non-decreasing: x1 ≤ x2 ⇒ FX (x1 ) ≤ FX (x2 ) ◮ These are called discrete and continuous random
4. FX is right continuous and has left-hand limits. variables.
◮ Any real-valued function of a real variable satisfying the ◮ Note that the distribution function is defined for all
above four properties would be a distribution function of random variables.
some random variable.
◮ We also have
FX (x+ ) − FX (x− ) = FX (x) − FX (x− ) = P [X = x]
P [a < X ≤ b] = FX (b) − FX (a).

PS Sastry, IISc, Bangalore, 2020 4/52 PS Sastry, IISc, Bangalore, 2020 5/52

Discrete Random Variables Discrete Random Variable Example


◮ Consider three independent tosses of a fair coin.
◮ Ω = {H, T }3 and X(ω) is the number of H’s in ω.
◮ This rv takes four distinct values, namely, 0, 1, 2, 3.
◮ We denote this as X ∈ {0, 1, 2, 3}
◮ A random variable X is said to be discrete if it takes only ◮ Let us find the distribution function of this rv
countably many distinct values.
◮ Let us take some examples of [X ≤ x]
◮ Countably many means finite or countably infinite.
[X ≤ 0.72] = {ω : X(ω) ≤ 0.72} = {ω : X(ω) = 0} = [X = 0]

[X ≤ 1.57] = {ω : X(ω) ≤ 1.57}


= {ω : X(ω) = 0} ∪ {ω : X(ω) = 1} = [X = 0 or 1]

PS Sastry, IISc, Bangalore, 2020 6/52 PS Sastry, IISc, Bangalore, 2020 7/52
◮ FX (x) = P [X ≤ x] (Recall X ∈ {0, 1, 2, 3})
◮ The event [X ≤ x] for different x can be seen to be ◮ The plot of this distribution function is:


 φ x<0
{T T T } 0≤x<1



[X ≤ x] = {T T T, HT T, T HT, T T H} 1 ≤ x < 2
Ω − {HHH} 2≤x<3




Ω x≥3

◮ So, we get the distribution function as



 0 x<0 ◮ This is a stair-case function.
1
0≤x<1


It has jumps at x = 0, 1, 2, 3, which are the values that X

 8 ◮


4
FX (x) = 8
1≤x<2 takes. In between these it is constant.
7 The jump at, e.g., x = 2 is 3/8 which is the probability of



 8
2≤x<3 ◮
X taking that value.


1 x≥3

PS Sastry, IISc, Bangalore, 2020 8/52 PS Sastry, IISc, Bangalore, 2020 9/52

◮ We know that FX (x) − FX (x− ) = P [X = x].


◮ Let X be a dicrete rv and let X ∈ {a1 , a2 , · · · , an }
◮ For example,
◮ As a notation we assume: a1 < a2 < · · · < an

FX (2) − FX (2 ) = P [X = 2] = P ({ω : X(ω) = 2}) ◮ Let [X = ai ] = {ω : X(ω) = ai } = Bi and let
3 P (Bi ) = qi .
= P ({T HH, HT H, HHT }) =
8 ◮ Since X is a function on Ω, B1 , · · · , Bn form a partition
◮ The FX is a stair-case function. of Ω.
Note that qi ≥ 0 and ni=1 qi = 1.
P

◮ It has jumps at each value assumed by X (and is
constant in between) ◮ If x < a1 then [X ≤ x] = φ.
◮ The height of the jump is equal to the probability of X ◮ If a1 ≤ x < a2 then [X ≤ x] = [X = a1 ] = B1
taking that value. ◮ If a2 ≤ x < a3 then
◮ All discrete random variables would have this general form [X ≤ x] = [X = a1 ] ∪ [X = a2 ] = B1 + B2
of distribution function.

PS Sastry, IISc, Bangalore, 2020 10/52 PS Sastry, IISc, Bangalore, 2020 11/52
◮ Hence we can write the distribution function as
0 x < a1


P (B1 ) a1 ≤ x < a2




P (B ) + P (B ) a2 ≤ x < a3 Let X be a discrete rv with X ∈ {x1 , x2 , · · · }.


 1 2 ◮

 .. Let qi = P [X = xi ] (= P ({ω : X(ω) = xi }) )
FX (x) = . ◮
k
P
We have qi ≥ 0 and i qi = 1.
P
i=1 P (Bi ) ak ≤ x < ak+1

 ◮

.

..





◮ If X is discrete then there is a countable set E such that

1 x ≥ an P [X ∈ E] = 1.
◮ The distribution function of X is specified completely by
◮ We can write this compactly as these qi
X
FX (x) = qk
k:ak ≤x

◮ Note that all this holds even when X takes countably


infinitely many values.
PS Sastry, IISc, Bangalore, 2020 12/52 PS Sastry, IISc, Bangalore, 2020 13/52

probability mass function, fX Properties of pmf


◮ Let X be a discrete rv with X ∈ {x1 , x2 , · · · }.
◮ The probability mass function (pmf) of X is defined by ◮ The probability mass function of a discrete random
variable X ∈ {x1 , x2 , · · · } satisfies
fX (xi ) = P [X = xi ]; fX (x) = 0, for all other x
X (x) ≥ 0, ∀x and fX (x) = 0 if x 6= xi for some i
1. fP
◮ fX is also a real-valued function of a real variable. 2. i fX (xi ) = 1
◮ We can write the definition compactly as ◮ Any function satisfying the above two would be a pmf of
fX (x) = P [X = x] some discrete random variable.
◮ The distribution function (df) and the pmf are related as ◮ We can specify a discrete random variable by giving either
X FX or fX .
FX (x) = fX (xi ) ◮ Please remember that we have defined distribution
i:xi ≤x function for any random variable. But pmf is defined only
for discrete random variables
fX (x) = FX (x) − FX (x− )
◮ We can get pmf from df and df from pmf.
PS Sastry, IISc, Bangalore, 2020 14/52 PS Sastry, IISc, Bangalore, 2020 15/52
Computations of Probabilities for discrete rv’s
◮ A discrete random variable is specified by giving either df
or pmf. One can be obtained from the other.
◮ Any discrete random variable can be specified by ◮ We normally specify it through the pmf.
◮ giving the set of values of X, {x1 , x2 , · · · }, and ◮ Given X ∈ {x1 , x2 , · · · } and fX , we can (in principle)
◮ numbers qi such that qi = P [X = xi ] = fX (xi ) compute probability of any event
P
◮ Note that we must have qi ≥ 0 and i qi = 1. X
◮ As we saw this is how we can specify a probability P [X ∈ B] = fX (xi )
i:
assignment on any countable sample space. xi ∈B

◮ For example, if X ∈ {0, 1, 2, 3} then

P [X ∈ [0.5, 1.32] ∪ [2.75, 5.2] ] = fX (1) + fX (3)

◮ We next look at some standard discrete random variable


models
PS Sastry, IISc, Bangalore, 2020 16/52 PS Sastry, IISc, Bangalore, 2020 17/52

Bernoulli Distribution One of the df examples we saw earlier is that of Bernoulli


◮ Bernoulli random variable: X ∈ {0, 1} with

fX (1) = p; fX (0) = 1−p; where 0 < p < 1 is a parameter

◮ This fX is easily seen to be a pmf


◮ Consider (Ω, F, P ) with B ∈ F. (The Ω here may be
uncountable).
◮ Consider the random variable

0 if ω ∈
/B
IB (ω) =
1 if ω ∈ B
◮ It is called indicator (random variable) of B.
◮ P [IB = 1] = P ({ω : IB (ω) = 1}) = P (B)
◮ Thus, this indicator rv has Bernoulli distribution with
p = P (B)
PS Sastry, IISc, Bangalore, 2020 18/52 PS Sastry, IISc, Bangalore, 2020 19/52
One of the df examples we considered was that of Binomial
Binomial Distribution
◮ X ∈ {0, 1, · · · , n} with pmf
n
fX (k) = Ck pk (1 − p)n−k , k = 0, 1, · · · , n
where n, p are parameters (n is a +ve integer and
0 < p < 1).
◮ This is easily seen to be a pmf
Xn
n
Ck pk (1 − p)n−k = (p + 1 − p)n = 1
k=0
◮ Consider n independent tosses of coin whose probability
of heads is p. If X is the number of heads then X has
the above binomial distribution.
(Number of successes in n bernoulli trials)
◮ Any one outcome (a seq of length n) with k heads would
have probability pk (1 − p)n−k . There are n Ck outcomes
with exactly k heads.
PS Sastry, IISc, Bangalore, 2020 20/52 PS Sastry, IISc, Bangalore, 2020 21/52

Poisson Distribution Geometric Distribution

◮ X ∈ {0, 1, 2, · · · } with pmf


◮ X ∈ {1, 2, · · · } with pmf
k −λ
λ e
fX (k) = , k = 0, 1, 2, · · · fX (k) = (1 − p)k−1 p, k = 1, 2, · · ·
k!
where λ > 0 is a parameter. where 0 < p < 1 is a parameter.
◮ We can see this to be a pmf by ◮ Consider tossing a coin (with prob of H being p)
∞ repeatedly till we get a head. X is the toss number on
X λk −λ which we got the first head.
e = eλ e−λ = 1
k=0
k! ◮ In general waiting for ‘success’ in independent Bernoulli
trials.
◮ Poisson distribution is also useful in many applications

PS Sastry, IISc, Bangalore, 2020 22/52 PS Sastry, IISc, Bangalore, 2020 23/52
Memoryless property of geometric distribution ◮ Now we can compute the required conditional probability

P [X > m + n, X > m]
P [X > m + n|X > m] =
◮ Suppose X is a geometric rv. Let m, n be positive P [X > m]
integers. P [X > m + n]
=
◮ We want to calculate P ( [X > m + n] | [X > m] ) P [X > m]
(Remember that [X > m] etc are events) (1 − p)m+n
= = (1 − p)n
◮ Let us first calculate P [X > n] for any positive integer n (1 − p) m

∞ ∞ ⇒ P [X > m + n|X > m] = P [X > n]


X X
P [X > n] = P [X = k] = (1 − p)k−1 p
k=n+1 k=n+1
◮ This is known as the memoryless property of geometric
(1 − p)n distribution
= p = (1 − p)n ◮ Same as
1 − (1 − p)

(Does this also tell us what is df of geometric rv?) P [X > m + n] = P [X > m]P [X > n]

PS Sastry, IISc, Bangalore, 2020 24/52 PS Sastry, IISc, Bangalore, 2020 25/52

Memoryless property defines geometric rv


◮ If X is a geometric random variable, it satisfies

P [X > m + n|X > m] = P [X > n] ◮ Suppose X ∈ {0, 1, · · · } is a discrete rv satisfying, for all
non-negative integers, m, n
◮ This is same as
P [X > m + n] = P [X > m]P [X > n]
P [X > m + n] = P [X > m]P [X > n]
◮ We will show that X has geometric distribution
◮ Does it say that [X > m] is independent of [X > n] ◮ First, note that
◮ NO! P [X > 0] = P [X > 0 + 0] = (P [X > 0])2
Because [X > m + n] is not equal to intersection of ⇒ P [X > 0] is either 1 or 0.
[X > m] and [X > n] ◮ Let us take P [X > 0] = 1 (and hence P [X = 0] = 0).

PS Sastry, IISc, Bangalore, 2020 26/52 PS Sastry, IISc, Bangalore, 2020 27/52
◮ We have, for any m, Continuous Random Variables
P [X > m] = P [X > (m − 1) + 1]
= P [X > m − 1]P [X > 1]
= P [X > m − 2] (P [X > 1])2

◮ Let q = P [X > 1]. Iterating on the above, we get


◮ A rv, X, is said to be continuous (or of continuous type)
P [X > m] = P [X > 0] (P [X > 1])m = q m if its distribution function, FX is absolutely continuous.

◮ Using this, we can get pmf of X as

P [X = m] = P [X > m−1]−P [X > m] = q m−1 −q m = q m−1 (1−q)

◮ This is pmf of geometric (with q = (1 − p))


PS Sastry, IISc, Bangalore, 2020 28/52 PS Sastry, IISc, Bangalore, 2020 29/52

Absolute Continuity Continuous Random Variables


◮ A function g : ℜ → ℜ is absolutely continuous on an
interval, I, if given any ǫ > 0 there is a δ > 0 such that
◮ A rv, X, is said to be continuous (or of continuous type)
for any finite sequence of pair-wise disjoint if its distribution function, FX is absolutely continuous.
P subintervals,
with xk , yk ∈ I, ∀k, satisfying k (yk − xk ) < δ,
(xk , yk ), P ◮ That is, if there exists a function fX : ℜ → ℜ such that
we have k |f (yk ) − f (xk )| < ǫ Z x
◮ A function that is absolutely continuous on a (finite) FX (x) = fX (t) dt, ∀x
−∞
closed interval is uniformly continuous.
◮ If g is absolutely continuous on [a, b] then there exists an ◮ fX is called the probability density function (pdf) of X.
integrable function h such that ◮ Note that FX here is continuous
Z x ◮ By the fundamental theorem of claculus, we have
g(x) = g(a) + h(t) dt, ∀x ∈ [a, b]
a
dFx (x)
= fX (x), ∀x where fX is continuous
◮ In the above, g would be differentiable almost everywhere dx
and h would be its derivative (wherever g is
differentiable).
PS Sastry, IISc, Bangalore, 2020 30/52 PS Sastry, IISc, Bangalore, 2020 31/52
Continuous Random Variables Continuous Random Variables

◮ If X is a continuous rv then its distribution function, FX ,


is continuous. ◮ The df of a continuous rv is continuous.
◮ Hence a discrete random variable is not a continuous rv! ◮ This implies
FX (x) = FX (x+ ) = FX (x− )
◮ If a rv takes countably many values then it is discrete.
◮ Hence, if X is a continuous random variable then
◮ However, if a rv takes uncoutably infinitely many distinct
values, it does not necessarily imply it is of continuous P [X = x] = FX (x) − FX (x− ) = 0, ∀x
type.
◮ As mentioned earlier, there would be many random
variables that are neither discrete nor continuous.

PS Sastry, IISc, Bangalore, 2020 32/52 PS Sastry, IISc, Bangalore, 2020 33/52

Continuous Random Variables Probability Density Function

◮ The pdf of a continuous rv is defined to be the fX that


◮ A rv, X, is said to be continuous (or of continuous type)
satisfies
if its distribution function, FX is absolutely continuous.
Z x
FX (x) = fX (t) dt, ∀x
◮ The df of a continuous random variable can be written as −∞
Z x R∞
◮ Since FX (∞) = 1, we must have −∞ fX (t) dt = 1
FX (x) = fX (t) dt, ∀x
−∞ ◮ For x1 ≤ x2 we need FX (x1 ) ≤ FX (x2 ) and hence we
need
◮ This fX is the probability density function (pdf) of X. Z x1 Z x2 Z x2
dFx (x) fX (t) dt ≤ fX (t) dt ⇒ fX (t) dt ≥ 0, ∀x1 < x2
= fX (x), ∀x where fX is continuous −∞ −∞ x1
dx ⇒ fX (x) ≥ 0, ∀x

PS Sastry, IISc, Bangalore, 2020 34/52 PS Sastry, IISc, Bangalore, 2020 35/52
Properties of pdf Continuous rv – example
◮ The pdf, fX : ℜ → ℜ, of a continuous rv satisfies ◮ Consider a probability space with Ω = [0, 1] and with the
A1. fX (x) ≥ 0, ∀x ‘usual’ probability assignment (where probability of an
R∞
A2. −∞ fX (t) dt = 1 interval is its length)
◮ Any fX that satisfies the above two would be the ◮ Earlier we considered the rv X(ω) = ω on this probability
probability density function of a continuous rv space.
◮ Given fX satifying the above two, define ◮ We found that the df for this is

Z x  0 if x < 0
FX (x) = fX (t) dt, ∀x FX (x) = x if 0 ≤ x < 1
−∞
1 if x ≥ 1

This FX satisfies
This is absolutely continuous and we can get the pdf as
1. FX (−∞) = 0; FX (∞) = 1
2. FX is non decreasing. fX (x) = 1 if 0 < x < 1; (fX (x) = 0, otherwise)
3. FX is continuous (and hence right continuous with left
limits) ◮ On the same probability space, consider rv Y (ω) = 1 − ω.
◮ This shows the the FX is a df and hence fX is a pdf ◮ Let us find FY and fY .
PS Sastry, IISc, Bangalore, 2020 36/52 PS Sastry, IISc, Bangalore, 2020 37/52

◮ Let X be a continuous rv.


◮ Y (ω) = 1 − ω. ◮ It can be specified by giving either FX or the pdf, fX .
◮ We can, in principle, compute probability of any event as
[Y ≤ y] = {ω : Y (ω) ≤ y} = {ω ∈ [0, 1] : 1 − ω ≤ y} Z
= {ω ∈ [0, 1] : ω ≥ 1 − y} P [X ∈ B] = fX (t) dt, ∀B ∈ B
B

 φ if y < 0
= Ω if y ≥ 1 ◮ In particular, we have
[1 − y, 1] if 0 ≤ y < 1 b
 Z
‘ ∈ [a, b]] = P [a ≤ X ≤ b] =
P [X fX (t) dt = FX (b)−FX (a)
◮ Hence the df of Y is a

 ◮ Since the integral over the open or closed intervals is the


 0 if y < 0 same, we have, for continuous rv,
FY (y) = y if 0 ≤ y < 1

1 if y ≥ 1 P [a ≤ X ≤ b] = P [a < X ≤ b] = P [a ≤ X < b] etc.
◮ Recall that for a general rv
◮ We have FX = FY and thus fX = fY . (However, note
that X(ω) 6= Y (ω) except at ω = 0.5). FX (b) − FX (a) = P [a < X ≤ b]

PS Sastry, IISc, Bangalore, 2020 38/52 PS Sastry, IISc, Bangalore, 2020 39/52
◮ For any random variable, the df is defined and it is given
by
◮ If X is a continuous rv, we have FX (x) = P [X ≤ x] = P [X ∈ (−∞, x] ]
Z b ◮ The value of FX (x) at any x is probability of some event.
P [a ≤ X ≤ b] = fX (t) dt ◮ The pmf is defined only for discrete random variables as
a
fX (x) = P [X = x]
◮ Thus ◮ The value of pmf is also a probability
x+∆x
We use the same symbol for pdf (as for pmf), defined by
Z

P [x ≤ X ≤ x + ∆x] = fX (t) dt ≈ fX (x) ∆x
x Z x
FX (x) = fX (x) dx
◮ That is why fX is called probability density function. −∞

◮ Note that the value of pdf is not a probability.


◮ We can say fX (x) dx ≈ P [x ≤ X ≤ x + dx]

PS Sastry, IISc, Bangalore, 2020 40/52 PS Sastry, IISc, Bangalore, 2020 41/52

A note on notation

◮ A continuous random variable is a probability model on


◮ The df, FX , and the pmf or pdf, fX , are all functions uncountably infinite Ω.
defined on ℜ. ◮ For this, we take ℜ as our sample space.
◮ Hence you should not write FX (X ≤ 5). ◮ We can specify a continuous rv either through the df or
You should write FX (5) to denote P [X ≤ 5]. through the pdf.
◮ For a discrete rv, X, one should not write fX (X = 5). ◮ The df, FX , of a cont rv allows you to (consistently)
It is fX (5) which gives P [X = 5]. assign probabilities to all Borel subsets of real line.
◮ Writing fX (X = 5) when fX is a pdf, is particularly bad. ◮ We next consider a few standard continuous random
Note that for a continuous rv, P [X = 5] = 0 and variables.
fX (5) 6= P [X = 5].

PS Sastry, IISc, Bangalore, 2020 42/52 PS Sastry, IISc, Bangalore, 2020 43/52
Uniform distribution ◮ A plot of density and distribution functions of a uniform
◮ X is uniform over [a, b] when its pdf is rv is given below
1
fX (x) = , a≤x≤b
b−a
(fX (x) = 0 for all other values of x).
◮ Uniform distribution over open or closed interval is
essentially the same.
◮ When X has this distribution, we say X ∼ U [a, b]
◮ By integrating the above, we can see the df as
 Rx Rx

 −∞ X
f (x) dx = −∞ 0 dx = 0 if x < a
 Ra Rx 1
FX (x) = −∞
0 dx + a b−a
dx = x−a
b−a
if a ≤ x < b

 Rb 1
0 + a b−a dx + 0 = 1 if x ≥ b

PS Sastry, IISc, Bangalore, 2020 44/52 PS Sastry, IISc, Bangalore, 2020 45/52

Exponential distribution
◮ The pdf of exponential distribution is

◮ Let X ∼ U [a, b]. Then fX (x) = b−a 1


, a≤x≤b fX (x) = λ e−λx , x > 0, (λ > 0 is a parameter)
◮ Let [c, d] ⊂ [a, b].
Rd (By our notation, fX (x) = 0 for x ≤ 0)
d−c
◮ Then P [X ∈ [c, d]] = c fX (t) dt = b−a R∞
◮ It is easy to verify 0 fX (x) dx = 1.
◮ Probability of an interval is proportional to its length. ◮ It is easy to see that Fx (x) = 0 for x ≤ 0.
◮ The earlier examples we considered are uniform over ◮ For x > 0 we can compute FX by integrating fX :
[0, 1].
Z x x
−λx e−λx
FX (x) = λe dx = λ = 1 − e−λx
0 −λ 0

◮ This also gives us: P [X > x] = 1 − FX (x) = e−λx for


x > 0.
PS Sastry, IISc, Bangalore, 2020 46/52 PS Sastry, IISc, Bangalore, 2020 47/52
exponential distribution is memoryless
◮ A plot of density and distribution functions of an
exponential rv is given below ◮ If X has exponential distribution, then, for t, s > 0,

P [X > t+s] = e−λ(t+s) = e−λt e−λs = P [X > t] P [X > s]

◮ This gives us the memoryless property

[P [X > t + s]
P [X > t + s | X > t] = = P [X > s]
P [X > t]

◮ Exponential distribution is a useful model for, e.g.,


life-time of components.
◮ If the distribution of a non-negative continuous random
variable is memory less then it must be exponential.

PS Sastry, IISc, Bangalore, 2020 48/52 PS Sastry, IISc, Bangalore, 2020 49/52

Gaussian Distribution
◮ A plot of Gaussian density functions is given below

◮ The pdf of Gaussian distribution is given by


1 (x−µ)2
fX (x) = √ e− 2σ2 , −∞ < x < ∞
σ 2π
where σ > 0 and µ ∈ ℜ are parameters.
◮ We write X ∼ N (µ, σ 2 ) to denote that X has Gaussian
density with parameters µ and σ.
◮ This is also called the Normal distribution.
◮ The special case where µ = 0 and σ 2 = 1 is called
standard Gaussian (or standard Normal) distribution.

PS Sastry, IISc, Bangalore, 2020 50/52 PS Sastry, IISc, Bangalore, 2020 51/52
Recap: Random Variable
(x−µ)2

◮ fX (x) = σ√12π e , −∞ < x < ∞
2σ 2
◮ Given a probability space (Ω, F, P ), a random variable is
◮ Showing that the density integrates to 1 is not trivial. a real-valued function on Ω.
R∞
◮ Take µ = 0, σ = 1. Let I = −∞ fX (x) dx. Then ◮ It essentially results in an induced probability space
∞ Z ∞ X
1 −0.5x2 1
Z
I 2
= √ e dx
2
√ e−0.5y dy (Ω, F, P ) → (ℜ, B, PX )
2π 2π
Z−∞
∞ Z ∞
−∞
where B is the Borel σ-algebra and
1 −0.5(x2 +y2 )
= e dx dy
−∞ −∞ 2π
PX (B) = P [X ∈ B] = P ({ω ∈ Ω : X(ω) ∈ B})
◮ Now converting the above integral into polar coordinates
would allow you to show I = 1.
◮ For X to be a random variable
(Left as an exercise for you!)
{ω ∈ Ω : X(ω) ∈ B} ∈ F, ∀B ∈ B

PS Sastry, IISc, Bangalore, 2020 52/52 PS Sastry, IISc, Bangalore, 2020 1/43

Recap: Distribution Function Recap: Discrete Random Variable


◮ Let X be a random variable. It distribution function,
FX : ℜ → ℜ, is defined by

FX (x) = P [X ≤ x] = P ({ω ∈ Ω : X(ω) ≤ x})


◮ A random variable X is said to be discrete if it takes only
◮ The distribution function, FX , completely specifies the finitely many or countably infinitely many distinct values.
probability measure, PX . ◮ Let X ∈ {x1 , x2 , · · · }
◮ The distribution function satisfies ◮ Its distribution function, FX is a stair-case function with
1. 0 ≤ FX (x) ≤ 1, ∀x jump discontinuities at each xi and the magnitude of the
2. FX (−∞) = 0; FX (∞) = 1 jump at xi is equal to P [X = xi ]
3. FX is non-decreasing: x1 ≤ x2 ⇒ FX (x1 ) ≤ FX (x2 )
4. FX is right continuous and has left-hand limits.
◮ We also have
FX (x+ ) − FX (x− ) = FX (x) − FX (x− ) = P [X = x]
P [a < X ≤ b] = FX (b) − FX (a).
PS Sastry, IISc, Bangalore, 2020 2/43 PS Sastry, IISc, Bangalore, 2020 3/43
Recap: probability mass function Recap: continuous random variable
◮ Let X ∈ {x1 , x2 , · · · }. ◮ X is said to be a continuous random variable if there
◮ The probability mass function (pmf) of X is defined by exists a function fX : ℜ → ℜ satisfying
fX (xi ) = P [X = xi ]; fX (x) = 0, for all other x Z x
FX (x) = fX (x) dx
◮ It satisfies −∞

X (x) ≥ 0, ∀x and fX (x) = 0 if x 6= xi for some i


1. fP
2.
The fX is called the probability density function.
i fX (xi ) = 1
◮ We have P ◮ Same as saying FX is absolutely continuous.
FX (x) = i:xi ≤x fX (xi ) ◮ Since FX is continuous here, we have
fX (x) = FX (x) − FX (x− )
P [X = x] = FX (x) − FX (x− ) = 0, ∀x
◮ We can calculate the probability of any event as
◮ A continuous rv takes uncountably many distinct values.
X
P [X ∈ B] = fX (xi )
i:
However, not every rv that takes uncountably many
xi ∈B values is a continuous rv
PS Sastry, IISc, Bangalore, 2020 4/43 PS Sastry, IISc, Bangalore, 2020 5/43

Recap: probability density function Recap: some discrete random variables


◮ The pdf of a continuous rv is defined to be the fX that ◮ Bernoulli: X ∈ {0, 1}; parameter: p, 0 < p < 1
satisfies Z x
FX (x) = fX (t) dt, ∀x fX (1) = p; fX (0) = 1 − p
−∞
◮ It satisfies
◮ Binomial: X ∈ {0, 1, · · · , n}; Parameters: n, p
1. fX (x) ≥ 0, ∀x fX (x) = n
Cx px (1 − p)n−x , x = 0, · · · , n
R∞
2. −∞ fX (t) dt = 1
◮ We can, in principle, compute probability of any event as ◮ Poisson: X ∈ {0, 1, · · · }; Parameter: λ > 0.
λx
Z
P [X ∈ B] = fX (t) dt, ∀B ∈ B fX (x) = e−λ , x = 0, 1, · · ·
B x!
◮ In particular, ◮ Geometric: X ∈ {1, 2, · · · }; Parameter: p, 0 < p < 1.
Z b
P [a ≤ X ≤ b] = fX (t) dt fX (x) = p(1 − p)x−1 , x = 1, 2, · · ·
a

PS Sastry, IISc, Bangalore, 2020 6/43 PS Sastry, IISc, Bangalore, 2020 7/43
Recap: Some continuous random variables Functions of a random variable

◮ Uniform over [a, b]: Parameters: a, b, b > a.


1
fX (x) = , a ≤ x ≤ b.
b−a
◮ exponential: Parameter: λ > 0. ◮ We next look at random variables defined in terms of
other random variables.
fX (x) = λe−λx , x ≥ 0.

◮ Gaussian (Normal): Parameters: σ > 0, µ.


1 (x−µ)2
fX (x) = √ e− 2σ2 , −∞ < x < ∞
σ 2π

PS Sastry, IISc, Bangalore, 2020 8/43 PS Sastry, IISc, Bangalore, 2020 9/43

Functions of a Random Variable


◮ Let X be a rv on some probability space (Ω, F, P ).
◮ Recall that X : Ω → ℜ. ◮ Let X be a rv on some probability space (Ω, F, P ).
(Recall X : Ω → ℜ)
◮ Also recall that
◮ Consider a function g : ℜ → ℜ
[X ∈ B] , {ω : X(ω) ∈ B} ∈ F, ∀B ∈ B ◮ Let Y = g(X). Then Y also maps Ω into real line.
◮ If g is a ‘nice’ function, Y would also be a random variable
◮ We need: g −1 (B) , {z ∈ ℜ : g(z) ∈ B} ∈ B, ∀B ∈ B

PS Sastry, IISc, Bangalore, 2020 10/43 PS Sastry, IISc, Bangalore, 2020 11/43
Example
◮ Let X be a rv and let Y = g(X). ◮ Let Y = aX + b, a > 0.
◮ The distribution function of Y is given by ◮ Then we have

FY (y) = P [Y ≤ y] FY (y) = P [Y ≤ y]
= P [g(X) ≤ y] = P [aX + b ≤ y]
= P [g(X) ∈ (−∞, y] ] = P [aX ≤ y − b]
y−b
 
= P [X ∈ {z : g(z) ≤ y}] = P X≤ , since a > 0
a
This probability can be obtained from distribution of X. y−b
 

= FX
◮ Thus, in principle, we can find the distribution of Y if we a
know that of X
◮ This tells us how to find df of Y when it is an affine
function of X.
If X is continuous rv, then, fY (y) = a1 fX y−b


a

PS Sastry, IISc, Bangalore, 2020 12/43 PS Sastry, IISc, Bangalore, 2020 13/43

◮ In many examples we would be using uniform random ◮ Let X ∼ U [−1, 1]. The pdf would be
variables. fX (x) = 0.5, −1 ≤ x ≤ 1.
1+x
◮ Let X ∼ U [0, 1]. Its pdf is fX (x) = 1, 0 ≤ x ≤ 1. ◮ Integrating this, we get the df: FX (x) = 2
for
◮ Integrating this we get the df: FX (x) = x, 0 ≤ x ≤ 1 −1 ≤ x ≤ 1.
◮ These are plotted below

PS Sastry, IISc, Bangalore, 2020 14/43 PS Sastry, IISc, Bangalore, 2020 15/43
◮ Suppose X ∼ U [0, 1] and Y = aX + b (x−µ)2
◮ Recall that Gaussian density is f (x) = √1 e− 2σ 2
◮ The df for Y would be σ 2π
◮ We denote this as N (µ, σ 2 )
y−b

  0 ≤0 ◮ Let Y = aX + b where X ∼ N (0, 1). The df of Y is
y−b

a
y−b
FY (y) = FX = 0 ≤ y−b ≤ 1
a  a y−b a
y−b
 
1 a
≥1
FY (y) = FX
a
Z y−b
◮ Thus we get the df for Y as a 1 x2
= √ e− 2 dx
 −∞ 2π
 0 y≤b
y−b t−b 1
FY (y) = b≤y ≤a+b we make a substitution: t = ax + b ⇒ x = , and dx = dt
 a a a
1 y ≥a+b Z y
1 (t−b)2
FY (y) = √ e− 2a2 dt
◮ Hence fY (y) = a1 , y ∈ [b, a + b] and Y ∼ U [b, a + b]. −∞ a 2π

◮ If X ∼ U [0, 1] then Y = aX + b, (a > 0), is uniform ◮ This shows that Y ∼ N (b, a2 )


over [b, a + b].

PS Sastry, IISc, Bangalore, 2020 16/43 PS Sastry, IISc, Bangalore, 2020 17/43

◮ Suppose X is a discrete rv with X ∈ {x1 , x2 , · · · }.


◮ Suppose Y = g(X).
◮ Then Y is also discrete and Y ∈ {g(x1 ), g(x2 ), · · · }. ◮ Let X ∈ {1, 2, · · · , N } with fX (k) = N1 , 1 ≤ k ≤ N
◮ Though we use this notation, we should note: ◮ Let Y = aX + b, (a > 0).
1. these values may not be distinct (it is possible that
g(xi ) = g(xj ));
◮ Then Y ∈ {b + a, b + 2a, · · · , b + N a}.
2. g(x1 ) may not be the smallest value of Y and so on. ◮ We get the pmf of Y as
◮ We can find the pmf of Y as 1
fY (b + ka) = fX (k) = , 1≤k≤N
fY (y) = p[Y = y] = P [g(X) = y] N
= P [X ∈ {xi : g(xi ) = y}]
X
= fX (xi )
i:
g(xi )=y

PS Sastry, IISc, Bangalore, 2020 18/43 PS Sastry, IISc, Bangalore, 2020 19/43
◮ Suppose X is geometric: ◮ Suppose X is geometric. (fX (k) = (1 − p)k−1 p)
fX (k) = (1 − p)k−1 p, k = 1, 2, · · · . ◮ Let Y = max(X, 5) ⇒ Y ∈ {5, 6, · · · }
◮ Let Y = X − 1 ◮ We can calculate the pmf of Y as
◮ We get the pmf of Y as
5
X
fY (j) = P [X − 1 = j] fY (5) = P [max(X, 5) = 5] = fX (k) = 1 − (1 − p)5
= P [X = j + 1] k=1

= (1 − p)j p, j = 0, 1, · · · fY (k) = P [max(X, 5) = k] = P [X = k] = (1 − p)k−1 p, k = 6, 7, · · ·

PS Sastry, IISc, Bangalore, 2020 20/43 PS Sastry, IISc, Bangalore, 2020 21/43

◮ Let X ∼ U [−1, 1]: FX (x) = 1+x


2
for −1 ≤ x ≤ 1.
◮ Let Y = X + . That is,

+ X if X > 0
Y =X =
◮ We next consider Y = h(X) where 0 otherwise

x if x > 0
◮ For y < 0, FY (y) = P [Y ≤ y] = 0 because Y ≥ 0.
h(x) = ◮ FY (0) = P [Y ≤ 0] = P [X ≤ 0] = 0.5.
0 otherwise
◮ For 0 < y < 1, FY (y) = P [Y ≤ y] = P [X ≤ y] = 1+y
2
+
◮ This is written as Y = X to indicate the function only ◮ For y ≥ 1, FY (y) = 1.
keeps the positive part.
◮ Thus, the df of Y is


 0 if y < 0
0.5 if y = 0

FY (y) = 1+y
if 0 < y < 1
 2


1 if y ≥ 1

PS Sastry, IISc, Bangalore, 2020 22/43 PS Sastry, IISc, Bangalore, 2020 23/43
◮ Let Y = X 2 .
◮ The df of Y is
 ◮ For y < 0, FY (y) = P [Y ≤ y] = 0 (since Y ≥ 0)
 0 if y < 0 ◮ For y ≥ 0, we can get FY (y) as
1+y
FY (y) = 2
if 0 ≤ y < 1
1 if y ≥ 1 FY (y) = P [Y ≤ y] = P [X 2 ≤ y]

√ √
= P [− y ≤ X ≤ y]
◮ This is plotted below √ √ √
= FX ( y) − FX (− y) + P [X = − y]

◮ If X is a continuous random variable, then we get


d √ √
fY (y) = (FX ( y) − FX (− y))
dy
1 √ √
= √ [fX ( y) + fX (− y)]
2 y

◮ This is neither a continuous rv nor a discrete rv. ◮ This is the general formula for density of X 2 when X is
continuous rv.
PS Sastry, IISc, Bangalore, 2020 24/43 PS Sastry, IISc, Bangalore, 2020 25/43

x2 Gamma density
◮ Let X ∼ N (0, 1): fX (x) = √1 e− 2

◮ The Gamma function is given by
◮ Let Y = X 2 . Then we know fY (y) = 0 for y < 0. Z ∞
For y ≥ 0, Γ(α) = xα−1 e−x dx
0
1 √ √
fY (y) = √ [fX ( y) + fX (− y)] It can be easily verified that Γ(α + 1) = αΓ(α).
2 y
1

1 1
 ◮ The Gamma density is given by
− y2 − y2
= √ √ e +√ e 1 1
2 y 2π 2π f (x) = λα xα−1 e−λx = (λx)α−1 λe−λx , x > 0
1 2 y Γ(α) Γ(α)
= √ √ e− 2
2 y 2π
1
 0.5
1 1
◮ Here α, λ > 0 are parameters.
= √ y −0.5 e− 2 y ◮ The earlier density we saw corresponds to α = λ = 0.5:
π 2
 0.5
1 1 1
◮ This is an example of gamma density. fY (y) = √ y −0.5 e− 2 y , y > 0
π 2

PS Sastry, IISc, Bangalore, 2020 26/43 PS Sastry, IISc, Bangalore, 2020 27/43
◮ Let X ∼ U (0, 1).
◮ Let Y = −1λ
ln(1 − X), where λ > 0.
◮ The gamma density with parameters α, λ > 0 is given by ◮ Note that Y ≥ 0. We can find its df:
1 
−1

f (x) = λα xα−1 e−λx , x > 0 FY (y) = P [Y ≤ y] = P ln(1 − X) ≤ y
Γ(α) λ
◮ If X ∼ N (0, 1) then X 2 has gamma density with = P [− ln(1 − X) ≤ λy]
parameters α = λ = 0.5. = P [ln(1 − X) ≥ −λy]
◮ When α is a positive integer then the gamma density is = P [1 − X ≥ e−λy ]
known as the Erlang density. = P [X ≤ 1 − e−λy ]
◮ If α = 1, gamma density becomes exponential density. = 1 − e−λy , y ≥ 0 (since X ∼ U (0, 1))

◮ Thus Y has exponential density


◮ If X ∼ U (0, 1), −1
λ
ln(1 − X) has exponential density

PS Sastry, IISc, Bangalore, 2020 28/43 PS Sastry, IISc, Bangalore, 2020 29/43

◮ If X ∼ U (0, 1), −1λ


ln(1 − X) has exponential density
◮ This is actually a special case of a general result. ◮ We can visualize this as shown below
◮ The exponential distribution fn is F (x) = 1 − e−λx .
◮ This is continuous, strictly monotone and hence is
invertible. The inverse function maps [0, 1] to ℜ+ .
We derive its inverse:
−1
z = 1 − e−λx ⇒ e−λx = 1 − z ⇒ x = ln(1 − z)
λ

◮ Thus, the inverse of F is F −1 (z) = −1


λ
ln(1 − z)
−1
◮ So, we had Y = F (X) and the df of Y was F

PS Sastry, IISc, Bangalore, 2020 30/43 PS Sastry, IISc, Bangalore, 2020 31/43
◮ Let G be a continuous invertible distribution function.
◮ Let X ∼ U [0, 1] and let Y = G−1 (X).
◮ Let X be a cont rv with an invertible distribution
function, say, F .
◮ We can get the df of Y as
◮ Define Y = F (X).
−1
FY (y) = P [Y ≤ y] = P [G (X) ≤ y] = P [X ≤ G(y)] = G(y) ◮ Since range of F is [0, 1], we know 0 ≤ Y ≤ 1.
◮ For 0 ≤ y ≤ 1 we can obtain FY (y) as

◮ Thus, starting with uniform rv, we can generate a rv with FY (y) = P [Y ≤ y] = P [F (X) ≤ y] = P [X ≤ F −1 (y)] = F (F −1 (y)) = y
a desired distribution.
◮ Very useful in random number generation. Known as the
inverse function method. ◮ This means Y has uniform density.
◮ Can be generalized to handle discrete rv also. It only ◮ Has interesting applications.
involves defining an ‘inverse’ when F is a stair-case E.g., histogram equalization in image processing
function. (Left as an exercise!)

PS Sastry, IISc, Bangalore, 2020 32/43 PS Sastry, IISc, Bangalore, 2020 33/43

◮ Let us sum-up the last two examples


◮ If Y = g(X), we can compute distribution of Y , knowing
the function g and the distribution of X.
◮ If X ∼ U [0, 1] and Y = F −1 (X), then Y has df F .
◮ We have seen a number of examples.
◮ If df of X is F and Y = F (X) then Y is uniform over
[0, 1].
◮ Finally, we look at a theorem that gives a formula for pdf
of Y in certain special cases

PS Sastry, IISc, Bangalore, 2020 34/43 PS Sastry, IISc, Bangalore, 2020 35/43
◮ Let g : ℜ → ℜ be differentiable with g ′ (x) > 0, ∀x.
◮ Let X be a continuous rv with pdf fX .
◮ Let Y = g(X) ◮ Now, suppose g ′ (x) < 0, ∀x. Even then the theorem
◮ Theorem: With the above, Y is a continuous rv with pdf essentially holds.
d −1 ◮ Now, g is strictly monotonically decreasing. So, we get
fY (y) = fX (g −1 (y)) g (y), g(−∞) ≤ y ≤ g(∞)
dy
FY (y) = P [g(X) ≤ y] = P [X ≥ g −1 (y)] = 1−FX (g −1 (y))
◮ Proof: Since g ′ (x) > 0, g is strictly monotonically
increasing and hence is invertible and g −1 would also be ◮ Once again, by differentiating
monotone and differentiable.
◮ So, range of Y is [g(−∞), g(∞)]. d −1 d −1
fY (y) = −fX (g −1 (y)) g (y) = fX (g −1 (y)) g (y)
◮ Now we have dy dy
FY (y) = P [Y ≤ y] = P [g(X) ≤ y] = P [X ≤ g −1 (y)] = FX (g −1 (y)) because g −1 is also monotone decreasing.
◮ The range of Y here is [g(∞), g(−∞)]
◮ Since g −1 is differentiable, so is FY and we get the pdf as ◮ We can combine both cases into one result.
d d −1
fY (y) = (FX (g −1 (y))) = fX (g −1 (y)) g (y)
dy dy
◮ This completes the proof. PS Sastry, IISc, Bangalore, 2020 36/43 PS Sastry, IISc, Bangalore, 2020 37/43

◮ Let g : ℜ → ℜ be differentiable with g ′ (x) > 0, ∀x or ◮ For an example, take g(x) = ax + b.


g ′ (x) < 0, ∀x.
◮ This satisfies the conditions and g −1 (y) = y−b
a
◮ Let X be a continuous rv and let Y = g(X).
◮ Hence we get
◮ Then Y is a continuous rv with pdf
y−b 1
 
−1 d −1
d −1 fY (y) = fX (g (y)) g (y) = fX
fY (y) = fX (g −1 (y)) g (y) , a ≤ y ≤ b dy a a
dy
◮ This is an example we saw earlier.
where a = min(g(∞), g(−∞)) and
◮ We need to find the range of Y based on range of X.
b = max(g(∞), g(−∞))

PS Sastry, IISc, Bangalore, 2020 38/43 PS Sastry, IISc, Bangalore, 2020 39/43
◮ If Y = g(x) and g is monotone,
◮ The function g(x) = x2 does not satisfy the conditions of d −1
the theorem. fY (y) = fX (g −1 (y)) g (y)
dy
◮ The utility of the theorem is somewhat limited.
◮ However, we can extend the theorem.
◮ Let xo (y) be the solution of g(x) = y; then
g −1 (y) = xo (y).
◮ Essentially, what we need is that for a any y, the equation
g(x) = y would have finite solutions and the derivative of
◮ Also, the derivative of g −1 is reciprocal of the derivative
g is not zero at any of these points. of g.
◮ There are multiple ‘g −1 (y)’ and we can get density of Y
◮ Hence, we can also write the above as
by summing all the terms. −1
fY (y) = fX (xo (y)) |g ′ (xo (y))|

◮ However, the notation in the above may be confusing.

PS Sastry, IISc, Bangalore, 2020 40/43 PS Sastry, IISc, Bangalore, 2020 41/43

◮ We can now extend the theorem as follows.


◮ Consider the old example g(x) = x2 .
◮ Suppose, for a given y, g(x) = y has multiple solutions. √ √
◮ For y > 0, x2 = y has two solutions: y and − y.
◮ Call them x1 (y), · · · , xm (y). Assume the derivative of g
◮ At both these points, the absolute value of derivative of g
is not zero at any of these points. √
is 2 y which is non-zero.
◮ Then we have
◮ Hence we get
m
X −1 √ √ √
fY (y) = fX (xk (y)) |g ′ (xk (y))| fY (y) = (2 y)−1 (fX ( y) + fX (− y))
k=1
◮ This is same as what we derived from first principles
◮ If g(x) = y has no solution (or no solution satisfying earlier.
g ′ (x) 6= 0), then at that y, fY (y) = 0.

PS Sastry, IISc, Bangalore, 2020 42/43 PS Sastry, IISc, Bangalore, 2020 43/43
Recap: Function of a random variable Recap

◮ Let X be a rv and let Y = g(X).


◮ If X is a random variable and g : ℜ → ℜ is a function, ◮ The distribution function of Y is given by
then Y = g(X) is a random variable.
◮ More formally, Y is a random variable if g is a Borel FY (y) = P [g(X) ≤ y]
measurable function. = P [X ∈ {z : g(z) ≤ y}]
◮ We can determine distribution of Y given the function g
and the distribution of X ◮ This probability can be obtained from distribution of X.
◮ We have seen many specific examples of this.

PS Sastry, IISc, Bangalore, 2020 1/45 PS Sastry, IISc, Bangalore, 2020 2/45

Recap Recap

◮ Suppose X is a discrete rv with X ∈ {x1 , x2 , · · · }. ◮ Let g : ℜ → ℜ be differentiable with g ′ (x) > 0, ∀x or


g ′ (x) < 0, ∀x.
◮ Suppose Y = g(X).
◮ Let X be a continuous rv and let Y = g(X).
◮ Then Y is also discrete and Y ∈ {g(x1 ), g(x2 ), · · · }.
◮ Then Y is a continuous rv with pdf
◮ We can find the pmf of Y as
d −1
fY (y) = p[Y = y] = P [g(X) = y] fY (y) = fX (g −1 (y)) g (y) , a ≤ y ≤ b
dy
= P [X ∈ {xi : g(xi ) = y}]
where a = min(g(∞), g(−∞)) and
X
= fX (xi )
i: b = max(g(∞), g(−∞))
g(xi )=y
◮ This theorem is useful in some cases to find the densities
of functions of continuous random variables

PS Sastry, IISc, Bangalore, 2020 3/45 PS Sastry, IISc, Bangalore, 2020 4/45
Expectation and Moments of a random variable Expectation of a discrete rv

◮ Let X be a discrete rv with X ∈ {x1 , x2 , · · · }


◮ We define its expectation by
X
E[X] = xi fX (xi )
◮ We next consider the important notion of expectation of i

a random variable ◮ Expectation is essentially a weighted average.


◮ To make the above finite and well defined, we can
stipulate the following as condition for existence of
expectation X
|xi | fX (xi ) < ∞
i

PS Sastry, IISc, Bangalore, 2020 5/45 PS Sastry, IISc, Bangalore, 2020 6/45

Expectation of a Continuous rv
◮ If X is a continuous random variable with pdf, fX , we ◮ Let us look at a couple of simple examples.
define its expectation as ◮ Let X ∈ {1, 2, 3, 4, 5, 6} and fX (k) = 61 , 1 ≤ k ≤ 6.
Z ∞
E[X] = x fX (x) dx 1 21
−∞ EX = (1 + 2 + 3 + 4 + 5 + 6) = = 3.5
6 6
◮ Once again we can use the following as condition for
existence of expectation ◮ Let X ∼ U [0, 1]
Z ∞
∞ 1
|x| fX (x) dx < ∞
Z Z
−∞ EX = x fX (x) dx = x dx = 0.5
−∞ 0
◮ Sometimes we use the following notation to denote
expectation of both kinds of rv ◮ When an rv takes only finitely many values or when the
Z ∞ pdf is non-zero only on a bounded set, the expectation is
E[X] = x dFX (x) always finite.
−∞
◮ Though we consider only discrete or continuous rv’s,
expectation is defined for all random variables.
PS Sastry, IISc, Bangalore, 2020 7/45 PS Sastry, IISc, Bangalore, 2020 8/45
◮ The way we have defined existence of expectation, implies ◮ Now let X be a rv that may not be non-negative.
that expectation is always finite (when it exists). ◮ We define positive and negative parts of X by
◮ This may be needlessly restrictive in some situations.

+ X if X > 0
We redefine it as follows. X =
0 otherwise
◮ Let X be a non-negative (discrete or continuous) random 
variable. − −X if X < 0
X =
◮ We define its expectation by 0 otherwise

X Z ∞ ◮ Note that both X + and X − are non-negative. Hence


EX = xi fX (xi ) or EX = x fX (x) dx their expectations exist. (Also,
i −∞ X(ω) = X + (ω) − X − (ω), ∀ω).
◮ Now we define expectation of X by
depending on whether it is discrete or continuous
(In this course we will consider only discrete or continuous EX = EX + − EX − , if at least one of them is finite
rv’s)
◮ Note that the expectation may be infinite. Otherwise EX does not exist.
◮ Now, expectation does not exist only when
◮ But it always exists for non-negative random variables.
EX + = EX − = ∞
PS Sastry, IISc, Bangalore, 2020 9/45 PS Sastry, IISc, Bangalore, 2020 10/45

◮ This is the formal way of defining expectation of a ◮ Let X ∈ {1, 2, · · · }.


random variable. ◮ Suppose fX (k) = kC2 .
We first note that if i |xi | fX (xi ) < ∞ then both EX + Since k k12 < ∞, we can find C so that k fX (k) = 1.
P P P
◮ ◮
and EX − would be finiteP and we can simply take the
2
( k k12 = π6 and hence C = π62 ).
P
expectation as EX = i xi fX (xi ). ◮ Hence we get
◮ Also note that if X takes only finitely many values, the
above always holds. X X X C XC
|xk | fX (xk ) = xk fX (xk ) = k = =∞
◮ Similar comments apply for a continuous random variable. k k k
k2 k
k
◮ This is what we do in this course because we deal with
only discrete and continuous rv’s. ◮ Here the expectation is infinity.
◮ But to get a feel for the more formal definition, we look ◮ But by the formal definition it exists.
at a couple of examples. (Note that here X + = X and X − = 0).

PS Sastry, IISc, Bangalore, 2020 11/45 PS Sastry, IISc, Bangalore, 2020 12/45
◮ Now suppose X takes values 1, −2, 3, −4, · · · with ◮ Consider a continuous random variable X with pdf
probabilities 1C2 , 2C2 , 3C2 and so on. 1 1

P
Once again k |xk |fX (xk ) = ∞. fX (x) = , −∞ < x < ∞
P π 1 + x2
◮ But k xk fX (xk ) is an alternating series.
◮ This is called (standard) Cauchy density. We can verify it
◮ Here X + would take values 2k − 1 with probability
C integrates to 1
(2k−1)2
, k = 1, 2, · · ·
Z ∞
1 π −π
 
(and the value 0 with remaining probability). 1 1 1 −1 ∞
C 2
dx = tan (x) −∞ = − =1
◮ Similarly, X − would take values 2k with probability (2k) 2, −∞ π 1 + x π π 2 2
k = 1, 2, · · · (and the value 0 with remaining probability).
X C XC ◮ What would be EX?
EX + = = ∞, and EX − = =∞ Z ∞ Z a
2k − 1 2k 1 1 ? x
k k EX = x dx = 0 because = 0?
−∞ π 1 + x2 −a 1 + x
2
◮ Hence EX does not exist.

PS Sastry, IISc, Bangalore, 2020 13/45 PS Sastry, IISc, Bangalore, 2020 14/45

◮ The question was



1 1
Z
?
EX = x 2
dx = 0 ◮ Here we have
−∞ π 1+x
Z 0 Z d
x x
◮ This depends on the definition of infinite integrals lim dx = −∞; lim dx = ∞
c→∞ −c 1 + x 2 d→∞ 0 1 + x2
Z ∞ Z d
R∞
g(x) dx , lim g(x) dx ◮ Hence EX = −∞ x π1 1+x 1
2 dx does not exist.
−∞ c→∞,d→∞ −c
Z 0 Z d ◮ Essentially, both halves of the integral are infinite and
= lim g(x) dx + lim g(x) dx hence we get ∞ − ∞ type expression which is undefined.
c→∞ −c d→∞ 0 Ra
◮ However, lima→∞ −a x π1 1+x 1
2 dx = 0.
Z a
This is not same as lim g(x) dx,
a→∞ −a

which is known as Cauchy principal value

PS Sastry, IISc, Bangalore, 2020 15/45 PS Sastry, IISc, Bangalore, 2020 16/45
Expectation of a random variable Binary random variable
◮ Let X be a discrete rv with X ∈ {x1 , x2 , · · · }. Then
X
E[X] = xi fX (xi )
i
◮ If X is a continuous random variable with pdf, fX , ◮ Expectation of a binary rv (e.g., Bernoulli):
Z ∞
E[X] = x fX (x) dx EX = 0 × fX (0) + 1 × fX (1) = P [X = 1]
−∞
◮ Sometimes we use the following notation to denote ◮ Expectation of a binary random variable is same as the
expectation of both kinds of rv probability of the rv taking value 1.
Z ∞
E[X] = x dFX (x) ◮ Thus, for example, EIA = P (A).
−∞
◮ We take the expectation to exist when the sum or
integral above is absolutely convergent
◮ Note that expectation is defined for all random variables
◮ Let us calculate expectations of some of the standard
distributions. PS Sastry, IISc, Bangalore, 2020 17/45 PS Sastry, IISc, Bangalore, 2020 18/45

Expectation of Binomial rv Expectation of Poisson rv


n
◮ Let fX (k) = Ck pk (1 − p)n−k , k = 0, 1, · · · , n.
n n
X n! X n!
EX = k pk (1 − p)n−k = k pk (1 − p)n−k
k!(n − k)! k!(n − k)! λk
k=0 k=1 ◮ fX (k) = k!
e−λ , k = 0, 1, · · ·
n
X n!
= pk (1 − p)n−k ∞
(k − 1)!(n − k)! X λk −λ
k=1 EX = k e
n k!
X n(n − 1)! k=0
= p pk−1 (1 − p)(n−1)−(k−1) = λ
k=1
(k − 1)!((n − 1) − (k − 1))!
n
X (n − 1)! (Left as an exercise for you!)
= np pk−1 (1 − p)(n−1)−(k−1)
k=1
(k − 1)!((n − 1) − (k − 1))!
n−1
X (n − 1)! ′ ′
= np pk (1 − p)(n−1)−k = np
k′ =0
k ′ !((n ′
− 1) − k )!

PS Sastry, IISc, Bangalore, 2020 19/45 PS Sastry, IISc, Bangalore, 2020 20/45
Expectation of Geometric rv Expectation of uniform density
◮ fX (k) = (1 − p)k−1 p, k = 1, 2, · · · 1
◮ Let X ∼ U [a, b]. fX (x) = b−a
, a≤x≤b


X
k (1 − p)k−1 p
Z
EX =
EX = x fX (x) dx
k=1 −∞
Z b
◮ We have 1
∞ = x dx
X 1−p 1 a b−a
(1 − p)k = = −1 b
k=1
p p 1 x2
=
b−a 2 a
◮ Term-wise differentiation of the above gives 1 b2 − a2
=
∞ b−a 2
X 1
k (1 − p)k−1 = b+a
p2 =
k=1 2
1
◮ This gives us EX = p
PS Sastry, IISc, Bangalore, 2020 21/45 PS Sastry, IISc, Bangalore, 2020 22/45

Expectation of exponential density Expectation of Gaussian density


◮ fX (x) = λ e−λx , x > 0. (x−µ)2
◮ fX (x) = √1
σ 2π
e− 2σ 2 , −∞ < x < ∞
Z ∞
EX = x λ e−λx dx Z ∞
1 −
(x−µ)2
0 EX = x √ e 2σ2 dx
e−λx
∞ Z ∞
e−λx −∞ σ 2π
= xλ − λ dx x−µ
−λ 0 0 −λ make a change of variable y =
Z ∞ Z ∞ σ
= e−λx dx 1 y2
= √ (σy + µ)e− 2 dy
0 2π
∞ −∞
e−λx 1
Z ∞ 2
Z ∞
1 y2
= − y2
−λ = σ√ ye dy + µ √ e− 2 dy
0 2π −∞ −∞ 2π
1 = µ
=
λ

PS Sastry, IISc, Bangalore, 2020 23/45 PS Sastry, IISc, Bangalore, 2020 24/45
Expectation of a function of a random variable ◮ Theorem: Let X ∈ {x1 , x2 , · · · xn } and let Y = g(X).
Then X
◮ Let X be a rv and let Y = g(X). EY = g(xi ) fX (xi )
R R
◮ Theorem: EY = y dFY (y) = g(x) dFX (x) i

◮ That is, if X is discrete, then ◮ Proof: Let Y ∈ {y1 , y2 , · · · , ym }.


X X Each yj would be equal to g(xi ) for one or more i.
EY = yj fY (yj ) = g(xi )fX (xi ) ◮ Let Bj = {xi : g(xi ) = yj }. Thus,
j i
X
◮ If X and Y are continuous fY (yj ) = P [Y = yj ] = P [X ∈ Bj ] = fX (xi )
i:
Z Z xi ∈Bj
EY = y fY (y) dy = g(x) fX (x) dx
◮ Note that
◮ This theorem is true for all rv’s. But we will prove it in ◮ Bj are disjoint
only some special cases. ◮ each xi would be in one (and only one) of the Bj

PS Sastry, IISc, Bangalore, 2020 25/45 PS Sastry, IISc, Bangalore, 2020 26/45

◮ Now we have
m ◮ Suppose X is a continuous rv and suppose g is a
differentiable function with g ′ (x) > 0, ∀x. Let Y = g(X)
X
EY = yj fY (yj )
R
j=1 ◮ Once again we can show EY = g(x) fX (x) dx
Xm X Z ∞
= yj fX (xi )
j=1 i:
EY = y fY (y) dy
xi ∈Bj −∞
Z g(∞)
m X d −1
= y fX (g −1 (y)) g (y) dy,
X
= g(xi )fX (xi ) dy
g(−∞)
j=1 i:
xi ∈Bj d −1
n change the variable to x = g −1 (y) ⇒ dx = g (y) dy
X dy
= g(xi )fX (xi ) Z ∞
i=1 = g(x) fX (x) dx
−∞
That completes the proof.
◮ The proof goes through even when X (and Y ) take ◮ We can similarly show this for the case where
countably infinitely many values (because we assume the g ′ (x) < 0, ∀x
expectation sum is absolutely convergent).
PS Sastry, IISc, Bangalore, 2020 27/45 PS Sastry, IISc, Bangalore, 2020 28/45
Some Properties of Expectation

◮ We proved the theorem only for discrete rv’s and for some
restricted case of continuous rv’s. X Z ∞

◮ However, this theorem is true for all random variables. E[g(X)] = g(xi )fX (xi ) or E[g(X)] = g(x) fX (x) dx
i −∞
◮ Now, for any function, g, we can write
Z ∞
X ◮ If X ≥ 0 then EX ≥ 0
E[g(X)] = g(xi )fX (xi ) or E[g(X)] = g(x) fX (x) dx
i −∞ ◮ E[b] = b where b is a constant
◮ E[ag(X)] = aE[g(X)] where a is a constant
◮ E[aX + b] = aE[X] + b where a, b are constants.
◮ E[ag1 (X) + bg2 (X)] = aE[g1 (X)] + bE[g2 (X)]

PS Sastry, IISc, Bangalore, 2020 29/45 PS Sastry, IISc, Bangalore, 2020 30/45

◮ Consider the problem: minc E[(X − c)2 ]


◮ We are asking what is the best constant to approximate a ◮ We can derive this in an alternate manner too
rv with
◮ We are trying to minimize (weighted) average, over all E[(X − c)2 ] = E[(X − EX + EX − c)2 ]
values X can take, of the square of the error = E[(X − EX)2 + (EX − c)2 + 2(EX − c)(X − EX)]
◮ We are interested in the best mean-square approximation = E[(X − EX)2 ] + (EX − c)2 + 2(EX − c)E[(X − EX)]
of X by a constant. = E[(X − EX)2 ] + (EX − c)2 + 2(EX − c)(EX − EX)
= E[(X − EX)2 ] + (EX − c)2
E[(X − c)2 ] = E[X 2 + c2 − 2cX] = E[X 2 ] + c2 − 2cE[X]
≥ E[(X − EX)2 ]

◮ Thus E[(X − c)2 ] ≥ E[(X − EX)2 ], ∀c


◮ We differentiate this and equate to zero to get the best c
◮ So, E[(X − c)2 ] is minimized when c = EX and the
∗ ∗
2c = 2E[X] ⇒ c = E[X] minimum value is E[(X − EX)2 ]

PS Sastry, IISc, Bangalore, 2020 31/45 PS Sastry, IISc, Bangalore, 2020 32/45
Variance of a Random variable Some properties of variance

◮ We define variance of X as E[(X − EX)2 ] and denote it ◮ Var(X + c) = Var(X) where c is a constant
as Var(X).
By definition, Var(X) ≥ 0. Var(X+c) = E {(X + c) − E[X + c]}2 = E (X − EX)2 = Var(X)
   

Var(X) = E[(X − EX)2 ]


E X 2 + (EX)2 − 2X(EX)
 
= ◮ Var(cX) = c2 Var(X) where c is a constant
= E[X 2 ] + (EX)2 − 2(EX)E[X]
Var(cX) = E (cX − E[cX])2 = E (cX − cE[X])2 = c2 Var(X)
   
= E[X 2 ] − (EX)2

◮ This also implies: E[X 2 ] ≥ (EX)2

PS Sastry, IISc, Bangalore, 2020 33/45 PS Sastry, IISc, Bangalore, 2020 34/45

Variance of uniform rv Variance of uniform rv


2 2
1
◮ We got E[X 2 ] = b +ab+a
3
. Earlier we showed EX = b+a
2
◮ fX (x) = b−a
, a≤x≤b
◮ Now we can calculate Var(X) as
b
1
Z
2
E[X ] = x2 dx Var(X) = EX 2 − (EX)2
b−a
a
b2 + ab + a2 (b + a)2
1 x3
b = −
= 3 4
b−a 3 a 4(b2 + ab + a2 ) − 3(b2 + 2ab + a2 )
=
1 b 3 − a3 12
=
b−a 3 (b2 − 2ab + a2 )
=
b2 + ab + a2 12
= (b − a)2
3 =
12

PS Sastry, IISc, Bangalore, 2020 35/45 PS Sastry, IISc, Bangalore, 2020 36/45
Variance of exponential rv Variance of Gaussian rv
◮ fX (x) = λ e−λx , x > 0
Z ∞ ◮ Let X ∼ N (0, 1). That is,
x2
2 fX (x) = √1 e− 2 , −∞ < x < ∞.
E[X ] = x2 λ e−λx dx 2π
0
∞ Z ∞ ◮ We know EX = 0. Hence Var(X) = EX 2 .
2 e−λx e−λx
= x λ − λ 2x dx Z ∞
−λ 0 0 −λ 2 1 x2
Z ∞ Var(X) = EX = x2 √ e− 2 dx
2 2π
= x λ e−λx dx Z ∞  −∞
λ 0

1 − x2
2 = x x√ e 2 dx
= −∞ 2π
λ2 −1 − x2
∞ Z ∞
1 x2
= x √ e 2 + √ e− 2 dx
◮ Hence the variance is now given by 2π −∞ −∞ 2π
 2 = 1
2 1 1
Var(X) = 2 − = 2
λ λ λ

PS Sastry, IISc, Bangalore, 2020 37/45 PS Sastry, IISc, Bangalore, 2020 38/45

2
◮ Here is a plot of Gaussian densities with different
√1 e − x2 variances
◮ Let fX (x) = 2π
, −∞ < x < ∞.
◮ Let g(x) = σx + µ and hence g −1 (y) = y−µ
σ
.
◮ Take σ > 0 and Y = g(X). By the theorem,
 
d −1 1 1 (y−µ)2
fY (y) = g (y) fX (g −1 (y)) = √ e− 2σ2
dy σ 2π

◮ Since Y = σX + µ, we get
◮ EY = σEX + µ = µ
◮ Var(Y ) = σ 2 Var(X) = σ 2
◮ When Y ∼ N (µ, σ 2 ), EY = µ and Var(Y ) = σ 2 .

PS Sastry, IISc, Bangalore, 2020 39/45 PS Sastry, IISc, Bangalore, 2020 40/45
Variance of Binomial rv
n!
◮ fX (k) = k!(n−k)! pk (1 − p)n−k , k = 0, 1, · · · , n
◮ When X is binomial rv, we showed,
◮ Here we use the identity, EX 2 = E[X(X − 1)] + EX
E[X(X − 1)] = n(n − 1)p2
n
X n! ◮ Hence,
E[X(X − 1)] = k(k − 1) pk (1 − p)n−k
k=0
k!(n − k)!
n EX 2 = E[X(X−1)]+EX = n(n−1)p2 +np = n2 p2 +np(1−p)
X n!
= k(k − 1) pk (1 − p)n−k
k=2
k!(n − k)!
n
X n(n − 1)(n − 2)!
◮ Now we can calculate the variance
= p2 pk−2 (1 − p)(n−2)−(k−2)
k=2
(k − 2)!((n − 2) − (k − 2))! Var(X) = EX 2 −(EX)2 = n2 p2 +np(1−p)−(np)2 = np(1−p)
n−2
2
X (n − 2)! k′ (n−2)−k′
= n(n − 1)p p (1 − p)
k′ =0
k ′ !((n − 2) − k ′ )!
= n(n − 1)p2

PS Sastry, IISc, Bangalore, 2020 41/45 PS Sastry, IISc, Bangalore, 2020 42/45

Variance of a geometric random variable moments of a random variable


◮ X ∈ {1, 2, · · · } and fX (k) = (1 − p)k−1 p, k = 1, 2, · · ·
◮ We define the k th order moment of a rv, X, by
◮ Here also, it is easier to calculate E[X(X − 1)]
Z

X ∞
X mk = E[X ] = xk dFX (x)
k
k−1
E[X(X−1)] = k(k−1)(1−p) p = p(1−p) k(k−1)(1−p)k−2
k=1 k=1
◮ m1 = EX and m2 = EX 2 and so on
◮ We define the k th central moment of X by
◮ We know
Z
∞ ∞
1−p d2 1 − p sk = E[(X − EX) ] = (x − EX)k dFX (x)
k
X X  
k k−2
(1−p) = ⇒ k(k−1)(1−p) = 2
k=1
p k=1
dp p
◮ s1 = 0 and s2 = Var(X).
2
Now you can compute E[X(X − 1)] and hence E[X ] ◮ Not all moments may exist for a given random variable.
and hence Var(X) and show it to be equal to 1−p
p2
. (For example, m1 does not exist for Cauchy rv)
(Left as an exercise)
PS Sastry, IISc, Bangalore, 2020 43/45 PS Sastry, IISc, Bangalore, 2020 44/45
 
◮ Theorem: If E |X|k < ∞ then E [|X|s ] < ∞ for
0 < s < k.
Recap: Expectation
◮ For example, if third order moment exists then so do first ◮ Let X be a discrete rv with X ∈ {x1 , x2 , · · · }. Then
and second order moments X
◮ Proof: We prove it when X is continuous rv. Proof for E[X] = xi fX (xi )
i
discrete case is similar.
s
Z ∞ ◮ If X is a continuous random variable with pdf, fX ,
E [|X| ] = |x|s fX (x) dx Z ∞
Z−∞ Z E[X] = x fX (x) dx
s
= |x| fX (x) dx + |x|s fX (x) dx −∞

Z|x|<1 Z |x|≥1 ◮ Sometimes we use the following notation to denote


≤ fX (x) dx + |x|s fX (x) expectation of both kinds of rv
|x|<1 |x|≥1
Z ∞
Z E[X] = x dFX (x)
≤ P [|X|s < 1] + |x|k fX (x) −∞
|x|≥1
◮ We take the expectation to exist when the sum or
since for |x| ≥ 1, |x|s < |x|k when s < k integral above is absolutely convergent
  R∞
< ∞ because E |X|k = ∞ |x|k fX (x) dx < ∞ ◮ Note that expectation is defined for all random variables
PS Sastry, IISc, Bangalore, 2020 45/45 PS Sastry, IISc, Bangalore, 2020 1/45

Recap: Expectation of a function of a random Recap: Properties of Expectation


variable
◮ Let X be a rv and let Y = g(X). Then, X Z ∞


R R
EY = y dFY (y) = g(x) dFX (x) E[g(X)] = g(xi )fX (xi ) or E[g(X)] = g(x) fX (x) dx
i −∞
◮ That is, if X is discrete, then
◮ If X ≥ 0 then EX ≥ 0
X X
EY = yj fY (yj ) = g(xi )fX (xi )
j i ◮ E[b] = b where b is a constant
◮ E[ag(X)] = aE[g(X)] where a is a constant
◮ If X and Y are continuous
Z Z ◮ E[aX + b] = aE[X] + b where a, b are constants.
EY = y fY (y) dy = g(x) fX (x) dx ◮ E[ag1 (X) + bg2 (X)] = aE[g1 (X)] + bE[g2 (X)]
◮ E [(X − c)2 ] ≥ E [(X − EX)2 ] , ∀c
◮ This is true for all rv’s.

PS Sastry, IISc, Bangalore, 2020 2/45 PS Sastry, IISc, Bangalore, 2020 3/45
Recap: Variance of random variable Recap: Moments of a random variable

◮ The k th (order) moment of X is


Z
Var(X) = E (X − EX)2 = E[X 2 ] − (EX)2
 
mk = E[X ] = xk dFX (x)
k

◮ Properties of Variance: ◮ The k th central moment of X is


◮ Var(X) ≥ 0 Z
◮ Var(X + c) = Var(X) sk = E[(X − EX) ] = (x − EX)k dFX (x)
k

◮ Var(cX) = c2 Var(X)
◮ If moment of order k is finite then so is moment of order
s for s < k.

PS Sastry, IISc, Bangalore, 2020 4/45 PS Sastry, IISc, Bangalore, 2020 5/45

Moment generating function


◮ The mgf of X is: MX (t) = E[etX ].
◮ If MX (t) exists (for t ∈ [−a, a] for some a > 0) then all
its derivatives also exist.
◮ The moment generating function (mgf) of rv X,
MX : ℜ → ℜ, is defined by ◮ Then we can get the moments of X by successive
Z differentiation of MX (t).
X
tX txi
MX (t) = Ee = e fX (xi ) or etx fX (x) dx, t ∈ ℜ dMX (t) d  tX 
i = E e = E[XetX ] t=0
= EX
dt t=0 dt t=0

◮ We say the mgf exists if E[etX ] < ∞ for t in some


interval around zero ◮ In general
dk MX (t)
◮ The mgf may not exist for some random variables. = E[X k ]
dtk t=0

PS Sastry, IISc, Bangalore, 2020 6/45 PS Sastry, IISc, Bangalore, 2020 7/45
◮ We can easily see this by expanding etX in Taylor series: Example – Moment generating function for Poisson
λk −λ
◮ fX (k) = e , k = 0, 1, · · ·
tX t2 X 2 t3 X 3 t4 X 4
 k!
tX
MX (t) = Ee = E 1 + + + + + ···
1! 2! 3! 4! ∞
X λk −λ
tX
t t2 t3 t4 MX (t) = E[e ] = etk e
= 1 + EX + EX 2 + EX 3 + EX 4 + · · · k=0
k!
1! 2! 3! 4! ∞
X 1 k
◮ Now we can do term-wise differentiation. For example = e−λ λet
k=0
k!
d3 MX (t) 3 ∗ 2 ∗ 1 ∗ t0 3 4∗3∗2∗t = e −λ
eλet t
= eλ(e −1)
3
= 0+0+0+ EX + EX 4 +· · ·
dt 3! 4!
◮ Now, by differentiating it we can find EX
◮ Hence we get
dMX (t) t
d3 MX (t) EX = = eλ(e −1) λet =λ
= E[X ] 3 dt t=0 t=0
dt3 t=0
(Exercise: Differentiate it twice to find EX 2 and hence
show that variance is λ).
PS Sastry, IISc, Bangalore, 2020 8/45 PS Sastry, IISc, Bangalore, 2020 9/45

mgf of exponential rv
◮ fX (x) = λe−λx , x > 0

For mgf to exist we need E[etX ] < ∞ for t ∈ [−a, a] for
Z

tX
MX (t) = E[e ] = etx λe−λx dx some a > 0.
0
Z ∞
◮ If MX (t) exists then all moments of X are finite.
= λe−x(λ−t) dx
0 ◮ However, all moments may be finite but the mgf may not
This is finite if t < λ exist.
−x(λ−t) ∞
λe ◮ When mgf exists, it uniquely determines the df
=
−(λ − t) 0 ◮ We are not saying moments uniquely determine the
λ distribution; we are saying mgf uniquely determines the
= , t<λ
λ−t distribution
◮ We can use this to compute EX
 
dMX (t) d λ λ 1
EX = = = =
dt t=0 dt λ − t t=0 (λ − t)2 t=0 λ

PS Sastry, IISc, Bangalore, 2020 10/45 PS Sastry, IISc, Bangalore, 2020 11/45
Characteristic Function Generating function
◮ The characteristic function of X is defined by
√ ◮ Let X ∈ {0, 1, 2, · · · }
Z
φX (t) = E[e ] = eitx dFX (x) (i = −1)
itX
◮ The (probability) generating function of X is defined by
◮ If X is continuous rv, ∞
X
Z ∞ PX (s) = fX (k)sk , s ∈ ℜ
φX (t) = E[eitX ] = eitx fX (x) dx k=0
−∞
◮ This infinite sum converges (absolutely) for |s| ≤ 1.
Characteristic function always exists because

|eitx | = 1, ∀t, x
◮ We have
◮ For example,
PX (s) = fX (0) + fX (1)s + fX (2)s2 + fX (3)s3 + · · ·
Z ∞ Z ∞ Z ∞
itx itx
e fX (x) dx ≤ e |fX (x)| dx = fX (x) dx = 1 ◮ The pmf can be obtained from the generating function
−∞ −∞ −∞

◮ We would consider φX later in the course


PS Sastry, IISc, Bangalore, 2020 12/45 PS Sastry, IISc, Bangalore, 2020 13/45

◮ PX (s) = fX (0) + fX (1)s + fX (2)s2 + fX (3)s3 + · · · The moments (when they exist)Pcan be obtained from the

dPX (s) ∞
◮ Let PX′ (s) , ds
and so on generating function: PX (s) = k=0 fX (k)sk
◮ We get ∞
X
PX′ (s) = k fX (k) sk−1 ⇒ PX′ (1) = EX
PX′ (s) = 0 + fX (1) + fX (2) 2s + fX (3) 3s2 + · · ·
k=0


X
1
PX′′ (s) = 0 + 0 + fX (2) 2 ∗ 1 + fX (3) 3 ∗ 2s + · · · PX′′ (s) = k(k−1) fX (k) sk−2 ⇒ PX′′ (1) = E[X(X−1)]
k=0
Hence, we get

P ′ (0) P ′′ (0) ◮ For (positive integer valued) discrete random variables, it


fX (0) = PX (0); fX (1) = X ; fX (2) = X is more convenient to deal with generating functions than
1! 2!
mgf.

PS Sastry, IISc, Bangalore, 2020 14/45 PS Sastry, IISc, Bangalore, 2020 15/45
Example – Generating function for binomial rv ◮ Let p ∈ (0, 1). The number x ∈ ℜ that satisfies
n!
◮ fX (k) = k!(n−k)!
pk (1 − p)n−k , k = 0, 1, · · · , n P [X ≤ x] ≥ p and P [X ≥ x] ≥ 1 − p
n
X n! is called the quantile of order p or the 100pth percentile of
PX (s) = pk (1 − p)n−k sk rv X.
k=0
k!(n − k)!
n
◮ Suppose x is a quantile of order p. Then we have
X n! p ≤ P [X ≤ x] = FX (x)
= (sp)k (1 − p)n−k ◮

k=0
k!(n − k)! ◮ 1 − p ≤ 1 − P [X < x] = 1 − (P [X ≤ x] − P [X = x])
= (sp + (1 − p))n = (1 + p(s − 1))n ⇒ 1 − p ≤ 1 − FX (x) + P [X = x]
⇒ FX (x) ≤ p + P [X = x]
◮ From the above, we get PX′ (s) = n(sp + (1 − p))n−1 p ◮ Thus, x satisfies (if it is quantile of order p)
◮ Thus,
p ≤ FX (x) ≤ p + P [X = x]
n−1
EX = PX′ (1) = np; fX (1) = PX′ (0) = n(1 − p) p
◮ Note that for a given p there can be multiple values for x
to satisfy the above.
PS Sastry, IISc, Bangalore, 2020 16/45 PS Sastry, IISc, Bangalore, 2020 17/45

◮ Let X be continuous rv.


◮ If the df is strictly monotone then FX (x) = p would have
a unique solution.
◮ If x is a quantile of order p then

p ≤ FX (x) ≤ p + P [X = x]

◮ If X is continuous rv, we need to satisfy p = FX (x).


◮ In general, for a given p, there may be multiple x that
satisfy the above.
◮ Let us see some examples.

PS Sastry, IISc, Bangalore, 2020 18/45 PS Sastry, IISc, Bangalore, 2020 19/45
◮ For continuous rv, X, FX need not be strictly monotone. ◮ For this df, for p = 0.5, the quantile of order p is not
◮ Consider a pdf: fX (x) = 0.5, x ∈ [1, 2] ∪ [3, 4] unique because there many x with FX (x) = 0.5
◮ The pdf and the corresponding df are: But for p = 0.75 it is unique.

PS Sastry, IISc, Bangalore, 2020 20/45 PS Sastry, IISc, Bangalore, 2020 21/45

◮ This situation is illustrated below

◮ Let X ∈ {x1 , x2 , · · · }
◮ Given a p we want to calculate quantile of order p
◮ Suppose there is a xi such that FX (xi ) = p.
◮ Then, for xi ≤ x < xi+1 , FX (x) = p
◮ For xi ≤ x ≤ xi+1 , we have p ≤ FX (x) ≤ p + P [X = x]
◮ So, quantile of order p is not unique and all such x qualify.

PS Sastry, IISc, Bangalore, 2020 22/45 PS Sastry, IISc, Bangalore, 2020 23/45
◮ Now suppose p is such that FX (xi−1 ) < p < FX (xi ). ◮ This situation is illustrated below
◮ Let FX (xi−1 ) = p − δ1 and FX (xi ) = p + δ2 . (Note that
δ1 , δ2 > 0)
◮ Then P [X = xi ] = FX (xi ) − FX (xi−1 ) = δ2 + δ1
◮ Hence we have

p < p + δ2 = FX (xi ) < p + δ2 + δ1 = p + P [X = xi ]

◮ Hence, xi is quantile of order p.


◮ For any x < xi we would have FX (x) ≤ FX (xi−1 ) < p.
◮ For any x, with xi < x < xi+1 we have
p + P [X = x] = p < FX (x) = p + δ2 .
◮ Similarly, for x ≥ xi+1 we have FX (x) > p + P [X = x].
◮ Thus quantile of order p is unique here.

PS Sastry, IISc, Bangalore, 2020 24/45 PS Sastry, IISc, Bangalore, 2020 25/45

Median of a distribution

◮ For p = 0.5 quantile of order p is called the median.


◮ For a continuous rv, median, x satisfies: FX (x) = 0.5. ◮ If we want to find c to minimize E [(X − c)2 ] then the
◮ For a discrete rv, it satisfies: solution is c = EX.
0.5 ≤ FX (x) ≤ 0.5 + P [X = x]. ◮ We saw this earlier.
◮ As we saw, median need not be unique. ◮ Suppose we want to find c to minimize E [|(X − c)|]
◮ Recall that the (standard) Cauchy density is given by ◮ Then we would get c to be the median.
(Exercise: Show this for discrete and continuous rv)
1 1
fX (x) = , −∞ < x < ∞
π 1 + x2
R0
◮ One can show that −∞ fX (x) dx = 0.5 and hence the
median is at the origin.

PS Sastry, IISc, Bangalore, 2020 26/45 PS Sastry, IISc, Bangalore, 2020 27/45
Markov Inequality Markov Inequality
◮ Let g : ℜ → ℜ be a non-negative function. Then
E[g(X)] E[g(X)]
P [g(X) > c] ≤ , (c > 0) P [g(X) > c] ≤ , (c > 0)
c c
◮ Proof: We prove it for continuous rv. Proof is similar for
discrete rv ◮ In all such results an underlying assumption is that the
Z ∞ expectation is finite.
E[g(X)] = g(x) fX (x) dx
−∞
◮ Let g(x) = |x|k where k is a positive integer. We have
Z Z g(x) ≥ 0, ∀x. Let c > 0.
= g(x) fX (x) dx + g(x) fX (x) dx
g(x)≤c g(x)>c
◮ We know that |x| > c ⇒ |x|k > ck and vice versa.
Now we get,
Z

≥ g(x) fX (x) dx because g(x) ≥ 0
g(x)>c  k
E |X|
P [|X| > c] = P [|X|k > ck ] ≤
Z
≥ c fX (x) dx = c P [g(X) > c] ck
g(x)>c
E[g(X)]
◮ Markov inequality is often used in this form.
Thus, P [g(X) > c] ≤ c
PS Sastry, IISc, Bangalore, 2020 28/45 PS Sastry, IISc, Bangalore, 2020 29/45

Chebyshev Inequality
◮ The Chebyshev inequality is

◮ Markov Inequality: Var(X)


P [|X − EX| > c] ≤
  c2
E |X|k
P [|X| > c] ≤ ◮ Let EX = µ and let Var(X) = σ 2 . Take c = kσ
ck
◮ We call, σ, square root of variance, as standard deviation.
◮ Take |X| as |X − EX| and take k = 2 ◮ Now, Chebyshev inequality gives us
E [|X − EX|2 ] Var(X) σ2 1
P [|X − EX| > c] ≤ 2
= P [|X − µ| > kσ] ≤ 2 2 = 2
c c2 k σ k
◮ This is true for all random variables and the RHS above
◮ This is known as the Chebyshev inequality. does not depend on the distribution of X.

PS Sastry, IISc, Bangalore, 2020 30/45 PS Sastry, IISc, Bangalore, 2020 31/45
◮ Markov inequality: For a non-negative function, g,
E[g(X)]
P [g(X) > c] ≤
c
◮ A specific instance of this is
 
E |X|k
P [|X| > c] ≤
ck
◮ Chebyshev inequality
Var(X)
P [|X − EX| > c] ≤
c2
◮ With EX = µ and Var(X) = σ 2 , we get
1
P [|X − µ| > kσ] ≤
k2

PS Sastry, IISc, Bangalore, 2020 32/45 PS Sastry, IISc, Bangalore, 2020 33/45

A pair of random variables


◮ Just as in the case of a single rv, we can think of the
◮ Let X, Y be random variables on the same probability induced probability space for the case of a pair of rv’s too.
space (Ω, F, P ) ◮ That is, by defining the pair of random variables, we
◮ Each of X, Y maps Ω to ℜ. essentially create a new probability space with sample
◮ We can think of the pair of radom variables as a space being ℜ2 .
vector-valued function that maps Ω to ℜ2 . ◮ The events now would be the Borel subsets of ℜ2 .
◮ Recall that ℜ2 is cartesian product of ℜ with itself.
[ ]
X
Y ◮ So, we can create Borel subsets of ℜ2 by cartesian
product of Borel subsets of ℜ.

B 2 = σ ({B1 × B2 : B1 , B2 ∈ B})

where B is the Borel σ-algebra we considered earlier, and


B 2 is the set of Borel sets of ℜ2 .
Sample Space 2
R

PS Sastry, IISc, Bangalore, 2020 34/45 PS Sastry, IISc, Bangalore, 2020 35/45
◮ Recall that B is the smallest σ-algebra containing all
intervals. ◮ Let X, Y be random variables on the probability space
◮ Let I1 , I2 ⊂ ℜ be intervals. Then I1 × I2 ⊂ ℜ2 is known (Ω, F, P )
as a cylindrical set. ◮ This gives rise to a new probability space (ℜ2 , B 2 , PXY )
[a, b] X [c, d]
with PXY given by
y

d PXY (B) = P [(X, Y ) ∈ B], ∀B ∈ B 2


= P ({ω : (X(ω).Y (ω)) ∈ B})
c
◮ Recall that for a single rv, the resulting probability space
is (ℜ, B, PX ) with
a b x

◮ B 2 is the smallest σ-algebra containing all cylindrical sets. PX (B) = P [X ∈ B] = P ({ω : X(ω) ∈ B})
◮ We saw that B is also the smallest σ-algebra containing
all intervals of the form (−∞, x].
◮ Similarly B 2 is the smallest σ-algebra containing
cylindrical sets of the form (−∞, x] × (−∞, y].
PS Sastry, IISc, Bangalore, 2020 36/45 PS Sastry, IISc, Bangalore, 2020 37/45

Joint distribution of a pair of random variables


◮ In the case of a single rv, we define a distribution
function, FX which essentailly assigns probability to all ◮ Let X, Y be random variables on the same probability
intervals of the form (−∞, x]. space (Ω, F, P )
◮ This FX uniquely determines PX (B) for all Borel sets, B. ◮ The joint distribution function of X, Y is FXY : ℜ2 → ℜ,
◮ In a similar manner we define a joint distribution function defined by
FXY for a pair of random varibles.
◮ FXY (x, y) would be PXY ((−∞, x] × (−∞, y]). FXY (x, y) = P [X ≤ x, Y ≤ y] (= PXY ((−∞, x] × (−∞, y]))
◮ FXY fixes the probability of all cylindrical sets of the form = P ({ω : X(ω) ≤ x} ∩ {ω : Y (ω) ≤ y})
(−∞, x] × (−∞, y] and hence uniquely determines the
probability of all Borel sets of ℜ2 .
◮ The joint distribution function is the probability of the
intersection of the events [X ≤ x] and [Y ≤ y].

PS Sastry, IISc, Bangalore, 2020 38/45 PS Sastry, IISc, Bangalore, 2020 39/45
Properties of Joint Distribution Function
◮ Recall that, for the case of a single rv, the probability of
◮ Joint distribution function: X being in any interval is given by the difference of FX
values at the end points of the interval.
FXY (x, y) = P [X ≤ x, Y ≤ y] ◮ Let x1 < x2 . Then

P [x1 < X ≤ x2 ] = FX (x2 ) − FX (x1 )


◮ FXY (−∞, y) = FXY (x, −∞) = 0, ∀x, y;
FXY (∞, ∞) = 1 ◮ The LHS above is a probability.
(These are actually limits: limx→−∞ FXY (x, y) = 0, ∀y) The RHS is non-negative because FX is non-decreasing.
◮ FXY is non-decresing in each of its arguments ◮ We will now derive a similar expression in the case of two
◮ FXY is right continuous and has left-hand limits in each random variables.
of its arguments
◮ Here, the probability we want is that of the pair of rv’s
◮ These are straight-forward extensions of single rv case
being in a cylindrical set.
◮ But there is another crucial property satisfied by FXY .

PS Sastry, IISc, Bangalore, 2020 40/45 PS Sastry, IISc, Bangalore, 2020 41/45

◮ Let x1 < x2 and y1 < y2 . We want y

y
B3 2

P [x1 < X ≤ x2 , y1 < Y ≤ y2 ]. B1

◮ Consider the Borel set B = (−∞, x2 ] × (−∞, y2 ]. y


1

x1 x2 x
y B2

y
B3 2
B1

y
1
P [(X, Y ) ∈ B] = P [X ≤ x2 , Y ≤ y2 ] = FXY (x2 , y2 )
x1 x2 x
= P [(X, Y ) ∈ B1 + (B2 ∪ B3 )]
B2
= P [(X, Y ) ∈ B1 ] + P [(X, Y ) ∈ (B2 ∪ B3 )]

P [(X, Y ) ∈ B2 ] = P [X ≤ x2 , Y ≤ y1 ] = FXY (x2 , y1 )


B , (−∞, x2 ] × (−∞, y2 ] = B1 + (B2 ∪ B3 )
P [(X, Y ) ∈ B3 ] = P [X ≤ x1 , Y ≤ y2 ] = FXY (x1 , y2 )
B1 = (x1 , x2 ] × (y1 , y2 ]
P [(X, Y ) ∈ B2 ∩ B3 ] = P [X ≤ x1 , Y ≤ y1 ] = FXY (x1 , y1 )
B2 = (−∞, x2 ] × (−∞, y1 ]
B3 = (−∞, x1 ] × (−∞, y2 ] P [(X, Y ) ∈ B1 ] = FXY (x2 , y2 ) − P [(X, Y ) ∈ (B2 ∪ B3 )]
B2 ∩ B3 = (−∞, x1 ] × (−∞, y1 ] = FXY (x2 , y2 ) − FXY (x2 , y1 ) − FXY (x1 , y2 ) + FXY (x1 , y1 )

PS Sastry, IISc, Bangalore, 2020 42/45 PS Sastry, IISc, Bangalore, 2020 43/45
Properties of Joint Distribution Function
◮ What we showed is the following. ◮ Joint distribution function: FXY : ℜ2 → ℜ
◮ For x1 < x2 and y1 < y2
FXY (x, y) = P [X ≤ x, Y ≤ y]
P [x1 < X ≤ x2 , y1 < Y ≤ y2 ] = FXY (x2 , y2 ) − FXY (x2 , y1 )
−FXY (x1 , y2 ) + FXY (x1 , y1 )
◮ It satisfies
1. FXY (−∞, y) = FXY (x, −∞) = 0, ∀x, y;
◮ This means FXY should satisfy FXY (∞, ∞) = 1
2. FXY is non-decreasing in each of its arguments
FXY (x2 , y2 )−FXY (x2 , y1 )−FXY (x1 , y2 )+FXY (x1 , y1 ) ≥ 0 3. FXY is right continuous and has left-hand limits in each
of its arguments
for all x1 < x2 and y1 < y2 4. For all x1 < x2 and y1 < y2
◮ This is an additional condition that a function has to FXY (x2 , y2 )−FXY (x2 , y1 )−FXY (x1 , y2 )+FXY (x1 , y1 ) ≥ 0
satisfy to be the joint distribution function of a pair of
random variables
◮ Any F : ℜ2 → ℜ satisfying the above would be a joint
distribution function.
PS Sastry, IISc, Bangalore, 2020 44/45 PS Sastry, IISc, Bangalore, 2020 45/45

Recap: Random Variables Recap: Distribution function of a random variable

◮ Given a probability space (Ω, F, P ), a random variable is


a real-valued function on Ω.
◮ Let X be a random variable. It distribution function,
◮ It essentially results in an induced probability space FX : ℜ → ℜ, is defined by
X
(Ω, F, P ) → (ℜ, B, PX ) FX (x) = P [X ≤ x] = P ({ω ∈ Ω : X(ω) ≤ x})
where B is the Borel σ-algebra and ◮ The distribution function, FX , completely specifies the
probability measure, PX .
PX (B) = P [X ∈ B] = P ({ω ∈ Ω : X(ω) ∈ B})

PS Sastry, IISc, Bangalore, 2020 1/57 PS Sastry, IISc, Bangalore, 2020 2/57
Recap: Properties of distribution function Recap: Discrete Random Variable

◮ The distribution function satisfies


1. 0 ≤ FX (x) ≤ 1, ∀x
2. FX (−∞) = 0; FX (∞) = 1 ◮ A random variable X is said to be discrete if it takes only
3. FX is non-decreasing: x1 ≤ x2 ⇒ FX (x1 ) ≤ FX (x2 ) finitely many or countably infinitely many distinct values.
4. FX is right continuous and has left-hand limits. ◮ Let X ∈ {x1 , x2 , · · · }
◮ Any real-valued function of a real variable satisfying the ◮ Its distribution function, FX is a stair-case function with
above four properties would be a distribution function of jump discontinuities at each xi and the magnitude of the
some random variable. jump at xi is equal to P [X = xi ]
◮ We also have
FX (x+ ) − FX (x− ) = FX (x) − FX (x− ) = P [X = x]
P [a < X ≤ b] = FX (b) − FX (a).

PS Sastry, IISc, Bangalore, 2020 3/57 PS Sastry, IISc, Bangalore, 2020 4/57

Recap: probability mass function Recap: continuous random variable


◮ Let X ∈ {x1 , x2 , · · · }. ◮ X is said to be a continuous random variable if there
◮ The probability mass function (pmf) of X is defined by exists a function fX : ℜ → ℜ satisfying
fX (xi ) = P [X = xi ]; fX (x) = 0, for all other x Z x
FX (x) = fX (x) dx
◮ It satisfies −∞

X (x) ≥ 0, ∀x and fX (x) = 0 if x 6= xi for some i


1. fP
The fX is called the probability density function.
2. i fX (xi ) = 1
◮ We have P ◮ Same as saying FX is absolutely continuous.
FX (x) = i:xi ≤x fX (xi ) ◮ Since FX is continuous here, we have
fX (x) = FX (x) − FX (x− )
P [X = x] = FX (x) − FX (x− ) = 0, ∀x
◮ We can calculate the probability of any event as
◮ A continuous rv takes uncountably many distinct values.
X
P [X ∈ B] = fX (xi )
i:
However, not every rv that takes uncountably many
xi ∈B values is a continuous rv
PS Sastry, IISc, Bangalore, 2020 5/57 PS Sastry, IISc, Bangalore, 2020 6/57
Recap: probability density function Recap: Function of a random variable
◮ The pdf of a continuous rv is defined to be the fX that
satisfies Z x
FX (x) = fX (t) dt, ∀x
−∞
◮ It satisfies ◮ If X is a random variable and g : ℜ → ℜ is a function,
1. fX (x) ≥ 0, ∀x then Y = g(X) is a random variable.
R∞
2. −∞ fX (t) dt = 1
◮ More formally, Y is a random variable if g is a Borel
◮ We can, in principle, compute probability of any event as measurable function.
Z ◮ We can determine distribution of Y given the function g
P [X ∈ B] = fX (t) dt, ∀B ∈ B and the distribution of X
B

◮ In particular,
Z b
P [a ≤ X ≤ b] = fX (t) dt
a

PS Sastry, IISc, Bangalore, 2020 7/57 PS Sastry, IISc, Bangalore, 2020 8/57

Recap Recap

◮ Suppose X is a discrete rv with X ∈ {x1 , x2 , · · · }.


◮ Suppose Y = g(X).
◮ Let X be a rv and let Y = g(X).
◮ Then Y is also discrete and Y ∈ {g(x1 ), g(x2 ), · · · }.
◮ The distribution function of Y is given by
◮ We can find the pmf of Y as
FY (y) = P [g(X) ≤ y]
fY (y) = p[Y = y] = P [g(X) = y]
= P [X ∈ {z : g(z) ≤ y}]
= P [X ∈ {xi : g(xi ) = y}]
X
◮ This probability can be obtained from distribution of X. = fX (xi )
i:
g(xi )=y

PS Sastry, IISc, Bangalore, 2020 9/57 PS Sastry, IISc, Bangalore, 2020 10/57
Recap Recap: Expectation
◮ Let X be a discrete rv with X ∈ {x1 , x2 , · · · }. Then
Let g : ℜ → ℜ be differentiable with g ′ (x) > 0, ∀x or
X
◮ E[X] = xi fX (xi )
g ′ (x) < 0, ∀x. i
◮ Let X be a continuous rv and let Y = g(X). ◮ If X is a continuous random variable with pdf, fX ,
◮ Then Y is a continuous rv with pdf Z ∞
E[X] = x fX (x) dx
d −1 −∞
fY (y) = fX (g −1 (y)) g (y) , a ≤ y ≤ b
dy ◮ Sometimes we use the following notation to denote
expectation of both kinds of rv
where a = min(g(∞), g(−∞)) and Z ∞
b = max(g(∞), g(−∞)) E[X] = x dFX (x)
−∞
◮ This theorem is useful in some cases to find the densities
of functions of continuous random variables ◮ We take the expectation to exist when the sum or
integral above is absolutely convergent
◮ Note that expectation is defined for all random variables
PS Sastry, IISc, Bangalore, 2020 11/57 PS Sastry, IISc, Bangalore, 2020 12/57

Recap: Expectation of a function of a random Recap: Properties of Expectation


variable
◮ Let X be a rv and let Y = g(X). Then, X Z ∞


R R
EY = y dFY (y) = g(x) dFX (x) E[g(X)] = g(xi )fX (xi ) or E[g(X)] = g(x) fX (x) dx
i −∞
◮ That is, if X is discrete, then
◮ If X ≥ 0 then EX ≥ 0
X X
EY = yj fY (yj ) = g(xi )fX (xi )
j i ◮ E[b] = b where b is a constant
◮ E[ag(X)] = aE[g(X)] where a is a constant
◮ If X and Y are continuous
Z Z ◮ E[aX + b] = aE[X] + b where a, b are constants.
EY = y fY (y) dy = g(x) fX (x) dx ◮ E[ag1 (X) + bg2 (X)] = aE[g1 (X)] + bE[g2 (X)]
◮ E [(X − c)2 ] ≥ E [(X − EX)2 ] , ∀c
◮ This is true for all rv’s.

PS Sastry, IISc, Bangalore, 2020 13/57 PS Sastry, IISc, Bangalore, 2020 14/57
Recap: Variance of random variable Recap: Moments of a random variable

◮ The k th (order) moment of X is


Z

mk = E[X ] = xk dFX (x)
k
Var(X) = E (X − EX)2 = E[X 2 ] − (EX)2
 

◮ Properties of Variance: ◮ The k th central moment of X is


◮ Var(X) ≥ 0 Z
◮ Var(X + c) = Var(X)
sk = E[(X − EX) ] = (x − EX)k dFX (x)
k
◮ Var(cX) = c2 Var(X)

◮ If moment of order k is finite then so is moment of order


s for s < k.

PS Sastry, IISc, Bangalore, 2020 15/57 PS Sastry, IISc, Bangalore, 2020 16/57

Recap: Moment Generating function Generating function


◮ For X ∈ {0, 1, 2, · · · } the (probability) generating
◮ The moment generating function – MX : ℜ → ℜ function of X is defined by
Z
tX
X
txi ∞
MX (t) = Ee = e fX (xi ) or etx fX (x) dx, t ∈ ℜ X
PX (s) = fX (k)sk , s ∈ ℜ
i
k=0
tX
◮ We say the mgf exists if E[e ] < ∞ for t in some ◮ We get the pmf from it as
interval around zero
◮ If MX (t) exists (for t ∈ [−a, a] for some a > 0) then all PX′ (0) P ′′ (0)
fX (0) = PX (0); fX (1) = ; fX (2) = X
its derivatives also exist and 1! 2!
dk MX (t) ◮ We can also get the moments:
= E[X k ]
dtk t=0
PX′ (1) = EX, PX′′ (1) = E[X(X − 1)]

PS Sastry, IISc, Bangalore, 2020 17/57 PS Sastry, IISc, Bangalore, 2020 18/57
quantiles of a distribution Recap: some moment inequalities
◮ Markov inequality: For a non-negative function, g,
◮ Let p ∈ (0, 1). The number x ∈ ℜ that satisfies E[g(X)]
P [g(X) > c] ≤
c
P [X ≤ x] ≥ p and p[X ≥ x] ≥ 1 − p
◮ A specific instance of this is
th
is called the quantile of order p or the 100p percentile of 
E |X|k

rv X. P [|X| > c] ≤
ck
◮ If x is quantile of order p, it satisfies ◮ Chebyshev inequality
p ≤ FX (x) ≤ p + P [X = x] Var(X)
P [|X − EX| > c] ≤
c2
◮ For a given p there can be multiple values for x to satisfy
the above.
◮ With EX = µ and Var(X) = σ 2 , we get
◮ For p = 0.5, it is called the median. 1
P [|X − µ| > kσ] ≤
k2

PS Sastry, IISc, Bangalore, 2020 19/57 PS Sastry, IISc, Bangalore, 2020 20/57

Recap: A pair of random variables

◮ Let X, Y be random variables on the probability space


(Ω, F, P )
◮ We can think of X, Y together as a vector-valued
function mapping Ω to ℜ2 .
◮ This gives rise to a new probability space (ℜ2 , B 2 , PXY )
with PXY given by

PXY (B) = P [(X, Y ) ∈ B], ∀B ∈ B 2


= P ({ω : (X(ω).Y (ω)) ∈ B})

PS Sastry, IISc, Bangalore, 2020 21/57 PS Sastry, IISc, Bangalore, 2020 22/57
Recap: Joint distribution function Recap: Properties of Joint Distribution Function
◮ Joint distribution function: FXY : ℜ2 → ℜ

◮ Let X, Y be random variables on the same probability FXY (x, y) = P [X ≤ x, Y ≤ y]


space (Ω, F, P ) ◮ It satisfies
◮ The joint distribution function of X, Y is FXY : ℜ2 → ℜ, 1. FXY (−∞, y) = FXY (x, −∞) = 0, ∀x, y;
defined by FXY (∞, ∞) = 1
2. FXY is non-decreasing in each of its arguments
FXY (x, y) = P [X ≤ x, Y ≤ y] (= PXY ((−∞, x] × (−∞, y])) 3. FXY is right continuous and has left-hand limits in each
= P ({ω : X(ω) ≤ x} ∩ {ω : Y (ω) ≤ y}) of its arguments
4. For all x1 < x2 and y1 < y2
◮ The joint distribution function is the probability of the
FXY (x2 , y2 )−FXY (x2 , y1 )−FXY (x1 , y2 )+FXY (x1 , y1 ) ≥ 0
intersection of the events [X ≤ x] and [Y ≤ y].

◮ Any F : ℜ2 → ℜ satisfying the above would be a joint


distribution function.
PS Sastry, IISc, Bangalore, 2020 23/57 PS Sastry, IISc, Bangalore, 2020 24/57

Example
◮ Let X, Y be two discrete random variables (defined on
the same probability space).
◮ Let X ∈ {x1 , · · · xn } and Y ∈ {y1 , · · · , ym }.
◮ Let Ω = (0, 1) with the ‘usual’ probability.
◮ We define the joint probability mass function of X and Y
◮ So, each ω is a real number between 0 and 1
as
fXY (xi , yj ) = P [X = xi , Y = yj ] ◮ Let X(ω) be the digit in the first decimal place in ω and
let Y (ω) be the digit in the second decimal place.
(fXY (x, y) is zero for all other values of x, y) ◮ If ω = 0.2576 then X(ω) = 2 and Y (ω) = 5
◮ The fXY would satisfy
P P ◮ Easy to see that X, Y ∈ {0, 1, · · · , 9}.
◮ fXY (x, y) ≥ 0, ∀x, y and i j fXY (xi , yj ) = 1
◮ We want to calculate the joint pmf of X and Y
◮ This is a straight-forward extension of the pmf of a single
discrete rv.

PS Sastry, IISc, Bangalore, 2020 25/57 PS Sastry, IISc, Bangalore, 2020 26/57
Example Example
◮ What is the event [X = 4]? ◮ Consider the random experiment of rolling two dice.
Ω = {(ω1 , ω2 ) : ω1 , ω2 ∈ {1, 2, · · · , 6}}
[X = 4] = {ω : X(ω) = 4} = [0.4, 0.5) ◮ Let X be the maximum of the two numbers and let Y be
the sum of the two numbers.
◮ What is the event [Y = 3]? ◮ Easy to see X ∈ {1, 2, · · · , 6} and Y ∈ {2, 3, · · · , 12}
[Y = 3] = [0.03, 0.04) ∪ [0.13, 0.14) ∪ · · · ∪ [0.93, 0.94)
◮ What is the event [X = m, Y = n]? (We assume m, n
are in the correct range)
◮ What is the event [X = 4, Y = 3]? [X = m, Y = n] = {(ω1 , ω2 ) ∈ Ω : max(ω1 , ω2 ) = m, ω1 +ω2 = n}
It is the intersection of the above
◮ For this to be a non-empty set, we must have
[X = 4, Y = 3] = [0.43, 0.44) m < n ≤ 2m
◮ Then [X = m, Y = n] = {(m, n − m), (n − m, m)}
◮ Hence the joint pmf of X and Y is ◮ Is this always true? No! What if n = 2m?
[X = 3, Y = 6] = {(3, 3)},
fXY (x, y) = P [X = x, Y = y] = 0.01, x, y ∈ {0, 1, · · · , 9}
[X = 4, Y = 6] = {(4, 2), (2, 4)}
◮ So, P [X = m, Y = n] is either 2/36 or 1/36 (assuming
PS Sastry, IISc, Bangalore, 2020 27/57 m, n satisfy other requirements) PS Sastry, IISc, Bangalore, 2020 28/57

Example Joint Probability mass function


◮ We can now write the joint pmf.
◮ Assume 1 ≤ m ≤ 6 and 2 ≤ n ≤ 12. Then
( 2 ◮ Let X ∈ {x1 , x2 , · · · } and Y ∈ {y1 , y2 , · · · } be discrete
36
if m < n < 2m random variables.
fXY (m, n) = 1
36
if n = 2m ◮ The joint pmf: fXY (x, y) = P [X = x, Y = y].
(fXY (m, n) is zero in all other cases) ◮ The joint pmf satisfies:
◮ Does this satisfy requirements of joint pmf? ◮ fP
XYP (x, y) ≥ 0, ∀x, y and
j fXY (xi , yj ) = 1

6 2m−1 6 i
X X X 2 X 1
fXY (m, n) = + ◮ Given the joint pmf, we can get the joint df as
m,n m=1 n=m+1
36 m=1 36
X X
6 FXY (x, y) = fXY (xi , yj )
2 X 1
= (m − 1) + 6 i: j:
xi ≤x yj ≤y
36 m=1 36
2 6
= (21 − 6) + =1
36 36
PS Sastry, IISc, Bangalore, 2020 29/57 PS Sastry, IISc, Bangalore, 2020 30/57
◮ Given sets {x1 , x2 , · · · } and {y1 , y2 , · · · }.
◮ Given the joint pmf, we can (in principle) compute the
probability of any event involving the two discrete random
◮ Suppose fXY : ℜ2 → [0, 1] be such that
variables.
◮ fXY (x, y) = 0 unless x = xi for some i and y = yj for
X
some
P Pj, and P [(X, Y ) ∈ B] = fXY (xi , yj )
j fXY (xi , yj ) = 1

i i,j:
(xi ,yj )∈B
◮ Then fXY is a joint pmf.
◮ This is because, if we define ◮ Now, events can be specified in terms of relations
X X between the two rv’s too
FXY (x, y) = fXY (xi , yj )
i: j:
xi ≤x yj ≤y
[X < Y + 2] = {ω : X(ω) < Y (ω) + 2}

then FXY satisfies all properties of a df. ◮ Thus, X


P [X < Y + 2] = fXY (xi , yj )
◮ We normally specify a pair of discrete random variables by
i,j:
giving the joint pmf xi <yj +2

PS Sastry, IISc, Bangalore, 2020 31/57 PS Sastry, IISc, Bangalore, 2020 32/57

◮ Take the example: 2 dice, X is max and Y is sum Joint density function
◮ fXY (m, n) = 0 unless m = 1, · · · , 6 and n = 2, · · · , 12.
For this range ◮ Let X, Y be two continuous rv’s with df FXY .
( 2
if m < n < 2m ◮ If there exists a function fXY that satisfies
36
fXY (m, n) = 1
Z x Z y
if n = 2m
36 FXY (x, y) = fXY (x′ , y ′ ) dy ′ dx′ , ∀x, y
−∞ −∞
◮ Suppose we want P [Y = X + 2].
6
then we say that X, Y have a joint probability density
function which is fXY
X X
P [Y = X + 2] = fXY (m, n) = fXY (m, m + 2)
m,n: m=1 ◮ Please note the difference in the definition of joint pmf
n=m+2
6
and joint pdf.
X
= fXY (m, m + 2) since we need m + 2 ≤ 2m ◮ When X, Y are discrete we defined a joint pmf
m=2 ◮ We are not saying that if X, Y are continuous rv’s then a
1 2 9 joint density exists.
= +4 =
36 36 36 ◮ We use joint density to mean joint pdf

PS Sastry, IISc, Bangalore, 2020 33/57 PS Sastry, IISc, Bangalore, 2020 34/57
properties of joint density properties of joint density

◮ The joint density (or joint pdf) of X, Y is fXY that


satisfies ◮ The joint density fXY satisfies the following
Z x Z y 1. fXY (x, y) ≥ 0, ∀x, y
FXY (x, y) = fXY (x′ , y ′ ) dy ′ dx′ , ∀x, y R∞ R∞
−∞ −∞ 2. −∞ −∞ fXY (x′ , y ′ ) dy ′ dx′ = 1
◮ Since FXY is non-decreasing in each argument, we must ◮ These are very similar to the properties of the density of a
have fXY (x, y) ≥ 0.
R∞ R∞ single rv

−∞ −∞ XY
f (x′ , y ′ ) dy ′ dx′ = 1 is needed to ensure
FXY (∞, ∞) = 1.

PS Sastry, IISc, Bangalore, 2020 35/57 PS Sastry, IISc, Bangalore, 2020 36/57

Example: Joint Density properties of joint density


◮ Consider the function
f (x, y) = 2, 0 < x < y < 1 (f (x, y) = 0, otherwise) ◮ The joint density fXY satisfies the following
1. fXY (x, y) ≥ 0, ∀x, y
◮ Let us show this is a density 2.
R∞ R∞
fXY (x′ , y ′ ) dy ′ dx′ = 1
−∞ −∞
Z ∞ Z ∞ Z 1Z y Z 1 Z 1
◮ Any function fXY : ℜ2 → ℜ satisfying the above two is a
f (x, y) dx dy = 2 dx dy = 2 x|y0 dy = 2y dy = 1
−∞ −∞ 0 0 0 0 joint density function.
◮ Given fXY satisfying the above, define
◮ We can say this density is uniform over the region Z x Z y
y
FXY (x, y) = fXY (x′ , y ′ ) dy ′ dx′ , ∀x, y
1.0
−∞ −∞

0.5
◮ Then we can show FXY is a joint distribution.

x
0.5 1.0

PS Sastry, IISc, Bangalore, 2020 37/57 PS Sastry, IISc, Bangalore, 2020 38/57
R∞ R∞ ∆ , FXY (x2 , y2 ) − FXY (x1 , y2 ) − FXY (x2 , y1 ) + FXY (x1 , y1 ).
◮ fXY (x, y) ≥ 0 and −∞ −∞
fXY (x′ , y ′ ) dy ′ dx′ = 1
◮ Define ◮ We need to show ∆ ≥ 0 if x1 < x2 and y1 < y2 .
Z x Z y
FXY (x, y) = fXY (x′ , y ′ ) dy ′ dx′ , ∀x, y
◮ We have
−∞ −∞ Z x 2 Z y2 Z x 1 Z y2
∆ = fXY dy dx − fXY dy dx
◮ Then, FXY (−∞, y) = FXY (x, −∞) = 0, ∀x, y and −∞ −∞ −∞ −∞
Z x 2 Z y1 Z x 1 Z y1
FXY (∞, ∞) = 1
− fXY dy dx + fXY dy dx
◮ Since fXY (x, y) ≥ 0, FXY is non-decreasing in each −∞ −∞ −∞ −∞
Z x 2  Z y2 Z y1
argument.

= fXY dy − fXY dy dx
◮ Since it is given as an integral, the above also shows that −∞ −∞ −∞
FXY is continuous in each argument. Z x 1  Z y2 Z y1 
− fXY dy − fXY dy dx
◮ The only property left is the special property of FXY we −∞ −∞ −∞
mentioned earlier.

PS Sastry, IISc, Bangalore, 2020 39/57 PS Sastry, IISc, Bangalore, 2020 40/57

◮ What we showed is the following


◮ Thus we have ◮ Any function fXY : ℜ2 → ℜ that satisfies
Z x 2  Z y2 Z y1
fRXY (x, y) ≥ 0, ∀x, y
 ◮
∆ = fXY dy − fXY dy dx ∞ R∞
−∞ −∞ fXY (x, y) dx dy = 1

−∞ −∞ −∞
Z x 1  Z y2 Z y1  is a joint density function.
− fXY dy − fXY dy dx This is because
R y now

−∞ −∞ −∞ Rx
Z x 2 Z y2 Z x1 Z y2 FXY (x, y) = −∞ −∞ fXY (x, y) dx dy
= fXY dy dx − fXY dy dx would satisfy all conditions for a df.
−∞ y1 −∞ y1 ◮ Convenient to specify joint density (when it exists)
Z x 2 Z y2
= fXY dy dx ≥ 0 ◮ We also showed
x1 y1 Z x 2 Z y2
P [x1 ≤ X ≤ x2 , y1 ≤ Y ≤ y2 ] = fXY dy dx
◮ This actually shows x1 y1
Z x2 Z y2 ◮ In general
P [x1 ≤ X ≤ x2 , y1 ≤ Y ≤ y2 ] = fXY dy dx Z
x1 y1
P [(X, Y ) ∈ B] = fXY (x, y) dx dy, ∀B ∈ B 2
B

PS Sastry, IISc, Bangalore, 2020 41/57 PS Sastry, IISc, Bangalore, 2020 42/57
◮ Let us consider the example

f (x, y) = 2, 0 < x < y < 1


◮ We can look at it geometrically
◮ Suppose wee want probability of [Y > X + 0.5] y

P [Y > X + 0.5] = P [(X, Y ) ∈ {(x, y) : y > x + 0.5}] 1.0


Z
= fXY (x, y) dx dy 0.5

{(x,y) : y>x+0.5}
Z 1 Z y−0.5 0.5 1.0
x

= 2 dx dy
Z0.51 0 ◮ The probability of the event we want is the area of the
small triangle divided by that of the big triangle.
= 2(y − 0.5)dy
0.5
1
y2
=2 − y|10.5 = 1 − 0.25 − 1 + 0.5 = 0.25
2 0.5

PS Sastry, IISc, Bangalore, 2020 43/57 PS Sastry, IISc, Bangalore, 2020 44/57

Marginal Distributions Marginal mass functions


◮ Let X ∈ {x1 , x2 , · · · } and Y ∈ {y1 , y2 , · · · }
◮ Let X, Y be random variables with joint distribution
function FXY .
◮ Let fXY be their joint mass function.
◮ Then
◮ We know FXY (x, y) = P [X ≤ x, Y ≤ y].
X X
◮ Hence P [X = xi ] = P [X = xi , Y = yj ] = fXY (xi , yj )
j j
FXY (x, ∞) = P [X ≤ x, Y ≤ ∞] = P [X ≤ x] = FX (x)
(This is because
P [Y = yj ], j = 1, · · · , form a partition
and P (A) = i P (ABi ) when Bi is a partition)
◮ We define the marginal distribution functions of X, Y by ◮ We define the marginal mass functions of X and Y as
X X
FX (x) = FXY (x, ∞); FY (y) = FXY (∞, y) fX (xi ) = fXY (xi , yj ); fY (yj ) = fXY (xi , yj )
j i
◮ These are simply distribution functions of X and Y
obtained from the joint distribution function. ◮ These are mass functions of X and Y obtained from the
joint mass function
PS Sastry, IISc, Bangalore, 2020 45/57 PS Sastry, IISc, Bangalore, 2020 46/57
marginal density functions Example
◮ Let X, Y be continuous rv with
R x joint
R y density fXY . ◮ Rolling two dice, X is max, Y is sum
◮ Then we know FXY (x, y) = −∞ −∞ fXY (x′ , y ′ ) dy ′ dx′
◮ Hence, we have
◮ We had, for 1 ≤ m ≤ 6 and 2 ≤ n ≤ 12,
Z x Z ∞ ( 2
FX (x) = FXY (x, ∞) = fXY (x′ , y ′ ) dy ′ dx′ 36
if m < n < 2m
fXY (m, n) =
Z−∞ −∞
x Z ∞  1
36
if n = 2m
= ′ ′
fXY (x , y ) dy dx′ ′
P
−∞ −∞ ◮ We know, fX (m) = n fXY (m, n), m = 1, · · · , 6.
◮ Since X is a continuous rv, this means ◮ Given m, for what values of n, fXY (m, n) > 0 ?
Z ∞
We can only have n = m + 1, · · · , 2m.
fX (x) = fXY (x, y) dy
−∞ ◮ Hence we get
We call this the marginal density of X. 2m 2m−1
◮ Similarly, marginal density of Y is X X2 1 2 1 2m − 1
fX (m) = fXY (m, n) = + = (m−1)+ =
Z ∞
n=m+1 n=m+1
36 36 36 36 36
fY (y) = fXY (x, y) dx
−∞
◮ These are pdf’s of X and Y obtained fromPS the
Sastry, joint
IISc, Bangalore, 2020 47/57 PS Sastry, IISc, Bangalore, 2020 48/57

Example
We have: fXY (x, y) = 2, 0 < x < y < 1
◮ Consider the joint density
◮ We can similarly find density of Y .

fXY (x, y) = 2, 0 < x < y < 1 ◮ For 0 < y < 1,

Z ∞ Z y
◮ The marginal density of X is: for 0 < x < 1, fY (y) = fXY (x, y) dx = 2 dx = 2y
Z ∞ Z 1 −∞ 0

fX (x) = fXY (x, y) dy = 2 dy = 2(1 − x)


−∞ x
◮ Thus, fY (y) = 2y, 0 < y < 1 and
Thus, fX (x) = 2(1 − x), 0 < x < 1 1 1
y2
Z
◮ We can easily verify this is a density 2y dy = 2 =1
0 2 0
Z ∞ Z 1
1
fX (x) dx = 2(1 − x) dx = (2x − x2 ) 0
=1
−∞ 0

PS Sastry, IISc, Bangalore, 2020 49/57 PS Sastry, IISc, Bangalore, 2020 50/57
Conditional distributions

◮ If we are given the joint df or joint pmf/joint density of


X, Y , then the individual df or pmf/pdf are uniquely ◮ Let X, Y be rv’s on the same probability space
determined. ◮ We define the conditional distribution of X given Y by
◮ However, given individual pdf of X and Y , we cannot
FX|Y (x|y) = P [X ≤ x|Y = y]
determine the joint density. (same is true of pmf or df)
◮ There can be many different joint density functions all (For now ignore the case of P [Y = y] = 0).
having the same marginals ◮ Note that FX|Y : ℜ2 → ℜ
◮ FX|Y (x|y) is a notation. We could write FX|Y (x, y).

PS Sastry, IISc, Bangalore, 2020 51/57 PS Sastry, IISc, Bangalore, 2020 52/57

◮ Conditional distribution of X given Y is


◮ Let: X ∈ {x1 , x2 , · · · } and Y ∈ {y1 , y2 , · · · }. Then
FX|Y (x|y) = P [X ≤ x|Y = y]
P [X ≤ x, Y = yj ]
FX|Y (x|yj ) = P [X ≤ x|Y = yj ] =
It is the conditional probability of [X ≤ x] given (or P [Y = yj ]
conditioned on) [Y = y].
(We define FX|Y (x|y) only when y = yj for some j).
◮ Consider example: rolling 2 dice, X is max, Y is sum
◮ For each yj , FX|Y (x|yj ) is a df of a discrete rv in x.

P [X ≤ 4|Y = 3] = 1; P [X ≤ 4|Y = 9] = 0 ◮ Since X is a discrete rv, we can write the above as


P
P [X ≤ x, Y = yj ] i:xi ≤x P [X = xi , Y = yj ]
FX|Y (x|yj ) = =
◮ This is what conditional distribution captures. P [Y = yj ] P [Y = yj ]
X  fXY (xi , yj ) 
◮ For every value of y, FX|Y (x|y) is a distribution function =
in the variable x. i:x ≤x
fY (yj )
i
◮ It defines a new distribution for X based on knowing the
value of Y .

PS Sastry, IISc, Bangalore, 2020 53/57 PS Sastry, IISc, Bangalore, 2020 54/57
◮ The conditional mass function is
Conditional mass function
fXY (xi , yj )
fX|Y (xi |yj ) = P [X = xi |Y = yj ] =
fY (yj )
◮ We got ◮ This gives us the useful identity
X  fXY (xi , yj )  fXY (xi , yj ) = fX|Y (xi |yj )fY (yj )
FX|Y (x|yj ) =
fY (yj )
i:x ≤x
i ( P [X = xi , Y = yj ] = P [X = xi |Y = yj ]P [Y = yj ])
◮ This gives us the total proability rule for rv’s
◮ Since X is a discrete rv, what is inside the summation X X
above is the pmf corresponding to the df, FX|Y . fX (xi ) = fXY (xi , yj ) = fX|Y (xi |yj )fY (yj )
◮ We define the conditional mass function of X given Y as j j

◮ This is same as
fXY (xi , yj )
fX|Y (xi |yj ) = = P [X = xi |Y = yj ] X
fY (yj ) P [X = xi ] = P [X = xi |Y = yj ]P [Y = yj ]
j
P
(P (A) = j P (A|Bj )P (Bj ) when B1 , · · · form a
partition)
PS Sastry, IISc, Bangalore, 2020 55/57 PS Sastry, IISc, Bangalore, 2020 56/57

Recap: Joint Distribution Function


◮ We have
◮ Given X, Y rv on same probability space, joint
fXY (xi , yj ) = fX|Y (xi |yj )fY (yj ) = fY |X (yj |xi )fX (xi ) distribution function: FXY : ℜ2 → ℜ
FXY (x, y) = P [X ≤ x, Y ≤ y]

◮ This gives us Bayes rule for discrete rv’s ◮ It satisfies


1. FXY (−∞, y) = FXY (x, −∞) = 0, ∀x, y;
fY |X (yj |xi )fX (xi ) FXY (∞, ∞) = 1
fX|Y (xi |yj ) = 2. FXY is non-decreasing in each of its arguments
fY (yj )
3. FXY is right continuous and has left-hand limits in each
fY |X (yj |xi )fX (xi )
= P of its arguments
i fXY (xi , yj ) 4. For all x1 < x2 and y1 < y2
fY |X (yj |xi )fX (xi )
= P FXY (x2 , y2 )−FXY (x2 , y1 )−FXY (x1 , y2 )+FXY (x1 , y1 ) ≥ 0
i fY |X (yj |xi )fX (xi )

◮ Any F : ℜ2 → ℜ satisfying the above would be a joint


distribution function.
PS Sastry, IISc, Bangalore, 2020 57/57 PS Sastry, IISc, Bangalore, 2020 1/36
Recap: Joint Probability mass function Recap joint density
◮ X ∈ {x1 , x2 , · · · }, Y ∈ {y1 , y2 , · · · } ◮ Two cont rv X, Y have a joint density fXY if
◮ The joint pmf: fXY (x, y) = P [X = x, Y = y]. Z x Z y
◮ The joint pmf satisfies: FXY (x, y) = fXY (x′ , y ′ ) dy ′ dx′ , ∀x, y
−∞ −∞
A1 P
fXYP(x, y) ≥ 0, ∀x, y and non-zero only for xi , yj pairs
◮ The joint density fXY satisfies the following
A2 i j fXY (xi , yj ) = 1
◮ Given the joint pmf, we can get the joint df as 1. fXY (x, y) ≥ 0, ∀x, y
R∞ R∞
X X 2. −∞ −∞ fXY (x′ , y ′ ) dy ′ dx′ = 1
FXY (x, y) = fXY (xi , yj )
i: j:
◮ Any function fXY : ℜ2 → ℜ satisfying the above two is a
xi ≤x yj ≤y joint density function. (Then the above FXY can be
◮ Any fXY : ℜ2 → [0, 1] satisfying A1 and A2 above is a shown to be a joint df).
joint pmf. (The FXY satisfies all properties of df). ◮ We also have
Z x 2 Z y2
◮ Given the joint pmf, we can (in principle) compute the
P [x1 ≤ X ≤ x2 , y1 ≤ Y ≤ y2 ] = fXY dy dx
probability of any event involving the two discrete random x1 y1
variables. and, in general,
X
P [(X, Y ) ∈ B] = fXY (xi , yj ) Z
i,j:
P [(X, Y ) ∈ B] = fXY (x, y) dx dy, ∀B ∈ B 2
(xi ,yj )∈B PS Sastry, IISc, Bangalore, 2020 2/36 B PS Sastry, IISc, Bangalore, 2020 3/36

Recap Marginals Recap Conditional distribution

◮ Marginal distribution functions of X, Y are ◮ Let: X ∈ {x1 , x2 , · · · } and Y ∈ {y1 , y2 , · · · }. Then


FX (x) = FXY (x, ∞); FY (y) = FXY (∞, y) P [X ≤ x, Y = yj ]
FX|Y (x|yj ) = P [X ≤ x|Y = yj ] =
◮ X, Y discrete with joint pmf fXY . The marginal pmfs are P [Y = yj ]
X X (We define FX|Y (x|y) only when y = yj for some j).
fX (x) = fXY (x, y); fY (y) = fXY (x, y)
y x
◮ For each yj , FX|Y (x|yj ) is a df of a discrete rv in x.
◮ The pmf corresponding to this df is called conditional pmf
◮ If X, Y have joint pdf fXY then the marginal pdf are
Z ∞ Z ∞ fXY (xi , yj )
fX|Y (xi |yj ) = P [X = xi |Y = yj ] =
fX (x) = fXY (x, y) dy fY (y) = fXY (x, y) dx fY (yj )
−∞ −∞

PS Sastry, IISc, Bangalore, 2020 4/36 PS Sastry, IISc, Bangalore, 2020 5/36
Recap Bayes rule for discrete rv’s Example: Conditional pmf
◮ The conditional mass function is ◮ Consider the random experiment of tossing a coin n
fXY (xi , yj ) times.
fX|Y (xi |yj ) = P [X = xi |Y = yj ] =
fY (yj ) ◮ Let X denote the number of heads and let Y denote the
toss number on which the first head comes.
◮ This gives us the useful identity
◮ For 1 ≤ k ≤ n
fXY (xi , yj ) = fX|Y (xi |yj )fY (yj )
P [Y = k, X = 1]
◮ This gives us the total probability rule for rv’s fY |X (k|1) = P [Y = k|X = 1] =
P [X = 1]
p(1 − p)n−1
X X
fX (xi ) = fXY (xi , yj ) = fX|Y (xi |yj )fY (yj ) =
j j n C p(1 − p)n−1
1
1
◮ Also gives us Bayes rule for discrete rv =
n
fY |X (yj |xi )fX (xi )
fX|Y (xi |yj ) = P ◮ Given there is only one head, it is equally likely to occur
i fY |X (yj |xi )fX (xi )
on any toss.
PS Sastry, IISc, Bangalore, 2020 6/36 PS Sastry, IISc, Bangalore, 2020 7/36

◮ The conditional df is given by (assuming fY (y) > 0)


FX|Y (x|y) = lim P [X ≤ x|Y ∈ [y, y + δ]]
◮ Let X, Y be continuous rv’s with joint density, fXY . δ→0

◮ We once again want to define conditional df P [X ≤ x, Y ∈ [y, y + δ]


= lim
δ→0 P [Y ∈ [y, y + δ]]
FX|Y (x|y) = P [X ≤ x|Y = y] R x R y+δ
fXY (x′ , y ′ ) dy ′ dx′
−∞ y
= lim R y+δ
◮ But the conditioning event, [Y = y] has zero probability. δ→0
y
fY (y ′ ) dy ′
Rx
◮ Hence we define conditional df as follows −∞ XY
f (x′ , y) δ dx′
= lim
δ→0 fY (y) δ
FX|Y (x|y) = lim P [X ≤ x|Y ∈ [y, y + δ]] Z x
δ→0 fXY (x′ , y) ′
= dx
◮ This is well defined if the limit exists. −∞ fY (y)
◮ The limit exists for all y where fY (y) > 0 (and for all x) ◮ We define conditional density of X given Y as
fXY (x, y)
fX|Y (x|y) =
fY (y)

PS Sastry, IISc, Bangalore, 2020 8/36 PS Sastry, IISc, Bangalore, 2020 9/36
◮ Let X, Y have joint density fXY . Example
◮ The conditional df of X given Y is

FX|Y (x|y) = lim P [X ≤ x|Y ∈ [y, y + δ]] fXY (x, y) = 2, 0 < x < y < 1
δ→0
◮ We saw that the marginal densities are
◮ This exists if fY (y) > 0 and then it has a density: fX (x) = 2(1 − x), 0 < x < 1; fY (y) = 2y, 0 < y < 1
Z x
FX|Y (x|y) = fX|Y (x′ |y) dx′
◮ Hence the conditional densities are given by
−∞ fXY (x, y) 1
fX|Y (x|y) = = , 0<x<y<1
◮ This conditional density is given by fY (y) y
fXY (x, y) 1
fXY (x, y) fY |X (y|x) = = , 0<x<y<1
fX|Y (x|y) = fX (x) 1−x
fY (y)
◮ We can see this intuitively like this
◮ We (once again) have the useful identity y

fXY (x, y) = fX|Y (x|y) fY (y) = fY |X (y|x)fX (x) 1.0

0.5

PS Sastry, IISc, Bangalore, 2020 10/36 x PS Sastry, IISc, Bangalore, 2020 11/36
0.5 1.0

Example
◮ Let X be uniform over (0, 1) and let Y be uniform over
0 to X. Find the density of Y .
◮ What we are given is
◮ The identity fXY (x, y) = fX|Y (x|y)fY (y) can be used to
1
specify the joint density of two continuous rv’s fX (x) = 1, 0 < x < 1; fY |X (y|x) = , 0 < y < x < 1
x
◮ We can specify the marginal density of one and the ◮ Hence the joint density is:
conditional density of the other given the first. fXY (x, y) = x1 , 0 < y < x < 1.
◮ This may actually be the model of how the the rv’s are ◮ Hence the density of Y is
generated. Z ∞ Z 1
1
fY (y) = fXY (x, y) dx = dx = − ln(y), 0 < y < 1
−∞ y x

◮ We can verify it to be a density


Z 1 1
1
Z
1
− ln(y) dy = −y ln(y)|0 + y dy = 1
0 0 y
PS Sastry, IISc, Bangalore, 2020 12/36 PS Sastry, IISc, Bangalore, 2020 13/36
◮ We have the identity
fXY (x, y) = fX|Y (x|y) fY (y) ◮ We have the identity
◮ By integrating both sides
Z ∞ Z ∞ fXY (x, y) = fX|Y (x|y) fY (y) = fY |X (y|x)fX (x)
fX (x) = fXY (x, y) dy = fX|Y (x|y) fY (y) dy
−∞ −∞
◮ This is a continuous analogue of total probability rule. ◮ This gives rise to Bayes rule for continuous rv
◮ But note that, since X is continuous rv, fX (x) is NOT
P [X = x] fY |X (y|x)fX (x)
fX|Y (x|y) =
◮ In case of discrete rv, the mass function value fX (x) is fY (y)
equal to P [X = x] and we had fY |X (y|x)fX (x)
X = R∞
fX (x) = fX|Y (x|y)fY (y) f (y|x)fX (x) dx
−∞ Y |X
y
◮ It is as if one can simply replace pmf by pdf and
◮ This is essentially identical to Bayes rule for discrete rv’s.
summation by integration!! We have essentially put the pdf wherever there was pmf
◮ While often that gives the right result, one needs to be
very careful
PS Sastry, IISc, Bangalore, 2020 14/36 PS Sastry, IISc, Bangalore, 2020 15/36

◮ To recap, we started by defining conditional distribution ◮ Now, let X be a continuous rv and let Y be discrete rv.
function. ◮ We can define FX|Y as
FX|Y (x|y) = P [X ≤ x|Y = y]
◮ When X, Y are discrete, we define this only for y = yj . FX|Y (x|y) = P [X ≤ x|Y = y]
That is, we define it only for all values that Y can take.
◮ When X, Y have joint density, we defined it by This is well defined for all values that y takes. (We
consider only those y)
FX|Y (x|y) = lim P [X ≤ x|Y ∈ [y, y + δ]]
δ→0 ◮ Since X is continuous rv, this df would have a density
This limit exists and FX|Y is well defined if fY (y) > 0. Z x
That is, essentially again for all values that Y can take. FX|Y (x|y) = fX|Y (x′ |y) dx′
◮ In the discrete case, we define fX|Y as the pmf −∞

corresponding to FX|Y . This conditional pmf can also be ◮ Hence we can write
defined as a conditional probability
◮ In the continuous case fX|Y is the density corresponding P [X ≤ x, Y = y] = FX|Y (x|y)P [Y = y]
to FX|Y .
Z x
◮ In both cases we have: fXY (x, y) = fX|Y (x|y)fY (y) = fX|Y (x′ |y) fY (y) dx′
−∞
◮ This gives total probability rule and Bayes rule for random
variables PS Sastry, IISc, Bangalore, 2020 16/36 PS Sastry, IISc, Bangalore, 2020 17/36
◮ We now get ◮ When X, Y are discrete we have
X X X
FX (x) = P [X ≤ x] = P [X ≤ x, Y = y] fX (x) = fX|Y (x|y)fY (y) (P [X = x] = P [X = x|Y = y]P [Y = y]
y y y
X Z x
= fX|Y (x′ |y) fY (y) dx′ ◮ When X is continuous and Y is discrete, we defined
−∞
y fX|Y (x|y) to be the density corresponding to
Z x X FX|Y (x|y) = P [X ≤ x|Y = y]
= fX|Y (x′ |y) fY (y) dx′
−∞ y
◮ Then we once again get
X
◮ This gives us fX (x) = fX|Y (x|y)fY (y)
X y
fX (x) = fX|Y (x|y)fY (y)
Now, fX is density (and not a mass function).
y
◮ Suppose Y ∈ {1, 2, 3} and fY (i) = λi ; let
◮ This is another version of total probability rule. fX|Y (x|i) = fi (x)
◮ Earlier we derived this when X, Y are discrete.
fX (x) = λ1 f1 (x) + λ2 f2 (x) + λ3 f3 (x)
◮ The formula is true even when X is continuous
Only difference is we need to take fX as the density of X. Called a mixture density model
PS Sastry, IISc, Bangalore, 2020 18/36 PS Sastry, IISc, Bangalore, 2020 19/36

◮ Continuing with X continuous rv and Y discrete


◮ Continuing with X continuous rv and Y discrete. We
have
◮ Can we define fY |X (y|x)?
Z x ◮ Since Y is discrete, this (conditional) mass function is
FX|Y (x|y) = P [X ≤ x|Y = y] = fX|Y (x′ |y) dx′ fY |X (y|x) = P [Y = y|X = x]
−∞

But the conditioning event has zero prob


◮ We also have We now know how to handle it
Z x
P [X ≤ x, Y = y] = fX|Y (x′ |y) fY (y) dx′ fY |X (y|x) = lim P [Y = y|X ∈ [x, x + δ]]
δ→0
−∞
◮ For simplifying this we note the following:
◮ Hence we can define a ‘joint density’ Z x
P [X ≤ x, Y = y] = fX|Y (x′ |y) fY (y) dx′
fXY (x, y) = fX|Y (x|y)fY (y) −∞

◮ This is a kind of mixed density and mass function.


Z x+δ
⇒ P [X ∈ [x, x+δ], Y = y] = fX|Y (x′ |y) fY (y) dx′
◮ We will not be using such ‘joint densities’ here x

PS Sastry, IISc, Bangalore, 2020 20/36 PS Sastry, IISc, Bangalore, 2020 21/36
◮ First let us look at the total probability rule possibilities
◮ We have
◮ When X is continuous rv and Y is discrete rv, we derived
fY |X (y|x) = lim P [Y = y|X ∈ [x, x + δ]]
δ→0 fY |X (y|x)fX (x) = fX|Y (x|y) fY (y)
P [Y = y, X ∈ [x, x + δ]]
= lim Note that fY is mass fn, fX is density and so on.
δ→0 P [X ∈ [x, x + δ]]
R x+δ
fX|Y (x′ |y) fY (y) dx′
◮ Since fX|Y is a density (corresponding to FX|Y ),
x
= lim R x+δ Z ∞
δ→0
x
fX (x′ ) dx′
fX|Y (x|y) dx = 1
fX|Y (x|y)δ fY (y) −∞
= lim
δ→0 fX (x) δ ◮ Hence we get
fX|Y (x|y) fY (y)
= Z ∞
fX (x)
fY (y) = fY |X (y|x)fX (x) dx
−∞
◮ This gives us further versions of total probability rule and
Bayes rule. ◮ Earlier we derived the same formula when X, Y have a
joint density.
PS Sastry, IISc, Bangalore, 2020 22/36 PS Sastry, IISc, Bangalore, 2020 23/36

◮ When X is continuous rv and Y is discrete rv, we derived


◮ Let us review all the total probability formulas
X fY |X (y|x)fX (x) = fX|Y (x|y) fY (y)
1. fX (x) = fX|Y (x|y)fY (y)
y ◮ This once again gives rise to Bayes rule:
◮ We first derived this when X, Y are discrete. fX|Y (x|y) fY (y) fY |X (y|x)fX (x)
fY |X (y|x) = fX|Y (x|y) =
◮ But now we proved this holds when Y is discrete fX (x) fY (y)
If X is continuous the fX , fX|Y are densities; If X is also
discrete they are mass functions ◮ Earlier we showed this hold when X, Y are both discrete
Z ∞ or both continuous.
2. fY (y) = fY |X (y|x)fX (x) dx ◮ Thus Bayes rule holds in all four possible scenarios
−∞ ◮ Only difference is we need to interpret fX or fX|Y as
◮ We first proved it when X, Y have a joint density mass functions when X is discrete and as densities when
We now know it holds also when X is cont and Y is X is a continuous rv
discrete. In that case fY is a mass function ◮ In general, one refers to these always as densities since
the actual meaning would be clear from context.
PS Sastry, IISc, Bangalore, 2020 24/36 PS Sastry, IISc, Bangalore, 2020 25/36
Example
◮ We need fX|Y . What does our model say?
◮ Consider a communication system. The transmitter puts ◮ fX|Y (x|1) is Gaussian with mean 5 and variance σ 2 and
out 0 or 5 volts for the bits of 0 and 1, and, volage fX|Y (x|0) is Gaussian with mean zero and variance σ 2
measured by the receiver is the sent voltage plus noise
added by the channel. fX|Y (x|1) fY (1)
P [Y = 1|X = x] = fY |X (1|x) =
◮ We assume noise has Gaussian density with mean zero fX (x)
and variance σ 2 .
◮ We may want the probability that the sent bit is 1 when
◮ We need fY (1), fY (0). Let us take them to be same.
measured voltage at the receiver is x to decide what is ◮ In practice we only want to know whether
sent. fY |X (1|x) > fY |X (0|x)
◮ Let X be the measured voltage and let Y be sent bit. ◮ Then we do not need to calculate fX (x).
◮ We want to calculate fY |X (1|x). We only need ratio of fY |X (1|x) and fY |X (0|x).
◮ We want to use the Bayes rule to calculate this

PS Sastry, IISc, Bangalore, 2020 26/36 PS Sastry, IISc, Bangalore, 2020 27/36

◮ The ratio of the two probabilities is


◮ We did not calculate fX (x) in the above.
fY |X (1|x) fX|Y (x|1) fY (1)
= ◮ We can calculate it if we want.
fY |X (0|x) fX|Y (x|0) fY (0)
1 2 ◮ Using total probability rule
√1 e− 2σ2 (x−5)
σ 2π
= 1 2
X
√1 e− 2σ2 (x−0) fX (x) = fX|Y (x|y)fY (y)
σ 2π
y
−0.5σ −2 (x2 −10x+25−x2 )
= e = fX|Y (x|1)fY (1) + fX|Y (x|0)fY (0)
1 1 −
(x−5)2 1 1 x2
◮ We are only interested in whether the above is greater = √ e 2σ2 + √ e− 2σ2
than 1 or not. 2 σ 2π 2 σ 2π
◮ The ratio is greater than 1 if 10x > 25 or x > 2.5 ◮ It is a mixture density
◮ So, if X > 2.5 we will conclude bit 1 is sent. Intuitively
obvious!

PS Sastry, IISc, Bangalore, 2020 28/36 PS Sastry, IISc, Bangalore, 2020 29/36
◮ As we saw, given the joint distribution we can calculate
all the marginals.
◮ However, there can be many joint distributions with the
f (x, y) = f1 (x)f2 (y) [1 + α(2F1 (x) − 1)(2F2 (y) − 1)]
same marginals.
◮ Let F1 , F2 be one dimensional df’s of continuous rv’s with
∞ ∞ ∞ ∞
f1 , f2 being the corresponding densities.
Z Z Z Z
Define a function f : ℜ2 → ℜ by f (x, y) dx dy = f1 (x) dx f2 (y) dy
−∞ −∞ −∞ −∞
Z ∞ Z ∞
f (x, y) = f1 (x)f2 (y) [1 + α(2F1 (x) − 1)(2F2 (y) − 1)] +α (2f1 (x)F1 (x) − f1 (x)) dx (2f2 (y)F2 (y) − f2 (y))
−∞ −∞
where α ∈ (−1, 1).
= 1
◮ First note that f (x, y) ≥ 0, ∀α ∈ (−1, 1).
For different α we get different functions. because 2
R∞
f1 (x) F1 (x) dx = 1. This also shows
−∞
◮ We first show that f (x, y) is a joint density.
◮ For this, we note the following
Z ∞ Z ∞

∞ f (x, y)dx = f2 (y); f (x, y)dy = f1 (x)


(F1 (x))2
Z ∞
1 −∞ −∞
f1 (x) F1 (x) dx = =
−∞ 2 2
−∞

PS Sastry, IISc, Bangalore, 2020 30/36 PS Sastry, IISc, Bangalore, 2020 31/36

Independent Random Variables


◮ Two random variable X, Y are said to be independent if
for all Borel sets B1 , B2 , the events [X ∈ B1 ] and
◮ Thus infinitely many joint distributions can all have the [Y ∈ B2 ] are independent.
same marginals. ◮ If X, Y are independent then
◮ So, in general, the marginals cannot determine the joint
distribution. P [X ∈ B1 , Y ∈ B2 ] = P [X ∈ B1 ] P [Y ∈ B2 ], ∀B1 , B2 ∈ B
◮ An important special case where this is possible is that of
independent random variables
◮ In particular

FXY (x, y) = P [X ≤ x, Y ≤ y] = P [X ≤ x]P [Y ≤ y] = FX (x) FY (y)

◮ Theorem: X, Y are independent if and only if


FXY (x, y) = FX (x)FY (y).

PS Sastry, IISc, Bangalore, 2020 32/36 PS Sastry, IISc, Bangalore, 2020 33/36
◮ Let X, Y be independent continuous rv
Z x Z y
◮ Suppose X, Y are independent discrete rv’s FXY (x, y) = FX (x)FY (y) = ′
fX (x ) dx ′
fY (y ′ ) dy ′
−∞ −∞
fXY (x, y) = P [X = x, Y = y] = P [X = x]P [Y = y] = fX (x)fY (y) Z y Z x
= (fX (x′ )fY (y ′ )) dx′ dy ′
−∞ −∞
The joint mass function is a product of marginals.
◮ Suppose fXY (x, y) = fX (x)fY (y). Then ◮ This implies joint density is product of marginals.
X X
◮ Now, suppose fXY (x, y) = fX (x)fY (y)
FXY (x, y) = fXY (xi , yj ) = fX (xi )fY (yj ) Z y Z x
xi ≤x,yj ≤y xi ≤x,yj ≤y FXY (x, y) = fXY (x′ , y ′ ) dx′ dy ′
X X Z−∞ −∞
y Z x
= fX (xi ) fY (yj ) = FX (x)FY (y)
xi ≤x yj ≤y
= fX (x′ )fY (y ′ ) dx′ dy ′
Z−∞
x
−∞
Z y
◮ So, X, Y are independent if and only if = ′
fX (x ) dx ′
fY (y ′ ) dy ′ = FX (x)FY (y)
fXY (x, y) = fX (x)fY (y) −∞ −∞

◮ So, X, Y are independent if and only if


fXY (x, y) = fX (x)fY (y)
PS Sastry, IISc, Bangalore, 2020 34/36 PS Sastry, IISc, Bangalore, 2020 35/36

Recap: Joint Distribution Function


◮ Given X, Y rv’s on same probability space, joint
distribution function: FXY : ℜ2 → ℜ
◮ Let X, Y be independent. FXY (x, y) = P [X ≤ x, Y ≤ y]
◮ Then P [X ∈ B1 |Y ∈ B2 ] = P [X ∈ B1 ]. ◮ It satisfies
◮ Hence, we get FX|Y (x|y) = FX (x). 1. FXY (−∞, y) = FXY (x, −∞) = 0, ∀x, y;
◮ This also implies fX|Y (x|y) = fX (x). FXY (∞, ∞) = 1
◮ This is true for all the four possibilities of X, Y being 2. FXY is non-decreasing in each of its arguments
3. FXY is right continuous and has left-hand limits in each
continuous/discrete.
of its arguments
4. For all x1 < x2 and y1 < y2
FXY (x2 , y2 )−FXY (x2 , y1 )−FXY (x1 , y2 )+FXY (x1 , y1 ) ≥ 0

◮ Any F : ℜ2 → ℜ satisfying the above would be a joint


distribution function.
PS Sastry, IISc, Bangalore, 2020 36/36 PS Sastry, IISc, Bangalore, 2020 1/41
Recap: Joint Probability mass function Recap joint density
◮ X ∈ {x1 , x2 , · · · }, Y ∈ {y1 , y2 , · · · } ◮ Two cont rv X, Y have a joint density fXY if
◮ The joint pmf: fXY (x, y) = P [X = x, Y = y]. Z x Z y
◮ The joint pmf satisfies: FXY (x, y) = fXY (x′ , y ′ ) dy ′ dx′ , ∀x, y
−∞ −∞
A1 P
fXYP(x, y) ≥ 0, ∀x, y and non-zero only for xi , yj pairs
◮ The joint density fXY satisfies the following
A2 i j fXY (xi , yj ) = 1
◮ Given the joint pmf, we can get the joint df as 1. fXY (x, y) ≥ 0, ∀x, y
R∞ R∞
X X 2. −∞ −∞ fXY (x′ , y ′ ) dy ′ dx′ = 1
FXY (x, y) = fXY (xi , yj )
i: j:
◮ Any function fXY : ℜ2 → ℜ satisfying the above two is a
xi ≤x yj ≤y joint density function. (Then the above FXY can be
◮ Any fXY : ℜ2 → [0, 1] satisfying A1 and A2 above is a shown to be a joint df).
joint pmf. (The FXY satisfies all properties of df). ◮ We also have
Z x 2 Z y2
◮ Given the joint pmf, we can (in principle) compute the
P [x1 ≤ X ≤ x2 , y1 ≤ Y ≤ y2 ] = fXY dy dx
probability of any event involving the two discrete random x1 y1
variables. and, in general,
X
P [(X, Y ) ∈ B] = fXY (xi , yj ) Z
i,j:
P [(X, Y ) ∈ B] = fXY (x, y) dx dy, ∀B ∈ B 2
(xi ,yj )∈B PS Sastry, IISc, Bangalore, 2020 2/41 B PS Sastry, IISc, Bangalore, 2020 3/41

Recap Marginals Recap Conditional distributions

◮ Marginal distribution functions of X, Y are


◮ Let X, Y be continuous or discrete random variables
FX (x) = FXY (x, ∞); FY (y) = FXY (∞, y)
FX|Y (x|y) = lim P [X ≤ x|Y ∈ [y, y + δ]]
δ→0
◮ X, Y discrete with joint pmf fXY . The marginal pmfs are
X X (= P [X ≤ x|Y = y] when Y is discrete)
fX (x) = fXY (x, y); fY (y) = fXY (x, y)
y x
◮ This is well defined for all values that Y can assume.
◮ For each y, FX|Y (x|y) is a df in x.
◮ If X, Y have joint pdf fXY then the marginal pdf are ◮ If X, Y have a joint density or if X is continuous and Y
Z ∞ Z ∞ is discrete, FX|Y would be absolutely continuous and
fX (x) = fXY (x, y) dy; fY (y) = fXY (x, y) dx would have a density.
−∞ −∞

PS Sastry, IISc, Bangalore, 2020 4/41 PS Sastry, IISc, Bangalore, 2020 5/41
Recap Contional density (or mass) fn Recap
◮ Let X be a discrete random variable. Then
◮ When X, Y are both discrete or they have a joint density
fX|Y (x|y) = lim P [X = x|Y ∈ [y, y + δ] ]
δ→0
fXY (x, y) = fX|Y (x|y)fY (y) = fY |X (y|x)fX (x)
(= P [X = x|Y = y] if Y is discrete)
◮ When X, Y are discrete or continuous (all four
◮ This will be the mass function corresponding to the df possibilities)
FX|Y .
◮ Let X be a continuous rv. Then we define conditional fX|Y (x|y)fY (y) = fY |X (y|x)fX (x)
density fX|Y by
Z x Here fX|Y , fX are densities when X is continuous and
mass functions when X is discrete. Similarly for fY |X , fY
FX|Y (x|y) = fX|Y (x′ |y) dx′
−∞ ◮ The above relation gives rise to the total probability rules
and Bayes rule for rv’s
This exists if X, Y have a joint density or when Y is
discrete.
PS Sastry, IISc, Bangalore, 2020 6/41 PS Sastry, IISc, Bangalore, 2020 7/41

Recap Recap Bayes rule


◮ If Y is discrete
X ◮ When X, Y are continuous or discrete (all four
fX (x) = fX|Y (x|y)fY (y) possibilities)
y
fY |X (y|x)fX (x) = fX|Y (x|y) fY (y)
◮ If X is continuous, the fX , fX|Y are densities; If X is also
discrete, they are mass functions ◮ This gives rise to Bayes rule:
◮ If Y is continuous
fX|Y (x|y) fY (y) fY |X (y|x)fX (x)
Z ∞ fY |X (y|x) = fX|Y (x|y) =
fX (x) = fX|Y (x|y)fY (y) dy fX (x) fY (y)
−∞
◮ We need to interpret fX or fX|Y as mass functions when
◮ If X is continuous, the fX , fX|Y are densities; If X is also X is discrete and as densities when X is a continuous
discrete, they are mass functions (Where needed we and so on
assume the conditional density exists)

PS Sastry, IISc, Bangalore, 2020 8/41 PS Sastry, IISc, Bangalore, 2020 9/41
Recap Independent Random variables More than two rv
◮ Everything we have done so far is easily extended to
multiple random variables.
◮ Let X, Y, Z be rv on the same probability space.
◮ X and Y are said to be independent if events [X ∈ B1 ], ◮ We define joint distribution function by
[Y ∈ B2 ] are independent for all B1 , B2 ∈ B.
◮ X and Y are independent if and only if FXY Z (x, y, z) = P [X ≤ x, Y ≤ y, Z ≤ z]
1. FXY (x, y) = FX (x) FY (y)
2. fXY (x, y) = fX (x) fY (y)
◮ If all three are discrete then the joint mass function is
◮ This also implies FX|Y (x|y) = FX (x) and fXY Z (x, y, z) = P [X = x, Y = y, Z = z]
fX|Y (x|y) = fX (x)
◮ If they are continuous , they have a joint density if
Z z Z y Z x
FXY Z (x, y, z) = fXY Z (x′ , y ′ , z ′ ) dx′ dy ′ dz ′
−∞ −∞ −∞

PS Sastry, IISc, Bangalore, 2020 10/41 PS Sastry, IISc, Bangalore, 2020 11/41

◮ Now we get many different marginals:


◮ Easy to see that joint mass function satisfies
1. fXY Z (x, y, z) ≥ 0 and is non-zero only for countably FXY (x, y) = FXY Z (x, y, ∞); FZ (z) = FXY Z (∞, ∞, z) and so on
many tuples.
P
2. x,y,z fXY Z (x, y, z) = 1
◮ Similarly the joint density satisfies
◮ Similarly we get
1. fRXY ZR(x, y, z) ≥ 0
Z ∞
∞ ∞ R∞
2. −∞ −∞ −∞ fXY Z (x, y, z) dx dy dz = 1 fY Z (y, z) = fXY Z (x, y, z) dx;
◮ These are straight-forward generalizations Z−∞
∞ Z ∞
fX (x) = fXY Z (x, y, z) dy dz
◮ The properties of joint distribution function such as it −∞ −∞
being non-decreasing in each argument etc are easily seen
to hold here too. ◮ Any marginal is a joint density of a subset of these rv’s
and we obtain it by integrating the (full) joint density
◮ Generalizing the special property of the df (relating to
with respect to the remaining variables.
probability of cylindrical sets) is a little more complicated.
(An exercise for you!) ◮ We obtain the marginal mass functions for a subset of the
rv’s also similarly where we sum over the remaining
variables.
PS Sastry, IISc, Bangalore, 2020 12/41 PS Sastry, IISc, Bangalore, 2020 13/41
◮ Like in case of marginals, there are different types of
◮ We have to be a little careful in dealing with these when conditional distributions now.
some random variables are discrete and others are ◮ We can always define conditional distribution functions
continuous.
like
◮ Suppose X is continuous and Y, Z are discrete. We do
not have any joint density or mass function as such. FXY |Z (x, y|z) = P [X ≤ x, Y ≤ y|Z = z]
◮ However, the joint df is always well defined. FX|Y Z (x|y, z) = P [X ≤ x|Y = y, Z = z]
◮ Suppose we want marginal joint distribution of X, Y . We
know how to get FXY by marginalization. ◮ In all such cases, if the conditioning random variables are
continuous, we define the above as a limit.
◮ Then we can get fX (a density), fY (a mass fn), fX|Y
(conditinal density) and fY |X (conditional mass fn) ◮ For example when Z is continuous
◮ With these we can generally calculate most quantities of FXY |Z (x, y|z) = lim P [X ≤ x, Y ≤ y|Z ∈ [z, z + δ]]
interest. δ→0

PS Sastry, IISc, Bangalore, 2020 14/41 PS Sastry, IISc, Bangalore, 2020 15/41

◮ If X, Y, Z are all discrete then, all conditional mass


functions are defined by appropriate conditional ◮ When X, Y, Z have joint density, all such relations hold
probabilities. For example, for the appropriate (conditional) densities. For example,

fX|Y Z (x|y, z) = P [X = x|Y = y, Z = z] P [Z ≤ z, X ∈ [x, x + δ], Y ∈ [y, y + δ]]


FZ|XY (z|x, y) = lim
δ→0 P [X ∈ [x, x + δ, Y ∈ [y, y + δ]]
◮ Thus the following are obvious R z R x+δ R y+δ
−∞ x y
fXY Z (x′ , y ′ , z ′ ) dy ′ dx′ dz ′
fXY Z (x, y, z) = lim R x+δ R y+δ
fXY |Z (x, y|z) = δ→0 fXY (x′ , y ′ ) dy ′ dx′
fZ (z) Z z
x y

fXY Z (x, y, z) fXY Z (x, y, z ′ ) ′


fX|Y Z (x|y, z) = = dz
fY Z (y, z) −∞ fXY (x, y)
fXY Z (x, y, z) = fZ|Y X (z|y, x)fY |X (y|x)fX (x) ◮ Thus we get
◮ For example, the first one above follows from
fXY Z (x, y, z) = fZ|XY (z|x, y)fXY (x, y) = fZ|XY (z|x, y)fY |X (y|x)fX (x)
P [X = x, Y = y, Z = z]
P [X = x, Y = y|Z = z] =
P [Z = z]

PS Sastry, IISc, Bangalore, 2020 16/41 PS Sastry, IISc, Bangalore, 2020 17/41
Independence of multiple random variables
◮ We can similarly talk about the joint distribution of any
finite number of rv’s
◮ Let X1 , X2 , · · · , Xn be rv’s on the same probability space.
◮ We denote it as a vector X or X. We can think of it as a
◮ Random variables X1 , X2 , · · · , Xn are said to be
mapping, X : Ω → ℜn .
independent if the the events [Xi ∈ Bi ], i = 1, · · · , n are
◮ We can write the joint distribution as independent.
(Recall definition of independence of a set of events)
FX (x) = P [X ≤ x] = P [Xi ≤ xi , i = 1, · · · , n]
◮ Independence implies that the marginals would determine
◮ We represent by fX (x) the joint density or mass function. the joint distribution.
Sometimes we also write it as fX1 ···Xn (x1 , · · · , xn )
◮ We use similar notation for marginal and conditional
distributions

PS Sastry, IISc, Bangalore, 2020 18/41 PS Sastry, IISc, Bangalore, 2020 19/41

Example Example
◮ Let a joint density be given by ◮ Let a joint density be given by
fXY Z (x, y, z) = K, 0<z<y<x<1 fXY Z (x, y, z) = K, 0<z<y<x<1
First let us determine K. First let us determine K.
Z ∞ Z ∞ Z ∞ Z 1 Z x Z y Z ∞ Z ∞ Z ∞ Z 1 Z x Z y
fXY Z (x, y, z) dz dy dx = K dz dy dx fXY Z (x, y, z) dz dy dx = K dz dy dx
−∞ −∞ −∞ 0 0 0 −∞ −∞ −∞ x=0 y=0 z=0
Z 1 Z x Z 1 Z x
= K y dy dx = K y dy dx
0 0
x=0 y=0
Z 1 2
x Z 1 2
= K dx x
0 2 = K dx
0 2
1
= K ⇒K=6 1
6 = K ⇒K=6
6

PS Sastry, IISc, Bangalore, 2020 20/41 PS Sastry, IISc, Bangalore, 2020 21/41
◮ We got the joint density as

fXY Z (x, y, z) = K, 0<z<y<x<1 fXZ (x, z) = 6(x − z), 0<z<x<1

◮ We can verify this is a joint density


◮ Suppose we want to find the (marginal) joint distribution Z ∞Z ∞ Z 1Z x
of X and Z.
fXZ (x, z) dz dx = 6(x − z) dz dx
Z ∞ −∞ −∞ 0 0
fXZ (x, z) = fXY Z (x, y, z) dy Z 1
x z2
x
−∞ = 6x z|0 − 6 dx
Z x
0 2 0
= K dy, 0 < z < x < 1 Z 1
x2

2
z = 6x − 6 dx
= 6(x − z), 0<z<x<1 0 2
1
x3
= 3 =1
3 0

PS Sastry, IISc, Bangalore, 2020 22/41 PS Sastry, IISc, Bangalore, 2020 23/41

Functions of multiple random variables


◮ The joint density of X, Y, Z is ◮ Let X, Y be random variables on the same probability
space.
fXY Z (x, y, z) = 6, 0<z<y<x<1
◮ Let g : ℜ2 → ℜ.
◮ The joint density of X, Z is ◮ Let Z = g(X, Y ). Then Z is a rv
◮ This is analogous to functions of a single rv
fXZ (x, z) = 6(x − z), 0<z<x<1
Z = g(X,Y)

◮ Hence,
g

fXY Z (x, y, z) 1 R2
fY |XZ (y|x, z) = = , 0<z<y<x<1 Sample Space
[ XY]
fXZ (x, z) x−z R

B’ B

PS Sastry, IISc, Bangalore, 2020 24/41 PS Sastry, IISc, Bangalore, 2020 25/41
◮ Let X, Y be discrete rv’s. Let Z = min(X, Y ).
◮ let Z = g(X, Y )
fZ (z) = P [min(X, Y ) = z]
◮ We can determine distribution of Z from the joint
= P [X = z, Y > z] + P [Y = z, X > z] + P [X = Y = z]
distribution of X, Y X X
= P [X = z, Y = y] + P [X = x, Y = z]
FZ (z) = P [Z ≤ z] = P [g(X, Y ) ≤ z] y>z x>z
+P [X = z, Y = z]
◮ For example, if X, Y are discrete, then X X
= fXY (z, y) + fXY (x, z) + fXY (z, z)
X y>z x>z
fZ (z) = P [Z = z] = P [g(X, Y ) = z] = fXY (xi , yj )
xi ,yj :
g(xi ,yj )=z
◮ Now suppose X, Y are independent and both of them
have geometric distribution with the same parameter, p.
◮ Such random variables are called independent and
identically distributed or iid random variables.

PS Sastry, IISc, Bangalore, 2020 26/41 PS Sastry, IISc, Bangalore, 2020 27/41

◮ Now we can get pmf of Z as (note Z ∈ {1, 2, · · · }) ◮ We can show this is a pmf
∞ ∞
fZ (z) = P [X = z, Y > z] + P [Y = z, X > z] + P [X = Y = z] X X
fZ (z) = (2 − p)p(1 − p)2z−2
= P [X = z]P [Y > z] + P [Y = z]P [X > z] + P [X = z]P [Y = z] z=1 z=1
2 ∞
= p(1 − p)z−1 (1 − p)z ∗ 2 + p(1 − p)z−1 X
2 = (2 − p)p (1 − p)2z−2
= 2p(1 − p)z−1 (1 − p)z + p(1 − p)z−1 z=1
= 2p(1 − p)2z−1 + p2 (1 − p)2z−2 1
= (2 − p)p
= p(1 − p)2z−2 (2(1 − p) + p) 1 − (1 − p)2
1
= (2 − p)p(1 − p)2z−2 = (2 − p)p =1
2p − p2

PS Sastry, IISc, Bangalore, 2020 28/41 PS Sastry, IISc, Bangalore, 2020 29/41
◮ Let us consider the max and min functions, in general.
◮ Let Z = max(X, Y ). Then we have

FZ (z) = P [Z ≤ z] = P [max(X, Y ) ≤ z] ◮ Suppose X, Y are iid uniform over (0, 1)


= P [X ≤ z, Y ≤ z] ◮ Then we get df and pdf of Z = max(X, Y ) as
= FXY (z, z)
= FX (z)FY (z), if X, Y are independent FZ (z) = z 2 , 0 < z < 1; and fZ (z) = 2z, 0 < z < 1
= (FX (z))2 , if they are iid FZ (z) = 0 for z ≤ 0 and FZ (z) = 1 for z ≥ 1 and
fZ (z) = 0 outside (0, 1)
◮ This is true of all random variables.
◮ Suppose X, Y are iid continuous rv. Then density of Z is

fZ (z) = 2FX (z)fX (z)

PS Sastry, IISc, Bangalore, 2020 30/41 PS Sastry, IISc, Bangalore, 2020 31/41

◮ Consider Z = min(X, Y ) and X, Y independent


◮ This is easily generalized to n radom variables. FZ (z) = P [Z ≤ z] = P [min(X, Y ) ≤ z]
◮ Let Z = max(X1 , · · · , Xn )
◮ It is difficult to write this in terms of joint df of X, Y .
FZ (z) = P [Z ≤ z] = P [max(X1 , X2 , · · · , Xn ) ≤ z] ◮ So, we consider the following
= P [X1 ≤ z, X2 ≤ z, · · · , Xn ≤ z]
P [Z > z] = P [min(X, Y ) > z]
= FX1 ···Xn (z, · · · , z)
= P [X > z, Y > z]
= FX1 (z) · · · FXn (z), if they are independent
= P [X > z]P [Y > z], using independence
= (FX (z))n , if they are iid
= (1 − FX (z))(1 − FY (z))
where we take FX as the common df
= (1 − FX (z))2 , if they are iid
◮ For example if all Xi are uniform over (0, 1) and ind, then
FZ (z) = z n , 0 < z < 1 Hence, FZ (z) = 1 − (1 − FX (z))(1 − FY (z))
◮ We can once again find density of Z if X, Y are
continuous
PS Sastry, IISc, Bangalore, 2020 32/41 PS Sastry, IISc, Bangalore, 2020 33/41
◮ min fn is also easily generalized to n random variables
◮ Suppose X, Y are iid uniform (0, 1). ◮ Let Z = min(X1 , X2 , · · · , Xn )
◮ Z = min(X, Y ) P [Z > z] = P [min(X1 , X2 , · · · , Xn ) > z]
FZ (z) = 1 − (1 − FX (z))2 = 1 − (1 − z)2 , 0 < z < 1 = P [X1 > z, · · · , Xn > z]
= P [X1 > z] · · · P [Xn > z], using independence
◮ Notice that P [X > z] = (1 − z). = (1 − FX1 (z)) · · · (1 − FXn (z))
◮ We get the density of Z as = (1 − FX (z))n , if they are iid
fZ (z) = 2(1 − z), 0 < z < 1 ◮ Hence, when Xi are iid, the df of Z is

FZ (z) = 1 − (1 − FX (z))n

where FX is the common df

PS Sastry, IISc, Bangalore, 2020 34/41 PS Sastry, IISc, Bangalore, 2020 35/41

◮ Let X, Y ∈ {0, 1, · · · }
◮ Let Z = X + Y . Then we have
◮ Let X, Y be independent X
fZ (z) = P [X + Y = z] = P [X = x, Y = y]
◮ Let Z = max(X, Y ) and W = min(X, Y ). x,y:
x+y=z
◮ We want joint distribution function of Z and W . z
X
= P [X = k, Y = z − k]
FZW (z, w) = P [Z ≤ z, W ≤ w]
k=0
z
◮ This is difficult to find. But we can easily find X
= fXY (k, z − k)
k=0
P [max(X, Y ) ≤ z, min(X, Y ) > w]
◮ Now suppose X, Y are independent. Then
◮ Remaining details are left as an exercise for you!!
z
X
fZ (z) = fX (k)fY (z − k)
k=0

PS Sastry, IISc, Bangalore, 2020 36/41 PS Sastry, IISc, Bangalore, 2020 37/41
◮ Let X, Y have a joint density fXY . Let Z = X + Y
◮ Now suppose X, Y are independent Poisson with FZ (z) = P [Z ≤ z] = P [X + Y ≤ z]
Z Z
parameters λ1 , λ2 . And, Z = X + Y .
= fXY (x, y) dy dx
z {(x,y):x+y≤z}
∞ Z z−x
X Z
fZ (z) = fX (k)fY (z − k)
= fXY (x, y) dy dx
k=0 x=−∞ y=−∞
z
X λk1 −λ1 λz−k
2 change of variable: t = x + y
= e e−λ2
k=0
k! (z − k)! dt = dy; when (y = z − x), t = z
Z ∞ Z z
z
−(λ1 +λ2 ) 1 X z! = fXY (x, t − x) dt dx
= e λk1 λz−k
2
z! k=0 k!(z − k)! x=−∞ t=−∞
Z z Z ∞ 
1 = fXY (x, t − x) dx dt
= e−(λ1 +λ2 ) (λ1 + λ2 )z −∞ −∞
z!
◮ This gives us
◮ Z is Poisson with parameter λ1 + λ2 Z ∞
fZ (z) = fXY (x, z − x) dx
−∞

PS Sastry, IISc, Bangalore, 2020 38/41 PS Sastry, IISc, Bangalore, 2020 39/41

◮ Suppose X, Y are iid exponential rv’s.


◮ X, Y have joint density fXY . Z = X + Y . Then
fX (x) = λ e−λx , x > 0
Z ∞
fZ (z) = fXY (x, z − x) dx ◮ Let Z = X + Y . Then, density of Z is
−∞
Z ∞
◮ Now suppose X and Y are independent. Then fZ (z) = fX (x) fY (z − x) dx
−∞
Z ∞ Z z
fZ (z) = fX (x) fY (z − x) dx = λ e−λx λ e−λ(z−x) dx
−∞ 0
Z z
2 −λz
Density of sum of independent random variables is the = λ e dx = λ2 z e−λz
0
convolution of their densities.
◮ Thus, sum of independent exponential random variables
fX+Y = fX ∗ fY (Convolution) has gamma distribution:

fZ (z) = λz λe−λz , z > 0

PS Sastry, IISc, Bangalore, 2020 40/41 PS Sastry, IISc, Bangalore, 2020 41/41
Recap Recap
◮ Given X1 , · · · , Xn , random variables on the same
probability space, Z = g(X1 , · · · , Xn ) is a rv
(if g : ℜn → ℜ is borel measurable).
Z = g(X)

n
g ◮ X1 , · · · , Xn are said to be independent if events
R
Sample Space
X
R
[X1 ∈ B1 ], · · · , [Xn ∈ Bn ] are independent.
B’ B ◮ If X1 , · · · , Xn are indepedent and all of them have the
same distribution function then they are said to be iid –
independent and identically distributed
◮ We can determine distribution of Z from the joint
distribution of all Xi

FZ (z) = P [Z ≤ z] = P [g(X1 , · · · , Xn ) ≤ z]

PS Sastry, IISc, Bangalore, 2020 1/43 PS Sastry, IISc, Bangalore, 2020 2/43

Recap Recap

◮ Let X1 , · · · , Xn be independent and ◮ Let X1 , · · · , Xn be independent and


Z = max(X1 , · · · , Xn ) Z = min(X1 , · · · , Xn )
n
Y n
Y
FZ (z) = FXi (z) FZ (z) = 1 − (1 − FXi (z))
i=1 i=1
n
= (F (z)) , if they are iid = 1 − (1 − F (z))n , if they are iid

PS Sastry, IISc, Bangalore, 2020 3/43 PS Sastry, IISc, Bangalore, 2020 4/43
Recap Recall problem from last class
◮ Let X, Y be random variables with joint density fXY
◮ Z =X +Y ◮ Let X, Y be independent
Let Z = max(X, Y ) and W = min(X, Y ).
Z ∞ ◮
fZ (z) = fXY (t, z − t) dt
−∞
◮ We want joint distribution function of Z and W .

◮ If X, Y are independent FZW (z, w) = P [Z ≤ z, W ≤ w]


Z ∞
fZ (z) = fX (t) fY (z − t) dt ◮ This is difficult to find. But we can easily find
−∞
P [max(X, Y ) ≤ z, min(X, Y ) > w]
Density of sum of independent random variables is the
convolution of their densities. ◮ Remaining details are left as an exercise for you!!
◮ Sum of independent exponential random variables has
gamma density.

PS Sastry, IISc, Bangalore, 2020 5/43 PS Sastry, IISc, Bangalore, 2020 6/43

◮ X, Y iid with df F and density f


Z = max(X, Y ) and W = min(X, Y ). ◮ We have P [w < X, Y ≤ z] = (F (z) − F (w))2 only when
◮ We want joint distribution function of Z and W . w ≤ z.
◮ We can use the following ◮ Otherwise it is zero.
◮ Hence we get FZW as
P [Z ≤ z] = P [Z ≤ z, W ≤ w] + P [Z ≤ z, W > w]
(F (z))2

if w > z
FZW (z, w) = 2 2
P [Z ≤ z, W > w] = P [w < X, Y ≤ z] = (F (z) − F (w))2 (F (z)) − (F (z) − F (w)) if w ≤ z
P [Z ≤ z] = P [X ≤ z, Y ≤ z] = (F (z))2 ◮ We can get joint density of Z, W as
◮ So, we get FZW as
∂2
fZW (z, w) = FZW (z, w)
FZW (z, w) = P [Z ≤ z, W ≤ w] ∂z ∂w
= 2f (z)f (w), w ≤ z
= P [Z ≤ z] − P [Z ≤ z, W > w]
= (F (z))2 − (F (z) − F (w))2
◮ Is this correct for all values of z, w?
PS Sastry, IISc, Bangalore, 2020 7/43 PS Sastry, IISc, Bangalore, 2020 8/43
Order Statistics ◮ Let X1 , · · · , Xn be iid with df F and density f .
◮ P [Xi ≤ y] = F (y) for any i and y.
◮ Let X1 , · · · , Xn be iid with density f . ◮ Since they are independent, we have, e.g.,
◮ Let X(k) denote the k th smallest of these.
◮ That is, X(k) = gk (X1 , · · · , Xn ) where gk : ℜn → ℜ and P [X1 ≤ y, X2 > y, X3 ≤ y] = (F (y))2 (1 − F (y))
the value of gk (x1 , · · · , xn ) is the k th smallest of the
numbers x1 , · · · , xn .
◮ Hence, probability that exactly k of these n random
variables are less than or equal to y is
◮ X(1) = min(X1 , · · · , Xn ), X(n) = max(X1 , · · · , Xn ) n
Ck (F (y))k (1 − F (y))n−k
◮ The joint distribution of X(1) , · · · X(n) is called the order ◮ Now the event [X(k) ≤ y] is same as the event “at least k
statistics. of these are less than or equal to y”
◮ We calculated the order statistics for the case n = 2. ◮ Hence we get
◮ It can be shown that
n
X
n
n
Y FX(k) (y) = Cj (F (y))j (1 − F (y))n−j
fX(1) ···X(n) (x1 , · · · xn ) = n! f (xi ), x1 < x2 < · · · < xn j=k
i=1
We can get the density by differentiating this.
PS Sastry, IISc, Bangalore, 2020 9/43 PS Sastry, IISc, Bangalore, 2020 10/43

Distribution of sums of independent rv ◮ fX (x) = 0.5, −1 < x < 1. fY is also same


◮ Note that Z takes values in [−2, 2]
Z ∞
◮ Suppose X, Y are iid uniform over (−1, 1). fZ (z) = fX (t) fY (z − t) dt
◮ let Z = X + Y . We want fZ . −∞

◮ The density of X, Y is ◮ For the integrand to be non-zero we need


◮ −1 < t < 1 ⇒ t < 1, t > −1
◮ −1 < z − t < 1 ⇒ t < z + 1, t > z − 1
◮ Hence we need:
t < min(1, z + 1), t > max(−1, z − 1)
◮ Hence, for z < 0, we need −1 < t < z + 1
and, for z ≥ 0 we need z − 1 < t < 1
◮ Thus we get
 R z+1
 −1 14 dt = z+2
4
if − 2 ≤ z < 0
◮ fZ is convolution of this density with itself. fZ (z) =
 R 1 1 dt = 2−z
if 2 ≥ z ≥ 0
z−1 4 4

PS Sastry, IISc, Bangalore, 2020 11/43 PS Sastry, IISc, Bangalore, 2020 12/43
◮ Thus, the density of sum of two ind rv’s that are uniform
Independence of functions of random variable
over (−1, 1) is
 z+2
4
if − 2 < z < 0
fZ (z) = 2−z
4
if 0 < z < 2
◮ Suppose X and Y are independent.
◮ This is a triangle with vertices (−2, 0), (0, 0.5), (2, 0) ◮ Then g(X) and h(Y ) are independent
◮ This is because [g(X) ∈ B1 ] = [X ∈ B̃1 ] for some Borel
set, B̃1 and similarly [h(Y ) ∈ B2 ] = [Y ∈ B̃2 ]
◮ Hence, [g(X) ∈ B1 ] and [h(Y ) ∈ B2 ] are independent.

PS Sastry, IISc, Bangalore, 2020 13/43 PS Sastry, IISc, Bangalore, 2020 14/43

Independence of functions of random variable

◮ This is easily generalized to functions of multiple random


◮ Let X1 , X2 , X3 be independent continuous rv
variables. ◮ Z = X 1 + X 2 + X3 .
◮ If X, Y are vector random variables (or random vectors), ◮ Can we find density of Z?
independence implies [X ∈ B1 ] is independent of ◮ Let W = X1 + X2 .
[Y ∈ B2 ] for all borel sets B1 , B2 (in appropriate spaces). ◮ Then Z = W + X3 and W and X3 are independent.
◮ Then g(X) would be independent of h(Y). ◮ Exercise for you: Find density of X1 + X2 + X3 where
◮ That is, suppose X1 , · · · , Xm , Y1 , · · · , Yn are X1 , X2 , X3 are iid uniform over (0, 1).
independent.
◮ Then, g(X1 , · · · , Xm ) is independent of h(Y1 , · · · , Yn ).

PS Sastry, IISc, Bangalore, 2020 15/43 PS Sastry, IISc, Bangalore, 2020 16/43
Sum of independent gamma rv Z ∞
fZ (z) = fX (x) fY (z − x) dx
−∞
Z z
Gamma density with parameters α > 0 and λ > 0 is given 1 1

= λα1 xα1 −1 e−λx λα2 (z − x)α2 −1 e−λ(z−x) dx
by 0 Γ(α1 ) Γ(α2 )
1 λ α1 +α2 −λz Z z
e x α1 −1 α2 −1  x α2 −1
λα xα−1 e−λx , x > 0
 
f (x) = = z α1 −1 z 1− dx
Γ(α) Γ(α1 )Γ(α2 ) 0 z z
We will call this Gamma(α, λ). x
change the variable: t = (⇒ z −1 dx = dt)
◮ The α is called the shape parameter and λ is called the z
λα1 +α2 e−λz α+ α2 −1 1 α1 −1
Z
rate parameter. = z t (1 − t)α2 −1 dt
Γ(α1 )Γ(α2 ) 0
◮ For α = 1 this is the exponential density.
1
◮ Let X ∼ Gamma(α1 , λ), Y ∼ Gamma(α2 , λ). = λα1 +α2 z α1 +α2 −1 e−λz
Γ(α1 + α2 )
Suppose X, Y are independent.
Because
◮ Let Z = X + Y . Then Z ∼ Gamma(α1 + α2 , λ). Z 1
Γ(α1 )Γ(α2 )
tα1 −1 (1 − t)α2 −1 dt =
0 Γ(α1 + α2 )
PS Sastry, IISc, Bangalore, 2020 17/43 PS Sastry, IISc, Bangalore, 2020 18/43

A Calculation Trick
◮ If X, Y are independent gamma random variables then ∞  
1  2
Z
X + Y also has gamma distribution.

I = exp − x − 2bx + c dx
−∞ 2K
◮ If X ∼ Gamma(α1 , λ), and Y ∼ Gamma(α2 , λ), then Z ∞  
X + Y ∼ Gamma(α1 + α2 , λ). 1  2 2

= exp − (x − b) + c − b dx
−∞ 2K
◮ Exercise for you: Show that sum of independent Gaussian Z ∞
(x − b)2 (c − b2 )
   
random variables has gaussian density. = exp − exp − dx
◮ The algebra is a little involved. −∞ 2K 2K
(c − b2 ) √
 
◮ First take the two gaussians to be zero-mean. = exp − 2πK
2K
◮ There is a calculation trick that is often useful with
Gaussian density because

(x − b)2
 
1
Z
√ exp − dx = 1
2πK −∞ 2K

PS Sastry, IISc, Bangalore, 2020 19/43 PS Sastry, IISc, Bangalore, 2020 20/43
◮ Let X1 , · · · , Xn be continuous random variables with
joint density fX1 ···Xn . We define Y1 , · · · Yn by
Y1 = g1 (X1 , · · · , Xn ) ··· Yn = gn (X1 , · · · , Xn )
We think of gi as components of g : ℜn → ℜn .
◮ We next look at a general theorem that is quite useful in ◮ We assume g is continuous with continuous first partials
dealing with functions of multiple random variables. and is invertible.
◮ This result is only for continuous random variables.
◮ Let h be the inverse of g. That is
X1 = h1 (Y1 , · · · , Yn ) ··· Xn = hn (Y1 , · · · , Yn )
◮ Each of gi , hi are ℜn → ℜ functions and we can write
them as
yi = gi (x1 , · · · , xn ); ··· xi = hi (y1 , · · · , yn )
We denote the partial derivatives of these functions by
∂xi
∂yj
etc.
PS Sastry, IISc, Bangalore, 2020 21/43 PS Sastry, IISc, Bangalore, 2020 22/43

◮ The jacobian of the inverse transformation is Proof of Theorem


∂x1 ∂x1
··· ∂x1 ◮ Let B = (−∞, y1 ] × · · · × (−∞, yn ] ⊂ ℜn . Then
∂y1 ∂y2 ∂yn

∂(x1 , · · · , xn ) ∂x2 ∂x2


··· ∂x2 FY (y) = FY1 ···Yn (y1 , · · · yn ) = P [Yi ≤ yi , i = 1, · · · , n]
J= = ∂y1 ∂y2 ∂yn Z
∂(y1 , · · · , yn ) .. .. .. ..
. . . . = fY1 ···Yn (y1′ , · · · , yn′ ) dy1′ · · · dyn′
∂xn ∂xn ∂xn B
∂y1 ∂y2
··· ∂yn
◮ Define
◮ We assume that J is non-zero in the range of the
transformation g −1 (B) = {(x1 , · · · , xn ) ∈ ℜn : g(x1 , · · · , xn ) ∈ B}
◮ Theorem: Under the above conditions, we have = {(x1 , · · · , xn ) ∈ ℜn : gi (x1 , · · · , xn ) ≤ yi , i = 1 · · · n}

fY1 ···Yn (y1 , · · · , yn ) = |J|fX1 ···Xn (h1 (y1 , · · · , yn ), · · · , hn (y1 , · · · , yn ))


◮ Then we have
FY1 ···Yn (y1 , · · · yn ) = P [gi (X1 , · · · , Xn ) ≤ yi , i = 1, · · · n]
Z
Or, more compactly, fY (y) = |J|fX (h(y)) = fX1 ···Xn (x′1 , · · · , x′n ) dx′1 · · · dx′n
g −1 (B)

PS Sastry, IISc, Bangalore, 2020 23/43 PS Sastry, IISc, Bangalore, 2020 24/43
Proof of Theorem ◮ X1 , · · · Xn are continuous rv with joint density
◮ B = (−∞, y1 ] × · · · × (−∞, yn ].
Y1 = g1 (X1 , · · · , Xn ) ··· Yn = gn (X1 , · · · , Xn )
◮ g −1 (B) = {(x1 , · · · , xn ) ∈ ℜn : g(x1 , · · · , xn ) ∈ B}
◮ The transformation is continuous with continuous first
FY (y1 , · · · , yn ) = P [gi (X1 , · · · , Xn ) ≤ yi , i = 1, · · · , n] partials and is invertible and
Z
= fX1 ···Xn (x′1 , · · · , x′n ) dx′1 · · · dx′n
g −1 (B) X1 = h1 (Y1 , · · · , Yn ) ··· Xn = hn (Y1 , · · · , Yn )
change variables: yi′ = gi (x′1 , · · ·
, x′n ), i = 1, · · · n
◮ We assume the Jacobian of the inverse transform, J, is
(x′1 , · · · x′n ) ∈ g (B) ⇒ (y1′ , · · · , yn′ ) ∈ B
−1
non-zero
x′i = hi (y1′ , · · · , yn′ ), dx′1 · · · dx′n = |J|dy1′ · · · dyn′ ◮ Then the density of Y is
Z
FY (y1 , · · · , yn ) = fX1 ···Xn (h1 (y′ ), · · · , hn (y′ )) |J|dy1′ · · · dyn′
B fY1 ···Yn (y1 , · · · , yn ) = |J|fX1 ···Xn (h1 (y1 , · · · , yn ), · · · , hn (y1 , · · · , yn ))
⇒ fY1 ···Yn (y1 , · · · , yn ) = fX1 ···Xn (h1 (y), · · · , hn (y)) |J|
◮ Called multidimensional change of variable formula

PS Sastry, IISc, Bangalore, 2020 25/43 PS Sastry, IISc, Bangalore, 2020 26/43

◮ Let X, Y have joint density fXY . Let Z = X + Y . ◮ let Z = X + Y and W = X − Y . Then


◮ We want fZ . For the theorem we need two functions.
Z ∞  
◮ To use the theorem, we need an invertible transformation 1 z+w z−w
fZ (z) = fXY , dw
of ℜ2 onto ℜ2 of which one component is x + y. −∞ 2 2 2
◮ Take Z = X + Y and W = X − Y . This is an invertible. z+w 1
change the variable: t = ⇒ dt = dw
◮ X = (Z + W )/2 and Y = (Z − W )/2. The Jacobian is 2 2
1 1 ⇒ w = 2t − z ⇒ z − w = 2z − 2t
2 2 1 Z ∞
J= =− fZ (z) = fXY (t, z − t) dt
1
2
− 12 2
−∞
Z ∞
◮ Hence we get z−w
= fXY (z − t, t) dt, by using t = above
  −∞ 2
1 z+w z−w
fZW (z, w) = fXY ,
2 2 2 ◮ We get same result as earlier. If, X, Y are independent
◮ Now we get density of Z as Z ∞
Z ∞
1

z+w z−w
 fZ (z) = fX (t) fY (z − t) dt
fZ (z) = fXY , dw −∞
−∞ 2 2 2

PS Sastry, IISc, Bangalore, 2020 27/43 PS Sastry, IISc, Bangalore, 2020 28/43
◮ let Z = X + Y and W = X − Y . We got Example
1

z+w z−w
 ◮ Let X, Y be iid U (0, 1). Let Z = X − Y .
fZW (z, w) = fXY , Z ∞
2 2 2 fZ (z) = fX (t) fY (t − z) dt
−∞
◮ Now we can calculate fW also.
◮ For the integrand to be non-zero (note Z ∈ (−1, 1))
Z ∞ 0 < t < 1 ⇒ t > 0, t < 1
 
1 z+w z−w ◮

fW (w) = fXY , dz ◮ 0 < t − z < 1 ⇒ t > z, t < 1 + z


−∞ 2 2 2
◮ ⇒ max(0, z) < t < min(1, 1 + z)
z+w 1
change the variable: t = ⇒ dt = dz ◮ Thus, we get density as
2 2  R
⇒ z = 2t − w ⇒ z − w = 2t − 2w  1+z 1 dt = 1 + z, if − 1 < z < 0
Z ∞ 0
fZ (z) =
fW (w) = fXY (t, t − w) dt  1 1 dt = 1 − z,
R
0<z<1
−∞ z
Z ∞
z−w ◮ This we have when X, Y ∼ U (0, 1) iid
= fXY (t + w, t)dt, using t = above
−∞ 2 fX−Y (z) = 1 − |z|, −1 < z < 1

PS Sastry, IISc, Bangalore, 2020 29/43 PS Sastry, IISc, Bangalore, 2020 30/43

◮ We showed that Distribution of product of random variables


Z ∞ Z ∞
fX+Y (z) = fXY (t, z − t) dt = fXY (z − t, t) dt ◮ We want density of Z = XY .
−∞
Z ∞ −∞
Z ∞
◮ We need one more function to make an invertible
transformation
fX−Y (w) = fXY (t, t − w) dt = fXY (t + w, t)dt
−∞ −∞
◮ A possible choice: Z = XY W =Y
◮ This is invertible: X = Z/W Y = W
◮ Suppose X, Y are discrete. Then we have
1 −z
w w2 1
J= =
X
fX+Y (z) = P [X + Y = z] = P [X = k, Y = z − k] 0 1 w
k
X ◮ Hence we get
= fXY (k, z − k)
k 1 z 
fZW (z, w) = fXY ,w
w w
X
fX−Y (w) = P [X − Y = w] = P [X = k, Y = k − w]
X
k ◮ Thus we get the density of product as
Z ∞
= fXY (k, k − w) 1 z 
k
fZ (z) = fXY , w dw
−∞ w w
PS Sastry, IISc, Bangalore, 2020 31/43 PS Sastry, IISc, Bangalore, 2020 32/43
example
◮ X, Y have joint density and Z = XY . Then
Z ∞
1 z 
Let X, Y be iid U (0, 1). Let Z = XY . fZ (z) = fXY .w dw
−∞ w w

Z ∞
1 z
Suppose X, Y are discrete and Z = XY
fZ (z) = fX fY (w) dw
−∞ w w X X
fZ (0) = P [X = 0 or Y = 0] = fXY (x, 0) + fXY (0, y
z
◮ We need: 0 < w < 1 and 0 < w
< 1. Hence x y
   
1 1
X k X k
1 1 fZ (k) = P X = ,Y = y = fXY , y , k 6= 0
Z Z
fZ (z) = dw = dw = − ln(z), 0 < z < 1 y y
z w z w y6=0 y6=0

◮ We cannot always interchange density and mass


functions!!

PS Sastry, IISc, Bangalore, 2020 33/43 PS Sastry, IISc, Bangalore, 2020 34/43

◮ We wanted density of Z = XY . Distributions of quotients


◮ We used: Z = XY and W = Y . ◮ X, Y have joint density and Z = X/Y .
◮ We could have used: Z = XY and W = X. ◮ We can take: Z = X/Y W =Y
◮ This is invertible: X = W and Y = Z/W . ◮ This is invertible: X = ZW Y = W
0 1 1 w z
J= =− J= =w
1
w
−z
w2
w 0 1

◮ This gives ◮ Hence we get

1  z fZW (z, w) = |w| fXY (zw, w)


fZW (z, w) = fXY w,
w w
Z ∞ ◮ Thus we get the density of quotient as
1  z
fZ (z) = fXY w, dw Z ∞
−∞ w w
fZ (z) = |w| fXY (zw, w) dw
−∞
◮ The fZ should be same in both cases.
PS Sastry, IISc, Bangalore, 2020 35/43 PS Sastry, IISc, Bangalore, 2020 36/43
example
◮ Let X, Y be iid U (0, 1). Let Z = X/Y .
◮ X, Y have joint density and Z = X/Y
Note Z ∈ (0, ∞) Z ∞
Z ∞ fZ (z) = |w| fXY (zw, w) dw
−∞
fZ (z) = |w| fX (zw) fY (w) dw
−∞ ◮ SupposeX, Y are discrete and Z = X/Y
◮ We need 0 < w < 1 and 0 < zw < 1 ⇒ w < 1/z.
fZ (z) = P [Z = z] = P [X/Y = z]
◮ So, when z ≤ 1, w goes from 0 to 1; when z > 1, w goes X
from 0 to 1/z. = P [X = yz, Y = y]
y
◮ Hence we get density as X
 R = fXY (yz, y)
 01 w dw = 21 , if 0 < z ≤ 1 y
fZ (z) =
 R 1/z w dw = 1 , 1 < z < ∞
0 2z 2

PS Sastry, IISc, Bangalore, 2020 37/43 PS Sastry, IISc, Bangalore, 2020 38/43

◮ We chose: Z = X/Y and W = Y . Exchangeable Random Variables


◮ We could have taken: Z = X/Y and W = X
◮ The inverse is: X = W and Y = W/Z ◮ X1 , X2 , · · · , Xn are said to be exchangeable if their joint
distribution is same as that of any permutation of them.
0 1 w
J= =− ◮ let (i1 , · · · , in ) be a permutation of (1, 2, · · · , n). Then
− zw2 1
z
z2 joint df of (Xi1 , · · · , Xin ) should be same as that
(X1 , · · · , Xn )
◮ Thus we get the density of quotient as
Z ∞ ◮ Take n = 3. Suppose FX1 X2 X3 (a, b, c) = g(a, b, c). If they
w  w
are exchangeable, then
fZ (z) = 2
f XY w, dw
−∞ z z
w dw FX2 X3 X1 (a, b, c) = P [X2 ≤ a, X3 ≤ b, X1 ≤ c]
put t = ⇒ dt = , w = tz
z z = P [X1 ≤ c, X2 ≤ a, X3 ≤ b]
Z ∞
= |t|fXY (tz, t) dt = g(c, a, b) = g(a, b, c)
−∞
◮ The df or density should be “symmetric” in its variables if
◮ We can show that the density of quotient is same in both the random variables are exchangeable.
these approches.
PS Sastry, IISc, Bangalore, 2020 39/43 PS Sastry, IISc, Bangalore, 2020 40/43
◮ Consider the density of three random variables Expectation of functions of multiple rv
2
f (x, y, z) = (x + y + z), 0 < x, y, z < 1 ◮ Theorem: Let Z = g(X1 , · · · Xn ) = g(X). Then
3
Z
◮ They are exchangeable (because f (x, y, z) = f (y, x, z)) E[Z] = g(x) dFX (x)
ℜn
◮ If random variables are exchangeable then they are
identically distributed. ◮ That is, if they have a joint density, then
FXY Z (a, ∞, ∞) = FXY Z (∞, ∞, a) ⇒ FX (a) = FZ (a) Z
◮ The above example shows that exchangeable random E[Z] = g(x) fX (x) dx
variables need not be independent. The joint density is ℜn

not factorizable. ◮ Similarly, if all Xi are discrete


Z 1Z 1
2 2(x + 1) X
(x + y + z) dy dz = E[Z] = g(x) fX (x)
0 0 3 3
x

◮ So, the joint density is not the product of marginals

PS Sastry, IISc, Bangalore, 2020 41/43 PS Sastry, IISc, Bangalore, 2020 42/43

Recap
◮ Let Z = X + Y . Let X, Y have joint density fXY
Z ∞Z ∞ ◮ X1 , · · · Xn are continuous rv with joint density
E[X + Y ] = (x + y) fXY (x, y) dx dy
−∞ −∞
Y1 = g1 (X1 , · · · , Xn ) ··· Yn = gn (X1 , · · · , Xn )
Z ∞ Z ∞
= x fXY (x, y) dy dx ◮ The transformation is continuous with continuous first
−∞ −∞
Z ∞ Z ∞ partials and is invertible and
+ y fXY (x, y) dx dy
−∞ −∞ X1 = h1 (Y1 , · · · , Yn ) ··· Xn = hn (Y1 , · · · , Yn )
Z ∞ Z ∞
= x fX (x) dx + y fY (y) dy ◮ We assume the Jacobian of the inverse transform, J, is
−∞ −∞
non-zero
= E[X] + E[Y ]
◮ Then the density of Y is
◮ Expectation is a linear operator.
fY1 ···Yn (y1 , · · · , yn ) = |J|fX1 ···Xn (h1 (y1 , · · · , yn ), · · · , hn (y1 , · · · , yn ))
◮ This is true for all random variables.
◮ Called multidimensional change of variable formula
PS Sastry, IISc, Bangalore, 2020 43/43 PS Sastry, IISc, Bangalore, 2020 1/37
Recap Recap

◮ One can use the theorem to find densities of sum,


difference, product and quotient of random variables.
Z ∞ Z ∞ ◮ X1 , X2 , · · · , Xn are said to be exchangeable if their joint
fX+Y (z) = fXY (t, z − t) dt = fXY (z − t, t) dt distribution is same as that of any permutation of them.
−∞ −∞
Z ∞ Z ∞ ◮ If the random variables are exchangeable then the joint
fX−Y (z) = fXY (t, t − z) dt = fXY (t + z, t)dt distribution function remains the same on permutation of
Z ∞ −∞ −∞
Z ∞ arguments.
1 z  1  z
Exchangeable random variables are identically distributed
fX∗Y (z) = fXY , t dt = fXY t, dt ◮
−∞ t t −∞ t t but they may not be independent.
Z ∞ Z ∞  
t t
f(X/Y ) (z) = |t| fXY (zt, t) dt = 2
fXY t, dt
−∞ −∞ z z

PS Sastry, IISc, Bangalore, 2020 2/37 PS Sastry, IISc, Bangalore, 2020 3/37

Recap
◮ We saw E[X + Y ] = E[X] + E[Y ].
◮ Let us calculate Var(X + Y ).
◮ Let Z = g(X1 , · · · Xn ) = g(X). Then
Var(X + Y ) = E ((X + Y ) − E[X + Y ])2
 
Z
= E ((X − EX) + (Y − EY ))2
 
E[Z] = g(x) dFX (x)
= E (X − EX)2 + E (Y − EY )2
   
ℜn

◮ For example, if they have a joint density, then +2E [(X − EX)(Y − EY )]
Z = Var(X) + Var(Y ) + 2Cov(X, Y )
E[Z] = g(x) fX (x) dx
ℜn where we define covariance between X, Y as
◮ This gives us: E[X + Y ] = E[X] + E[Y ] Cov(X, Y ) = E [(X − EX)(Y − EY )]
◮ In general, E [g1 (X) + g2 (X)] = E [g1 (X] + E [g2 (X)]

PS Sastry, IISc, Bangalore, 2020 4/37 PS Sastry, IISc, Bangalore, 2020 5/37
Example
◮ We define covariance between X and Y by ◮ Consider the joint density
Cov(X, Y ) = E [(X − EX)(Y − EY )]
fXY (x, y) = 2, 0 < x < y < 1
= E [XY − X(EY ) − Y (EX) + EX EY ]
= E[XY ] − EX EY ◮ We want to calculate Cov(X, Y )
Z 1Z 1 1
1
Z
◮ Note that Cov(X, Y ) can be positive or negative EX = x 2 dy dx = 2 x (1 − x) dx =
◮ X and Y are said to be uncorrelated if Cov(X, Y ) = 0 0 x 0 3
◮ If X and Y are uncorrelated then Z 1 Z y Z 1
2
EY = y 2 dx dy = 2 y 2 dy =
Var(X + Y ) = Var(X) + Var(Y ) 0 0 0 3
1 y 1
y2 1
Z Z Z
◮ Note that E[X + Y ] = E[X] + E[Y ] for all random E[XY ] = xy 2 dx dy = 2 y dy =
variables. 0 0 0 2 4

1 2 1
◮ Hence, Cov(X, Y ) = E[XY ] − EX EY = 4
− 9
= 36
PS Sastry, IISc, Bangalore, 2020 6/37 PS Sastry, IISc, Bangalore, 2020 7/37

Independent random variables are uncorrelated Uncorrelated random variables may not be
independent
◮ Suppose X, Y are independent. Then ◮ Suppose X ∼ N (0, 1) Then, EX = EX 3 = 0
Z Z ◮ Let Y = X 2 Then,
E[XY ] = x y fXY (x, y) dx dy
E[XY ] = EX 3 = 0 = EX EY
Z Z
= x y fX (x) fY (y) dx dy
Thus X, Y are uncorrelated.
Z Z ◮
= xfX (x) dx yfY (y) dy = EX EY
◮ Are they independent? No
e.g.,
◮ Then, Cov(X, Y ) = E[XY ] − EX EY = 0. P [X > 2 |Y < 1] = 0 6= P [X > 2]
◮ X, Y independent ⇒ X, Y uncorrelated
◮ X, Y are uncorrealted does not imply they are
independent.
PS Sastry, IISc, Bangalore, 2020 8/37 PS Sastry, IISc, Bangalore, 2020 9/37
◮ We have E [(αX + βY )2 ] ≥ 0, ∀α, β ∈ ℜ

◮ We define the correlation coefficient of X, Y by α2 E[X 2 ] + β 2 E[Y 2 ] + 2αβE[XY ] ≥ 0, ∀α, β ∈ ℜ


E[XY ]
Cov(X, Y ) Take α = −
ρXY = p E[X 2 ]
Var(X) Var(Y )
(E[XY ])2 2 2 (E[XY ])2
+ β E[Y ] − 2β ≥ 0, ∀β ∈ ℜ
◮ If X, Y are uncorrelated then ρXY = 0. E[X 2 ] E[X 2 ]
2
(E[XY ])2 2

◮ We will show that |ρXY | ≤ 1 2 (E[XY ])
⇒ 4 − 4E[Y ] ≤0
◮ Hence −1 ≤ ρXY ≤ 1, ∀X, Y E[X 2 ] E[X 2 ]
⇒ (E[XY ])2 ≤ E[X 2 ]E[Y 2 ]

PS Sastry, IISc, Bangalore, 2020 10/37 PS Sastry, IISc, Bangalore, 2020 11/37

◮ We showed that
Linear Least Squares Estimation
(E[XY ])2 ≤ E[X 2 ]E[Y 2 ]
◮ Take X − EX in place of X and Y − EY in place of Y
in the above algebra.
◮ This gives us
◮ Suppose we want to approximate Y as an affine function
of X.
(E[(X − EX)(Y − EY )])2 ≤ E[(X−EX)2 ]E[(Y −EY )2 ] ◮ We want a, b to minimize E [(Y − (aX + b))2 ]
◮ For a fixed a, what is the b that minimizes
⇒ (Cov(X, Y ))2 ≤ Var(X)Var(Y ) E [((Y − aX) − b)2 ] ?
◮ Hence we get ◮ We know the best b here is:
!2
Cov(X, Y ) b = E[Y − aX] = EY − aEX.
ρ2XY = p ≤1 ◮ So, we want to find the best a to minimize
Var(X)Var(Y )
J(a) = E [(Y − aX − (EY − aEX))2 ]
◮ The equality holds here only if E [(αX + βY )2 ] = 0
Thus, |ρXY | = 1 only if αX + βY = 0
◮ Correlation coefficient of X, Y is ±1 only when Y is a
linear function of X PS Sastry, IISc, Bangalore, 2020 12/37 PS Sastry, IISc, Bangalore, 2020 13/37
◮ We want to find a to minimize ◮ The final mean square error, say, J ∗ is

J(a) = E (Y − aX − (EY − aEX))2 J ∗ = Var(Y ) + a2 Var(X) − 2aCov(X, Y )


 
 2
= E ((Y − EY ) − a(X − EX))2 Cov(X, Y ) Cov(X, Y )
 
= Var(Y ) + Var(X) − 2 Cov(X, Y )
= E (Y − EY )2 + a2 (X − EX)2 − 2a(Y − EY )(X − EX) Var(X) Var(X)
 

= Var(Y ) + a2 Var(X) − 2aCov(X, Y ) (Cov(X, Y ))2


= Var(Y ) −
Var(X)
So, the optimal a satisfies (Cov(X, Y ))2

 
= Var(Y ) 1 −
Cov(X, Y ) Var(Y ) Var(X)
2aVar(X) − 2Cov(X, Y ) = 0 ⇒ a = = Var(Y ) 1 − ρXY2

Var(X)

PS Sastry, IISc, Bangalore, 2020 14/37 PS Sastry, IISc, Bangalore, 2020 15/37

◮ The best mean-square approximation of Y as a ‘linear’ ◮ The covariance of X, Y is


function of X is
Cov(X, Y )

Cov(X, Y )
 Cov(X, Y ) = E[(X−EX) (Y −EY )] = E[XY ]−EX EY
Y = X + EY − EX
Var(X) Var(X)
Note that Cov(X, X) = Var(X)
◮ Called the line of regression of Y on X. ◮ X, Y are called uncorrelated if Cov(X, Y ) = 0.
◮ If cov(X, Y ) = 0 then this reduces to approximating Y by ◮ X, Y independent ⇒ X, Y uncorrelated.
a constant, EY . ◮ Uncorrelated random variables need not necessarily be
◮ The final mean square error is independent
◮ Covariance plays an important role in linear least squares
Var(Y ) 1 − ρ2XY

estimation.
◮ If ρXY = ±1 then the error is zero ◮ Informally, covariance captures the ‘linear dependence’
between the two random variables.
◮ If ρXY = 0 the final error is Var(Y )

PS Sastry, IISc, Bangalore, 2020 16/37 PS Sastry, IISc, Bangalore, 2020 17/37
Covariance Matrix Covariance matrix
◮ Let X1 , · · · , Xn be random variables (on the same
probability space)
◮ We represent them as a vector X. ◮ If a = (a1 , · · · , an )T then
◮ As a notation, all vectors are column vectors: a aT is a n × n matrix whose (i, j)th element is ai aj .
X = (X1 , · · · , Xn )T
◮ Hence we get
◮ We denote E[X] = (EX1 , · · · , EXn )T
The n × n matrix whose (i, j)th element is Cov(Xi , Xj ) is ΣX = E (X − EX) (X − EX)T
 

called the covariance matrix (or variance-covariance
matrix) of X. Denoted as ΣX or ΣX ◮ This is because
(X − EX) (X − EX)T ij = (Xi − EXi )(Xj − EXj )

 
Cov(X1 , X1 ) Cov(X1 , X2 ) · · · Cov(X1 , Xn ) and (ΣX )ij = E[(Xi − EXi )(Xj − EXj )]
 Cov(X2 , X1 ) Cov(X2 , X2 ) · · · Cov(X2 , Xn ) 
ΣX =  .. .. .. ..
 

 . . . . 
Cov(Xn , X1 ) Cov(Xn , X2 ) · · · Cov(Xn , Xn )

PS Sastry, IISc, Bangalore, 2020 18/37 PS Sastry, IISc, Bangalore, 2020 19/37

◮ ΣX is a real symmetric matrix


◮ Recall the following about vectors and matrices ◮ It is positive semidefinite.
◮ let a, b ∈ ℜn be column vectors. Then ◮ Let a ∈ ℜn and let Y = aT X.
2
aT b = aT b
T T 
a b = bT a aT b = bT a aT b
 ◮ Then, EY = aT EX. We get variance of Y as
h 2 i
Var(Y ) = E[(Y − EY )2 ] = E aT X − aT EX
Let A be an n × n matrix with elements aij . Then

h 2 i
T
= E a (X − EX)
n
= E aT (X − EX) (X − EX)T a
X  
bT Ab = bi bj aij
= aT E (X − EX) (X − EX)T a
 
i,j=1

T
= aT ΣX a
where b = (b1 , · · · , bn )
◮ A is said to be positive semidefinite if bT Ab ≥ 0, ∀b ◮ This gives aT ΣX a ≥ 0, ∀a
◮ This shows ΣX is positive semidefinite

PS Sastry, IISc, Bangalore, 2020 20/37 PS Sastry, IISc, Bangalore, 2020 21/37
Y = aT X = i ai Xi – linear combination of Xi ’s.
P

◮ We know how to find its mean and variance ◮ Covariance matrix ΣX positive semidefinite because
X
EY = aT EX = ai EXi ;
aT ΣX a = Var(aT X) ≥ 0
i
X
Var(Y ) = aT ΣX a = ai aj Cov(Xi , Xj ) ◮ ΣX would be positive definite if aT ΣX a > 0, ∀a 6= 0
i,j
◮ It would fail to be positive definite if Var(aT X) = 0 for
Specifically, by taking all components of a to be 1, we get
◮ some nonzero a.
◮ Var(Z) = E[(Z − EZ)2 ] = 0 implies Z = EZ, a
n
! n n n X
X X X X constant.
Var Xi = Cov(Xi , Xj ) = Var(Xi )+ Cov(Xi , Xj )
i=1 i,j=1 i=1 i=1 j6=i
◮ Hence, ΣX fails to be positive definite only if there is a
non-zero linear combination of Xi ’s that is a constant.
◮ If Xi are independent, variance of sum is sum of
variances.

PS Sastry, IISc, Bangalore, 2020 22/37 PS Sastry, IISc, Bangalore, 2020 23/37

◮ Let Y = aT X and assume ||a|| = 1


◮ Y is projection of X along the direction a.
◮ Suppose we want to find a direction along which variance
◮ Covariance matrix is a real symmetric positive is maximized
semidefinite matrix ◮ We want to maximize aT ΣX a subject to aT a = 1
◮ It have real and non-negative eigen values. ◮ The lagrangian is aT ΣX a + η(1 − aT a)
◮ It would have n linearly independent eigen vectors. ◮ Equating the gradient to zero, we get
◮ These also have some interesting roles.
◮ We consider one simple example. ΣX a = ηa

◮ So, a should be an eigen vector (with eigen value η).


◮ Then the variance would be aT ΣX a = ηaT a = η
◮ Hence the direction is the eigen vector corresponding to
the highest eigen value.

PS Sastry, IISc, Bangalore, 2020 24/37 PS Sastry, IISc, Bangalore, 2020 25/37
Joint moments ◮ We can define moment generating function of X, Y by

MXY (s, t) = E esX+tY , s, t ∈ ℜ


 

◮ Given two random variables, X, Y ◮ This is easily generalized to n random variables


◮ The joint moment of order (i, j) is defined by h T i
MX (s) = E es X , s ∈ ℜn
mij = E[X i Y j ]
◮ Once again, we can get all the moments by differentiating
m10 = EX, m01 = EY , m11 = E[XY ] and so on the moment generating function
◮ Similarly joint central moments of order (i, j) are defined
by ∂
MX (s) = EXi
sij = E (X − EX)i (Y − EY )j
 
∂si s=0

s10 = s01 = 0, s11 = Cov(X, Y ), s20 = Var(X) and so on ◮ More generally


◮ We can similarly define joint moments of multiple random
∂ m+n
variables MX (s) = EXin Xjm
∂sni ∂sm
j s=0

PS Sastry, IISc, Bangalore, 2020 26/37 PS Sastry, IISc, Bangalore, 2020 27/37

Conditional Expectation ◮ Let X, Y be discrete random variables (on the same


probability space).
◮ Suppose X, Y have a joint density fXY ◮ The conditinal expectation of h(X) conditioned on Y is a
◮ Consider the conditional density fX|Y (x|y). This is a function of Y , and its value for any y is defined by
density in x for every value of y. X
◮ Since it isR a density, we can use it in an expectation E[h(X)|Y = y] = h(x) fX|Y (x|y)
integral: g(x) fX|Y (x|y) dx x
X
◮ This is like expectation of g(X) since fX|Y (x|y) is a = h(x) P [X = x|Y = y]
density in x. x

◮ However, its value would be a function of y. ◮ What this means is that we define E[h(X)|Y ] = g(Y )
◮ That is, this is a kind of expectation that is a function of where X
Y (and hence is a random variable) g(y) = h(x) fX|Y (x|y)
◮ It is called conditional expectation. x

◮ We will now define it formally ◮ Thus, E[h(X)|Y ] is a random variable

PS Sastry, IISc, Bangalore, 2020 28/37 PS Sastry, IISc, Bangalore, 2020 29/37
A simple example
◮ Consider the joint density
◮ Let X, Y have joint density fXY .
◮ The conditinal expectation of h(X) conditioned on Y is a fXY (x, y) = 2, 0 < x < y < 1
function of Y , and its value for any y is defined by ◮ We calculated the conditional densities earlier
Z ∞ 1 1
E[h(X)|Y = y] = h(x) fX|Y (x|y) dx fX|Y (x|y) = , fY |X (y|x) = , 0<x<y<1
−∞
y 1−x
◮ Now we can calculate the conditional expectation
◮ Once again, what this means is that E[h(X)|Y ] = g(Y ) Z ∞
where Z ∞ E[X|Y = y] = x fX|Y (x|y) dx
g(y) = h(x) fX|Y (x|y) dx −∞
Z y y
−∞ 1 1 x2 y
= x dx = =
0 y y 2 0 2
◮ This gives: E[X|Y ] = Y2
1+X
◮ We can show E[Y |X] = 2
PS Sastry, IISc, Bangalore, 2020 30/37 PS Sastry, IISc, Bangalore, 2020 31/37

◮The conditional expectation is defined by


X
E[h(X)|Y = y] = h(x) fX|Y (x|y), X, Y are discrete
Zx ∞ ◮ Conditional expectation also has some extra properties
E[h(X)|Y = y] = h(x); fX|Y (x|y) dx, X, Y have joint density which are very important
−∞ ◮ E [E[h(X)|Y ]] = E[h(X)]
◮ E[h1 (X)h2 (Y )|Y ] = h2 (Y )E[h1 (X)|Y ]
◮ We can actually define E[h(X, Y )|Y ] also as above. ◮ E[h(X, Y )|Y = y] = E[h(X, y)|Y = y]
That is,
◮ We will justify each of these.
Z ∞
E[h(X, Y )|Y = y] = h(x, y) fX|Y (x|y) dx
◮ The last property above follows directly from the
−∞ definition.
◮ It has all the properties of expectation:
1. E[a|Y ] = a where a is a constant
2. E[ah1 (X) + bh2 (X)|Y ] = aE[h1 (X)|Y ] + bE[h2 (X)|Y ]
3. h1 (X) ≥ h2 (X) ⇒ E[h1 (X)|Y ] ≥ E[h2 (X)|Y ]

PS Sastry, IISc, Bangalore, 2020 32/37 PS Sastry, IISc, Bangalore, 2020 33/37
◮ Expectation of a conditional expectation is the
unconditional expectation ◮ Any factor that depends only on the conditioning variable
E [ E[h(X)|Y ] ] = E[h(X)] behaves like a constant inside a conditional expectation
In the above, LHS is expectation of a function of Y .
E[h1 (X) h2 (Y )|Y ] = h2 (Y )E[h1 (X)|Y ]
◮ Let us denote g(Y ) = E[h(X)|Y ]. Then
E [ E[h(X)|Y ] ] = E[g(Y )] ◮ Let us denote g(Y ) = E[h1 (X) h2 (Y )|Y ]
Z ∞
= g(y) fY (y) dy g(y) = E[h1 (X) h2 (Y )|Y = y]
−∞ Z ∞
Z ∞ Z ∞ 
= h1 (x)h2 (y) fX|Y (x|y) dx
= h(x) fX|Y (x|y) dx fY (y) dy −∞
−∞ −∞ Z ∞
Z ∞Z ∞
= h2 (y) h1 (x) fX|Y (x|y) dx
= h(x) fXY (x, y) dy dx −∞
−∞ −∞
Z ∞ = h2 (y) E[h1 (X)|Y = y]
= h(x) fX (x) dx
−∞
= E[h(X)]
PS Sastry, IISc, Bangalore, 2020 34/37 PS Sastry, IISc, Bangalore, 2020 35/37

◮ A very useful property of conditional expectation is


E[ E[X|Y ] ] = E[X] (Assuming all expectations exist)
◮ We can see this in our earlier example.

fXY (x, y) = 2, 0 < x < y < 1


◮ We have
1 2
◮ We calculated: EX = and EY =
3 3
E[ E[X|Y ] ] = E[X], ∀X, Y
◮ We also showed E[X|Y ] = Y2 ◮ This is a useful technique to find EX.
 
Y 1
◮ We can choose a Y that is useful.
E[ E[X|Y ] ] = E = = E[X]
2 3
◮ Similarly
 
1+X 2
E[ E[Y |X] ] = E = = E[Y ]
2 3

PS Sastry, IISc, Bangalore, 2020 36/37 PS Sastry, IISc, Bangalore, 2020 37/37
Density of XY Z 0 Z ∞ Z ∞ Z z/x

◮ Let X, Y have joint density fXY . FZ (z) = fXY (x, y) dy dx + fXY (x, y) dy dx
−∞ z/x 0 −∞
◮ Let Z = XY . We want to find density of XY directly
◮ Let Az = {(x, y) ∈ ℜ2 : xy ≤ z} ⊂ ℜ2 . ◮ Change variable from y to t using t = xy
y = t/x; dy = x1 dt; y = z/x ⇒ t = z
FZ (z) = P [XY ≤ z] = P [(X, Y ) ∈ Az ] Z 0 Z −∞ Z ∞Z z
Z Z 1 t 1 t
= fXY (x, y) dy dx FZ (z) = fXY (x, ) dt dx + fXY (x, ) dt d
−∞ z x x 0 −∞ x x
Az Z 0 Z z Z ∞Z z
1 t 1 t
◮ We need to find limits for integrating over Az = fXY (x, ) dt dx + fXY (x, ) d
−∞ x x −∞ x x
◮ If x > 0, then xy ≤ z ⇒ y ≤ z/x Z−∞
∞ Z z   0
1 t
If x < 0, then xy ≤ z ⇒ y ≥ z/x = fXY x, dt dx
−∞ −∞ x x
Z z Z ∞  
Z 0 Z ∞ Z ∞ Z z/x 1 t
FZ (z) = fXY (x, y) dy dx+ fXY (x, y) dy dx = fXY x, dx dt
−∞ −∞ x x
−∞ z/x 0 −∞
R∞
This shows: fZ (z) = −∞ x1 fXY x, xz dx


PS Sastry, IISc, Bangalore, 2020 1/32 PS Sastry, IISc, Bangalore, 2020 2/32

Recap: Covariance Recap: Correlation coefficient

◮ The covariance of X, Y is

Cov(X, Y ) = E[(X−EX) (Y −EY )] = E[XY ]−EX EY


◮ The correlation coefficient of X, Y is

Cov(X, Y )
Note that Cov(X, X) = Var(X) ρXY = p
Var(X) Var(Y )
◮ Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y )
◮ X, Y are called uncorrelated if Cov(X, Y ) = 0. ◮ If X, Y are uncorrelated then ρXY = 0.
◮ If X, Y are uncorrelated, Var(X + Y ) = Var(X) + Var(Y ) ◮ −1 ≤ ρXY ≤ 1, ∀X, Y
◮ X, Y independent ⇒ X, Y uncorrelated. ◮ |ρXY | = 1 iff X = aY
◮ Uncorrelated random variables need not necessarily be
independent

PS Sastry, IISc, Bangalore, 2020 3/32 PS Sastry, IISc, Bangalore, 2020 4/32
Recap: mean square estimation Recap: Covariance matrix
◮ The best mean-square approximation of Y as a ‘linear’
function of X is
For a random vector, X = (X1 , · · · , Xn )T , the covariance
 
Cov(X, Y ) Cov(X, Y ) ◮
Y = X + EY − EX
Var(X) Var(X) matrix is

ΣX = E (X − EX) (X − EX)T
 
◮ Called the line of regression of Y on X.
◮ If cov(X, Y ) = 0 then this reduces to approximating Y by
a constant, EY . (ΣX )ij = E[(Xi − EXi )(Xj − EXj )]
◮ The final mean square error is ◮ Var(aT X) = aT ΣX a
◮ ΣX is a real symmetric and positive semidefinite matrix.
Var(Y ) 1 − ρ2XY


◮ If ρXY = ±1 then the error is zero


◮ If ρXY = 0 the final error is Var(Y )
PS Sastry, IISc, Bangalore, 2020 5/32 PS Sastry, IISc, Bangalore, 2020 6/32

Recap: Moment generating function Recap: Conditional Expectation


◮The conditional expectation of h(X) conditioned on Y is
◮ For a pair of rv, the joint moment of order (i, j) is defined by
mij = E[X i Y j ] X
E[h(X)|Y = y] = h(x) fX|Y (x|y), X, Y are discrete
◮ The moment generating
 sX+tY function of X, Y is
MXY (s, t) = E e , s, t ∈ ℜ Zx ∞
◮ For n rv, the joint moments are E[h(X)|Y = y] = h(x) fX|Y (x|y) dx, X, Y have joint density
−∞

mi1 i2 ···in = E X1i1 X2i2 · · · Xnin The conditional expectation of h(X) conditioned on Y is
 

a function of Y : E[h(X)|Y ] = g(Y )
◮ The moment generating function of X is the above specify the value of g(y).
h T i ◮ We define E[h(X, Y )|Y ] also as above:
MX (s) = E es X , s ∈ ℜn Z ∞
E[h(X, Y )|Y = y] = h(x, y) fX|Y (x|y) dx
−∞

◮ If X, Y are independent, E[h(X)|Y ] = E[h(X)]


PS Sastry, IISc, Bangalore, 2020 7/32 PS Sastry, IISc, Bangalore, 2020 8/32
◮ Expectation of a conditional expectation is the
Recap: Properties of Conditional Expectation unconditional expectation
E [ E[h(X)|Y ] ] = E[h(X)]
In the above, LHS is expectation of a function of Y .
◮ It has all the properties of expectation: ◮ Let us denote g(Y ) = E[h(X)|Y ]. Then
◮ E[a|Y ] = a where a is a constant E [ E[h(X)|Y ] ] = E[g(Y )]
◮ E[ah1 (X) + bh2 (X)|Y ] = aE[h1 (X)|Y ] + bE[h2 (X)|Y ] Z ∞
◮ h1 (X) ≥ h2 (X) ⇒ E[h1 (X)|Y ] ≥ E[h2 (X)|Y ] = g(y) fY (y) dy
−∞
◮ Conditional expectation also has some extra properties Z ∞ Z ∞ 
which are very important = h(x) fX|Y (x|y) dx fY (y) dy
◮ E [E[h(X)|Y ]] = E[h(X)] −∞
Z ∞Z ∞ −∞
◮ E[h1 (X)h2 (Y )|Y ] = h2 (Y )E[h1 (X)|Y ]
= h(x) fXY (x, y) dy dx
◮ E[h(X, Y )|Y = y] = E[h(X, y)|Y = y] −∞ −∞
Z ∞
= h(x) fX (x) dx
−∞
= E[h(X)]
PS Sastry, IISc, Bangalore, 2020 9/32 PS Sastry, IISc, Bangalore, 2020 10/32

Example
◮ Any factor that depends only on the conditioning variable
behaves like a constant inside a conditional expectation ◮ Let X, Y be random variables with joint density given by

E[h1 (X) h2 (Y )|Y ] = h2 (Y )E[h1 (X)|Y ] fXY (x, y) = e−y , 0 < x < y < ∞
◮ The marginal densities are:
◮ Let us denote g(Y ) = E[h1 (X) h2 (Y )|Y ] Z ∞ Z ∞

g(y) = E[h1 (X) h2 (Y )|Y = y] fX (x) = fXY (x, y) dy = e−y dy = e−x , x > 0
Z ∞ −∞ x

= h1 (x)h2 (y) fX|Y (x|y) dx Z ∞ Z y


−∞
Z ∞ fY (y) = fXY (x, y) dx = e−y dx = y e−y , y > 0
−∞ 0
= h2 (y) h1 (x) fX|Y (x|y) dx
−∞ Thus, X is exponential and Y is gamma.
= h2 (y) E[h1 (X)|Y = y] ◮ Hence we have

EX = 1; Var(X) = 1; EY = 2; Var(Y ) = 2

PS Sastry, IISc, Bangalore, 2020 11/32 PS Sastry, IISc, Bangalore, 2020 12/32
◮ Recall the joint and marginal densities
fXY (x, y) = e−y , 0 < x < y < ∞
fXY (x, y) = e−y , 0 < x < y < ∞
◮ Let us calculate covariance of X and Y fX (x) = e−x , x > 0; fY (y) = ye−y , y > 0
Z ∞Z ∞
E[XY ] = xy fXY (x, y) dx dy
◮ The conditional densities will be
−∞ −∞
Z ∞Z y Z ∞ fXY (x, y) e−y 1
−y 1 3 −y fX|Y (x|y) = = −y = , 0 < x < y < ∞
= xye dx dy = y e dy = 3 fY (y) ye y
0 0 0 2

Hence, Cov(X, Y ) = E[XY ] − EX EY = 3 − 2 = 1. fXY (x, y) e−y



fY |X (y|x) = = −x = e−(y−x) , 0 < x < y < ∞
◮ ρXY = √12 fX (x) e

PS Sastry, IISc, Bangalore, 2020 13/32 PS Sastry, IISc, Bangalore, 2020 14/32

◮ The conditional densities are


1
fX|Y (x|y) = ; fY |X (y|x) = e−(y−x) , 0 < x < y < ∞ We got
y ◮

◮ We can now calculate the conditional expectation Y


E[X|Y ] = ; E[Y |X] = 1 + X
Z y 2
1 y
Z
E[X|Y = y] = x fX|Y (x|y) dx = x dx =
0 y 2 ◮ Using this we can verify:
Y
 
Thus E[X|Y ] = Y EY 2
2 E[ E[X|Y ] ] = E = = = 1 = EX
Z Z ∞
2 2 2
E[Y |X = x] = y fY |X (y|x) dy = ye−(y−x) dy
 Z ∞ x  E[ E[Y |X] ] = E[1 + X] = 1 + 1 = 2 = EY
x −y ∞ −y
= e −ye x + e dy
x
= ex xe−x + e−x = 1 + x


Thus, E[Y |X] = 1 + X


PS Sastry, IISc, Bangalore, 2020 15/32 PS Sastry, IISc, Bangalore, 2020 16/32
◮ A property of conditional expectation is
◮ Let X be geometric and we want EX.
E[ E[X|Y ] ] = E[X] ◮ X is number of tosses needed to get head
◮ Let Y ∈ {0, 1} be outcome of first toss. (1 for head)
◮ We assume that all three expectations exist.
◮ Very useful in calculating expectations E[X] = E[ E[X|Y ] ]
X Z = E[X|Y = 1] P [Y = 1] + E[X|Y = 0] P [Y = 0]
EX = E[X|Y = y] fY (y) or E[X|Y = y] fY (y) dy = E[X|Y = 1] p + E[X|Y = 0] (1 − p)
y
= 1 p + (1 + EX)(1 − p)
⇒ EX (1 − (1 − p)) = p + (1 − p) = 1
◮ Can be used to calculate probabilities of events too 1
⇒ EX =
p
P (A) = E[IA ] = E [ E [IA |Y ] ]

PS Sastry, IISc, Bangalore, 2020 17/32 PS Sastry, IISc, Bangalore, 2020 18/32

◮ P [X = k|Y = 1] = 1 if k = 1 (otherwise it is zero) and Another example


hence E[X|Y = 1] = 1
(
0 if k = 1
P [X = k|Y = 0] = (1−p)k−1 p
(1−p)
if k ≥ 2
◮ Example: multiple rounds of the party game
Hence ◮ Let Rn denote number of rounds when you start with n

X people.
E[X|Y = 0] = k (1 − p)k−2 p ◮ We want R̄n = E [Rn ].
k=2
∞ ∞
◮ We want to use E [Rn ] = E[ E [Rn |Xn ] ]
X X
k−2
= (k − 1) (1 − p) p + (1 − p)k−2 p ◮ We need to think of a useful Xn .
k=2 k=2 ◮ Let Xn be the number of people who got their own hat in
∞ ∞
X ′
X ′ the first round with n people.
= k ′ (1 − p)k −1 p + (1 − p)k −1 p
k′ =1 k′ =1
= EX + 1

PS Sastry, IISc, Bangalore, 2020 19/32 PS Sastry, IISc, Bangalore, 2020 20/32
◮ Rn – number of rounds when you start with n people. ◮ What would be E[Xn ]?
◮ Xn – number of people who got their own hat in the first ◮ Let Yi ∈ {0, 1} denote whether or not ith person got his
round own hat.
E [Rn ] = E[ E [Rn |Xn ] ] ◮ We know
n
X (n − 1)! 1
= E [Rn |Xn = i] P [Xn = i] E[Yi ] = P [Yi = 1] = =
n! n
i=0
n
n n
X
= (1 + E [Rn−i ]) P [Xn = i] X X
Now, Xn = Yi and hence EXn = E[Yi ] = 1
i=0
n n i=1 i=1
X X
= P [Xn = i] + E [Rn−i ] P [Xn = i]
i=0 i=0
◮ Hence a good guess is E[Rn ] = n.
◮ We verify it using mathematical induction. We know
If we can guess value of E[Rn ] then we can prove it using E[R1 ] = 1
mathematical induction

PS Sastry, IISc, Bangalore, 2020 21/32 PS Sastry, IISc, Bangalore, 2020 22/32

◮ Assume: E [Rk ] = k, 1 ≤ k ≤ n − 1
Analysis of Quicksort
n
X n
X
E [Rn ] = P [Xn = i] + E [Rn−i ] P [Xn = i]
i=0 i=0
n
X ◮ Given n numbers we want to sort them. Many algorithms.
= 1 + E [Rn ] P [Xn = 0] + E [Rn−i ] P [Xn = i] ◮ Complexity – order of the number of comparisons needed
i=1
n
◮ Quicksort: Choose a pivot. Separte numbers into two
parts – less and greater than pivot, do recursively
X
= 1 + E [Rn ] P [Xn = 0] + (n − i) P [Xn = i]
i=1 ◮ Separating into two parts takes n − 1 comparisons.
n
◮ Suppose the two parts contain m and n − m − 1.
Separating both of them into two parts each takes
X
E [Rn ] (1 − P [Xn = 0]) = 1 + n(1 − P [Xn = 0]) − i P [Xn = i]
i=1 m + n − m − 1 comparisons
= 1 + n (1 − P [Xn = 0]) − E[Xn ] ◮ So, final number of comparisons depends on the ‘number
= 1 + n (1 − P [Xn = 0]) − 1 of rounds’
⇒ E [Rn ] = n

PS Sastry, IISc, Bangalore, 2020 23/32 PS Sastry, IISc, Bangalore, 2020 24/32
quicksort details Average case complexity of quicksort
◮ Given {x1 , · · · , xn }. ◮ Assume pivot is equally likely to be the smallest or second
◮ Choose first as pivot smallest or mth smallest.
◮ Mn – number of comparisons.
{xj1 , xj2 , · · · , xjm }x1 {xk1 , xk2 , · · · , xkn−1−m } ◮ Define: X = j if pivot is j th smallest
◮ Given X = j we know Mn = (n − 1) + Mj−1 + Mn−j .
◮ Suppose rn is the number of comparisons. If we get
n
(roughly) equal parts, then X
E[Mn ] = E[ E[Mn |X] ] = E[Mn |X = j] P [X = j]
rn ≈ n+2rn/2 = n+2(n/2+2rn/4 ) = n+n+4rn/4 = · · · = n log2 (n) j=1
n
If all the rest go into one part, then
X 1
◮ = E[(n − 1) + Mj−1 + Mn−j ]
j=1
n
n(n + 1)
rn = n + rn−1 = n + (n − 1) + rn−2 = · · · = n−1
2 2X
= (n − 1) + E[Mk ], (taking M0 = 0)
◮ If you are lucky, O(n log(n)) comparisons. n k=1
◮ If unlucky, in the worst case, O(n2 ) comparisons ◮ This is a recurrence relation. (A little complicated to
◮ Question: ‘on the average’ how many comparisons? solve)
PS Sastry, IISc, Bangalore, 2020 25/32 PS Sastry, IISc, Bangalore, 2020 26/32

Least squares estimation


◮ We want to show that for all g
◮ We want to estimate Y as a function of X.
E (E[Y | X] − Y )2 ≤ E (g(X) − Y )2
   
◮ We want an estimate with minimum mean square error.
◮ We want to solve (the min is over all functions g) ◮ We have
2
min E (Y − g(X))2 (g(X) − Y )2 =

(g(X) − E[Y | X]) + (E[Y | X] − Y )
g 2 2
= g(X) − E[Y | X] + E[Y | X] − Y
◮ Earlier we considered linear functions: g(X) = aX + b 
+ 2 g(X) − E[Y | X] E[Y | X] − Y

◮ The solution now turns out to be
◮ Now we can take expectation on both sides.

g (X) = E[Y |X] ◮ We first show that expectation of last term on RHS
above is zero.
◮ Let us prove this.

PS Sastry, IISc, Bangalore, 2020 27/32 PS Sastry, IISc, Bangalore, 2020 28/32
◮ We earlier got
First consider the last term 2 2
  (g(X) − Y )2 = g(X) − E[Y | X] + E[Y | X] − Y
E (g(X) − E[Y | X])(E[Y | X] − Y ) 
+ 2 g(X) − E[Y | X] E[Y | X] − Y

  
= E E (g(X) − E[Y | X])(E[Y | X] − Y ) | X
because E[Z] = E[ E[Z|X] ] ◮ Hence we get
  
= E (g(X) − E[Y | X]) E (E[Y | X] − Y ) | X E (g(X) − Y )2
 
= E (g(X) − E[Y | X])2
 
because E[h1 (X)h2 (Z)|X] = h1 (X) E[h2 (Z)|X] + E (E[Y | X] − Y )2
 
  
= E (g(X) − E[Y | X]) E (E[Y | X])|X − E{Y | X}) ≥ E (E[Y | X] − Y )2
 
 
= E (g(X) − E[Y | X]) (E[Y | X] − E[Y | X))
= 0 ◮ Since the above is true for all functions g, we get

g ∗ (X) = E [Y | X]

PS Sastry, IISc, Bangalore, 2020 29/32 PS Sastry, IISc, Bangalore, 2020 30/32

Sum of random number of random variables ◮ We have


" N
#
X
E[S|N = n] = E Xi | N = n
◮ Let X1 , X2 , · · · be iid rv on the same probability space. " i=1
n
#
Suppose EXi = µ, ∀i.
X
= E Xi | N = n
◮ Let N be a positive integer valued rv that is independent i=1
of all Xi . since E[h(X, Y )|Y = y] = E[h(X, y)|Y = y]
n n
Let S = N
P
i=1 Xi .
◮ X X
= E[Xi | N = n] = E[Xi ] = nµ
◮ We want to calculate ES. We can use i=1 i=1

E[S] = E[ E[S|N ] ] ◮ Hence we get

E[S|N ] = N µ ⇒ E[S] = E[N ]E[X1 ]

◮ Actually, we did not use independence of Xi .


PS Sastry, IISc, Bangalore, 2020 31/32 PS Sastry, IISc, Bangalore, 2020 32/32
Recap: Conditional Expectation Recap: Properties of Conditional Expectation
◮The conditional expectation of h(X) conditioned on Y is
defined by
X
E[h(X)|Y = y] = h(x) fX|Y (x|y), X, Y are discrete ◮ It has all the properties of expectation:
Zx ∞
◮ E[a|Y ] = a where a is a constant
◮ E[ah1 (X) + bh2 (X)|Y ] = aE[h1 (X)|Y ] + bE[h2 (X)|Y ]
E[h(X)|Y = y] = h(x) fX|Y (x|y) dx, X, Y have joint density ◮ h1 (X) ≥ h2 (X) ⇒ E[h1 (X)|Y ] ≥ E[h2 (X)|Y ]
−∞
◮ Conditional expectation also has some extra properties
◮ The conditional expectation of h(X) conditioned on Y is which are very important
a function of Y : E[h(X)|Y ] = g(Y ) ◮ E [E[h(X)|Y ]] = E[h(X)]
the above specify the value of g(y). ◮ E[h1 (X)h2 (Y )|Y ] = h2 (Y )E[h1 (X)|Y ]
◮ We define E[h(X, Y )|Y ] also as above: ◮ E[h(X, Y )|Y = y] = E[h(X, y)|Y = y]
Z ∞
E[h(X, Y )|Y = y] = h(x, y) fX|Y (x|y) dx
−∞

◮ If X, Y are independent, E[h(X)|Y ] = E[h(X)]


PS Sastry, IISc, Bangalore, 2020 1/36 PS Sastry, IISc, Bangalore, 2020 2/36

Sum of random number of random variables


◮ The property of conditional expectation

E[ E[X|Y ] ] = E[X]
◮ Let X1 , X2 , · · · be iid rv on the same probability space.
is very useful in calculating expectations Suppose EXi = µ, ∀i.
X Z ◮ Let N be a positive integer valued rv that is independent
EX = E[X|Y = y] fY (y) or E[X|Y = y] fY (y) dy of all Xi .
Let S = N
y
P
i=1 Xi .

We saw many examples. ◮ We want to calculate ES. We can use


◮ Can be used to calculate probabilities of events too E[S] = E[ E[S|N ] ]
P (A) = E[IA ] = E [ E [IA |Y ] ]

PS Sastry, IISc, Bangalore, 2020 3/36 PS Sastry, IISc, Bangalore, 2020 4/36
◮ We have Variance of random sum
" N
# PN
X ◮ S= Xi , Xi iid, ind of N . Want Var(S)
E[S|N = n] = E Xi | N = n i=1

" i=1
 !2    !2 
# N N
n X X
= E
X
Xi | N = n E[S 2 ] = E  Xi  = E  E  Xi | N 
i=1 i=1
i=1
since E[h(X, Y )|Y = y] = E[h(X, y)|Y = y]
n
X Xn ◮ As earlier, we have
= E[Xi | N = n] = E[Xi ] = nµ    
N
!2 n
!2
i=1 i=1 X X
E Xi | N = n = E  Xi | N = n
◮ Hence we get i=1 i=1
 !2 
E[S|N ] = N µ ⇒ E[S] = E[N ]E[X1 ] n
X
= E Xi 
i=1
◮ Actually, we did not use independence of Xi .
PS Sastry, IISc, Bangalore, 2020 5/36 PS Sastry, IISc, Bangalore, 2020 6/36

Let Y = ni=1 Xi , Xi iid


P

◮ Then, Var(Y ) = n Var(X1 ) PN


◮ Hence we have ◮ S= i=1 Xi (Xi iid). We got

E[Y 2 ] = Var(Y ) + (EY )2 = n Var(X1 ) + (nEX1 )2 E[S 2 ] = E[ E[S 2 |N ] ] = EN Var(X1 ) + E[N 2 ](EX1 )2

◮ Using this ◮ Now we can calculate variance of S as

Var(S) = E[S 2 ] − (ES)2


 !2   !2 
N
X n
X
E Xi | N = n = E  Xi  = n Var(X1 )+(nEX1 )2 = EN Var(X1 ) + E[N 2 ](EX1 )2 − (EN EX1 )2
EN Var(X1 ) + (EX1 )2 E[N 2 ] − (EN )2
i=1 i=1

=
◮ Hence = EN Var(X1 ) + Var(N ) (EX1 )2
 !2 
N
X
E Xi | N  = N Var(X1 ) + N 2 (EX1 )2
i=1

PS Sastry, IISc, Bangalore, 2020 7/36 PS Sastry, IISc, Bangalore, 2020 8/36
Wald’s formula Another Example
PN
◮ Considered S = i=1 Xi with N independent of all Xi .
◮ With iid Xi , the formula ES = EN EX1 is valid even ◮ We toss a (biased) coin till we get k consecutive heads.
under some dependence between N and Xi . Let Nk denote the number of tosses needed.
◮ Here is one version of Wald’s formula. We assume ◮ N1 would be geometric.
 i |] < ∞, ∀i and EN < ∞.
1. E[|X ◮ We want E[Nk ]. What rv should we condition on?
2. E Xn I[N ≥n] = E[Xn ]P [N ≥ n], ∀n
◮ Useful rv here is Nk−1
Let SN = N
P PN
i=1 Xi and let TN = i=1 E[Xi ].

◮ Then, ESN = ETN . E[Nk | Nk−1 = n] = (n + 1)p + (1 − p)(n + 1 + E[Nk ])
If E[Xi ] is same for all i, ESN = EX1 EN .
◮ Assume Xi are iid. Suppose the event [N ≤ n − 1] ◮ Thus we get the recurrence relation
depends only on X1 , · · · , Xn−1 .
◮ Then the event [N ≤ n − 1] and hence its complement E[Nk ] = E[ E[Nk | Nk−1 ] ]
[N ≥ n] is independent of Xn and the assumption above = E [ (Nk−1 + 1)p + (1 − p)(Nk−1 + 1 + E[Nk ]) ]
is satisfied.
◮ Such an N is an example of what is called a stopping
time.
PS Sastry, IISc, Bangalore, 2020 9/36 PS Sastry, IISc, Bangalore, 2020 10/36

◮ We have ◮ As mentioned earlier, we can use the conditional


E[Nk ] = E [ (Nk−1 + 1)p + (1 − p)(Nk−1 + 1 + E[Nk ]) ] expectation to calculate probabilities of events also.
◮ Denoting Mk = E[Nk ], we get
P (A) = E[IA ] = E [ E [IA |Y ] ]
Mk = pMk−1 + p + (1 − p)Mk−1 + (1 − p) + (1 − p)Mk
pMk = Mk−1 + 1
1 1 E[IA |Y = y] = P [IA = 1|Y = y] = P (A|Y = y)
Mk = Mk−1 +
p p
   2  2
1 1 1 1 1 1 1 ◮ Thus, we get
= Mk−2 + + = Mk−2 + +
p p p p p p p
 k−1 k−1
X 1 j
  k
X 1 j
  P (A) = E[IA ] = E [ E [IA |Y ] ]
1 1
= M1 + = (M1 = )
X
p p p p = P (A|Y = y)P [Y = y], when Y is discrete
j=1 j=1
  y
 k Z
1 1
1 − = P (A|Y = y) fY (y) dy, when Y is continuous
p p 1 − pk
=   =
1 − p1 (1 − p)pk
PS Sastry, IISc, Bangalore, 2020 11/36 PS Sastry, IISc, Bangalore, 2020 12/36
Example
◮ We can also use the conditional expectation method here
◮ Let X, Y be independent continuous rv Z ∞
◮ We want to calculate P [X ≤ Y ] P [X ≤ Y ] = P [X ≤ Y | Y = y] fY (y) dy
◮ We can calculate it by integrating joint density over −∞
Z ∞
A = {(x, y) : x ≤ y} = P [X ≤ y | Y = y] fY (y) dy
Z Z −∞
Z ∞
P [X ≤ Y ] = fX (x) fY (y) dx dy = P [X ≤ y] fY (y) dy
A −∞
Z ∞ Z y  Z ∞
= fY (y) fX (x) dx dy = FX (y) fY (y) dy
−∞ −∞ −∞
Z ∞
= FX (y) fY (y) dy
−∞

◮ IF X, Y are iid then P [X < Y ] = 0.5

PS Sastry, IISc, Bangalore, 2020 13/36 PS Sastry, IISc, Bangalore, 2020 14/36

◮ Assuming p is chosen uniformly from (0, 1), we get


◮ Consider a sequence of bernoullli trials where p, Z
probability of success, is random. P [X = k] = [P [X = k | p] f (p) dp
◮ We first choose p uniformly over (0, 1) and then perform Z 1
n
n tosses. = Ck pk (1 − p)n−k 1 dp
0
◮ Let X be the number of heads. k!(n − k)!
n
◮ Conditioned on knowledge of p, we know distribution of = Ck
(n + 1)!
X Z 1
Γ(k + 1)Γ(n − k + 1)
P [X = k | p] = n Ck pk (1 − p)n−k because pk (1 − p)n−k dp =
0 Γ(n + 2)
◮ Now we can calculate P [X = k] using the conditioning 1
argument. =
n+1
1
◮ So, we get: P [X = k] = n+1
, k = 0, 1, · · · , n

PS Sastry, IISc, Bangalore, 2020 15/36 PS Sastry, IISc, Bangalore, 2020 16/36
Gaussian or Normal distribution
◮ The Gaussian or normal density is given by
1 (x−µ)2
f (x) = √ e− 2σ2 , −∞ < x < ∞
σ 2π
◮ If X has this density, we denote it as X ∼ N (µ, σ 2 ).
We showed EX = µ and Var(X) = σ 2
◮ The density is a ‘bell-shaped’ curve

PS Sastry, IISc, Bangalore, 2020 17/36 PS Sastry, IISc, Bangalore, 2020 18/36

◮ Standard Normal rv — X ∼ N (0, 1) ◮ X ∼ N (0, 1). Then its mgf is


Z ∞
◮ The distribution function of standard normal is 1 x2
etx √ e− 2 dx
 tX 
Z x MX (t) = E e =
1 t2

Φ(x) = √ e− 2 dt Z ∞ −∞
−∞ 2π 1 1 2
= √ e− 2 (x −2tx) dx
◮ Suppose X ∼ N (µ, σ 2 ) 2π −∞
Z ∞
1
e− 2 ((x−t) −t ) dx
Z b 1 2 2
1 (x−µ)2 = √
P [a ≤ X ≤ b] = √ e− 2σ2 dx 2π −∞
a σ 2π Z ∞
1 2 1 1 2
(x − µ) 1 t
= e2 √ e− 2 (x−t) dx
take y = ⇒ dy = dx 2π −∞
σ σ 1 2
Z (b−µ) = e2t
σ 1 y2
= √ e− 2 dy ◮ Now let Y = σX + µ. Then Y ∼ N (µ, σ 2 ).
(a−µ) 2π
σ The mgf of Y is
b−µ a−µ
  
= Φ −Φ MY (t) = E et(σX+µ) = etµ E e(tσ)X = etµ MX (tσ)
   
σ σ
= e(µt+ 2 t σ )
1 2 2
◮ We can express probability of events involving all Normal
rv using Φ.
PS Sastry, IISc, Bangalore, 2020 19/36 PS Sastry, IISc, Bangalore, 2020 20/36
Multi-dimensional Gaussian Distribution ◮ We begin by showing the following is a density (when M
is symmetric +ve definite)
◮ The n-dimensional Gaussian density is given by fY (y) = C e− 2 y
1 T My

1 1 T Σ−1 (x− 1 T
e− 2 (x−µ) µ ) , x ∈ ℜn Let I = ℜn C e− 2 y M y dy
R
fX (x) = 1 n

|Σ| (2π)
2 2
◮ Since M is real symmetric, there exits an orthogonal
◮ µ ∈ ℜn and Σ ∈ ℜn×n are parameters of the density and transform, L with L−1 = LT , |L| = 1 and LT M L is
Σ is symmetric and positive definite. diagonal
◮ If X1 , · · · , Xn have the above joint density, they are said ◮ Let LT M L = diag(m1 , · · · , mn ).
to be jointly Gaussian. ◮ Then for any z ∈ ℜn ,
◮ We denote this by X ∼ N (µ, Σ) X
zT LT M Lz = mi zi2
◮ We will now show that this is a joint density function. i

PS Sastry, IISc, Bangalore, 2020 21/36 PS Sastry, IISc, Bangalore, 2020 22/36

◮ We will first relate m1 · · · mn to the matrix M .


◮ We now get ◮ By definition, LT M L = diag(m1 , · · · , mn ). Hence
Z  
1 T 1 1 −1
I = C e− 2 y M y dy diag ,··· , = LT M L = L−1 M −1 (LT )−1 = LT M −1 L
ℜn m1 mn
change variable: z = L−1 y = LT y ⇒ y = Lz
Since |L| = 1, we get
Z
1 T T ◮
= C e− 2 z L M Lz dz (note that |L| = 1)
Zℜ
n
1
1 P 2
LT M −1 L = M −1 =
= C e− 2 i mi zi dz m1 · · · mn
ℜn
n n Z zi2
Putting all this together
YZ − 21
− 21 mi zi2 1
Y
n r
= C e dzi = C e mi
dzi Z
1 T
Y 1 n 1

i=1 ℜ i=1 ℜ C e− 2 y M y dy = C 2π = C (2π) 2 M −1 2

n r ℜn i=1
mi
Y 1
= C 2π
i=1
mi 1
Z
1 T My
⇒ n 1 e− 2 y dy = 1
(2π) |M −1 |
2 2 ℜn

PS Sastry, IISc, Bangalore, 2020 23/36 PS Sastry, IISc, Bangalore, 2020 24/36
◮ Consider Y with joint density
1 1 T Σ−1 y
fY (y) = n 1 e− 2 y , y ∈ ℜn
◮ We showed the following is a density (taking M −1
= Σ) (2π) |Σ|
2 2

1 − 21 yT Σ−1 y n
◮ As earlier let M = Σ−1 . Let LT M L = diag(m1 , · · · , mn )
fY (y) = e , y∈ℜ
n
(2π) 2 |Σ| 2
1
◮ Define Z = (Z1 , · · · , Zn )T = LT Y. Then Y = LZ.
◮ Recall |L| = 1, |M −1 | = (m1 · · · mn )−1
◮ Let X = Y + µ. Then ◮ Then density of Z is
1 1 T Σ−1 (x− 1 1
fX (x) = fY (x − µ) = e− 2 (x−µ) µ) 1 T T 1
mi zi2
P
n 1 fZ (z) = n 1 e− 2 z L M Lz
= n 1 1 e
−2 i
(2π) |Σ|
2 2
(2π) |M −1 |
2 2 (2π) 2 ( m1 ···m n
)2
n n z2
r r
◮ This is the multidimensional Gaussian distribution Y 1 1 1 2
Y 1 1 − 21 1i
= q e − 2 mi zi = q e mi
i=1
2π 1
i=1
2π 1
mi mi

This shows that Zi ∼ N (0, m1i ) and Zi are independent.


PS Sastry, IISc, Bangalore, 2020 25/36 PS Sastry, IISc, Bangalore, 2020 26/36

◮ If Y has density fY and Z = LT Y then Zi ∼ N (0, m1i )


and Zi are independent. Hence, ◮ Let Y have density
1
 
1 1 1 T Σ−1 y
ΣZ = diag ,··· , = LT M −1 L fY (y) = n 1 e− 2 y , y ∈ ℜn
m1 mn (2π) |Σ|
2 2

◮ Let X = Y + µ. Then
◮ Also, since Zi = 0, ΣZ = E[ZZT ].
◮ Since Y = LZ, E[Y] = 0 and 1 1 T Σ−1 (x−
fX (x) = n 1 e− 2 (x−µ) µ)
(2π) |Σ|
2 2
ΣY = E[YYT ] = E[LZZT LT ] = LE[ZZT ]LT = L(LT M −1 L)LT = M −1
◮ We have
EX = E[Y + µ] = µ
◮ Thus, if Y has density
1 1 T Σ−1 y
ΣX = E[(X − µ)(X − µ)T ] = E[YYT ] = Σ
fY (y) = n 1 e− 2 y , y ∈ ℜn
(2π) 2 |Σ| 2

then EY = 0 and ΣY = M −1 = Σ
PS Sastry, IISc, Bangalore, 2020 27/36 PS Sastry, IISc, Bangalore, 2020 28/36
Multi-dimensional Gaussian density
◮ X = (X1 , · · · , Xn )T are said to be jointly Gaussian if ◮ Let X = (X1 , · · · , Xn )T be jointly Gaussian:

1 1 1 T Σ−1 (x−
e− 2 (x−µ) µ)
1 T Σ−1 (x−
fX (x) = 1 e− 2 (x−µ) µ) fX (x) = n 1
n
(2π) |Σ| 2 2 (2π) |Σ|2 2

◮ EX = µ and ΣX = Σ. ◮ Let Y = X − µ.
◮ Suppose Cov(Xi , Xj ) = 0, ∀i 6= j. ◮ Let M = Σ−1 and L be such that
◮ Then Σij = 0, ∀i 6= j. Let Σ = diag(σ12 , · · · , σn2 ). LT M L = diag(m1 , · · · , mn )
n
◮ Let Z = (Z1 , · · · , Zn )T = LT Y .
Pn  xi −µi 2 2
1 1

xi −µi
− 21 − 12 Then we saw that Zi ∼ N (0, m1i ) and Zi are independent.
Y
fX (x) = n e i=1 σi
= √ e σi ◮
(2π) σ1 · · · σn
2
i=1
σi 2π ◮ If X1 , · · · , Xn are jointly Gaussian then there is a ‘linear’
transform that transforms them into independent random
◮ This implies Xi are independent. variables.
◮ If X1 , · · · , Xn are jointly Gaussian then uncorrelatedness
implies independence.
PS Sastry, IISc, Bangalore, 2020 29/36 PS Sastry, IISc, Bangalore, 2020 30/36

Moment generating function


◮ Since Zi are independent, easy to get MZ .
◮ Let X = (X1 , · · · , Xn )T be jointly Gaussian ◮ We know Zi ∼ N (0, m1i ). Hence
◮ Let Y = X − µ and Z = (Z1 , · · · , Zn )T = LT Y as earlier
u2
◮ The moment generating function of X is given by 1 1
u2 i
MZi (ui ) = e 2 mi i
=e 2mi
h T i
MX (s) = E es X
h T i h T i h i n n u2 u2
i i
P
sT µ uT Z
s (Y+µ)
Y Y
E es Y ui Z i
 
= E e =e MZ (u) = E e = E e = e 2mi
=e i 2mi

h T i i=1 i=1
sT µ
= e E es LZ
T
h T i ◮ We derived earlier
= es µ E eu Z
T
T MX (s) = es µ MZ (u), where u = LT s
where u = L s
sT µ
= e MZ (u)

PS Sastry, IISc, Bangalore, 2020 31/36 PS Sastry, IISc, Bangalore, 2020 32/36
◮ We got ◮Let X, Y be jointly Gaussian. For simplicity let
P u2
i
EX = EY = 0.
sT µ T
MX (s) = e MZ (u); u = L s; MZ (u) = e i 2mi
◮ Let Var(X) = σ 2 , Var(Y ) = σ 2 and ρXY = ρ.
x y
⇒ Cov(X, Y ) = ρσx σy .
◮ Earlier we have shown LT M −1 L = diag( m11 , · · · , m1n )
◮ Now, the covariance matrix and its inverse are given by
where M −1 = Σ. Now we get
σx2 σy2
   
ρσx σy −1 1 −ρσx σy
1 X u2i 1 1 1 Σ= ; Σ = 2 2
= uT (LT M −1 L)u = sT M −1 s = sT Σs ρσx σy σy2 σx σy (1 − ρ2 ) −ρσx σy σx2
2 i mi 2 2 2

◮ The joint density of X, Y is given by


◮ Hence we get
 
x2 y2
sT 1 T 1 − 1
+ − 2ρxy

MX (s) = e µ + s Σs 2(1−ρ2 ) 2 2 σx σy
2 fXY (x, y) = p e σx σy

2πσx σy 1 − ρ2
◮ This is the moment generating function of
multi-dimensional Normal density
◮ This is the bivariate Gaussian density

PS Sastry, IISc, Bangalore, 2020 33/36 PS Sastry, IISc, Bangalore, 2020 34/36

◮ Suppose X, Y are jointly Gaussian (with the density ◮ The multi-dimensional Gaussian density has some
above) important properties.
◮ Then, all the marginals and conditionals would be ◮ If X1 , · · · , Xn are jointly Gaussian then they are
Gaussian. independent if they are uncorrelated.
◮ X ∼ N (0, σx2 ), and Y ∼ N (0, σy2 ) ◮ Suppose X1 , · · · , Xn be jointly Gaussian and have zero
◮ fX|Y (x|y) would be a Gaussian density with mean yρ σσxy means. Then there is an orthogonal transform Y = AX
and variance σx2 (1 − ρ2 ). such that Y1 , · · · , Yn are jointly Gaussian and
◮ Exercise for you – show all this starting with the joint independent.
density we have ◮ X1 , · · · , Xn are jointly Gaussian if and only if tT X is
◮ Note that X, Y are individually Gaussian does not mean Gaussian for for all non-zero t ∈ ℜn .
they are jointly Gaussian (unless they are independent) ◮ We will prove this using moment generating functions

PS Sastry, IISc, Bangalore, 2020 35/36 PS Sastry, IISc, Bangalore, 2020 36/36
Recap: Multi-dimensional Gaussian density
◮ X = (X1 , · · · , Xn )T are said to be jointly Gaussian if ◮ The multi-dimensional Gaussian density has some
1 1 T Σ−1 (x−
important properties.
fX (x) = n 1 e− 2 (x−µ) µ)
◮ If X1 , · · · , Xn are jointly Gaussian then they are
(2π) |Σ|
2 2
independent if they are uncorrelated.
◮ EX = µ and ΣX = Σ. ◮ Suppose X1 , · · · , Xn be jointly Gaussian and have zero
◮ The moment generating function is given by means. Then there is an orthogonal transform Y = AX
T 1 T
such that Y1 , · · · , Yn are jointly Gaussian and
MX (s) = es µ + 2 s Σ s independent.
◮ When X, Y are jointly Gaussian, the joint densityis given
◮ X1 , · · · , Xn are jointly Gaussian if and only if tT X is
by Gaussian for for all non-zero t ∈ ℜn .
◮ We will prove this using moment generating functions
(y−µy )2
 
(x−µx )2 2ρ(x−µx )(y−µy )
1 − 1
2(1−ρ2 ) 2 + 2 − σx σy
fXY (x, y) = p e σx σy

2πσx σy 1 − ρ2

PS Sastry, IISc, Bangalore, 2020 1/32 PS Sastry, IISc, Bangalore, 2020 2/32

◮ Suppose X = (X1 , · · · , Xn )T be jointly Gaussian and let


W = tT X.
◮ Now suppose W = tT X is Gaussian for all t.
◮ Let µX and ΣX denote the mean vector and covariance
1 1 2
matrix of X. Then MW (u) = euµw + 2 u
2 σ2
w = eu t

X+2u tT Σ X t

µw , EW = tT µX ; σw2 , Var(W ) = tT ΣX t

◮This implies
◮ The mgf of W is given by h T i T 1 2 T
 uW  h T i E eu t X = eu t µX + 2 u t ΣX t , ∀u ∈ ℜ, ∀t ∈ ℜn , t 6= 0
MW (u) = E e = E eu t X h T i T 1 T
Tµ 1 2 T E et X = et µX + 2 t ΣX t , ∀t
= MX (ut) = eut x + 2 u t Σx t

1 2 σ2
= euµw + 2 u w
This implies X is jointly Gaussian.
showing that W is Gaussian ◮ This is a defining property of multidimensional Gaussian
◮ Shows density of Xi is Gaussian for each i. For example, density
if we take t = (1, 0, 0, · · · , 0)T then W above would be
X1 .
PS Sastry, IISc, Bangalore, 2020 3/32 PS Sastry, IISc, Bangalore, 2020 4/32
◮ Let X = (X1 , · · · , Xn )T be jointly Gaussian.
◮ Let A be a k × n matrix with rank k.
◮ The mgf of Y is
◮ Then Y = AX is jointly Gaussian.
h T i
◮ We will once again show this using the moment MY (s) = E es Y (s ∈ ℜk )
generating function. h T i
◮ Let µx and Σx denote mean vector and covariance matrix = E es A X
of X. Similarly µy and Σy for Y = MX (AT s)
◮ We have µy = Aµx and (Recall MX (t) = et
Tµ 1 T
x + 2 t Σx t
)
1 T
sT Aµ x+ 2 s A
Σx AT s
Σy = E (Y − µy )(Y − µy )T = e
 
T 1 T
= E (A(X − µx ))(A(X − µx ))T = e s µy + 2 s Σ y s
 

= E A(X − µx )(X − µx )T AT
 
This shows Y is jointly Gaussian
= A E (X − µx )(X − µx )T AT = AΣx AT
 

PS Sastry, IISc, Bangalore, 2020 5/32 PS Sastry, IISc, Bangalore, 2020 6/32

◮ X is jointly Gaussian and A is a k × n matrix with rank k.


◮ Then Y = AX is jointly Gaussian.
◮ This shows all marginals of X are gaussian
◮ For example, if you take A to be
 
1 0 0 ··· 0
A=
0 1 0 ··· 0

then Y = (X1 , X2 )T

PS Sastry, IISc, Bangalore, 2020 7/32 PS Sastry, IISc, Bangalore, 2020 8/32
Jensen’s Inequality Jensen’s Inequality: Proof
◮ Let g : ℜ → ℜ be a convex function. Then
◮ We have
g(EX) ≤ E[g(X)]
g(x) ≥ g(x0 ) + λ(x0 )(x − x0 ), ∀x
◮ For example, (EX)2 ≤ E [X 2 ]
◮ Function g is convex if ◮ Take x0 = EX and x = X(ω). Then
g(αx+(1−α)y) ≤ αg(x)+(1−α)g(y), ∀x, y, ∀0 ≤ α ≤ 1
g(X(ω)) ≥ g(EX) + λ(EX)(X(ω) − EX), ∀ω
◮ If g is convex, then, given any x0 , exists λ(x0 ) such that
◮ Y (ω) ≥ Z(ω), ∀ω ⇒ Y ≥ Z ⇒ EY ≥ EZ
g(x) ≥ g(x0 ) + λ(x0 )(x − x0 ), ∀x
◮ Hence we get

g(X) ≥ g(EX) + λ(EX)(X − EX)


⇒ E[g(X)] ≥ g(EX) + λ(EX) E[X − EX] = g(EX)

◮ This completes the proof

PS Sastry, IISc, Bangalore, 2020 9/32 PS Sastry, IISc, Bangalore, 2020 10/32

◮ Consider the set of all mean-zero random variables. Holder Inequality


◮ It is closed under addition and scalar (real number) 1 1
◮ For all p, q with p, q > 1 and p
+ q
=1
multiplication.
1 1
◮ Cov(X, Y ) = E[XY ] satisfies E[|XY |] ≤ (E|X|p ) p (E|Y |q ) q
1. Cov(X, Y ) = Cov(Y, X)
(We assume all the expectations are finite)
2. Cov(X, X) = Var(X) ≥ 0 and is zero only if X = 0
3. Cov(aX, Y ) = aCov(X, Y )
◮ If we take p = q = 2
p
4. Cov(X1 + X2 , Y ) = Cov(X1 , Y ) + Cov(X2 , Y ) E[|XY |] ≤ E[X 2 ] E[Y 2 ]
◮ Thus Cov(X, Y ) is an inner product here. ◮ This is same as Cauchy-Schwartz inequality. We once
◮ The Cauchy-Schwartz inequality (|xT y| ≤ ||x|| ||y||) again get
gives
p p Cov(X, Y ) = E[(X − EX)(Y − EY )]
|Cov(X, Y )| ≤ Cov(X, X) Cov(Y, Y ) = Var(X) Var(Y )
 
≤ E (X − EX)(Y − EY )
p
◮ This is same as |ρXY | ≤ 1 ≤ E[(X − EX)2 ] E[(Y − EY )2 ]
p
◮ A generalization of Cauchy-Schwartz inequality is Holder = Var(X) Var(Y )
inequality
PS Sastry, IISc, Bangalore, 2020 11/32 PS Sastry, IISc, Bangalore, 2020 12/32
Proof
1 1
◮ First we will show, for p, q > 1 and p
+ q
=1 ◮ We have for all x1 , x2 > 0 and 0 ≤ t ≤ 1,
|x|p |y|q tx1 + (1 − t)x2 ≥ xt1 x2
(1−t)
|xy| ≤ + , ∀x, y ∈ ℜ
p q
1
◮ Take x1 = |x|p , x2 = |y|q , t = p
(and hence 1 − t = 1q )
◮ For x > 0, g(x) = − log(x) is convex because
g ′′ (x) = 1/x2 ≥ 0, ∀x. 1 1 1 p 1 q
(|x|p ) p (|y|q ) q ≤ |x| + |y|
◮ Hence, for all x1 , x2 > 0 and 0 ≤ t ≤ 1, p q
p
|x| |y|q
− log(tx1 + (1 − t)x2 ) ≤ −t log(x1 ) − (1 − t) log(x2 ) ⇒ |xy| ≤ + , ∀x, y
  p q
t (1−t)
⇒ log(tx1 + (1 − t)x2 ) ≥ log x1 x2
(1−t)
⇒ tx1 + (1 − t)x2 ≥ xt1 x2

PS Sastry, IISc, Bangalore, 2020 13/32 PS Sastry, IISc, Bangalore, 2020 14/32

◮ Jensen’s Inequality: If g is convex and EX and


|x|p |y|q E[g(X)] exist
|xy| ≤ + , ∀x, y
p q g(EX) ≤ E[g(X)]
1 1
1 1
◮ Holder Inequality: For p, q > 1 and p
+ q
=1
p −p q −q
◮ Take x = X(ω) (E|X| ) , y = Y (ω) (E|Y | )
1 1
p p −1 q q −1 E|XY | ≤ (E|X|p ) p (E|Y |q ) q
|X(ω)Y (ω)| |X(ω)| (E|X| ) |Y (ω)| (E|Y | )
1 1 ≤ +
(E|X|p ) (E|Y |q )
p q p q (assuming all expectations exist)
p p −1 q q −1
|XY | |X| (E|X| ) |Y | (E|Y | ) ◮ For p = q = 2, the above is Cauchy-Schwartz inequality
⇒ 1 1 ≤ +
(E|X|p ) (E|Y |q )
p q p q ◮ This implies |ρXY | ≤ 1
E|XY | 1 1 ◮ Minkowski’s Inequality:
⇒ 1 1 ≤ + =1
(E|X|p ) p (E|Y |q ) q p q 1 1 1
1 1 (E|X + Y |r ) r ≤ (E|X|r ) r + (E|Y |r ) r
⇒ E|XY | ≤ (E|X|p ) p (E|Y |q ) q

PS Sastry, IISc, Bangalore, 2020 15/32 PS Sastry, IISc, Bangalore, 2020 16/32
Chernoff Bounds Hoeffding Inequality

◮ Recall Markov inequality. If h is positive, strictly


increasing
◮ Often we need to deal with sums of iid random variables.
◮ Here is a simple version of an inequality very useful in
E[h(X)] such situations.
P [X > a] = P [h(X) > h(a)] ≤
h(a) ◮ Let Xi be iid and let Xi ∈ [a, b], ∀i. Let EXi = µ
◮ Take h(x) = esx , s > 0. Then " n #
X 2ǫ2
P Xi − nµ ≥ ǫ ≤ 2e− n(b−a) , ǫ > 0
E[esX ] MX (s) i=1
P [X > a] ≤ sa
= , ∀s > 0
e esa
◮ Note we do not need knowledge of any moments of Xi to
◮ The RHS is a function of S. We can get a tight bound by calculate the bound
using a value of s which minimizes RHS.

PS Sastry, IISc, Bangalore, 2020 17/32 PS Sastry, IISc, Bangalore, 2020 18/32

◮ Let X1 , X2 , · · · be iid random variables


◮ Let EXi = µ and let Var(Xi ) = σ 2
Define Sn = ni=1 Xi . Then
P

n
X n
X
ESn = EXi = nµ; and Var(Sn ) = Var(Xi ) = nσ 2
i=1 i=1

◮ We are interested in Snn , average of X1 , · · · , Xn .


 
Sn 1
E = ESn = µ, ∀n
n n
 2
nσ 2 σ2
 
Sn 1
Var = Var(Sn ) = 2 = , ∀n
n n n n

PS Sastry, IISc, Bangalore, 2020 19/32 PS Sastry, IISc, Bangalore, 2020 20/32
Weak Law of large numbers
◮ Suppose we are tossing a (biased) coin repeatedly
Xi are iid, EXi = µ, Var(Xi ) = σ 2 , Sn = ni=1 Xi
P

◮ Xi = 1 if ith toss came up head and is zero otherwise.
σ2
   
Sn Sn ◮ EXi = p where p is the probability of heads. Variance of
E = µ; and Var =
n n n Xi is p(1 − p)
Sn = ni=1 Xi is the number of heads in n tosses
P
◮ As n becomes large, variance of Snn becomes close to zero ◮
Sn Sn

n
‘converges’ to its expectation, µ, as n → ∞ ◮
n
is the fraction of heads in n tosses.
◮ By Chebyshev Inequality ◮ We are saying Snn ‘converges’ to p

Sn

Var( Snn ) σ2 ◮ The probability of head is the limiting fraction of heads
P −µ ≥ǫ ≤ = , ∀ǫ > 0 when you toss the coin infinite times
n ǫ2 nǫ2
 
◮ Thus, we get Sn
lim P − p ≥ ǫ = 0, ∀ǫ > 0
  n→∞ n
Sn
lim P − µ ≥ ǫ = 0, ∀ǫ > 0
n→∞ n
◮ Known as weak law of large numbers
PS Sastry, IISc, Bangalore, 2020 21/32 PS Sastry, IISc, Bangalore, 2020 22/32

◮ X is a random variable and we want to find EX.


◮ Make multiple independent observations of X. Call them
◮ This is true of any event. X1 , · · · , Xn .
These are called samples of X. Sn = ni=1 Xi
P

◮ Consider repeatedly performing a random experiment
Sn
◮ Xi be the indicator of event A on ith repetition

n
is the sample mean – average of all samples.
Sn
◮ Then EXi = P (A), ∀i

n
has the same expectation as X but has much smaller
Sn variance.

n
is the fraction of times the event A occurred.
◮ Sample mean ‘converges’ to expectation (’population
◮ The fraction of times an event occurs ‘converges’ to its
mean’)
probability as you repeat the experiment infinite times
◮ This is the principle of sample surveys
◮ In general one can get an approximate value of
expectation of X through simulations/experiments
◮ Known as Monte Carlo simulations

PS Sastry, IISc, Bangalore, 2020 23/32 PS Sastry, IISc, Bangalore, 2020 24/32
Pn ◮ Recall convergence of real number sequences.
2
◮ Xi are iid, EXi = µ, Var(Xi ) = σ , Sn = i=1 Xi ◮ A sequence of real numbers xn is said to converge to x0 ,
xn → x0 , if
σ2
   
Sn Sn
E = µ; and Var = ∀ǫ > 0, ∃N < ∞, s.t. |xn − x0 | ≤ ǫ, ∀n ≥ N
n n n
◮ To show a sequence converges using this definition, we
Sn
◮ As n becomes large, variance ofn
becomes close to zero need to know (or guess) the limit.
◮ Sn
We would like to say n → µ. ◮ Convergent sequences of real numbers satisfy the Cauchy
◮ We need to properly define convergence of a sequence of criterion
random variables ∀ǫ > 0, ∃N < ∞, s.t.|xn − xm | ≤ ǫ, ∀n, m ≥ N
◮ One way of looking at this convergence is ◮ Now consider defining sequence of random variables Xn
  converging to X0
Sn
lim P − µ ≥ ǫ = 0, ∀ǫ > 0 ◮ These are not numbers. They are, in fact functions.
n→∞ n ◮ We know that |Xn − X0 | ≤ ǫ is an event. We can define
◮ There are other ways of defining convergence of random convergence in terms of probability of that event
variables becoming 1.
◮ Or we can look at different notions of convergence of a
PS Sastry, IISc, Bangalore, 2020 25/32
sequence of functions to a function. PS Sastry, IISc, Bangalore, 2020 26/32

Convergence in Probability
◮ A sequence of random variables, Xn , is said to converge
in probability to a random variable X0 is
◮ Consider a sequence of functions gn mapping ℜ to ℜ.
◮ We can say gn → g0 if gn (x) → g0 (x), ∀x. lim P [|Xn − X0 | > ǫ] = 0, ∀ǫ > 0
n→∞
◮ This is known as point-wise convergence
Or we can ask for |gn (x) − g0 (x)|2 dx → 0.
R P
◮ This is denoted as Xn → X0
◮ There are multiple notions of convergence that are ◮ We would mostly be considering convergence to a
reasonable for a sequence of functions. constant.
◮ Thus there would be multiple ways to define convergence ◮ By the definition of limit, the above means
of sequence of random variables.
∀δ > 0, ∃N < ∞, s.t. P [|Xn − X0 | > ǫ] < δ, ∀n > N

◮ We only need marginal distributions of individual Xn to


decide whether a sequence converges to a constant in
probability
PS Sastry, IISc, Bangalore, 2020 27/32 PS Sastry, IISc, Bangalore, 2020 28/32
Example: Partial sums of iid random variables Example

◮ Let Ω = [0, 1] with the usual probability measure and let


Xi are iid, EXi = µ, Var(Xi ) = σ 2 , Sn = ni=1 Xi
P

Xn = I[0, 1/n] .
◮ Then we saw ◮ P [Xn = 1] = n1 = 1 − P [Xn = 0]
σ2 ◮ The probability of Xn taking value 1 is decreasing with n
 
Sn
P − µ ≥ ǫ ≤ 2 , ∀ǫ > 0
n nǫ ◮ A good guess is that it converges to zero
P 1
◮ Hence we have Snn → µ P [|Xn − 0| > ǫ] = P [Xn = 1] =
n
◮ Weak law of large numbers says that sample mean
converges in probability to the expectation which goes to zero as n → ∞.
P
◮ Hence, Xn → 0

PS Sastry, IISc, Bangalore, 2020 29/32 PS Sastry, IISc, Bangalore, 2020 30/32

Example Some properties of convergence in probability

◮ Let X1 , X2 , · · · be a sequence of iid random variable


which are uniform over (0, 1). P P
◮ Xn → X and Xn → Y ⇒ P [X = Y ] = 1
◮ Let Mn = max(X1 , X2 , · · · , Xn )
P
◮ Does Mn converge in probability? ◮ Xn → X ⇒ P [|Xn − Xm | > ǫ] → 0 as n, m → ∞
P P
◮ A reasonable guess for the limit is 1 ◮ Suppose Xn → X and Yn → Y Then the following hold
P
n 1. aXn → aX
P [|Mn − 1| ≥ ǫ] = P [Mn ≤ 1 − ǫ] = (1 − ǫ) P
2. Xn + Yn → X + Y
P
3. Xn Yn → XY

P
This implies Mn → 1 ◮ We omit the proofs
◮ Suppose Zn = min(X1 , X2 , · · · , Xn ).
P
Then Zn → 0

PS Sastry, IISc, Bangalore, 2020 31/32 PS Sastry, IISc, Bangalore, 2020 32/32
Recap: Multi-dimensional Gaussian density Recap
◮ X = (X1 , · · · , Xn )T are said to be jointly Gaussian if
1 1 T Σ−1 (x−
fX (x) = n 1 e− 2 (x−µ) µ)
◮ If X1 , · · · , Xn are jointly Gaussian then they are
(2π) |Σ|
2 2
independent if they are uncorrelated.
◮ EX = µ and ΣX = Σ. ◮ When X1 , · · · , Xn be jointly Gaussian (with zero means),
◮ The moment generating function is given by there is an orthogonal transform Y = AX such that
Y1 , · · · , Yn are jointly Gaussian and independent.
T 1 T
MX (s) = es µ + 2 s Σ s ◮ X1 , · · · , Xn are jointly Gaussian if and only if tT X is
Gaussian for for all non-zero t ∈ ℜn .
◮ When X, Y are jointly Gaussian, the joint density is given
by
◮ If X1 , · · · , Xn are jointly Gaussian and A is a k × n
matrix of rank k, then, Y = AX is jointly gaussian
(y−µy )2
 
(x−µx )2 2ρ(x−µx )(y−µy )
1 − 1
2 2 + 2 − σx σy
fXY (x, y) = p e 2(1−ρ ) σx σy

2πσx σy 1 − ρ2

PS Sastry, IISc, Bangalore, 2020 1/34 PS Sastry, IISc, Bangalore, 2020 2/34

Recap: Moment inequalities Recap


◮ Jensen’s Inequality: If g is convex and EX and
E[g(X)] exist
g(EX) ≤ E[g(X)] ◮ Chernoff Bounds
1 1
◮ Holder Inequality: For p, q > 1 and p
+ q
=1 E[esX ] MX (s)
P [X > a] ≤ = , ∀s > 0
1 1
esa esa
E|XY | ≤ (E|X|p ) p (E|Y |q ) q
◮ Hoeffding Inequality Xi iid, Xi ∈ [a, b], ∀i and
(assuming all expectations exist) EXi = µ
◮ For p = q = 2, the above is Cauchy-Schwartz inequality " n #
X 2ǫ2
◮ This implies |ρXY | ≤ 1 P Xi − nµ ≥ ǫ ≤ 2e− n(b−a) , ǫ > 0
◮ Minkowski’s Inequality: i=1

1 1 1
(E|X + Y |r ) r ≤ (E|X|r ) r + (E|Y |r ) r

PS Sastry, IISc, Bangalore, 2020 3/34 PS Sastry, IISc, Bangalore, 2020 4/34
Recap: Weak Law of large numbers Recap: Convergence in Probability

◮ Xi are iid, EXi = µ, Var(Xi ) = σ 2 , Sn = ni=1 Xi


P ◮ A sequence of random variables, Xn , is said to converge
in probability to a random variable X0 is
σ2
   
Sn Sn
E = µ; and Var = lim P [|Xn − X0 | > ǫ] = 0, ∀ǫ > 0
n n n n→∞

◮ By Chebyshev Inequality P
This is denoted as Xn → X0

Sn

Var( Snn ) σ2 ◮ By the definition of limit, the above means
P −µ ≥ǫ ≤ = 2 , ∀ǫ > 0
n ǫ2 nǫ
∀δ > 0, ∃N < ∞, s.t. P [|Xn − X0 | > ǫ] < δ, ∀n > N
 
Sn ◮ We only need marginal distributions of individual Xn to
⇒ lim P − µ ≥ ǫ = 0, ∀ǫ > 0
n→∞ n decide whether a sequence converges to a constant in
probability

PS Sastry, IISc, Bangalore, 2020 5/34 PS Sastry, IISc, Bangalore, 2020 6/34

◮ A sequence of random variables, Xn , is said to converge


◮ We mentioned point-wise convergence of a sequence of almost surely or with probability one to X if
functions
P ({ω : Xn (ω) → X(ω)}) = 1
gn → g0 if gn (x) → g0 (x), ∀x
or equivalently
◮ Since random variables are also functions we can define
convergence like this. P ({ω : Xn (ω) 9 X(ω)}) = 0
◮ We can demand Xn (ω) → X0 (ω), ∀ω a.s. w.p.1
◮ Denoted as Xn → X or Xn → X or Xn → X0 (w.p.1)
◮ Such pointise convergence is too restrictive.
◮ We are saying that for ‘almost all’ ω, Xn (ω) converges to
◮ But we can demand that it should be satisfied for almost X(ω)
all ω
◮ We will first try and write the event
{ω : Xn (ω) 9 X(ω)} in a proper form

PS Sastry, IISc, Bangalore, 2020 7/34 PS Sastry, IISc, Bangalore, 2020 8/34
◮ Note that given any ω, Xn (ω) is real number sequence.
◮ Recall convergence of real number sequences. ◮ Hence Xn (ω) → X(ω) is same as
◮ A sequence of real numbers xn is said to converge to x0 , ∀ǫ > 0 ∃N < ∞ ∀k ≥ 0 |XN +k (ω) − X(ω)| < ǫ
xn → x0 , if
This is equivalent to
∀ǫ > 0, ∃N < ∞, s.t. |xn − x0 | < ǫ, ∀n ≥ N
1
∀r > 0, r integer ∃N < ∞ ∀k ≥ 0 |XN +k (ω) − X(ω)| <
This is equivalent to r

∀ǫ > 0, ∃N < ∞, ∀k ≥ 0 |xN +k − x0 | < ǫ


◮ Hence, Xn (ω) 9 X(ω) is same as
1
◮ So, xn 9 x0 means ∃r ∀N ∃k |XN +k (ω) − X(ω)| ≥
r
∃ǫ ∀N ∃k |xN +k − x0 | ≥ ǫ ◮ Hence we can write this event as
 
∞ ∞ ∞ 1
∪r=1 ∩N =1 ∪k=0 ω : |XN +k (ω) − X(ω)| ≥
r

PS Sastry, IISc, Bangalore, 2020 9/34 PS Sastry, IISc, Bangalore, 2020 10/34

◮ The event {ω : Xn (ω) 9 X(ω)} can be expressed as ◮ A sequence Xn is said to converge almost surely or

1
 with probability one to X if
∞ ∞ ∞
∪r=1 ∩N =1 ∪k=0 |XN +k − X| ≥
r P ({ω : Xn (ω) → X(ω)}) = 1
◮ Hence Xn converges almost surely to X iff
◮ We can also write it as
  
∞ ∞ ∞ 1
P ∪r=1 ∩N =1 ∪k=0 |XN +k − X| ≥ =0 P [Xn → X] = 1
r
◮ This is same as ◮ We showed that this is equivalent to
  
1 P (∩∞ ∞
P ∩∞ ∞ N =1 ∪k=0 [|XN +k − X| ≥ ǫ]) = 0, ∀ǫ > 0
N =1 ∪k=0 |XN +k − X| ≥ = 0, ∀r > 0, integer
r
◮ Same as
◮ Same as
P (∩∞ ∞
N =1 ∪k=N [|Xk − X| ≥ ǫ]) = 0, ∀ǫ > 0
P (∩∞
N =1 ∪∞
k=0 [|XN +k − X| ≥ ǫ]) = 0, ∀ǫ > 0

PS Sastry, IISc, Bangalore, 2020 11/34 PS Sastry, IISc, Bangalore, 2020 12/34
◮ Xn converges to X almost surely iff
◮ let Ak = [|Xk − X| ≥ ǫ]
◮ Let BN = ∪∞ k=N Ak . lim P (∪∞
k=n [|Xk − X| ≥ ǫ]) = 0, ∀ǫ > 0
n→∞
◮ Then, BN +1 ⊂ BN and hence BN ↓.
Hence, lim BN = ∩∞
◮ To show convergence with probability one using this one
N =1 BN .

a.s. needs to know the joint distribution of Xn , Xn+1 , · · ·
◮ We saw that Xn → X is same as P
◮ Contrast this with Xn → X which is
P (∩∞ ∞
N =1 ∪k=N [|Xk − X| ≥ ǫ]) = 0, ∀ǫ > 0
lim P [|Xn − X| > ǫ] = 0, ∀ǫ > 0
n→∞
 
⇔ P lim ∪∞
k=N [|X k − X| ≥ ǫ] = 0, ∀ǫ > 0 ◮ This also shows that
N →∞
a.s. P
Xn → X ⇒ Xn → X
⇔ lim P (∪∞
k=N [|Xk − X| ≥ ǫ]) = 0, ∀ǫ > 0
N →∞
◮ Almost sure convergence is a stronger mode of
convergence

PS Sastry, IISc, Bangalore, 2020 13/34 PS Sastry, IISc, Bangalore, 2020 14/34

simple example: almost sure convergence


◮ Let Ω = [0, 1] with the usual probability measure and let
Xn = I[0, 1/n] . ◮ Xn converges to X almost surely iff

1 ω ≤ n1

Xn (ω) = P (∩∞ ∞
N =1 ∪k=N [|Xk − X| ≥ ǫ]) = 0, ∀ǫ > 0
0 otherwise
P
◮ Since Xn → 0, zero is the only candidate for limit ⇔ lim P (∪∞
k=n [|Xk − X| ≥ ǫ]) = 0, ∀ǫ > 0
n→∞
◮ Xn (ω) = 1 only when n ≤ 1/ω. ◮ We normally do not specify Xn as functions over Ω
◮ Given any ω, for all n > 1/ω, Xn (ω) = 0 ◮ We are only given the distributions
◮ Hence, {ω : Xn (ω) → 0} = (0, 1] ◮ How do we establish convergence almost surely
P [Xn → X0 ] = P ({ω : Xn (ω) → 0}) = P ((0, 1]) = 1

a.s
◮ Hence Xn → 0
PS Sastry, IISc, Bangalore, 2020 15/34 PS Sastry, IISc, Bangalore, 2020 16/34
◮ Let A1 , A2 , · · · be a sequence of events.
◮ How do we define limit of this sequence ?
◮ Define sequences
Bn = ∪∞
k=n Ak Cn = ∩∞
k=n Ak ◮ We can show lim inf An ⊂ lim sup An
◮ These are monotone: Bn ↓, Cn ↑. They have limits.
ω ∈ lim inf An ⇒ ω ∈ ∪∞ ∞
n=1 ∩k=n Ak
◮ Define
⇒ ∃m, ω ∈ Ak , ∀k ≥ m
lim sup An , lim Bn = ∩∞ ∞
n=1 ∪k=n Ak
⇒ ω ∈ ∪∞j=n Aj , ∀n
lim inf An , lim Cn = ∪∞ ∞
n=1 ∩k=n Ak
⇒ ω ∈ ∩∞ ∞
n=1 ∪j=n Aj
◮ If lim sup An = lim inf An then we define that as
⇒ ω ∈ lim sup An
lim An . Otherwise we say the sequence does not have a
limit
◮ Note that lim sup An and lim inf An are events
a.s.
◮ Note that Xn → X iff
P (∩∞ ∞
N =1 ∪k=N [|Xk − X| ≥ ǫ]) = 0, ∀ǫ > 0

⇔ P (lim sup [|Xn − X| ≥ ǫ]) = 0, ∀ǫ > 0


PS Sastry, IISc, Bangalore, 2020 17/34 PS Sastry, IISc, Bangalore, 2020 18/34

◮ We can characterize lim sup An as follows

ω ∈ lim sup An ⇒ ω ∈ ∩∞ ∞
n=1 ∪k=n Ak
◮ We can characterize lim inf An as follows ⇒ ω ∈ ∪∞ k=n Ak , ∀n
⇒ ω belongs to infinitely many of An
ω ∈ lim inf An ⇒ ω ∈ ∪∞ ∞
n=1 ∩k=n Ak
⇒ ∃m, ω ∈ Ak , ∀k ≥ m Thus lim sup An consists of points that are in infinitely
⇒ ω belongs to all but finitely many of An many An
One refers to lim sup An also as ‘An infinitely often’ or
Thus, lim inf An consists of all points that are there in ‘An i.o.’
all but finitely many An . ◮ What is the difference between
Points that belong to all but finitely many An and
Points that belong to infinitely many An
◮ There can be ω that are there in infinitely many of An
and are also not there in infinitely many of the An

PS Sastry, IISc, Bangalore, 2020 19/34 PS Sastry, IISc, Bangalore, 2020 20/34
Example example
n
◮ Consider the sets An = [0, 1 + (−1)
n
)
The sequence is
◮ Consider the following sequence of sets: A, B, A, B, · · ·      
1 1 1
◮ Recall [0, 0), 0, 1 + , 0, 1 − , 0, 1 + ···
2 3 4
lim sup An = ∩∞ ∞
n=1 ∪k=n Ak lim inf An = ∪∞ ∞
n=1 ∩k=n Ak ◮ Guess: lim sup An = [0, 1] and lim inf An = [0, 1)
1 1
◮ First note that [0, 1 + n+1 ) ⊂ ∪∞
k=n Ak ⊂ [0, 1 + n ).

∪∞ Hence
k=n Ak = A ∪ B, ∀n ⇒ lim sup An = A ∪ B
x ∈ [0, 1] ⇒ x ∈ ∪∞ ∞ ∞
k=n Ak , ∀n ⇒ x ∈ ∩n=1 ∪k=n Ak ⇒ x ∈ lim sup An
∩∞
k=n Ak = A ∩ B, ∀n ⇒ lim inf An = A ∩ B

◮ / [0, 1 + n1 ) if ǫ > n1 or n > 1ǫ .


Given any ǫ > 0, 1 + ǫ ∈
◮ Hence, given any ǫ > 0, ∃n such that 1 + ǫ ∈ / ∪∞k=n Ak .
◮ This proves lim sup An = [0, 1]
PS Sastry, IISc, Bangalore, 2020 21/34 PS Sastry, IISc, Bangalore, 2020 22/34

◮ Now let us consider: lim inf An = ∪∞


n=1 ∩k=n Ak .

a.s.
n
◮ Xn → X iff
◮ Recall An = [0, 1 + (−1)
n
)
First note that 1
[0, 1 − n ) ⊂ ∩∞ 1 P (∩∞ ∞
N =1 ∪k=N [|Xk − X| ≥ ǫ]) = 0, ∀ǫ > 0
k=n Ak ⊂ [0, 1 − )

n+1
◮ Given any ǫ > 0, 1 − ǫ < 1 − n1 if n > 1ǫ ◮ Let Aǫn = [|Xn − X| ≥ ǫ]
◮ Hence, given any ǫ > 0, ∃n such that 1 − ǫ ∈ ∩∞ k=n Ak ◮
a.s.
Then Xn → X iff
◮ Hence 1 − ǫ ∈ lim inf An
It is easy to see 1 ∈
/ ∩∞ P (lim sup Aǫn ) = 0, ∀ǫ > 0
k=n Ak for ay n.

◮ Hence 1 ∈ / lim inf An ◮ The question now is: can we get probability of
◮ This proves lim inf An = [0, 1) lim sup An
◮ Since lim sup An 6= lim inf An , this sequence does not ◮ We look at an important result that allows us to do this
have a limit

PS Sastry, IISc, Bangalore, 2020 23/34 PS Sastry, IISc, Bangalore, 2020 24/34
Borel-Cantelli Lemma
Borel-Cantelli lemma: Given sequence of events,
P∞ P∞
◮ ◮ If P (Ak ) < ∞, then, limn→∞ P (Ak ) = 0
k=1 k=n
A1 , A2 ,P
···
1. If ∞
i=1 P (Ai ) < ∞, then, P (lim sup An ) = 0 0 ≤ P (lim sup An ) = P (∩∞ ∪∞ A )
P∞  n=1 k=n k
2. If i=1 P (Ai )
= ∞ and Ai are independent, = P lim ∪∞k=n Ak
n→∞
P (lim sup An ) = 1
= lim P (∪∞
k=n Ak )
Proof: n→∞
P∞
◮ We will first show: P (∪∞i=n Ai ) ≤ i=n P (Ai ), ∀n

X
N
PN ≤ lim P (Ak )
◮ We have the result: P (∪i=n Ai ) ≤ i=n P (Ai ), n ≤ N n→∞
k=n
◮ For any n, let BN = ∪N i=n Ai . Then BN ⊂ BN +1 . ∞
X
◮ limN →∞ BN = ∪k=n Ak

= 0, if P (Ak ) < ∞
N k=1
P (∪∞
i=n Ai ) = P ( lim ∪i=n Ai ) = lim P (∪N
i=n Ai )
N →∞ N →∞
N
X ∞
X
◮ This completes proof of first part of Borel-Cantelli lemma
≤ lim P (Ai ) = P (Ai )
N →∞
i=n i=n

PS Sastry, IISc, Bangalore, 2020 25/34 PS Sastry, IISc, Bangalore, 2020 26/34

◮ For the second part of the lemma:

Let ∞ P (lim sup An ) = P (∩∞ ∪∞ A )


P
k=1 P (Ak ) = C < ∞

 n=1 k=n k
◮ It means given any ǫ > 0, ∃n = P lim ∪∞k=n Ak
n→∞
n
X ∞
X = lim P (∪∞
k=n Ak )
n→∞
P (Ak ) − C < ǫ ⇒ P (Ak ) < ǫ c
k=1 k=n
= lim (1 − P (∩∞
k=n Ak ) )
n→∞

!
◮ This implies
Y
∞ = lim 1− (1 − P (Ak ))
X n→∞
lim P (Ak ) = 0 k=n
n→∞
k=n
because Ak are independent
Y∞
= 1 − lim (1 − P (Ak ))
n→∞
k=n

PS Sastry, IISc, Bangalore, 2020 27/34 PS Sastry, IISc, Bangalore, 2020 28/34
◮ We can compute that limit as follows
◮ Given a sequence Xn we want to know whether it

Y ∞
Y converges to X
lim (1 − P (Ak )) ≤ lim e−P (Ak ) , since 1 − x ≤ e−x ◮ Let Aǫk = [|Xk − X| ≥ ǫ]
n→∞ n→∞
k=n k=n
P
◮ Xn → X if
P
− ∞ k=n P (Ak )
= lim e
n→∞
= 0 lim P [|Xk − X| ≥ ǫ] = 0 same as lim P (Ak ) = 0, ∀ǫ > 0
k→∞ k→∞

because
P∞ P∞
k=1 P (Ak ) = ∞ ⇒ limn→∞ k=n P (Ak ) = ∞ ◮ By Borel-Cantelli lemma
◮ This finally gives us

a.s.
X

Y P (Ak ) < ∞ ⇒ P (lim sup Ak ) = 0 ⇒ Xk → X
P (lim sup An ) = 1 − lim (1 − P (Ak )) = 1 k=1
n→∞
k=n

PS Sastry, IISc, Bangalore, 2020 29/34 PS Sastry, IISc, Bangalore, 2020 30/34

◮ Consider a sequence Xn with


1 1
P [Xn = 0] = 1 − ; P [Xn = 1] =
◮ Consider a sequence Xn with n n
P
P [Xn = 0] = 1 − an ; P [Xn = cn ] = an ◮ We can easily conclude Xn → 0.
But since, n n1 = ∞, Borel-Cantelli lemma is not really
P

◮ We want to investigate convergence to 0.
useful here
◮ Aǫn = [|Xn − 0| > ǫ] = [Xn = cn ], ∀ǫ > 0
◮ We saw one example where such Xn converge almost
◮ Hence P (Aǫn ) = an , ∀ǫ > 0. surely.
P
◮ If an → 0 then Xn → 0. (e.g., an = n1 , n12 ) ◮ But, if Xn are independent, then by Borel-Cantelli
a.s. lemma, they do not converge
an < ∞, Xn → 0 (e.g., an = n12 )
P
◮ If
◮ Convergence (to a constant) in probability depends only
on distribution of individual Xn .
◮ Convergence almost surely depends on the joint
distribution
PS Sastry, IISc, Bangalore, 2020 31/34 PS Sastry, IISc, Bangalore, 2020 32/34
Strong Law of Large Numbers
Let Aǫn = Snn − µ > ǫ
 

Pn ◮ As we saw, by Chebyshev inequality


◮ Let Xn be iid, EXn = µ, Var(Xn ) = σ 2 , Sn = i=1 Xi
σ2
 
◮ We saw weak law of large numbers: Sn
P −µ >ǫ ≤ 2
n nǫ
Sn P
→µ
n ◮ This shows P (Aǫn ) → 0 and thus we get weak law
◮ Strong law of large numbers says:
◮ To
P proveǫ strong law using Borel-Cantelli lemma, we need
P (An ) < ∞
P σ2
Sn a.s. ◮ Since n nǫ 2 = ∞, the Chebyshev bound is not useful
→µ
n ◮ We need a bound: P [| Snn − µ|] ≤ cn such that
P
n cn < ∞.

PS Sastry, IISc, Bangalore, 2020 33/34 PS Sastry, IISc, Bangalore, 2020 34/34

Recap: Convergence in Probability Recap: Weak Law of large numbers


◮ A sequence of random variables, Xn , is said to converge
in probability to a random variable X0 is
Xi are iid, EXi = µ, Var(Xi ) = σ 2 , Sn = ni=1 Xi
P

lim P [|Xn − X0 | > ǫ] = 0, ∀ǫ > 0


σ2
   
n→∞ Sn Sn
E = µ; and Var =
P
n n n
This is denoted as Xn → X0
◮ By the definition of limit, the above means Weak law of large numbers states
Sn P
∀δ > 0, ∃N < ∞, s.t. P [|Xn − X0 | > ǫ] < δ, ∀n > N → µ
n
◮ We only need marginal distributions of individual Xn to
decide whether a sequence converges to a constant in
probability

PS Sastry, IISc, Bangalore, 2020 1/33 PS Sastry, IISc, Bangalore, 2020 2/33
Recap: almost sure convergence Recap
◮ The sequence Xn converges to X almost surely iff
◮ A sequence of random variables, Xn , is said to converge
almost surely or with probability one to X if P (∩∞ ∞
N =1 ∪k=0 [|XN +k − X| ≥ ǫ]) = 0, ∀ǫ > 0

Same as
P ({ω : Xn (ω) → X(ω)}) = 1
P (∩∞ ∞
N =1 ∪k=N [|Xk − X| ≥ ǫ]) = 0, ∀ǫ > 0
or equivalently ◮ Equivalently

P ({ω : Xn (ω) 9 X(ω)}) = 0 lim P (∪∞


k=n [|Xk − X| ≥ ǫ]) = 0, ∀ǫ > 0
n→∞
P

a.s. w.p.1
Denoted as Xn → X or Xn → X or Xn → X0 (w.p.1)
◮ Xn → X iff
◮ We can also write it as lim P [|Xn − X| > ǫ] = 0, ∀ǫ > 0
n→∞

P [Xn → X] = 1 a.s.
Xn → X ⇒
P
Xn → X
◮ Almost sure convergence is a stronger mode of
convergence
PS Sastry, IISc, Bangalore, 2020 3/33 PS Sastry, IISc, Bangalore, 2020 4/33

Recap: lim sup and lim inf Recap

◮ Let A1 , A2 , · · · be a sequence of events. a.s.


◮ Xn → X iff
◮ We define
P (∩∞ ∞
N =1 ∪k=N [|Xk − X| ≥ ǫ]) = 0, ∀ǫ > 0
lim sup An , ∩∞
n=1 ∪∞
k=n Ak
lim inf An , ∞
∪n=1 ∞
∩k=n Ak ◮ Let Aǫk = [|Xk − X| ≥ ǫ].
a.s.
◮ Hence, Xn → X iff
◮ If lim sup An = lim inf An then that is lim An .
Otherwise the sequence does not have a limit P (lim sup Aǫn ) = 0, ∀ǫ > 0
◮ lim sup An and lim inf An are events
◮ lim inf An ⊂ lim sup An

PS Sastry, IISc, Bangalore, 2020 5/33 PS Sastry, IISc, Bangalore, 2020 6/33
Recall: Borel-Cantelli Lemma ◮ Given a sequence Xn we want to know whether it
converges to X
◮ Let Aǫk = [|Xk − X| ≥ ǫ]
a.s.
◮ Xn → X if

P (lim sup Aǫn ) = 0, ∀ǫ > 0


◮ Borel-Cantelli lemma: Given sequence of events,
A1 , A2 , · · · ◮ By Borel-Cantelli lemma
P∞
1. If i=1 P (Ai ) < ∞, then, P (lim sup An ) = 0
P∞ ∞
2. If = ∞ and Ai are independent,
i=1 P (Ai )
a.s.
X
P (Ak ) < ∞ ⇒ P (lim sup Ak ) = 0 ⇒ Xk → X
P (lim sup An ) = 1
k=1

If Ak are ind

a.s.
X
P (Ak ) = ∞ ⇒ P (lim sup Ak ) = 1 ⇒ Xk 9 X
k=1

PS Sastry, IISc, Bangalore, 2020 7/33 PS Sastry, IISc, Bangalore, 2020 8/33

Strong Law of Large Numbers


Let Aǫn = Snn − µ > ǫ
 

Pn ◮ As we saw, by Chebyshev inequality


◮ Let Xn be iid, EXn = µ, Var(Xn ) = σ 2 , Sn = i=1 Xi
σ2
 
◮ We saw weak law of large numbers: Sn
P −µ >ǫ ≤ 2
n nǫ
Sn P
→µ
n ◮ This shows P (Aǫn ) → 0 and thus we get weak law
◮ Strong law of large numbers says:
◮ To
P proveǫ strong law using Borel-Cantelli lemma, we need
P (An ) < ∞
Sn a.s.
P σ2
Since n nǫ 2 = ∞, the Chebyshev bound is not useful

→µ
n ◮ We need a bound: P [| Snn − µ|] ≤ cn such that
P
n cn < ∞.

PS Sastry, IISc, Bangalore, 2020 9/33 PS Sastry, IISc, Bangalore, 2020 10/33
◮ Let us assume Xi have finite fourth moment
!4
◮ Now we can get, using Markov inequality
Xn Xn X X 4!
(Xi −µ)4 + (Xi −µ)2 (Xj −µ)2 +T
 
(Xi − µ) = Sn
2!2! P − µ > ǫ = P [|Sn − nµ| > nǫ]
i=1 i=1 i j>i n
" n #
Where T represent a number of terms such that every = P
X
(Xi − µ) > nǫ
term in it contains a factor like (Xi − µ) i=1
Note that E[(Xi − µ)(Xj − µ)3 ] = 0 etc. because Xi are Pn 4
E( i=1 (Xi − µ))
independent. ≤
(nǫ)4
◮ Hence we get
C ′ n2 C
 !4  ≤ 4 4
= 2
Xn nǫ n
E (Xi − µ)  = nE[(Xi −µ)4 ]+3n(n−1)σ 4 ≤ C ′ n2 P C Sn a.s.
i=1
◮ Since n n2 < ∞, we get n
→ µ

PS Sastry, IISc, Bangalore, 2020 11/33 PS Sastry, IISc, Bangalore, 2020 12/33

Convergence in rth mean


◮ We say that a sequence Xn converges in rth mean to X if
◮ Strong law of large numbers says
E[|Xn |r ] < ∞, ∀n, E[|X|r ] < ∞ and
n
Sn a.s. X
E [|Xn − X|r ] → 0 as n → ∞
→µ where Sn = Xi , Xi iid, EXi = µ
n i=1 r
◮ Denoted as Xn → X
◮ We proved it assuming finite fourth moment of Xi . ◮ Consider our old example of binary random variables
◮ This is only for illustration 1 1
P [Xn = 1] = P [Xn = 0] = 1 −
◮ Strong law holds without any such assumptions on n n
moments ◮ All moments of Xn are finite and we have
◮ Strong law of large numbers says that sample mean
 1
converges to the expectation with probability one. E |Xn − 0|2 = → 0

n
2
◮ Hence Xn → 0.
◮ In this example Xn converges in rth mean for all r
PS Sastry, IISc, Bangalore, 2020 13/33 PS Sastry, IISc, Bangalore, 2020 14/33
◮ Consider sequence Xn where Xn are independent with


r
Suppose Xn → X. Then, by Markov inequality P [Xn = 0] = 1 − an ; P [Xn = cn ] = an
P
E [|Xn − X|r ] ◮ Assume an → 0 so that Xn → 0
P [|Xn − X| > ǫ] ≤ →0
ǫr ◮ By Borel-Cantelli lemma
◮ Hence a.s.
X
r P Xn → 0 ⇔ an < ∞
Xn → X ⇒ Xn → X
n
th
◮ In general, neither of convergence almost surely and in r
mean imply the other.
◮ For convergence in rth mean we need
◮ We can generate counter examples for this easily. E [|Xn − 0|r ] = (cn )r an → 0
◮ However, if all Xn take values in a bounded interval, then r
almost sure convergence implies rth mean convergence ◮ Take an = n1 and cn = 1. Then Xn → 0 but the sequence
does not converge almost surely.
a.s.
◮ Take an = n12 and cn = en . Then Xn → 0 but the
sequence does not converge in rth mean for any r.
PS Sastry, IISc, Bangalore, 2020 15/33 PS Sastry, IISc, Bangalore, 2020 16/33

Convergence in distribution
◮ Let Fn be the df of Xn , n = 1, 2, · · · . Let X be a rv with
df F .
◮ Sequence Xn is said to converge to X in distribution if
r
◮ Let Xn → X. Then Fn (x) → F (x), ∀x where F is continuous
1. E[|Xn |r ] → E[|X|r ] ◮ We denote this as
s
2. Xn → X, ∀s < r d L w
Xn → X, or Xn → X, or Fn → F
◮ The proofs are straight-forward but we omit the proofs
◮ This is also known as convergence in law or weak
convergence
◮ Note that here we are essentially talking about
convergence of distribution functions.
◮ Convergence in probability implies convergence in
distribution
◮ The converse is not true. (e.g., sequence of iid random
variables)
PS Sastry, IISc, Bangalore, 2020 17/33 PS Sastry, IISc, Bangalore, 2020 18/33
Examples Examples
◮ X1 , X2 , · · · be iid; uniform over (0, 1) ◮ Let {Xn } be iid with density f (x) = e−x+θ , x > θ > 0.
◮ Nn = min(X1 , · · · , Xn ), Yn = nNn . Does Yn converge in ◮ Let Nn = min(X1 , · · · Xn ). Does Nn converge in
distribution?
probability?
P [Nn > a] = (P [Xi > a])n = (1 − a)n , 0 < a < 1 ◮ Guess for limit: θ

 y n P [|Nn − θ| > ǫ] = P [Nn > θ + ǫ] = (P [Xi > θ + ǫ])n


P [Yn > y] = P [Nn > y/n] = 1 − , if n > y
n
Z ∞

◮ Hence for any y P [Xi > θ + ǫ] = e−x+θ dx = e−ǫ


θ+ǫ
 y n n
lim P [Yn > y] = lim 1 − = e−y P [Nn > θ + ǫ] = e−ǫ → 0, as n → ∞, ∀ǫ > 0
n→∞ n→∞ n
P
◮ Hence Nn → θ
◮ The sequence converges in distribution to an exponential ◮ Does it converge almost surely?
rv
PS Sastry, IISc, Bangalore, 2020 19/33 PS Sastry, IISc, Bangalore, 2020 20/33

Examples ◮ We have seen different modes of convergence


d
◮ Xn → X iff

◮ EXn = mn and Var(Xn ) = σn2 , n = 1, 2, · · · Fn (x) → F (x), ∀x where F is continuous


◮ Want a sufficient condition for Xn − mn to converge in ◮
P
Xn → X iff
probability
◮ Note that E[Xn − mn ] = 0, and Var(Xn − mn ) = σn2 , ∀n lim P [|Xn − X| > ǫ] = 0, ∀ǫ > 0
n→∞

σn2 r
P [|Xn − mn | > ǫ] ≤ ◮ Xn → X iff
ǫ2
E [|Xn − X|r ] → 0 as n → ∞
◮ Hence, a sufficient condition is σn2 → 0.
a.s
◮ What is a sufficient condition for convergece almost ◮ Xn → X iff
surely?
P [Xn → X] = 1 or P [lim sup |Xn − X| > ǫ] = 0

PS Sastry, IISc, Bangalore, 2020 21/33 PS Sastry, IISc, Bangalore, 2020 22/33
◮ We have the following relations among different modes of
convergence
◮ Strong and weak laws of large numbers are very useful
examples of convergence of sequences of random
r P
Xn → X ⇒ Xn → X ⇒ X n → X
d variables.
Given Xi are iid, EXi = µ, Var(Xi ) = σ 2 , Sn = ni=1 Xi
P

P
a.s. P d
◮ Weak law of large numbers: Snn → µ
Xn → X ⇒ X n → X ⇒ X n → X ◮
a.s.
strong law of large numbers: Snn → µ
◮ All the implications are one-way and we have seen ◮ Another useful result is the Central Limit Theorem (CLT)
counter examples ◮ CLT is about (normalized) sums of of independent
◮ In general, almost sure convergence does not imply random variables converging to the Gaussian distribution
convergence in rth mean and vice versa

PS Sastry, IISc, Bangalore, 2020 23/33 PS Sastry, IISc, Bangalore, 2020 24/33

Central Limit Theorem


◮ Given Xi are iid, EXi = µ, Var(Xi ) = σ 2 , n = 1, 2, · · · ◮ Take Xi iid, EXi = 0, Var(Xi ) = 1, n = 1, 2, · · ·
Sn = ni=1 Xi
P
n
X ◮
2
Sn = Xi ⇒ ESn = nµ, Var(Sn ) = nσ ◮ Strong law of large numbers implies
i=1
Sn a.s.
→ 0
◮ Given any rv Y , let Z = √Y −EY n
Var(Y )
◮ Then, EZ = 0 and Var(Z) = 1. ◮ Central Limit Theorem implies
◮ Define S̃n = Sσn√
−nµ
E S̃n = 0, Var(S̃n ) = 1, ∀n S a.s.
n
√n → N (0, 1)

d
Central Limit Theorem states: S̃n → N (0, 1) n
Z a
1 x2
lim P [S̃n ≤ a] = Φ(a) , √ e− 2 dx
n→∞ −∞ 2π

PS Sastry, IISc, Bangalore, 2020 25/33 PS Sastry, IISc, Bangalore, 2020 26/33
Central Limit Theorem Characteristic Function

◮ Given Xi are iid, EXi = µ, Var(Xi ) = σ 2 , n = 1, 2, · · ·


◮ Given rv X, its characteristic function, φX , is defined by
n
X Sn − nµ
Sn = Xi S̃n = √ √
Z
 iuX 
i=1
σ n φX (u) = E e = eiux dFX (x) (i = −1)


d
Central Limit Theorem states: S̃n → N (0, 1) ◮ Since |eiux | ≤ 1, φX exists for all random variables
◮ We use characteristic functions for proving CLT

PS Sastry, IISc, Bangalore, 2020 27/33 PS Sastry, IISc, Bangalore, 2020 28/33

Properties of characteristic function


◮ Let µr = E[X r ] and let νr = E[|X|r ]
◮ If νr is finite, then

Z
φX (u) = E eiuX = eiux dFX (x) (i =
 
−1) r−1
X (iu)s (iu)r
φX (u) = µs + ρ(u) µr
s=0
s! r!
◮ φ is continuous; |φ(u)| ≤ φ(0) = 1; φ(−u) = φ∗ (u)
where |ρ(u)| ≤ 1 and ρ(u) → 1 as u → 0
◮ If Y = aX + b, φY (u) = eiub φX (ua)
◮ If all moments exist, then
◮ If E|X|r < ∞, φ would be differentiable r times and

(r) r iuX
X (iu)s
φ (u) = E[(iX) e ] φX (u) = µs
s=0
s!

PS Sastry, IISc, Bangalore, 2020 29/33 PS Sastry, IISc, Bangalore, 2020 30/33
Characteristic function example

◮ We denote by φF characteristic function of df F ◮ Let X be binomial rv


◮ Let Fn be a sequence of distribution functions n
X
iuX n
Ck pk (1 − p)n−k eiuk
 
◮ Continuity theorem φX (u) = E e =
k=0
◮ If Fn → F then φFn → φF n
If φFn → ψ and ψ is continuous at zero, then ψ would be
X
n

= Ck (peiu )k (1 − p)n−k
characteristic function of some df, say, F , and Fn → F k=0
n
= peiu + (1 − p)

D
PS Sastry, IISc, Bangalore, 2020 31/33 PS Sastry, IISc, Bangalore, 2020 32/33

Recap: Modes of convergence


d
◮ Xn → X iff
◮ Let X ∼ N (0, 1)
∞ Fn (x) → F (x), ∀x where F is continuous
1
Z
x2
iuX
√ eiux e− 2 dx
 
φX (u) = E e =
2π P
Xn → X iff
Z−∞


1 1 2 2 2
= √ e− 2 ((x−iu) −i u ) dx
−∞ 2π lim P [|Xn − X| > ǫ] = 0, ∀ǫ > 0
Z ∞ n→∞
2 1 1 2
= e − u2
√ e− 2 ((x−iu) dx r
−∞ 2π ◮ Xn → X iff
u2
= e− 2 E [|Xn − X|r ] → 0 as n → ∞
a.s
◮ Xn → X iff

P [Xn → X] = 1 or P [lim sup |Xn − X| > ǫ] = 0

PS Sastry, IISc, Bangalore, 2020 33/33 PS Sastry, IISc, Bangalore, 2020 1/34
Recap Recap

◮ We have the following relations among different modes of


convergence
Pn
r P d ◮ Given Xi are iid, EXi = µ, Var(Xi ) = σ 2 , Sn = i=1 Xi
Xn → X ⇒ Xn → X ⇒ X n → X P
Sn
◮ Weak law of large numbers: n
→µ
a.s.
a.s. P d ◮ strong law of large numbers: Snn → µ
Xn → X ⇒ X n → X ⇒ X n → X
−nµ d
◮ Central Limit Theorem: Sσn√ n
→ N (0, 1)
◮ All the implications are one-way and we have seen
counter examples
◮ In general, almost sure convergence does not imply
convergence in rth mean and vice versa

D
PS Sastry, IISc, Bangalore, 2020 2/34 PS Sastry, IISc, Bangalore, 2020 3/34

Recap Recap: Characteristic Function

◮ Take Xi iid, EXi = 0, Var(Xi ) = 1, n = 1, 2, · · · ◮ Given rv X, its characteristic function, φX , is defined by


Sn = ni=1 Xi
P


Z
 iuX 
◮ Strong law of large numbers implies φX (u) = E e = eiux dFX (x) (i = −1)

Sn a.s. Since |eiux | ≤ 1, φX exists for all random variables


→ 0 ◮
n ◮ φ is continuous; |φ(u)| ≤ φ(0) = 1; φ(−u) = φ∗ (u)
◮ Central Limit Theorem implies ◮ If Y = aX + b, φY (u) = eiub φX (ua)
◮ If E|X|r < ∞, φ would be differentiable r times and
S a.s.
√n → N (0, 1) φ(r) (u) = E[(iX)r eiuX ]
n

PS Sastry, IISc, Bangalore, 2020 4/34 PS Sastry, IISc, Bangalore, 2020 5/34
Recap Recap

◮ Let µr = E[X r ] and let νr = E[|X|r ]


◮ If νr is finite, then
r−1 ◮ We denote by φF characteristic function of df F
X (iu)s (iu)r
φX (u) = µs + ρ(u) µr ◮ Let Fn be a sequence of distribution functions
s! r!
s=0 ◮ Continuity theorem
where |ρ(u)| ≤ 1 and ρ(u) → 1 as u → 0 ◮ If Fn → F then φFn → φF
◮ If φFn → ψ and ψ is continuous at zero, then ψ would be
◮ If all moments exist, then characteristic function of some df, say, F , and Fn → F

X (iu)s
φX (u) = µs
s=0
s!

D
PS Sastry, IISc, Bangalore, 2020 6/34 PS Sastry, IISc, Bangalore, 2020 7/34

◮ Recall that we can expand φ in a Taylor series


Given Xi iid, EXi = µ, Var(Xi ) = σ 2 , Sn = ni=1 Xi
P

r−1
◮ Let S̃n = √Sn −ESn
= Sσn√−nµ X (iu)s (iu)r
var(Sn ) n
φ(u) = µs +ρ(u) µr , ρ(u) → 1, as u → 0
◮ (Lindberg-Levy) Central Limit Theorem s=0
s! r!

h i 
S − nµ

1
Z x
t2
◮ Here we assume: EXi = 0 and EXi2 = σ 2
lim P S˜n ≤ x = lim P
n
√ ≤x = √ e− 2 dt, ∀x
n→∞ n→∞ σ n 2π −∞ 1
φ(t) = 1 + 0 − ρ(t) σ 2 t2
2
Proof:
◮ Without loss of generality let us assume µ = 0.
t2
   
t 1 t 2
◮ We use characteristic function of S̃n for the proof. φ √ = 1− ρ √ σ 2
σ n 2 σ n σ n
◮ Let φ be the characteristic function of Xi . Then 1 t 2
1 t 2
 
t

= 1− + 1−ρ √
n
 
t
n 2 n 2 n σ n
φSn (t) = (φ(t)) and φS̃n (t) = φ √ 1 t2
 
1
σ n = 1− + o
2 n n

PS Sastry, IISc, Bangalore, 2020 8/34 PS Sastry, IISc, Bangalore, 2020 9/34
◮ Hence we get
  n ◮ What CLT says is that sums of iid random variables,
t when appropriately normalized, would always approach
lim φ (t) = lim φ √
n→∞ S̃n n→∞ σ n
 n the Gaussian distribution.
1 t2

1 ◮ It allows one to approximate distribution of sums of
= lim 1 − + o
n→∞ 2 n n independent rv’s
t2
Let Xi be iid and Sn = ni=1 Xi
P
= e− 2 ◮

   
which is the characteristic function of standard normal Sn − nµ x − nµ x − nµ
P [Sn ≤ x] = P √ ≤ √ ≈Φ √
◮ By Continuity theorem, distribution function of S˜n σ n σ n σ n
converges to that of standard Normal rv
Z x ◮ Thus, Sn is well approximated by a normal rv with mean
h
˜
i 1 t2
lim P Sn ≤ x = √ e− 2 dt, ∀x nµ and variance nσ 2 , if n is large
n→∞ 2π −∞

D
PS Sastry, IISc, Bangalore, 2020 10/34 PS Sastry, IISc, Bangalore, 2020 11/34

Z = 20
P
Example ◮
i=1 Xi , Xi ∼ U [−0.5, 0.5], Xi iid
1
◮ EXi = 0 and Var(Xi ) = 12 .
20
◮ Hence, EZ = 0 and Var(Z) = 12 = 53

P [|Z| ≤ 3] = P [−3 ≤ Z ≤ 3]
◮ Twenty numbers are rounded off to the nearest integer  
and added. What is the probability that the sum obtained −3 Z − EZ 3
= P q ≤ p ≤q 
differs from true sum by more than 3. 5 Var(Z) 5
3 3
◮ A reasonable assumption is round-off errors are    
independent and uniform over [−0.5, 0.5] 3 −3
≈ Φ q  − Φ q 
Take Z = 20
P
i=1 Xi , Xi ∼ U [−0.5, 0.5], Xi iid.
◮ 5 5
3 3
◮ Then Z represents the error in the sum.
≈ Φ(2.3) − Φ(−2.3)
= 0.9893 − 0.0107 ≈ 0.98

◮ Hence probability that the sum differs from true sum by


more than 3 is 0.02
PS Sastry, IISc, Bangalore, 2020 12/34 PS Sastry, IISc, Bangalore, 2020 13/34
◮ We can approximate binomial rv with Gaussian for large n ◮ Sn be binomial with parameters n, p
◮ Binomial random variable with parameters n, p is a sum !
of n independent Bernoulli variables: x − np
Pn
Sn = i=1 Xi ; Xi ∈ {0, 1}, P [Xi = 1] = p, Xi ind P [Sn ≤ x] ≈ Φ p
np(1 − p)
◮ Hence we can approximate distribution of Sn by
" # ◮ For example, with p = 0.95
Sn − np x − np
P [Sn ≤ x] = P p ≤p
 
100 − 110 ∗ 0.95
np(1 − p) np(1 − p) P [S110 ≤ 100] ≈ Φ √ ≈ Φ(−1.97) = 0.025
! 110 ∗ 0.05 ∗ 0.95
x − np
≈ Φ p
np(1 − p) ◮ Since Sn is integer-valued, the LHS above is same for all
x between two consecutive integers; but RHS changes
◮ For large n, binomial rv is like a Gaussian rv with mean
np and variance np(1 − p)
◮ To get a good approximation, to calculate P [Sn ≤ m] one
uses P [Sn ≤ m + 0.5] in the above approximation formula
◮ The approximation is quite good in practice

D
PS Sastry, IISc, Bangalore, 2020 14/34 PS Sastry, IISc, Bangalore, 2020 15/34

◮ CLT allows one to get rate of convergence of law of large Example: Opinion Polls
numbers
Let Xi iid, EXi = µ, Var(Xi ) = σ 2 , Sn = ni=1 Xi
P

◮ By Law of large numbers, Snn → µ.


◮ let p denote the fraction of population that prefers
◮ Now, by CLT
product A to product B
 
Sn ◮ We want to estimate p
P − µ > ǫ = P [|Sn − nµ| > nǫ]
n ◮ We conduct a sample survey by asking n people
 
Sn − nµ nǫ ◮ We want to make a statement such as
= P √ > √
σ n σ n p = 0.34 ± 0.07 with a confidence of 95%
    
nǫ nǫ
≈ 1− Φ √ −Φ − √ ◮ Here, the 0.34 would be the sample mean. The other two
σ n σ n
   numbers can be fixed using CLT

= 2 1−Φ √
σ n

(because Φ(−x) = (1 − Φ(x)) )


PS Sastry, IISc, Bangalore, 2020 16/34 PS Sastry, IISc, Bangalore, 2020 17/34
Xi ∈ {0, 1} iid, EXi = p, Sn = ni=1 Xi
P

◮ Now, by CLT, we have


 
Sn
P − p > ǫ = P [|Sn − np| > nǫ]
p
n ◮ Fortunately, p(1 − p) does not change too much with p
!! ◮ It attains its maximum value of 0.5 at p = 0.5

= 2 1−Φ p ◮ It is 0.458 at p = 0.3 and is 0.4 at p = 0.2
np(1 − p)
◮ One normally fixes this variance as 0.5 or 0.45 to
◮ Suppose we want to satisfy calculate the sample size, n.
  ◮ There are other ways of handling it
Sn
P −p >ǫ =δ
n
◮ We can calculate any one of ǫ, δ or n given the other two
using the earlier equation.
◮ But we need value of p for it!

D
PS Sastry, IISc, Bangalore, 2020 18/34 PS Sastry, IISc, Bangalore, 2020 19/34

◮ We have Confidence intervals


  √ !!
Sn ǫ n
P −p >ǫ =2 1−Φ p
n p(1 − p)
Let Xi iid, EXi = µ, Var(Xi ) = σ 2 , Sn = ni=1 Xi .
P

◮ Suppose n = 900 and p ǫ = 0.025.
Let us approximate p(1 − p) = 0.45. Then
◮ Using CLT, we get
    √ 
   Sn c n
0.025 ∗ 30 P −µ >c =2 1−Φ
2 1−Φ = 2(1 − Φ(1.66)) ≈ 0.1 n σ
0.45
p ◮ If the RHS above is δ, then we can say that
◮ If we took p(1 − p) = 0.5 we get the value as 0.14 Sn
∈ [µ − c, µ + c] with probability (1 − δ)
n
◮ If we use Chebyshev inequality with variance as 0.5 we get ◮ This interval is called the 100(1 − δ)% confidence interval.
the bound as 0.8
◮ If we change ǫ to 0.05, then at variance equal to 0.5 the
probability becomes about 0.02 while the Chebyshev
bound would be about 0.2
PS Sastry, IISc, Bangalore, 2020 20/34 PS Sastry, IISc, Bangalore, 2020 21/34
   √ 
 central limit theorem
Sn c n
P −µ >c =2 1−Φ
n σ

◮ Suppose c = 1.96σ

n ◮ CLT essentially states that sum of many independent
◮ Then random variables behaves like a Gaussian random variable
It is very useful in many statistics applications.
 
Sn 1.96σ ◮
P −µ > √ = 2 (1 − Φ (1.96)) = 0.05
n n ◮ We stated CLT for iid random variables.
◮ While independence is important, all rv need not have the
◮ Denoting
h X̄ = Snn , thei95% confidence interval is same distribution.
√ , X̄ + 1.96σ
X̄ − 1.96σ
n

n ◮ Essentially, the variances should not die out.
◮ One generally uses an estimate for σ obtained from Xi
◮ In analyzing any experimental data the confidence
intervals or the variance term is important

D
PS Sastry, IISc, Bangalore, 2020 22/34 PS Sastry, IISc, Bangalore, 2020 23/34

Markov Chains
◮ We have been considering sequences: Xn , n = 1, 2, · · ·
◮ Let Xn , n = 0, 1, · · · be a sequence of discrete random
variables taking values in S.
◮ We have so far considered only the asymptotic properties Note that S would be countable
or limits of such sequences. ◮ We say it is a Markov chain if
◮ Any such sequence is an example of what is called a
P [Xn+1 = xn+1 |Xn = xn , Xn−1 = xn−1 · · · X0 = x0 ] = P [Xn+1 = xn+1 |Xn = xn ], ∀x
random process or stochastic process
◮ Given n rv, they are completely characterized by their ◮ We can write it as
joint distribution.
P [Xn+1 = xn+1 |Xn = xn , Xn−1 · · · X0 ] = P [Xn+1 = xn+1 |Xn = xn ], ∀xi
◮ How doe we specify or characterize an infinite collection
of random variables? ◮ Conditioned on Xn , Xn+1 is independent of
◮ We need the joint distribution of every finite subcollection Xn−1 , Xn−2 , · · ·
of them. ◮ We think of Xn as state at n
◮ For a Markov chain, given the current state, the future
evolution is independent of the history of how you
reached the current state
PS Sastry, IISc, Bangalore, 2020 24/34 PS Sastry, IISc, Bangalore, 2020 25/34
Example
◮ In this example, we can think of Xn as the number of
◮ Let Xi be iid discrete rv taking integer values. people or things arriving at a facility in the nth time
Let Y0 = 0 and Yn = ni=1 Xi interval.
P

◮ Yn , n = 0, 1, · · · is a Markov chain with state space as ◮ Then Yn would be total arrivals till end of nth time
integers interval.
◮ Note that Yn+1 = Yn + Xn+1 and Xn+1 is independent of ◮ Number of packets coming into a network switch, number
Y0 , · · · , Yn . people joining the queue in a bank, number of infections
till date are all Markov chains.
P [Yn+1 = y|Yn = x, Yn−1 , · · · ] = P [Xn+1 = y − x] ◮ This is a useful model for many dynamic systems or
processes
◮ Thus, Yn+1 is conditionally independent of Yn−1 , · · ·
conditioned on Yn

D
PS Sastry, IISc, Bangalore, 2020 26/34 PS Sastry, IISc, Bangalore, 2020 27/34

Transition Probabilities
◮ Let {Xn , n = 0, 1, · · · } be a Markov Chain with
◮ The Markov property is: given current state, the future (countable) state space S
evolution is independent of the history of how we came to
P r[Xn+1 = xn+1 |Xn = xn , Xn−1 · · · X0 ] = P r[Xn+1 = xn+1 |Xn = xn ], ∀x
current state.
◮ It essentially means the current state contains all needed (Notice change of notation)
information about history ◮ Define function P : S × S → [0, 1] by
◮ We are considering the case where states as well as time
are discrete. P (x, y) = P r[Xn+1 = y|Xn = x]
◮ It can be more general and we discuss some of them
◮ P is called the state transition probability function. It
satisfies
P(x, y) ≥ 0, ∀x, y ∈ S
P

y∈S P (x, y) = 1, ∀x ∈ S

◮ If S is finite then P can be represented as a matrix


PS Sastry, IISc, Bangalore, 2020 28/34 PS Sastry, IISc, Bangalore, 2020 29/34
Initial State Probabilities
◮ The state transition probability function is given by

P (x, y) = P r[Xn+1 = y|Xn = x] ◮ Let {Xn } be a Markov Chain with state space S
◮ Define function π0 : S → [0, 1] by
◮ In general, this can depend on n though our notation
does not show it π0 (x) = P r[X0 = x]
◮ If the value of that probability does not depend on n then
the chain is called homogeneous
◮ It is the pmf of the rv X0
◮ For a homogeneous chain we have
◮ Hence it satisfies
P0 (x) ≥ 0, ∀x ∈ S
π

P r[Xn+1 = y|Xn = x] = P r[X1 = y|X0 = x], ∀n x∈S π0 (x) = 1


◮ From now on, without loss of generality, we take


◮ In this course we will consider only homogeneous chains S = {0, 1, · · · }

D
PS Sastry, IISc, Bangalore, 2020 30/34 PS Sastry, IISc, Bangalore, 2020 31/34

◮ Let Xn be a (homogeneous) Markov chain ◮ This calculation is easily generalized to any number of
◮ Then we have time steps
P r[X0 = x0 , X1 = x1 ] = P r[X1 = x1 |X0 = x0 ] P r[X0 = x0 ], ∀x0 , x1 P r[X0 = x0 , · · · Xn = xn ] = P r[Xn = xn |Xn−1 = xn−1 , · · · X0 = x0 ] ·
= P (x0 , x1 )π0 (x0 ) = π0 (x0 )P (x0 , x1 ) P r[Xn−1 = xn−1 , · · · X0 = x0 ]
◮ Now we can extend this as = P r[Xn = xn |Xn−1 = xn−1 ] ·
P r[Xn−1 = xn−1 , · · · X0 = x0 ]
P r[X0 = x0 , X1 = x1 , X2 = x2 ] = P r[X2 = x2 |X1 = x1 , X0 = x0 ] · = P (xn−1 , xn )P r[Xn−1 = xn−1 , · · · X0 = x0 ]
P r[X0 = x0 , X1 = x1 ] = P (xn−1 , xn )P r[Xn−1 = xn−1 |Xn−2 = xn−2 ] ·
= P r[X2 = x2 |X1 = x1 ] · P r[Xn−2 = xn−2 , · · · X0 = x0 ]
P r[X0 = x0 , X1 = x1 ] ..
.
= P (x1 , x2 ) P (x0 , x1 )π0 (x0 )
= π0 (x0 )P (x0 , x1 ) · · · P (xn−1 , xn )
= π0 (x0 )P (x0 , x1 )P (x1 , x2 )

PS Sastry, IISc, Bangalore, 2020 32/34 PS Sastry, IISc, Bangalore, 2020 33/34
Recap: Central Limit Theorem
◮ We showed
Given Xi iid, EXi = µ, Var(Xi ) = σ 2 , Sn = ni=1 Xi
P

−nµ
P r[X0 = x0 , · · · Xn = xn ] = π0 (x0 )P (x0 , x1 ) · · · P (xn−1 , xn ) ◮ Let S̃n = √Sn −ESn
= Sσn√
var(Sn ) n

◮ (Lindberg-Levy) Central Limit Theorem


◮ This shows that the transition probabilities, P , and initial
Z x
state probabilities, π0 , completely specify the chain.
 
h i S − nµ 1 t2
lim P S˜n ≤ x = lim P
n
√ ≤x = √ e− 2 dt, ∀x
◮ They give us the joint distribution of any finite n→∞ n→∞ σ n 2π −∞
subcollection of the rv’s ◮ It allows us to approximate distributions of sums of
◮ Suppose you want joint distribution of Xi1 , · · · Xik independent random variables
◮ Let m = max(i1 , · · · , ik )  
x − nµ
◮ We know how to get joint distribution of X0 , · · · , Xm . P [Sn ≤ x] ≈ Φ √
σ n
◮ The joint distribution of Xi1 , · · · Xik is now calculated as
◮ For example, binomial rv is well approximated by normal
a marginal distribution from the joint distribution of
for large n
X 0 , · · · , Xm
◮ CLT is also important to get information on rate of
convergence of law of large numbers.
PS Sastry, IISc, Bangalore, 2020 34/34 PS Sastry, IISc, Bangalore, 2020 1/27

Recap: Markov Chain Recap: Transition Probabilities

◮ Let Xn , n = 0, 1, · · · be a sequence of discrete random ◮ Let {Xn , n = 0, 1, · · · } be a Markov Chain with


variables taking values in S. (countable) state space S
◮ We say it is a Markov chain if ◮ Transition probability function is P : S × S → [0, 1]
P r[Xn+1 = xn+1 |Xn = xn , Xn−1 = xn−1 · · · X0 = x0 ] = P r[Xn+1 = xn+1 |Xn = xn ], P (x, y) = P r[Xn+1 = y|Xn = x]
◮ We can write it as
The chain is said to be homogeneous when this is not a
fXn+1 |Xn ,···X0 (xn+1 |xn , · · · , x0 ) = fXn+1 |Xn (xn+1 |xn ), ∀xi function of time.
◮ It satisfies
◮ For a Markov chain, given the current state, the future ◮
P(x, y) ≥ 0, ∀x, y ∈ S
P
evolution is independent of the history of how you y∈S P (x, y) = 1, ∀x ∈ S

reached the current state ◮ If S is finite then P can be represented as a matrix

PS Sastry, IISc, Bangalore, 2020 2/27 PS Sastry, IISc, Bangalore, 2020 3/27
Recap: Initial State Probabilities
◮ The P and π0 determine all joint distributions

P r[X0 = x0 , · · · Xn = xn ] = P r[Xn = xn |Xn−1 = xn−1 , · · · X0 = x0 ] ·


P r[Xn−1 = xn−1 , · · · X0 = x0 ]
◮ Let {Xn } be a Markov Chain with state space S
= P r[Xn = xn |Xn−1 = xn−1 ] ·
◮ Initial state probabilities π0 : S → [0, 1]
P r[Xn−1 = xn−1 , · · · X0 = x0 ]
π0 (x) = P r[X0 = x] = P (xn−1 , xn )P r[Xn−1 = xn−1 , · · · X0 = x0 ]
= P (xn−1 , xn )P r[Xn−1 = xn−1 |Xn−2 = xn−2 ] ·
It satisfies
P r[Xn−2 = xn−2 , · · · X0 = x0 ]
P0 (x) ≥ 0, ∀x ∈ S
π

..
x∈S π0 (x) = 1 .

= π0 (x0 )P (x0 , x1 ) · · · P (xn−1 , xn )

PS Sastry, IISc, Bangalore, 2020 4/27 PS Sastry, IISc, Bangalore, 2020 5/27

Example: 2-state chain


◮ We showed
◮ Let S = {0, 1}.
P r[X0 = x0 , · · · Xn = xn ] = π0 (x0 )P (x0 , x1 ) · · · P (xn−1 , xn ) ◮ We can write the transition probabilities as a matrix
   
◮ This shows P , and π0 , determine joint distribution of P (0, 0) P (0, 1) 1−p p
P = =
X0 , · · · , Xm for any m P (1, 0) P (1, 1) q 1−q
◮ Suppose you want joint distribution of Xi1 , · · · Xik ◮ Now we can calculate the joint distribution, e.g., of
◮ Let m = max(i1 , · · · , ik ) X1 , X2 as
◮ We know how to get joint distribution of X0 , · · · , Xm . 1
X
◮ The joint distribution of Xi1 , · · · Xik is now calculated as P r[X1 = 0, X2 = 1] = P r[X0 = x, X1 = 0, X2 = 1]
x=0
a marginal distribution from the joint distribution of
1
X 0 , · · · , Xm X
= π0 (x)P (x, 0)P (0, 1)
◮ This shows that the transition probabilities, P , and initial x=0
state probabilities, π0 , completely specify the chain. = π0 (0)(1 − p)p + π0 (1)qp

PS Sastry, IISc, Bangalore, 2020 6/27 PS Sastry, IISc, Bangalore, 2020 7/27
◮ Consider the 2-state chain with S = {0, 1} and
 
1−p p
◮ We can similarly calculate probabilities of any events P =
q 1−q
involving these random variables
◮ We can represent the chain through a graph as shown
P r[X2 6= X0 ] = P r[X2 = 0, X0 = 1] + P r[X2 = 1, X0 = 0] below
X1 p

= (π0 (1)P (1, x)P (x, 0) + π0 (0)P (0, x)P (x, 1))
x=0
0
1

1-q
◮ We have the formula
1-p
q
P r[X0 = x0 , · · · Xn = xn ] = π0 (x0 )P (x0 , x1 ) · · · P (xn−1 , xn )
◮ The nodes represent states. The edges show possible
◮ This can easily be seen through a graphical notation. transitions and the probabilities
P r[X0 = 0, X1 = 1, X2 = 1, X3 = 0] = π0 (0)p(1 − q)q

PS Sastry, IISc, Bangalore, 2020 8/27 PS Sastry, IISc, Bangalore, 2020 9/27

An example Birth-Death chain


◮ A man has 4 umbrellas. carries them from home to office ◮ The following Markov chain is known as a birth-death
and back when needed. Probability of rain in the morning
chain
and evening is same, namely, p. q q
q q
◮ What should be the state?
p p p p
◮ S = {0, 1, · · · , 5}. The transition probabilities are ... ...

  i-1 i i+1
0 1 2 3 4
 0 0 0 0 0 1  ◮ In general, birth-death chains may have self-loops on
states
 
 1 0 0 0 1 − p p 
P =
Random walk: Xi ∈ {−1, +1}, iid, Sn = ni=1 Xi
 P
 2 0 0 1−p p 0  ◮
 
 3 0 1−p p 0 0  ◮ We can have ‘reflecting boundary’ at 0
4 1−p p 0 0 0 ◮ Queuing chains can also be birth-death chains

PS Sastry, IISc, Bangalore, 2020 10/27 PS Sastry, IISc, Bangalore, 2020 11/27
Gambler’s Ruin chain

◮ We can have birth-death chains with finite state space


also ◮ The following chain is called Gambler’s ruin chain
q q q q q q q
1-p 1
p p p p p p
... ... 1-q p 1
... ...
i-1 i i+1 i-1 i i+1
0 N 0 N

◮ This chain keeps visiting all the states again and again ◮ Here, the chain is ultimately absorbed either in 0 or in N
◮ Here state can be the current funds that the gambler has

PS Sastry, IISc, Bangalore, 2020 12/27 PS Sastry, IISc, Bangalore, 2020 13/27

◮ We consider a simple case


◮ The transition probabilities we defined earlier are also P r[X3 = y, X1 = x, X0 = z]
P r[X3 = y|X1 = x, X0 = z] =
called one step transition probabilities P r[X1 = x, X0 = z]
P
w π0 (z)P (z, x)P (x, w)P (w, y)
P (x, y) = P r[Xn+1 = y|Xn = x] = P r[X1 = y|X0 = x] =
π0 (z)P (z, x)
X
= P (x, w)P (w, y)
◮ We can define transition probabilities for multiple steps, w

that is, P r[Xn = y|X0 = x] ◮ We also have


◮ We first look at one consequence of markov property P r[X3 = y|X1 = x] = P r[X2 = y|X0 = x]
P
◮ The Markov property implies that it is the most recent w π0 (x)P (x, w)P (w, y)
=
past that matters. For example π0 (x)
X
= P (x, w)P (w, y)
P r[Xn+m = y|Xn = x, X0 ] = P r[Xn+m = y|Xn = x]
w
◮ Thus we get
P r[X3 = y|X1 = x, X0 = z] = P r[X3 = y|X1 = x]
PS Sastry, IISc, Bangalore, 2020 14/27 PS Sastry, IISc, Bangalore, 2020 15/27
◮ Using similar algebra, we can show that
P r[Xm+n = y|Xm = x, X0 = z] = P r[Xm+n = y|Xm = x]
= P r[Xn = y|X0 = x] ◮ Now we get
X
◮ Or, in general, P r[Xm+n = y|X0 = x] = P r[Xm+n = y, Xm = z|X0 = x]
z
fXm+n |Xm ,··· ,X0 (y|x, · · · ) = fXm+n |Xm (y|x) X
= P r[Xm+n = y|Xm = z, X0 = x] P r[Xm = z|X0 = x]
◮ Using the same algebra, we can show z
X
= P r[Xm+n = y|Xm = z] P r[Xm = z|X0 = x]
Pr[Xm+n = y|Xm = x, Xm−k ∈ Ak , k = 1, · · · , m] =
z
P r[Xm+n = y|Xm = x] X
= P r[Xn = y|X0 = z] P r[Xm = z|X0 = x]
z

P r[Xm+n+r ∈ Br , r = 0, · · · , s | Xm = x, Xm−k ∈ Ak , k = 1, · · · , m]
= P r[Xm+n+r ∈ Br , r = 0, · · · , s | Xm = x]

PS Sastry, IISc, Bangalore, 2020 16/27 PS Sastry, IISc, Bangalore, 2020 17/27

Chapman-Kolmogorov Equations

◮ Specifically, using Chapman-Kolmogorov equations,


◮ Define: P n (x, y) = P r[Xn = y|x0 = x]
X
◮ These are called n-step transition probabilities. P 2 (x, y) = P (x, z)P (z, y)
◮ From what we showed, n-step transition probabilities z

satisfy ◮ For a finite chain, P is a matrix


◮ Thus P 2 (x, y) is the (x, y)th element of the matrix,
X
P m+n (x, y) = P m (x, z)P n (z, y)
z P ×P
◮ That is why we use P n for n-step transition probabilities
◮ These are known as Chapman-Kolmogorov equations
◮ This relationship is intuitively clear

PS Sastry, IISc, Bangalore, 2020 18/27 PS Sastry, IISc, Bangalore, 2020 19/27
Hitting times
◮ Define: πn (x) = P r[Xn = x].
◮ Then we get
◮ Let y be a state.
X
πn (y) = P r[Xn = y|X0 = x] P r[X0 = x] ◮ We define hitting time for y as the random variable
x
X Ty = min{n > 0 : Xn = y}
= π0 (x)P n (x, y)
x
◮ Ty is the first time that the chain is in state y (after t = 0
◮ In particular when the chain is initiated).
X ◮ It is easy to see that P r[Ty = 1|X0 = x] = P (x, y).
πn+1 (y) = P r[Xn+1 = y|Xn = x] P r[Xn = x] ◮ We often need conditional probability conditioned on the
x
X initial state.
= πn (x)P (x, y) ◮ Notation: Pz (A) = P r[A|X0 = z]
x
◮ We write the above as Px (Ty = 1) = P (x, y)

PS Sastry, IISc, Bangalore, 2020 20/27 PS Sastry, IISc, Bangalore, 2020 21/27

◮ Similarly we can get:


n
Ty = min{n > 0 : Xn = y} n
X
P (x, y) = Px (Ty = m)P n−m (y, y)
m=1
◮ We can now get ◮ We can derive this as
X X P n (x, y) = P r[Xn = y|X0 = x]
Px (Ty = 2) = P (x, z) P (z, y) = P (x, z) Pz (Ty = 1) n
z6=y z6=y
X
= P r[Ty = m, Xn = y|X0 = x]
Px (Ty = m) = P r[Ty = m|X0 = x] m=1
X n
= P r[Ty = m|X1 = z, X0 = x] P r[X1 = z|X0 = x] X
z6=y
= P r[Xn = y|Ty = m, X0 = x] P r[Ty = m|X0 = x]
X m=1
= P (x, z)P r[Ty = m|X1 = z] n
X
z6=y = P r[Xn = y|Xm = y] P r[Ty = m|X0 = x]
X
m=1
= P (x, z)Pz (Ty = m − 1) n
X
z6=y
= P n−m (y, y) Px (Ty = m)
m=1

PS Sastry, IISc, Bangalore, 2020 22/27 PS Sastry, IISc, Bangalore, 2020 23/27
◮ Now, the total number of visits to y is given by
transient and recurrent states

◮ Define ρxy = Px (Ty < ∞).
X
Ny = Iy (Xn )
◮ It is the probability that starting in x you will visit y n=1
◮ Note that ◮ We can get distribution of Ny as
X∞
ρxy = lim Px (Ty < n) = Px (Ty = n) Px (Ny ≥ 1) = Px (Ty < ∞) = ρxy
n→∞
n=1
X
Px (Ny ≥ 2) = Px (Ty = m)Py (Ty < ∞)
Definition: A state y is called transient if ρyy < 1; it is m
called recurrent if ρyy = 1. X
= ρyy Px (Ty = m) = ρyy ρxy
◮ Intuitively, all transient states would be visited only m
finitely many times while recurrent states are visited Px (Ny ≥ m) = m−1
ρyy ρxy
infinitely often.
Px (Ny = m) = Px (Ny ≥ m) − Px (Ny ≥ m + 1)
◮ For any state y define
 = ρm−1
yy ρxy − ρm m−1
yy ρxy = ρxy ρyy (1 − ρyy )
1 if Xn = y
Iy (Xn ) = Px (Ny = 0) = 1 − Px (Ny ≥ 1) = 1 − ρxy
0 otherwise

PS Sastry, IISc, Bangalore, 2020 24/27 PS Sastry, IISc, Bangalore, 2020 25/27

◮ Notation: Ex [Z] = E[Z|X0 = x]


Theorem:
◮ Define
(i). Let y be transient. Then
G(x, y) , Ex [Ny ] ρxy
"∞ # Px (Ny < ∞) = 1, ∀x and G(x, y) = < ∞, ∀x
X 1 − ρyy
= Ex Iy (Xn )
n=1 (ii) Let y be recurrent. Then

X
= Ex [Iy (Xn )] Py [Ny = ∞] = 1, and G(y, y) = Ey [Ny ] = ∞
n=1
X∞
= P n (x, y)

0 if ρxy = 0
n=1 Px [Ny = ∞] = ρxy , and G(x, y) =
∞ if ρxy > 0
◮ G(x, y) is the expected number of visits to y for a chain
that is started in x.

PS Sastry, IISc, Bangalore, 2020 26/27 PS Sastry, IISc, Bangalore, 2020 27/27
Recap: Markov Chain Recap: Transition Probabilities
◮ Transition probability function is P : S × S → [0, 1]
◮ Let Xn , n = 0, 1, · · · be a sequence of discrete random
variables taking values in S. P (x, y) = P r[Xn+1 = y|Xn = x]
◮ We say it is a Markov chain if
The chain is said to be homogeneous when this is not a
P r[Xn+1 = xn+1 |Xn = xn , Xn−1 = xn−1 · · · X0 = x0 ] = P r[Xn+1 = xn+1 |Xn = xn ], function of time.
◮ We can write it as ◮ For a homogeneous chain

fXn+1 |Xn ,···X0 (xn+1 |xn , · · · , x0 ) = fXn+1 |Xn (xn+1 |xn ), ∀xi P r[Xn+1 = y|Xn = x] = P r[X1 = y|X0 = x], ∀n

◮ For a Markov chain, given the current state, the future ◮ P satisfies
evolution is independent of the history of how you ◮ P
P(x, y) ≥ 0, ∀x, y ∈ S
reached the current state y∈S P (x, y) = 1, ∀x ∈ S

◮ If S is finite then P can be represented as a matrix

PS Sastry, IISc, Bangalore, 2020 1/40 PS Sastry, IISc, Bangalore, 2020 2/40

Recap: Initial State Probabilities Recap


◮ The Markov property implies
P r[Xm+n = y|Xm = x, X0 = z] = P r[Xm+n = y|Xm = x]
= P r[Xn = y|X0 = x]
◮ Initial state probabilities π0 : S → [0, 1]
◮ Or, in general,
π0 (x) = P r[X0 = x] fXm+n |Xm ,··· ,X0 (y|x, · · · ) = fXm+n |Xm (y|x)
It satisfies ◮ Further, we can show
◮ π
P0 (x) ≥ 0, ∀x ∈ S Pr[Xm+n = y|Xm = x, Xm−k ∈ Ak , k = 1, · · · , m] =
x∈S π0 (x) = 1

P r[Xm+n = y|Xm = x]
◮ The P and π0 together determine all joint distributions

P r[Xm+n+r ∈ Br , r = 0, · · · , s | Xm = x, Xm−k ∈ Ak , k = 1, · · · , m]
= P r[Xm+n+r ∈ Br , r = 0, · · · , s | Xm = x]

PS Sastry, IISc, Bangalore, 2020 3/40 PS Sastry, IISc, Bangalore, 2020 4/40
Recap: Chapman-Kolmogorov Equations Recap: Hitting times
◮ The n-step transition probabilities are defined by
◮ We define hitting time for y as the random variable
n
P (x, y) = P r[Xn = y|X0 = x]
Ty = min{n > 0 : Xn = y}
◮ These n-step transition probabilities satisfy
X ◮ Using this defintion, we can derive
P m+n (x, y) = P m (x, z)P n (z, y) X
z Px (Ty = m) = P (x, z)Pz (Ty = m − 1)
z6=y
◮ These are known as Chapman-Kolmogorov equations
◮ For a finite chain, the n-step transition probability matrix (Notation: Pz (A) = P r[A|X0 = z])
is n-fold product of the transition probability matrix
n
◮ We also have n
X
P (x, y) = Px (Ty = m)P n−m (y, y)
X
πn (x) , P r[Xn = x] = π0 (x)P n (x, y) m=1

PS Sastry, IISc, Bangalore, 2020 5/40 PS Sastry, IISc, Bangalore, 2020 6/40

Recap: transient and recurrent states Recap


◮ For any state y define
◮ Define ρxy = Px (Ty < ∞). 
◮ It is the probability that starting in x you will visit y 1 if Xn = y
Iy (Xn ) =
◮ Note that 0 otherwise

X ◮ The total number of visits to y is given by
ρxy = lim Px (Ty < n) = Px (Ty = n)
n→∞ ∞
n=1 X
N (y) = Iy (Xn )
Definition: A state y is called transient if ρyy < 1; it is n=1
called recurrent if ρyy = 1.
◮ We can get distribution of N (y) as
◮ Intuitively, all transient states would be visited only
finitely many times while recurrent states are visited m−1
Px (N (y) = m) = ρxy ρyy (1 − ρyy ), m ≥ 1
infinitely often.
and Px (N (y) = 0) = 1 − ρxy

PS Sastry, IISc, Bangalore, 2020 7/40 PS Sastry, IISc, Bangalore, 2020 8/40
Recap
Theorem:
(i). Let y be transient. Then
◮ Notation: Ex [Z] = E[Z|X0 = x]
◮ Define ρxy
Px (N (y) < ∞) = 1, ∀x and G(x, y) = < ∞, ∀x
1 − ρyy
G(x, y) , Ex [N (y)]
X∞ (ii) Let y be recurrent. Then
= Ex [Iy (Xn )]
n=1 Py [N (y) = ∞] = 1, and G(y, y) = Ey [N (y)] = ∞
X∞
= P n (x, y) 
n=1 0 if ρxy = 0
Px [N (y) = ∞] = ρxy , and G(x, y) =
∞ if ρxy > 0
◮ G(x, y) is the expected number of visits to y for a chain
that is started in x.

PS Sastry, IISc, Bangalore, 2020 9/40 PS Sastry, IISc, Bangalore, 2020 10/40

Proof of (ii):
Proof of (i): y is transient; ρyy < 1 y recurrent ⇒ ρyy = 1. Hence

X Py [N (y) ≥ m] = ρm
yy = 1, ∀m
G(x, y) = Ex [N (y)] = mPx [N (y) = m]
m
⇒ Py [N (y) = ∞] = lim Py [N (y) ≥ m] = 1
m→∞
X
= m ρxy ρm−1
yy (1 − ρyy ) ⇒ G(y, y) = Ey [N (y)] = ∞
m

X
= ρxy m ρm−1
yy (1 − ρyy )
m−1
Px [N (y) ≥ m] = ρxy ρyy = ρxy , ∀m
m=1
1 Hence Px [N (y) = ∞] = ρxy
= ρxy < ∞, because ρyy < 1
1 − ρyy ρxy = 0 ⇒ Px [N (y) ≥ m] = 0, ∀m > 0 ⇒ G(x, y) = 0

⇒ Px [N (y) < ∞] = 1
ρxy > 0 ⇒ Px [N (y) = ∞] > 0 ⇒ G(x, y) = ∞

PS Sastry, IISc, Bangalore, 2020 11/40 PS Sastry, IISc, Bangalore, 2020 12/40
◮ Transient states are visited only finitely many times while ◮ We say, x leads to y if ρxy > 0
recurrent states are visited infinitely often Theorem: If x is recurrent and x leads to y then y is
◮ If S is finite, it should have at least one recurrent state recurrent and ρxy = ρyx = 1.
◮ If y is transient, then, for all x Proof:

X ◮ Take x 6= y, wlog. Since ρxy > 0, ∃n s.t. P n (x, y) > 0
G(x, y) = P n (x, y) < ∞ ⇒ lim P n (x, y) = 0 ◮ Take least such n. Then we have states y1 , · · · , yn−1 ,
n→∞
n=1 none of which is x (or y) such that

◮ However, y P n (x, y) = 1, ∀n, ∀x


P P (x, y1 ) P (y1 , y2 ) · · · P (yn−1 , y) > 0
◮ If all y ∈ S are transient, then we get a contradiction ◮ Now suppose, ρyx < 1. Then
P (x, y1 ) P (y1 , y2 ) · · · P (yn−1 , y)(1 − ρyx ) > 0
X X
1 = lim P n (x, y) = lim P n (x, y) = 0
n→∞ n→∞
y∈S y∈S
is the probability of starting in x but not returning to x.
◮ But this cannot be because x is recurrent and hence
◮ A finite chain has to have at least one recurrent state ρxx = 1
◮ An infinite chain can have only transient states ◮ Hence, if x is recurrent and x leads to y then ρyx = 1
PS Sastry, IISc, Bangalore, 2020 13/40 PS Sastry, IISc, Bangalore, 2020 14/40

◮ Now, ∃n0 , n1 s.t. P n0 (x, y) > 0, P n1 (y, x) > 0.

P n1 +n+n0 (y, y) = Py [Xn1 +n+n0 = y]


≥ Py [Xn1 = x, Xn1 +n = x, Xn1 +n+n0 = y]
= P n1 (y, x)P n (x, x)P n0 (x, y), ∀n
◮ What we showed so far is: if x leads to y and x is
We know G(x, x) = ∞ m
P

m=1 P (x, x) = ∞ recurrent, then ρyx = 1 and y is recurrent.
∞ ∞ ∞
◮ Now, y is recurrent and y leads to x and hence ρxy = 1.
X X X
P m (y, y) ≥ P m (y, y) = P n1 +n+n0 (y, y) ◮ This completes proof of the theorem
m=1 m=n0 +n1 +1 n=1
X∞
≥ P (y, x)P n (x, x)P n0 (x, y)
n1

n=1
= ∞, because x is recurrent
⇒ y is recurrent

PS Sastry, IISc, Bangalore, 2020 15/40 PS Sastry, IISc, Bangalore, 2020 16/40
equivalence relation example

◮ Let A = { mn
| m, n are integers}
◮ let R be a relation on set A. Note R ⊂ A × A ◮ Define relation R by
◮ R is called an equivalence relation if it is  
m p
1. reflexive, i.e., (x, x) ∈ R, ∀x ∈ A , ∈ R if mq = np
2. symmetric, i.e., (x, y) ∈ R ⇒ (y, x) ∈ R n q
3. transitive, i.e., (x, y), (y, z) ∈ R ⇒ (x, z) ∈ R
◮ This is the usual equality of fractions
◮ Easy to check it is an equivalence relation.

PS Sastry, IISc, Bangalore, 2020 17/40 PS Sastry, IISc, Bangalore, 2020 18/40

Equivalence classes
◮ The state space of any Markov chain can be partitioned
into the transient and recurrent states: S = ST + SR :
◮ Let R be an equivalence relation on A.
◮ Then, A can be partitioned as ST = {y ∈ S : ρyy < 1} SR = {y ∈ S : ρyy = 1}
A = C1 + C2 + · · · ◮ On SR , consider the relation: ‘x leads to y’ (i.e., x is
related to y if ρxy > 0)
Where Ci satisfy
◮ This is an equivalence relation
◮ x, y ∈ Ci ⇒ (x, y) ∈ R, ∀i
◮ ρxx > 0, ∀x ∈ SR
◮ x ∈ Ci , y ∈ Cj , i 6= j ⇒ (x, y) ∈
/R
◮ ρxy > 0 ⇒ ρyx > 0, ∀x, y ∈ SR
◮ In our example, each equivalence class corresponds to a ◮ ρxy > 0, ρyz > 0 ⇒ ρxz > 0
rational number. ◮ Hence we get a partition: SR = C1 + C2 + · · · where Ci
◮ Here, Ci contains all fractions that are equal to that are equivalence classes.
rational number

PS Sastry, IISc, Bangalore, 2020 19/40 PS Sastry, IISc, Bangalore, 2020 20/40
◮ A set of states, C ⊂ S is said to be irreducible if x leads
to y for all x, y ∈ C
◮ On SR , “x leads to y” is an equivalence relation. ◮ An irreducible set is also called a communicating class
◮ This gives rise to the partition SR = C1 + C2 + · · · ◮ A set of states, C ⊂ S, is said to be closed if
◮ Since Ci are equivalence classes, they satisfy: x ∈ C, y ∈ / C implies ρxy = 0.
◮ x, y ∈ Ci ⇒ x leads to y ◮ Once the chain visits a state in a closed set, it cannot
◮ x ∈ Ci , y ∈ Cj , i 6= j ⇒ ρxy = 0 leave that set.
◮ All states in any Ci lead to each other or communicate ◮ We get a partition of recurrent states
with each other
◮ If i 6= j and x ∈ Ci and y ∈ Cj , then, ρxy = ρyx = 0. x SR = C 1 + C 2 + · · ·
and y do not communicate with each other.
where each Ci is a closed and irreducible set of states.
◮ If S is irreducible then the chain is said to be irreducible.
(Note that S is trivially closed)

PS Sastry, IISc, Bangalore, 2020 21/40 PS Sastry, IISc, Bangalore, 2020 22/40

◮ In an irreducible set of states, if one state is recurrent,


then, all states are recurrent.
◮ We saw that a finite chain has to have at least one
◮ The state space of any Markov chain can be partitioned
recurrent state.
into transient and recurrent states.
◮ Thus, a finite irreducible chain is recurrent.
◮ We need not calculate ρxx to do this partition.
◮ For example, in the umbrellas problem, the chain is
◮ By looking at the structure of transition probability
irreducible and hence all states are recurrent.
matrix we can get this partition
◮ An infinite irreducible chain may be wholly transient
◮ Here is a trivial example of non-irreducible transient chain:
0 1 2 3 ...

PS Sastry, IISc, Bangalore, 2020 23/40 PS Sastry, IISc, Bangalore, 2020 24/40
Example
0 1 2 4 5
3
◮ If you start the chain in a recurrent state it will stay in
the corresponding closed irreducible set
  ◮ If you start in one of the transient states, it would
0 1 2 3 4 5
eventually get ‘absorbed’ in one of the closed irreducible

 0 + − − − − − 
 sets of recurrent states.
 1 + + + − − − 

 2 − + + + − +

 ◮ We want to know the probabilities of ending up in

 3 − − − + + −

 different sets.
We want to know how long you stay in transient states
 
 4 − − − − + +  ◮

5 − − − + − + ◮ We want to know what is the ‘steady state’ ?


◮ State 0 is called an absorbing state. {0} is a closed
irreducible set.
◮ 1, 2 are transient states.
◮ We get: ST = {1, 2} and SR = {0} + {3, 4, 5}
PS Sastry, IISc, Bangalore, 2020 25/40 PS Sastry, IISc, Bangalore, 2020 26/40

◮ let C be a closed irreducible set of recurrent states Example: Absorption probabilities


0.2
◮ TC – hitting time for C.
0.25 0.2
TC = min{n > 0 : Xn ∈ C} 1.0 0 1 0.25 2
3
4 5

It is the first time instant when the chain is in C 0.4


0.5
◮ Define ρC (x) = Px [TC < ∞] 0.2

 ◮ ST = {1, 2} and C1 = {0}, C2 = {3, 4, 5}


1 if x ∈ C
If x is recurrent, ρC (x) =
0 if x ∈ /C
X X
ρC (x) = P (x, y) + P (x, y) ρC (y)
y∈C y∈ST
Because each x is in a closed irreducible set
◮ Suppose x is transient. Then
ρC1 (1) = P (1, 0) + P (1, 1)ρC1 (1) + P (1, 2)ρC1 (2)
X X
ρC (x) = P (x, y) + P (x, y) ρC (y) = 0.25 + 0.5ρC1 (1) + 0.25ρC1 (2)
y∈C y∈ST ρC1 (2) = 0 + 0.2ρC1 (1) + 0.4ρC1 (2)
◮ By solving this set of linear equations we can get ◮ Solving these, we get ρC1 (1) = 0.6, ρC1 (2) = 0.2
ρc (x), x ∈ ST ◮ What would be ρC2 (1)?
PS Sastry, IISc, Bangalore, 2020 27/40 PS Sastry, IISc, Bangalore, 2020 28/40
Expected time in transient states stationary distributions
◮ We consider a simple method to get the time spent in π : S → [0, 1] is a probability distribution
P (mass

transient states for finite chains function) over S if π(x) ≥ 0, ∀x and x∈S π(x) = 1
◮ Let states 1, 2, · · · , t be the transient states
◮ A probability distribution over S, π, is said to be a
◮ bij – the expected number of time instants spent in state
stationary distribution for the Markov chain with
j when started in i.
transition probabilities P if
◮ Then we get
t
X
X π(y) = π(x)P (x, y), ∀y ∈ S
bij = δij + P (i, k)bkj x∈S
k=1
where δij = 1 if i = j and is zero otherwise ◮ Suppose S is finite. Then π can be represented by a
◮ let B be the t × t matrix of bij , I be the t × t identity vector.
matrix and PT be the submatrix (corresponding to the ◮ The π is stationary if
transient states) of P .
◮ Then the above in Matrix notation is πT = πT P or P T π = π
B = I + PT B or B = (I − PT )−1
PS Sastry, IISc, Bangalore, 2020 29/40 PS Sastry, IISc, Bangalore, 2020 30/40

◮ π is a stationary distribution if
X
π(y) = π(x)P (x, y), ∀y ∈ S ◮ If the chain is started in stationary distribution then the
x∈S distribution of Xn is not a function of time, as we saw.
◮ Recall πn (x) , P r[Xn = x] satisfies
◮ Suppose for a chain, distribution of Xn is not dependent
on n. Then the chain must be in a stationary distribution.
X X
πn+1 (y) = P r[Xn+1 = y|Xn = x]P r[Xn = x] = πn (x)P (x, y) ◮ Suppose π = π0 = π1 = · · · = πn = · · · . Then
x∈S x∈S X X
π(y) = π1 (y) = π0 (x)P (x, y) = π(x)P (x, y)
x∈S x∈S
◮ Hence, if π0 = π then π1 = π
and hence πn = π, ∀n which shows π is a stationary distribution
◮ Hence the name, stationary distribution.
◮ It is also called the invariant distribution or the invariant
measure

PS Sastry, IISc, Bangalore, 2020 31/40 PS Sastry, IISc, Bangalore, 2020 32/40
Example
0.75 0.5
0 1 0.5
◮ Suppose S is finite. 2
0.25

◮ Then π is a stationary distribution if 0.25


0.75

P T π = π or PT − I π = 0 The stationary distribution has to satisfy




X
π(y) = π(x)P (x, y), ∀y ∈ S
x∈S
◮ Note that each column of P T sums to 1.


Hence, P T − I would be singular
◮ Thus we get the following linear equations
(1 is always an eigen value for a column stochastic matrix) 0.75π(0) + 0.5π(1) = π(0)
◮ A stationary distribution always exists for a finite chain. 0.25π(0) + 0.75π(2) = π(1)
◮ But it may or may not be unique. 0.5π(1) + 0.25π(2) = π(2)
◮ What about infinite chains?
in addition, π(0) + π(1) + π(2) = 1

PS Sastry, IISc, Bangalore, 2020 33/40 PS Sastry, IISc, Bangalore, 2020 34/40

0.75 0.5
0 1 0.5 2
0.25 0.75
0 0.5 1 0.5 2
0.25
0.25
0.75
0.25
0.75
◮ We can also write the equations for π as
1
0.75π(0) + 0.5π(1) = π(0) ⇒ π(1) = π(0)
 
  0.75 0.25 0   2
π(0) π(1) π(2)  0.5 0 0.5  = π(0) π(1) π(2) 1
0 0.75 0.25 0.25π(0) + 0.75π(2) = π(1) ⇒ π(2) = π(0)
3
0.5π(1) + 0.25π(2) = π(2)
 
1 1
0.75π(0) + 0.5π(1) = π(0) π(0) + π(1) + π(2) = 1 ⇒ π(0) 1 + + =1
2 3
0.25π(0) + 0.75π(2) = π(1)
0.5π(1) + 0.25π(2) = π(2) 
◮ Now, π(0) 1 + 21 + 31 = 1 gives π(0) = 6
11
6 3 2
◮ We get a unique solution: 11 11 11
◮ We have to solve these along with π(0) + π(1) + π(2) = 1
PS Sastry, IISc, Bangalore, 2020 35/40 PS Sastry, IISc, Bangalore, 2020 36/40
Example2 Example2
1.0 0.5
0 1 0.5 2
1.0

1.0 0.5 0.5


0
X
1 2
1.0 π(y) = π(x)P (x, y), ∀y ∈ S
x∈S
◮ The stationary distribution has to satisfy
  ◮ We get the following linear equations
1.0 0 0
  
π(0) π(1) π(2)  0.5 0 0.5  = π(0) π(1) π(2)
 π(0) + 0.5π(1) = π(0) ⇒ π(1) = 0
0 0 1.0 0.5π(1) + π(2) = π(2) ⇒ π(1) = 0
π(0) + π(1) + π(2) = 1 ⇒ π(0) = 1 − π(2)
◮ We also have to add the equation π(0) + π(1) + π(2) = 1
◮ Now there are infinitely many solutions.
◮ We now do not have a unique stationary distribution
◮ Any distribution [a 0 1 − a] with 0 ≤ a ≤ 1 is a
stationary distribution
◮ This chain is not irreducible; the previous one is
irreducible
PS Sastry, IISc, Bangalore, 2020 37/40 PS Sastry, IISc, Bangalore, 2020 38/40

◮ Let Iy (Xn ) be indicator of [Xn = y]


◮ We now explore conditions for existence and uniqueness
Number of visits to y till n: Nn (y) = nm=1 Iy (Xn )
P

of stationary distributions
◮ For finite chains stationary distribution always exists. n
X n
X
◮ For finite irreducible chains it is unique. Gn (x, y) , Ex [Nn (y)] = Ex [Iy (Xn )] = P m (x, y)
m=1 m=1
◮ But for infinite chains, it is possible that stationary
distribution does not exist.
◮ Expected fraction of time spent in y till n is
◮ When the stationary distribution is unique, we also want
to know if the chain converges to that distribution n
Gn (x, y) 1 X m
◮ The stationary distribution, when it exists, is related to = P (x, y)
n n m=1
expected fraction of time spent in different states.
◮ We will first establish a limit for the above as n → ∞

PS Sastry, IISc, Bangalore, 2020 39/40 PS Sastry, IISc, Bangalore, 2020 40/40
Recap: Markov Chain Recap: Transition Probabilities

◮ Let Xn , n = 0, 1, · · · be a sequence of discrete random


variables taking values in S.
◮ We say it is a Markov chain if
◮ Transition probabilities: P (x, y) = P r[Xn+1 = y|Xn = x]
P r[Xn+1 = xn+1 |Xn = xn , Xn−1 = xn−1 · · · X0 = x0 ] = P r[Xn+1 = xn+1 |Xn = xn ], Chain is homogeneous:
P r[Xn+1 = y|Xn = x] = P r[X1 = y|X0 = x], ∀n
◮ We can write it as
◮ Initial probabilities π0 (x) = P r[X0 = x]
fXn+1 |Xn ,···X0 (xn+1 |xn , · · · , x0 ) = fXn+1 |Xn (xn+1 |xn ), ∀xi ◮ Similarly, πn (x) = P r[Xn = x]

◮ For a Markov chain, given the current state, the future


evolution is independent of the history of how you
reached the current state

PS Sastry, IISc, Bangalore, 2020 1/39 PS Sastry, IISc, Bangalore, 2020 2/39

Recap: Chapman-Kolmogorov Equations Recap: transient and recurrent states

◮ n-step transition probabilities:


P n (x, y) = P r[Xn = y|X0 = x] ◮ Hitting time for y: Ty = min{n > 0 : Xn = y}
◮ These satisfy Chapman-Kolmogorov equations: ◮ ρxy = Px (Ty < ∞).
X ◮ A state y is called transient if ρyy < 1; it is called
P m+n (x, y) = P m (x, z)P n (z, y) recurrent if ρyy = 1.
z ◮ N (y) – total number of visits to y
◮ For a finite chain, the n-step transition probability matrix ◮ G(x, y) = Ex [N (y)]
is n-fold product of the transition probability matrix

PS Sastry, IISc, Bangalore, 2020 3/39 PS Sastry, IISc, Bangalore, 2020 4/39
Recap Recap

Theorem:
(i). Let y be transient. Then
ρxy ◮ Transient states are visited only finitely many times while
Px (N (y) < ∞) = 1, ∀x and G(x, y) = < ∞, ∀x recurrent states are visited infinitely often
1 − ρyy
◮ A finite chain should have at least one recurrent state
(ii) Let y be recurrent. Then
◮ We say, x leads to y if ρxy > 0
Py [N (y) = ∞] = 1, and G(y, y) = Ey [N (y)] = ∞ Theorem: If x is recurrent and x leads to y then y is
recurrent and ρxy = ρyx = 1.

0 if ρxy = 0
Px [N (y) = ∞] = ρxy , and G(x, y) =
∞ if ρxy > 0

PS Sastry, IISc, Bangalore, 2020 5/39 PS Sastry, IISc, Bangalore, 2020 6/39

Recap: closed and irreducible sets Recap: Partition of state space

◮ S = ST + SR , transient and recurrent states and


◮ A set of states, C ⊂ S is said to be irreducible if x leads
to y for all x, y ∈ C SR = C 1 + C 2 + · · ·
◮ An irreducible set is also called a communicating class
where Ci are closed and irreducible
◮ A set of states, C ⊂ S, is said to be closed if
◮ We can calculate absorption probabilities for Ci using
x ∈ C, y ∈ / C implies ρxy = 0.
◮ Once the chain visits a state in a closed set, it cannot
X X
ρC (x) = P (x, y) + P (x, y) ρC (y)
leave that set. y∈C y∈ST

PS Sastry, IISc, Bangalore, 2020 7/39 PS Sastry, IISc, Bangalore, 2020 8/39
Recap: Stationary distribution
◮ Let Iy (Xn ) be indicator of [Xn = y]
◮ π is said to be a stationary distribution for the Markov
◮ Number of visits to y till n: Nn (y) = nm=1 Iy (Xm )
P
chain with transition probabilities P if
X n
X n
X
π(y) = π(x)P (x, y), ∀y ∈ S Gn (x, y) , Ex [Nn (y)] = Ex [Iy (Xm )] = P m (x, y)
x∈S m=1 m=1

◮ For finite chains, P T π = π


◮ Expected fraction of time spent in y till n is
◮ When π is stationary distribution,
n
π0 = π ⇒ πn = π, ∀n Gn (x, y) 1 X m
= P (x, y)
◮ If πn = π, ∀n then π is a stationary distribution n n m=1
◮ For a finite chain, a stationary distribution always exists.
◮ We will first establish a limit for the above as n → ∞
◮ The stationary distribution, when it exists, is related to
expected fraction of time spent in different states.

PS Sastry, IISc, Bangalore, 2020 9/39 PS Sastry, IISc, Bangalore, 2020 10/39

◮ Suppose y is transient. Then we have


◮ Now, let y be recurrent
lim Nn (y) = N (y) ◮ Then, Py [Ty < ∞] = 1
n→∞
and P r[N (y) < ∞] = 1 lim Gn (x, y) = G(x, y) < ∞ ◮ Define my = Ey [Ty ]
n→∞
Nn (y) Gn (x, y) ◮ my is mean return time to y
⇒ lim = 0 (w.p.1) and lim =0
n→∞ n n→∞ n ◮ We will show that Nnn(y) converges to m1 if the chain
y

◮ The expected fraction of time spent in a transient state is starts in y.


zero. ◮ Convergence would be with probability one.
◮ This is intuitively obvious

PS Sastry, IISc, Bangalore, 2020 11/39 PS Sastry, IISc, Bangalore, 2020 12/39
◮ We have
◮ Consider a chain started in y P r[Wy3 = k3 |Wy2 = k2 , Wy1 = k1 ] =
◮ let Tyr be time of rth visit to y, r ≥ 1 P r[Xk1 +k2 +j 6= y, 1 ≤ j ≤ k3 − 1, Xk1 +k2 +k3 = y | B]
where B = [Xk1 +k2 = y, Xk1 = y, Xj 6= y, j < k1 + k2 , j 6= k1 ]
Tyr = min{n ≥ 1 : Nn (y) = r}
◮ Using the Markovian property, we get
◮ Define Wy1 = Ty1 = Ty and Wyr = Tyr − Tyr−1 , r > 1
◮ Note that Ey [Wy1 ] = Ey [Ty ] = my P r[Wy3 = k3 |Wy2 = k2 , Wy1 = k1 ] =
◮ Also, Tyr = Wy1 + · · · + Wyr P r[Xk1 +k2 +j 6= y, 1 ≤ j ≤ k3 − 1, Xk1 +k2 +k3 = y | Xk1 +k2 = y]
◮ Wyr are the “waiting times” = P r[Xj 6= y, 1 ≤ j ≤ k3 − 1, Xk3 = y | X0 = y]
◮ By Markovian property we should expect them to be iid = Py [Wy1 = k3 ]
◮ We will prove this. ◮ In general, we get
◮ Then Tyr /r converges to my by law of large umbers
P r[Wyr = kr | Wyr−1 = kr−1 , · · · , Wy1 = k1 ] = Py [Wy1 = kr ]

PS Sastry, IISc, Bangalore, 2020 13/39 PS Sastry, IISc, Bangalore, 2020 14/39

◮ This shows the waiting time are iid


X
Py [Wy2 = k2 ] = Py [Wy2 = k2 | Wy1 = k1 ] Py [Wy1 = k1 ]
k1
X ◮ We have shown Wyr , r = 1, 2, · · · are iid
= Py [Wy1 = k2 ] Py [Wy1 = k1 ]
k1
◮ Since E[Wy1 ] = my , by strong law of large numbers,
= Py [Wy1 = k2 ] k
Tyk 1X r
⇒ identically distributed lim = lim Wy = my , (w.p.1)
k→∞ k k→∞ k
r=1

Py [Wy2 = k2 , Wy1 = k1 ] = Py [Wy2 = k2 | Wy1 = k1 ]Py [Wy1 = k1 ] ◮ Note that this is true even if my = ∞
= Py [Wy1 = k2 ] Py [Wy1 = k1 ]
= Py [Wy2 = k2 ] Py [Wy1 = k1 ]
⇒ independent

PS Sastry, IISc, Bangalore, 2020 15/39 PS Sastry, IISc, Bangalore, 2020 16/39
TyNn (y) ≤ n < TyNn (y)+1

◮ For all n such that Nn (y) ≥ 1, we have ◮ Now we have


N (y) N (y)+1
TyNn (y) ≤ n < TyNn (y)+1 Ty n n Ty n
≤ <
Nn (y) Nn (y) Nn (y)
◮ Nn (y) is the number of visits to y till time step n ◮ We know that
◮ Suppose N50 (y) = 8 – Visited y 8 times till time 50. ◮ As n → ∞, Nn (y) → ∞, w.p.1
◮ So, the 8th visit occurred at or before time 50.
n
◮ As n → ∞, Tny → my , w.p.1
◮ The 9th visit has not occurred till 50. ◮ Hence we get
◮ So, time of 9th visit is beyond 50. n
lim = my , w.p.1
n→∞ Nn (y)

or
Nn (y) 1
lim = , w.p.1
n→∞ n my
PS Sastry, IISc, Bangalore, 2020 17/39 PS Sastry, IISc, Bangalore, 2020 18/39

◮ All this is true if the chain started in y.


◮ That means it is true if the chain visits y once.
◮ So, we get ◮ Thus we have proved the following theorem
◮ Theorem:
Nn (y) I[T <∞]
lim = y , w.p.1 Let y be recurrent. Then
n→∞ n my 1
Nn (y) I[T <∞]
◮ Since 0 ≤ Nn (y) lim = y , w.p.1
n
≤ 1, almost sure convergence implies n→∞ n my
convergence in mean 2
Gn (x, y) ρxy
lim =
 
Gn (x, y) Nn (y) Px [Ty < ∞] ρxy n→∞ n my
lim = lim Ex = lim =
n→∞ n n→∞ n n→∞ my my

◮ The fraction of time spent in each recurrent state is


inversely proportional to the mean recurrence time

PS Sastry, IISc, Bangalore, 2020 19/39 PS Sastry, IISc, Bangalore, 2020 20/39
◮ The limiting fraction of time spent in a state is inversely
proportional to my , the mean return time. ◮ A recurrent state y is called null recurrent if my = ∞.
◮ Intuitively, the stationary probability of a state could be ◮ y is called positive recurrent if my < ∞
the limiting fraction of time spent in that state. ◮ We earlier saw that the fraction of time spent in a
◮ Thus π(y) = m1 is a good candidate for stationary transient state is zero.
y
distribution. ◮ Suppose y is null recurrent. Then
◮ We first note that we can have my = ∞.
Though Py [Ty < ∞] = 1, we can have Ey [Ty ] = ∞. Nn (y) 1
lim = =0
◮ What if my = ∞, ∀y? n→∞ n my
◮ Does not seem reasonable for a finite chain. ◮ Thus the limiting fraction of time spent by the chain in
◮ But for infinite chains?? transient and null recurrent states is zero.
◮ Let us characterize y for which my = ∞

PS Sastry, IISc, Bangalore, 2020 21/39 PS Sastry, IISc, Bangalore, 2020 22/39

n n
◮ Theorem: Let x be positive recurrent and let x lead to 1 X n1 +m+n0 1X m
P (y, y) ≥ P n1 (y, x) P (x, x) P n0 (x, y), ∀n
y. Then y is positive recurrent. n m=1 n m=1
Proof
◮ Since x is recurrent and x leads to y we know ∃n0 , n1 s.t. ◮ We can write the LHS of above as
n n +n+n n1 +n0
P n0 (x, y) > 0, P n1 (y, x) > 0 and 1 X
n1 +m+n0 1 1X 0 m 1 X
P (y, y) = P (y, y) − P m (y, y)
n m=1 n m=1 n m=1
P n1 +m+n0 (y, y) ≥ P n1 (y, x)P m (x, x)P n0 (x, y), ∀m
n1 +n+n
X 0 n1 +n0
n1 + n + n 0 1 m 1 X
Summing the above for m = 1, 2, · · · n and dividing by n = P (y, y) − P m (y, y)
n n1 + n + n0 m=1
n m=1
n n
1 X 1 X
P n1 +m+n0 (y, y) ≥ P n1 (y, x) P m (x, x) P n0 (x, y), ∀n
n m=1 n m=1 n
1 X n1 +m+n0 1
⇒ lim P (y, y) =
n→∞ n my
If we now let n → ∞, the RHS goes to m=1

P n1 (y, x) m1x P n0 (x, y) > 0. 1 1


⇒ ≥ P n1 (y, x) P n0 (x, y) > 0
my mx
which implies y is positive recurrent
PS Sastry, IISc, Bangalore, 2020 23/39 PS Sastry, IISc, Bangalore, 2020 24/39
◮ Let C be a finite closed set of recurrent states.
◮ Suppose all states in C are null recurrent. Then
n
1X m
lim P (x, y) = 0, ∀x, y ∈ C
◮ Thus, in a closed irreducible set of recurrent states, if one n→∞ n
m=1
state is positive recurrent then all are positive recurrent.
◮ Since C is closed, P m (x, y) = 1, ∀m, ∀x ∈ C.
P
◮ Hence, in the partition: SR = C1 + C2 + · · · , each Ci is y∈C

either wholly positive recurrent or wholly null recurrent. ◮ Thus we get


◮ We next show that a finite chain cannot have any null n
1 XX m X1X n

recurrent states. 1= P (x, y) = P m (x, y), ∀n


n m=1 y∈C y∈C
n m=1

X1X n n
m
X 1X m
⇒ 1 = lim P (x, y) = lim P (x, y) = 0
n→∞
y∈C
n m=1 y∈C
n→∞ n
m=1

where we could take the limit inside the sum because C is


finite.
PS Sastry, IISc, Bangalore, 2020 25/39 PS Sastry, IISc, Bangalore, 2020 26/39

Example of null recurrent chain


◮ Consider the chain with state space {0, 1, · · · } given by
q(4)

◮ If C is a finite closed set of recurrent states then all


q(2)

states in it cannot be null recurrent. 0 1 2 3 4 ...

◮ Actually what we showed is that any closed finite set q(1)

must have at least one positive recurrent state. q(3)

◮ Hence, in a finite chain, every closed irreducible set of P∞


◮ Here, q(k) ≥ 0, ∀k and k=1 q(k) = 1. We have
recurrent states contains only positive recurrent states.
∞ ∞
◮ Hence, a finite chain cannot have a null recurrent state. X X
P0 [T0 = j+1] = q(j) ⇒ m0 = j Po [T0 = j] = j q(j−1)
j=2 j=2

(Note that P0 [T0 = 1] = 0)


◮ So, m0 = ∞ if the q(·) distribution has infinite
expectation. For example, if q(k) = kc2
◮ Then state 0 is null recurrent. Implies chain is null
PS Sastry, IISc, Bangalore, 2020 27/39
recurrent PS Sastry, IISc, Bangalore, 2020 28/39
◮ Bounded Convergence Theorem: Suppose
P
a(x) ≥ 0, ∀x ∈ S and x a(x) < ∞. Let bn (x), x ∈ S
◮ Suppose π is a stationary distribution.
be such that |bn (x)| ≤ K, ∀x, n and suppose
◮ Then π(y) = 0 if y is transient or null recurrent limn→∞ bn (x) = b(x), ∀x ∈ S. Then
◮ We prove this as follows X X X
lim a(x) bn (x) = a(x) lim bn (x) = a(x) b(x)
n→∞ n→∞
x∈S x∈S x∈S
X
π(y) = π(x)P m (x, y) ∀m
x
◮ We had
n n n
1 XX 1X m
X 1X m
π(y) = lim π(x) P (x, y)
X
m
⇒ π(y) = π(x)P (x, y) = π(x) P (x, y) n→∞ n m=1
n m=1 x x
n m=1 x

X 1
n
X ◮ We have
⇒ π(y) = lim π(x) P m (x, y) n
1X m
n m=1
n→∞
X
x π(x) ≥ 0; π(x) = 1; 0 ≤ P (x, y) ≤ 1, ∀x
x
n m=1
◮ The proof is complete if we can take the limit inside the ◮ Hence, if y is transient or null recurrent, then
sum n
X 1X m
π(y) = π(x) lim P (x, y) = 0
n→∞ n
x m=1
PS Sastry, IISc, Bangalore, 2020 29/39 PS Sastry, IISc, Bangalore, 2020 30/39

◮ Theorem An irreducible positive recurrent chain has a


unique stationary distribution given by
1
π(y) = , ∀y ∈ S
my
◮ Suppose ∃ π such that π(y) = x π(x)P (x, y). Then
P
◮ In any stationary distribution π, we would have π(y) = 0 X
π(y) = π(x) P m (x, y), ∀m
is y is transient or null recurrent.
x
◮ Hence an irreducible transient or null recurrent chain n
X 1X m
would not have a stationary distribution. ⇒ π(y) = π(x) P (x, y), ∀n
x
n m=1
n
X 1X m
⇒ π(y) = lim π(x) P (x, y)
n→∞
x
n m=1
n
X 1X m
⇒ π(y) = π(x) lim P (x, y)
n→∞ n
x m=1
X 1 1
= π(x) =
x
my my
PS Sastry, IISc, Bangalore, 2020 31/39 PS Sastry, IISc, Bangalore, 2020 32/39
◮ If π 1 and π 2 are stationary distributions, then so is
◮ To complete the proof, we need to show 1
απ 1 + (1 − α)π 2 (easily verified)
P
y my = 1.
◮ We also need to show m1 = x m1 P (x, y)
P ◮ Let C be a closed irreducible set of positive recurrent
y x

◮ We skip these steps in the proof. states.


Then there is a unique stationary distribution π that
◮ The theorem shows that an irreducible positive recurrent
satisfies π(y) = 0, ∀y ∈ / C.
chain has a unique stationary distribution
◮ Any other stationary distribution of the chain is a convex
◮ Corollary: An irreducible chain has a stationary
combination of the stationary distributions concentrated
distribution if and only if it is positive recurrent on each of the closed irreducible sets of positive recurrent
◮ An irreducible finite chain has a unique stationary states.
distribution ◮ This answers all questions about existence and uniqueness
of stationary distributions

PS Sastry, IISc, Bangalore, 2020 33/39 PS Sastry, IISc, Bangalore, 2020 34/39

◮ Consider an irreducible positive recurrent chain.


◮ It has a unique stationary distribution and ◮ Consider a chain with transition probabilities
1
Pn m
n m=1 P (x, y) converges to π(y).  
◮ The next question is convergence of πn 0 1 0 0
 1/3 0 2/3 0 
X P = 0 2/3 0 1/3 

lim πn (y) = lim π0 (x) P n (x, y) =?
n→∞ n→∞
x 0 0 1 0

◮ If P n (x, y) converges to g(y) then that would be the ◮ One can show π T = 18 38 38 18
 
stationary distribution and πn converges to it ◮ However, P n goes to different limits based on whether n
◮ But, n1 nm=1 am may have a limit though limn→∞ an
P
is even or odd
may not exist.
For example, an = (−1)n

PS Sastry, IISc, Bangalore, 2020 35/39 PS Sastry, IISc, Bangalore, 2020 36/39
◮ The chain is the folowing ◮ We define period of a state x as
1/3 2/3 1

dx = gcd{n ≥ 1 : P n (x, x) > 0}


0 1 2 3
1 2/3 1/3 ◮ If P (x, x) > 0 then dx = 1
  ◮ If x leads to y and y leads to x, then dx = dy
0 1 0 0
 1/3 0 2/3 0  ◮ Let P n1 (x, y) > 0, P n2 (y, x) > 0. Then
P = 
 0 2/3 0 1/3  P n1 +n2 (x, x) > 0 ⇒ dx divides n1 + n2 .
0 0 1 0 ◮ For any n s.t. P n (y, y) > 0, we get pn1 +n+n2 (x, x) > 0
◮ Hence, dx divides n for all n s.t. P n (y, y) > 0 ⇒ dx ≤ dy
◮ We can return to a state only after even number of time ◮ Similarly, dy ≤ dx and hence dy = dx
steps
◮ All states in an irreducible chain have the same period.
◮ That is why P n does not go to a limit
◮ If the period is 1 then chain is called aperiodic
◮ Such a chain is called a periodic chain

PS Sastry, IISc, Bangalore, 2020 37/39 PS Sastry, IISc, Bangalore, 2020 38/39

Recap: Stationary Distribution

◮ The extra condition we need for convergence of πn is ◮ π is said to be a stationary distribution for the Markov
aperiodicity chain with transition probabilities P if
◮ For an aperiodic, irreducible, positive recurrent chain, π(y) =
X
π(x)P (x, y), ∀y ∈ S
there is a unique stationary distribution and πn converges x∈S
to it irrespective of what π0 is.
◮ An aperiodic, irreducible, positive recurrent chain is called ◮ When π is stationary distribution,
an ergodic chain π0 = π ⇒ πn = π, ∀n
◮ If πn = π, ∀n then π is a stationary distribution
◮ For a finite chain: P T π = π
◮ A stationary distribution always exists for a finite chain

PS Sastry, IISc, Bangalore, 2020 39/39 PS Sastry, IISc, Bangalore, 2020 1/36
Recap Recap: positive and null recurrent states

◮ Nn (y) – number of visits to y till n


◮ y is positive recurrent if my < ∞
Gn (x, y) = Ex [Nn (y)] = nm=1 P m (x, y)
P

– expected number of visits to y till n
◮ y is null recurrent if my = ∞
◮ my = Ey [Ty ] – mean return time to y
◮ If x is positive recurrent and x leads to y, then y is
positive recurrent
Nn (y) I[T <∞] ◮ In a closed irreducible set of recurrent states either all
lim = y , w.p.1
n→∞ n my states are positive recurrent or all states are null recurrent
◮ A finite closed set has to have at least one positive
n
Gn (x, y) 1 X m ρxy recurrent state
lim = lim P (x, y) =
n→∞ n n→∞ n
m=1
my ◮ A finite chain cannot have null recurrent states

PS Sastry, IISc, Bangalore, 2020 2/36 PS Sastry, IISc, Bangalore, 2020 3/36

Recap: Existence of stationary distribution Recap: Periodic chains


◮ In any stationary distribution π, π(y) = 0 if y is transient
or null recurrent ◮ The period of a state x is
◮ An irreducible transient or null recurrent chain does not dx = gcd{n ≥ 1 : P n (x, x) > 0}
have a stationary distribution ◮ If x and y lead to each other, dx = dy
◮ An irreducible positive recurrent chain has a unique ◮ In an irreducible chain, all states have the same period
stationary distribution: π(y) = m1y ◮ An irreducible chain is called aperiodic if the period is 1
◮ An irreducible chain has a stationary distribution iff it is ◮ For an irreducible aperiodic positive recurrent chain, πn
positive recurrent converges to π, the unique stationary distribution,
◮ For a non-irreducible chain, for each closed irreducible set irrespective of what π0 is.
of positive recurrent states, there is a unique stationary ◮ Also, for an irreducible, aperiodic, positive recurrent
distribution concentrated on that set. chain, P n (x, y) converges to m1y
◮ All stationary distributions of the chain are convex
combinations of these

PS Sastry, IISc, Bangalore, 2020 4/36 PS Sastry, IISc, Bangalore, 2020 5/36
Example
◮ Consider the umbrella problem
 
0 1 2 3 4
 0 0 0 0 0 1  ◮ We want calculate the probability of getting caught in a

 1 0 0 0 1−p p

 rain without an umbrella.
P =  2

This would be the steady state probability of state 0
0 0 1−p p 0  ◮
multiplied by p
 
 3 0 1−p p 0 0 
4 1−p p 0 0 0 ◮ We are using the fact that this chain converges to the
stationary distribution starting with any initial
probabilities.
0 1 2 3 4

◮ This is an irreducible, aperiodic positive recurrent chain


PS Sastry, IISc, Bangalore, 2020 6/36 PS Sastry, IISc, Bangalore, 2020 7/36

 
0 1 2 3 4

 0 0 0 0 0 1 

 1 0 0 0 1−p p 
P = 

 2 0 0 1−p p 0 

 3 0 1−p p 0 0 
4 1−p p 0 0 0
The stationary distribution satisfies π T P = π T
π(0) = (1 − p)π(4)
π(1) = (1 − p)π(3) + pπ(4) ⇒ π(3) = π(1)
π(2) = (1 − p)π(2) + pπ(3)
π(3) = (1 − p)π(1) + pπ(2) ⇒ π(2) = π(1)
π(4) = π(0) + pπ(1) ⇒ π(4) = π(1)
This gives 4π(1) + (1 − p)π(1) = 1 and hence
1 1−p
π(i) = i = 1, 2, 3, 4 and π(0) =
5−p 5−p
PS Sastry, IISc, Bangalore, 2020 8/36 PS Sastry, IISc, Bangalore, 2020 9/36
Birth-Death chains birth-death chains – stationary distribution q
i
p 1-q
0 N
1-p 0
0 ... i-1 i i+1 N
...
p q
N
◮ The following is a finite birth-death chain i

ri
q
i
p 1-q X
0
1-p
0
0
... i-1 i i+1
...
N
N
π(y) = π(x)P (x, y)
p q
N
i x
ri
π(0) = π(0)(1 − p0 ) + π(1)q1
◮ We assume pi , qi > 0, ∀i. ⇒ π(1)q1 − π(0)p0 = 0
◮ Then the chain is irreducible, positive recurrent π(1) = π(0)p0 + π(1)(1 − p1 − q1 ) + π(2)q2
◮ If we assume ri > 0 at least for one i, it is also aperiodic ⇒ π(1)q1 − π(0)p0 = π(2)q2 − π(1)p1
◮ We can derive a general form for its stationary ⇒ π(2)q2 − π(1)p1 = 0
probabilities
π(2) = π(1)p1 + π(2)(1 − p2 − q2 ) + π(3)q3
⇒ π(2)q2 − π(1)p1 = π(3)q3 − π(2)p2 = 0

PS Sastry, IISc, Bangalore, 2020 10/36 PS Sastry, IISc, Bangalore, 2020 11/36

◮ Thus we get
p0 ◮ Consider a birth-death chain
π(1)q1 − π(0)p0 = 0 ⇒ π(1) = π(0) q q
q1 x x+1

p1 p0 p1
π(2)q2 − π(1)p1 = 0 ⇒ π(2) = π(1) = π(0) q
x-1
q2 q1 q2 ...
x-1
p
x
p
x+1 p
...
x-1 x x+1

◮ Iterating like this, we get rx-1 rx rx+1

p0 p1 · · · pn−1
π(n) = ηn π(0), where ηn = , n = 1, 2, · · · , N ◮ The chain may be infinite or finite
q1 q2 · · · qn
PN
◮ Let a, b ∈ S with a < b. Assume px , qx > 0, a < x < b.
◮ With η0 = 1, we get π(0) j=0 ηj = 1 and hence ◮ Define
1
π(0) = PN and π(n) = ηn π(0), n = 1, · · · , N U (x) = Px [Ta < Tb ], a < x < b, U (a) = 1, U (b) = 0
j=0 ηj
◮ We want to derive a formula for U (x)
◮ Note that this process is applicable even for infinite chains
◮ This can be useful, e.g., in the gambler’s ruin chain
with state space {0, 1, 2, · · · } (but there may not be a
solution)
PS Sastry, IISc, Bangalore, 2020 12/36 PS Sastry, IISc, Bangalore, 2020 13/36
q q
x x+1

q
x-1
x
...
x-1
...
x+1
p
x-1
p
x
p
x+1 qx
U (x + 1) − U (x) = [U (x) − U (x − 1)]
rx-1 rx rx+1 px
qx qx−1
= [U (x − 1) − U (x − 2)]
px px−1
U (x) = Px [Ta < Tb ] = P r[Ta < Tb |X0 = x] qx qx−1 · · · qa+1
x+1 = [U (a + 1) − U (a)]
X px px−1 · · · pa+1
= P r[Ta < Tb |X1 = y] P r[X1 = y|X0 = x]
y=x−1 qy qy−1 · · · qa+1
= U (x − 1)qx + U (x)rx + U (x + 1)px Let γy = , a < y < b, γa = 1
py py−1 · · · pa+1
= U (x − 1)qx + U (x)(1 − px − qx ) + U (x + 1)px Now we get
γx
⇒ qx [U (x) − U (x − 1)] = px [U (x + 1) − U (x)] U (x + 1) − U (x) = [U (a + 1) − U (a)]
qx γa
⇒ U (x + 1) − U (x) = [U (x) − U (x − 1)]
px

PS Sastry, IISc, Bangalore, 2020 14/36 PS Sastry, IISc, Bangalore, 2020 15/36

γx
◮ Using these, we get
U (x + 1) − U (x) = [U (a + 1) − U (a)] γx
γa U (x) − U (x + 1) = [U (a) − U (a + 1)]
γa
◮ By taking x = b − 1, b − 2, · · · , a + 1, a, γx γa γx
= Pb−1 = Pb−1
γb−1 γa x=a γx x=a γx
U (b) − U (b − 1) = [U (a + 1) − U (a)]
γa ◮ Putting x = b − 1, b − 2, · · · , y in the above
γb−2 γb−1
U (b − 1) − U (b − 2) = [U (a + 1) − U (a)] U (b − 1) − U (b) = Pb−1
γa
x=a γx
.. γb−2
. U (b − 2) − U (b − 1) = Pb−1
γa
U (a + 1) − U (a) = [U (a + 1) − U (a)] x=a γx
γa ..
◮ Adding all these we get .
γy
b−1 U (y) − U (y + 1) = Pb−1
1 X
x=a γx
[U (a + 1) − U (a)] γx = U (b) − U (a) = 0 − 1
γa x=a
◮ Adding these we get
Pb−1
γa
⇒ U (a) − U (a + 1) = Pb−1 x=y γx
γx U (y) − U (b) = U (y) = Pb−1 , a < y < b
x=a γx
x=a
PS Sastry, IISc, Bangalore, 2020 16/36 PS Sastry, IISc, Bangalore, 2020 17/36
◮ We are considering birth-death chains
q
x
q
x+1
◮ Suppose this is a Gambler’s ruin chain:
px = p, qx = q, ∀x
q
x-1
q q q
x 1
...
x-1
...
x+1 p
p p
x-1 x x+1 p p p
... ... 1
rx-1 rx rx+1
i-1 i i+1
0 N

◮ We have derived, for a < y < b,  x


q
◮ Then, γx = p
Pb−1
x=y γx qx qx−1 · · · qa+1 ◮ Hence, for a Gambler’s ruin chain we get, e.g.,
U (y) = Py [Ta < Tb ] = Pb−1 , γx =
x=a γx
px px−1 · · · pa+1  i
q
Pi−1 −1
Hence we also get x=0 γx p
◮ Pi [TN < T0 ] = PN −1 =  N
Py−1 x=0 γx q
−1
p
x=a γx
Py [Tb < Ta ] = Pb−1
x=a γx ◮ This is the probability of gambler being successful

PS Sastry, IISc, Bangalore, 2020 18/36 PS Sastry, IISc, Bangalore, 2020 19/36

◮ Consider the following chain over {0, 1, · · · }


q
i
p
0
1-p
0
0
... i-1 i i+1
... ◮ Consider this chain started in state 1.
p
i

ri

◮ This is an infinite irreducible birth-death chain


◮ We want to know whether the chain is transient or
recurrent etc. [T0 < Tn ] ⊂ [T0 < Tn+1 ], n = 2, 3, · · ·
◮ We can use the earlier analysis for this too.
Pn−1 since the chain cannot hit n + 1 without hitting n.
γx Also, 1 ≤ T2 < T3 < · · · < Tn and Tn ≥ n.
P1 [T0 < Tn ] = Px=1 n−1 , ∀n > 1 ◮

x=0 γx ◮ Hence [T0 < ∞] is same as [T0 < Tn , for some n]


Pn−1
x=0 γx − γ0 1
= Pn−1 = 1 − Pn−1
x=0 γx x=0 γx

PS Sastry, IISc, Bangalore, 2020 20/36 PS Sastry, IISc, Bangalore, 2020 21/36
Consider this chain started in state 1. Theorem: The chain is recurrent iff ∞
P
x=0 γx = ∞
◮ ◮
Proof
[T0 < Tn ] ⊂ [T0 < Tn+1 ], n = 2, 3, · · ·
◮ Supoose chain is recurrent. Since it is irreducible,

since the chain cannot hit n + 1 without hitting n. ∞


X
◮ Also, 1 ≤ T2 < T3 < · · · < Tn and Tn ≥ n. P1 [T0 < ∞] = 1 ⇒ γx = ∞
◮ Hence [T0 < ∞] is same as [T0 < Tn , for some n] x=0

P1 [T0 < Tn , for some n] = P1 (∪n>1 [T0 < Tn ]) ◮ Suppose


P∞
γx = ∞ ⇒ P1 [T0 < ∞] = 1
  x=0
= P1 lim [T0 < Tn ]
n→∞
P0 [T0 < ∞] = P (0, 0) + P (0, 1) P1 [T0 < ∞]
= lim P1 ([T0 < Tn ])
n→∞ = P (0, 0) + P (0, 1) = 1
1
= lim 1 − Pn−1 ◮ Implies state 0 is recurrent and hence the chain is
n→∞
x=0 γx
1 recurrent because it is irreducible.
⇒ P1 [T0 < ∞] = 1 − P∞ ◮ Note that we have used the fact that the chain is infinite
x=0 γx
only to the right.
PS Sastry, IISc, Bangalore, 2020 22/36 PS Sastry, IISc, Bangalore, 2020 23/36

P∞
◮ The chain is transient if x=0 γx < ∞
 x
q
◮ Let px = p, qx = q ⇒ γx = p
P∞  q x
◮ We know the chain is recurrent if x=0 p
=∞
∞  x
X q ◮ When will this chain be positive recurrent?
Transient if <∞ ⇔ q<p We know that an irreducible chain is positive recurrent if
x=0
p ◮
and only if it has a stationary distribution.
∞  x We can check if it has a stationary distribution
X q ◮
Recurrent if =∞ ⇔ q≥p
p ◮ The earlier equations that we derived earlier hold for this
x=0
infinite case also.
◮ Intuitively clear
◮ This chain with q < p is an example of an irreducible
chain that is wholly transient

PS Sastry, IISc, Bangalore, 2020 24/36 PS Sastry, IISc, Bangalore, 2020 25/36
◮ We derived earlier the equations that a stationary
distribution of this chain (if it exists) has to satisfy ◮ This analysis can handle chains which are infinite in one
direction
p0 p1 · · · pn−1
π(n) = ηn π(0), where ηn = , n = 1, 2, · · · , ◮ Consider the following random walk chain
q1 q2 · · · qn 1-p
1-p
P∞
◮ Setting η0 = 1, we get π(0) j=0 ηj = 1
1-p
Hence stationary distribution exists iff ∞
P
j=0 ηj < ∞
◮ ...
0 +1 ...
-1 p p p
◮ Let px = p, qx = q
∞ ∞  j
◮ The state space here is {· · · , −1, 0, +1, · · · }
X X p The chain is irreducible and periodic with period 2
ηj = <∞ ⇔ p<q ◮
q
j=0 j=0 ◮ P 2n (0, 0) = 2n Cn pn (1 − p)n .
We can look at the limit of n1 n P 2n (0, 0)
P

◮ Thus in this special case, the chain is
◮ transient if p > q; recurrent if p ≤ q ◮ We can show that the chain is transient if p 6= 0.5 and is
◮ positive recurrent if p < q recurrent if p = 0.5.
◮ null recurrent if p = q
PS Sastry, IISc, Bangalore, 2020 26/36 PS Sastry, IISc, Bangalore, 2020 27/36

◮ In general, determining when an infinite chain is positive ◮ Let {Xn , n ≥ 0} be an irreducible markov chain on a
recurrent is difficult. finite state space S with stationary distribution π.
◮ The method we had works only for birth-death chains ◮ Let r : S → ℜ be a bounded function.
over non-negative integers. ◮ Suppose we want E[r(X)] P with respect to the stationary
◮ There is a useful general theorem. distribution π (E[r(X)] = j∈S r(j)π(j))
Foster’s Theorem ◮ Let Nn (j) be as earlier. Then
Let P be the transition probabilities of a homogeneous
n
irreducible Markov chain with state space S. Let 1X 1X
h : S → ℜ with h(x) ≥ 0 and r(Xm ) = Nn (j)r(j)
P n m=1 n j∈S
Pk∈S P (i, k)h(k) < ∞ ∀i ∈ F and

k∈S P (i, k)h(k) ≤ h(i) − ǫ ∀i ∈


/F

n
1X X Nn (j) X
for some finite set F and some ǫ > 0. Then the Markov ⇒ lim r(Xm ) = r(j) lim = r(j)π(j)
n→∞ n n→∞ n
chain is positive recurrent m=1 j∈S j∈S

◮ The h here is called a Lyapunov function. ◮ For this to be true for infinite S, we need some extra
◮ We will not prove this theorem conditions

PS Sastry, IISc, Bangalore, 2020 28/36 PS Sastry, IISc, Bangalore, 2020 29/36
MCMC Sampling
◮ Consider a distribution over (finite) S: π(x) = b(x)
Z ◮ Suppose {Xn } is a an irreducible, aperiodic positive
recurrent Markov chain with stationary dist π(x) = b(x)
P
◮ Since this is a distribution, Z = x∈S b(x) Z
◮ We assume, we can efficiently calculate b(x) for any x ◮ Then we have
but computation of Z is intractable or computationally n
expensive 1X X
lim g(Xm ) = g(x)π(x)
E.g., the Boltzmann distribution: b(x) = e−E(x)/KT n→∞ n
m=1 x
◮ We want E[g(X)] w.r.t. distribution π (for any g)
◮ hence, if we can design a Markov chain with a given
X 1
n
X stationary distribution, we can use that to calculate the
E[g(X)] = g(x) π(x) ≈ g(Xi ), X 1 , · · · Xn ∼ π expectation.
x
n i=1
◮ We can also use the chain to generate samples from
distribution π
◮ One way to generate samples is to design an ergodic
markov chain with stationary distribution π
– MCMC sampling
PS Sastry, IISc, Bangalore, 2020 30/36 PS Sastry, IISc, Bangalore, 2020 31/36

b(x)
◮ {Xn }: Markov chain with stationary dist π(x) = Z
We can approximate the expectation as ◮ Let Q = [q(i, j)] be the transition probability matrix of an
n
irreducible Markov chain over S.
X 1X
g(x)π(x) ≈ g(XM +i ) ◮ Q is called the proposal distribution
n i=1
x ◮ We start with arbitray X0 and generate
Xn+1 , n = 0, 1, 2, · · · , iteratively as follows
Where M is large enough to assume chain is in steady
◮ If Xn = i, we generate Y with P r[Y = k] = q(i, k)
state ◮ Let the generated value for Y be j. Set
When we take sample mean, n1 ni=1 Zi , we want Zi to
P


be uncorrelated j with probability α(i, j)
Xn+1 =
◮ We can, for example, use Xn with probability 1 − α(i, j)

n
X 1X ◮ α(i, j) is called the acceptance probability
g(x)π(x) ≈ g(XM +Ki )
x
n i=1 ◮ We want to choose α(i, j) to make Xn an ergodic
markov chain with stationary probabilities π
◮ For all these, we need to design a Markov chain with π as
stationary distribution
PS Sastry, IISc, Bangalore, 2020 32/36 PS Sastry, IISc, Bangalore, 2020 33/36
◮ The stationary distribution π satisfies (with transition
probabilities P )
X
π(y) = π(x) P (x, y), ∀y ∈ S ◮ Any stationary distribution has to satisfy
x
X
◮ Suppose there is a distribution g(·) that satisfies π(y) = π(x) P (x, y), ∀y ∈ S
x
g(y) P (y, x) = g(x) P (x, y), ∀x, y ∈ S
◮ If I can find a π that satisfies
This is called detailed balance
◮ Summing both sides above over x give π(x)P (x, y) = π(y)P (y, x), ∀x, y ∈ S, x 6= y
X X
g(y) = g(y) P (y, x) = g(x)P (x, y), ∀y that would be the stationary distribution
x x ◮ This is called detailed balance
◮ Thus if g(·) satisfies detailed balance, then it must be the
stationary distribution
◮ Note that it is not necessary for a stationary distribution
to satisfy detailed balance
PS Sastry, IISc, Bangalore, 2020 34/36 PS Sastry, IISc, Bangalore, 2020 35/36

◮ Recall our algorithm for generating Xn , n = 0, 1, · · ·


◮ Start with arbitrary X0 and generate Xn+1 from Xn
MCMC Sampling
◮ If Xn = i, we generate Y with P r[Y = k] = q(i, k) ◮ Consider a distribution over (finite) S: π(x) = b(x)
Z
◮ Let the generated value for Y be j. Set
◮ Since this is a distribution, Z = x∈S b(x)
P

j with probability α(i, j) ◮ We assume, we can efficiently calculate b(x) for any x
Xn+1 =
Xn with probability 1 − α(i, j)
but computation of Z is intractable or computationally
◮ Hence the transition probabilities for Xn are expensive
E.g., the Boltzmann distribution: b(x) = e−E(x)/KT
P (i, j) = q(i, j) α(i, j), i 6= j ◮ We want E[g(X)] w.r.t. distribution π (for any g)
X
P (i, i) = q(i, i) + q(i, j) (1 − α(i, j))
n
j6=i X 1X
E[g(X)] = g(x) π(x) ≈ g(Xi ), X 1 , · · · Xn ∼ π
◮ π(i) = b(i)/Z is the desired stationary distribution x
n i=1
◮ So, we can try to satisfy
π(i) P (i, j) = π(j) P (j, i), ∀i, j, i 6= j ◮ One way to generate samples is to design an ergodic
that is, b(i)q(i, j) α(i, j) = b(j)q(j, i) α(j, i) markov chain with stationary distribution π
– MCMC sampling
PS Sastry, IISc, Bangalore, 2020 36/36 PS Sastry, IISc, Bangalore, 2020 1/35
b(x)
◮ {Xn }: Markov chain with stationary dist π(x) = Z
We can approximate the expectation as
◮ Suppose {Xn } is a an irreducible, aperiodic positive n
recurrent Markov chain with stationary dist π(x) = b(x)
X 1X
Z g(x)π(x) ≈ g(XM +i )
◮ Then we have x
n i=1
n
1X X Where M is large enough to assume chain is in steady
lim g(Xm ) = g(x)π(x) state
n→∞ n
m=1 x
◮ When we take sample mean, n1 ni=1 Zi , we want Zi to
P
◮ hence, if we can design a Markov chain with a given be uncorrelated
stationary distribution, we can use that to calculate the ◮ We can, for example, use
expectation.
n
◮ We can also use the chain to generate samples from X 1X
g(x)π(x) ≈ g(XM +Ki )
distribution π x
n i=1

◮ For all these, we need to design a Markov chain with π as


stationary distribution
PS Sastry, IISc, Bangalore, 2020 2/35 PS Sastry, IISc, Bangalore, 2020 3/35

◮ The stationary distribution π satisfies (with transition


◮ Let Q = [q(i, j)] be the transition probability matrix of an probabilities P )
irreducible Markov chain over S. X
π(y) = π(x) P (x, y), ∀y ∈ S
◮ Q is called the proposal distribution x
◮ We start with arbitrary X0 and generate
◮ Suppose there is a distribution g(·) that satisfies
Xn+1 , n = 0, 1, 2, · · · , iteratively as follows
◮ If Xn = i, we generate Y with P r[Y = k] = q(i, k) g(y) P (y, x) = g(x) P (x, y), ∀x, y ∈ S
◮ Let the generated value for Y be j. Set
 This is called detailed balance
Xn+1 =
j with probability α(i, j) ◮ Summing both sides above over x give
Xn with probability 1 − α(i, j) X X
g(y) = g(y) P (y, x) = g(x)P (x, y), ∀y
x x
◮ α(i, j) is called the acceptance probability
◮ We want to choose α(i, j) to make Xn an ergodic ◮ Thus if g(·) satisfies detailed balance, then it must be the
Markov chain with stationary probabilities π stationary distribution
◮ Note that it is not necessary for a stationary distribution
to satisfy detailed balance
PS Sastry, IISc, Bangalore, 2020 4/35 PS Sastry, IISc, Bangalore, 2020 5/35
◮ Recall our algorithm for generating Xn , n = 0, 1, · · ·
◮ Start with arbitrary X0 and generate Xn+1 from Xn
◮ If Xn = i, we generate Y with P r[Y = k] = q(i, k)
◮ Any stationary distribution has to satisfy ◮ Let the generated value for Y be j. Set

j with probability α(i, j)
X Xn+1 =
π(y) = π(x) P (x, y), ∀y ∈ S Xn with probability 1 − α(i, j)
x
◮ Hence the transition probabilities for Xn are
◮ If I can find a π that satisfies
P (i, j) = q(i, j) α(i, j), i 6= j
π(x)P (x, y) = π(y)P (y, x), ∀x, y ∈ S, x 6= y
X
P (i, i) = q(i, i) + q(i, j) (1 − α(i, j))
j6=i
that would be the stationary distribution
◮ π(i) = b(i)/Z is the desired stationary distribution
◮ This is called detailed balance
◮ So, we can try to satisfy
π(i) P (i, j) = π(j) P (j, i), ∀i, j, i 6= j
that is, b(i)q(i, j) α(i, j) = b(j)q(j, i) α(j, i)

PS Sastry, IISc, Bangalore, 2020 6/35 PS Sastry, IISc, Bangalore, 2020 7/35

◮ We want to satisfy Metropolis-Hastings Algorithm


◮ Start with arbitrary X0 and generate Xn+1 from Xn
b(i)q(i, j) α(i, j) = b(j)q(j, i) α(j, i) ◮ If Xn = i, we generate Y with P r[Y = k] = q(i, k)
◮ Let the generated value for Y be j. Set
◮ Choose

j with probability α(i, j)
Xn+1 =
   
π(j)q(j, i) b(j)q(j, i) Xn with probability 1 − α(i, j)
α(i, j) = min ,1 = min ,1
π(i)q(i, j) b(i)q(i, j)
◮ Note that one of α(i, j), α(j, i) is 1 Where Q = [q(i, j)] is the transition probabilities of an
irreducible chain and
π(j)q(j, i) 
π(j)q(j, i)

suppose α(i, j) = < 1 α(i, j) = min ,1
π(i)q(i, j) π(i)q(i, j)
⇒ π(i) q(i, j) α(i, j) = π(j) q(j, i)
◮ Then {Xn } would be an irreducible, aperiodic chain with
= π(j) q(j, i) α(j, i)
stationary distribution π.
◮ Note that π(i) above can be replaced by b(i) ◮ Q is called the proposal chain and α(i, j) is called
acceptance probabilities
PS Sastry, IISc, Bangalore, 2020 8/35 PS Sastry, IISc, Bangalore, 2020 9/35
◮ Consider Boltzmann distribution: b(x) = e−E(x)/KT
◮ Take proposal to be uniform: from any state, we go to all ◮ Suppose E : S → ℜ is some function.
other states with equal probabilities ◮ We want to find x ∈ S where E is globally minimized.
◮ Then,
◮ A gradient descent type method tries to find a locally
minimizing direction and hence gives only a ‘local’
 
b(y)
, 1 = min e−(E(y)−E(x))/KT , 1

α(x, y) = min minimum.
b(x)
◮ The Metropolis-Hastings algorithm gives another view
◮ In state x you generate a random new state y. point on how such optimization problems can be handled.
If E(y) ≤ E(x) you always go there; ◮ We can think of E as the energy function in a Boltzmann
if E(y) > E(x), accept with probability e−(E(y)−E(x))/KT distribution
◮ An interesting way to simulate Boltzmann distribution
◮ We could have chosen Q to be ‘uniform over neighbours’

PS Sastry, IISc, Bangalore, 2020 10/35 PS Sastry, IISc, Bangalore, 2020 11/35

◮ Let b(x) = e−E(x)/T where T is a parameter called


‘temparature’ ◮ In most applications of MCMC, x ∈ S is a vector.
◮ {Xn } be Markov chain with stationary dist π(x) = b(x) ◮ One normally changes one component at a time. That is
Z
◮ We can find relative occupation of different states by the how neighbours can be defined
chain by collecting statistics during steady state ◮ A special case of proposal distribution is the conditional
◮ We know distribution.
◮ Suppose X = (X1 , · · · , XN ). To propose a value for Xi ,
π(x1 ) b(x1 ) we use fXi |X−i
= = e−(E(x1 )−E(x2 ))/T
π(x2 ) b(x2 ) ◮ Here the conditional distribution is calculated using the
◮ We spend more time in global minimum target π as the joint distribution.
We can increase the relative fraction of time spent in ◮ With such a proposal distribution, one can show that
global minimum by decreasing T (There is a price to pay!) α(i, j) is always 1
◮ Gives rise to interesting optimization technique called ◮ This is known as Gibbs sampling
simulated annealing

PS Sastry, IISc, Bangalore, 2020 12/35 PS Sastry, IISc, Bangalore, 2020 13/35
Random process
◮ A random process or a stochastic process is a collection ◮ A random process: {Xt , t ∈ T }
of random variables: {Xt , t ∈ T } ◮ The set T can be countable e.g., T = {0, 1, 2, · · · }
◮ Markov chain is an example. Here T = {0, 1, · · · } ◮ Or, T can be continuous e.g., T = [0, ∞)
◮ We call T the index set. ◮ These are termed discrete-time or continuous-time
◮ Normally, T is either (a subset of) set of integers or an processes
interval on real line. ◮ The random variables, Xt , may be discrete or continuous
◮ We think of the index t as time ◮ These are termed discrete-state or continuous-state
◮ Thus a random process can represent the time-evolution processes
of the state of a system ◮ The Markov chain we considered is a discrete-time
◮ We assume T is infinite discrete-state process
◮ The index need not necessarily represent time. It can
represent, for example, space coordinates.

PS Sastry, IISc, Bangalore, 2020 14/35 PS Sastry, IISc, Bangalore, 2020 15/35

◮ A random process: {Xt , t ∈ T }


◮ We can think of this as a mapping: X : Ω × T → ℜ
◮ A finite collection of random variables is completely
◮ Thus, X(ω, ·) is a real-valued function over T .
specified by its joint distribution
◮ So, we can think of the process also as a collection of ◮ How do we characterize a random process?
time functions.
◮ We need to specify joint distribution of Xt1 , Xt2 , · · · Xtn
◮ X can be thought of as a map that associates with each
for all n and all t1 , t2 , · · · tn ∈ T ..
ω ∈ Ω a real-valued function on T .
◮ One can show this completely specifies the process.
◮ These functions are called sample paths or paths of the
◮ As we saw, for a Markov chain, π0 and P together specify
process
all such joint distributions
◮ We can view the random process as a collection of
random variables, or as a collection of functions
◮ We will denote the random variables as Xt or X(t)

PS Sastry, IISc, Bangalore, 2020 16/35 PS Sastry, IISc, Bangalore, 2020 17/35
Distributions of a random process
◮ When it is a discrete-state process, all Xt would be
◮ A random process: {Xt , t ∈ T } or X : Ω × T → ℜ discrete random variables
◮ The first order distribution function of X is ◮ We can specify distributions through mass functions:

FX (x; t) = P r[Xt ≤ x] = FXt (x) fX (x; t) = P r[Xt = x] = fXt (x)

◮ The second order distribution function of X is fX (x1 , x2 ; t1 , t2 ) = P r[Xt1 = x1 , Xt2 = x2 ]

FX (x1 , x2 ; t1 , t2 ) = P r[Xt1 ≤ x1 , Xt2 ≤ x2 ]


fX (x1 , · · · , xn ; t1 , · · · tn ) = P r[Xti = xi , i = 1, · · · , n]
◮ The nth order distribution function of X is ◮ If all Xt are continuous random variables and if all
distributions have density functions, then we denote joint
FX (x1 , · · · , xn ; t1 , · · · tn ) = P r[Xti ≤ xi , i = 1, · · · , n] density of Xt1 , · · · , Xtn by fX (x1 , · · · , xn ; t1 , · · · tn )

PS Sastry, IISc, Bangalore, 2020 18/35 PS Sastry, IISc, Bangalore, 2020 19/35

◮ A random process {X(t), t ∈ T } is said to be a process


with independent increments if
◮ Specifying the nth order distributions for all n separately for all t1 < t2 ≤ t3 < t4 , the random variables X(t2 )−
is not feasible. X(t1 ) and X(t4 ) − X(t3 ) are independent
◮ Hence one needs some assumptions on the model so that
◮ Note that this also implies, e.g., X(t1 ) is independent of
these are specified implicitly.
X(t2 ) − X(t1 ) for all t1 < t2 .
◮ One example is the Markovian assumption.
◮ Now suppose this is a discrete-state process.
◮ As we saw, in a Markov chain, the transition probabilities
◮ Then we can write nth order pmf’s as
and initial state probabilities would determine all the
distributions P r[X(t1 ) = x1 , X(t2 ) = x2 , · · · X(tn ) = xn ]
◮ Another such useful assumption is what is called a = P r[X(t1 ) = x1 , X(t2 ) − X(t1 ) = x2 − x1 , · · · ]
process with independent increments
= P r[X(t1 ) = x1 ] P r[X(t2 ) − X(t1 ) = x2 − x1 ] · · ·
· · · P r[X(tn ) − X(tn−1 ) = xn − xn−1 ]

◮ We only need up to second order distributions


PS Sastry, IISc, Bangalore, 2020 20/35 PS Sastry, IISc, Bangalore, 2020 21/35
◮ Let {X(t), t ∈ T } be a discrete-state process with
independent increments ◮ Given a random process {X(t), t ∈ T }
◮ Then we specify fX (x; t) and another function ◮ Its mean or mean function is defined by

g(x1 , x2 ; t1 , t2 ) = P r[X(t2 ) − X(t1 ) = x2 − x1 ] ηX (t) = E[X(t)], t ∈ T

◮ Now we can get all distributions as ◮ We define the autocorrelation of the process by

fX (x1 , · · · , xn ; t1 , · · · tn ) RX (t1 , t2 ) = E[X(t1 )X(t2 )]


= P r[X(ti ) = xi , i = 1, · · · , n]
◮ We define the autocovariance of the process by
n−1
Y
= fX (x1 ; t1 ) P r[X(ti+1 ) − X(ti ) = xi+1 − xi ] CX (t1 , t2 ) = E [(X(t1 ) − E[X(t1 )])(X(t2 ) − E[X(t2 )])]
i=1
n−1 = RX (t1 , t2 ) − ηX (t1 )ηX (t2 )
Y
= fX (x1 ; t1 ) g(xi , xi+1 ; ti , ti+1 )
i=1

PS Sastry, IISc, Bangalore, 2020 22/35 PS Sastry, IISc, Bangalore, 2020 23/35

Stationary Processes
◮ A homogeneous Markov chain started in its stationary
distribution is a stationary process
◮ A random process {X(t), t ∈ T } is said to be stationary ◮ As we know, if π0 is the stationary distribution then πn is
if same for all n.
for all n, for all t1 , · · · , tn , for all x1 , · · · xn and for all ◮ This, along with the Markov condition would imply that
τ we have
shift of time origin does not affect the distributions
FX (x1 , · · · , xn ; t1 , · · · , tn ) = FX (x1 , · · · , xn ; t1 +τ, · · · , tn +τ )
P r[Xn = x0 , Xn+1 = x1 , · · · Xn+m = xm ]
= πn (x0 )P (x0 , x1 ) · · · P (xm−1 , xm )
◮ For a stationary process, the distributions are unaffected
by translation of the time axis. = π0 (x0 )P (x0 , x1 ) · · · P (xm−1 , xm )
◮ This is a rather stringent condition and is often referred = P r[X0 = x0 , X1 = x1 , · · · Xm = xm ]
to as strict-sense stationarity

PS Sastry, IISc, Bangalore, 2020 24/35 PS Sastry, IISc, Bangalore, 2020 25/35
◮ Suppose {X(t), t ∈ T } is (strict-sense) stationary
◮ Then the first order distribution is independent of time

FX (x; t) = FX (x; t + τ ), ∀x, t, τ ⇒ e.g., FX (x; t) = FX (x; 0) ◮ The process {X(t), t ∈ T } is said to be wide-sense
stationary if

◮ This implies ηX (t) = ηX , a constant FX (x; t) = FX (x; t + τ ), ∀x, t, τ


◮ The second order distribution has to satisfy FX (x1 , x2 ; t1 , t2 ) = FX (x1 , x2 ; t1 + τ, t2 + τ )

FX (x1 , x2 ; t, t + τ ) = FX (x1 , x2 ; 0, τ ), ∀x1 , x2 , t, τ ◮ The process is wide-sense stationary if the first and
second order distributions are invariant to translation of
Hence FX (x1 , x2 ; t1 , t2 ) can depend only on t1 − t2 time origin
◮ This implies

RX (t, t + τ ) = E[X(t)X(t + τ )] = RX (τ )

Autocorrelation depends only on the time difference


PS Sastry, IISc, Bangalore, 2020 26/35 PS Sastry, IISc, Bangalore, 2020 27/35

Ergodicity
◮ Let {X(t), t ∈ T } be wide-sense stationary. Then ◮ Suppose X(n) is a discrete-time discrete-state process
1. ηX (t) = ηX , a constant (like a Markov chain)
2. RX (t1 , t2 ) depends only on t1 − t2 ◮ Suppose it is wide-sense stationary.
◮ In many engineering applications, we call a process Then E[X(n)] does not depend on n
wide-sense stationary if the above two hold. ◮ Ergodicity is the question of
n
◮ In this course we take the above as the definition of 1X ?
wide-sense stationary process lim X(i) = E[X(n)] = ηX
n→∞ n
i=1
◮ When the process is wide-sense stationary, we write
◮ We proved that this is true for an irreducible, aperiodic,
autocorrelation as
positive recurrent Markov chain (with a finite state space)
RX (τ ) = E[X(t)X(t + τ )] ◮ The question is : do ‘time-averages’ converge to
‘ensemble-averages’
◮ The process is wide-sense stationary and hence all X(n)
have the same distribution; but they need not be
independent or uncorrelated (e.g., Markov chain)
PS Sastry, IISc, Bangalore, 2020 28/35 PS Sastry, IISc, Bangalore, 2020 29/35
◮ Ergodicity is a question of whether time-averages ◮ Define τ
1
Z
converge to ensemble-averages? ητ = X(t) dt (τ > 0)
n 2τ −τ
1 X ?
lim X(i) = E[X(n)] = ηX ◮ For each τ , ητ is a rv. We write η for ηX .
n→∞ n
i=1 ◮ We say the process is mean-ergodic if
Or, more generally
P
n ητ → η, as τ → ∞
1X ?
lim g(X(i)) = E[g(X(n))]
n→∞ n
i=1
◮ That is, if
For a continuous time process we can write this as lim P r [|ητ − η| > ǫ] = 0, ∀ǫ > 0
Z τ τ →∞
1 ?
lim X(t) dt = E[X(t)] = ηX ◮ Note that E[ητ ] = η, ∀τ .
τ →∞ 2τ −τ

◮ Essentially if there is no long-term correlation in the ◮ Hence it is enough if we show


process this may hold. στ2 , E (ητ − η)2 → 0,
 
as τ → ∞
◮ One sufficient condition could be that covariance between
X(t) and X(t + τ ) decreases fast with increasing τ .
PS Sastry, IISc, Bangalore, 2020 30/35 PS Sastry, IISc, Bangalore, 2020 31/35

◮ Let CX (t1 , t2 ) be the autocovariance of the process Z τ Z τ


Let I = CX (t1 − t2 ) dt2 dt1
−τ −τ
CX (t1 , t2 ) = E[(X(t1 ) − η)(X(t2 ) − η)]

◮ Assuming wide-sense stationarity, ◮ Let z = t1 − t2 . We want to change the integration to be


CX (t1 , t2 ) = CX (t1 − t2 ) over t2 and z
t1

◮ We can get στ2 as


T

στ2 = E (ητ − η)2


 
 Z τ Z τ 
1 1 ′ ′
= E (X(t) − η) dt (X(t ) − η) dt -T T t2
2τ −τ 2τ −τ
Z τZ τ
1
= 2
E[(X(t) − η)(X(t′ ) − η)] dt dt′
4τ −τ −τ -T
Z τZ τ
1
= CX (t − t′ ) dt dt′ ◮ Easy to see z goes from −2τ to 2τ
2
4τ −τ −τ When z ≥ 0, for a given z, t2 goes from −τ to τ − z
When z < 0, for a given z, t2 goes from −τ − z to τ
PS Sastry, IISc, Bangalore, 2020 32/35 PS Sastry, IISc, Bangalore, 2020 33/35
◮ Now we get στ2 as
◮ Now we get
Z τZ τ
1
στ2 CX (t − t′ ) dt dt′
Z τZ τ
= 2
I = CX (t1 − t2 ) dt2 dt1 4τ −τ −τ
−τ −τ Z 2τ
Z 0 Z τ Z 2τ Z τ −z 1
= CX (z) (2τ − |z|) dz
= CX (z) dt2 dz + CX (z) dt2 dz 4τ 2 −2τ
−2τ −τ −z 0 −τ Z 2τ  
Z 0 Z 2τ 1 |z|
= CX (z) 1 − dz
= CX (z) (τ − (−τ − z)) dz + CX (z) (τ − z − (−τ )) dz 2τ −2τ 2τ
−2τ 0
◮ Hence, a sufficient condition for στ2 → 0 is
Z 0 Z 2τ
= CX (z) (2τ + z) dz + CX (z) (2τ − z) dz
−2τ 0 Z ∞
Z 2τ
|CX (z)| dz < ∞
= CX (z) (2τ − |z|) dz −∞
−2τ
◮ This is a sufficient condition for the process being
mean-ergodic

PS Sastry, IISc, Bangalore, 2020 34/35 PS Sastry, IISc, Bangalore, 2020 35/35

Poisson Process
◮ A random process {N (t), t ≥ 0} is called a counting
process if
1. N (t) ≥ 0 and is integer-valued
2. If s < t then, N (s) ≤ N (t)
N (t) represents number of ‘events’ till t
◮ This is the next process we study
◮ The counting process has independent increments if for
◮ This is a discrete-state continuous-time process
all t1 < t2 ≤ t3 < t4 , N (t2 ) − N (t1 ) is independent of
◮ The index set is the interval [0, ∞) and all random N (t4 ) − N (t3 )
variables are discrete and take non-negative integer
◮ In particular, for all s > t, N (s) − N (t) is independent of
values.
N (t) − N (0)
◮ The process is said to have stationary increments if
N (t2 ) − N (t1 ) has the same distribution as
N (t2 + τ ) − N (t1 + τ ), ∀τ, ∀t2 > t1

PS Sastry, IISc, Bangalore, 2020 1/32 PS Sastry, IISc, Bangalore, 2020 2/32
◮ Definition 2 A counting process {N (t), t ≥ 0} is said to
◮ We start with two definitions of Poisson process be a Poisson process with rate λ > 0 if
◮ Definition 1 A counting process {N (t), t ≥ 0} is said to 1. N (0) = 0
be a Poisson process with rate λ > 0 if 2. The process has stationary and independent increments
1. N (0) = 0 3. P r[N (h) = 1] = λh + o(h) and
2. The process has stationaryn and independent increments P r[N (h) ≥ 2] = o(h)
3. P r[N (t) = n] = e−λt (λt)
n! , n = 0, 1, · · · ◮ We say g(h) is o(h) if
◮ N (t) is Poisson with parameter λt
g(h)
◮ E[N (t)] = λt and hence λ is called rate lim =0
h→0 h
◮ Since the process has stationary increments and
N (0) = 0, (N (t + s) − N (s)) would be Poisson with ◮ This definition tells us when Poisson process may be a
parameter λt for all s, t > 0. good model
◮ We will show that both definitions are equivalent

PS Sastry, IISc, Bangalore, 2020 3/32 PS Sastry, IISc, Bangalore, 2020 4/32

◮ We first show Definition 2 ⇒ Definition 1


◮ For this we need to calculate distribution of N (t) ◮ Now we can solve this differential equation to get P0 (t)
◮ Let Pn (t) = P r[N (t) = n]
d
P0 (t + h) = P r[N (t + h) = 0] P0 (t) = −λP0 (t)
dt
= P r[N (t) = 0, N (t + h) − N (t) = 0] 1 d
⇒ P0 (t) = −λ
= P r[N (t) = 0] P r[N (t + h) − N (t) = 0] P0 (t) dt
(because of independent increments) ⇒ ln(P0 (t)) = −λt + c
= P r[N (t) = 0] P r[N (h) = 0] (stationary increments) ⇒ P0 (t) = Ke−λt
= P0 (t)(1 − λh + o(h)) ◮ Since P0 (0) = P r[N (0) = 0] = 1, we get K = 1 and
hence
P0 (t + h) − P0 (t) o(h) P0 (t) = P r[N (t) = 0] = e−λt
⇒ = −λP0 (t) +
h h
d ◮ Next we consider Pn (t) for n > 0
⇒ P0 (t) = −λP0 (t)
dt

PS Sastry, IISc, Bangalore, 2020 5/32 PS Sastry, IISc, Bangalore, 2020 6/32
d
Pn (t) + λPn (t) = λPn−1 (t)
Pn (t + h) = P r[N (t + h) = n] dt
= P r[N (t) = n, N (t + h) − N (t) = 0] + ◮ We need to solve this linear ODE to obtain Pn
P r[N (t) = n − 1, N (t + h) − N (t) = 1] +
n
◮ The integrating factor is eλt . Let Pn′ (t) = dtd Pn (t)
X
P r[N (t) = n − k, N (t + h) − N (t) = k] eλt (Pn′ (t) + λPn (t)) = eλt λPn−1 (t)
k=2
d
Pn (t) eλt = λ eλt Pn−1 (t)

= Pn (t)P0 (h) + Pn−1 (t)P1 (h) + o(h) ⇒
dt
= Pn (t)(1 − λh + o(h)) + Pn−1 (t)(λh + o(h)) + o(h)
◮ We need Pn−1 to solve for Pn . Take n = 1

Pn (t + h) − Pn (t) o(h) d
P1 (t) eλt = λ eλt P0 (t) = λeλt e−λt = λ

⇒ = −λPn (t) + λPn−1 (t) +
h h dt
⇒ eλt P1 (t) = λt + c ⇒ P1 (t) = e−λt (λt + c)
d
⇒ Pn (t) = −λPn (t) + λPn−1 (t)
dt ◮ Since P1 (0) = P r[N (0) = 1] = 0, c = 0
Hence P1 (t) = λt e−λt
PS Sastry, IISc, Bangalore, 2020 7/32 PS Sastry, IISc, Bangalore, 2020 8/32

◮ We showed: P0 (t) = e−λt and P1 (t) = λte−λt ◮ Definition 1 A counting process {N (t), t ≥ 0} is said to
k
◮ We need to show: Pk (t) = e−λt (λt)
k!
be a Poisson process with rate λ > 0 if
◮ Assume it is true till k = n − 1 1. N (0) = 0
2. The process has stationaryn and independent increments
d λt
 λt λt −λt (λt)
n−1
n t
n−1
3. P r[N (t) = n] = e−λt (λt)
n! , n = 0, 1, · · ·
Pn (t) e = λ e Pn−1 (t) = λe e =λ
dt (n − 1)! (n − 1
n
◮ Definition 2 A counting process {N (t), t ≥ 0} is said to
t 1 (λt)n be a Poisson process with rate λ > 0 if
⇒ eλt Pn (t) = λn + c ⇒ Pn (t) = e−λt
n (n − 1)! n! 1. N (0) = 0
2. The process has stationary and independent increments
where c = 0 because Pn (0) = 0.
3. P r[N (h) = 1] = λh + o(h) and
◮ This completes the proof that Definition 2 implies P r[N (h) ≥ 2] = o(h)
Definition 1

PS Sastry, IISc, Bangalore, 2020 9/32 PS Sastry, IISc, Bangalore, 2020 10/32
◮ Now we prove Definition 1 implies Definition 2
◮ We need to only show point(3) of Definition 2 using point ◮ Now we need to show P r[N (h) ≥ 2] = o(h)
(3) of Definition 1
P r[N (h) ≥ 2] = 1 − P r[N (h) = 0] − P r[N (h) = 1]
k
(λt) = 1 − e−λh − λhe−λh
Let P r[N (t) = k] = e−λt
k!
◮ This goes to zero as h → 0
We can use L’Hospital rule

P r[N (h) = 1] = λ h e−λh = λ h + λ h e−λh − 1 = λ h + o(h) ◮

because 1 − e−λh − λhe−λh λe−λh − λe−λh + λ2 he−λh


lim = lim =0
λ h e−λh − 1

 h→0 h h→0 1
lim = lim λ e−λh − 1 = 0
h→0 h h→0 ◮ This completes the proof that Definition 2 implies
Definition 1
◮ We showed P r[N (h) = 1] = λh + o(h)

PS Sastry, IISc, Bangalore, 2020 11/32 PS Sastry, IISc, Bangalore, 2020 12/32

These two definitions are equivalent ◮ Since the process has stationary increments, for t2 > t1 ,

P r[N (t2 ) − N (t1 ) = k]] = P r[N (t2 − t1 ) − N (0) = k]


Definition 1 A counting process {N (t), t ≥ 0} is said to (λ(t2 − t1 ))k

= e−λ(t2 −t1 )
be a Poisson process with rate λ > 0 if k!
1. N (0) = 0 ◮ The first order distribution of the process is:
2. The process has stationaryn and independent increments N (t) ∼ Poisson(λt)
3. P r[N (t) = n] = e−λt (λt)
n! , n = 0, 1, · · · ◮ This, along with stationary and independent increments
◮ Definition 2 A counting process {N (t), t ≥ 0} is said to property determines all distributions
be a Poisson process with rate λ > 0 if
1. N (0) = 0 P r[N (t1 ) = n1 , N (t2 ) = n2 , N (t3 ) = n3 ]
2. The process has stationary and independent increments = P r[N (t1 ) = n1 ] P r[N (t2 ) − N (t1 ) = n2 − n1 ]
3. P r[N (h) = 1] = λh + o(h) and P r[N (t3 ) − N (t2 ) = n3 − n2 ]
P r[N (h) ≥ 2] = o(h)
= P r[N (t1 ) = n1 ] P r[N (t2 − t1 ) = n2 − n1 ] P r[N (t3 − t2 ) = n3 − n2

where we assumed t1 < t2 < t3


PS Sastry, IISc, Bangalore, 2020 13/32 PS Sastry, IISc, Bangalore, 2020 14/32
We can easily calculate mean and autocorrelation of the

process
Inter-arrival or waiting times
ηN (t) = E[N (t)] = λ t ⇒ not stationary ◮ Let T1 denote the time of first event and let Tn denote
With t2 > t1 , we have the time between nth and (n − 1)st events.
Let Sn = ni=1 Ti – time of nth event
P

RN (t1 , t2 ) = E[N (t2 )N (t1 )]
= E[N (t1 )(N (t2 ) − N (t1 ) + N (t1 ))] P r[T1 > t] = P r[N (t) = 0] = e−λt
= E[N (t1 )(N (t2 ) − N (t1 ))] + E[N (t1 )2 ]
⇒ T1 ∼ exponential(λ)
= E[N (t1 )] E[N (t2 ) − N (t1 )] + E[N (t1 )2 ]
= E[N (t1 )] E[N (t2 − t1 )] + E[N (t1 )2 ] P r[T2 > t|T1 = s] = P r[0 events in (s, s + t] | T1 = s]
= λt1 (λ(t2 − t1 )) + (λt1 + λ2 t21 ) = P r[0 events in (s, s + t] ] = e−λt
Z
= λt1 + λ2 t1 t2 ⇒ P r[T2 > t] = P r[T2 > t|T1 = s] fT1 (s) ds = e−λt

⇒ RN (t1 , t2 ) = λ2 t1 t2 + λ min(t1 , t2 ) ◮ Tn are iid exponential with parameter λ

PS Sastry, IISc, Bangalore, 2020 15/32 PS Sastry, IISc, Bangalore, 2020 16/32

◮ The time of nth event is


n
X
Sn = Ti
i=1 ◮ This can be used, e.g., in simulating Poisson process
Since Ti are iid, exponential, Sn is Gamma with ◮ We can cut time axis into small intervals of length h.
parameters n, λ ◮ In each interval we can decide whether or not there is an
◮ Let s < t. event, with prob λh.
P r[T1 < s, N (t) = 1] ◮ If there is an event, we choose its time uniformly in the
P r[T1 < s|N (t) = 1] = interval.
P r[N (t) = 1]
P r[1 event in (0, s), 0 in [s, t]] ◮ Called Bernoulli approximation of Poisson process
=
P r[N (t) = 1] ◮ We could also generate Poisson process by generating
λse −λs −λ(t−s)
e independent exponential random variables
=
λte−λt
s
=
t
◮ Conditioned on N (t) = 1, T1 is uniform over [0, t]
PS Sastry, IISc, Bangalore, 2020 17/32 PS Sastry, IISc, Bangalore, 2020 18/32
Examples ◮ We can explicitly derive this (taking t > 1)
◮ We look at a few simple example problems using Poisson P r[S3 > t, N (1) = 2]
process. P r[S3 > t|N (1) = 2] =
P r[N (1) = 2]
E[N (4) − N (2)|N (1) = 3] = E[N (4) − N (2)] P r[2 event in (0, 1], 0 in (1, t)]
=
= E[N (2) − 0] = 2λ P r[N (1) = 2]
P r[2 event in (0, 1]] P r[0 in (1, t)]
◮ Another example; =
P r[2 event in (0, 1]]
= e−λ(t−1)
" 4 #
X 4
E[S4 ] = E Ti =
i=1
λ ◮ Here is another example
2
◮ The memoryless property of exponential rv can be useful E[S4 |N (1) = 2] = 1 + E[S2 ] = 1 +
 λ
1 if t < 1
P r[S3 > t|N (1) = 2] = Exercise for you: calculate P r[S4 > t|N (1) = 2] and use
e −λ(t−1)
if t ≥ 1
it to find the above expectation
PS Sastry, IISc, Bangalore, 2020 19/32 PS Sastry, IISc, Bangalore, 2020 20/32

Example
◮ Given a specific T0 we want to guess which is the last ◮ Let {N (t), t ≥ 0} be a Poisson process with rate λ
event before T0 . ◮ Suppose each event can be one of two types –
◮ Consider a strategy: we will wait till T0 − τ and pick the Typ-I or Typ-II
next event as the last one before T0 . ◮ N1 (t) = number of Typ-I events till t
◮ The probability of winning for this is ◮ N2 (t) = number of Typ-II events till t
◮ Note that N (t) = N1 (t) + N2 (t), ∀t
P r[exactly 1 event in (T0 − τ, T0 )] = λτ e−λτ ◮ Suppose that, independently of everything else, an event
is of Typ-I with probability p and Typ-II with probability
◮ We pick τ to maximize this (1 − p)

1 Theorem: {N1 (t), t ≥ 0} and {N2 (t), t ≥ 0} are Poisson


λe−λτ − λ2 τ e−λτ = 0 ⇒ τ = processes with rate λp and λ(1 − p) respectively, and
λ
they are independent
◮ Intuitively reasonable because expected inter-arrival time
is λ1
PS Sastry, IISc, Bangalore, 2020 21/32 PS Sastry, IISc, Bangalore, 2020 22/32
P r[N1 (t) = n, N2 (t) = m] ◮ The interesting issue here is that N1 (t) and N2 (t) are
X independent.
= P r[N1 (t) = n, N2 (t) = m | N (t) = k] P r[N (t) = k]
k
◮ Suppose customers arrive at a bank as a Poisson process
= P r[N1 (t) = n, N2 (t) = m | N (t) = m + n] P r[N (t) = m + n] with rate 12 per hour.
(λt)m+n ◮ Independently of everything, an arriving customer is male
= m+n Cn pn (1 − p)m e−λt or female with equal probability.
(m + n)!
(m + n)! n (λt)m (λt)n ◮ Q: Given that on some day 6 male customers came in the
= p (1 − p)m e−λ(p+1−p)t first half hour, what is the expected number of female
m! n! (m + n)!
n m customers in that half hour?
(λpt) −λpt (λ(1 − p)t)
= e e−λ(1−p)t ◮ The answer is 3 because the two processes are
n! m!
independent

◮ This shows that N1 (t) and N2 (t) are independent Poisson

PS Sastry, IISc, Bangalore, 2020 23/32 PS Sastry, IISc, Bangalore, 2020 24/32

◮ The theorem is easily generalized to multiple types for


events ◮ Superposition of independent Poisson processes also gives
◮ Consider Poisson process with rate λ Poisson process.
◮ Suppose, independently of everything, an event is Typ-i ◮ If N1 and N2 are independent Poisson processes with
with probability pi , i = 1, · · · K. rates λ1 and λ2 then N (t) = N1 (t) + N2 (t) is a Poisson
Note we have K
P
i=1 pi = 1 process with rate λ1 + λ2

◮ Let Ni (t) be the number of Typ-i customers till t ◮ We know that sum of independent Possion rv’s is Poisson
◮ Then, these are independent Poisson processes with rates
λpi , i = 1, · · · , K

PS Sastry, IISc, Bangalore, 2020 25/32 PS Sastry, IISc, Bangalore, 2020 26/32
◮ There is an interesting generalization of this.
◮ Suppose number of radioactive particles emitted is
◮ Events are of different types
Poisson with rate λ.
◮ The type of an event can depend on the time of
◮ We are counting particles using a sensor
occurrence but it is independent of everything else.
◮ Suppose (independent of everything) an emitted particle
◮ Suppose an event occurring at time t is Typ-i with
is detected by our sensor with probability p
probability pi (t).
◮ Given that we detected K particles till t what is the
pi (t) ≥ 0, ∀i, t and K
P
i=1 pi (t) = 1, ∀t

expected number of particles emitted?
◮ Ni (t) is the number of Typ-i events till t
◮ Let these processes be N (t), N1 (t), N2 (t)
Theorem; Then, at any t, Ni (t), i = 1, · · · K are
E[N (t)|N1 (t) = K] = E[N1 (t) + N2 (t)|N1 (t) = K] independent Poisson random variables with
= K + E[N2 (t)] = K + λ(1 − p)t Z t
E[Ni (t)] = λ pi (s) ds
where we have used independence of N1 and N2 0

PS Sastry, IISc, Bangalore, 2020 27/32 PS Sastry, IISc, Bangalore, 2020 28/32

Example: Tracking infections


◮ Define two types of events. We take t as current time
and treat it as fixed
◮ We use a simple model ◮ An event occurring at s is Typ-1 with probability
◮ Individuals get infected as a Poisson process with rate λ G(t − s)
◮ Time between getting infected and showing symptoms is ◮ It is Typ-2 with probability 1 − G(t − s)
a random variable with known distribution function G ◮ Then, Typ-1 individuals are those showing symptoms by t
An individual infected at s would show symptoms by t ◮ From our theorem,
with probability G(t − s) Z t Z t
◮ The incubation times of different infected individuals are E[N1 (t)] = λ G(t − s) ds = λ G(y) dy
iid 0 0
◮ Define Z t Z t
◮ N (t) – total number infected till t E[N2 (t)] = λ (1 − G(t − s)) ds = λ (1 − G(y)) dy
◮ N1 (t) – number showing symptoms by t 0 0
◮ N2 (t) – infected by t but not showing symptoms

PS Sastry, IISc, Bangalore, 2020 29/32 PS Sastry, IISc, Bangalore, 2020 30/32
◮ The Poisson process we considered is called homogeneous
◮ Suppose we have n1 people showing symptoms at t because the rate is constant.
◮ We can approximate ◮ For a non-homogeneous Poisson process the rate can be
Z t changing with time.
n1 ≈ E[N1 (t)] = λ G(y) dy ◮ But we can still use a definition similar to definition 2
0
P r[N (t + h) − N (t) = 1] = λ(t)h + o(h)
◮ Hence we can estimate
◮ We still stipulate independent increments though we
n1
λ̂ = R t cannot have stationary increments now
G(y) dy One can show that N (t + s) − N (t) is Poisson
0
R τ with

parameter m(t + s) − m(t) where m(τ ) = 0 λ(s) ds
◮ Using this we can approximate
◮ Suppose Yi are iid and ind of N (t). Then
Z t
N (t)
E[N2 (t)] ≈ λ̂ (1 − G(y)) dy X
0 X(t) = Yi
i=1

is called a compound Poisson process


PS Sastry, IISc, Bangalore, 2020 31/32 PS Sastry, IISc, Bangalore, 2020 32/32

Random Walk
◮ Let Zi be iid with P r[Zi = +s] = P r[Zi = −s] = 0.5
◮ Define a continuous-time process X(t) by

X(nT ) = Z1 + Z2 + · · · + Zn
X(t) = X(nT ), for nT ≤ t < (n + 1)T

◮ Viewed as a discrete-time process, X(nT ), is a Markov


chain.
◮ Called a (one dimensional) random walk
◮ It is the position after n random steps
◮ We defined X(t) by piece-wise constant interpolation of
X(nT )
◮ We could have also use piece-wise linear interpolation

PS Sastry, IISc, Bangalore, 2020 1/31 PS Sastry, IISc, Bangalore, 2020 2/31
◮ We have EZi = 0 and E[Zi2 ] = s2 ◮ Consider t = nT
◮ Hence, E[X(nT )] = 0 and E[X 2 (nT )] = ns2 t
X(nT ) E[X 2 (t)] = ns2 = s2
◮ For large n, √
s n
would be Gaussian T

X(nT )
 ◮ If we let T → 0 then the variance goes to infinity (the
Pr √ ≤ y ≈ Φ(y) process goes to infinity) unless we let s also to go to zero.
s n
◮ We actuall need s2 to go to zero at the same rate as T .
where Φ is distribution function of standard Normal ◮ So, we keep s2 = αT and let T go to zero.
◮ For any t, X(t) is X(nT ) for n = [t/T ].
◮ Define
Large n would mean large t. Hence W (t) = lim X(t)
2
T →0,s =αT
   
X(t) ms m
P r[X(t) ≤ ms] = P r √ ≤ √ ≈ Φ √ , for large t This is called the Wiener Process or Brownian motion.
s n s n n This result is known as Donsker’s theorem
◮ We are interested in limit of this process as T → 0 ◮ Let us intuitively see some properties of W (t)

PS Sastry, IISc, Bangalore, 2020 3/31 PS Sastry, IISc, Bangalore, 2020 4/31

◮ We have seen that for n = [t/T ],


 
m
P r[X(t) ≤ ms] ≈ Φ √ ◮ We had Zi iid and defined
n
◮ Let w = ms and t = nT . Then X(nT ) = Z1 + Z2 + · · · + Zn
r
m w/s w T w ◮ Hence we get
√ =p =√ 2
=√
n t/T t s αt
X((m + n)T ) − X(nT ) = Zn+1 + · · · + Zn+m

◮ W (t) is limit of X(t) as T goes to zero Thus, X(nT ) is independent of X((m + n)T ) − X(nT ).
◮ As T goes to zero, any t is ‘large n’. ◮ Hence the X(nT ) process has independent increments
◮ Hence we can expect ◮ Hence, we can expect W (t) to be a process with
  independent increments
w
P r[W (t) ≤ w] = Φ √
αt
⇒ W (t) ∼ N (0, αt)
PS Sastry, IISc, Bangalore, 2020 5/31 PS Sastry, IISc, Bangalore, 2020 6/31
◮ Let {X(t), t ≥ 0} be a continuous-state continuous-time
◮ X((m + n + k)T ) − X((n + k)T ) and
process
X((m + n)T ) − X(nT )
both are sums of m of the Zi ’s This process is called a Brownian motion if
◮ Hence both would have the same distribution 1. X(0) = 0
2. The process has stationary and independent increments
◮ Thus X(nT ) would also have stationary increments. 3. For every t > 0, X(t) is Gaussian with mean 0 and
◮ Hence we also expect W (t) to have stationary increments variance σ 2 t
◮ Thus, W (t) should be a process with stationary and ◮ Let B(t) = X(t)
σ
. Then, variance of B(t) is t
independent increments and for each t, W (t) is Gaussian ◮ {B(t), t ≥ 0} is called standard Brownian Motion
with zero mean and variance proportional to t
◮ Let Y (t) = X(t) + µ. Then Y (t) has non-zero mean
◮ We will now formally define Brownian motion using these
◮ The mean can be a function of time
properties.
◮ {Y (t), t ≥ 0} is called Brownian motion with a drift

PS Sastry, IISc, Bangalore, 2020 7/31 PS Sastry, IISc, Bangalore, 2020 8/31

◮ We can calculate the autocorrelation function

◮ Let {X(t), t ≥ 0} be a Brownian motion RX (t1 , t2 ) = E[X(t1 )X(t2 )]


◮ The process has stationary increments. = E[X(t1 ) (X(t2 ) − X(t1 ) + X(t1 ))], (take t1 < t2 )
◮ Hence for t2 > t1 , X(t2 ) − X(t1 ) has the same = E[X(t1 )(X(t2 ) − X(t1 ))] + E[X 2 (t1 )]
distribution as X(t2 − t1 ) = E[X(t1 )] E[X(t2 ) − X(t1 )] + E[X 2 (t1 )]
◮ Thus, X(t2 ) − X(t1 ) is Gaussian with zero mean and = E[X 2 (t1 )]
variance σ 2 (t2 − t1 ) = σ 2 t1
◮ Since increments are also independent, we can show that
all nth order distributions are Gaussian
◮ Since E[X(t)] = 0, ∀t, we have

Cov(X(t1 ), X(t2 )) = E[X(t1 )X(t2 )] = σ 2 min(t1 , t2 )

PS Sastry, IISc, Bangalore, 2020 9/31 PS Sastry, IISc, Bangalore, 2020 10/31
◮ Suppose we want the joint distribution of Y1 = X(t1 ), Yi = X(ti ) − X(ti−1 ), i = 2, · · · , n
X(t1 ), X(t2 ), · · · , X(tn )
◮ Let t1 < t2 < · · · < tn ◮ The transformation is invertible
◮ Define random variables Y1 , · · · , Yn by
X(t1 ) = Y1
Y1 = X(t1 ), Y2 = X(t2 )−X(t1 ), Y3 = X(t3 )−X(t2 ), · · ·
X(t2 ) = Y1 + Y2
◮ We know Yi are independent because the process has X(t3 ) = Y1 + Y2 + Y3
independent increments ..
.
◮ This transformation is invertible X(tn ) = Y1 + Y2 + · · · + Yn
◮ Hence we can get joint density of X(t1 ), · · · X(tn ) in
terms of joint density of Y1 , · · · , Yn ◮ Y1 , · · · Yn are independent and Gaussian and hence are
◮ This is how we can get nth order density for any Jointly Gaussian
continuous-state process with independent increments ◮ Hence X(t1 ), · · · , X(tn ) are jointly Gaussian
◮ Thus all nth order distributions are Gaussian
PS Sastry, IISc, Bangalore, 2020 11/31 PS Sastry, IISc, Bangalore, 2020 12/31

◮ Since all joint densities are Gaussian and are easy to


write, we can also calculate conditional densities
◮ X(t1 ), X(t2 ), · · · , X(tn ) are jointly Gaussian. fX(s)X(t) (x, b)
fX(s)|X(t) (x|b) = (s < t)
◮ We can write their joint density because we know the fX(t) (b)
means, variances and covariances fX(s) (x) fX(t)−X(s) (b − x)
=
◮ We can also write the density using the transformation fX(t) (b)
considered earlier x2 (b−x)2

Let t1 < t2 < · · · < tn ∝ e− 2s e− 2(t−s) (taking σ 2 = 1)


   
1 1 bx
fX (x1 , · · · , xn ; t1 , · · · , tn ) = fY1 (x1 )fY2 (x2 − x1 ) · · · fYn (xn − xn−1 ) ∝ exp −x2 + +
2s 2(t − s) t−s
  
t sb
◮ Note that Yi = X(ti ) − X(ti−1 ) is Gaussian with mean ∝ exp − x2 − 2 x
zero and variance σ 2 (ti − ti−1 ), i = 1, · · · , n 2s(t − s) t
2
(x − bs/t)
 
(Take t0 = 0) ∝ exp −
2s(t − s)/t
◮ Hence the conditional density is Gaussian with mean bs/t
and variance s(t − s)/t
PS Sastry, IISc, Bangalore, 2020 13/31 PS Sastry, IISc, Bangalore, 2020 14/31
Hitting Times
◮ Let Ta denote the first time Brownian motion hits a. We
take a > 0.
◮ An important result is that Brownian motion paths are
continuous P r[X(t) ≥ a] = P r[X(t) ≥ a | Ta ≤ t] P r[Ta ≤ t] +
◮ Brownian motion is the limit of random walk where both P r[X(t) ≥ a | Ta > t] P r[Ta > t]
s and T tend to zero ◮ Since Brownian motion paths are continuous,
◮ Intuitively the paths should be continuous. P r[X(t) ≥ a | Ta > t] = 0
◮ The paths are continuous but non-differentiable ◮ Brownian motion is a limit of symmetric random walk.
everywhere Hence if we had already hit a sometime back, then now
◮ This is a deep result we are as likely to be above a as below it.
1
⇒ P r[X(t) ≥ a | Ta ≤ t] =
2
Thus
P [X(t) ≥ a] = 0.5P r[Ta ≤ t]

PS Sastry, IISc, Bangalore, 2020 15/31 PS Sastry, IISc, Bangalore, 2020 16/31

◮ Hence we get Geometric Brownian Motion


P r[Ta ≤ t] = 2 P r[X(t) ≥ a] ◮ Let {Y (t), t ≥ 0} is a Brownian motion with drift. Define
Z ∞
2 x2
= √ e− 2t dx X(t) = eY (t)
2πtZ a
2 ∞ 2 ◮ Then, {X(t), t ≥ 0} is called geometric Brownian
− y2
= √ e dy motion. It is useful in mathematial finance
2π a/√t
◮ Let X0 , X1 , · · · be time series of prices of a stock.
◮ Let Yn = Xn /Xn−1 and assume Yi are iid
◮ Here we have assumed a > 0. For a < 0 the situation is
similar. Hence the above is true √
even for a < 0 except Xn = Yn Xn−1 = Yn Yn−1 Xn−2 = · · · = Yn Yn−1 · · · Y1 X0
that the lower limit becimes |a|/ t
◮ Another interesting consequence is the following n
X
⇒ ln(Xn ) = ln(Yi ) + ln(X0 )
P r[ max X(s) ≥ a] = P r[Ta ≤ t], by continuity of paths i=1
0≤s≤t
= 2P r[X(t) ≥ a]
◮ Since ln(Yi ) are iid, with suitable normalization, the
interpolated process ln(X(t)) would be Brownian motion
and X(t) would be geometric Brownian motion
PS Sastry, IISc, Bangalore, 2020 17/31 PS Sastry, IISc, Bangalore, 2020 18/31
Gaussian Processes
◮ A continuous-time continuous-state process
{X(t), t ≥ 0} is said to be a Gaussian process if for all n ◮ Consider the statistics of the Brownian motion process for
and all t1 , t2 , · · · , tn , we have that X(t1 ), · · · , X(tn ) are 0 < t < 1 under the condition that X(1) = 0
jointly Gaussian. ◮ Consider standard Brownian motion. (σ 2 = 1)
◮ The Brownian motion is an example of a Gaussian Process t
◮ The Brwonian motion is a Gaussian process with E[X(t)|X(1) = 0] = 0=0
1
E[X(t)] = 0, Var(X(t)) = σ 2 t, Cov(X(s), X(t)) = σ 2 min(s, t) Recall that, for s < t, conditional density of X(s)
conditioned on X(t) = b is gaussian with mean bs/t and
◮ Recall that the multivariate Gaussian density is specified variance s(t − s)/t
by the marginal means, variances and the covariances of
the random variables
◮ Hence, a general Gaussian process is specified by the
mean function and the variance and covariance functions

PS Sastry, IISc, Bangalore, 2020 19/31 PS Sastry, IISc, Bangalore, 2020 20/31

◮ Consider a process {Z(t), 0 ≤ t ≤ 1}.


Now, for s < t < 1, since E[X(s)|X(1) = 0] = 0, s < 1, ◮ It is called Brownian Bridge process if it is a Gaussian
process with mean zero and covariance function
Cov(X(s), X(t)|X(1) = 0) , E[X(s)X(t) | X(1) = 0]
Cov(Z(s), Z(t)) = s(1 − t) when s ≤ t.
= E[ E[X(s)X(t) | X(t), X(1) = 0] | X(1) = 0] ◮ Let X(t) be a standard Brownian motion process.
= E[ X(t)E[X(s) | X(t)] | X(1) = 0]
s
◮ Then, Z(t) = X(t) − tX(1), 0 ≤ t ≤ 1 is a Brownian
= E[ X(t) X(t) | X(1) = 0] Bridge
t
s 2 ◮ Easy to see it is a Gaussian process with mean zero.
= E[X (t) | X(1) = 0]
t For s < t
s
= t(1 − t)
t Cov(Z(s), Z(t)) = Cov(X(s) − sX(1), X(t) − tX(1))
= s(1 − t) = Cov(X(s), X(t)) − tCov(X(s), X(1)) −
Thus, for 0 < t < 1, conditioned on X(1) = 0, this process sCov(X(1), X(t)) + stCov(X(1), X(1))
has mean 0 and covariance function s(1 − t), s < t = s − st − st + st = s(1 − t)

PS Sastry, IISc, Bangalore, 2020 21/31 PS Sastry, IISc, Bangalore, 2020 22/31
White Noise ◮ Assume V (t) is Gaussian. Let
Z t
X(t) = V (τ ) dτ
0

◮ Consider a process {V (t), t ≥ 0} with ◮ Then we get E[X(t)] = 0 and


Z tZ t Z t
E[V (t)] = 0; Var(V (t)) = σ 2 Cov(V (t), V (s)) = 0, s 6= t 2
E[X (t)] = E[V (t1 )V (t2 )] dt1 dt2 = σ 2 dt1 = σ 2 t
0 0 0
◮ This is a kind of generalization of sequence of iid random Z t1 Z t2
variables to continuous time
E[X(t1 )(X(t2 )−X(t1 )] = E[V (t)V (t′ )] dt dt′ = 0
◮ It is an example of what is called White Noise. 0 t1

◮ We see that X(t) is a process with mean zero, variance


proportional to t and having uncorrelated increments.
◮ One can show that it would be a Brownian motion
◮ The actual concept involved is rather deep
PS Sastry, IISc, Bangalore, 2020 23/31 PS Sastry, IISc, Bangalore, 2020 24/31

◮ Let {Xn , n = 0, 1, · · · } be a discrete-time


continuous-state process.
◮ We have considered three random processes ◮ It is called a martingale if E|Xn | < ∞, ∀n and
◮ Markov Chain E[Xn+1 | Xn , · · · , X0 ] = Xn , ∀n
– Example of Discrete-time discrete-state process ◮ Suppose Zi are iid with
◮ Poisson Process P r[Zi = +1] = P r[Zi = −1] = 0.5. Let
– Example of continuous-time discrete-state process X n
◮ Brownian Motion Xn = Zi ⇒ Xn+1 = Xn + Zn+1
– Example of continuous-time continuous-state process i=1

◮ We need an example of discrete-time continuous-state ◮ Since EZi = 0, ∀i,


process!
E[Xn+1 | Xn , · · · , X0 ] = E[Xn +Zn+1 |Xn ] = Xn +E[Zn+1 |Xn ] = Xn
◮ Any sequence of continuous random variables would be a
discrete-time continuous-state process ◮ Hence, Xn is a martingale.
◮ When Xn is a martingale, we have
E[Xn+1 ] = E[Xn ], ∀n

PS Sastry, IISc, Bangalore, 2020 25/31 PS Sastry, IISc, Bangalore, 2020 26/31
◮ {Xn , n = 0, 1, · · · } and E|Xn | < ∞, ∀n
◮ It is called a martingale if

E[Xn+1 | Xn , · · · , X0 ] = Xn , ∀n ◮ Martingales are useful because of the martingale


◮ It is called a supermartingale if convergence theorem.
martingale convergence theorem: If Xn is a
E[Xn+1 | Xn , · · · , X0 ] ≤ Xn , ∀n martingale with supn E|Xn | < ∞ then Xn converges
almost surely to a rv X which will have finite expectation.
◮ It is called a submartingale if A positive supermartingale also converges almost surely
E[Xn+1 | Xn , · · · , X0 ] ≥ Xn , ∀n

Please note that these are ‘simplified’ definitions In the


above, the conditioning random variables can be another
sequence Yi if Y1 , · · · , Yn determine X1 , · · · , Xn

PS Sastry, IISc, Bangalore, 2020 27/31 PS Sastry, IISc, Bangalore, 2020 28/31

◮ Consider the 2-armed bandit problem in problem sheet 3.7


◮ This gives us
◮ The algorithm is
E[p(k + 1) − p(k)|p(k)] = λ(1 − p(k)) d1 p(k)
p(k + 1) = p(k) + λ(1 − p(k)) if arm 1 chosen, b(k) = 1
−λp(k) d2 (1 − p(k)
= p(k) − λp(k) if arm 2 is chosen and b(k) = 1
= λ p(k)(1 − p(k)) (d1 − d2 )
= p(k) if b(k) = 0
≥ 0, if d1 > d2
◮ We get

E[p(k + 1) − p(k)|p(k)] ⇒ E[p(k+1) | p(k)] ≥ p(k) ⇒ E[p(k+1)] ≥ E[p(k)], ∀k


= λ(1 − p(k)) P r[b(k) = 1, arm 1 | p(k)]
−λp(k) P r[b(k) = 1, arm 2 | p(k)] ◮ This also shows p(k) is a submartingale.
= λ(1 − p(k)) P r[b(k) = 1 | arm 1, p(k)] P r[arm 1 | p(k)] ◮ Here, p(k) is bounded and 1 − p(k) is a supermartingale.
−λp(k) P r[b(k) = 1 | arm 2, p(k)] P r[arm 2 | p(k)] ◮ So, we can conclude, the algorithm converges almost
surely

PS Sastry, IISc, Bangalore, 2020 29/31 PS Sastry, IISc, Bangalore, 2020 30/31
Continuous-Time Markov Chains
◮ Let {X(t), t ≥ 0} be a continuous-time discrete-state
◮ We have mentioned martingales as an example of process
discrete-time continuous processes ◮ Let X(t) take non-negative integer values
◮ A stochastic iterative algorithm essentially generates a ◮ It is called a continuous-time markov chain if
discrete-time continuous-state processes.
◮ Martingales are very useful in analyzing convergence of P r[X(t + s) = j | X(s) = i, X(u) ∈ Au , 0 ≤ u < s]
many stochastic algorithms = P r[X(t + s) = j | X(s) = i]
◮ While we mentioned only discrete-time martingales, one
can similarly have continuous-time martingales
◮ Only most recent past matters
◮ It is called homogeneous chain if

P r[X(t+s) = j | X(s) = i] = P r[X(t) = j | X(0) = i], ∀s

PS Sastry, IISc, Bangalore, 2020 31/31 PS Sastry, IISc, Bangalore, 2020 1/31

◮ By the Markov property and homogeneity we have


◮ Define
P r [X(s) = i, s ∈ [t, t + τ ] | X(s′ ) = i, 0 ≤ s′ ≤ t]
Pij (t) = P r[X(t) = j | X(0) = i] = P r[X(t+s) = j | X(s) = i]
= P r[X(s) = i, s ∈ [t, t + τ ] | X(t) = i]
It is the probability of going from i to j in time t = P r[X(s) = i, s ∈ [0, τ ] | X(0) = i]
◮ Analogous to transition probabilities in the discrete case
◮ Let X(0) = i and let Ti be time spent in i before leaving
◮ Like in the discrete case, we can show that the Markov
it for the first time
condition implies
′ ′ P r[X(s) = i, s ∈ [t, t + τ ] | X(s′ ) = i, 0 ≤ s′ ≤ t]
P r[X(s) ∈ Bs , s ∈ (t, t + τ ] | X(t) = i, X(s ), 0 ≤ s < t]
= P r[Ti > t + τ | Ti > t]
= P r[X(s) ∈ Bs , s ∈ (t, t + τ ] | X(t) = i]
P r[X(s) = i, s ∈ [0, τ ] | X(0) = i] = P r[Ti > τ ]
◮ Next we consider distribution of time spent in a state ⇒ P r[Ti > t + τ | Ti > t] = P r[Ti > τ ]
before leaving it
⇒ Ti is memoryless and hence exponential

PS Sastry, IISc, Bangalore, 2020 2/31 PS Sastry, IISc, Bangalore, 2020 3/31
Example: Birth-Death process

◮ Once you transit into a state, the time spent in it is


exponentially distributed. ◮ This is generalization of birth-death chains we saw earlier
to continuous time
◮ So, the chain can be viewed as follows
◮ From i the process can only go to i + 1 or i − 1
◮ Once you transit to a state, it spends time, say,
Ti ∼ exponential(νi ) in it. ◮ A birth event takes it to i + 1 and a death event takes it
to i − 1
◮ Then, when it leaves i, it transits to state j with
probability, say, zij ◮ An example would be: X(t) is number of people in a
P queuing system.
◮ We would have zij ≥ 0, j zij = 1. Also, zii = 0
◮ A birth event would be a new person joining the queue.
◮ Note that Pij (t) is different from these zij
◮ A death event would be a person leaving after finishing
service

PS Sastry, IISc, Bangalore, 2020 4/31 PS Sastry, IISc, Bangalore, 2020 5/31

◮ Suppose, in state n, time till next arrival or birth event is


exponential(λn ).
◮ Let time till next departure or death event be
exponential(µn ) ◮ The time spent in state i, Ti , is exponential(νi )
We assume that these two are independent ◮ The chain would be in state i till either a birth event or a
◮ Now, these λn and µn completely determine νn and zij death event occurs
and hence completely specify the chain ◮ Hence, Ti = min(W1 , W2 )
◮ zi,i+1 is the probability that when the system changes ◮ Hence, Ti ∼ exponential(λi + µi ).
state it goes to i + 1
◮ Thus, νi = λi + µi
◮ Hence it is the probability that a birth event occurs before
a death event. ◮ We have taken state space to be non-negative integers.
◮ Let W1 ∼ exponential(λi ) and W2 ∼ exponential(µi ) be ◮ Hence, µ0 = 0 and ν0 = λ0 and z01 = 1
independent. Then
λi µi
zi,i+1 = P r[W1 < W2 ] = ; ⇒ zi,i−1 =
λi + µi λi + µi

PS Sastry, IISc, Bangalore, 2020 6/31 PS Sastry, IISc, Bangalore, 2020 7/31
◮ Consider a queuing system
◮ Suppose people joining the queue is a Poisson process
with rate λ
◮ Suppose the time to service each customer is independent
◮ Suppose λn = λ, ∀n and µn = 0, ∀n and exponential with parmeter µ.
◮ It is called pure birth process
◮ We assume that the arrival and service processes are
independent.
◮ The process spend time Ti ∼ exponential(λ) in state i
and then moves to state i + 1
◮ Then this is a birth death process with
◮ This is the Poisson process λn = λ, n ≥ 0 and µn = µ, n ≥ 1

◮ This is known as an M/M/1 queue


◮ A variation: M/M/K queue

nµ 1 ≤ n ≤ K
λn = λ, n ≥ 0 and µn =
Kµ n > K

PS Sastry, IISc, Bangalore, 2020 8/31 PS Sastry, IISc, Bangalore, 2020 9/31

◮ Time spent in i is exponential with rate λi + µi .


◮ Consider an example of some calculations with continuous
Markov chains ◮ Hence, expected time till transition out of i is 1/(λi + µi )
◮ Consider a Birth-Death process. Let Yi be the time that a ◮ If this transition is to i + 1 then that is the expected time
chain currently in i takes to reach state i + 1 for the first to reach i + 1
time. 1
E [Yi | Ii = 1] =
◮ We want to calculate E[Yi ]. (Note that E[Y0 ] = 1/λ0 ) λi + µ i
◮ The chain may directly go to i + 1 or it may go to i − 1
◮ Suppose this transition is to i − 1.
and then to i and then to i + 1 or ...
◮ Then the expected time to reach i + 1 is this time plus
◮ Define
expected time to reach i from i − 1 plus expected time to
reach i + 1 from i

1 if first transition out of i is to i + 1
Ii =
0 if first transition out of i is to i − 1
1
E [Yi | Ii = 0] = + E [Yi−1 ] + E[Yi ]
◮ We can find E[Yi ] by conditioning on Ii . λi + µi

PS Sastry, IISc, Bangalore, 2020 10/31 PS Sastry, IISc, Bangalore, 2020 11/31
◮ We also have
◮ Thus we get
λi µi
P r[Ii = 1] = zi,i+1 = ; P r[Ii = 0] =
λi + µ i λi + µi 1 µi
E [Yi ] = + E[Yi−1 ], i ≥ 1
λi λi
◮ Now we can calculate E[Yi ] as
◮ Since E[Y0 ] = 1/λ0 , we have a formula for E[Yi ]
E[Yi ] = P r[Ii = 1] E [Yi | Ii = 1] + P r[Ii = 0] E [Yi | Ii = 0] ◮ For example,
 
λi 1 µi 1
= + + E [Yi−1 ] + E[Yi ] 1 µ1 1

µ2 1 µ1

λi + µi λi + µi λi + µ i λi + µ i E[Y1 ] = + ; E[Y2 ] = + +
1 µi λ1 λ1 λ0 λ2 λ2 λ1 λ1 λ0
= + (E [Yi−1 ] + E[Yi ])
λi + µi λi + µi ◮ Expected time to go from i to j, i < j can now be
  computed as
µi 1 µi
E[Yi ] 1 − = + (E[Yi−1 ])
λi + µ i λi + µi λi + µi E[Yi ] + E[Yi+1 ] + · · · + E[Yj−1 ]
1 µi
E[Yi ] = + E[Yi−1 ] Note that these are only for birth-death processes
λi λi ◮

PS Sastry, IISc, Bangalore, 2020 12/31 PS Sastry, IISc, Bangalore, 2020 13/31

◮ Chapmann-Kolmogorov equation gives


◮ Consider the transition probabilities, Pij (t) X
Pij (t + s) = Pik (s) Pkj (t)
◮ These satisfy Chapmann-Kolmogorov equation
k

Pij (t + s) = P r[X(t + s) = j | X(0) = i] ◮ Hence we get


X X
= P r[X(t + s) = j | X(s) = k, X(0) = i]P r[X(s) = k | X(0) = i] Pij (t + h) − Pij (t) = Pik (h) Pkj (t) − Pij (t)
k k
X X
= P r[X(t + s) = j | X(s) = k]P r[X(s) = k | X(0) = i] = Pik (h) Pkj (t) − (1 − Pii (h))Pij (t)
k k6=i
X
= P r[X(t) = j | X(0) = k]P r[X(s) = k | X(0) = i] ◮ Define
k
Pik (h) 1 − Pii (h)
=
X
Pkj (t) Pik (s) qik = lim , i 6= k, and qii = lim
h→0 h h→0 h
k ◮ Then, assuming limit and sum can be interchanged,
◮ For finite chain, P is a matrix and Pij (t + h) − Pij (t) X
P (t + s) = P (t) P (s) lim = qik Pkj (t) − qii Pij (t)
h→0 h k6=i

PS Sastry, IISc, Bangalore, 2020 14/31 PS Sastry, IISc, Bangalore, 2020 15/31
◮ By definition, 1 − Pii (h) is the probability that the chain ◮ By definition, Pij (h) = qij h + o(h), i 6= j
that started in i is not in i at h. ◮ Hence qij is the rate at which transitions out of i into j
◮ This is equivalent to there being a transition in the time are occurring.
h and transitions out of i occur at the rate of νi . ◮ Transitions out of i occur with rate νi and zij fraction of
Also, two or more transitions in h is o(h) these are into j
◮ Hence ◮ Hence, qij = νi zij , i 6= j
1 − Pii (h) = νi h + o(h) ◮ Thus, we got
◮ Thus qii = νi . It is rate of transition out of i X qij X
νi = qij , zij = P , qii = qij
◮ We also have j6 = i q ij
j6=i j6=i
P
1 − Pii (h) j6=i Pij (h)
X
νi = qii = lim = lim = qij ◮ The {qij } are called the infinitesimal generator of the
h→0 h h→0 h j6=i process.
◮ A continuous time Markov Chain is specified by these qij

PS Sastry, IISc, Bangalore, 2020 16/31 PS Sastry, IISc, Bangalore, 2020 17/31

◮ Consider a Birth-Death process.


◮ The Chapmann-Kolmogorov equations give us
We got earlier

X
Pij (t + h) − Pij (t) = Pik (h) Pkj (t) − (1 − Pii (h))Pij (t)
k6=i
λi µi
ν i = λi + µi , zi,i+1 = , zi,i−1 =
λi + µ i λi + µi ◮ Using this we derived
◮ Now we can calculate qij X
Pij′ (t) = qik Pkj (t) − qii Pij (t)
λi µi k6=i
qi,i+1 = (λi +µi ) = λi , qi,i−1 = (λi +µi ) = µi
λi + µi λi + µi Called Kolmogorov Backward equation
◮ We can solve these ODEs to get Pij (t)
◮ This is intuitively obvious ◮ For a birth-death chain the equation becomes
◮ We specify a birth-death chain by
birth rate (rate of transition from i to i + 1), λi and Pij′ (t) = λi Pi+1,j (t) + µi Pi−1,j (t) − (λi + µi )Pij (t)
death rate (rate of transition from i to i − 1), µi .

PS Sastry, IISc, Bangalore, 2020 18/31 PS Sastry, IISc, Bangalore, 2020 19/31
Poisson process as a special case

◮ Consider a two-state Birth-Death chain.


◮ Consider the case: λi = λ and µi = 0. ∀i.
◮ We would have µ0 = λ1 = 0. Let λ0 = λ and µ1 = µ
◮ This would be a Poisson process with rate λ.
◮ The two states can be a machine working or failed.
◮ Taking i = 0, the differential equation becomes
◮ λ is rate of failure. Time till next failure is exponential(λ)

P0j (t) = λP1j (t) − λP0j (t) ◮ µ is rate of repair. Time for repair is exponential(µ)
◮ We may want to calculate P00 (T ), the probability that
◮ P0j (t) is the probability of j events in an interval of the machine would be working at a time T units later
length t which is same as what we had called Pj (t). given it is in working condition now
◮ Similarly, P1j (t) is same as what we called Pj−1 (t) there ◮ We can calculate it by solving the ODE’s
◮ Now one can see that the above ODE is what we got for
Poisson process.

PS Sastry, IISc, Bangalore, 2020 20/31 PS Sastry, IISc, Bangalore, 2020 21/31

◮ Consider a finite chain


Pij′ (t) = λi Pi+1,j (t) + µi Pi−1,j (t) − (λi + µi )Pij (t) ◮ Then the transition probabilities can be represented as a
matrix
◮ For the two state chain, these equations are ◮ The Chapmann-Kolmogorov equation gives

P00 (t) = λ0 P10 (t) − λ0 P00 (t) P (t + s) = P (t) P (s)
′ Differentiating the above with respect to t
P01 (t) = λ0 P11 (t) − λ0 P01 (t) ◮

P10 (t) = µ1 P00 (t) − µ1 P10 (t) P ′ (t + s) = P ′ (t)P (s)

P11 (t) = µ1 P01 (t) − µ1 P11 (t) ◮ Putting t = 0 in the above we get
P ′ (s) = P ′ (0) P (s) = Q̄ P (s), where Q̄ = P ′ (0)
◮ As is easy to see, we get a system of equations like this
for any finite chain.
◮ The solution for this is
◮ Solving these we can show
P (t) = etQ̄ , because P (0) = I
λ −(λ+µ)t µ
P00 (t) = e + This is the expression for calculating Pij (t) for any t and
λ+µ λ+µ ◮
i, j
PS Sastry, IISc, Bangalore, 2020 22/31 PS Sastry, IISc, Bangalore, 2020 23/31
◮ Let us examine the matrix Q̄ = [q̄ij ]

P (h) − P (0) P (h) − I ◮ The Kolmogorov backward equation is


Q̄ = P ′ (0) = lim = lim
h↓0 h h↓0 h X
Pij′ (t) = qik Pkj (t) − qii Pij (t)
k6=i
◮ This gives us
◮ The above can be written in a matrix form
Pkj (h) − 0
for k 6= j, q̄kj = lim = qkj
h↓0 h P ′ (t) = QP (t)
Pjj (h) − 1
q̄jj = lim = −qjj = −νj ◮ The off-diagonal entries of Q are qik and diagonal entries
h↓0 h
are −qii
◮ Thus this Q̄ matrix has qik as off-diagonal entries and ◮ From the above equation, P ′ (0) = Q
−qjj as diagonal entries ◮ So, what we did is to write the backward equation in
◮ So, each row here sums to zero matrix form
◮ We normally write it as Q and call it the infinitesimal
generator of the process
PS Sastry, IISc, Bangalore, 2020 24/31 PS Sastry, IISc, Bangalore, 2020 25/31

◮ For the backward equation, we started with


X
Pij (t + h) = Pik (h) Pkj (t) ◮ We can define transient and recurrent states as in the
k discrete case.
◮ The Chapmann-Kolmogorov equation also gives us ◮ However, we need to be careful about defining hitting
X times or first passage times
Pij (t + h) = Pik (t) Pkj (h)
k
◮ We define
◮ Similar algebra as earlier gives us Ti = min{t > 0 : X(t) 6= i} fi = min{t : t > Ti , X(t) = i}
X
Pij′ (t) = Pik (t) qkj − qjj Pij (t)
k6=j
◮ For a chain started in i we take fi as first return time to i
(under some assumptions about interchanging limit and
◮ A state i is said to be
summation)
◮ Transient if P r[fi < ∞ | X(0) = i] < 1
◮ This is known as Kolmogorov forward equation ◮ Recurrent if P r[fi < ∞ | X(0) = i] = 1
◮ For finite chains, both forward and backward equations
are same
◮ For infinite chains there are some differences
PS Sastry, IISc, Bangalore, 2020 26/31 PS Sastry, IISc, Bangalore, 2020 27/31
◮ Most of the other definitions are also similar to the case
of discrete chains ◮ Define
◮ The chain is said to be irreducible if for all i, j there is a X
positive probability of going from i to j in some finite πj (t) = P r[X(t) = j] = πi (0)Pij (t)
i
time: Pij (t) > 0 for some t
◮ A recurrent state is positive recurrent if mean time to This also analogous to the discrete case
return is finite: E[fi | X(0) = i] < ∞ ◮ The above equation is true for general infinite chains.
Otherwise it is null recurrent ◮ In the finite case, we can get a more compact expression
◮ An irreducible positive recurrent chain would have a ◮ For a finite chain, taking π as a row vector,
unique stationary distribution
◮ There is no concept of periodicity in the continuous time π(t) = π(0) P (t) = π(0) eQt
case
◮ An irreducible positive recurrent chain would be called an
ergodic chain

PS Sastry, IISc, Bangalore, 2020 28/31 PS Sastry, IISc, Bangalore, 2020 29/31

◮ We say π is a stationary distribution if


π(0) = π ⇒ π(t) = π, ∀t
◮ What we showed is that any stationary distribution π has
◮ Hence, if we start the chain in the stationary distribution, to satisfy
π ′ (t) = 0
X X
qkj πk = πj qjk
◮ We get from the earlier equation k6=j k6=j

We can interpret this (as we did in discrete case)


X X
πj (t) = πi (0)Pij (t) and hence πj′ (t) = πi (0)Pij′ (t) ◮

i i ◮ qkj is the rate of transition from k to j and πk is the


fraction present in k.
◮ Using the forward equation for Pij′ (t) P
◮ Hence k6=j qkj πk is the net flow into j
! P
X X ◮ πj k6=j qjk is the net flow out of j
πi (0) qkj Pik (t) − qjj Pij (t) = 0
◮ At steady state the flows have to be balanced
i k6=j
X X ◮ The above equation is known as a balance equation
⇒ qkj πk − πj qjk = 0
k6=j k6=j

when π is a stationary distribution and π(0) = π


PS Sastry, IISc, Bangalore, 2020 30/31 PS Sastry, IISc, Bangalore, 2020 31/31

You might also like