Lecture Slides - 230809 - 154641
Lecture Slides - 230809 - 154641
E1 222 Stochastic Models and ◮ V.K. Rohatgi and A.K.Md.E. Saleh, An Introduction to
Applications probability and Statistics, Wiley, 2nd edition, 2018
◮ S.Ross, ‘Introduction to Probability Models’, Elsevier,
12th edition, 2019.
P.S. Sastry
◮ P G Hoel, S Port and C Stone, Introduction to
[email protected]
Probability Theory, 1971.
◮ P G Hoel, S Port and C Stone, Introduction to Stochastic
Processes, 1971.
1/13 2/13
6/13 7/13
8/13 9/13
Probability axioms Simple consequences of the axioms
◮ Notation: Ac is complement of A.
P : F → ℜ, F ⊂ 2Ω (Events are subsets of Ω)
C = A + B implies A, B are mutually exclusive and C
A1 P(A) ≥ 0, ∀A ∈ F is their union.
A2 P(Ω) = 1 ◮ Let A ⊂ B be events. Then B = A + (B − A).
P∞
A3 If Ai ∩ Aj = φ, ∀i 6= j then P(∪∞
i=1 Ai ) = i=1 P(Ai ) Now we can show P(A) ≤ P(B):
◮ For these axioms to make sense, we are assuming P(B) = P (A + (B − A)) = P(A) + P(B − A) ≥ P(A)
(i). Ω ∈ F and (ii). A1 , A2 , · · · ∈ F ⇒ (∪i Ai ) ∈ F This also shows P(B − A) = P(B) − P(A) when A ⊂ B.
10/13 11/13
1/38 2/38
5/38 6/38
7/38 8/38
◮ We have
◮ The lengths are: x, (y − x), (1 − y ). So we need
Ω = {(x, y ) : 0 < x < y < 1}
x + (y − x) > (1 − y ) ⇒ y > 0.5 A = {(x, y ) : y > 0.5; x < 0.5; y < x + 0.5}
9/38 10/38
Conditional Probability
◮ Let B be an event with P(B) > 0. We define conditional
probability, conditioned on B, of any event, A, as
◮ Everything we do in probability theory is always in
reference to an underlying probability space: (Ω, F, P) P(A ∩ B) P(AB)
P(A | B) = =
where P(B) P(B)
◮ Ω is the sample space
◮ The above is a notation. “A | B” does not represent any
◮ F ⊂ 2Ω set of events; each event is a subset of Ω
◮ P : F → [0, 1] is a probability (measure) that assigns a set operation! (This is an abuse of notation!)
number between 0 and 1 to every event (satisfying the ◮ Given a B, conditional probability is a new probability
three axioms). assignment to any event.
◮ That is, given B with P(B) > 0, we define a new
probability PB : F → [0, 1] by
P(AB)
PB (A) =
P(B)
11/38 12/38
◮ Conditional probability is a probability. What does this P(AB)
mean? P(A | B) =
P(B)
◮ The new function we defined, PB : F → [0, 1],
PB (A) = P(AB)
P(B)
, ◮ Note P(B|B) = 1 and P(A|B) > 0 only if P(AB) > 0.
satisfies the three axioms of probability. ◮ Now the ‘new’ probability of each event is determined by
◮ PB (A) ≥ 0 and PB (Ω) = 1. what it has in common with B.
◮ If A1 , A2 are mutually exclusive then A1 B and A2 B are ◮ If we know the event B has occurred, then based on this
also mutually exclusive and hence knowledge we can readjust probabilities of all events and
that is given by the conditional probability.
P ((A1 + A2 )B) P(A1 B + A2 B) ◮ Intuitively it is as if the sample space is now reduced to B
PB (A1 + A2 ) = =
P(B) P(B) because we are given the information that B has occurred.
P(A1 B) + P(A2 B) ◮ This is a useful intuition as long as we understand it
= = PB (A1 ) + PB (A2 ) properly.
P(B)
◮ It is not as if we talk about conditional probability only
◮ Once we understand condional probability is a new for subsets of B. Conditional probability is also with
probability assignment, we go back to the ‘standard respect to the original probability space. Every element of
notation’ F has conditional probability defined.
13/38 14/38
P(AB)
P(A | B) =
P(B)
◮ In a conditional probability, the conditioning event can be
◮ Suppose P(A | B) > P(A)
any event (with positive probability)
Does it mean “B causes A”?
◮ In particular, it could be intersection of events.
P(A | B) > P(A) ⇒ P(AB) > P(A)P(B) ◮ We think of that as conditioning on multiple events.
P(AB)
⇒ > P(B) P(ABC )
P(A) P(A | B, C ) = P(A | BC ) =
P(BC )
⇒ P(B | A) > P(B)
15/38 16/38
◮ Let B1 , · · · , Bm be events such that ∪m
i=1 Bi = Ω and
◮ The conditional probability is defined by Bi Bj = φ, ∀i 6= j.
P(AB) ◮ Such a collection of events is said to be a partition of Ω.
P(A | B) = (They are also sometimes said to be mutually exclusive
P(B)
and collectively exhaustive).
◮ This gives us a useful identity ◮ Given this partition, any other event can be represented
as a mutually exclusive union as
P(AB) = P(A | B)P(B)
A = AB1 + · · · + ABm
◮ We can iterate this for multiple events
To explain the notation again
P(ABC ) = P(A | BC )P(BC ) = P(A | BC )P(B | C )P(C )
A = A∩Ω = A∩(B1 ∪· · ·∪Bm ) = (A∩B1 )∪· · ·∪(A∩Bm )
17/38 18/38
21/38 22/38
◮ The probabilities P(T+ |D) and P(T+ |D c ) can be ◮ Now suppose we can improve the test so that
obtained through, for example, laboratory experiments. P(T+ |D c ) = 0.01
◮ P(T+ |D) is called the true positive rate and P(T+ |D c ) is 0.99 ∗ 0.1
P(D|T+ ) = = 0.92
called false positive rate. 0.99 ∗ 0.1 + 0.01 ∗ 0.9
◮ We also need P(D), the probability of a random person
◮ These different cases are important in understanding the
having the disease.
role of false positives rate.
23/38 24/38
◮ In many applications of Bayes rule the same generic
situation exists
◮ P(D) is the probability that a random person has the ◮ Based on a measurement we want to predict (what may
disease. We call it the prior probability. be called) the state of nature.
◮ P(D|T+ ) is the probability of the random person having ◮ For another example, take a simple communication
system.
disease once we do a test and it came positive. We call it
◮ D can represent the event that the transmitter sent bit 1.
the posterior probability.
◮ T+ can represent an event about the measurement we
◮ Bayes rule essentially transforms the prior probability to made at the receiver.
posterior probability. ◮ We want the probability that bit 1 is sent based on the
measurement.
◮ The knowledge we need is P(T+ |D), P(T+ |D c ) which
can be determined through experiment or modelling of
channel.
25/38 26/38
P(T+ |D)P(D)
P(D|T+ ) =
P(T+ |D)P(D) + P(T+ |D c )P(D c )
P(T+ |D)P(D)
P(D|T+ ) =
◮ Not all applications of Bayes rule involve a ‘binary’ P(T+ |D)P(D) + P(T+ |D c )P(D c )
situation
◮ Suppose D1 , D2 , D3 are the (exclusive) possibilities and T ◮ In the binary situation we can think of Bayes rule in a
is an event about a measurement. slightly modified form too.
27/38 28/38
Independent Events Example: Independence
A class has 20 female and 30 male course (MTech) students
◮ Two events A, B are said to be independent if and 6 female and 9 male research (PhD) students. Are gender
and degree independent?
P(AB) = P(A)P(B) ◮ Let F , M, C , R denote events of female, male, course,
research students
◮ Note that this is a definition. Two events are independent ◮ From the given numbers, we can easily calculate the
if and only if they satisfy the above. following:
◮ Suppose P(A), P(B) > 0. Then, if they are independent
26 2 50 10 20 4
P(F ) = = ; P(C ) = = ; P(FC ) = =
P(AB) 65 5 65 13 65 13
P(A|B) = = P(A); similarly P(B|A) = P(B)
P(B)
◮ Hence we can verify
2 10 4
◮ This gives an intuitive feel for independence. P(F )P(C ) = = = P(FC )
5 13 13
◮ Independence is an important (often confusing!) concept.
and conclude that F and C are independent.
Similarly we can show for others.
29/38 30/38
31/38 32/38
◮ Consider the random experiment of tossing two fair coins
(or tossing a coin twice).
◮ Ω = {HH, HT , TH, TT }. ◮ In multiple tosses, assuming all outcomes are equally
Suppose we employ ‘equally likely idea’. likely is alright if the coin is fair.
◮ That is, P({HH}) = 14 , P({HT }) = 41 and so on ◮ Suppose we toss a biased coin two times.
◮ Let A = ‘H on 1st toss’ = {HH, HT } (P(A) = 12 ) ◮ Then the four outcomes are, obviously, not ‘equally likely’
Let B = ‘T on second toss’ = {HT , TT } (P(B) = 21 )
◮ How should we then assign these probabilities?
◮ We have P(AB) = P({HT }) = 0.25
◮ If we assume tosses are independent then we can assign
◮ Since P(A)P(B) = 12 21 = 41 = P(AB), probabilities easily.
A, B are independent.
◮ Hence, in multiple tosses, assuming all outcomes are
equally likely implies outcome of one toss is independent
of another.
33/38 34/38
37/38 38/38
Recap Recap
PS Sastry, IISc, Bangalore, 2020 1/34 PS Sastry, IISc, Bangalore, 2020 2/34
Recap Recap
◮ Conditional probability of A given (or conditioned on) B is
P(AB)
P(A|B) =
P(B) ◮ Bayes Rule
◮ This gives us the identity: P(AB) = P(A|B)P(B) P(T |D)P(D)
◮ This holds for multiple event, e.g., P(D|T ) =
P(T |D)P(D) + P(T |D c )P(D c )
P(ABC ) = P(A|BC )P(B|C )P(C )
◮ Bayes rule can be viewed as transforming a prior
◮ Given a partition, Ω = B1 + B2 + · · · + Bm , for any event,
probability into a posterior probability.
A,
m
X
P(A) = P(A|Bi )P(Bi ) (Total Probability rule)
i=1
PS Sastry, IISc, Bangalore, 2020 3/34 PS Sastry, IISc, Bangalore, 2020 4/34
PS Sastry, IISc, Bangalore, 2020 5/34 PS Sastry, IISc, Bangalore, 2020 6/34
Pair-wise independence Conditional Independence
◮ Events A, B are said to be (conditionally) independent
◮ Events A1 , A2 , · · · , An are said to be pair-wise given C if
independent if P(AB|C ) = P(A|C )P(B|C )
◮ If the above holds
P(Ai Aj ) = P(Ai )P(Aj ), ∀i 6= j
P(ABC ) P(AB|C )P(C )
◮ Events may be pair-wise independent but not (totally) P(A|BC ) = =
P(BC ) P(BC )
independent.
◮ Example: Four balls in a box inscribed with ‘1’, ‘2’, ‘3’ P(A|C ) P(B|C )P(C )
= = P(A|C )
and ‘123’. Let Ei be the event that number ‘i’ appears on P(BC )
a radomly drawn ball, i = 1, 2, 3.
◮ Events may be conditionally independent but not
◮ Easy to see: P(Ei ) = 0.5, i = 1, 2, 3.
independent. (e.g., ‘independent’ multiple tests for
◮ P(Ei Ej ) = 0.25 (i 6= j) ⇒ pairwise independent
confirming a disease)
◮ But, P(E1 E2 E3 ) = 0.25 6= (0.5)3 ◮ It is also possible that A, B are independent but are not
conditionally independent given some other event C .
PS Sastry, IISc, Bangalore, 2020 7/34 PS Sastry, IISc, Bangalore, 2020 8/34
PS Sastry, IISc, Bangalore, 2020 11/34 PS Sastry, IISc, Bangalore, 2020 12/34
lim An = ∪∞
k=1 Ak
n→∞
An ↓ An ↑
PS Sastry, IISc, Bangalore, 2020 13/34 PS Sastry, IISc, Bangalore, 2020 14/34
◮ We have shown that ∩n [a, b + n1 ) = [a, b]
◮ Let us look at simple examples of monotone sequences of
◮ Similarly we can get ∩n (a − n1 , b] = [a, b]
subsets of ℜ.
◮ Now consider An = [a, b − n1 ].
◮ Consider a sequence of intervals:
[ ] ] ... )
An = [a, b + n1 ), n = 1, 2, · · · with a, b ∈ ℜ, a < b.
) ) a b-1 b-0.5 b
[
...
a b b+0.5 b+1
◮ Now, An ↑ and lim An = ∪n An = [a, b).
◮ We have An ↓ and lim An = ∩i Ai = [a, b] ◮ Why? – because
◮ Why? – because ◮ ∀ǫ > 0, ∃n s.t. b − ǫ ∈ An ⇒ b − ǫ ∈ ∪n An ;
◮ b ∈ An , ∀n ⇒ b ∈ ∩i Ai , and ◮ but b ∈
/ An , ∀n ⇒ b ∈/ ∪n An .
◮ ∀ǫ > 0, b + ǫ ∈
/ An after some n ⇒ b + ǫ ∈ / ∩i Ai . ◮ These examples also show how using countable unions or
1
For example, b + 0.01 ∈/ A101 = [a, b + 101 ). intersections we can convert one end of an interval from
‘open’ to ‘closed’ or vice versa.
PS Sastry, IISc, Bangalore, 2020 15/34 PS Sastry, IISc, Bangalore, 2020 16/34
An ↑ lim An = ∪∞
k=1 Ak A3
n→∞
◮ Having defined the limits, we now ask the question
?
P lim An = lim P(An ) B2
n→∞ n→∞ A1
B3
PS Sastry, IISc, Bangalore, 2020 17/34 PS Sastry, IISc, Bangalore, 2020 18/34
Theorem: Let An ↑. Then P(limn An ) = limn P(An )
◮ Since An ↑, An ⊂ An+1 .
◮ Define sets Bi , i = 1, 2, · · · , by
B1 = A1 , Bk = Ak − Ak−1 , k = 2, 3, · · ·
◮ Note that Bk are mutually exclusive. Also note that ◮ We showed that when An ↑, P(limn An ) = limn P(An )
Xn ◮ We can show this for the case An ↓ also.
n
An = ∪k=1 Bk and hence P(An ) = P(Bk ) ◮ Note that if An ↓, then Acn ↑. Using this and the theorem
k=1
we can show it. (Left as an exercise)
◮ We also have ◮ This property is known as monotone sequential continuity
∪nk=1 Ak = ∪nk=1 Bk , ∀n and hence ∪∞ ∞ of the probability measure.
k=1 Ak = ∪k=1 Bk
◮ Thus we get
P(lim An ) = P(∪∞ ∞
k=1 Ak ) = P(∪k=1 Bk )
n
∞
X n
X
= P(Bk ) = lim P(Bk ) = lim P(An )
n n
k=1 k=1
PS Sastry, IISc, Bangalore, 2020 19/34 PS Sastry, IISc, Bangalore, 2020 20/34
PS Sastry, IISc, Bangalore, 2020 21/34 PS Sastry, IISc, Bangalore, 2020 22/34
◮ What P should we consider for this uncountable Ω?
We are not sure what to take. ◮ For n = 1, 2, · · · , define
◮ So, let us ask only for some consistency. An = {(ω1 , ω2 , ....) : ωi = 0, i = 1, · · · , n}
For any subset of this Ω that is specified only through
outcomes of first n tosses, that event should have the ◮ An is the event of no head in the first n tosses
same probability as in the finite probability space and we know P(An ) = (0.5)n .
corresponding to n tosses. ◮ Note that ∩∞k=1 Ak is the event we want.
◮ Consider an event here; ◮ Note that An ↓ because An+1 ⊂ An .
A = {(ω1 , ω2 , ....) : ω1 = ω2 = 0} ⊂ Ω ◮ Hence we get
n
A is the event of tails on first two tosses. P(∩∞
k=1 Ak ) = P(lim An ) = lim P(An ) = lim(0.5) = 0
n n n
◮ We are saying we must have P(A) = (0.5)2 .
◮ Now we can complete problem
PS Sastry, IISc, Bangalore, 2020 23/34 PS Sastry, IISc, Bangalore, 2020 24/34
PS Sastry, IISc, Bangalore, 2020 31/34 PS Sastry, IISc, Bangalore, 2020 32/34
Random Variable
PS Sastry, IISc, Bangalore, 2020 33/34 PS Sastry, IISc, Bangalore, 2020 34/34
Recap: Monotone Sequences of Sets Recap: Monotone Sequential Continuity
◮ A sequence, A1 , A2 , · · · , is said to be monotone
decreasing if
An+1 ⊂ An , ∀n (denoted as An ↓)
◮ Limit of a monotone decreasing sequence is ◮ We showed that
lim An = ∩∞
An ↓: k=1 Ak P lim An = lim P (An )
n→∞
n→∞ n→∞
◮ A sequence, A1 , A2 , · · · , is said to be monotone
increasing if when An ↓ or An ↑
An ⊂ An+1 , ∀n (denoted as An ↑)
◮ Limit of monotone increasing sequence is
An ↑: lim An = ∪∞
k=1 Ak
n→∞
PS Sastry, IISc, Bangalore, 2020 1/41 PS Sastry, IISc, Bangalore, 2020 2/41
Random Variable
◮ Let (Ω, F, P ) be our probability space and let X be a
random variable defined in this probability space.
◮ A random variable is a real-valued function on Ω:
◮ We know X maps Ω into ℜ.
X:Ω→ℜ ◮ This random variable results in a new probability space:
◮ For example, Ω = {H, T }, X(H) = 1, X(T ) = 0. X
(Ω, F, P ) → (ℜ, B, PX )
◮ Another example: Ω = {H, T }3 , X(ω) is numbers of H’s.
◮ A random variable maps each outcome to a real number. where ℜ is the new sample space and B ⊂ 2ℜ is the new
◮ It essentially means we can treat all outcomes as real set of events and PX is a probability defined on B.
numbers. ◮ For now we will assume that any set of ℜ that we want
◮ We can effectively work with ℜ as sample space in all would be in B and hence is an event.
probability models ◮ PX is a new probability measure (which depends on P
and X) that assigns probability to different subsets of ℜ.
PS Sastry, IISc, Bangalore, 2020 3/41 PS Sastry, IISc, Bangalore, 2020 4/41
◮ Given a probability space (Ω, F, P ), a random variable X
X
◮ Given a probability space (Ω, F, P ), a random variable X (Ω, F, P ) → (ℜ, B, PX )
X ◮ We define PX :
(Ω, F, P ) → (ℜ, B, PX )
PX (B) = P ({ω ∈ Ω : X(ω) ∈ B}) , B ∈ B
◮ We define PX : ◮ We use the notation
PX (B) = P ({ω ∈ Ω : X(ω) ∈ B}) , B ∈ B [X ∈ B] = {ω ∈ Ω : X(ω) ∈ B}
X ◮ So, now we can write
Sample Space Real Line
PX (B) = P ([X ∈ B]) = P [X ∈ B]
B
◮ For the definition of PX to be proper, for each B ∈ B, we
must have [X ∈ B] ∈ F.
We will assume that. (This is trivially true if F = 2Ω ).
◮ We can easily verify PX is a probability measure. It
satisfies the axioms.
PS Sastry, IISc, Bangalore, 2020 5/41 PS Sastry, IISc, Bangalore, 2020 6/41
◮ Given a probability space (Ω, F, P ), a random variable X ◮ Let us look at a couple of simple examples.
◮ We define PX : ◮ Let Ω = {H, T } and P (H) = p.
Let X(H) = 1; X(T ) = 0.
PX (B) = P [X ∈ B] = P ({ω ∈ Ω : X(ω) ∈ B})
[X ∈ {0}] = {ω : X(ω) = 0} = {T }
◮ Easy to see: PX (B) ≥ 0, ∀B and PX (ℜ) = 1
[X ∈ [−3.14, 0.552] ] = {ω : −3.14 ≤ X(ω) ≤ 0.552} = {T }
◮ If B1 ∩ B2 = φ then PX (B1 ∪ B2 ) = P [X ∈ B1 ∪ B2 ] =?
[X ∈ (0.62, 15.5)] = {ω : 0.62 < X(ω) < 15.5} = {H}
[X ∈ [−2, 2) ] = Ω
◮ Hence we get
PS Sastry, IISc, Bangalore, 2020 7/41 PS Sastry, IISc, Bangalore, 2020 8/41
◮ Let Ω = {H, T }3 = {HHH, HHT, · · · , T T T }.
Let P be specified through ‘equally likely’ assignment.
Let X(ω) be number of H’s in ω. Thus, X(T HT ) = 1.
(X takes one of the values: 0, 1, 2, or 3)
◮ A random variable defined on (Ω, F, P ) results in a new
or induced probability space (ℜ, B, PX ).
◮ We can once again write down [X ∈ B] for different
B⊂ℜ
◮ The Ω may be countable or uncountable (even though we
looked at only examples of finite Ω).
[X ∈ (0, 1] ] = {HT T, T HT, T T H}; ◮ Thus, we can study probability models by taking ℜ as
sample space through the use of random variables.
◮ However there are some technical issues regarding what B
[X ∈ (−1.2, 2.78) ] = Ω − {HHH} we should consider.
◮ We briefly consider this and then move on to studying
◮ Hence random variables.
3 7
PX ((0, 1]) = ; PX ((−1.2, 2.78)) =
8 8
PS Sastry, IISc, Bangalore, 2020 9/41 PS Sastry, IISc, Bangalore, 2020 10/41
PS Sastry, IISc, Bangalore, 2020 13/41 PS Sastry, IISc, Bangalore, 2020 14/41
PS Sastry, IISc, Bangalore, 2020 15/41 PS Sastry, IISc, Bangalore, 2020 16/41
Borel σ-algebra
◮ Let us get back to the question we started with.
◮ Let G = {(−∞, x] : x ∈ ℜ}
◮ In the probability space (ℜ, B, P ) what is the B we
◮ We can define the Borel σ-algebra, B, as the smallest
should choose.
σ-algebra containing G.
◮ We can choose it to be the smallest σ-algebra containing
◮ We can see that B would contain all intervals.
all intervals
1. (−∞, x) ∈ B because (−∞, x) = ∪n (−∞, x − n1 ]
◮ That is called Borel σ-algebra, B. 2. (x, ∞) ∈ B because (x, ∞) = (−∞, x]c
◮ It contains all intervals, all complements, countable 3. [x, ∞) ∈ B because [x, ∞) = ∩n (x − n1 , ∞)
unions and intersections of intervals and all sets that can 4. (x, y] ∈ B because (x, y] = (−∞, y] ∩ (x, ∞)
be obtained through complements, countable unions 5. [x, y] ∈ B because [x, y] = ∩n (x − n1 , y]
and/or intersections of such sets and so on. 6. [x, y), (x, y) ∈ B, similarly
◮ Thus, σ(G) is also the smallest σ-algebra containing all
intervals.
PS Sastry, IISc, Bangalore, 2020 17/41 PS Sastry, IISc, Bangalore, 2020 18/41
[X ≤ x] = {ω : X(ω) ≤ x}
φ if x < 0
= Ω if x ≥ 1
{T } if 0 ≤ x < 1
PS Sastry, IISc, Bangalore, 2020 23/41 PS Sastry, IISc, Bangalore, 2020 24/41
◮ We are considering: Ω = {T, H}, ◮ We are considering: Ω = {T, H},
P ({T }) = P ({H}) = 0.5.
P ({T }) = P ({H}) = 0.5.
◮ X(T ) = 0 and X(H) = 1. We want to calculate FX ◮ X(T ) = 0 and X(H) = 1. We want to calculate FX
◮ We showed ◮ We showed
[X ≤ x] = {ω : X(ω) ≤ x}
[X ≤ x] = {ω : X(ω) ≤ x}
φ if x < 0
φ if x < 0
= {T } if 0 ≤ x < 1
= {T } if 0 ≤ x < 1
Ω if x ≥ 1
Ω if x ≥ 1
PS Sastry, IISc, Bangalore, 2020 27/41 PS Sastry, IISc, Bangalore, 2020 28/41
◮ Once again we need to find the event [X ≤ x] for
different values of x. ◮ The plot of this distribution function:
◮ Note that the function X takes values in [0, 1] and
X(ω) = ω.
[X ≤ x] = {ω ∈ Ω : X(ω) ≤ x} = {ω ∈ [0, 1] : ω ≤ x}
φ if x < 0
= Ω if x ≥ 1
[0, x] if 0 ≤ x < 1
PS Sastry, IISc, Bangalore, 2020 29/41 PS Sastry, IISc, Bangalore, 2020 30/41
PS Sastry, IISc, Bangalore, 2020 31/41 PS Sastry, IISc, Bangalore, 2020 32/41
◮ FX is right-continuous at all x. ◮ FX is right-continuous:
◮ Next, let us look at the lefthand limits: limxn ↑x FX (xn ) FX (x+ ) = FX (x) = PX ( (−∞, x] )
◮ When xn ↑ x, the sequence of events (−∞, xn ] is ◮ It has left limits: FX (x− ) = PX ( (−∞, x) )
monotone increasing and ◮ If A ⊂ B then P (B − A) = P (B) − P (A)
lim(−∞, xn ] = ∪n (−∞, xn ] = (−∞, x) ◮ We have (−∞, x] − (−∞, x) = {x}. Hence
n
PS Sastry, IISc, Bangalore, 2020 33/41 PS Sastry, IISc, Bangalore, 2020 34/41
Distribution Functions
◮ FX (x) = P [X ≤ x] = P [X ∈ (−∞, x] ]
◮ Let X be a random variable.
◮ Given FX , we can, in principle, find P [X ∈ B] for all
◮ Its distribution function, FX : ℜ → ℜ is given by Borel sets.
FX (x) = P [X ≤ x]
◮ In particular, for a < b,
◮ The distribution function satisfies
1. 0 ≤ FX (x) ≤ 1, ∀x P [a < X ≤ b] = P [X ∈ (a, b] ]
2. FX (−∞) = 0; FX (∞) = 1
= P [X ∈ ( (−∞, b] − (−∞, a] ) ]
3. FX is non-decreasing: x1 ≤ x2 ⇒ FX (x1 ) ≤ FX (x2 )
4. FX is right continuous and has left-hand limits. = P [X ∈ (−∞, b] ] − P [X ∈ (−∞, a] ]
◮ We also have FX (x+ ) − FX (x− ) = P [X = x] = FX (b) − FX (a)
◮ Any real-valued function of a real variable satisfying the
above four properties would be a distribution function of
some random variable.
PS Sastry, IISc, Bangalore, 2020 35/41 PS Sastry, IISc, Bangalore, 2020 36/41
Discrete Random Variables
◮ There are two classes of random variables that we would
study here. ◮ A random variable X is said to be discrete if it takes only
◮ These are called discrete and continuous random countably many distinct values.
variables. ◮ Countably many means finite or countably infinite.
◮ There can be random variables that are neither discrete ◮ If X : Ω → ℜ is discrete, its (strict) range is countable
nor continuous.
◮ Any random variable that is defined on finite or countable
◮ But these two are important classes of random variables
Ω would be discrete.
that we deal with in this course.
◮ Thus the family of discrete random variables includes all
◮ Note that the distribution function is defined for all
probability models on finite or countably infinite sample
random variables.
spaces.
PS Sastry, IISc, Bangalore, 2020 37/41 PS Sastry, IISc, Bangalore, 2020 38/41
PS Sastry, IISc, Bangalore, 2020 39/41 PS Sastry, IISc, Bangalore, 2020 40/41
Recap: Random Variables
◮ The plot of this distribution function is:
PS Sastry, IISc, Bangalore, 2020 41/41 PS Sastry, IISc, Bangalore, 2020 1/52
PS Sastry, IISc, Bangalore, 2020 2/52 PS Sastry, IISc, Bangalore, 2020 3/52
Recap: Properties of distribution function
PS Sastry, IISc, Bangalore, 2020 4/52 PS Sastry, IISc, Bangalore, 2020 5/52
PS Sastry, IISc, Bangalore, 2020 6/52 PS Sastry, IISc, Bangalore, 2020 7/52
◮ FX (x) = P [X ≤ x] (Recall X ∈ {0, 1, 2, 3})
◮ The event [X ≤ x] for different x can be seen to be ◮ The plot of this distribution function is:
φ x<0
{T T T } 0≤x<1
[X ≤ x] = {T T T, HT T, T HT, T T H} 1 ≤ x < 2
Ω − {HHH} 2≤x<3
Ω x≥3
PS Sastry, IISc, Bangalore, 2020 8/52 PS Sastry, IISc, Bangalore, 2020 9/52
PS Sastry, IISc, Bangalore, 2020 10/52 PS Sastry, IISc, Bangalore, 2020 11/52
◮ Hence we can write the distribution function as
0 x < a1
P (B1 ) a1 ≤ x < a2
P (B ) + P (B ) a2 ≤ x < a3 Let X be a discrete rv with X ∈ {x1 , x2 , · · · }.
1 2 ◮
.. Let qi = P [X = xi ] (= P ({ω : X(ω) = xi }) )
FX (x) = . ◮
k
P
We have qi ≥ 0 and i qi = 1.
P
i=1 P (Bi ) ak ≤ x < ak+1
◮
.
..
◮ If X is discrete then there is a countable set E such that
1 x ≥ an P [X ∈ E] = 1.
◮ The distribution function of X is specified completely by
◮ We can write this compactly as these qi
X
FX (x) = qk
k:ak ≤x
PS Sastry, IISc, Bangalore, 2020 22/52 PS Sastry, IISc, Bangalore, 2020 23/52
Memoryless property of geometric distribution ◮ Now we can compute the required conditional probability
P [X > m + n, X > m]
P [X > m + n|X > m] =
◮ Suppose X is a geometric rv. Let m, n be positive P [X > m]
integers. P [X > m + n]
=
◮ We want to calculate P ( [X > m + n] | [X > m] ) P [X > m]
(Remember that [X > m] etc are events) (1 − p)m+n
= = (1 − p)n
◮ Let us first calculate P [X > n] for any positive integer n (1 − p) m
(Does this also tell us what is df of geometric rv?) P [X > m + n] = P [X > m]P [X > n]
PS Sastry, IISc, Bangalore, 2020 24/52 PS Sastry, IISc, Bangalore, 2020 25/52
P [X > m + n|X > m] = P [X > n] ◮ Suppose X ∈ {0, 1, · · · } is a discrete rv satisfying, for all
non-negative integers, m, n
◮ This is same as
P [X > m + n] = P [X > m]P [X > n]
P [X > m + n] = P [X > m]P [X > n]
◮ We will show that X has geometric distribution
◮ Does it say that [X > m] is independent of [X > n] ◮ First, note that
◮ NO! P [X > 0] = P [X > 0 + 0] = (P [X > 0])2
Because [X > m + n] is not equal to intersection of ⇒ P [X > 0] is either 1 or 0.
[X > m] and [X > n] ◮ Let us take P [X > 0] = 1 (and hence P [X = 0] = 0).
PS Sastry, IISc, Bangalore, 2020 26/52 PS Sastry, IISc, Bangalore, 2020 27/52
◮ We have, for any m, Continuous Random Variables
P [X > m] = P [X > (m − 1) + 1]
= P [X > m − 1]P [X > 1]
= P [X > m − 2] (P [X > 1])2
PS Sastry, IISc, Bangalore, 2020 32/52 PS Sastry, IISc, Bangalore, 2020 33/52
PS Sastry, IISc, Bangalore, 2020 34/52 PS Sastry, IISc, Bangalore, 2020 35/52
Properties of pdf Continuous rv – example
◮ The pdf, fX : ℜ → ℜ, of a continuous rv satisfies ◮ Consider a probability space with Ω = [0, 1] and with the
A1. fX (x) ≥ 0, ∀x ‘usual’ probability assignment (where probability of an
R∞
A2. −∞ fX (t) dt = 1 interval is its length)
◮ Any fX that satisfies the above two would be the ◮ Earlier we considered the rv X(ω) = ω on this probability
probability density function of a continuous rv space.
◮ Given fX satifying the above two, define ◮ We found that the df for this is
Z x 0 if x < 0
FX (x) = fX (t) dt, ∀x FX (x) = x if 0 ≤ x < 1
−∞
1 if x ≥ 1
This FX satisfies
This is absolutely continuous and we can get the pdf as
1. FX (−∞) = 0; FX (∞) = 1
2. FX is non decreasing. fX (x) = 1 if 0 < x < 1; (fX (x) = 0, otherwise)
3. FX is continuous (and hence right continuous with left
limits) ◮ On the same probability space, consider rv Y (ω) = 1 − ω.
◮ This shows the the FX is a df and hence fX is a pdf ◮ Let us find FY and fY .
PS Sastry, IISc, Bangalore, 2020 36/52 PS Sastry, IISc, Bangalore, 2020 37/52
PS Sastry, IISc, Bangalore, 2020 38/52 PS Sastry, IISc, Bangalore, 2020 39/52
◮ For any random variable, the df is defined and it is given
by
◮ If X is a continuous rv, we have FX (x) = P [X ≤ x] = P [X ∈ (−∞, x] ]
Z b ◮ The value of FX (x) at any x is probability of some event.
P [a ≤ X ≤ b] = fX (t) dt ◮ The pmf is defined only for discrete random variables as
a
fX (x) = P [X = x]
◮ Thus ◮ The value of pmf is also a probability
x+∆x
We use the same symbol for pdf (as for pmf), defined by
Z
◮
P [x ≤ X ≤ x + ∆x] = fX (t) dt ≈ fX (x) ∆x
x Z x
FX (x) = fX (x) dx
◮ That is why fX is called probability density function. −∞
PS Sastry, IISc, Bangalore, 2020 40/52 PS Sastry, IISc, Bangalore, 2020 41/52
A note on notation
PS Sastry, IISc, Bangalore, 2020 42/52 PS Sastry, IISc, Bangalore, 2020 43/52
Uniform distribution ◮ A plot of density and distribution functions of a uniform
◮ X is uniform over [a, b] when its pdf is rv is given below
1
fX (x) = , a≤x≤b
b−a
(fX (x) = 0 for all other values of x).
◮ Uniform distribution over open or closed interval is
essentially the same.
◮ When X has this distribution, we say X ∼ U [a, b]
◮ By integrating the above, we can see the df as
Rx Rx
−∞ X
f (x) dx = −∞ 0 dx = 0 if x < a
Ra Rx 1
FX (x) = −∞
0 dx + a b−a
dx = x−a
b−a
if a ≤ x < b
Rb 1
0 + a b−a dx + 0 = 1 if x ≥ b
PS Sastry, IISc, Bangalore, 2020 44/52 PS Sastry, IISc, Bangalore, 2020 45/52
Exponential distribution
◮ The pdf of exponential distribution is
[P [X > t + s]
P [X > t + s | X > t] = = P [X > s]
P [X > t]
PS Sastry, IISc, Bangalore, 2020 48/52 PS Sastry, IISc, Bangalore, 2020 49/52
Gaussian Distribution
◮ A plot of Gaussian density functions is given below
PS Sastry, IISc, Bangalore, 2020 50/52 PS Sastry, IISc, Bangalore, 2020 51/52
Recap: Random Variable
(x−µ)2
−
◮ fX (x) = σ√12π e , −∞ < x < ∞
2σ 2
◮ Given a probability space (Ω, F, P ), a random variable is
◮ Showing that the density integrates to 1 is not trivial. a real-valued function on Ω.
R∞
◮ Take µ = 0, σ = 1. Let I = −∞ fX (x) dx. Then ◮ It essentially results in an induced probability space
∞ Z ∞ X
1 −0.5x2 1
Z
I 2
= √ e dx
2
√ e−0.5y dy (Ω, F, P ) → (ℜ, B, PX )
2π 2π
Z−∞
∞ Z ∞
−∞
where B is the Borel σ-algebra and
1 −0.5(x2 +y2 )
= e dx dy
−∞ −∞ 2π
PX (B) = P [X ∈ B] = P ({ω ∈ Ω : X(ω) ∈ B})
◮ Now converting the above integral into polar coordinates
would allow you to show I = 1.
◮ For X to be a random variable
(Left as an exercise for you!)
{ω ∈ Ω : X(ω) ∈ B} ∈ F, ∀B ∈ B
PS Sastry, IISc, Bangalore, 2020 52/52 PS Sastry, IISc, Bangalore, 2020 1/43
PS Sastry, IISc, Bangalore, 2020 6/43 PS Sastry, IISc, Bangalore, 2020 7/43
Recap: Some continuous random variables Functions of a random variable
PS Sastry, IISc, Bangalore, 2020 8/43 PS Sastry, IISc, Bangalore, 2020 9/43
PS Sastry, IISc, Bangalore, 2020 10/43 PS Sastry, IISc, Bangalore, 2020 11/43
Example
◮ Let X be a rv and let Y = g(X). ◮ Let Y = aX + b, a > 0.
◮ The distribution function of Y is given by ◮ Then we have
FY (y) = P [Y ≤ y] FY (y) = P [Y ≤ y]
= P [g(X) ≤ y] = P [aX + b ≤ y]
= P [g(X) ∈ (−∞, y] ] = P [aX ≤ y − b]
y−b
= P [X ∈ {z : g(z) ≤ y}] = P X≤ , since a > 0
a
This probability can be obtained from distribution of X. y−b
◮
= FX
◮ Thus, in principle, we can find the distribution of Y if we a
know that of X
◮ This tells us how to find df of Y when it is an affine
function of X.
If X is continuous rv, then, fY (y) = a1 fX y−b
◮
a
PS Sastry, IISc, Bangalore, 2020 12/43 PS Sastry, IISc, Bangalore, 2020 13/43
◮ In many examples we would be using uniform random ◮ Let X ∼ U [−1, 1]. The pdf would be
variables. fX (x) = 0.5, −1 ≤ x ≤ 1.
1+x
◮ Let X ∼ U [0, 1]. Its pdf is fX (x) = 1, 0 ≤ x ≤ 1. ◮ Integrating this, we get the df: FX (x) = 2
for
◮ Integrating this we get the df: FX (x) = x, 0 ≤ x ≤ 1 −1 ≤ x ≤ 1.
◮ These are plotted below
PS Sastry, IISc, Bangalore, 2020 14/43 PS Sastry, IISc, Bangalore, 2020 15/43
◮ Suppose X ∼ U [0, 1] and Y = aX + b (x−µ)2
◮ Recall that Gaussian density is f (x) = √1 e− 2σ 2
◮ The df for Y would be σ 2π
◮ We denote this as N (µ, σ 2 )
y−b
0 ≤0 ◮ Let Y = aX + b where X ∼ N (0, 1). The df of Y is
y−b
a
y−b
FY (y) = FX = 0 ≤ y−b ≤ 1
a a y−b a
y−b
1 a
≥1
FY (y) = FX
a
Z y−b
◮ Thus we get the df for Y as a 1 x2
= √ e− 2 dx
−∞ 2π
0 y≤b
y−b t−b 1
FY (y) = b≤y ≤a+b we make a substitution: t = ax + b ⇒ x = , and dx = dt
a a a
1 y ≥a+b Z y
1 (t−b)2
FY (y) = √ e− 2a2 dt
◮ Hence fY (y) = a1 , y ∈ [b, a + b] and Y ∼ U [b, a + b]. −∞ a 2π
PS Sastry, IISc, Bangalore, 2020 16/43 PS Sastry, IISc, Bangalore, 2020 17/43
PS Sastry, IISc, Bangalore, 2020 18/43 PS Sastry, IISc, Bangalore, 2020 19/43
◮ Suppose X is geometric: ◮ Suppose X is geometric. (fX (k) = (1 − p)k−1 p)
fX (k) = (1 − p)k−1 p, k = 1, 2, · · · . ◮ Let Y = max(X, 5) ⇒ Y ∈ {5, 6, · · · }
◮ Let Y = X − 1 ◮ We can calculate the pmf of Y as
◮ We get the pmf of Y as
5
X
fY (j) = P [X − 1 = j] fY (5) = P [max(X, 5) = 5] = fX (k) = 1 − (1 − p)5
= P [X = j + 1] k=1
PS Sastry, IISc, Bangalore, 2020 20/43 PS Sastry, IISc, Bangalore, 2020 21/43
PS Sastry, IISc, Bangalore, 2020 22/43 PS Sastry, IISc, Bangalore, 2020 23/43
◮ Let Y = X 2 .
◮ The df of Y is
◮ For y < 0, FY (y) = P [Y ≤ y] = 0 (since Y ≥ 0)
0 if y < 0 ◮ For y ≥ 0, we can get FY (y) as
1+y
FY (y) = 2
if 0 ≤ y < 1
1 if y ≥ 1 FY (y) = P [Y ≤ y] = P [X 2 ≤ y]
√ √
= P [− y ≤ X ≤ y]
◮ This is plotted below √ √ √
= FX ( y) − FX (− y) + P [X = − y]
◮ This is neither a continuous rv nor a discrete rv. ◮ This is the general formula for density of X 2 when X is
continuous rv.
PS Sastry, IISc, Bangalore, 2020 24/43 PS Sastry, IISc, Bangalore, 2020 25/43
x2 Gamma density
◮ Let X ∼ N (0, 1): fX (x) = √1 e− 2
2π
◮ The Gamma function is given by
◮ Let Y = X 2 . Then we know fY (y) = 0 for y < 0. Z ∞
For y ≥ 0, Γ(α) = xα−1 e−x dx
0
1 √ √
fY (y) = √ [fX ( y) + fX (− y)] It can be easily verified that Γ(α + 1) = αΓ(α).
2 y
1
1 1
◮ The Gamma density is given by
− y2 − y2
= √ √ e +√ e 1 1
2 y 2π 2π f (x) = λα xα−1 e−λx = (λx)α−1 λe−λx , x > 0
1 2 y Γ(α) Γ(α)
= √ √ e− 2
2 y 2π
1
0.5
1 1
◮ Here α, λ > 0 are parameters.
= √ y −0.5 e− 2 y ◮ The earlier density we saw corresponds to α = λ = 0.5:
π 2
0.5
1 1 1
◮ This is an example of gamma density. fY (y) = √ y −0.5 e− 2 y , y > 0
π 2
PS Sastry, IISc, Bangalore, 2020 26/43 PS Sastry, IISc, Bangalore, 2020 27/43
◮ Let X ∼ U (0, 1).
◮ Let Y = −1λ
ln(1 − X), where λ > 0.
◮ The gamma density with parameters α, λ > 0 is given by ◮ Note that Y ≥ 0. We can find its df:
1
−1
f (x) = λα xα−1 e−λx , x > 0 FY (y) = P [Y ≤ y] = P ln(1 − X) ≤ y
Γ(α) λ
◮ If X ∼ N (0, 1) then X 2 has gamma density with = P [− ln(1 − X) ≤ λy]
parameters α = λ = 0.5. = P [ln(1 − X) ≥ −λy]
◮ When α is a positive integer then the gamma density is = P [1 − X ≥ e−λy ]
known as the Erlang density. = P [X ≤ 1 − e−λy ]
◮ If α = 1, gamma density becomes exponential density. = 1 − e−λy , y ≥ 0 (since X ∼ U (0, 1))
PS Sastry, IISc, Bangalore, 2020 28/43 PS Sastry, IISc, Bangalore, 2020 29/43
PS Sastry, IISc, Bangalore, 2020 30/43 PS Sastry, IISc, Bangalore, 2020 31/43
◮ Let G be a continuous invertible distribution function.
◮ Let X ∼ U [0, 1] and let Y = G−1 (X).
◮ Let X be a cont rv with an invertible distribution
function, say, F .
◮ We can get the df of Y as
◮ Define Y = F (X).
−1
FY (y) = P [Y ≤ y] = P [G (X) ≤ y] = P [X ≤ G(y)] = G(y) ◮ Since range of F is [0, 1], we know 0 ≤ Y ≤ 1.
◮ For 0 ≤ y ≤ 1 we can obtain FY (y) as
◮ Thus, starting with uniform rv, we can generate a rv with FY (y) = P [Y ≤ y] = P [F (X) ≤ y] = P [X ≤ F −1 (y)] = F (F −1 (y)) = y
a desired distribution.
◮ Very useful in random number generation. Known as the
inverse function method. ◮ This means Y has uniform density.
◮ Can be generalized to handle discrete rv also. It only ◮ Has interesting applications.
involves defining an ‘inverse’ when F is a stair-case E.g., histogram equalization in image processing
function. (Left as an exercise!)
PS Sastry, IISc, Bangalore, 2020 32/43 PS Sastry, IISc, Bangalore, 2020 33/43
PS Sastry, IISc, Bangalore, 2020 34/43 PS Sastry, IISc, Bangalore, 2020 35/43
◮ Let g : ℜ → ℜ be differentiable with g ′ (x) > 0, ∀x.
◮ Let X be a continuous rv with pdf fX .
◮ Let Y = g(X) ◮ Now, suppose g ′ (x) < 0, ∀x. Even then the theorem
◮ Theorem: With the above, Y is a continuous rv with pdf essentially holds.
d −1 ◮ Now, g is strictly monotonically decreasing. So, we get
fY (y) = fX (g −1 (y)) g (y), g(−∞) ≤ y ≤ g(∞)
dy
FY (y) = P [g(X) ≤ y] = P [X ≥ g −1 (y)] = 1−FX (g −1 (y))
◮ Proof: Since g ′ (x) > 0, g is strictly monotonically
increasing and hence is invertible and g −1 would also be ◮ Once again, by differentiating
monotone and differentiable.
◮ So, range of Y is [g(−∞), g(∞)]. d −1 d −1
fY (y) = −fX (g −1 (y)) g (y) = fX (g −1 (y)) g (y)
◮ Now we have dy dy
FY (y) = P [Y ≤ y] = P [g(X) ≤ y] = P [X ≤ g −1 (y)] = FX (g −1 (y)) because g −1 is also monotone decreasing.
◮ The range of Y here is [g(∞), g(−∞)]
◮ Since g −1 is differentiable, so is FY and we get the pdf as ◮ We can combine both cases into one result.
d d −1
fY (y) = (FX (g −1 (y))) = fX (g −1 (y)) g (y)
dy dy
◮ This completes the proof. PS Sastry, IISc, Bangalore, 2020 36/43 PS Sastry, IISc, Bangalore, 2020 37/43
PS Sastry, IISc, Bangalore, 2020 38/43 PS Sastry, IISc, Bangalore, 2020 39/43
◮ If Y = g(x) and g is monotone,
◮ The function g(x) = x2 does not satisfy the conditions of d −1
the theorem. fY (y) = fX (g −1 (y)) g (y)
dy
◮ The utility of the theorem is somewhat limited.
◮ However, we can extend the theorem.
◮ Let xo (y) be the solution of g(x) = y; then
g −1 (y) = xo (y).
◮ Essentially, what we need is that for a any y, the equation
g(x) = y would have finite solutions and the derivative of
◮ Also, the derivative of g −1 is reciprocal of the derivative
g is not zero at any of these points. of g.
◮ There are multiple ‘g −1 (y)’ and we can get density of Y
◮ Hence, we can also write the above as
by summing all the terms. −1
fY (y) = fX (xo (y)) |g ′ (xo (y))|
PS Sastry, IISc, Bangalore, 2020 40/43 PS Sastry, IISc, Bangalore, 2020 41/43
PS Sastry, IISc, Bangalore, 2020 42/43 PS Sastry, IISc, Bangalore, 2020 43/43
Recap: Function of a random variable Recap
PS Sastry, IISc, Bangalore, 2020 1/45 PS Sastry, IISc, Bangalore, 2020 2/45
Recap Recap
PS Sastry, IISc, Bangalore, 2020 3/45 PS Sastry, IISc, Bangalore, 2020 4/45
Expectation and Moments of a random variable Expectation of a discrete rv
PS Sastry, IISc, Bangalore, 2020 5/45 PS Sastry, IISc, Bangalore, 2020 6/45
Expectation of a Continuous rv
◮ If X is a continuous random variable with pdf, fX , we ◮ Let us look at a couple of simple examples.
define its expectation as ◮ Let X ∈ {1, 2, 3, 4, 5, 6} and fX (k) = 61 , 1 ≤ k ≤ 6.
Z ∞
E[X] = x fX (x) dx 1 21
−∞ EX = (1 + 2 + 3 + 4 + 5 + 6) = = 3.5
6 6
◮ Once again we can use the following as condition for
existence of expectation ◮ Let X ∼ U [0, 1]
Z ∞
∞ 1
|x| fX (x) dx < ∞
Z Z
−∞ EX = x fX (x) dx = x dx = 0.5
−∞ 0
◮ Sometimes we use the following notation to denote
expectation of both kinds of rv ◮ When an rv takes only finitely many values or when the
Z ∞ pdf is non-zero only on a bounded set, the expectation is
E[X] = x dFX (x) always finite.
−∞
◮ Though we consider only discrete or continuous rv’s,
expectation is defined for all random variables.
PS Sastry, IISc, Bangalore, 2020 7/45 PS Sastry, IISc, Bangalore, 2020 8/45
◮ The way we have defined existence of expectation, implies ◮ Now let X be a rv that may not be non-negative.
that expectation is always finite (when it exists). ◮ We define positive and negative parts of X by
◮ This may be needlessly restrictive in some situations.
+ X if X > 0
We redefine it as follows. X =
0 otherwise
◮ Let X be a non-negative (discrete or continuous) random
variable. − −X if X < 0
X =
◮ We define its expectation by 0 otherwise
PS Sastry, IISc, Bangalore, 2020 11/45 PS Sastry, IISc, Bangalore, 2020 12/45
◮ Now suppose X takes values 1, −2, 3, −4, · · · with ◮ Consider a continuous random variable X with pdf
probabilities 1C2 , 2C2 , 3C2 and so on. 1 1
◮
P
Once again k |xk |fX (xk ) = ∞. fX (x) = , −∞ < x < ∞
P π 1 + x2
◮ But k xk fX (xk ) is an alternating series.
◮ This is called (standard) Cauchy density. We can verify it
◮ Here X + would take values 2k − 1 with probability
C integrates to 1
(2k−1)2
, k = 1, 2, · · ·
Z ∞
1 π −π
(and the value 0 with remaining probability). 1 1 1 −1 ∞
C 2
dx = tan (x) −∞ = − =1
◮ Similarly, X − would take values 2k with probability (2k) 2, −∞ π 1 + x π π 2 2
k = 1, 2, · · · (and the value 0 with remaining probability).
X C XC ◮ What would be EX?
EX + = = ∞, and EX − = =∞ Z ∞ Z a
2k − 1 2k 1 1 ? x
k k EX = x dx = 0 because = 0?
−∞ π 1 + x2 −a 1 + x
2
◮ Hence EX does not exist.
PS Sastry, IISc, Bangalore, 2020 13/45 PS Sastry, IISc, Bangalore, 2020 14/45
PS Sastry, IISc, Bangalore, 2020 15/45 PS Sastry, IISc, Bangalore, 2020 16/45
Expectation of a random variable Binary random variable
◮ Let X be a discrete rv with X ∈ {x1 , x2 , · · · }. Then
X
E[X] = xi fX (xi )
i
◮ If X is a continuous random variable with pdf, fX , ◮ Expectation of a binary rv (e.g., Bernoulli):
Z ∞
E[X] = x fX (x) dx EX = 0 × fX (0) + 1 × fX (1) = P [X = 1]
−∞
◮ Sometimes we use the following notation to denote ◮ Expectation of a binary random variable is same as the
expectation of both kinds of rv probability of the rv taking value 1.
Z ∞
E[X] = x dFX (x) ◮ Thus, for example, EIA = P (A).
−∞
◮ We take the expectation to exist when the sum or
integral above is absolutely convergent
◮ Note that expectation is defined for all random variables
◮ Let us calculate expectations of some of the standard
distributions. PS Sastry, IISc, Bangalore, 2020 17/45 PS Sastry, IISc, Bangalore, 2020 18/45
PS Sastry, IISc, Bangalore, 2020 19/45 PS Sastry, IISc, Bangalore, 2020 20/45
Expectation of Geometric rv Expectation of uniform density
◮ fX (k) = (1 − p)k−1 p, k = 1, 2, · · · 1
◮ Let X ∼ U [a, b]. fX (x) = b−a
, a≤x≤b
∞
∞
X
k (1 − p)k−1 p
Z
EX =
EX = x fX (x) dx
k=1 −∞
Z b
◮ We have 1
∞ = x dx
X 1−p 1 a b−a
(1 − p)k = = −1 b
k=1
p p 1 x2
=
b−a 2 a
◮ Term-wise differentiation of the above gives 1 b2 − a2
=
∞ b−a 2
X 1
k (1 − p)k−1 = b+a
p2 =
k=1 2
1
◮ This gives us EX = p
PS Sastry, IISc, Bangalore, 2020 21/45 PS Sastry, IISc, Bangalore, 2020 22/45
PS Sastry, IISc, Bangalore, 2020 23/45 PS Sastry, IISc, Bangalore, 2020 24/45
Expectation of a function of a random variable ◮ Theorem: Let X ∈ {x1 , x2 , · · · xn } and let Y = g(X).
Then X
◮ Let X be a rv and let Y = g(X). EY = g(xi ) fX (xi )
R R
◮ Theorem: EY = y dFY (y) = g(x) dFX (x) i
PS Sastry, IISc, Bangalore, 2020 25/45 PS Sastry, IISc, Bangalore, 2020 26/45
◮ Now we have
m ◮ Suppose X is a continuous rv and suppose g is a
differentiable function with g ′ (x) > 0, ∀x. Let Y = g(X)
X
EY = yj fY (yj )
R
j=1 ◮ Once again we can show EY = g(x) fX (x) dx
Xm X Z ∞
= yj fX (xi )
j=1 i:
EY = y fY (y) dy
xi ∈Bj −∞
Z g(∞)
m X d −1
= y fX (g −1 (y)) g (y) dy,
X
= g(xi )fX (xi ) dy
g(−∞)
j=1 i:
xi ∈Bj d −1
n change the variable to x = g −1 (y) ⇒ dx = g (y) dy
X dy
= g(xi )fX (xi ) Z ∞
i=1 = g(x) fX (x) dx
−∞
That completes the proof.
◮ The proof goes through even when X (and Y ) take ◮ We can similarly show this for the case where
countably infinitely many values (because we assume the g ′ (x) < 0, ∀x
expectation sum is absolutely convergent).
PS Sastry, IISc, Bangalore, 2020 27/45 PS Sastry, IISc, Bangalore, 2020 28/45
Some Properties of Expectation
◮ We proved the theorem only for discrete rv’s and for some
restricted case of continuous rv’s. X Z ∞
◮ However, this theorem is true for all random variables. E[g(X)] = g(xi )fX (xi ) or E[g(X)] = g(x) fX (x) dx
i −∞
◮ Now, for any function, g, we can write
Z ∞
X ◮ If X ≥ 0 then EX ≥ 0
E[g(X)] = g(xi )fX (xi ) or E[g(X)] = g(x) fX (x) dx
i −∞ ◮ E[b] = b where b is a constant
◮ E[ag(X)] = aE[g(X)] where a is a constant
◮ E[aX + b] = aE[X] + b where a, b are constants.
◮ E[ag1 (X) + bg2 (X)] = aE[g1 (X)] + bE[g2 (X)]
PS Sastry, IISc, Bangalore, 2020 29/45 PS Sastry, IISc, Bangalore, 2020 30/45
PS Sastry, IISc, Bangalore, 2020 31/45 PS Sastry, IISc, Bangalore, 2020 32/45
Variance of a Random variable Some properties of variance
◮ We define variance of X as E[(X − EX)2 ] and denote it ◮ Var(X + c) = Var(X) where c is a constant
as Var(X).
By definition, Var(X) ≥ 0. Var(X+c) = E {(X + c) − E[X + c]}2 = E (X − EX)2 = Var(X)
◮
PS Sastry, IISc, Bangalore, 2020 33/45 PS Sastry, IISc, Bangalore, 2020 34/45
PS Sastry, IISc, Bangalore, 2020 35/45 PS Sastry, IISc, Bangalore, 2020 36/45
Variance of exponential rv Variance of Gaussian rv
◮ fX (x) = λ e−λx , x > 0
Z ∞ ◮ Let X ∼ N (0, 1). That is,
x2
2 fX (x) = √1 e− 2 , −∞ < x < ∞.
E[X ] = x2 λ e−λx dx 2π
0
∞ Z ∞ ◮ We know EX = 0. Hence Var(X) = EX 2 .
2 e−λx e−λx
= x λ − λ 2x dx Z ∞
−λ 0 0 −λ 2 1 x2
Z ∞ Var(X) = EX = x2 √ e− 2 dx
2 2π
= x λ e−λx dx Z ∞ −∞
λ 0
1 − x2
2 = x x√ e 2 dx
= −∞ 2π
λ2 −1 − x2
∞ Z ∞
1 x2
= x √ e 2 + √ e− 2 dx
◮ Hence the variance is now given by 2π −∞ −∞ 2π
2 = 1
2 1 1
Var(X) = 2 − = 2
λ λ λ
PS Sastry, IISc, Bangalore, 2020 37/45 PS Sastry, IISc, Bangalore, 2020 38/45
2
◮ Here is a plot of Gaussian densities with different
√1 e − x2 variances
◮ Let fX (x) = 2π
, −∞ < x < ∞.
◮ Let g(x) = σx + µ and hence g −1 (y) = y−µ
σ
.
◮ Take σ > 0 and Y = g(X). By the theorem,
d −1 1 1 (y−µ)2
fY (y) = g (y) fX (g −1 (y)) = √ e− 2σ2
dy σ 2π
◮ Since Y = σX + µ, we get
◮ EY = σEX + µ = µ
◮ Var(Y ) = σ 2 Var(X) = σ 2
◮ When Y ∼ N (µ, σ 2 ), EY = µ and Var(Y ) = σ 2 .
PS Sastry, IISc, Bangalore, 2020 39/45 PS Sastry, IISc, Bangalore, 2020 40/45
Variance of Binomial rv
n!
◮ fX (k) = k!(n−k)! pk (1 − p)n−k , k = 0, 1, · · · , n
◮ When X is binomial rv, we showed,
◮ Here we use the identity, EX 2 = E[X(X − 1)] + EX
E[X(X − 1)] = n(n − 1)p2
n
X n! ◮ Hence,
E[X(X − 1)] = k(k − 1) pk (1 − p)n−k
k=0
k!(n − k)!
n EX 2 = E[X(X−1)]+EX = n(n−1)p2 +np = n2 p2 +np(1−p)
X n!
= k(k − 1) pk (1 − p)n−k
k=2
k!(n − k)!
n
X n(n − 1)(n − 2)!
◮ Now we can calculate the variance
= p2 pk−2 (1 − p)(n−2)−(k−2)
k=2
(k − 2)!((n − 2) − (k − 2))! Var(X) = EX 2 −(EX)2 = n2 p2 +np(1−p)−(np)2 = np(1−p)
n−2
2
X (n − 2)! k′ (n−2)−k′
= n(n − 1)p p (1 − p)
k′ =0
k ′ !((n − 2) − k ′ )!
= n(n − 1)p2
PS Sastry, IISc, Bangalore, 2020 41/45 PS Sastry, IISc, Bangalore, 2020 42/45
◮
R R
EY = y dFY (y) = g(x) dFX (x) E[g(X)] = g(xi )fX (xi ) or E[g(X)] = g(x) fX (x) dx
i −∞
◮ That is, if X is discrete, then
◮ If X ≥ 0 then EX ≥ 0
X X
EY = yj fY (yj ) = g(xi )fX (xi )
j i ◮ E[b] = b where b is a constant
◮ E[ag(X)] = aE[g(X)] where a is a constant
◮ If X and Y are continuous
Z Z ◮ E[aX + b] = aE[X] + b where a, b are constants.
EY = y fY (y) dy = g(x) fX (x) dx ◮ E[ag1 (X) + bg2 (X)] = aE[g1 (X)] + bE[g2 (X)]
◮ E [(X − c)2 ] ≥ E [(X − EX)2 ] , ∀c
◮ This is true for all rv’s.
PS Sastry, IISc, Bangalore, 2020 2/45 PS Sastry, IISc, Bangalore, 2020 3/45
Recap: Variance of random variable Recap: Moments of a random variable
◮ Var(cX) = c2 Var(X)
◮ If moment of order k is finite then so is moment of order
s for s < k.
PS Sastry, IISc, Bangalore, 2020 4/45 PS Sastry, IISc, Bangalore, 2020 5/45
PS Sastry, IISc, Bangalore, 2020 6/45 PS Sastry, IISc, Bangalore, 2020 7/45
◮ We can easily see this by expanding etX in Taylor series: Example – Moment generating function for Poisson
λk −λ
◮ fX (k) = e , k = 0, 1, · · ·
tX t2 X 2 t3 X 3 t4 X 4
k!
tX
MX (t) = Ee = E 1 + + + + + ···
1! 2! 3! 4! ∞
X λk −λ
tX
t t2 t3 t4 MX (t) = E[e ] = etk e
= 1 + EX + EX 2 + EX 3 + EX 4 + · · · k=0
k!
1! 2! 3! 4! ∞
X 1 k
◮ Now we can do term-wise differentiation. For example = e−λ λet
k=0
k!
d3 MX (t) 3 ∗ 2 ∗ 1 ∗ t0 3 4∗3∗2∗t = e −λ
eλet t
= eλ(e −1)
3
= 0+0+0+ EX + EX 4 +· · ·
dt 3! 4!
◮ Now, by differentiating it we can find EX
◮ Hence we get
dMX (t) t
d3 MX (t) EX = = eλ(e −1) λet =λ
= E[X ] 3 dt t=0 t=0
dt3 t=0
(Exercise: Differentiate it twice to find EX 2 and hence
show that variance is λ).
PS Sastry, IISc, Bangalore, 2020 8/45 PS Sastry, IISc, Bangalore, 2020 9/45
mgf of exponential rv
◮ fX (x) = λe−λx , x > 0
∞
For mgf to exist we need E[etX ] < ∞ for t ∈ [−a, a] for
Z
◮
tX
MX (t) = E[e ] = etx λe−λx dx some a > 0.
0
Z ∞
◮ If MX (t) exists then all moments of X are finite.
= λe−x(λ−t) dx
0 ◮ However, all moments may be finite but the mgf may not
This is finite if t < λ exist.
−x(λ−t) ∞
λe ◮ When mgf exists, it uniquely determines the df
=
−(λ − t) 0 ◮ We are not saying moments uniquely determine the
λ distribution; we are saying mgf uniquely determines the
= , t<λ
λ−t distribution
◮ We can use this to compute EX
dMX (t) d λ λ 1
EX = = = =
dt t=0 dt λ − t t=0 (λ − t)2 t=0 λ
PS Sastry, IISc, Bangalore, 2020 10/45 PS Sastry, IISc, Bangalore, 2020 11/45
Characteristic Function Generating function
◮ The characteristic function of X is defined by
√ ◮ Let X ∈ {0, 1, 2, · · · }
Z
φX (t) = E[e ] = eitx dFX (x) (i = −1)
itX
◮ The (probability) generating function of X is defined by
◮ If X is continuous rv, ∞
X
Z ∞ PX (s) = fX (k)sk , s ∈ ℜ
φX (t) = E[eitX ] = eitx fX (x) dx k=0
−∞
◮ This infinite sum converges (absolutely) for |s| ≤ 1.
Characteristic function always exists because
◮
|eitx | = 1, ∀t, x
◮ We have
◮ For example,
PX (s) = fX (0) + fX (1)s + fX (2)s2 + fX (3)s3 + · · ·
Z ∞ Z ∞ Z ∞
itx itx
e fX (x) dx ≤ e |fX (x)| dx = fX (x) dx = 1 ◮ The pmf can be obtained from the generating function
−∞ −∞ −∞
◮ PX (s) = fX (0) + fX (1)s + fX (2)s2 + fX (3)s3 + · · · The moments (when they exist)Pcan be obtained from the
◮
dPX (s) ∞
◮ Let PX′ (s) , ds
and so on generating function: PX (s) = k=0 fX (k)sk
◮ We get ∞
X
PX′ (s) = k fX (k) sk−1 ⇒ PX′ (1) = EX
PX′ (s) = 0 + fX (1) + fX (2) 2s + fX (3) 3s2 + · · ·
k=0
∞
X
1
PX′′ (s) = 0 + 0 + fX (2) 2 ∗ 1 + fX (3) 3 ∗ 2s + · · · PX′′ (s) = k(k−1) fX (k) sk−2 ⇒ PX′′ (1) = E[X(X−1)]
k=0
Hence, we get
PS Sastry, IISc, Bangalore, 2020 14/45 PS Sastry, IISc, Bangalore, 2020 15/45
Example – Generating function for binomial rv ◮ Let p ∈ (0, 1). The number x ∈ ℜ that satisfies
n!
◮ fX (k) = k!(n−k)!
pk (1 − p)n−k , k = 0, 1, · · · , n P [X ≤ x] ≥ p and P [X ≥ x] ≥ 1 − p
n
X n! is called the quantile of order p or the 100pth percentile of
PX (s) = pk (1 − p)n−k sk rv X.
k=0
k!(n − k)!
n
◮ Suppose x is a quantile of order p. Then we have
X n! p ≤ P [X ≤ x] = FX (x)
= (sp)k (1 − p)n−k ◮
k=0
k!(n − k)! ◮ 1 − p ≤ 1 − P [X < x] = 1 − (P [X ≤ x] − P [X = x])
= (sp + (1 − p))n = (1 + p(s − 1))n ⇒ 1 − p ≤ 1 − FX (x) + P [X = x]
⇒ FX (x) ≤ p + P [X = x]
◮ From the above, we get PX′ (s) = n(sp + (1 − p))n−1 p ◮ Thus, x satisfies (if it is quantile of order p)
◮ Thus,
p ≤ FX (x) ≤ p + P [X = x]
n−1
EX = PX′ (1) = np; fX (1) = PX′ (0) = n(1 − p) p
◮ Note that for a given p there can be multiple values for x
to satisfy the above.
PS Sastry, IISc, Bangalore, 2020 16/45 PS Sastry, IISc, Bangalore, 2020 17/45
p ≤ FX (x) ≤ p + P [X = x]
PS Sastry, IISc, Bangalore, 2020 18/45 PS Sastry, IISc, Bangalore, 2020 19/45
◮ For continuous rv, X, FX need not be strictly monotone. ◮ For this df, for p = 0.5, the quantile of order p is not
◮ Consider a pdf: fX (x) = 0.5, x ∈ [1, 2] ∪ [3, 4] unique because there many x with FX (x) = 0.5
◮ The pdf and the corresponding df are: But for p = 0.75 it is unique.
PS Sastry, IISc, Bangalore, 2020 20/45 PS Sastry, IISc, Bangalore, 2020 21/45
◮ Let X ∈ {x1 , x2 , · · · }
◮ Given a p we want to calculate quantile of order p
◮ Suppose there is a xi such that FX (xi ) = p.
◮ Then, for xi ≤ x < xi+1 , FX (x) = p
◮ For xi ≤ x ≤ xi+1 , we have p ≤ FX (x) ≤ p + P [X = x]
◮ So, quantile of order p is not unique and all such x qualify.
PS Sastry, IISc, Bangalore, 2020 22/45 PS Sastry, IISc, Bangalore, 2020 23/45
◮ Now suppose p is such that FX (xi−1 ) < p < FX (xi ). ◮ This situation is illustrated below
◮ Let FX (xi−1 ) = p − δ1 and FX (xi ) = p + δ2 . (Note that
δ1 , δ2 > 0)
◮ Then P [X = xi ] = FX (xi ) − FX (xi−1 ) = δ2 + δ1
◮ Hence we have
PS Sastry, IISc, Bangalore, 2020 24/45 PS Sastry, IISc, Bangalore, 2020 25/45
Median of a distribution
PS Sastry, IISc, Bangalore, 2020 26/45 PS Sastry, IISc, Bangalore, 2020 27/45
Markov Inequality Markov Inequality
◮ Let g : ℜ → ℜ be a non-negative function. Then
E[g(X)] E[g(X)]
P [g(X) > c] ≤ , (c > 0) P [g(X) > c] ≤ , (c > 0)
c c
◮ Proof: We prove it for continuous rv. Proof is similar for
discrete rv ◮ In all such results an underlying assumption is that the
Z ∞ expectation is finite.
E[g(X)] = g(x) fX (x) dx
−∞
◮ Let g(x) = |x|k where k is a positive integer. We have
Z Z g(x) ≥ 0, ∀x. Let c > 0.
= g(x) fX (x) dx + g(x) fX (x) dx
g(x)≤c g(x)>c
◮ We know that |x| > c ⇒ |x|k > ck and vice versa.
Now we get,
Z
◮
≥ g(x) fX (x) dx because g(x) ≥ 0
g(x)>c k
E |X|
P [|X| > c] = P [|X|k > ck ] ≤
Z
≥ c fX (x) dx = c P [g(X) > c] ck
g(x)>c
E[g(X)]
◮ Markov inequality is often used in this form.
Thus, P [g(X) > c] ≤ c
PS Sastry, IISc, Bangalore, 2020 28/45 PS Sastry, IISc, Bangalore, 2020 29/45
Chebyshev Inequality
◮ The Chebyshev inequality is
PS Sastry, IISc, Bangalore, 2020 30/45 PS Sastry, IISc, Bangalore, 2020 31/45
◮ Markov inequality: For a non-negative function, g,
E[g(X)]
P [g(X) > c] ≤
c
◮ A specific instance of this is
E |X|k
P [|X| > c] ≤
ck
◮ Chebyshev inequality
Var(X)
P [|X − EX| > c] ≤
c2
◮ With EX = µ and Var(X) = σ 2 , we get
1
P [|X − µ| > kσ] ≤
k2
PS Sastry, IISc, Bangalore, 2020 32/45 PS Sastry, IISc, Bangalore, 2020 33/45
B 2 = σ ({B1 × B2 : B1 , B2 ∈ B})
PS Sastry, IISc, Bangalore, 2020 34/45 PS Sastry, IISc, Bangalore, 2020 35/45
◮ Recall that B is the smallest σ-algebra containing all
intervals. ◮ Let X, Y be random variables on the probability space
◮ Let I1 , I2 ⊂ ℜ be intervals. Then I1 × I2 ⊂ ℜ2 is known (Ω, F, P )
as a cylindrical set. ◮ This gives rise to a new probability space (ℜ2 , B 2 , PXY )
[a, b] X [c, d]
with PXY given by
y
◮ B 2 is the smallest σ-algebra containing all cylindrical sets. PX (B) = P [X ∈ B] = P ({ω : X(ω) ∈ B})
◮ We saw that B is also the smallest σ-algebra containing
all intervals of the form (−∞, x].
◮ Similarly B 2 is the smallest σ-algebra containing
cylindrical sets of the form (−∞, x] × (−∞, y].
PS Sastry, IISc, Bangalore, 2020 36/45 PS Sastry, IISc, Bangalore, 2020 37/45
PS Sastry, IISc, Bangalore, 2020 38/45 PS Sastry, IISc, Bangalore, 2020 39/45
Properties of Joint Distribution Function
◮ Recall that, for the case of a single rv, the probability of
◮ Joint distribution function: X being in any interval is given by the difference of FX
values at the end points of the interval.
FXY (x, y) = P [X ≤ x, Y ≤ y] ◮ Let x1 < x2 . Then
PS Sastry, IISc, Bangalore, 2020 40/45 PS Sastry, IISc, Bangalore, 2020 41/45
y
B3 2
x1 x2 x
y B2
y
B3 2
B1
y
1
P [(X, Y ) ∈ B] = P [X ≤ x2 , Y ≤ y2 ] = FXY (x2 , y2 )
x1 x2 x
= P [(X, Y ) ∈ B1 + (B2 ∪ B3 )]
B2
= P [(X, Y ) ∈ B1 ] + P [(X, Y ) ∈ (B2 ∪ B3 )]
PS Sastry, IISc, Bangalore, 2020 42/45 PS Sastry, IISc, Bangalore, 2020 43/45
Properties of Joint Distribution Function
◮ What we showed is the following. ◮ Joint distribution function: FXY : ℜ2 → ℜ
◮ For x1 < x2 and y1 < y2
FXY (x, y) = P [X ≤ x, Y ≤ y]
P [x1 < X ≤ x2 , y1 < Y ≤ y2 ] = FXY (x2 , y2 ) − FXY (x2 , y1 )
−FXY (x1 , y2 ) + FXY (x1 , y1 )
◮ It satisfies
1. FXY (−∞, y) = FXY (x, −∞) = 0, ∀x, y;
◮ This means FXY should satisfy FXY (∞, ∞) = 1
2. FXY is non-decreasing in each of its arguments
FXY (x2 , y2 )−FXY (x2 , y1 )−FXY (x1 , y2 )+FXY (x1 , y1 ) ≥ 0 3. FXY is right continuous and has left-hand limits in each
of its arguments
for all x1 < x2 and y1 < y2 4. For all x1 < x2 and y1 < y2
◮ This is an additional condition that a function has to FXY (x2 , y2 )−FXY (x2 , y1 )−FXY (x1 , y2 )+FXY (x1 , y1 ) ≥ 0
satisfy to be the joint distribution function of a pair of
random variables
◮ Any F : ℜ2 → ℜ satisfying the above would be a joint
distribution function.
PS Sastry, IISc, Bangalore, 2020 44/45 PS Sastry, IISc, Bangalore, 2020 45/45
PS Sastry, IISc, Bangalore, 2020 1/57 PS Sastry, IISc, Bangalore, 2020 2/57
Recap: Properties of distribution function Recap: Discrete Random Variable
PS Sastry, IISc, Bangalore, 2020 3/57 PS Sastry, IISc, Bangalore, 2020 4/57
◮ In particular,
Z b
P [a ≤ X ≤ b] = fX (t) dt
a
PS Sastry, IISc, Bangalore, 2020 7/57 PS Sastry, IISc, Bangalore, 2020 8/57
Recap Recap
PS Sastry, IISc, Bangalore, 2020 9/57 PS Sastry, IISc, Bangalore, 2020 10/57
Recap Recap: Expectation
◮ Let X be a discrete rv with X ∈ {x1 , x2 , · · · }. Then
Let g : ℜ → ℜ be differentiable with g ′ (x) > 0, ∀x or
X
◮ E[X] = xi fX (xi )
g ′ (x) < 0, ∀x. i
◮ Let X be a continuous rv and let Y = g(X). ◮ If X is a continuous random variable with pdf, fX ,
◮ Then Y is a continuous rv with pdf Z ∞
E[X] = x fX (x) dx
d −1 −∞
fY (y) = fX (g −1 (y)) g (y) , a ≤ y ≤ b
dy ◮ Sometimes we use the following notation to denote
expectation of both kinds of rv
where a = min(g(∞), g(−∞)) and Z ∞
b = max(g(∞), g(−∞)) E[X] = x dFX (x)
−∞
◮ This theorem is useful in some cases to find the densities
of functions of continuous random variables ◮ We take the expectation to exist when the sum or
integral above is absolutely convergent
◮ Note that expectation is defined for all random variables
PS Sastry, IISc, Bangalore, 2020 11/57 PS Sastry, IISc, Bangalore, 2020 12/57
◮
R R
EY = y dFY (y) = g(x) dFX (x) E[g(X)] = g(xi )fX (xi ) or E[g(X)] = g(x) fX (x) dx
i −∞
◮ That is, if X is discrete, then
◮ If X ≥ 0 then EX ≥ 0
X X
EY = yj fY (yj ) = g(xi )fX (xi )
j i ◮ E[b] = b where b is a constant
◮ E[ag(X)] = aE[g(X)] where a is a constant
◮ If X and Y are continuous
Z Z ◮ E[aX + b] = aE[X] + b where a, b are constants.
EY = y fY (y) dy = g(x) fX (x) dx ◮ E[ag1 (X) + bg2 (X)] = aE[g1 (X)] + bE[g2 (X)]
◮ E [(X − c)2 ] ≥ E [(X − EX)2 ] , ∀c
◮ This is true for all rv’s.
PS Sastry, IISc, Bangalore, 2020 13/57 PS Sastry, IISc, Bangalore, 2020 14/57
Recap: Variance of random variable Recap: Moments of a random variable
PS Sastry, IISc, Bangalore, 2020 15/57 PS Sastry, IISc, Bangalore, 2020 16/57
PS Sastry, IISc, Bangalore, 2020 17/57 PS Sastry, IISc, Bangalore, 2020 18/57
quantiles of a distribution Recap: some moment inequalities
◮ Markov inequality: For a non-negative function, g,
◮ Let p ∈ (0, 1). The number x ∈ ℜ that satisfies E[g(X)]
P [g(X) > c] ≤
c
P [X ≤ x] ≥ p and p[X ≥ x] ≥ 1 − p
◮ A specific instance of this is
th
is called the quantile of order p or the 100p percentile of
E |X|k
rv X. P [|X| > c] ≤
ck
◮ If x is quantile of order p, it satisfies ◮ Chebyshev inequality
p ≤ FX (x) ≤ p + P [X = x] Var(X)
P [|X − EX| > c] ≤
c2
◮ For a given p there can be multiple values for x to satisfy
the above.
◮ With EX = µ and Var(X) = σ 2 , we get
◮ For p = 0.5, it is called the median. 1
P [|X − µ| > kσ] ≤
k2
PS Sastry, IISc, Bangalore, 2020 19/57 PS Sastry, IISc, Bangalore, 2020 20/57
PS Sastry, IISc, Bangalore, 2020 21/57 PS Sastry, IISc, Bangalore, 2020 22/57
Recap: Joint distribution function Recap: Properties of Joint Distribution Function
◮ Joint distribution function: FXY : ℜ2 → ℜ
Example
◮ Let X, Y be two discrete random variables (defined on
the same probability space).
◮ Let X ∈ {x1 , · · · xn } and Y ∈ {y1 , · · · , ym }.
◮ Let Ω = (0, 1) with the ‘usual’ probability.
◮ We define the joint probability mass function of X and Y
◮ So, each ω is a real number between 0 and 1
as
fXY (xi , yj ) = P [X = xi , Y = yj ] ◮ Let X(ω) be the digit in the first decimal place in ω and
let Y (ω) be the digit in the second decimal place.
(fXY (x, y) is zero for all other values of x, y) ◮ If ω = 0.2576 then X(ω) = 2 and Y (ω) = 5
◮ The fXY would satisfy
P P ◮ Easy to see that X, Y ∈ {0, 1, · · · , 9}.
◮ fXY (x, y) ≥ 0, ∀x, y and i j fXY (xi , yj ) = 1
◮ We want to calculate the joint pmf of X and Y
◮ This is a straight-forward extension of the pmf of a single
discrete rv.
PS Sastry, IISc, Bangalore, 2020 25/57 PS Sastry, IISc, Bangalore, 2020 26/57
Example Example
◮ What is the event [X = 4]? ◮ Consider the random experiment of rolling two dice.
Ω = {(ω1 , ω2 ) : ω1 , ω2 ∈ {1, 2, · · · , 6}}
[X = 4] = {ω : X(ω) = 4} = [0.4, 0.5) ◮ Let X be the maximum of the two numbers and let Y be
the sum of the two numbers.
◮ What is the event [Y = 3]? ◮ Easy to see X ∈ {1, 2, · · · , 6} and Y ∈ {2, 3, · · · , 12}
[Y = 3] = [0.03, 0.04) ∪ [0.13, 0.14) ∪ · · · ∪ [0.93, 0.94)
◮ What is the event [X = m, Y = n]? (We assume m, n
are in the correct range)
◮ What is the event [X = 4, Y = 3]? [X = m, Y = n] = {(ω1 , ω2 ) ∈ Ω : max(ω1 , ω2 ) = m, ω1 +ω2 = n}
It is the intersection of the above
◮ For this to be a non-empty set, we must have
[X = 4, Y = 3] = [0.43, 0.44) m < n ≤ 2m
◮ Then [X = m, Y = n] = {(m, n − m), (n − m, m)}
◮ Hence the joint pmf of X and Y is ◮ Is this always true? No! What if n = 2m?
[X = 3, Y = 6] = {(3, 3)},
fXY (x, y) = P [X = x, Y = y] = 0.01, x, y ∈ {0, 1, · · · , 9}
[X = 4, Y = 6] = {(4, 2), (2, 4)}
◮ So, P [X = m, Y = n] is either 2/36 or 1/36 (assuming
PS Sastry, IISc, Bangalore, 2020 27/57 m, n satisfy other requirements) PS Sastry, IISc, Bangalore, 2020 28/57
PS Sastry, IISc, Bangalore, 2020 31/57 PS Sastry, IISc, Bangalore, 2020 32/57
◮ Take the example: 2 dice, X is max and Y is sum Joint density function
◮ fXY (m, n) = 0 unless m = 1, · · · , 6 and n = 2, · · · , 12.
For this range ◮ Let X, Y be two continuous rv’s with df FXY .
( 2
if m < n < 2m ◮ If there exists a function fXY that satisfies
36
fXY (m, n) = 1
Z x Z y
if n = 2m
36 FXY (x, y) = fXY (x′ , y ′ ) dy ′ dx′ , ∀x, y
−∞ −∞
◮ Suppose we want P [Y = X + 2].
6
then we say that X, Y have a joint probability density
function which is fXY
X X
P [Y = X + 2] = fXY (m, n) = fXY (m, m + 2)
m,n: m=1 ◮ Please note the difference in the definition of joint pmf
n=m+2
6
and joint pdf.
X
= fXY (m, m + 2) since we need m + 2 ≤ 2m ◮ When X, Y are discrete we defined a joint pmf
m=2 ◮ We are not saying that if X, Y are continuous rv’s then a
1 2 9 joint density exists.
= +4 =
36 36 36 ◮ We use joint density to mean joint pdf
PS Sastry, IISc, Bangalore, 2020 33/57 PS Sastry, IISc, Bangalore, 2020 34/57
properties of joint density properties of joint density
PS Sastry, IISc, Bangalore, 2020 35/57 PS Sastry, IISc, Bangalore, 2020 36/57
0.5
◮ Then we can show FXY is a joint distribution.
x
0.5 1.0
PS Sastry, IISc, Bangalore, 2020 37/57 PS Sastry, IISc, Bangalore, 2020 38/57
R∞ R∞ ∆ , FXY (x2 , y2 ) − FXY (x1 , y2 ) − FXY (x2 , y1 ) + FXY (x1 , y1 ).
◮ fXY (x, y) ≥ 0 and −∞ −∞
fXY (x′ , y ′ ) dy ′ dx′ = 1
◮ Define ◮ We need to show ∆ ≥ 0 if x1 < x2 and y1 < y2 .
Z x Z y
FXY (x, y) = fXY (x′ , y ′ ) dy ′ dx′ , ∀x, y
◮ We have
−∞ −∞ Z x 2 Z y2 Z x 1 Z y2
∆ = fXY dy dx − fXY dy dx
◮ Then, FXY (−∞, y) = FXY (x, −∞) = 0, ∀x, y and −∞ −∞ −∞ −∞
Z x 2 Z y1 Z x 1 Z y1
FXY (∞, ∞) = 1
− fXY dy dx + fXY dy dx
◮ Since fXY (x, y) ≥ 0, FXY is non-decreasing in each −∞ −∞ −∞ −∞
Z x 2 Z y2 Z y1
argument.
= fXY dy − fXY dy dx
◮ Since it is given as an integral, the above also shows that −∞ −∞ −∞
FXY is continuous in each argument. Z x 1 Z y2 Z y1
− fXY dy − fXY dy dx
◮ The only property left is the special property of FXY we −∞ −∞ −∞
mentioned earlier.
PS Sastry, IISc, Bangalore, 2020 39/57 PS Sastry, IISc, Bangalore, 2020 40/57
PS Sastry, IISc, Bangalore, 2020 41/57 PS Sastry, IISc, Bangalore, 2020 42/57
◮ Let us consider the example
{(x,y) : y>x+0.5}
Z 1 Z y−0.5 0.5 1.0
x
= 2 dx dy
Z0.51 0 ◮ The probability of the event we want is the area of the
small triangle divided by that of the big triangle.
= 2(y − 0.5)dy
0.5
1
y2
=2 − y|10.5 = 1 − 0.25 − 1 + 0.5 = 0.25
2 0.5
PS Sastry, IISc, Bangalore, 2020 43/57 PS Sastry, IISc, Bangalore, 2020 44/57
Example
We have: fXY (x, y) = 2, 0 < x < y < 1
◮ Consider the joint density
◮ We can similarly find density of Y .
Z ∞ Z y
◮ The marginal density of X is: for 0 < x < 1, fY (y) = fXY (x, y) dx = 2 dx = 2y
Z ∞ Z 1 −∞ 0
PS Sastry, IISc, Bangalore, 2020 49/57 PS Sastry, IISc, Bangalore, 2020 50/57
Conditional distributions
PS Sastry, IISc, Bangalore, 2020 51/57 PS Sastry, IISc, Bangalore, 2020 52/57
PS Sastry, IISc, Bangalore, 2020 53/57 PS Sastry, IISc, Bangalore, 2020 54/57
◮ The conditional mass function is
Conditional mass function
fXY (xi , yj )
fX|Y (xi |yj ) = P [X = xi |Y = yj ] =
fY (yj )
◮ We got ◮ This gives us the useful identity
X fXY (xi , yj ) fXY (xi , yj ) = fX|Y (xi |yj )fY (yj )
FX|Y (x|yj ) =
fY (yj )
i:x ≤x
i ( P [X = xi , Y = yj ] = P [X = xi |Y = yj ]P [Y = yj ])
◮ This gives us the total proability rule for rv’s
◮ Since X is a discrete rv, what is inside the summation X X
above is the pmf corresponding to the df, FX|Y . fX (xi ) = fXY (xi , yj ) = fX|Y (xi |yj )fY (yj )
◮ We define the conditional mass function of X given Y as j j
◮ This is same as
fXY (xi , yj )
fX|Y (xi |yj ) = = P [X = xi |Y = yj ] X
fY (yj ) P [X = xi ] = P [X = xi |Y = yj ]P [Y = yj ]
j
P
(P (A) = j P (A|Bj )P (Bj ) when B1 , · · · form a
partition)
PS Sastry, IISc, Bangalore, 2020 55/57 PS Sastry, IISc, Bangalore, 2020 56/57
PS Sastry, IISc, Bangalore, 2020 4/36 PS Sastry, IISc, Bangalore, 2020 5/36
Recap Bayes rule for discrete rv’s Example: Conditional pmf
◮ The conditional mass function is ◮ Consider the random experiment of tossing a coin n
fXY (xi , yj ) times.
fX|Y (xi |yj ) = P [X = xi |Y = yj ] =
fY (yj ) ◮ Let X denote the number of heads and let Y denote the
toss number on which the first head comes.
◮ This gives us the useful identity
◮ For 1 ≤ k ≤ n
fXY (xi , yj ) = fX|Y (xi |yj )fY (yj )
P [Y = k, X = 1]
◮ This gives us the total probability rule for rv’s fY |X (k|1) = P [Y = k|X = 1] =
P [X = 1]
p(1 − p)n−1
X X
fX (xi ) = fXY (xi , yj ) = fX|Y (xi |yj )fY (yj ) =
j j n C p(1 − p)n−1
1
1
◮ Also gives us Bayes rule for discrete rv =
n
fY |X (yj |xi )fX (xi )
fX|Y (xi |yj ) = P ◮ Given there is only one head, it is equally likely to occur
i fY |X (yj |xi )fX (xi )
on any toss.
PS Sastry, IISc, Bangalore, 2020 6/36 PS Sastry, IISc, Bangalore, 2020 7/36
PS Sastry, IISc, Bangalore, 2020 8/36 PS Sastry, IISc, Bangalore, 2020 9/36
◮ Let X, Y have joint density fXY . Example
◮ The conditional df of X given Y is
FX|Y (x|y) = lim P [X ≤ x|Y ∈ [y, y + δ]] fXY (x, y) = 2, 0 < x < y < 1
δ→0
◮ We saw that the marginal densities are
◮ This exists if fY (y) > 0 and then it has a density: fX (x) = 2(1 − x), 0 < x < 1; fY (y) = 2y, 0 < y < 1
Z x
FX|Y (x|y) = fX|Y (x′ |y) dx′
◮ Hence the conditional densities are given by
−∞ fXY (x, y) 1
fX|Y (x|y) = = , 0<x<y<1
◮ This conditional density is given by fY (y) y
fXY (x, y) 1
fXY (x, y) fY |X (y|x) = = , 0<x<y<1
fX|Y (x|y) = fX (x) 1−x
fY (y)
◮ We can see this intuitively like this
◮ We (once again) have the useful identity y
0.5
PS Sastry, IISc, Bangalore, 2020 10/36 x PS Sastry, IISc, Bangalore, 2020 11/36
0.5 1.0
Example
◮ Let X be uniform over (0, 1) and let Y be uniform over
0 to X. Find the density of Y .
◮ What we are given is
◮ The identity fXY (x, y) = fX|Y (x|y)fY (y) can be used to
1
specify the joint density of two continuous rv’s fX (x) = 1, 0 < x < 1; fY |X (y|x) = , 0 < y < x < 1
x
◮ We can specify the marginal density of one and the ◮ Hence the joint density is:
conditional density of the other given the first. fXY (x, y) = x1 , 0 < y < x < 1.
◮ This may actually be the model of how the the rv’s are ◮ Hence the density of Y is
generated. Z ∞ Z 1
1
fY (y) = fXY (x, y) dx = dx = − ln(y), 0 < y < 1
−∞ y x
◮ To recap, we started by defining conditional distribution ◮ Now, let X be a continuous rv and let Y be discrete rv.
function. ◮ We can define FX|Y as
FX|Y (x|y) = P [X ≤ x|Y = y]
◮ When X, Y are discrete, we define this only for y = yj . FX|Y (x|y) = P [X ≤ x|Y = y]
That is, we define it only for all values that Y can take.
◮ When X, Y have joint density, we defined it by This is well defined for all values that y takes. (We
consider only those y)
FX|Y (x|y) = lim P [X ≤ x|Y ∈ [y, y + δ]]
δ→0 ◮ Since X is continuous rv, this df would have a density
This limit exists and FX|Y is well defined if fY (y) > 0. Z x
That is, essentially again for all values that Y can take. FX|Y (x|y) = fX|Y (x′ |y) dx′
◮ In the discrete case, we define fX|Y as the pmf −∞
corresponding to FX|Y . This conditional pmf can also be ◮ Hence we can write
defined as a conditional probability
◮ In the continuous case fX|Y is the density corresponding P [X ≤ x, Y = y] = FX|Y (x|y)P [Y = y]
to FX|Y .
Z x
◮ In both cases we have: fXY (x, y) = fX|Y (x|y)fY (y) = fX|Y (x′ |y) fY (y) dx′
−∞
◮ This gives total probability rule and Bayes rule for random
variables PS Sastry, IISc, Bangalore, 2020 16/36 PS Sastry, IISc, Bangalore, 2020 17/36
◮ We now get ◮ When X, Y are discrete we have
X X X
FX (x) = P [X ≤ x] = P [X ≤ x, Y = y] fX (x) = fX|Y (x|y)fY (y) (P [X = x] = P [X = x|Y = y]P [Y = y]
y y y
X Z x
= fX|Y (x′ |y) fY (y) dx′ ◮ When X is continuous and Y is discrete, we defined
−∞
y fX|Y (x|y) to be the density corresponding to
Z x X FX|Y (x|y) = P [X ≤ x|Y = y]
= fX|Y (x′ |y) fY (y) dx′
−∞ y
◮ Then we once again get
X
◮ This gives us fX (x) = fX|Y (x|y)fY (y)
X y
fX (x) = fX|Y (x|y)fY (y)
Now, fX is density (and not a mass function).
y
◮ Suppose Y ∈ {1, 2, 3} and fY (i) = λi ; let
◮ This is another version of total probability rule. fX|Y (x|i) = fi (x)
◮ Earlier we derived this when X, Y are discrete.
fX (x) = λ1 f1 (x) + λ2 f2 (x) + λ3 f3 (x)
◮ The formula is true even when X is continuous
Only difference is we need to take fX as the density of X. Called a mixture density model
PS Sastry, IISc, Bangalore, 2020 18/36 PS Sastry, IISc, Bangalore, 2020 19/36
PS Sastry, IISc, Bangalore, 2020 20/36 PS Sastry, IISc, Bangalore, 2020 21/36
◮ First let us look at the total probability rule possibilities
◮ We have
◮ When X is continuous rv and Y is discrete rv, we derived
fY |X (y|x) = lim P [Y = y|X ∈ [x, x + δ]]
δ→0 fY |X (y|x)fX (x) = fX|Y (x|y) fY (y)
P [Y = y, X ∈ [x, x + δ]]
= lim Note that fY is mass fn, fX is density and so on.
δ→0 P [X ∈ [x, x + δ]]
R x+δ
fX|Y (x′ |y) fY (y) dx′
◮ Since fX|Y is a density (corresponding to FX|Y ),
x
= lim R x+δ Z ∞
δ→0
x
fX (x′ ) dx′
fX|Y (x|y) dx = 1
fX|Y (x|y)δ fY (y) −∞
= lim
δ→0 fX (x) δ ◮ Hence we get
fX|Y (x|y) fY (y)
= Z ∞
fX (x)
fY (y) = fY |X (y|x)fX (x) dx
−∞
◮ This gives us further versions of total probability rule and
Bayes rule. ◮ Earlier we derived the same formula when X, Y have a
joint density.
PS Sastry, IISc, Bangalore, 2020 22/36 PS Sastry, IISc, Bangalore, 2020 23/36
PS Sastry, IISc, Bangalore, 2020 26/36 PS Sastry, IISc, Bangalore, 2020 27/36
PS Sastry, IISc, Bangalore, 2020 28/36 PS Sastry, IISc, Bangalore, 2020 29/36
◮ As we saw, given the joint distribution we can calculate
all the marginals.
◮ However, there can be many joint distributions with the
f (x, y) = f1 (x)f2 (y) [1 + α(2F1 (x) − 1)(2F2 (y) − 1)]
same marginals.
◮ Let F1 , F2 be one dimensional df’s of continuous rv’s with
∞ ∞ ∞ ∞
f1 , f2 being the corresponding densities.
Z Z Z Z
Define a function f : ℜ2 → ℜ by f (x, y) dx dy = f1 (x) dx f2 (y) dy
−∞ −∞ −∞ −∞
Z ∞ Z ∞
f (x, y) = f1 (x)f2 (y) [1 + α(2F1 (x) − 1)(2F2 (y) − 1)] +α (2f1 (x)F1 (x) − f1 (x)) dx (2f2 (y)F2 (y) − f2 (y))
−∞ −∞
where α ∈ (−1, 1).
= 1
◮ First note that f (x, y) ≥ 0, ∀α ∈ (−1, 1).
For different α we get different functions. because 2
R∞
f1 (x) F1 (x) dx = 1. This also shows
−∞
◮ We first show that f (x, y) is a joint density.
◮ For this, we note the following
Z ∞ Z ∞
PS Sastry, IISc, Bangalore, 2020 30/36 PS Sastry, IISc, Bangalore, 2020 31/36
PS Sastry, IISc, Bangalore, 2020 32/36 PS Sastry, IISc, Bangalore, 2020 33/36
◮ Let X, Y be independent continuous rv
Z x Z y
◮ Suppose X, Y are independent discrete rv’s FXY (x, y) = FX (x)FY (y) = ′
fX (x ) dx ′
fY (y ′ ) dy ′
−∞ −∞
fXY (x, y) = P [X = x, Y = y] = P [X = x]P [Y = y] = fX (x)fY (y) Z y Z x
= (fX (x′ )fY (y ′ )) dx′ dy ′
−∞ −∞
The joint mass function is a product of marginals.
◮ Suppose fXY (x, y) = fX (x)fY (y). Then ◮ This implies joint density is product of marginals.
X X
◮ Now, suppose fXY (x, y) = fX (x)fY (y)
FXY (x, y) = fXY (xi , yj ) = fX (xi )fY (yj ) Z y Z x
xi ≤x,yj ≤y xi ≤x,yj ≤y FXY (x, y) = fXY (x′ , y ′ ) dx′ dy ′
X X Z−∞ −∞
y Z x
= fX (xi ) fY (yj ) = FX (x)FY (y)
xi ≤x yj ≤y
= fX (x′ )fY (y ′ ) dx′ dy ′
Z−∞
x
−∞
Z y
◮ So, X, Y are independent if and only if = ′
fX (x ) dx ′
fY (y ′ ) dy ′ = FX (x)FY (y)
fXY (x, y) = fX (x)fY (y) −∞ −∞
PS Sastry, IISc, Bangalore, 2020 4/41 PS Sastry, IISc, Bangalore, 2020 5/41
Recap Contional density (or mass) fn Recap
◮ Let X be a discrete random variable. Then
◮ When X, Y are both discrete or they have a joint density
fX|Y (x|y) = lim P [X = x|Y ∈ [y, y + δ] ]
δ→0
fXY (x, y) = fX|Y (x|y)fY (y) = fY |X (y|x)fX (x)
(= P [X = x|Y = y] if Y is discrete)
◮ When X, Y are discrete or continuous (all four
◮ This will be the mass function corresponding to the df possibilities)
FX|Y .
◮ Let X be a continuous rv. Then we define conditional fX|Y (x|y)fY (y) = fY |X (y|x)fX (x)
density fX|Y by
Z x Here fX|Y , fX are densities when X is continuous and
mass functions when X is discrete. Similarly for fY |X , fY
FX|Y (x|y) = fX|Y (x′ |y) dx′
−∞ ◮ The above relation gives rise to the total probability rules
and Bayes rule for rv’s
This exists if X, Y have a joint density or when Y is
discrete.
PS Sastry, IISc, Bangalore, 2020 6/41 PS Sastry, IISc, Bangalore, 2020 7/41
PS Sastry, IISc, Bangalore, 2020 8/41 PS Sastry, IISc, Bangalore, 2020 9/41
Recap Independent Random variables More than two rv
◮ Everything we have done so far is easily extended to
multiple random variables.
◮ Let X, Y, Z be rv on the same probability space.
◮ X and Y are said to be independent if events [X ∈ B1 ], ◮ We define joint distribution function by
[Y ∈ B2 ] are independent for all B1 , B2 ∈ B.
◮ X and Y are independent if and only if FXY Z (x, y, z) = P [X ≤ x, Y ≤ y, Z ≤ z]
1. FXY (x, y) = FX (x) FY (y)
2. fXY (x, y) = fX (x) fY (y)
◮ If all three are discrete then the joint mass function is
◮ This also implies FX|Y (x|y) = FX (x) and fXY Z (x, y, z) = P [X = x, Y = y, Z = z]
fX|Y (x|y) = fX (x)
◮ If they are continuous , they have a joint density if
Z z Z y Z x
FXY Z (x, y, z) = fXY Z (x′ , y ′ , z ′ ) dx′ dy ′ dz ′
−∞ −∞ −∞
PS Sastry, IISc, Bangalore, 2020 10/41 PS Sastry, IISc, Bangalore, 2020 11/41
PS Sastry, IISc, Bangalore, 2020 14/41 PS Sastry, IISc, Bangalore, 2020 15/41
PS Sastry, IISc, Bangalore, 2020 16/41 PS Sastry, IISc, Bangalore, 2020 17/41
Independence of multiple random variables
◮ We can similarly talk about the joint distribution of any
finite number of rv’s
◮ Let X1 , X2 , · · · , Xn be rv’s on the same probability space.
◮ We denote it as a vector X or X. We can think of it as a
◮ Random variables X1 , X2 , · · · , Xn are said to be
mapping, X : Ω → ℜn .
independent if the the events [Xi ∈ Bi ], i = 1, · · · , n are
◮ We can write the joint distribution as independent.
(Recall definition of independence of a set of events)
FX (x) = P [X ≤ x] = P [Xi ≤ xi , i = 1, · · · , n]
◮ Independence implies that the marginals would determine
◮ We represent by fX (x) the joint density or mass function. the joint distribution.
Sometimes we also write it as fX1 ···Xn (x1 , · · · , xn )
◮ We use similar notation for marginal and conditional
distributions
PS Sastry, IISc, Bangalore, 2020 18/41 PS Sastry, IISc, Bangalore, 2020 19/41
Example Example
◮ Let a joint density be given by ◮ Let a joint density be given by
fXY Z (x, y, z) = K, 0<z<y<x<1 fXY Z (x, y, z) = K, 0<z<y<x<1
First let us determine K. First let us determine K.
Z ∞ Z ∞ Z ∞ Z 1 Z x Z y Z ∞ Z ∞ Z ∞ Z 1 Z x Z y
fXY Z (x, y, z) dz dy dx = K dz dy dx fXY Z (x, y, z) dz dy dx = K dz dy dx
−∞ −∞ −∞ 0 0 0 −∞ −∞ −∞ x=0 y=0 z=0
Z 1 Z x Z 1 Z x
= K y dy dx = K y dy dx
0 0
x=0 y=0
Z 1 2
x Z 1 2
= K dx x
0 2 = K dx
0 2
1
= K ⇒K=6 1
6 = K ⇒K=6
6
PS Sastry, IISc, Bangalore, 2020 20/41 PS Sastry, IISc, Bangalore, 2020 21/41
◮ We got the joint density as
PS Sastry, IISc, Bangalore, 2020 22/41 PS Sastry, IISc, Bangalore, 2020 23/41
◮ Hence,
g
fXY Z (x, y, z) 1 R2
fY |XZ (y|x, z) = = , 0<z<y<x<1 Sample Space
[ XY]
fXZ (x, z) x−z R
B’ B
PS Sastry, IISc, Bangalore, 2020 24/41 PS Sastry, IISc, Bangalore, 2020 25/41
◮ Let X, Y be discrete rv’s. Let Z = min(X, Y ).
◮ let Z = g(X, Y )
fZ (z) = P [min(X, Y ) = z]
◮ We can determine distribution of Z from the joint
= P [X = z, Y > z] + P [Y = z, X > z] + P [X = Y = z]
distribution of X, Y X X
= P [X = z, Y = y] + P [X = x, Y = z]
FZ (z) = P [Z ≤ z] = P [g(X, Y ) ≤ z] y>z x>z
+P [X = z, Y = z]
◮ For example, if X, Y are discrete, then X X
= fXY (z, y) + fXY (x, z) + fXY (z, z)
X y>z x>z
fZ (z) = P [Z = z] = P [g(X, Y ) = z] = fXY (xi , yj )
xi ,yj :
g(xi ,yj )=z
◮ Now suppose X, Y are independent and both of them
have geometric distribution with the same parameter, p.
◮ Such random variables are called independent and
identically distributed or iid random variables.
PS Sastry, IISc, Bangalore, 2020 26/41 PS Sastry, IISc, Bangalore, 2020 27/41
◮ Now we can get pmf of Z as (note Z ∈ {1, 2, · · · }) ◮ We can show this is a pmf
∞ ∞
fZ (z) = P [X = z, Y > z] + P [Y = z, X > z] + P [X = Y = z] X X
fZ (z) = (2 − p)p(1 − p)2z−2
= P [X = z]P [Y > z] + P [Y = z]P [X > z] + P [X = z]P [Y = z] z=1 z=1
2 ∞
= p(1 − p)z−1 (1 − p)z ∗ 2 + p(1 − p)z−1 X
2 = (2 − p)p (1 − p)2z−2
= 2p(1 − p)z−1 (1 − p)z + p(1 − p)z−1 z=1
= 2p(1 − p)2z−1 + p2 (1 − p)2z−2 1
= (2 − p)p
= p(1 − p)2z−2 (2(1 − p) + p) 1 − (1 − p)2
1
= (2 − p)p(1 − p)2z−2 = (2 − p)p =1
2p − p2
PS Sastry, IISc, Bangalore, 2020 28/41 PS Sastry, IISc, Bangalore, 2020 29/41
◮ Let us consider the max and min functions, in general.
◮ Let Z = max(X, Y ). Then we have
PS Sastry, IISc, Bangalore, 2020 30/41 PS Sastry, IISc, Bangalore, 2020 31/41
FZ (z) = 1 − (1 − FX (z))n
PS Sastry, IISc, Bangalore, 2020 34/41 PS Sastry, IISc, Bangalore, 2020 35/41
◮ Let X, Y ∈ {0, 1, · · · }
◮ Let Z = X + Y . Then we have
◮ Let X, Y be independent X
fZ (z) = P [X + Y = z] = P [X = x, Y = y]
◮ Let Z = max(X, Y ) and W = min(X, Y ). x,y:
x+y=z
◮ We want joint distribution function of Z and W . z
X
= P [X = k, Y = z − k]
FZW (z, w) = P [Z ≤ z, W ≤ w]
k=0
z
◮ This is difficult to find. But we can easily find X
= fXY (k, z − k)
k=0
P [max(X, Y ) ≤ z, min(X, Y ) > w]
◮ Now suppose X, Y are independent. Then
◮ Remaining details are left as an exercise for you!!
z
X
fZ (z) = fX (k)fY (z − k)
k=0
PS Sastry, IISc, Bangalore, 2020 36/41 PS Sastry, IISc, Bangalore, 2020 37/41
◮ Let X, Y have a joint density fXY . Let Z = X + Y
◮ Now suppose X, Y are independent Poisson with FZ (z) = P [Z ≤ z] = P [X + Y ≤ z]
Z Z
parameters λ1 , λ2 . And, Z = X + Y .
= fXY (x, y) dy dx
z {(x,y):x+y≤z}
∞ Z z−x
X Z
fZ (z) = fX (k)fY (z − k)
= fXY (x, y) dy dx
k=0 x=−∞ y=−∞
z
X λk1 −λ1 λz−k
2 change of variable: t = x + y
= e e−λ2
k=0
k! (z − k)! dt = dy; when (y = z − x), t = z
Z ∞ Z z
z
−(λ1 +λ2 ) 1 X z! = fXY (x, t − x) dt dx
= e λk1 λz−k
2
z! k=0 k!(z − k)! x=−∞ t=−∞
Z z Z ∞
1 = fXY (x, t − x) dx dt
= e−(λ1 +λ2 ) (λ1 + λ2 )z −∞ −∞
z!
◮ This gives us
◮ Z is Poisson with parameter λ1 + λ2 Z ∞
fZ (z) = fXY (x, z − x) dx
−∞
PS Sastry, IISc, Bangalore, 2020 38/41 PS Sastry, IISc, Bangalore, 2020 39/41
PS Sastry, IISc, Bangalore, 2020 40/41 PS Sastry, IISc, Bangalore, 2020 41/41
Recap Recap
◮ Given X1 , · · · , Xn , random variables on the same
probability space, Z = g(X1 , · · · , Xn ) is a rv
(if g : ℜn → ℜ is borel measurable).
Z = g(X)
n
g ◮ X1 , · · · , Xn are said to be independent if events
R
Sample Space
X
R
[X1 ∈ B1 ], · · · , [Xn ∈ Bn ] are independent.
B’ B ◮ If X1 , · · · , Xn are indepedent and all of them have the
same distribution function then they are said to be iid –
independent and identically distributed
◮ We can determine distribution of Z from the joint
distribution of all Xi
FZ (z) = P [Z ≤ z] = P [g(X1 , · · · , Xn ) ≤ z]
PS Sastry, IISc, Bangalore, 2020 1/43 PS Sastry, IISc, Bangalore, 2020 2/43
Recap Recap
PS Sastry, IISc, Bangalore, 2020 3/43 PS Sastry, IISc, Bangalore, 2020 4/43
Recap Recall problem from last class
◮ Let X, Y be random variables with joint density fXY
◮ Z =X +Y ◮ Let X, Y be independent
Let Z = max(X, Y ) and W = min(X, Y ).
Z ∞ ◮
fZ (z) = fXY (t, z − t) dt
−∞
◮ We want joint distribution function of Z and W .
PS Sastry, IISc, Bangalore, 2020 5/43 PS Sastry, IISc, Bangalore, 2020 6/43
PS Sastry, IISc, Bangalore, 2020 11/43 PS Sastry, IISc, Bangalore, 2020 12/43
◮ Thus, the density of sum of two ind rv’s that are uniform
Independence of functions of random variable
over (−1, 1) is
z+2
4
if − 2 < z < 0
fZ (z) = 2−z
4
if 0 < z < 2
◮ Suppose X and Y are independent.
◮ This is a triangle with vertices (−2, 0), (0, 0.5), (2, 0) ◮ Then g(X) and h(Y ) are independent
◮ This is because [g(X) ∈ B1 ] = [X ∈ B̃1 ] for some Borel
set, B̃1 and similarly [h(Y ) ∈ B2 ] = [Y ∈ B̃2 ]
◮ Hence, [g(X) ∈ B1 ] and [h(Y ) ∈ B2 ] are independent.
PS Sastry, IISc, Bangalore, 2020 13/43 PS Sastry, IISc, Bangalore, 2020 14/43
PS Sastry, IISc, Bangalore, 2020 15/43 PS Sastry, IISc, Bangalore, 2020 16/43
Sum of independent gamma rv Z ∞
fZ (z) = fX (x) fY (z − x) dx
−∞
Z z
Gamma density with parameters α > 0 and λ > 0 is given 1 1
◮
= λα1 xα1 −1 e−λx λα2 (z − x)α2 −1 e−λ(z−x) dx
by 0 Γ(α1 ) Γ(α2 )
1 λ α1 +α2 −λz Z z
e x α1 −1 α2 −1 x α2 −1
λα xα−1 e−λx , x > 0
f (x) = = z α1 −1 z 1− dx
Γ(α) Γ(α1 )Γ(α2 ) 0 z z
We will call this Gamma(α, λ). x
change the variable: t = (⇒ z −1 dx = dt)
◮ The α is called the shape parameter and λ is called the z
λα1 +α2 e−λz α+ α2 −1 1 α1 −1
Z
rate parameter. = z t (1 − t)α2 −1 dt
Γ(α1 )Γ(α2 ) 0
◮ For α = 1 this is the exponential density.
1
◮ Let X ∼ Gamma(α1 , λ), Y ∼ Gamma(α2 , λ). = λα1 +α2 z α1 +α2 −1 e−λz
Γ(α1 + α2 )
Suppose X, Y are independent.
Because
◮ Let Z = X + Y . Then Z ∼ Gamma(α1 + α2 , λ). Z 1
Γ(α1 )Γ(α2 )
tα1 −1 (1 − t)α2 −1 dt =
0 Γ(α1 + α2 )
PS Sastry, IISc, Bangalore, 2020 17/43 PS Sastry, IISc, Bangalore, 2020 18/43
A Calculation Trick
◮ If X, Y are independent gamma random variables then ∞
1 2
Z
X + Y also has gamma distribution.
I = exp − x − 2bx + c dx
−∞ 2K
◮ If X ∼ Gamma(α1 , λ), and Y ∼ Gamma(α2 , λ), then Z ∞
X + Y ∼ Gamma(α1 + α2 , λ). 1 2 2
= exp − (x − b) + c − b dx
−∞ 2K
◮ Exercise for you: Show that sum of independent Gaussian Z ∞
(x − b)2 (c − b2 )
random variables has gaussian density. = exp − exp − dx
◮ The algebra is a little involved. −∞ 2K 2K
(c − b2 ) √
◮ First take the two gaussians to be zero-mean. = exp − 2πK
2K
◮ There is a calculation trick that is often useful with
Gaussian density because
∞
(x − b)2
1
Z
√ exp − dx = 1
2πK −∞ 2K
PS Sastry, IISc, Bangalore, 2020 19/43 PS Sastry, IISc, Bangalore, 2020 20/43
◮ Let X1 , · · · , Xn be continuous random variables with
joint density fX1 ···Xn . We define Y1 , · · · Yn by
Y1 = g1 (X1 , · · · , Xn ) ··· Yn = gn (X1 , · · · , Xn )
We think of gi as components of g : ℜn → ℜn .
◮ We next look at a general theorem that is quite useful in ◮ We assume g is continuous with continuous first partials
dealing with functions of multiple random variables. and is invertible.
◮ This result is only for continuous random variables.
◮ Let h be the inverse of g. That is
X1 = h1 (Y1 , · · · , Yn ) ··· Xn = hn (Y1 , · · · , Yn )
◮ Each of gi , hi are ℜn → ℜ functions and we can write
them as
yi = gi (x1 , · · · , xn ); ··· xi = hi (y1 , · · · , yn )
We denote the partial derivatives of these functions by
∂xi
∂yj
etc.
PS Sastry, IISc, Bangalore, 2020 21/43 PS Sastry, IISc, Bangalore, 2020 22/43
PS Sastry, IISc, Bangalore, 2020 23/43 PS Sastry, IISc, Bangalore, 2020 24/43
Proof of Theorem ◮ X1 , · · · Xn are continuous rv with joint density
◮ B = (−∞, y1 ] × · · · × (−∞, yn ].
Y1 = g1 (X1 , · · · , Xn ) ··· Yn = gn (X1 , · · · , Xn )
◮ g −1 (B) = {(x1 , · · · , xn ) ∈ ℜn : g(x1 , · · · , xn ) ∈ B}
◮ The transformation is continuous with continuous first
FY (y1 , · · · , yn ) = P [gi (X1 , · · · , Xn ) ≤ yi , i = 1, · · · , n] partials and is invertible and
Z
= fX1 ···Xn (x′1 , · · · , x′n ) dx′1 · · · dx′n
g −1 (B) X1 = h1 (Y1 , · · · , Yn ) ··· Xn = hn (Y1 , · · · , Yn )
change variables: yi′ = gi (x′1 , · · ·
, x′n ), i = 1, · · · n
◮ We assume the Jacobian of the inverse transform, J, is
(x′1 , · · · x′n ) ∈ g (B) ⇒ (y1′ , · · · , yn′ ) ∈ B
−1
non-zero
x′i = hi (y1′ , · · · , yn′ ), dx′1 · · · dx′n = |J|dy1′ · · · dyn′ ◮ Then the density of Y is
Z
FY (y1 , · · · , yn ) = fX1 ···Xn (h1 (y′ ), · · · , hn (y′ )) |J|dy1′ · · · dyn′
B fY1 ···Yn (y1 , · · · , yn ) = |J|fX1 ···Xn (h1 (y1 , · · · , yn ), · · · , hn (y1 , · · · , yn ))
⇒ fY1 ···Yn (y1 , · · · , yn ) = fX1 ···Xn (h1 (y), · · · , hn (y)) |J|
◮ Called multidimensional change of variable formula
PS Sastry, IISc, Bangalore, 2020 25/43 PS Sastry, IISc, Bangalore, 2020 26/43
PS Sastry, IISc, Bangalore, 2020 27/43 PS Sastry, IISc, Bangalore, 2020 28/43
◮ let Z = X + Y and W = X − Y . We got Example
1
z+w z−w
◮ Let X, Y be iid U (0, 1). Let Z = X − Y .
fZW (z, w) = fXY , Z ∞
2 2 2 fZ (z) = fX (t) fY (t − z) dt
−∞
◮ Now we can calculate fW also.
◮ For the integrand to be non-zero (note Z ∈ (−1, 1))
Z ∞ 0 < t < 1 ⇒ t > 0, t < 1
1 z+w z−w ◮
PS Sastry, IISc, Bangalore, 2020 29/43 PS Sastry, IISc, Bangalore, 2020 30/43
Z ∞
1 z
Suppose X, Y are discrete and Z = XY
fZ (z) = fX fY (w) dw
−∞ w w X X
fZ (0) = P [X = 0 or Y = 0] = fXY (x, 0) + fXY (0, y
z
◮ We need: 0 < w < 1 and 0 < w
< 1. Hence x y
1 1
X k X k
1 1 fZ (k) = P X = ,Y = y = fXY , y , k 6= 0
Z Z
fZ (z) = dw = dw = − ln(z), 0 < z < 1 y y
z w z w y6=0 y6=0
PS Sastry, IISc, Bangalore, 2020 33/43 PS Sastry, IISc, Bangalore, 2020 34/43
PS Sastry, IISc, Bangalore, 2020 37/43 PS Sastry, IISc, Bangalore, 2020 38/43
PS Sastry, IISc, Bangalore, 2020 41/43 PS Sastry, IISc, Bangalore, 2020 42/43
Recap
◮ Let Z = X + Y . Let X, Y have joint density fXY
Z ∞Z ∞ ◮ X1 , · · · Xn are continuous rv with joint density
E[X + Y ] = (x + y) fXY (x, y) dx dy
−∞ −∞
Y1 = g1 (X1 , · · · , Xn ) ··· Yn = gn (X1 , · · · , Xn )
Z ∞ Z ∞
= x fXY (x, y) dy dx ◮ The transformation is continuous with continuous first
−∞ −∞
Z ∞ Z ∞ partials and is invertible and
+ y fXY (x, y) dx dy
−∞ −∞ X1 = h1 (Y1 , · · · , Yn ) ··· Xn = hn (Y1 , · · · , Yn )
Z ∞ Z ∞
= x fX (x) dx + y fY (y) dy ◮ We assume the Jacobian of the inverse transform, J, is
−∞ −∞
non-zero
= E[X] + E[Y ]
◮ Then the density of Y is
◮ Expectation is a linear operator.
fY1 ···Yn (y1 , · · · , yn ) = |J|fX1 ···Xn (h1 (y1 , · · · , yn ), · · · , hn (y1 , · · · , yn ))
◮ This is true for all random variables.
◮ Called multidimensional change of variable formula
PS Sastry, IISc, Bangalore, 2020 43/43 PS Sastry, IISc, Bangalore, 2020 1/37
Recap Recap
PS Sastry, IISc, Bangalore, 2020 2/37 PS Sastry, IISc, Bangalore, 2020 3/37
Recap
◮ We saw E[X + Y ] = E[X] + E[Y ].
◮ Let us calculate Var(X + Y ).
◮ Let Z = g(X1 , · · · Xn ) = g(X). Then
Var(X + Y ) = E ((X + Y ) − E[X + Y ])2
Z
= E ((X − EX) + (Y − EY ))2
E[Z] = g(x) dFX (x)
= E (X − EX)2 + E (Y − EY )2
ℜn
◮ For example, if they have a joint density, then +2E [(X − EX)(Y − EY )]
Z = Var(X) + Var(Y ) + 2Cov(X, Y )
E[Z] = g(x) fX (x) dx
ℜn where we define covariance between X, Y as
◮ This gives us: E[X + Y ] = E[X] + E[Y ] Cov(X, Y ) = E [(X − EX)(Y − EY )]
◮ In general, E [g1 (X) + g2 (X)] = E [g1 (X] + E [g2 (X)]
PS Sastry, IISc, Bangalore, 2020 4/37 PS Sastry, IISc, Bangalore, 2020 5/37
Example
◮ We define covariance between X and Y by ◮ Consider the joint density
Cov(X, Y ) = E [(X − EX)(Y − EY )]
fXY (x, y) = 2, 0 < x < y < 1
= E [XY − X(EY ) − Y (EX) + EX EY ]
= E[XY ] − EX EY ◮ We want to calculate Cov(X, Y )
Z 1Z 1 1
1
Z
◮ Note that Cov(X, Y ) can be positive or negative EX = x 2 dy dx = 2 x (1 − x) dx =
◮ X and Y are said to be uncorrelated if Cov(X, Y ) = 0 0 x 0 3
◮ If X and Y are uncorrelated then Z 1 Z y Z 1
2
EY = y 2 dx dy = 2 y 2 dy =
Var(X + Y ) = Var(X) + Var(Y ) 0 0 0 3
1 y 1
y2 1
Z Z Z
◮ Note that E[X + Y ] = E[X] + E[Y ] for all random E[XY ] = xy 2 dx dy = 2 y dy =
variables. 0 0 0 2 4
1 2 1
◮ Hence, Cov(X, Y ) = E[XY ] − EX EY = 4
− 9
= 36
PS Sastry, IISc, Bangalore, 2020 6/37 PS Sastry, IISc, Bangalore, 2020 7/37
Independent random variables are uncorrelated Uncorrelated random variables may not be
independent
◮ Suppose X, Y are independent. Then ◮ Suppose X ∼ N (0, 1) Then, EX = EX 3 = 0
Z Z ◮ Let Y = X 2 Then,
E[XY ] = x y fXY (x, y) dx dy
E[XY ] = EX 3 = 0 = EX EY
Z Z
= x y fX (x) fY (y) dx dy
Thus X, Y are uncorrelated.
Z Z ◮
= xfX (x) dx yfY (y) dy = EX EY
◮ Are they independent? No
e.g.,
◮ Then, Cov(X, Y ) = E[XY ] − EX EY = 0. P [X > 2 |Y < 1] = 0 6= P [X > 2]
◮ X, Y independent ⇒ X, Y uncorrelated
◮ X, Y are uncorrealted does not imply they are
independent.
PS Sastry, IISc, Bangalore, 2020 8/37 PS Sastry, IISc, Bangalore, 2020 9/37
◮ We have E [(αX + βY )2 ] ≥ 0, ∀α, β ∈ ℜ
PS Sastry, IISc, Bangalore, 2020 10/37 PS Sastry, IISc, Bangalore, 2020 11/37
◮ We showed that
Linear Least Squares Estimation
(E[XY ])2 ≤ E[X 2 ]E[Y 2 ]
◮ Take X − EX in place of X and Y − EY in place of Y
in the above algebra.
◮ This gives us
◮ Suppose we want to approximate Y as an affine function
of X.
(E[(X − EX)(Y − EY )])2 ≤ E[(X−EX)2 ]E[(Y −EY )2 ] ◮ We want a, b to minimize E [(Y − (aX + b))2 ]
◮ For a fixed a, what is the b that minimizes
⇒ (Cov(X, Y ))2 ≤ Var(X)Var(Y ) E [((Y − aX) − b)2 ] ?
◮ Hence we get ◮ We know the best b here is:
!2
Cov(X, Y ) b = E[Y − aX] = EY − aEX.
ρ2XY = p ≤1 ◮ So, we want to find the best a to minimize
Var(X)Var(Y )
J(a) = E [(Y − aX − (EY − aEX))2 ]
◮ The equality holds here only if E [(αX + βY )2 ] = 0
Thus, |ρXY | = 1 only if αX + βY = 0
◮ Correlation coefficient of X, Y is ±1 only when Y is a
linear function of X PS Sastry, IISc, Bangalore, 2020 12/37 PS Sastry, IISc, Bangalore, 2020 13/37
◮ We want to find a to minimize ◮ The final mean square error, say, J ∗ is
PS Sastry, IISc, Bangalore, 2020 14/37 PS Sastry, IISc, Bangalore, 2020 15/37
PS Sastry, IISc, Bangalore, 2020 16/37 PS Sastry, IISc, Bangalore, 2020 17/37
Covariance Matrix Covariance matrix
◮ Let X1 , · · · , Xn be random variables (on the same
probability space)
◮ We represent them as a vector X. ◮ If a = (a1 , · · · , an )T then
◮ As a notation, all vectors are column vectors: a aT is a n × n matrix whose (i, j)th element is ai aj .
X = (X1 , · · · , Xn )T
◮ Hence we get
◮ We denote E[X] = (EX1 , · · · , EXn )T
The n × n matrix whose (i, j)th element is Cov(Xi , Xj ) is ΣX = E (X − EX) (X − EX)T
◮
called the covariance matrix (or variance-covariance
matrix) of X. Denoted as ΣX or ΣX ◮ This is because
(X − EX) (X − EX)T ij = (Xi − EXi )(Xj − EXj )
Cov(X1 , X1 ) Cov(X1 , X2 ) · · · Cov(X1 , Xn ) and (ΣX )ij = E[(Xi − EXi )(Xj − EXj )]
Cov(X2 , X1 ) Cov(X2 , X2 ) · · · Cov(X2 , Xn )
ΣX = .. .. .. ..
. . . .
Cov(Xn , X1 ) Cov(Xn , X2 ) · · · Cov(Xn , Xn )
PS Sastry, IISc, Bangalore, 2020 18/37 PS Sastry, IISc, Bangalore, 2020 19/37
T
= aT ΣX a
where b = (b1 , · · · , bn )
◮ A is said to be positive semidefinite if bT Ab ≥ 0, ∀b ◮ This gives aT ΣX a ≥ 0, ∀a
◮ This shows ΣX is positive semidefinite
PS Sastry, IISc, Bangalore, 2020 20/37 PS Sastry, IISc, Bangalore, 2020 21/37
Y = aT X = i ai Xi – linear combination of Xi ’s.
P
◮
◮ We know how to find its mean and variance ◮ Covariance matrix ΣX positive semidefinite because
X
EY = aT EX = ai EXi ;
aT ΣX a = Var(aT X) ≥ 0
i
X
Var(Y ) = aT ΣX a = ai aj Cov(Xi , Xj ) ◮ ΣX would be positive definite if aT ΣX a > 0, ∀a 6= 0
i,j
◮ It would fail to be positive definite if Var(aT X) = 0 for
Specifically, by taking all components of a to be 1, we get
◮ some nonzero a.
◮ Var(Z) = E[(Z − EZ)2 ] = 0 implies Z = EZ, a
n
! n n n X
X X X X constant.
Var Xi = Cov(Xi , Xj ) = Var(Xi )+ Cov(Xi , Xj )
i=1 i,j=1 i=1 i=1 j6=i
◮ Hence, ΣX fails to be positive definite only if there is a
non-zero linear combination of Xi ’s that is a constant.
◮ If Xi are independent, variance of sum is sum of
variances.
PS Sastry, IISc, Bangalore, 2020 22/37 PS Sastry, IISc, Bangalore, 2020 23/37
PS Sastry, IISc, Bangalore, 2020 24/37 PS Sastry, IISc, Bangalore, 2020 25/37
Joint moments ◮ We can define moment generating function of X, Y by
PS Sastry, IISc, Bangalore, 2020 26/37 PS Sastry, IISc, Bangalore, 2020 27/37
◮ However, its value would be a function of y. ◮ What this means is that we define E[h(X)|Y ] = g(Y )
◮ That is, this is a kind of expectation that is a function of where X
Y (and hence is a random variable) g(y) = h(x) fX|Y (x|y)
◮ It is called conditional expectation. x
PS Sastry, IISc, Bangalore, 2020 28/37 PS Sastry, IISc, Bangalore, 2020 29/37
A simple example
◮ Consider the joint density
◮ Let X, Y have joint density fXY .
◮ The conditinal expectation of h(X) conditioned on Y is a fXY (x, y) = 2, 0 < x < y < 1
function of Y , and its value for any y is defined by ◮ We calculated the conditional densities earlier
Z ∞ 1 1
E[h(X)|Y = y] = h(x) fX|Y (x|y) dx fX|Y (x|y) = , fY |X (y|x) = , 0<x<y<1
−∞
y 1−x
◮ Now we can calculate the conditional expectation
◮ Once again, what this means is that E[h(X)|Y ] = g(Y ) Z ∞
where Z ∞ E[X|Y = y] = x fX|Y (x|y) dx
g(y) = h(x) fX|Y (x|y) dx −∞
Z y y
−∞ 1 1 x2 y
= x dx = =
0 y y 2 0 2
◮ This gives: E[X|Y ] = Y2
1+X
◮ We can show E[Y |X] = 2
PS Sastry, IISc, Bangalore, 2020 30/37 PS Sastry, IISc, Bangalore, 2020 31/37
PS Sastry, IISc, Bangalore, 2020 32/37 PS Sastry, IISc, Bangalore, 2020 33/37
◮ Expectation of a conditional expectation is the
unconditional expectation ◮ Any factor that depends only on the conditioning variable
E [ E[h(X)|Y ] ] = E[h(X)] behaves like a constant inside a conditional expectation
In the above, LHS is expectation of a function of Y .
E[h1 (X) h2 (Y )|Y ] = h2 (Y )E[h1 (X)|Y ]
◮ Let us denote g(Y ) = E[h(X)|Y ]. Then
E [ E[h(X)|Y ] ] = E[g(Y )] ◮ Let us denote g(Y ) = E[h1 (X) h2 (Y )|Y ]
Z ∞
= g(y) fY (y) dy g(y) = E[h1 (X) h2 (Y )|Y = y]
−∞ Z ∞
Z ∞ Z ∞
= h1 (x)h2 (y) fX|Y (x|y) dx
= h(x) fX|Y (x|y) dx fY (y) dy −∞
−∞ −∞ Z ∞
Z ∞Z ∞
= h2 (y) h1 (x) fX|Y (x|y) dx
= h(x) fXY (x, y) dy dx −∞
−∞ −∞
Z ∞ = h2 (y) E[h1 (X)|Y = y]
= h(x) fX (x) dx
−∞
= E[h(X)]
PS Sastry, IISc, Bangalore, 2020 34/37 PS Sastry, IISc, Bangalore, 2020 35/37
PS Sastry, IISc, Bangalore, 2020 36/37 PS Sastry, IISc, Bangalore, 2020 37/37
Density of XY Z 0 Z ∞ Z ∞ Z z/x
◮ Let X, Y have joint density fXY . FZ (z) = fXY (x, y) dy dx + fXY (x, y) dy dx
−∞ z/x 0 −∞
◮ Let Z = XY . We want to find density of XY directly
◮ Let Az = {(x, y) ∈ ℜ2 : xy ≤ z} ⊂ ℜ2 . ◮ Change variable from y to t using t = xy
y = t/x; dy = x1 dt; y = z/x ⇒ t = z
FZ (z) = P [XY ≤ z] = P [(X, Y ) ∈ Az ] Z 0 Z −∞ Z ∞Z z
Z Z 1 t 1 t
= fXY (x, y) dy dx FZ (z) = fXY (x, ) dt dx + fXY (x, ) dt d
−∞ z x x 0 −∞ x x
Az Z 0 Z z Z ∞Z z
1 t 1 t
◮ We need to find limits for integrating over Az = fXY (x, ) dt dx + fXY (x, ) d
−∞ x x −∞ x x
◮ If x > 0, then xy ≤ z ⇒ y ≤ z/x Z−∞
∞ Z z 0
1 t
If x < 0, then xy ≤ z ⇒ y ≥ z/x = fXY x, dt dx
−∞ −∞ x x
Z z Z ∞
Z 0 Z ∞ Z ∞ Z z/x 1 t
FZ (z) = fXY (x, y) dy dx+ fXY (x, y) dy dx = fXY x, dx dt
−∞ −∞ x x
−∞ z/x 0 −∞
R∞
This shows: fZ (z) = −∞ x1 fXY x, xz dx
PS Sastry, IISc, Bangalore, 2020 1/32 PS Sastry, IISc, Bangalore, 2020 2/32
◮ The covariance of X, Y is
Cov(X, Y )
Note that Cov(X, X) = Var(X) ρXY = p
Var(X) Var(Y )
◮ Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y )
◮ X, Y are called uncorrelated if Cov(X, Y ) = 0. ◮ If X, Y are uncorrelated then ρXY = 0.
◮ If X, Y are uncorrelated, Var(X + Y ) = Var(X) + Var(Y ) ◮ −1 ≤ ρXY ≤ 1, ∀X, Y
◮ X, Y independent ⇒ X, Y uncorrelated. ◮ |ρXY | = 1 iff X = aY
◮ Uncorrelated random variables need not necessarily be
independent
PS Sastry, IISc, Bangalore, 2020 3/32 PS Sastry, IISc, Bangalore, 2020 4/32
Recap: mean square estimation Recap: Covariance matrix
◮ The best mean-square approximation of Y as a ‘linear’
function of X is
For a random vector, X = (X1 , · · · , Xn )T , the covariance
Cov(X, Y ) Cov(X, Y ) ◮
Y = X + EY − EX
Var(X) Var(X) matrix is
ΣX = E (X − EX) (X − EX)T
◮ Called the line of regression of Y on X.
◮ If cov(X, Y ) = 0 then this reduces to approximating Y by
a constant, EY . (ΣX )ij = E[(Xi − EXi )(Xj − EXj )]
◮ The final mean square error is ◮ Var(aT X) = aT ΣX a
◮ ΣX is a real symmetric and positive semidefinite matrix.
Var(Y ) 1 − ρ2XY
mi1 i2 ···in = E X1i1 X2i2 · · · Xnin The conditional expectation of h(X) conditioned on Y is
◮
a function of Y : E[h(X)|Y ] = g(Y )
◮ The moment generating function of X is the above specify the value of g(y).
h T i ◮ We define E[h(X, Y )|Y ] also as above:
MX (s) = E es X , s ∈ ℜn Z ∞
E[h(X, Y )|Y = y] = h(x, y) fX|Y (x|y) dx
−∞
Example
◮ Any factor that depends only on the conditioning variable
behaves like a constant inside a conditional expectation ◮ Let X, Y be random variables with joint density given by
E[h1 (X) h2 (Y )|Y ] = h2 (Y )E[h1 (X)|Y ] fXY (x, y) = e−y , 0 < x < y < ∞
◮ The marginal densities are:
◮ Let us denote g(Y ) = E[h1 (X) h2 (Y )|Y ] Z ∞ Z ∞
g(y) = E[h1 (X) h2 (Y )|Y = y] fX (x) = fXY (x, y) dy = e−y dy = e−x , x > 0
Z ∞ −∞ x
EX = 1; Var(X) = 1; EY = 2; Var(Y ) = 2
PS Sastry, IISc, Bangalore, 2020 11/32 PS Sastry, IISc, Bangalore, 2020 12/32
◮ Recall the joint and marginal densities
fXY (x, y) = e−y , 0 < x < y < ∞
fXY (x, y) = e−y , 0 < x < y < ∞
◮ Let us calculate covariance of X and Y fX (x) = e−x , x > 0; fY (y) = ye−y , y > 0
Z ∞Z ∞
E[XY ] = xy fXY (x, y) dx dy
◮ The conditional densities will be
−∞ −∞
Z ∞Z y Z ∞ fXY (x, y) e−y 1
−y 1 3 −y fX|Y (x|y) = = −y = , 0 < x < y < ∞
= xye dx dy = y e dy = 3 fY (y) ye y
0 0 0 2
PS Sastry, IISc, Bangalore, 2020 13/32 PS Sastry, IISc, Bangalore, 2020 14/32
PS Sastry, IISc, Bangalore, 2020 17/32 PS Sastry, IISc, Bangalore, 2020 18/32
PS Sastry, IISc, Bangalore, 2020 19/32 PS Sastry, IISc, Bangalore, 2020 20/32
◮ Rn – number of rounds when you start with n people. ◮ What would be E[Xn ]?
◮ Xn – number of people who got their own hat in the first ◮ Let Yi ∈ {0, 1} denote whether or not ith person got his
round own hat.
E [Rn ] = E[ E [Rn |Xn ] ] ◮ We know
n
X (n − 1)! 1
= E [Rn |Xn = i] P [Xn = i] E[Yi ] = P [Yi = 1] = =
n! n
i=0
n
n n
X
= (1 + E [Rn−i ]) P [Xn = i] X X
Now, Xn = Yi and hence EXn = E[Yi ] = 1
i=0
n n i=1 i=1
X X
= P [Xn = i] + E [Rn−i ] P [Xn = i]
i=0 i=0
◮ Hence a good guess is E[Rn ] = n.
◮ We verify it using mathematical induction. We know
If we can guess value of E[Rn ] then we can prove it using E[R1 ] = 1
mathematical induction
PS Sastry, IISc, Bangalore, 2020 21/32 PS Sastry, IISc, Bangalore, 2020 22/32
◮ Assume: E [Rk ] = k, 1 ≤ k ≤ n − 1
Analysis of Quicksort
n
X n
X
E [Rn ] = P [Xn = i] + E [Rn−i ] P [Xn = i]
i=0 i=0
n
X ◮ Given n numbers we want to sort them. Many algorithms.
= 1 + E [Rn ] P [Xn = 0] + E [Rn−i ] P [Xn = i] ◮ Complexity – order of the number of comparisons needed
i=1
n
◮ Quicksort: Choose a pivot. Separte numbers into two
parts – less and greater than pivot, do recursively
X
= 1 + E [Rn ] P [Xn = 0] + (n − i) P [Xn = i]
i=1 ◮ Separating into two parts takes n − 1 comparisons.
n
◮ Suppose the two parts contain m and n − m − 1.
Separating both of them into two parts each takes
X
E [Rn ] (1 − P [Xn = 0]) = 1 + n(1 − P [Xn = 0]) − i P [Xn = i]
i=1 m + n − m − 1 comparisons
= 1 + n (1 − P [Xn = 0]) − E[Xn ] ◮ So, final number of comparisons depends on the ‘number
= 1 + n (1 − P [Xn = 0]) − 1 of rounds’
⇒ E [Rn ] = n
PS Sastry, IISc, Bangalore, 2020 23/32 PS Sastry, IISc, Bangalore, 2020 24/32
quicksort details Average case complexity of quicksort
◮ Given {x1 , · · · , xn }. ◮ Assume pivot is equally likely to be the smallest or second
◮ Choose first as pivot smallest or mth smallest.
◮ Mn – number of comparisons.
{xj1 , xj2 , · · · , xjm }x1 {xk1 , xk2 , · · · , xkn−1−m } ◮ Define: X = j if pivot is j th smallest
◮ Given X = j we know Mn = (n − 1) + Mj−1 + Mn−j .
◮ Suppose rn is the number of comparisons. If we get
n
(roughly) equal parts, then X
E[Mn ] = E[ E[Mn |X] ] = E[Mn |X = j] P [X = j]
rn ≈ n+2rn/2 = n+2(n/2+2rn/4 ) = n+n+4rn/4 = · · · = n log2 (n) j=1
n
If all the rest go into one part, then
X 1
◮ = E[(n − 1) + Mj−1 + Mn−j ]
j=1
n
n(n + 1)
rn = n + rn−1 = n + (n − 1) + rn−2 = · · · = n−1
2 2X
= (n − 1) + E[Mk ], (taking M0 = 0)
◮ If you are lucky, O(n log(n)) comparisons. n k=1
◮ If unlucky, in the worst case, O(n2 ) comparisons ◮ This is a recurrence relation. (A little complicated to
◮ Question: ‘on the average’ how many comparisons? solve)
PS Sastry, IISc, Bangalore, 2020 25/32 PS Sastry, IISc, Bangalore, 2020 26/32
PS Sastry, IISc, Bangalore, 2020 27/32 PS Sastry, IISc, Bangalore, 2020 28/32
◮ We earlier got
First consider the last term 2 2
(g(X) − Y )2 = g(X) − E[Y | X] + E[Y | X] − Y
E (g(X) − E[Y | X])(E[Y | X] − Y )
+ 2 g(X) − E[Y | X] E[Y | X] − Y
= E E (g(X) − E[Y | X])(E[Y | X] − Y ) | X
because E[Z] = E[ E[Z|X] ] ◮ Hence we get
= E (g(X) − E[Y | X]) E (E[Y | X] − Y ) | X E (g(X) − Y )2
= E (g(X) − E[Y | X])2
because E[h1 (X)h2 (Z)|X] = h1 (X) E[h2 (Z)|X] + E (E[Y | X] − Y )2
= E (g(X) − E[Y | X]) E (E[Y | X])|X − E{Y | X}) ≥ E (E[Y | X] − Y )2
= E (g(X) − E[Y | X]) (E[Y | X] − E[Y | X))
= 0 ◮ Since the above is true for all functions g, we get
g ∗ (X) = E [Y | X]
PS Sastry, IISc, Bangalore, 2020 29/32 PS Sastry, IISc, Bangalore, 2020 30/32
E[ E[X|Y ] ] = E[X]
◮ Let X1 , X2 , · · · be iid rv on the same probability space.
is very useful in calculating expectations Suppose EXi = µ, ∀i.
X Z ◮ Let N be a positive integer valued rv that is independent
EX = E[X|Y = y] fY (y) or E[X|Y = y] fY (y) dy of all Xi .
Let S = N
y
P
i=1 Xi .
◮
PS Sastry, IISc, Bangalore, 2020 3/36 PS Sastry, IISc, Bangalore, 2020 4/36
◮ We have Variance of random sum
" N
# PN
X ◮ S= Xi , Xi iid, ind of N . Want Var(S)
E[S|N = n] = E Xi | N = n i=1
" i=1
!2 !2
# N N
n X X
= E
X
Xi | N = n E[S 2 ] = E Xi = E E Xi | N
i=1 i=1
i=1
since E[h(X, Y )|Y = y] = E[h(X, y)|Y = y]
n
X Xn ◮ As earlier, we have
= E[Xi | N = n] = E[Xi ] = nµ
N
!2 n
!2
i=1 i=1 X X
E Xi | N = n = E Xi | N = n
◮ Hence we get i=1 i=1
!2
E[S|N ] = N µ ⇒ E[S] = E[N ]E[X1 ] n
X
= E Xi
i=1
◮ Actually, we did not use independence of Xi .
PS Sastry, IISc, Bangalore, 2020 5/36 PS Sastry, IISc, Bangalore, 2020 6/36
E[Y 2 ] = Var(Y ) + (EY )2 = n Var(X1 ) + (nEX1 )2 E[S 2 ] = E[ E[S 2 |N ] ] = EN Var(X1 ) + E[N 2 ](EX1 )2
PS Sastry, IISc, Bangalore, 2020 7/36 PS Sastry, IISc, Bangalore, 2020 8/36
Wald’s formula Another Example
PN
◮ Considered S = i=1 Xi with N independent of all Xi .
◮ With iid Xi , the formula ES = EN EX1 is valid even ◮ We toss a (biased) coin till we get k consecutive heads.
under some dependence between N and Xi . Let Nk denote the number of tosses needed.
◮ Here is one version of Wald’s formula. We assume ◮ N1 would be geometric.
i |] < ∞, ∀i and EN < ∞.
1. E[|X ◮ We want E[Nk ]. What rv should we condition on?
2. E Xn I[N ≥n] = E[Xn ]P [N ≥ n], ∀n
◮ Useful rv here is Nk−1
Let SN = N
P PN
i=1 Xi and let TN = i=1 E[Xi ].
◮
◮ Then, ESN = ETN . E[Nk | Nk−1 = n] = (n + 1)p + (1 − p)(n + 1 + E[Nk ])
If E[Xi ] is same for all i, ESN = EX1 EN .
◮ Assume Xi are iid. Suppose the event [N ≤ n − 1] ◮ Thus we get the recurrence relation
depends only on X1 , · · · , Xn−1 .
◮ Then the event [N ≤ n − 1] and hence its complement E[Nk ] = E[ E[Nk | Nk−1 ] ]
[N ≥ n] is independent of Xn and the assumption above = E [ (Nk−1 + 1)p + (1 − p)(Nk−1 + 1 + E[Nk ]) ]
is satisfied.
◮ Such an N is an example of what is called a stopping
time.
PS Sastry, IISc, Bangalore, 2020 9/36 PS Sastry, IISc, Bangalore, 2020 10/36
PS Sastry, IISc, Bangalore, 2020 13/36 PS Sastry, IISc, Bangalore, 2020 14/36
PS Sastry, IISc, Bangalore, 2020 15/36 PS Sastry, IISc, Bangalore, 2020 16/36
Gaussian or Normal distribution
◮ The Gaussian or normal density is given by
1 (x−µ)2
f (x) = √ e− 2σ2 , −∞ < x < ∞
σ 2π
◮ If X has this density, we denote it as X ∼ N (µ, σ 2 ).
We showed EX = µ and Var(X) = σ 2
◮ The density is a ‘bell-shaped’ curve
PS Sastry, IISc, Bangalore, 2020 17/36 PS Sastry, IISc, Bangalore, 2020 18/36
1 1 T Σ−1 (x− 1 T
e− 2 (x−µ) µ ) , x ∈ ℜn Let I = ℜn C e− 2 y M y dy
R
fX (x) = 1 n
◮
|Σ| (2π)
2 2
◮ Since M is real symmetric, there exits an orthogonal
◮ µ ∈ ℜn and Σ ∈ ℜn×n are parameters of the density and transform, L with L−1 = LT , |L| = 1 and LT M L is
Σ is symmetric and positive definite. diagonal
◮ If X1 , · · · , Xn have the above joint density, they are said ◮ Let LT M L = diag(m1 , · · · , mn ).
to be jointly Gaussian. ◮ Then for any z ∈ ℜn ,
◮ We denote this by X ∼ N (µ, Σ) X
zT LT M Lz = mi zi2
◮ We will now show that this is a joint density function. i
PS Sastry, IISc, Bangalore, 2020 21/36 PS Sastry, IISc, Bangalore, 2020 22/36
n r ℜn i=1
mi
Y 1
= C 2π
i=1
mi 1
Z
1 T My
⇒ n 1 e− 2 y dy = 1
(2π) |M −1 |
2 2 ℜn
PS Sastry, IISc, Bangalore, 2020 23/36 PS Sastry, IISc, Bangalore, 2020 24/36
◮ Consider Y with joint density
1 1 T Σ−1 y
fY (y) = n 1 e− 2 y , y ∈ ℜn
◮ We showed the following is a density (taking M −1
= Σ) (2π) |Σ|
2 2
1 − 21 yT Σ−1 y n
◮ As earlier let M = Σ−1 . Let LT M L = diag(m1 , · · · , mn )
fY (y) = e , y∈ℜ
n
(2π) 2 |Σ| 2
1
◮ Define Z = (Z1 , · · · , Zn )T = LT Y. Then Y = LZ.
◮ Recall |L| = 1, |M −1 | = (m1 · · · mn )−1
◮ Let X = Y + µ. Then ◮ Then density of Z is
1 1 T Σ−1 (x− 1 1
fX (x) = fY (x − µ) = e− 2 (x−µ) µ) 1 T T 1
mi zi2
P
n 1 fZ (z) = n 1 e− 2 z L M Lz
= n 1 1 e
−2 i
(2π) |Σ|
2 2
(2π) |M −1 |
2 2 (2π) 2 ( m1 ···m n
)2
n n z2
r r
◮ This is the multidimensional Gaussian distribution Y 1 1 1 2
Y 1 1 − 21 1i
= q e − 2 mi zi = q e mi
i=1
2π 1
i=1
2π 1
mi mi
◮ Let X = Y + µ. Then
◮ Also, since Zi = 0, ΣZ = E[ZZT ].
◮ Since Y = LZ, E[Y] = 0 and 1 1 T Σ−1 (x−
fX (x) = n 1 e− 2 (x−µ) µ)
(2π) |Σ|
2 2
ΣY = E[YYT ] = E[LZZT LT ] = LE[ZZT ]LT = L(LT M −1 L)LT = M −1
◮ We have
EX = E[Y + µ] = µ
◮ Thus, if Y has density
1 1 T Σ−1 y
ΣX = E[(X − µ)(X − µ)T ] = E[YYT ] = Σ
fY (y) = n 1 e− 2 y , y ∈ ℜn
(2π) 2 |Σ| 2
then EY = 0 and ΣY = M −1 = Σ
PS Sastry, IISc, Bangalore, 2020 27/36 PS Sastry, IISc, Bangalore, 2020 28/36
Multi-dimensional Gaussian density
◮ X = (X1 , · · · , Xn )T are said to be jointly Gaussian if ◮ Let X = (X1 , · · · , Xn )T be jointly Gaussian:
1 1 1 T Σ−1 (x−
e− 2 (x−µ) µ)
1 T Σ−1 (x−
fX (x) = 1 e− 2 (x−µ) µ) fX (x) = n 1
n
(2π) |Σ| 2 2 (2π) |Σ|2 2
◮ EX = µ and ΣX = Σ. ◮ Let Y = X − µ.
◮ Suppose Cov(Xi , Xj ) = 0, ∀i 6= j. ◮ Let M = Σ−1 and L be such that
◮ Then Σij = 0, ∀i 6= j. Let Σ = diag(σ12 , · · · , σn2 ). LT M L = diag(m1 , · · · , mn )
n
◮ Let Z = (Z1 , · · · , Zn )T = LT Y .
Pn xi −µi 2 2
1 1
xi −µi
− 21 − 12 Then we saw that Zi ∼ N (0, m1i ) and Zi are independent.
Y
fX (x) = n e i=1 σi
= √ e σi ◮
(2π) σ1 · · · σn
2
i=1
σi 2π ◮ If X1 , · · · , Xn are jointly Gaussian then there is a ‘linear’
transform that transforms them into independent random
◮ This implies Xi are independent. variables.
◮ If X1 , · · · , Xn are jointly Gaussian then uncorrelatedness
implies independence.
PS Sastry, IISc, Bangalore, 2020 29/36 PS Sastry, IISc, Bangalore, 2020 30/36
h T i i=1 i=1
sT µ
= e E es LZ
T
h T i ◮ We derived earlier
= es µ E eu Z
T
T MX (s) = es µ MZ (u), where u = LT s
where u = L s
sT µ
= e MZ (u)
PS Sastry, IISc, Bangalore, 2020 31/36 PS Sastry, IISc, Bangalore, 2020 32/36
◮ We got ◮Let X, Y be jointly Gaussian. For simplicity let
P u2
i
EX = EY = 0.
sT µ T
MX (s) = e MZ (u); u = L s; MZ (u) = e i 2mi
◮ Let Var(X) = σ 2 , Var(Y ) = σ 2 and ρXY = ρ.
x y
⇒ Cov(X, Y ) = ρσx σy .
◮ Earlier we have shown LT M −1 L = diag( m11 , · · · , m1n )
◮ Now, the covariance matrix and its inverse are given by
where M −1 = Σ. Now we get
σx2 σy2
ρσx σy −1 1 −ρσx σy
1 X u2i 1 1 1 Σ= ; Σ = 2 2
= uT (LT M −1 L)u = sT M −1 s = sT Σs ρσx σy σy2 σx σy (1 − ρ2 ) −ρσx σy σx2
2 i mi 2 2 2
MX (s) = e µ + s Σs 2(1−ρ2 ) 2 2 σx σy
2 fXY (x, y) = p e σx σy
2πσx σy 1 − ρ2
◮ This is the moment generating function of
multi-dimensional Normal density
◮ This is the bivariate Gaussian density
PS Sastry, IISc, Bangalore, 2020 33/36 PS Sastry, IISc, Bangalore, 2020 34/36
◮ Suppose X, Y are jointly Gaussian (with the density ◮ The multi-dimensional Gaussian density has some
above) important properties.
◮ Then, all the marginals and conditionals would be ◮ If X1 , · · · , Xn are jointly Gaussian then they are
Gaussian. independent if they are uncorrelated.
◮ X ∼ N (0, σx2 ), and Y ∼ N (0, σy2 ) ◮ Suppose X1 , · · · , Xn be jointly Gaussian and have zero
◮ fX|Y (x|y) would be a Gaussian density with mean yρ σσxy means. Then there is an orthogonal transform Y = AX
and variance σx2 (1 − ρ2 ). such that Y1 , · · · , Yn are jointly Gaussian and
◮ Exercise for you – show all this starting with the joint independent.
density we have ◮ X1 , · · · , Xn are jointly Gaussian if and only if tT X is
◮ Note that X, Y are individually Gaussian does not mean Gaussian for for all non-zero t ∈ ℜn .
they are jointly Gaussian (unless they are independent) ◮ We will prove this using moment generating functions
PS Sastry, IISc, Bangalore, 2020 35/36 PS Sastry, IISc, Bangalore, 2020 36/36
Recap: Multi-dimensional Gaussian density
◮ X = (X1 , · · · , Xn )T are said to be jointly Gaussian if ◮ The multi-dimensional Gaussian density has some
1 1 T Σ−1 (x−
important properties.
fX (x) = n 1 e− 2 (x−µ) µ)
◮ If X1 , · · · , Xn are jointly Gaussian then they are
(2π) |Σ|
2 2
independent if they are uncorrelated.
◮ EX = µ and ΣX = Σ. ◮ Suppose X1 , · · · , Xn be jointly Gaussian and have zero
◮ The moment generating function is given by means. Then there is an orthogonal transform Y = AX
T 1 T
such that Y1 , · · · , Yn are jointly Gaussian and
MX (s) = es µ + 2 s Σ s independent.
◮ When X, Y are jointly Gaussian, the joint densityis given
◮ X1 , · · · , Xn are jointly Gaussian if and only if tT X is
by Gaussian for for all non-zero t ∈ ℜn .
◮ We will prove this using moment generating functions
(y−µy )2
(x−µx )2 2ρ(x−µx )(y−µy )
1 − 1
2(1−ρ2 ) 2 + 2 − σx σy
fXY (x, y) = p e σx σy
2πσx σy 1 − ρ2
PS Sastry, IISc, Bangalore, 2020 1/32 PS Sastry, IISc, Bangalore, 2020 2/32
µw , EW = tT µX ; σw2 , Var(W ) = tT ΣX t
◮This implies
◮ The mgf of W is given by h T i T 1 2 T
uW h T i E eu t X = eu t µX + 2 u t ΣX t , ∀u ∈ ℜ, ∀t ∈ ℜn , t 6= 0
MW (u) = E e = E eu t X h T i T 1 T
Tµ 1 2 T E et X = et µX + 2 t ΣX t , ∀t
= MX (ut) = eut x + 2 u t Σx t
1 2 σ2
= euµw + 2 u w
This implies X is jointly Gaussian.
showing that W is Gaussian ◮ This is a defining property of multidimensional Gaussian
◮ Shows density of Xi is Gaussian for each i. For example, density
if we take t = (1, 0, 0, · · · , 0)T then W above would be
X1 .
PS Sastry, IISc, Bangalore, 2020 3/32 PS Sastry, IISc, Bangalore, 2020 4/32
◮ Let X = (X1 , · · · , Xn )T be jointly Gaussian.
◮ Let A be a k × n matrix with rank k.
◮ The mgf of Y is
◮ Then Y = AX is jointly Gaussian.
h T i
◮ We will once again show this using the moment MY (s) = E es Y (s ∈ ℜk )
generating function. h T i
◮ Let µx and Σx denote mean vector and covariance matrix = E es A X
of X. Similarly µy and Σy for Y = MX (AT s)
◮ We have µy = Aµx and (Recall MX (t) = et
Tµ 1 T
x + 2 t Σx t
)
1 T
sT Aµ x+ 2 s A
Σx AT s
Σy = E (Y − µy )(Y − µy )T = e
T 1 T
= E (A(X − µx ))(A(X − µx ))T = e s µy + 2 s Σ y s
= E A(X − µx )(X − µx )T AT
This shows Y is jointly Gaussian
= A E (X − µx )(X − µx )T AT = AΣx AT
PS Sastry, IISc, Bangalore, 2020 5/32 PS Sastry, IISc, Bangalore, 2020 6/32
then Y = (X1 , X2 )T
PS Sastry, IISc, Bangalore, 2020 7/32 PS Sastry, IISc, Bangalore, 2020 8/32
Jensen’s Inequality Jensen’s Inequality: Proof
◮ Let g : ℜ → ℜ be a convex function. Then
◮ We have
g(EX) ≤ E[g(X)]
g(x) ≥ g(x0 ) + λ(x0 )(x − x0 ), ∀x
◮ For example, (EX)2 ≤ E [X 2 ]
◮ Function g is convex if ◮ Take x0 = EX and x = X(ω). Then
g(αx+(1−α)y) ≤ αg(x)+(1−α)g(y), ∀x, y, ∀0 ≤ α ≤ 1
g(X(ω)) ≥ g(EX) + λ(EX)(X(ω) − EX), ∀ω
◮ If g is convex, then, given any x0 , exists λ(x0 ) such that
◮ Y (ω) ≥ Z(ω), ∀ω ⇒ Y ≥ Z ⇒ EY ≥ EZ
g(x) ≥ g(x0 ) + λ(x0 )(x − x0 ), ∀x
◮ Hence we get
PS Sastry, IISc, Bangalore, 2020 9/32 PS Sastry, IISc, Bangalore, 2020 10/32
PS Sastry, IISc, Bangalore, 2020 13/32 PS Sastry, IISc, Bangalore, 2020 14/32
PS Sastry, IISc, Bangalore, 2020 15/32 PS Sastry, IISc, Bangalore, 2020 16/32
Chernoff Bounds Hoeffding Inequality
PS Sastry, IISc, Bangalore, 2020 17/32 PS Sastry, IISc, Bangalore, 2020 18/32
n
X n
X
ESn = EXi = nµ; and Var(Sn ) = Var(Xi ) = nσ 2
i=1 i=1
PS Sastry, IISc, Bangalore, 2020 19/32 PS Sastry, IISc, Bangalore, 2020 20/32
Weak Law of large numbers
◮ Suppose we are tossing a (biased) coin repeatedly
Xi are iid, EXi = µ, Var(Xi ) = σ 2 , Sn = ni=1 Xi
P
◮
◮ Xi = 1 if ith toss came up head and is zero otherwise.
σ2
Sn Sn ◮ EXi = p where p is the probability of heads. Variance of
E = µ; and Var =
n n n Xi is p(1 − p)
Sn = ni=1 Xi is the number of heads in n tosses
P
◮ As n becomes large, variance of Snn becomes close to zero ◮
Sn Sn
◮
n
‘converges’ to its expectation, µ, as n → ∞ ◮
n
is the fraction of heads in n tosses.
◮ By Chebyshev Inequality ◮ We are saying Snn ‘converges’ to p
Sn
Var( Snn ) σ2 ◮ The probability of head is the limiting fraction of heads
P −µ ≥ǫ ≤ = , ∀ǫ > 0 when you toss the coin infinite times
n ǫ2 nǫ2
◮ Thus, we get Sn
lim P − p ≥ ǫ = 0, ∀ǫ > 0
n→∞ n
Sn
lim P − µ ≥ ǫ = 0, ∀ǫ > 0
n→∞ n
◮ Known as weak law of large numbers
PS Sastry, IISc, Bangalore, 2020 21/32 PS Sastry, IISc, Bangalore, 2020 22/32
PS Sastry, IISc, Bangalore, 2020 23/32 PS Sastry, IISc, Bangalore, 2020 24/32
Pn ◮ Recall convergence of real number sequences.
2
◮ Xi are iid, EXi = µ, Var(Xi ) = σ , Sn = i=1 Xi ◮ A sequence of real numbers xn is said to converge to x0 ,
xn → x0 , if
σ2
Sn Sn
E = µ; and Var = ∀ǫ > 0, ∃N < ∞, s.t. |xn − x0 | ≤ ǫ, ∀n ≥ N
n n n
◮ To show a sequence converges using this definition, we
Sn
◮ As n becomes large, variance ofn
becomes close to zero need to know (or guess) the limit.
◮ Sn
We would like to say n → µ. ◮ Convergent sequences of real numbers satisfy the Cauchy
◮ We need to properly define convergence of a sequence of criterion
random variables ∀ǫ > 0, ∃N < ∞, s.t.|xn − xm | ≤ ǫ, ∀n, m ≥ N
◮ One way of looking at this convergence is ◮ Now consider defining sequence of random variables Xn
converging to X0
Sn
lim P − µ ≥ ǫ = 0, ∀ǫ > 0 ◮ These are not numbers. They are, in fact functions.
n→∞ n ◮ We know that |Xn − X0 | ≤ ǫ is an event. We can define
◮ There are other ways of defining convergence of random convergence in terms of probability of that event
variables becoming 1.
◮ Or we can look at different notions of convergence of a
PS Sastry, IISc, Bangalore, 2020 25/32
sequence of functions to a function. PS Sastry, IISc, Bangalore, 2020 26/32
Convergence in Probability
◮ A sequence of random variables, Xn , is said to converge
in probability to a random variable X0 is
◮ Consider a sequence of functions gn mapping ℜ to ℜ.
◮ We can say gn → g0 if gn (x) → g0 (x), ∀x. lim P [|Xn − X0 | > ǫ] = 0, ∀ǫ > 0
n→∞
◮ This is known as point-wise convergence
Or we can ask for |gn (x) − g0 (x)|2 dx → 0.
R P
◮ This is denoted as Xn → X0
◮ There are multiple notions of convergence that are ◮ We would mostly be considering convergence to a
reasonable for a sequence of functions. constant.
◮ Thus there would be multiple ways to define convergence ◮ By the definition of limit, the above means
of sequence of random variables.
∀δ > 0, ∃N < ∞, s.t. P [|Xn − X0 | > ǫ] < δ, ∀n > N
PS Sastry, IISc, Bangalore, 2020 29/32 PS Sastry, IISc, Bangalore, 2020 30/32
PS Sastry, IISc, Bangalore, 2020 31/32 PS Sastry, IISc, Bangalore, 2020 32/32
Recap: Multi-dimensional Gaussian density Recap
◮ X = (X1 , · · · , Xn )T are said to be jointly Gaussian if
1 1 T Σ−1 (x−
fX (x) = n 1 e− 2 (x−µ) µ)
◮ If X1 , · · · , Xn are jointly Gaussian then they are
(2π) |Σ|
2 2
independent if they are uncorrelated.
◮ EX = µ and ΣX = Σ. ◮ When X1 , · · · , Xn be jointly Gaussian (with zero means),
◮ The moment generating function is given by there is an orthogonal transform Y = AX such that
Y1 , · · · , Yn are jointly Gaussian and independent.
T 1 T
MX (s) = es µ + 2 s Σ s ◮ X1 , · · · , Xn are jointly Gaussian if and only if tT X is
Gaussian for for all non-zero t ∈ ℜn .
◮ When X, Y are jointly Gaussian, the joint density is given
by
◮ If X1 , · · · , Xn are jointly Gaussian and A is a k × n
matrix of rank k, then, Y = AX is jointly gaussian
(y−µy )2
(x−µx )2 2ρ(x−µx )(y−µy )
1 − 1
2 2 + 2 − σx σy
fXY (x, y) = p e 2(1−ρ ) σx σy
2πσx σy 1 − ρ2
PS Sastry, IISc, Bangalore, 2020 1/34 PS Sastry, IISc, Bangalore, 2020 2/34
1 1 1
(E|X + Y |r ) r ≤ (E|X|r ) r + (E|Y |r ) r
PS Sastry, IISc, Bangalore, 2020 3/34 PS Sastry, IISc, Bangalore, 2020 4/34
Recap: Weak Law of large numbers Recap: Convergence in Probability
◮ By Chebyshev Inequality P
This is denoted as Xn → X0
Sn
Var( Snn ) σ2 ◮ By the definition of limit, the above means
P −µ ≥ǫ ≤ = 2 , ∀ǫ > 0
n ǫ2 nǫ
∀δ > 0, ∃N < ∞, s.t. P [|Xn − X0 | > ǫ] < δ, ∀n > N
Sn ◮ We only need marginal distributions of individual Xn to
⇒ lim P − µ ≥ ǫ = 0, ∀ǫ > 0
n→∞ n decide whether a sequence converges to a constant in
probability
PS Sastry, IISc, Bangalore, 2020 5/34 PS Sastry, IISc, Bangalore, 2020 6/34
PS Sastry, IISc, Bangalore, 2020 7/34 PS Sastry, IISc, Bangalore, 2020 8/34
◮ Note that given any ω, Xn (ω) is real number sequence.
◮ Recall convergence of real number sequences. ◮ Hence Xn (ω) → X(ω) is same as
◮ A sequence of real numbers xn is said to converge to x0 , ∀ǫ > 0 ∃N < ∞ ∀k ≥ 0 |XN +k (ω) − X(ω)| < ǫ
xn → x0 , if
This is equivalent to
∀ǫ > 0, ∃N < ∞, s.t. |xn − x0 | < ǫ, ∀n ≥ N
1
∀r > 0, r integer ∃N < ∞ ∀k ≥ 0 |XN +k (ω) − X(ω)| <
This is equivalent to r
PS Sastry, IISc, Bangalore, 2020 9/34 PS Sastry, IISc, Bangalore, 2020 10/34
◮ The event {ω : Xn (ω) 9 X(ω)} can be expressed as ◮ A sequence Xn is said to converge almost surely or
1
with probability one to X if
∞ ∞ ∞
∪r=1 ∩N =1 ∪k=0 |XN +k − X| ≥
r P ({ω : Xn (ω) → X(ω)}) = 1
◮ Hence Xn converges almost surely to X iff
◮ We can also write it as
∞ ∞ ∞ 1
P ∪r=1 ∩N =1 ∪k=0 |XN +k − X| ≥ =0 P [Xn → X] = 1
r
◮ This is same as ◮ We showed that this is equivalent to
1 P (∩∞ ∞
P ∩∞ ∞ N =1 ∪k=0 [|XN +k − X| ≥ ǫ]) = 0, ∀ǫ > 0
N =1 ∪k=0 |XN +k − X| ≥ = 0, ∀r > 0, integer
r
◮ Same as
◮ Same as
P (∩∞ ∞
N =1 ∪k=N [|Xk − X| ≥ ǫ]) = 0, ∀ǫ > 0
P (∩∞
N =1 ∪∞
k=0 [|XN +k − X| ≥ ǫ]) = 0, ∀ǫ > 0
PS Sastry, IISc, Bangalore, 2020 11/34 PS Sastry, IISc, Bangalore, 2020 12/34
◮ Xn converges to X almost surely iff
◮ let Ak = [|Xk − X| ≥ ǫ]
◮ Let BN = ∪∞ k=N Ak . lim P (∪∞
k=n [|Xk − X| ≥ ǫ]) = 0, ∀ǫ > 0
n→∞
◮ Then, BN +1 ⊂ BN and hence BN ↓.
Hence, lim BN = ∩∞
◮ To show convergence with probability one using this one
N =1 BN .
◮
a.s. needs to know the joint distribution of Xn , Xn+1 , · · ·
◮ We saw that Xn → X is same as P
◮ Contrast this with Xn → X which is
P (∩∞ ∞
N =1 ∪k=N [|Xk − X| ≥ ǫ]) = 0, ∀ǫ > 0
lim P [|Xn − X| > ǫ] = 0, ∀ǫ > 0
n→∞
⇔ P lim ∪∞
k=N [|X k − X| ≥ ǫ] = 0, ∀ǫ > 0 ◮ This also shows that
N →∞
a.s. P
Xn → X ⇒ Xn → X
⇔ lim P (∪∞
k=N [|Xk − X| ≥ ǫ]) = 0, ∀ǫ > 0
N →∞
◮ Almost sure convergence is a stronger mode of
convergence
PS Sastry, IISc, Bangalore, 2020 13/34 PS Sastry, IISc, Bangalore, 2020 14/34
1 ω ≤ n1
Xn (ω) = P (∩∞ ∞
N =1 ∪k=N [|Xk − X| ≥ ǫ]) = 0, ∀ǫ > 0
0 otherwise
P
◮ Since Xn → 0, zero is the only candidate for limit ⇔ lim P (∪∞
k=n [|Xk − X| ≥ ǫ]) = 0, ∀ǫ > 0
n→∞
◮ Xn (ω) = 1 only when n ≤ 1/ω. ◮ We normally do not specify Xn as functions over Ω
◮ Given any ω, for all n > 1/ω, Xn (ω) = 0 ◮ We are only given the distributions
◮ Hence, {ω : Xn (ω) → 0} = (0, 1] ◮ How do we establish convergence almost surely
P [Xn → X0 ] = P ({ω : Xn (ω) → 0}) = P ((0, 1]) = 1
a.s
◮ Hence Xn → 0
PS Sastry, IISc, Bangalore, 2020 15/34 PS Sastry, IISc, Bangalore, 2020 16/34
◮ Let A1 , A2 , · · · be a sequence of events.
◮ How do we define limit of this sequence ?
◮ Define sequences
Bn = ∪∞
k=n Ak Cn = ∩∞
k=n Ak ◮ We can show lim inf An ⊂ lim sup An
◮ These are monotone: Bn ↓, Cn ↑. They have limits.
ω ∈ lim inf An ⇒ ω ∈ ∪∞ ∞
n=1 ∩k=n Ak
◮ Define
⇒ ∃m, ω ∈ Ak , ∀k ≥ m
lim sup An , lim Bn = ∩∞ ∞
n=1 ∪k=n Ak
⇒ ω ∈ ∪∞j=n Aj , ∀n
lim inf An , lim Cn = ∪∞ ∞
n=1 ∩k=n Ak
⇒ ω ∈ ∩∞ ∞
n=1 ∪j=n Aj
◮ If lim sup An = lim inf An then we define that as
⇒ ω ∈ lim sup An
lim An . Otherwise we say the sequence does not have a
limit
◮ Note that lim sup An and lim inf An are events
a.s.
◮ Note that Xn → X iff
P (∩∞ ∞
N =1 ∪k=N [|Xk − X| ≥ ǫ]) = 0, ∀ǫ > 0
ω ∈ lim sup An ⇒ ω ∈ ∩∞ ∞
n=1 ∪k=n Ak
◮ We can characterize lim inf An as follows ⇒ ω ∈ ∪∞ k=n Ak , ∀n
⇒ ω belongs to infinitely many of An
ω ∈ lim inf An ⇒ ω ∈ ∪∞ ∞
n=1 ∩k=n Ak
⇒ ∃m, ω ∈ Ak , ∀k ≥ m Thus lim sup An consists of points that are in infinitely
⇒ ω belongs to all but finitely many of An many An
One refers to lim sup An also as ‘An infinitely often’ or
Thus, lim inf An consists of all points that are there in ‘An i.o.’
all but finitely many An . ◮ What is the difference between
Points that belong to all but finitely many An and
Points that belong to infinitely many An
◮ There can be ω that are there in infinitely many of An
and are also not there in infinitely many of the An
PS Sastry, IISc, Bangalore, 2020 19/34 PS Sastry, IISc, Bangalore, 2020 20/34
Example example
n
◮ Consider the sets An = [0, 1 + (−1)
n
)
The sequence is
◮ Consider the following sequence of sets: A, B, A, B, · · ·
1 1 1
◮ Recall [0, 0), 0, 1 + , 0, 1 − , 0, 1 + ···
2 3 4
lim sup An = ∩∞ ∞
n=1 ∪k=n Ak lim inf An = ∪∞ ∞
n=1 ∩k=n Ak ◮ Guess: lim sup An = [0, 1] and lim inf An = [0, 1)
1 1
◮ First note that [0, 1 + n+1 ) ⊂ ∪∞
k=n Ak ⊂ [0, 1 + n ).
∪∞ Hence
k=n Ak = A ∪ B, ∀n ⇒ lim sup An = A ∪ B
x ∈ [0, 1] ⇒ x ∈ ∪∞ ∞ ∞
k=n Ak , ∀n ⇒ x ∈ ∩n=1 ∪k=n Ak ⇒ x ∈ lim sup An
∩∞
k=n Ak = A ∩ B, ∀n ⇒ lim inf An = A ∩ B
◮ Hence 1 ∈ / lim inf An ◮ The question now is: can we get probability of
◮ This proves lim inf An = [0, 1) lim sup An
◮ Since lim sup An 6= lim inf An , this sequence does not ◮ We look at an important result that allows us to do this
have a limit
PS Sastry, IISc, Bangalore, 2020 23/34 PS Sastry, IISc, Bangalore, 2020 24/34
Borel-Cantelli Lemma
Borel-Cantelli lemma: Given sequence of events,
P∞ P∞
◮ ◮ If P (Ak ) < ∞, then, limn→∞ P (Ak ) = 0
k=1 k=n
A1 , A2 ,P
···
1. If ∞
i=1 P (Ai ) < ∞, then, P (lim sup An ) = 0 0 ≤ P (lim sup An ) = P (∩∞ ∪∞ A )
P∞ n=1 k=n k
2. If i=1 P (Ai )
= ∞ and Ai are independent, = P lim ∪∞k=n Ak
n→∞
P (lim sup An ) = 1
= lim P (∪∞
k=n Ak )
Proof: n→∞
P∞
◮ We will first show: P (∪∞i=n Ai ) ≤ i=n P (Ai ), ∀n
∞
X
N
PN ≤ lim P (Ak )
◮ We have the result: P (∪i=n Ai ) ≤ i=n P (Ai ), n ≤ N n→∞
k=n
◮ For any n, let BN = ∪N i=n Ai . Then BN ⊂ BN +1 . ∞
X
◮ limN →∞ BN = ∪k=n Ak
∞
= 0, if P (Ak ) < ∞
N k=1
P (∪∞
i=n Ai ) = P ( lim ∪i=n Ai ) = lim P (∪N
i=n Ai )
N →∞ N →∞
N
X ∞
X
◮ This completes proof of first part of Borel-Cantelli lemma
≤ lim P (Ai ) = P (Ai )
N →∞
i=n i=n
PS Sastry, IISc, Bangalore, 2020 25/34 PS Sastry, IISc, Bangalore, 2020 26/34
PS Sastry, IISc, Bangalore, 2020 27/34 PS Sastry, IISc, Bangalore, 2020 28/34
◮ We can compute that limit as follows
◮ Given a sequence Xn we want to know whether it
∞
Y ∞
Y converges to X
lim (1 − P (Ak )) ≤ lim e−P (Ak ) , since 1 − x ≤ e−x ◮ Let Aǫk = [|Xk − X| ≥ ǫ]
n→∞ n→∞
k=n k=n
P
◮ Xn → X if
P
− ∞ k=n P (Ak )
= lim e
n→∞
= 0 lim P [|Xk − X| ≥ ǫ] = 0 same as lim P (Ak ) = 0, ∀ǫ > 0
k→∞ k→∞
because
P∞ P∞
k=1 P (Ak ) = ∞ ⇒ limn→∞ k=n P (Ak ) = ∞ ◮ By Borel-Cantelli lemma
◮ This finally gives us
∞
a.s.
X
∞
Y P (Ak ) < ∞ ⇒ P (lim sup Ak ) = 0 ⇒ Xk → X
P (lim sup An ) = 1 − lim (1 − P (Ak )) = 1 k=1
n→∞
k=n
PS Sastry, IISc, Bangalore, 2020 29/34 PS Sastry, IISc, Bangalore, 2020 30/34
PS Sastry, IISc, Bangalore, 2020 33/34 PS Sastry, IISc, Bangalore, 2020 34/34
PS Sastry, IISc, Bangalore, 2020 1/33 PS Sastry, IISc, Bangalore, 2020 2/33
Recap: almost sure convergence Recap
◮ The sequence Xn converges to X almost surely iff
◮ A sequence of random variables, Xn , is said to converge
almost surely or with probability one to X if P (∩∞ ∞
N =1 ∪k=0 [|XN +k − X| ≥ ǫ]) = 0, ∀ǫ > 0
Same as
P ({ω : Xn (ω) → X(ω)}) = 1
P (∩∞ ∞
N =1 ∪k=N [|Xk − X| ≥ ǫ]) = 0, ∀ǫ > 0
or equivalently ◮ Equivalently
P [Xn → X] = 1 a.s.
Xn → X ⇒
P
Xn → X
◮ Almost sure convergence is a stronger mode of
convergence
PS Sastry, IISc, Bangalore, 2020 3/33 PS Sastry, IISc, Bangalore, 2020 4/33
PS Sastry, IISc, Bangalore, 2020 5/33 PS Sastry, IISc, Bangalore, 2020 6/33
Recall: Borel-Cantelli Lemma ◮ Given a sequence Xn we want to know whether it
converges to X
◮ Let Aǫk = [|Xk − X| ≥ ǫ]
a.s.
◮ Xn → X if
If Ak are ind
∞
a.s.
X
P (Ak ) = ∞ ⇒ P (lim sup Ak ) = 1 ⇒ Xk 9 X
k=1
PS Sastry, IISc, Bangalore, 2020 7/33 PS Sastry, IISc, Bangalore, 2020 8/33
PS Sastry, IISc, Bangalore, 2020 9/33 PS Sastry, IISc, Bangalore, 2020 10/33
◮ Let us assume Xi have finite fourth moment
!4
◮ Now we can get, using Markov inequality
Xn Xn X X 4!
(Xi −µ)4 + (Xi −µ)2 (Xj −µ)2 +T
(Xi − µ) = Sn
2!2! P − µ > ǫ = P [|Sn − nµ| > nǫ]
i=1 i=1 i j>i n
" n #
Where T represent a number of terms such that every = P
X
(Xi − µ) > nǫ
term in it contains a factor like (Xi − µ) i=1
Note that E[(Xi − µ)(Xj − µ)3 ] = 0 etc. because Xi are Pn 4
E( i=1 (Xi − µ))
independent. ≤
(nǫ)4
◮ Hence we get
C ′ n2 C
!4 ≤ 4 4
= 2
Xn nǫ n
E (Xi − µ) = nE[(Xi −µ)4 ]+3n(n−1)σ 4 ≤ C ′ n2 P C Sn a.s.
i=1
◮ Since n n2 < ∞, we get n
→ µ
PS Sastry, IISc, Bangalore, 2020 11/33 PS Sastry, IISc, Bangalore, 2020 12/33
◮
r
Suppose Xn → X. Then, by Markov inequality P [Xn = 0] = 1 − an ; P [Xn = cn ] = an
P
E [|Xn − X|r ] ◮ Assume an → 0 so that Xn → 0
P [|Xn − X| > ǫ] ≤ →0
ǫr ◮ By Borel-Cantelli lemma
◮ Hence a.s.
X
r P Xn → 0 ⇔ an < ∞
Xn → X ⇒ Xn → X
n
th
◮ In general, neither of convergence almost surely and in r
mean imply the other.
◮ For convergence in rth mean we need
◮ We can generate counter examples for this easily. E [|Xn − 0|r ] = (cn )r an → 0
◮ However, if all Xn take values in a bounded interval, then r
almost sure convergence implies rth mean convergence ◮ Take an = n1 and cn = 1. Then Xn → 0 but the sequence
does not converge almost surely.
a.s.
◮ Take an = n12 and cn = en . Then Xn → 0 but the
sequence does not converge in rth mean for any r.
PS Sastry, IISc, Bangalore, 2020 15/33 PS Sastry, IISc, Bangalore, 2020 16/33
Convergence in distribution
◮ Let Fn be the df of Xn , n = 1, 2, · · · . Let X be a rv with
df F .
◮ Sequence Xn is said to converge to X in distribution if
r
◮ Let Xn → X. Then Fn (x) → F (x), ∀x where F is continuous
1. E[|Xn |r ] → E[|X|r ] ◮ We denote this as
s
2. Xn → X, ∀s < r d L w
Xn → X, or Xn → X, or Fn → F
◮ The proofs are straight-forward but we omit the proofs
◮ This is also known as convergence in law or weak
convergence
◮ Note that here we are essentially talking about
convergence of distribution functions.
◮ Convergence in probability implies convergence in
distribution
◮ The converse is not true. (e.g., sequence of iid random
variables)
PS Sastry, IISc, Bangalore, 2020 17/33 PS Sastry, IISc, Bangalore, 2020 18/33
Examples Examples
◮ X1 , X2 , · · · be iid; uniform over (0, 1) ◮ Let {Xn } be iid with density f (x) = e−x+θ , x > θ > 0.
◮ Nn = min(X1 , · · · , Xn ), Yn = nNn . Does Yn converge in ◮ Let Nn = min(X1 , · · · Xn ). Does Nn converge in
distribution?
probability?
P [Nn > a] = (P [Xi > a])n = (1 − a)n , 0 < a < 1 ◮ Guess for limit: θ
σn2 r
P [|Xn − mn | > ǫ] ≤ ◮ Xn → X iff
ǫ2
E [|Xn − X|r ] → 0 as n → ∞
◮ Hence, a sufficient condition is σn2 → 0.
a.s
◮ What is a sufficient condition for convergece almost ◮ Xn → X iff
surely?
P [Xn → X] = 1 or P [lim sup |Xn − X| > ǫ] = 0
PS Sastry, IISc, Bangalore, 2020 21/33 PS Sastry, IISc, Bangalore, 2020 22/33
◮ We have the following relations among different modes of
convergence
◮ Strong and weak laws of large numbers are very useful
examples of convergence of sequences of random
r P
Xn → X ⇒ Xn → X ⇒ X n → X
d variables.
Given Xi are iid, EXi = µ, Var(Xi ) = σ 2 , Sn = ni=1 Xi
P
◮
P
a.s. P d
◮ Weak law of large numbers: Snn → µ
Xn → X ⇒ X n → X ⇒ X n → X ◮
a.s.
strong law of large numbers: Snn → µ
◮ All the implications are one-way and we have seen ◮ Another useful result is the Central Limit Theorem (CLT)
counter examples ◮ CLT is about (normalized) sums of of independent
◮ In general, almost sure convergence does not imply random variables converging to the Gaussian distribution
convergence in rth mean and vice versa
PS Sastry, IISc, Bangalore, 2020 23/33 PS Sastry, IISc, Bangalore, 2020 24/33
PS Sastry, IISc, Bangalore, 2020 25/33 PS Sastry, IISc, Bangalore, 2020 26/33
Central Limit Theorem Characteristic Function
◮
d
Central Limit Theorem states: S̃n → N (0, 1) ◮ Since |eiux | ≤ 1, φX exists for all random variables
◮ We use characteristic functions for proving CLT
PS Sastry, IISc, Bangalore, 2020 27/33 PS Sastry, IISc, Bangalore, 2020 28/33
PS Sastry, IISc, Bangalore, 2020 29/33 PS Sastry, IISc, Bangalore, 2020 30/33
Characteristic function example
D
PS Sastry, IISc, Bangalore, 2020 31/33 PS Sastry, IISc, Bangalore, 2020 32/33
PS Sastry, IISc, Bangalore, 2020 33/33 PS Sastry, IISc, Bangalore, 2020 1/34
Recap Recap
D
PS Sastry, IISc, Bangalore, 2020 2/34 PS Sastry, IISc, Bangalore, 2020 3/34
PS Sastry, IISc, Bangalore, 2020 4/34 PS Sastry, IISc, Bangalore, 2020 5/34
Recap Recap
D
PS Sastry, IISc, Bangalore, 2020 6/34 PS Sastry, IISc, Bangalore, 2020 7/34
h i
S − nµ
1
Z x
t2
◮ Here we assume: EXi = 0 and EXi2 = σ 2
lim P S˜n ≤ x = lim P
n
√ ≤x = √ e− 2 dt, ∀x
n→∞ n→∞ σ n 2π −∞ 1
φ(t) = 1 + 0 − ρ(t) σ 2 t2
2
Proof:
◮ Without loss of generality let us assume µ = 0.
t2
t 1 t 2
◮ We use characteristic function of S̃n for the proof. φ √ = 1− ρ √ σ 2
σ n 2 σ n σ n
◮ Let φ be the characteristic function of Xi . Then 1 t 2
1 t 2
t
= 1− + 1−ρ √
n
t
n 2 n 2 n σ n
φSn (t) = (φ(t)) and φS̃n (t) = φ √ 1 t2
1
σ n = 1− + o
2 n n
PS Sastry, IISc, Bangalore, 2020 8/34 PS Sastry, IISc, Bangalore, 2020 9/34
◮ Hence we get
n ◮ What CLT says is that sums of iid random variables,
t when appropriately normalized, would always approach
lim φ (t) = lim φ √
n→∞ S̃n n→∞ σ n
n the Gaussian distribution.
1 t2
1 ◮ It allows one to approximate distribution of sums of
= lim 1 − + o
n→∞ 2 n n independent rv’s
t2
Let Xi be iid and Sn = ni=1 Xi
P
= e− 2 ◮
which is the characteristic function of standard normal Sn − nµ x − nµ x − nµ
P [Sn ≤ x] = P √ ≤ √ ≈Φ √
◮ By Continuity theorem, distribution function of S˜n σ n σ n σ n
converges to that of standard Normal rv
Z x ◮ Thus, Sn is well approximated by a normal rv with mean
h
˜
i 1 t2
lim P Sn ≤ x = √ e− 2 dt, ∀x nµ and variance nσ 2 , if n is large
n→∞ 2π −∞
D
PS Sastry, IISc, Bangalore, 2020 10/34 PS Sastry, IISc, Bangalore, 2020 11/34
Z = 20
P
Example ◮
i=1 Xi , Xi ∼ U [−0.5, 0.5], Xi iid
1
◮ EXi = 0 and Var(Xi ) = 12 .
20
◮ Hence, EZ = 0 and Var(Z) = 12 = 53
P [|Z| ≤ 3] = P [−3 ≤ Z ≤ 3]
◮ Twenty numbers are rounded off to the nearest integer
and added. What is the probability that the sum obtained −3 Z − EZ 3
= P q ≤ p ≤q
differs from true sum by more than 3. 5 Var(Z) 5
3 3
◮ A reasonable assumption is round-off errors are
independent and uniform over [−0.5, 0.5] 3 −3
≈ Φ q − Φ q
Take Z = 20
P
i=1 Xi , Xi ∼ U [−0.5, 0.5], Xi iid.
◮ 5 5
3 3
◮ Then Z represents the error in the sum.
≈ Φ(2.3) − Φ(−2.3)
= 0.9893 − 0.0107 ≈ 0.98
D
PS Sastry, IISc, Bangalore, 2020 14/34 PS Sastry, IISc, Bangalore, 2020 15/34
◮ CLT allows one to get rate of convergence of law of large Example: Opinion Polls
numbers
Let Xi iid, EXi = µ, Var(Xi ) = σ 2 , Sn = ni=1 Xi
P
◮
D
PS Sastry, IISc, Bangalore, 2020 18/34 PS Sastry, IISc, Bangalore, 2020 19/34
◮ Suppose c = 1.96σ
√
n ◮ CLT essentially states that sum of many independent
◮ Then random variables behaves like a Gaussian random variable
It is very useful in many statistics applications.
Sn 1.96σ ◮
P −µ > √ = 2 (1 − Φ (1.96)) = 0.05
n n ◮ We stated CLT for iid random variables.
◮ While independence is important, all rv need not have the
◮ Denoting
h X̄ = Snn , thei95% confidence interval is same distribution.
√ , X̄ + 1.96σ
X̄ − 1.96σ
n
√
n ◮ Essentially, the variances should not die out.
◮ One generally uses an estimate for σ obtained from Xi
◮ In analyzing any experimental data the confidence
intervals or the variance term is important
D
PS Sastry, IISc, Bangalore, 2020 22/34 PS Sastry, IISc, Bangalore, 2020 23/34
Markov Chains
◮ We have been considering sequences: Xn , n = 1, 2, · · ·
◮ Let Xn , n = 0, 1, · · · be a sequence of discrete random
variables taking values in S.
◮ We have so far considered only the asymptotic properties Note that S would be countable
or limits of such sequences. ◮ We say it is a Markov chain if
◮ Any such sequence is an example of what is called a
P [Xn+1 = xn+1 |Xn = xn , Xn−1 = xn−1 · · · X0 = x0 ] = P [Xn+1 = xn+1 |Xn = xn ], ∀x
random process or stochastic process
◮ Given n rv, they are completely characterized by their ◮ We can write it as
joint distribution.
P [Xn+1 = xn+1 |Xn = xn , Xn−1 · · · X0 ] = P [Xn+1 = xn+1 |Xn = xn ], ∀xi
◮ How doe we specify or characterize an infinite collection
of random variables? ◮ Conditioned on Xn , Xn+1 is independent of
◮ We need the joint distribution of every finite subcollection Xn−1 , Xn−2 , · · ·
of them. ◮ We think of Xn as state at n
◮ For a Markov chain, given the current state, the future
evolution is independent of the history of how you
reached the current state
PS Sastry, IISc, Bangalore, 2020 24/34 PS Sastry, IISc, Bangalore, 2020 25/34
Example
◮ In this example, we can think of Xn as the number of
◮ Let Xi be iid discrete rv taking integer values. people or things arriving at a facility in the nth time
Let Y0 = 0 and Yn = ni=1 Xi interval.
P
◮
◮ Yn , n = 0, 1, · · · is a Markov chain with state space as ◮ Then Yn would be total arrivals till end of nth time
integers interval.
◮ Note that Yn+1 = Yn + Xn+1 and Xn+1 is independent of ◮ Number of packets coming into a network switch, number
Y0 , · · · , Yn . people joining the queue in a bank, number of infections
till date are all Markov chains.
P [Yn+1 = y|Yn = x, Yn−1 , · · · ] = P [Xn+1 = y − x] ◮ This is a useful model for many dynamic systems or
processes
◮ Thus, Yn+1 is conditionally independent of Yn−1 , · · ·
conditioned on Yn
D
PS Sastry, IISc, Bangalore, 2020 26/34 PS Sastry, IISc, Bangalore, 2020 27/34
Transition Probabilities
◮ Let {Xn , n = 0, 1, · · · } be a Markov Chain with
◮ The Markov property is: given current state, the future (countable) state space S
evolution is independent of the history of how we came to
P r[Xn+1 = xn+1 |Xn = xn , Xn−1 · · · X0 ] = P r[Xn+1 = xn+1 |Xn = xn ], ∀x
current state.
◮ It essentially means the current state contains all needed (Notice change of notation)
information about history ◮ Define function P : S × S → [0, 1] by
◮ We are considering the case where states as well as time
are discrete. P (x, y) = P r[Xn+1 = y|Xn = x]
◮ It can be more general and we discuss some of them
◮ P is called the state transition probability function. It
satisfies
P(x, y) ≥ 0, ∀x, y ∈ S
P
◮
y∈S P (x, y) = 1, ∀x ∈ S
◮
P (x, y) = P r[Xn+1 = y|Xn = x] ◮ Let {Xn } be a Markov Chain with state space S
◮ Define function π0 : S → [0, 1] by
◮ In general, this can depend on n though our notation
does not show it π0 (x) = P r[X0 = x]
◮ If the value of that probability does not depend on n then
the chain is called homogeneous
◮ It is the pmf of the rv X0
◮ For a homogeneous chain we have
◮ Hence it satisfies
P0 (x) ≥ 0, ∀x ∈ S
π
◮
D
PS Sastry, IISc, Bangalore, 2020 30/34 PS Sastry, IISc, Bangalore, 2020 31/34
◮ Let Xn be a (homogeneous) Markov chain ◮ This calculation is easily generalized to any number of
◮ Then we have time steps
P r[X0 = x0 , X1 = x1 ] = P r[X1 = x1 |X0 = x0 ] P r[X0 = x0 ], ∀x0 , x1 P r[X0 = x0 , · · · Xn = xn ] = P r[Xn = xn |Xn−1 = xn−1 , · · · X0 = x0 ] ·
= P (x0 , x1 )π0 (x0 ) = π0 (x0 )P (x0 , x1 ) P r[Xn−1 = xn−1 , · · · X0 = x0 ]
◮ Now we can extend this as = P r[Xn = xn |Xn−1 = xn−1 ] ·
P r[Xn−1 = xn−1 , · · · X0 = x0 ]
P r[X0 = x0 , X1 = x1 , X2 = x2 ] = P r[X2 = x2 |X1 = x1 , X0 = x0 ] · = P (xn−1 , xn )P r[Xn−1 = xn−1 , · · · X0 = x0 ]
P r[X0 = x0 , X1 = x1 ] = P (xn−1 , xn )P r[Xn−1 = xn−1 |Xn−2 = xn−2 ] ·
= P r[X2 = x2 |X1 = x1 ] · P r[Xn−2 = xn−2 , · · · X0 = x0 ]
P r[X0 = x0 , X1 = x1 ] ..
.
= P (x1 , x2 ) P (x0 , x1 )π0 (x0 )
= π0 (x0 )P (x0 , x1 ) · · · P (xn−1 , xn )
= π0 (x0 )P (x0 , x1 )P (x1 , x2 )
PS Sastry, IISc, Bangalore, 2020 32/34 PS Sastry, IISc, Bangalore, 2020 33/34
Recap: Central Limit Theorem
◮ We showed
Given Xi iid, EXi = µ, Var(Xi ) = σ 2 , Sn = ni=1 Xi
P
◮
−nµ
P r[X0 = x0 , · · · Xn = xn ] = π0 (x0 )P (x0 , x1 ) · · · P (xn−1 , xn ) ◮ Let S̃n = √Sn −ESn
= Sσn√
var(Sn ) n
PS Sastry, IISc, Bangalore, 2020 2/27 PS Sastry, IISc, Bangalore, 2020 3/27
Recap: Initial State Probabilities
◮ The P and π0 determine all joint distributions
PS Sastry, IISc, Bangalore, 2020 4/27 PS Sastry, IISc, Bangalore, 2020 5/27
PS Sastry, IISc, Bangalore, 2020 6/27 PS Sastry, IISc, Bangalore, 2020 7/27
◮ Consider the 2-state chain with S = {0, 1} and
1−p p
◮ We can similarly calculate probabilities of any events P =
q 1−q
involving these random variables
◮ We can represent the chain through a graph as shown
P r[X2 6= X0 ] = P r[X2 = 0, X0 = 1] + P r[X2 = 1, X0 = 0] below
X1 p
= (π0 (1)P (1, x)P (x, 0) + π0 (0)P (0, x)P (x, 1))
x=0
0
1
1-q
◮ We have the formula
1-p
q
P r[X0 = x0 , · · · Xn = xn ] = π0 (x0 )P (x0 , x1 ) · · · P (xn−1 , xn )
◮ The nodes represent states. The edges show possible
◮ This can easily be seen through a graphical notation. transitions and the probabilities
P r[X0 = 0, X1 = 1, X2 = 1, X3 = 0] = π0 (0)p(1 − q)q
PS Sastry, IISc, Bangalore, 2020 8/27 PS Sastry, IISc, Bangalore, 2020 9/27
i-1 i i+1
0 1 2 3 4
0 0 0 0 0 1 ◮ In general, birth-death chains may have self-loops on
states
1 0 0 0 1 − p p
P =
Random walk: Xi ∈ {−1, +1}, iid, Sn = ni=1 Xi
P
2 0 0 1−p p 0 ◮
3 0 1−p p 0 0 ◮ We can have ‘reflecting boundary’ at 0
4 1−p p 0 0 0 ◮ Queuing chains can also be birth-death chains
PS Sastry, IISc, Bangalore, 2020 10/27 PS Sastry, IISc, Bangalore, 2020 11/27
Gambler’s Ruin chain
◮ This chain keeps visiting all the states again and again ◮ Here, the chain is ultimately absorbed either in 0 or in N
◮ Here state can be the current funds that the gambler has
PS Sastry, IISc, Bangalore, 2020 12/27 PS Sastry, IISc, Bangalore, 2020 13/27
P r[Xm+n+r ∈ Br , r = 0, · · · , s | Xm = x, Xm−k ∈ Ak , k = 1, · · · , m]
= P r[Xm+n+r ∈ Br , r = 0, · · · , s | Xm = x]
PS Sastry, IISc, Bangalore, 2020 16/27 PS Sastry, IISc, Bangalore, 2020 17/27
Chapman-Kolmogorov Equations
PS Sastry, IISc, Bangalore, 2020 18/27 PS Sastry, IISc, Bangalore, 2020 19/27
Hitting times
◮ Define: πn (x) = P r[Xn = x].
◮ Then we get
◮ Let y be a state.
X
πn (y) = P r[Xn = y|X0 = x] P r[X0 = x] ◮ We define hitting time for y as the random variable
x
X Ty = min{n > 0 : Xn = y}
= π0 (x)P n (x, y)
x
◮ Ty is the first time that the chain is in state y (after t = 0
◮ In particular when the chain is initiated).
X ◮ It is easy to see that P r[Ty = 1|X0 = x] = P (x, y).
πn+1 (y) = P r[Xn+1 = y|Xn = x] P r[Xn = x] ◮ We often need conditional probability conditioned on the
x
X initial state.
= πn (x)P (x, y) ◮ Notation: Pz (A) = P r[A|X0 = z]
x
◮ We write the above as Px (Ty = 1) = P (x, y)
PS Sastry, IISc, Bangalore, 2020 20/27 PS Sastry, IISc, Bangalore, 2020 21/27
PS Sastry, IISc, Bangalore, 2020 22/27 PS Sastry, IISc, Bangalore, 2020 23/27
◮ Now, the total number of visits to y is given by
transient and recurrent states
∞
◮ Define ρxy = Px (Ty < ∞).
X
Ny = Iy (Xn )
◮ It is the probability that starting in x you will visit y n=1
◮ Note that ◮ We can get distribution of Ny as
X∞
ρxy = lim Px (Ty < n) = Px (Ty = n) Px (Ny ≥ 1) = Px (Ty < ∞) = ρxy
n→∞
n=1
X
Px (Ny ≥ 2) = Px (Ty = m)Py (Ty < ∞)
Definition: A state y is called transient if ρyy < 1; it is m
called recurrent if ρyy = 1. X
= ρyy Px (Ty = m) = ρyy ρxy
◮ Intuitively, all transient states would be visited only m
finitely many times while recurrent states are visited Px (Ny ≥ m) = m−1
ρyy ρxy
infinitely often.
Px (Ny = m) = Px (Ny ≥ m) − Px (Ny ≥ m + 1)
◮ For any state y define
= ρm−1
yy ρxy − ρm m−1
yy ρxy = ρxy ρyy (1 − ρyy )
1 if Xn = y
Iy (Xn ) = Px (Ny = 0) = 1 − Px (Ny ≥ 1) = 1 − ρxy
0 otherwise
PS Sastry, IISc, Bangalore, 2020 24/27 PS Sastry, IISc, Bangalore, 2020 25/27
PS Sastry, IISc, Bangalore, 2020 26/27 PS Sastry, IISc, Bangalore, 2020 27/27
Recap: Markov Chain Recap: Transition Probabilities
◮ Transition probability function is P : S × S → [0, 1]
◮ Let Xn , n = 0, 1, · · · be a sequence of discrete random
variables taking values in S. P (x, y) = P r[Xn+1 = y|Xn = x]
◮ We say it is a Markov chain if
The chain is said to be homogeneous when this is not a
P r[Xn+1 = xn+1 |Xn = xn , Xn−1 = xn−1 · · · X0 = x0 ] = P r[Xn+1 = xn+1 |Xn = xn ], function of time.
◮ We can write it as ◮ For a homogeneous chain
fXn+1 |Xn ,···X0 (xn+1 |xn , · · · , x0 ) = fXn+1 |Xn (xn+1 |xn ), ∀xi P r[Xn+1 = y|Xn = x] = P r[X1 = y|X0 = x], ∀n
◮ For a Markov chain, given the current state, the future ◮ P satisfies
evolution is independent of the history of how you ◮ P
P(x, y) ≥ 0, ∀x, y ∈ S
reached the current state y∈S P (x, y) = 1, ∀x ∈ S
◮
PS Sastry, IISc, Bangalore, 2020 1/40 PS Sastry, IISc, Bangalore, 2020 2/40
P r[Xm+n+r ∈ Br , r = 0, · · · , s | Xm = x, Xm−k ∈ Ak , k = 1, · · · , m]
= P r[Xm+n+r ∈ Br , r = 0, · · · , s | Xm = x]
PS Sastry, IISc, Bangalore, 2020 3/40 PS Sastry, IISc, Bangalore, 2020 4/40
Recap: Chapman-Kolmogorov Equations Recap: Hitting times
◮ The n-step transition probabilities are defined by
◮ We define hitting time for y as the random variable
n
P (x, y) = P r[Xn = y|X0 = x]
Ty = min{n > 0 : Xn = y}
◮ These n-step transition probabilities satisfy
X ◮ Using this defintion, we can derive
P m+n (x, y) = P m (x, z)P n (z, y) X
z Px (Ty = m) = P (x, z)Pz (Ty = m − 1)
z6=y
◮ These are known as Chapman-Kolmogorov equations
◮ For a finite chain, the n-step transition probability matrix (Notation: Pz (A) = P r[A|X0 = z])
is n-fold product of the transition probability matrix
n
◮ We also have n
X
P (x, y) = Px (Ty = m)P n−m (y, y)
X
πn (x) , P r[Xn = x] = π0 (x)P n (x, y) m=1
PS Sastry, IISc, Bangalore, 2020 5/40 PS Sastry, IISc, Bangalore, 2020 6/40
PS Sastry, IISc, Bangalore, 2020 7/40 PS Sastry, IISc, Bangalore, 2020 8/40
Recap
Theorem:
(i). Let y be transient. Then
◮ Notation: Ex [Z] = E[Z|X0 = x]
◮ Define ρxy
Px (N (y) < ∞) = 1, ∀x and G(x, y) = < ∞, ∀x
1 − ρyy
G(x, y) , Ex [N (y)]
X∞ (ii) Let y be recurrent. Then
= Ex [Iy (Xn )]
n=1 Py [N (y) = ∞] = 1, and G(y, y) = Ey [N (y)] = ∞
X∞
= P n (x, y)
n=1 0 if ρxy = 0
Px [N (y) = ∞] = ρxy , and G(x, y) =
∞ if ρxy > 0
◮ G(x, y) is the expected number of visits to y for a chain
that is started in x.
PS Sastry, IISc, Bangalore, 2020 9/40 PS Sastry, IISc, Bangalore, 2020 10/40
Proof of (ii):
Proof of (i): y is transient; ρyy < 1 y recurrent ⇒ ρyy = 1. Hence
X Py [N (y) ≥ m] = ρm
yy = 1, ∀m
G(x, y) = Ex [N (y)] = mPx [N (y) = m]
m
⇒ Py [N (y) = ∞] = lim Py [N (y) ≥ m] = 1
m→∞
X
= m ρxy ρm−1
yy (1 − ρyy ) ⇒ G(y, y) = Ey [N (y)] = ∞
m
∞
X
= ρxy m ρm−1
yy (1 − ρyy )
m−1
Px [N (y) ≥ m] = ρxy ρyy = ρxy , ∀m
m=1
1 Hence Px [N (y) = ∞] = ρxy
= ρxy < ∞, because ρyy < 1
1 − ρyy ρxy = 0 ⇒ Px [N (y) ≥ m] = 0, ∀m > 0 ⇒ G(x, y) = 0
⇒ Px [N (y) < ∞] = 1
ρxy > 0 ⇒ Px [N (y) = ∞] > 0 ⇒ G(x, y) = ∞
PS Sastry, IISc, Bangalore, 2020 11/40 PS Sastry, IISc, Bangalore, 2020 12/40
◮ Transient states are visited only finitely many times while ◮ We say, x leads to y if ρxy > 0
recurrent states are visited infinitely often Theorem: If x is recurrent and x leads to y then y is
◮ If S is finite, it should have at least one recurrent state recurrent and ρxy = ρyx = 1.
◮ If y is transient, then, for all x Proof:
∞
X ◮ Take x 6= y, wlog. Since ρxy > 0, ∃n s.t. P n (x, y) > 0
G(x, y) = P n (x, y) < ∞ ⇒ lim P n (x, y) = 0 ◮ Take least such n. Then we have states y1 , · · · , yn−1 ,
n→∞
n=1 none of which is x (or y) such that
n=1
= ∞, because x is recurrent
⇒ y is recurrent
PS Sastry, IISc, Bangalore, 2020 15/40 PS Sastry, IISc, Bangalore, 2020 16/40
equivalence relation example
◮ Let A = { mn
| m, n are integers}
◮ let R be a relation on set A. Note R ⊂ A × A ◮ Define relation R by
◮ R is called an equivalence relation if it is
m p
1. reflexive, i.e., (x, x) ∈ R, ∀x ∈ A , ∈ R if mq = np
2. symmetric, i.e., (x, y) ∈ R ⇒ (y, x) ∈ R n q
3. transitive, i.e., (x, y), (y, z) ∈ R ⇒ (x, z) ∈ R
◮ This is the usual equality of fractions
◮ Easy to check it is an equivalence relation.
PS Sastry, IISc, Bangalore, 2020 17/40 PS Sastry, IISc, Bangalore, 2020 18/40
Equivalence classes
◮ The state space of any Markov chain can be partitioned
into the transient and recurrent states: S = ST + SR :
◮ Let R be an equivalence relation on A.
◮ Then, A can be partitioned as ST = {y ∈ S : ρyy < 1} SR = {y ∈ S : ρyy = 1}
A = C1 + C2 + · · · ◮ On SR , consider the relation: ‘x leads to y’ (i.e., x is
related to y if ρxy > 0)
Where Ci satisfy
◮ This is an equivalence relation
◮ x, y ∈ Ci ⇒ (x, y) ∈ R, ∀i
◮ ρxx > 0, ∀x ∈ SR
◮ x ∈ Ci , y ∈ Cj , i 6= j ⇒ (x, y) ∈
/R
◮ ρxy > 0 ⇒ ρyx > 0, ∀x, y ∈ SR
◮ In our example, each equivalence class corresponds to a ◮ ρxy > 0, ρyz > 0 ⇒ ρxz > 0
rational number. ◮ Hence we get a partition: SR = C1 + C2 + · · · where Ci
◮ Here, Ci contains all fractions that are equal to that are equivalence classes.
rational number
PS Sastry, IISc, Bangalore, 2020 19/40 PS Sastry, IISc, Bangalore, 2020 20/40
◮ A set of states, C ⊂ S is said to be irreducible if x leads
to y for all x, y ∈ C
◮ On SR , “x leads to y” is an equivalence relation. ◮ An irreducible set is also called a communicating class
◮ This gives rise to the partition SR = C1 + C2 + · · · ◮ A set of states, C ⊂ S, is said to be closed if
◮ Since Ci are equivalence classes, they satisfy: x ∈ C, y ∈ / C implies ρxy = 0.
◮ x, y ∈ Ci ⇒ x leads to y ◮ Once the chain visits a state in a closed set, it cannot
◮ x ∈ Ci , y ∈ Cj , i 6= j ⇒ ρxy = 0 leave that set.
◮ All states in any Ci lead to each other or communicate ◮ We get a partition of recurrent states
with each other
◮ If i 6= j and x ∈ Ci and y ∈ Cj , then, ρxy = ρyx = 0. x SR = C 1 + C 2 + · · ·
and y do not communicate with each other.
where each Ci is a closed and irreducible set of states.
◮ If S is irreducible then the chain is said to be irreducible.
(Note that S is trivially closed)
PS Sastry, IISc, Bangalore, 2020 21/40 PS Sastry, IISc, Bangalore, 2020 22/40
PS Sastry, IISc, Bangalore, 2020 23/40 PS Sastry, IISc, Bangalore, 2020 24/40
Example
0 1 2 4 5
3
◮ If you start the chain in a recurrent state it will stay in
the corresponding closed irreducible set
◮ If you start in one of the transient states, it would
0 1 2 3 4 5
eventually get ‘absorbed’ in one of the closed irreducible
0 + − − − − −
sets of recurrent states.
1 + + + − − −
2 − + + + − +
◮ We want to know the probabilities of ending up in
3 − − − + + −
different sets.
We want to know how long you stay in transient states
4 − − − − + + ◮
◮ π is a stationary distribution if
X
π(y) = π(x)P (x, y), ∀y ∈ S ◮ If the chain is started in stationary distribution then the
x∈S distribution of Xn is not a function of time, as we saw.
◮ Recall πn (x) , P r[Xn = x] satisfies
◮ Suppose for a chain, distribution of Xn is not dependent
on n. Then the chain must be in a stationary distribution.
X X
πn+1 (y) = P r[Xn+1 = y|Xn = x]P r[Xn = x] = πn (x)P (x, y) ◮ Suppose π = π0 = π1 = · · · = πn = · · · . Then
x∈S x∈S X X
π(y) = π1 (y) = π0 (x)P (x, y) = π(x)P (x, y)
x∈S x∈S
◮ Hence, if π0 = π then π1 = π
and hence πn = π, ∀n which shows π is a stationary distribution
◮ Hence the name, stationary distribution.
◮ It is also called the invariant distribution or the invariant
measure
PS Sastry, IISc, Bangalore, 2020 31/40 PS Sastry, IISc, Bangalore, 2020 32/40
Example
0.75 0.5
0 1 0.5
◮ Suppose S is finite. 2
0.25
PS Sastry, IISc, Bangalore, 2020 33/40 PS Sastry, IISc, Bangalore, 2020 34/40
0.75 0.5
0 1 0.5 2
0.25 0.75
0 0.5 1 0.5 2
0.25
0.25
0.75
0.25
0.75
◮ We can also write the equations for π as
1
0.75π(0) + 0.5π(1) = π(0) ⇒ π(1) = π(0)
0.75 0.25 0 2
π(0) π(1) π(2) 0.5 0 0.5 = π(0) π(1) π(2) 1
0 0.75 0.25 0.25π(0) + 0.75π(2) = π(1) ⇒ π(2) = π(0)
3
0.5π(1) + 0.25π(2) = π(2)
1 1
0.75π(0) + 0.5π(1) = π(0) π(0) + π(1) + π(2) = 1 ⇒ π(0) 1 + + =1
2 3
0.25π(0) + 0.75π(2) = π(1)
0.5π(1) + 0.25π(2) = π(2)
◮ Now, π(0) 1 + 21 + 31 = 1 gives π(0) = 6
11
6 3 2
◮ We get a unique solution: 11 11 11
◮ We have to solve these along with π(0) + π(1) + π(2) = 1
PS Sastry, IISc, Bangalore, 2020 35/40 PS Sastry, IISc, Bangalore, 2020 36/40
Example2 Example2
1.0 0.5
0 1 0.5 2
1.0
PS Sastry, IISc, Bangalore, 2020 39/40 PS Sastry, IISc, Bangalore, 2020 40/40
Recap: Markov Chain Recap: Transition Probabilities
PS Sastry, IISc, Bangalore, 2020 1/39 PS Sastry, IISc, Bangalore, 2020 2/39
PS Sastry, IISc, Bangalore, 2020 3/39 PS Sastry, IISc, Bangalore, 2020 4/39
Recap Recap
Theorem:
(i). Let y be transient. Then
ρxy ◮ Transient states are visited only finitely many times while
Px (N (y) < ∞) = 1, ∀x and G(x, y) = < ∞, ∀x recurrent states are visited infinitely often
1 − ρyy
◮ A finite chain should have at least one recurrent state
(ii) Let y be recurrent. Then
◮ We say, x leads to y if ρxy > 0
Py [N (y) = ∞] = 1, and G(y, y) = Ey [N (y)] = ∞ Theorem: If x is recurrent and x leads to y then y is
recurrent and ρxy = ρyx = 1.
0 if ρxy = 0
Px [N (y) = ∞] = ρxy , and G(x, y) =
∞ if ρxy > 0
PS Sastry, IISc, Bangalore, 2020 5/39 PS Sastry, IISc, Bangalore, 2020 6/39
PS Sastry, IISc, Bangalore, 2020 7/39 PS Sastry, IISc, Bangalore, 2020 8/39
Recap: Stationary distribution
◮ Let Iy (Xn ) be indicator of [Xn = y]
◮ π is said to be a stationary distribution for the Markov
◮ Number of visits to y till n: Nn (y) = nm=1 Iy (Xm )
P
chain with transition probabilities P if
X n
X n
X
π(y) = π(x)P (x, y), ∀y ∈ S Gn (x, y) , Ex [Nn (y)] = Ex [Iy (Xm )] = P m (x, y)
x∈S m=1 m=1
PS Sastry, IISc, Bangalore, 2020 9/39 PS Sastry, IISc, Bangalore, 2020 10/39
PS Sastry, IISc, Bangalore, 2020 11/39 PS Sastry, IISc, Bangalore, 2020 12/39
◮ We have
◮ Consider a chain started in y P r[Wy3 = k3 |Wy2 = k2 , Wy1 = k1 ] =
◮ let Tyr be time of rth visit to y, r ≥ 1 P r[Xk1 +k2 +j 6= y, 1 ≤ j ≤ k3 − 1, Xk1 +k2 +k3 = y | B]
where B = [Xk1 +k2 = y, Xk1 = y, Xj 6= y, j < k1 + k2 , j 6= k1 ]
Tyr = min{n ≥ 1 : Nn (y) = r}
◮ Using the Markovian property, we get
◮ Define Wy1 = Ty1 = Ty and Wyr = Tyr − Tyr−1 , r > 1
◮ Note that Ey [Wy1 ] = Ey [Ty ] = my P r[Wy3 = k3 |Wy2 = k2 , Wy1 = k1 ] =
◮ Also, Tyr = Wy1 + · · · + Wyr P r[Xk1 +k2 +j 6= y, 1 ≤ j ≤ k3 − 1, Xk1 +k2 +k3 = y | Xk1 +k2 = y]
◮ Wyr are the “waiting times” = P r[Xj 6= y, 1 ≤ j ≤ k3 − 1, Xk3 = y | X0 = y]
◮ By Markovian property we should expect them to be iid = Py [Wy1 = k3 ]
◮ We will prove this. ◮ In general, we get
◮ Then Tyr /r converges to my by law of large umbers
P r[Wyr = kr | Wyr−1 = kr−1 , · · · , Wy1 = k1 ] = Py [Wy1 = kr ]
PS Sastry, IISc, Bangalore, 2020 13/39 PS Sastry, IISc, Bangalore, 2020 14/39
Py [Wy2 = k2 , Wy1 = k1 ] = Py [Wy2 = k2 | Wy1 = k1 ]Py [Wy1 = k1 ] ◮ Note that this is true even if my = ∞
= Py [Wy1 = k2 ] Py [Wy1 = k1 ]
= Py [Wy2 = k2 ] Py [Wy1 = k1 ]
⇒ independent
PS Sastry, IISc, Bangalore, 2020 15/39 PS Sastry, IISc, Bangalore, 2020 16/39
TyNn (y) ≤ n < TyNn (y)+1
or
Nn (y) 1
lim = , w.p.1
n→∞ n my
PS Sastry, IISc, Bangalore, 2020 17/39 PS Sastry, IISc, Bangalore, 2020 18/39
PS Sastry, IISc, Bangalore, 2020 19/39 PS Sastry, IISc, Bangalore, 2020 20/39
◮ The limiting fraction of time spent in a state is inversely
proportional to my , the mean return time. ◮ A recurrent state y is called null recurrent if my = ∞.
◮ Intuitively, the stationary probability of a state could be ◮ y is called positive recurrent if my < ∞
the limiting fraction of time spent in that state. ◮ We earlier saw that the fraction of time spent in a
◮ Thus π(y) = m1 is a good candidate for stationary transient state is zero.
y
distribution. ◮ Suppose y is null recurrent. Then
◮ We first note that we can have my = ∞.
Though Py [Ty < ∞] = 1, we can have Ey [Ty ] = ∞. Nn (y) 1
lim = =0
◮ What if my = ∞, ∀y? n→∞ n my
◮ Does not seem reasonable for a finite chain. ◮ Thus the limiting fraction of time spent by the chain in
◮ But for infinite chains?? transient and null recurrent states is zero.
◮ Let us characterize y for which my = ∞
PS Sastry, IISc, Bangalore, 2020 21/39 PS Sastry, IISc, Bangalore, 2020 22/39
n n
◮ Theorem: Let x be positive recurrent and let x lead to 1 X n1 +m+n0 1X m
P (y, y) ≥ P n1 (y, x) P (x, x) P n0 (x, y), ∀n
y. Then y is positive recurrent. n m=1 n m=1
Proof
◮ Since x is recurrent and x leads to y we know ∃n0 , n1 s.t. ◮ We can write the LHS of above as
n n +n+n n1 +n0
P n0 (x, y) > 0, P n1 (y, x) > 0 and 1 X
n1 +m+n0 1 1X 0 m 1 X
P (y, y) = P (y, y) − P m (y, y)
n m=1 n m=1 n m=1
P n1 +m+n0 (y, y) ≥ P n1 (y, x)P m (x, x)P n0 (x, y), ∀m
n1 +n+n
X 0 n1 +n0
n1 + n + n 0 1 m 1 X
Summing the above for m = 1, 2, · · · n and dividing by n = P (y, y) − P m (y, y)
n n1 + n + n0 m=1
n m=1
n n
1 X 1 X
P n1 +m+n0 (y, y) ≥ P n1 (y, x) P m (x, x) P n0 (x, y), ∀n
n m=1 n m=1 n
1 X n1 +m+n0 1
⇒ lim P (y, y) =
n→∞ n my
If we now let n → ∞, the RHS goes to m=1
X1X n n
m
X 1X m
⇒ 1 = lim P (x, y) = lim P (x, y) = 0
n→∞
y∈C
n m=1 y∈C
n→∞ n
m=1
X 1
n
X ◮ We have
⇒ π(y) = lim π(x) P m (x, y) n
1X m
n m=1
n→∞
X
x π(x) ≥ 0; π(x) = 1; 0 ≤ P (x, y) ≤ 1, ∀x
x
n m=1
◮ The proof is complete if we can take the limit inside the ◮ Hence, if y is transient or null recurrent, then
sum n
X 1X m
π(y) = π(x) lim P (x, y) = 0
n→∞ n
x m=1
PS Sastry, IISc, Bangalore, 2020 29/39 PS Sastry, IISc, Bangalore, 2020 30/39
PS Sastry, IISc, Bangalore, 2020 33/39 PS Sastry, IISc, Bangalore, 2020 34/39
◮ If P n (x, y) converges to g(y) then that would be the ◮ One can show π T = 18 38 38 18
stationary distribution and πn converges to it ◮ However, P n goes to different limits based on whether n
◮ But, n1 nm=1 am may have a limit though limn→∞ an
P
is even or odd
may not exist.
For example, an = (−1)n
PS Sastry, IISc, Bangalore, 2020 35/39 PS Sastry, IISc, Bangalore, 2020 36/39
◮ The chain is the folowing ◮ We define period of a state x as
1/3 2/3 1
PS Sastry, IISc, Bangalore, 2020 37/39 PS Sastry, IISc, Bangalore, 2020 38/39
◮ The extra condition we need for convergence of πn is ◮ π is said to be a stationary distribution for the Markov
aperiodicity chain with transition probabilities P if
◮ For an aperiodic, irreducible, positive recurrent chain, π(y) =
X
π(x)P (x, y), ∀y ∈ S
there is a unique stationary distribution and πn converges x∈S
to it irrespective of what π0 is.
◮ An aperiodic, irreducible, positive recurrent chain is called ◮ When π is stationary distribution,
an ergodic chain π0 = π ⇒ πn = π, ∀n
◮ If πn = π, ∀n then π is a stationary distribution
◮ For a finite chain: P T π = π
◮ A stationary distribution always exists for a finite chain
PS Sastry, IISc, Bangalore, 2020 39/39 PS Sastry, IISc, Bangalore, 2020 1/36
Recap Recap: positive and null recurrent states
PS Sastry, IISc, Bangalore, 2020 2/36 PS Sastry, IISc, Bangalore, 2020 3/36
PS Sastry, IISc, Bangalore, 2020 4/36 PS Sastry, IISc, Bangalore, 2020 5/36
Example
◮ Consider the umbrella problem
0 1 2 3 4
0 0 0 0 0 1 ◮ We want calculate the probability of getting caught in a
1 0 0 0 1−p p
rain without an umbrella.
P = 2
This would be the steady state probability of state 0
0 0 1−p p 0 ◮
multiplied by p
3 0 1−p p 0 0
4 1−p p 0 0 0 ◮ We are using the fact that this chain converges to the
stationary distribution starting with any initial
probabilities.
0 1 2 3 4
0 1 2 3 4
0 0 0 0 0 1
1 0 0 0 1−p p
P =
2 0 0 1−p p 0
3 0 1−p p 0 0
4 1−p p 0 0 0
The stationary distribution satisfies π T P = π T
π(0) = (1 − p)π(4)
π(1) = (1 − p)π(3) + pπ(4) ⇒ π(3) = π(1)
π(2) = (1 − p)π(2) + pπ(3)
π(3) = (1 − p)π(1) + pπ(2) ⇒ π(2) = π(1)
π(4) = π(0) + pπ(1) ⇒ π(4) = π(1)
This gives 4π(1) + (1 − p)π(1) = 1 and hence
1 1−p
π(i) = i = 1, 2, 3, 4 and π(0) =
5−p 5−p
PS Sastry, IISc, Bangalore, 2020 8/36 PS Sastry, IISc, Bangalore, 2020 9/36
Birth-Death chains birth-death chains – stationary distribution q
i
p 1-q
0 N
1-p 0
0 ... i-1 i i+1 N
...
p q
N
◮ The following is a finite birth-death chain i
ri
q
i
p 1-q X
0
1-p
0
0
... i-1 i i+1
...
N
N
π(y) = π(x)P (x, y)
p q
N
i x
ri
π(0) = π(0)(1 − p0 ) + π(1)q1
◮ We assume pi , qi > 0, ∀i. ⇒ π(1)q1 − π(0)p0 = 0
◮ Then the chain is irreducible, positive recurrent π(1) = π(0)p0 + π(1)(1 − p1 − q1 ) + π(2)q2
◮ If we assume ri > 0 at least for one i, it is also aperiodic ⇒ π(1)q1 − π(0)p0 = π(2)q2 − π(1)p1
◮ We can derive a general form for its stationary ⇒ π(2)q2 − π(1)p1 = 0
probabilities
π(2) = π(1)p1 + π(2)(1 − p2 − q2 ) + π(3)q3
⇒ π(2)q2 − π(1)p1 = π(3)q3 − π(2)p2 = 0
PS Sastry, IISc, Bangalore, 2020 10/36 PS Sastry, IISc, Bangalore, 2020 11/36
◮ Thus we get
p0 ◮ Consider a birth-death chain
π(1)q1 − π(0)p0 = 0 ⇒ π(1) = π(0) q q
q1 x x+1
p1 p0 p1
π(2)q2 − π(1)p1 = 0 ⇒ π(2) = π(1) = π(0) q
x-1
q2 q1 q2 ...
x-1
p
x
p
x+1 p
...
x-1 x x+1
p0 p1 · · · pn−1
π(n) = ηn π(0), where ηn = , n = 1, 2, · · · , N ◮ The chain may be infinite or finite
q1 q2 · · · qn
PN
◮ Let a, b ∈ S with a < b. Assume px , qx > 0, a < x < b.
◮ With η0 = 1, we get π(0) j=0 ηj = 1 and hence ◮ Define
1
π(0) = PN and π(n) = ηn π(0), n = 1, · · · , N U (x) = Px [Ta < Tb ], a < x < b, U (a) = 1, U (b) = 0
j=0 ηj
◮ We want to derive a formula for U (x)
◮ Note that this process is applicable even for infinite chains
◮ This can be useful, e.g., in the gambler’s ruin chain
with state space {0, 1, 2, · · · } (but there may not be a
solution)
PS Sastry, IISc, Bangalore, 2020 12/36 PS Sastry, IISc, Bangalore, 2020 13/36
q q
x x+1
q
x-1
x
...
x-1
...
x+1
p
x-1
p
x
p
x+1 qx
U (x + 1) − U (x) = [U (x) − U (x − 1)]
rx-1 rx rx+1 px
qx qx−1
= [U (x − 1) − U (x − 2)]
px px−1
U (x) = Px [Ta < Tb ] = P r[Ta < Tb |X0 = x] qx qx−1 · · · qa+1
x+1 = [U (a + 1) − U (a)]
X px px−1 · · · pa+1
= P r[Ta < Tb |X1 = y] P r[X1 = y|X0 = x]
y=x−1 qy qy−1 · · · qa+1
= U (x − 1)qx + U (x)rx + U (x + 1)px Let γy = , a < y < b, γa = 1
py py−1 · · · pa+1
= U (x − 1)qx + U (x)(1 − px − qx ) + U (x + 1)px Now we get
γx
⇒ qx [U (x) − U (x − 1)] = px [U (x + 1) − U (x)] U (x + 1) − U (x) = [U (a + 1) − U (a)]
qx γa
⇒ U (x + 1) − U (x) = [U (x) − U (x − 1)]
px
PS Sastry, IISc, Bangalore, 2020 14/36 PS Sastry, IISc, Bangalore, 2020 15/36
γx
◮ Using these, we get
U (x + 1) − U (x) = [U (a + 1) − U (a)] γx
γa U (x) − U (x + 1) = [U (a) − U (a + 1)]
γa
◮ By taking x = b − 1, b − 2, · · · , a + 1, a, γx γa γx
= Pb−1 = Pb−1
γb−1 γa x=a γx x=a γx
U (b) − U (b − 1) = [U (a + 1) − U (a)]
γa ◮ Putting x = b − 1, b − 2, · · · , y in the above
γb−2 γb−1
U (b − 1) − U (b − 2) = [U (a + 1) − U (a)] U (b − 1) − U (b) = Pb−1
γa
x=a γx
.. γb−2
. U (b − 2) − U (b − 1) = Pb−1
γa
U (a + 1) − U (a) = [U (a + 1) − U (a)] x=a γx
γa ..
◮ Adding all these we get .
γy
b−1 U (y) − U (y + 1) = Pb−1
1 X
x=a γx
[U (a + 1) − U (a)] γx = U (b) − U (a) = 0 − 1
γa x=a
◮ Adding these we get
Pb−1
γa
⇒ U (a) − U (a + 1) = Pb−1 x=y γx
γx U (y) − U (b) = U (y) = Pb−1 , a < y < b
x=a γx
x=a
PS Sastry, IISc, Bangalore, 2020 16/36 PS Sastry, IISc, Bangalore, 2020 17/36
◮ We are considering birth-death chains
q
x
q
x+1
◮ Suppose this is a Gambler’s ruin chain:
px = p, qx = q, ∀x
q
x-1
q q q
x 1
...
x-1
...
x+1 p
p p
x-1 x x+1 p p p
... ... 1
rx-1 rx rx+1
i-1 i i+1
0 N
PS Sastry, IISc, Bangalore, 2020 18/36 PS Sastry, IISc, Bangalore, 2020 19/36
ri
PS Sastry, IISc, Bangalore, 2020 20/36 PS Sastry, IISc, Bangalore, 2020 21/36
Consider this chain started in state 1. Theorem: The chain is recurrent iff ∞
P
x=0 γx = ∞
◮ ◮
Proof
[T0 < Tn ] ⊂ [T0 < Tn+1 ], n = 2, 3, · · ·
◮ Supoose chain is recurrent. Since it is irreducible,
P∞
◮ The chain is transient if x=0 γx < ∞
x
q
◮ Let px = p, qx = q ⇒ γx = p
P∞ q x
◮ We know the chain is recurrent if x=0 p
=∞
∞ x
X q ◮ When will this chain be positive recurrent?
Transient if <∞ ⇔ q<p We know that an irreducible chain is positive recurrent if
x=0
p ◮
and only if it has a stationary distribution.
∞ x We can check if it has a stationary distribution
X q ◮
Recurrent if =∞ ⇔ q≥p
p ◮ The earlier equations that we derived earlier hold for this
x=0
infinite case also.
◮ Intuitively clear
◮ This chain with q < p is an example of an irreducible
chain that is wholly transient
PS Sastry, IISc, Bangalore, 2020 24/36 PS Sastry, IISc, Bangalore, 2020 25/36
◮ We derived earlier the equations that a stationary
distribution of this chain (if it exists) has to satisfy ◮ This analysis can handle chains which are infinite in one
direction
p0 p1 · · · pn−1
π(n) = ηn π(0), where ηn = , n = 1, 2, · · · , ◮ Consider the following random walk chain
q1 q2 · · · qn 1-p
1-p
P∞
◮ Setting η0 = 1, we get π(0) j=0 ηj = 1
1-p
Hence stationary distribution exists iff ∞
P
j=0 ηj < ∞
◮ ...
0 +1 ...
-1 p p p
◮ Let px = p, qx = q
∞ ∞ j
◮ The state space here is {· · · , −1, 0, +1, · · · }
X X p The chain is irreducible and periodic with period 2
ηj = <∞ ⇔ p<q ◮
q
j=0 j=0 ◮ P 2n (0, 0) = 2n Cn pn (1 − p)n .
We can look at the limit of n1 n P 2n (0, 0)
P
◮
◮ Thus in this special case, the chain is
◮ transient if p > q; recurrent if p ≤ q ◮ We can show that the chain is transient if p 6= 0.5 and is
◮ positive recurrent if p < q recurrent if p = 0.5.
◮ null recurrent if p = q
PS Sastry, IISc, Bangalore, 2020 26/36 PS Sastry, IISc, Bangalore, 2020 27/36
◮ In general, determining when an infinite chain is positive ◮ Let {Xn , n ≥ 0} be an irreducible markov chain on a
recurrent is difficult. finite state space S with stationary distribution π.
◮ The method we had works only for birth-death chains ◮ Let r : S → ℜ be a bounded function.
over non-negative integers. ◮ Suppose we want E[r(X)] P with respect to the stationary
◮ There is a useful general theorem. distribution π (E[r(X)] = j∈S r(j)π(j))
Foster’s Theorem ◮ Let Nn (j) be as earlier. Then
Let P be the transition probabilities of a homogeneous
n
irreducible Markov chain with state space S. Let 1X 1X
h : S → ℜ with h(x) ≥ 0 and r(Xm ) = Nn (j)r(j)
P n m=1 n j∈S
Pk∈S P (i, k)h(k) < ∞ ∀i ∈ F and
◮
◮ The h here is called a Lyapunov function. ◮ For this to be true for infinite S, we need some extra
◮ We will not prove this theorem conditions
PS Sastry, IISc, Bangalore, 2020 28/36 PS Sastry, IISc, Bangalore, 2020 29/36
MCMC Sampling
◮ Consider a distribution over (finite) S: π(x) = b(x)
Z ◮ Suppose {Xn } is a an irreducible, aperiodic positive
recurrent Markov chain with stationary dist π(x) = b(x)
P
◮ Since this is a distribution, Z = x∈S b(x) Z
◮ We assume, we can efficiently calculate b(x) for any x ◮ Then we have
but computation of Z is intractable or computationally n
expensive 1X X
lim g(Xm ) = g(x)π(x)
E.g., the Boltzmann distribution: b(x) = e−E(x)/KT n→∞ n
m=1 x
◮ We want E[g(X)] w.r.t. distribution π (for any g)
◮ hence, if we can design a Markov chain with a given
X 1
n
X stationary distribution, we can use that to calculate the
E[g(X)] = g(x) π(x) ≈ g(Xi ), X 1 , · · · Xn ∼ π expectation.
x
n i=1
◮ We can also use the chain to generate samples from
distribution π
◮ One way to generate samples is to design an ergodic
markov chain with stationary distribution π
– MCMC sampling
PS Sastry, IISc, Bangalore, 2020 30/36 PS Sastry, IISc, Bangalore, 2020 31/36
b(x)
◮ {Xn }: Markov chain with stationary dist π(x) = Z
We can approximate the expectation as ◮ Let Q = [q(i, j)] be the transition probability matrix of an
n
irreducible Markov chain over S.
X 1X
g(x)π(x) ≈ g(XM +i ) ◮ Q is called the proposal distribution
n i=1
x ◮ We start with arbitray X0 and generate
Xn+1 , n = 0, 1, 2, · · · , iteratively as follows
Where M is large enough to assume chain is in steady
◮ If Xn = i, we generate Y with P r[Y = k] = q(i, k)
state ◮ Let the generated value for Y be j. Set
When we take sample mean, n1 ni=1 Zi , we want Zi to
P
◮
be uncorrelated j with probability α(i, j)
Xn+1 =
◮ We can, for example, use Xn with probability 1 − α(i, j)
n
X 1X ◮ α(i, j) is called the acceptance probability
g(x)π(x) ≈ g(XM +Ki )
x
n i=1 ◮ We want to choose α(i, j) to make Xn an ergodic
markov chain with stationary probabilities π
◮ For all these, we need to design a Markov chain with π as
stationary distribution
PS Sastry, IISc, Bangalore, 2020 32/36 PS Sastry, IISc, Bangalore, 2020 33/36
◮ The stationary distribution π satisfies (with transition
probabilities P )
X
π(y) = π(x) P (x, y), ∀y ∈ S ◮ Any stationary distribution has to satisfy
x
X
◮ Suppose there is a distribution g(·) that satisfies π(y) = π(x) P (x, y), ∀y ∈ S
x
g(y) P (y, x) = g(x) P (x, y), ∀x, y ∈ S
◮ If I can find a π that satisfies
This is called detailed balance
◮ Summing both sides above over x give π(x)P (x, y) = π(y)P (y, x), ∀x, y ∈ S, x 6= y
X X
g(y) = g(y) P (y, x) = g(x)P (x, y), ∀y that would be the stationary distribution
x x ◮ This is called detailed balance
◮ Thus if g(·) satisfies detailed balance, then it must be the
stationary distribution
◮ Note that it is not necessary for a stationary distribution
to satisfy detailed balance
PS Sastry, IISc, Bangalore, 2020 34/36 PS Sastry, IISc, Bangalore, 2020 35/36
PS Sastry, IISc, Bangalore, 2020 6/35 PS Sastry, IISc, Bangalore, 2020 7/35
PS Sastry, IISc, Bangalore, 2020 10/35 PS Sastry, IISc, Bangalore, 2020 11/35
PS Sastry, IISc, Bangalore, 2020 12/35 PS Sastry, IISc, Bangalore, 2020 13/35
Random process
◮ A random process or a stochastic process is a collection ◮ A random process: {Xt , t ∈ T }
of random variables: {Xt , t ∈ T } ◮ The set T can be countable e.g., T = {0, 1, 2, · · · }
◮ Markov chain is an example. Here T = {0, 1, · · · } ◮ Or, T can be continuous e.g., T = [0, ∞)
◮ We call T the index set. ◮ These are termed discrete-time or continuous-time
◮ Normally, T is either (a subset of) set of integers or an processes
interval on real line. ◮ The random variables, Xt , may be discrete or continuous
◮ We think of the index t as time ◮ These are termed discrete-state or continuous-state
◮ Thus a random process can represent the time-evolution processes
of the state of a system ◮ The Markov chain we considered is a discrete-time
◮ We assume T is infinite discrete-state process
◮ The index need not necessarily represent time. It can
represent, for example, space coordinates.
PS Sastry, IISc, Bangalore, 2020 14/35 PS Sastry, IISc, Bangalore, 2020 15/35
PS Sastry, IISc, Bangalore, 2020 16/35 PS Sastry, IISc, Bangalore, 2020 17/35
Distributions of a random process
◮ When it is a discrete-state process, all Xt would be
◮ A random process: {Xt , t ∈ T } or X : Ω × T → ℜ discrete random variables
◮ The first order distribution function of X is ◮ We can specify distributions through mass functions:
PS Sastry, IISc, Bangalore, 2020 18/35 PS Sastry, IISc, Bangalore, 2020 19/35
◮ Now we can get all distributions as ◮ We define the autocorrelation of the process by
PS Sastry, IISc, Bangalore, 2020 22/35 PS Sastry, IISc, Bangalore, 2020 23/35
Stationary Processes
◮ A homogeneous Markov chain started in its stationary
distribution is a stationary process
◮ A random process {X(t), t ∈ T } is said to be stationary ◮ As we know, if π0 is the stationary distribution then πn is
if same for all n.
for all n, for all t1 , · · · , tn , for all x1 , · · · xn and for all ◮ This, along with the Markov condition would imply that
τ we have
shift of time origin does not affect the distributions
FX (x1 , · · · , xn ; t1 , · · · , tn ) = FX (x1 , · · · , xn ; t1 +τ, · · · , tn +τ )
P r[Xn = x0 , Xn+1 = x1 , · · · Xn+m = xm ]
= πn (x0 )P (x0 , x1 ) · · · P (xm−1 , xm )
◮ For a stationary process, the distributions are unaffected
by translation of the time axis. = π0 (x0 )P (x0 , x1 ) · · · P (xm−1 , xm )
◮ This is a rather stringent condition and is often referred = P r[X0 = x0 , X1 = x1 , · · · Xm = xm ]
to as strict-sense stationarity
PS Sastry, IISc, Bangalore, 2020 24/35 PS Sastry, IISc, Bangalore, 2020 25/35
◮ Suppose {X(t), t ∈ T } is (strict-sense) stationary
◮ Then the first order distribution is independent of time
FX (x; t) = FX (x; t + τ ), ∀x, t, τ ⇒ e.g., FX (x; t) = FX (x; 0) ◮ The process {X(t), t ∈ T } is said to be wide-sense
stationary if
FX (x1 , x2 ; t, t + τ ) = FX (x1 , x2 ; 0, τ ), ∀x1 , x2 , t, τ ◮ The process is wide-sense stationary if the first and
second order distributions are invariant to translation of
Hence FX (x1 , x2 ; t1 , t2 ) can depend only on t1 − t2 time origin
◮ This implies
RX (t, t + τ ) = E[X(t)X(t + τ )] = RX (τ )
Ergodicity
◮ Let {X(t), t ∈ T } be wide-sense stationary. Then ◮ Suppose X(n) is a discrete-time discrete-state process
1. ηX (t) = ηX , a constant (like a Markov chain)
2. RX (t1 , t2 ) depends only on t1 − t2 ◮ Suppose it is wide-sense stationary.
◮ In many engineering applications, we call a process Then E[X(n)] does not depend on n
wide-sense stationary if the above two hold. ◮ Ergodicity is the question of
n
◮ In this course we take the above as the definition of 1X ?
wide-sense stationary process lim X(i) = E[X(n)] = ηX
n→∞ n
i=1
◮ When the process is wide-sense stationary, we write
◮ We proved that this is true for an irreducible, aperiodic,
autocorrelation as
positive recurrent Markov chain (with a finite state space)
RX (τ ) = E[X(t)X(t + τ )] ◮ The question is : do ‘time-averages’ converge to
‘ensemble-averages’
◮ The process is wide-sense stationary and hence all X(n)
have the same distribution; but they need not be
independent or uncorrelated (e.g., Markov chain)
PS Sastry, IISc, Bangalore, 2020 28/35 PS Sastry, IISc, Bangalore, 2020 29/35
◮ Ergodicity is a question of whether time-averages ◮ Define τ
1
Z
converge to ensemble-averages? ητ = X(t) dt (τ > 0)
n 2τ −τ
1 X ?
lim X(i) = E[X(n)] = ηX ◮ For each τ , ητ is a rv. We write η for ηX .
n→∞ n
i=1 ◮ We say the process is mean-ergodic if
Or, more generally
P
n ητ → η, as τ → ∞
1X ?
lim g(X(i)) = E[g(X(n))]
n→∞ n
i=1
◮ That is, if
For a continuous time process we can write this as lim P r [|ητ − η| > ǫ] = 0, ∀ǫ > 0
Z τ τ →∞
1 ?
lim X(t) dt = E[X(t)] = ηX ◮ Note that E[ητ ] = η, ∀τ .
τ →∞ 2τ −τ
PS Sastry, IISc, Bangalore, 2020 34/35 PS Sastry, IISc, Bangalore, 2020 35/35
Poisson Process
◮ A random process {N (t), t ≥ 0} is called a counting
process if
1. N (t) ≥ 0 and is integer-valued
2. If s < t then, N (s) ≤ N (t)
N (t) represents number of ‘events’ till t
◮ This is the next process we study
◮ The counting process has independent increments if for
◮ This is a discrete-state continuous-time process
all t1 < t2 ≤ t3 < t4 , N (t2 ) − N (t1 ) is independent of
◮ The index set is the interval [0, ∞) and all random N (t4 ) − N (t3 )
variables are discrete and take non-negative integer
◮ In particular, for all s > t, N (s) − N (t) is independent of
values.
N (t) − N (0)
◮ The process is said to have stationary increments if
N (t2 ) − N (t1 ) has the same distribution as
N (t2 + τ ) − N (t1 + τ ), ∀τ, ∀t2 > t1
PS Sastry, IISc, Bangalore, 2020 1/32 PS Sastry, IISc, Bangalore, 2020 2/32
◮ Definition 2 A counting process {N (t), t ≥ 0} is said to
◮ We start with two definitions of Poisson process be a Poisson process with rate λ > 0 if
◮ Definition 1 A counting process {N (t), t ≥ 0} is said to 1. N (0) = 0
be a Poisson process with rate λ > 0 if 2. The process has stationary and independent increments
1. N (0) = 0 3. P r[N (h) = 1] = λh + o(h) and
2. The process has stationaryn and independent increments P r[N (h) ≥ 2] = o(h)
3. P r[N (t) = n] = e−λt (λt)
n! , n = 0, 1, · · · ◮ We say g(h) is o(h) if
◮ N (t) is Poisson with parameter λt
g(h)
◮ E[N (t)] = λt and hence λ is called rate lim =0
h→0 h
◮ Since the process has stationary increments and
N (0) = 0, (N (t + s) − N (s)) would be Poisson with ◮ This definition tells us when Poisson process may be a
parameter λt for all s, t > 0. good model
◮ We will show that both definitions are equivalent
PS Sastry, IISc, Bangalore, 2020 3/32 PS Sastry, IISc, Bangalore, 2020 4/32
PS Sastry, IISc, Bangalore, 2020 5/32 PS Sastry, IISc, Bangalore, 2020 6/32
d
Pn (t) + λPn (t) = λPn−1 (t)
Pn (t + h) = P r[N (t + h) = n] dt
= P r[N (t) = n, N (t + h) − N (t) = 0] + ◮ We need to solve this linear ODE to obtain Pn
P r[N (t) = n − 1, N (t + h) − N (t) = 1] +
n
◮ The integrating factor is eλt . Let Pn′ (t) = dtd Pn (t)
X
P r[N (t) = n − k, N (t + h) − N (t) = k] eλt (Pn′ (t) + λPn (t)) = eλt λPn−1 (t)
k=2
d
Pn (t) eλt = λ eλt Pn−1 (t)
= Pn (t)P0 (h) + Pn−1 (t)P1 (h) + o(h) ⇒
dt
= Pn (t)(1 − λh + o(h)) + Pn−1 (t)(λh + o(h)) + o(h)
◮ We need Pn−1 to solve for Pn . Take n = 1
Pn (t + h) − Pn (t) o(h) d
P1 (t) eλt = λ eλt P0 (t) = λeλt e−λt = λ
⇒ = −λPn (t) + λPn−1 (t) +
h h dt
⇒ eλt P1 (t) = λt + c ⇒ P1 (t) = e−λt (λt + c)
d
⇒ Pn (t) = −λPn (t) + λPn−1 (t)
dt ◮ Since P1 (0) = P r[N (0) = 1] = 0, c = 0
Hence P1 (t) = λt e−λt
PS Sastry, IISc, Bangalore, 2020 7/32 PS Sastry, IISc, Bangalore, 2020 8/32
◮ We showed: P0 (t) = e−λt and P1 (t) = λte−λt ◮ Definition 1 A counting process {N (t), t ≥ 0} is said to
k
◮ We need to show: Pk (t) = e−λt (λt)
k!
be a Poisson process with rate λ > 0 if
◮ Assume it is true till k = n − 1 1. N (0) = 0
2. The process has stationaryn and independent increments
d λt
λt λt −λt (λt)
n−1
n t
n−1
3. P r[N (t) = n] = e−λt (λt)
n! , n = 0, 1, · · ·
Pn (t) e = λ e Pn−1 (t) = λe e =λ
dt (n − 1)! (n − 1
n
◮ Definition 2 A counting process {N (t), t ≥ 0} is said to
t 1 (λt)n be a Poisson process with rate λ > 0 if
⇒ eλt Pn (t) = λn + c ⇒ Pn (t) = e−λt
n (n − 1)! n! 1. N (0) = 0
2. The process has stationary and independent increments
where c = 0 because Pn (0) = 0.
3. P r[N (h) = 1] = λh + o(h) and
◮ This completes the proof that Definition 2 implies P r[N (h) ≥ 2] = o(h)
Definition 1
PS Sastry, IISc, Bangalore, 2020 9/32 PS Sastry, IISc, Bangalore, 2020 10/32
◮ Now we prove Definition 1 implies Definition 2
◮ We need to only show point(3) of Definition 2 using point ◮ Now we need to show P r[N (h) ≥ 2] = o(h)
(3) of Definition 1
P r[N (h) ≥ 2] = 1 − P r[N (h) = 0] − P r[N (h) = 1]
k
(λt) = 1 − e−λh − λhe−λh
Let P r[N (t) = k] = e−λt
k!
◮ This goes to zero as h → 0
We can use L’Hospital rule
P r[N (h) = 1] = λ h e−λh = λ h + λ h e−λh − 1 = λ h + o(h) ◮
PS Sastry, IISc, Bangalore, 2020 11/32 PS Sastry, IISc, Bangalore, 2020 12/32
These two definitions are equivalent ◮ Since the process has stationary increments, for t2 > t1 ,
PS Sastry, IISc, Bangalore, 2020 15/32 PS Sastry, IISc, Bangalore, 2020 16/32
Example
◮ Given a specific T0 we want to guess which is the last ◮ Let {N (t), t ≥ 0} be a Poisson process with rate λ
event before T0 . ◮ Suppose each event can be one of two types –
◮ Consider a strategy: we will wait till T0 − τ and pick the Typ-I or Typ-II
next event as the last one before T0 . ◮ N1 (t) = number of Typ-I events till t
◮ The probability of winning for this is ◮ N2 (t) = number of Typ-II events till t
◮ Note that N (t) = N1 (t) + N2 (t), ∀t
P r[exactly 1 event in (T0 − τ, T0 )] = λτ e−λτ ◮ Suppose that, independently of everything else, an event
is of Typ-I with probability p and Typ-II with probability
◮ We pick τ to maximize this (1 − p)
PS Sastry, IISc, Bangalore, 2020 23/32 PS Sastry, IISc, Bangalore, 2020 24/32
◮ Let Ni (t) be the number of Typ-i customers till t ◮ We know that sum of independent Possion rv’s is Poisson
◮ Then, these are independent Poisson processes with rates
λpi , i = 1, · · · , K
PS Sastry, IISc, Bangalore, 2020 25/32 PS Sastry, IISc, Bangalore, 2020 26/32
◮ There is an interesting generalization of this.
◮ Suppose number of radioactive particles emitted is
◮ Events are of different types
Poisson with rate λ.
◮ The type of an event can depend on the time of
◮ We are counting particles using a sensor
occurrence but it is independent of everything else.
◮ Suppose (independent of everything) an emitted particle
◮ Suppose an event occurring at time t is Typ-i with
is detected by our sensor with probability p
probability pi (t).
◮ Given that we detected K particles till t what is the
pi (t) ≥ 0, ∀i, t and K
P
i=1 pi (t) = 1, ∀t
◮
expected number of particles emitted?
◮ Ni (t) is the number of Typ-i events till t
◮ Let these processes be N (t), N1 (t), N2 (t)
Theorem; Then, at any t, Ni (t), i = 1, · · · K are
E[N (t)|N1 (t) = K] = E[N1 (t) + N2 (t)|N1 (t) = K] independent Poisson random variables with
= K + E[N2 (t)] = K + λ(1 − p)t Z t
E[Ni (t)] = λ pi (s) ds
where we have used independence of N1 and N2 0
PS Sastry, IISc, Bangalore, 2020 27/32 PS Sastry, IISc, Bangalore, 2020 28/32
PS Sastry, IISc, Bangalore, 2020 29/32 PS Sastry, IISc, Bangalore, 2020 30/32
◮ The Poisson process we considered is called homogeneous
◮ Suppose we have n1 people showing symptoms at t because the rate is constant.
◮ We can approximate ◮ For a non-homogeneous Poisson process the rate can be
Z t changing with time.
n1 ≈ E[N1 (t)] = λ G(y) dy ◮ But we can still use a definition similar to definition 2
0
P r[N (t + h) − N (t) = 1] = λ(t)h + o(h)
◮ Hence we can estimate
◮ We still stipulate independent increments though we
n1
λ̂ = R t cannot have stationary increments now
G(y) dy One can show that N (t + s) − N (t) is Poisson
0
R τ with
◮
parameter m(t + s) − m(t) where m(τ ) = 0 λ(s) ds
◮ Using this we can approximate
◮ Suppose Yi are iid and ind of N (t). Then
Z t
N (t)
E[N2 (t)] ≈ λ̂ (1 − G(y)) dy X
0 X(t) = Yi
i=1
Random Walk
◮ Let Zi be iid with P r[Zi = +s] = P r[Zi = −s] = 0.5
◮ Define a continuous-time process X(t) by
X(nT ) = Z1 + Z2 + · · · + Zn
X(t) = X(nT ), for nT ≤ t < (n + 1)T
PS Sastry, IISc, Bangalore, 2020 1/31 PS Sastry, IISc, Bangalore, 2020 2/31
◮ We have EZi = 0 and E[Zi2 ] = s2 ◮ Consider t = nT
◮ Hence, E[X(nT )] = 0 and E[X 2 (nT )] = ns2 t
X(nT ) E[X 2 (t)] = ns2 = s2
◮ For large n, √
s n
would be Gaussian T
X(nT )
◮ If we let T → 0 then the variance goes to infinity (the
Pr √ ≤ y ≈ Φ(y) process goes to infinity) unless we let s also to go to zero.
s n
◮ We actuall need s2 to go to zero at the same rate as T .
where Φ is distribution function of standard Normal ◮ So, we keep s2 = αT and let T go to zero.
◮ For any t, X(t) is X(nT ) for n = [t/T ].
◮ Define
Large n would mean large t. Hence W (t) = lim X(t)
2
T →0,s =αT
X(t) ms m
P r[X(t) ≤ ms] = P r √ ≤ √ ≈ Φ √ , for large t This is called the Wiener Process or Brownian motion.
s n s n n This result is known as Donsker’s theorem
◮ We are interested in limit of this process as T → 0 ◮ Let us intuitively see some properties of W (t)
PS Sastry, IISc, Bangalore, 2020 3/31 PS Sastry, IISc, Bangalore, 2020 4/31
◮ W (t) is limit of X(t) as T goes to zero Thus, X(nT ) is independent of X((m + n)T ) − X(nT ).
◮ As T goes to zero, any t is ‘large n’. ◮ Hence the X(nT ) process has independent increments
◮ Hence we can expect ◮ Hence, we can expect W (t) to be a process with
independent increments
w
P r[W (t) ≤ w] = Φ √
αt
⇒ W (t) ∼ N (0, αt)
PS Sastry, IISc, Bangalore, 2020 5/31 PS Sastry, IISc, Bangalore, 2020 6/31
◮ Let {X(t), t ≥ 0} be a continuous-state continuous-time
◮ X((m + n + k)T ) − X((n + k)T ) and
process
X((m + n)T ) − X(nT )
both are sums of m of the Zi ’s This process is called a Brownian motion if
◮ Hence both would have the same distribution 1. X(0) = 0
2. The process has stationary and independent increments
◮ Thus X(nT ) would also have stationary increments. 3. For every t > 0, X(t) is Gaussian with mean 0 and
◮ Hence we also expect W (t) to have stationary increments variance σ 2 t
◮ Thus, W (t) should be a process with stationary and ◮ Let B(t) = X(t)
σ
. Then, variance of B(t) is t
independent increments and for each t, W (t) is Gaussian ◮ {B(t), t ≥ 0} is called standard Brownian Motion
with zero mean and variance proportional to t
◮ Let Y (t) = X(t) + µ. Then Y (t) has non-zero mean
◮ We will now formally define Brownian motion using these
◮ The mean can be a function of time
properties.
◮ {Y (t), t ≥ 0} is called Brownian motion with a drift
PS Sastry, IISc, Bangalore, 2020 7/31 PS Sastry, IISc, Bangalore, 2020 8/31
PS Sastry, IISc, Bangalore, 2020 9/31 PS Sastry, IISc, Bangalore, 2020 10/31
◮ Suppose we want the joint distribution of Y1 = X(t1 ), Yi = X(ti ) − X(ti−1 ), i = 2, · · · , n
X(t1 ), X(t2 ), · · · , X(tn )
◮ Let t1 < t2 < · · · < tn ◮ The transformation is invertible
◮ Define random variables Y1 , · · · , Yn by
X(t1 ) = Y1
Y1 = X(t1 ), Y2 = X(t2 )−X(t1 ), Y3 = X(t3 )−X(t2 ), · · ·
X(t2 ) = Y1 + Y2
◮ We know Yi are independent because the process has X(t3 ) = Y1 + Y2 + Y3
independent increments ..
.
◮ This transformation is invertible X(tn ) = Y1 + Y2 + · · · + Yn
◮ Hence we can get joint density of X(t1 ), · · · X(tn ) in
terms of joint density of Y1 , · · · , Yn ◮ Y1 , · · · Yn are independent and Gaussian and hence are
◮ This is how we can get nth order density for any Jointly Gaussian
continuous-state process with independent increments ◮ Hence X(t1 ), · · · , X(tn ) are jointly Gaussian
◮ Thus all nth order distributions are Gaussian
PS Sastry, IISc, Bangalore, 2020 11/31 PS Sastry, IISc, Bangalore, 2020 12/31
PS Sastry, IISc, Bangalore, 2020 15/31 PS Sastry, IISc, Bangalore, 2020 16/31
PS Sastry, IISc, Bangalore, 2020 19/31 PS Sastry, IISc, Bangalore, 2020 20/31
PS Sastry, IISc, Bangalore, 2020 21/31 PS Sastry, IISc, Bangalore, 2020 22/31
White Noise ◮ Assume V (t) is Gaussian. Let
Z t
X(t) = V (τ ) dτ
0
PS Sastry, IISc, Bangalore, 2020 25/31 PS Sastry, IISc, Bangalore, 2020 26/31
◮ {Xn , n = 0, 1, · · · } and E|Xn | < ∞, ∀n
◮ It is called a martingale if
PS Sastry, IISc, Bangalore, 2020 27/31 PS Sastry, IISc, Bangalore, 2020 28/31
PS Sastry, IISc, Bangalore, 2020 29/31 PS Sastry, IISc, Bangalore, 2020 30/31
Continuous-Time Markov Chains
◮ Let {X(t), t ≥ 0} be a continuous-time discrete-state
◮ We have mentioned martingales as an example of process
discrete-time continuous processes ◮ Let X(t) take non-negative integer values
◮ A stochastic iterative algorithm essentially generates a ◮ It is called a continuous-time markov chain if
discrete-time continuous-state processes.
◮ Martingales are very useful in analyzing convergence of P r[X(t + s) = j | X(s) = i, X(u) ∈ Au , 0 ≤ u < s]
many stochastic algorithms = P r[X(t + s) = j | X(s) = i]
◮ While we mentioned only discrete-time martingales, one
can similarly have continuous-time martingales
◮ Only most recent past matters
◮ It is called homogeneous chain if
PS Sastry, IISc, Bangalore, 2020 31/31 PS Sastry, IISc, Bangalore, 2020 1/31
PS Sastry, IISc, Bangalore, 2020 2/31 PS Sastry, IISc, Bangalore, 2020 3/31
Example: Birth-Death process
PS Sastry, IISc, Bangalore, 2020 4/31 PS Sastry, IISc, Bangalore, 2020 5/31
PS Sastry, IISc, Bangalore, 2020 6/31 PS Sastry, IISc, Bangalore, 2020 7/31
◮ Consider a queuing system
◮ Suppose people joining the queue is a Poisson process
with rate λ
◮ Suppose the time to service each customer is independent
◮ Suppose λn = λ, ∀n and µn = 0, ∀n and exponential with parmeter µ.
◮ It is called pure birth process
◮ We assume that the arrival and service processes are
independent.
◮ The process spend time Ti ∼ exponential(λ) in state i
and then moves to state i + 1
◮ Then this is a birth death process with
◮ This is the Poisson process λn = λ, n ≥ 0 and µn = µ, n ≥ 1
PS Sastry, IISc, Bangalore, 2020 8/31 PS Sastry, IISc, Bangalore, 2020 9/31
PS Sastry, IISc, Bangalore, 2020 10/31 PS Sastry, IISc, Bangalore, 2020 11/31
◮ We also have
◮ Thus we get
λi µi
P r[Ii = 1] = zi,i+1 = ; P r[Ii = 0] =
λi + µ i λi + µi 1 µi
E [Yi ] = + E[Yi−1 ], i ≥ 1
λi λi
◮ Now we can calculate E[Yi ] as
◮ Since E[Y0 ] = 1/λ0 , we have a formula for E[Yi ]
E[Yi ] = P r[Ii = 1] E [Yi | Ii = 1] + P r[Ii = 0] E [Yi | Ii = 0] ◮ For example,
λi 1 µi 1
= + + E [Yi−1 ] + E[Yi ] 1 µ1 1
µ2 1 µ1
λi + µi λi + µi λi + µ i λi + µ i E[Y1 ] = + ; E[Y2 ] = + +
1 µi λ1 λ1 λ0 λ2 λ2 λ1 λ1 λ0
= + (E [Yi−1 ] + E[Yi ])
λi + µi λi + µi ◮ Expected time to go from i to j, i < j can now be
computed as
µi 1 µi
E[Yi ] 1 − = + (E[Yi−1 ])
λi + µ i λi + µi λi + µi E[Yi ] + E[Yi+1 ] + · · · + E[Yj−1 ]
1 µi
E[Yi ] = + E[Yi−1 ] Note that these are only for birth-death processes
λi λi ◮
PS Sastry, IISc, Bangalore, 2020 12/31 PS Sastry, IISc, Bangalore, 2020 13/31
PS Sastry, IISc, Bangalore, 2020 14/31 PS Sastry, IISc, Bangalore, 2020 15/31
◮ By definition, 1 − Pii (h) is the probability that the chain ◮ By definition, Pij (h) = qij h + o(h), i 6= j
that started in i is not in i at h. ◮ Hence qij is the rate at which transitions out of i into j
◮ This is equivalent to there being a transition in the time are occurring.
h and transitions out of i occur at the rate of νi . ◮ Transitions out of i occur with rate νi and zij fraction of
Also, two or more transitions in h is o(h) these are into j
◮ Hence ◮ Hence, qij = νi zij , i 6= j
1 − Pii (h) = νi h + o(h) ◮ Thus, we got
◮ Thus qii = νi . It is rate of transition out of i X qij X
νi = qij , zij = P , qii = qij
◮ We also have j6 = i q ij
j6=i j6=i
P
1 − Pii (h) j6=i Pij (h)
X
νi = qii = lim = lim = qij ◮ The {qij } are called the infinitesimal generator of the
h→0 h h→0 h j6=i process.
◮ A continuous time Markov Chain is specified by these qij
PS Sastry, IISc, Bangalore, 2020 16/31 PS Sastry, IISc, Bangalore, 2020 17/31
PS Sastry, IISc, Bangalore, 2020 18/31 PS Sastry, IISc, Bangalore, 2020 19/31
Poisson process as a special case
PS Sastry, IISc, Bangalore, 2020 20/31 PS Sastry, IISc, Bangalore, 2020 21/31
PS Sastry, IISc, Bangalore, 2020 28/31 PS Sastry, IISc, Bangalore, 2020 29/31