0% found this document useful (0 votes)
5 views

Probability_Theory_Cookbook

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Probability_Theory_Cookbook

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Probability Cookbook

Pantelis Sopasakis

January 24, 2019


Contents

Abstract 5

1 Probability Theory 7
1.1 General Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.1 Measurable and Probability spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.3 Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.1.4 The Radon-Nikodym Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1.5 Probability distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1.6 Probability density function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.1.7 Decomposition of measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.1.8 Lp spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.1.9 Product spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.1.10 Transition Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.1.11 Law invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.1.12 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.2.1 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.2.2 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.2.3 Construction of probability spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3 Inequalities on Probability Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.1 Inequalities on Lp spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.2 Generic inequalities involving probabilities or expectations . . . . . . . . . . . . . . 21
1.3.3 Involving sums or averages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.4 Convergence of random processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.4.1 Convergence of measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.4.2 Almost sure convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.4.3 Convergence in probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.4.4 Convergence in Lp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.4.5 Convergence in distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.4.6 Tail events and 0-1 Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.4.7 Laws of large numbers and CLTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.5 Standard Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.5.1 Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.5.2 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.5.3 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.5.4 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2 Multivariate distributions 33
2.1 Multivariate random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2 Copulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.1 Sklar’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.2 Examples of copulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3 Stochastic Processes 37
3.1 General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Random walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4 Brownian motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5 Markov processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3
Contents

3.6 Markov decision processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43


3.6.1 General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6.2 Optimal control problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.6.3 Linear-Quadratic problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4 Stochastic Differential Equations 47


4.1 Itô Integral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Stochastic differential equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5 Information Theory 51
5.1 Entropy and Conditional Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 KL divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6 Risk 53
6.1 Risk measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2 Popular risk measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

7 Uncertainty Quantification 57
7.1 Polynomial chaos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.1.1 The Kosambi-Karhunen-Loève theorem . . . . . . . . . . . . . . . . . . . . . . . . 57
7.1.2 Orthogonal polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.1.3 Generalized polynomial chaos expansions . . . . . . . . . . . . . . . . . . . . . . . 59

8 Bibliography with comments 61

About the author 63

4
Abstract
This document is intended to serve as a collection of important results in general probability theory,
stochastic processes, uncertainty quantification, risk measures and a lot more. It can be used for a quick
brush up or as a quick reference or cheat sheet for graduate students and researchers in the domain of
mathematics and engineering. The readers may find a list of bibliographic references with comments at
the end of this document. This is still work in progress, so several important results are still missing.
This document, as well as its future versions, will be available at https://mathematix.wordpress.
com/probability-cookbook.

5
1 Probability Theory
1.1 General Probability Theory
1.1.1 Measurable and Probability spaces
1. (σ-algebra). Let X be a nonempty set. A collection F of subsets S of X is called a σ-algebra if (i)
X ∈ F, (i) Ac ∈ F whenever A ∈ F, (ii) if A1 , . . . , An ∈ F, then i=1,...,n Ai ∈ F. The space X
equipped with a σ-algebra F is called a measurable space.

2. (d-system) A collection D of subsets of X is called a d-system or a Dynkin class if (i) X ∈ D, (ii)


A \ B ∈ D whenever
S A, B ∈ D and A ⊇ B, (iii) A ∈ D whenever An ∈ D and An ↑ A (meaning,
Ak ⊆ Ak+1 and k∈IN Ak = A).

3. (p-system). A collection of sets P in X is called a p-system if A ∩ B ∈ P whenever A, B ∈ P.

4. A collection of sets is a σ-algebra if and only if it is both a p- and a d-system.

5. (Smallest σ-algebra). Let H be a collection of sets in X. The smallest collection of sets which
contains H and is a σ-algebra exists and is denoted by σ(H).

6. (Monotone class theorem). If a d-system D contains a p-system P, then is also contains σ(P).

7. (Borel σ-algebra). On IR, the σ-algebra σ({(a, b); a < b}) is called the Borel σ-algebra on IR which
we denote by BIR . For topological spaces (X, τ ), the Borel σ-algebra is defined as BX = σ(τ ), i.e.,
it is the smallest σ-algebra which contains all open sets. BIR is generated by:
i. The open intervals (a, b)
ii. The closed intervals [a, b]
iii. All sets of the form [a, b) or (a, b]
iv. Open rays (a, ∞) or (−∞, a)
v. Closed rays [a, ∞) or (−∞, a]

Pµ : F → [0, +∞] is called a measure if for every sequence of disjoint sets An


8. (Measure).SA function
from F, µ( n An ) = n µ(An ).

9. (Properties of measures). The following hold:


i. (Empty set is negligible). µ(∅) = 0 [Indeed, µ(A) = µ(A ∪ ∅) = µ(A) + µ(∅) for all A ∈ F]
ii. (Monotonicity). A ⊆ B implies µ(A) ≤ µ(B) [Indeed, µ(B) = µ(A ∪ (B \ A))]
iii. (Boole’s inequality). For all An ∈ F, µ( n An ) ≤ n µ(An )
S P

iv. (Sequential continuity). If An ↑ A, then µ(An ) ↑ µ(A).

10. (Equality of measures). Let µ, ν be two measures on a measurable space (X, F) and let G be a
p-system generating F. If µ(A) = ν(A) for all A ∈ G, then µ(B) = ν(B) for all B ∈ F. As presented
in #7 above, p-systems are often available and have simple forms.

11. (Completeness). A measure space (X, F, µ) is called complete if the following holds:

A ∈ F, µ(A) = 0, B ⊆ A ⇒ B ∈ F.

Of course, by the monotonicity property in #9–iii, if (X, F, µ) is a complete measure space then
µ(B) = 0.

12. (Completion). Let (X, F, µ) be a measure space and define the set of negligible sets of µ as Zµ =
{N ⊆ X : ∃N 0 ⊇ N, N 0 ∈ F s.t. µ(N 0 ) = 0}. Let F0 be the σ-algebra generated by F ∪ Zµ . Then

7
1 Probability Theory

i. Every B ∈ F0 can be written as B = A ∪ N with A ∈ F and N ∈ Zµ


ii. Define µ0 (A ∪ N ) = µ(A); this is a measure on (X, F0 ) which renders the space (X, F0 , µ0 )
complete.
13. (Lebesgue measure on IR and IRn ). It suffices to define the Lebesgue measure on (IR, BIR ) on the
p-system {(a, b), a < b}; it is λ((a, b)) = b − a. This extends to a measure on (IR, BIR ). Likewise,
the collection of n-dimensional rectangles {(a1 , Qb1 ) × . . . × (an , Q
bn )} is a p-system which generates
n n
BIRn ; the Lebesgue measure on (IRn , BIRn ) is λ( i=1 (ai , bi )) = i=1 (bi − ai ).
14. (Lebesgue measurable sets). The completion of the Lebesgue measure defines the class of Lebesgue-
measurable sets.
15. (Negligible boundary). If a set C ⊆ IRn has a boundary whose Lebesgue measure is 0, then C is
Lebesgue measurable.
16. (Independent events). Let E1 , E2 be two events from (Ω, F, P); we say that E1 and E2 are inde-
pendent if P[E1 ∩ E2 ] = P[E1 ]P[E2 ].
17. (Independent σ-algebras). We say that two σ-algebras F1 and F2 on Ω are independent if for any
E1 ∈ F1 and E2 ∈ F2 , E1 and E2 are independent. Note that E1 ∩ E2 is a member of the σ algebra
E1 ∧ E2 .
18. (Atom). Let (Ω, F, µ) be a measure space. A set A ∈ F is called an atom if µ(A) > 0 and for every
B ⊂ A with µ(B) < µ(A) it is µ(B) = 0. A space without atoms is called non-atomic1 .

1.1.2 Random variables


1. (Measurable function). A function f : (X, F) → (Y, G) (between two measurable spaces) is called
measurable if f −1 (G) ∈ F for all G ∈ G (i.e., if it inverts all measurable sets to measurable ones).
2. (Measurability test). Let F, G be σ-algebras on the nonempty sets X and Y . Let G0 be a p-system
which generates G. A function f : (X, F) → (Y, G) is measurable if and only if f −1 (G0 ) ∈ F for all
G0 ∈ G0 (it suffices to check the measurability condition on a p-system).
3. (σ-algebra generated by f ). Let f : (X, F) → (Y, G) (between two measurable spaces) be a mea-
surable function. The set
σ(f ) := {f −1 (B) | B ∈ G},
is a sub-σ-algebra of F and is called the σ-algebra generated by f .
4. (Preservation of measurability). Let f, g : Ω → IR be two measurable functions on (Ω, F). Then,
the functions h1 (x) = f (x) + g(x), h2 (x) = f (x) − g(x), h3 (x) = max{f (x), g(x)}, h4 (x) =
min(f (x), g(x)), h5 (x) = f (x)g(x) are measurable. For all α ∈ IR, h6 (x) = αf (x) is measurable.
5. (Measurability of supermum/infimum). Let (fn )n be a sequence of real-valued measurable func-
tions. Then supn fn and inf n fn are measurable.
6. (Sub/sup-level sets) Let f : (X, F) → IR. The following are equivalent:
i. f is measurable,
ii. Its sub-level sets, that is sets of the form lev≤α f := {x ∈ X : f (x) ≤ α} are measurable,
iii. Its sup-level sets, that is sets of the form lev≥α f := {x ∈ X : f (x) ≥ α} are measurable.
7. (Random variable). A real-valued random variable X : (Ω, F, P) → (IR, BIR ) is a measurable
function X from a probability space (Ω, F, P) to IR, equipped with the Borel σ-algebra, that is, for
every Borel set B, X −1 (B) ∈ F.
8. Every nonnegative (real-valued) random variable X on (IR+ , BIR+ ) is written as
Z +∞
X(ω) = 1X(ω)≥t dt.
0
1A special class of spaces with (only) atoms are the discrete probability spaces where F is generated by a discrete — often
finite — set of events. Several results in measure theory require that the space be non-atomic. However, we may often
prove these results for discrete or finite spaces.

8
1.1 General Probability Theory

9. (Increasing functions). Every increasing function f : IR → IR is Borel-measurable.


10. (Semi-continuous functions). Every lower semi-continuous function X : Ω → IR (where Ω is
assumed to be equipped with a topology) is Borel-measurable.
11. (Push-forward measure) [3]. Given measurable spaces (X , F) and (Y, G), a measurable mapping
f : X → Y and a (probability) measure µ on (X , F), the push-forward of µ is defined to be a
measure f (µ) on (Y, G) given by

(f∗ µ)(B) = µ(f −1 (B)) = µ({ω | f (ω) ∈ B}),

for B ∈ G.
12. (Change of variables). Let F be a random variable on the probability space (Ω, F, P) and F∗ P
is the push-forward measure. random variable X is integrable with respect to the push-forward
measure F∗ P if and only if X ◦ F is P-integrable. Then, the integrals coincide
Z Z
Xd(F∗ P) = (X ◦ F )dP.

13. (Measures from random variables). Let X be a random variable on (Ω, F, P). We may use X to
define the following measure Z
ν(A) = XdP,
A
defined for A ∈ F. This is a positive measure which for short we denote as ν = XP and it satisfies:
Z Z
Y dν = XY dP,
A A

for all random variables Y .


14. (Compositions). Let f : (X, FX ) → (Y, FY ) and g : (Y, FY ) → (Z, FZ ) be two measurable functions.
Then, the function h : (X, FX ) 3 x 7→ h(x) := f (g(x)) ∈ (Z, FZ ) is measurable.
15. (Simple function; definition). A simple function is one of the form
n
X
f (x) = αk 1Ak (x),
k=1

where 1Ak is the characteristic function of a measurable set Ak , that is


(
1, if x ∈ Ak
1Ak =
0, otherwise

16. (Characterization of measurability). A function f : (X, F) → IR is F-measurable if and only if it is


the point-wise limit of a sequence of simple functions. A function f : (X, F) → IR+ is F-measurable
if and only if it is the point-wise limit of an increasing sequence of simple functions.
17. (Continuity and measurability). Every continuous function f : (X, F) → IR is Borel-measurable.
18. (Monotone class of functions). Let M be a collection of functions f : (X, F) → IR; let M+ be all
positive functions in M and Mb all bounded functions in M . We say that M is a monotone class
of functions if (i) 1 ∈ M , (ii) if f, g ∈ Mb and a, b ∈ IR, then af + bg ∈ M and (iii) if (fn )n ⊆ M+
and fn ↑ f , then f ∈ M .
19. (Monotone class theorem for functions). Let M be a monotone class of functions on (X, F). Suppose
that F is generated by some p-system C, 1A ∈ M for all A ∈ C. Then, M includes all positive
F-measurable functions and all bounded F-measurable functions.
20. (Simple function approximation theorem). Let X be an extended-real-valued Lebesgue-measurable
function defined on a Lebesgue measurable set E. Then there exists a sequence {φk }k∈IN of simple
functions2 on E such that
2Asimple function isPa finite linear combination of indicator functions of measurable sets, that is, simple functions are
written as φ(x) = n i=1 αi 1Ai (x).

9
1 Probability Theory

i. φk → X, point-wise on E
ii. |φk | ≤ |X| on E for all k ∈ IN
If X ≥ 0 then there exists a sequence of point-wise increasing simple functions with these properties.

21. (Simple function approximation trick). Let f be a real-valued measurable function, f : (Ω, F, P) →
(IR, BIR ). Define (
j−1 j−1
k , 2k
≤ f (x) < 2jk
φk (x) = 2
k, f (x) ≥ k
Then,
j−1 j
i. The sets {x : f (x) ≥ k} and {x : 2k
≤ f (x) < 2k
} are measurable because f is measurable
ii. φk are measurable for all k ∈ IN
iii. φk (x) ≤ φk+1 (x) for all k ∈ IN and for all x ∈ Ω
iv. Let E ⊆ Ω so that supx∈E f (x) ≤ M . Then supx∈Ω |f (x) − φk (x)| ≤ 1/2k for all k ≥ M

1.1.3 Limits
Limits of sequences of events
1. (Nested sequences and probabilities). Let (En )n be a non-increasing sequence of events (En ⊇ En+1
for all n ∈ IN). Then limn P[En ] exists and
" #
\
P En = lim P[En ].
n
n

If (En )n is a nondecreasing sequence (En ⊆ En+1 for all n ∈ IN), then


" #
[
P En = lim P[En ].
n
n

2. (Limits inferior). For a sequence of events En , the limit inferior of (En )n is denoted by lim inf n En
and is defined as
[ \
lim inf En = En = {x : x ∈ En for all but finitely many n ∈ IN}.
n
n∈IN m≥n

3. (Limit superior). The limit superior of (En )n , lim supn En , is


\ [
lim sup En = En = {x : x ∈ En infinitely often}.
n
n∈IN m≥n

4. (Limits of complements). The limit (super/inferior) of a sequence of complements is the comple-


ment of the limit

lim inf Enc = (lim sup En )c ,


n n
lim sup Enc = (lim inf En )c .
n n

5. (Relationship between limits). It is

lim inf En ⊆ lim sup En .


n n

6. (Probabilities of lim inf En and lim sup En ). The sets lim inf n En and lim supn En are measurable
and
P[lim inf En ] ≤ lim inf P[En ] ≤ lim sup P[En ] ≤ P[lim sup En ].
n n n n

10
1.1 General Probability Theory

7. (A result reminiscent of Baire’s category theorem). Let (En )n be a sequence of almost sure events.
Then P[∩n En ] = 1.

8. (Borel-Cantelli lemma). Let (En )n be a sequence of events over (Ω, F, P). The following hold
P∞
i. If n=1 P[En ] < ∞, then P[lim supn En ] = 0
P∞
ii. If (En )n are independent events such that n=1 P[En ] = ∞, then P[lim supn En ] = 1.

9. (Corollary: Borel 0-1 law). If (En )n is a sequence of independent events, then P[lim supn En ] ∈
{0, 1} (according to the summability of (P[En ])n ).

10. (Kochen-Stoone lemma). Let (En )n be a sequence of events. Then,


Pn 2
( k=1 P[Ak ])
P[lim sup En ] ≥ lim sup Pn Pn
n n k=1 j=1 P[Ak ∩ Aj ]

11. (Corollary of Kochen-Stoone’s


P∞ lemma). If for i 6= j, Ei and Ej are either independent or P[Ei ∩Ej ] ≤
P[Ei ]P[Ej ] and n=1 P[En ] = ∞, then P[lim supn En ] = 1.

Limits of sequences of random variables


1. (Lebesgue’s monotone convergence theorem). Let (fn )n be an increasing sequence of nonnegative
Borel functions and let f := limn fn (in the sense fn → f point-wise a.e.). Then IE[fn ] ↑ IE[f ].

2. (Lebesgue’s Dominated Convergence Theorem). Let Xn be real-valued RVs over (Ω, F, P). Suppose
that Xn converges point-wise to X and is dominated by a Y ∈ L1 (Ω, F, P), that is |Xn | ≤ Y P-a.s
for all n ∈ IN. Then, X ∈ L1 (Ω, F, P) and

lim IE[|Xn − X|] = 0,


n

which implies
lim IE[Xn ] = IE[X].
n

3. (Dominated convergence in Lp ). For p ∈ [1, ∞) and a sequence of random variables Xk : (Ω, F, P) →


IR, assume that Xk → X almost everywhere (X(ω) = limk Xk (ω) P-a.e.) and there is Y ∈
Lp (Ω, F, P) so that Xk ≤ Y . Then,
i. Xk ∈ Lp (Ω, F, P) for all k ∈ IN,
ii. X ∈ Lp (Ω, F, P)
iii. Xk → X in Lp (Ω, F, P), that is limk kXk − Xkp = 0.

4. (Consequence of theSdominated convergence theorem) [8]. Let {Ek }∞


k=1 be a collection of disjoint
events and let E = k Ek . Then,
Z ∞ Z
X
f= f.
E k=1 Ek

5. (Bounded convergence). If Xk → X almost surely and supk |Xk | ≤ b for some constant b > 0, then
IE[Xk ] → IE[X] and IE[|X|] ≤ b.

6. (Fatou’s lemma). Let Xn ≥ 0 be a sequence of random variables. Then,

IE[lim inf Xn ] ≤ lim inf IE[Xn ].


n n

7. (Fatou’s lemma with varying measures). For a sequence of nonnegative random variables Xn ≥ 0
over (Ω, F, P), and a sequence of (probability) measures µn which converge strongly to a (proba-
bility) measure µ (that is, µn (A) → µ(A) for all A ∈ F), we have

IEµ [lim inf Xn ] ≤ lim inf IEµn [Xn ]


n n

11
1 Probability Theory

8. (Reverse Fatou’s lemma). Let Xn ≥ 0 be a sequence of nonnegative random variables over (Ω, F, P)
and assume there is a Y ∈ L1 (Ω, F, P) so that Xn ≤ Y . Then

lim sup IE[Xn ] ≤ IE[lim sup Xn ]


n n

9. (Integrable lower bound). Let Xn be a sequence of random variables over (Ω, F, P). Suppose, there
exists a Y ≥ 0 such that Xn ≥ −Y for all n ∈ IN. Then,

IE[lim inf Xn ] ≤ lim inf IE[Xn ].


n n

10. (Beppo Levi’s Theorem). Let Xk be a sequence of nonnegative random variables on (Ω, F, P) with
0 ≤ X1 ≤ X2 ≤ . . .. Let X(ω) = limk→∞ Xk (ω). Then X is a random variable and

lim IE[Xk ] = IE[ lim Xk ].


k→∞ k→∞

11. (Beppo Levi’s Theorem for series). Let Xk be a sequence of nonnegative integrable random variables
Pk P∞
on (Ω, F, P) and let Yk = j=0 Xk . Assume that k=1 IE[Yk ] converges. Then Yk satisfies the
conditions of the BL theorem and

"∞ #
X X
IE[Yk ] = IE Yk .
k=1 k=1

12. (Uniform integrability – definition) [7]. A collection {Xk }k∈T is said to be uniformly integrable if
supt∈T IE[|Xt |1|Xt |>x ] → 0 as x → ∞.
13. (Constant absolutely integrable sequences as uniformly integrable) [7]. The sequence {Y }t∈T with
IE[|Y |] < ∞ is uniformly integrable.
14. (Uniform boundedness in Lp , p > 1, implies uniform integrability). If {Xt }t∈T is uniformly bounded
in Lp , p > 1 (that is, IE[|Xk |p ] < c for some c > 0), then it is uniformly integrable.
15. (Convergence under uniform integrability) [7]. If Xk → X a.s. and {Xk }k is uniformly integrable
then
i. IE[X] < ∞
ii. IE[Xk ] → IE[X]
iii. IE|Xk − X| → 0

1.1.4 The Radon-Nikodym Theorem


1. (Absolute continuity). Let (X , G) be a measurable space and µ and ν two measures on it. We say
that ν is absolutely continuous with respect to µ if for all A ∈ G, ν(A) = 0 whenever µ(A) = 0. We
denote this by ν  µ.
2. (Radon-Nikodym). Let (X , G) be a measurable space, let ν be a σ-finite measure on (X , G) which is
absolutely continuous with respect to a measure µ on (X , G). Then, there is a measurable function
f : X → [0, ∞) such that for all A ∈ G
Z
ν(A) = f dµ.
A


This function is denoted by f = dµ .

3. (Linearity). Let ν, µ and λ be σ-finite measures on (X , G) and ν  λ, µ  λ. Then


d(ν + µ) dν dµ
= + , λ-a.e.
dλ dλ dλ

4. (Chain rule). If ν  µ  λ,
dν dν dµ
= , λ-a.e.
dλ dµ dλ

12
1.1 General Probability Theory

5. (Inverse). If ν  µ and µ  ν, then


 −1
dµ dν
= , ν-a.e.
dν dµ

6. (Change of measure). If µ  λ and g is a µ-integrable function, then


Z Z

gdµ = g dλ.
X X dλ

7. (Change of variables in integration). This was addressed using the push-forward.


Z Z
IE[g(X)] = g ◦ XdP = g d(X∗ P).
IR

If the measure d(X∗ P) is absolutely continuous with respect to the Lebesgue measure µ (on (IR, BIR ),
then, the Radon-Nikodym derivative fX := d(X ∗ P)
dµ , where fX : IR → IR exists. Then
Z Z Z
IE[g(X)] = g d(X∗ P) = gfX dµ = g(τ )fX (τ ) dτ.
IR IR IR

This is known as the law of the unconscious statistician (LotUS).

1.1.5 Probability distribution


1. (Probability distribution). Let X : (Ω, F, P) → (Y, G) be a random variable. The measure

FX (A) = P[X ∈ A] = P[{ω ∈ Ω | X(ω) ∈ A}] = P[X −1 A] = (X∗ P)(A),

is called the probability distribution of X and it is a measure . Note that for all A ∈ G, X −1 A ∈ F
since X is measurable.

2. (Probability distribution of real-valued random variables). The probability distribution or cumula-


tive distribution function of a random variable X on a space Lp (Ω, F, P) is FX (x) = P[X ≤ x]
−1
for x ∈ IR. The inverse cumulative distribution of X is FX (p) for p ∈ [0, 1] is defined as
−1
FX = inf{x ∈ IR : FX (x) ≥ p}.

3. (Push-forward). The probability distribution of a random variable X with values in (X , G), is the
push-forward measure X∗ P on (X , G) which is a probability measure on (X , G) with X∗ P = PX −1 .

4. (Associated p-system). We associate with FX : IR → [0, 1] the measure µ which is defined on the
p-system {(−∞, x]}x∈IR as µ((−∞, x]) = FX (x).

5. (Properties of the cumulative and the inverse cumulative distributions). The notation X ∼ Y
means that X and Y have the same cumulative distribution, that is FX = FY .
−1
i. If Y ∼ U [0, 1], then FX (Y ) ∼ X.
ii. FX is càdlàg
iii. x1 < x2 ⇒ FX (x1 ) ≤ FX (x2 )
iv. P[X > x] = 1 − FX (x)
v. P[{x1 < X ≤ x2 }] = FX (x2 ) − FX (x1 )
vi. limx→−∞ FX (x) = 0, limx→∞ FX (x) = 1
−1
vii. FX (FX (x)) ≤ x
−1
viii. FX (FX (p)) ≥ p
−1
ix. FX (p) ≤ x ⇔ p ≤ FX (x)

13
1 Probability Theory

1.1.6 Probability density function


1. (Definition). The probability density function fX of a random variable X : (Ω, F, P) → (X , G) with
respect to a measure µ on (X , G) is the Radon-Nikodym derivative

d(X∗ P)
fX = ,

which exists provided that X∗ P  µ, and fX is measurable and µ-integrable. Then,
Z Z Z Z Z
P[X ∈ A] = dP = 1X −1 A dP = (1A ◦ X)dP = d(X∗ P) = fX dµ.
X −1 A Ω Ω A A

2. (Probability distribution). If X is a real-valued random variable and its range (IR) is taken with
the Borel σ-algebra, then
Z Z Z x
P[X ≤ x] = XdP = dP = fX dµ
(−∞,x] {ω∈Ω:X(ω)≤x} −∞

Note that the first integral is written with a slight abuse of notation as the integration with respect
to P is carried out over the set {ω ∈ Ω : X(ω) ≤ x}; The first integral can be understood as
shorthand notation for the second integral.
3. (Expectation). Let a real-valued random variable X have probability density fX . Let ι be the
identity function ι : x 7→ x on Ω. Then
Z Z Z Z Z
IE[X] = XdP = (ι ◦ X)dP = ιd(X∗ P) = ι(x)fX (x)dµ = xfX (x)dx.
Ω Ω IR IR IR

4. (Distribution of transformation). Let g : IR → IR be a strictly increasing function. Let X be


a real-valued random variable with probability density function fX and let Y (ω) = g(X(ω)) be
another random variable. Then

FY (y) = FX (g −1 (y)),
∂g −1 (y)
fY (y) = fX (g −1 (y)) .
∂y

5. (Expectation of transformation). Let X be a real-valued random variable on (Ω, F, P) with prob-


ability density function fX and let Y (ω) = g(X(ω)) be another random variable. Then
Z ∞
IE[Y ] = fX (τ )g(τ )dτ.
−∞

If Ω = {ωi }ni=1 , F = 2Ω and P[{ω}i ] = pi , then


n
X
IE[Y ] = pi g(X(ωi )).
i=1

See also: law of the unconscious statistician.

1.1.7 Decomposition of measures


Does a density function always exist? The answer is negative, but Lebesgue’s decomposition theorem
offers some further insight.
1. (Singular measures). Let (Ω, F) be a measurable space and µ, ν be two measures defined thereon.
These are called singular if there are A, B ∈ F so that
i. A ∪ B = Ω,
ii. A ∩ B = ∅,
iii. µ(B 0 ) = 0 for all B 0 ∈ F with B 0 ⊆ B,

14
1.1 General Probability Theory

iv. ν(A0 ) = 0 for all A0 ∈ F with A0 ⊆ A.

2. (Discrete measure on IR). A measure µ on IR equipped with the Lebesgue σ-algebra, is said to be
discrete if there is a (possibly finite) sequence of elements {sk }k∈IN , so that
[
µ(IR \ {sk }) = 0.
k∈IN

3. (Lebesgue’s decomposition Theorem). For every two σ-finite signed measures µ and ν on a mea-
surable space (Ω, F), there exist two σ-finite signed measures ν0 and ν1 on (Ω, F) such that
i. ν = ν0 + ν1
ii. ν0  µ
iii. ν1 ⊥µ
and ν0 and ν1 are uniquely determined by ν and µ.

4. (Lebesgue’s decomposition Theorem — Corollary). Consider the space (IR, BIR ) and let µ be the
Lebesgue measure. Any probability measure ν on this space can be written as

ν = νac + νsc + νd ,

where νac  µ (which is easily understood via the Radon-Nikodym Theorem), νsc is singular
continuous (wrt µ) and νd is a discrete measure.

1.1.8 Lp spaces
1. (p-norm). Let X be a real-valued random variable on (Ω, F, P). For p ∈ [1, ∞) define the p-norm
of X as
kXkp = IE[|X|p ]1/p .

2. (Lp spaces). Define Lp (Ω, F, P) = {X : Ω → IR, measurable, kXkp < ∞} and equip this space
with the addition and scalar multiplication operations (X + Y )(ω) = X(ω) + Y (ω) and (αX)(ω) =
αX(ω). This becomes a semi-normed space3 .

3. (Lp spaces). Define N (Ω, F, P) = {X : Ω → IR, measurable, X = 0 a.s.}; this is the kernel of k · kp .
Then, define Lp (Ω, F, P) = Lp (Ω, F, P)/N . This is a normed space where for X ∈ Lp (Ω, F, P) and
[X] = X + N ∈ Lp (Ω, F, P) we have k[X]kp := kXkp .

4. (∞-norm, L∞ and L∞ ). The infinity norm is defined as

kXk∞ = esssup |X| = inf{λ ∈ IR : P[|X| > λ] = 0},

or equivalently
kXk∞ = inf{λ ∈ IR : |X| ≤ λ, P-a.s.}.

The spaces L∞ (Ω, F, P) and L∞ (Ω, F, P) are defined similarly.

5. (L∞ (Ω, F, P) as a limit). If there is a p0 ∈ [1, ∞) such that X ∈ L∞ ∩ Lp0 , then

kXk∞ = lim kXkp .


p→∞

6. (L2 is a Hilbert space). Lp (Ω, F, P) is the only Hilbert Lp space with inner product

hX, Y i = IE[XY ].
3 kXk = 0 does not imply that X = 0, but instead that X = 0 almost surely. However, k · kp is absolutely homogeneous,
sub-additive and nonnegative

15
1 Probability Theory

1.1.9 Product spaces


Q
1. (Product σ-algebra). Let {Xa }a∈A be an indexed collection of nonempty sets; define X = a∈A Xa
and πa : X = (xa )a∈A 7→ xa ∈ Xa . Let Fa be a σ-algebra on Xa . We define the product σ-algebra
as O
Fa := σ {πa−1 (Ea ); a ∈ A, Ea ∈ Fa }


a∈A
This is the smallest σ-algebra on the product space which renders all projections measurable
(compare to the definition of the product topology which is the smallest topology on the product
space which renders the projections continuous).
2. (Measurability of epigraphs). Let f : (X, F) → IR be a measurable proper function. Its epi-
graph, that is the set epi f := {(x, α) ∈ X × IR | f (x) ≤ α} and its hypo-graph, that is the set
hyp f := {(x, α) ∈ X ×IR | f (x) ≥ α} are measurable in the product measure space (X ×IR, F⊗BIR ).
3. (Measurability of graph). The graph of a measurable function f : (X, F, µ) → IR is a Lebesgue-
measurable set with Lebesgue measure zero.
4. (Countable product of σ-algebras). If A is countable, the product σ-algebra is generated by the
products of measurable sets { a∈A Ea ; Ea ∈ Fa }.
Q

5. (Product measures). Let (X , F, µ) and (Y, G, ν) be two measure spaces. The product space X × Y
becomes a measurable space with the σ-algebra F⊗G. Let Ex ∈ F and Ey ∈ G; then Ex ×Ey ∈ F⊗G.
We define a measure µ × ν on (X × Y, F ⊗ G) with
(µ × ν)(Ex × Ey ) = µ(Ex )ν(Ey ).

6. Let E ∈ F ⊗ G and define Ex = {y ∈ Y : (x, y) ∈ E} and Ey = {x ∈ X : (x, y) ∈ E}. Then, Ex ∈ F


for all x ∈ X , Ey ∈ G for all y ∈ Y.
7. Let f : X × Y → IR be an F ⊗ G-measurable function. Then, f (x, ·) is G-measurable for all x ∈ X
and f (·, y) is F-measurable for all y ∈ Y.
8. Let (X , F, µ) and (Y, G, ν) be two σ-finite measure spaces. For E ∈ F ⊗ G, the mappings X 3 x 7→
ν(Ex ) ∈ IR and Y 3 y 7→ µ(Ey ) are measurable and
Z Z
(µ × ν)(E) = ν(Ex )dµ(x) = µ(Ey )dν(x)

9. (Tonelli’s Theorem). Let h : X × Y → [0, ∞] be an F ⊗ G-measurable function. Let


Z Z
f (x) = h(x, y)dν(y), g(y) = h(x, y)dµ(x).
Y X
Then, f and g are measurable and
Z Z Z
f dµ = gdν = gd(µ × ν).
X Y X ×Y

10. (Fubini’s Theorem). Let h : X × Y → IR be an F ⊗ G-measurable function and


Z Z
h(x, y)dν(y)dµ(x) < ∞.
X Y

Then, h ∈ L1 (X × Y, F ⊗ G, µ × ν) and
Z Z Z Z Z
h(x, y)dν(y)dµ(x) = h(x, y)dµ(x)dν(y) = hd(µ × ν)
X Y Y X X ×Y

R ∞ Let X be a nonnegative random variable. Let E = {(ω, x) :


11. (Consequence of Fubini’s theorem).
0 ≤ x ≤ X(ω)}. Then, X(ω) = 0 1E (ω, x)dx.
Z Z Z ∞
IE[X] = XdP = 1E (ω, x)dxdP
Ω Ω 0
Z ∞Z
= 1E (ω, x)dPdx
Z0 ∞ Ω
= P[X ≥ x]dx.
0

16
1.1 General Probability Theory

1.1.10 Transition Kernels


1. (Definition). Let (X , F), (Y, G) be two measurable spaces and let K : G × X → [0, 1]. K is called
a (probability) transition kernel if
i. fB (x) := K(B, x) is F-measurable for every B ∈ G,
ii. µx (B) := K(B, x) is a measure on (Y, G) for every x ∈ X .
2. (Markov kernel). A kernel K : G × X → [0, 1] is called a Markov kernel if K(Y, x) = 1 for all x ∈ X
3. (Existence of transition kernels). Let µ be a finite measure on (XR, F) and k : X × Y → IR+ be
measurable in the product σ-algebra F ⊗ G and has the property Y k(x, y)ν(dy) = 1. Then the
mapping K : X × G → [0, 1] given by
Z
K(B, x) = k(x, y)µ(dy),
B

is a probability transition kernel.


4. (Measure on product space via a kernel). Let (X , F), (Y, G) be two measurable spaces and let
K : G × X → [0, 1] be a transition kernel. For A ∈ F and B ∈ G define
Z
µ(A × B) = K(B, x)dP(x).
A

This extends to a unique measure on the product space (X × Y, F ⊗ G).

1.1.11 Law invariance


1. (Equality in distribution). Let X, Y be two real-valued random variables on (Ω, F, P). We say
d
that X and Y are equal in distribution, and we denote X ∼ Y , if X and Y have equal probability
distribution functions, that is FX (s) = FY (s) for all s.
2. (Equal in distribution, nowhere equal). Let Ω = {−1, 1}, F = 2Ω , P[{ωi }] = 21 . Let X(ω) = ω and
Y (ω) = −X(ω). These two variables have the same distribution, but are nowhere equal.
3. (Equal in distribution, almost nowhere equal). Take X ∼ N (0, 1) and Y = −X. These two random
variables are almost nowhere equal, but have the same distribution.
4. The following are equivalent:
d
i. X ∼ Y
ii. IE[e−rX ] = IE[e−rY ] for all r > 0
iii. IE[f (X)] = IE[f (Y )] for all bounded continuous functions
iv. IE[f (X)] = IE[f (Y )] for all bounded Borel functions
v. IE[f (X)] = IE[f (Y )] for all positive Borel functions

1.1.12 Expectation
1. (Definition) Let (Ω, F, P) be a probability space and X be a random variable. Then, the expected
value of X is denoted by IE[X] and is defined as the Lebesgue integral
Z
IE[X] = XdP

2. Because of item 8 in Sec. 1.1.2, for X ≥ 0 nonnegative


Z +∞
IE[X] = XdP
0
Z +∞ Z +∞
= 1X≥t dtdP
0 0
Z +∞ Z +∞
= 1X≥t dPdt
0 0

17
1 Probability Theory

and we use the fact that Z +∞


1X>t dP = P[X > t],
0
so Z ∞
IE[X] = P[X > t]dt.
0

The function S(t) = P[X > t] = 1 − P[X ≤ t] is called the survival function of X, or its tail
distribution or exceedance.

3. (Expectation in terms of PDF). Let X be a real-valued continuous random variable with PDF fX .
Then,
Z ∞
IE[X] = xfX (x)dx.
−∞

4. (Expectation in terms of CDF). Let X be a real-valued random variable. Then,


Z ∞
IE[X] = xdF (x).
−∞

5. Let (Ω, F, P) be a probability space and X a real-valued random variable thereon. Define
Z
f (τ ) = (X − τ )2 dP.

Then τ = IE[X] minimizes f and the minimum value is Var[X].

6. Let X be a real-valued random variable. Then,



X ∞
X
P[|X| ≥ n] ≤ IE[|X|] ≤ 1 + P[|X| ≥ n].
n=1 n=1

It is IE[|X|] < ∞ if and only if the above series converges.

7. If X takes positive integer values, then



X
IE[X] = P[X ≥ n]
n=1

8. (Finite mean, infinite variance). There are several distributions with finite mean and infinite
variance — a standard example is the Pareto distribution. A random variable X follows the Pareto
distribution with parameters xm > 0 and a if it has support [xm , ∞) and probability distribution

axam
P[X ≤ x] = ,
xa+1
axm
for x ≥ xm . For a ≤ 1, X has infinite mean and variance. For a > 1, its mean is IE[X] = a−1 and
infinite variance.

9. (Absolutely bounded a.s. ⇔ Bounded moments) [11]. Let X be a random variable on (Ω, F, P).
The following are equivalent:
i. X is almost surely absolutely bounded (i.e., there is M ≥ 0 such that P[|X| ≤ M ] = 1)
ii. IE[|X|k ] ≤ M k , for all k ∈ IN≥1

10. (A useful formula) [2]. For q > 0


Z ∞
q
IE[|X| ] = qxq−1 P[|X| > x]dx.
0

18
1.2 Conditioning

1.2 Conditioning
1.2.1 Conditional Expectation
1. (Conditional Expectation). Let X be a random variable on (Ω, F, P) and H ⊆ F. A conditional
expectation of X given H is an H-measurable random variable, denoted as IE [X | H], with
Z Z
IE [X | H] dP = XdP,
H H

which equivalently can be written as

IE[X1H ] = IE[IE [X | H] 1H ],

for all H ∈ H.

2. (Uniqueness). All versions of a conditional expectation, IE [X | H], differ only on a set of measure
zero4 .

3. (Equivalent definition). It is equivalent to define the conditional expectation of X, conditioned by


a σ-algebra H as a random variable IE [X | H] with the property

IE[XZ] = IE[IE [X | H] Z],

for all H-measurable random variables Z.

4. (Best estimator). Assuming IE[Y 2 ] < ∞, the best estimator of Y given X is IE[Y | X]

5. (Radon-Nikodym definition). The conditional expectation as introduced above, is the Radon-


Nikodym derivative
dµX
H
IE [X | H] = ,
dPH
where µXH : H → [0, ∞] is the measure induced by X restricted on H, that is µH : H 7→ H XdP.
X
R

This is absolutely continuous with respect to P. The measure PH is the restriction of P on H.

6. (Conditional expectation wrt random variable). Let X, Y be random variables on (Ω, F, P). The
conditional expectation of X given Y is IE[X | Y ] := IE[X | σ(Y )], where σ(Y ) is the σ-algebra
generated by Y , that is σ(Y ) = Y −1 (F) = {Y −1 (B); B ∈ F}.

7. (Conditional expectation using the push-forward Y∗ P). Let X be an integrable random variable
on (Ω, F, P). Then, there is a Y∗ P-unique random variable IE[X | Y ]
Z Z
XdP = IE[X | Y ]d(Y∗ P).
Y −1 (B) B

8. (Conditioning by an event). The conditional expectation IE[X | H], conditioned by an event H ∈ F


is given by Z
1 1
IE[X | H] = XdP = IE[X1H ].
P[H] H P[H]

9. (Properties of conditional expectations). The conditional expectation has the following properties:
i. (Monotonicity). X ≤ Y ⇒ IE [X | H] ≤ IE [Y | H]
ii. (Positivity). X ≥ 0 ⇒ IE [X | H] ≥ 0 [Set Y = 0 in 9i].
iii. (Linearity). For a, b ∈ IR, IE [aX + bY | H] = aIE [X | H] + bIE [Y | H]
iv. (Monotone convergence). Xn ≥ 0, Xn ↑ X implies IE [Xn | H] ↑ IE [X | H]
v. (Fatou’s lemma). For Xn ≥ 0, IE [lim inf n Xn | H] ≤ lim inf n IE [Xn | H]
vi. (Reverse Fatou’s lemma).
4 R.Durrett, “Probability: Theory and Examples,” 2013, Available at: https://services.math.duke.edu/~rtd/PTE/
PTE4_1.pdf

19
1 Probability Theory

vii. (Dominated convergence theorem). Xn → X (point-wise) and |Xn | ≤ Y P-a.s. where Y is


integrable. Then, IE [X | H] is integrable and
IE [Xn | H] → IE [X | H] .

viii. (Jensen’s inequality). Let X ∈ L1 (Ω, F, P), f : IR → IR convex. Then


f (IE [X | H]) ≤ IE [f (X) | H] .

ix. (Law of total expectation). For any σ-algebra H ⊆ F,


IE[IE [X | H]] = IE[X].

x. (Tower property). For two σ-algebras H1 and H2 with H1 ⊆ H2 ,


IE[IE[X | H1 ] | H2 ] = IE[IE[X | H2 ] | H1 ] = IE[X | H1 ].

xi. (Tower property with X being Hi -measurable). Let H1 ⊆ H2 be two σ-algebras. If X is


H1 -measurable, then it is also H2 -measurable.
xii. If X is H-measurable then
IE [X | H] = X.

1.2.2 Conditional Probability


1. (Conditional probability). Let (Ω, F, P) be a probability space and H be a sub-σ-algebra of F. We
define PH as an operator so that for all H ∈ H
PH [H] = IEF 1H .

2. (Conditional probability given an event). For E, H ∈ F, P[E ∩ H] = P[H]PH [E]. This is uniquely
defined provided that P[H] > 0.

1.2.3 Construction of probability spaces


1. (Inonescu-Tulcea’s Theorem).
2. (Kolmogorov’s Extension Theorem). Let T denote a time interval, k ∈ IN and let νt1 ,...,tk be
probability measures on IRnk such that
a) For all permutations π on {1, . . . , k} it holds that
νπ(t1 ),...,π(tk ) (F1 × · · · × Fk ) = νt1 ,...,tk (Fπ−1 (1) × · · · × Fπ−1 (k) ),

b) For all m ∈ IN the following holds


νt1 ,...,tk (F1 × · · · × Fk ) = νt1 ,...,tk ,tk+1 ,...,tk+1 (F1 × · · · × Fk × IRn × · · · × IRn ).

Then, there exists a probability space (Ω, F, P) and a stochastic process (Xt )t on Ω such that
νt1 ,...,tk (F1 × · · · × Fk ) = P[Xt1 ∈ F1 , . . . , Xtk ∈ Fk ],
for all Borel sets Fi , i = 1, . . . , k.

1.3 Inequalities on Probability Spaces


1.3.1 Inequalities on Lp spaces
1. (Hölder’s inequality). If X ∈ Lp (Ω, F, P), Y ∈ Lq (Ω, F, P) (where p, q are conjugate exponents),
then XY ∈ L1 (Ω, F, P) and
IE[|XY |] = kXY k1 ≤ kXkp kY kq .
2. (Cauchy-Schwarz inequality). This is Hölder’s inequality with p = q = 2:
kXY k1 ≤ kXk2 kY k2 .

3. (Minkowski inequality). If X, Y ∈ Lp (Ω, F, P) (p ∈ [1, ∞]), then X + Y ∈ Lp (Ω, F, P) and kX +


Y kp ≤ kXkp + kY kp .

20
1.3 Inequalities on Probability Spaces

1.3.2 Generic inequalities involving probabilities or expectations


1. (Lyapunov’s inequality). Let 0 < s < t. Then
(IE[|X|s ]) /s ≤ (IE[|X|t ]) /t .
1 1

2. (Markov’s inequality). Let X ≥ 0, integrable. For all t > 0,


IE[X]
P[X > t] ≤ .
t
3. (Chebyshev’s inequality). Let X have finite expectation µ and finite variance σ 2 . Then
σ2
P[|X − µ| ≥ t] ≤ .
t2
4. (Generalized Markov’s inequality). Let X be a real-valued random variable and f : IR → IR+ be
an increasing function. Then, for all b ∈ IR,
1
P[X > b] ≤ IE[f (X)]
f (b)
5. (Gaussian tail inequality). Let X ∼ N (0, 1). Then,
2
2e− /2
P[|X| > ] ≤ .

6. (Hoeffding’s lemma). Let a ≤ X ≤ b be an RV with finite expectation µ = IE[X]. Then
t2 (b−a)2
IE[etX ] ≤ etµ e 8 .

7. (Corollary of Hoeffding’s lemma). Let X be such that etX is integrable for t ≥ 0. Then
P[X > ] ≤ inf e−t IE[etX ].
t≥0

8. (Jensen’s inequality). Let X ∈ L1 (Ω, F, P), f : IR → IR convex. Then


f (IE[X]) ≤ IE[f (X)].

9. (Paley-Zygmund). Let Z ≥ 0 be a random variable with finite variance. Then,


IE[Z]2
P[Z > θIE[Z]] ≥ (1 − θ)2 ,
IE[Z 2 ]
and this bound can be improved (using the Cauchy-Schwartz inequality) as
IE[Z]2
P[Z > θIE[Z]] ≥ (1 − θ)2 ,
Var[Z] + (1 − θ)2 IE[Z 2 ]

10. Let X ≥ 0 and IE[X 2 ] < ∞. We apply the Cauchy-Schwarz inequality to X1X>0 and obtain
IE[X]2
P[X > 0] ≥ .
IE[X 2 ]
11. (Dvoretzky-Kiefer-Wolfowitz inequality). Let X1 , . . . , Xn be iid random variables (samples) with
cumulative distribution F . Let Fn be the associated empirical distribution
n
1X
Fn (x) = 1X ≤x ,
n i=1 i
Then,
2
P[sup (Fn (x) − F (x)) > ] ≤ e−2n ,
x∈IR
q
1
for every  ≥ 2n ln 2.

12. (Chung-Erdős inequality). Let E1 , . . . , En ∈ F and P[Ei ] > 0 for some i. Then
Pn
( P[Ei ])2
P[E1 ∨ . . . ∨ En ] ≥ Pn Pi=1 n
i=1 j=1 P[Ei ∧ Ej ]

21
1 Probability Theory

1.3.3 Involving sums or averages


1. (Hoeffding’s inequality for sums #1). Let X1 , X2 , . . . , Xn be independent random variables in [0, 1].
Define
X1 + X2 + . . . + Xn
X̄ = .
n
Then,
2
P[X̄ − IE[X̄] ≥ t] ≤ e−2nt .

2. (Hoeffding’s inequality for sums #2). Let X1 , X2 , . . . , Xn be independent random variables and
Xi ∈ [ai , bi ]. Let X̄ be as above and let ri = bi − ai . Then

2n2 t2
 
P[X̄ − IE[X̄] ≥ t] ≤ exp − Pn 2 ,
i=1 ri

and
2n2 t2
 
P[|X̄ − IE[X̄]| ≥ t] ≤ 2 exp − Pn 2 .
i=1 ri

3. (Kolmogorov’s inequality). Let Xk , k = 1, . . . , N be independent random variables on (Ω, F, P)


with mean 0 and variances σk2 . Let Sk = X1 + X2 + . . . + Xk . For all  > 0,
n
1 X 2
P[ max |Sk | > ] ≤ σk .
1≤k≤n 2
k=1

Pn
4. (Gaussian tail inequality for averages). Let X1 , . . . , Xn ∼ N (0, 1) and let X̄n := n−1 i=1 Xi .
Then X̄n ∼ N (0, n−1 ) and
2
2e−n /2
P[|X̄n | > ] ≤ √ .
n

5. (Etemadi’s inequality). Let X1 , . . . , Xn be independent real-valued random variables and α ≥ 0.


Let Sn = X1 + . . . + Xn . Then

P[ max |Si | ≥ 3α] ≤ max P[|Si | ≥ α].


1≤i≤n 1≤i≤n

1.4 Convergence of random processes


1.4.1 Convergence of measures
1. (Strong convergence). Let {µk }k∈IN be a sequence of measures defined on a measurable space
(X , G). We say that the sequence converges strongly to a measure µ if

lim µk (A) = µ(A),


k

for all A ∈ G.
2. (Total variation convergence). The total variation distance between two measures µ and ν on a
measurable space (X , G) is defined as

dTV (µ, ν) = kµ − νkTV


Z Z 
:= sup f dµ − f dν, f : X → [−1, 1] measurable
X X
= 2 sup |µ(A) − ν(A)|
A∈G

A sequence of measures {µk }k∈IN converges in the total variation to a measure µ if dTV (µk (A) −
µ(A)) → 0 as k → ∞ for all A ∈ G.
3. (Weak convergence). The sequence of measures {µk }k∈IN is said to converge in the weak sense,
denoted by µk * µ, if any of the conditions of the Portmanteau Theorem hold; these are

22
1.4 Convergence of random processes

i. IEµk f → IEµ f for all bounded continuous functions f


ii. IEµk f → IEµ f for all bounded Lipschitz functions f
iii. lim supk IEµk f ≤ IEµ f for every upper semi-continuous f bounded from above
iv. lim inf k IEµk f ≥ IEµ f for every lower semi-continuous f bounded from below
v. lim sup µk (C) ≤ µ(C) for all closed set C ⊆ X
vi. lim inf µk (O) ≥ µ(O) for all open set O ⊆ X

4. (Tightness). A sequence of measures (µn )n is called tight if for every  > 0 there is a compact set
K so that µn (K) > 1 −  for all n ∈ IN.

5. (Prokhorov’s Theorem). If (µn )n is tight, then every subsequence of it has a further subsequence
which is weakly convergent.

6. (Lévy-Prokhorov distance). Let (X, d) be a metric space and let BX be the Borel σ-algebra which
makes (X, BX ) a measurable space. Let P(X) be the space of all probability measures on (X, BX ).
For all A ⊆ X we define
[
A := {p ∈ X | ∃q ∈ A, d(p, q) < } = B (p),
p∈A

where B (p) is an open ball centered at p with radius .


The Lévy-Prokhorov distance is a mapping π : P(X) × P(X) → [0, 1] between two probability
measures µ and ν defined as

π(µ, ν) := inf{ > 0 | µ(A) ≤ ν(A ) + , ν(A) ≤ µ(A ) + , ∀A ∈ BX }.

7. (Metrizability of weak convergence). If (X, d) is a separable metric space, then convergence of a


sequence of measures in the Lévy-Prokhorov distance is equivalent to weak convergence.

8. (Separability of (PX , π)). The space (PX , π) is separable if and only if (X, d) is separable.

9. (Skorokhod’s representation theorem). Let (µn )n be a sequence of probability measures on a metric


measurable space (S, H) such that µn → µ weakly. Suppose that the support of µ is separable5 .
Then, there exist random variables (Xn )n and X on a common probability space such that the
distribution of Xn is µn , the distribution of X is µ and Xn → X almost surely.

10. (Strong ; TV).

1.4.2 Almost sure convergence


1. (Almost sure convergence). A sequence of random variables (Xn )n is said to converge almost surely
if the sequence (Xn (ω))n converges (somewhere) for almost every ω. It converges almost surely to
X if limn Xn (ω) = X(ω) for almost every ω.

2. (Uniqueness almost surely). If Xn → X a.s. and Xn → Y a.s., then X = Y a.s.

3. (Characterization of a.s. convergence). The sequence (Xn )n converges a.s. to X if and only if for
every  > 0
X
1(,∞) ◦ |Xn − X| < ∞.
n∈IN

4. (Characterization of a.s. convergence a là Borel-Cantelli #1). The sequence (Xn )n converges a.s.
to X if for every  > 0
X
P[|Xn − X| > ] < ∞.
n∈IN

5 The support of a measure µ on (Ω, F, P) which is equipped with a topology τ is the set of ω ∈ Ω for which every open
neighbourhood Nω of ω has a positive measure: supp(µ) = {ω ∈ Ω : µ(Nx ) > 0, for all Nω ∈ τ, Nω 3 ω}.

23
1 Probability Theory

5. (Characterization of a.s. convergence a là Borel-Cantelli #2). The sequence (Xn )n converges a.s.
to X if there is a decreasing sequence (n )n converging to 0 so that
X
P[|Xn − X| > n ] < ∞.
n∈IN

6. (Cauchy criterion). The sequence (Xn )n is convergent almost surely if and only if limm,n→∞ |Xn −
Xm | → 0 almost surely.
7. (Kolmogorov’s three-series
P theorem). Let (Xn )n be a sequence of independent random variables.
The random series n Xn converges almost surely in IR if and only if the following conditions hold
for some  > 0:
P
i. n P[|Xn | > ] converges
P
ii. Let Yn = Xn 1|Xn |≤ . Then, n IE[Yn ] converges
P
iii. n var Yn converges

8. (Consequence of Kolmogorov’s three-series theorem: Random harmonic series). Let Xn be a


sequence
P ofXindependent random variables taking the values {−1, 1}, each with probability 1/2.
Then, n n
n
converges almost surely. This is proven using Kolmogorov’s three-series theorem
with  = 2.6
9. (Kolmogorov’s two-series theorem). Let (Xn )n be a sequence of independent random
P variables with
2 2
P
expected values
P I
E[Xn ] = µ n and variances var[Xn ] = σ n such that µ
n n and σ
n n converge in
IR. Then, n Xn converges in IR almost surely.
a.s.
10. (Continuous mapping theorem). Let Xn → X and g be a (almost everywhere) continuous mapping.
a.s.
Then g(Xn ) → g(X).
11. (Topological (non) characterization). The concept of almost sure convergence does not come from a
topology on the space of random variables. This means there is no topology on the space of random
variables such that the almost surely convergent sequences are exactly the converging sequences
with respect to that topology. In particular, there is no metric of almost sure convergence.

1.4.3 Convergence in probability


1. (Convergence in probability). We say that the stochastic process (Xn )n converges to a random
variable X in probability if for every  > 0,
lim P[|Xn − X| > ] = 0.
n
p
We denote Xn → X.
p
2. (Continuous mapping theorem). Let Xn → X and g be a (almost everywhere) continuous mapping.
p
Then g(Xn ) → g(X).
3. (Metrizability). Convergence in probability defines a topology which is metrizable via the Ky Fan
metric
d(X, Y ) = inf{ > 0 | P[|X − Y | > ] ≤ } = IE[min(|X − Y |, 1)].
4. (Metrizability #2). The sequence Xn converges to 0 in probability if and only if
 
|Xn |
IE → 0.
1 + |Xn |
The functional  
|X − Y |
d(X, Y ) := IE
1 + |X − Y |
is a metric that induces the convergence in probability (provided we identify two random variables
as equal if they are almost everywhere equal).
6 More on random harmonic series can be found in B. Schmuland, “Random Harmonic Series,” American
P Xn Mathematical
Monthly 110: 407–416, 2003, doi:10.2307/3647827. There the author explains that the pdf of n n evaluated at 2
differs from 1/8 by less than 10−42 .

24
1.4 Convergence of random processes

p
5. (Almost surely convergent subsequence). If Xn → X, then there exists a subsequence of (Xn )n ,
(Xkn )n which converges almost surely to X.

6. (Sum of independent variables). Let (Xn )n be a sequence of independent random variables and let
(Sn )n be a sequence defined as Sn = X1 + . . . + Xn . Then Sn converges almost surely if and only
if it converges in probability.

7. (Convergence of pairs). If Xn → X in probability and Yn → Y in probability, then (Xn , Yn ) →


(X, Y ) in probability.

8. (Almost surely ⇒ in probability). If a sequence of random variables {Xk }k converges almost surely,
it converges in probability to the same limit.

9. (In probability 6⇒ almost surely). There are sequences which converge in probability but not almost
surely. Here is an example: Let (Xn )n be a sequence of independent random variables on Ω = IN
with Xn = 1 with probability 1/n and 0 with probability 1 − 1/n. P Then, for any  > 0 it is

P[|Xn | > ] = n1 → 0, but by the second Borel-Cantelli lemma since n=1 P[|Xn | > ] (and the
events {|Xn | > } are independent), we have P[lim supn {|Xn | > }] = 1.

1.4.4 Convergence in Lp
1. (Convergence in Lp (Ω, F, P)). We say that Xk converges to X in Lp if X, Xk ∈ Lp for all k ∈ IN
and kXk − Xkp → 0.

2. (Convergence L1 under uniform integrability). If Xn → X in probability and (Xn )n is uniformly


integrable, then Xn → X in L1 .

3. (In Ls (Ω, F, P) ⇒ in Lp (Ω, F, P), for s > p ≥ 1).

4. (Scheffé’s theorem). Let Xn ∈ L1 , X ∈ L1 and Xn → X almost surely. The following are


equivalent:
i. IE[|Xn |] → IE[|X|],
ii. IE[|Xn − X|] → 0.

5. (Convergence in Lp for all p ∈ [1, ∞) but not in L∞ ). Let X be a random variable on Ω = IN which
−λ k
follows the Poisson distribution (P[X = k] = e k!λ , λ > 0). Define the sequence Xk = 1{X=k} .
Then kXk k∞ = 1.

6. (Vitali’s theorem). Suppose that Xn ∈ Lp , p ∈ [1, ∞) and Xn → X in probability. The following


are equivalent
i. {Xn }n is uniformly integrable
ii. Xn → X in Lp
iii. IE[|Xn |p ] → IE[|X|p ]

7. (In Lp ⇒ in probability). If (Xn )n converges to X in Lp , for any p ∈ [1, ∞] it also converges to X


in probability.

8. (Almost surely 6⇒ in Lp ). On ([0, 1], B[0,1] , λ) take Xn = n1[0,1/n] . Then, for all p ∈ [1, ∞] we have
kXn kp = 1, but the sequence converges almost surely to 0.

9. (In Lp , p ∈ [1, 2) 6⇒ In Lp , for p ≥ 2). Let Ω = IN and Zkp be a sequence of random variables with
parameter p and

P[Zkp = n] = pn,
P[Zkp = 0] = 1 − pn.

Let X = 0 and Xk be defined as


Xk = Zkpk
t−2
where pk = 1/k2 ln k. Then IE[|Xk |t ] = k /ln k. We have IE[|Xk |t ] → 0 if and only if t < 2.

25
1 Probability Theory

Figure 1.1: Illustration of the relationships among different modes of convergence of random variables. Convergence in L∞
implies convergence in Lp for all p ∈ [1, ∞) which in turn implies convergence in Lp0 for all 1 ≤ p0 ≤ p which
implies convergence in probability which implies convergence in distribution which implies convergence of the
characteristic functions (Lévy’s continuity theorem). Convergence in distribution implies almost convergence
d d
of a sequence of RVs {Yk }k which have the same distribution as {Xk }k (Yk ∼ Xk and Y ∼ X).

1.4.5 Convergence in distribution


1. (Convergence in distribution). The sequence of random variables {Xn }n with distributions {µn }n
is said to converge in distribution of X if {µn }n converges weakly to µ, the distribution of X.

2. (Slutsky’s theorem). Let Xk → X in distribution and Yn → c in probability, where c is a constant.


Then,
i. Xn + Yn → X + c in distribution
ii. Xn Yn → cX in distribution
iii. Xn /Yn → X/c in distribution, provided that c 6= 0, Yn 6= 0.

3. (Almost sure convergence). If Xn → X in distribution, we may find a probability space (Ω, F, P)


and random variables Y and (Yn )n so that Yn is equal in distribution to Xn , Y is equal in distri-
bution to X and Yn → Y almost surely.

4. (Lévy’s continuity theorem)7 . Let {Xk }k be a sequence of random variables with characteristic
functions ϕk (t) and let X be a random variable with characteristic function ϕ(t). It Xk converges
to X in distribution then ϕk → ϕ point-wise. Conversely, if ϕk → ϕ and ϕ is continuous at 0, then
ϕ is the characteristic function of a random variable X and Xk → X in distribution. the

5. (Scheffé’s theorem for density functions).8 Let Pn and P have densities fn and f with respect to
a measure µ. If fn → f µ-a.s., then Pn → P in the total variation metric and, as a result, Pn → P
weakly.

6. (Continuous mapping theorem). For a (almost everywhere) continuous function g, if the sequence
{Xk }k converges in distribution to X, then {g(Xk )}k converges in distribution to g(X).

7. (Convergence in probability ⇒ in distribution). If {Xk }k converges in probability, then it converges


in distribution to the same limit.

8. (In distribution 6⇒ in probability). There are sequences which converge in distribution, but not in
probability. For example: On the space ([0, 1], B[0,1] , λ), let X2n (ω) = ω and X2n−1 (ω) = 1 − ω.
Then all Xk have the same distribution, but the sequence does not converge in probability. As a
second example, the sequence Xn = X where X follows the Bernoulli distribution with parameter
1
2 , converges in distribution to 1 − X, but not in probability.

9. (Polya-Cantelli lemma)9 . If Xn → X in distribution, Fn are the distribution function of Xn and


X has the continuous distribution function F , then kFn − F k∞ := supx |Fn (x) − F (x)| → 0 as
n → ∞.
7 Lecture notes 6.436J/15.085J by MIT, Available online at https://goo.gl/7ZaHW9.
8 S.Sagitov, “Weak Convergence of Probability Measures,” 2013, Available at: https://goo.gl/m4Qi5i
9 Lecture notes of M. Banerjee, Available at: http://dept.stat.lsa.umich.edu/ moulib/ch2.pdf
~

26
1.4 Convergence of random processes

10. (Delta method). Let X be a real-valued random variable and Xn be a sequence of real-valued
random variables with nc (Xn − θ) → X, in distribution for some c > 0. Let g : IR → IR be function
which is differentiable at θ. Then, nc (g(Xn ) − g(θ)) → g 0 (θ)X.10

1.4.6 Tail events and 0-1 Laws


1. (Simple 0-1 law). Let {En } be a sequence of independent events. Then P[lim supn En ] ∈ {0, 1}.

2. (Unions of σ-algebras). Let F1 , F2 be two σ-algebras on a nonempty set X. The σ-algebra


generated by the sets E1 ∪ E2 with E1 ∈ F1 and E2 ∈ F2 is denoted by F1 ∨ F2

T of F. The σ-algebra Tn := m>n Fm


W
3. (Tail σ-algebra). Let (Fn )n be a sequence of sub-σ-algebras
encodes the information about the future after n and T = n Tn is the tail σ-algebra which encodes
the information of the end of time.

4. (Events in the T
tail σ-algebra). For a process (En )n be a sequence of events. The associated tail
σ-algebra T is n σ({Ek }k≥n ). The event lim supn En is in T .

5. (Kolmogorov’s zero-one law). Let (Fn )n be a sequence of independent σ-algebras on a nonempty


set X and let T be the tail σ-algebra. We equip (X, F) with a probability measure P. For every
H ∈ T , P(H) ∈ {0, 1}.

6. (Counterpart of the Borel-Cantelli lemma). Let {En }n∈IN be a nested increasing sequence of events
in (Ω, F, P), that is Ek ⊆ Ek+1 and let Ekc denote the complement of Ek . Infinitely many Ek occur
with probability 1 if and only if there is an increasing sequence tk ∈ IN such that
X
P[Atk+1 | Actk ] = ∞.
k

7. (Lévy’s zero-one law). Let F = {Fk }k∈IN be any filtration of F on (Ω, F, P) and X ∈ L1 (Ω, F, P).
Let F∞ be the minimum σ-algebra generated by F. Then

IE[X | Fk ] → IE[X | F∞ ],

both in L1 (Ω, F, P) and P-a.s.

1.4.7 Laws of large numbers and CLTs


1. (Weak law of large numbers). Also known as Bernoulli’s theorem. Let {Xk }k be a sequence of
independent identically distributed random variables, each having a finite mean IE[Xk ] = µ and
finite variance σ 2 . Define X̄k = 1/k(X1 + . . . + Xk ). Then X̄k → µ in probability.

2. (Strong law of large numbers). Let {Xk }k and X̄k be as above. Then X̄k → µ almost surely.

3. (Uniform law of large numbers). Let f (x, θ) be a function defined over θ ∈ Θ. For fixed θ and
a random process {Xk }k define Zkθ := f (Xk , θ). Let {Zkθ }k be a sequence of independent and
identically distributed random variables, such that the sample mean converges in probability to
IE[f (X, θ)]. Suppose that (i) Θ is compact, (ii) f is continuous in θ for almost all x and measurable
with respect to x for each θ, (iii) there is a function g such that IE[g(X)] < ∞ and kf (x, θ)k ≤ g(x)
for all θ ∈ Θ. Then, IE[f (X, θ)] is continuous in θ and
a.s.
sup Z̄kθ − IE[f (X, θ)] −→ 0
θ∈Θ

4. (Lindeberg-Lévy central limit theorem). Let {Xk }k be iid, finite mean and variance and X̄k as
above. Then
X̄k − µ d
√ −→ N (0, σ 2 ),
n
where N (0, σ 2 ) is the normal distribution is zero mean and variance σ 2 (See Section 1.5.2).
10 Seethe lecture notes at http://personal.psu.edu/drh20/asymp/fall2006/lectures/, Chap. 5. The proof makes use
of Taylor’s first-order expansion and Slutsky’s theorem.

27
1 Probability Theory

5. (Lyapunov central limit theorem). Let {Xk }k be a sequence of independent random variables with
Pk
IE[Xk ] = µk and finite variance σk2 . Define s2k = i=1 σi2 . If for some δ > 0, the following condition
holds (Lyapunov’s condition)11 :

k
1 X
IE |Xi − µi |2+δ = 0,
 
lim
k→∞ s2+δ
k i=1

then,
k
1 X d
(Xi − µi ) −→ N (0, 1).
sk i=1

6. (Law od iterated logarithm). Let (Xk )t∈IN be independent identically distributed random variables
and let Sk := X1 + . . . + Xk . Then,

Sk
lim sup √ = 1,
k 2k ln ln k

and convergence is in the almost sure sense.

7. (Delta method). Let (Xn )n be a sequence of random variables satisfying


√ d
n(Xn − θ) → N (0, σ 2 ).

Let g : IR → IR be a function which is differentiable at θ and g 0 (θ) 6= 0. Then,


√ d
n(g(Xn ) − g(θ)) → N (0, σ 2 [g 0 (θ)]2 )

8. (Glivenko-Cantelli Theorem). Let X1 , X2 , . . . , XN be independent and identically distributed real-


valued random variables with a common cumulative distribution function F (x). The empirical
distribution of X1 , X2 , . . . , XN is
n
1X
FN (x) = 1[Xi ,+∞) (x).
n i=1

Then,
X
kFN − F k∞ := |Fn (x) − F (x)| → 0,
x∈IR

almost surely as N → ∞12 .

9. (Dvoretzky–Kiefer–Wolfowitz inequality). This inequality is a stronger result compared to the


Glivenko-Cantelli Theorem above as it gives the rate of convergence of the empirical distribution
to the actual one. Let X1 , . . . , XN and F , FN be as above. Then,
2
P[kFN − F k∞ > ] ≤ e−2n ,
q
1
for  ≥ 2n ln 2 and
2
P[kFN − F k∞ > ] ≤ 2e−2n ,

for  > 0.
11 In practice it is usually easiest to check Lyapunov’s condition for δ = 1. If a sequence of random variables satisfies
Lyapunov’s condition, then it also satisfies Lindeberg’s condition. The converse implication, however, does not hold.
12 The almost sure pointwise convergence of F
N to F follows from the strong law of large numbers. This is, therefore, a
stronger result.

28
1.5 Standard Distributions

1.5 Standard Distributions


1.5.1 Uniform distribution
1. (Definition). A random variable X : Ω → [a, b], a < b, is said to follow the uniform distribution,
denoted as X ∼ U (a, b), if its cumulative distribution function is

0,
 for x<a
FX (x) = P[X ≤ x] = x−a/b−a, for a≤x<b

1, for x≥b

The probability density function of X is


(
1/b−a, for x ∈ [a, b]
fX (x) =
0, otherwise

The distribution U (0, 1) is called the standard uniform distribution.


a+b
2. (Characteristics). For X ∼ U (a, b), the expectation of X is IE[X] = 2 , and its variance is
Var[X] = 1/12(b − a)2 .
3. (Probability integral transform). Let X be a real-valued random variable which has a continuous
distribution with cumulative distribution function FX . Then, the random variable Y = FX (X),
that is, Y (ω) = FX (X(ω)) follows the standard uniform distribution.
4. (Inverse probability integral transform). Let Y ∼ U (0, 1) and let X be a random variable with
−1
cumulative distribution function FX . Then FX (Y ) has the same distribution as X.

1.5.2 Normal distribution


1. (Definition: univariate case). For a real-valued random variable X, the PDF fX of the normal
distribution on IR with parameters µ and σ is given by
(x − µ)2
 
1
fX (x) = √ exp − ,
2πσ 2 2σ 2
and cumulative distribution function
  
1 x−µ
FX (x) = 1 + erf √ .
2 2σ
We write X ∼ N (µ, σ) to denote that X follows the normal distribution with parameters µ and σ.
2. (Characteristics). The mean, median and mode of N (µ, σ) are equal to µ. Its variance is σ 2 and
its MGF is MX (z) = exp(µz + σ 2 z 2 /2)
2
3. (Sum and scalar product of normals). Let X ∼ N (µX , σX ) and Y ∼ N (µY , σY2 be two independent
random variables. Then X + Y ∼ N (µX + µY , σX + σY2 ). If X and Y are jointly normally
2

distributed
prandom variables, then X + Y is normally distributed and E[X + Y ] = µX + µY and
σX+Y = σX 2 + σ 2 + 2ρσ σ , where ρ is the correlation between X and Y . For any α ∈ IR,
Y X Y
αX ∼ N (αµX , α2 σX2
).
4. (Multivariate normal distribution). The multivariate variant of the normal distribution, denoted by
N (µ, Σ) with µ ∈ IRn and Σ ∈ IRn×n , symmetric positive semi-definite, is supported on µ + im(Σ)
and PDF
pX (x) = |2πΣ|− /2 exp − 12 (x − µ)> Σ(x − µ) .
1


5. (Isserlis’ theorem – high-order moments of multivariate normal). Let X = (X1 , . . . , X2n ) follow
the multivariate normal distribution with zero mean and covariances Σi,j = cov(Xi , Xj ). Then,
XY XY
IE[X1 X2 · · · X2n ] = IE[Xi Xj ] = cov(Xi , Xj ),

and
IE[X1 X2 · · · X2n−1 ] = 0.

29
1 Probability Theory

6. (Linear transformation). Let X ∼ N (µ, Σ), µ ∈ IRn , Σ ∈ IRn×n and Y = AX + b for constant A
and b. Then Y ∼ N (Aµ + b, AΣA0 ).
7. (Conditioning). Let X1 , X2 be two random variables with values in IRn1 and IRn2 respectively.
Suppose that      
X1 µ1 Σ Σ12
∼N , 11 .
X2 µ2 Σ21 Σ22
Then, IE[X1 | X2 = x2 ] ∼ N (µ(x2 ), Σ) with

µ(x2 ) = µ1 + Σ12 Σ−1


22 (x2 − µ2 ),
Σ = Σ11 − Σ12 Σ−1
22 Σ21 .

8. (χ2 distribution). If X is an n-dimensional random variable and X ∼ N (µ, Σ), then

(X − µ)> Σ−1 (X − µ)

follows the χ2 (n) distribution.

1.5.3 Binomial distribution


1. (Definition). For n ∈ IN and p ∈ [0, 1], we say that a random variable X : Ω → {0, . . . , n} follows
the Binomial distribution, we denote X ∼ B(n, p), if its probability mass function is
 
n k
P[X = k] = p (1 − p)n−k ,
k

where nk = k!(n−k)!
n!


2. (Characteristics). If X ∼ B(n, p), then IE[X] = np, the median of X is either bnpc or dnpe,
its variance is Var[X] = np(1 − p) and moment generating function (MGF) of X is MX (z) =
(1 − p + pez )n . The cumulative probability function of X is given in terms of the regularized
incomplete beta function
P[X ≤ x] = I1−p (n − k, k + 1)

3. (Sum of independent Binomials). Let X ∼ B(n, p) and Y ∼ B(m, p) be independent random


variables. Then, X + Y follows the Binomial distribution B(n + m, p).
4. (Binomial sum variance inequality). Let X ∼ B(n, p) and Y ∼ B(n0 , p0 ) be two random variables,
not necessarily independent. Let S = X + Y . Then,

1 − IE[S]
Var[S] ≤ IE[S] .
n + n0

5. (De Moivre-Laplace theorem). For large n and k in the neighbourhood of np, it is


 
n k 1 (k−np)2
p (1 − p)n−k u p e− 2np(1−p) .
k 2πnp(1 − p)

As an example, consider the experiment of tossing n coins a large number of times and observing
the number of heads each time. Then, as n grows large, the shape of the distribution of the number
of heads approaches that of the normal distribution.

1.5.4 Poisson distribution


1. (Definition). We say that a random variable X : Ω → IN follows the Poisson distribution with
parameter λ if its probability mass function is

λk e−λ
P[X = k] =
k!
We denote X ∼ Poisson(λ).

30
1.5 Standard Distributions

2. (Characteristics). If X ∼ Poisson(λ), then IE[X] = λ, Var[X] = λ and the moment generating


function (MGF) of X is MX (z) = exp(λ(ez −1)). The median of X is in the interval [λ−ln 2, λ+ 1/3.
3. (Sum of independent Poissons). Let Xi , i ∈ IN[1,N ] , be independent random variables with Xi ∼
Poisson(λi ). Define the random variable S = X1 + . . . + XN and λ = λ1 + . . . + λN , then
S ∼ Poisson(λ).
4. (Raikov’s theorem). If the sum of two independent nonnegative random variables, X and Y , follows
the Poisson distribution, then both X and Y follow the Poisson distribution.
5. (Law of rare events/Poisson limit theorem). Let (pn )n∈IN be a sequence of numbers in [0, 1] such
that npn converges to a limit λ. Then

λk
 
n k
lim pn (1 − pn )n−k = e−λ .
n→∞ k k!
Binomial PMF Poisson PMF
3
6. (Large λ). For large values of λ, e.g., λ > 10 , the normal distribution N (λ, λ), is considered a
good approximation to Poisson(λ).

31
2 Multivariate distributions
2.1 Multivariate random variables
1. (Multivariate CDF). The cdf of a random variable X : (Ω, F, P) → IRd is the function

FX (x1 , . . . , xd ) = P[X1 ≤ x1 , . . . , Xd ≤ xd ]
 
\
= P {Xi ≤ xi }
i=1,...,d

2. (Multivariate CDF properties). The cdf FX of a random variable X : (Ω, F, P) → IRd has the
following properties
i. It is monotonically decreasing with respect to each variable
ii. It is right-continuous with respect to each variable
iii. 0 ≤ FX (x1 , . . . , xd ) ≤ 1 for all x1 , . . . , xd ∈ IR
iv. limx1 ,...,xd →∞ FX (x1 , . . . , xd ) = 1
v. limxi →−∞ FX (x1 , . . . , xd ) = 0 for all i ∈ IN[1,d]

3. (Expectation and variance). The expectation of a random variable X : (Ω, F, P) → IRd , X =


(X1 , X2 , . . . , Xd ) is defined as

IE[X] = (IE[X1 ], IE[X2 ], . . . , IE[Xd ])

And the variance-covariance matrix of X is

V(X) = IE[(X − IE[X])(X − IE[X])> ],

which is the multivariate counterpart of variance.

2.2 Copulas
2.2.1 Sklar’s theorem
1. (Definition). Let X be an d-dimensional random variable, with X(ω) = (X1 (ω), X2 (ω), . . . , Xd (ω)
with continuous marginal CDFs FXi (x) = P[Xi ≤ x]. By the probability integral transform, the
random variable U = (U1 , . . . , Ud ) defined as

Ui = FXi (Xi ),

for i = 1, . . . , d, has uniformly distributed marginals, that is, Ui ∼ U (0, 1). The copula of X is the
joint cumulative distribution function of U , that is

C(u1 , . . . , ud ) = P[U1 ≤ u1 , . . . , Ud ≤ ud ]
−1 −1
= P[X1 ≤ FX1
(u1 ), . . . , Xd ≤ FX d
(ud )].

A d-dimensional copula is a function C : [0, 1]d → [0, 1] which is a joint cumulative distribution
function of a d-dimensional random variable on the [0, 1]d with uniform marginals.
2. (Sklar’s theorem). Every multivariate CDF, H(x1 , . . . , xd ) = P[X1 ≤ x1 , . . . , Xd ≤ xd ], can be
expressed in terms of its marginals, FXi (x) = P[Xi ≤ x], and a copula C : [0, 1]d → [0, 1], that is

H(x1 , . . . , xd ) = C(FX1 (x1 ), . . . , FXd (xd )).

33
2 Multivariate distributions

If the multivariate distribution has a PDF h, then there is a function c called the density copula
and
h(x1 , . . . , xd ) = c(FX1 (x1 ), . . . , FXd (xd ))f1 (x1 ) · . . . · fd (xd ).
Conversely, given a copula C : [0, 1]d → [0, 1] and marginal distributions FXi , there is a d-
dimensional CDF as described above.

3. (Characterization). A function C : [0, 1]d → [0, 1] is a copula if and only if it satisfies the following
properties
i. For every j ∈ IN[1,d] , C(1, . . . , 1, t, 1, . . . , 1) = t
ii. C is isotonic (order preserving), that is, C(u) ≤ C(u0 ) whenever u ≤ u0 in the sense ui ≤ u0i
for all i ∈ IN[0,d]
iii. C is d-nondecreasing, that is, for every hyperrectange B, the dC-volume B is nonnegative,
that is Z
dC ≥ 0,
B

where dC is treated as a measure.

4. (Characterization of two-dimensional copulas). A two-dimensional copula C : [0, 1]2 → [0, 1] satis-


fies the following properties
i for every u ∈ [0, 1], C(u, 0) = C(0, u) = 0
ii for every u ∈ [0, 1], C(u, 1) = C(1, u) = u
iii for all u, u0 , v, v 0 ∈ [0, 1] with u ≤ u0 and v ≤ v 0

C(u0 , v 0 ) − C(u0 , v) − C(u, v 0 ) + C(u, v) ≥ 0

5. (Properties of copulas). A copula C : [0, 1]d → [0, 1] possesses the following properties
i. C(u1 , . . . , ud ) = 0 if there is an i0 ∈ IN[1,d] so that ui0 = 0
ii. C is nonexpansive in the following sense
d
X
|C(u) − C(v)| ≤ |ui − vi |
i=1

6. (Fréchet-Hoeffding copula bounds). For any copula C : [0, 1]d → [0, 1],

W (u1 , . . . , ud ) ≤ C(u1 , . . . , ud ) ≤ M (u1 , . . . , ud ),

where Pd
W (u1 , . . . , ud ) := max{0, 1 − d + i=1 ui },
and
M (u1 , . . . , ud ) = min{u1 , . . . , ud }.
The upper bound is sharp, M is always a copula and equality is attained for comonotone random
variables.

2.2.2 Examples of copulas


1. (Independence copula). The independence copula Πd (u1 , . . . , ud ) = u1 u2 · · · ud , which is associated
with random variables with independent marginals and uniformly distributed.

2. (Comonotonicity copula). The copula Md (u1 , . . . , ud ) = min{u1 , . . . , ud } which is associated with


a random variable U = (U1 , . . . , Ud ) where Ui are uniformly distributed and U1 = U2 = . . . = Ud
almost surely.

3. (Counter-monotonicity copula in 2D). We define the copula W2 (u1 , u2 ) = max{u1 + u2 − 1, 0},


which is associated with a U = (U1 , U2 ) where Ui are uniformly distributed on I and U1 = 1 − U2
almost surely.

34
2.2 Copulas

4. (Fréchet-Hoeffding bounds). For every d-dimensional copula Cd and u ∈ [0, 1]d , we have

Wd (u) ≤ Cd (u) ≤ Md (u),

where Wd is the d-dimensional variant of the counter-monotonicity copula, W2 shown above. This
is defined as ( d )
X
Wd (u) = max ui − d + 1, 0 .
i=1

For d > 2, Wd is not a copula. The above bounds are tight, that is,

inf C(u) = Wd (u) sup C(u) = Md (u)


C:d-dim. copula C:d-dim. copula

5. (Convex combinations of copulas). Suppose that


i. U and U 0 are two d-dimensional random variables on (Ω, F, P), distributed with copulas C and
C 0 respectively.
ii. Z is a random variable which follows the Bernoulli distribution with P[Z = 1] = α and
P[Z = 2] = 1 − α.
iii. We define two functions σ, σ 0 IR → {0, 1}. Define σ(x) = δ1 (x), that is, σ(1) = 1 and σ(x) = 0
for x 6= 1. Similarly, define σ 0 (x) = δ2 (x).
Then, the random variable
Ū = σ(Z)U + σ 0 U 0 ,
is distributed according to the copula

C̄ = αC + (1 − α)C 0 .

6. (Fréchet-Mardia family of copulas). Define the d-dimensional Fréchet-Mardia copula as

CdFM = λΠd + (1 − λ)Md ,

for λ ∈ [0, 1].

35
3 Stochastic Processes
3.1 General
1. (Stochastic process). Let T ⊆ IR (e.g., T = IN or T = IR). A random process is a sequence/net
(Xn )n∈T of (real-valued) random variables on a probability space (Ω, F, P).

2. (Version). Let T = [0, ∞) be a time index set and (Xt )t , (Yt )t be two stochastic processes on
(Ω, F, P). We say that (Xt )t is a version of (Yt )t if

Xt = Yt , P-a.s. for all t ∈ T,

that is, P[{ω | Xt (ω) = Yt (ω)}] = 1 for all t ∈ T .

3. (Centered). Let (Xt )t be a real-valued stochastic process with t ∈ [a, b]. We say that (Xt )t is
centered if IE[Xt ] = 0 for all t ∈ [a, b].

4. (Mean-square continuous). Let (Xt )t be a real-valued stochastic process with t ∈ [a, b]. We say
that (Xt )t is mean-square continuous if

lim IE[(Xt+ − Xt )2 ] = 0,
→0

for all t ∈ [a, b].

5. (Auto-correlation function). Let (Xt )t∈T be a stochastic process. Define the function RX : T ×T →
IR as
RX (s, t) = IE[Xs Xt ].
This function is called the auto-correlation function of (Xt )t .

6. (Mean-square continuity criterion). A stochastic process (Xt )t∈[a,b] is mean-square continuous if


and only if its auto-correlation function, RX , is continuous on [a, b] × [a, b].

7. (Filtrations). A filtration is an increasing sequence of sub-σ-algebras of F. The space (Ω, F, (Ft )t∈T , P)
is called a filtered probability space. The filtration Ft = σ({Xs ; s ∈ T, s ≤ t}) is called the filtration
generated by (Xn )n∈T . We say that (Xn )n is adapted to a filtration (Fn )n if for all n ∈ T, Xn is
Fn -measurable.

8. (Stopping times). Let (Fn )n be a filtration on (Ω, F, P) and define T := T ∪ {+∞}. A random
variable T : Ω → T is called a stopping time if

{ω | T (ω) ≤ t} ∈ Ft ,

for all t ∈ T. This is equivalent to requiring that the process Zt = 1T ≤t is adapted to (Ft )t∈T .

9. (Wald’s first identity)1 . Let (Xk )k∈IN be a sequence of iid random variables with common finite
mean, IE[|Xi |] < ∞. Let τ be a stopping time with IE[τ ] < ∞. Then,

IE[X1 + . . . + Xτ ] = IE[τ ]IE[X1 ].

10. (Wald’s second identity). Let (Xk )k∈IN be a sequence of iid random variables with zero mean and
common finite variance σ 2 = IE[Xi2 ] < ∞. Let τ be a stopping time with IE[τ ] < ∞. Then,

IE[(X1 + . . . + Xτ )2 ] = σ 2 IE[τ ].
1 Details and proofs for the three identities of Wald can be found in the lecture notes of S. Lalley (Statistics 381).

37
3 Stochastic Processes

11. (Wald’s third identity). Let (Xk )k∈IN be a sequence of nonnegative iid random variables with mean
IE[Xk ] = 1. Let τ be a bounded stopping time with IE[τ ] < ∞. Then,
T
Y
IE Xi = 1.
i=1

12. (A useful property). For any stochastic process (Xn )n∈IN , we have
  k
!
X
2 2
P max |Xi | >  = P Xi · 1{|Xi |>} >  .
i≤k
i=0

13. (Kolmogorov’s continuity theorem). Let (Xt )t be an IRn -valued stochastic process on (Ω, F, P).
Suppose that (Xt )t is such that for all t > 0 there are positive constants α, β, L such that

IE[kXτ − Xτ 0 kα ] ≤ L|τ 0 − τ |1+β ,

for τ ≥ 0 and τ 0 ≤ t. Then, there is a continuous version of X.


14. (càdlàg function). Let M, d be a metric space and E ⊆ IR. A function f : E → M is called a
càdlàg (continue à droite, limite à gauche) function if for every t ∈ E,
i. lims→t− f (s) exists
ii. lims→t+ f (s) exists and is equal to f (t),
that is, f is right-continuous and the limit from the left exists.

3.2 Martingales
1. (Martingale — discrete time). A random process (Xn )n is called a martingale if IE[|Xn |] < ∞ and
IE[Xn+1 | X1 , . . . , Xn ] = Xn .
2. (Martingale — continuous time). A random process (Xt )t≥0 on a filtered probability space (Ω, F, (Ft )t≥0 , P)
is called a martingale if (i) it is adapted to (Ft )t≥0 , (ii) for every t ≥ 0, IE[|Xt |] < ∞, (iii) for all
s, t ≥ 0 with s < t and all F ∈ Fs , IE[1F (Yt − Ys )] = 0, or, equivalently, Ys = IE[Yt | Fs ].
3. (Martingale examples). The following are common examples of martingales:
Pn
a) Let (Xn )n be a sequence of iid random variables with mean IE[Xn ] = µ. Then Yn = i=1 (Xi −
µ) is a martingale.
b) Let (Xn )n be a sequence of iid random variables with mean IE[Xn ] = 1 and finite variance.
Define a sequence of random variables with Y0 = 0 and Yn = X0 X1 · . . . · Xn . Then, by the
Cauchy-Schwartz inequality, Yn is a martingale.
Qn
c) If (Xn )n is a sequence of iid random variables with mean 1, then Yn = i=1 Xi is a martingale.
d) If (Xn )n is a sequence
Pn of random variables with finite expectation and IE[Xn | X1 , . . . , Xn−1 ] =
0, then Yn = i=0 Xi is a martingale.
e) (The classical martingale). The fortune of a gambler is a martingale in a fair game.
4. (Sub- and super-martingales). A random process (Xn )n is called a super-martingale if IE[|Xn |] < ∞
and IE[Xn+1 | X1 , . . . , Xn ] ≤ Xn . Likewise, it is a sub-martingale if IE[|Xn |] < ∞ and IE[Xn+1 |
X1 , . . . , Xn ] ≥ Xn .
5. (Stopping time). Let {Zk }k be a random process and T a stopping time. Define Xk (ω) = Zk∧T (ω) ,
that is (
Zk (ω), if k ≤ T (ω)
Xk (ω) =
ZT (ω) (ω), otherwise
If Z is a (sub-) martingale, then X is a (sub-) martingale too.
6. (Stopped martingales are martingales). Let (Xn )n be a martingale. Let τ be a stopping time.
Then X̃n = Xn∧τ is a martingale.

38
3.2 Martingales

7. (Doob’s optional stopping theorem). Let (Xn )n be a super-martingale and T be a stopping time.
Then XT is integrable and IE[XT ] ≤ IE[X0 ] in each of the following cases
i. T is bounded
ii. X is bounded and T is almost surely finite
iii. E[T ] < ∞ and (Xn )n has (surely) bounded differences, i.e., there is an M > 0 such that
|Xn (ω) − Xn−1 (ω)| ≤ M,
for all n ∈ IN and ω ∈ Ω
iv. Xn ≥ 0 for all n and T is almost surely finite
8. (Optional stopping theorem, version 2). Let (Xt )t be a martingale on (Ω, F, P) subject to a filtration
F = (Ft )t and let τ be a stopping time. Assume that one of the following holds
i. τ is almost surely bounded, that is, there is a τ̄ ≥ 0, so that τ (ω) ≤ τ̄ for P-almost all ω 2
ii. IE[τ ] is finite and IE[|Xk − Xk | | Fk ] is almost surely bounded, uniformly in k,
iii. |Xmin(t,τ ) | is almost surely bounded,
Then Xτ is almost surely a well-defined random variable and
IE[Xτ ] = IE[X0 ].
If X is assumed to be a super-martingale, then
IE[Xτ ] ≤ IE[X0 ].
If X is assumed to be a sub-martingale, then
IE[Xτ ] ≥ IE[X0 ].

9. (Optional stopping theorem, more general version). Let (Xt )t be a martingale on (Ω, F, P) subject
to a filtration F = (Ft )t and let τ be a stopping time. Suppose that X is uniformly integrable
(then, it has a well-defined limit, X∞ so we may define X̄τ = Xτ 1τ <∞ + X∞ 1τ =∞ ). Let τ 0 ≤ τ be
two stopping times. Then,
IE[Xτ | Fτ 0 ] = Xτ 0 .
10. (Almost sure martingale convergence). Let (Xn )n be a martingale which is uniformly bounded
in L1 , i.e., supn IE[|Xn |] < ∞. Then, there is a X ∈ L1 (F∞ ), so that Xn → X a.s., where
F∞ = σ(Fn , n ≥ 0).
11. (Kolmogorov’s sub-martingale inequality). Let {Xk }k be a nonnegative sub-martingale. Then, for
n ∈ IN>0 and α > 0,  
IE[Xn ]
P max Xk ≥ α ≤ .
k=1,...,n α
i. (Corollary 1). Let {Xk }k be a nonnegative martingale. Then P[supk≥1 Xk ≤ α] ≤ IE[X1 ]/α
for α > 0.
ii. (Corollary 2). Let {Xk }k be a martingale with IE[Xk2 ] < ∞ for all k ∈ IN>0 . Then,
P[maxk=1,...,n |Xk | ≥ α] ≤ IE[Xn2 ]/α for all n ∈ IN≥2 and α > 0.
iii. (Corollary 3). Let {Xk }k be a nonnegative super-martingale. Then, for n ∈ IN>0 and α > 0,
P[∪k≥n {Zk ≥ α}] ≤ IE[Zn ]/α.
12. (Azuma-Hoeffding inequality for martingales with bounded differences). Let (Xi )i be a martingale
or a super-martingale and |Xk − Xk−1 | < ck almost surely. Then for all N ∈ IN and t ∈ IR,
!
t2
P[XN − X0 ≥ t] ≤ exp − PN
2 i=1 c2i
If (Xi )i is a sub-martingale,
!
t2
P[XN − X0 ≤ −t] ≤ exp − PN
2 i=1 c2i
2 This
is a strong condition which is often not satisfied in practice. However, for fixed N ∈ IN, τ ∧ N is a stopping time.
We often apply the optional stopping theorem for the bounded stopping time τ ∧ N and take N → ∞.

39
3 Stochastic Processes

20

15

10

Xt ( )
5

-5
0 20 40 60 80 100
time instant k

Figure 3.1: Random walk: two different paths, (Xt (ω1 ))t and (Xt (ω2 ))t .

13. (Martingale inequalities). Let (Xt )t≥0 be a càdlàg martingale and t > 0. Define Xt∗ = sups≤t |Xs |.
Then, for every t > 0
kXt k1
i. for α > 0, P[Xt∗ ≥ α] ≤ α
p
ii. for p > 1, kXt∗ kp ≤ p−1 kXt kp

14. (Nonnegative submartingale inequalities). Let (Xt )t≥0 be a nonnegative càdlàg submartingale and
t > 0. Define Xt∗ = sups≤t |Xs |. Then, for every t > 0
kXt k1
i. for α > 0, P[Xt∗ ≥ α] ≤ α
p
ii. for p > 1, kXt∗ kp ≤ p−1 kXt kp

3.3 Random walk


1. (One-dimensional random walk). Take a sequence of independent random variables (Zt )t∈IN that
take valuesPin {−1, 1}, each with probability 1/2. Define a random process (Xt )t∈IN with X0 = 0
t
and Xt = i=1 Zi . The process (Xt )t∈IN is called a (one-dimensional) (simple) random walk on Z.
A random walk is shown in Figure 3.1.
More generally, a random walk can be such that p := P[Zt+1 − Zt = 1] 6= P[Zt+1 − Zt = −1].
2. (Characteristics). The expectation of random walk (Xt )t∈IN with p = 1/2 is IE[Xt ] = 0 and its
variance is IE[Xt2 ] = t. Additionally,
r
IE[Xt ] 2
lim √ = .
t t π

3. (Distribution of Xt ). Let (Xt )t be a one-dimensional random walk with p := P[Xt+1 − Xt = 1] and


define q := 1−p. For t ∈ IN, t ≥ 1, random variables Xt take values on Xt := {−t, −t+2, . . . , t−2, t}
and their distribution is given by
 
t 1 1
P[Xt = m] = 1 p 2 (t+m) q 2 (t−m) ,
2 (t + m)

for all m ∈ Xt This is the Binomial distribution on Xt with parameter p (See definition in Section
1.5.3).
4. (Maximum of random walk). Let Xt be a simple symmetric random walk (with p = 0.5 and define
Mt = maxt0 ≤t Xt0 . Then, M0 = 0, the support of Mt is {0, 1, . . . , t} and
 
t
P[Mt = m] = P[Xt = m] + P[Xt = m + 1] = 2−t
b t+m+1
2 c

40
3.4 Brownian motion

5. (Infinite often visits). Almost surely, the one-dimensional simple random walk visits every integer
n ∈ IN infinitely often.
6. (As a Markov chain). The one-dimensional random walk can be seen as a Markov chain with states
in Z and P[Xk+1 = i + i | Xk = i] = p and P[Xk+1 = i − i | Xk = i] = 1 − p.
7. (Probability to reach upper bound before lower bound). Let (Xn )n be a simple random walk
starting at x ∈ Z, that is, X0 = x. Let a < x < b for some a, b ∈ Z. Let τa = inf{n ∈ Z | Xn = a}
and τb = inf{n ∈ Z | Xn = b}. Then
x−a
P[τa < τb ] = .
b−a

8. (Average time to exit interval). Let (Xk )k be a random walk with X0 = x. Suppose x ∈ [a, b] with
a, b ∈ Z. Define the stopping time τ = inf{n ∈ IN | Xn ∈ {a, b}} (therefore, Xτ ∈ {a, b}). Define
the stochastic process
Yn = n + (Xn − a)(b − Xn ).
Let Fn be the sigma algebra generated by (Xk )nk=0 . Then,
a) τ is an almost surely bounded stopping time
b) (Yn )n is an (Fn )n -martingale
c) Y0 = (x − a)(b − x) and Yτ = τ
d) From the (general) optional stopping theorem, IE[τ ] = IE[Yτ ] = IE[Y0 ] = (x − a)(b − x)
9. (Gaussian random walk). Take a sequence of independent random variables Zt with Zt ∼ N (µ, σ 2 ).
The random process (Xt )t with X0 = 0 and Xt = Z1 + . . . + Zt is called Gaussian random walk.
It is Xt ∼ N (tµ, tσ 2 ).

3.4 Brownian motion


1. (Definition). A stochastic process (Xt )t∈IR+ on (IR, B(IR)) is called Brownian motion if it is contin-
uous and has stationary independent increments. A process (Wt )t∈IR+ is called a Wiener process
if it is a Brownian motion with
i. W0 = 0
ii. IE[Wt ] = 0
iii. IE[Wt2 ] = t for all t ∈ IR+
2. (Definition according to Øksendal). Following Kolmogorov’s extension theorem, for given time
instants t1 , . . . , tk in a time set T (typically T = [0, ∞)), define the following proability measure
on IRnk
Z
νt1 ,...,tk (F1 × · · · × Fk ) = p(t1 , x, x1 )p(t2 − t1 , x2 , x1 ) · · · p(tk − tk−1 , xk−1 , xk )dx1 dx2 · · · dxk ,
F1 ×···×Fk

where p is the function  


2
p(t, x, y) = (2πt)−n/2 exp − kx−yk
2t .
Let (Ω, F, P) be the associated probability space from Following Kolmogorov’s extension theorem
and let (Bt )t be an IRn -valued stochastic process with
P[Bt1 ∈ F1 , . . . , Btk ∈ Fk ] = νt1 ,...,tk (F1 × · · · × Fk ).
Such a process is a (version of) Brownian motion starting at x.
3. (Expectation and covariance). Let (Bt )t∈T be a Brownian motion starting at x. Then IE[Bt ] = x
for all t ∈ T . Let t1 , . . . , tk be time instants and Z = (Bt1 , Bt2 , . . . , Btk ) be an IRnk -valued random
variable. Then,  
t1 In t1 In · · · t1 In
t1 In t2 In · · · t2 In 
cov[Z] =  .
 
.. .. .. 
 .. . . . 
t1 In t2 In ··· tk In

41
3 Stochastic Processes

Observe that, as a result, we have


IE[(Bt − x)(Bs − x)] = n min(s, t).

4. (Existence of continuous version). The Brownian motion satisfies Kolmogorov’s continuity theorem
with α = 4, β = 1 and L = n(n + 2). In particular, IE[kXτ − Xτ 0 k4 ] = n(n + 2)|τ 0 − τ |2 .
5. (Zero crossing). Properties:
i. Define the set of zero crossing times, Z0 = {t ≥ 0 | Bt = 0}. With probability 1, the Lebesgue
measure of Z0 is zero,
ii. Almost surely, Z0 is a closed set and has no isolated points,
iii.
iv. The Brownian motion crosses the time axis infinitely often in every time interval (0, t) for
t > 0.
6. (Distribution of maximum). Let Xt be a Brownian motion and Mt = maxs≤t Xs . Then, for all
t > 0 and a > 0, √
P[Mt ≥ a] = 2P[Xt ≥ a] = 2(1 − Φ(a/ t)).

7. (Attainment of maximum 1). Almost surely, the set of times where Bt attains a local maximum is
dense in [0, +∞)
8. (Attainment of maximum 2). On any interval, Bt almost surely does not attain the same maximum.
9. (Strict maximum). Almost surely, every local maximum of a Brownian motion is a strict local
maximum.
10. (Countability of the set maximum times). Almost surely, the set of times when Bt attains a local
maximum is countable.
11. (Maxima are distinct). The local maxima of Bt are almost surely distinct
12. (Nowhere differentiable). For every ω ∈ Ω, t 7→ Xt (ω) is nowhere differentiable.
13. (Orthogonal transformation). Let Bt be an n-dimensional Brownian motion starting at 0 and U
be an orthogonal matrix, U U > = I. Them, B̃t = U Bt is a Brownian motion.
14. (Brownian scaling). Let Bt be an n-dimensional Brownian motion starting at 0 and c > 0. Then,
B̂t = 1/cBc2 t is a Brownian motion.
15. (Time inversion). Let Bt be an n-dimensional Brownian motion starting at 0 and (B̆t )t is a process
with B̆0 = 0 and B̆t = tB1/t . Then B̆ is a Brownian motion.
16. (Integrated Brownian motion). The integral of the one-dimensional Brownian motion starting at
Rt
0, ibm(t, ω) := 0 Bs (ω)ds, is a random variable which follows the normal distribution N (0, t3 /3).
17. (Exit time). Let (Bt )t be a one-dimensional Brownian motion on (Ω, F, P) started at 0 and define
τ (ω) = inf{t ∈ T | Bt ∈ / [−a, b]}, where a, b > 0. This means that τ is the first time when the
process leaves the interval [−a, b]. Then,
i. τ is an integrable random variable
ii. IE[τ ] = ab
iii. IE[Wτ ] = 0 and IE[Wτ2 ] = IE[τ ] = ab
b a
iv. P[Wτ = −a] = a+b and P[Wτ = b] = a+b

Note. We often need to evaluate the expectation of a transformation of the Brownian motion, Yt = f (Bt ).
Using the fact that Bt is normally distributed at every t and the law of the unconscious statistician,
Z ∞
1 x2
IE[f (Bt )] = √ f (x)e− 2t dx,
2πt −∞
Rt Rt
provided f (Bt ) is integragle. Similarly, we may need to evaluate IE[ 0 f (Bs )ds] = 0 IE[f (Bs )]ds =
Rt 1 R∞ x2
0

2πs −∞
f (x)e− 2s dx ds (using Fubini’s Theorem).

42
3.5 Markov processes

3.5 Markov processes


1. (Definition). Let (Ω, F, {Ft }t∈T , P) be a filtered probability space. Let {Xt }t∈T be a random
process which is adapted to the filtration {Ft }t∈T . Let {Gt0 }t∈T be the filtration generated by
{Xt }t∈T and Gt∞ = σ({Xu : u ≥ t, u ∈ T }). The process is said to be Markovian if for every t ∈ T ,
the past Ft and the future G∞t are conditionally independent given Xt .

2. (Characterization). The following are equivalent


i. The process {Xt }t is Markovian with state space (E, E)
ii. For every t ∈ T and u > t, and f ∈ E+ IE[f ◦ Xu | Ft ] = IE[f ◦ Xu | Xt ].
iii. Let E be a p-system generating E. For every t ∈ T and u > t, and A ∈ E it is IE[1A ◦ Xu |
Ft ] = IE[1A ◦ Xu | Xt ].
iv. For every t ∈ T and positive V ∈ Gt∞ , IE[V | Ft ] = E[V | Xt ].
v. For every t ∈ T and positive V ∈ Gt∞ , IE[V | Ft ] ∈ σXt .
3. (Markov transition functions). Let (Pt,u )t,u∈T,t≤u be a family of Markov transition kernels on
(Ω, F). This is said to be a Markovian transition function if Ps,t Pt,u = Ps,u for all 0 ≤ s < t ≤ u.
4. (Chapman-Kolmogorov equation). A Markov process X = {Xt }t∈T is said to admit (Pt,u ) as a
transition function if
IE[f ◦ Xu | Xt ] = (Pt,u f ) ◦ Xt , t < u,
for all nonnegative functions f .
5. (Time homogeneity). We call a Markov process time homogeneous if it admits a transition function
(Pt,u ) which depends only on t − u, i.e., there is a Markov kernel with Pt,u = Pt−u .
6. (Martingales from Markov chains). Let (Xn )n a random process adapted to a filtration F with
state space (E, E). Then (Xn )n is a Markov chain with transition kernel P with respect to F if
and only if
n−1
X
Mn = f ◦ Xn − (P f − f ) ◦ Xn ,
m=0
is a martingale with respect to F for every nonnegative E-measurable function f .

3.6 Markov decision processes


3.6.1 General
1. (Definition of MCM). A Markov control model (MCM) is a tuple (X , U, U, Q, c) consisting of
i. two Borel spaces X and U called the state and control spaces respectively,
ii. a multivalued function mapping U : X ⇒ U which maps a state x ∈ X to a set U (x) ⊆ U of
feasible control actions. Define the graph of U as the set K := gph(U ) = {(x, u) ∈ X × U |
u ∈ U (x)}. We assume that K contains the graph of a measurable (single-valued) function
u : X → U.
iii. a transition kernel Q : B(X ) × K 3 (B, x, u) 7→ Q(B, x, u) ∈ [0, 1], where (x, u) is such that
x ∈ X , u ∈ U (x) and B ∈ B(X ) (See Section 1.1.10).
iv. A cost function c is a measurable function c : K → IR.
2. (Example). Let (S, B(S)) be a Borel space. Let {ξt }t be a collection of iid random variables with
values in S. Let µ be their common probability distribution. Consider the dynamical system
xt+1 = F (xt , ut , ξt ).
Then, the transition kernel Q(B, x, u) is
Q(B, x, u) = µ({ω ∈ S | F (x, u, ω) ∈ B})
Z
= 1B (F (x, u, ω))dµ(ω)
S
= IE1B (F (x, u, ω)).

43
3 Stochastic Processes

3. (Equivalent representation of Markov control models). For every MCM, there is a Borel space S,
a function F : X × U × S and an S-valued iid process {ξt }t so that

xt+1 = F (xt , ut , ξt ).

4. (Definition of Ht and Ht ). Define H0 = X and Ht = Kt × X . Ht contains elements of the form


ht = (x0 , u0 , . . . , xt−1 , ut−1 , xt ) with uk ∈ U (xk ). Define also the linear space Ht = (X × U)t × X
with H0 = X .

5. (Definition of a policy). A policy is a sequence π = (π0 , π1 , . . .) of transition kernels πt : B(U)×Ht →


[0, 1] with
πt (A(xt ), ht ) = 1, for all ht ∈ Ht , t ∈ IN.

(The canonical probability space (Ω, F, P)). Given an MCM (X , U, U, Q, c), let Ω = H∞ =
6. Q

t=1 X × U. Ω contains sequences ω = (x0 , u0 , x1 , u1 , . . .). Let F be the corresponding prod-
uct σ-algebra. Given a probability measure ν on (X , B(X )) (called the initial distribution) and a
policy π, according to the Ionescu-Tulcea Theorem, there is a unique probability measure Pπν so
that for B ∈ B(X ), C ∈ B(U), ht ∈ Ht and t ∈ IN:
i. Pπν [x0 ∈ B] = ν(B)
ii. Pπν [ut ∈ C | ht ] = πt (C, ht )
iii. Pπν [xt+1 ∈ B | ht , ut ] = Q(B, xt , ut )
Note that the last condition is a Markov-like property, but it does not imply that xt is a Markov
process.

7. (Markov decision process). A (discrete-time) Markov decision process (MDP) is a tuple (Ω, F, Pπν , {xt }t ).
In other words, for a given policy π and a given initial distribution ν, an MDP is a stochastic process
{xt (ω)}t∈IN over the canonical probability space (Ω, F, Pπν ).

8. (Space Φ). We define the space Φ of all transition kernels φ : B(U)×X → [0, 1] with φ(U(x), x) = 1.

9. (Types of policies). A policy π is called


i. a randomized Markov policy if there are φt ∈ Φ so that πt ( · | ht ) = φt ( · | xt ), for ht ∈ Ht ,
t ∈ IN.
ii. a randomized stationary policy if there is a φ ∈ Φ with πt ( · | ht ) = φ( · | xt ), for ht ∈ Ht ,
t ∈ IN.
iii. a deterministic policy if there are functions gt : Ht → U such that for ht ∈ Ht , t ∈ IN,
gt (ht ) ∈ U (xt ) and πt ( · | ht ) is concentrated at gt (ht ), that is, πt (C, ht ) = 1C (gt (ht )) for all
C ∈ B(U)
iv. a deterministic Markov policy if there exist functions gt : X → U, with gt (x) ∈ U (x) for all
x ∈ X , such that πt ( · , ht ) is concentrated at gt (xt ) for all ht ∈ Ht , t ∈ IN.
v. a deterministic stationary policy if there is a function g : X → U, with g(x) ∈ U (x) for all
x ∈ X , such that πt ( · , ht ) is concentrated at g(xt ) for all ht ∈ Ht , t ∈ IN.

10. (Markovianity of {xt }t ). Let ν be an initial distribution. Let π = {φt } be a randomized Markov pol-
icy (see 9-i). Then, {xt }t is a non-homogeneous Markov process with transition kernels {Q(·, ·, φt )}t ,
that is, for B ∈ B(X )

Pπν [xt+1 ∈ B | x0 , . . . , xt ] = Pπν [xt+1 ∈ B | xt ] = Q(B, xt , φt ).

If π = {φ}t is a stationary policy, then the above also holds with

Pπν [xt+1 ∈ B | x0 , . . . , xt ] = Pπν [xt+1 ∈ B | xt ] = Q(B, xt , φ),

so {xt }t is a time-homogeneous Markov process.

44
3.6 Markov decision processes

3.6.2 Optimal control problems


1. (Statement). The finite-horizon optimal control problem is stated as
−1
 NX 
π
minimize J(π, x) := IEx c(xt , ut ) + cN (xN ).
π:policy
t=0
−1
 NX 
= IEπ c(xt , ut ) + cN (xN ) x0 = x
t=0

2. (DP Theorem). Define the function JN (x) = cN (x) and for t = N − 1, N − 2, . . . , 0,


Z
Jt (x) := min c(x, u) + Jt+1 (χ)Q(dχ, x, u),
U (x) X

where the minimization is over control functions u : X → U. Suppose that Suppose all Jt are
measurable and for all t ∈ IN[0,N −1] there is a selection u?t (x) ∈ U (x), u?t : X → U which attains
the minimum, that is
Z
Jt (x) = c(x, u?t (x)) + Jt+1 (χ)Q(dχ, x, u?t ( · )).
X

Then, the deterministic Markov policy π = {u?0 , u?1 , . . . , u?N −1 } is optimal and the value function
?

J ? is equal to J0 , that is
J ? (x) = J0 (x) = J(π ? , x)
3. (Measurable selection theorem 1). There exists a measurable selection u?t in the above DP theorem,
if
i. (Control constraints). U is compact-valued (i.e., for every x, U (x) is compact)
ii. (Cost function). c(x, · ) is lower semicontinuous on U (x) for every x ∈ X
R
iii. (Integral). the function ξ(x, u) = X u(χ)Q(dχ, x, u) on K satisfies one of the following condi-
tions:
i. ξ(x, · ) is lower semi-continuous on U (x) for every x ∈ X and every continuous bounded
function u on X
ii. ξ(x, · ) is lower semi-continuous on U (x) for every x ∈ X and every measurable bounded
function u on X .
4. (Measurable selection theorem 2). There exists a measurable selection u?t in the above DP theorem,
if
i. (Control constraints). U is compact-valued (i.e., for every x, U (x) is compact) and the multi-
valued function x 7→ U (x) is upper semi-continuous
ii. (Cost function). Function c is lower semicontinuous and bounded below
iii. (Transition kernel). the transition kernel Q satisfies of the following conditions
R
i. it is weakly continuous, that is, ξ(x, u) = X u(χ)Q(dχ, x, u) is continuous and bounded
on K for every continuous bounded function u on X
ii. it is strongly continuous, that is, ξ is continuous and bounded on K for every measurable
bounded function u on X
5. (Measurable selection theorem 3). There exists a measurable selection u?t in the above DP theorem,
if
i. (Cost function). The stage cost c is lower semi-continuous, bouned below and inf-compact on
K, that is, for every x ∈ X and r ≥ 0, the set {u ∈ U (x) | c(x, u) ≤ r} is compact (in other
words, c has compact level sets)
ii. (Transition kernel). Condition 4iii in Measurable Selection Theorem 2 holds.

3.6.3 Linear-Quadratic problems


Work in progress (this section will be updated in the upcoming version). Stay tuned on Twitter for
updates (@isToxic).

45
4 Stochastic Differential Equations
4.1 Itô Integral
1. (Class V). Let (Ω, F, F, P) be a filtered probability space where F = (Ft )t∈T is a filtration, T =
[0, +∞) and t, t0 ∈ T with t < t0 . We define the class V = V(t, t0 ) to be a class of functions
f (t, ω) : T × Ω → IR with
i. f is B × F-measurable where B is the Borel σ-algebra on T
ii. f (t, ω) is F-adapted
R t0
iii. IE t f (t, ω)2 dt < ∞
2. (Itô integral for elementary functions). Let (Bt )t≥0 be the standard Brownian motion on the
filtered probability space (Ω, F, {Ft }t≥0 , P) and φ be an elementary function of class V(t, t0 ), that
is, X
φ(t, ω) = ei (ω)1[ti ,ti+1 ) (t),
i

where ei is Fti -measurable. We define the Itô integral of φ from t to t0 to be the random variable
Z t0 X
φ(t, ω)dBt = ei (ω)(Bti+1 (ω) − Bti (ω))
t i

3. (Itô integral on V(t, t0 )). Let (Bt )t≥0 be the standard Brownian motion on (Ω, F, {Ft }t≥0 , P) and
f ∈ V(t, t0 ). Let φn be a sequence of elementary functions which converges to f in the following
sense: "Z 0 #
t
IE (φn (s, ω) − f (s, ω))2 ds → 0.
t

Then,
Z t0 Z t0
f (s, ω)dBs = lim φn (s, ω)dBs ,
t n→∞ t

where the limit is in the L2 (Ω, F, P) sense.


4. (Properties of the Itô integral). The Itô integral has the following properties (for all f, g ∈ V(t, t0 ))
R t0 R t00 R t0
i. (Break down). t f (s, ω)dBs = t f (s, ω)dBs + t00 f (s, ω)dBs , for almost all ω
R t0 R t0 R t0
ii. (Linearity). t (cf + g)dBs = c t f dBs + t gdBs , for almost all ω, where c is a constant
hR 0 i
t
iii. (Zero expectation). IE t f (s, ω)dBs = 0
R t0
iv. (Measurability). t f (s, ω)dBs is Ft0 -measurable
R t0 R t0
v. (Isometry property). IE[( t f (s, ω)dBs )2 ] = IE[ t f (s, ω)2 ds]
Rt
vi. (Martingale). Mt (ω) = 0 f (s, ω)dBs is a martingale with respect to the filtration (Ft )t and P
and t 7→ Mt (ω) is almost surely continuous. As a result, Doob’s martingale theorem applies,
that is Z t
1
P[|Mt | ≥ t] ≤ 2 IE[ f (s, ω)2 ds]
t 0

5. (Itô process). The stochastic process Xt given by


Z t Z t
Xt = X0 + u(s, ω)ds + v(s, ω)dBs ,
0 0

47
4 Stochastic Differential Equations
Rt
where v is an Itô-integrable function with P[ 0 v 2 ds < ∞, ∀t ≥ 0] = 1, is called an Itô process.
Such a process is also written in the following shorter differential form

dXt = udt + vdBt .

6. (Multi-dimensional Itô process). Let Bt be an m-dimensional Brownian motion and

dXt = udt + V dBt

where u(t, ω) = [u1 (t, ω) · · · un (t, ω)]> , V = (Vi,j (t, ω))i,j is an n-by-m matrix where Vi,j are
Itô-integrable functions, dBt = [dB1,t · · · dBm,t ]> and Xt = [X1,t · · · Xn,t ]> . In other words,
     
dX1,t u1   dB1,t
 dX2,t   u2  V1,1 · · · V1,m 
..   dB2,t  .

 . ..
 ..  =  ..  dt +  ..
   
. . .
 .. 
  
 .   . 
Vn,1 · · · Vn,m
dXn,t un dBm,t

This is called an n-by-m-dimensional Itô process.


7. (Itô formula — 1-dimensional). Let Xt be an Itô process, dXt = udt + vdBt , and let g ∈
C 2 ([0, +∞) × IR). Define Yt = g(t, Xt ). Let us denote the partial derivative of g wrt t by gt
∂g ∂g
and let, likewise, gx (t, x) = ∂x g(t, x) and gxx (t, x) = ∂x gx (t, x). Then,

dYt = gt (t, Xt )dt + gx (t, Xt )dx + 21 gxx (t, Xt )(dXt )2 ,

with the calculus rules dt · dt = dt · dBt = dBt · dt = 0 and dBt · dBt = dt.
8. (Itô formula — multi-dimensional). Let Bt be an m-dimensional Brownian motion and

dXt = udt + V dBt

be an n-by-m-dimensional Itô process. Let g be a C 2 ([0, ∞) × IRn ; IRp ) map and Yt = g(t, Xt ).
Then,
∂gk X ∂gk X ∂ 2 gk
dYk,t = (t, Xt )dt + (t, Xt )dXi + 12 (t, Xt )dXi dXj ,
∂t i
∂xi i,j
∂xi ∂xj

where dBi dBj = δi,j dt, dBi dt = dtdBi = 0.


9. (Examples). Here are a few examples:
i. Take g(t, x) = 12 x2 and Xt = Bt . Then, dYt = Bt dBt + 12 dt, which leads to
Z
1 2
2 Bt = Bs dBs + 21 t.

ii. By taking Xt = Bt and g(t, x) = tx we obtain


Z t Z t
sdBs = tBt − Bs ds,
0 0
Rt
where 0
Bs ds is an integrated Brownian motion.
iii. By taking Xt = Bt and g(t, x) = 13 x3 , we obtain
Z t Z t
Bs2 dBs = 31 Bt3 − Bs ds.
0 0

iv. (Integration by parts). Let Xt and Yt be two Itô processes. Then,

d(Xt Yt ) = Xt dYt + Yt dXt + dXt dYt ,

in other words, Z t Z t Z t
Xs dYs = Xt Yt − X0 Y0 − Ys dXs − dXs dYs
0 0 0

48
4.2 Stochastic differential equations

v. (Exponential). Let θ(t, ω) be an n-dimensional random process with θi (t, ω) ∈ V([0, T ]) for
i = 1, . . . , n with T ≤ ∞. Define
Z t Z t 
Zt = exp θ(s, ω)dBs − 21 θ2 (s, ω)ds ,
0 0

for t ∈ [0, T ]. Then,


dZt = Zt θ(t, ω)dBt
10. (Ito integrals of deterministic functions). Let g : [0, T ] → IR be a Borel-measurable function. For
every t ≥ 0, the random variable
Z t
Xt (ω) = g(s)dBs (ω),
0
Rt
is normally distributed with zero mean and variance Var[X] = 0
g(s)2 ds.
11. (Expectation of product of Itô integrals). Let (Xt )t , (Yt )t be two stochastic processes on the filtered
probability space (Ω, F, {Ft }t≥0 , P) so that Xt and Yt are in V(0, T ) for some T > 0. Then, for
0≤t≤T Z Z t  Z t t
IE Xs dBs Ys dBs = IE[Xs Ys ]ds
0 0 0

Table 4.1: Application of Itô’s formula

Xt dXt = udt + vdBt

Bt dBt

Bt2 2Bt dBt + dt

Bt2 − t 2Bt dBt

Bt3 3Bt2 dBt + 3Bt dt

e Bt eBt dBt + 12 eBt dt


1 1
e Bt − 2 t eBt − 2 t dBt
1 1
e 2 t sin Bt e 2 t cos Bt dBt
1 1
e 2 t cos Bt −e 2 t sin Bt dBt
1 1
(Bt + t)e−Bt − 2 t (1 − Bt − t)e−Bt − 2 t dBt

4.2 Stochastic differential equations


1. (Ornstein-Uhlenbeck process). The stochastic differential equation
dXt = µXt dt + σdBt ,
is called the Ornstein-Uhlenbeck process. Its solution is1
Z t
Xt = eµt X0 + σ eµ(t−s) dBs ,
0
2
with expectation IE[Xt ] = e X0 and variance Var[Xt ] = σ /2µ(e2µt − 1).
µt

1 Multiply both sides by e−µt and apply It’s formula on d(e−µt Xt ). To find the variance, use the Itô isometry.

49
4 Stochastic Differential Equations

Rt
Table 4.2: Stochastic integrals — we denote by Bt the standard Brownian motion with B0 = 0 and ibm(t, ω) := 0 Bs ds
is the integrated Brownian motion.

Stochastic Integral Result Variance


Rt
0
dBs Bt t
Rt 1 3
0
sdBs tBt − ibm(t, ω) 3t
Rt 1 2
0
Bs dBs 2 Bt − 12 t 1 2
2t
Rt 1 3
0
Bs2 dBs 3 Bt − ibm(t, ω) 3t3
Rt 1 k+1
Rt
0
Bsk dBs k+1 Bt −k 0
Bsk−1 ds
Rt
0
eBs −s/2 dBs eBt −t/2 − 1 et − 1

50
5 Information Theory
5.1 Entropy and Conditional Entropy
1. (Self-Information, construction). Let (Ω, F, P) be a discrete probability space. A self-information
function I must satisfy the following desiderata: (i) if ωi is sure (P[ωi ] = 1), then this offers no
information, that is I(ωi ) = 0, (ii) if ωi is not sure, that is P[ωi ] < 1, then I(ωi ) > 0, (iii) I(ω)
depends on the probability P[ω], that is, there is a function f so that I(ω) = f (P[ω]) (iv) for two
independent events A and B, I(A ∩ B) = I(A) + I(B).

2. (Self-information, definition). A definition which satisfied the above desiderata is I(ω) = − log(P[ω]).

3. (Self-information, units). When log2 is used in the definition, the units of measurement of self-
information are the bits. If ln ≡ loge is used, the self-information is measures in nats. For the
decimal logarithm, I is measured in hartley.

4. (Entropy, definition). The entropy (or Shannon entropy) of a random variable is the expectation
of its self-information denoted as H(X) = IE[I(X)], where I(X) is to be interpreted as follows:
Let (Ω, F, P) be a probability space and X : (Ω, F, P) → {xi }ni=1 a finite-valued random variable.
Consider the events Ei = {ω ∈ Ω | X(ω) = xi } with self-information I(Ei ). Then, I(X) is the
random variable I(X)(ω) = I(Eι(ω) ), where ι(ω) is such that X(ω) = xι(ω) .
The entropy of X is given by
n
X
H(X) = − pi log(pi ),
i=1

where pi = P[X = xi ].

5. (Joint entropy). The joint entropy of two random variables X and Y (with values {xi }i and {yj }j
respectively) is the entropy of the random variable (X, Y ) in the product space, that is
X
H(X, Y ) = − pij log pij ,
i,j

where pij = P[X = xi , Y = yj ].

6. (Conditional Entropy).

7. (Mutual information).

5.2 KL divergence
1. (Definition/Discrete spaces). Let (Ω, F) be a discrete measurable space and P and P0 two proba-
bility measures on it. The Kullback-Leibler (KL) divergence from P0 and P is defined as1
X X
DKL (PkP0 ) = −
0
Pi log (Pi/Pi ) = Pi log (Pi/P0i )
i i

2. (Definition/Continuous spaces with PDFs). The KL divergence over a continuous probability space
and for two probability measures P and P0 with PDFs p and p0 respectively is
Z ∞
DKL (PkP0 ) = p(x) log (p(x)/p0 (x)) dx
−∞

1 Lecturenotes by S. Khudanpur available online at https://www.clsp.jhu.edu/~sanjeev/520.447/Spring00/


I-divergence-properties.ps

51
5 Information Theory

3. (Definition/Continuous spaces). If P is absolutely continuous with respect to P0 , we define


Z
0
DKL (PkP ) = log (dP/dP0 ) dP.

4. (Nonnegative). The KL divergence is always nonnegative: DKL (PkP0 ) ≥ 0


q
5. (Pinsker’s inequality). dTV (P, P0 ) ≤ 12 DKL (PkP0 )

52
6 Risk
6.1 Risk measures
1. (Risk measures and coherency). Let (Ω, F, P) be a probability space and Z = Lp (Ω, F, P) for
p ∈ [1, ∞]. A risk measure ρ : Z → IR is called coherent if
i. (Convexity). For Z, Z 0 ∈ Z and λ ∈ [0, 1], ρ(λZ + (1 − λ)Z 0 ) ≤ λρ(Z) + (1 + λ)ρ(Z 0 )
ii. (Monotonicity). For Z, Z 0 ∈ Z, ρ(Z) ≤ ρ(Z 0 ) whenever Z ≤ Z 0 a.s.,
iii. (Translation equi-variance). For Z ∈ Z and C ∈ Z with C(ω) = c for almost all ω (almost
surely constant), it is ρ(C + Z) = c + ρ(Z),
iv. (Positive homogeneity). For Z ∈ Z and α ≥ 0, ρ(αZ) = αρ(Z).

2. (Conjugate risk measure). With every convex risk measure, we associate the conjugate risk measure
ρ∗ : Z ∗ → IR defined as
ρ∗ (Y ) = sup {hZ, Y i − ρ(Z)} .
Z∈Z

3. (Biconjugate risk measure). With every convex risk measure, we associate the biconjugate risk
measure ρ∗∗ : Z ∗∗ → IR
ρ∗∗ (Z) = sup {hZ, Y i − ρ∗ (Y )} .
Y ∈Z ∗

4. (Dual representation). Let Z = Lp (Ω, F, P) with p ∈ [1, ∞). If ρ is lower semi-continuous, then
ρ = ρ∗∗ . In particular,

ρ(Z) = sup {hZ, Y i − ρ∗ (Y )} = sup {hZ, Y i − ρ∗ (Y )} ,


Y ∈Z ∗ Y ∈A

where A = dom ρ∗ .

5. (Acceptance set). The set Aρ = {X ∈ Z : ρ(X) ≤ 0} is called the acceptance set of ρ. Several
properties of ρ can be tested using its acceptance set.

6. (Monotonicity condition). If Y ≥ 0 (almost surely) for every Y ∈ A, then and only then ρ is
monotone.

7. (Translation equi-variance condition). If for every Y ∈ A it is IE[Y ] = 1, then and only then, ρ is
translation equi-variant.

8. (Positive homogeneity condition). If ρ is the support function of A, that is, ρ(Z) = supY ∈A hY, Zi,
then and only then it is positively homogeneous. A is called the admissibility set of ρ.

9. (Coherency-preserving operations). Let ρ1 , ρ2 be two risk measures on Z. Then, the following risk
measures are coherent
i. ρ(X) := λ1 ρ1 (X) + λ2 ρ2 (X), λ1 , λ2 ∈ IR not both equal to 0
ii. ρ(X) = max{ρ1 (X), ρ2 (X)}

10. (Sub-differentiability). If ρ : Z → IR is real valued, convex and monotone, then it is continuous


and sub-differentiable on Z.

11. (Sub-differentials of risk measures). Let ρ : Lp (Ω, F, P) → IR, p ∈ [1, ∞), be convex and lower
semi-continuous. Then ∂ρ(Z) = arg maxY ∈A {hY, Zi − ρ∗ (Z)}. If, additionally, ρ is positively
homogeneous, then ∂ρ(Z) = arg maxY ∈A hY, Zi.

53
6 Risk

12. (Convexity of ρ ◦ F ). Let F : IRn → Z be a convex mapping1 and ρ be a convex monotone risk
measure. Then ρ ◦ F is convex.
13. (Directional differentiability). Let Z = Lp (Ω, F, P) with p ∈ [1, ∞), F : IRn → Z be a convex
mapping and ρ : Z → IR be a convex monotone risk measure which is finite-valued and continuous
at Z̄ = F (x̄). Then, φ := ρ ◦ F is directionally differentiable at x̄, φ0 (x̄; h) is finite-valued for all
h ∈ IRn and2
φ0 (x̄; h) = sup hY, f 0 (x̄; h)i
Y ∈∂ρ(Z̄)

14. (Sub-differentiability of ρ ◦ F ). Let Z, F , ρ, x̄ and Z̄ be as above. Define φ = ρ ◦ F . Then φ is


sub-differentiable at x̄ and [
∂φ(x̄) = cl hY, F 0 i
Y ∈∂ρ(Z̄)
F 0 ∈∂F (x̄)

15. (Continuity equivalences). Let ρ : Z → IR be a convex, monotone, translation equi-variant risk


measure and Z = Lp (Ω, F, P). The following are equivalent3 :
i. ρ is continuous
ii. ρ is continuous at a X ∈ dom ρ
iii. int Aρ 6= ∅
iv. ρ is lower semi-continuous and finite-valued (dom ρ = Z)
16. (Lipschitz continuity wrt infinity norm). Let ρ : Z → IR be a proper, convex, monotone, translation
equi-variant risk measure. Then, for all X, X 0 ∈ dom ρ
|ρ(X) − ρ(X 0 )| ≤ kX − X 0 k∞ .

17. (Law invariance). A risk measure ρ is called law invariant if ρ(Z) = ρ(Z 0 ) whenever Z and Z 0 have
the same distribution.
18. (Fatou property #1). Let ρ : L∞ → IR be a proper convex risk measure. The following are
equivalent:
i. ρ is σ(L∞ , L1 )-lower semi-continuous
ii. ρ has the Fatou property, i.e., ρ(X) ≤ lim inf k ρ(Xk ) whenever {Xk } is essentially uniformly
p
bounded (there is Z ∈ L∞ so that Xk ≤ Z for all k ∈ IN) and Xk −→ X.
19. (Law-invariant risk measures have the Fatou property)4 . Let LΦ denote an Orlicz space5 . Any
proper, (quasi)convex, law-invariant risk measure ρ : LΦ → IR that is norm-lower semi-continuous
has the Fatou property if and only if Φ is ∆2 .
20. (Kusuoka representations). Let (Ω, F, P) be a non-atomic space and let ρ : Lp (Ω, F, P) → IR be a
proper lower semi-continuous law-invariant coherent risk measure. Then, there exists a set M of
probability measures on [0, 1) so that
Z 1
ρ(Z) = sup AV@R1−α (Z)dµ(α),
µ∈M 0

where AV@R1−α is the average value-at-risk operator at level 1 − α (defined in the next section).
1 The mapping F : IRn → Z if for every λ ∈ [0, 1] and x, y ∈ IRn it is F (λx + (1 − λ)y)(ω) ≤ λF (x)(ω) + (1 − λ)F (y)(ω)
for P-almost every ω.
2 F maps a vector x to random variables, so it is F (x)(ω) = f (x, ω). The directional derivative of f with respect to
0 0
R along a0 direction h is f (x̄; h) and it is a random variable. The scalar product here is defined as hY, f (x̄; h)i =
x
Ω Y (ω)f (x̄; h)(ω)dP(ω).
3 For a detailed discussion on continuity properties of risk measures, see D. Filipović and G. Svindland, “Convex risk

measures on Lp ,” Available online at: http://www.math.lmu.de/~filipo/PAPERS/crmlp.pdf.


4 This result is rather involved. For a detailed presentation refer to the article E. Jouini, W. Schachermayer and N. Touzi,

“Law invariant risk measures have the Fatou property,” (Chapter) Advances in Mathematical Economics, 2006, Springer
Japan.
5 An Orlicz space is a function space which generalizes the Lp spaces. A Young function Φ : [0, ∞) → [0, ∞) is a

convex function with limx→∞ Φ(x) → ∞ and Φ(0) = 0. Given a Young function Φ and a probability space (Ω, F, P),
define LΦ (Ω, F, P) = {X : Ω → IR, measurable, IE[Φ(|X|)] < ∞} This set is not necessarily a vector space. The
vector space spanned by LΦ is the Orlicz space LΦ (Ω, F, P). This space is equipped with the Luxembourg norm
kXkΦ = inf{λ > 0 : IE[Φ(X/λ)] ≤ 1}. We say that Φ has the ∆2 condition if Φ(2t) ≤ KΦ(t) for some K > 0.

54
6.2 Popular risk measures

21. (Regularity in spaces with atoms). Let (Ω, F, P) be a space with atoms and (Ω, H, P) be a uniform
probability space so that (Ω, F, P) is isomorphic to it. Let Z := Lp (Ω, F, P) and Ẑ := Lp (Ω, H, P),
p ∈ [1, ∞). Let ρ̂ : Ẑ → IR be a proper, lower semi-continuous, law invariant, coherent risk measure.
We say that ρ̂ is regular if there is a proper, lower semi-continuous, law invariant, coherent risk
measure ρ : Z → IR so that ρ|Ẑ = ρ̂.

22. (Zero risk). Let (Ω, F, P) be a non-atomic probability space. Let ρ be a proper, lower semi-
continuous, coherent, law invariant risk measure. If Z ∈ Z, Z ≥ 0 a.s. then ρ(Z) = 0 if and only
if Z = 0 a.s.

23. (Risk under conditioning). Let (Ω, F, P) be a non-atomic space and ρ : Z → IR be a proper
convex lower semi-continuous law-invariant risk measure. Let H be a sub-σ-algebra of F. Then,
ρ(IE [X | H]) ≤ ρ(X), for all X ∈ Z and IE[X] ≤ ρ(X).

24. (Interchangeability principle for risk measures). Let Z := Lp (Ω, F, P) and Z 0 := Lp0 (Ω, F, P) with
p, p0 ∈ [1, ∞]. Let F : IRn → Z, that is, for x ∈ IRn , F (x) is a random variable; let (F (x))(ω) =
f (x, ω). For a set X ⊆ IRn define MX := {χ ∈ Z 0 : χ ∈ X, P-a.s.}. Let ρ : Z → IR be a proper
monotone risk measure. For χ ∈ Z 0 define Fχ (ω) = f (χ(ω), ω) Suppose that inf x∈X F (x) ∈ Z and
that ρ is continuous at inf x∈X F (x). Then
 
inf ρ(Fχ ) = ρ inf F (x) .
χ∈MX x∈X

6.2 Popular risk measures


1. (Trivially coherent risk measures). The expectation operator and the essential supremum are
coherent risk measures. For ω ∈ Ω, define ρ(X) = X(ω). This is a coherent risk measure, however,
it is not law invariant.

2. (Mean-Variance measure). The mean-variance risk measure is defined as ρ(X) = IE[X] + cVar[X].
This risk measure is law invariant, continuous, convex and translation equi-variant. However, it is
neither monotone nor positively homogeneous.

3. (Value-at-Risk). The Vale-at-Risk of a random variable X at level α is the (1 − α)-quantile of X,


that is, V@Rα [X] = inf{t ∈ IR : P[X > t] ≤ α}. V@Rα is monotone, positively homogeneous and
translation equi-variant, but non-convex and not sub-additive6 .

4. (Average Value-at-Risk). The Average Value-at-Risk is defined as7

AV@Rα [X] = inf t + 1/αIE[X − t]+ .


t∈IR

This is a coherent law-invariant risk measure.

5. (Mean-Deviation of order p). Let X ∈ Lp (Ω, F, P), p ∈ [1, ∞) and c ≥ 0. Define


1/p
ρ(X) = IE[X] + cIE [|X − IE[X]|p ]

This is a convex, translation equi-variant and positively homogeneous risk measure. It is monotone
if p = 1, (Ω, F, P) is non-atomic and c ∈ [0, 1/2].

6. (Mean-Upper-Semideviation of order p). Let X ∈ Lp (Ω, F, P), p ∈ [1, ∞) and c ≥ 0. The mapping
1/p
ρ(X) = IE[X] + cIE [X − IE[X]]p+


This is a convex, translation equi-variant and positively homogeneous risk measure. It is monotone
if p = 1, (Ω, F, P) is non-atomic and c ∈ [0, 1].
6 The Value-at-Risk is convex for certain classes of random variables. See A. I. Kibzun and E. A. Kuznetsov, “Convex
Properties of the Quantile Function in Stochastic Programming,” Automation and Remote Control, Vol. 65, No. 2,
2004, pp. 184–192.
7 We use the notation [X] = max{X, 0}. We use the definition of Shapiro et al. Other authors use different definitions

such as AV@Rα [X] = inf t∈IR t + 1/1−αIE[X − t]+ .

55
6 Risk

7. (Entropic risk measure). Let Z = Lp (Ω, F, P), p ∈ [1, ∞]. For γ > 0, define the entropic risk
measure
ρent
γ (X) = /γ IE[e
1 γX
].
For p = ∞, ρent ent
γ is finite valued and w*-lower-semi-continuous. Moreover, ργ is convex, monotone
and translation equi-variant, but not positively homogeneous. Furthermore, limγ→0 ρentγ (X) =
IE[X] and limγ→∞ ρent
γ (X) = esssup[X].

8. (Entropic Value-at-Risk). The entropic value-at-risk at level 1 − α, α ∈ (0, 1] of a random variable


X for which the moment generating function MX exists is defined as8

EV@R1−α [X] = inf { 1t ln(MX (t)/α}.


t>0

The entropic value-at-risk is a coherent risk measure for all α ∈ (0, 1].
9. (Expectiles). Let X ∈ L2 (Ω, F, P) and τ ∈ (0, 1). The τ -expectile of X is defined as

eτ (X) = argmin IE[τ [X − t]2+ + (1 − τ )[t − X]2+ ].


t∈IR

For all τ ∈ (0, 1), eτ is a coherent risk measure.


10. (Generalizations of AV@Rα )9 . Let X ∈ Z := Lp (Ω, F, P) and φ : Z → IR+ be a function which is
lower semi-continuous, monotone, convex and positive homogeneous. Then

ρ(X) = inf {t + φ(X − t)},


t

is a coherent risk measure10 .

8 The moment generating function (MGF) MX of a random variable X is defined as MX (z) := IE[ezX ] for z ∈ IR. Not all
random variables have an MGF (e.g., the Cauchy distribution does not define an MGF).
9 These risk measures were first introduced by Ben-Tal and Teboulle; see for example A. Ben-Tal, M. Teboulle, “An oldnew

concept of convex risk measures: an optimized certainty equivalent,” Mathematical Finance 17 (2007) 449–476. These
measures are discussed in: P. Krokhma, M. Zabarankin and S. Uryasev, “Modeling and optimization of risk,” Surveys
in Operations Research and Management Science 16 (2011) 49–66.
10 In the case of AV@R , it is φ(X) = 1/αIE[X] which is indeed convex, monotone and translation equi-variant.
α +

56
7 Uncertainty Quantification
7.1 Polynomial chaos
7.1.1 The Kosambi-Karhunen-Loève theorem
1. (Kernel function1 ). A kernel function
Pnis a symmetric continuous function K : [a, b] × [a, b] → IR.
K is called positive semidefinite if i,j=1 ci cj K(xi , xj ) ≤ 0, for all scalars (ci )ni=1 and (xi )ni=1 in
[a, b] and all n ∈ IN.

2. (Hilbert-Schmidt integral operator). A kernel K is associated with its Hilbert-Schmidt integral


operator which is defined as TK : L2 [a, b] → L2 [a, b]
Z b
TK : L2 [a, b] 3 φ 7→ TK φ = K( · , s)φ(s)ds
a

3. (Mercer’s theorem). Mercer’s theorem offers a representation of kernel functions using a basis of
L2 ([a, b]): Let K be a positive definite kernel. Then, there is an orthonormal basis (ei )i of L2 ([a, b])
and a sequence of nonnegative coefficients (λi )i so that

X
K(s, t) = λj ej (s)ej (t),
j=1

where convergence is absolute and uniform in [a, b] × [a, b] and (ei )i and (λi )i are eigenfunctions
and eigenvalues of TK .

4. (Kosambi-Karhunen-Loève theorem). Let (Xt )t∈T , T = [a, b], be a centered, mean-square contin-
uous stochastic process on (Ω, F, P) with Xt ∈ L2 (Ω, F, P) for all t ∈ T . Then, there is a basis
(ei )i∈IN or L2 (T ) such that for all t ∈ T ,

X
Xt = λi ei (t),
i=1

in L2 ([a, b]), where the random coefficients λi ∈ L2 (Ω, F, P) are given by


Z
λi (ω) = Xt (ω)ei (t)
T

and satisfy IE[λi ] = 0 and IE[λi λj ] = 0 for i 6= j.


In particular, (ei )i and (λi )i are eigenvectors and eigenvalues of the Hilbert-Schmidt integral op-
erator TRX of the auto-correlation function RX of the random process, that is, they satisfy the
Fredholm integral equation of the second kind

TRX ei = λi ei

or equivalently
Z b
RX (s, t)ei (s)ds = λi ei (t),
a

for all i ∈ IN.


1 This should not be confused with transition kernels, which we discussed in Section 1.1.10.

57
7 Uncertainty Quantification

5. (Corollary of KKL theorem2 .). Let (Xt )t∈T , T = [a, b], be a stochastic process which satisfies the
requirements of the Kosambi-Karhunen-Loève theorem. Then, there exists a basis (ei )i of L2 (T )
such that for all t ∈ T
X∞ p
Xt (ω) = λi ξi (ω)ei (t),
i=1

in L (Ω, F, P), where ξi are centered, mutually uncorrelated random variables with unit variance
2

and are given by


Z b

ξi (ω) = / λi
1 Xτ (ω)ei (τ )dτ.
a

7.1.2 Orthogonal polynomials


1. (Space of polynomials). Let PN (I) denote the space of polynomials ψ : I → IR of degree N ∈ IN.
2. (Weighted L2w space). For two functions ψ1 , ψ2 : I → IR, we define the scalar product
Z
hψ1 , ψ2 iw = w(τ )ψ1 (τ )ψ2 (τ )dτ.
I

and the corresponding norm kf k2 = hf, f iw . Define the space

L2w (I) = {f : I → IR | kf k2 < ∞}.

For a random variable Ξ : (Ω, F, P) → IR with PDF pΞ , we define

hψ1 , ψ2 iΞ := hψ1 , ψ2 ipΞ .

3. (Orthogonality wrt random variable). Let Ξ be a real-valued random variable with probability
density function pΞ . Let ψ1 , ψ2 : Ω → IR be two polynomials. We say that ψ1 , ψ2 are orthogonal
with respect to (the pdf of) Ξ if hπ1 , π2 iΞ = 0.
4. Let ψ0 , ψ2 , . . ., with ψ0 = 1, be a sequence of orthogonal polynomials. Then,
Z ∞
0 = hψ0 , ψ1 iΞ = ψ1 (s)pΞ (s)ds = IE[ψ1 (Ξ)],
−∞
d(Ξ∗ P)
by virtue of LotUS. Recursively, IE[ψi (Ξ)] = 0 for all i.
5. (Hermite polynomials). Let Ξ be distributed as N (0, 1). Then, its pdf is pΞ (s) = 1 and the
polynomials ψ0 , ψ1 , . . . are the Hermite polynomials, the first few of which are H0 (x) = 1, H1 (x) =
x, H2 (x) = x2 − 1, H3 (x) = x3 − 3x. These are orthogonal with respect to Ξ, that is
Z ∞
s2
hHi , Hj iΞ = Hi (s)Hj (s)e− 2 ds = 0,
−∞

for i 6= j.
6. (Legendre polynomials). If Ξ ∼ U ([−1, 1]), then ψ0 , ψ1 , . . . are the Legendre polynomials. If,
instead, Ξ ∼ U ([−1, 1]), the coefficients of the Legendre polynomials can be modified.
7. (Laguerre polynomials). If the germ is an exponential random variable on [0, ∞), then ψ0 , ψ1 , . . .
are the Laguerre polynomials.
8. (Polynomial projection). Let MN = {ψi }N i=0 ⊆ PN (I) be a set of orthogonal polynomials ψi : I →
IR with respect to the inner product h · , · iw . We define the projection operator onto MN as
N
X
PN : L2w (I) 3 f 7→ PN f := fˆj ψj ∈ PN ,
j=0

where
fˆj = 1/kψj k2 hf, ψj iw .
2 For applications of the KKL theorem to decompose stochastic processes, see http://amslaurea.unibo.it/10169/1/
Giambartolomei_Giordano_Tesi.pdf

58
7.1 Polynomial chaos

9. (Properties of polynomial projection). It is easy to see that PN f = f for all f ∈ PN (I), whereas,
for g ⊥ PN (I) it is PN g = 0.
10. (Best approximation). For f ∈ L2w (I),

kf − PN f kw = inf kf − ψkw
ψ∈PN (I)

Additionally, the approximation error, ef := f − PN f is orthogonal to PN (I), that is, hef , ψi iw = 0


for all i ∈ IN[0,N ] .
11. (Approximation properties). For f ∈ L2w (I)3 , define fN = PN f . Then,

lim kf − fN kw = 0.
N →∞

12. (Strong/weak approximation). An approximation fN which converges, as N → ∞, to f as the


above best approximation (via polynomial projection), is called a strong approximation. An ap-
proximation fN which converges in a weaker sense (e.g., in probability) is called weak.

7.1.3 Generalized polynomial chaos expansions


1. (Polynomial chaos expansions). Given a random variable X, a polynomial chaos expansion is a
function f so that for a given random variable Ξ with known distribution, called a “germ,” it
d d
holds that X = f (Ξ) (the notation = means that the distributions of the two random variables are
equal).
2. (Weak approximation theorem). Let X be a random variable of L2 (Ω, F, P) with cumulative
distribution function FX . Let Ξ be a germ in L2N (Ω, F, P) with distribution FΞ and

hψi , ψj iΞ = δi,j γi ,

for all i, j ∈ IN[0,N ] . Let


N
X
XN = αj ψj (Ξ),
j=0

where Z 1
−1
αj = 1/γj ψj (u)FX (FΞ (u))dFΞ (u),
0

Then, as N → ∞, XN → X in probability4 .
3. (Non-intrusive solution via linear regression).
4. (Non-intrusive solution via stochastic projection). Let X be a random variable on (Ω, F, P) and
Y = f (X). Suppose we have obtained a truncated PC expansion for X. The question uncertainty
propagates from X to Y via f ; in other words, what is the distribution of Y . For N ∈ IN, let XN
be a PC expansion of X as follows
N
X
XN = xj ψj (Ξ) = fN (Ξ)
j=0

3 In p
certain cases, we may derive approximation bounds. For example, for f in the weighted Sobolev space Hw ([−1, 1]) =
{g : [−1, 1] → IR | di g/dτ i ∈ L2w ([−1, 1]), i = 0, . . . , p}, equipped with the inner product
p  j
d g dj g
X 
hf, giHw p
([−1,1]) = j
, j
j=0
dτ dτ L2 ([−1,1])
w

1/2
and induced norm kf kHw
p
([−1,1]) = hf, f i p
Hw ([−1,1])
, and with MN being the set of Legendre polynomials on [−1, 1],
we have that there is a constant c, independent of N , so that kf − PN f k ≤ c/N kf kHw
p
([−1,1]) .
4 X can be approximated by projecting on the space spanned by the orthogonal polynomials M = {ψ }N
i i=0 leading to an
PN
approximation XN = j=0 αj ψj (Ξ), where the coefficients αj are computed by αj = hX, ψj iΞ /γj . The problem is
d
−1
that the inner product hX, ψj iΞ , typically, cannot be evaluated. The trick is that X = FX (U ), where U is a random
d −1
variable which is uniformly distributed on [0, 1]; one such variable is U = FΞ (Ξ), therefore, X = FX (FΞ (Ξ)). This
leads to the above formula. The integral can be evaluated by quadrature methods.

59
7 Uncertainty Quantification

Let YN be the desired approximation (we take the approximation length to be the same and the
PN
orthogonal polynomial basis to be also the same), YN = j=0 yj ψj (Ξ). It follows that
Z
yk = 1/kψk k2
Ξ · hψk , η ◦ fN iΞ = 1/kψk k2
Ξ · η(fN (u))ψ(u)pΞ (u)du,

and
R the integral can be evaluated
Pusing a quadrature method, or even simple Monte Carlo, that is
−1 Nmc (i) (i) (i) (i)
η(fN (u))ψ(u)pΞ (u)du ≈ Nmc i=1 η(f N (u ))ψ(u )pΞ (u ) where u are samples from the
distribution of Ξ.
5. (Galerkin projection).

6.

60
8 Bibliography with comments
Bibliographic references including lecture notes and online resources with some comments:
1. R.G. Gallager. Stochastic processes: theory for applications. Cambridge University Press, 2013: A gentle
introduction to stochastic processes suitable for engineers who want to eschew the mathematical drudgery.
Following a short, but circumspect introduction to probability theory, the author discusses several pro-
cesses such as Poisson, Gaussian, Markovian and renewal processes. Lastly, the book discusses hypothesis
testing, martingales and estimation theory. Without doubt, an excellent introduction to the topic for the
uninitiated.
2. Robert L. Wolpert. Probability and measure, 2005. Lecture notes: Lecture notes with a succinct presen-
tation of some very useful results, but without many proofs. Available at https://www2.stat.duke.edu/
courses/Spring05/sta205/lec/s05wk07.pdf.
3. Erhan Çinlar. Probability and Stochastics. Springer New York, 2011: A fantastic book for one’s first steps
in probability theory with emphasis on random processes, filtrations, Martingales, stopping times and
convergence theorems, Poisson random measures, Lévy and Markovian processes and Brownian motion.
4. Olav Kallenberg. Foundations of modern probability. Springer, 1997: The definitive reference for researchers.
In its 23 chapters it gives a circumspect overview of probability theory and stochastic processes; ideal for
researchers in the field.
5. Onesimo Hernández-Lerma and Jean Bernarde Lasserre. Discrete-Time Markov Control Processes: Basic
Optimality Criteria. Springer, 1996
6. Bernt Øksendal. Stochastic Differential Equations. Springer Berlin Heidelberg, sixth edition, 2003: An
amazing eye-opening book on stochastic differential equations and their aplications. It offers a very com-
prehensive presentation of the Brownian motion and Itô’s integral. The exercises are an invaluable tool for
assimilating the theory.
7. Karl Simgman. Lecture notes on stochastic modeling I, 2009: Lecture notes by K. Sigman, Columbia
University, http://www.columbia.edu/~ks20/stochastic-I/stochastic-I.html.
8. David Walnut. Convergence theorems, 2011. Lecture notes: A short compilation of convergence theorems
9. S.R. Srinivasa Varadhan. Lecture notes on limit theorems, 2002: A lot of material on limit theorems
starting from general measure theory, to weak convergence results, limits of independent sums, results
for dependent processes with emphasis on Markov chains, a comprehensive introduction to martingales,
stationary processes and ergodic theorems and some notes on dynamic programming. Available online at
https://www.math.nyu.edu/faculty/varadhan/.
10. Zhengyan Lin and Zhidong Bai. Probability Inequalities. Springer, 2011: several interesting (elementary
and advanced) inequalities on probability spaces.
11. Andrea Ambrosio. Relation between almost surely absolutely bounded random variables and their abso-
lute moments, 2013: A short note at http://planetmath.org/sites/default/files/texpdf/38346.pdf
showing that almost surely bounded RVs have all their moments bounded.
12. Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczyński. Lectures on stochastic programming:
modeling and theory. SIAM, second edition, 2014: Excellent book on stochastic programming and the
definitive reference for risk measures.
13. Anthony O’Hagan. Polynomial chaos: A tutorial and critique from a statisticians perspective, 2013. Avail-
able at http://tonyohagan.co.uk/academic/pdf/Polynomial-chaos.pdf: This article is written in a very
intuitive manner, it is easy to follow. It seems that it targets applied scientists and practitioners, rather
than mathematicians. It is a good read to understand the basics of polynomial chaos. The author questions
certain aspects of polynomial chaos from a statistics standpoint.
14. Dongbin Xiu. Numerical methods for stochastic computations: a spectral method approach. Princeton
University Press, 2010: A proper theoretical treatise on polynomial chaos and several other topics related
to approximations of (multivariate) probability distributions.
15. M.S. Eldred. Recent advances in non-intrusive polynomial chaos and stochastic collocation methods for
uncertainty analysis and design. In 50th AIAA/ASME/ASCE/AHS/ASC Structures, Structural Dynamics,
and Materials Conference, California, USA, 2009
16. Thorsten Schmidt. Coping with copulas, 2006. https://www.researchgate.net/publication/228876267/
download

61
8 Bibliography with comments

17. Carlo Sempi. Introduction to copulas, 2011. The 33rd Finnish Summer School on Probability Theory and
Statistics; available at http://web.abo.fi/fak/mnf/mate/gradschool/summer_school/tammerfors2011/
slides_sempi.pdf: a very thorough presentation of copulas along with lots of theoretical results.
18. A. Kaintura, T. Dhaene, and D. Spina. Review of polynomial chaos-based methods for uncertainty quan-
tification in modern integrated circuits. Electronics, 7(3):30, 2018: a not very rigorous review, but it offers
an overview of basic properties of polynomial chaos expansions

62
About the author
I was born in Athens, Greece, in 1985. I received a Diploma in Chemical Engineering in 2007 and an
MSc with honours in Applied Mathematics in 2009 from NTU Athens. In December 2012, I defended my
PhD thesis titled “Modelling and Control of Biological and Physiological Systems” at NTU Athens. In
January 2013 I joined the Dynamical Systems, Control and Optimization research unit at IMT Lucca as
a post-doctoral Fellow. Afterwards, I worked as a post-doctoral researcher at ESAT, KU Leuven. I am
currently a post-doctoral researcher at KIOS Center of Excellence, University of Cyprus. My research
focuses on model predictive control and numerical optimization.

Web page: https://alphaville.github.io/

63

You might also like