Probability_Theory_Cookbook
Probability_Theory_Cookbook
Pantelis Sopasakis
Abstract 5
1 Probability Theory 7
1.1 General Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.1 Measurable and Probability spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.3 Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.1.4 The Radon-Nikodym Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1.5 Probability distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1.6 Probability density function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.1.7 Decomposition of measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.1.8 Lp spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.1.9 Product spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.1.10 Transition Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.1.11 Law invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.1.12 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.2.1 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.2.2 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.2.3 Construction of probability spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3 Inequalities on Probability Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.1 Inequalities on Lp spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.2 Generic inequalities involving probabilities or expectations . . . . . . . . . . . . . . 21
1.3.3 Involving sums or averages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.4 Convergence of random processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.4.1 Convergence of measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.4.2 Almost sure convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.4.3 Convergence in probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.4.4 Convergence in Lp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.4.5 Convergence in distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.4.6 Tail events and 0-1 Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.4.7 Laws of large numbers and CLTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.5 Standard Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.5.1 Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.5.2 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.5.3 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.5.4 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2 Multivariate distributions 33
2.1 Multivariate random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2 Copulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.1 Sklar’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.2 Examples of copulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3 Stochastic Processes 37
3.1 General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Random walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4 Brownian motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5 Markov processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3
Contents
5 Information Theory 51
5.1 Entropy and Conditional Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 KL divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6 Risk 53
6.1 Risk measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2 Popular risk measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7 Uncertainty Quantification 57
7.1 Polynomial chaos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.1.1 The Kosambi-Karhunen-Loève theorem . . . . . . . . . . . . . . . . . . . . . . . . 57
7.1.2 Orthogonal polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.1.3 Generalized polynomial chaos expansions . . . . . . . . . . . . . . . . . . . . . . . 59
4
Abstract
This document is intended to serve as a collection of important results in general probability theory,
stochastic processes, uncertainty quantification, risk measures and a lot more. It can be used for a quick
brush up or as a quick reference or cheat sheet for graduate students and researchers in the domain of
mathematics and engineering. The readers may find a list of bibliographic references with comments at
the end of this document. This is still work in progress, so several important results are still missing.
This document, as well as its future versions, will be available at https://mathematix.wordpress.
com/probability-cookbook.
5
1 Probability Theory
1.1 General Probability Theory
1.1.1 Measurable and Probability spaces
1. (σ-algebra). Let X be a nonempty set. A collection F of subsets S of X is called a σ-algebra if (i)
X ∈ F, (i) Ac ∈ F whenever A ∈ F, (ii) if A1 , . . . , An ∈ F, then i=1,...,n Ai ∈ F. The space X
equipped with a σ-algebra F is called a measurable space.
5. (Smallest σ-algebra). Let H be a collection of sets in X. The smallest collection of sets which
contains H and is a σ-algebra exists and is denoted by σ(H).
6. (Monotone class theorem). If a d-system D contains a p-system P, then is also contains σ(P).
7. (Borel σ-algebra). On IR, the σ-algebra σ({(a, b); a < b}) is called the Borel σ-algebra on IR which
we denote by BIR . For topological spaces (X, τ ), the Borel σ-algebra is defined as BX = σ(τ ), i.e.,
it is the smallest σ-algebra which contains all open sets. BIR is generated by:
i. The open intervals (a, b)
ii. The closed intervals [a, b]
iii. All sets of the form [a, b) or (a, b]
iv. Open rays (a, ∞) or (−∞, a)
v. Closed rays [a, ∞) or (−∞, a]
10. (Equality of measures). Let µ, ν be two measures on a measurable space (X, F) and let G be a
p-system generating F. If µ(A) = ν(A) for all A ∈ G, then µ(B) = ν(B) for all B ∈ F. As presented
in #7 above, p-systems are often available and have simple forms.
11. (Completeness). A measure space (X, F, µ) is called complete if the following holds:
A ∈ F, µ(A) = 0, B ⊆ A ⇒ B ∈ F.
Of course, by the monotonicity property in #9–iii, if (X, F, µ) is a complete measure space then
µ(B) = 0.
12. (Completion). Let (X, F, µ) be a measure space and define the set of negligible sets of µ as Zµ =
{N ⊆ X : ∃N 0 ⊇ N, N 0 ∈ F s.t. µ(N 0 ) = 0}. Let F0 be the σ-algebra generated by F ∪ Zµ . Then
7
1 Probability Theory
8
1.1 General Probability Theory
for B ∈ G.
12. (Change of variables). Let F be a random variable on the probability space (Ω, F, P) and F∗ P
is the push-forward measure. random variable X is integrable with respect to the push-forward
measure F∗ P if and only if X ◦ F is P-integrable. Then, the integrals coincide
Z Z
Xd(F∗ P) = (X ◦ F )dP.
13. (Measures from random variables). Let X be a random variable on (Ω, F, P). We may use X to
define the following measure Z
ν(A) = XdP,
A
defined for A ∈ F. This is a positive measure which for short we denote as ν = XP and it satisfies:
Z Z
Y dν = XY dP,
A A
9
1 Probability Theory
i. φk → X, point-wise on E
ii. |φk | ≤ |X| on E for all k ∈ IN
If X ≥ 0 then there exists a sequence of point-wise increasing simple functions with these properties.
21. (Simple function approximation trick). Let f be a real-valued measurable function, f : (Ω, F, P) →
(IR, BIR ). Define (
j−1 j−1
k , 2k
≤ f (x) < 2jk
φk (x) = 2
k, f (x) ≥ k
Then,
j−1 j
i. The sets {x : f (x) ≥ k} and {x : 2k
≤ f (x) < 2k
} are measurable because f is measurable
ii. φk are measurable for all k ∈ IN
iii. φk (x) ≤ φk+1 (x) for all k ∈ IN and for all x ∈ Ω
iv. Let E ⊆ Ω so that supx∈E f (x) ≤ M . Then supx∈Ω |f (x) − φk (x)| ≤ 1/2k for all k ≥ M
1.1.3 Limits
Limits of sequences of events
1. (Nested sequences and probabilities). Let (En )n be a non-increasing sequence of events (En ⊇ En+1
for all n ∈ IN). Then limn P[En ] exists and
" #
\
P En = lim P[En ].
n
n
2. (Limits inferior). For a sequence of events En , the limit inferior of (En )n is denoted by lim inf n En
and is defined as
[ \
lim inf En = En = {x : x ∈ En for all but finitely many n ∈ IN}.
n
n∈IN m≥n
6. (Probabilities of lim inf En and lim sup En ). The sets lim inf n En and lim supn En are measurable
and
P[lim inf En ] ≤ lim inf P[En ] ≤ lim sup P[En ] ≤ P[lim sup En ].
n n n n
10
1.1 General Probability Theory
7. (A result reminiscent of Baire’s category theorem). Let (En )n be a sequence of almost sure events.
Then P[∩n En ] = 1.
8. (Borel-Cantelli lemma). Let (En )n be a sequence of events over (Ω, F, P). The following hold
P∞
i. If n=1 P[En ] < ∞, then P[lim supn En ] = 0
P∞
ii. If (En )n are independent events such that n=1 P[En ] = ∞, then P[lim supn En ] = 1.
9. (Corollary: Borel 0-1 law). If (En )n is a sequence of independent events, then P[lim supn En ] ∈
{0, 1} (according to the summability of (P[En ])n ).
2. (Lebesgue’s Dominated Convergence Theorem). Let Xn be real-valued RVs over (Ω, F, P). Suppose
that Xn converges point-wise to X and is dominated by a Y ∈ L1 (Ω, F, P), that is |Xn | ≤ Y P-a.s
for all n ∈ IN. Then, X ∈ L1 (Ω, F, P) and
which implies
lim IE[Xn ] = IE[X].
n
5. (Bounded convergence). If Xk → X almost surely and supk |Xk | ≤ b for some constant b > 0, then
IE[Xk ] → IE[X] and IE[|X|] ≤ b.
7. (Fatou’s lemma with varying measures). For a sequence of nonnegative random variables Xn ≥ 0
over (Ω, F, P), and a sequence of (probability) measures µn which converge strongly to a (proba-
bility) measure µ (that is, µn (A) → µ(A) for all A ∈ F), we have
11
1 Probability Theory
8. (Reverse Fatou’s lemma). Let Xn ≥ 0 be a sequence of nonnegative random variables over (Ω, F, P)
and assume there is a Y ∈ L1 (Ω, F, P) so that Xn ≤ Y . Then
9. (Integrable lower bound). Let Xn be a sequence of random variables over (Ω, F, P). Suppose, there
exists a Y ≥ 0 such that Xn ≥ −Y for all n ∈ IN. Then,
10. (Beppo Levi’s Theorem). Let Xk be a sequence of nonnegative random variables on (Ω, F, P) with
0 ≤ X1 ≤ X2 ≤ . . .. Let X(ω) = limk→∞ Xk (ω). Then X is a random variable and
11. (Beppo Levi’s Theorem for series). Let Xk be a sequence of nonnegative integrable random variables
Pk P∞
on (Ω, F, P) and let Yk = j=0 Xk . Assume that k=1 IE[Yk ] converges. Then Yk satisfies the
conditions of the BL theorem and
∞
"∞ #
X X
IE[Yk ] = IE Yk .
k=1 k=1
12. (Uniform integrability – definition) [7]. A collection {Xk }k∈T is said to be uniformly integrable if
supt∈T IE[|Xt |1|Xt |>x ] → 0 as x → ∞.
13. (Constant absolutely integrable sequences as uniformly integrable) [7]. The sequence {Y }t∈T with
IE[|Y |] < ∞ is uniformly integrable.
14. (Uniform boundedness in Lp , p > 1, implies uniform integrability). If {Xt }t∈T is uniformly bounded
in Lp , p > 1 (that is, IE[|Xk |p ] < c for some c > 0), then it is uniformly integrable.
15. (Convergence under uniform integrability) [7]. If Xk → X a.s. and {Xk }k is uniformly integrable
then
i. IE[X] < ∞
ii. IE[Xk ] → IE[X]
iii. IE|Xk − X| → 0
dν
This function is denoted by f = dµ .
4. (Chain rule). If ν µ λ,
dν dν dµ
= , λ-a.e.
dλ dµ dλ
12
1.1 General Probability Theory
If the measure d(X∗ P) is absolutely continuous with respect to the Lebesgue measure µ (on (IR, BIR ),
then, the Radon-Nikodym derivative fX := d(X ∗ P)
dµ , where fX : IR → IR exists. Then
Z Z Z
IE[g(X)] = g d(X∗ P) = gfX dµ = g(τ )fX (τ ) dτ.
IR IR IR
is called the probability distribution of X and it is a measure . Note that for all A ∈ G, X −1 A ∈ F
since X is measurable.
3. (Push-forward). The probability distribution of a random variable X with values in (X , G), is the
push-forward measure X∗ P on (X , G) which is a probability measure on (X , G) with X∗ P = PX −1 .
4. (Associated p-system). We associate with FX : IR → [0, 1] the measure µ which is defined on the
p-system {(−∞, x]}x∈IR as µ((−∞, x]) = FX (x).
5. (Properties of the cumulative and the inverse cumulative distributions). The notation X ∼ Y
means that X and Y have the same cumulative distribution, that is FX = FY .
−1
i. If Y ∼ U [0, 1], then FX (Y ) ∼ X.
ii. FX is càdlàg
iii. x1 < x2 ⇒ FX (x1 ) ≤ FX (x2 )
iv. P[X > x] = 1 − FX (x)
v. P[{x1 < X ≤ x2 }] = FX (x2 ) − FX (x1 )
vi. limx→−∞ FX (x) = 0, limx→∞ FX (x) = 1
−1
vii. FX (FX (x)) ≤ x
−1
viii. FX (FX (p)) ≥ p
−1
ix. FX (p) ≤ x ⇔ p ≤ FX (x)
13
1 Probability Theory
d(X∗ P)
fX = ,
dµ
which exists provided that X∗ P µ, and fX is measurable and µ-integrable. Then,
Z Z Z Z Z
P[X ∈ A] = dP = 1X −1 A dP = (1A ◦ X)dP = d(X∗ P) = fX dµ.
X −1 A Ω Ω A A
2. (Probability distribution). If X is a real-valued random variable and its range (IR) is taken with
the Borel σ-algebra, then
Z Z Z x
P[X ≤ x] = XdP = dP = fX dµ
(−∞,x] {ω∈Ω:X(ω)≤x} −∞
Note that the first integral is written with a slight abuse of notation as the integration with respect
to P is carried out over the set {ω ∈ Ω : X(ω) ≤ x}; The first integral can be understood as
shorthand notation for the second integral.
3. (Expectation). Let a real-valued random variable X have probability density fX . Let ι be the
identity function ι : x 7→ x on Ω. Then
Z Z Z Z Z
IE[X] = XdP = (ι ◦ X)dP = ιd(X∗ P) = ι(x)fX (x)dµ = xfX (x)dx.
Ω Ω IR IR IR
FY (y) = FX (g −1 (y)),
∂g −1 (y)
fY (y) = fX (g −1 (y)) .
∂y
14
1.1 General Probability Theory
2. (Discrete measure on IR). A measure µ on IR equipped with the Lebesgue σ-algebra, is said to be
discrete if there is a (possibly finite) sequence of elements {sk }k∈IN , so that
[
µ(IR \ {sk }) = 0.
k∈IN
3. (Lebesgue’s decomposition Theorem). For every two σ-finite signed measures µ and ν on a mea-
surable space (Ω, F), there exist two σ-finite signed measures ν0 and ν1 on (Ω, F) such that
i. ν = ν0 + ν1
ii. ν0 µ
iii. ν1 ⊥µ
and ν0 and ν1 are uniquely determined by ν and µ.
4. (Lebesgue’s decomposition Theorem — Corollary). Consider the space (IR, BIR ) and let µ be the
Lebesgue measure. Any probability measure ν on this space can be written as
ν = νac + νsc + νd ,
where νac µ (which is easily understood via the Radon-Nikodym Theorem), νsc is singular
continuous (wrt µ) and νd is a discrete measure.
1.1.8 Lp spaces
1. (p-norm). Let X be a real-valued random variable on (Ω, F, P). For p ∈ [1, ∞) define the p-norm
of X as
kXkp = IE[|X|p ]1/p .
2. (Lp spaces). Define Lp (Ω, F, P) = {X : Ω → IR, measurable, kXkp < ∞} and equip this space
with the addition and scalar multiplication operations (X + Y )(ω) = X(ω) + Y (ω) and (αX)(ω) =
αX(ω). This becomes a semi-normed space3 .
3. (Lp spaces). Define N (Ω, F, P) = {X : Ω → IR, measurable, X = 0 a.s.}; this is the kernel of k · kp .
Then, define Lp (Ω, F, P) = Lp (Ω, F, P)/N . This is a normed space where for X ∈ Lp (Ω, F, P) and
[X] = X + N ∈ Lp (Ω, F, P) we have k[X]kp := kXkp .
or equivalently
kXk∞ = inf{λ ∈ IR : |X| ≤ λ, P-a.s.}.
6. (L2 is a Hilbert space). Lp (Ω, F, P) is the only Hilbert Lp space with inner product
hX, Y i = IE[XY ].
3 kXk = 0 does not imply that X = 0, but instead that X = 0 almost surely. However, k · kp is absolutely homogeneous,
sub-additive and nonnegative
15
1 Probability Theory
a∈A
This is the smallest σ-algebra on the product space which renders all projections measurable
(compare to the definition of the product topology which is the smallest topology on the product
space which renders the projections continuous).
2. (Measurability of epigraphs). Let f : (X, F) → IR be a measurable proper function. Its epi-
graph, that is the set epi f := {(x, α) ∈ X × IR | f (x) ≤ α} and its hypo-graph, that is the set
hyp f := {(x, α) ∈ X ×IR | f (x) ≥ α} are measurable in the product measure space (X ×IR, F⊗BIR ).
3. (Measurability of graph). The graph of a measurable function f : (X, F, µ) → IR is a Lebesgue-
measurable set with Lebesgue measure zero.
4. (Countable product of σ-algebras). If A is countable, the product σ-algebra is generated by the
products of measurable sets { a∈A Ea ; Ea ∈ Fa }.
Q
5. (Product measures). Let (X , F, µ) and (Y, G, ν) be two measure spaces. The product space X × Y
becomes a measurable space with the σ-algebra F⊗G. Let Ex ∈ F and Ey ∈ G; then Ex ×Ey ∈ F⊗G.
We define a measure µ × ν on (X × Y, F ⊗ G) with
(µ × ν)(Ex × Ey ) = µ(Ex )ν(Ey ).
Then, h ∈ L1 (X × Y, F ⊗ G, µ × ν) and
Z Z Z Z Z
h(x, y)dν(y)dµ(x) = h(x, y)dµ(x)dν(y) = hd(µ × ν)
X Y Y X X ×Y
16
1.1 General Probability Theory
1.1.12 Expectation
1. (Definition) Let (Ω, F, P) be a probability space and X be a random variable. Then, the expected
value of X is denoted by IE[X] and is defined as the Lebesgue integral
Z
IE[X] = XdP
Ω
17
1 Probability Theory
The function S(t) = P[X > t] = 1 − P[X ≤ t] is called the survival function of X, or its tail
distribution or exceedance.
3. (Expectation in terms of PDF). Let X be a real-valued continuous random variable with PDF fX .
Then,
Z ∞
IE[X] = xfX (x)dx.
−∞
5. Let (Ω, F, P) be a probability space and X a real-valued random variable thereon. Define
Z
f (τ ) = (X − τ )2 dP.
Ω
8. (Finite mean, infinite variance). There are several distributions with finite mean and infinite
variance — a standard example is the Pareto distribution. A random variable X follows the Pareto
distribution with parameters xm > 0 and a if it has support [xm , ∞) and probability distribution
axam
P[X ≤ x] = ,
xa+1
axm
for x ≥ xm . For a ≤ 1, X has infinite mean and variance. For a > 1, its mean is IE[X] = a−1 and
infinite variance.
9. (Absolutely bounded a.s. ⇔ Bounded moments) [11]. Let X be a random variable on (Ω, F, P).
The following are equivalent:
i. X is almost surely absolutely bounded (i.e., there is M ≥ 0 such that P[|X| ≤ M ] = 1)
ii. IE[|X|k ] ≤ M k , for all k ∈ IN≥1
18
1.2 Conditioning
1.2 Conditioning
1.2.1 Conditional Expectation
1. (Conditional Expectation). Let X be a random variable on (Ω, F, P) and H ⊆ F. A conditional
expectation of X given H is an H-measurable random variable, denoted as IE [X | H], with
Z Z
IE [X | H] dP = XdP,
H H
IE[X1H ] = IE[IE [X | H] 1H ],
for all H ∈ H.
2. (Uniqueness). All versions of a conditional expectation, IE [X | H], differ only on a set of measure
zero4 .
4. (Best estimator). Assuming IE[Y 2 ] < ∞, the best estimator of Y given X is IE[Y | X]
6. (Conditional expectation wrt random variable). Let X, Y be random variables on (Ω, F, P). The
conditional expectation of X given Y is IE[X | Y ] := IE[X | σ(Y )], where σ(Y ) is the σ-algebra
generated by Y , that is σ(Y ) = Y −1 (F) = {Y −1 (B); B ∈ F}.
7. (Conditional expectation using the push-forward Y∗ P). Let X be an integrable random variable
on (Ω, F, P). Then, there is a Y∗ P-unique random variable IE[X | Y ]
Z Z
XdP = IE[X | Y ]d(Y∗ P).
Y −1 (B) B
9. (Properties of conditional expectations). The conditional expectation has the following properties:
i. (Monotonicity). X ≤ Y ⇒ IE [X | H] ≤ IE [Y | H]
ii. (Positivity). X ≥ 0 ⇒ IE [X | H] ≥ 0 [Set Y = 0 in 9i].
iii. (Linearity). For a, b ∈ IR, IE [aX + bY | H] = aIE [X | H] + bIE [Y | H]
iv. (Monotone convergence). Xn ≥ 0, Xn ↑ X implies IE [Xn | H] ↑ IE [X | H]
v. (Fatou’s lemma). For Xn ≥ 0, IE [lim inf n Xn | H] ≤ lim inf n IE [Xn | H]
vi. (Reverse Fatou’s lemma).
4 R.Durrett, “Probability: Theory and Examples,” 2013, Available at: https://services.math.duke.edu/~rtd/PTE/
PTE4_1.pdf
19
1 Probability Theory
2. (Conditional probability given an event). For E, H ∈ F, P[E ∩ H] = P[H]PH [E]. This is uniquely
defined provided that P[H] > 0.
Then, there exists a probability space (Ω, F, P) and a stochastic process (Xt )t on Ω such that
νt1 ,...,tk (F1 × · · · × Fk ) = P[Xt1 ∈ F1 , . . . , Xtk ∈ Fk ],
for all Borel sets Fi , i = 1, . . . , k.
20
1.3 Inequalities on Probability Spaces
7. (Corollary of Hoeffding’s lemma). Let X be such that etX is integrable for t ≥ 0. Then
P[X > ] ≤ inf e−t IE[etX ].
t≥0
10. Let X ≥ 0 and IE[X 2 ] < ∞. We apply the Cauchy-Schwarz inequality to X1X>0 and obtain
IE[X]2
P[X > 0] ≥ .
IE[X 2 ]
11. (Dvoretzky-Kiefer-Wolfowitz inequality). Let X1 , . . . , Xn be iid random variables (samples) with
cumulative distribution F . Let Fn be the associated empirical distribution
n
1X
Fn (x) = 1X ≤x ,
n i=1 i
Then,
2
P[sup (Fn (x) − F (x)) > ] ≤ e−2n ,
x∈IR
q
1
for every ≥ 2n ln 2.
12. (Chung-Erdős inequality). Let E1 , . . . , En ∈ F and P[Ei ] > 0 for some i. Then
Pn
( P[Ei ])2
P[E1 ∨ . . . ∨ En ] ≥ Pn Pi=1 n
i=1 j=1 P[Ei ∧ Ej ]
21
1 Probability Theory
2. (Hoeffding’s inequality for sums #2). Let X1 , X2 , . . . , Xn be independent random variables and
Xi ∈ [ai , bi ]. Let X̄ be as above and let ri = bi − ai . Then
2n2 t2
P[X̄ − IE[X̄] ≥ t] ≤ exp − Pn 2 ,
i=1 ri
and
2n2 t2
P[|X̄ − IE[X̄]| ≥ t] ≤ 2 exp − Pn 2 .
i=1 ri
Pn
4. (Gaussian tail inequality for averages). Let X1 , . . . , Xn ∼ N (0, 1) and let X̄n := n−1 i=1 Xi .
Then X̄n ∼ N (0, n−1 ) and
2
2e−n /2
P[|X̄n | > ] ≤ √ .
n
for all A ∈ G.
2. (Total variation convergence). The total variation distance between two measures µ and ν on a
measurable space (X , G) is defined as
A sequence of measures {µk }k∈IN converges in the total variation to a measure µ if dTV (µk (A) −
µ(A)) → 0 as k → ∞ for all A ∈ G.
3. (Weak convergence). The sequence of measures {µk }k∈IN is said to converge in the weak sense,
denoted by µk * µ, if any of the conditions of the Portmanteau Theorem hold; these are
22
1.4 Convergence of random processes
4. (Tightness). A sequence of measures (µn )n is called tight if for every > 0 there is a compact set
K so that µn (K) > 1 − for all n ∈ IN.
5. (Prokhorov’s Theorem). If (µn )n is tight, then every subsequence of it has a further subsequence
which is weakly convergent.
6. (Lévy-Prokhorov distance). Let (X, d) be a metric space and let BX be the Borel σ-algebra which
makes (X, BX ) a measurable space. Let P(X) be the space of all probability measures on (X, BX ).
For all A ⊆ X we define
[
A := {p ∈ X | ∃q ∈ A, d(p, q) < } = B (p),
p∈A
8. (Separability of (PX , π)). The space (PX , π) is separable if and only if (X, d) is separable.
3. (Characterization of a.s. convergence). The sequence (Xn )n converges a.s. to X if and only if for
every > 0
X
1(,∞) ◦ |Xn − X| < ∞.
n∈IN
4. (Characterization of a.s. convergence a là Borel-Cantelli #1). The sequence (Xn )n converges a.s.
to X if for every > 0
X
P[|Xn − X| > ] < ∞.
n∈IN
5 The support of a measure µ on (Ω, F, P) which is equipped with a topology τ is the set of ω ∈ Ω for which every open
neighbourhood Nω of ω has a positive measure: supp(µ) = {ω ∈ Ω : µ(Nx ) > 0, for all Nω ∈ τ, Nω 3 ω}.
23
1 Probability Theory
5. (Characterization of a.s. convergence a là Borel-Cantelli #2). The sequence (Xn )n converges a.s.
to X if there is a decreasing sequence (n )n converging to 0 so that
X
P[|Xn − X| > n ] < ∞.
n∈IN
6. (Cauchy criterion). The sequence (Xn )n is convergent almost surely if and only if limm,n→∞ |Xn −
Xm | → 0 almost surely.
7. (Kolmogorov’s three-series
P theorem). Let (Xn )n be a sequence of independent random variables.
The random series n Xn converges almost surely in IR if and only if the following conditions hold
for some > 0:
P
i. n P[|Xn | > ] converges
P
ii. Let Yn = Xn 1|Xn |≤ . Then, n IE[Yn ] converges
P
iii. n var Yn converges
24
1.4 Convergence of random processes
p
5. (Almost surely convergent subsequence). If Xn → X, then there exists a subsequence of (Xn )n ,
(Xkn )n which converges almost surely to X.
6. (Sum of independent variables). Let (Xn )n be a sequence of independent random variables and let
(Sn )n be a sequence defined as Sn = X1 + . . . + Xn . Then Sn converges almost surely if and only
if it converges in probability.
8. (Almost surely ⇒ in probability). If a sequence of random variables {Xk }k converges almost surely,
it converges in probability to the same limit.
9. (In probability 6⇒ almost surely). There are sequences which converge in probability but not almost
surely. Here is an example: Let (Xn )n be a sequence of independent random variables on Ω = IN
with Xn = 1 with probability 1/n and 0 with probability 1 − 1/n. P Then, for any > 0 it is
∞
P[|Xn | > ] = n1 → 0, but by the second Borel-Cantelli lemma since n=1 P[|Xn | > ] (and the
events {|Xn | > } are independent), we have P[lim supn {|Xn | > }] = 1.
1.4.4 Convergence in Lp
1. (Convergence in Lp (Ω, F, P)). We say that Xk converges to X in Lp if X, Xk ∈ Lp for all k ∈ IN
and kXk − Xkp → 0.
5. (Convergence in Lp for all p ∈ [1, ∞) but not in L∞ ). Let X be a random variable on Ω = IN which
−λ k
follows the Poisson distribution (P[X = k] = e k!λ , λ > 0). Define the sequence Xk = 1{X=k} .
Then kXk k∞ = 1.
8. (Almost surely 6⇒ in Lp ). On ([0, 1], B[0,1] , λ) take Xn = n1[0,1/n] . Then, for all p ∈ [1, ∞] we have
kXn kp = 1, but the sequence converges almost surely to 0.
9. (In Lp , p ∈ [1, 2) 6⇒ In Lp , for p ≥ 2). Let Ω = IN and Zkp be a sequence of random variables with
parameter p and
P[Zkp = n] = pn,
P[Zkp = 0] = 1 − pn.
25
1 Probability Theory
Figure 1.1: Illustration of the relationships among different modes of convergence of random variables. Convergence in L∞
implies convergence in Lp for all p ∈ [1, ∞) which in turn implies convergence in Lp0 for all 1 ≤ p0 ≤ p which
implies convergence in probability which implies convergence in distribution which implies convergence of the
characteristic functions (Lévy’s continuity theorem). Convergence in distribution implies almost convergence
d d
of a sequence of RVs {Yk }k which have the same distribution as {Xk }k (Yk ∼ Xk and Y ∼ X).
4. (Lévy’s continuity theorem)7 . Let {Xk }k be a sequence of random variables with characteristic
functions ϕk (t) and let X be a random variable with characteristic function ϕ(t). It Xk converges
to X in distribution then ϕk → ϕ point-wise. Conversely, if ϕk → ϕ and ϕ is continuous at 0, then
ϕ is the characteristic function of a random variable X and Xk → X in distribution. the
5. (Scheffé’s theorem for density functions).8 Let Pn and P have densities fn and f with respect to
a measure µ. If fn → f µ-a.s., then Pn → P in the total variation metric and, as a result, Pn → P
weakly.
6. (Continuous mapping theorem). For a (almost everywhere) continuous function g, if the sequence
{Xk }k converges in distribution to X, then {g(Xk )}k converges in distribution to g(X).
8. (In distribution 6⇒ in probability). There are sequences which converge in distribution, but not in
probability. For example: On the space ([0, 1], B[0,1] , λ), let X2n (ω) = ω and X2n−1 (ω) = 1 − ω.
Then all Xk have the same distribution, but the sequence does not converge in probability. As a
second example, the sequence Xn = X where X follows the Bernoulli distribution with parameter
1
2 , converges in distribution to 1 − X, but not in probability.
26
1.4 Convergence of random processes
10. (Delta method). Let X be a real-valued random variable and Xn be a sequence of real-valued
random variables with nc (Xn − θ) → X, in distribution for some c > 0. Let g : IR → IR be function
which is differentiable at θ. Then, nc (g(Xn ) − g(θ)) → g 0 (θ)X.10
4. (Events in the T
tail σ-algebra). For a process (En )n be a sequence of events. The associated tail
σ-algebra T is n σ({Ek }k≥n ). The event lim supn En is in T .
6. (Counterpart of the Borel-Cantelli lemma). Let {En }n∈IN be a nested increasing sequence of events
in (Ω, F, P), that is Ek ⊆ Ek+1 and let Ekc denote the complement of Ek . Infinitely many Ek occur
with probability 1 if and only if there is an increasing sequence tk ∈ IN such that
X
P[Atk+1 | Actk ] = ∞.
k
7. (Lévy’s zero-one law). Let F = {Fk }k∈IN be any filtration of F on (Ω, F, P) and X ∈ L1 (Ω, F, P).
Let F∞ be the minimum σ-algebra generated by F. Then
IE[X | Fk ] → IE[X | F∞ ],
2. (Strong law of large numbers). Let {Xk }k and X̄k be as above. Then X̄k → µ almost surely.
3. (Uniform law of large numbers). Let f (x, θ) be a function defined over θ ∈ Θ. For fixed θ and
a random process {Xk }k define Zkθ := f (Xk , θ). Let {Zkθ }k be a sequence of independent and
identically distributed random variables, such that the sample mean converges in probability to
IE[f (X, θ)]. Suppose that (i) Θ is compact, (ii) f is continuous in θ for almost all x and measurable
with respect to x for each θ, (iii) there is a function g such that IE[g(X)] < ∞ and kf (x, θ)k ≤ g(x)
for all θ ∈ Θ. Then, IE[f (X, θ)] is continuous in θ and
a.s.
sup Z̄kθ − IE[f (X, θ)] −→ 0
θ∈Θ
4. (Lindeberg-Lévy central limit theorem). Let {Xk }k be iid, finite mean and variance and X̄k as
above. Then
X̄k − µ d
√ −→ N (0, σ 2 ),
n
where N (0, σ 2 ) is the normal distribution is zero mean and variance σ 2 (See Section 1.5.2).
10 Seethe lecture notes at http://personal.psu.edu/drh20/asymp/fall2006/lectures/, Chap. 5. The proof makes use
of Taylor’s first-order expansion and Slutsky’s theorem.
27
1 Probability Theory
5. (Lyapunov central limit theorem). Let {Xk }k be a sequence of independent random variables with
Pk
IE[Xk ] = µk and finite variance σk2 . Define s2k = i=1 σi2 . If for some δ > 0, the following condition
holds (Lyapunov’s condition)11 :
k
1 X
IE |Xi − µi |2+δ = 0,
lim
k→∞ s2+δ
k i=1
then,
k
1 X d
(Xi − µi ) −→ N (0, 1).
sk i=1
6. (Law od iterated logarithm). Let (Xk )t∈IN be independent identically distributed random variables
and let Sk := X1 + . . . + Xk . Then,
Sk
lim sup √ = 1,
k 2k ln ln k
Then,
X
kFN − F k∞ := |Fn (x) − F (x)| → 0,
x∈IR
for > 0.
11 In practice it is usually easiest to check Lyapunov’s condition for δ = 1. If a sequence of random variables satisfies
Lyapunov’s condition, then it also satisfies Lindeberg’s condition. The converse implication, however, does not hold.
12 The almost sure pointwise convergence of F
N to F follows from the strong law of large numbers. This is, therefore, a
stronger result.
28
1.5 Standard Distributions
distributed
prandom variables, then X + Y is normally distributed and E[X + Y ] = µX + µY and
σX+Y = σX 2 + σ 2 + 2ρσ σ , where ρ is the correlation between X and Y . For any α ∈ IR,
Y X Y
αX ∼ N (αµX , α2 σX2
).
4. (Multivariate normal distribution). The multivariate variant of the normal distribution, denoted by
N (µ, Σ) with µ ∈ IRn and Σ ∈ IRn×n , symmetric positive semi-definite, is supported on µ + im(Σ)
and PDF
pX (x) = |2πΣ|− /2 exp − 12 (x − µ)> Σ(x − µ) .
1
5. (Isserlis’ theorem – high-order moments of multivariate normal). Let X = (X1 , . . . , X2n ) follow
the multivariate normal distribution with zero mean and covariances Σi,j = cov(Xi , Xj ). Then,
XY XY
IE[X1 X2 · · · X2n ] = IE[Xi Xj ] = cov(Xi , Xj ),
and
IE[X1 X2 · · · X2n−1 ] = 0.
29
1 Probability Theory
6. (Linear transformation). Let X ∼ N (µ, Σ), µ ∈ IRn , Σ ∈ IRn×n and Y = AX + b for constant A
and b. Then Y ∼ N (Aµ + b, AΣA0 ).
7. (Conditioning). Let X1 , X2 be two random variables with values in IRn1 and IRn2 respectively.
Suppose that
X1 µ1 Σ Σ12
∼N , 11 .
X2 µ2 Σ21 Σ22
Then, IE[X1 | X2 = x2 ] ∼ N (µ(x2 ), Σ) with
(X − µ)> Σ−1 (X − µ)
where nk = k!(n−k)!
n!
2. (Characteristics). If X ∼ B(n, p), then IE[X] = np, the median of X is either bnpc or dnpe,
its variance is Var[X] = np(1 − p) and moment generating function (MGF) of X is MX (z) =
(1 − p + pez )n . The cumulative probability function of X is given in terms of the regularized
incomplete beta function
P[X ≤ x] = I1−p (n − k, k + 1)
1 − IE[S]
Var[S] ≤ IE[S] .
n + n0
As an example, consider the experiment of tossing n coins a large number of times and observing
the number of heads each time. Then, as n grows large, the shape of the distribution of the number
of heads approaches that of the normal distribution.
λk e−λ
P[X = k] =
k!
We denote X ∼ Poisson(λ).
30
1.5 Standard Distributions
λk
n k
lim pn (1 − pn )n−k = e−λ .
n→∞ k k!
Binomial PMF Poisson PMF
3
6. (Large λ). For large values of λ, e.g., λ > 10 , the normal distribution N (λ, λ), is considered a
good approximation to Poisson(λ).
31
2 Multivariate distributions
2.1 Multivariate random variables
1. (Multivariate CDF). The cdf of a random variable X : (Ω, F, P) → IRd is the function
FX (x1 , . . . , xd ) = P[X1 ≤ x1 , . . . , Xd ≤ xd ]
\
= P {Xi ≤ xi }
i=1,...,d
2. (Multivariate CDF properties). The cdf FX of a random variable X : (Ω, F, P) → IRd has the
following properties
i. It is monotonically decreasing with respect to each variable
ii. It is right-continuous with respect to each variable
iii. 0 ≤ FX (x1 , . . . , xd ) ≤ 1 for all x1 , . . . , xd ∈ IR
iv. limx1 ,...,xd →∞ FX (x1 , . . . , xd ) = 1
v. limxi →−∞ FX (x1 , . . . , xd ) = 0 for all i ∈ IN[1,d]
2.2 Copulas
2.2.1 Sklar’s theorem
1. (Definition). Let X be an d-dimensional random variable, with X(ω) = (X1 (ω), X2 (ω), . . . , Xd (ω)
with continuous marginal CDFs FXi (x) = P[Xi ≤ x]. By the probability integral transform, the
random variable U = (U1 , . . . , Ud ) defined as
Ui = FXi (Xi ),
for i = 1, . . . , d, has uniformly distributed marginals, that is, Ui ∼ U (0, 1). The copula of X is the
joint cumulative distribution function of U , that is
C(u1 , . . . , ud ) = P[U1 ≤ u1 , . . . , Ud ≤ ud ]
−1 −1
= P[X1 ≤ FX1
(u1 ), . . . , Xd ≤ FX d
(ud )].
A d-dimensional copula is a function C : [0, 1]d → [0, 1] which is a joint cumulative distribution
function of a d-dimensional random variable on the [0, 1]d with uniform marginals.
2. (Sklar’s theorem). Every multivariate CDF, H(x1 , . . . , xd ) = P[X1 ≤ x1 , . . . , Xd ≤ xd ], can be
expressed in terms of its marginals, FXi (x) = P[Xi ≤ x], and a copula C : [0, 1]d → [0, 1], that is
33
2 Multivariate distributions
If the multivariate distribution has a PDF h, then there is a function c called the density copula
and
h(x1 , . . . , xd ) = c(FX1 (x1 ), . . . , FXd (xd ))f1 (x1 ) · . . . · fd (xd ).
Conversely, given a copula C : [0, 1]d → [0, 1] and marginal distributions FXi , there is a d-
dimensional CDF as described above.
3. (Characterization). A function C : [0, 1]d → [0, 1] is a copula if and only if it satisfies the following
properties
i. For every j ∈ IN[1,d] , C(1, . . . , 1, t, 1, . . . , 1) = t
ii. C is isotonic (order preserving), that is, C(u) ≤ C(u0 ) whenever u ≤ u0 in the sense ui ≤ u0i
for all i ∈ IN[0,d]
iii. C is d-nondecreasing, that is, for every hyperrectange B, the dC-volume B is nonnegative,
that is Z
dC ≥ 0,
B
5. (Properties of copulas). A copula C : [0, 1]d → [0, 1] possesses the following properties
i. C(u1 , . . . , ud ) = 0 if there is an i0 ∈ IN[1,d] so that ui0 = 0
ii. C is nonexpansive in the following sense
d
X
|C(u) − C(v)| ≤ |ui − vi |
i=1
6. (Fréchet-Hoeffding copula bounds). For any copula C : [0, 1]d → [0, 1],
where Pd
W (u1 , . . . , ud ) := max{0, 1 − d + i=1 ui },
and
M (u1 , . . . , ud ) = min{u1 , . . . , ud }.
The upper bound is sharp, M is always a copula and equality is attained for comonotone random
variables.
34
2.2 Copulas
4. (Fréchet-Hoeffding bounds). For every d-dimensional copula Cd and u ∈ [0, 1]d , we have
where Wd is the d-dimensional variant of the counter-monotonicity copula, W2 shown above. This
is defined as ( d )
X
Wd (u) = max ui − d + 1, 0 .
i=1
For d > 2, Wd is not a copula. The above bounds are tight, that is,
C̄ = αC + (1 − α)C 0 .
35
3 Stochastic Processes
3.1 General
1. (Stochastic process). Let T ⊆ IR (e.g., T = IN or T = IR). A random process is a sequence/net
(Xn )n∈T of (real-valued) random variables on a probability space (Ω, F, P).
2. (Version). Let T = [0, ∞) be a time index set and (Xt )t , (Yt )t be two stochastic processes on
(Ω, F, P). We say that (Xt )t is a version of (Yt )t if
3. (Centered). Let (Xt )t be a real-valued stochastic process with t ∈ [a, b]. We say that (Xt )t is
centered if IE[Xt ] = 0 for all t ∈ [a, b].
4. (Mean-square continuous). Let (Xt )t be a real-valued stochastic process with t ∈ [a, b]. We say
that (Xt )t is mean-square continuous if
lim IE[(Xt+ − Xt )2 ] = 0,
→0
5. (Auto-correlation function). Let (Xt )t∈T be a stochastic process. Define the function RX : T ×T →
IR as
RX (s, t) = IE[Xs Xt ].
This function is called the auto-correlation function of (Xt )t .
7. (Filtrations). A filtration is an increasing sequence of sub-σ-algebras of F. The space (Ω, F, (Ft )t∈T , P)
is called a filtered probability space. The filtration Ft = σ({Xs ; s ∈ T, s ≤ t}) is called the filtration
generated by (Xn )n∈T . We say that (Xn )n is adapted to a filtration (Fn )n if for all n ∈ T, Xn is
Fn -measurable.
8. (Stopping times). Let (Fn )n be a filtration on (Ω, F, P) and define T := T ∪ {+∞}. A random
variable T : Ω → T is called a stopping time if
{ω | T (ω) ≤ t} ∈ Ft ,
for all t ∈ T. This is equivalent to requiring that the process Zt = 1T ≤t is adapted to (Ft )t∈T .
9. (Wald’s first identity)1 . Let (Xk )k∈IN be a sequence of iid random variables with common finite
mean, IE[|Xi |] < ∞. Let τ be a stopping time with IE[τ ] < ∞. Then,
10. (Wald’s second identity). Let (Xk )k∈IN be a sequence of iid random variables with zero mean and
common finite variance σ 2 = IE[Xi2 ] < ∞. Let τ be a stopping time with IE[τ ] < ∞. Then,
IE[(X1 + . . . + Xτ )2 ] = σ 2 IE[τ ].
1 Details and proofs for the three identities of Wald can be found in the lecture notes of S. Lalley (Statistics 381).
37
3 Stochastic Processes
11. (Wald’s third identity). Let (Xk )k∈IN be a sequence of nonnegative iid random variables with mean
IE[Xk ] = 1. Let τ be a bounded stopping time with IE[τ ] < ∞. Then,
T
Y
IE Xi = 1.
i=1
12. (A useful property). For any stochastic process (Xn )n∈IN , we have
k
!
X
2 2
P max |Xi | > = P Xi · 1{|Xi |>} > .
i≤k
i=0
13. (Kolmogorov’s continuity theorem). Let (Xt )t be an IRn -valued stochastic process on (Ω, F, P).
Suppose that (Xt )t is such that for all t > 0 there are positive constants α, β, L such that
3.2 Martingales
1. (Martingale — discrete time). A random process (Xn )n is called a martingale if IE[|Xn |] < ∞ and
IE[Xn+1 | X1 , . . . , Xn ] = Xn .
2. (Martingale — continuous time). A random process (Xt )t≥0 on a filtered probability space (Ω, F, (Ft )t≥0 , P)
is called a martingale if (i) it is adapted to (Ft )t≥0 , (ii) for every t ≥ 0, IE[|Xt |] < ∞, (iii) for all
s, t ≥ 0 with s < t and all F ∈ Fs , IE[1F (Yt − Ys )] = 0, or, equivalently, Ys = IE[Yt | Fs ].
3. (Martingale examples). The following are common examples of martingales:
Pn
a) Let (Xn )n be a sequence of iid random variables with mean IE[Xn ] = µ. Then Yn = i=1 (Xi −
µ) is a martingale.
b) Let (Xn )n be a sequence of iid random variables with mean IE[Xn ] = 1 and finite variance.
Define a sequence of random variables with Y0 = 0 and Yn = X0 X1 · . . . · Xn . Then, by the
Cauchy-Schwartz inequality, Yn is a martingale.
Qn
c) If (Xn )n is a sequence of iid random variables with mean 1, then Yn = i=1 Xi is a martingale.
d) If (Xn )n is a sequence
Pn of random variables with finite expectation and IE[Xn | X1 , . . . , Xn−1 ] =
0, then Yn = i=0 Xi is a martingale.
e) (The classical martingale). The fortune of a gambler is a martingale in a fair game.
4. (Sub- and super-martingales). A random process (Xn )n is called a super-martingale if IE[|Xn |] < ∞
and IE[Xn+1 | X1 , . . . , Xn ] ≤ Xn . Likewise, it is a sub-martingale if IE[|Xn |] < ∞ and IE[Xn+1 |
X1 , . . . , Xn ] ≥ Xn .
5. (Stopping time). Let {Zk }k be a random process and T a stopping time. Define Xk (ω) = Zk∧T (ω) ,
that is (
Zk (ω), if k ≤ T (ω)
Xk (ω) =
ZT (ω) (ω), otherwise
If Z is a (sub-) martingale, then X is a (sub-) martingale too.
6. (Stopped martingales are martingales). Let (Xn )n be a martingale. Let τ be a stopping time.
Then X̃n = Xn∧τ is a martingale.
38
3.2 Martingales
7. (Doob’s optional stopping theorem). Let (Xn )n be a super-martingale and T be a stopping time.
Then XT is integrable and IE[XT ] ≤ IE[X0 ] in each of the following cases
i. T is bounded
ii. X is bounded and T is almost surely finite
iii. E[T ] < ∞ and (Xn )n has (surely) bounded differences, i.e., there is an M > 0 such that
|Xn (ω) − Xn−1 (ω)| ≤ M,
for all n ∈ IN and ω ∈ Ω
iv. Xn ≥ 0 for all n and T is almost surely finite
8. (Optional stopping theorem, version 2). Let (Xt )t be a martingale on (Ω, F, P) subject to a filtration
F = (Ft )t and let τ be a stopping time. Assume that one of the following holds
i. τ is almost surely bounded, that is, there is a τ̄ ≥ 0, so that τ (ω) ≤ τ̄ for P-almost all ω 2
ii. IE[τ ] is finite and IE[|Xk − Xk | | Fk ] is almost surely bounded, uniformly in k,
iii. |Xmin(t,τ ) | is almost surely bounded,
Then Xτ is almost surely a well-defined random variable and
IE[Xτ ] = IE[X0 ].
If X is assumed to be a super-martingale, then
IE[Xτ ] ≤ IE[X0 ].
If X is assumed to be a sub-martingale, then
IE[Xτ ] ≥ IE[X0 ].
9. (Optional stopping theorem, more general version). Let (Xt )t be a martingale on (Ω, F, P) subject
to a filtration F = (Ft )t and let τ be a stopping time. Suppose that X is uniformly integrable
(then, it has a well-defined limit, X∞ so we may define X̄τ = Xτ 1τ <∞ + X∞ 1τ =∞ ). Let τ 0 ≤ τ be
two stopping times. Then,
IE[Xτ | Fτ 0 ] = Xτ 0 .
10. (Almost sure martingale convergence). Let (Xn )n be a martingale which is uniformly bounded
in L1 , i.e., supn IE[|Xn |] < ∞. Then, there is a X ∈ L1 (F∞ ), so that Xn → X a.s., where
F∞ = σ(Fn , n ≥ 0).
11. (Kolmogorov’s sub-martingale inequality). Let {Xk }k be a nonnegative sub-martingale. Then, for
n ∈ IN>0 and α > 0,
IE[Xn ]
P max Xk ≥ α ≤ .
k=1,...,n α
i. (Corollary 1). Let {Xk }k be a nonnegative martingale. Then P[supk≥1 Xk ≤ α] ≤ IE[X1 ]/α
for α > 0.
ii. (Corollary 2). Let {Xk }k be a martingale with IE[Xk2 ] < ∞ for all k ∈ IN>0 . Then,
P[maxk=1,...,n |Xk | ≥ α] ≤ IE[Xn2 ]/α for all n ∈ IN≥2 and α > 0.
iii. (Corollary 3). Let {Xk }k be a nonnegative super-martingale. Then, for n ∈ IN>0 and α > 0,
P[∪k≥n {Zk ≥ α}] ≤ IE[Zn ]/α.
12. (Azuma-Hoeffding inequality for martingales with bounded differences). Let (Xi )i be a martingale
or a super-martingale and |Xk − Xk−1 | < ck almost surely. Then for all N ∈ IN and t ∈ IR,
!
t2
P[XN − X0 ≥ t] ≤ exp − PN
2 i=1 c2i
If (Xi )i is a sub-martingale,
!
t2
P[XN − X0 ≤ −t] ≤ exp − PN
2 i=1 c2i
2 This
is a strong condition which is often not satisfied in practice. However, for fixed N ∈ IN, τ ∧ N is a stopping time.
We often apply the optional stopping theorem for the bounded stopping time τ ∧ N and take N → ∞.
39
3 Stochastic Processes
20
15
10
Xt ( )
5
-5
0 20 40 60 80 100
time instant k
Figure 3.1: Random walk: two different paths, (Xt (ω1 ))t and (Xt (ω2 ))t .
13. (Martingale inequalities). Let (Xt )t≥0 be a càdlàg martingale and t > 0. Define Xt∗ = sups≤t |Xs |.
Then, for every t > 0
kXt k1
i. for α > 0, P[Xt∗ ≥ α] ≤ α
p
ii. for p > 1, kXt∗ kp ≤ p−1 kXt kp
14. (Nonnegative submartingale inequalities). Let (Xt )t≥0 be a nonnegative càdlàg submartingale and
t > 0. Define Xt∗ = sups≤t |Xs |. Then, for every t > 0
kXt k1
i. for α > 0, P[Xt∗ ≥ α] ≤ α
p
ii. for p > 1, kXt∗ kp ≤ p−1 kXt kp
for all m ∈ Xt This is the Binomial distribution on Xt with parameter p (See definition in Section
1.5.3).
4. (Maximum of random walk). Let Xt be a simple symmetric random walk (with p = 0.5 and define
Mt = maxt0 ≤t Xt0 . Then, M0 = 0, the support of Mt is {0, 1, . . . , t} and
t
P[Mt = m] = P[Xt = m] + P[Xt = m + 1] = 2−t
b t+m+1
2 c
40
3.4 Brownian motion
5. (Infinite often visits). Almost surely, the one-dimensional simple random walk visits every integer
n ∈ IN infinitely often.
6. (As a Markov chain). The one-dimensional random walk can be seen as a Markov chain with states
in Z and P[Xk+1 = i + i | Xk = i] = p and P[Xk+1 = i − i | Xk = i] = 1 − p.
7. (Probability to reach upper bound before lower bound). Let (Xn )n be a simple random walk
starting at x ∈ Z, that is, X0 = x. Let a < x < b for some a, b ∈ Z. Let τa = inf{n ∈ Z | Xn = a}
and τb = inf{n ∈ Z | Xn = b}. Then
x−a
P[τa < τb ] = .
b−a
8. (Average time to exit interval). Let (Xk )k be a random walk with X0 = x. Suppose x ∈ [a, b] with
a, b ∈ Z. Define the stopping time τ = inf{n ∈ IN | Xn ∈ {a, b}} (therefore, Xτ ∈ {a, b}). Define
the stochastic process
Yn = n + (Xn − a)(b − Xn ).
Let Fn be the sigma algebra generated by (Xk )nk=0 . Then,
a) τ is an almost surely bounded stopping time
b) (Yn )n is an (Fn )n -martingale
c) Y0 = (x − a)(b − x) and Yτ = τ
d) From the (general) optional stopping theorem, IE[τ ] = IE[Yτ ] = IE[Y0 ] = (x − a)(b − x)
9. (Gaussian random walk). Take a sequence of independent random variables Zt with Zt ∼ N (µ, σ 2 ).
The random process (Xt )t with X0 = 0 and Xt = Z1 + . . . + Zt is called Gaussian random walk.
It is Xt ∼ N (tµ, tσ 2 ).
41
3 Stochastic Processes
4. (Existence of continuous version). The Brownian motion satisfies Kolmogorov’s continuity theorem
with α = 4, β = 1 and L = n(n + 2). In particular, IE[kXτ − Xτ 0 k4 ] = n(n + 2)|τ 0 − τ |2 .
5. (Zero crossing). Properties:
i. Define the set of zero crossing times, Z0 = {t ≥ 0 | Bt = 0}. With probability 1, the Lebesgue
measure of Z0 is zero,
ii. Almost surely, Z0 is a closed set and has no isolated points,
iii.
iv. The Brownian motion crosses the time axis infinitely often in every time interval (0, t) for
t > 0.
6. (Distribution of maximum). Let Xt be a Brownian motion and Mt = maxs≤t Xs . Then, for all
t > 0 and a > 0, √
P[Mt ≥ a] = 2P[Xt ≥ a] = 2(1 − Φ(a/ t)).
7. (Attainment of maximum 1). Almost surely, the set of times where Bt attains a local maximum is
dense in [0, +∞)
8. (Attainment of maximum 2). On any interval, Bt almost surely does not attain the same maximum.
9. (Strict maximum). Almost surely, every local maximum of a Brownian motion is a strict local
maximum.
10. (Countability of the set maximum times). Almost surely, the set of times when Bt attains a local
maximum is countable.
11. (Maxima are distinct). The local maxima of Bt are almost surely distinct
12. (Nowhere differentiable). For every ω ∈ Ω, t 7→ Xt (ω) is nowhere differentiable.
13. (Orthogonal transformation). Let Bt be an n-dimensional Brownian motion starting at 0 and U
be an orthogonal matrix, U U > = I. Them, B̃t = U Bt is a Brownian motion.
14. (Brownian scaling). Let Bt be an n-dimensional Brownian motion starting at 0 and c > 0. Then,
B̂t = 1/cBc2 t is a Brownian motion.
15. (Time inversion). Let Bt be an n-dimensional Brownian motion starting at 0 and (B̆t )t is a process
with B̆0 = 0 and B̆t = tB1/t . Then B̆ is a Brownian motion.
16. (Integrated Brownian motion). The integral of the one-dimensional Brownian motion starting at
Rt
0, ibm(t, ω) := 0 Bs (ω)ds, is a random variable which follows the normal distribution N (0, t3 /3).
17. (Exit time). Let (Bt )t be a one-dimensional Brownian motion on (Ω, F, P) started at 0 and define
τ (ω) = inf{t ∈ T | Bt ∈ / [−a, b]}, where a, b > 0. This means that τ is the first time when the
process leaves the interval [−a, b]. Then,
i. τ is an integrable random variable
ii. IE[τ ] = ab
iii. IE[Wτ ] = 0 and IE[Wτ2 ] = IE[τ ] = ab
b a
iv. P[Wτ = −a] = a+b and P[Wτ = b] = a+b
Note. We often need to evaluate the expectation of a transformation of the Brownian motion, Yt = f (Bt ).
Using the fact that Bt is normally distributed at every t and the law of the unconscious statistician,
Z ∞
1 x2
IE[f (Bt )] = √ f (x)e− 2t dx,
2πt −∞
Rt Rt
provided f (Bt ) is integragle. Similarly, we may need to evaluate IE[ 0 f (Bs )ds] = 0 IE[f (Bs )]ds =
Rt 1 R∞ x2
0
√
2πs −∞
f (x)e− 2s dx ds (using Fubini’s Theorem).
42
3.5 Markov processes
43
3 Stochastic Processes
3. (Equivalent representation of Markov control models). For every MCM, there is a Borel space S,
a function F : X × U × S and an S-valued iid process {ξt }t so that
xt+1 = F (xt , ut , ξt ).
(The canonical probability space (Ω, F, P)). Given an MCM (X , U, U, Q, c), let Ω = H∞ =
6. Q
∞
t=1 X × U. Ω contains sequences ω = (x0 , u0 , x1 , u1 , . . .). Let F be the corresponding prod-
uct σ-algebra. Given a probability measure ν on (X , B(X )) (called the initial distribution) and a
policy π, according to the Ionescu-Tulcea Theorem, there is a unique probability measure Pπν so
that for B ∈ B(X ), C ∈ B(U), ht ∈ Ht and t ∈ IN:
i. Pπν [x0 ∈ B] = ν(B)
ii. Pπν [ut ∈ C | ht ] = πt (C, ht )
iii. Pπν [xt+1 ∈ B | ht , ut ] = Q(B, xt , ut )
Note that the last condition is a Markov-like property, but it does not imply that xt is a Markov
process.
7. (Markov decision process). A (discrete-time) Markov decision process (MDP) is a tuple (Ω, F, Pπν , {xt }t ).
In other words, for a given policy π and a given initial distribution ν, an MDP is a stochastic process
{xt (ω)}t∈IN over the canonical probability space (Ω, F, Pπν ).
8. (Space Φ). We define the space Φ of all transition kernels φ : B(U)×X → [0, 1] with φ(U(x), x) = 1.
10. (Markovianity of {xt }t ). Let ν be an initial distribution. Let π = {φt } be a randomized Markov pol-
icy (see 9-i). Then, {xt }t is a non-homogeneous Markov process with transition kernels {Q(·, ·, φt )}t ,
that is, for B ∈ B(X )
44
3.6 Markov decision processes
where the minimization is over control functions u : X → U. Suppose that Suppose all Jt are
measurable and for all t ∈ IN[0,N −1] there is a selection u?t (x) ∈ U (x), u?t : X → U which attains
the minimum, that is
Z
Jt (x) = c(x, u?t (x)) + Jt+1 (χ)Q(dχ, x, u?t ( · )).
X
Then, the deterministic Markov policy π = {u?0 , u?1 , . . . , u?N −1 } is optimal and the value function
?
J ? is equal to J0 , that is
J ? (x) = J0 (x) = J(π ? , x)
3. (Measurable selection theorem 1). There exists a measurable selection u?t in the above DP theorem,
if
i. (Control constraints). U is compact-valued (i.e., for every x, U (x) is compact)
ii. (Cost function). c(x, · ) is lower semicontinuous on U (x) for every x ∈ X
R
iii. (Integral). the function ξ(x, u) = X u(χ)Q(dχ, x, u) on K satisfies one of the following condi-
tions:
i. ξ(x, · ) is lower semi-continuous on U (x) for every x ∈ X and every continuous bounded
function u on X
ii. ξ(x, · ) is lower semi-continuous on U (x) for every x ∈ X and every measurable bounded
function u on X .
4. (Measurable selection theorem 2). There exists a measurable selection u?t in the above DP theorem,
if
i. (Control constraints). U is compact-valued (i.e., for every x, U (x) is compact) and the multi-
valued function x 7→ U (x) is upper semi-continuous
ii. (Cost function). Function c is lower semicontinuous and bounded below
iii. (Transition kernel). the transition kernel Q satisfies of the following conditions
R
i. it is weakly continuous, that is, ξ(x, u) = X u(χ)Q(dχ, x, u) is continuous and bounded
on K for every continuous bounded function u on X
ii. it is strongly continuous, that is, ξ is continuous and bounded on K for every measurable
bounded function u on X
5. (Measurable selection theorem 3). There exists a measurable selection u?t in the above DP theorem,
if
i. (Cost function). The stage cost c is lower semi-continuous, bouned below and inf-compact on
K, that is, for every x ∈ X and r ≥ 0, the set {u ∈ U (x) | c(x, u) ≤ r} is compact (in other
words, c has compact level sets)
ii. (Transition kernel). Condition 4iii in Measurable Selection Theorem 2 holds.
45
4 Stochastic Differential Equations
4.1 Itô Integral
1. (Class V). Let (Ω, F, F, P) be a filtered probability space where F = (Ft )t∈T is a filtration, T =
[0, +∞) and t, t0 ∈ T with t < t0 . We define the class V = V(t, t0 ) to be a class of functions
f (t, ω) : T × Ω → IR with
i. f is B × F-measurable where B is the Borel σ-algebra on T
ii. f (t, ω) is F-adapted
R t0
iii. IE t f (t, ω)2 dt < ∞
2. (Itô integral for elementary functions). Let (Bt )t≥0 be the standard Brownian motion on the
filtered probability space (Ω, F, {Ft }t≥0 , P) and φ be an elementary function of class V(t, t0 ), that
is, X
φ(t, ω) = ei (ω)1[ti ,ti+1 ) (t),
i
where ei is Fti -measurable. We define the Itô integral of φ from t to t0 to be the random variable
Z t0 X
φ(t, ω)dBt = ei (ω)(Bti+1 (ω) − Bti (ω))
t i
3. (Itô integral on V(t, t0 )). Let (Bt )t≥0 be the standard Brownian motion on (Ω, F, {Ft }t≥0 , P) and
f ∈ V(t, t0 ). Let φn be a sequence of elementary functions which converges to f in the following
sense: "Z 0 #
t
IE (φn (s, ω) − f (s, ω))2 ds → 0.
t
Then,
Z t0 Z t0
f (s, ω)dBs = lim φn (s, ω)dBs ,
t n→∞ t
47
4 Stochastic Differential Equations
Rt
where v is an Itô-integrable function with P[ 0 v 2 ds < ∞, ∀t ≥ 0] = 1, is called an Itô process.
Such a process is also written in the following shorter differential form
where u(t, ω) = [u1 (t, ω) · · · un (t, ω)]> , V = (Vi,j (t, ω))i,j is an n-by-m matrix where Vi,j are
Itô-integrable functions, dBt = [dB1,t · · · dBm,t ]> and Xt = [X1,t · · · Xn,t ]> . In other words,
dX1,t u1 dB1,t
dX2,t u2 V1,1 · · · V1,m
.. dB2,t .
. ..
.. = .. dt + ..
. . .
..
. .
Vn,1 · · · Vn,m
dXn,t un dBm,t
with the calculus rules dt · dt = dt · dBt = dBt · dt = 0 and dBt · dBt = dt.
8. (Itô formula — multi-dimensional). Let Bt be an m-dimensional Brownian motion and
be an n-by-m-dimensional Itô process. Let g be a C 2 ([0, ∞) × IRn ; IRp ) map and Yt = g(t, Xt ).
Then,
∂gk X ∂gk X ∂ 2 gk
dYk,t = (t, Xt )dt + (t, Xt )dXi + 12 (t, Xt )dXi dXj ,
∂t i
∂xi i,j
∂xi ∂xj
in other words, Z t Z t Z t
Xs dYs = Xt Yt − X0 Y0 − Ys dXs − dXs dYs
0 0 0
48
4.2 Stochastic differential equations
v. (Exponential). Let θ(t, ω) be an n-dimensional random process with θi (t, ω) ∈ V([0, T ]) for
i = 1, . . . , n with T ≤ ∞. Define
Z t Z t
Zt = exp θ(s, ω)dBs − 21 θ2 (s, ω)ds ,
0 0
Bt dBt
1 Multiply both sides by e−µt and apply It’s formula on d(e−µt Xt ). To find the variance, use the Itô isometry.
49
4 Stochastic Differential Equations
Rt
Table 4.2: Stochastic integrals — we denote by Bt the standard Brownian motion with B0 = 0 and ibm(t, ω) := 0 Bs ds
is the integrated Brownian motion.
50
5 Information Theory
5.1 Entropy and Conditional Entropy
1. (Self-Information, construction). Let (Ω, F, P) be a discrete probability space. A self-information
function I must satisfy the following desiderata: (i) if ωi is sure (P[ωi ] = 1), then this offers no
information, that is I(ωi ) = 0, (ii) if ωi is not sure, that is P[ωi ] < 1, then I(ωi ) > 0, (iii) I(ω)
depends on the probability P[ω], that is, there is a function f so that I(ω) = f (P[ω]) (iv) for two
independent events A and B, I(A ∩ B) = I(A) + I(B).
2. (Self-information, definition). A definition which satisfied the above desiderata is I(ω) = − log(P[ω]).
3. (Self-information, units). When log2 is used in the definition, the units of measurement of self-
information are the bits. If ln ≡ loge is used, the self-information is measures in nats. For the
decimal logarithm, I is measured in hartley.
4. (Entropy, definition). The entropy (or Shannon entropy) of a random variable is the expectation
of its self-information denoted as H(X) = IE[I(X)], where I(X) is to be interpreted as follows:
Let (Ω, F, P) be a probability space and X : (Ω, F, P) → {xi }ni=1 a finite-valued random variable.
Consider the events Ei = {ω ∈ Ω | X(ω) = xi } with self-information I(Ei ). Then, I(X) is the
random variable I(X)(ω) = I(Eι(ω) ), where ι(ω) is such that X(ω) = xι(ω) .
The entropy of X is given by
n
X
H(X) = − pi log(pi ),
i=1
where pi = P[X = xi ].
5. (Joint entropy). The joint entropy of two random variables X and Y (with values {xi }i and {yj }j
respectively) is the entropy of the random variable (X, Y ) in the product space, that is
X
H(X, Y ) = − pij log pij ,
i,j
6. (Conditional Entropy).
7. (Mutual information).
5.2 KL divergence
1. (Definition/Discrete spaces). Let (Ω, F) be a discrete measurable space and P and P0 two proba-
bility measures on it. The Kullback-Leibler (KL) divergence from P0 and P is defined as1
X X
DKL (PkP0 ) = −
0
Pi log (Pi/Pi ) = Pi log (Pi/P0i )
i i
2. (Definition/Continuous spaces with PDFs). The KL divergence over a continuous probability space
and for two probability measures P and P0 with PDFs p and p0 respectively is
Z ∞
DKL (PkP0 ) = p(x) log (p(x)/p0 (x)) dx
−∞
51
5 Information Theory
52
6 Risk
6.1 Risk measures
1. (Risk measures and coherency). Let (Ω, F, P) be a probability space and Z = Lp (Ω, F, P) for
p ∈ [1, ∞]. A risk measure ρ : Z → IR is called coherent if
i. (Convexity). For Z, Z 0 ∈ Z and λ ∈ [0, 1], ρ(λZ + (1 − λ)Z 0 ) ≤ λρ(Z) + (1 + λ)ρ(Z 0 )
ii. (Monotonicity). For Z, Z 0 ∈ Z, ρ(Z) ≤ ρ(Z 0 ) whenever Z ≤ Z 0 a.s.,
iii. (Translation equi-variance). For Z ∈ Z and C ∈ Z with C(ω) = c for almost all ω (almost
surely constant), it is ρ(C + Z) = c + ρ(Z),
iv. (Positive homogeneity). For Z ∈ Z and α ≥ 0, ρ(αZ) = αρ(Z).
2. (Conjugate risk measure). With every convex risk measure, we associate the conjugate risk measure
ρ∗ : Z ∗ → IR defined as
ρ∗ (Y ) = sup {hZ, Y i − ρ(Z)} .
Z∈Z
3. (Biconjugate risk measure). With every convex risk measure, we associate the biconjugate risk
measure ρ∗∗ : Z ∗∗ → IR
ρ∗∗ (Z) = sup {hZ, Y i − ρ∗ (Y )} .
Y ∈Z ∗
4. (Dual representation). Let Z = Lp (Ω, F, P) with p ∈ [1, ∞). If ρ is lower semi-continuous, then
ρ = ρ∗∗ . In particular,
where A = dom ρ∗ .
5. (Acceptance set). The set Aρ = {X ∈ Z : ρ(X) ≤ 0} is called the acceptance set of ρ. Several
properties of ρ can be tested using its acceptance set.
6. (Monotonicity condition). If Y ≥ 0 (almost surely) for every Y ∈ A, then and only then ρ is
monotone.
7. (Translation equi-variance condition). If for every Y ∈ A it is IE[Y ] = 1, then and only then, ρ is
translation equi-variant.
8. (Positive homogeneity condition). If ρ is the support function of A, that is, ρ(Z) = supY ∈A hY, Zi,
then and only then it is positively homogeneous. A is called the admissibility set of ρ.
9. (Coherency-preserving operations). Let ρ1 , ρ2 be two risk measures on Z. Then, the following risk
measures are coherent
i. ρ(X) := λ1 ρ1 (X) + λ2 ρ2 (X), λ1 , λ2 ∈ IR not both equal to 0
ii. ρ(X) = max{ρ1 (X), ρ2 (X)}
11. (Sub-differentials of risk measures). Let ρ : Lp (Ω, F, P) → IR, p ∈ [1, ∞), be convex and lower
semi-continuous. Then ∂ρ(Z) = arg maxY ∈A {hY, Zi − ρ∗ (Z)}. If, additionally, ρ is positively
homogeneous, then ∂ρ(Z) = arg maxY ∈A hY, Zi.
53
6 Risk
12. (Convexity of ρ ◦ F ). Let F : IRn → Z be a convex mapping1 and ρ be a convex monotone risk
measure. Then ρ ◦ F is convex.
13. (Directional differentiability). Let Z = Lp (Ω, F, P) with p ∈ [1, ∞), F : IRn → Z be a convex
mapping and ρ : Z → IR be a convex monotone risk measure which is finite-valued and continuous
at Z̄ = F (x̄). Then, φ := ρ ◦ F is directionally differentiable at x̄, φ0 (x̄; h) is finite-valued for all
h ∈ IRn and2
φ0 (x̄; h) = sup hY, f 0 (x̄; h)i
Y ∈∂ρ(Z̄)
17. (Law invariance). A risk measure ρ is called law invariant if ρ(Z) = ρ(Z 0 ) whenever Z and Z 0 have
the same distribution.
18. (Fatou property #1). Let ρ : L∞ → IR be a proper convex risk measure. The following are
equivalent:
i. ρ is σ(L∞ , L1 )-lower semi-continuous
ii. ρ has the Fatou property, i.e., ρ(X) ≤ lim inf k ρ(Xk ) whenever {Xk } is essentially uniformly
p
bounded (there is Z ∈ L∞ so that Xk ≤ Z for all k ∈ IN) and Xk −→ X.
19. (Law-invariant risk measures have the Fatou property)4 . Let LΦ denote an Orlicz space5 . Any
proper, (quasi)convex, law-invariant risk measure ρ : LΦ → IR that is norm-lower semi-continuous
has the Fatou property if and only if Φ is ∆2 .
20. (Kusuoka representations). Let (Ω, F, P) be a non-atomic space and let ρ : Lp (Ω, F, P) → IR be a
proper lower semi-continuous law-invariant coherent risk measure. Then, there exists a set M of
probability measures on [0, 1) so that
Z 1
ρ(Z) = sup AV@R1−α (Z)dµ(α),
µ∈M 0
where AV@R1−α is the average value-at-risk operator at level 1 − α (defined in the next section).
1 The mapping F : IRn → Z if for every λ ∈ [0, 1] and x, y ∈ IRn it is F (λx + (1 − λ)y)(ω) ≤ λF (x)(ω) + (1 − λ)F (y)(ω)
for P-almost every ω.
2 F maps a vector x to random variables, so it is F (x)(ω) = f (x, ω). The directional derivative of f with respect to
0 0
R along a0 direction h is f (x̄; h) and it is a random variable. The scalar product here is defined as hY, f (x̄; h)i =
x
Ω Y (ω)f (x̄; h)(ω)dP(ω).
3 For a detailed discussion on continuity properties of risk measures, see D. Filipović and G. Svindland, “Convex risk
“Law invariant risk measures have the Fatou property,” (Chapter) Advances in Mathematical Economics, 2006, Springer
Japan.
5 An Orlicz space is a function space which generalizes the Lp spaces. A Young function Φ : [0, ∞) → [0, ∞) is a
convex function with limx→∞ Φ(x) → ∞ and Φ(0) = 0. Given a Young function Φ and a probability space (Ω, F, P),
define LΦ (Ω, F, P) = {X : Ω → IR, measurable, IE[Φ(|X|)] < ∞} This set is not necessarily a vector space. The
vector space spanned by LΦ is the Orlicz space LΦ (Ω, F, P). This space is equipped with the Luxembourg norm
kXkΦ = inf{λ > 0 : IE[Φ(X/λ)] ≤ 1}. We say that Φ has the ∆2 condition if Φ(2t) ≤ KΦ(t) for some K > 0.
54
6.2 Popular risk measures
21. (Regularity in spaces with atoms). Let (Ω, F, P) be a space with atoms and (Ω, H, P) be a uniform
probability space so that (Ω, F, P) is isomorphic to it. Let Z := Lp (Ω, F, P) and Ẑ := Lp (Ω, H, P),
p ∈ [1, ∞). Let ρ̂ : Ẑ → IR be a proper, lower semi-continuous, law invariant, coherent risk measure.
We say that ρ̂ is regular if there is a proper, lower semi-continuous, law invariant, coherent risk
measure ρ : Z → IR so that ρ|Ẑ = ρ̂.
22. (Zero risk). Let (Ω, F, P) be a non-atomic probability space. Let ρ be a proper, lower semi-
continuous, coherent, law invariant risk measure. If Z ∈ Z, Z ≥ 0 a.s. then ρ(Z) = 0 if and only
if Z = 0 a.s.
23. (Risk under conditioning). Let (Ω, F, P) be a non-atomic space and ρ : Z → IR be a proper
convex lower semi-continuous law-invariant risk measure. Let H be a sub-σ-algebra of F. Then,
ρ(IE [X | H]) ≤ ρ(X), for all X ∈ Z and IE[X] ≤ ρ(X).
24. (Interchangeability principle for risk measures). Let Z := Lp (Ω, F, P) and Z 0 := Lp0 (Ω, F, P) with
p, p0 ∈ [1, ∞]. Let F : IRn → Z, that is, for x ∈ IRn , F (x) is a random variable; let (F (x))(ω) =
f (x, ω). For a set X ⊆ IRn define MX := {χ ∈ Z 0 : χ ∈ X, P-a.s.}. Let ρ : Z → IR be a proper
monotone risk measure. For χ ∈ Z 0 define Fχ (ω) = f (χ(ω), ω) Suppose that inf x∈X F (x) ∈ Z and
that ρ is continuous at inf x∈X F (x). Then
inf ρ(Fχ ) = ρ inf F (x) .
χ∈MX x∈X
2. (Mean-Variance measure). The mean-variance risk measure is defined as ρ(X) = IE[X] + cVar[X].
This risk measure is law invariant, continuous, convex and translation equi-variant. However, it is
neither monotone nor positively homogeneous.
This is a convex, translation equi-variant and positively homogeneous risk measure. It is monotone
if p = 1, (Ω, F, P) is non-atomic and c ∈ [0, 1/2].
6. (Mean-Upper-Semideviation of order p). Let X ∈ Lp (Ω, F, P), p ∈ [1, ∞) and c ≥ 0. The mapping
1/p
ρ(X) = IE[X] + cIE [X − IE[X]]p+
This is a convex, translation equi-variant and positively homogeneous risk measure. It is monotone
if p = 1, (Ω, F, P) is non-atomic and c ∈ [0, 1].
6 The Value-at-Risk is convex for certain classes of random variables. See A. I. Kibzun and E. A. Kuznetsov, “Convex
Properties of the Quantile Function in Stochastic Programming,” Automation and Remote Control, Vol. 65, No. 2,
2004, pp. 184–192.
7 We use the notation [X] = max{X, 0}. We use the definition of Shapiro et al. Other authors use different definitions
55
6 Risk
7. (Entropic risk measure). Let Z = Lp (Ω, F, P), p ∈ [1, ∞]. For γ > 0, define the entropic risk
measure
ρent
γ (X) = /γ IE[e
1 γX
].
For p = ∞, ρent ent
γ is finite valued and w*-lower-semi-continuous. Moreover, ργ is convex, monotone
and translation equi-variant, but not positively homogeneous. Furthermore, limγ→0 ρentγ (X) =
IE[X] and limγ→∞ ρent
γ (X) = esssup[X].
The entropic value-at-risk is a coherent risk measure for all α ∈ (0, 1].
9. (Expectiles). Let X ∈ L2 (Ω, F, P) and τ ∈ (0, 1). The τ -expectile of X is defined as
8 The moment generating function (MGF) MX of a random variable X is defined as MX (z) := IE[ezX ] for z ∈ IR. Not all
random variables have an MGF (e.g., the Cauchy distribution does not define an MGF).
9 These risk measures were first introduced by Ben-Tal and Teboulle; see for example A. Ben-Tal, M. Teboulle, “An oldnew
concept of convex risk measures: an optimized certainty equivalent,” Mathematical Finance 17 (2007) 449–476. These
measures are discussed in: P. Krokhma, M. Zabarankin and S. Uryasev, “Modeling and optimization of risk,” Surveys
in Operations Research and Management Science 16 (2011) 49–66.
10 In the case of AV@R , it is φ(X) = 1/αIE[X] which is indeed convex, monotone and translation equi-variant.
α +
56
7 Uncertainty Quantification
7.1 Polynomial chaos
7.1.1 The Kosambi-Karhunen-Loève theorem
1. (Kernel function1 ). A kernel function
Pnis a symmetric continuous function K : [a, b] × [a, b] → IR.
K is called positive semidefinite if i,j=1 ci cj K(xi , xj ) ≤ 0, for all scalars (ci )ni=1 and (xi )ni=1 in
[a, b] and all n ∈ IN.
3. (Mercer’s theorem). Mercer’s theorem offers a representation of kernel functions using a basis of
L2 ([a, b]): Let K be a positive definite kernel. Then, there is an orthonormal basis (ei )i of L2 ([a, b])
and a sequence of nonnegative coefficients (λi )i so that
∞
X
K(s, t) = λj ej (s)ej (t),
j=1
where convergence is absolute and uniform in [a, b] × [a, b] and (ei )i and (λi )i are eigenfunctions
and eigenvalues of TK .
4. (Kosambi-Karhunen-Loève theorem). Let (Xt )t∈T , T = [a, b], be a centered, mean-square contin-
uous stochastic process on (Ω, F, P) with Xt ∈ L2 (Ω, F, P) for all t ∈ T . Then, there is a basis
(ei )i∈IN or L2 (T ) such that for all t ∈ T ,
∞
X
Xt = λi ei (t),
i=1
TRX ei = λi ei
or equivalently
Z b
RX (s, t)ei (s)ds = λi ei (t),
a
57
7 Uncertainty Quantification
5. (Corollary of KKL theorem2 .). Let (Xt )t∈T , T = [a, b], be a stochastic process which satisfies the
requirements of the Kosambi-Karhunen-Loève theorem. Then, there exists a basis (ei )i of L2 (T )
such that for all t ∈ T
X∞ p
Xt (ω) = λi ξi (ω)ei (t),
i=1
in L (Ω, F, P), where ξi are centered, mutually uncorrelated random variables with unit variance
2
3. (Orthogonality wrt random variable). Let Ξ be a real-valued random variable with probability
density function pΞ . Let ψ1 , ψ2 : Ω → IR be two polynomials. We say that ψ1 , ψ2 are orthogonal
with respect to (the pdf of) Ξ if hπ1 , π2 iΞ = 0.
4. Let ψ0 , ψ2 , . . ., with ψ0 = 1, be a sequence of orthogonal polynomials. Then,
Z ∞
0 = hψ0 , ψ1 iΞ = ψ1 (s)pΞ (s)ds = IE[ψ1 (Ξ)],
−∞
d(Ξ∗ P)
by virtue of LotUS. Recursively, IE[ψi (Ξ)] = 0 for all i.
5. (Hermite polynomials). Let Ξ be distributed as N (0, 1). Then, its pdf is pΞ (s) = 1 and the
polynomials ψ0 , ψ1 , . . . are the Hermite polynomials, the first few of which are H0 (x) = 1, H1 (x) =
x, H2 (x) = x2 − 1, H3 (x) = x3 − 3x. These are orthogonal with respect to Ξ, that is
Z ∞
s2
hHi , Hj iΞ = Hi (s)Hj (s)e− 2 ds = 0,
−∞
for i 6= j.
6. (Legendre polynomials). If Ξ ∼ U ([−1, 1]), then ψ0 , ψ1 , . . . are the Legendre polynomials. If,
instead, Ξ ∼ U ([−1, 1]), the coefficients of the Legendre polynomials can be modified.
7. (Laguerre polynomials). If the germ is an exponential random variable on [0, ∞), then ψ0 , ψ1 , . . .
are the Laguerre polynomials.
8. (Polynomial projection). Let MN = {ψi }N i=0 ⊆ PN (I) be a set of orthogonal polynomials ψi : I →
IR with respect to the inner product h · , · iw . We define the projection operator onto MN as
N
X
PN : L2w (I) 3 f 7→ PN f := fˆj ψj ∈ PN ,
j=0
where
fˆj = 1/kψj k2 hf, ψj iw .
2 For applications of the KKL theorem to decompose stochastic processes, see http://amslaurea.unibo.it/10169/1/
Giambartolomei_Giordano_Tesi.pdf
58
7.1 Polynomial chaos
9. (Properties of polynomial projection). It is easy to see that PN f = f for all f ∈ PN (I), whereas,
for g ⊥ PN (I) it is PN g = 0.
10. (Best approximation). For f ∈ L2w (I),
kf − PN f kw = inf kf − ψkw
ψ∈PN (I)
lim kf − fN kw = 0.
N →∞
hψi , ψj iΞ = δi,j γi ,
where Z 1
−1
αj = 1/γj ψj (u)FX (FΞ (u))dFΞ (u),
0
Then, as N → ∞, XN → X in probability4 .
3. (Non-intrusive solution via linear regression).
4. (Non-intrusive solution via stochastic projection). Let X be a random variable on (Ω, F, P) and
Y = f (X). Suppose we have obtained a truncated PC expansion for X. The question uncertainty
propagates from X to Y via f ; in other words, what is the distribution of Y . For N ∈ IN, let XN
be a PC expansion of X as follows
N
X
XN = xj ψj (Ξ) = fN (Ξ)
j=0
3 In p
certain cases, we may derive approximation bounds. For example, for f in the weighted Sobolev space Hw ([−1, 1]) =
{g : [−1, 1] → IR | di g/dτ i ∈ L2w ([−1, 1]), i = 0, . . . , p}, equipped with the inner product
p j
d g dj g
X
hf, giHw p
([−1,1]) = j
, j
j=0
dτ dτ L2 ([−1,1])
w
1/2
and induced norm kf kHw
p
([−1,1]) = hf, f i p
Hw ([−1,1])
, and with MN being the set of Legendre polynomials on [−1, 1],
we have that there is a constant c, independent of N , so that kf − PN f k ≤ c/N kf kHw
p
([−1,1]) .
4 X can be approximated by projecting on the space spanned by the orthogonal polynomials M = {ψ }N
i i=0 leading to an
PN
approximation XN = j=0 αj ψj (Ξ), where the coefficients αj are computed by αj = hX, ψj iΞ /γj . The problem is
d
−1
that the inner product hX, ψj iΞ , typically, cannot be evaluated. The trick is that X = FX (U ), where U is a random
d −1
variable which is uniformly distributed on [0, 1]; one such variable is U = FΞ (Ξ), therefore, X = FX (FΞ (Ξ)). This
leads to the above formula. The integral can be evaluated by quadrature methods.
59
7 Uncertainty Quantification
Let YN be the desired approximation (we take the approximation length to be the same and the
PN
orthogonal polynomial basis to be also the same), YN = j=0 yj ψj (Ξ). It follows that
Z
yk = 1/kψk k2
Ξ · hψk , η ◦ fN iΞ = 1/kψk k2
Ξ · η(fN (u))ψ(u)pΞ (u)du,
and
R the integral can be evaluated
Pusing a quadrature method, or even simple Monte Carlo, that is
−1 Nmc (i) (i) (i) (i)
η(fN (u))ψ(u)pΞ (u)du ≈ Nmc i=1 η(f N (u ))ψ(u )pΞ (u ) where u are samples from the
distribution of Ξ.
5. (Galerkin projection).
6.
60
8 Bibliography with comments
Bibliographic references including lecture notes and online resources with some comments:
1. R.G. Gallager. Stochastic processes: theory for applications. Cambridge University Press, 2013: A gentle
introduction to stochastic processes suitable for engineers who want to eschew the mathematical drudgery.
Following a short, but circumspect introduction to probability theory, the author discusses several pro-
cesses such as Poisson, Gaussian, Markovian and renewal processes. Lastly, the book discusses hypothesis
testing, martingales and estimation theory. Without doubt, an excellent introduction to the topic for the
uninitiated.
2. Robert L. Wolpert. Probability and measure, 2005. Lecture notes: Lecture notes with a succinct presen-
tation of some very useful results, but without many proofs. Available at https://www2.stat.duke.edu/
courses/Spring05/sta205/lec/s05wk07.pdf.
3. Erhan Çinlar. Probability and Stochastics. Springer New York, 2011: A fantastic book for one’s first steps
in probability theory with emphasis on random processes, filtrations, Martingales, stopping times and
convergence theorems, Poisson random measures, Lévy and Markovian processes and Brownian motion.
4. Olav Kallenberg. Foundations of modern probability. Springer, 1997: The definitive reference for researchers.
In its 23 chapters it gives a circumspect overview of probability theory and stochastic processes; ideal for
researchers in the field.
5. Onesimo Hernández-Lerma and Jean Bernarde Lasserre. Discrete-Time Markov Control Processes: Basic
Optimality Criteria. Springer, 1996
6. Bernt Øksendal. Stochastic Differential Equations. Springer Berlin Heidelberg, sixth edition, 2003: An
amazing eye-opening book on stochastic differential equations and their aplications. It offers a very com-
prehensive presentation of the Brownian motion and Itô’s integral. The exercises are an invaluable tool for
assimilating the theory.
7. Karl Simgman. Lecture notes on stochastic modeling I, 2009: Lecture notes by K. Sigman, Columbia
University, http://www.columbia.edu/~ks20/stochastic-I/stochastic-I.html.
8. David Walnut. Convergence theorems, 2011. Lecture notes: A short compilation of convergence theorems
9. S.R. Srinivasa Varadhan. Lecture notes on limit theorems, 2002: A lot of material on limit theorems
starting from general measure theory, to weak convergence results, limits of independent sums, results
for dependent processes with emphasis on Markov chains, a comprehensive introduction to martingales,
stationary processes and ergodic theorems and some notes on dynamic programming. Available online at
https://www.math.nyu.edu/faculty/varadhan/.
10. Zhengyan Lin and Zhidong Bai. Probability Inequalities. Springer, 2011: several interesting (elementary
and advanced) inequalities on probability spaces.
11. Andrea Ambrosio. Relation between almost surely absolutely bounded random variables and their abso-
lute moments, 2013: A short note at http://planetmath.org/sites/default/files/texpdf/38346.pdf
showing that almost surely bounded RVs have all their moments bounded.
12. Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczyński. Lectures on stochastic programming:
modeling and theory. SIAM, second edition, 2014: Excellent book on stochastic programming and the
definitive reference for risk measures.
13. Anthony O’Hagan. Polynomial chaos: A tutorial and critique from a statisticians perspective, 2013. Avail-
able at http://tonyohagan.co.uk/academic/pdf/Polynomial-chaos.pdf: This article is written in a very
intuitive manner, it is easy to follow. It seems that it targets applied scientists and practitioners, rather
than mathematicians. It is a good read to understand the basics of polynomial chaos. The author questions
certain aspects of polynomial chaos from a statistics standpoint.
14. Dongbin Xiu. Numerical methods for stochastic computations: a spectral method approach. Princeton
University Press, 2010: A proper theoretical treatise on polynomial chaos and several other topics related
to approximations of (multivariate) probability distributions.
15. M.S. Eldred. Recent advances in non-intrusive polynomial chaos and stochastic collocation methods for
uncertainty analysis and design. In 50th AIAA/ASME/ASCE/AHS/ASC Structures, Structural Dynamics,
and Materials Conference, California, USA, 2009
16. Thorsten Schmidt. Coping with copulas, 2006. https://www.researchgate.net/publication/228876267/
download
61
8 Bibliography with comments
17. Carlo Sempi. Introduction to copulas, 2011. The 33rd Finnish Summer School on Probability Theory and
Statistics; available at http://web.abo.fi/fak/mnf/mate/gradschool/summer_school/tammerfors2011/
slides_sempi.pdf: a very thorough presentation of copulas along with lots of theoretical results.
18. A. Kaintura, T. Dhaene, and D. Spina. Review of polynomial chaos-based methods for uncertainty quan-
tification in modern integrated circuits. Electronics, 7(3):30, 2018: a not very rigorous review, but it offers
an overview of basic properties of polynomial chaos expansions
62
About the author
I was born in Athens, Greece, in 1985. I received a Diploma in Chemical Engineering in 2007 and an
MSc with honours in Applied Mathematics in 2009 from NTU Athens. In December 2012, I defended my
PhD thesis titled “Modelling and Control of Biological and Physiological Systems” at NTU Athens. In
January 2013 I joined the Dynamical Systems, Control and Optimization research unit at IMT Lucca as
a post-doctoral Fellow. Afterwards, I worked as a post-doctoral researcher at ESAT, KU Leuven. I am
currently a post-doctoral researcher at KIOS Center of Excellence, University of Cyprus. My research
focuses on model predictive control and numerical optimization.
63