IntroductionToProbabilityAndEstimationTheory
IntroductionToProbabilityAndEstimationTheory
Theory
Frédéric Lehmann
September, 2024
4 Limit theorems
Chebyshev inequality
Law of large numbers
Central limit theorem
5 Poisson process
7 Classical Estimation
8 Bayesian Estimation
Part I
Introduction to Probability
1 Basic concepts
Sets
Probabilistic models
Conditional probability
Independence
1 Basic concepts
Sets
Probabilistic models
Conditional probability
Independence
Sets
Definition
A set, S is a collection of objects, which are the elements of S.
Specification
A set with a finite number of elements can be written:
S = {x1 , . . . , xn }.
Example: the possible outcomes of a dice roll is S = {1, 2, 3, 4, 5, 6}
A set S is countably infinite if its infinitely many elements can be
enumerated in a list, S = {x1 , x2 , . . . }
Example: the set of non-negative integers, N
Otherwise the set is uncountable
Example: the set of real numbers, R or the interval [0, 1]
Set operations
Definitions
Universal set: Ω, the set containing all possible objects in the
context of interest
Empty set: ∅, the set containing no object
Complement of the set S: S c = {x ∈ Ω|x ∈
/ S}
Union of two sets S and T : S ∪ T = {x|x ∈ S or x ∈ T }
Intersection of two sets S and T :
S ∩ T = {x|x ∈ S and x ∈ T }
A collection of disjoint sets: their intersection is ∅
A collection of sets is a partition of S: the sets in the
collection are disjoint and their union is S
Properties
Commutative property: S ∪ T = T ∪ S and S ∩ T = T ∩ S,
Associative property: S ∪ (T ∪ U ) = (S ∪ T ) ∪ U and
S ∩ (T ∩ U ) = (S ∩ T ) ∩ U ,
Distributive law: S ∩ (T ∪ U ) = (S ∩ T ) ∪ (S ∩ U ) and
S ∪ (T ∩ U ) = (S ∪ T ) ∩ (S ∪ U )
S ∪ ∅ = S and S ∩ ∅ = ∅
S ∪ Ω = Ω and S ∩
Ω =Sc c
De Morgan’s law: ∪Sn = ∩Snc and ∩Sn = ∪Snc
n n n n
1 Basic concepts
Sets
Probabilistic models
Conditional probability
Independence
Criticism
Finite number of equally likely outcomes: not always true
Uses a law of large numbers argument: not yet demonstrated
Probability law
Probability law
P(A)
Event A
Experiment
P(B)
Event B
Probability law
Important properties
Property 1: P (Ac ) = 1 − P (A)
Property 2: If A ⊂ B, then P (A) ≤ P (B)
Property 3: P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
Property 4: P (A ∪ B) ≤ P (A) + P (B)
Property 5: Let A = A1 ∪ · · · ∪ An be a union of mutually disjoint
events, P (A) = P (A1 ) + P (A2 ) + · · · + P (An )
1 Basic concepts
Sets
Probabilistic models
Conditional probability
Independence
Conditional probability
Definition
Consider two events A and B such that: P (B) > 0
P (A∩B)
Conditional probability of A given B: P (A|B) = P (B)
Solution
Intuitive approach: |A ∩ B| = 2, |B| = 3, thus
P (A|B) = |A ∩ B|/|B| = 2/3
Rigorous approach: P (B) = |B|/|Ω| = 3/6 = 1/2,
P (A ∩ B) = |A ∩ B|/|Ω| = 2/6 = 1/3, thus
P (A|B) = P P(A∩B)
(B) = 2/3
Multiplicative rule
Assumption: All conditioning events have positive probability
Probability of an intersection of events:
1 Basic concepts
Sets
Probabilistic models
Conditional probability
Independence
Definition
Two events A and B are independent if P (A ∩ B) = P (A)P (B)
If in addition P (B) > 0, this is equivalent to P (A|B) = P (A)
Interpretation
In a probabilistic sense: the occurrence of B does not alter P (A)
Pay attention to the intuitive sense: A common thought is that A
and B, with P (A) > 0 and P (B) > 0, are independent if they are
disjoint.
However the opposite is true: P (A ∩ B) 6= P (A)P (B), since
P (A ∩ B) = 0
Definition
A1 , A2 , . . . , An are independent if
Y
P ∩ Ai = P (Ai ), for any subset S of {1, 2, . . . , n}
i∈S
i∈S
Random variable
Visualization
Random variable X
2
.
. Real number line
.
n
Sample space
Tabular representation
Sample HH HT TH TT
X 2 1 1 0
Definition
Let X be a discrete random variable
Let x1 ≤ x2 ≤ . . . be the possible outcomes of X in
ascending order
Let P ({X = xk }) = pX (xk ), ∀k
pX (x) is the probability mass function (p.m.f.) of X
1/2
1/4
x
0 1 2
Frédéric Lehmann Introduction to Probability and Estimation Theory 37 / 161
Basic concepts Probability mass function (p.m.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability mass function of multiple random variables
Limit theorems Results for some useful distributions
Definition
P
The c.d.f. is defined as FX (x) = P (X ≤ x) = u≤x pX (u)
0, − ∞ < x < x1
pX (x1 ), x1 ≤ x < x2
FX (x) = pX (x1 ) + pX (x2 ), x2 ≤ x < x3
.. ..
. .
pX (x1 ) + · · · + pX (xn ) = 1, xn ≤ x < ∞
Properties
FX (x) is a piecewise constant function of x
FX is monotonically nondecreasing
limx→−∞ FX (x) = 0
limx→+∞ FX (x) = 1
P (a < X ≤ b) = FX (b) − FX (a)
1 ...
pX(2)
1/2
pX(1)
1/4
pX(0)
... x
0 1 2
Change of variable
Expectation
Definition
P
The expectation of a r.v. X is defined as E[X] = x xpX (x).
Interpretation
E(X), as the center of gravity of the p.m.f., represents the mean
value of X in a probabilistic sense.
Expectation
Properties
P
E[g(X)] = x g(x)pX (x)
Given two scalars a and b, E[aX + b] = aE[X] + b
Variance
Definition
The variance
of a r.v. X is defined as
V [X] = E (X − E[X])2 = x (x − E[X])2 pX (x).
P
The square root of the variance, σX is called the standard
deviation.
Interpretation
The standard deviation, σX , provides a measure of the dispersion
of X around its mean.
Variance
Properties
V [X] = E[X 2 ] − (E[X])2
Given two scalars a and b, V [aX + b] = a2 V [X]
Definition
Let X be a random variable with expectation E[X] and standard
deviation σX > 0, the standardized random variable is defined as
X − E[X]
X∗ =
σX
Properties
E[X]−E[X]
E[X ∗ ] = σX =0
V [X−E[X]] V [X]
V [X ∗ ] = 2
σX
= V [X] =1
pX,Y (x, y) = P (X = x, Y = y)
Marginal p.m.f.
X
pX (x) = P (X = x) = pX,Y (x, y)
y
X
pY (y) = P (Y = y) = pX,Y (x, y)
x
Definition
The joint c.d.f of two random variables X and Y is defined as
XX
FX,Y (x, y) = P (X ≤ x, Y ≤ y) = pX,Y (u, v)
u≤x v≤y
Change of variable
If g is linear, Z = aX + bY + c
Given the scalars a, b and c
Definition
Two discrete r.v.s are independent if
Definitions
Let X and Y be two discrete r.v.s, their covariance is defined as
cov(X, Y )
ρ(X, Y ) =
σX σY
Properties
X and Y are independent implies cov(X, Y ) = 0
The converse is false
Frédéric Lehmann Introduction to Probability and Estimation Theory 56 / 161
Basic concepts Probability mass function (p.m.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability mass function of multiple random variables
Limit theorems Results for some useful distributions
Definition
X ∼ B(1, p) describes the success or failure in a single trial
(
p, if k = 1
pX (k) =
q, if k = 0
where q = 1 − p
Properties
E[X] = p
V [X] = pq
Definition
X ∼ B(n, p) describes the number of successes in n independent
Bernoulli trials
n k n−k
pX (k) = p q , for k = 0, 1, . . . , n
k
where q = 1 − p
Properties
E[X] = np
V [X] = npq
Definition
X ∼ P(λ) approximates the binomial p.m.f. when n is large, p is
small and λ = np
λk
pX (k) = e−λ , for k = 0, 1, . . .
k!
Properties
E[X] = λ
V [X] = λ
x x+
Note that P (X = x) = limδ→0 P (X ∈ [x, x + δ]) = 0
Frédéric Lehmann Introduction to Probability and Estimation Theory 64 / 161
Basic concepts Probability density function (p.d.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability distribution function of multiple random variables
Limit theorems Results for some useful distributions
Definition
Rx
The c.d.f. is defined as FX (x) = P (X ≤ x) = −∞ fX (t)dt.
By differentiation, fX (x) = dF
dx (x), ∀x where fX is continuous
X
Graphical representation
p.d.f. fX(x) c.d.f. FX(x)
0
x x+
Properties
FX (x) is a continuous function of x
FX is monotonically nondecreasing
limx→−∞ FX (x) = 0
limx→+∞ FX (x) = 1
Since including or exluding the endpoint of an interval has no
effect on an integral
Z b
P (a ≤ X ≤ b) = fX (x)dx
a
= P (a < X < b) = P (a ≤ X < b) = P (a < X ≤ b)
= FX (b) − FX (a)
Frédéric Lehmann Introduction to Probability and Estimation Theory 66 / 161
Basic concepts Probability density function (p.d.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability distribution function of multiple random variables
Limit theorems Results for some useful distributions
Change of variable
dh
pY (y) = fX (h(y)) (y)
dy
Expectation
Definition
Z +∞
The expectation of a r.v. X is defined as E[X] = xfX (x)dx.
−∞
Interpretation
E(X), as the center of gravity of the p.d.f., represents the mean
value of X in a probabilistic sense.
It can also be seen the anticipated average value of X in a large
number of repeated independent experiments
Expectation
Properties
Z +∞
E[g(X)] = g(x)fX (x)dx
−∞
Given two scalars a and b, E[aX + b] = aE[X] + b
Variance
Definition
The variance of a r.v. X is defined
Z +∞ as
V [X] = E (X − E[X])2 = (x − E[X])2 fX (x)dx.
−∞
The square root of the variance, σX is called the standard
deviation.
Interpretation
The standard deviation, σX , provides a measure of the dispersion
of X around its mean.
Variance
Properties
V [X] = E[X 2 ] − (E[X])2
Given two scalars a and b, V [aX + b] = a2 V [X]
Definition
Let X be a random variable with expectation E[X] and standard
deviation σX > 0, the standardized random variable is defined as
X − E[X]
X∗ =
σX
Properties
E[X]−E[X]
E[X ∗ ] = σX =0
V [X−E[X]] V [X]
V [X ∗ ] = 2
σX
= V [X] =1
Definition
Two random variable X and Y are jointly continuous if there is a
function fX,Y , called the joint probability density function (joint
p.d.f.) of X and Y verifying
(Nonnegativity) fX,Y (x, y) ≥ 0, ∀(x, y)
Z +∞ Z +∞
(Normalization) fX,Y (x, y)dxdy = 1
−∞ −∞ Z Z
such that ∀B ⊂ R2 , P ((X, Y ) ∈ B) = fX,Y (x, y)dxdy,
(x,y)∈B
where all integrals are assumed well-defined in the Riemann sense.
Marginal p.d.f.
Definition
The joint c.d.f of two random variables X and Y is defined as
Z x Z y
FX,Y (x, y) = P (X ≤ x, Y ≤ y) = fX,Y (u, v)dudv
−∞ −∞
∂ 2 FX,Y
fX,Y (x, y) = (x, y)
∂x∂y
Change of variable
If g is linear, Z = aX + bY + c
Given the scalars a, b and c
Definition
Two continuous r.v.s are independent if
Definitions
Let X and Y be two continuous r.v.s, their covariance is defined as
cov(X, Y )
ρ(X, Y ) =
σX σY
Properties
X and Y are independent implies cov(X, Y ) = 0
The converse is false
Frédéric Lehmann Introduction to Probability and Estimation Theory 82 / 161
Basic concepts Probability density function (p.d.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability distribution function of multiple random variables
Limit theorems Results for some useful distributions
Definition
If X ∼ U([a, b]), then
0, if x < a
1 , if a ≤ x ≤ b
x − a
fX (x) = b − a FX (x) = , if a ≤ x ≤ b
b−a
0, otherwise
1, otherwise
Properties
a+b
E[X] = 2
(b−a)2
V [X] = 12
Definition
If X ∼ E(λ), then
( (
λe−λx , if x ≥ 0 1 − e−λx , if x ≥ 0
fX (x) = FX (x) =
0, otherwise 0, otherwise
Properties
1
E[X] = λ
1
V [X] = λ2
Definition
If X ∼ N (0, 1) (standard normal r.v.), then
1 x2 x
fX (x) = √ e− 2 ,
Z
1 t2
2π FX (x) = Φ (x) = √ e− 2 dt
−∞ 2π
Properties
E[X] = 0
V [X] = 1
Definition
If X ∼ N (m, σ 2 ), then
(x−m)2
1 x−m
fX (x) = √ e− 2σ2 FX (x) = Φ
σ 2π σ
Properties
E[X] = m
V [X] = σ 2
4 Limit theorems
Chebyshev inequality
Law of large numbers
Central limit theorem
4 Limit theorems
Chebyshev inequality
Law of large numbers
Central limit theorem
Chebyshev inequality
Theorem
Let X be a r.v. with finite mean µ and variance σ 2 , then
2
P (|X − µ| ≥ ) ≤ σ2 , ∀ > 0
Z +∞ Z
2 2
σ = (x − µ) fX (x)dx ≥ (x − µ)2 fX (x)dx
−∞ |x−µ|≥
Z
2
≥ fX (x)dx = 2 P (|X − µ| ≥ )
|x−µ|≥
4 Limit theorems
Chebyshev inequality
Law of large numbers
Central limit theorem
Theorem
Let X1 , X2 , . . . , Xn be independent and identically distributed
2
with mean µ and variance σ ,then ∀ > 0
(i.i.d.) r.v.s
X1 + X2 + · · · + Xn
lim P −µ ≥ =0
n→+∞ n
0.35
0.3
0.25
fX(x)
0.2
0.15
0.1
0.05
0
−5 −4 −3 −2 −1 0 1 2 3 4 5
x
4 Limit theorems
Chebyshev inequality
Law of large numbers
Central limit theorem
Theorem
Let X1 , X2 , . . . , Xn be i.i.d. r.v.s with mean µ and variance σ 2 ,
and define the standardized sum
n −nµ
Zn = X1 +X2 +···+X
√
σ n
, with E[Zn ] = 0 and V [Zn ] = 1,
then lim P (Zn ≤ x) = Φ(x)
n→+∞
Application
2500
2000
1500
1000
500
0
0 5 10 15 20 25
Part II
Basic concepts
Basic concepts
x2(t)
n
x2(tk)
t
0
xn(t)
xn(tk)
t
0
tk
Basic concepts
Definition of a point
A point is a discrete event that occurs in continuous time (or
space).
5 Poisson process
Definition
A temporal point process {Xk , k ≥ 1} is called a Poisson process
with rate λ if it has the following properties
Time homogeneity: The p.m.f. of the number of arrivals Nτ
k
over any interval of length τ is pNτ (k) = e−λτ (λτk!) , ∀k ≥ 0
Independence: The number of arrivals during a particular
interval is independent of the history of arrivals outside this
interval
Interpretation
Time homogeneity: Arrivals are equally likely at all times
Independence: The conditional probability of k arrivals during
[t, t0 ], given the occurrence of n arrivals for t ∈
/ [t, t0 ] is equal
to the unconditional probability
Rate λ: Average number of arrivals per unit time
Interarrival times
Definition
The k-th interarrival time is the r.v. defined by T1 = X1 ,
Tk = Xk − Xk−1 , k = 2, 3, . . .
Note that Xk = T1 + T2 + · · · + Tk
Summary
Interarrival times T1 , T2 . . . , are independent r.v.s
The p.d.f. of each interarrival time is E(λ)
Definition
A random process {Xn , n ∈ N}, taking values in the set of states
S = {s1 , s2 , . . . , sm }, is an time homogeneous Markov chain if
(Markov property):
P (Xn+1 = sj |Xn = si , Xn−1 = sin−1 , . . . , X0 = si0 ) =
P (Xn+1 = sj |Xn = si ), ∀sin−1 . . . , si0 ∈ S
(Time homogeneity): The transition probabilities defined by
pij = P (Xn+1 = sj |Xn = si ) ≥ 0 are independent of n
m
X
and satisfy pij = 1
j=1
P = [pij ]1≤i≤m
1≤j≤m
Transition diagram
A time homogeneous Markov chain can be represented by a graph,
whose nodes are the states (i.e. the elements of S)
whose arcs are the allowed state transitions, labeled with the
corresponding transition probabilities
Probability of a path
Definition
Given that we are in state si , the probability that we are in state sj
in two steps from now is
m
(2)
X
pij = p(X2 = sj |X0 = si ) = pik pkj
k=1
Derivation
From the total probability theorem
m
(2)
X
pij = P (X2 = sj |X0 = si ) = P (X2 = sj , X1 = sk |X0 = si )
k=1
m
X
= P (X2 = sj |X1 = sk , X0 = si )P (X1 = sk |X0 = si )
k=1
Xm
= pik pkj
k=1
Derivation
Time 0 Time 1 Time 2
s1
. p1j
pi1 .
.
pik sk
pkj
si .
. sj
pim . pmj
sm
Chapman-Kolmogorov equation
Given that we are in state si , the probability that we are in state sj
in n steps from now is by definition
(n)
pij = p(Xn = sj |X0 = si ),
Proof
By induction
Result
Let u(n) = [P (Xn = s1 ), . . . , P (Xn = sm )] be the probability
vector at instant n, then
u(n) = u(0) Pn
Proof
Using the total probability theorem
P (X
Pnm= sj )
= Pi=1 P (Xn = sj , X0 = si )
= m P (X0 = si )P (Xn = sj |X0 = si )
Pi=1
m (n)
= i=1 P (X0 = si )pij
Definition
A Markov chain is said regular if some power of the transition
matrix has only (strictly) positive elements.
Interpretation
For some n, it is possible to go from any state to any state in
exactly n steps.
k=1
π1 = 1
Birth-death processes
Definition
A birth-death process is a particular Markov process, where only
self-transitions or transitions to neighboring states are allowed.
Therefore, the transitions probabilities (pij 6= 0 iif
j ∈ {i − 1, i, i + 1}) are completely parameterized by
bi = P (Xn+1 = si+1 |Xn = si ) (birth probability at state si )
di = P (Xn+1 = si−1 |Xn = si ) (death probability at state si )
Applications
Modeling of queueing problems in telecommunications
Modeling of desease evolution
Birth-death processes
Transition diagram
1-b0 1-b1-d1 1-bm-1-dm-1 1-dm
b0 b1 bm-2 bm-1
s0 s1 … Sm-1 sm
d1 d2 dm-1 dm
Birth-death processes
1 − b b0 0 0 ... 0 0
0
d1 1 − b1 − d1 b1 0 ... 0 0
1 − b2 − d2
0 d2 b2 ... 0 0
.. .. .. .. ..
. 0 d3 . . . .
P=
.. .. .. .. ..
. . 0 . . 0 .
. . . . ..
.. .. .. .. . bm−2 0
0 0 0 ... dm−1 1 − bm−1 − dm−1 bm−1
0 0 0 ... 0 dm 1 − dm
Birth-death processes
Part III
Estimation Theory
Motivation...
General definitions
Examples
PN
Mean estimator: M̂ = i=1 Xi /N
PN
Variance estimator: V̂ = i=1 (Xi − M̂ )2 /N
Estimator’s p.d.f.
Desirable properties
For E[Θ̂]: E[Θ̂] = θ (unbiasedness)
For V [Θ̂]: small (good estimation accuracy)
p.d.f. f ^ ( ^ )
^
√V[
^
^
E[
Examples
PN
Mean estimate: m̂ = i=1 xi /N
PN
Variance estimate: v̂ = i=1 (xi − m̂)2 /N
Classical estimation
Principle: The unknown parameter θ is considered as a
deterministic constant
We know: fX (x; θ), the joint p.d.f. of X, parameterized by
the unknown constant θ
Bayesian estimation
Principle: The unknown parameter θ is modeled as a random
variable Θ
We know: the joint p.d.f. fX,Θ (x, θ) as the product of
1 A prior distribution fΘ (θ) for the unknown random variable Θ
2 A conditional distribution fX|Θ (x|θ) for the random process X
7 Classical Estimation
Performance measures
Definitions
Bias of the estimator: b(Θ̂) = E[Θ̂] − θ
2
Variance of the estimator: V (Θ̂) = E Θ̂ − E[Θ̂]
2
Mean square error (MSE): mse(Θ̂) = E Θ̂ − θ
Property
2
mse(Θ̂) = E (Θ̂ − E[Θ̂]) + (E[Θ̂ − θ]) = V (Θ̂) + b(Θ̂)2
Performance criteria
Assume only three unbiased estimators, Θ̂1 , Θ̂2 and Θ̂3 exist
Left-hand side: Θ̂1 is the MVUE
Right-hand side: MVUE does not exist, since no unbiased
estimator has minimum variance ∀θ
^ ^
V[ V[
^ ^
V[ V[
^ ^
V[ V[
^
V[ ^
V[
^
is the MVUE MVUE does not exist
Efficient Estimator
An unbiased estimator attaining the CRLB ∀θ, is said efficient
Assume only three unbiased estimators, Θ̂1 , Θ̂2 and Θ̂3 exist
Left-hand side: the MVUE Θ̂1 exists, but is not efficient
Right-hand side: the MVUE Θ̂1 exists and is efficient
^ ^
V[ V[
^ ^
V[ V[
^ ^
V[ V[
^
V[ ^
V[ =CRLB
CRLB
^ ^
is the MVUE but is not efficient is the MVUE and is efficient
Motivation
MVUE may not even exist
Even if MVUE exists, it may not be found using the CRLB
theorem
Suboptimal BLUE estimator: find the minimum variance
unbiased estimator within the restricted class linear estimators
Class of
linear estimators
Class of
linear
estimators Class of
BLUE nonlinear
estimators
Class of
nonlinear estimators
MVUE BLUE=MVUE
Motivation
MVUE may be difficult to find or not even exist
BLUE always exists, but sometimes a linear estimator is
totally inappropriate (ex: if we force the estimator of the
variance to be linear function of X !)
MLE Principle: what is the value of θ under which the
observations we have actually seen are most likely?
Definitions
Likelihood function: the joint p.d.f. fX (x; θ) considered as a
numerical function of θ, by setting x as the actually observed
data
Log-likelihood function: the logarithm of the previous function
Problem
Costumer arrival times Yi at a restaurant form a Poisson
process with rate θ
Interarrival times Xi = Yi − Yi−1 ∼ E(θ) (with the convention
Y0 = 0) are known to be independent r.v.s
We collect the set of observations x = (x1 , . . . , xN )T
MLE estimator of θ
N
Y
Likelihood function: fX (x; θ) = θe−θxi
i=1
PN
Log-likelihood function: log fX (x; θ) = N log θ − i=1 θxi
Derivative of the log-likelihood function:
∂ log fX (x;θ) N PN
∂θ = θ − i=1 x i
PN −1
i=1 xi
Maximum-likelihood estimate: θ̂ = N (inverse of
the sample mean of the interarrival times)
PN −1
i=1 Xi
Maximum-likelihood estimator: Θ̂ = N
8 Bayesian Estimation
Bayesian philosophy
Principle
The unknown parameter θ is modeled as a random variable Θ
Assumption
We know the joint p.d.f. fX,Θ (x, θ) as the product of
1 A prior distribution fΘ (θ) for the unknown random variable Θ
Cost functions
C( C( | | C(
Bayes risk
Definition
R(Θ̂) = E[C(Θ − Θ̂)], which depends on the choice of the
estimator Θ̂
Expression
Since C(Θ − Θ̂) depends on both r.v.s Θ and Θ̂, its
expectation is with respect to the joint p.d.f. fX,Θ (x, θ)
Z Z
We obtain R(Θ̂) = C(θ − θ̂)fX,Θ (x, θ)dθdx
Estimator construction
General case
∂g(θ̂)
Solve =0
∂ θ̂
mean
mode median
f |X ( |x) C( C( C( | |
End