0% found this document useful (0 votes)
4 views161 pages

IntroductionToProbabilityAndEstimationTheory

Uploaded by

Saddly Smile
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views161 pages

IntroductionToProbabilityAndEstimationTheory

Uploaded by

Saddly Smile
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 161

Introduction to Probability and Estimation

Theory

Frédéric Lehmann

Télécom SudParis, Institut Mines-Télécom, IP Paris


Département CITI (Communications, Images et Traitement de l’Information)
Laboratoire SAMOVAR

September, 2024

Frédéric Lehmann Introduction to Probability and Estimation Theory 1 / 161


Acronyms

r.v. Random variable


p.m.f. Probability mass function
p.d.f. Probability density function
c.d.f. Cumulative distribution function
a.s. Almost surely
ML Maximum-Likelihood
MSE Mean Squared Error
X ∼ ... r.v. X is distributed according to . . .
B(1, p) Bernoulli law of parameter p
B(n, p) Binomial law of size n and parameter p
P(λ) Poisson law of parameter λ
U ([a, b]) Uniform law over [a, b]
E(λ) Exponential law of parameter λ
N (m, σ 2 ) Gaussian law of mean m and variance σ 2

Frédéric Lehmann Introduction to Probability and Estimation Theory 2 / 161


Motivation...

A useful concept: to model uncertain situations


In science and engineering: noisy experimental data
In biology: a drug can work on a patient or not
In management: is a new activity likely to be profitable ?
With several interpretations:
Frequency of occurrence of an event
Our belief about the likelihood of an event

Frédéric Lehmann Introduction to Probability and Estimation Theory 3 / 161


Outline of Part I: Introduction to Probability I
1 Basic concepts
Sets
Probabilistic models
Conditional probability
Independence
2 Discrete random variables
Probability mass function (p.m.f.)
Expectation and variance
Joint probability mass function of multiple random variables
Results for some useful distributions
3 Continuous random variables
Probability density function (p.d.f.)
Expectation and variance
Frédéric Lehmann Introduction to Probability and Estimation Theory 4 / 161
Outline of Part I: Introduction to Probability II
Joint probability distribution function of multiple random
variables
Results for some useful distributions

4 Limit theorems
Chebyshev inequality
Law of large numbers
Central limit theorem

Frédéric Lehmann Introduction to Probability and Estimation Theory 5 / 161


Outline of Part II: Introduction to Random Processes I

5 Poisson process

6 Discrete-time Markov processes

Frédéric Lehmann Introduction to Probability and Estimation Theory 6 / 161


Outline of Part III: Introduction to Estimation Theory I

7 Classical Estimation

8 Bayesian Estimation

Frédéric Lehmann Introduction to Probability and Estimation Theory 7 / 161


Basic concepts
Discrete random variables
Continuous random variables
Limit theorems

Part I

Introduction to Probability

Frédéric Lehmann Introduction to Probability and Estimation Theory 8 / 161


Basic concepts Sets
Discrete random variables Probabilistic models
Continuous random variables Conditional probability
Limit theorems Independence

1 Basic concepts
Sets
Probabilistic models
Conditional probability
Independence

Frédéric Lehmann Introduction to Probability and Estimation Theory 9 / 161


Basic concepts Sets
Discrete random variables Probabilistic models
Continuous random variables Conditional probability
Limit theorems Independence

1 Basic concepts
Sets
Probabilistic models
Conditional probability
Independence

Frédéric Lehmann Introduction to Probability and Estimation Theory 10 / 161


Basic concepts Sets
Discrete random variables Probabilistic models
Continuous random variables Conditional probability
Limit theorems Independence

Sets

Definition
A set, S is a collection of objects, which are the elements of S.

Specification
A set with a finite number of elements can be written:
S = {x1 , . . . , xn }.
Example: the possible outcomes of a dice roll is S = {1, 2, 3, 4, 5, 6}
A set S is countably infinite if its infinitely many elements can be
enumerated in a list, S = {x1 , x2 , . . . }
Example: the set of non-negative integers, N
Otherwise the set is uncountable
Example: the set of real numbers, R or the interval [0, 1]

Frédéric Lehmann Introduction to Probability and Estimation Theory 11 / 161


Basic concepts Sets
Discrete random variables Probabilistic models
Continuous random variables Conditional probability
Limit theorems Independence

Set operations

Definitions
Universal set: Ω, the set containing all possible objects in the
context of interest
Empty set: ∅, the set containing no object
Complement of the set S: S c = {x ∈ Ω|x ∈
/ S}
Union of two sets S and T : S ∪ T = {x|x ∈ S or x ∈ T }
Intersection of two sets S and T :
S ∩ T = {x|x ∈ S and x ∈ T }
A collection of disjoint sets: their intersection is ∅
A collection of sets is a partition of S: the sets in the
collection are disjoint and their union is S

Frédéric Lehmann Introduction to Probability and Estimation Theory 12 / 161


Basic concepts Sets
Discrete random variables Probabilistic models
Continuous random variables Conditional probability
Limit theorems Independence

The algebra of sets

Properties
Commutative property: S ∪ T = T ∪ S and S ∩ T = T ∩ S,
Associative property: S ∪ (T ∪ U ) = (S ∪ T ) ∪ U and
S ∩ (T ∩ U ) = (S ∩ T ) ∩ U ,
Distributive law: S ∩ (T ∪ U ) = (S ∩ T ) ∪ (S ∩ U ) and
S ∪ (T ∩ U ) = (S ∪ T ) ∩ (S ∪ U )
S ∪ ∅ = S and S ∩ ∅ = ∅
S ∪ Ω = Ω and S ∩
 Ω =Sc  c
De Morgan’s law: ∪Sn = ∩Snc and ∩Sn = ∪Snc
n n n n

Frédéric Lehmann Introduction to Probability and Estimation Theory 13 / 161


Basic concepts Sets
Discrete random variables Probabilistic models
Continuous random variables Conditional probability
Limit theorems Independence

1 Basic concepts
Sets
Probabilistic models
Conditional probability
Independence

Frédéric Lehmann Introduction to Probability and Estimation Theory 14 / 161


Basic concepts Sets
Discrete random variables Probabilistic models
Continuous random variables Conditional probability
Limit theorems Independence

Vocabulary of probability theory

Consider a random experiment, whose outcomes are not


deterministic (i.e. unpredictable)
Definitions
Sample space: Ω, the set containing all possible outcomes of
a given experiment
Event: subset of the sample space Ω

Example: experiment is a single coin toss


with outcomes head (H) or tail (T)

Sample space: Ω = {H, T }


Events: ∅, {H}, {T }, Ω

Frédéric Lehmann Introduction to Probability and Estimation Theory 15 / 161


Basic concepts Sets
Discrete random variables Probabilistic models
Continuous random variables Conditional probability
Limit theorems Independence

Intuitive definition of probability


Repeat independently a random experiment n times
Frequentist approach
Hypothesis on the sample space: finite number of equally
likely outcomes
Relative frequency of occurrence of event A:
fn (A) = number of occurrences
n
of A
Probability of event A: P (A) = limn→∞ fn (A)

Criticism
Finite number of equally likely outcomes: not always true
Uses a law of large numbers argument: not yet demonstrated

Frédéric Lehmann Introduction to Probability and Estimation Theory 16 / 161


Basic concepts Sets
Discrete random variables Probabilistic models
Continuous random variables Conditional probability
Limit theorems Independence

Axiomatic definition of probability

A probability law P is a function defined on the algebra of sets


taking values in [0, 1], satisfying the following axioms:
Kolmogorov approach
Axiom 1 (Nonnegativity): for any event A, P (A) ≥ 0
Axiom 2 (Normalization): P (Ω) = 1
Axiom 3 (Additivity): Let A1 , A2 , . . . be a sequence of
mutually
 disjoint events (i.e. Ai ∩ Aj = ∅, ∀i 6= j), then
P
P ∪Ai = i P (Ai )
i

Frédéric Lehmann Introduction to Probability and Estimation Theory 17 / 161


Basic concepts Sets
Discrete random variables Probabilistic models
Continuous random variables Conditional probability
Limit theorems Independence

Probability law

Probability law

P(A)
Event A

Experiment

P(B)
Event B

Sample space Events

Frédéric Lehmann Introduction to Probability and Estimation Theory 18 / 161


Basic concepts Sets
Discrete random variables Probabilistic models
Continuous random variables Conditional probability
Limit theorems Independence

Probability law

Important properties
Property 1: P (Ac ) = 1 − P (A)
Property 2: If A ⊂ B, then P (A) ≤ P (B)
Property 3: P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
Property 4: P (A ∪ B) ≤ P (A) + P (B)
Property 5: Let A = A1 ∪ · · · ∪ An be a union of mutually disjoint
events, P (A) = P (A1 ) + P (A2 ) + · · · + P (An )

Example: intuitive definition of probability


Sample space: Ω = {x1 , . . . , xn }, where P ({xi }) = 1/n, ∀i
Using property 5: P (A) = number of elements of A n

Frédéric Lehmann Introduction to Probability and Estimation Theory 19 / 161


Basic concepts Sets
Discrete random variables Probabilistic models
Continuous random variables Conditional probability
Limit theorems Independence

1 Basic concepts
Sets
Probabilistic models
Conditional probability
Independence

Frédéric Lehmann Introduction to Probability and Estimation Theory 20 / 161


Basic concepts Sets
Discrete random variables Probabilistic models
Continuous random variables Conditional probability
Limit theorems Independence

Conditional probability

Definition
Consider two events A and B such that: P (B) > 0
P (A∩B)
Conditional probability of A given B: P (A|B) = P (B)

Interpretation in terms of the intuitive definition of probability


Consider the sample space: Ω = {x1 , . . . , xn }, where
P ({xi }) = 1/n, ∀i
Interpretation: probability law on a new universe B, whose
elements are also equally likely
Using property 5: P (A|B) = number of elements of A ∩ B
number of elements of B

Frédéric Lehmann Introduction to Probability and Estimation Theory 21 / 161


Basic concepts Sets
Discrete random variables Probabilistic models
Continuous random variables Conditional probability
Limit theorems Independence

Conditional probability example

A dice roll problem


Experiment: outcome of a dice roll
Sample space: Ω = {1, 2, 3, 4, 5, 6}
Events: A = {number < 4}, B = {number is odd}
Calculate: P (A|B)

Solution
Intuitive approach: |A ∩ B| = 2, |B| = 3, thus
P (A|B) = |A ∩ B|/|B| = 2/3
Rigorous approach: P (B) = |B|/|Ω| = 3/6 = 1/2,
P (A ∩ B) = |A ∩ B|/|Ω| = 2/6 = 1/3, thus
P (A|B) = P P(A∩B)
(B) = 2/3

Frédéric Lehmann Introduction to Probability and Estimation Theory 22 / 161


Basic concepts Sets
Discrete random variables Probabilistic models
Continuous random variables Conditional probability
Limit theorems Independence

Conditional probabilities specify a probability law

All axioms are satisfied


Axiom 1: ∀A, A ∩ B ⊂ B, thus 0 ≤ P (A ∩ B) ≤ P (B) and
P (A|B) ∈ [0, 1]
P (Ω∩B) P (B)
Axiom 2: P (Ω|B) = P (B) = P (B) =1
Axiom 3: Let A1 , A2 , . . . be a sequence of mutually disjoint
events.
Using the distributive law
P [(∪Ai )∩B] P [∪(Ai ∩B)]
i i
P (∪Ai |B) = P (B) = P (B)
i
Now using thePfact that (Ai ∩ B) ∩ (Aj ∩ B) = ∅, ∀i 6= j
P (A ∩B)
P (∪Ai |B) = i P (B)i
P
= i P (Ai |B)
i

Frédéric Lehmann Introduction to Probability and Estimation Theory 23 / 161


Basic concepts Sets
Discrete random variables Probabilistic models
Continuous random variables Conditional probability
Limit theorems Independence

Properties of conditional probabilities

Multiplicative rule
Assumption: All conditioning events have positive probability
Probability of an intersection of events:

P (∩ni=1 Ai ) = P (A1 )P (A2 |A1 )P (A3 |A1 ∩A2 ) . . . P (An |∩n−1


i=1 Ai )

Frédéric Lehmann Introduction to Probability and Estimation Theory 24 / 161


Basic concepts Sets
Discrete random variables Probabilistic models
Continuous random variables Conditional probability
Limit theorems Independence

Properties of conditional probabilities

Total probability theorem


Assumption 1: A1 , A2 , . . . , An are mutually disjoint events,
that form a partition of the sample space
Assumption 2: P (Ai ) > 0, ∀i
Result
P (B) = P (A1 ∩ B) + P (A2 ∩ B) + · · · + P (An ∩ B)
= P (B|A1 )P (A1 ) + P (B|A2 )P (A2 ) + · · · + P (B|An )P (An )

Frédéric Lehmann Introduction to Probability and Estimation Theory 25 / 161


Basic concepts Sets
Discrete random variables Probabilistic models
Continuous random variables Conditional probability
Limit theorems Independence

1 Basic concepts
Sets
Probabilistic models
Conditional probability
Independence

Frédéric Lehmann Introduction to Probability and Estimation Theory 26 / 161


Basic concepts Sets
Discrete random variables Probabilistic models
Continuous random variables Conditional probability
Limit theorems Independence

Independence of two events

Definition
Two events A and B are independent if P (A ∩ B) = P (A)P (B)
If in addition P (B) > 0, this is equivalent to P (A|B) = P (A)

Interpretation
In a probabilistic sense: the occurrence of B does not alter P (A)
Pay attention to the intuitive sense: A common thought is that A
and B, with P (A) > 0 and P (B) > 0, are independent if they are
disjoint.
However the opposite is true: P (A ∩ B) 6= P (A)P (B), since
P (A ∩ B) = 0

Frédéric Lehmann Introduction to Probability and Estimation Theory 27 / 161


Basic concepts Sets
Discrete random variables Probabilistic models
Continuous random variables Conditional probability
Limit theorems Independence

Independence of several events

Definition
A1 , A2 , . . . , An are independent if
  Y
P ∩ Ai = P (Ai ), for any subset S of {1, 2, . . . , n}
i∈S
i∈S

Link with pairwise independence


Independence implies pairwise independence, i.e.
P (Ai ∩ Aj ) = P (Ai )P (Aj ), ∀i 6= j
The converse is false

Frédéric Lehmann Introduction to Probability and Estimation Theory 28 / 161


Basic concepts Sets
Discrete random variables Probabilistic models
Continuous random variables Conditional probability
Limit theorems Independence

Useful counting results

Consider two integers, n, k, with k ≤ n


Number of permutations of n objects: n!
n!
Number of k-permutations of n objects: (n−k)!
Combinations of k out of n objects: nk = k!(n−k)!
n!


Frédéric Lehmann Introduction to Probability and Estimation Theory 29 / 161


Basic concepts Probability mass function (p.m.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability mass function of multiple random variables
Limit theorems Results for some useful distributions

2 Discrete random variables


Probability mass function (p.m.f.)
Expectation and variance
Joint probability mass function of multiple random variables
Results for some useful distributions

Frédéric Lehmann Introduction to Probability and Estimation Theory 30 / 161


Basic concepts Probability mass function (p.m.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability mass function of multiple random variables
Limit theorems Results for some useful distributions

Random variable

Random variable (r.v.)


A random variable is a real-valued function X : Ω → R
A function of a random variable is another random variable

Visualization
Random variable X

2
.
. Real number line
.
n

Sample space

Frédéric Lehmann Introduction to Probability and Estimation Theory 31 / 161


Basic concepts Probability mass function (p.m.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability mass function of multiple random variables
Limit theorems Results for some useful distributions

Discrete random variable

Discrete random variable


The range of X (the set of values it can take) is either
finite
or countably infinite

Frédéric Lehmann Introduction to Probability and Estimation Theory 32 / 161


Basic concepts Probability mass function (p.m.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability mass function of multiple random variables
Limit theorems Results for some useful distributions

Discrete random variable example

Experiment: Two successive and independent coin tosses


Sample space: Ω = {HH, HT, T H, T T }
Random variable X: number of heads

Tabular representation

Sample HH HT TH TT
X 2 1 1 0

Frédéric Lehmann Introduction to Probability and Estimation Theory 33 / 161


Basic concepts Probability mass function (p.m.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability mass function of multiple random variables
Limit theorems Results for some useful distributions

2 Discrete random variables


Probability mass function (p.m.f.)
Expectation and variance
Joint probability mass function of multiple random variables
Results for some useful distributions

Frédéric Lehmann Introduction to Probability and Estimation Theory 34 / 161


Basic concepts Probability mass function (p.m.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability mass function of multiple random variables
Limit theorems Results for some useful distributions

Probability mass function (p.m.f.)

Definition
Let X be a discrete random variable
Let x1 ≤ x2 ≤ . . . be the possible outcomes of X in
ascending order
Let P ({X = xk }) = pX (xk ), ∀k
pX (x) is the probability mass function (p.m.f.) of X

Frédéric Lehmann Introduction to Probability and Estimation Theory 35 / 161


Basic concepts Probability mass function (p.m.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability mass function of multiple random variables
Limit theorems Results for some useful distributions

Probability mass function (p.m.f.) example

Experiment: Two successive and independent coin tosses


Sample space: Ω = {HH, HT, T H, T T }
Using a fair coin:
P (HH) = P (HT ) = P (T H) = P (T T ) = 1/4
Random variable X: number of heads
P (X = 0) = P (T T ) = 1/4,
P (X = 1) = P (HT ∪ T H) = P (HT ) + P (T H) = 1/2,
P (X = 2) = P (HH) = 1/4

Frédéric Lehmann Introduction to Probability and Estimation Theory 36 / 161


Basic concepts Probability mass function (p.m.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability mass function of multiple random variables
Limit theorems Results for some useful distributions

Probability mass function (p.m.f.): example continued

Bar diagram of the p.m.f of X, the number of heads for two


successive and independent coin tosses
p.m.f. pX(x)

1/2

1/4

x
0 1 2
Frédéric Lehmann Introduction to Probability and Estimation Theory 37 / 161
Basic concepts Probability mass function (p.m.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability mass function of multiple random variables
Limit theorems Results for some useful distributions

Cumulative distribution function (c.d.f.)

Definition
P
The c.d.f. is defined as FX (x) = P (X ≤ x) = u≤x pX (u)

Expression when the range of X, x1 ≤ x2 ≤ · · · ≤ xn , is finite



 0, − ∞ < x < x1

pX (x1 ), x1 ≤ x < x2





FX (x) = pX (x1 ) + pX (x2 ), x2 ≤ x < x3

 .. ..
. .





pX (x1 ) + · · · + pX (xn ) = 1, xn ≤ x < ∞

Frédéric Lehmann Introduction to Probability and Estimation Theory 38 / 161


Basic concepts Probability mass function (p.m.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability mass function of multiple random variables
Limit theorems Results for some useful distributions

Cumulative distribution function (c.d.f.)

Properties
FX (x) is a piecewise constant function of x
FX is monotonically nondecreasing
limx→−∞ FX (x) = 0
limx→+∞ FX (x) = 1
P (a < X ≤ b) = FX (b) − FX (a)

Frédéric Lehmann Introduction to Probability and Estimation Theory 39 / 161


Basic concepts Probability mass function (p.m.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability mass function of multiple random variables
Limit theorems Results for some useful distributions

Cumulative distribution function (c.d.f.):


example continued
c.d.f of X, the number of heads for two successive and
independent coin tosses
c.d.f. FX(x)

1 ...
pX(2)

1/2

pX(1)

1/4

pX(0)

... x

0 1 2

Frédéric Lehmann Introduction to Probability and Estimation Theory 40 / 161


Basic concepts Probability mass function (p.m.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability mass function of multiple random variables
Limit theorems Results for some useful distributions

Change of variable

Functions of random variables


Let X be a random variable with p.m.f. pX (x).
Y = g(X), where g(.) is a real-valued function, is also a random
variable with p.m.f.
X
pY (y) = pX (x)
{x|g(x)=y}

Frédéric Lehmann Introduction to Probability and Estimation Theory 41 / 161


Basic concepts Probability mass function (p.m.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability mass function of multiple random variables
Limit theorems Results for some useful distributions

2 Discrete random variables


Probability mass function (p.m.f.)
Expectation and variance
Joint probability mass function of multiple random variables
Results for some useful distributions

Frédéric Lehmann Introduction to Probability and Estimation Theory 42 / 161


Basic concepts Probability mass function (p.m.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability mass function of multiple random variables
Limit theorems Results for some useful distributions

Expectation

Definition
P
The expectation of a r.v. X is defined as E[X] = x xpX (x).

Interpretation
E(X), as the center of gravity of the p.m.f., represents the mean
value of X in a probabilistic sense.

Example with X, the number of heads for two successive and


independent coin tosses
E[X] = 0 × 1/4 + 1 × 1/2 + 2 × 1/4 = 1

Frédéric Lehmann Introduction to Probability and Estimation Theory 43 / 161


Basic concepts Probability mass function (p.m.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability mass function of multiple random variables
Limit theorems Results for some useful distributions

Expectation

Properties
P
E[g(X)] = x g(x)pX (x)
Given two scalars a and b, E[aX + b] = aE[X] + b

Frédéric Lehmann Introduction to Probability and Estimation Theory 44 / 161


Basic concepts Probability mass function (p.m.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability mass function of multiple random variables
Limit theorems Results for some useful distributions

Variance

Definition
The variance
 of a r.v. X is defined as
V [X] = E (X − E[X])2 = x (x − E[X])2 pX (x).
P
The square root of the variance, σX is called the standard
deviation.

Interpretation
The standard deviation, σX , provides a measure of the dispersion
of X around its mean.

Example with X, the number of heads for two successive and


independent coin tosses
V [X] = (0 − 1)2 × 1/4 + (1 − 1)2 × 1/2 + (2 − 1)2 × 1/4 = 1/2

Frédéric Lehmann Introduction to Probability and Estimation Theory 45 / 161


Basic concepts Probability mass function (p.m.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability mass function of multiple random variables
Limit theorems Results for some useful distributions

Variance

Properties
V [X] = E[X 2 ] − (E[X])2
Given two scalars a and b, V [aX + b] = a2 V [X]

Frédéric Lehmann Introduction to Probability and Estimation Theory 46 / 161


Basic concepts Probability mass function (p.m.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability mass function of multiple random variables
Limit theorems Results for some useful distributions

Standardized random variable

Definition
Let X be a random variable with expectation E[X] and standard
deviation σX > 0, the standardized random variable is defined as

X − E[X]
X∗ =
σX

Properties
E[X]−E[X]
E[X ∗ ] = σX =0
V [X−E[X]] V [X]
V [X ∗ ] = 2
σX
= V [X] =1

Frédéric Lehmann Introduction to Probability and Estimation Theory 47 / 161


Basic concepts Probability mass function (p.m.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability mass function of multiple random variables
Limit theorems Results for some useful distributions

2 Discrete random variables


Probability mass function (p.m.f.)
Expectation and variance
Joint probability mass function of multiple random variables
Results for some useful distributions

Frédéric Lehmann Introduction to Probability and Estimation Theory 48 / 161


Basic concepts Probability mass function (p.m.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability mass function of multiple random variables
Limit theorems Results for some useful distributions

Joint p.m.f. of two random variables

Let X and Y be two discrete random variables associated to the


same experiment.
Definition
The joint p.m.f. of X and Y for a particular (x, y) is the
probability of the event {X = x, Y = y}:

pX,Y (x, y) = P (X = x, Y = y)

By analogy, this can be easily generalized for more than two


random variables.

Frédéric Lehmann Introduction to Probability and Estimation Theory 49 / 161


Basic concepts Probability mass function (p.m.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability mass function of multiple random variables
Limit theorems Results for some useful distributions

Marginal p.m.f.

Let X and Y be two discrete random variables associated to the


same experiment.
Marginal p.m.f. of X and Y

X
pX (x) = P (X = x) = pX,Y (x, y)
y
X
pY (y) = P (Y = y) = pX,Y (x, y)
x

Frédéric Lehmann Introduction to Probability and Estimation Theory 50 / 161


Basic concepts Probability mass function (p.m.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability mass function of multiple random variables
Limit theorems Results for some useful distributions

Joint p.m.f. of two random variables

Let the range of X (resp. Y ) be {x1 , . . . , xn } (resp. {y1 , . . . , ym })


Joint p.m.f. pX,Y (x, y) in tabular form

X \Y y1 y2 ... ym Row sum


x1 pX,Y (x1 , y1 ) pX,Y (x1 , y2 ) ... pX,Y (x1 , ym ) pX (x1 )
x2 pX,Y (x2 , y1 ) pX,Y (x2 , y2 ) ... pX,Y (x2 , ym ) pX (x2 )
.. .. .. .. ..
. . . ... . .
xn pX,Y (xn , y1 ) pX,Y (xn , y2 ) ... pX,Y (xn , ym ) pX (xn )
Column sum pY (y1 ) pY (y2 ) ... pY (ym ) 1

Frédéric Lehmann Introduction to Probability and Estimation Theory 51 / 161


Basic concepts Probability mass function (p.m.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability mass function of multiple random variables
Limit theorems Results for some useful distributions

Joint c.d.f. of two random variables

Definition
The joint c.d.f of two random variables X and Y is defined as
XX
FX,Y (x, y) = P (X ≤ x, Y ≤ y) = pX,Y (u, v)
u≤x v≤y

Frédéric Lehmann Introduction to Probability and Estimation Theory 52 / 161


Basic concepts Probability mass function (p.m.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability mass function of multiple random variables
Limit theorems Results for some useful distributions

Change of variable

Functions of multiple random variables


Let X and Y be two random variable with joint p.m.f pX,Y (x, y).
U = g(X, Y ) and V = h(X, Y ), where g(.) and h(.) are
real-valued functions, are also random variables with joint p.m.f
X
pU,V (u, v) = pX,Y (x, y)
{(x,y)|g(x,y)=u and h(x,y)=v}

Frédéric Lehmann Introduction to Probability and Estimation Theory 53 / 161


Basic concepts Probability mass function (p.m.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability mass function of multiple random variables
Limit theorems Results for some useful distributions

Expectation of a function of two random variables

Consider the r.v. Z = g(X, Y )


The expectation of Z is given by
XX
E[g(X, Y )] = g(x, y)pX,Y (x, y)
x y

If g is linear, Z = aX + bY + c
Given the scalars a, b and c

E[aX + bY + c] = aE[X] + bE[Y ] + c

Frédéric Lehmann Introduction to Probability and Estimation Theory 54 / 161


Basic concepts Probability mass function (p.m.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability mass function of multiple random variables
Limit theorems Results for some useful distributions

Independence of two random variables

Definition
Two discrete r.v.s are independent if

pX,Y (x, y) = pX (x)pY (y), ∀(x, y)

Properties of independent discrete r.v.s


E[XY ] = E[X]E[Y ]
E[g(X)h(Y )] = E[g(X)]E[h(Y )]
V [X + Y ] = V [X] + V [Y ]

Frédéric Lehmann Introduction to Probability and Estimation Theory 55 / 161


Basic concepts Probability mass function (p.m.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability mass function of multiple random variables
Limit theorems Results for some useful distributions

Covariance and correlation coefficient

Definitions
Let X and Y be two discrete r.v.s, their covariance is defined as

cov(X, Y ) = E [(X − E[X])(Y − E[Y ])] = E[XY ] − E[X]E[Y ]

and their correlation coefficient, −1 ≤ ρ(X, Y ) ≤ 1, is defined as

cov(X, Y )
ρ(X, Y ) =
σX σY

Properties
X and Y are independent implies cov(X, Y ) = 0
The converse is false
Frédéric Lehmann Introduction to Probability and Estimation Theory 56 / 161
Basic concepts Probability mass function (p.m.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability mass function of multiple random variables
Limit theorems Results for some useful distributions

2 Discrete random variables


Probability mass function (p.m.f.)
Expectation and variance
Joint probability mass function of multiple random variables
Results for some useful distributions

Frédéric Lehmann Introduction to Probability and Estimation Theory 57 / 161


Basic concepts Probability mass function (p.m.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability mass function of multiple random variables
Limit theorems Results for some useful distributions

Bernoulli distribution with parameter p, B(1, p)

Definition
X ∼ B(1, p) describes the success or failure in a single trial
(
p, if k = 1
pX (k) =
q, if k = 0

where q = 1 − p

Properties
E[X] = p
V [X] = pq

Frédéric Lehmann Introduction to Probability and Estimation Theory 58 / 161


Basic concepts Probability mass function (p.m.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability mass function of multiple random variables
Limit theorems Results for some useful distributions

Binomial distribution of size n with parameter p, B(n, p)

Definition
X ∼ B(n, p) describes the number of successes in n independent
Bernoulli trials
 
n k n−k
pX (k) = p q , for k = 0, 1, . . . , n
k

where q = 1 − p

Properties
E[X] = np
V [X] = npq

Frédéric Lehmann Introduction to Probability and Estimation Theory 59 / 161


Basic concepts Probability mass function (p.m.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability mass function of multiple random variables
Limit theorems Results for some useful distributions

Poisson distribution with parameter λ, P(λ)

Definition
X ∼ P(λ) approximates the binomial p.m.f. when n is large, p is
small and λ = np

λk
pX (k) = e−λ , for k = 0, 1, . . .
k!

Properties
E[X] = λ
V [X] = λ

Frédéric Lehmann Introduction to Probability and Estimation Theory 60 / 161


Basic concepts Probability density function (p.d.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability distribution function of multiple random variables
Limit theorems Results for some useful distributions

3 Continuous random variables


Probability density function (p.d.f.)
Expectation and variance
Joint probability distribution function of multiple random
variables
Results for some useful distributions

Frédéric Lehmann Introduction to Probability and Estimation Theory 61 / 161


Basic concepts Probability density function (p.d.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability distribution function of multiple random variables
Limit theorems Results for some useful distributions

3 Continuous random variables


Probability density function (p.d.f.)
Expectation and variance
Joint probability distribution function of multiple random
variables
Results for some useful distributions

Frédéric Lehmann Introduction to Probability and Estimation Theory 62 / 161


Basic concepts Probability density function (p.d.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability distribution function of multiple random variables
Limit theorems Results for some useful distributions

Continuous random variable

Intuition: A random variable is continuous if it has a continuous


range of possible values
Definition
A random variable X is continuous if there is a function fX , called
the probability density function (p.d.f.) of X verifying
(Nonnegativity) fX (x) ≥ 0, ∀x
R +∞
(Normalization) −∞ fX (x)dx = 1
R
such that ∀B ⊂ R, P (X ∈ B) = B fX (x)dx, where for simplicity,
all integrals are assumed to be well-defined in the usual Riemann
sense.

Frédéric Lehmann Introduction to Probability and Estimation Theory 63 / 161


Basic concepts Probability density function (p.d.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability distribution function of multiple random variables
Limit theorems Results for some useful distributions

Probability density function (p.d.f.)

Interpretation of fX (x): probability mass per unit length around x


For a small length δ, R
x+δ
P (X ∈ [x, x + δ]) = x fX (t)dt ≈ fX (x)δ
p.d.f. fX(x)

x x+
Note that P (X = x) = limδ→0 P (X ∈ [x, x + δ]) = 0
Frédéric Lehmann Introduction to Probability and Estimation Theory 64 / 161
Basic concepts Probability density function (p.d.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability distribution function of multiple random variables
Limit theorems Results for some useful distributions

Cumulative distribution function (c.d.f.)

Definition
Rx
The c.d.f. is defined as FX (x) = P (X ≤ x) = −∞ fX (t)dt.
By differentiation, fX (x) = dF
dx (x), ∀x where fX is continuous
X

Graphical representation
p.d.f. fX(x) c.d.f. FX(x)

0
x x+

Frédéric Lehmann Introduction to Probability and Estimation Theory 65 / 161


Basic concepts Probability density function (p.d.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability distribution function of multiple random variables
Limit theorems Results for some useful distributions

Cumulative distribution function (c.d.f.)

Properties
FX (x) is a continuous function of x
FX is monotonically nondecreasing
limx→−∞ FX (x) = 0
limx→+∞ FX (x) = 1
Since including or exluding the endpoint of an interval has no
effect on an integral
Z b
P (a ≤ X ≤ b) = fX (x)dx
a
= P (a < X < b) = P (a ≤ X < b) = P (a < X ≤ b)
= FX (b) − FX (a)
Frédéric Lehmann Introduction to Probability and Estimation Theory 66 / 161
Basic concepts Probability density function (p.d.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability distribution function of multiple random variables
Limit theorems Results for some useful distributions

Change of variable

Strictly monotonic functions of random variables


Let X be a random variable with p.d.f. fX (x).
Let g be a strictly monotonic function, such that for some function
h, y = g(x) if and only if x = h(y).
Assume h is differentiable, then the p.d.f. of Y = g(X) is

dh
pY (y) = fX (h(y)) (y)
dy

Frédéric Lehmann Introduction to Probability and Estimation Theory 67 / 161


Basic concepts Probability density function (p.d.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability distribution function of multiple random variables
Limit theorems Results for some useful distributions

3 Continuous random variables


Probability density function (p.d.f.)
Expectation and variance
Joint probability distribution function of multiple random
variables
Results for some useful distributions

Frédéric Lehmann Introduction to Probability and Estimation Theory 68 / 161


Basic concepts Probability density function (p.d.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability distribution function of multiple random variables
Limit theorems Results for some useful distributions

Expectation

Definition
Z +∞
The expectation of a r.v. X is defined as E[X] = xfX (x)dx.
−∞

Interpretation
E(X), as the center of gravity of the p.d.f., represents the mean
value of X in a probabilistic sense.
It can also be seen the anticipated average value of X in a large
number of repeated independent experiments

Frédéric Lehmann Introduction to Probability and Estimation Theory 69 / 161


Basic concepts Probability density function (p.d.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability distribution function of multiple random variables
Limit theorems Results for some useful distributions

Expectation

Properties
Z +∞
E[g(X)] = g(x)fX (x)dx
−∞
Given two scalars a and b, E[aX + b] = aE[X] + b

Frédéric Lehmann Introduction to Probability and Estimation Theory 70 / 161


Basic concepts Probability density function (p.d.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability distribution function of multiple random variables
Limit theorems Results for some useful distributions

Variance

Definition
The variance of a r.v. X is defined
Z +∞ as
V [X] = E (X − E[X])2 = (x − E[X])2 fX (x)dx.
 
−∞
The square root of the variance, σX is called the standard
deviation.

Interpretation
The standard deviation, σX , provides a measure of the dispersion
of X around its mean.

Frédéric Lehmann Introduction to Probability and Estimation Theory 71 / 161


Basic concepts Probability density function (p.d.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability distribution function of multiple random variables
Limit theorems Results for some useful distributions

Variance

Properties
V [X] = E[X 2 ] − (E[X])2
Given two scalars a and b, V [aX + b] = a2 V [X]

Frédéric Lehmann Introduction to Probability and Estimation Theory 72 / 161


Basic concepts Probability density function (p.d.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability distribution function of multiple random variables
Limit theorems Results for some useful distributions

Standardized random variable

Definition
Let X be a random variable with expectation E[X] and standard
deviation σX > 0, the standardized random variable is defined as

X − E[X]
X∗ =
σX

Properties
E[X]−E[X]
E[X ∗ ] = σX =0
V [X−E[X]] V [X]
V [X ∗ ] = 2
σX
= V [X] =1

Frédéric Lehmann Introduction to Probability and Estimation Theory 73 / 161


Basic concepts Probability density function (p.d.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability distribution function of multiple random variables
Limit theorems Results for some useful distributions

3 Continuous random variables


Probability density function (p.d.f.)
Expectation and variance
Joint probability distribution function of multiple random
variables
Results for some useful distributions

Frédéric Lehmann Introduction to Probability and Estimation Theory 74 / 161


Basic concepts Probability density function (p.d.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability distribution function of multiple random variables
Limit theorems Results for some useful distributions

Joint p.d.f. of two random variables

Definition
Two random variable X and Y are jointly continuous if there is a
function fX,Y , called the joint probability density function (joint
p.d.f.) of X and Y verifying
(Nonnegativity) fX,Y (x, y) ≥ 0, ∀(x, y)
Z +∞ Z +∞
(Normalization) fX,Y (x, y)dxdy = 1
−∞ −∞ Z Z
such that ∀B ⊂ R2 , P ((X, Y ) ∈ B) = fX,Y (x, y)dxdy,
(x,y)∈B
where all integrals are assumed well-defined in the Riemann sense.

Frédéric Lehmann Introduction to Probability and Estimation Theory 75 / 161


Basic concepts Probability density function (p.d.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability distribution function of multiple random variables
Limit theorems Results for some useful distributions

Joint p.d.f. of two random variables

Intuition: fX,Y encodes the probabilistic information about X, Y


and their dependencies
Interpretation of fX,Y (x, y): probability mass per unit area around
(x, y)
For a small length δ,
Z x+δ Z y+δ
P (X ∈ [x, x + δ], Y ∈ [y, y + δ]) = fX,Y (u, v)dudv
x y
≈ fX,Y (x, y)δ 2

Frédéric Lehmann Introduction to Probability and Estimation Theory 76 / 161


Basic concepts Probability density function (p.d.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability distribution function of multiple random variables
Limit theorems Results for some useful distributions

Marginal p.d.f.

Marginal p.d.f. of X (similar expression for Y )


∀A ⊂ R, Z Z +∞
P (X ∈ A and Y ∈ (−∞, +∞)) = fX,Y (x, y)dydx.
A Z −∞

Since the p.d.f. of X is defined by P (X ∈ A) = fX (x)dx,


Z +∞ A

it follows that fX (x) = fX,Y (x, y)dy.


R +∞ −∞
(Similarly, fY (y) = −∞ fX,Y (x, y)dx).

Frédéric Lehmann Introduction to Probability and Estimation Theory 77 / 161


Basic concepts Probability density function (p.d.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability distribution function of multiple random variables
Limit theorems Results for some useful distributions

Joint c.d.f. of two random variables

Definition
The joint c.d.f of two random variables X and Y is defined as
Z x Z y
FX,Y (x, y) = P (X ≤ x, Y ≤ y) = fX,Y (u, v)dudv
−∞ −∞

Conversely, by differentiation we have

∂ 2 FX,Y
fX,Y (x, y) = (x, y)
∂x∂y

Frédéric Lehmann Introduction to Probability and Estimation Theory 78 / 161


Basic concepts Probability density function (p.d.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability distribution function of multiple random variables
Limit theorems Results for some useful distributions

Change of variable

Functions of multiple random variables


Let X and Y be two jointly continuous random variable with joint
p.d.f fX,Y (x, y).
(U, V ) = g(X, Y ), where g(., .) is a real-valued bijective function,
are also jointly continuous random variables with joint p.d.f

fU,V (u, v) = |J| fX,Y (x, y)


(x,y)=g −1 (u,v)

where J is the jacobian of g −1


∂x ∂x
J= ∂u ∂v
∂y ∂y
∂u ∂v (x,y)=g −1 (u,v)

Frédéric Lehmann Introduction to Probability and Estimation Theory 79 / 161


Basic concepts Probability density function (p.d.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability distribution function of multiple random variables
Limit theorems Results for some useful distributions

Expectation of a function of two random variables

Consider the r.v. Z = g(X, Y )


The expectation of Z is given by
Z +∞ Z +∞
E[g(X, Y )] = g(x, y)fX,Y (x, y)dxdy
−∞ −∞

If g is linear, Z = aX + bY + c
Given the scalars a, b and c

E[aX + bY + c] = aE[X] + bE[Y ] + c

Frédéric Lehmann Introduction to Probability and Estimation Theory 80 / 161


Basic concepts Probability density function (p.d.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability distribution function of multiple random variables
Limit theorems Results for some useful distributions

Independence of two random variables

Definition
Two continuous r.v.s are independent if

fX,Y (x, y) = fX (x)fY (y), ∀(x, y)

Properties of independent continuous r.v.s


E[XY ] = E[X]E[Y ]
E[g(X)h(Y )] = E[g(X)]E[h(Y )]
V [X + Y ] = V [X] + V [Y ]

Frédéric Lehmann Introduction to Probability and Estimation Theory 81 / 161


Basic concepts Probability density function (p.d.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability distribution function of multiple random variables
Limit theorems Results for some useful distributions

Covariance and correlation coefficient

Definitions
Let X and Y be two continuous r.v.s, their covariance is defined as

cov(X, Y ) = E [(X − E[X])(Y − E[Y ])] = E[XY ] − E[X]E[Y ]

and their correlation coefficient, −1 ≤ ρ(X, Y ) ≤ 1, is defined as

cov(X, Y )
ρ(X, Y ) =
σX σY

Properties
X and Y are independent implies cov(X, Y ) = 0
The converse is false
Frédéric Lehmann Introduction to Probability and Estimation Theory 82 / 161
Basic concepts Probability density function (p.d.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability distribution function of multiple random variables
Limit theorems Results for some useful distributions

3 Continuous random variables


Probability density function (p.d.f.)
Expectation and variance
Joint probability distribution function of multiple random
variables
Results for some useful distributions

Frédéric Lehmann Introduction to Probability and Estimation Theory 83 / 161


Basic concepts Probability density function (p.d.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability distribution function of multiple random variables
Limit theorems Results for some useful distributions

Uniform distribution over [a, b], U([a, b])

Definition
If X ∼ U([a, b]), then

0, if x < a

 1 , if a ≤ x ≤ b
 

x − a
fX (x) = b − a FX (x) = , if a ≤ x ≤ b
b−a
0, otherwise
 


1, otherwise

Properties
a+b
E[X] = 2
(b−a)2
V [X] = 12

Frédéric Lehmann Introduction to Probability and Estimation Theory 84 / 161


Basic concepts Probability density function (p.d.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability distribution function of multiple random variables
Limit theorems Results for some useful distributions

Exponential distribution with parameter λ, E(λ)

Definition
If X ∼ E(λ), then
( (
λe−λx , if x ≥ 0 1 − e−λx , if x ≥ 0
fX (x) = FX (x) =
0, otherwise 0, otherwise

Properties
1
E[X] = λ
1
V [X] = λ2

Frédéric Lehmann Introduction to Probability and Estimation Theory 85 / 161


Basic concepts Probability density function (p.d.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability distribution function of multiple random variables
Limit theorems Results for some useful distributions

Gaussian distribution with mean 0 and variance 1, N (0, 1)

Definition
If X ∼ N (0, 1) (standard normal r.v.), then
1 x2 x
fX (x) = √ e− 2 ,
Z
1 t2
2π FX (x) = Φ (x) = √ e− 2 dt
−∞ 2π

Properties
E[X] = 0
V [X] = 1

Frédéric Lehmann Introduction to Probability and Estimation Theory 86 / 161


Basic concepts Probability density function (p.d.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability distribution function of multiple random variables
Limit theorems Results for some useful distributions

Gaussian distribution with mean m and variance σ 2 ,


N (m, σ 2 )

Definition
If X ∼ N (m, σ 2 ), then

(x−m)2
 
1 x−m
fX (x) = √ e− 2σ2 FX (x) = Φ
σ 2π σ

Properties
E[X] = m
V [X] = σ 2

Frédéric Lehmann Introduction to Probability and Estimation Theory 87 / 161


Basic concepts Probability density function (p.d.f.)
Discrete random variables Expectation and variance
Continuous random variables Joint probability distribution function of multiple random variables
Limit theorems Results for some useful distributions

Properties of Gaussian random variables

Gaussianity preserved by linearity


Let X ∼ N (mX , σX 2 ), and Y ∼ N (m , σ 2 ) be two independent
Y Y
Gaussian r.v.s.
Let a and b be fixed scalars, then
Z = aX + bY ∼ N (mZ , σZ2 )
mZ = amX + bmY
σZ2 = a2 σX
2 + b2 σ 2
Y

Frédéric Lehmann Introduction to Probability and Estimation Theory 88 / 161


Basic concepts
Chebyshev inequality
Discrete random variables
Law of large numbers
Continuous random variables
Central limit theorem
Limit theorems

4 Limit theorems
Chebyshev inequality
Law of large numbers
Central limit theorem

Frédéric Lehmann Introduction to Probability and Estimation Theory 89 / 161


Basic concepts
Chebyshev inequality
Discrete random variables
Law of large numbers
Continuous random variables
Central limit theorem
Limit theorems

4 Limit theorems
Chebyshev inequality
Law of large numbers
Central limit theorem

Frédéric Lehmann Introduction to Probability and Estimation Theory 90 / 161


Basic concepts
Chebyshev inequality
Discrete random variables
Law of large numbers
Continuous random variables
Central limit theorem
Limit theorems

Chebyshev inequality

Theorem
Let X be a r.v. with finite mean µ and variance σ 2 , then
2
P (|X − µ| ≥ ) ≤ σ2 , ∀ > 0

Proof for a continuous r.v. (similar for a discrete r.v.)

Z +∞ Z
2 2
σ = (x − µ) fX (x)dx ≥ (x − µ)2 fX (x)dx
−∞ |x−µ|≥
Z
2
≥ fX (x)dx = 2 P (|X − µ| ≥ )
|x−µ|≥

Frédéric Lehmann Introduction to Probability and Estimation Theory 91 / 161


Basic concepts
Chebyshev inequality
Discrete random variables
Law of large numbers
Continuous random variables
Central limit theorem
Limit theorems

4 Limit theorems
Chebyshev inequality
Law of large numbers
Central limit theorem

Frédéric Lehmann Introduction to Probability and Estimation Theory 92 / 161


Basic concepts
Chebyshev inequality
Discrete random variables
Law of large numbers
Continuous random variables
Central limit theorem
Limit theorems

(Weak) Law of large numbers

Theorem
Let X1 , X2 , . . . , Xn be independent and identically distributed
2
 with mean µ and variance σ ,then ∀ > 0
(i.i.d.) r.v.s
X1 + X2 + · · · + Xn
lim P −µ ≥ =0
n→+∞ n

Proof for a continuous r.v. (similar for a discrete r.v.)


2
Let Mn = X1 +X2n+···+Xn , E[Mn ] = µ and V [Mn ] = σn
σ2
Applying Chebyshev’s inequality P (|Mn − µ| ≥ ) ≤ n 2 , ∀ > 0.
As n → +∞ this probability tends to 0

Frédéric Lehmann Introduction to Probability and Estimation Theory 93 / 161


Basic concepts
Chebyshev inequality
Discrete random variables
Law of large numbers
Continuous random variables
Central limit theorem
Limit theorems

Consequence of the law of large numbers

Consistency with the frequentist definition of probability


Consider an event A with probability P (A). Define the r.v.

1, if A occurs
Xi =
0, otherwise

Let X1 , X2 , . . . , Xn be i.i.d. distributed r.v.s with mean


E[Xi ] = 1 × P (A) + 0 × (1 − P (A)) = P (A), ∀i.
Then, the relative frequency of occurrence of A is
Mn = X1 +X2n+···+Xn , with E[Mn ] = P (A).
Applying the (weak) law of large numbers obtained from the
axiomatic definition of probability, we obtain
lim P (|Mn − P (A)| ≥ ) = 0, ∀ > 0
n→+∞
(Interpretation: Mn is concentrated around P (A) for large n)
Frédéric Lehmann Introduction to Probability and Estimation Theory 94 / 161
Basic concepts
Chebyshev inequality
Discrete random variables
Law of large numbers
Continuous random variables
Central limit theorem
Limit theorems

Consequence of the law of large numbers

Recovering the p.d.f. using an histogram


Consider a continuous r.v. X with p.d.f fX .
If A is the event {X ∈ [a − h, a + h]}, for a small h > 0,
Z a+h
P (A) = fX (x)dx ≈ 2hfX (a)
a−h
Mn number of times Xi = 1, i ≤ n
Then fX (a) ≈ 2h = n×2h
0.4

0.35

0.3

0.25
fX(x)

0.2

0.15

0.1

0.05

0
−5 −4 −3 −2 −1 0 1 2 3 4 5
x

Frédéric Lehmann Introduction to Probability and Estimation Theory 95 / 161


Basic concepts
Chebyshev inequality
Discrete random variables
Law of large numbers
Continuous random variables
Central limit theorem
Limit theorems

4 Limit theorems
Chebyshev inequality
Law of large numbers
Central limit theorem

Frédéric Lehmann Introduction to Probability and Estimation Theory 96 / 161


Basic concepts
Chebyshev inequality
Discrete random variables
Law of large numbers
Continuous random variables
Central limit theorem
Limit theorems

Central limit theorem

Theorem
Let X1 , X2 , . . . , Xn be i.i.d. r.v.s with mean µ and variance σ 2 ,
and define the standardized sum
n −nµ
Zn = X1 +X2 +···+X

σ n
, with E[Zn ] = 0 and V [Zn ] = 1,
then lim P (Zn ≤ x) = Φ(x)
n→+∞

Normal approximation based on the central limit theorem


Let Sn = X1 + X2 + · · · + Xn be a sum of i.i.d. r.v.s, its
distribution can be approximated as normal, typically when
n > 30
The Xi ’s can be discrete or continuous
The distribution of the Xi ’s need not be known
Frédéric Lehmann Introduction to Probability and Estimation Theory 97 / 161
Basic concepts
Chebyshev inequality
Discrete random variables
Law of large numbers
Continuous random variables
Central limit theorem
Limit theorems

Application

De Moivre-Laplace approximation to the binomial


Let X1 , X2 , . . . , Xn be i.i.d. according to B(1, p), the sum
Sn = X1 + X2 + · · · + Xn ∼ B(n, p).
If n > 30, np > 5 and  n(1 − p)  >5  
P (k ≤ Sn ≤ l) ≈ Φ √ l−np −Φ √ k−np
np(1−p) np(1−p)
Histogram of 20000 draws from B(100,0.1)
3000

2500

2000

1500

1000

500

0
0 5 10 15 20 25

Frédéric Lehmann Introduction to Probability and Estimation Theory 98 / 161


Poisson process
Discrete-time Markov processes

Part II

Introduction to Random Processes

Frédéric Lehmann Introduction to Probability and Estimation Theory 99 / 161


Poisson process
Discrete-time Markov processes

Basic concepts

Definition of a random process


Consider a sample space Ω and a set T of discrete or continuous
time instants, a random process is defined as a family of random
variables (Xt )t∈T
For a fixed tk ∈ T , ω → X(tk , ω) is a r.v.
For a fixed ωi ∈ Ω, t → xi (t) = X(t, ωi ) is a deterministic
function (or realization) defined on T

Frédéric Lehmann Introduction to Probability and Estimation Theory 100 / 161


Poisson process
Discrete-time Markov processes

Basic concepts

Illustration of the concept of random process


x1(t)
1
x1(tk)
2
t
0

x2(t)
n

x2(tk)

t
0

xn(t)
xn(tk)

t
0
tk

Frédéric Lehmann Introduction to Probability and Estimation Theory 101 / 161


Poisson process
Discrete-time Markov processes

Basic concepts

Definition of a point
A point is a discrete event that occurs in continuous time (or
space).

Definition of a point process


A point process is a random process {Xk , k ≥ 1}, whose
realizations are points in time (or space).

Examples of point processes


The arrival times of costumers in a store, movie theater, etc
The location of trees (resp. asteroid impacts) in a forest
(resp. on earth)

Frédéric Lehmann Introduction to Probability and Estimation Theory 102 / 161


Poisson process
Discrete-time Markov processes

5 Poisson process

Frédéric Lehmann Introduction to Probability and Estimation Theory 103 / 161


Poisson process
Discrete-time Markov processes

The homogeneous Poisson process over the positive half


line

Definition
A temporal point process {Xk , k ≥ 1} is called a Poisson process
with rate λ if it has the following properties
Time homogeneity: The p.m.f. of the number of arrivals Nτ
k
over any interval of length τ is pNτ (k) = e−λτ (λτk!) , ∀k ≥ 0
Independence: The number of arrivals during a particular
interval is independent of the history of arrivals outside this
interval

Frédéric Lehmann Introduction to Probability and Estimation Theory 104 / 161


Poisson process
Discrete-time Markov processes

The homogeneous Poisson process over the positive half


line

Interpretation
Time homogeneity: Arrivals are equally likely at all times
Independence: The conditional probability of k arrivals during
[t, t0 ], given the occurrence of n arrivals for t ∈
/ [t, t0 ] is equal
to the unconditional probability
Rate λ: Average number of arrivals per unit time

Frédéric Lehmann Introduction to Probability and Estimation Theory 105 / 161


Poisson process
Discrete-time Markov processes

Interarrival times

Definition
The k-th interarrival time is the r.v. defined by T1 = X1 ,
Tk = Xk − Xk−1 , k = 2, 3, . . .
Note that Xk = T1 + T2 + · · · + Tk

Frédéric Lehmann Introduction to Probability and Estimation Theory 106 / 161


Poisson process
Discrete-time Markov processes

Properties of interarrival times

Probability law of the first arrival time


c.d.f. of T1 : FT1 (t) = P (T1 ≤ t) = 1 − P (T1 > t) =
1 − P ({no arrivals in (0, t]}) = 1 − e−λt , t ≥ 0, using the time
homogeneity property
p.d.f. of T1 , fT1 (t), obtained by differentiating FT1 (t):
fT1 (t) = λe−λt , t ≥ 0

Frédéric Lehmann Introduction to Probability and Estimation Theory 107 / 161


Poisson process
Discrete-time Markov processes

Properties of interarrival times

Probability law of the k-th interarrival time


c.d.f. of Tk : using the independence property,
FTk (t) = P (Tk ≤ t|T1 = s1 , . . . , Tk−1 = sk−1 )
= 1 − P (Tk > t|T1 = s1 , . . . , Tk−1 = sk−1 )
( )!
 k−1
X k−1 X i
=1−P no arrivals in si , si + t
i=1 i=1

Using time homogeneity: FTk (t) = 1 − e−λt , t ≥ 0.


Since the result does not depend on s1 , . . . , sk−1 , Tk is
independent of T1 , . . . , Tk−1
p.d.f. of Tk : fTk (t) = λe−λt , t ≥ 0

Frédéric Lehmann Introduction to Probability and Estimation Theory 108 / 161


Poisson process
Discrete-time Markov processes

Properties of interarrival times

Summary
Interarrival times T1 , T2 . . . , are independent r.v.s
The p.d.f. of each interarrival time is E(λ)

Simulation of a Poisson process with rate λ


Simulate Tk ∼ E(λ), k ≥ 1
Obtain the points of the Poisson process from
Xk = T1 + T2 + · · · + Tk

Frédéric Lehmann Introduction to Probability and Estimation Theory 109 / 161


Poisson process
Discrete-time Markov processes

P.d.f of the arrival times

Consider the k-th arrival time Xk


Xk is the sum of k independent r.v.s with p.d.f. E(λ)
The p.d.f. of Xk is the Erlang p.d.f. of order k,
k k−1 e−λx
fXk (x) = λ x(k−1)! , x≥0

Frédéric Lehmann Introduction to Probability and Estimation Theory 110 / 161


Poisson process
Discrete-time Markov processes

6 Discrete-time Markov processes

Frédéric Lehmann Introduction to Probability and Estimation Theory 111 / 161


Poisson process
Discrete-time Markov processes

Discrete-time Markov process

Definition
A random process {Xn , n ∈ N}, taking values in the set of states
S = {s1 , s2 , . . . , sm }, is an time homogeneous Markov chain if
(Markov property):
P (Xn+1 = sj |Xn = si , Xn−1 = sin−1 , . . . , X0 = si0 ) =
P (Xn+1 = sj |Xn = si ), ∀sin−1 . . . , si0 ∈ S
(Time homogeneity): The transition probabilities defined by
pij = P (Xn+1 = sj |Xn = si ) ≥ 0 are independent of n
m
X
and satisfy pij = 1
j=1

Frédéric Lehmann Introduction to Probability and Estimation Theory 112 / 161


Poisson process
Discrete-time Markov processes

Specifying a Discrete-time Markov process

Transition probability matrix


A time homogeneous Markov chain can be represented by the
square [m × m] matrix, whose i-th row and j-th column is pij :

P = [pij ]1≤i≤m
1≤j≤m

Transition diagram
A time homogeneous Markov chain can be represented by a graph,
whose nodes are the states (i.e. the elements of S)
whose arcs are the allowed state transitions, labeled with the
corresponding transition probabilities

Frédéric Lehmann Introduction to Probability and Estimation Theory 113 / 161


Poisson process
Discrete-time Markov processes

Discrete-time Markov process example

Machine failure model


A given day, a machine can be either working (i.e. in state s1 ) or
broken down (i.e. in state s2 ). The probability of breaking down,
while it was working the previous day, is b. The probability of
getting repaired, while it was broken down the previous day, is r.

Transition probability matrix/diagram


b
 
1−b b
P= 1-b s1 s2 1-r
r 1−r
r

Frédéric Lehmann Introduction to Probability and Estimation Theory 114 / 161


Poisson process
Discrete-time Markov processes

Probability of a path

Consider a Markov chain with pij = P (Xn+1 = sj |Xn = si )


Using the multiplicative rule of conditional probabilities

P (X0 = i0 , X1 = i1 , . . . , Xn−1 = in−1 , Xn = in )


= P (X0 = i0 )pi0 i1 . . . pin−1 in

Frédéric Lehmann Introduction to Probability and Estimation Theory 115 / 161


Poisson process
Discrete-time Markov processes

2-step transition probabilities

Definition
Given that we are in state si , the probability that we are in state sj
in two steps from now is
m
(2)
X
pij = p(X2 = sj |X0 = si ) = pik pkj
k=1

which is the (i, j)-th entry of P2

Frédéric Lehmann Introduction to Probability and Estimation Theory 116 / 161


Poisson process
Discrete-time Markov processes

2-step transition probabilities

Derivation
From the total probability theorem
m
(2)
X
pij = P (X2 = sj |X0 = si ) = P (X2 = sj , X1 = sk |X0 = si )
k=1
m
X
= P (X2 = sj |X1 = sk , X0 = si )P (X1 = sk |X0 = si )
k=1
Xm
= pik pkj
k=1

Frédéric Lehmann Introduction to Probability and Estimation Theory 117 / 161


Poisson process
Discrete-time Markov processes

2-step transition probabilities

Derivation
Time 0 Time 1 Time 2

s1

. p1j
pi1 .
.

pik sk
pkj
si .
. sj
pim . pmj

sm

Frédéric Lehmann Introduction to Probability and Estimation Theory 118 / 161


Poisson process
Discrete-time Markov processes

n-step transition probabilities

Chapman-Kolmogorov equation
Given that we are in state si , the probability that we are in state sj
in n steps from now is by definition
(n)
pij = p(Xn = sj |X0 = si ),

which is equal to the (i, j)-th entry of Pn

Proof
By induction

Frédéric Lehmann Introduction to Probability and Estimation Theory 119 / 161


Poisson process
Discrete-time Markov processes

Probability vector at instant n

Result
Let u(n) = [P (Xn = s1 ), . . . , P (Xn = sm )] be the probability
vector at instant n, then

u(n) = u(0) Pn

Proof
Using the total probability theorem
P (X
Pnm= sj )
= Pi=1 P (Xn = sj , X0 = si )
= m P (X0 = si )P (Xn = sj |X0 = si )
Pi=1
m (n)
= i=1 P (X0 = si )pij

Frédéric Lehmann Introduction to Probability and Estimation Theory 120 / 161


Poisson process
Discrete-time Markov processes

Regular Markov chain

Definition
A Markov chain is said regular if some power of the transition
matrix has only (strictly) positive elements.

Interpretation
For some n, it is possible to go from any state to any state in
exactly n steps.

Example of transition matrix for non-regular Markov chain


 
1 0
P=
1/2 1/2
then Pn will be 0 in position (1, 2), ∀n ≥ 1 (since s1 is absorbing)
Frédéric Lehmann Introduction to Probability and Estimation Theory 121 / 161
Poisson process
Discrete-time Markov processes

Steady-state convergence theorem

For regular Markov chain with transition matrix P


For an arbitrary starting distribution u(0) , the probability vector at
instant n, u(0) Pn , converges as n → ∞ to a unique steady-state
probability vector π.

Properties of the steady-state probability vector


The steady-state probability vector π = [π1 , . . . , πm ] is the unique
solution of

X m
π = πP (or equivalently πj = πk pkj , ∀j)

 k=1

π1 = 1

Frédéric Lehmann Introduction to Probability and Estimation Theory 122 / 161


Poisson process
Discrete-time Markov processes

Birth-death processes

Definition
A birth-death process is a particular Markov process, where only
self-transitions or transitions to neighboring states are allowed.
Therefore, the transitions probabilities (pij 6= 0 iif
j ∈ {i − 1, i, i + 1}) are completely parameterized by
bi = P (Xn+1 = si+1 |Xn = si ) (birth probability at state si )
di = P (Xn+1 = si−1 |Xn = si ) (death probability at state si )

Applications
Modeling of queueing problems in telecommunications
Modeling of desease evolution

Frédéric Lehmann Introduction to Probability and Estimation Theory 123 / 161


Poisson process
Discrete-time Markov processes

Birth-death processes

Transition diagram
1-b0 1-b1-d1 1-bm-1-dm-1 1-dm
b0 b1 bm-2 bm-1

s0 s1 … Sm-1 sm

d1 d2 dm-1 dm

Frédéric Lehmann Introduction to Probability and Estimation Theory 124 / 161


Poisson process
Discrete-time Markov processes

Birth-death processes

Transition probability matrix

1 − b b0 0 0 ... 0 0

0
d1 1 − b1 − d1 b1 0 ... 0 0
1 − b2 − d2
 0 d2 b2 ... 0 0

 
 .. .. .. .. .. 
 . 0 d3 . . . . 
P=
 
.. .. .. .. .. 
 . . 0 . . 0 . 
 . . . . ..

 .. .. .. .. . bm−2 0

 
0 0 0 ... dm−1 1 − bm−1 − dm−1 bm−1
0 0 0 ... 0 dm 1 − dm

Frédéric Lehmann Introduction to Probability and Estimation Theory 125 / 161


Poisson process
Discrete-time Markov processes

Birth-death processes

Steady-state probability vector (Proof: by induction)


A birth-death process is a regular Markov chain, whose
steady-state probability vector π = [π0 , π1 , . . . , πm ] verifies
the local balance equations πi bi = πi+1 di+1 , i = 0, 1, . . . , m − 1

Expression of the steady-state probabilities (Proof: by induction)


The steady-state probabilities are the unique solution of
b b . . . bi−1

πi = π0 0 1
 , i = 1, . . . , m

 d1 d2 . . . di
Xm
πi = 1




i=0

Frédéric Lehmann Introduction to Probability and Estimation Theory 126 / 161


Classical Estimation
Bayesian Estimation

Part III

Estimation Theory

Frédéric Lehmann Introduction to Probability and Estimation Theory 127 / 161


Classical Estimation
Bayesian Estimation

Motivation...

Problem statement: Given a set of noisy observations (data),


determine the value of a parameter of interest
Applications:
Radar: target range from radar returns
Image analysis: object position from a camera image
Communications: carrier frequency from demodulated signal
Biomedecine: fetus heart rate from a noisy sensor
Etc

Frédéric Lehmann Introduction to Probability and Estimation Theory 128 / 161


Classical Estimation
Bayesian Estimation

Data or observation set

N -point data set


Definition: realization x = (x1 , x2 , . . . , xN )T of a continuous
random process X = (X1 , X2 , . . . , XN )T , whose joint
probability distribution (p.d.f.) fX (x; θ) is parameterized by
an unknown parameter θ.
Remark: if the random process X is discrete, replace the
joint p.d.f. of X by its joint p.m.f. pX (x; θ).

Frédéric Lehmann Introduction to Probability and Estimation Theory 129 / 161


Classical Estimation
Bayesian Estimation

General definitions

Estimator for an unknown parameter θ


Definition: a random variable of the form Θ̂ = g(X), for
some function g
Properties: the p.d.f of Θ̂ can be obtained from that of X
and thus also depends on θ

Examples
PN
Mean estimator: M̂ = i=1 Xi /N
PN
Variance estimator: V̂ = i=1 (Xi − M̂ )2 /N

Frédéric Lehmann Introduction to Probability and Estimation Theory 130 / 161


Classical Estimation
Bayesian Estimation

Estimator’s p.d.f.

Desirable properties
For E[Θ̂]: E[Θ̂] = θ (unbiasedness)
For V [Θ̂]: small (good estimation accuracy)
p.d.f. f ^ ( ^ )

^
√V[
^

^
E[

Frédéric Lehmann Introduction to Probability and Estimation Theory 131 / 161


Classical Estimation
Bayesian Estimation

General definitions (continued)

Estimate for an unknown parameter θ


Definition: Let x = (x1 , x2 , . . . , xN )T be a realization of
X = (X1 , X2 , . . . , XN )T , the estimate θ̂ = g(x) is the
corresponding realization of Θ̂

Examples
PN
Mean estimate: m̂ = i=1 xi /N
PN
Variance estimate: v̂ = i=1 (xi − m̂)2 /N

Frédéric Lehmann Introduction to Probability and Estimation Theory 132 / 161


Classical Estimation
Bayesian Estimation

Classical versus Bayesian estimation

Classical estimation
Principle: The unknown parameter θ is considered as a
deterministic constant
We know: fX (x; θ), the joint p.d.f. of X, parameterized by
the unknown constant θ

Bayesian estimation
Principle: The unknown parameter θ is modeled as a random
variable Θ
We know: the joint p.d.f. fX,Θ (x, θ) as the product of
1 A prior distribution fΘ (θ) for the unknown random variable Θ
2 A conditional distribution fX|Θ (x|θ) for the random process X

Frédéric Lehmann Introduction to Probability and Estimation Theory 133 / 161


Classical Estimation
Bayesian Estimation

7 Classical Estimation

Frédéric Lehmann Introduction to Probability and Estimation Theory 134 / 161


Classical Estimation
Bayesian Estimation

Performance measures

Definitions
Bias of the estimator: b(Θ̂) = E[Θ̂] − θ
 2 
Variance of the estimator: V (Θ̂) = E Θ̂ − E[Θ̂]
 2 
Mean square error (MSE): mse(Θ̂) = E Θ̂ − θ

Property
 2 
mse(Θ̂) = E (Θ̂ − E[Θ̂]) + (E[Θ̂ − θ]) = V (Θ̂) + b(Θ̂)2

Frédéric Lehmann Introduction to Probability and Estimation Theory 135 / 161


Classical Estimation
Bayesian Estimation

Performance criteria

Minimum MSE criterion


Find the estimator
Z that minimizes the mean square error
mse(Θ̂) = (θ̂ − θ)2 fX (x; θ)dx
Problem: this estimator is in general a function of θ

Minimum Variance Unbiased Estimator (MVUE)


Find the estimator Θ̂ which
Is unbiased: b(Θ̂) = 0
And minimizes the variance V (Θ̂), ∀θ

Frédéric Lehmann Introduction to Probability and Estimation Theory 136 / 161


Classical Estimation
Bayesian Estimation

Existence of the Minimum Variance Unbiased Estimator

Assume only three unbiased estimators, Θ̂1 , Θ̂2 and Θ̂3 exist
Left-hand side: Θ̂1 is the MVUE
Right-hand side: MVUE does not exist, since no unbiased
estimator has minimum variance ∀θ
^ ^
V[ V[
^ ^
V[ V[

^ ^
V[ V[
^
V[ ^
V[

^
is the MVUE MVUE does not exist

Frédéric Lehmann Introduction to Probability and Estimation Theory 137 / 161


Classical Estimation
Bayesian Estimation

Cramer-Rao lower bound (CRLB)

Theorem (θ is a scalar parameter)


Assumption:
h p.d.f.
i fX (x; θ) satisfies the regularity condition
∂ log fX (x;θ)
E ∂θ = 0, ∀θ
Define the Fisher
 2 information as:
∂ log fX (x; θ)
I(θ) = −E
∂2θ
Z 2
∂ log fX (x; θ)
=− fX (x; θ)dx
∂2θ θ=true value

Any unbiased estimator Θ̂ satisfies: V [Θ̂] ≥ I(θ)−1

Provides a benchmark against which to compare the variance of


any unbiased estimator
Frédéric Lehmann Introduction to Probability and Estimation Theory 138 / 161
Classical Estimation
Bayesian Estimation

Efficient Estimator
An unbiased estimator attaining the CRLB ∀θ, is said efficient
Assume only three unbiased estimators, Θ̂1 , Θ̂2 and Θ̂3 exist
Left-hand side: the MVUE Θ̂1 exists, but is not efficient
Right-hand side: the MVUE Θ̂1 exists and is efficient
^ ^
V[ V[
^ ^
V[ V[

^ ^
V[ V[
^
V[ ^
V[ =CRLB
CRLB

^ ^
is the MVUE but is not efficient is the MVUE and is efficient

Frédéric Lehmann Introduction to Probability and Estimation Theory 139 / 161


Classical Estimation
Bayesian Estimation

Cramer-Rao Lower Bound (CRLB)

Theorem continued (θ is a scalar parameter)


An efficient estimator exists iif there exists a function g such
that:
∂ log fX (x; θ)
= I(θ) (g(x) − θ) , ∀θ
∂θ
Then the efficient estimator is given by: Θ̂ = g(X)
Since no unbiased estimator can do better, this is also the
MVUE

Frédéric Lehmann Introduction to Probability and Estimation Theory 140 / 161


Classical Estimation
Bayesian Estimation

Best Linear Unbiased Estimator (BLUE)

Motivation
MVUE may not even exist
Even if MVUE exists, it may not be found using the CRLB
theorem
Suboptimal BLUE estimator: find the minimum variance
unbiased estimator within the restricted class linear estimators

Frédéric Lehmann Introduction to Probability and Estimation Theory 141 / 161


Classical Estimation
Bayesian Estimation

Best Linear Unbiased Estimator (BLUE)

Explore the class of all unbiased estimators


Left-hand side: BLUE is suboptimal
Right-hand side: BLUE is optimal
^ ^
V[ V[

Class of
linear estimators
Class of
linear
estimators Class of
BLUE nonlinear
estimators

Class of
nonlinear estimators

MVUE BLUE=MVUE

Frédéric Lehmann Introduction to Probability and Estimation Theory 142 / 161


Classical Estimation
Bayesian Estimation

Best Linear Unbiased Estimator (BLUE)

BLUE (for a scalar parameter θ)


Assumption 1: Linear observation model X = θs + W, where
W is a length-N zero-mean noise process with arbitrary joint
p.d.f. and s is a suitable length-N deterministic vector
Assumption 2: Linear estimator Θ̂ = aT X
Finding the BLUE: aopt = arg mina V [Θ̂] subject to the
unbiasdness constraint E[Θ̂] = θ
−1
Solution: aopt = sTCC−1s s , where
C = E[(X − E[X])(X − E[X])T ] = E[WWT ]

Only requirements: knowledge of s and C

Frédéric Lehmann Introduction to Probability and Estimation Theory 143 / 161


Classical Estimation
Bayesian Estimation

Maximum-Likelihood Estimator (MLE)

Motivation
MVUE may be difficult to find or not even exist
BLUE always exists, but sometimes a linear estimator is
totally inappropriate (ex: if we force the estimator of the
variance to be linear function of X !)
MLE Principle: what is the value of θ under which the
observations we have actually seen are most likely?

Frédéric Lehmann Introduction to Probability and Estimation Theory 144 / 161


Classical Estimation
Bayesian Estimation

Maximum-Likelihood Estimator (MLE)

Definitions
Likelihood function: the joint p.d.f. fX (x; θ) considered as a
numerical function of θ, by setting x as the actually observed
data
Log-likelihood function: the logarithm of the previous function

Find the value of θ maximizing the likelihood function


The maximum-likelihood estimate is given by
θ̂ = arg max fX (x; θ)
θ
Or equivalently θ̂ = arg max log fX (x; θ)
θ

Frédéric Lehmann Introduction to Probability and Estimation Theory 145 / 161


Classical Estimation
Bayesian Estimation

Maximum-Likelihood Estimator (MLE)

Finding the MLE


Assumption: the log-likelihood function is differentiable, so
that
∂ log fX (x;θ)
∂θ =0
θ=θ̂
Analytical determination: find a closed form solution (not
always feasible)
Numerical determination: gradient algorithm,
Newton-Raphson method, etc

Frédéric Lehmann Introduction to Probability and Estimation Theory 146 / 161


Classical Estimation
Bayesian Estimation

Example: Estimate the rate of a Poisson process

Problem
Costumer arrival times Yi at a restaurant form a Poisson
process with rate θ
Interarrival times Xi = Yi − Yi−1 ∼ E(θ) (with the convention
Y0 = 0) are known to be independent r.v.s
We collect the set of observations x = (x1 , . . . , xN )T

Frédéric Lehmann Introduction to Probability and Estimation Theory 147 / 161


Classical Estimation
Bayesian Estimation

Example: Estimate the rate of a Poisson process

MLE estimator of θ
N
Y
Likelihood function: fX (x; θ) = θe−θxi
i=1
PN
Log-likelihood function: log fX (x; θ) = N log θ − i=1 θxi
Derivative of the log-likelihood function:
∂ log fX (x;θ) N PN
∂θ = θ − i=1 x i
 PN −1
i=1 xi
Maximum-likelihood estimate: θ̂ = N (inverse of
the sample mean of the interarrival times)
 PN −1
i=1 Xi
Maximum-likelihood estimator: Θ̂ = N

Frédéric Lehmann Introduction to Probability and Estimation Theory 148 / 161


Classical Estimation
Bayesian Estimation

Maximum-Likelihood Estimator (MLE)

Properties of the MLE


If an efficient estimator exists, the MLE will produce it
Asymptotic efficiency: under some regularity conditions, the
MLE Θ̂ is asymptotically (for large data sets) distributed as
N (θ, I(θ)−1 )

Frédéric Lehmann Introduction to Probability and Estimation Theory 149 / 161


Classical Estimation
Bayesian Estimation

8 Bayesian Estimation

Frédéric Lehmann Introduction to Probability and Estimation Theory 150 / 161


Classical Estimation
Bayesian Estimation

Bayesian philosophy

Principle
The unknown parameter θ is modeled as a random variable Θ

Assumption
We know the joint p.d.f. fX,Θ (x, θ) as the product of
1 A prior distribution fΘ (θ) for the unknown random variable Θ

2 A conditional distribution fX|Θ (x|θ) for the random process X

fΘ (θ) encodes our prior knowledge on the unknown parameter.


Using prior knowledge can lead to more accurate estimators.

Frédéric Lehmann Introduction to Probability and Estimation Theory 151 / 161


Classical Estimation
Bayesian Estimation

Cost functions

Notion of error and associated cost function


Consider
a particular realization θ of Θ
an estimate θ̂ corresponding to the estimator Θ̂, obtained for
a particular realization x of X
We let  = θ − θ̂ be the error of the estimate. A deterministic
function C() is termed a cost function.

Typical cost functions (cost is low when error is small)

1 Quadratic error: C() = 2


2 Absolute error: C() = ||
3 Hit-or-miss error: C() = 0 if || < δ, C() = 1 if || ≥ δ
Frédéric Lehmann Introduction to Probability and Estimation Theory 152 / 161
Classical Estimation
Bayesian Estimation

Typical cost functions

C( C( | | C(

Quadratic error Absolute error Hit-or-miss error

Frédéric Lehmann Introduction to Probability and Estimation Theory 153 / 161


Classical Estimation
Bayesian Estimation

Bayes risk

Definition
R(Θ̂) = E[C(Θ − Θ̂)], which depends on the choice of the
estimator Θ̂

Expression
Since C(Θ − Θ̂) depends on both r.v.s Θ and Θ̂, its
expectation is with respect to the joint p.d.f. fX,Θ (x, θ)
Z Z
We obtain R(Θ̂) = C(θ − θ̂)fX,Θ (x, θ)dθdx

Note that R(Θ̂) cannot be a function of θ, since this variable


is integrated out (contrary to mse(Θ̂) in classical estimation).

Frédéric Lehmann Introduction to Probability and Estimation Theory 154 / 161


Classical Estimation
Bayesian Estimation

Estimator construction

Find the estimator Θ̂ that minimizes the Bayes risk R(Θ̂)


1 From Bayes’
Z Ztheorem rewrite the Bayes
 risk as:
R(Θ̂) = C(θ − θ̂)fΘ|X (θ|x)dθ fX (x)dx, where
Z
f (x,θ)
fX (x) = fX,Θ (x, θ)dθ and fΘ|X (θ|x) = X,ΘfX (x)

2 Since fX (x) ≥ 0, ∀x, the Bayes risk R(Θ̂) will be minimized if


we can Zminimize the inner integral
g(θ̂) = C(θ − θ̂)fΘ|X (θ|x)dθ, ∀x

Only requirement: the posterior p.d.f. of Θ given x, fΘ|X (θ|x)

Frédéric Lehmann Introduction to Probability and Estimation Theory 155 / 161


Classical Estimation
Bayesian Estimation

Find the optimal estimates

General case
∂g(θ̂)
Solve =0
∂ θ̂

Frédéric Lehmann Introduction to Probability and Estimation Theory 156 / 161


Classical Estimation
Bayesian Estimation

Optimal estimates for the typical cost functions

For the quadratic cost function


1 Cost function: C() = 2
2 Bayes risk: Z Z
2
R(Θ̂) = E[|Θ − Θ̂| ] = |θ − θ̂|2 fX,Θ (x, θ)dθdx
Z
3 Function to be minimized: g(θ̂) = |θ − θ̂|2 fΘ|X (θ|x)dθ, ∀x
4 Minimum mean square error (MMSE) estimate: Z
mean of the posterior p.d.f., θ̂ = E[Θ|x] = θfΘ|X (θ|x)dθ

Frédéric Lehmann Introduction to Probability and Estimation Theory 157 / 161


Classical Estimation
Bayesian Estimation

Optimal estimates for the typical cost functions

For the absolute cost function


1 Cost function: C() = ||
Z Z
2 Bayes risk: R(Θ̂) = E[|Θ − Θ̂|] = |θ − θ̂|fX,Θ (x, θ)dθdx
Z
3 Function to be minimized: g(θ̂) = |θ − θ̂|fΘ|X (θ|x)dθ, ∀x
4 Minimum mean absolute error estimate:
median of the posterior p.d.f., such that
R θ̂ R +∞
−∞ fΘ|X (θ|x)dθ = θ̂ fΘ|X (θ|x)dθ

Frédéric Lehmann Introduction to Probability and Estimation Theory 158 / 161


Classical Estimation
Bayesian Estimation

Optimal estimates for the typical cost functions

For the hit-or-miss cost function


1 Cost function: C() = 0 if || < δ, C() = 1 if || ≥ δ, for
arbirary small δ
2 Bayes risk: Z Z
R(Θ̂) = E[C(Θ − Θ̂)] = C(θ − θ̂)fX,Θ (x, θ)dθdx
Z θ̂+δ
3 Function to be minimized: g(θ̂) = 1 − fΘ|X (θ|x)dθ, ∀x
θ̂−δ
4 Maximum a posteriori (MAP) estimate:
mode of the posterior p.d.f., θ̂ = arg max fΘ|X (θ|x)
θ

Frédéric Lehmann Introduction to Probability and Estimation Theory 159 / 161


Classical Estimation
Bayesian Estimation

Optimal estimates for the typical cost functions


f |X ( |x) C( C( C( | |

mean
mode median

f |X ( |x) C( C( C( | |

mode = mean = median

Frédéric Lehmann Introduction to Probability and Estimation Theory 160 / 161


Classical Estimation
Bayesian Estimation

End

Thank you for your attention !

Frédéric Lehmann Introduction to Probability and Estimation Theory 161 / 161

You might also like