CS115 Probability 2
CS115 Probability 2
Review
The contents of this document are taken mainly from the follow sources:
John Tsitsiklis. Massachusetts Institute of Technology. Introduction
to Probability.1
Marek Rutkowski. University of Sydney. Probability Review.2
https://www.probabilitycourse.com/
1
https://ocw.mit.edu/resources/
res-6-012-introduction-to-probability-spring-2018/index.htm
2
http:
//www.maths.usyd.edu.au/u/UG/SM/MATH3075/r/Slides_1_Probability.pdf
(University of Information Technology (UIT)) Math for Computer Science CS115 2 / 46
Table of Contents
List (set) of all possible states of the world, Ω. The states are called
samples or elementary events.
List (set) of possible outcomes, Ω.
List must be:
Mutually exclusive
Collectively exhaustive
At the “right” granularity
The sample space Ω is either countable or uncountable.
Definition (Probability)
A map P : Ω 7→ [0, 1] is called a probability on a discrete sample space Ω
if the following conditions are satisfied:
P (ωk ) ≥ 0 for all k ∈ I
P
k∈I P (ωk ) = 1
P (Ω) = 1
(University of Information Technology (UIT)) Math for Computer Science CS115 7 / 46
Probability Measure
Theorem
Let P : Ω 7→ [0, 1] be a probability on a discrete sample space Ω. Then the
unique probability measure on (Ω, F) generated by P satisfies for all
A∈F X
P (A) = P (ωk )
ωk ∈A
Notation
Random variable X Numerical value x
Definition
pX (x) = P (X = x) = P ({ω ∈ Ω s.t. X(ω) = x})
Properties
pX (x) ≥ 0
P
x pX (x) = 1
0.5
0.4 “Average” gain:
0.3
pX (x)
P
E[X] = xpX (x)
x
E(·) is called the expectation operator.
Average in a large number of independence experiments.
Expectation of a r.v. can be seen as the weighted average.
It is impossible to know the exact event to happen in the future and
thus expectation is useful in making decisions when the probabilities
of future outcomes are known.
Any random variable defined on a finite set Ω admits the expectation.
P
When the set Ω is countable but infinite, we need |x|pX (x) < ∞
x
so that E[X] is well-defined.
Definition
The expectation (expected value or mean value) of a random variable
X on a discrete sample space Ω is given by
X X
EP (X) = µ := X(ωk )P (ωk ) = xk p k
k∈I k∈I
Definition
The expectation (expected value or mean value) of a discrete random
variable X with range RX = {x1 , x2 , x3 , . . .} (finite or countably infinite)
is defined as
X X
E(X) = µ := xk P (X = xk ) = xk PX (xk )
xk ∈RX xk ∈RX
If X ≥ 0 then E[X] ≥ 0.
If a ≤ X ≤ b then a ≤ E[X] ≤ b.
If c is a constant, E[c] = c.
Average over x:
E[X 2 ] = x2 pX (x).
P
x
Caution: In general, E[g(X)] 6= g(E[X]).
Theorem
E[aX + b] = aE[X] + b
0.3
pX (x)
0.2
0.1
0.0 1 2 3 4 5 6 7 8 9 10
x
R.v. X with µ = E[X]. Average distance from the mean?
E [X − µ] = E [X] − µ = µ − µ = 0
E [X − µ] = E [X] − µ = µ − µ = 0
Definition (Variance)
The variance of a random variable X on a discrete sample space Ω is
defined as
V ar(X) = σ 2 = E P [(X − µ)2 ],
where P is a probability measure on Ω.
Theorem
For a random variable X and real numbers a and b,
V ar(aX + b) = a2 V ar(X)
Notation µ = E[X]
Let Y = X + b, γ = E[Y ] = µ + b.
Theorem
If X, Y are independent: E [X, Y ] = E [X]E [Y ],
g(X) and h(Y ) are also independent: E [g(X), h(Y )] = E [g(X)]E [h(Y )]
Theorem
If X, Y are independent: Var (X + Y ) = Var (X) + Var (Y )
Proof.
Assume E [X] = E [Y ] = 0 E [XY ] = E [X]E [Y ] = 0.
Definition
A random variable X is a Bernoulli random variable with parameter
p ∈ [0, 1], written as X ∼ Bernoulli(p) if its PMF is given by
(
p, for x = 1
PX (x) =
1 − p, for x = 0.
1.0
0.8
p
0.6
pX (x)
0.4
1−p
0.2
0.0 0 1
x
(University of Information Technology (UIT)) Math for Computer Science CS115 28 / 46
Bernoulli & Indicator Random Variables
A Bernoulli r.v. X with parameter p ∈ [0, 1] can also be described as
(
1 with probability p
X=
0 with probability 1 − p
A Bernoulli r.v. is associated with a certain event A. If event A
occurs, then X = 1; otherwise, X = 0.
Bernoulli r.v. is also called the indicator random variable of an event.
Definition
The indicator random variable of an event A is defined by
(
1 if the event A occurs
IA =
0 otherwise
The indicator r.v. for an event A has Bernoulli distribution with parameter
p = P (IA = 1) = PIA (1) = P (A). We can write IA ∼ Bernoulli((P (A)).
(University of Information Technology (UIT)) Math for Computer Science CS115 29 / 46
Discrete Uniform Random Variables
1
pX (x)
b−a+1
a a+1 b
x
(University of Information Technology (UIT)) Math for Computer Science CS115 30 / 46
Binomial Random Variables
Parameters: Probability p ∈ [0, 1], positive integer n.
Experiment: e.g., n independent tosses of a coin with P (Head) = p
Sample space: Set of sequences of H and T of length n
Random variable X : number of Heads observed.
Model of: number of successes in a given number of independent
trials.
Examples
PX (2) = P (X = 2)
= P (HHT) + P (HTH) + P (THH)
= 3p2 (1 − p)
3 2
= p (1 − p)
2
(University of Information Technology (UIT)) Math for Computer Science CS115 31 / 46
Binomial Random Variables
n = 10, p = 0.5 n = 100, p = 0.5
0.25 0.08
0.20 0.06
0.15
pX (x)
pX (x)
0.04
0.10
0.05 0.02
0.2 pX (x)
0.050
0.1
0.025
0.0 0 2 4 6 8 10 0.000 0 20 40 60 80 100
x x
where
n n!
=
k k!(n − k)!
Then
E[X] = np and V ar(X) = np(1 − p)
0.10
0.05
0.00 1 2 3 4 5 6 7 8 9 1011121314151617181920
x
(University of Information Technology (UIT)) Math for Computer Science CS115 34 / 46
Geometric Random Variables
Then
1 1−p
E[X] = and V ar(X) =
p p2
Definition
A random variable X on the sample space Ω is said to have a continuous
distribution if there exists a real-valued function f such that
f (x) ≥ 0,
Z ∞
f (x) dx = 1,
−∞
0.3
pX(x)
0.2
0.1
0.0
0 1 2 3 4 5 6
x
X
P (a ≤ X ≤ b) = pX (x)
x: a≤x≤b
pX (x) ≥ 0
X
pX (x) = 1
x
0.3 0.4
0.3
pX(x)
0.2
fX(x)
0.2
0.1
0.1
0.0
0 1 2 3 4 5 6 0.0
x a b
x
X
P (a ≤ X ≤ b) = pX (x) Z b
x: a≤x≤b P (a ≤ X ≤ b) = fX (x) dx
a
pX (x) ≥ 0
X fX (x) ≥ 0
pX (x) = 1 Z ∞
x fX (x) dx = 1
−∞
0.3
fX(x)
0.2
0.1
0.0
a a+
x
Z b
P (a ≤ X ≤ b) = fX (x) dx
a
δ > 0, small
P (a ≤ X ≤ a + δ) ≈ fX (a).δ
P (X = a) = 0
Just like, a single point has zero length.
But, a set of lots of points has a positive length.
(University of Information Technology (UIT)) Math for Computer Science CS115 39 / 46
Standard Normal (Gaussian) Random Variable N (0, 1)
1.0
0.8
0.6
fX (x)
x 2 /2
0.4
0.2
0.0
3 2 1 0 1 2 3
x
1.0
0.8
0.6 x 2 /2
fX (x)
e −x 2 /2
0.4
0.2
0.0
3 2 1 0 1 2 3
x
1.0
0.8
x 2 /2
0.6
fX (x)
e −x 2 /2
1 e −x 2 /2
0.4 p
2π
0.2
0.0
3 2 1 0 1 2 3
x
Z ∞
2 /2 √
e−x dx = 2π
−∞
1 2
fX (x) = √ e−x /2
2π
1.0
0.8
0.6
(x − µ) 2 /2σ 2
0.4
0.2
0.0
6 4 2 0 2 4 6 8 10
x
1.0
0.8
0.6 (x − µ) 2 /2σ 2
e −(x − µ)2 /2σ2
0.4
0.2
0.0
6 4 2 0 2 4 6 8 10
x
1.0
0.8
(x − µ) 2 /2σ 2
0.6 e −(x − µ)2 /2σ2
1 e −(x − µ) 2 /2σ 2
0.4 p
σ 2π
0.2
0.0
6 4 2 0 2 4 6 8 10
x
1 2 2
fX (x) = √ e−(x−µ) /2σ
σ 2π
E [X] = µ
Var (X) = σ 2
1.0
0.8
σ = 0.5
0.6 σ=1
fX (x)
σ=2
0.4 σ=3
0.2
0.0
6 4 2 0 2 4 6 8 10
x
Smaller σ, narrower PDF.
Let Y = aX + b N ∼ N (µ, σ 2 )
Then, E [Y ] = aX + b Var (Y ) = a2 σ 2 (always true)
But also, Y ∼ N (aµ + b, a2 σ 2 )
θ PX1 X2 X3 X4 (1, 0, 1, 1; θ)
0 0
1 0.0247
2 0.0988
3 0
θ PX1 X2 X3 X4 (1, 0, 1, 1; θ)
0 0
1 0.0247
2 0.0988
3 0
The observed data is most likely to occur for θ = 2.
We may choose θ̂ = 2 as our estimate of θ.
(University of Information Technology (UIT)) Math for Computer Science CS115 45 / 46
Maximum Likelihood Estimation (MLE)
Definition
Let X1 , X2 , . . . , Xn be a random sample from a distribution with a
parameter θ.
Given that we have observed X1 = x1 , X2 = x2 , . . . , Xn = xn , a
maximum likelihood estimate of θ, denoted as θ̂M L , is a value of θ that
maximizes the likelihood function
L(x1 , x2 , . . . , xn ; θ)