0% found this document useful (0 votes)
7 views

Slides ProbTheoryStats

Uploaded by

coxdevon045
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Slides ProbTheoryStats

Uploaded by

coxdevon045
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

Engineering Modelling:

Probability Theory - Brief review

Stephan Schmidt

[email protected]

Last generated: February 27, 2024

1/90
Please let me know if anything needs to be clarified in the slides (e.g.
something that is not apparent) or if there are mistakes/typos.

2/90
Outline

Probability theory
Introduction
Probability density functions
Cumulative distributions
Sampling
Probabilities
Percentiles and quantiles
Likelihood
Transformation of variables
Multivariate distributions
Expected values
Monte Carlo integration
Conclusion

3/90
Probability theory

4/90
Subsection 1

Introduction

5/90
Introduction

(a) (b)

Urn 1 Urn 2 Urn 1 Urn 2

Assume that the stones are properly mixed.


1. What is the probability of selecting a (i) red square, (ii) green triangle or a (iii)
blue circle?
2. Given you have selected Urn 1, what is the probability of selecting a (i) red
square, (ii) green triangle or a (iii) blue circle?
3. Given you have selected Urn 2, what is the probability of selecting a (i) red
square, (ii) green triangle or a (iii) blue circle?
4. Given you have selected a blue circle, what is the probability that you have
sampled from Urn 1.
5. Given you have selected a green triangle, what is the probability that you have
sampled from Urn 1.
6. Given you have selected a green triangle and a blue circle what is the probability
that you selected Urn 1 at random.

6/90
Introduction

Go to the next page to see the answers . . .

7/90
Introduction

Description Urn 1 Urn 2 Total


Blue circle (B) 2 6 8
Red square (R) 8 3 11
Green triangle (G) 0 1 1
Total 10 10 20
What is the probability of selecting a red square, green triangle or a blue circle?

P (B) = 8/20 = 0.4 (1)


P (R) = 11/20 = 0.55 (2)
P (G) = 1/20 = 0.05 (3)

Given you have selected Urn 1, what is the probability of selecting a red square,
green triangle or a blue circle?

P (B|U1 ) = 2/10 = 0.2 (4)


P (R|U1 ) = 8/10 = 0.8 (5)
P (G|U1 ) = 0/10 = 0.0 (6)

8/90
Introduction

Description Urn 1 Urn 2 Total


Blue circle (B) 2 6 8
Red square (R) 8 3 11
Green triangle (G) 0 1 1
Total 10 10 20
Given you have selected Urn 2, what is the probability of selecting a red square,
green triangle or a blue circle?
6
P (B|U2 ) = = 0.6 (7)
10
3
P (R|U2 ) = = 0.3 (8)
10
1
P (G|U2 ) = = 0.1 (9)
10

9/90
Introduction

Description Urn 1 Urn 2 Total


Blue circle (B) 2 6 8
Red square (R) 8 3 11
Green triangle (G) 0 1 1
Total 10 10 20

Given you have selected a blue circle, what is the probability that you have
sampled from Urn 1.
2
P (U1 |B) = = 0.25 (10)
8
6
P (U2 |B) = = 0.75 (11)
8

10/90
Introduction

Description Urn 1 Urn 2 Total


Blue circle (B) 2 6 8
Red square (R) 8 3 11
Green triangle (G) 0 1 1
Total 10 10 20

Given you have selected a green triangle, what is the probability that you have
sampled from Urn 1.
0
P (U1 |G) = =0 (12)
1
1
P (U2 |G) = = 1 (13)
1
Given you have selected a green triangle and a blue circle what is the
probability that you selected Urn 1 at random.

P (U1 |G, B) = 0 (14)

11/90
Subsection 2

Probability density functions

12/90
Probability density functions

The function p(x) is a probability density function over the continuous random
variable x if and only if

• p(x) ≥ 0 for the domain of the function.


R∞
• −∞
p(x)dx = 1 (if −∞ < x < ∞).

Exercise: Calculate c to ensure that this is a valid probability density function

c · (1 + x)−1

if 0≤x≤1
p(x) = (15)
0 otherwise

Plot the probability density function.

13/90
Probability density functions

• Normal/Gaussian distribution:
1 − 1 (x−µ)2
N (x; µ, σ 2 ) = √ e 2σ2 (16)
2πσ 2

• Laplace distribution:
 
1 1
L(x; µ, b) = exp − |x − µ| (17)
2b b
• Gamma distribution:
1 a a−1 −bτ
Gamma(τ ; a, b) = b τ e (18)
Γ(a)
• Beta distribution:
Γ(a + b) a−1
Beta(µ; a, b) = µ (1 − µ)b−1 (19)
Γ(a)Γ(b)
• Bernoulli distribution:

Be(x; µ) = µx · (1 − µ)1−x (20)

14/90
Probability density functions

The scipy.stats module

Example of the (univariate) Normal/Gaussian distribution

NB: Always check the form of the function (there are often different
parametrisations of the same distribution)!

15/90
Probability density functions
These are the functions that are available for the Normal distribution in
scipy.stats:

• rvs - sampling
• pdf - probability density function
• cdf - cumulative density function
• logpdf and logcdf - more stable than np.log(pdf) and np.log(cdf)
Laplace
16/90
Probability density functions
Common probability density functions:

a Normal/Gaussian b Laplace

1.0
0.75 loc=0, scale=1 loc=0, scale=1
loc=1, scale=1 loc=1, scale=1
loc=-2.5, scale=0.5 loc=-2.5, scale=0.5
0.50
p(x)

p(x)
0.5
0.25

0.00 0.0
−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0
x x

c Beta d Gamma

2 1.0
a=1.5, b=1 a=1, loc=0, scale=1
a=1, b=1 a=2, loc=0, scale=1
a=2, b=1 a=5, loc=0, scale=1
p(x)

p(x)

1 0.5

0 0.0
−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0
x x
17/90
Probability density functions

(a) How does the distribution of a random variable u3 , which is calculated from
the sum of two uniformly distributed variables u3 = u1 + u2 , look like?

(b) How does the distribution of a random variable u3 , which is calculated from
the difference of two uniformly distributed variables u3 = u1 − u2 , look like?

a b

1.0 1.0

0.5 0.5

0.0 0.0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
u1 u2

18/90
Probability density functions
Example:
Which option is correct for p(u2 + u1 )?

a b

1.0
0.4

0.5
0.2

0.0 0.0
0.0 0.5 1.0 1.5 2.0 0.00 0.25 0.50 0.75 1.00
u1 + u2 u1 + u2

c d

1.0
4

0.5
2

0.0 0
0.0 0.5 1.0 1.5 2.0 0.6 0.8 1.0 1.2
u1 + u2 u1 + u2

19/90
Probability density functions
Which option is correct for p(u1 − u2 )?

a b

1.0

40

0.5
20

0.0 0
−1.0 −0.5 0.0 0.5 1.0 −0.4 −0.2 0.0 0.2 0.4
u1 − u2 u1 − u2

c d

10
4

5
2

0 0
−0.10 −0.05 0.00 0.05 0.10 −0.2 0.0 0.2
u1 − u2 u1 − u2

20/90
Subsection 3

Cumulative distributions

21/90
Cumulative distributions

The cumulative distribution function is defined by

Fx (X) = P (x < X) (21)

and can be calculated with


Z X
Fx (X) = px (x)dx (22)
−∞

Some useful properties:


• P (x ≥ X) = 1 − Fx (X)

• P (X1 ≤ x ≤ X2 ) = P (x ≤ X2 ) − P (x ≤ X1 )

• px (x) = d
F (x)
dx x

22/90
Cumulative distributions
Example
A uniform distribution U(x; a, b) has the following probability density function:

1/(b − a) if a≤x≤b
U(x; a, b) = (23)
0 otherwise

Its CDF is given by


Z X
Fx (X) = U(x; a, b)dx (24)
−∞
X −a
= , a≤X≤b (25)
(b − a)

U(x; 0, 1):

1.0 1.0
pdf

cdf

0.5 0.5

0.0 0.0
−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0
x x

23/90
Cumulative distributions
Common probability density functions (left) with their cumulative distributions (right):

a Normal/Gaussian b Normal/Gaussian

1.0
0.75 loc=0, scale=1
loc=1, scale=1
loc=-2.5, scale=0.5
0.50

cdf (x)
p(x)

0.5
0.25 loc=0, scale=1
loc=1, scale=1
loc=-2.5, scale=0.5
0.00 0.0
−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0
x x

c Laplace d Laplace

1.0 1.0
loc=0, scale=1
loc=1, scale=1
loc=-2.5, scale=0.5
cdf (x)
p(x)

0.5 0.5
loc=0, scale=1
loc=1, scale=1
loc=-2.5, scale=0.5
0.0 0.0
−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0
x x

24/90
Cumulative distributions
a Beta b Beta

2 1.0
a=1.5, b=1 a=1.5, b=1
a=1, b=1 a=1, b=1
a=2, b=1 a=2, b=1

cdf (x)
p(x)

1 0.5

0 0.0
−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0
x x

c Gamma d Gamma

1.0 1.0
a=1, loc=0, scale=1 a=1, loc=0, scale=1
a=2, loc=0, scale=1 a=2, loc=0, scale=1
a=5, loc=0, scale=1 cdf (x) a=5, loc=0, scale=1
p(x)

0.5 0.5

0.0 0.0
−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0
x x

25/90
Subsection 4

Sampling

26/90
Sampling

• If we have a PDF, p(x), then a sample from the PDF is denoted x ∼ p(x),
where ∼ means ”sample from” in this course.

• Sampling forms a critical part of modelling and inference.

• It also allows efficient integration in high-dimensions

27/90
Sampling
Samples (left); empirical and analytical probability distribution (right).

a N=10 b N=10

Analytical
1 0.6
Empirical
0.4
xi

0
0.2
−1
0.0
0 2 4 6 8 −5.0 −2.5 0.0 2.5 5.0
N x

c N=100 d N=100

0.6 Analytical
2 Empirical
0.4
xi

0
0.2
−2

0.0
0 25 50 75 100 −5.0 −2.5 0.0 2.5 5.0
N x

28/90
Sampling
a N=10000 b N=10000

10
Analytical
0.4 Empirical
xi

0
0.2

−10 0.0
0 2500 5000 7500 10000 −10 −5 0 5 10
N x

c N=100000 d N=100000

10 Analytical
0.4 Empirical
xi

0
0.2

−10
0.0
0 25000 50000 75000 100000 −10 0 10
N x

29/90
Subsection 5

Probabilities

30/90
Probabilities

The probability of a variable a ≤ x < b, with x ∼ p(x) is given by


Z b
P (a < x < b) = p(x)dx (26)
a

P (a < x < b) = P (x < b) − P (x < a) (27)

P (a < x < b) = Fx (b) − Fx (a) (28)

31/90
Probabilities

0.2 0.2

pdf

pdf
B C
A
0.0 0.0
−5.0 −2.5a 0.0 2.5 5.0 −5.0 −2.5 0.0b 2.5 5.0 −5.0 −2.5 a 0.0b 2.5 5.0

Z a
P (x < a) = p(x)dx = Fx (a) (29)
−∞
Z b
P (x < b) = p(x)dx = Fx (b) (30)
−∞
Z b
P (a < x < b) = p(x)dx (31)
a
Z b Z a
= p(x)dx − p(x)dx (32)
−∞ −∞

= Fx (b) − Fx (a) (33)

32/90
Probabilities
a Example 1 b Example 1

a = 0, b= 1 a = 0, b= 1
0.4 1.0
a cdf(a)=0.5
b cdf(b)=0.841

cdf (x)
p(x)

0.2 0.5

0.0 0.0
−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0
x x

c Example 2 d Example 2

a = -1, b= 1 a = -1, b= 1
0.4 1.0
a cdf(a)=0.159
b cdf(b)=0.841
cdf (x)
p(x)

0.2 0.5

0.0 0.0
−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0
x x

33/90
Probabilities
a Example 3 b Example 3

a = -2, b= 4 a = -2, b= 4
0.4 1.0
a cdf(a)=0.023
b cdf(b)=1.0

cdf (x)
p(x)

0.2 0.5

0.0 0.0
−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0
x x

c Example 4 d Example 4

a = 4, b= 4.5 a = 4, b= 4.5
0.4 1.0
a cdf(a)=1.0
b cdf(b)=1.0
cdf (x)
p(x)

0.2 0.5

0.0 0.0
−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0
x x

34/90
Subsection 6

Percentiles and quantiles

35/90
Percentiles and quantiles

Consider the PDF p(x) with the CDF given by


Z X
Fx (X) = p(x)dx (34)
−∞

The qth percentile, denoted by Xq is obtained by solving:

Fx (Xq ) = q/100 (35)

What value should Xq be, to ensure that Fx (Xq ) = q/100?

36/90
Percentiles and quantiles

Example:
Calculate the 95th percentile of a uniform distribution.

Answers:
The 95th percentile of a uniform distribution is obtained by firstly calculating
the CDF: Z X
Fx (X) = 1dx = X, 0 ≤ X ≤ 1 (36)
0

Therefore, the 95th percentile is X95 = 0.95.

37/90
Percentiles and quantiles

Example:
Student-t distribution with 4 degrees-of-freedom, mean of 0 and a scaling
parameter of 1:

0.2
pdf

0.0
−5.0 −2.5 0.0 2.5 5.0
x

What is the 75th percentile/3rd quartile?

38/90
Percentiles and quantiles
Method 1: Use a table for standardised t-distributions (Outdated and tedious)

Figure: https://www.dummies.com/wp-content/uploads/451675.image0.jpg

Remember, Z Xq
Fx (Xq ) = p(x)dx (37)
−∞

is equivalent to Z Xq
1 − Fx (Xq ) = 1 − p(x)dx (38)
−∞
39/90
Percentiles and quantiles

Method 2: Use the CDF to find the percentile

Percentile: 0.741
1.0
cdf 75th percentile

0.5

0.0
−5.0 −2.5 0.0 2.5 5.0
x

Hint: See scipy.stats.<distribution>.ppf for directly calculating the


percentiles.

Therefore, the 75th percentile is 0.741.

40/90
Percentiles and quantiles

Method 3: Calculate the 75th percentile of the samples

Percentile of samples: 0.745

0.3
pdf

0.2

0.1

0.0
−5.0 −2.5 0.0 2.5 5.0
x

How many samples are sufficient to reliably estimate the percentiles?

41/90
Subsection 7

Likelihood

42/90
Likelihood

Consider the following generative model:

x ∼ p(x; θ) (39)

The likelihood of the model with the parameters θ for the data point xn is
given by
L(θ|xn ) = p(xn ; θ) (40)
Example:

1.0 0
PDF
Data point
Likelihood
−5
0.5
Log-PDF
−10 Data point
Log-Likelihood
0.0
0 1 2 3 0 1 2 3
x x

The likelihood measures how well the model agrees with the measured data.

43/90
Likelihood

The likelihood values are usually small and therefore the log-likelihood is
usually calculated and reported:

log L(θ|xn ) = log p(xn ; θ) (41)

If we have multiple data points and the data are independently sampled, we
can calculate the likelihood as follows:
N
Y
L(θ|x) = p(xn ; θ) (42)
n=1

and the log-likelihood as follows:


N
Y
log L(θ|x) = log p(xn ; θ) (43)
n=1

which can be simplified as follows:


N
X
log L(θ|x) = log p(xn ; θ) (44)
n=1

44/90
Subsection 8

Transformation of variables

45/90
Let x be from a known distribution with a probability density function pX (x)
and a cumulative distribution function F (x). Let

y = ψ(x) (45)
−1
where ψ is a monotonic function and x = ψ (y). Then,

dx ψ −1 (y)
pY (y) = pX (ψ −1 (y)) = pX (ψ −1 (y)) (46)
dy dy
What is the probability distribution over

y =a·x+b (47)

if x ∼ pX (x)?

46/90
What is the probability distribution over

y =a·x+b (48)

if x ∼ pX (x)?

dx
pY (y) = pX (ψ −1 (y)) (49)
dy
 
y−b 1
pY (y) = pX · (50)
a a
2
If x ∼ N (x|µX , σX )
 
y−b 2 1
pY (y) = N µX , σ X · (51)
a a
which can be written as follows:

pY (y) = N (y|a · µX + b, a2 σX
2
) (52)

47/90
Simplification of N y−b 2
 1
a
µX , σ X · a :
The Gaussian probability density function over x is given by
 
2 1 1 2
N (x|µX , σX ) = p exp − 2 (x − µX ) (53)
2πσX2 2σX

The distribution of y is as follows:


(  2 )
1 1 y−b 1
pY (y) = p exp − 2
− µ X · (54)
2πσX 2 2σX a a
 
1 1 2
= p exp − 2 2 (y − (a · µX + b)) (55)
2πa2 σX2 2a σX
 
1 1 2
= p exp − (y − µ Y ) (56)
2πσY2 2σY2
= N (y|µY , σY2 ) (57)
2 2
= N (y|a · µX + b, a σX ) (58)

48/90
Example model 1
The equation
xn = θ + ϵn (59)
2

where ϵn ∼ N 0, σ can be written as follows:

xn ∼ N θ, σ 2

(60)

Example model 2
The equation
yn = w0 + w1 · xn + ϵn (61)
where yn and xn are known, ϵn ∼ N 0, σ 2 can be written as follows:


yn ∼ N w0 + w1 · xn , σ 2

(62)

or
yn |xn ∼ N w0 + w1 · xn , σ 2

(63)
or
yn |xn , w0 , w1 , σ 2 ∼ N w0 + w1 · xn , σ 2

(64)

49/90
Subsection 9

Multivariate distributions

50/90
Multivariate distributions

A multivariate density is a probability density defined over multiple variables.

We can write the density as p(x1 , x2 , x3 ) or p(x) where xT = [x1 , x2 , x3 ].

The multivariate Gaussian distribution over a D dimensional space is defined by


 
1 1 1 T −1
p(x; µ, Σ) = exp − (x − µ) Σ (x − µ) (65)
(2π)D/2 |Σ|1/2 2

where µ ∈ RD is the mean vector and Σ ∈ RD×D is the covariance matrix.

51/90
Multivariate distributions

Examples of bivariate Gaussian distributions:

10 0.16 10 0.09
0.14 0.08
5 0.12 5 0.07
0.10 0.06
0 0.08 0 0.05
b

b
0.04
0.06 0.03
−5 0.04 −5 0.02
0.02 0.01
−10 0.00 −10 0.00
−10 −5 0 5 10 −10 −5 0 5 10
a a

 
5 2
• Left model: µT = [1, 1]; Σ =
2 1
 
1 −1
• Right model: µT = [−1, 1]; Σ =
−1 4
Note:
Z  
1 1 1 T −1
D/2
exp − (x − µ) Σ (x − µ) d x = 1 (66)
x (2π) |Σ|1/2 2

52/90
Multivariate distributions

We can marginalise over a to obtain a probability distribution over b:


Z
p(b) = p(a, b)da (67)
a

This is referred to as the marginal distribution over b.

We can integrate over c to obtain a probability distribution over a and b:


Z
p(a, b) = p(a, b, c)dc (68)
c

We can marginalise over c and b to obtain a probability distribution over a:


Z Z
p(a) = p(a, b, c)dbdc (69)
c b

where Z Z Z
p(a, b, c)dadbdc = 1 (70)
c b a

53/90
Multivariate distributions

The joint distribution p(a, b) can be decomposed as follows

p(a, b) = p(b|a)p(a) = p(a|b)p(b), Product rule (71)

where p(b|a) denotes the conditional distribution of b given a, i.e. what do we


know about b given the observations of a.

If b is independent of a then

p(b|a) = p(b) (72)

Bayes’ rule:
p(b|a)p(a)
p(a|b) = (73)
p(b)

54/90
Multivariate distributions

Example:
If we go back to the urn example in the introduction . . .
(a) (b)

Urn 1 Urn 2 Urn 1 Urn 2

Figure

1. Calculate the joint distribution p(colour, urn).

2. Calculate the marginal distribution p(colour).

3. Calculate the conditional distribution distribution p(colour|urn).

4. Calculate the conditional distribution distribution p(urn|colour).

55/90
Multivariate distributions

This is the frequency table we filled in previously:


Description Urn 1 Urn 2 Total
Blue circle (B) 2 6 8
Red square (R) 8 3 11
Green triangle (G) 0 1 1
Total 10 10 20

56/90
Multivariate distributions

The joint distribution for this problem is denoted

p(colour, urn) (74)

and given by the following table:


Description Urn 1 Urn 2 Total
Blue circle (B) 0.1 0.3 0.4
Red square (R) 0.4 0.15 0.55
Green triangle (G) 0.0 0.05 0.05
Total 0.5 0.5 1.0
XX
p(colour, urn) = 1 (75)
colour urn

57/90
Multivariate distributions

The marginal distribution


X
p(colour) = p(colour, urn) (76)
urn

and given by the following table:


Description Urn 1 Urn 2 P(colour)
Blue circle (B) 0.1 0.3 0.4
Red square (R) 0.4 0.15 0.55
Green triangle (G) 0.0 0.05 0.05
Total 0.5 0.5 1.0
X
p(colour) = 1 (77)
colour

58/90
Multivariate distributions

The conditional distribution for this problem is denoted


p(colour, urn)
p(colour|urn) = (78)
p(urn)
and given by the following table:
Description Urn 1 Urn 2 Total
Blue circle (B) 0.2 0.6 0.8
Red square (R) 0.8 0.3 1.1
Green triangle (G) 0.0 0.1 0.1
Total 1.0 1.0
X
p(colour|urn) = 1 (79)
colour

59/90
Multivariate distributions

Calculate P (U1 |B)

P (B|U1 )P (U1 )
P (U1 |B) = (80)
P (B)
0.2 · 0.5
= (81)
0.4
= 0.25 (82)

Calculate P (U2 |B)

P (B|U2 )P (U2 )
P (U2 |B) = (83)
P (B)
0.6 · 0.5
= (84)
0.4
= 0.75 (85)

60/90
Multivariate distributions

Consider the following joint distribution:

10 0.16
0.14
5 0.12
0.10
0 0.08
b

0.06
−5 0.04
0.02
−10 0.00
−10 −5 0 5 10
a

• Calculate the marginal distribution of a and b.


• Calculate the condition distribution for b = 2 and b = −1.

61/90
Multivariate distributions

The marginal distributions associated with this distribution is given by

0.4
Analytical Analytical
0.15
Numerical Numerical
0.10
p(a)

p(b)
0.2

0.05

0.00 0.0
−10 −5 0 5 10 −10 −5 0 5 10
a b

Analytical: p(a) = N (a|µa , Σaa ) (See Bishop, 2006).

Numerical: Z
p(a) = N (x|µ, Σ) db (86)

where x = [a, b].

62/90
Multivariate distributions
Consider b = 2 and b = −1:

10 0.16
b = 2.0 b = −1.0 0.14
5 0.12
0.10
0 0.08
b 0.06
−5 0.04
0.02
−10 0.00
−10 −5 0 5 10
a

The associated condition distributions are given by:

0.4 0.4
p(a|b = 2.0) p(a|b = −1.0)

0.2 0.2

0.0 0.0
−10 −5 0 5 10 −10 −5 0 5 10
a a
63/90
Multivariate distributions

See Bishop (2006)’s Chapter 2 for a extensive overview of univariate probability


distributions and multivariate probability distributions.

Useful equations:
• Marginal and condition distributions (I): Equations 2.94-2.98
• Marginal and condition distributions (II): Equations 2.113-2.117

64/90
Subsection 10

Expected values

65/90
Expected values

Expected values are often used to understand the performance of models.


Therefore, we need to understand what they mean and know how they are
calculated.

Consider a dartboard:

1.00

0.75

0.50

0.25

0.00

−0.25

−0.50

−0.75

−1.00
−1.0 −0.5 0.0 0.5 1.0

Four people are throwing darts at the board to see who can throw the most
accurate. The aim is to throw the darts as close as possible to the centre of the
board. Each person gets 10 throws.
66/90
Expected values

These are the results from the four people:


Attempt 1 Attempt 2
1.0 1.0

0.5 0.5

0.0 0.0

−0.5 −0.5

−1.0 −1.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

Attempt 3 Attempt 4
1.0 1.0

0.5 0.5

0.0 0.0

−0.5 −0.5

−1.0 −1.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

Who was the most accurate?


67/90
Expected values

Summary:
• Attempt 1: Close to the centre, low variance
• Attempt 2: Far from the centre, low variance
• Attempt 3: Close to the centre, high variance
• Attempt 4: Far from the centre, high variance

Ideally, we want the darts close to the centre, with a low variance.

• Expected value of throw: Average position from centre for an infinite


number of throws
• Variance of throw: Average squared distance from expected value for an
infinite number of throws
We need to consider both of them when evaluating the accuracy of the throws.

68/90
Expected values

Consider a Gaussian with a mean of 2 and a standard deviation of 3 (variance


of 9), i.e. N (x; 2, 32 ).

If we draw 10 samples from a Gaussian,


• What is the mean of the samples?

• Would there be a difference in the data if we draw another 10 samples?

• Would there be a difference in the mean of the data if we draw another


10 samples?

If we repeat this exercise but draw 106 samples instead, what would be
different?

69/90
Expected values

The distribution of N (x; 2, 32 ) has the following form:

0.10

0.05

0.00
−10 0 10
x

Most of the data points are between −7 and 11.

What is the analytical mean of the distribution?

70/90
Expected values

Consider 10 samples from N (x; 2, 32 ):

x
−5

2 4 6 8 10
Sample number

What is the mean of the data?


The mean of the data is calculated with the following equation1 :
N
1 X
g(x) = xi (87)
N i=1

where the data vector is denoted x = [x1 , . . . , xN ].

1
Please note that g(x) is a definition of an arbitrary function of the data, e.g.
1
PN 1
PN
g(x) = N i=1 xi , g(x) = N i=1 |xi |, or g(x) = x1 71/90
Expected values

g(x) is superimposed on the data:

0
x

−5
g(x) = 1.5675032467265801

2 4 6 8 10
Sample number

If we generate data again, what do we expect would happen with the data and
g(x)?

72/90
Expected values

The data (left) and the data with the corresponding g(x) (right) for 10 repeats:

g(x)

5 5
x

x
0 0

2 4 6 8 10 2 4 6 8 10
Sample number Sample number

• Data are random - generated from a random process.


• The function g(x) is dependent on the data and therefore also random.
• How can we quantify the behaviour of g(x)?

73/90
Expected values

If we repeat this process 100, 000 times, then g(x) has the following behaviour:

g(x) 5.0

2.5

0.0

0 25000 50000 75000 100000


Repeat number

g(x) varies significantly around 2.

74/90
Expected values

The distribution over g(x) has the following form:

0.4

0.2

0.0
−2 0 2 4 6
g(x)

This is referred to as the sampling distribution of g(x). (More on this later)

Do not confuse it with the distribution of x!

75/90
Expected values

Distribution of x and the distribution of g(x):

0.4
0.10

0.2
0.05

0.00 0.0
−10 0 10 −2 0 2 4 6
x g(x)

76/90
Expected values

The expected value quantifies what would happen if we could repeat and
average an infinite number of experiments.

Example: (Continued)
The (approximate) expected value of g(x) is superimposed on g(x) for each
run:

5.0
g(x)

2.5

0.0
E{g(x)}

0 25000 50000 75000 100000


Repeat number

The expected value of g(x), denoted E(g(x)), is the average of g(x) over an
infinite number of experiments.

77/90
Expected values
Definition:
The expected value of a function of a random variable x, denoted g(x), is given
by Z ∞
Ex∼p(x) {g(x)} = g(x)p(x)dx (88)
−∞
Notation:
• Ex∼p(x) denotes the expected value where x has a PDF of p(x).
• Ex∼p(x) {g(x)} is therefore the expected value of g(x) where x ∼ p(x).
• Sometimes x ∼ p(x) is omitted from the notation.
The expected value tells us on average how far we are from the centre of the
dart board.

The expected value is calculated as an integral and therefore the properties of


integrals can be used in the calculation, e.g. let
q(x) = e(x) · g(x) + f (x) + h(x), then
Ex∼p(x) {q(x)} (89)
Z ∞
= (e(x) · g(x) + f (x) + h(x)) p(x)dx (90)
−∞

= Ex∼p(x) {e(x) · g(x)} + Ex∼p(x) {f (x)} + Ex∼p(x) {h(x)}


(91)
78/90
Expected values

Definition: The variance of g(x) tells us the dispersion of g(x) under the
distribution x ∼ p(x):
Z ∞
varx∼p(x) {g(x)} = (g(x) − Ex∼p(x) {g(x)})2 p(x)dx (92)
−∞

The variance tells us how far the different throws are from the average value of
the throws.

79/90
Expected values

Example: (Continued)
The expected value and the variance of g(x) is quantified:

5.0
g(x)
2.5

0.0 E{g(x)}
E{g(x)} ± 3 · std{g(x)}

0 25000 50000 75000 100000


Repeat number

Therefore, even though the data are random, we have a bound on g(x)!

How do we calculate the expected value and variance in practice?

80/90
What is the expected value and variance of

y =a·x+b (93)
2
if N (x|µX , σX )?

81/90
The expected value is given by:

Ex∼p(x) {y(x)} = a · Ex∼p(x) {x} + b = a · µX + b (94)

The variance of y is given by:

Ex∼p(x) (y − µY )2 = Ex∼p(x) (a · x + b − a · µX − b)2


 
(95)
= a2 Ex∼p(x) (x − µX )2

(96)
2 2
=a σX (97)

Compare these answers against the distribution of y.

82/90
Subsection 11

Monte Carlo integration

83/90
Monte Carlo integration

Consider an integral of the form:


Z ∞
p(x)f (x)dx (98)
−∞

where p(x) is a probability density function and f (x) is an arbitrary function of


x.

How do we calculate this?


• Analytical - only possible for specific combinations of g(x) and p(x).
• Numerical integration (e.g. Trapezoidal, Simpson) - scales poorly with
high-dimensions
• Monte Carlo - we need to have samples available - not always easy.

84/90
Monte Carlo integration

Let us assume we have a method to generate samples from a distribution


x ∼ p(x).

We can write this integral Z ∞


p(x)f (x)dx (99)
−∞

in the form of a summation


Z ∞ N
1 X
p(x)f (x)dx = lim f (xi ), xi ∼ p(x) (100)
−∞ N →∞ N
i=1

Therefore, we need to sample from p(x) and then calculate the mean of f (x),
evaluated at the samples.

85/90
Monte Carlo integration

Exercise:
Calculate the integral Z ∞
N (x; 1, 22 )xdx (101)
−∞

using scipy.integrate.quad and Monte Carlo integration.

What is the analytical value of the integral?

(Hint: This integral calculates the expected value of x)

86/90
Monte Carlo integration

Exercise:
Calculate the integral Z ∞
N (x; 1, 22 )xdx (102)
−∞

• scipy.integrate.quad2 : 1.0
• Monte Carlo integration3 : 1.0004166378896788 ± 0.020921105045549167

2
The integral is not performed from −∞ to ∞. Instead, a large domain that covers the
majority of the probability mass of p(x) is used in the integration.
3
Only a single scalar estimate is obtained when performing Monte Carlo integration. The
Monte Carlo estimate is a random variable as we are using randomly sampled values from p(x) to
estimate the integral. This means that each time that we generate new samples, a new estimate is
obtained. The expected value of the estimate is unbiased and the variance of the Monte Carlo
estimate depends on the number of samples that is used. In this example, the mean of the
estimates and the standard deviation of the estimates are shown to highlight that the Monte Carlo
estimate is random. 87/90
Monte Carlo integration

Exercise:
Calculate the integral Z ∞
N (x; 1, 22 )x4 dx (103)
−∞

• scipy.integrate.quad: 73.00
• Monte Carlo integration ≈ 73.

88/90
Subsection 12

Conclusion

89/90
Conclusion

In this lecture we developed the language to be able to build engineering


models from the available data.
• PDF, CDF
• Probabilities, percentiles, quantiles
• Likelihood
• Marginal and conditional distributions
• Transformations
• Expected values
• Monte Carlo integration, i.e. integration through sampling.

90/90

You might also like