Slides ProbTheoryStats
Slides ProbTheoryStats
Stephan Schmidt
1/90
Please let me know if anything needs to be clarified in the slides (e.g.
something that is not apparent) or if there are mistakes/typos.
2/90
Outline
Probability theory
Introduction
Probability density functions
Cumulative distributions
Sampling
Probabilities
Percentiles and quantiles
Likelihood
Transformation of variables
Multivariate distributions
Expected values
Monte Carlo integration
Conclusion
3/90
Probability theory
4/90
Subsection 1
Introduction
5/90
Introduction
(a) (b)
6/90
Introduction
7/90
Introduction
Given you have selected Urn 1, what is the probability of selecting a red square,
green triangle or a blue circle?
8/90
Introduction
9/90
Introduction
Given you have selected a blue circle, what is the probability that you have
sampled from Urn 1.
2
P (U1 |B) = = 0.25 (10)
8
6
P (U2 |B) = = 0.75 (11)
8
10/90
Introduction
Given you have selected a green triangle, what is the probability that you have
sampled from Urn 1.
0
P (U1 |G) = =0 (12)
1
1
P (U2 |G) = = 1 (13)
1
Given you have selected a green triangle and a blue circle what is the
probability that you selected Urn 1 at random.
11/90
Subsection 2
12/90
Probability density functions
The function p(x) is a probability density function over the continuous random
variable x if and only if
c · (1 + x)−1
if 0≤x≤1
p(x) = (15)
0 otherwise
13/90
Probability density functions
• Normal/Gaussian distribution:
1 − 1 (x−µ)2
N (x; µ, σ 2 ) = √ e 2σ2 (16)
2πσ 2
• Laplace distribution:
1 1
L(x; µ, b) = exp − |x − µ| (17)
2b b
• Gamma distribution:
1 a a−1 −bτ
Gamma(τ ; a, b) = b τ e (18)
Γ(a)
• Beta distribution:
Γ(a + b) a−1
Beta(µ; a, b) = µ (1 − µ)b−1 (19)
Γ(a)Γ(b)
• Bernoulli distribution:
14/90
Probability density functions
NB: Always check the form of the function (there are often different
parametrisations of the same distribution)!
15/90
Probability density functions
These are the functions that are available for the Normal distribution in
scipy.stats:
• rvs - sampling
• pdf - probability density function
• cdf - cumulative density function
• logpdf and logcdf - more stable than np.log(pdf) and np.log(cdf)
Laplace
16/90
Probability density functions
Common probability density functions:
a Normal/Gaussian b Laplace
1.0
0.75 loc=0, scale=1 loc=0, scale=1
loc=1, scale=1 loc=1, scale=1
loc=-2.5, scale=0.5 loc=-2.5, scale=0.5
0.50
p(x)
p(x)
0.5
0.25
0.00 0.0
−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0
x x
c Beta d Gamma
2 1.0
a=1.5, b=1 a=1, loc=0, scale=1
a=1, b=1 a=2, loc=0, scale=1
a=2, b=1 a=5, loc=0, scale=1
p(x)
p(x)
1 0.5
0 0.0
−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0
x x
17/90
Probability density functions
(a) How does the distribution of a random variable u3 , which is calculated from
the sum of two uniformly distributed variables u3 = u1 + u2 , look like?
(b) How does the distribution of a random variable u3 , which is calculated from
the difference of two uniformly distributed variables u3 = u1 − u2 , look like?
a b
1.0 1.0
0.5 0.5
0.0 0.0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
u1 u2
18/90
Probability density functions
Example:
Which option is correct for p(u2 + u1 )?
a b
1.0
0.4
0.5
0.2
0.0 0.0
0.0 0.5 1.0 1.5 2.0 0.00 0.25 0.50 0.75 1.00
u1 + u2 u1 + u2
c d
1.0
4
0.5
2
0.0 0
0.0 0.5 1.0 1.5 2.0 0.6 0.8 1.0 1.2
u1 + u2 u1 + u2
19/90
Probability density functions
Which option is correct for p(u1 − u2 )?
a b
1.0
40
0.5
20
0.0 0
−1.0 −0.5 0.0 0.5 1.0 −0.4 −0.2 0.0 0.2 0.4
u1 − u2 u1 − u2
c d
10
4
5
2
0 0
−0.10 −0.05 0.00 0.05 0.10 −0.2 0.0 0.2
u1 − u2 u1 − u2
20/90
Subsection 3
Cumulative distributions
21/90
Cumulative distributions
• P (X1 ≤ x ≤ X2 ) = P (x ≤ X2 ) − P (x ≤ X1 )
• px (x) = d
F (x)
dx x
22/90
Cumulative distributions
Example
A uniform distribution U(x; a, b) has the following probability density function:
1/(b − a) if a≤x≤b
U(x; a, b) = (23)
0 otherwise
U(x; 0, 1):
1.0 1.0
pdf
cdf
0.5 0.5
0.0 0.0
−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0
x x
23/90
Cumulative distributions
Common probability density functions (left) with their cumulative distributions (right):
a Normal/Gaussian b Normal/Gaussian
1.0
0.75 loc=0, scale=1
loc=1, scale=1
loc=-2.5, scale=0.5
0.50
cdf (x)
p(x)
0.5
0.25 loc=0, scale=1
loc=1, scale=1
loc=-2.5, scale=0.5
0.00 0.0
−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0
x x
c Laplace d Laplace
1.0 1.0
loc=0, scale=1
loc=1, scale=1
loc=-2.5, scale=0.5
cdf (x)
p(x)
0.5 0.5
loc=0, scale=1
loc=1, scale=1
loc=-2.5, scale=0.5
0.0 0.0
−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0
x x
24/90
Cumulative distributions
a Beta b Beta
2 1.0
a=1.5, b=1 a=1.5, b=1
a=1, b=1 a=1, b=1
a=2, b=1 a=2, b=1
cdf (x)
p(x)
1 0.5
0 0.0
−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0
x x
c Gamma d Gamma
1.0 1.0
a=1, loc=0, scale=1 a=1, loc=0, scale=1
a=2, loc=0, scale=1 a=2, loc=0, scale=1
a=5, loc=0, scale=1 cdf (x) a=5, loc=0, scale=1
p(x)
0.5 0.5
0.0 0.0
−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0
x x
25/90
Subsection 4
Sampling
26/90
Sampling
• If we have a PDF, p(x), then a sample from the PDF is denoted x ∼ p(x),
where ∼ means ”sample from” in this course.
27/90
Sampling
Samples (left); empirical and analytical probability distribution (right).
a N=10 b N=10
Analytical
1 0.6
Empirical
0.4
xi
0
0.2
−1
0.0
0 2 4 6 8 −5.0 −2.5 0.0 2.5 5.0
N x
c N=100 d N=100
0.6 Analytical
2 Empirical
0.4
xi
0
0.2
−2
0.0
0 25 50 75 100 −5.0 −2.5 0.0 2.5 5.0
N x
28/90
Sampling
a N=10000 b N=10000
10
Analytical
0.4 Empirical
xi
0
0.2
−10 0.0
0 2500 5000 7500 10000 −10 −5 0 5 10
N x
c N=100000 d N=100000
10 Analytical
0.4 Empirical
xi
0
0.2
−10
0.0
0 25000 50000 75000 100000 −10 0 10
N x
29/90
Subsection 5
Probabilities
30/90
Probabilities
31/90
Probabilities
0.2 0.2
pdf
B C
A
0.0 0.0
−5.0 −2.5a 0.0 2.5 5.0 −5.0 −2.5 0.0b 2.5 5.0 −5.0 −2.5 a 0.0b 2.5 5.0
Z a
P (x < a) = p(x)dx = Fx (a) (29)
−∞
Z b
P (x < b) = p(x)dx = Fx (b) (30)
−∞
Z b
P (a < x < b) = p(x)dx (31)
a
Z b Z a
= p(x)dx − p(x)dx (32)
−∞ −∞
32/90
Probabilities
a Example 1 b Example 1
a = 0, b= 1 a = 0, b= 1
0.4 1.0
a cdf(a)=0.5
b cdf(b)=0.841
cdf (x)
p(x)
0.2 0.5
0.0 0.0
−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0
x x
c Example 2 d Example 2
a = -1, b= 1 a = -1, b= 1
0.4 1.0
a cdf(a)=0.159
b cdf(b)=0.841
cdf (x)
p(x)
0.2 0.5
0.0 0.0
−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0
x x
33/90
Probabilities
a Example 3 b Example 3
a = -2, b= 4 a = -2, b= 4
0.4 1.0
a cdf(a)=0.023
b cdf(b)=1.0
cdf (x)
p(x)
0.2 0.5
0.0 0.0
−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0
x x
c Example 4 d Example 4
a = 4, b= 4.5 a = 4, b= 4.5
0.4 1.0
a cdf(a)=1.0
b cdf(b)=1.0
cdf (x)
p(x)
0.2 0.5
0.0 0.0
−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0
x x
34/90
Subsection 6
35/90
Percentiles and quantiles
36/90
Percentiles and quantiles
Example:
Calculate the 95th percentile of a uniform distribution.
Answers:
The 95th percentile of a uniform distribution is obtained by firstly calculating
the CDF: Z X
Fx (X) = 1dx = X, 0 ≤ X ≤ 1 (36)
0
37/90
Percentiles and quantiles
Example:
Student-t distribution with 4 degrees-of-freedom, mean of 0 and a scaling
parameter of 1:
0.2
pdf
0.0
−5.0 −2.5 0.0 2.5 5.0
x
38/90
Percentiles and quantiles
Method 1: Use a table for standardised t-distributions (Outdated and tedious)
Figure: https://www.dummies.com/wp-content/uploads/451675.image0.jpg
Remember, Z Xq
Fx (Xq ) = p(x)dx (37)
−∞
is equivalent to Z Xq
1 − Fx (Xq ) = 1 − p(x)dx (38)
−∞
39/90
Percentiles and quantiles
Percentile: 0.741
1.0
cdf 75th percentile
0.5
0.0
−5.0 −2.5 0.0 2.5 5.0
x
40/90
Percentiles and quantiles
0.3
pdf
0.2
0.1
0.0
−5.0 −2.5 0.0 2.5 5.0
x
41/90
Subsection 7
Likelihood
42/90
Likelihood
x ∼ p(x; θ) (39)
The likelihood of the model with the parameters θ for the data point xn is
given by
L(θ|xn ) = p(xn ; θ) (40)
Example:
1.0 0
PDF
Data point
Likelihood
−5
0.5
Log-PDF
−10 Data point
Log-Likelihood
0.0
0 1 2 3 0 1 2 3
x x
The likelihood measures how well the model agrees with the measured data.
43/90
Likelihood
The likelihood values are usually small and therefore the log-likelihood is
usually calculated and reported:
If we have multiple data points and the data are independently sampled, we
can calculate the likelihood as follows:
N
Y
L(θ|x) = p(xn ; θ) (42)
n=1
44/90
Subsection 8
Transformation of variables
45/90
Let x be from a known distribution with a probability density function pX (x)
and a cumulative distribution function F (x). Let
y = ψ(x) (45)
−1
where ψ is a monotonic function and x = ψ (y). Then,
dx ψ −1 (y)
pY (y) = pX (ψ −1 (y)) = pX (ψ −1 (y)) (46)
dy dy
What is the probability distribution over
y =a·x+b (47)
if x ∼ pX (x)?
46/90
What is the probability distribution over
y =a·x+b (48)
if x ∼ pX (x)?
dx
pY (y) = pX (ψ −1 (y)) (49)
dy
y−b 1
pY (y) = pX · (50)
a a
2
If x ∼ N (x|µX , σX )
y−b 2 1
pY (y) = N µX , σ X · (51)
a a
which can be written as follows:
pY (y) = N (y|a · µX + b, a2 σX
2
) (52)
47/90
Simplification of N y−b 2
1
a
µX , σ X · a :
The Gaussian probability density function over x is given by
2 1 1 2
N (x|µX , σX ) = p exp − 2 (x − µX ) (53)
2πσX2 2σX
48/90
Example model 1
The equation
xn = θ + ϵn (59)
2
where ϵn ∼ N 0, σ can be written as follows:
xn ∼ N θ, σ 2
(60)
Example model 2
The equation
yn = w0 + w1 · xn + ϵn (61)
where yn and xn are known, ϵn ∼ N 0, σ 2 can be written as follows:
yn ∼ N w0 + w1 · xn , σ 2
(62)
or
yn |xn ∼ N w0 + w1 · xn , σ 2
(63)
or
yn |xn , w0 , w1 , σ 2 ∼ N w0 + w1 · xn , σ 2
(64)
49/90
Subsection 9
Multivariate distributions
50/90
Multivariate distributions
51/90
Multivariate distributions
10 0.16 10 0.09
0.14 0.08
5 0.12 5 0.07
0.10 0.06
0 0.08 0 0.05
b
b
0.04
0.06 0.03
−5 0.04 −5 0.02
0.02 0.01
−10 0.00 −10 0.00
−10 −5 0 5 10 −10 −5 0 5 10
a a
5 2
• Left model: µT = [1, 1]; Σ =
2 1
1 −1
• Right model: µT = [−1, 1]; Σ =
−1 4
Note:
Z
1 1 1 T −1
D/2
exp − (x − µ) Σ (x − µ) d x = 1 (66)
x (2π) |Σ|1/2 2
52/90
Multivariate distributions
where Z Z Z
p(a, b, c)dadbdc = 1 (70)
c b a
53/90
Multivariate distributions
If b is independent of a then
Bayes’ rule:
p(b|a)p(a)
p(a|b) = (73)
p(b)
54/90
Multivariate distributions
Example:
If we go back to the urn example in the introduction . . .
(a) (b)
Figure
55/90
Multivariate distributions
56/90
Multivariate distributions
57/90
Multivariate distributions
58/90
Multivariate distributions
59/90
Multivariate distributions
P (B|U1 )P (U1 )
P (U1 |B) = (80)
P (B)
0.2 · 0.5
= (81)
0.4
= 0.25 (82)
P (B|U2 )P (U2 )
P (U2 |B) = (83)
P (B)
0.6 · 0.5
= (84)
0.4
= 0.75 (85)
60/90
Multivariate distributions
10 0.16
0.14
5 0.12
0.10
0 0.08
b
0.06
−5 0.04
0.02
−10 0.00
−10 −5 0 5 10
a
61/90
Multivariate distributions
0.4
Analytical Analytical
0.15
Numerical Numerical
0.10
p(a)
p(b)
0.2
0.05
0.00 0.0
−10 −5 0 5 10 −10 −5 0 5 10
a b
Numerical: Z
p(a) = N (x|µ, Σ) db (86)
62/90
Multivariate distributions
Consider b = 2 and b = −1:
10 0.16
b = 2.0 b = −1.0 0.14
5 0.12
0.10
0 0.08
b 0.06
−5 0.04
0.02
−10 0.00
−10 −5 0 5 10
a
0.4 0.4
p(a|b = 2.0) p(a|b = −1.0)
0.2 0.2
0.0 0.0
−10 −5 0 5 10 −10 −5 0 5 10
a a
63/90
Multivariate distributions
Useful equations:
• Marginal and condition distributions (I): Equations 2.94-2.98
• Marginal and condition distributions (II): Equations 2.113-2.117
64/90
Subsection 10
Expected values
65/90
Expected values
Consider a dartboard:
1.00
0.75
0.50
0.25
0.00
−0.25
−0.50
−0.75
−1.00
−1.0 −0.5 0.0 0.5 1.0
Four people are throwing darts at the board to see who can throw the most
accurate. The aim is to throw the darts as close as possible to the centre of the
board. Each person gets 10 throws.
66/90
Expected values
0.5 0.5
0.0 0.0
−0.5 −0.5
−1.0 −1.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
Attempt 3 Attempt 4
1.0 1.0
0.5 0.5
0.0 0.0
−0.5 −0.5
−1.0 −1.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
Summary:
• Attempt 1: Close to the centre, low variance
• Attempt 2: Far from the centre, low variance
• Attempt 3: Close to the centre, high variance
• Attempt 4: Far from the centre, high variance
Ideally, we want the darts close to the centre, with a low variance.
68/90
Expected values
If we repeat this exercise but draw 106 samples instead, what would be
different?
69/90
Expected values
0.10
0.05
0.00
−10 0 10
x
70/90
Expected values
x
−5
2 4 6 8 10
Sample number
1
Please note that g(x) is a definition of an arbitrary function of the data, e.g.
1
PN 1
PN
g(x) = N i=1 xi , g(x) = N i=1 |xi |, or g(x) = x1 71/90
Expected values
0
x
−5
g(x) = 1.5675032467265801
2 4 6 8 10
Sample number
If we generate data again, what do we expect would happen with the data and
g(x)?
72/90
Expected values
The data (left) and the data with the corresponding g(x) (right) for 10 repeats:
g(x)
5 5
x
x
0 0
2 4 6 8 10 2 4 6 8 10
Sample number Sample number
73/90
Expected values
If we repeat this process 100, 000 times, then g(x) has the following behaviour:
g(x) 5.0
2.5
0.0
74/90
Expected values
0.4
0.2
0.0
−2 0 2 4 6
g(x)
75/90
Expected values
0.4
0.10
0.2
0.05
0.00 0.0
−10 0 10 −2 0 2 4 6
x g(x)
76/90
Expected values
The expected value quantifies what would happen if we could repeat and
average an infinite number of experiments.
Example: (Continued)
The (approximate) expected value of g(x) is superimposed on g(x) for each
run:
5.0
g(x)
2.5
0.0
E{g(x)}
The expected value of g(x), denoted E(g(x)), is the average of g(x) over an
infinite number of experiments.
77/90
Expected values
Definition:
The expected value of a function of a random variable x, denoted g(x), is given
by Z ∞
Ex∼p(x) {g(x)} = g(x)p(x)dx (88)
−∞
Notation:
• Ex∼p(x) denotes the expected value where x has a PDF of p(x).
• Ex∼p(x) {g(x)} is therefore the expected value of g(x) where x ∼ p(x).
• Sometimes x ∼ p(x) is omitted from the notation.
The expected value tells us on average how far we are from the centre of the
dart board.
Definition: The variance of g(x) tells us the dispersion of g(x) under the
distribution x ∼ p(x):
Z ∞
varx∼p(x) {g(x)} = (g(x) − Ex∼p(x) {g(x)})2 p(x)dx (92)
−∞
The variance tells us how far the different throws are from the average value of
the throws.
79/90
Expected values
Example: (Continued)
The expected value and the variance of g(x) is quantified:
5.0
g(x)
2.5
0.0 E{g(x)}
E{g(x)} ± 3 · std{g(x)}
Therefore, even though the data are random, we have a bound on g(x)!
80/90
What is the expected value and variance of
y =a·x+b (93)
2
if N (x|µX , σX )?
81/90
The expected value is given by:
82/90
Subsection 11
83/90
Monte Carlo integration
84/90
Monte Carlo integration
Therefore, we need to sample from p(x) and then calculate the mean of f (x),
evaluated at the samples.
85/90
Monte Carlo integration
Exercise:
Calculate the integral Z ∞
N (x; 1, 22 )xdx (101)
−∞
86/90
Monte Carlo integration
Exercise:
Calculate the integral Z ∞
N (x; 1, 22 )xdx (102)
−∞
• scipy.integrate.quad2 : 1.0
• Monte Carlo integration3 : 1.0004166378896788 ± 0.020921105045549167
2
The integral is not performed from −∞ to ∞. Instead, a large domain that covers the
majority of the probability mass of p(x) is used in the integration.
3
Only a single scalar estimate is obtained when performing Monte Carlo integration. The
Monte Carlo estimate is a random variable as we are using randomly sampled values from p(x) to
estimate the integral. This means that each time that we generate new samples, a new estimate is
obtained. The expected value of the estimate is unbiased and the variance of the Monte Carlo
estimate depends on the number of samples that is used. In this example, the mean of the
estimates and the standard deviation of the estimates are shown to highlight that the Monte Carlo
estimate is random. 87/90
Monte Carlo integration
Exercise:
Calculate the integral Z ∞
N (x; 1, 22 )x4 dx (103)
−∞
• scipy.integrate.quad: 73.00
• Monte Carlo integration ≈ 73.
88/90
Subsection 12
Conclusion
89/90
Conclusion
90/90