0% found this document useful (0 votes)
7 views153 pages

178 hw 9

hw 178

Uploaded by

erka303e
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views153 pages

178 hw 9

hw 178

Uploaded by

erka303e
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 153

E178/ME292b

Statistics and Data Science for Engineers


Reader

Gabriel Gomes
2
Contents

1 Systems and Models 9

2 Probability theory 15
2.1 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.1 Sample space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.2 Event space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.3 Probability measure . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.4 Probability density function . . . . . . . . . . . . . . . . . . . . . . . 22
2.1.5 Cumulative distribution function . . . . . . . . . . . . . . . . . . . . 25
2.2 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.1 Properties of the expected value . . . . . . . . . . . . . . . . . . . . . 27
2.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Standardization of random variables . . . . . . . . . . . . . . . . . . . . . . . 31
2.5 Named distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5.1 Bernoulli distribution B(s) . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.2 Binomial distribution Bin(n, s) . . . . . . . . . . . . . . . . . . . . . 33
2.5.3 Poisson process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5.4 Exponential distribution E( ) . . . . . . . . . . . . . . . . . . . . . . 36
2.5.5 Uniform distribution U (a, b) . . . . . . . . . . . . . . . . . . . . . . . 37
2
2.5.6 Gaussian (a.k.a. normal) distribution N (µ, ). . . . . . . . . . . . . 38
2.6 Multivariate random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3
2.6.1 Correlation coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.6.2 Marginalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.6.3 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.6.4 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.6.5 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3 Optimization theory 69
3.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.1.1 Global vs. local solutions . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.2 Types of feasible points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.2.1 First order optimality condition . . . . . . . . . . . . . . . . . . . . . 76
3.3 Convex optimization problems . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.3.1 Properties of convex optimization problems . . . . . . . . . . . . . . 79
3.3.2 Examples of convex sets . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.3.3 Convex functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.4 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.4.1 Stochastic gradient descent (SGD) . . . . . . . . . . . . . . . . . . . 85

4 Statistical inference 89
4.0.1 The data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.0.2 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.0.3 Behavior of the sample mean . . . . . . . . . . . . . . . . . . . . . . 94
4.1 Point estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.1.1 Estimator performance . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.1.2 Estimation of the mean with the sample mean . . . . . . . . . . . . . 102
2
4.1.3 Estimation of the variance Y with the biased sample variance S̃ 2 . . 104
2
4.1.4 Estimation of the variance Y with the unbiased sample variance S 2 . 106
4.1.5 Mean squared error (MSE) . . . . . . . . . . . . . . . . . . . . . . . . 107
4.1.6 Maximum likelihood estimation (MLE) . . . . . . . . . . . . . . . . . 110

4
4.1.7 Estimation for Mixture Gaussian Models . . . . . . . . . . . . . . . . 116
4.1.8 Clustering algorithms and K-means . . . . . . . . . . . . . . . . . . . 122
4.2 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.2.1 Confidence interval for the mean µY when Y is Gaussian with known
variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
4.2.2 Confidence interval for the mean µY when Y is Gaussian with unknown
variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
4.2.3 Confidence interval for the mean µY when Y is non-Gaussian . . . . . 132

5 Supervised learning 137


5.1 The data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.2 Parametric families of models . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.3 Loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.4 Optimization problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.5 Assessing model performance . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.5.1 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.5.2 K-fold cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . 151
5.6 Hyper-parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

5
Symbols
Sets
{} ... the empty set
{a, b, c} ... a set with elements a, b, and c
{ai }n ... the set {a1 , a2 , . . . , an }, indexed with i.
{a, . . . , b} ... the set of integers a through b (inclussive)
[a, b] ... the closed interval of the real numbers from a to b
A[B ... the union of sets A and B
A\B ... the intersection of sets A and B
A B ... the set that results from removing from A all of the elements of B.
[ni=1 ei ... the union of sets e1 , e2 , . . . , en .
a2A ... assert a is a member of the set A.
A✓B ... assert A is a subset of B, and possibly A equals B.
8a 2 A ... assert a condition holds for all elements a in the set A

Named sets
N ... the natural numbers, not including zero
N0 ... the natural numbers including zero
Z ... the integers
R ... the real numbers
R+ ... the positive real numbers, not including zero
RD ... the D-dimensional vector space of real numbers

Functions
f :A!B ... f is a function with domain A and codomain B.
f (x; ✓) ... f is a function with inputs x and parameters ✓

6
Probability

Y ... A random variable


⌦Y ... Sample space of Y
EY ... Event space of Y
PY ... Probability measure of Y
pY ... Probability density function of Y
FY ... Cumulative distribution function of Y
E[Y ], µY ... Expected value (mean) of Y
2
V ar[Y ], Y ... Variance of Y
Y ... Standard deviation of Y
Cov(X, Y ) ... Covariance of X and Y
y⇠Y ... y is a sample of Y
B(↵) ... Bernoulli distribution with parameter ↵
U (a, b) ... Uniform distribution on the interval [a, b]
2 2
N (µ, ) ... Normal distribution with mean µ and variance
iid
{yi }N ⇠ Y ... {yi }N is iid sampled from Y
iid
{Yi }N ⇠ Y ... {Yi }N are iid copies of Y
Y |X = x ... The random variable Y conditioned on X = x
⇢XY ... Correlation between random variables X and Y

Statistical learning
H ... A family of prediction functions
P ... The number of parameters that characterize H
✓ ... The vector of parameters ✓ 2 RP
h(x; ✓) ... A family of prediction functions

7
8
Chapter 1

Systems and Models

Much of the activity of engineers involves building models to predict and control the behavior
of physical systems. For example, we use models of solar panels and batteries to design
solar farms and predict their production. We use models of the drivetrain of a car to
design cruise control systems. We embed models in feedback control loops to guide robots
through uncertain environments. These models consist of mathematical equations or code
that capture the aspects of the system that we are interested in predicting or controlling.

To build a model of a system is to specify a function (mathematical or in code) that


maps the system’s “inputs” to its “outputs”. The outputs are the quanitities of interest; the
ones that we wish to predict or control. The inputs are quantities that we can measure (and
perhaps manipulate) and which a↵ect the outputs through the influence of the system. This
is illustrated in Figure 1.1.

Figure 1.1: A model transforms inputs to outputs.

9
Figure 1.2: Models range from purely data-based to purely mechanistic.

As an example of a system, consider a quadcopter or “drone”. The measurable quantities


include the voltages applied to each of its motors, its position, its linear and angular speeds,
and the masses and moments of inertia of its parts. If our goal is to steer the drone, then
the relevant model will be one that provides its position and speed (outputs) as a function
of the applied motor voltages (inputs).

The process of building a model begins with the specification of the type or family of
models that we wish to work with. Any given system may be described by many di↵erent
types of models, and which one we choose will depend on our goal. For the quadcopter,
the type of model used when our aim is simply to predict travel time will be di↵erent from
the type needed for real-time steering. The former requires only the average velocity and
approximate path of the drone, while the latter requires detailed knowledge of its position,
orientation, and speed, as well as its maneuvering capabilities.

Figure 1.2 illustrates a range of model types, organized from left to right with respect
to their “a-priori structure”. On the right-hand side we see the so-called “mechanistic”
or “open-box” models. These are ones with lots of a-priori structure, usually obtained by
applying the principles of science, such as Newton’s laws, the physical laws of electricity,
thermodynamics, fluid mechanics, etc. to the system. They are “open-box” because they
require that we understand the inner workings of the system. We build open-box models
by breaking the system down to its individual parts, then modeling each one of those us-

10
ing scientific or engineering principles (or, if not principles, some established techniques),
and then assembling these into a single model for the entire system. As an example, the
coupled system of ordinary di↵erential equations shown below captures the rotational dy-
namics of a quadcopter in terms of the torques exerted by its propellers and its principal
moments of intertia. It is obtained by applying Euler’s equations of rotational dynamics to
the configuration of the quadcopter.
2 3 2 3 2 3
ṗ n 1/J qr
6 7 6 xx 7 Jxx Jzz 6 7
6 7 6n 7 6 7
6q̇ 7 = 6 2/Jxx 7 6 pr 7 (1.1)
4 5 4 5 Jxx 4 5
ṙ n 3/Jzz 0

In these equations, n1 , n2 , n3 are net torques generated by the propellers, about principle
axes going trough the quadcopter’s center of mass; p, q, r are angular speeds about those
axes; and Jxx and Jzz are moments of inertia. These equations model only one aspect of
the movement of the quadcopter – how the net torques influence the angular speeds. If we
wished to control the trajectory of the quadcopter, we would need to model other aspects
as well, such as the influence of motor voltages on propeller thrust, and the influence of
propeller thrust on speed.
Let’s keep our focus on the rotational “sub-model” of Eq. 1.1. These equations apply to
a large number of quadcopters – all those that comply with the assumptions made in the
derivation of the equations. Of all of the properties of a drone, the only ones that a↵ect its
rotational dynamics (according to this model) are Jxx and Jzz – the moments of inertia about
its two main axes1 . These are the parameters of the model. The parameters are quantities
that distinguish two di↵erent systems that are both covered by the model. When we assign
a particular numerical value to the parameters (i.e. we tune or train the model), we are
selecting one from a family of models for a particular class of systems. We are selecting the
one that corresponds to our system.
The number of parameters of a model can be taken as a measure of the size of its family.
1
The details of the drone model don’t matter much here, but if you are interested, please check out
ME136.

11
Figure 1.3: The propellers of the quadcopter produce torques with components n1 , n2 , and
n3

The model family of Eq. 1.1 is fairly small – it has only two parameters. The space that we
have to search for our model (the one that best fits our system) is relatively small. This is
because the model equations contain a lot of a-priori structure (or knowledge, or information)
about the system. And this case is particularly simple because the two parameters, Jxx and
Jzz , can be directly measured. It is more difficult when the parameters are not measureable,
and must be inferred.

Moving from right to left in Figure 1.2 we encounter models with less a-priori structure.
On the extreme left-hand side we find the “data-centric” or “closed-box” models. Amongst
these we find the “machine learning” models. These are models that make very few assump-
tions about the system. They contain very little a-priori structure, and therefore apply to a
very wide range of systems. Such models typically contain many parameters, and require a
large amount of data to calibrate. They are most useful for “closed-box” systems, meaning
ones that cannot be broken down into manageable pieces, as we did with the quadcopter.
Below are two examples of closed-box systems, illustrated in Figure 1.4.

Humans are very good at recognizing cats in photographs. Even though it may seem
like a simple task, our current understanding of the visual cortex and other relevant parts of
the brain is insufficient to build a open-box model of this system. Although we understand
the basic physics of the brain at the neuronal level, that understanding is not helpful to the
task of building a cat-recognition algorithm. Machine learning models on the other hand,

12
Figure 1.4: Two closed-box models

specifically convolutional neural networks, have been tremendously successful with this task.
These models have very little a-priori structure. They assume only that the input is an
image and the output is a binary “yes cat” or “no cat”. They only learn about cats – that
they consist of two eyes, a triangular nose with whiskers and pointy ears – through a training
process that involves many examples of images with and without cats.

The roll of a die is another example. In this case we have an excellent understanding of
the physics involved, including the 3D rigid-body dynamics of the die, the fluid dynamics of
the air, and the interaction of the die with the floor. However all of these are plagued with
uncertainties. There are uncertainties in the initial position and velocities of the die, the
temperature and vioscosity of the air, the restitution coefficient of the ground, etc. These
uncertainties are amplified through the dynamics of the die, producing a wide range of
possible outcomes. The closed-box approach is more straightforward. It acknowledges that
the uncertainties are large and states simply that the output is a randomly chosen number
between 1 and 6!

In this course our focus will be on the left-hand side of Figure 1.2, with the data-centric,
closed-box models. We will cover many techniques for building such models. These tech-
niques invariably begin with an “untrained” model whose parameter values are unspecified.
As the model is presented with input data, we compare its predicted output to the measured
output, and use the di↵erence (i.e. the prediction error) to adjust the model parameters. In
this way the model “learns” about the relationship between inputs and outputs in the real
system. With enough data, this process can often produce reliable models of systems whose

13
inner function is too complex for physics-based analysis.
Similar to humans, data-centric models begin life in a somewhat amorphous and un-
trained state. At this stage, their future is uncertain. Given an input, they can produce a
wide variey of outputs. We can think of the training process as one of progressively reducing
the uncertainty in the output by presenting the model with stimiluli and showing it the
correct response. The field of mathematics that studies uncertainty is probability theory,
and so that is where we will start.

14
Chapter 2

Probability theory

Probability theory concerns the study of uncertainty. Most things in life are uncertain.
Predictions about the future are uncertain because they are subject to many factors that
we do not fully know or control. Statements about the present are also uncertain, due to
the finite precision of our measurement devices. Probability theory allows us to quantify,
combine, and evolve interacting uncertainties. This helps us to get a sense of the confidence
that we can place on statements about the world. Large uncertainty means low confidence;
low uncertainty means high confidence.

To begin the discussion, consider the following statement: “The temperature outside my
office is between 60°F and 65°F”. This statement is either true or false, but I do not know
which because I have not looked at the thermometer. However my belief is that it is false,
based on what I see through my window. If asked to rate my belief on a scale from 0 to
1 – where 0 means complete certainty that the statement is false (the temperature is not
between 60°F and 65°F) and 1 means complete certainty that it is true – I would give it a
0.2. This is the so-called Bayesian interpretation of probability. Under this interpretation,
the probability quantifies our subjective belief in a proposition, with 0 signifying complete
certainty that the proposition is false and 1 signifying complete certainty that it is true.
Whenever a belief is based on measurements or perceptual experience, its probability (or
credence in the Bayesian terminology) may only take values in the open interval (0,1). The

15
extreme values of 0 and 1 are not allowed. When I look at the thermometer and see that
it reads 59°F, I obtain evidence that serves to decrease my credence in the statement, from
0.2 to perhaps 0.1. However no amount of evidence can bring the credence value to 0 or
1. There are always alternative possibilities with small but nonzero probability, such as the
possibility that the thermometer is broken, or that I have lost my mind. The extreme values
of 0 and 1 are reserved for statements that are defined or can be proven to be true within
some system of axioms. For example, the statement “there is no largest prime number” is
true with probability 1 within the rules of arithemetic.

Aside from quantifying belief, we can also use probabilities to gauge the uncertainty in
the outcome of a process or measurement. Take for example the tossing of a coin. When
we say that the “probability of heads is 0.5”, we mean that, if the coin were tossed a large
number of times, we expect that it would turn up heads about half of the time. More
precisely we mean that as the number of trials is increased, the ratio of heads to the total
number of tosses will converge to 0.5. We cannot say with any certainty what the sequence
of heads and tails will be, but we are certain that the ratio will approach to 0.5 – provided
the coin is fair. This is the frequentist interpretation of probability.

In both the Bayesian and frequentist interpretations, a probability is a real number in


the interval [0,1] which quantifies the uncertainty of a statement or event. However neither
of the interpretations are completely satisfactory because they do not suggest a practical
method for measuring probabilities. Psychologists have made progress in the design of
experiments that measure subjective beliefs, however it remains a difficult problem. The
frequentist definition, on the other hand, relies on an infinite experiment, which is also
difficult! Despite this, the theory of probability has been extremely successful in modeling
real-world uncertainties of both types. Furthermore, the mathematics of probability theory
applies equally to both interpretations. Thus, we can proceed without worrying too much
about whether our probabilities are “Bayesian” or “frequentist”, but keeping in mind that
both interpretations are available.

16
2.1 Random variables
We begin with an informal definition of a random variable as a symbol that represents some
uncertain measurement or outcome. We typically use upper case letters for random variables.
Here are three exmaples,

• T : the temperature measured by a thermometer outside of my office window in degrees


Fahrenheit. The thermometer ranges from -40°F to 120°F in increments of 1°F. Hence
T can take integer values from -40 to 120.

• D: the outcome of the roll of a die. D can take integer values 1 through 6.

• C: The response given by a person when they are given a photograph and asked
whether it shows a cat. They say either “yes” or “no”.

Apart from its symbol, a random variable has three components: a sample space ⌦, an event
space E, and a probability measure P . We often identify these with their corresponding
random variable using a subscript. Hence we have:

D = (⌦D , ED , PD ) (2.1)

T = (⌦T , ET , PT ) (2.2)

C = (⌦C , EC , PC ) (2.3)

This notation states for example that the random variable D consists of a sample space ⌦D ,
an event space ED , and a probability measure PD . We will define these three quantities next.

2.1.1 Sample space

The sample space of a random variable is the set of all of the values that it can take. In the
case of the die, the sample space is the integers from 1 to 6:

⌦D = {1, 2, 3, 4, 5, 6} (2.4)

17
The thermometer is a little trickier. Strictly speaking ⌦T are the integers from -40 to 120:

⌦T = { 40, . . . , 120} ⇢ N (2.5)

However we are free to be pragmatic and choose a simpler option, such as the interval of
real numbers:
⌦T = [ 40, 120] ⇢ R (2.6)

or even the entire real line:


⌦T = R (2.7)

It will become clear later, when we introduce probability distributions, why it can be prefer-
able to work with real-valued sample spaces such as Eq. 2.6 or Eq. 2.7 instead of discrete-
valued sample spaces such as Eq. 2.5. We will take ⌦T = R as the sample space for T from
now on.
Both ⌦D and ⌦T are examples of numerical sample spaces. ⌦C on the other hand is a
categorical sample space because it consists of labels “yes” and “no”.

⌦C = {“yes”, “no”} (2.8)

2.1.2 Event space

An event e is any subset of a sample space: e ✓ ⌦. Here are some examples.

• {2, 4, 6} is an event for the random variable D. This event can be expressed verbally
as “roll an even number”.

• [60, 65] is the event for the random variable T corresponding to the statement “the
temperature is between 60°F and 65°F.

• {yes} is the even from the random variable C corresponding to the statement “the
picture shows a cat”.

18
• The empty set {} and the entire sample space ⌦ are events.

The event space E of a random variable is the set of all of its events, i.e. all subsets of the
sample space 1 . This is a much larger set than ⌦, known as its power set. If we use |⌦| for
the size of (i.e. the number of elements in) the sample space, and |E| for the size of the event
space, then
|E| = 2|⌦| (2.9)

For example, there are 26 = 64 possible events for the roll of a die, since ⌦D contains 6
elements.

2.1.3 Probability measure

A probability measure P of a random variable is a function that assigns a real number to


each event in its event space.
P :E !R (2.10)

P (e) is the probability of the event e. To qualify as a probability measure, the function P
must satisfy the following three properties, known as the axioms of probability.

A1. All probabilities are non-negative.

P (e) 0 8e 2 E (2.11)

A2. The probability of the sample space is 1.

P (⌦) = 1 (2.12)

A3. For any disjoint set of n events {ei }n , meaning that no two events intersect (ei \ej = {}
whenever i 6= j), the probability of the union of the events equals the sum of the
1
Actually we do not have to include all of the subsets in the event space, only enough to form a “ -
algebra”. That is, ⌦ must contain the complements of each of its elements, as well as the intersections and
unions of any number of its elements.

19
probabilities of the individual events.

n
X
P ([ni=1 ei ) = P (ei ) (2.13)
i=1

In the frequentist interpretation, disjoint events represent outcomes that cannot happen
simultaneously. For example, the roll of a die cannot be both even and odd. A photo either
has a cat or it doesn’t. Axiom A3 states that to find the probability that any of a set of
disjoint events occurs, we must add up the probabilities that each of them occur.
These axioms were stated in the 1930’s, long after the birth of probability theory in the
sixteenth century. They capture our intuitions for both the Bayesian and frequentist notions,
and they are a sufficient foundation for the full development of the theory. The following
properties are easily deduced from the axioms.

1. Nothing can’t happen:


P ({}) = 0 (2.14)

2. If event e0 happens whenever e happens, then the probability of e cannot exceed that
of e0 :
e ✓ e0 ) P (e)  P (e0 ) 8e, e0 2 E (2.15)

3. Everything either happens or it doesn’t:

P (e) + P (⌦ \ e) = 1 8e 2 E (2.16)

4. The probability of either e or e0 happening equals the sum of their probabilities, minus
the probability that they both happen:

P (e [ e0 ) = P (e) + P (e0 ) P (e \ e0 ) 8e, e0 2 E (2.17)

Example 2.1.1. A scale is used on a production line to monitor the weight of the widgets

20
produced. It finds that 30% weigh less than 120 g, 40% weigh more than 200 g, and 50%
weigh between 120 g and 250 g. Describe this situation using a random variable and a
probability measure. What percentage of widgets weigh between 200 g and 250 g?

Solution. We define a random variable W to represent the weight of a randomly chosen


widget. The sample space for W is the real line: ⌦W = R. The problem statement gives the
probabilities for three events:

e1 = ( 1, 120) PW (e1 ) = 0.3 (2.18)

e2 = (120, 250) PW (e2 ) = 0.5 (2.19)

e3 = (200, 1) PW (e3 ) = 0.4 (2.20)

Figure 2.1: Example 2.1.1

Figure 2.1 illustrates the three events. Notice that they are not disjoint, and hence their
probabilities need not add up to one. Our goal is to find the probability of event e4 =
(200, 250), which can be expressed as e4 = e2 \ e3 . Noting that e1 , e2 , and e3 cover the entire
sample space, we have
PW (e1 [ e2 [ e3 ) = PW (⌦W ) = 1 (2.21)

On the other hand, using Eq. 2.17 with e = e1 [ e2 and e0 = e3 we find,

PW (e1 [ e2 [ e3 ) = PW (e1 [ e2 ) + PW (e3 ) PW ((e1 [ e2 ) \ e3 ) (2.22)

21
Since e1 and e3 do not intersect, we have (e1 [ e2 ) \ e3 = e2 \ e3 = e4 , and so,

1 = PW (e1 [ e2 ) + PW (e3 ) PW (e4 ) (2.23)

= PW (e1 ) + PW (e2 ) + PW (e3 ) PW (e4 ) (2.24)

= 0.3 + 0.5 + 0.4 PW (e4 ) (2.25)

which gives PW (e4 ) = 0.2. The second equality above is an application of the third axiom.

A note on notation

We can use P (X 2 e) instead of PX (e) to denote the probability that the random variable
X takes a value from the event e. If X is discrete-valued and e consists of a single item,
then we can use P (X = e). When e is a semi-infinite interval, we can use P (X x) instead
of PX ([x, 1)) and P (X  x) instead of PX (( 1, x]).

2.1.4 Probability density function

The probability measure fully specifies the probabilities associated with the possible out-
comes of an experiment. However it is not a convenient mathematical object for practical
use. To program a probability measure, one would have to write a function that accepts
every possible subset of the sample space and returns a number for each one. Without some
simplifying property or rule, this would require storing the set of all possible events, which
as we have seen, grows exponentially with the size of the sample space.

Fortunately the axioms of probability ensure the existence of another function that cap-
tures the same information but in a simpler form. This is the probability density function
(pdf), or the distribution of the random variable. The pdf is simpler because it maps the
sample space (as opposed to the event space) to the reals. We denote the pdf with lower

22
case p, and with a subscript indicating its random variable:

pD : ⌦D ! R (2.26)

pT : ⌦T ! R (2.27)

pC : ⌦C ! R (2.28)

The defining property of the probability density function is that its integral over any event
e equals the probability of e.
Z
p(!)d! = P (e) 8e 2 E (2.29)
e

Or, if the sample space is discrete, it is the sum over the elements of e:

X
P (e) = p(!) 8e 2 E (2.30)
!2e

Although we do not do it here, it can be shown that this property uniquely defines the
function p.
Axiomatic definition of the probability density function
Alternatively, the probability density function can be defined axiomatically, without reference
to the probability measure. Below are the two axioms (a.k.a. properties) that characterize
a pdf.

1. Non-negativity:
p(!) 0 8! 2⌦ (2.31)

2. Sum to one:
Z
p(!)d! = 1 (continuous sample space) (2.32)

X
p(!) = 1 (discrete sample space) (2.33)
!2⌦

23
With this definition of the probability density function p, we can remove the more cumber-
some probability measure P from the definition of a random variable. A random variable
then becomes the collection of sample space, event space, and pdf.

D = (⌦D , ED , pD ) (2.34)

T = (⌦T , ET , pT ) (2.35)

C = (⌦C , EC , pC ) (2.36)

Figure 2.2: Example probability density functions.

A note on integrals and sums


Figure 2.2 shows probability density functions over continuous and discrete sample spaces.
Many texts on probability theory refer to the discrete version as a probability mass function.
However here we will dispense with this distinction and refer to both as probability density
functions. We will also only use integrals in our equations; no sums. This eliminates a lot of
repitition (such as in Eqs. 2.32 and 2.33 above) and should not create confusion. Whenever
the sample space is discrete, then the integral symbol should simply be interpreted as a sum.

Example 2.1.2. Let Y be a random variable with ⌦Y = [1, 1). Consider the function
pY (y) = a y b , where a and b are real numbers. Find conditions that a and b must satisfy for
pY to be a valid pdf.

Solution. Non-negativity requires that a y b 0 for all y 2 [1, 1). Thus we conclude a 0.
Secondly, we require that the integral of pY (y) over the sample space equal 1. The integral

24
is only defined when b < 1, so we adopt that assumption.
Z 1 1
b a b+1 a a
ay = y = (0 1) = =1 (2.37)
1 b+1 1 b+1 b+1

From which we obtain a third condition: a + b = 1.

2.1.5 Cumulative distribution function

The cumulative distribution function (cdf) of a random variable Y is a function from the
sample space ⌦Y to the interval [0,1]. It is denoted with FY .

FY : ⌦Y ! [0, 1] (2.38)

FY (t) us defined as the probability of obtaining a sample that is less or equal to t:

FY (t) = P (Y  t) (2.39)

which can be computed as the integral of the pdf from 1 to t.


Z t
FY (t) = pY (y) dy (2.40)
1

Notice that this definition only makes sense for numerical random variables, and not for
label-based random variables, since it utilizes the “” concept, which relies on the order
of the elements of ⌦Y . As always, the integral should be interpreted as a sum when Y is
discrete-valued, and in this case the sum must include pY (t). Figure 2.3 shows examples of
discrete and continuous pdfs with their respective cdfs.

The cdf is very useful for the computation of probabilities, since the probabilty of an

25
Figure 2.3: FY (t) is the probability of the event ( 1, t], which is obtained by integrating the
pdf from 1 to t (inclusive). Integrating over the discrete pdf on the left produces a series
of positive jumps. For the continuous distribution, the cdf is a continuous non-decreasing
function. Its value at t equals the area under the pdf to the left of t.

interval [a, b] equals the di↵erence between two values of the cdf:
Z b
PY ([a, b]) = pY (y) dy (2.41)
a
Z b Z a
= pY (y) dy pY (y) dy (2.42)
1 1

= FY (b) FY (a) (2.43)

In lieu of computers, one often uses lookup tables of cdfs to compute probabilities.

2.2 Expected value

The expected value of a random variable Y , also known as the expectation or the mean of
Y , is denoted with E[Y ] or µY and is defined as follows,
Z
E[Y ] = y pY (y) dy (2.44)
⌦Y

26
Figure 2.4: Means vs. median. pB is obtained by moving a portion of distribution pA to the
right. This action a↵ects the mean but not the median.

Graphically, the expected value can be understood as the “balance point” of the pdf. If
we cut a piece of cardboard into the shape of the pdf, then this shape will be balanced at
E[Y ], as shown in Figure 2.4. The figure also illustrates the “median”, which is any value
ym that satisfies P (Y < ym ) = P (Y > ym ). That is, a value ym is a median value if there
is equaly chance the a sample of Y will be greater than or less than ym . For a symmetric
distribution over a continuous space, such as distribution pA on the left hand side of Figure
2.4, the median coincides with the mean, and they are both at the point of symmetry. We
can appreciate a di↵erence between the median and the mean if we convert distribution pA
into distribution pB by taking a portion of the high-value outcomes and moving them further
to the right. Distribution pB is said to be “positively skewed”, or “right skewed”, since the
action causes the mean (the balance point) to also move to the right. However it does not
a↵ect the median, since the areas to its left and right remain unchanged.

2.2.1 Properties of the expected value

The properties below follow from the definition of the expected value.

1. E[·] is a linear operation. This means that the expected value of a linear combination of
random variables {Yi }n (all over the same sample space) equals the linear combination
of their expected values. " #
n
X n
X
E ↵ i Yi = ↵i E[Yi ] (2.45)
i=1 i=1

27
2. The expected value of a fixed number equals that number: E[↵] = ↵.

3. The expected value of a function g of a random variable Y is computed with,


Z
E[g(Y )] = g(y) pY (y)dy (2.46)
⌦Y

Example 2.2.1. Find the expected value of the distribution of Example 2.1.2.

1 a
Solution. In the example we found that a+b = 1, so the pdf is of the form pX (x) = ax .
Next we apply the definition of the expected value.
Z 1
1 a
E[X] = xax dx (2.47)
1
Z 1
=a x a dx (2.48)
1

The integral exists only if a > 1. Then

a a 1
E[X] = x1 1
(2.49)
1 a
a
= (2.50)
a 1

2.3 Variance
2
The variance of a random variable Y is denoted with V ar[Y ] or Y, and is defined with,
Z
2
V ar[Y ] = E[(Y µY ) ] = (y µY )2 pY (y) dy (2.51)
⌦Y

Here, (Y µY )2 is a random variable, a sample of which is obtained by squaring the distance


from a sample of Y to the fixed mean µY . The expected value of this squared distance is
the variance of Y . An alternate formula for the variance can be derived by expanding the

28
Figure 2.5: Small vs large variance

square and using properties 1 and 2 of the expected value.

V ar[Y ] = E[(Y µY ) 2 ]

= E[Y 2 2µY Y + µ2Y ]


(2.52)
= E[Y 2 ] 2µY E[Y ] + µ2Y

= E[Y 2 ] µ2Y

The variance of Y is therefore the di↵erence between the mean of Y 2 an the squared mean
of Y . The variance is a measure of the spread of a distribution (see Figure 2.5). When we
sample a random variable with small variance, we can be fairly sure that the outcome will
be close to the expected value. High-variance random variables on the other hand produce a
wide range of outcomes. Hence the variance quantifies the uncertainty captured by a random
variable.
The unit of variance is the square of the unit of the outcome. For example, if the
temperature is measured in °F , then its variance has units (°F )2 . For this reason we often
report the square root of variance, known as the standard deviation of Y , and denoted with
Y .

Example 2.3.1. Find the variance of the random variable of Example 2.1.2.

Solution. "✓ ◆2 #
⇥ 2
⇤ a
V ar[X] = E (X E[X]) =E X (2.53)
a 1

29
We apply Eq. 2.46 for the expected value of a function.
Z 1 ✓ ◆2
a 1 a
V ar[X] = x dx ax (2.54)
1 a 1
Z 1✓ ◆
2a a2
=a x2 x+ x 1 a
dx (2.55)
1 a 1 (a 1)2
.
= ..
a
= provided a > 2 (2.56)
(a 2)(a 1)2

In contrast with the expected value, the variance is not a linear function. Rather, the
variance of a linear combination of random variables {Yi }n is given by this more complicated
nonlinear formula (provided without proof):
" n
# n n X
n
X X X
V ar ↵ i Yi = ↵i2 V ar[Yi ] + 2 ↵i ↵j Cov(Yi , Yj ) (2.57)
i=1 i=1 i=1 j=i+1

This formula includes the covariance of two random variables Cov(Yi , Yj ), which is defined
later in part 2.6. We provide the formula here only to point out two sources of nonlinearity:
the squares on the ↵i ’s in the first term, and the cross-products ↵i ↵j in the second term. We
can then come up with a particular case in which the variance is linear: the variance of a
sum of uncorrelated random variables. We will see later that “uncorrelated” implies that the
covariance is zero, so the second term is removed. With all ↵i ’s set to 1, we obtain linearity:
" n
# n n X
n
X X X
V ar Yi = V ar[Yi ] + 2 ↵i ↵j Cov(Yi , Yj ) (2.58)
i=1 i=1 i=1 j=i+1
Xn
= V ar[Yi ] (2.59)
i=1

30
Figure 2.6: Standardization of a random variable X.

2.4 Standardization of random variables


We “standardize” a continuous random variable X when we define a new random variable
X̃ with:
X E[X]
X̃ = (2.60)
X

X̃ is standardized in the sense that it necessarily has zero mean and unit variance.
Proof : 
X E[X] E[X] E[X]
E[X̃] = E = =0 (2.61)
X X

X E[X] V ar[X]
V ar[X̃] = V ar = 2
=1 (2.62)
X X

Graphically, the operation has the e↵ect of shifting the distribution of X to the origin, and
squeezing or expanding it so that its variance becomes 1. This is illustrated in Figure 2.6.
Standardization (sometimes referred to as normalization) of random variables has prac-
tical benefits. It can improve the performance of numerical algorithms by bringing all of
the numbers into a common range. Furthermore, lookup tables used in statistics are usually
provided in terms of a standard distribution, and hence one must standardize the data before
consulting them.

2.5 Named distributions


Here we introduce a few important families of pdfs with proper names. You can find a long
list of such distributions in the article titled “List of probability distributions” in Wikipedia.

31
These are “parametric” families, in the sense that they are expressed with mathematical
formulas that include some number of tunable parameters. A “member” of a family is a
distribution obtained by setting the parameters to particular values. We denote this with
p(y; ✓1 , ✓2 , . . . , ✓P ). Here ✓i is a value assigned to the i’th parameter of a family with P
parameters. The semi-colon in the notation separates the arguments of the pdf (values in
the sample space) from its parameters.

2.5.1 Bernoulli distribution B(s)

The Bernoulli distribution is the simplest model for a discrete-valued random variable. Its
sample space consists of just two outcomes: “yes” and “no”, 0 and 1, “left” and “right”,
“true” and “false”, etc. Call them “success” and “failure”, and denote them with X and 7.

⌦ = {X, 7} (2.63)

The Bernoulli distribution has only one parameter which is the probability s of “success”.
Then the probability of “failure” is 1 s.
8
< s k=X
p(k; s) = (2.64)
: 1 s k=7

We use symbol B for the Bernoulli family, and B(s) for the particular distribution with
parameter value s. Y ⇠ B(s) means that Y is a Bernoulli random variable with parameter
value s.
To do computations with Bernoulli random variables, we need to assign numbers to
outcomes X and 7. There are two commonly used options: {0, 1} and { 1, 1}. Which one
to use is purely a matter of mathematical convenience.

{0, 1} encoding:
8
< s k=1
p(k; s) = (2.65)
: 1 s k=0

32
{ 1, 1} encoding:
8
< s k=1
p(k; s) = (2.66)
: 1 s k= 1

Later, for the purpose of taking derivatives, it will be convenient to define a smooth extension
of p(k; s). This is a function that passes through the two points of the discrete distribution
and also has a continuous first derivative. Figure 2.7 shows two options: linear and expo-
nential. Below are the formulas for both, in each of the two encodings. It is important to
note that these functions are not distributions, since they do not integrate to 1. They are
merely functions that coincide with p(k, s) at the discrete points of ⌦, and that also have
the convenient property of being di↵erentiable. We denote them with p̄(k; s) to emphasize
this distinction.

Extension functions for the {0, 1} encoding:

p̄(k; s) = sk (1 s)1 k
exponential (2.67)

p̄(k; s) = sk + (1 s)(1 k) linear (2.68)

Extension functions for the { 1, 1} encoding:

p̄(k; s) = s(1+k)/2 (1 s)(1 k)/2


exponential (2.69)
1+k 1 k
p̄(k; s) = s+ (1 s) linear (2.70)
2 2

2.5.2 Binomial distribution Bin(n, s)

The binomial distribution applies to the total number of successes in an independent set of
n Bernoulli trials B(s). For example, it applies to the number k of people with a particular
illness in a random sample of n people, when each person has a probability s of having the

33
Figure 2.7: Smooth extensions of the Bernoulli pdf, in each of two numerical encodings.

disease. The sample space for the binomial distribution is ⌦ = {0, . . . , n}, and its pdf is:
✓ ◆
n k
p(k; n, s) = s (1 s)n k
(2.71)
k

n is the total number of trials, and s is the probability of success in each trial. The pdf then
returns to probability of observing k successes. Notice that the formula reduces to Eq. 2.67
when n = 1 (use 0! = 1).

Figure 2.8: A sample sequence of Bernoulli trials with s = 0.25.

Figure 2.8 provides an example. It shows an outcome for a sequence of 32 Bernoulli trials
with a probability of success of 0.25 for each one. Of the 32 trials, 6 are successes (green
1’s). This is fewer than expected since ns = 32 ⇥ 0.25 = 8. The probability of the outcome
shown in Figure 2.8 is obtained by evaluating the binomial pdf with parameters n = 32 and
s = 0.25 at k = 6:
✓ ◆
32 1 6 3 26 2.54 ⇥ 1012
p(k; n, s) = p(6 ; 32, 0.25) = ( /4) ( /4) ⇡ 9.06 ⇥ 105 ⇡ 0.125 (2.72)
6 1.84 ⇥ 1019

34
Figure 2.9: A Poisson process.

2.5.3 Poisson process

A stochastic process is a mathematical object for representing a system that generate a


stream of events over time. This is a similar concept as the sequence of events of Figure 2.8,
except that now they occur on the real line (the axis of time), as shown in Figure 2.9.
Stochastic processes are an interesting and large topic, and we only touch upon them briefly
in the chapter about time series data. The Poisson process is a very simple case that assumes
only that all events are independent, and that the expected number of events in any two
equally-sized intervals is the same. This implies that there is a positive number such that
the expected number of events in an interval of length t is t. We call this number the
rate of the process, and it is measured in events per unit of time.

The Poisson random variable or Poisson distribution counts the number of events in a
Poisson process during one unit of time. The expected value of a Poisson random variable
is clearly , by definition. Its sample space is all of the natural numbers, including 0:

⌦ = N0 (2.73)

Its pdf is
k
e
p(k; ) = (2.74)
k!
This formula can be obtained as the limit of a binomial distribution when the number n of
trials in one unit of time goes to infinity.

Proof. We imagine the Bernoulli trials of Figure 2.8 as occurring in time, one after the
other, over a period of one time unit (one second, for example). Define as the expected

35
number of successes in that period: = ns. Then, applying Eq. 2.71,
✓ ◆
n k
p(k; n, s) = s (1 s)n k
(2.75)
k
✓ ◆k ✓ ◆n k
n!
= 1 (2.76)
k!(n k)! n n
k
✓ ◆n k
n(n 1) . . . (n k + 1)
= 1 (2.77)
k! nk n
k
✓ ◆n k
n(n 1) . . . (n k + 1)
= 1 (2.78)
k! | ⇥n⇥
n {z. . . ⇥ n} n
Ktimes

Take the limit as n ! 1.

k
✓ ◆ ✓ ◆✓ ◆ k ✓ ◆n
1 k 1
lim p(k; n, s) = lim 1 ... 1 1 1 (2.79)
n!1 k! n!1 n n n n

Notice that all of the terms except the last tend to 1 as n ! 1. So we can remove them
from the limit. This limit of the last term is a common identity in calculus which we will
not derive here. It equals e . Hence,

k
lim p(k; n, s) = e (2.80)
n!1 k!

2.5.4 Exponential distribution E( )

The exponential distribution models the waiting time between events in a Poisson process.
A sample waiting time is shown as w in Figure 2.9. The sample space for the exponential
distribution is the positive real numbers R+ (not including zero). The pdf for the waiting
time is:
w
p(w; ) = e (2.81)

36
Figure 2.10: Continuous and discrete uniform distributions

2.5.5 Uniform distribution U (a, b)

A uniform distribution is one whose sample space is an interval of real numbers or integers,
and whose pdf is a constant C on that interval. The uniform distribution is parameterized
by the limits a and b of the interval.
8
< C if y 2 ⌦
p(y; a, b) = (2.82)
: 0 otherwise

The value of C can be computed from a and b by noting that the integral must equal 1. In
the continuous case, with ⌦ = [a, b], we obtain C = 1/(b a) (from the area of a rectangle).
For the discrete case, with ⌦ = {a, . . . , b}, we get C = 1/(b a + 1), since there are b a+1
elements in ⌦.

Figure 2.10 shows the continuous and discrete uniform distribution, along with their
respective means and variances. Examples of discrete uniform random quantities abound:
rolling a single die, flipping a fair coin, picking a card from a well-shu✏ed deck, etc. An
example of a continuous uniform variable is the angle of the seconds-hand on a clock when
you observe it at an arbitrary point in the day.

We use the notation U [a, b] for the continuous uniform random variable on the interval
[a, b], and U{a, b} for the discrete uniform random variable on the discrete interval {a, . . . , b}.

37
Figure 2.11: Normal probability density functions

2
2.5.6 Gaussian (a.k.a. normal) distribution N (µ, )

The Gaussian or normal distribution is widely used in engineering and the sciences to model
quantities whose value can in principle be any real number, but is expected to fall near
a particular value. An important example of this are measurements of real-valued quanti-
ties taken with measurement devices. These values sometimes fall above the “true value”,
sometimes below, and collectively they form a histogram that has is bell-shaped.
The sample space for the Gaussian distribution is all of the real numbers (⌦ = R). The
2
Gaussian family of distributions is parameterized by two numbers: µ and . µ is allowed
2
to be any real number, while is required to be positive and non-zero. Here is the formula
for a Gaussian pdf: !
2 1 (y µ)2
p(y; µ, )= p exp 2
(2.83)
2⇡ 2 2

Examples of this function are shown in Figure 2.11 for three di↵erent setting of the param-
2
eters. We use Y ⇠ N (µ, ) to designate Y as a Gaussian variable with parameters µ and
2 2 2
. Next we prove that µ and turn out to be the mean and variance of N (µ, ).
2 2
Theorem With Y ⇠ N (µ, ), E[Y ] = µ and V ar[Y ] = .
Proof. Applying the definition from Eq. 2.44 to Eq. 2.83,
Z 1 ✓ ◆
1 (y µ)2
E[Y ] = yp exp 2
dy (2.84)
1 2⇡ 2 2

38
y µ
p
Change of variables: z = p . Then dz = p1 dy and y = 2 2 z + µ.
2 2 2 2

Z p1 p
1
E[Y ] = ( 2 2 z + µ) p exp z 2 2 2 dz (2.85)
2⇡ 2
1
Z 1 p
1
=p ( 2 2 z + µ) exp z 2 dz (2.86)
⇡ 1
✓ Z Z 1 ◆
1 p 2 1 2 2
=p 2 z exp z dz + µ exp z dz (2.87)
⇡ 1 1

Noting that (1/2) exp ( z 2 ) is the antiderivative of z exp ( z 2 ), we find that the first term
1
in Eq. 2.87 equals (1/2) exp ( z 2 )| 1 = 0. For the second term, we use the result (without
R1 p
proving it), that 1 exp ( z 2 ) dz = ⇡. Therefore,

1 p
E[Y ] = p 0 + µ ⇡ = µ (2.88)

2
For V ar[Y ] = see https://proofwiki.org/wiki/Variance_of_Gaussian_Distribution.

2.6 Multivariate random variables

So far we have considered the individual random variable as a model for a single measure-
ment from a system. However most physical systems are not well described by a single
measurement. A real system may be characterized by multiple measurements, and therefore
multiple random variables are required. The main question that arises regarding multiple
measurements is whether the are related in some way. For example, measurements of ambi-
ent temperature and humidity exhibit decreasing relationship: the higher the temperature,
the lower the humidity. However the roll of a die is not related to the measurement of
temperature. These relationships must be encoded in the random variables that we use to
model the system.
Let’s again use T for temperature, with sample space ⌦T = R and distribution pT , and
H for humidity with sample space ⌦H = R and distribution pH . We form a multivariate

39
Figure 2.12: Joint distribution of temperature and humidity. The event that T 2 [60, 65]
and H 2 [69, 72] is a rectangle IN R2 . The probability of this event is the integral of pZ over
the rectangle.

random variable Z by grouping T and H into an array: Z = (T, H). The sample space of
Z is the two-dimensional plane: ⌦Z = ⌦T ⇥ ⌦H = R2 , and its distribution is, as expected,
a function from the sample space to the real numbers: pZ : R2 ! R. Figure 2.12 shows
a possible distribution of Z. The notion of an event also generalizes: an event of Z is any
subset of ⌦Z . Take for example the event that the temperature is between 60°F and 65°F,
while humidity is between 69% and 72%:

e = {(t, h) : t 2 [60, 65], h 2 [69, 72]} (2.89)

This event is depicted as a dark green rectangle in Figure 2.12. The probability of an event
is, as before, the integral of the pdf over the event. In the case of e it is a double integral:
Z Z 65 Z 72
PZ (e) = pZ dz = pZ (t, h) dh dt = 0.02 (2.90)
e 60 69

In general, a multivariate random variable may have D components, and its sample space is

40
the composition of the component sample spaces.

Z = (Z1 , Z2 , . . . , ZD ) . . . multivariate random variable (2.91)

⌦Z = ⌦Z1 ⇥ . . . ⇥ ⌦ZD . . . multivariate sample space (2.92)

The expected value of Z is the vector of expected values of the components:

E[Z] = (E[Z1 ], E[Z2 ], . . . , E[ZD ]) 2 RD (2.93)

The variance of a multivariate random variable is a D ⇥ D matrix, and is referred to as the


covariance matrix.
⇥ ⇤
V ar[Z] = E (Z E[Z])T (Z E[Z]) (2.94)

Here we have assumed that Z and E[Z] are arranged as row vectors so that (Z E[Z])T
is a column vector and V ar[Z] is a square matrix. The diagonal entries in this matrix turn
out to be the variances of the individual Zi ’s. The non-diagonal entries are the covariances
Cov(Zi , Zj ) of the pair (Zi , Zj ):

Cov(Zi , Zj ) = E[(Zi E[Zi ])(Zj E[Zj ])] (2.95)

2 2
Denoting the variance of Zi with i and the covariance of Zi and Zj with i,j , we have,

2 3
2 2 2
1 1,2 ... 1,D 7
6
6 2 2 2 7
6 2,1 2 ... 2,D 7
V ar[Z] = 6
6 .. .. ...
7
.. 7 (2.96)
6 . . . 7
4 5
2 2 2
D,1 D,2 ... D

2 2
Notice from the definition that i,j = j,i , and therefore the covariance matrix is symmetric.

Example 2.6.1. Consider the multivariate variable Z with a triangular sample space ⌦Z
shown in Figure 2.13, and distribution pZ (x, y) = c (1 x y).

41
Figure 2.13: Sample space for Example 2.6.1

a) Compute c
b) Compute the probability of the event e = {(x, y) : x  0.5}

Solution.

Figure 2.14: Example 2.6.1

a) pZ (x, y) is the the gray triangular plane shown on the left side of Figure 2.14. We must
find the value of c such that the volume under the triangle equals 1. The volume of any
pyramid is one third its base times its height. Therefore we require:

(1/3) (1/2) c = 1 (2.97)

Which implies c = 6.
b) The event e is shown in Figure 2.13. The probability of its complement ⌦Z \ e is the
volume of the green-shaded region in Figure 2.14. Again, we can compute this quatity using
the formula for the volume of a pyramid; one third the base times the height. In this case

42
the base has area 0.5 ⇥ 0.5 ⇥ 0.5 = 1/8. The height of the pyramid is c/2 = 3. The probability
of ⌦Z \ e is therefore:
PZ (⌦Z \ e) = (1/3) (1/8) 3 = 1/8 (2.98)

Therefore PZ (e) = 7/8

2.6.1 Correlation coefficient

The covariance between two random variables has been defined in Eq. 2.95. As the name
suggests, the covariance captures the degree to which the samples of the two quantities tend
to move together. As with the variance, the covariance can be difficult to interpret because
its unit is the product of the units of the two variables. So, for example, the covariance of
temperature and humidity has SI units of °K · P a. This defficiency was corrected in the case
of variance by defining the standard deviation. For the covariance we define the correlation
coefficient ⇢XY of two random variables X and Y .

Cov(X, Y )
⇢XY = (2.99)
X Y

Notice that the ⇢XY is dimensionless, which means that it is insensitive to the units with
which we measure X and Y . Indeed, it can be shown (using Eq. 2.58) that ⇢XY must lie in
the interval [ 1, 1].
Notice that the correlation coefficient can be expressed in terms of standardized versions
of X and Y , which we will denote with X̃ and Ỹ .
✓ ◆✓ ◆
E[(X µX )(Y µY )] X µX Y µY
⇢XY = =E = E[X̃ Ỹ ] (2.100)
X Y X Y

Figure 2.15 shows an example of possible samples of X̃ and Ỹ . The value of X̃ Ỹ for a
particular sample is the area of the rectangle with one corner at the origin and another at
the sample, with positive sign if the sample is in the first or third quadrant, and negative
sign if it is in the second or fourth quadrant. The mean of these areas, known as sample

43
Figure 2.15: Scatter plot of the standardized measurements.

correlation coefficient, coverges to ⇢XY as the number of samples increases. If there tend
to be more and larger such rectangles in quadrants two and four, as in the figure, then the
correlation coefficient will be negative. If there are more and larger rectangles in quadrants
one and three, then it will be positive. If the rectangles in quadrants two and four balance
out with rectangles in one and three, then the correlation coefficient is zero.
Figure 2.16 provides several examples. Each subplot shows a scatter plot of data sampled
from a joint distribution pXY . The top row shows Gaussian distributions. Positive correlation
coefficients indicate data that has an increasing tendency. Negative correlation coefficients
indicate data that decreases. The middle plot on the top row shows uncorrelated Gaussian
data - here there is no descernable pattern.
The middle row shows examples of perfectly correlated data – that is, data from distri-
butions with correlation coefficient equal to 1 or -1. The exception is the middle plot which
lacks a correlation coefficient since Y = 0 in this case, and so ⇢XY is undefined.
The bottom row shows examples where the correlation coefficient is zero, even though
there is a discernable pattern in the data. Notice however that all of these cases present a
symmetry that balances rectangles in quadrants one and three with those in quadrants two
and four.

44
Figure 2.16: Correlation coefficient

Example 2.6.2. Find the correlation coefficient for the distribution of Example 2.6.1.

p
Solution. The standard deviations were found in Example 2.6.5 to be X = Y = 3/80.

We can plug these into Eq. 2.99 and use the definition of the expectation to obtain the
answer. The integral is too tedious to do by hand, and so we resort to Python’s sympy,
which returns ⇢XY = 1/3

2.6.2 Marginalization

The joint distribution of a mutlivariate random variable Z = (Z1 , . . . , ZD ) encodes informa-


tion about all of the individual quantities Zi , as well as their correlations. We can extract the
individual distributions from the joint distribution through the process of marginalization.

As an example, take the joint distribution of temperature and humidity pZ (t, h), with
Z = (T, H). We obtain the distribution of T alone by integrating pZ over ⌦H :
Z
pT (t) = pZ (t, h)dh (2.101)
⌦H

45
Similarly, we obtain the distribution of H alone by integrating pZ over ⌦T :
Z
pH (h) = pZ (t, h)dt (2.102)
⌦T

Both of these formulas yield valid probability distributions in the sense that they are non-
negative and have unit integral. More generally, we can compute the distribution of any
number of components d < D (not just one) of a multivariate random variable with D
components, by integrating its joint distribution over the sample spaces of the other D d
components.

Example 2.6.3. For Example 2.6.1, compute the marginal distributions of X and Y .

Solution. For each value of x, the marginal probability pX (x) is found by integrating over
⌦Y . We use the fact that pXY (x, y) is only non-zero between 0 and 1 x.
Z
pX (x) = pZ (x, y) dy (2.103)
⌦Y
Z 1 x
= 6(1 x y) dy (2.104)
0
Z 1 x Z 1 x
= 6(1 x)dy 6 y dy (2.105)
0 0
1 x
2 y2
= 6(1 x) 6 (2.106)
2 0

= 3(1 x)2 (2.107)

We can argue by symmetry that pY (y) = 3(1 y)2 .

Example 2.6.4. For Example 2.6.1, compute E[X].

Solution.

46
Z
E[X] = x pX (x)dx

Z 1X
= 3x(1 x)2 dx
0
Z 1
=3 x(1 2x + x2 )dx
0
✓ ◆1
1 2 2 3 1 4
=3 x x + x
2 3 4 0
1
=
4

Example 2.6.5. For Example 2.6.1, compute V ar[X].

Solution.

V ar[X] = E[(X E[X])2 ]


Z
= (x 1/4)2 pX (x)dx
⌦X
Z 1
= (x 1/4)2 3(1 x)2 dx
0
.
= ..

= 3/80

As has been mentioned, the important question regarding pairs of random variables is
whether knowing the value of one of them has an influence on our belief about the other.
The correlation coefficient gave a partial answer to this question: knowledge of one influences
belief of the other when ⇢XY 6= 0. Specifically, if ⇢XY > 0, then larger values of Y imply
larger expected values of X (and vice-versa). Conversely, when ⇢XY < 0, larger values of Y
imply smaller expected values of X (and vice-versa). However, ⇢XY = 0 does not imply that
the two are unrelated, as exemplified in the third row of Figure 2.16.

47
Upon further reflection we realize that a full answer to this question cannot possibly be a
single number such as the correlation coefficient, since it may depend on the value obtained
in the measurement. For example, the distribution of temperature may depend on humidity
when humidity is low, but not when it is high.
A full answer to the question of the relation between random variables must, in the
language of probability theory, tell us how the distribution of one random variable changes,
when something is known about another random variable. The concept that gives us this
answer is the conditional probability.

2.6.3 Conditional probability

Consider two random variables X and Y , both categorical, with ⌦X = {a, b, c} and ⌦Y =
{↵, }. The joint pdf of Z = (X, Y ) is provided in tabular form in Figure 2.17.

Figure 2.17: Joint distribution of two categorical random variables.

From this table we see that the marginal probability of the event (Y = ↵) is 0.55, and
for (Y = ) it is 0.45. However the odds change if we fix the value of X. If (X = a), then
the (Y = ↵)-to-(Y = ) odds are 1:3. The odds flip to 3:1 if X = b or X = c. These are
examples of conditional probabilities.
More generally, let Z be a (possibly multivariate) random variable with sample space ⌦Z
and probability measure PZ . Then for any two events and e and e0 in ⌦Z , we define the
conditional probability P (e0 |e) as the probability that e0 occurs, provided e occurs. It’s good
here to keep an open mind about what is meant by an event “occurring”, as well as with the
conjunctive “provided”. For example, we have not made any assumptions about the flow of
causality, whether e causes e0 , or e0 causes e, or neither. We have also not specified what

48
Figure 2.18: Eq. 2.108

the events represent. They may be measurements from a system, for example e may be an
input event (i.e. a possible measurement of the inputs), and e0 an output event. In a more
Bayesian setting they could be hypotheses about the world, with P (e) being our credence
value for hypothesis e. The possibilties are vast. In any case, the numerical value of the
conditional probability is obtained with:

PZ (e0 \ e)
P (e0 | e) = (2.108)
PZ (e)

provided PZ (e) 6= 0. We can apply this formula to the example from Figure 2.17 to compute
several other conditional probabilities.

1. The probability that Y = ↵, provided X = a:

PZ (Y = ↵, X = a) 0.1
P (Y = ↵ | X = a) = = = 1/4 (2.109)
PZ (X = a) 0.4

2. The probability that X = c, provided Y = :

PZ (X = c, Y = ) 0.05
P (X = c | Y = ) = = = 1/9 (2.110)
PZ (Y = ) 0.45

3. The probability that Y = , provided ((Y = ) and (X = a)) or ((Y = ↵) and (X = c)).
Let’s use e for the conditioning event. e comprises the lower-left and upper-right

49
corners of the table in Figure 2.17.

PZ ((Y = ) \ e) 0.3
P (Y = |e) = = = 2/3 (2.111)
PZ (e) 0.45

The concept of conditional probability applies equally well to continuous random variables.
For example, we can define the probability that the temperature is between 62°F and 65°F
given that humidy is between 70% and 72%:

PZ (T 2 [62, 65] and H 2 [70, 72])


P (T 2 [62, 65] | H 2 [70, 72]) = (2.112)
PZ (H 2 [70, 72])

You may have noticed that we have not identified the conditional probabilities (left-hand
sides of the equations in this section so far) with a random variable. This was in order to
delay the question of the status of the conditional probability function P . Does it satisfy
the axioms of probability? The answer to this question is “yes” (proof left to the reader).
Since it is a probability measure, we can use it to define a random variable, which we will
call a conditioned random variable.
Let’s be more precise with the definition of the conditioned random variable. Again,
let Z be a (possibly multivariate) random variable with sample space ⌦Z and probability
measure PZ , and let e be an event of Z: e ✓ ⌦Z . e determines a new random variable Z|e
with sample space ⌦Z .
⌦Z|e = ⌦Z (2.113)

The probability measure for Z|e is defined with:

PZ (e0 \ e)
PZ|e (e0 ) = 8e0 2 ⌦Z (2.114)
PZ (e)

The simpler notation of Eq. 2.108 is often used. It is even acceptable to remove subscripts
altogether, as long as the meaning is unambiguous:

P (e0 \ e)
P (e0 |e) = 8e0 2 ⌦Z (2.115)
P (e)

50
Figure 2.19: Conditional probability density function

What is the probability density function for the random variable Z|e? Noting that the
probability of every event e0 ✓ ⌦Z equals the probability of its intersection with e divided
by PZ (e), we find that the conditional pdf is:
8
< pZ (z)
if z 2 e
PZ (e)
pZ|e (z) = (2.116)
: 0 if z 2
/e

This is illustrated in Figure 2.19. The portion of the joint pdf pZ that is within e is scaled
by 1/PZ (e), and the rest is “chopped o↵”. We can also use the simplified notation p(z|e) in
place of pZ|e (z), if the meaning is clear from context:
8
< p(z)
if z 2 e
P (e)
p(z|e) = (2.117)
: 0 if z 2
/e

This formula only applies when PZ (e) 6= 0, and hence it does not cover a case that is very
common in the sciences and engineering: the conditioning of a continuous random variable
with a measurement event. For example, when we seek the distribution of temperature given
that the humidity is known to be 70%: T |H = 70. We deal with this next.

Conditioning a continuous pdf on a measurement.

Consider a joint random variable Z with components X and Y : Z = (X, Y ), both real-
valued. The joint sample space ⌦Z is the real plane, and the joint pdf pZ (x, y) is a function

51
Figure 2.20: Conditioning a continuous pdf on a measurement

from R2 to R. Now suppose that we take a measurement of the quantity X and obtain the
value xo . The conditioning event e is {(x, y) | x = xo }. Visually, e is a straight line through
⌦Z , as shown in Figure 2.20. Being a line, the event has no area, and therefore its probability
is zero: PZ (e) = 0. This presents a problem, because the formula of Eq. 2.116 for the pdf
does not apply. We need an alternative definition for conditional probabilities of this type.

To this end we define the conditional random variable Y |X = xo , as opposed to Z|X = xo ,


with sample space ⌦Y instead of ⌦Z . The pdf for Y |X = xo is computed with:

pZ (xo , y)
pY |X=xo (y) = 8y 2 ⌦Y (2.118)
pX (xo )

or, using the simplified notation with:

pZ (xo , y)
p(y | X = xo ) = 8y 2 ⌦Y (2.119)
pX (xo )

Let’s note the di↵erences between Eqs. 2.119 and 2.117. First, the domains are di↵erent.
The domain of p(z|e) in Eq. 2.117 is all of ⌦Z . The formula therefore must account for values
of z that are outside of e – hence the two cases. The domain of p(y|X = xo ) in Eq. 2.119
on the other hand is only ⌦Y . We do not need to specify two cases since all values of (xo , y)
fall within the event (X = xo ). Another di↵ernce is the denominator on the right-hand side.
The denominator in in Eqs. 2.117 is PZ (e) – the probability of event e. In Eq. 2.119 it is
pX (xo ) – the marginal pdf of X evaluated at xo . Next, we justify Eq. 2.119 by showing that

52
it corresponds to the limit of Eq. 2.117 as e becomes a line.

Consider the event e = {(x, y) : x 2 [xo , xo + ]}. This is the vertical blue bandv in
Figure2.21. Define p(z|e) using Eq. 2.117. Consider the case when z 2 e. We will transform
e into a line by letting ! 0.
pZ (z)
lim p(z|e) = lim (2.120)
!0 !0 PZ (e)

Hence we need to find the limits of pZ (z) and PZ (e) as ! 0. Start with pZ (z). We can
write z in (x, y) coordinates as z = (xo + ✏, y) for some ✏ 2 [0, ]. Then, assuming pZ is
continuous on the line X = xo , we have,

lim pZ (xo + ✏, y) = lim pZ (xo + ✏, y) = pZ (xo , y) (2.121)


!0 ✏!0

Next, we note that PZ (e) can be expressed in terms of the marginal cdf of X.
Z Z 1
PZ (e) = pZ (xo + x, y) dy dx (2.122)
0 1
Z
= pX (xo + x) dx (2.123)
0

= FX (xo + ) FX (xo ) (2.124)

PZ (e) clearly vanishes as ! 0, but PZ (e)/ becomes pX (xo ):

PZ (e) FX (xo + ) FX (xo )


lim = lim (2.125)
!0 !0

= FX0 (xo ) (2.126)

= pX (xo ) (2.127)

Hence, although p(z|e) explodes as ! 0, the limit of p(z|e) does exist:

pZ (z) pZ (xo , y)
lim p(z|e) = lim = (2.128)
!0 !0 PZ (e)/ pX (xo )

pZ (xo , y)/pX (xo ) has the same shape as p(z|e) in the limit. It is scaled by , which gives it

53
unit integral and thus makes it a valid pdf.

Figure 2.21: Derivation of conditioning Eq.2.119

Example 2.6.6. For Example 2.6.1, compute p(y|X = 1/2).

Solution. We apply the formula for the pdf of a conditional random variable:

pZ (x, y) 6(1 x y)
p(y|X = x) = = (2.129)
pX (x) 3(1 x)2

With x = 1/2 this becomes:

6(1 1/2 y)
p(y|X = 1/2) = = 4(1 2y) (2.130)
3(1 1/2)2

Example 2.6.7. A college campus has three types of vehicles: scooters, bicycles, and
mopeds. The scooters are the lightest of the three, and also the most popular, account-
ing for 50% of the total. Bicycles are heavier than scooters and account for 40%, while the
remaining 10% are mopeds and are the heaviest.

We define random variables V for the vehicle type and W for the vehicle weight. The
sample spaces are respectively ⌦V = {scooter, bicycle, moped} and ⌦W = R. The joint
distribution pZ of Z = (V, W ) is shown in Figure 2.22. It consists of three lines; if both were
continuous, pZ would be a surface. We can integrate pZ over its sample space to confirm

54
Figure 2.22: Joint distribution of weight and vehicle type

that it is a valid pdf.


Z XZ
pZ (z) dz = pZ (v, w)dw
⌦Z v2⌦V w2R
Z 1 Z 1 Z 1
= p(scooters, w)dw + p(bicycles, w)dw + p(mopeds, w)dw
1 1 1

= 0.5 + 0.4 + 0.1

=1

Figure 2.23 shows the marginal distributions of W and V , obtained by integrating the joint
distribution over ⌦V and ⌦W , respectively. For each w 2 R,

X
pW (w) = p(v, w) (2.131)
v2⌦V

= p(scooter, w) + p(bicycles, w) + p(moped, w) (2.132)

55
Figure 2.23: Marginal distributions

Figure 2.24: Distributions of weight conditioned on vehicle type.

pV is obtained by integrating p(v, w) over ⌦W .


Z 1
pV (scooter) = p(scooter, w)dw = 0.5 (2.133)
1
Z 1
pV (bicycle) = p(bicycle, w)dw = 0.4 (2.134)
1
Z 1
pV (moped) = p(moped, w)dw = 0.1 (2.135)
1

The conditional probability of weight given vehicle type is obtained by dividing the joint

56
pdf by the marginal of each vehicle class. This is shown in Figure 2.24. For each w 2 R,

p(scooter, w) p(bicycle, w)
p(w | V = scooter) = = (2.136)
pV (scooter) 0.5
p(bicycle, w) p(scooter, w)
p(w | V = bicycle) = = (2.137)
pV (bicycle) 0.4
p(moped, w) p(moped, w)
p(w | V = moped) = = (2.138)
pV (moped) 0.1

2.6.4 Bayes’ theorem

Bayes’ theorem gives a formula for swapping the roles of the two events in a conditional
probability. For any two events e and e0 in ⌦Z ,

P (e0 | e)P (e)


P (e | e0 ) = (2.139)
P (e0 )

This formula is easy to derive from the symmetry of set intersections. Since P (e \ e0 ) =
P (e0 \ e), applying the definition of Eq. 2.108 we get P (e|e0 )P (e0 ) = P (e0 |e)P (e), from
whence Eq. 2.139 immediately follows, provided P (e0 ) 6= 0.

Despite its simplicity, Bayes’ rule has many useful applications. It is often used as a rule
for updating our belief in a hypothesis or statement h when an observation or measurement
m has been made.
P (m|h) P (h)
P (h|m) = (2.140)
P (m)
On the right-hand side, P (h) is our belief in hypothesis h prior to observing m – we call this
the prior belief. P (m|h) is the probability of observing m assuming that the hypothesis h is
correct – we call this the likelihood of m. P (m) is the probability of observing m when we
make no assumptions about the veracity of h. This can be computed as long as there is a
finite number of alternatives to h, whose probabilities are known.

57
On the computation of P (m)

Suppose we have a finite number of events hypotheses {h1 , . . . , hn }. P (m) can then be
P
obtained by marginalization: P (m) = ni=1 P (m, hi ). Using the definition of the conditional
probability we obtain a useful formula, known as the law of total probability:

n
X
P (m) = P (m|hi )P (hi ) (2.141)
i=1

This formula states more generally that, given n mutually exclusive and exhaustive events
{h1 , . . . , hn } (i.e. a partition of the sample space), the probability of any other event m can
be obtained with Eq. 2.141.

Example 2.6.8. Take h to represent the event that my car starts the next time I turn the
ignition. Whether or not it starts will depend on many factors, including the amount of gas
in the tank. We will denote with m the observation that there is sufficient gas in the tank
to start the car. Suppose the car is fairly old, and it only starts about 90% of the time. My
prior belief that it will start is P (h) = 0.9. Furthermore, I sometimes forget to fill the tank,
so there is only a 95% chance that the tank has gas: P (m) = 0.95. P (m|h) is the probability
that the car has gas if it has been observed to start. This must equal 1 since a car with no
gas can never start. Applying Bayes’ rule we get,

1 ⇥ 0.9
P (h|m) = = 0.947 (2.142)
0.95

Upon observing that the car has gas, my belief that it will start increases from 0.9 to 0.947.

Example 2.6.9. One of the two urns shown in Figure 2.25 is placed in front of you, but you
do not know which. You are asked to pick a marble. Before looking at the marble, what is
the probability that you’ve picked from urn A? Urn B? How do these change if the marble
turns out to be white? Black?

58
Figure 2.25: Two urns with black and white marbles.

The prior beliefs for urns A and B are both 0.5.

P (A) = P (B) = 0.5 (2.143)

The fact that A has half white and half black marbles, while B has all white is captured
with conditional probabilities.

P (white|A) = P (black|A) = 0.5 (2.144)

P (white|B) = 1 (2.145)

P (black|B) = 0 (2.146)

Apply Bayes’ rule:

P (white|A)P (A)
P (A|white) = (2.147)
P (white)
P (white|A)P (A)
= (2.148)
P (white|A)P (A) + P (white|B)P (B)
0.5 ⇥ 0.5
= (2.149)
0.5 ⇥ 0.5 + 1 ⇥ 0.5
= 1/3 (2.150)

P (B|white) is therefore 2/3. We can repeat the computation for the case that we choose a

59
Figure 2.26: Vehicle type conditioned on the weight, as a function of the weight.

black marble.

P (black|A)P (A)
P (A|black) = (2.151)
P (black)
P (black|A)P (A)
= (2.152)
P (black|A)P (A) + P (black|B)P (B)
0.5 ⇥ 0.5
= (2.153)
0.5 ⇥ 0.5 + 0 ⇥ 0.5
=1 (2.154)

And therefore P (B|black) = 0.

Bayes’ theorem for inference

Suppose X and Y are respectively the input and output of a system. Bayes’ rule says that
for all pairs (x, y) 2 ⌦Z :
p(y | X = x) pX (x)
p(x | Y = y) = (2.155)
pY (y)
For input-output systems, we can often say that the input “causes” the output. We can also
often gather data to estimate the distributions on the right-hand side of this equation. We
can then use the equation to build a distribution of the inputs that result in output y. Thus
we can infer something about the cause x of an observed output y.

Example 2.6.10. In Example 2.6.7 we computed the probability of the vehicle’s weight
conditioned on the vehicle’s type. We can use Eq. 2.155 to obtain the distribution of a

60
vehicle’s type given its weight:

p(w | V = v)
p(v | W = w) = pV (v) (2.156)
pW (w)

Figure 2.26 shows the result of this computation for each of the three vehicle types, and
expressed as a function of weight. Notice that for each weight, the sum of the three lines
equals 1. The plot suggests the following algorithm for guessing a vehicle’s type based on
its weight:
if w < 20:
guess scooter
elif w < 43:
guess bicycle
elif w < 66:
guess moped
else:
guess bicycle

It is surprising that, even though mopeds are on average heavier than bicycles, we should
expect that very heavy vehicles (w > 66) are bicycles. This is due to the fact that the variance
in the weight of bicycles is larger than that of mopeds (in this fictitious example). If the
three conditional random variables W | V = bicycle, W | V = scooter, and W | V = mopeds
had equal variance, then this sort of reversal would not occur. We will return to this issue
when we study logistic regression.

2.6.5 Independence

Two random variables are independent if learning the outcome of one does not change our
belief about the other. For example, the roll of a die and the temperature of the room
are independent random variables: knowing the temperature does not influence what we
believe about the die, and vice-versa. Conversely, two random variables are dependent when
observing one of them can inform our belief about the other. This is true of temperature and

61
humidity. We can also generalize this notion to more than two random variables. Multiple
random variables are pairwise independent when every pair is independent. And even more
strongly, multiple random variables are jointly independent when no measurements from
any subset of the variables provides information about any other variable. The di↵erence
between pairwise independence and joint independence is subtle, and in this course we will
use the term “independent” to mean “jointly independent”. The question of independence is
of crucial importance when working simultaneously with multiple random quantities. If the
quantities are indepedendent, then they can be treated separately. When they are dependent,
the goal is often to infer the nature of their relationship from the data.

Before getting to independent random variables, let’s first define the notion of two inde-
pendent events in an abstract sample space. Two events e ✓ ⌦ and e0 ✓ ⌦ are independent
if the probability that they both occur equals the product of the probabilities that they each
occur:
P (e \ e0 ) = P (e)P (e0 ) (2.157)

Using Eq. 2.108 we find that independent events satisfy:

P (e|e0 ) = P (e) (2.158)

P (e0 |e) = P (e0 ) (2.159)

This form makes clear that the prior and posterior probabilities are the same, and hence no
information about the ocurrence of one event is gained by observing the other.

Two random variables X and Y are independent (denoted as X ? Y ) if no event in one


can inform any event in the other. To make this precise, let’s define Z = (X, Y ), the joint
random variable. The sample space of Z is the product of the sample spaces of X and Y :

62
Figure 2.27: Independent random variables.

⌦Z = ⌦X ⇥ ⌦Y . Now, consider two arbitrary events in X and Y .

↵ x ✓ ⌦X (2.160)

↵ y ✓ ⌦Y (2.161)

From these we can construct events ex and ey in ⌦Z :

ex = {(x, y) : x 2 ↵x } (2.162)

ey = {(x, y) : y 2 ↵y } (2.163)

This is illustrated in Figure 2.27. ex and ey are the projection of ↵x and ↵y into the joint
sample space ⌦Z . We say that X and Y are independent if ex and ey are independent events
in Z for any choice of ↵x and ↵y .

PZ (ex \ ey ) = PZ (ex ) PZ (ey ) (2.164)

For independent random variables, the joint pdf decomposes into the product of the marginal
pdfs. For all (x, y) 2 ⌦Z :
pZ (x, y) = pX (x) pY (y) (2.165)

63
Proof: In Eq. 2.164, evaluate the probabilities using the respective pdfs:

Z ✓Z ◆ ✓Z ◆
pZ (x, y) dx dy = pZ (x, y) dx dy pZ (x, y) dx dy (2.166)
eX \eY eX eY
✓Z ◆ ✓Z ◆
= pX (x) dx pY (y) dy (2.167)
↵X ↵Y

Both sides of this equation can be written as a double integrals over x and y:

Z Z Z Z
pZ (x, y) dx dy = pX (x) pY (y) dx dy (2.168)
↵Y ↵X ↵Y ↵X

Since the intervals were arbitrary, this implies that the integrands must be equal to each
other. ⇤

The pdf of a random variable remains unchanged when it is conditioned on an event in a


variable to which it is independent. With X ? Y ,

p(y | X = x) = pY (y) (2.169)

p(x | Y = y) = pX (x) (2.170)

Proof

The result is an immediate consequence of Eq. 2.119:

pZ (x, y) pX (x) pY (y)


p(y | X = x) = = = pY (y) (2.171)
pX (x) pX (x)

The covariance of two independent random variables is zero:

X?Y ) Cov(X, Y ) = 0 (2.172)

64
Proof: Start with the expected value of the product of X and Y :

Z
E[XY ] = pZ (x, y) dx dy (2.173)
⌦Z

Since X and Y are independent, we can apply Eq. 2.165:

Z
E[XY ] = pY (y) pX (x) dx dy (2.174)
⌦XY

The double integral can be separated into two simple integrals:

Z Z
E[XY ] = pX (x) dx pY (y) dy = E[X]E[Y ] (2.175)
⌦X ⌦Y

This is actually an interesting finding: the expected value of a product of independent


random variables is the product of their expected values. Next we define µX = E[X] and
µY = E[Y ], and note the following identity, obtained using the linearity of the expected
value (Eq. 2.45):
2µX µY = E[µX Y + µY X] (2.176)

Combining equations 2.175 and 2.176:

E[XY ] + 2µX µY = µX µY + E[µX Y + µY X] (2.177)

Rearranging and again using the linearity of the expectation we get:

E[XY + µX µY µX Y µY X] = 0 (2.178)

65
which under further rearrangement becomes the covariance:

E[(X µX )(Y µY )] = 0 (2.179)

Independence vs. correlation

Independence is a binary (true/false) property that tells us whether two random variables
have any relation to each other. When two random variables are independent, then they
are also uncorrelated (correlation coefficient equals zero), and there is nothing else to be
said about their relationship. However when they are not independent, then the correlation
coefficient tells us something about the manner in which they are related. Specifically it
tells us the degree to which their relationship is linear. With ⇢XY = 1 or ⇢XY = 1, the
relationship is perfectly linear. When ⇢XY = 0, the relationship is not at all linear. However
this does not preclude many nonlinear relationships, such as the ones exhibited in the bottom
row of Figure 2.16.

Multivariate Gaussian variables

A multivariate Gaussian random variable Y = (Y1 , . . . , Yn ) is one whose univariate marginals


2
are all Gaussian: Yi ⇠ N (µi , i) for i = 1, . . . , n.

Y = (Y1 , . . . , Yn ) ⇠ N (µY , ⌃2Y ) (2.180)

Here µY 2 Rn is the mean, and ⌃2Y 2 Rn⇥n is the covariance matrix, which, like all covariance
matrices, is symmetric and positive-definite. Figure 2.28 shows the bell-shaped pdf of a
two-dimensional normal random variable. The level-sets (horizontal slices) of the pdf are
concentric ellipses in the sample space (horizontal plane). These are centered on µY .

66
Figure 2.28: Multivariate normal pdf.

The formula for the multivariate pdf can be found in the Wikipedia article titled ‘Mul-
tivariate normal distribution’, but we will not be needing it. The important point is that it
describes a set of measurements, each of which is Gaussian.
Gaussian variables are an important exception to the observation stated in the previous
part, that uncorrelated variables may not be independent. If two variables X and Y are
Gaussian, then zero correlation implies independence:

X?Y () ⇢XY = 0 (2.181)

67
68
Chapter 3

Optimization theory

In this course we will learn several data-based techniques for building models of systems.
Many of the techniques will follow a common paradigm: first we propose a parameterized
family of models, then we choose the member of that family that best fits the data, according
to some criterion. The search for the best-fitting model will be cast as an optimization
problem, and we must therefore establish some of the basic concepts of optimization theory.

Optimization problems arise whenever we are faced with the task of choosing a best
option from a set of possible options. This is an extremely broad formulation, and indeed
optimization theory is useful in many di↵erent settings. Within engineering it can be applied
to problems as wide ranging as these:

• What shape should we give a part such that its cost in minimized while meeting a
specification?

• What voltage should we apply to each of the motors of a drone in order to stabilize its
flight?

• How should the weights of a neural network be set so that it reliably identifies cars in
images?

69
3.1 Problem formulation

The specification of an optimization problem has three parts.

1. The decision vector is an array of decision variables. Each of the these variables can
be real- or discrete-valued, however we will only encountered real-valued variables in
this course. Let D be the number of decision variables.

2. The search set or feasible set ⌦ ✓ RD . This set should not to be confused with the
sample space from the previous chapter. This is the set of permissible values for the
decision variables. The feasible set is typically specified by applying a series of equality
and inequality constraints to RD .

3. The objective function J : ⌦ ! R. This function assigns to each feasible decision


vector x 2 ⌦ a measure of its quality. We will assume here that more desireable values
of x have smaller values of J(x).

Our goal will be to find the best possible decision vector, which we will denote with x⇤ . That
is, we seek a x⇤ 2 ⌦ that satisfies:

J(x⇤ )  J(x) 8x 2 ⌦ (3.1)

x⇤ is called the global solution to the problem Note that the global soluttion may not be
unique – there may be multiple x 2 ⌦, all with the same value of J(x) that is the smallest
in ⌦. Note also there may be no solution – in the same way as there is no smallest number,
there may be no smallest value of J(x) amongst x 2 ⌦.
We formulate an optimization problem using the following notation:

minimize J(x)
x2RD
(3.2)
subject to: x 2 ⌦

70
These are the elements in the notation:

• “minimize”: This is the main directive to find the smallest value of J(x). If the goal
is to maximize rather than minimize a function, we can use the “maximize” directive,
or we can equivalently minimize the negative of the function.

• “x 2 RD ” indicates the domain of the decision vector. As stated above, we focus


here on real numbers. However this is where one would indicate that the variables are
integer-valued, if that were the case.

• “J(x)”: The objective function.

• “subject to”, indicated that what follows is a list of constraints that carves the search
space ⌦ out from the domain RD . This is often abreviated as “s.t.”.

⌦ is specified as a set of n equality constraints fi (x) and m inequality constraints gj (x):

minimize J(x)
x2RD

subject to: fi (x) = 0 8i 2 {1, . . . , n} (3.3)

gj (x)  0 8j 2 {1, . . . , m}

Using “argmin” in place of “minimize” returns the optimal decision vector x⇤ .

x⇤ = argmin J(x)
x2RD

subject to: fi (x) = 0 8i 2 {1, . . . , n} (3.4)

gj (x)  0 8j 2 {1, . . . , m}

Example 3.1.1. Suppose you wish build a rectangular enclosure for your pet rabbit using
a fixed length ` of fence material, and you would like to know the width w and depth that
maximize the area of the rectangle. This can be posed as an optimization problem with

71
decision variable x = (w, ). The objective is to maximize the function J(w, ) = w . The
feasible values are all of the positive w and that add up to a perimeter of `. Here is the
formulation as an optimization problem.

Given `,
maximize
2
w
(w, )2R

subject to 2w + 2 = `
(3.5)
w 0

We can convert the problem to a minimization simply by flipping the sign of the objective
function. Objective functions in minimization problems are usually called “cost” functions.

minimize
2
w
(w, )2R

subject to 2w + 2 = `
(3.6)
w 0

This problem has D = 2 decision variables, n = 1 equality constraint, and m = 2 inequality


constraints. Figure 3.1 provides an illustration.

3.1.1 Global vs. local solutions

The optimal decision vector x⇤ is known as the global solution to the problem. A weaker sense
of solving an optimization problem is to find a local solution. This is a feasible point (a.k.a.
decision vector) that is best amongst its immediate feasible neighbors, but not necessarily
best overall. For many problems, this will be the best we can do. A vector x+ is a local

72
Figure 3.1: The feasible set is the red line segment, which is the restriction of the line
2w + 2 = ` to the positive quadrant. The solution (w⇤ , ⇤ ) must lie on this line. The cost
function J = w is a surface that dips into the page, and descends from the bottom left
corner to the upper right corner. Its level sets are the gray curves. The lowest point along
this surface on the red line is at the blue dot, i.e. at w⇤ = ⇤ = `/4. Thus, the optimal shape
is a square.

solution if,
J(x+ )  J(x) 8 x 2 ⌦ \ B✏ (x+ ) (3.7)

where B✏ (x+ ) is a small ball centered at x+ with radius ✏,

B✏ (x+ ) = {x : ||x x+ || < ✏} (3.8)

3.2 Types of feasible points

Next we list three ways of categorizing the elements of ⌦: interior vs. non-interior, di↵eren-
tiable vs. non-di↵erentiable, and stationary vs. non-stationary.

73
Interior vs. non-interior points

As the name suggests, an interior point of a set is one that is located “inside” the set. All
other points are non-interior (a.k.a boundary) points. The formal definition of an interior
point is one that can be placed at the center of a ball that is entirely contained in the set.
Points that cannot be put inside a ball that is contained in the set are non-interior. We will
use ⌦° to denote the set of interior points (aka the interior ) of ⌦.

We can characterize ⌦° in terms of the constraints of the problem. ⌦° is the set of points
for which no constraint is active. An active constraint is one in which the relation is satisfied
with the “=” symbol. Equality constraints are always active, and hence a problem with
equality constraints has no interior points. For problems with only inequality contraints, ⌦°
is the set of points that satisfies gj (x) < 0 for all j 2 {1, . . . , m}.

Di↵erentiable vs. non-di↵erentiable points

We say that a point x is a di↵erentiable point when the cost function J is continuously
di↵erentiable at x. This means that the gradient of J, denoted with rJ exists and is
continuous at x. Otherwise x is a non-di↵erentiable point. The gradient is a generalization
of the scalar derivative to functions with multiple inputs. rJ is a vector in RD that points
in the direction of most rapid increase of J.

Stationary vs. non-stationary points

A point x 2 ⌦ is a stationary point of J when rJ(x) = 0. This means that the function does
not increase nor decrease in any direction. It is locally flat. Stationary points are important
points to consider when solving optimization problems, as we will see in the next section.

Example 3.2.1. The feasible set in the enclosure example has no interior, because it has
an equality constraint.

74
Figure 3.2: Categorizing feasible points

Figure 3.2 illustrates these concepts. Plot (a) shows a non-di↵erentiable point x1 . This
is not a stationary point, since the gradient is not defined at x1 . In (b) there is a continuum
of stationary points between x2 and x3 . All are interior points, di↵erentiable, and also global
minima. Plot (c) shows two stationary points x4 and x5 . x4 is a local solution but not a
global solution, x5 is not a local solution. Plot (d) shows another example of a stationary
point that is neither a local nor a global solution. Finally, in plot (e), x7 is a non-stationary
non-interior point that is both a local and a global solution.

From these plots we can begin to see that the solutions to optimization problems are of
at least three types:

1. non-di↵erentiable points, as in (a),

2. non-interior points, as in (e), and

3. stationary points, as in (b).

The first order condition for optimality establishes that these are in fact the only possibilities.

75
3.2.1 First order optimality condition

The first order optimality condition states that any local solution that is both di↵erentiable
and interior must also be stationary.

x is a di↵erentiable, interior, local solution ) x is stationary (3.9)

At first glance, the statement may not seem very useful. It says that, if we know that a
point is a local solution, as well as di↵erentiable and interior, then we can assert that it
is stationary. However it is much easier to test for stationarity than for local optimality;
just evaluate the gradient. A better interpretation requires that we first realize that all
points are either 1) di↵erentiable and interior, or 2) non-di↵erentiable, or 3) non-interior
(some may be both 2 and 3). The first order condition then tells us that any solution that
falls into category 1) must be a stationary point. The condition therefore suggests that we
seek a solution amongst three sets of points: the non-di↵erentiable points, the non-interior
points, and the stationary points. It is most impactfull when J is continuously di↵erentiable
everywhere and the problem has no constraints. In this case the condition reduces to,

x is a local solution ) x is stationary (3.10)

That is, all solutions are to be found amongst the stationary points. Add to this the fact
that all global solutions are local solutions.

x is a global solution ) x is a local solution ) x is stationary (3.11)

In this context, stationarity is a necessary but not sufficient condition for local and hence
global optimality. Non-sufficiency is demonstrated by points x5 and x6 of Figure 3.2, which
are both stationary but not local solutions. Despite not being sufficient, the condition

76
Figure 3.3: Example 3.2.2.

suggests a procedure for solving di↵erentiable/unconstrained optimization problems. First,


find all of the stationary points (by solving rJ(x) = 0). Then, assuming there are a finite
number of such solutions, evaluate J for each of them, and choose the minimizer. The
following example demonstrates this procedure in 1D.

Example 3.2.2. Find the minimum of the function J(x) = x3 (x 3)(x 2).
Solution. A plot of J(x) is shown in Figure 3.3. We begin by computing the derivative of
J(x).
dJ d
rJ(x) = (x) = x2 (x 3)(x 2) = 4x3 15x2 + 12x (3.12)
dx dx

rJ(x) is continuous everywhere. Since the problem is also unconstrained, we are assured
by the first order necessary condition that any local solution must be stationary. The roots
of rJ(x) can be found with the standard formula for quadratic equations, or using Python,
and they are {0, 1.16, 2.60} (green dots in the figure). Finally we evaluate J on each of the
stationary points and choose the one with the least value: x⇤ = 2.60 (red star in the figure).

When presented with an optimization problem, there are some important questions to con-
sider:

77
1. What is its size? That is, how many decision variables and constraints does it have?
Both the dimension and the number of constraints strongly influence the amount of
computation and time needed to solve a problem.

2. Are the decision variables real or integer-valued? Integer-valued problems are harder
to solve than real-valued problems, since many numerical methods rely on the gradient.
Fortunately, the problems that we will encounter in this course all involve real-valued
decision variables and di↵erentiable objective functions.

3. Is the problem convex? The first order conditions become “supercharged” if we can
establish that the problem is convex.

3.3 Convex optimization problems

Convex optimization problems are ones with a special structure that makes them relatively
easy to solve. All other problems are non-convex. Non-convex problems are usually difficult
to solve in the global sense, although they can sometimes be solved in the local sense using
the first order condition.

A convex optimization problem is a minimization problem in which both the feasible set
⌦ and the cost function J are convex. The definitions of a convex set and a convex function
are given next.

• A set is convex if for any two of its elements a and b, the line segment ab is entirely
contained in the set.

• A function is convex if its epigraph is a convex set. The epigraph of a function is


the set of points that lay above the graph of the function. The epigragh of f (x) is
epi(f ) = {(x, y) | y f (x)}.

78
Figure 3.4: Convex sets and functions

These concepts are illustrated in Figure 3.4. On the left we see convex and non-convex
sets. For the convex set, no line segment that begins and ends within the set, leaves the
set. The non-convex set has a “dimple”, which violates convexity. Convex functions are
bowl-shaped. Their epigraph (the region above the function) is a convex set. A convex
optimization problem is therefore one with a dimple-less feasible set and a bowl-shaped cost
function.

3.3.1 Properties of convex optimization problems

There are two important facts about convex problems that make them relatively easy to
solve.

1. For convex problems, every local solution is a global solution. The practical implication
of this is that we can use local solvers (e.g. gradient descent) to find global solutions.

2. For convex problems with continuously di↵erentiable cost, stationary points are local
solutions. That is, situations such as that of points x5 and x6 in Figure 3.2 do not
arise in convex problems.

Together, these add left-facing arrows to Eq. 3.11, which now becomes:

x is a global solution () x is a local solution () x is stationary

Hence, for unconstrained smooth convex problems, the set of stationary points is the same
as the set of global solutions. Hence, to solve such problems we need only to solve the

79
system of equations rJ(x) = 0. We may, for simple problems, be able to find a solution
analytically. In practice we use numerical methods such as Newton’s method, or gradient
descent. Newton’s method can be faster, but it requires knowledge of the Hessian of J (i.e.
its second derivative). The gradient descent method requires knowledge only of rJ, and it
is the method that we will use in this course (Section 3.4).

Next we list a few important examples of convex sets and convex functions.

3.3.2 Examples of convex sets

Euclidean norm ball

The Euclidean norm of a vector x = (x1 , . . . , xd ) 2 Rd , also known as the 2-norm, is a


generalization of the standard 3D notion of distance to a d-dimensional vector space. We
denote it with kxk2 .
d
!1/2
X
kxk2 = x2i (3.13)
i=1

Notice that in three dimensions, this is the straight-line distance from the origin to x: kxk2 =
p
x21 + x22 + x23 . A Euclidean norm ball (a.k.a. the 2-norm ball) is a generalization of a 3D
sphere. It is defined as the set {x 2 Rd : kxk2  r}, where r is the radius of the ball. The
Euclidean norm ball is convex.

p-norm balls

The Euclidean norm can itself be generalized by replacing the 2’s in Eq. 3.13 with p’s, where
p is a positive integer. With p 1, the p-norm is defined as,

d
!1/p
X
kxkp = |xi |p (3.14)
i=1

80
The p-norm ball of radius r is analogous to the Euclidean norm ball: it is the set of points
whose p-norm is at most r: {x 2 Rd : kxkp  r}. The p-norm ball is convex. Of course,
nothing prevents us from using p < 1, however the resulting function fails to be a norm, and
its corresponding ball fails to be convex.

Figure 3.5: p-norm balls with unit radius (Image from Wikipedia)

Figure 3.5 shows some examples of p-norm balls in R2 . Note three important cases.

• 1-norm. With p = 1 the ball appears diamond-shaped. This is the so-called Manhat-
tan, or taxicab norm, because it measures distance only along vertical and horizontal
displacements (like a taxicab driving through Manhattan).

• 2-norm. The Euclidean norm ball is a circle.

• 1-norm. As p goes to infinity, the p-norm ball approaches a square, and Eq. 3.14
returns the largest absolute value among the components of x.

kxk1 = max(|x1 |, |x2 |, . . . , |xn |) (3.15)

81
Affine equalities

An affine equalities is a formula of the form,

↵ 1 x1 + ↵ 2 x2 + . . . + ↵ d xd = (3.16)

Here the ↵i ’s and are real numbers, and the xi ’s are the decision variables. The set of
points that satisfy this formula is called a hyperplane, and it is the generalization of a 3D
plane to d dimensions. We can arrange n such affine equality constraints into a matrix form.

Ax = b (3.17)

where A is an n ⇥ d matrix whose coefficients are the ↵’s, and b is a n ⇥ 1 column vector
with the ’s. The set of points that satisfy Eq. 3.17, or equivalently, the intersection of n
hyperplanes, is convex.

Convex inequalities

A convex inequality is a formula of the form,

g(x)  0 (3.18)

where g(x) is a convex function. The set of points x that satisfy a convex inequality is
convex. Because the intersection of any number of convex sets is also convex, we find that
the set of points satisfying any number of convex inequalities is a convex set.

This leads us to an important fact about convex optimization problems.

A constraint set consisting of affine equalities and convex inequalities defines a convex
feasible set ⌦.

82
In fact, all convex feasible sets are made up of affine equality constraints and convex
inequality constraints.

3.3.3 Convex functions

Here are some examples of convex functions. In each case, J is a function from RD to R.

• Affine functions. J(x) = a · x + b, with a 2 RD , b 2 R (a · x is the dot product of a


and x).

• p-norms. J(x) = kxkp with p 1.

• Function composition: J(x) = g(h(x)) is convex whenever g convex and h is affine.


Here h is a function that takes x 2 RD and returns a real number, and g takes that
real number and returns another real number. In other words, convexity is preserved
by composition with an affine scalar function.

These examples will show up as cost functions later in the course.

Next we present numerical algorithms for solving optimization problems. Most of the prob-
lems that we will encounter are di↵erentiable and have no constraints. The first-order opti-
mality condition tells us that it will suffice to find a stationary point. This is exactly what
the gradient descent method is designed to do.

3.4 Gradient descent

Gradient descent is a numerical technique for minimizing di↵erentiable, real-valued functions.


The method can be understood by imagining the function as a hilly terrain, as depicted in
Figure 3.6. The algorithm advances like a person walking along the terrain in search of its
lowest point. The person can only look down at the ground – they cannot raise their sight

83
Figure 3.6: The gradient descent method

to look around. At each moment, they observe the slope of the ground under their feet, and
move in the downward direction. They continue in this way until they reach an area that is
flat, i.e. a local minimum.

Gradient descent advances in the direction of the negative gradient. In our two-dimensional
example, the gradient is the blue arrow in the horizontal plane, pointing in the direction of
steepest ascent. Its negative (the green arrow) indicates the direction of steepest descent.

The gradient descent algorithm begins with the arbitrary selection of a starting point
x0 2 ⌦. From there, it proceeds by taking steps to x1 , x2 , . . . until no more downward
progress can be made. If the problem is unconstrained, then we can conclude that the
algorithm has reached a stationary point. If furthermore the problem is convex, then we
have found a global minimum.

The update rule for gradient descent is,

xk+1 = xk rJ(xk ) (3.19)

84
Here xk is the value after k steps and is the step size parameter. The step size can be
kept fixed, or it can be varied with each step, either in a predetermined manner or in a way
that depends on the current value of the gradient. There are strategies for varying that
guarantee convergence to a local minimum. There are also bad choices for that prevent
convergence. Gradient descent does not guarantee convergence to a local minimum unless
the step size parameter is properly chosen.

Although we will not study them here, it is worth noting that there are extensions of
gradient descent for problems with constraints. Equality constraints are typically treated by
appending them to the cost function using Lagrange multipliers. For inequality constraints,
projected gradient methods prevent the solution from leaving ⌦ by projecting the gradient
onto the boundary. Alternatively, the Frank-Wolfe algorithm finds feasible directions by
solving intermediate convex programs.

Next we will introduce a variant of gradient descent that is useful for optimization prob-
lems that are typical of data-based modeling techniques.

3.4.1 Stochastic gradient descent (SGD)

The techniques of supervised learning described in later chapters can all be described as
di↵erent formulations of the following optimization problem.

✓⇤ = argmin E[L(Y ; ✓)] (3.20)


✓2Rd

This is a type of optimization problem that we had not encountered before – one involving
a random variable Y . The goal here is to choose the values of d decision variables (✓) such
that the expected value of the function L(Y ; ✓) is minimized. Since Y is a random variable,
L(Y ; ✓) is random as well, and so it makes sense to consider its expected value. The function
L is known as the loss function. It is not important at this point to understand the purpose

85
of the loss function, nor to understand the justification for this optimization problem. We
will get to that in Chapter 5. For now we are only interested in the mechanics of solving the
problem using a technique called stochastic gradient descent.

Figure 3.7: Stochastic cost function

Figure 3.7 provides an illustration in 2D. The search space is the set of feasible parameters
✓ = (✓1 , ✓2 ) – ie the horizontal plane. For each value of ✓, the variations in Y produce
variations in L(Y ; ✓). Hence, we can visualize L(Y ; ✓) as a “fuzzy” function and E[L(Y ; ✓)]
as a smooth function that approximates it at every point. Our goal is to find the lowest
point on this surface.

To do this, we use the given dataset D = {yi }N to produce an estimate of the objective
function.
N
1 X
E[L(Y ; ✓)] ⇡ L(yi ; ✓) (3.21)
N i=1

The multiplicative factor 1/N does not influence the result, and can therefore be dropped.
Applying standard gradient descent (Eq. 3.19) to this problem produces the following update
rule.

N
!
X
✓k+1 = ✓k r✓ L(yi ; ✓) (3.22)
i=1
N
X
= ✓k r✓ L(yi ; ✓) (3.23)
i=1

86
The notation r✓ indicates that the gradient is taken with respect to ✓. Eq. 3.23 is the direct
application of gradient descent to the stochastic optimization problem. This works well with
small to medium-sized datasets. However for large datasets (large N ), the computation
becomes inefficient, since each step involves N evaluations of r✓ J; one for each data point
yi . For N in the tens or hundreds of thousands, this is an excessive amount of computation
for a single parameter update.

The idea behind SGD is simple: instead of basing the estimate of r✓ J on the full dataset
D, use a reduced dataset B, which we call a batch. Then the update becomes,

X
✓k+1 = ✓k r✓ L(yi ; ✓) (3.24)
yi 2B

The batch size is a parameter of the algorithm. However, since the sample mean is unbiased
for all N , choosing a smaller batch will not a↵ect the bias of the estimate, although it will
increase its variance. The hope is that this increased variance is compensated by the more
frequent steps taken by the algorithm, which allows it to advance quickly in the early stages.

Another parameter of SGD is the method by which we sample B from D. Certainly we


do not want any points in D to be ignored. Hence there are two reasonable approaches:
sampling with replacement and without replacement. To sample with replacement means to
choose each element of B randomly from the full D. Sampling without replacement means
that we partition D into an integer number K of batches (K = N/|B|), and we use one batch
per step. A single K-step pass through D is called an epoch. Typically, SGD will run for
many epochs.

The stopping criteria for SGD is a bit tricky. The noisy nature of the algorithm means
that it will bounce around without ever settling down, and hence we cannot simply look at
✓k+1 ✓k to decide when to stop. Rather, as SGD advances, we track the performance
of the resulting model and stop when the performance begins to deteriorate, due to a phe-

87
nomenon known as overfitting. We will delve deeper into these topics in lab and later in the
course.

88
Chapter 4

Statistical inference

Having learned the fundamental concepts of probability and optimization theory, we will now
apply those concepts to the task of building models of systems from measurements alone.
In this chapter we begin with “input-less” systems; systems with inputs and outputs will be
introduced in Chapter 5.

Figure 4.1: The dataset is collected from the system.

The setup is illustrated in Figure 4.1. The system is a closed box, so we can only gain
information about its internal process by sampling and analyzing its output. For example,
if the system is a population of people, we may perform a survey. If the system is a physical
phenomenon, we may design an experiment and collect measurements.
The goal of the field of Statistics is to aid the summary and interpretation of data.
Because datasets are often too large for a single person to grasp, we use the techniques of
Statistics to reduce the information contained in the dataset to a few important facts of
interest. For example, we might be interested, not in the tensile strengths of individual
samples of aluminum, but in the average strength of the metal. Or, we may be interested in

89
the uncertainty in the strength – that is, the variations that can be expected in a collection
of aluminum samples.

Toward the goal of summarizing the data, it is useful to think about the larger population
from which the data is sampled. In medical studies, where we may be interested in some
physiological fact about humans, the population being sampled is the entire species, or
maybe everyone between the ages of 20 and 35. This population is large – there are about
8 billion living humans – but it is not infinite. On the other hand, if the data collection
process is subject to measurement errors, then the “population” may be regarded as being
infinitely large. For example, we may take several measurements of the tensile strength of a
single aluminum rod and obtain slightly di↵erent values each time. In this case we imagine
the samples as coming from an infinite population of slightly di↵erent rods. In the finite
case, there exists a “true” distribution, which is a histogram constructed from the entire
population, and we may calculate (at least in principle) a true mean, a true variance, etc.
In the second case of an imagined infinite population, however, it is difficult to justify the
notion of a “true” distribution. It is typical in statistics, however, to ignore this issue, and
to speak of the “true” distribution, “true” mean, “true” variance, etc. in both cases. We
will adhere to this convention and refer to the “true” distribution, regardless of whether the
population is finite or infinite. Thus the distinction between fintite and infinite becomes
meaningless and we can regard all samples as coming from an unknown true distribution.

A “model” is a random variable whose pdf approximates the true distribution. Using a
random variable allows us to express our assumptions about the system in precise mathe-
matical terms, as constraints on the universe of possible distributions. For example, we may
declare that our model is Gaussian, or mixture-Gaussian, or Bernoulli, thus eliminating a
large number of alternatives. Having selected an initial pool of candidate models (e.g. all
Gaussian distributions), we can then use the data to reduce this pool, perhaps to a single
final model. Here we encounter a trade-o↵ between assumptions and data. The stronger

90
the assumptions, the smaller the initial set of candidate models, and therefore the easier the
task for the data. Conversely, with fewer or weaker assumptions, the larger the initial pool,
and more data is required to find a model of comparable quality. Modeling assumptions are
therefore useful because they can reduce the amount of data required. Unless the assump-
tions are incorrect. In this case the assumptions may set us back, and we may need lots of
good data to outweigh them. The same is true of the data: more data is better because it
reduces the need for assumptions; unless the data is erroneous, in which case it can have a
detrimental e↵ect.

4.0.1 The data

We begin the modeling procedure by collecting a dataset D consisting of N samples from


the system. This is denoted with,
D = {yi }N (4.1)

It is important that the data-collection methodology satisfy the following two properties:

1. Independence: Each sample yi should be methodologically independent of all other


samples. This notion of independence is related but not identical to that of ran-
dom variables. Independence of random variables is a purely mathematical concept.
Methodological independence, in the case of surveys, means that my choice is not
influenced by anyone else’s choice, nor does my choice a↵ect other peoples’s choices.
Independence would be violated if I were incentivized to choose in the same way as my
neighbor. It could also be violated in more subtle ways, for example if the experimenter
were to influence the subject, perhaps subconsciously, based on what they’ve observed
in other subjects.

When the system is a physical phenomenon (or a machine) and the samples are ob-
tained with a measurement device, methodological independence requires that the

91
system not alter the measurement device in any way that a↵ects subsequent measure-
ments. For example, when we measure temperature with a mercury thermometer, it
takes some amount of time for the mercury to settle to the current temperature. If
the readings are made too quickly, then the mercury will not have settled, and infor-
mation from the previous sample will “leak” into the current one, thereby violating
methodological independence.

2. Identically distributed: The system should not change between samples, or at least
nothing about the system that a↵ects the measurements may change.

A dataset that satisfies these requirements is said to be “IID” – independent and identically
distributed. IID is often a difficult condition to meet, especially when working with popula-
tions of people. The methodology of surveys and data collection is an important and large
topic that we will not cover in this course. Rather, we will take it as given that the dataset
is IID.

Given that we have collected an IID dataset, we are justified in modeling that dataset with
a collection of IID random variables {Yi }N . Here the terms “independent” and “identically
distributed” do have their mathematical definitions. {Yi }N is a jointly indpendent set of
random variables as defined in Section 2.6.5, and all of the Yi ’s are identically distributed
in the sense that they have the same sample space ⌦Y and distribution pY . We denote this
with:
iid
{Yi }N ⇠ Y (4.2)

One consequence of the IID assumption is that there is no inherent order in the dataset. In
other words, two IID datasets that contain the same samples but in a di↵erent order can be
regarded as essentially the same. Inference techniques that account for this symmetry and
produce the same result regardless of the order of the samples will tend to outperform ones
that do not.

92
The process of selecting the distribution for Y begins with the specification of a pool of
candidates to choose from. For example, if we are modeling the measurement errors due to
vibrations, then it may be reasonable to model these errors as Gaussian random variables. In
2
this case we say that the initial model pool is N (µ, ); the Gaussian family. This family is
“parameterized” with two numbers: the mean and the variance. We may also decide not to
make any a-priori assumption about the distribution, and leave the task entirely to the data.
This is the non-parmateric approach to statistics, which we will not pursue here. Instead, we
will assume that our chosen model family is described by some small number of parameters.

The next step is to use the data to select a single best model from the candidate pool.
That is, we must select a particular value for each of the parameters. This is the topic of
point estimation, which we cover in Section 4.1. To make this selection, we must compute
certain statistics related to the given dataset.

4.0.2 Statistics

A statistic is any quantity that can be computed directly from a dataset. Some examples of
statistics include:

• The sample mean µ̂ is the average of all of the values in the dataset:

N
1 X
µ̂ = yi (4.3)
N i=1

• The unbiased sample variance ˆ 2 is a measure of the spread of the dataset, defined as
the average of the squared deviations from the sample mean.

N
X
1
ˆ2 = (yi µ̂)2 (4.4)
N 1 i=1

The use of N 1 instead of N here will be justified in Section 4.1.4.

93
• The sample standard deviation ˆ is the square root of the sample variance.

v
u N
u 1 X
ˆ=t (yi µ̂)2 (4.5)
N 1 i=1

Other statistics include the median value, the largest value, and the smallest value of the
dataset. In general, a statistic ✓ˆ is any function gN of the data y1 , . . . , yN :

✓ˆ = gN (y1 , . . . , yN ) (4.6)

The question that immediately arises when we consider a statistic is how it is distributed.
That is, how might the value of the statistic change when it is computed using a di↵erent
dataset (of equal size as D) sampled from the same system. If the system is modeled with
Y , then the distribution of ✓ˆ can be computed by applying the function gN to the random
ˆ
variables {Yi }N . We will denote the resulting random variable with ⇥.

ˆ = gN (Y1 , . . . , YN )
⇥ (4.7)

ˆ captures the variations in the values of ✓ˆ due to the randomness in the system (according

to model Y ).

4.0.3 Behavior of the sample mean

We demonstrate these concepts with the sample mean statistic (Eq. 4.3). Its corresponding
random variable is denoted with ȲN and also called the sample mean:

N
1 X
ȲN = Yi (4.8)
N i=1

94
At this point, if we have assigned a particular pdf to Y , then we can use Eq. 4.8 to derive
an explicit pdf for ȲN . The method for computing the pdf of a sum of independent random
variables involves the convolution operation, and is beyond our scope. Without assigning
a distribution to Y , the distribution of ȲN is clearly underdetermined. However, there are
2
certain facts that can be established about ȲN when only µY and Y are known. These are
its mean, its variance, and the central limit theorem.

Expected value of the sample mean

The expected value of the sample mean equals the expected value of Y .

E[ȲN ] = µY (4.9)

Proof
" N
#
1 X
E[ȲN ] = E Yi (4.10)
N i=1
N
1 X
= E[Yi ] (4.11)
N i=1
N
1 X
= µY (4.12)
N i=1
1
= N µY (4.13)
N
= µY (4.14)

The second equality is obtained by the linearity property of the expected value (Eq. 2.45).
The third by the IID assumption.

95
Variance of the sample mean

1
The variance of the sample mean is N
times the variance of Y .

2
Y
V ar[ȲN ] = (4.15)
N

Proof

N
" #
1 X
V ar[ȲN ] = V ar Yi (4.16)
N i=1
N
1 X
= V ar[Yi ] (4.17)
N 2 i=1
N
1 X
= V ar[Y ] (4.18)
N 2 i=1
1 2
= N Y (4.19)
N2
2
Y
= (4.20)
N

The second equality is obtained by the summation rule for the variance (Eq. 2.57), coupled
with the fact that the Yi ’s are independent. The third by the IID assumption.

Central limit theorem

The two previous results gave the expected value and the variance of the sample mean for
a dataset of any size N , sampled from any distribution Y . We cannot say much about the
shape of pȲN as long as pY is unspecified, as long as the sample size is small. However, as N
increases, the central limit theorem asserts that ȲN must converge to a normal distribution.

✓ 2

Y
ȲN ! N µY , as N ! 1 (4.21)
N

96
In words, the distribution of the sample mean becomes more bell-shaped as the size of
the sample increases, regardless of the distribution of Y . The fact that the distribution is
2
centered at µY and has variance Y /N was already established in Eqs. 4.9 and 4.15.

The central limit theorem is often stated in terms of a standardized version of the sample
mean:
ȲN µY
Z= p (4.22)
Y / N

In terms of Z, the central limit theorem states that,

Z ! N (0, 1) as N ! 1 (4.23)

N (0, 1) is known as the standard normal distribution. The central limit theorem says that
the standardized sample mean converges to the standard normal distribution as N grows,
regardless of the distribution of Y . Figure 4.2 shows simulations that demonstrate the CLT.

4.1 Point estimation

Point estimation is concerned with estimating the parameters of the model Y . A parameter
here is any scalar property of Y . It is often a variable that “parameterizes” the pdf. For
example, if Y is assumed to be uniformly distributed, Y ⇠ U [a, b], then a and b are param-
eters of Y . But parameters can also be properties that do not appear explicitly in the pdf.
For example, we may be interested in estimating the mean and the variance of a uniform
distribution, even though these do not appear explicitly in Eq. 2.82).

We denote the parameter of interest with ✓. A point estimate ✓ˆ is a statistic that does a
good job of estimating the value of ✓. It is called a “point” estimate because it is a single
point in the parameter space (the set of all possible values of ✓). The function that computes

97
Figure 4.2: Simulation of the CLT. Each plot in the figure is a histogram of Z obtained
with either a uniform (top row) or Bernoulli (bottom row) distribution for Y . The CLT
predicts that, as N grows, these histograms will approach the standard normal distribution,
shown in orange. Indeed, the uniform distribution reaches “normality” after only a handful
of samples, whereas for the Bernoulli distribution it takes several thousand.

the estimate is called an estimator, and is denoted with gN :

✓ˆ = gN (y1 , . . . , yN ) (4.24)

The similarity between the notation of estimators and statistics is not accidental: an esti-
mator is a particular type of statistic; one whose purpose is the estimation of a parameter.
The statistical properties of the estimator are obtained by feeding the set of IID random
variables {Yi }N through the estimator function. Hence, again,

ˆ = gN (Y1 , . . . , YN )
⇥ (4.25)

Our study of point estimates will focus on the properties that make good estimators. Along
the way we will encounter the concepts of bias, variance, and mean squared error of estima-

98
tors, and we will arrive ultimately to the maximum likelihood estimator.

4.1.1 Estimator performance

Let’s recap where we are. We have collected an IID dataset D and we wish to use it to build
a model of a system. The model here is a random variable Y with distribution pY . Our
task is to choose the shape of pY . We begin by restricting our search to a particular family
of distributions, such as the Gaussian distributions, the uniform distributions, etc. Each of
these families is parameterized by its own set of parameters. For example, the Gaussian
2 2
family N (µ, ) is parameterized by the mean and variance (µ and ); the exponential
distributions E( ) by the inverse of the mean ( ); the uniform distributions U (a, b) by the
lower and upper limits (a and b). The task of choosing the shape of pY is thus reduced to
one of estimating the values of a finite number of parameters.

We now consider to the problem of estimating one of those parameters, which we denote
ˆ How can we decide
with ✓. To this end, we use an estimator gN and calculate an estimate ✓.
whether we’ve done a good job? Whatever criteria we use, it should be based, not on the
particular value ✓ˆ that we happened to get, but on the distribution ⇥
ˆ of all possible values

(given the assumptions we’ve made about Y ).

Consistency

At the very least, we would like our estimator to give the right answer when provided with a
ˆ should
sample consisting of the entire population. For this to happen, the distribution of ⇥
narrow down and focus on the true value ✓ as the sample size increases. An estimator with
this property is said to be consistent. In mathematical terms, an estimator is consistent if,
for any ✏ > 0,
⇣ ⌘
ˆ
lim P | ⇥ ✓| ✏ =0 (4.26)
N !1

99
In words, the probability of obtaining an estimate that is more than a distance ✏ from the
true value converges to zero as the sample size grows. This should hold for any value of ✏,
not matter how small.

Bias

Consistency can be used to evaluate and eliminate estimators that would not produce the
correct answer, even when provided with an infinite amount of IID data. We can also
evaluate estimators based on their performance with finite samples. Let’s fix N to some
ˆ (Eq. 4.25).
finite number. Then the estimator function gN produces an estimator ⇥

The bias of an estimator measures the expected di↵erence between the estimate and the true
value of the parameter.
ˆ = E[⇥]
Bias[⇥] ˆ ✓ (4.27)

An estimator that is positively biased is expected to produce estimates that are higher than
the true value, and vice-versa for negatively biased estimators. All other things being equal,
we prefer estimators with a lower absolute value of the bias. Note however that the bias is
a function of unknown true value of the parameter, ✓. This creates two problems. First, the
bias can generally not be calculated. Nevertheless, we can sometimes prove that the bias of
an estimator is zero for all values of ✓. In this case we say that the estimator is unbiased for
the class of models that satisfy our assumptions. Second, an estimator can only be said to
be “less biased” than another if its bias is smaller (in absolute value) for all possible values
of ✓. Here again, the statement only holds for the assumed class of models Y .

100
Figure 4.3: The red line is the expected value. The two blue lines are the expected value
plus/minus one standard deviation. The left and middle plots are unbiased, and therefore
asymptotically unbiased. The middle plot is biased for all N , and asymptotically unbiased.

Asymptotic unbiasedness

An estimator is asymptotically unbiased (again, for some class of models!) when its bias
converges to zero as N increases, for all values of ✓.

ˆN] = 0
lim Bias[⇥ (4.28)
N !1

An estimator that lacks asymptotic unbiasedness will have a persistent error that is immune
to data. See Figure 4.3.

Example 4.1.1. A team of roboticists is designing an apple-picking robot. In choosing the


dimensions of the gripper, they must estimate the size of the largest apple that could be
produced by the orchard. To do this, they collect 100 randomly chosen apples and measure
their sizes (D = {yi }100 ). To estimate the size of the largest apple (✓), they apply the
following formula:
✓ˆ = max yi (4.29)
i

That is, they take the the largest apple in the orchard to be equal in size to the largest
apple in the basket. This estimator is a biased estimator: it is very unlikely that the basket
contains the largest apple. What is the bias of this estimator?

To address this question, we must first choose a class of distribution to moodel the sizes

101
of apples. The better our choice of model family, the better our estimate of the bias. We
(the team) then decide to assume that the size of apples is uniformly distributed across the
orchard: Y ⇠ U(a, b). Formula 4.29 can then be regarded as an estimator of b. The question
now becomes, what is the bias of ✓ˆ with respect to the true value of b? That is, we want to
ˆ = E[⇥]
find Bias[⇥] ˆ ˆ is the random variable obtained by applying maxi Yi to
b, where ⇥
IID uniformly distributed random variables U (a, b). This particular mathematical problem
turns out to be easy to solve if we assume a = 0 – admitedly, a very bad assumption for this
problem. However, in that case we can prove mathematically (we will not do it here) that

ˆ = N
E[⇥] b (4.30)
N +1

and therefore,

ˆ = E[⇥]
Bias[⇥] ˆ b (4.31)
1
= b (4.32)
N +1

This is illustrated in Figure 4.4.

4.1.2 Estimation of the mean with the sample mean

Next we consider the use of the sample mean, introduced in Section 4.0.3, as an estimator of
the mean. Equation 4.9 shows that the sample mean is an unbiased estimator for all values
of the true mean, and for any population model Y . With Eq. 4.15 we found that the variance
2
of the sample mean is Y /N . It can be shown (although we do not do it here) that this is the
lowest variance that can be achieved by any unbiased estimator of the sample mean, when
Y is Gaussian. We say therefore that the sample mean is the minimum variance unbiased
estimator (MVUE) for the normal distribution.

102
Figure 4.4: The blue dots are estimates of the size of the largest apple made with maxi yi ,
each with a di↵erent baskets of 100 apples. These can be modeled as samples from ⇥,ˆ whose
pdf is shown in blue. This estimator is negatively biased because it tends to underestimate
the true value of b.

The sample mean, in addition to being unbiased, is also a consistent estimator of the
mean. Generally it can be difficult to show that an estimator is consistent. However for
unbiased estimators, the condition for consistency reduces to the variance tending to zero as
N ! 1. This is certainly true of the sample mean:

2
Y
lim V ar[ȲN ] = lim =0 (4.33)
N !1 N !1 N

The sample mean is in fact the best possible estimator of the mean for many processes Y .
However, there are cases in which other estimators do better. For example, if Y is a Laplace
distribution, then the sample median is also unbiased and consistent, and has a smaller
variance than the sample mean.

103
2
4.1.3 Estimation of the variance Y with the biased sample vari-

ance S̃ 2

Define the biased sample variance as follows:

N
2 1 X
˜ = (yi µ̂)2 (4.34)
N i=1

and its random variable:


N
21 X
S̃ = (Yi ȲN )2 (4.35)
N i=1

This formula is very intuitive; it is obtained by replacing the mean in Eq. 2.52 with the
sample mean, and the expected value with the average over the dataset. Next we will prove
that this estimator, as the name suggests, is biased. We will do this with no assumptions on
2
Y (aside from fintiteness of Y ), and thus the result holds for a very wide class of models.

Compute the expected value of the estimator:

N
" #
2 1 X 2
E[S̃ ] = E (Yi ȲN ) (4.36)
N i=1
" N #
1 X
= E Yi2 2ȲN Yi + ȲN2 (4.37)
N i=1
" N N
#
1 X X
= E Yi2 2ȲN Yi + N ȲN2 (4.38)
N i=1 i=1

PN
Using Eq. 4.8 we can replace i=1 Yi with N ȲN , and obtain,

" N
#
1 X
E[S̃ 2 ] = E Yi2 N ȲN2 (4.39)
N i=1
N
!
1 X
= E[Yi2 ] E[ȲN2 ] (4.40)
N i=1

104
Because the Yi ’s are identically distributed, we have E[Yi2 ] = E[Y 2 ]:

1
E[S̃ 2 ] = N E[Y 2 ] E[ȲN2 ] (4.41)
N
= E[Y 2 ] E[ȲN2 ] (4.42)

We now apply the identity from Eq. 2.52 to Y and ȲN :

E[Y 2 ] = 2
Y + µ2Y (4.43)
1
E[ȲN2 ] = 2
Y + µ2Y (4.44)
N

Then,

1
E[S̃ 2 ] = 2
Y + µ2Y 2
Y µ2Y (4.45)
N
N 1 2
= Y (4.46)
N

Subtract the true value of the variance to obtain the bias:

N 1 1
Bias[S̃ 2 ] = 2
Y
2
Y = 2
Y (4.47)
N N

S̃ 2 is therefore negatively biased for all values of 2


Y.

Note on the variance of S̃ 2

The variance of S̃ 2 is a complicated function of the properties of Y . We will not need it


in this course, but you can see it derived here [?] for generic Y . A simpler formula can be
2
found if Y is taken to be Gaussian. In this case the sample variance is distributed as a
(“chi squared”) distribution [?].

105
2
4.1.4 Estimation of the variance Y with the unbiased sample vari-

ance S 2

Define the unbiased sample variance ˆ 2 by replacing the N in the denotminator of Eq. 4.34
with N 1:
N
X
1
ˆ2 = (yi µ̂)2 (4.48)
N 1 i=1

and its random variable.


N
X
2 1
S = (Yi ȲN )2 (4.49)
N 1 i=1

It is easy to prove that S 2 is indeed unbiased by noting that,

N
S2 = S̃ 2 (4.50)
N 1

which implies E[S 2 ] = N/N 1E[S̃ 2 ] = 2


Y. Hence, S 2 is preferrable to S̃ 2 in terms of bias.
Although it is difficult to explicitly compute the variance of either of these estimators, it is
simple to show that the variance of the biased estimator is the smaller of the two: Since
N 1 2 N 1 2
S̃ 2 = N
S , it follows that V ar[S̃ 2 ] = N
V ar[S 2 ]. Therefore V ar[S̃ 2 ] < V ar[S 2 ].

We’ve now seen three examples of estimators: the sample mean, the unbiased sample vari-
ance, and the biased sample variance. With this last pair we found that the bias and variance
did not agree in their assessment, and hence we have still not decided which to prefer. In
archery, is it better to produce a wide cloud of shots that is perfectly centered on the bullseye
(zero bias, high variance), or a small cloud that is a little bit o↵ center (small bias, small
variance)? The answer depends on how the scores are marked on the target. That is, on
how we choose to measure the error.

A more systematic approach to quantifying estimator performance is therefore to base


it on the expected value of the estimation error. We get to choose which error metric to

106
use. If we choose to quantify the error by taking the absolute value of the di↵erence between
the estimate and the true value, then we get the so-called mean absolute error criterion, or
MAE:
⇥ ⇤ ⇥ ⇤
ˆ = E |⇥
MAE ⇥ ˆ ✓| (4.51)

If we use the square of the di↵erence, we get the mean squared error criterion, or MSE:

⇥ ⇤ ⇥ ⇤
ˆ = E (⇥
MSE ⇥ ˆ ✓)2 (4.52)

Both are useful. In this course we will focus on the MSE because a) it is more widely used,
b) it results in a smooth optimization problem (important for gradient descent), and c) it
has interesting analytical properties (bias variance decomposition).

4.1.5 Mean squared error (MSE)

ˆ is defined as,
Again, the MSE of an estimator ⇥

⇥ ⇤ ⇥ ⇤
ˆ = E (⇥
MSE ⇥ ˆ ✓)2 (4.53)

Figure 4.5 shows an illustration. The MSE is the expected value of the red distribution
that is drawn along the vertical axis in the figure. This distribution is obtained by mapping
ˆ through a parabola centered at ✓, shown in the figure as a red line. Note that,
samples of ⇥
ˆ were unbiased and its variance equal to zero, then the MSE would be zero. Note also
if ⇥
that one cannot generally estimate the MSE from the data because it depends on ✓.

The intuition behind the MSE is simple: it measures the expected size of the errors using
a Euclidean metric (circles on the archery target). What is less obvious is the fact that the
MSE can be expressed as a function of the bias and the variance of the estimator. This
result is know as the bias-variance decomposition of MSE.

107
ˆ N : See demo MSE.py
Figure 4.5: MSE of ⇥

Bias-variance decomposition of MSE

The MSE of an estimator equals the sum of its variance and its bias squared. This is true
regardless of the value of ✓ and distribution Y .

⇣ ⌘2
ˆ = V ar[⇥]
MSE[⇥] ˆ + Bias[⇥]
ˆ (4.54)

ˆ without altering
Proof: In the definition of MSE (Eq. 4.53) we can add and subtract E[⇥]
the result.
⇣ ⌘2
ˆ =E
MSE[⇥] ˆ
⇥ ˆ + E[⇥]
E[⇥] ˆ ✓ (4.55)

Expanding the square and applying the linearity of the expectation we get:

⇣ ⌘2 h⇣ ⌘⇣ ⌘i ⇣ ⌘2
ˆ =E
MSE[⇥] ˆ
⇥ ˆ
E[⇥] + 2E ˆ
⇥ ˆ
E[⇥] ˆ
E[⇥] ✓ +E ˆ
E[⇥] ✓ (4.56)

ˆ In the last
Take a close look at each of these three terms. The first is the variance of ⇥.
ˆ and ✓ are numbers (not random variables), so the outer expectation can be
term, both E[⇥]
removed, and the term equals the square of the bias (Eq. 4.27). In the middle term, because
ˆ
E[⇥] ✓ is a number, it can again be extracted from the expectation. The middle term then

108
becomes:
⇣ ⌘ h i
ˆ
2 E[⇥] ˆ
✓ E ⇥ ˆ
E[⇥] (4.57)

Again using the linearity of the expectation, this becomes:

⇣ ⌘⇣ ⌘
ˆ
2 E[⇥] ✓ ˆ
E[⇥] ˆ
E[⇥] (4.58)

which equals zero. ⇤

This decomposition gives us a di↵erent perspective on the MSE: it balances variance and bias
by giving equal weight to both. The bias is squared so that the units match the variance.
We can use this result to compute the MSE of our three estimators.

Example 4.1.2. MSE of the sample mean:

2
MSE[ȲN ] = V ar[ȲN ] + Bias[ȲN ] (4.59)
2
Y
= +0 (4.60)
N
2
Y
= (4.61)
N

As mentioned earlier, the sample mean is the MVUE for the mean of a Gaussian distribution.
The above quantity is therefore a lower bound on the MSE that can be achieved by any
unbiased estimator, when Y is Gaussian.

Example 4.1.3. MSE of the unbiased sample variance:

2 2 2 2
MSE[SN ] = V ar[SN ] + Bias[SN ] (4.62)
2
= V ar[SN ] (4.63)

Recall that V ar[S 2 ] is a complicated function of the properties of Y .

109
Example 4.1.4. MSE of the biased sample variance:

⇣ ⌘2
2 2 2
MSE[S̃ ] = V ar[S̃ ] + Bias[S̃ ] (4.64)
✓ ◆2 ✓ 2
◆2
N 1 2 Y
= V ar[S ] + (4.65)
N N
✓ ◆2 ✓ 2 ◆2
N 1 Y
= MSE[S 2 ] + (4.66)
N N

2
Knowing N and Y allows us to decide which of the two estimators of variance produce a
smaller MSE.

How does consistency relate to bias, variance, and MSE? Do these three quantity neces-
sarily tend to zero when the estimator is consistent? The answer to this question is, strictly
speaking, “no”. There are corner cases of estimators that are consistent, but whose MSE
does not tend to zero. However these examples are contrived. For most real-world distribu-
tions, if an estimator is consistent, then its MSE converges to zero (and therefore its bias and
variance as well). The converse is always true: If MSE converges to zero, then the estimator
is consistent. In short, MSE convergence is a slightly stronger property than consistency.

This concludes our review of performance criteria for point estimators. We have covered
consistency, bias, asymptotic unbiasedness, and MSE. Next we will see how optimization
theory can be used to compute parameter estimates using the maximum likelihood method.

4.1.6 Maximum likelihood estimation (MLE)

The three estimators introduced in this chapter were given as formulas (Eqs. 4.3, 4.49, and
4.35), with no indication as to how those formulas were obtained. This will now be remedied
with the introduction of the maximum likelihood method. MLE is a technique based on
optimization theory for generating estimators. As we will see, the MLE approach can be

110
used to produce estimates, not only of mean and variance, but of any parameter of Y . We
will also see that estimators generated with MLE enjoy some beneficial properties.

We begin by assuming that Y is a member of some parameterized family of distributions,


with parameter vector ✓. Up until now we have only considered the point estimation of single
parameter ✓. Maximum likelihood allows the estimation of the entire parameter vector ✓ at
once. The problem is cast as a search over the parameter space for the most likely parameter
vector ✓ˆMLE , given the observed data D = {yi }N . We compute the likelihood of an arbitrary
parameter vector ✓ˆ with the likelihood function:

N
Y
L(✓ˆ ; D) = ˆ
pY (yi ; ✓) (4.67)
i=1

This definition states that the likelihood of the parameter vector ✓ˆ given the dataset D is the
probability of independently sampling all of the data points y1 . . . yN from the model specified
ˆ The maximum likelihood estimate is the parameter setting ✓ˆ
by ✓. MLE that maximizes the

likelihood function. Here is the statement of the problem as an optimization problem:

Given D = {yi }N
N
Y
✓ˆMLE = argmax ˆ
pY (yi ; ✓) (4.68)
ˆ d
✓2R i=1

Example 4.1.5. We are presented with a bag containing 4 marbles. Each marble is either
black or white. We are allowed to extract and look at a marble 5 times, each time returning
the marble to the bag and shaking it. Upon doing this, we observe the following sequence:
{black, white, black, white, white}. Estimate the number of black marbles in the bag.

Solution. We begin the MLE procedure by proposing a distribution family for our sampling
process. Since the bag produces only two possible outcomes (black and white), the reasonable
choice is a Bernoulli distribution B(s), with parameter s for the probability of drawing a

111
black marble. The quantity that we are interested in estimating – the number ✓ of black
marbles in the bag – relates to s with s = ✓/4. For example, if there are two black marbles
(✓ = 2), the probability of drawing a black marble is s = 2/4 = 1/2. Thus, Y ⇠ B(✓/4), and
8
>
< ✓ˆ/4 y is black
ˆ =
pY (y; ✓) (4.69)
>
: 1 ✓ˆ/4 y is white

The maximum likelihood estimate is the value from the set {0, 1, 2, 3, 4} that maximizes the
likelihood given the observed dataset D = {1, 0, 1, 0, 0}. Here we have encoded black marbles
with 1 and white marbles with 0. Next we plug Eq. 4.69 into the likelihood function and
evaluate it using the observed dataset.

N
Y
ˆ D) =
L(✓; ˆ
pY (yi ; ✓) (4.70)
i=1

ˆ pY (0; ✓)
= pY (1; ✓) ˆ pY (1; ✓)
ˆ pY (0; ✓)
ˆ pY (0; ✓)
ˆ (4.71)
2 3
= ✓ˆ/4 1 ✓ˆ/4 (4.72)

The easiest way to find the maximum of this function is to evaluate it on each of the five
ˆ and choose the largest value.
possible values for ✓,

✓ˆ 0 1 2 3 4
ˆ D)
L(✓; 0 (1/4)2 (3/4)3 (1/2)2 (1/2)3 (3/4)2 (1/4)3 0

Of these numbers, (1/2)2 (1/2)3 = 1/32 is the largest. Therefore ✓ˆMLE = 2 ⇤

Returning to the general problem of Equation 4.68, it is often convenient to apply the loga-
rithm function to the objective. This can make the problem more tractable, both analytically
and numerically, and it does not alter the result since the logarithm is an increasing function.

112
The problem then becomes:

✓ˆMLE = argmax log L(✓;


ˆ D) (4.73)
ˆ d
✓2R
N
!
Y
= argmax log ˆ
pY (yi ; ✓) (4.74)
ˆ d
✓2R i=1
N
X
= argmax ˆ
log pY (yi ; ✓) (4.75)
ˆ d
✓2R i=1

The difficulty of solving this problem depends on the parametric family that is assumed.
Next we will use the MLE technique to derive estimators of the mean µY and the variance
of a Gaussian random variable. Previously we had found the sample mean and the biased
sample variance to be good estimators for these quantities. We will now see how these can
be derived using maximum likelihood.

MLE with Gaussian data

2
With Y ⇠ N (µ, ), the pdf of Y is,

✓ ◆
2 1 1 (y µ̂)2
pY (y; µ̂, ˆ ) = p exp (4.76)
2⇡ˆ 2 2 ˆ2

The log-likelihood is then:

N
X
2
log L(µ̂, ˆ ; D) = log pY (yi ; µ̂, ˆ 2 ) (4.77)
i=1
N
X ✓ ✓ ◆◆
1 1 (y µ̂)2
= log p exp (4.78)
i=1 2⇡ˆ 2 2 ˆ2
N
N 1 X
= log(2⇡ˆ 2 ) (yi µ̂)2 (4.79)
2 2ˆ 2 i=1

113
We convert the problem into a minimization problem by flipping the sign of the objective.

N
!
2 N 2 1 X
µ̂MLE , ˆMLE = argmin log(2⇡ˆ ) + 2 (yi µ̂)2 (4.80)
µ̂,ˆ 2 2 2ˆ i=1

The cost function of this problem is the negative of the log-likelihood:

N
2 N 2 1 X
J(µ̂, ˆ ) = log(2⇡ˆ ) + 2 (yi µ̂)2 (4.81)
2 2ˆ i=1

The only constraint on the parameters is ˆ 2 > 0. Hence, the feasible set is open (i.e. all
points are interior points). Because J is di↵erentiable everywhere in the feasible set, the
optimality condition of Section 3.2.1 tells us that any local solutions (if any exist) are to be
found amongst the stationary points of J. Furthermore it can be shown that J is convex.
Thus, the stationary points are global optima. Next we find these stationary points by
equating each of the partial derivatives to zero. We begin with the partial derivative with
respect to µ̂.

N
!
@J @ 1 X
= (yi µ̂)2 (4.82)
@ µ̂ @ µ̂ 2ˆ 2 i=1
N
1 X @
= (yi µ̂)2 (4.83)
2ˆ 2 i=1 @ µ̂
N
1 X
= (yi µ̂) (4.84)
ˆ 2 i=1
N
N µ̂ 1 X
= 2 yi (4.85)
ˆ ˆ 2 i=1

114
Equating this to zero we find that the maximum likelihood estimate of the mean for a
Gaussian variable is the sample mean.

N
1 X
µ̂MLE = yi (4.86)
N i=1

Next we take a partial derivative with respect to ˆ 2 .

N
!
@J @ N 1 X
2
= log(2⇡ˆ 2 ) + 2 (yi µ̂)2 (4.87)
@ˆ @ ˆ2 2 2ˆ i=1
N
N 1 X
= (yi µ̂)2 (4.88)
2ˆ 2 2ˆ 4 i=1
N
!
1 1 X
= N (yi µ̂)2 (4.89)
2ˆ 2 ˆ 2 i=1

Equating this to zero we find that the maximum likelihood estimate of the variance of a
Gaussian variable is the biased sample variance.

N
2 1 X
ˆMLE = (yi µ̂N )2 (4.90)
N i=1

Properties of maximum likelihood estimators

The maximum likelihood optimization problem is a straightforward technique for finding the
member of a family of pdfs that best fits a given dataset. In posing this problem, we made
no reference to the desirable properties of consistency, unbiasedness, or low MSE. So it is
reasonable to ask whether the MLE achieves any of these properties. Here are some facts.

1. In general, maximum likelihood estimators have no finite-sample properties. For each


fixed N , the MLE is not necessarily unbiased, nor does it necessarily achieve minimum
MSE. However we know of at least one case where it does have these properties: the
sample mean of a Gaussian variable.

115
2. Maximum likelihood estimators have good asymptotic properties. They are consistent
and, setting aside a few corner cases, they are asymptotically unbiased.

4.1.7 Estimation for Mixture Gaussian Models

We now demonstrate the application of maximum likelihood estimation to a more compli-


cated distribution family: the Gaussian mixture. A Gaussian mixture is a distribution which,
like pW (w) in Figure 2.23, consists of a weighted sum of Gaussians. Recall Example 2.6.7
with the three vehicle types. Suppose that we only have measurements of the weights W of
the vehicles (and not their type V ), and we wish to build a model for the joint distribution of
weight and type based on this information. In other words, we seek to estimate pV W (Figure
2.22) from samples of pW (left-hand side of Figure 2.23). How can we do this?

To answer the question, notice that to construct the joint distribution it will suffice to
estimate the marginal distribution of V (right-hand side of Figure 2.23) and the conditional
distribution of W given V (Figure 2.24). With these we can construct the joint distribution
using the definition of the conditional distribution:

pV W (scooter, w) = p(w | scooter) pV (scooter) (4.91)

pV W (bicycle, w) = p(w | bicycle) pV (bicycle) (4.92)

pV W (moped, w) = p(w | moped) pV (moped) (4.93)

Our focus will be on estimating the discrete distribution pV and the three continuous distri-
butions p(w | scooter), p(w | bicycle), and p(w | moped). We pose this as a point estimation
problem, and solve it using the maximum likelihood technique.

Let’s first generalize the notation. Instead of W , we’ll use Y for the observed continuous
quantity. Instead of V , we will use Z for the unobserved class of the items. There are K
of these classes (K = 3 in the example), indexed with k = 1 . . . K. The unknown marginal

116
probabilities of the classes (pV (v) in the example) are denoted with ⇡k . Then,

K
X
⇡k = 1 (4.94)
k=1

⇡k 0 8k 2 {1 . . . K} (4.95)

In this generic notation, our goal is to build a model for the joint distribution pZY (k, y) for
a multivariate random variable (Z, Y ), where Z is discrete-valued (k 2 {1 . . . K}) and Y is
continuous-valued, based on observations of Y alone. And we will do this by estimating the
class proportions ⇡k and the conditional distribution p(y | Z = k) for each k in {1 . . . K}.

To apply the maximum likelihood technique, we must first propose a parametric family
for the conditional distribution p(y | Z = k). Here we will assume they are Gaussian.

✓ ◆
1 1 (y µk ) 2
p(y | Z = k) = p exp 2
= Nk (y) (4.96)
2⇡ 2
k
2 k

The second equality defines the short-hand Nk (y) for the Gaussian pdf of the k’th class. Our
problem is thus reduced to the estimation of K proportions ⇡k and 2K parameters µk and
2
k ; 3K parameters in total.

We can now apply the maximum likelihood machinery to the estimation of the parameter
2
set ✓ = {(⇡k , µk , k )}K . Given a dataset D = {yi }N , the log-likelihood of ✓ is,

N
X
log L (✓ ; D) = log pY (yi ; ✓) (4.97)
i=1

We must express pY (y; ✓) in terms of pZ (k) and p(y|Z = k) in order to make explicit its

117
dependence on the parameters.

K
X
pY (y ; ✓) = pZY (y, k) (4.98)
k=1
XK
= pZ (k)p(y | Z = k) (4.99)
k=1
K
X
= ⇡k Nk (y) (4.100)
k=1

Then, !
N
X K
X
log L (✓ ; D) = log ⇡k Nk (yi ) (4.101)
i=1 k=1

The maximum likelihood optimization problem is then,

N K
!
X X
maximize
2
log ⇡k Nk (yi )
{(⇡k ,µk , k )}K i=1 k=1
K
X
subject to ⇡k = 1
(4.102)
k=1

⇡k 0 k 2 {1 . . . K}
2
k >0 k 2 {1 . . . K}

This is a very complicated optimization problem! Although the objective function is di↵er-
entiable everywhere in the feasible set, it is not convex (or rather, its negative is not convex,
since this is a maximization problem). Furthermore, the presence of an equality constraint
implies that the feasible set has no interior, and hence it is unlikely that we will find a feasible
stationary point.

A common trick for eliminating an equality constraint is to append it to the objective


function using a so-called Lagrange multiplier . The theory of Lagrange multipliers is

118
beyond the scope of this course. We will simply accept that Eq. 4.102 is equivalent to:

N K
! K
!
X X X
maximize log ⇡k Nk (yi ) + ⇡k 1
,✓
i=1 k=1 k=1
(4.103)
subject to ⇡k 0 k 2 {1 . . . K}
2
k >0 k 2 {1 . . . K}

The first order optimality conditions imply that the solutions to this problem are amongst
the stationary points of the objective function, as well as the non-interior feasible points (i.e.
points with ⇡k = 0 for some k). Next we find the stationary points by taking derivatives of
2
the objective function with respect to each of the µk ’s, k ’s, ⇡k ’s, and . For simplicity of
notation, we denote the objective function with J:

N K
! K
!
X X X
J( , ✓) = log ⇡k Nk (yi ) + ⇡k 1 (4.104)
i=1 k=1 k=1

Derivative with respect to µk

Take the derivative of Eq. 4.104 with respect to µk , where k is any number in {1 . . . K}

X N
@J ⇡k @Nk (yi )
= PK (4.105)
@µk i=1 =1 ⇡ N (yi )
@µk
N
X ⇡k Nr (yi ) (yi µk )
= PK 2
(4.106)
⇡ N (yi ) k
i=1
| =1 {z }
ik

Here we have introduced the symbol ik for the “responsibility” of class k for data point yi .
This quantity can be shown (via Bayes’ rule) to be probability that data point i is of class
k.
⇡k Nk (yi )
ik = PK (4.107)
=1 ⇡ N (yi )

119
Equating Eq. 4.106 to zero, we find a condition for stationarity.

N
X (yi µk )
ik 2
=0 (4.108)
i=1 k

2
With k 6= 0, this leads to an expression for the optimal placement of the means of the
components.
N
1 X
µk = ik yi (4.109)
Nk i=1

Here we have defined Nk as the total responsibility of the k’th component.

N
X
Nk = ik (4.110)
i=1

In words, the k’th component is centered at the weighted sample mean of the data points,
with the weights set to the responsibilities of that component. It is easily shown that
PK
k=1 Nk = N .

2
Derivative with respect to k

2
Next we take a derivative of Eq. 4.104 with respect to k.

X N
@J ⇡k @Nk (yi )
2
= PK (4.111)
@ k i=1 =1 ⇡ N (yi )
@ k2

Equating this to zero and skipping some of the details (which are easy, though tedious), we
eventually get to a second stationarity condition:

N
X ✓ ◆
(yi µk ) 2
ik 2
1 =0 (4.112)
i=1 k

Which leads to
N
2 1 X
k = ki (yi µk ) 2 (4.113)
Nk i=1

120
In words, the variance of the k’th component is the weighted sample variance of the data
points, with weights set the responsibilities of that component.

Derivative with respect to ⇡k


Finally, we take a derivative of Eq. 4.104 with respect to ⇡k and equate it to zero.

X N
@J Nk (yi )
= PK + =0 (4.114)
@⇡k i=1 =1 ⇡ N (yi )

Multiplying both sides by ⇡k :


N
X
ik + ⇡k = 0 (4.115)
i=1

Using the definition of Nk , this implies that ⇡k = Nk/ . Separately, from the requirement
P
that k ⇡k = 1, we find that = N , and therefore,

Nk
⇡k = (4.116)
N

In words, the marginal probability of class k is the ratio of the total responsibility of com-
ponent k (Nk ) to the number of points in the dataset (N ). This justifies the interpretation
of Nk as the “e↵ective” number of data points in the k’th component.
To summarize, we have derived five formulas (Eqs. 4.107, 4.109, 4.110, 4.113, and 4.116)
that characterize the local solutions of the maximum likelihood estimate for a Gaussian
mixture. These formulas can be summarized as follows:

1. Given the marginal probabilities ⇡k ’s, compute the responsibilities ik for each data
point i and component k, using Eq. 4.107.

2. Compute the centroids µk and total responsibilities Nk for each component k, using
Eqs. 4.109 and 4.110.

2
3. Compute the variances k and marginal probabilities ⇡k for each component k, using

121
Figure 4.6: Expectation Maximization for Gaussian Mixture Models

Eqs. 4.113 and 4.116

Taken together, these form a system of nonlinear equations that is difficult to solve. However
notice that the first step assumes that the ⇡k ’s are given, and these are computed in the
third step. This suggests that possible procedure for solving this system of equations might
be to cycle through the three steps until the values converge. This turns out to be an
excellent numerical algorithm for this problem, and it goes by the name of the expectation-
maximization or EM algorithm. The details of the algorithm are shown in Figure 4.6.

4.1.8 Clustering algorithms and K-means

We saw in the previous section that fitting a Gaussian mixture to a dataset entails finding the
mean and variance for each of the Gaussian components, and also the “responsibility” values

ik . These responsibilities can be interpreted as membership values: the i’th data point has
membership level ik in the k’th Gaussian component. We can use the memberships to
create clusters by assigning each data point to the component for which its membership is
highest. Thus, the GMM procedure can be used to cluster the data, in addition to generating
a model.

122
But what if we are only interested in the clusters, and not in the model? In this case the
GMM approach may be overkill, and we may prefer an algorithm that focuses exclusively on
the clustering task. There are many algorithms for doing this. Generally they go under the
heading of “unsupervised learning”, as opposed to “supervised learning”, which we will cover
starting in Chapter ?? and throughout the rest of the course. While supervised learning
algorithms teach machines to predict approximately correct answers by presenting them
with examples of correct answers, unsupervised learning algorithms lack a concept of a
“correct” answer. They are used simply to find patterns in the data. In the case of clustering
algorithms, they find groups (a.k.a. clusters) of similar data points.

Gaussian mixtures were presented in the previous section in the single-output context
(D = 1). Here we will generalize the presentation to multiple outputs (D 1). The
component means µk are now vectors in RD , and the variances 2
k are D⇥D positive definite
matrices which we will denote by ⌃k .

The K-means algorithm can be obtained via a simplification of GMM. The first step is
to impose a simplified structure on the covariance matrices: ⌃k is required to be a diagonal
matrix with equal entries (✏ > 0) along the diagonal.

2 3
6✏ 0 . . . 07
6 7
60 ✏ . . . 07
6 7
⌃k = 6 . .. . . .. 7 (4.117)
6 .. . . .7
6 7
4 5
0 0 ... ✏

Each of the component Gaussian pdfs are now radially symmetric; their level sets are circles.
Each membership value ik now depends only on the Euclidean distance from the data point
to the mean µk of the Gaussian component. The next step is to take the limit as ✏ ! 0. As
we do this, the ik ’s migrate to their extreme values of 0 and 1. That is, the ik ’s become a
hard indicator of cluster membership. The centroids µk are the mean of the members their

123
Figure 4.7: Basic K-means algorithm

cluster, and Nk is the integer number of points in cluster k.

Let’s now see what happens to the EM algorithm of Figure 4.6 when we apply the simpli-
fication just described. This is illustrated in Figure 4.7. As with GMM, the algorithm begins
with a specification of the number of clusters K. The algorithm is initialized with a random
placement of the centroids {µk }K . In the ‘E’ step, instead of computing responsibilities with
Eq. 4.107, we now assign each data point to its nearest centroid. Then we find Nk as the
number of data points assigned to the k’th cluster. The ‘M’ step then relocates the µk ’s to
the mean of the points in each cluster. This is repeated until the clusters stop changing.

Once convergence is reached, the result can be assured to be a local optimum, but not
necessarily a globally optimal solution. It may be possible to reach a better solution by
starting from a di↵erent initial placement of the centroids. It is very common to perform an
“ensemble run”, in which many executions of the basic algorithm are carried out, and only
the best result (best local optimum) is kept.

Finally, we can run ensemble runs for di↵erent numbers of clusters. Figure 4.8 shows
the result of running an ensemble of basic K-means models for values of K ranging from 1
to 10. It is clear that the best possible solution with K = 5 will be better than the best
possible solution with K = 4, since K = 4 is achievable by leaving empty one of the clusters
in K = 5. The optimal cost for K-means should therefore be a strictly decreasing function

124
Figure 4.8: Sweep over K

of K, which reaches zero in the extreme case of K = N (the number of data points). To
select a “best K”, it is therefore necessary to judge the marginal benefit of increasing K
by one, say from 4 to 5. The plot on the right shows these marginal gains as a function
of K. For example, the cost is decreased by over 150% when going from K = 1 to K = 2
clusters. A further improvement of over 200% is attained when going to 3 clusters. However
the improvement decreases to about 15% when going from K = 4 to K = 5. The conclusion
is that K = 4 is a reasonable place to stop.

4.2 Confidence intervals

When we estimate the value of a parameter, the number that we obtain is almost surely
not identical to the true value of the parameter. The question that arrises then is how
close do we expect the estimate to be to the true value? We’ve already seen in the MSE
one way of answering this question. The MSE gives the expected value of the square of the
di↵erence between estimate and true value. When the estimator is unbiased, however, it is
more common in practical applications to use confidence intervals to report the uncertainty
in the estimate.

A -level confidence interval of a parameter is an interval of the real line that has prob-

125
ability of containing the true value of the parameter. The procedure for constucting a
-level confidence interval for a parameter ✓ consists of these steps:

1. Collect an iid dataset D = {yi }N .

2. Propose a parametric family of distributions for the system: Y ⇠ F(✓).

3. Propose an estimator for the parameter that is unbiased for the chosen family of
distributions.

ˆ
4. Use the estimator and the data to compute an estimate ✓.

ˆ and define ⇢ as its radius. That is, the confidence


5. Center the confidence interval on ✓,
⇥ ⇤
interval is of the form ✓ˆ ⇢ , ✓ˆ + ⇢ .

6. Compute ⇢ such that there is a probability that the confidence interval includes ✓.
That is, the probability that the distance between ✓ and ✓ˆ is no greater than ⇢ must
equal :
P⇥ˆ ˆ
⇥ ✓ ⇢ = (4.118)

We’ve seen (in class) that we can use Chebyshev’s inequality to place an upper bound on
ˆ We will assume here,
this probability even when we have no knowledge of the shape of ⇥.
ˆ and that we can use that knowledge to compute
however, that we do know the shape of ⇥,
⇢ precisely. We will develop the technique for estimates of the mean µY .

126
4.2.1 Confidence interval for the mean µY when Y is Gaussian

with known variance

Let’s suppose that we want to construct a -level confidence interval for the mean µY . We
begin by collecting an iid dataset D and proposing a model family for the system. In this
2
case we choose the Gaussian family. Let’s suppose further that we know the variance Y

of the system, even though we do not know its mean. This might happen if we are using
a measurement device with known precision. Thus, our candidate models for the system
2
consists of all Gaussian distributions with variance Y. The reason for this assumption will
2
become clear, and we will address the case of unknown Y later.

The third step in our procedure is to choose an unbiased estimator for µY . Of course we
choose the sample mean ȲN . We then apply the sample mean formula to the data to obtain
the estimate µ̂.

The final and main part of the task is to compute ⇢ to satisfy Eq. 4.118, which in this
scenario becomes:
P ȲN µY ⇢ = (4.119)

Having assumed that Y is Gaussian, we can assert that ȲN is Gaussian:

ȲN ⇠ N (µY , Y2 /N ) (4.120)

Hence, Eq. 4.119 says that ⇢ is such that the area under the pdf of N (µY , Y2 /N ) between
µY ⇢ and µY + ⇢ must equal (see Figure 4.9). In other words,

FȲN (µY + ⇢) FȲN (µY ⇢) = (4.121)

This equation involves µY , which is unknown. However we notice that the area under the
curve is not a↵ected by shifting the pdf along the horizontal axis – that is, by adding a

127
constant to ȲN . We can therefore write

FȲs (⇢) FȲs ( ⇢) = (4.122)

where Ȳs = ȲN µY , and Ȳs ⇠ N (0, Y2 /N ). Next we notice that, due to the symmetry of the
Gaussian, we have FȲs (⇢) = 1 FȲs ( ⇢). Plugging this into Eq. 4.122 we obtain:

1 2FȲs ( ⇢) = (4.123)

which implies
1
FȲs ( ⇢) = (4.124)
2

which, given the fact that the cdf is an invertible function, implies

✓ ◆
1 1
⇢ = FȲs (4.125)
2

The right-hand side of this formula is a negative number. Take the absolute value to obtain
the positive radius of the confidence interval:

✓ ◆
1 1
⇢ = FȲs (4.126)
2

This formula can be easily evaluated using the stats subpackage of Python’s SciPy library.
However to use lookup tables, we must express ⇢ in terms of the standard normal distribution
N (0, 1). Going back to Eq. 4.121, instead of shifting the distribution, let’s shift and scale it
by the standard deviation of ȲN . That is, let’s standardize it. Thus we obtain a standard
normal variate which we call Z:

ȲN µY
Z= p ⇠ N (0, 1) (4.127)
Y / N

128
⇢ ⇢
Under this transformation, the point µY + ⇢ moves to p
Y/ N
and µY ⇢ moves to p
Y/ N
.
Thus, Eq. 4.122 becomes

✓ ◆ ✓ ◆
⇢ ⇢
FN p FN p = (4.128)
Y/ N Y/ N

We again apply the symmetry argument and arrive finally at:

✓ ◆
Y 1 1
⇢= p FN (4.129)
N 2

Example 4.2.1. A sample of 10 resistors is taken from a process for manufacturing 200
ohm resistors. The sample has a mean of 195 ohms. Construct a 95% confidence interval
for the mean assuming that the true distribution is Gaussian, with a standard deviation of
7 ohms.

Solution

The problem statement specifies N = 10, µ̂ = 195, Y ⇠ N (µY , 72 ), and = 0.95. The
confidence interval is centered at µ̂, and its radius is found with Equation 4.129. Use the
lookup table to find,
✓ ◆
1 1
FN = FN 1 (0.025) = 1.96 (4.130)
2

This can be done in Python with scipy.stats.norm().ppf(0.025).

The radius of the interval is then:

7
⇢ = p (1.96) = 4.34 (4.131)
10

The confidence interval is therefore 195 ± 4.34 ohms or [190.66, 199.34] ohm. ⇤

We can see from Equation 4.129 that the width of the confidence interval decreases as the
size of the sample grows. This reflects the consistency of the sample mean estimator, which

129
2
Figure 4.9: Calculation of the confidence interval for the mean when Y is Gaussian and Y
is known.

delivers more precise estimates when given more data. Conversely, a larger process variance
2
Y results in a larger confidence interval. We can be less sure of the location of the mean
when the data is noisy. Finally, because the cdf of the standard normal distribution FN is an
1
increasing function, it follows that its inverse is also increasing, which means that FN 1 2
1
is a decreasing function of . But since FN 1 2
is negative on 2 (0, 1), its absolute value
is an increasing function of . So, as increases, the width of the confidence interval also
increases. Our confidence that the mean is contained in the interval grows as the interval
gets larger.

4.2.2 Confidence interval for the mean µY when Y is Gaussian

with unknown variance

A significant limitation of the approach we’ve just outlined is that it requires a-priori knowl-
2 2
edge of Y, the variance of Y . When Y is unknown, it must be estimated from the data. For
this we use the unbiased sample variance S 2 (Eq. 4.49). Then, following the same procedure
2
as before, we arrive at an alternate version of Eq. 4.127, in which Y has been replaced with

130
S 2:
ȲN µY
t= p (4.132)
S 2/N

Here, t is a random variable that depends on random variables ȲN and S 2 , which in turn
depend on Y . The distribution of t that results when Y is Gaussian was derived in the
early 1900’s by William Sealy Gosset under the pseudonym of “Student”, and is known as
“Student’s t distribution”. The formula for the pdf of this distribution is complicated and
not very enlightening. Like the standardized sample mean Z, the t distribution is symmetric
and its mean is zero. Unlike Z, however, the variance of t depends on the number of data
points N . Figure 4.10 shows t-distributions for di↵erent values of ⌫ (“nu”). ⌫ corresponds
to the number of “degrees of freedom” of the distribution. We will not delve further into the
concept of degrees of freedom, except to note that in our case it is one less than the number
of data points:
⌫=N 1 (4.133)

We can observe in the figure that, as ⌫ grows, the t-distribution converges to a standard
normal distribution. For smaller ⌫ (small datasets), the variance of the distribution is larger,
2
due to the greater uncertainty in the estimate of Y. With N is sufficiently large, we dispense
with the t-distribution and use the standard normal distribution instead. In this class we
will use the a normal distribution whenever N > 30.

Figure 4.10: Student’s t-distribution

2
The formula for computing the radius of a confidence interval when Y is unknown and

131
2
Y is Gaussian is identical to the case when Y is known (Eq.4.10), except that we use ˆ
insead of Y and inverse cdf of the t-distribution, instead of that of the standard normal
distribution.

✓ ◆
ˆ 1 1
⇢= p Ft(⌫) (4.134)
N 2

Example 4.2.2. Repeat Example 4.2.1, but assuming unknown standard deviation.

Solution

This time, the radius of the confidence interval is found with a lookup table for the inverse
cdf of the t distribution, using ⌫ = N 1 = 9.

Ft(9)1 (0.025) = 2.26 (4.135)

This can be done in Python with scipy.stats.t(df=9).ppf(0.025).

The radius of the interval is then:

ˆ
⇢ = p (2.26) (4.136)
10

The confidence interval is centered on the sample mean µ̂, which must be computed from
the dataset (not provided in this problem).

4.2.3 Confidence interval for the mean µY when Y is non-Gaussian

In the previous sections we used the assumption of Gaussian Y to assert that the sample
mean ȲN is Gaussian. We then proceeded to compute the radius of the confidence interval
2
by standardizing the sample mean. But because the standardization step involved Y, this

132
2
generated two cases: known and unknown Y, and two corresponding distributions: N (0, 1)
and t(⌫).

If we now drop the assumption of Gaussian Y , then we can no longer be sure that ȲN
is Gaussian. This does not necessarily mean that it is unknown. There are non-Gaussian
Y ’s for which the distribution of ȲN is known (e.g. Y ⇠ B(s)). However when this is not
the case, we must appeal to the central limit theorem, which tells us that if N is sufficiently
large, then we are justified in approximating ȲN as Gaussian.

Here we will again use N > 30 as a threshold for invoking the CLT. Having made that
assumption, the rest of the procedure is identical. We can standardize the sample mean by
subtracting the mean and dividing it by the true or sample standard deviation. If Y is
known, the result is distributed as N (0, 1). If Y is unknown and ˆ is used instead, then
the result is distributed as t(⌫). Except, because N large, ⌫ is also large and t(⌫) becomes
indistinguishable from N (0, 1), so the two cases collapse into one. Figure 4.11 provides a
summary diagram.

Example 4.2.3. (Chapter 5.2 of Navidi).

A soft-drink manufacturer purchases aluminum cans from an outside vendor. A random


sample of 70 cans is selected from a large shipment, and each is tested for strength by
applying an increasing load to the side of the can until it punctures. Of the 70 cans, 52 meet
the specification for puncture resistance. Find a 95% confidence interval for the proportion
of cans in the shipment that meet the specification.

Solution

The can-testing process is a Bernoulli random variable with an unknown probability of

133
Figure 4.11: Diagram for deciding which distribution to use to compute ⇢

success of s. We know about Y ⇠ B(s) that,

µY = s (4.137)
2
Y = s(1 s) (4.138)

The dataset is encoded with a 1 for “success” (the can meets the specification), and a 0 for
“failure” (the can does not meet the specification). Hence, the dataset contains 52 ones and
18 zeros. The sample mean is then:

N
1 X 52
µ̂ = yi = ⇡ 0.74 (4.139)
N i=1 70

134
and the unbiased sample variance is:

N
X
2 1
ˆ = (yi µ̂)2 (4.140)
N 1 i=1
✓ ◆2 ✓ ◆2 !
1 52 52
= 52 ⇥ 1 + 18 ⇥ (4.141)
69 70 70
52 ⇥ 18
= (4.142)
69 ⇥ 70
⇡ 0.194 (4.143)

Because N is large, we can assert that the normalized sample mean is t-distributed:

ȲN µY
t= p ⇠ t(⌫) (4.144)
S 2 /N

and furthermore that t(⌫) ⇡ N (0, 1). Therefore we can use tables for FN 1 to compute the
radius of the confidence interval. The remaining details are left to the reader.

135
136
Chapter 5

Supervised learning

In the previous chapter we described the point estimation problem, and specifically the
maximum likelihood technique, for building a model of a closed-box system with no inputs.
The steps in the procedure were as follows,

1. Collect an iid dataset.

2. Propose a parametric family of distributions for the model Y .

3. Find values of the parameters by solving an optimzation problem. (Eq. 4.68).

We now extend the setting to include systems with inputs. This is illustrated in Figure 5.1.
The input x is a vector in RD , with values corresponding to each of the D inputs of the
system. We will study two important problems in this context: inference and prediction.
The inference problem (left handside of Figure 5.1) builds a distribution over the space of

Figure 5.1: Two types of models for closed-box systems with inputs.

137
possible outputs y, for each input x. That is, it approximates the conditional distribution
Y |X = x. The main technique for building inference models will be an extension of the
maximum likelihood technique introduced in Section 4.1.6.
The second type of problem that we will study is the prediction problem (right hand
side of Figure 5.1). A prediction model h(x) is one that computes a predicted output y for
each input x. It is a simpler model than the inference model Y |X = x, which produces a
distribution of y for each x. Because h(x) is a simpler mathematical object than Y |X = x,
the available techniques for prediction cover a wider range of scenarios. On the other hand,
inference techniques provide more insight about the system. For example, from the infered
distributions we can compute confidence intervals for the output. We will see how to do this
in the context of linear regression in Chapter ??.
A third possibility, which we do not cover in this class, is to build a model of the joint
distribution (X, Y ). Such a model goes beyond an inference model because it includes a
model inputs (p(x)), as well as the transformation from inputs to outputs (p(y|x)).
Most of the rest of the course will be concerned with techniques for solving the prediction
problem. We begin in this chapter with an overview of a general framework from building
prediction models called “supervised learning”. The approach follows the three steps listed
earlier for building probabilistic models using maximum likelihood. For the prediction prob-
lem the steps are:

1. Collect an iid dataset D = {(xi , yi )}N .

2. Propose a parametric family of input/output functions h(x; ✓).

3. Find the values of the parameters that minimize the loss of the model given the data.

Notice that the “distributions” from the maximum likelihood problem have been replaced
with “input/output functions”, and instead of maximizing the likelihood, we are minimizing
a loss function. There are many candidate model families to choose from in step 2. We will

138
Figure 5.2: Two views of a dataset with D = 2.

cover several of them: K-bins , K-nearest neighbors, linear regression, logistic regression,
decision trees, support vector machines, and neural networks. In this chapter we introduce
high-level notions that apply generally to the framework, irrespective of the chosen model
family.

5.1 The data

The first step is to collect a dataset D consisting of N samples of inputs and the corresponding
output of the system. As in the previous chapter, the data must be iid, meaning that nothing
about the system or the input generating process may change between samples, and the
outcome of one measurement cannot influence the outcome of any other.

D = {(xi , yi )}N = {(x1i , . . . , xD


i , yi )}N (5.1)

The dataset can be organized into a table with N rows and D + 1 columns, as shown in
Figure 5.2. Each row corresponds to a sample. The first D columns hold the inputs, and

139
the right-most column holds the output. The fact that the samples are iid means that we
are free to shu✏e the rows.
The right hand side of Figure 5.2 shows a scatter plot of the data. Each data point is
thought of as being generated in a two-step process, captured by the decomposition of the
joint input-output distribution into a conditional times a marginal distribution: p(xi , yi ) =
p(yi |X = xi ) pX (xi ). First, an input sample xi is drawn from the distribution of inputs pX .
This corresponds to sampling a black dot in the horizontal plane in the figure. This sample
determines a conditional distribution along the vertical axis p(yi |X = xi ), which generates yi
(the corresponding red dot). Generative models are ones that capture both parts of this data
generating process: input generation and its transformation through the system. Predictive
models only capture p(yi |X = x). They take the input as given, and are not concerned with
how it was generated.
Prediction problems are divided broadly into three types: classification problems, ordinal
problems, and regression problems, depending on the sample space of the output ⌦Y .

• For classification problems, ⌦Y is categorical, meaning that it consists of an unordered


set of labels. For example, in Example 2.6.7, ⌦V = {scooter, bike, moped} is categorical
because it consists of discrete labels, and we do not impose an order on the labels
(moped<scooter has no meaning).

• For ordinal problems, ⌦Y is again discrete, but it is also ordered. ⌦Y = {1, 2, 3, 4, 5} is


an example of a ordinal sample space, which may correspond to the output of a rating
system.

• For regression problems, ⌦Y is some interval of the real numbers. For example ⌦Y = R
or ⌦Y = [0, 1]. The table of Figure 5.2 corresponds to a regression problem.

Let’s consider the regression problem for a system with a single input (D = 1). Figure 5.3
shows views of the joint distribution in the leftmost and middle plots, and its decomposition

140
Figure 5.3: Decomposition of a joint distribution into pX (x), h0 (x) and p(y|X = x).

into the product of pX (x) (in red) and p(y|X = x) (in blue) in the rightmost plot. Consider
the inference problem, where our goal is to approximate the conditional distribution p(y|X =
x). The input-to-output transformation represented by p(y|X = x) can itself be understood
as a two-step process. First the input is transformed by a deterministic function h0 (x) into
an “expected output”, and then the expected output is corrupted by zero-mean noise ". This
is expressed as follows,
y = h0 (x) + " (5.2)

where h0 (x) is the expected value of the output for input x,

h0 (x) = E[ Y | X = x ] (5.3)

and E["] = 0. The common interpretation of Eq. 5.2 is that h0 (x) represents the “true”
model of the system, while " captures uncertainties such as measurement errors. In the
rightmost plot of Figure 5.3, h0 (x) is the purple line, and the distributions of " for each x are
shown in blue. In the most general setting, the shape of the pdf of " may vary with x. That
is, each of the blue lines in the figure may have a di↵erent shape (not just a di↵erent location).
However it is customary to assume that this is not the case, and that " is independent of

141
Figure 5.4: A tiny neural network with 6 vertices and 9 weighted edges.

the input.
The prediction problem is then to construct a function h(x) that approximates h0 (x).
The inference problem requires both the estimation of h0 (x) and of the distribution of ".

5.2 Parametric families of models

We focus on the prediction problem; i.e. to construct a prediction function h(x) that closely
matches the unknown h0 (x). We use H to denote a parameterized family of prediction func-
tions. P is the number of parameters that parameterize H, and h(x; ✓) 2 H is a particular
member of H with the parameters set to ✓ 2 RP . (We are dropping the underline from the
✓ of the previous chapter. ✓ is always vector-valued from now on). For example, H might be
the family of parabolas, i.e. all functions that map x 2 R to ✓0 +✓1 x+✓2 x2 , where ✓0 , ✓1 , and
✓2 are real numbers. This family has P = 3, because it is parameterized by (✓0 , ✓1 , ✓2 ) 2 R3 .

Example 5.2.1. Figure 5.4 shows a tiny neural network. The details of how it works are
not important for now and are covered in Chapter ??. This neural network has 9 tunable
parameters ✓1 . . . ✓9 , corresponding to the “weights” on each of its edges. The family of
neural network with this particular architecture has P = 9 and is parameterized by ✓ 2 R9 .

142
Figure 5.5: Example loss functions.

Each parameter vector ✓ yields a prediction function h(x; ✓). We denote with ŷ the prediction
for input x:
ŷ = h(x; ✓) (5.4)

The semicolon in the notation separates the input of the function (x) from its parameters (✓).
Our goal is to find the “best” prediction function h from the proposed family of prediction
functions H. The performance of a particular prediction function is gauged in terms of the
expected prediction error. That is, in terms of the expected “distance” between the predicted
value ŷ and the true value y. This distance is measured using a loss function.

5.3 Loss function

A loss function L(y, ŷ) is a function that quantifies the distance between the predicted output
ŷ and the actual output y, for a particular input x. The loss function is used in the objective
function of the optimization problem that finds optimal values of the parameters ✓ of h(x; ✓)
(see Eq. 5.7). Hence, a loss function should evaluate to zero (or a small value) whenever
y = ŷ, and to a positive (or larger) value when y 6= ŷ. It is not necessary that a loss function
satisfy all of the requirements of being a metric. That is, it need not be symmetric nor

143
satisfy the triangle inequality.

When solving a prediction problem, the choice of loss function will depend mainly on the
type of problem being solved (classification, ordinal, or regression). Figure 5.5 shows two
examples of loss functions used in regression problems: the squared loss L2 (y, ŷ) = (y ŷ)2
and the absolute value loss L1 (y, ŷ) = |y ŷ|. These are only used in regression problems and
are not appropriate for classification problems. Loss functions used in classification, such as
the cross-entropy loss, will be introduced in Chapter ??.

Our choice of L1 or L2 or some other loss function reflects our preferences about the
resulting prediction errors. For example, both L1 and L2 are symmetric, meaning L(y, ŷ) =
L(ŷ, y). This means they penalize over-prediction and under-prediction equally. If for a
particular problem we deem over-prediction as worse than under-prediction, then we can use
an asymmetric loss function that penalizes positive errors more harshly than negative errors.

The loss function can also be used to control the occurrence of large outlier errors.
Models trained with the L2 loss will tend to produce errors that are more “nicely behaved”
(i.e. normally distributed, with few large positive or negative “spikes”) than those resulting
from the L1 loss function. This is because the L2 penalty rises more quickly and therefore
penalizes large errors more severely than L1 . On the other hand, models trained with the
L1 loss will tend to produce a lower average error, relative to the L2 loss. This is because
the total L1 loss is proportional to the average absolute error. Despite this advantage, L2 is
often preferred because of its di↵erentiability.

5.4 Optimization problem

We can now state the general supervised learning optimization problem. The dataset used
here is called the “training” dataset, and denoted with Dtrain . The first formulation is
abstract, and seeks a prediction function h from an amorphous class of candidate functions

144
H

Given Dtrain = {(xi , yi )}N ,

N
X
minimize L(yi , ŷi )
h2H
i=1 (5.5)
subject to ŷi = h(xi ) i = 1 . . . N

This it too broad. We must confine our search to a set of functions parameterized by ✓.
Then the formulation becomes:

Given Dtrain = {(xi , yi )}N ,

N
X
minimize L(yi , ŷi )
✓2RP
i=1 (5.6)
subject to ŷi = h(xi ; ✓) i = 1 . . . N

or more briefly,
N
X
✓ˆ = argmin L(yi , h(xi ; ✓)) (5.7)
✓2RP i=1

Of course, we may add constraints to the set of feasible parameters. We have seen a similar
optimization problem already in Section 3.4.1 when we introduced stochastic gradient de-
scent. Indeed, SGD is the most widely used (although not the only) numerical algorithm for
solving supervised learning problems. However SGD can only be used if the gradients can
be computed. Let’s consider this issue more closely.

Recall that the update equation for SGD (Eq. 3.24) makes use of rL✓ ; the gradient of L
with respect to the parameters. Using the chain rule, we find that this can be expressed as
a product of two terms:
✓ ◆
@L
r✓ L = r✓ h (5.8)
@ ŷ

145
The first term is the partial derivative of the loss function with respect to its second argument.
This is a scalar quantity, and it corresponds simply to the slope of the curve in Figure 5.5.
We can see that for the L2 loss, the slope equals the “residual”, ŷ y, while for L1 it is
1 for negative errors and +1 for positive errors. This discontinuity in the gradient of the
L1 loss precludes the use of SGD. We will not consider it further in the course for this
reason. The second term in Eq. 5.8 (r✓ h 2 RP ) is the vector gradient of the model with
respect to its parameters. The difficulty of computing this term depends on the model
family that we have selected. For example, finding the gradient for the family of parabolas
h(x; ✓) = ✓0 + ✓1 x + ✓2 x2 is easy.

✓ ◆
@f @f @f
r✓ h = , , (5.9)
@✓0 @✓1 @✓2
= (1, x, x2 ) (5.10)

However computing the gradient of a neural network is more difficult, and was in fact a
major obstacle to their use until an efficient implementation of the backpropagation algorithm
became available in the 1980’s. More on this in Chapter ??.

5.5 Assessing model performance

ˆ
Once we have solved the optimization problem and obtained a prediction function h(x; ✓),
the next question is how to assess the quality of the result. We might think that by virtue of
ˆ is the best amongst its family. We will see
be a solution to the optimization problem, h(x; ✓)
that this is not necessarily so, due to the phenomenon of overfitting, and we will introduce
techniques for detecting and avoiding this problem.

146
Performance metrics for regression problems include MSE, RMSE, MAE, and R2 :

N
1 X
MSE = (yi ŷi )2 ... mean squared error (5.11)
N i=1
p
RMSE = MSE ... root mean squared error (5.12)
N
1 X
MAE = |yi ŷi | ... mean absolute error (5.13)
N i=1
N
1 X yi ŷi
MAPE = ... mean absolute percentage error (5.14)
N i=1 yi
PN
(yi ŷi )2
R2 = 1 PNi=1 ... coefficient of determination (5.15)
i=1 (yi µ̂Y )2

Metrics for classification problems are introduced in Chapter ??.

Concerning the MSE, it is important to understand the relationship between Eq. 5.11 and
the probabilistic quantity defined in Eq. 4.53. The probabilistic MSE applies to estimators in
ˆ can be regarded as an estimator of the true output
general. The prediction function h(x, ✓)
y. We can then imagine obtaining a new dataset Dtrain , repeating the training procedure
(i.e. re-solving the optimization problem), and thus obtaining a new prediction function and
a new value of ŷ(x) for the same input x. If we denote with Ŷ (x) the random variable that
captures these variations in ŷ (caused by variations in the training data, while keeping x
fixed), and we regard Ŷ (x) as an estimator for y, then we can define its MSE:

MSE(Ŷ (x)) = E[(E[Ŷ (x)] y)2 ] (5.16)

This is the mean squared error for each value of the input x. If we now take the average
(expectation) over all posible inputs x, we obtain the average MSE. Call this MSE.

MSE = EX [MSE(Ŷ (X))] (5.17)

147
The numerical quantity of Eq. 5.11 is an unbiased estimate of MSE.

As a performance metric, the MSE can be difficult to interpret because its units are
the square of the units of y. For reporting purposes it is more common to use the RMSE
(Eq. 5.12), which is simply the square root of the MSE. Both the MSE and the RMSE are
ˆ will be the minimizer of both MSE
increasing functions of the total L2 loss. Hence h(x, ✓)
and RMSE when L2 is used in the optimization problem (assuming a global optimum is
found). The MAE (Eq. 5.14) computes the average of the absolute values of the errors.
Its units are the same as the output, which makes it easier to interpret than MSE. It is
compatible with the L1 loss function in the same way as MSE and RMSE are compatible
with L2 . The MAPE goes a step further than MAE in its interpretability, since it is unitless.
A MAPE of 0.2 means that the prediction is 20% from the true value on average.

The remaining difficulty of interpretation relates to the scale of the metric. Whether a
MAPE of 0.2 is good or bad depends on the difficulty of the problem. The coefficient of
determination R2 addresses this issue by comparing the error incurred by the model with
the error of a baseline model h̄(x) = ȳ, where ȳ is the average of the outputs in the training
P
dataset: ȳ = N1 N 2
i=1 yi . An R of 1 indicates a “perfect” model; one that perfectly hits all

of the points in the training dataset. R2 can never exceed 1. R2 = 0 corresponds to the
baseline model. Hence we expect the R2 to fall between 0 and 1, with larger values being
preferred. Negative R2 indicates a model that performs worse than the baseline model.

5.5.1 Overfitting

These five metrics measure the errors obtained during the training process, but they are not
necessarily good predictors of future model performance. Figure 5.6 illustrates the problem.
Here the blue dots are the data used to train the model (Dtrain ), and the red plus signs
are samples that were withheld from training. We call this the test dataset Dtest . Each of
the four plots shows the optimal prediction function selected from four di↵erent families:

148
Figure 5.6: Overfitting in a polynomial regression.

149
linear functions, cubics, order five polynomials, and order 10 polynomial. These families are
nested in the sense that the cubics include all of the linear functions as a sub-family, the
order five polynomials include the cubics, etc. Hence, the “training error” for the optimal
cubic polynomial (i.e. the optimal L2 loss) will necessarily be better (or no worse) than that
of the linear functions, assuming global optima are found in each case. Similarly, the optimal
order five polynomial is better than the optimal cubic, and so on. However the same metric
applied to the test data tells a di↵erent story. The cubic polynomial is the best in terms
of the “test error”, while the order 10 polynomial, which was best on training, performs
terribly. The order 10 model is severely overfitted.

Overfitting occurs when the model family used in training is too flexible for the system
being modeled. In the example, the underlying system function happens to be a sine wave,
which is similar to a cubic function in the regime covered by the dataset. The added flexibility
of order ten polynomials allows them to more closely track the training data, but this turns
out to be detrimental to the test error.

How can we detect whether a model is overfitted? The most best approach is what
we have just described: to withhold a portion Dtest of the data from the training process,
and to use it to estimate future model performance. This works well if we have sufficient
data. Otherwise, if data is scarce and data collection expensive, then there are metrics that
explicitly penalize some measure of model flexibility, such as the order of the polynomial.
Examples of these include Akaike’s information criterion (AIC) and the Bayesian information
criterion (BIC). These approaches tend to be model-specific, and will not be covered here.
A more data-centered technique for detecting overfitting with scarce data is called K-fold
cross-validation, and we cover this next.

150
Figure 5.7: K-fold cross validation with K = 3.

5.5.2 K-fold cross-validation

In situations of data scarcity, it is reasonable to want to use all of the available data for
training the model. But this leaves us with no data for testing, which raises the possibil-
ity of undetected overfitting. K-fold cross-validation is a model assessment technique that
produces an unbiased estimate of model performance while allowing all of the data to be
used for training. Figure 5.7 illustrates the approach with K set to 3. The process begins
0
by splitting Dtrain into K equal parts. K 1 of those parts constitute a training set Dtrain ,
0
and the remainder, Dval , is known as the validation data. The “train” block in the figure
takes a training data as input, runs SGD or some other training algorithm, and returns an
optimal set of parameters ✓ˆ1 . This model is assessed by evaluating any of the aforementioned
performance metrics using the validation data. This produces an unbiased estimate of model
0
performance (`1 ). It is unbiased because Dval was not seen by the training algorithm. The
process is repeated K times, each time with a di↵erent 1/K ’th portion of Dtrain as the vali-
dation dataset. Finally, the K performance assessments `1 . . . `K are averaged to obtain an
overall estimate:
K
1 X
` = `k (5.18)
K k=1

151
Figure 5.8: Hyper-parameter search.

The question that remains is which if the K models (i.e. which of h(x, ✓ˆ1 ) through h(x; ✓ˆK ))
should be kept? The answer is none. K-fold cross-validation is used only to estimate the
performance of the model. The model parameters themselves are obtained by running the
training algorithm one last time on the full dataset Dtrain .

5.6 Hyper-parameters

A parameter of a model is a hyper-parameters when it is not a decision variable of the


optimization problem (Eq. 5.7). This is usually because the loss function is not di↵erentiable
with respect to p, or simply because including it would make the problem too difficult to
solve. The order of the polynomial in Figure 5.6 is an example of a hyper-parameter. We
cannot take a derivative of L with respect to the polynomial order because a) it is an integer,
and b) modifying the order changes the formula for h by adding new term. This captures
the two main reasons why a parameter may be considered a hyper-parameter; either because
it is not real-valued or because it is a “structural” parameter of the model family itself. In
terms of notation, we will reserve ✓ for the tunable parameters, and use for the vector of
hyper-parameters.

Figure 5.8 illustrates the general approach to optimizing hyper-parameters. The main
idea is to use a gradient-less optimization method, such as the ones introduced in Section ??.

152
The optimization method will advance by suggesting new ’s to evaluate. These are passed
to the “train” block, which runs SGD to obtain the parameter estimate ✓ˆ corresponding to
the given . This is then evaluated in the “assess” box using cross-validation, and the result
(`) is returned to the search algorithm. This continues until an optimal value of is found.

153

You might also like