0% found this document useful (0 votes)
137 views

Machine Learning Summarized Notes 1660762916

This document provides a summary of machine learning techniques for supervised and unsupervised learning. It covers classification algorithms like decision trees, ensemble methods, and support vector machines. It also discusses regression techniques like linear regression and neural networks. Unsupervised topics include randomized optimization, clustering, and expectation maximization. The document aims to serve as an unofficial guide to Georgia Tech's machine learning course.

Uploaded by

Arundas Purakkat
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
137 views

Machine Learning Summarized Notes 1660762916

This document provides a summary of machine learning techniques for supervised and unsupervised learning. It covers classification algorithms like decision trees, ensemble methods, and support vector machines. It also discusses regression techniques like linear regression and neural networks. Unsupervised topics include randomized optimization, clustering, and expectation maximization. The document aims to serve as an unofficial guide to Georgia Tech's machine learning course.

Uploaded by

Arundas Purakkat
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 111

Machine Learning

or: an Unofficial Guide to the Georgia Institute


of Technology’s CS7641: Machine Learning

George Kudrayvtsev
[email protected]

Last Updated: May 23, 2020

Creation of this guide was powered entirely by caffeine in


its many forms. If you found it useful and are gener-
ously looking to fuel my stimulant addiction, feel free to
shoot me a donation on Venmo @george_k_btw or PayPal
[email protected] with whatever this guide was
worth to you.
Happy studying!

I’m just a student, so I can’t really make any guarantees about the correctness of
my content. If you encounter typos; incorrect, misleading, or poorly-worded in-
formation; or simply want to contribute a better explanation or extend a section,
please raise an issue on my notes’ GitHub repository.

I Supervised Learning 5
0 Techniques 6

1 Classification 7
1.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.1 Getting Answers . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.2 Asking Questions: The ID3 Algorithm . . . . . . . . . . . . . 9

1
1.1.3 Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2 Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.1 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.2 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.1 There are lines and there are lines. . . . . . . . . . . . . . . . . 19
1.3.2 Support Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.3 Extending SVMs: The Kernel Trick . . . . . . . . . . . . . . . 21
1.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2 Regression 26
2.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.1 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.2 Sigmoids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.3 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.4 Biases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3 Instance-Based Learning . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.1 Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . 35

3 Computational Learning Theory 39


3.1 Learning to Learn: Interactions . . . . . . . . . . . . . . . . . . . . . 40
3.2 Space Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.1 Version Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.2 Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.3 PAC Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.4 Epsilon Exhaustion . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3 Infinite Hypothesis Spaces . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.1 Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.2 Vapnik-Chervonenkis Dimension . . . . . . . . . . . . . . . . . 47
3.4 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4.1 Entropy: Information Certainty . . . . . . . . . . . . . . . . . 49
3.4.2 Joint Entropy: Mutual Information . . . . . . . . . . . . . . . 50
3.4.3 Kullback-Leibler Divergence . . . . . . . . . . . . . . . . . . . 51

4 Bayesian Learning 52
4.1 Bayesian Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.1.1 Finding the Best Hypothesis . . . . . . . . . . . . . . . . . . . 54
4.1.2 Finding the Best Label . . . . . . . . . . . . . . . . . . . . . . 55
4.2 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.1 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.2 Making Inferences . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.3 Naïve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

2
II Unsupervised Learning 62
5 Randomized Optimization 63
5.1 Hill Climbing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3.1 High-Level Algorithm . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.2 Cross-Over . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4 MIMIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4.1 High-Level Algorithm . . . . . . . . . . . . . . . . . . . . . . . 68
5.4.2 Estimating Distributions . . . . . . . . . . . . . . . . . . . . . 68
5.4.3 Practical Considerations . . . . . . . . . . . . . . . . . . . . . 70

6 Clustering 71
6.1 Single Linkage Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.1.1 Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2 k-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2.1 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.2.2 Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3 Soft Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.3.1 Expectation Maximization . . . . . . . . . . . . . . . . . . . . 77
6.3.2 Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.4 Analyzing Clustering Algorithms . . . . . . . . . . . . . . . . . . . . 78
6.4.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.4.2 How Many Clusters? . . . . . . . . . . . . . . . . . . . . . . . 79

7 Features 80
7.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.1.1 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.1.2 Wrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.1.3 Describing Features . . . . . . . . . . . . . . . . . . . . . . . . 82
7.2 Feature Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.2.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . 83
7.2.3 Independent Component Analysis . . . . . . . . . . . . . . . . 86
7.2.4 Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

III Reinforcement Learning 88


8 Markov Decision Processes 89
8.1 Bellman Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.2 Finding Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

3
8.3 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

9 Game Theory 96
9.1 Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
9.1.1 Relaxation: Non-Determinism . . . . . . . . . . . . . . . . . . 97
9.1.2 Relaxation: Hidden Information . . . . . . . . . . . . . . . . . 98
9.1.3 Prisoner’s Dilemma . . . . . . . . . . . . . . . . . . . . . . . . 100
9.1.4 Nash Equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . 101
9.1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
9.2 Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
9.2.1 Tit-for-Tat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
9.2.2 Folk Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
9.2.3 Pavlov’s Strategy . . . . . . . . . . . . . . . . . . . . . . . . . 105
9.3 Coming Full Circle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
9.3.1 Example: Grid World . . . . . . . . . . . . . . . . . . . . . . . 106
9.3.2 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
9.3.3 Solving Stochastic Games . . . . . . . . . . . . . . . . . . . . 107

Index of Terms 110

4
PART I
Supervised Learning

ur first minicourse will dive into supervised learning, which is a school of


O machine learning that relies on human input (or “supervision”) to train a model.
Examples of supervised learning include anything that has to do with labelling,
and it occurs far more often than unsupervised learning. It’s often reduced down
to function approximation (think numpy’s polyfit, for example): given enough
predetermined pairs of (input, output)s, the trained model can eventually predict a
never-before-seen input with reasonable accuracy.
x f (x)
An elementary example of supervised learning would be a model
2 4
that “learns” that the dataset on the right represents f (x) =
9 81
x2 . Of course, there’s no guarantee that this data really does
4 16
represent f (x) = x2 . It certainly could just “look a lot like it.”
7 49
Thus, in supervised learning we will need to a basal assumption
...
about the world: that we have some well-behaved, consistent
function behind the data we’re seeing.

Contents
0 Techniques 6

1 Classification 7

2 Regression 26

3 Computational Learning Theory 39

4 Bayesian Learning 52

5
Techniques

upervised learning is typically broken up into two main schools of algorithms.


S Classification involves mapping between complex inputs (like image of faces)
and labels (like True or False, though they don’t necessarily need to be binary)
which we call “classes”. This is in contrast with regression, in which we map our
complex inputs to an arbitrary, often-continuous, often-numeric value (rather than a
discrete value that comes from a small set of labels). Classification leans more towards
data with discrete values, whereas regression is more-universal, being applicable to
any numeric values.
Though data is everything in machine learning, it isn’t perfect. Errors can come from
a variety of places:

• hardware (sensors, precision) • malicious intent (willful misrepre-


sentation)
• human element (mistakes)
• unmodeled influences

These hidden errors will factor into the


resulting model if they’re present in the
training data. Similarly, they’ll cause in-
accuracies in evaluation if they’re present
in the testing data. We need to be careful
about how accurately we “fit” the training
data, ideally keeping it general enough to
flourish on real-world data. A method for
reducing this risk of overfitting is called
cross-validation: we can use some of
the training data as a “fake” testing set. The “Goldilocks zone” of training is be-
tween underfitting and overfitting, where the error across both training data and
cross-validation data are relatively similar.

6
Classification

Science is the systematic classification of experience.


— George Henry Lewes, Physical Basis of Mind

reaking down a classification problem requires a number of important ele-


B ments:
• instances, representing the input data from which the overall model will “learn;”
• the concept, which is the abstract concept that the data represents (hopefully
representable by a well-formed function);
• a target concept, which is the “answer” we want: the ability to classify based
on our concept;
• the hypotheses are all of the possible functions (ideally, we can restrict our-
selves from literally all functions) we’re willing to entertain that may describe
our concept;
• some input samples pulled from our instances and paired (by someone who
“knows”) with the correct output;
• some candidate which is a potential target concept; and
• a testing set from our instances that our candidate concept has not yet seen
in order to evaluate how close it is to the ideal target concept.

1.1 Decision Trees


Decision trees are a form of classification learning. They are exactly what they Quinlan ‘86
sound like: trees of decisions, in which branches are particular questions (in which
each path down a branch represents a different answer to said question) based on the
input, and leaves are final decisions to be made. It maps various choices to diverging
paths that end with some decision.

7
CHAPTER 1: Classification

To create a decision tree for a concept, we need to identify pertinent features that
would describe it well. For example, if we wanted to decide whether or not to eat at
a restaurant, we could use the weather, particular cuisine, average cost, atmosphere,
or even occupancy level as features.
For a great example of “intelligence” being driven by a decision tree in popular culture,
consider the famous “character guessing” AI Akinator. For each yes-or-no question
it asks, there are branches the answers that lead down a tree of further questions
until it can make a confident guess. One could imagine the following (incredibly
oversimplified) tree in Akinator’s “brain:”

Does your character really exist?

Yes No

Is your characteran animal?

Does your character play a sport? Yes No

Yes No Pumba Is your character’sgender female?

Steph Curry . . . Yes No

Turanga Leela . . .
It’s important to note that decision trees are a representation of our features. Only
after we’ve formed a representation can we start talking about the algorithm that
will use the tree to make a decision.
The order in which we apply each feature to our decision tree should be correlated
with its ability to reduce our space. Just like Akinator divides the space of characters
in the world into fiction and non-fiction right off the bat, we should aim to start our
decision tree with questions whose answers can sweep away swaths of finer decision-
making. For our restaurant example, if we want to spend ≤ $10 no matter what, that
would eliminate a massive amount of restaurants immediately from the first question.

1.1.1 Getting Answers


The notion of a “best” question is obviously subjective, but we can make an attempt
to define it with a little more mathematical rigor. Taking some inspiration from
binary search, we could define a question as being good if it divides our data roughly
in half. Regardless of our final decision (heh) regarding the definition of “best,” the
algorithm is roughly the same:
1. Pick the “best” attribute.

Kudrayvtsev 8
MACHINE LEARNING Supervised Learning

2. Ask the question.


3. Follow the answer path.
4. If we haven’t hit a leaf, go to Step 1.
Of course the flaw is that we want to learn our decision tree based on the data. The
above algorithm is for using the tree to make decisions. How do we create the tree in
the first place? Do we need to search over the (massive) space of all possible decision
trees and use some criteria to filter out the best ones?
Given n boolean attributes, there are 2n possible ways to arrange the attributes,
and 22 possible answers (since there are 2n different decisions for each of those
n

arrangements). . . we probably want to be a little smarter than that.

1.1.2 Asking Questions: The ID3 Algorithm


If we approach the feature-ranking process greedily, a simple top-down approach ID3 Algorithm,
emerges: Udacity

• A ← best attribute
• Assign A as the decision attribute (the “question” we’re asking) for the particular
node n we’re working with (initially, this would be the tree’s root node).
• For each v ∈ A, create a branch from n.
• Lump the training examples that correspond to the particular attribute value,
v, to their respective branch.
• If the examples are perfectly classified with this arrangement (that is, we have
one training example per leaf), we can stop.
• Otherwise, repeat this process on each of these branches.
The “information gain” from a particular attribute A can be a good metric for quali-
fying attributes. Namely, we measure how much the attribute can reduce the overall
entropy:
X |Sv |
Gain(S, A) = Entropy(S) − · Entropy(Sv ) (1.1)
v∈A
|S|
where the entropy is calculated based on the probability of seeing values in A:
X
Entropy(A) = − Pr [v] · log2 Pr [v] (1.2)
v∈A

These concepts come from Information Theory which we’ll dive into more when we
discuss randomized optimization algorithms in chapter 5; for now, we should just
think of this as a measure of how much information an attribute gives us about a

Kudrayvtsev 9
CHAPTER 1: Classification

system. Attributes that give a lot of information are more valuable, and should thus
be higher on the decision tree. Then, the “best attribute” is the one that gives us the
maximum information gain: max Gain(S, A).
A∈A

Inductive Bias
There are two kinds of bias we need to worry about when designing any classifier:
• restriction bias, which automatically occurs when we decide our hypothesis
set, H. In this case, our bias comes from the fact that we’re only considering
functions that can be represented with a decision tree.
• preference bias, which tells us what sort of hypotheses from our hypothesis
set, h ∈ H, we prefer.
The latter of these is at the heart of inductive bias. Which decision trees—out of all of
the possible decision trees in the universe that can represent our target concept—will
the ID3 algorithm prefer?

Splits Since it’s greedily choosing the attributes with the most information gain
from the top down, we can confidently say that it will prefer trees with good
splits at the top.

Correctness Critically, the ID3 algorithm repeats until the labels are correctly
classified. And though it may be obvious, it’s still important to note that it will
hence prefer correct decision trees to incorrect ones.

Depth This arises naturally out of the top-heavy split preference, but again, it’s
still worth noting that ID3 will prefer trees that are shallower or “shorter.”

1.1.3 Considerations
Asking (Continuous) Questions
The careful reader may have noticed the explicit mention of branching on attributes
based on every possible value of an attribute: v ∈ A. This is infeasible for many
features, especially continuous ones. For our earlier restaurant example, we may
have discretized our “cost” feature into one, two, or three dollar signs (à la Yelp), but
what if we wanted to keep them as a raw average dish dollar value instead?
Well, if we’re sticking to the “only ask Boolean questions” model, then binning is
a viable approach. Instead of making decisions based on a precise cost, we instead
make decisions based on a place being “cheap,” which we might subjectively define as
cost ∈ [0, 10), for example.

Kudrayvtsev 10
MACHINE LEARNING Supervised Learning

Repeating Attributes

Does it make sense to ask about an attribute more than once down its branch? That
is, if we ask about cost somewhere down a path, can (or should) we ask again, later?

Is the average cost ≤ $10?

Yes No

Is the average cost ≤ $20?

Is it Mongolian food? Yes No

Yes No ... ...

Just eat in. . .

Is it crowded?

Yes No

Is the average cost ≤ $5? Go!

...
With our “proof by example,” it’s pretty apparent that the answer is “yes, it’s ac-
ceptable to ask about the same attribute twice.” However, it really depends on the
attribute. For example, we wouldn’t want to ask about the weather twice, since the
weather will be constant throughout the duration of the decision-making process.
With our bucketed continuous values (cost, age, etc.), though, it does make sense to
potentially refine our buckets as we go further down a branch.

Stopping Point

The ID3 algorithm tells us to stop creating our decision tree when all of our training
examples are classified correctly. That’s. . . a lot of leaf nodes. . . It may actually be
pretty problematic to refine the decision tree to such a point: when we leave our
training set, there may be examples that don’t fall into an exact leaf. There may also
be examples that have identical features but actually have a different outcome; when
we’re talking about restaurant choices, opinions may differ:
Weather Cost Cuisine Go?
Alice: Cloudy $ Mexican 
Bob: Cloudy $ Mexican ×

Kudrayvtsev 11
CHAPTER 1: Classification

If both of these rows were in our training set, we’d actually get an infinite loop in
the naïve ID3 algorithm: it’s impossible to classify every example correctly. It makes
sense to adopt a termination approach that is a little more general and robust. We
want to avoid overfitting our training examples!
If we bubble up the decisions down the branch of a tree back up to its parent node,
then prune the branch entirely, we can avoid overfitting. Of course, we’d need to
make sure that the generalized decision does not increase our training error by too
much. For example, if a cost-based branch had all of its children branches based on
weather, and all but one of those resulted in the go-ahead to eat, we could generalize
and say that we should always eat for the cost branch without incurring a very large
penalty for the one specific “don’t eat” case.

Adapting to Regression
Decision trees as we’ve defined them here don’t transfer directly to regression prob-
lems. We no longer have a useful notion of information gain, so our approach at at-
tribute sorting falls through. Instead, we can rely on purely statistical methods (like
variance and correlation) to determine how important an attribute is. For leaves, too,
we can do averages, local linear fit, or a host of other approaches that mathematically
generalize with no regard for the “meaning” of the data.

1.2 Ensemble Learning


An ensemble is a fancy word for a collective that works together. In this section,
we’re going to discuss combining learners into groups who will each contribute to the
hypothesis individually and essentially corroborate towards the answer.
Ensemble learning is powerful when there are features that may slightly be indicative
of a result on their own, but definitely inconclusive, whereas a combination of some of
the rules is far more conclusive. In essence, when the whole is greater than the sum of
its parts. For example, it’s hard to consider an email containing the word “money” as
spam, but containing a sketchy URL, the word “money,” and having mispelled words
might be far more indicative.
The general approach to ensemble learning algorithms is to learn rules over smaller
subsets of the training data, then combine all of the rules into a collective, smarter
decision-maker. A particular rule might apply well to a subset (such as a bunch of
spam emails all containing the word “Viagra”), but might not be as prevalent in the
whole; hence, each weak learner picks up simple rules that, when combined with the
other learners, can make more-complex inferences about the overall dataset.
This approach begs a few critical questions: we need to determine how to pick
subsets and how to combine learners.

Kudrayvtsev 12
MACHINE LEARNING Supervised Learning

1.2.1 Bagging
Turns out, simply making choosing data uniformally randomly to form our subset
(with replacement) works pretty well. Similarly-simply, combining the results with
an average also works well. This technique is called boostrap aggregation, or
more-commonly bagging.
The reason why taking the average of a set of weak learners trained on subsets of
the data can outperform a single learner trained on the entire dataset is because of
overfitting, our mortal fear in machine learning. Overfitting a subset will not overfit
the overall dataset, and the average will “smooth out” the specifics of each individual
learner.

1.2.2 Boosting
We must be able to pick subsets of the data a little more cleverly than randomly,
right? The basic idea behind boosting is to prefer data that we’re not good at
analyzing. We essentially craft learners that are specifically catered towards data
that previous learners struggled with in order to form a cohesive picture of the entire
dataset.
Initially, all training examples are weighed equally. Then, in each “boosting round,”
we find the weak learner that achieves the lowest error. Following that, we raise the
weights of the training examples that it misclassified. In essence, we say, “learn these
better next time.” Finally, we combine the weak learners from each step into our final
learner with a simple weighted average: weight is directly proportional to accuracy.

(a) Our set of training examples (b) Reweighing the error examples (c) Final classifier is a combination
and the first boundary guess. and trying another boundary. of the weak learners.

Figure 1.1: Iteratively applying weak learners to differentiate between the red
and blue classes while boosting mistakes in each round.

Let’s define this notion of a learner’s “error” a little more rigorously. Previously, when
we pulled from the training set with uniform randomness, this was easy to define.
The number of mismatches M from our model out of the N -element subset meant an
error of M/N . However, since we’re now weighing certain training examples differently
(incorrect =⇒ likelier to get sampled), our error is likewise different. Shouldn’t we
punish an incorrect result on a data point that we are intentionally trying to learn
more-so than an incorrect result that a dozen other learners got correct?

Kudrayvtsev 13
CHAPTER 1: Classification

Example 1.1: Understanding Check: Training Error

Suppose our training subset is just 4 values, and our weak learner H(x) got
two of them correct:
x1 x2 x3 x4
×  × 
What’s our training error? Trivially 1/2, you might say. Sure, but what if the
probability of each xi being chosen for this subset was different? Suppose
x1 x2 x3 x4
×  × 
D: 1/2 1/20 2/5 1/20

Now what’s our error? Well getting x1 wrong is a way bigger deal now, isn’t
it? It barely matters that we got x2 and x4 correct. . . So we need to weigh
each incorrect answer accordingly:

1 2 1 1 9
ε= + =1− − =
|2 {z 5} |20 {z 20} 10
incorrects corrects

To drive the point in the above example home, we’ll now formally define our error as
the probability of a learner H not getting a data point xi correct over the distribution
of xi s. That is,
εi = Pr [H(xi ) 6= yi ]
D

Our notion of a weak learner—a term we’ve been using so far to refer to a learner
that does well on a subset of the training data—can now likewise be formally defined:
it’s a learner that performs better than chance for any distribution of data (where ε
is a number very close to zero):

∀D : Pr [·] ≤ 1/2 − ε
D

Note the implication here: if there is any distribution for which a set of hypotheses
can’t do better than random chance, there’s no way to create a weak learner from
those hypotheses. That makes this a pretty strong condition, actually, since you need
a lot of good hypotheses to cover the various distributions.
Boosting at a high level can be broken down into a simple loop: on iteration t,
• construct a distribution Dt , and
• find a weak classifier Ht (x) that minimizes the error over it.
Then after the loop, combine the weak classifiers into a stronger one.

Kudrayvtsev 14
MACHINE LEARNING Supervised Learning

Specific boosting algorithms vary in how they perform these steps, but a well-known
one is the adaptive boosting (or AdaBoost) algorithm outlined in algorithm 1.1.
Let’s dig into its guts.

AdaBoost
From a very high level, this algorithm follows a very human approach for learning:
The better we’re doing overall, the more we should focus on individual mistakes.

We return to the world of classification, where our training set maps from a feature Freund &
vector to a “correct” or “incorrect” label, yi ∈ {−1, 1}: Schapire, ‘99

X = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}

Determining Distribution We start with a uniform distribution, so D1 (i) = 1/n.


Then, the probability of a sample in the next distribution is weighed by its “correct-
ness”:
Dt (i) 1 1 − εt
Dt+1 (i) = · exp(−αt yi Ht (xi )) where αt = ln
zt 2 εt

Uh, scrrrrrr. . . let’s break this down piece by piece.


The leading fraction is the previous probability scaled by a normalization factor zt
that is necessary to keep Dt+1 a proper probability distribution, so we can basically
ignore it.1
Notice the clever trick here given our values for yi : when the classification is correct,
yi = Ht (xi ) and their product is 1; when it’s incorrect, their product is always −1.
Because α > 0 (shown later in the detailed math aside), the −α will always flip the
sign of the “correctness” result. Thus, we’re doing e−α when the learner agrees and eα
otherwise. A negative exponential is a fraction, so we should expect the whole term
to make Dt (i) decrease when we get it right and increase when we get it wrong:
(
↑ if H(·) is incorrect
Dt+1 (i) =⇒
↓ otherwise

On each iteration, our probability distribution adjusts to make D favor incorrect


answers so that our classifier H can learn them better on the next round; it weighs
incorrect results more and more as the overall model performance increases.

1
We don’t dive into this in the main text for brevity, but here zt would be the sum of the pre-
normalized weights. In an algorithm (like in algorithm
P 1.1), you might first calculate a D0t+1 that
0
didn’t divide any terms by zt , then calculate z = d∈D d and do Dt+1 (i) = Dt+1 (i)/z at the end.

Kudrayvtsev 15
CHAPTER 1: Classification

Deng, ‘07 Finding the Weak Classifier Notice that we kind-of glossed over determining
Ht (x) for any given round of boosting. Weak learners encompass a large class of
learners; a simple decision tree could be a weak learner. All it has to do is guarantee
performance that is slightly better than random chance.

Final Hypothesis As you can see in algorithm 1.1, the final classifier is just a
weighted average of the individual weak learners, where the weight of a learner is its
respective α. And remember, αt is in terms of εt , so it measures how well the tth
round went overall; thus, a good round is weighed more than a bad round.
The beauty of ensemble learning is that you can combine many simple weak classifiers
that individually hardly do better than chance together into a final classifier that
performs really well.

Quick Maffs: Boosting, Beyond Intuition

Let’s take a deeper look at the definition of Dt+1 (i) to understand how
the probability of the ith sample changes. Recall that in our equation, the
exponential simplifies to e∓α depending on whether H(·) guesses yi correctly
or incorrectly, respectively.

Dt (i) 1 1 − εt
Dt+1 (i) = · exp(−αt yi Ht (xi )) where αt = ln
zt | {z } 2 εt
e∓α

Let’s look at what eα simplifies to:


 
α 1 1−ε
e = exp · ln
2 ε
"   1 !#
1−ε 2
= exp ln power rule of logarithms
ε
r
1−ε recall that ln x = loge x
= and aloga n = n
ε

(it should be obvious now why we said α ≥ 0)

We can follow the same reasoning for e−α and get a flipped result:

1 − ε − /2
    1  1/2 r
−α 1 1−ε ε ε
e = exp − · ln = = =
2 ε ε 1−ε 1−ε

So we have two general outcomes depending on whether or not the weak

Kudrayvtsev 16
MACHINE LEARNING Supervised Learning


learner classified xi correctly (dropping the · for simplicity of analysis):
1−ε


 if H(·) was wrong
2
f (ε) = exp (−αt yi Ht (xi )) = ε
ε

 if H(·) was right
1−ε

Remember that εt is the total error of all incorrect answers that Ht gave; it’s
a sum of probabilities, so 0 < ε < 1. But note that H is a weak learner, so
it must do better than random chance, so in fact 0 < ε < 21 . The functions
are plotted in Figure 1.2; from them, we can draw some straightforward
conclusions:
5

3
f (ε)

2
right
1
wrong

0
0 0.2 0.4 0.6
ε

Figure 1.2: The two ways the exponential can go when boosting,
depending on whether or not the classifier gets sample i right ( 1−ε
ε ,
ε
in red) or wrong ( 1−ε , in blue).

• When our learner is incorrect, the weight of sample i increases ex-


ponentially as we get more and more confident (ε → 0) in our model.
• When our learner is correct, the weight of sample i will decrease
(relatively) proportionally with our overall confidence.

To summarize these two points in different words: a learner will always


weigh incorrect results more than correct results, but will focus on incorrect
results more and more as the learner’s overall performance increases.

Considerations: Overfitting

Boosting is a robust method that tries very hard to avoid overfitting. Testing per-
formance often mimics training performance pretty closely. Why is this the case?

Kudrayvtsev 17
CHAPTER 1: Classification

Though we’ve been discussing error at length up to this point, and how minimizing
error has been our overarching goal, it’s worth discussing the idea of confidence, as
well. If, for example, you had a 5-nearest neighbor in which three neighbors strongly
voted one way and two neighbors voted another way, you might have low error but
also low confidence, relative to a scenario where all five voted the same way.
Because boosting is insecure and suffers from social anxiety, it tries really hard to
be confident and this lets it avoid overfitting. Once a boosted ensemble reaches a
state at which it has low error and can separate positive and negative examples well,
adding more and more weak learners will actually continue to spread the gap (similar
to support vector machines which we’re about to discuss with respect to margins) at
the boundary.

−1 0 +1 −1 0 +1

However, nobody’s perfect. Boosting still tends to overfit in an oddly-specific case


Dr. Isbell highlights: when its underlying weak learner uses a complex artificial neural
network (one with many nodes and layers). More generally, if the underlying learners
all overfit and can’t stop overfitting (like a complex neural net tends to do), boosting
can’t do anything to prevent that. Boosting also suffers under pink noise—uniform
noise—and tends to overfit.2

1.3 Support Vector Machines


Let’s return to the notion of a dataset being linearly-separable. From the standpoint
of human cognition, finding a line that cleanly divides two colors is pretty easy. In
general when we’re trying to classify something into one category or another, there’s
no reason for us to consider anything but the specific details that truly separate the
two categories. We hardly pay attention to the bulk of the data, focusing on what
defines the boundary between them.
This is the motivation behind support vector machines.

I covered SVMs briefly in my notes on computer vision. For a gentler


introduction to the topic, refer there. This section will have a lot more
assumed knowledge so that I can avoid repeating myself.

Now, which of the green dashed lines below “best” separates the two colors?

2
“White noise” is Gaussian noise.

Kudrayvtsev 18
MACHINE LEARNING Supervised Learning

They’re all correct, but why does the middle one “feel” best? Aesthetics? Maybe.
But more likely, it’s because the middle line does the best job at separating the data
without making us commit too much to it. It leaves the biggest margin of potential
error if some hidden dots got revealed.
A support vector machine operates on this exact notion: it tries find the boundary
that will maximize the margin from the nearest data points. The optimal margin lines
will always have some special points that intersect the dashed lines:

These points are called the support vectors. Just like a human, a support vector
machine reduces computational complexity by focusing on examples near the bound-
aries rather than the entire data set. So how can we use these vectors to maximize
the margin?

1.3.1 There are lines and there are lines. . .


Let’s briefly note that a line in 2D is typically represented in the form y = mx + b.
However, we want to generalize to n dimensions.
The standard form of a line in 2D is defined as: ax + by + c = 0, where a, b, c ∈ Z and
a > 0. From this, we can imagine a “compact” representation of the line that only
uses its constants, so if we want the original equation back, we can dot this vector

Kudrayvtsev 19
CHAPTER 1: Classification

with x y 1 :
 
   
a b c · x y 1 =0

Thus if we let s = a b c and w = x y 1 , then our line can be expressed simply


   

as: wT s = 0. This lets us representing a line in vector form and use any number of
dimensions. Our w defines the parameters of the (hyper)plane; notice that here,
w ⊥ the xy-plane.
If we want to stay reminiscent of y = mx + b, we can drop the last term of w and use
the raw constant: wT s + c = 0.

1.3.2 Support Vectors


Similar to what we did with boosting, we’ll say that when y ≥ 1, the input value x
was part of the class, whereas if y ≤ −1 it wasn’t (take care to differentiate the fact
that y is the label now rather than something related to the y axis). Then a line in
our “label space” is of the form y = wT x + b.
What’s the output, then, of the line that divides the two classes? Well it’s exactly
between all of the 1s and the -1s, so it must be the line wT x + b = 0. Similarly, the
two decision boundaries are exactly ±1. Note that we don’t know w or b yet, but
know we want to maximize d:

wT x + b = 1
wT x + b = 0
wT x + b = −1

Well if we call the two green support vectors above x1 and x2 , what’s the distance
between them? Well,

wT x1 + b = 1
−(wT x2 + b = −1)

wT (x1 − x2 ) = 2
2
ŵT (x1 − x2 ) =
kwk

Kudrayvtsev 20
MACHINE LEARNING Supervised Learning

Thus we want to maximize M = kwk 2


(M for margin), while also classifying all of our
data points correctly. Mathematically-speaking, maximization is harder than mini-
mization (thank u calculus), so we’re actually better off optimizing for min 12 kwk2 .
So if we define yi ∈ {−1, 1} for every training sample as we did when Boosting, then
we arrive at a standard quadratic optimization problem:

1
Minimize: kwk2
2
Subject to: yi (wT xi + b) ≥ 1

which is a well-understood, always-solveable optimization problem whose solution is


just a linear combination of the support vectors:
X
w= α i y i xi
i

where the αi s are “learned weights” that are only non-zero at the support vectors.
Any support vector i is defined by yi = wT xi + b, so:
X
yi = αi yi xTi x + b = ±1
i
| {z }
wT
!
X
We can use this to build our classification function: f (x) = sign αi yi xTi x + b
i

Note the highlighted box: the entirety of the classification depends only on this dot
product between some “new point” x and our support vectors xi s. The dot product
is a metric of similarity: here, it’s the projection of each xi onto the new x, but it
doesn’t have to be. . .

1.3.3 Extending SVMs: The Kernel Trick


We’ve been working with nice, neat 2D plots and drawing a line to separate the data.
This begs the question: what if the data isn’t linearly-separable? The answer to
this question is the source of power of SVMs: we’ll use it to find separation boundaries
between our data points in higher dimensions than our features provide.
Obviously, we can find the optimal separator between the following group of points:

0
x

But what about these?

Kudrayvtsev 21
CHAPTER 1: Classification

0
x

No such luck this time. But what if we mapped them to a higher-dimensional space?
For example, if we map these to y = x2 , a wild linear separator appears!

x2

Figure 1.3: Finding a linear separator by mapping to a higher-dimensional


space.

This seems promising. . . how can we find such a mapping (like the arbitrary x 7→
x2 above) for other feature spaces? Notice that we added a dimension simply by
manipulating the representation of our features.
Let’s generalize this idea. We can call our mapping function Φ; it maps the xs in
our feature space to another higher-dimensional space ϕ(x), so Φ : x 7→ ϕ(x). And
recall the “similarity metric” in our classification function; let’s isolate it and define
K(a, b) = aT b. Then, we have
!
X
f (x) = sign αi yi K(xi , x) + b
i

The kernel trick here is simple: it states that if there exists some Φ(·) that can
represent K(·) as a dot product, we can actually use K in our linear classifier. We
don’t actually need to find, define, or care about Φ, it just needs to exists. And in
practice, it does exist for almost any function we can think of (though I won’t offer
an explanation on how). That means we can apply almost any K and it’ll work out.
For example, if our 2D dataset was separable by a circular boundary, we could use
K(a, b) = (aT b)2 in our classifier and it would actually find the boundary, despite it
not being linearly-separable in two dimensions.
Soak that in for moment. . . we can almost-arbitrarily define a K that represents our
data in a different way and it’ll still find a boundary that just so happens to be linear
in a higher-dimensional space.

Kudrayvtsev 22
MACHINE LEARNING Supervised Learning

Example 1.2: A Simple Polynomial Kernel Function

Let’s work through a proof that a particular kernel function is a dot product
in some higher-dimensional space. Remember, we don’t actually care about
what that space is when it comes to applying the kernel; that’s the beauty of
the kernel trick. We’re working through this to demonstrate how you would
show that some kernel function does have a higher-dimensional mapping.
 
x
We have 2D vectors, so x = 1 .
x2

Let define the following kernel function: K(xi , xj ) = (1 + xTi xj )2


This is a simple polynomial kernel; it’s called that because we are creating
a polynomial from the dot product. To prove that this is a valid kernel
function, we need to show that K(xi , xj ) = ϕ(xj )T ϕ(xj ) for some ϕ.

K(xi , xj ) = (1 + xTi xj )2
     
  xj1   xj1
= 1 + xi1 xi2 1 + xi1 xi2 expand
xj2 xj2
= 1 + x2i1 x2ij + 2xi1 xj1 xi2 xj2 + x2i2 x2j2 + 2xi1 xj1 + 2xi2 xj2 multiply
it all out
 
1
 x2 
√ j1 
√ √ √   2xj1 xj2 

rewrite it as a
= 1 x2i1 2xi1 xi2 x2i2

2xi1 2xi2 

 √x2j2  vector product

 
 √2xj1 
2xj2

At this point, we can see something magical and crucially important: each
of the vectors only relies on terms from either xi or xj ! That means it’s
a. . . wait for it. . . dot product! We can define ϕ as a mapping into this new
6-dimensional space:
√ √ √ T
ϕ(x) = 1 x21 2x1 xn2 x22

2x1 2x2

Which means now we can express K in terms of dot products in ϕ, exactly


as we wanted:
K(xi , xj ) = ϕ(xi )T ϕ(xj ) 

The choice of K(·) is much like the choice of d(·) in k-nearest neighbor: it encodes
domain knowledge about the data in question that can help us classify it better.
Some common kernels include polynomial kernels (like the one in the example) and

Kudrayvtsev 23
CHAPTER 1: Classification

the radial basis kernel which is essentially a Gaussian:

K(a, b) = . . .
= (aT b + c)p (polynomial)
!
ka − bk2
= exp − (radial basis)
2σ 2

1.3.4 Summary
In general, we’ve come to the conclusion that finding the linear separator of a dataset
with the maximum margin is a good way to generalize and avoid overfitting. SVMs
do this with a quadratic optimization problem to find the support vectors. Finally,
we discussed how the kernel trick lets us find these linear separators in an arbitrary
n-dimensional space by projecting the data to said space.

Kudrayvtsev 24
MACHINE LEARNING Supervised Learning

Algorithm 1.1: A simplified AdaBoost algorithm. Note that yi ∈ {−1, 1},


so we’re working with classification here, but it can just as easily be adapted to
regression. (this algorithm comes from my notes on computer vision)

Input: X = {x1 , x2 , . . . , xm } and Y = {y1 , y2 , . . . , ym }, a training set mapping


both positive and negative training examples to their corresponding
labels: xi 7→ yi .
Input: H(·): a weak classifier type.
Result: A boosted classifier, H ∗ .
 
ŵ = 1/m 1/m . . . // a uniform weight distribution
t≈0 // some small threshold value close to 0
foreach training stage j ∈ [1..n] do
ŵ = w/kwk
hj = H(X , Y, ŵ) // train a weak learner for the current weights
/* The error is the sum of the incorrect training predictions. */
∀wi ∈ ŵ where hj (xi ) 6= yi
PM
εj = wi 
i=0
1 1−ε
αj = 2
ln εj j
/* Update the weights only if the error is large enough. */
if ε > t then
wi = wi · exp (−yi αj hj (xi )) ∀wi ∈ w
else
break
end
/* The final boosted classifier is the sum of each hj weighed by its
corresponding αj . Prediction on an new x is then simply: */
h PM i
H ∗ (x) := sign j=0 αj hj (x)
return H ∗

Kudrayvtsev 25
Regression

We stand the risk of regression, because you refused to take risks.


So life demands risks.
— Sunday Adelaja

hen we free ourselves from the limitation of small discrete quantities of la-
W bels, we are open to approximate a much larger range of functions. Machine
learning techniques that use regression can approximate real-valued and con-
tinuous functions.
In supervised learning, we are trying to perform inductive reasoning, in which we try
to figure out the abstract bigger picture from tiny snapshots of the world in which
we don’t know most of the rules (that is, approximate a function from input-output
pairs). This is in stark contrast with deductive reasoning, through which individual
facts are combined through strict, rigorous logic to come to bigger conclusions (think
“if this, then that”).

2.1 Linear Regression


You’ve likely heard of linear regres- 3
sion if you’re reading these notes. Lin-
ear regression is the mathematical pro- 2 e2
cess of acquiring the “line-of-best fit”
y

e1
that we used to plot in grade school. 1 e0
Through the power of linear algebra,
the line of best fit can be rigorously de- 0
fined by solving a linear system. 0 1 2 3
x
One way to find this line is to find the
sum of least squared-error between the Figure 2.1: A set of points with “no solution”:
points and the chosen line; more specif- no line passes through all of them. The set of
errors is plotted in red: (e0 , e1 , e2 ).
ically, a visual demonstration can show

26
MACHINE LEARNING Supervised Learning

us this is the same as minimizing the projection error of the points on the line.
Suppose we have a set of points that don’t exactly fit a line: {(1, 1), (2, 1), (3, 2)},
plotted in Figure 2.1. We want to find the best possible line y = mx + b that mini-
mizes the total error. This corresponds to solving the following system of equations,
forming y = Ax:

1 = b + m · 1
   
 1 1 1  
m
1=b+m·2 =⇒ 1 = 1 2
b
2 1 3

2=b+m·3

The lack of an exact solution to this system (algebraically) means that the vector
of y-values isn’t in the column space of A, or: y ∈ / C(A). The vector can’t be
represented by a linear combination of column vectors in A.
We can imagine the column space as a plane in xyz-
space, and y existing outside of it; then, the vector
that’d be within the column space is the projection y
of y into the column space plane: p = proj y. This
C(A)
is the closest possible vector in the column space e
C(A)
to y, which is exactly the distance we were trying p
to minimize! Thus,

e=y−p Figure 2.2: The vector y rela-


tive to the column space of A, and
its projection p onto the column
The projection isn’t super convenient to calculate space.
or determine, though. Through algebraic manip-
ulation, calculus, and other magic, we learn that the way to find the least squares
approximation of the solution is:

AT Ax∗ = AT y
x∗ = (AT A)−1 AT y
| {z }
pseudoinverse

which is exactly what the linear regression algorithm calculates.

More Resources. This section basically summarizes and synthesizes this Khan
Academy video, this lecture from the Computer Vision course (which goes through
the full derivation), this section of Introduction to Linear Algebra, and this explana-
tion from NYU. These links are provided in order of clarity.

Kudrayvtsev 27
CHAPTER 2: Regression

2.2 Neural Networks


Let’s dive right into the deep end and learn about how a neural network works.
Neurons in the brain can be described relatively succinctly from a high level: a single
neuron is connected to a bunch of other neurons. It accumulates energy and once the
amount is bigger than the “activation threshold,” the neuron fires, sending energy
to the neurons its connected to (potentially causing a cascade of firings). Artificial
neural networks take inspiration from biology and are somewhat analogous to neuron
interactions in the brain, but there’s really not much benefit to looking at the analogy
further.
The basic model of an “artificial neuron” is a function powered by a series of inputs
xi , and weights wi , that somehow run through F and produce an output y:

x1 w1

w2 activation
x2 function F , and y
w3 threshold θ

x3

Typically, the activation function “fires” based on a firing threshold θ.

2.2.1 Perceptron
The simplest, most fundamental activation function procudes a binary output (so
y ∈ {0, 1}) based on the weighted sum of the inputs:

(
1 if
Pn
i=1 wi xi ≥ θ
F (x1 , x2 , . . . , xn , w1 , w2 , . . . , wn ) =
0 otherwise

This is called a perceptron, and it’s the foundational building block of neural net-
works going back to the 1950s.

Kudrayvtsev 28
MACHINE LEARNING Supervised Learning

Example 2.1: Basic Perceptron

Let’s quickly validate our understanding. Given the following input state:
1
x1 = 1, w1 =
2
3
x2 = 0, w2 =
5
x3 = −1.5, w3 = 1

and the firing threshold θ = 0, what’s the output?


Well, pretty simply, we get
     
1 3 1 3
y= 1 +0 + (−1.5) · 1 = − = −1 < 0 = 0
2 5 2 2

The perceptron doesn’t fire!

What kind of functions can we represent with a perceptron? Suppose we have two
inputs, x1 and x2 , along with equal weights w1 = w2 = 21 , θ = 34 . For what values of
xi will we get an activation?
Well, we know that if x2 = 0, then we’ll get an activation for anything that makes
x1 w1 ≥ θ, so x1 ≥ θ/w1 ≥ 1.5. The same rationale applies for x2 , and since we know
that the inequality is linear, we can just connect the dots:
2
x2

x1
1 2

Thus, a perceptron is a linear function (each wi xi term is linear), and so it can be


used to compute the half-plane boundary between its inputs.
This very example actually computes an interesting function: if the inputs are binary
(that is, x1 , x2 ∈ {0, 1}), then this actually computes binary AND operation. The
only possible outputs are marked accordingly; only when x1 = x2 = 1 does y = 1!
We can actually model OR the same way with different weights (like w1 = w2 = 32 ).

Kudrayvtsev 29
CHAPTER 2: Regression

2 2
x2 x2

1 1

x1 x1
1 2 1 2

Note that we “derived” OR by adjusting w1,2 until it worked, though we could’ve also
adjusted θ. This might trigger a small “aha!” moment if the idea of induction from
stuck with you: if we have some known input/output pairs (like the truth tables
for the binary operators), then we can use a computer to rapidly guess-and-check
the weights of a perceptron (or perhaps an entire neural network. . . ?) until they
accurately describe the training pairs as expected.

x y x⊕y
Combining Perceptrons
1 1 0
To build up an intuition for how perceptrons can be com- 1 0 1
bined, let’s binary XOR. It can’t be described by a single 0 1 1
perceptron because it’s not a linear function; however, 0 0 0
it can be described by several!
Table 2.1: The truth table
Intuitively, XOR is like OR, except when the inputs succeed for XOR.
under AND. . . so we might imagine that XOR ≈ OR−AND.
So if x1 and x2 are both “on,” we should take away the result of the AND perceptron so
that we fall under the activation threshold. However, if only one of them is “on,” we
can proceed as normal. Note that the “deactivation weight” needs to be equal to the
sum of the other weights in order to effectively cancel them out, as shown Figure 2.3.

Learning Perceptrons
Let’s delve further into the notion of “training” a perceptron network that we alluded
to earlier: twiddling the wi s and θs to fit some known inputs and outputs. There are
two approaches to this: the perceptron rule, which operates on the post-activated
y-values, and gradient descent, which operates on the raw summation.

Perceptron Rule Suppose ŷ is our perceptron’s current output, while y is its


desired output. In order to “move towards” the goal y, we adjust our wi s accordingly

Kudrayvtsev 30
MACHINE LEARNING Supervised Learning

w1 = 1
x1 x2 AND y
x1 1 1 1 1 · 1 + −2 · 1 + 1 · 1 = 0
w2 = −2 XOR
AND y 1 0 0 1 · 1 + −2 · 0 + 1 · 0 = 1
θ=1
x2 0 1 0 1 · 0 + −2 · 0 + 1 · 1 = 1
0 0 0 1 · 0 + −2 · 0 + 1 · 0 = 0
w3 = 1

Figure 2.3: A neural network modeling XOR and its summations for all possi-
ble bit inputs. The final column in the table is the summation expression for
perceptron activation, w1 x1 + w2 FAND (x1 , x2 ) + w3 x2 .

based on the “error” between y and ŷ. That is,


define ŷ = ( i wi xi ≥ 0)
P

then use ∆wi = η (y − ŷ) xi


to adjust wi ← wi + ∆wi
where η > 0 is a learning rate which influences the size of the ∆wi adjustment made
every iteration.
Notice the absense of the activation threshold, θ. To simplify the math, we can
actually treat it as part of the summation: a “fake” input with a fixed weight of −1,
since:
Pn Pn
i wi xi = θ =⇒ i w i xi − θ = 0
where wn+1 = −1 and xn+1 = θ
Pn+1
=⇒ i w i xi = 0
This extra input value is now called the bias and its weight can be tweaked just like
the other inputs.
Notice that when our output is correct, y − ŷ = 0 so there is no effect on the weights.
If ŷ is too big, then our ∆wi < 0, making that wi xi term smaller, and vice-versa if ŷ
is too small. The learning rate controls our adjustments so that we take small steps
in the right direction rather than overshooting, since we don’t know how close we are
to fixing ŷ.

Theorem 2.1 (Perceptron Rule). If a dataset is linearly-separable (that is,


able to be separated by a line), then a perceptron can find it with a finite number
of iterations by using the perceptron rule.

Of course, it’s impossible to know whether a sufficiently-large “finite” number of a


iterations has occurred, so we can’t use this fact to determine whether or not a
dataset is linearly-separable, but it’s still good to know that it will terminate when
it can.

Kudrayvtsev 31
CHAPTER 2: Regression

Gradient Descent We still need something in our toolkit for datasets that aren’t
linearly-separable. This time, we’ll operate on the unthresholded summation since
it gives us far more information about how close (or far) we P
are from triggering an
activation. We’ll use a as shorthand for the summation: a = i wi xi .
Then we can define an error metric based on the difference between a and each
expected output: we sum the error for each of our input/output pairs in the dataset
D. That is,
1 X
E(w) = (y − a)2
2
(x,y)∈D

Let’s use our good old friend calculus to solve this via gradient descent. A function
is at its minimum when its derivative is zero, so we’ll take the derivative with respect
to a weight:
 
∂E ∂ 1 X
= (y − a)2 
∂wi ∂wi 2
(x,y)∈D
" #
X ∂ X
chain rule, and only a
= (y − a) · − w j xj is in terms of wi
∂wi j
(x,y)∈D
X
when j 6= i, the derivative
= (y − a)(−xi ) will be zero
(x,y)∈D
X rearranged to look like
=− (y − a)xi the perceptron rule
(x,y)∈D

Notice that we essentially end up with a version of the perceptron rule where η = −1,
except we now use the summation a instead of the binary output ŷ. Unlike the
perceptron rule, we have no guarantees about finding a separation, but it is far more
robust to non-separable data. In the limit (thank u Newton very cool), though, it
will converge to a local optimum.
To reiterate our learning rules, we have:

∆wi = η(y − ŷ)xi (2.1)


∆wi = η(y − a)xi (2.2)

2.2.2 Sigmoids
The similarity between the learning rules in (2.1) and (2.2) begs the question, why
didn’t we just use calculus on the thresholded ŷ?

Kudrayvtsev 32
MACHINE LEARNING Supervised Learning

y
1
a

The simple answer is that the function isn’t differentiable. Wouldn’t it be nice, though
if we had one that was very similar to it, but smooth at the hard corners that give
us problems? Enter the sigmoid function:
1
σ(a) = (2.3)
1 + e−a

By introducing this as our activation function, we can use gradient descent all over
the place. Furthermore, the sigmoid’s derivative itself is beautiful (thanks to the fact
that dx e = ex ):
d x

σ̇(a) = σ(a)(1 − σ(a))

Note that this isn’t the only function that smoothly transitions between 0 and 1;
there are other activation functions out there that behave similarly.

2.2.3 Structure
Now that we have the ability to put together differentiable individual neurons, let’s
look at what a large neural network presents itself as. In Figure 2.4 we see a collec-
tion of sigmoid units arranged in an arbitrary pattern. Just like we combined two
perceptrons to represent the non-linear function XOR, we can combine a larger number
of these sigmoid units to approximate an arbitrary function.

x1

x2

x3 y

x4

x5

input layer hidden layers output

Figure 2.4: A 4-layer neural network with 5 inputs and 3 hidden layers.

Because each individual unit is differentiable, the entire mapping from x 7→ y is differ-
entiable! The overall error of the system enables a bidirectional flow of information:

Kudrayvtsev 33
CHAPTER 2: Regression

the error of y impacts the last hidden layer, which results in its own error, which im-
pacts the second-to-last hidden layer, etc. This layout of computationally-beneficial
organizations of the chain rule is called back-propagation: the error of the network
propogates to adjust each unit’s weight individually.

Optimization Methods
It’s worth reiterating that we’ve departed from the guarantees of perceptrons since
we’re using σ(a) instead of the binary activation function; this means that gradient
descent can get stuck in local optima and not necessarily result in the best global
approximation of the function in question.
Gradient descent isn’t the only approach to training a neural network. Other, more
advanced methods are researched heavily. Some of these include momentum, which
allows gradient descent to “gain speed” if it’s descending down steep areas in the
function; higher-order derivatives, which look at combinations of weight changes to try
to grasp the bigger picture of how the function is changing; randomized optimization;
and the idea of penalizing “complexity,” so that the network avoids overfitting with
too many nodes, layers, or even too-large of weights.

2.2.4 Biases
What kind of problems are neural networks appropriate for solving?

Restriction Bias A neural network’s restriction bias (which, if you recall, is the
representation’s ability to consider hypotheses) is basically non-existent if you use
sigmoids, though certain models may require arbitrarily-complex structure.
We can clearly represent Boolean functions with threshold-like units. Continuous
functions with no “jumps” can actually be represented with a single hidden layer. We
can think of a hidden layer as a way to stitch together “patches” of the function as
they approach the output layer. Even arbitrary functions can be approximated with
a neural network! They require two hidden layers, one stitching at seams and the
other stitching patches.
This lack of restriction does mean that there’s a significant danger of overfitting,
but by carefully limiting things that add complexity (as before, this might be layers,
nodes, or even the weights themselves), we can stay relatively generalized.

Preference Bias On the other hand, we can’t yet answer the question of the pref-
erence bias of a neural network (which, if you recall, describes which hypotheses from
the restricted space are preferred). We discussed the algorithm for updating weights
(gradient descent), but have yet to discuss how the weights should be initialized in
the first place.

Kudrayvtsev 34
MACHINE LEARNING Supervised Learning

Common practice is choosing small, random values for our initial weights. The ran-
domness allows for variability so that the algorithm doesn’t get stuck at the same
local minima each time; the smallness allows for relative adjustments to be impactful
and reduces complexity.
Given this knowledge, we can say that neural networks—when all other things are
equal—prefer simpler explanations to complex ones. This idea is an embodiment of
Occam’s Razor:
Entities should not be multiplied unnecessarily.

More colloquially, it’s often expressed as the idea that the simpler explanation is
likelier to be true.

2.3 Instance-Based Learning


The learning algorithms presented in this section take a radically-different approach
to modeling a function approximation. Instead of inducing an abstract model from
the training set that compactly represents most of the data, these algorithms will
actually regularly refer to the data itself to create approximations.
For a high-level example, consider our fundamental “line of best fit” problem. Given
some input points, linear regression will determine the line that minimizes the error
with the data, and then we can use the line directly to predict new values. With
instance-based learning methods, we instead would base our predictions from the
points themselves. We might predict the output of a novel input x0 as being the same
as the output of whatever known x is closest to it, or perhaps the average of the
outputs of the three known inputs closest to it.
There are some pretty clear benefits to this paradigm shift: we no longer need to
actually spend time “learning” anything; the model perfectly-remembers the training
data rather than remembering an abstract generalizing; and the approach is dead-
simple. The downsides, of course, are that we have to store all of our training data
(which might potentially require massive amounts of storage) and we are not really
generalizing from it (we’re pretty obviously overfitting) at all.

2.3.1 Nearest Neighbors


This idea of referring to similar known inputs for a novel input is exactly the intuition
behind the k-nearest neighbor learning algorithm. While k is the number of knowns
to consider, we also need a notion of “distance” to determine how close or similar an
xi ∈ X is to the novel input x0 .
The distance is our expression of domain knowledge about the space. If we’re classify-
ing restaurants into cheap, average, and expensive, our distance metric might be the
difference between the average entreé price. We might even be talking about literal

Kudrayvtsev 35
CHAPTER 2: Regression

distances: if we had some arbitarily-colored dots (whose color was determined by the
features fm and fn ) and wanted to determine the color of a novel dot (encoded in red
below) with fm = 2, fn = 1, we’d use the standard Euclidean distance.
4
fn

d3
2

d1
d2
1

d4
fm
1 2 3 4

Notice that x4 is closest, followed by x2 . If we were doing classification, we would


probably choose specifically blue or green (depending on our tie-breaking scheme),
but in regression, we might instead color the novel dot by the weighted sum of its
k = 2 nearest neighbors:
d4 y4 + d2 y2 2.8 · blue + 1.8 · green
y0 = ≈
d4 + d2 2.8 + 1.8
= (0, 155, 100) in RGB ∈ [0, 255] terms

which is a nice, dark blue-green color.


An extremely simple (and non-performant) version of the kNN algorithm is formalized
in algorithm 2.1; notice that it has O(k |X |) complexity, meaning it grows linearly
both in k, the number of neighbors to consider, and in |X |, the size of the training
data.
Obviously we can improve on this. With a sorted list (O(n log n) time), we can apply
binary search (O(log n + k) time for k values) and query far more efficiently. For
features with high dimensionality (i.e. when n  0 for xi = {f1 , f2 , . . . , fn }), we can
also leverage the kd-tree algorithm1 —which subdivides the data into sectors to search
through them more efficiently—but is still O(kn log n). This is the main downside of
kNN: because we use the training data itself as part of the querying process, things
can get slow and unwieldy (in both time and space) very quickly.

1
Game developers might already be somewhat familiar with the algorithm: quadtrees rely on sim-
ilar principles to efficiently perform collision detection, pathfinding, and other spatially-sensitive
calculations. The game world is dynamically divided into recursively-halved quadrilaterals to
group closer objects together.

Kudrayvtsev 36
MACHINE LEARNING Supervised Learning

This is in contrast with something like linear regression, which calculates a model up-
front and makes querying very cheap (constant time, in fact); in this regard, kNN is
referred to as a lazy learner, whereas linear regression would be an eager learner.

Algorithm 2.1: A naïve kNN learning algorithm. Both the number of neigh-
bors, k, and the similarity metric, d(·), are assumed to be pre-defined.

Input: A series of training samples, X := xi 7→ yi .


Input: The novel input, x0 .
Result: The predicted value of y 0 .
closest := {}
dists := {∞, . . .}
foreach xi ∈ X do
foreach j ∈ [1, k] do
if d (xi , x0 ) < dists[j] then
closest[j] = xi
dists[j] = d (xi , x0 )
end
end
end
P the distance-weighed average of the k closest neighbors.
// Return
di ∈dists di · xi
return P
di ∈dists di

In the case of regression, we’ll likely be taking the weighted average like in algo-
rithm 2.1; in the case of classification, we’ll likely instead have a “vote” and choose
the label with plurality. We can also do more “sophisticated” tie-breaking: for regres-
sion, we can consider all data points that fall within a particular shortest distance (in
other words, we consider the k shortest distances rather than the k closest points);
for classification, we have more options. We could choose the label that occurs the
most globally in X , or randomly, or. . .

Biases
As always, it’s important to discuss what sorts of problems a kNN representation of
our data would cater towards.

Preference Bias Our belief of what makes a good hypothesis to explain the data
heavily relies on locality—closer points (based on the domain-aware distance metric)
are in fact similar—and smoothness—averaging neighbors makes sense and feature
behavior smoothly transitions between values.

Kudrayvtsev 37
CHAPTER 2: Regression

Notice that we’ve also been treating the features of our training sample vector xi
equally. The scales for the features may of course be different, but there is no notion of
weight on a particular feature. This is critical: obviously, whether or not a restaurant
is within your budget is far more important than whether the atmosphere inside fits
your vibe (being the broke graduate students that we are). However, the kNN model
has difficulty with making this differentiation.
In general, we are encountering the curse of dimensionality in machine learning:
as the number of features (the dimensionality of a single xi ∈ X ) grows linearly, the
amount of data we need to accurately generalize the grows exponentially. If we
have a lot of features and we treat them all with equal importance, we’re going to
need a LOT of data to determine which ones are more relevant than others.
It’s common to treat features as “hints” to the machine learning algorithm you’re
using, following the rationale that, “If I give it more features, it can approximate the
model better since it has more information to work with;” however, this paradoxically
makes it more difficult for the model, since now the feature space has grown and
“importance” is harder rather than easier to determine.

Restriction Bias If you can somehow define a distance function that relates to
feature points together, you can represent the data with a kNN.

Kudrayvtsev 38
Computational Learning Theory

People worry that computers will get too smart and take over
the world, but the real problem is that they’re too stupid and
they’ve already taken over the world.
— Pedro Domingos

omputational learning theory lets us define and understanding learning prob-


C lems better. It’s a powerful tool that can let us show that specific algorithms
work for specific problems (like, for example, when a kNN learner would be
best for a problem), and it also lets us show when a certain problem is fundamentally
“hard.”
The kinds of methods used in analysing learning questions are often analogous to the
methods used in algorithm analysis in a more “traditional” computer science setting
(think big-O notation). There, we talk about an algorithm’s complexity in time and
space; analogously, in machine learning problems, while we also care about time and
space, we also care about sample complexity. Does a particular algorithm do well
with small amounts of data? How does metrics like prediction accuracy grow with
increased data?
Generally-speaking, inductive learning is learning from examples. There are many
factors in a way such a learning problem is set up that can affect the resulting induc-
tions; these include:
• the number of samples to learn from: it should come as no surprise that the
amount of data we have can affect our inductions significantly;
• the complexity of the hypothesis class: the more difficult a concept or
idea, the harder it may be for most algorithms to model; similarly, the risk of
overfitting is high when you model a complex hypothesis;
• the accuracy to which the target concept is approximated: obviously, whether
or not a model is “good” correlates directly with how well it performs on novel

39
CHAPTER 3: Computational Learning Theory

data;
• how samples are presented : until now, we’ve been training models on entire
“batches” of data at once, but this isn’t the only option; online learning (one at
a time) is another alternative; and
• how samples are selected : is random shuffling the best way to choose data
to train on?

3.1 Learning to Learn: Interactions


We’ll look at the last two items from the above list first because deep within them
are some important subtleties.
The ways that training examples are selected to learn from can drastically affect the
overall quantity of data necessary to learn something meaningful. There are a number
of ways to approach the learner-teacher relationship:
1. query-based, where the learner “asks questions” that the teacher responds to.
For example, “I have x; what’s c(x)?”
2. friendly, where the teacher provides the example x and its resulting c(x), se-
lecting x to be a helpful way to learn.
3. natural, where nobody chooses and there’s just some nature-provided distribu-
tion of samples.
4. adversarial, where the teacher provides unhelpful (x, c(x)) pairings designed to
make the learner fail (and thereby learn better when it figures out how not to
fail–think “trick questions” on exams).
5. . . . and more!
So if we have a query-based learner,
coming up with and asking questions,
what should they ask? An effective
question one whose answer gives the
most information. In a binary world
of yes-or-no questions with n possible
hypotheses, the question always elimi-
nates some ` number of possibilities:
x
Figure 3.1: From the movie I, Robot, a holo-
graphic version of a scientist appears posthu-
` n−` mously to help Will Smith solve the riddle of
the robot uprising.
A good question, then, ideally halves
the space, so ` ≈ n − `; then it’ll take

Kudrayvtsev 40
MACHINE LEARNING Supervised Learning

log2 |H| questions overall to nail the answer and find h.

Teaching
If the teacher is providing friendly examples, wouldn’t a very ambitious teacher—one
that wants the learner to figure out the hypothesis h ∈ H as quickly as possible—
simply tell the learner to “ask,” “Is the hypothesis h?” Though effective, that’s not
a very realistic way to learn: the question-space is never “the entire set of possible
questions in the universe.” Teachers have a constrained space of questions they can
suggest.

Example 3.1: A Good Teacher

Suppose we have some Boolean function that operates on some (possibly


proper) subset of the bits x1 through x5 . Our job is to determine the
conjunction (logical-AND) of literals or negations that will fulfill the table of
examples:
x1 x2 x3 x4 x5 h
1 0 1 1 0 1
0 0 0 1 0 1
1 1 1 1 0 0
1 0 1 0 0 0
1 0 1 1 1 0
How should we tackle this? Well notice that in the first two rows, x1 is
flipped yet h stays the same. We can infer from this that x1 is not relevant,
and likewise for x3 . What’s consistent across these rows? Well, both x2 =
x5 = 0. Yet this is also the case in row 4, so this may be necessary but not
sufficient. We also see that x4 = 1 in both top rows, so a reasonable guess
overall for the Boolean function in question is:

f (x1 · · · x5 ) = x2 ∧ x4 ∧ x5

This holds for all of the rows.

Learning: Constrained Queries


Notice that in the above example, we can come up with f from the first two rows (we
can think of these as “positive” examples), then use the next three rows to verify it
(these being “negative” examples). This isn’t because we’re some sort of genius, but
rather that we were given very good examples to learn from: each one gave us a lot
of information. This is exactly what it means to be a good teacher: it can
teach us a generic f with just k + 2 samples.
What if we weren’t provided good examples and had to ask questions blindly? Since

Kudrayvtsev 41
CHAPTER 3: Computational Learning Theory

we don’t really learn anything from h = 0, there are many possible f s that are
extremely hard to guess. For a general k, it’ll take 2k questions in the worst case
(consider when exactly one k-bit string gives h = 1).

Learning: Mistake Bounds


When the game is too hard, change the game. Instead of measuring how many
samples we need, why not measure how many mistakes we make?
We’ll modify the teacher-learner interaction thus: the learner is now given an input
and has to guess the output. If it guesses wrong, it gets p u n i s h e d. This
is an example of online learning, since the learner adjusts with each sample. Its
algorithm is then:
1. To start, assume it’s possible for each variable to be both positive and negated:
h0 = x1 ∧ x1 ∧ . . . ∧ xk ∧ xk
Obviously this is logically impossible, but this lets us have a meaningful baseline
for Step 3 to work.
2. Given an input, compute the output based on our h0 . For a while (that is, until
we guess wrong), we will just output “false.”
3. If we’re wrong, set positive variables (the xi s) that were 0 to absent and negative
variables (the xi s) that were 1 to be absent.
4. Go to Step 2.
Basically whenever the learner guesses wrong, they will know to eliminate some vari-
ables. Even though the number of examples to learn may be the same in the worst
case (2k ), we will now never make more than k + 1 mistakes. A good teacher that
knows that its learner has this algorithm can actually teach h with k + 1 samples,
teaching it one bit every time.

3.2 Space Complexity


As we’ve already established, data is king in machine learning. In this section we’ll
derive a theoretical way to approximate space complexity: how much data do we need
to solve a problem to a particular degree of certainty?
First, we’ll need to rigorously define some terminology:
• a training set: S ⊆ X is made up of some samples x
• H is the hypothesis space
• the true hypothesis or concept: c ∈ H is the thing we want to learn, while
the candidate hypothesis: h ∈ H is what the learner is currently considering

Kudrayvtsev 42
MACHINE LEARNING Supervised Learning

• a consistent learner: one that produces the correct result for all of the training
samples:
∀x ∈ S : c(x) = h(x)

3.2.1 Version Spaces


Given the above, the version space is all of the possible hypotheses that are consis-
tent with the data; in other words, the only hypotheses worth considering1 at some
point in time given the training data:

VS(S) = {h : c(x) = h(x) | ∀x ∈ S, h ∈ H}

Example 3.2: Version Spaces

Given the target concept of XOR (left) and some training data (right):
x1 x2 c(x)
x1 x2 c(x)
0 0 0
0 0 0
0 1 1
0 1 1
1 0 1
1 1 ?
1 1 0
and the hypotheses:

H = {x1 , x1 , x2 , x2 , T, F, OR, AND, XOR, EQUIV}

which of the hypotheses are in the version space?


Well, which of the hypotheses would be consistent with the training samples?
n o
x1 , x1 , x2 , x2 , T, F, OR , AND, XOR , EQUIV

3.2.2 Error
Given a candidate hypothesis h, how do we evaluate how good it is? We define the
training error as the fraction of training examples misclassified by h, while the true
error is the fraction of examples that would be misclassified on the sample drawn from
a distribution D:
errorD (h) = Pr [c(x) 6= h(x)] (3.1)
x∼D

This lets us minimize the punishment that comes from getting rare examples wrong.
1
Author’s note: this is such an unfortunate term and is a classic example of the machine learning
field intentionally gatekeeping newcomers with obscure jargon. Wouldn’t it make much more sense
to simply call this the viable hypotheses?

Kudrayvtsev 43
CHAPTER 3: Computational Learning Theory

So if there’s a very obscure case that comes up once in a blue moon, it should be
treated differently than getting a commonly-occurring case incorrect.

3.2.3 PAC Learning


First, some more definitions:
• C is the concept class
• L is the learner
• D is the probability distribution over the inputs
• the error goal is 0 ≤ ε ≤ 1/2, while the certainty goal is 0 ≤ δ ≤ 1/2.
Then, we say C is PAC-learnable2 by L using H if and only if : L will, with prob-
ability 1 − δ, output a hypothesis h ∈ H such that errorD (h) ≤ ε with polynomial
time and sample complexity in 1/ε, 1/δ, n. That is, finding h such that
Pr [errorD (h) ≤ ε] = 1 − δ
is achievable within polynomial time. To rephrase this in English: a concept
is PAC-learnable if it can be learned to a reasonable degree of correctness within a
reasonable amount of time.
(couldn’t we have just said that straight up
instead of needing 4 pages of terminology?)

3.2.4 Epsilon Exhaustion


A version space is considered ε-exhausted if and only if all of its hypotheses have
low error. Formally, iff:
∀h ∈ VS(S) : errorD (h) ≤ ε

Remember that ε varies, so how do we find the smallest-possible ε for a particular


version space? Apply the Haussler theorem: it allows us to bound the true error
(3.1) in terms of the number of necessary data samples.
Lets consider the hypotheses h1..k that have high true error:
errorD (h1 , h2 , . . . , hk ∈ H) > ε
how much data do we need to “knock out” these hypotheses from the version space?
The probability of being right for any of these hypotheses is then:
Pr [hi (x) = c(x)] ≤ 1 − ε
x∼D
2
PAC—probably approximately correct, where 1 − δ defines the “probably,” ε defines the “approx-
imately,” and errorD (h) = 0 defines the “correct.”

Kudrayvtsev 44
MACHINE LEARNING Supervised Learning

Being consistent on m samples then exponentiates this for a specific hi :


Pr [all hi (x) = c(x) on m samples] ≤ Pr [h1 (x) = c(x) on m samples] ·
Pr [h2 (x) = c(x) on m samples] ·
...
Pr [hk (x) = c(x) on m samples]
≤ (1 − ε)m
but at least one staying consistent is:
Pr [any hi (x) = c(x) on m samples] ≤ Pr [h1 (x) = c(x) on m samples] +
Pr [h2 (x) = c(x) on m samples] +
... ·
Pr [hk (x) = c(x) on m samples]
≤ k(1 − ε)m
≤ |H| (1 − ε)m

Now we use an interesting truth (shown without proof here, just refer to the lectures
or a real textbook lol): −ε ≥ ln(1 − ε). Thus,

|H| (1 − ε)m ≤ |H| e−εm ≤ δ

This gives us an upper bound that the version space is not ε-exhausted after
m samples, and this is related to our certainty goal δ. To solve this inequality in
terms of m, we get:
|H| e−εm ≤ δ (3.2)
logarithm
ln |H| − εm ≤ ln δ product rule
 
1 1 flip inequality when
m≥ ln |H| + ln dividing by −ε
ε δ
Notice that this upper bound is polynomial in all of the terms we need for a problem
to be PAC-learnable.

Example 3.3: PAC-Learnability

Given a 10-bit input x, suppose we have a hypothesis space that guesses


one of those bits being set. That is,

H = {hi (x) = xi }

Given that D is a uniform distribution, ε = 0.1, δ = 0.2, how many samples


do we need to PAC learn the hypothesis set?

Kudrayvtsev 45
CHAPTER 3: Computational Learning Theory

Well, refer to the upper bound from before:


 
1 1
m≥ ln |H| + ln (3.3)
ε δ
 
1 1
≥ ln 10 + ln (3.4)
0.1 0.2
≥ 10(ln 10 + ln 5) ≈ 39.12 (3.5)
≥ 40 (3.6)

This is pretty good: it’s only 4% of the total 210 -element input space. Also
notice that D was irrelevant.

3.3 Infinite Hypothesis Spaces


Consider the flaw limitation of the Haussler theorem presented above:
 
1 1
m≥ ln |H| + ln (3.7)
ε δ
if |H| = ∞, things kinda. . . blow up. We need to deal with this somehow in our
theoretical; in fact, all of the concrete, practical algorithms we learned in the previous
chapters have handled this snafu just fine. In fact, linear separators, neural networks,
and decision trees with continuous inputs all have infinite hypothesis spaces.3

3.3.1 Intuition
Let’s work through a concrete example to build up some intuition about how to tackle
this problem. Suppose we’re given a simple input space, and a simple set of infinite
hypotheses: is x bigger than some number θ?
X = {1, 2, 3, 4, . . . , 10}
H : {h(x) = x ≥ θ | ∀θ ∈ R}

Obviously since θ is a real number, |H| = ∞. However, notice that many of those
hypotheses are completely meaningless. More specifically, any choice of θ > 10 gives
the same result as any θ < 1. Similarly, since X ⊂ N, only integer θs really make
sense. This gives us a notion of a “semantic” hypothesis space: hypotheses that
are meaningfully different, as opposed to all of the hypotheses in the conceivable
universe that we could write down. . .
TL;DR: We want to differentiate between hypotheses that matter (of which there
are few) from the hypotheses that don’t (of which there are infinite).
3
Obviously, there are an infinite number of lines, and any model with continuous inputs has an
infinite number of ways to change up the input.

Kudrayvtsev 46
MACHINE LEARNING Supervised Learning

3.3.2 Vapnik-Chervonenkis Dimension


Let’s turn this intuition into something more formal. Our definition of “meaningful”
will be something far more specific and elegant:
also termed “shattering”
z }| {
What is the largest set of inputs that the hypothesis class can label in all possible ways?

Example To continue with our previous example, the answer is simply one. The
hypotheses are all binary, true-or-false questions, so any one input can be labeled in
all of the two possible ways. However, since we only consider x ≥ θ, we can’t properly
map a pair of inputs. Think of θ as being the “separator” between some training data
S = {x1 , x2 }; then, slide it along:
x1 x2 x1 x2 x1 x2
T T F T F F
θ1 θ2 θ3

It’s impossible to capture a case where x1 = T, x2 = F . Thus, despite being infinite,


the hypothesis space H cannot express very much; it’s a weak space.
In general, this is called the VC dimension of H; it allows us to relate the amount
of data we need to learn to describe a hypothesis space.

Example 3.4: Measuring the VC Dimension

Suppose we’re given an input space of all real numbers and a hypothesis
space that can tell us whether or not an input value is within a particular
interval:

X=R
parameterized by
H = {h(x) = (x ∈ [a, b])} a, b ∈ R

What is the largest set of inputs that we can shatter?


Answer: two

We solve this methodically. Confirming that we can shatter one input is


easy: the interval either goes around or outside of the input. For two, we
have four combinations to confirm: true-true can be done by wrapping the
inputs just like in the single-input case, true-false is done by only wrapping
the former, vice-versa for false-true, and false-false is done by wrapping
something else. What about three? Consider, specifically, the true-false-
true case. This is impossible to capture with a single interval, and thus we
can’t shatter it.

Kudrayvtsev 47
CHAPTER 3: Computational Learning Theory

To show a lower bound for a VC dimension, all you need to do is find a single example
of inputs for which all labeling combinations can be done.

Practical VC Dimensions
What’s the VC dimension for something more practical in machine learning, like a
linear separator? That is, for
X = R2
H = {h(x) = wT x ≥ θ}

The VC dimension is three. The use of the second (Euclidean) dimension lets us solve
the problem from the above example, but four inputs (one inside of the convex hull
of the other three, for example) are impossible to shatter.
This might lead to an inductive hypothesis: with one parameter we had one VC
dimension; with two, we had two; with three, we had three. . . can we expect to find
a VC dimension of four when we decide to work with 3D space?
In fact, this is precisely the case: for any n-dimensional hyperplane (or hypoth-
esis class), the VC dimension will be n + 1.

Sample Complexity
The whole divergence into VC dimensions was to find a way to deal with infinite
hypothesis spaces in the Haussler theorem (3.7) so that we can make estimates about
how many m samples we need to find good hypotheses.
The derivation (which I’m sure is long and arduous) leads us to the following:
 
1 13 2
m≥ 8 · VC(H) · log2 + 4 log2 (3.8)
ε ε δ

Can we determine the VC dimensions of finite hypothesis spaces, too? Turns out,
finding an upper bound isn’t hard. If d = VCH, then there are 2d distinct concepts
(each picking a different h ∈ H). Thus, since 2d ≤ |H|, we know that d ≤ log2 |H| .

Property 3.1. A hypothesis space H is PAC-learnable if and only if its VC


dimension is finite.

3.4 Information Theory


This brief foray into a completely separate field will give us a bit of knowledge that
is useful in understanding our machine learning models better. Our algorithms fit a
pretty typical abstract model:

Kudrayvtsev 48
MACHINE LEARNING Supervised Learning

x1

x2 learner y

x3

We want to be able to answer questions like:


• Which of our xi s will give us the most information about the output?
• Are these input feature vectors similar? We’ll call this metric mutual infor-
mation.
• Does this feature have any information? This is called entropy.
The key first step is simply quantifying this idea of “information.”
Suppose we want to transmit the result of 10 coin flips. However, we’re comparing a
fair coin (1/2 probability of each result) and a biased coin whose sides are identical.
Suppose our flip results were as follows:
fair: HTHHTHTTHT
unfair: HHHHHHHHHH

Which of these messages has more information? Many answers may make sense
intuitively: perhaps we sent one bit per flip regardless; perhaps all we’re sending
is the flip ratio; perhaps we can compress the unfair flips into a single bit; etc. In
actuality, we need 10 bits for the first message and zero for the second: the output is
completely predictable without extra information.

3.4.1 Entropy: Information Certainty


This notion that certain messages are harder to transmit than others leads to en-
tropy: the more unpredictable a particular piece of data, the more information needs
to be transmitted to explain it. Thus, sending HHHHHHHHHH actually transmits no in-
formation.
Let’s work through something a little more complex. Suppose we have a four-letter
alphabet: L = {A, B, C, D}. This set can clearly be represented with 2 bits per letter:
{A = 00, B = 01, C = 10, D = 11}.
Letter Frequency
However, suppose the letters don’t occur with equal
A 50%
probability: a message with an A occurs 50% of the
B 12.5%
time. Specifically, we have the ratios on the right. Can
C 12.5%
we get away with a different bit-representation?
D 25%
Since A occurs half the time, we can let A = 0. Then, we
can represent half of the possible occurrences of letters

Kudrayvtsev 49
CHAPTER 3: Computational Learning Theory

with a single bit. However, now we need three bits for B and C: if D = 10, then
B = 110 and C = 111. We’re still saving information, though, because these cases
occur less frequently.
How much are we saving? To answer that, we’ll need to use math. What’s the
expected message size of a single letter? It’s the product of a probability’s letter
and size:
X
E= (|`i | · Pr [`i ]) = 0.5 · 1 + 0.125 · 3 + 0.125 · 3 + 0.25 · 2 = 1.75 average bits
i

In general, the size of a piece of information x is log2 x1 . Thus, entropy in general is


expressed as:
X
h(S) = − Pr [s] · log2 Pr [s] (3.9)
s∈S

3.4.2 Joint Entropy: Mutual Information


Will there be thunder today? If you’re sitting alone in a windowless room, it’s hard
to say, but if I told you it’s raining outside, you can make a better guess. Thus, there
is a relationship between variables: knowing one helps you learn something about
another.
In information theory, this is represented by joint entropy:
X
h(X, Y ) = − Pr [x, y] log2 Pr [x, y]
x∈X,y∈Y

There is also the idea of conditional entropy, where the entropy for a variable
changes based on another:
X
h(Y | x) = − Pr [x, y] log2 Pr [Y | x]
y∈Y

Obviously, if two variables are independent of each other, h(Y | x) = h(Y ), and
H(X, Y ) = H(X) + H(Y ).
We need to differentiate between two cases: perhaps x tells us a lot about Y , then
Y ’s conditional entropy will be small; however, it’s possible that Y ’s entropy is small
to begin with. To solve this, we introduce the formula formutual information:

I(x, Y ) = h(Y ) − h(Y | x) (3.10)

Kudrayvtsev 50
MACHINE LEARNING Supervised Learning

3.4.3 Kullback-Leibler Divergence


This metric measures the difference between any two distributions; mutual informa-
tion is a special case of KL-divergence. It’s given by:

Pr [p] x
Z
DKL (p k q) = Pr [x] log2 dx (3.11)
Pr [q] x

Kudrayvtsev 51
Bayesian Learning

The combination of Bayes and Markov Chain Monte Carlo has


been called “arguably the most powerful mechanism ever created
for processing data and knowledge.” Almost instantaneously
MCMC and Gibbs sampling changed statisticians’ entire method
of attacking problems. In the words of Thomas Kuhn, it was a
paradigm shift. MCMC solved real problems, used computer al-
gorithms instead of theorems, and led statisticians and scientists
into a worked where “exact” meant “simulated” and repetitive
computer operations replaced mathematical equations. It was a
quantum leap in statistics.
— Sharon Bertsch McGrayne, The Theory That Would Not Die

ne of the most important, fundamental rules of statistics is Bayes’ rule: it


O relates the conditional probability of one event to the condition itself:

Pr [y | x] · Pr [x]
Pr [x | y] = (4.1)
Pr [y]

It also comes with some bonus corollaries:

Pr [x, y] = Pr [a | b] · Pr [b]
Pr [x, y] = Pr [b | a] · Pr [a] since order
doesn’t matter

These simple rule is what powers Bayesian learning. Our learning algorithms aim to
learn the “best” hypothesis given data and some domain knowledge. If we reframe
this as being the most probable hypothesis, now we’re cooking with Bayes, since this
can be expressed as:
arg max Pr [h | D]
h∈H

52
MACHINE LEARNING Supervised Learning

where D is our given data. Let’s apply Bayes’ rule to our novel view of the world:

Pr [D | h] · Pr [h]
Pr [h | D] = (4.2)
Pr [D]

But what does Pr [D]—the probability of the data—really mean? Well D is our set of
training examples: D = {(x1 , d1 ), . . . , (xn , dn )} and they were chosen/sampled/given
(recall the variations we discussed in section 3.1) from some probability distribution.
What about Pr [D | h], this strange notion of data occurring given our hypothesis?
This is the likelihood—for a particular label, how likely is our hypothesis to output
it? The beauty of this formulation is that it’s a lot easier to figure out Pr [D | h] than
the original quantity.
Finally, what does Pr [h]—the prior probability of the hypothesis h—mean? This
represents our domain knowledge: hypotheses that represent the world or problem
space better will be likelier to occur.

Example 4.1: Likelihood


Suppose we have the following simple hypothesis:
(
1 if x ≥ 10
h(x) =
0 otherwise

If we have x = 7, what’s the likelihood that h(x) = 1? Obviously zero, since


h(7) will never output 1. Thus, Pr [{7} | h] = 0.

When working with probabilities, it’s crucial to remember the paradox of priors.
Consider a man named Dwight diagnosed with dental hydroplosion—an extremely
rare disease that occurs in 8 of 1000 people—based on a test with a 98% correct
positive rate and a 97% correct negative rate. Is it actually plausible that his teeth
will spontaneously explode and drip down his esophagus in his sleep?
Well, apply Bayes’ rule and see:

Pr [test
has it | has it] · Pr [has it]
says
Pr [has it | test says
has it ] =
Pr [test says
has it ]

0.98 ∗ 0.008
=
0.98
| ∗{z0.008} + 0.03 ∗ (1 − 0.008)
| {z }
Pr[test says]·Pr[has it] Pr test says ·Pr[not has it]
[ has it ]
has it

≈ 20%

Kudrayvtsev 53
CHAPTER 4: Bayesian Learning

In other words, Dwight only has a 20% chance of actually having dental hydroplosion
despite the fact that the test has a 98% accuracy! The disease is so rare that the
probability of actually being infected dominates the test accuracy itself. The prior
(domain knowledge) of a random person having the disease (0.8%) is critical here.

4.1 Bayesian Learning


The trip to the doctor has given us a bit of an algorithm: for each h ∈ H, calculate
Pr [h | D] ∝ Pr [D | h] · Pr [h]. Then, just output the arg max .1
h

4.1.1 Finding the Best Hypothesis


Sometimes it’s quite hard to know Pr [h], so there are two possible versions:
• The maximum a posteriori hypothesis (or MAP):

hmap = arg max Pr [h | D] = arg max Pr [D | h] Pr [h]


h h

• The maximum likelihood hypothesis (or ML):

hml = arg max Pr [D | h]


h

The latter just assumes a uniform probability: all hypotheses are equally-likely to
occur. Unfortunately, enumerating every hypothesis is not practical, but fortunately
it gives us a theoretically-optimal approach.
Let’s put theory into action.2 Suppose we’re given a generic set of labeled examples:
{(xi , di } where each label comes from a function and has some normally-distributed
noise, so:

di = f (xi ) + εi
εi ∼ N 0, σ 2


The maximum likelihood is clearly defined by the Gaussian: the further from the
mean an element is, the less likely it is to occur.

hml = arg max Pr [D | h]


h

1
We dropped the denominator from (4.1) because it’s just a normalization term—the resulting
probabilities will still be proportional and thus equally comparable.
2
At this point, the lectures go through a length derivation for two specific cases, then actually
apply it in general. We are skipping the former here because it’s Tuesday, the project is due on
Sunday, and I have like a dozen more lectures to get through before I can properly start.

Kudrayvtsev 54
MACHINE LEARNING Supervised Learning

Y
= arg max Pr [di | h]
h i
Y 1 (di −h(xi ))2
= arg max √ e 2σ2
h i − 2πσ 2
Y (di −h(xi ))2
the constant has no effect on
= arg max e −2σ 2
the arg max
h i
X 1 (di − h(xi ))2 ln(·) both sides, then the product of
ln hml = arg max − · logs is the sum of its components
h i
2 σ2
X
= arg max − (di − h(xi ))2 drop constants like before
h i
X
= arg min (di − h(xi ))2 − max(·) = min(·)
h i

But this final expression is simply the sum of squared errors! Thus, the log of the
maximum likelihood is simply expressed as:

ln hml = arg min SSD(d, h) (4.3)


h

Thus, our intuition about minimizing the SSD in the algorithms we covered earlier is
correct: gradient descent, linear regression, etc. are all the way to go to find a good
hypothesis. Bayesian learning just confirmed it!
Note our foundational assumptions, though: our dataset is built upon a true underly-
ing function and is perturbed by Gaussian noise. If this assumption doesn’t actually
hold for our problem, then minimizing the SSD will not give the optimal results that
we’re looking for.

4.1.2 Finding the Best Label


Consider the following table that tells us some results about the hypotheses given the
data and what the hypotheses say about some input x:
h(x) Pr [h | D]
h1 : + 0.4
h2 : − 0.3
h3 : − 0.3
What should we label x, given our hypotheses’ outputs? Obviously, h1 is the likeliest,
and it votes +. However, h2 and h3 ’s combined likelihood is higher than h1 and they
would vote −. What’s the best call? As intuition may suggest, − is the better vote.
The important takeaway is that the best hypothesis does not always provide the
best label. However, allowing all hypotheses to (do a weighted ) vote leads to Bayes’

Kudrayvtsev 55
CHAPTER 4: Bayesian Learning

optimal classifier. This is another important result: on average, you cannot do


better than a weighted vote from all of the hypotheses.

4.2 Bayesian Inference


First, let’s review joint distributions. Given the following toy example, which tells us
the likelihood of various weather patterns in Atlanta on a random summer afternoon,
storm lightning Pr [·]
T T 0.25
T F 0.40
F T 0.05
F F 0.30
what’s the probability of there not being a storm? Simple enough: 0.25 + 0.05 = 0.30.
What about the probability of there being lightning given that there’s a storm going
on? Well, we need to renormalize the distribution of just the “is storming” cases:
0.25
Pr [lighting | storm] = ≈ 0.38
0.25 + 0.40

If we added another variable (like, “Is it thundering?”), our table would double in
size. For n Boolean variables, we have a massive 2n -entry joint distribution. We can
represent it a different, more-efficient that instead takes 2n entries via factoring.

4.2.1 Bayesian Networks


Recall the definition of conditional independence between two variables:
X is conditionally independent of Y given Z if the probability distribution
governing X is independent of the value of Y when given the value of Z:

∀x ∈ X, y ∈ Y, z ∈ Z :
Pr [X = x | Y = y, Z = z] = Pr [X = x | Z = z]

More compactly, we typically write Pr [X | Y, Z] = Pr [X | Z].3


If we have a system of three variables (say X, Y, Z) in which two (say X, Z) are
conditionally independent, we can arrange them in a Bayesian network (also called
a graphical model) that represents this relationship as well as some tables—known
as conditional probability tables (CPTs)—that represent the probabilities at each
step:
3
Don’t forget about the definition of two variables being independent: Pr [X, Y ] = Pr [X] · Pr [Y ]
as well as the chain rule that comes out of Bayes’ rule: Pr [X, Y ] = Pr [X | Y ] · Pr [Y ].

Kudrayvtsev 56
MACHINE LEARNING Supervised Learning

Pr [Y | X]
Pr [Y | ¬X]
X Y Z
Pr [Z | Y ]
Pr [X]
Pr [Z | ¬Y ]

Optimizing CPTs Now, given an arbitrary Bayesian network:

A B
A ∼ Pr [A]
B ∼ Pr [B]
C
C ∼ Pr [C | A, B]
D ∼ Pr [D | B, C]
E ∼ Pr [E | C, D] E D

to get a sample over the entire joint distribution (that is, to sample some X ∼
Pr [A, B, C, D, E]), we need to sample from the Bayesian network in its topological
order (an easily-achievable configuration on an acyclic digraph), because:

Pr [A, B, C, D, E] = Pr [A] · Pr [B] · Pr [C | A, B] · Pr [D | B, C] · Pr [E | C, D]

Notice the space savings of this representation: the full joint distribution (if A through
E were Boolean variables) requires 25 − 1 = 31 probability specifications to recover,
whereas under the specific conditional independencies we described, only 1 + 1 + 4 +
4 + 4 = 14 probabilities are needed. If they were all completely independent, we’d
only need 5 probabilities (the product of the unconditionals).

4.2.2 Making Inferences


Sampling Why do we even care about probability distributions? Well, if the dis- We say
tributions model some complex, real-world processes, simulating them accurately can “approximate”
inferences
let us learn things about it. Furthermore, we can use the distributions to make
because they are
predictions or approximate inferences based on partial information. far easier to
calculate; finding
In both of these cases, samples from the distribution (rather than the distribution
exact inferences
itself, since that’s often all we can have) let us model the underlying truth and do efficiently would
those things. mean P = N P .

Kudrayvtsev 57
CHAPTER 4: Bayesian Learning

Inferencing Rules Let’s make some inferences. For clarity, though, let’s first reit-
erate the rules we’re going to be using:
• total probability: the probability of a random variable X is the sum of the
probabilities of the individual events that compose it (for example, if X is
“weather,” then it’s the sum of “rainy,” “sunny,” “cloudy,” etc.):
X
Pr [X] = Pr [X = x]
x∈X

• marginalization: if given another random variable, Y , and their joint probabil-


ities we can find the probability of just X by summing all the joint probabilities
regardless of Y ’s value:
X
Pr [X] = Pr [X, Y = y]
y∈Y

• chain rule: Pr [X, Y ] = Pr [X] · Pr [Y | X] = Pr [Y ] · Pr [X | Y ]


Pr [X | Y ] · Pr [Y ]
• Bayes’ rule: Pr [Y | X] =
Pr [X]

Example 4.2: Inference By Hand

Suppose we’re given two boxes with colored balls inside of them (this a
classic probability example; I have no idea why statisticians love boxes with
balls, or bags of marbles, or containers of jelly beans, or . . . )

Box 1 Box 2

For every sample, we first choose a box uniformally at random then fish out
a ball. The Bayesian network of this sampling process is:

box ball 1 ball 2

What’s the probability of drawing a green ball then a blue ball?

Pr [2 = blue | 1 = green] = ?

Kudrayvtsev 58
MACHINE LEARNING Supervised Learning

Answer: 6/23 ≈ 0.26

The Bayes net itself gives us an insight on how we should compute this! By
combining the marginalization and chain rules, we get

Pr [2 = blue | 1 = green] = Pr [2 = blue | 1 = green, box]


= Pr [2 = blue | 1 = green, box = 1] · Pr [box = 1 | green] +
Pr [2 = blue | 1 = green, box = 2] · Pr [box = 2 | green]
3
= 0 · Pr [box = 1 | 1 = green] + · Pr [box = 2 | 1 = green]
4

To find these latter values, we apply Bayes’ rule:

Pr [1 = green | box = 1] Pr [box = 1]


Pr [box = 1 | 1 = green] =
Pr [1 = green]
3 1
4 · 2
=
Pr [1 = green]
3/8
=
Pr [1 = green]

and similarly,

Pr [1 = green | box = 2] Pr [box = 2]


Pr [box = 2 | 1 = green] =
Pr [1 = green]
2 1
5 · 2
=
Pr [1 = green]
1/5
=
Pr [1 = green]

Since we’re combining the two values, Pr [1 = green] will actually disappear
entirely if we normalize, so we don’t need to calculate it separately:
1 8 3 15
5 40 8 8 40 15
1 3 = 8 15 = 1 3 = 8 15 =
5 + 8 40 + 40
23 5 + 8 40 + 40
23

Finally, we can plug this back into our original expansion:


3
Pr [2 = blue | 1 = green] = 0 · Pr [box = 1 | 1 = green] + · Pr [box = 2 | 1 = green]
4
15 3 8
=0· + ·
23 4 23
6
= ≈ 0.26
23

Kudrayvtsev 59
CHAPTER 4: Bayesian Learning

Solving that example was miserable, but we followed a pretty clear process based on
the Bayesian network. Luckily for us, that means it should be possible to automate.

4.2.3 Naïve Bayes


Suppose we cut down on the complexity of allowable relationships, only allowing edges
between immediate family; that is, parents are connected directly to children, and
that’s it. The beauty of the structure of such a naïve Bayesian network is that it
breaks down larger conditional probabilities into products of simpler ones.
For example, given the Bayesian network and CPTs in Figure 4.1, can we compute
the probability of a particular message being spam given its contents?

Is it spam? Pr [S] = 0.4

“viagra” “prince” “Udacity”

Pr [V | S] = 0.3 Pr [P | S] = 0.2 Pr [U | S] = 0.001


Pr [V | ¬S] = 0.001 Pr [P | ¬S] = 0.1 Pr [U | ¬S] = 0.1

Figure 4.1: An example naïve Bayesian network for spam email.

Pr spam? | contains
 contains contains

“viagra” , ¬ “prince” , ¬“Udacity” =
∝ Pr contains
“viagra” , ¬ “prince” , ¬“Udacity” | spam? · Pr [spam?]
 contains contains

Bayes’ rule

∝ Pr “viagra” | spam? · Pr ¬ “prince” | spam? · Pr ¬“Udacity” | spam? · Pr [spam?]


contains   contains   contains 

∝ 0.3 · 0.8 · 0.9999 · 0.4


0.096
= z is the normalization factor
z

In general, when we have this format of a single parent V and its many children,
a1 , a2 , . . . , an , then the probability is solveable as:
n
Pr [V ] Y
Pr [V | a1 , a2 , . . . , an ] = · Pr [ai | V ]
z i

This beautiful relationship lets us do classification (yay, we’re finally talking about
machine learning again!): given the features / attributes (the children), we can make

Kudrayvtsev 60
MACHINE LEARNING Supervised Learning

inferences about the parent! The maximum a posteriori class (that is, the best class
label) is then simply the best v ∈ V :
n
Y
arg max Pr [ai | V = v]
v∈V
i

Benefits This simple limitation on inference networks results in powerful con-


structs. There are many perks:
• inference is cheap: despite being difficult in general, inference under naïve
Bayesian assumptions makes the calculation a series of simple products
• few parameters: even with massive parameter quantities, the conditional
probability tables are small overall, growing linearly rather than exponentially
• allow estimation: we’ve only considered hard-coded CPTs and inferences
made from those, but we can actually estimate the CPTs themselves when
given labeled data. The simplest way to do this is simply by counting: to find
Pr [ai | V = v], count the number of occurrences of ai with V = v and divide it
by the total number of V s in general.
• allows classification: as already mentioned, this structure lets us both do
inference about features from classes (top-down) as well as classification from
feature values (bottom-up).
• empirically successful: with enough data, naïve Bayesian networks do an
incredible job. Apparently, Google uses it extensively.

Downsides This seems almost too good to be true. How can a truly-naïve model
that assumes that there is complete independence between attributes perform well?
There’s no way that email containing the word “prince” is no more-or-less likely to
contain the word “viagra” compared to email without the word “prince.” In fact, I
would expect spam emails targeting male genitalia to be a separate “class” of spam
than emails that target people with ties to Nigeria (though they may both target
gullible suckers).
This is exactly its downside: when there are strong inter-relationships between the
attributes, the model can’t make good inferences. However, these relationships (and
their probabilities) don’t matter so much when doing classification, since it just needs
to be “good enough,” and guess in the right direction from the limited information
that it has.
There’s also a flaw in the “counting” approach: notice that if a particular attribute ai
has never been seen or associated with a particular v ∈ V . Then, the entire product
is zero. This is not intuitive and is exactly why this doesn’t happen in practice: the
probabilities need to be smoothed out so that one bad apple doesn’t spoil the bunch.

Kudrayvtsev 61
PART II
Unsupervised Learning

ather than providing labeled training pairs to an algorithm like we did before,
R in unsupervised learning we are focused on inferring patterns from the data
alone: we want to make sense out of unlabeled data. While earlier we were
doing essentially amounted to function approximation, now we will be doing data
description—finding explanations and compact descriptions for data.

Contents
5 Randomized Optimization 63

6 Clustering 71

7 Features 80

62
Randomized Optimization

People believe the only alternative to randomness is intelligent


design.
— Richard Dawkins

n an optimization problem, we’re given an input space X and an objective or


I fitness function f : X 7→ R; the goal is to and find the best x∗ ∈ X such
that f (x∗ ) = max f (x). This chapter discusses algorithmic ways to find said x∗ .
x
Sometimes, calculus-based methods and even input space exhaustion can work, but
this is rare to encounter in the real world; more often than not, we’re dealing with
complex functions with massive input spaces and many local optima. To tackle this
complexity, we’ll need randomized optimization algorithms.

5.1 Hill Climbing


The most straightforward of these is the hill climbing algorithm, which randomly
guesses a starting point, then drifts to the best input within a local neighborhood of
that point.
This is not super effective since it’s highly-dependent on the starting point: it’s easy
to get stuck in a local optimum. However, we can run the algorithm many, many
times to ensure that we don’t always get stuck in the same one; even if we don’t find
the global optimum, we’ll at least get stuck in a really good local one. This is called
random-restart hill climbing.

5.2 Simulated Annealing


Here, the idea of random hill climbing is taken a step further: we don’t always nec-
essarily need to improve to be on a “good path” to an optimum. Sure, we could hope
that another restart down the line will let us avoid the pending economic downturn,

63
CHAPTER 5: Randomized Optimization

Algorithm 5.1: A hill-climbing algorithm.

Input: (X, f ), the parameters for an optimization problem.


Input: N (·), a function that returns a list of neighbors to x ∈ X; this can be
user-defined (encoding domain knowledge) or defined as part of the
optimization problem itself.
Result: x such that f (x) is a local optimum.
$
x ←− X // randomly choose x ∈ X
repeat
n∗ ← arg max f (n)
n∈N (x)
if f (n∗ ) > f (x) then
x = n∗
end
until f (x) < f (n∗ )
return x

but it also makes sense to simply do a little exploration of the neighboring space
beyond just arg max .
n∈N

This concept of simulated annealing leads to an eternal balance between exploita-


tion and exploration: in hill-climbing, we always exploit the best possible direction,
whereas now we’ll do some additional exploration.

Fun Fact: Annealing

The concept of annealing comes from metallurgy, where metals are repeat-
edly heated and cooled to increase their ductility (basically bendability).
It’s a silly thing to name an algorithm after, but this idea of “temperature”
is baked into the simulated annealing (see algorithm 5.2).

The key to simulated annealing is this idea of an “acceptance probability” function,


which lets us smoothly interpret how seriously we should take “bad” points based on
how bad they are.

if f (x0 ) ≥ f (x)
(
1 
P (x, x0 , T ) = 0
 (5.1)
exp f (x )−f
T
(x)
otherwise

For high temperatures, it’s likely to accept x0 regardless of how bad it is; contrarily,
a low T is likely to prefer “decent” x0 s that aren’t much worse than x. In other

Kudrayvtsev 64
MACHINE LEARNING Unsupervised Learning

words, T → 0 behaves like hill climbing while T → ∞ behaves like a random walk
through f . In practice, slowly decreasing T is the way to go: the “cooling factor” α
in algorithm 5.2 below controls this.

Algorithm 5.2: The simulated annealing algorithm.

Input: (X, f ), the parameters for an optimization problem.


Input: N (·), a function that returns a list of neighbors to x ∈ X; this can be
user-defined (encoding domain knowledge) or defined as part of the
optimization problem itself.
Result: x such that f (x) is a local optimum.
α ∈ [0, 1) // a “cooling factor”
for a finite number of iterations do
$
x0 ←− N (x) // randomly choose x ∈ X
if P (x, x0 , T ) = 1 then
x = x0
end
T = αT
end
return x

Simulated annealing comes with an interesting fact: we can actually determine the
overall probability of ending at any given input. Namely,

exp f (x)
Pr [end at x] = T zT is just a normalization term
zT
Notice that this1 is actually in terms of the fitness of x, so the probability of ending
at a particular input is directly proportional to how good it is! That’s a really nice
property.

5.3 Genetic Algorithms


Our third class of random optimization techniques are called genetic algorithms.
It takes its inspiration from biology:
• We begin with populations of individuals—these are our various input sampling
points.
• They undergo mutations—this is a local search (much like the N (x) neighbor-
hoods from before).
1
This probability distribution is called the Boltzmann distribution and is used in physics.

Kudrayvtsev 65
CHAPTER 5: Randomized Optimization

• Then, there’s cross-over, where different population groups have their attributes
combined together to hopefully produce something novel and better-performing
(like sexual selection).
• This happens over generations—this is simply the number of iterations of im-
provement.
The biggest deviation (and not the sexual kind) from what we’ve seen thus far is
the idea of cross-over; otherwise, each population essentially operates like a random-
restart in parallel.

5.3.1 High-Level Algorithm


We typically start with a randomly-generated initial population, P0 , of some size k.
Then, we repeat an “evolution” process until convergence:
1. Compute the fitness of all x ∈ Pt .
2. Select the “most fit” individuals. This is where things can vary: under a strategy
of truncation selection, you might just select the top half of individuals; on
the other hand, under a roulette wheel strategy, you might base it off of
a weighted probability so that low-fitness individuals still have a chance of
sticking around in the gene pool. This is another place where we encounter of
exploitation vs. exploration.
3. Pair up individuals, replacing the “least fit” individuals (those that weren’t
chosen in the previous step) via cross-over and mutation.

5.3.2 Cross-Over
Again, we’ve (intentionally) left cross-over as a “black box.” Some concrete examples
might help expand on what it means and how it can be useful.

Example: Bitstrings
Suppose, as we’ve been doing, that our input space is bitstrings. Let X be the set of
8-bit strings, and we’re performing a cross-over with the following two “fit” individuals:

01101100
11010111

Perhaps a strategy might be to combine halves, resulting in two “offspring”: 01100111


and 11011100.

M : 0110 1100
D : 1101 0111

Kudrayvtsev 66
MACHINE LEARNING Unsupervised Learning

This strategy obviously encodes some important assumptions: the locality of the
bits themselves has to matter; furthermore, it assumes that these “subspaces” can be
independently optimized.
Suppose this previous assumption (bit locality) is incorrect. We can cross over in a
different way, then: instead, let’s either keep or flip the bits at random.

M : 01101100
D : 11010111
—————
01100110
11011101
kkkkfkfk

where k/f correspond to “keep” and “flip,” respectively. This is called uniform
crossover.

5.3.3 Challenges
Representing an optimization problem as input to a genetic algorithm is difficult.
However, if done correctly, it’s often quite effective. It’s commonly considered to be
“the second-most effective approach” in a machine learning engineer’s toolbelt.

5.4 MIMIC
The algorithms we’ve looked at thus far: hill climbing, simulated annealing, and
genetic algorithms all feel relatively primitive because of a simple fact—they always
just end at one point. They learn nothing about the space they’re searching over,
they remember nothing about where they’ve been, and they build no recollection over
the underlying probability distribution that they’re searching over.
The main principle behind MIMIC is that it should be possible to directly model Isbell ‘97
a probability distribution, successively refine it over time, and that should lead to a
semblance of structure.
The probability distribution in question depends on a threshold, θ. It’s formulated
as:
 1 if f (x) ≥ θ

Pr [x] = zθ
θ
(5.2)
0 otherwise
Namely, we’re basically only considering samples x ∈ X, uniformally, that have a
“good enough” fitness level.

Kudrayvtsev 67
CHAPTER 5: Randomized Optimization

5.4.1 High-Level Algorithm


Notice that Prθmax [x] spans only the optima of f . Contrarily, Prθmin [x] is simply
the uniform distribution over X. MIMIC will start from this latter baseline, then
iteratively improve, hoping to eventually reach the optima: Prθmax [x] ; Prθmin [x].
From a high level, the algorithm is as follows:
1. First, generate samples from the current Prθt [x].
2. Then, set θt+1 to be the nth percentile of the dataset, like the truncation selection
strategy of genetic algorithms.
3. Now, given only the samples for which f (x) ≥ θt+1 , we estimate the new
Prθt+1 [x].
4. Finally, repeat from Step 1 until n meets some thresholding criteria.
This is quite similar to genetic algorithms on the surface, but the way we represent
our probability distribution at the step t will encode structure and meaning about
our search space.
There are some key underlying assumptions here: first of all, estimating a probability
distribution needs to be possible; second, we are hoping that choosing good samples
from Prθt [x] actually also gives us good samples at Prθt+1 [x].2

5.4.2 Estimating Distributions


Given a feature vector x = x1 x2 . . . xn , the probability of seeing all particular
 

features is the joint distribution over all of those features:


Pr [x] = Pr [x1 | x2 , . . . , xn ] Pr [x2 | x3 , . . . , xn ] . . . Pr [xn ]

Obviously this is typically computationally-infeasible to calculate directly. However,


we can make some assumptions to make it easier to calculate. Namely, we’re going to
craft dependency trees, which are special case of Bayesian networks that are trees:
Y
Pr
c [x] = Pr [xi | π(xi )]
π
i

Here, π(·) represents the “parent” of a node in the dependency tree. Since each
node only has exactly one parent (aside from the root, so π(x0 ) = x0 ), the table of
conditional probabilities stays much smaller than the full joint above.3
Why dependency trees, though? Well, they let us represent relationships between
variables, and they do so compactly. It’s worth noting that we’re not stuck using
2
A little more rigorously, we’re basically assuming that Prθ [x] ≈ Prθ+ε [x].
3
Specifically, the size of the table is quadratic in the number of features.

Kudrayvtsev 68
MACHINE LEARNING Unsupervised Learning

dependency trees; they’re simply a means to an end of approximating the probability


distribution (Step 3 above) and underlying structure. And in fact this specific choice
of structure lets us represent similar relationships to cross-over in GAs.

Finding Depedency Trees


We’ll do this in general, not referincing MIMIC or θ or any of that jazz. We have a
true probability distribution P that we want to estimate, and Pbπ is our dependency
tree formulation from above.
Recall the KL-divergence (3.11) from Information Theory, which will let us measure
the similarity between any two distributions:
  X  
DKL P k Pπ =
b P log2 P − log2 Pπb
X
= −h(p) + h(xi | π(xi )) p log2 p is just entropy, −h(p)
i
X
Jπ = h(xi | π(xi ))
i

At the end, we end up with a sort of cost function Jπ that we’re aiming to minimize:
find each feature’s parent such that overall entropy is low. In other words, find parents
that give us a lot of information about their child features.
To make this Jπ easy to compute—for reasons beyond my understanding—we’ll in-
troduce a new term: the entropy of each feature. This is okay because it doesn’t
affect π(·) which is what we’re minimizing.
X
min Jπ = h(xi | π(xi ))
π
i
X X
min Jπ0 =− h(xi ) + h(xi | π(xi ))
π
i i

Both of these will give us the same π(·). This term is related to the mutual information
(see (3.10) in Information Theory):
X X
min Jπ0 = − h(xi ) + h(xi | π(xi ))
π
i i
X
=− I (xi ; π(xi ))
i
X
max Jπ0 = I (xi ; π(xi ))
π
i

In English, what we’ve come to show is that in order to maximize the similarity
between a true probability distribution and our guess (I know it’s been a while, but this is

Kudrayvtsev 69
CHAPTER 5: Randomized Optimization

what we started with), we want to maximize the mutual information between each feature
and its parent.
Finally, we need to actually calculate this optimization. This equation can actually
be represented byy a fully connected graph: the nodes are the xi features and the
edges are the mutual information I(xi ; xj ) between them:

x3

x1 x2

x4 x5

And since we want to find the tree that maximizes the total mutual information,
this is equivalent to finding the maximum spanning tree. Since more-traditional
algorithms find the minimum spanning trees, this is equivalent to just inverting the
I edge weights.4

Coming Full Circle


Now that we understand how to estimate a probability distribution as well as pull
samples from it, we can turn the high-level algorithm into reality: our Prθt [x] is simply
Pbπ , and estimation is the spanning tree over mutual information.

5.4.3 Practical Considerations


MIMIC does well with structure: when the problems you’re trying to solve
depend on relationships between features rather than specific feature values, MIMIC
can handle them well; MIMIC does not care about “ties” in optima.
If a fitness function can’t be represented by a probability distribution for all possible
θs, MIMIC struggles. Furthermore, it still can get stuck in local optima. Finally,
time complexity is an issue—though in practice MIMIC takes orders of magnitude
fewer iterations, a single iteration of MIMIC takes far more time.
However, if information about the problem is important, though, MIMIC is the only
way to go since it provides, as we’ve said time and time again, structure. If evaluating
the fitness function f (x) has a high cost, MIMIC can avoid far more evaluations
relative to the other randomized algorithms.

4
Since this is a fully-connected graph, Prim’s algorithm is the way to go.

Kudrayvtsev 70
Clustering

Wash your hands often with soap and water for at least 20 sec-
onds, especially after blowing your nose, coughing, or sneezing,
or having been in a public place.
Wash your hands after touching surfaces in public places.
Avoid touching your face, nose, eyes, etc.
Avoid crowds, especially in poorly ventilated spaces.
— Center for Disease Control, People at Risk for COVID-19

classic unsupervised learning problem is clustering: given a set of objects,


A we want to divide them into groups. For example, take look at the following set
of points (straight from the section on Support Vector Machines, but without
the margins):

Here, they’re represented as belonging to the blue or pink categories, but this obvi-
ously is only done visually—if we knew their labels, we’d use a supervised technique.

71
CHAPTER 6: Clustering

How do we cluster them into these two groups mathematically? Let’s define the
problem more rigorously:
The clustering problem:
Input: A set of objects: X
A distance metric D(·, ·), defining inter-object distances such
that D(x, y) = D(y, x) where x, y ∈ X .
Output: A partitioning of the objects such that PD (x) = PD (y) if x and y
belong to the same cluster.
This notion of similarity being defined by distance should remind us of k-nearest
neighbor. Given this definition, we can come up with some trivial partitioning
schemes: we can just stuff every element into its own cluster, or even all in the
same cluster. The problem definition does not measure or differentiate between a
“good” or “bad” clustering.
Because of this loose definition for clustering, its solutions have a high variance and
are very algorithm-driven; thus, each algorithm can be analyzed independently.

6.1 Single Linkage Clustering


This is (subjectively) the simplest and most natural clustering algorithm; it’s very
easy to understand. Given an input k that describes the goal number of clusters:
• We start by treating each object as a cluster, so we start with n clusters.
• We’ll define the inter-cluster distance as being the closest-possible distance be-
tween the two clusters (that is, the minimum distance between all pairs of points
in two clusters).
• Merge the two closest clusters.
• Repeat n − k times to make k clusters.
This algorithm has some interesting properties. It’s deterministic, and it actually
equates to a minimum spanning tree algorithm if we treat distances as edge lengths.
It runs with O(n3 ) time complexity:1 we need to evaluate and compare n2 pairs of
points at least k times (which, in the worst case, is n).

6.1.1 Considerations
If there was a perfect clustering algorithm, we would’ve just covered it and not both-
ered with enumerating them.
1
This is true for the simplest possible algorithm; it can absolutely be improved, but definitely not
beyond some factor of n2 since every pair of points needs to be considered.

Kudrayvtsev 72
MACHINE LEARNING Unsupervised Learning

(a) A set of points we’d like to cluster. (b) A (likely) intuitive clustering. (c) The actual SLC result.

Figure 6.1: A comparison of how single linkage clustering results may differ
from intuition. Here, k = 2 clusters.

Consider the set of points in Figure 6.1a: notice that SLC does not result in an
intuitive clustering. When we consider the “merge closest clusters” behaviour, the
inner four points on the left clump would always be too far relative to all other
possible pairs.

6.2 k-Means Clustering


Clustering encounters a bit of a “chicken and egg” problem:
• If we knew where the cluster centers should be, how would we determine which
points to associate with each ci ? Naturally, we’d choose the closest ci for each
point p.
• If we knew the cluster memberships, how do we get the centers? Naturally, we’d
make ci the mean of all of the points in the cluster.2
In both cases, we can find one if we have the other, but we start with neither. . . This
conundrum is solved by the k-means clustering algorithm (described formally in
algorithm 6.1) by starting with random cluster centers and iteratively improving on
them; it can be thought of as a quantization of the feature space. k-means requires
just a few basic steps:
1. Pick k random cluster centers.
2. Associate each point with its closest center point.
3. Recompute the centers by averaging the above points.
4. Repeat from Step 2 until convergence—that is, when cluster centers no longer
moving.

2
The mean (or average) actually minimizes the SSD (sum of squared differences), since you are
implicitly assuming that your data set is distributed around a point with a Gaussian falloff; using
the mean thus maximizes the likelihood.

Kudrayvtsev 73
CHAPTER 6: Clustering

Figure 6.2: The result of applying k-means on the same set of points that SLC
struggled with in Figure 6.1, given that the two white-circled points were selected
as the random cluster centers.

6.2.1 Convergence
The main benefit of k-means clustering is that it’s an incredibly simple and straight-
forward method; as you can see in algorithm 6.1, it requires a small handful of trivial
operations. The secondary benefit is that it actually provably converges to a local
(not a global, mind you) minimum, guaranteeing a certain level of “correctness.”
The “proof” is described in this math aside for those interested. The TL;DR summary
results in the following succinct properties:
• Each iteration takes polynomial time: O(kn).
• There’s a finite–though exponential–number of iterations: O(k n ).
• The error decreases if ties are broken consistently.
• It can get stuck in local minima.
This latter point can be avoided (well, alleviated) much like with random hill climbing:
by using random restarts and well-distributed starting points.

6.2.2 Considerations
There are some significant downsides, though, the biggest of which is that k must
be specified in advance. It’s a fairly memory intensive algorithm since the entire set
of points needs to be kept around. Furthermore, it’s sensitive to initialization
(remember, it finds the local minimum) and outliers since they affect the mean
disproportionately.
Another important downside to k-means is that it only finds “spherical” clusters. In
other words, because we rely on the SSD as our “error function,” the resulting clusters
try to balance the centers to make points in any direction roughly evenly distributed.
This is highlighted in Figure 6.3.
The red and blue segments are artificially colored to differentiate them in feature
space, and as a human we would intuitively group them into those clusters, but
k-means’ preference for spherical clusters resulting in an incorrect segmentation.

Kudrayvtsev 74
MACHINE LEARNING Unsupervised Learning

Figure 6.3: A problem of k-means clustering is that it prefers “spherical” clus-


ters, resulting in an unintuitive (and partially incorrect) segmentation of the red
and blue regions.

Quick Maffs: Proving k-means’ Optimality

We’ll walk through the proof by considering k-means in Euclidean space.


Our notation is as follows:

• P t (x): the partition or cluster label of the point x at time or iteration


number t.
• Cit = {x : P (x) = i}: the set of all points belonging to cluster i.
• cti : the center of a particular cluster i on the tth iteration, defined as
the average of the points.

The algorithm is then a two-way updating process:

P
2 y∈Cit y
P t (x) = arg min x − ct−1

cti =

i 2
i |Cit |

ci s, t += 1

Consider this process as an optimization problem: the goal is to minimize


the total “error,” which is the total distance between points and their clus-
ter center. This process is very similar to hill climbing: we’re iteratively
improving this score by moving in a better direction.
Consider how partitions are updated: by minimizing the Euclidean distance,
the overall error can never go up. Similarly, by choosing the center as the
average, we’re also minimizing squared error.

Kudrayvtsev 75
CHAPTER 6: Clustering

Algorithm 6.1: The k-means clustering algorithm.

Input: A set of points, P .


Result: A set of cluster centers and their constituent points.
C = {c1 , . . . , cK } // Randomly initialize cluster centers.
V = [∅1 , . . . , ∅K ] // A set of points corresponding to its cluster.
repeat
foreach p ∈ P do
Find the closest ci
V [i] ← p // Put p into cluster i
end
// Given the cluster points, solve for ci by setting it to the
mean.
foreach P ci ∈ C do
j V [i]j
ci =
|V [i]|
end
until no ci s change

The further rationale is as follows: since there are a finite number of configu-
rations (clusters and partitions), and we’re never increasing in error, and we
break ties consistently, we guarantee convergence to a local minimum; in
the worst case, we will eventually enumerate all (k n ) of the configurations.

6.3 Soft Clustering


Consider how k-means behaves on the following set of points, given that k = 2:
?

Obviously, the middle point can end up with either cluster depending on how ties
are broken and how the initial centers are chosen. Wouldn’t it be nice, though, if we
could put such “ambigious” points into both clusters?
This idea precipitates the return of probability: each point now belongs to each of
the possible clusters with some probabilistic certainty. As with all things probability,
though, it relies on a fundamental assumption: in this case, we’re assuming that our
n points were selected from k uniformally-selected Gaussian distributions.
The goal is then to find a hypothesis of means, h = {µ1 , . . . , µk }, that maximize the

Kudrayvtsev 76
MACHINE LEARNING Unsupervised Learning

likelihood (sound familiar?) of the data.

Single ML Gaussian Suppose k = 1 for the simplest possible case. What Gaussian
maximizes the likelihood of some collection of points? Conveniently (and intuitively),
it’s simply the Gaussian with µ set to the mean of the points!

Extending to Many. . . With k possible sources for each point, we’ll introduce
hidden variables for each point that represent which cluster they came from. They’re
hidden because, obviously, we don’t know them: if we knew that information we
wouldn’t be trying to cluster them. Now each point x is actually coupled with the
probabilities of coming from the clusters:
 
x = x z1 z2 · · · zk

6.3.1 Expectation Maximization


Under the above formulation of the soft clustering problem, we get an algorithm that
looks remarkably similar to k-means clustering. There are two parts iterating hand-
in-hand: adjusting the cluster probabilities zi and the chosen mean µ. The process is
called expectation maximization. In fact, k-means is a special case of expectation
maximization: variances are all equal, and there is no covariance.
Z

Pr [X = xi | µ = µj ]
P
E[zij ]xi
E[zij ] = Pk µj = Pi
j=1 Pr [X = xi | µ = µj ] i E[zij ]

Expectation, µ Maximization,
defining Z from µ defining µ from Z

where every Pr [X = xi | µ = µj ] is simply defined by the Gaussian:


 
1 2
Pr [X = xi | µ = µj ] = exp − σ (xi − µj ) 2
2

At each step, the expectation of a particular cluster is weighed by its probability of


lying at the cluster center; then, the cluster center is chosen by weighing all of the
points by their respective clusters (the denominators are just normalization factors).

6.3.2 Considerations
Like with k-means, the initial means (which are akin to the cluster centers) are chosen
randomly. Unfortunately, it doesn’t come with the same convergence guarantees:
though it will never get worse, it might get better at an increasingly-slower rate. In

Kudrayvtsev 77
CHAPTER 6: Clustering

(a) The initial, randomly-chosen cluster means, marked with little ×s.

(b) The clusters and the new, ad- (c) The clusters and centers after (d) The clusters and centers upon
justed centers after a single itera- another iteration. convergence.
tion.

Figure 6.4: An application of soft clustering to k = 2 clusters, with green points indi-
cating a reasonable level of uncertainty (±10%) of which cluster they belong to.

practice, though, you can assume convergence. Another benefit is that it works with
any distribution if the expectation maximization is solveable; we only used Gaussians
here for simplicity of assumptions and its generality.

6.4 Analyzing Clustering Algorithms


We’ve enumerated a handful of different clustering algorithm. How should we think
about using them, and how can we compare them a little more rigorously?

6.4.1 Properties
Though the best way to compare clustering algorithms is simply to enumerate and
analyze them, it is useful to define some desirable properties for any given approach:
Richness Ideally, our clustering algorithm would be able to associate any number
of points with any number of clusters depending on our inputs and distance
metric. Formally, richness means that there is some distance metric D such
that any clustering can work, so: ∀c, ∃D : PD = c.

Kudrayvtsev 78
MACHINE LEARNING Unsupervised Learning

Scale-invariance If we were to upscale our distances by some arbitrary constant,


the clustering should not change; intuitively, the “units” of our feature space
(miles vs. kilometers, for example) shouldn’t matter. Formally, we say that a
clustering algorithm has scale invariance if ∀D, ∀k > D : PD = PkD .
Consistency Given a particular clustering, we would imagine that by “compressing”
or “expanding” the points within a cluster (think squeezing or stretching a cir-
cle), none of the points would magically assign themselves to another cluster.
In other words, shrinking or expanding intracluster distances should not change
the clustering.
Consider some variations on the stopping condition to single linkage clustering (recall
that normally we repeat the merging process n − k times to get k clusters). If we
instead stop when. . .
n/2 clusters are reached, we sacrifice richness simply by the fact that we are lim-
iting the number of possible clusters.
the clusters are θ units apart, we sacrifice scale invariance, since scaling the points
by θ would result in each point belonging to its own cluster.
the clusters are θ/ω units apart, where we define ω as being the maximum dis-
tance between points: ω = maxi,j∈X D(i, j), we sacrifice consistency. By shrink-
ing points within a cluster, we may modify the maximum distance and thus
cause a different clustering.
This set of examples might lead to an unfortunate induction: it seems tough to achieve
all of the properties at once. Turns out it’s not only tough, it’s impossible.

Theorem 6.1 (Impossibility Theorem). No clustering scheme can achieve all Kleinberg ‘15
three of: richness, scale-invarance, and consistency.

6.4.2 How Many Clusters?


A natural question follows all of the clustering algorithms we’ve enumerated: how
do we choose k, the number of clusters? The unfortunate, fundamental problem of
unsupervised clustering is this needs to be specified in advance.
Depending on what we choose as our problem’s “feature space,” we can group points
in different ways. If we were applying k-means to image segmentation (where we want
to cluster similar pixels together), for example, we could choose pixel intensity as our
feature space and discretize based on its “lightness” or “darkness”; however, we could
choose colors, hues, or even positions with just as much validity. Each of these would
likely lend itself to a different choice of k.
Thus we consider this a feature, not a bug.

Kudrayvtsev 79
Features

Since we live in a world of appearances, people are judged by


what they seem to be. If the mind can’t read the predictable
features, it reacts with alarm or aversion. Faces which don’t
fit in the picture are socially banned. An ugly countenance, a
hideous outlook can be considered as a crime and criminals must
be inexorably discarded from society.
— Erik Pevernagie, “Ugly mug offense”

he selection of features (data points) in a machine learning problem is cru-


T cial to its success. If you wanted to decide what restaurant to eat at, a feature
describing the current political climate is useless and can only lead to a nois-
ier, less effective modeling. Thus, it’s really important to choose features carefully.
Furthermore, it’s useful to be able to transform features into something different to
really squeeze out the information-potential of each data point.

7.1 Feature Selection


When we talk about feature selection, we are talking about it from both a human
and machine perspective.

Humans A good feature lets us discover knowledge, providing us with a lot of


information about the world we’re trying to model, letting us interpret and form
insights about the given data. This has been self-evident through this entire part of
the guide: more often than not, we’ve been using 2-dimensional features to be able
to easily visualize them on the Cartesian coordinate plane and build up an intuition
that can then generalize to n dimensions.

Machines Recall the curse of dimensionality: the amount of data that we need
grows exponentially with the number of feature dimensions. Thus, careful decision-

80
MACHINE LEARNING Unsupervised Learning

making about our choice of features lets us build models that learn quickly and
effectively and need minimal datasets.
In summary, with proper feature selection, not only will you understand your data
better, but you will also have an easier learning problem.
Feature selection doesn’t necessarily need to be a human-driven process. We can
leverage the power of a l g o r i t h m s to reduce some set of n features to m
“important” features. However, this is a hard (actually, an NP-hard) problem: we are
choosing some m out of n possible values, and m ; Θ(nm ).
n


If we had some f (·) that could score a set of features, this becomes an optimization
problem. Algorithms for tackling this problem fall into two categories: filtering—
where a search process directly reduces to a smaller feature set and feeds it to a learn-
ing algorithm—and wrapping—where the search process interacts with the learning
algorithm directly to iteratively adjust the feature set.

Filtering Wrapping

features fewer
ch

ch
L
ar

ar
features
se

se
features

7.1.1 Filtering
Pros & Cons As you can imagine, the logic in the filtering approach needs to be
baked directly into the search itself. There’s no opportunity for the learning algorithm
to give feedback on the process. However, this obviously makes it work much faster:
there’s no need to wait on the learner.

Techniques How can that even work, though: aren’t we essentially looking at the
features isolated in a vaccuum? Well, looking at the (supervised) labels is allowed in
filtering, you just can’t use the learner. Thus, decision trees are essentially a filtering
algorithm: we split each branch on the best available attribute by ranking their
information gain. By definition, this is a way to separate the wheat from the chaff.
This is just one such search method; we can also use things like variance, entropy,
and even concepts like linear independence and cross-feature redundancy.

7.1.2 Wrapping
Pros & Cons The natural perk of involving the learner in a feedback loop of feature
search is that model bias and learning is taken into account. Naturally, though, this
takes much longer for each iteration.

Kudrayvtsev 81
CHAPTER 7: Features

Techniques Interestingly enough, we can use approaches directly from randomized


optimization algorithms if we treat the learner’s result as a fitness function. Other
viable options include forward search and backward search.
Forward Search — This is super straightforward. We simply enumerate the features
and greedily choose their best combination. For example, on the first run, keep the
single feature that gives the highest score. Then, on the second run, keep the single
feature that pairs best with the first one. And so on, until you have kept some target
number of features or have stopped making many gains.
Backward Search — This approach is the polar opposite of the previous one. Rather
than building up with one feature at a time, start with all features and iteratively
drop the most useless one.

7.1.3 Describing Features


When it comes to determining which features we should use, it’s necessary to dif-
ferentiate them based on their viability in a general, statistical sense as well as in a
learner-specificsense.

Relevance For starters, we need a way to measure their relevance; we’ll actually
define this colloquial term rigorously: a feature xi is. . .
• strongly relevant if removing it degrades the Bayes’ optimal classifier;1
• weakly relevant if it’s not strongly relevant (obviously) and there’s some
subset of features S such that adding xi to S improves the Bayes’ optimal
classifier; and
• irrelevant, otherwise.

Usefulness Just because a feature always only maps to a single value (which would
make it irrelevant), for example, it does not mean it’s entirely useless in all selection
scenarios.2 A feature’s usefulness is measured by its effect on a particular predictor
rather than on the Bayes’ optimal classifier.

7.2 Feature Transformation


The goal of feature transformation is to (pre-)process a set of features to create a new,
optimized feature set while retaining as much relevant information as possible. This
might sound like the same thing as feature selection, but the difference is subtle: the
resulting feature set here can be completely new; it doesn’t have to just be a subset
of the original features.
1
Remember, the Bayes’ optimal classifier is the best way to do a classification on average.
2
For example, as shown in lecture, such a feature could be used to imitate a bias in a perceptron.

Kudrayvtsev 82
MACHINE LEARNING Unsupervised Learning

Specifically, we’re going to be targeting linear transformations, so our new feature set
will be some linear combination of the originals. This will be somewhat reminiscent
of the inspiration for the kernel trick from support vector machines: we will combine
our features to get new information about them.

7.2.1 Motivation
Consider the information retrieval problem: given an unknown search query, we want
to list documents from a massive database relevant to the query. How do we design
this?
If we treat words as features, we encounter the curse of dimensionality: there are a
lot of words. . . Furthermore, words can be ambigious (the same word meaning can
multiple [and sometimes even conflicting! think about the modern use of “literally”]
things) and can describe the same concepts in a variety of ways (different words
meaning the same thing).3 Because of this, we encounter false positives and false
negatives even if we could find the documents with the words efficiently. Thus, words
are an insufficient indicator!
A good feature transformation will combine features together (like lumping words
with similar meanings or overarching concepts together), providing a more compact,
generalized, and efficient way to query things. For example, a query for the word “car”
should rank documents with the word “automobile” (a synonym), “Tesla” (a brand),
and even “motor” (and underlying concept) higher than a generic term like “turtle.” 4
Given this motivation via a real-world application, we can now talk about specific
algorithms in the abstract.

7.2.2 Principal Component Analysis


A component is essentially a direction in a feature space. A principal component,
then, is the direction along which points in that feature space have the greatest
variance. Thus, principal component analysis is the process of breaking down a
feature set in this way.
The first principal component of a feature set (which, at the end of the day, is just
a matrix) is the direction of maximum variance. Subsequent principal components
are orthogonal to the previous components and describe the “next” direction of
maximum residual variance.
We’ll define this more rigorously soon, but let’s explain the “direction of maximum
variance” a little more more intuitively. We know that the variance describes how
“spread out” the data is; more specifically, it measures the average of the squared
3
The linguistic terms for these concepts are polysemy and synonomy, respectively.
4
Funnily enough, “turtle” works as a generic negative example, but “bug” doesn’t; the Volkswagen
Beetle is definitely a car!

Kudrayvtsev 83
CHAPTER 7: Features

differences from the mean. In 2 dimensions, the difference from the mean (the ⊗
in Figure 7.1) for a point is expressed by its distance. The direction of maximum
variance, then, is the line that most accurately describes the direction of spread. The
second line, then, describes the direction of the remaining spread relative to the first
line.

30 30

25

20 20

15

10 10

0 0
0 10 20 30 0 5 10 15 20 25 30
Figure 7.1: The first principal component and its subsequent orthogonal prin-
cipal component.

You might notice that the first principal component in Figure 7.1 resembles the line
of best fit; this is no coincidence: the direction of maximum variance reduces the sum
of squared differences just like linear regression.

These notes are pulled straight from my notes on computer vision which
cover the same topic as it pertains to images and facial recognition; they go
into the full derivation of PCA from several different angles.

To summarize the derivation, the principal components of a matrix are the


eigenvectors of the points, ranked in importance by their eigenvalues. The
eigenvectors can be found algorithmically through a bunch of methods.

Lowering Dimensionality
The idea is that we can collapse a set of points to their largest eigenvector (i.e. their
primary principal component). For example, the set of points from Figure 7.1 will be
collapsed to the blue points in Figure 7.2; we ignore the second principal component
and only describe where on the line the points are, instead.
Collapsing our example set of points from two dimensions to one doesn’t seem like
that big of a deal, but this idea can be extended to however many dimensions we

Kudrayvtsev 84
MACHINE LEARNING Unsupervised Learning

30

25

20 x
15

10
E λ1
5

0
0 5 10 15 20 25 30
Figure 7.2: Collapsing a set of points to their principal component, Êλ1 . The
points can now be represented by a coefficient—a scalar of the principal compo-
nent unit vector.

want. Unless the data is uniformally random, there will be directions of maximum
variance; collapsing things along them (for however many principal components we
feel are necessary to accurately describe the data) still results in massive dimension-
ality savings.

Example 7.1: Understanding Check

What eigenvectors make up the principal components in the following dataset?


Remember, we’re looking for the directions of maximum variance.

Is it the ones on the right or on the left?

Kudrayvtsev 85
CHAPTER 7: Features

aren’t even perpendicular!


Answer: The eigenvectors found by PCA would’ve be the blue vectors; the red ones

Intuitively, we can imagine the points above the blue line “cancelling out” the
ones below it, effectively making their average in-between them. Principal
component analysis works best when training data comes from a single
class. There are other ways to identify classes that are not orthogonal such
as independent component analysis, which is coming up next.

Considerations
Principal component analysis is a well-studied algorithm with some nice, provable
properties, like the fact that it gives the best reconstruction of the original data. Its
downside comes from its very nature: features with high variance don’t necessarily
correlate to features with high importance. You might have a bunch of random noise
with one useful feature, and PCA will almost certainly drop the useful feature since
the random noise has high variance.

7.2.3 Independent Component Analysis


Isbell & Jones, The improvement of the independent component analysis method over PCA is
‘99 that it does not restrict itself to orthogonal components. While PCA aims to find
vectors of maximal variance (and correlation) to aid in data reconstruction, ICA aims
to maximize independence.
From a high level, given some features {x1 , x2 , . . .}, ICA aims to find new features
{y1 , y2 , . . .} such that yi ⊥ yj and I(yi , yj ) = 0 (that is, the pairs of vectors are mu-
tually independent and share minimal mutual information). It also tries to maximize
the mutual information between the xi s and yj s.

Intuition In the cocktail party problem, there are a bunch of people in a crowded,
noisy room (please don’t try this at home until the COVID-19 pandemic is over).
You want to pick out one particular voice from the crowd, but your ears are receiving
some (linear) combination of sounds from everyone who is speaking. If you move
around, the volumes of the voices change. Given enough movement, calculations, and

Kudrayvtsev 86
MACHINE LEARNING Unsupervised Learning

understanding of physics, you could isolate one of the voices based on their changes
in volume and whatnot.
This is the background for ICA: the voices are hidden variables (variables that we wish
we knew since they powered the data we see) that we’re trying to learn about, and
the overall room noise is our known data. ICA reconstructs the sounds independently
by modeling this assumption of the noise being a linear combination of some hidden
variables.

Specifics(ish) First, ICA samples each feature in the dataset to form a matrix.
Each row in the matrix is a feature, and each column is a sample. Then, it aims to
find a projection such that the above constraints hold: minimizing mutual information
between each pair of new features and maximizing information between the new
features and the old features.

Comparison ICA is heavily dependent on this assumption of the underlying hidden


variables being independent. It’s a very “local” algorithm as a result (finding features
like noses rather than global features like the average face as PCA would) and is
heavily affected by directionality.

7.2.4 Alternatives
While PCA and ICA are well-established and effective algorithms, there are other
alternatives in the space.
RCA or random component analysis generates random directions and projects
the data onto them. It works remarkably well for classification because enough
random components can still capture correlations in the data, though the re-
sulting m-dimensional space is bigger than the one found by the other methods.
Its primary benefit comes simply from being blazingly fast.
LDA or linear discriminant analysis finds a projection that discriminates based
on the label. In other words, it aims to find a collection of good linear separators
on the data.

Kudrayvtsev 87
PART III
Reinforcement Learning

ecision-making in the real world is psychologically driven by risk and reward.5


D These are concepts we’ve yet to integrate into any of our algorithms, and they’re
the very concepts that power reinforcement learning.

Contents
8 Markov Decision Processes 89

9 Game Theory 96

5
We’ll ignore the existence of altruism in our discussions of human behaviour because it’s a concept
that doesn’t make computational sense.

88
Markov Decision Processes

einforcement learning algorithms come up with policies that specify which


R actions to take to reach a particular goal. They’re often broken down into three
main parts: sensing, thinking, and acting.
To simulate the real world accurately, our digital recreations of various decision-
making scenarios will have probabilistic components: given a particular action choice,
the resulting action might not be the same. This structure leads to problems following
a Markov decision process (or MDP) as follows:
• The environment can be described by a particular state, s.
• The agent can take actions in the world based on its state: A(s).
• The model of the world is described by a transition function, T , which (prob-
abilistically) describes how the environment responds to the action:

T (s, a, s0 ) ∼ Pr [s0 | s, a]

• The actions are rewarded (or punished) based on their outcome and the re-
sulting state, R(s) 7→ R.
The Markovian property of this problem states that only the present matters:
notice that the new state in transition function only depends on the previous state.
Furthermore, the environment in which the agent can act stays static (walls don’t
move, for example).
Given this description of the world, an agent has a policy, Π(s) 7→ a, that describes
its decision-making process, converting environment state into an action. This policy
is often computed with the goal of finding the optimal policy Π∗ which maximizes
the total reward.
Our reinforcement learning algorithms will learn which actions maximize the reward
and use that to define and enforce a policy. At each stage, we will receive a state-
action-reward tuple: hs, a, ri and eventually determine the optimal actions for any
given state.

89
CHAPTER 8: Markov Decision Processes

On Rewards
In our MDP, we have no understand of
how our immediate action will lead to
things down the road. For example, if
you’re standing on an island surrounded
by hot coals, but there’s the potential of
a pot of gold a few steps away. Under the
Markovian property, you have no knowl-
edge of the gold until you find it, and are
enduring a burning sensation on your feet
for an unknown, delayed reward. Figure 8.1: Enduring the hot coals for the
delayed reward of “manager.”
Because of this property, it’s hard to tell
which moves along the path had a meaningful impcat on the overall result. Was it
one move early on that caused a failure 100 steps down the line, or was it the step
we just did while everything beforehand was perfect?
Furthermore, minor changes to rewards are really impactful. By having a small
negative reward everywhere, for example, you encourage the agent to end the game.
However, if the reward is too negative (that is, it outweighs the maximum-possible
positive reward), you might encourage the agent to instead be “suicidal” and end the
simulation as quickly as possible.
For example, if the commission for a stock trade is $100 (representing an absurdly-high
negative reward for any action) then trading penny stocks—even for a 200% profit!—
would be completely pointless. Thus, a trading simulation agent might simply buy a
stock and hold it indefinitely.

On Utility
In our discussion, we’ve been assuming an infinite horizon: our agent can live and
simulate forever and their optimal policy assumes there’s infinite time to execute its
plans.
We won’t consider this problematic assumption in this chapter because it’s too ad-
vanced for this brief introduction1 even though it’s not representative of the real
world; it’s mentioned because but it’s important to keep in mind. Imagine your au-
tonomous vacuum did nothing but circle the edge of your house because the center
of your rooms is too “unknown” and high risk?
Furthermore, we’ve been assuming that the utility of sequences is Markovian: if
we preferred a particular state today (when we started from s0 ), we actually always
prefer it:
if U (s0 , s1 , s2 , . . .) > U (s0 , s01 , s02 , . . .)
1
Ultimately, you end up creating a sort of time-based policy, Π(s, t) 7→ a, to deal with this.

Kudrayvtsev 90
MACHINE LEARNING Reinforcement Learning

then U (s1 , s2 , . . .) > U (s01 , s02 , . . .)

These are called stationary preferences and this assumption leads to a natural
intuition of accumulating rewards by simple addition. The utility of a sequence of
states is simply the sum of its rewards:

X
U (s0 , s1 , s2 , . . .) = R(st )
t=0

However, this simple view doesn’t encode what we truly want. Consider the following
two sequences of states:

S1 = {+1, +1, +1, +1, +1, +1, +1, . . .}


S2 = {+1, +1, +1, +2, +1, +1, +2, . . .}

Isn’t the second sequence “better”? But by our above definition, they both sum up
to infinity. . . so it doesn’t matter what we do. How do we encode time into our
utility, and model the agent to prefer instant gratification? Simple: we introduce
a fractional factor γ that decreases how impactful future rewards are.

X
U (s0 , s1 , s2 , . . .) = γ t R(st ) where 0 ≤ γ < 1
t=0

Unlike our previous construction which grows to infinity, this geometric series actually
converges to a very useful result:
∞ ∞
X
t
X Rmax
γ R(st ) ≤ γ t Rmax =
t=0 t=0
1−γ

Notice that if we’d let γ = 1, we’d have the equation that we started with. This
model has discounted rewards, letting us converge to a finite reward after taking
infinite steps.

8.1 Bellman Equation


With this model of the world in mind, what is the optimal policy? Well, isn’t it
simply the one that maximizes the expectation (average) of the reward?
"∞ #
X
Π∗ = arg max E γ t R(st ) Π

Π t=0

Kudrayvtsev 91
CHAPTER 8: Markov Decision Processes

The utility of a particular state, then, is the expected set of states that we’ll see given
that policy and starting at that state:
"∞ #
X
U Π (s) = E γ t R(st ) Π, s0 = s

t=0

Note the critical point that the reward at a state is not the same as the utility of
that state (immediate vs. long term): R(s) 6= U (s).2 Under this notion of the optimal
utility, then, we now want our policy to return the action that maximizes the expected
utility: X
Π∗ (s) = arg max T (s, a, s0 )U (s0 )
a∈A(s) s0

This feels a little circular, but when we unroll it, a recursive pattern emerges: the
true utility of a state is simply its immediate reward plus all discounted future rewards
(utility). X
U (s) = R(s) + γ max T (s, a, s0 )U (s0 ) (8.1)
a∈A(s)
s0

This final equation is the key equation of reinforcement learning. It’s called
the Bellman equation and fully encodes the utility of a state in an MDP. The rest
of this chapter will simply be different ways to solve the Bellman equation.

8.2 Finding Policies


The Bellman equation (8.1) is in n variables with n unknowns, but they’re not linear
equations. Thankfully, that doesn’t mean they’re unsolvable.

Value Iteration The key to an approach for solving this equation will be through
repeated iteration until convergence: we’ll start with arbitrary utilities, then update
them based on their neighbors until they converge.
X
Ût+1 (s) = R(s) + γ max T (s, a, s0 )Ût (s0 )
a
s0

Where Û0 (s) is simply random values. However, because the transitions and rewards
are actually true, we’ll eventually converge by overwhelming our initially-random
guesses about utility. This simple process is called value iteration; the fundamental
reason why value iteration works is because rewards propogate through their neigh-
bors.
2
We’ll use the notation U (s) to refer to the “true” utility of a state; this is the utility when following

the optimal policy, so U (s) := U Π (s).

Kudrayvtsev 92
MACHINE LEARNING Reinforcement Learning

Policy Iteration This isn’t the only way to do this, and it gives us an intuition
about the alternative. We want a policy; it maps states to actions, not utilities. And
though we can use utilities to find actions, it’s way more work than is necessary; the
optimal action would be sufficient.
With policy iteration, we’ll start with a guess Π0 of how we should act in the world.
Following this policy will result in a particular utility, so we can find Ut = U Πt . Then,
given that utility, we can figure out how to improve the policy by then finding the
action that maximizes the expected utility:
X
Πt+1 (s) = arg max T (s, a, s0 )Ut (s0 )
a∈A(s) s0

Of course, finding Ut is just a matter of solving the Bellman equation (8.1), except
the action is already encoded by the policy:
X
Ut (s) = R(s) + γ T (s, Πt (s), s0 )U (s0 )
s0

With this formulation, we have n linear equations (the maxa is gone) in n unknowns
which is easily solvable with linear algebra libraries.

8.3 Q-Learning
Critically, solving an MDP is not the same thing as reinforcement learning! Previ-
ously, we knew everything there is to know about the world: the states, the transitions,
the rewards, etc. In reality, though, a lot of that is hidden from us: only after taking
(or even trying to take) an action can we see its effect on the world (our new state)
and the resulting reward.
When thinking about reinforcement learning, there are a lot of moving parts that go
into discovering the policy:
model
planner policy
(T, R)

transitions
learner policy
hs, a, r, si

transitions modeler model

model simulator transitions

Kudrayvtsev 93
CHAPTER 8: Markov Decision Processes

We’ll be “combining” these pieces to build a model-free reinforcement learner, fo-


cusing on a value-based approach that learns utilities to relate states to actions:
model → simulator → transitions → policy
| {z }
learner

From the ashes of our original value and policy equations,


X
Ut (s) = R(s) + γ max T (s, Πt (s), s0 )U (s0 )
a∈A(s)
s0
X
Π(s) = arg max T (s, a, s0 )Ut (s0 )
a∈A(s) s0

rises a new challenger called the Q-function, relating the value of arriving at the state
s and leaving via action a:
X 
Q(s, a) = R(s) + γ 0 0 0
T (s, a, s ) 0max0 Q(s , a ) (8.2)
a ∈A(s )
s0

Notice that we can redefine our original equations in terms of Q:


Utility For our utility value, we have the immediate reward and optimal behavior
thereafter, whereas with Q we have the immediate reward and the optimal
behavior thereafter with a specific action, so the parallel is incredibly simple:
U (s) = max Q(s, a)
a

Policy On the other hand with policies, we’re looking for the action itself that
maximizes the value, so the parallel is just as simple:
Π(s) = arg max Q(s, a)
a

The foundation behind Q-learning is using data that we learn about the world as
we take actions in order to evaluate the Bellman equation (8.1). The fundamental
problem, though, is that we don’t have R or T , but we do have samples about the
world: hs, a, r, s0 i is the reward and new s0 tate that result from taking an action in a
state. Given enough of these tuples, then, we can estimate this Q-function:3
α
hs, a, r, s0 i : t
Q̂(s, a) ←− r + γ 0max0 Q̂(s0 , a0 )
a ∈A(s )

where αt is updated over time.4 This simple updating rule is guaranteed to converge
to the true value of the Q-function (8.2) with the huge caveat that (s, a) must be
visited infinitely often:
lim Q̂(s, a) = Q(s, a)
t→∞
3 α
The non-standard notation f ←− x is defined here as a simple linear interpolation: f = (1 − α)f +
αx, where 0 ≤ α < 1. P 2
4
P
Specifically, it must follow these rules for converge to hold true: t αt = ∞ yet t αt < ∞. A
function like αt = 1t holds this property.

Kudrayvtsev 94
MACHINE LEARNING Reinforcement Learning

Notice that we’re leaving a bunch of questions unanswered, and the answers to them
are what make Q-learning a family of algorithms:
• How do we initialize Q̂? We could initialize it to zero, or with small random
values, or negative values, or. . .
• How do we decay αt ? We’ve seen that αt = 1
t
follows the convergence rules we
need, but are there other alternatives?
• How do we choose actions? Neither always choosing the optimal action nor a
random action actually incorporates what we’ve learned into our decision. For
this, we can actually use Q̂, but we need to be careful: relying on our (potentially
faulty, unconverged) knowledge of actions might lead to us simply reinforcing
the wrong actions.
For this last point, we want to sometimes “explore” suboptimal actions (recall our dis-
cussion of exploration vs. exploitation in Simulated Annealing) to see if they actually
are better than what we know so far. To formalize this, we’ll let ε be our “exploration
factor,” and then:

arg max Q̂(s, a) with probability 1 − ε
Π̂(s) = a∈A(s)
random action otherwise

With this approach, we can guarantee that (with infinite iterations) all possible state-
action pairs will be explored.

Kudrayvtsev 95
Game Theory

The implication of game theory is that the freedom of choice of


any one state is limited by the actions of the others.
— Kenneth Waltz, Man, the State, and War: A Theoretical Analysis

his field comes from a world far beyond computer science, but actually has
T important ties to reinforcement learning in the way we design problems, make
decisions, and perceive the “economic utility” of the world. The fundamental
principle behind game theory is that you are not alone: you are often collaborating
and competing with other agents in the world to achieve various (and potentially
conflicting) goals. That’s the key of this chapter: we’re moving from a world with
a single agent to one with multiple agents; then, we’ll tie it back to reinforcement
learning and even the Bellman equation.

9.1 Games
Let’s consider an extremely simple example of a “game”
in which we introduce a second agent. At each point in 1

time, Agent A makes an action choice, and then Agent L R

B gets a choice. Eventually, this series of choices leads


2 3
to some final reward for A and the opposite reward for L R
B. The game tree is shown on the right: we start with M R

A’s choice, then B’s at the tree’s next level, and so on. +7 +3 4 +2

L R
This is a 2-player, zero-sum, finite, deterministic game
with perfect information. That’s quite a mouthful of a −1 +4

description, but each term should be easily digestible on


its own.
It’s quite possibly the simplest game we could have devised. Given this simple game,
we can devise strategies (akin to policies in MDPs). How many strategies are there
for each player? Well, Agent A only gets two opportunities to take action, and each

96
MACHINE LEARNING Reinforcement Learning

opportunity has two options available, so there are four total strategies. Agent B
likewise has two opportunities, but there are three options in one case and only one
in the other, so there are three total strategies.
We can actually represent these strategies as a matrix, with each cell being the final
reward value for Agent A:
2
○ L M R
3
○ R R R
1
○ 4

L L 7 3 -1
L R 7 3 4
R L 2 2 2
R R 2 2 2
In fact, this matrix represents the entire game tree. Which strategy (row) then, should
Agent A choose? Knowing that Agent B will always choose the minimum value in
the row, it should choose the second row. If Agent B gets to choose, though, it wants
the row with the smallest maximum value, thus choosing the second column.
This two-way process of picking a strategy such that it minimizes the impact your
opponent could have is called minimax. In our above matrix, following the minimax
strategy, the “value” of the game is simply 3 (the intersection of the best row for
Agent A and best column for Agent B).

Theorem 9.1 (Von Neumann Theorem). In a 2-player, zero-sum, finite game


with perfect information, following the minimax strategy is equivalent to following
the inverse “maximin” strategy (e.g. Agent A minimizing its reward):

minimax ≡ maximin

and there always exists an optimal pure strategy for each player.

This notion of optimality assumes a fundamental approach to the game: all agents
are behaving optimally, and thus are always trying to maximize their respective re-
ward.

9.1.1 Relaxation: Non-Determinism


Let’s introduce chance to our games. The
cells in the game tree on the right in-
dicate “chance nodes,” in which each branch .5 .5

corresponds to a probability resulting from a +4 −20 +3


(weighted) coin flip. To clarify, if Agent A
goes left, s/he then has a 50/50 chance of ei-
.8 .2

ther getting a +4 or −20 reward (hence, a −5 +10

Kudrayvtsev 97
CHAPTER 9: Game Theory

4(0.5) − 20(0.5) = −8 reward on average).


Once again, the beauty of our matrix formulation is that the original game tree no
longer matters. We can distill the game into a few numbers by finding the expected
value of following each strategy, and apply the above theorem to see that -2 is the
optimal “value” of the game:
L R
L -8 -8
R -2 3

9.1.2 Relaxation: Hidden Information


This is where things get interesting. Suppose we are playing a game of “mini-poker,”
in which we have the following simple rules.
First, A is dealt a single card that is equally-
likely to be red or black. If A receives a red 0.5 0.5

card, s/he may resign and take a $20 penalty.1


red black
Otherwise, the reward depends on B: if B re-
signs, A is rewarded $10, but if B asks to see hold resign
hold

A’s card, then A is rewarded $30 if B sees black −20

and loses $40 if B sees red (and not in a fit of resign see resign see

rage ) as a deterrent from bluffing. +10 −40 +10 +30

The game tree for this is shown on the right.


Notice the new complication: B does not know what state it’s in if A decides to hold.
What does the resulting reward matrix look like? Well, A either resigns or holds,
while B either resigns or asks to see. If A resigns, she loses $20, and B resigns, she
wins $10. These are equally-probable, so we get a value of −$5. The other cells can
be calculated in a similar fashion:
resign see
resign -5 5
hold 10 -5
What’s the value of this game? Well, for A, choosing either row results in a value
of -5. However, from B’s perspective, choosing the right column is superior, and
the value is +5! This clearly doesn’t fit Theorem 9.1, so hidden information means
minimax 6≡ maximin.
This makes sense in the context of our game: if Agent B knew that A always folded
on a red card, he would likewise always fold when A didn’t (since losing is inevitable!).
This introduction of hidden information means there is no longer a pure, consistent

1
We could explicitly say that A cannot resign with a black card, but in reality it simply makes no
sense to resign with a black card since you cannot lose.

Kudrayvtsev 98
MACHINE LEARNING Reinforcement Learning

strategy that works for both players, which segues perfectly into the next section.

Mixed Strategies
A mixed strategy is a distribution of strategies. In the context of mini-poker, for
example, Agent A could choose to be a holder or a resigner half of the time.
Suppose, then, that we’re a holder with probability P .
• If Agent B is a “resigner,” what’s our expected profit? Well, if we’re dealt a
black card, we always hold and always win $10. If we’re dealt a red card, we
will hold P of the time and win $10, and not hold 1 − P of the time and lose
$20. Thus, our expected profit is:

E[profit] = (0.5 × 10) + (0.5 × ((P × 10) + ((1 − P ) × −20)))


= 0.5 + (0.5 × (10P − 20 + 20P ))
= 0.5 + 15P − 10
= 15P − 5

Notice that this a direct application of our above value table, a simplification
of: 10P − 5(1 − P ).
• On the other hand, if Agent B is a “see”-er, then we will win $0.5 if we resign
(1 − P ) and lose $0.5 if we hold, so:

E[profit] = −5P + 5(1 − P )


= −10P + 5

Now if we were to plot these lines, we’d get a point of intersection at (0.4, 1). How
do we interpret this value?

10

5
profit

−5
0 1
P

This means that if A chooses P = 0.4, it leads to a profit of $1 regardless of what


Agent B is doing. Even if B were to apply a mixed strategy, it would still be confined
within the regions between the above lines and work out to an average of $1. Thus,
the expected profit value of this game for Agent A is $1 .

Kudrayvtsev 99
CHAPTER 9: Game Theory

Note that the point of intersection isn’t necessarily always the optimal probability
nor the best value; in this case, it’s the maximum of the possible lower bounds. It
will always be one of the extrema of the lower triangle, though.

9.1.3 Prisoner’s Dilemma

We’ve relaxed all of the terms that described our original game except one: the notion
of zero-sum. Consider the infamous prisoner’s dilemma:
• Two criminals are imprisoned in separate cells.
• A cop enters one cell and tells the first criminal, “We know you guys committed
the crime, and the other prisoner is telling us that it was all you.”
• He tells the prisoner, “If you can provide us evidence against the other guy, we’ll
can cut you a deal and let you off.”
• To make matters worse, he also tells him that there’s another cop in the other
cell offering the other criminal the same thing, and whoever rats first gets to
walk free.
More specifically, if Prisoner A talks, he gets no sentence and Prisoner B gets a 9-
month sentence. Similarly, if Prisoner B talks, Prisoner A gets a 9-month sentence.
However, the two criminals can also cooperate with each other and say nothing, OR
they might both talk at the same time and both get stuck with a sentence. In the
former case, they don’t get off scot-free because there’s still enough evidence to toss
them both in jail for a month. In the latter case, they get to convict both prisoners
(which is better for the cops) but the sentence for each will be shorter—6 months.
cooperate snitch
cooperate (−1, −1) (−9, 0)
snitch (0, −9) (−6, −6)
Given this matrix, and interesting, unintuitive result appears: it’s always better to
defect. Though overall it’s better for the prisoners to cooperate, we need to consider
the cold, hard facts of the above matrix. If you know your partner will cooperate
with you, you get more value from ratting him out (-1 vs. 0). Similarly, if you know
they’ll rat you out, you still get more value from also ratting since your sentence
gets reduced (-9 to -6). The fundamental assumption of this result is that the cold,
calculated matrix does not consider the effects of your actions on your literal partner
in crime.
Notice that the first row is strictly worse for Prisoner A: in both cases, it’s better to
snitch. This means that one strategy is strictly dominated by the other.

Kudrayvtsev 100
MACHINE LEARNING Reinforcement Learning

9.1.4 Nash Equilibrium


The prisoner’s dilemma generalizes to many players. Given a game with n players,
each with their own possible set of strategies: s1 , s2 , . . . , sn , we say that s∗1 ∈ s1 , s∗2 ∈
s2 , . . . , s∗n ∈ sn (the set of specific chosen strategies by the players) are a Nash
equilibrium if and only if:

∀i : s∗i = arg max utility(s∗1 , . . . , si , . . . , s∗n )


si

More intuitively, a set of strategies is a Nash equilibrium if no one player would change
their strategy given the opportunity. This concept works for both pure and mixed
strategies.

Example Let’s find the Nash equilibrium of the prisoner’s dilemma. Recall our
value table:
cooperate snitch
cooperate (-1, -1) (-9, 0)
snitch (0, -9) (-6, -6)
There are obviously only two actions for each player, and thus four strategies. We
can do a simple brute-force enumeration to see if any of them cause any one player
to change their mind.
• If they both cooperate, then if you gave either a chance to switch, they would
(since snitching now gives lower utility).
• If one cooperates and the other doesn’t, then if you gave the cooperator the
chance, they would now snitch to reduce their sentence.
• Finally, if they both snitch, then neither would change their action, meaning
this set of strategies is the Nash equilibrium.
In fact, if we’d avoided brute force, we could’ve still come to this same conclusion by
noticing that the (−6, −6) cell is the only one left after removing strictly dominated
rows or columns.

Example Consider another value matrix:


0,4 4,0 5,3
4,0 0,4 5,3
3,5 3,5 6,6
Let’s try to avoid brute force. Does any cell stand out as a potential equilibrium?
Well the (global) maximum value that Agent A could achieve is 6, and likewise for
Agent B. Since the bottom-right cell achieves this for both simultaneously, there’s no
way they’d either want to change.

Kudrayvtsev 101
CHAPTER 9: Game Theory

Properties The Nash equilibrium leads to a set of beautiful properties about these
types of games:
• In the n-player pure strategy game, if elimination of strictly dominated strate-
gies eliminates all but one combination, that combination is the unique Nash
equilibrium.
• More generally, any Nash equilibrium will survive the elimination of strictly
dominated strategy.
• If n is finite, and all si s (the sets of possible strategies for player i) are finite,
then there exists at least one Nash equilibrium (with a possibly mixed strategy).

Repeats Suppose we ran the prisoner’s dilemma multiple times. Intuitively, once
you saw that I would defect regardless (and you would likewise), wouldn’t it make
sense for us to collectively decide to cooperate to reduce our sentences?

Consider the very last repeat of the experiment. At this point you may have built up
some trust in your criminal partner, and you think you can trust them to cooperate.
Well, isn’t that the best time to betray them and get off scot-free? Yes (since guilt
has no utility), and by that same rationale your partner would do likewise, and we’re
back where we started. This leads to another property of the Nash equilibrium: If
you repeat the game n times, you will get the same Nash equilibrium for all n times.

9.1.5 Summary
Game theory operates under the critical assumption that everyone behaves optimally,
and this means that they behave greedily with respect to their utility values. The
key that lets this work in the real world is that the utility value of someone staying
out of jail might not be only the raw amount of months they avoided. It might
also be things like: how much time their partner spends in jail, how much their
family will miss them, how good of a workout routine they could get going in prison,
etc. Furthermore, we know that snitches get stitches, so the (0, −9) cell suddenly

Kudrayvtsev 102
MACHINE LEARNING Reinforcement Learning

looks much less attractive when that 0 turns into a −100 to represent the number of
beatings you’ll get for snitching on your criminal buddies.
An important conclusion from our cold-hearted matrix view of the world is that if we
know the outcome, we can try to manipulate the game itself to change the outcome
to what we want. This is called mechanism design; it’s an important part of both
economic and political policy.

9.2 Uncertainty
Our rationale for the repeated prisoner’s dilemma is that we can pull a fast one on
our partner if they decide to cooperate with us in the last round. But this relies on
knowledge of when this “last round” will be! To formalize this notion a bit, suppose
the game continues with probability γ. Then, we should expect it to end after 1−γ 1

rounds on average. Some ties to MDPs should begin to form.

9.2.1 Tit-for-Tat
Consider the following strategy for the iterative prisoner’s dilemma game: on the first
round, we agree to cooperate. Then for all future rounds, we simply copy whatever
our opponent did the previous round.
For example, we’d get the following behavior under these opponent strategies:
• under the always snitch opponent strategy, we’ll cooperate, then snitch, snitch,
...
• if they always cooperate, we’ll always cooperate.
• if they also do tit-for-tat, we’ll again always cooperate.
• if they snitch, then cooperate, then snitch, etc. we’ll do the exact opposite—
cooperate, then snitch, then cooperate, etc.
Under such a known opponent strategy, how do we figure out the best response
strategy? Well, it’s dependent on γ, right? With our prisoner’s dilemma matrix,
always snitching is viable if γ is low: we gain the maximum value on our first try,
but then suffer snitching on the next step. Contrarily, always cooperating is good for
high γ: we always gain the next-best reward (after 0).
 
always
 1 6γ
utility snitch = 0 − 6γ =−
1−γ 1−γ
 
always
 1 1
utility cooperate = −1 =−
1−γ 1−γ

The threshold is at γ = 1/6 (just set them equal to each other). How can we find this

Kudrayvtsev 103
CHAPTER 9: Game Theory

for a general strategy, though? Well, consider our model expressed as a finite state
machine:
-1 -6
-9

C D

and notice that this is simply an MDP with γ as the discount factor. By solving the
MDP, we can find the optimal policy for countering an opponent’s strategy.

9.2.2 Folk Theorem


Beautifully, scenarios where both players use the always-snitch strategy and both use
the tit-for-tat strategy are at Nash equilibriums. And, unlike earlier, one of these is
actually a cooperative strategy. Of course, we modified the problem such that we did
a finite number of rounds with no knowledge of when the last round would be, but
the point still stands and leads to a general idea:
In repeated games, the possibility of retailiation opens the door for cooperation.

To start, a feasible payoff is one that can be reached by some combination of the
extremes (by varying probabilities, etc.). This is simply the convex hull of the points
defined in the value matrix on a “player plot.”
A minmax profile is a pair of payoffs—one for each player—that represent the
payoffs that can be achieved by a player defending itself from a malicious adversary.
In other words, we reform the game to be a zero-sum game.
For example, consider the following battle of sexes: two people want to meet up at a
concert, but there are two bands playing and they didn’t coordinate beforehand. If
they end up at different concerts, they’ll both be pretty bummed out (no utility), but
if they do, they’ll end up with different utilities based on the concert—one prefers
Bach and the other prefers Stravinsky.
B S
B (1, 2) (0, 0)
S (0, 0) (2, 1)
The minmax profile of the above “game” is (2/3, 2/3), which can be solved by following
the same process we did during our initial discussion of Mixed Strategies.

Theorem 9.2 (Folk Theorem). Any feasible payoff profile that strictly dominated
the minmax profile can be realized as a Nash equilibrium payoff profile, with a
sufficiently large discount factor.

Kudrayvtsev 104
MACHINE LEARNING Reinforcement Learning

The proof of this theorem can be viewed as an abstract strategy called the grim
trigger: in this strategy, we guarantee cooperation for mutual benefit as long as
our partner/opponent doesn’t “cross the line”; if they do, we will deal out vengeance
forever. In the context of the prisoner’s dilemma, A cooperates until B snitches, at
which point A will always snitch.
The problem with this strategy is that it’s a bit implausible: the idea of rejecting any
notion of optimality simply to dole out maximum vengeful punishment no-matter-
what has a pretty huge negative effect on the one dealing out the punishment—they
are forgoing potential value. In game theory, we are interested in a plausible threat
that leads to a subgame perfect equilibrium. Under this definition, you always take
the best response independent of any historical responses.
For example, the grim trigger and tit-for-tat are in a Nash equilibrium (since both
will always cooperate), but they are not subgame perfect, since a history of any
defection from tit-for-tat will cause grim to defect while it would’ve been better to
keep cooperating. Similarly, two players both using the tit-for-tat strategy are not
subgame perfect, since any defection in the past from one will “flip” the other and
lead to something worse than their forever-cooperation default.
We can think of subgame perfect strategies as eventually equalizing again, whereas
ones that are not subgame perfect are extremely fragile and dependent on very ideal
sequences—the difference between stability of a marble at the bottom of a ∪ parabola
vs. one at the exact, barely-balanced peak of a ∩ one.

9.2.3 Pavlov’s Strategy


Consider a strategy in which we switch actions on any sign of disagreement. In other
words, we start off cooperating; then, as long as you cooperate when we do, we will
continue to cooperate. If you snitch, we will also snitch, but if you then decide to
cooperate we will still snitch. This will continue until you snitch again, in which case
we will revert back to cooperation as a sort of olive branch of peace.
C C
D

C D
D

This strategy is a Nash equilibrium with itself—you always start with mutual cooper-
ation, and there’s no incentive or reason to stop doing so. Furthermore, it’s subgame
perfect unlike tit-for-tat: the average reward is always mutual cooperation.

Theorem 9.3 (Computational Folk Theorem). You can build a Pavlov-like Littman &
machine for any game and construct a subgame perfect Nash equilibrium in poly- Stone, ‘04
nomial time.

Kudrayvtsev 105
CHAPTER 9: Game Theory

There are three possible forms of the equilibrium resulting from the above theorem: if
possible, it will be Pavlovian; otherwise, it must be a zero-sum-like game and solving
a linear program will lead to a strategy with at most one player improving.

9.3 Coming Full Circle


Let’s discuss stochastic games and see how they tie in to multi-agent reinforcement
learning. These games will be analogous to how MDPs describe single-agent rein-
forcement learning problems.

9.3.1 Example: Grid World


Consider the following grid world, in which the agents A and B want to get to the $
goal; the first to reach the goal receives a $100 reward:

A B

If they arrive at the same time, they both get the reward. The world is deterministic
aside from the dashed lines above the agents: these are “partial walls” which can be
passed with a 50% probability.
Obviously, the optimal strategy for each agent when ignoring the other agent is to
move into the middle square and go straight for the goal. However, this ignores the
unfortunate reality that the agents can’t occupy the same square (aside from the
reward cell). We’ll say that if two agents try for the same square, a 50-50 chance
describes who gets there “first;” the other agent stays still.
Does this game have a Nash equilibrium for the agents? Consider the possible strate-
gies:
• If both agents try to go through their walls, each has a 50% chance to make it
through, and the average reward is $67 for either agent.2
• However, if B knows that A will follow the above strategy, it’s in its best interest
to go left and up towards the goal. In this case, A still has a 50% chance of
success regardless of whether it goes up (wall chance) or right (collision chance),

2
There’s a 50% chance to pass the wall and win, then there’s also the chance that 25% of the time,
neither of you will
 pass the wall and win so there’s another opportunity to1 try1the wall. If we say
x = Pr A reaches
goal , then this probability can be expressed simply as x = 2 + 4 x. This results in
x = 2/3.

Kudrayvtsev 106
MACHINE LEARNING Reinforcement Learning

so it will not switch strategies. Thus, this is a Nash equilibrium.


• Obviously, the opposite case in which A goes right and B goes up is also a Nash
equilibrium.
• Finally, in the case where both agents move towards the middle square, each
has a 50% chance of failure. However, switching to a strategy that goes through
the wall is equally-bad, so it’s also a Nash equilibrium.

9.3.2 Generalization
Let’s describe this notion of a stochastic game in detail. It looks much like an MDP, Shapley, ‘53
but we use the i subscript to refer to each agent individually:
• We still have our set of states, S.
• We have actions available to each player: Ai . For our purposes, we’ll use two-
player games as examples, so let a = A1 (the actions available to the first agent)
and b = A2 .
• The transitions from state to state are described by a joint action. All of the
players take an action simultaneously: T (s, (a, b), s0 ).
• Each player gets their own rewards, so R1 (s, (a, b)) describes the first agent’s
reward for the joint action, and R2 (s, (a, b)) describes the second’s.
• The discount factor, γ stays the same, and is universal for everyone. We can
consider γ to be part of the definition of the game, rather than a γi which would
allow each agent to discount rewards independently.
In fact, this is a generalization of MDPs, published way before Bellman and his
equation (8.1). We can see how it simplifies to a number of scenarios we’ve seen
under certain contraints: with transition probabilities and rewards that don’t change
regardless of b, this is an MDP; with equal rewards, this is a zero-sum game; and with
a single state in the state space, this is a repeated game à la the iterative prisoner’s
dilemma.

9.3.3 Solving Stochastic Games


With the parallel to MDPs established, how do things like value iteration, policy
iteration, and Q-learning transfer over to multi-agent worlds?

Zero-Sum Stochastic Games


Consider the Q-function adapted directly for a two player, zero-sum game: Littman ‘94
X 
∗ 0 ∗ 0 0 0
Qi (s, (a, b)) = Ri (s, (a, b)) + γ T (s, (a, b), s ) · max
0 0
Qi (s , (a , b ))
a ,b
s0

Kudrayvtsev 107
CHAPTER 9: Game Theory

What does this mean? By maximizing over joint actions (the (a0 , b0 ) tuple), we
are assuming that the optimal joint action will exclusively benefit us: the whole
world will bend themselves over backwards to improve our utility. This seems a tad
unrealistic. . . In reality, when we discussed agent decision-making in zero-sum games,
we determined that optimal opponents will try minimax. Thus, evaluating a state
should involve actually solving the zero-sum game matrix like we did before:
X 
∗ 0 ∗ 0 0 0
Qi (s, (a, b)) = Ri (s, (a, b)) + γ T (s, (a, b), s ) · minimax
0 0
Qi (s , (a , b ))
a ,b
s0

The analog of the Q-learning update is simple; only our observation tuples get more
complicated:
α
hs, (a, b), (ra , rb ), s0 i : Qi (s, (a, b)) ←− ri + γ minimax
0 0
Qi (s0 , (a0 , b0 ))
a ,b

This is often referred to as minimax-Q. From this algorithm, we get some wonderful
properties:
• Value iteration “just works”! We can solve the system of equations just like
before with MDPs by using the utilities found by the Q function.
• Minimax-Q converges to a unique solution! There’s a single optimal policy to
the zero-sum game, and this iterative algorithm finds it.
• Policies can be computed independently by each agent (that is, by operating
under the assumption that the opponent will behave optimally) and are guar-
anteed to converge to the same optimal policy.
• The Q-function update can be efficiently computed in polynomial time, since
the minimax can be computed using linear programming.
• The resulting optimal Q∗ is sufficient in specifying the optimal policy, since the
optimal utility for a state-action pair still corresponds to the optimal action to
take from that state.

General-Sum Stochastic Games


The future for the general case is much more grim. Our reliance on minimax is now
a liability: the assumption that opponents are trying to minimize our reward is no
longer true. Instead, we should be trying to find a Nash equilibrium: it’s a well-
defined notion that we could analyze both Q1 (s0 , (a0 , b0 )) and Q2 (s, (b0 , a0 ) to find the
action tuple that enables an equilibrium.
X 
∗ 0 ∗ 0 0 0
Qi (s, (a, b)) = Ri (s, (a, b)) + γ T (s, (a, b), s ) · Nash
0 0
Qi (s , (a , b ))
a ,b
s0
α
Qi (s, (a, b)) ←− ri + γ Nash
0 0
Qi (s0 , (a0 , b0 ))
a ,b

Kudrayvtsev 108
MACHINE LEARNING Reinforcement Learning

This idea is aptly-called Nash-Q and it exists in the literature, however its properties
are much more grim:
• Value iteration doesn’t work; the Nash-Q algorithm doesn’t converge since there
isn’t unique solution (there can be multiple Nash equilibria).
• Policies can’t be computed independently, since the very notion of a Nash equi-
librium depends on everyone coordinating their actions together.
• Computing the Nash equilibrium is not a polynomial-time operation; it’s as
hard as any problem in np.
• Finally, even if the Q-function could work, it still would be insufficient in spec-
ifying the policy.

Workarounds
This last conclusion is saddening, but there are some interesting ideas for getting
around the need for a general result:
• By viewing stochastic games as repeated games, we can leverage the Folk the-
orem and ideas like it to come up with better solutions.
• By allowing communication between agents—we can call it “cheap talk,” since
the communication isn’t necessarily binding—we can let the agents coordinate
and compute a correlated equilibrium. This actually lets us approximate the
solution efficiently.3
• By taking a bit of a prideful view of ourselves and assuming we have more com-
putational resources than our opponents do to behave optimally, we can create a
“cognitive hierarchy” and approximate ideal responses under these assumptions.
• By enabling “side payments,” in which agents can actually give some of their Sodomka et
reward to other agents in order to incentivize high-reward actions, we can sim- al. ‘13
ilarly find optimal strategies.

I’m not going to bother taking notes


on the 25+ minute outro, so. . .

3
This idea should be familiar to anyone whose taken a formal algorithms course: much like we
can efficiently approximate np-complete problems to a particular degree (like the upper bound of
7/8ths for Max-SAT), we should also be able to approximate the Nash equilibrium to a particular

(and provable) degree of accuracy.

Kudrayvtsev 109
Index of Terms

Symbols D
Q-learning . . . . . . . . . . . . . . . . . 94, 107, 108 decision tree . . . . . . . . . . . . . . . . . . 7, 46, 81
ε-exhausted . . . . . . . . . . . . . . . . . . . . . . . . . 44 dependency trees . . . . . . . . . . . . . . . . . . . . 68
k-means clustering . . . . . . . . . . . . . . . 73, 77 discounted reward . . . . . . . . . . . . . . . . . . . 91
k-nearest neighbor . . . . . . . . . . . 23, 35, 72
E
eager learner . . . . . . . . . . . . . . . . . . . . . . . . 37
A ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
activation threshold . . . . . . . . . . . . . . . . . 28
entropy . . . . . . . . . . . . . . . . . . . . . . . 9, 49, 81
AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
expectation maximization . . . . . . . . . . . 77
B F
back-propagation . . . . . . . . . . . . . . . . . . . . 34 factoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
backward search . . . . . . . . . . . . . . . . . . . . . 82 features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 firing threshold . . . . . . . . . . . . . . . . . . . . . . 28
Bayes’ optimal classifier . . . . . . . . . 55, 82 fitness function . . . . . . . . . . . . . . . . . . 63, 65
Bayes’ rule . . . . . . . . . . . . . . . . . . . 52, 56, 58 Folk theorem . . . . . . . . . . . . . . . . . . . . . . . 109
Bayesian network . . . . . . . . . . . . . . . . 56, 68 forward search . . . . . . . . . . . . . . . . . . . . . . . 82
Bayesian network, naïve . . . . . . . . . . . . . 60 function approximation . . . . . . . . . . . . . . . 5
Bellman equation . . . . . . . . . . . . 92, 94, 96
bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31, 82
G
game theory . . . . . . . . . . . . . . . . . . . . 96, 105
Boltzmann distribution . . . . . . . . . . . . . . 65
genetic algorithms . . . . . . . . . . . 65, 67, 68
boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
gradient descent . . . . . . . . . . 30, 32, 34, 55
grim trigger . . . . . . . . . . . . . . . . . . . . . . . . 105
C
chain rule . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 H
classification . . . . . . . . . . . . . . . 6, 15, 37, 60 Haussler theorem . . . . . . . . . . . . 44, 46, 48
clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 hill climbing . . . . . . . . . . . . . . . . . 63, 67, 75
component analysis, independent . . . . 86 hill climbing, random . . . . . . . . 63, 66, 74
component analysis, principal . . . . . . . 83
component analysis, random . . . . . . . . . 87
I
independent . . . . . . . . . . . . . . . . . . . . . . . . . 56
conditional entropy . . . . . . . . . . . . . . . . . . 50
independent component analysis . . . . . 86
conditionally independent . . . . . . . . . . . 56
correlated equilibrium . . . . . . . . . . . . . . 109 J
cross-validation . . . . . . . . . . . . . . . . . . . . . . . 6 joint distribution . . . . . . . . . . . . . . . . . . . . 56
curse of dimensionality . . . . . . . 38, 80, 83 joint entropy . . . . . . . . . . . . . . . . . . . . . . . . 50

110
Index

K R
kernel trick . . . . . . . . . . . . . . . . . . . . . . 22, 83 radial basis kernel . . . . . . . . . . . . . . . . . . . 24
KL-divergence . . . . . . . . . . . . . . . . . . . 51, 69 randomized optimization . . 9, 34, 63, 82
regression . . . . . . . . . . . . . . . . . . . . . 6, 26, 37
L reinforcement learning . . . . . . . . . . 88, 106
lazy learner . . . . . . . . . . . . . . . . . . . . . . . . . 37
relevant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
learning rate . . . . . . . . . . . . . . . . . . . . . . . . 31
relevant, strongly . . . . . . . . . . . . . . . . . . . . 82
least squares . . . . . . . . . . . . . . . . . . . . . . . . . 27
relevant, weakly . . . . . . . . . . . . . . . . . . . . . 82
linear discriminant analysis . . . . . . . . . . 87
restriction bias . . . . . . . . . . . . . . . . . . 10, 34
linear regression . . . . . . 26, 35, 37, 55, 84
roulette wheel . . . . . . . . . . . . . . . . . . . . . . . 66
M
margin . . . . . . . . . . . . . . . . . . . 18, 19, 21, 71 S
Markovian property . . . . . . . . . . . . . . . . . 89 shatter . . . . . . . . . . . . . . . . . . . . . . . . . . 47, 47
maximum a posteriori . . . . . . . . . . . 54, 61 sigmoid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
maximum likelihood . . . . . . . . . . . . . . . . . 54 simulated annealing . . . . . . . . . . . . . 64, 67
MDP . . . . . . . . . . . . . 89, 96, 103, 104, 107 single linkage clustering . . . . . . . . . . . . . . 79
mechanism design . . . . . . . . . . . . . . . . . . 103 stationary preferences . . . . . . . . . . . . . . . 91
MIMIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 strictly dominated . . . 100, 101, 102, 104
minimax . . . . . . . . . . . . . . . . . . . . 97, 97, 108 subgame perfect . . . . . . . . . . . . . . . 105, 105
minimax-Q . . . . . . . . . . . . . . . . . . . . . . . . 108 supervised learning . . . . . . . . . . . . . . . . . . . 5
minmax profile . . . . . . . . . . . . . . . . 104, 104 support vector machine . . . . . . 18, 19, 83
mixed strategy . . . . . . . . . . . . 99, 101, 102
momentum . . . . . . . . . . . . . . . . . . . . . . . . . . 34 T
mutual information . . . . . . . 49, 50, 69, 86 training error . . . . . . . . . . . . . . . . . . . . . . . 43
N true error . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Nash equilibrium 101, 102, 104–106, 108 truncation selection . . . . . . . . . . . . . . 66, 68
Nash-Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
neural network . . . . . . . . . . . . . . . 18, 28, 46 U
underfitting . . . . . . . . . . . . . . . . . . . . . . . . . . 6
O utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
overfitting . . . . . . . . . . . . . . . . . 6, 12, 13, 35

P V
PAC-learnable . . . . . . . . . . . . . . . . . . . 44, 48 value iteration . . . . . . . . . . . . . . . . . . 92, 107
perceptron . . . . . . . . . . . . . . . . . . . . . . . 28, 82 VC dimension . . . . . . . . . . . . . . . . . . . 47, 48
perceptron rule . . . . . . . . . . . . . . . . . . 30, 31 version space . . . . . . . . . . . . . . . . 43, 43, 44
pink noise . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
policy iteration . . . . . . . . . . . . . . . . . 93, 107 W
preference bias . . . . . . . . . . . . . . . . . . 10, 34 weak learner . . . . . . . . . . 12, 14, 14, 17, 18
principal component . . . . . . . . . . . . . . . . . 83
prisoner’s dilemma . . 100, 101, 103, 105, Z
107 zero-sum . . . . . . 96, 97, 100, 104, 106–108

111

You might also like