Machine Learning Summarized Notes 1660762916
Machine Learning Summarized Notes 1660762916
George Kudrayvtsev
[email protected]
I’m just a student, so I can’t really make any guarantees about the correctness of
my content. If you encounter typos; incorrect, misleading, or poorly-worded in-
formation; or simply want to contribute a better explanation or extend a section,
please raise an issue on my notes’ GitHub repository.
I Supervised Learning 5
0 Techniques 6
1 Classification 7
1.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.1 Getting Answers . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.2 Asking Questions: The ID3 Algorithm . . . . . . . . . . . . . 9
1
1.1.3 Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2 Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.1 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.2 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.1 There are lines and there are lines. . . . . . . . . . . . . . . . . 19
1.3.2 Support Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.3 Extending SVMs: The Kernel Trick . . . . . . . . . . . . . . . 21
1.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2 Regression 26
2.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.1 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.2 Sigmoids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.3 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.4 Biases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3 Instance-Based Learning . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.1 Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . 35
4 Bayesian Learning 52
4.1 Bayesian Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.1.1 Finding the Best Hypothesis . . . . . . . . . . . . . . . . . . . 54
4.1.2 Finding the Best Label . . . . . . . . . . . . . . . . . . . . . . 55
4.2 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.1 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.2 Making Inferences . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.3 Naïve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2
II Unsupervised Learning 62
5 Randomized Optimization 63
5.1 Hill Climbing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3.1 High-Level Algorithm . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.2 Cross-Over . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4 MIMIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4.1 High-Level Algorithm . . . . . . . . . . . . . . . . . . . . . . . 68
5.4.2 Estimating Distributions . . . . . . . . . . . . . . . . . . . . . 68
5.4.3 Practical Considerations . . . . . . . . . . . . . . . . . . . . . 70
6 Clustering 71
6.1 Single Linkage Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.1.1 Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2 k-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2.1 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.2.2 Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3 Soft Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.3.1 Expectation Maximization . . . . . . . . . . . . . . . . . . . . 77
6.3.2 Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.4 Analyzing Clustering Algorithms . . . . . . . . . . . . . . . . . . . . 78
6.4.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.4.2 How Many Clusters? . . . . . . . . . . . . . . . . . . . . . . . 79
7 Features 80
7.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.1.1 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.1.2 Wrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.1.3 Describing Features . . . . . . . . . . . . . . . . . . . . . . . . 82
7.2 Feature Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.2.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . 83
7.2.3 Independent Component Analysis . . . . . . . . . . . . . . . . 86
7.2.4 Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3
8.3 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
9 Game Theory 96
9.1 Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
9.1.1 Relaxation: Non-Determinism . . . . . . . . . . . . . . . . . . 97
9.1.2 Relaxation: Hidden Information . . . . . . . . . . . . . . . . . 98
9.1.3 Prisoner’s Dilemma . . . . . . . . . . . . . . . . . . . . . . . . 100
9.1.4 Nash Equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . 101
9.1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
9.2 Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
9.2.1 Tit-for-Tat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
9.2.2 Folk Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
9.2.3 Pavlov’s Strategy . . . . . . . . . . . . . . . . . . . . . . . . . 105
9.3 Coming Full Circle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
9.3.1 Example: Grid World . . . . . . . . . . . . . . . . . . . . . . . 106
9.3.2 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
9.3.3 Solving Stochastic Games . . . . . . . . . . . . . . . . . . . . 107
4
PART I
Supervised Learning
Contents
0 Techniques 6
1 Classification 7
2 Regression 26
4 Bayesian Learning 52
5
Techniques
6
Classification
7
CHAPTER 1: Classification
To create a decision tree for a concept, we need to identify pertinent features that
would describe it well. For example, if we wanted to decide whether or not to eat at
a restaurant, we could use the weather, particular cuisine, average cost, atmosphere,
or even occupancy level as features.
For a great example of “intelligence” being driven by a decision tree in popular culture,
consider the famous “character guessing” AI Akinator. For each yes-or-no question
it asks, there are branches the answers that lead down a tree of further questions
until it can make a confident guess. One could imagine the following (incredibly
oversimplified) tree in Akinator’s “brain:”
Yes No
Turanga Leela . . .
It’s important to note that decision trees are a representation of our features. Only
after we’ve formed a representation can we start talking about the algorithm that
will use the tree to make a decision.
The order in which we apply each feature to our decision tree should be correlated
with its ability to reduce our space. Just like Akinator divides the space of characters
in the world into fiction and non-fiction right off the bat, we should aim to start our
decision tree with questions whose answers can sweep away swaths of finer decision-
making. For our restaurant example, if we want to spend ≤ $10 no matter what, that
would eliminate a massive amount of restaurants immediately from the first question.
Kudrayvtsev 8
MACHINE LEARNING Supervised Learning
• A ← best attribute
• Assign A as the decision attribute (the “question” we’re asking) for the particular
node n we’re working with (initially, this would be the tree’s root node).
• For each v ∈ A, create a branch from n.
• Lump the training examples that correspond to the particular attribute value,
v, to their respective branch.
• If the examples are perfectly classified with this arrangement (that is, we have
one training example per leaf), we can stop.
• Otherwise, repeat this process on each of these branches.
The “information gain” from a particular attribute A can be a good metric for quali-
fying attributes. Namely, we measure how much the attribute can reduce the overall
entropy:
X |Sv |
Gain(S, A) = Entropy(S) − · Entropy(Sv ) (1.1)
v∈A
|S|
where the entropy is calculated based on the probability of seeing values in A:
X
Entropy(A) = − Pr [v] · log2 Pr [v] (1.2)
v∈A
These concepts come from Information Theory which we’ll dive into more when we
discuss randomized optimization algorithms in chapter 5; for now, we should just
think of this as a measure of how much information an attribute gives us about a
Kudrayvtsev 9
CHAPTER 1: Classification
system. Attributes that give a lot of information are more valuable, and should thus
be higher on the decision tree. Then, the “best attribute” is the one that gives us the
maximum information gain: max Gain(S, A).
A∈A
Inductive Bias
There are two kinds of bias we need to worry about when designing any classifier:
• restriction bias, which automatically occurs when we decide our hypothesis
set, H. In this case, our bias comes from the fact that we’re only considering
functions that can be represented with a decision tree.
• preference bias, which tells us what sort of hypotheses from our hypothesis
set, h ∈ H, we prefer.
The latter of these is at the heart of inductive bias. Which decision trees—out of all of
the possible decision trees in the universe that can represent our target concept—will
the ID3 algorithm prefer?
Splits Since it’s greedily choosing the attributes with the most information gain
from the top down, we can confidently say that it will prefer trees with good
splits at the top.
Correctness Critically, the ID3 algorithm repeats until the labels are correctly
classified. And though it may be obvious, it’s still important to note that it will
hence prefer correct decision trees to incorrect ones.
Depth This arises naturally out of the top-heavy split preference, but again, it’s
still worth noting that ID3 will prefer trees that are shallower or “shorter.”
1.1.3 Considerations
Asking (Continuous) Questions
The careful reader may have noticed the explicit mention of branching on attributes
based on every possible value of an attribute: v ∈ A. This is infeasible for many
features, especially continuous ones. For our earlier restaurant example, we may
have discretized our “cost” feature into one, two, or three dollar signs (à la Yelp), but
what if we wanted to keep them as a raw average dish dollar value instead?
Well, if we’re sticking to the “only ask Boolean questions” model, then binning is
a viable approach. Instead of making decisions based on a precise cost, we instead
make decisions based on a place being “cheap,” which we might subjectively define as
cost ∈ [0, 10), for example.
Kudrayvtsev 10
MACHINE LEARNING Supervised Learning
Repeating Attributes
Does it make sense to ask about an attribute more than once down its branch? That
is, if we ask about cost somewhere down a path, can (or should) we ask again, later?
Yes No
Is it crowded?
Yes No
...
With our “proof by example,” it’s pretty apparent that the answer is “yes, it’s ac-
ceptable to ask about the same attribute twice.” However, it really depends on the
attribute. For example, we wouldn’t want to ask about the weather twice, since the
weather will be constant throughout the duration of the decision-making process.
With our bucketed continuous values (cost, age, etc.), though, it does make sense to
potentially refine our buckets as we go further down a branch.
Stopping Point
The ID3 algorithm tells us to stop creating our decision tree when all of our training
examples are classified correctly. That’s. . . a lot of leaf nodes. . . It may actually be
pretty problematic to refine the decision tree to such a point: when we leave our
training set, there may be examples that don’t fall into an exact leaf. There may also
be examples that have identical features but actually have a different outcome; when
we’re talking about restaurant choices, opinions may differ:
Weather Cost Cuisine Go?
Alice: Cloudy $ Mexican
Bob: Cloudy $ Mexican ×
Kudrayvtsev 11
CHAPTER 1: Classification
If both of these rows were in our training set, we’d actually get an infinite loop in
the naïve ID3 algorithm: it’s impossible to classify every example correctly. It makes
sense to adopt a termination approach that is a little more general and robust. We
want to avoid overfitting our training examples!
If we bubble up the decisions down the branch of a tree back up to its parent node,
then prune the branch entirely, we can avoid overfitting. Of course, we’d need to
make sure that the generalized decision does not increase our training error by too
much. For example, if a cost-based branch had all of its children branches based on
weather, and all but one of those resulted in the go-ahead to eat, we could generalize
and say that we should always eat for the cost branch without incurring a very large
penalty for the one specific “don’t eat” case.
Adapting to Regression
Decision trees as we’ve defined them here don’t transfer directly to regression prob-
lems. We no longer have a useful notion of information gain, so our approach at at-
tribute sorting falls through. Instead, we can rely on purely statistical methods (like
variance and correlation) to determine how important an attribute is. For leaves, too,
we can do averages, local linear fit, or a host of other approaches that mathematically
generalize with no regard for the “meaning” of the data.
Kudrayvtsev 12
MACHINE LEARNING Supervised Learning
1.2.1 Bagging
Turns out, simply making choosing data uniformally randomly to form our subset
(with replacement) works pretty well. Similarly-simply, combining the results with
an average also works well. This technique is called boostrap aggregation, or
more-commonly bagging.
The reason why taking the average of a set of weak learners trained on subsets of
the data can outperform a single learner trained on the entire dataset is because of
overfitting, our mortal fear in machine learning. Overfitting a subset will not overfit
the overall dataset, and the average will “smooth out” the specifics of each individual
learner.
1.2.2 Boosting
We must be able to pick subsets of the data a little more cleverly than randomly,
right? The basic idea behind boosting is to prefer data that we’re not good at
analyzing. We essentially craft learners that are specifically catered towards data
that previous learners struggled with in order to form a cohesive picture of the entire
dataset.
Initially, all training examples are weighed equally. Then, in each “boosting round,”
we find the weak learner that achieves the lowest error. Following that, we raise the
weights of the training examples that it misclassified. In essence, we say, “learn these
better next time.” Finally, we combine the weak learners from each step into our final
learner with a simple weighted average: weight is directly proportional to accuracy.
(a) Our set of training examples (b) Reweighing the error examples (c) Final classifier is a combination
and the first boundary guess. and trying another boundary. of the weak learners.
Figure 1.1: Iteratively applying weak learners to differentiate between the red
and blue classes while boosting mistakes in each round.
Let’s define this notion of a learner’s “error” a little more rigorously. Previously, when
we pulled from the training set with uniform randomness, this was easy to define.
The number of mismatches M from our model out of the N -element subset meant an
error of M/N . However, since we’re now weighing certain training examples differently
(incorrect =⇒ likelier to get sampled), our error is likewise different. Shouldn’t we
punish an incorrect result on a data point that we are intentionally trying to learn
more-so than an incorrect result that a dozen other learners got correct?
Kudrayvtsev 13
CHAPTER 1: Classification
Suppose our training subset is just 4 values, and our weak learner H(x) got
two of them correct:
x1 x2 x3 x4
× ×
What’s our training error? Trivially 1/2, you might say. Sure, but what if the
probability of each xi being chosen for this subset was different? Suppose
x1 x2 x3 x4
× ×
D: 1/2 1/20 2/5 1/20
Now what’s our error? Well getting x1 wrong is a way bigger deal now, isn’t
it? It barely matters that we got x2 and x4 correct. . . So we need to weigh
each incorrect answer accordingly:
1 2 1 1 9
ε= + =1− − =
|2 {z 5} |20 {z 20} 10
incorrects corrects
To drive the point in the above example home, we’ll now formally define our error as
the probability of a learner H not getting a data point xi correct over the distribution
of xi s. That is,
εi = Pr [H(xi ) 6= yi ]
D
Our notion of a weak learner—a term we’ve been using so far to refer to a learner
that does well on a subset of the training data—can now likewise be formally defined:
it’s a learner that performs better than chance for any distribution of data (where ε
is a number very close to zero):
∀D : Pr [·] ≤ 1/2 − ε
D
Note the implication here: if there is any distribution for which a set of hypotheses
can’t do better than random chance, there’s no way to create a weak learner from
those hypotheses. That makes this a pretty strong condition, actually, since you need
a lot of good hypotheses to cover the various distributions.
Boosting at a high level can be broken down into a simple loop: on iteration t,
• construct a distribution Dt , and
• find a weak classifier Ht (x) that minimizes the error over it.
Then after the loop, combine the weak classifiers into a stronger one.
Kudrayvtsev 14
MACHINE LEARNING Supervised Learning
Specific boosting algorithms vary in how they perform these steps, but a well-known
one is the adaptive boosting (or AdaBoost) algorithm outlined in algorithm 1.1.
Let’s dig into its guts.
AdaBoost
From a very high level, this algorithm follows a very human approach for learning:
The better we’re doing overall, the more we should focus on individual mistakes.
We return to the world of classification, where our training set maps from a feature Freund &
vector to a “correct” or “incorrect” label, yi ∈ {−1, 1}: Schapire, ‘99
1
We don’t dive into this in the main text for brevity, but here zt would be the sum of the pre-
normalized weights. In an algorithm (like in algorithm
P 1.1), you might first calculate a D0t+1 that
0
didn’t divide any terms by zt , then calculate z = d∈D d and do Dt+1 (i) = Dt+1 (i)/z at the end.
Kudrayvtsev 15
CHAPTER 1: Classification
Deng, ‘07 Finding the Weak Classifier Notice that we kind-of glossed over determining
Ht (x) for any given round of boosting. Weak learners encompass a large class of
learners; a simple decision tree could be a weak learner. All it has to do is guarantee
performance that is slightly better than random chance.
Final Hypothesis As you can see in algorithm 1.1, the final classifier is just a
weighted average of the individual weak learners, where the weight of a learner is its
respective α. And remember, αt is in terms of εt , so it measures how well the tth
round went overall; thus, a good round is weighed more than a bad round.
The beauty of ensemble learning is that you can combine many simple weak classifiers
that individually hardly do better than chance together into a final classifier that
performs really well.
Let’s take a deeper look at the definition of Dt+1 (i) to understand how
the probability of the ith sample changes. Recall that in our equation, the
exponential simplifies to e∓α depending on whether H(·) guesses yi correctly
or incorrectly, respectively.
Dt (i) 1 1 − εt
Dt+1 (i) = · exp(−αt yi Ht (xi )) where αt = ln
zt | {z } 2 εt
e∓α
We can follow the same reasoning for e−α and get a flipped result:
1 − ε − /2
1 1/2 r
−α 1 1−ε ε ε
e = exp − · ln = = =
2 ε ε 1−ε 1−ε
Kudrayvtsev 16
MACHINE LEARNING Supervised Learning
√
learner classified xi correctly (dropping the · for simplicity of analysis):
1−ε
if H(·) was wrong
2
f (ε) = exp (−αt yi Ht (xi )) = ε
ε
if H(·) was right
1−ε
Remember that εt is the total error of all incorrect answers that Ht gave; it’s
a sum of probabilities, so 0 < ε < 1. But note that H is a weak learner, so
it must do better than random chance, so in fact 0 < ε < 21 . The functions
are plotted in Figure 1.2; from them, we can draw some straightforward
conclusions:
5
3
f (ε)
2
right
1
wrong
0
0 0.2 0.4 0.6
ε
Figure 1.2: The two ways the exponential can go when boosting,
depending on whether or not the classifier gets sample i right ( 1−ε
ε ,
ε
in red) or wrong ( 1−ε , in blue).
Considerations: Overfitting
Boosting is a robust method that tries very hard to avoid overfitting. Testing per-
formance often mimics training performance pretty closely. Why is this the case?
Kudrayvtsev 17
CHAPTER 1: Classification
Though we’ve been discussing error at length up to this point, and how minimizing
error has been our overarching goal, it’s worth discussing the idea of confidence, as
well. If, for example, you had a 5-nearest neighbor in which three neighbors strongly
voted one way and two neighbors voted another way, you might have low error but
also low confidence, relative to a scenario where all five voted the same way.
Because boosting is insecure and suffers from social anxiety, it tries really hard to
be confident and this lets it avoid overfitting. Once a boosted ensemble reaches a
state at which it has low error and can separate positive and negative examples well,
adding more and more weak learners will actually continue to spread the gap (similar
to support vector machines which we’re about to discuss with respect to margins) at
the boundary.
−1 0 +1 −1 0 +1
Now, which of the green dashed lines below “best” separates the two colors?
2
“White noise” is Gaussian noise.
Kudrayvtsev 18
MACHINE LEARNING Supervised Learning
They’re all correct, but why does the middle one “feel” best? Aesthetics? Maybe.
But more likely, it’s because the middle line does the best job at separating the data
without making us commit too much to it. It leaves the biggest margin of potential
error if some hidden dots got revealed.
A support vector machine operates on this exact notion: it tries find the boundary
that will maximize the margin from the nearest data points. The optimal margin lines
will always have some special points that intersect the dashed lines:
These points are called the support vectors. Just like a human, a support vector
machine reduces computational complexity by focusing on examples near the bound-
aries rather than the entire data set. So how can we use these vectors to maximize
the margin?
Kudrayvtsev 19
CHAPTER 1: Classification
with x y 1 :
a b c · x y 1 =0
as: wT s = 0. This lets us representing a line in vector form and use any number of
dimensions. Our w defines the parameters of the (hyper)plane; notice that here,
w ⊥ the xy-plane.
If we want to stay reminiscent of y = mx + b, we can drop the last term of w and use
the raw constant: wT s + c = 0.
wT x + b = 1
wT x + b = 0
wT x + b = −1
Well if we call the two green support vectors above x1 and x2 , what’s the distance
between them? Well,
wT x1 + b = 1
−(wT x2 + b = −1)
wT (x1 − x2 ) = 2
2
ŵT (x1 − x2 ) =
kwk
Kudrayvtsev 20
MACHINE LEARNING Supervised Learning
1
Minimize: kwk2
2
Subject to: yi (wT xi + b) ≥ 1
where the αi s are “learned weights” that are only non-zero at the support vectors.
Any support vector i is defined by yi = wT xi + b, so:
X
yi = αi yi xTi x + b = ±1
i
| {z }
wT
!
X
We can use this to build our classification function: f (x) = sign αi yi xTi x + b
i
Note the highlighted box: the entirety of the classification depends only on this dot
product between some “new point” x and our support vectors xi s. The dot product
is a metric of similarity: here, it’s the projection of each xi onto the new x, but it
doesn’t have to be. . .
0
x
Kudrayvtsev 21
CHAPTER 1: Classification
0
x
No such luck this time. But what if we mapped them to a higher-dimensional space?
For example, if we map these to y = x2 , a wild linear separator appears!
x2
This seems promising. . . how can we find such a mapping (like the arbitrary x 7→
x2 above) for other feature spaces? Notice that we added a dimension simply by
manipulating the representation of our features.
Let’s generalize this idea. We can call our mapping function Φ; it maps the xs in
our feature space to another higher-dimensional space ϕ(x), so Φ : x 7→ ϕ(x). And
recall the “similarity metric” in our classification function; let’s isolate it and define
K(a, b) = aT b. Then, we have
!
X
f (x) = sign αi yi K(xi , x) + b
i
The kernel trick here is simple: it states that if there exists some Φ(·) that can
represent K(·) as a dot product, we can actually use K in our linear classifier. We
don’t actually need to find, define, or care about Φ, it just needs to exists. And in
practice, it does exist for almost any function we can think of (though I won’t offer
an explanation on how). That means we can apply almost any K and it’ll work out.
For example, if our 2D dataset was separable by a circular boundary, we could use
K(a, b) = (aT b)2 in our classifier and it would actually find the boundary, despite it
not being linearly-separable in two dimensions.
Soak that in for moment. . . we can almost-arbitrarily define a K that represents our
data in a different way and it’ll still find a boundary that just so happens to be linear
in a higher-dimensional space.
Kudrayvtsev 22
MACHINE LEARNING Supervised Learning
Let’s work through a proof that a particular kernel function is a dot product
in some higher-dimensional space. Remember, we don’t actually care about
what that space is when it comes to applying the kernel; that’s the beauty of
the kernel trick. We’re working through this to demonstrate how you would
show that some kernel function does have a higher-dimensional mapping.
x
We have 2D vectors, so x = 1 .
x2
K(xi , xj ) = (1 + xTi xj )2
xj1 xj1
= 1 + xi1 xi2 1 + xi1 xi2 expand
xj2 xj2
= 1 + x2i1 x2ij + 2xi1 xj1 xi2 xj2 + x2i2 x2j2 + 2xi1 xj1 + 2xi2 xj2 multiply
it all out
1
x2
√ j1
√ √ √ 2xj1 xj2
rewrite it as a
= 1 x2i1 2xi1 xi2 x2i2
2xi1 2xi2
√x2j2 vector product
√2xj1
2xj2
At this point, we can see something magical and crucially important: each
of the vectors only relies on terms from either xi or xj ! That means it’s
a. . . wait for it. . . dot product! We can define ϕ as a mapping into this new
6-dimensional space:
√ √ √ T
ϕ(x) = 1 x21 2x1 xn2 x22
2x1 2x2
The choice of K(·) is much like the choice of d(·) in k-nearest neighbor: it encodes
domain knowledge about the data in question that can help us classify it better.
Some common kernels include polynomial kernels (like the one in the example) and
Kudrayvtsev 23
CHAPTER 1: Classification
K(a, b) = . . .
= (aT b + c)p (polynomial)
!
ka − bk2
= exp − (radial basis)
2σ 2
1.3.4 Summary
In general, we’ve come to the conclusion that finding the linear separator of a dataset
with the maximum margin is a good way to generalize and avoid overfitting. SVMs
do this with a quadratic optimization problem to find the support vectors. Finally,
we discussed how the kernel trick lets us find these linear separators in an arbitrary
n-dimensional space by projecting the data to said space.
Kudrayvtsev 24
MACHINE LEARNING Supervised Learning
Kudrayvtsev 25
Regression
hen we free ourselves from the limitation of small discrete quantities of la-
W bels, we are open to approximate a much larger range of functions. Machine
learning techniques that use regression can approximate real-valued and con-
tinuous functions.
In supervised learning, we are trying to perform inductive reasoning, in which we try
to figure out the abstract bigger picture from tiny snapshots of the world in which
we don’t know most of the rules (that is, approximate a function from input-output
pairs). This is in stark contrast with deductive reasoning, through which individual
facts are combined through strict, rigorous logic to come to bigger conclusions (think
“if this, then that”).
e1
that we used to plot in grade school. 1 e0
Through the power of linear algebra,
the line of best fit can be rigorously de- 0
fined by solving a linear system. 0 1 2 3
x
One way to find this line is to find the
sum of least squared-error between the Figure 2.1: A set of points with “no solution”:
points and the chosen line; more specif- no line passes through all of them. The set of
errors is plotted in red: (e0 , e1 , e2 ).
ically, a visual demonstration can show
26
MACHINE LEARNING Supervised Learning
us this is the same as minimizing the projection error of the points on the line.
Suppose we have a set of points that don’t exactly fit a line: {(1, 1), (2, 1), (3, 2)},
plotted in Figure 2.1. We want to find the best possible line y = mx + b that mini-
mizes the total error. This corresponds to solving the following system of equations,
forming y = Ax:
1 = b + m · 1
1 1 1
m
1=b+m·2 =⇒ 1 = 1 2
b
2 1 3
2=b+m·3
The lack of an exact solution to this system (algebraically) means that the vector
of y-values isn’t in the column space of A, or: y ∈ / C(A). The vector can’t be
represented by a linear combination of column vectors in A.
We can imagine the column space as a plane in xyz-
space, and y existing outside of it; then, the vector
that’d be within the column space is the projection y
of y into the column space plane: p = proj y. This
C(A)
is the closest possible vector in the column space e
C(A)
to y, which is exactly the distance we were trying p
to minimize! Thus,
AT Ax∗ = AT y
x∗ = (AT A)−1 AT y
| {z }
pseudoinverse
More Resources. This section basically summarizes and synthesizes this Khan
Academy video, this lecture from the Computer Vision course (which goes through
the full derivation), this section of Introduction to Linear Algebra, and this explana-
tion from NYU. These links are provided in order of clarity.
Kudrayvtsev 27
CHAPTER 2: Regression
x1 w1
w2 activation
x2 function F , and y
w3 threshold θ
x3
2.2.1 Perceptron
The simplest, most fundamental activation function procudes a binary output (so
y ∈ {0, 1}) based on the weighted sum of the inputs:
(
1 if
Pn
i=1 wi xi ≥ θ
F (x1 , x2 , . . . , xn , w1 , w2 , . . . , wn ) =
0 otherwise
This is called a perceptron, and it’s the foundational building block of neural net-
works going back to the 1950s.
Kudrayvtsev 28
MACHINE LEARNING Supervised Learning
Let’s quickly validate our understanding. Given the following input state:
1
x1 = 1, w1 =
2
3
x2 = 0, w2 =
5
x3 = −1.5, w3 = 1
What kind of functions can we represent with a perceptron? Suppose we have two
inputs, x1 and x2 , along with equal weights w1 = w2 = 21 , θ = 34 . For what values of
xi will we get an activation?
Well, we know that if x2 = 0, then we’ll get an activation for anything that makes
x1 w1 ≥ θ, so x1 ≥ θ/w1 ≥ 1.5. The same rationale applies for x2 , and since we know
that the inequality is linear, we can just connect the dots:
2
x2
x1
1 2
Kudrayvtsev 29
CHAPTER 2: Regression
2 2
x2 x2
1 1
x1 x1
1 2 1 2
Note that we “derived” OR by adjusting w1,2 until it worked, though we could’ve also
adjusted θ. This might trigger a small “aha!” moment if the idea of induction from
stuck with you: if we have some known input/output pairs (like the truth tables
for the binary operators), then we can use a computer to rapidly guess-and-check
the weights of a perceptron (or perhaps an entire neural network. . . ?) until they
accurately describe the training pairs as expected.
x y x⊕y
Combining Perceptrons
1 1 0
To build up an intuition for how perceptrons can be com- 1 0 1
bined, let’s binary XOR. It can’t be described by a single 0 1 1
perceptron because it’s not a linear function; however, 0 0 0
it can be described by several!
Table 2.1: The truth table
Intuitively, XOR is like OR, except when the inputs succeed for XOR.
under AND. . . so we might imagine that XOR ≈ OR−AND.
So if x1 and x2 are both “on,” we should take away the result of the AND perceptron so
that we fall under the activation threshold. However, if only one of them is “on,” we
can proceed as normal. Note that the “deactivation weight” needs to be equal to the
sum of the other weights in order to effectively cancel them out, as shown Figure 2.3.
Learning Perceptrons
Let’s delve further into the notion of “training” a perceptron network that we alluded
to earlier: twiddling the wi s and θs to fit some known inputs and outputs. There are
two approaches to this: the perceptron rule, which operates on the post-activated
y-values, and gradient descent, which operates on the raw summation.
Kudrayvtsev 30
MACHINE LEARNING Supervised Learning
w1 = 1
x1 x2 AND y
x1 1 1 1 1 · 1 + −2 · 1 + 1 · 1 = 0
w2 = −2 XOR
AND y 1 0 0 1 · 1 + −2 · 0 + 1 · 0 = 1
θ=1
x2 0 1 0 1 · 0 + −2 · 0 + 1 · 1 = 1
0 0 0 1 · 0 + −2 · 0 + 1 · 0 = 0
w3 = 1
Figure 2.3: A neural network modeling XOR and its summations for all possi-
ble bit inputs. The final column in the table is the summation expression for
perceptron activation, w1 x1 + w2 FAND (x1 , x2 ) + w3 x2 .
Kudrayvtsev 31
CHAPTER 2: Regression
Gradient Descent We still need something in our toolkit for datasets that aren’t
linearly-separable. This time, we’ll operate on the unthresholded summation since
it gives us far more information about how close (or far) we P
are from triggering an
activation. We’ll use a as shorthand for the summation: a = i wi xi .
Then we can define an error metric based on the difference between a and each
expected output: we sum the error for each of our input/output pairs in the dataset
D. That is,
1 X
E(w) = (y − a)2
2
(x,y)∈D
Let’s use our good old friend calculus to solve this via gradient descent. A function
is at its minimum when its derivative is zero, so we’ll take the derivative with respect
to a weight:
∂E ∂ 1 X
= (y − a)2
∂wi ∂wi 2
(x,y)∈D
" #
X ∂ X
chain rule, and only a
= (y − a) · − w j xj is in terms of wi
∂wi j
(x,y)∈D
X
when j 6= i, the derivative
= (y − a)(−xi ) will be zero
(x,y)∈D
X rearranged to look like
=− (y − a)xi the perceptron rule
(x,y)∈D
Notice that we essentially end up with a version of the perceptron rule where η = −1,
except we now use the summation a instead of the binary output ŷ. Unlike the
perceptron rule, we have no guarantees about finding a separation, but it is far more
robust to non-separable data. In the limit (thank u Newton very cool), though, it
will converge to a local optimum.
To reiterate our learning rules, we have:
2.2.2 Sigmoids
The similarity between the learning rules in (2.1) and (2.2) begs the question, why
didn’t we just use calculus on the thresholded ŷ?
Kudrayvtsev 32
MACHINE LEARNING Supervised Learning
y
1
a
The simple answer is that the function isn’t differentiable. Wouldn’t it be nice, though
if we had one that was very similar to it, but smooth at the hard corners that give
us problems? Enter the sigmoid function:
1
σ(a) = (2.3)
1 + e−a
By introducing this as our activation function, we can use gradient descent all over
the place. Furthermore, the sigmoid’s derivative itself is beautiful (thanks to the fact
that dx e = ex ):
d x
Note that this isn’t the only function that smoothly transitions between 0 and 1;
there are other activation functions out there that behave similarly.
2.2.3 Structure
Now that we have the ability to put together differentiable individual neurons, let’s
look at what a large neural network presents itself as. In Figure 2.4 we see a collec-
tion of sigmoid units arranged in an arbitrary pattern. Just like we combined two
perceptrons to represent the non-linear function XOR, we can combine a larger number
of these sigmoid units to approximate an arbitrary function.
x1
x2
x3 y
x4
x5
Figure 2.4: A 4-layer neural network with 5 inputs and 3 hidden layers.
Because each individual unit is differentiable, the entire mapping from x 7→ y is differ-
entiable! The overall error of the system enables a bidirectional flow of information:
Kudrayvtsev 33
CHAPTER 2: Regression
the error of y impacts the last hidden layer, which results in its own error, which im-
pacts the second-to-last hidden layer, etc. This layout of computationally-beneficial
organizations of the chain rule is called back-propagation: the error of the network
propogates to adjust each unit’s weight individually.
Optimization Methods
It’s worth reiterating that we’ve departed from the guarantees of perceptrons since
we’re using σ(a) instead of the binary activation function; this means that gradient
descent can get stuck in local optima and not necessarily result in the best global
approximation of the function in question.
Gradient descent isn’t the only approach to training a neural network. Other, more
advanced methods are researched heavily. Some of these include momentum, which
allows gradient descent to “gain speed” if it’s descending down steep areas in the
function; higher-order derivatives, which look at combinations of weight changes to try
to grasp the bigger picture of how the function is changing; randomized optimization;
and the idea of penalizing “complexity,” so that the network avoids overfitting with
too many nodes, layers, or even too-large of weights.
2.2.4 Biases
What kind of problems are neural networks appropriate for solving?
Restriction Bias A neural network’s restriction bias (which, if you recall, is the
representation’s ability to consider hypotheses) is basically non-existent if you use
sigmoids, though certain models may require arbitrarily-complex structure.
We can clearly represent Boolean functions with threshold-like units. Continuous
functions with no “jumps” can actually be represented with a single hidden layer. We
can think of a hidden layer as a way to stitch together “patches” of the function as
they approach the output layer. Even arbitrary functions can be approximated with
a neural network! They require two hidden layers, one stitching at seams and the
other stitching patches.
This lack of restriction does mean that there’s a significant danger of overfitting,
but by carefully limiting things that add complexity (as before, this might be layers,
nodes, or even the weights themselves), we can stay relatively generalized.
Preference Bias On the other hand, we can’t yet answer the question of the pref-
erence bias of a neural network (which, if you recall, describes which hypotheses from
the restricted space are preferred). We discussed the algorithm for updating weights
(gradient descent), but have yet to discuss how the weights should be initialized in
the first place.
Kudrayvtsev 34
MACHINE LEARNING Supervised Learning
Common practice is choosing small, random values for our initial weights. The ran-
domness allows for variability so that the algorithm doesn’t get stuck at the same
local minima each time; the smallness allows for relative adjustments to be impactful
and reduces complexity.
Given this knowledge, we can say that neural networks—when all other things are
equal—prefer simpler explanations to complex ones. This idea is an embodiment of
Occam’s Razor:
Entities should not be multiplied unnecessarily.
More colloquially, it’s often expressed as the idea that the simpler explanation is
likelier to be true.
Kudrayvtsev 35
CHAPTER 2: Regression
distances: if we had some arbitarily-colored dots (whose color was determined by the
features fm and fn ) and wanted to determine the color of a novel dot (encoded in red
below) with fm = 2, fn = 1, we’d use the standard Euclidean distance.
4
fn
d3
2
d1
d2
1
d4
fm
1 2 3 4
1
Game developers might already be somewhat familiar with the algorithm: quadtrees rely on sim-
ilar principles to efficiently perform collision detection, pathfinding, and other spatially-sensitive
calculations. The game world is dynamically divided into recursively-halved quadrilaterals to
group closer objects together.
Kudrayvtsev 36
MACHINE LEARNING Supervised Learning
This is in contrast with something like linear regression, which calculates a model up-
front and makes querying very cheap (constant time, in fact); in this regard, kNN is
referred to as a lazy learner, whereas linear regression would be an eager learner.
Algorithm 2.1: A naïve kNN learning algorithm. Both the number of neigh-
bors, k, and the similarity metric, d(·), are assumed to be pre-defined.
In the case of regression, we’ll likely be taking the weighted average like in algo-
rithm 2.1; in the case of classification, we’ll likely instead have a “vote” and choose
the label with plurality. We can also do more “sophisticated” tie-breaking: for regres-
sion, we can consider all data points that fall within a particular shortest distance (in
other words, we consider the k shortest distances rather than the k closest points);
for classification, we have more options. We could choose the label that occurs the
most globally in X , or randomly, or. . .
Biases
As always, it’s important to discuss what sorts of problems a kNN representation of
our data would cater towards.
Preference Bias Our belief of what makes a good hypothesis to explain the data
heavily relies on locality—closer points (based on the domain-aware distance metric)
are in fact similar—and smoothness—averaging neighbors makes sense and feature
behavior smoothly transitions between values.
Kudrayvtsev 37
CHAPTER 2: Regression
Notice that we’ve also been treating the features of our training sample vector xi
equally. The scales for the features may of course be different, but there is no notion of
weight on a particular feature. This is critical: obviously, whether or not a restaurant
is within your budget is far more important than whether the atmosphere inside fits
your vibe (being the broke graduate students that we are). However, the kNN model
has difficulty with making this differentiation.
In general, we are encountering the curse of dimensionality in machine learning:
as the number of features (the dimensionality of a single xi ∈ X ) grows linearly, the
amount of data we need to accurately generalize the grows exponentially. If we
have a lot of features and we treat them all with equal importance, we’re going to
need a LOT of data to determine which ones are more relevant than others.
It’s common to treat features as “hints” to the machine learning algorithm you’re
using, following the rationale that, “If I give it more features, it can approximate the
model better since it has more information to work with;” however, this paradoxically
makes it more difficult for the model, since now the feature space has grown and
“importance” is harder rather than easier to determine.
Restriction Bias If you can somehow define a distance function that relates to
feature points together, you can represent the data with a kNN.
Kudrayvtsev 38
Computational Learning Theory
People worry that computers will get too smart and take over
the world, but the real problem is that they’re too stupid and
they’ve already taken over the world.
— Pedro Domingos
39
CHAPTER 3: Computational Learning Theory
data;
• how samples are presented : until now, we’ve been training models on entire
“batches” of data at once, but this isn’t the only option; online learning (one at
a time) is another alternative; and
• how samples are selected : is random shuffling the best way to choose data
to train on?
Kudrayvtsev 40
MACHINE LEARNING Supervised Learning
Teaching
If the teacher is providing friendly examples, wouldn’t a very ambitious teacher—one
that wants the learner to figure out the hypothesis h ∈ H as quickly as possible—
simply tell the learner to “ask,” “Is the hypothesis h?” Though effective, that’s not
a very realistic way to learn: the question-space is never “the entire set of possible
questions in the universe.” Teachers have a constrained space of questions they can
suggest.
f (x1 · · · x5 ) = x2 ∧ x4 ∧ x5
Kudrayvtsev 41
CHAPTER 3: Computational Learning Theory
we don’t really learn anything from h = 0, there are many possible f s that are
extremely hard to guess. For a general k, it’ll take 2k questions in the worst case
(consider when exactly one k-bit string gives h = 1).
Kudrayvtsev 42
MACHINE LEARNING Supervised Learning
• a consistent learner: one that produces the correct result for all of the training
samples:
∀x ∈ S : c(x) = h(x)
Given the target concept of XOR (left) and some training data (right):
x1 x2 c(x)
x1 x2 c(x)
0 0 0
0 0 0
0 1 1
0 1 1
1 0 1
1 1 ?
1 1 0
and the hypotheses:
3.2.2 Error
Given a candidate hypothesis h, how do we evaluate how good it is? We define the
training error as the fraction of training examples misclassified by h, while the true
error is the fraction of examples that would be misclassified on the sample drawn from
a distribution D:
errorD (h) = Pr [c(x) 6= h(x)] (3.1)
x∼D
This lets us minimize the punishment that comes from getting rare examples wrong.
1
Author’s note: this is such an unfortunate term and is a classic example of the machine learning
field intentionally gatekeeping newcomers with obscure jargon. Wouldn’t it make much more sense
to simply call this the viable hypotheses?
Kudrayvtsev 43
CHAPTER 3: Computational Learning Theory
So if there’s a very obscure case that comes up once in a blue moon, it should be
treated differently than getting a commonly-occurring case incorrect.
Kudrayvtsev 44
MACHINE LEARNING Supervised Learning
Now we use an interesting truth (shown without proof here, just refer to the lectures
or a real textbook lol): −ε ≥ ln(1 − ε). Thus,
This gives us an upper bound that the version space is not ε-exhausted after
m samples, and this is related to our certainty goal δ. To solve this inequality in
terms of m, we get:
|H| e−εm ≤ δ (3.2)
logarithm
ln |H| − εm ≤ ln δ product rule
1 1 flip inequality when
m≥ ln |H| + ln dividing by −ε
ε δ
Notice that this upper bound is polynomial in all of the terms we need for a problem
to be PAC-learnable.
H = {hi (x) = xi }
Kudrayvtsev 45
CHAPTER 3: Computational Learning Theory
This is pretty good: it’s only 4% of the total 210 -element input space. Also
notice that D was irrelevant.
3.3.1 Intuition
Let’s work through a concrete example to build up some intuition about how to tackle
this problem. Suppose we’re given a simple input space, and a simple set of infinite
hypotheses: is x bigger than some number θ?
X = {1, 2, 3, 4, . . . , 10}
H : {h(x) = x ≥ θ | ∀θ ∈ R}
Obviously since θ is a real number, |H| = ∞. However, notice that many of those
hypotheses are completely meaningless. More specifically, any choice of θ > 10 gives
the same result as any θ < 1. Similarly, since X ⊂ N, only integer θs really make
sense. This gives us a notion of a “semantic” hypothesis space: hypotheses that
are meaningfully different, as opposed to all of the hypotheses in the conceivable
universe that we could write down. . .
TL;DR: We want to differentiate between hypotheses that matter (of which there
are few) from the hypotheses that don’t (of which there are infinite).
3
Obviously, there are an infinite number of lines, and any model with continuous inputs has an
infinite number of ways to change up the input.
Kudrayvtsev 46
MACHINE LEARNING Supervised Learning
Example To continue with our previous example, the answer is simply one. The
hypotheses are all binary, true-or-false questions, so any one input can be labeled in
all of the two possible ways. However, since we only consider x ≥ θ, we can’t properly
map a pair of inputs. Think of θ as being the “separator” between some training data
S = {x1 , x2 }; then, slide it along:
x1 x2 x1 x2 x1 x2
T T F T F F
θ1 θ2 θ3
Suppose we’re given an input space of all real numbers and a hypothesis
space that can tell us whether or not an input value is within a particular
interval:
X=R
parameterized by
H = {h(x) = (x ∈ [a, b])} a, b ∈ R
Kudrayvtsev 47
CHAPTER 3: Computational Learning Theory
To show a lower bound for a VC dimension, all you need to do is find a single example
of inputs for which all labeling combinations can be done.
Practical VC Dimensions
What’s the VC dimension for something more practical in machine learning, like a
linear separator? That is, for
X = R2
H = {h(x) = wT x ≥ θ}
The VC dimension is three. The use of the second (Euclidean) dimension lets us solve
the problem from the above example, but four inputs (one inside of the convex hull
of the other three, for example) are impossible to shatter.
This might lead to an inductive hypothesis: with one parameter we had one VC
dimension; with two, we had two; with three, we had three. . . can we expect to find
a VC dimension of four when we decide to work with 3D space?
In fact, this is precisely the case: for any n-dimensional hyperplane (or hypoth-
esis class), the VC dimension will be n + 1.
Sample Complexity
The whole divergence into VC dimensions was to find a way to deal with infinite
hypothesis spaces in the Haussler theorem (3.7) so that we can make estimates about
how many m samples we need to find good hypotheses.
The derivation (which I’m sure is long and arduous) leads us to the following:
1 13 2
m≥ 8 · VC(H) · log2 + 4 log2 (3.8)
ε ε δ
Can we determine the VC dimensions of finite hypothesis spaces, too? Turns out,
finding an upper bound isn’t hard. If d = VCH, then there are 2d distinct concepts
(each picking a different h ∈ H). Thus, since 2d ≤ |H|, we know that d ≤ log2 |H| .
Kudrayvtsev 48
MACHINE LEARNING Supervised Learning
x1
x2 learner y
x3
Which of these messages has more information? Many answers may make sense
intuitively: perhaps we sent one bit per flip regardless; perhaps all we’re sending
is the flip ratio; perhaps we can compress the unfair flips into a single bit; etc. In
actuality, we need 10 bits for the first message and zero for the second: the output is
completely predictable without extra information.
Kudrayvtsev 49
CHAPTER 3: Computational Learning Theory
with a single bit. However, now we need three bits for B and C: if D = 10, then
B = 110 and C = 111. We’re still saving information, though, because these cases
occur less frequently.
How much are we saving? To answer that, we’ll need to use math. What’s the
expected message size of a single letter? It’s the product of a probability’s letter
and size:
X
E= (|`i | · Pr [`i ]) = 0.5 · 1 + 0.125 · 3 + 0.125 · 3 + 0.25 · 2 = 1.75 average bits
i
There is also the idea of conditional entropy, where the entropy for a variable
changes based on another:
X
h(Y | x) = − Pr [x, y] log2 Pr [Y | x]
y∈Y
Obviously, if two variables are independent of each other, h(Y | x) = h(Y ), and
H(X, Y ) = H(X) + H(Y ).
We need to differentiate between two cases: perhaps x tells us a lot about Y , then
Y ’s conditional entropy will be small; however, it’s possible that Y ’s entropy is small
to begin with. To solve this, we introduce the formula formutual information:
Kudrayvtsev 50
MACHINE LEARNING Supervised Learning
Pr [p] x
Z
DKL (p k q) = Pr [x] log2 dx (3.11)
Pr [q] x
Kudrayvtsev 51
Bayesian Learning
Pr [y | x] · Pr [x]
Pr [x | y] = (4.1)
Pr [y]
Pr [x, y] = Pr [a | b] · Pr [b]
Pr [x, y] = Pr [b | a] · Pr [a] since order
doesn’t matter
These simple rule is what powers Bayesian learning. Our learning algorithms aim to
learn the “best” hypothesis given data and some domain knowledge. If we reframe
this as being the most probable hypothesis, now we’re cooking with Bayes, since this
can be expressed as:
arg max Pr [h | D]
h∈H
52
MACHINE LEARNING Supervised Learning
where D is our given data. Let’s apply Bayes’ rule to our novel view of the world:
Pr [D | h] · Pr [h]
Pr [h | D] = (4.2)
Pr [D]
But what does Pr [D]—the probability of the data—really mean? Well D is our set of
training examples: D = {(x1 , d1 ), . . . , (xn , dn )} and they were chosen/sampled/given
(recall the variations we discussed in section 3.1) from some probability distribution.
What about Pr [D | h], this strange notion of data occurring given our hypothesis?
This is the likelihood—for a particular label, how likely is our hypothesis to output
it? The beauty of this formulation is that it’s a lot easier to figure out Pr [D | h] than
the original quantity.
Finally, what does Pr [h]—the prior probability of the hypothesis h—mean? This
represents our domain knowledge: hypotheses that represent the world or problem
space better will be likelier to occur.
When working with probabilities, it’s crucial to remember the paradox of priors.
Consider a man named Dwight diagnosed with dental hydroplosion—an extremely
rare disease that occurs in 8 of 1000 people—based on a test with a 98% correct
positive rate and a 97% correct negative rate. Is it actually plausible that his teeth
will spontaneously explode and drip down his esophagus in his sleep?
Well, apply Bayes’ rule and see:
Pr [test
has it | has it] · Pr [has it]
says
Pr [has it | test says
has it ] =
Pr [test says
has it ]
0.98 ∗ 0.008
=
0.98
| ∗{z0.008} + 0.03 ∗ (1 − 0.008)
| {z }
Pr[test says]·Pr[has it] Pr test says ·Pr[not has it]
[ has it ]
has it
≈ 20%
Kudrayvtsev 53
CHAPTER 4: Bayesian Learning
In other words, Dwight only has a 20% chance of actually having dental hydroplosion
despite the fact that the test has a 98% accuracy! The disease is so rare that the
probability of actually being infected dominates the test accuracy itself. The prior
(domain knowledge) of a random person having the disease (0.8%) is critical here.
The latter just assumes a uniform probability: all hypotheses are equally-likely to
occur. Unfortunately, enumerating every hypothesis is not practical, but fortunately
it gives us a theoretically-optimal approach.
Let’s put theory into action.2 Suppose we’re given a generic set of labeled examples:
{(xi , di } where each label comes from a function and has some normally-distributed
noise, so:
di = f (xi ) + εi
εi ∼ N 0, σ 2
The maximum likelihood is clearly defined by the Gaussian: the further from the
mean an element is, the less likely it is to occur.
1
We dropped the denominator from (4.1) because it’s just a normalization term—the resulting
probabilities will still be proportional and thus equally comparable.
2
At this point, the lectures go through a length derivation for two specific cases, then actually
apply it in general. We are skipping the former here because it’s Tuesday, the project is due on
Sunday, and I have like a dozen more lectures to get through before I can properly start.
Kudrayvtsev 54
MACHINE LEARNING Supervised Learning
Y
= arg max Pr [di | h]
h i
Y 1 (di −h(xi ))2
= arg max √ e 2σ2
h i − 2πσ 2
Y (di −h(xi ))2
the constant has no effect on
= arg max e −2σ 2
the arg max
h i
X 1 (di − h(xi ))2 ln(·) both sides, then the product of
ln hml = arg max − · logs is the sum of its components
h i
2 σ2
X
= arg max − (di − h(xi ))2 drop constants like before
h i
X
= arg min (di − h(xi ))2 − max(·) = min(·)
h i
But this final expression is simply the sum of squared errors! Thus, the log of the
maximum likelihood is simply expressed as:
Thus, our intuition about minimizing the SSD in the algorithms we covered earlier is
correct: gradient descent, linear regression, etc. are all the way to go to find a good
hypothesis. Bayesian learning just confirmed it!
Note our foundational assumptions, though: our dataset is built upon a true underly-
ing function and is perturbed by Gaussian noise. If this assumption doesn’t actually
hold for our problem, then minimizing the SSD will not give the optimal results that
we’re looking for.
Kudrayvtsev 55
CHAPTER 4: Bayesian Learning
If we added another variable (like, “Is it thundering?”), our table would double in
size. For n Boolean variables, we have a massive 2n -entry joint distribution. We can
represent it a different, more-efficient that instead takes 2n entries via factoring.
∀x ∈ X, y ∈ Y, z ∈ Z :
Pr [X = x | Y = y, Z = z] = Pr [X = x | Z = z]
Kudrayvtsev 56
MACHINE LEARNING Supervised Learning
Pr [Y | X]
Pr [Y | ¬X]
X Y Z
Pr [Z | Y ]
Pr [X]
Pr [Z | ¬Y ]
A B
A ∼ Pr [A]
B ∼ Pr [B]
C
C ∼ Pr [C | A, B]
D ∼ Pr [D | B, C]
E ∼ Pr [E | C, D] E D
to get a sample over the entire joint distribution (that is, to sample some X ∼
Pr [A, B, C, D, E]), we need to sample from the Bayesian network in its topological
order (an easily-achievable configuration on an acyclic digraph), because:
Notice the space savings of this representation: the full joint distribution (if A through
E were Boolean variables) requires 25 − 1 = 31 probability specifications to recover,
whereas under the specific conditional independencies we described, only 1 + 1 + 4 +
4 + 4 = 14 probabilities are needed. If they were all completely independent, we’d
only need 5 probabilities (the product of the unconditionals).
Kudrayvtsev 57
CHAPTER 4: Bayesian Learning
Inferencing Rules Let’s make some inferences. For clarity, though, let’s first reit-
erate the rules we’re going to be using:
• total probability: the probability of a random variable X is the sum of the
probabilities of the individual events that compose it (for example, if X is
“weather,” then it’s the sum of “rainy,” “sunny,” “cloudy,” etc.):
X
Pr [X] = Pr [X = x]
x∈X
Suppose we’re given two boxes with colored balls inside of them (this a
classic probability example; I have no idea why statisticians love boxes with
balls, or bags of marbles, or containers of jelly beans, or . . . )
Box 1 Box 2
For every sample, we first choose a box uniformally at random then fish out
a ball. The Bayesian network of this sampling process is:
Pr [2 = blue | 1 = green] = ?
Kudrayvtsev 58
MACHINE LEARNING Supervised Learning
The Bayes net itself gives us an insight on how we should compute this! By
combining the marginalization and chain rules, we get
and similarly,
Since we’re combining the two values, Pr [1 = green] will actually disappear
entirely if we normalize, so we don’t need to calculate it separately:
1 8 3 15
5 40 8 8 40 15
1 3 = 8 15 = 1 3 = 8 15 =
5 + 8 40 + 40
23 5 + 8 40 + 40
23
Kudrayvtsev 59
CHAPTER 4: Bayesian Learning
Solving that example was miserable, but we followed a pretty clear process based on
the Bayesian network. Luckily for us, that means it should be possible to automate.
Pr spam? | contains
contains contains
“viagra” , ¬ “prince” , ¬“Udacity” =
∝ Pr contains
“viagra” , ¬ “prince” , ¬“Udacity” | spam? · Pr [spam?]
contains contains
Bayes’ rule
In general, when we have this format of a single parent V and its many children,
a1 , a2 , . . . , an , then the probability is solveable as:
n
Pr [V ] Y
Pr [V | a1 , a2 , . . . , an ] = · Pr [ai | V ]
z i
This beautiful relationship lets us do classification (yay, we’re finally talking about
machine learning again!): given the features / attributes (the children), we can make
Kudrayvtsev 60
MACHINE LEARNING Supervised Learning
inferences about the parent! The maximum a posteriori class (that is, the best class
label) is then simply the best v ∈ V :
n
Y
arg max Pr [ai | V = v]
v∈V
i
Downsides This seems almost too good to be true. How can a truly-naïve model
that assumes that there is complete independence between attributes perform well?
There’s no way that email containing the word “prince” is no more-or-less likely to
contain the word “viagra” compared to email without the word “prince.” In fact, I
would expect spam emails targeting male genitalia to be a separate “class” of spam
than emails that target people with ties to Nigeria (though they may both target
gullible suckers).
This is exactly its downside: when there are strong inter-relationships between the
attributes, the model can’t make good inferences. However, these relationships (and
their probabilities) don’t matter so much when doing classification, since it just needs
to be “good enough,” and guess in the right direction from the limited information
that it has.
There’s also a flaw in the “counting” approach: notice that if a particular attribute ai
has never been seen or associated with a particular v ∈ V . Then, the entire product
is zero. This is not intuitive and is exactly why this doesn’t happen in practice: the
probabilities need to be smoothed out so that one bad apple doesn’t spoil the bunch.
Kudrayvtsev 61
PART II
Unsupervised Learning
ather than providing labeled training pairs to an algorithm like we did before,
R in unsupervised learning we are focused on inferring patterns from the data
alone: we want to make sense out of unlabeled data. While earlier we were
doing essentially amounted to function approximation, now we will be doing data
description—finding explanations and compact descriptions for data.
Contents
5 Randomized Optimization 63
6 Clustering 71
7 Features 80
62
Randomized Optimization
63
CHAPTER 5: Randomized Optimization
but it also makes sense to simply do a little exploration of the neighboring space
beyond just arg max .
n∈N
The concept of annealing comes from metallurgy, where metals are repeat-
edly heated and cooled to increase their ductility (basically bendability).
It’s a silly thing to name an algorithm after, but this idea of “temperature”
is baked into the simulated annealing (see algorithm 5.2).
if f (x0 ) ≥ f (x)
(
1
P (x, x0 , T ) = 0
(5.1)
exp f (x )−f
T
(x)
otherwise
For high temperatures, it’s likely to accept x0 regardless of how bad it is; contrarily,
a low T is likely to prefer “decent” x0 s that aren’t much worse than x. In other
Kudrayvtsev 64
MACHINE LEARNING Unsupervised Learning
words, T → 0 behaves like hill climbing while T → ∞ behaves like a random walk
through f . In practice, slowly decreasing T is the way to go: the “cooling factor” α
in algorithm 5.2 below controls this.
Simulated annealing comes with an interesting fact: we can actually determine the
overall probability of ending at any given input. Namely,
exp f (x)
Pr [end at x] = T zT is just a normalization term
zT
Notice that this1 is actually in terms of the fitness of x, so the probability of ending
at a particular input is directly proportional to how good it is! That’s a really nice
property.
Kudrayvtsev 65
CHAPTER 5: Randomized Optimization
• Then, there’s cross-over, where different population groups have their attributes
combined together to hopefully produce something novel and better-performing
(like sexual selection).
• This happens over generations—this is simply the number of iterations of im-
provement.
The biggest deviation (and not the sexual kind) from what we’ve seen thus far is
the idea of cross-over; otherwise, each population essentially operates like a random-
restart in parallel.
5.3.2 Cross-Over
Again, we’ve (intentionally) left cross-over as a “black box.” Some concrete examples
might help expand on what it means and how it can be useful.
Example: Bitstrings
Suppose, as we’ve been doing, that our input space is bitstrings. Let X be the set of
8-bit strings, and we’re performing a cross-over with the following two “fit” individuals:
01101100
11010111
M : 0110 1100
D : 1101 0111
Kudrayvtsev 66
MACHINE LEARNING Unsupervised Learning
This strategy obviously encodes some important assumptions: the locality of the
bits themselves has to matter; furthermore, it assumes that these “subspaces” can be
independently optimized.
Suppose this previous assumption (bit locality) is incorrect. We can cross over in a
different way, then: instead, let’s either keep or flip the bits at random.
M : 01101100
D : 11010111
—————
01100110
11011101
kkkkfkfk
where k/f correspond to “keep” and “flip,” respectively. This is called uniform
crossover.
5.3.3 Challenges
Representing an optimization problem as input to a genetic algorithm is difficult.
However, if done correctly, it’s often quite effective. It’s commonly considered to be
“the second-most effective approach” in a machine learning engineer’s toolbelt.
5.4 MIMIC
The algorithms we’ve looked at thus far: hill climbing, simulated annealing, and
genetic algorithms all feel relatively primitive because of a simple fact—they always
just end at one point. They learn nothing about the space they’re searching over,
they remember nothing about where they’ve been, and they build no recollection over
the underlying probability distribution that they’re searching over.
The main principle behind MIMIC is that it should be possible to directly model Isbell ‘97
a probability distribution, successively refine it over time, and that should lead to a
semblance of structure.
The probability distribution in question depends on a threshold, θ. It’s formulated
as:
1 if f (x) ≥ θ
Pr [x] = zθ
θ
(5.2)
0 otherwise
Namely, we’re basically only considering samples x ∈ X, uniformally, that have a
“good enough” fitness level.
Kudrayvtsev 67
CHAPTER 5: Randomized Optimization
Here, π(·) represents the “parent” of a node in the dependency tree. Since each
node only has exactly one parent (aside from the root, so π(x0 ) = x0 ), the table of
conditional probabilities stays much smaller than the full joint above.3
Why dependency trees, though? Well, they let us represent relationships between
variables, and they do so compactly. It’s worth noting that we’re not stuck using
2
A little more rigorously, we’re basically assuming that Prθ [x] ≈ Prθ+ε [x].
3
Specifically, the size of the table is quadratic in the number of features.
Kudrayvtsev 68
MACHINE LEARNING Unsupervised Learning
At the end, we end up with a sort of cost function Jπ that we’re aiming to minimize:
find each feature’s parent such that overall entropy is low. In other words, find parents
that give us a lot of information about their child features.
To make this Jπ easy to compute—for reasons beyond my understanding—we’ll in-
troduce a new term: the entropy of each feature. This is okay because it doesn’t
affect π(·) which is what we’re minimizing.
X
min Jπ = h(xi | π(xi ))
π
i
X X
min Jπ0 =− h(xi ) + h(xi | π(xi ))
π
i i
Both of these will give us the same π(·). This term is related to the mutual information
(see (3.10) in Information Theory):
X X
min Jπ0 = − h(xi ) + h(xi | π(xi ))
π
i i
X
=− I (xi ; π(xi ))
i
X
max Jπ0 = I (xi ; π(xi ))
π
i
In English, what we’ve come to show is that in order to maximize the similarity
between a true probability distribution and our guess (I know it’s been a while, but this is
Kudrayvtsev 69
CHAPTER 5: Randomized Optimization
what we started with), we want to maximize the mutual information between each feature
and its parent.
Finally, we need to actually calculate this optimization. This equation can actually
be represented byy a fully connected graph: the nodes are the xi features and the
edges are the mutual information I(xi ; xj ) between them:
x3
x1 x2
x4 x5
And since we want to find the tree that maximizes the total mutual information,
this is equivalent to finding the maximum spanning tree. Since more-traditional
algorithms find the minimum spanning trees, this is equivalent to just inverting the
I edge weights.4
4
Since this is a fully-connected graph, Prim’s algorithm is the way to go.
Kudrayvtsev 70
Clustering
Wash your hands often with soap and water for at least 20 sec-
onds, especially after blowing your nose, coughing, or sneezing,
or having been in a public place.
Wash your hands after touching surfaces in public places.
Avoid touching your face, nose, eyes, etc.
Avoid crowds, especially in poorly ventilated spaces.
— Center for Disease Control, People at Risk for COVID-19
Here, they’re represented as belonging to the blue or pink categories, but this obvi-
ously is only done visually—if we knew their labels, we’d use a supervised technique.
71
CHAPTER 6: Clustering
How do we cluster them into these two groups mathematically? Let’s define the
problem more rigorously:
The clustering problem:
Input: A set of objects: X
A distance metric D(·, ·), defining inter-object distances such
that D(x, y) = D(y, x) where x, y ∈ X .
Output: A partitioning of the objects such that PD (x) = PD (y) if x and y
belong to the same cluster.
This notion of similarity being defined by distance should remind us of k-nearest
neighbor. Given this definition, we can come up with some trivial partitioning
schemes: we can just stuff every element into its own cluster, or even all in the
same cluster. The problem definition does not measure or differentiate between a
“good” or “bad” clustering.
Because of this loose definition for clustering, its solutions have a high variance and
are very algorithm-driven; thus, each algorithm can be analyzed independently.
6.1.1 Considerations
If there was a perfect clustering algorithm, we would’ve just covered it and not both-
ered with enumerating them.
1
This is true for the simplest possible algorithm; it can absolutely be improved, but definitely not
beyond some factor of n2 since every pair of points needs to be considered.
Kudrayvtsev 72
MACHINE LEARNING Unsupervised Learning
(a) A set of points we’d like to cluster. (b) A (likely) intuitive clustering. (c) The actual SLC result.
Figure 6.1: A comparison of how single linkage clustering results may differ
from intuition. Here, k = 2 clusters.
Consider the set of points in Figure 6.1a: notice that SLC does not result in an
intuitive clustering. When we consider the “merge closest clusters” behaviour, the
inner four points on the left clump would always be too far relative to all other
possible pairs.
2
The mean (or average) actually minimizes the SSD (sum of squared differences), since you are
implicitly assuming that your data set is distributed around a point with a Gaussian falloff; using
the mean thus maximizes the likelihood.
Kudrayvtsev 73
CHAPTER 6: Clustering
Figure 6.2: The result of applying k-means on the same set of points that SLC
struggled with in Figure 6.1, given that the two white-circled points were selected
as the random cluster centers.
6.2.1 Convergence
The main benefit of k-means clustering is that it’s an incredibly simple and straight-
forward method; as you can see in algorithm 6.1, it requires a small handful of trivial
operations. The secondary benefit is that it actually provably converges to a local
(not a global, mind you) minimum, guaranteeing a certain level of “correctness.”
The “proof” is described in this math aside for those interested. The TL;DR summary
results in the following succinct properties:
• Each iteration takes polynomial time: O(kn).
• There’s a finite–though exponential–number of iterations: O(k n ).
• The error decreases if ties are broken consistently.
• It can get stuck in local minima.
This latter point can be avoided (well, alleviated) much like with random hill climbing:
by using random restarts and well-distributed starting points.
6.2.2 Considerations
There are some significant downsides, though, the biggest of which is that k must
be specified in advance. It’s a fairly memory intensive algorithm since the entire set
of points needs to be kept around. Furthermore, it’s sensitive to initialization
(remember, it finds the local minimum) and outliers since they affect the mean
disproportionately.
Another important downside to k-means is that it only finds “spherical” clusters. In
other words, because we rely on the SSD as our “error function,” the resulting clusters
try to balance the centers to make points in any direction roughly evenly distributed.
This is highlighted in Figure 6.3.
The red and blue segments are artificially colored to differentiate them in feature
space, and as a human we would intuitively group them into those clusters, but
k-means’ preference for spherical clusters resulting in an incorrect segmentation.
Kudrayvtsev 74
MACHINE LEARNING Unsupervised Learning
P
2 y∈Cit y
P t (x) = arg min
x − ct−1
cti =
i 2
i |Cit |
ci s, t += 1
Kudrayvtsev 75
CHAPTER 6: Clustering
The further rationale is as follows: since there are a finite number of configu-
rations (clusters and partitions), and we’re never increasing in error, and we
break ties consistently, we guarantee convergence to a local minimum; in
the worst case, we will eventually enumerate all (k n ) of the configurations.
Obviously, the middle point can end up with either cluster depending on how ties
are broken and how the initial centers are chosen. Wouldn’t it be nice, though, if we
could put such “ambigious” points into both clusters?
This idea precipitates the return of probability: each point now belongs to each of
the possible clusters with some probabilistic certainty. As with all things probability,
though, it relies on a fundamental assumption: in this case, we’re assuming that our
n points were selected from k uniformally-selected Gaussian distributions.
The goal is then to find a hypothesis of means, h = {µ1 , . . . , µk }, that maximize the
Kudrayvtsev 76
MACHINE LEARNING Unsupervised Learning
Single ML Gaussian Suppose k = 1 for the simplest possible case. What Gaussian
maximizes the likelihood of some collection of points? Conveniently (and intuitively),
it’s simply the Gaussian with µ set to the mean of the points!
Extending to Many. . . With k possible sources for each point, we’ll introduce
hidden variables for each point that represent which cluster they came from. They’re
hidden because, obviously, we don’t know them: if we knew that information we
wouldn’t be trying to cluster them. Now each point x is actually coupled with the
probabilities of coming from the clusters:
x = x z1 z2 · · · zk
Pr [X = xi | µ = µj ]
P
E[zij ]xi
E[zij ] = Pk µj = Pi
j=1 Pr [X = xi | µ = µj ] i E[zij ]
Expectation, µ Maximization,
defining Z from µ defining µ from Z
6.3.2 Considerations
Like with k-means, the initial means (which are akin to the cluster centers) are chosen
randomly. Unfortunately, it doesn’t come with the same convergence guarantees:
though it will never get worse, it might get better at an increasingly-slower rate. In
Kudrayvtsev 77
CHAPTER 6: Clustering
(a) The initial, randomly-chosen cluster means, marked with little ×s.
(b) The clusters and the new, ad- (c) The clusters and centers after (d) The clusters and centers upon
justed centers after a single itera- another iteration. convergence.
tion.
Figure 6.4: An application of soft clustering to k = 2 clusters, with green points indi-
cating a reasonable level of uncertainty (±10%) of which cluster they belong to.
practice, though, you can assume convergence. Another benefit is that it works with
any distribution if the expectation maximization is solveable; we only used Gaussians
here for simplicity of assumptions and its generality.
6.4.1 Properties
Though the best way to compare clustering algorithms is simply to enumerate and
analyze them, it is useful to define some desirable properties for any given approach:
Richness Ideally, our clustering algorithm would be able to associate any number
of points with any number of clusters depending on our inputs and distance
metric. Formally, richness means that there is some distance metric D such
that any clustering can work, so: ∀c, ∃D : PD = c.
Kudrayvtsev 78
MACHINE LEARNING Unsupervised Learning
Theorem 6.1 (Impossibility Theorem). No clustering scheme can achieve all Kleinberg ‘15
three of: richness, scale-invarance, and consistency.
Kudrayvtsev 79
Features
Machines Recall the curse of dimensionality: the amount of data that we need
grows exponentially with the number of feature dimensions. Thus, careful decision-
80
MACHINE LEARNING Unsupervised Learning
making about our choice of features lets us build models that learn quickly and
effectively and need minimal datasets.
In summary, with proper feature selection, not only will you understand your data
better, but you will also have an easier learning problem.
Feature selection doesn’t necessarily need to be a human-driven process. We can
leverage the power of a l g o r i t h m s to reduce some set of n features to m
“important” features. However, this is a hard (actually, an NP-hard) problem: we are
choosing some m out of n possible values, and m ; Θ(nm ).
n
If we had some f (·) that could score a set of features, this becomes an optimization
problem. Algorithms for tackling this problem fall into two categories: filtering—
where a search process directly reduces to a smaller feature set and feeds it to a learn-
ing algorithm—and wrapping—where the search process interacts with the learning
algorithm directly to iteratively adjust the feature set.
Filtering Wrapping
features fewer
ch
ch
L
ar
ar
features
se
se
features
7.1.1 Filtering
Pros & Cons As you can imagine, the logic in the filtering approach needs to be
baked directly into the search itself. There’s no opportunity for the learning algorithm
to give feedback on the process. However, this obviously makes it work much faster:
there’s no need to wait on the learner.
Techniques How can that even work, though: aren’t we essentially looking at the
features isolated in a vaccuum? Well, looking at the (supervised) labels is allowed in
filtering, you just can’t use the learner. Thus, decision trees are essentially a filtering
algorithm: we split each branch on the best available attribute by ranking their
information gain. By definition, this is a way to separate the wheat from the chaff.
This is just one such search method; we can also use things like variance, entropy,
and even concepts like linear independence and cross-feature redundancy.
7.1.2 Wrapping
Pros & Cons The natural perk of involving the learner in a feedback loop of feature
search is that model bias and learning is taken into account. Naturally, though, this
takes much longer for each iteration.
Kudrayvtsev 81
CHAPTER 7: Features
Relevance For starters, we need a way to measure their relevance; we’ll actually
define this colloquial term rigorously: a feature xi is. . .
• strongly relevant if removing it degrades the Bayes’ optimal classifier;1
• weakly relevant if it’s not strongly relevant (obviously) and there’s some
subset of features S such that adding xi to S improves the Bayes’ optimal
classifier; and
• irrelevant, otherwise.
Usefulness Just because a feature always only maps to a single value (which would
make it irrelevant), for example, it does not mean it’s entirely useless in all selection
scenarios.2 A feature’s usefulness is measured by its effect on a particular predictor
rather than on the Bayes’ optimal classifier.
Kudrayvtsev 82
MACHINE LEARNING Unsupervised Learning
Specifically, we’re going to be targeting linear transformations, so our new feature set
will be some linear combination of the originals. This will be somewhat reminiscent
of the inspiration for the kernel trick from support vector machines: we will combine
our features to get new information about them.
7.2.1 Motivation
Consider the information retrieval problem: given an unknown search query, we want
to list documents from a massive database relevant to the query. How do we design
this?
If we treat words as features, we encounter the curse of dimensionality: there are a
lot of words. . . Furthermore, words can be ambigious (the same word meaning can
multiple [and sometimes even conflicting! think about the modern use of “literally”]
things) and can describe the same concepts in a variety of ways (different words
meaning the same thing).3 Because of this, we encounter false positives and false
negatives even if we could find the documents with the words efficiently. Thus, words
are an insufficient indicator!
A good feature transformation will combine features together (like lumping words
with similar meanings or overarching concepts together), providing a more compact,
generalized, and efficient way to query things. For example, a query for the word “car”
should rank documents with the word “automobile” (a synonym), “Tesla” (a brand),
and even “motor” (and underlying concept) higher than a generic term like “turtle.” 4
Given this motivation via a real-world application, we can now talk about specific
algorithms in the abstract.
Kudrayvtsev 83
CHAPTER 7: Features
differences from the mean. In 2 dimensions, the difference from the mean (the ⊗
in Figure 7.1) for a point is expressed by its distance. The direction of maximum
variance, then, is the line that most accurately describes the direction of spread. The
second line, then, describes the direction of the remaining spread relative to the first
line.
30 30
25
20 20
15
10 10
0 0
0 10 20 30 0 5 10 15 20 25 30
Figure 7.1: The first principal component and its subsequent orthogonal prin-
cipal component.
You might notice that the first principal component in Figure 7.1 resembles the line
of best fit; this is no coincidence: the direction of maximum variance reduces the sum
of squared differences just like linear regression.
These notes are pulled straight from my notes on computer vision which
cover the same topic as it pertains to images and facial recognition; they go
into the full derivation of PCA from several different angles.
Lowering Dimensionality
The idea is that we can collapse a set of points to their largest eigenvector (i.e. their
primary principal component). For example, the set of points from Figure 7.1 will be
collapsed to the blue points in Figure 7.2; we ignore the second principal component
and only describe where on the line the points are, instead.
Collapsing our example set of points from two dimensions to one doesn’t seem like
that big of a deal, but this idea can be extended to however many dimensions we
Kudrayvtsev 84
MACHINE LEARNING Unsupervised Learning
30
25
20 x
15
10
E λ1
5
0
0 5 10 15 20 25 30
Figure 7.2: Collapsing a set of points to their principal component, Êλ1 . The
points can now be represented by a coefficient—a scalar of the principal compo-
nent unit vector.
want. Unless the data is uniformally random, there will be directions of maximum
variance; collapsing things along them (for however many principal components we
feel are necessary to accurately describe the data) still results in massive dimension-
ality savings.
Kudrayvtsev 85
CHAPTER 7: Features
Intuitively, we can imagine the points above the blue line “cancelling out” the
ones below it, effectively making their average in-between them. Principal
component analysis works best when training data comes from a single
class. There are other ways to identify classes that are not orthogonal such
as independent component analysis, which is coming up next.
Considerations
Principal component analysis is a well-studied algorithm with some nice, provable
properties, like the fact that it gives the best reconstruction of the original data. Its
downside comes from its very nature: features with high variance don’t necessarily
correlate to features with high importance. You might have a bunch of random noise
with one useful feature, and PCA will almost certainly drop the useful feature since
the random noise has high variance.
Intuition In the cocktail party problem, there are a bunch of people in a crowded,
noisy room (please don’t try this at home until the COVID-19 pandemic is over).
You want to pick out one particular voice from the crowd, but your ears are receiving
some (linear) combination of sounds from everyone who is speaking. If you move
around, the volumes of the voices change. Given enough movement, calculations, and
Kudrayvtsev 86
MACHINE LEARNING Unsupervised Learning
understanding of physics, you could isolate one of the voices based on their changes
in volume and whatnot.
This is the background for ICA: the voices are hidden variables (variables that we wish
we knew since they powered the data we see) that we’re trying to learn about, and
the overall room noise is our known data. ICA reconstructs the sounds independently
by modeling this assumption of the noise being a linear combination of some hidden
variables.
Specifics(ish) First, ICA samples each feature in the dataset to form a matrix.
Each row in the matrix is a feature, and each column is a sample. Then, it aims to
find a projection such that the above constraints hold: minimizing mutual information
between each pair of new features and maximizing information between the new
features and the old features.
7.2.4 Alternatives
While PCA and ICA are well-established and effective algorithms, there are other
alternatives in the space.
RCA or random component analysis generates random directions and projects
the data onto them. It works remarkably well for classification because enough
random components can still capture correlations in the data, though the re-
sulting m-dimensional space is bigger than the one found by the other methods.
Its primary benefit comes simply from being blazingly fast.
LDA or linear discriminant analysis finds a projection that discriminates based
on the label. In other words, it aims to find a collection of good linear separators
on the data.
Kudrayvtsev 87
PART III
Reinforcement Learning
Contents
8 Markov Decision Processes 89
9 Game Theory 96
5
We’ll ignore the existence of altruism in our discussions of human behaviour because it’s a concept
that doesn’t make computational sense.
88
Markov Decision Processes
T (s, a, s0 ) ∼ Pr [s0 | s, a]
• The actions are rewarded (or punished) based on their outcome and the re-
sulting state, R(s) 7→ R.
The Markovian property of this problem states that only the present matters:
notice that the new state in transition function only depends on the previous state.
Furthermore, the environment in which the agent can act stays static (walls don’t
move, for example).
Given this description of the world, an agent has a policy, Π(s) 7→ a, that describes
its decision-making process, converting environment state into an action. This policy
is often computed with the goal of finding the optimal policy Π∗ which maximizes
the total reward.
Our reinforcement learning algorithms will learn which actions maximize the reward
and use that to define and enforce a policy. At each stage, we will receive a state-
action-reward tuple: hs, a, ri and eventually determine the optimal actions for any
given state.
89
CHAPTER 8: Markov Decision Processes
On Rewards
In our MDP, we have no understand of
how our immediate action will lead to
things down the road. For example, if
you’re standing on an island surrounded
by hot coals, but there’s the potential of
a pot of gold a few steps away. Under the
Markovian property, you have no knowl-
edge of the gold until you find it, and are
enduring a burning sensation on your feet
for an unknown, delayed reward. Figure 8.1: Enduring the hot coals for the
delayed reward of “manager.”
Because of this property, it’s hard to tell
which moves along the path had a meaningful impcat on the overall result. Was it
one move early on that caused a failure 100 steps down the line, or was it the step
we just did while everything beforehand was perfect?
Furthermore, minor changes to rewards are really impactful. By having a small
negative reward everywhere, for example, you encourage the agent to end the game.
However, if the reward is too negative (that is, it outweighs the maximum-possible
positive reward), you might encourage the agent to instead be “suicidal” and end the
simulation as quickly as possible.
For example, if the commission for a stock trade is $100 (representing an absurdly-high
negative reward for any action) then trading penny stocks—even for a 200% profit!—
would be completely pointless. Thus, a trading simulation agent might simply buy a
stock and hold it indefinitely.
On Utility
In our discussion, we’ve been assuming an infinite horizon: our agent can live and
simulate forever and their optimal policy assumes there’s infinite time to execute its
plans.
We won’t consider this problematic assumption in this chapter because it’s too ad-
vanced for this brief introduction1 even though it’s not representative of the real
world; it’s mentioned because but it’s important to keep in mind. Imagine your au-
tonomous vacuum did nothing but circle the edge of your house because the center
of your rooms is too “unknown” and high risk?
Furthermore, we’ve been assuming that the utility of sequences is Markovian: if
we preferred a particular state today (when we started from s0 ), we actually always
prefer it:
if U (s0 , s1 , s2 , . . .) > U (s0 , s01 , s02 , . . .)
1
Ultimately, you end up creating a sort of time-based policy, Π(s, t) 7→ a, to deal with this.
Kudrayvtsev 90
MACHINE LEARNING Reinforcement Learning
These are called stationary preferences and this assumption leads to a natural
intuition of accumulating rewards by simple addition. The utility of a sequence of
states is simply the sum of its rewards:
∞
X
U (s0 , s1 , s2 , . . .) = R(st )
t=0
However, this simple view doesn’t encode what we truly want. Consider the following
two sequences of states:
Isn’t the second sequence “better”? But by our above definition, they both sum up
to infinity. . . so it doesn’t matter what we do. How do we encode time into our
utility, and model the agent to prefer instant gratification? Simple: we introduce
a fractional factor γ that decreases how impactful future rewards are.
∞
X
U (s0 , s1 , s2 , . . .) = γ t R(st ) where 0 ≤ γ < 1
t=0
Unlike our previous construction which grows to infinity, this geometric series actually
converges to a very useful result:
∞ ∞
X
t
X Rmax
γ R(st ) ≤ γ t Rmax =
t=0 t=0
1−γ
Notice that if we’d let γ = 1, we’d have the equation that we started with. This
model has discounted rewards, letting us converge to a finite reward after taking
infinite steps.
Kudrayvtsev 91
CHAPTER 8: Markov Decision Processes
The utility of a particular state, then, is the expected set of states that we’ll see given
that policy and starting at that state:
"∞ #
X
U Π (s) = E γ t R(st ) Π, s0 = s
t=0
Note the critical point that the reward at a state is not the same as the utility of
that state (immediate vs. long term): R(s) 6= U (s).2 Under this notion of the optimal
utility, then, we now want our policy to return the action that maximizes the expected
utility: X
Π∗ (s) = arg max T (s, a, s0 )U (s0 )
a∈A(s) s0
This feels a little circular, but when we unroll it, a recursive pattern emerges: the
true utility of a state is simply its immediate reward plus all discounted future rewards
(utility). X
U (s) = R(s) + γ max T (s, a, s0 )U (s0 ) (8.1)
a∈A(s)
s0
This final equation is the key equation of reinforcement learning. It’s called
the Bellman equation and fully encodes the utility of a state in an MDP. The rest
of this chapter will simply be different ways to solve the Bellman equation.
Value Iteration The key to an approach for solving this equation will be through
repeated iteration until convergence: we’ll start with arbitrary utilities, then update
them based on their neighbors until they converge.
X
Ût+1 (s) = R(s) + γ max T (s, a, s0 )Ût (s0 )
a
s0
Where Û0 (s) is simply random values. However, because the transitions and rewards
are actually true, we’ll eventually converge by overwhelming our initially-random
guesses about utility. This simple process is called value iteration; the fundamental
reason why value iteration works is because rewards propogate through their neigh-
bors.
2
We’ll use the notation U (s) to refer to the “true” utility of a state; this is the utility when following
∗
the optimal policy, so U (s) := U Π (s).
Kudrayvtsev 92
MACHINE LEARNING Reinforcement Learning
Policy Iteration This isn’t the only way to do this, and it gives us an intuition
about the alternative. We want a policy; it maps states to actions, not utilities. And
though we can use utilities to find actions, it’s way more work than is necessary; the
optimal action would be sufficient.
With policy iteration, we’ll start with a guess Π0 of how we should act in the world.
Following this policy will result in a particular utility, so we can find Ut = U Πt . Then,
given that utility, we can figure out how to improve the policy by then finding the
action that maximizes the expected utility:
X
Πt+1 (s) = arg max T (s, a, s0 )Ut (s0 )
a∈A(s) s0
Of course, finding Ut is just a matter of solving the Bellman equation (8.1), except
the action is already encoded by the policy:
X
Ut (s) = R(s) + γ T (s, Πt (s), s0 )U (s0 )
s0
With this formulation, we have n linear equations (the maxa is gone) in n unknowns
which is easily solvable with linear algebra libraries.
8.3 Q-Learning
Critically, solving an MDP is not the same thing as reinforcement learning! Previ-
ously, we knew everything there is to know about the world: the states, the transitions,
the rewards, etc. In reality, though, a lot of that is hidden from us: only after taking
(or even trying to take) an action can we see its effect on the world (our new state)
and the resulting reward.
When thinking about reinforcement learning, there are a lot of moving parts that go
into discovering the policy:
model
planner policy
(T, R)
transitions
learner policy
hs, a, r, si
Kudrayvtsev 93
CHAPTER 8: Markov Decision Processes
rises a new challenger called the Q-function, relating the value of arriving at the state
s and leaving via action a:
X
Q(s, a) = R(s) + γ 0 0 0
T (s, a, s ) 0max0 Q(s , a ) (8.2)
a ∈A(s )
s0
Policy On the other hand with policies, we’re looking for the action itself that
maximizes the value, so the parallel is just as simple:
Π(s) = arg max Q(s, a)
a
The foundation behind Q-learning is using data that we learn about the world as
we take actions in order to evaluate the Bellman equation (8.1). The fundamental
problem, though, is that we don’t have R or T , but we do have samples about the
world: hs, a, r, s0 i is the reward and new s0 tate that result from taking an action in a
state. Given enough of these tuples, then, we can estimate this Q-function:3
α
hs, a, r, s0 i : t
Q̂(s, a) ←− r + γ 0max0 Q̂(s0 , a0 )
a ∈A(s )
where αt is updated over time.4 This simple updating rule is guaranteed to converge
to the true value of the Q-function (8.2) with the huge caveat that (s, a) must be
visited infinitely often:
lim Q̂(s, a) = Q(s, a)
t→∞
3 α
The non-standard notation f ←− x is defined here as a simple linear interpolation: f = (1 − α)f +
αx, where 0 ≤ α < 1. P 2
4
P
Specifically, it must follow these rules for converge to hold true: t αt = ∞ yet t αt < ∞. A
function like αt = 1t holds this property.
Kudrayvtsev 94
MACHINE LEARNING Reinforcement Learning
Notice that we’re leaving a bunch of questions unanswered, and the answers to them
are what make Q-learning a family of algorithms:
• How do we initialize Q̂? We could initialize it to zero, or with small random
values, or negative values, or. . .
• How do we decay αt ? We’ve seen that αt = 1
t
follows the convergence rules we
need, but are there other alternatives?
• How do we choose actions? Neither always choosing the optimal action nor a
random action actually incorporates what we’ve learned into our decision. For
this, we can actually use Q̂, but we need to be careful: relying on our (potentially
faulty, unconverged) knowledge of actions might lead to us simply reinforcing
the wrong actions.
For this last point, we want to sometimes “explore” suboptimal actions (recall our dis-
cussion of exploration vs. exploitation in Simulated Annealing) to see if they actually
are better than what we know so far. To formalize this, we’ll let ε be our “exploration
factor,” and then:
arg max Q̂(s, a) with probability 1 − ε
Π̂(s) = a∈A(s)
random action otherwise
With this approach, we can guarantee that (with infinite iterations) all possible state-
action pairs will be explored.
Kudrayvtsev 95
Game Theory
his field comes from a world far beyond computer science, but actually has
T important ties to reinforcement learning in the way we design problems, make
decisions, and perceive the “economic utility” of the world. The fundamental
principle behind game theory is that you are not alone: you are often collaborating
and competing with other agents in the world to achieve various (and potentially
conflicting) goals. That’s the key of this chapter: we’re moving from a world with
a single agent to one with multiple agents; then, we’ll tie it back to reinforcement
learning and even the Bellman equation.
9.1 Games
Let’s consider an extremely simple example of a “game”
in which we introduce a second agent. At each point in 1
A’s choice, then B’s at the tree’s next level, and so on. +7 +3 4 +2
L R
This is a 2-player, zero-sum, finite, deterministic game
with perfect information. That’s quite a mouthful of a −1 +4
96
MACHINE LEARNING Reinforcement Learning
opportunity has two options available, so there are four total strategies. Agent B
likewise has two opportunities, but there are three options in one case and only one
in the other, so there are three total strategies.
We can actually represent these strategies as a matrix, with each cell being the final
reward value for Agent A:
2
○ L M R
3
○ R R R
1
○ 4
○
L L 7 3 -1
L R 7 3 4
R L 2 2 2
R R 2 2 2
In fact, this matrix represents the entire game tree. Which strategy (row) then, should
Agent A choose? Knowing that Agent B will always choose the minimum value in
the row, it should choose the second row. If Agent B gets to choose, though, it wants
the row with the smallest maximum value, thus choosing the second column.
This two-way process of picking a strategy such that it minimizes the impact your
opponent could have is called minimax. In our above matrix, following the minimax
strategy, the “value” of the game is simply 3 (the intersection of the best row for
Agent A and best column for Agent B).
minimax ≡ maximin
and there always exists an optimal pure strategy for each player.
This notion of optimality assumes a fundamental approach to the game: all agents
are behaving optimally, and thus are always trying to maximize their respective re-
ward.
Kudrayvtsev 97
CHAPTER 9: Game Theory
and loses $40 if B sees red (and not in a fit of resign see resign see
1
We could explicitly say that A cannot resign with a black card, but in reality it simply makes no
sense to resign with a black card since you cannot lose.
Kudrayvtsev 98
MACHINE LEARNING Reinforcement Learning
strategy that works for both players, which segues perfectly into the next section.
Mixed Strategies
A mixed strategy is a distribution of strategies. In the context of mini-poker, for
example, Agent A could choose to be a holder or a resigner half of the time.
Suppose, then, that we’re a holder with probability P .
• If Agent B is a “resigner,” what’s our expected profit? Well, if we’re dealt a
black card, we always hold and always win $10. If we’re dealt a red card, we
will hold P of the time and win $10, and not hold 1 − P of the time and lose
$20. Thus, our expected profit is:
Notice that this a direct application of our above value table, a simplification
of: 10P − 5(1 − P ).
• On the other hand, if Agent B is a “see”-er, then we will win $0.5 if we resign
(1 − P ) and lose $0.5 if we hold, so:
Now if we were to plot these lines, we’d get a point of intersection at (0.4, 1). How
do we interpret this value?
10
5
profit
−5
0 1
P
Kudrayvtsev 99
CHAPTER 9: Game Theory
Note that the point of intersection isn’t necessarily always the optimal probability
nor the best value; in this case, it’s the maximum of the possible lower bounds. It
will always be one of the extrema of the lower triangle, though.
We’ve relaxed all of the terms that described our original game except one: the notion
of zero-sum. Consider the infamous prisoner’s dilemma:
• Two criminals are imprisoned in separate cells.
• A cop enters one cell and tells the first criminal, “We know you guys committed
the crime, and the other prisoner is telling us that it was all you.”
• He tells the prisoner, “If you can provide us evidence against the other guy, we’ll
can cut you a deal and let you off.”
• To make matters worse, he also tells him that there’s another cop in the other
cell offering the other criminal the same thing, and whoever rats first gets to
walk free.
More specifically, if Prisoner A talks, he gets no sentence and Prisoner B gets a 9-
month sentence. Similarly, if Prisoner B talks, Prisoner A gets a 9-month sentence.
However, the two criminals can also cooperate with each other and say nothing, OR
they might both talk at the same time and both get stuck with a sentence. In the
former case, they don’t get off scot-free because there’s still enough evidence to toss
them both in jail for a month. In the latter case, they get to convict both prisoners
(which is better for the cops) but the sentence for each will be shorter—6 months.
cooperate snitch
cooperate (−1, −1) (−9, 0)
snitch (0, −9) (−6, −6)
Given this matrix, and interesting, unintuitive result appears: it’s always better to
defect. Though overall it’s better for the prisoners to cooperate, we need to consider
the cold, hard facts of the above matrix. If you know your partner will cooperate
with you, you get more value from ratting him out (-1 vs. 0). Similarly, if you know
they’ll rat you out, you still get more value from also ratting since your sentence
gets reduced (-9 to -6). The fundamental assumption of this result is that the cold,
calculated matrix does not consider the effects of your actions on your literal partner
in crime.
Notice that the first row is strictly worse for Prisoner A: in both cases, it’s better to
snitch. This means that one strategy is strictly dominated by the other.
Kudrayvtsev 100
MACHINE LEARNING Reinforcement Learning
More intuitively, a set of strategies is a Nash equilibrium if no one player would change
their strategy given the opportunity. This concept works for both pure and mixed
strategies.
Example Let’s find the Nash equilibrium of the prisoner’s dilemma. Recall our
value table:
cooperate snitch
cooperate (-1, -1) (-9, 0)
snitch (0, -9) (-6, -6)
There are obviously only two actions for each player, and thus four strategies. We
can do a simple brute-force enumeration to see if any of them cause any one player
to change their mind.
• If they both cooperate, then if you gave either a chance to switch, they would
(since snitching now gives lower utility).
• If one cooperates and the other doesn’t, then if you gave the cooperator the
chance, they would now snitch to reduce their sentence.
• Finally, if they both snitch, then neither would change their action, meaning
this set of strategies is the Nash equilibrium.
In fact, if we’d avoided brute force, we could’ve still come to this same conclusion by
noticing that the (−6, −6) cell is the only one left after removing strictly dominated
rows or columns.
Kudrayvtsev 101
CHAPTER 9: Game Theory
Properties The Nash equilibrium leads to a set of beautiful properties about these
types of games:
• In the n-player pure strategy game, if elimination of strictly dominated strate-
gies eliminates all but one combination, that combination is the unique Nash
equilibrium.
• More generally, any Nash equilibrium will survive the elimination of strictly
dominated strategy.
• If n is finite, and all si s (the sets of possible strategies for player i) are finite,
then there exists at least one Nash equilibrium (with a possibly mixed strategy).
Repeats Suppose we ran the prisoner’s dilemma multiple times. Intuitively, once
you saw that I would defect regardless (and you would likewise), wouldn’t it make
sense for us to collectively decide to cooperate to reduce our sentences?
Consider the very last repeat of the experiment. At this point you may have built up
some trust in your criminal partner, and you think you can trust them to cooperate.
Well, isn’t that the best time to betray them and get off scot-free? Yes (since guilt
has no utility), and by that same rationale your partner would do likewise, and we’re
back where we started. This leads to another property of the Nash equilibrium: If
you repeat the game n times, you will get the same Nash equilibrium for all n times.
9.1.5 Summary
Game theory operates under the critical assumption that everyone behaves optimally,
and this means that they behave greedily with respect to their utility values. The
key that lets this work in the real world is that the utility value of someone staying
out of jail might not be only the raw amount of months they avoided. It might
also be things like: how much time their partner spends in jail, how much their
family will miss them, how good of a workout routine they could get going in prison,
etc. Furthermore, we know that snitches get stitches, so the (0, −9) cell suddenly
Kudrayvtsev 102
MACHINE LEARNING Reinforcement Learning
looks much less attractive when that 0 turns into a −100 to represent the number of
beatings you’ll get for snitching on your criminal buddies.
An important conclusion from our cold-hearted matrix view of the world is that if we
know the outcome, we can try to manipulate the game itself to change the outcome
to what we want. This is called mechanism design; it’s an important part of both
economic and political policy.
9.2 Uncertainty
Our rationale for the repeated prisoner’s dilemma is that we can pull a fast one on
our partner if they decide to cooperate with us in the last round. But this relies on
knowledge of when this “last round” will be! To formalize this notion a bit, suppose
the game continues with probability γ. Then, we should expect it to end after 1−γ 1
9.2.1 Tit-for-Tat
Consider the following strategy for the iterative prisoner’s dilemma game: on the first
round, we agree to cooperate. Then for all future rounds, we simply copy whatever
our opponent did the previous round.
For example, we’d get the following behavior under these opponent strategies:
• under the always snitch opponent strategy, we’ll cooperate, then snitch, snitch,
...
• if they always cooperate, we’ll always cooperate.
• if they also do tit-for-tat, we’ll again always cooperate.
• if they snitch, then cooperate, then snitch, etc. we’ll do the exact opposite—
cooperate, then snitch, then cooperate, etc.
Under such a known opponent strategy, how do we figure out the best response
strategy? Well, it’s dependent on γ, right? With our prisoner’s dilemma matrix,
always snitching is viable if γ is low: we gain the maximum value on our first try,
but then suffer snitching on the next step. Contrarily, always cooperating is good for
high γ: we always gain the next-best reward (after 0).
always
1 6γ
utility snitch = 0 − 6γ =−
1−γ 1−γ
always
1 1
utility cooperate = −1 =−
1−γ 1−γ
The threshold is at γ = 1/6 (just set them equal to each other). How can we find this
Kudrayvtsev 103
CHAPTER 9: Game Theory
for a general strategy, though? Well, consider our model expressed as a finite state
machine:
-1 -6
-9
C D
and notice that this is simply an MDP with γ as the discount factor. By solving the
MDP, we can find the optimal policy for countering an opponent’s strategy.
To start, a feasible payoff is one that can be reached by some combination of the
extremes (by varying probabilities, etc.). This is simply the convex hull of the points
defined in the value matrix on a “player plot.”
A minmax profile is a pair of payoffs—one for each player—that represent the
payoffs that can be achieved by a player defending itself from a malicious adversary.
In other words, we reform the game to be a zero-sum game.
For example, consider the following battle of sexes: two people want to meet up at a
concert, but there are two bands playing and they didn’t coordinate beforehand. If
they end up at different concerts, they’ll both be pretty bummed out (no utility), but
if they do, they’ll end up with different utilities based on the concert—one prefers
Bach and the other prefers Stravinsky.
B S
B (1, 2) (0, 0)
S (0, 0) (2, 1)
The minmax profile of the above “game” is (2/3, 2/3), which can be solved by following
the same process we did during our initial discussion of Mixed Strategies.
Theorem 9.2 (Folk Theorem). Any feasible payoff profile that strictly dominated
the minmax profile can be realized as a Nash equilibrium payoff profile, with a
sufficiently large discount factor.
Kudrayvtsev 104
MACHINE LEARNING Reinforcement Learning
The proof of this theorem can be viewed as an abstract strategy called the grim
trigger: in this strategy, we guarantee cooperation for mutual benefit as long as
our partner/opponent doesn’t “cross the line”; if they do, we will deal out vengeance
forever. In the context of the prisoner’s dilemma, A cooperates until B snitches, at
which point A will always snitch.
The problem with this strategy is that it’s a bit implausible: the idea of rejecting any
notion of optimality simply to dole out maximum vengeful punishment no-matter-
what has a pretty huge negative effect on the one dealing out the punishment—they
are forgoing potential value. In game theory, we are interested in a plausible threat
that leads to a subgame perfect equilibrium. Under this definition, you always take
the best response independent of any historical responses.
For example, the grim trigger and tit-for-tat are in a Nash equilibrium (since both
will always cooperate), but they are not subgame perfect, since a history of any
defection from tit-for-tat will cause grim to defect while it would’ve been better to
keep cooperating. Similarly, two players both using the tit-for-tat strategy are not
subgame perfect, since any defection in the past from one will “flip” the other and
lead to something worse than their forever-cooperation default.
We can think of subgame perfect strategies as eventually equalizing again, whereas
ones that are not subgame perfect are extremely fragile and dependent on very ideal
sequences—the difference between stability of a marble at the bottom of a ∪ parabola
vs. one at the exact, barely-balanced peak of a ∩ one.
C D
D
This strategy is a Nash equilibrium with itself—you always start with mutual cooper-
ation, and there’s no incentive or reason to stop doing so. Furthermore, it’s subgame
perfect unlike tit-for-tat: the average reward is always mutual cooperation.
Theorem 9.3 (Computational Folk Theorem). You can build a Pavlov-like Littman &
machine for any game and construct a subgame perfect Nash equilibrium in poly- Stone, ‘04
nomial time.
Kudrayvtsev 105
CHAPTER 9: Game Theory
There are three possible forms of the equilibrium resulting from the above theorem: if
possible, it will be Pavlovian; otherwise, it must be a zero-sum-like game and solving
a linear program will lead to a strategy with at most one player improving.
A B
If they arrive at the same time, they both get the reward. The world is deterministic
aside from the dashed lines above the agents: these are “partial walls” which can be
passed with a 50% probability.
Obviously, the optimal strategy for each agent when ignoring the other agent is to
move into the middle square and go straight for the goal. However, this ignores the
unfortunate reality that the agents can’t occupy the same square (aside from the
reward cell). We’ll say that if two agents try for the same square, a 50-50 chance
describes who gets there “first;” the other agent stays still.
Does this game have a Nash equilibrium for the agents? Consider the possible strate-
gies:
• If both agents try to go through their walls, each has a 50% chance to make it
through, and the average reward is $67 for either agent.2
• However, if B knows that A will follow the above strategy, it’s in its best interest
to go left and up towards the goal. In this case, A still has a 50% chance of
success regardless of whether it goes up (wall chance) or right (collision chance),
2
There’s a 50% chance to pass the wall and win, then there’s also the chance that 25% of the time,
neither of you will
pass the wall and win so there’s another opportunity to1 try1the wall. If we say
x = Pr A reaches
goal , then this probability can be expressed simply as x = 2 + 4 x. This results in
x = 2/3.
Kudrayvtsev 106
MACHINE LEARNING Reinforcement Learning
9.3.2 Generalization
Let’s describe this notion of a stochastic game in detail. It looks much like an MDP, Shapley, ‘53
but we use the i subscript to refer to each agent individually:
• We still have our set of states, S.
• We have actions available to each player: Ai . For our purposes, we’ll use two-
player games as examples, so let a = A1 (the actions available to the first agent)
and b = A2 .
• The transitions from state to state are described by a joint action. All of the
players take an action simultaneously: T (s, (a, b), s0 ).
• Each player gets their own rewards, so R1 (s, (a, b)) describes the first agent’s
reward for the joint action, and R2 (s, (a, b)) describes the second’s.
• The discount factor, γ stays the same, and is universal for everyone. We can
consider γ to be part of the definition of the game, rather than a γi which would
allow each agent to discount rewards independently.
In fact, this is a generalization of MDPs, published way before Bellman and his
equation (8.1). We can see how it simplifies to a number of scenarios we’ve seen
under certain contraints: with transition probabilities and rewards that don’t change
regardless of b, this is an MDP; with equal rewards, this is a zero-sum game; and with
a single state in the state space, this is a repeated game à la the iterative prisoner’s
dilemma.
Kudrayvtsev 107
CHAPTER 9: Game Theory
What does this mean? By maximizing over joint actions (the (a0 , b0 ) tuple), we
are assuming that the optimal joint action will exclusively benefit us: the whole
world will bend themselves over backwards to improve our utility. This seems a tad
unrealistic. . . In reality, when we discussed agent decision-making in zero-sum games,
we determined that optimal opponents will try minimax. Thus, evaluating a state
should involve actually solving the zero-sum game matrix like we did before:
X
∗ 0 ∗ 0 0 0
Qi (s, (a, b)) = Ri (s, (a, b)) + γ T (s, (a, b), s ) · minimax
0 0
Qi (s , (a , b ))
a ,b
s0
The analog of the Q-learning update is simple; only our observation tuples get more
complicated:
α
hs, (a, b), (ra , rb ), s0 i : Qi (s, (a, b)) ←− ri + γ minimax
0 0
Qi (s0 , (a0 , b0 ))
a ,b
This is often referred to as minimax-Q. From this algorithm, we get some wonderful
properties:
• Value iteration “just works”! We can solve the system of equations just like
before with MDPs by using the utilities found by the Q function.
• Minimax-Q converges to a unique solution! There’s a single optimal policy to
the zero-sum game, and this iterative algorithm finds it.
• Policies can be computed independently by each agent (that is, by operating
under the assumption that the opponent will behave optimally) and are guar-
anteed to converge to the same optimal policy.
• The Q-function update can be efficiently computed in polynomial time, since
the minimax can be computed using linear programming.
• The resulting optimal Q∗ is sufficient in specifying the optimal policy, since the
optimal utility for a state-action pair still corresponds to the optimal action to
take from that state.
Kudrayvtsev 108
MACHINE LEARNING Reinforcement Learning
This idea is aptly-called Nash-Q and it exists in the literature, however its properties
are much more grim:
• Value iteration doesn’t work; the Nash-Q algorithm doesn’t converge since there
isn’t unique solution (there can be multiple Nash equilibria).
• Policies can’t be computed independently, since the very notion of a Nash equi-
librium depends on everyone coordinating their actions together.
• Computing the Nash equilibrium is not a polynomial-time operation; it’s as
hard as any problem in np.
• Finally, even if the Q-function could work, it still would be insufficient in spec-
ifying the policy.
Workarounds
This last conclusion is saddening, but there are some interesting ideas for getting
around the need for a general result:
• By viewing stochastic games as repeated games, we can leverage the Folk the-
orem and ideas like it to come up with better solutions.
• By allowing communication between agents—we can call it “cheap talk,” since
the communication isn’t necessarily binding—we can let the agents coordinate
and compute a correlated equilibrium. This actually lets us approximate the
solution efficiently.3
• By taking a bit of a prideful view of ourselves and assuming we have more com-
putational resources than our opponents do to behave optimally, we can create a
“cognitive hierarchy” and approximate ideal responses under these assumptions.
• By enabling “side payments,” in which agents can actually give some of their Sodomka et
reward to other agents in order to incentivize high-reward actions, we can sim- al. ‘13
ilarly find optimal strategies.
3
This idea should be familiar to anyone whose taken a formal algorithms course: much like we
can efficiently approximate np-complete problems to a particular degree (like the upper bound of
7/8ths for Max-SAT), we should also be able to approximate the Nash equilibrium to a particular
Kudrayvtsev 109
Index of Terms
Symbols D
Q-learning . . . . . . . . . . . . . . . . . 94, 107, 108 decision tree . . . . . . . . . . . . . . . . . . 7, 46, 81
ε-exhausted . . . . . . . . . . . . . . . . . . . . . . . . . 44 dependency trees . . . . . . . . . . . . . . . . . . . . 68
k-means clustering . . . . . . . . . . . . . . . 73, 77 discounted reward . . . . . . . . . . . . . . . . . . . 91
k-nearest neighbor . . . . . . . . . . . 23, 35, 72
E
eager learner . . . . . . . . . . . . . . . . . . . . . . . . 37
A ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
activation threshold . . . . . . . . . . . . . . . . . 28
entropy . . . . . . . . . . . . . . . . . . . . . . . 9, 49, 81
AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
expectation maximization . . . . . . . . . . . 77
B F
back-propagation . . . . . . . . . . . . . . . . . . . . 34 factoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
backward search . . . . . . . . . . . . . . . . . . . . . 82 features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 firing threshold . . . . . . . . . . . . . . . . . . . . . . 28
Bayes’ optimal classifier . . . . . . . . . 55, 82 fitness function . . . . . . . . . . . . . . . . . . 63, 65
Bayes’ rule . . . . . . . . . . . . . . . . . . . 52, 56, 58 Folk theorem . . . . . . . . . . . . . . . . . . . . . . . 109
Bayesian network . . . . . . . . . . . . . . . . 56, 68 forward search . . . . . . . . . . . . . . . . . . . . . . . 82
Bayesian network, naïve . . . . . . . . . . . . . 60 function approximation . . . . . . . . . . . . . . . 5
Bellman equation . . . . . . . . . . . . 92, 94, 96
bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31, 82
G
game theory . . . . . . . . . . . . . . . . . . . . 96, 105
Boltzmann distribution . . . . . . . . . . . . . . 65
genetic algorithms . . . . . . . . . . . 65, 67, 68
boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
gradient descent . . . . . . . . . . 30, 32, 34, 55
grim trigger . . . . . . . . . . . . . . . . . . . . . . . . 105
C
chain rule . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 H
classification . . . . . . . . . . . . . . . 6, 15, 37, 60 Haussler theorem . . . . . . . . . . . . 44, 46, 48
clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 hill climbing . . . . . . . . . . . . . . . . . 63, 67, 75
component analysis, independent . . . . 86 hill climbing, random . . . . . . . . 63, 66, 74
component analysis, principal . . . . . . . 83
component analysis, random . . . . . . . . . 87
I
independent . . . . . . . . . . . . . . . . . . . . . . . . . 56
conditional entropy . . . . . . . . . . . . . . . . . . 50
independent component analysis . . . . . 86
conditionally independent . . . . . . . . . . . 56
correlated equilibrium . . . . . . . . . . . . . . 109 J
cross-validation . . . . . . . . . . . . . . . . . . . . . . . 6 joint distribution . . . . . . . . . . . . . . . . . . . . 56
curse of dimensionality . . . . . . . 38, 80, 83 joint entropy . . . . . . . . . . . . . . . . . . . . . . . . 50
110
Index
K R
kernel trick . . . . . . . . . . . . . . . . . . . . . . 22, 83 radial basis kernel . . . . . . . . . . . . . . . . . . . 24
KL-divergence . . . . . . . . . . . . . . . . . . . 51, 69 randomized optimization . . 9, 34, 63, 82
regression . . . . . . . . . . . . . . . . . . . . . 6, 26, 37
L reinforcement learning . . . . . . . . . . 88, 106
lazy learner . . . . . . . . . . . . . . . . . . . . . . . . . 37
relevant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
learning rate . . . . . . . . . . . . . . . . . . . . . . . . 31
relevant, strongly . . . . . . . . . . . . . . . . . . . . 82
least squares . . . . . . . . . . . . . . . . . . . . . . . . . 27
relevant, weakly . . . . . . . . . . . . . . . . . . . . . 82
linear discriminant analysis . . . . . . . . . . 87
restriction bias . . . . . . . . . . . . . . . . . . 10, 34
linear regression . . . . . . 26, 35, 37, 55, 84
roulette wheel . . . . . . . . . . . . . . . . . . . . . . . 66
M
margin . . . . . . . . . . . . . . . . . . . 18, 19, 21, 71 S
Markovian property . . . . . . . . . . . . . . . . . 89 shatter . . . . . . . . . . . . . . . . . . . . . . . . . . 47, 47
maximum a posteriori . . . . . . . . . . . 54, 61 sigmoid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
maximum likelihood . . . . . . . . . . . . . . . . . 54 simulated annealing . . . . . . . . . . . . . 64, 67
MDP . . . . . . . . . . . . . 89, 96, 103, 104, 107 single linkage clustering . . . . . . . . . . . . . . 79
mechanism design . . . . . . . . . . . . . . . . . . 103 stationary preferences . . . . . . . . . . . . . . . 91
MIMIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 strictly dominated . . . 100, 101, 102, 104
minimax . . . . . . . . . . . . . . . . . . . . 97, 97, 108 subgame perfect . . . . . . . . . . . . . . . 105, 105
minimax-Q . . . . . . . . . . . . . . . . . . . . . . . . 108 supervised learning . . . . . . . . . . . . . . . . . . . 5
minmax profile . . . . . . . . . . . . . . . . 104, 104 support vector machine . . . . . . 18, 19, 83
mixed strategy . . . . . . . . . . . . 99, 101, 102
momentum . . . . . . . . . . . . . . . . . . . . . . . . . . 34 T
mutual information . . . . . . . 49, 50, 69, 86 training error . . . . . . . . . . . . . . . . . . . . . . . 43
N true error . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Nash equilibrium 101, 102, 104–106, 108 truncation selection . . . . . . . . . . . . . . 66, 68
Nash-Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
neural network . . . . . . . . . . . . . . . 18, 28, 46 U
underfitting . . . . . . . . . . . . . . . . . . . . . . . . . . 6
O utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
overfitting . . . . . . . . . . . . . . . . . 6, 12, 13, 35
P V
PAC-learnable . . . . . . . . . . . . . . . . . . . 44, 48 value iteration . . . . . . . . . . . . . . . . . . 92, 107
perceptron . . . . . . . . . . . . . . . . . . . . . . . 28, 82 VC dimension . . . . . . . . . . . . . . . . . . . 47, 48
perceptron rule . . . . . . . . . . . . . . . . . . 30, 31 version space . . . . . . . . . . . . . . . . 43, 43, 44
pink noise . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
policy iteration . . . . . . . . . . . . . . . . . 93, 107 W
preference bias . . . . . . . . . . . . . . . . . . 10, 34 weak learner . . . . . . . . . . 12, 14, 14, 17, 18
principal component . . . . . . . . . . . . . . . . . 83
prisoner’s dilemma . . 100, 101, 103, 105, Z
107 zero-sum . . . . . . 96, 97, 100, 104, 106–108
111