n9
n9
Machine Learning
In the previous few notes of this course, we’ve learned about various types of models that help us reason
under uncertainty. Until now, we’ve assumed that the probabilistic models we’ve worked with can be taken
for granted, and the methods by which the underlying probability tables we worked with were generated have
been abstracted away. We’ll begin to break down this abstraction barrier as we delve into our discussion
of machine learning, a broad field of computer science that deals with constructing and/or learning the
parameters of a specified model given some data.
There are many machine learning algorithms which deal with many different types of problems and differ-
ent types of data, classified according to the tasks they hope to accomplish and the types of data that they
work with. Two primary subgroups of machine learning algorithms are supervised learning algorithms
and unsupervised learning algorithms. Supervised learning algorithms infer a relationship between input
data and corresponding output data in order to predict outputs for new, previously unseen input data. Unsu-
pervised learning algorithms, on the other hand, have input data that doesn’t have any corresponding output
data and so deal with recognizing inherent structure between or within datapoints and grouping and/or pro-
cessing them accordingly. In this class, the algorithms we’ll discuss will be limited to supervised learning
tasks.
Once you have a dataset that you’re ready to learn with, the machine learning process usually involves
splitting your dataset into three distinct subsets. The first, training data, is used to actually generate a
model mapping inputs to outputs. Then, validation data (also known as hold-out or development data)
is used to measure your model’s performance by making predictions on inputs and generating an accuracy
score. If your model doesn’t perform as well as you’d like it to, it’s always okay to go back and train again,
either by adjusting special model-specific values called hyperparameters or by using a different learning
algorithm altogether until you’re satisfied with your results. Finally, use your model to make predictions on
the third and final subset of your data, the test set. The test set is the portion of your data that’s never seen by
Naive Bayes
We’ll further motivate our discussion of machine learning with a concrete example of a machine learning
algorithm. Let’s consider the common problem of building an email spam filter which sorts messages into
spam (unwanted email) or ham (wanted email). Such a problem is called a classification problem – given
various datapoints (in this case, each email is a datapoint), our goal is to group them into one of two or more
classes. For classification problems, we’re given a training set of datapoints along with their corresponding
labels, which are typically one of a few discrete values. As we’ve discussed, our goal will be to use this
training data (emails, and a spam/ham label for each one) to learn some sort of relationship that we can use
to make predictions on new and previously unseen datapoints. In this section we’ll describe how to construct
a specific type of model for solving classification problems known as a Naive Bayes Classifier.
To train a model to classify emails as spam or ham, we need some training data consisting of preclassified
emails that we can learn from. However, emails are simply strings of text, and in order to learn anything
useful, we need to extract certain attributes from each of them known as features. Features can be anything
ranging from specific word counts to text patterns (e.g. whether words are in all caps or not) to pretty much
any other attribute of the data that you can imagine. The specific features extracted for training are often
dependent on the specific problem you’re trying to solve and which features you decide to select can often
impact the performance of your model dramatically. Deciding which features to utilize is known as feature
engineering and is fundamental to machine learning, but for the purposes of this course you can assume
you’ll always be given the extracted features for any given dataset.
Now let’s say you have a dictionary of n words, and from each email you extract a feature vector F ∈ Rn
where the ith entry in F is a random variable Fi which can take on a value of either a 0 or a 1 depending on
whether the ith word in your dictionary appears in the email under consideration. For example, if F200 is the
feature for the word free, we will have F200 = 1 if free appears in the email, and 0 if it does not. With these
definitions, we can define more concretely how to predict whether or not an email is spam or ham – if we
can generate a joint probability table between each Fi and and the label Y , we can compute the probability
any email under consideration is spam or ham given it’s feature vector. Specifically, we can compute both
P(Y = spam|F1 = f1 , . . . , Fn = fn )
and
P(Y = ham|F1 = f1 , . . . , Fn = fn )
and simply label the email depending on which of the two probabilities is higher. Unfortunately, since
we have n features and 1 label, each of which can take on 2 distinct values, the joint probability table
Note that the rules of d-separation delineated earlier in the course make it immediately clear that in this
Bayes’ net each Fi is conditionally independent of all the others, given Y . Now we have one table for
P(Y ) with 2 entries, and n tables for each P(Fi | Y ) each with 22 = 4 entries for a total of 4n + 2 entries -
linear in n! This simplifying assumption highlights the tradeoff that arises from the concept of statistical
efficiency; we sometimes need to compromise our model’s complexity in order to stay within the limits of
our computational resources.
Indeed, in cases where the number of features is sufficiently low, it’s common to make more assumptions
about relationships between features to generate a better model (corresponding to adding edges to your
Bayes’ net). With this model we’ve adopted, making predictions for unknown datapoints amounts to running
inference on our Bayes’ net. We have observed values for F1 , · · · Fn , and want to choose the value of Y that
has the highest probability conditioned on these featuers:
= argmax P(Y = y, F1 = f1 , . . . FN = fn )
y
n
= argmax P(Y = y) ∏ P(Fi = fi |Y = y)
y i=1
where the first step is because the highest probability class will be the same in the normalized or unnor-
malized distribution, and the second comes directly from the Naive Bayes’ independence assumption that
features are independent given the class label (as seen in the graphical model structure).
Generalizing away from a spam filter, assume now that there are k class labels (possible values for Y ).
Additionally, after noting that our desired probabilities - the probability of each label yi given our features,
P(Y = yi |F1 = f1 , . . . , Fn = fn ) - is proportional to the joint P(Y = yi , F1 = f1 , . . . , Fn = fn ), we can compute:
P(Y = y1 , F1 = f1 , . . . , Fn = fn ) P(Y = y1 ) ∏i P(Fi = fi |Y = y1 )
P(Y = y2 , F1 = f1 , . . . , Fn = fn ) P(Y = y2 ) ∏i P(Fi = fi |Y = y2 )
P(Y, F1 = f1 , . . . , Fn = fn ) = =
.. ..
. .
P(Y = yk , F1 = f1 , . . . , Fn = fn ) P(Y = yk ) ∏i P(Fi = fi |Y = yk )
We’ve now learned the basic theory behind the modeling assumptions of the Naive Bayes’ classifier and
how to make predictions with one, but have yet to touch on how exactly we learn the conditional probability
tables used in our Bayes’ net from the input data. This will have to wait for our next topic of discussion,
parameter estimation.
Parameter Estimation
Assume you have a set of N sample points or observations, x1 , . . . , xN , and you believe that this data was
drawn from a distribution parametrized by an unknown value θ . In other words, you believe that the
probability Pθ (xi ) of each of your observations is a function of θ . For example, we could be flipping a coin
which has probability θ of coming up heads.
How can you "learn" the most likely value of θ given your sample? For example, if we have 10 coin flips,
and saw that 7 of them were heads, what value should we choose for θ ? One answer to this question is to
infer that θ is equal to the value that maximizes the probability of having selected your sample x1 , . . . , xN
from your assumed probability distribution. A frequently used and fundamental method in machine learn-
ing known as maximum likelihood estimation (MLE) does exactly this. Maximum likelihood estimation
typically makes the following simplifying assumptions:
• Each sample is drawn from the same distribution. In other words, each xi is identically distributed.
In our coin flipping example, each coin flip has the same chance, θ , of coming up heads.
• Each sample xi is conditionally independent of the others, given the parameters for our distribution.
This is a strong assumption, but as we’ll see greatly helps simplify the problem of maximum likelihood
estimation and generally works well in practice. In the coin flipping example, the outcome of one flip
doesn’t affect any of the others.
• All possible values of θ are equally likely before we’ve seen any data (this is known as a uniform
prior).
The first two assumptions above are often referred to as independent, identically distributed (i.i.d.).
Let’s now define the likelihood L (θ ) of our sample, a function which represents the probability of having
drawn our sample from our distribution. For a fixed sample x1 , xN , the likelihood is just a function of θ :
L (θ ) = Pθ (x1 , . . . , xN )
Using our simplifying assumption that the samples xi are i.i.d., the likelihood function can be reexpressed
as follows:
N
L (θ ) = ∏ Pθ (xi )
i=1
How can we find the value of θ that maximizes this function? This will be the value of θ that best explains
the data we saw. Recall from calculus that at points where a function’s maxima and minima are realized, its
∂
L (θ ) = 0
∂θ
Let’s go through an example to make this concept more concrete. Say you have a bag filled with red and
blue balls and don’t know how many of each there are. You draw samples by taking a ball out of the bag,
noting the color, then putting the ball back in (sampling with replacement). Drawing a sample of three balls
from this bag yields red, red, blue. This seems to imply that we should infer that 23 of the balls in the bag
are red and 31 of the balls are blue. We’ll assume that each ball being taken out of the bag will be red with
probability θ and blue with probability 1 − θ , for some value θ that we want to estimate (this is known as a
Bernoulli distribution): (
θ xi = red
Pθ (xi ) =
(1 − θ ) xi = blue
The likelihood of our sample is then:
3
L (θ ) = ∏ Pθ (xi ) = Pθ (x1 = red)Pθ (x2 = red)Pθ (x2 = blue) = θ 2 · (1 − θ )
i=1
The final step is to set the derivative of the likelihood to 0 and solve for θ :
∂ ∂ 2
L (θ ) = θ · (1 − θ ) = θ (2 − 3θ ) = 0
∂θ ∂θ
Solving this equation for θ yields θ = 32 , which intuitively makes sense! (There’s a second solution, too,
θ = 0 – but this corresponds to a minimum of the likelihood function, as L (0) = 0 < L ( 23 ) = 27
4
.)
• N - the number of observations (emails) you have for training. For our upcoming discussion, let’s also
define Nh as the number of training samples labeled as ham and Ns as the number of training samples
labeled as spam. Note Nh + Ns = N.
• Fi - a random variable which is 1 if the ith dictionary word is present in an email under consideration,
and 0 otherwise.
• Y - a random variable that’s either spam or ham depending on the label of the corresponding email.
( j)
• fi - this references the resolved value of the random variable Fi in the jth item in the training set. In
( j)
other words, each fi is a 1 if word i appeared in jth email under consideration and 0 otherwise. This
is the first time we’re seeing this notation, but it’ll come in handy in the upcoming derivation.
Disclaimer: Feel free to skip the following mathematical derivation. For CS 188, you’re only required to
know the result of the derivation summarized in the paragraph at the end of this section.
( j)
The second step comes from a small mathematical trick: if fi = 1 then
( j)
P(Fi = fi |Y = ham) = θ 1 (1 − θ )0 = θ
( j)
and similarly if fi = 0 then
( j)
P(Fi = fi |Y = ham) = θ 0 (1 − θ )1 = (1 − θ )
In order to compute the maximum likelihood estimate for θ , recall that the next step is to compute the
derivative of L (θ ) and set it equal to 0. Attempting this proves quite difficult, as it’s no simple task to isolate
and solve for θ . Instead, we’ll employ a trick that’s very common in maximum likelihood derivations, and
that’s to instead find the value of θ the maximizes the log of the likelihood function. Because log(x) is a
strictly increasing function (sometimes referred to as a monotonic transformation), finding a value that
maximizes log L (θ ) will also maximize L (θ ). The expansion of log L (θ ) is below:
Nh ( j) ( j)
log L (θ ) = log ∏ θ fi (1 − θ )1− fi
j=1
Nh ( j) ( j)
θ fi (1 − θ )1− fi
= ∑ log
j=1
Nh ( j)
Nh ( j)
θ fi + ∑ log (1 − θ )1− fi )
= ∑ log
j=1 j=1
Nh Nh
( j) ( j)
= log(θ ) ∑ fi + log(1 − θ ) ∑ (1 − fi )
j=1 j=1
Note that in the above derivation, we’ve used the properties of the log function that log(ac ) = c · log(a) and
log(ab) = log(a) + log(b). Now we set the derivative of the log of the likelihood function to 0 and solve for
θ:
We’ve arrived at a remarkably simple final result! According to our formula above, the maximum likelihood
estimate for θ (which, remember, is the probability that P(Fi = 1|Y = ham)) corresponds to counting the
number of ham emails in which word i appears and dividing it by the total number of ham emails. You
may think this was a lot of work for an intuitive result (and it was), but the derivation and techniques
will be useful for more complex distributions than the simple Bernoulli distribution we are using for each
feature here. To summarize, in this Naive Bayes model with Bernoulli feature distributions, within any given
class the maximum likelihood estimate for the probability of any outcome corresponds to the count for the
outcome divided by the total number of samples for the given class. The above derivation can be generalized
to cases where we have more than two classes and more than two outcomes for each feature, though this
derivation is not provided here.
Smoothing
Though maximum likelihood estimation is a very powerful method for parameter estimation, bad training
data can often lead to unfortunate consequences. For example, if every time the word “minute” appears in
an email in our training set, that email is classified as spam, our trained model will learn that
Hence in an unseen email, if the word minute ever shows up, P(Y = ham) ∏i P(Fi |Y = ham) = 0, and so
your model will never classify any email containing the word minute as ham. This is a classic example
of overfitting, or building a model that doesn’t generalize well to previously unseen data. Just because a
specific word didn’t appear in an email in your training data, that doesn’t mean that it won’t appear in an
email in your test data or in the real world. Overfitting with Naive Bayes’ classifiers can be mitigated by
Laplace smoothing. Conceptually, Laplace smoothing with strength k assumes having seen k extra of each
outcome. Hence if for a given sample your maximum likelihood estimate for an outcome x that can take on
count(x)
PMLE (x) =
N
then the Laplace estimate with strength k is
count(x) + k
PLAP,k (x) =
N + k|X|
What does this equation say? We’ve made the assumption of seeing k additional instances of each outcome,
and so act as if we’ve seen count(x) + k rather than count(x) instances of x. Similarly, if we see k additional
instances of each of |X| classes, then we must add k|X| to our original number of samples N. Together,
these two statements yield the above formula. A similar result holds for computing Laplace estimates for
conditionals (which is useful for computing Laplace estimates for outcomes across different classes):
count(x, y) + k
PLAP,k (x|y) =
count(y) + k|X|
There are two particularly notable cases for Laplace smoothing. The first is when k = 0, then PLAP,0 (x) =
PMLE (x). The second is the case where k = ∞. Observing a very large, infinite number of each outcome
makes the results of your actual sample inconsequential and so your Laplace estimates imply that each
outcome is equally likely. Indeed:
1
PLAP,∞ (x) =
|X|
The specific value of k that’s appropriate to use in your model is typically determined by trial-and-error. k is
a hyperparameter in your model, which means that you can set it to whatever you want and see which value
yields the best prediction accuracy/performance on your validation data.
Perceptron
Linear Classifiers
The core idea behind Naive Bayes is to extract certain attributes of the training data called features and then
estimate the probability of a label given the features: P(y| f1 , f2 , ... fn ). Thus, given a new data point, we
can then extract the corresponding features, and classify the new data point with the label with the highest
probability given the features. This all, however, this requires us to estimate distributions, which we did
with MLE. What if instead we decided not to estimate the probability distribution? Lets start by looking
at a simple linear classifier, which we can use for binary classification, which is when the label has two
possibilities, positive or negative.
The basic idea of a linear classifier is to do classification using a linear combination of the features– a
value which we call the activation. Concretely, the activation function takes in a data point, multiplies each
feature of our data point, fi (x), by a corresponding weight, wi , and outputs the sum of all the resulting values.
In vector form, we can also write this as a dot product of our weights as a vector, w, and our featurized data
point as a vector f(x):
To understand this geometrically, let us reexamine the vectorized activation function. Using the Law of
Cosines, we can rewrite the dot product as follows, where k · k is the magnitude operator and θ is the angle
between w and f(x):
Since magnitudes are always non-negative, and our classification rule looks at the sign of the activation, the
only term that matters for determining the class is cos(θ ).
(
+ if cos(θ ) > 0
classify(x) =
− if cos(θ ) < 0
We, therefore, are interested in when cos(θ ) is negative or postive. It is easily seen that for θ < π2 , cos(θ )
will be somewhere in the interval (0, 1], which is positive. For θ > π2 , cos(θ ) will be somewhere in the
interval [−1, 0), which is negative. You can confirm this by looking at a unit circle. Essentially, our simple
linear classifier is checking to see if the feature vector of a new data point roughly "points" in the same
direction as a predefined weight vector and applies a positive label if it does.
(
π
+ if θ < 2 (i.e. when θ is less than 90°, or acute)
classify(x) = π
− if θ > 2 (i.e. when θ is greater than 90°, or obtuse)
Up to this point, we haven’t considered the points where activationw (x) = wT f(x) = 0. Following all the
same logic, we will see that cos(θ ) = 0 for those points. Furthermore, θ = π2 (i.e θ is 90°) for those points.
In otherwords, these are the data points with feature vectors that are orthogonal to w. We can add a dotted
blue line, orthogonal to w, where any feature vector that lies on this line will have activation equaling 0.
We call this blue line the decision boundary because it is the boundary that separates the region where
we classify data points as positive from the region of negatives. In higher dimensions, a linear decision
boundary is generically called a hyperplane. A hyperplane is a linear surface that is one dimesion lower
than the latent space, thus dividing the surface in two. For general classifiers (non-linear ones), the decision
boundary may not be linear, but is simply defined as surface in the space of feature vectors that separates the
classes. To classify points that end up on the decision boundary, we can apply either label since both classes
are equally valid (in the algorithms below, we’ll classify points on the line as positive).
Binary Perceptron
Great, now you know how linear classifiers work, but how do we build a good one? When building a
classifier, you start with data, which are labeled with the correct class, we call this the training set. You
build a classifier by evaluating it on the training data, comparing that to you training labels, and adjusting
the parameters of your classifier until you reach your goal.
Let’s explore one specific implementation of a simple linear classifier: the binary perceptron. The percep-
tron is a binary classifier–though it can be extended to work on more than two classes. The goal of the binary
perceptron is to find a decision boundary that perfectly separates the training data. In other words, we’re
seeking the best possible weights– the best w– such that any featured training point that is multiplied by the
weights, can be perfectly classified.
The Algorithm
The perceptron algorithm works as follows:
2. For each training sample, with features f(x) and true class label y∗ ∈ {−1, +1}, do:
(a) Classify the sample using the current weights, let y be the class predicted by your current w:
(
+1 if activationw (x) = wT f(x) > 0
y = classify(x) =
−1 if activationw (x) = wT f(x) < 0
Updating weights
Let’s examine and justify the procedure for updating our weights. Recall that in step 2b above, when our
classifier is right, nothing changes. But when our classifier is wrong, the weight vector is updated as follows:
w ← w + y∗ f(x)
where y∗ is the true label, which is either 1 or -1, and x is the training sample which we mis-classified. You
can interpret this update rule to be:
Why does this work? One way to look at this is to see it as a balancing act. Mis-classification happens either
when the activation for a training sample is much smaller than it should be (causes a Case 1 misclassification)
or much larger than it should be (causes a Case 2 misclassification).
Consider Case 1, where activation is negative when it should be positive. In otherwords, the activation is too
small. How we adjust w should strive to fix that and make the activation larger for that training sample. To
convince yourself that our update rule w ← w + f(x) does that, let us update w and see how the activation
changes.
activationw+x (x) = (w + f(x))T x = wT f(x) + f(x)T f(x) = activationw (x) + f(x)T f(x)
Using our update rule, we see that the new activation increases by f(x)T f(x), which is a postive number,
therefore showing that our update makes sense. Activation is getter larger– closer to becoming positive.
You can repeat the same logic for when the classifier is mis-classifying because the activation is too large
(activation is positive when it should be negative). You’ll see that the update will cause the new activation
to decrease by f(x)T f(x), thus getting smaller and closer to classifying correctly.
While this makes it clear why we are adding and subtracting something, why would we want to add and
subtract our sample point’s features? A good way to think about it, is that the weights aren’t the only thing
that determines this score. The score is determined by multiplying the weights by the relevant sample. This
means that certain parts of a sample contribute more than others. Consider the following situation where x
is a training sample we are given with true label y∗ = −1:
4
wT = 2 2 2 , f(x) = 0
activationw (x) = (2 ∗ 4) + (2 ∗ 0) + (2 ∗ 1) = 10
1
We know that our weights need to be smaller because activation needs to be negative to classify correctly.
We don’t want to change them all the same amount though. You’ll notice that the first element of our sample,
the 4, contributed much more to our score of 10 than the third element, and that the second element did not
contribute at all. An appropriate weight update, then, would change the first weight alot, the third weight a
little, and the second weight should not be changed at all. After all, the second and third weights might not
even be that broken, and we don’t fix want to fix what isn’t broken!
Bias
If you tried to implement a perceptron based on what has been mentioned thus far, you will notice one
particularly unfriendly quirk. Any decision boundary that you end up drawing will be crossing the origin.
Basically, your perceptron can only produce a decision boundary that could be represented by the function
w> f(x) = 0, w, f(x) ∈ Rn . The problem is, even among problems where there is a linear decision boundary
that separates the positive and negative classes in the data, that boundary may not go through the origin, and
we want to be able to draw those lines.
To do so, we will modify our feature and weights to add a bias term: add a feature to your sample feature
vectors that is always 1, and add an extra weight for this feature to your weight vector. Doing so essentially
allows us to produce a decision boundary representable by w> f(x) + b = 0, where b is the weighted bias
term (i.e. 1 * the last weight in the weight vector).
Geometrically, we can visualize this by thinking about what the activation function looks like when it is
w> f(x) and when there is a bias w> f(x) + b. To do so, we need to be one dimension higher than the space
of our featurized data (labeled data space in the figures below). In all the above sections, we had only been
looking at a flat view of the data space.
We’ll stop here, but in actuality this algorithm would run for many more passes through the data before all
the data points are classified correctly in a single pass.
Multiclass Perceptron
The perceptron presented above is a binary classifier, but we can extend it to account for multiple classes
rather easily. The main difference is in how we set up weights and how we update said weights. For the
binary case, we had one weight vector, which had a dimension equal to the number of features (plus the bias
feature). For the multi-class case, we will have one weight vector for each class, so in the 3 class case, we
have 3 weight vectors. In order to classify a sample, we compute a score for each class by taking the dot
product of the feature vector with each of the weight vectors. Whichever class yields the highest score is the
one we choose as our prediction.
For example, consider the 3-class case. Let our sample have features f(x) = −2 3 1 and our weights
for classes 0, 1, and 2 be:
w0 = −2 2 1
w1 = 0 3 4
w2 = 1 4 −2
Taking dot products for each class gives us scores s0 = 11, s1 = 13, s2 = 8. We would thus predict that x
belongs to class 1.
An important thing to note is that in actual implementation, we do not keep track of the weights as separate
structures, we usually stack them on top of each other to create a weight matrix. This way, instead of doing
as many dot products as there are classes, we can instead do a single matrix-vector multiplication. This tends
to be much more efficient in practice (because matrix-vector multiplication usually has a highly optimized
implementation).
Along with the structure of our weights, our weight update also changes when we move to a multi-class
case. If we correctly classify our data point, then do nothing just like in the binary case. If we chose
incorrectly, say we chose class y 6= y∗ , then we add the feature vector to the weight vector for the true class
to y∗ , and subtract the feature vector from the weight vector corresponding to the predicted class y. In our
above example, let’s say that the correct class was class 2, but we predicted class 1. We would now take the
weight vector corresponding to class 1 and subtract x from it,
w1 = 0 3 4 − −2 3 1 = 2 0 3
Next we take the weight vector corresponding to the correct class, class 2 in our case, and add x to it:
w2 = 1 4 −2 + −2 3 1 = −1 7 −1
What this amounts to is ’rewarding’ the correct weight vector, ’punishing’ the misleading, incorrect weight
vector, and leaving alone an other weight vectors. With the difference in the weights and weight updates
taken into account, the rest of the algorithm is essentially the same; cycle through every sample point,
updating weights when a mistake is make, until you stop making mistakes.
In order to incorporate a bias term, do the same as we did for binary perceptron – add an extra feature of 1
to every feature vector, and an extra weight for this feature to every class’s weight vector (this amounts to
adding an extra column to the matrix form).
Summary
In this note, we introduced several fundamental principles of machine learning, including:
• Splitting your data into training data, validation data, and test data.
• The difference between supervised learning, which learns from labeled data, and unsupervised learn-
ing, which doesn’t have labeled data and so attempts to infer inherent structure from it.
We then proceeded to discuss an example of a supervised learning algorithm for classification, Naive Bayes’,
which uses parameter estimation to construct conditional probability tables within a Bayes’ net before run-
ning inference to predict the class labels of datapoints. We extended this idea to discuss the problem of
overfitting in the context of Naive Bayes’ and how this issue can be mitigated with Laplace smoothing. Fi-
nally, we talked about perceptrons and decision boundaries - methods for classification that don’t explicitly
store conditional probability tables.