CS 229 - Supervised Learning Cheatsheet
CS 229 - Supervised Learning Cheatsheet
Want more content like this? Subscribe here (https://docs.google.com/forms/d/e/1FAIpQLSeOr- the predicted value z corresponding to the real data value y and outputs how different they are. The
yp8VzYIs4ZtE9HVkRcMJyDcJ2FieM82fUsFoCssHu9DA/viewform) to be notified of new releases! common loss functions are summed up in the table below:
(teaching/cs-229)
Supervised Learning Unsupervised Learning Deep Learning Tips and tricks
Introduction
(https://stanford.edu/~shervine/teaching/cs-
Supervised Learning
(https://stanford.edu/~shervine/teaching/cs- 229/cheatsheet-supervised-
learning#introduction)
229/cheatsheet-supervised- (https://stanford.edu/~shervine/teaching/cs-
● Type of prediction
(https://stanford.edu/~shervine/teaching/cs-
Examples Linear regression Logistic regression, SVM, Naive Bayes Linear models
(https://stanford.edu/~shervine/teaching/cs- (0)FAUS
229/cheatsheet-supervised-
❐ Type of model ― The different models are summed up in the table below: learning#linear-models)
● Linear regression
Discriminative model Generative model (https://stanford.edu/~shervine/teaching/cs-
229/cheatsheet-supervised-
learning#linear-models)
Goal Directly estimate P (y∣x) Estimate P (x∣y) to then deduce P (y∣x)
● Logisitic regression
What's (https://stanford.edu/~shervine/teaching/cs-
Decision boundary Probability distributions of the data 229/cheatsheet-supervised-
learned learning#linear-models)
● Generalized linear models
(https://stanford.edu/~shervine/teaching/cs-
229/cheatsheet-supervised-
learning#linear-models)
Illustration
Support Vector Machines
(https://stanford.edu/~shervine/teaching/cs-
229/cheatsheet-supervised-
learning#svm) Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training example,
● Optimal margin classifier and batch gradient descent is on a batch of training examples.
Examples Regressions, SVMs GDA, Naive Bayes (https://stanford.edu/~shervine/teaching/cs-
229/cheatsheet-supervised-
learning#svm)
❐ Likelihood ― The likelihood of a model L(θ) given parameters θ is used to find the optimal parameters θ
● Hinge loss
through likelihood maximization. We have:
(https://stanford.edu/~shervine/teaching/cs-
(https://stanford.edu/~shervine/teaching/cs- 229/cheatsheet-supervised-
learning#svm)
229/cheatsheet-supervised- ● Kernel θopt = arg max L(θ)
learning#notations) (https://stanford.edu/~shervine/teaching/cs- θ
❐ Hypothesis ― The hypothesis is noted hθ and is the model that we choose. For a given input data x(i) Generative learning
the model prediction output is hθ (x(i) ). (https://stanford.edu/~shervine/teaching/cs-
229/cheatsheet-supervised-
′
learning#generative-learning) ❐ Newton's algorithm ― Newton's algorithm is a numerical method that finds θ such that ℓ′ (θ) = 0. Its ❐ Softmax regression ― A softmax regression, also called a multiclass logistic regression, is used to
● Gaussian Discriminant Analysis update rule is as follows: generalize logistic regression when there are more than 2 outcome classes. By convention, we set θK = 0,
(https://stanford.edu/~shervine/teaching/cs-
which makes the Bernoulli parameter ϕi of each class i be such that:
229/cheatsheet-supervised-
learning#generative-learning) ′
ℓ (θ)
● Naive Bayes θ←θ−
(https://stanford.edu/~shervine/teaching/cs- ℓ′′ (θ) exp(θiT x)
229/cheatsheet-supervised-
ϕi = K
learning#generative-learning)
Remark: the multidimensional generalization, also known as the Newton-Raphson method, has the ∑ exp(θjT x)
following update rule: j=1
Trees and ensemble methods
(https://stanford.edu/~shervine/teaching/cs-
−1
229/cheatsheet-supervised- θ←θ− (∇2θ ℓ(θ)) ∇θ ℓ(θ)
learning#tree)
Generalized Linear Models
● CART
(https://stanford.edu/~shervine/teaching/cs-
229/cheatsheet-supervised- ❐ Exponential family ― A class of distributions is said to be in the exponential family if it can be written in
learning#tree) (https://stanford.edu/~shervine/teaching/cs- terms of a natural parameter, also called the canonical parameter or link function, η , a sufficient statistic
● Random forest 229/cheatsheet-supervised- T (y) and a log-partition function a(η) as follows:
learning#linear-models)
(https://stanford.edu/~shervine/teaching/cs-
229/cheatsheet-supervised-
learning#tree) Linear models
p(y; η) = b(y) exp(ηT (y) − a(η))
● Boosting
(https://stanford.edu/~shervine/teaching/cs-
229/cheatsheet-supervised- Linear regression Remark: we will often have T (y) = y . Also, exp(−a(η)) can be seen as a normalization parameter that
learning#tree)
will make sure that the probabilities sum to one.
We assume here that y∣x; θ ∼ N (µ, σ 2 )
Other methods The most common exponential distributions are summed up in the following table:
(https://stanford.edu/~shervine/teaching/cs- ❐ Normal equations ― By noting X the design matrix, the value of θ that minimizes the cost function is a
229/cheatsheet-supervised- closed-form solution such that:
Distribution η T (y) a(η) b(y)
learning#other)
(x(i) − x)2
w(i) (x) = exp (− ) (https://stanford.edu/~shervine/teaching/cs-
2τ 2 229/cheatsheet-supervised-learning#svm)
Support Vector Machines
The goal of support vector machines is to find the line that maximizes the minimum distance to the line.
Classification and logistic regression
❐ Optimal margin classifier ― The optimal margin classifier h is such that:
❐ Sigmoid function ― The sigmoid function g , also known as the logistic function, is defined as follows:
1 h(x) = sign(wT x − b)
∀z ∈ R, g(z) = ∈]0, 1[
1 + e−z
where (w, b) ∈ Rn × R is the solution of the following optimization problem:
❐ Logistic regression ― We assume here that y∣x; θ ∼ Bernoulli(ϕ). We have the following form: 1
min ∣∣w∣∣2 such that y (i) (wT x(i) − b) ⩾ 1
2
1
ϕ = p(y = 1∣x; θ) = = g(θT x)
1 + exp(−θT x)
❐ Estimation ― The following table sums up the estimates that we find when maximizing the likelihood:
ϕ µj (j = 0, 1) Σ
m m m
1 ∑i=1 1{y(i) =j} x(i) 1
∑ 1{y(i) =1} m ∑(x(i) − µy(i) )(x(i) − µy(i) )T
m
i=1
∑i=1 1{y(i) =j} m
i=1
Naive Bayes
T
Remark: the decision boundary is defined as w x − b = 0 . ❐ Assumption ― The Naive Bayes model supposes that the features of each data point are all
independent:
❐ Hinge loss ― The hinge loss is used in the setting of SVMs and is defined as follows:
n
P (x∣y) = P (x1 , x2 , ...∣y) = P (x1 ∣y)P (x2 ∣y)... = ∏ P (xi ∣y)
L(z, y) = [1 − yz]+ = max(0, 1 − yz) i=1
❐ Kernel ― Given a feature mapping ϕ, we define the kernel K as follows: ❐ Solutions ― Maximizing the log-likelihood gives the following solutions:
(j)
K(x, z) = ϕ(x)T ϕ(z) 1 #{j∣y (j) = k and xi = l}
P (y = k) = × #{j∣y (j) = k} and P (xi = l∣y = k) =
m #{j∣y (j) = k}
In practice, the kernel K defined by K(x, z) = exp (− 2σ2 ) is called the Gaussian kernel and is
∣∣x−z∣∣2
(https://stanford.edu/~shervine/teaching/cs-
229/cheatsheet-supervised-learning#tree)
Tree-based and ensemble methods
These methods can be used for both regression and classification problems.
❐ CART ― Classification and Regression Trees (CART), commonly known as decision trees, can be
represented as binary trees. They have the advantage to be very interpretable.
Remark: we say that we use the "kernel trick" to compute the cost function using the kernel because we
actually don't need to know the explicit mapping ϕ, which is often very complicated. Instead, only the
values K(x, z) are needed. ❐ Random forest ― It is a tree-based technique that uses a high number of decision trees built out of
randomly selected sets of features. Contrary to the simple decision tree, it is highly uninterpretable but its
generally good performance makes it a popular algorithm.
❐ Lagrangian ― We define the Lagrangian L(w, b) as follows:
Remark: random forests are a type of ensemble methods.
l
L(w, b) = f (w) + ∑ βi hi (w) ❐ Boosting ― The idea of boosting methods is to combine several weak learners to form a stronger one.
i=1 The main ones are summed up in the table below:
Remark: the coefficients βi are called the Lagrange multipliers. Adaptive boosting Gradient boosting
❐ k -nearest neighbors ― The k -nearest neighbors algorithm, commonly known as k -NN, is a non- ∃h ∈ H, ∀i ∈ [[1, d]], h(x(i) ) = y (i)
parametric approach where the response of a data point is determined by the nature of its k neighbors
from the training set. It can be used in both classification and regression settings.
❐ Upper bound theorem ― Let H be a finite hypothesis class such that ∣H∣ = k and let δ and the sample
Remark: the higher the parameter k , the higher the bias, and the lower the parameter k , the higher the size m be fixed. Then, with probability of at least 1 − δ , we have:
variance.
1 2k
ϵ(h) ⩽ (min ϵ(h)) + 2 log ( )
h∈H 2m δ
❐ VC dimension ― The Vapnik-Chervonenkis (VC) dimension of a given infinite hypothesis class H, noted
VC(H) is the size of the largest set that is shattered by H.
❐ Theorem (Vapnik) ― Let H be given, with VC(H) = d and m the number of training examples. With
(https://stanford.edu/~shervine/teaching/cs- probability at least 1 − δ , we have:
229/cheatsheet-supervised-
learning#learning-theory)
Learning Theory
ϵ(h) ⩽ (min ϵ(h)) + O (
1 1
log ( ))
d m
log ( ) +
h∈H m d m δ
❐ Union bound ― Let A1 , ..., Ak be k events. We have:
user=nMnMTm8AAAAJ)
❐ Hoeffding inequality ― Let Z1 , .., Zm be m iid variables drawn from a Bernoulli distribution of
parameter ϕ. Let ϕ be their sample mean and γ > 0 fixed. We have:
❐ Training error ― For a given classifier h, we define the training error ϵ(h), also known as the empirical
risk or empirical error, to be as follows:
m
1
ϵ(h) = ∑1 (i) (i) }
m i=1 {h(x )=y
❐ Probably Approximately Correct (PAC) ― PAC is a framework under which numerous results on
learning theory were proved, and has the following set of assumptions:
❐ Shattering ― Given a set S = {x(1) , ..., x(d) }, and a set of classifiers H, we say that H shatters S if for
any set of labels {y (1) , ..., y (d) }, we have: