0% found this document useful (0 votes)
60 views

CS 229 - Supervised Learning Cheatsheet

The document discusses loss functions, cost functions, and gradient descent which are important concepts in machine learning. It provides a table listing common loss functions including least squared error, logistic loss, hinge loss, and cross-entropy. It defines the loss function as taking predicted and real values as input and outputting how different they are. It defines the cost function J as the average of the loss function L over the training set. Finally, it provides the update rule for gradient descent which is used to minimize the cost function J during training.

Uploaded by

helen lee
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

CS 229 - Supervised Learning Cheatsheet

The document discusses loss functions, cost functions, and gradient descent which are important concepts in machine learning. It provides a table listing common loss functions including least squared error, logistic loss, hinge loss, and cross-entropy. It defines the loss function as taking predicted and real values as input and outputting how different they are. It defines the cost function J as the average of the loss function L over the training set. Finally, it provides the update rule for gradient descent which is used to minimize the cost function J during training.

Uploaded by

helen lee
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

❐ Loss function ― A loss function is a function L : (z, y) ∈ R × Y ⟼ L(z, y) ∈ R that takes as inputs

Want more content like this? Subscribe here (https://docs.google.com/forms/d/e/1FAIpQLSeOr- the predicted value z corresponding to the real data value y and outputs how different they are. The
yp8VzYIs4ZtE9HVkRcMJyDcJ2FieM82fUsFoCssHu9DA/viewform) to be notified of new releases! common loss functions are summed up in the table below:

Least squared error Logistic loss Hinge loss Cross-entropy


(https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-supervised-learning#cs-229---
machine-learning)CS 229 - Machine Learning (teaching/cs-229) English 1 −[y log(z) + (1 −

(y − z)2 log(1 + exp(−yz)) max(0, 1 − yz)
   CS 229 - Machine Learning
2 y) log(1 − z)]

(teaching/cs-229)
Supervised Learning Unsupervised Learning Deep Learning Tips and tricks
Introduction
(https://stanford.edu/~shervine/teaching/cs-
Supervised Learning
(https://stanford.edu/~shervine/teaching/cs- 229/cheatsheet-supervised-
learning#introduction)

229/cheatsheet-supervised- (https://stanford.edu/~shervine/teaching/cs-
● Type of prediction
(https://stanford.edu/~shervine/teaching/cs-

learning#cheatsheet)Supervised Learning 229/cheatsheet-supervised-


229/cheatsheet-
learning#introduction)

cheatsheet  Star 15,519 ● Type of model


supervised-learning#)
(https://stanford.edu/~shervine/teaching/cs- Linear regression Logistic regression SVM Neural Network
229/cheatsheet-supervised-
learning#introduction)
By Afshine Amidi (https://twitter.com/afshinea) and Shervine Amidi (https://twitter.com/shervinea)
❐ Cost function ― The cost function J is commonly used to assess the performance of a model, and is
Notations and general concepts
defined with the loss function L as follows:
(https://stanford.edu/~shervine/teaching/cs- (https://stanford.edu/~shervine/teaching/cs-
229/cheatsheet-supervised- 229/cheatsheet-supervised-
learning#introduction) learning#notations) m

Introduction to Supervised Learning ● Loss function J(θ) = ∑ L(hθ (x(i) ), y (i) )


(https://stanford.edu/~shervine/teaching/cs- i=1
229/cheatsheet-supervised-
Given a set of data points {x(1) , ..., x(m) } associated to a set of outcomes {y (1) , ..., y (m) }, we want to learning#notations)

build a classifier that learns how to predict y from x. ● Gradient descent


❐ Gradient descent ― By noting α ∈ R the learning rate, the update rule for gradient descent is
(https://stanford.edu/~shervine/teaching/cs-
❐ Type of prediction ― The different types of predictive models are summed up in the table below:
229/cheatsheet-supervised- expressed with the learning rate and the cost function J as follows:
learning#notations)
● Likelihood
Regression Classification (https://stanford.edu/~shervine/teaching/cs- θ ⟵ θ − α∇J(θ)
229/cheatsheet-supervised-
Outcome Continuous Class learning#notations)

Examples Linear regression Logistic regression, SVM, Naive Bayes Linear models
(https://stanford.edu/~shervine/teaching/cs- (0)FAUS
229/cheatsheet-supervised-
❐ Type of model ― The different models are summed up in the table below: learning#linear-models)

● Linear regression
Discriminative model Generative model (https://stanford.edu/~shervine/teaching/cs-
229/cheatsheet-supervised-
learning#linear-models)
Goal Directly estimate P (y∣x) Estimate P (x∣y) to then deduce P (y∣x)
● Logisitic regression
What's (https://stanford.edu/~shervine/teaching/cs-
Decision boundary Probability distributions of the data 229/cheatsheet-supervised-
learned learning#linear-models)
● Generalized linear models
(https://stanford.edu/~shervine/teaching/cs-
229/cheatsheet-supervised-
learning#linear-models)

Illustration
Support Vector Machines
(https://stanford.edu/~shervine/teaching/cs-
229/cheatsheet-supervised-
learning#svm) Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training example,
● Optimal margin classifier and batch gradient descent is on a batch of training examples.
Examples Regressions, SVMs GDA, Naive Bayes (https://stanford.edu/~shervine/teaching/cs-
229/cheatsheet-supervised-
learning#svm)
❐ Likelihood ― The likelihood of a model L(θ) given parameters θ is used to find the optimal parameters θ
● Hinge loss
through likelihood maximization. We have:
(https://stanford.edu/~shervine/teaching/cs-
(https://stanford.edu/~shervine/teaching/cs- 229/cheatsheet-supervised-
learning#svm)
229/cheatsheet-supervised- ● Kernel θopt = arg max L(θ)
learning#notations) (https://stanford.edu/~shervine/teaching/cs- θ

Notations and general concepts 229/cheatsheet-supervised-


learning#svm)
Remark: in practice, we use the log-likelihood ℓ(θ) = log(L(θ)) which is easier to optimize.

❐ Hypothesis ― The hypothesis is noted hθ and is the model that we choose. For a given input data x(i) Generative learning
the model prediction output is hθ (x(i) ). (https://stanford.edu/~shervine/teaching/cs-
229/cheatsheet-supervised-

learning#generative-learning) ❐ Newton's algorithm ― Newton's algorithm is a numerical method that finds θ such that ℓ′ (θ) = 0. Its ❐ Softmax regression ― A softmax regression, also called a multiclass logistic regression, is used to
● Gaussian Discriminant Analysis update rule is as follows: generalize logistic regression when there are more than 2 outcome classes. By convention, we set θK = 0,
(https://stanford.edu/~shervine/teaching/cs-
which makes the Bernoulli parameter ϕi of each class i be such that:
229/cheatsheet-supervised-
learning#generative-learning) ′
ℓ (θ)
● Naive Bayes θ←θ−
(https://stanford.edu/~shervine/teaching/cs- ℓ′′ (θ) exp(θiT x)
229/cheatsheet-supervised-
ϕi = K
learning#generative-learning)
Remark: the multidimensional generalization, also known as the Newton-Raphson method, has the ∑ exp(θjT x)
following update rule: j=1
Trees and ensemble methods
(https://stanford.edu/~shervine/teaching/cs-
−1
229/cheatsheet-supervised- θ←θ− (∇2θ ℓ(θ)) ∇θ ℓ(θ)
learning#tree)
Generalized Linear Models
● CART
(https://stanford.edu/~shervine/teaching/cs-
229/cheatsheet-supervised- ❐ Exponential family ― A class of distributions is said to be in the exponential family if it can be written in
learning#tree) (https://stanford.edu/~shervine/teaching/cs- terms of a natural parameter, also called the canonical parameter or link function, η , a sufficient statistic
● Random forest 229/cheatsheet-supervised- T (y) and a log-partition function a(η) as follows:
learning#linear-models)
(https://stanford.edu/~shervine/teaching/cs-
229/cheatsheet-supervised-
learning#tree) Linear models
p(y; η) = b(y) exp(ηT (y) − a(η))
● Boosting
(https://stanford.edu/~shervine/teaching/cs-
229/cheatsheet-supervised- Linear regression Remark: we will often have T (y) = y . Also, exp(−a(η)) can be seen as a normalization parameter that
learning#tree)
will make sure that the probabilities sum to one.
We assume here that y∣x; θ ∼ N (µ, σ 2 )
Other methods The most common exponential distributions are summed up in the following table:
(https://stanford.edu/~shervine/teaching/cs- ❐ Normal equations ― By noting X the design matrix, the value of θ that minimizes the cost function is a
229/cheatsheet-supervised- closed-form solution such that:
Distribution η T (y) a(η) b(y)
learning#other)

● k-NN Bernoulli log ( 1−ϕ


ϕ
) y log(1 + exp(η)) 1
(https://stanford.edu/~shervine/teaching/cs- θ = (X T X)−1 X T y
229/cheatsheet-supervised-
exp (− y2 )
2
learning#other) Gaussian µ y η2 1
2 2π
❐ LMS algorithm ― By noting α the learning rate, the update rule of the Least Mean Squares (LMS) 1
Learning Theory
algorithm for a training set of m data points, which is also known as the Widrow-Hoff learning rule, is as Poisson log(λ) y eη
(https://stanford.edu/~shervine/teaching/cs- y!
229/cheatsheet-supervised- follows: η
e
learning#learning-theory) Geometric log(1 − ϕ) y log ( 1−eη) 1
● Hoeffding inequality m
θj ← θj + α ∑ [y (i) − hθ (x(i) )] xj
(https://stanford.edu/~shervine/teaching/cs- (i)
∀j,
229/cheatsheet-supervised- ❐ Assumptions of GLMs ― Generalized Linear Models (GLM) aim at predicting a random variable y as a
learning#learning-theory) i=1
function of x ∈ Rn+1 and rely on the following 3 assumptions:
● PAC
Remark: the update rule is a particular case of the gradient ascent.
(https://stanford.edu/~shervine/teaching/cs-
229/cheatsheet-supervised- (1) y∣x; θ ∼ ExpFamily(η) (2) hθ (x) = E[y∣x; θ] (3) η = θT x
learning#learning-theory)
● VC dimension ❐ LWR ― Locally Weighted Regression, also known as LWR, is a variant of linear regression that weights
(https://stanford.edu/~shervine/teaching/cs- Remark: ordinary least squares and logistic regression are special cases of generalized linear models.
229/cheatsheet-supervised- each training example in its cost function by w (i) (x), which is defined with parameter τ ∈ R as:
learning#learning-theory)

(x(i) − x)2
w(i) (x) = exp (− ) (https://stanford.edu/~shervine/teaching/cs-
2τ 2 229/cheatsheet-supervised-learning#svm)
Support Vector Machines

The goal of support vector machines is to find the line that maximizes the minimum distance to the line.
Classification and logistic regression
❐ Optimal margin classifier ― The optimal margin classifier h is such that:
❐ Sigmoid function ― The sigmoid function g , also known as the logistic function, is defined as follows:

1 h(x) = sign(wT x − b)
∀z ∈ R, g(z) = ∈]0, 1[
1 + e−z
where (w, b) ∈ Rn × R is the solution of the following optimization problem:

❐ Logistic regression ― We assume here that y∣x; θ ∼ Bernoulli(ϕ). We have the following form: 1
min ∣∣w∣∣2 such that  y (i) (wT x(i) − b) ⩾ 1
2
1
ϕ = p(y = 1∣x; θ) = = g(θT x)
1 + exp(−θT x)

Remark: logistic regressions do not have closed form solutions.


Gaussian Discriminant Analysis
❐ Setting ― The Gaussian Discriminant Analysis assumes that y and x∣y = 0 and x∣y = 1 are such that:

(1) y ∼ Bernoulli(ϕ) (2) x∣y = 0 ∼ N (µ0 , Σ) (3) x∣y = 1 ∼ N (µ1 , Σ)

❐ Estimation ― The following table sums up the estimates that we find when maximizing the likelihood:

ϕ µj (j = 0, 1) Σ
m m m
1 ∑i=1 1{y(i) =j} x(i) 1
∑ 1{y(i) =1} m ∑(x(i) − µy(i) )(x(i) − µy(i) )T
m
i=1
∑i=1 1{y(i) =j} m
i=1

Naive Bayes
T
Remark: the decision boundary is defined as w x − b = 0 . ❐ Assumption ― The Naive Bayes model supposes that the features of each data point are all
independent:

❐ Hinge loss ― The hinge loss is used in the setting of SVMs and is defined as follows:
n
P (x∣y) = P (x1 , x2 , ...∣y) = P (x1 ∣y)P (x2 ∣y)... = ∏ P (xi ∣y)
L(z, y) = [1 − yz]+ = max(0, 1 − yz) i=1

❐ Kernel ― Given a feature mapping ϕ, we define the kernel K as follows: ❐ Solutions ― Maximizing the log-likelihood gives the following solutions:

(j)
K(x, z) = ϕ(x)T ϕ(z) 1 #{j∣y (j) = k and xi = l}
P (y = k) = × #{j∣y (j) = k}  and  P (xi = l∣y = k) =
m #{j∣y (j) = k}
In practice, the kernel K defined by K(x, z) = exp (− 2σ2 ) is called the Gaussian kernel and is
∣∣x−z∣∣2

with k ∈ {0, 1} and l ∈ [[1, L]]


commonly used.
Remark: Naive Bayes is widely used for text classification and spam detection.

(https://stanford.edu/~shervine/teaching/cs-
229/cheatsheet-supervised-learning#tree)
Tree-based and ensemble methods

These methods can be used for both regression and classification problems.

❐ CART ― Classification and Regression Trees (CART), commonly known as decision trees, can be
represented as binary trees. They have the advantage to be very interpretable.
Remark: we say that we use the "kernel trick" to compute the cost function using the kernel because we
actually don't need to know the explicit mapping ϕ, which is often very complicated. Instead, only the
values K(x, z) are needed. ❐ Random forest ― It is a tree-based technique that uses a high number of decision trees built out of
randomly selected sets of features. Contrary to the simple decision tree, it is highly uninterpretable but its
generally good performance makes it a popular algorithm.
❐ Lagrangian ― We define the Lagrangian L(w, b) as follows:
Remark: random forests are a type of ensemble methods.

l
L(w, b) = f (w) + ∑ βi hi (w) ❐ Boosting ― The idea of boosting methods is to combine several weak learners to form a stronger one.
i=1 The main ones are summed up in the table below:

Remark: the coefficients βi are called the Lagrange multipliers. Adaptive boosting Gradient boosting

• High weights are put on errors to improve at the


• Weak learners are trained on residuals
next boosting step
• Examples include XGBoost
(https://stanford.edu/~shervine/teaching/cs- • Known as Adaboost
229/cheatsheet-supervised-
learning#generative-learning)
Generative Learning
(https://stanford.edu/~shervine/teaching/cs-
A generative model first tries to learn how the data is generated by estimating P (x∣y), which we can then 229/cheatsheet-supervised-learning#other)
use to estimate P (y∣x) by using Bayes' rule. Other non-parametric approaches

❐ k -nearest neighbors ― The k -nearest neighbors algorithm, commonly known as k -NN, is a non- ∃h ∈ H, ∀i ∈ [[1, d]], h(x(i) ) = y (i)
parametric approach where the response of a data point is determined by the nature of its k neighbors
from the training set. It can be used in both classification and regression settings.
❐ Upper bound theorem ― Let H be a finite hypothesis class such that ∣H∣ = k and let δ and the sample
Remark: the higher the parameter k , the higher the bias, and the lower the parameter k , the higher the size m be fixed. Then, with probability of at least 1 − δ , we have:
variance.

1 2k
ϵ(h) ⩽ (min ϵ(h)) + 2 log ( )
h∈H 2m δ

❐ VC dimension ― The Vapnik-Chervonenkis (VC) dimension of a given infinite hypothesis class H, noted
VC(H) is the size of the largest set that is shattered by H.

Remark: the VC dimension of H = {set of linear classifiers in 2 dimensions} is 3.

❐ Theorem (Vapnik) ― Let H be given, with VC(H) = d and m the number of training examples. With
(https://stanford.edu/~shervine/teaching/cs- probability at least 1 − δ , we have:
229/cheatsheet-supervised-
learning#learning-theory)
Learning Theory
ϵ(h) ⩽ (min ϵ(h)) + O (
1 1
log ( ))
d m
log ( ) +
h∈H m d m δ
❐ Union bound ― Let A1 , ..., Ak be k events. We have:

P (A1 ∪ ... ∪ Ak ) ⩽ P (A1 ) + ... + P (Ak )

 (https://twitter.com/shervinea)  (https://linkedin.com/in/shervineamidi)  (https://github.com/shervinea)  (https://scholar.google.com/citations?

user=nMnMTm8AAAAJ) 
❐ Hoeffding inequality ― Let Z1 , .., Zm be m iid variables drawn from a Bernoulli distribution of
parameter ϕ. Let ϕ be their sample mean and γ > 0 fixed. We have:

P (∣ϕ − ϕ∣ > γ) ⩽ 2 exp(−2γ 2 m)

Remark: this inequality is also known as the Chernoff bound.

❐ Training error ― For a given classifier h, we define the training error ϵ(h), also known as the empirical
risk or empirical error, to be as follows:

m
1
ϵ(h) = ∑1 (i)  (i) }
m i=1 {h(x )=y

❐ Probably Approximately Correct (PAC) ― PAC is a framework under which numerous results on
learning theory were proved, and has the following set of assumptions:

the training and testing sets follow the same distribution


the training examples are drawn independently

❐ Shattering ― Given a set S = {x(1) , ..., x(d) }, and a set of classifiers H, we say that H shatters S if for
any set of labels {y (1) , ..., y (d) }, we have:

You might also like