0% found this document useful (0 votes)
9 views130 pages

12_Bài toán phân lớp_LR_v2

The document discusses basic machine learning concepts, focusing on classification problems, including binary and multi-class classification. It introduces linear models for classification, decision boundaries, and logistic regression as an alternative to traditional methods. Additionally, it highlights various applications of machine learning such as handwritten digit recognition, spam filtering, and object detection.

Uploaded by

havi240304
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views130 pages

12_Bài toán phân lớp_LR_v2

The document discusses basic machine learning concepts, focusing on classification problems, including binary and multi-class classification. It introduces linear models for classification, decision boundaries, and logistic regression as an alternative to traditional methods. Additionally, it highlights various applications of machine learning such as handwritten digit recognition, spam filtering, and object detection.

Uploaded by

havi240304
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 130

Học máy cơ bản

Bài toán phân lớp

Từ bài toán hồi quy đến bài toán phân lớp


Từ bài toán phân lớp đến thuật toán perceptron và mạng nơ-ron 1 lớp
Applications of Machine Learning
• Handwritten Digit Recognition

References: https://courses.cs.washington.edu/courses/cse546/12wi/slides/cse546wi12intro.pdf
2
Applications of Machine Learning
• Spam Filtering
• Spam or Not Spam

References:
• Paper: Konstantin Tretyakov, Machine Learning Techniques in Spam Filtering, 2004.
• http://www.techlicious.com/blog/gmail-spam-filter-artificial-neural-network/ 3
• https://upxacademy.com/spam-alert-machine-learning-is-filtering-your-emails/
Applications of Machine Learning
• Object Detection and Recognition

References:
• https://devblogs.nvidia.com/parallelforall/deep-learning-for-computer-vision-with-matlab-and-cudnn/
• https://www.cnet.com/how-to/how-to-disable-facial-recognition-in-facebook/ 4
Linear Models for Classification
• Classification
• Determine which discrete category the example is.
Linear Models for Classification
• Classification
• Determine which discrete category the example is.

• Binary classification: two possible labels


• e.g. Yes/No, True/False

• Multi-class classification: multiple possible labels


• e.g. Dog/Cat/Bird/Other
Linear Models for Classification
• Key Concepts
• Classification as regression
• Decision boundary
• Loss function
• Metrics to evaluate classification
A Simple 1-D Classification
• Classification as Regression
• Given a set of input-output pairs for binary
classification
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}

Class 1
Class 2
A Simple 1-D Classification
• Classification as Regression
• Given a set of input-output pairs for binary
classification

t
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}
where 𝑡 = 1 or − 1.
1

0 𝑥

−1
Class 1
Class 2
A Simple 1-D Classification
• Classification as Regression
• Given a set of input-output pairs for binary
classification

t
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}
where 𝑡 = 1 or − 1.
1

1
0 𝑥
𝑓(𝑥)
−1
−1 Class 1
Class 2
A Simple 1-D Classification
• Classification as Regression
• Given a set of input-output pairs for binary
classification 𝒇 𝒙 = 𝒘𝑻 𝒙

t
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}
where 𝑡 = 1 or − 1.
1

1
0 𝑥
𝑓(𝑥)
−1
−1 Class 1
Class 2
A Simple 1-D Classification
• Classification as Regression
• Given a set of input-output pairs for binary
classification 𝒇 𝒙 = 𝒘𝑻 𝒙

t
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}
where 𝑡 = 1 or − 1.
1

1
0 𝑥
𝑓(𝑥)
−1
−1 Class 1
Let 𝒇 𝒙 denote 𝑤0 + 𝑤1 𝒙 Class 2
If 𝒇(𝒙) ≥ 𝟎, we predict class 1
If 𝒇(𝒙) < 𝟎, we predict class 2 (linear regression)
A Simple 1-D Classification
• Classification as Regression
• Given a set of input-output pairs for binary
classification

t
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}
where 𝑡 = 1 or − 1. 𝒚(𝒙)
1

In Math 𝑦 𝑥 = 𝑠𝑔𝑛 𝑓 𝑥 𝑥
0
1 𝑓𝑜𝑟 𝑧 > 0 −1
where 𝑠𝑔𝑛(𝑧) = ቐ 0 𝑓𝑜𝑟 𝑧 = 0 Class 1
−1 𝑓𝑜𝑟 𝑧 < 0 Class 2
A Simple 1-D Classification
• Classification as Regression
• Given a set of input-output pairs for binary
classification
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}
where 𝑡 = 1 or − 1.

In Math 𝑦 𝑥 = 𝑠𝑔𝑛 𝑓 𝑥 𝑥

1 𝑓𝑜𝑟 𝑧 > 0 Decision boundary


where 𝑠𝑔𝑛(𝑧) = ቐ 0 𝑓𝑜𝑟 𝑧 = 0 (In 1-D, this is simply a threshold)
−1 𝑓𝑜𝑟 𝑧 < 0 𝑥 − 𝑤0 = 0
A Simple 1-D Classification
• Buy or not buy it?

1 𝑤0 Price 1: buy
sgn -1: not buy
Size in 𝑤1
feature square feet
1 𝑓𝑜𝑟 𝑧 > 0
𝑠𝑔𝑛(𝑧) = ቐ 0 𝑓𝑜𝑟 𝑧 = 0
−1 𝑓𝑜𝑟 𝑧 < 0
A Simple 1-D Classification
• Buy or not buy it?

1 𝑤0 Price 1: buy
sgn -1: not buy
Size in 𝑤1
feature square feet

Defining decision boundary Converting to 1 or -1


A Simple 1-D Classification
• Buy or not buy it?

1 𝑤0 Price 1: buy
sgn -1: not buy
Size in 𝑤1
feature square feet

Defining decision boundary Converting to 1 or -1


A Simple 1-D Classification
If we may want class probability, what
• Classification as Regression should we do?
• Given a set of input-output pairs for binary
classification

t
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}
where 𝑡 = 1 or − 1. 𝒚(𝒙)
1

In Math 𝑦 𝑥 = 𝑠𝑔𝑛 𝑓 𝑥 𝑥
0
1 𝑓𝑜𝑟 𝑧 > 0 −1
where 𝑠𝑔𝑛(𝑧) = ቐ 0 𝑓𝑜𝑟 𝑧 = 0 Class 1
−1 𝑓𝑜𝑟 𝑧 < 0 Class 2
Logistic Regression
• An alternative: replace the sgn function with the
sigmoid or logistic function
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}

t
where 𝑡 = 1 or − 1.
𝒚(𝒙)
1
𝒚 𝒙 =σ 𝒇 𝒙
1
where σ 𝑧 = 0.5 𝑥
1+𝑒 −𝑧

0 Class 1
Class 2

−1
Logistic Regression
• An alternative: replace the sgn function with the
sigmoid or logistic function
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}

t
where 𝑡 = 1 or − 1.
𝒚(𝒙)
1
𝒚 𝒙 =σ 𝒇 𝒙 𝐿𝑖𝑘𝑒𝑙𝑦 𝑉𝑒𝑟𝑦 𝑙𝑖𝑘𝑒𝑙𝑦

1
where σ 𝑧 = 0.5 𝑥
1+𝑒 −𝑧 𝑈𝑛𝑙𝑖𝑘𝑒𝑙𝑦
Very 𝑢𝑛𝑙𝑖𝑘𝑒𝑙𝑦

0 Class 1
Class 2

−1
Logistic Regression (Note: Starting from now, we use
{1, 0}, just for convenience!)
• An alternative: replace the sgn function with the
sigmoid or logistic function
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}

t
where 𝑡 = 1 or 0.
𝒚(𝒙)
1
𝒚 𝒙 =σ 𝒇 𝒙 𝐿𝑖𝑘𝑒𝑙𝑦 𝑉𝑒𝑟𝑦 𝑙𝑖𝑘𝑒𝑙𝑦

1
where σ 𝑧 = 0.5 𝑥
1+𝑒 −𝑧 𝑈𝑛𝑙𝑖𝑘𝑒𝑙𝑦
Very 𝑢𝑛𝑙𝑖𝑘𝑒𝑙𝑦

0 Class 1
Class 2
Logistic Regression
• An alternative: replace the sgn function with the
sigmoid or logistic function
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}

t
where 𝑡 = 1 or 0.
𝒚(𝒙)
1
1
𝑦 𝑥 = 𝑇𝑥
1 + 𝑒 −w
0.5 𝑥

𝒚 𝒙 = 𝟎. 𝟖 means that 80% chance of


being in class 1 0 Class 1
Class 2
A Simple 1-D Classification
• Decision Boundary
• partitions the space into several sets,
one for each class

If 𝒚 𝒙 ≥ 𝟎. 𝟓, we predict class 1
If 𝒚 𝒙 < 𝟎. 𝟓, we predict class 2

• can be found at
1 Decision Boundary
Let 𝑦 𝑥 = = 0.5
1+𝑒 −(𝑤0+𝑤1𝑥)
(In 1-D, this is simply a threshold)
⇒ 𝑤0 + 𝑤1 𝑥 = 0
Decision Boundary
• In 2-D (two features)

𝑥2

𝑥1

𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 = 0
Which separates the space into two parts
Decision Boundary
• In 2-D (two features) • In 3-D (three features)

𝑥2
Courtesy of Dr. Sanja Fidler

𝑥1
w 𝑇 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3 = 0
w 𝑇 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 = 0
Which separates the space into two parts
Decision Boundary
• More on decision boundary

Class 1
Class 2
Decision Boundary
• More on decision boundary

Decision boundary:
𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥12 + 𝑤4 𝑥1 𝑥2 + 𝑤5 𝑥22 = 0

Class 1
Class 2
Decision Boundary
• More on decision boundary

Decision boundary:
𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥12
+ 𝑤4 𝑥12 𝑥23 + 𝑤5 𝑥25 + ⋯ = 0
Logistic Regression
• For multiple variables

CRIM

ZN

CHAS
𝑤1
INDUS
𝑤2
NOX
Price
RM
Probability p of buying it
AGE
features
DIS sigmoid function
RAD
1
σ 𝑧 =
TAX 1 + 𝑒 −𝑧
𝑤13
PTRATIO

LSTAT
Linear Regression Logistic Regression
• Training datasets • Training datasets

• Linear model • Linear model

• Parameters • Parameters

• Loss function
• Loss function

• Goal: minimize ℓ 𝑤
• Goal: minimize ℓ 𝑤
Linear Regression Logistic Regression
• Training datasets • Training datasets
𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)}
(target 𝑡 𝑖 : real value)
• Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀 • Linear model

• Parameters
𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀 • Parameters
• Loss function
𝑁
1 𝑖 𝑖 • Loss function
ℓ 𝑤 = ෍[𝑡 −𝑦 𝑥 ]2
2𝑁
𝑖=1
• Goal: minimize ℓ 𝑤 • Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on
the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
Linear Regression Logistic Regression
• Training datasets • Training datasets
𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)} 𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)}
(target 𝑡 𝑖 : real value) (target 𝑡 𝑖 : 0 or 1)
• Linear model • Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀
• Parameters
• Parameters
𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀
• Loss function
𝑁
1 • Loss function
ℓ 𝑤 = ෍[𝑡 𝑖 −𝑦 𝑥 𝑖 ]2
2𝑁
𝑖=1
• Goal: minimize ℓ 𝑤 • Goal: minimize ℓ 𝑤

Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on
the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
Linear Regression Logistic Regression
• Training datasets • Training datasets
𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)} 𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)}
(target 𝑡 𝑖 : real value) (target 𝑡 𝑖 : 0 or 1)
• Linear model • Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀 𝑦 𝑥 = 𝜎 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀
• Parameters • Parameters
𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀
• Loss function
𝑁 • Loss function
1 𝑖 𝑖
ℓ 𝑤 = ෍[𝑡 −𝑦 𝑥 ]2
2𝑁
𝑖=1
• Goal: minimize ℓ 𝑤
• Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on
the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
1
σ 𝑧 =
1 + 𝑒 −𝑧
Linear Regression Logistic Regression
• Training datasets • Training datasets
𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)} 𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)}
(target 𝑡 𝑖 : real value) (target 𝑡 𝑖 : 0 or 1)
• Linear model • Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀 𝑦 𝑥 = 𝜎 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀
• Parameters • Parameters
𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀 𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀
• Loss function • Loss function
𝑁
1 𝑖 𝑖
ℓ 𝑤 = ෍[𝑡 −𝑦 𝑥 ]2
2𝑁
𝑖=1
• Goal: minimize ℓ 𝑤
• Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on
the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
1
σ 𝑧 =
1 + 𝑒 −𝑧
Linear Regression Logistic Regression
• Training datasets • Training datasets
𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)} 𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)}
(target 𝑡 𝑖 : real value) (target 𝑡 𝑖 : 0 or 1)
• Linear model • Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀 𝑦 𝑥 = 𝜎 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀
• Parameters • Parameters
𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀 𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀
• Loss function • Loss function
𝑁
1 𝑁
ℓ 𝑤 = ෍[𝑡 𝑖 −𝑦 𝑥 𝑖 ]2 1 𝑖 𝑖
2𝑁 ℓ 𝑤 = ෍[𝑡 −𝑦 𝑥 ]2
𝑖=1 2𝑁
𝑖=1
• Goal: minimize ℓ 𝑤
• Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on
the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
1
σ 𝑧 =
1 + 𝑒 −𝑧
Linear Regression Logistic Regression
• Training datasets • Training datasets
𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)} 𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)}
(target 𝑡 𝑖 : real value) (target 𝑡 𝑖 : 0 or 1)
• Linear model • Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀 𝑦 𝑥 = 𝜎 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀
• Parameters • Parameters
𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀 𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀
• Loss function • Loss function
𝑁
1 𝑁
ℓ 𝑤 = ෍[𝑡 𝑖 −𝑦 𝑥 𝑖 ]2 1 𝑖 𝑖
2𝑁 ℓ 𝑤 = ෍[𝑡 −𝑦 𝑥 ]2
𝑖=1 2𝑁
𝑖=1
• Goal: minimize ℓ 𝑤
• Goal: minimize ℓ 𝑤

1
σ 𝑧 =
1 + 𝑒 −𝑧
Loss Function in Logistic Regression
• Bad Loss function
t

𝑁
𝒚(𝒙) 1 𝑖 𝑖
1 ℓ 𝑤 = ෍[𝑡 −𝑦 𝑥 ]2
2𝑁
𝑖=1

0.5
𝑥
0 Class 1
Class 2
Loss Function in Logistic Regression
• Bad Loss function
t

𝑁
𝒚(𝒙) 1 𝑖 𝑖
1 ℓ 𝑤 = ෍[𝑡 −𝑦 𝑥 ]2
2𝑁
𝑖=1

0.5 Because of using sigmoid function, the length of


each red vertical lines is at most 1.
𝑥 ( The penalty is too small )
0 Class 1
Class 2
Loss Function in Logistic Regression
• Bad t

𝑖
If target 𝑡 is 1, the error is set to
− log 𝑦 𝑥 𝑖
𝒚(𝒙)
1 𝑒𝑟𝑟𝑜𝑟

0.5
𝑥
0 Class 1
0 𝑖 1
Class 2 𝑦 𝑥
Loss Function in Logistic Regression
• Bad
If target 𝑡 𝑖 is 0, the error is set to
t

− log 1 − 𝑦 𝑥 𝑖

𝒚(𝒙)
1 𝑒𝑟𝑟𝑜𝑟

0.5
𝑥
0 Class 1
0 𝑖
1
Class 2 𝑦 𝑥
Loss Function in Logistic Regression

The error on 𝑒𝑎𝑐ℎ 𝑒𝑥𝑎𝑚𝑝𝑙𝑒 𝑖𝑠


𝑖 𝑖
− log 𝑦 𝑥 𝑖𝑓 𝑡 =1

𝑖 𝑖
−log 1 − 𝑦 𝑥 𝑖𝑓 𝑡 =0
Loss Function in Logistic Regression

The error on 𝑒𝑎𝑐ℎ 𝑒𝑥𝑎𝑚𝑝𝑙𝑒 𝑖𝑠


𝑖 𝑖
− log 𝑦 𝑥 𝑖𝑓 𝑡 =1

𝑖 𝑖
−log 1 − 𝑦 𝑥 𝑖𝑓 𝑡 =0

𝑖 𝑖
−𝑡 log 𝑦 𝑥 − (1 − 𝑡 𝑖 )log 1 − 𝑦 𝑥 𝑖
Loss Function in Logistic Regression

The error on 𝑒𝑎𝑐ℎ 𝑒𝑥𝑎𝑚𝑝𝑙𝑒 𝑖𝑠


𝑖 𝑖
− log 𝑦 𝑥 𝑖𝑓 𝑡 =1

𝑖 𝑖
−log 1 − 𝑦 𝑥 𝑖𝑓 𝑡 =0

𝑖 𝑖
−𝑡 log 𝑦 𝑥 − (1 − 𝑡 𝑖 )log 1 − 𝑦 𝑥 𝑖

𝑁
1 𝑖 𝑖
ℓ 𝑤 =− ෍ 𝑡 log 𝑦 𝑥 + (1 − 𝑡 𝑖 )log 1 − 𝑦 𝑥 𝑖
𝑁
𝑖=1
Linear Regression Logistic Regression
• Training datasets • Training datasets
𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)} 𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)}
(target 𝑡 𝑖 : real value) (target 𝑡 𝑖 : 0 or 1)
• Linear model • Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀 𝑦 𝑥 = 𝜎 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀
• Parameters • Parameters
𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀 𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀
• Loss function • Loss function
𝑁
1 𝑖 𝑖
ℓ 𝑤 = ෍[𝑡 −𝑦 𝑥 ]2 1
2𝑁 ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖
log 𝑦 𝑥 𝑖
+ 1−𝑡 𝑖
log 1 − 𝑦 𝑥 𝑖
𝑖=1
• Goal: minimize ℓ 𝑤
• Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on
the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
1
σ 𝑧 =
1 + 𝑒 −𝑧
Linear Regression Logistic Regression
• Training datasets • Training datasets
𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)} 𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)}
(target 𝑡 𝑖 : real value) (target 𝑡 𝑖 : 0 or 1)
• Linear model • Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀 𝑦 𝑥 = 𝜎 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀
• Parameters • Parameters
𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀 𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀
• Loss function • Loss function
𝑁
1 𝑖 𝑖
ℓ 𝑤 = ෍[𝑡 −𝑦 𝑥 ]2 1
2𝑁 ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖
log 𝑦 𝑥 𝑖
+ 1−𝑡 𝑖
log 1 − 𝑦 𝑥 𝑖
𝑖=1
• Goal: minimize ℓ 𝑤 Cross Entropy (In statistics)
• Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on
the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
1
σ 𝑧 =
1 + 𝑒 −𝑧
Linear Regression Logistic Regression
• Training datasets • Training datasets
𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)} 𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)}
(target 𝑡 𝑖 : real value) (target 𝑡 𝑖 : 0 or 1)
• Linear model • Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀 𝑦 𝑥 = 𝜎 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀
• Parameters • Parameters
𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀 𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀
• Loss function • Loss function
𝑁
1 𝑖 𝑖
ℓ 𝑤 = ෍[𝑡 −𝑦 𝑥 ]2 1
2𝑁 ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖
log 𝑦 𝑥 𝑖
+ 1−𝑡 𝑖
log 1 − 𝑦 𝑥 𝑖
𝑖=1
• Goal: minimize ℓ 𝑤
• Goal: minimize ℓ 𝑤
Steps: Steps:
• Initialize 𝑤 (e.g., randomly) • Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on • Repeatedly update 𝑤 based on
the gradient the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤 𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate. where 𝜖 is the learning rate.
1
σ 𝑧 =
1 + 𝑒 −𝑧
Linear Regression Logistic Regression
𝑁
1 𝑖 𝑖 1
ℓ 𝑤 = ෍[𝑡 −𝑦 𝑥 ]2 ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖
log 𝑦 𝑥 𝑖
+ 1−𝑡 𝑖
log 1 − 𝑦 𝑥 𝑖
2𝑁
𝑖=1
• Goal: minimize ℓ 𝑤 • Goal: minimize ℓ 𝑤

Steps: Steps:
• Initialize 𝑤 (e.g., randomly) • Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on • Repeatedly update 𝑤 based on
the gradient the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤 𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate. where 𝜖 is the learning rate.
𝜕ℓ 𝑤 1
𝜕ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖
−𝑦 𝑥 𝑖
]
𝜕𝑤0
𝜕𝑤0 𝜕ℓ 𝑤 1 𝑖
𝜕ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖 −𝑦 𝑥 𝑖 𝑥1
𝜕𝑤1
𝛻𝑤 ℓ 𝑤 = 𝜕𝑤1 𝜕ℓ 𝑤 1
= − 𝑁 σ𝑁 𝑖 𝑖 𝑖
𝜕𝑤2 𝑖=1[𝑡 −𝑦 𝑥 ] 𝑥2

𝜕ℓ 𝑤 ⋮
𝜕ℓ 𝑤 1 𝑖
𝜕𝑤𝑀 = − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖
−𝑦 𝑥 𝑖
] 𝑥𝑀
𝜕𝑤𝑀

where 𝑦 𝑥 𝑖 = 𝑤𝑇𝑥 𝑖 1
where 𝑦 𝑥 𝑖 = 𝜎 𝑤𝑇𝑥 𝑖 = 𝑇 𝑖
1+𝑒 −𝑤 𝑥
Linear Regression Logistic Regression
𝑁
1 𝑖 𝑖 1
ℓ 𝑤 = ෍[𝑡 −𝑦 𝑥 ]2 ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖
log 𝑦 𝑥 𝑖
+ 1−𝑡 𝑖
log 1 − 𝑦 𝑥 𝑖
2𝑁
𝑖=1
• Goal: minimize ℓ 𝑤 • Goal: minimize ℓ 𝑤

Steps: Steps:
• Initialize 𝑤 (e.g., randomly) • Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on • Repeatedly update 𝑤 based on
the gradient the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤 𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate. where 𝜖 is the learning rate.
𝜕ℓ 𝑤 1
𝜕ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖
−𝑦 𝑥 𝑖
] 𝜕ℓ 𝑤 1
𝜕𝑤0 𝜕ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖 −𝑦 𝑥 𝑖 ]
𝜕𝑤0 𝜕ℓ 𝑤 1 𝑖 𝜕𝑤0
= − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖 −𝑦 𝑥 𝑖 𝑥1 𝜕𝑤0 𝜕ℓ 𝑤 1 𝑖
𝜕ℓ 𝑤 𝜕𝑤1
𝜕ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖
−𝑦 𝑥 𝑖
𝑥1
𝜕𝑤1
𝛻𝑤 ℓ 𝑤 = 𝜕𝑤1 𝜕ℓ 𝑤
=
1
− 𝑁 σ𝑁 𝑖 𝑖 𝑖
𝑖=1[𝑡 −𝑦 𝑥 ] 𝑥2 𝛻𝑤 ℓ 𝑤 = 𝜕𝑤1 𝜕ℓ 𝑤 1 𝑖
⋮ 𝜕𝑤2 = − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖
−𝑦 𝑥 𝑖
] 𝑥2
𝜕ℓ 𝑤 ⋮ ⋮ 𝜕𝑤2
𝜕ℓ 𝑤 1 𝑖 𝜕ℓ 𝑤 ⋮
𝜕𝑤𝑀 = − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖
−𝑦 𝑥 𝑖
] 𝑥𝑀 𝜕ℓ 𝑤 1 𝑖
𝜕𝑤𝑀 𝜕𝑤𝑀 = − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖
−𝑦 𝑥 𝑖
] 𝑥𝑀
𝜕𝑤𝑀

where 𝑦 𝑥 𝑖 = 𝑤𝑇𝑥 𝑖 1
where 𝑦 𝑥 𝑖 = 𝜎 𝑤𝑇𝑥 𝑖 = 𝑇 𝑖
1+𝑒 −𝑤 𝑥
Cross Entropy vs Mean Squared Error

Cross Entropy

Mean Squared Error

Reference: Glorot, Xavier, and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks." Aistats. Vol. 9. 2010.
Cross Entropy vs Mean Squared Error

𝑁
1 𝑖 𝑖
ℓ 𝑤 = ෍[𝑡 −𝑦 𝑥 ]2
2𝑁
𝑖=1
Cross Entropy vs Mean Squared Error

small error,
bad classifier

𝑁
1 𝑖 𝑖
ℓ 𝑤 = ෍[𝑡 −𝑦 𝑥 ]2
2𝑁
𝑖=1
Cross Entropy vs Mean Squared Error

1
ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖
log 𝑦 𝑥 𝑖
+ 1−𝑡 𝑖
log 1 − 𝑦 𝑥 𝑖
Cross Entropy vs Mean Squared Error

Large error,
bad classifier

1
ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖
log 𝑦 𝑥 𝑖
+ 1−𝑡 𝑖
log 1 − 𝑦 𝑥 𝑖
Cross Entropy vs Mean Squared Error

Large error,
bad classifier

small error,
good classifier

1
ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖
log 𝑦 𝑥 𝑖
+ 1−𝑡 𝑖
log 1 − 𝑦 𝑥 𝑖
Cross Entropy vs Mean Squared Error

A large distance to the decision boundary

1
ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖
log 𝑦 𝑥 𝑖
+ 1−𝑡 𝑖
log 1 − 𝑦 𝑥 𝑖
Multi-Class Classification
• One vs All

class 2

class 3

class 1
Multi-Class Classification
• One vs All

class 2

class 3

class 1
Multi-Class Classification
• One vs All Model 1: probability in class 1

class 2

class 3 Model 2: probability in class 2

class 1

Model 3: probability in class 3


Multi-Class Classification New data

• One vs All Model 1: probability in class 1

class 2

Choose a
class 3 Model 2: probability in class 2 class with
largest
probability

class 1

Model 3: probability in class 3


Multi-Class Classification
• We can also consider multi classes at the same time
𝑇
𝑤 (1) 𝑥
𝑒
𝑇 𝑦1 =
𝑇 (1) (1) 𝑇 𝑥 (2) 𝑇 𝑥 (3) 𝑇 𝑥
C1: 𝑤 (1) 𝑥 𝑒𝑤 𝑥 𝑒𝑤 + 𝑒𝑤 + 𝑒𝑤

(2) 𝑇
𝑇 (2) 𝑇 𝑒𝑤 𝑥
C2: 𝑤 (2) 𝑥 𝑒𝑤 𝑥 𝑦2 =
(1) 𝑇 𝑥 (2) 𝑇 𝑥 (3) 𝑇 𝑥
𝑒𝑤 + 𝑒𝑤 + 𝑒𝑤
Courtesy of Dr. Hung-yi Lee

𝑇 (3) 𝑇
(3) 𝑇 𝑤 (3) 𝑥 𝑒𝑤 𝑥
C3: 𝑤 𝑥 𝑒
𝑦3 =
(1) 𝑇 𝑥 (2) 𝑇 𝑥 (3) 𝑇 𝑥
𝑒𝑤 + 𝑒𝑤 + 𝑒𝑤
Multi-Class Classification
• We can also consider multi classes at the same time
𝑇
𝑤 (1) 𝑥
𝑒
𝑇 𝑦1 =
𝑇 (1) (1) 𝑇 𝑥 (2) 𝑇 𝑥 (3) 𝑇 𝑥
C1: 𝑤 (1) 𝑥 𝑒𝑤 𝑥 𝑒𝑤 + 𝑒𝑤 + 𝑒𝑤

(2) 𝑇
𝑇 (2) 𝑇 𝑒𝑤 𝑥
C2: 𝑤 (2) 𝑥 𝑒𝑤 𝑥 𝑦2 =
(1) 𝑇 𝑥 (2) 𝑇 𝑥 (3) 𝑇 𝑥
𝑒𝑤 + 𝑒𝑤 + 𝑒𝑤
Courtesy of Dr. Hung-yi Lee

𝑇 (3) 𝑇
(3) 𝑇 𝑤 (3) 𝑥 𝑒𝑤 𝑥
C3: 𝑤 𝑥 𝑒
𝑦3 =
(1) 𝑇 𝑥 (2) 𝑇 𝑥 (3) 𝑇 𝑥
𝑒𝑤 + 𝑒𝑤 + 𝑒𝑤
Multi-Class Classification
• We can also consider multi classes at the same time

(1) 𝑇
𝑒𝑤 𝑥
𝑇 𝑦1 = 𝑇 𝑇 𝑇
𝑦ො1
C1: 𝑤 (1) 𝑇 𝑥 𝑒 𝑤 (1) 𝑥 𝑒 𝑤 (1) 𝑥 +𝑒 𝑤 (2) 𝑥 +𝑒 𝑤 (3) 𝑥

𝑇 Argmax
𝑤 (2) 𝑥
𝑇 (2) 𝑇 𝑒 𝑦ො2
C2: 𝑤 (2) 𝑥 𝑒𝑤 𝑥 𝑦2 =
(1) 𝑇 (2) 𝑇 (3) 𝑇
𝑒𝑤 𝑥 + 𝑒𝑤 𝑥 + 𝑒𝑤 𝑥
Courtesy of Dr. Hung-yi Lee

(3) 𝑇 (3) 𝑇

C3: 𝑤 (3) 𝑇 𝑥 𝑒𝑤 𝑥 𝑒𝑤 𝑥 𝑦ො3


𝑦3 =
(1) 𝑇 (2) 𝑇 (3) 𝑇
𝑒𝑤 𝑥 + 𝑒𝑤 𝑥 + 𝑒𝑤 𝑥

In Optimization: the loss function is − σ3𝑖=1 𝑦ො𝑖 𝑙𝑛𝑦𝑖


Multi-Class Classification
• We can also consider multi classes at the same time
One-hot Encoding

𝑇
(1) If x is in C1,
𝑒𝑤 𝑥
𝑇 𝑦1 = 𝑇 𝑇 𝑇
𝑦ො1 𝑦ො1 1
(1) 𝑇 𝑒 𝑤 (1) 𝑥 𝑒 𝑤 (1) 𝑥 +𝑒 𝑤 (2) 𝑥 +𝑒 𝑤 (3) 𝑥
C1: 𝑤 𝑥 the output 𝑦ො2 = 0
Argmax 𝑦ො3 0
𝑇
𝑤 (2) 𝑥 If x is in C2,
𝑇 (2) 𝑇 𝑒 𝑦ො2
C2: 𝑤 (2) 𝑥 𝑒𝑤 𝑥 𝑦2 =
(1) 𝑇 𝑥 (2) 𝑇 𝑥 (3) 𝑇 𝑥 𝑦ො1 0
𝑒𝑤 + 𝑒𝑤 + 𝑒𝑤
Courtesy of Dr. Hung-yi Lee

the output 𝑦ො2 = 1


(3) 𝑇 (3) 𝑇 𝑦ො3 0
C3: 𝑤 (3) 𝑇 𝑥 𝑒𝑤 𝑥 𝑒𝑤 𝑥 𝑦ො3
𝑦3 = If x is in C3,
(1) 𝑇 (2) 𝑇 (3) 𝑇
𝑒𝑤 𝑥 + 𝑒𝑤 𝑥 + 𝑒𝑤 𝑥
𝑦ො1 0
the output 𝑦ො2 = 0
𝑦ො3 1
Multi-Class Classification
• We can also consider multi classes at the same time
One-hot Encoding

𝑇
(1) If x is in C1,
𝑒𝑤 𝑥
𝑇 𝑦1 = 𝑇 𝑇 𝑇
𝑦ො1 𝑦ො1 1
(1) 𝑇 𝑒 𝑤 (1) 𝑥 𝑒 𝑤 (1) 𝑥 +𝑒 𝑤 (2) 𝑥 +𝑒 𝑤 (3) 𝑥
C1: 𝑤 𝑥 the output 𝑦ො2 = 0
Argmax 𝑦ො3 0
𝑇
𝑤 (2) 𝑥 If x is in C2,
𝑇 (2) 𝑇 𝑒 𝑦ො2
C2: 𝑤 (2) 𝑥 𝑒𝑤 𝑥 𝑦2 =
(1) 𝑇 𝑥 (2) 𝑇 𝑥 (3) 𝑇 𝑥 𝑦ො1 0
𝑒𝑤 + 𝑒𝑤 + 𝑒𝑤
Courtesy of Dr. Hung-yi Lee

the output 𝑦ො2 = 1


(3) 𝑇 (3) 𝑇 𝑦ො3 0
C3: 𝑤 (3) 𝑇 𝑥 𝑒𝑤 𝑥 𝑒𝑤 𝑥 𝑦ො3
𝑦3 = If x is in C3,
(1) 𝑇 (2) 𝑇 (3) 𝑇
𝑒𝑤 𝑥 + 𝑒𝑤 𝑥 + 𝑒𝑤 𝑥
𝑦ො1 0
the output 𝑦ො2 = 0
𝑦ො3 1
Why not use 0, 1, 2?
Multi-Class Classification
• Consider multi classes at the same time

𝑇
𝑤 (1) 𝑥
𝑒
𝑇 𝑦1 = 𝑇 𝑇 𝑇
𝑦ො1
(1) 𝑇 𝑒 𝑤 (1) 𝑥 𝑒 𝑤 (1) 𝑥 +𝑒 𝑤 (2) 𝑥 +𝑒 𝑤 (3) 𝑥
𝑤 𝑥
Argmax
(2) 𝑇
𝑇 𝑒𝑤 𝑥
𝑦ො2
𝑥 (2) 𝑇 𝑒 𝑤 (2) 𝑥 𝑦2 =
𝑤 𝑥 (1) 𝑇 (2) 𝑇 (3) 𝑇
𝑒𝑤 𝑥 + 𝑒𝑤 𝑥 + 𝑒𝑤 𝑥
Courtesy of Dr. Hung-yi Lee

(3) 𝑇 (3) 𝑇

𝑤 (3) 𝑇 𝑥 𝑒𝑤 𝑥 𝑒𝑤 𝑥 𝑦ො3
𝑦3 =
(1) 𝑇 (2) 𝑇 (3) 𝑇
𝑒𝑤 𝑥 + 𝑒𝑤 𝑥 + 𝑒𝑤 𝑥

Softmax
Multi-Class Classification
• Consider multi classes at the same time

𝑇
𝑤 (1) 𝑥 S 𝑦2 𝑦ො1
o
Argmax
f
𝑇
𝑥 𝑤 (2) 𝑥 t 𝑦2 𝑦ො2
Courtesy of Dr. Hung-yi Lee

m
a
𝑇 x 𝑦3 𝑦ො3
𝑤 (3) 𝑥

The softmax function is often used in the final layer of neural


networks, which are applied to classification problems
Regularized Logistic Regression
Courtesy of Dr. Hung-yi Lee
Regularized Logistic Regression

𝑎𝑙𝑙 𝑦 ≥ 0.5
Courtesy of Dr. Hung-yi Lee

𝑎𝑙𝑙 𝑦 < 0.5

Underfitting
Regularized Logistic Regression
Courtesy of Dr. Hung-yi Lee

Underfitting Overfitting
Limitation of Logistic Regression
Data Target
𝑥1 𝑥𝟐 𝑡
0 0 1
0 1 0
1 0 0
1 1 1
Courtesy of Dr. Hung-yi Lee
Limitation of Logistic Regression
Data Target
𝑥1 𝑥𝟐 𝑡
0 0 1
0 1 0
1 0 0
1 1 1
Courtesy of Dr. Hung-yi Lee

𝑦 < 0.5 𝑦 ≥ 0.5

𝑦 ≥ 0.5 𝑦 < 0.5


Limitation of Logistic Regression
Data Target
𝑥1 𝑥𝟐 𝑡 1
𝑤0
0 0 1 C1: if 𝑦 ≥ 0.5
𝑤1
𝑥1 𝑤0 + 𝑤1 𝑥1 +𝑤2 𝑥2
0 1 0 C2: if 𝑦 < 0.5
𝑤2
1 0 0 𝑥2
1 1 1
Courtesy of Dr. Hung-yi Lee

𝑦 < 0.5 𝑦 ≥ 0.5

𝑦 ≥ 0.5 𝑦 < 0.5


Limitation of Logistic Regression
Data Target
𝑥1 𝑥𝟐 𝑡 1
𝑤0
0 0 1 C1: if 𝑦 ≥ 0.5
𝑤1
𝑥1 𝑤0 + 𝑤1 𝑥1 +𝑤2 𝑥2
0 1 0 C2: if 𝑦 < 0.5
𝑤2
1 0 0 𝑥2
1 1 1
Courtesy of Dr. Hung-yi Lee

𝑦 < 0.5 𝑦 ≥ 0.5 𝑎𝑙𝑙 𝑦 ≥ 0.5


𝑎𝑙𝑙 𝑦 ≥ 0.5
𝑎𝑙𝑙 𝑦 < 0.5

𝑦 ≥ 0.5 𝑦 < 0.5


𝑎𝑙𝑙 𝑦 < 0.5
Limitation of Logistic Regression
Data Target
𝑥1 𝑥𝟐 𝑡 1
𝑤0
0 0 1 C1: if 𝑦 ≥ 0.5
𝑤1
𝑥1 𝑤0 + 𝑤1 𝑥1 +𝑤2 𝑥2
0 1 0 C2: if 𝑦 < 0.5
𝑤2
1 0 0 𝑥2
1 1 1 We can’t separate them well using a simple logistic regression
(not linearly separable)
Courtesy of Dr. Hung-yi Lee

𝑦 < 0.5 𝑦 ≥ 0.5 𝑎𝑙𝑙 𝑦 ≥ 0.5


𝑎𝑙𝑙 𝑦 ≥ 0.5
𝑎𝑙𝑙 𝑦 < 0.5

𝑦 ≥ 0.5 𝑦 < 0.5


𝑎𝑙𝑙 𝑦 < 0.5
Limitation of Logistic Regression

Feature Transformation
𝑥1 𝑥1′
𝑥2 𝑥2′
Courtesy of Dr. Hung-yi Lee

https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression

0 1
1 1
Feature Transformation
𝑥1 𝑥1′
0 1 𝑥2 𝑥2′
0 0
Courtesy of Dr. Hung-yi Lee

https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression

0 1
0
1 1 𝑥1′ : distance to
0
1 Feature Transformation
𝑥2′ : distance to
1 𝑥1 𝑥1′
0 1 𝑥2 𝑥2′
0 0
Courtesy of Dr. Hung-yi Lee

https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression

0 1
0
1 1 𝑥1′ : distance to
0
1 Feature Transformation
𝑥2′ : distance to
1 𝑥1 𝑥1′
0 1 𝑥2 𝑥2′
0 0
Courtesy of Dr. Hung-yi Lee

1
1

https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression

0 1
0
1 1 𝑥1′ : distance to
0
1 Feature Transformation
𝑥2′ : distance to
1 𝑥1 𝑥1′
0 1 𝑥2 𝑥2′
0 0
Courtesy of Dr. Hung-yi Lee

1
1

https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression

0 1
0
1 1 𝑥1′ : distance to
0
1 Feature Transformation
𝑥2′ : distance to
1 𝑥1 𝑥1′
0 1 𝑥2 𝑥2′
0 0
Courtesy of Dr. Hung-yi Lee

1
1
2
0

https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression

0 1
0
1 1 𝑥1′ : distance to
0
1 Feature Transformation
𝑥2′ : distance to
1 𝑥1 𝑥1′
0 1 𝑥2 𝑥2′
0
0 0
2
Courtesy of Dr. Hung-yi Lee

1
1
2
0

https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression

0 1
0
1 1 𝑥1′ : distance to
0
1 Feature Transformation
𝑥2′ : distance to
1 𝑥1 𝑥1′
0 1 𝑥2 𝑥2′
0
0 0
2
Courtesy of Dr. Hung-yi Lee

1
1
2
0

https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression

0 1
0
1 1 𝑥1′ : distance to
0
1 Feature Transformation
𝑥2′ : distance to
1 𝑥1 𝑥1′
0 1 𝑥2 𝑥2′
0
0 0
2
(Not linearly separable)
Courtesy of Dr. Hung-yi Lee

1
1
2
0
(Linearly separable)
https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression

0 1
0
1 1 𝑥1′ : distance to
0
1 Feature Transformation
𝑥2′ : distance to
1 𝑥1 𝑥1′
0 1 𝑥2 𝑥2′
0
0 0
2
Courtesy of Dr. Hung-yi Lee

1
1
2
0

https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression

0 1
0
1 1 𝑥1′ : distance to
0
1 Feature Transformation
𝑥2′ : distance to
1 𝑥1 𝑥1′
0 1 𝑥2 𝑥2′
0
0 0
2
Courtesy of Dr. Hung-yi Lee

1
It is not easy to find a 1
good transformation 2
0

https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression
• Cascading logistic regression models

1 (1)
𝑤0
(1)
𝑤1
𝑥1 (1) (1) (1)
𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑥1′
(1)
𝑤2
𝑥2
Courtesy of Dr. Hung-yi Lee

https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression
• Cascading logistic regression models

1 (1)
𝑤0
(1)
𝑤1
𝑥1 (1) (1) (1)
𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑥1′
(1)
𝑤2
𝑥2

1
Courtesy of Dr. Hung-yi Lee

(2)
𝑤0
(2)
𝑤1
𝑥1 (2) (2) (2)
𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑥2′
(2)
𝑥2 𝑤2

https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression
• Cascading logistic regression models
1
1 (1)
𝑤0 (3)
(1)
𝑤0
𝑤1
𝑥1 (1) (1) (1)
𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑥1′
(1)
𝑤2
𝑥2 (3)
𝑤1
(3) (3) (3)
𝑤0 + 𝑤1 𝑥1′ + 𝑤2 𝑥2′ 𝑦

1
Courtesy of Dr. Hung-yi Lee

(2)
𝑤0
(3)
(2)
𝑤1 𝑤2
𝑥1 (2) (2) (2)
𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑥2′
(2)
𝑥2 𝑤2

https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression
• Cascading logistic regression models
1
1 (1)
𝑤0 (3)
(1)
𝑤0
𝑤1
𝑥1 (1) (1) (1)
𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑥1′
(1)
𝑤2
𝑥2 (3)
𝑤1
(3) (3) (3)
𝑤0 + 𝑤1 𝑥1′ + 𝑤2 𝑥2′ 𝑦

1
Courtesy of Dr. Hung-yi Lee

(2)
𝑤0
(3)
(2)
𝑤1 𝑤2
𝑥1 (2) (2) (2)
𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑥2′
(2)
𝑥2 𝑤2 Try to make it linearly separable

Feature Transformation

https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
𝑥1′ = 0.65 𝑥1′ = 0.30

𝑥1′ = 0.30 𝑥1′ = 0.10


Courtesy of Dr. Hung-yi Lee
𝑥1′ = 0.65 𝑥1′ = 0.30

𝑥1′ = 0.30 𝑥1′ = 0.10

𝑥2′ = 0.10 𝑥2′ = 0.30


Courtesy of Dr. Hung-yi Lee

𝑥2′ = 0.30 𝑥2′ = 0.65


𝑥1′ = 0.65 𝑥1′ = 0.30

𝑥1′ = 0.30 𝑥1′ = 0.10

0.65
𝑦 = 0.20
0.10
𝑥2′ = 0.10 𝑥2′ = 0.30
Courtesy of Dr. Hung-yi Lee

0.30
𝑥2′ = 0.30 0.30 0.10
𝑥2′ = 0.65 𝑦 = 0.20
𝑦 = 0.70 0.65

(3) (3) (3)


𝑦 = 𝑤0 + 𝑤1 𝑥1′ + 𝑤2 𝑥2′
Limitation of Logistic Regression
Courtesy of Dr. Hung-yi Lee
Limitation of Logistic Regression
Courtesy of Dr. Hung-yi Lee
Limitation of Logistic Regression
𝑥1′ = 0.6
Courtesy of Dr. Hung-yi Lee
Limitation of Logistic Regression
𝑥1′ = 0.6
Courtesy of Dr. Hung-yi Lee

𝑥2′ = 0.8
Limitation of Logistic Regression
𝑥1′ = 0.6

0.69
0.8
Courtesy of Dr. Hung-yi Lee

𝑥2′ = 0.8
Limitation of Logistic Regression
𝑥1′ = 0.6

(3) (3) (3)


𝑤0 + 𝑤1 × 0.6 + 𝑤2 × 0.8
Courtesy of Dr. Hung-yi Lee

𝑥2′ = 0.8
Limitation of Logistic Regression
𝑥1′ = 0.6

(3) (3) (3)


𝑤0 + 𝑤1 × 0.6 + 𝑤2 × 0.8

0.89
Courtesy of Dr. Hung-yi Lee

𝑥2′ = 0.8
Limitation of Logistic Regression

𝑦 = 0.6

0.3
Courtesy of Dr. Hung-yi Lee

𝑦 = 0.2
(3) (3) (3)
𝑤0 + 𝑤1 × 0.6 + 𝑤2 × 0.2
Limitation of Logistic Regression
• Cascading logistic regression models
1
1 (1)
𝑤0 (3)
(1)
𝑤0
𝑤1
𝑥1 (1) (1) (1)
𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑥1′
(1)
𝑤2
𝑥2 (3)
𝑤1
(3) (3) (3)
𝑤0 + 𝑤1 𝑥1′ + 𝑤2 𝑥2′ 𝑦

1
Courtesy of Dr. Hung-yi Lee

(2)
𝑤0
(3)
(2)
𝑤1 𝑤2
𝑥1 (2) (2) (2)
𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑥2′
(2)
𝑥2 𝑤2

https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression
• Cascading logistic regression models
1
1 (1)
𝑤0 (3)
(1)
𝑤0
𝑤1
𝑥1 (1) (1) (1)
𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑥1′
(1)
𝑤2
𝑥2 (3)
𝑤1
(3) (3) (3)
𝑤0 + 𝑤1 𝑥1′ + 𝑤2 𝑥2′ 𝑦

1
Courtesy of Dr. Hung-yi Lee

(2)
𝑤0
(3)
(2)
𝑤1 𝑤2
𝑥1 (2) (2) (2)
𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑥2′
(2)
𝑥2 𝑤2

https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression
• Cascading logistic regression models
1
1 (1)
𝑤0 (3)
(1)
𝑤0
𝑤1
𝑥1 (1) (1) (1)
𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑥1′
(1)
𝑤2
𝑥2 (3)
𝑤1
(3) (3) (3)
𝑤0 + 𝑤1 𝑥1′ + 𝑤2 𝑥2′ 𝑦

1
Courtesy of Dr. Hung-yi Lee

(2)
𝑤0
(3)
(2)
𝑤1 𝑤2
𝑥1 (2) (2) (2)
𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑥2′
(2)
𝑥2 𝑤2

https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression
• Cascading logistic regression models

𝑥1
𝑤𝑇𝑥 …
𝑥2 𝑤𝑇𝑥
𝑥1 𝑤𝑇𝑥
𝑤𝑇𝑥
Courtesy of Dr. Hung-yi Lee

𝑥2

https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression
• Cascading logistic regression models

𝑥1
𝑤𝑇𝑥 …
𝑥2 𝑤𝑇𝑥
𝑥1 𝑤𝑇𝑥
𝑤𝑇𝑥
Courtesy of Dr. Hung-yi Lee

𝑥2
“Neuron”

https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
https://www.tutorialspoint.com/artificial_intelligence/artificial_intelligence_neural_networks.htm
Limitation of Logistic Regression
• Cascading logistic regression models

𝑥1
𝑤𝑇𝑥 …
𝑥2 𝑤𝑇𝑥
𝑥1 𝑤𝑇𝑥
𝑤𝑇𝑥
Courtesy of Dr. Hung-yi Lee

𝑥2
“Neuron”

Neural Network

https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
https://www.tutorialspoint.com/artificial_intelligence/artificial_intelligence_neural_networks.htm
Limitation of Logistic Regression
• Cascading logistic regression models

𝑥1
𝑤𝑇𝑥 …
𝑥2 𝑤𝑇𝑥
𝑥1 𝑤𝑇𝑥
𝑤𝑇𝑥
Courtesy of Dr. Hung-yi Lee

𝑥2

https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
https://www.tutorialspoint.com/artificial_intelligence/artificial_intelligence_neural_networks.htm
Limitation of Logistic Regression
• Cascading logistic regression models


𝑥1
𝑤𝑇𝑥
𝑤𝑇𝑥
𝑥2
Courtesy of Dr. Hung-yi Lee

https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
https://www.tutorialspoint.com/artificial_intelligence/artificial_intelligence_neural_networks.htm
Limitation of Logistic Regression
• Cascading logistic regression models

𝑥1

𝑥2
Courtesy of Dr. Hung-yi Lee

https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
https://www.tutorialspoint.com/artificial_intelligence/artificial_intelligence_neural_networks.htm
Metrics to Evaluate Classification
• Accuracy
• The ratio of correct predictions to total
predictions made.
Metrics to Evaluate Classification
• Accuracy
• The ratio of correct predictions to total
predictions made.

• Cases
• True Positive: Correctly identified as
relevant
• True Negative: Correctly identified as
not relevant
• False Positive: Incorrectly labeled as
relevant
• False Negative: Incorrectly labeled as
not relevant
Metrics to Evaluate Classification
• Accuracy
• The ratio of correct predictions to total Predicted Class
predictions made.
1 0

• Cases 1
• True Positive: Correctly identified as
relevant
• True Negative: Correctly identified as Actual
not relevant Class
• False Positive: Incorrectly labeled as
relevant
• False Negative: Incorrectly labeled as 0
not relevant
Metrics to Evaluate Classification
• Accuracy
• The ratio of correct predictions to total Predicted Class
predictions made.
1 0

• Cases 1
• True Positive: Correctly identified as
relevant
• True Negative: Correctly identified as Actual True Positive False Negative
not relevant Class
• False Positive: Incorrectly labeled as
relevant (Đoán nhầm: Negative => Positive)
• False Negative: Incorrectly labeled as 0
not relevant (Bỏ sót: Positive => Negative)
False Positive True Negative
Metrics to Evaluate Classification
• Accuracy
• The ratio of correct predictions to total Predicted Class
predictions made. 1 0
# 𝑜𝑓 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + # 𝑜𝑓 𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
# of samples
1

Actual True Positive False Negative


Class

False Positive True Negative

References: http://machinelearningmastery.com/confusion-matrix-machine-learning/
Metrics to Evaluate Classification
• Accuracy
• The ratio of correct predictions to total Predicted Class
predictions made. 1 0
# 𝑜𝑓 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + # 𝑜𝑓 𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
# of samples
1

However, using classification accuracy only can be Actual True Positive False Negative
misleading, Class
• when you have an unequal number of
observations in each class
• when you have more than two classes in your 0
dataset.
False Positive True Negative

References: http://machinelearningmastery.com/confusion-matrix-machine-learning/
Metrics to Evaluate Classification
• Suppose we have a classifier for identify the object in a image is cat or not.
If 𝒚 𝒙 ≥ 𝟎. 𝟓, we predict 1 (cat)
If 𝒚 𝒙 < 𝟎. 𝟓, we predict 0 (not cat)
the accuracy is 99% on testing datasets (1000 images).

• Is this classifier good or not?


Metrics to Evaluate Classification
• Suppose we have a classifier for identify the object in a image is cat or not.
If 𝒚 𝒙 ≥ 𝟎. 𝟓, we predict 1 (cat)
If 𝒚 𝒙 < 𝟎. 𝟓, we predict 0 (not cat)
the accuracy is 99% on testing datasets (1000 images).

• Is this classifier good or not?

• What do think if the 1000 images contain only 5 cat images?


Metrics to Evaluate Classification
• Suppose we have a classifier for identify the object in a image is cat or not.
If 𝒚 𝒙 ≥ 𝟎. 𝟓, we predict 1 (cat)
If 𝒚 𝒙 < 𝟎. 𝟓, we predict 0 (not cat)
the accuracy is 99% on testing datasets (1000 images).

• Is this classifier good or not?

• What do think if the 1000 images contain only 5 cat images? def classifier(img):
return 0

The problem comes from an unequal number of observations in each class


Metrics to Evaluate Classification
• Precision, Recall, and Accuracy

• Precision >< Đoán nhầm


• Percentage of positive labels that are correct
• Precision = (# true positives) / (# true positives + # false positives)

• Recall >< Bỏ sót


• Percentage of positive examples that are correctly labeled
• Recall = (# true positives) / (# true positives + # false negatives)

• Accuracy
• Percentage of correct labels
• Accuracy = (# true positives + # true negatives) / (# of samples)

Reference: Nvidia - Deep Learning Teaching Kit


Metrics to Evaluate Classification
• Precision >< Đoán nhầm
• Percentage of positive labels that are Predicted Class
correct
1 0

# 𝑜𝑓 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
# 𝑜𝑓 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + # 𝑜𝑓 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
1

Actual True Positive False Negative


Class

False Positive True Negative

References: http://machinelearningmastery.com/confusion-matrix-machine-learning/
Metrics to Evaluate Classification
• Precision
• Percentage of positive labels that are Predicted Class
correct
1 0

# 𝑜𝑓 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
# 𝑜𝑓 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + # 𝑜𝑓 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
1

Actual True Positive False Negative


Class

False Positive True Negative

References: http://machinelearningmastery.com/confusion-matrix-machine-learning/
Metrics to Evaluate Classification
• Recall >< Bỏ sót
• Percentage of positive examples that Predicted Class
are correctly labeled
1 0
# 𝑜𝑓 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑅𝑒𝑐𝑎𝑙𝑙 =
# 𝑜𝑓 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + # 𝑜𝑓 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
1

Actual True Positive False Negative


Class

False Positive True Negative

References: http://machinelearningmastery.com/confusion-matrix-machine-learning/
Metrics to Evaluate Classification
• Recall
• Percentage of positive examples that Predicted Class
are correctly labeled
1 0
# 𝑜𝑓 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑅𝑒𝑐𝑎𝑙𝑙 =
# 𝑜𝑓 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + # 𝑜𝑓 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
1

Actual True Positive False Negative


Class

False Positive True Negative

References: http://machinelearningmastery.com/confusion-matrix-machine-learning/
Metrics to Evaluate Classification
• Suppose we have a classifier for
identify the object in a image is cat or
not.
If 𝒚 𝒙 ≥ 𝟎. 𝟓, we predict 1 (cat)
If 𝒚 𝒙 < 𝟎. 𝟓, we predict 0 (not cat)

Predicted Class
• The testing datasets (1000 images
contain only 5 cat images) 1 0

1
def classifier(img): Actual
return 0 Class
0
Metrics to Evaluate Classification
• Suppose we have a classifier for
identify the object in a image is cat or
not.
If 𝒚 𝒙 ≥ 𝟎. 𝟓, we predict 1 (cat)
If 𝒚 𝒙 < 𝟎. 𝟓, we predict 0 (not cat)

Predicted Class
• The testing datasets (1000 images
contain only 5 cat images) 1 0

1 0 5
def classifier(img): Actual
return 0 Class
0 0 995
Metrics to Evaluate Classification
• Suppose we have a classifier for
identify the object in a image is cat or Accuracy = 99.5%
not. Precision = ? (0/0)
Recall = 0 %
If 𝒚 𝒙 ≥ 𝟎. 𝟓, we predict 1 (cat)
If 𝒚 𝒙 < 𝟎. 𝟓, we predict 0 (not cat)
Confusion Matrix
Predicted Class
• The testing datasets (1000 images
contain only 5 cat images) 1 0

1 0 5
def classifier(img): Actual
return 0 Class
0 0 995
Metrics to Evaluate Classification
• Suppose we have a classifier for
identify the object in a image is cat or
not.
If 𝒚 𝒙 ≥ 𝒕𝒉𝒓𝒆𝒔𝒉𝒐𝒍𝒅, we predict 1 (cat) Large threshold
If 𝒚 𝒙 < 𝒕𝒉𝒓𝒆𝒔𝒉𝒐𝒍𝒅, we predict 0 (not cat)

small threshold

http://cb.csail.mit.edu/cb/struct2net/webserver/images/prec-v-recall-v2.png
Metrics to Evaluate Classification
• F1 score is a measure of a classifier’s performance
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 = 2
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙

• Measure precision and recall on validation sets and select a model that gives max
F1 score.

https://en.wikipedia.org/wiki/F1_score
Readings
• Logistic Regression
• Pages 203-207 in the book “Pattern Recognition and Machine Learning”, by Christopher M.
Bishop, Springer, 2006.

• Multi-classes
• Sections 4.1.2 and 4.3.4 in the book “Pattern Recognition and Machine Learning”, by
Christopher M. Bishop, Springer, 2006.

• On Discriminative vs. Generative classifiers: A comparison of logistic regression


and naive Bayes.
• http://ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf

You might also like