0% found this document useful (0 votes)

9 views130 pages

12_Bài toán phân lớp_LR_v2

The document discusses basic machine learning concepts, focusing on classification problems, including binary and multi-class classification. It introduces linear models for classification, decision boundaries, and logistic regression as an alternative to traditional methods. Additionally, it highlights various applications of machine learning such as handwritten digit recognition, spam filtering, and object detection.

Uploaded by

havi240304

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views130 pages

12_Bài toán phân lớp_LR_v2

Uploaded by

havi240304

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 130

Học máy cơ bản

Bài toán phân lớp

Từ bài toán hồi quy đến bài toán phân lớp

Từ bài toán phân lớp đến thuật toán perceptron và mạng nơ-ron 1 lớp
Applications of Machine Learning
• Handwritten Digit Recognition

References: https://courses.cs.washington.edu/courses/cse546/12wi/slides/cse546wi12intro.pdf
2
Applications of Machine Learning
• Spam Filtering
• Spam or Not Spam

References:
• Paper: Konstantin Tretyakov, Machine Learning Techniques in Spam Filtering, 2004.
• http://www.techlicious.com/blog/gmail-spam-filter-artificial-neural-network/ 3
• https://upxacademy.com/spam-alert-machine-learning-is-filtering-your-emails/
Applications of Machine Learning
• Object Detection and Recognition

References:
• https://devblogs.nvidia.com/parallelforall/deep-learning-for-computer-vision-with-matlab-and-cudnn/
• https://www.cnet.com/how-to/how-to-disable-facial-recognition-in-facebook/ 4
Linear Models for Classification
• Classification
• Determine which discrete category the example is.
Linear Models for Classification
• Classification
• Determine which discrete category the example is.

• Binary classification: two possible labels

• e.g. Yes/No, True/False

• Multi-class classification: multiple possible labels

• e.g. Dog/Cat/Bird/Other
Linear Models for Classification
• Key Concepts
• Classification as regression
• Decision boundary
• Loss function
• Metrics to evaluate classification
A Simple 1-D Classification
• Classification as Regression
• Given a set of input-output pairs for binary
classification
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}

Class 1
Class 2
A Simple 1-D Classification
• Classification as Regression
• Given a set of input-output pairs for binary
classification

t
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}
where 𝑡 = 1 or − 1.
1

0 𝑥

−1
Class 1
Class 2
A Simple 1-D Classification
• Classification as Regression
• Given a set of input-output pairs for binary
classification

t
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}
where 𝑡 = 1 or − 1.
1

1
0 𝑥
𝑓(𝑥)
−1
−1 Class 1
Class 2
A Simple 1-D Classification
• Classification as Regression
• Given a set of input-output pairs for binary
classification 𝒇 𝒙 = 𝒘𝑻 𝒙

t
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}
where 𝑡 = 1 or − 1.
1

1
0 𝑥
𝑓(𝑥)
−1
−1 Class 1
Class 2
A Simple 1-D Classification
• Classification as Regression
• Given a set of input-output pairs for binary
classification 𝒇 𝒙 = 𝒘𝑻 𝒙

t
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}
where 𝑡 = 1 or − 1.
1

1
0 𝑥
𝑓(𝑥)
−1
−1 Class 1
Let 𝒇 𝒙 denote 𝑤0 + 𝑤1 𝒙 Class 2
If 𝒇(𝒙) ≥ 𝟎, we predict class 1
If 𝒇(𝒙) < 𝟎, we predict class 2 (linear regression)
A Simple 1-D Classification
• Classification as Regression
• Given a set of input-output pairs for binary
classification

t
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}
where 𝑡 = 1 or − 1. 𝒚(𝒙)
1

In Math 𝑦 𝑥 = 𝑠𝑔𝑛 𝑓 𝑥 𝑥
0
1 𝑓𝑜𝑟 𝑧 > 0 −1
where 𝑠𝑔𝑛(𝑧) = ቐ 0 𝑓𝑜𝑟 𝑧 = 0 Class 1
−1 𝑓𝑜𝑟 𝑧 < 0 Class 2
A Simple 1-D Classification
• Classification as Regression
• Given a set of input-output pairs for binary
classification
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}
where 𝑡 = 1 or − 1.

In Math 𝑦 𝑥 = 𝑠𝑔𝑛 𝑓 𝑥 𝑥

1 𝑓𝑜𝑟 𝑧 > 0 Decision boundary

where 𝑠𝑔𝑛(𝑧) = ቐ 0 𝑓𝑜𝑟 𝑧 = 0 (In 1-D, this is simply a threshold)
−1 𝑓𝑜𝑟 𝑧 < 0 𝑥 − 𝑤0 = 0
A Simple 1-D Classification
• Buy or not buy it?

1 𝑤0 Price 1: buy
sgn -1: not buy
Size in 𝑤1
feature square feet
1 𝑓𝑜𝑟 𝑧 > 0
𝑠𝑔𝑛(𝑧) = ቐ 0 𝑓𝑜𝑟 𝑧 = 0
−1 𝑓𝑜𝑟 𝑧 < 0
A Simple 1-D Classification
• Buy or not buy it?

1 𝑤0 Price 1: buy
sgn -1: not buy
Size in 𝑤1
feature square feet

Defining decision boundary Converting to 1 or -1

A Simple 1-D Classification
• Buy or not buy it?

1 𝑤0 Price 1: buy
sgn -1: not buy
Size in 𝑤1
feature square feet

Defining decision boundary Converting to 1 or -1

A Simple 1-D Classification
If we may want class probability, what
• Classification as Regression should we do?
• Given a set of input-output pairs for binary
classification

t
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}
where 𝑡 = 1 or − 1. 𝒚(𝒙)
1

In Math 𝑦 𝑥 = 𝑠𝑔𝑛 𝑓 𝑥 𝑥
0
1 𝑓𝑜𝑟 𝑧 > 0 −1
where 𝑠𝑔𝑛(𝑧) = ቐ 0 𝑓𝑜𝑟 𝑧 = 0 Class 1
−1 𝑓𝑜𝑟 𝑧 < 0 Class 2
Logistic Regression
• An alternative: replace the sgn function with the
sigmoid or logistic function
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}

t
where 𝑡 = 1 or − 1.
𝒚(𝒙)
1
𝒚 𝒙 =σ 𝒇 𝒙
1
where σ 𝑧 = 0.5 𝑥
1+𝑒 −𝑧

0 Class 1
Class 2

−1
Logistic Regression
• An alternative: replace the sgn function with the
sigmoid or logistic function
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}

t
where 𝑡 = 1 or − 1.
𝒚(𝒙)
1
𝒚 𝒙 =σ 𝒇 𝒙 𝐿𝑖𝑘𝑒𝑙𝑦 𝑉𝑒𝑟𝑦 𝑙𝑖𝑘𝑒𝑙𝑦

1
where σ 𝑧 = 0.5 𝑥
1+𝑒 −𝑧 𝑈𝑛𝑙𝑖𝑘𝑒𝑙𝑦
Very 𝑢𝑛𝑙𝑖𝑘𝑒𝑙𝑦

0 Class 1
Class 2

−1
Logistic Regression (Note: Starting from now, we use
{1, 0}, just for convenience!)
• An alternative: replace the sgn function with the
sigmoid or logistic function
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}

t
where 𝑡 = 1 or 0.
𝒚(𝒙)
1
𝒚 𝒙 =σ 𝒇 𝒙 𝐿𝑖𝑘𝑒𝑙𝑦 𝑉𝑒𝑟𝑦 𝑙𝑖𝑘𝑒𝑙𝑦

1
where σ 𝑧 = 0.5 𝑥
1+𝑒 −𝑧 𝑈𝑛𝑙𝑖𝑘𝑒𝑙𝑦
Very 𝑢𝑛𝑙𝑖𝑘𝑒𝑙𝑦

0 Class 1
Class 2
Logistic Regression
• An alternative: replace the sgn function with the
sigmoid or logistic function
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}

t
where 𝑡 = 1 or 0.
𝒚(𝒙)
1
1
𝑦 𝑥 = 𝑇𝑥
1 + 𝑒 −w
0.5 𝑥

𝒚 𝒙 = 𝟎. 𝟖 means that 80% chance of

being in class 1 0 Class 1
Class 2
A Simple 1-D Classification
• Decision Boundary
• partitions the space into several sets,
one for each class

If 𝒚 𝒙 ≥ 𝟎. 𝟓, we predict class 1
If 𝒚 𝒙 < 𝟎. 𝟓, we predict class 2

• can be found at
1 Decision Boundary
Let 𝑦 𝑥 = = 0.5
1+𝑒 −(𝑤0+𝑤1𝑥)
(In 1-D, this is simply a threshold)
⇒ 𝑤0 + 𝑤1 𝑥 = 0
Decision Boundary
• In 2-D (two features)

𝑥2

𝑥1

𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 = 0
Which separates the space into two parts
Decision Boundary
• In 2-D (two features) • In 3-D (three features)

𝑥2
Courtesy of Dr. Sanja Fidler

𝑥1
w 𝑇 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3 = 0
w 𝑇 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 = 0
Which separates the space into two parts
Decision Boundary
• More on decision boundary

Class 1
Class 2
Decision Boundary
• More on decision boundary

Decision boundary:
𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥12 + 𝑤4 𝑥1 𝑥2 + 𝑤5 𝑥22 = 0

Class 1
Class 2
Decision Boundary
• More on decision boundary

Decision boundary:
𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥12
+ 𝑤4 𝑥12 𝑥23 + 𝑤5 𝑥25 + ⋯ = 0
Logistic Regression
• For multiple variables

CRIM

CHAS
𝑤1
INDUS
𝑤2
NOX
Price
RM
Probability p of buying it
AGE
features
DIS sigmoid function
RAD
1
σ 𝑧 =
TAX 1 + 𝑒 −𝑧
𝑤13
PTRATIO

LSTAT
Linear Regression Logistic Regression
• Training datasets • Training datasets

• Linear model • Linear model

• Parameters • Parameters

• Loss function
• Loss function

• Goal: minimize ℓ 𝑤
• Goal: minimize ℓ 𝑤
Linear Regression Logistic Regression
• Training datasets • Training datasets
𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)}
(target 𝑡 𝑖 : real value)
• Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀 • Linear model

• Parameters
𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀 • Parameters
• Loss function
𝑁
1 𝑖 𝑖 • Loss function
ℓ 𝑤 = ෍[𝑡 −𝑦 𝑥 ]2
2𝑁
𝑖=1
• Goal: minimize ℓ 𝑤 • Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on
the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
Linear Regression Logistic Regression
• Training datasets • Training datasets
𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)} 𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)}
(target 𝑡 𝑖 : real value) (target 𝑡 𝑖 : 0 or 1)
• Linear model • Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀
• Parameters
• Parameters
𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀
• Loss function
𝑁
1 • Loss function
ℓ 𝑤 = ෍[𝑡 𝑖 −𝑦 𝑥 𝑖 ]2
2𝑁
𝑖=1
• Goal: minimize ℓ 𝑤 • Goal: minimize ℓ 𝑤

Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on
the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
Linear Regression Logistic Regression
• Training datasets • Training datasets
𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)} 𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)}
(target 𝑡 𝑖 : real value) (target 𝑡 𝑖 : 0 or 1)
• Linear model • Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀 𝑦 𝑥 = 𝜎 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀
• Parameters • Parameters
𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀
• Loss function
𝑁 • Loss function
1 𝑖 𝑖
ℓ 𝑤 = ෍[𝑡 −𝑦 𝑥 ]2
2𝑁
𝑖=1
• Goal: minimize ℓ 𝑤
• Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on
the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
1
σ 𝑧 =
1 + 𝑒 −𝑧
Linear Regression Logistic Regression
• Training datasets • Training datasets
𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)} 𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)}
(target 𝑡 𝑖 : real value) (target 𝑡 𝑖 : 0 or 1)
• Linear model • Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀 𝑦 𝑥 = 𝜎 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀
• Parameters • Parameters
𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀 𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀
• Loss function • Loss function
𝑁
1 𝑖 𝑖
ℓ 𝑤 = ෍[𝑡 −𝑦 𝑥 ]2
2𝑁
𝑖=1
• Goal: minimize ℓ 𝑤
• Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on
the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
1
σ 𝑧 =
1 + 𝑒 −𝑧
Linear Regression Logistic Regression
• Training datasets • Training datasets
𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)} 𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)}
(target 𝑡 𝑖 : real value) (target 𝑡 𝑖 : 0 or 1)
• Linear model • Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀 𝑦 𝑥 = 𝜎 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀
• Parameters • Parameters
𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀 𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀
• Loss function • Loss function
𝑁
1 𝑁
ℓ 𝑤 = ෍[𝑡 𝑖 −𝑦 𝑥 𝑖 ]2 1 𝑖 𝑖
2𝑁 ℓ 𝑤 = ෍[𝑡 −𝑦 𝑥 ]2
𝑖=1 2𝑁
𝑖=1
• Goal: minimize ℓ 𝑤
• Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on
the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
1
σ 𝑧 =
1 + 𝑒 −𝑧
Linear Regression Logistic Regression
• Training datasets • Training datasets
𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)} 𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)}
(target 𝑡 𝑖 : real value) (target 𝑡 𝑖 : 0 or 1)
• Linear model • Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀 𝑦 𝑥 = 𝜎 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀
• Parameters • Parameters
𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀 𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀
• Loss function • Loss function
𝑁
1 𝑁
ℓ 𝑤 = ෍[𝑡 𝑖 −𝑦 𝑥 𝑖 ]2 1 𝑖 𝑖
2𝑁 ℓ 𝑤 = ෍[𝑡 −𝑦 𝑥 ]2
𝑖=1 2𝑁
𝑖=1
• Goal: minimize ℓ 𝑤
• Goal: minimize ℓ 𝑤

1
σ 𝑧 =
1 + 𝑒 −𝑧
Loss Function in Logistic Regression
• Bad Loss function
t

𝑁
𝒚(𝒙) 1 𝑖 𝑖
1 ℓ 𝑤 = ෍[𝑡 −𝑦 𝑥 ]2
2𝑁
𝑖=1

0.5
𝑥
0 Class 1
Class 2
Loss Function in Logistic Regression
• Bad Loss function
t

𝑁
𝒚(𝒙) 1 𝑖 𝑖
1 ℓ 𝑤 = ෍[𝑡 −𝑦 𝑥 ]2
2𝑁
𝑖=1

0.5 Because of using sigmoid function, the length of

each red vertical lines is at most 1.
𝑥 ( The penalty is too small )
0 Class 1
Class 2
Loss Function in Logistic Regression
• Bad t

𝑖
If target 𝑡 is 1, the error is set to
− log 𝑦 𝑥 𝑖
𝒚(𝒙)
1 𝑒𝑟𝑟𝑜𝑟

0.5
𝑥
0 Class 1
0 𝑖 1
Class 2 𝑦 𝑥
Loss Function in Logistic Regression
• Bad
If target 𝑡 𝑖 is 0, the error is set to
t

− log 1 − 𝑦 𝑥 𝑖

𝒚(𝒙)
1 𝑒𝑟𝑟𝑜𝑟

0.5
𝑥
0 Class 1
0 𝑖
1
Class 2 𝑦 𝑥
Loss Function in Logistic Regression

The error on 𝑒𝑎𝑐ℎ 𝑒𝑥𝑎𝑚𝑝𝑙𝑒 𝑖𝑠

𝑖 𝑖
− log 𝑦 𝑥 𝑖𝑓 𝑡 =1
൞
𝑖 𝑖
−log 1 − 𝑦 𝑥 𝑖𝑓 𝑡 =0
Loss Function in Logistic Regression

The error on 𝑒𝑎𝑐ℎ 𝑒𝑥𝑎𝑚𝑝𝑙𝑒 𝑖𝑠

𝑖 𝑖
− log 𝑦 𝑥 𝑖𝑓 𝑡 =1
൞
𝑖 𝑖
−log 1 − 𝑦 𝑥 𝑖𝑓 𝑡 =0

𝑖 𝑖
−𝑡 log 𝑦 𝑥 − (1 − 𝑡 𝑖 )log 1 − 𝑦 𝑥 𝑖
Loss Function in Logistic Regression

The error on 𝑒𝑎𝑐ℎ 𝑒𝑥𝑎𝑚𝑝𝑙𝑒 𝑖𝑠

𝑖 𝑖
− log 𝑦 𝑥 𝑖𝑓 𝑡 =1
൞
𝑖 𝑖
−log 1 − 𝑦 𝑥 𝑖𝑓 𝑡 =0

𝑖 𝑖
−𝑡 log 𝑦 𝑥 − (1 − 𝑡 𝑖 )log 1 − 𝑦 𝑥 𝑖

𝑁
1 𝑖 𝑖
ℓ 𝑤 =− ෍ 𝑡 log 𝑦 𝑥 + (1 − 𝑡 𝑖 )log 1 − 𝑦 𝑥 𝑖
𝑁
𝑖=1
Linear Regression Logistic Regression
• Training datasets • Training datasets
𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)} 𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)}
(target 𝑡 𝑖 : real value) (target 𝑡 𝑖 : 0 or 1)
• Linear model • Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀 𝑦 𝑥 = 𝜎 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀
• Parameters • Parameters
𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀 𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀
• Loss function • Loss function
𝑁
1 𝑖 𝑖
ℓ 𝑤 = ෍[𝑡 −𝑦 𝑥 ]2 1
2𝑁 ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖
log 𝑦 𝑥 𝑖
+ 1−𝑡 𝑖
log 1 − 𝑦 𝑥 𝑖
𝑖=1
• Goal: minimize ℓ 𝑤
• Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on
the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
1
σ 𝑧 =
1 + 𝑒 −𝑧
Linear Regression Logistic Regression
• Training datasets • Training datasets
𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)} 𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)}
(target 𝑡 𝑖 : real value) (target 𝑡 𝑖 : 0 or 1)
• Linear model • Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀 𝑦 𝑥 = 𝜎 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀
• Parameters • Parameters
𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀 𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀
• Loss function • Loss function
𝑁
1 𝑖 𝑖
ℓ 𝑤 = ෍[𝑡 −𝑦 𝑥 ]2 1
2𝑁 ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖
log 𝑦 𝑥 𝑖
+ 1−𝑡 𝑖
log 1 − 𝑦 𝑥 𝑖
𝑖=1
• Goal: minimize ℓ 𝑤 Cross Entropy (In statistics)
• Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on
the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
1
σ 𝑧 =
1 + 𝑒 −𝑧
Linear Regression Logistic Regression
• Training datasets • Training datasets
𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)} 𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)}
(target 𝑡 𝑖 : real value) (target 𝑡 𝑖 : 0 or 1)
• Linear model • Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀 𝑦 𝑥 = 𝜎 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀
• Parameters • Parameters
𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀 𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀
• Loss function • Loss function
𝑁
1 𝑖 𝑖
ℓ 𝑤 = ෍[𝑡 −𝑦 𝑥 ]2 1
2𝑁 ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖
log 𝑦 𝑥 𝑖
+ 1−𝑡 𝑖
log 1 − 𝑦 𝑥 𝑖
𝑖=1
• Goal: minimize ℓ 𝑤
• Goal: minimize ℓ 𝑤
Steps: Steps:
• Initialize 𝑤 (e.g., randomly) • Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on • Repeatedly update 𝑤 based on
the gradient the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤 𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate. where 𝜖 is the learning rate.
1
σ 𝑧 =
1 + 𝑒 −𝑧
Linear Regression Logistic Regression
𝑁
1 𝑖 𝑖 1
ℓ 𝑤 = ෍[𝑡 −𝑦 𝑥 ]2 ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖
log 𝑦 𝑥 𝑖
+ 1−𝑡 𝑖
log 1 − 𝑦 𝑥 𝑖
2𝑁
𝑖=1
• Goal: minimize ℓ 𝑤 • Goal: minimize ℓ 𝑤

where 𝑦 𝑥 𝑖 = 𝑤𝑇𝑥 𝑖 1
where 𝑦 𝑥 𝑖 = 𝜎 𝑤𝑇𝑥 𝑖 = 𝑇 𝑖
1+𝑒 −𝑤 𝑥
Linear Regression Logistic Regression
𝑁
1 𝑖 𝑖 1
ℓ 𝑤 = ෍[𝑡 −𝑦 𝑥 ]2 ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖
log 𝑦 𝑥 𝑖
+ 1−𝑡 𝑖
log 1 − 𝑦 𝑥 𝑖
2𝑁
𝑖=1
• Goal: minimize ℓ 𝑤 • Goal: minimize ℓ 𝑤

Steps: Steps:
• Initialize 𝑤 (e.g., randomly) • Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on • Repeatedly update 𝑤 based on
the gradient the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤 𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate. where 𝜖 is the learning rate.
𝜕ℓ 𝑤 1
𝜕ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖
−𝑦 𝑥 𝑖
] 𝜕ℓ 𝑤 1
𝜕𝑤0 𝜕ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖 −𝑦 𝑥 𝑖 ]
𝜕𝑤0 𝜕ℓ 𝑤 1 𝑖 𝜕𝑤0
= − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖 −𝑦 𝑥 𝑖 𝑥1 𝜕𝑤0 𝜕ℓ 𝑤 1 𝑖
𝜕ℓ 𝑤 𝜕𝑤1
𝜕ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖
−𝑦 𝑥 𝑖
𝑥1
𝜕𝑤1
𝛻𝑤 ℓ 𝑤 = 𝜕𝑤1 𝜕ℓ 𝑤
=
1
− 𝑁 σ𝑁 𝑖 𝑖 𝑖
𝑖=1[𝑡 −𝑦 𝑥 ] 𝑥2 𝛻𝑤 ℓ 𝑤 = 𝜕𝑤1 𝜕ℓ 𝑤 1 𝑖
⋮ 𝜕𝑤2 = − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖
−𝑦 𝑥 𝑖
] 𝑥2
𝜕ℓ 𝑤 ⋮ ⋮ 𝜕𝑤2
𝜕ℓ 𝑤 1 𝑖 𝜕ℓ 𝑤 ⋮
𝜕𝑤𝑀 = − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖
−𝑦 𝑥 𝑖
] 𝑥𝑀 𝜕ℓ 𝑤 1 𝑖
𝜕𝑤𝑀 𝜕𝑤𝑀 = − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖
−𝑦 𝑥 𝑖
] 𝑥𝑀
𝜕𝑤𝑀

where 𝑦 𝑥 𝑖 = 𝑤𝑇𝑥 𝑖 1
where 𝑦 𝑥 𝑖 = 𝜎 𝑤𝑇𝑥 𝑖 = 𝑇 𝑖
1+𝑒 −𝑤 𝑥
Cross Entropy vs Mean Squared Error

Cross Entropy

Mean Squared Error

Reference: Glorot, Xavier, and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks." Aistats. Vol. 9. 2010.
Cross Entropy vs Mean Squared Error

𝑁
1 𝑖 𝑖
ℓ 𝑤 = ෍[𝑡 −𝑦 𝑥 ]2
2𝑁
𝑖=1
Cross Entropy vs Mean Squared Error

small error,
bad classifier

𝑁
1 𝑖 𝑖
ℓ 𝑤 = ෍[𝑡 −𝑦 𝑥 ]2
2𝑁
𝑖=1
Cross Entropy vs Mean Squared Error

1
ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖
log 𝑦 𝑥 𝑖
+ 1−𝑡 𝑖
log 1 − 𝑦 𝑥 𝑖
Cross Entropy vs Mean Squared Error

Large error,
bad classifier

1
ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖
log 𝑦 𝑥 𝑖
+ 1−𝑡 𝑖
log 1 − 𝑦 𝑥 𝑖
Cross Entropy vs Mean Squared Error

Large error,
bad classifier

small error,
good classifier

1
ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖
log 𝑦 𝑥 𝑖
+ 1−𝑡 𝑖
log 1 − 𝑦 𝑥 𝑖
Cross Entropy vs Mean Squared Error

A large distance to the decision boundary

1
ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖
log 𝑦 𝑥 𝑖
+ 1−𝑡 𝑖
log 1 − 𝑦 𝑥 𝑖
Multi-Class Classification
• One vs All

class 2

class 3

class 1
Multi-Class Classification
• One vs All

class 2

class 3

class 1
Multi-Class Classification
• One vs All Model 1: probability in class 1

class 2

class 3 Model 2: probability in class 2

class 1

Model 3: probability in class 3

Multi-Class Classification New data

• One vs All Model 1: probability in class 1

class 2

Choose a
class 3 Model 2: probability in class 2 class with
largest
probability

class 1

Model 3: probability in class 3

Multi-Class Classification
• We can also consider multi classes at the same time
𝑇
𝑤 (1) 𝑥
𝑒
𝑇 𝑦1 =
𝑇 (1) (1) 𝑇 𝑥 (2) 𝑇 𝑥 (3) 𝑇 𝑥
C1: 𝑤 (1) 𝑥 𝑒𝑤 𝑥 𝑒𝑤 + 𝑒𝑤 + 𝑒𝑤

(2) 𝑇
𝑇 (2) 𝑇 𝑒𝑤 𝑥
C2: 𝑤 (2) 𝑥 𝑒𝑤 𝑥 𝑦2 =
(1) 𝑇 𝑥 (2) 𝑇 𝑥 (3) 𝑇 𝑥
𝑒𝑤 + 𝑒𝑤 + 𝑒𝑤
Courtesy of Dr. Hung-yi Lee

𝑇 (3) 𝑇
(3) 𝑇 𝑤 (3) 𝑥 𝑒𝑤 𝑥
C3: 𝑤 𝑥 𝑒
𝑦3 =
(1) 𝑇 𝑥 (2) 𝑇 𝑥 (3) 𝑇 𝑥
𝑒𝑤 + 𝑒𝑤 + 𝑒𝑤
Multi-Class Classification
• We can also consider multi classes at the same time
𝑇
𝑤 (1) 𝑥
𝑒
𝑇 𝑦1 =
𝑇 (1) (1) 𝑇 𝑥 (2) 𝑇 𝑥 (3) 𝑇 𝑥
C1: 𝑤 (1) 𝑥 𝑒𝑤 𝑥 𝑒𝑤 + 𝑒𝑤 + 𝑒𝑤

(2) 𝑇
𝑇 (2) 𝑇 𝑒𝑤 𝑥
C2: 𝑤 (2) 𝑥 𝑒𝑤 𝑥 𝑦2 =
(1) 𝑇 𝑥 (2) 𝑇 𝑥 (3) 𝑇 𝑥
𝑒𝑤 + 𝑒𝑤 + 𝑒𝑤
Courtesy of Dr. Hung-yi Lee

𝑇 (3) 𝑇
(3) 𝑇 𝑤 (3) 𝑥 𝑒𝑤 𝑥
C3: 𝑤 𝑥 𝑒
𝑦3 =
(1) 𝑇 𝑥 (2) 𝑇 𝑥 (3) 𝑇 𝑥
𝑒𝑤 + 𝑒𝑤 + 𝑒𝑤
Multi-Class Classification
• We can also consider multi classes at the same time

(1) 𝑇
𝑒𝑤 𝑥
𝑇 𝑦1 = 𝑇 𝑇 𝑇
𝑦ො1
C1: 𝑤 (1) 𝑇 𝑥 𝑒 𝑤 (1) 𝑥 𝑒 𝑤 (1) 𝑥 +𝑒 𝑤 (2) 𝑥 +𝑒 𝑤 (3) 𝑥

𝑇 Argmax
𝑤 (2) 𝑥
𝑇 (2) 𝑇 𝑒 𝑦ො2
C2: 𝑤 (2) 𝑥 𝑒𝑤 𝑥 𝑦2 =
(1) 𝑇 (2) 𝑇 (3) 𝑇
𝑒𝑤 𝑥 + 𝑒𝑤 𝑥 + 𝑒𝑤 𝑥
Courtesy of Dr. Hung-yi Lee

(3) 𝑇 (3) 𝑇

C3: 𝑤 (3) 𝑇 𝑥 𝑒𝑤 𝑥 𝑒𝑤 𝑥 𝑦ො3

𝑦3 =
(1) 𝑇 (2) 𝑇 (3) 𝑇
𝑒𝑤 𝑥 + 𝑒𝑤 𝑥 + 𝑒𝑤 𝑥

In Optimization: the loss function is − σ3𝑖=1 𝑦ො𝑖 𝑙𝑛𝑦𝑖

Multi-Class Classification
• We can also consider multi classes at the same time
One-hot Encoding

𝑇
(1) If x is in C1,
𝑒𝑤 𝑥
𝑇 𝑦1 = 𝑇 𝑇 𝑇
𝑦ො1 𝑦ො1 1
(1) 𝑇 𝑒 𝑤 (1) 𝑥 𝑒 𝑤 (1) 𝑥 +𝑒 𝑤 (2) 𝑥 +𝑒 𝑤 (3) 𝑥
C1: 𝑤 𝑥 the output 𝑦ො2 = 0
Argmax 𝑦ො3 0
𝑇
𝑤 (2) 𝑥 If x is in C2,
𝑇 (2) 𝑇 𝑒 𝑦ො2
C2: 𝑤 (2) 𝑥 𝑒𝑤 𝑥 𝑦2 =
(1) 𝑇 𝑥 (2) 𝑇 𝑥 (3) 𝑇 𝑥 𝑦ො1 0
𝑒𝑤 + 𝑒𝑤 + 𝑒𝑤
Courtesy of Dr. Hung-yi Lee

the output 𝑦ො2 = 1

(3) 𝑇 (3) 𝑇 𝑦ො3 0
C3: 𝑤 (3) 𝑇 𝑥 𝑒𝑤 𝑥 𝑒𝑤 𝑥 𝑦ො3
𝑦3 = If x is in C3,
(1) 𝑇 (2) 𝑇 (3) 𝑇
𝑒𝑤 𝑥 + 𝑒𝑤 𝑥 + 𝑒𝑤 𝑥
𝑦ො1 0
the output 𝑦ො2 = 0
𝑦ො3 1
Multi-Class Classification
• We can also consider multi classes at the same time
One-hot Encoding

the output 𝑦ො2 = 1

(3) 𝑇 (3) 𝑇 𝑦ො3 0
C3: 𝑤 (3) 𝑇 𝑥 𝑒𝑤 𝑥 𝑒𝑤 𝑥 𝑦ො3
𝑦3 = If x is in C3,
(1) 𝑇 (2) 𝑇 (3) 𝑇
𝑒𝑤 𝑥 + 𝑒𝑤 𝑥 + 𝑒𝑤 𝑥
𝑦ො1 0
the output 𝑦ො2 = 0
𝑦ො3 1
Why not use 0, 1, 2?
Multi-Class Classification
• Consider multi classes at the same time

𝑇
𝑤 (1) 𝑥
𝑒
𝑇 𝑦1 = 𝑇 𝑇 𝑇
𝑦ො1
(1) 𝑇 𝑒 𝑤 (1) 𝑥 𝑒 𝑤 (1) 𝑥 +𝑒 𝑤 (2) 𝑥 +𝑒 𝑤 (3) 𝑥
𝑤 𝑥
Argmax
(2) 𝑇
𝑇 𝑒𝑤 𝑥
𝑦ො2
𝑥 (2) 𝑇 𝑒 𝑤 (2) 𝑥 𝑦2 =
𝑤 𝑥 (1) 𝑇 (2) 𝑇 (3) 𝑇
𝑒𝑤 𝑥 + 𝑒𝑤 𝑥 + 𝑒𝑤 𝑥
Courtesy of Dr. Hung-yi Lee

(3) 𝑇 (3) 𝑇

𝑤 (3) 𝑇 𝑥 𝑒𝑤 𝑥 𝑒𝑤 𝑥 𝑦ො3
𝑦3 =
(1) 𝑇 (2) 𝑇 (3) 𝑇
𝑒𝑤 𝑥 + 𝑒𝑤 𝑥 + 𝑒𝑤 𝑥

Softmax
Multi-Class Classification
• Consider multi classes at the same time

𝑇
𝑤 (1) 𝑥 S 𝑦2 𝑦ො1
o
Argmax
f
𝑇
𝑥 𝑤 (2) 𝑥 t 𝑦2 𝑦ො2
Courtesy of Dr. Hung-yi Lee

m
a
𝑇 x 𝑦3 𝑦ො3
𝑤 (3) 𝑥

The softmax function is often used in the final layer of neural

networks, which are applied to classification problems
Regularized Logistic Regression
Courtesy of Dr. Hung-yi Lee
Regularized Logistic Regression

𝑎𝑙𝑙 𝑦 ≥ 0.5
Courtesy of Dr. Hung-yi Lee

𝑎𝑙𝑙 𝑦 < 0.5

Underfitting
Regularized Logistic Regression
Courtesy of Dr. Hung-yi Lee

Underfitting Overfitting
Limitation of Logistic Regression
Data Target
𝑥1 𝑥𝟐 𝑡
0 0 1
0 1 0
1 0 0
1 1 1
Courtesy of Dr. Hung-yi Lee
Limitation of Logistic Regression
Data Target
𝑥1 𝑥𝟐 𝑡
0 0 1
0 1 0
1 0 0
1 1 1
Courtesy of Dr. Hung-yi Lee

𝑦 < 0.5 𝑦 ≥ 0.5

𝑦 ≥ 0.5 𝑦 < 0.5

Limitation of Logistic Regression
Data Target
𝑥1 𝑥𝟐 𝑡 1
𝑤0
0 0 1 C1: if 𝑦 ≥ 0.5
𝑤1
𝑥1 𝑤0 + 𝑤1 𝑥1 +𝑤2 𝑥2
0 1 0 C2: if 𝑦 < 0.5
𝑤2
1 0 0 𝑥2
1 1 1
Courtesy of Dr. Hung-yi Lee

𝑦 < 0.5 𝑦 ≥ 0.5

𝑦 ≥ 0.5 𝑦 < 0.5

𝑦 < 0.5 𝑦 ≥ 0.5 𝑎𝑙𝑙 𝑦 ≥ 0.5

𝑎𝑙𝑙 𝑦 ≥ 0.5
𝑎𝑙𝑙 𝑦 < 0.5

𝑦 ≥ 0.5 𝑦 < 0.5

𝑎𝑙𝑙 𝑦 < 0.5
Limitation of Logistic Regression
Data Target
𝑥1 𝑥𝟐 𝑡 1
𝑤0
0 0 1 C1: if 𝑦 ≥ 0.5
𝑤1
𝑥1 𝑤0 + 𝑤1 𝑥1 +𝑤2 𝑥2
0 1 0 C2: if 𝑦 < 0.5
𝑤2
1 0 0 𝑥2
1 1 1 We can’t separate them well using a simple logistic regression
(not linearly separable)
Courtesy of Dr. Hung-yi Lee

𝑦 < 0.5 𝑦 ≥ 0.5 𝑎𝑙𝑙 𝑦 ≥ 0.5

𝑎𝑙𝑙 𝑦 ≥ 0.5
𝑎𝑙𝑙 𝑦 < 0.5