12_Bài toán phân lớp_LR_v2
12_Bài toán phân lớp_LR_v2
References: https://courses.cs.washington.edu/courses/cse546/12wi/slides/cse546wi12intro.pdf
2
Applications of Machine Learning
• Spam Filtering
• Spam or Not Spam
References:
• Paper: Konstantin Tretyakov, Machine Learning Techniques in Spam Filtering, 2004.
• http://www.techlicious.com/blog/gmail-spam-filter-artificial-neural-network/ 3
• https://upxacademy.com/spam-alert-machine-learning-is-filtering-your-emails/
Applications of Machine Learning
• Object Detection and Recognition
References:
• https://devblogs.nvidia.com/parallelforall/deep-learning-for-computer-vision-with-matlab-and-cudnn/
• https://www.cnet.com/how-to/how-to-disable-facial-recognition-in-facebook/ 4
Linear Models for Classification
• Classification
• Determine which discrete category the example is.
Linear Models for Classification
• Classification
• Determine which discrete category the example is.
Class 1
Class 2
A Simple 1-D Classification
• Classification as Regression
• Given a set of input-output pairs for binary
classification
t
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}
where 𝑡 = 1 or − 1.
1
0 𝑥
−1
Class 1
Class 2
A Simple 1-D Classification
• Classification as Regression
• Given a set of input-output pairs for binary
classification
t
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}
where 𝑡 = 1 or − 1.
1
1
0 𝑥
𝑓(𝑥)
−1
−1 Class 1
Class 2
A Simple 1-D Classification
• Classification as Regression
• Given a set of input-output pairs for binary
classification 𝒇 𝒙 = 𝒘𝑻 𝒙
t
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}
where 𝑡 = 1 or − 1.
1
1
0 𝑥
𝑓(𝑥)
−1
−1 Class 1
Class 2
A Simple 1-D Classification
• Classification as Regression
• Given a set of input-output pairs for binary
classification 𝒇 𝒙 = 𝒘𝑻 𝒙
t
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}
where 𝑡 = 1 or − 1.
1
1
0 𝑥
𝑓(𝑥)
−1
−1 Class 1
Let 𝒇 𝒙 denote 𝑤0 + 𝑤1 𝒙 Class 2
If 𝒇(𝒙) ≥ 𝟎, we predict class 1
If 𝒇(𝒙) < 𝟎, we predict class 2 (linear regression)
A Simple 1-D Classification
• Classification as Regression
• Given a set of input-output pairs for binary
classification
t
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}
where 𝑡 = 1 or − 1. 𝒚(𝒙)
1
In Math 𝑦 𝑥 = 𝑠𝑔𝑛 𝑓 𝑥 𝑥
0
1 𝑓𝑜𝑟 𝑧 > 0 −1
where 𝑠𝑔𝑛(𝑧) = ቐ 0 𝑓𝑜𝑟 𝑧 = 0 Class 1
−1 𝑓𝑜𝑟 𝑧 < 0 Class 2
A Simple 1-D Classification
• Classification as Regression
• Given a set of input-output pairs for binary
classification
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}
where 𝑡 = 1 or − 1.
In Math 𝑦 𝑥 = 𝑠𝑔𝑛 𝑓 𝑥 𝑥
1 𝑤0 Price 1: buy
sgn -1: not buy
Size in 𝑤1
feature square feet
1 𝑓𝑜𝑟 𝑧 > 0
𝑠𝑔𝑛(𝑧) = ቐ 0 𝑓𝑜𝑟 𝑧 = 0
−1 𝑓𝑜𝑟 𝑧 < 0
A Simple 1-D Classification
• Buy or not buy it?
1 𝑤0 Price 1: buy
sgn -1: not buy
Size in 𝑤1
feature square feet
1 𝑤0 Price 1: buy
sgn -1: not buy
Size in 𝑤1
feature square feet
t
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}
where 𝑡 = 1 or − 1. 𝒚(𝒙)
1
In Math 𝑦 𝑥 = 𝑠𝑔𝑛 𝑓 𝑥 𝑥
0
1 𝑓𝑜𝑟 𝑧 > 0 −1
where 𝑠𝑔𝑛(𝑧) = ቐ 0 𝑓𝑜𝑟 𝑧 = 0 Class 1
−1 𝑓𝑜𝑟 𝑧 < 0 Class 2
Logistic Regression
• An alternative: replace the sgn function with the
sigmoid or logistic function
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}
t
where 𝑡 = 1 or − 1.
𝒚(𝒙)
1
𝒚 𝒙 =σ 𝒇 𝒙
1
where σ 𝑧 = 0.5 𝑥
1+𝑒 −𝑧
0 Class 1
Class 2
−1
Logistic Regression
• An alternative: replace the sgn function with the
sigmoid or logistic function
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}
t
where 𝑡 = 1 or − 1.
𝒚(𝒙)
1
𝒚 𝒙 =σ 𝒇 𝒙 𝐿𝑖𝑘𝑒𝑙𝑦 𝑉𝑒𝑟𝑦 𝑙𝑖𝑘𝑒𝑙𝑦
1
where σ 𝑧 = 0.5 𝑥
1+𝑒 −𝑧 𝑈𝑛𝑙𝑖𝑘𝑒𝑙𝑦
Very 𝑢𝑛𝑙𝑖𝑘𝑒𝑙𝑦
0 Class 1
Class 2
−1
Logistic Regression (Note: Starting from now, we use
{1, 0}, just for convenience!)
• An alternative: replace the sgn function with the
sigmoid or logistic function
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}
t
where 𝑡 = 1 or 0.
𝒚(𝒙)
1
𝒚 𝒙 =σ 𝒇 𝒙 𝐿𝑖𝑘𝑒𝑙𝑦 𝑉𝑒𝑟𝑦 𝑙𝑖𝑘𝑒𝑙𝑦
1
where σ 𝑧 = 0.5 𝑥
1+𝑒 −𝑧 𝑈𝑛𝑙𝑖𝑘𝑒𝑙𝑦
Very 𝑢𝑛𝑙𝑖𝑘𝑒𝑙𝑦
0 Class 1
Class 2
Logistic Regression
• An alternative: replace the sgn function with the
sigmoid or logistic function
𝒟 = {(𝑥 1 ,𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁 ,𝑡 𝑁 )}
t
where 𝑡 = 1 or 0.
𝒚(𝒙)
1
1
𝑦 𝑥 = 𝑇𝑥
1 + 𝑒 −w
0.5 𝑥
If 𝒚 𝒙 ≥ 𝟎. 𝟓, we predict class 1
If 𝒚 𝒙 < 𝟎. 𝟓, we predict class 2
• can be found at
1 Decision Boundary
Let 𝑦 𝑥 = = 0.5
1+𝑒 −(𝑤0+𝑤1𝑥)
(In 1-D, this is simply a threshold)
⇒ 𝑤0 + 𝑤1 𝑥 = 0
Decision Boundary
• In 2-D (two features)
𝑥2
𝑥1
𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 = 0
Which separates the space into two parts
Decision Boundary
• In 2-D (two features) • In 3-D (three features)
𝑥2
Courtesy of Dr. Sanja Fidler
𝑥1
w 𝑇 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3 = 0
w 𝑇 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 = 0
Which separates the space into two parts
Decision Boundary
• More on decision boundary
Class 1
Class 2
Decision Boundary
• More on decision boundary
Decision boundary:
𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥12 + 𝑤4 𝑥1 𝑥2 + 𝑤5 𝑥22 = 0
Class 1
Class 2
Decision Boundary
• More on decision boundary
Decision boundary:
𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥12
+ 𝑤4 𝑥12 𝑥23 + 𝑤5 𝑥25 + ⋯ = 0
Logistic Regression
• For multiple variables
CRIM
ZN
CHAS
𝑤1
INDUS
𝑤2
NOX
Price
RM
Probability p of buying it
AGE
features
DIS sigmoid function
RAD
1
σ 𝑧 =
TAX 1 + 𝑒 −𝑧
𝑤13
PTRATIO
LSTAT
Linear Regression Logistic Regression
• Training datasets • Training datasets
• Parameters • Parameters
• Loss function
• Loss function
• Goal: minimize ℓ 𝑤
• Goal: minimize ℓ 𝑤
Linear Regression Logistic Regression
• Training datasets • Training datasets
𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)}
(target 𝑡 𝑖 : real value)
• Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀 • Linear model
• Parameters
𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀 • Parameters
• Loss function
𝑁
1 𝑖 𝑖 • Loss function
ℓ 𝑤 = [𝑡 −𝑦 𝑥 ]2
2𝑁
𝑖=1
• Goal: minimize ℓ 𝑤 • Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on
the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
Linear Regression Logistic Regression
• Training datasets • Training datasets
𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)} 𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)}
(target 𝑡 𝑖 : real value) (target 𝑡 𝑖 : 0 or 1)
• Linear model • Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀
• Parameters
• Parameters
𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀
• Loss function
𝑁
1 • Loss function
ℓ 𝑤 = [𝑡 𝑖 −𝑦 𝑥 𝑖 ]2
2𝑁
𝑖=1
• Goal: minimize ℓ 𝑤 • Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on
the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
Linear Regression Logistic Regression
• Training datasets • Training datasets
𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)} 𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)}
(target 𝑡 𝑖 : real value) (target 𝑡 𝑖 : 0 or 1)
• Linear model • Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀 𝑦 𝑥 = 𝜎 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀
• Parameters • Parameters
𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀
• Loss function
𝑁 • Loss function
1 𝑖 𝑖
ℓ 𝑤 = [𝑡 −𝑦 𝑥 ]2
2𝑁
𝑖=1
• Goal: minimize ℓ 𝑤
• Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on
the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
1
σ 𝑧 =
1 + 𝑒 −𝑧
Linear Regression Logistic Regression
• Training datasets • Training datasets
𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)} 𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)}
(target 𝑡 𝑖 : real value) (target 𝑡 𝑖 : 0 or 1)
• Linear model • Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀 𝑦 𝑥 = 𝜎 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀
• Parameters • Parameters
𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀 𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀
• Loss function • Loss function
𝑁
1 𝑖 𝑖
ℓ 𝑤 = [𝑡 −𝑦 𝑥 ]2
2𝑁
𝑖=1
• Goal: minimize ℓ 𝑤
• Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on
the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
1
σ 𝑧 =
1 + 𝑒 −𝑧
Linear Regression Logistic Regression
• Training datasets • Training datasets
𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)} 𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)}
(target 𝑡 𝑖 : real value) (target 𝑡 𝑖 : 0 or 1)
• Linear model • Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀 𝑦 𝑥 = 𝜎 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀
• Parameters • Parameters
𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀 𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀
• Loss function • Loss function
𝑁
1 𝑁
ℓ 𝑤 = [𝑡 𝑖 −𝑦 𝑥 𝑖 ]2 1 𝑖 𝑖
2𝑁 ℓ 𝑤 = [𝑡 −𝑦 𝑥 ]2
𝑖=1 2𝑁
𝑖=1
• Goal: minimize ℓ 𝑤
• Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on
the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
1
σ 𝑧 =
1 + 𝑒 −𝑧
Linear Regression Logistic Regression
• Training datasets • Training datasets
𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)} 𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)}
(target 𝑡 𝑖 : real value) (target 𝑡 𝑖 : 0 or 1)
• Linear model • Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀 𝑦 𝑥 = 𝜎 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀
• Parameters • Parameters
𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀 𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀
• Loss function • Loss function
𝑁
1 𝑁
ℓ 𝑤 = [𝑡 𝑖 −𝑦 𝑥 𝑖 ]2 1 𝑖 𝑖
2𝑁 ℓ 𝑤 = [𝑡 −𝑦 𝑥 ]2
𝑖=1 2𝑁
𝑖=1
• Goal: minimize ℓ 𝑤
• Goal: minimize ℓ 𝑤
1
σ 𝑧 =
1 + 𝑒 −𝑧
Loss Function in Logistic Regression
• Bad Loss function
t
𝑁
𝒚(𝒙) 1 𝑖 𝑖
1 ℓ 𝑤 = [𝑡 −𝑦 𝑥 ]2
2𝑁
𝑖=1
0.5
𝑥
0 Class 1
Class 2
Loss Function in Logistic Regression
• Bad Loss function
t
𝑁
𝒚(𝒙) 1 𝑖 𝑖
1 ℓ 𝑤 = [𝑡 −𝑦 𝑥 ]2
2𝑁
𝑖=1
𝑖
If target 𝑡 is 1, the error is set to
− log 𝑦 𝑥 𝑖
𝒚(𝒙)
1 𝑒𝑟𝑟𝑜𝑟
0.5
𝑥
0 Class 1
0 𝑖 1
Class 2 𝑦 𝑥
Loss Function in Logistic Regression
• Bad
If target 𝑡 𝑖 is 0, the error is set to
t
− log 1 − 𝑦 𝑥 𝑖
𝒚(𝒙)
1 𝑒𝑟𝑟𝑜𝑟
0.5
𝑥
0 Class 1
0 𝑖
1
Class 2 𝑦 𝑥
Loss Function in Logistic Regression
𝑖 𝑖
−𝑡 log 𝑦 𝑥 − (1 − 𝑡 𝑖 )log 1 − 𝑦 𝑥 𝑖
Loss Function in Logistic Regression
𝑖 𝑖
−𝑡 log 𝑦 𝑥 − (1 − 𝑡 𝑖 )log 1 − 𝑦 𝑥 𝑖
𝑁
1 𝑖 𝑖
ℓ 𝑤 =− 𝑡 log 𝑦 𝑥 + (1 − 𝑡 𝑖 )log 1 − 𝑦 𝑥 𝑖
𝑁
𝑖=1
Linear Regression Logistic Regression
• Training datasets • Training datasets
𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)} 𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)}
(target 𝑡 𝑖 : real value) (target 𝑡 𝑖 : 0 or 1)
• Linear model • Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀 𝑦 𝑥 = 𝜎 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀
• Parameters • Parameters
𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀 𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀
• Loss function • Loss function
𝑁
1 𝑖 𝑖
ℓ 𝑤 = [𝑡 −𝑦 𝑥 ]2 1
2𝑁 ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖
log 𝑦 𝑥 𝑖
+ 1−𝑡 𝑖
log 1 − 𝑦 𝑥 𝑖
𝑖=1
• Goal: minimize ℓ 𝑤
• Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on
the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
1
σ 𝑧 =
1 + 𝑒 −𝑧
Linear Regression Logistic Regression
• Training datasets • Training datasets
𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)} 𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)}
(target 𝑡 𝑖 : real value) (target 𝑡 𝑖 : 0 or 1)
• Linear model • Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀 𝑦 𝑥 = 𝜎 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀
• Parameters • Parameters
𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀 𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀
• Loss function • Loss function
𝑁
1 𝑖 𝑖
ℓ 𝑤 = [𝑡 −𝑦 𝑥 ]2 1
2𝑁 ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖
log 𝑦 𝑥 𝑖
+ 1−𝑡 𝑖
log 1 − 𝑦 𝑥 𝑖
𝑖=1
• Goal: minimize ℓ 𝑤 Cross Entropy (In statistics)
• Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on
the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
1
σ 𝑧 =
1 + 𝑒 −𝑧
Linear Regression Logistic Regression
• Training datasets • Training datasets
𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)} 𝒟 = {(𝑥 1 , 𝑡 1 ), … , (𝑥 𝑖 , 𝑡 𝑖 ), … , (𝑥 𝑁
,𝑡 𝑁
)}
(target 𝑡 𝑖 : real value) (target 𝑡 𝑖 : 0 or 1)
• Linear model • Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀 𝑦 𝑥 = 𝜎 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑀 𝑥𝑀
• Parameters • Parameters
𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀 𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑀
• Loss function • Loss function
𝑁
1 𝑖 𝑖
ℓ 𝑤 = [𝑡 −𝑦 𝑥 ]2 1
2𝑁 ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖
log 𝑦 𝑥 𝑖
+ 1−𝑡 𝑖
log 1 − 𝑦 𝑥 𝑖
𝑖=1
• Goal: minimize ℓ 𝑤
• Goal: minimize ℓ 𝑤
Steps: Steps:
• Initialize 𝑤 (e.g., randomly) • Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on • Repeatedly update 𝑤 based on
the gradient the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤 𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate. where 𝜖 is the learning rate.
1
σ 𝑧 =
1 + 𝑒 −𝑧
Linear Regression Logistic Regression
𝑁
1 𝑖 𝑖 1
ℓ 𝑤 = [𝑡 −𝑦 𝑥 ]2 ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖
log 𝑦 𝑥 𝑖
+ 1−𝑡 𝑖
log 1 − 𝑦 𝑥 𝑖
2𝑁
𝑖=1
• Goal: minimize ℓ 𝑤 • Goal: minimize ℓ 𝑤
Steps: Steps:
• Initialize 𝑤 (e.g., randomly) • Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on • Repeatedly update 𝑤 based on
the gradient the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤 𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate. where 𝜖 is the learning rate.
𝜕ℓ 𝑤 1
𝜕ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖
−𝑦 𝑥 𝑖
]
𝜕𝑤0
𝜕𝑤0 𝜕ℓ 𝑤 1 𝑖
𝜕ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖 −𝑦 𝑥 𝑖 𝑥1
𝜕𝑤1
𝛻𝑤 ℓ 𝑤 = 𝜕𝑤1 𝜕ℓ 𝑤 1
= − 𝑁 σ𝑁 𝑖 𝑖 𝑖
𝜕𝑤2 𝑖=1[𝑡 −𝑦 𝑥 ] 𝑥2
⋮
𝜕ℓ 𝑤 ⋮
𝜕ℓ 𝑤 1 𝑖
𝜕𝑤𝑀 = − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖
−𝑦 𝑥 𝑖
] 𝑥𝑀
𝜕𝑤𝑀
where 𝑦 𝑥 𝑖 = 𝑤𝑇𝑥 𝑖 1
where 𝑦 𝑥 𝑖 = 𝜎 𝑤𝑇𝑥 𝑖 = 𝑇 𝑖
1+𝑒 −𝑤 𝑥
Linear Regression Logistic Regression
𝑁
1 𝑖 𝑖 1
ℓ 𝑤 = [𝑡 −𝑦 𝑥 ]2 ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖
log 𝑦 𝑥 𝑖
+ 1−𝑡 𝑖
log 1 − 𝑦 𝑥 𝑖
2𝑁
𝑖=1
• Goal: minimize ℓ 𝑤 • Goal: minimize ℓ 𝑤
Steps: Steps:
• Initialize 𝑤 (e.g., randomly) • Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on • Repeatedly update 𝑤 based on
the gradient the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤 𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate. where 𝜖 is the learning rate.
𝜕ℓ 𝑤 1
𝜕ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖
−𝑦 𝑥 𝑖
] 𝜕ℓ 𝑤 1
𝜕𝑤0 𝜕ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖 −𝑦 𝑥 𝑖 ]
𝜕𝑤0 𝜕ℓ 𝑤 1 𝑖 𝜕𝑤0
= − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖 −𝑦 𝑥 𝑖 𝑥1 𝜕𝑤0 𝜕ℓ 𝑤 1 𝑖
𝜕ℓ 𝑤 𝜕𝑤1
𝜕ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖
−𝑦 𝑥 𝑖
𝑥1
𝜕𝑤1
𝛻𝑤 ℓ 𝑤 = 𝜕𝑤1 𝜕ℓ 𝑤
=
1
− 𝑁 σ𝑁 𝑖 𝑖 𝑖
𝑖=1[𝑡 −𝑦 𝑥 ] 𝑥2 𝛻𝑤 ℓ 𝑤 = 𝜕𝑤1 𝜕ℓ 𝑤 1 𝑖
⋮ 𝜕𝑤2 = − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖
−𝑦 𝑥 𝑖
] 𝑥2
𝜕ℓ 𝑤 ⋮ ⋮ 𝜕𝑤2
𝜕ℓ 𝑤 1 𝑖 𝜕ℓ 𝑤 ⋮
𝜕𝑤𝑀 = − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖
−𝑦 𝑥 𝑖
] 𝑥𝑀 𝜕ℓ 𝑤 1 𝑖
𝜕𝑤𝑀 𝜕𝑤𝑀 = − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖
−𝑦 𝑥 𝑖
] 𝑥𝑀
𝜕𝑤𝑀
where 𝑦 𝑥 𝑖 = 𝑤𝑇𝑥 𝑖 1
where 𝑦 𝑥 𝑖 = 𝜎 𝑤𝑇𝑥 𝑖 = 𝑇 𝑖
1+𝑒 −𝑤 𝑥
Cross Entropy vs Mean Squared Error
Cross Entropy
Reference: Glorot, Xavier, and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks." Aistats. Vol. 9. 2010.
Cross Entropy vs Mean Squared Error
𝑁
1 𝑖 𝑖
ℓ 𝑤 = [𝑡 −𝑦 𝑥 ]2
2𝑁
𝑖=1
Cross Entropy vs Mean Squared Error
small error,
bad classifier
𝑁
1 𝑖 𝑖
ℓ 𝑤 = [𝑡 −𝑦 𝑥 ]2
2𝑁
𝑖=1
Cross Entropy vs Mean Squared Error
1
ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖
log 𝑦 𝑥 𝑖
+ 1−𝑡 𝑖
log 1 − 𝑦 𝑥 𝑖
Cross Entropy vs Mean Squared Error
Large error,
bad classifier
1
ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖
log 𝑦 𝑥 𝑖
+ 1−𝑡 𝑖
log 1 − 𝑦 𝑥 𝑖
Cross Entropy vs Mean Squared Error
Large error,
bad classifier
small error,
good classifier
1
ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖
log 𝑦 𝑥 𝑖
+ 1−𝑡 𝑖
log 1 − 𝑦 𝑥 𝑖
Cross Entropy vs Mean Squared Error
1
ℓ 𝑤 = − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖
log 𝑦 𝑥 𝑖
+ 1−𝑡 𝑖
log 1 − 𝑦 𝑥 𝑖
Multi-Class Classification
• One vs All
class 2
class 3
class 1
Multi-Class Classification
• One vs All
class 2
class 3
class 1
Multi-Class Classification
• One vs All Model 1: probability in class 1
class 2
class 1
class 2
Choose a
class 3 Model 2: probability in class 2 class with
largest
probability
class 1
(2) 𝑇
𝑇 (2) 𝑇 𝑒𝑤 𝑥
C2: 𝑤 (2) 𝑥 𝑒𝑤 𝑥 𝑦2 =
(1) 𝑇 𝑥 (2) 𝑇 𝑥 (3) 𝑇 𝑥
𝑒𝑤 + 𝑒𝑤 + 𝑒𝑤
Courtesy of Dr. Hung-yi Lee
𝑇 (3) 𝑇
(3) 𝑇 𝑤 (3) 𝑥 𝑒𝑤 𝑥
C3: 𝑤 𝑥 𝑒
𝑦3 =
(1) 𝑇 𝑥 (2) 𝑇 𝑥 (3) 𝑇 𝑥
𝑒𝑤 + 𝑒𝑤 + 𝑒𝑤
Multi-Class Classification
• We can also consider multi classes at the same time
𝑇
𝑤 (1) 𝑥
𝑒
𝑇 𝑦1 =
𝑇 (1) (1) 𝑇 𝑥 (2) 𝑇 𝑥 (3) 𝑇 𝑥
C1: 𝑤 (1) 𝑥 𝑒𝑤 𝑥 𝑒𝑤 + 𝑒𝑤 + 𝑒𝑤
(2) 𝑇
𝑇 (2) 𝑇 𝑒𝑤 𝑥
C2: 𝑤 (2) 𝑥 𝑒𝑤 𝑥 𝑦2 =
(1) 𝑇 𝑥 (2) 𝑇 𝑥 (3) 𝑇 𝑥
𝑒𝑤 + 𝑒𝑤 + 𝑒𝑤
Courtesy of Dr. Hung-yi Lee
𝑇 (3) 𝑇
(3) 𝑇 𝑤 (3) 𝑥 𝑒𝑤 𝑥
C3: 𝑤 𝑥 𝑒
𝑦3 =
(1) 𝑇 𝑥 (2) 𝑇 𝑥 (3) 𝑇 𝑥
𝑒𝑤 + 𝑒𝑤 + 𝑒𝑤
Multi-Class Classification
• We can also consider multi classes at the same time
(1) 𝑇
𝑒𝑤 𝑥
𝑇 𝑦1 = 𝑇 𝑇 𝑇
𝑦ො1
C1: 𝑤 (1) 𝑇 𝑥 𝑒 𝑤 (1) 𝑥 𝑒 𝑤 (1) 𝑥 +𝑒 𝑤 (2) 𝑥 +𝑒 𝑤 (3) 𝑥
𝑇 Argmax
𝑤 (2) 𝑥
𝑇 (2) 𝑇 𝑒 𝑦ො2
C2: 𝑤 (2) 𝑥 𝑒𝑤 𝑥 𝑦2 =
(1) 𝑇 (2) 𝑇 (3) 𝑇
𝑒𝑤 𝑥 + 𝑒𝑤 𝑥 + 𝑒𝑤 𝑥
Courtesy of Dr. Hung-yi Lee
(3) 𝑇 (3) 𝑇
𝑇
(1) If x is in C1,
𝑒𝑤 𝑥
𝑇 𝑦1 = 𝑇 𝑇 𝑇
𝑦ො1 𝑦ො1 1
(1) 𝑇 𝑒 𝑤 (1) 𝑥 𝑒 𝑤 (1) 𝑥 +𝑒 𝑤 (2) 𝑥 +𝑒 𝑤 (3) 𝑥
C1: 𝑤 𝑥 the output 𝑦ො2 = 0
Argmax 𝑦ො3 0
𝑇
𝑤 (2) 𝑥 If x is in C2,
𝑇 (2) 𝑇 𝑒 𝑦ො2
C2: 𝑤 (2) 𝑥 𝑒𝑤 𝑥 𝑦2 =
(1) 𝑇 𝑥 (2) 𝑇 𝑥 (3) 𝑇 𝑥 𝑦ො1 0
𝑒𝑤 + 𝑒𝑤 + 𝑒𝑤
Courtesy of Dr. Hung-yi Lee
𝑇
(1) If x is in C1,
𝑒𝑤 𝑥
𝑇 𝑦1 = 𝑇 𝑇 𝑇
𝑦ො1 𝑦ො1 1
(1) 𝑇 𝑒 𝑤 (1) 𝑥 𝑒 𝑤 (1) 𝑥 +𝑒 𝑤 (2) 𝑥 +𝑒 𝑤 (3) 𝑥
C1: 𝑤 𝑥 the output 𝑦ො2 = 0
Argmax 𝑦ො3 0
𝑇
𝑤 (2) 𝑥 If x is in C2,
𝑇 (2) 𝑇 𝑒 𝑦ො2
C2: 𝑤 (2) 𝑥 𝑒𝑤 𝑥 𝑦2 =
(1) 𝑇 𝑥 (2) 𝑇 𝑥 (3) 𝑇 𝑥 𝑦ො1 0
𝑒𝑤 + 𝑒𝑤 + 𝑒𝑤
Courtesy of Dr. Hung-yi Lee
𝑇
𝑤 (1) 𝑥
𝑒
𝑇 𝑦1 = 𝑇 𝑇 𝑇
𝑦ො1
(1) 𝑇 𝑒 𝑤 (1) 𝑥 𝑒 𝑤 (1) 𝑥 +𝑒 𝑤 (2) 𝑥 +𝑒 𝑤 (3) 𝑥
𝑤 𝑥
Argmax
(2) 𝑇
𝑇 𝑒𝑤 𝑥
𝑦ො2
𝑥 (2) 𝑇 𝑒 𝑤 (2) 𝑥 𝑦2 =
𝑤 𝑥 (1) 𝑇 (2) 𝑇 (3) 𝑇
𝑒𝑤 𝑥 + 𝑒𝑤 𝑥 + 𝑒𝑤 𝑥
Courtesy of Dr. Hung-yi Lee
(3) 𝑇 (3) 𝑇
𝑤 (3) 𝑇 𝑥 𝑒𝑤 𝑥 𝑒𝑤 𝑥 𝑦ො3
𝑦3 =
(1) 𝑇 (2) 𝑇 (3) 𝑇
𝑒𝑤 𝑥 + 𝑒𝑤 𝑥 + 𝑒𝑤 𝑥
Softmax
Multi-Class Classification
• Consider multi classes at the same time
𝑇
𝑤 (1) 𝑥 S 𝑦2 𝑦ො1
o
Argmax
f
𝑇
𝑥 𝑤 (2) 𝑥 t 𝑦2 𝑦ො2
Courtesy of Dr. Hung-yi Lee
m
a
𝑇 x 𝑦3 𝑦ො3
𝑤 (3) 𝑥
𝑎𝑙𝑙 𝑦 ≥ 0.5
Courtesy of Dr. Hung-yi Lee
Underfitting
Regularized Logistic Regression
Courtesy of Dr. Hung-yi Lee
Underfitting Overfitting
Limitation of Logistic Regression
Data Target
𝑥1 𝑥𝟐 𝑡
0 0 1
0 1 0
1 0 0
1 1 1
Courtesy of Dr. Hung-yi Lee
Limitation of Logistic Regression
Data Target
𝑥1 𝑥𝟐 𝑡
0 0 1
0 1 0
1 0 0
1 1 1
Courtesy of Dr. Hung-yi Lee
Feature Transformation
𝑥1 𝑥1′
𝑥2 𝑥2′
Courtesy of Dr. Hung-yi Lee
https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression
0 1
1 1
Feature Transformation
𝑥1 𝑥1′
0 1 𝑥2 𝑥2′
0 0
Courtesy of Dr. Hung-yi Lee
https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression
0 1
0
1 1 𝑥1′ : distance to
0
1 Feature Transformation
𝑥2′ : distance to
1 𝑥1 𝑥1′
0 1 𝑥2 𝑥2′
0 0
Courtesy of Dr. Hung-yi Lee
https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression
0 1
0
1 1 𝑥1′ : distance to
0
1 Feature Transformation
𝑥2′ : distance to
1 𝑥1 𝑥1′
0 1 𝑥2 𝑥2′
0 0
Courtesy of Dr. Hung-yi Lee
1
1
https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression
0 1
0
1 1 𝑥1′ : distance to
0
1 Feature Transformation
𝑥2′ : distance to
1 𝑥1 𝑥1′
0 1 𝑥2 𝑥2′
0 0
Courtesy of Dr. Hung-yi Lee
1
1
https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression
0 1
0
1 1 𝑥1′ : distance to
0
1 Feature Transformation
𝑥2′ : distance to
1 𝑥1 𝑥1′
0 1 𝑥2 𝑥2′
0 0
Courtesy of Dr. Hung-yi Lee
1
1
2
0
https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression
0 1
0
1 1 𝑥1′ : distance to
0
1 Feature Transformation
𝑥2′ : distance to
1 𝑥1 𝑥1′
0 1 𝑥2 𝑥2′
0
0 0
2
Courtesy of Dr. Hung-yi Lee
1
1
2
0
https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression
0 1
0
1 1 𝑥1′ : distance to
0
1 Feature Transformation
𝑥2′ : distance to
1 𝑥1 𝑥1′
0 1 𝑥2 𝑥2′
0
0 0
2
Courtesy of Dr. Hung-yi Lee
1
1
2
0
https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression
0 1
0
1 1 𝑥1′ : distance to
0
1 Feature Transformation
𝑥2′ : distance to
1 𝑥1 𝑥1′
0 1 𝑥2 𝑥2′
0
0 0
2
(Not linearly separable)
Courtesy of Dr. Hung-yi Lee
1
1
2
0
(Linearly separable)
https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression
0 1
0
1 1 𝑥1′ : distance to
0
1 Feature Transformation
𝑥2′ : distance to
1 𝑥1 𝑥1′
0 1 𝑥2 𝑥2′
0
0 0
2
Courtesy of Dr. Hung-yi Lee
1
1
2
0
https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression
0 1
0
1 1 𝑥1′ : distance to
0
1 Feature Transformation
𝑥2′ : distance to
1 𝑥1 𝑥1′
0 1 𝑥2 𝑥2′
0
0 0
2
Courtesy of Dr. Hung-yi Lee
1
It is not easy to find a 1
good transformation 2
0
https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression
• Cascading logistic regression models
1 (1)
𝑤0
(1)
𝑤1
𝑥1 (1) (1) (1)
𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑥1′
(1)
𝑤2
𝑥2
Courtesy of Dr. Hung-yi Lee
https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression
• Cascading logistic regression models
1 (1)
𝑤0
(1)
𝑤1
𝑥1 (1) (1) (1)
𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑥1′
(1)
𝑤2
𝑥2
1
Courtesy of Dr. Hung-yi Lee
(2)
𝑤0
(2)
𝑤1
𝑥1 (2) (2) (2)
𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑥2′
(2)
𝑥2 𝑤2
https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression
• Cascading logistic regression models
1
1 (1)
𝑤0 (3)
(1)
𝑤0
𝑤1
𝑥1 (1) (1) (1)
𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑥1′
(1)
𝑤2
𝑥2 (3)
𝑤1
(3) (3) (3)
𝑤0 + 𝑤1 𝑥1′ + 𝑤2 𝑥2′ 𝑦
1
Courtesy of Dr. Hung-yi Lee
(2)
𝑤0
(3)
(2)
𝑤1 𝑤2
𝑥1 (2) (2) (2)
𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑥2′
(2)
𝑥2 𝑤2
https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression
• Cascading logistic regression models
1
1 (1)
𝑤0 (3)
(1)
𝑤0
𝑤1
𝑥1 (1) (1) (1)
𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑥1′
(1)
𝑤2
𝑥2 (3)
𝑤1
(3) (3) (3)
𝑤0 + 𝑤1 𝑥1′ + 𝑤2 𝑥2′ 𝑦
1
Courtesy of Dr. Hung-yi Lee
(2)
𝑤0
(3)
(2)
𝑤1 𝑤2
𝑥1 (2) (2) (2)
𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑥2′
(2)
𝑥2 𝑤2 Try to make it linearly separable
Feature Transformation
https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
𝑥1′ = 0.65 𝑥1′ = 0.30
0.65
𝑦 = 0.20
0.10
𝑥2′ = 0.10 𝑥2′ = 0.30
Courtesy of Dr. Hung-yi Lee
0.30
𝑥2′ = 0.30 0.30 0.10
𝑥2′ = 0.65 𝑦 = 0.20
𝑦 = 0.70 0.65
𝑥2′ = 0.8
Limitation of Logistic Regression
𝑥1′ = 0.6
0.69
0.8
Courtesy of Dr. Hung-yi Lee
𝑥2′ = 0.8
Limitation of Logistic Regression
𝑥1′ = 0.6
𝑥2′ = 0.8
Limitation of Logistic Regression
𝑥1′ = 0.6
0.89
Courtesy of Dr. Hung-yi Lee
𝑥2′ = 0.8
Limitation of Logistic Regression
𝑦 = 0.6
0.3
Courtesy of Dr. Hung-yi Lee
𝑦 = 0.2
(3) (3) (3)
𝑤0 + 𝑤1 × 0.6 + 𝑤2 × 0.2
Limitation of Logistic Regression
• Cascading logistic regression models
1
1 (1)
𝑤0 (3)
(1)
𝑤0
𝑤1
𝑥1 (1) (1) (1)
𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑥1′
(1)
𝑤2
𝑥2 (3)
𝑤1
(3) (3) (3)
𝑤0 + 𝑤1 𝑥1′ + 𝑤2 𝑥2′ 𝑦
1
Courtesy of Dr. Hung-yi Lee
(2)
𝑤0
(3)
(2)
𝑤1 𝑤2
𝑥1 (2) (2) (2)
𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑥2′
(2)
𝑥2 𝑤2
https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression
• Cascading logistic regression models
1
1 (1)
𝑤0 (3)
(1)
𝑤0
𝑤1
𝑥1 (1) (1) (1)
𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑥1′
(1)
𝑤2
𝑥2 (3)
𝑤1
(3) (3) (3)
𝑤0 + 𝑤1 𝑥1′ + 𝑤2 𝑥2′ 𝑦
1
Courtesy of Dr. Hung-yi Lee
(2)
𝑤0
(3)
(2)
𝑤1 𝑤2
𝑥1 (2) (2) (2)
𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑥2′
(2)
𝑥2 𝑤2
https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression
• Cascading logistic regression models
1
1 (1)
𝑤0 (3)
(1)
𝑤0
𝑤1
𝑥1 (1) (1) (1)
𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑥1′
(1)
𝑤2
𝑥2 (3)
𝑤1
(3) (3) (3)
𝑤0 + 𝑤1 𝑥1′ + 𝑤2 𝑥2′ 𝑦
1
Courtesy of Dr. Hung-yi Lee
(2)
𝑤0
(3)
(2)
𝑤1 𝑤2
𝑥1 (2) (2) (2)
𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑥2′
(2)
𝑥2 𝑤2
https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression
• Cascading logistic regression models
𝑥1
𝑤𝑇𝑥 …
𝑥2 𝑤𝑇𝑥
𝑥1 𝑤𝑇𝑥
𝑤𝑇𝑥
Courtesy of Dr. Hung-yi Lee
𝑥2
https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
Limitation of Logistic Regression
• Cascading logistic regression models
𝑥1
𝑤𝑇𝑥 …
𝑥2 𝑤𝑇𝑥
𝑥1 𝑤𝑇𝑥
𝑤𝑇𝑥
Courtesy of Dr. Hung-yi Lee
𝑥2
“Neuron”
https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
https://www.tutorialspoint.com/artificial_intelligence/artificial_intelligence_neural_networks.htm
Limitation of Logistic Regression
• Cascading logistic regression models
𝑥1
𝑤𝑇𝑥 …
𝑥2 𝑤𝑇𝑥
𝑥1 𝑤𝑇𝑥
𝑤𝑇𝑥
Courtesy of Dr. Hung-yi Lee
𝑥2
“Neuron”
Neural Network
https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
https://www.tutorialspoint.com/artificial_intelligence/artificial_intelligence_neural_networks.htm
Limitation of Logistic Regression
• Cascading logistic regression models
𝑥1
𝑤𝑇𝑥 …
𝑥2 𝑤𝑇𝑥
𝑥1 𝑤𝑇𝑥
𝑤𝑇𝑥
Courtesy of Dr. Hung-yi Lee
𝑥2
https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
https://www.tutorialspoint.com/artificial_intelligence/artificial_intelligence_neural_networks.htm
Limitation of Logistic Regression
• Cascading logistic regression models
…
𝑥1
𝑤𝑇𝑥
𝑤𝑇𝑥
𝑥2
Courtesy of Dr. Hung-yi Lee
https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
https://www.tutorialspoint.com/artificial_intelligence/artificial_intelligence_neural_networks.htm
Limitation of Logistic Regression
• Cascading logistic regression models
𝑥1
𝑥2
Courtesy of Dr. Hung-yi Lee
https://www.slideshare.net/cacois/machine-learningfor-moderndevelopers-35608763
https://www.tutorialspoint.com/artificial_intelligence/artificial_intelligence_neural_networks.htm
Metrics to Evaluate Classification
• Accuracy
• The ratio of correct predictions to total
predictions made.
Metrics to Evaluate Classification
• Accuracy
• The ratio of correct predictions to total
predictions made.
• Cases
• True Positive: Correctly identified as
relevant
• True Negative: Correctly identified as
not relevant
• False Positive: Incorrectly labeled as
relevant
• False Negative: Incorrectly labeled as
not relevant
Metrics to Evaluate Classification
• Accuracy
• The ratio of correct predictions to total Predicted Class
predictions made.
1 0
• Cases 1
• True Positive: Correctly identified as
relevant
• True Negative: Correctly identified as Actual
not relevant Class
• False Positive: Incorrectly labeled as
relevant
• False Negative: Incorrectly labeled as 0
not relevant
Metrics to Evaluate Classification
• Accuracy
• The ratio of correct predictions to total Predicted Class
predictions made.
1 0
• Cases 1
• True Positive: Correctly identified as
relevant
• True Negative: Correctly identified as Actual True Positive False Negative
not relevant Class
• False Positive: Incorrectly labeled as
relevant (Đoán nhầm: Negative => Positive)
• False Negative: Incorrectly labeled as 0
not relevant (Bỏ sót: Positive => Negative)
False Positive True Negative
Metrics to Evaluate Classification
• Accuracy
• The ratio of correct predictions to total Predicted Class
predictions made. 1 0
# 𝑜𝑓 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + # 𝑜𝑓 𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
# of samples
1
References: http://machinelearningmastery.com/confusion-matrix-machine-learning/
Metrics to Evaluate Classification
• Accuracy
• The ratio of correct predictions to total Predicted Class
predictions made. 1 0
# 𝑜𝑓 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + # 𝑜𝑓 𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
# of samples
1
However, using classification accuracy only can be Actual True Positive False Negative
misleading, Class
• when you have an unequal number of
observations in each class
• when you have more than two classes in your 0
dataset.
False Positive True Negative
References: http://machinelearningmastery.com/confusion-matrix-machine-learning/
Metrics to Evaluate Classification
• Suppose we have a classifier for identify the object in a image is cat or not.
If 𝒚 𝒙 ≥ 𝟎. 𝟓, we predict 1 (cat)
If 𝒚 𝒙 < 𝟎. 𝟓, we predict 0 (not cat)
the accuracy is 99% on testing datasets (1000 images).
• What do think if the 1000 images contain only 5 cat images? def classifier(img):
return 0
• Accuracy
• Percentage of correct labels
• Accuracy = (# true positives + # true negatives) / (# of samples)
# 𝑜𝑓 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
# 𝑜𝑓 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + # 𝑜𝑓 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
1
References: http://machinelearningmastery.com/confusion-matrix-machine-learning/
Metrics to Evaluate Classification
• Precision
• Percentage of positive labels that are Predicted Class
correct
1 0
# 𝑜𝑓 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
# 𝑜𝑓 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + # 𝑜𝑓 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
1
References: http://machinelearningmastery.com/confusion-matrix-machine-learning/
Metrics to Evaluate Classification
• Recall >< Bỏ sót
• Percentage of positive examples that Predicted Class
are correctly labeled
1 0
# 𝑜𝑓 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑅𝑒𝑐𝑎𝑙𝑙 =
# 𝑜𝑓 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + # 𝑜𝑓 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
1
References: http://machinelearningmastery.com/confusion-matrix-machine-learning/
Metrics to Evaluate Classification
• Recall
• Percentage of positive examples that Predicted Class
are correctly labeled
1 0
# 𝑜𝑓 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑅𝑒𝑐𝑎𝑙𝑙 =
# 𝑜𝑓 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + # 𝑜𝑓 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
1
References: http://machinelearningmastery.com/confusion-matrix-machine-learning/
Metrics to Evaluate Classification
• Suppose we have a classifier for
identify the object in a image is cat or
not.
If 𝒚 𝒙 ≥ 𝟎. 𝟓, we predict 1 (cat)
If 𝒚 𝒙 < 𝟎. 𝟓, we predict 0 (not cat)
Predicted Class
• The testing datasets (1000 images
contain only 5 cat images) 1 0
1
def classifier(img): Actual
return 0 Class
0
Metrics to Evaluate Classification
• Suppose we have a classifier for
identify the object in a image is cat or
not.
If 𝒚 𝒙 ≥ 𝟎. 𝟓, we predict 1 (cat)
If 𝒚 𝒙 < 𝟎. 𝟓, we predict 0 (not cat)
Predicted Class
• The testing datasets (1000 images
contain only 5 cat images) 1 0
1 0 5
def classifier(img): Actual
return 0 Class
0 0 995
Metrics to Evaluate Classification
• Suppose we have a classifier for
identify the object in a image is cat or Accuracy = 99.5%
not. Precision = ? (0/0)
Recall = 0 %
If 𝒚 𝒙 ≥ 𝟎. 𝟓, we predict 1 (cat)
If 𝒚 𝒙 < 𝟎. 𝟓, we predict 0 (not cat)
Confusion Matrix
Predicted Class
• The testing datasets (1000 images
contain only 5 cat images) 1 0
1 0 5
def classifier(img): Actual
return 0 Class
0 0 995
Metrics to Evaluate Classification
• Suppose we have a classifier for
identify the object in a image is cat or
not.
If 𝒚 𝒙 ≥ 𝒕𝒉𝒓𝒆𝒔𝒉𝒐𝒍𝒅, we predict 1 (cat) Large threshold
If 𝒚 𝒙 < 𝒕𝒉𝒓𝒆𝒔𝒉𝒐𝒍𝒅, we predict 0 (not cat)
small threshold
http://cb.csail.mit.edu/cb/struct2net/webserver/images/prec-v-recall-v2.png
Metrics to Evaluate Classification
• F1 score is a measure of a classifier’s performance
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 = 2
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
• Measure precision and recall on validation sets and select a model that gives max
F1 score.
https://en.wikipedia.org/wiki/F1_score
Readings
• Logistic Regression
• Pages 203-207 in the book “Pattern Recognition and Machine Learning”, by Christopher M.
Bishop, Springer, 2006.
• Multi-classes
• Sections 4.1.2 and 4.3.4 in the book “Pattern Recognition and Machine Learning”, by
Christopher M. Bishop, Springer, 2006.