0% found this document useful (0 votes)
32 views

05 Draft

This document summarizes a lecture on linear models for machine learning. It discusses linear regression problems, algorithms, and error measures. The key points covered are: 1. Linear regression aims to find a linear function to model the relationship between input features and a continuous target variable by minimizing a pointwise error measure, such as squared error. 2. It formulates the learning problem as minimizing the average squared error between predicted and actual target values on training examples. 3. Common error measures for linear regression include squared error, which quantifies how far predictions are from the actual targets, and helps identify linear functions with "small residuals."

Uploaded by

歐怡君
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

05 Draft

This document summarizes a lecture on linear models for machine learning. It discusses linear regression problems, algorithms, and error measures. The key points covered are: 1. Linear regression aims to find a linear function to model the relationship between input features and a continuous target variable by minimizing a pointwise error measure, such as squared error. 2. It formulates the learning problem as minimizing the average squared error between predicted and actual target values on training examples. 3. Common error measures for linear regression include squared error, which quantifies how far predictions are from the actual targets, and helps identify linear functions with "small residuals."

Uploaded by

歐怡君
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Machine Learning

(_hx“)
Lecture 5: Linear Models
Hsuan-Tien Lin (ó“0)
[email protected]

Department of Computer Science


& Information Engineering
National Taiwan University
(↵Àc'x«⌦Â↵˚)

Hsuan-Tien Lin (NTU CSIE) Machine Learning 0/52


Linear Models

Roadmap
1 When Can Machines Learn?
2 Why Can Machines Learn?
3 How Can Machines Learn?

Lecture 5: Linear Models


Linear Regression Problem
Linear Regression Algorithm
Logistic Regression Problem
Logistic Regression Error
Gradient of Logistic Regression Error
Gradient Descent
Stochastic Gradient Descent

Hsuan-Tien Lin (NTU CSIE) Machine Learning 1/52


Linear Models Linear Regression Problem

Credit Limit Problem


age 23 years
gender female
annual salary NTD 1,000,000
year in residence 1 year
unknown target function year in job 0.5 year
f: X !Y current debt 200,000
(ideal credit limit formula) credit limit? 100,000
要 給 某 客⼾ 多少 信⽤ 額度 ? 每 个 客⼾ 不同 ,

training examples learning final hypothesis


D : (x1 , y1 ), · · · , (xN , yN ) algorithm g⇡f
A
(historical records in bank) (‘learned’ formula to be used)

hypothesis set
H

(set of candidate formula)

regression 特 点 輸出 空間 為 整 个 實 丰 ⼜
Y = R: regression
Hsuan-Tien Lin (NTU CSIE) Machine Learning 2/52
Linear Models Linear Regression Problem

Linear Regression Hypothesis


H 該長 怎樣 輸出 空間 会 是 整 个 實 取 ?
,

age 23 years
annual salary NTD 1,000,000
year in job 0.5 year
current debt 200,000

• For x = (x0 , x1 , x2 , · · · , xd ) ‘features of customer’, ← 客⼾ 資料


approximate the desired credit limit with a weighted sum:
wii 加 权分 邦
d
X
y⇡ wi xi
i=0

• linear regression hypothesis: h(x) = wT x

h(x): like perceptron, but without the sign


PLA 有 取 正負 号 但 這 裡 沒有 ,

Hsuan-Tien Lin (NTU CSIE) Machine Learning 3/52


Linear Models Linear Regression Problem

Illustration of Linear Regression


Linearhypothesis ,

x = (x) 2 R x = (x1 , x2 ) 2 R2
三維空間 中 即 形成 平⾯ ⼀

衡量 奌 到 hypothesis 有 多 近

y
y

x1
x x2

linear regression:
find lines/hyperplanes with small residuals

Hsuan-Tien Lin (NTU CSIE) Machine Learning 4/52


Linear Models Linear Regression Problem

Pointwise Error Measure for ‘Small Residuals’


final hypothesis
g⇡f

how well? often use averaged err(g(x), f (x)), like

Eout (g) = E Jg(x) 6= f (x)K


x⇠P | {z }
err(g(x),f (x))

—err: called pointwise error measure

in-sample out-of-sample
N
1X
Ein (g) = err(g(xn ), f (xn )) Eout (g) = E err(g(x), f (x))
N x⇠P
n=1

will mainly consider pointwise err for simplicity


Hsuan-Tien Lin (NTU CSIE) Machine Learning 5/52
Linear Models Linear Regression Problem

Learning Flow with Pointwise Error Measure


y
unknown target
distribution P(y |x) unknown
containing f (x) + noise P on X

(ideal credit approval formula)


x1 , x2 , · · · , xN x
y1 , y2 , · · · , yN

training examples learning final hypothesis


D : (x1 , y1 ), · · · , (xN , yN ) algorithm g⇡f
A
(historical records in bank) (‘learned’ formula to be used)

hypothesis set error measure


H err

(set of candidate formula)

extended VC theory/‘philosophy’
works for most H and err

Hsuan-Tien Lin (NTU CSIE) Machine Learning 6/52


Linear Models Linear Regression Problem

Two Important Pointwise Error Measures


0 1
B C
err @g(x), f (x)A
|{z} |{z}
ỹ y

0/1 error squared error


err(ỹ , y ) = Jỹ 6= y K err(ỹ , y ) = (ỹ y )2
• correct or incorrect? • how far is ỹ from y ?
• often for classification • often for regression

squared error: quantify ‘small residual’

Hsuan-Tien Lin (NTU CSIE) Machine Learning 7/52


Linear Models Linear Regression Problem

Squared Error Measure for Regression


popular/historical error measure for linear regression:
squared error err(ŷ , y ) = (ŷ y )2

in-sample out-of-sample

N
1X
Ein (w) = (h(xn ) yn )2 Eout (w) = E (wT x y )2
N | {z } (x,y )⇠P
n=1
wT x n

next: how to minimize Ein (w)?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 8/52


Linear Models Linear Regression Problem

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 9/52


Linear Models Linear Regression Algorithm

Matrix Form of Ein (w)


Ein 如何 越 ⼩ 越 好 了
N N
1X T 2 1X T
Ein (w) = (w xn yn ) = (xn w yn ) 2
N nypothesis 希望 的 預測N
n=1 n=1
的 預測
2
xT1 w y1
1 xT2 w y2
=
N ...
xTN w yN
2 3 2 3 2
xT1 y1
1 6
6 xT2 7
7w
6 y2 7
6 7
( 共同 的 w 提出 ) = 4 5 4 ... 5
N ...
xTN yN
1
= k X w y k2
N |{z} |{z} |{z}
N⇥d+1 d+1⇥1 N⇥1

Hsuan-Tien Lin (NTU CSIE) Machine Learning 10/52


Linear Models Linear Regression Algorithm
唯变 叔 為 w
,
已知 刈

1
min Ein (w) = kXw yk2
w N

• Ein (w): continuous, differentiable, convex


• necessary condition of ‘best’ w
在 制 ⽅向 做 偏 微 皆 ⼆ 0
2 3 2 3
Ein @Ein
@w 0 (w) 0
6 @Ein 7 6 7
6 @w 1 (w) 7 6 0 7
rEin (w) ⌘ 6 7=6 7
4 ... 5 4 ... 5
@Ein
w 最低 桌 @w d (w) 0
不管 往 哪 了 ⽅向 都 無法 讓函 邦值更

—not possible to ‘roll down’
,

梯度 ⼆ 0

task: find wLIN such that rEin (wLIN ) = 0

Hsuan-Tien Lin (NTU CSIE) Machine Learning 11/52


Linear Models Linear Regression Algorithm

The Gradient rEin (w)


0 1
1 1@ T T
Ein (w) = kXw yk2 = w X Xw 2wT XT y + yT y A
N N | {z } |{z} |{z}
A b c

one w only 若 w 只有 个 維度

,
vector w 若 㖷 了 向量

⇣ ⌘ ⇣ ⌘
Ein (w)= N1 aw 2 2bw + c Ein (w)= N1 wT Aw 2wT b + c

rEin (w)= N1 (2aw 2b) rEin (w)= N1 (2Aw 2b)

simple! :-) similar (derived by definition)


何時 梯度 ⼆ 0 ?
2
rEin (w) = N XT Xw XT y

Hsuan-Tien Lin (NTU CSIE) Machine Learning 12/52


Linear Models Linear Regression Algorithm

Optimal Linear Regression Weights


2
task: find wLIN such that N XT Xw XT y = rEin (w) = 0
之中 只有 w 未知 ˋ

invertible XT X 可 求得 反 矩陣 ,
singular XT X
• easy! unique solution • many optimal solutions
⇣ ⌘ 1
ldtllxldt 1) • one of the solutions
wLIN = XT X XT y
| {z } wLIN = X† y
pseudo-inverse X†
by defining X† in other ways
• often the case because
N d +1

practical suggestion:
use well-implemented † routine
1 T
instead of XT X X
for numerical stability when almost-singular
Hsuan-Tien Lin (NTU CSIE) Machine Learning 13/52
Linear Models Linear Regression Algorithm

Linear Regression Algorithm


1 from D, construct input matrix X and output vector y by
製造 矩陣 2 3 2 3
xT1 y1
6 xT2 7 6
7 y = 6 y2 7
7
X=6 4 5 4
··· ··· 5
xTN yN
| {z } | {z }
N⇥(d+1) N⇥1

2 calculate pseudo-inverse X†
|{z}
計算 反 矩陣 (d+1)⇥N

3 return wLIN = X† y
|{z}
(d+1)⇥1

simple and efficient


with good † routine
Hsuan-Tien Lin (NTU CSIE) Machine Learning 14/52
Linear Models Linear Regression Algorithm

Is Linear Regression a ‘Learning Algorithm’?


wLIN = X† y

No! Yes! 從某些 ⾓度 ⽽⾔ 算是 机 器 学 習 演算法 "


• analytic (closed-form) I • good Ein ?
solution, ‘instantaneous’ yes, optimal! 有 好 的 Ein
• not improving Ein nor 2
• good Eout ? 資料 量 夠 多 有 好 Ein 有 好 Eout
,
→ ,

Eout iteratively yes, finite dVC like perceptrons


⼀下⼦ 就 算 出 結果 ? 沒有 学 習 过 程 ? 3 • improving iteratively?
次有 隨著 覌 察的 資料 更新 結果 ! somewhat, within an iterative
pseudo-inverse routine

if Eout (wLIN ) is good, learning ‘happened’!

Hsuan-Tien Lin (NTU CSIE) Machine Learning 15/52


Linear Regression Generalization Issue

如何 保 証 Ein 是 好 的 ! Benefit of Analytic Solution:


‘Simpler-than-VC’ Guarantee
n o
to be shown d+1
Ein = E Ein (wLIN w.r.t. D) = noise level · 1 N
D⇠P N
資料 中 的 雜訊
衬 所有 抓 把 起來 的 資料 做 平均
Distrihution 中 会做錯 的 部分

1 1
Ein (wLIN ) = ky ŷ k2 = ky X X † y k2
N |{z} N |{z}
predictions wLIN
1
= k( I XX† )yk2
N |{z}
identity

call XX† the hat matrix H


because it puts ^ on y

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/23


Linear Regression Generalization Issue

Geometric View of Hat Matrix

y
y ŷ

span of X

in RN
• ŷ = XwLIN within the span of X columns
• y ŷ smallest: y ŷ ? span
• H: project y to ŷ 2 span 投影
• I H: transform y to y ŷ ? span

claim: trace(I H) = N (d + 1). Why? :-)


衬 ⾓ 線 上 所有 值 相加

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/23


Linear Regression Generalization Issue

An Illustrative ‘Proof’
y
y ŷ
noise

f (X) span of X

• if y comes from some ideal f (X) 2 span plus noise


• noise with per-dimension ‘noise level’ 2 transformed by I H to
be y ŷ
1
Ein (wLIN ) = ky ŷk2 = 1
N k(I H)noisek2
N
1 2
= N N (d + 1)

2 d+1
Ein = · 1 N 看到 的 讓 Ein 好看 笑
,

2 d+1
Eout = · 1+ N (complicated!) 要看 新 資料 來往 、

可能 差 2 倍
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/23
Linear Models Linear Regression Algorithm

The Learning Curves of Linear Regression


(proof skipped this year)

Expected Error
d+1 Eout
Eout = noise level · 1 + N 2

d+1
Ein = noise level · 1 N
Ein

d+1 Number of Data Points, N

• both converge to 2 (noise level) for N ! 1


• expected generalization error: 2(d+1)
N
—similar to worst-case guarantee from VC

linear regression (LinReg):


learning ‘happened’!
Hsuan-Tien Lin (NTU CSIE) Machine Learning 16/52
Linear Models Linear Regression Algorithm

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 17/52


Linear Models Logistic Regression Problem

Heart Attack Prediction Problem (1/2)


age 40 years
gender male
blood pressure 130/85
cholesterol level 240
unknown target weight 70
distribution P(y |x)
containing f (x) + noise heart disease? yes ⼆ 元分類 問題

training examples learning final hypothesis


D : (x1 , y1 ), · · · , (xN , yN ) algorithm g⇡f
A

hypothesis set error measure


H err

binary classification:
ideal f (x) = sign P(+1|x) 12 2 { 1, +1}
because of classification err

Hsuan-Tien Lin (NTU CSIE) Machine Learning 18/52


Linear Models Logistic Regression Problem

Heart Attack Prediction Problem (2/2)


age 40 years
gender male
blood pressure 130/85
cholesterol level 240
unknown target weight 70
distribution P(y |x)
containing f (x) + noise heart attack? 80% risk ⼼臟病 発 的 風險

training examples learning final hypothesis


D : (x1 , y1 ), · · · , (xN , yN ) algorithm g⇡f
A

hypothesis set error measure


H err

‘soft’ binary classification:


dinputvector
f (x) = P(+1|x) 2 [0, 1]
想 知道 的 是這 个 值 exi 風險值

Hsuan-Tien Lin (NTU CSIE) Machine Learning 19/52


Linear Models Logistic Regression Problem

Soft Binary Classification


target function f (x) = P(+1|x) 2 [0, 1]

ideal (noiseless) data actual (noisy) data


⇣ ⌘ ⇣ ⌘
x1 , y10 = 0.9 = P(+1|x1 ) x1 , y1 = ⇠ P(y |x1 )
⇣ ⌘ ⇣ ⌘
x2 , y20 = 0.2 = P(+1|x2 ) x2 , y2 = ⇥ ⇠ P(y |x2 )
.. ..
⇣ . ⌘ ⇣ . ⌘
xN , yN = 0.6 = P(+1|xN )
0 xN , yN = ⇥ ⇠ P(y |xN )
T T
但 ⼿上 可能 沒 這 个 資料 ⼜ 跟 原本 ⼀樣 ,
只有 錯 対

same data as hard binary classification,


different target function

Hsuan-Tien Lin (NTU CSIE) Machine Learning 20/52


Linear Models Logistic Regression Problem

Soft Binary Classification


target function f (x) = P(+1|x) 2 [0, 1]
改 成 左 迅 data 有 noise 的 版本 , 根據 09,02 …

ideal (noiseless) data actual (noisy) data 抽樣 ⼀下


⇣ ⌘ ⇣ r z⌘
?
x1 , y10 = 0.9 = P(+1|x1 ) x1 , y10 = 1 = ⇠ P(y |x1 )
⇣ ⌘ ⇣ r z⌘
?
x2 , y20 = 0.2 = P(+1|x2 ) x2 , y20 = 0 = ⇠ P(y |x2 )
.. ..
⇣ . ⌘ ⇣ . r z⌘
?
xN , yN = 0.6 = P(+1|xN )
0
xN , yN = 0 =
0 ⇠ P(y |xN )

same data as hard binary classification,


different target function
要 如何 做 才能 求得 好 的 nypothesis
7
當 有 興趣 的 ⼗ 是 -1 的 輸出 時
,

0
,

Hsuan-Tien Lin (NTU CSIE) Machine Learning 20/52


Linear Models Logistic Regression Problem

Logistic Hypothesis 常 ⽤ ˋ
age 40 years
gender male
blood pressure 130/85
cholesterol level 240

• For x = (x0 , x1 , x2 , · · · , xd ) ‘features of


S 型
patient’, calculate a weighted ‘risk score’:
將 teatnre 做 加 权 1
d
X ✓(s)
s= wi xi
i=0 ← 0 為常 䫍
0 s
• convert the score to estimated probability 將 分 邦 転化 為 个 ⼀
0 到 ⼯ 的值
by logistic function ✓(s)

logistic hypothesis: h(x) = ✓(wT x)

Hsuan-Tien Lin (NTU CSIE) Machine Learning 21/52


Linear Models Logistic Regression Problem

Logistic Function
1
✓(s)

0 s

✓( 1) = 0; ✓(0) = 12 ; ✓(1) = 1

es 1
分⼿ ⼼ 低 時 靠近 ✓(s) =
0
= 分形 ⾼ 時 靠近 2
1 + es 1+e s
analytictorm
—smooth, monotonic, sigmoid function of s

logistic regression: use


1
h(x) =
1 + exp( wT x)
to approximate target function f (x) = P(+1|x)

Hsuan-Tien Lin (NTU CSIE) Machine Learning 22/52


Linear Models Logistic Regression Problem

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 23/52


Linear Models Logistic Regression Error

Three Linear Models


linear scoring function: s = wT x

linear classification linear regression logistic regression


h(x) = sign(s) h(x) = s h(x) = ✓(s)
x0
杈 x0 x0

總分
x1 x1 x1
s s s
x2 h(x) x2 h(x) x2 h(x)

xd 対 分錯 xd xd
透 过 sigmoid 輸出

plausible err = 0/1 friendly err = squared


(small flipping noise) (easy to minimize) err = ?
PLA 在 盡量 不要 有 error

how to define
Ein (w) for logistic regression?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 24/52


𩶛
Linear Models Logistic Regression Error

Likelihood ⇢
Y 有 2年中

target function f (x) for y = +1


f (x) = P(+1|x) , P(y |x) =
1 f (x) for y = 1

consider D = {(x1 , ), (x2 , ⇥), . . . , (xN , ⇥)}


產⽣ data 的 机 率 是 …
'

probability that f generates D likelihood that h generates D


P(x1 )P( |x1 ) ⇥ P(x1 )h(x1 ) ⇥
P(x2 )P(⇥|x2 ) ⇥ P(x2 )(1 h(x2 )) ⇥
... 若 假裝 h 就是 t 可以
,
. 察 嚡 樣資料 的 可能性
. .覌
但 這 不 是 真的 不 啐 liuinood
P(xN )P(⇥|xN )
,

P(xN )(1 h(xN ))


Maximumlikelihood ,

• if h ⇡ f , h 的 可能性 和 t 是 類似 的 ( 若 h 是好 的 )

then likelihood(h) ⇡ probability using f


• probability using f usually large
Hsuan-Tien Lin (NTU CSIE) Machine Learning 25/52
Linear Models Logistic Regression Error

Likelihood ⇢
target function f (x) for y = +1
f (x) = P(+1|x) , P(y |x) =
1 f (x) for y = 1

假設 ⼝ 是 iid ,

consider D = {(x1 , ), (x2 , ⇥), . . . , (xN , ⇥)}

probability that f generates D likelihood that h generates D


P(x1 )f (x1 ) ⇥ pretend P(x1 )h(x1 ) ⇥


P(x2 )(1 f (x2 )) ⇥ P(x2 )(1 h(x2 )) ⇥
... ...
P(xN )(1 f (xN )) P(xN )(1 h(xN ))

• if h ⇡ f ,
then likelihood(h) ⇡ probability using f
• probability using f usually large
Hsuan-Tien Lin (NTU CSIE) Machine Learning 25/52
Linear Models Logistic Regression Error

Likelihood of Logistic Hypothesis


likelihood(h) ⇡ (probability using f ) ⇡ large

g = argmax likelihood(h)
h

旋転 1800 可以 得到 相同 图 形
1
0
when logistic: h(x) = ✓(wT x) 対稱性 :
✓(s)

1 h(x) = h( x) ( Iogisticfunc 特性 )

T 0 s
ltéwx

likelihood(h) = P(x1 )h(x1 ) ⇥ P(x2 )(1 h(x2 )) ⇥ . . . P(xN )(1 h(xN ))


wnr vvn vrrr
n n

対 h ⽽⾔ 是 ⼀樣 的 ,

以刈 hlxnl
N
Y
likelihood(logistic h) / h(yn xn )
n=1

Hsuan-Tien Lin (NTU CSIE) Machine Learning 26/52


Linear Models Logistic Regression Error

Likelihood of Logistic Hypothesis


likelihood(h) ⇡ (probability using f ) ⇡ large

g = argmax likelihood(h)
h

1
when logistic: h(x) = ✓(wT x) ✓(s)

1 h(x) = h( x)
0 s

O X X

likelihood(h) = P(x1 )h(+x1 ) ⇥ P(x2 )h( x2 ) ⇥ . . . P(xN )h( xN )


hlyxi ) h (火 刈 MYNXN)

N
Y
likelihood(logistic h) / h(yn xn )
n=1

Hsuan-Tien Lin (NTU CSIE) Machine Learning 26/52


Linear Models Logistic Regression Error

Cross-Entropy Error
N
Y
max likelihood(logistic h) / h(yn xn )
h
n=1
若是 0 就 ⽤
,
⼗ ⼼⼼

若是x 就 ⽤,
xrn

Hsuan-Tien Lin (NTU CSIE) Machine Learning 27/52


Linear Models Logistic Regression Error

Cross-Entropy Error
N
Y ⇣ ⌘
max likelihood(w) / ✓ yn wT xn
w
n=1
⽤ 真正 有 興趣 的 槤 ,

Hsuan-Tien Lin (NTU CSIE) Machine Learning 27/52


Linear Models Logistic Regression Error

Cross-Entropy Error
N
Y ⇣ ⌘
max ln ✓ yn wT xn ⼆
Iln Olynúxn )
w
n=1
連柴 恥 換 成 連 加

Hsuan-Tien Lin (NTU CSIE) Machine Learning 27/52


Linear Models Logistic Regression Error

Cross-Entropy Error
N ⇣ ⌘
1X
min ln ✓ yn wT xn
w N
n=1
errorfunc ,

Max 改成 Min

N
1 1X ⇣ ⌘
✓(s) = : min ln 1 + exp( yn wT xn )
1 + exp( s) w N
n=1
XN
1
=) min err(w, xn , yn ) 対 這 族的 error
w N
|n=1 {z }
Ein (w)

err(w, x, y ) = ln 1 + exp( y wT x) :
cross-entropy error 亂 度 1 熵

Hsuan-Tien Lin (NTU CSIE) Machine Learning 27/52


Linear Models Logistic Regression Error

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 28/52


Linear Models Gradient of Logistic Regression Error

Minimizing Ein (w)


N
1X ⇣ ⌘
min Ein (w) = ln 1 + exp( yn wT xn )
w N
n=1

• Ein (w): continuous, differentiable,


twice-differentiable, convex
• how to minimize? locate valley
Ein 拋物線 向上 如 ⼭ ⾕
want rEin (w) = 0

first: derive rEin (w)

Hsuan-Tien Lin (NTU CSIE) Machine Learning 29/52


Linear Models Gradient of Logistic Regression Error

The Gradient
0
rEin (w) 1
N
X z
}| {
1 B C
Ein (w) = ln @1 + exp( yn wT xn )A
N | {z }
n=1

chaimrule
N ✓ ◆✓ ◆✓ ◆
@Ein (w) 1 X @ ln(⇤) @(1 + exp( )) @ yn wT xn
=
@wi N @⇤ @ @wi
n=1
N ✓ ◆✓ ◆✓ ◆
1X
=
N 藏 expo ) ynxni -

n=1
N ✓ ◆ ! N
1X exp( ) 1X
= yn xn,i = ✓( ) yn xn,i
N 1 + exp( ) N
n=1 n=1

N
P
1
rEin (w) = N ✓ yn wT xn yn xn
n=1

Hsuan-Tien Lin (NTU CSIE) Machine Learning 30/52


Linear Models Gradient of Logistic Regression Error

The Gradient
0
rEin (w) 1
N
X z
}| {
1 B C
Ein (w) = ln @1 + exp( yn wT xn )A
N | {z }
n=1

N ✓ ◆✓ ◆✓ ◆
@Ein (w) 1 X @ ln(⇤) @(1 + exp( )) @ yn wT xn
=
@wi N @⇤ @ @wi
n=1
N ✓ ◆
! !
1X 1
= exp( ) yn xn,i
N ⇤
n=1
N ✓ ◆ ! N
1X exp( ) 1X
= yn xn,i = ✓( ) yn xn,i
N 1 + exp( ) N
n=1 n=1

N
P
1
rEin (w) = N ✓ yn wT xn yn xn
n=1

Hsuan-Tien Lin (NTU CSIE) Machine Learning 30/52


Linear Models Gradient of Logistic Regression Error

Minimizing Ein (w)


N
1X ⇣ T

min Ein (w) = ln 1 + exp( yn w xn )
w N
n=1
N
1X ⇣ ⌘
want rEin (w) = ✓ yn wT xn yn xn =0
N
n=1
0 ⼼)

scaled ✓-weighted sum of yn xn


• all ✓(·) = 0: only if yn wT xn 0
—linear separable D
Ein • weighted sum = 0:
non-linear equation of w

w
closed-form solution? no :-(
Hsuan-Tien Lin (NTU CSIE) Machine Learning 31/52
Linear Models Gradient of Logistic Regression Error

PLA Revisited: Iterative Optimization


PLA: start from some w0 (say, 0), and ‘correct’ its mistakes on D

For t = 0, 1, . . .
1 find a mistake of wt called xn(t) , yn(t)
⇣ ⌘
sign wTt xn(t) 6= yn(t)

2 (try to) correct the mistake by

wt+1 wt + yn(t) xn(t)

when stop, return last w as g

Hsuan-Tien Lin (NTU CSIE) Machine Learning 32/52


Linear Models Gradient of Logistic Regression Error

PLA Revisited: Iterative Optimization


PLA: start from some w0 (say, 0), and ‘correct’ its mistakes on D

For t = 0, 1, . . .
1 find a mistake of wt called xn(t) , yn(t)
⇣ ⌘
sign wTt xn(t) 6= yn(t)

2 (try to) correct the mistake by

wt+1 wt + yn(t) xn(t)

1 (equivalently) pick some n, and update wt by


r ⇣ ⌘ z
wt+1 wt + sign wTt xn 6= yn yn xn
direction
when stop, return last w as g
,

Hsuan-Tien Lin (NTU CSIE) Machine Learning 32/52


Linear Models Gradient of Logistic Regression Error

PLA Revisited: Iterative Optimization


PLA: start from some w0 (say, 0), and ‘correct’ its mistakes on D

For t = 0, 1, . . .
(equivalently) pick some n, and update wt by
1
⇣r ⇣ ⌘ z ⌘
wt+1 wt + 1 · sign wTt xn 6= yn · yn xn
|{z} | {z }
⌘ v
⽅向
when stop, return last w as g 步 長

① ② ③
choice of (⌘, v) and stopping condition defines
iterative optimization approach

Hsuan-Tien Lin (NTU CSIE) Machine Learning 32/52


Linear Models Gradient of Logistic Regression Error

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 33/52


Linear Models Gradient Descent

Iterative Optimization
For t = 0, 1, . . . 似 leamingrate
wt+1 wt + ⌘v
when stop, return last w as g 要 決定 了 件 事 :D ⽅向 ① 步 長
• PLA: v comes from mistake correction

In-sample Error, Ein


• smooth Ein (w) for logistic regression:
choose v to get the ball roll ‘downhill’?
• direction v:
(assumed) of unit length
• step size ⌘:
(assumed) positive Weights, w

a greedy approach for some given ⌘ > 0:

min Ein (wt + ⌘v) imeroptimiutionproblem


kvk=1 | {z }
wt+1

Hsuan-Tien Lin (NTU CSIE) Machine Learning 34/52


Linear Models Gradient Descent

Linear Approximation
a greedy approach for some given ⌘ > 0:

min Ein (wt + ⌘v)


kvk=1

• still non-linear optimization, now with constraints


—not any easier than minw Ein (w)
• local approximation by linear formula makes problem easier

Ein (wt + ⌘v) ⇡ Ein (wt ) + ⌘vT rEin (wt )


if ⌘ really small (Taylor expansion) ⽤ 泰 勒 展開式 在 維 上 看 是 切 線 ,

現在 延伸 到 多 維度 ,

an approximate greedy approach for some given small ⌘:

min
kvk=1 \
Ein (wt ) +
| {z }

|{z}
vT rEin (wt )
| {z }
1
丠 ?
known given positive known

常取 要 讓這項越 ⼩ 越 好
Hsuan-Tien Lin (NTU CSIE) Machine Learning 35/52
Linear Models Gradient Descent

Gradient Descent
an approximate greedy approach for some given small ⌘:

min Ein (wt ) + ⌘ vT rEin (wt )


kvk=1 | {z } |{z} | {z }
known given positive known

• optimal v: opposite direction of rEin (wt )

rEin (wt )
v=
krEin (wt )k
negativenormalizeddirect.int
• gradient descent: for small ⌘, wt+1 wt
lgradient.li
rEin (wt )
⌘ krEin (wt )k

update
gradient descent:
a simple & popular optimization tool

Hsuan-Tien Lin (NTU CSIE) Machine Learning 36/52


Linear Models Gradient Descent
Choice of ⌘
too small too large jumpt 可能 也会 往 上
泰勒 展開式 就
In-sample Error, Ein

In-sample Error, Ein


( 不会 対

Weights, w Weights, w

too slow :-( 步伐 ⼩ too unstable :-( 步伐 ⼤

a naive yet effective heuristic


• if red ⌘ / krEin (wt )k by ratio purple ⌘ (the fixed learning rate)

rEin (wt )
wt+1 wt ⌘ = wt ⌘rEin (wt )
krEin (wt )k

fixed learning rate gradient descent:

wt+1 wt ⌘rEin (wt )


Hsuan-Tien Lin (NTU CSIE) Machine Learning 37/52
Linear Models Gradient Descent

Putting Everything Together


Logistic Regression Algorithm
initialize w0
For t = 0, 1, · · ·
1 compute
N
1X ⇣ ⌘
rEin (wt ) = ✓ yn wTt xn yn xn
N
n=1

2 update by
wt+1 wt ⌘rEin (wt )
d
0.1126

...until rEin (wt+1 ) = 0 or enough iterations


return last wt+1 as g

O(N) time complexity in step 1 per iteration


Hsuan-Tien Lin (NTU CSIE) Machine Learning 38/52
Linear Models Gradient Descent

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 39/52


Linear Models Gradient Descent

Linear Models Revisited


linear scoring function: s = wT x

linear classification linear regression logistic regression


h(x) = sign(s) h(x) = s h(x) = ✓(s)
x0 x0 x0
x1 x1 x1
s s s
x2 h(x) x2 h(x) x2 h(x)

xd 対錯 xd xd ⼼ 之間 的 机 率
plausible err = 0/1 friendly err = squared plausible err = cross-entropy
discrete Ein (w): 使 error 变 ⼩ quadratic convex Ein (w): smooth convex Ein (w):
NP-hard to solve in general closed-form solution gradient descent
hara easy easy
can linear regression or logistic regression
help linear classification?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 40/52


Linear Models Gradient Descent

Error Functions Revisited


linear scoring function: s = wT x

for binary classification y 2 { 1, +1}

linear classification linear regression logistic regression


0 h(x) = sign(s) 算 出 h值 h(x) = s h(x) = ✓(s)
2
② err(h, x, y ) = Jh(x) 6= y K err(h, x, y ) = (h(x) y) err(h, x, y ) = ln h(y x)
從 hypothesis 輸出 看到 是否 相同
,

結合 : err0/1 (s, y ) errSQR (s, y ) errCE (s, y )


Jsign(s) 6= y K 2
= = (s y) = ln(1 + exp( y s))
Jsign(y s) 6= 1K 2
= = (y s 1)

(y s): classification correctness score


希望 ⼼。 越來越 好

Hsuan-Tien Lin (NTU CSIE) Machine Learning 41/52


Linear Models Gradient Descent

Visualizing Error Functions 比較 errorfunc 的 長相


0/1 err0/1 (s, y ) = Jsign(y s) 6= 1K 盡 在 同⼀ 平⾯ 上
sqr errSQR (s, y ) = (y s 1)2
ce errCE (s, y ) = ln(1 + exp( y s))
scaled ce errSCE (s, y ) = log2 (1 + exp( y s))

6 0/1 • 0/1: 1 iff ys  0


• sqr: large if ys ⌧ 1
4
err but over-charge ys 1
2 small errSQR ! small err0/1
1 • ce: monotonic of ys
0 small errCE $ small err0/1
−3 −2 −1 0 1 2 3
ys • scaled ce: a proper upper bound of 0/1
small errSCE $ small err0/1

upper bound:
useful for designing algorithm
Hsuan-Tien Lin (NTU CSIE) Machine Learning 42/52
Linear Models Gradient Descent

Visualizing Error Functions


0/1 err0/1 (s, y ) = Jsign(y s) 6= 1K
sqr errSQR (s, y ) = (y s 1)2
ce errCE (s, y ) = ln(1 + exp( y s))
scaled ce errSCE (s, y ) = log2 (1 + exp( y s))

6 • 0/1: 1 iff ys  0
0/1
sqr
• sqr: large if ys ⌧ 1
4
err 扣 分
区 但⽤ but over-charge ys 1
2 linearreg small errSQR ! small err0/1
,

卻 扣分
1 • ce: monotonic of ys
0
011 很 好 small errCE $ small err0/1
ys t
−3 −2 −1 0 1 2 3
• scaled ce: a proper upper bound of 0/1
在 011 和 linenr small errSCE $ small err0/1
結果類似
,

upper bound:
useful for designing algorithm
Hsuan-Tien Lin (NTU CSIE) Machine Learning 42/52
Linear Models Gradient Descent

Visualizing Error Functions


0/1 err0/1 (s, y ) = Jsign(y s) 6= 1K
sqr errSQR (s, y ) = (y s 1)2
ce errCE (s, y ) = ln(1 + exp( y s)) 取 og.no
scaled ce errSCE (s, y ) = log2 (1 + exp( y s))

6 0/1 • 0/1: 1 iff ys  0


sqr
ce
• sqr: large if ys ⌧ 1
4
err but over-charge ys 1
2 small errSQR ! small err0/1
1 • ce: monotonic of ys
0 small errCE $ small err0/1
−3 −2 −1 0 1 2 3
ys • scaled ce: a proper upper bound of 0/1
small errSCE $ small err0/1

upper bound:
useful for designing algorithm
Hsuan-Tien Lin (NTU CSIE) Machine Learning 42/52
Linear Models Gradient Descent

Visualizing Error Functions


0/1 err0/1 (s, y ) = Jsign(y s) 6= 1K
sqr errSQR (s, y ) = (y s 1)2
ce errCE (s, y ) = ln(1 + exp( y s))
scaled ce errSCE (s, y ) = log2 (1 + exp( y s)) 取 lgz

6 0/1 • 0/1: 1 iff ys  0


sqr
scaled ce
• sqr: large if ys ⌧ 1
4
err but over-charge ys 1
在 ⼼ 切 在 ⼀起 small errSQR ! small err0/1
2
1 • ce: monotonic of ys
0 small errCE $ small err0/1
−3 −2 −1 0 1 2 3
ys • scaled ce: a proper upper bound of 0/1
small errSCE $ small err0/1

upper bound:
useful for designing algorithm
Hsuan-Tien Lin (NTU CSIE) Machine Learning 42/52
Linear Models Gradient Descent

Learning Flow with Algorithmic Error Measure


y
unknown target
distribution P(y |x) unknown
containing f (x) + noise P on X

(ideal credit approval formula)


x1 , x2 , · · · , xN x
y1 , y2 , · · · , yN

training examples learning final hypothesis


D : (x1 , y1 ), · · · , (xN , yN ) algorithm g⇡f
A
ec
rr
(historical records in bank) (‘learned’ formula to be used)

hypothesis set error measure


H err

(set of candidate formula)

err: goal, not always easy to optimize;


rr: something ‘similar’ to facilitate A, e.g.
ec
upper bound
Hsuan-Tien Lin (NTU CSIE) Machine Learning 43/52
Linear Models Gradient Descent

Theoretical Implication of Upper Bound


For any ys where s = wT x
若 在意 的 是 011 的 error 上限 為 tu ,

err0/1 (s, y )  errSCE (s, y ) = ln12 errCE (s, y ).


0/1 1
=) Ein (w)  EinSCE (w) = CE
ln 2 Ein (w) 平均
0/1 SCE 1 CE
Eout (w)  Eout (w) = ln 2 Eout (w)

VC on 0/1: VC-Reg on CE :
( 從 Ein 出 発 1 1 從 Eout 出 発) ,

0/1 0/1 0/1


Eout (w)  Ein (w) + ⌦0/1 Eout (w)  1 CE
ln 2 Eout (w)
1 CE 0/1 1 1
ln 2 Ein (w) + ⌦
CE CE
ln 2 Ein (w) + ln 2 ⌦
 

0/1
small EinCE (w) =) small Eout (w):
logistic/linear reg. for linear classification

Hsuan-Tien Lin (NTU CSIE) Machine Learning 44/52


Linear Models Gradient Descent
使⽤ regression 得到
答案後將 分 我 取 sign
Regression for Classification
1 run logistic/linear reg. on D with yn 2 { 1, +1} to get wREG
2 return g(x) = sign(wTREG x)

PLA linear regression logistic regression


• pros: efficient + • pros: 公式 宑 下 就 解決 • pros:
strong guarantee ‘easiest’ ‘easy’
if lin. separable optimization optimization
• cons: works only if • cons: loose • cons: loose
lin. separable bound of err0/1 bound of err0/1 for
for large |ys| very negative ys
若 有 011 差 較多 州 時
1 ⼤ 只 是 ⼗ 上限
,

就 会 較寬鬆
非常 的 正 或 負時 看起來 就 和 011 差較 多
,

• linear regression sometimes used to set w0 for


PLA/logistic regression initialile ,

• logistic regression often preferred in practice


如果 lincarregression 看起來 是 不錯 的 ⽅法 就 將 LR 的 w 當成 wo 再 從 , ,

Hsuan-Tien Lin (NTU CSIE) 好的 w


繼續 optimize ,

Machine Learning 45/52


Linear Models Gradient Descent

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 46/52


Linear Models Stochastic Gradient Descent

Two Iterative Optimization Schemes


For t = 0, 1, . . .

步步 讓 w 越來越 好
wt+1 wt + ⌘v
when stop, return last w as g 每輪選 ⼀
个 桌修正

PLA logistic regression


pick (xn , yn ) and decide wt+1 by check D and decide wt+1 (or
the one example new ŵ) by all examples
O(1) time per iteration :-) O(N) time per iteration :-(
每 次都 看 其中 ⼀ 桌 每 輪 要 看 完 所有 的 桌 將 所有 的 桌 的,

update: 2
gmdient 的 貢獻 相 阿 平均 再 的 梯度 ⽅向 ,

logistic regression with 往 下 做


O(1) time per iteration?
w(t)

w(t+1)

x
9

Hsuan-Tien Lin (NTU CSIE) Machine Learning 47/52


Linear Models Stochastic Gradient Descent

Logistic Regression Revisited 演算法


N
1X ⇣ ⌘
wt+1 wt + ⌘ ✓ yn wTt xn yn xn
N
| n=1 {z }
rEin (wt )

• want: update direction v ⇡ rEin (wt ) 希望 更新 ⽅向 有 gradient 接近


while computing v by one single (xn , yn ) 㽆 夠 ⼩時 梯度是 ⼀ ,

PN i. 隨机 抽 族 的 ⽅向 但 不 希望 花 N
• technique on removing N1 ⇐ 好
:
,

n=1 再 做 平均 倍 ⼒氣 算 出 梯度
view as expectation E over uniform choice of n!

stochastic gradient: 随机 的 梯度 ( 不 是 原本 真正 的 ⽽是 在 某 族 上 的 1 ,

rw err(w, xn , yn ) with random n 將 整 体 的 梯度看成 這 个


true gradient: 隨机 过程 的 期望值
rw Ein (w) = E rw err(w, xn , yn )
random n

Hsuan-Tien Lin (NTU CSIE) Machine Learning 48/52


Linear Models Stochastic Gradient Descent

Stochastic Gradient Descent (SGD)


stochastic gradient = true gradient + zero-mean ‘noise’ directions

Stochastic Gradient Descent


• idea: replace true gradient by stochastic gradient ⽤ 随机 的 梯度 做 下降
• after enough steps, ⽽ 非 真實 的 梯度 ,

average true gradient ⇡ average stochastic gradient


• pros: simple & cheaper computation :-) 若 跑 夠 多 步 平均 ⽽⾔ 真實 ,
,

—useful for big data or online learning 的 梯度 会 ⼀ 随机 梯度


• cons: less stable in nature

SGD logistic regression, looks familiar? :-):


⇣ ⌘ → 更新 ⽅向
wt+1 wt + ⌘ ✓ yn wTt xn yn xn
| {z }
rerr(wt ,xn ,yn )

Hsuan-Tien Lin (NTU CSIE) Machine Learning 49/52


Linear Models Stochastic Gradient Descent

PLA Revisited
SGD logistic regression:
⇣ ⌘
wt+1 wt + ⌘ · ✓ yn wTt xn yn xn

PLA: r z
wt+1 wt + 1 · yn 6= sign(wTt xn ) yn xn

• SGD logistic regression ⇡ ‘soft’ PLA


• PLA ⇡ SGD logistic regression with ⌘ = 1 when wTt xn large

two practical rule-of-thumb:


• stopping condition? t large enough 很難決定 以 相信 跑
• ⌘? 0.1 when x in proper range 夠 久 就 可以 停

Hsuan-Tien Lin (NTU CSIE) Machine Learning 50/52


Linear Models Stochastic Gradient Descent

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 51/52


Linear Models Stochastic Gradient Descent

Summary
1 Why Can Machines Learn?

Lecture 4: Theory of Generalization


2 How Can Machines Learn?

Lecture 5: Linear Models


Linear Regression Problem
Linear Regression Algorithm
Logistic Regression Problem
Logistic Regression Error
Gradient of Logistic Regression Error
Gradient Descent
Stochastic Gradient Descent
• next: beyond simple linear models

Hsuan-Tien Lin (NTU CSIE) Machine Learning 52/52

You might also like