05 Draft
05 Draft
(_hx“)
Lecture 5: Linear Models
Hsuan-Tien Lin (ó“0)
[email protected]
Roadmap
1 When Can Machines Learn?
2 Why Can Machines Learn?
3 How Can Machines Learn?
hypothesis set
H
regression 特 点 輸出 空間 為 整 个 實 丰 ⼜
Y = R: regression
Hsuan-Tien Lin (NTU CSIE) Machine Learning 2/52
Linear Models Linear Regression Problem
age 23 years
annual salary NTD 1,000,000
year in job 0.5 year
current debt 200,000
x = (x) 2 R x = (x1 , x2 ) 2 R2
三維空間 中 即 形成 平⾯ ⼀
衡量 奌 到 hypothesis 有 多 近
y
y
x1
x x2
linear regression:
find lines/hyperplanes with small residuals
in-sample out-of-sample
N
1X
Ein (g) = err(g(xn ), f (xn )) Eout (g) = E err(g(x), f (x))
N x⇠P
n=1
extended VC theory/‘philosophy’
works for most H and err
in-sample out-of-sample
N
1X
Ein (w) = (h(xn ) yn )2 Eout (w) = E (wT x y )2
N | {z } (x,y )⇠P
n=1
wT x n
Questions?
1
min Ein (w) = kXw yk2
w N
梯度 ⼆ 0
one w only 若 w 只有 个 維度
⼀
,
vector w 若 㖷 了 向量
⇣ ⌘ ⇣ ⌘
Ein (w)= N1 aw 2 2bw + c Ein (w)= N1 wT Aw 2wT b + c
invertible XT X 可 求得 反 矩陣 ,
singular XT X
• easy! unique solution • many optimal solutions
⇣ ⌘ 1
ldtllxldt 1) • one of the solutions
wLIN = XT X XT y
| {z } wLIN = X† y
pseudo-inverse X†
by defining X† in other ways
• often the case because
N d +1
practical suggestion:
use well-implemented † routine
1 T
instead of XT X X
for numerical stability when almost-singular
Hsuan-Tien Lin (NTU CSIE) Machine Learning 13/52
Linear Models Linear Regression Algorithm
2 calculate pseudo-inverse X†
|{z}
計算 反 矩陣 (d+1)⇥N
3 return wLIN = X† y
|{z}
(d+1)⇥1
1 1
Ein (wLIN ) = ky ŷ k2 = ky X X † y k2
N |{z} N |{z}
predictions wLIN
1
= k( I XX† )yk2
N |{z}
identity
y
y ŷ
span of X
ŷ
in RN
• ŷ = XwLIN within the span of X columns
• y ŷ smallest: y ŷ ? span
• H: project y to ŷ 2 span 投影
• I H: transform y to y ŷ ? span
An Illustrative ‘Proof’
y
y ŷ
noise
f (X) span of X
ŷ
2 d+1
Ein = · 1 N 看到 的 讓 Ein 好看 笑
,
2 d+1
Eout = · 1+ N (complicated!) 要看 新 資料 來往 、
可能 差 2 倍
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/23
Linear Models Linear Regression Algorithm
Expected Error
d+1 Eout
Eout = noise level · 1 + N 2
d+1
Ein = noise level · 1 N
Ein
Questions?
binary classification:
ideal f (x) = sign P(+1|x) 12 2 { 1, +1}
because of classification err
0
,
Logistic Hypothesis 常 ⽤ ˋ
age 40 years
gender male
blood pressure 130/85
cholesterol level 240
Logistic Function
1
✓(s)
0 s
✓( 1) = 0; ✓(0) = 12 ; ✓(1) = 1
es 1
分⼿ ⼼ 低 時 靠近 ✓(s) =
0
= 分形 ⾼ 時 靠近 2
1 + es 1+e s
analytictorm
—smooth, monotonic, sigmoid function of s
Questions?
總分
x1 x1 x1
s s s
x2 h(x) x2 h(x) x2 h(x)
xd 対 分錯 xd xd
透 过 sigmoid 輸出
…
how to define
Ein (w) for logistic regression?
Likelihood ⇢
Y 有 2年中
• if h ⇡ f , h 的 可能性 和 t 是 類似 的 ( 若 h 是好 的 )
Likelihood ⇢
target function f (x) for y = +1
f (x) = P(+1|x) , P(y |x) =
1 f (x) for y = 1
假設 ⼝ 是 iid ,
• if h ⇡ f ,
then likelihood(h) ⇡ probability using f
• probability using f usually large
Hsuan-Tien Lin (NTU CSIE) Machine Learning 25/52
Linear Models Logistic Regression Error
g = argmax likelihood(h)
h
旋転 1800 可以 得到 相同 图 形
1
0
when logistic: h(x) = ✓(wT x) 対稱性 :
✓(s)
1 h(x) = h( x) ( Iogisticfunc 特性 )
、
T 0 s
ltéwx
対 h ⽽⾔ 是 ⼀樣 的 ,
以刈 hlxnl
N
Y
likelihood(logistic h) / h(yn xn )
n=1
g = argmax likelihood(h)
h
1
when logistic: h(x) = ✓(wT x) ✓(s)
1 h(x) = h( x)
0 s
O X X
N
Y
likelihood(logistic h) / h(yn xn )
n=1
Cross-Entropy Error
N
Y
max likelihood(logistic h) / h(yn xn )
h
n=1
若是 0 就 ⽤
,
⼗ ⼼⼼
若是x 就 ⽤,
xrn
Cross-Entropy Error
N
Y ⇣ ⌘
max likelihood(w) / ✓ yn wT xn
w
n=1
⽤ 真正 有 興趣 的 槤 ,
Cross-Entropy Error
N
Y ⇣ ⌘
max ln ✓ yn wT xn ⼆
Iln Olynúxn )
w
n=1
連柴 恥 換 成 連 加
Cross-Entropy Error
N ⇣ ⌘
1X
min ln ✓ yn wT xn
w N
n=1
errorfunc ,
Max 改成 Min
N
1 1X ⇣ ⌘
✓(s) = : min ln 1 + exp( yn wT xn )
1 + exp( s) w N
n=1
XN
1
=) min err(w, xn , yn ) 対 這 族的 error
w N
|n=1 {z }
Ein (w)
err(w, x, y ) = ln 1 + exp( y wT x) :
cross-entropy error 亂 度 1 熵
Questions?
The Gradient
0
rEin (w) 1
N
X z
}| {
1 B C
Ein (w) = ln @1 + exp( yn wT xn )A
N | {z }
n=1
⇤
chaimrule
N ✓ ◆✓ ◆✓ ◆
@Ein (w) 1 X @ ln(⇤) @(1 + exp( )) @ yn wT xn
=
@wi N @⇤ @ @wi
n=1
N ✓ ◆✓ ◆✓ ◆
1X
=
N 藏 expo ) ynxni -
n=1
N ✓ ◆ ! N
1X exp( ) 1X
= yn xn,i = ✓( ) yn xn,i
N 1 + exp( ) N
n=1 n=1
N
P
1
rEin (w) = N ✓ yn wT xn yn xn
n=1
The Gradient
0
rEin (w) 1
N
X z
}| {
1 B C
Ein (w) = ln @1 + exp( yn wT xn )A
N | {z }
n=1
⇤
N ✓ ◆✓ ◆✓ ◆
@Ein (w) 1 X @ ln(⇤) @(1 + exp( )) @ yn wT xn
=
@wi N @⇤ @ @wi
n=1
N ✓ ◆
! !
1X 1
= exp( ) yn xn,i
N ⇤
n=1
N ✓ ◆ ! N
1X exp( ) 1X
= yn xn,i = ✓( ) yn xn,i
N 1 + exp( ) N
n=1 n=1
N
P
1
rEin (w) = N ✓ yn wT xn yn xn
n=1
w
closed-form solution? no :-(
Hsuan-Tien Lin (NTU CSIE) Machine Learning 31/52
Linear Models Gradient of Logistic Regression Error
For t = 0, 1, . . .
1 find a mistake of wt called xn(t) , yn(t)
⇣ ⌘
sign wTt xn(t) 6= yn(t)
For t = 0, 1, . . .
1 find a mistake of wt called xn(t) , yn(t)
⇣ ⌘
sign wTt xn(t) 6= yn(t)
For t = 0, 1, . . .
(equivalently) pick some n, and update wt by
1
⇣r ⇣ ⌘ z ⌘
wt+1 wt + 1 · sign wTt xn 6= yn · yn xn
|{z} | {z }
⌘ v
⽅向
when stop, return last w as g 步 長
① ② ③
choice of (⌘, v) and stopping condition defines
iterative optimization approach
Questions?
Iterative Optimization
For t = 0, 1, . . . 似 leamingrate
wt+1 wt + ⌘v
when stop, return last w as g 要 決定 了 件 事 :D ⽅向 ① 步 長
• PLA: v comes from mistake correction
Linear Approximation
a greedy approach for some given ⌘ > 0:
現在 延伸 到 多 維度 ,
min
kvk=1 \
Ein (wt ) +
| {z }
⌘
|{z}
vT rEin (wt )
| {z }
1
丠 ?
known given positive known
常取 要 讓這項越 ⼩ 越 好
Hsuan-Tien Lin (NTU CSIE) Machine Learning 35/52
Linear Models Gradient Descent
Gradient Descent
an approximate greedy approach for some given small ⌘:
rEin (wt )
v=
krEin (wt )k
negativenormalizeddirect.int
• gradient descent: for small ⌘, wt+1 wt
lgradient.li
rEin (wt )
⌘ krEin (wt )k
update
gradient descent:
a simple & popular optimization tool
( 不会 対
Weights, w Weights, w
rEin (wt )
wt+1 wt ⌘ = wt ⌘rEin (wt )
krEin (wt )k
2 update by
wt+1 wt ⌘rEin (wt )
d
0.1126
Questions?
xd 対錯 xd xd ⼼ 之間 的 机 率
plausible err = 0/1 friendly err = squared plausible err = cross-entropy
discrete Ein (w): 使 error 变 ⼩ quadratic convex Ein (w): smooth convex Ein (w):
NP-hard to solve in general closed-form solution gradient descent
hara easy easy
can linear regression or logistic regression
help linear classification?
upper bound:
useful for designing algorithm
Hsuan-Tien Lin (NTU CSIE) Machine Learning 42/52
Linear Models Gradient Descent
6 • 0/1: 1 iff ys 0
0/1
sqr
• sqr: large if ys ⌧ 1
4
err 扣 分
区 但⽤ but over-charge ys 1
2 linearreg small errSQR ! small err0/1
,
卻 扣分
1 • ce: monotonic of ys
0
011 很 好 small errCE $ small err0/1
ys t
−3 −2 −1 0 1 2 3
• scaled ce: a proper upper bound of 0/1
在 011 和 linenr small errSCE $ small err0/1
結果類似
,
upper bound:
useful for designing algorithm
Hsuan-Tien Lin (NTU CSIE) Machine Learning 42/52
Linear Models Gradient Descent
upper bound:
useful for designing algorithm
Hsuan-Tien Lin (NTU CSIE) Machine Learning 42/52
Linear Models Gradient Descent
upper bound:
useful for designing algorithm
Hsuan-Tien Lin (NTU CSIE) Machine Learning 42/52
Linear Models Gradient Descent
VC on 0/1: VC-Reg on CE :
( 從 Ein 出 発 1 1 從 Eout 出 発) ,
0/1
small EinCE (w) =) small Eout (w):
logistic/linear reg. for linear classification
就 会 較寬鬆
非常 的 正 或 負時 看起來 就 和 011 差較 多
,
Questions?
步步 讓 w 越來越 好
wt+1 wt + ⌘v
when stop, return last w as g 每輪選 ⼀
个 桌修正
update: 2
gmdient 的 貢獻 相 阿 平均 再 的 梯度 ⽅向 ,
w(t+1)
x
9
PN i. 隨机 抽 族 的 ⽅向 但 不 希望 花 N
• technique on removing N1 ⇐ 好
:
,
n=1 再 做 平均 倍 ⼒氣 算 出 梯度
view as expectation E over uniform choice of n!
stochastic gradient: 随机 的 梯度 ( 不 是 原本 真正 的 ⽽是 在 某 族 上 的 1 ,
PLA Revisited
SGD logistic regression:
⇣ ⌘
wt+1 wt + ⌘ · ✓ yn wTt xn yn xn
PLA: r z
wt+1 wt + 1 · yn 6= sign(wTt xn ) yn xn
Questions?
Summary
1 Why Can Machines Learn?