0% found this document useful (0 votes)

32 views

05 Draft

This document summarizes a lecture on linear models for machine learning. It discusses linear regression problems, algorithms, and error measures. The key points covered are: 1. Linear regression aims to find a linear function to model the relationship between input features and a continuous target variable by minimizing a pointwise error measure, such as squared error. 2. It formulates the learning problem as minimizing the average squared error between predicted and actual target values on training examples. 3. Common error measures for linear regression include squared error, which quantifies how far predictions are from the actual targets, and helps identify linear functions with "small residuals."

Uploaded by

歐怡君

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views

05 Draft

Uploaded by

歐怡君

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 68

Machine Learning

(_hx“)
Lecture 5: Linear Models
Hsuan-Tien Lin (ó“0)
[email protected]

Department of Computer Science

& Information Engineering
National Taiwan University
(↵Àc'x«⌦Â↵˚)

Hsuan-Tien Lin (NTU CSIE) Machine Learning 0/52

Linear Models

Roadmap
1 When Can Machines Learn?
2 Why Can Machines Learn?
3 How Can Machines Learn?

Lecture 5: Linear Models

Linear Regression Problem
Linear Regression Algorithm
Logistic Regression Problem
Logistic Regression Error
Gradient of Logistic Regression Error
Gradient Descent
Stochastic Gradient Descent

Hsuan-Tien Lin (NTU CSIE) Machine Learning 1/52

Linear Models Linear Regression Problem

Credit Limit Problem

age 23 years
gender female
annual salary NTD 1,000,000
year in residence 1 year
unknown target function year in job 0.5 year
f: X !Y current debt 200,000
(ideal credit limit formula) credit limit? 100,000
要給某客⼾多少信⽤額度 ? 每个客⼾不同 ,

training examples learning final hypothesis

D : (x1 , y1 ), · · · , (xN , yN ) algorithm g⇡f
A
(historical records in bank) (‘learned’ formula to be used)

hypothesis set
H

(set of candidate formula)

regression 特点輸出空間為整个實丰⼜
Y = R: regression
Hsuan-Tien Lin (NTU CSIE) Machine Learning 2/52
Linear Models Linear Regression Problem

Linear Regression Hypothesis

H 該長怎樣輸出空間会是整个實取 ?
,

age 23 years
annual salary NTD 1,000,000
year in job 0.5 year
current debt 200,000

• For x = (x0 , x1 , x2 , · · · , xd ) ‘features of customer’, ← 客⼾資料

approximate the desired credit limit with a weighted sum:
wii 加权分邦
d
X
y⇡ wi xi
i=0

• linear regression hypothesis: h(x) = wT x

h(x): like perceptron, but without the sign

PLA 有取正負号但這裡沒有 ,

Hsuan-Tien Lin (NTU CSIE) Machine Learning 3/52

Linear Models Linear Regression Problem

Illustration of Linear Regression

Linearhypothesis ,

x = (x) 2 R x = (x1 , x2 ) 2 R2
三維空間中即形成平⾯⼀

衡量奌到 hypothesis 有多近

y
y

x1
x x2

linear regression:
find lines/hyperplanes with small residuals

Hsuan-Tien Lin (NTU CSIE) Machine Learning 4/52

Linear Models Linear Regression Problem

Pointwise Error Measure for ‘Small Residuals’

final hypothesis
g⇡f

how well? often use averaged err(g(x), f (x)), like

Eout (g) = E Jg(x) 6= f (x)K

x⇠P | {z }
err(g(x),f (x))

—err: called pointwise error measure

in-sample out-of-sample
N
1X
Ein (g) = err(g(xn ), f (xn )) Eout (g) = E err(g(x), f (x))
N x⇠P
n=1

will mainly consider pointwise err for simplicity

Hsuan-Tien Lin (NTU CSIE) Machine Learning 5/52
Linear Models Linear Regression Problem

Learning Flow with Pointwise Error Measure

y
unknown target
distribution P(y |x) unknown
containing f (x) + noise P on X

(ideal credit approval formula)

x1 , x2 , · · · , xN x
y1 , y2 , · · · , yN

training examples learning final hypothesis

D : (x1 , y1 ), · · · , (xN , yN ) algorithm g⇡f
A
(historical records in bank) (‘learned’ formula to be used)

hypothesis set error measure

H err

(set of candidate formula)

extended VC theory/‘philosophy’
works for most H and err

Hsuan-Tien Lin (NTU CSIE) Machine Learning 6/52

Linear Models Linear Regression Problem

Two Important Pointwise Error Measures

0 1
B C
err @g(x), f (x)A
|{z} |{z}
ỹ y

0/1 error squared error

err(ỹ , y ) = Jỹ 6= y K err(ỹ , y ) = (ỹ y )2
• correct or incorrect? • how far is ỹ from y ?
• often for classification • often for regression

squared error: quantify ‘small residual’

Hsuan-Tien Lin (NTU CSIE) Machine Learning 7/52

Linear Models Linear Regression Problem

Squared Error Measure for Regression

popular/historical error measure for linear regression:
squared error err(ŷ , y ) = (ŷ y )2

in-sample out-of-sample

N
1X
Ein (w) = (h(xn ) yn )2 Eout (w) = E (wT x y )2
N | {z } (x,y )⇠P
n=1
wT x n

next: how to minimize Ein (w)?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 8/52

Linear Models Linear Regression Problem

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 9/52

Linear Models Linear Regression Algorithm

Matrix Form of Ein (w)

Ein 如何越⼩越好了
N N
1X T 2 1X T
Ein (w) = (w xn yn ) = (xn w yn ) 2
N nypothesis 希望的預測N
n=1 n=1
的預測
2
xT1 w y1
1 xT2 w y2
=
N ...
xTN w yN
2 3 2 3 2
xT1 y1
1 6
6 xT2 7
7w
6 y2 7
6 7
( 共同的 w 提出 ) = 4 5 4 ... 5
N ...
xTN yN
1
= k X w y k2
N |{z} |{z} |{z}
N⇥d+1 d+1⇥1 N⇥1

Hsuan-Tien Lin (NTU CSIE) Machine Learning 10/52

Linear Models Linear Regression Algorithm
唯变叔為 w
,
已知刈

1
min Ein (w) = kXw yk2
w N

• Ein (w): continuous, differentiable, convex

• necessary condition of ‘best’ w
在制⽅向做偏微皆⼆ 0
2 3 2 3
Ein @Ein
@w 0 (w) 0
6 @Ein 7 6 7
6 @w 1 (w) 7 6 0 7
rEin (w) ⌘ 6 7=6 7
4 ... 5 4 ... 5
@Ein
w 最低桌 @w d (w) 0
不管往哪了⽅向都無法讓函邦值更
低
—not possible to ‘roll down’
,

梯度⼆ 0

task: find wLIN such that rEin (wLIN ) = 0

Hsuan-Tien Lin (NTU CSIE) Machine Learning 11/52

Linear Models Linear Regression Algorithm

The Gradient rEin (w)

0 1
1 1@ T T
Ein (w) = kXw yk2 = w X Xw 2wT XT y + yT y A
N N | {z } |{z} |{z}
A b c

one w only 若 w 只有个維度
⼀

,
vector w 若㖷了向量

⇣ ⌘ ⇣ ⌘
Ein (w)= N1 aw 2 2bw + c Ein (w)= N1 wT Aw 2wT b + c

rEin (w)= N1 (2aw 2b) rEin (w)= N1 (2Aw 2b)

simple! :-) similar (derived by definition)

何時梯度⼆ 0 ?
2
rEin (w) = N XT Xw XT y

Hsuan-Tien Lin (NTU CSIE) Machine Learning 12/52

Linear Models Linear Regression Algorithm

Optimal Linear Regression Weights

2
task: find wLIN such that N XT Xw XT y = rEin (w) = 0
之中只有 w 未知 ˋ

invertible XT X 可求得反矩陣 ,
singular XT X
• easy! unique solution • many optimal solutions
⇣ ⌘ 1
ldtllxldt 1) • one of the solutions
wLIN = XT X XT y
| {z } wLIN = X† y
pseudo-inverse X†
by defining X† in other ways
• often the case because
N d +1

practical suggestion:
use well-implemented † routine
1 T
instead of XT X X
for numerical stability when almost-singular
Hsuan-Tien Lin (NTU CSIE) Machine Learning 13/52
Linear Models Linear Regression Algorithm

Linear Regression Algorithm

1 from D, construct input matrix X and output vector y by
製造矩陣 2 3 2 3
xT1 y1
6 xT2 7 6
7 y = 6 y2 7
7
X=6 4 5 4
··· ··· 5
xTN yN
| {z } | {z }
N⇥(d+1) N⇥1

2 calculate pseudo-inverse X†
|{z}
計算反矩陣 (d+1)⇥N

3 return wLIN = X† y
|{z}
(d+1)⇥1

simple and efficient

with good † routine
Hsuan-Tien Lin (NTU CSIE) Machine Learning 14/52
Linear Models Linear Regression Algorithm

Is Linear Regression a ‘Learning Algorithm’?

wLIN = X† y

No! Yes! 從某些⾓度⽽⾔算是机器学習演算法 "

• analytic (closed-form) I • good Ein ?
solution, ‘instantaneous’ yes, optimal! 有好的 Ein
• not improving Ein nor 2
• good Eout ? 資料量夠多有好 Ein 有好 Eout
,
→ ,

Eout iteratively yes, finite dVC like perceptrons

⼀下⼦就算出結果 ? 沒有学習过程 ? 3 • improving iteratively?
次有隨著覌察的資料更新結果 ! somewhat, within an iterative
pseudo-inverse routine

if Eout (wLIN ) is good, learning ‘happened’!

Hsuan-Tien Lin (NTU CSIE) Machine Learning 15/52

Linear Regression Generalization Issue

如何保証 Ein 是好的 ! Benefit of Analytic Solution:

‘Simpler-than-VC’ Guarantee
n o
to be shown d+1
Ein = E Ein (wLIN w.r.t. D) = noise level · 1 N
D⇠P N
資料中的雜訊
衬所有抓把起來的資料做平均
Distrihution 中会做錯的部分

1 1
Ein (wLIN ) = ky ŷ k2 = ky X X † y k2
N |{z} N |{z}
predictions wLIN
1
= k( I XX† )yk2
N |{z}
identity

call XX† the hat matrix H

because it puts ^ on y

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/23

Linear Regression Generalization Issue

Geometric View of Hat Matrix

y
y ŷ

span of X
ŷ

in RN
• ŷ = XwLIN within the span of X columns
• y ŷ smallest: y ŷ ? span
• H: project y to ŷ 2 span 投影
• I H: transform y to y ŷ ? span

claim: trace(I H) = N (d + 1). Why? :-)

衬⾓線上所有值相加

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/23

Linear Regression Generalization Issue

An Illustrative ‘Proof’
y
y ŷ
noise

f (X) span of X
ŷ

• if y comes from some ideal f (X) 2 span plus noise

• noise with per-dimension ‘noise level’ 2 transformed by I H to
be y ŷ
1
Ein (wLIN ) = ky ŷk2 = 1
N k(I H)noisek2
N
1 2
= N N (d + 1)

2 d+1
Ein = · 1 N 看到的讓 Ein 好看笑
,

2 d+1
Eout = · 1+ N (complicated!) 要看新資料來往、

可能差 2 倍
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/23
Linear Models Linear Regression Algorithm

The Learning Curves of Linear Regression

(proof skipped this year)

Expected Error
d+1 Eout
Eout = noise level · 1 + N 2

d+1
Ein = noise level · 1 N
Ein

d+1 Number of Data Points, N

• both converge to 2 (noise level) for N ! 1

• expected generalization error: 2(d+1)
N
—similar to worst-case guarantee from VC

linear regression (LinReg):

learning ‘happened’!
Hsuan-Tien Lin (NTU CSIE) Machine Learning 16/52
Linear Models Linear Regression Algorithm

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 17/52

Linear Models Logistic Regression Problem

Heart Attack Prediction Problem (1/2)

age 40 years
gender male
blood pressure 130/85
cholesterol level 240
unknown target weight 70
distribution P(y |x)
containing f (x) + noise heart disease? yes ⼆元分類問題

training examples learning final hypothesis

D : (x1 , y1 ), · · · , (xN , yN ) algorithm g⇡f
A

hypothesis set error measure

H err

binary classification:
ideal f (x) = sign P(+1|x) 12 2 { 1, +1}
because of classification err

Hsuan-Tien Lin (NTU CSIE) Machine Learning 18/52

Linear Models Logistic Regression Problem

Heart Attack Prediction Problem (2/2)

age 40 years
gender male
blood pressure 130/85
cholesterol level 240
unknown target weight 70
distribution P(y |x)
containing f (x) + noise heart attack? 80% risk ⼼臟病発的風險

training examples learning final hypothesis

D : (x1 , y1 ), · · · , (xN , yN ) algorithm g⇡f
A

hypothesis set error measure

H err

‘soft’ binary classification:

dinputvector
f (x) = P(+1|x) 2 [0, 1]
想知道的是這个值 exi 風險值

Hsuan-Tien Lin (NTU CSIE) Machine Learning 19/52

Linear Models Logistic Regression Problem

Soft Binary Classification

target function f (x) = P(+1|x) 2 [0, 1]

ideal (noiseless) data actual (noisy) data

same data as hard binary classification,

different target function

Hsuan-Tien Lin (NTU CSIE) Machine Learning 20/52

Linear Models Logistic Regression Problem

Soft Binary Classification

target function f (x) = P(+1|x) 2 [0, 1]
改成左迅 data 有 noise 的版本 , 根據 09,02 …

ideal (noiseless) data actual (noisy) data 抽樣⼀下

same data as hard binary classification,

different target function
要如何做才能求得好的 nypothesis
7
當有興趣的⼗是 -1 的輸出時
,

0
,

Hsuan-Tien Lin (NTU CSIE) Machine Learning 20/52

Linear Models Logistic Regression Problem

Logistic Hypothesis 常⽤ ˋ
age 40 years
gender male
blood pressure 130/85
cholesterol level 240

• For x = (x0 , x1 , x2 , · · · , xd ) ‘features of

S 型
patient’, calculate a weighted ‘risk score’:
將 teatnre 做加权 1
d
X ✓(s)
s= wi xi
i=0 ← 0 為常䫍
0 s
• convert the score to estimated probability 將分邦転化為个⼀
0 到⼯的值
by logistic function ✓(s)

logistic hypothesis: h(x) = ✓(wT x)

Hsuan-Tien Lin (NTU CSIE) Machine Learning 21/52

Linear Models Logistic Regression Problem

Logistic Function
1
✓(s)

0 s

✓( 1) = 0; ✓(0) = 12 ; ✓(1) = 1

es 1
分⼿⼼低時靠近 ✓(s) =
0
= 分形⾼時靠近 2
1 + es 1+e s
analytictorm
—smooth, monotonic, sigmoid function of s

logistic regression: use

1
h(x) =
1 + exp( wT x)
to approximate target function f (x) = P(+1|x)

Hsuan-Tien Lin (NTU CSIE) Machine Learning 22/52

Linear Models Logistic Regression Problem

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 23/52

Linear Models Logistic Regression Error

Three Linear Models

linear scoring function: s = wT x

linear classification linear regression logistic regression

h(x) = sign(s) h(x) = s h(x) = ✓(s)
x0
杈 x0 x0

總分
x1 x1 x1
s s s
x2 h(x) x2 h(x) x2 h(x)

xd 対分錯 xd xd
透过 sigmoid 輸出
…

plausible err = 0/1 friendly err = squared

(small flipping noise) (easy to minimize) err = ?
PLA 在盡量不要有 error

how to define
Ein (w) for logistic regression?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 24/52

𩶛
Linear Models Logistic Regression Error

Likelihood ⇢
Y 有 2年中

target function f (x) for y = +1

f (x) = P(+1|x) , P(y |x) =
1 f (x) for y = 1

consider D = {(x1 , ), (x2 , ⇥), . . . , (xN , ⇥)}

產⽣ data 的机率是 …
'

probability that f generates D likelihood that h generates D

P(x1 )P( |x1 ) ⇥ P(x1 )h(x1 ) ⇥
P(x2 )P(⇥|x2 ) ⇥ P(x2 )(1 h(x2 )) ⇥
... 若假裝 h 就是 t 可以
,
. 察嚡樣資料的可能性
. .覌
但這不是真的不啐 liuinood
P(xN )P(⇥|xN )
,

P(xN )(1 h(xN ))

Maximumlikelihood ,

• if h ⇡ f , h 的可能性和 t 是類似的 ( 若 h 是好的 )

then likelihood(h) ⇡ probability using f

• probability using f usually large
Hsuan-Tien Lin (NTU CSIE) Machine Learning 25/52
Linear Models Logistic Regression Error

Likelihood ⇢
target function f (x) for y = +1
f (x) = P(+1|x) , P(y |x) =
1 f (x) for y = 1

假設⼝是 iid ,

consider D = {(x1 , ), (x2 , ⇥), . . . , (xN , ⇥)}

probability that f generates D likelihood that h generates D

「

P(x1 )f (x1 ) ⇥ pretend P(x1 )h(x1 ) ⇥

P(x2 )(1 f (x2 )) ⇥ P(x2 )(1 h(x2 )) ⇥
... ...
P(xN )(1 f (xN )) P(xN )(1 h(xN ))

• if h ⇡ f ,
then likelihood(h) ⇡ probability using f
• probability using f usually large
Hsuan-Tien Lin (NTU CSIE) Machine Learning 25/52
Linear Models Logistic Regression Error

Likelihood of Logistic Hypothesis

likelihood(h) ⇡ (probability using f ) ⇡ large

g = argmax likelihood(h)
h

旋転 1800 可以得到相同图形
1
0
when logistic: h(x) = ✓(wT x) 対稱性 :
✓(s)

1 h(x) = h( x) ( Iogisticfunc 特性 )
、

T 0 s
ltéwx

likelihood(h) = P(x1 )h(x1 ) ⇥ P(x2 )(1 h(x2 )) ⇥ . . . P(xN )(1 h(xN ))

wnr vvn vrrr
n n

対 h ⽽⾔是⼀樣的 ,

以刈 hlxnl
N
Y
likelihood(logistic h) / h(yn xn )
n=1

Hsuan-Tien Lin (NTU CSIE) Machine Learning 26/52

Linear Models Logistic Regression Error

Likelihood of Logistic Hypothesis

likelihood(h) ⇡ (probability using f ) ⇡ large

g = argmax likelihood(h)
h

1
when logistic: h(x) = ✓(wT x) ✓(s)

1 h(x) = h( x)
0 s

O X X

likelihood(h) = P(x1 )h(+x1 ) ⇥ P(x2 )h( x2 ) ⇥ . . . P(xN )h( xN )

hlyxi ) h (火刈 MYNXN)

N
Y
likelihood(logistic h) / h(yn xn )
n=1

Hsuan-Tien Lin (NTU CSIE) Machine Learning 26/52

Linear Models Logistic Regression Error

Cross-Entropy Error
N
Y
max likelihood(logistic h) / h(yn xn )
h
n=1
若是 0 就⽤
,
⼗⼼⼼

若是x 就⽤,
xrn

Hsuan-Tien Lin (NTU CSIE) Machine Learning 27/52

Linear Models Logistic Regression Error

Cross-Entropy Error
N
Y ⇣ ⌘
max likelihood(w) / ✓ yn wT xn
w
n=1
⽤真正有興趣的槤 ,

Hsuan-Tien Lin (NTU CSIE) Machine Learning 27/52

Linear Models Logistic Regression Error

Cross-Entropy Error
N
Y ⇣ ⌘
max ln ✓ yn wT xn ⼆
Iln Olynúxn )
w
n=1
連柴恥換成連加

Hsuan-Tien Lin (NTU CSIE) Machine Learning 27/52

Linear Models Logistic Regression Error

Cross-Entropy Error
N ⇣ ⌘
1X
min ln ✓ yn wT xn
w N
n=1
errorfunc ,

Max 改成 Min

N
1 1X ⇣ ⌘
✓(s) = : min ln 1 + exp( yn wT xn )
1 + exp( s) w N
n=1
XN
1
=) min err(w, xn , yn ) 対這族的 error
w N
|n=1 {z }
Ein (w)

err(w, x, y ) = ln 1 + exp( y wT x) :
cross-entropy error 亂度 1 熵

Hsuan-Tien Lin (NTU CSIE) Machine Learning 27/52

Linear Models Logistic Regression Error

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 28/52

Linear Models Gradient of Logistic Regression Error

Minimizing Ein (w)

N
1X ⇣ ⌘
min Ein (w) = ln 1 + exp( yn wT xn )
w N
n=1

• Ein (w): continuous, differentiable,

twice-differentiable, convex
• how to minimize? locate valley
Ein 拋物線向上如⼭⾕
want rEin (w) = 0

first: derive rEin (w)

Hsuan-Tien Lin (NTU CSIE) Machine Learning 29/52

Linear Models Gradient of Logistic Regression Error

The Gradient
0
rEin (w) 1
N
X z
}| {
1 B C
Ein (w) = ln @1 + exp( yn wT xn )A
N | {z }
n=1
⇤

chaimrule
N ✓ ◆✓ ◆✓ ◆
@Ein (w) 1 X @ ln(⇤) @(1 + exp( )) @ yn wT xn
=
@wi N @⇤ @ @wi
n=1
N ✓ ◆✓ ◆✓ ◆
1X
=
N 藏 expo ) ynxni -

n=1
N ✓ ◆ ! N
1X exp( ) 1X
= yn xn,i = ✓( ) yn xn,i
N 1 + exp( ) N
n=1 n=1

N
P
1
rEin (w) = N ✓ yn wT xn yn xn
n=1

Hsuan-Tien Lin (NTU CSIE) Machine Learning 30/52

Linear Models Gradient of Logistic Regression Error

The Gradient
0
rEin (w) 1
N
X z
}| {
1 B C
Ein (w) = ln @1 + exp( yn wT xn )A
N | {z }
n=1
⇤

N ✓ ◆✓ ◆✓ ◆
@Ein (w) 1 X @ ln(⇤) @(1 + exp( )) @ yn wT xn
=
@wi N @⇤ @ @wi
n=1
N ✓ ◆
! !
1X 1
= exp( ) yn xn,i
N ⇤
n=1
N ✓ ◆ ! N
1X exp( ) 1X
= yn xn,i = ✓( ) yn xn,i
N 1 + exp( ) N
n=1 n=1

N
P
1
rEin (w) = N ✓ yn wT xn yn xn
n=1

Hsuan-Tien Lin (NTU CSIE) Machine Learning 30/52

Linear Models Gradient of Logistic Regression Error

Minimizing Ein (w)

N
1X ⇣ T
⌘
min Ein (w) = ln 1 + exp( yn w xn )
w N
n=1
N
1X ⇣ ⌘
want rEin (w) = ✓ yn wT xn yn xn =0
N
n=1
0 ⼼)

scaled ✓-weighted sum of yn xn

• all ✓(·) = 0: only if yn wT xn 0
—linear separable D
Ein • weighted sum = 0:
non-linear equation of w

w
closed-form solution? no :-(
Hsuan-Tien Lin (NTU CSIE) Machine Learning 31/52
Linear Models Gradient of Logistic Regression Error

PLA Revisited: Iterative Optimization

PLA: start from some w0 (say, 0), and ‘correct’ its mistakes on D

For t = 0, 1, . . .
1 find a mistake of wt called xn(t) , yn(t)
⇣ ⌘
sign wTt xn(t) 6= yn(t)

2 (try to) correct the mistake by

wt+1 wt + yn(t) xn(t)

when stop, return last w as g

Hsuan-Tien Lin (NTU CSIE) Machine Learning 32/52

Linear Models Gradient of Logistic Regression Error

PLA Revisited: Iterative Optimization

PLA: start from some w0 (say, 0), and ‘correct’ its mistakes on D

For t = 0, 1, . . .
1 find a mistake of wt called xn(t) , yn(t)
⇣ ⌘
sign wTt xn(t) 6= yn(t)

2 (try to) correct the mistake by

wt+1 wt + yn(t) xn(t)

1 (equivalently) pick some n, and update wt by

r ⇣ ⌘ z
wt+1 wt + sign wTt xn 6= yn yn xn
direction
when stop, return last w as g
,

Hsuan-Tien Lin (NTU CSIE) Machine Learning 32/52

Linear Models Gradient of Logistic Regression Error

PLA Revisited: Iterative Optimization

PLA: start from some w0 (say, 0), and ‘correct’ its mistakes on D

For t = 0, 1, . . .
(equivalently) pick some n, and update wt by
1
⇣r ⇣ ⌘ z ⌘
wt+1 wt + 1 · sign wTt xn 6= yn · yn xn
|{z} | {z }
⌘ v
⽅向
when stop, return last w as g 步長

① ② ③
choice of (⌘, v) and stopping condition defines
iterative optimization approach

Hsuan-Tien Lin (NTU CSIE) Machine Learning 32/52

Linear Models Gradient of Logistic Regression Error

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 33/52

Linear Models Gradient Descent

Iterative Optimization
For t = 0, 1, . . . 似 leamingrate
wt+1 wt + ⌘v
when stop, return last w as g 要決定了件事 :D ⽅向 ① 步長
• PLA: v comes from mistake correction

In-sample Error, Ein

• smooth Ein (w) for logistic regression:
choose v to get the ball roll ‘downhill’?
• direction v:
(assumed) of unit length
• step size ⌘:
(assumed) positive Weights, w

a greedy approach for some given ⌘ > 0:

min Ein (wt + ⌘v) imeroptimiutionproblem

kvk=1 | {z }
wt+1

Hsuan-Tien Lin (NTU CSIE) Machine Learning 34/52

Linear Models Gradient Descent

Linear Approximation
a greedy approach for some given ⌘ > 0:

min Ein (wt + ⌘v)

kvk=1

• still non-linear optimization, now with constraints

—not any easier than minw Ein (w)
• local approximation by linear formula makes problem easier

Ein (wt + ⌘v) ⇡ Ein (wt ) + ⌘vT rEin (wt )

if ⌘ really small (Taylor expansion) ⽤泰勒展開式在維上看是切線 ,

現在延伸到多維度 ,

an approximate greedy approach for some given small ⌘:

min
kvk=1 \
Ein (wt ) +
| {z }
⌘
|{z}
vT rEin (wt )
| {z }
1
丠 ?
known given positive known

常取要讓這項越⼩越好
Hsuan-Tien Lin (NTU CSIE) Machine Learning 35/52
Linear Models Gradient Descent

Gradient Descent
an approximate greedy approach for some given small ⌘:

min Ein (wt ) + ⌘ vT rEin (wt )

kvk=1 | {z } |{z} | {z }
known given positive known

• optimal v: opposite direction of rEin (wt )

rEin (wt )
v=
krEin (wt )k
negativenormalizeddirect.int
• gradient descent: for small ⌘, wt+1 wt
lgradient.li
rEin (wt )
⌘ krEin (wt )k

update
gradient descent:
a simple & popular optimization tool

Hsuan-Tien Lin (NTU CSIE) Machine Learning 36/52

Linear Models Gradient Descent
Choice of ⌘
too small too large jumpt 可能也会往上
泰勒展開式就
In-sample Error, Ein

In-sample Error, Ein

」

( 不会対

Weights, w Weights, w

too slow :-( 步伐⼩ too unstable :-( 步伐⼤

a naive yet effective heuristic

• if red ⌘ / krEin (wt )k by ratio purple ⌘ (the fixed learning rate)

rEin (wt )
wt+1 wt ⌘ = wt ⌘rEin (wt )
krEin (wt )k

fixed learning rate gradient descent:

wt+1 wt ⌘rEin (wt )

Hsuan-Tien Lin (NTU CSIE) Machine Learning 37/52
Linear Models Gradient Descent

Putting Everything Together

Logistic Regression Algorithm
initialize w0
For t = 0, 1, · · ·
1 compute
N
1X ⇣ ⌘
rEin (wt ) = ✓ yn wTt xn yn xn
N
n=1

2 update by
wt+1 wt ⌘rEin (wt )
d
0.1126

...until rEin (wt+1 ) = 0 or enough iterations

return last wt+1 as g

O(N) time complexity in step 1 per iteration

Hsuan-Tien Lin (NTU CSIE) Machine Learning 38/52
Linear Models Gradient Descent

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 39/52

Linear Models Gradient Descent

Linear Models Revisited

linear scoring function: s = wT x

linear classification linear regression logistic regression

h(x) = sign(s) h(x) = s h(x) = ✓(s)
x0 x0 x0
x1 x1 x1
s s s
x2 h(x) x2 h(x) x2 h(x)

xd 対錯 xd xd ⼼之間的机率
plausible err = 0/1 friendly err = squared plausible err = cross-entropy
discrete Ein (w): 使 error 变⼩ quadratic convex Ein (w): smooth convex Ein (w):
NP-hard to solve in general closed-form solution gradient descent
hara easy easy
can linear regression or logistic regression
help linear classification?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 40/52

Linear Models Gradient Descent

Error Functions Revisited

linear scoring function: s = wT x

for binary classification y 2 { 1, +1}

linear classification linear regression logistic regression

0 h(x) = sign(s) 算出 h值 h(x) = s h(x) = ✓(s)
2
② err(h, x, y ) = Jh(x) 6= y K err(h, x, y ) = (h(x) y) err(h, x, y ) = ln h(y x)
從 hypothesis 輸出看到是否相同
,

結合 : err0/1 (s, y ) errSQR (s, y ) errCE (s, y )

Jsign(s) 6= y K 2
= = (s y) = ln(1 + exp( y s))
Jsign(y s) 6= 1K 2
= = (y s 1)

(y s): classification correctness score

希望⼼。越來越好

Hsuan-Tien Lin (NTU CSIE) Machine Learning 41/52

Linear Models Gradient Descent

Visualizing Error Functions 比較 errorfunc 的長相

0/1 err0/1 (s, y ) = Jsign(y s) 6= 1K 盡在同⼀平⾯上
sqr errSQR (s, y ) = (y s 1)2
ce errCE (s, y ) = ln(1 + exp( y s))
scaled ce errSCE (s, y ) = log2 (1 + exp( y s))

6 0/1 • 0/1: 1 iff ys  0

• sqr: large if ys ⌧ 1
4
err but over-charge ys 1
2 small errSQR ! small err0/1
1 • ce: monotonic of ys
0 small errCE $ small err0/1
−3 −2 −1 0 1 2 3
ys • scaled ce: a proper upper bound of 0/1
small errSCE $ small err0/1

upper bound:
useful for designing algorithm
Hsuan-Tien Lin (NTU CSIE) Machine Learning 42/52
Linear Models Gradient Descent

Visualizing Error Functions

0/1 err0/1 (s, y ) = Jsign(y s) 6= 1K
sqr errSQR (s, y ) = (y s 1)2
ce errCE (s, y ) = ln(1 + exp( y s))
scaled ce errSCE (s, y ) = log2 (1 + exp( y s))

6 • 0/1: 1 iff ys  0
0/1
sqr
• sqr: large if ys ⌧ 1
4
err 扣分
区但⽤ but over-charge ys 1
2 linearreg small errSQR ! small err0/1
,

卻扣分
1 • ce: monotonic of ys
0
011 很好 small errCE $ small err0/1
ys t
−3 −2 −1 0 1 2 3
• scaled ce: a proper upper bound of 0/1
在 011 和 linenr small errSCE $ small err0/1
結果類似
,

upper bound:
useful for designing algorithm
Hsuan-Tien Lin (NTU CSIE) Machine Learning 42/52
Linear Models Gradient Descent

Visualizing Error Functions

0/1 err0/1 (s, y ) = Jsign(y s) 6= 1K
sqr errSQR (s, y ) = (y s 1)2
ce errCE (s, y ) = ln(1 + exp( y s)) 取 og.no
scaled ce errSCE (s, y ) = log2 (1 + exp( y s))

6 0/1 • 0/1: 1 iff ys  0

sqr
ce
• sqr: large if ys ⌧ 1
4
err but over-charge ys 1
2 small errSQR ! small err0/1
1 • ce: monotonic of ys
0 small errCE $ small err0/1
−3 −2 −1 0 1 2 3
ys • scaled ce: a proper upper bound of 0/1
small errSCE $ small err0/1

upper bound:
useful for designing algorithm
Hsuan-Tien Lin (NTU CSIE) Machine Learning 42/52
Linear Models Gradient Descent

Visualizing Error Functions

0/1 err0/1 (s, y ) = Jsign(y s) 6= 1K
sqr errSQR (s, y ) = (y s 1)2
ce errCE (s, y ) = ln(1 + exp( y s))
scaled ce errSCE (s, y ) = log2 (1 + exp( y s)) 取 lgz

6 0/1 • 0/1: 1 iff ys  0

sqr
scaled ce
• sqr: large if ys ⌧ 1
4
err but over-charge ys 1
在⼼切在⼀起 small errSQR ! small err0/1
2
1 • ce: monotonic of ys
0 small errCE $ small err0/1
−3 −2 −1 0 1 2 3
ys • scaled ce: a proper upper bound of 0/1
small errSCE $ small err0/1

upper bound:
useful for designing algorithm
Hsuan-Tien Lin (NTU CSIE) Machine Learning 42/52
Linear Models Gradient Descent

Learning Flow with Algorithmic Error Measure

y
unknown target
distribution P(y |x) unknown
containing f (x) + noise P on X

(ideal credit approval formula)

x1 , x2 , · · · , xN x
y1 , y2 , · · · , yN

training examples learning final hypothesis

D : (x1 , y1 ), · · · , (xN , yN ) algorithm g⇡f
A
ec
rr
(historical records in bank) (‘learned’ formula to be used)

hypothesis set error measure

H err

(set of candidate formula)

err: goal, not always easy to optimize;

rr: something ‘similar’ to facilitate A, e.g.
ec
upper bound
Hsuan-Tien Lin (NTU CSIE) Machine Learning 43/52
Linear Models Gradient Descent

Theoretical Implication of Upper Bound

For any ys where s = wT x
若在意的是 011 的 error 上限為 tu ,

err0/1 (s, y )  errSCE (s, y ) = ln12 errCE (s, y ).

0/1 1
=) Ein (w)  EinSCE (w) = CE
ln 2 Ein (w) 平均
0/1 SCE 1 CE
Eout (w)  Eout (w) = ln 2 Eout (w)

VC on 0/1: VC-Reg on CE :
( 從 Ein 出発 1 1 從 Eout 出発) ,

0/1 0/1 0/1

Eout (w)  Ein (w) + ⌦0/1 Eout (w)  1 CE
ln 2 Eout (w)
1 CE 0/1 1 1
ln 2 Ein (w) + ⌦
CE CE
ln 2 Ein (w) + ln 2 ⌦
 

0/1
small EinCE (w) =) small Eout (w):
logistic/linear reg. for linear classification

Hsuan-Tien Lin (NTU CSIE) Machine Learning 44/52

Linear Models Gradient Descent
使⽤ regression 得到
答案後將分我取 sign
Regression for Classification
1 run logistic/linear reg. on D with yn 2 { 1, +1} to get wREG
2 return g(x) = sign(wTREG x)

PLA linear regression logistic regression

• pros: efficient + • pros: 公式宑下就解決 • pros:
strong guarantee ‘easiest’ ‘easy’
if lin. separable optimization optimization
• cons: works only if • cons: loose • cons: loose
lin. separable bound of err0/1 bound of err0/1 for
for large |ys| very negative ys
若有 011 差較多州時
1 ⼤只是⼗上限
,

就会較寬鬆
非常的正或負時看起來就和 011 差較多
,

• linear regression sometimes used to set w0 for

PLA/logistic regression initialile ,

• logistic regression often preferred in practice

如果 lincarregression 看起來是不錯的⽅法就將 LR 的 w 當成 wo 再從 , ,

Hsuan-Tien Lin (NTU CSIE) 好的 w

繼續 optimize ,

Machine Learning 45/52

Linear Models Gradient Descent

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 46/52

Linear Models Stochastic Gradient Descent

Two Iterative Optimization Schemes

For t = 0, 1, . . .
⼀

步步讓 w 越來越好
wt+1 wt + ⌘v
when stop, return last w as g 每輪選⼀
个桌修正

PLA logistic regression

pick (xn , yn ) and decide wt+1 by check D and decide wt+1 (or
the one example new ŵ) by all examples
O(1) time per iteration :-) O(N) time per iteration :-(
每次都看其中⼀桌每輪要看完所有的桌將所有的桌的,

update: 2
gmdient 的貢獻相阿平均再的梯度⽅向 ,

logistic regression with 往下做

O(1) time per iteration?
w(t)

w(t+1)

x
9

Hsuan-Tien Lin (NTU CSIE) Machine Learning 47/52

Linear Models Stochastic Gradient Descent

Logistic Regression Revisited 演算法

N
1X ⇣ ⌘
wt+1 wt + ⌘ ✓ yn wTt xn yn xn
N
| n=1 {z }
rEin (wt )

• want: update direction v ⇡ rEin (wt ) 希望更新⽅向有 gradient 接近

while computing v by one single (xn , yn ) 㽆夠⼩時梯度是⼀ ,

PN i. 隨机抽族的⽅向但不希望花 N
• technique on removing N1 ⇐ 好
:
,

n=1 再做平均倍⼒氣算出梯度
view as expectation E over uniform choice of n!

stochastic gradient: 随机的梯度 ( 不是原本真正的⽽是在某族上的 1 ,

rw err(w, xn , yn ) with random n 將整体的梯度看成這个

true gradient: 隨机过程的期望值
rw Ein (w) = E rw err(w, xn , yn )
random n

Hsuan-Tien Lin (NTU CSIE) Machine Learning 48/52

Linear Models Stochastic Gradient Descent

Stochastic Gradient Descent (SGD)

stochastic gradient = true gradient + zero-mean ‘noise’ directions

Stochastic Gradient Descent

• idea: replace true gradient by stochastic gradient ⽤随机的梯度做下降
• after enough steps, ⽽非真實的梯度 ,

average true gradient ⇡ average stochastic gradient

• pros: simple & cheaper computation :-) 若跑夠多步平均⽽⾔真實 ,
,

—useful for big data or online learning 的梯度会⼀随机梯度

• cons: less stable in nature

SGD logistic regression, looks familiar? :-):

⇣ ⌘ → 更新⽅向
wt+1 wt + ⌘ ✓ yn wTt xn yn xn
| {z }
rerr(wt ,xn ,yn )

Hsuan-Tien Lin (NTU CSIE) Machine Learning 49/52

Linear Models Stochastic Gradient Descent

PLA Revisited
SGD logistic regression:
⇣ ⌘
wt+1 wt + ⌘ · ✓ yn wTt xn yn xn

PLA: r z
wt+1 wt + 1 · yn 6= sign(wTt xn ) yn xn

• SGD logistic regression ⇡ ‘soft’ PLA

• PLA ⇡ SGD logistic regression with ⌘ = 1 when wTt xn large

two practical rule-of-thumb:

• stopping condition? t large enough 很難決定以相信跑
• ⌘? 0.1 when x in proper range 夠久就可以停

Hsuan-Tien Lin (NTU CSIE) Machine Learning 50/52

Linear Models Stochastic Gradient Descent

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 51/52

Linear Models Stochastic Gradient Descent

Summary
1 Why Can Machines Learn?

Lecture 4: Theory of Generalization

2 How Can Machines Learn?

Lecture 5: Linear Models

Linear Regression Problem
Linear Regression Algorithm
Logistic Regression Problem
Logistic Regression Error
Gradient of Logistic Regression Error
Gradient Descent
Stochastic Gradient Descent
• next: beyond simple linear models

Hsuan-Tien Lin (NTU CSIE) Machine Learning 52/52

How To Always Know What To Say Report PDF
No ratings yet
How To Always Know What To Say Report PDF
11 pages
Wk05 machine learning
No ratings yet
Wk05 machine learning
6 pages
Lecture 6 - Ridge Regression, Polynomial Regression (DONE!!) PDF
No ratings yet
Lecture 6 - Ridge Regression, Polynomial Regression (DONE!!) PDF
26 pages
Machine Learning Homework
No ratings yet
Machine Learning Homework
8 pages
11_Học máy cơ bản_Hồi quy tuyến tính 1
No ratings yet
11_Học máy cơ bản_Hồi quy tuyến tính 1
105 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
Progression Linaire
No ratings yet
Progression Linaire
187 pages
ML-2
No ratings yet
ML-2
155 pages
ML_Lec 4-introduction to regression
No ratings yet
ML_Lec 4-introduction to regression
65 pages
Lecture 3_Regression (1)
No ratings yet
Lecture 3_Regression (1)
47 pages
M6 RegressionLinearModels v2
No ratings yet
M6 RegressionLinearModels v2
97 pages
Bayesian linear regression for Posterior Predictive Distribution MATLAB
No ratings yet
Bayesian linear regression for Posterior Predictive Distribution MATLAB
46 pages
Python Tutorial
No ratings yet
Python Tutorial
37 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
Assignment 1-12 ML
No ratings yet
Assignment 1-12 ML
54 pages
Linear Regression
No ratings yet
Linear Regression
104 pages
Lecture 3 - Linear Regression
No ratings yet
Lecture 3 - Linear Regression
31 pages
Regression
No ratings yet
Regression
11 pages
W2 Ecs7020p
No ratings yet
W2 Ecs7020p
54 pages
Lecture 3
No ratings yet
Lecture 3
22 pages
Lecture 02 (3hrs) Linear Regression and Logistic Regression
No ratings yet
Lecture 02 (3hrs) Linear Regression and Logistic Regression
42 pages
Midterm Practice Questions
No ratings yet
Midterm Practice Questions
14 pages
Lecture Notes 5 Linear Regression
No ratings yet
Lecture Notes 5 Linear Regression
11 pages
Generalized Linear Model
No ratings yet
Generalized Linear Model
67 pages
SS ZG568 EC 2R SECOND SEM 2020 2021 Solution 1617000149821
No ratings yet
SS ZG568 EC 2R SECOND SEM 2020 2021 Solution 1617000149821
6 pages
Jntuk Machine Learning 3-2 Unit-2
No ratings yet
Jntuk Machine Learning 3-2 Unit-2
47 pages
Linear Models
No ratings yet
Linear Models
30 pages
COL774 Practice Problems
No ratings yet
COL774 Practice Problems
22 pages
383-Fall11-Lec19
No ratings yet
383-Fall11-Lec19
30 pages
P05 LinearRegression SolutionNotes
No ratings yet
P05 LinearRegression SolutionNotes
4 pages
LinearRegression PDF
No ratings yet
LinearRegression PDF
4 pages
Linear and Logistic Regression: Marta Arias Marias@lsi - Upc.edu
No ratings yet
Linear and Logistic Regression: Marta Arias Marias@lsi - Upc.edu
25 pages
Lec 3-5 (Function Approximation)
No ratings yet
Lec 3-5 (Function Approximation)
34 pages
Linear Regression
No ratings yet
Linear Regression
8 pages
Homework2 v1.0
No ratings yet
Homework2 v1.0
5 pages
HW 2
No ratings yet
HW 2
5 pages
Lec 6
No ratings yet
Lec 6
19 pages
2. Linear_ Regression_SGD
No ratings yet
2. Linear_ Regression_SGD
71 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
04 LinearModels
No ratings yet
04 LinearModels
28 pages
Linear-Regression 231212 072619
No ratings yet
Linear-Regression 231212 072619
13 pages
Week2 BBM406 Lec2.1 LinearRegression
No ratings yet
Week2 BBM406 Lec2.1 LinearRegression
49 pages
Mlfa Autumn 22 Lec 02
No ratings yet
Mlfa Autumn 22 Lec 02
24 pages
Matlab Homework Experts 2
No ratings yet
Matlab Homework Experts 2
10 pages
midterm2008f_sol
No ratings yet
midterm2008f_sol
12 pages
Cost Function
No ratings yet
Cost Function
17 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
SL_LMRG
No ratings yet
SL_LMRG
32 pages
Midterm 2010 F
No ratings yet
Midterm 2010 F
15 pages
Unit 2&3_250421_215911
No ratings yet
Unit 2&3_250421_215911
60 pages
training-models
No ratings yet
training-models
13 pages
Week 4 Linear Regression
No ratings yet
Week 4 Linear Regression
38 pages
CS464 Ch9 LinearRegression
100% (1)
CS464 Ch9 LinearRegression
43 pages
Essentials of Linear Regression in Python
No ratings yet
Essentials of Linear Regression in Python
23 pages
04 Logistic Regression
No ratings yet
04 Logistic Regression
46 pages
Single-Parameter Linear Regression: Predicting Real-Valued Outputs: An Introduction To Regression
No ratings yet
Single-Parameter Linear Regression: Predicting Real-Valued Outputs: An Introduction To Regression
51 pages
Supervised Machine Learning - Regression
No ratings yet
Supervised Machine Learning - Regression
34 pages
Week 5
No ratings yet
Week 5
7 pages
MLexam001 - 1 2 2
No ratings yet
MLexam001 - 1 2 2
9 pages
05 Regression Least Squares
No ratings yet
05 Regression Least Squares
5 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Solution: (A) G Solves Max: X 1 y X 1 y X
No ratings yet
Solution: (A) G Solves Max: X 1 y X 1 y X
5 pages
MA Homework 01 01 2021
No ratings yet
MA Homework 01 01 2021
2 pages
7 7 100 10:30 PDF Ntu Cool PDF + 60% 3-4: Solution: (A)
No ratings yet
7 7 100 10:30 PDF Ntu Cool PDF + 60% 3-4: Solution: (A)
7 pages
Multivariate Analysis: Bed Bed Bed
No ratings yet
Multivariate Analysis: Bed Bed Bed
2 pages
Multivariate Analysis: Library (Rethinking Data (Hurricanes) ?hurricanes. Deaths Femininity Femininity Deaths
No ratings yet
Multivariate Analysis: Library (Rethinking Data (Hurricanes) ?hurricanes. Deaths Femininity Femininity Deaths
2 pages
Multivariate Analysis: y N P V A
No ratings yet
Multivariate Analysis: y N P V A
2 pages
Mlba Slides 1 Vectors-Note
No ratings yet
Mlba Slides 1 Vectors-Note
15 pages
Mlba Slides 4 Clustering-Note
No ratings yet
Mlba Slides 4 Clustering-Note
14 pages
DP Quiz 2
100% (3)
DP Quiz 2
5 pages
Criteria of Truth
100% (1)
Criteria of Truth
5 pages
EDUC101 Forda Study
No ratings yet
EDUC101 Forda Study
8 pages
MCD 00001
No ratings yet
MCD 00001
10 pages
The Ecology of School PDF
No ratings yet
The Ecology of School PDF
37 pages
DLP Math 5 Congruent Polygon
No ratings yet
DLP Math 5 Congruent Polygon
6 pages
Insect Arium
No ratings yet
Insect Arium
44 pages
3.7 TRUSS T7 (20m Span) : Truss Is Modeled On SAP 2000 With Pin Joints On Both Ends
No ratings yet
3.7 TRUSS T7 (20m Span) : Truss Is Modeled On SAP 2000 With Pin Joints On Both Ends
31 pages
change to strange
No ratings yet
change to strange
11 pages
Lecture 1
No ratings yet
Lecture 1
32 pages
GRADE 10 CHAPTER 15 Relationship and Team Management
No ratings yet
GRADE 10 CHAPTER 15 Relationship and Team Management
10 pages
Yield (Engineering) - Wikipedia
No ratings yet
Yield (Engineering) - Wikipedia
11 pages
Maths Mania # 071: DIRECTIONS: For The Following Questions, Four Options Are Given. Choose The Correct Option
No ratings yet
Maths Mania # 071: DIRECTIONS: For The Following Questions, Four Options Are Given. Choose The Correct Option
4 pages
Chapter Two Management Theories
No ratings yet
Chapter Two Management Theories
55 pages
HEC-HMS Users Manual-V41-20201128 - 041539
No ratings yet
HEC-HMS Users Manual-V41-20201128 - 041539
588 pages
What Is Empirical - Models
No ratings yet
What Is Empirical - Models
14 pages
Executive Diploma in Food Microbiology
No ratings yet
Executive Diploma in Food Microbiology
2 pages
Unit Operation (I) : Dr. Dhuha A. AL-Saddi
No ratings yet
Unit Operation (I) : Dr. Dhuha A. AL-Saddi
15 pages
English Project:: Paul Mainas - Matteo Santaella L2 Géographie
No ratings yet
English Project:: Paul Mainas - Matteo Santaella L2 Géographie
2 pages
ArtificialIntelligenceinArchitectureIntegrationintoArchitecturalDesign
No ratings yet
ArtificialIntelligenceinArchitectureIntegrationintoArchitecturalDesign
17 pages
Quality Customer Service 1215328770383529 9
No ratings yet
Quality Customer Service 1215328770383529 9
28 pages
CV For GRACIOUS Y MUZVIDZIWA
No ratings yet
CV For GRACIOUS Y MUZVIDZIWA
3 pages
G 0211 Guideline On Validation of Methods Rev. No. 00
No ratings yet
G 0211 Guideline On Validation of Methods Rev. No. 00
2 pages
Graduation Script
No ratings yet
Graduation Script
2 pages
MPP - SRS Class 6 Bulk Deforfmation Process - Final
No ratings yet
MPP - SRS Class 6 Bulk Deforfmation Process - Final
44 pages
DIY Hand-Warmers
No ratings yet
DIY Hand-Warmers
1 page
Mathematics 8 Pre-Test With Answer Key
No ratings yet
Mathematics 8 Pre-Test With Answer Key
6 pages
Types of Rocks: Learning Activity Sheet in Earth and Life Sciences No. 4
No ratings yet
Types of Rocks: Learning Activity Sheet in Earth and Life Sciences No. 4
13 pages
Calibo DOH Roles UNHS Hyatt
No ratings yet
Calibo DOH Roles UNHS Hyatt
13 pages

05 Draft

Uploaded by

05 Draft

Uploaded by

Machine Learning

Department of Computer Science

Hsuan-Tien Lin (NTU CSIE) Machine Learning 0/52

Lecture 5: Linear Models

Hsuan-Tien Lin (NTU CSIE) Machine Learning 1/52

Credit Limit Problem

training examples learning final hypothesis

(set of candidate formula)

Linear Regression Hypothesis

• For x = (x0 , x1 , x2 , · · · , xd ) ‘features of customer’, ← 客⼾ 資料

• linear regression hypothesis: h(x) = wT x

h(x): like perceptron, but without the sign

Hsuan-Tien Lin (NTU CSIE) Machine Learning 3/52

Illustration of Linear Regression

Hsuan-Tien Lin (NTU CSIE) Machine Learning 4/52

Pointwise Error Measure for ‘Small Residuals’

how well? often use averaged err(g(x), f (x)), like

Eout (g) = E Jg(x) 6= f (x)K

—err: called pointwise error measure

will mainly consider pointwise err for simplicity

Learning Flow with Pointwise Error Measure

(ideal credit approval formula)

training examples learning final hypothesis

hypothesis set error measure

(set of candidate formula)

Hsuan-Tien Lin (NTU CSIE) Machine Learning 6/52

Two Important Pointwise Error Measures

0/1 error squared error

squared error: quantify ‘small residual’

Hsuan-Tien Lin (NTU CSIE) Machine Learning 7/52

Squared Error Measure for Regression

next: how to minimize Ein (w)?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 8/52

Hsuan-Tien Lin (NTU CSIE) Machine Learning 9/52

Matrix Form of Ein (w)

Hsuan-Tien Lin (NTU CSIE) Machine Learning 10/52

• Ein (w): continuous, differentiable, convex

task: find wLIN such that rEin (wLIN ) = 0

Hsuan-Tien Lin (NTU CSIE) Machine Learning 11/52

The Gradient rEin (w)

rEin (w)= N1 (2aw 2b) rEin (w)= N1 (2Aw 2b)

simple! :-) similar (derived by definition)

Hsuan-Tien Lin (NTU CSIE) Machine Learning 12/52

Optimal Linear Regression Weights

Linear Regression Algorithm

simple and efficient

Is Linear Regression a ‘Learning Algorithm’?

No! Yes! 從某些 ⾓度 ⽽⾔ 算是 机 器 学 習 演算法 "

Eout iteratively yes, finite dVC like perceptrons

if Eout (wLIN ) is good, learning ‘happened’!

Hsuan-Tien Lin (NTU CSIE) Machine Learning 15/52

如何 保 証 Ein 是 好 的 ! Benefit of Analytic Solution:

call XX† the hat matrix H

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/23

Geometric View of Hat Matrix

claim: trace(I H) = N (d + 1). Why? :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/23

• if y comes from some ideal f (X) 2 span plus noise

The Learning Curves of Linear Regression

d+1 Number of Data Points, N

• both converge to 2 (noise level) for N ! 1

linear regression (LinReg):

Hsuan-Tien Lin (NTU CSIE) Machine Learning 17/52

Heart Attack Prediction Problem (1/2)

training examples learning final hypothesis

hypothesis set error measure

Hsuan-Tien Lin (NTU CSIE) Machine Learning 18/52

Heart Attack Prediction Problem (2/2)

training examples learning final hypothesis

hypothesis set error measure

‘soft’ binary classification:

Hsuan-Tien Lin (NTU CSIE) Machine Learning 19/52

Soft Binary Classification

ideal (noiseless) data actual (noisy) data

same data as hard binary classification,

Hsuan-Tien Lin (NTU CSIE) Machine Learning 20/52

• For x = (x0 , x1 , x2 , · · · , xd ) ‘features of customer’, ← 客⼾資料

No! Yes! 從某些⾓度⽽⾔算是机器学習演算法 "

如何保証 Ein 是好的 ! Benefit of Analytic Solution:

ideal (noiseless) data actual (noisy) data 抽樣⼀下

too slow :-( 步伐⼩ too unstable :-( 步伐⼤