supervised learning
supervised learning
Carlo Ciliberto
1
Previous Classes
2
Previous Classes
• Nearest neighbor
• Least-squares
• Support Vector Machines
• Logistic Regression
• Decision Trees
• Ensemble methods (Bagging, Boosting)
3
Outline
Outline:
4
Refresher on the Learning Problem
Z
inf E(f ), E(f ) = ℓ(f (x), y) dρ(x, y)
f :X →Y X ×Y
5
A Wishlist
6
A Wishlist
6
Empirical Risk as a Proxy
7
Empirical Risk as a Proxy
7
Empirical Risk as a Proxy
8
Empirical Vs Expected
1 Pn
Let X and (Xi )ni=1 be i.i.d. random variables, X̄n = n i=1 Xi .
Then
E[(X̄n − E(X ))2 ] = Var(X̄n )
8
Empirical Vs Expected
1 Pn
Let X and (Xi )ni=1 be i.i.d. random variables, X̄n = n i=1 Xi .
Then
Var(X )
E[(X̄n − E(X ))2 ] = Var(X̄n ) =
n
8
Empirical Vs Expected Risk
Vf
E[(En (f ) − E(f ))2 ] =
n
Where Vf = Var(ℓ(f (x), y )). In particular
r
Vf
E[|En (f ) − E(f )|] ≤
n
9
Empirical Vs Expected Risk
Vf
E[(En (f ) − E(f ))2 ] =
n
Where Vf = Var(ℓ(f (x), y )). In particular
r
Vf
E[|En (f ) − E(f )|] ≤
n
9
Empirical Vs Expected
10
Empirical Risk Minimization (ERM)
fn = arg min En (f )
f ∈F
Then...
E [E(fn ) − En (fn )]
11
Generalization Error
12
Generalization Error
12
Generalization Error
• E [En (fn )] = 0
• E [E(fn )] = E(0), which is greater than E(f ∗ ) (unless f ∗ ≡ 0)
13
Overfitting
14
ERM on Finite Hypotheses Spaces
15
ERM on Finite Hypotheses Spaces
In particular, if f∗ ∈ H, then
p
E|E(fn ) − E(f∗ )| ≤ |H| VH /n
16
Example: Threshold functions
1.5
a
b
0.5
0
-1.5 -1 -0.5 0 0.5 1 1.5
with [a] denoting the integer part (i.e. the closest integer) of a
scalar a. The value p can be interpreted as the “precision” of
our space of functions Hp . Note that |Hp | = 2 · 10p
18
Rates in Expectation Vs Probability
19
Hoeffding’s Inequality
1 Pn
Let X = n i=1 Xi . Then,
2n2 ϵ2
P X − E X ≥ ϵ ≤ 2 exp − Pn 2
i=1 (bi − ai )
20
Applying Hoeffding’s inequality
nϵ2
P (|En (f ) − E(f )| ≥ ϵ) ≤ 2 exp(− )
2M 2
21
Controlling the Generalization Error
P (|En (fn ) − E(fn )| ≥ ϵ) ≤ P sup |En (f ) − E(f )| ≥ ϵ
f ∈H
The latter term is the probability that least one of the events
|En (f ) − E(f )| ≥ ϵ occurs for f ∈ H. In other words the
probability of the union of such events. Therefore
X
P sup |En (f ) − E(f )| ≥ ϵ ≤ P (|En (f ) − E(f )| ≥ ϵ)
f ∈H f ∈H
nϵ2
P (|En (fn ) − E(fn )| ≥ ϵ) ≤ 2|H| exp(− )
2M 2
r
2M 2 log(2|H|/δ)
|En (fn ) − E(fn )| ≤
n
with probability at least 1 − δ.
23
Example: Threshold Functions (in Probability)
24
Bounds in Expectation Vs Probability
r
2M 2 log(2|Hp |/δ)
E |En (fn ) − E(fn )| ≤ (1 − δ) + δM
n
Therefore only log |Hp | appears (no |Hp | alone).
26
Infinite Hypotheses Spaces
27
Approximation Error for Threshold Functions
Consider fp = 1[ap ,+∞) = arg minf ∈Hp E(f ) with ap ∈ [−1, 1].
We decompose the excess risk E(fn ) − E(f∗ ):
29
Regularization
30
Regularization and Decomposition of the Excess Risk
31
Irreducible Error
32
Approximation Error
33
Convergence of the Approximation Error
34
Density Results
35
Approximation error bounds
E(fγ,n ) − E(fγ )
• Capacity/Complexity estimates on Hγ .
• Stability.
37
Sample Error Decomposition
38
Generalization Error(s)
As we have observed,
" # r
VHγ
E sup |En (f ) − E(f )| ≤ |Hγ |
f ∈Hγ n
39
ERM on Finite Spaces and Computational Efficiency
40
ERM on Convex Spaces?
41
Example: Risks for Continuous functions
42
Example: Covering numbers
Example. If H ∼
= BR (0) is a ball of radius R in Rd :
N (BR (0), η) = (4R/η)d 43
Example: Covering numbers (continued)
44
Example: Covering numbers (continued)
r
2M 2 log(2N (H, η)/δ)
sup |E(fn ) − E(f )| ≤ 2Lη +
f ∈H n
45
Complexity Measures
• Covering numbers,
• combinatorial dimension, e.g. VC-dimension, fat-shattering
dimension
• Rademacher complexities
• Gaussian complexities
• ...
46
Prototypical Results
47
Choosing γ(n) in practice
• Cross validation,
• complexity regularization/structural risk minimization,
• balancing principles.
• ...
48
Abstract Regularization
49
Wrapping Up
• Have shown how the empirical risk can be a proxy for the
expected.
• Identified the main reasons behind overfitting and
discussed how to counteract it (in a more principle way!).
• Highlighted the key role played by the choice of the
hypotheses space and how their “complexity” affect
performance.
50
Recommended Reading
51