0% found this document useful (0 votes)

6 views

3. Multi-class Classification

The document discusses multi-class classification in artificial intelligence and machine learning, covering topics such as predictors, learning theory, and optimization techniques. It introduces methods for reducing multi-class problems to binary classification, including One-vs-One and One-vs-Rest approaches, and explores the linear multiclass hypothesis space. Additionally, it addresses margin-based loss functions and regularization schemes to enhance model performance in multi-class settings.

Uploaded by

9gt5rqjjnq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

3. Multi-class Classification

Uploaded by

9gt5rqjjnq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

ARIN7015/MATH6015: Topics in Artificial Intelligence

and Machine Learning

Multi-Class Classification and Its Optimization

Yunwen Lei

Department of Mathematics, The University of Hong Kong

March 20, 2025

Outline

1 Multiclass Predictors

2 Learning Theory for Multi-class Classification

3 Frank-Wolfe Algorithm

4 Optimization for Multi-class Classification

Multi-class Classification

( ,dog), ( ,car), ( ,airplane), . . .

i.i.d.
Formally z1 = (x1 , y1 ), . . . , zn = (xn , yn ) ∼ ρ
| {z }
∈X ×Y
▶ Y := {1, 2, . . . , c}
▶ c = number of classes
Reduction to Binary Classification

One-vs-One
Train c(c − 1)/2 binary classifiers, each classifier for a pair of different classes
The final prediction by the majority vote
Reduction to Binary Classification
One-vs-Rest
Train c binary classifiers, one for each class
Train j-th classifier to distinguish class j from the rest
Suppose h1 , . . . , hc : X 7→ R are our binary classifiers
The final prediction is
h(x) = arg max hj (x)
j∈[c]

Tie breaking: If two distinct i, j ∈ [c]

achieves the maximum (i.e., wi⊤ x = wj⊤ x),
we break the tie using some consistent
policy, e.g., predicting the label as the
smaller between i and j.
Recap: Linear Binary Classifier

Y = {−1, +1}, X = Rd
Linear classifier score function h(x) = w⊤ x
Final prediction: sign(h(x))

h(x) = w⊤ x = ∥w∥2 ∥x∥2 cos(θ)

h(x) > 0 ⇐⇒ cos θ > 0 =⇒ θ ∈ (−90◦ , 90◦ )
h(x) < 0 ⇐⇒ cos θ < 0 =⇒ θ ̸∈ (−90◦ , 90◦ )
Three Class Example

Base hypothesis space

H = x 7→ w⊤ x : w ∈ R2

Note: separating boundary always contains the origin

Three Class Example: One-vs-Rest

Class 1 vs Rest: h1 (x) = w1⊤ x

Three Class Example: One-vs-Rest

Class 2 vs Rest
▶ Predicts everything to be “Not 2”
▶ It misclassifies 6 examples (all examples in class “2”)
▶ If it predicted some “2”, then it would get many more “Not 2” incorrect
Score for class j is
hj (x) = wj⊤ x = ∥wj ∥2 ∥x∥ cos θj ,
where θj is the angle between x and wj .
One-vs-Rest: Class Boundaries

For simplicity, we assume ∥w1 ∥2 = ∥w2 ∥2 = ∥w3 ∥2

Then wj⊤ x = ∥wj ∥2 ∥x∥2 cos θj
▶ Only θj matters for comparison
▶ x is classified by whichever has largest cos θj (i.e., θj closest to 0)
One-vs-Rest: Class Boundaries

This approach doesn’t work well in this instance!

How can we fix this?

The Linear Multiclass Hypothesis Space

Base hypothesis space

e = x 7→ w⊤ x : w ∈ Rd

H
Linear multiclass hypothesis space
n o
H = x 7→ arg max hj (x) : h1 , . . . , hc ∈ H
e
j∈[c]

A learning algorithm chooses the hypothesis from

hypothesis space
Is this a failure of the hypothesis space or the
learning algorithm?
A Solution with Linear Functions

This works... so the problem is not with the hypothesis space

How can we get a solution like this?
Multiclass Predictors
Multiclass Hypothesis Space

Idea: use one function h(x, y ) to give a compatibility score between input x and output y
Final prediction is the y ∈ Y that is “most compatible” with x

x ← arg max h(x, y ).

y ∈Y

Given class-specific score functions h1 , . . . , hc , we could define compatibility

function as
h(x, j) = hj (x), j ∈ [c].

Base Hypothesis Space: H = h : X × Y 7→ R
Multiclass Hypothesis Space
n o
x 7→ arg max h(x, y ) : h ∈ H
y ∈Y

Want compatibility h(x, y ) to be large when x has label y , small otherwise!

Multi-class Classification: Example

Points x1 , x2 and x3 will be classified as label 1, 2 and 3, respectively.

What do the three rays stand for?

Linear Multiclass Prediction Function

A linear compatibility score function is given by

h(x, y ) = ⟨w, Ψ(x, y )⟩,

′
where Ψ : X × Y 7→ Rd is a compatibility feature map

Ψ(x, y ) extracts features relevant to how compatible y is with x

Final compatibility score is a linear function of Ψ(x, y )
Linear Multiclass Hypothesis Space
′
n o
x 7→ arg max⟨w, Ψ(x, y )⟩ : w ∈ Rd
y ∈Y
Example: X = R2 , Y = {1, 2, 3}

! !
− √12 √1

0 2
w1 = √1
, w2 = , w3 = √1
2
1 2

Prediction function
x1
x= ← arg max ⟨wj , x⟩
x2 j∈{1,2,3}

How can we get this into the form x ← arg maxy ∈Y ⟨w, Ψ(x, y )⟩?
The Multivector Construction
What if we stack wj ’s together
 
! !
− √12 √1

 0 2

w= √1
, , √1 

 2
1 2 

| {z } | {z } | {z }
w1 w2 w3

We define the feature map Ψ : R2 × {1, 2, 3} 7→ R2×3 as

x1 0 0
Ψ(x, 1) = , ,
x2 0 0

0 x1 0
Ψ(x, 2) = , ,
0 x2 0

0 0 x1
Ψ(x, 3) = , ,
0 0 x2

Then we get the wanted equation ⟨w, Ψ(x, y )⟩ = ⟨wy , x⟩

Linear Multiclass Prediction Function: Simplification

In our lecture, we always consider the following compatibility feature map

Ψ(x, j) = 0, . . . , 0, x
|{z} , 0, . . . , 0 ∈ Rd×c
j-th component

We stake all wj into a matrix

w = (w1 , . . . , wc ) ∈ Rd×c .

Then it is clear that

⟨w, Ψ(x, j)⟩ = wj⊤ x!
Linearly Separable Dataset

A training dataset S is linearly separable if there exist w1 , . . . , wc s.t.

correctly classify all the points in S
for every point x ∈ S with label y , wy⊤ x > wj⊤ x for all j ̸= y
Recap: Binary Classification
Margin of a hyperplane for a dataset: the width that the boundary could be increased
without hitting a data point

n
gi
ar
m

How to define the margin for multi-class classification?

Multi-class Margin

Recap: we choose the class label with the largest score function

ŷ := arg maxj∈Y h(x, j)

The prediction is correct if the true class label receives the largest score!

Multi-class Margin
Aim: Want h(xi , yi ) larger than all other h(xi , j) for j ̸= y i.e., a large margin

margin(h, z) = h(x, y ) − max h(x, j)
| {z } j:j̸=y | {z }
score of true label score of incorrect label
Multi-class Margin

Defined as the score difference between the score of the correct label and the highest
score for the incorrect label
Margin-based Loss

Ideal property of loss function

Loss should be a decreasing function of the margin
The model w predicts a vector in Rc : (w1⊤ x, . . . , wc⊤ x)
We need a map from Rc to R+

We consider the margin-based loss of the form

f (w; z) = ℓy w1⊤ x, w2⊤ x, . . . , wc⊤ x ,

where ℓy : Rc 7→ R.

hinge loss : ℓy (t1 , . . . , tc ) = max{0, 1 − margin} = max{0, 1 + max(tj − ty )}

j:j̸=y
n o
f (w; z) = max 0, 1 + max⟨wj − wy , x⟩ = ℓy w1⊤ x, . . . , wc⊤ x

(1)
j:j̸=y

margin: ty − maxj:j̸=y tj
Soft-Max Loss
Note that the margin of t is ty − maxj̸=y tj
Max is not differentiable. Replace it with the soft-max margin
X
max t1 −ty , . . . , tc −ty } ≤ log exp(tj −ty ) ≤ max t1 −ty , . . . , tc −ty }+log c
j:j̸=y
| {z } | {z }
−margin | {z } −margin
soft-max

Want to maximize margin: x 7→ log(1 + exp(−x)) is a decreasing function

X X
log(1+exp(−margin)) ≈ log 1+exp log exp(tj −ty ) = log 1+ exp(tj−ty )
j:j̸=y j:j̸=y
| {z }
soft-max
P
With ℓy (t1 , . . . , tc ) = log 1 + j:j̸=y exp(tj − ty ) we know
X
exp wj⊤ x − wy⊤ x = ℓy w1⊤ x, . . . , wc⊤ x .

f (w; z) = log 1 +
j:j̸=y
Regularization Schemes for Multi-class Classification

We train a model by minimizing the following objective w.r.t. w = (w1 , . . . , wc ), wj ∈ Rd

n
1X n o
ℓy w1⊤ x, . . . , wc⊤ x s.t. w ∈ W = w ∈ Rd×c : Ω(w) ≤ 1 ,

min
w∈Rd×c n i=1 | i {z }
=f (w;zi )

where Ω : Rd×c 7→ R+ is a regularizer.

Examples: Ω(w) = λ
2
∥w∥22,p , where
c
X 1
∥w∥2,p := ∥wj ∥p2 p
p≥1
j=1

Pc
p = 1: ∥w∥2,1 := j=1 ∥wj ∥2
Pc 1
p = 2: ∥w∥2,p := j=1 ∥wj ∥22 2

p = ∞: ∥w∥2,∞ := maxj∈[c] ∥wj ∥2

Learning Theory for Multi-class Classification
Rademacher Complexity for Multi-class Classification

The loss function class becomes

n o
F = z 7→ ℓy w1⊤ x, . . . , wc⊤ x : Ω(w) ≤ 1

How to estimate the Rademacher complexity

n
1 X
ℓyi w1⊤ xi , . . . , wc⊤ xi .

Eσ sup
n w∈W
i=1

The difficulty lies in the nonlinearity of ℓy , which takes a vector in Rc as the input!
Lipschitz Continuity of ℓy

We say ℓy is G -Lipschitz continuous w.r.t. a norm on Rc if

ℓ a1 , . . . , ac − ℓ a1′ , . . . , ac′ ≤ G (a1 , . . . , ac ) − (a1′ , . . . , ac′ ) .

It is an extension of the Lipschitz continuity of R 7→ R

We are interested in ∥ · ∥2 and ∥ · ∥∞

Lemma. Let a1 , . . . , ac and b1 , . . . , bc ∈ R. Then

max a1 , . . . , ac − max b1 , . . . , bc ≤ max |aj − bj |.
j∈[c]

Proof. Without loss of generality, we ak ≥ bk ′ , where

ak = max a1 , . . . , ac , bk ′ = max b1 , . . . , bc .

max a1 , . . . , ac − max b1 , . . . , bc ≤ ak − bk ′ ≤ ak − bk ≤ max |aj − bj |.
j∈[c]
Lipschitz Continuity of Hinge Loss

Example
The hinge loss ℓy (t1 , . . . , tc ) = max{0, 1 + maxj:j̸=y (tj − tc )} is 2-Lipschitz continuous
w.r.t. ∥ · ∥∞ .

Proof. We just showed

max a1 , . . . , ac − max b1 , . . . , bc ≤ max |aj − bj |.
j∈[c]

Let t = (t1 , . . . , tc ) and t′ = (t1′ , . . . , tc′ ). Then

|ℓy (t) − ℓy (t′ )| = max{0, 1 + max(tj − ty )} − max{0, 1 + max(tj′ − ty′ )}

j:j̸=y j:j̸=y

≤ 1 + max(tj − ty )} − 1 − max(tj′ − ty′ )}

j:j̸=y j:j̸=y

≤ max (tj − ty ) − (tj′ − ty′ ) ≤ 2∥t − t′ ∥∞ .

j:j̸=y
Lipschitz Continuity of the Soft-max Loss

Example
P
The soft-max loss ℓy (t1 , . . . , tc ) = log 1 + j:j̸=y exp(tj − ty ) is 2-Lipschitz continuous
w.r.t. ∥ · ∥∞ .
P
Proof. The function t 7→ g (t) := log j∈[c] exp(tj ) is 1-Lipschitz continuous w.r.t.
∥ · ∥∞ .
 
exp(t1 )
1 . 
∇g (t) = P  ..  =⇒ ∥∇g (t)∥1 ≤ 1.

j∈[c] exp(t j )
exp(tc )
g (t) − g (t′ ) = t − t′ , ∇g (αt + (1 − α)t′ )
| {z }
Taylor expansion

=⇒ |g (t) − g (t )| ≤ ∥t − t ∥∞ ∥∇g (αt + (1 − α)t′ )∥1 .

′ ′
Vector contraction of Rademacher complexity

Vector contraction of Rademacher complexity

Let F be a class of functions from Z to Rc . For each i ∈ [n], let σi : Rc 7→ R be
L-Lipschitz continuous w.r.t. ∥ · ∥2 . Then
X √ XX
E sup ϵi σi (f (zi )) ≤ 2LE sup ϵi,k fk (zi ),
f ∈F f ∈F
i∈[n] i∈[n] k∈[c]

where ϵi,k are i.i.d. Rademacher variables, and fk (zi ) is the k-th component of f (zi ).

It is an extension from Talagrand’s contraction lemma to vector-valued functions

It removes the nonlinear σ, which is the loss function in multi-class classification!

Maurer, A., 2016. A vector-contraction inequality for Rademacher complexities. In Algorithmic Learning
Theory.
Rademacher Complexity Bound (Optional)
Let ℓy be either the hinge loss or the soft-max loss
c
X 1
|ℓy (t) − ℓy (t′ )| ≤ 2 max |tj − tj′ | ≤ 2 (tj − tj′ )2 2 = 2∥t − t′ ∥2 .
j∈[c]
| {z } j=1
=∥t−t′ ∥∞

Then vector contraction of Rademacher complexity implies

X √ XX
ϵi ℓyi w1⊤ xi , . . . , wc⊤ xi ≤ 2 2E sup ϵij wj⊤ xi .

E sup
w∈W w∈W
i∈[n] i∈[n] j∈[c]

Let W = {w ∈ Rd×c : ∥w∥2,2 ≤ 1}. Then

XX X X X X
E sup ϵij wj⊤ xi = E sup wj⊤ ϵij xi ≤ E sup wj 2
ϵij xi
w∈W w∈W w∈W 2
i∈[n] j∈[c] j∈[c] i∈[n] j∈[c] i∈[n]
c
1 X
X
2 2
X 2 12
≤ E sup wj 2
ϵij xi
w∈W 2
j∈[c] j=1 i∈[n]
c
h X X 2 21 i
≤E ϵij xi .
2
j=1 i∈[n]
Rademacher Complexity Bound (Optional)
We just showed
c
X √ h X X 2 12 i
E sup ϵi ℓyi w1⊤ xi , . . . , wc⊤ xi ≤ 2LE ϵij xi .
w∈W 2
i∈[n] j=1 i∈[n]

√
By the concavity of x 7→ x, we know
c c
h X X 2 21 i h X X 2 21 i
E ϵij xi ≤ E ϵij xi .
2 2
j=1 i∈[n] j=1 i∈[n]

Furthermore, there holds

X 2 h X X i X X
E ϵij xi = E ⟨ ϵij xi , ϵi ′ j xi ′ ⟩ = Eϵ2ij x⊤
i xi + Eϵij ϵi ′ j x⊤
i xi ′ .
2
i∈[n] i∈[n] i ′ ∈[n] i∈[n] i̸=i ′
| {z }
=0

c
h X X 2 12 i X 1
2
=⇒ E ϵij xi ≤ c· E∥xi ∥22 .
2
j=1 i∈[n] i∈[n]
Rademacher Complexity Bound

The Rademacher complexity of F is bounded by

1 X 1
z 7→ ℓy (w1⊤ x, . . . , wc⊤ x) : ∥w∥2,2 ≤ 1} ≤
2
RS c· E∥xi ∥22 .
n
i∈[n]

Generalization bound
Let A be ERM. Then with probability at least 1 − δ
1 1 √ √
F (A(S)) − F (w∗ ) ≲ n− 2 log 2 (2/δ) + c/ n.

This shows a square-root dependency of generalization bounds w.r.t. the number of

classes!
Frank-Wolfe Algorithm
Training Multiclass Model

Consider the following minimization problem (p ≥ 1)

n
1X X
ℓyi (wj⊤ xi )j∈[c] ∥wj ∥p2 ≤ B p .

min s.t.
w n
i=1 j∈[c]

It is a constrained optimization problem

Can be solved by projected gradient descent, which however has a projection
step per iteration
The projection can be expensive
Can we develop a projection-free method?

Yes, the Frank-Wolfe Algorithm!

Frank-Wolfe Method

Objective: minw∈Ω FS (w), where Ω is convex.

Recap: projected gradient descent (PGD)

Recall PGD wt+1 = ProjW wt − η∇FS (wt ) uses a quadratic approximation
n 1 o
wt+1 = arg min FS (wt ) + ⟨w − wt , ∇FS (wt )⟩ + ∥w − wt ∥22 .
w∈W 2η

In some cases, the projection may be hard to compute (or even approximate)

For these problems, we can solve the simplified problem

n
arg min FS (wt ) + ⟨w − wt , ∇FS (wt )⟩},
w∈W

which optimizes a linear approximation to the function over the constraint set
This requires the set W to be bounded, otherwise there may be no solution
Frank-Wolfe Method

Frank-Wolfe Method
1: for t = 0, 1, . . . , T do
2: Set vt = arg minv∈W ⟨∇FS (wt ), v⟩
3: Set wt+1 = wt + γt (vt − wt ), where γt ∈ [0, 1]

wt+1 is feasible since it is a convex combination of two other feasible points

wt+1 = (1 − γt )wt + γt vt

A choice of γt is γt = 2/(t + 2)
Linear Minimization Oracle
Let W be a convex, closed and bounded set. The linear minimization oracle of Ω
(IMOW ) returns a vector ĝ such that

IMOW (g) = arg min g⊤ v.

v∈W

IMOW returns an extreme point of W

IMOW is (projection-free) arguably cheaper than projection
Example
Apply the FW algorithm to solve the following nonlinear problem with the initialization
point (x1 , y1 ) = (0, 0)

min f (x, y ) = (x − 1)2 + (y − 1)2

s.t. 2x + y ≤ 6
x + 2y ≤ 6
x, y ≥ 0.

Solution. The gradient of f is ∇f (x, y ) = (2x − 2, 2y − 2)⊤ .

We know ∇f (0, 0) = (−2, −2)⊤
We get v by solving the linear optimization problem

max −2x − 2y s.t. 2x + y ≤ 6, x + 2y ≤ 6, x, y ≥ 0.

(x,y )

By simplex method, one can show that v = (2, 2)⊤

If we choose γ = 1/2, then we get

⊤ ⊤ ⊤ 0 1 2 1
(x2 , y2 ) = (x1 , y1 ) + γ(v − (x1 , y1 ) ) = + =
0 2 2 1
Example

Lasso Regression
min FS (w) = ∥Aw − b∥22 s.t. ∥w∥1 ≤ 1.
w

∇FS (w) = g := A⊤ (Aw − b)

IMOW (g) = −sign(gj ∗ )ej ∗ with j ∗ := arg maxj∈[d] |gj |, which is simpler than
projection onto L1 -ball. Here ej is the j-th unit vector.
Proof. For any w ∈ W, we have
X X
w⊤ g = wj gj ≥ − |wj gj |
j∈[d] j∈[d]
X
≥ − max |gj | |wj | = −|gj ∗ | = −sign(gj ∗ )e⊤
j ∗ g.
j∈[d]
j∈[d]
Lasso comparison
Comparing projected and conditional gradient for constrained lasso problem, with
n = 100, d = 500
Convergence of Frank-Wolfe Method
Theorem. Let W be nonempty, convex and bounded. Let FS be convex and L-smooth.
Then FW with γt = 2/(t + 2) satisfies

2LD 2
FS (wt ) − FS (w∗ ) ≤ , D := max ∥w − w′ ∥2 .
t +1 ′
w,w ∈W

Proof. Since vt and w∗ are in W, optimality of vt shows

⟨vt − wt , ∇FS (wt )⟩ ≤ ⟨w∗ − wt , ∇FS (wt )⟩ ≤ FS (w∗ ) − FS (wt ).

By the L-smoothness and wt+1 − wt = γt (vt − wt )

L
FS (wt+1 ) ≤ FS (wt ) + ⟨wt+1 − wt , ∇FS (wt )⟩ + ∥wt+1 − wt ∥22
2
γ2L
= FS (wt ) + γt ⟨vt − wt , ∇FS (wt )⟩ + t ∥vt − wt ∥22
2
2
γ LD 2
≤ FS (wt ) + γt FS (w∗ ) − FS (wt ) + t

.
2
γ 2 LD 2
=⇒ FS (wt+1 ) − FS (w∗ ) ≤ (1 − γt ) FS (wt ) − FS (w∗ ) + t .
2
Convergence of Frank-Wolfe Method

γ 2 LD 2
We derived FS (wt+1 ) − FS (w∗ ) ≤ (1 − γt ) FS (wt ) − FS (w∗ ) + t .
2
t 4LD 2
=⇒ FS (wt+1 ) − FS (w∗ ) ≤ FS (wt ) − FS (w∗ ) +

t +2 2(t + 2)2

We multiply both sides by (t + 1)(t + 2) and get

(t + 1)(t + 2)∆t+1 ≤ t(t + 1)∆t + 2LD 2 .

Telescoping shows
t
X
(t + 1)(t + 2)∆t+1 = (k + 1)(k + 2)∆k+1 − k(k + 1)∆t
k=0
t
X
≤ 2LD 2 = 2LD 2 (t + 1).
k=0
Stochastic Frank-Wolfe Method
Stochastic Frank-Wolfe Method
1: for t = 0, 1, . . . , T do
2: ˆ t , v⟩, where ∇
Set vt = arg minv∈W ⟨∇ ˆ t is an unbiased estimator of ∇FS (wt )
3: Set wt+1 = wt + γt (vt − wt ), where γt ∈ [0, 1]

Convergence Rate
Let W be nonempty, convex and bounded. Let FS be convex and L-smooth. Assume the
following variance condition

ˆ t − ∇FS (wt ) ≤ LD
2 2
E ∇ .
2 t +1
Then FW with γt = 2/(t + 2) satisfies

4LD 2
E FS (wt ) − F (w∗ ) ≤

.
t +1

Stochastic FW requires decreasing variance!

Stochastic Frank-Wolfe Method
Note
n
1X
FS (w) = f (w; zi ).
n i=1

Assume St is a random sampling (with replacement) from [n] and

ˆt = 1
X
∇ ∇f (wt ; zi ).
|St |
i∈St

ˆ t is an unbiased estimator of ∇FS (wt )

Then ∇

ˆ t ] = 1 ESt
hX i
ESt [∇ ∇f (wt ; zi ) = ∇FS (wt ).
|St |
i∈St

Furthermore, the variance decreases by a factor of |St |

2 σ2 2
ˆ t − ∇FS (wt )
E ∇ = , σ 2 := E[ ∇f (wt ; zit ) − ∇FS (wt ) 2 ].
2 |St |

Choosing |St | = σ 2 (t + 1)2 /(L2 D 2 ) meets the variance condition for stochastic FW.
Optimization for Multi-class Classification
Training Multi-class Model by Frank-Wolfe Method
Recall the problem
n
1X X
ℓyi (wj⊤ xi )j∈[c] ∥wj ∥22 ≤ B 2 .

min s.t.
w n i=1
j∈[c]

Frank-Wolfe method for multi-class classification

1: for t = 0, 1, . . . , T do
2: vt = arg minw:Pj∈[c] ∥wj ∥2 ≤1 ⟨w, ∇FS (w(t) )
2
3: Set wt+1 = wt + γt (vt − wt ), where γt ∈ [0, 1]

Frank-Wolfe Algorithm requires solving a linear optimization problem

X
arg min⟨w, g⟩ s.t. ∥wj ∥22 ≤ 1,
w
j∈[c]

∗
which has a closed-form solution w = (w1∗ , . . . , wc∗ ) as follows
gj
wj∗ = − P 1 . (2)
c
˜
j=1 ∥gj˜∥22 2
Proof on Closed-form Solution (Optional)

gj
w∗ = arg min ⟨w, g⟩ ⇐⇒ wj∗ = − P 1 .
∥wj ∥22 ≤1 c
P
w: j∈[c] ˜
j=1 ∥gj˜∥22 2

It suffices to check ∥w∗ ∥2,p ≤ 1 and ⟨w∗ , g⟩ = −∥g∥2,p∗ .

It is clear that
c c
X 1 X 1
∥w∗ ∥2,2 = ∥gj˜∥22 2
/ ∥gj˜∥22 2
=1
˜
j=1 ˜
j=1

and
c c c c
X X − 1 X X 1
⟨w∗ , g⟩ = ⟨wj∗ , gj ⟩ = − ∥gj˜∥22 2 ∥gj˜∥22 = − ∥gj˜∥22 2 = −∥g∥2,2 .
j=1 ˜
j=1 ˜
j=1 ˜
j=1
Summary

Multiclass predictors: vector-valued functions

▶ margin-based formulation: we want to find model with large margin
Learning theory
▶ Rademacher complexity bound with a square-root dependency on c
Frank-Wolfe Algorithm
▶ projection-free algorithm for constrained optimization
▶ replace projection by a linear minimization problem
Optimization for multiclass optimization
▶ A closed-form solution for the linear minimization problem

HSB Paper 2 2022 Solutions
100% (4)
HSB Paper 2 2022 Solutions
19 pages
Sol Multiclass 1
No ratings yet
Sol Multiclass 1
5 pages
Lecture 6
No ratings yet
Lecture 6
19 pages
ML Lec SVM Linear
No ratings yet
ML Lec SVM Linear
19 pages
CS373 Lecture18.1
No ratings yet
CS373 Lecture18.1
33 pages
06-07-08-Supervised Learning by Computing Distances, Multi Class Classification, Decision Boundary
No ratings yet
06-07-08-Supervised Learning by Computing Distances, Multi Class Classification, Decision Boundary
32 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
Multiclass Classification: 9.520 Class 06, 25 Feb 2008 Ryan Rifkin
No ratings yet
Multiclass Classification: 9.520 Class 06, 25 Feb 2008 Ryan Rifkin
59 pages
Multiclass Classification: 9.520 Class 06, 25 Feb 2008 Ryan Rifkin
No ratings yet
Multiclass Classification: 9.520 Class 06, 25 Feb 2008 Ryan Rifkin
59 pages
Introduction To Machine Learning Lecture 3: Linear Classification Methods
No ratings yet
Introduction To Machine Learning Lecture 3: Linear Classification Methods
40 pages
Understanding Machine Learning P2
No ratings yet
Understanding Machine Learning P2
224 pages
הרצאה-Classifiers and Decision Trees
No ratings yet
הרצאה-Classifiers and Decision Trees
119 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
Conf Stab
No ratings yet
Conf Stab
16 pages
3 LogisticRegression
No ratings yet
3 LogisticRegression
30 pages
315 F19 14 SVM 1
No ratings yet
315 F19 14 SVM 1
33 pages
10_SVM (1)
No ratings yet
10_SVM (1)
77 pages
Nptel Lec
No ratings yet
Nptel Lec
22 pages
Supervised Machine Learning
No ratings yet
Supervised Machine Learning
74 pages
Machine Learning Basics Lecture 7: Multiclass Classification
No ratings yet
Machine Learning Basics Lecture 7: Multiclass Classification
28 pages
1 An Introduction To Linear Classifiers
No ratings yet
1 An Introduction To Linear Classifiers
9 pages
SVM PCA Kmeans
No ratings yet
SVM PCA Kmeans
121 pages
Discriminant Functions
No ratings yet
Discriminant Functions
33 pages
Support Vector Machines: Javier B Ejar Cbea
No ratings yet
Support Vector Machines: Javier B Ejar Cbea
44 pages
5 Multiclass
No ratings yet
5 Multiclass
46 pages
Lecture Notes - SVM
No ratings yet
Lecture Notes - SVM
13 pages
Support Vector Machines: Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
No ratings yet
Support Vector Machines: Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
65 pages
SVM 2
No ratings yet
SVM 2
65 pages
cs188 Fa23 Note21
No ratings yet
cs188 Fa23 Note21
8 pages
Chapter_8 (1)
No ratings yet
Chapter_8 (1)
52 pages
CS221 - Artificial Intelligence - Machine Learning - 2 Linear Regression
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 2 Linear Regression
24 pages
Machine Learning - Lecture 5
No ratings yet
Machine Learning - Lecture 5
19 pages
21 Support Vector Machines 03-10-2024
No ratings yet
21 Support Vector Machines 03-10-2024
72 pages
Problem 1 Report Trần Minh Long 2052154 Final
No ratings yet
Problem 1 Report Trần Minh Long 2052154 Final
31 pages
i2ML Cheatsheets
No ratings yet
i2ML Cheatsheets
7 pages
Midterm F02soln
No ratings yet
Midterm F02soln
14 pages
K 2
No ratings yet
K 2
527 pages
Machine Learning and Pattern Recognition Week 3 Intro - Classification
No ratings yet
Machine Learning and Pattern Recognition Week 3 Intro - Classification
5 pages
CS550 Lec7-ClassificationIntro
No ratings yet
CS550 Lec7-ClassificationIntro
49 pages
AI Lec2.1 MLsupervised
No ratings yet
AI Lec2.1 MLsupervised
21 pages
CS221 - Artificial Intelligence - Machine Learning - 3 Linear Classification
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 3 Linear Classification
28 pages
PRu 4
No ratings yet
PRu 4
13 pages
Imp Document3
No ratings yet
Imp Document3
6 pages
Lec1 PDF
No ratings yet
Lec1 PDF
56 pages
Article
No ratings yet
Article
23 pages
Unit-4
No ratings yet
Unit-4
27 pages
Svm
No ratings yet
Svm
65 pages
cs221-lecture11
No ratings yet
cs221-lecture11
71 pages
Intro To ML RevisionNotes
No ratings yet
Intro To ML RevisionNotes
24 pages
SVM Minus Kernel 71
No ratings yet
SVM Minus Kernel 71
32 pages
Supervised Learning
No ratings yet
Supervised Learning
5 pages
MLA TAB Lecture3
No ratings yet
MLA TAB Lecture3
70 pages
Lecture 3 1611410001002
No ratings yet
Lecture 3 1611410001002
51 pages
Logistic Regression Training DR Anil
No ratings yet
Logistic Regression Training DR Anil
38 pages
SVM-CDing2024 11 15
No ratings yet
SVM-CDing2024 11 15
54 pages
7 SVM for Scientists Annotated (1)
No ratings yet
7 SVM for Scientists Annotated (1)
76 pages
Chapter Classification
No ratings yet
Chapter Classification
12 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
1. Statistical Learning Theory
No ratings yet
1. Statistical Learning Theory
100 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Algebraic Equations
From Everand
Algebraic Equations
Demetrios P. Kanoussis
No ratings yet
E5. Efficient LM Methods
No ratings yet
E5. Efficient LM Methods
41 pages
E3. AI Agents
No ratings yet
E3. AI Agents
49 pages
11. Pre-training & LLM 2
No ratings yet
11. Pre-training & LLM 2
46 pages
13. LLM Scaling Laws & Emergent Capacities
No ratings yet
13. LLM Scaling Laws & Emergent Capacities
23 pages
12. LLM Prompting & In-Context Learning
No ratings yet
12. LLM Prompting & In-Context Learning
18 pages
7. Deep Learning Recap
No ratings yet
7. Deep Learning Recap
13 pages
6. Neural Language Models & Tokenization
No ratings yet
6. Neural Language Models & Tokenization
70 pages
0.1. Probability Review
No ratings yet
0.1. Probability Review
6 pages
0. Introduction
No ratings yet
0. Introduction
6 pages
7. Orthogonality
No ratings yet
7. Orthogonality
61 pages
2. Matrices and Linear Transformations
No ratings yet
2. Matrices and Linear Transformations
74 pages
4. Subspace and Basis
No ratings yet
4. Subspace and Basis
60 pages
Exocad Smile Creator Brochure en Screen
No ratings yet
Exocad Smile Creator Brochure en Screen
7 pages
SQA Lab Report-1
No ratings yet
SQA Lab Report-1
4 pages
More Phonology Tests
No ratings yet
More Phonology Tests
10 pages
Rabbit Rabbits Are Small Mammals in The Family Leporidae of The Order
100% (1)
Rabbit Rabbits Are Small Mammals in The Family Leporidae of The Order
11 pages
Haryana Police Male Constable All Paper With Official Answer Key
No ratings yet
Haryana Police Male Constable All Paper With Official Answer Key
308 pages
As IEC 62347-2008 Guidance On System Dependability Specifications
No ratings yet
As IEC 62347-2008 Guidance On System Dependability Specifications
8 pages
Te Araroa - Gear List
No ratings yet
Te Araroa - Gear List
6 pages
Continuous Cables An Exploration of Knitted Cabled Knots Rings Swirls and Curlicues First Edition Melissa Leapman - The ebook in PDF format with all chapters is ready for download
100% (2)
Continuous Cables An Exploration of Knitted Cabled Knots Rings Swirls and Curlicues First Edition Melissa Leapman - The ebook in PDF format with all chapters is ready for download
49 pages
Christensen-Intro To Biomedical Engineering Biomechanics&Bioelectricity I
100% (1)
Christensen-Intro To Biomedical Engineering Biomechanics&Bioelectricity I
118 pages
Financial Planning Principles - The Pritchett - Financial Statements and Ratios
No ratings yet
Financial Planning Principles - The Pritchett - Financial Statements and Ratios
5 pages
Literature Review of Pipe Bending Machine
100% (2)
Literature Review of Pipe Bending Machine
7 pages
Hospital - First Floor Plan: Observation Cells
No ratings yet
Hospital - First Floor Plan: Observation Cells
1 page
CSEC BIOLOGY Teeth
No ratings yet
CSEC BIOLOGY Teeth
3 pages
Analytical Instrument Qualification Agilent - Capitulo 3calificación Equipos - 014349
No ratings yet
Analytical Instrument Qualification Agilent - Capitulo 3calificación Equipos - 014349
20 pages
CV - Gebby Oktavia Zakiah
No ratings yet
CV - Gebby Oktavia Zakiah
1 page
Siliconsmart Ds
No ratings yet
Siliconsmart Ds
3 pages
Transport Tech Model Report
100% (1)
Transport Tech Model Report
270 pages
Trading Strategy - Technical Analysis With Python TA-Lib
No ratings yet
Trading Strategy - Technical Analysis With Python TA-Lib
12 pages
Adminjaya,+9 +Ni+Nyoman+Ayu+Suciartini
No ratings yet
Adminjaya,+9 +Ni+Nyoman+Ayu+Suciartini
31 pages
Experiment 1: Calorimetry Hess'S Law
No ratings yet
Experiment 1: Calorimetry Hess'S Law
7 pages
EInfochips Double Patterning Technology
No ratings yet
EInfochips Double Patterning Technology
2 pages
WBDV111 Finals CS2
No ratings yet
WBDV111 Finals CS2
4 pages
HHW Class 10
No ratings yet
HHW Class 10
16 pages
Course Code: CHE 401 Course Title: Reaction Engineering Term: January 2019
No ratings yet
Course Code: CHE 401 Course Title: Reaction Engineering Term: January 2019
2 pages
Soce-C 2019 W - Formula
100% (3)
Soce-C 2019 W - Formula
2 pages
Module 4 - PK Infrastructure
No ratings yet
Module 4 - PK Infrastructure
23 pages
US and Metric Thread Sizes PDF
No ratings yet
US and Metric Thread Sizes PDF
1 page
Honda Vietnam
100% (3)
Honda Vietnam
16 pages