0% found this document useful (0 votes)
6 views

3. Multi-class Classification

The document discusses multi-class classification in artificial intelligence and machine learning, covering topics such as predictors, learning theory, and optimization techniques. It introduces methods for reducing multi-class problems to binary classification, including One-vs-One and One-vs-Rest approaches, and explores the linear multiclass hypothesis space. Additionally, it addresses margin-based loss functions and regularization schemes to enhance model performance in multi-class settings.

Uploaded by

9gt5rqjjnq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

3. Multi-class Classification

The document discusses multi-class classification in artificial intelligence and machine learning, covering topics such as predictors, learning theory, and optimization techniques. It introduces methods for reducing multi-class problems to binary classification, including One-vs-One and One-vs-Rest approaches, and explores the linear multiclass hypothesis space. Additionally, it addresses margin-based loss functions and regularization schemes to enhance model performance in multi-class settings.

Uploaded by

9gt5rqjjnq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

ARIN7015/MATH6015: Topics in Artificial Intelligence

and Machine Learning

Multi-Class Classification and Its Optimization

Yunwen Lei

Department of Mathematics, The University of Hong Kong

March 20, 2025


Outline

1 Multiclass Predictors

2 Learning Theory for Multi-class Classification

3 Frank-Wolfe Algorithm

4 Optimization for Multi-class Classification


Multi-class Classification

( ,dog), ( ,car), ( ,airplane), . . .


i.i.d.
Formally z1 = (x1 , y1 ), . . . , zn = (xn , yn ) ∼ ρ
| {z }
∈X ×Y
▶ Y := {1, 2, . . . , c}
▶ c = number of classes
Reduction to Binary Classification

One-vs-One
Train c(c − 1)/2 binary classifiers, each classifier for a pair of different classes
The final prediction by the majority vote
Reduction to Binary Classification
One-vs-Rest
Train c binary classifiers, one for each class
Train j-th classifier to distinguish class j from the rest
Suppose h1 , . . . , hc : X 7→ R are our binary classifiers
The final prediction is
h(x) = arg max hj (x)
j∈[c]

Tie breaking: If two distinct i, j ∈ [c]


achieves the maximum (i.e., wi⊤ x = wj⊤ x),
we break the tie using some consistent
policy, e.g., predicting the label as the
smaller between i and j.
Recap: Linear Binary Classifier

Y = {−1, +1}, X = Rd
Linear classifier score function h(x) = w⊤ x
Final prediction: sign(h(x))

h(x) = w⊤ x = ∥w∥2 ∥x∥2 cos(θ)


h(x) > 0 ⇐⇒ cos θ > 0 =⇒ θ ∈ (−90◦ , 90◦ )
h(x) < 0 ⇐⇒ cos θ < 0 =⇒ θ ̸∈ (−90◦ , 90◦ )
Three Class Example

Base hypothesis space


H = x 7→ w⊤ x : w ∈ R2


Note: separating boundary always contains the origin


Three Class Example: One-vs-Rest

Class 1 vs Rest: h1 (x) = w1⊤ x


Three Class Example: One-vs-Rest

Class 2 vs Rest
▶ Predicts everything to be “Not 2”
▶ It misclassifies 6 examples (all examples in class “2”)
▶ If it predicted some “2”, then it would get many more “Not 2” incorrect
Score for class j is
hj (x) = wj⊤ x = ∥wj ∥2 ∥x∥ cos θj ,
where θj is the angle between x and wj .
One-vs-Rest: Class Boundaries

For simplicity, we assume ∥w1 ∥2 = ∥w2 ∥2 = ∥w3 ∥2


Then wj⊤ x = ∥wj ∥2 ∥x∥2 cos θj
▶ Only θj matters for comparison
▶ x is classified by whichever has largest cos θj (i.e., θj closest to 0)
One-vs-Rest: Class Boundaries

This approach doesn’t work well in this instance!

How can we fix this?


The Linear Multiclass Hypothesis Space

Base hypothesis space


e = x 7→ w⊤ x : w ∈ Rd

H
Linear multiclass hypothesis space
n o
H = x 7→ arg max hj (x) : h1 , . . . , hc ∈ H
e
j∈[c]

A learning algorithm chooses the hypothesis from


hypothesis space
Is this a failure of the hypothesis space or the
learning algorithm?
A Solution with Linear Functions

This works... so the problem is not with the hypothesis space


How can we get a solution like this?
Multiclass Predictors
Multiclass Hypothesis Space

Idea: use one function h(x, y ) to give a compatibility score between input x and output y
Final prediction is the y ∈ Y that is “most compatible” with x

x ← arg max h(x, y ).


y ∈Y

Given class-specific score functions h1 , . . . , hc , we could define compatibility


function as
h(x, j) = hj (x), j ∈ [c].

Base Hypothesis Space: H = h : X × Y 7→ R
Multiclass Hypothesis Space
n o
x 7→ arg max h(x, y ) : h ∈ H
y ∈Y

Want compatibility h(x, y ) to be large when x has label y , small otherwise!


Multi-class Classification: Example

Points x1 , x2 and x3 will be classified as label 1, 2 and 3, respectively.

What do the three rays stand for?


Linear Multiclass Prediction Function

A linear compatibility score function is given by

h(x, y ) = ⟨w, Ψ(x, y )⟩,



where Ψ : X × Y 7→ Rd is a compatibility feature map

Ψ(x, y ) extracts features relevant to how compatible y is with x


Final compatibility score is a linear function of Ψ(x, y )
Linear Multiclass Hypothesis Space

n o
x 7→ arg max⟨w, Ψ(x, y )⟩ : w ∈ Rd
y ∈Y
Example: X = R2 , Y = {1, 2, 3}

! !
− √12 √1
 
0 2
w1 = √1
, w2 = , w3 = √1
2
1 2

Prediction function  
x1
x= ← arg max ⟨wj , x⟩
x2 j∈{1,2,3}

How can we get this into the form x ← arg maxy ∈Y ⟨w, Ψ(x, y )⟩?
The Multivector Construction
What if we stack wj ’s together
 
!   !
− √12 √1

 0 2

w= √1
, , √1 

 2
1 2 

| {z } | {z } | {z }
w1 w2 w3

We define the feature map Ψ : R2 × {1, 2, 3} 7→ R2×3 as


     
x1 0 0
Ψ(x, 1) = , ,
x2 0 0
     
0 x1 0
Ψ(x, 2) = , ,
0 x2 0
     
0 0 x1
Ψ(x, 3) = , ,
0 0 x2

Then we get the wanted equation ⟨w, Ψ(x, y )⟩ = ⟨wy , x⟩


Linear Multiclass Prediction Function: Simplification

In our lecture, we always consider the following compatibility feature map


 
Ψ(x, j) = 0, . . . , 0, x
|{z} , 0, . . . , 0 ∈ Rd×c
j-th component

We stake all wj into a matrix

w = (w1 , . . . , wc ) ∈ Rd×c .

Then it is clear that


⟨w, Ψ(x, j)⟩ = wj⊤ x!
Linearly Separable Dataset

A training dataset S is linearly separable if there exist w1 , . . . , wc s.t.


correctly classify all the points in S
for every point x ∈ S with label y , wy⊤ x > wj⊤ x for all j ̸= y
Recap: Binary Classification
Margin of a hyperplane for a dataset: the width that the boundary could be increased
without hitting a data point

n
gi
ar
m

How to define the margin for multi-class classification?


Multi-class Margin

Recap: we choose the class label with the largest score function

ŷ := arg maxj∈Y h(x, j)

The prediction is correct if the true class label receives the largest score!

Multi-class Margin
Aim: Want h(xi , yi ) larger than all other h(xi , j) for j ̸= y i.e., a large margin
 
margin(h, z) = h(x, y ) − max h(x, j)
| {z } j:j̸=y | {z }
score of true label score of incorrect label
Multi-class Margin

Defined as the score difference between the score of the correct label and the highest
score for the incorrect label
Margin-based Loss

Ideal property of loss function


Loss should be a decreasing function of the margin
The model w predicts a vector in Rc : (w1⊤ x, . . . , wc⊤ x)
We need a map from Rc to R+

We consider the margin-based loss of the form

f (w; z) = ℓy w1⊤ x, w2⊤ x, . . . , wc⊤ x ,




where ℓy : Rc 7→ R.

hinge loss : ℓy (t1 , . . . , tc ) = max{0, 1 − margin} = max{0, 1 + max(tj − ty )}


j:j̸=y
n o
f (w; z) = max 0, 1 + max⟨wj − wy , x⟩ = ℓy w1⊤ x, . . . , wc⊤ x

(1)
j:j̸=y

margin: ty − maxj:j̸=y tj
Soft-Max Loss
Note that the margin of t is ty − maxj̸=y tj
Max is not differentiable. Replace it with the soft-max margin
 X 
max t1 −ty , . . . , tc −ty } ≤ log exp(tj −ty ) ≤ max t1 −ty , . . . , tc −ty }+log c
j:j̸=y
| {z } | {z }
−margin | {z } −margin
soft-max

Want to maximize margin: x 7→ log(1 + exp(−x)) is a decreasing function


  X  X 
log(1+exp(−margin)) ≈ log 1+exp log exp(tj −ty ) = log 1+ exp(tj−ty )
j:j̸=y j:j̸=y
| {z }
soft-max
P 
With ℓy (t1 , . . . , tc ) = log 1 + j:j̸=y exp(tj − ty ) we know
 X 
exp wj⊤ x − wy⊤ x = ℓy w1⊤ x, . . . , wc⊤ x .

f (w; z) = log 1 +
j:j̸=y
Regularization Schemes for Multi-class Classification

We train a model by minimizing the following objective w.r.t. w = (w1 , . . . , wc ), wj ∈ Rd


n
1X n o
ℓy w1⊤ x, . . . , wc⊤ x s.t. w ∈ W = w ∈ Rd×c : Ω(w) ≤ 1 ,

min
w∈Rd×c n i=1 | i {z }
=f (w;zi )

where Ω : Rd×c 7→ R+ is a regularizer.

Examples: Ω(w) = λ
2
∥w∥22,p , where
c
X 1
∥w∥2,p := ∥wj ∥p2 p
p≥1
j=1

Pc
p = 1: ∥w∥2,1 := j=1 ∥wj ∥2
Pc 1
p = 2: ∥w∥2,p := j=1 ∥wj ∥22 2

p = ∞: ∥w∥2,∞ := maxj∈[c] ∥wj ∥2


Learning Theory for Multi-class Classification
Rademacher Complexity for Multi-class Classification

The loss function class becomes


n o
F = z 7→ ℓy w1⊤ x, . . . , wc⊤ x : Ω(w) ≤ 1


How to estimate the Rademacher complexity


n
1 X
ℓyi w1⊤ xi , . . . , wc⊤ xi .

Eσ sup
n w∈W
i=1

The difficulty lies in the nonlinearity of ℓy , which takes a vector in Rc as the input!
Lipschitz Continuity of ℓy

We say ℓy is G -Lipschitz continuous w.r.t. a norm on Rc if

ℓ a1 , . . . , ac − ℓ a1′ , . . . , ac′ ≤ G (a1 , . . . , ac ) − (a1′ , . . . , ac′ ) .


 

It is an extension of the Lipschitz continuity of R 7→ R


We are interested in ∥ · ∥2 and ∥ · ∥∞

Lemma. Let a1 , . . . , ac and b1 , . . . , bc ∈ R. Then


 
max a1 , . . . , ac − max b1 , . . . , bc ≤ max |aj − bj |.
j∈[c]

Proof. Without loss of generality, we ak ≥ bk ′ , where


 
ak = max a1 , . . . , ac , bk ′ = max b1 , . . . , bc .
 
max a1 , . . . , ac − max b1 , . . . , bc ≤ ak − bk ′ ≤ ak − bk ≤ max |aj − bj |.
j∈[c]
Lipschitz Continuity of Hinge Loss

Example
The hinge loss ℓy (t1 , . . . , tc ) = max{0, 1 + maxj:j̸=y (tj − tc )} is 2-Lipschitz continuous
w.r.t. ∥ · ∥∞ .

Proof. We just showed


 
max a1 , . . . , ac − max b1 , . . . , bc ≤ max |aj − bj |.
j∈[c]

Let t = (t1 , . . . , tc ) and t′ = (t1′ , . . . , tc′ ). Then

|ℓy (t) − ℓy (t′ )| = max{0, 1 + max(tj − ty )} − max{0, 1 + max(tj′ − ty′ )}


j:j̸=y j:j̸=y

≤ 1 + max(tj − ty )} − 1 − max(tj′ − ty′ )}


j:j̸=y j:j̸=y

≤ max (tj − ty ) − (tj′ − ty′ ) ≤ 2∥t − t′ ∥∞ .


j:j̸=y
Lipschitz Continuity of the Soft-max Loss

Example
P 
The soft-max loss ℓy (t1 , . . . , tc ) = log 1 + j:j̸=y exp(tj − ty ) is 2-Lipschitz continuous
w.r.t. ∥ · ∥∞ .
P 
Proof. The function t 7→ g (t) := log j∈[c] exp(tj ) is 1-Lipschitz continuous w.r.t.
∥ · ∥∞ .
 
exp(t1 )
1 . 
∇g (t) = P  ..  =⇒ ∥∇g (t)∥1 ≤ 1.

j∈[c] exp(t j )
exp(tc )
g (t) − g (t′ ) = t − t′ , ∇g (αt + (1 − α)t′ )
| {z }
Taylor expansion

=⇒ |g (t) − g (t )| ≤ ∥t − t ∥∞ ∥∇g (αt + (1 − α)t′ )∥1 .


′ ′
Vector contraction of Rademacher complexity

Vector contraction of Rademacher complexity


Let F be a class of functions from Z to Rc . For each i ∈ [n], let σi : Rc 7→ R be
L-Lipschitz continuous w.r.t. ∥ · ∥2 . Then
X √ XX
E sup ϵi σi (f (zi )) ≤ 2LE sup ϵi,k fk (zi ),
f ∈F f ∈F
i∈[n] i∈[n] k∈[c]

where ϵi,k are i.i.d. Rademacher variables, and fk (zi ) is the k-th component of f (zi ).

It is an extension from Talagrand’s contraction lemma to vector-valued functions


It removes the nonlinear σ, which is the loss function in multi-class classification!

Maurer, A., 2016. A vector-contraction inequality for Rademacher complexities. In Algorithmic Learning
Theory.
Rademacher Complexity Bound (Optional)
Let ℓy be either the hinge loss or the soft-max loss
c
X 1
|ℓy (t) − ℓy (t′ )| ≤ 2 max |tj − tj′ | ≤ 2 (tj − tj′ )2 2 = 2∥t − t′ ∥2 .
j∈[c]
| {z } j=1
=∥t−t′ ∥∞

Then vector contraction of Rademacher complexity implies


X √ XX
ϵi ℓyi w1⊤ xi , . . . , wc⊤ xi ≤ 2 2E sup ϵij wj⊤ xi .

E sup
w∈W w∈W
i∈[n] i∈[n] j∈[c]

Let W = {w ∈ Rd×c : ∥w∥2,2 ≤ 1}. Then


XX X X  X X
E sup ϵij wj⊤ xi = E sup wj⊤ ϵij xi ≤ E sup wj 2
ϵij xi
w∈W w∈W w∈W 2
i∈[n] j∈[c] j∈[c] i∈[n] j∈[c] i∈[n]
c
1 X
X
2 2
X 2  12
≤ E sup wj 2
ϵij xi
w∈W 2
j∈[c] j=1 i∈[n]
c
h X X 2  21 i
≤E ϵij xi .
2
j=1 i∈[n]
Rademacher Complexity Bound (Optional)
We just showed
c
X  √ h X X 2  12 i
E sup ϵi ℓyi w1⊤ xi , . . . , wc⊤ xi ≤ 2LE ϵij xi .
w∈W 2
i∈[n] j=1 i∈[n]


By the concavity of x 7→ x, we know
c c
h X X 2  21 i h X X 2  21 i
E ϵij xi ≤ E ϵij xi .
2 2
j=1 i∈[n] j=1 i∈[n]

Furthermore, there holds


X 2 h X X i X X
E ϵij xi = E ⟨ ϵij xi , ϵi ′ j xi ′ ⟩ = Eϵ2ij x⊤
i xi + Eϵij ϵi ′ j x⊤
i xi ′ .
2
i∈[n] i∈[n] i ′ ∈[n] i∈[n] i̸=i ′
| {z }
=0

c
h X X 2  12 i  X 1
2
=⇒ E ϵij xi ≤ c· E∥xi ∥22 .
2
j=1 i∈[n] i∈[n]
Rademacher Complexity Bound

The Rademacher complexity of F is bounded by


  1 X 1
z 7→ ℓy (w1⊤ x, . . . , wc⊤ x) : ∥w∥2,2 ≤ 1} ≤
2
RS c· E∥xi ∥22 .
n
i∈[n]

Generalization bound
Let A be ERM. Then with probability at least 1 − δ
1 1 √ √
F (A(S)) − F (w∗ ) ≲ n− 2 log 2 (2/δ) + c/ n.

This shows a square-root dependency of generalization bounds w.r.t. the number of


classes!
Frank-Wolfe Algorithm
Training Multiclass Model

Consider the following minimization problem (p ≥ 1)


n
1X X
ℓyi (wj⊤ xi )j∈[c] ∥wj ∥p2 ≤ B p .

min s.t.
w n
i=1 j∈[c]

It is a constrained optimization problem


Can be solved by projected gradient descent, which however has a projection
step per iteration
The projection can be expensive
Can we develop a projection-free method?

Yes, the Frank-Wolfe Algorithm!


Frank-Wolfe Method

Objective: minw∈Ω FS (w), where Ω is convex.

Recap: projected gradient descent (PGD)



Recall PGD wt+1 = ProjW wt − η∇FS (wt ) uses a quadratic approximation
n 1 o
wt+1 = arg min FS (wt ) + ⟨w − wt , ∇FS (wt )⟩ + ∥w − wt ∥22 .
w∈W 2η

In some cases, the projection may be hard to compute (or even approximate)

For these problems, we can solve the simplified problem


n
arg min FS (wt ) + ⟨w − wt , ∇FS (wt )⟩},
w∈W

which optimizes a linear approximation to the function over the constraint set
This requires the set W to be bounded, otherwise there may be no solution
Frank-Wolfe Method

Frank-Wolfe Method
1: for t = 0, 1, . . . , T do
2: Set vt = arg minv∈W ⟨∇FS (wt ), v⟩
3: Set wt+1 = wt + γt (vt − wt ), where γt ∈ [0, 1]

wt+1 is feasible since it is a convex combination of two other feasible points

wt+1 = (1 − γt )wt + γt vt

A choice of γt is γt = 2/(t + 2)
Linear Minimization Oracle
Let W be a convex, closed and bounded set. The linear minimization oracle of Ω
(IMOW ) returns a vector ĝ such that

IMOW (g) = arg min g⊤ v.


v∈W

IMOW returns an extreme point of W


IMOW is (projection-free) arguably cheaper than projection
Example
Apply the FW algorithm to solve the following nonlinear problem with the initialization
point (x1 , y1 ) = (0, 0)

min f (x, y ) = (x − 1)2 + (y − 1)2


s.t. 2x + y ≤ 6
x + 2y ≤ 6
x, y ≥ 0.

Solution. The gradient of f is ∇f (x, y ) = (2x − 2, 2y − 2)⊤ .


We know ∇f (0, 0) = (−2, −2)⊤
We get v by solving the linear optimization problem

max −2x − 2y s.t. 2x + y ≤ 6, x + 2y ≤ 6, x, y ≥ 0.


(x,y )

By simplex method, one can show that v = (2, 2)⊤


If we choose γ = 1/2, then we get
     
⊤ ⊤ ⊤ 0 1 2 1
(x2 , y2 ) = (x1 , y1 ) + γ(v − (x1 , y1 ) ) = + =
0 2 2 1
Example

Lasso Regression
min FS (w) = ∥Aw − b∥22 s.t. ∥w∥1 ≤ 1.
w

∇FS (w) = g := A⊤ (Aw − b)


IMOW (g) = −sign(gj ∗ )ej ∗ with j ∗ := arg maxj∈[d] |gj |, which is simpler than
projection onto L1 -ball. Here ej is the j-th unit vector.
Proof. For any w ∈ W, we have
X X
w⊤ g = wj gj ≥ − |wj gj |
j∈[d] j∈[d]
X
≥ − max |gj | |wj | = −|gj ∗ | = −sign(gj ∗ )e⊤
j ∗ g.
j∈[d]
j∈[d]
Lasso comparison
Comparing projected and conditional gradient for constrained lasso problem, with
n = 100, d = 500
Convergence of Frank-Wolfe Method
Theorem. Let W be nonempty, convex and bounded. Let FS be convex and L-smooth.
Then FW with γt = 2/(t + 2) satisfies

2LD 2
FS (wt ) − FS (w∗ ) ≤ , D := max ∥w − w′ ∥2 .
t +1 ′
w,w ∈W

Proof. Since vt and w∗ are in W, optimality of vt shows

⟨vt − wt , ∇FS (wt )⟩ ≤ ⟨w∗ − wt , ∇FS (wt )⟩ ≤ FS (w∗ ) − FS (wt ).

By the L-smoothness and wt+1 − wt = γt (vt − wt )

L
FS (wt+1 ) ≤ FS (wt ) + ⟨wt+1 − wt , ∇FS (wt )⟩ + ∥wt+1 − wt ∥22
2
γ2L
= FS (wt ) + γt ⟨vt − wt , ∇FS (wt )⟩ + t ∥vt − wt ∥22
2
2
γ LD 2
≤ FS (wt ) + γt FS (w∗ ) − FS (wt ) + t

.
2
 γ 2 LD 2
=⇒ FS (wt+1 ) − FS (w∗ ) ≤ (1 − γt ) FS (wt ) − FS (w∗ ) + t .
2
Convergence of Frank-Wolfe Method

 γ 2 LD 2
We derived FS (wt+1 ) − FS (w∗ ) ≤ (1 − γt ) FS (wt ) − FS (w∗ ) + t .
2
t 4LD 2
=⇒ FS (wt+1 ) − FS (w∗ ) ≤ FS (wt ) − FS (w∗ ) +

t +2 2(t + 2)2

We multiply both sides by (t + 1)(t + 2) and get

(t + 1)(t + 2)∆t+1 ≤ t(t + 1)∆t + 2LD 2 .

Telescoping shows
t
X 
(t + 1)(t + 2)∆t+1 = (k + 1)(k + 2)∆k+1 − k(k + 1)∆t
k=0
t
X
≤ 2LD 2 = 2LD 2 (t + 1).
k=0
Stochastic Frank-Wolfe Method
Stochastic Frank-Wolfe Method
1: for t = 0, 1, . . . , T do
2: ˆ t , v⟩, where ∇
Set vt = arg minv∈W ⟨∇ ˆ t is an unbiased estimator of ∇FS (wt )
3: Set wt+1 = wt + γt (vt − wt ), where γt ∈ [0, 1]

Convergence Rate
Let W be nonempty, convex and bounded. Let FS be convex and L-smooth. Assume the
following variance condition

ˆ t − ∇FS (wt ) ≤ LD
2  2
E ∇ .
2 t +1
Then FW with γt = 2/(t + 2) satisfies

4LD 2
E FS (wt ) − F (w∗ ) ≤
 
.
t +1

Stochastic FW requires decreasing variance!


Stochastic Frank-Wolfe Method
Note
n
1X
FS (w) = f (w; zi ).
n i=1

Assume St is a random sampling (with replacement) from [n] and

ˆt = 1
X
∇ ∇f (wt ; zi ).
|St |
i∈St

ˆ t is an unbiased estimator of ∇FS (wt )


Then ∇

ˆ t ] = 1 ESt
hX i
ESt [∇ ∇f (wt ; zi ) = ∇FS (wt ).
|St |
i∈St

Furthermore, the variance decreases by a factor of |St |

2 σ2 2
ˆ t − ∇FS (wt )
E ∇ = , σ 2 := E[ ∇f (wt ; zit ) − ∇FS (wt ) 2 ].
2 |St |

Choosing |St | = σ 2 (t + 1)2 /(L2 D 2 ) meets the variance condition for stochastic FW.
Optimization for Multi-class Classification
Training Multi-class Model by Frank-Wolfe Method
Recall the problem
n
1X X
ℓyi (wj⊤ xi )j∈[c] ∥wj ∥22 ≤ B 2 .

min s.t.
w n i=1
j∈[c]

Frank-Wolfe method for multi-class classification


1: for t = 0, 1, . . . , T do
2: vt = arg minw:Pj∈[c] ∥wj ∥2 ≤1 ⟨w, ∇FS (w(t) )
2
3: Set wt+1 = wt + γt (vt − wt ), where γt ∈ [0, 1]

Frank-Wolfe Algorithm requires solving a linear optimization problem


X
arg min⟨w, g⟩ s.t. ∥wj ∥22 ≤ 1,
w
j∈[c]


which has a closed-form solution w = (w1∗ , . . . , wc∗ ) as follows
gj
wj∗ = − P 1 . (2)
c
˜
j=1 ∥gj˜∥22 2
Proof on Closed-form Solution (Optional)

gj
w∗ = arg min ⟨w, g⟩ ⇐⇒ wj∗ = − P 1 .
∥wj ∥22 ≤1 c
P
w: j∈[c] ˜
j=1 ∥gj˜∥22 2

It suffices to check ∥w∗ ∥2,p ≤ 1 and ⟨w∗ , g⟩ = −∥g∥2,p∗ .


It is clear that
c c
X 1 X 1
∥w∗ ∥2,2 = ∥gj˜∥22 2
/ ∥gj˜∥22 2
=1
˜
j=1 ˜
j=1

and
c c c c
X X − 1 X X 1
⟨w∗ , g⟩ = ⟨wj∗ , gj ⟩ = − ∥gj˜∥22 2 ∥gj˜∥22 = − ∥gj˜∥22 2 = −∥g∥2,2 .
j=1 ˜
j=1 ˜
j=1 ˜
j=1
Summary

Multiclass predictors: vector-valued functions


▶ margin-based formulation: we want to find model with large margin
Learning theory
▶ Rademacher complexity bound with a square-root dependency on c
Frank-Wolfe Algorithm
▶ projection-free algorithm for constrained optimization
▶ replace projection by a linear minimization problem
Optimization for multiclass optimization
▶ A closed-form solution for the linear minimization problem

You might also like