SlideShare a Scribd company logo
ECE 8443 – Pattern Recognition
ECE 8443 – Pattern Recognition
• Objectives:
Empirical Risk Minimization
Large-Margin Classifiers
Soft Margin Classifiers
SVM Training
Relevance Vector Machines
• Resources:
DML: Introduction to SVMs
AM: SVM Tutorial
JP: SVM Resources
OC: Taxonomy
NC: SVM Tutorial
LECTURE 16: SUPPORT VECTOR MACHINES
Class 1
Class 2
Audio:
URL:
ECE 8443: Lecture 16, Slide 1
Generative Models
• Thus far we have essentially considered techniques that perform classification
indirectly by modeling the training data, optimizing the parameters of that
model, and then performing classification by choosing the closest model. This
approach is known as a generative model: by training models supervised
learning assumes we know the form of the underlying density function, which
is often not true in real applications.
• Convergence in maximum likelihood does
not guarantee optimal classification.
• Gaussian MLE modeling tends to
overfit data.
0 10 20 30 40
0
0.1
0.2
0.3
0.4
3.5 4 4.5 5
0
0.005
0.01
0.015
0.02
0.025
0.03
ML Decision
Optimal
Decision
Boundary
Boundary
(MLE Gaussian)
Discrimination
Class-Dependent PCA
• Real data often not separable by hyperplanes.
• Goal: balance representation and discrimination
in a common framework (rather than alternating
between the two).
ECE 8443: Lecture 16, Slide 2
Risk Minimization
• The expected risk can be defined as:
)
,
(
)
,
(
( y
x
x
y dP
f
R 


 


• Empirical risk is defined as:
 


l
i
i
i
emp f
l
R
1
)
,
(
2
1
( 
 x
y
• These are related by the Vapnik-Chervonenkis (VC) dimension:
)
(
(
( h
f
R
R emp 
 

l
h
l
h
h
f
)
4
/
log(
))
1
)
/
2
(log((
)
(




where
)
(h
f is referred to as the VC confidence, where  is a confidence measure in
the range [0,1].
• The VC dimension, h, is a measure of the capacity of the learning machine.
• The principle of structural risk minimization (SRM) involves finding the subset
of functions that minimizes the bound on the actual risk.
• Optimal hyperplane classifiers achieve zero empirical risk for linearly
separable data.
• A Support Vector Machine is an approach that gives the least upper bound on
the risk.
confidence in the risk
empirical risk
bound on the expected risk
VC dimension
Expected risk
optimum
ECE 8443: Lecture 16, Slide 3
Large-Margin Classification
• Hyperplanes C0 - C2 achieve perfect classification
(zero empirical risk):
 C0 is optimal in terms of generalization.
 The data points that define the boundary
are called support vectors.
 A hyperplane can be defined by:
 We will impose the constraints:
The data points that satisfy the equality are
called support vectors.
• Support vectors are found using a constrained optimization:
• The final classifier is computed using the support vectors and the weights:
b

 w
x
origin
class 1
class 2
w
H1
H2
C1
CO
C2
optimal
classifier
0
1
)
( 

 b
w
x
y i
i
i


 





N
i
i
N
i
i
i
i
p b
y
w
L
1
1
2
)
(
2
1

 w
x
 



N
i
i
i
i b
y
f
1
)
( x
x
x 
ECE 8443: Lecture 16, Slide 4
Class 1
Class 2
Soft-Margin Classification
• In practice, the number of support vectors will grow unacceptably large for
real problems with large amounts of data.
• Also, the system will be very sensitive to mislabeled training data or outliers.
• Solution: introduce “slack variables”
or a soft margin:
This gives the system the ability to
ignore data points near the boundary,
and effectively pushes the margin
towards the centroid of the training data.
• This is now a constrained optimization
with an additional constraint:
• The solution to this problem can still be found using Lagrange multipliers.
0
)
1
(
)
( 


 i
i
i
i b
w
x
y 
ECE 8443: Lecture 16, Slide 5
Nonlinear Decision Surfaces
f( )
f( )
f( )
f( )
f( )
f( )
f( )
f( )
f(.)
f( )
f( )
f( )
f( )
f( )
f( )
f( )
f( )
f( )
f( )
Feature space
Input space
• Thus far we have only considered linear decision surfaces. How do we
generalize this to a nonlinear surface?
• Our approach will be to transform the data to a higher dimensional space
where the data can be separated by a linear surface.
• Define a kernel function:
Examples of kernel functions include polynomial:
)
(
)
(
)
,
( j
i
j
i
K x
x
x
x f
f

d
j
t
i
j
i
K )
1
(
)
,
( 
 x
x
x
x
ECE 8443: Lecture 16, Slide 6
Kernel Functions
Other popular kernels are a radial basis function (popular in neural networks):
)
)
2
(
exp(
)
,
( 2
2

j
i
j
i
K x
x
x
x 

and a sigmoid function:
)
tanh(
)
,
( 

 j
t
i
j
i k
K x
x
x
x
• Our optimization does not change significantly:
• The final classifier has a similar form:
• Let’s work some examples.
 


N
i
i
i
i b
K
y
f
1
)
,
(
)
( x
x
x 
0
,
0
)
,
(
2
1
)
(
max
1
1 1
1




 
 


 

i
n
i
i
i
j
i
j
i
n
i
n
j
j
i
n
i
i
y
C
to
subject
K
y
y
W





 x
x
ECE 8443: Lecture 16, Slide 7
SVM Limitations
Model Complexity
Error
Training Set
Error
Open-Loop
Error
Optimum
• Uses a binary (yes/no) decision rule
• Generates a distance from the hyperplane, but this distance is often not a
good measure of our “confidence” in the classification
• Can produce a “probability” as a function of the distance (e.g. using
sigmoid fits), but they are inadequate
• Number of support vectors grows linearly with the size of the data set
• Requires the estimation of trade-off parameter, C, via held-out sets
ECE 8443: Lecture 16, Slide 8
• Build a fully specified probabilistic model – incorporate prior
information/beliefs as well as a notion of confidence in predictions.
• MacKay posed a special form for regularization in neural networks – sparsity.
• Evidence maximization: evaluate candidate models based on their
“evidence”, P(D|Hi).
• Evidence approximation:
• Likelihood of data given best fit parameter set:
• Penalty that measures how well our posterior model
fits our prior assumptions:
• We can use set the prior in favor of sparse,
smooth models.
• Incorporates an automatic relevance
determination (ARD) prior over each weight.
Evidence Maximization
w
w
w 
 )
|
ˆ
(
)
,
ˆ
|
(
)
|
( H
P
H
D
P
H
D
P
)
,
ˆ
|
( H
D
P w
w
w
w
P(w|D,Hi)
P(w|Hi)
w
w 
)
|
ˆ
( H
P
)
1
),
0
(
|
(
)
|
(
0
 


N
i i
i
i
w
N
P



w
ECE 8443: Lecture 16, Slide 9
• Still a kernel-based learning machine:
• Incorporates an automatic relevance determination (ARD) prior over each
weight (MacKay)
• A flat (non-informative) prior over a completes the Bayesian specification.
• The goal in training becomes finding:
• Estimation of the “sparsity” parameters is inherent in the optimization – no
need for a held-out set.
• A closed-form solution to this maximization problem is not available.
Rather, we iteratively reestimate .
Relevance Vector Machines




N
i
i
iK
w
w
y
1
0 )
,
(
)
;
( x
x
w
x )
;
(
1
1
)
;
|
1
( w
x
w
x
i
y
i
e
t
P 



)
1
),
0
(
|
(
)
|
(
0
 


N
i i
i
i
w
N
P



w
)
|
(
)
|
,
(
)
,
,
|
(
)
,
(
)
,
|
,
(
,
max
arg
ˆ
,
ˆ
X
X
α
w
X
α
w
α
w
X
α
w
α
w
α
w
t
p
p
t
p
p
where
t
p


α
w ˆ
ˆ and
ECE 8443: Lecture 16, Slide 10
Summary
• Support Vector Machines are one example of a kernel-based learning machine
that is training in a discriminative fashion.
• Integrates notions of risk minimization, large-margin and soft margin
classification.
• Two fundamental innovations:
 maximize the margin between the classes using actual data points,
 rotate the data into a higher-dimensional space in which the data is linearly
separable.
• Training can be computationally expensive but classification is very fast.
• Note that SVMs are inherently non-probabilistic (e.g., non-Bayesian).
• SVMs can be used to estimate posteriors by mapping the SVM output to a
likelihood-like quantity using a nonlinear function (e.g., sigmoid).
• SVMs are not inherently suited to an N-way classification problem. Typical
approaches include a pairwise comparison or “one vs. world” approach.
ECE 8443: Lecture 16, Slide 11
Summary
• Many alternate forms include Transductive SVMs, Sequential SVMs, Support
Vector Regression, Relevance Vector Machines, and data-driven kernels.
• Key lesson learned: a linear algorithm in the feature space is equivalent to a
nonlinear algorithm in the input space. Standard linear algorithms can be
generalized (e.g., kernel principal component analysis, kernel independent
component analysis, kernel canonical correlation analysis, kernel k-means).
• What we didn’t discuss:
 How do you train SVMs?
 Computational complexity?
 How to deal with large amounts of data?
See Ganapathiraju for an excellent, easy to understand discourse on SVMs
and Hamaker (Chapter 3) for a nice overview on RVMs. There are many other
tutorials available online (see the links on the title slide) as well.
 Other methods based on kernels – more to follow.

More Related Content

Similar to lecture_16.pptx (20)

PPT
An Introduction to Support Vector Machines.ppt
zjadidfard
 
PPTX
Support Vector Machines USING MACHINE LEARNING HOW IT WORKS
rajalakshmi5921
 
PPT
Machine Learning and Statistical Analysis
butest
 
PPT
Machine Learning and Statistical Analysis
butest
 
PPT
Machine Learning and Statistical Analysis
butest
 
PPT
Machine Learning and Statistical Analysis
butest
 
PPT
Machine Learning and Statistical Analysis
butest
 
PPT
Machine Learning and Statistical Analysis
butest
 
PPT
Machine Learning and Statistical Analysis
butest
 
PDF
course slides of Support-Vector-Machine.pdf
onurenginar1
 
PPT
Linear Discrimination Centering on Support Vector Machines
butest
 
PPT
Machine Learning Machine Learnin Machine Learningg
ghsskchutta
 
PPT
MAchine learning
JayrajSingh9
 
PPT
PPT-3.ppt
Shibaprasad Sen
 
PDF
Epsrcws08 campbell kbm_01
Cheng Feng
 
PPT
i i believe is is enviromntbelieve is is enviromnt7.ppt
hirahelen
 
PPTX
Support Vector Machine.pptx
HarishNayak44
 
PDF
Data Science - Part IX - Support Vector Machine
Derek Kane
 
DOC
Introduction to Support Vector Machines
Silicon Mentor
 
An Introduction to Support Vector Machines.ppt
zjadidfard
 
Support Vector Machines USING MACHINE LEARNING HOW IT WORKS
rajalakshmi5921
 
Machine Learning and Statistical Analysis
butest
 
Machine Learning and Statistical Analysis
butest
 
Machine Learning and Statistical Analysis
butest
 
Machine Learning and Statistical Analysis
butest
 
Machine Learning and Statistical Analysis
butest
 
Machine Learning and Statistical Analysis
butest
 
Machine Learning and Statistical Analysis
butest
 
course slides of Support-Vector-Machine.pdf
onurenginar1
 
Linear Discrimination Centering on Support Vector Machines
butest
 
Machine Learning Machine Learnin Machine Learningg
ghsskchutta
 
MAchine learning
JayrajSingh9
 
PPT-3.ppt
Shibaprasad Sen
 
Epsrcws08 campbell kbm_01
Cheng Feng
 
i i believe is is enviromntbelieve is is enviromnt7.ppt
hirahelen
 
Support Vector Machine.pptx
HarishNayak44
 
Data Science - Part IX - Support Vector Machine
Derek Kane
 
Introduction to Support Vector Machines
Silicon Mentor
 

Recently uploaded (20)

PPTX
How to Consolidate Subscription Billing in Odoo 18 Sales
Celine George
 
PPTX
ABDOMINAL WALL DEFECTS:GASTROSCHISIS, OMPHALOCELE.pptx
PRADEEP ABOTHU
 
PDF
FULL DOCUMENT: Read the full Deloitte and Touche audit report on the National...
Kweku Zurek
 
PDF
Exploring-the-Investigative-World-of-Science.pdf/8th class curiosity/1st chap...
Sandeep Swamy
 
PPTX
Various Psychological tests: challenges and contemporary trends in psychologi...
santoshmohalik1
 
PPTX
TOP 10 AI TOOLS YOU MUST LEARN TO SURVIVE IN 2025 AND ABOVE
digilearnings.com
 
PPTX
Maternal and Child Tracking system & RCH portal
Ms Usha Vadhel
 
PPTX
GENERAL METHODS OF ISOLATION AND PURIFICATION OF MARINE__MPHARM.pptx
SHAHEEN SHABBIR
 
PDF
Living Systems Unveiled: Simplified Life Processes for Exam Success
omaiyairshad
 
PDF
07.15.2025 - Managing Your Members Using a Membership Portal.pdf
TechSoup
 
PPTX
DIARRHOEA & DEHYDRATION: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
PPTX
HIRSCHSPRUNG'S DISEASE(MEGACOLON): NURSING MANAGMENT.pptx
PRADEEP ABOTHU
 
PPTX
FAMILY HEALTH NURSING CARE - UNIT 5 - CHN 1 - GNM 1ST YEAR.pptx
Priyanshu Anand
 
PPTX
Constitutional Design Civics Class 9.pptx
bikesh692
 
PPTX
ANORECTAL MALFORMATIONS: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
PPTX
quizbeenutirtion-230726075512-0387d08e.pptx
domingoriahlyne
 
PDF
A guide to responding to Section C essay tasks for the VCE English Language E...
jpinnuck
 
PPTX
Room booking management - Meeting Room In Odoo 17
Celine George
 
PPTX
ARAL-Guidelines-Learning-Resources_v3.pdf.pptx
canetevenus07
 
PPTX
Folding Off Hours in Gantt View in Odoo 18.2
Celine George
 
How to Consolidate Subscription Billing in Odoo 18 Sales
Celine George
 
ABDOMINAL WALL DEFECTS:GASTROSCHISIS, OMPHALOCELE.pptx
PRADEEP ABOTHU
 
FULL DOCUMENT: Read the full Deloitte and Touche audit report on the National...
Kweku Zurek
 
Exploring-the-Investigative-World-of-Science.pdf/8th class curiosity/1st chap...
Sandeep Swamy
 
Various Psychological tests: challenges and contemporary trends in psychologi...
santoshmohalik1
 
TOP 10 AI TOOLS YOU MUST LEARN TO SURVIVE IN 2025 AND ABOVE
digilearnings.com
 
Maternal and Child Tracking system & RCH portal
Ms Usha Vadhel
 
GENERAL METHODS OF ISOLATION AND PURIFICATION OF MARINE__MPHARM.pptx
SHAHEEN SHABBIR
 
Living Systems Unveiled: Simplified Life Processes for Exam Success
omaiyairshad
 
07.15.2025 - Managing Your Members Using a Membership Portal.pdf
TechSoup
 
DIARRHOEA & DEHYDRATION: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
HIRSCHSPRUNG'S DISEASE(MEGACOLON): NURSING MANAGMENT.pptx
PRADEEP ABOTHU
 
FAMILY HEALTH NURSING CARE - UNIT 5 - CHN 1 - GNM 1ST YEAR.pptx
Priyanshu Anand
 
Constitutional Design Civics Class 9.pptx
bikesh692
 
ANORECTAL MALFORMATIONS: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
quizbeenutirtion-230726075512-0387d08e.pptx
domingoriahlyne
 
A guide to responding to Section C essay tasks for the VCE English Language E...
jpinnuck
 
Room booking management - Meeting Room In Odoo 17
Celine George
 
ARAL-Guidelines-Learning-Resources_v3.pdf.pptx
canetevenus07
 
Folding Off Hours in Gantt View in Odoo 18.2
Celine George
 
Ad

lecture_16.pptx

  • 1. ECE 8443 – Pattern Recognition ECE 8443 – Pattern Recognition • Objectives: Empirical Risk Minimization Large-Margin Classifiers Soft Margin Classifiers SVM Training Relevance Vector Machines • Resources: DML: Introduction to SVMs AM: SVM Tutorial JP: SVM Resources OC: Taxonomy NC: SVM Tutorial LECTURE 16: SUPPORT VECTOR MACHINES Class 1 Class 2 Audio: URL:
  • 2. ECE 8443: Lecture 16, Slide 1 Generative Models • Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing the parameters of that model, and then performing classification by choosing the closest model. This approach is known as a generative model: by training models supervised learning assumes we know the form of the underlying density function, which is often not true in real applications. • Convergence in maximum likelihood does not guarantee optimal classification. • Gaussian MLE modeling tends to overfit data. 0 10 20 30 40 0 0.1 0.2 0.3 0.4 3.5 4 4.5 5 0 0.005 0.01 0.015 0.02 0.025 0.03 ML Decision Optimal Decision Boundary Boundary (MLE Gaussian) Discrimination Class-Dependent PCA • Real data often not separable by hyperplanes. • Goal: balance representation and discrimination in a common framework (rather than alternating between the two).
  • 3. ECE 8443: Lecture 16, Slide 2 Risk Minimization • The expected risk can be defined as: ) , ( ) , ( ( y x x y dP f R        • Empirical risk is defined as:     l i i i emp f l R 1 ) , ( 2 1 (   x y • These are related by the Vapnik-Chervonenkis (VC) dimension: ) ( ( ( h f R R emp     l h l h h f ) 4 / log( )) 1 ) / 2 (log(( ) (     where ) (h f is referred to as the VC confidence, where  is a confidence measure in the range [0,1]. • The VC dimension, h, is a measure of the capacity of the learning machine. • The principle of structural risk minimization (SRM) involves finding the subset of functions that minimizes the bound on the actual risk. • Optimal hyperplane classifiers achieve zero empirical risk for linearly separable data. • A Support Vector Machine is an approach that gives the least upper bound on the risk. confidence in the risk empirical risk bound on the expected risk VC dimension Expected risk optimum
  • 4. ECE 8443: Lecture 16, Slide 3 Large-Margin Classification • Hyperplanes C0 - C2 achieve perfect classification (zero empirical risk):  C0 is optimal in terms of generalization.  The data points that define the boundary are called support vectors.  A hyperplane can be defined by:  We will impose the constraints: The data points that satisfy the equality are called support vectors. • Support vectors are found using a constrained optimization: • The final classifier is computed using the support vectors and the weights: b   w x origin class 1 class 2 w H1 H2 C1 CO C2 optimal classifier 0 1 ) (    b w x y i i i          N i i N i i i i p b y w L 1 1 2 ) ( 2 1   w x      N i i i i b y f 1 ) ( x x x 
  • 5. ECE 8443: Lecture 16, Slide 4 Class 1 Class 2 Soft-Margin Classification • In practice, the number of support vectors will grow unacceptably large for real problems with large amounts of data. • Also, the system will be very sensitive to mislabeled training data or outliers. • Solution: introduce “slack variables” or a soft margin: This gives the system the ability to ignore data points near the boundary, and effectively pushes the margin towards the centroid of the training data. • This is now a constrained optimization with an additional constraint: • The solution to this problem can still be found using Lagrange multipliers. 0 ) 1 ( ) (     i i i i b w x y 
  • 6. ECE 8443: Lecture 16, Slide 5 Nonlinear Decision Surfaces f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f(.) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) Feature space Input space • Thus far we have only considered linear decision surfaces. How do we generalize this to a nonlinear surface? • Our approach will be to transform the data to a higher dimensional space where the data can be separated by a linear surface. • Define a kernel function: Examples of kernel functions include polynomial: ) ( ) ( ) , ( j i j i K x x x x f f  d j t i j i K ) 1 ( ) , (   x x x x
  • 7. ECE 8443: Lecture 16, Slide 6 Kernel Functions Other popular kernels are a radial basis function (popular in neural networks): ) ) 2 ( exp( ) , ( 2 2  j i j i K x x x x   and a sigmoid function: ) tanh( ) , (    j t i j i k K x x x x • Our optimization does not change significantly: • The final classifier has a similar form: • Let’s work some examples.     N i i i i b K y f 1 ) , ( ) ( x x x  0 , 0 ) , ( 2 1 ) ( max 1 1 1 1              i n i i i j i j i n i n j j i n i i y C to subject K y y W       x x
  • 8. ECE 8443: Lecture 16, Slide 7 SVM Limitations Model Complexity Error Training Set Error Open-Loop Error Optimum • Uses a binary (yes/no) decision rule • Generates a distance from the hyperplane, but this distance is often not a good measure of our “confidence” in the classification • Can produce a “probability” as a function of the distance (e.g. using sigmoid fits), but they are inadequate • Number of support vectors grows linearly with the size of the data set • Requires the estimation of trade-off parameter, C, via held-out sets
  • 9. ECE 8443: Lecture 16, Slide 8 • Build a fully specified probabilistic model – incorporate prior information/beliefs as well as a notion of confidence in predictions. • MacKay posed a special form for regularization in neural networks – sparsity. • Evidence maximization: evaluate candidate models based on their “evidence”, P(D|Hi). • Evidence approximation: • Likelihood of data given best fit parameter set: • Penalty that measures how well our posterior model fits our prior assumptions: • We can use set the prior in favor of sparse, smooth models. • Incorporates an automatic relevance determination (ARD) prior over each weight. Evidence Maximization w w w   ) | ˆ ( ) , ˆ | ( ) | ( H P H D P H D P ) , ˆ | ( H D P w w w w P(w|D,Hi) P(w|Hi) w w  ) | ˆ ( H P ) 1 ), 0 ( | ( ) | ( 0     N i i i i w N P    w
  • 10. ECE 8443: Lecture 16, Slide 9 • Still a kernel-based learning machine: • Incorporates an automatic relevance determination (ARD) prior over each weight (MacKay) • A flat (non-informative) prior over a completes the Bayesian specification. • The goal in training becomes finding: • Estimation of the “sparsity” parameters is inherent in the optimization – no need for a held-out set. • A closed-form solution to this maximization problem is not available. Rather, we iteratively reestimate . Relevance Vector Machines     N i i iK w w y 1 0 ) , ( ) ; ( x x w x ) ; ( 1 1 ) ; | 1 ( w x w x i y i e t P     ) 1 ), 0 ( | ( ) | ( 0     N i i i i w N P    w ) | ( ) | , ( ) , , | ( ) , ( ) , | , ( , max arg ˆ , ˆ X X α w X α w α w X α w α w α w t p p t p p where t p   α w ˆ ˆ and
  • 11. ECE 8443: Lecture 16, Slide 10 Summary • Support Vector Machines are one example of a kernel-based learning machine that is training in a discriminative fashion. • Integrates notions of risk minimization, large-margin and soft margin classification. • Two fundamental innovations:  maximize the margin between the classes using actual data points,  rotate the data into a higher-dimensional space in which the data is linearly separable. • Training can be computationally expensive but classification is very fast. • Note that SVMs are inherently non-probabilistic (e.g., non-Bayesian). • SVMs can be used to estimate posteriors by mapping the SVM output to a likelihood-like quantity using a nonlinear function (e.g., sigmoid). • SVMs are not inherently suited to an N-way classification problem. Typical approaches include a pairwise comparison or “one vs. world” approach.
  • 12. ECE 8443: Lecture 16, Slide 11 Summary • Many alternate forms include Transductive SVMs, Sequential SVMs, Support Vector Regression, Relevance Vector Machines, and data-driven kernels. • Key lesson learned: a linear algorithm in the feature space is equivalent to a nonlinear algorithm in the input space. Standard linear algorithms can be generalized (e.g., kernel principal component analysis, kernel independent component analysis, kernel canonical correlation analysis, kernel k-means). • What we didn’t discuss:  How do you train SVMs?  Computational complexity?  How to deal with large amounts of data? See Ganapathiraju for an excellent, easy to understand discourse on SVMs and Hamaker (Chapter 3) for a nice overview on RVMs. There are many other tutorials available online (see the links on the title slide) as well.  Other methods based on kernels – more to follow.