0% found this document useful (0 votes)
3 views79 pages

Ml Module4 Classification

The document discusses supervised learning with a focus on classification, defining it as a method for predicting categorical class labels based on training data. It outlines various classification algorithms such as k-Nearest Neighbor, Decision Trees, and Naïve Bayes, along with their applications in fields like medical diagnosis and image classification. Additionally, it details the processes involved in model construction and usage, emphasizing the importance of accuracy in classification tasks.

Uploaded by

12302080603002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views79 pages

Ml Module4 Classification

The document discusses supervised learning with a focus on classification, defining it as a method for predicting categorical class labels based on training data. It outlines various classification algorithms such as k-Nearest Neighbor, Decision Trees, and Naïve Bayes, along with their applications in fields like medical diagnosis and image classification. Additionally, it details the processes involved in model construction and usage, emphasizing the importance of accuracy in classification tasks.

Uploaded by

12302080603002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

Unit-4

Supervised learning: Classfication

1
Contents

• What is classification?
• Use cases of classification

• Classification algorithms:
– Nearest Neighbor
– Decision Tree
– Bayesian Classification
– Support Vector Machine
– Neural Networks

2
What is Classification?

• Classification is a supervised learning algorithm under


Machine Learning terminology

• Classification
– predicts categorical class labels (discrete or nominal)
– classifies data based on the training set and the values
in a classifying attribute and uses it in classifying new
data
• Classification vs. Regression
• Classification (Pattern Recognition), Regression
(function approximation), Prediction (supervvised),
3 clustering (grouping - unsupervised)
Use cases of Classification

• Classification is widely known as pattern recognition in the literature.

• Character Recognition, Handwriting recognition


• Biometric (Fingerprint/Face/Expression) Recognition
– Person Identification
• Speech Recognition, Speaker recognition
• Medical diagnosis (Disease Detection)
• Image classification (object classification, vehicle detection, etc.)
• Fault detection in electric circuits
• Win-loss prediction of games
• Prediction of natural calamity such as earthquake, flood, etc.
• …
4
Classification Model

5
Data Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes

– Each tuple/sample is assumed to belong to a predefined class,


as determined by the class label attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, decision trees,
or mathematical formula

• Model usage: for classifying future or unknown objects


– Estimate accuracy of the model

• The known label of test sample is compared with the


classified result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
6
Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


M ike A ssistan t P ro f 3 no (Model)
M ary A ssistan t P ro f 7 yes
B ill P ro fesso r 2 yes
Jim A sso ciate P ro f 7 yes
IF rank = ‘professor’
D ave A ssistan t P ro f 6 no
OR years > 6
Anne A sso ciate P ro f 3 no
7 THEN tenured = ‘yes’
Process (2): Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
8
Joseph A ssistant P rof 7 yes
Classification Model Steps

9
Common Classification Algorithms

1. k-Nearest Neighbour (kNN)


2. Naïve Bayes classifier
3. Decision tree
4. Random forest
5. Artificial Neural Networks
6. Support Vector Machine (SVM)

 Rule-Based Classification
 Associative Classification
 Case-based Reasoning
10
 Convolutional Neural Networks
Nearest Neighbor

• Uses similarity in classifications


– We are similar to our neighbors

• Pattern matching:
– Person identification
– Vehicle classification
– Speaker recognition

11
K-Nearest Neighbor Classification
• K-Nearest Neighbor is simple algorithm that stores all the
available cases and classifies the new data or cases based
on similarity measure.

• Similarity measure
– Euclidian distance
– Manhattan distance

12
K-Nearest Neighbor Classification: Example

13
K-Nearest Neighbor Algorithm

14
K-Nearest Neighbor

• Lazy learner
– Does not prepare classification model
– Stores data and predicts whenever test data is available

15
How to choose Value of K?

• Empirically selected.

• A preferable choice is sqrt(N).

• Odd value of k is selected to avoid confusion between two classes

16
Strength and Weaknesses of KNN

• Strength
– Extremely simple algorithm – easy to understand
– Very effective in certain situations, e.g. for recommender system design
– Very fast or almost no time required for the training phase
• Weakness
– Does not learn anything in the real sense. Classification is done completely on
the basis of the training data. So, it has a heavy reliance on the training data.
If the training data does not represent the problem domain comprehensively,
the algorithm fails to make an effective classification.
– Because there is no model trained in real sense and the classification is done
completely on the basis of the training data, the classification process is very
slow.
– Also, a large amount of computational space is required to load the training
data for classification.
17
Use cases of k-Nearest Neighbor

• Marketing: Amazon Recommender system


– Suggests not only the searched items but also relevant products that you may
be interested in.
– More than 35% of revenue of Amazon is based on Recommendations
• Social Media
– Recommendation of News, friends, etc.
• Tourism
– Recommendation of Hotels, tourism places
• Banking
– Credit card
• Document classifications

18
Bayesian Theorem: Basics

• Let X be a data sample (“evidence”): class label is unknown


• Let H be a hypothesis that ‘X belongs to class C’

• Classification is to determine:
• P(H|X), (posteriori probability), the probability that the hypothesis holds
given the observed data sample X

• P(H) (prior probability), the initial probability


– E.g., X will buy computer, regardless of age, income, …
• P(X): probability that sample data is observed
• P(X|H): (likelihood), the probability of observing the sample X, given
that the hypothesis holds

19
Bayesian Theorem

• Given training data X, posteriori probability of a hypothesis H, P(H|X),


follows the Bayes theorem

P(H | X)  P(X | H )P(H )  P(X | H ) P(H ) / P(X)


P(X)
• Informally, this can be written as

posteriori = likelihood x prior/evidence


• Predicts X belongs to Ci iff the probability P(Ci|X) is the highest among
all the P(Ck|X) for all the k classes
• Practical difficulty: require initial knowledge of many probabilities,
significant computational cost
20
Towards Naïve Bayesian Classifier

• Let D be a training set of tuples and their associated class


labels, and each tuple is represented by an n-D attribute
vector X = (x1, x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
• This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X)  i i
i P(X)
• Since P(X) is constant for all classes, only

21 needs to be maximized P(Ci | X)  P(X| Ci )P(Ci )


Derivation of Naïve Bayes Classifier

• A simplified assumption:
• attributes are conditionally independent (i.e., no dependence relation
between attributes):
n
P( X | C i)   P( x | C i)  P( x | C i)  P( x | C i)  ...  P( x | C i)
k 1 2 n
k 1
• This greatly reduces the computation cost: Only counts the class
distribution
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for
Ak divided by |Ci, D| (# of tuples of Ci in D)

• If Ak is continous-valued, P(xk|Ci) is usually computed based on


Gaussian distribution with a mean μ and standard deviation σ and
P(xk|Ci) is
P(X | C i)  g ( xk , Ci ,  Ci )
( x )2
22 1 
g ( x,  ,  )  e 2 2

2 
Naïve Bayesian Classifier: Training Dataset

age income studentcredit_rating


buys_compu
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
Data sample
31…40 low yes excellent yes
X = (age <=30, <=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
P(C | X)  P(X| C )P(C ) 31…40 high yes fair yes
23 i i i
>40 medium no excellent no
Naïve Bayesian Classifier: An Example

• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643


P(buys_computer = “no”) = 5/14= 0.357

• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

• Compute P(X|Ci) for each class


P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4

• P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044


P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019

P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028


24 P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
P(C | X)  P(X| C )P(C )
i i i
Avoiding the Zero-Probability Problem

• Naïve Bayesian prediction requires each conditional prob. be non-


zero. Otherwise, the predicted prob. will be zero
n
P( X | C i )   P( x k | C i )
k 1

• Ex. Suppose a dataset with 1000 tuples, income=low (0), income=


medium (990), and income = high (10)

• Use Laplacian correction (or Laplacian estimator)


– Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
25
– The “corrected” prob. estimates are close to their “uncorrected”
counterparts
25
Naïve Bayesian Classifier: Comments

• Advantages
– Easy to implement
– Good results obtained in most of the cases

• Disadvantages
– They require significant amount of probability data to construct a
knowledge base.
– Assumption: class conditional independence, therefore loss of accuracy
– Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
• Dependencies among these cannot be modeled by Naïve Bayesian
Classifier

• How to deal with these dependencies? Bayesian Belief Networks


26
Decision Tree
• Decision tree learning is one of the most widely adopted algorithms for
classification.
• As the name indicates, it builds a model in the form of a tree structure.
• It is used for multi-dimensional analysis with multiple classes.
• It is characterized by fast execution time and ease in the interpretation of
the rules.
• The goal of decision tree learning is to create a model (based on the past
data called past vector) that predicts the value of the output variable based
on the input variables in the feature vector.
Root Node
• Decision Node (Non-leaf node)
Branch
• Leaf Node (output variable - class label) Node
• Branches (possible range of features)

27 Leaf Node

Decision Tree Structure


Example Decision Tree for Car Driving
• The decision to be taken is whether to ‘Keep Going’ or to ‘Stop’, which
depends on various situations as depicted in the figure.

28
Decision Tree Induction: An Example
 Training data set: Buys_computer
 The data set follows an example of Quinlan’s ID3 (Playing Tennis)

age income student credit_rating buys_computer


<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
29
Decision Tree Induction: An Example
age income student credit_rating buys_computer
Age <=30 high no fair no
>40
<=30 31..40 <=30 high no excellent no
Income Income Income 31…40 high no fair yes
>40 medium no fair yes
High Medium Low >40 low yes fair yes
>40 low yes excellent no
Student Student Student 31…40 low yes excellent yes
Y N Y N Y N <=30 medium no fair no
CR CR CR CR CR CR <=30 low yes fair yes
>40 medium yes fair yes
Excellent <=30 medium yes excellent yes
? Y ? Y 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

Stopping criteria:
1. All or most of the examples at a particular node have the same class
2. All features have been used up in the partitioning
30 3. The tree has grown to a pre-defined threshold limit
Decision Tree Induction: An Example
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
 Resulting tree: 31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
age? <=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 31…40 medium no excellent yes
31..40 >40 31…40 high yes fair yes
>40 medium no excellent no

student? yes credit rating?

no yes excellent fair


31
no yes yes
Algorithm for Decision Tree Induction

• Basic algorithm (a greedy algorithm)


– Tree is constructed in a top-down recursive divide-and-conquer
manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discretized
in advance)
– Examples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)

• Conditions for stopping partitioning


– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
– There are no samples left
32
Decision Tree

• There can be many decision trees which fits the example/data


• Que. Which decision tree should we choose?
– Prefer smaller trees:
• Trees with low depth
• Trees with less no. of nodes
• Finding smallest decision tree is computationally hard problem.
• Applying heuristics
• Decision tree represent disjunction of conjunctions (?)
• Entropy is measure of disorder in a system. We wish nodes quickly get
low entropy

33
Attribute Selection Measure:
Information Gain (ID3/C4.5/C5.0)

 Select the attribute with the highest information gain


 Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify a tuple in D:
m
Entropy( D)   pi log2 ( pi )
i 1

 Information needed (after using A to split D into v partitions) to


v | D |
classify D:
EntropyA ( D)    Entropy( D j )
j

j 1 | D |

 Information gained by branching on attribute A


34
Gain(A)  Entropy(D)  EntropyA(D)
• The information gain is created on the basis of the decrease in entropy
(S) after a data set is split according to a particular attribute (A).
• Constructing a decision tree is all about finding an attribute that returns
the highest information gain (i.e. the most homogeneous branches).
• If the information gain is 0, it means that there is no reduction in entropy
due to split of the data set according to that particular feature.
• On the other hand, the maximum amount of information gain which may
happen is the entropy of the data set before the split.
• Information gain for a particular feature A is calculated by the difference
in entropy before a split (or S ) with the entropy after the split (S ).

35
Attribute Selection: Information Gain
 Class P: buys_computer = “yes” 5 4
Infoage ( D)  I (2,3)  I (4,0)
 Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info( D)  I (9,5)   log2 ( )  log2 ( ) 0.940  I (3,2)  0.694
14 14 14 14 14
age pi ni I(pi, ni)
5
<=30 2 3 0.971 I (2,3) means “age <=30” has 5
31…40 4 0 0 14
>40 3 2 0.971
out of 14 samples, with 2 yes’es
age income student credit_rating buys_computer and 3 no’s. Hence
<=30 high no fair no
<=30
31…40
high
high
no
no
excellent
fair
no
yes
Gain(age)  Info( D)  Infoage ( D)  0.246
>40 medium no fair yes
>40 low yes fair yes Similarly,
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no Gain(income)  0.029
<=30
>40
low
medium
yes fair
yes fair
yes
yes Gain( student )  0.151
36
<=30
31…40
medium
medium
yes excellent
no excellent
yes
yes
Gain(credit _ rating )  0.048
31…40 high yes fair yes
>40 medium no excellent no
37
Case Study

• Let us try to understand this in the context of an


example. Global Technology Solutions (GTS), a
leading provider of IT solutions, is coming to
College of Engineering and Management (CEM)
for hiring B.Tech. students. Last year during
campus recruitment, they had shortlisted 18
students for the final interview. Being a company
of international repute, they follow a stringent
interview process to select only the best of the
students. The information related to the interview
evaluation results of shortlisted students (hiding
the names) on the basis of different evaluation
parameters is available for reference in Figure
7.10. Chandra, a student of CEM, wants to find out
if he may be offered a job in GTS. His CGPA is
quite high. His self-evaluation on the other
parameters is as follows:
• Communication – Bad; Aptitude – High;
Programming skills – Bad

38
Training data for GTS recruitment
Decision tree based on the training data

39
Decision tree based on the training data
(depicting a sample path)
Information Gain (S, A) = Entropy (Sbs ) − Entropy (Sas )

40
41
Entropy and information gain calculation (Level
2)

42
43
Computing Information-Gain
for Continuous-Valued Attributes

• Let attribute A be a continuous-valued attribute


• Must determine the best split point for A
– Sort the value A in increasing order
– Typically, the midpoint between each pair of adjacent
values is considered as a possible split point
• (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
– The point with the minimum expected information
requirement for A is selected as the split-point for A
• Split:
– D1 is the set of tuples in D satisfying A ≤ split-point, and
44
D2 is the set of tuples in D satisfying A > split-point
Gain Ratio for Attribute Selection (C4.5)

• Information gain measure is biased towards attributes with


a large number of values
• C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfoA ( D)    log2 ( )
j 1 |D| |D|
– GainRatio(A) = Gain(A) / SplitInfo(A)
• Ex.

– gain_ratio(income) = 0.029/1.557 = 0.019


45 • The attribute with the maximum gain ratio is selected as
the splitting attribute
Gini Index (CART, IBM IntelligentMiner)

• If a data set D contains examples from n classes, gini


n
index, gini(D) is defined as gini( D)  1   p2 j
j 1
where pj is the relative frequency of class j in D
• If a data set D is split on A into two subsets D1 and D2, the
gini index gini(D) is defined as
|D1| |D |
gini A (D)  gini(D1)  2 gini( D2)
|D| |D|
• Reduction in Impurity:
gini( A)  gini(D)  giniA (D)
• The attribute provides the smallest ginisplit(D) (or the largest
reduction in impurity) is chosen to split the node (need to
46
enumerate all the possible splitting points for each
attribute)
Computation of Gini Index

• Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”


2 2
9 5
gini( D)  1        0.459
 14   14 
• Suppose the attribute income partitions D into 10 in D1: {low, medium}
and 4 in D2
 10  4
giniincome{low,medium} ( D)   Gini ( D1 )   Gini ( D2 )
 14   14 

Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the


{low,medium} (and {high}) since it has the lowest Gini index
• All attributes are assumed continuous-valued
47 • May need other tools, e.g., clustering, to get the possible split values
• Can be modified for categorical attributes
Comparing Attribute Selection Measures

• The three measures, in general, return good results but


– Information gain:
• biased towards multivalued attributes
– Gain ratio:
• tends to prefer unbalanced splits in which one
partition is much smaller than the others
– Gini index:
• biased to multivalued attributes
• has difficulty when # of classes is large
• tends to favor tests that result in equal-sized
48 partitions and purity in both partitions
Overfitting and Tree Pruning

• Overfitting: An induced tree may overfit the training data


– Too many branches, some may reflect anomalies due to noise or
outliers
– Poor accuracy for unseen samples
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early ̵ do not split a node if this
would result in the goodness measure falling below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees
• Use a set of data different from the training data to decide which is
the “best pruned tree”

49
Decision Tree
• Avoiding overfitting in decision tree – pruning

• There are two approaches of pruning:


• Pre-pruning: Stop growing the tree before it reaches
perfection.
• Post-pruning: Allow the tree to grow entirely and then post-
prune some of the branches from it.

50
Decision Tree
• Strengths of Decision Tree
• It produces very simple understandable rules. For smaller trees, not much mathematical and
computational knowledge is required to understand this model.
• Works well for most of the problems.
• It can handle both numerical and categorical variables.
• Can work well both with small and large training data sets.
• Decision trees provide a definite clue of which features are more useful for classification.

• Weaknesses of Decision Tree


• Decision tree models are often biased towards features having more number
of possible values, i.e. levels.
• This model gets overfitted or underfitted quite easily.
• Decision trees are prone to errors in classification problems with many classes
and relatively small number of training examples.
• A decision tree can be computationally expensive to train.
• Large trees are complex to understand.

51
Random forest model

• Random forest is an ensemble classifier, i.e. a combining


classifier that uses and combines many decision tree
classifiers.
• Ensembling is usually done using the concept of bagging
with different feature sets.
• After the random forest is generated by combining the
trees, majority vote is applied to combine the output of the
different trees.

52
Random forest model
• How does random forest works?
• 1. If there are N variables or features in the input data set, select a subset of ‘m’ (m < N) features at
random out of the N features. Also, the observations or data instances should be picked randomly.
• 2. Use the best split principle on these ‘m’ features to calculate the number of nodes ‘d’.
• 3. Keep splitting the nodes to child nodes till the tree is grown to the maximum possible extent.
• 4. Select a different subset of the training data ‘with replacement’ to train another decision tree
following steps (1) to (3). Repeat this to build and train ‘n’ decision trees.
• 5. Final class assignment is done on the basis of the majority votes from the ‘n’ trees.

53
Random forest model
• Strengths of random forest
• It runs efficiently on large and expansive data sets.
• It has a robust method for estimating missing data and maintains precision when a large proportion of
the data is absent.
• It has powerful techniques for balancing errors in a class population of unbalanced data sets.
• It gives estimates (or assessments) about which features are the most important ones in the overall
classification.
• It generates an internal unbiased estimate (gauge) of the generalization error as the forest generation
progresses.
• Generated forests can be saved for future use on other data.
• Lastly, the random forest algorithm can be used to solve both classification and regression problems.

• Weaknesses of random forest


• This model, because it combines a number of decision tree models, is not as
easy to understand as a decision tree model.
• It is computationally much more expensive than a simple model like decision
tree.

54
Support Vector Machine (SVM)

• New Machine Learning technique provided by Vapnik in 1996


• Popular due to its generalization ability
• SVM is used for
– Pattern Recognition (Classification)
– Regression Estimation (Prediction)

• Solution provided SVM is


– Theoretically elegant
– Computationally Efficient
– Very effective in many Large practical problems

• It has a simple geometrical interpretation in a high-dimensional feature space that is nonlinearly related to
input space.

• By using kernels all computations keep simple.

55 • It contains ANN, RBF and Polynomial classifiers as special cases.


ANN vs. SVM

ANN SVM
Minimizes empirical risk Minimizes structural risk
Solution can be Multiple local Solution is global and unique
Minima
Less Generalize More Generalization
Data intensive Requires less data
Presence of outlier affects Presence of outlier may not influence
generalization accuracy generalization accuracy
Difficult to determine structure, and Developing SVM based model is much
parameters. easier compared to ANN model.
Black box Based on Computational learning
Theory SVM can give some
56 explanation about how the final result
was arrived at.
History: SVM
• The Study on Statistical Learning Theory was
started in the 1960s by Vapnik. A classifier
derived from statistical learning theory by Vapnik,
et al. in 1992
• SVM became famous when, using images as
input, it gave accuracy comparable to neural-
network with hand-designed features in a
handwriting recognition task
• Currently, SVM is widely used in
• object detection & recognition,
• content-based image retrieval,
• text recognition,
• biometrics,
57 • speech recognition
• And Many more…
Two Key Concepts of SVM

1. Construction of optimum hyperplane in the feature space.

Optimum Hyper-plane

2. Mapping of training data into higher dimensional feature


space. A kernel trick is used to avoid explicit mapping
functions.

Φ(x)
58
Linear Classifiers

denotes +1 w x + b>0
denotes -1

How would you


classify this
data?

w x + b<0

59
Linear Classifiers

denotes +1
denotes -1

How would you


classify this
data?

60
Linear Classifiers

denotes +1
denotes -1

How would you


classify this
data?

61
Linear Classifiers

denotes +1
denotes -1

How would you


classify this
data?

62
Linear Classifiers

denotes +1
denotes -1
Any of these
would be fine..

..but which is
best?

63
Linear Classifiers

denotes +1
denotes -1

How would you


classify this
data?

Misclassified
to +1 class
64
Margin Classifier

denotes +1
denotes -1

65
Define the margin of a linear classifier as the width that the
boundary could be increased by before hitting a datapoint.
Maximizing the Margin

Var1
IDEA 1: Select the
separating
hyperplane that
maximizes the
margin!

Margin
Width

Margin
66 Width
Var2
Support Vectors

Var1

Support Vectors

Margin
67 Width
Var2
Maximum Margin Classifier

denotes +1 • Implies that only support vectors are


important; other training examples are
denotes -1 ignorable.

Support Vectors
are those
datapoints that
the margin pushes
up against This is the simplest
kind of SVM (Called
an LSVM)
Linear SVM
68
Maximizing Margin Mathematically

x+ M=Margin Width

X-

What we know:  
(x  x )  w 2
• w . x+ + b = +1 M 
• w . x- + b = -1 w w
69
• w . (x+-x-) = 2
Quadratic Optimization Problem

 Goal: 1) Correctly classify all training data


wxi  b  1 if yi = +1
wxi  b  1 if yi = -1
yi (wxi  b)  1 for all i
2
2) Maximize the Margin M 
1 w
same as minimize t
ww
2
 We can formulate a Quadratic Optimization Problem and solve for w and b

1 t
Minimize: ( w)  w w
2
70
Subject to: yi (wxi  b)  1 i
Data set with Noise

denotes +1  Hard Margin: So far we require


all data points be classified correctly
denotes -1
- No training error
 What if the training set is
noisy?
- Solution 1: use very powerful
kernels

OVERFITTING!
71
Soft Margin Classification

Slack variables ξi can be added to allow


misclassification of difficult or noisy examples.

What should our quadratic


e3
optimization criterion be?

e1 R
1
Minimize w.w  C  εk
2 k 1
e2

72  Parameter C can be viewed as a


way to control overfitting.
Non-Linear SVMs

 Datasets that are linearly separable with some noise


work out great:

0 x

 But what are we going to do if the dataset is just too


hard?
0 x

 How about… mapping data to a higher-dimensional


space: x2

73
0 x
Non-Linear SVM:
Map to High dimensional Feature Space

 General idea: The original input space can always be


mapped to some higher-dimensional feature space where
the training set is separable

Φ: x → φ(x)

74
The Kernel Trick K(xi,xj)= φ(xi) Tφ(xj)
The XOR Problem

x2 x2

x1 x1x2

X = ( x1, x2 ) x1 Z = ( x1, x2, x1x2 )

75
Examples of Kernel Functions

 Linear Kernel: K(x, y)= xT y


 Polynomial Kernel with degree d: K(x,x)= (1+ xTy)d
 Gaussian (radial-basis function) Kernel:
x y
2

K (x, y )  exp(  )
2 2

 Exponential RBF Kernel: x y


K (x, y)  exp( )
2 2

 Sigmoid (MLP) Kernel: K(x,y)= tanh(β0 x Ty + β1)

 Fourier series
 Splines
76
 …
Strengths and Weakness of SVM

•Strengths
•Training is relatively easy
• No local optimal, unlike in neural networks
•It scales relatively well to high dimensional data
•Weakness
•Need to choose a “good” kernel function.
•SVMs can only handle two-class outputs (i.e. a categorical
output variable with arity 2).

•What can be done?


•Answer: with output arity N, learn N SVM’s
SVM 1 learns “Output==1” vs “Output != 1”
SVM 2 learns “Output==2” vs “Output != 2” …
SVM N learns “Output==N” vs “Output != N”
•Then to predict the output for a new input, just predict with each
SVM and find out which one puts the prediction the furthest into the
77 positive region.
SVM Multiclass Classification Methods

1. One-against-all (Cortes and Vapnik, 1995)


• Each of the SVMs separates a single class from all
remaining classes

2. One-against-One (Fridman, 1996)


• Each SVM separates a pair of classes
• k(k-1)/2 classifiers where each one is trained on data from
two classes
• For training data from ith and jth classes, run binary
classification
• Voting strategy:
• If x is in class i, then add 1 to class i. Else to class j.

• Assign x to class with largest vote (Max wins)

78
References

• Video tutorial – Edureka, simplilearn


• Pattern Recognition, by Rajjan Singhal
• Data Mining, Han and Kamber

79

You might also like