0% found this document useful (0 votes)

3 views79 pages

Ml Module4 Classification

The document discusses supervised learning with a focus on classification, defining it as a method for predicting categorical class labels based on training data. It outlines various classification algorithms such as k-Nearest Neighbor, Decision Trees, and Naïve Bayes, along with their applications in fields like medical diagnosis and image classification. Additionally, it details the processes involved in model construction and usage, emphasizing the importance of accuracy in classification tasks.

Uploaded by

12302080603002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views79 pages

Ml Module4 Classification

Uploaded by

12302080603002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 79

Unit-4

Supervised learning: Classfication

1
Contents

• What is classification?
• Use cases of classification

• Classification algorithms:
– Nearest Neighbor
– Decision Tree
– Bayesian Classification
– Support Vector Machine
– Neural Networks

2
What is Classification?

• Classification is a supervised learning algorithm under

Machine Learning terminology

• Classification
– predicts categorical class labels (discrete or nominal)
– classifies data based on the training set and the values
in a classifying attribute and uses it in classifying new
data
• Classification vs. Regression
• Classification (Pattern Recognition), Regression
(function approximation), Prediction (supervvised),
3 clustering (grouping - unsupervised)
Use cases of Classification

• Classification is widely known as pattern recognition in the literature.

• Character Recognition, Handwriting recognition

• Biometric (Fingerprint/Face/Expression) Recognition
– Person Identification
• Speech Recognition, Speaker recognition
• Medical diagnosis (Disease Detection)
• Image classification (object classification, vehicle detection, etc.)
• Fault detection in electric circuits
• Win-loss prediction of games
• Prediction of natural calamity such as earthquake, flood, etc.
• …
4
Classification Model

5
Data Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes

– Each tuple/sample is assumed to belong to a predefined class,

as determined by the class label attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, decision trees,
or mathematical formula

• Model usage: for classifying future or unknown objects

– Estimate accuracy of the model

• The known label of test sample is compared with the

classified result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
6
Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier

M ike A ssistan t P ro f 3 no (Model)
M ary A ssistan t P ro f 7 yes
B ill P ro fesso r 2 yes
Jim A sso ciate P ro f 7 yes
IF rank = ‘professor’
D ave A ssistan t P ro f 6 no
OR years > 6
Anne A sso ciate P ro f 3 no
7 THEN tenured = ‘yes’
Process (2): Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
8
Joseph A ssistant P rof 7 yes
Classification Model Steps

9
Common Classification Algorithms

1. k-Nearest Neighbour (kNN)

2. Naïve Bayes classifier
3. Decision tree
4. Random forest
5. Artificial Neural Networks
6. Support Vector Machine (SVM)

 Rule-Based Classification
 Associative Classification
 Case-based Reasoning
10
 Convolutional Neural Networks
Nearest Neighbor

• Uses similarity in classifications

– We are similar to our neighbors

• Pattern matching:
– Person identification
– Vehicle classification
– Speaker recognition

11
K-Nearest Neighbor Classification
• K-Nearest Neighbor is simple algorithm that stores all the
available cases and classifies the new data or cases based
on similarity measure.

• Similarity measure
– Euclidian distance
– Manhattan distance

12
K-Nearest Neighbor Classification: Example

13
K-Nearest Neighbor Algorithm

14
K-Nearest Neighbor

• Lazy learner
– Does not prepare classification model
– Stores data and predicts whenever test data is available

15
How to choose Value of K?

• Empirically selected.

• A preferable choice is sqrt(N).

• Odd value of k is selected to avoid confusion between two classes

16
Strength and Weaknesses of KNN

• Strength
– Extremely simple algorithm – easy to understand
– Very effective in certain situations, e.g. for recommender system design
– Very fast or almost no time required for the training phase
• Weakness
– Does not learn anything in the real sense. Classification is done completely on
the basis of the training data. So, it has a heavy reliance on the training data.
If the training data does not represent the problem domain comprehensively,
the algorithm fails to make an effective classification.
– Because there is no model trained in real sense and the classification is done
completely on the basis of the training data, the classification process is very
slow.
– Also, a large amount of computational space is required to load the training
data for classification.
17
Use cases of k-Nearest Neighbor

• Marketing: Amazon Recommender system

– Suggests not only the searched items but also relevant products that you may
be interested in.
– More than 35% of revenue of Amazon is based on Recommendations
• Social Media
– Recommendation of News, friends, etc.
• Tourism
– Recommendation of Hotels, tourism places
• Banking
– Credit card
• Document classifications

18
Bayesian Theorem: Basics

• Let X be a data sample (“evidence”): class label is unknown

• Let H be a hypothesis that ‘X belongs to class C’

• Classification is to determine:
• P(H|X), (posteriori probability), the probability that the hypothesis holds
given the observed data sample X

• P(H) (prior probability), the initial probability

– E.g., X will buy computer, regardless of age, income, …
• P(X): probability that sample data is observed
• P(X|H): (likelihood), the probability of observing the sample X, given
that the hypothesis holds

19
Bayesian Theorem

• Given training data X, posteriori probability of a hypothesis H, P(H|X),

follows the Bayes theorem

P(H | X)  P(X | H )P(H )  P(X | H ) P(H ) / P(X)

P(X)
• Informally, this can be written as

posteriori = likelihood x prior/evidence

• Predicts X belongs to Ci iff the probability P(Ci|X) is the highest among
all the P(Ck|X) for all the k classes
• Practical difficulty: require initial knowledge of many probabilities,
significant computational cost
20
Towards Naïve Bayesian Classifier

• Let D be a training set of tuples and their associated class

labels, and each tuple is represented by an n-D attribute
vector X = (x1, x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
• This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X)  i i
i P(X)
• Since P(X) is constant for all classes, only

21 needs to be maximized P(Ci | X)  P(X| Ci )P(Ci )

Derivation of Naïve Bayes Classifier

• A simplified assumption:
• attributes are conditionally independent (i.e., no dependence relation
between attributes):
n
P( X | C i)   P( x | C i)  P( x | C i)  P( x | C i)  ...  P( x | C i)
k 1 2 n
k 1
• This greatly reduces the computation cost: Only counts the class
distribution
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for
Ak divided by |Ci, D| (# of tuples of Ci in D)

• If Ak is continous-valued, P(xk|Ci) is usually computed based on

Gaussian distribution with a mean μ and standard deviation σ and
P(xk|Ci) is
P(X | C i)  g ( xk , Ci ,  Ci )
( x )2
22 1 
g ( x,  ,  )  e 2 2

2 
Naïve Bayesian Classifier: Training Dataset

age income studentcredit_rating

buys_compu
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
Data sample
31…40 low yes excellent yes
X = (age <=30, <=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
P(C | X)  P(X| C )P(C ) 31…40 high yes fair yes
23 i i i
>40 medium no excellent no
Naïve Bayesian Classifier: An Example

• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643

P(buys_computer = “no”) = 5/14= 0.357

• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

• Compute P(X|Ci) for each class

P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4

• P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044

P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019

P(X|Ci)P(Ci) : P(X|buys_computer = “yes”) P(buys_computer = “yes”) = 0.028

24 P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
P(C | X)  P(X| C )P(C )
i i i
Avoiding the Zero-Probability Problem

• Naïve Bayesian prediction requires each conditional prob. be non-

zero. Otherwise, the predicted prob. will be zero
n
P( X | C i )   P( x k | C i )
k 1

• Ex. Suppose a dataset with 1000 tuples, income=low (0), income=

medium (990), and income = high (10)

• Use Laplacian correction (or Laplacian estimator)

– Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
25
– The “corrected” prob. estimates are close to their “uncorrected”
counterparts
25
Naïve Bayesian Classifier: Comments

• Advantages
– Easy to implement
– Good results obtained in most of the cases

• Disadvantages
– They require significant amount of probability data to construct a
knowledge base.
– Assumption: class conditional independence, therefore loss of accuracy
– Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
• Dependencies among these cannot be modeled by Naïve Bayesian
Classifier

• How to deal with these dependencies? Bayesian Belief Networks

26
Decision Tree
• Decision tree learning is one of the most widely adopted algorithms for
classification.
• As the name indicates, it builds a model in the form of a tree structure.
• It is used for multi-dimensional analysis with multiple classes.
• It is characterized by fast execution time and ease in the interpretation of
the rules.
• The goal of decision tree learning is to create a model (based on the past
data called past vector) that predicts the value of the output variable based
on the input variables in the feature vector.
Root Node
• Decision Node (Non-leaf node)
Branch
• Leaf Node (output variable - class label) Node
• Branches (possible range of features)

27 Leaf Node

Decision Tree Structure

Example Decision Tree for Car Driving
• The decision to be taken is whether to ‘Keep Going’ or to ‘Stop’, which
depends on various situations as depicted in the figure.

28
Decision Tree Induction: An Example
 Training data set: Buys_computer
 The data set follows an example of Quinlan’s ID3 (Playing Tennis)

age income student credit_rating buys_computer

<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
29
Decision Tree Induction: An Example
age income student credit_rating buys_computer
Age <=30 high no fair no
>40
<=30 31..40 <=30 high no excellent no
Income Income Income 31…40 high no fair yes
>40 medium no fair yes
High Medium Low >40 low yes fair yes
>40 low yes excellent no
Student Student Student 31…40 low yes excellent yes
Y N Y N Y N <=30 medium no fair no
CR CR CR CR CR CR <=30 low yes fair yes
>40 medium yes fair yes
Excellent <=30 medium yes excellent yes
? Y ? Y 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

Stopping criteria:
1. All or most of the examples at a particular node have the same class
2. All features have been used up in the partitioning
30 3. The tree has grown to a pre-defined threshold limit
Decision Tree Induction: An Example
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
 Resulting tree: 31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
age? <=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 31…40 medium no excellent yes
31..40 >40 31…40 high yes fair yes
>40 medium no excellent no

student? yes credit rating?

no yes excellent fair

31
no yes yes
Algorithm for Decision Tree Induction

• Basic algorithm (a greedy algorithm)

– Tree is constructed in a top-down recursive divide-and-conquer
manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discretized
in advance)
– Examples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)

• Conditions for stopping partitioning

– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
– There are no samples left
32
Decision Tree

• There can be many decision trees which fits the example/data

• Que. Which decision tree should we choose?
– Prefer smaller trees:
• Trees with low depth
• Trees with less no. of nodes
• Finding smallest decision tree is computationally hard problem.
• Applying heuristics
• Decision tree represent disjunction of conjunctions (?)
• Entropy is measure of disorder in a system. We wish nodes quickly get
low entropy

33
Attribute Selection Measure:
Information Gain (ID3/C4.5/C5.0)

 Select the attribute with the highest information gain

 Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify a tuple in D:
m
Entropy( D)   pi log2 ( pi )
i 1

 Information needed (after using A to split D into v partitions) to

v | D |
classify D:
EntropyA ( D)    Entropy( D j )
j

j 1 | D |

 Information gained by branching on attribute A

34
Gain(A)  Entropy(D)  EntropyA(D)
• The information gain is created on the basis of the decrease in entropy
(S) after a data set is split according to a particular attribute (A).
• Constructing a decision tree is all about finding an attribute that returns
the highest information gain (i.e. the most homogeneous branches).
• If the information gain is 0, it means that there is no reduction in entropy
due to split of the data set according to that particular feature.
• On the other hand, the maximum amount of information gain which may
happen is the entropy of the data set before the split.
• Information gain for a particular feature A is calculated by the difference
in entropy before a split (or S ) with the entropy after the split (S ).

35
Attribute Selection: Information Gain
 Class P: buys_computer = “yes” 5 4
Infoage ( D)  I (2,3)  I (4,0)
 Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info( D)  I (9,5)   log2 ( )  log2 ( ) 0.940  I (3,2)  0.694
14 14 14 14 14
age pi ni I(pi, ni)
5
<=30 2 3 0.971 I (2,3) means “age <=30” has 5
31…40 4 0 0 14
>40 3 2 0.971
out of 14 samples, with 2 yes’es
age income student credit_rating buys_computer and 3 no’s. Hence
<=30 high no fair no
<=30
31…40
high
high
no
no
excellent
fair
no
yes
Gain(age)  Info( D)  Infoage ( D)  0.246
>40 medium no fair yes
>40 low yes fair yes Similarly,
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no Gain(income)  0.029
<=30
>40
low
medium
yes fair
yes fair
yes
yes Gain( student )  0.151
36
<=30
31…40
medium
medium
yes excellent
no excellent
yes
yes
Gain(credit _ rating )  0.048
31…40 high yes fair yes
>40 medium no excellent no
37
Case Study

• Let us try to understand this in the context of an

example. Global Technology Solutions (GTS), a
leading provider of IT solutions, is coming to
College of Engineering and Management (CEM)
for hiring B.Tech. students. Last year during
campus recruitment, they had shortlisted 18
students for the final interview. Being a company
of international repute, they follow a stringent
interview process to select only the best of the
students. The information related to the interview
evaluation results of shortlisted students (hiding
the names) on the basis of different evaluation
parameters is available for reference in Figure
7.10. Chandra, a student of CEM, wants to find out
if he may be offered a job in GTS. His CGPA is
quite high. His self-evaluation on the other
parameters is as follows:
• Communication – Bad; Aptitude – High;
Programming skills – Bad

38
Training data for GTS recruitment
Decision tree based on the training data

39
Decision tree based on the training data
(depicting a sample path)
Information Gain (S, A) = Entropy (Sbs ) − Entropy (Sas )

40
41
Entropy and information gain calculation (Level
2)

42
43
Computing Information-Gain
for Continuous-Valued Attributes

• Let attribute A be a continuous-valued attribute

• Must determine the best split point for A
– Sort the value A in increasing order
– Typically, the midpoint between each pair of adjacent
values is considered as a possible split point
• (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
– The point with the minimum expected information
requirement for A is selected as the split-point for A
• Split:
– D1 is the set of tuples in D satisfying A ≤ split-point, and
44
D2 is the set of tuples in D satisfying A > split-point
Gain Ratio for Attribute Selection (C4.5)

• Information gain measure is biased towards attributes with

a large number of values
• C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfoA ( D)    log2 ( )
j 1 |D| |D|
– GainRatio(A) = Gain(A) / SplitInfo(A)
• Ex.

– gain_ratio(income) = 0.029/1.557 = 0.019

45 • The attribute with the maximum gain ratio is selected as
the splitting attribute
Gini Index (CART, IBM IntelligentMiner)

• If a data set D contains examples from n classes, gini

n
index, gini(D) is defined as gini( D)  1   p2 j
j 1
where pj is the relative frequency of class j in D
• If a data set D is split on A into two subsets D1 and D2, the
gini index gini(D) is defined as
|D1| |D |
gini A (D)  gini(D1)  2 gini( D2)
|D| |D|
• Reduction in Impurity:
gini( A)  gini(D)  giniA (D)
• The attribute provides the smallest ginisplit(D) (or the largest
reduction in impurity) is chosen to split the node (need to
46
enumerate all the possible splitting points for each
attribute)
Computation of Gini Index

• Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”

2 2
9 5
gini( D)  1        0.459
 14   14 
• Suppose the attribute income partitions D into 10 in D1: {low, medium}
and 4 in D2
 10  4
giniincome{low,medium} ( D)   Gini ( D1 )   Gini ( D2 )
 14   14 

Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the

{low,medium} (and {high}) since it has the lowest Gini index
• All attributes are assumed continuous-valued
47 • May need other tools, e.g., clustering, to get the possible split values
• Can be modified for categorical attributes
Comparing Attribute Selection Measures

• The three measures, in general, return good results but

– Information gain:
• biased towards multivalued attributes
– Gain ratio:
• tends to prefer unbalanced splits in which one
partition is much smaller than the others
– Gini index:
• biased to multivalued attributes
• has difficulty when # of classes is large
• tends to favor tests that result in equal-sized
48 partitions and purity in both partitions
Overfitting and Tree Pruning

• Overfitting: An induced tree may overfit the training data

– Too many branches, some may reflect anomalies due to noise or
outliers
– Poor accuracy for unseen samples
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early ̵ do not split a node if this
would result in the goodness measure falling below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees
• Use a set of data different from the training data to decide which is
the “best pruned tree”

49
Decision Tree
• Avoiding overfitting in decision tree – pruning

• There are two approaches of pruning:

• Pre-pruning: Stop growing the tree before it reaches
perfection.
• Post-pruning: Allow the tree to grow entirely and then post-
prune some of the branches from it.

50
Decision Tree
• Strengths of Decision Tree
• It produces very simple understandable rules. For smaller trees, not much mathematical and
computational knowledge is required to understand this model.
• Works well for most of the problems.
• It can handle both numerical and categorical variables.
• Can work well both with small and large training data sets.
• Decision trees provide a definite clue of which features are more useful for classification.

• Weaknesses of Decision Tree

• Decision tree models are often biased towards features having more number
of possible values, i.e. levels.
• This model gets overfitted or underfitted quite easily.
• Decision trees are prone to errors in classification problems with many classes
and relatively small number of training examples.
• A decision tree can be computationally expensive to train.
• Large trees are complex to understand.

51
Random forest model

• Random forest is an ensemble classifier, i.e. a combining

classifier that uses and combines many decision tree
classifiers.
• Ensembling is usually done using the concept of bagging
with different feature sets.
• After the random forest is generated by combining the
trees, majority vote is applied to combine the output of the
different trees.

52
Random forest model
• How does random forest works?
• 1. If there are N variables or features in the input data set, select a subset of ‘m’ (m < N) features at
random out of the N features. Also, the observations or data instances should be picked randomly.
• 2. Use the best split principle on these ‘m’ features to calculate the number of nodes ‘d’.
• 3. Keep splitting the nodes to child nodes till the tree is grown to the maximum possible extent.
• 4. Select a different subset of the training data ‘with replacement’ to train another decision tree
following steps (1) to (3). Repeat this to build and train ‘n’ decision trees.
• 5. Final class assignment is done on the basis of the majority votes from the ‘n’ trees.

53
Random forest model
• Strengths of random forest
• It runs efficiently on large and expansive data sets.
• It has a robust method for estimating missing data and maintains precision when a large proportion of
the data is absent.
• It has powerful techniques for balancing errors in a class population of unbalanced data sets.
• It gives estimates (or assessments) about which features are the most important ones in the overall
classification.
• It generates an internal unbiased estimate (gauge) of the generalization error as the forest generation
progresses.
• Generated forests can be saved for future use on other data.
• Lastly, the random forest algorithm can be used to solve both classification and regression problems.

• Weaknesses of random forest

• This model, because it combines a number of decision tree models, is not as
easy to understand as a decision tree model.
• It is computationally much more expensive than a simple model like decision
tree.

54
Support Vector Machine (SVM)

• New Machine Learning technique provided by Vapnik in 1996

• Popular due to its generalization ability
• SVM is used for
– Pattern Recognition (Classification)
– Regression Estimation (Prediction)

• Solution provided SVM is

– Theoretically elegant
– Computationally Efficient
– Very effective in many Large practical problems

• It has a simple geometrical interpretation in a high-dimensional feature space that is nonlinearly related to
input space.

• By using kernels all computations keep simple.

55 • It contains ANN, RBF and Polynomial classifiers as special cases.

ANN vs. SVM

ANN SVM
Minimizes empirical risk Minimizes structural risk
Solution can be Multiple local Solution is global and unique
Minima
Less Generalize More Generalization
Data intensive Requires less data
Presence of outlier affects Presence of outlier may not influence
generalization accuracy generalization accuracy
Difficult to determine structure, and Developing SVM based model is much
parameters. easier compared to ANN model.
Black box Based on Computational learning
Theory SVM can give some
56 explanation about how the final result
was arrived at.
History: SVM
• The Study on Statistical Learning Theory was
started in the 1960s by Vapnik. A classifier
derived from statistical learning theory by Vapnik,
et al. in 1992
• SVM became famous when, using images as
input, it gave accuracy comparable to neural-
network with hand-designed features in a
handwriting recognition task
• Currently, SVM is widely used in
• object detection & recognition,
• content-based image retrieval,
• text recognition,
• biometrics,
57 • speech recognition
• And Many more…
Two Key Concepts of SVM

1. Construction of optimum hyperplane in the feature space.

Optimum Hyper-plane

2. Mapping of training data into higher dimensional feature

space. A kernel trick is used to avoid explicit mapping
functions.

Φ(x)
58
Linear Classifiers

denotes +1 w x + b>0
denotes -1

How would you

classify this
data?

w x + b<0

59
Linear Classifiers

denotes +1
denotes -1

How would you

classify this
data?

60
Linear Classifiers

denotes +1
denotes -1

How would you

classify this
data?

61
Linear Classifiers

denotes +1
denotes -1

How would you

classify this
data?

62
Linear Classifiers

denotes +1
denotes -1
Any of these
would be fine..

..but which is
best?

63
Linear Classifiers

denotes +1
denotes -1

How would you

classify this
data?

Misclassified
to +1 class
64
Margin Classifier

denotes +1
denotes -1

65
Define the margin of a linear classifier as the width that the
boundary could be increased by before hitting a datapoint.
Maximizing the Margin

Var1
IDEA 1: Select the
separating
hyperplane that
maximizes the
margin!

Margin
Width

Margin
66 Width
Var2
Support Vectors

Var1

Support Vectors

Margin
67 Width
Var2
Maximum Margin Classifier

denotes +1 • Implies that only support vectors are

important; other training examples are
denotes -1 ignorable.

Support Vectors
are those
datapoints that
the margin pushes
up against This is the simplest
kind of SVM (Called
an LSVM)
Linear SVM
68
Maximizing Margin Mathematically

x+ M=Margin Width

X-

What we know:  
(x  x )  w 2
• w . x+ + b = +1 M 
• w . x- + b = -1 w w
69
• w . (x+-x-) = 2
Quadratic Optimization Problem

 Goal: 1) Correctly classify all training data

wxi  b  1 if yi = +1
wxi  b  1 if yi = -1
yi (wxi  b)  1 for all i
2
2) Maximize the Margin M 
1 w
same as minimize t
ww
2
 We can formulate a Quadratic Optimization Problem and solve for w and b

1 t
Minimize: ( w)  w w
2
70
Subject to: yi (wxi  b)  1 i
Data set with Noise

denotes +1  Hard Margin: So far we require

all data points be classified correctly
denotes -1
- No training error
 What if the training set is
noisy?
- Solution 1: use very powerful
kernels

OVERFITTING!
71
Soft Margin Classification

Slack variables ξi can be added to allow

misclassification of difficult or noisy examples.

What should our quadratic

e3
optimization criterion be?

e1 R
1
Minimize w.w  C  εk
2 k 1
e2

72  Parameter C can be viewed as a

way to control overfitting.
Non-Linear SVMs

 Datasets that are linearly separable with some noise

work out great:

0 x

 But what are we going to do if the dataset is just too

hard?
0 x

 How about… mapping data to a higher-dimensional

space: x2

73
0 x
Non-Linear SVM:
Map to High dimensional Feature Space

 General idea: The original input space can always be

mapped to some higher-dimensional feature space where
the training set is separable

Φ: x → φ(x)

74
The Kernel Trick K(xi,xj)= φ(xi) Tφ(xj)
The XOR Problem

x2 x2

x1 x1x2

X = ( x1, x2 ) x1 Z = ( x1, x2, x1x2 )

75
Examples of Kernel Functions

 Linear Kernel: K(x, y)= xT y

 Polynomial Kernel with degree d: K(x,x)= (1+ xTy)d
 Gaussian (radial-basis function) Kernel:
x y
2

K (x, y )  exp(  )
2 2

 Exponential RBF Kernel: x y

K (x, y)  exp( )
2 2

 Sigmoid (MLP) Kernel: K(x,y)= tanh(β0 x Ty + β1)

 Fourier series
 Splines
76
 …
Strengths and Weakness of SVM

•Strengths
•Training is relatively easy
• No local optimal, unlike in neural networks
•It scales relatively well to high dimensional data
•Weakness
•Need to choose a “good” kernel function.
•SVMs can only handle two-class outputs (i.e. a categorical
output variable with arity 2).

•What can be done?

•Answer: with output arity N, learn N SVM’s
SVM 1 learns “Output==1” vs “Output != 1”
SVM 2 learns “Output==2” vs “Output != 2” …
SVM N learns “Output==N” vs “Output != N”
•Then to predict the output for a new input, just predict with each
SVM and find out which one puts the prediction the furthest into the
77 positive region.
SVM Multiclass Classification Methods

1. One-against-all (Cortes and Vapnik, 1995)

• Each of the SVMs separates a single class from all
remaining classes

2. One-against-One (Fridman, 1996)

• Each SVM separates a pair of classes
• k(k-1)/2 classifiers where each one is trained on data from
two classes
• For training data from ith and jth classes, run binary
classification
• Voting strategy:
• If x is in class i, then add 1 to class i. Else to class j.

• Assign x to class with largest vote (Max wins)

78
References

• Video tutorial – Edureka, simplilearn

• Pattern Recognition, by Rajjan Singhal
• Data Mining, Han and Kamber

#08207A: Customer Satisfaction - Sunroof Water Leak - Install Drain Tube Extensions - (Sep 15, 2008)
100% (1)
#08207A: Customer Satisfaction - Sunroof Water Leak - Install Drain Tube Extensions - (Sep 15, 2008)
12 pages
UNIT- iv
No ratings yet
UNIT- iv
169 pages
Naive Bayes
No ratings yet
Naive Bayes
37 pages
Bayesian
No ratings yet
Bayesian
23 pages
6 Classification
No ratings yet
6 Classification
53 pages
Bayes Classification
No ratings yet
Bayes Classification
9 pages
Unit6 -3 Classification-Bayesian_e224638f-6bb6-4684-a1a1-adb33ef1b15d
No ratings yet
Unit6 -3 Classification-Bayesian_e224638f-6bb6-4684-a1a1-adb33ef1b15d
15 pages
AI notes
No ratings yet
AI notes
19 pages
Lecture 5 Bayesian Classification
No ratings yet
Lecture 5 Bayesian Classification
16 pages
ML 05 Bayesian Classifier
No ratings yet
ML 05 Bayesian Classifier
19 pages
3 - Bayesian Classification
No ratings yet
3 - Bayesian Classification
15 pages
CSC 325 AI Lecture08 Supervised Learning Fall2024 Dr Raheel 20022025 034558pm
No ratings yet
CSC 325 AI Lecture08 Supervised Learning Fall2024 Dr Raheel 20022025 034558pm
29 pages
CSC-325-AI-Lecture08-Supervised-Learning
No ratings yet
CSC-325-AI-Lecture08-Supervised-Learning
32 pages
K - Nearest Neighbours Classifier / Regressor
No ratings yet
K - Nearest Neighbours Classifier / Regressor
35 pages
Module 3- Bayesian Classifier (1)
No ratings yet
Module 3- Bayesian Classifier (1)
17 pages
4_22865_IS465_2019_1__2_1_08ClassBasic
No ratings yet
4_22865_IS465_2019_1__2_1_08ClassBasic
43 pages
Data Mining - Bayesian Classification
No ratings yet
Data Mining - Bayesian Classification
6 pages
IME672 - Lecture 44
No ratings yet
IME672 - Lecture 44
16 pages
Lecture 8 - Naive Bayes
No ratings yet
Lecture 8 - Naive Bayes
27 pages
10 Classification New 1
No ratings yet
10 Classification New 1
31 pages
Unit-4 DWDM
No ratings yet
Unit-4 DWDM
10 pages
Naive Bayes
No ratings yet
Naive Bayes
38 pages
Classification FoundationalMathofAI S24
No ratings yet
Classification FoundationalMathofAI S24
6 pages
L3 (Week3) Bayesian Classifier
No ratings yet
L3 (Week3) Bayesian Classifier
21 pages
Classification-Clustering
No ratings yet
Classification-Clustering
44 pages
Bayesian Classification
No ratings yet
Bayesian Classification
25 pages
9-Decision Tree Induction-23-01-2025
No ratings yet
9-Decision Tree Induction-23-01-2025
40 pages
Lecture 3 Basics of Clssification
No ratings yet
Lecture 3 Basics of Clssification
53 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Lecture Slide 03 - Bayesian Classifier - Summer 2023
No ratings yet
Lecture Slide 03 - Bayesian Classifier - Summer 2023
23 pages
Naive Bayes Classification
No ratings yet
Naive Bayes Classification
47 pages
Machine Learning-Lecture 04
No ratings yet
Machine Learning-Lecture 04
31 pages
8 Classification
No ratings yet
8 Classification
45 pages
2. Classification and clustering algorithms
No ratings yet
2. Classification and clustering algorithms
108 pages
Classification
No ratings yet
Classification
33 pages
Unit-3 AML (Bayesian Concept Learning)
No ratings yet
Unit-3 AML (Bayesian Concept Learning)
40 pages
2.3 Bayes classification
No ratings yet
2.3 Bayes classification
15 pages
DWM - Classification-Unit7
No ratings yet
DWM - Classification-Unit7
44 pages
Classification Naive Bayes
No ratings yet
Classification Naive Bayes
17 pages
Unit 5 Classification PDF
No ratings yet
Unit 5 Classification PDF
131 pages
Classification & Prediction: - Shailesh Yadav Central University of Rajasthan
No ratings yet
Classification & Prediction: - Shailesh Yadav Central University of Rajasthan
28 pages
A5 PDF
No ratings yet
A5 PDF
9 pages
Classification DMKD
No ratings yet
Classification DMKD
50 pages
Module - 4 - ECE3047 - Machine Learning
No ratings yet
Module - 4 - ECE3047 - Machine Learning
81 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Unit 3 Machine Learning
No ratings yet
Unit 3 Machine Learning
159 pages
Lesson 3.3 - Supervised Learning Rule Based Classification
No ratings yet
Lesson 3.3 - Supervised Learning Rule Based Classification
43 pages
Bayes Classifier
No ratings yet
Bayes Classifier
20 pages
Bayesian Classification: Dr. Navneet Goyal BITS, Pilani
No ratings yet
Bayesian Classification: Dr. Navneet Goyal BITS, Pilani
35 pages
Data Mining 4th Is
No ratings yet
Data Mining 4th Is
24 pages
Bayesian Classification: Dr. Navneet Goyal BITS, Pilani
No ratings yet
Bayesian Classification: Dr. Navneet Goyal BITS, Pilani
35 pages
Statistical Inference INF312 - Is - Lecture 03 - Part 3
No ratings yet
Statistical Inference INF312 - Is - Lecture 03 - Part 3
18 pages
DM See M4
No ratings yet
DM See M4
8 pages
03 Classification
No ratings yet
03 Classification
66 pages
Mod09-ppt2-ML_in_Image_Classification
No ratings yet
Mod09-ppt2-ML_in_Image_Classification
30 pages
20210913115710D3708 - Session 09-12 Bayes Classifier
No ratings yet
20210913115710D3708 - Session 09-12 Bayes Classifier
30 pages
Unit-Iv Data Classification: Data Warehousing and Data Mining
No ratings yet
Unit-Iv Data Classification: Data Warehousing and Data Mining
7 pages
Lecture12-Ch8-ClassBasic-Part2
No ratings yet
Lecture12-Ch8-ClassBasic-Part2
22 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
16 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Python Machine Learning
From Everand
Python Machine Learning
Sebastian Raschka
4/5 (18)
ML_MODULE7_Advanced Topics in ML
No ratings yet
ML_MODULE7_Advanced Topics in ML
22 pages
ML MODULE6 Artificial Neural Networks
No ratings yet
ML MODULE6 Artificial Neural Networks
42 pages
Ml Module5 Clustering
No ratings yet
Ml Module5 Clustering
71 pages
ML MODULE1 Introduction to Machine
No ratings yet
ML MODULE1 Introduction to Machine
38 pages
Beckers World of the Cell 9th Edition Hardin Test Bank all chapter instant download
100% (8)
Beckers World of the Cell 9th Edition Hardin Test Bank all chapter instant download
57 pages
How To Run MacOS On Windows 10 in A Virtual Machine
No ratings yet
How To Run MacOS On Windows 10 in A Virtual Machine
13 pages
Communication and Globalization
No ratings yet
Communication and Globalization
6 pages
PDF College Algebra 10th Edition Sullivan Test Bank download
100% (11)
PDF College Algebra 10th Edition Sullivan Test Bank download
60 pages
The Status of Women in Islam
No ratings yet
The Status of Women in Islam
50 pages
SGL-PT-Brochure-Graphite Shell Tube Heat Exchangers
No ratings yet
SGL-PT-Brochure-Graphite Shell Tube Heat Exchangers
16 pages
Week 2 Tutorial Activity Sheet
No ratings yet
Week 2 Tutorial Activity Sheet
4 pages
DPA 507 Planning Principles and Systems
No ratings yet
DPA 507 Planning Principles and Systems
8 pages
Groundswell Room To Breathe Full Report 2016
No ratings yet
Groundswell Room To Breathe Full Report 2016
35 pages
The Knee: Jason Peeler Jacquie Ripat
No ratings yet
The Knee: Jason Peeler Jacquie Ripat
11 pages
Cable Selection
No ratings yet
Cable Selection
1 page
GeM Bidding 6525194
No ratings yet
GeM Bidding 6525194
7 pages
L’Imparfait
No ratings yet
L’Imparfait
3 pages
Djbdns Howto
No ratings yet
Djbdns Howto
9 pages
Petroskill (Master Prog Structure)
No ratings yet
Petroskill (Master Prog Structure)
3 pages
Chapter 3
No ratings yet
Chapter 3
42 pages
Otdr Testing
No ratings yet
Otdr Testing
2 pages
Be 14112017
No ratings yet
Be 14112017
73 pages
Uniform Civil Code: Judicial Approach: Chapter - 5
No ratings yet
Uniform Civil Code: Judicial Approach: Chapter - 5
66 pages
How to check FSMO roles in Active Directory
No ratings yet
How to check FSMO roles in Active Directory
7 pages
Rishi Prasad - RP198June09
No ratings yet
Rishi Prasad - RP198June09
24 pages
Algorithm (Data Structures) - Javatpoint
No ratings yet
Algorithm (Data Structures) - Javatpoint
12 pages
Hellfire Manual
No ratings yet
Hellfire Manual
35 pages
Blue GIS Presentation
No ratings yet
Blue GIS Presentation
24 pages
10 Principles of Effective Communication - Constant Content
No ratings yet
10 Principles of Effective Communication - Constant Content
3 pages
ETom Ebook PDF
100% (3)
ETom Ebook PDF
1,460 pages
Causation, Prediction, and Search: Second Edition
100% (1)
Causation, Prediction, and Search: Second Edition
567 pages
Annotated Bibliography (Anomie)
No ratings yet
Annotated Bibliography (Anomie)
10 pages
WL #6, Italian Renaissance
No ratings yet
WL #6, Italian Renaissance
3 pages

Ml Module4 Classification

Uploaded by

Ml Module4 Classification

Uploaded by

Unit-4

Supervised learning: Classfication

• Classification is a supervised learning algorithm under

• Classification is widely known as pattern recognition in the literature.

• Character Recognition, Handwriting recognition

– Each tuple/sample is assumed to belong to a predefined class,

• Model usage: for classifying future or unknown objects

• The known label of test sample is compared with the

NAME RANK YEARS TENURED Classifier

1. k-Nearest Neighbour (kNN)

• Uses similarity in classifications

• A preferable choice is sqrt(N).

• Odd value of k is selected to avoid confusion between two classes

• Marketing: Amazon Recommender system

• Let X be a data sample (“evidence”): class label is unknown

• P(H) (prior probability), the initial probability

• Given training data X, posteriori probability of a hypothesis H, P(H|X),

P(H | X)  P(X | H )P(H )  P(X | H ) P(H ) / P(X)

posteriori = likelihood x prior/evidence

• Let D be a training set of tuples and their associated class

21 needs to be maximized P(Ci | X)  P(X| Ci )P(Ci )

• If Ak is continous-valued, P(xk|Ci) is usually computed based on

age income studentcredit_rating

• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643

• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

• Compute P(X|Ci) for each class

• P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044

P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028

• Naïve Bayesian prediction requires each conditional prob. be non-

• Ex. Suppose a dataset with 1000 tuples, income=low (0), income=

• Use Laplacian correction (or Laplacian estimator)

• How to deal with these dependencies? Bayesian Belief Networks

Decision Tree Structure

age income student credit_rating buys_computer

student? yes credit rating?

no yes excellent fair

• Basic algorithm (a greedy algorithm)

• Conditions for stopping partitioning

• There can be many decision trees which fits the example/data

 Select the attribute with the highest information gain

 Information needed (after using A to split D into v partitions) to

 Information gained by branching on attribute A

• Let us try to understand this in the context of an

• Let attribute A be a continuous-valued attribute

• Information gain measure is biased towards attributes with

– gain_ratio(income) = 0.029/1.557 = 0.019

• If a data set D contains examples from n classes, gini

• Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”

Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the

• The three measures, in general, return good results but

• Overfitting: An induced tree may overfit the training data

• There are two approaches of pruning:

• Weaknesses of Decision Tree

• Random forest is an ensemble classifier, i.e. a combining

• Weaknesses of random forest

• New Machine Learning technique provided by Vapnik in 1996

• Solution provided SVM is

• By using kernels all computations keep simple.

55 • It contains ANN, RBF and Polynomial classifiers as special cases.

1. Construction of optimum hyperplane in the feature space.

2. Mapping of training data into higher dimensional feature

How would you

How would you

How would you

How would you

How would you

denotes +1 • Implies that only support vectors are

 Goal: 1) Correctly classify all training data

denotes +1  Hard Margin: So far we require

Slack variables ξi can be added to allow

What should our quadratic

72  Parameter C can be viewed as a

 Datasets that are linearly separable with some noise

 But what are we going to do if the dataset is just too

 How about… mapping data to a higher-dimensional

 General idea: The original input space can always be

P(X|Ci)P(Ci) : P(X|buys_computer = “yes”) P(buys_computer = “yes”) = 0.028