Ml Module4 Classification
Ml Module4 Classification
1
Contents
• What is classification?
• Use cases of classification
• Classification algorithms:
– Nearest Neighbor
– Decision Tree
– Bayesian Classification
– Support Vector Machine
– Neural Networks
2
What is Classification?
• Classification
– predicts categorical class labels (discrete or nominal)
– classifies data based on the training set and the values
in a classifying attribute and uses it in classifying new
data
• Classification vs. Regression
• Classification (Pattern Recognition), Regression
(function approximation), Prediction (supervvised),
3 clustering (grouping - unsupervised)
Use cases of Classification
5
Data Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
8
Joseph A ssistant P rof 7 yes
Classification Model Steps
9
Common Classification Algorithms
Rule-Based Classification
Associative Classification
Case-based Reasoning
10
Convolutional Neural Networks
Nearest Neighbor
• Pattern matching:
– Person identification
– Vehicle classification
– Speaker recognition
11
K-Nearest Neighbor Classification
• K-Nearest Neighbor is simple algorithm that stores all the
available cases and classifies the new data or cases based
on similarity measure.
• Similarity measure
– Euclidian distance
– Manhattan distance
12
K-Nearest Neighbor Classification: Example
13
K-Nearest Neighbor Algorithm
14
K-Nearest Neighbor
• Lazy learner
– Does not prepare classification model
– Stores data and predicts whenever test data is available
15
How to choose Value of K?
• Empirically selected.
16
Strength and Weaknesses of KNN
• Strength
– Extremely simple algorithm – easy to understand
– Very effective in certain situations, e.g. for recommender system design
– Very fast or almost no time required for the training phase
• Weakness
– Does not learn anything in the real sense. Classification is done completely on
the basis of the training data. So, it has a heavy reliance on the training data.
If the training data does not represent the problem domain comprehensively,
the algorithm fails to make an effective classification.
– Because there is no model trained in real sense and the classification is done
completely on the basis of the training data, the classification process is very
slow.
– Also, a large amount of computational space is required to load the training
data for classification.
17
Use cases of k-Nearest Neighbor
18
Bayesian Theorem: Basics
• Classification is to determine:
• P(H|X), (posteriori probability), the probability that the hypothesis holds
given the observed data sample X
19
Bayesian Theorem
• A simplified assumption:
• attributes are conditionally independent (i.e., no dependence relation
between attributes):
n
P( X | C i) P( x | C i) P( x | C i) P( x | C i) ... P( x | C i)
k 1 2 n
k 1
• This greatly reduces the computation cost: Only counts the class
distribution
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for
Ak divided by |Ci, D| (# of tuples of Ci in D)
2
Naïve Bayesian Classifier: Training Dataset
• Advantages
– Easy to implement
– Good results obtained in most of the cases
• Disadvantages
– They require significant amount of probability data to construct a
knowledge base.
– Assumption: class conditional independence, therefore loss of accuracy
– Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
• Dependencies among these cannot be modeled by Naïve Bayesian
Classifier
27 Leaf Node
28
Decision Tree Induction: An Example
Training data set: Buys_computer
The data set follows an example of Quinlan’s ID3 (Playing Tennis)
Stopping criteria:
1. All or most of the examples at a particular node have the same class
2. All features have been used up in the partitioning
30 3. The tree has grown to a pre-defined threshold limit
Decision Tree Induction: An Example
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
Resulting tree: 31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
age? <=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 31…40 medium no excellent yes
31..40 >40 31…40 high yes fair yes
>40 medium no excellent no
33
Attribute Selection Measure:
Information Gain (ID3/C4.5/C5.0)
j 1 | D |
35
Attribute Selection: Information Gain
Class P: buys_computer = “yes” 5 4
Infoage ( D) I (2,3) I (4,0)
Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info( D) I (9,5) log2 ( ) log2 ( ) 0.940 I (3,2) 0.694
14 14 14 14 14
age pi ni I(pi, ni)
5
<=30 2 3 0.971 I (2,3) means “age <=30” has 5
31…40 4 0 0 14
>40 3 2 0.971
out of 14 samples, with 2 yes’es
age income student credit_rating buys_computer and 3 no’s. Hence
<=30 high no fair no
<=30
31…40
high
high
no
no
excellent
fair
no
yes
Gain(age) Info( D) Infoage ( D) 0.246
>40 medium no fair yes
>40 low yes fair yes Similarly,
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no Gain(income) 0.029
<=30
>40
low
medium
yes fair
yes fair
yes
yes Gain( student ) 0.151
36
<=30
31…40
medium
medium
yes excellent
no excellent
yes
yes
Gain(credit _ rating ) 0.048
31…40 high yes fair yes
>40 medium no excellent no
37
Case Study
38
Training data for GTS recruitment
Decision tree based on the training data
39
Decision tree based on the training data
(depicting a sample path)
Information Gain (S, A) = Entropy (Sbs ) − Entropy (Sas )
40
41
Entropy and information gain calculation (Level
2)
42
43
Computing Information-Gain
for Continuous-Valued Attributes
49
Decision Tree
• Avoiding overfitting in decision tree – pruning
50
Decision Tree
• Strengths of Decision Tree
• It produces very simple understandable rules. For smaller trees, not much mathematical and
computational knowledge is required to understand this model.
• Works well for most of the problems.
• It can handle both numerical and categorical variables.
• Can work well both with small and large training data sets.
• Decision trees provide a definite clue of which features are more useful for classification.
51
Random forest model
52
Random forest model
• How does random forest works?
• 1. If there are N variables or features in the input data set, select a subset of ‘m’ (m < N) features at
random out of the N features. Also, the observations or data instances should be picked randomly.
• 2. Use the best split principle on these ‘m’ features to calculate the number of nodes ‘d’.
• 3. Keep splitting the nodes to child nodes till the tree is grown to the maximum possible extent.
• 4. Select a different subset of the training data ‘with replacement’ to train another decision tree
following steps (1) to (3). Repeat this to build and train ‘n’ decision trees.
• 5. Final class assignment is done on the basis of the majority votes from the ‘n’ trees.
53
Random forest model
• Strengths of random forest
• It runs efficiently on large and expansive data sets.
• It has a robust method for estimating missing data and maintains precision when a large proportion of
the data is absent.
• It has powerful techniques for balancing errors in a class population of unbalanced data sets.
• It gives estimates (or assessments) about which features are the most important ones in the overall
classification.
• It generates an internal unbiased estimate (gauge) of the generalization error as the forest generation
progresses.
• Generated forests can be saved for future use on other data.
• Lastly, the random forest algorithm can be used to solve both classification and regression problems.
54
Support Vector Machine (SVM)
• It has a simple geometrical interpretation in a high-dimensional feature space that is nonlinearly related to
input space.
ANN SVM
Minimizes empirical risk Minimizes structural risk
Solution can be Multiple local Solution is global and unique
Minima
Less Generalize More Generalization
Data intensive Requires less data
Presence of outlier affects Presence of outlier may not influence
generalization accuracy generalization accuracy
Difficult to determine structure, and Developing SVM based model is much
parameters. easier compared to ANN model.
Black box Based on Computational learning
Theory SVM can give some
56 explanation about how the final result
was arrived at.
History: SVM
• The Study on Statistical Learning Theory was
started in the 1960s by Vapnik. A classifier
derived from statistical learning theory by Vapnik,
et al. in 1992
• SVM became famous when, using images as
input, it gave accuracy comparable to neural-
network with hand-designed features in a
handwriting recognition task
• Currently, SVM is widely used in
• object detection & recognition,
• content-based image retrieval,
• text recognition,
• biometrics,
57 • speech recognition
• And Many more…
Two Key Concepts of SVM
Optimum Hyper-plane
Φ(x)
58
Linear Classifiers
denotes +1 w x + b>0
denotes -1
w x + b<0
59
Linear Classifiers
denotes +1
denotes -1
60
Linear Classifiers
denotes +1
denotes -1
61
Linear Classifiers
denotes +1
denotes -1
62
Linear Classifiers
denotes +1
denotes -1
Any of these
would be fine..
..but which is
best?
63
Linear Classifiers
denotes +1
denotes -1
Misclassified
to +1 class
64
Margin Classifier
denotes +1
denotes -1
65
Define the margin of a linear classifier as the width that the
boundary could be increased by before hitting a datapoint.
Maximizing the Margin
Var1
IDEA 1: Select the
separating
hyperplane that
maximizes the
margin!
Margin
Width
Margin
66 Width
Var2
Support Vectors
Var1
Support Vectors
Margin
67 Width
Var2
Maximum Margin Classifier
Support Vectors
are those
datapoints that
the margin pushes
up against This is the simplest
kind of SVM (Called
an LSVM)
Linear SVM
68
Maximizing Margin Mathematically
x+ M=Margin Width
X-
What we know:
(x x ) w 2
• w . x+ + b = +1 M
• w . x- + b = -1 w w
69
• w . (x+-x-) = 2
Quadratic Optimization Problem
1 t
Minimize: ( w) w w
2
70
Subject to: yi (wxi b) 1 i
Data set with Noise
OVERFITTING!
71
Soft Margin Classification
e1 R
1
Minimize w.w C εk
2 k 1
e2
0 x
73
0 x
Non-Linear SVM:
Map to High dimensional Feature Space
Φ: x → φ(x)
74
The Kernel Trick K(xi,xj)= φ(xi) Tφ(xj)
The XOR Problem
x2 x2
x1 x1x2
75
Examples of Kernel Functions
K (x, y ) exp( )
2 2
Fourier series
Splines
76
…
Strengths and Weakness of SVM
•Strengths
•Training is relatively easy
• No local optimal, unlike in neural networks
•It scales relatively well to high dimensional data
•Weakness
•Need to choose a “good” kernel function.
•SVMs can only handle two-class outputs (i.e. a categorical
output variable with arity 2).
78
References
79