0% found this document useful (0 votes)
12 views

Unit 4- Classification and Prediction

The document discusses classification and prediction, defining classification as the process of predicting categorical labels and prediction as modeling continuous-valued functions. It covers various classification methods including decision trees, Bayesian classification, and neural networks, along with issues related to data preparation, model accuracy, and overfitting. Additionally, it highlights the importance of scalability in classification for large datasets and introduces enhancements for decision tree induction.

Uploaded by

csd
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Unit 4- Classification and Prediction

The document discusses classification and prediction, defining classification as the process of predicting categorical labels and prediction as modeling continuous-valued functions. It covers various classification methods including decision trees, Bayesian classification, and neural networks, along with issues related to data preparation, model accuracy, and overfitting. Additionally, it highlights the importance of scalability in classification for large datasets and introduces enhancements for decision tree induction.

Uploaded by

csd
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 72

Unit IV

Classification and
Prediction

1
Classification and Prediction
 What is classification? What is prediction?
 Issues regarding classification and prediction
 Classification by decision tree induction
 Bayesian Classification
 Classification by Neural Networks
 Classification based on concepts from association
rule mining
 Other Classification Methods
 Prediction
 Classification accuracy
 Summary

2
Classification and Prediction

• What is classification? What is prediction?


• Issues regarding classification and prediction
• Classification by decision tree induction
• Bayesian Classification
• Classification by Neural Networks(Back propagation)
• Other Classification Methods
• Support Vector Machine(SVM)
• K Nearest Neighbor Classification(KNN)
• Prediction
• Classification accuracy
• Summary

3
Classification vs. Prediction
 Classification:

predicts categorical class labels (discrete or
nominal)

classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new
data.

 Prediction:

models continuous-valued functions, i.e., predicts
unknown or missing values
 Typical Applications

credit approval

target marketing

medical diagnosis

treatment effectiveness analysis
4
Classification—A Two-Step
Process
1. Model construction: describing a set of predetermined
classes

2. Model usage: for classifying future or unknown objects

5
Classification—A Two-Step
Process
1. Model construction: describing a set of predetermined
classes

Each tuple/sample is assumed to belong to a
predefined class, as determined by the class label
attribute

The set of tuples used for model construction is
training set

The model is represented as classification rules,
decision trees, or mathematical formulae

2. Model usage: for classifying future or unknown objects



Estimate accuracy of the model

The known label of test sample is compared with
the classified result from the model

Accuracy rate = % of test set samples that are
correctly classified by the model

Test set is independent of training set, otherwise
over-fitting will occur

If the accuracy is acceptable, use the model to
classify data tuples whose class labels are not known 6
Classification Process (1):
Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
7
Classification Process (2): Use the
Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
8
Supervised vs. Unsupervised Learning
 Supervised learning (classification)
 Supervision: The training data are accompanied
by labels indicating the class of the observations
 New data is classified based on the training set

 Unsupervised learning (clustering)


 The class labels of training data is unknown
 Given: a set of measurements, observations, etc.
 Aim: Establish the existence of classes or
clusters in the data

9
Classification and Prediction

• What is classification? What is prediction?


• Issues regarding classification and prediction
• Classification by decision tree induction
• Bayesian Classification
• Classification by Neural Networks(Back propagation)
• Other Classification Methods
• Support Vector Machine(SVM)
• K Nearest Neighbor Classification(KNN)
• Prediction
• Classification accuracy
• Summary

10
Issues Regarding Classification and
Prediction: 1. Data Preparation
Issues are :
1. Data Preparation: helps improve efficiency, accuracy and
scalability of classification and prediction
2. Comparing Classification Methods

A. Data cleaning
 Preprocess data in order to reduce noise and handle missing
values

B. Relevance analysis (feature selection)


 Remove the irrelevant or redundant attributes. This avoids
slow-down, misleading of the learning step.

C. Data transformation
 Generalize and/or normalize data. Concept hierarchies can
be used. 11
Issues regarding classification and
prediction:
2. Comparing Classification Methods
1. Predictive accuracy :

ability of the model to correctly predict the class label
of new/unseen data.
2. Speed

time to construct the model

time to use the model
3. Robustness

handling noise and missing values
4. Scalability

efficiency of the model for large databases
5. Interpretability:

understanding the insight provided by the model
6. Goodness of rules

decision tree size

compactness of classification rules

12
Classification and Prediction

• What is classification? What is prediction?


• Issues regarding classification and prediction
• Classification by decision tree induction
• Bayesian Classification
• Classification by Neural Networks(Back propagation)
• Other Classification Methods
• Support Vector Machine(SVM)
• K Nearest Neighbor Classification(KNN)
• Prediction
• Classification accuracy
• Summary

13
Training Dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
This 31…40 high no fair yes
follows >40 medium no fair yes
>40 low yes fair yes
an >40 low yes excellent no
example 31…40 low yes excellent yes
from <=30 medium no fair no
Quinlan’s <=30 low yes fair yes
>40 medium yes fair yes
ID3 <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

14
Output: A Decision Tree for “buys_computer”
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40
<=30
low
medium
yes excellent
no fair
yes
no
age?
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes <=30 overcast
31..40 >40
31…40 high yes fair yes
>40 medium no excellent no

student? yes credit rating?

no yes excellent fair

no yes no yes
15
Basic algorithm for inducing a decision
tree from training tuples

16
Basic algorithm for inducing a decision tree
from training tuples (contd..)

17
Attribute Selection Measure:
Information Gain (ID3/C4.5)
 Select the attribute with the highest information gain
 S contains si tuples of class Ci for i = {1, …, m}
 information measures: info required to classify any arbitrary
tuple
m
si si
I( s1,s2,...,sm )   log 2
i 1 s s

 entropy of attribute Av with


s1 j  values
...  smj {a1,a2,…,av}
E(A)  I ( s1 j ,..., smj )
j 1 s

 information gained by branching on attribute A


Gain(A) I(s 1, s 2 ,..., sm)  E(A)

18
Attribute Selection by Information
Gain Computation
 Class P: buys_computer =
“yes”
5 4
 Class N: buys_computer = “no” E (age)  I (2,3)  I (4,0)
 I(p, n) = I(9, 5) =0.940
14 14
 Compute the entropy for age: 5
 I (3,2) 0.694
14
age pi ni I(pi, ni) 5
I (2,3) means “age <=30” has
<=30 2 3 0.971 14
5 out of 14 samples, with 2
30…40 4 0 0 yes’es and 3 no’s. Hence
>40 3 2 0.971
age income student credit_rating buys_computer Gain(age) I ( p, n)  E (age) 0.246
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes Similarly,
>40 medium no fair yes
>40 low yes fair yes
>40
31…40
low
low
yes
yes
excellent
excellent
no
yes
Gain(income) 0.029
<=30 medium no fair no
<=30 low yes fair yes Gain( student ) 0.151
>40 medium yes fair yes
<=30
31…40
medium
medium
yes
no
excellent
excellent
yes
yes
Gain(credit _ rating ) 0.048
31…40 high yes fair yes
>40 medium no excellent no 19
Tree Pruning: Avoid Overfitting in
Classification
 Overfitting: An induced tree may overfit the training
data
 Too many branches, some may reflect anomalies

due to noise or outliers


 Poor accuracy for unseen samples

 Two approaches to avoid overfitting


 Prepruning: Halt tree construction early—do not

split a node if this would result in the goodness


measure falling below a threshold

Difficult to choose an appropriate threshold
 Postpruning: Remove branches from a “fully grown”

tree—get a sequence of progressively pruned trees



Use a set of data different from the training data
to decide which is the “best pruned tree”
20
Extracting Classification Rules from
Trees
 Represent the knowledge in the form of IF-THEN rules
 One rule is created for each path from the root to a leaf
 Each attribute-value pair along a path forms a conjunction
 The leaf node holds the class prediction age?
 Rules are easier for humans to understand
<=30 overcast
30..40 >40

student? yes credit rating?

no yes excellent fair

Example no yes no yes

IF age = “<=30” AND student = “no” THEN buys_computer = “no”


IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND
credit_rating = “excellent” THEN buys_computer = “yes”
IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”

21
Approaches to Determine
the Final Tree Size

 Separate training (2/3) and testing (1/3)


sets
 Use minimum description length (MDL)
principle
 halting growth of the tree when the
encoding is minimized

22
Enhancements to basic decision
tree induction
 Allow for continuous-valued attributes
 Dynamically define new discrete-valued
attributes that partition the continuous attribute
value into a discrete set of intervals
 Handle missing attribute values
 Assign the most common value of the attribute
 Assign probability to each of the possible values
 Attribute construction
 Create new attributes based on existing ones
that are sparsely represented
 This reduces fragmentation, repetition, and
replication
23
Classification in Large Databases
 Classification—a classical problem extensively studied by
statisticians and machine learning researchers
 Scalability: Classifying data sets with millions of examples
and hundreds of attributes with reasonable speed
 Why decision tree induction in data mining?
 relatively faster learning speed (than other

classification methods)

convertible to simple and easy to understand
classification rules
 can use SQL queries for accessing databases

 comparable classification accuracy with other methods

24
Scalable Decision Tree Induction Methods
in Data Mining Studies

 SLIQ
 builds an index for each attribute and only class list and
the current attribute list reside in memory
 SPRINT
 constructs an attribute list data structure
 PUBLIC
 integrates tree splitting and tree pruning: stop growing the
tree earlier
 RainForest
 separates the scalability aspects from the criteria that
determine the quality of the tree
 builds an AVC-list (attribute, value, class label)

25
Classification and Prediction

• What is classification? What is prediction?


• Issues regarding classification and prediction
• Classification by decision tree induction
• Bayesian Classification
• Classification by Neural Networks(Back propagation)
• Other Classification Methods
• Support Vector Machine(SVM)
• K Nearest Neighbor Classification(KNN)
• Prediction
• Classification accuracy
• Summary

26
Classification and Prediction

• What is classification? What is prediction?


• Issues regarding classification and prediction
• Classification by decision tree induction
• Bayesian Classification
• Classification by Neural Networks(Back propagation)
• Other Classification Methods
• Support Vector Machine(SVM)
• K Nearest Neighbor Classification(KNN)
• Prediction
• Classification accuracy
• Summary

27
Bayesian Classification
 Bayesian classifiers are statistical classifiers.

 They can predict class membership probabilities:-


probability that a given sample belongs to a
particular class.

 Have high accuracy and speed for large databases.

 class conditional independence : Naive Bayesian


classifiers assume that the effect of an attribute
value on a given class is independent of the values
of the other attributes.

 It is made to simplify the computations involved,


and in this sense, is considered “naive".

28
Bayesian Theorem Basics
 Given training data X, posteriori probability of a hypothesis H, P(H|X)
follows the Bayes theorem
P(H | X ) P( X | H )P(H )
 Informally, this can be written as
P( X )
posterior =likelihood x prior / evidence

 Let

X- a data sample whose class label is unknown

H- hypothesis that X belongs to class C

 For classification problems, determine P(H/X): the probability that


the hypothesis holds given the observed data sample X

 P(H): prior probability of hypothesis H (i.e. the initial probability


before we observe any data, reflects the background knowledge)

 P(X): probability that sample data is observed

 P(X|H) : probability of observing the sample X, given that the


hypothesis holds 29
Bayesian Classification: Practical
difficulty
 require initial knowledge of many
probabilities, significant computational cost

30
Naive Bayesian classification
 Suppose
 1. Each data sample, X = (x1,x2,…,xn),
depicting n measurements for n attributes,
respectively A1,A2,…,An
 2. There are m classes, C1,C2,…,Cm.

 Given : unknown data sample, X


 the naive Bayesian classier assigns an
unknown sample X to the class Ci iff

31
By Bayes theorem P(Ci/X)= P(X/Ci) * P(Ci)
P(X)
Probability of X conditioned
on Ci is given by

i.e., P(X/Ci) = P(x1/Ci) * P(x2/Ci) *……


*P(xn/Ci)
where unknown data sample, X=(x1,x2,
…..xn)

32
Training dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
Class: 30…40 high no fair yes
C1:buys_computer= >40 medium no fair yes
‘yes’ >40 low yes fair yes
C2:buys_computer= >40 low yes excellent no
‘no’
31…40 low yes excellent yes
Data sample(unknown) <=30 medium no fair no
X =(age<=30, <=30 low yes fair yes
Income=medium, >40 medium yes fair yes
Student=yes <=30 medium yes excellent yes
Credit_rating= 31…40 medium no excellent yes
Fair)
31…40 high yes fair yes
>40 medium no excellent no
33
Naïve Bayesian Classifier: Example
Compute P(X/Ci) for each class
 Data sample X =(age<=30, Income=medium, Student=yes Credit_rating=
Fair)

 P(age=“<30” | buys_computer=“yes”) = 2/9=0.222


P(age=“<30” | buys_computer=“no”) = 3/5 =0.6
P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444
P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4
P(student=“yes” | buys_computer=“yes)= 6/9 =0.667
P(student=“yes” | buys_computer=“no”)= 1/5=0.2
P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667
P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4

P(buys_computer=“yes”)=9/14=0.643
P(buys_computer=“no”)=5/14=0.357

P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.667 =0.044


P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019
P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028
P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007

X belongs to class “buys_computer=yes” 34


Bayesian Classification: Why?
 Probabilistic learning: Calculate explicit probabilities for
hypothesis, among the most practical approaches to certain
types of learning problems

 Incremental: Each training example can incrementally


increase/decrease the probability that a hypothesis is correct.
Prior knowledge can be combined with observed data.

 Probabilistic prediction: Predict multiple hypotheses,


weighted by their probabilities

 Standard: Even when Bayesian methods are computationally


intractable, they can provide a standard of optimal decision
making against which other methods can be measured

35
Naïve Bayesian Classifier:
Comments
 Advantages :
1. Easy to implement
2. Good results obtained in most of the cases
 Disadvantages
1. Assumption: class conditional independence , therefore loss
of accuracy
2. Practically, dependencies exist among variables
1. E.g., hospitals: patients: Profile: age, family history etc
2. Symptoms: fever, cough etc., Disease: lung cancer,
diabetes etc
3. Dependencies among these cannot be modeled by Naïve
Bayesian Classifier

 How to deal with these dependencies?



Bayesian Belief Networks

36
Bayesian Networks

 Bayesian belief network allows a subset of the


variables conditionally independent
 A graphical model of causal relationships
 Represents dependency among the variables

 Gives a specification of joint probability

distribution
Nodes: random variables
Links: dependency

X Y X,Y are the parents of Z, and Y is

the parent of P
No dependency between Z and P
Z Has no loops or cycles
P
37
Bayesian Belief Network: An
Example
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
Family
Smoker
History
LC 0.8 0.5 0.7 0.1
~LC 0.2 0.5 0.3 0.9

LungCancer Emphysema The Conditional Probability Table


(CPT) for the variable LungCancer:
Shows the conditional probability for each
possible combination of its parents

n
PositiveXRay Dyspnoea P ( z1,..., zn)   P ( z i | Parents ( Z i ))
i 1
Bayesian Belief Networks
Where tuple(z1,…,zn) correspond to attributes Z1,Z2,…,Zn) 38
Learning Bayesian Networks

 Several cases
1. Given both the network structure and all variables
observable: learn only the CPTs

2. Network structure known, some hidden variables:


method of gradient descent, analogous to neural
network learning

3. Network structure unknown, all variables observable:


search through the model space to reconstruct graph
topology

4. Unknown structure, all hidden variables: no good


algorithms known for this purpose

39
Classification and Prediction

• What is classification? What is prediction?


• Issues regarding classification and prediction
• Classification by decision tree induction
• Bayesian Classification
• Classification by Neural Networks(Back propagation)
• Other Classification Methods
• Support Vector Machine(SVM)
• K Nearest Neighbor Classification(KNN)
• Prediction
• Classification accuracy
• Summary

40
Classification by Neural Networks
(Backpropagation)
“What is backpropagation?"
 Its a neural network learning algorithm.

Neural Network
 Its a set of connected input/output units where each

connection has a weight associated with it.


 During the learning phase, the network learns by adjusting

the weights so as to be able to predict the correct class label


of the input samples.

A multilayer feed-forward neural network

41
Multilayer Feed-forward Neural
Network
 A training sample, X = (x1; x2; …; xi), is fed to the input layer.
 The weighted outputs of these units are, in turn, fed simultaneously to
a second layer, known as a hidden layer.
 The hidden layer's weighted outputs can be input to another hidden
layer, and so on. The number of hidden layers is arbitrary, although in
practice, usually only one is used.
 The weighted outputs of the last hidden layer are input to units
making up the output layer, which emits the network's prediction for
given samples.

Weighted connections
exist between each
layer, where wij
denotes the weight
from a unit j in one
layer to a unit i in the
previous layer.
42
Benefits and drawbacks of Neural
Networks
 The network is feed-forward in that none of the weights
cycle back to an input unit or to an output unit of a previous
layer.

Benefits:
 include their high tolerance to noisy data
 able to classify patterns on which they have not
been trained.

Disadvantages:
 involve long training times, so more suitable for
applications where this is feasible.
 Need to know network topology or “structure".
 poor interpretability, as its difficult for humans to
interpret the symbolic meaning behind the learned
weights.
43
Defining a network topology
 Before training can begin, the user must decide
on the network topology by specifying
1. the number of units in the input layer,
2. the number of hidden layers (if more than one),
3. the number of units in each hidden layer, and
4. the number of units in the output layer.

 Normalizing the input values for each attribute in


the training samples to fall between 0 and 1.0.

44
Backpropagation
“What is backpropagation?"
 Its a neural network learning algorithm.

How does it work?


 Backpropagation learns by iteratively processing a set of
training samples, comparing the network's prediction for
each sample with the actual known class label.

 For each training sample, the weights are modified so as to


minimize the error between the network's prediction and the
actual class.

 These modifications are made in the “backwards" direction,


i.e., from the output layer, through each hidden layer down to
the first hidden layer (hence the name backpropagation).

45
Multi-Layer Perceptron

Output vector
Err j O j (1  O j ) Errk w jk
Output nodes k

 j  j  (l) Err j
wij wij  (l ) Err j Oi
Hidden nodes Err j O j (1  O j )(T j  O j )
wij 1
Oj   Ij
1 e
Input nodes
I j  wij Oi   j
i

Input vector: xi
 Given a unit j in a hidden or
output layer, the net input, x1
Ij, to unit j is:
I j  wij Oi   j x2 i j k
i wij
=Bias(threshold) .
j
xn
.
Given the net input Ij to
unit j, then Oj, the output . 1
of unit j, is computed as: O j  I
1 e j
Backpropagate the error. The error is propagated
backwards by updating the weights and biases to
reflect the error of the network's prediction. Tj=true
o/p Err j O j (1  O j )(T j  O j ) Err j O j (1  O j ) Errk w jk
k

47
Network Training
 The ultimate objective of training
 obtain a set of weights that makes almost all the

tuples in the training data classified correctly

 Steps
 Initialize weights with random values

 Feed the input tuples into the network one by one

 For each unit


Compute the net input to the unit as a linear
combination of all the inputs to the unit

Compute the output value using the activation function

Compute the error

Update the weights and the bias
Chapter 7. Classification and
Prediction
• What is classification? What is prediction?
• Issues regarding classification and prediction
• Classification by decision tree induction
• Bayesian Classification
• Classification by back propagation
• Other Classification Methods
• Support Vector Machine(SVM)
• K Nearest Neighbor Classification(KNN)
• Prediction
• Classification accuracy
• Summary

49
Other Classification Methods
 k-nearest neighbor classifier
 case-based reasoning
 Genetic algorithm
 Rough set approach
 Fuzzy set approaches

50
Instance-Based (Lazy Learners)
Methods
 Instance-based learning:
 Store training examples and delay the processing

(“lazy evaluation”) until a new instance must be


classified

 Typical approaches
 k-nearest neighbor approach


Instances represented as points in a Euclidean
space.

 Case-based reasoning

Uses symbolic representations and knowledge-
based inference

 Locally weighted regression



Constructs local approximation

51
Remarks on Lazy vs. Eager Learning
 K Nearest Neighbor & Case based reasoning : lazy evaluation
 Decision-tree and Bayesian classification: eager evaluation

Key differences
Lazy Learning Eager Learning

Store training examples and delay Constructs generalization model


the processing until a new before receiving new samples to
instance must be classified classify.
Faster at training Slower at training
Expensive computational costs Less expensive. Assigns different
when neighbors are more. weights to attributes.
Take more time for predicting Faster at classification as it builds
Accuracy model early.

52
The k-Nearest Neighbor Algorithm
 All instances correspond to points in the n-D space.

 The nearest neighbor are defined in terms of Euclidean


distance.

 The target function could be discrete- or real- valued.

 For discrete-valued, the k-NN returns the most common


value among the k training examples nearest to X.

 Euclidean distance between 2 points: X=(x1,x2,…,xn)


Y=(y1,y2,…,yn)
d ( X ,Y )  
n
i 1
x  y 
i i
2

53
Case-Based Reasoning
 Uses: lazy evaluation + analyze similar instances
 Instances are not “points in a Euclidean space”

 Methodology
 Instances represented by rich symbolic

descriptions, or as cases.
 Earlier similar cases are seen. Multiple retrieved

cases may be combined


 Tight coupling between case retrieval, knowledge-

based reasoning, and problem solving

 Research issues
 Indexing based on syntactic similarity measure,

and when failure, backtracking, and adapting to


additional cases
54
Genetic Algorithms
based on an analogy to biological evolution
 Each rule is represented by a string of bits

 An initial population is created consisting of randomly


generated rules
 e.g., IF A and Not A then C can be encoded as 101
1 2 2

 Based on the notion of survival of the fittest, a new


population is formed to consists of the fittest rules and their
offsprings

 The fitness of a rule is represented by its classification


accuracy on a set of training examples

 Offsprings are generated by crossover and mutation


55
Rough Set Approach
 Rough sets are used to approximately or “roughly” define
equivalent classes

 A rough set for a given class C is approximated by two


sets: a lower approximation (certain to be in C) and an
upper approximation (cannot be described as not
belonging to C)

 Finding the minimal subsets (reducts) of attributes (for


feature reduction) is NP-hard but a discernibility matrix is
used to reduce the computation intensity

56
Fuzzy Set
Approaches

 Fuzzy logic uses truth values between 0.0 and 1.0 to


represent the degree of membership (such as using fuzzy
membership graph)

 Attribute values are converted to fuzzy values



e.g., income is mapped into the discrete categories
{low, medium, high} with fuzzy values calculated

 For a given new sample, more than one fuzzy value may
apply

 Each applicable rule contributes a vote for membership in


the categories

 Typically, the truth values for each predicted category are


summed
57
Chapter 7. Classification and
Prediction
• What is classification? What is prediction?
• Issues regarding classification and prediction
• Classification by decision tree induction
• Bayesian Classification
• Classification by back propagation
• Other Classification Methods
• Support Vector Machine(SVM)
• K Nearest Neighbor Classification(KNN)
• Prediction
• Classification accuracy
• Summary

58
The k-Nearest Neighbor Algorithm
• The k-nearest-neighbor method was first described in the early 1950s. The
method is labor intensive when given large training sets, and did not gain
popularity until the 1960s

• when increased computing power became available. It has since been


widely used in the area of pattern recognition.

• Nearest-neighbor classifiers are based on learning by analogy, that is, by


comparing a given test tuple with training tuples that are similar to it. The
training tuples are described by n attributes.

• Each tuple represents a point in an n-dimensional space.

• In this way, all of the training tuples are stored in an n-dimensional


pattern space.
• When given an unknown tuple, a k-nearest-neighbor classifier searches
the pattern space for the k training tuples that are closest to the
unknown tuple.

• These k training tuples are the k “nearest neighbors” of the unknown


tuple.

• “Closeness” is defined in terms of a distance metric, such as Euclidean


distance.

• The Euclidean distance between two points or tuples, say, X1 = (x11,


x12, : : : , x1n) and X2 = (x21, x22, : : : , x2n), is
• For k-nearest-neighbor classification, the unknown tuple is
assigned the most common class among its k nearest
neighbors.
• When k = 1, the unknown tuple is assigned the class of the
training tuple that is closest to it in pattern space.
• Nearest neighbor classifiers can also be used for prediction,
that is, to return a real-valued prediction for a given unknown
tuple.
• Example
Predict the class for X1(P1=3 & P2=7)
Suppose k=3
P1 P2 Class
7 7 Bad
7 4 Bad
3 4 Good
1 4 Good
sno P1 P2 Class
1 7 7 Bad
2 7 4 Bad
3 3 4 Good
4 1 4 Good

Euclidean Distance Formula

62
Chapter 7. Classification and
Prediction
• What is classification? What is prediction?
• Issues regarding classification and prediction
• Classification by decision tree induction
• Bayesian Classification
• Classification by back propagation
• Other Classification Methods
• Support Vector Machine(SVM)
• K Nearest Neighbor Classification(KNN)
• Prediction
• Classification accuracy
• Summary

67
What Is Prediction?

 Prediction is similar to classification


 First, construct a model
 Second, use model to predict unknown value

Major method for prediction is regression
 Linear and multiple regression
 Non-linear regression

 Prediction is different from classification


 Classification refers to predict categorical class label
 Prediction models continuous-valued functions

68
Regress Analysis and Log-Linear
Models in Prediction

Linear regression: Y =  +  X
 Two parameters ,  and  specify the line and are to be

estimated by using the data at hand.


 using the least squares criterion to the known values of Y 1, Y2,

…, X1, X2, ….

 Multiple regression: Y = b0 + b1 X1 + b2 X2.


 Many nonlinear functions can be transformed into the above.

69
 Non-linear models:
 Polynomial regression can be modeled by adding

polynomial terms to the basic linear model.


 By applying transformations to the variables, we

can convert the nonlinear model into a linear


one that can then be solved by the method of
least squares.

 X1=X, X2=X2, X3=X3


 The above equation can be converted as

70
Chapter 7. Classification and
Prediction
 What is classification? What is prediction?
 Issues regarding classification and prediction
 Classification by decision tree induction
 Bayesian Classification
 Classification by Neural Networks
 Classification based on concepts from association
rule mining
 Other Classification Methods
 Prediction
 Classification accuracy
 Summary
71
Summary

 Classification is an extensively studied problem


(mainly in statistics, machine learning & neural
networks)
 Classification is probably one of the most widely
used data mining techniques with a lot of extensions
 Scalability is still an important issue for database
applications: thus combining classification with
database techniques should be a promising topic
 Research directions: classification of non-relational
data, e.g., text, spatial, multimedia, etc..
72

You might also like