UNIT 1 CLASSIFICATION & PREDICTION DM
UNIT 1 CLASSIFICATION & PREDICTION DM
Classification
Basic Concepts
• There are two forms of data analysis that can be used for extracting models describing
important classes or to predict future data trends.
» Classification
» Prediction
• Prediction :- How much a given customer will spend during a sale at his company.
How does classification work?
• Data classification is a two-step process,
» Learning step
» Classification step
• Learning steps :- It training the datasets are analyzed by a classification algorithm and
classifier is represented in the form of classification rules.
• Classification step :- Test data are used to estimate the accuracy of the classification rules.
If the accuracy is considered acceptable, the rules can be applied to the classification of
new data tuples.
Models
Database :- Bank Loan_Decision
Datasets :- Bank Loan_Decision
Lerning Steps or Training datasets
Classification :- Test Data
Decision Tree Induction
• It includes a root node, branches, and leaf nodes.
• Each internal node denotes a test on an attribute,
• each branch denotes the outcome of a test,
• each leaf node holds a class label.
• The topmost node in the tree is the root node.
Example Decision Tree :-
• Decision trees can easily be converted to classification rules.
• Decision trees can handle multidimensional data.
• The learning and classification steps of decision tree induction are
simple and fast.
• It have good accuracy.
• Application areas such as medicine, manufacturing and production,
financial analysis, astronomy, and molecular biology.
Decision Tree Induction Algorithm :-
Three parameters in DTI
• Input
• Method
• Output
• Input :-
– Data Partition , D (Training tuples and class label)
– Attribute_List
– Attribute_Selection_Method
• Methods :-
1. Create a node N;
2. if tuples in D are all of the same class, C, then
3. return N as a leaf node labeled with the class C;
4. if attribute_list is empty then
5. return N as a leaf node labeled with the majority class in D;
6. apply Attribute_selection_method (D, attribute_list) to find best
splitting_attribute
7. label node N with splitting_criterion;
8. if splitting_attribute is discrete-valued and multiway splits allowed then
9. for each outcome j of splitting criterion
10. let Dj be the set of data tuples in D satisfying outcome j;
11. if Dj is empty then
12. attach a leaf labeled with the majority class in D to node N;
13. else attach the node returned by Generate decision tree(Dj, attribute list)
to node N;
endfor
14. return N;
• Output :-
A Decision Tree
There are three possible scenarios,
1. Discrete-valued
2. Continuous-valued
3. Discrete-valued and a binary tree
1. Discrete-valued :- If A is discrete-valued, then one branch is grown for each
known value of A.
2. Continuous-valued :- If A is continuous-valued, then two branches are
grown, corresponding to A <= split point and A > split point.
where split point is the split-point returned by Attribute selection
method as part of the splitting criterion.
3. Discrete-valued and a binary tree :- If A is discrete-valued and a binary tree
must be produced, then the test is of the form A ∈ SA, where SA is the
splitting subset for A.
Attribute Selection Measures
= 0.940 bits
= 0.694 bits
Similarly, we can compute,
Gain(income) = 0.029 bits,
Gain(student) = 0.151 bits,
Gain(credit rating) = 0.048 bits.
Age has the highest information gain among the attributes, it is selected as
the splitting attribute. Node N is labeled with age, and branches are grown
for each of the attribute’s values.
• Again we have split the remaining branches
• Similarly, we can compute,
Gain(student) = 0.971
Gain(credit_rating) = 0.021
Student has the highest information gain among the attributes, it is
selected as the splitting attribute.
Final Decision Tree
Gain Ratio
• The information gain measure is biased toward tests with many outcomes.
• It prefers to select attributes having a large number of values.
• For example, consider an attribute that acts as a unique identifier such as
product ID. A split on product ID would result in a large number of
partitions (as many as there are values), each one containing just one tuple.
• Because each partition is pure, the information required to classify data set
D based on this partitioning would be Infoproduct ID (D) = 0.
• Therefore, the information gained by partitioning on this attribute is
maximal.
• such a partitioning is useless for classification.
• It applies a kind of normalization to informationgain using a
“splitinformation” value defined analogously with Info(D) as
• A successor of ID3, uses an extension to information gain known
as gain ratio,
Gini Index
• The attribute that maximizes the reduction in impurity (or, equivalently, has
the minimum Gini index) is selected as the splitting attribute. This attribute
and either its splitting subset (for a discrete-valued splitting attribute) or
split-point (for a continuous-valued splitting attribute) together form the
splitting criterion.
• Example :- Induction of a decision tree using the Gini index. Let D be the
training data shown earlier in customers table where there are 9 tuples
belonging to the class buys computer D = yes and the remaining 5 tuples
belong to the class buys computer D = no. A (root) node N is created for
the tuples in D.We first use Eq. (8.7) for the Gini index to compute the
impurity of D:
• Solution :-
• To find the splitting criterion for the tuples in D, we need to compute the
Gini index for each attribute.
• Let’s start with the attribute income and consider each of the possible
splitting subsets.
• Consider the subset {low, medium}. This would result in 10 tuples in
partition D1 satisfying the condition “income Є {low, medium}.” The
remaining 4 tuples of D would be assigned to partition D2. The Gini index
value computed based on
Similarly,
• Similarly, for all attributes such as age, student, and credit-rating.
• Therefore Given minimum Gini Index overall, with a reduction in
impurity of
= 0.459 - 0.357
= 0.102 (overall attribute)
[ The binary split "age Є {Youth, senior}
The result in the maximum reduction in impurity of tuples in D and returned
as thesplitting criterion, Node N is labeled with the criterion, two branches are
grown from it, and the tuples are partitioned accordingly.
Tree Pruning
• In some datasets (ie, in large datasets) decision tree builts the tree which
have many branches, due to noise and outliers.
• Tree pruning methods address this problem of overfitting the data.
• So in such case typically use statistical measures to remove the least reliable
branches.
• “How does tree pruning work?” There are two common approaches to tree
pruning:
1. Prepruning
2. Postpruning
1. Prepruning :-
Prepruning approach, a tree is pruned by halting its construction early.
Example : by deciding not to further split or partition the subset of
training tuples at a given node.
The leaf may hold the most frequent class among the subset tuples or
the probability distribution of those tuples.
2. Postpruning :-
which removes subtrees from a “fully grown” tree.
A subtree at a given node is pruned by removing its branches and
replacing it with a leaf.
The leaf is labeled with the most frequent class among the subtree
being replaced.
• Step 3 :- As P(X) is constant for all classes, only P(X/Ci) P(Ci) needs to be
maximized.
If the class prior probabilities are not known, then it is commonly
assumed that the classes are equally likely, that is, P(C1) = P(C2) = ........ =
P(Cm), and we would therefore maximize P(X/Ci). Otherwise, we maximize
P(X/Ci) P(Ci). Note that the class prior probabilities may be estimated by
P(Ci) = |Ci, D|/|D|, where |Ci,D| is the number of training tuples of class Ci
in D.
• Step 4 :- Given data sets with many attributes, it would be extremely
computationally expensive to compute P(X/Ci). To reduce computation in
evaluating P(X/Ci), the naive assumption of class-conditional independence
is made. This presumes that the attributes ’ values are conditionally
independent of one another, given the class label of the tuple (i.e., that there
are no dependence relationships among the attributes). Thus,
• The predicted class label is the class Ci for which P.XjCi/P.Ci/ is the
maximum.
Example :-
Predicting a class label using naive Bayesian classification. We wish to
predict the class label of a tuple using naive Bayesian classification, given the
same training data as in customer dataset for decision tree induction. The
training data were shown earlier in Table customer dataset. The data tuples
are described by the attributes age, income, student, and credit rating. The
class label attribute, buys computer, has two distinct values (namely, {yes,
no}). Let C1 correspond to the class buys computer = yes and C2 correspond
to buys computer = no. The tuple we wish to classify is
X = (age = youth, income = medium, student = yes, credit-rating = fair)
We need to maximize P(X/Ci) P(Ci), for i = 1, 2. P(Ci), the prior probability
of each class, can be computed based on the training tuples:
P(buys computer = yes) = 9/14 = 0.643
P(buys computer = no) = 5/14 = 0.357
To compute P(X/Ci), for i = 1, 2, we compute the following conditional
probabilities:
P(age = youth / buys computer = yes) = 2/9 = 0.222
P(age = youth / buys computer = no) = 3/5 = 0.600
P(income = medium / buys computer = yes) = 4/9 = 0.444
P(income = medium / buys computer = no) = 2/5 = 0.400
P(student = yes / buys computer = yes) = 6/9 = 0.667
P(student = yes / buys computer = no) = 1/5 = 0.200
P(credit rating = fair / buys computer = yes) = 6/9 = 0.667
P(credit rating = fair / buys computer = no) = 2/5 = 0.400
• Using these probabilities, we obtain
P(X/buys computer = yes) = P(age = youth / buys computer = yes)
* P(income = medium / buys computer = yes)
* P(student = yes / buys computer = yes)
* P(credit rating = fair / buys computer = yes)
=0.222 * 0.444 * 0.667 * 0.667
= 0.044.
Similarly,
P(X/buys computer = no) = 0.600 * 0.400 * 0.200 * 0.400
= 0.019.
To find the class, Ci , that maximizes P(X/Ci) P(Ci), we compute
P(X/buys computer = yes) P(buys computer = yes) = 0.044 * 0.643 = 0.028
P(X/buys computer = no) P(buys computer = no) = 0.019 * 0.357 = 0.007
Therefore, the naive Bayesian classifier predicts buys computer = yes for tuple X.
Regression
• Regression is a data mining function that predicts a number. Age, weight,
distance, temperature, income, or sales could all be predicted using
regression techniques. For example, a regression model could be used to
predict children's height, given their age, weight, and other factors.
• A regression task begins with a data set in which the target values are
known.
• For example, a regression model that predicts children's height could be
developed based on observed data for many children over a period of time.
The data might track age, height, weight, developmental milestones, family
history, and so on. Height would be the target, the other attributes would be
the predictors, and the data for each child would constitute a case.
• Regression models are tested by computing various statistics that measure
the difference between the predicted values and the expected values.
• Regression is for predicting numeric attribute. Regression analysis can be
used to model the relationship between one or more independent.
• Two types of Regression
1. Linear Regression
2. Multiple Regression
1. Linear Regression :-
Simple linear regression is a method that enables you to determine the
relationship between a continuous process output (Y) and one factor (X).
The relationship is typically expressed in terms of a mathematical equation
such as Y = b + mX
Here,
Y --> response, b and m --> constant, x --> predictor variable.
Y = m0 + m1x
|D |
( x i x ' )( y i y ' )
m1 = |D |
i 1
i 1
( xi x')2
m0 = y' - m1x'
Linear regression is performed either to predict the response variable based
on the predictor variables, or to study the relationship between the response
variable and predictor variables.
Example :- Linear Regression, salary database
x (Experience) y (salary in $ 1000)
03 30
8 57
9 64
13 72
03 36
06 43
11 59
21 90
01 20
16 83
• Solution :-
x' = 3 8 9 13 3 6 11 21 1 16
10
x' = 9.1
y' = 30 57 64 72 3610 43 59 90 20 83
y' = 55.4
m1 = (3 9.1) * (30 55.4) (8 9.1) * (57 55.4) ................ (16 9.1) * (83 55.4)
(3 9.1) 2 (8 9.1) 2 ................. (16 9.1) 2
m1 = 3.5
m0 = 55.4 - 3.5 * 9.1
m0 = 23.6
y = 23.6 + 3.5x (equation for straight line)
Fig for Linear Regression :-
2. Multiple Regression :-
Y= b 0 + b 1 X 1 + b 2 X 2 + .... + b k X k
where Y is the dependent variable (response) and X 1 , X 2 ,.. .,X k are the
independent variables (predictors) and e is random error. b 0 , b 1 , b 2 , .... b k
are known as the regression coefficients, which have to be estimated from the
data.
The multiple linear regression algorithm in XLMiner chooses regression
coefficients so as to minimize the difference between predicted values and
actual values.
Fig for Multiple Linear regression :-
Model Evaluation and Selection
• In this section you may have built a classification model, here we confuse
that which classifier is the best and accuracy.
• For example, suppose you used data from previous sales to build a classifier
to predict customer purchasing behavior. You would like an estimate of how
accurately the classifier can predict the purchasing behavior of future
customers, that is, future customer data on which the classifier has not been
trained.
• But what is accuracy? How can we estimate it? Are some measures of a
classifier’s accuracy more appropriate than others?How can we obtain a
reliable accuracy estimate?
• Describes various evaluation metrics for the predictive accuracy of a classifier.
– Holdout and random subsampling
– cross-validation and bootstrap methods
• These are common techniques for assessing accuracy, based on randomly
sampled partitions of the given data.
• What if we have more than one classifier and want to choose the “best” one?
This is referred to as model selection.
• The last two sections address this issue, and discusses how to use tests of
statistical significance to assess whether the difference in accuracy between two
classifiers is due to chance.
• Techniques to Improve Classification Accuracy presents how to compare
classifiers based on cost–benefit and receiver operating characteristic (ROC)
curves.
Metrics for Evaluating Classifier Performance
• This section presents measures for assessing how good or how “accurate”
your classifier is at predicting the class label of tuples.
• We will consider the case of where the class tuples are more or less evenly
distributed, as well as the case where classes are unbalanced.
• They include accuracy (also known as recognition rate), sensitivity (or
recall), specificity, precision, F1, and Fβ.
• Note that although accuracy is a specific measure, the word “accuracy” is
also used as a general term to refer to a classifier’s predictive abilities.
• Before we discuss the various measures, we need to become comfortable
with some terminology. Recall that we can talk in terms of positive tuples
(tuples of the main class of interest) and negative tuples (all other tuples).
• Given two classes, for example, the positive tuples may be buys computer =
yes while the negative tuples are buys computer = no.
• Suppose we use our classifier on a test set of labeled tuples. P is the number of positive
tuples and N is the number of negative tuples. For each tuple, we compare the
classifier’s class label prediction with the tuple’s known class label.
• There are four additional terms we need to know that are the “building blocks” used in
computing many evaluation measures. Understanding themwill make it easy to grasp
the meaning of the various measures.
– True positives (TP): These refer to the positive tuples that were correctly labeled by
the classifier. Let TP be the number of true positives.
– True negatives (TN): These are the negative tuples that were correctly labeled by
the classifier. Let TN be the number of true negatives.
– False positives .FP): These are the negative tuples that were incorrectly labeled as
positive (e.g., tuples of class buys computer = no for which the classifier predicted
buys computer = yes). Let FP be the number of false positives.
– False negatives .FN/: These are the positive tuples that were mislabeled as negative
(e.g., tuples of class buys computer = yes for which the classifier predicted buys
computer = no). Let FN be the number of false negatives.
confusion matrix
• These terms are summarized in the confusion matrix,
• The confusion matrix is a useful tool for analyzing how well your classifier
can recognize tuples of different classes. TP and TN tell us when the
classifier is getting things right, while FP and FN tell us when the classifier
is getting things wrong
• In addition to accuracy-based measures, classifiers can also be compared
with respect to the following additional aspects:
– Speed: This refers to the computational costs involved in generating and
using the given classifier.
– Robustness: This is the ability of the classifier to make correct
predictions given noisy data or data with missing values. Robustness is
typically assessed with a series of synthetic data sets representing
increasing degrees of noise and missing values.
– Scalability: This refers to the ability to construct the classifier efficiently
given large amounts of data. Scalability is typically assessed with a series
of data sets of increasing size.
– Interpretability: Interpretability is subjective and therefore more difficult
to assess. Decision trees and classification rules can be easy to interpret,
yet their interpretability may diminish the more they become complex.
Holdout Method and Random Subsampling
Holdout Method
• The holdout method is what we have alluded to so far in our discussions
about accuracy.
• In this method, the given data are randomly partitioned into two
independent sets, a training set and a test set.
• Typically, two-thirds of the data are allocated to the training set, and the
remaining one-third is allocated to the test set.
Random subsampling
• Random subsampling is a variation of the holdout method in which the
holdout method is repeated k times.
• The overall accuracy estimate is taken as the average of the accuracies
obtained from each iteration.
Cross-Validation