DWDM UNIT 4
DWDM UNIT 4
Unit 4
1
Classification: Basic Concepts
Classification
Algorithms
Training
Data
Classifie
r
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
6
Process (2): Using the Model in
Prediction
Classifie
r
Testing Unseen
Data Data
(Jeff, Professor, 4)
Tenured?
7
Classification: Basic Concepts
<=30 overcas
31..40 >4
t 0
student? ye credit rating?
s
no ye excellen fai
s t r
n ye n ye
o s 9
Algorithm for Decision Tree
Induction
■ Basic algorithm (a greedy algorithm)
■Tree is constructed in a top-down recursive divide-and-conquer
manner
■At start, all the training examples are at the root
■Attributes are categorical (if continuous-valued, they are
discretized in advance)
■Examples are partitioned recursively based on selected
attributes
■Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
■ Conditions for stopping partitioning
■All samples for a given node belong to the same class
■There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
■There are no samples left 10
Brief Review of Entropy
■
m=
2
11
Attribute Selection Measure:
Information Gain (ID3/C4.5)
■ Select the attribute with the highest information gain
■ Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
■ Expected information (entropy) needed to classify a tuple in D:
12
Attribute Selection: Information
Gain
g Class P: buys_computer = “yes”
g Class N: buys_computer = “no”
Similarly,
13
for Continuous-Valued
Attributes
■ Let attribute A be a continuous-valued attribute
■ Must determine the best split point for A
■Sort the value A in increasing order
■Typically, the midpoint between each pair of adjacent values
is considered as a possible split point
■(ai+ai+1)/2 is the midpoint between the values of ai and ai+1
■The point with the minimum expected information
requirement for A is selected as the split-point for A
■ Split:
■D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is
the set of tuples in D satisfying A > split-point
14
Gain Ratio for Attribute
Selection (C4.5)
■ Information gain measure is biased towards attributes with a
large number of values
■ C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
■GainRatio(A) = Gain(A)/SplitInfo(A)
■ Ex.
■ Reduction in Impurity:
21
Classification in Large Databases
■ Classification—a classical problem extensively studied by
statisticians and machine learning researchers
■ Scalability: Classifying data sets with millions of examples and
hundreds of attributes with reasonable speed
■ Why is decision tree induction popular?
■relatively faster learning speed (than other classification
methods)
■convertible to simple and easy to understand classification
rules
■can use SQL queries for accessing databases
■comparable classification accuracy with other methods
■ RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
■Builds an AVC-list (attribute, value, class label)
22
Scalability Framework for
RainForest
23
Rainforest: Training Set and Its
AVC Sets
yes no
yes no
high 2 2
<=30 2 3
31..40 4 0 medium 4 2
>40 3 2 low 3 1
AVC-set on
AVC-set on
credit_rati
Student
student Buy_Computer ng
Buy_Computer
yes no Credit
rating yes no
yes 6 1
fair 6 2
no 3 4
excellent 3 3
24
Algorithm for Tree
Construction)
■ Use a statistical technique called bootstrapping to create
several smaller samples (subsets), each fits in memory
■ Each subset is used to create a tree, resulting in several
trees
■ These trees are examined and used to construct a new
tree T’
■It turns out that T’ is very close to the tree that would
be generated using the whole data set together
■ Adv: requires only two scans of DB, an incremental alg.
25
Presentation of Classification
Results
■ Bayes’ Theorem:
32
Classification Is to Derive the Maximum
Posteriori
■ Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
x2, …, xn)
■ Suppose there are m classes C1, C2, …, Cm.
■ Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
■ This can be derived from Bayes’ theorem
needs to be maximized
33
Naïve Bayes Classifier
■ A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between attributes):
and P(xk|Ci) is
34
Naïve Bayes Classifier: Training
Dataset
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data to be classified:
X = (age <=30,
Income = medium,
Student = yes
Credit_rating = Fair)
35
Naïve Bayes Classifier: An
Example
■ P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
■ Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
■ X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) =
0.007 36
Avoiding the Zero-Probability
Problem
■ Naïve Bayesian prediction requires each conditional prob. be
non-zero. Otherwise, the predicted prob. will be zero
■ One rule is created for each path from the <=30 31..40 >40
root to a leaf
student? credit rating?
yes
■ Each attribute-value pair along a path forms a
excellent fair
conjunction: the leaf holds the class no yes
no yes
prediction no yes
Examples
Examples covered
covered by Rule 2
Examples
by Rule 1 covered
by Rule 3
Positive
examples
43
Rule Generation
■ To generate a rule
while(true)
find the best predicate p
if foil-gain(p) > threshold then add p to current rule
else break
A3=1&&A
1=2
A3=1&&A1=2
&&A8=5A3=1
Positive Negative
examples examples
44
How to Learn-One-Rule?
■ Start with the most general rule possible: condition = empty
■ Adding new attributes by adopting a greedy depth-first strategy
■Picks the one that most improves the rule quality
■ Rule-Quality measures: consider both coverage and accuracy
■Foil-gain (in FOIL & RIPPER): assesses info_gain by extending
condition
■favors rules that have high accuracy and cover many positive tuples
■ Rule pruning based on an independent set of test tuples
47
Classifier Evaluation Metrics:
Confusion Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
Example of Confusion
Matrix:
Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000
49
Precision and Recall, and F-
measures
■ Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive
50
Classifier Evaluation Metrics:
Example
51
Holdout & Cross-Validation
Methods
■ Holdout method
■Given data is randomly partitioned into two independent sets
■Training set (e.g., 2/3) for model construction
■Test set (e.g., 1/3) for accuracy estimation
■Random sampling: a variation of holdout
■Repeat holdout k times, accuracy = avg. of the accuracies
obtained
■ Cross-validation (k-fold, where k = 10 is most popular)
■Randomly partition the data into k mutually exclusive subsets,
each approximately equal size
■At i-th iteration, use Di as test set and others as training set
■Leave-one-out: k folds where k = # of tuples, for small sized
data
■*Stratified cross-validation*: folds are stratified so that class
dist. in each fold is approx. the same as that in the initial data
52
Evaluating Classifier Accuracy:
Bootstrap
■ Bootstrap
■ Works well with small data sets
■ Samples the given training tuples uniformly with replacement
■i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
■ Several bootstrap methods, and a common one is .632 boostrap
■ A data set with d tuples is sampled d times, with replacement, resulting in
a training set of d samples. The data tuples that did not make it into the
training set end up forming the test set. About 63.2% of the original data
end up in the bootstrap, and the remaining 36.8% form the test set (since
(1 – 1/d)d ≈ e-1 = 0.368)
■ Repeat the sampling procedure k times, overall accuracy of the model:
53
Estimating Confidence Intervals:
Classifier Models M1 vs. M2
■ Suppose we have 2 classifiers, M1 and M2, which one is better?
54
Estimating Confidence Intervals:
Null Hypothesis
■ Perform 10-fold cross-validation
■ Assume samples follow a t distribution with k–1 degrees of
freedom (here, k=10)
■ Use t-test (or Student’s t-test)
■ Null Hypothesis: M1 & M2 are the same
■ If we can reject null hypothesis, then
■we conclude that the difference between M1 & M2 is
statistically significant
■Chose model with lower error rate
55
Estimating Confidence Intervals: t-test
■ Symmetric
■ Significance level,
e.g., sig = 0.05 or
5% means M1 & M2
are significantly
different for 95% of
population
■ Confidence limit, z
= sig/2
57
Estimating Confidence Intervals:
Statistical Significance
■ Are M1 & M2 significantly different?
■Compute t. Select significance level (e.g. sig = 5%)
■Consult table for t-distribution: Find t value corresponding to
k-1 degrees of freedom (here, 9)
■t-distribution is symmetric: typically upper % points of
distribution shown → look up value for confidence limit
z=sig/2 (here, 0.025)
■If t > z or t < -z, then t value lies in rejection region:
■Reject null hypothesis that mean error rates of M1 & M2
are same
■Conclude: statistically significant difference between M1
& M2
■Otherwise, conclude that any difference is chance
58
Model Selection: ROC
Curves
■ ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
■ Originated from signal detection theory
■ Shows the trade-off between the true
positive rate and the false positive rate
■ The area under the ROC curve is a ■ Vertical axis
measure of the accuracy of the model represents the true
positive rate
■ Rank the test tuples in decreasing
■ Horizontal axis rep.
order: the one that is most likely to the false positive rate
belong to the positive class appears at ■ The plot also shows a
the top of the list diagonal line
■ The closer to the diagonal line (i.e., the ■ A model with perfect
closer the area is to 0.5), the less accuracy will have an
accurate is the model area of 1.0
59
Issues Affecting Model Selection
■ Accuracy
■classifier accuracy: predicting class label
■ Speed
■time to construct the model (training time)
■time to use the model (classification/prediction time)
■ Robustness: handling noise and missing values
■ Scalability: efficiency in disk-resident databases
■ Interpretability
■understanding and insight provided by the model
■ Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
60
Chapter 8. Classification: Basic
Concepts
■ Ensemble methods
■Use a combination of models to increase accuracy
■Combine a series of k learned models, M1, M2, …, Mk, with
the aim of creating an improved model M*
■ Popular ensemble methods
■Bagging: averaging the prediction over a collection of
classifiers
■Boosting: weighted vote with a collection of classifiers
■Ensemble: combining a set of heterogeneous classifiers
62
Bagging: Boostrap Aggregation
■ Analogy: Diagnosis based on multiple doctors’ majority vote
■ Training
■ Given a set D of d tuples, at each iteration i, a training set Di of d tuples is
sampled with replacement from D (i.e., bootstrap)
■ A classifier model Mi is learned for each training set Di
■ Classification: classify an unknown sample X
■ Each classifier Mi returns its class prediction
■ The bagged classifier M* counts the votes and assigns the class with the
most votes to X
■ Prediction: can be applied to the prediction of continuous values by taking
the average value of each prediction for a given test tuple
■ Accuracy
■ Often significantly better than a single classifier derived from D
■ For noise data: not considerably worse, more robust
■ Proved improved accuracy in prediction
63
Boosting
■ Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
■ How boosting works?
■ Weights are assigned to each training tuple
■ A series of k classifiers is iteratively learned
■ After a classifier Mi is learned, the weights are updated to
allow the subsequent classifier, Mi+1, to pay more attention to
the training tuples that were misclassified by Mi
■ The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
accuracy
■ Boosting algorithm can be extended for numeric prediction
■ Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data 64
Adaboost (Freund and Schapire,
1997)
■ Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
■ Initially, all the weights of tuples are set the same (1/d)
■ Generate k classifiers in k rounds. At round i,
■ Tuples from D are sampled (with replacement) to form a training set Di
of the same size
■ Each tuple’s chance of being selected is based on its weight
■ A classification model Mi is derived from Di
■ Its error rate is calculated using Di as a test set
■ If a tuple is misclassified, its weight is increased, o.w. it is decreased
■ Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error
rate is the sum of the weights of the misclassified tuples:
65
Random Forest (Breiman 2001)
■ Random Forest:
■ Each classifier in the ensemble is a decision tree classifier and is
generated using a random selection of attributes at each node to
determine the split
■ During classification, each tree votes and the most popular class is
returned
■ Two Methods to construct Random Forest:
■ Forest-RI (random input selection): Randomly select, at each node, F
attributes as candidates for the split at the node. The CART methodology
is used to grow the trees to maximum size
■ Forest-RC (random linear combinations): Creates new attributes (or
features) that are a linear combination of the existing attributes (reduces
the correlation between individual classifiers)
■ Comparable in accuracy to Adaboost, but more robust to errors and outliers
■ Insensitive to the number of attributes selected for consideration at each
split, and faster than bagging or boosting
66
Classification of Class-Imbalanced Data
Sets
■ Class-imbalance problem: Rare positive example but numerous
negative ones, e.g., medical diagnosis, fraud, oil-spill, fault, etc.
■ Traditional methods assume a balanced distribution of classes
and equal error costs: not suitable for class-imbalanced data
■ Typical methods for imbalance data in 2-class classification:
■Oversampling: re-sampling of data from positive class
■Under-sampling: randomly eliminate tuples from negative
class
■Threshold-moving: moves the decision threshold, t, so that
the rare class tuples are easier to classify, and hence, less
chance of costly false negative errors
■Ensemble techniques: Ensemble multiple classifiers
introduced above
■ Still difficult for class imbalance problem on multiclass tasks
67
Chapter 8. Classification: Basic
Concepts
69
Summary (II)
■ Significance tests and ROC curves are useful for model selection.
■ There have been numerous comparisons of the different
classification methods; the matter remains a research topic
■ No single method has been found to be superior over all others
for all data sets
■ Issues such as accuracy, training time, robustness, scalability,
and interpretability must be considered and can involve trade-
offs, further complicating the quest for an overall superior
method
70
References (1)
■ C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future
Generation Computer Systems, 13, 1997
■ C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University Press,
1995
■ L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.
Wadsworth International Group, 1984
■ C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data
Mining and Knowledge Discovery, 2(2): 121-168, 1998
■ P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data
for scaling machine learning. KDD'95
■ H. Cheng, X. Yan, J. Han, and C.-W. Hsu,
Discriminative Frequent Pattern Analysis for Effective Classification, ICDE'07
■ H. Cheng, X. Yan, J. Han, and P. S. Yu,
Direct Discriminative Pattern Mining for Effective Classification, ICDE'08
■ W. Cohen. Fast effective rule induction. ICML'95
■ G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu. Mining top-k covering rule groups for
gene expression data. SIGMOD'05
71
References (2)
■ A. J. Dobson. An Introduction to Generalized Linear Models. Chapman & Hall, 1990.
■ G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and
differences. KDD'99.
■ R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed. John Wiley, 2001
■ U. M. Fayyad. Branching on attribute values in decision tree generation. AAAI’94.
■ Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and
an application to boosting. J. Computer and System Sciences, 1997.
■ J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree
construction of large datasets. VLDB’98.
■ J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision Tree
Construction. SIGMOD'99.
■ T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer-Verlag, 2001.
■ D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The
combination of knowledge and statistical data. Machine Learning, 1995.
■ W. Li, J. Han, and J. Pei, CMAR: Accurate and Efficient Classification Based on Multiple
Class-Association Rules, ICDM'01.
72
References (3)
■ T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy, complexity,
and training time of thirty-three old and new classification algorithms. Machine
Learning, 2000.
■ J. Magidson. The Chaid approach to segmentation modeling: Chi-squared automatic
interaction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing
Research, Blackwell Business, 1994.
■ M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining.
EDBT'96.
■ T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
■ S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-
Disciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998
■ J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
■ J. R. Quinlan and R. M. Cameron-Jones. FOIL: A midterm report. ECML’93.
■ J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
■ J. R. Quinlan. Bagging, boosting, and c4.5. AAAI'96.
73
References (4)
■ R. Rastogi and K. Shim. Public: A decision tree classifier that integrates building and
pruning. VLDB’98.
■ J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data
mining. VLDB’96.
■ J. W. Shavlik and T. G. Dietterich. Readings in Machine Learning. Morgan Kaufmann,
1990.
■ P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley,
2005.
■ S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and
Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert
Systems. Morgan Kaufman, 1991.
■ S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997.
■ I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and
Techniques, 2ed. Morgan Kaufmann, 2005.
■ X. Yin and J. Han. CPAR: Classification based on predictive association rules. SDM'03
■ H. Yu, J. Yang, and J. Han. Classifying large data sets using SVM with hierarchical
clusters. KDD'03.
74
CS412 Midterm Exam Statistics
■Opinion Question Answering:
■Like the style: 70.83%, dislike: 29.16%
■Exam is hard: 55.75%, easy: 0.6%, just right: 43.63%
■Time: plenty:3.03%, enough: 36.96%, not: 60%
■Score distribution: # of students (Total: 180)
■>=90: 24 ■60-69: 37 ■<40: 2
■80-89: 54 ■50-59: 15
■70-79: 46 ■40-49: 2
■Final grading are based on overall score accumulation
and relative class distributions
76
Issues: Evaluating Classification
Methods
■ Accuracy
■classifier accuracy: predicting class label
■predictor accuracy: guessing value of predicted attributes
■ Speed
■time to construct the model (training time)
■time to use the model (classification/prediction time)
■ Robustness: handling noise and missing values
■ Scalability: efficiency in disk-resident databases
■ Interpretability
■understanding and insight provided by the model
■ Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
77
Predictor Error Measures
■ Measure predictor accuracy: measure how far off the predicted value is from
the actual known value
■ Loss function: measures the error betw. yi and the predicted value yi’
■ Absolute error: | yi – yi’|
■ Squared error: (yi – yi’)2
■ Test error (generalization error): the average loss over the test set
■ Mean absolute error: Mean squared error:
80