UNIT 2 Class Basic
UNIT 2 Class Basic
Bootstrap
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
9
Decision Tree Induction
10
Decision Tree Induction
Many algorithms are there for Decision Tree
(a) ID3 (Iterative Dichotomiser)
(b) C4.5- Successor of ID3
(c) CART –Classification and Regression Trees
ID3, C4.5 and CART adopt a greedy approach
(i.e. Nonbacktracking) approach in which
decision trees are constructed in a top-down
recursive divide-and-conquer manner
The training set is recursively partitioned
into smaller subsets as the tree is being built
11
12
Decision Tree Induction: An Example
age income student credit_rating buys_computer
<=30 high no fair no
Training data set: Buys_computer <=30 high no excellent no
The data set follows an example of 31…40 high no fair yes
>40 medium no fair yes
Quinlan’s ID3 (Playing Tennis) >40 low yes fair yes
Resulting tree: >40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
no yes no yes
13
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
◦ Tree is constructed in a top-down recursive divide-
and-conquer manner
◦ At start, all the training examples are at the root
◦ Attributes are categorical (if continuous-valued,
they are discretized in advance)
◦ Examples are partitioned recursively based on
selected attributes
◦ Test attributes are selected on the basis of a
heuristic or statistical measure (e.g., information
gain)
Conditions for stopping partitioning
◦ All samples for a given node belong to the same
class
◦ There are no remaining attributes for further
14
15
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify a tuple in D:
m
Info( D ) pi log 2 ( pi )
1
Information needed (after using A to split Di into v partitions) to
classify D: v | D |
Info A ( D )
j
Info( D j )
j 1 |D|
Information gained by branching on attribute A
◦ GainRatio(A) = Gain(A)/SplitInfo(A)
Ex.
26
Scalability and Decision Tree
Induction
Several scalable Decision tree induction
methods have been introduced eg.
Rainforest-adapts to the amount of main
memory available and applies to any
decision tree algorithm
BOAT (Bootstrap Optimistic Algorithm for
tree construction)- it uses a statistical
technique known as ‘Bootstraping” to
create smaller samples of training data
each of which fits into memory.
27
Classification in Large Databases
Classification—a classical problem extensively studied
by statisticians and machine learning researchers
Scalability: Classifying data sets with millions of
examples and hundreds of attributes with reasonable
speed
Why is decision tree induction popular?
◦ relatively faster learning speed (than other
classification methods)
◦ convertible to simple and easy to understand
classification rules
◦ can use SQL queries for accessing databases
◦ comparable classification accuracy with other
methods
29
Rainforest: Training Set and Its AVC Sets
31
Bayesian Classification: Why?
A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities such as the
probability that a given tuple belongs to a particular class.
Foundation: Based on Bayes’ Theorem.
Performance: A simple Bayesian classifier, Naïve Bayesian
classifier, has comparable performance with decision tree
and selected neural network classifiers
Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct
— prior knowledge can be combined with observed data
Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
32
Bayes’ Theorem: Basics
Total probability Theorem: P(B) M
P( B | A ) P( A )
i i
i 1
Bayes’ Theorem:
P( H | X) P(X | H ) P( H ) P(X | H )P( H ) / P(X)
P(X)
◦ Let X be a data sample (“evidence”): class label is unknown
◦ Let H be a hypothesis that X belongs to class C
◦ Classification is to determine P(H|X), (i.e., posteriori probability):
the probability that the hypothesis holds given the observed
data sample X
◦ P(H) (prior probability): the initial probability
E.g., X will buy computer, regardless of age, income, …
◦ P(X): probability that sample data is observed
◦ P(X|H) (likelihood): the probability of observing the sample X,
given that the hypothesis holds
E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income 33
Prediction Based on Bayes’ Theorem
needs to be maximized 35
Naïve Bayes Classifier
A simplified assumption: attributes are
conditionally independent (i.e., no dependence
relation between attributes):
P( X | )
n
P ( | ) P ( | ) P ( | ) ...P ( | )
Ci x k Ci x 1 Ci x 2 Ci x n Ci
k 1
This greatly reduces the computation cost: Only
counts the class distribution
If A is categorical, P(x |C ) is the # of tuples in C
k k i i
having value xk for Ak divided by |Ci, D| (# of tuples
of Ci in D)
If A is continous-valued, P(xk|Ci) ( x is usually
k ) 2
37
Naïve Bayes Classifier: An Example age income studentcredit_rating
buys_comp
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
P(Ci):P(buys_computer = “yes”) = 9/14 = 0.643 >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
P(buys_computer = “no”) = 5/14= 0.357 31…40
<=30
low
medium
yes excellent
no fair
yes
no
Compute P(X|C ) for each class <=30
>40
low yes fair
medium yes fair
yes
yes
i
<=30 medium yes excellent yes
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.22231…40
31…40
medium
high
no excellent
yes fair
yes
yes
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 >40 medium no excellent no
no yes
prediction no yes
46
Classifier Evaluation Metrics: Example
47
Evaluating Classifier Accuracy:
Holdout & Cross-Validation
Methods
Holdout method
◦ Given data is randomly partitioned into two independent
sets
Training set (e.g., 2/3) for model construction
Test set (e.g., 1/3) for accuracy estimation
◦ Random sampling: a variation of holdout
Repeat holdout k times, accuracy = avg. of the
accuracies obtained
Cross-validation (k-fold, where k = 10 is most popular)
◦ Randomly partition the data into k mutually exclusive
subsets, each approximately equal size
◦ At i-th iteration, use Di as test set and others as training
set
◦ Leave-one-out: k folds where k = # of tuples, for small
sized data
◦ *Stratified cross-validation*: folds are stratified so that
class dist. in each fold is approx. the same as that in the
initial data 48
Evaluating Classifier Accuracy: Bootstrap
Bootstrap
◦ Works well with small data sets
◦ Samples the given training tuples uniformly with
replacement
i.e., each time a tuple is selected, it is equally likely to be
selected again and re-added to the training set
Several bootstrap methods, and a common one is .632
bootstrap
◦ A data set with d tuples is sampled d times, with
replacement, resulting in a training set of d samples. The
data tuples that did not make it into the training set end up
forming the test set. About 63.2% of the original data end
up in the bootstrap, and the remaining 36.8% form the test
set (since (1 – 1/d)d ≈ e-1 = 0.368)
◦ Repeat the sampling procedure k times, overall accuracy of
the model: 49
Estimating Confidence Intervals:
Classifier Models M1 vs. M2
Symmetric
Significance
level, e.g., sig
= 0.05 or 5%
means M1 & M2
are significantly
different for
95% of
population
Confidence
limit, z = sig/2
53
Estimating Confidence Intervals:
Statistical Significance
Are M1 & M2 significantly different?
◦ Compute t. Select significance level (e.g. sig = 5%)
◦ Consult table for t-distribution: Find t value
corresponding to k-1 degrees of freedom (here, 9)
◦ t-distribution is symmetric: typically upper % points
of distribution shown → look up value for
confidence limit z=sig/2 (here, 0.025)
◦ If t > z or t < -z, then t value lies in rejection
region:
Reject null hypothesis that mean error rates of
M1 & M2 are same
Conclude: statistically significant difference
between M1 & M2
◦ Otherwise, conclude that any difference is chance
54
Model Selection: ROC Curves
ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
Originated from signal detection
theory
Shows the trade-off between the true
positive rate and the false positive
rate Vertical axis
The area under the ROC curve is a represents the true
measure of the accuracy of the positive rate
model Horizontal axis rep.
Rank the test tuples in decreasing the false positive rate
order: the one that is most likely to The plot also shows a
belong to the positive class appears diagonal line
at the top of the list
A model with perfect
The closer to the diagonal line (i.e., accuracy will have an
the closer the area is to 0.5), the less
accurate is the model
area of 1.0
55
Issues Affecting Model Selection
Accuracy
Ensemble methods
◦ Use a combination of models to increase accuracy
◦ Combine a series of k learned models, M1, M2, …, Mk,
with the aim of creating an improved model M*
Popular ensemble methods
63
Summary (I)
Classification
is a form of data analysis that extracts models
describing important data classes.
Effective and scalable methods have been developed for
decision tree induction, Naive Bayesian classification, rule-
based classification, and many other classification methods.
Evaluation metrics include: accuracy, sensitivity, specificity,
precision, recall, F measure, and Fß measure.
Stratifiedk-fold cross-validation is recommended for accuracy
estimation. Bagging and boosting can be used to increase
overall accuracy by learning and combining a series of
individual models.
64
Summary (II)
Accuracy
◦ classifier accuracy: predicting class label
◦ predictor accuracy: guessing value of predicted
attributes
Speed
◦ time to construct the model (training time)
◦ time to use the model (classification/prediction
time)
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability
◦ understanding and insight provided by the
model
Other measures, e.g., goodness of rules, such as
66
Predictor Error Measures
i 1
◦ Mean absolute error: d Mean squared error: d
d
d
| y
i 1
i yi ' | ( yi yi ' ) 2
i 1
d
d
◦ Relative absolute error: | yi y| Relative squared error: ( y y ) 2
i 1 i i 1
68
Data Cube-Based Decision-Tree Induction