Unit 3
Unit 3
• Each record, also known as an instance, is characterized by a tuple (x, y), where x
is the attribute set and y is a special attribute, designated as the class label
The vertebrate data set
• Definition ::Classification is the task of learning a target function f that maps each
attribute set x to one of the predefined class labels y.
• Predictive Modeling A classification model can also be used to predict the class
label of unknown records.
• Suppose we are given the following characteristics of a creature known as a gila
monster:
• We can use a classification model built from the data set to determine the class to which the creature
belongs.
General Approach to Solving a Classification Problem
• Each technique uses a learning algorithm to identify a model that best fits the
relationship between the attribute set and class label of the input data.
• The model generated by a learning algorithm should both fit the input data and
correctly predict the class labels
General approach for building a classification model.
• First, a training set consisting of records whose class labels are known must be
provided.
• The training set is used to build a classification model, which is subsequently applied to
the test set.
• Each entry fij in this table denotes the number of records from class i predicted
to be of class j.
• For instance, f01 is the number of records from class 0 incorrectly predicted as
class 1.
• Based on the entries in the confusion matrix, the total number of correct predictions
made by the model is (f11 + f00) and the total number of incorrect predictions is (f10 +
f01).
• Lets classify the vertebrates into two categories: mammals and non-mammals.
• One approach is to pose a series of questions about the characteristics of the species.
• The first question we may ask is whether the species is cold- or warm-blooded.
• The series of questions and their possible answers can be organized in the form of
a decision tree.
The tree has three types of nodes:
• A root node that has no incoming edges and zero or more outgoing edges.
• Internal nodes, each of which has exactly one incoming edge and two or more
outgoing edges.
• Leaf or terminal nodes, each of which has exactly one incoming edge and no
outgoing edges.
A decision tree for the mammal classification problem
• In a decision tree, each leaf node is assigned a class label.
• The nonterminal nodes, which include the root and other internal nodes, contain
attribute test conditions to separate records that have different characteristics.
• Classifying a test record is straightforward once a decision tree has been
constructed.
• Starting from the root node, we apply the test condition to the record and follow
the appropriate branch based on the outcome of the test.
Classifying an unlabeled vertebrate
How to Build a Decision Tree
• In principle, there are exponentially many decision trees that can be constructed
from a given set of attributes.
Hunt’s Algorithm
• Let Dt be the set of training records that are associated with node t and y = {y 1,
• Step 2: If Dt contains records that belong to more than one class, an attribute test condition is
selected to partition the records into smaller subsets.
• A child node is created for each outcome of the test condition and the records in Dt are
distributed to the children based on the outcomes.
Training set
• The initial tree for the classification problem contains a single node with class
label Defaulted = No
• The tree, however, needs to be refined since the root node contains records from
both classes.
• The records are subsequently divided into smaller subsets based on the outcomes
of the Home Owner test condition
• From the training set, notice that all borrowers who are home owners
successfully repaid their loans.
• The left child of the root is therefore a leaf node labeled Defaulted = No
Hunt’s algorithm for inducing decision trees
Design Issues of Decision Tree Induction
• How should the training records be split? Each recursive step of the tree-growing
process must select an attribute test condition to divide the records into smaller subsets.
• How should the splitting procedure stop? A stopping condition is needed to terminate
the tree-growing process.
• A possible strategy is to continue expanding a node until either all the records belong to
the same class
Methods for Expressing Attribute Test Conditions
• Binary Attributes The test condition for a binary attribute generates two potential
outcomes.
• Ordinal attribute values can be grouped as long as the grouping does not violate the
order property of the attribute values.
Different ways of grouping ordinal attribute values
• Continuous Attributes For continuous attributes, the test condition can be
expressed as a comparison test
• There are many measures that can be used to determine the best way to split the
records.
• These measures are defined in terms of the class distribution of the records before
and after splitting.
• Let p(i|t) denote the fraction of records belonging to class i at a given node t.
• In a two-class problem, the class distribution at any node can be written as (p 0, p1),
where p1 = 1 − p0.
To illustrate, consider the test conditions shown in Figure.
• The class distribution before splitting is (0.5, 0.5) because there are an equal number of
records from each class.
• If we split the data using the Gender attribute, then the class distributions of the child
nodes are (0.6, 0.4) and (0.4, 0.6), respectively.
• Although the classes are no longer evenly distributed, the child nodes still contain
records from both classes.
Multiway versus binary splits
• The measures developed for selecting the best split are often based on the degree of
impurity of the child nodes.
• The smaller the degree of impurity, the more skewed the class distribution.
• ID3 stands for Iterative Dichotomiser 3 and is named such because the algorithm
iteratively (repeatedly) dichotomizes(divides) features into two or more groups at each
step.
Three popular attribute selection measures
information gain-ID3 Algorithm
gain ratio—C4.5/J48
gini index--CART
gain ratio
Algorithm for Decision Tree Induction
• The input to this algorithm consists of the training records E and the attribute set F.
• The algorithm works by recursively selecting the best attribute to split the data and
expanding the leaf nodes of the tree until the stopping criterion is met.
The details of this algorithm are explained below:
• The createNode() function creates a new node. A node in the decision tree has either a
test condition, or a class label.
• The find best split() function determines which attribute should be selected as the test
condition for splitting the training records.
• The Classify() function determines the class label to be assigned to a leaf node. For
each leaf node t, let p(i|t) denote the fraction of training records from class i
associated with the node t.
• The errors committed by a classification model are generally divided into two
types: training errors and generalization errors.
• In other words, a good model must have low training error as well as low generalization
error.
• This is important because a model that fits the training data too well can have a poorer
generalization error. Such a situation is known as model overfitting.
Overfitting Example in Two-Dimensional Data
• The data set contains data points that belong to two different classes, denoted as
class o and class +, respectively.
Training and test error rates
• Notice that the training and test error rates of the model are large when the size of
the tree is very small. This situation is known as model underfitting.
• However, once the tree becomes too large, its test error rate begins to increase
even though its training error rate continues to decrease. This phenomenon is
known as model overfitting.
Overfitting Due to Presence of Noise
• Consider the training and test sets for the mammal classification problem.
• Two of the ten training records are mislabeled: bats and whales are classified as
non-mammals instead of mammals.
Training set for classifying mammals
test set for classifying mammals
• A decision tree that perfectly fits the training
as nonmammals.
an exceptional case
• The decision tree has a lower
test error rate (10%) even
though its training error rate
is some what higher (20%).
• Although its training error is zero, its error rate on the test set is 30%.
An example training set for classifying mammals.
Decision tree
• Humans, elephants, and dolphins are misclassified because the decision tree classifies
all warm-blooded vertebrates that do not hibernate as non-mammals.
• The tree arrives at this classification decision because there is only one training record,
which is an eagle, with such characteristics.
Evaluating the Performance of a Classifier
• Once the model has been constructed, it can be applied to the test set to predict the
class labels.
• The accuracy or error rate computed from the test set can be used to compare the
relative performance of different classifiers.
Holdout Method
• In the holdout method, the original data with labeled examples is partitioned into the training
and the test sets.
• A classification model is then constructed from the training set and its performance is
evaluated on the test set.
• The proportion of data reserved for training and for testing is (e.g., 50-50 or two thirds for
training and one-third for testing).
• The accuracy of the classifier can be estimated based on the accuracy of the model on the
test set.
• The holdout method has few limitations.
• First, fewer labeled examples are available for training because some of the records
are used for testing.
• As a result, the model may not be as good as when all the labeled examples are used
for training.
• Second, the model may be highly dependent on the composition of the training
• The holdout method can be repeated several times to improve the estimation of a
classifier’s performance. This approach is known as random subsampling.
• It also has no control over the number of times each record is used for testing and
training.
Cross-Validation
• In this approach,each record is used the same number of times for training and
exactly once for testing.
• To illustrate this method, suppose we partition the data into two equal-sized subsets.
• First, we choose one of the subsets for training and the other for testing.
• We then swap the roles of the subsets so that the previous training set becomes the
test set and vice versa. This approach is called a twofold cross-validation.
• The total error is obtained by summing up the errors for both runs.
• The k-fold cross-validation method generalizes this approach by segmenting the data
into k equal-sized partitions.
• During each run, one of the partitions is chosen for testing, while the rest of them are
used for training.
• This procedure is repeated k times so that each partition is used for testing exactly
once.
• Again, the total error is found by summing up the errors for all k runs.
Bootstrap
• The methods presented so far assume that the training records are sampled without
replacement.
• As a result, there are no duplicate records in the training and test sets.
• In the bootstrap approach, the training records are sampled with replacement; i.e., a
record already chosen for training is put back into the original dataset, so that it is
equally likely to be redrawn.
• If the original data has N records, it can be shown that, on average, a bootstrap
sample of size N contains about 63.2% of the records in the original data.
• Records that are not included in the bootstrap sample become part of the test set.
• One of the more widely used approaches is the .632 bootstrap, which computes
the overall accuracy by combining the accuracies of each bootstrap sample
with the accuracy computed from a training set that contains all the labeled
examples in the original data (accs):
Bayes Theorem
• Consider a football game between two rival teams: Team 0 and Team 1. Suppose
Team 0 wins 65% of the time and Team 1 wins the remaining matches. Among the
games won by Team 0, only 30% of them come from playing on Team 1’s football
field. On the other hand, 75% of the victories for Team 1 are obtained while playing
at home. If Team 1 is to host the next match between the two teams, which team
will most likely emerge as the winner?
• This question can be answered by using the well-known Bayes theorem.
• Their joint probability, P(X =x, Y = y), refers to the probability that variable X will
take on the value x and variable Y will take on the value y.
• the conditional probability P(Y = y|X = x) refers to the probability that the variable Y
will take on the value y, given that the variable X is observed to have the value x.
• The joint and conditional probabilities for X and Y are related in the
following way:
• Rearranging the last two expressions ,leads to the formula, known as the Bayes
theorem:
• Our objective is to compute P(Y = 1|X = 1), which is the conditional probability
that Team 1 wins the next match it will be hosting, and compares it against P(Y =
0|X = 1).
• Using the Bayes theorem
• Furthermore, P(Y = 0|X = 1) = 1 − P(Y = 1|X = 1) =0.4262.
• Since P(Y = 1|X = 1) > P(Y = 0|X = 1), Team 1 has a better chance
than Team 0 of winning the next match.
Using the Bayes Theorem for Classification
• Let X denote the attribute set and Y denote the class variable.
• This conditional probability is also known as the posterior probability for Y , as opposed
to its prior probability, P(Y ).
• During the training phase, we need to learn the posterior probabilities P(Y |X) for every
combination of X and Y based on information gathered from the training data.
• By knowing these probabilities, a test record X can be classified by finding the class Y
that maximizes the posterior probability,P(Y1|X1).
• To illustrate this approach, consider the task of predicting whether a loan borrower will
default on their payments.
• Loan borrowers who defaulted on their payments are classified as Yes, while those who
repaid their loans are classified as No.
• Suppose we are given a test record : X =(Home Owner = No, Marital Status =
Married, Annual Income = $120K).
• To classify the record, we need to compute the posterior probabilities P(Yes|X) and
P(No|X) based on information available in the training data.
• If P(Yes|X) > P(No|X), then the record is classified as Yes; otherwise, it is classified as
No.
Na¨ıve Bayes Classifier
that the attributes are conditionally independent, given the class label y.
• To classify a test record, the na¨ıve Bayes classifier computes the posterior probability for
each class Y:
• Since P(X) is fixed for every Y , it is sufficient to choose the class that maximizes the
numerator term.
Estimating Conditional Probabilities for Categorical Attributes
• For example, three out of the seven people who repaid their loans also own a home.
• The parameter µij can be estimated based on the sample mean of Xi or all training
records that belong to the class yj . Similarly, σ2ij can be estimated from the sample
variance (s2) of such training records.
• For example, consider the annual income attribute(No Class)
• Given a test record with taxable income equal to $120K, we can compute its class-
conditional probability as follows:
Example of the Na¨ıve Bayes Classifier
• We can compute the class conditional probability for each categorical attribute, along with
the sample mean and variance for the continuous attribute.
• To predict the class label of a test record X = (Home Owner=No, Marital Status =
Married, Income = $120K), we need to compute the posterior probabilities P(No|X) and
P(Yes|X).
• Since there are three records that belong to the class Yes and seven records that belong to
the class No, P(Yes)=0.3 and P(No)=0.7. the class-conditional probabilities can be
computed as follows:
• Putting them together, the posterior probability for class No is P(No|X) = α × 7/10 ×
0.0024 = 0.0016α, where α = 1/P(X) is a constant term.
• Using a similar approach, we can show that the posterior probability for class Yes is
zero because its class-conditional probability is zero.