0% found this document useful (0 votes)
12 views

Unit 3

Uploaded by

Asif EE-010
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Unit 3

Uploaded by

Asif EE-010
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 95

UNIT - III: Classification

Basic Concepts, General Approach to solving a classification problem,


Decision Tree Induction: Working of Decision Tree, building a decision tree,
methods for expressing an attribute test conditions, measures for selecting the
best split, Algorithm for decision tree induction.

Model Overfitting: Due to presence of noise, due to lack of representation


samples, evaluating The performance of classifier: holdout method, random
sub sampling, cross-validation, bootstrap. Bayes Theorem, Naïve Bayes
Classifier.
Basic concepts

• Classification is the task of assigning objects to one of several predefined


categories.

• The input data for a classification task is a collection of records.

• Each record, also known as an instance, is characterized by a tuple (x, y), where x
is the attribute set and y is a special attribute, designated as the class label
The vertebrate data set
• Definition ::Classification is the task of learning a target function f that maps each
attribute set x to one of the predefined class labels y.

• The target function is also known informally as a classification model.


• Descriptive Modeling: A classification model can serve as an explanatory tool to
distinguish between objects of different classes.

• Predictive Modeling A classification model can also be used to predict the class
label of unknown records.
• Suppose we are given the following characteristics of a creature known as a gila
monster:

• We can use a classification model built from the data set to determine the class to which the creature
belongs.
General Approach to Solving a Classification Problem

• A classification technique (or classifier) is a systematic approach to build


classification models from an input data set.

• Each technique uses a learning algorithm to identify a model that best fits the
relationship between the attribute set and class label of the input data.

• The model generated by a learning algorithm should both fit the input data and
correctly predict the class labels
General approach for building a classification model.
• First, a training set consisting of records whose class labels are known must be
provided.

• The training set is used to build a classification model, which is subsequently applied to
the test set.

• Performance of a classification model is based on the counts of test records correctly


and incorrectly predicted by the model.

• These counts are shown in a table known as a confusion matrix.


Confusion matrix for a 2-class problem

• Each entry fij in this table denotes the number of records from class i predicted
to be of class j.
• For instance, f01 is the number of records from class 0 incorrectly predicted as
class 1.
• Based on the entries in the confusion matrix, the total number of correct predictions
made by the model is (f11 + f00) and the total number of incorrect predictions is (f10 +

f01).

• This can be done using a performance metric such as accuracy


• The performance of a model can be expressed in terms of its error
rate
Decision Tree Induction

How a Decision Tree Works

• Lets classify the vertebrates into two categories: mammals and non-mammals.

• One approach is to pose a series of questions about the characteristics of the species.

• The first question we may ask is whether the species is cold- or warm-blooded.

• If it is cold-blooded, then it is definitely not a mammal. Otherwise, it is either a bird


or a mammal.
• Do the females of the species give birth to their young? Those that give birth are
definitely mammals.

• The series of questions and their possible answers can be organized in the form of
a decision tree.
The tree has three types of nodes:

• A root node that has no incoming edges and zero or more outgoing edges.

• Internal nodes, each of which has exactly one incoming edge and two or more
outgoing edges.

• Leaf or terminal nodes, each of which has exactly one incoming edge and no
outgoing edges.
A decision tree for the mammal classification problem
• In a decision tree, each leaf node is assigned a class label.

• The nonterminal nodes, which include the root and other internal nodes, contain
attribute test conditions to separate records that have different characteristics.
• Classifying a test record is straightforward once a decision tree has been
constructed.

• Starting from the root node, we apply the test condition to the record and follow
the appropriate branch based on the outcome of the test.
Classifying an unlabeled vertebrate
How to Build a Decision Tree

• In principle, there are exponentially many decision trees that can be constructed
from a given set of attributes.

Hunt’s Algorithm

• In Hunt’s algorithm, a decision tree is grown in a recursive fashion by partitioning


the training records into successively purer subsets.

• Let Dt be the set of training records that are associated with node t and y = {y 1,

y2,...,yc} be the class labels.


• Step 1: If all the records in Dt belong to the same class yt, then t is a leaf node labeled as yt.

• Step 2: If Dt contains records that belong to more than one class, an attribute test condition is
selected to partition the records into smaller subsets.

• A child node is created for each outcome of the test condition and the records in Dt are
distributed to the children based on the outcomes.

• The algorithm is then recursively applied to each child node.


• consider the problem of predicting whether a loan applicant will repay her loan or
defaulting on her loan.

Training set
• The initial tree for the classification problem contains a single node with class
label Defaulted = No

• The tree, however, needs to be refined since the root node contains records from
both classes.
• The records are subsequently divided into smaller subsets based on the outcomes
of the Home Owner test condition
• From the training set, notice that all borrowers who are home owners
successfully repaid their loans.
• The left child of the root is therefore a leaf node labeled Defaulted = No
Hunt’s algorithm for inducing decision trees
Design Issues of Decision Tree Induction

• How should the training records be split? Each recursive step of the tree-growing
process must select an attribute test condition to divide the records into smaller subsets.

• How should the splitting procedure stop? A stopping condition is needed to terminate
the tree-growing process.

• A possible strategy is to continue expanding a node until either all the records belong to
the same class
Methods for Expressing Attribute Test Conditions

• Binary Attributes The test condition for a binary attribute generates two potential
outcomes.

Test condition for binary attributes


• Nominal Attributes Since a nominal attribute can have many values, its test
condition can be expressed in two ways

Test conditions for nominal attributes


Ordinal Attributes Ordinal attributes can also produce binary or multiway splits.

• Ordinal attribute values can be grouped as long as the grouping does not violate the
order property of the attribute values.
Different ways of grouping ordinal attribute values
• Continuous Attributes For continuous attributes, the test condition can be
expressed as a comparison test

• With binary outcomes, or a range query with outcomes of the form


Test condition for continuous attributes
Measures for Selecting the Best Split

• There are many measures that can be used to determine the best way to split the
records.

• These measures are defined in terms of the class distribution of the records before
and after splitting.
• Let p(i|t) denote the fraction of records belonging to class i at a given node t.

• In a two-class problem, the class distribution at any node can be written as (p 0, p1),

where p1 = 1 − p0.
To illustrate, consider the test conditions shown in Figure.

• The class distribution before splitting is (0.5, 0.5) because there are an equal number of
records from each class.

• If we split the data using the Gender attribute, then the class distributions of the child
nodes are (0.6, 0.4) and (0.4, 0.6), respectively.

• Although the classes are no longer evenly distributed, the child nodes still contain
records from both classes.
Multiway versus binary splits
• The measures developed for selecting the best split are often based on the degree of
impurity of the child nodes.

• The smaller the degree of impurity, the more skewed the class distribution.

where c is the number of classes and 0 log2 0 = 0 in entropy calculations


• A decision tree is a flowchart-like tree structure, where each internal node
(nonleaf node) denotes a test on an attribute, each branch represents an
outcome of the test, and each leaf node (or terminal node) holds a class label.
• Attribute selection measures are also known as splitting rules because they determine how
the tuples at a given node are to be split.

• ID3 stands for Iterative Dichotomiser 3 and is named such because the algorithm
iteratively (repeatedly) dichotomizes(divides) features into two or more groups at each
step.
Three popular attribute selection measures
information gain-ID3 Algorithm
gain ratio—C4.5/J48
gini index--CART
gain ratio
Algorithm for Decision Tree Induction

• A skeleton decision tree induction algorithm called TreeGrowth.

• The input to this algorithm consists of the training records E and the attribute set F.

• The algorithm works by recursively selecting the best attribute to split the data and
expanding the leaf nodes of the tree until the stopping criterion is met.
The details of this algorithm are explained below:

• The createNode() function creates a new node. A node in the decision tree has either a
test condition, or a class label.

• The find best split() function determines which attribute should be selected as the test
condition for splitting the training records.
• The Classify() function determines the class label to be assigned to a leaf node. For
each leaf node t, let p(i|t) denote the fraction of training records from class i
associated with the node t.

• The stopping cond() function is used to terminate the tree-growing process by


testing whether all the records have either the same class label or the same
attribute values.
Model Overfitting

• The errors committed by a classification model are generally divided into two
types: training errors and generalization errors.

• Training error, is the number of misclassification errors committed on training


records, whereas generalization error is the expected error of the model on test
records.
• a good classification model must not only fit the training data well, it must also
accurately classify test records.

• In other words, a good model must have low training error as well as low generalization
error.

• This is important because a model that fits the training data too well can have a poorer
generalization error. Such a situation is known as model overfitting.
Overfitting Example in Two-Dimensional Data

• Consider the two-dimensional data set.

• The data set contains data points that belong to two different classes, denoted as
class o and class +, respectively.
Training and test error rates
• Notice that the training and test error rates of the model are large when the size of
the tree is very small. This situation is known as model underfitting.

• However, once the tree becomes too large, its test error rate begins to increase
even though its training error rate continues to decrease. This phenomenon is
known as model overfitting.
Overfitting Due to Presence of Noise

• Consider the training and test sets for the mammal classification problem.

• Two of the ten training records are mislabeled: bats and whales are classified as
non-mammals instead of mammals.
Training set for classifying mammals
test set for classifying mammals
• A decision tree that perfectly fits the training

data, Although the training error for the tree is

zero, its error rate on the test set is 30%.

• Both humans and dolphins were misclassified

as nonmammals.

• Spiny anteaters, on the other hand, represent

an exceptional case
• The decision tree has a lower
test error rate (10%) even
though its training error rate
is some what higher (20%).

• It is evident that the first


decision tree, has overfitted
the training data because
there is a simpler model with
lower error rate on the test
set.
Overfitting Due to Lack of Representative Samples

• Models that make their classification decisions based on a small

number of training records are also susceptible to overfitting.


• Consider the five training records ,all of these training records are
labeled correctly .

• Although its training error is zero, its error rate on the test set is 30%.
An example training set for classifying mammals.
Decision tree
• Humans, elephants, and dolphins are misclassified because the decision tree classifies
all warm-blooded vertebrates that do not hibernate as non-mammals.

• The tree arrives at this classification decision because there is only one training record,
which is an eagle, with such characteristics.
Evaluating the Performance of a Classifier

• The estimated error helps the learning algorithm to do model selection.

• Once the model has been constructed, it can be applied to the test set to predict the
class labels.

• The accuracy or error rate computed from the test set can be used to compare the
relative performance of different classifiers.
Holdout Method

• In the holdout method, the original data with labeled examples is partitioned into the training
and the test sets.

• A classification model is then constructed from the training set and its performance is
evaluated on the test set.

• The proportion of data reserved for training and for testing is (e.g., 50-50 or two thirds for
training and one-third for testing).

• The accuracy of the classifier can be estimated based on the accuracy of the model on the
test set.
• The holdout method has few limitations.

• First, fewer labeled examples are available for training because some of the records
are used for testing.

• As a result, the model may not be as good as when all the labeled examples are used
for training.
• Second, the model may be highly dependent on the composition of the training

and test sets.


Random Subsampling

• The holdout method can be repeated several times to improve the estimation of a
classifier’s performance. This approach is known as random subsampling.

• Let acci be the model accuracy during the ith iteration.

• The overall accuracy is given by


• Random subsampling still have problems because it does not utilize as much data as

possible for training.

• It also has no control over the number of times each record is used for testing and

training.
Cross-Validation

• In this approach,each record is used the same number of times for training and
exactly once for testing.

• To illustrate this method, suppose we partition the data into two equal-sized subsets.

• First, we choose one of the subsets for training and the other for testing.
• We then swap the roles of the subsets so that the previous training set becomes the
test set and vice versa. This approach is called a twofold cross-validation.

• The total error is obtained by summing up the errors for both runs.
• The k-fold cross-validation method generalizes this approach by segmenting the data
into k equal-sized partitions.

• During each run, one of the partitions is chosen for testing, while the rest of them are
used for training.

• This procedure is repeated k times so that each partition is used for testing exactly
once.

• Again, the total error is found by summing up the errors for all k runs.
Bootstrap

• The methods presented so far assume that the training records are sampled without
replacement.

• As a result, there are no duplicate records in the training and test sets.

• In the bootstrap approach, the training records are sampled with replacement; i.e., a
record already chosen for training is put back into the original dataset, so that it is
equally likely to be redrawn.
• If the original data has N records, it can be shown that, on average, a bootstrap
sample of size N contains about 63.2% of the records in the original data.

• Records that are not included in the bootstrap sample become part of the test set.
• One of the more widely used approaches is the .632 bootstrap, which computes
the overall accuracy by combining the accuracies of each bootstrap sample
with the accuracy computed from a training set that contains all the labeled
examples in the original data (accs):
Bayes Theorem

• Consider a football game between two rival teams: Team 0 and Team 1. Suppose
Team 0 wins 65% of the time and Team 1 wins the remaining matches. Among the
games won by Team 0, only 30% of them come from playing on Team 1’s football
field. On the other hand, 75% of the victories for Team 1 are obtained while playing
at home. If Team 1 is to host the next match between the two teams, which team
will most likely emerge as the winner?
• This question can be answered by using the well-known Bayes theorem.

• Let X and Y be a pair of random variables.

• Their joint probability, P(X =x, Y = y), refers to the probability that variable X will
take on the value x and variable Y will take on the value y.

• the conditional probability P(Y = y|X = x) refers to the probability that the variable Y
will take on the value y, given that the variable X is observed to have the value x.
• The joint and conditional probabilities for X and Y are related in the
following way:

• Rearranging the last two expressions ,leads to the formula, known as the Bayes
theorem:
• Our objective is to compute P(Y = 1|X = 1), which is the conditional probability
that Team 1 wins the next match it will be hosting, and compares it against P(Y =
0|X = 1).
• Using the Bayes theorem
• Furthermore, P(Y = 0|X = 1) = 1 − P(Y = 1|X = 1) =0.4262.

• Since P(Y = 1|X = 1) > P(Y = 0|X = 1), Team 1 has a better chance
than Team 0 of winning the next match.
Using the Bayes Theorem for Classification

• Let X denote the attribute set and Y denote the class variable.

• If the class variable has a non-deterministic relationship ,then their relationship


probabilistically using P(Y |X).

• This conditional probability is also known as the posterior probability for Y , as opposed
to its prior probability, P(Y ).
• During the training phase, we need to learn the posterior probabilities P(Y |X) for every
combination of X and Y based on information gathered from the training data.

• By knowing these probabilities, a test record X can be classified by finding the class Y
that maximizes the posterior probability,P(Y1|X1).

• To illustrate this approach, consider the task of predicting whether a loan borrower will
default on their payments.

• Loan borrowers who defaulted on their payments are classified as Yes, while those who
repaid their loans are classified as No.
• Suppose we are given a test record : X =(Home Owner = No, Marital Status =
Married, Annual Income = $120K).

• To classify the record, we need to compute the posterior probabilities P(Yes|X) and
P(No|X) based on information available in the training data.

• If P(Yes|X) > P(No|X), then the record is classified as Yes; otherwise, it is classified as
No.
Na¨ıve Bayes Classifier

• A na¨ıve Bayes classifier estimates the class-conditional probability by assuming

that the attributes are conditionally independent, given the class label y.

The conditional independence assumption can be formally stated as follows:

where each attribute set X = {X1, X2,...,Xd} consists of d attributes.


How a Na¨ıve Bayes Classifier Works

• With the conditional independence assumption, instead of computing the class-


conditional probability for every combination of X, we only have to estimate the
conditional probability of each Xi, given Y .
How a Na¨ıve Bayes Classifier Works

• To classify a test record, the na¨ıve Bayes classifier computes the posterior probability for
each class Y:

• Since P(X) is fixed for every Y , it is sufficient to choose the class that maximizes the
numerator term.
Estimating Conditional Probabilities for Categorical Attributes

• For example, three out of the seven people who repaid their loans also own a home.

• As a result, the conditional probability for P(Home Owner=Yes|No) is equal to 3/7.


Similarly, the conditional probability for defaulted borrowers who are single is given by
P(Marital Status = Single|Yes)=2/3.
Estimating Conditional Probabilities for Continuous Attributes

• A Gaussian distribution is chosen to represent the class-conditional probability for


continuous attributes.
• The distribution is characterized by two parameters, its mean, µ, and variance, σ 2. For
each class yj , the class-conditional probability for attribute Xi is

• The parameter µij can be estimated based on the sample mean of Xi or all training

records that belong to the class yj . Similarly, σ2ij can be estimated from the sample
variance (s2) of such training records.
• For example, consider the annual income attribute(No Class)
• Given a test record with taxable income equal to $120K, we can compute its class-
conditional probability as follows:
Example of the Na¨ıve Bayes Classifier

• We can compute the class conditional probability for each categorical attribute, along with
the sample mean and variance for the continuous attribute.

• To predict the class label of a test record X = (Home Owner=No, Marital Status =
Married, Income = $120K), we need to compute the posterior probabilities P(No|X) and
P(Yes|X).
• Since there are three records that belong to the class Yes and seven records that belong to
the class No, P(Yes)=0.3 and P(No)=0.7. the class-conditional probabilities can be
computed as follows:
• Putting them together, the posterior probability for class No is P(No|X) = α × 7/10 ×
0.0024 = 0.0016α, where α = 1/P(X) is a constant term.

• Using a similar approach, we can show that the posterior probability for class Yes is
zero because its class-conditional probability is zero.

• Since P(No|X) > P(Yes|X), the record is classified as No.

You might also like