0% found this document useful (0 votes)
46 views

Unit-3 (MLT)

This document discusses attribute selection measures used in decision tree learning algorithms like ID3. It describes three popular attribute selection measures - information gain, gain ratio, and Gini index. For each measure, it provides the mathematical formula to calculate the measure and explains how it is used to select the best attribute to split the data at each node in decision tree construction. It also includes an example to illustrate how information gain is calculated for different attributes.

Uploaded by

zenithteacho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

Unit-3 (MLT)

This document discusses attribute selection measures used in decision tree learning algorithms like ID3. It describes three popular attribute selection measures - information gain, gain ratio, and Gini index. For each measure, it provides the mathematical formula to calculate the measure and explains how it is used to select the best attribute to split the data at each node in decision tree construction. It also includes an example to illustrate how information gain is calculated for different attributes.

Uploaded by

zenithteacho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Unit-3

Attribute Selection Measures


 An attribute selection measure is a heuristic for
selecting the splitting criterion that “best” separates a
given data partition, D, of class-labeled training tuples
into individual classes.
 Attribute selection measures are also known as
splitting rules because they determine how the
tuples at a given node are to be split.
 The attribute selection measure provides a ranking for
each attribute describing the given training tuples.
The attribute having the best score for the measure is
chosen as the splitting attribute for the given tuples
Attribute Selection Measures
 Three popular attribute selection measures
1) Information gain This attribute minimizes the
information needed to classify the tuples in the resulting
partitions and reflects the least randomness or “impurity”
in these partitions.
Such an approach minimizes the expected number of tests
needed to classify a given tuple and guarantees that a
simple (but not necessarily the simplest) tree is found
Let node N represent or hold the tuples of partition D. The
attribute with highest information gain is chosen as the
splitting attribute for node N.
 The expected information needed to classify a tuple in D is
given by :

where pi is the nonzero probability that an arbitrary tuple in


D belongs to class Ci and is estimated by |Ci,D|/|D|.
 A log function to the base 2 is used, because the
information is encoded in bits.
 Info(D) is just the average amount of information needed
to identify the class label of a tuple in D.
 Info(D) is also known as the entropy of D.
 Suppose we were to partition the tuples in D on some attribute A
having v distinct values, {a1, a2,..., av }, as observed from the training
data
 Attribute A can be used to split D into v partitions or subsets, {D1,
D2,..., Dv },
 These partitions would correspond to the branches grown from node
N.
 The expected information required to classify a tuple from D based
on the partitioning by A is given by :

The term |Dj | /|D| acts as the weight of the jth partition.
 Information gain is defined as the difference between the original
information requirement (i.e., based on just the proportion of classes) and
the new requirement (i.e., obtained after partitioning on A).
 The expected information needed to classify a tuple in D:
 Expected information needed to classify a tuple in D if the
tuples are partitioned according to age is :

 Hence, the gain in information from such a partitioning


would be Gain(age) = Info(D) − Infoage(D) = 0.940 −
0.694 = 0.246 bits.
 Gain(income) = 0.029 bits, Gain(student) = 0.151 bits,
and Gain(credit rating) = 0.048 bits. Because age has
the highest information gain among the attributes, it
is selected as the splitting attribute.
Attribute Selection Measures
2) Gain ratio The information gain measure is biased toward
tests with many outcomes. That is, it prefers to select
attributes having a large number of values
 C4.5, a successor of ID3, uses an extension to information
gain known as gain ratio, which attempts to overcome this
bias.
 It applies a kind of normalization to information gain
using a “split information” value defined analogously with
Info(D) as :
 This value represents the potential information generated
by splitting the training data set, D, into v partitions,
corresponding to the v outcomes of a test on attribute A.
 The gain ratio is defined as

 The attribute with the maximum gain ratio is selected as the


splitting attribute
 To compute the gain ratio of income,

 Gain(income) = 0.029.
 Therefore, GainRatio(income) = 0.029/1.557 = 0.019.
Attribute Selection Measures
3) Gini index The Gini index is used in CART.
 Gini index measures the impurity of D, a data partition or
set of training tuples, as

where pi is the probability that a tuple in D belongs to class


Ci and is estimated by |Ci,D|/|D|. The sum is computed
over m classes.
 Gini index considers a binary split for each attribute.
 When considering a binary split, we compute a weighted
sum of the impurity of each resulting partition.
 E.g,if a binary split on A partitions D into D1 and D2, the
Gini index of D given that partitioning is :

 The reduction in impurity that would be incurred by a


binary split on a discrete- or continuous-valued attribute A
is :

 The attribute that has the minimum Gini index is selected


as the splitting attribute
 Gini index to compute the impurity of D:

 Consider the subset “income ∈ {low, medium},


 Gini index value computed based on this partitioning is :

 Similarly, the Gini index values for splits on the remaining


subsets are 0.458 (for the subsets {low, high} and {medium})
and 0.450 (for the subsets {medium, high} and {low}).
Therefore, the best binary split for attribute income is on {low,
medium} (or {high}) because it minimizes the Gini index
ID3 Algorithm
 ID3 stands for Iterative Dichotomiser 3
 J. Ross Quinlan, a researcher in machine learning,
developed a decision tree algorithm known as ID3
(Iterative Dichotomiser)
 It uses a top-down greedy approach to build a
decision tree
 This algorithm uses Information Gain to decide which
attribute is to be used classify the current subset of the
data. For each level of the tree, information gain is
calculated for the remaining data recursively.
ID3 Algorithm
ID3 Algorithm Example
Q : Create a Decision tree for the following
training data set using ID3 Algorithm.
Pruning in Decision Tree
 Pruning is a data compression technique
in ML and search algorithms that reduces the size
of decision trees by removing sections of the tree that
are non-critical and redundant to classify instances.
 There are two common approaches to tree pruning:
prepruning and postpruning.
 In the prepruning approach, a tree is “pruned” by
halting its construction early (e.g., by deciding not to
further split or partition the subset of training tuples at
a given node). Upon halting, the node becomes a leaf.
The leaf may hold the most frequent class among the
subset tuples
Pruning in Decision Tree
Pruning in Decision Tree
 The second and more common approach is postpruning,
which removes subtrees from a “fully grown” tree. A
subtree at a given node is pruned by removing its
branches and replacing it with a leaf. The leaf is labeled
with the most frequent class among the subtree being
replaced.
 E.g : Notice the subtree at node “A3?” in the unpruned
tree of previous Fig. Suppose that the most common class
within this subtree is “class B.” In the pruned version of
the tree, the subtree in question is pruned by replacing it
with the leaf “class B.”
Inductive Inference with Decision
Trees
 Describing the inductive bias of ID3 consists of describing
the basis by which it chooses one of the consistent
hypotheses over the others.
 Which of these decision trees does ID3 choose?
 ID3 search strategy
(a) selects in favor of shorter trees over longer ones, and
(b) selects trees that place the attributes with highest
information gain closest to the root.
It is difficult to characterize precisely the inductive bias
exhibited by ID3. However, we can approximately
characterize its bias as a preference for short decision trees
over complex trees
Inductive Inference with Decision
Trees
Issues in Decision Tree
1) Avoiding Overfitting the Data
2) Incorporating Continuous-Valued Attributes
3) Alternative Measures for Selecting Attributes
4) Handling Training Examples with Missing Attribute
Values
5) Handling Attributes with Differing Costs
Over-fitting & Under-fitting in
decision trees
 When a model performs very well for training data but has
poor performance with test data (new data), it is known as
over-fitting. In this case, the machine learning model
learns the details and noise in the training data such that it
negatively affects the performance of the model on test
data. Over-fitting can happen due to low bias and high
variance.
 When a model has not learned the patterns in the training
data well and is unable to generalize well on the new data,
it is known as under-fitting. An underf-it model has poor
performance on the training data and will result in
unreliable predictions. Under-fitting occurs due to high
bias and low variance.
Instance Based Learning
 Classification methods discussed so far—decision tree
induction, Bayesian classification, support vector machines are
all examples of eager learners.
 Eager learners, when given a set of training tuples, will
construct a generalization (i.e., classification) model before
receiving new (e.g., test) tuples to classify. We can think of the
learned model as being ready and eager to classify previously
unseen tuples.
 Imagine a contrasting lazy approach, in which the learner
instead waits until the last minute before doing any model
construction to classify a given test tuple i.e. when given a
training tuple, a lazy learner simply stores it (or does only a
little minor processing) and waits until it is given a test tuple.
Instance Based Learning
 Only when it sees the test tuple does it perform
generalization to classify the tuple based on its
similarity to the stored training tuples.
 Unlike eager learning methods, lazy learners do
less work when a training tuple is presented and
more work when making a classification or
numeric prediction. Because lazy learners store
the training tuples or “instances,” they are also
referred to as instance-based learners
K-Nearest Neighbour Learning
 k-nearest-neighbor method was first described in the
early 1950s
 The method is labor intensive when given large
training sets, and did not gain popularity until the
1960s when increased computing power became
available.
 It has since been widely used in the area of pattern
recognition, data mining,etc
 K-Nearest Neighbour is one of the simplest Machine
Learning algorithms based on Supervised Learning
technique.
K-Nearest Neighbour Learning
 It is also called a lazy learner algorithm
Example
Example
Example
Example
Locally Weighted Regression
 Linear Regression cannot be used for making predictions when
there exists a non-linear relationship between X and Y. In such
cases, locally weighted linear regression is used.
 Locally weighted linear regression is a supervised learning
algorithm.
 It a non-parametric algorithm.
 Model-based methods, such as neural networks use the data to
build a parameterized model.
 After training, the model is used for predictions and the data is
generally discarded.
 In contrast, ``memory-based'' methods are non-parametric
approaches that explicitly retain the training data, and use it each
time a prediction needs to be made.
 LWR is a memory-based method that performs a regression
around a point of interest using only training data that are ``local''
to that point
Locally Weighted Regression
 The model does not learn a fixed set of parameters as is done in
ordinary linear regression.
 Rather parameters θ are computed individually for each query
point x.
 While computing θ , a higher “preference” is given to the points
in the training set lying in the vicinity of x than the points
lying far away from x .
 The cost function is:

where, w(i) is a non-negative “weight” associated with training


point x(i) .
For x(i) lying closer to the query point x , the value of w(i) is
large, while for x(i) lying far away from x, the value of w(i) is
small.
Radial Basis Function Networks
 Radial basis function (RBF) networks are a commonly used
type of artificial neural network for function approximation
problems.
 Radial basis function networks are distinguished from
other neural networks due to their universal approximation
and faster learning speed.
 An RBF network is a type of feed forward neural network
composed of three layers, namely the input layer, the
hidden layer and the output layer.
 The computation that is performed inside the hidden layer
is very different from most neural networks, and this is
where the power of the RBF network comes from.
 RBF Neural networks are conceptually similar to K-Nearest
Neighbor models, though the implementation of both
models is starkly different.
Input Vector
 The input vector is the n-dimensional vector that you are trying
to classify. The entire input vector is shown to each of the RBF
neurons.
RBF Neurons
 Each RBF neuron stores a “prototype” vector which is just one of
the vectors from the training set.
 Each RBF neuron compares the input vector to its prototype,
and outputs a value between 0 and 1 which is a measure of
similarity. I
 f the input is equal to the prototype, then the output of that RBF
neuron will be 1.
 As the distance between the input and prototype grows, the
response falls off exponentially towards 0.
 The shape of the RBF neuron’s response is a bell curve, as
illustrated in the network architecture diagram.
 The neuron’s response value is also called its “activation” value.
Output Nodes
 The output of the network consists of a set of nodes,
one per category that we are trying to classify.
 Each output node computes a sort of score for the
associated category.
 Typically, a classification decision is made by assigning
the input to the category with the highest score.
 The score is computed by taking a weighted sum of the
activation values from every RBF neuron.
 The output node will typically give a positive weight to
the RBF neurons that belong to its category, and a
negative weight to the others.
RBF Neuron Activation Function
 Each RBF neuron computes a measure of the similarity
between the input and its prototype vector (taken
from the training set).
 Input vectors which are more similar to the prototype
return a result closer to 1.
 There are different possible choices of similarity
functions, but the most popular is based on the
Gaussian.
Case Based Learning
 In case-based reasoning, the training examples,
the cases, are stored and accessed to solve a new problem.
 To get a prediction for a new example, those cases that are
similar, or close to, the new example are used to predict the
value of the target features of the new example.
 This is at one extreme of the learning problem where,
unlike decision trees and neural networks, relatively little
work must be done offline, and virtually all of the work is
performed at query time.
Case Based Learning
 Case-based reasoning is used for classification and for
regression
 If the cases are simple, one algorithm that works well is to
use the k-nearest neighbors for some given number k.
 Given a new e.g, the k-training examples that have the
input features closest to that example are used to predict
the target value for the new example.
 The prediction could be the mode, average, or some
interpolation between the prediction of these k training
examples, weighting closer examples more than distant
examples.
How CBR works?
 When a new case arises to classify, a Case-based Reasoner(CBR) will
first check if an identical training case exists.
 If one is found, then the accompanying solution to that case is
returned.
 If no identical case is found, then the CBR will search for training
cases having components that are similar to those of the new case.
 Conceptually, these training cases may be considered as neighbours of
the new case.
 If cases are represented as graphs, this involves searching for
subgraphs that are similar to subgraphs within the new case.
 The CBR tries to combine the solutions of the neighbouring training
cases to propose a solution for the new case.
 If compatibilities arise with the individual solutions, then
backtracking to search for other solutions may be necessary.
 The CBR may employ background knowledge and problem-solving
strategies to propose a feasible solution.

You might also like