0% found this document useful (0 votes)

46 views

Unit-3 (MLT)

This document discusses attribute selection measures used in decision tree learning algorithms like ID3. It describes three popular attribute selection measures - information gain, gain ratio, and Gini index. For each measure, it provides the mathematical formula to calculate the measure and explains how it is used to select the best attribute to split the data at each node in decision tree construction. It also includes an example to illustrate how information gain is calculated for different attributes.

Uploaded by

zenithteacho

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views

Unit-3 (MLT)

Uploaded by

zenithteacho

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Unit-3

Attribute Selection Measures

 An attribute selection measure is a heuristic for
selecting the splitting criterion that “best” separates a
given data partition, D, of class-labeled training tuples
into individual classes.
 Attribute selection measures are also known as
splitting rules because they determine how the
tuples at a given node are to be split.
 The attribute selection measure provides a ranking for
each attribute describing the given training tuples.
The attribute having the best score for the measure is
chosen as the splitting attribute for the given tuples
Attribute Selection Measures
 Three popular attribute selection measures
1) Information gain This attribute minimizes the
information needed to classify the tuples in the resulting
partitions and reflects the least randomness or “impurity”
in these partitions.
Such an approach minimizes the expected number of tests
needed to classify a given tuple and guarantees that a
simple (but not necessarily the simplest) tree is found
Let node N represent or hold the tuples of partition D. The
attribute with highest information gain is chosen as the
splitting attribute for node N.
 The expected information needed to classify a tuple in D is
given by :

where pi is the nonzero probability that an arbitrary tuple in

D belongs to class Ci and is estimated by |Ci,D|/|D|.
 A log function to the base 2 is used, because the
information is encoded in bits.
 Info(D) is just the average amount of information needed
to identify the class label of a tuple in D.
 Info(D) is also known as the entropy of D.
 Suppose we were to partition the tuples in D on some attribute A
having v distinct values, {a1, a2,..., av }, as observed from the training
data
 Attribute A can be used to split D into v partitions or subsets, {D1,
D2,..., Dv },
 These partitions would correspond to the branches grown from node
N.
 The expected information required to classify a tuple from D based
on the partitioning by A is given by :

The term |Dj | /|D| acts as the weight of the jth partition.
 Information gain is defined as the difference between the original
information requirement (i.e., based on just the proportion of classes) and
the new requirement (i.e., obtained after partitioning on A).
 The expected information needed to classify a tuple in D:
 Expected information needed to classify a tuple in D if the
tuples are partitioned according to age is :

 Hence, the gain in information from such a partitioning

would be Gain(age) = Info(D) − Infoage(D) = 0.940 −
0.694 = 0.246 bits.
 Gain(income) = 0.029 bits, Gain(student) = 0.151 bits,
and Gain(credit rating) = 0.048 bits. Because age has
the highest information gain among the attributes, it
is selected as the splitting attribute.
Attribute Selection Measures
2) Gain ratio The information gain measure is biased toward
tests with many outcomes. That is, it prefers to select
attributes having a large number of values
 C4.5, a successor of ID3, uses an extension to information
gain known as gain ratio, which attempts to overcome this
bias.
 It applies a kind of normalization to information gain
using a “split information” value defined analogously with
Info(D) as :
 This value represents the potential information generated
by splitting the training data set, D, into v partitions,
corresponding to the v outcomes of a test on attribute A.
 The gain ratio is defined as

 The attribute with the maximum gain ratio is selected as the

splitting attribute
 To compute the gain ratio of income,

 Gain(income) = 0.029.
 Therefore, GainRatio(income) = 0.029/1.557 = 0.019.
Attribute Selection Measures
3) Gini index The Gini index is used in CART.
 Gini index measures the impurity of D, a data partition or
set of training tuples, as

where pi is the probability that a tuple in D belongs to class

Ci and is estimated by |Ci,D|/|D|. The sum is computed
over m classes.
 Gini index considers a binary split for each attribute.
 When considering a binary split, we compute a weighted
sum of the impurity of each resulting partition.
 E.g,if a binary split on A partitions D into D1 and D2, the
Gini index of D given that partitioning is :

 The reduction in impurity that would be incurred by a

binary split on a discrete- or continuous-valued attribute A
is :

 The attribute that has the minimum Gini index is selected

as the splitting attribute
 Gini index to compute the impurity of D:

 Consider the subset “income ∈ {low, medium},

 Gini index value computed based on this partitioning is :

 Similarly, the Gini index values for splits on the remaining

subsets are 0.458 (for the subsets {low, high} and {medium})
and 0.450 (for the subsets {medium, high} and {low}).
Therefore, the best binary split for attribute income is on {low,
medium} (or {high}) because it minimizes the Gini index
ID3 Algorithm
 ID3 stands for Iterative Dichotomiser 3
 J. Ross Quinlan, a researcher in machine learning,
developed a decision tree algorithm known as ID3
(Iterative Dichotomiser)
 It uses a top-down greedy approach to build a
decision tree
 This algorithm uses Information Gain to decide which
attribute is to be used classify the current subset of the
data. For each level of the tree, information gain is
calculated for the remaining data recursively.
ID3 Algorithm
ID3 Algorithm Example
Q : Create a Decision tree for the following
training data set using ID3 Algorithm.
Pruning in Decision Tree
 Pruning is a data compression technique
in ML and search algorithms that reduces the size
of decision trees by removing sections of the tree that
are non-critical and redundant to classify instances.
 There are two common approaches to tree pruning:
prepruning and postpruning.
 In the prepruning approach, a tree is “pruned” by
halting its construction early (e.g., by deciding not to
further split or partition the subset of training tuples at
a given node). Upon halting, the node becomes a leaf.
The leaf may hold the most frequent class among the
subset tuples
Pruning in Decision Tree
Pruning in Decision Tree
 The second and more common approach is postpruning,
which removes subtrees from a “fully grown” tree. A
subtree at a given node is pruned by removing its
branches and replacing it with a leaf. The leaf is labeled
with the most frequent class among the subtree being
replaced.
 E.g : Notice the subtree at node “A3?” in the unpruned
tree of previous Fig. Suppose that the most common class
within this subtree is “class B.” In the pruned version of
the tree, the subtree in question is pruned by replacing it
with the leaf “class B.”
Inductive Inference with Decision
Trees
 Describing the inductive bias of ID3 consists of describing
the basis by which it chooses one of the consistent
hypotheses over the others.
 Which of these decision trees does ID3 choose?
 ID3 search strategy
(a) selects in favor of shorter trees over longer ones, and
(b) selects trees that place the attributes with highest
information gain closest to the root.
It is difficult to characterize precisely the inductive bias
exhibited by ID3. However, we can approximately
characterize its bias as a preference for short decision trees
over complex trees
Inductive Inference with Decision
Trees
Issues in Decision Tree
1) Avoiding Overfitting the Data
2) Incorporating Continuous-Valued Attributes
3) Alternative Measures for Selecting Attributes
4) Handling Training Examples with Missing Attribute
Values
5) Handling Attributes with Differing Costs
Over-fitting & Under-fitting in
decision trees
 When a model performs very well for training data but has
poor performance with test data (new data), it is known as
over-fitting. In this case, the machine learning model
learns the details and noise in the training data such that it
negatively affects the performance of the model on test
data. Over-fitting can happen due to low bias and high
variance.
 When a model has not learned the patterns in the training
data well and is unable to generalize well on the new data,
it is known as under-fitting. An underf-it model has poor
performance on the training data and will result in
unreliable predictions. Under-fitting occurs due to high
bias and low variance.
Instance Based Learning
 Classification methods discussed so far—decision tree
induction, Bayesian classification, support vector machines are
all examples of eager learners.
 Eager learners, when given a set of training tuples, will
construct a generalization (i.e., classification) model before
receiving new (e.g., test) tuples to classify. We can think of the
learned model as being ready and eager to classify previously
unseen tuples.
 Imagine a contrasting lazy approach, in which the learner
instead waits until the last minute before doing any model
construction to classify a given test tuple i.e. when given a
training tuple, a lazy learner simply stores it (or does only a
little minor processing) and waits until it is given a test tuple.
Instance Based Learning
 Only when it sees the test tuple does it perform
generalization to classify the tuple based on its
similarity to the stored training tuples.
 Unlike eager learning methods, lazy learners do
less work when a training tuple is presented and
more work when making a classification or
numeric prediction. Because lazy learners store
the training tuples or “instances,” they are also
referred to as instance-based learners
K-Nearest Neighbour Learning
 k-nearest-neighbor method was first described in the
early 1950s
 The method is labor intensive when given large
training sets, and did not gain popularity until the
1960s when increased computing power became
available.
 It has since been widely used in the area of pattern
recognition, data mining,etc
 K-Nearest Neighbour is one of the simplest Machine
Learning algorithms based on Supervised Learning
technique.
K-Nearest Neighbour Learning
 It is also called a lazy learner algorithm
Example
Example
Example
Example
Locally Weighted Regression
 Linear Regression cannot be used for making predictions when
there exists a non-linear relationship between X and Y. In such
cases, locally weighted linear regression is used.
 Locally weighted linear regression is a supervised learning
algorithm.
 It a non-parametric algorithm.
 Model-based methods, such as neural networks use the data to
build a parameterized model.
 After training, the model is used for predictions and the data is
generally discarded.
 In contrast, ``memory-based'' methods are non-parametric
approaches that explicitly retain the training data, and use it each
time a prediction needs to be made.
 LWR is a memory-based method that performs a regression
around a point of interest using only training data that are ``local''
to that point
Locally Weighted Regression
 The model does not learn a fixed set of parameters as is done in
ordinary linear regression.
 Rather parameters θ are computed individually for each query
point x.
 While computing θ , a higher “preference” is given to the points
in the training set lying in the vicinity of x than the points
lying far away from x .
 The cost function is:

where, w(i) is a non-negative “weight” associated with training

point x(i) .
For x(i) lying closer to the query point x , the value of w(i) is
large, while for x(i) lying far away from x, the value of w(i) is
small.
Radial Basis Function Networks
 Radial basis function (RBF) networks are a commonly used
type of artificial neural network for function approximation
problems.
 Radial basis function networks are distinguished from
other neural networks due to their universal approximation
and faster learning speed.
 An RBF network is a type of feed forward neural network
composed of three layers, namely the input layer, the
hidden layer and the output layer.
 The computation that is performed inside the hidden layer
is very different from most neural networks, and this is
where the power of the RBF network comes from.
 RBF Neural networks are conceptually similar to K-Nearest
Neighbor models, though the implementation of both
models is starkly different.
Input Vector
 The input vector is the n-dimensional vector that you are trying
to classify. The entire input vector is shown to each of the RBF
neurons.
RBF Neurons
 Each RBF neuron stores a “prototype” vector which is just one of
the vectors from the training set.
 Each RBF neuron compares the input vector to its prototype,
and outputs a value between 0 and 1 which is a measure of
similarity. I
 f the input is equal to the prototype, then the output of that RBF
neuron will be 1.
 As the distance between the input and prototype grows, the
response falls off exponentially towards 0.
 The shape of the RBF neuron’s response is a bell curve, as
illustrated in the network architecture diagram.
 The neuron’s response value is also called its “activation” value.
Output Nodes
 The output of the network consists of a set of nodes,
one per category that we are trying to classify.
 Each output node computes a sort of score for the
associated category.
 Typically, a classification decision is made by assigning
the input to the category with the highest score.
 The score is computed by taking a weighted sum of the
activation values from every RBF neuron.
 The output node will typically give a positive weight to
the RBF neurons that belong to its category, and a
negative weight to the others.
RBF Neuron Activation Function
 Each RBF neuron computes a measure of the similarity
between the input and its prototype vector (taken
from the training set).
 Input vectors which are more similar to the prototype
return a result closer to 1.
 There are different possible choices of similarity
functions, but the most popular is based on the
Gaussian.
Case Based Learning
 In case-based reasoning, the training examples,
the cases, are stored and accessed to solve a new problem.
 To get a prediction for a new example, those cases that are
similar, or close to, the new example are used to predict the
value of the target features of the new example.
 This is at one extreme of the learning problem where,
unlike decision trees and neural networks, relatively little
work must be done offline, and virtually all of the work is
performed at query time.
Case Based Learning
 Case-based reasoning is used for classification and for
regression
 If the cases are simple, one algorithm that works well is to
use the k-nearest neighbors for some given number k.
 Given a new e.g, the k-training examples that have the
input features closest to that example are used to predict
the target value for the new example.
 The prediction could be the mode, average, or some
interpolation between the prediction of these k training
examples, weighting closer examples more than distant
examples.
How CBR works?
 When a new case arises to classify, a Case-based Reasoner(CBR) will
first check if an identical training case exists.
 If one is found, then the accompanying solution to that case is
returned.
 If no identical case is found, then the CBR will search for training
cases having components that are similar to those of the new case.
 Conceptually, these training cases may be considered as neighbours of
the new case.
 If cases are represented as graphs, this involves searching for
subgraphs that are similar to subgraphs within the new case.
 The CBR tries to combine the solutions of the neighbouring training
cases to propose a solution for the new case.
 If compatibilities arise with the individual solutions, then
backtracking to search for other solutions may be necessary.
 The CBR may employ background knowledge and problem-solving
strategies to propose a feasible solution.

Professional Google Workspace Administrator-Update
50% (2)
Professional Google Workspace Administrator-Update
54 pages
Data Mining & Knowledge Discovery
No ratings yet
Data Mining & Knowledge Discovery
34 pages
dm4
No ratings yet
dm4
68 pages
Classification and Prediction
No ratings yet
Classification and Prediction
143 pages
08 Class Basic
No ratings yet
08 Class Basic
86 pages
08ClassBasic-L
No ratings yet
08ClassBasic-L
78 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
81 pages
Data Mining Book
No ratings yet
Data Mining Book
84 pages
Classification DecisionTreesNaiveBayeskNN
No ratings yet
Classification DecisionTreesNaiveBayeskNN
75 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
87 pages
Concepts and Techniques: Data Mining
100% (1)
Concepts and Techniques: Data Mining
81 pages
08 Class Basic
No ratings yet
08 Class Basic
81 pages
20210913115613D3708 - Session 05-08 Decision Tree Classification
No ratings yet
20210913115613D3708 - Session 05-08 Decision Tree Classification
37 pages
Classification: Basic Concepts
No ratings yet
Classification: Basic Concepts
73 pages
_08ClassBasic_v1
No ratings yet
_08ClassBasic_v1
46 pages
Mod 3 part1_merged
No ratings yet
Mod 3 part1_merged
101 pages
Unit 4 DM
No ratings yet
Unit 4 DM
88 pages
CH 5
No ratings yet
CH 5
81 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
42 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
81 pages
Unit - Iii
No ratings yet
Unit - Iii
52 pages
UNIT 2 Class Basic
No ratings yet
UNIT 2 Class Basic
69 pages
Decision Tree
No ratings yet
Decision Tree
33 pages
Topic01 Classification Basics Jiawei Han Extra
No ratings yet
Topic01 Classification Basics Jiawei Han Extra
198 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
50 pages
Decision Tree
No ratings yet
Decision Tree
30 pages
Class Basic
No ratings yet
Class Basic
75 pages
AI Chapter 3 Part 2
No ratings yet
AI Chapter 3 Part 2
51 pages
Session 5b Classification by Decision Tree Induction (1)
No ratings yet
Session 5b Classification by Decision Tree Induction (1)
42 pages
Unit 3 (A) NGP
No ratings yet
Unit 3 (A) NGP
78 pages
COMP 6930 Topic01 Classification Basics
No ratings yet
COMP 6930 Topic01 Classification Basics
190 pages
Unit-4 DM
No ratings yet
Unit-4 DM
15 pages
Classification and Prediction
No ratings yet
Classification and Prediction
40 pages
Supervised Learning Algorithm
No ratings yet
Supervised Learning Algorithm
59 pages
Module - 4.1-DM-1
No ratings yet
Module - 4.1-DM-1
63 pages
Decision Tree Induction
No ratings yet
Decision Tree Induction
80 pages
dm 3
No ratings yet
dm 3
37 pages
Data Mining: Concepts and Techniques: - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 7
61 pages
7 - Classification
No ratings yet
7 - Classification
71 pages
Unit-3
No ratings yet
Unit-3
98 pages
05 Classification
No ratings yet
05 Classification
79 pages
DWDM UNIT-IV Classification and Prediction
100% (1)
DWDM UNIT-IV Classification and Prediction
70 pages
04 Classification
No ratings yet
04 Classification
72 pages
Decision Tree Induction
No ratings yet
Decision Tree Induction
23 pages
Data Mining: Classification
No ratings yet
Data Mining: Classification
70 pages
Machine Learning Unit-3.2
No ratings yet
Machine Learning Unit-3.2
61 pages
ML Unit 3
No ratings yet
ML Unit 3
14 pages
VII - CS8031 - DMDW - Module 6 - Classification - VBP
No ratings yet
VII - CS8031 - DMDW - Module 6 - Classification - VBP
99 pages
Classification
No ratings yet
Classification
45 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
88 pages
06-Classification_Part1
No ratings yet
06-Classification_Part1
44 pages
Unit-3 Classification
No ratings yet
Unit-3 Classification
28 pages
DM Unit-3
No ratings yet
DM Unit-3
46 pages
Classification and Prediction
100% (1)
Classification and Prediction
31 pages
Chapter 3 Decision Trees
No ratings yet
Chapter 3 Decision Trees
61 pages
Asset v1 MKAU+SEng9032+DEV 01+Type@Asset+Block@ML Chapterthree
No ratings yet
Asset v1 MKAU+SEng9032+DEV 01+Type@Asset+Block@ML Chapterthree
129 pages
2 - Decision Tree
No ratings yet
2 - Decision Tree
23 pages
Unit 3 - Classification
No ratings yet
Unit 3 - Classification
28 pages
15 1 Random Forest and Decision Tree
No ratings yet
15 1 Random Forest and Decision Tree
66 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
RNN LSTM
No ratings yet
RNN LSTM
72 pages
@redcloudlink [3.9K] #574
No ratings yet
@redcloudlink [3.9K] #574
67 pages
Fire Alarm System - Notifier PDF
No ratings yet
Fire Alarm System - Notifier PDF
19 pages
ESAB PLC Integration
No ratings yet
ESAB PLC Integration
52 pages
From Recognition To Cognition: Visual Commonsense Reasoning
No ratings yet
From Recognition To Cognition: Visual Commonsense Reasoning
29 pages
Types of Computers
No ratings yet
Types of Computers
77 pages
TSXP571634M: Product Data Sheet
No ratings yet
TSXP571634M: Product Data Sheet
3 pages
Release of QRadar 7.5.0 Update Package 5 SFS (7.5.0-QRADAR-QRSIEM-20230301133107)
No ratings yet
Release of QRadar 7.5.0 Update Package 5 SFS (7.5.0-QRADAR-QRSIEM-20230301133107)
10 pages
Static Decoupling: Coupling Between States Is A Situation When The Time Variation DX
No ratings yet
Static Decoupling: Coupling Between States Is A Situation When The Time Variation DX
3 pages
Cheats and Guide
No ratings yet
Cheats and Guide
6 pages
En GVP 9.0.x Gvprpgax Cbrowser
No ratings yet
En GVP 9.0.x Gvprpgax Cbrowser
3 pages
Infoblox Poster Ipv6 Best Practices
No ratings yet
Infoblox Poster Ipv6 Best Practices
1 page
NP-Hard and NP-Complete
No ratings yet
NP-Hard and NP-Complete
13 pages
Acefone - JOB DESCRIPTION - CUSTOMER SUPPORT EXECUTIVE
No ratings yet
Acefone - JOB DESCRIPTION - CUSTOMER SUPPORT EXECUTIVE
3 pages
Civil Case Cross Border Lintas Batas Negara
No ratings yet
Civil Case Cross Border Lintas Batas Negara
26 pages
Revision Work - (1) - Computer - Class-Viii
No ratings yet
Revision Work - (1) - Computer - Class-Viii
1 page
GPS Pitch Deck
No ratings yet
GPS Pitch Deck
9 pages
Hortatory Exposition Text
100% (1)
Hortatory Exposition Text
5 pages
B.SC H Computer Sci kupvWYf
No ratings yet
B.SC H Computer Sci kupvWYf
6 pages
Jecholiah Narteh - CV
No ratings yet
Jecholiah Narteh - CV
1 page
Mybook Live Feature Pack
No ratings yet
Mybook Live Feature Pack
4 pages
Dharmendra Singh STPR
No ratings yet
Dharmendra Singh STPR
101 pages
Static Routing Configuration For CCNA Students by Eng. Abeer Hosni
No ratings yet
Static Routing Configuration For CCNA Students by Eng. Abeer Hosni
12 pages
CEF207 Tutorial1
No ratings yet
CEF207 Tutorial1
4 pages
Epic AZ Detector Calibration Procedure 9202-0183 Rev
No ratings yet
Epic AZ Detector Calibration Procedure 9202-0183 Rev
56 pages
Dynamic Energy Optimization With Revit® and Insight 360: Daniel Stine, LHB
No ratings yet
Dynamic Energy Optimization With Revit® and Insight 360: Daniel Stine, LHB
52 pages
OWASP MASVS Spain Nov 17
No ratings yet
OWASP MASVS Spain Nov 17
47 pages
Read Me Java - Olx Clone
No ratings yet
Read Me Java - Olx Clone
2 pages
MySQL Cheatsheet CodeWithHarry
No ratings yet
MySQL Cheatsheet CodeWithHarry
15 pages

Unit-3 (MLT)

Uploaded by

Unit-3 (MLT)

Uploaded by

Unit-3

Attribute Selection Measures

where pi is the nonzero probability that an arbitrary tuple in

 Hence, the gain in information from such a partitioning

 The attribute with the maximum gain ratio is selected as the

where pi is the probability that a tuple in D belongs to class

 The reduction in impurity that would be incurred by a

 The attribute that has the minimum Gini index is selected

 Consider the subset “income ∈ {low, medium},

 Similarly, the Gini index values for splits on the remaining

where, w(i) is a non-negative “weight” associated with training

You might also like