Classification and Regression Trees (CART - I) : Dr. A. Ramesh
Classification and Regression Trees (CART - I) : Dr. A. Ramesh
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
2
Introduction
3
Problem Description for Illustration
Han, J., Pei, J. and Kamber, M., 2011. Data mining: concepts and
techniques. Elsevier.
4
Root Node, Internal Node, Child Node
Root Node or
Internal parent node
• A decision tree uses a tree structure to
represent a number of possible decision paths Node
and an outcome for each path
• A decision tree consists of root node, internal
node and leaf node
• The topmost node in a tree is the root node
or parent node
• It represents entire sample population
• Internal node (non-leaf node) denotes a test
on an attribute, each branch represents an
outcome of the test
• Leaf node (or terminal node or child node)
holds a class label Child
• It can not be further split
Node
5
Decision Tree Introduction
7
Decision Tree Algorithm
Input:
• Data partition, D, which is a set of
training tuples and their associated
class labels;
• Attribute list, the set of candidate
attributes;
• Attribute selection method, a
procedure to determine the splitting
criterion that “best” partitions the data
tuples into individual classes. This
criterion consists of a splitting attribute
and, possibly, either a split point or
splitting subset.
Output: A decision tree
8
Decision Tree Algorithm
9
Decision Tree Algorithm
10
Decision Tree Method
N-Node
C- Class
D- tuples in training data set
11
Decision Tree Method step 1 to 6
• The tree starts as a single node, N,
representing the training tuples in D (step
1).
• If the tuples in D are all of the same class,
then node N becomes a leaf and is
labelled with that class (steps 2 and 3)
• Steps 4 and 5 are terminating conditions
• Otherwise, the algorithm calls Attribute
selection method to determine the
splitting criterion
• The splitting criterion (like Gini) tells us
which attribute to test at node N by
determining the “best” way to separate
or partition the tuples in D into individual
classes (step 6)
12
Decision Tree Method - Step 7 - 11
• The splitting criterion indicates the splitting
attribute and may also indicate either a
split-point or a splitting subset
• The splitting criterion is determined so
that, ideally, the resulting partitions at each
branch are as “pure” as possible. A
partition is pure if all of the tuples in it
belong to the same class.
• The node N is labelled with the splitting
criterion, which serves as a test at the node
(step 7).
• A branch is grown from node N for each of
the outcomes of the splitting criterion.
• The tuples in D are partitioned accordingly
(steps 10 to 11)
13
Three possibilities for partitioning tuples based on the
splitting criterion
• There are three possible scenarios, as illustrated in Figure (a), (b) and (c).
• Let A be the splitting attribute. A has ‘v’ distinct values,{a1,a2,...,av}, based
on the training data
• If A is discrete-valued in figure (a), then one branch is grown for each
known value of A.
Figure (a)
14
Three possibilities for partitioning tuples based on the
splitting criterion
• If A is continuous-valued in figure (b), then two branches are grown,
corresponding to A ≤ split point and A > split point.
• Where split point is the split-point returned by Attribute selection method
as part of the splitting criterion.
Figure (b)
15
Three possibilities for partitioning tuples based on the
splitting criterion
• If A is discrete-valued and a binary tree must be produced, then the test is
of the form A ∈ 𝑆𝐴 , where 𝑆𝐴 is the splitting subset for A.
Figure (c)
16
Decision Tree Method – termination condition
• The algorithm uses the same process recursively to form a decision tree
for the tuples at each resulting partition, 𝐷𝑗 , of D (step 14).
17
Decision Tree Method – termination condition
1.
2. There are no remaining attributes on which the tuples may be further
partitioned (step4).
• In this case, majority voting is employed(step 5).
• This involves converting node N into a leaf and labelling it with the most
common class in D.
• Alternatively, the class distribution of the node tuples may be stored.
3. There are no tuples for a given branch, that is, a partition Dj is empty (step
12).
• In this case, a leaf is created with the majority class in D (step 13).
• The resulting decision tree is returned (step 15).
18
Attribute Selection Measures
19
Attribute Selection Measures
• If the splitting attribute is continuous-valued or if we are restricted to binary
trees then, respectively, either a ‘split point’ or a ‘splitting subset’ must also be
determined as part of the splitting criterion
• CART algorithm uses information gain and Gini index measure for attribute
selection
20
Attribute Selection Measures
21
Information Gain
22
Information Gain-Entropy Measure
• The expected information needed to classify a
tuple in D is given by
23
Attribute Selection Measures
• It is quite likely that the partitions will be impure (e.g., where a partition
may contain a collection of tuples from different classes rather than from
a single class).
• How much more information would we still need (after the partitioning) in
order to arrive at an exact classification?
• This amount is measured by
• The term |𝐷𝑗 | / |D| acts as the weight of the 𝑗𝑡ℎ partition. 𝐼𝑛𝑓𝑜𝐴 (D) is
the expected information required to classify a tuple from D based on the
partitioning by A.
24
Information Gain
• The smaller the expected information (still) required, the greater the
purity of the partitions
• Information gain is defined as the difference between the original
information requirement (i.e., based on just the proportion of classes) and
the new requirement (i.e., obtained after partitioning on A). That is,
25
Gini Index
• Gini index is used to measures the
impurity of D, a data partition or set
of training tuples, as
26
Gini Index
27
Gini Index
• For continuous-valued attributes, each possible split-point must be considered
• The strategy is similar where the midpoint between each pair of (sorted)
adjacent values is taken as a possible split-point.
• For a possible split-point of A, 𝐷1 is the set of tuples in D satisfying A ≤ split
point, and 𝐷2 is the set of tuples in D satisfying A > split point.
• The reduction in impurity that would be incurred by a binary split on a
discrete- or continuous-valued attribute A is
• The attribute that maximizes the reduction in impurity (or, equivalently, has
the minimum Gini index) is selected as the splitting attribute
28
Which attribute selection measure is the best?
• When a decision tree is built, many of the branches will reflect anomalies
in the training data due to noise or outliers
• Tree pruning use statistical measures to remove the least reliable
branches
• Pruned trees tend to be smaller and less complex and, thus, easier to
comprehend
• They are usually faster and better at correctly classifying independent test
data than unpruned trees
30
How does Tree Pruning Work?
• There are two common approaches to tree pruning: pre-pruning and post-
pruning.
• In the pre-pruning approach, a tree is “pruned” by halting its construction
early (e.g., by deciding not to further split or partition the subset of
training tuples at a given node).
• When constructing a tree, measures such as statistical significance,
information gain, Gini index can be used to assess the goodness of a split.
31
How does Tree Pruning Work?
32
How does Tree Pruning Work?
33
THANK YOU
34