0% found this document useful (0 votes)

5 views

Lecture 5 - Decision Tree

Uploaded by

baygiolamaygio04

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Lecture 5 - Decision Tree

Uploaded by

baygiolamaygio04

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

UET

Since 2004

ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN

VNU-University of Engineering and Technology

INT3405 - Machine Learning

Lecture 5: Classification (P2)
Decision Tree

Hanoi, 09/2023
Outline
● Decision Tree Examples
● Decision Tree Algorithms
● Methods for Expressing Test Conditions
● Measures of Node Impurity
○ Gini index
○ Entropy
○ Gain Ratio
○ Classification Error

FIT-CS INT3405 - Machine Learning 2

Example of a Decision Tree

Source: https://regenerativetoday.com/simple-explanation-on-how-decision-tree-algorithm-makes-decisions/

FIT-CS INT3405 - Machine Learning 3

Example of a Decision Tree

FIT-CS INT3405 - Machine Learning 4

Example of Model Prediction (1)

FIT-CS INT3405 - Machine Learning 5

Example of Model Prediction (2)

FIT-CS INT3405 - Machine Learning 6

Example of Model Prediction (3)

FIT-CS INT3405 - Machine Learning 7

Example of Model Prediction (4)

FIT-CS INT3405 - Machine Learning 8

Example of Model Prediction (5)

FIT-CS INT3405 - Machine Learning 9

Example of Model Prediction (6)

FIT-CS INT3405 - Machine Learning 10

Decision Tree - Another Solution

There could be more than one tree that

fits the same data!

FIT-CS INT3405 - Machine Learning 11

Decision Tree - Classification Task

FIT-CS INT3405 - Machine Learning 12

Decision Tree Induction

● Various Algorithms
○ Hunt’s Algorithm (one of the earliest)
○ CART
○ ID3, C4.5
○ SLIQ, SPRINT

FIT-CS INT3405 - Machine Learning 13

General Structure of Hunt’s Algorithm
● Let Dt be the set of training records that reach a
node t
● General Procedure:
○ If Dt contains records that belong the same
class yt, then t is a leaf node labeled as yt
○ If Dt contains records that belong to more than
one class, use an attribute test to split the data
into smaller subsets.
○ Recursively apply the procedure to each subset.
FIT-CS INT3405 - Machine Learning 14
Hunt’s Algorithm

FIT-CS INT3405 - Machine Learning 15

Hunt’s Algorithm

FIT-CS INT3405 - Machine Learning 16

Hunt’s Algorithm

FIT-CS INT3405 - Machine Learning 17

Hunt’s Algorithm

FIT-CS INT3405 - Machine Learning 18

Design Issues of Decision Tree Induction
● How should training records be split?
○ Method for expressing test condition
■ depending on attribute types
○ Measure for evaluating the goodness of a test condition
● How should the splitting procedure stop?
○ Stop splitting if all the records belong to the same class or
have identical attribute values
○ Early termination

FIT-CS INT3405 - Machine Learning 19

Methods for Expressing Test Conditions
● Depends on attribute types
○ Binary
○ Nominal
○ Ordinal
○ Continuous

FIT-CS INT3405 - Machine Learning 20

Test Conditions for Nominal Attributes
● Multi-way split:
○ Use as many partitions as distinct values.

● Binary split:
○ Divides values into two subsets

FIT-CS INT3405 - Machine Learning 21

Test Conditions for Nominal Attributes
● Multi-way split:
○ Use as many partitions as distinct values.

● Binary split:
○ Divides values into two subsets

FIT-CS INT3405 - Machine Learning 22

Test Conditions for Ordinal Attributes
● Multi-way split:
○ Use as many partitions as distinct values.
● Binary split:
○ Divides values into two subsets
○ Preserve order property among attribute values

This grouping
violates order
property

FIT-CS INT3405 - Machine Learning 23

Test Conditions for Continuous Attributes

FIT-CS INT3405 - Machine Learning 24

Splitting based on Continuous Attributes
● Different ways of handling
○ Discretization to form an ordinal categorical attribute. Ranges
can be found by equal interval bucketing, equal frequency
bucketing (percentiles), or clustering.
■ Static – discretize once at the beginning
■ Dynamic – repeat at each node
○ Binary Decision: (A < v) or (A ≥ v)
■ consider all possible splits and finds the best cut
■ can be more compute intensive
FIT-CS INT3405 - Machine Learning 25
How to determine the Best Split

Before Splitting: 10 records of class 0,

10 records of class 1

Which test condition is the best?

FIT-CS INT3405 - Machine Learning 26
How to determine the Best Split
● Greedy approach:
– Nodes with purer class distribution are preferred

● Need a measure of node impurity:

High degree of impurity Low degree of impurity

FIT-CS INT3405 - Machine Learning 27

Measures of Node Impurity
● Gini Index

● Entropy

● Misclassification error

FIT-CS INT3405 - Machine Learning 28

Find the Best Split
● Compute impurity measure (P) before splitting
● Compute impurity measure (M) after splitting
○ Compute impurity measure of each child node
○ M is the weighted impurity of child nodes
● Choose the attribute test condition that produces the highest
gain
Gain = P - M
● or equivalently, lowest impurity measure after splitting (M)

FIT-CS INT3405 - Machine Learning 29

Find the Best Split

FIT-CS INT3405 - Machine Learning 30

Measure of Impurity: Gini (index)

● Gini Index for a given node t :

● For 2-class problem (p, 1 – p):

◆ GINI = 1 – p2 – (1 – p)2 = 2p (1-p)

FIT-CS INT3405 - Machine Learning 31

Computing Gini Index of a Single Node

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

P(C1) = 1/6 P(C2) = 5/6

Gini = 1 – (1/6)2 – (5/6)2 = 0.278

P(C1) = 2/6 P(C2) = 4/6

Gini = 1 – (2/6)2 – (4/6)2 = 0.444
FIT-CS INT3405 - Machine Learning 32
Computing Gini Index for a Collection of Nodes
●

FIT-CS INT3405 - Machine Learning 33

Binary Attributes: Computing GINI Index
● Splits into two partitions (child nodes)
● Effect of Weighing partitions:
– Larger and purer partitions are sought

FIT-CS INT3405 - Machine Learning 34

Categorical Attributes: Computing GINI Index
● For each distinct value, gather counts for each class in the dataset
● Use the count matrix to make decisions

Which of these is the best?

FIT-CS INT3405 - Machine Learning 35

Continuous Attributes: Computing Gini Index
● Use Binary Decisions based on one value
● Several Choices for the splitting value
○ Number of possible splitting values
= Number of distinct values
● Each splitting value has a count matrix associated with it
○ Class counts in each of the partitions, A ≤ v and A > v
● Simple method to choose best v
○ For each v, scan the database to gather count matrix
and compute its Gini index
○ Computationally Inefficient! Repetition of work.

FIT-CS INT3405 - Machine Learning 36

Measure of Impurity: Entropy
● Entropy for a given node t :

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

P(C1) = 1/6 P(C2) = 5/6

Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

P(C1) = 2/6 P(C2) = 4/6

Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

FIT-CS INT3405 - Machine Learning 37

Computing Information Gain After Splitting
●

FIT-CS INT3405 - Machine Learning 38

Problem with large number of partitions
● Node impurity measures tend to prefer splits that result in large number
of partitions, each being small but pure

– Customer ID has highest information gain because entropy for all

the children is zero

FIT-CS INT3405 - Machine Learning 39

Gain Ratio
●

FIT-CS INT3405 - Machine Learning 40

Gain Ratio

SplitINFO = 1.52 SplitINFO = 0.72 SplitINFO = 0.97

FIT-CS INT3405 - Machine Learning 41

Measure of Impurity: Classification Error
●

FIT-CS INT3405 - Machine Learning 42

Computing Error of a Single Node

FIT-CS INT3405 - Machine Learning 43

Computing Error of a Single Node

FIT-CS INT3405 - Machine Learning 44

Comparison among Impurity Measures
For a 2-class problem:

FIT-CS INT3405 - Machine Learning 45

Decision Tree Classification
● Advantages:
○ Relatively inexpensive to construct
○ Extremely fast at classifying unknown records
○ Easy to interpret for small-sized trees
○ Robust to noise
○ Can easily handle redundant attributes, irrelevant attributes
● Disadvantages: .
○ Due to the greedy nature of splitting criterion, interacting attributes (that
can distinguish between classes together but not individually) may be
passed over in favor of other attributed that are less discriminating.
○ Each decision boundary involves only a single attribute
FIT-CS INT3405 - Machine Learning 46
Summary
● Decision Tree Examples
● Decision Tree Algorithms
● Methods for Expressing Test Conditions
● Measures of Node Impurity
○ Gini index
○ Entropy
○ Gain Ratio
○ Classification Error

FIT-CS INT3405 - Machine Learning 47

UET
Since 2004

ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN

VNU-University of Engineering and Technology

Thank you

Lecture 5_Decision Tree
No ratings yet
Lecture 5_Decision Tree
49 pages
Lecture 5 Classification P2 Decision Tree
No ratings yet
Lecture 5 Classification P2 Decision Tree
54 pages
Lecture03 Linear Regression
No ratings yet
Lecture03 Linear Regression
54 pages
Lecture 4 Classification P1
No ratings yet
Lecture 4 Classification P1
51 pages
Lecture 3 - Linear Regression
No ratings yet
Lecture 3 - Linear Regression
55 pages
Lecture 5 Classification SVM
No ratings yet
Lecture 5 Classification SVM
44 pages
Lecture 6 Classification SVM
No ratings yet
Lecture 6 Classification SVM
44 pages
Lecture 6 Classification P3 SVM
No ratings yet
Lecture 6 Classification P3 SVM
44 pages
Lecture 4 Classification P1
No ratings yet
Lecture 4 Classification P1
50 pages
Lecture 4 Classification P1
No ratings yet
Lecture 4 Classification P1
49 pages
Lecture 6 - Classification - SVM
No ratings yet
Lecture 6 - Classification - SVM
48 pages
Unit-7
No ratings yet
Unit-7
67 pages
Lecture 7 - Feature Selection & Model Optimization
No ratings yet
Lecture 7 - Feature Selection & Model Optimization
48 pages
Comp3314 5. Data Preprocessing
No ratings yet
Comp3314 5. Data Preprocessing
51 pages
CS 303 Operating Systems Concepts: Semester I - 2019/2020
No ratings yet
CS 303 Operating Systems Concepts: Semester I - 2019/2020
19 pages
Data Preprocessing
No ratings yet
Data Preprocessing
65 pages
Lecture6 Synchronization
No ratings yet
Lecture6 Synchronization
81 pages
Data Structures AND Algorithms: Bilgisayar Mühendisliği Bölümü
No ratings yet
Data Structures AND Algorithms: Bilgisayar Mühendisliği Bölümü
17 pages
Economics Lecture Notes - Advanced Econometrics
No ratings yet
Economics Lecture Notes - Advanced Econometrics
3 pages
Triage Bot
No ratings yet
Triage Bot
34 pages
Intro ML 1 Day
No ratings yet
Intro ML 1 Day
43 pages
2 - Types of Algorithms
No ratings yet
2 - Types of Algorithms
78 pages
Azencott BioML
No ratings yet
Azencott BioML
87 pages
ML Lec. 01
No ratings yet
ML Lec. 01
17 pages
02 Introduction Algorithmsstudent
No ratings yet
02 Introduction Algorithmsstudent
77 pages
Lecture2 - General Concepts For ML
No ratings yet
Lecture2 - General Concepts For ML
69 pages
CAT12 Manual (Volumetría)
No ratings yet
CAT12 Manual (Volumetría)
70 pages
CSE 422 Machine Learning Tree Based Methods
No ratings yet
CSE 422 Machine Learning Tree Based Methods
35 pages
MGT402 FinalTerm 2010 Session 03
No ratings yet
MGT402 FinalTerm 2010 Session 03
24 pages
Trees
No ratings yet
Trees
19 pages
07 - Classification Algorithms - Part III
No ratings yet
07 - Classification Algorithms - Part III
30 pages
Machine Learning Investigative Reporting NorthBaySolutions
No ratings yet
Machine Learning Investigative Reporting NorthBaySolutions
130 pages
Lecture 2 - General Concepts For ML
No ratings yet
Lecture 2 - General Concepts For ML
63 pages
Class Basic
No ratings yet
Class Basic
75 pages
Step-by-Step Machine Learning
No ratings yet
Step-by-Step Machine Learning
3 pages
7 - Kmeans 5-11-24
No ratings yet
7 - Kmeans 5-11-24
51 pages
CS XII Crash Course
No ratings yet
CS XII Crash Course
8 pages
Sudhir-Demand Estimation-Aggregate Data Workshop-Updated 2013 PDF
No ratings yet
Sudhir-Demand Estimation-Aggregate Data Workshop-Updated 2013 PDF
72 pages
Measurement and Monitoring
No ratings yet
Measurement and Monitoring
66 pages
Optimization Algorithms For Association Rule Mining (ARM) : K.Indira
No ratings yet
Optimization Algorithms For Association Rule Mining (ARM) : K.Indira
118 pages
HW01
No ratings yet
HW01
29 pages
Classification
No ratings yet
Classification
45 pages
Machine Learning (CSC052P6G, CSC033U3M, CSL774, EEL012P5E) : Dr. Shaifu Gupta
No ratings yet
Machine Learning (CSC052P6G, CSC033U3M, CSL774, EEL012P5E) : Dr. Shaifu Gupta
18 pages
Cipher Suite Is A Combination of
No ratings yet
Cipher Suite Is A Combination of
15 pages
IT8601 IQ Computational Intelligence
No ratings yet
IT8601 IQ Computational Intelligence
10 pages
EmbeddedSWEConcepts PDF
No ratings yet
EmbeddedSWEConcepts PDF
4 pages
Mining High-Speed Data Streams: Sixth ACM SIGKDD International Conference - 2000
No ratings yet
Mining High-Speed Data Streams: Sixth ACM SIGKDD International Conference - 2000
46 pages
D1-Intro To Algorithm DataStructures
No ratings yet
D1-Intro To Algorithm DataStructures
56 pages
THEORY FILE - Machine Learning (6th Sem)!!
No ratings yet
THEORY FILE - Machine Learning (6th Sem)!!
26 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
43 pages
Unit-2 (2)
No ratings yet
Unit-2 (2)
67 pages
B.Tech Third Year Computer Science and Engineering From Academic Year 2016-17
No ratings yet
B.Tech Third Year Computer Science and Engineering From Academic Year 2016-17
14 pages
TESTING-1[1]
No ratings yet
TESTING-1[1]
132 pages
Become a Machine learning engineer
No ratings yet
Become a Machine learning engineer
4 pages
2-ML Principles
No ratings yet
2-ML Principles
34 pages
Gtfms Sem
No ratings yet
Gtfms Sem
26 pages
Lect 8 - 2024
No ratings yet
Lect 8 - 2024
15 pages
Design of A Data-Driven Controller For A Spiral Heat Exchanger
No ratings yet
Design of A Data-Driven Controller For A Spiral Heat Exchanger
5 pages
Advanced PLC Presentation
No ratings yet
Advanced PLC Presentation
60 pages
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
From Everand
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
Digital Equipment Corporation
No ratings yet
Demag All Terrain Cranes Spec 032383
No ratings yet
Demag All Terrain Cranes Spec 032383
36 pages
d1010 RDP
No ratings yet
d1010 RDP
1 page
Agr 203 - 0
No ratings yet
Agr 203 - 0
237 pages
My Husband Hate Me, But He Lost His Memory
No ratings yet
My Husband Hate Me, But He Lost His Memory
13 pages
Methodology of Compliant Mechanisms and Its Curren
No ratings yet
Methodology of Compliant Mechanisms and Its Curren
9 pages
Timing Belt, Gear and Pulleys Opel Zafira-B
No ratings yet
Timing Belt, Gear and Pulleys Opel Zafira-B
4 pages
What Causes Chest Pain When Feelings Are Hurt - Scientific American
No ratings yet
What Causes Chest Pain When Feelings Are Hurt - Scientific American
5 pages
Thesis 2012 Sissler PDF
No ratings yet
Thesis 2012 Sissler PDF
193 pages
EN 17010 SoftwareUserManual 202103
No ratings yet
EN 17010 SoftwareUserManual 202103
104 pages
Ch. 4 Area of Polygons
No ratings yet
Ch. 4 Area of Polygons
38 pages
ROSS Storage Solutin MSDS
No ratings yet
ROSS Storage Solutin MSDS
2 pages
Audi MMI Display Retrofit (From 8.3 - To 10.1 - ) - VAG MIB2
No ratings yet
Audi MMI Display Retrofit (From 8.3 - To 10.1 - ) - VAG MIB2
3 pages
Problems 768 772
No ratings yet
Problems 768 772
5 pages
List of Research Funding Agency
No ratings yet
List of Research Funding Agency
3 pages
Engineering Manual Index
No ratings yet
Engineering Manual Index
5 pages
Man Made
No ratings yet
Man Made
10 pages
A new modification of Gamma operators with a better error estimation (1)
No ratings yet
A new modification of Gamma operators with a better error estimation (1)
13 pages
Introduction To Asean Literature Meaning of Literature
No ratings yet
Introduction To Asean Literature Meaning of Literature
9 pages
Technical English For Civil Engineers
No ratings yet
Technical English For Civil Engineers
13 pages
Deflection of Trusses
No ratings yet
Deflection of Trusses
9 pages
Intel i5-3470 vs Xeon E3-1270 v6 vs i7-3770 [cpubenchmark.net] by PassMark Software
No ratings yet
Intel i5-3470 vs Xeon E3-1270 v6 vs i7-3770 [cpubenchmark.net] by PassMark Software
1 page
Easa Part 66 Module 11 15 Oxygen
No ratings yet
Easa Part 66 Module 11 15 Oxygen
16 pages
Lecture 1 - 23SU - Operating System Introduction
No ratings yet
Lecture 1 - 23SU - Operating System Introduction
45 pages
Path of The Assassin v04 (Dark Horse)
No ratings yet
Path of The Assassin v04 (Dark Horse)
298 pages
Cambridge Guide to Second Language Teacher Education 2nd Edition Anne Burns 2024 Scribd Download
100% (15)
Cambridge Guide to Second Language Teacher Education 2nd Edition Anne Burns 2024 Scribd Download
60 pages
TLE10 Q2 Mod1 Configuring-Computer-System-and-Network v3
No ratings yet
TLE10 Q2 Mod1 Configuring-Computer-System-and-Network v3
36 pages
Dah Solar 450w
No ratings yet
Dah Solar 450w
2 pages
Geometry and Trigonometry questionss
No ratings yet
Geometry and Trigonometry questionss
70 pages
Packaged, Dry, Rapid-Hardening Cementitious Materials For Concrete Repairs
No ratings yet
Packaged, Dry, Rapid-Hardening Cementitious Materials For Concrete Repairs
4 pages
Large Diameter Steel Piping
0% (1)
Large Diameter Steel Piping
19 pages