0% found this document useful (0 votes)
22 views

2EL1730-ML-Lecture05-Trees and Ensemble Learning

Uploaded by

Zakaria Mennioui
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

2EL1730-ML-Lecture05-Trees and Ensemble Learning

Uploaded by

Zakaria Mennioui
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Machine Learning

2EL1730

Lecture 5
Tree-based methods and ensemble learning

Fragkiskos Malliaros and Maria Vakalopoulou

Friday, December 18, 2020


Acknowledgements

• The lecture is partially based on material by


– Richard Zemel, Raquel Urtasun and Sanja Fidler (University of Toronto)
– Chloé-Agathe Azencott (Mines ParisTech)
– Julian McAuley (UC San Diego)
– Dimitris Papailiopoulos (UW-Madison)
– Jure Leskovec, Anand Rajaraman, Jeff Ullman (Stanford Univ.)
• http://www.mmds.org
– Panagiotis Tsaparas (UOI)
– Evimaria Terzi (Boston University)
– Andrew Ng (Stanford University)
– Nina Balcan and Matt Gormley (CMU)
– Ricardo Gutierrez-Osuna (Texas A&M Univ.)
– Tan, Steinbach, Kumar
• Introduction to Data Mining

Thank you!
2
Last lecture

3
Non parametric learning
• Non-parametric learning algorithm (does not mean NO parameters)
• The complexity of the decision function grows with the number of data
points

• Contrast with linear/logistic regression (≈ as many parameters as


features)

• Usually: decision function is expressed directly in terms of the training


examples

• Examples:
• K-nearest neighbors (today's lecture)
• Tree-based methods
• Some cased of SVMs

4
k-Nearest Neighbors (kNN) Algorithm

1NN 3NN

Every example in Every example in


the blue shaded the blue shaded
area will be area will be
misclassified as classified correctly
the blue class as the red class

• Algorithm 1 is sensitive to mis-labeled data (‘class noise’)


• Consider the vote of the k nearest neighbors (majority vote)

Algorithm kNN
• Find k examples (x*i, y *i), i=1,…,k closest to the test instance x
• The output is the majority class
5
Choice of Parameter k

• Small k: noisy decision


– The idea behind using more than 1 neighbors is to average out the
noise
• Large k
– May lead to better prediction performance
– If we set k too large, we may end up looking at samples that are not
neighbors (are far away from the point of interest)
– Also, computationally intensive
– Extreme case: set k=m (number of points in the dataset)
• For classification: the majority class
• For regression: the average value

6
In this Lecture

• Decision trees
• Ensemble learning
– Bagging methods
– Boosting methods
• AdaBoost

7
Decision trees

8
Another Classification Idea (1/2)

• We learned about linear classification (e.g., logistic regression),


LDA, nearest neighbors.
• Any other idea?
• Pick an attribute, do a simple test
• Conditioned on a choice, pick another attribute, do another test
– Create a tree
• In the leaves, assign a class with majority vote
• Do other branches as well

9
Another Classification Idea (2/2)

• Gives axes aligned decision boundaries

10
Example of a Decision Tree

Tid Refund Marital Taxable


Splitting Attributes
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund
3 No Single 70K No
Yes No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes
Single, Divorced Married
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree

11
Another Example of Decision Tree

MarSt Single, Divorced


Tid Refund Marital Taxable
Married
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that fits
10 No Single 90K Yes the same data!
10

12
Decision Trees – Nodes and Branching

• Internal nodes correspond to test attributes


• Branching is determined by attribute value
• Leaf nodes are outputs (class assignments)
– YES or NO in this example

MarSt Single, Divorced


Married

NO Refund
Yes No

NO TaxInc
< 80K > 80K

NO YES

13
Decision Tree Classification Task

Tid Attrib1 Attrib2 Attrib3 Class


Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision
Model
Tid Attrib1 Attrib2 Attrib3 Class Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set

14
Apply Model to Test Data
Test Data
Start from the root of tree Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt

Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

15
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt

Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

16
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt

Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

17
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt

Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

18
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt

Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

19
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt

Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES

20
Decision Tree Induction – The Idea (1/2)

• Basic algorithm
– Tree is constructed in a top-down recursive manner
– Initially, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discretized
in advance)
– Examples are partitioned recursively based on the selected
attributes
– Split attributes are selected on the basis of a heuristic or statistical
measure (e.g., gini index, information gain)

• Most commercial DTs use variations of this algorithm

21
Decision Tree Induction – The Idea (2/2)

• Simple, greedy, recursive approach, builds up tree node-by-


node

1. Pick an attribute to split at a non-terminal node


2. Split examples into groups based on attribute value
3. For each group:
– If no examples – return class majority from parent node
– Else, if all examples are in the same class – return class
– Else, loop to step 1

22
Decision Tree Induction Algorithms

• Many algorithms
– Hunt’s algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ, SPRINT

23
General Structure of Hunt’s Algorithm

Tid Refund Marital Taxable


• Let D t be the set of training Status Income Cheat

records that reach a node t 1 Yes Single 125K No

• General procedure: 2 No Married 100K No

– If D t contains records that belong


3 No Single 70K No

to the same class y t , then t is a 4 Yes Married 120K No

leaf node labeled as y t 5 No Divorced 95K Yes


6 No Married 60K No
– If D t is an empty set, then t is a 7 Yes Divorced 220K No
leaf node labeled by the default 8 No Single 85K Yes
class, y d
9 No Married 75K No
– If D t contains records that belong 10 No Single 90K Yes
to more than one classes, use an 10

attribute test to split the data into Dt


smaller subsets
• Recursively apply the
procedure to each subset ?

24
Hunt’s Algorithm Tid Refund Marital
Status
Taxable
Income Cheat

1 Yes Single 125K No


Refund
2 No Married 100K No
Yes No
3 No Single 70K No
Cheat Cheat 4 Yes Married 120K No
= No = Yes
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

25
Hunt’s Algorithm Tid Refund Marital
Status
Taxable
Income Cheat

1 Yes Single 125K No


Refund
2 No Married 100K No
Yes No
3 No Single 70K No
Cheat Cheat 4 Yes Married 120K No
= No = Yes
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
Refund 8 No Single 85K Yes
Yes No 9 No Married 75K No
10 No Single 90K Yes
Cheat Marital 10

= No Status
Single, Married
Divorced
Cheat Cheat
= Yes = No

26
Hunt’s Algorithm Tid Refund Marital
Status
Taxable
Income Cheat

1 Yes Single 125K No


Refund
2 No Married 100K No
Yes No
3 No Single 70K No
Cheat Cheat 4 Yes Married 120K No
= No = Yes
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
Refund Refund 8 No Single 85K Yes
Yes No Yes No 9 No Married 75K No

Cheat Cheat Marital 10 No Single 90K Yes


Marital
= No
10

= No Status Status
Single, Single, Married
Married Divorced
Divorced
Taxable Cheat
Cheat Cheat No
Income
= Yes = No
< 80K >= 80K

Cheat Cheat
= No = Yes
27
Tree Induction

• Greedy strategy
– Split the records based on an attribute test that
optimizes a certain criterion

• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting

28
How to Specify Test Condition?

• Depends on attribute types


– Nominal (categorical)
• E.g., female, male
– Ordinal (categorical, but there is an ordering)
• E.g., economic status: low, medium, high
– Continuous

• Depends on the number of ways to split


– 2-way split
– Multi-way split

29
Splitting Based on Nominal Attributes

• Multi-way split: Use as many partitions as


distinct values
CarType
Family Luxury No ordering of
Sport the attributes

CarType CarType
{Sport, {Family,
Luxury} {Family} OR Luxury} {Sport}

• Binary split: Divides values into two subsets


(Need to find optimal partitioning)

30
Splitting Based on Ordinal Attributes

• Multi-way split: Use as many partitions as


distinct values Size
Small Large
Medium

Size
{Small, Size
Medium} {Large} OR {Medium,
Large} {Small}

• Binary split: Divides values into two subsets


(Need to find optimal partitioning)

31
Splitting Based on Continuous Attributes

• Different ways of handling


– Discretization to form an ordinal categorical
attribute
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.

– Binary decision: (A < v) or (A  v)


• Considers all possible splits and finds the best cut
• Can be more computational intensive
32
Splitting Based on Continuous Attributes

Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split

33
Tree Induction

• Greedy strategy
– Split the records based on an attribute test that
optimizes certain criterion

• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting

34
How to Determine the Best Split (1/2)

Before splitting: 10 records of class 0 Suppose that we want to predict if a


student will get tax return or no?
10 records of class 1

Own Car Student


Married ?
Car? Type? ID?

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which test condition is the best?

35
How to Determine the Best Split (1/2)

Before splitting: 10 records of class 0 Suppose that we want to predict if a


student will get tax return or no?
10 records of class 1

Own Car Student


Married ?
Car? Type? ID?

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Purer partition

Which test condition is the best?

36
How to Determine the Best Split (2/2)

• Greedy approach:
– Nodes with homogeneous (pure) class distribution
are preferred
C0: 5 C0: 9
• Need a measure
C1: 5 of node impurity:
C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity

Very impure Medium impure Pure 37


Measures of Node Impurity

• Gini Index

• Entropy and Information Gain

• Misclassification error

38
How to Find the Best Split – Gini Index
Before Splitting: C0 N00 Two possible splits on
C1 N01 attributes A or B

A? B?

Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40


C1 N11 C1 N21 C1 N31 C1 N41

39
How to Find the Best Split – Gini Index
Before Splitting: C0 N00 M0 Two possible splits on
C1 N01 attributes A or B

A? B?

Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40


C1 N11 C1 N21 C1 N31 C1 N41

M1 M2 M3 M4

40
How to Find the Best Split – Gini Index
Before Splitting: C0 N00 M0 Two possible splits on
C1 N01 attributes A or B

A? B?

Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40


C1 N11 C1 N21 C1 N31 C1 N41

M1 M2 M3 M4

M12 M34

41
How to Find the Best Split – Gini Index
Before Splitting: C0 N00 M0 Two possible splits on
C1 N01 attributes A or B

A? B?

Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40


C1 N11 C1 N21 C1 N31 C1 N41

M1 M2 M3 M4

M12 M34
Gain = M0 – M12 vs. M0 – M34
Reduction in impurity

42
Measure of Impurity: Gini Index

Gini index for • J number of classes


a node t • p( j | t) is the relative frequency of
class j at node t

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5
Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444

Select the split with the


smallest Gini index
43
Splitting based on Gini Index

• When a node t is split on attribute A into k partitions (children), the


quality of split is computed as

n i = number of records at child i


n = number of records at node t

• Reduction in impurity (gain) after splitting node t (on attribute A), is


computed as

• The attribute X with the smallest GiniX or the largest reduction in


impurity is chosen to split the node

44
Example

• Splits into two partitions


– Goal: reduce impurity Parent
C1 6
B? C2 6
Gini = 0.500
Yes No

Node N1 Node N2

Gini(N1)
= 1 – (5/7)2 – (2/7)2 Split on attribute B
N1 N2
= 0.408
C1 5 1 GiniB
Gini(N2) = 7/12 * 0.408 +
= 1 – (1/5)2 – (4/5)2 C2 2 4
5/12 * 0.32
= 0.32 Gini=0.371 = 0.371

45
Tree Induction

• Greedy strategy
– Split the records based on an attribute test that optimizes certain
criterion

• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting

46
Stopping Criteria for Tree Induction

• Stop expanding a node when all the records belong to the


same class

• Stop expanding a node when all the records have similar


attribute values

47
Decision Tree Based Classification

• Advantages
– Inexpensive to construct (training phase)
– Extremely fast at testing phase (classifying unseen data)
– Easy to interpret for small-sized trees
– Accuracy is comparable to other classification techniques for
many simple data sets

48
Overfitting and Tree Pruning

• Overfitting: An induced tree may overfit the training data


– Too many branches, some may reflect anomalies due to noise or
outliers

• Two approaches to avoid overfitting


– Pre-pruning: Halt tree construction early; do not split a node if this
would result in the goodness measure falling below a threshold
• Difficult to choose an appropriate threshold
– Post-pruning: Remove branches from a fully grown tree
• Get a sequence of progressively pruned trees

49
scikit-learn

http://scikit-
learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.
html#sklearn.tree.DecisionTreeClassifier

50
Ensemble learning

51
Ensemble Methods

• Typical application: classification


• Ensemble of classifiers: set of classifiers whose individual
decisions are combined in some way to classify new examples
• Simplest approach:
1. Generate multiple classifiers (e.g., decision trees, logistic
regression)
2. Each classifier votes (decides) on a test instance
3. Take majority as classification
• Classifiers are different due to different sampling of training data,
or randomized parameters within the classification algorithm

52
Ensemble Learning – General Idea

Original
D Training data

Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets

Step 2:
Build Multiple C1 C2 Ct -1 Ct
Classifiers

Step 3:
Combine C*
Classifiers

53
Ensemble Methods: Summary

• Differ in training strategy and combination method

• Bagging (bootstrap aggregation)


– Random sampling with replacement
– Train separate models on overlapping training sets, average their
predictions
– E.g., random forest classifier

• Boosting
– Sequential training, iteratively re-weighting the training examples –
the current classifier focuses on hard examples
– E.g., Adaboost

54
Bagging vs. Boosting

Source: https://quantdare.com/what-is-the-difference-between-bagging-and-boosting/
55
Bagging: Bootstrap Estimation

• Repeatedly draw n samples from D


• For each set of samples, estimate a statistic
• The bootstrap estimate is the mean of the individual estimates
• Used to estimate a statistic (parameter) and its variance
• Bagging: bootstrap aggregation (Breiman 1994)

56
Bagging

• Simple idea
– Generate M bootstrap samples from your original training set
– Train on each one to get y m , and average them

• Each bootstrap sample is drawn with replacement


– Each one contains some duplicates of certain training points and
leaves out other training points completely

• For regression: average the predictions


• For classification: average class probabilities or take majority
vote
57
Random Forest

• Random forest classifier is an extension to bagging which used


de-correlated tress
Create bootstrap samples
from the training data Construct a decision tree
Training Data
M features
N examples

....…

58
Random Forest

• Random forest classifier is an extension to baggin which used


de-correlated tress
Create bootstrap samples
from the training data
Training Data
M features
N examples

Take the
majority
vote
....…
....…

59
Random Forest - Algorithm

• Input: Training data D with N cases and M variables and labels Y


• Output: A model with the best accuracy

1. Select randomly m variables m ≤ M and determine the decision


tree
2. Choose a training set for this tree by using bootstrap techniques
3. Repeat the process k times, as many times as the number of trees

1. For prediction a new sample is pushed down the tree. It is


assigned the label of the training sample in the leaf it ends up in.
This process is iterated over all tress in the ensemble and the
average vote of all trees is reported as the final prediction

60
Boosting

• Also works by manipulating the training set, but the individual


classifiers are trained sequentially

• Each classifier is trained based on knowledge of the


performance of previously trained classifiers
– Focus on difficult examples

• Final classifier: weighted sum of the component classifiers

61
AdaBoost: Making Weak Learners Stronger

• Suppose you have a weak learning module (a base classifier)


that can always get 0.5 + ε correct when given a binary
classification task
– A little bit better than a random classifier
– Weak learner

• Can you apply this learning module many times to get a strong
learner that can get close to zero error rate on the training data?
– ML theorists showed how to do this and it actually led to an
effective new learning procedure (Freund & Shapire, 1996)
– The AdaBoost

62
AdaBoost – The Idea

• Train T weak learners (models) sequentially

• First train the base classifier on all the training data with equal
importance weights on each case
• Then, re-weight the training data to emphasize the hard cases
and train a second model
– Instances that were misclassified in the previous step
– Q: How do we re-weight the data?
• Keep training new models on the re-weighted data
• Finally, use a weighted committee of all the models for the test
data
– How do we weight the models in the committee?

63
How to Train Each Classifier

• Input:
• Output:
• Weight of instance (e.g., data point) xi for classifier t:

• Cost function for classifier t:

1 if error, 0 otherwise

64
Weight of Instances for Classifier t

• Weighted error rate of a classifier t

• The quality (coefficient) of classifier t is

ln ((1 - γt) / γt)


It is zero if the classifier has
weighted error rate of 0.5 and
infinity if the classifier is perfect

γt

• Update the weights for the next classifier

The weights ‘inform’ the training of the weak learner


(decision trees can be grown that favor splitting sets of samples with high 1 if error, 0 otherwise
weights)
65
How to Make Predictions

• Weight the binary prediction of


each classifier by the quality of
that classifier

66
AdaBoost Pseudocode

Weighted error of the base learner

Examine if we have
encountered errors or not

67
scikit-learn

http://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifi
er.html

http://scikit-learn.org/stable/modules/ensemble.html

68
Next Class

• Support Vector Machines

69
Thank You!

70

You might also like