0% found this document useful (0 votes)
2 views

ML_Course_15 -17

The document covers K-Nearest Neighbor (KNN) learning and Decision Tree learning, detailing their classification and regression methods. KNN uses proximity to classify or predict values based on the majority vote or average of nearest neighbors, while Decision Trees recursively split data based on feature attributes to create a model. Various algorithms such as ID3, C4.5, and CART are discussed, highlighting their approaches to splitting criteria and handling of data attributes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

ML_Course_15 -17

The document covers K-Nearest Neighbor (KNN) learning and Decision Tree learning, detailing their classification and regression methods. KNN uses proximity to classify or predict values based on the majority vote or average of nearest neighbors, while Decision Trees recursively split data based on feature attributes to create a model. Various algorithms such as ID3, C4.5, and CART are discussed, highlighting their approaches to splitting criteria and handling of data attributes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Prof.Dr. Zeki YETGİN, Mersin Univ.

, Machine Learning Course Book (Content is copyrighted)

6. K-Nearest Neighbor Learning

6.1. K-Nearest Neighbor Classification(KNN)


Knn can be considered non-linear, generative (weakly) and non-parametric model.
Knn computes the nearest K samples around each point in the feature space and
assign the given sample to the majority voted class as most probable class. The
parameter K is input parameter and not affected by the learning stage. Knn has some
information about the data distrubution around each point. From this perspective,
Knn could be considered as a generative model where it can brutely generate the
data distrubution of each class samples. Finding the K-nearest neighbour is also used
as a metod to compute a density estimation p (x | ci) for Bayes based classifiers,
which are explained in later chapters.

Basics Steps:

1- given sample x in feature space; find the nearest K points around x

2- find the majority vote

Training : Training means finding the K nearest points around all possible points.
Normally, the complexity of this step is O(n2) but complexity can be reduced with
some algorithms like BallTree, KDTree etc, which uses tree based data structures to
find and save the nearest points.

Feature space

x=which Class according to K=6?

x =Majority Vote(Decision of the nearest K example)

x= (assigned to red class in this example)

similarity metrics: To find the nearest K points, distance or similarity metrics


(Euclidean, Cos, Mahalanobis, Manhattan etc.) are used.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

KNN classifier with Python Sklearn


KNeighborClassifier class can be used with the following basic parameters:

• n_neighbors: Number of nearest neighbor (K) to consider


• algorithm: The algorithm that finds the nearest K neighbor of each point
during KNN fit process. BallTree ('ball_tree')and KDTree ('kd_tree') are two
well known algorithms that finds the nearest neighbours in unsupervised
manner (don’t use output labels at this stage). Default algorihm = 'auto' ,
meaning algorithm is automatically selected after analysing data.
• metric: Distance metric which is used to calculate the distance between
points. This metric is used to find nearest K neighbors
• Predict: finds the majority vote of K neigbours for the given sample
• additional parameters can be used depending on the selected params.

Example: Classification with iris dataset KNN(K=3)


from sklearn.neighbors import KNeighborsClassifier
X,y=load_iris(...)
model=KNeighborsClassifier(n_neighbors=3, algorithm = 'ball_tree')
model.fit(X,y)//Calculating the nearest k point in a tree like structure in the model
fx=model.predict(X) //predict the outputs of the training samples
print(accuracy_score(fx,y))

6.2. KNN Regression


Similarly, the nearest neighbour algorithms , kd_ball and K neighbors during the
training process(fit) are found with the selected algorithm, as parameter and the
final value is the average of K neigbours. Thus, instead of majority vote, majority
value (average value) is considered for regression

x is given! → f (x) = ? for K=5

f(x)=avarage value of K neighbours

K- training examples around x in the feature space


Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

KNN Regression with Python Sklearn


K-NeighborsRegressor class can be used with the following basic parameters:

• n_neighbors: Number of nearest neighbor (K)


• algorithm: The algorithm that finds the nearest K neighbor of each point
during KNN fit process. Similarly, possible algorithms {‘auto’, ‘ball_tree’,
‘kd_tree’, ‘brute’} with default=’auto’
• metric: Distance metric which is used to calculate the distance between
points. This metric is used to find nearest K neighbors.
• Predict: finds the average f values of K neigbours for the given sample.
• additional parameters can be used depending on the selected params.

Example: Regression with KNN Regressor

from sklearn.neighbors import KNeighborsRegressor


X,Y=load_xxx
model=KNeighborsRegressor(n_neighbors=3, algorithm = 'auto') //auto is default
model.fit(X,Y)
print("R2 Error: %f" %model.score(X,Y)) //R2 error for training set
// call predict fun in itself

7. Decision Tree Learning


Decision tree can be considered as non-linear, discriminative and non-parametric
model. The decision tree algorithm recursively splits the training set into disjoint
subsets. Each splitting occurs on a feature (or attribute) and the best feature to split
on is found using various methods (jini, mse, etc). As a result of each splitting, a new
branch is added to the current node of the decision tree. This splitting continues until
the subsets reach a certain purity (e.g. same class) where the subset becomes a leaf
of the tree and the leaves are assigned an output by majority vote (classification) or
average value (regression). The scheleton of the decision tree learning is almost same
for classification and regression. The splitting criteria and the forming the final
outputs are basic differences.

7.1. Decision Tree Classification


Different algorithms are used to determine the “best” split at a node. ID3, C4.5, C5
and CART (classification and regression tree) are some decision tree based
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

classification algorithms. ID3 and its extension C4.5 are classifiers whereas CHART and
the C5 (extention of C4.5) can handle both classification and regression tasks. ID3
assumes categoric features and splits until the leaves are pure w.r.t. output class. The
basic issue of ID3 is the tree size where big tree easily overfits data and, thus, small
tree is usually preferred in ID3. Talking in general, finding the correct sized tree is an
issue of research for tree based learners. CART uses binary tree (always splits into left
and right) and works on both categoric and numeric features. C4.5 is the extended
version of the ID3 where it can handle impure leaves, numeric/categoric features and
splitting into many branches. The algorithms above basicly differ in splitting criteria
and the assumed structure of the tree.

Example: let the training set contains all categoric features, age and weight .

Two solutions are shown below where small tree is better. Decision tree
classification can be represented as logical functions , which involves set of binary
rules(disjunction of conjunctions) to predict the output.

Solution-1 Solution-2

fhappy = (age=old) AND (weight=normal) OR fhappy = ( weight=normal)


(age=young) AND (weight=normal)
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Example: let the training set contains categoric and numeric features, [TI, PE ] are
features , Response is target variable(class)

Problems:
i- which feature to split on ?
Best Splitting
ii- which value to split on?
iii- the meaning of leaf (when splitting is no longer needed)?
iv- the size of the tree ?

Solutions are addressed by the decision tree algorithms themselves.

7.1.1. Inductive Decision Tree Learning (ID3)


find a tree that is fully consistent with the training samples (splitting continue until all leaves
are pure). Features must be categoric type. ID3 recursively choose "most significant" feature
as root of (sub)tree. Greedy search through the space of possible branches is followed.
ID3 (S, features):
If Entropy(S)=0 Then Add_Leaf_to_Tree( Label(S) ) and return
If features=ɸ Then Add_Leaf_to_Tree( Majority_Vote(S) ) and return
𝐁𝐞𝐬𝐭 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝐈𝐧𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧𝐆𝐚𝐢𝐧(S, x)
𝑥 ∈ 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠
Add_Node_to_Tree(Best)
features = features – {Best}
for each value of Best
Add_Branch_to_Tree(value)
ID3(Svalue , features)
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Choosing Best Attibute (splitting criteria): Use Information Gain with Entropy

Entropy measures the average amount of information or impurity of the samples in a


set, S, with respect to output classes. Lets S = {Np+, Nn-} contains Np and Nn
number of positive and negative samples respetively. Then, Entropy (S) is defined as
𝑁𝑝 𝑁𝑝 𝑁𝑛 𝑁𝑛
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = − log − log = − ∑𝑖 𝑃𝑖 log 𝑃𝑖
𝑁 𝑁 𝑁 𝑁
Pi: probability of i. class
Example : Entropy of set S where only two classes are considered. Entropy (E) is zero
if all samples in S belongs to the same class.

Information Gain (S, x) measures expected reduction in entropy due to splitting on


feature x. Let Sx=v is the subset due to splitting on x=v (feature =value). That is, the
subset Sx=v select the samples of S having feature x= value v.

parent set S E (S) = Total entropy of the parent

x=v2 E (S, x) = average entropy of the childs

child sets Sx=v1 Sx=v2 Sx=vn

𝐺𝑎𝑖𝑛(𝑆, 𝑥) = 𝐸(𝑆) − 𝐸(𝑆, 𝑥)

|𝑆𝑥=𝑣𝑖 |
𝐸(𝑆, 𝑥) = ∑ 𝐸𝑛𝑡𝑟𝑜𝑝(𝑆𝑥=𝑣𝑖 )
|S|
𝑖:𝑣𝑎𝑙𝑢𝑒𝑠 𝑜𝑓 𝑥

Higher Gain (S, x) is better. Higher gain means lower entropy of an average child
(meaning child sets are getting more pure).
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Example : A person decision on playin tennis are recorded for 14 days of various
weather conditions and given as table below. Train the agent using inductive decision
tree learning and provide the final decision tree.

Basic steps of the solution:

1- Find the best feature of S , which has maximum information gain after splitting,
and add it as a node to the tree.

𝒃𝒆𝒔𝒕 = 𝐚𝐫𝐠𝐦𝐚𝐱 𝐺𝑎𝑖𝑛(𝑆, 𝑥)


𝒙 ∈𝒇𝒆𝒂𝒕𝒖𝒓𝒆𝒔
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

2- for each value of the best feature,


add a new branch to the tree,
replace S with Svalue and go to step-1.

What is the next here ? →Addressed by the next recursive call

The final decision tree is


Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

7.1.2. C4.5 Algoritm


C4.5 is the extended version of the ID3 . It can handle the leaf subsets that are not
necessary to be pure and also handle numeric continutes features by converting
them to categoric values. Converting numeric values to categoric ones is an
important issue and solved on the basis on information theory. Choosing the best
attibute (splitting criteria) is based on Gain Ratio, rather than Information Gain of
ID3. On contrary to CART that always grow binary tree, C4.5 can grow an arbitrary
tree due to splitting into many branch. It handles the basic weakness of ID3, which is
overfitting problem, by pruning the tree. There are two ways of prunings :
prepruning, and postpruning.

In Prepruning, splitting is stopped when splitting is no longer statistically significant


or not contributes to the classification ability. The aim is to find an earlier stop before
the perfect classification. However, it is hard to estimate when to stop splitting.
Postpruning lets the tree get fully grown and later prune it gradually until further
pruning no longer contributes to the classification. This approach is more practical.
Reduced error pruning and Rule post-pruning are two methods of postpruning.

Reduced error pruning


• Examine each decision node to see if pruning decreases the tree’s
performance over the evaluation data.
• “Pruning” here means replacing a subtree with a leaf with the most common
classification in the subtree.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Rule post-pruning
One of the most popular method fpr pruning (e.g., in C4.5). First a full decision
tree is built and represented as set of if/then rules. Prune each rule by
removing any preconditions if any improvement in accuracy. Finally, sort the
pruned rules by accuracy and use them in that order.
Incorporating continues-valued attributes (ref[1])

Choosing best attribute(splitting criteria) ref[1]


Which one is better: Splitting a continues attribute into two or many? Use Gain Ratio
that consider Information Gain together with best value split (SplitInformation) .

Information gain uses the concept of Entropy of sets and doesn’t consider the
entropy of an attribute (how much information the attribute carries).

Entropy of the attribute A:


Experimentally determined by the training samples
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Unknown attribute values(ref[1])


C4.5 can also handle missing values. The basic approach is to fill the missing value
with a most probable value. More sophisticated method could fill a missing value
based on the probability distrubution of all values for the given atribute.
When missing values exist in dataset, the entropy based computations need special
handling. The Gain can be multiplied by a factor F that is the ratio of number of
known values by the total number. To compute the GainRatio, missing values should
be considered as a group.

7.1.3. CART (Classification and Regression Trees)


CART grow a binary tree. It split the data into left and right parts using splitting
critera such as squared error for regression, gini index for classification. The
scheleton of the CART is almost same for both regression and classification. The basic
differences are the splitting methods and handling leaves in that when a subset is
considered as leaf it is assigned to an average value for regression while the majority
vote for classification. The core algorithm for building decision trees is similar
to ID3 where Information Gain is replaced by Variation Reduction for regression and
Gini Gain for Classification.

Choosing Best Attibute (splitting criteria) : variance and gini


Use variance reduction (or alternatively residual sum) for regression and Gini gain
(or alternatively gini index) for classification
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Classification
Gini measures the average amount of information or impurity of the samples in a
set, S, with respect to output classes. Gini (S) is defined as

𝐺𝑖𝑛𝑖 (𝑆) = ∑𝑖 𝑃𝑖 (1 − 𝑃𝑖 ) where Pi: probability of i. class

Gini Gain (S, x) measures expected reduction in gini index due to splitting on x

parent set S G (S) = Total gini indexof parent

G (S, x) = average gini ix. of the childs

child sets Sleft Sright

𝐺𝑖𝑛𝑖𝐺𝑎𝑖𝑛(𝑆, 𝑥) = 𝐺𝑖𝑛𝑖 (𝑆) − 𝐺𝑖𝑛𝑖(𝑆, 𝑥)

|𝑆𝑖 |
𝐺𝑖𝑛𝑖 (𝑆, 𝑥) = ∑ 𝐺𝑖𝑛𝑖(𝑆𝑖 )
|S|
𝑖:𝑙𝑒𝑓𝑡,𝑟𝑖𝑔ℎ𝑡

Higher Gini Gain is better. Higher gain means lower gini ix of an average child
(meaning child sets are getting more pure). Alternatively instead of maximizing Gini
Gain, minimizing Gini (S, x) can also be considered. By ignoring division|S|, following
Gini criteria is acquird where smaller gini criteria is better.

𝐺𝑖𝑛𝑖 𝐶𝑟𝑖𝑡𝑒𝑟𝑖𝑎 = ∑|𝑆𝑙𝑒𝑓𝑡 |𝐺𝑖𝑛𝑖(𝑆𝑙𝑒𝑓𝑡 ) + ∑|𝑆𝑟𝑖𝑔ℎ𝑡 |𝐺𝑖𝑛𝑖(𝑆𝑟𝑖𝑔ℎ𝑡 )


Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Classification Tree Example on Hepatits (ref[2]):

...
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Regression
Variation (or mse) measures the average amount of information in set S with
respect to mean of the set. Variance Reduction(VR) measures expected reduction in
variance(mse) due to splitting on feature x

|𝑆𝑖 |
𝑉𝑅(𝑆, 𝑥) = 𝑀𝑆𝐸 (𝑆) − ∑ 𝑀𝑆𝐸(𝑆𝑖 )
|S|
𝑖:𝑙𝑒𝑓𝑡,𝑟𝑖𝑔ℎ𝑡

average MSE of the childs

One can see that in order to maximize Variance Reduction, one can minimize the
total MSE of the childs (Residual Sum) for regression. Thus, smaller RSS (Residual
Sum) is better for splitting. Residual Sum (RSS) is defined as follows

Regression Tree Example on Prostate Cancer (ref[2]):

i-Choosing the best x split =3.67 with RSS=68.09

Regression Tree
node0

ii-Choosing the best y split = 1.05 with RSS=61.76


Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

iii-left of node0: the best x split =3.66 with RSS=16.11

Regression Tree

node1
iv- left of node0: the best y split = -0.48 with RSS=13.61

v-right of node0: the best x split =3.07 with RSS=27.15

Regression Tree

node2
vi- right of node0: the best y split = 2.79 with RSS=25.11
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

...
If we complete the training, the final Regression Tree

7.1.4. Decision Tree Learning using Pyhton Sklearn


Pyton currently uses an optimized version of the CART algorithm for decision tree
classification and regression. The DecisionTreeClassifier and DecisionTreeRegressor
classes are used for classification and regression respectively. Following parameters
are some of the general parameters that works on both classification and regression:

Criterian : decides the criteria that measures the impurity of a set. “gini” and
“entropy” are some alternatives for classification with default “gini”. “mse” (mean
square error) and”mae”( mean absolute error) are some alternatives for regression
with default “mse”.

Splitter : decides the splitting method on “criterian” basis. Default “best” means
choose the best attribute to split, according to the information gain (variance
reduction or gini gain, information gain, etc) that uses the selected criterian. The
other alternative “random” means choose random attribute but give higher chance
to the better attributes according to their gains.

Max_depth : The maximum depth of the tree, default “none”: continues to expand
the tree's nodes until all leaves reach to purity or all leaves contain fewer samples
than the value of “min_samples_split”.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Min_samples_split : If a node has fewer samples than “min_samples_split”, do not


split it. Default = 2

Min_samples_leaf: Minimum number of samples required for a node to become a


leaf. Default = 1
Min_impurity_decrease: It is a stopping criterion for further splitting nodes in the
decision tree. It ensures that a node will only split if the decrease in impurity (e.g.,
Gini impurity or entropy) is at least as large as the specified min_impurity_decrease
value. Meaning, if the gain is less than the threshold(min_impurity_decrease) don’t
expand the tree. Default = 0, meaning the tree will split as long as there is any
decrease in impurity.

Sleft Sright

𝐺𝑎𝑖𝑛(𝑆, 𝑥) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) − ∑ 𝑃𝑖 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑖 )


𝑖:𝑙𝑒𝑓𝑡,𝑟𝑖𝑔ℎ𝑡

𝑦𝑎𝑑𝑎 = 𝐺𝑖𝑛𝑖 (𝑆) − ∑ 𝑃𝑖 𝐺𝑖𝑛𝑖(𝑆𝑖 )


𝑖:𝑙𝑒𝑓𝑡,𝑟𝑖𝑔ℎ𝑡

𝑦𝑎𝑑𝑎 = 𝑀𝑆𝐸 (𝑆) − ∑ 𝑃𝑖 𝑀𝑆𝐸(𝑆𝑖 )


𝑖:𝑙𝑒𝑓𝑡,𝑟𝑖𝑔ℎ𝑡
|𝑆𝑖 |
where 𝑃𝑖 = and N is the total number of samples in parent set S.
𝑁

Example: analyse the DecisionTreeClassifier on iris dataset using cross_validation


...
from sklearn.tree import DecisionTreeClassifier()
iris=load_iris()
model=DecisionTreeClassifier()
cross_val_score(model, iris.data, target.data, cv=10)
...

Example: DecisionTreeClassifier using Entropy, rather than default Gini criterian


iris=load_iris()
model=DecisionTreeClassifier(criterian="entropy",max_depth=3)
model.fit(iris.data,iris.target)
fx=model.predict(iris.data)
print ("Accuracy rate =",accuracy_score(iris.target,fx))
print ("Accuracy rate =",model.score(iris.target,fx)) #same accuracy score
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Example: analyse DecisionTreeRegressor on boston dataset using cross_validation


...
from sklearn.tree import DecisionTreeRegressor ()
boston=load_boston()
model= DecisionTreeRegressor ()
cross_val_score(model, boston.data, boston .target, cv=10)
...

Example: DecisionTreeRegressor using mae, rather than default mse


...
from sklearn.tree import DecisionTreeRegressor ()
boston=load_boston()
model= DecisionTreeRegressor (criterian="mae",max_depth=3)
model.fit(boston.data, boston.target)
fx=model.predict(boston.data)
print ("R2 error(default) =",model.score(boston.target,fx))
print ("mse error =", mean_square_error (boston.target,fx))

7.1. Ensemble Learning using Decision Trees : Bagging, Random


Forests, and Adaboost

7.1.1.Bagging
A group of weak learners (decision trees here) are ensembled together to predict a
single decision. Each weak learner is trained on a random dataset that is derived from
the original training set by sampling with replacement. The decision of the ensemble
learner is the majority vote for the classification and the average value for
regression. Bagging is one of the extention to CART to reduce the variance of decision
using ensemble of many weak trees(learners). Example for regression is as follows
Single Tree Average of 100 Tree
High Variance Low Variance
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Example for classification is as follows

When creating random dataset (bootstrap sets), the features(attributes) of bootstrap


sets are preserved. Thus, no feature selection is used althoug Pyhton implemenations
may allow it. Bootstrap trees are build independently and different than the original
tree.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

7.1.2. Random Forest


It is one of the extention to Bagging. The difference from Bagging, Random Forest
select a group of random features during generating the bootstrap sets. Thus, the
random datasets are various feature lenghts of Mi (≤M).

Advantages: Random Forests reduces the effects of overfitting and improves


generalization. the samples that are not selected in each random dataset are
combined into out-of-bag. The score on out-of-bag is called oob_score.

No need for cross-validation test since the out-of-bag is ideal test set . The out-of-
bag error is the error of the learner in out of bag set as test set.

Bagging ve Random Forest With Python Scikit Learn

BaggingClassifier and BaggingRegressor classes are used for bagging classification


and regression respectively. The classes can use any classifier for weak classifiers, not
limited to decision tree or KNN.
Majority of the parameters are valid for both regression and classification. Some
basic parameters are:
base_estimator : weak classifier. Default is decision tree
n_estimator : number of weak learners
oob_score : out of bag error.
Bootstrap : are the samples re-selected from the bag? (Default :yes )
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Example: BaggingClassifier where the weak classifier is DecisionTreeClassifier default


...
from sklearn.ensemble import BaggingClassifier
X,y=load …
model = BaggingClassifier (n_estimators=15)
model.fit(X,y)
print ("out of bag error",model.oob_score())
fx=model.predict(X)
...

Example: BaggingClassifier where the weak classifier is KNN


. . .
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier
X,y=load …
model = BaggingClassifier (base_estimator = KNeighborsClassifier())
model.fit(X,y)
fx=model.predict(X)
print ("accuracy score",model.score(y, fx))
. . .

Example: BaggingRegressor where the weak learner is DecisionTreeRegressor default


...
from sklearn.ensemble import BaggingRegressor
X,y=load …
model = BaggingRegressor (n_estimators=15)
model.fit(X,y)
print ("out of bag error",model.oob_score())
fx=model.predict(X)
...

Example: BaggingRegressor where the weak regressor is KNN


. . .
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import BaggingRegressor
X,y=load …
model = BaggingRegressor (base_estimator = KNeighborsRegressor ())
model.fit(X,y)
fx=model.predict(X)
print ("R2 score",model.score(y, fx))
. . .
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

RandomForestClassifier and RandomForestResgressor classes are used for Random


Forest classification and regression respectively. Random Forest can inherently use
the paremeters from decision tree, as weak learners, and also bagging. Thus, the
parameters of the decision trees and bagging are passed to the Random Forest
classes implicitly. The parameters such as criterian, max_dept, and so on are applied
to all weak trees.
Example: Regression with Random Forest
. . .
x_train, x_test, y_train, y_test = train_test_split(...)
model = RandomForestResgressor(n_estimators=15, max_depth=3, criterian="mse")
fx= model.predict(x_test)
print("Ensemble R2 error: " , model.score(y_test,fx))
. . .

Example: Classification with Random Forest


...
x_train, x_test, y_train, y_test = train_test_split(...)
model= RandomForestClassifier(n_estimator=50, max_depth=2, criterian=”gini”)
fx= model.predict(x_test)
print("Ensemble accuracy score on test set:" , model.score(y_test,fx))
...

7.1.3. Adaboost
(ref: https://www.datacamp.com/community/tutorials/adaboost-classifier-python)
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Adaboost:

Adaboost on classification example:


(ref: https://laptrinhx.com/understanding-adaboost-and-scikit-learn-s-algorithm-3554153184/)

Adaboost stands for adaptive boosting. The models are sequentially arranged in the
ensemble. This means that at each step we try to boost our weak learners (base
model) based on the mistakes of our previous models so together they are one
strong ensemble model.
Step 1: Assign equal weights (𝒘𝒊 ) for all samples in the data set

1
2
3
4
5
6
7
8
Assign equal weigths to all samples with 𝒘𝒊 = 𝟏/𝑵=1/8

This means that initially each sample is equally important w.r.t learning.

Step 2: Create a decision stump using the feature that has the lowest Gini index
A decision tree stump is just a decision tree with a single root and two leaves. A tree
with 1 level. This stump is our weak learner (base model). The feature first gets to
classify our data is determined using the Gini Index.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

If you want, you can increase your level to two, but it’s very common to go for
a stump.

Step 3: Evaluate the performance of your stump and assign its weight
In Adaboost, we have an ensemble of stumps and all their predictions are taken into
account before deciding the final prediction. But some stumps do a better job
classifying our data that than the other stumps in the ensemble. It only makes sense
to give more importance to these stumps. Adaboost does this by assigning weights to
each stump in the ensemble. Higher the weight, the more amount of
say (significance) the stump will have in the final prediction. So, for example lets
sample-3 and sample-6 are misclassified by the stump, the weight of this stump m is
calculated by

where

𝒕𝒐𝒕𝒂𝒍𝒆𝒓𝒓𝒐𝒓 = ∑ 𝒘𝒊 ∗ 𝒍𝒐𝒔𝒔𝒊
and

𝒍𝒐𝒔𝒔𝒊 = 𝒉(𝒙𝒊 ) == 𝒚𝒊 (zero-one loss for the weak learner h)

Total error = sum of the weights of samples wrongly classified. So if our stump got
two samples misclassified, using the weights of those samples, we get stump
significance=0.5*log(1 — (1/8 + 1/8)/(1/8 + 1/8)) = 0.54 (Using natural log).
And that is the weight of our first model in the ensemble.

Remember that this is different from the weights of the sample. Sample weights
stress the importance of getting the classification of the sample right, while model
weights are used to determine the amount of say a model gets in the final prediction.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Step 4: Re-assign the weights of the samples


Sample weigths increase or descrease depending on the z(loss) which is either 1
(correct) or -1 (incorrect). However, some methods use 0 (correct) or 1 (incorrect)

Remember we set equal weights for all samples earlier? Well, after classifying using
our first model, we got two samples misclassified. Adaboost then tries to stress the
importance of getting these two samples right next time around by assigning them a
higher sample weight. This will help the next model focus more on getting these two
samples right. The formula for the new weight of the wrongly classified samples are:

𝒘𝒊 𝒕+𝟏 = 𝒘𝒊 𝒕 𝒆𝒔𝒊𝒈𝒏𝒊𝒇𝒊𝒄𝒂𝒏𝒄𝒆
Initially the weights of both sample were 1/8 and now the new weight of each is
1/8*e^amount of say= 0.21. This is now greater than the initial 1/8.

Next, it reduces the weight of other samples. This is done by using this formula:

𝒘𝒊 𝒕+𝟏 = 𝒘𝒊 𝒕 𝒆−𝒔𝒊𝒈𝒏𝒊𝒇𝒊𝒄𝒂𝒏𝒄𝒆

Step 5: Normalise the weights of the samples


Now that we have new weights for each sample, we can normalize the weights by
dividing each weight by the sum of the weights. This will make the sum of the new
weights equal to 1. After normalization, the new weights will look like following.
Notice that the weights of the two wrong samples (sample 3 and 6) are increased.

1
2
3
4
5
6
7
8
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Step 6 Create a new dataset of the same size as the original dataset and pick
samples based on their weights
This step is curial because this is how the next model will benefit from the experience
of the previous model so that misclassified samples are given more importance. It
works in the following way:

• First, an empty dataset of the same size as the original dataset is created.
• Then samples are selected according to their weights such as using roulette wheel
selection. Thus, important samples are more likely to be selected in the new dataset.
This will grant the new learner the ability towards correcting the mistakes of the
previous learner.

• With roulette Wheel, a random number from 0 to 1 is picked and the sample weight
lying in the corresponding slice is chosen. For example, if 0.3 is picked, then the third
sample is chosed since the value of 0.3 falls within 0.167 and 0.416, and added to the
new dataset
• Repeat the previous step till the new dataset is filled. Once the new dataset is
filled, reassign the sample weights to an equal value of 1/No of samples.
This is what out new dataset looks like.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Notice how the samples that we got wrong previously are included more? This will
give the next model a better chance to get it right. Sort of like creating a large penalty
for misclassification.

Step 7: Repeat steps 2 to step 6 till we have enough number of models

Step 8: Assign final prediction by weighted majority votes for classification

𝒇(𝒙) = 𝐚𝐫𝐠𝐦𝐚𝐱 ∑ 𝒔𝒊𝒈𝒏𝒊𝒇𝒊𝒄𝒂𝒏𝒄𝒆𝒊 ∗ (𝒉𝒎 (𝒙) == 𝒌)


𝒌=⋯
𝒎=⋯

So if the sum of weights of models that classified as “Yes” is more than the sum of
weights of models that classified as “No”, the final prediction in “Yes” and vice-versa.

For AdaBoost.SAMME classification case


Replace model weight(significance) with
𝟏 − 𝒕𝒐𝒕𝒂𝒍𝒆𝒓𝒓𝒐𝒓
𝒔𝒊𝒈𝒏𝒊𝒇𝒊𝒄𝒂𝒏𝒄𝒆𝒎 = 𝒍𝒐𝒈 ( ) + 𝒍𝒐𝒈(𝑲 − 𝟏)
𝒕𝒐𝒕𝒂𝒍𝒆𝒓𝒓𝒐𝒓
where K is the number of classes.

Note that when K=2, SAMME becomes original Adaboost.

For AdaBoost.SAMME.R classification case


Replace model weight(significance) with
𝟏 𝑷𝒉𝒎 (𝒚 ∣ 𝒙)
𝒔𝒊𝒈𝒏𝒊𝒇𝒊𝒄𝒂𝒏𝒄𝒆𝒎 = ∗ 𝒍𝒐𝒈 ( )
𝟐 𝟏 − 𝑷𝒉𝒎 (𝒚 ∣ 𝒙)
It extends SAMME by replacing class predictions with class probabilities from the
weak learners. It minimizes the exponential loss directly using these probabilities. The
probabilities 𝑷𝒉𝒎 are computed using the softmax function applied to the raw scores
of the weak learner. These probabilities are then used to compute sample weights
and the overall weight of the weak learner in SAMME.R.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

For regression (AdaBoost.R2 case) :

Replace loss with


𝒆𝒓𝒓𝒐𝒓𝒊 = |𝒚𝒊 − 𝒉(𝒙𝒊 )| (for weak learner h)

𝒍𝒐𝒔𝒔𝒊 = 𝒆𝒓𝒓𝒐𝒓𝒊 /𝒎𝒂𝒙(𝒆𝒓𝒓𝒐𝒓) (loss for xi)

Replace update weight phase as follows


𝒘𝒊 𝒕+𝟏 = 𝒘𝒊 𝒕 (𝟏 − 𝒍𝒐𝒔𝒔𝒊 )𝒔𝒊𝒈𝒏𝒊𝒇𝒊𝒄𝒂𝒏𝒄𝒆

Adaboost with Python Scikit Learn

AdaBoostClassifier and AdaBoostRegressor classes are used for classification and


regression respectively. The classes can use any classifier for weak classifiers, not
limited to decision tree or KNN.
Majority of the parameters are valid for both regression and classification. Some
basic parameters are:
base_estimator : weak classifier. default base classifier is
DecisionTreeClassifier initialized with max_depth=1, default base regressor
DecisionTreeRegressor initialized with max_depth=3
n_estimator : number of weak learners
learning_rate : It contributes to the weights of weak learners .There is a trade-off
between learning_rate and n_estimators. default=1
algorithm : {‘SAMME’, ‘SAMME.R’}, default=’SAMME.R’ : The SAMME and SAMME.R
algorithms are multiclass Adaboost functions that were put forward in a paper by Ji
Zhu, Saharon Rosset, Hui Zou, Trevor Hastie. These algorithms are adaptations of the
main idea of Ababoost extending their functionality with multiclass capabilities

Example(ref: https://scikit-learn.org/stable/auto_examples/ensemble/plot_adaboost_regression.html)
A decision tree is boosted using the AdaBoost.R2 (Drucker 1997) algorithm on a 1D
sinusoidal dataset with a small amount of Gaussian noise. 299 boosts (300 decision
trees) is compared with a single decision tree regressor. As the number of boosts is
increased the regressor can fit more detail.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

print(__doc__)

# Author: Noel Dawe <[email protected]>


#
# License: BSD 3 clause

# importing necessary libraries


import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor

# Create the dataset


rng = np.random.RandomState(1)
X = np.linspace(0, 6, 100)[:, np.newaxis]
y = np.sin(X).ravel() + np.sin(6 * X).ravel() + rng.normal(0, 0.1, X.shape[0])

# Fit regression model


regr_1 = DecisionTreeRegressor(max_depth=4)

regr_2 = AdaBoostRegressor(DecisionTreeRegressor(max_depth=4),
n_estimators=300, random_state=rng)

regr_1.fit(X, y)
regr_2.fit(X, y)

# Predict
y_1 = regr_1.predict(X)
y_2 = regr_2.predict(X)

# Plot the results


plt.figure()
plt.scatter(X, y, c="k", label="training samples")
plt.plot(X, y_1, c="g", label="n_estimators=1", linewidth=2)
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

plt.plot(X, y_2, c="r", label="n_estimators=300", linewidth=2)


plt.xlabel("data")
plt.ylabel("target")
plt.title("Boosted Decision Tree Regression")
plt.legend()
plt.show()

Example( Adaboost classification on iris dataset)


ref: https://www.datacamp.com/community/tutorials/adaboost-classifier-python

from sklearn.ensemble import AdaBoostClassifier


from sklearn import datasets
# Import train_test_split function
from sklearn.model_selection import train_test_split
#Import scikit-learn metrics module for accuracy
calculation
from sklearn import metrics

# Load data
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split dataset into training set and test set


X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3) # 70% training and 30% test

# Create adaboost classifer model


model = AdaBoostClassifier(n_estimators=50,
learning_rate=1)
# Train Adaboost Classifer
model.fit(X_train, y_train)

#Predict the response for test dataset


fx_test = model.predict(X_test)

# Model Accuracy, how often is the classifier correct?


print("Accuracy:",metrics.accuracy_score(y_test, fx_test))

Output:
Accuracy: 0.8888888888888888
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)

Example(using SVC as base learner for Adaboost classification and compare with
decision tree base learner )
ref: https://www.datacamp.com/community/tutorials/adaboost-classifier-python

# Load libraries
from sklearn.ensemble import AdaBoostClassifier

# Import Support Vector Classifier


from sklearn.svm import SVC
#Import scikit-learn metrics module for accuracy
calculation
from sklearn import metrics
svc=SVC(probability=True, kernel='linear')

# Create adaboost classifer object


model =AdaBoostClassifier(n_estimators=50,
base_estimator=svc,learning_rate=1)

# Train Adaboost Classifer


model.fit(X_train, y_train)

#Predict the response for test dataset


fx_test = model.predict(X_test)

# Model Accuracy, how often is the classifier correct?


print("Accuracy:",metrics.accuracy_score(y_test, fx_test))

Output:
Accuracy: 0.9555555555555556

References
[1]- Berlin-Chen Slides
[2]- Ovronnaz, Switzerland Slides
[3] https://towardsdatascience.com/machine-learning-part-17-boosting-algorithms-
adaboost-in-python-d00faac6c464

Some usefull resources:


See youtube video https://www.youtube.com/watch?v=LsK-xG1cLYA

and https://towardsdatascience.com/machine-learning-part-17-boosting-algorithms-
adaboost-in-python-d00faac6c464

You might also like