ML_Course_15 -17
ML_Course_15 -17
Basics Steps:
Training : Training means finding the K nearest points around all possible points.
Normally, the complexity of this step is O(n2) but complexity can be reduced with
some algorithms like BallTree, KDTree etc, which uses tree based data structures to
find and save the nearest points.
Feature space
classification algorithms. ID3 and its extension C4.5 are classifiers whereas CHART and
the C5 (extention of C4.5) can handle both classification and regression tasks. ID3
assumes categoric features and splits until the leaves are pure w.r.t. output class. The
basic issue of ID3 is the tree size where big tree easily overfits data and, thus, small
tree is usually preferred in ID3. Talking in general, finding the correct sized tree is an
issue of research for tree based learners. CART uses binary tree (always splits into left
and right) and works on both categoric and numeric features. C4.5 is the extended
version of the ID3 where it can handle impure leaves, numeric/categoric features and
splitting into many branches. The algorithms above basicly differ in splitting criteria
and the assumed structure of the tree.
Example: let the training set contains all categoric features, age and weight .
Two solutions are shown below where small tree is better. Decision tree
classification can be represented as logical functions , which involves set of binary
rules(disjunction of conjunctions) to predict the output.
Solution-1 Solution-2
Example: let the training set contains categoric and numeric features, [TI, PE ] are
features , Response is target variable(class)
Problems:
i- which feature to split on ?
Best Splitting
ii- which value to split on?
iii- the meaning of leaf (when splitting is no longer needed)?
iv- the size of the tree ?
Choosing Best Attibute (splitting criteria): Use Information Gain with Entropy
|𝑆𝑥=𝑣𝑖 |
𝐸(𝑆, 𝑥) = ∑ 𝐸𝑛𝑡𝑟𝑜𝑝(𝑆𝑥=𝑣𝑖 )
|S|
𝑖:𝑣𝑎𝑙𝑢𝑒𝑠 𝑜𝑓 𝑥
Higher Gain (S, x) is better. Higher gain means lower entropy of an average child
(meaning child sets are getting more pure).
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
Example : A person decision on playin tennis are recorded for 14 days of various
weather conditions and given as table below. Train the agent using inductive decision
tree learning and provide the final decision tree.
1- Find the best feature of S , which has maximum information gain after splitting,
and add it as a node to the tree.
Rule post-pruning
One of the most popular method fpr pruning (e.g., in C4.5). First a full decision
tree is built and represented as set of if/then rules. Prune each rule by
removing any preconditions if any improvement in accuracy. Finally, sort the
pruned rules by accuracy and use them in that order.
Incorporating continues-valued attributes (ref[1])
Information gain uses the concept of Entropy of sets and doesn’t consider the
entropy of an attribute (how much information the attribute carries).
Classification
Gini measures the average amount of information or impurity of the samples in a
set, S, with respect to output classes. Gini (S) is defined as
Gini Gain (S, x) measures expected reduction in gini index due to splitting on x
|𝑆𝑖 |
𝐺𝑖𝑛𝑖 (𝑆, 𝑥) = ∑ 𝐺𝑖𝑛𝑖(𝑆𝑖 )
|S|
𝑖:𝑙𝑒𝑓𝑡,𝑟𝑖𝑔ℎ𝑡
Higher Gini Gain is better. Higher gain means lower gini ix of an average child
(meaning child sets are getting more pure). Alternatively instead of maximizing Gini
Gain, minimizing Gini (S, x) can also be considered. By ignoring division|S|, following
Gini criteria is acquird where smaller gini criteria is better.
...
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
Regression
Variation (or mse) measures the average amount of information in set S with
respect to mean of the set. Variance Reduction(VR) measures expected reduction in
variance(mse) due to splitting on feature x
|𝑆𝑖 |
𝑉𝑅(𝑆, 𝑥) = 𝑀𝑆𝐸 (𝑆) − ∑ 𝑀𝑆𝐸(𝑆𝑖 )
|S|
𝑖:𝑙𝑒𝑓𝑡,𝑟𝑖𝑔ℎ𝑡
One can see that in order to maximize Variance Reduction, one can minimize the
total MSE of the childs (Residual Sum) for regression. Thus, smaller RSS (Residual
Sum) is better for splitting. Residual Sum (RSS) is defined as follows
Regression Tree
node0
Regression Tree
node1
iv- left of node0: the best y split = -0.48 with RSS=13.61
Regression Tree
node2
vi- right of node0: the best y split = 2.79 with RSS=25.11
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
...
If we complete the training, the final Regression Tree
Criterian : decides the criteria that measures the impurity of a set. “gini” and
“entropy” are some alternatives for classification with default “gini”. “mse” (mean
square error) and”mae”( mean absolute error) are some alternatives for regression
with default “mse”.
Splitter : decides the splitting method on “criterian” basis. Default “best” means
choose the best attribute to split, according to the information gain (variance
reduction or gini gain, information gain, etc) that uses the selected criterian. The
other alternative “random” means choose random attribute but give higher chance
to the better attributes according to their gains.
Max_depth : The maximum depth of the tree, default “none”: continues to expand
the tree's nodes until all leaves reach to purity or all leaves contain fewer samples
than the value of “min_samples_split”.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
Sleft Sright
7.1.1.Bagging
A group of weak learners (decision trees here) are ensembled together to predict a
single decision. Each weak learner is trained on a random dataset that is derived from
the original training set by sampling with replacement. The decision of the ensemble
learner is the majority vote for the classification and the average value for
regression. Bagging is one of the extention to CART to reduce the variance of decision
using ensemble of many weak trees(learners). Example for regression is as follows
Single Tree Average of 100 Tree
High Variance Low Variance
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
No need for cross-validation test since the out-of-bag is ideal test set . The out-of-
bag error is the error of the learner in out of bag set as test set.
7.1.3. Adaboost
(ref: https://www.datacamp.com/community/tutorials/adaboost-classifier-python)
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
Adaboost:
Adaboost stands for adaptive boosting. The models are sequentially arranged in the
ensemble. This means that at each step we try to boost our weak learners (base
model) based on the mistakes of our previous models so together they are one
strong ensemble model.
Step 1: Assign equal weights (𝒘𝒊 ) for all samples in the data set
1
2
3
4
5
6
7
8
Assign equal weigths to all samples with 𝒘𝒊 = 𝟏/𝑵=1/8
This means that initially each sample is equally important w.r.t learning.
Step 2: Create a decision stump using the feature that has the lowest Gini index
A decision tree stump is just a decision tree with a single root and two leaves. A tree
with 1 level. This stump is our weak learner (base model). The feature first gets to
classify our data is determined using the Gini Index.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
If you want, you can increase your level to two, but it’s very common to go for
a stump.
Step 3: Evaluate the performance of your stump and assign its weight
In Adaboost, we have an ensemble of stumps and all their predictions are taken into
account before deciding the final prediction. But some stumps do a better job
classifying our data that than the other stumps in the ensemble. It only makes sense
to give more importance to these stumps. Adaboost does this by assigning weights to
each stump in the ensemble. Higher the weight, the more amount of
say (significance) the stump will have in the final prediction. So, for example lets
sample-3 and sample-6 are misclassified by the stump, the weight of this stump m is
calculated by
where
𝒕𝒐𝒕𝒂𝒍𝒆𝒓𝒓𝒐𝒓 = ∑ 𝒘𝒊 ∗ 𝒍𝒐𝒔𝒔𝒊
and
Total error = sum of the weights of samples wrongly classified. So if our stump got
two samples misclassified, using the weights of those samples, we get stump
significance=0.5*log(1 — (1/8 + 1/8)/(1/8 + 1/8)) = 0.54 (Using natural log).
And that is the weight of our first model in the ensemble.
Remember that this is different from the weights of the sample. Sample weights
stress the importance of getting the classification of the sample right, while model
weights are used to determine the amount of say a model gets in the final prediction.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
Remember we set equal weights for all samples earlier? Well, after classifying using
our first model, we got two samples misclassified. Adaboost then tries to stress the
importance of getting these two samples right next time around by assigning them a
higher sample weight. This will help the next model focus more on getting these two
samples right. The formula for the new weight of the wrongly classified samples are:
𝒘𝒊 𝒕+𝟏 = 𝒘𝒊 𝒕 𝒆𝒔𝒊𝒈𝒏𝒊𝒇𝒊𝒄𝒂𝒏𝒄𝒆
Initially the weights of both sample were 1/8 and now the new weight of each is
1/8*e^amount of say= 0.21. This is now greater than the initial 1/8.
Next, it reduces the weight of other samples. This is done by using this formula:
𝒘𝒊 𝒕+𝟏 = 𝒘𝒊 𝒕 𝒆−𝒔𝒊𝒈𝒏𝒊𝒇𝒊𝒄𝒂𝒏𝒄𝒆
1
2
3
4
5
6
7
8
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
Step 6 Create a new dataset of the same size as the original dataset and pick
samples based on their weights
This step is curial because this is how the next model will benefit from the experience
of the previous model so that misclassified samples are given more importance. It
works in the following way:
• First, an empty dataset of the same size as the original dataset is created.
• Then samples are selected according to their weights such as using roulette wheel
selection. Thus, important samples are more likely to be selected in the new dataset.
This will grant the new learner the ability towards correcting the mistakes of the
previous learner.
• With roulette Wheel, a random number from 0 to 1 is picked and the sample weight
lying in the corresponding slice is chosen. For example, if 0.3 is picked, then the third
sample is chosed since the value of 0.3 falls within 0.167 and 0.416, and added to the
new dataset
• Repeat the previous step till the new dataset is filled. Once the new dataset is
filled, reassign the sample weights to an equal value of 1/No of samples.
This is what out new dataset looks like.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
Notice how the samples that we got wrong previously are included more? This will
give the next model a better chance to get it right. Sort of like creating a large penalty
for misclassification.
So if the sum of weights of models that classified as “Yes” is more than the sum of
weights of models that classified as “No”, the final prediction in “Yes” and vice-versa.
Example(ref: https://scikit-learn.org/stable/auto_examples/ensemble/plot_adaboost_regression.html)
A decision tree is boosted using the AdaBoost.R2 (Drucker 1997) algorithm on a 1D
sinusoidal dataset with a small amount of Gaussian noise. 299 boosts (300 decision
trees) is compared with a single decision tree regressor. As the number of boosts is
increased the regressor can fit more detail.
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
print(__doc__)
regr_2 = AdaBoostRegressor(DecisionTreeRegressor(max_depth=4),
n_estimators=300, random_state=rng)
regr_1.fit(X, y)
regr_2.fit(X, y)
# Predict
y_1 = regr_1.predict(X)
y_2 = regr_2.predict(X)
# Load data
iris = datasets.load_iris()
X = iris.data
y = iris.target
Output:
Accuracy: 0.8888888888888888
Prof.Dr. Zeki YETGİN, Mersin Univ. , Machine Learning Course Book (Content is copyrighted)
Example(using SVC as base learner for Adaboost classification and compare with
decision tree base learner )
ref: https://www.datacamp.com/community/tutorials/adaboost-classifier-python
# Load libraries
from sklearn.ensemble import AdaBoostClassifier
Output:
Accuracy: 0.9555555555555556
References
[1]- Berlin-Chen Slides
[2]- Ovronnaz, Switzerland Slides
[3] https://towardsdatascience.com/machine-learning-part-17-boosting-algorithms-
adaboost-in-python-d00faac6c464
and https://towardsdatascience.com/machine-learning-part-17-boosting-algorithms-
adaboost-in-python-d00faac6c464