DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
()
About this ebook
Read more from César Pérez López
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB Rating: 0 out of 5 stars0 ratingsDEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB Rating: 0 out of 5 stars0 ratingsDATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB Rating: 0 out of 5 stars0 ratings
Related to DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES
Related ebooks
PyTorch Cookbook: 100+ Solutions across RNNs, CNNs, python tools, distributed training and graph networks Rating: 0 out of 5 stars0 ratingsEffective Amazon Machine Learning Rating: 0 out of 5 stars0 ratingsPython for Data Science: A Practical Approach to Machine Learning Rating: 0 out of 5 stars0 ratingsMachine Learning in Production: Master the art of delivering robust Machine Learning solutions with MLOps (English Edition) Rating: 0 out of 5 stars0 ratingsJava for Data Science Rating: 0 out of 5 stars0 ratingsIntroduction to Machine Learning with Python Rating: 0 out of 5 stars0 ratingsInstant StyleCop Code Analysis How-to Rating: 0 out of 5 stars0 ratingsSocial Media Data Mining and Analytics Rating: 0 out of 5 stars0 ratingsFoundations of Data Intensive Applications: Large Scale Data Analytics under the Hood Rating: 0 out of 5 stars0 ratingsMeshing, Geometric Modeling and Numerical Simulation 1: Form Functions, Triangulations and Geometric Modeling Rating: 0 out of 5 stars0 ratingsHopfield Networks: Fundamentals and Applications of The Neural Network That Stores Memories Rating: 0 out of 5 stars0 ratingsHydraulic Modeling for Effective Flow Management in Managed Pressure Drilling Rating: 0 out of 5 stars0 ratingsGroup Method of Data Handling: Fundamentals and Applications for Predictive Modeling and Data Analysis Rating: 0 out of 5 stars0 ratingsHebbian Learning: Fundamentals and Applications for Uniting Memory and Learning Rating: 0 out of 5 stars0 ratingsFeedforward Neural Networks: Fundamentals and Applications for The Architecture of Thinking Machines and Neural Webs Rating: 0 out of 5 stars0 ratingsOpenFlow Cookbook Rating: 5 out of 5 stars5/5Kernel Methods: Fundamentals and Applications Rating: 0 out of 5 stars0 ratingsRobot Operating System: Mastering Autonomous Systems for Seamless Integration and Control Rating: 0 out of 5 stars0 ratingsAdvanced Functional Programming: Mastering Concepts and Techniques Rating: 0 out of 5 stars0 ratingsHybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models Rating: 0 out of 5 stars0 ratingsUltimate Robotics Programming with ROS 2 and Python Rating: 0 out of 5 stars0 ratingsBuilding Modern GUIs with tkinter and Python: Building user-friendly GUI applications with ease (English Edition) Rating: 0 out of 5 stars0 ratingsThe Official Raspberry Pi Handbook 2023: Astounding projects with Raspberry Pi computers Rating: 0 out of 5 stars0 ratingsDynamic Bayesian Networks: Fundamentals and Applications Rating: 0 out of 5 stars0 ratingsCompetitive Learning: Fundamentals and Applications for Reinforcement Learning through Competition Rating: 0 out of 5 stars0 ratingsMastering AndEngine Game Development: Move beyond basic games and explore the limits of AndEngine Rating: 0 out of 5 stars0 ratingsRake Task Management Essentials Rating: 3 out of 5 stars3/5
Applications & Software For You
80 Ways to Use ChatGPT in the Classroom Rating: 5 out of 5 stars5/5Canva Tips and Tricks Beyond The Limits Rating: 3 out of 5 stars3/5Logic Pro X For Dummies Rating: 0 out of 5 stars0 ratingsBlender All-in-One For Dummies Rating: 0 out of 5 stars0 ratingsThe Beginner's Guide to Procreate Dreams: How to Create and Animate Your Stories on the iPad Rating: 0 out of 5 stars0 ratingsExcel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1 Rating: 5 out of 5 stars5/5GarageBand For Dummies Rating: 5 out of 5 stars5/5The Basics of User Experience Design by Interaction Design Foundation Rating: 4 out of 5 stars4/5Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer. Rating: 5 out of 5 stars5/5Photoshop - Stupid. Simple. Photoshop: A Noobie's Guide to Using Photoshop TODAY Rating: 3 out of 5 stars3/5Adobe Illustrator CC For Dummies Rating: 5 out of 5 stars5/5The Designer’s Guide to Figma: Master Prototyping, Collaboration, Handoff, and Workflow Rating: 3 out of 5 stars3/5Mastering ChatGPT Rating: 0 out of 5 stars0 ratings2022 Adobe® Premiere Pro Guide For Filmmakers and YouTubers Rating: 5 out of 5 stars5/5Microsoft Word Guide for Success: Achieve Efficiency and Professional Results in Every Document [IV EDITION] Rating: 5 out of 5 stars5/5Animation for Beginners: Getting Started with Animation Filmmaking Rating: 4 out of 5 stars4/5Tableau Your Data!: Fast and Easy Visual Analysis with Tableau Software Rating: 4 out of 5 stars4/5Mastering YouTube Automation: The Ultimate Guide to Creating a Successful Faceless Channel Rating: 0 out of 5 stars0 ratingsYouTube Channels For Dummies Rating: 3 out of 5 stars3/5Skulls & Anatomy: Copyright Free Vintage Illustrations for Artists & Designers Rating: 1 out of 5 stars1/5Smartphone Photography Rating: 0 out of 5 stars0 ratingsDigital Video Production Handbook Rating: 0 out of 5 stars0 ratingsAutoCAD For Dummies Rating: 0 out of 5 stars0 ratingsPhotoshop For Beginners: Learn Adobe Photoshop cs5 Basics With Tutorials Rating: 0 out of 5 stars0 ratingsCanva For Dummies Rating: 5 out of 5 stars5/5Sound Design for Filmmakers: Film School Sound Rating: 5 out of 5 stars5/5Blender 4.3 Guide for All: Mastering 3D Design and Animation Rating: 0 out of 5 stars0 ratings
Reviews for DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES
0 ratings0 reviews
Book preview
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES - César Pérez López
Currently the weak learner types are:
'Discriminant' (recommended for Subspace ensemble)
'KNN' (only for Subspace ensemble)
'Tree' (for any ensemble except Subspace)
There are two ways to set the weak learner type in the ensemble.
To create an ensemble with default weak learner options, pass in the character vectors as the weak learner. For example:
ens = fitensemble(X,Y,'AdaBoostM2',50,'Tree');
% or
ens = fitensemble(X,Y,'Subspace',50,'KNN');
To create an ensemble with nondefault weak learner options, create a nondefault weak learner using the appropriate template method. For example, if you have missing data, and want to use trees with surrogate splits for better accuracy:
templ = templateTree('Surrogate','all');
ens = fitensemble(X,Y,'AdaBoostM2',50,templ);
To grow trees with leaves containing a number of observations that is at least 10% of the sample size:
templ = templateTree('MinLeafSize',size(X,1)/10);
ens = fitensemble(X,Y,'AdaBoostM2',50,templ);
Alternatively, choose the maximal number of splits per tree:
templ = templateTree('MaxNumSplits',4);
ens = fitensemble(X,Y,'AdaBoostM2',50,templ);
While you can give fitensemble a cell array of learner templates, the most common usage is to give just one weak learner template.
Decision trees can handle NaN values in X. Such values are called missing
. If you have some missing values in a row of X, a decision tree finds optimal splits using nonmissing values only. If an entire row consists of NaN, fitensemble ignores that row. If you have data with a large fraction of missing values in X, use surrogate decision splits.
Common Settings for Tree Weak Learners
The depth of a weak learner tree makes a difference for training time, memory usage, and predictive accuracy. You control the depth these parameters:
MaxNumSplits — The maximal number of branch node splits is MaxNumSplits per tree. Set large values of MaxNumSplits to get deep trees. The default for bagging is size(X,1) - 1. The default for boosting is 1.
MinLeafSize — Each leaf has at least MinLeafSize observations. Set small values of MinLeafSize to get deep trees. The default for classification is 1 and 5 for regression.
MinParentSize — Each branch node in the tree has at least MinParentSize observations. Set small values of MinParentSize to get deep trees. The default for classification is 2 and 10 for regression.
If you supply both MinParentSize and MinLeafSize, the learner uses the setting that gives larger leaves (shallower trees):
MinParent = max(MinParent,2*MinLeaf)
If you additionally supply MaxNumSplits, then the software splits a tree until one of the three splitting criteria is satisfied.
Surrogate — Grow decision trees with surrogate splits when Surrogate is 'on'. Use surrogate splits when your data has missing values.
PredictorSelection — fitensemble and TreeBagger grow trees using the standard CART[1] algorithm by default. If the predictor variables are heterogeneous or there are predictors having many levels and other having few levels, then standard CART tends to select predictors having many levels as split predictors. For split-predictor selection that is robust to the number of levels that the predictors have, consider specifying 'curvature' or 'interaction-curvature'. These specifications conduct chi-square tests of association between each predictor and the response or each pair of predictors and the response, respectively. The predictor that yields the minimal p-value is the split predictor for a particular node.
The syntax of fitensemble is:
ens = fitensemble(X,Y,model,numberens,learners)
X is the matrix of data. Each row contains one observation, and each column contains one predictor variable.
Y is the responses, with the same number of observations as rows in X.
model is a character vector, such as 'bag', naming the type of ensemble.
numberens is the number of weak learners in ens from each element of learners. The number of elements in ens is numberens times the number of elements in learners.
learners is a character vector, such as 'tree', naming a weak learner, a weak learner template, or a cell array of such character vectors and templates.
The result of fitensemble is an ensemble object, suitable for making predictions on new data.
Where to Set Name-Value Pairs. There are several name-value pairs you can pass to fitensemble, and several that apply to the weak learners (templateDiscriminant, templateKNN, and templateTree). To determine which name-value pair argument is appropriate, the ensemble or the weak learner:
Use template name-value pairs to control the characteristics of the weak learners.
Use fitensemble name-value pair arguments to control the ensemble as a whole, either for algorithms or for structure.
For example, for an ensemble of boosted classification trees with each tree deeper than the default, set the templateTree name-value pair arguments MinLeafSize and MinParentSize to smaller values than the defaults. Or, MaxNumSplits to a larger value than the defaults. The trees are then leafier (deeper).
To name the predictors in the ensemble (part of the structure of the ensemble), use the PredictorNames name-value pair in fitensemble.
This example shows how to create a classification tree ensemble for the ionosphere data set, and use it to predict the classification of a radar return with average measurements.
Load the ionosphere data set.
load ionosphere
Train a classification ensemble. For binary classification problems, fitcensemble aggregates 100 classification trees using LogitBoost.
Mdl = fitcensemble(X,Y)
Mdl =
classreg.learning.classif.ClassificationEnsemble
ResponseName: 'Y'
CategoricalPredictors: []
ClassNames: {'b' 'g'}
ScoreTransform: 'none'
NumObservations: 351
NumTrained: 100
Method: 'LogitBoost'
LearnerNames: {'Tree'}
ReasonForTermination: 'Terminated normally after completing the requested number of training cycles.'
FitInfo: [100×1 double]
FitInfoDescription: {2×1 cell}
Mdl is a ClassificationEnsemble model.
Plot a graph of the first trained classification tree in the ensemble.
view(Mdl.Trained{1}.CompactRegressionLearner,'Mode','graph');
Descripción: http://www.mathworks.com/help/examples/stats/win64/TrainAClassificationEnsembleExample_01.pngBy default, fitcensemble grows shallow trees for boosting algorithms. You can alter the tree depth by passing a tree template object to fitcensemble. For more details, see templateTree.
Predict the quality of a radar return with average predictor measurements.
label = predict(Mdl,mean(X))
label =
cell
'g'
This example shows how to create a regression ensemble to predict mileage of cars based on their horsepower and weight, trained on the carsmall data.
Load the carsmall data set.
load carsmall
Prepare the predictor data.
X = [Horsepower Weight];
The response data is MPG. The only available boosted regression ensemble type is LSBoost. For this example, arbitrarily choose an ensemble of 100 trees, and use the default tree options.
Train an ensemble of regression trees.
Mdl = fitensemble(X,MPG,'LSBoost',100,'Tree')
Mdl =
classreg.learning.regr.RegressionEnsemble
ResponseName: 'Y'
CategoricalPredictors: []
ResponseTransform: 'none'
NumObservations: 94
NumTrained: 100
Method: 'LSBoost'
LearnerNames: {'Tree'}
ReasonForTermination: 'Terminated normally after completing the requested number of training cycles.'
FitInfo: [100×1 double]
FitInfoDescription: {2×1 cell}
Regularization: []
Plot a graph of the first trained regression tree in the ensemble.
view(Mdl.Trained{1},'Mode','graph');
Descripción: http://www.mathworks.com/help/examples/stats/win64/TrainARegressionEnsemble1Example_01.pngBy default, fitensemble grows stumps for boosted trees.
Predict the mileage of a car with 150 horsepower weighing 2750 lbs.
mileage = predict(Mdl,[150 2750])
mileage =
22.4236
This example shows how to choose the appropriate split predictor selection technique for your data set when growing a random forest of regression trees. This example also shows how to decide which predictors are most important to include in the training data.
Load and Preprocess Data
Load the carbig data set. Consider a model that predicts the fuel economy of a car given its number of cylinders, engine displacement, horsepower, weight, acceleration, model year, and country of origin. Consider Cylinders, Model_Year, and Origin as categorical variables.
load carbig
Cylinders = categorical(Cylinders);
Model_Year = categorical(Model_Year);
Origin = categorical(cellstr(Origin));
X = table(Cylinders,Displacement,Horsepower,Weight,Acceleration,Model_Year,...
Origin,MPG);
Determine Levels in Predictors
The standard CART algorithm tends to split predictors with many unique values (levels), e.g., continuous variables, over those with fewer levels, e.g., categorical variables. If your data is heterogeneous, or your predictor variables vary greatly in their number of levels, then consider using the curvature or interaction tests for split-predictor selection instead of standard CART.
For each predictor, determine the number of levels in the data. One way to do this is define an anonymous function that:
Converts all variables to the categorical data type using categorical
Determines all unique categories while ignoring missing values using categories
Counts the categories using numel
Then, apply the function to each variable using varfun.
countLevels = @(x)numel(categories(categorical(x)));
numLevels = varfun(countLevels,X(:,1:end-1),'OutputFormat','uniform');
Compare the number of levels among the predictor variables.
figure;
bar(numLevels);
title('Number of Levels Among Predictors');
xlabel('Predictor variable');
ylabel('Number of levels');
h = gca;
h.XTickLabel = X.Properties.VariableNames(1:end-1);
h.XTickLabelRotation = 45;
h.TickLabelInterpreter = 'none';
Descripción: http://www.mathworks.com/help/examples/stats/win64/SelectPredictorsForRandomForestsExample_01.pngThe continuous variables have many more levels than the categorical variables. Because the number of levels among the predictors vary so much, using standard CART to select split predictors at each node of the trees in a random forest can yield inaccurate predictor importance estimates.
Grow Robust Random Forest
Grow a random forest of 200 regression trees. Specify sampling all variables at each node. Specify usage of the interaction test to select split predictors. Because there are missing values in the data, specify usage of surrogate splits to increase accuracy.
t = templateTree('NumPredictorsToSample','all',...
'PredictorSelection','interaction-curvature','Surrogate','on');
rng(1); % For reproducibility
Mdl = fitrensemble(X,'MPG','Method','bag','NumLearningCycles',200,...
'Learners',t);
Mdl is a RegressionBaggedEnsemble model.
Estimate the model Descripción: $R^2$ using out-of-bag predictions.
yHat = oobPredict(Mdl);
R2 = corr(Mdl.Y,yHat)^2
R2 =
0.8739
Mdl explains 87.39% of the variability around the mean.
Predictor Importance Estimation
Estimate predictor importance values by permuting out-of-bag observations among the trees.
impOOB = oobPermutedPredictorImportance(Mdl);
impOOB is a 1-by-7 vector of predictor importance estimates corresponding to the predictors in Mdl.PredictorNames. The estimates are not biased toward predictors containing many levels.
Compare the predictor importance estimates.
figure;
bar(impOOB);
title('Unbiased Predictor Importance Estimates');
xlabel('Predictor variable');
ylabel('Importance');
h = gca;
h.XTickLabel = Mdl.PredictorNames;
h.XTickLabelRotation = 45;
h.TickLabelInterpreter = 'none';
Descripción: http://www.mathworks.com/help/examples/stats/win64/SelectPredictorsForRandomForestsExample_02.pngGreater importance estimates indicate more important predictors. The bar graph suggests that Model_Year is the most important predictor, followed by Weight. Model_Year has 13 distinct levels only, whereas Weight has over 300.
Compare predictor importance estimates by permuting out-of-bag observations and those estimates obtained by summing gains in the mean squared error due to splits on each predictor. Also, obtain predictor association measures estimated by surrogate splits.
[impGain,predAssociation] = predictorImportance(Mdl);
figure;
plot(1:numel(Mdl.PredictorNames),[impOOB' impGain']);
title('Predictor Importance Estimation Comparison')
xlabel('Predictor variable');
ylabel('Importance');
h = gca;
h.XTickLabel = Mdl.PredictorNames;
h.XTickLabelRotation = 45;
h.TickLabelInterpreter = 'none';
legend('OOB permuted','MSE improvement')
grid on
Descripción: http://www.mathworks.com/help/examples/stats/win64/SelectPredictorsForRandomForestsExample_03.pngimpGain is commensurate with impOOB. According to the values of impGain, Model_Year and Weight do not appear to be the most important predictors.
predAssociation is a 7-by-7 matrix of predictor association measures. Rows and columns correspond to the predictors in Mdl.PredictorNames. You can infer the strength of the relationship between pairs of predictors using the elements of predAssociation. Larger values indicate more highly correlated pairs of predictors.
figure;
imagesc(predAssociation);
title('Predictor Association Estimates');
colorbar;
h = gca;
h.XTickLabel = Mdl.PredictorNames;
h.XTickLabelRotation = 45;
h.TickLabelInterpreter = 'none';
h.YTickLabel = Mdl.PredictorNames;
predAssociation(1,2)
ans =
0.6830
Descripción: http://www.mathworks.com/help/examples/stats/win64/SelectPredictorsForRandomForestsExample_04.pngThe largest association is between Cylinders and Displacement, but the value is not high enough to indicate a strong relationship between the two predictors.
Grow Random Forest Using Reduced Predictor Set
Because prediction time increases with the number of predictors in random forests, it is good practice to create a model using as few predictors as possible.
Grow a random forest of 200 regression trees using the best two predictors only.
MdlReduced = fitrensemble(X(:,{'Model_Year' 'Weight' 'MPG'}),'MPG','Method','bag',...
'NumLearningCycles',200,'Learners',t);
Compute the Descripción: $R^2$ of the reduced model.
yHatReduced = oobPredict(MdlReduced);
r2Reduced = corr(Mdl.Y,yHatReduced)^2
r2Reduced =
0.8525
The Descripción: $R^2$ for the reduced model is close to the Descripción: $R^2$ of the full model. This result suggests that the reduced model is sufficient for prediction.
Usually you cannot evaluate the predictive quality of an ensemble based on its performance on training data. Ensembles tend to overtrain,
meaning they produce overly optimistic estimates of their predictive power. This means the result of resubLoss for classification (resubLoss for regression) usually indicates lower error than you get on new