0% found this document useful (0 votes)
27 views

Week-7 - Lecture Notes

The document discusses decision trees for big data analytics and regression. It covers topics like how decision trees work by splitting data into pure subsets, algorithms like ID3, choosing the best attribute to split on using information gain, and growing decision trees for regression tasks.

Uploaded by

tejastaware7451
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Week-7 - Lecture Notes

The document discusses decision trees for big data analytics and regression. It covers topics like how decision trees work by splitting data into pure subsets, algorithms like ID3, choosing the best attribute to split on using information gain, and growing decision trees for regression tasks.

Uploaded by

tejastaware7451
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 143

Decision Trees for

Big Data Analytics

EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Preface
Content of this Lecture:

In this lecture, we will discuss Decision Trees for Big


Data Analytics and also discuss a case study of medical

EL
application using a Decision Tree in Spark ML

PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Decision Trees

EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Predict if Sachin will play cricket
Consider the following data set
our task is to predict if Sachin is
going to play cricket on a given day.
We have observed Sachin over a

EL
number of days and recorded
various things that might influence
his decision to play cricket so we

PT
looked at what kind of weather it is.
Is it sunny or is it raining humidity
is it high as normal is it windy
N
So this is our training set and these
are the examples that we are going
to build a classifier from.

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Predict if Sachin will play cricket
So on day 15 if it's raining the
humidity is high it's not very windy
so is Sachin going to play or not
and just by looking at the data it's
kind of hard to it's it's kind of hard
to decide right because it's you

EL
know some days when it's raining
Sachin is playing other days Sachin
is not playing sometimes he plays

PT
with strong winds sometimes he
doesn't play with strong winds and
with weak winds life so what do
you do how do you predict it and
N
the basic idea behind decision
trees is try to at some level
understand why Sachin plays.
so this is the only classifier that
that will cover that tries to predict
Sachin playing in this way.

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Predict if Sachin will play cricket
Hard to guess
Try to understand
when Sachin plays
Divide & conquer:

EL
Split into subsets
Are they pure ?

PT
(all yes of all no)
If yes: stop
N
If not: repeat
See which subset
new data falls into

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
ID3 Algorithm
In decision tree learning, ID3 (Iterative Dichotomiser 3) is an algorithm
invented by Ross Quinlan used to generate a decision tree from a
dataset. ID3 is the precursor to the C4.5 algorithm, and is typically used
in the machine learning and natural language processing domains.
Split (node, {examples}):

EL
1. A the best attribute for splitting the {examples}
2. Decision attribute for this node  A

PT
3. For each value of A, create new child node
4. Split training {examples} to child nodes
5. For each child node / subset:
N
• if subset is pure: STOP
• else: Split (child_node, {subset})
Ross Quinlan (ID3: 1986), (C4.5:1993)
Breimanetal (CaRT:1984) from statistics

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Which attribute to split on ?

EL
PT
Want to measure “purity” of the split
More certain about Yes/No after the split
• pure set (4 yes/ 0 no)➔ completely certain (100%)
N
• impure (3 yes/3 no)➔ completely uncertain (50%)
Can’t use P(“yes” | set):
• must be symmetric: 4 yes/0 no as pure as 0 yes / 4 no

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Entropy
Entropy: H(S)= - p(+) log2 p(+) - p(-) log2 p(-) bits
S … subset of training examples
p(+) / p(-)…% of positive/negative examples in S

EL
Interpretation: assume item X belongs to S
How many bits need to tell if X positive or negative

PT
Impure (3 yes / 3 no):
N
Pure set (4 yes / 0 no):

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Information Gain
Want many items in pure sets
Expected drop in entropy after split:

EL
Mutual Information
-between attribute APT
and class labels of S
N
Gain (S, Wind)
= H(S) - 8/14 * H(Sweak) - 6/14 * H(Sstrong)
= 0.94- 8/14 * 0.81 - 6/14 * 1.0
= 0.049

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Decision Trees for Regression

EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
How to grow a decision tree
The tree is built greedily from top to bottom
Each split is selected to maximize information
gain (IG)

EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Decision Tree for Regression
Given a training set: Z= {(x1,y1),…,(xn,yn)}
yi-real values

EL
Goal is to find f(x) (a tree) such that

PT
N
How to grow a decision tree for regression ?

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
How to find the best split
A tree is built from top to
bottom. And at each step, you
should find the best split in
the internal node. You get a

EL
test dataset Z. And here is a
splitting criteria xk less than t
or xk is greater or equal than
PT
a threshold t. The problem is
how to find k, the number of
N
feature and the threshold t.
And also you need to find
values in leaves aL and aR.

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
What happens without a split?
What happens without a split?
Without a split, the best you can do
is to predict one number, ‘a’. It
makes sense to select such a
number ‘a’ to minimize the squared

EL
error.
It is very easy to prove that such
variable a equals an average of all
yi.
PT
Thus you need to calculate the
average value of all the targets.
N
Here ā (a-hat)denotes the average
value.
Impurity z equals the mean squared
error if we use ā (a-hat) as a
prediction.

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Find the best split (xk < t)
What happens if you make some
split by a condition xk < t? For some
part of training objects ZL, you have
x_k < t. For other part ZR holds
xk ≥ t.

EL
The error consists of two parts for
ZL and ZR respectively. So here we
need to find simultaneously k, t, aL
and aR.

PT
We can calculate the optimal aL and
aR exactly the same way as we did
for the case without a split. We can
N
easily prove that the optimal aL
equals an average of all targets
which are the targets of objects
which get to the leaf. And we will
denote these values by a-hat L and
a-hat R respectively.
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Find the best split
After this step we only
need to find optimal
k and t. We have formulas
for impurity, for objects

EL
which get to the left
branch of our splitting
condition and to the
PT
right. Then you can find
the best splitting criteria
N
which maximizes the
information gain. This
procedure is done
iteratively from top to
bottom.
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Stopping rule
The node depth is equal to the maxDepth training
parameter.

EL
No split candidate leads to an information gain greater
than mininfoGain.

PT
No split candidate produces child nodes which have at
N
least minInstancesPerNode training instances
(|ZL|, |ZR| < minInstancesPerNode) each.

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Summary: Decision Trees
Automatically handling interactions of features: The benefits of decision
tree is that this algorithm can automatically handle interactions or
features because it can combine several different features in a single
decision tree. It can build complex functions involving multiple splitting
criteria.

EL
Computational scalability: The second property is a computational
scalability. There exists effect of algorithms for building decision trees for

PT
the very large data sets with many features.
N
Predictive power: Single decision tree actually is not a very good
predictor. The predictive power of a single tree is typically not so good.

Interpretability: You can visualize the decision tree and analyze this
splitting criteria in nodes, the values in leaves, and so one. Sometimes it
might be helpful.
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Building a tree using
MapReduce

EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Problem: Building a tree
Given a large dataset with FindBestSplit

hundreds of attributes FindBestSplit

Build a decision tree!

EL
FindBestSplit

General considerations: FindBestSplit

PT
Tree is small (can keep it memory):
• Shallow (~10 levels)
N
Dataset too large to keep in memory
Dataset too big to scan over on a single machine
MapReduce to the rescue!

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
MapReduce

EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
PLANET
Parallel Learner for Assembling Numerous
Ensemble Trees [Panda et al., VLDB ‘09]
A sequence of MapReduce jobs that build a
decision tree

EL
Setting:
Hundreds of numerical (discrete & continuous)
attributes
PT
Target (class) is numerical: Regression
N
Splits are binary: Xj < v
Decision tree is small enough for each
Mapper to keep it in memory
Data too large to keep in memory

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
A

PLANET Architecture B
D
C
E
F G H I

Master

EL
Model Attribute
metadata

PT Intermediate
results
N
Input
data

FindBestSplit
InMemoryGrow

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
A

PLANET Overview B
D
C
E
F G H I

We build the tree level by level


One MapReduce step builds one level of the tree
Mapper

EL
Considers a number of possible splits (Xi, v) on its subset of
the data

PT
For each split it stores partial statistics
Partial split-statistics is sent to Reducers
Reducer
N
Collects all partial statistics and determines best split.
Master grows the tree for one level

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
A

PLANET Overview B
D
C
E
F G H I

Mapper loads the model and info


about which attribute splits to consider
Each mapper sees a subset of the data D*

EL
Mapper “drops” each datapoint to find the
appropriate leaf node L

PT
For each leaf node L it keeps statistics about
1) the data reaching L
N
2) the data in left/right subtree under split S
Reducer aggregates the statistics (1) and (2) and
determines the best split for each node

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
A

PLANET: Components B
D
C
E
F G H I
Master
Monitors everything (runs multiple MapReduce jobs)
Three types of MapReduce jobs:
(1) MapReduce Initialization (run once first)

EL
For each attribute identify values to be considered for splits

PT
(2) MapReduce FindBestSplit (run multiple times)
MapReduce job to find best split when there is too much data to
fit in memory
N
(3) MapReduce InMemoryBuild (run once last)
Similar to FindBestSplit (but for small data)
Grows an entire sub-tree once the data fits in memory
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Reference
B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo.
PLANET: Massively parallel learning of tree ensembles
with MapReduce. VLDB 2009.

EL
J. Ye, J.-H. Chow, J. Chen, Z. Zheng. Stochastic Gradient

PT
Boosted Distributed Decision Trees. CIKM 2009.
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Example: Medical Application
using a Decision Tree in Spark

EL
ML
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Create SparkContext and SparkSession
This example will be about using Spark ML for doing classification and
regression with decision trees, and in samples of decision trees.
First of all, you are going to create SparkContext and SparkSession, and
here it is.

EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Download a dataset
Now you are downloading a dataset which is
about breast cancer diagnosis, here it is.

EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Exploring the Dataset
Let's explore this dataset a little bit. The first column is ID of observation. The
second column is the diagnosis, and other columns are features which are
comma separated. So these features are results of some analysis and
measurements. There are total 569 examples in this dataset. And if the
second column is M, it means that it is cancer. If B, means that there is no

EL
cancer in a particular woman.

PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Exploring the Dataset
First of all you need to transform the label, which is either M or B from the second
column. You should transform it from a string to a number. We use a StringIndexer
object for this purpose, and first of all you need to load all these datasets. Then you
create a Spark DataFrame, which is stored in the distributed manner on cluster.

EL
PT
N

Vu Pham
Exploring the Dataset
inputDF DataFrame has two columns, label and features. Okay, you use an object
vector for creating a vector column in this dataset, and then you can do string
indexing. So Spark now enumerates all the possible labels in a string form, and
transforms them to the label indexes. And now label M is equivalent 1, and B label is
equivalent 0.

EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Train/Test Split
We can start doing the machine learning right now. First of all, you make training test
splitting in the proportion 70% to 30%. And the first model you are going to evaluate is
one single decision tree.

EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Train/Test Split
We are making import DecisionTreeClassifier object. We create a class which is
responsible for training, and we call the method fit to the training data, and obtain a
decision tree model.

EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Train/Test Split
Number of nodes and depths of decision tree, feature importances, total number of
features used in this decision tree, and so on.
We can even visualize this decision tree and explore it, and here is a structure of this
decision tree. Here are splitting conditions If and Else, which predict values and leaves
of our decision trees.

EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Train/Test Split
Now we are applying a decision tree model to the test data, and obtain predictions.
And you can explore these predictions. The predictions are in the last column.
And in this particular case, our model always predicts zero class.

EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Predictions

EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Predictions

EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Accuracy

EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Conclusion

In this lecture, we have discussed Decision Trees for Big


Data Analytics.

EL
We have also discussed a case study of Breast Cancer
Diagnosis using a Decision Tree in Spark ML.

PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Big Data Predictive Analytics

EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]

Big Data Computing Vu Pham Predictive Analytics


Preface
Content of this Lecture:

In this lecture, we will discuss the fundamental


techniques of predictive analytics.

EL
We will mainly cover Random Forest, Gradient Boosted
PT
Decision Trees and a Case Study with Spark ML
Programming, Decision Trees and Ensembles.
N

Big Data Computing Vu Pham Predictive Analytics


Decision Trees

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Summary: Decision Trees
Automatically handle interactions of features: It can combine
several different features in a single decision tree. It can build
complex functions involving multiple splitting criteria.

EL
Computational scalability: There exists effect of algorithms for
building decision trees for the very large data sets with many
features. But unfortunately, single decision tree actually is not a

PT
very good predictor.
N
Predictive Power: The predictive power of a single tree is
typically not so good.

Interpretability: We can visualize the decision tree and analyze


this splitting criteria in nodes, the values in leaves, and so one.
Big Data Computing Vu Pham Predictive Analytics
Bootstrap and Bagging

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Bootstrap
Bootstrapping is an algorithm which produces replicas of a
data set by doing random sampling with replacement. This
idea is essential for the random forest algorithm.

EL
Consider a dataset Z={(x1, y1),…,(xn,yn)}

PT
Bootstrapped dataset Z*- It is a modification of the
original dataset Z, produced by random sampling with
N
replacement.

Big Data Computing Vu Pham Predictive Analytics


Sampling with Replacement
Each iteration pick an object at random, and there is no
correlation with the previous step. Consider a data set, Z,
for example, with five objects.

EL
1 2 3 4 5

PT
N

Big Data Computing Vu Pham Predictive Analytics


Sampling with Replacement
At the first step, you pick at random an object, for
example, the object number three.

EL
1 2 3 4 5

3
PT
N

Big Data Computing Vu Pham Predictive Analytics


Sampling with Replacement
Then you repeat it and pick the object number one.

EL
1 2 3 4 5

3
PT
1
N

Big Data Computing Vu Pham Predictive Analytics


Sampling with Replacement
Then the object number five.

EL
1 2 3 4 5

3
PT
1 5
N

Big Data Computing Vu Pham Predictive Analytics


Sampling with Replacement
Then possibly you can pick again the object number five,
because at each iteration you pick an object at random,
and there is no correlation with the previous step.

EL
1 2 3 4 5

3
PT1 5 5
N

Big Data Computing Vu Pham Predictive Analytics


Sampling with Replacement
And finally, you pick the object number two.

EL
1 2 3 4 5

3
PT
1 5 5 2
N

Big Data Computing Vu Pham Predictive Analytics


Sampling with Replacement
After bootstrapping we have a new data set. The size of this data set is the
number of elements in the original data set. But its content, as you can see,
is slightly different. Some objects may be missing and other objects may be
present several times, more than once.

EL
1 2 3 4 5

3
PT
1
Original Dataset

5 5 2
N
Bootstrap Dataset

Big Data Computing Vu Pham Predictive Analytics


Bagging
It was the second idea essential for understanding of the
random forest algorithm.

Bagging (Bootstrap Aggregation): It is a general method for

EL
averaging predictions of other algorithms, not decision trees,
but any other algorithm in general.

PT
Bagging works because it reduces the variance of the
prediction.
N

Big Data Computing Vu Pham Predictive Analytics


Algorithm: Bagging

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Why does Bagging work ?
Model f(x) has higher predictive power than any single fxb(x), b=1,…,B

Most of situations with any machine learning method in the core, the

EL
quality of such aggregated predictions will be better than of any
single prediction.

PT
Why does bagging works?
N
This phenomenon is based on a very general principle which is called
the bias variance trade off. You can consider the training data set to
be random by itself.

Big Data Computing Vu Pham Predictive Analytics


Why does Bagging work ?
Why is it so? What is the training data set?

In the real situation, the training data set may be a user behavior in
Internet, for example, web browsing, using search engine, doing

EL
clicks on advertisement, and so on.

Other examples of training data sets are physical measurements. For

PT
example, temperature, locations. date, time, and so on. And all these
measurements are essentially stochastic.
N
If you can repeat the same experiment in the same conditions, the
measurements actually will be different because of the noise in
measurements, and since user behavior is essentially stochastic and
not exactly predictable. Now, you understand that the training data
set itself is random.
Big Data Computing Vu Pham Predictive Analytics
Why does Bagging work ?

Bagging: It is an averaging over a set of possible datasets,


removing noisy and non-stable parts of models.

EL
PT
After averaging, the noisy parts of machine learning model
will vanish out, whereas stable and reliable parts will
N
remain. The quality of the average model will be better
than any single model.

Big Data Computing Vu Pham Predictive Analytics


Summary
Bootstrap: A method for generating different replicas
of the dataset

EL
Bagging (Bootstrap Aggregation): A method for
averaging predictions and reducing prediction’s
variance
PT
N
Bagging improves the quality of almost any machine
learning method.

Bagging is very time consuming for large data sets.


Big Data Computing Vu Pham Predictive Analytics
Random Forest

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Random Forest

Random Forest Algorithm is a bagging of

EL
de-correlated decision trees.

PT
N

Big Data Computing Vu Pham Predictive Analytics


Algorithm: Random Forest

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


How to grow a random forest decision tree

The tree is built greedily from top to bottom


Select m ≤ p of the input variables at random
as candidates for splitting

EL
Each split is selected to maximize information
gain (IG)
PT
N

Big Data Computing Vu Pham Predictive Analytics


How to grow a random forest decision tree
Select m ≤ p of the input variables at random as
candidates for splitting

EL
Recommendations from inventors of Random Forests:

PT
m=√p for classification, minInstance PerNode =1
m= p/3for regression, minInstancePerNode=5
N

Big Data Computing Vu Pham Predictive Analytics


Random forest
Here are the results of training of two
random force. The first variant is marked
with green here, and either the variant
were at each step m equals speed.
It means that at each step, we grow a

EL
regular decision tree and you find the best
split among all the variables. And the blue
line, it is a de-correlated decision tree.

PT
In this situation, we randomly pick m
equals square root of b. And all the trees
can be built using different subsets of
variables. As we can see at this diagram,
N
at the initial stage, the variant is m equals
square root b is worse before 20
iterations. But eventually, this variant of
the Random Forest algorithm converges
to the better solution.

Big Data Computing Vu Pham Predictive Analytics


Summary

Random Forest is a good method for a general

EL
purpose classification/regression problems (typically
slightly worse than gradient boosted decision trees)

PT
N

Big Data Computing Vu Pham Predictive Analytics


Summary
Automatically handle interactions of features: Of course, this
algorithm can automatically handle interactions of features because
this could be done by a single decision tree.

Computational scalability: Of course, this algorithm can

EL
automatically handle interactions of features because this could be
done by a single decision tree.

PT
In the Random Forest algorithm, each tree can be built
independently on other trees. It is an important feature and I would
N
like to emphasize it. That is why the Random Forest algorithm east is
essentially parallel.

The Random Forest could be trained in the distributed environment


with the high degree of parallelization.

Big Data Computing Vu Pham Predictive Analytics


Summary
Predictive Power: As far as predictive power of the Random Forest
is, on the one hand better than a single decision tree, but it is slightly
worse than gradient boosted decision trees.

EL
Interpretability: Here you'll lose the interpretability because their
composition of hundreds or thousands Random Forest decision trees
cannot be analyzed by human expert.

PT
N

Big Data Computing Vu Pham Predictive Analytics


Gradient Boosted Decision Trees

EL
Regression

PT
N

Big Data Computing Vu Pham Predictive Analytics


Boosting
Boosting: It is a method for combining outputs
of many weak classifiers to produce a powerful
ensemble.

EL
There are several variants of boosting

PT
algorithms, AdaBoost, BrownBoost, LogitBoost, and
Gradient Boosting.
N

Big Data Computing Vu Pham Predictive Analytics


Big Data
Large number of training examples.
Large number of features describing objects.
In this situation, It is very natural to assume that you would like to
train a really complex model, even having such a great amount of

EL
data and hopefully, this model will be accurate.
There are two basic ways, in machine learning, to build complex
models.
PT
The first way is to start with a complex model from the very
beginning, and fit its parameters. This is exactly the way how neural
N
network operates.
And the second way is to build a complex model iteratively. You can
build a complex model iteratively, where each step requires training
of a simple model. In context of boosting, these models are called
weak classifiers, or base classifiers.

Big Data Computing Vu Pham Predictive Analytics


Regression
Given a training set: Z={(x1,y1),…,(xn,yn)}
Xi- features, yi-targets (real values)

EL
Goal is to find f(x) using training set, such as

PT
At test set T={(x1,y1),…,(xn,yn)}
N
How to build f(x)

Big Data Computing Vu Pham Predictive Analytics


Gradient Boosted Trees for Regression
How to build such f(x)?

In boosting, our goal is to build the function f(x) iteratively.


We suppose that this function f(x) is just a sum of other

EL
simple functions, hm(x).

PT
And particular, you assume that each function hm(x) is a
decision tree.
N

Big Data Computing Vu Pham Predictive Analytics


Algorithm: Gradient Boosted Trees for Regression

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Optimization Theory
You may have noticed that gradient boosting is somewhat similar to the
gradient descent in the optimization theory. If we want to minimize a function in
the optimization theory using the gradient descent, we make a small step in the
direction opposite to the gradient.
Gradient of the function, by definition, is a vector which points to the direction

EL
with the fastest increase. Since we want to minimize the function, we must move
to the direction opposite to the gradient. To ensure convergence, we must make
very small steps. So you're multiplying each gradient by small constant, which is
called step size. It is very similar to what we do in gradient boosting.

PT
And gradient boosting is considered to be a minimization in the functional space.
N
Boosting-Minimization in
the functional space

Big Data Computing Vu Pham Predictive Analytics


Summary
Boosting is a method for combining outputs of many weak
classifiers or regressors to produce a powerful ensemble.

EL
Gradient Boosting is a gradient descent minimization of
the target function in the functional space.

PT
Gradient Boosting with Decision Trees is considered to be
the best algorithm for general purpose classification or
N
regression problems.

Big Data Computing Vu Pham Predictive Analytics


Gradient Boosted Decision Trees

EL
Classification

PT
N

Big Data Computing Vu Pham Predictive Analytics


Classification
Given a training set: Z={(x1,y1),…,(xn,yn)}
Xi- features, yi-class labels (0,1)

EL
Goal is to find f(x) using training set, such as

PT
At test set T={(x1,y1),…,(xn,yn)}
N
How to build f(x) ?

Big Data Computing Vu Pham Predictive Analytics


Gradient Boosted Trees for Classification
How we are going to build such the function, f(x)?
We use a probabilistic model by using the following expression.

EL
PT
N
We model the probability of belonging of an object to the first class. And
here inside the exp, there is the sum of hm(x), and each hm(x) is a decision
tree.
We can easily check that such expression for probability will be always
between zero and one, so it is normal regular probability.

Big Data Computing Vu Pham Predictive Analytics


Sigmoid Function
This function is called the sigmoid function, which maps all the
real values into the range between zero and one.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Let us denote the sum of all hm(x) by f(x). It is an ensemble of decision trees.
Then you can write the probability of belonging to the first class in a simple way
using f(x). And the main idea which is used here is called the principle of
maximum likelihood.
What is it? First of all, what is the likelihood?

EL
Likelihood is a probability of absorbing some data given a statistical model. If
we have a data set with an objects from one to n, then the probability of
absorbing such data set is the multiplication of probabilities for all single

PT
objects. This multiplication is called the likelihood.
N

Big Data Computing Vu Pham Predictive Analytics


The principle of maximum likelihood
Algorithm: find a function f(x) maximizing the likelihood
Equivalent: find a function f(x) maximizing the logarithm of the
likelihood
(since logarithm is a monotone function)

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


The principle of maximum likelihood
We will denote by Q[f] the logarithm of the likelihood, and now, it
is sum of all logarithms of probabilities and you are going to
maximize this function.

We will use shorthand for this logarithm L(yi, f(x)i). It is the

EL
logarithm of probability. And here, we emphasize that this
logarithms depend actually on the true label, yi and our prediction,

PT
f(x)i. Now, Q[f] is a sum of L(yi, f(x)i).
N

Big Data Computing Vu Pham Predictive Analytics


Algorithm: Gradient Boosted Trees for Classification

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Stochastic Boosting

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Algorithm: Gradient Boosted Trees for Classification

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Algorithm: Gradient Boosted Trees for Classification
+ Stochastic Boosting

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Sampling with Replacement

EL
n=8 1 2 3 4 5 6 7 8

k=4 7
PT
3 1 3
N

Big Data Computing Vu Pham Predictive Analytics


Tips for Usage
First of all, it is important to
understand how the
regularization parameter
works. In this figure, you can
see the behavior of the

EL
gradient boosted decision
trees algorithm with two

and 0.05. PT
variants of this parameter, 0.1

What happens here, at the


N
initial stage of learning the
variant with parameter 0.1 is
better because it has lower
testing error.

Big Data Computing Vu Pham Predictive Analytics


Tips for Usage
At each iteration, you measure a
testing error of our ensemble on
the hold out data set.
But eventually, the variant with
lower regularization, 0.05, reach

EL
lower testing error. Finally, this
variant turns out to be superior.
It is a very typical behavior, and you

PT
should not stop your algorithm
after several dozen iterations. You
should proceed until convergence.
Convergence happens when your
N
testing error doesn't change a lot.
The variant with lower
regularization converges more
slowly, but eventually it builds a
better model.

Big Data Computing Vu Pham Predictive Analytics


Tips for Usage
The recommended learningRate should be less or equal
than 0.1.
The bigger your data set is, the larger number of

EL
iterations should be.
The recommended number of iterations ranges from
several hundred to several thousand.
PT
Also, the more features you have in your data set, the
deeper your decision tree should be.
N
These are very general rules because the bigger your data
set is, the more features you have, the more complex
model you can build without overfitting.

Big Data Computing Vu Pham Predictive Analytics


Summary
It is a best method for a general purpose classification and
regression problems.

It automatically handles interactions of features, because in

EL
the core, it is based on decision trees, which can combine
several features in a single tree.

PT
But also, this algorithm is computationally scalable. It can be
effectively executed in the distributed environment, for
N
example, in Spark. So it can be executed at the top of the Spark
cluster.

Big Data Computing Vu Pham Predictive Analytics


Summary
But also, this algorithm has a very good predictive power. But
unfortunately, the models are not interpretable.

The final ensemble effects is very large and cannot be

EL
analyzed by a human expert.

PT
There is always a tradeoff in machine learning, between
predictive power and interpretability, because the more
complex and accurate your model is, the harder is the analysis
N
of this model by a human.

Big Data Computing Vu Pham Predictive Analytics


Spark ML, Decision Trees and
Ensembles

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Introduction

This lesson will be about using Spark ML for doing


classification and regression with decision trees, and

EL
in samples of decision trees.

PT
N

Big Data Computing Vu Pham Predictive Analytics


First of all, you are going to create SparkContext and SparkSession

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Now you are downloading a dataset.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Let's explore this dataset a little bit. The first column is ID of
observation. The second column is the diagnosis, and other
columns are features which are comma separated.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


So these features are results of some analysis and
measurements. There are total 569 examples in this dataset.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


First of all you need to transform the label, which is either M or B
from the second column. You should transform it from a string to a
number.
You use a StringIndexer object for this purpose, and first of all you
need to load all these datasets.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Then you create a Spark DataFrame, which is stored in the
distributed manner on cluster.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


inputDF DataFrame has two columns, label and features. We use an
object vector for creating a vector column in this dataset.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Then we can do string indexing. So Spark now enumerates all
the possible labels in a string form, and transforms them to the
label indexes.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


And now label M is equivalent 1, and B label is equivalent 0.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


First of all, you make training test splitting in the proportion
70% to 30%. And the first model you are going to evaluate is
one single decision tree.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Here we are making import DecisionTreeClassifier object.
We create a class which is responsible for training, and
Then call the method fit to the training data, and obtain a
decision tree model.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


So the training was quite fast, and here are some results.
Number of nodes and depths of decision tree, feature
importance, total number of features used in this decision tree,
and so on.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


We can even visualize this decision tree and explore it, and
here is a structure of this decision tree.
Here are splitting conditions If and Else, which predict values
and leaves of our decision trees.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Here we are applying a decision tree model to the test data,
and obtain predictions.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Here we can explore these predictions. The predictions are in
the last column. And in this particular case, our model always
predicts zero class.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Now we can evaluate the accuracy of model.
For this purpose, we use a MulticlassClassificationEvaluator
with a metric named accuracy.
The testing error is 3%, the model is quite accurate.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Gradient Boosted Decision Trees
First of all we import it, we create an object which will do this
classification. Here you are specifying labelCol = "labelIndexed", and
featuresCol = "features". Actually, it is not mandatory to specify
feature column, because its default name is features. We can do it
either with this argument or without this argument.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Gradient Boosted Decision Trees
We are going to do 100 iterations, and the default stepSize
is 0.1.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Gradient Boosted Decision Trees
The model is ready, and we can explore featureImportances.
We can even visualize the sample of decision trees, and it is
quite large and long.
It is not interpretable by a human expert.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Gradient Boosted Decision Trees
Now you are doing predictions at testing data. And finally,
we evaluate the accuracy.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Gradient Boosted Decision Trees
In this case, accuracy is a bit lower than the one of the
single decision tree.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Random Forest
We are importing the classes which are required for evaluating
random forest.
We are creating an object, and we are fitting this method.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Random Forest
Here are featureImportances, and here is an example. Again, it
is quite large.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Random Forest
Here are the quality of our model predictions and testing accuracy.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Random Forest
We can see in this example, testing accuracy of random forest was the
best. But with the other dataset, the situation may be quite different.
And as a general rule, the bigger your dataset is, the more features it
has, then the quality of complex algorithms like gradient boosted
decision trees or random forest will be better.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Spark ML, Cross Validation

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Cross Validation
Cross-validation helps to assess the quality of a machine
learning model and to find the best model among a family of
models.
First of all, we start SparkContext.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Cross Validation

We use the same dataset which we use for evaluating different


decision trees.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Cross Validation
Finally we have the dataset which is called inputDF2.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Cross Validation
Here is the content for inputDF2.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Cross Validation
What steps are required for doing cross-validation with this dataset?
For example, we want to select the best parameters of a single
decision tree. We create an object decision tree then we should
create a pipeline.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Cross Validation
Pipeline, in general, may contain many stages including feature pre-
processing, string indexing, and machine learning, and so on.
But in this case, pipeline contains only one step, this training of
decision tree.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Cross Validation
Then we import their cross-validator and ParamGridBuilder class and
you are creating a ParamGridBuilder class.
For example, we want to select the best maximum depths of a
decision tree in the range from 1-8.
We have now the ParamGridBuilder class, we create an evaluator so

EL
we want to select the model which has the best accuracy among
others.

PT
N

Big Data Computing Vu Pham Predictive Analytics


Cross Validation
We create an evaluator so we want to select the model which has the
best accuracy among others.
We create a cross-validator class and pass a pipeline into this class, a
ParamGrid and evaluator.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Cross Validation
And finally, we select the number of folds and the number of folds should
not be less than 5 or 10.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Cross Validation
We create cvModel and it takes some time because Spark
needs to make training and evaluating the quality 10
times.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Cross Validation
You can see the average accuracy, amount of folds, for each failure of
decision tree depths.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Cross Validation
The first stage of our pipeline was a decision tree and you can get the
best model and the best model has depth 6 and it has 47 nodes.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics


Cross Validation
Then we can view this model to make predictions at any other dataset.
In ParamGridBuilder, we can use several parameters. For example,
maximum depths and some other parameters of a decision tree, for
example, minInstancesPerNode and select some other grid here, but in
this simple example, we did not do it for the simplicity. And if we

EL
evaluate only one parameter, the training is much faster.

PT
N

Big Data Computing Vu Pham Predictive Analytics


Conclusion

In this lecture, we have discussed the concepts of Random


Forest, Gradient Boosted Decision Trees and a Case Study

EL
with Spark ML Programming, Decision Trees and
Ensembles.

PT
N

Big Data Computing Vu Pham Predictive Analytics

You might also like