0% found this document useful (0 votes)

27 views

Week-7 - Lecture Notes

The document discusses decision trees for big data analytics and regression. It covers topics like how decision trees work by splitting data into pure subsets, algorithms like ID3, choosing the best attribute to split on using information gain, and growing decision trees for regression tasks.

Uploaded by

tejastaware7451

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views

Week-7 - Lecture Notes

Uploaded by

tejastaware7451

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 143

Decision Trees for

Big Data Analytics

EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Preface
Content of this Lecture:

In this lecture, we will discuss Decision Trees for Big

Data Analytics and also discuss a case study of medical

EL
application using a Decision Tree in Spark ML

PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Decision Trees

EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Predict if Sachin will play cricket
Consider the following data set
our task is to predict if Sachin is
going to play cricket on a given day.
We have observed Sachin over a

EL
number of days and recorded
various things that might influence
his decision to play cricket so we

PT
looked at what kind of weather it is.
Is it sunny or is it raining humidity
is it high as normal is it windy
N
So this is our training set and these
are the examples that we are going
to build a classifier from.

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Predict if Sachin will play cricket
So on day 15 if it's raining the
humidity is high it's not very windy
so is Sachin going to play or not
and just by looking at the data it's
kind of hard to it's it's kind of hard
to decide right because it's you

EL
know some days when it's raining
Sachin is playing other days Sachin
is not playing sometimes he plays

PT
with strong winds sometimes he
doesn't play with strong winds and
with weak winds life so what do
you do how do you predict it and
N
the basic idea behind decision
trees is try to at some level
understand why Sachin plays.
so this is the only classifier that
that will cover that tries to predict
Sachin playing in this way.

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Predict if Sachin will play cricket
Hard to guess
Try to understand
when Sachin plays
Divide & conquer:

EL
Split into subsets
Are they pure ?

PT
(all yes of all no)
If yes: stop
N
If not: repeat
See which subset
new data falls into

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
ID3 Algorithm
In decision tree learning, ID3 (Iterative Dichotomiser 3) is an algorithm
invented by Ross Quinlan used to generate a decision tree from a
dataset. ID3 is the precursor to the C4.5 algorithm, and is typically used
in the machine learning and natural language processing domains.
Split (node, {examples}):

EL
1. A the best attribute for splitting the {examples}
2. Decision attribute for this node  A

PT
3. For each value of A, create new child node
4. Split training {examples} to child nodes
5. For each child node / subset:
N
• if subset is pure: STOP
• else: Split (child_node, {subset})
Ross Quinlan (ID3: 1986), (C4.5:1993)
Breimanetal (CaRT:1984) from statistics

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Which attribute to split on ?

EL
PT
Want to measure “purity” of the split
More certain about Yes/No after the split
• pure set (4 yes/ 0 no)➔ completely certain (100%)
N
• impure (3 yes/3 no)➔ completely uncertain (50%)
Can’t use P(“yes” | set):
• must be symmetric: 4 yes/0 no as pure as 0 yes / 4 no

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Entropy
Entropy: H(S)= - p(+) log2 p(+) - p(-) log2 p(-) bits
S … subset of training examples
p(+) / p(-)…% of positive/negative examples in S

EL
Interpretation: assume item X belongs to S
How many bits need to tell if X positive or negative

PT
Impure (3 yes / 3 no):
N
Pure set (4 yes / 0 no):

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Information Gain
Want many items in pure sets
Expected drop in entropy after split:

EL
Mutual Information
-between attribute APT
and class labels of S
N
Gain (S, Wind)
= H(S) - 8/14 * H(Sweak) - 6/14 * H(Sstrong)
= 0.94- 8/14 * 0.81 - 6/14 * 1.0
= 0.049

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Decision Trees for Regression

EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
How to grow a decision tree
The tree is built greedily from top to bottom
Each split is selected to maximize information
gain (IG)

EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Decision Tree for Regression
Given a training set: Z= {(x1,y1),…,(xn,yn)}
yi-real values

EL
Goal is to find f(x) (a tree) such that

PT
N
How to grow a decision tree for regression ?

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
How to find the best split
A tree is built from top to
bottom. And at each step, you
should find the best split in
the internal node. You get a

EL
test dataset Z. And here is a
splitting criteria xk less than t
or xk is greater or equal than
PT
a threshold t. The problem is
how to find k, the number of
N
feature and the threshold t.
And also you need to find
values in leaves aL and aR.

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
What happens without a split?
What happens without a split?
Without a split, the best you can do
is to predict one number, ‘a’. It
makes sense to select such a
number ‘a’ to minimize the squared

EL
error.
It is very easy to prove that such
variable a equals an average of all
yi.
PT
Thus you need to calculate the
average value of all the targets.
N
Here ā (a-hat)denotes the average
value.
Impurity z equals the mean squared
error if we use ā (a-hat) as a
prediction.

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Find the best split (xk < t)
What happens if you make some
split by a condition xk < t? For some
part of training objects ZL, you have
x_k < t. For other part ZR holds
xk ≥ t.

EL
The error consists of two parts for
ZL and ZR respectively. So here we
need to find simultaneously k, t, aL
and aR.

PT
We can calculate the optimal aL and
aR exactly the same way as we did
for the case without a split. We can
N
easily prove that the optimal aL
equals an average of all targets
which are the targets of objects
which get to the leaf. And we will
denote these values by a-hat L and
a-hat R respectively.
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Find the best split
After this step we only
need to find optimal
k and t. We have formulas
for impurity, for objects

EL
which get to the left
branch of our splitting
condition and to the
PT
right. Then you can find
the best splitting criteria
N
which maximizes the
information gain. This
procedure is done
iteratively from top to
bottom.
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Stopping rule
The node depth is equal to the maxDepth training
parameter.

EL
No split candidate leads to an information gain greater
than mininfoGain.

PT
No split candidate produces child nodes which have at
N
least minInstancesPerNode training instances
(|ZL|, |ZR| < minInstancesPerNode) each.

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Summary: Decision Trees
Automatically handling interactions of features: The benefits of decision
tree is that this algorithm can automatically handle interactions or
features because it can combine several different features in a single
decision tree. It can build complex functions involving multiple splitting
criteria.

EL
Computational scalability: The second property is a computational
scalability. There exists effect of algorithms for building decision trees for

PT
the very large data sets with many features.
N
Predictive power: Single decision tree actually is not a very good
predictor. The predictive power of a single tree is typically not so good.

Interpretability: You can visualize the decision tree and analyze this
splitting criteria in nodes, the values in leaves, and so one. Sometimes it
might be helpful.
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Building a tree using
MapReduce

EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Problem: Building a tree
Given a large dataset with FindBestSplit

hundreds of attributes FindBestSplit

Build a decision tree!

EL
FindBestSplit

General considerations: FindBestSplit

PT
Tree is small (can keep it memory):
• Shallow (~10 levels)
N
Dataset too large to keep in memory
Dataset too big to scan over on a single machine
MapReduce to the rescue!

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
MapReduce

EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
PLANET
Parallel Learner for Assembling Numerous
Ensemble Trees [Panda et al., VLDB ‘09]
A sequence of MapReduce jobs that build a
decision tree

EL
Setting:
Hundreds of numerical (discrete & continuous)
attributes
PT
Target (class) is numerical: Regression
N
Splits are binary: Xj < v
Decision tree is small enough for each
Mapper to keep it in memory
Data too large to keep in memory

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
A

PLANET Architecture B
D
C
E
F G H I

Master

EL
Model Attribute
metadata

PT Intermediate
results
N
Input
data

FindBestSplit
InMemoryGrow

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
A

PLANET Overview B
D
C
E
F G H I

We build the tree level by level

One MapReduce step builds one level of the tree
Mapper

EL
Considers a number of possible splits (Xi, v) on its subset of
the data

PT
For each split it stores partial statistics
Partial split-statistics is sent to Reducers
Reducer
N
Collects all partial statistics and determines best split.
Master grows the tree for one level

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
A

PLANET Overview B
D
C
E
F G H I

Mapper loads the model and info

about which attribute splits to consider
Each mapper sees a subset of the data D*

EL
Mapper “drops” each datapoint to find the
appropriate leaf node L

PT
For each leaf node L it keeps statistics about
1) the data reaching L
N
2) the data in left/right subtree under split S
Reducer aggregates the statistics (1) and (2) and
determines the best split for each node

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
A

PLANET: Components B
D
C
E
F G H I
Master
Monitors everything (runs multiple MapReduce jobs)
Three types of MapReduce jobs:
(1) MapReduce Initialization (run once first)

EL
For each attribute identify values to be considered for splits

PT
(2) MapReduce FindBestSplit (run multiple times)
MapReduce job to find best split when there is too much data to
fit in memory
N
(3) MapReduce InMemoryBuild (run once last)
Similar to FindBestSplit (but for small data)
Grows an entire sub-tree once the data fits in memory
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Reference
B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo.
PLANET: Massively parallel learning of tree ensembles
with MapReduce. VLDB 2009.

EL
J. Ye, J.-H. Chow, J. Chen, Z. Zheng. Stochastic Gradient

PT
Boosted Distributed Decision Trees. CIKM 2009.
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Example: Medical Application
using a Decision Tree in Spark

EL
ML
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Create SparkContext and SparkSession
This example will be about using Spark ML for doing classification and
regression with decision trees, and in samples of decision trees.
First of all, you are going to create SparkContext and SparkSession, and
here it is.

EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Download a dataset
Now you are downloading a dataset which is
about breast cancer diagnosis, here it is.

EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Exploring the Dataset
Let's explore this dataset a little bit. The first column is ID of observation. The
second column is the diagnosis, and other columns are features which are
comma separated. So these features are results of some analysis and
measurements. There are total 569 examples in this dataset. And if the
second column is M, it means that it is cancer. If B, means that there is no

EL
cancer in a particular woman.

PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Exploring the Dataset
First of all you need to transform the label, which is either M or B from the second
column. You should transform it from a string to a number. We use a StringIndexer
object for this purpose, and first of all you need to load all these datasets. Then you
create a Spark DataFrame, which is stored in the distributed manner on cluster.

EL
PT
N

Vu Pham
Exploring the Dataset
inputDF DataFrame has two columns, label and features. Okay, you use an object
vector for creating a vector column in this dataset, and then you can do string
indexing. So Spark now enumerates all the possible labels in a string form, and
transforms them to the label indexes. And now label M is equivalent 1, and B label is
equivalent 0.

EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Train/Test Split
We can start doing the machine learning right now. First of all, you make training test
splitting in the proportion 70% to 30%. And the first model you are going to evaluate is
one single decision tree.

EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Train/Test Split
We are making import DecisionTreeClassifier object. We create a class which is
responsible for training, and we call the method fit to the training data, and obtain a
decision tree model.

EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Train/Test Split
Number of nodes and depths of decision tree, feature importances, total number of
features used in this decision tree, and so on.
We can even visualize this decision tree and explore it, and here is a structure of this
decision tree. Here are splitting conditions If and Else, which predict values and leaves
of our decision trees.

EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Train/Test Split
Now we are applying a decision tree model to the test data, and obtain predictions.
And you can explore these predictions. The predictions are in the last column.
And in this particular case, our model always predicts zero class.

EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Predictions

EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Predictions

EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Accuracy

EL
PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Conclusion

In this lecture, we have discussed Decision Trees for Big

Data Analytics.

EL
We have also discussed a case study of Breast Cancer
Diagnosis using a Decision Tree in Spark ML.

PT
N

Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Big Data Predictive Analytics

EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]

Big Data Computing Vu Pham Predictive Analytics

Preface
Content of this Lecture:

In this lecture, we will discuss the fundamental

techniques of predictive analytics.

EL
We will mainly cover Random Forest, Gradient Boosted
PT
Decision Trees and a Case Study with Spark ML
Programming, Decision Trees and Ensembles.
N

Big Data Computing Vu Pham Predictive Analytics

Decision Trees

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Summary: Decision Trees
Automatically handle interactions of features: It can combine
several different features in a single decision tree. It can build
complex functions involving multiple splitting criteria.

EL
Computational scalability: There exists effect of algorithms for
building decision trees for the very large data sets with many
features. But unfortunately, single decision tree actually is not a

PT
very good predictor.
N
Predictive Power: The predictive power of a single tree is
typically not so good.

Interpretability: We can visualize the decision tree and analyze

this splitting criteria in nodes, the values in leaves, and so one.
Big Data Computing Vu Pham Predictive Analytics
Bootstrap and Bagging

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Bootstrap
Bootstrapping is an algorithm which produces replicas of a
data set by doing random sampling with replacement. This
idea is essential for the random forest algorithm.

EL
Consider a dataset Z={(x1, y1),…,(xn,yn)}

PT
Bootstrapped dataset Z*- It is a modification of the
original dataset Z, produced by random sampling with
N
replacement.

Big Data Computing Vu Pham Predictive Analytics

Sampling with Replacement
Each iteration pick an object at random, and there is no
correlation with the previous step. Consider a data set, Z,
for example, with five objects.

EL
1 2 3 4 5

PT
N

Big Data Computing Vu Pham Predictive Analytics

Sampling with Replacement
At the first step, you pick at random an object, for
example, the object number three.

EL
1 2 3 4 5

3
PT
N

Big Data Computing Vu Pham Predictive Analytics

Sampling with Replacement
Then you repeat it and pick the object number one.

EL
1 2 3 4 5

3
PT
1
N

Big Data Computing Vu Pham Predictive Analytics

Sampling with Replacement
Then the object number five.

EL
1 2 3 4 5

3
PT
1 5
N

Big Data Computing Vu Pham Predictive Analytics

Sampling with Replacement
Then possibly you can pick again the object number five,
because at each iteration you pick an object at random,
and there is no correlation with the previous step.

EL
1 2 3 4 5

3
PT1 5 5
N

Big Data Computing Vu Pham Predictive Analytics

Sampling with Replacement
And finally, you pick the object number two.

EL
1 2 3 4 5

3
PT
1 5 5 2
N

Big Data Computing Vu Pham Predictive Analytics

Sampling with Replacement
After bootstrapping we have a new data set. The size of this data set is the
number of elements in the original data set. But its content, as you can see,
is slightly different. Some objects may be missing and other objects may be
present several times, more than once.

EL
1 2 3 4 5

3
PT
1
Original Dataset

5 5 2
N
Bootstrap Dataset

Big Data Computing Vu Pham Predictive Analytics

Bagging
It was the second idea essential for understanding of the
random forest algorithm.

Bagging (Bootstrap Aggregation): It is a general method for

EL
averaging predictions of other algorithms, not decision trees,
but any other algorithm in general.

PT
Bagging works because it reduces the variance of the
prediction.
N

Big Data Computing Vu Pham Predictive Analytics

Algorithm: Bagging

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Why does Bagging work ?
Model f(x) has higher predictive power than any single fxb(x), b=1,…,B

Most of situations with any machine learning method in the core, the

EL
quality of such aggregated predictions will be better than of any
single prediction.

PT
Why does bagging works?
N
This phenomenon is based on a very general principle which is called
the bias variance trade off. You can consider the training data set to
be random by itself.

Big Data Computing Vu Pham Predictive Analytics

Why does Bagging work ?
Why is it so? What is the training data set?

In the real situation, the training data set may be a user behavior in
Internet, for example, web browsing, using search engine, doing

EL
clicks on advertisement, and so on.

Other examples of training data sets are physical measurements. For

PT
example, temperature, locations. date, time, and so on. And all these
measurements are essentially stochastic.
N
If you can repeat the same experiment in the same conditions, the
measurements actually will be different because of the noise in
measurements, and since user behavior is essentially stochastic and
not exactly predictable. Now, you understand that the training data
set itself is random.
Big Data Computing Vu Pham Predictive Analytics
Why does Bagging work ?

Bagging: It is an averaging over a set of possible datasets,

removing noisy and non-stable parts of models.

EL
PT
After averaging, the noisy parts of machine learning model
will vanish out, whereas stable and reliable parts will
N
remain. The quality of the average model will be better
than any single model.

Big Data Computing Vu Pham Predictive Analytics

Summary
Bootstrap: A method for generating different replicas
of the dataset

EL
Bagging (Bootstrap Aggregation): A method for
averaging predictions and reducing prediction’s
variance
PT
N
Bagging improves the quality of almost any machine
learning method.

Bagging is very time consuming for large data sets.

Big Data Computing Vu Pham Predictive Analytics
Random Forest

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Random Forest

Random Forest Algorithm is a bagging of

EL
de-correlated decision trees.

PT
N

Big Data Computing Vu Pham Predictive Analytics

Algorithm: Random Forest

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

How to grow a random forest decision tree

The tree is built greedily from top to bottom

Select m ≤ p of the input variables at random
as candidates for splitting

EL
Each split is selected to maximize information
gain (IG)
PT
N

Big Data Computing Vu Pham Predictive Analytics

How to grow a random forest decision tree
Select m ≤ p of the input variables at random as
candidates for splitting

EL
Recommendations from inventors of Random Forests:

PT
m=√p for classification, minInstance PerNode =1
m= p/3for regression, minInstancePerNode=5
N

Big Data Computing Vu Pham Predictive Analytics

Random forest
Here are the results of training of two
random force. The first variant is marked
with green here, and either the variant
were at each step m equals speed.
It means that at each step, we grow a

EL
regular decision tree and you find the best
split among all the variables. And the blue
line, it is a de-correlated decision tree.

PT
In this situation, we randomly pick m
equals square root of b. And all the trees
can be built using different subsets of
variables. As we can see at this diagram,
N
at the initial stage, the variant is m equals
square root b is worse before 20
iterations. But eventually, this variant of
the Random Forest algorithm converges
to the better solution.

Big Data Computing Vu Pham Predictive Analytics

Summary

Random Forest is a good method for a general

EL
purpose classification/regression problems (typically
slightly worse than gradient boosted decision trees)

PT
N

Big Data Computing Vu Pham Predictive Analytics

Summary
Automatically handle interactions of features: Of course, this
algorithm can automatically handle interactions of features because
this could be done by a single decision tree.

Computational scalability: Of course, this algorithm can

EL
automatically handle interactions of features because this could be
done by a single decision tree.

PT
In the Random Forest algorithm, each tree can be built
independently on other trees. It is an important feature and I would
N
like to emphasize it. That is why the Random Forest algorithm east is
essentially parallel.

The Random Forest could be trained in the distributed environment

with the high degree of parallelization.

Big Data Computing Vu Pham Predictive Analytics

Summary
Predictive Power: As far as predictive power of the Random Forest
is, on the one hand better than a single decision tree, but it is slightly
worse than gradient boosted decision trees.

EL
Interpretability: Here you'll lose the interpretability because their
composition of hundreds or thousands Random Forest decision trees
cannot be analyzed by human expert.

PT
N

Big Data Computing Vu Pham Predictive Analytics

Gradient Boosted Decision Trees

EL
Regression

PT
N

Big Data Computing Vu Pham Predictive Analytics

Boosting
Boosting: It is a method for combining outputs
of many weak classifiers to produce a powerful
ensemble.

EL
There are several variants of boosting

PT
algorithms, AdaBoost, BrownBoost, LogitBoost, and
Gradient Boosting.
N

Big Data Computing Vu Pham Predictive Analytics

Big Data
Large number of training examples.
Large number of features describing objects.
In this situation, It is very natural to assume that you would like to
train a really complex model, even having such a great amount of

EL
data and hopefully, this model will be accurate.
There are two basic ways, in machine learning, to build complex
models.
PT
The first way is to start with a complex model from the very
beginning, and fit its parameters. This is exactly the way how neural
N
network operates.
And the second way is to build a complex model iteratively. You can
build a complex model iteratively, where each step requires training
of a simple model. In context of boosting, these models are called
weak classifiers, or base classifiers.

Big Data Computing Vu Pham Predictive Analytics

Regression
Given a training set: Z={(x1,y1),…,(xn,yn)}
Xi- features, yi-targets (real values)

EL
Goal is to find f(x) using training set, such as

PT
At test set T={(x1,y1),…,(xn,yn)}
N
How to build f(x)

Big Data Computing Vu Pham Predictive Analytics

Gradient Boosted Trees for Regression
How to build such f(x)?

In boosting, our goal is to build the function f(x) iteratively.

We suppose that this function f(x) is just a sum of other

EL
simple functions, hm(x).

PT
And particular, you assume that each function hm(x) is a
decision tree.
N

Big Data Computing Vu Pham Predictive Analytics

Algorithm: Gradient Boosted Trees for Regression

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Optimization Theory
You may have noticed that gradient boosting is somewhat similar to the
gradient descent in the optimization theory. If we want to minimize a function in
the optimization theory using the gradient descent, we make a small step in the
direction opposite to the gradient.
Gradient of the function, by definition, is a vector which points to the direction

EL
with the fastest increase. Since we want to minimize the function, we must move
to the direction opposite to the gradient. To ensure convergence, we must make
very small steps. So you're multiplying each gradient by small constant, which is
called step size. It is very similar to what we do in gradient boosting.

PT
And gradient boosting is considered to be a minimization in the functional space.
N
Boosting-Minimization in
the functional space

Big Data Computing Vu Pham Predictive Analytics

Summary
Boosting is a method for combining outputs of many weak
classifiers or regressors to produce a powerful ensemble.

EL
Gradient Boosting is a gradient descent minimization of
the target function in the functional space.

PT
Gradient Boosting with Decision Trees is considered to be
the best algorithm for general purpose classification or
N
regression problems.

Big Data Computing Vu Pham Predictive Analytics

Gradient Boosted Decision Trees

EL
Classification

PT
N

Big Data Computing Vu Pham Predictive Analytics

Classification
Given a training set: Z={(x1,y1),…,(xn,yn)}
Xi- features, yi-class labels (0,1)

EL
Goal is to find f(x) using training set, such as

PT
At test set T={(x1,y1),…,(xn,yn)}
N
How to build f(x) ?

Big Data Computing Vu Pham Predictive Analytics

Gradient Boosted Trees for Classification
How we are going to build such the function, f(x)?
We use a probabilistic model by using the following expression.

EL
PT
N
We model the probability of belonging of an object to the first class. And
here inside the exp, there is the sum of hm(x), and each hm(x) is a decision
tree.
We can easily check that such expression for probability will be always
between zero and one, so it is normal regular probability.

Big Data Computing Vu Pham Predictive Analytics

Sigmoid Function
This function is called the sigmoid function, which maps all the
real values into the range between zero and one.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Let us denote the sum of all hm(x) by f(x). It is an ensemble of decision trees.
Then you can write the probability of belonging to the first class in a simple way
using f(x). And the main idea which is used here is called the principle of
maximum likelihood.
What is it? First of all, what is the likelihood?

EL
Likelihood is a probability of absorbing some data given a statistical model. If
we have a data set with an objects from one to n, then the probability of
absorbing such data set is the multiplication of probabilities for all single

PT
objects. This multiplication is called the likelihood.
N

Big Data Computing Vu Pham Predictive Analytics

The principle of maximum likelihood
Algorithm: find a function f(x) maximizing the likelihood
Equivalent: find a function f(x) maximizing the logarithm of the
likelihood
(since logarithm is a monotone function)

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

The principle of maximum likelihood
We will denote by Q[f] the logarithm of the likelihood, and now, it
is sum of all logarithms of probabilities and you are going to
maximize this function.

We will use shorthand for this logarithm L(yi, f(x)i). It is the

EL
logarithm of probability. And here, we emphasize that this
logarithms depend actually on the true label, yi and our prediction,

PT
f(x)i. Now, Q[f] is a sum of L(yi, f(x)i).
N

Big Data Computing Vu Pham Predictive Analytics

Algorithm: Gradient Boosted Trees for Classification

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Stochastic Boosting

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Algorithm: Gradient Boosted Trees for Classification

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Algorithm: Gradient Boosted Trees for Classification
+ Stochastic Boosting

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Sampling with Replacement

EL
n=8 1 2 3 4 5 6 7 8

k=4 7
PT
3 1 3
N

Big Data Computing Vu Pham Predictive Analytics

Tips for Usage
First of all, it is important to
understand how the
regularization parameter
works. In this figure, you can
see the behavior of the

EL
gradient boosted decision
trees algorithm with two

and 0.05. PT
variants of this parameter, 0.1

What happens here, at the

N
initial stage of learning the
variant with parameter 0.1 is
better because it has lower
testing error.

Big Data Computing Vu Pham Predictive Analytics

Tips for Usage
At each iteration, you measure a
testing error of our ensemble on
the hold out data set.
But eventually, the variant with
lower regularization, 0.05, reach

EL
lower testing error. Finally, this
variant turns out to be superior.
It is a very typical behavior, and you

PT
should not stop your algorithm
after several dozen iterations. You
should proceed until convergence.
Convergence happens when your
N
testing error doesn't change a lot.
The variant with lower
regularization converges more
slowly, but eventually it builds a
better model.

Big Data Computing Vu Pham Predictive Analytics

Tips for Usage
The recommended learningRate should be less or equal
than 0.1.
The bigger your data set is, the larger number of

EL
iterations should be.
The recommended number of iterations ranges from
several hundred to several thousand.
PT
Also, the more features you have in your data set, the
deeper your decision tree should be.
N
These are very general rules because the bigger your data
set is, the more features you have, the more complex
model you can build without overfitting.

Big Data Computing Vu Pham Predictive Analytics

Summary
It is a best method for a general purpose classification and
regression problems.

It automatically handles interactions of features, because in

EL
the core, it is based on decision trees, which can combine
several features in a single tree.

PT
But also, this algorithm is computationally scalable. It can be
effectively executed in the distributed environment, for
N
example, in Spark. So it can be executed at the top of the Spark
cluster.

Big Data Computing Vu Pham Predictive Analytics

Summary
But also, this algorithm has a very good predictive power. But
unfortunately, the models are not interpretable.

The final ensemble effects is very large and cannot be

EL
analyzed by a human expert.

PT
There is always a tradeoff in machine learning, between
predictive power and interpretability, because the more
complex and accurate your model is, the harder is the analysis
N
of this model by a human.

Big Data Computing Vu Pham Predictive Analytics

Spark ML, Decision Trees and
Ensembles

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Introduction

This lesson will be about using Spark ML for doing

classification and regression with decision trees, and

EL
in samples of decision trees.

PT
N

Big Data Computing Vu Pham Predictive Analytics

First of all, you are going to create SparkContext and SparkSession

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Now you are downloading a dataset.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Let's explore this dataset a little bit. The first column is ID of
observation. The second column is the diagnosis, and other
columns are features which are comma separated.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

So these features are results of some analysis and
measurements. There are total 569 examples in this dataset.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

First of all you need to transform the label, which is either M or B
from the second column. You should transform it from a string to a
number.
You use a StringIndexer object for this purpose, and first of all you
need to load all these datasets.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Then you create a Spark DataFrame, which is stored in the
distributed manner on cluster.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

inputDF DataFrame has two columns, label and features. We use an
object vector for creating a vector column in this dataset.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Then we can do string indexing. So Spark now enumerates all
the possible labels in a string form, and transforms them to the
label indexes.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

And now label M is equivalent 1, and B label is equivalent 0.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

First of all, you make training test splitting in the proportion
70% to 30%. And the first model you are going to evaluate is
one single decision tree.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Here we are making import DecisionTreeClassifier object.
We create a class which is responsible for training, and
Then call the method fit to the training data, and obtain a
decision tree model.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

So the training was quite fast, and here are some results.
Number of nodes and depths of decision tree, feature
importance, total number of features used in this decision tree,
and so on.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

We can even visualize this decision tree and explore it, and
here is a structure of this decision tree.
Here are splitting conditions If and Else, which predict values
and leaves of our decision trees.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Here we are applying a decision tree model to the test data,
and obtain predictions.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Here we can explore these predictions. The predictions are in
the last column. And in this particular case, our model always
predicts zero class.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Now we can evaluate the accuracy of model.
For this purpose, we use a MulticlassClassificationEvaluator
with a metric named accuracy.
The testing error is 3%, the model is quite accurate.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Gradient Boosted Decision Trees
First of all we import it, we create an object which will do this
classification. Here you are specifying labelCol = "labelIndexed", and
featuresCol = "features". Actually, it is not mandatory to specify
feature column, because its default name is features. We can do it
either with this argument or without this argument.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Gradient Boosted Decision Trees
We are going to do 100 iterations, and the default stepSize
is 0.1.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Gradient Boosted Decision Trees
The model is ready, and we can explore featureImportances.
We can even visualize the sample of decision trees, and it is
quite large and long.
It is not interpretable by a human expert.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Gradient Boosted Decision Trees
Now you are doing predictions at testing data. And finally,
we evaluate the accuracy.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Gradient Boosted Decision Trees
In this case, accuracy is a bit lower than the one of the
single decision tree.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Random Forest
We are importing the classes which are required for evaluating
random forest.
We are creating an object, and we are fitting this method.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Random Forest
Here are featureImportances, and here is an example. Again, it
is quite large.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Random Forest
Here are the quality of our model predictions and testing accuracy.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Random Forest
We can see in this example, testing accuracy of random forest was the
best. But with the other dataset, the situation may be quite different.
And as a general rule, the bigger your dataset is, the more features it
has, then the quality of complex algorithms like gradient boosted
decision trees or random forest will be better.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Spark ML, Cross Validation

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Cross Validation
Cross-validation helps to assess the quality of a machine
learning model and to find the best model among a family of
models.
First of all, we start SparkContext.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Cross Validation

We use the same dataset which we use for evaluating different

decision trees.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Cross Validation
Finally we have the dataset which is called inputDF2.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Cross Validation
Here is the content for inputDF2.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Cross Validation
What steps are required for doing cross-validation with this dataset?
For example, we want to select the best parameters of a single
decision tree. We create an object decision tree then we should
create a pipeline.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Cross Validation
Pipeline, in general, may contain many stages including feature pre-
processing, string indexing, and machine learning, and so on.
But in this case, pipeline contains only one step, this training of
decision tree.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Cross Validation
Then we import their cross-validator and ParamGridBuilder class and
you are creating a ParamGridBuilder class.
For example, we want to select the best maximum depths of a
decision tree in the range from 1-8.
We have now the ParamGridBuilder class, we create an evaluator so

EL
we want to select the model which has the best accuracy among
others.

PT
N

Big Data Computing Vu Pham Predictive Analytics

Cross Validation
We create an evaluator so we want to select the model which has the
best accuracy among others.
We create a cross-validator class and pass a pipeline into this class, a
ParamGrid and evaluator.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Cross Validation
And finally, we select the number of folds and the number of folds should
not be less than 5 or 10.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Cross Validation
We create cvModel and it takes some time because Spark
needs to make training and evaluating the quality 10
times.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Cross Validation
You can see the average accuracy, amount of folds, for each failure of
decision tree depths.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Cross Validation
The first stage of our pipeline was a decision tree and you can get the
best model and the best model has depth 6 and it has 47 nodes.

EL
PT
N

Big Data Computing Vu Pham Predictive Analytics

Cross Validation
Then we can view this model to make predictions at any other dataset.
In ParamGridBuilder, we can use several parameters. For example,
maximum depths and some other parameters of a decision tree, for
example, minInstancesPerNode and select some other grid here, but in
this simple example, we did not do it for the simplicity. And if we

EL
evaluate only one parameter, the training is much faster.

PT
N

Big Data Computing Vu Pham Predictive Analytics

Conclusion

In this lecture, we have discussed the concepts of Random

Forest, Gradient Boosted Decision Trees and a Case Study

EL
with Spark ML Programming, Decision Trees and
Ensembles.

PT
N

Big Data Computing Vu Pham Predictive Analytics

You Can't Lie To Me
100% (1)
You Can't Lie To Me
12 pages
Calciums
No ratings yet
Calciums
5 pages
A Critical Review of The Critical Period Research (Scovel 2000)
100% (2)
A Critical Review of The Critical Period Research (Scovel 2000)
11 pages
Basic Comparison Table - ATF
No ratings yet
Basic Comparison Table - ATF
1 page
Broek - The Practical Use of Fracture Mechanics PDF
100% (2)
Broek - The Practical Use of Fracture Mechanics PDF
266 pages
Big Data Computing Decision Trees For Big Data Analytics
No ratings yet
Big Data Computing Decision Trees For Big Data Analytics
48 pages
Day 5 Supervised Technique-Decision Tree For Classification PDF
100% (1)
Day 5 Supervised Technique-Decision Tree For Classification PDF
58 pages
Business Data Mining Week 10 A
No ratings yet
Business Data Mining Week 10 A
28 pages
CCST9017 (2023-24lecture11printed Version) MachineLearning
No ratings yet
CCST9017 (2023-24lecture11printed Version) MachineLearning
55 pages
6__DecisionTrees__ID3_CART
No ratings yet
6__DecisionTrees__ID3_CART
24 pages
15 - DT and KNN Algorithm
No ratings yet
15 - DT and KNN Algorithm
34 pages
Session 17-Decision Tree
No ratings yet
Session 17-Decision Tree
16 pages
25-questions-to-test-your-skills-on-decision-trees
No ratings yet
25-questions-to-test-your-skills-on-decision-trees
10 pages
Starting Large Spreadsheet Is Not Good Way To Analyze Any Data. It Is
No ratings yet
Starting Large Spreadsheet Is Not Good Way To Analyze Any Data. It Is
12 pages
Week-6 - Lecture Notes
No ratings yet
Week-6 - Lecture Notes
149 pages
2-Difference Between Data Analytics 2
No ratings yet
2-Difference Between Data Analytics 2
1 page
FMLanswerkey-IT 2.docx (1) (1) (1)
No ratings yet
FMLanswerkey-IT 2.docx (1) (1) (1)
11 pages
Lecture 17 18
No ratings yet
Lecture 17 18
52 pages
AIML Removed Merged
No ratings yet
AIML Removed Merged
31 pages
AIML Removed
No ratings yet
AIML Removed
25 pages
Module2 Ids 240201 162026
No ratings yet
Module2 Ids 240201 162026
11 pages
Decision Tree 2
No ratings yet
Decision Tree 2
20 pages
Decision Tree Algorithm, Explained-1-22
No ratings yet
Decision Tree Algorithm, Explained-1-22
22 pages
Machine Learning
No ratings yet
Machine Learning
53 pages
Decision Trees
67% (3)
Decision Trees
14 pages
CS583 Supervised Learning
No ratings yet
CS583 Supervised Learning
147 pages
Classification - Decision Trees
No ratings yet
Classification - Decision Trees
43 pages
Machine Learning: BY:Vatsal J. Gajera (09BCE010)
No ratings yet
Machine Learning: BY:Vatsal J. Gajera (09BCE010)
25 pages
Data Analytics and Business Intelligence
No ratings yet
Data Analytics and Business Intelligence
15 pages
MODULE 4-Dr - GM
No ratings yet
MODULE 4-Dr - GM
23 pages
End to End Statistics for Data Science
No ratings yet
End to End Statistics for Data Science
28 pages
Python Decision Tree Classification
No ratings yet
Python Decision Tree Classification
14 pages
Decision Trees and Random Forest Q&a
No ratings yet
Decision Trees and Random Forest Q&a
48 pages
Decision Tree
No ratings yet
Decision Tree
20 pages
Chapter 7 Classification and Prediction 3735
No ratings yet
Chapter 7 Classification and Prediction 3735
89 pages
Business Data Mining Week 10
No ratings yet
Business Data Mining Week 10
30 pages
Assignment 3
No ratings yet
Assignment 3
8 pages
Supervised Learning
No ratings yet
Supervised Learning
41 pages
Decision Tree Classifier-Introduction, ID3
No ratings yet
Decision Tree Classifier-Introduction, ID3
34 pages
Decision Tree Classification Algorithm
No ratings yet
Decision Tree Classification Algorithm
14 pages
Class i Fiers
No ratings yet
Class i Fiers
24 pages
Decision Trees_ a Complete Introduction With Examples _ by Shubham Koli _ Medium
No ratings yet
Decision Trees_ a Complete Introduction With Examples _ by Shubham Koli _ Medium
22 pages
Subjects You Need To Know:: Programming Languages of AI
0% (1)
Subjects You Need To Know:: Programming Languages of AI
7 pages
Decision Tree Algorithm
No ratings yet
Decision Tree Algorithm
18 pages
DT-0 (3 Files Merged)
No ratings yet
DT-0 (3 Files Merged)
143 pages
Decision Tree
No ratings yet
Decision Tree
11 pages
Untitled 481
No ratings yet
Untitled 481
12 pages
Static Tics
No ratings yet
Static Tics
47 pages
Decisiontree 2
No ratings yet
Decisiontree 2
16 pages
Decision Tree Algorithm in Machine Learning
No ratings yet
Decision Tree Algorithm in Machine Learning
17 pages
ML Unit 3
No ratings yet
ML Unit 3
14 pages
Unit-3 Introduction To Machine Learning Algorithms
No ratings yet
Unit-3 Introduction To Machine Learning Algorithms
18 pages
Data science is an amalgamation of different scientific methods, algorithms and systems which enable us
No ratings yet
Data science is an amalgamation of different scientific methods, algorithms and systems which enable us
35 pages
DAR LECT 12
No ratings yet
DAR LECT 12
29 pages
Decision Tree R
No ratings yet
Decision Tree R
5 pages
Prediction of Autism Spectrum Disorder
No ratings yet
Prediction of Autism Spectrum Disorder
25 pages
UNIT-1 Regression vs. Classification
No ratings yet
UNIT-1 Regression vs. Classification
25 pages
Aih Exp 2
No ratings yet
Aih Exp 2
8 pages
6CS4-02 Machine Learning Manish Bhardwaj
No ratings yet
6CS4-02 Machine Learning Manish Bhardwaj
625 pages
Data-Science - Introduction
No ratings yet
Data-Science - Introduction
35 pages
Decision Tree Algorithm Tutorial With Example in R
No ratings yet
Decision Tree Algorithm Tutorial With Example in R
23 pages
Six Weeks Summer Training Reportpdf
100% (1)
Six Weeks Summer Training Reportpdf
26 pages
Decision Tree & Random Forest
No ratings yet
Decision Tree & Random Forest
16 pages
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
SEM Notes
No ratings yet
SEM Notes
3 pages
Repair Guide For Denso Common Rail Injector Repair
100% (5)
Repair Guide For Denso Common Rail Injector Repair
22 pages
Beginners Guide To The API
No ratings yet
Beginners Guide To The API
23 pages
9 Customer Retention Strategies For Companies
No ratings yet
9 Customer Retention Strategies For Companies
9 pages
A Framework For The Study of Security Communities
No ratings yet
A Framework For The Study of Security Communities
10 pages
Moments of Inertia Lab Report
73% (15)
Moments of Inertia Lab Report
4 pages
Desgine Features of Berp Rotor Blade
100% (2)
Desgine Features of Berp Rotor Blade
20 pages
1 s2.0 S0894177714000491 Main
No ratings yet
1 s2.0 S0894177714000491 Main
11 pages
Lithium Extraction From Brines Using Ion Concentration Polarization by Alex Barksdale
No ratings yet
Lithium Extraction From Brines Using Ion Concentration Polarization by Alex Barksdale
103 pages
VisuHole - User Manual - English - V1.0 - 19 Sep 14
No ratings yet
VisuHole - User Manual - English - V1.0 - 19 Sep 14
17 pages
Installing ICU 52
No ratings yet
Installing ICU 52
7 pages
Nick 2011
No ratings yet
Nick 2011
4 pages
g12 q3 Las Week2 3is Nothing
No ratings yet
g12 q3 Las Week2 3is Nothing
19 pages
Egranger Version 1 Code
No ratings yet
Egranger Version 1 Code
7 pages
ENGLISH PYQ for Competitive Exams-1
No ratings yet
ENGLISH PYQ for Competitive Exams-1
51 pages
Carbon Capture
No ratings yet
Carbon Capture
14 pages
OJT-Field Report - Research Project Format 2025
No ratings yet
OJT-Field Report - Research Project Format 2025
9 pages
Refrigeration: Definitions and Useful Information
No ratings yet
Refrigeration: Definitions and Useful Information
18 pages
Youtube Video Ideas
No ratings yet
Youtube Video Ideas
7 pages
Hagar Kiyoshi 29 Yonatan Ratosh ST, Hadera, Central, 2610000, IL
No ratings yet
Hagar Kiyoshi 29 Yonatan Ratosh ST, Hadera, Central, 2610000, IL
4 pages
Djordjevic PHYS 260 002 ALT SP22 Syllabus
No ratings yet
Djordjevic PHYS 260 002 ALT SP22 Syllabus
4 pages
Grade 3 Math Module 1
No ratings yet
Grade 3 Math Module 1
282 pages
Q2-Learning Plan in TLE-7
No ratings yet
Q2-Learning Plan in TLE-7
3 pages
Item Banking English 2 1st
No ratings yet
Item Banking English 2 1st
4 pages
Ningbo Dongxin High-Strength Nut Co.,Ltd: Test Certificate Conforming To Bs en 10204:2004 3.1
100% (1)
Ningbo Dongxin High-Strength Nut Co.,Ltd: Test Certificate Conforming To Bs en 10204:2004 3.1
2 pages