Week-7 - Lecture Notes
Week-7 - Lecture Notes
EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Preface
Content of this Lecture:
EL
application using a Decision Tree in Spark ML
PT
N
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Decision Trees
EL
PT
N
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Predict if Sachin will play cricket
Consider the following data set
our task is to predict if Sachin is
going to play cricket on a given day.
We have observed Sachin over a
EL
number of days and recorded
various things that might influence
his decision to play cricket so we
PT
looked at what kind of weather it is.
Is it sunny or is it raining humidity
is it high as normal is it windy
N
So this is our training set and these
are the examples that we are going
to build a classifier from.
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Predict if Sachin will play cricket
So on day 15 if it's raining the
humidity is high it's not very windy
so is Sachin going to play or not
and just by looking at the data it's
kind of hard to it's it's kind of hard
to decide right because it's you
EL
know some days when it's raining
Sachin is playing other days Sachin
is not playing sometimes he plays
PT
with strong winds sometimes he
doesn't play with strong winds and
with weak winds life so what do
you do how do you predict it and
N
the basic idea behind decision
trees is try to at some level
understand why Sachin plays.
so this is the only classifier that
that will cover that tries to predict
Sachin playing in this way.
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Predict if Sachin will play cricket
Hard to guess
Try to understand
when Sachin plays
Divide & conquer:
EL
Split into subsets
Are they pure ?
PT
(all yes of all no)
If yes: stop
N
If not: repeat
See which subset
new data falls into
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
EL
PT
N
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
EL
PT
N
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
EL
PT
N
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
EL
PT
N
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
EL
PT
N
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
EL
PT
N
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
ID3 Algorithm
In decision tree learning, ID3 (Iterative Dichotomiser 3) is an algorithm
invented by Ross Quinlan used to generate a decision tree from a
dataset. ID3 is the precursor to the C4.5 algorithm, and is typically used
in the machine learning and natural language processing domains.
Split (node, {examples}):
EL
1. A the best attribute for splitting the {examples}
2. Decision attribute for this node A
PT
3. For each value of A, create new child node
4. Split training {examples} to child nodes
5. For each child node / subset:
N
• if subset is pure: STOP
• else: Split (child_node, {subset})
Ross Quinlan (ID3: 1986), (C4.5:1993)
Breimanetal (CaRT:1984) from statistics
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Which attribute to split on ?
EL
PT
Want to measure “purity” of the split
More certain about Yes/No after the split
• pure set (4 yes/ 0 no)➔ completely certain (100%)
N
• impure (3 yes/3 no)➔ completely uncertain (50%)
Can’t use P(“yes” | set):
• must be symmetric: 4 yes/0 no as pure as 0 yes / 4 no
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Entropy
Entropy: H(S)= - p(+) log2 p(+) - p(-) log2 p(-) bits
S … subset of training examples
p(+) / p(-)…% of positive/negative examples in S
EL
Interpretation: assume item X belongs to S
How many bits need to tell if X positive or negative
PT
Impure (3 yes / 3 no):
N
Pure set (4 yes / 0 no):
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Information Gain
Want many items in pure sets
Expected drop in entropy after split:
EL
Mutual Information
-between attribute APT
and class labels of S
N
Gain (S, Wind)
= H(S) - 8/14 * H(Sweak) - 6/14 * H(Sstrong)
= 0.94- 8/14 * 0.81 - 6/14 * 1.0
= 0.049
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Decision Trees for Regression
EL
PT
N
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
How to grow a decision tree
The tree is built greedily from top to bottom
Each split is selected to maximize information
gain (IG)
EL
PT
N
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Decision Tree for Regression
Given a training set: Z= {(x1,y1),…,(xn,yn)}
yi-real values
EL
Goal is to find f(x) (a tree) such that
PT
N
How to grow a decision tree for regression ?
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
How to find the best split
A tree is built from top to
bottom. And at each step, you
should find the best split in
the internal node. You get a
EL
test dataset Z. And here is a
splitting criteria xk less than t
or xk is greater or equal than
PT
a threshold t. The problem is
how to find k, the number of
N
feature and the threshold t.
And also you need to find
values in leaves aL and aR.
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
What happens without a split?
What happens without a split?
Without a split, the best you can do
is to predict one number, ‘a’. It
makes sense to select such a
number ‘a’ to minimize the squared
EL
error.
It is very easy to prove that such
variable a equals an average of all
yi.
PT
Thus you need to calculate the
average value of all the targets.
N
Here ā (a-hat)denotes the average
value.
Impurity z equals the mean squared
error if we use ā (a-hat) as a
prediction.
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Find the best split (xk < t)
What happens if you make some
split by a condition xk < t? For some
part of training objects ZL, you have
x_k < t. For other part ZR holds
xk ≥ t.
EL
The error consists of two parts for
ZL and ZR respectively. So here we
need to find simultaneously k, t, aL
and aR.
PT
We can calculate the optimal aL and
aR exactly the same way as we did
for the case without a split. We can
N
easily prove that the optimal aL
equals an average of all targets
which are the targets of objects
which get to the leaf. And we will
denote these values by a-hat L and
a-hat R respectively.
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Find the best split
After this step we only
need to find optimal
k and t. We have formulas
for impurity, for objects
EL
which get to the left
branch of our splitting
condition and to the
PT
right. Then you can find
the best splitting criteria
N
which maximizes the
information gain. This
procedure is done
iteratively from top to
bottom.
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Stopping rule
The node depth is equal to the maxDepth training
parameter.
EL
No split candidate leads to an information gain greater
than mininfoGain.
PT
No split candidate produces child nodes which have at
N
least minInstancesPerNode training instances
(|ZL|, |ZR| < minInstancesPerNode) each.
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Summary: Decision Trees
Automatically handling interactions of features: The benefits of decision
tree is that this algorithm can automatically handle interactions or
features because it can combine several different features in a single
decision tree. It can build complex functions involving multiple splitting
criteria.
EL
Computational scalability: The second property is a computational
scalability. There exists effect of algorithms for building decision trees for
PT
the very large data sets with many features.
N
Predictive power: Single decision tree actually is not a very good
predictor. The predictive power of a single tree is typically not so good.
Interpretability: You can visualize the decision tree and analyze this
splitting criteria in nodes, the values in leaves, and so one. Sometimes it
might be helpful.
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Building a tree using
MapReduce
EL
PT
N
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Problem: Building a tree
Given a large dataset with FindBestSplit
EL
FindBestSplit
PT
Tree is small (can keep it memory):
• Shallow (~10 levels)
N
Dataset too large to keep in memory
Dataset too big to scan over on a single machine
MapReduce to the rescue!
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
MapReduce
EL
PT
N
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
PLANET
Parallel Learner for Assembling Numerous
Ensemble Trees [Panda et al., VLDB ‘09]
A sequence of MapReduce jobs that build a
decision tree
EL
Setting:
Hundreds of numerical (discrete & continuous)
attributes
PT
Target (class) is numerical: Regression
N
Splits are binary: Xj < v
Decision tree is small enough for each
Mapper to keep it in memory
Data too large to keep in memory
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
A
PLANET Architecture B
D
C
E
F G H I
Master
EL
Model Attribute
metadata
PT Intermediate
results
N
Input
data
FindBestSplit
InMemoryGrow
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
A
PLANET Overview B
D
C
E
F G H I
EL
Considers a number of possible splits (Xi, v) on its subset of
the data
PT
For each split it stores partial statistics
Partial split-statistics is sent to Reducers
Reducer
N
Collects all partial statistics and determines best split.
Master grows the tree for one level
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
A
PLANET Overview B
D
C
E
F G H I
EL
Mapper “drops” each datapoint to find the
appropriate leaf node L
PT
For each leaf node L it keeps statistics about
1) the data reaching L
N
2) the data in left/right subtree under split S
Reducer aggregates the statistics (1) and (2) and
determines the best split for each node
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
A
PLANET: Components B
D
C
E
F G H I
Master
Monitors everything (runs multiple MapReduce jobs)
Three types of MapReduce jobs:
(1) MapReduce Initialization (run once first)
EL
For each attribute identify values to be considered for splits
PT
(2) MapReduce FindBestSplit (run multiple times)
MapReduce job to find best split when there is too much data to
fit in memory
N
(3) MapReduce InMemoryBuild (run once last)
Similar to FindBestSplit (but for small data)
Grows an entire sub-tree once the data fits in memory
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Reference
B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo.
PLANET: Massively parallel learning of tree ensembles
with MapReduce. VLDB 2009.
EL
J. Ye, J.-H. Chow, J. Chen, Z. Zheng. Stochastic Gradient
PT
Boosted Distributed Decision Trees. CIKM 2009.
N
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Example: Medical Application
using a Decision Tree in Spark
EL
ML
PT
N
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Create SparkContext and SparkSession
This example will be about using Spark ML for doing classification and
regression with decision trees, and in samples of decision trees.
First of all, you are going to create SparkContext and SparkSession, and
here it is.
EL
PT
N
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Download a dataset
Now you are downloading a dataset which is
about breast cancer diagnosis, here it is.
EL
PT
N
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Exploring the Dataset
Let's explore this dataset a little bit. The first column is ID of observation. The
second column is the diagnosis, and other columns are features which are
comma separated. So these features are results of some analysis and
measurements. There are total 569 examples in this dataset. And if the
second column is M, it means that it is cancer. If B, means that there is no
EL
cancer in a particular woman.
PT
N
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Exploring the Dataset
First of all you need to transform the label, which is either M or B from the second
column. You should transform it from a string to a number. We use a StringIndexer
object for this purpose, and first of all you need to load all these datasets. Then you
create a Spark DataFrame, which is stored in the distributed manner on cluster.
EL
PT
N
Vu Pham
Exploring the Dataset
inputDF DataFrame has two columns, label and features. Okay, you use an object
vector for creating a vector column in this dataset, and then you can do string
indexing. So Spark now enumerates all the possible labels in a string form, and
transforms them to the label indexes. And now label M is equivalent 1, and B label is
equivalent 0.
EL
PT
N
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Train/Test Split
We can start doing the machine learning right now. First of all, you make training test
splitting in the proportion 70% to 30%. And the first model you are going to evaluate is
one single decision tree.
EL
PT
N
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Train/Test Split
We are making import DecisionTreeClassifier object. We create a class which is
responsible for training, and we call the method fit to the training data, and obtain a
decision tree model.
EL
PT
N
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Train/Test Split
Number of nodes and depths of decision tree, feature importances, total number of
features used in this decision tree, and so on.
We can even visualize this decision tree and explore it, and here is a structure of this
decision tree. Here are splitting conditions If and Else, which predict values and leaves
of our decision trees.
EL
PT
N
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Train/Test Split
Now we are applying a decision tree model to the test data, and obtain predictions.
And you can explore these predictions. The predictions are in the last column.
And in this particular case, our model always predicts zero class.
EL
PT
N
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Predictions
EL
PT
N
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Predictions
EL
PT
N
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Accuracy
EL
PT
N
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Conclusion
EL
We have also discussed a case study of Breast Cancer
Diagnosis using a Decision Tree in Spark ML.
PT
N
Big Data Computing Vu Pham Decision Trees for Big Data Analytics
Big Data Predictive Analytics
EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
EL
We will mainly cover Random Forest, Gradient Boosted
PT
Decision Trees and a Case Study with Spark ML
Programming, Decision Trees and Ensembles.
N
EL
PT
N
EL
Computational scalability: There exists effect of algorithms for
building decision trees for the very large data sets with many
features. But unfortunately, single decision tree actually is not a
PT
very good predictor.
N
Predictive Power: The predictive power of a single tree is
typically not so good.
EL
PT
N
EL
Consider a dataset Z={(x1, y1),…,(xn,yn)}
PT
Bootstrapped dataset Z*- It is a modification of the
original dataset Z, produced by random sampling with
N
replacement.
EL
1 2 3 4 5
PT
N
EL
1 2 3 4 5
3
PT
N
EL
1 2 3 4 5
3
PT
1
N
EL
1 2 3 4 5
3
PT
1 5
N
EL
1 2 3 4 5
3
PT1 5 5
N
EL
1 2 3 4 5
3
PT
1 5 5 2
N
EL
1 2 3 4 5
3
PT
1
Original Dataset
5 5 2
N
Bootstrap Dataset
EL
averaging predictions of other algorithms, not decision trees,
but any other algorithm in general.
PT
Bagging works because it reduces the variance of the
prediction.
N
EL
PT
N
Most of situations with any machine learning method in the core, the
EL
quality of such aggregated predictions will be better than of any
single prediction.
PT
Why does bagging works?
N
This phenomenon is based on a very general principle which is called
the bias variance trade off. You can consider the training data set to
be random by itself.
In the real situation, the training data set may be a user behavior in
Internet, for example, web browsing, using search engine, doing
EL
clicks on advertisement, and so on.
PT
example, temperature, locations. date, time, and so on. And all these
measurements are essentially stochastic.
N
If you can repeat the same experiment in the same conditions, the
measurements actually will be different because of the noise in
measurements, and since user behavior is essentially stochastic and
not exactly predictable. Now, you understand that the training data
set itself is random.
Big Data Computing Vu Pham Predictive Analytics
Why does Bagging work ?
EL
PT
After averaging, the noisy parts of machine learning model
will vanish out, whereas stable and reliable parts will
N
remain. The quality of the average model will be better
than any single model.
EL
Bagging (Bootstrap Aggregation): A method for
averaging predictions and reducing prediction’s
variance
PT
N
Bagging improves the quality of almost any machine
learning method.
EL
PT
N
EL
de-correlated decision trees.
PT
N
EL
PT
N
EL
Each split is selected to maximize information
gain (IG)
PT
N
EL
Recommendations from inventors of Random Forests:
PT
m=√p for classification, minInstance PerNode =1
m= p/3for regression, minInstancePerNode=5
N
EL
regular decision tree and you find the best
split among all the variables. And the blue
line, it is a de-correlated decision tree.
PT
In this situation, we randomly pick m
equals square root of b. And all the trees
can be built using different subsets of
variables. As we can see at this diagram,
N
at the initial stage, the variant is m equals
square root b is worse before 20
iterations. But eventually, this variant of
the Random Forest algorithm converges
to the better solution.
EL
purpose classification/regression problems (typically
slightly worse than gradient boosted decision trees)
PT
N
EL
automatically handle interactions of features because this could be
done by a single decision tree.
PT
In the Random Forest algorithm, each tree can be built
independently on other trees. It is an important feature and I would
N
like to emphasize it. That is why the Random Forest algorithm east is
essentially parallel.
EL
Interpretability: Here you'll lose the interpretability because their
composition of hundreds or thousands Random Forest decision trees
cannot be analyzed by human expert.
PT
N
EL
Regression
PT
N
EL
There are several variants of boosting
PT
algorithms, AdaBoost, BrownBoost, LogitBoost, and
Gradient Boosting.
N
EL
data and hopefully, this model will be accurate.
There are two basic ways, in machine learning, to build complex
models.
PT
The first way is to start with a complex model from the very
beginning, and fit its parameters. This is exactly the way how neural
N
network operates.
And the second way is to build a complex model iteratively. You can
build a complex model iteratively, where each step requires training
of a simple model. In context of boosting, these models are called
weak classifiers, or base classifiers.
EL
Goal is to find f(x) using training set, such as
PT
At test set T={(x1,y1),…,(xn,yn)}
N
How to build f(x)
EL
simple functions, hm(x).
PT
And particular, you assume that each function hm(x) is a
decision tree.
N
EL
PT
N
EL
with the fastest increase. Since we want to minimize the function, we must move
to the direction opposite to the gradient. To ensure convergence, we must make
very small steps. So you're multiplying each gradient by small constant, which is
called step size. It is very similar to what we do in gradient boosting.
PT
And gradient boosting is considered to be a minimization in the functional space.
N
Boosting-Minimization in
the functional space
EL
Gradient Boosting is a gradient descent minimization of
the target function in the functional space.
PT
Gradient Boosting with Decision Trees is considered to be
the best algorithm for general purpose classification or
N
regression problems.
EL
Classification
PT
N
EL
Goal is to find f(x) using training set, such as
PT
At test set T={(x1,y1),…,(xn,yn)}
N
How to build f(x) ?
EL
PT
N
We model the probability of belonging of an object to the first class. And
here inside the exp, there is the sum of hm(x), and each hm(x) is a decision
tree.
We can easily check that such expression for probability will be always
between zero and one, so it is normal regular probability.
EL
PT
N
EL
Likelihood is a probability of absorbing some data given a statistical model. If
we have a data set with an objects from one to n, then the probability of
absorbing such data set is the multiplication of probabilities for all single
PT
objects. This multiplication is called the likelihood.
N
EL
PT
N
EL
logarithm of probability. And here, we emphasize that this
logarithms depend actually on the true label, yi and our prediction,
PT
f(x)i. Now, Q[f] is a sum of L(yi, f(x)i).
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
n=8 1 2 3 4 5 6 7 8
k=4 7
PT
3 1 3
N
EL
gradient boosted decision
trees algorithm with two
and 0.05. PT
variants of this parameter, 0.1
EL
lower testing error. Finally, this
variant turns out to be superior.
It is a very typical behavior, and you
PT
should not stop your algorithm
after several dozen iterations. You
should proceed until convergence.
Convergence happens when your
N
testing error doesn't change a lot.
The variant with lower
regularization converges more
slowly, but eventually it builds a
better model.
EL
iterations should be.
The recommended number of iterations ranges from
several hundred to several thousand.
PT
Also, the more features you have in your data set, the
deeper your decision tree should be.
N
These are very general rules because the bigger your data
set is, the more features you have, the more complex
model you can build without overfitting.
EL
the core, it is based on decision trees, which can combine
several features in a single tree.
PT
But also, this algorithm is computationally scalable. It can be
effectively executed in the distributed environment, for
N
example, in Spark. So it can be executed at the top of the Spark
cluster.
EL
analyzed by a human expert.
PT
There is always a tradeoff in machine learning, between
predictive power and interpretability, because the more
complex and accurate your model is, the harder is the analysis
N
of this model by a human.
EL
PT
N
EL
in samples of decision trees.
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
we want to select the model which has the best accuracy among
others.
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
evaluate only one parameter, the training is much faster.
PT
N
EL
with Spark ML Programming, Decision Trees and
Ensembles.
PT
N