Lecture Notes - Random Forests PDF
Lecture Notes - Random Forests PDF
Random Forests
You are familiar with decision trees. Now it’s time to learn about random forests, which are collections of decision
trees. The great thing about random forests is that they almost always outperform decision trees in terms of
accuracy.
Ensembles
An ensemble is a group of things viewed as a whole rather than individually. In ensembles, a collection of models is
used to make predictions instead of individual models. Arguably, the most popular in the family of ensemble models
is the random forest. A random forest is an ensemble made using a combination of numerous decision trees.
For an ensemble to work, each model of it should comply with the following conditions:
1. Each model should be diverse. Diversity ensures that the models serve complementary purposes; this
means the individual models make predictions independent of each other.
2. Each model should be acceptable. Acceptability implies that each model is at least better than a random
model.
Consider a binary classification problem where the response variable is either 0 or 1. You have an ensemble of three
models, where each model has an accuracy of 0.7, i.e. it is correct 70% of the time. The following table shows all the
possible cases that can occur while classifying a test data point as 1 or 0. The column to the extreme right shows the
probability of each case.
Random forests are created using a special ensemble method called bagging. Bagging stands for bootstrap
aggregation. Bootstrapping means creating bootstrap samples from a given data set. A bootstrap sample is created
by sampling the rows of a given data set uniformly and with replacements. A bootstrap sample typically contains
about 30-70% of the data from the data set. Aggregation implies combining the results of different models present
in the ensemble.
Random forests is an ensemble of many decision trees. A random forest is created in the following way:
1. Create a bootstrap sample from the training set.
2. Now construct a decision tree using the bootstrap sample. While splitting a node of the tree, only consider a
random subset of features. Every time a node has to split, a different random subset of features will be
considered.
3. Repeat steps 1 and 2 n times to construct n trees in the forest. Remember, each tree is constructed
independently, so it is possible to construct them parallel to each other.
4. While predicting a test case, each tree predicts individually, and the final prediction is given by the majority
vote of all the trees.
The OOB error is calculated by using each observation of the training set as a test observation. Since each tree is
built on a bootstrap sample, each observation can be used as a test observation by the trees that did not have it in
their bootstrap sample. All these trees predict on this observation, and you get an error for a single observation. The
final OOB error is calculated by calculating the error on each observation and aggregating it.
To construct a forest of S trees on a data set that has M features and N observations, the time taken will depend on
the following factors:
1. The number of trees: The time is directly proportional to the number of trees. But, this duration can be
reduced by creating the trees parallel to each other.
2. The size of the bootstrap sample: Generally, the size of a bootstrap sample is 30-70% of N. The smaller the
size, the faster it takes to create a forest.
3. The size of the subset of features while splitting a node: Generally, this is taken as √𝑀 in classification and
M/3 in regression.
Random forests lab