Hyperparameter Tuning the Random Forest in Python BOM 3_ by Will Koehrsen _ Towards Data Science
Hyperparameter Tuning the Random Forest in Python BOM 3_ by Will Koehrsen _ Towards Data Science
So we’ve built a random forest model to solve our machine learning problem (perhaps
by following this end-to-end guide) but we’re not too impressed by the results. What
are our options? As we saw in the first part of this series, our first step should be to
gather more data and perform feature engineering. Gathering more data and feature
engineering usually has the greatest payoff in terms of time invested versus improved
performance, but when we have exhausted all data sources, it’s time to move on to
model hyperparameter tuning. This post will focus on optimizing the random forest
model in Python using Scikit-Learn tools. Although this article builds on part one, it
fully stands on its own, and we will cover many widely-applicable machine learning
concepts.
I have included Python code in this article where it is most instructive. Full code and
data to follow along can be found on the project Github page.
Hyperparameter tuning relies more on experimental results than theory, and thus the
best method to determine the optimal settings is to try many different combinations
evaluate the performance of each model. However, evaluating each model only on the
training set can lead to one of the most fundamental problems in machine learning:
overfitting.
If we optimize the model for the training data, then our model will score very well on
the training set, but will not be able to generalize to new data, such as in a test set.
When a model performs highly on the training set but poorly on the test set, this is
known as overfitting, or essentially creating a model that knows the training set very
well but cannot be applied to new problems. It’s like a student who has memorized the
simple problems in the textbook but has no idea how to apply concepts in the messy
real world.
An overfit model may look impressive on the training set, but will be useless in a real
application. Therefore, the standard procedure for hyperparameter optimization
accounts for overfitting through cross validation.
Cross Validation
The technique of cross validation (CV) is best explained by example using the most
common method, K-Fold CV. When we approach a machine learning problem, we
make sure to split our data into a training and a testing set. In K-Fold CV, we further
split our training set into K number of subsets, called folds. We then iteratively fit the
model K times, each time training the data on K-1 of the folds and evaluating on the
Kth fold (called the validation data). As an example, consider fitting a model with K =
5. The first iteration we train on the first four folds and evaluate on the fifth. The
second time we train on the first, second, third, and fifth fold and evaluate on the
fourth. We repeat this procedure 3 more times, each time evaluating on a different
fold. At the very end of training, we average the performance on each of the folds to
come up with final validation metrics for the model.
As a brief recap before we get into model tuning, we are dealing with a supervised
regression machine learning problem. We are trying to predict the temperature
tomorrow in our city (Seattle, WA) using past historical weather data. We have 4.5
years of training data, 1.5 years of test data, and are using 6 different features
(variables) to make our predictions. (To see the full code for data preparation, see the
notebook).
In previous posts, we checked the data to check for anomalies and we know our data is
clean. Therefore, we can skip the data cleaning and jump straight into hyperparameter
tuning.
To look at the available hyperparameters, we can create a random forest and examine
the default values.
rf = RandomForestRegressor(random_state = 42)
{'bootstrap': True,
'criterion': 'mse',
'max_depth': None,
'max_features': 'auto',
'max_leaf_nodes': None,
'min_impurity_decrease': 0.0,
'min_impurity_split': None,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'n_estimators': 10,
'n_jobs': 1,
'oob_score': False,
'random_state': 42,
'verbose': 0,
'warm_start': False}
Wow, that is quite an overwhelming list! How do we know where to start? A good place
is the documentation on the random forest in Scikit-Learn. This tells us the most
important settings are the number of trees in the forest (n_estimators) and the number
of features considered for splitting at each leaf node (max_features). We could go read
the research papers on the random forest and try to theorize the best hyperparameters,
but a more efficient use of our time is just to try out a wide range of values and see
what works! We will try adjusting the following set of hyperparameters:
min_samples_split = min number of data points placed in a node before the node
is split
pprint(random_grid)
On each iteration, the algorithm will choose a difference combination of the features.
Altogether, there are 2 * 12 * 2 * 3 * 3 * 10 = 4320 settings! However, the benefit of a
random search is that we are not trying every combination, but selecting at random to
sample a wide range of values.
We can view the best parameters from fitting the random search:
rf_random.best_params_
{'bootstrap': True,
'max_depth': 70,
'max_features': 'auto',
'min_samples_leaf': 4,
'min_samples_split': 10,
'n_estimators': 400}
From these results, we should be able to narrow the range of values for each
hyperparameter.
return accuracy
Model Performance
Average Error: 3.9199 degrees.
Accuracy = 93.36%.
best_random = rf_random.best_estimator_
random_accuracy = evaluate(best_random, test_features, test_labels)
Model Performance
Average Error: 3.7152 degrees.
Accuracy = 93.73%.
Improvement of 0.40%.
grid_search.best_params_
{'bootstrap': True,
'max_depth': 80,
'max_features': 3,
'min_samples_leaf': 5,
'min_samples_split': 12,
'n_estimators': 100}
best_grid = grid_search.best_estimator_
grid_accuracy = evaluate(best_grid, test_features, test_labels)
Model Performance
Average Error: 3.6561 degrees.
Accuracy = 93.83%.
Improvement of 0.50%.
It seems we have about maxed out performance, but we can give it one more try with a
grid further refined from our previous results. The code is the same as before just with
a different grid so I only present the results:
Model Performance
Average Error: 3.6602 degrees.
Accuracy = 93.82%.
Improvement of 0.49%.
Comparisons
We can make some quick comparisons between the different approaches used to
improve performance showing the returns on each. The following table shows the final
results from all the improvements we made (including those from the first part):
Model is the (very unimaginative) names for the models, accuracy is the percentage
accuracy, error is the average absolute error in degrees, n_features is the number of
features in the dataset, n_trees is the number of decision trees in the forest, and time is
the training and predicting time in seconds.
four_years_all: model trained using 4.5 years of data and expanded features (see
Part One for details)
four_years_red: model trained using 4.5 years of data and subset of most
important features
first_grid: best model from first grid search with cross validation (selected as the
final model)
second_grid: best model from second grid search
Overall, gathering more data and feature selection reduced the error by 17.69%
while hyperparameter further reduced the error by 6.73%.
Training Visualizations
To further analyze the process of hyperparameter optimization, we can change one
setting at a time and see the effect on the model performance (essentially conducting a
controlled experiment). For example, we can create a grid with a range of number of
trees, perform grid search CV, and then plot the results. Plotting the training and
testing error and the training time will allow us to inspect how changing one
hyperparameter impacts the model.
First we can look at the effect of changing the number of trees in the forest. (see
notebook for training and plotting code)
As the number of trees increases, our error decreases up to a point. There is not much
benefit in accuracy to increasing the number of trees beyond 20 (our final model had
100) and the training time rises consistently.
We can also examine curves for the number of features to split a node:
Number of Features Training Curves
Together with the quantitative stats, these visuals can give us a good idea of the trade-
offs we make with different combinations of hyperparameters. Although there is
usually no way to know ahead of time what settings will work the best, this example
has demonstrated the simple tools in Python that allow us to optimize our machine
learning model.
Hands-on real-world examples, research, tutorials, and cutting-edge techniques delivered Monday
to Thursday. Make learning your daily ritual. Take a look
Your email
By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information
about our privacy practices.