05.XGBoost
05.XGBoost
XGBoost Features
SPLITTING ALGORITHMS
XGBoost's core strength lies in its regularized objective function. It doesn't just aim to
minimize the loss (like mean squared error or logistic loss); it also incorporates a penalty
term for the complexity of the tree. This helps prevent overfitting. The objective function
looks something like this:
XGBoost uses a greedy approach to build the trees. It starts with a single root node and
iteratively adds branches to the tree. For each possible split, it calculates the gain – how
much the objective function would be improved by making that split.
A key innovation is that XGBoost uses a second-order Taylor expansion of the loss
function. This gives a more accurate estimate of the gain compared to using just the first
derivative (as in gradient boosting). The second-order information (Hessian) helps
XGBoost find better splits.
3. Exact Greedy Algorithm: The main problem in tree learning is to find the best
split. This algorithm enumerates over all the possible splits on all the features. It is
computationally demanding to enumerate all the possible splits for continuous features.
Calculating the exact best split can be computationally expensive, especially for large
datasets. XGBoost provides an approximate greedy algorithm to speed up the process.
Instead of evaluating every possible split point, it proposes a set of candidate split points
(quantiles of the feature distribution). It then evaluates the gain for these candidate split
points and chooses the best one.
These approximations are based on quantiles. That is, the first quantile is the first
threshold, the second quantile is the second threshold, and so on and so forth. By
default, the approximate greedy algorithm builds approximately 33 quantiles.
Even the approximate algorithm can be slow if the dataset doesn't fit into memory.
XGBoost uses a clever data structure called a weighted quantile sketch to efficiently find
the candidate split points. This sketch approximates the quantiles of the feature
distribution without needing to load the entire dataset into memory.
Sparsity aware split finding helps to handle missing information in data and provides a
basis on how to deal with new missing data. In this optimization, the data is split into
two groups. One group has data with no missing feature values, and the second group
has data with all the missing features rows with its associated response variable. The
data from group one is sorted in an ascending fashion. Then, the split finding process
calculates two sets of gain values –
● First, it would calculate gain by adding the missing data from the second group to
the left of the tree
● Second, it would calculate gain by adding the missing data from the second group
to the right of the tree
This is done for each of the quantiles. The largest gain overall is picked as the default,
when there is missing data.
Cache-Aware Access
Cache memory in the CPU is the fastest to access. Hence, this is used to store the first
and second order derivatives within XGBoost (gradients and hessians) to rapidly
calculate the scores for each node and leaf in the tree.
XGBoost divides the dataset into multiple blocks in a Compressed Sparse Column format
(CSC) to distribute the blocks to multiple cores for parallel learning.
How does it compare to gradient boosting technique?
In the traditional sequential gradient boosting technique, the process that takes the most
time is the split finding process which uses greedy algorithm. Though the greedy
algorithm is fast with small datasets, for very large datasets, the process becomes
extremely slow. This is because, the entire dataset is linearly scanned, and for each
unique value in the data, a tree is built first without considering the effect of the split
until the next iteration.
The parallelism within XGBoost occurs within this split finding process for the tree
branches. It is a highly optimised and well-engineered parallelism which makes the
process 10 times faster compared to gradient boosting technique. In this parallel split
finding process, the data is split into multiple subsets and distributed to the available
cores (4 in the diagram below). The data is then scanned for all the possible values and
approximated using the greedy algorithm. The data from all 4 cores is then combined to
form an approximate quantile histogram which is used as approximations for tree splits.
The first and second order derivates (gradients and hessians) calculated for the splits are
stored in cache memory for faster access when determining the gain and output value for
the leaf nodes. This process exploits the majority of the unique features in XGBoost, that
is, parallel learning, approximate greedy algorithm, weighted quantile sketch, sparsity
aware splitting and cache aware access. Furthermore, the CSC data format makes
reading the data from the hard drive much faster even though it needs to be
decompressed first.
Below is the diagram I put together to demonstrate how the process and optimisations
come together in XGBoost.
..\Pictures\xgboost.png
Goals of XGBoost
Execution Speed: XGBoost was almost always faster than the other benchmarked
implementations from R, Python Spark and H2O and it is really faster when compared
to the other algorithms.
Model Performance: XGBoost dominates structured or tabular datasets on
classification and regression predictive modelling problems.
The metric to be used for validation data. The default values are rmse for regression and
error for classification.
Typical values are:
rmse – root mean square error.
mae – mean absolute error.
logloss – negative log-likelihood.
error – Binary classification error rate (0.5 threshold).
merror – Multiclass classification error rate.
mlogloss – Multiclass logloss.
auc – Area under the curve.
Consider using XGBoost for any supervised machine learning task when satisfies the
following criteria:
XGBoost is usually used to train gradient-boosted decision trees (GBDT) and other
gradient boosted models. Random forests also use the same model representation and
inference as gradient-boosted decision trees, but it is a different training algorithm.
XGBoost can be used to train a standalone random forest. Also, random forest can be
used as a base model for gradient boosting techniques.
Further, random forest is an improvement over bagging that helps in reducing the
variance. Random forest builds trees in parallel, while in boosting, trees are built
sequentially. Meaning, each of the trees is grown using information from previously
grown trees, unlike bagging, where multiple copies of original training data are created
and fit separate decision tree on each. This is the reason why XGBoost generally
performs better than random forest.