0% found this document useful (0 votes)
3 views6 pages

05.XGBoost

XGBoost is an open-source library for optimized distributed gradient boosting, excelling in regression, classification, and ranking tasks through techniques like regularization, parallel tree boosting, and advanced splitting algorithms. It incorporates a regularized objective function, greedy learning with second-order approximations, and various optimizations to enhance speed and performance, making it particularly effective for large datasets with mixed feature types. XGBoost also handles missing values efficiently and is often preferred over Random Forest due to its ability to manage unbalanced datasets and overfitting more effectively.

Uploaded by

Durvesh Mahurkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views6 pages

05.XGBoost

XGBoost is an open-source library for optimized distributed gradient boosting, excelling in regression, classification, and ranking tasks through techniques like regularization, parallel tree boosting, and advanced splitting algorithms. It incorporates a regularized objective function, greedy learning with second-order approximations, and various optimizations to enhance speed and performance, making it particularly effective for large datasets with mixed feature types. XGBoost also handles missing values efficiently and is often preferred over Random Forest due to its ability to manage unbalanced datasets and overfitting more effectively.

Uploaded by

Durvesh Mahurkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

XGBOOST

∙ XGBoost is an open-source software library that implements optimized distributed


gradient boosting machine learning algorithms under the Gradient Boosting
framework.
∙ XGBoost, which stands for Extreme Gradient Boosting, is a scalable, distributed
gradient-boosted decision tree (GBDT) machine learning library. It provides
parallel tree boosting and is the leading machine learning library for regression,
classification, and ranking problems.
∙ XGBoost minimizes a regularized (L1 and L2) objective function that combines a
convex loss function (based on the difference between the predicted and target
outputs) and a penalty term for model complexity (in other words, the regression
tree functions). The training proceeds iteratively, adding new trees that predict
the residuals or errors of prior trees that are then combined with previous trees
to make the final prediction. It's called gradient boosting because it uses a
gradient descent algorithm to minimize the loss when adding new models.

XGBoost Features

∙ Regularized Learning: Regularization term helps to smooth the final learnt


weights to avoid over-fitting. The regularized objective will tend to select a model
employing simple and predictive functions.
∙ Gradient Tree Boosting: The tree ensemble model cannot be optimized using
traditional optimization methods in Euclidean space. Instead, the model is
trained in an additive manner.
∙ Shrinkage and Column Subsampling: Besides the regularized objective, two
additional techniques are used to further prevent over fitting. The first technique
is shrinkage introduced by Friedman. Shrinkage scales newly added weights by a
factor η after each step of tree boosting. Similar to a learning rate in stochastic
optimization, shrinkage reduces the influence of each tree and leaves space for
future trees to improve the model.
∙ The second technique is the column (feature) subsampling. This technique is used
in Random Forest. Column sub-sampling prevents over-fitting even more so than
the traditional row sub-sampling. The usage of column sub-samples also speeds
up computations of the parallel algorithm.

SPLITTING ALGORITHMS

XGBoost (Extreme Gradient Boosting) employs a sophisticated approach to splitting


nodes in its decision tree construction, going beyond basic methods like Gini impurity or
information gain. Here's a breakdown of how splitting works in XGBoost:
1. Regularized Objective Function:

XGBoost's core strength lies in its regularized objective function. It doesn't just aim to
minimize the loss (like mean squared error or logistic loss); it also incorporates a penalty
term for the complexity of the tree. This helps prevent overfitting. The objective function
looks something like this:

Obj(Θ) = Σ(Loss(yi, ŷi)) + Σ(Ω(tree_k))

2. Greedy Learning with Second-Order Approximation:

XGBoost uses a greedy approach to build the trees. It starts with a single root node and
iteratively adds branches to the tree. For each possible split, it calculates the gain – how
much the objective function would be improved by making that split.

A key innovation is that XGBoost uses a second-order Taylor expansion of the loss
function. This gives a more accurate estimate of the gain compared to using just the first
derivative (as in gradient boosting). The second-order information (Hessian) helps
XGBoost find better splits.

3. Exact Greedy Algorithm: The main problem in tree learning is to find the best
split. This algorithm enumerates over all the possible splits on all the features. It is
computationally demanding to enumerate all the possible splits for continuous features.

4. Approximate Greedy Algorithm:

Calculating the exact best split can be computationally expensive, especially for large
datasets. XGBoost provides an approximate greedy algorithm to speed up the process.
Instead of evaluating every possible split point, it proposes a set of candidate split points
(quantiles of the feature distribution). It then evaluates the gain for these candidate split
points and chooses the best one.

These approximations are based on quantiles. That is, the first quantile is the first
threshold, the second quantile is the second threshold, and so on and so forth. By
default, the approximate greedy algorithm builds approximately 33 quantiles.

5. Weighted Quantile Sketch:

Even the approximate algorithm can be slow if the dataset doesn't fit into memory.
XGBoost uses a clever data structure called a weighted quantile sketch to efficiently find
the candidate split points. This sketch approximates the quantiles of the feature
distribution without needing to load the entire dataset into memory.

6. Sparsity-Aware Split finding

Sparsity aware split finding helps to handle missing information in data and provides a
basis on how to deal with new missing data. In this optimization, the data is split into
two groups. One group has data with no missing feature values, and the second group
has data with all the missing features rows with its associated response variable. The
data from group one is sorted in an ascending fashion. Then, the split finding process
calculates two sets of gain values –

●​ First, it would calculate gain by adding the missing data from the second group to
the left of the tree
●​ Second, it would calculate gain by adding the missing data from the second group
to the right of the tree

This is done for each of the quantiles. The largest gain overall is picked as the default,
when there is missing data.

Cache-Aware Access

Cache memory in the CPU is the fastest to access. Hence, this is used to store the first
and second order derivatives within XGBoost (gradients and hessians) to rapidly
calculate the scores for each node and leaf in the tree.

Compressed Sparse Column (CSC) data format

XGBoost divides the dataset into multiple blocks in a Compressed Sparse Column format
(CSC) to distribute the blocks to multiple cores for parallel learning.
How does it compare to gradient boosting technique?

In the traditional sequential gradient boosting technique, the process that takes the most
time is the split finding process which uses greedy algorithm. Though the greedy
algorithm is fast with small datasets, for very large datasets, the process becomes
extremely slow. This is because, the entire dataset is linearly scanned, and for each
unique value in the data, a tree is built first without considering the effect of the split
until the next iteration.

The parallelism within XGBoost occurs within this split finding process for the tree
branches. It is a highly optimised and well-engineered parallelism which makes the
process 10 times faster compared to gradient boosting technique. In this parallel split
finding process, the data is split into multiple subsets and distributed to the available
cores (4 in the diagram below). The data is then scanned for all the possible values and
approximated using the greedy algorithm. The data from all 4 cores is then combined to
form an approximate quantile histogram which is used as approximations for tree splits.
The first and second order derivates (gradients and hessians) calculated for the splits are
stored in cache memory for faster access when determining the gain and output value for
the leaf nodes. This process exploits the majority of the unique features in XGBoost, that
is, parallel learning, approximate greedy algorithm, weighted quantile sketch, sparsity
aware splitting and cache aware access. Furthermore, the CSC data format makes
reading the data from the hard drive much faster even though it needs to be
decompressed first.

Below is the diagram I put together to demonstrate how the process and optimisations
come together in XGBoost.

..\Pictures\xgboost.png

Goals of XGBoost

Execution Speed: XGBoost was almost always faster than the other benchmarked
implementations from R, Python Spark and H2O and it is really faster when compared
to the other algorithms.
Model Performance: XGBoost dominates structured or tabular datasets on
classification and regression predictive modelling problems.

Learning Task Parameters:

The metric to be used for validation data. The default values are rmse for regression and
error for classification.
Typical values are:
rmse – root mean square error.
mae – mean absolute error.
logloss – negative log-likelihood.
error – Binary classification error rate (0.5 threshold).
merror – Multiclass classification error rate.
mlogloss – Multiclass logloss.
auc – Area under the curve.

When to Use XGBoost?

Consider using XGBoost for any supervised machine learning task when satisfies the
following criteria:

∙ When you have large number of observations in training data.


∙ Number features < number of observations in training data.
∙ It performs well when data has mixture numerical and categorical features or just
numeric features.
∙ When the model performance metrics are to be considered.
How does XGB handle missing values?

Solution: XGBoost supports missing values by default. In tree algorithms, branch


directions for missing values are learned during training. It is important to note that the
gblinear booster treats missing values as zeros. During the training time XGB decides
whether the missing values should fall into the right node or left node. This decision is
taken to minimise the loss. If there are no missing values during the training time, the
tree made a default decision to send any new missings to the right node.

Key Difference Between Random Forest VS. XGBoost


1. XGBoost straight away prunes the tree with a score called “Similarity score” before
entering into the actual modeling purposes. It considers the “Gain” of a node as the
difference between the similarity score of the node and the similarity score of the
children. If the gain from a node is found to be minimal then it just stops constructing
the tree to a greater depth which can overcome the challenge of overfitting to a great
extent. Meanwhile, the Random forest might probably overfit the data if the
majority of the trees in the forest are provided with similar samples. If the trees are
completely grown ones then the model will collapse once the test data is introduced.
Therefore, major consideration is given to distributing all the elementary units of
the sample with approximately equal participation to all trees.
2. XGBoost is a good option for unbalanced datasets but we cannot trust random forest
in these types of cases. In applications like fraud detection, the classes will almost
certainly be imbalanced where the number of authentic transactions will be huge
when compared with unauthentic transactions. In XGBoost, when the model fails to
predict the anomaly for the first time, it gives more preferences and weightage to it
in the upcoming iterations thereby increasing its ability to predict the class with low
participation; but we cannot assure that random forest will treat the class imbalance
with a proper process.
3. One of the most important differences between XGBoost and Random forest is that
the XGBoost always gives more importance to functional space when reducing the
cost of a model while Random Forest tries to give more preferences to
hyperparameters to optimize the model. A small change in the hyperparameter will
affect almost all trees in the forest which can alter the prediction. Also, this is not a
good approach when we expect test data with so many variations in real-time with a
pre-defined mindset of hyperparameters for the whole forest but XG boost
hyperparameters are applied to only one tree at the beginning which is expected to
adjust itself in an efficient manner when iterations progress. Also, the XGBoost needs
only a very low number of initial hyperparameters (shrinkage parameter, depth of
the tree, number of trees) when compared with the Random forest.
4. When the model is encountered with a categorical variable with a different number of
classes then there lies a possibility that Random forest may give more preferences
to the class with more participation.
5. XGBoost may be more preferable in situations like Poisson regression, rank
regression, etc. This is because trees are derived by optimizing an objective function.
6. Random forests are easier to tune than Boosting algorithms.
7. Random forests easily adapt to distributed computing more than Boosting algorithms.
8. Random forests will not overfit almost certainly if the data is neatly pre-processed
and cleaned unless similar samples are repeatedly given to the majority of trees.

Is XGBoost faster than random forest?

XGBoost is usually used to train gradient-boosted decision trees (GBDT) and other
gradient boosted models. Random forests also use the same model representation and
inference as gradient-boosted decision trees, but it is a different training algorithm.
XGBoost can be used to train a standalone random forest. Also, random forest can be
used as a base model for gradient boosting techniques.

Further, random forest is an improvement over bagging that helps in reducing the
variance. Random forest builds trees in parallel, while in boosting, trees are built
sequentially. Meaning, each of the trees is grown using information from previously
grown trees, unlike bagging, where multiple copies of original training data are created
and fit separate decision tree on each. This is the reason why XGBoost generally
performs better than random forest.

You might also like