XGboost Tutorial
XGboost Tutorial
Xgboost
What is Xgboost?
Extreme Gradient Boosting (xgboost) is similar to gradient boosting framework but more efficient. It has both
linear model solver and tree learning algorithms. So, what makes it fast is its capacity to do parallel
computation on a single machine.
This makes xgboost at least 10 times faster than existing gradient boosting implementations. It supports
various objective functions, including regression, classification and ranking.
Since it is very high in predictive power but relatively slow with implementation, “xgboost” becomes an ideal
fit for many competitions. It also has additional features for doing cross validation and finding important
variables.
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML6/Xgboost.ipynb?download=false 1/13
5/15/2018 Xgboost
Idea of boosting
Let's start with intuitive definition of the concept:
Boosting (Freud and Shapire, 1996) - algorithm allowing to fit many weak classifiers to
reweighted versions of the training data. Classify final examples by majority voting.
When using boosting techinque all instance in dataset are assigned a score that tells how difficult to classify
they are. In each following iteration the algorithm pays more attention (assign bigger weights) to instances
that were wrongly classified previously.
Ensemble parameters are optimized in stagewise way which means that we are calculating optimal
parameters for the next classifier holding fixed what was already calculated. This might sound like a limitation
but turns out it's a very resonable way of regularizing the model.
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML6/Xgboost.ipynb?download=false 2/13
5/15/2018 Xgboost
Every algorithm can be used as a base for boosting techinique, but trees have some nice properties that
makes them more suitable candidates.
Pro's
computational scalability,
handling missing values,
robust to outliers,
does not require feature scalling,
can deal with irrelevant inputs,
interpretable (if small),
can handle mixed predictors (quantitive and qualitative)
Con's
Boosting techinque can try to reduce the variance by averaging many different trees (where each one is
solving the same problem)
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML6/Xgboost.ipynb?download=false 3/13
5/15/2018 Xgboost
Common Algorithms
In every machine learning model the training objective is a sum of a loss function L and regularization Ω:
obj = L + Ω
The loss function controls the predictive power of an algorithm and regularization term controls it's simplicity.
The implementation of boosting technique using decision tress (it's a meta-estimator which means you can fit
any classifier in). The intuitive recipie is presented below:
Algorithm:
Assume that the number of training samples is denoted by N , and the number of iterations (created trees) is
M . Notice that possible class outputs are Y = {−1, 1}
2. For m = 1 to M :
We can take advantage of the fact that the loss function can be represented with a form suitable for
optimalization (due to the stage-wise additivity). This creates a class of general boosting algorithms named
simply generalized boosted model (GBM).
An example of a GBM is Gradient Boosted Tree which uses decision tree as an estimator. It can work with
different loss functions (regression, classification, risk modeling etc.), evaluate it's gradient and approximates
it with a simple tree (stage-wisely, that minimizes the overall error).
AdaBoost is a special case of Gradient Boosted Tree that uses exponential loss function.
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML6/Xgboost.ipynb?download=false 4/13
5/15/2018 Xgboost
GBT tries to approach this problem by adding some regularization parameters. We can:
It was develop by Tianqi Chen in C++ but also enables interfaces for Python, R, Julia.
XGBoost's objective function is a sum of a specific loss function evaluated over all predictions and a sum of
regularization term for all predictors (K trees). In the formula fk means a prediction coming from k-th tree.
n K
^i ) + ∑ Ω(fk )
obj(θ) = ∑ l(yi − y
i k=1
Loss function depends on the task being performed (classification, regression, etc.) and a regularization term
is described by the following equation:
T
1
2
Ω(f ) = γT + λ∑w
j
2
j=1
First part (γT ) is responsible for controlling the overall number of created leaves, and the second term (
T
1
λ∑
j=1
w
2
j
) watches over the their's scores.
2
Implementation
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML6/Xgboost.ipynb?download=false 5/13
5/15/2018 Xgboost
In [1]:
import numpy as np
import xgboost as xgb
/usr/local/lib/python3.5/dist-packages/sklearn/cross_validation.py:44: Dep
recationWarning: This module was deprecated in version 0.18 in favor of th
e model_selection module into which all the refactored classes and functio
ns are moved. Also note that the interface of the new CV iterators are dif
ferent from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
Loading data
We are going to use bundled Agaricus (https://archive.ics.uci.edu/ml/datasets/Mushroom) dataset which can
be downloaded here (https://github.com/dmlc/xgboost/tree/master/demo/data).
This data set records biological attributes of different mushroom species, and the target is to
predict whether it is poisonous
It consist of 8124 instances, characterized by 22 attributes (both numeric and categorical). The target class is
either 0 or 1 which means binary classification problem.
Lucily all the data have alreay been pre-process for us. Categorical variables have been encoded, and all
instances divided into train and test datasets. You will know how to do this on your own in later lectures.
Data needs to be stored in DMatrix object which is designed to handle sparse datasets. It can be populated
in couple ways:
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML6/Xgboost.ipynb?download=false 6/13
5/15/2018 Xgboost
In [2]:
dtrain = xgb.DMatrix('data/agaricus.txt.train')
dtest = xgb.DMatrix('data/agaricus.txt.test')
In [3]:
In [4]:
In [5]:
params = {
'objective':'binary:logistic',
'max_depth':2,
'silent':1,
'eta':1
}
num_rounds = 5
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML6/Xgboost.ipynb?download=false 7/13
5/15/2018 Xgboost
Training classifier
To train the classifier we simply pass to it a training dataset, parameters list and information about number of
iterations.
In [6]:
In [7]:
Make predictions
In [8]:
preds_prob = bst.predict(dtest)
preds_prob
Out[8]:
Calculate simple accuracy metric to verify the results. Of course validation should be performed accordingly
to the dataset, but in this case accuracy is sufficient.
In [9]:
labels = dtest.get_label()
preds = preds_prob > 0.5 # threshold
correct = 0
for i in range(len(preds)):
if (labels[i] == preds[i]):
correct += 1
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML6/Xgboost.ipynb?download=false 8/13
5/15/2018 Xgboost
Loading libraries
Begin with loading all required libraries.
In [10]:
import numpy as np
Loading data
We are going to use the same dataset as in previous lecture. The scikit-learn package provides a convenient
function load_svmlight capable of reading many libsvm files at once and storing them as Scipy's sparse
matrices.
In [12]:
In [15]:
params = {
'objective': 'binary:logistic',
'max_depth': 2,
'learning_rate': 1.0,
'silent': 1.0,
'n_estimators': 5
}
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML6/Xgboost.ipynb?download=false 9/13
5/15/2018 Xgboost
Training classifier
In [16]:
Make predictions
In [17]:
preds = bst.predict(X_test)
preds
Out[17]:
In [18]:
correct = 0
for i in range(len(preds)):
if (y_test[i] == preds[i]):
correct += 1
Evaluate results
Specify training parameters - we are going to use 5 decision tree stumps with average learning rate.
In [21]:
num_rounds = 5
Before training the model let's also specify watchlist array to observe it's performance on the both
datasets.
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML6/Xgboost.ipynb?download=false 10/13
5/15/2018 Xgboost
In [22]:
There are already some predefined metrics availabe. You can use them as the input for the eval_metric
parameter while training the model.
In [23]:
In [24]:
params['eval_metric'] = 'logloss'
bst = xgb.train(params, dtrain, num_rounds, watchlist)
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML6/Xgboost.ipynb?download=false 11/13
5/15/2018 Xgboost
In [25]:
In this example our classification metric will simply count the number of misclassified examples assuming
that classes with p > 0.5 are positive. You can change this threshold if you want more certainty.
The algorithm is getting better when the number of misclassified examples is getting lower. Remember to
also set the argument maximize=False while training.
In [26]:
In [27]:
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML6/Xgboost.ipynb?download=false 12/13
5/15/2018 Xgboost
General advices
These are some common tactics when approaching imbalanced datasets:
make sure that parameter min_child_weight is small (because leaf nodes can have smaller size
groups), it is set to min_child_weight=1 by default,
assign more weights to specific samples while initalizing DMatrix,
control the balance of positive and negative weights using set_pos_weight parameter,
use AUC for evaluation
In [ ]:
http://localhost:8888/nbconvert/html/Desktop/IPYNB_Files/masteringdatascience_good_images_pdf/ML6/Xgboost.ipynb?download=false 13/13