projectreport
projectreport
Abstract— Santander customer transaction prediction This paper is divided as follows: Section II discusses about the
dataset was a part of Kaggle competition and the aim is to background and the related work that were performed, Section
identify which customers will make a specific transaction in the III discusses about the conceptual study of the techniques that
future, irrespective of the amount of money transacted. These have been employed according to the research done till date.
are some of the ways to help the customers understand their Section IV discusses about the model that we have employed
financial health and identify which products and services might and Section V discusses about the results obtained and the
help them achieve their monetary goals. This project aims to analysis of those results. Section VI concludes the paper and
train a dataset containing nearly 200 numeric feature variables, discusses about the future work that can be performed to
the binary target column and a string ID_code column to predict
improve the results further.
which customer will make a specific transaction in the future.
II. BACKGROUND WORK
Keywords—customer prediction, lightGBM, machine
learning, binary classification, gradient boosting
A general study was performed to analyze the concept of
I. INTRODUCTION gradient boosting decision trees and LightGBM specially an
in-depth conceptual understanding behind LightGBM model.
This project can help the company in following ways- Then we did a targeted research analyzing the existing
Segmenting customers into small groups and addressing experiments that has been performed using various machine
individual customers based on actual behaviours — instead of learning techniques. There have been quite a few
making assumptions of what makes customers similar to one implementations of GBDT in the literature, including
another, and instead of only looking at aggregated data which XGBoost [1], pGBRT [2], scikit-learn [3], and gbm in R [4].
hides important facts about individual customers. Scikit-learn and gbm in R implements the presorted
algorithm, and pGBRT implements the histogram-based
In Accurately predicting the future behaviour of customers algorithm. XGBoost supports both The accuracy has
like transaction prediction using predictive customer improved by using certain machine learning techniques but
behaviour modelling techniques — instead of just looking in there is a lot of area for improvement. The reason for
the analysis of historical data. choosing RNN is that it has an edge over several other
techniques because of the text comments which are of
By some advanced calculation, we can determine which variable length can be flexibly used across a chronological
cluster of customers are going to be loyal to the company and sequence. Conventionally it was difficult to train RNN. RNN
the group of customers who are hesitating to make a is famously used for textual data, mostly for sequence
transaction. By knowing all such information, we can treat prediction problems.
each customer group differently. For instance, providing
coupons to a group of customers to engage in transactions and the pre-sorted algorithm and histogram-based algorithm. As
rewards to those who are loyal to the company. shown in [1], XGBoost outperforms the other tools. So, we
use XGBoost as our baseline in the experiment section. To
This is a classification problem and we need to understand reduce the size of the training data, a common approach is to
confusion matrix for getting evaluation metrics. It is a down sample the data instances. For example, in one of the
performance measurement for machine learning papers, data instances are filtered if their weights are smaller
classification problem where output can be two or more than a fixed threshold. In another paper SGB uses a random
classes. It is a table with 4 different combinations of subset to train the weak learners in every iteration. In another,
the sampling ratio are dynamically adjusted in the training
progress. However, all these works except SGB are based on
predicted and actual values. It is extremely useful for
AdaBoost, and cannot be directly applied to GBDT since
measuring Recall, Precision, Specificity, Accuracy and most
there are no native weights for data instances in GBDT.
importantly AUC-ROC Curve.
Though SGB can be applied to GBDT, it usually hurts
accuracy and thus it is not a desirable choice. Similarly, to mutually exclusive), and solving it by a greedy algorithm with
reduce the number of features, it is natural to filter weak a constant approximation ratio.
features. This is usually done by principle component LightGBM is a recent improvement of the gradient
analysis or projection pursuit. However, these approaches boosting algorithm, inherited its high predictivity but resolved
highly rely on the assumption that features contain significant its scalalbility and long computational time by adopting a leaf
redundancy, which might not always be true in practice wise growth strategy and introducing novel techniques.
(features are usually designed with their unique contributions Experiments on multiple public datasets show that LightGBM
and removing any of them may affect the training accuracy can accelerate the training process by up to over 20 times
to some degree). while achieving almost the same accuracy.
III. LIGHTGBM
Gradient boosting decision tree (GBDT) [1] is a widely-
used machine learning algorithm, due to its efficiency,
accuracy, and interpretability. GBDT achieves state-of-the-art
performances in many machine learning tasks, such as multi-
class classification [2], click prediction [3], and learning to
rank [4]. In recent years, with the emergence of big data (in
terms of both the number of features and the number of
instances), GBDT is facing new challenges, especially in the
tradeoff between accuracy and efficiency. Conventional
implementations of GBDT need to, for every feature, scan all
the data instances to estimate the information gain of all the In AdaBoost, the sample weight serves as a good indicator
possible split points. Therefore, their computational for the importance of data instances. However, in GBDT,
complexities will be proportional to both the number of there are no native sample weights, and thus the sampling
features and the number of instances. This makes these methods proposed for AdaBoost cannot be directly applied.
implementations very time consuming when handling big Fortunately, we notice that the gradient for each data instance
data. To tackle this challenge, a straightforward idea is to in GBDT provides us with useful information for data
reduce the number of data instances and the number of sampling. That is, if an instance is associated with a small
features. However, this turns out to be highly non-trivial. For gradient, the training error for this instance is small and it is
example, it is unclear how to perform data sampling for already well-trained. A straightforward idea is to discard those
GBDT. While there are some works that sample data data instances with small gradients. However, the data
according to their weights to speed up the training process of distribution will be changed by doing so, which will hurt the
boosting [5, 6, 7], they cannot be directly applied to GBDT accuracy of the learned model. To avoid this problem, we
since there is no sample weight in GBDT at all. propose a new method called Gradient-based One-Side
Gradient-based One-Side Sampling (GOSS). While there Sampling (GOSS). 3 GOSS keeps all the instances with large
is no native weight for data instance in GBDT, we notice that gradients and performs random sampling on the instances with
data instances with different gradients play different roles in small gradients. In order to compensate the influence to the
the computation of information gain. In particular, according data distribution, when computing the information gain,
to the definition of information gain, those instances with GOSS introduces a constant multiplier for the data instances
larger gradients1 (i.e., under-trained instances) will contribute with small gradients (see Alg. 2). Specifically, GOSS firstly
more to the information gain. Therefore, when down sampling sorts the data instances according to the absolute value of their
the data instances, in order to retain the accuracy of gradients and selects the top a×100% instances. Then it
information gain estimation, we should better keep those randomly samples b×100% instances from the rest of the data.
instances with large gradients (e.g., larger than a pre-defined After that, GOSS amplifies the sampled data with small
threshold, or among the top percentiles), and only randomly gradients by a constant 1−a b when calculating the
drop those instances with small gradients. We prove that such information gain. By doing so, we put more focus on the
a treatment can lead to a more accurate gain estimation than under-trained instances without changing the original data
uniformly random sampling, with the same target sampling distribution by much.
rate, especially when the value of information gain has a large GBDT is an ensemble model of decision trees, which are
range. trained in sequence [1]. In each iteration, GBDT learns the
Exclusive Feature Bundling (EFB). Usually in real decision trees by fitting the negative gradients (also known as
applications, although there are a large number of features, the residual errors). The main cost in GBDT lies in learning the
feature space is quite sparse, which provides us a possibility decision trees, and the most time-consuming part in learning a
of designing a nearly lossless approach to reduce the number decision tree is to find the best split points. One of the most
of effective features. Specifically, in a sparse feature space, popular algorithms to find split points is the pre-sorted
many features are (almost) exclusive, i.e., they rarely take algorithm [8, 9], which enumerates all possible split points on
nonzero values simultaneously. Examples include the one-hot the pre-sorted feature values. This algorithm is simple and can
features (e.g., one-hot word representation in text mining). We find the optimal split points, however, it is inefficient in both
can safely bundle such exclusive features. To this end, we training speed and memory consumption. Another popular
design an efficient algorithm by reducing the optimal bundling algorithm is the histogram-based algorithm [10, 11, 12], as
problem to a graph coloring problem (by taking features as shown in Alg. 13 . Instead of finding the split points on the
vertices and adding edges for every two features if they are not sorted feature values, histogram-based algorithm buckets
continuous feature values into discrete bins and uses these bins
to construct feature histograms during training. Since the
histogram-based algorithm is more efficient in both memory
consumption and training speed, we will develop our work on V. RESULTS AND ANALYSIS
its basis. As shown in Alg. 1, the histogram-based algorithm
finds the best split points based on the feature histograms. It
costs O(#data × #feature) for histogram building and O(#bin
× #feature) for split point finding. Since #bin is usually much
smaller than #data, histogram building will dominate the
computational complexity. If we can reduce #data or #feature,
we will be able to substantially speed up the training of
GBDT.
The LightGBM Model can be tuned further through
parameter tuning. In order to improve efficiency the
following steps can be used. 1.num_leaves: This is the main
parameter to control the complexity of the tree model.
Ideally, the value of num_leaves should be less than or
equal to 2 times the maximum depth of the tree. Value more
than this will result in overfitting. 2. min_data_in_leaf:
Setting it to a large value can avoid growing too deep a tree,
but may cause under-fitting. In practice we set to hundreds
for a large dataset. 3.max_depth: You also can use
max_depth to limit the tree depth explicitly.
Fig 1: CV, Train and Test accuracies for LightGBM
IV. METHODOLOGY
A. Pre-Processing
The dataset being used was obtained from kaggle
website (https://www.kaggle.com/c/santander-customer-
transaction-prediction/overview). It contains a list of data
with each data row containing a 200 feature variables, a
binary label which describes if the customer will perform
transaction in the future or not(0 for negative and 1 for
positive), and a string ID identifying each data point. Missing
value check was done and no missing values were found. We
drop the train label during pre-processing. In addition mean,
median and standard deviations have been plotted for the
training data set. Distribution of the feature columns have
also been plotted as histograms. Data was split in 70:30 ratio
for training and testing.
REFERENCES
[1] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting
system. In Proceedings of the 22Nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pages 785–794.
ACM, 2016