0% found this document useful (0 votes)
16 views

projectreport

This paper discusses the use of the LightGBM model for predicting customer transactions in the banking sector using a dataset from Kaggle. It details the methodology, including data preprocessing, model training, and evaluation metrics, achieving a testing accuracy of 89.95%. The authors suggest future improvements such as parallel processing and feature selection to enhance model performance.

Uploaded by

Gaga
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

projectreport

This paper discusses the use of the LightGBM model for predicting customer transactions in the banking sector using a dataset from Kaggle. It details the methodology, including data preprocessing, model training, and evaluation metrics, achieving a testing accuracy of 89.95%. The authors suggest future improvements such as parallel processing and feature selection to enhance model performance.

Uploaded by

Gaga
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Customer transaction prediction in banking sector

Aditi Venkatesh, Anusha Pugazhedhi


Department of computer science and Department of computer science and
software engineering, software engineering,
University of Texas at Dallas, Dallas, University of Texas at Dallas, Dallas,
USA, USA,
[email protected] [email protected]

Padma Priya Cheruvu Rajashree Kamalakannan


Department of computer science and Department of computer science and
software engineering, software engineering,
University of Texas at Dallas, Dallas, University of Texas at Dallas, Dallas,
USA, USA,
[email protected] [email protected]

Abstract— Santander customer transaction prediction This paper is divided as follows: Section II discusses about the
dataset was a part of Kaggle competition and the aim is to background and the related work that were performed, Section
identify which customers will make a specific transaction in the III discusses about the conceptual study of the techniques that
future, irrespective of the amount of money transacted. These have been employed according to the research done till date.
are some of the ways to help the customers understand their Section IV discusses about the model that we have employed
financial health and identify which products and services might and Section V discusses about the results obtained and the
help them achieve their monetary goals. This project aims to analysis of those results. Section VI concludes the paper and
train a dataset containing nearly 200 numeric feature variables, discusses about the future work that can be performed to
the binary target column and a string ID_code column to predict
improve the results further.
which customer will make a specific transaction in the future.
II. BACKGROUND WORK
Keywords—customer prediction, lightGBM, machine
learning, binary classification, gradient boosting
A general study was performed to analyze the concept of
I. INTRODUCTION gradient boosting decision trees and LightGBM specially an
in-depth conceptual understanding behind LightGBM model.
This project can help the company in following ways- Then we did a targeted research analyzing the existing
Segmenting customers into small groups and addressing experiments that has been performed using various machine
individual customers based on actual behaviours — instead of learning techniques. There have been quite a few
making assumptions of what makes customers similar to one implementations of GBDT in the literature, including
another, and instead of only looking at aggregated data which XGBoost [1], pGBRT [2], scikit-learn [3], and gbm in R [4].
hides important facts about individual customers. Scikit-learn and gbm in R implements the presorted
algorithm, and pGBRT implements the histogram-based
In Accurately predicting the future behaviour of customers algorithm. XGBoost supports both The accuracy has
like transaction prediction using predictive customer improved by using certain machine learning techniques but
behaviour modelling techniques — instead of just looking in there is a lot of area for improvement. The reason for
the analysis of historical data. choosing RNN is that it has an edge over several other
techniques because of the text comments which are of
By some advanced calculation, we can determine which variable length can be flexibly used across a chronological
cluster of customers are going to be loyal to the company and sequence. Conventionally it was difficult to train RNN. RNN
the group of customers who are hesitating to make a is famously used for textual data, mostly for sequence
transaction. By knowing all such information, we can treat prediction problems.
each customer group differently. For instance, providing
coupons to a group of customers to engage in transactions and the pre-sorted algorithm and histogram-based algorithm. As
rewards to those who are loyal to the company. shown in [1], XGBoost outperforms the other tools. So, we
use XGBoost as our baseline in the experiment section. To
This is a classification problem and we need to understand reduce the size of the training data, a common approach is to
confusion matrix for getting evaluation metrics. It is a down sample the data instances. For example, in one of the
performance measurement for machine learning papers, data instances are filtered if their weights are smaller
classification problem where output can be two or more than a fixed threshold. In another paper SGB uses a random
classes. It is a table with 4 different combinations of subset to train the weak learners in every iteration. In another,
the sampling ratio are dynamically adjusted in the training
progress. However, all these works except SGB are based on
predicted and actual values. It is extremely useful for
AdaBoost, and cannot be directly applied to GBDT since
measuring Recall, Precision, Specificity, Accuracy and most
there are no native weights for data instances in GBDT.
importantly AUC-ROC Curve.
Though SGB can be applied to GBDT, it usually hurts
accuracy and thus it is not a desirable choice. Similarly, to mutually exclusive), and solving it by a greedy algorithm with
reduce the number of features, it is natural to filter weak a constant approximation ratio.
features. This is usually done by principle component LightGBM is a recent improvement of the gradient
analysis or projection pursuit. However, these approaches boosting algorithm, inherited its high predictivity but resolved
highly rely on the assumption that features contain significant its scalalbility and long computational time by adopting a leaf
redundancy, which might not always be true in practice wise growth strategy and introducing novel techniques.
(features are usually designed with their unique contributions Experiments on multiple public datasets show that LightGBM
and removing any of them may affect the training accuracy can accelerate the training process by up to over 20 times
to some degree). while achieving almost the same accuracy.

III. LIGHTGBM
Gradient boosting decision tree (GBDT) [1] is a widely-
used machine learning algorithm, due to its efficiency,
accuracy, and interpretability. GBDT achieves state-of-the-art
performances in many machine learning tasks, such as multi-
class classification [2], click prediction [3], and learning to
rank [4]. In recent years, with the emergence of big data (in
terms of both the number of features and the number of
instances), GBDT is facing new challenges, especially in the
tradeoff between accuracy and efficiency. Conventional
implementations of GBDT need to, for every feature, scan all
the data instances to estimate the information gain of all the In AdaBoost, the sample weight serves as a good indicator
possible split points. Therefore, their computational for the importance of data instances. However, in GBDT,
complexities will be proportional to both the number of there are no native sample weights, and thus the sampling
features and the number of instances. This makes these methods proposed for AdaBoost cannot be directly applied.
implementations very time consuming when handling big Fortunately, we notice that the gradient for each data instance
data. To tackle this challenge, a straightforward idea is to in GBDT provides us with useful information for data
reduce the number of data instances and the number of sampling. That is, if an instance is associated with a small
features. However, this turns out to be highly non-trivial. For gradient, the training error for this instance is small and it is
example, it is unclear how to perform data sampling for already well-trained. A straightforward idea is to discard those
GBDT. While there are some works that sample data data instances with small gradients. However, the data
according to their weights to speed up the training process of distribution will be changed by doing so, which will hurt the
boosting [5, 6, 7], they cannot be directly applied to GBDT accuracy of the learned model. To avoid this problem, we
since there is no sample weight in GBDT at all. propose a new method called Gradient-based One-Side
Gradient-based One-Side Sampling (GOSS). While there Sampling (GOSS). 3 GOSS keeps all the instances with large
is no native weight for data instance in GBDT, we notice that gradients and performs random sampling on the instances with
data instances with different gradients play different roles in small gradients. In order to compensate the influence to the
the computation of information gain. In particular, according data distribution, when computing the information gain,
to the definition of information gain, those instances with GOSS introduces a constant multiplier for the data instances
larger gradients1 (i.e., under-trained instances) will contribute with small gradients (see Alg. 2). Specifically, GOSS firstly
more to the information gain. Therefore, when down sampling sorts the data instances according to the absolute value of their
the data instances, in order to retain the accuracy of gradients and selects the top a×100% instances. Then it
information gain estimation, we should better keep those randomly samples b×100% instances from the rest of the data.
instances with large gradients (e.g., larger than a pre-defined After that, GOSS amplifies the sampled data with small
threshold, or among the top percentiles), and only randomly gradients by a constant 1−a b when calculating the
drop those instances with small gradients. We prove that such information gain. By doing so, we put more focus on the
a treatment can lead to a more accurate gain estimation than under-trained instances without changing the original data
uniformly random sampling, with the same target sampling distribution by much.
rate, especially when the value of information gain has a large GBDT is an ensemble model of decision trees, which are
range. trained in sequence [1]. In each iteration, GBDT learns the
Exclusive Feature Bundling (EFB). Usually in real decision trees by fitting the negative gradients (also known as
applications, although there are a large number of features, the residual errors). The main cost in GBDT lies in learning the
feature space is quite sparse, which provides us a possibility decision trees, and the most time-consuming part in learning a
of designing a nearly lossless approach to reduce the number decision tree is to find the best split points. One of the most
of effective features. Specifically, in a sparse feature space, popular algorithms to find split points is the pre-sorted
many features are (almost) exclusive, i.e., they rarely take algorithm [8, 9], which enumerates all possible split points on
nonzero values simultaneously. Examples include the one-hot the pre-sorted feature values. This algorithm is simple and can
features (e.g., one-hot word representation in text mining). We find the optimal split points, however, it is inefficient in both
can safely bundle such exclusive features. To this end, we training speed and memory consumption. Another popular
design an efficient algorithm by reducing the optimal bundling algorithm is the histogram-based algorithm [10, 11, 12], as
problem to a graph coloring problem (by taking features as shown in Alg. 13 . Instead of finding the split points on the
vertices and adding edges for every two features if they are not sorted feature values, histogram-based algorithm buckets
continuous feature values into discrete bins and uses these bins
to construct feature histograms during training. Since the
histogram-based algorithm is more efficient in both memory
consumption and training speed, we will develop our work on V. RESULTS AND ANALYSIS
its basis. As shown in Alg. 1, the histogram-based algorithm
finds the best split points based on the feature histograms. It
costs O(#data × #feature) for histogram building and O(#bin
× #feature) for split point finding. Since #bin is usually much
smaller than #data, histogram building will dominate the
computational complexity. If we can reduce #data or #feature,
we will be able to substantially speed up the training of
GBDT.
The LightGBM Model can be tuned further through
parameter tuning. In order to improve efficiency the
following steps can be used. 1.num_leaves: This is the main
parameter to control the complexity of the tree model.
Ideally, the value of num_leaves should be less than or
equal to 2 times the maximum depth of the tree. Value more
than this will result in overfitting. 2. min_data_in_leaf:
Setting it to a large value can avoid growing too deep a tree,
but may cause under-fitting. In practice we set to hundreds
for a large dataset. 3.max_depth: You also can use
max_depth to limit the tree depth explicitly.
Fig 1: CV, Train and Test accuracies for LightGBM

IV. METHODOLOGY

A. Pre-Processing
The dataset being used was obtained from kaggle
website (https://www.kaggle.com/c/santander-customer-
transaction-prediction/overview). It contains a list of data
with each data row containing a 200 feature variables, a
binary label which describes if the customer will perform
transaction in the future or not(0 for negative and 1 for
positive), and a string ID identifying each data point. Missing
value check was done and no missing values were found. We
drop the train label during pre-processing. In addition mean,
median and standard deviations have been plotted for the
training data set. Distribution of the feature columns have
also been plotted as histograms. Data was split in 70:30 ratio
for training and testing.

1. Finding loss and accuracy


For the LightBGM, training accuracy is 98.23% and with
validation data set is 98.54%. K-fold (k-fold) cross
validation was performed. Testing accuracy for
LightBGM is 89.95%.
Fig 2: Train and test accuracy for keras sequential neural network for
In addition, for comparison Keras sequential neural comparison
network was built was exactly one input tensor and
exactly one output tensor. Training accuracy was 90.37%
and testing accuracy after 50 epochs(stopped after a few
epochs due to no change in error) was 90.7%. VI. CONCLUSION
It this paper, we have used the lightGBM model for the
Santander customer transaction dataset from Kaggle. After
experimenting different values of learning rates, we get an
accuracy of 89.96%. The results prove that LightGBM with
featured engineering data proved to be one of the best suited
methods for this classification dataset.
VII. FUTURE WORK
The project can be further enhanced in several
methodologies. Some of them are by using parallel processing
along with the LightGBM Algorithm, selecting important
features and then modelling them, using stratified folding for
train and test splits, taking a try for XGBoost for faster speeds.
.

REFERENCES

[1] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting
system. In Proceedings of the 22Nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pages 785–794.
ACM, 2016

[2] Stephen Tyree, Kilian Q Weinberger, Kunal Agrawal, and Jennifer


Paykin. Parallel boosted regression trees for web search ranking. In
Proceedings of the 20th international conference on World wide web, pages
387–396. ACM, 2011.

[3] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent


Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter
Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine
learning in python. Journal of Machine Learning Research, 12(Oct):2825–
2830, 2011.

[4] Greg Ridgeway. Generalized boosted models: A guide to the gbm


package. Update, 1(1):2007, 2007.

[5] LightGBM: A Highly Efficient Gradient Boosting Decision Tree by


Guolin Ke1 , Qi Meng2 , Thomas Finley3 , Taifeng Wang1 , Wei Chen1 ,
Weidong Ma1 , Qiwei Ye1 , Tie-Yan Liu1.

You might also like