Group7_Datamining_Project_Report_Final

BANK MARKETING
DATA MINING GROUP PROJECT
Group7
Arun Kumar Narayana Murthy
Manikandan Sundarapandian
Muthu Kannan Subramaniam
Sathya Narayanan Manivannan
Sourabh Mahajan

pg. 1
Contents
INTRODUCTION..................................................................................................................................2
Domain description.........................................................................................................................2
Problem statement.........................................................................................................................2
DATA SET...........................................................................................................................................2
Description.....................................................................................................................................2
PRE-PROCESSING STEPS......................................................................................................................3
Data Cleaning.................................................................................................................................4
Missing Values................................................................................................................................4
Duplicate Values.............................................................................................................................5
Class Imbalance..............................................................................................................................5
Removing Outliers (Skewed Data):...................................................................................................7
Scaling Data....................................................................................................................................8
Observation....................................................................................................................................9
DATA VISUALIZATION..........................................................................................................................9
EXPERIMENT DESIGN........................................................................................................................10
ALGORITHMS USED ..........................................................................................................................11
Naïve Bayes Classifier:...................................................................................................................12
Trees – J48 ...................................................................................................................................12
PART-algorithm............................................................................................................................13
Experimental results .....................................................................................................................13
Naïve Bayes Algorithm: .................................................................................................................13
CONSOLIDATED RESULTS ..................................................................................................................15
Confusion Matrix..........................................................................................................................15
RELATIVE PERFORMANCE OF ALGORITHMS........................................................................................16
FALSE PREDICTORS ...........................................................................................................................18
Testing the dataset without False Predictors: .................................................................................18
Tree Visualizationfor J48 algorithm: ..............................................................................................19
Confusion Matrix after removingfalse predictor:............................................................................19
Receiver Operating Characteristics (ROC) Curve .................................................................................20
CONCLUSION....................................................................................................................................24
REFERENCES.....................................................................................................................................24
APPENDIX ........................................................................................................................................25

pg. 2
INTRODUCTION
The Data Mining Process is a powerful technique that helps not only in decision making
based on the data that is available but also helps in predicting the potential change or result
that might occur in the future. The Data Mining Technique can come up with various useful
predictions that usually cannot be interpreted using the graphical reporting.
Classification is one of the data mining techniques that help in classifying the items
according to the items with predefined set of classes. In this project, we are implementing the
Classification technique in predicting the ‘Subscription’ attribute based on the other relevant
fields.
This paper will include evaluation of three different algorithms using the tool WEKA. The
Data that is collected will enable better strategies for finding possible customers who will
subscribe for term deposit.
Domain description
Bank Marketing campaigns are dependent on customers’ huge electronic data.
Identifying customers who are more likely to respond to new offers is an important issue in
Direct Marketing. With the huge amount of data, it is impossible for analysts to come up with
interesting information about the customers. In direct marketing, data mining has been used
extensively to identify potential customers for a new offer (target selection).
The data is related with direct marketing campaigns (phone calls) of a Portuguese banking
institution. The classification goal is to predict if the client will subscribe a term deposit (values
y or n).
Problem statement
The requirement is to predict the value of Subscription attribute based on Job Type,
Client Marital Status, Education, Bank Balance, Housing Loan, Personal Loan, Contact, Last
Contact Day, Last Contact, Month, Last Contact Duration, Current Campaigns, Days Passed,
Previous Campaigns and Previous Outcome.
DATA SET
Description
The Dataset includes a total of 61,079 rows and 16 attributes containing both Nominal
and Numeric attribute types like Job Type, Client Marital Status, Education, Bank Balance,
Housing Loan, Personal Loan, Contact, Last Contact Day, Last Contact, Month, Last Contact
Duration, Current Campaigns, Days Passed, Previous Campaigns, Previous Outcome and

pg. 3
Subscription Status of several customers, whose information was extracted from a Portuguese
Banking Institution.
Dataset was obtained from UCI Website.
http://archive.ics.uci.edu/ml/datasets/Bank+Marketing
Attribute Description
Attributes Description Values
Age Age of the Client Numeric Value
Job Type of job of the client
blue-collar, entrepreneur,housemaid,
management, retired, self-employed, services,
student, technician, unemployed, unknown.
Marital Status Marital Status of the client divorced, married, single, unknown
Education Educational qualification of the client
Basic 4y, basic 6y, basic 9y, high school,
illiterate, professional course,university
degree,unknown
Bank Balance Bank balance of the client no, yes, unknown
Housing Loan Whether the client has housing loan no, yes, unknown
PersonalLoan Whether the client has personal loan no, yes
Contact Contact Communication type Cellular, Telephone
Last Contact Day Client's Last Contact day of the week Mon, Tue, Wed, Thu, Fri
Last Contact Month Client's Last Contact month of the year Jan, Feb, Mar…Nov, Dec
Last contact
duration
Client's Last contact duration, in
seconds(numeric)
Call Duration in Seconds
Current Campaigns
Number of contacts performed during this
campaign and for this client
Numeric Data
Days Passed
Number of days that passed by after the client
was last contacted from a previous campaign
Numeric data (999 means client was not
previously contacted)
Previous
Campaigns
number of contacts performed before this
campaign and for this client (numeric)
Numeric Data
Previous Outcome outcome of the previous marketing campaign failure, nonexistent, success,unknown
Subscription
whether the client has subscribed a term
deposit
Yes,No
PRE-PROCESSINGSTEPS
Quality data provides quality decisions. Data preprocessing transforms the data into a
format that will be more easily and effectively processed by the algorithm. Real world data are

pg. 4
incomplete, noisy, and inconsistent. There are attributes which are false predictors and has
missing values, noise, error, other data discrepancies.
Data Cleaning
Data cleaning is a process used to determine inaccurate, incomplete, or unreasonable
data and then improve the quality through correcting detected errors and omissions. Raw data
is highly susceptible to noise, missing values and inconsistency. The quality of data affects the
data mining results. To help improve the quality of data and consequently of mining results,
raw data is pre-processed so as to improve the efficiency and ease of mining process. Data pre-
processing is one of the most critical steps in a data mining process which deals with
preparation and transformation of the initial data set.
Missing Values
Real-world data tends t’o be incomplete, noisy and inconsistent. An important task
when preprocessing the data is to fill in the missing values, smooth out noise and correct
inconsistencies.
Some of the steps to handle Missing Values are as follows:
o Ignore the data row
o Use a global constant to fill in for missing values
o Use attribute mean for all samples belonging to the same class
o Use a data mining algorithm to predict the most probable value
In our dataset, of the 61079 instances, there were a total of 105 Instances with a missing value
in either of their attributes. These missing data have been represented with a “?”. These
missing values were replaced using unsupervised field in ‘Filters’ option.

pg. 5
Duplicate Values
The problem of detecting and eliminating duplicated data is one of the major problems
in the broad area of data cleaning. Duplicate elimination is hard because it is caused by several
types of errors like typographical errors and different representations of the same logical value.
Hence, another important aspect of data cleaning was to check for duplicate values in the
dataset. No duplicate values were detected from our Bank Marketing dataset.
Class Imbalance
Learning from imbalanced data has been studied actively for about two decades in
machine learning. A vast number of techniques have been tried, with varying results and few
clear answers. Data scientists facing this problem have no definite answer since it entirely
depends on the data.
Let us see some of the useful approaches to handle Class Imbalance.
 Do nothing. Sometimes you get lucky and nothing needs to be done. You can train on
the so-called natural (or stratified) distribution and sometimes it works without need for
modification.
 Balance the training set in some way(Smote):
o Oversample the minority class.
o Under sample the majority class.
o Synthesize new minority classes.
 Throw away minority examples and switch to an anomaly detection framework.
 Construct an entirely new algorithm to perform well on imbalanced data.
We have used SMOTE technique to balance our class attribute.

pg. 6
According to our dataset, as we can see from the above image the class attribute has a huge
imbalance. Applying an algorithm over this dataset might build models which are biased
towards one value of the class variable. Hence, we have chosen the SMOTE filter option (i.e., by
oversampling the minority Class) to handle the Class Imbalance problem.
What is SMOTE? What does it do?
 Resamples a dataset by applying the Synthetic Minority Oversampling Technique
(SMOTE).
 SMOTE option does oversampling of the minority class, i.e., adds additional instances
where the minority class can be oversampled. Similarly, down sampling (or under-
sampling) the majority class could also rectify the imbalance.
To use SMOTE filter in Weka,

pg. 7
After applying Smote filter to our dataset, the imbalance data is modified and the class
attribute has pretty balanced data.
Removing Outliers (Skewed Data):
Outliers do not follow the pattern of the majority of the data. Such observations need to
be set apart at the onset of any analysis simply because their distance from the bulk of the data
ensures that they will exert a disproportionate pull on any model fitted by maximum likelihood.

pg. 8
Furthermore, detecting outliers is a statistical procedure with a well-defined objective
and whose efficacy can be measured. It is also important to point out that no matter how they
are identified, the outliers of a group of suspect observations can be assessed simply by
measuring their influence on a non-robust fit.
While pre-processing our dataset, outliers from the data set which are positively skewed
are removed to check for improvement in accuracy of the classifiers. Some of the attributes had
skewed data like the one showed below where the values are skewed between (0-1735) even
though the range goes up to 4000.
List of Attributes with skewed data:
• Bank Balance
• Last Contact Duration
• Current Campaigns
• Days Passed
• Previous Campaigns
Scaling Data
To handle this problem, we scaled the data by modifying the range for the one of the attributes
from (0-1735).

pg. 9
After changing the scale, we compared the runs between Original data and scaled data.
Observation
Though the algorithm took less time to come up with the result, there was not much
difference in the accuracies between original data and scaled data. Hence, we decided to build
algorithms by sticking to the original skewed dataset on which SMOTE was used.
DATA VISUALIZATION
Data visualization is a general term that describes any effort to help people understand
the significance of data by placing it in a visual context such as patterns, graphs, trends and
correlations that might go undetected in text-based data but can be recognized easily with data
visualization software.

pg. 10
Most business intelligence software vendors embed data visualization tools into their
products, either developing the visualization technology themselves or sourcing it from
companies that specialize in visualization.
The following picture shows a visual representation of the banking dataset that we have
chosen:
EXPERIMENT DESIGN
Experiment Design is the best approach for testing our hypothesis. It refers to how
participants are allocated to the different combinations in an experiment. The most common
way to design an experiment is to divide the participants into two groups, the experimental
group, and the control group, and then introduce a change to the experimental group but not
the control group.
One member of each matched pair must be randomly assigned to the experimental
group and the other to the control group.
In our dataset, following factors are considered for Experimental Design:

pg. 11
 Noise (0%, 10%).
 Size of the Training set (10/90, 80/20).
Now, let us consider the experimental group as
F1- Size of training set
F2- Noise
and the control group as
F11- 10% training set
F12- 80% training set
F21- 0% Noise
F22- 10% Noise
Here, the concept of counterbalancing is applied to these factors and the following four
scenarios are created in the experimental design
C1- 0% noise, 10% training set
ALGORITHMS USED
We wanted to use algorithms on probability based, tree based and rule based classifiers. Since
noise is effectively managed by Naive Bayes classifier, we have chosen this in probability based
classifier category. In the tree based classifier category, we chose J48 algorithm since it gives
better accuracy and the results are easily compared with other algorithms. We chose Part
algorithm because the algorithm was quite new to all of us and wanted to test the dataset in a
non-familiar algorithm so that there is some additional learning to all of us apart from the
familiar algorithms. Here are the algorithms/classifiers that we selected.
 Naïve Bayes Classifier
 J48
 Part

pg. 12
Naïve Bayes Classifier:
Bayes rule is applied to calculate the posterior from the prior and the likelihood,
because the latter two is generally easier to be calculated from a probability model. We have
chosen to work with Naïve Bayes classifier under this method. This method goes by the name of
Naïve Bayes because it’s based on Bayes’ rule and “naïvely” assumes independence—it is only
valid to multiply probabilities when the events are independent.
One of the nice things about Naïve Bayes is that missing values are no problem at all. If a
value is missing in a training instance, it is simply not included in the frequency counts, and the
probability ratios are based on the number of values that occur rather than on the total number
of instances. Numeric values are usually handled by assuming that they have a “normal”
probability distribution. The advantages of Naïve Bayes are that it only requires a small amount
of training data to estimate the parameters necessary for classification. Because independent
variables are assumed, only the variables for each class need to be determined and not the
entire covariance matrix.
Trees – J48
A decision tree is a predictive machine-learning model that decides the target value
(dependent variable) of a new sample based on various attribute values of the available data.
The internal nodes of a decision tree denote the different attributes, the branches between the
nodes tell us the possible values that these attributes can have in the observed samples, while
the terminal nodes tell us the final value (classification) of the dependent variable.
The attribute that is to be predicted is known as the dependent variable, since its value depends
upon, or is decided by, the values of all the other attributes. The other attributes, which help in
predicting the value of the dependent variable, are known as the independent variables in the
dataset.
The J48 Decision tree classifier follows the following simple algorithm. To classify a new
item, it first needs to create a decision tree based on the attribute values of the available training
data. So, whenever it encounters a set of items (training set) it identifies the attribute that
discriminates the various instances most clearly. This feature can tell us most about the data
instances so that we can classify themthe best is said to have the highest information gain. Now,
among the possible values of this feature, if there is any value for which there is no ambiguity,
that is, for which the data instances falling within its category have the same value for the target
variable, then we terminate that branch and assign to it the target value that we have obtained.
For the other cases, we then look for another attribute that gives us the highest
information gain. Hence, we continue in this manner until we either get a clear decision of what
combination of attributes gives us a particular target value, or we run out of attributes. If we run

pg. 13
out of attributes, or if we cannot get an unambiguous result from the available information, we
assign this branch a target value that the majority of the items under this branch possess.
Now that we have the decision tree, we follow the order of attribute selection as we have
obtained for the tree. By checking all the respective attributes and their values with those seen
in the decision tree model, we can assign or predict the target value of this new instance.
J48 can work with both continuous and discrete data. It does this by specifying ranges or
thresholds for continuous data thus turning continuous data into discrete data.
J48 is well known for its capability of building high accuracy models.
The bestselling point of decision trees is their ease of interpretation and explanation. They are
also quite fast and popular and the output is human readable.
PART-algorithm
Part algorithm adopts a supervised machine learning algorithm, namely partial decision
trees, as a method for feature subset selection. Feature subset selection aims at finding the
smallest feature set having the most beneficial impact on machine learning algorithms, i.e. it’s
prime goal is to identify a subset of features upon which attention should be centered. More
precisely, PART exploits the partial decision tree learning algorithm for feature space reduction.
It uses separate-and-conquer method. It builds a partial C4.5 (J48) decision tree in each iteration
and makes the "best" leaf into a rule. In each iteration, a rule is derived from a pre-pruned
decision tree.
Experimental results
Naïve Bayes Algorithm:

pg. 14
J48 Algorithm:
The top attribute forthe J48 algorithmwas ‘LAST CONTACTDURATION’attribute.
Part Algorithm:

pg. 15
CONSOLIDATEDRESULTS
AlgorithmsFactor
Cells
C1 C2 C3 C4
Naïve Bayes 81.69037 % 82.21676 % 74.69219 % 75.33154 %
J48 86.55393 % 89.82483 % 78.20231 % 81.11085 %
PART 87.36565 % 90.24394 % 75.02638 % 78.54535 %
The above table showsthe consolidatedresultsforeachalgorithm.The valuesdisplayedare the average
valuesfortenruns thatwere compiledforfourscenariosinthe experimental design.
Confusion Matrix
Confusion matrix represents the number of correctly classified instances and the wrongly
classified instances for each algorithm that was used for this project.

pg. 16
RELATIVE PERFORMANCE OFALGORITHMS
Performance of Classifiers under Noise

pg. 17
Performance of classifiers under varied training set split

pg. 18
FALSE PREDICTORS
False predictors are values that misdirects the working logic of any algorithm. They
sometimes tend to increase the accuracy rate of the working of the algorithm, but are
misleading. Such attributes are determined to be false predictors.
• Last contact duration–This nominal attribute gives information about the duration of
the call happened between the bank and the customer. Ideally predicting the potential
customers must have been done prior to making the sales call.
• Last contact day– This nominal attribute gives information about the day of the call
happened between the bank and the customer. As discussed above, in an ideal scenario,
predicting the potential customers must have been done prior to making the sales call.
Hence this attribute is a false predictor.
• Last contact month– This nominal attribute gives information about the month of the
call in a year happened between the bank and the customer. As discussed above, in an
ideal scenario, predicting the potential customers must have been done prior to making
the sales call. Hence this attribute is a false predictor.
• Current Campaigns – This nominal attribute gives information about current campaigns
happening in the bank for the sales. But in data mining, usually prediction is done based
on past data hence having previous campaign data becomes relevant but current
campaign becomes a false predictor.
Testing the dataset without False Predictors:
The selected three algorithms are again applied on the dataset using cross validation
method after removing the false predictors. The following results are obtained.
Algorithm 0% Noise and 10 folds 10% Noise and 10 folds
Naïve Bayes 73.0263 68.0458
J48 81.129 73.4454
PART 85.1289 73.9563
We are comparing the above results with C2(0% noise & 80% training set) and C4(10%
noise & 80% training set). Because C2 and C4 have more percentage in training set (80% instead
of 10%). So, the accuracies become more relevant for comparison with current results.

pg. 19
On comparison, it is evident that none of the values in the above table is higher from
the earlier results. Still ideally data mining must be done without any false predictors.
Therefore, the above-mentioned values are correct accuracies.
Tree Visualization for J48 algorithm:
Confusion Matrix after removing false predictor:
Class1- 0% Noise &
10 folds
Class2- 10% Noise &
10 folds.
Class/Algorithm Class 1 Class 2
a = No
a b a b
b = Yes
J48
36024 3898 32909 5137
7628 13528 11082 11950
PART
37352 2570 32862 5184
6513 14643 10723 12309
Naïve Bayes
32848 7074 30016 8030
9401 11755 11487 11545

pg. 20
Receiver Operating Characteristics (ROC) Curve
Receiver Operating Characteristic (ROC) curve is plotted between TPR and FPR. ROC
curve plots true positive on the y-axis against false positive on the x-axis. The area covered
between the diagonal (threshold line) and curve is AUC (Area Under Curve). The points plotted
between TPR and FPR falling in this region determine the accuracy of the algorithm.
Inferences from algorithm runs:
TPR (True Positive Rate) - How many correct positive results are identified among true positive
and false negative instances. It is also known as sensitivity or recall.
FPR (False Positive Rate) - How many incorrect positive results occur among false positive and
true negative instances. The false-positive rate is also known as the fall-out.
The below given ROC curves were plotted for respective algorithms with cross validation testing
and the number of folds was 10.

pg. 21
ROC for Naïve Bayes:
0% Noise (Areaundercurve is 0.7806)

pg. 22
ROC for J48:
10% Noise (Areaundercurve is 0.7465):

pg. 23
ROC for PART:
0% Noise (Areaundercurve is 0.8774):
10 % Noise (Areaundercurve is 0.7601)

pg. 24
CONCLUSION
 After removing the false predictors, the top predictor for J48 algorithm is ‘CONTACT’
attribute.
 Based on the accuracy obtained from each algorithm and Area Under Curve (AUC), it is
possible to conclude that PART algorithm gives the most accuracy for our dataset.
 But Naïve Bayes algorithm performs consistently in the presence of noise.
Algorithm Correctly Classified Instance Incorrectly Classified Instance
Naïve Bayes 41561 19517
J48 44859 16219
PART 45171 15907
REFERENCES
[1] Ian Written, Elbe Frank and Mark A. Hall - Data Mining Practice Machine Learning Tools and
Techniques
[2] Data pre-processing http://www.cs.ccsu.edu/~markov/ccsu_courses/DataMining-3.html
[3] Data Cleaning and data pre-processing http://www.mimuw.edu.pl/~son/datamining/DM/4-
preprocess.pdf
[4] Data Pre-Processing http://searchsqlserver.techtarget.com/definition/data-preprocessing
[5] Algorithms Used http://www.d.umn.edu/~padhy005/Chapter5.html
http://www.ec.tuwien.ac.at/~dieter/research/publications/sac2006.pdf

pg. 25
APPENDIX
Since we used SMOTE option over our dataset, we tried with three algorithms (J48, Decision
Table, Naïve Bayes) over class imbalance factor and noise.
F1- Class Imbalance, F2-Noise
F11- Original, F12- SMOTE, F21- 0% Noise, F22- 10% noise
C1- 0% noise in original dataset
C2- 0% noise in SMOTE dataset
C3- 10% noise in original dataset
C4- 10% noise in SMOTE dataset
AlgorithmsFactor cells C1 C2 C3 C4
Naïve Bayes 88.0693% 82.2424 % 80.5291% 75.4331 %
J48 86.1251% 94.4055 % 81.6505% 86.3928 %
DecisionTable 78.3261% 85.787 % 73.2919% 82.3357 %
Class Imbalance Versus Accuracy:

pg. 26
Noise Percentage VersusAccuracy:

pg. 27
Noise Vs SMOTE in Weka
 When we add noise to the dataset, the values of only one attribute is changed. For
example, the value of the class variable is changed from ‘yes’ to ‘no’. That is why this
option is listed below ‘Attribute’ in Filters.
 But SMOTE option adds additional instances where the minority class can be
oversampled. Unlike noise, SMOTE simulates records which has minority class. This is
why SMOTE is listed below ‘Instance’ in Filters.
Naïve Bayes and SMOTE
Why there is a decrease in accuracy for Naïve Bayes when SMOTE is used on the dataset?
Naïve Bayes assumes that all attributes are independent of each other. Number of ‘yes’ in
dataset remains constant. But the number of ‘no’ in class variable has been increased by SMOTE
which in turn increases total number of instances. This reduces the probability of ‘yes’
occurrence. So, while predicting, the output has more chance of classified as ‘no’ than before.
Why accuracy increases when training set percentage is increased?
When the training set has very less number of records (e.g. 10%), the algorithms build a model
based on these records. These records cover very less scenarios which may occur in the test set.
But when the size of the training set is increased (e.g. 80%), the model built on more records
will cover more scenarios. So the algorithm results in more accuracy when the size of the
training set is more. Performance of the algorithms in the presence of Noise.
Why J48 and PART have low accuracy in the presence of noise?
Naïve Bayes algorithm works based on probability. So the presence of noise has less effect on
accuracy. Whereas J48 and PART algorithms work based on rules created by decision trees.
These algorithms generate rules which covers the noise while building a model using training
set. These rules built predict the wrong output in the test set which results in very less accuracy
in the presence of noise.
Performance of Naïve Bayes in presence of noise
This graph shows that when noise percentage was increased, the performance of both Naïve
Bayes and J48 decrease. But the accuracy for Naïve Bayes dropped from 82.11% to 55.63%
which is comparatively better than J48 (dropped from 89.49% to 54.17%).

Group7_Datamining_Project_Report_Final

Recommended

More Related Content

Viewers also liked (16)

Similar to Group7_Datamining_Project_Report_Final (20)

Group7_Datamining_Project_Report_Final