SlideShare a Scribd company logo
BANK MARKETING
DATA MINING GROUP PROJECT
Group7
Arun Kumar Narayana Murthy
Manikandan Sundarapandian
Muthu Kannan Subramaniam
Sathya Narayanan Manivannan
Sourabh Mahajan
pg. 1
Contents
INTRODUCTION..................................................................................................................................2
Domain description.........................................................................................................................2
Problem statement.........................................................................................................................2
DATA SET...........................................................................................................................................2
Description.....................................................................................................................................2
PRE-PROCESSING STEPS......................................................................................................................3
Data Cleaning.................................................................................................................................4
Missing Values................................................................................................................................4
Duplicate Values.............................................................................................................................5
Class Imbalance..............................................................................................................................5
Removing Outliers (Skewed Data):...................................................................................................7
Scaling Data....................................................................................................................................8
Observation....................................................................................................................................9
DATA VISUALIZATION..........................................................................................................................9
EXPERIMENT DESIGN........................................................................................................................10
ALGORITHMS USED ..........................................................................................................................11
Naïve Bayes Classifier:...................................................................................................................12
Trees – J48 ...................................................................................................................................12
PART-algorithm............................................................................................................................13
Experimental results .....................................................................................................................13
Naïve Bayes Algorithm: .................................................................................................................13
CONSOLIDATED RESULTS ..................................................................................................................15
Confusion Matrix..........................................................................................................................15
RELATIVE PERFORMANCE OF ALGORITHMS........................................................................................16
FALSE PREDICTORS ...........................................................................................................................18
Testing the dataset without False Predictors: .................................................................................18
Tree Visualizationfor J48 algorithm: ..............................................................................................19
Confusion Matrix after removingfalse predictor:............................................................................19
Receiver Operating Characteristics (ROC) Curve .................................................................................20
CONCLUSION....................................................................................................................................24
REFERENCES.....................................................................................................................................24
APPENDIX ........................................................................................................................................25
pg. 2
INTRODUCTION
The Data Mining Process is a powerful technique that helps not only in decision making
based on the data that is available but also helps in predicting the potential change or result
that might occur in the future. The Data Mining Technique can come up with various useful
predictions that usually cannot be interpreted using the graphical reporting.
Classification is one of the data mining techniques that help in classifying the items
according to the items with predefined set of classes. In this project, we are implementing the
Classification technique in predicting the ‘Subscription’ attribute based on the other relevant
fields.
This paper will include evaluation of three different algorithms using the tool WEKA. The
Data that is collected will enable better strategies for finding possible customers who will
subscribe for term deposit.
Domain description
Bank Marketing campaigns are dependent on customers’ huge electronic data.
Identifying customers who are more likely to respond to new offers is an important issue in
Direct Marketing. With the huge amount of data, it is impossible for analysts to come up with
interesting information about the customers. In direct marketing, data mining has been used
extensively to identify potential customers for a new offer (target selection).
The data is related with direct marketing campaigns (phone calls) of a Portuguese banking
institution. The classification goal is to predict if the client will subscribe a term deposit (values
y or n).
Problem statement
The requirement is to predict the value of Subscription attribute based on Job Type,
Client Marital Status, Education, Bank Balance, Housing Loan, Personal Loan, Contact, Last
Contact Day, Last Contact, Month, Last Contact Duration, Current Campaigns, Days Passed,
Previous Campaigns and Previous Outcome.
DATA SET
Description
The Dataset includes a total of 61,079 rows and 16 attributes containing both Nominal
and Numeric attribute types like Job Type, Client Marital Status, Education, Bank Balance,
Housing Loan, Personal Loan, Contact, Last Contact Day, Last Contact, Month, Last Contact
Duration, Current Campaigns, Days Passed, Previous Campaigns, Previous Outcome and
pg. 3
Subscription Status of several customers, whose information was extracted from a Portuguese
Banking Institution.
Dataset was obtained from UCI Website.
http://archive.ics.uci.edu/ml/datasets/Bank+Marketing
Attribute Description
Attributes Description Values
Age Age of the Client Numeric Value
Job Type of job of the client
blue-collar, entrepreneur,housemaid,
management, retired, self-employed, services,
student, technician, unemployed, unknown.
Marital Status Marital Status of the client divorced, married, single, unknown
Education Educational qualification of the client
Basic 4y, basic 6y, basic 9y, high school,
illiterate, professional course,university
degree,unknown
Bank Balance Bank balance of the client no, yes, unknown
Housing Loan Whether the client has housing loan no, yes, unknown
PersonalLoan Whether the client has personal loan no, yes
Contact Contact Communication type Cellular, Telephone
Last Contact Day Client's Last Contact day of the week Mon, Tue, Wed, Thu, Fri
Last Contact Month Client's Last Contact month of the year Jan, Feb, Mar…Nov, Dec
Last contact
duration
Client's Last contact duration, in
seconds(numeric)
Call Duration in Seconds
Current Campaigns
Number of contacts performed during this
campaign and for this client
Numeric Data
Days Passed
Number of days that passed by after the client
was last contacted from a previous campaign
Numeric data (999 means client was not
previously contacted)
Previous
Campaigns
number of contacts performed before this
campaign and for this client (numeric)
Numeric Data
Previous Outcome outcome of the previous marketing campaign failure, nonexistent, success,unknown
Subscription
whether the client has subscribed a term
deposit
Yes,No
PRE-PROCESSINGSTEPS
Quality data provides quality decisions. Data preprocessing transforms the data into a
format that will be more easily and effectively processed by the algorithm. Real world data are
pg. 4
incomplete, noisy, and inconsistent. There are attributes which are false predictors and has
missing values, noise, error, other data discrepancies.
Data Cleaning
Data cleaning is a process used to determine inaccurate, incomplete, or unreasonable
data and then improve the quality through correcting detected errors and omissions. Raw data
is highly susceptible to noise, missing values and inconsistency. The quality of data affects the
data mining results. To help improve the quality of data and consequently of mining results,
raw data is pre-processed so as to improve the efficiency and ease of mining process. Data pre-
processing is one of the most critical steps in a data mining process which deals with
preparation and transformation of the initial data set.
Missing Values
Real-world data tends t’o be incomplete, noisy and inconsistent. An important task
when preprocessing the data is to fill in the missing values, smooth out noise and correct
inconsistencies.
Some of the steps to handle Missing Values are as follows:
o Ignore the data row
o Use a global constant to fill in for missing values
o Use attribute mean for all samples belonging to the same class
o Use a data mining algorithm to predict the most probable value
In our dataset, of the 61079 instances, there were a total of 105 Instances with a missing value
in either of their attributes. These missing data have been represented with a “?”. These
missing values were replaced using unsupervised field in ‘Filters’ option.
pg. 5
Duplicate Values
The problem of detecting and eliminating duplicated data is one of the major problems
in the broad area of data cleaning. Duplicate elimination is hard because it is caused by several
types of errors like typographical errors and different representations of the same logical value.
Hence, another important aspect of data cleaning was to check for duplicate values in the
dataset. No duplicate values were detected from our Bank Marketing dataset.
Class Imbalance
Learning from imbalanced data has been studied actively for about two decades in
machine learning. A vast number of techniques have been tried, with varying results and few
clear answers. Data scientists facing this problem have no definite answer since it entirely
depends on the data.
Let us see some of the useful approaches to handle Class Imbalance.
 Do nothing. Sometimes you get lucky and nothing needs to be done. You can train on
the so-called natural (or stratified) distribution and sometimes it works without need for
modification.
 Balance the training set in some way(Smote):
o Oversample the minority class.
o Under sample the majority class.
o Synthesize new minority classes.
 Throw away minority examples and switch to an anomaly detection framework.
 Construct an entirely new algorithm to perform well on imbalanced data.
We have used SMOTE technique to balance our class attribute.
pg. 6
According to our dataset, as we can see from the above image the class attribute has a huge
imbalance. Applying an algorithm over this dataset might build models which are biased
towards one value of the class variable. Hence, we have chosen the SMOTE filter option (i.e., by
oversampling the minority Class) to handle the Class Imbalance problem.
What is SMOTE? What does it do?
 Resamples a dataset by applying the Synthetic Minority Oversampling Technique
(SMOTE).
 SMOTE option does oversampling of the minority class, i.e., adds additional instances
where the minority class can be oversampled. Similarly, down sampling (or under-
sampling) the majority class could also rectify the imbalance.
To use SMOTE filter in Weka,
pg. 7
After applying Smote filter to our dataset, the imbalance data is modified and the class
attribute has pretty balanced data.
Removing Outliers (Skewed Data):
Outliers do not follow the pattern of the majority of the data. Such observations need to
be set apart at the onset of any analysis simply because their distance from the bulk of the data
ensures that they will exert a disproportionate pull on any model fitted by maximum likelihood.
pg. 8
Furthermore, detecting outliers is a statistical procedure with a well-defined objective
and whose efficacy can be measured. It is also important to point out that no matter how they
are identified, the outliers of a group of suspect observations can be assessed simply by
measuring their influence on a non-robust fit.
While pre-processing our dataset, outliers from the data set which are positively skewed
are removed to check for improvement in accuracy of the classifiers. Some of the attributes had
skewed data like the one showed below where the values are skewed between (0-1735) even
though the range goes up to 4000.
List of Attributes with skewed data:
• Bank Balance
• Last Contact Duration
• Current Campaigns
• Days Passed
• Previous Campaigns
Scaling Data
To handle this problem, we scaled the data by modifying the range for the one of the attributes
from (0-1735).
pg. 9
After changing the scale, we compared the runs between Original data and scaled data.
Observation
Though the algorithm took less time to come up with the result, there was not much
difference in the accuracies between original data and scaled data. Hence, we decided to build
algorithms by sticking to the original skewed dataset on which SMOTE was used.
DATA VISUALIZATION
Data visualization is a general term that describes any effort to help people understand
the significance of data by placing it in a visual context such as patterns, graphs, trends and
correlations that might go undetected in text-based data but can be recognized easily with data
visualization software.
pg. 10
Most business intelligence software vendors embed data visualization tools into their
products, either developing the visualization technology themselves or sourcing it from
companies that specialize in visualization.
The following picture shows a visual representation of the banking dataset that we have
chosen:
EXPERIMENT DESIGN
Experiment Design is the best approach for testing our hypothesis. It refers to how
participants are allocated to the different combinations in an experiment. The most common
way to design an experiment is to divide the participants into two groups, the experimental
group, and the control group, and then introduce a change to the experimental group but not
the control group.
One member of each matched pair must be randomly assigned to the experimental
group and the other to the control group.
In our dataset, following factors are considered for Experimental Design:
pg. 11
 Noise (0%, 10%).
 Size of the Training set (10/90, 80/20).
Now, let us consider the experimental group as
F1- Size of training set
F2- Noise
and the control group as
F11- 10% training set
F12- 80% training set
F21- 0% Noise
F22- 10% Noise
Here, the concept of counterbalancing is applied to these factors and the following four
scenarios are created in the experimental design
C1- 0% noise, 10% training set
C2- 0% noise, 80% training set
C3- 10% noise, 10% training set
C4- 10% noise, 80% training set
ALGORITHMS USED
We wanted to use algorithms on probability based, tree based and rule based classifiers. Since
noise is effectively managed by Naive Bayes classifier, we have chosen this in probability based
classifier category. In the tree based classifier category, we chose J48 algorithm since it gives
better accuracy and the results are easily compared with other algorithms. We chose Part
algorithm because the algorithm was quite new to all of us and wanted to test the dataset in a
non-familiar algorithm so that there is some additional learning to all of us apart from the
familiar algorithms. Here are the algorithms/classifiers that we selected.
 Naïve Bayes Classifier
 J48
 Part
pg. 12
Naïve Bayes Classifier:
Bayes rule is applied to calculate the posterior from the prior and the likelihood,
because the latter two is generally easier to be calculated from a probability model. We have
chosen to work with Naïve Bayes classifier under this method. This method goes by the name of
Naïve Bayes because it’s based on Bayes’ rule and “naïvely” assumes independence—it is only
valid to multiply probabilities when the events are independent.
One of the nice things about Naïve Bayes is that missing values are no problem at all. If a
value is missing in a training instance, it is simply not included in the frequency counts, and the
probability ratios are based on the number of values that occur rather than on the total number
of instances. Numeric values are usually handled by assuming that they have a “normal”
probability distribution. The advantages of Naïve Bayes are that it only requires a small amount
of training data to estimate the parameters necessary for classification. Because independent
variables are assumed, only the variables for each class need to be determined and not the
entire covariance matrix.
Trees – J48
A decision tree is a predictive machine-learning model that decides the target value
(dependent variable) of a new sample based on various attribute values of the available data.
The internal nodes of a decision tree denote the different attributes, the branches between the
nodes tell us the possible values that these attributes can have in the observed samples, while
the terminal nodes tell us the final value (classification) of the dependent variable.
The attribute that is to be predicted is known as the dependent variable, since its value depends
upon, or is decided by, the values of all the other attributes. The other attributes, which help in
predicting the value of the dependent variable, are known as the independent variables in the
dataset.
The J48 Decision tree classifier follows the following simple algorithm. To classify a new
item, it first needs to create a decision tree based on the attribute values of the available training
data. So, whenever it encounters a set of items (training set) it identifies the attribute that
discriminates the various instances most clearly. This feature can tell us most about the data
instances so that we can classify themthe best is said to have the highest information gain. Now,
among the possible values of this feature, if there is any value for which there is no ambiguity,
that is, for which the data instances falling within its category have the same value for the target
variable, then we terminate that branch and assign to it the target value that we have obtained.
For the other cases, we then look for another attribute that gives us the highest
information gain. Hence, we continue in this manner until we either get a clear decision of what
combination of attributes gives us a particular target value, or we run out of attributes. If we run
pg. 13
out of attributes, or if we cannot get an unambiguous result from the available information, we
assign this branch a target value that the majority of the items under this branch possess.
Now that we have the decision tree, we follow the order of attribute selection as we have
obtained for the tree. By checking all the respective attributes and their values with those seen
in the decision tree model, we can assign or predict the target value of this new instance.
J48 can work with both continuous and discrete data. It does this by specifying ranges or
thresholds for continuous data thus turning continuous data into discrete data.
J48 is well known for its capability of building high accuracy models.
The bestselling point of decision trees is their ease of interpretation and explanation. They are
also quite fast and popular and the output is human readable.
PART-algorithm
Part algorithm adopts a supervised machine learning algorithm, namely partial decision
trees, as a method for feature subset selection. Feature subset selection aims at finding the
smallest feature set having the most beneficial impact on machine learning algorithms, i.e. it’s
prime goal is to identify a subset of features upon which attention should be centered. More
precisely, PART exploits the partial decision tree learning algorithm for feature space reduction.
It uses separate-and-conquer method. It builds a partial C4.5 (J48) decision tree in each iteration
and makes the "best" leaf into a rule. In each iteration, a rule is derived from a pre-pruned
decision tree.
Experimental results
Naïve Bayes Algorithm:
pg. 14
J48 Algorithm:
The top attribute forthe J48 algorithmwas ‘LAST CONTACTDURATION’attribute.
Part Algorithm:
pg. 15
CONSOLIDATEDRESULTS
AlgorithmsFactor
Cells
C1 C2 C3 C4
Naïve Bayes 81.69037 % 82.21676 % 74.69219 % 75.33154 %
J48 86.55393 % 89.82483 % 78.20231 % 81.11085 %
PART 87.36565 % 90.24394 % 75.02638 % 78.54535 %
The above table showsthe consolidatedresultsforeachalgorithm.The valuesdisplayedare the average
valuesfortenruns thatwere compiledforfourscenariosinthe experimental design.
Confusion Matrix
Confusion matrix represents the number of correctly classified instances and the wrongly
classified instances for each algorithm that was used for this project.
pg. 16
RELATIVE PERFORMANCE OFALGORITHMS
Performance of Classifiers under Noise
pg. 17
Performance of classifiers under varied training set split
pg. 18
FALSE PREDICTORS
False predictors are values that misdirects the working logic of any algorithm. They
sometimes tend to increase the accuracy rate of the working of the algorithm, but are
misleading. Such attributes are determined to be false predictors.
• Last contact duration–This nominal attribute gives information about the duration of
the call happened between the bank and the customer. Ideally predicting the potential
customers must have been done prior to making the sales call.
• Last contact day– This nominal attribute gives information about the day of the call
happened between the bank and the customer. As discussed above, in an ideal scenario,
predicting the potential customers must have been done prior to making the sales call.
Hence this attribute is a false predictor.
• Last contact month– This nominal attribute gives information about the month of the
call in a year happened between the bank and the customer. As discussed above, in an
ideal scenario, predicting the potential customers must have been done prior to making
the sales call. Hence this attribute is a false predictor.
• Current Campaigns – This nominal attribute gives information about current campaigns
happening in the bank for the sales. But in data mining, usually prediction is done based
on past data hence having previous campaign data becomes relevant but current
campaign becomes a false predictor.
Testing the dataset without False Predictors:
The selected three algorithms are again applied on the dataset using cross validation
method after removing the false predictors. The following results are obtained.
Algorithm 0% Noise and 10 folds 10% Noise and 10 folds
Naïve Bayes 73.0263 68.0458
J48 81.129 73.4454
PART 85.1289 73.9563
We are comparing the above results with C2(0% noise & 80% training set) and C4(10%
noise & 80% training set). Because C2 and C4 have more percentage in training set (80% instead
of 10%). So, the accuracies become more relevant for comparison with current results.
pg. 19
On comparison, it is evident that none of the values in the above table is higher from
the earlier results. Still ideally data mining must be done without any false predictors.
Therefore, the above-mentioned values are correct accuracies.
Tree Visualization for J48 algorithm:
Confusion Matrix after removing false predictor:
Class1- 0% Noise &
10 folds
Class2- 10% Noise &
10 folds.
Class/Algorithm Class 1 Class 2
a = No
a b a b
b = Yes
J48
36024 3898 32909 5137
7628 13528 11082 11950
PART
37352 2570 32862 5184
6513 14643 10723 12309
Naïve Bayes
32848 7074 30016 8030
9401 11755 11487 11545
pg. 20
Receiver Operating Characteristics (ROC) Curve
Receiver Operating Characteristic (ROC) curve is plotted between TPR and FPR. ROC
curve plots true positive on the y-axis against false positive on the x-axis. The area covered
between the diagonal (threshold line) and curve is AUC (Area Under Curve). The points plotted
between TPR and FPR falling in this region determine the accuracy of the algorithm.
Inferences from algorithm runs:
TPR (True Positive Rate) - How many correct positive results are identified among true positive
and false negative instances. It is also known as sensitivity or recall.
FPR (False Positive Rate) - How many incorrect positive results occur among false positive and
true negative instances. The false-positive rate is also known as the fall-out.
The below given ROC curves were plotted for respective algorithms with cross validation testing
and the number of folds was 10.
pg. 21
ROC for Naïve Bayes:
0% Noise (Areaundercurve is 0.7806)
10% Noise (Areaundercurve is 0.7135)
pg. 22
ROC for J48:
0% Noise (Areaundercurve is 0.8479)
10% Noise (Areaundercurve is 0.7465):
pg. 23
ROC for PART:
0% Noise (Areaundercurve is 0.8774):
10 % Noise (Areaundercurve is 0.7601)
pg. 24
CONCLUSION
 After removing the false predictors, the top predictor for J48 algorithm is ‘CONTACT’
attribute.
 Based on the accuracy obtained from each algorithm and Area Under Curve (AUC), it is
possible to conclude that PART algorithm gives the most accuracy for our dataset.
 But Naïve Bayes algorithm performs consistently in the presence of noise.
Algorithm Correctly Classified Instance Incorrectly Classified Instance
Naïve Bayes 41561 19517
J48 44859 16219
PART 45171 15907
REFERENCES
[1] Ian Written, Elbe Frank and Mark A. Hall - Data Mining Practice Machine Learning Tools and
Techniques
[2] Data pre-processing http://www.cs.ccsu.edu/~markov/ccsu_courses/DataMining-3.html
[3] Data Cleaning and data pre-processing http://www.mimuw.edu.pl/~son/datamining/DM/4-
preprocess.pdf
[4] Data Pre-Processing http://searchsqlserver.techtarget.com/definition/data-preprocessing
[5] Algorithms Used http://www.d.umn.edu/~padhy005/Chapter5.html
http://www.ec.tuwien.ac.at/~dieter/research/publications/sac2006.pdf
pg. 25
APPENDIX
Since we used SMOTE option over our dataset, we tried with three algorithms (J48, Decision
Table, Naïve Bayes) over class imbalance factor and noise.
F1- Class Imbalance, F2-Noise
F11- Original, F12- SMOTE, F21- 0% Noise, F22- 10% noise
C1- 0% noise in original dataset
C2- 0% noise in SMOTE dataset
C3- 10% noise in original dataset
C4- 10% noise in SMOTE dataset
AlgorithmsFactor cells C1 C2 C3 C4
Naïve Bayes 88.0693% 82.2424 % 80.5291% 75.4331 %
J48 86.1251% 94.4055 % 81.6505% 86.3928 %
DecisionTable 78.3261% 85.787 % 73.2919% 82.3357 %
Class Imbalance Versus Accuracy:
pg. 26
Noise Percentage VersusAccuracy:
pg. 27
Noise Vs SMOTE in Weka
 When we add noise to the dataset, the values of only one attribute is changed. For
example, the value of the class variable is changed from ‘yes’ to ‘no’. That is why this
option is listed below ‘Attribute’ in Filters.
 But SMOTE option adds additional instances where the minority class can be
oversampled. Unlike noise, SMOTE simulates records which has minority class. This is
why SMOTE is listed below ‘Instance’ in Filters.
Naïve Bayes and SMOTE
Why there is a decrease in accuracy for Naïve Bayes when SMOTE is used on the dataset?
Naïve Bayes assumes that all attributes are independent of each other. Number of ‘yes’ in
dataset remains constant. But the number of ‘no’ in class variable has been increased by SMOTE
which in turn increases total number of instances. This reduces the probability of ‘yes’
occurrence. So, while predicting, the output has more chance of classified as ‘no’ than before.
Why accuracy increases when training set percentage is increased?
When the training set has very less number of records (e.g. 10%), the algorithms build a model
based on these records. These records cover very less scenarios which may occur in the test set.
But when the size of the training set is increased (e.g. 80%), the model built on more records
will cover more scenarios. So the algorithm results in more accuracy when the size of the
training set is more. Performance of the algorithms in the presence of Noise.
Why J48 and PART have low accuracy in the presence of noise?
Naïve Bayes algorithm works based on probability. So the presence of noise has less effect on
accuracy. Whereas J48 and PART algorithms work based on rules created by decision trees.
These algorithms generate rules which covers the noise while building a model using training
set. These rules built predict the wrong output in the test set which results in very less accuracy
in the presence of noise.
Performance of Naïve Bayes in presence of noise
This graph shows that when noise percentage was increased, the performance of both Naïve
Bayes and J48 decrease. But the accuracy for Naïve Bayes dropped from 82.11% to 55.63%
which is comparatively better than J48 (dropped from 89.49% to 54.17%).
pg. 28
Ad

More Related Content

Viewers also liked (16)

Introduction to data mining with Weka by OPEN MINER
Introduction to data mining with Weka by OPEN MINERIntroduction to data mining with Weka by OPEN MINER
Introduction to data mining with Weka by OPEN MINER
openminer
 
Data Mining – analyse Bank Marketing Data Set by WEKA.
Data Mining – analyse Bank Marketing Data Set by WEKA.Data Mining – analyse Bank Marketing Data Set by WEKA.
Data Mining – analyse Bank Marketing Data Set by WEKA.
Mateusz Brzoska
 
Empirical Study on Classification Algorithm For Evaluation of Students Academ...
Empirical Study on Classification Algorithm For Evaluation of Students Academ...Empirical Study on Classification Algorithm For Evaluation of Students Academ...
Empirical Study on Classification Algorithm For Evaluation of Students Academ...
iosrjce
 
Weka
WekaWeka
Weka
Mostafa Raihan
 
Naïve Bayes and J48 Classification Algorithms on Swahili Tweets: Performance ...
Naïve Bayes and J48 Classification Algorithms on Swahili Tweets: Performance ...Naïve Bayes and J48 Classification Algorithms on Swahili Tweets: Performance ...
Naïve Bayes and J48 Classification Algorithms on Swahili Tweets: Performance ...
IJCSIS Research Publications
 
PROJECT_REPORT_FINAL
PROJECT_REPORT_FINALPROJECT_REPORT_FINAL
PROJECT_REPORT_FINAL
Jason Warnstaff
 
Assessing Component based ERP Architecture for Developing Organizations
Assessing Component based ERP Architecture for Developing OrganizationsAssessing Component based ERP Architecture for Developing Organizations
Assessing Component based ERP Architecture for Developing Organizations
IJCSIS Research Publications
 
HCI - Individual Report for Metrolink App
HCI - Individual Report for Metrolink AppHCI - Individual Report for Metrolink App
HCI - Individual Report for Metrolink App
Darran Mottershead
 
Portuguese Bank - Direct Marketing Campaign
Portuguese Bank - Direct Marketing CampaignPortuguese Bank - Direct Marketing Campaign
Portuguese Bank - Direct Marketing Campaign
Rehan Akhtar
 
HCI - Group Report for Metrolink App
HCI - Group Report for Metrolink AppHCI - Group Report for Metrolink App
HCI - Group Report for Metrolink App
Darran Mottershead
 
An Introduction To Weka
An Introduction To WekaAn Introduction To Weka
An Introduction To Weka
weka Content
 
Data mining with weka
Data mining with wekaData mining with weka
Data mining with weka
Hein Min Htike
 
Data Mining – analyse Bank Marketing Data Set
Data Mining – analyse Bank Marketing Data SetData Mining – analyse Bank Marketing Data Set
Data Mining – analyse Bank Marketing Data Set
Mateusz Brzoska
 
Project 2 Data Mining Part 1
Project 2 Data Mining Part 1Project 2 Data Mining Part 1
Project 2 Data Mining Part 1
rayborg
 
Lecture5 - C4.5
Lecture5 - C4.5Lecture5 - C4.5
Lecture5 - C4.5
Albert Orriols-Puig
 
Classifiers for Predicting Wine Quality
Classifiers for Predicting Wine QualityClassifiers for Predicting Wine Quality
Classifiers for Predicting Wine Quality
Laurent Declercq
 
Introduction to data mining with Weka by OPEN MINER
Introduction to data mining with Weka by OPEN MINERIntroduction to data mining with Weka by OPEN MINER
Introduction to data mining with Weka by OPEN MINER
openminer
 
Data Mining – analyse Bank Marketing Data Set by WEKA.
Data Mining – analyse Bank Marketing Data Set by WEKA.Data Mining – analyse Bank Marketing Data Set by WEKA.
Data Mining – analyse Bank Marketing Data Set by WEKA.
Mateusz Brzoska
 
Empirical Study on Classification Algorithm For Evaluation of Students Academ...
Empirical Study on Classification Algorithm For Evaluation of Students Academ...Empirical Study on Classification Algorithm For Evaluation of Students Academ...
Empirical Study on Classification Algorithm For Evaluation of Students Academ...
iosrjce
 
Naïve Bayes and J48 Classification Algorithms on Swahili Tweets: Performance ...
Naïve Bayes and J48 Classification Algorithms on Swahili Tweets: Performance ...Naïve Bayes and J48 Classification Algorithms on Swahili Tweets: Performance ...
Naïve Bayes and J48 Classification Algorithms on Swahili Tweets: Performance ...
IJCSIS Research Publications
 
Assessing Component based ERP Architecture for Developing Organizations
Assessing Component based ERP Architecture for Developing OrganizationsAssessing Component based ERP Architecture for Developing Organizations
Assessing Component based ERP Architecture for Developing Organizations
IJCSIS Research Publications
 
HCI - Individual Report for Metrolink App
HCI - Individual Report for Metrolink AppHCI - Individual Report for Metrolink App
HCI - Individual Report for Metrolink App
Darran Mottershead
 
Portuguese Bank - Direct Marketing Campaign
Portuguese Bank - Direct Marketing CampaignPortuguese Bank - Direct Marketing Campaign
Portuguese Bank - Direct Marketing Campaign
Rehan Akhtar
 
HCI - Group Report for Metrolink App
HCI - Group Report for Metrolink AppHCI - Group Report for Metrolink App
HCI - Group Report for Metrolink App
Darran Mottershead
 
An Introduction To Weka
An Introduction To WekaAn Introduction To Weka
An Introduction To Weka
weka Content
 
Data Mining – analyse Bank Marketing Data Set
Data Mining – analyse Bank Marketing Data SetData Mining – analyse Bank Marketing Data Set
Data Mining – analyse Bank Marketing Data Set
Mateusz Brzoska
 
Project 2 Data Mining Part 1
Project 2 Data Mining Part 1Project 2 Data Mining Part 1
Project 2 Data Mining Part 1
rayborg
 
Classifiers for Predicting Wine Quality
Classifiers for Predicting Wine QualityClassifiers for Predicting Wine Quality
Classifiers for Predicting Wine Quality
Laurent Declercq
 

Similar to Group7_Datamining_Project_Report_Final (20)

Big data analytics for telecom operators final use cases 0712-2014_prof_m erdas
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdasBig data analytics for telecom operators final use cases 0712-2014_prof_m erdas
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdas
Prof Dr Mehmed ERDAS
 
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdas
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdasBig data analytics for telecom operators final use cases 0712-2014_prof_m erdas
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdas
Prof Dr Mehmed ERDAS
 
Big Data in Banking (White paper)
Big Data in Banking (White paper)Big Data in Banking (White paper)
Big Data in Banking (White paper)
InData Labs
 
Architecting A Platform For Big Data Analytics
Architecting A Platform For Big Data AnalyticsArchitecting A Platform For Big Data Analytics
Architecting A Platform For Big Data Analytics
Arun Chinnaraju MBA, PMP, CSM, CSPO, SA
 
Customer relationship management in banking sector
Customer relationship management in banking sectorCustomer relationship management in banking sector
Customer relationship management in banking sector
Vivekanandha College of arts and Science for Women (Autonomous)
 
20 ccp using logistic
20 ccp using logistic20 ccp using logistic
20 ccp using logistic
Vrinda Sachdeva
 
Why is Data Science still not a mainstream in corporations - Sasa Radovanovic
Why is Data Science still not a mainstream in corporations - Sasa RadovanovicWhy is Data Science still not a mainstream in corporations - Sasa Radovanovic
Why is Data Science still not a mainstream in corporations - Sasa Radovanovic
Institute of Contemporary Sciences
 
Top 50 B2B Marketing Case Studies of 2012
Top 50 B2B Marketing Case Studies of 2012Top 50 B2B Marketing Case Studies of 2012
Top 50 B2B Marketing Case Studies of 2012
BtoB Online
 
Datasets using R-StudioUsha Rani Singh.docx
Datasets using R-StudioUsha Rani Singh.docxDatasets using R-StudioUsha Rani Singh.docx
Datasets using R-StudioUsha Rani Singh.docx
edwardmarivel
 
Team 8 Business Plan
Team 8 Business PlanTeam 8 Business Plan
Team 8 Business Plan
Emma Morgan
 
Day 1 (Lecture 2): Business Analytics
Day 1 (Lecture 2): Business AnalyticsDay 1 (Lecture 2): Business Analytics
Day 1 (Lecture 2): Business Analytics
Aseda Owusua Addai-Deseh
 
Ch7-Overview of data Science-part 2 - Copy.pptx
Ch7-Overview of data Science-part 2 - Copy.pptxCh7-Overview of data Science-part 2 - Copy.pptx
Ch7-Overview of data Science-part 2 - Copy.pptx
HaneenSabbhin
 
Project Report - Acquisition Credit Scoring Model
Project Report - Acquisition Credit Scoring ModelProject Report - Acquisition Credit Scoring Model
Project Report - Acquisition Credit Scoring Model
Subhasis Mishra
 
Star cement [www.writekraft.com]
Star cement [www.writekraft.com] Star cement [www.writekraft.com]
Star cement [www.writekraft.com]
WriteKraft Dissertations
 
Data Mining in Life Insurance Business
Data Mining in Life Insurance BusinessData Mining in Life Insurance Business
Data Mining in Life Insurance Business
Ankur Khanna
 
Future of Tracking: Transforming how we do it not what we do
Future of Tracking: Transforming how we do it not what we doFuture of Tracking: Transforming how we do it not what we do
Future of Tracking: Transforming how we do it not what we do
Kantar
 
Forrester’s study: Discover How Marketing Analytics Increases Business Perfor...
Forrester’s study: Discover How Marketing Analytics Increases Business Perfor...Forrester’s study: Discover How Marketing Analytics Increases Business Perfor...
Forrester’s study: Discover How Marketing Analytics Increases Business Perfor...
Nicolas Valenzuela
 
Data MiningData MiningData MiningData Mining
Data MiningData MiningData MiningData MiningData MiningData MiningData MiningData Mining
Data MiningData MiningData MiningData Mining
abdulraqeebalareqi1
 
How to Achieve World Class Customer Experience through Insightful IT
How to Achieve World Class Customer Experience through Insightful IT How to Achieve World Class Customer Experience through Insightful IT
How to Achieve World Class Customer Experience through Insightful IT
Bobhallahan
 
CUSTOMER FEEDBACK MANAGEMENT
CUSTOMER FEEDBACK MANAGEMENTCUSTOMER FEEDBACK MANAGEMENT
CUSTOMER FEEDBACK MANAGEMENT
George Krasadakis
 
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdas
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdasBig data analytics for telecom operators final use cases 0712-2014_prof_m erdas
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdas
Prof Dr Mehmed ERDAS
 
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdas
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdasBig data analytics for telecom operators final use cases 0712-2014_prof_m erdas
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdas
Prof Dr Mehmed ERDAS
 
Big Data in Banking (White paper)
Big Data in Banking (White paper)Big Data in Banking (White paper)
Big Data in Banking (White paper)
InData Labs
 
Why is Data Science still not a mainstream in corporations - Sasa Radovanovic
Why is Data Science still not a mainstream in corporations - Sasa RadovanovicWhy is Data Science still not a mainstream in corporations - Sasa Radovanovic
Why is Data Science still not a mainstream in corporations - Sasa Radovanovic
Institute of Contemporary Sciences
 
Top 50 B2B Marketing Case Studies of 2012
Top 50 B2B Marketing Case Studies of 2012Top 50 B2B Marketing Case Studies of 2012
Top 50 B2B Marketing Case Studies of 2012
BtoB Online
 
Datasets using R-StudioUsha Rani Singh.docx
Datasets using R-StudioUsha Rani Singh.docxDatasets using R-StudioUsha Rani Singh.docx
Datasets using R-StudioUsha Rani Singh.docx
edwardmarivel
 
Team 8 Business Plan
Team 8 Business PlanTeam 8 Business Plan
Team 8 Business Plan
Emma Morgan
 
Ch7-Overview of data Science-part 2 - Copy.pptx
Ch7-Overview of data Science-part 2 - Copy.pptxCh7-Overview of data Science-part 2 - Copy.pptx
Ch7-Overview of data Science-part 2 - Copy.pptx
HaneenSabbhin
 
Project Report - Acquisition Credit Scoring Model
Project Report - Acquisition Credit Scoring ModelProject Report - Acquisition Credit Scoring Model
Project Report - Acquisition Credit Scoring Model
Subhasis Mishra
 
Data Mining in Life Insurance Business
Data Mining in Life Insurance BusinessData Mining in Life Insurance Business
Data Mining in Life Insurance Business
Ankur Khanna
 
Future of Tracking: Transforming how we do it not what we do
Future of Tracking: Transforming how we do it not what we doFuture of Tracking: Transforming how we do it not what we do
Future of Tracking: Transforming how we do it not what we do
Kantar
 
Forrester’s study: Discover How Marketing Analytics Increases Business Perfor...
Forrester’s study: Discover How Marketing Analytics Increases Business Perfor...Forrester’s study: Discover How Marketing Analytics Increases Business Perfor...
Forrester’s study: Discover How Marketing Analytics Increases Business Perfor...
Nicolas Valenzuela
 
Data MiningData MiningData MiningData Mining
Data MiningData MiningData MiningData MiningData MiningData MiningData MiningData Mining
Data MiningData MiningData MiningData Mining
abdulraqeebalareqi1
 
How to Achieve World Class Customer Experience through Insightful IT
How to Achieve World Class Customer Experience through Insightful IT How to Achieve World Class Customer Experience through Insightful IT
How to Achieve World Class Customer Experience through Insightful IT
Bobhallahan
 
CUSTOMER FEEDBACK MANAGEMENT
CUSTOMER FEEDBACK MANAGEMENTCUSTOMER FEEDBACK MANAGEMENT
CUSTOMER FEEDBACK MANAGEMENT
George Krasadakis
 
Ad

Group7_Datamining_Project_Report_Final

  • 1. BANK MARKETING DATA MINING GROUP PROJECT Group7 Arun Kumar Narayana Murthy Manikandan Sundarapandian Muthu Kannan Subramaniam Sathya Narayanan Manivannan Sourabh Mahajan
  • 2. pg. 1 Contents INTRODUCTION..................................................................................................................................2 Domain description.........................................................................................................................2 Problem statement.........................................................................................................................2 DATA SET...........................................................................................................................................2 Description.....................................................................................................................................2 PRE-PROCESSING STEPS......................................................................................................................3 Data Cleaning.................................................................................................................................4 Missing Values................................................................................................................................4 Duplicate Values.............................................................................................................................5 Class Imbalance..............................................................................................................................5 Removing Outliers (Skewed Data):...................................................................................................7 Scaling Data....................................................................................................................................8 Observation....................................................................................................................................9 DATA VISUALIZATION..........................................................................................................................9 EXPERIMENT DESIGN........................................................................................................................10 ALGORITHMS USED ..........................................................................................................................11 Naïve Bayes Classifier:...................................................................................................................12 Trees – J48 ...................................................................................................................................12 PART-algorithm............................................................................................................................13 Experimental results .....................................................................................................................13 Naïve Bayes Algorithm: .................................................................................................................13 CONSOLIDATED RESULTS ..................................................................................................................15 Confusion Matrix..........................................................................................................................15 RELATIVE PERFORMANCE OF ALGORITHMS........................................................................................16 FALSE PREDICTORS ...........................................................................................................................18 Testing the dataset without False Predictors: .................................................................................18 Tree Visualizationfor J48 algorithm: ..............................................................................................19 Confusion Matrix after removingfalse predictor:............................................................................19 Receiver Operating Characteristics (ROC) Curve .................................................................................20 CONCLUSION....................................................................................................................................24 REFERENCES.....................................................................................................................................24 APPENDIX ........................................................................................................................................25
  • 3. pg. 2 INTRODUCTION The Data Mining Process is a powerful technique that helps not only in decision making based on the data that is available but also helps in predicting the potential change or result that might occur in the future. The Data Mining Technique can come up with various useful predictions that usually cannot be interpreted using the graphical reporting. Classification is one of the data mining techniques that help in classifying the items according to the items with predefined set of classes. In this project, we are implementing the Classification technique in predicting the ‘Subscription’ attribute based on the other relevant fields. This paper will include evaluation of three different algorithms using the tool WEKA. The Data that is collected will enable better strategies for finding possible customers who will subscribe for term deposit. Domain description Bank Marketing campaigns are dependent on customers’ huge electronic data. Identifying customers who are more likely to respond to new offers is an important issue in Direct Marketing. With the huge amount of data, it is impossible for analysts to come up with interesting information about the customers. In direct marketing, data mining has been used extensively to identify potential customers for a new offer (target selection). The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (values y or n). Problem statement The requirement is to predict the value of Subscription attribute based on Job Type, Client Marital Status, Education, Bank Balance, Housing Loan, Personal Loan, Contact, Last Contact Day, Last Contact, Month, Last Contact Duration, Current Campaigns, Days Passed, Previous Campaigns and Previous Outcome. DATA SET Description The Dataset includes a total of 61,079 rows and 16 attributes containing both Nominal and Numeric attribute types like Job Type, Client Marital Status, Education, Bank Balance, Housing Loan, Personal Loan, Contact, Last Contact Day, Last Contact, Month, Last Contact Duration, Current Campaigns, Days Passed, Previous Campaigns, Previous Outcome and
  • 4. pg. 3 Subscription Status of several customers, whose information was extracted from a Portuguese Banking Institution. Dataset was obtained from UCI Website. http://archive.ics.uci.edu/ml/datasets/Bank+Marketing Attribute Description Attributes Description Values Age Age of the Client Numeric Value Job Type of job of the client blue-collar, entrepreneur,housemaid, management, retired, self-employed, services, student, technician, unemployed, unknown. Marital Status Marital Status of the client divorced, married, single, unknown Education Educational qualification of the client Basic 4y, basic 6y, basic 9y, high school, illiterate, professional course,university degree,unknown Bank Balance Bank balance of the client no, yes, unknown Housing Loan Whether the client has housing loan no, yes, unknown PersonalLoan Whether the client has personal loan no, yes Contact Contact Communication type Cellular, Telephone Last Contact Day Client's Last Contact day of the week Mon, Tue, Wed, Thu, Fri Last Contact Month Client's Last Contact month of the year Jan, Feb, Mar…Nov, Dec Last contact duration Client's Last contact duration, in seconds(numeric) Call Duration in Seconds Current Campaigns Number of contacts performed during this campaign and for this client Numeric Data Days Passed Number of days that passed by after the client was last contacted from a previous campaign Numeric data (999 means client was not previously contacted) Previous Campaigns number of contacts performed before this campaign and for this client (numeric) Numeric Data Previous Outcome outcome of the previous marketing campaign failure, nonexistent, success,unknown Subscription whether the client has subscribed a term deposit Yes,No PRE-PROCESSINGSTEPS Quality data provides quality decisions. Data preprocessing transforms the data into a format that will be more easily and effectively processed by the algorithm. Real world data are
  • 5. pg. 4 incomplete, noisy, and inconsistent. There are attributes which are false predictors and has missing values, noise, error, other data discrepancies. Data Cleaning Data cleaning is a process used to determine inaccurate, incomplete, or unreasonable data and then improve the quality through correcting detected errors and omissions. Raw data is highly susceptible to noise, missing values and inconsistency. The quality of data affects the data mining results. To help improve the quality of data and consequently of mining results, raw data is pre-processed so as to improve the efficiency and ease of mining process. Data pre- processing is one of the most critical steps in a data mining process which deals with preparation and transformation of the initial data set. Missing Values Real-world data tends t’o be incomplete, noisy and inconsistent. An important task when preprocessing the data is to fill in the missing values, smooth out noise and correct inconsistencies. Some of the steps to handle Missing Values are as follows: o Ignore the data row o Use a global constant to fill in for missing values o Use attribute mean for all samples belonging to the same class o Use a data mining algorithm to predict the most probable value In our dataset, of the 61079 instances, there were a total of 105 Instances with a missing value in either of their attributes. These missing data have been represented with a “?”. These missing values were replaced using unsupervised field in ‘Filters’ option.
  • 6. pg. 5 Duplicate Values The problem of detecting and eliminating duplicated data is one of the major problems in the broad area of data cleaning. Duplicate elimination is hard because it is caused by several types of errors like typographical errors and different representations of the same logical value. Hence, another important aspect of data cleaning was to check for duplicate values in the dataset. No duplicate values were detected from our Bank Marketing dataset. Class Imbalance Learning from imbalanced data has been studied actively for about two decades in machine learning. A vast number of techniques have been tried, with varying results and few clear answers. Data scientists facing this problem have no definite answer since it entirely depends on the data. Let us see some of the useful approaches to handle Class Imbalance.  Do nothing. Sometimes you get lucky and nothing needs to be done. You can train on the so-called natural (or stratified) distribution and sometimes it works without need for modification.  Balance the training set in some way(Smote): o Oversample the minority class. o Under sample the majority class. o Synthesize new minority classes.  Throw away minority examples and switch to an anomaly detection framework.  Construct an entirely new algorithm to perform well on imbalanced data. We have used SMOTE technique to balance our class attribute.
  • 7. pg. 6 According to our dataset, as we can see from the above image the class attribute has a huge imbalance. Applying an algorithm over this dataset might build models which are biased towards one value of the class variable. Hence, we have chosen the SMOTE filter option (i.e., by oversampling the minority Class) to handle the Class Imbalance problem. What is SMOTE? What does it do?  Resamples a dataset by applying the Synthetic Minority Oversampling Technique (SMOTE).  SMOTE option does oversampling of the minority class, i.e., adds additional instances where the minority class can be oversampled. Similarly, down sampling (or under- sampling) the majority class could also rectify the imbalance. To use SMOTE filter in Weka,
  • 8. pg. 7 After applying Smote filter to our dataset, the imbalance data is modified and the class attribute has pretty balanced data. Removing Outliers (Skewed Data): Outliers do not follow the pattern of the majority of the data. Such observations need to be set apart at the onset of any analysis simply because their distance from the bulk of the data ensures that they will exert a disproportionate pull on any model fitted by maximum likelihood.
  • 9. pg. 8 Furthermore, detecting outliers is a statistical procedure with a well-defined objective and whose efficacy can be measured. It is also important to point out that no matter how they are identified, the outliers of a group of suspect observations can be assessed simply by measuring their influence on a non-robust fit. While pre-processing our dataset, outliers from the data set which are positively skewed are removed to check for improvement in accuracy of the classifiers. Some of the attributes had skewed data like the one showed below where the values are skewed between (0-1735) even though the range goes up to 4000. List of Attributes with skewed data: • Bank Balance • Last Contact Duration • Current Campaigns • Days Passed • Previous Campaigns Scaling Data To handle this problem, we scaled the data by modifying the range for the one of the attributes from (0-1735).
  • 10. pg. 9 After changing the scale, we compared the runs between Original data and scaled data. Observation Though the algorithm took less time to come up with the result, there was not much difference in the accuracies between original data and scaled data. Hence, we decided to build algorithms by sticking to the original skewed dataset on which SMOTE was used. DATA VISUALIZATION Data visualization is a general term that describes any effort to help people understand the significance of data by placing it in a visual context such as patterns, graphs, trends and correlations that might go undetected in text-based data but can be recognized easily with data visualization software.
  • 11. pg. 10 Most business intelligence software vendors embed data visualization tools into their products, either developing the visualization technology themselves or sourcing it from companies that specialize in visualization. The following picture shows a visual representation of the banking dataset that we have chosen: EXPERIMENT DESIGN Experiment Design is the best approach for testing our hypothesis. It refers to how participants are allocated to the different combinations in an experiment. The most common way to design an experiment is to divide the participants into two groups, the experimental group, and the control group, and then introduce a change to the experimental group but not the control group. One member of each matched pair must be randomly assigned to the experimental group and the other to the control group. In our dataset, following factors are considered for Experimental Design:
  • 12. pg. 11  Noise (0%, 10%).  Size of the Training set (10/90, 80/20). Now, let us consider the experimental group as F1- Size of training set F2- Noise and the control group as F11- 10% training set F12- 80% training set F21- 0% Noise F22- 10% Noise Here, the concept of counterbalancing is applied to these factors and the following four scenarios are created in the experimental design C1- 0% noise, 10% training set C2- 0% noise, 80% training set C3- 10% noise, 10% training set C4- 10% noise, 80% training set ALGORITHMS USED We wanted to use algorithms on probability based, tree based and rule based classifiers. Since noise is effectively managed by Naive Bayes classifier, we have chosen this in probability based classifier category. In the tree based classifier category, we chose J48 algorithm since it gives better accuracy and the results are easily compared with other algorithms. We chose Part algorithm because the algorithm was quite new to all of us and wanted to test the dataset in a non-familiar algorithm so that there is some additional learning to all of us apart from the familiar algorithms. Here are the algorithms/classifiers that we selected.  Naïve Bayes Classifier  J48  Part
  • 13. pg. 12 Naïve Bayes Classifier: Bayes rule is applied to calculate the posterior from the prior and the likelihood, because the latter two is generally easier to be calculated from a probability model. We have chosen to work with Naïve Bayes classifier under this method. This method goes by the name of Naïve Bayes because it’s based on Bayes’ rule and “naïvely” assumes independence—it is only valid to multiply probabilities when the events are independent. One of the nice things about Naïve Bayes is that missing values are no problem at all. If a value is missing in a training instance, it is simply not included in the frequency counts, and the probability ratios are based on the number of values that occur rather than on the total number of instances. Numeric values are usually handled by assuming that they have a “normal” probability distribution. The advantages of Naïve Bayes are that it only requires a small amount of training data to estimate the parameters necessary for classification. Because independent variables are assumed, only the variables for each class need to be determined and not the entire covariance matrix. Trees – J48 A decision tree is a predictive machine-learning model that decides the target value (dependent variable) of a new sample based on various attribute values of the available data. The internal nodes of a decision tree denote the different attributes, the branches between the nodes tell us the possible values that these attributes can have in the observed samples, while the terminal nodes tell us the final value (classification) of the dependent variable. The attribute that is to be predicted is known as the dependent variable, since its value depends upon, or is decided by, the values of all the other attributes. The other attributes, which help in predicting the value of the dependent variable, are known as the independent variables in the dataset. The J48 Decision tree classifier follows the following simple algorithm. To classify a new item, it first needs to create a decision tree based on the attribute values of the available training data. So, whenever it encounters a set of items (training set) it identifies the attribute that discriminates the various instances most clearly. This feature can tell us most about the data instances so that we can classify themthe best is said to have the highest information gain. Now, among the possible values of this feature, if there is any value for which there is no ambiguity, that is, for which the data instances falling within its category have the same value for the target variable, then we terminate that branch and assign to it the target value that we have obtained. For the other cases, we then look for another attribute that gives us the highest information gain. Hence, we continue in this manner until we either get a clear decision of what combination of attributes gives us a particular target value, or we run out of attributes. If we run
  • 14. pg. 13 out of attributes, or if we cannot get an unambiguous result from the available information, we assign this branch a target value that the majority of the items under this branch possess. Now that we have the decision tree, we follow the order of attribute selection as we have obtained for the tree. By checking all the respective attributes and their values with those seen in the decision tree model, we can assign or predict the target value of this new instance. J48 can work with both continuous and discrete data. It does this by specifying ranges or thresholds for continuous data thus turning continuous data into discrete data. J48 is well known for its capability of building high accuracy models. The bestselling point of decision trees is their ease of interpretation and explanation. They are also quite fast and popular and the output is human readable. PART-algorithm Part algorithm adopts a supervised machine learning algorithm, namely partial decision trees, as a method for feature subset selection. Feature subset selection aims at finding the smallest feature set having the most beneficial impact on machine learning algorithms, i.e. it’s prime goal is to identify a subset of features upon which attention should be centered. More precisely, PART exploits the partial decision tree learning algorithm for feature space reduction. It uses separate-and-conquer method. It builds a partial C4.5 (J48) decision tree in each iteration and makes the "best" leaf into a rule. In each iteration, a rule is derived from a pre-pruned decision tree. Experimental results Naïve Bayes Algorithm:
  • 15. pg. 14 J48 Algorithm: The top attribute forthe J48 algorithmwas ‘LAST CONTACTDURATION’attribute. Part Algorithm:
  • 16. pg. 15 CONSOLIDATEDRESULTS AlgorithmsFactor Cells C1 C2 C3 C4 Naïve Bayes 81.69037 % 82.21676 % 74.69219 % 75.33154 % J48 86.55393 % 89.82483 % 78.20231 % 81.11085 % PART 87.36565 % 90.24394 % 75.02638 % 78.54535 % The above table showsthe consolidatedresultsforeachalgorithm.The valuesdisplayedare the average valuesfortenruns thatwere compiledforfourscenariosinthe experimental design. Confusion Matrix Confusion matrix represents the number of correctly classified instances and the wrongly classified instances for each algorithm that was used for this project.
  • 17. pg. 16 RELATIVE PERFORMANCE OFALGORITHMS Performance of Classifiers under Noise
  • 18. pg. 17 Performance of classifiers under varied training set split
  • 19. pg. 18 FALSE PREDICTORS False predictors are values that misdirects the working logic of any algorithm. They sometimes tend to increase the accuracy rate of the working of the algorithm, but are misleading. Such attributes are determined to be false predictors. • Last contact duration–This nominal attribute gives information about the duration of the call happened between the bank and the customer. Ideally predicting the potential customers must have been done prior to making the sales call. • Last contact day– This nominal attribute gives information about the day of the call happened between the bank and the customer. As discussed above, in an ideal scenario, predicting the potential customers must have been done prior to making the sales call. Hence this attribute is a false predictor. • Last contact month– This nominal attribute gives information about the month of the call in a year happened between the bank and the customer. As discussed above, in an ideal scenario, predicting the potential customers must have been done prior to making the sales call. Hence this attribute is a false predictor. • Current Campaigns – This nominal attribute gives information about current campaigns happening in the bank for the sales. But in data mining, usually prediction is done based on past data hence having previous campaign data becomes relevant but current campaign becomes a false predictor. Testing the dataset without False Predictors: The selected three algorithms are again applied on the dataset using cross validation method after removing the false predictors. The following results are obtained. Algorithm 0% Noise and 10 folds 10% Noise and 10 folds Naïve Bayes 73.0263 68.0458 J48 81.129 73.4454 PART 85.1289 73.9563 We are comparing the above results with C2(0% noise & 80% training set) and C4(10% noise & 80% training set). Because C2 and C4 have more percentage in training set (80% instead of 10%). So, the accuracies become more relevant for comparison with current results.
  • 20. pg. 19 On comparison, it is evident that none of the values in the above table is higher from the earlier results. Still ideally data mining must be done without any false predictors. Therefore, the above-mentioned values are correct accuracies. Tree Visualization for J48 algorithm: Confusion Matrix after removing false predictor: Class1- 0% Noise & 10 folds Class2- 10% Noise & 10 folds. Class/Algorithm Class 1 Class 2 a = No a b a b b = Yes J48 36024 3898 32909 5137 7628 13528 11082 11950 PART 37352 2570 32862 5184 6513 14643 10723 12309 Naïve Bayes 32848 7074 30016 8030 9401 11755 11487 11545
  • 21. pg. 20 Receiver Operating Characteristics (ROC) Curve Receiver Operating Characteristic (ROC) curve is plotted between TPR and FPR. ROC curve plots true positive on the y-axis against false positive on the x-axis. The area covered between the diagonal (threshold line) and curve is AUC (Area Under Curve). The points plotted between TPR and FPR falling in this region determine the accuracy of the algorithm. Inferences from algorithm runs: TPR (True Positive Rate) - How many correct positive results are identified among true positive and false negative instances. It is also known as sensitivity or recall. FPR (False Positive Rate) - How many incorrect positive results occur among false positive and true negative instances. The false-positive rate is also known as the fall-out. The below given ROC curves were plotted for respective algorithms with cross validation testing and the number of folds was 10.
  • 22. pg. 21 ROC for Naïve Bayes: 0% Noise (Areaundercurve is 0.7806) 10% Noise (Areaundercurve is 0.7135)
  • 23. pg. 22 ROC for J48: 0% Noise (Areaundercurve is 0.8479) 10% Noise (Areaundercurve is 0.7465):
  • 24. pg. 23 ROC for PART: 0% Noise (Areaundercurve is 0.8774): 10 % Noise (Areaundercurve is 0.7601)
  • 25. pg. 24 CONCLUSION  After removing the false predictors, the top predictor for J48 algorithm is ‘CONTACT’ attribute.  Based on the accuracy obtained from each algorithm and Area Under Curve (AUC), it is possible to conclude that PART algorithm gives the most accuracy for our dataset.  But Naïve Bayes algorithm performs consistently in the presence of noise. Algorithm Correctly Classified Instance Incorrectly Classified Instance Naïve Bayes 41561 19517 J48 44859 16219 PART 45171 15907 REFERENCES [1] Ian Written, Elbe Frank and Mark A. Hall - Data Mining Practice Machine Learning Tools and Techniques [2] Data pre-processing http://www.cs.ccsu.edu/~markov/ccsu_courses/DataMining-3.html [3] Data Cleaning and data pre-processing http://www.mimuw.edu.pl/~son/datamining/DM/4- preprocess.pdf [4] Data Pre-Processing http://searchsqlserver.techtarget.com/definition/data-preprocessing [5] Algorithms Used http://www.d.umn.edu/~padhy005/Chapter5.html http://www.ec.tuwien.ac.at/~dieter/research/publications/sac2006.pdf
  • 26. pg. 25 APPENDIX Since we used SMOTE option over our dataset, we tried with three algorithms (J48, Decision Table, Naïve Bayes) over class imbalance factor and noise. F1- Class Imbalance, F2-Noise F11- Original, F12- SMOTE, F21- 0% Noise, F22- 10% noise C1- 0% noise in original dataset C2- 0% noise in SMOTE dataset C3- 10% noise in original dataset C4- 10% noise in SMOTE dataset AlgorithmsFactor cells C1 C2 C3 C4 Naïve Bayes 88.0693% 82.2424 % 80.5291% 75.4331 % J48 86.1251% 94.4055 % 81.6505% 86.3928 % DecisionTable 78.3261% 85.787 % 73.2919% 82.3357 % Class Imbalance Versus Accuracy:
  • 27. pg. 26 Noise Percentage VersusAccuracy:
  • 28. pg. 27 Noise Vs SMOTE in Weka  When we add noise to the dataset, the values of only one attribute is changed. For example, the value of the class variable is changed from ‘yes’ to ‘no’. That is why this option is listed below ‘Attribute’ in Filters.  But SMOTE option adds additional instances where the minority class can be oversampled. Unlike noise, SMOTE simulates records which has minority class. This is why SMOTE is listed below ‘Instance’ in Filters. Naïve Bayes and SMOTE Why there is a decrease in accuracy for Naïve Bayes when SMOTE is used on the dataset? Naïve Bayes assumes that all attributes are independent of each other. Number of ‘yes’ in dataset remains constant. But the number of ‘no’ in class variable has been increased by SMOTE which in turn increases total number of instances. This reduces the probability of ‘yes’ occurrence. So, while predicting, the output has more chance of classified as ‘no’ than before. Why accuracy increases when training set percentage is increased? When the training set has very less number of records (e.g. 10%), the algorithms build a model based on these records. These records cover very less scenarios which may occur in the test set. But when the size of the training set is increased (e.g. 80%), the model built on more records will cover more scenarios. So the algorithm results in more accuracy when the size of the training set is more. Performance of the algorithms in the presence of Noise. Why J48 and PART have low accuracy in the presence of noise? Naïve Bayes algorithm works based on probability. So the presence of noise has less effect on accuracy. Whereas J48 and PART algorithms work based on rules created by decision trees. These algorithms generate rules which covers the noise while building a model using training set. These rules built predict the wrong output in the test set which results in very less accuracy in the presence of noise. Performance of Naïve Bayes in presence of noise This graph shows that when noise percentage was increased, the performance of both Naïve Bayes and J48 decrease. But the accuracy for Naïve Bayes dropped from 82.11% to 55.63% which is comparatively better than J48 (dropped from 89.49% to 54.17%).