0% found this document useful (0 votes)
53 views15 pages

Cis111 - 6 Assignment 2 Advanced Data Techqniuqe For Data Mining

This document discusses advanced data mining techniques for direct marketing campaigns. It introduces the topic and describes how banks analyze customer data to target marketing offers. The assignment is to analyze different data mining techniques that can be used for this purpose using a bank marketing dataset from Kaggle. Business intelligence and data mining techniques are commonly used to improve business performance.

Uploaded by

Hafeez Shaikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
0% found this document useful (0 votes)
53 views15 pages

Cis111 - 6 Assignment 2 Advanced Data Techqniuqe For Data Mining

This document discusses advanced data mining techniques for direct marketing campaigns. It introduces the topic and describes how banks analyze customer data to target marketing offers. The assignment is to analyze different data mining techniques that can be used for this purpose using a bank marketing dataset from Kaggle. Business intelligence and data mining techniques are commonly used to improve business performance.

Uploaded by

Hafeez Shaikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
You are on page 1/ 15

UNIT:

ASSIGNMENT:
UNIT CORDINATOR:

STUDENT NAME: DIKSHA


ID:
EMAIL: {YOUR.NAME}@STUDY.BEDS.AC.UK

1
ADVANCED DATA MININGTECHNIQUES
FOR DIRECT MARKETING CAMPAIGNS

2
1 . INTRODUCTION

This task is describe the basic and advanced data mining techniques with bank marketing dataset
i i i i i i i i i i i i i i

of kaggle. The banking sector is increasing day by day in terms of innovation and evolving. We
i i i i i i i i i i i i i i i i i

choose this dataset for two reasons, one is this dataset is used in kaggle competition. And the second
i i i i i i i i i i i i i i i i i i

is Bank store a very large amount of data including customer’s personal info, and have previous
i i i i i i i i i i i i i i i i

history of all time customers. This way they market their products and offers with the help of
i i i i i i i i i i i i i i i i i

customers history. For targeting the customers’ demands bank use one to one meeting and media
i i i i i i i i i i i i i i i

is called direct marketing.As in this assignment we have to analyze different techniques of Data
i i i i i i i i i i i i i i i

mining approaches.
i i

Business intelligence with data mining is very common now a days, there are a lot of techniques and
i i i i i i i i i i i i i i i i i

i solution for the business improvement are developed now. Particularly, in the data science field a
i i i i i i i i i i i i i i

i modern world is using for their decisions. Previous data is first task for solving the existing problem
i i i i i i i i i i i i i i i i

i while the prediction about upcoming data is very useful. As we know, a lot of different techniques
i i i i i i i i i i i i i i i i

and advanced techniques are developing but in this task we use some of them. Below is the list of
i i i i i i i i i i i i i i i i i i i

Data mining methods and algorithms, Data cleaning like dealing with missing values and
i i i i i i i i i i i i i

removing outliers, Data visualization with multiple libraries with python, Track pattern in data
i i i i i i i i i i i i i

with two types of analysis one is uni variant analysis with one column wise and second is bi variant
i i i i i i i i i i i i i i i i i i i

i analysis with multiple column analysis. Visualization graph is used with the help of python library
i i i i i i i i i i i i i i

for showing uni variate and bi variate analysis. After that, classification is the process of dealing
i i i i i i i i i i i i i i i i

with classes and multiple column classify with the simple format. There are many classification
i i i i i i i i i i i i i i

techniques but we used linear model with the provided dataset. And the final one we used decision
i i i i i i i i i i i i i i i i i

tree classifier based on target column and as we know decision tree classifier is worked with only
i i i i i i i i i i i i i i i i i

one true value of column. In this Paragraph it should be decide the objective of this task, Earlier I
i i i i i i i i i i i i i i i i i i i

said business decision can be made with data mining and the best method is decision tree classifier
i i i i i i i i i i i i i i i i i

is best one for all the others techniques of those dataset which have data of target column. Other
i i i i i i i i i i i i i i i i i i

machine learning algorithm, like XGBoost and SVM is the best algorithm for analyzing the
i i i i i i i i i i i i i i

dataset. But in this assignment we are only used decision tree classifier. The objective of this
i i i i i i i i i i i i i i i

dataset is to effect those customers and improve the techniques of direct marketing, calculate the
i i i i i i i i i i i i i i i

features where direct marketing efficiently used.


i i i i i i

3
2 . DESIGNING A SOLUTION I I

First need to analyzed the dataset, how many feature this dataset have. The Question is which are
i i i i i i i i i i i i i i i i

i usefull for us for meeting the objective and which are just outliers and how many columns are
i i i i i i i i i i i i i i i i

i effecting the dataset in a different directions. As I describe earlier that many data mining
i i i i i i i i i i i i i i

techniques are exist but we use some techniques in which we find a solution.
i i i i i i i i i i i i i i

DATASET:
Dataset is described with the help of python. This dataset include more then 41k rows with 20
i i i i i i i i i i i i i i i i

i columns. Later we extract the most important features from them. Below picture is explaining the
i i i i i i i i i i i i i i

i basic structure the Dataset, First column is age, describing the age of customer, and later the job is
i i i i i i i i i i i i i i i i i

i explaining his/her profession. And the others features like education, compaign duration are also
i i i i i i i i i i i i

i have some importance points in them. Let’s dive into column values.
i i i i i i i i i i

Fig 1: Describing the Dataset


i i i i

4
Below picture is raises another important point in which target column count is described and this
i i i i i i i i i i i i i i i

resulted two type of information.


i i i i i

A- Dataset have so much no column values with respect to yes values.


i i i i i i i i i i i i

B- This column value acts as different either we use half dataset. Hence the distribution is
i i i i i i i i i i i i i i i

i skewed towards max rows of columns values.


i i i i i i

fig 2: explains the target column


i i i i i

5
Above picture explains complete dataset, now considering this column we analyse in what age have
I i i i i i i i i i i i i i

yes deposit account or in what age less deposit account. As we have many no in the target column.
i i i i i i i i i i i i i i i i i i i

Fig 3: describing the count of target column with respect age column.
i i i i i i i i i i i

Hence above picture shows that almost age have many record count between 30 to 50 But the
i i i i i i i i i i i i i i i i

i distribution of yes or no is different. We saw almost 90% records have no column then after targeting
i i i i i i i i i i i i i i i i i

i age column with respect to distribution of either they have account or not is not skewed so much. This
i i i i i i i i i i i i i i i i i i

Results we don’t have so much skewedness in the distribution of age feature.


i i i i i i i i i i i i i

Now we are moving towards Uni variate analysis. This techniques is used for analyzing single
i i i i i i i i i i i i i i

column values means the distribution of columns. The question is how the column is skewed
i i i i i i i i i i i i i i i

towards values. How the distribution of data is included in the dataset.


i i i i i i i i i i i i

6
Fig 4: Explains the Education column values counts
i i i i i i i

Fig 5: Explains the Job column.


i i i i i

7
Now let’s move to the bi-varaite analysis with column to column dependencies and how the one
i i i i i i i i i i i i i i i

i column is effecting the other column value. How are they effect with multiple values of
i i i i i i i i i i i i i i

distribution?
i

BIVARIATE ANALYSIS: I

First analysis is between column values of marital status with age and target column in this dataset.
i i i i i i i i i i i i i i i i

i There are many graph explains this type of analysis but we use boxplot of this analysis. Boxplot have
i i i i i i i i i i i i i i i i i

many benefits because it shows the one column values with different colors.
i i i i i i i i i i i i

This analysis results that divorced and married have yes target column with respect to other marital
i i i i i i i i i i i i i i i

i status. So, Target column explains that single peoples have less yes target column and age wise they
i i i i i i i i i i i i i i i i

are younger then other martial status. Below picture describing the this analysis.
i i i i i i i i i i i i

Fig 6: BiVariate Analysis of age and martial with target column.


i i i i i i i i i i

8
Now lets move another Bi variate analysis, this time we target our column is education which is
i i i i i i i i i i i i i i i i

very skewed towards target column. As we know target column is very skewed towards no values,
i i i i i i i i i i i i i i i i

we are finding how this column effect another columns. Below picture explains the education
i i i i i i i i i i i i i i

main values with respect to age and target the last feature. We are analyzing the question is that
i i i i i i i i i i i i i i i i i i

how many educated and uneducated peoples have deposit account with respect to age matters.
i i i i i i i i i i i i i i

Fig 7: Describing education on target column yes or no.


i i i i i i i i i

Above picture resulted that, those who have basic 4y education with age 60+ have more
i i i i i i i i i i i i i i

deposit accounts then others, second results is that whose education is unknown have deposit
i i i i i i i i i i i i i i

account. Hence with the count wise, basic4y education is high records of yes.
i i i i i i i i i i i i i

3 . EXPERIMENTS

Now we are calculating the advanced data mining techniques like classifications, Regression and
i i i i i i i i i i i i

the decision tree classifier. First we need to extract features and split dataset into two data streams like
i i i i i i i i i i i i i i i i i i

9
training dataset and testing dataset. Training dataset is used for train our model like decision tree
i i i i i i i i i i i i i i i

in this case and we predict the next values with the help of testing dataset. Split with datasets are
i i i i i i i i i i i i i i i i i i i

very important for accuracy, and also we have to calculate our main features which are needed for
i i i i i i i i i i i i i i i i i

the evaluation procedure. First of all we need to extract a matrix of correlation with the column to
i i i i i i i i i i i i i i i i i

column. This way we can make a decision of calculating main features.


i I i i i i i i i i i i

We are using sklearn library for preprocessing the dataset. The function name is LableEncoder
i i i i i i i i i i i i i

which works to transform the datasets columns into one-hot encoding of numeric columns.
i i i i i i i i i i i i i

Because for training models it need to be completed that all the numeric column should results in
i i i i i i i i i i i i i i i i i

one boundary. Means all the columns features train to maximum and minimum values of their
i i i i i i i i i i i i i i i

distributions. I will explain this way, once the column values have different distribution values then
i i i i i i i i i i i i i i i

its very difficult to train a model. So, all the column features assigned as same distributions values.
i i i i i i i i i i i i i i i i i

Standard scaler from sklearn is used to transform the values to some distribution. After
i i i i i i i i i i i i i

transforming dataset look like this,


i i i i i

Fig 8: After transforming Dataset shape: i i i i i

After transforming the next step is to split the dataset into train test, we use sklearn.train_test_split
i i i i i i i i i i i i i i i

i into two different datasets the shape of after split is in this format, X_train.shape have 32950 rows
i i i i i i i i i i i i i i i i

and 19 column one column is removed from training because it is used y training for evaluating,
i i i i i i i i i i i i i i i i

since y column also have rows equal to 32950. Then for testing X_test.shape have 8238 rows with
i i i i i i i i i i i i i i i i i

19 columns and Y_test.shape also have 8238 rows respectively.


i i i i i i ii i i

First Experiment is Logistic regression which results accuracy of 90%. On the test set so this model
i i i i i i i i i i i i i i i i

i needs some improvements. Now we are moving to next experimental technique is. Before moving to
i i i i i i i i i i i i i i

classification report we need to create confusion matrix for approximate results.


i i i i i i i i i i i

10
Following parameters are to be calculated with the classification techniques is the result of confusion
i i i i i i i i i i i i i i

matrix and prediction using classification problem.


i i i i i i

Below picture is taken from code which is explaining the confusion matrix, accuracy score and f1-
I, i i i i i i i i i i i i i i

i score and all the parameters of predictions using the y target class. This means either we use logistic
i i i i i i i i i i i i i i i i i

regression or classification techniques the difference of results are calculated below.


i i i i i i i i i i i

Fig 9: Summary of classifications Results


i i i i i

Above pictures shows that our predictions with confusion matrix are more correct then wrong
i i i i i i i i i i i i i

i predictions. Actual no with predicted no is 7191 and actual no with predicted yes are 103 means
i i i i i i i i i i i i i i i i

i correction predictions have greater number then actual yes is predicted. With the precision of 91%
i i i i i i i i i i i i i i

means corrected predictions have higher weight.


i i i i i i

11
Fig 10: ROC Curve of FP Rate
i i i i i i

12
Roc curve means the ratio of false positive with respect to true positive. Our algorithm predict false
i i i i i i i i i i i i i i i i

i positive then true positive. This explains the predicted calculation are either towards positive or
i i i i i i i i i i i i i

i negative. The ration converts the accuracy of the False Positive which are calculated false but their
i i i i i i i i i i i i i i i

actual values is true and true positive which are calculates true and their actual value is yes, means our
i i i i i i i i i i i i i i i i i i

algorithm differentiate the two different ratio. This is the curve gives us the algorithm testing. Above
i i i i i i i i i i i i i i i i

graph is explaining the Roc curve of logistic regression.


i i i i i i i i i

Now the Decision tree classifier is the path finder of the results. Complete dataset divided into path of
i i i i i i i i i i i i i i i i

reaching the exact result.


i i i i

Fig 11: Decision tree classifier of Bank Dataset


i i i i i i i

13
The tree is resulted in this way,
i i i i i i

1- When entropy is greater than 0.9 it always yes for the predicted class.
i i i i i i i i i i i i

2- When column value of nr.employed <= -1.099 then entropy is 0.5 resulted yes with
i i i i i i i i i i i i i

i the predicted class. i i

3- With column value of nr.employed if we check cons.conf.idx value <= -1.328 then
i i i i i i i i i i i i

i it is always yes, either which column value is added.


i i i i i i i i i

4- When checking with three column values nr.employed with month, and days of week
i i i i i i i i i i i i

i then this value is no with entropy of 0.9


i i i i i i i i

5- If we consider poutcome <= -.1.5 with days of week comparing all entropy value
i i i i i i i i i i i i i

i of nr.employed will lead to no in decision.


i i i i i i i

6- If we consider poutcome <= -.1.5 with cons.price.idx comparing all entropy


i i i i i i i i i i

i value of nr.employed will lead to no in decision. i i i i i i i i

7- The best case is nr.employed with month addition to poutcome value of <= 1.5 lead to
i i i i i i i i i i i i i i i

i the yes in predicted class.


i i i i

4 . CONCLUSIONS

The bank marketing strategy effects with multiple data patterns. In this results made from Decision
i i i i i i i i i i i i i i

i tree are those which lead to positive marketing strategy. This pattern results conclude the resulted
i i i i i i i i i i i i i i

i best marketing campaigns. The resulted path is either the value of no and yes, no scheme is
i i i i i i i i i i i i i i i i

i promoting to shift to yes and the yes pattern need another intentions for their business. Managers
i i i i i i i i i i i i i i i

and other stakeholders made their choices according to situations. As many patterns seems likely to
i i i i i i i i i i i i i i i

old styles in bi varaite analysis. Moreover the classifier takes too much in making the business
i i i i i i i i i i i i i i i i

higher.
i

14
5 . REFERENCES

1. Aggarwal, C. C. (2015). Data Mining: The Textbook. Springer.

2. Bramer, M. (2016). Principles of Data Mining. Springer.

3. Chen, M., Hao, Y., & Zhang, Y. (Eds.). (2018). Data Mining: Theories, Algorithms, and
Examples. CRC Press.

4. Han, J., Pei, J., Kamber, M., & Dong, G. (2011). Data Mining: Concepts and
Techniques (3rd ed.). Morgan Kaufmann.

5. Tan, P. N., Steinbach, M., & Kumar, V. (2013). Introduction to Data Mining. Pearson.

6. Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., ... & Yu, P. S.
(2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), 1-
37.

7. Aggarwal, C. C., & Zhai, C. (2012). Mining text data. Springer Science & Business
Media.

8. Hotho, A., Nürnberger, A., & Paaß, G. (2005). A brief survey of text mining. LDV
Forum, 20(1), 19-62.

9. Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge
discovery in databases. AI magazine, 17(3), 37-54.

10. Mitchell, T. M. (1997). Machine Learning. McGraw-Hill.

15

You might also like