ANIL DS PROJECT
ANIL DS PROJECT
ON
DATA SCIENCE
BY
K.ANIL
FROM
TABLE OF CONTENTS
Example of tuple:
thistuple = ("apple", "banana", "cherry")
Methods Description
count() Returns the number of times a
specified value occurs in a
tuple
Searches the tuple for a
index() specified value and returns the
position of where it was found
•Dictionary: Dictionaries are used to store data values in key:value pairs.
• A dictionary is a collection which is ordered*, changeable and do not
allow duplicates.
Example: thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
Method Description
Numpy
Pandas
Matplotlib
FINAL PROJECT
PROJECT STATEMENT:
Your client is a retail banking institution. Term deposits are a
major source of income for a bank. A term deposit is a cash
investment held at a financial institution. Your money is
invested for an agreed rate of interest over a fixed amount of
time, or term.
The bank has various outreach plans to sell term deposits to
their customers such as email marketing, advertisements,
telephonic marketing and digital marketing. Telephonic
marketing campaigns still remain one of the most effective
way to reach out to people. However, they require huge
investment as large call centers are hired to actually execute
these campaigns. Hence, it is crucial to identify the
customers most likely to convert beforehand so that they can
be specifically targeted via call.
You are provided with the client data such as : age of the client, their job
type, their marital status, etc. Along with the client data, you are also
provided with the information of the call such as the duration of the call,
day and month of the call, etc. Given this information, your task is to
predict if the client will subscribe to term deposit.
Data PROVIDED:
You are provided with following files:
train.csv : Use this dataset to train the model. This file contains all the
client and call details as well as the target variable “subscribed”. You have
to train your model using this file.
test.csv : Use the trained model to predict whether a new set of clients
will subscribe the term deposit
Information provided in test.csv file
Information provided in train.csv file
Libraries which are used in this project
So, 3715 users out of total 31647 have subscribed which is around 12%.
Let's now explore the variables to have a better understanding of the
dataset. We will first explore the variables individually using univariate
analysis, then we will look at the relation between various independent
variables and the target variable. We will also look at the correlation plot to
see which variables affects the target variable most.
Now let's look at what are the different types of jobs of the clients.
As job is a categorical variable, we will look at its frequency table
Bivariate Analysis
From the above graph we can infer that students and retired people have
higher chances of subscribing to a term deposit, which is surprising as
students generally do not subscribe to a term deposit. The possible reason
is that the number of students in the dataset is less and comparatively to
other job types, more students have subscribed to a term deposit.
We can infer that clients having no previous default have slightly higher
chances of subscribing to a term loan as compared to the clients who
have previous default history.
Let's now look at how correlated our numerical variables are. We will see
the correlation between each of these variables and the variable which
have high negative or positive values are correlated. By this we can get
an overview of the variables which might affect our target variable. We
will convert our target variable into numeric values first.
We can infer that duration of the call is highly correlated with the target
variable. This can be verified as well. As the duration of the call is more, there
are higher chances that the client is showing interest in the term deposit and
hence there are higher chances that the client will subscribe to term deposit.
Next, we will start to build our predictive model to predict whether a client
will subscribe to a term deposit or not.As the sklearn models takes only
numerical input, we will convert the categorical variables into numerical
values using dummies. We will remove the ID variables as they are unique
values and then apply dummies. We will also remove the target variable and
Model Building
Logistic Regression
We got an accuracy of more than 90% on the validation set. You can try to
improve the score by tuning hyperparameters of the model. Let's now
make the prediction on test dataset. We will make the similar changes in
the test set as we have done in the training set before making the
predictions.
test = pd.get_dummies(test)
test_prediction = clf.predict(test)
Finally, we will save these predictions into a csv file. You can then open this
csv file and copy paste the predictions on the provided excel file to generate
score.
submission = pd.DataFrame()
# creating a Business_Sourced column and saving the predictions
in it
submission['ID'] = test['ID']
submission['subscribed'] = test_prediction
submission['subscribed'].replace(0,'no',inplace=True)
submission['subscribed'].replace(1,'yes',inplace=True)
submission.to_csv('submission.csv', header=True, index=False)
This results in formation of submission file where actual predict in values are
formed.