0% found this document useful (0 votes)
3 views33 pages

ANIL DS PROJECT

The document outlines an internship project on Data Science by K. Anil, covering topics such as data science fundamentals, Python programming, statistics, and machine learning. It includes a final project focused on predicting client subscriptions to term deposits for a retail banking institution using client and call data. The project involves data analysis, model building, and generating predictions using logistic regression and decision tree algorithms.

Uploaded by

anilkatta639
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views33 pages

ANIL DS PROJECT

The document outlines an internship project on Data Science by K. Anil, covering topics such as data science fundamentals, Python programming, statistics, and machine learning. It includes a final project focused on predicting client subscriptions to term deposits for a retail banking institution using client and call data. The project involves data analysis, model building, and generating predictions using logistic regression and decision tree algorithms.

Uploaded by

anilkatta639
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

INTERNSHIP

ON
DATA SCIENCE

BY
K.ANIL
FROM
TABLE OF CONTENTS

 Introduction to Data Science


 Python for Data Science
 Understanding the statistics for Data Science
 Predictive modeling and basics of Machine
Learning
 About final project.
what is data science ?

 Data Science is about finding patterns in data,


through analysis, and make future predictions.
 By using Data Science, companies are able to
make:
 Better decisions (should we choose A or B)
 Predictive analysis (what will happen next?)
 Pattern discoveries (find pattern, or maybe hidden
information in the data)
Structured vs Unstructured data
oStructured data refers to data that is
organized and formatted in a specific
way to make it easily readable and
understandable by both humans and
machines.

oStructured data is typically found in


databases and spreadsheets, and is
characterized by its organized nature.

oStructured data is highly valuable because it can be easily searched,


queried, and analyzed using various tools and techniques
python for Data
Science
o List : Lists are used to store multiple items in a single variable.
•Lists are one of 4 built-in data types in Python used to store collections
of data, the other 3 are Tuple, Set, and Dictionary, all with different
qualities and usage.
Lists are created using square brackets:
Example: thislist = ["apple", "banana", "cherry"]
Methods Description
append() Adds an element at the
end of the list
clear() Removes all the
elements from the list
copy() Returns a copy of the list
sort() Sorts the list
o Tuple : Tuples are used to store multiple items in a single variable.

• Tuples are written with round brackets.

•A tuple is a collection which is ordered and unchangeble.

Example of tuple:
thistuple = ("apple", "banana", "cherry")
Methods Description
count() Returns the number of times a
specified value occurs in a
tuple
Searches the tuple for a
index() specified value and returns the
position of where it was found
•Dictionary: Dictionaries are used to store data values in key:value pairs.
• A dictionary is a collection which is ordered*, changeable and do not
allow duplicates.
Example: thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
Method Description

clear() Removes all the elements from the


dictionary

copy() Returns a copy of the dictionary

fromkeys() Returns a dictionary with the


specified keys and value

get() Returns the value of the specified key

items() Returns a list containing a tuple for


each key value pair
Statistics for Data Science

• Data scientists need to understand the fundamental concepts


of descriptive statistics and probability theory, which include the
key concepts of probability distribution, statistical significance,
hypothesis testing and regression.

•Competency in statistics, computer programming and information


technology could lead you to a successful career in a wide range of
industries. Data scientists are needed almost everywhere, from
health care and science to business and banking
ROLE OF MACHINE LEARNING
IN DATA SCIENCE

 Machine learning analyzes and examines large


chunks of data automatically.

 It automates the data analysis process and makes


predictions in real-time without any human
involvement.

 You can further build and train the data model to


make real-time predictions.
:
Some basic import functions taken in data
science

 Numpy
 Pandas
 Matplotlib
FINAL PROJECT
PROJECT STATEMENT:
 Your client is a retail banking institution. Term deposits are a
major source of income for a bank. A term deposit is a cash
investment held at a financial institution. Your money is
invested for an agreed rate of interest over a fixed amount of
time, or term.
 The bank has various outreach plans to sell term deposits to
their customers such as email marketing, advertisements,
telephonic marketing and digital marketing. Telephonic
marketing campaigns still remain one of the most effective
way to reach out to people. However, they require huge
investment as large call centers are hired to actually execute
these campaigns. Hence, it is crucial to identify the
customers most likely to convert beforehand so that they can
be specifically targeted via call.
You are provided with the client data such as : age of the client, their job
type, their marital status, etc. Along with the client data, you are also
provided with the information of the call such as the duration of the call,
day and month of the call, etc. Given this information, your task is to
predict if the client will subscribe to term deposit.
Data PROVIDED:
You are provided with following files:
train.csv : Use this dataset to train the model. This file contains all the
client and call details as well as the target variable “subscribed”. You have
to train your model using this file.
test.csv : Use the trained model to predict whether a new set of clients
will subscribe the term deposit
Information provided in test.csv file
Information provided in train.csv file
Libraries which are used in this project

To read the csv file


Columns present in train.csv and test.csv files

Shape helps to return the no.of rows and columns


dtypes returns the data types of the columns of the data
Univariate Analysis
Now Let's look at the distribution of our target variable, i.e. subscribed. As
it is a categorical variable, let us look at its frequency table, percentage
distribution and bar plot.

To find the proportions of subscribed and unsubscribed can be obtained


as follows
Plot the bar graph for the obtained ratio’s

So, 3715 users out of total 31647 have subscribed which is around 12%.
Let's now explore the variables to have a better understanding of the
dataset. We will first explore the variables individually using univariate
analysis, then we will look at the relation between various independent
variables and the target variable. We will also look at the correlation plot to
see which variables affects the target variable most.
Now let's look at what are the different types of jobs of the clients.
As job is a categorical variable, we will look at its frequency table
Bivariate Analysis
From the above graph we can infer that students and retired people have
higher chances of subscribing to a term deposit, which is surprising as
students generally do not subscribe to a term deposit. The possible reason
is that the number of students in the dataset is less and comparatively to
other job types, more students have subscribed to a term deposit.
We can infer that clients having no previous default have slightly higher
chances of subscribing to a term loan as compared to the clients who
have previous default history.
Let's now look at how correlated our numerical variables are. We will see
the correlation between each of these variables and the variable which
have high negative or positive values are correlated. By this we can get
an overview of the variables which might affect our target variable. We
will convert our target variable into numeric values first.
We can infer that duration of the call is highly correlated with the target
variable. This can be verified as well. As the duration of the call is more, there
are higher chances that the client is showing interest in the term deposit and
hence there are higher chances that the client will subscribe to term deposit.

Next, we will start to build our predictive model to predict whether a client
will subscribe to a term deposit or not.As the sklearn models takes only
numerical input, we will convert the categorical variables into numerical
values using dummies. We will remove the ID variables as they are unique
values and then apply dummies. We will also remove the target variable and
Model Building
Logistic Regression

We got an accuracy score of around 90% on the validation dataset.


Logistic regression has a linear decision boundary. What if our data have
non linearity? We need a model that can capture this non linearity.
Let's try decision tree algorithm now to check if we get better accuracy
with that
Decision Tree

We got an accuracy of more than 90% on the validation set. You can try to
improve the score by tuning hyperparameters of the model. Let's now
make the prediction on test dataset. We will make the similar changes in
the test set as we have done in the training set before making the
predictions.
test = pd.get_dummies(test)
test_prediction = clf.predict(test)

Finally, we will save these predictions into a csv file. You can then open this
csv file and copy paste the predictions on the provided excel file to generate
score.

submission = pd.DataFrame()
# creating a Business_Sourced column and saving the predictions
in it
submission['ID'] = test['ID']
submission['subscribed'] = test_prediction
submission['subscribed'].replace(0,'no',inplace=True)
submission['subscribed'].replace(1,'yes',inplace=True)
submission.to_csv('submission.csv', header=True, index=False)

This results in formation of submission file where actual predict in values are
formed.

You might also like