Krishna Report
Krishna Report
on
PROJECTTITLE (IN CAPITAL
LETTERS ONLY)
This line is also for TITLE only
(Font-16)
Submitted in partial fulfilment ofthe requirements for the award of the degree
of
BACHELOR OF TECHNOLOGY
In
ELECTRICAL AND ELECTRONICS ENGINEERING
Submitted by
R.S.V.KRISHNA
21B91A02C6
(2024)
S.R.K.R. ENGINEERING COLLEGE (A)
(Affiliated to JNTU KAKINADA) (Recognized by A.I.C.T.E, New Delhi)
(Accredited by N.B.A., NAAC with ‘A+’ grade, New Delhi)
CHINNA AMIRAM, BHIMAVARAM-534204
Certificate
This is to certify that the summer Internship Report titled “FRAUD DETECTION ON
BANK PAYMENTS” is the bonafied work done by Mr. S.Vijay Kumar Raju
(21B91A02D0) at the end of Second year B. Tech ., at Next24tech Pvt Ltd,
Hyderabad from 10-11-2023 to 10-01-2024 in partial fulfilment of the requirements for the
award of the Degree of „Bachelor of Technology‟ with specialization of Electrical and
Electronics Engineering in S.R.K.R. Engineering College (A), Bhimavaram.
Head of the
Department
VALID
INTERNSHIP
CERTIFICATE
Given
By
the
COMPAN
Y
iv
DECLARATION
v
Table of Contents
S. No Name Page.No
1. Introduction 4
4
1. What are the different types of Machine Learning?
4
2. Benefits of Using Machine Learning in Bank Marketing
5
3. About Industry (Bank Marketing)
5
4. AI / ML Role in Bank Marketing
6
2. Bank Marketing Dataset content 6
1. Main Drivers for AI Bank Marketing
7
2. Internship Project - Data Link
8
3. AI / ML Modelling and Results 8
1. Your Problem of Statement
8
2. Data Science Project Life Cycle 9
1. Data Exploratory Analysis 9
2. Data Pre-processing 9
1. Check the Duplicate and low variation data 10
2. Identify and address the missing variables 10
3. Handling of Outliers 10
4. Categorical data and Encoding Techniques 10
5. Feature Scaling 10
3. Selection of Dependent and Independent variables 10
4. Data Sampling Methods 10
1. Stratified sampling 11
2.Simple random 11
In general, datasets which contain marketing data can be used for 2 different business goals:
1. Prediction of the results of the marketing campaign for each customer and clarification of
factors which affect the campaign results. This helps us to find out the ways how to make
marketing Campaigns more efficient.
2. Finding out customer segments, using data for customers, who subscribed to Term Deposit.
This helps to identify the profile of a customer, Who is more likely to acquire the product
and develop more targeted marketing campaigns.
This dataset containing bank marketing campaign data and we can use it to optimize marketing
campaigns to attract more customers to a term deposit Subscription
In order to optimize marketing campaigns with the help of a dataset, we will have to take following
steps:
1. Import data from datasets and perform initial high level analysis
With the increasing power of computer technology, companies and institutions can
nowadays store large amounts of data at reduced cost. The amount of available data is
increasing exponentially and cheap disk storage makes it easy to store data that previously
was thrown away. There is a huge amount of information locked up in databases that is
potentially important but has not yet been explored. The growing size and complexity of
the databases makes it hard to analyse the data manually, so it is important to have
automated systems to support the process. Hence there is the need of computational tools
able to treat these large amounts of data and extract valuable information.
In this context, Data Mining provides automated systems capable of processing large
amounts of data that are already present in databases. Data Mining is used to
automatically extract important patterns and trends from databases seeking regularities or
patterns that can reveal the structure of the data and answer business problems. Data mining
includes learning techniques that fall into the field of Machine learning. The growth of
databases in recent years brings data mining at the forefront of new business technologies.
A key challenge for the insurance industry is to charge each customer an appropriate price
for the risk they represent. Risk varies widely from customer to customer and a deep
understanding of different risk factors helps predict the likelihood and cost of insurance
claims. The goal of this program is to see how well various statistical methods perform in
predicting auto Insurance claims based on the characteristics of the driver, vehicle and
driver / vehicle coverage details.
A number of factors will determine BI claims prediction among them a driver's age, past
accident history, and domicile, etc. However, this contest focused on the relationship
between claims and vehicle characteristics well as other characteristics associated with the
auto insurance policies.
1.1. What are the different types of Machine Learning?
Supervised learning, which trains a model on known input and output data so that
it can predict future outputs, and Unsupervised learning, which finds hidden
patterns or intrinsic structures in input data.
Supervised Learning:
Supervised machine learning builds a model that makes predictions based on evidence
in the presence of uncertainty. A supervised learning algorithm takes a known set of
input data and known responses to the data (output) and trains a model to generate
reasonable predictions for the response to new data. Use supervised learning if you
have known data for the output you are trying to predict.
Supervised learning uses Regression and Classification techniques to develop predictive
models.
Use regression techniques if you are working with a data range or if the nature of
your response is a real number, such as temperature or the time until failure for a
piece of equipment.
Use classification if your data can be tagged, categorized, or separated into specific
groups or classes. For example, applications for hand-writing recognition use
classification to recognize letters and numbers.
In image processing and computer vision, unsupervised pattern recognition techniques
are used for object detection and image segmentation.
For example, if a cell phone company wants optimize the locations where they build
cell phone towers, they can use machine learning to estimate the number of clusters of
people relying on their towers. A phone can only talk to one tower at a time, so the
team uses clustering algorithms to design the best placement of cell towers to
optimize signal reception for groups, or clusters, of their customers.
ML would be the best tool for education in the future. It provides very creative
techniques to help students study.
The bank marketing industry has been going through massive transformations in
the past couple of years. With more emphasis on customized loan plans and the
increasing level of market competition. Bank marketing is known for its nature of
developing a unique brand image,which is treated as the capital reputation of the
financial academy. It is very important for a bank to develop good relationship with
valued customers accompanied by innovative ideas which can be used as measures
to meet their requirements. Basically, banks engage in transaction of products and
services through their retail outlets known as branches to different customers at the
grassroots
level. This is referred as the ‘top to bottom’ approach. Relationship banking can
be defined as a process that includes proactively predicting the demands of
individual bank customers and taking steps to meet these demands before the
client shows them.
1. AI / ML Role in Bank Marketing:
The data is related with direct marketing campaigns of a Portuguese banking instructions.
The marketing campaigns were based on phone calls. Often, more than one contact to the
same client was required, in order to access if the product(bank term deposit) was subscribed
or not. Data set has 16 predicator variables and 45K rows . Customers who received phone
calls not be unique and a same customer might receive multiple phone calls.
The main factors for term deposit claims are Age, Job, Education, Marital , Default , Housing,
Loan, balance , contact, month, day, duration, campaign, previous.outcomes,days…
The data is related to the customers of age , gender ,Zipcode , merchant , category , amount ,
The internship project data has taken from Kaggle and the link is:
Kaggle
https://www.kaggle.com/code/gustavooff/fraud-detection-on-bank-payments/notebook
1. AI / ML Modelling and Results
Predictive models are most effective when they are constructed using a company’s own
historical claims data since this allows the model to recognize the specific nature of a
company’s exposure as well as its claims practices. The construction of the model also
involves input from the company throughout the process, as well as consideration of industry
leading claims practices and benchmarks.
Predictive modelling can be used to quantify the impact to the claims department resulting from
the failure to meet or exceed claim service leading practices. It can also be used to identify the
root cause of claim leakage. Proper use of predictive modelling will allow for potential savings
across two dimensions:
Early identification of claims with the potential for high leakage, thereby allowing for
the proactive management of the claim
Data Science is a multidisciplinary field of study that combines programming skills, domain
expertise and knowledge of statistics and mathematics to extract useful insights and knowledge
from data.
In simple terms, a data science life cycle is nothing but a repetitive set of steps that you need to
take to complete and deliver a project/product to your client.
Although the data science projects and the teams involved in deploying and developing the
model will be different, every data science life cycle will be slightly different in every other
company.
However, most of the data science projects happen to follow a somewhat similar process.
In order to start and complete a data science-based project, we need to understand the various roles
and responsibilities of the people involved in building, developing the project.
1. Data Exploratory Analysis
2. Data Pre-processing
Data pre-processing transforms the data into a format that is more easily and effectively
processed in data mining, machine learning and other data science tasks. The techniques
are generally used at the earliest stages of the machine learning and AI development
pipeline to ensure accurate results.
3.2.1.1 Check the Duplicate and low variation data
Two things distinguish top data scientists from others in most cases: Feature Creation and
Feature Selection. i.e., creating features that capture deeper/hidden insights about the business
or customer and then making the right choices about which features to choose for your model.
2. Duplicate Index (value of two features are different but they occur at the
problem of multicollinearity.
In the case of linear models, weights distribution between the two
features will be problematic.
If you are using tree-based modes, it won’t matter unless you are looking
at feature importance.
In the case of distance-based models, it will make that feature count more
in the distance.
Missing data are values that are not recorded in a dataset. They can be a single value missing
in a single cell or missing of an entire observation (row). Missing data can occur both in a
continuous variable (e.g., height of students) or a categorical variable (e.g., gender of a
population).
Missing data are common in any field of natural or behavioral science, but it is particularly
commonplace in social sciences research data.
So where do the missing values come from, and why do they even exist?
Let’s give an example. You are administering a questionnaire survey among a sample
of respondents; and in the questionnaire, you are asking a question about household
income. Now, what if a respondent refuses to answer that question? Would you make
that up or rather leave the field empty? You’d probably leave that cell empty —
creating an instance of missing value
• Problems caused
However, if the dataset is relatively small, every data point counts. In these
situations, a missing data point means loss of valuable information.
In any case, generally missing data creates imbalanced observations, cause biased
estimates, and in extreme cases, can even lead to invalid conclusion.
Case deletion: if the dataset is relatively large delete the complete record with a
missing value Substitution: substitute missing cells with (a) column mean, (b)
mean of nearest neighbors, (c) moving average, or (c) filling with the last
observation
Sensitivity analysis: if the sample is small or missing values are relatively large
then conduct a sensitivity analysis with multiple variations of outcomes.
What is an outlier?
An outlier is an observation that lies an abnormal distance from other values in a random
sample from a population.
There is, of course, a degree of ambiguity. Qualifying a data point as an anomaly leaves it up
to the analyst or model to determine what is abnormal—and what to do with such data
points.
There are also different degrees of outliers:
Since we are going to be working on categorical variables in this article, here is a quick
refresher on the same with a couple of examples. Categorical variables are usually
represented as ‘strings’ or ‘categories’ and are finite in number. Here are a few examples:
1. The city where a person lives: Delhi, Mumbai, Ahmedabad, Bangalore, etc.
2. The department a person works in: Finance, Human resources, IT, Production.
3. The highest degree a person has: High school, Diploma, Bachelors, Masters, PhD.
In the above examples, the variables only have definite possible values. Further, we can
see there are two kinds of categorical data-
Label Encoding:
• We use this categorical data encoding technique when the categorical feature is
ordinal. In this case, retaining the order is important. Hence encoding should reflect
the sequence.
• In Label encoding, each label is converted into an integer value. We will create
a variable that contains the categories representing the education qualification
of a person.
1. Stratified sampling
Stratified sampling randomly selects data points from majority class so they will be
equal to the data points in the minority class. So, after the sampling both the class
will have same no of observations.
Simple random sampling is a sampling technique where a set percentage of the data is
selected randomly. It is generally done to reduce bias in the dataset which can occur if
data is selected manually without randomizing the dataset.
We used this method to split the dataset into train dataset which contains 70% of
the total data and test dataset with the remaining 30% of the data.
The first stage in the sampling process is to clearly define the target population.
• So, to carry out opinion polls, polling agencies consider only the
people who are above 18 years of age and are eligible to vote in the
population. Sampling Frame – It is a list of items or people forming a
population from which the sample is taken.
• So, the sampling frame would be the list of all the people whose
names appear on the voter list of a constituency.
• Generally, probability sampling methods are used because every
vote has equal value and any person can be included in the sample
irrespective of his caste, community, or religion.
Different samples are taken from different regions all over the country.
Logistic uses logistic link function to convert the likelihood values to probabilities so we
can get a good estimate on the probability of a particular observation to be positive class or
negative class. The also gives us p-value of the variables which tells us about significance of
each independent variable.
Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems. It is a
treestructured classifier, where internal nodes represent the features of a dataset, branches
represent the decision rules, and each leaf node represents the outcome.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas Leaf
nodes are the output of those decisions and do not contain any further branches.
The decisions or the test are performed based on features of the given dataset.
It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
It is called a decision tree because, like a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
Decision Trees (DTs) are a non-parametric supervised learning method used for
classification and regression. The goal is to create a model that predicts the value of a
target variable by learning simple decision rules inferred from the data features. A tree
can be seen as a piece wise constant approximation
Random forest is an algorithm that consists of many decision trees. It was first
developed by Leo Bierman and Adele Cutler. The idea behind it is to build several trees, to
have the instance classified by each tree, and to give a "vote" at each class. The model uses
a "bagging" approach and the random selection of features to build a collection of
decision trees with controlled variance. The instance's class is to the class with the
highest number of votes, the class that occurs the most within the leaf in which the
instance is placed.
• The strength of each tree in the forest. A strong tree is a tree with low error. By using
trees that classify the instances with low error the error rate of the forest decreases.
This class implements a meta estimator that fits several randomized decision trees
(a.k.a. extra-trees) on various sub-samples of the dataset and uses averaging to improve
the predictive accuracy and control over- fitting.
3.2.5.5 Model 05(KNN Classifier)
K-Nearest Neighbor is one of the simplest Machine Learning algorithms based on
Supervised Learning technique K-NN algorithm assumes the similarity between the new
case/data and available cases and put the new case into the category that is most like the
available categories’-NN algorithm stores all the available data and classifies a new data
point based on the similarity. This means when new data appears then it can be easily
classified into a well suite category by using KNN algorithm.
K-NN algorithm can be used for Regression as well as for Classification but mostly it
is used for the Classification problems K-NN is a non-parametric algorithm, which means it
does not make any assumption on underlying data. It is also called a lazy learner algorithm
because it does not learn from the training set immediately instead it stores the dataset and
at the time of classification, it performs an action on the dataset KNN-algorithm at the
training phase just stores the dataset and when it gets new data, then it classifies that data
into a category that is much like the new data.
Naïve Bayes is a probabilistic machine learning algorithm used for many classification
functions and is based on the Bayes theorem. Gaussian Naïve Bayes is the extension
of naïve Bayes. While other functions are used to estimate data distribution, Gaussian
or normal distribution is the simplest to implement as you will need to calculate the
mean and standard deviation for the training data.
Naive Bayes is a probabilistic machine learning algorithm that can be used in several
classification tasks. Typical applications of Naive Bayes are classification of
documents, filtering spam, prediction and so on. This algorithm is based on the
discoveries of Thomas Bayes and hence its name.
The name “Naïve” is used because the algorithm incorporates features in its
model that are independent of each other. Any modifications in the value of one
feature do not directly impact the value of any other feature of the algorithm. The
main advantage of the Naïve Bayes algorithm is that it is a simple yet powerful
algorithm.
It is based on the probabilistic model where the algorithm can be coded easily, and
predictions did quickly in real-time. Hence this algorithm is the typical choice to
solve realworld problems as it can be tuned to respond to user requests instantly.
But before we dive deep into Naïve Bayes and Gaussian Naïve Bayes, we must
know what is meant by conditional probability.
Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is
used for Classification as well as Regression problems. However, primarily, it is
used for Classification problems in Machine Learning.