0% found this document useful (0 votes)
30 views

Fraud Detection System Micro-Project

Uploaded by

mampirahulbiswas
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Fraud Detection System Micro-Project

Uploaded by

mampirahulbiswas
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 27

INTRODUCTION

Now a day the usage of credit cards has dramatically increased. As credit
card becomes the most popular mode of payment for both online as well
as regular purchase, cases of fraud associated with it are also rising. In
this paper, we model the sequence of operations in credit card transaction
processing using a Decision tree and Deep Neural Network show how it
can be used for the detection of frauds. An both algorithms is initially
trained with the normal behaviour of a cardholder. If an incoming credit
card transaction is not accepted by the trained with sufficiently high
probability, it is considered to be fraudulent. At the same time, we try to
ensure that genuine transactions. We present

detailed experimental results to show the effectiveness of our approach


and compare it with other techniques available in the literature.

Motivation

The prediction Model will describe you whether to invest in the proposal or
not. Here, we choose to minimize the risk for investing, i.e. we aim to
minimize investing in proposals for which the loan will not be paid back.

Issues

Credit card fraud is a criminal offense. It causes severe damage to


financial institutions and individuals. Therefore, the detection and
prevention fraudulent activities are critically important to financial
institutions. Fraud detection and prevention are costly, time- consuming
and labour - intensive tasks. Anumber of significant research works have
been dedicated to develoning innovative solutions to detect different
types

Motivation

 The prediction Model will describe you whether to invest in the


proposal or not. Here, we choose to minimize the risk for investing,
i.e. we aim to minimize investing in proposals for which the loan will
not be paid back.

Issues

Credit card fraud is a criminal offense. It causes severe damage to


financial institutions and individuals. Therefore, the detection and
prevention fraudulent activities are critically important to financial
institutions. Fraud detection and prevention are costly, time- consuming
and labour intensive tasks. Anumber of significant research works have
been dedicated to developing innovative solutions to detect different
types of fraud. However, these solutions have been proved ineffective.
According to Cifa, 33,305 cases of credit card identity fraud were reported
between January and June in 2018.

Scope of The Project

 In this proposed project we designed a protocol or a model to detect


the fraud activity in credit card transactions.

 This system is capable of providing most of the essential features


required to detect fraudulent and legitimate transactions.

 As technology changes, it becomes difficult to track the behaviour


and pattern of fraudulent transactions.

 With the rise of machine learning, artificial intelligence and other


relevant fields of information technology, it becomes feasible to
automate this process and to save some of the intensive amount of
labour that is put into detecting credit card fraud.save some of the
intensive amount of labour that is put into detecting credit card
fraud.

Abstract
In our project, mainly focussed on credit card fraud detection for in
real world. Initially I will collect the credit card datasets for trained
dataset. Then will provide the user credit card queries for testing
data set. After classification process of random forest algorithm
using to the already analysing data set and user provide current
dataset. Finally optimizing the accuracy of the result data. Then will
apply the processing of some of the attributes provided can find
affected fraud detection in viewing the graphical model
visualization. The performance of the techniques is evaluated based
on accuracy, sensitivity, and specificity, precision. The results
indicate about the optimal accuracy for Decision tree are 98.6%
respectively.

Existing System

In existing System, research about a case study involving credit card


fraud detection, where data normalization is applied before Cluster
Analysis and with results. Obtained from the use of Cluster Analysis
and Artificial Neural Networks on fraud

Detection has shown that by clustering attributes neuronal inputs can be


minimized. And promising results can be obtained by using normalized
data and data should be MLP trained. This research was based on
unsupervised learning. Significance of this paper was to find new methods
for fraud detection and to increase the accuracy of results. The data. Set
for this paper is based on real life transactional data by a large European
company and personal details in data is kept confidential. Accuracy of an
algorithm is around 50%. Significance of this paper was to find an
algorithm and to reduce the cost measure. The result obtained was by
23% and the algorithm they find was Bayes minimum risk.

Disadvantage

 In this paper a new collative comparison measure that reasonably


represents the gains and losses due to fraud detection is proposed.
 A cost sensitive method which is based on Bayes minimum risk is
presented using the proposed cost measure.

Proposed System

In proposed System, we are applying random forest algorithm for classify


the credit card dataset. Decision tree is an algorithm for classification and
regression. Summarily, it is a collection of decision tree classifiers.
Decision tree has advantage over decision tree as it corrects the habit of
over fitting to their training set. A subset of the training set is sampled
randomly so that to train each individual tree and then a decision tree is
built, each node then splits on a feature selected from a random subset of
the full feature set.

Advantage

 Random forest ranks the importance of variables in a regression or


classification problem in a natural way can be done by Decision tree.

 The ‘amount’ feature is the transaction amount. Feature ‘class’ is


the target class for the binary classification and it takes value 1 for
positive case (fraud) and 0 for negative case (non fraud).
System Architecture

Software and Hardware Requirements

Hardware

 OS Windows 7, 8 and 10 (32 and 64 bit)

 RAM-4GB

 Software
 Python

 Anaconda

PROBLEM STATEMENT

Billions of dollars of loss are caused every year by the fraudulent


credit card transactions. Fraud is old as humanity itself and can take
an unlimited variety of different forms. The PWC global economic
crime survey of 2017 suggests that approximately 48% of
organizations experienced economic crime. Therefore, there is
definitely a need to solve the problem of credit card fraud detection.
Moreover, the development of new technologies provides additional
ways in which criminals may commit fraud. The use of credit cards is
prevalent in modern day society and credit card fraud has been kept
on growing in recent years. Hugh Financial losses has been
fraudulent affects not only merchants and banks, but also individual
person who are using the credits. Fraud may also affect the
reputation and image of a merchant causing non-financial losses
that, though difficult to quantify in the short term, may become
visible in the long period. For

example, if a cardholder is victim of fraud with a certain company,


he may no longer trust their business and choose a competitor.

METHODOLOGY

There are various fraudulent activities detection techniques has


implemented in credit card transactions have been kept in
researcher minds to methods to develop models based on artificial
intelligence, data mining, fuzzy logic and machine learning. Credit
card fraud detection is an extremely difficult, but also popular
problem to solve. In our proposed system we built the credit card
fraud detection using Machine learning. With the advancement of
machine learning techniques. Machine learning has been recognized
as a successful measure for fraud detection. A great deal of data is
transferred during online transaction processes, resulting in a binary
result: genuine or fraudulent. Online businesses are able to identify
fraudulent transactions accurately because they receive
chargebacks on them. Within the sample fraudulent datasets,
features are constructed. These are data points such as the age and
value of the customer account, as well as the origin of the credit
card. There can be hundreds of features and each contributes, to
varying extents, towards the fraud probability. Note, the degree in
which each feature contributes to the fraud score is not determined
by a fraud analyst, but is generated by the artificial intelligence of
the machine which is driven by the training set. So, in regards to the
card fraud, if the use of cards to commit fraud is proven to be high,
the fraud weighting of a transaction that uses a credit card will be
equally so.

PURPOSE OF THE PROJECT

We propose a Machine learning model to detect the fraudulent


credit card

activities in online financial transactions. Analysing fraudulent


transactions manually is

unfeasible due to huge amounts of data and its complexity.


However, given sufficiently

informative features, one could expect it is possible to do using


Machine Learning. This

hypothesis will be explored in the project.

To classify fraudulent and legitimate credit card transaction by


supervised learning

Algorithm such as Random forest.

To helps us to get awareness about the fraudulent and without loss of any
financially.

MODULES

1. DATA COLLECTION

2. DATA PRE-PROCESSING
3. FEATURE EXTRATION

4. EVALUATION MODEL

DATA COLLECTION

Data used in this paper is a set of product reviews collected from credit
card transactions records. This step is concerned with selecting the subset
of all available data that you will be working with. ML problems start with
data preferably, lots of data (examples or observations) for which you
already know the target answer. Data for which you already know the
target answer is called labelled data.

DATA PRE-PROCESSING

Organize your selected data by formatting, cleaning and sampling from it.

Three common data pre-processing steps are:

 Formatting: The data you have selected may not be in a format that
is suitable for you to work with. The data may be in a relational
database and you would like it in a flat file, or the data may be in a
proprietary file format and you would like it in a relational database
or a text file.

 Cleaning: Cleaning data is the removal or fixing of missing data.


There may be data instances that are incomplete and do not carry
the data you believe you need to address the problem. These
instances may need to be removed. Additionally, there may be
sensitive information in some of the attributes and these attributes
may need to be anonymized or removed from the data entirely.
 Sampling: There may be far more selected data available than you
need to work with. More data can result in much longer running
times for algorithms and larger computational and memory
requirements. You can take a smaller representative sample of the
selected data that may be much faster for exploring and prototyping
solutions before considering the whole dataset.

FEATURE EXTRATION

Next thing is to do Feature extraction is an attribute reduction process.


Unlike feature selection, which ranks the existing attributes according to
their predictive significance, feature extraction actually transforms the
attributes. The transformed attributes, or features, are linear
combinations of the original attributes. Finally, our models are trained
using Classifier algorithm. We use classify module on Natural Language
Toolkit library on Python. We use the labelled dataset gathered. The rest of
our labelled data will be used to evaluate the models. Some machine
learning algorithms were used to classify pre-processed data. The chosen
classifiers were Random forest. These algorithms are very popular in text
classification tasks.

EVALUATION MODEL

Model Evaluation is an integral part of the model development process. It


helps to find the best model that represents our data and how well the
chosen model will work in the future. Evaluating model performance with
the data used for training is not acceptable in data science because it can
easily generate overoptimistic and over fitted models. There are two
methods of evaluating models in data science, Hold-Out and Cross-
Validation. To avoid over fitting, both methods use a test set (not seen by
the model) to evaluate model performance. Performance of each
classification model is estimated base on its averaged. The result will be in
the visualized form. Representation of classified data in the form of
graphs. Accuracy is defined as the percentage of correct predictions for
the test data. It can be calculated easily by dividing the number of corre

ct predictions by the number of total predictions.

UML DIAGRAMS
SEQUENCE DIAGRAM
ACTIVITY DIAGRAM
COLLABRATION DIAGRAM
REQUIREMENTS ANAYLSIS

SOFTWARE REQUIREMENTS

• Python

• Anaconda Navigator

• Python built-in modules

o Numpy

o Pandas

o Matplotlib

o Sklearn

o Seaborm

ANACONDA NAVIGATOR

Anaconda Navigator is a desktop graphical user interface (GUI)


included in Anaconda distribution that allows you to launch
applications and easily manage conda packages, environments and
channels without using command-line commands. Navigator can
search for packages on Anaconda Cloud or in a local Anaconda
Repository. It is available for Windows, mac OS and Linux.

Why use Navigator?


In order to run, many scientific packages depend on specific
versions of other packages. Data scientists often use multiple
versions of many packages, and use multiple environments to
separate these different versions.

The command line program conda is both a package manager and


an environment manager, to help data scientists ensure that each
version of each package has all the dependencies it requires and
works correctly.

Navigator is an easy, point-and-click way to work with packages and


environments without needing to type conda commands in a
terminal window. You can use it to find the packages you want,
install them in an environment, run the packages and update them,
all inside Navigator.

WHAT APPLICATIONS CAN I ACCESS USING NAVIGATOR?

The following applications are available by default in Navigator

 Jupyter Notebook

 QT Console

 Spyder

 VS Code

 Glue viz

 Orange 3 App

 Rodeo

 RStudio

How can I run code with Navigator?

The simplest way is with Spyder. From the Navigator Home tab,
click Spyder, and write and execute your code.
You can also use Jupyter Notebooks the same way. Jupyter
Notebooks are an increasingly popular system that combine your
code, descriptive text, output, images and interactive interfaces
into a single notebook file that is edited, viewed and used in a web
browser.

PYTHON

Python

Python is a general-purpose, versatile and popular programming


language. It’s great as a first language because it is concise and
easy to read, and it is also a good language to have in any
programmer’s stack as it can be used for everything from web
development to soitware development and scientific applications.

It has simple easy-to-use syntax, making it the perfect language for


someone trying to learn computer programming for the first time.

Features of Python

A simple language which is casier to learn, Rython has a very


simple and elegant syntax. It’s much easier 16.0f 25 rite Python
programs compared to other languages like: C++, Java, C#. Python
makes programming fun and allows you to focus on the solution
rather than syntax. If you are a newbie, it’s a great choice to start
your journey with Python.

• Free and open source

You can freely use and distribute Python, even for commercial use.
Not only can you use and distribute software’s written in it, you can
even make changes to the Python’s source code. Python has a large
community constantly improving it in each iteration.

 Portability

 Extensible and Embeddable

Suppose an application requires high performance. You can easily


combine pieces of C/C++ or other languages with Python code. This
will give your application high performance as well as scripting
capabilities which other languages may not provide out of the box.

 Large standard libraries to solve common tasks

Python has a number of standard libraries which makes life of a


programmer much easier since you don’t have to write all the code
yourself. For example: Need to connect MySQL database on a Web
server You can use MySQLdb library using import MySQL db
Standard libraries in Python are well tested and used by hundreds of
people. So you can be sure that it won’t break your application.

 Object-oriented

Everything in Python is an object. Object oriented programming


(OOP) helps you solve a complex problem intuitively. With OOP, you
are able to divide these complex problems into smaller sets by
creating object

NUMPY

NumPy is the fundamental package for scientific computing in


Python. It is a Python library that provides a multidimensional array
object, various derived objects (such as masked arrays and
matrices), and an assortment of routines for fast operations on
arrays, including mathematical, logical, shape manipulation,
sorting, selecting, I/O, discrete Fourier transforms, basic linear
algebra, basic statistical operations, random simulation and much
more. At the core of the NumPy package, is the ndarray object. This
encapsulates n- dimensional arrays of homogeneous data types,
with many operations being performed in compiled code for
performance. There are several important differences between
NumPy arrays and the standard Python sequences: NumPy arrays
have a fixed size at creation, unlike Python lists (which can grow
dynamically). Changing the size of an ndarray will create a new
array and delete the original. The elements in a NumPy array are all
required to be of the same data type, and thus will be the same size
in memory. The exception: one can have arrays of (Python,
including NumPy) objects, thereby allowing for arrays of different
sized elements. NumPy arrays facilitate advanced mathematical
and other types of operations on large numbers of data. Typically,
such operations are executed more efficiently and with less code
than is possible using Python's built-in sequences. A growing
plethora of scientific and mathematical Python-based packages are
using NumPy arrays; though these typically support Python-
sequence input, they convert such input to NumPy arrays prior to
processing, and they often output NumPy arrays. In other words, in
order to efficiently use much (perhaps even most) of today's
scientific/mathematical Python-based software, just knowing how to
use Python's built-in sequence types is insufficient one also needs
to know how to use NumPy arrays. The points about sequence size
and speed are particularly important in scientific computing. As a
simple example, consider the case of multiplying each element in a
1-D sequence with the corresponding element in another sequence
of the same length. If the data are stored in two Python lists, a and
b, we could iterate over each element:
The Numeric Python extensions (NumPy henceforth) is a set of
extensions to the Python programming language which allows
Python programmers to efficiently manipulate large sets of objects
organized in grid-like fashion. These sets of objects are called
arrays, and they can have any number of dimensions: one
dimensional arrays are similar to standard Python sequences, two-
dimensional arrays are similar to matrices from linear algebra. Note
that one-dimensional arrays are also different from any other
Python sequence, and that two-dimensional matrices are also
different from the matrices of linear algebra, in ways which we will
mention later in this text. Why are these extensions needed? The
core reason is a very prosaic one, and that is that manipulating a
set of a million numbers in Python with the standard data structures
such as lists, tuples or classes is much too slow and uses too much
space. Anything which we can do in NumPy we can do in standard
Python we just may not be alive to see the program finish. A more
subtle reason for these extensions however is that the kinds of
operations that programmers typically want to do on arrays, while
sometimes very complex, can often be decomposed into a set of
fairly standard operations. This decomposition has been developed
similarly in many array languages. In some ways, NumPy is simply
the application of this experience to the Python language thus
many of the operations described in NumPy work the way they do
because experience has shown that way to be a good one, in a
variety of contexts. The languages which were used to guide the
development of NumPy include the infamous APL family of
languages, Basis, MATLAB, FORTRAN, S and S+, and others. This
heritage will be obvious to users of NumPy who already have
experience with these other languages. This tutorial, however, does
not assume any such background, and all that is expected of the
reader is a reasonable working knowledge of the standard Python
language. This document is the “official” documentation for NumPy.
It is both a tutorial and the most authoritative source of information
about NumPy with the exception of the source code. The tutorial
material will walk you through a set of manipulations of simple,
small, arrays of numbers, as well as image files. This choice was
made because: • A concrete data set makes explaining the
behavior of some functions much easier to motivate than simply
talking about abstract operations on abstract data sets;

TESTING

Software testing is an investigation conducted to provide


stakeholders with information about the quality of the product or
service under test. Software Testing also provides an objective,
independent view of the software to allow the business to
appreciate and understand the risks at implementation of the
software. Test techniques include, but are not limited to, the process
of executing a program or application with the intent of finding
software bugs. Software Testing can also be stated as the process of
validating and verifying that a software
program/application/product:

 Meets the business and technical requirements that guided its


design and Development.

 Works as expected and can be implemented with the same


characteristics.

TESTING METHODS

• Functional Testing

Functional tests provide systematic demonstrations that functions


tested are available as specified by the business and technical
requirements, system documentation, and user manuals.

 Functional testing is centered on the following items:

 Functions: Identified functions must be exercised.

 Output: Identified classes of software outputs must be


exercised.
 Systems/Procedures: system should work properly

Integration Testing

Software integration testing is the incremental integration testing of


two or more integrated software components on a single platform to
produce failures caused by interface defects.
Conclusion

The proposed paper evaluate that the Decision tree and support vector
machine algorithm will perform better with a larger number of training
data comparing to Adaboost classifier, but speed during testing and
application will suffer. Application of more pre- processing techniques
would also help. The SVM algorithm still suffers from the imbalanced
dataset problem and requires more pre-processing to give better results at
the results shown by SVM is great but it could have been better if more
pre-processing have been done on the data.so, in proposed work we
balanced the imbalanced data with up- sampling technique during pre-
processing. We review the existing works on credit card fraud prediction in
three different perspectives: datasets, methods, and metrics. Firstly, we
present the details about the availability of public datasets and what kinds
of details are available in each dataset for predicting credit card fraud.
Secondly, we compare and contrast the various predictive modeling
methods that have been used in the literature for predicting, and then
quantitatively compare their performances in terms of accuracy.

REFERENCES

[1] P. Richhariya and P. K. Singh, “Evaluating and emerging payment card


fraud challenges and resolution,” International Journal of Computer
Applications, vol. 107, no. 14, pp. 5-10, 2014.

[2] S. Bhattacharyya, S. Jha, K. Tharakunnel, and J. C. Westland, “Data


mining for credit card fraud: A comparative study,” Decision Support
Systems, vol. 50, no. 3, pp. 602-613, 2011.

[3] A.DalPozzolo, O. Caelen, Y.-A.LeBorgne, S.


Waterschoot,andG.Bontempi, “Learned lessons in credit card fraud
detection from a practitioner perspective,” Expert systems with
applications, vol. 41, no. 10, pp. 4915- 4928, 2014.

[4] C. Phua, D. Alahakoon, and V. Lee, “Minority report in fraud detection:


Classification of skewed data,” ACM SIGKDD explorations newsletter, vol.

6, no. 1, pp. 50-59, 2004.

[5] Z.-H. Zhou and X.-Y. Liu, “Training cost-sensitive neural networks with

Methods addressing the class imbalance problem,” IEEE Transactions on

Knowledge and Data Engineering, vol. 18, no. 1, pp. 63-77, 2006.

You might also like