A Datamining Model For Detection of Fraudulent Behaviour in Water
A Datamining Model For Detection of Fraudulent Behaviour in Water
MAIN PROJECT ON
BACHELOR OF TECHNOLOGY
IN
OF
SUBMITTED BY
K.MADHURI (16C51A0528)
A.SRUTHI (16C55A0504)
D.BHARATH (16C51A0514)
Mr .V.V.SIVA PRASAD
Associate Professor
CERTIFICATE
This is to certify that the main project entitled “A DATAMINING MODEL FOR
DETECTION OF FRAUDULENT BEHAVIOUR IN WATER” is a bonafied work done
by S.SWETHA SRAVANTHI (16C51A0546), K.MADHURI (16C51A0524), A.SRUTHI
(16C55A0504),D.BHARATH (16C51A0514) under the guidance and supervision of Mr
V.V.SIVA PRASAD Assoc. Professor in CSE Department at SAI SPURTHI INSTITUTE
OF TECHNOLOGY in the partial fulfillment of Bachelor of Technology in Computer
Science and Engineering from JNTU-Hyderabad during the year 2019-2020.
It is our privilege to thank all Project Review Committee members for allowing
us to do this project and providing us all the facilities to do our project.
In all sincerity,
K.KRISHNASRI (16C51A0523)
M.SPOORTHI (16C51A0528)
CONTENTS
1. INTRODUCTION
1.1. Data science
Data science is the process of deriving knowledge and
insights from a huge and diverse set of data through
organizing, processing and analyzing the data. It involves
many different disciplines like mathematical and
statistical modeling, extracting data from it source and
applying data visualization techniques. Often it also
involves handling big data technologies to gather both
structured and unstructured data.
Python in Data
Science:
The
programming
requirements
of data
science
demand a
very versatile
yet flexible
language
which is
simple to
write the
code but can
handle highly
complex
mathematical
processing.
Python is
most suited
for such
requirements
as it has
already
established
itself both as
a language
for general
SAI SPURTHI INSTITUTE OF TECHNOLOGY Page 2
computing as
well as
A DATAMINIG MODEL FOR DETECTION OF
FRADULENT BEHAVIOUR IN WATER INTRODUCTION
Machine learning is a discipline that deals with programming the systems so as to make them
automatically learn and improve with experience. Here, learning implies recognizing and
understanding the input data and taking informed decisions based on the supplied data. It is
very difficult to consider all the decisions based on all possible inputs.
To solve this problem, algorithms are developed that build knowledge from a specific data
and past experience by applying the principles of statistical science, probability, logic,
mathematical optimization, reinforcement learning, and control theory.
For example, machine learning programs can scan and process huge databases detecting
patterns that are beyond the scope of human perception.
The developed machine learning algorithms are used in various applications such as
Vision processing
Language processing
Pattern recognition
Games
Data mining
Expert systems
Robotics
Unsupervised Learning
Reinforcement Learning
A DATAMINIG MODEL FOR DETECTION OF
FRADULENT BEHAVIOUR IN WATER INTRODUCTION
Supervised Learning:
Supervised learning involves building a machine learning model that is based on labeled
samples. Learning data comes with description, labels, targets or desired outputs and the
objective is to find a general rule that maps inputs to outputs. This kind of learning data is
called labeled data.
For example, if we build a system to estimate the price of a plot of land or a house based on
various features, such as size, location, and so on, we first need to create a database and label
it. We need to teach the algorithm what features correspond to what prices. Based on this
data, the algorithm will learn how to calculate the price of real estate using the values of the
input features.
Regression trains on and predicts a continuous-valued response, for example predicting real
estate prices.
Regression algorithms:
Linear regression
Logistic regression
Polynomial Regression
Classification algorithms:
Unsupervised learning:
Unsupervised learning has no labelled data here. When learning data contains only some
indications without any description or labels, it is up to the coder or to the algorithm to find
the structure of the underlying data, to discover hidden patterns, or to determine how to
describe the data. This kind of learning data is called unlabeled data.
Unsupervised learning algorithms are extremely powerful tools for analyzing data and for
identifying patterns and trends. They are most commonly used for clustering similar input
into logical groups. Unsupervised learning algorithms include
Clustering algorithms
Kmeans
Random Forests
Reinforcement Learning
Here learning data gives feedback so that the system adjusts to dynamic conditions in order
to achieve a certain objective. The system evaluates its performance based on the feedback
responses and reacts accordingly. The best known instances include self-driving cars and
chess master algorithm Alpha Go.
A DATAMINIG MODEL FOR DETECTION OF
FRADULENT BEHAVIOUR IN WATER INTRODUCTION
Water is an essential element for the uses of households, industry, and agriculture.
Fraudulent behavior in drinking water consumption is a significant problem facing water
supplying companies and agencies. This behavior results in a massive loss of income and
forms the highest percentage of non technical loss. Finding efficient measurements for
detecting fraudulent activities has been an active research area in recent years.
For this Prediction intelligent data mining techniques can help water supplying
companies to detect these fraudulent activities to reduce such losses. This research explores the
use of two classification techniques (SVM and KNN) to detect suspicious fraud water
customers. The SVM based approach uses customer load profile attributes to expose abnormal
behavior that is known to be correlated with non technical loss activities. The data has been
collected from the historical data. The system will help the company to predict suspicious
water customers to be inspected on site.
To do data science project we must know about some python libraries like:
NumPy
Pandas
Scikitlearn
Matplotl
ib And IDE’s
like
Jupyter
Spyder
A DATAMINING MODEL FOR DETECTION OF
FRAUDULENT BEHAVIOUR IN WATER INSTALLATIONS
2. INSTALLATIONS
2.1 ANACONDA:
Anaconda is a package manager, an environment manager,
and Python distribution that contain a collection of many open source packages. This is
advantageous as when you are working on a data science project, you will find that you need
many different packages (NumPy, Scikit-learn, SciPy, pandas to name a few), which an
installation of Anaconda comes preinstalled with.
1. Go to the Anaconda Website and choose a Python 3.x graphical installer (A) or a Python
2.x graphical installer (B). If you aren't sure which Python version you want to install, choose
Python 3. Do not choose both.
7. Click on Next.
8. Click on Next
9. Click on Finish.
Anaconda provides various IDE’s like Jupyter, Spyder, etc. You can launch them and use
them.
Jupyter:
The Jupyter Notebook is an incredibly powerful tool for interactively developing and
presenting data science projects.
A notebook integrates code and its output into a single document that combines
visualisations, narrative text, mathematical equations, and other rich media.
It is possible to use many different programming languages within Jupyter Notebooks,
this article will focus on Python as it is the most common use case.
Spyder:
Spyder was developed specifically for data science
Spyder is an open source cross-platform IDE for data science.
Spyder does the job of integrating the essentials libraries for data science like
IPython, SciPy, Matplotlib and NumPy.
Spyder has features like code completion, a text editor with syntax highlighting, and
variable exploring, whose values you may edit using a GUI.
An online help browser, allowing users to search and view Python and package
documentation inside the IDE
A DATAMINING MODEL FOR DETECTION OF
FRAUDULENT BEHAVIOUR IN WATER PYTHON LIBRARIES
3.PYTHON LIBRARIES
Libraries:
3.1 NumPy:
3.2Pandas:
Pandas is a Python module that contains high-level data structures and tools designed
for fast and easy data analysis operations.
Pandas is built on NumPy and make it easy to use in NumPy-centric applications,
such as data structures.
It is also easy to handle missing data using Python. Pandas are the best tool for doing
data munging.
3.3Matplotlib:
3.4 Scikit-Learn:
4.System Specifications
Processor : i5 or higher
Processor Speed : minimum 1.1GHz
Hard Disk : maximum 100GB
Input Devices : Keyboard, Mouse
Ram : 8GB or higher.
4.2Software Requirements:
Features:
5.2 Python:
Python is interpreted, high-level, general-purpose programming language. Created by Guido
van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its
notable use of significant whitespace. Its language constructs and object-oriented approach aim to help
programmers write clear, logical code for small and large-scale projects.
Python dynamically typed and It supports multiple programming paradigms,
including procedural, object-oriented, and functional programming. Python is often described as a
"batteries included" language due to its comprehensive standard library.
Python was conceived in the late 1980s as a successor to the ABC language. Python 2.0,
released in 2000, introduced features like list comprehensions and a garbage collection system capable
of collecting reference cycles. Python 3.0, released in 2008, was a major revision of the language that is
not completely backward-compatible, and much Python 2 code does not run unmodified on Python 3.
A DATAMINING MODEL FOR DETECTION OF
TOOLS AND TECHNOLOGIES
FRAUDULENT BEHAVIOUR IN WATER
The Python 2 language, i.e. Python 2.7.x, was officially discontinued on 1 January 2020 (first
planned for 2015) after which security patches and other improvements will not be released for it. With
Python 2's end-of-life, only Python 3.5.x and later are supported.
Python interpreters are available for many operating systems. A global community of
programmers develops and maintains CPython, an open source reference implementation. A non-profit
organization, the Python Software Foundation, manages and directs resources for Python and CPython
development.
Features:
Python is a multi-paradigm programming language.
Object-oriented programming and structured programming are fully supported.
Supports functional programming and aspect-oriented programming.
Algorithms:
Linear Regression,
Support Vector Machine (SVM) ,
K-Nearest Neighbors (KNN).
wide as possible. New examples are then mapped into that same space and predicted to
belong to a category based on the side of the gap on which they fall.
Advantages :
Support vector machine is one of the most widely used classification algorithms due to the advantages it
enjoys which are as follows:
SVMs are helpful in text and hypertext categorization as their application can significantly
reduce the need for labeled training instances in both the standard inductive and transductive
settings.
Classification of images can also be performed using SVMs.
Experimental results show that SVMs achieve significantly higher search accuracy than
traditional query refinement schemes after just three to four rounds of relevance feedback.
This is also true of image segmentation systems, including those using a modified version
SVM.
KNN makes predictions using the training dataset directly. In KNN, K is the
number of nearest neighbors. The number of neighbors is the core deciding factor. Predictions are made
for a new instance (x) by searching through the entire training set for the K most similar instances (the
neighbors) and summarizing the output variable for those K instances. For regression this might be the
mean output variable, in classification this might be the mode (or most common) class value.
A DATAMINING MODEL FOR DETECTION OF
TOOLS AND TECHNOLOGIES
FRAUDULENT BEHAVIOUR IN WATER
To determine which of the K instances in the training dataset are most similar to a new
input a distance measure is used. For real-valued input variables, the most popular distance measure is
Euclidean distance. This is calculated as the square root of the sum of the squared differences between a
new point (x) and an existing point (xi) across all input attributes j.
INPUT
SELECT PROCESS
(LR, SVM, KNN)
TRAIN
PREDICT
OUTPUT
A DATAMINING MODEL FOR DETECTION OF
SAMPLE CODE
FRAUDULENT BEHAVIOUR IN WATER
7.SAMPLE CODE
7.1 Sample code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LinearRegression
def predict(file,impacts,outcome,inps):
data = pd.read_csv(file)
X = data[impacts]
Y = data[outcome]
linear_regressor = LinearRegression()
linear_regressor.fit(X, Y)
nx = [inps]
pred = linear_regressor.predict(nx)
return pred
speed = int(input("Enter speed:"))
time = int(input("Enter time:"))
users = int(input("Enter users:"))
print("Thefraudishappend: ",predict('water.csv',["speed","time","users"],"fraud",
[speed,time,users]))
A DATAMINING MODEL FOR DETECTION OF
SAMPLE CODE
FRAUDULENT BEHAVIOUR IN WATER
Visual Representation:
plt.scatter(X["speed"], Y, color='r')
plt.xlabel('Speed')
plt.ylabel('Fraud')
plt.show()
plt.scatter(X["time"], Y, color='g')
plt.xlabel('time')
A DATAMINING MODEL FOR DETECTION OF
SAMPLE CODE
FRAUDULENT BEHAVIOUR IN WATER
plt.ylabel('Fraud')
plt.show()
plt.scatter(X["users"], Y, color='b')
plt.xlabel('users')
plt.ylabel('Fraud')
plt.show()
A DATAMINING MODEL FOR DETECTION OF
SCREENSHOTS
FRAUDULENT BEHAVIOUR IN WATER
8 . SCREENSHOTS
8.1 Code:
A DATAMINING MODEL FOR DETECTION OF
SCREENSHOTS
FRAUDULENT BEHAVIOUR IN WATER
8.2 DataSets:
A DATAMINING MODEL FOR DETECTION OF
SCREENSHOTS
FRAUDULENT BEHAVIOUR IN WATER
8.3 Outputs:
A DATAMINING MODEL FOR DETECTION OF
CONCLUSION
FRAUDULENT BEHAVIOUR IN WATER
9.CONCLUSION
In this research, we applied the data mining classification techniques for the purpose of
detecting fraud behaviour in water consumption. We used SVM and KNN classifiers to build
classification models for detecting suspicious fraud. The models were built using the
customers’ historical metered consumption data.
This phase took a considerable effort and time to pre-process and format the data to fit
the SVM and KNN data mining classifiers. The conducted experiments showed that a good
performance of Support Vector Machines (SVM) and K-Nearest Neighbours (KNN) had been
achieved with overall accuracy around 70% for both. The model hit rate is 60%-70% which is
apparently better
A DATAMINING MODEL FOR DETECTION OF
FRAUDULENT BEHAVIOUR IN WATER REFERENCES
10. REFERENCES