Cs Batchno19
Cs Batchno19
SCIENCE IN PYTHON
Submitted in partial fulfillment of the requirements for the award
of
JUSTCE PEACEREVOLUTTION
SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY
(DEEMED TO BE UNIVERSITY)
Accredited with Grade "A" by NAAC
JEPPIAAR NAGAR, RAJIV GANDHI SALAI, CHENNAI 600199
MAY 2023
SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY
(DEEMED TOBE UNIVERSITY)
Accredited "A"Grade by NAAC| 12B Status by UGC |Approved by AlCTE
www.sathyabama.ac.in
DEPARTMENT OF COMPUTER SCIENCE
BONAFIDE CERTIFICATE
This is to certify that this Project Report is the bonafide work of
NAVEENRANGASAMY B (40290060) who carried out the project entitled
"DIABETES PREDICTION USING DATA SCIENCE IN PYTHON" under my
supervision from January 2023 to May 2023.
INTERNAL GUIDE
Dr. V.ULAGAMUTHALVI M.E., Ph.D.
en
Dean and Head of the Department
Dr. REKHA CHAKRAVARTHIM.E., Ph.D.
DATE: 08.05.20D23
PLACE: CHENNAI SIGNATURE OF THE CANDIDATE
ABSTRACT
Therefore, how to quickly and accurately diagnose and analyze diabetes is a topic
worthy studying. Machine learning can help people make a preliminary judgment
about diabetes according to their daily physical examination data, and it can serve
as a reference for doctors . For machine learning the selection of valid features and
the correct classifier are the most important problems. So in this study,
LogisticRegression and DecisionTree are implemented to predict the diabetes.
V
TABLE OF CONTENTS
1. INTRODUCTION
USING PYTHON
VI
4.1 SYSTEM SPECIFICATION 27
4.2 PROJECT DESCRIPTION 27
4.3 LIBRARIES USED 27
4.4 LOGISTIC REGRESSION 29
APPENDIX
43
A. SOURCE CODE
46
B. SCREEN SHOTS
REFERENCES 52
VII
LIST OF FIGURES
4.7 Code for splitting data frame into two parts based 34
on the column values
4.8 Code for splitting data frame into two parts based 35
on the column values
VIII
5.1 Code for the prediction model 40
6.5 Code for splitting data frame into two parts based 48
on the column values
6.6 Code for splitting data frame into two parts based 48
on the column values
IX
CHAPTER 1
INTRODUCTION
Machine learning is a type of artificial intelligence (AI) that provides computers with
the ability to learn without being explicitly programmed. Machine learning focuses
on the development of Computer Programs that can change when exposed to new
data. In this article, we’ll see basics of Machine Learning, and implementation of a
simple machine-learning algorithm using python.
Machine learning is a method of teaching computers to learn from data, without
being explicitly programmed. Python is a popular programming language for
machine learning because it has a large number of powerful libraries and
frameworks that make it easy to implement machine learning algorithms.
Python offers multiple great graphing libraries that come packed with lots of
different features. No matter if you want to create interactive, live or highly
10
customized plots python has an excellent library for you.
PLOTLY: Plotlib is the Python Library for interactive data visualizations. Plt allows
you to plot superior interactive graphs than either Matplotlib or Seaborn.
The field of artificial intelligence known as machine learning is concerned with the
process by which computers attempt to forecast future events using historical data
and the information they already possess. There are two distinct forms of machine
learning. The first kind of learning is called supervised learning, and in this type of
learning, the data itself serve as the instructor, and the system is constructed
based on the dataset. The second kind of learning is called unsupervised learning,
and it involves the data teaching itself by identifying certain patterns within the
dataset and then categorising those patterns. Over the last several years, a large
number of writers have reported and discussed their research on diabetes
prediction by utilising machine learning algorithms.
11
1.3 MACHINE LEARNING ALGORITHM
The research that has been done on machine learning has resulted in the
development of multiple data mining methods. These algorithms may be directly
applied to a dataset in order to create some predictions or to derive important
inferences and conclusions from such a dataset. Immediate use of these
algorithms is possible. Decision tree,NaivE Bayes, k-means, neural network, and
other similar algorithms are examples of prominent data mining techniques. In the
part that comes after this one, we will talk about them.
12
be obtained by adhering to these criteria.
13
1.4 INTRODUCTION TO PYTHON FOR DATA ANALYTICS
Features in Python
There are many features in Python, some of which are discussed below
• Easy to code.
• Free and Open Source.
• Object-Oriented Language.
• GUI Programming Support.
• High-Level Language.
• Extensible feature.
• Python is Portable language.
• Python is Integrated language.
1.5 ANACONDA
Anaconda distribution comes with over 250 packages automatically installed, and
over 7,500 additional open-source packages can be installed from PyPI as well as
the conda package and virtual environment manager. It also includes a GUI,
Anaconda Navigator, [12] as a graphical alternative to the Command Line Interface
(CLI).
14
The big difference between conda and the pip package manager is in how package
dependencies are managed, which is a significant challenge for Python data
science and the reason conda exists.
When pip installs a package, it automatically installs any dependent Python
packages without checking if these conflict with previously installed packages. It
will install a package and any of its dependencies regardless of the state of the
existing installation. Because of this, a user with a working installation of, for
example, Google Tensorflow, can find that it stops working having used pip to
install a different package that requires a different version of the dependent numpy
library than the one used by Tensorflow. In some cases, the package may appear
to work but produce different results in detail.
Open source packages can be individually installed from the Anaconda repository,
Anaconda Cloud (anaconda.org), or the user's own private repository or mirror,
using the conda install command. Anaconda, Inc. compiles and builds the
packages available in the Anaconda repository itself, and provides binaries for
Windows 32/64 bit, Linux 64 bit and MacOS 64-bit. Anything available on PyPI
may be installed into a conda environment using pip, and conda will keep track of
what it has installed itself and what pip has installed.
Custom packages can be made using the conda build command, and can be
shared with others by uploading them to Anaconda Cloud, PyPI or other
repositories.
The default installation of Anaconda2 includes Python 2.7 and Anaconda3 includes
Python 3.7. However, it is possible to create new environments that include any
version of Python packaged with conda.
15
Repository, install them in an environment, run the packages and update them. It
is available for Windows, macOS and Linux.
16
manipulation and analysis tool using its powerful data structures. The name
Pandas is derived from the word Panel Data – an Econometrics from
Multidimensional data. In 2008, developer Wes McKinney started developing
pandas when in need of high performance, flexible tool for analysis of data. Prior
to Pandas, Python was majorly used for data munging and preparation. It had very
less contribution towards data analysis. Pandas solved this problem. Using
Pandas, we can accomplish five typical steps in the processing and analysis of
data, regardless of the origin of data—load, prepare, manipulate, model and
analyze. Python with Pandas is used in a wide range of fields including academic
and commercial domains including finance, economics, Statistics, analytics, etc.
• Fast and efficient Data Frame object with default and customized
indexing.
• Tools for loading data into in-memory data objects from different
file formats.
• Data alignment and integrated handling of missing data.
• Reshaping and pivoting of date sets.
• Label-based slicing, indexing and sub-setting of large data sets.
• Columns from a data structure can be deleted or inserted.
• Group by data for aggregation and transformations.
• High performance merging and joining of data.
• Time Series functionality.
Matplotlib is the most popular python plotting library. It is a low-level library with a
Matlab like interface which offers lots of freedom at the cost of having to write more
code.
17
• Semantic way to generate complex, subplot grids.
• Charts.
• Dashboards.
• File Export.
• App Manager.
• Kubernetes Authentication.
• Jobs Queue.
• Snapshot Engine.
18
CHAPTER 2
LITRATURE SURVEY
Birjais et al.[8] experimented on PIMA Indian Diabetes (PID) data set. It has 768
instances and 8 attributes and is available in the UCI machine learning repository.
They aimed to focus more on diabetes diagnosis, which, according to the World
Health Organization (WHO) in 2014, is one of the world’s fastest-growing chronic
diseases. Gradient boosting, logistic regression, and naive Bayes classifiers were
used to predict whether a person is diabetic or not, with gradient boosting having
an accuracy of 86%, logistic regression having a 79% accuracy, and naive Bayes
having a 77% accuracy.
Sadhu, A. and Jadli A.[9] experimented on a diabetes data set taken from the UCI
repository. There were 520 occurrences and 16 attributes in all. They attempted to
concentrate their efforts on predicting diabetes at an early stage. On the validation
set of the employed data set, seven classification techniques were implemented:
k-NN, logistic regression, SVM, naive Bayes, decision tree, random forests, and
multilayer perceptron. The random forests classifier proved to be the best model
for the concerned data set, with an accuracy score of 98%, followed by logistic
regression at 93%, SVM at 94%, naive Bayes at 91%, decision tree at 94%,
random forests at 98%, and multilayer perceptron at 98%, according to the results
of training several machine learning models.
Xue et al.[10] experimented on the diabetes data set taken from the UCI repository;
there were 520 patients and 17 qualities in it. They attempted to concentrate on
early detection of diabetes. They trained on the actual data of 520 diabetic patients
and probable diabetic patients aged 16–90 using supervised ML techniques such
as SVM, naive Bayes classifiers, and LightGBM. The performance of the SVM is
the best when comparing classification and recognition accuracy. The naive
Bayes classifier is the most widely used classification algorithm, with
an accuracy of 93.27%. SVM has the highest accuracy rate of 96.54%. LightGBM
19
has an accuracy of only 88.46%. This demonstrates that SVM is the
best classification algorithm for diabetes prediction.
Le et al.[11] experimented on the early-stage diabetes risk prediction; the data set
used in this research was taken from the UCI repository and consisted of 520
patients and 16 variables. They suggested a ML approach for predicting diabetes
patients’ early onset. It was a new wrapper-based feature selection method that
employed grey wolf optimizer (GWO) and adaptive particle swarm optimization
(APSO) to optimize the multilayer perceptron (MLP) and reduce the number of
needed input attributes. They also compared the results obtained with this method
to those obtained via a variety of traditional machine learning algorithms, including
SVM, DT, k-NN, naive Bayes classifier (NBC), random forest classifier (RFC), and
logistic regression (LR). LR achieved a 95% accuracy rate. k-NN had a
96% accuracy rate, SVM a 95% accuracy rate, NBC a 93% accuracy rate, DT a
95% accuracy rate, and RFC had a 96% accuracy rate. The suggested methods’
computational findings show that not only are fewer features required but also that
higher prediction accuracy may be attained (96% for GWO–MLP and 97% for
APSO–MLP). This research has the potential to be applied in clinical practice and
used as a tool to assist doctors and physicians.
Julius et al.[12] used the Waikato Environment for Knowledge Analysis (Weka)
application platform to test a data set collected from the UCI repository. There were
520 samples in the data set, each with a collection of 17 attributes. The goal of this
study was to use machine learning classification approaches based on observable
sample attributes to predict diabetes at an early stage. The k-NN, SVM, functional
tree (FT), and RFCs were employed as classifiers. k-NN had the
highest accuracy of 98%, followed by SVM at 94%, FT at 93%, and RF at 97%.
Shafi et al.[13] reported that because diabetes is a serious illness, early detection
is always a struggle. This study used machine learning classification methods to
develop a model that could solve any problem and that could be used to identify
20
diabetes development early on. The authors of this research made concerted
efforts to develop a framework that could accurately predict the likelihood of
diabetes in patients. As part of this study, the three ML
approach classification algorithms—DT, SVM, and NBC—were studied and
assessed on various measures. In the study, the PID data set acquired from the
UCI repository was used to save time and produce precise findings. The
experimental results suggested that the NBC approach was adequate, with a
74% accuracy, followed by SVM with a 63% accuracy and the DT with a
72% accuracy. In the future, the built framework, as well as the ML classifiers used,
could be used to identify or diagnose other diseases. The study, as well as several
other ML methodologies, could be extended and improved for diabetes research,
and the scientists intended to classify other algorithms with missing data.
Sisodia et al.[15] used the PID data set available on the UCI repository. This data
set contained 768 patients and 8 attributes. They employed three ML
classifications to identify diabetic patients: DT, SVM, and NBC. NBC had the
highest accuracy (76.30%) when compared to the other models.
21
Agarwal et al.[16] used the PID data set of 738 patients as well in their study. To
analyze the effectiveness of this data set for identifying diabetic patients, the
authors applied models such as SVM, k-NN, NBC, ID3, C4.5, and CART. The SVM
and LDA algorithms were the most accurate, with an accuracy of 88%.
Meng et al.[20] examined J48, LR, and k-NN algorithms on the diabetes data set.
J48 was found to be the most accurate, with a classification accuracy of 78.27%.
22
Kumari and Chitra[23] used SVM, RFC, DT, MLP, and LR, as well as four k-fold
cross-validations (k = 2,4,5,10) in their research. According to the researchers,
MLP with four-fold cross-validation achieves the best accuracy, at 78.7%. They
discovered that MLP outscored all other algorithms.
To predict diabetes, Kavakiotis et al.[24] employed NBC, RFC, k-NN, SVM, DT, and
LR methods. The algorithms were applied using a ten-fold cross-validation
technique. SVM had the best accuracy of all the approaches, measuring 84%,
according to the study.
Perveen et al.[26] used a data set from the Canadian Primary Care Sentinel
Surveillance Network (CPCSSN) database to do their research. The study
employed the AdaBoost and bagging ensemble techniques using the J48 (C4.5)
DT as a base learner and standalone data mining methodology J48 to categorize
patients with diabetes mellitus based on diabetes risk indicators. This
categorization was done across three separate ordinal adult groups in the
CPCSSN. In terms of overall performance, the AdaBoost ensemble method
surpassed both bagging and a single J48 DT, according to the findings.
23
as well as regular components such as glucose, BMI, age, insulin, and so on. The
new data set enhanced classification accuracy when compared to the old data set.
Multiple ML approaches were used on the data set, and classification was done
with a variety of algorithms, with LR yielding the highest accuracy at 96%. The
AdaBoost classifier was found to be the most accurate, with a
98.8% accuracy rate. They used two separate data sets to compare
the accuracy of ML techniques. When compared to the existing data set, it was
clear that the model improved diabetes prediction accuracy and precision.
Mercaldo et al.[28] offered a strategy for classifying diabetic patients based on a set
of features chosen according to the WHO criteria. Evaluating real-world data using
state of the art machine learning algorithms. The model was trained using six
alternative classification approaches, with the Hoeffding Tree method scoring
0.770 in precision and 0.775 in recall. They used data from the PIMA Indian
community in Phoenix, Arizona, to evaluate the method.
24
CHAPTER 3
3.1 AIM: To accurately predict whether or not the patients in the dataset have
diabetes or not.
3.2 SCOPE: The accuracy of the model is obtained by comparing the predicted
values against the original set of values. Identifying diabetics or predicting the
upcoming of a diabetic life can be propelled by using various machine learning
techniques.
In this project we are going to use a dataset named Diabetes.csv file which
contains the details such as, Pregnancies, Glucose, BloodPressure,
SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age and Outcome.
25
dataset.
The datasets consists of several medical predictor variables and one target
variable, Outcome. Predictor variables includes the number of pregnancies the
patient has had, their BMI, insulin level, age, and so on.
26
CHAPTER 4
We will generally be working on this project using the standard libraries such as:
● Sklearn
● Pandas
● Numpy
● Matplotlib
Sklearn is the root important library from which we are going to import and
make use of the algorithms or the metrics or the train_test_split for making
predictions using the model generated using thee.
27
Pandas is the standard library one would be dealing with while doing a
machine learning project. As I have mentioned above that I will be using the data
set which is stored in an excel file, we can use pandas to read the excel file.
Similarly we will use Numpy to make use of the numpy arrays, while we are
calculating the errors or handling the data of some kind.
We will break down more about the libraries and their usage once we start using
them. I will try to describe that particular usage and its overall description too once
we get into the coding part. As we have seen that the above are the necessary
libraries which we will import as full but for some we will only import the necessary
modules (just to have a quick knowledge of exactly what modules we will be using).
We will import a few specific functions and algorithms from modules of Sklearn
library, they are namely:
● train_test_split from model_selection module of sklearn
● from sklearn.linear_model import LogisticRegression
● metrics module of sklearn
28
4.4 Logistic Regression
The predictive analysis which is used for the dependent variable is categorical
called as Logistical Regression. Logistical Regression explains the relationship
between one dependent variable and one or more independent variables.
29
Project Implementation:
30
Step1:
Went on a search for a diabetes dataset with all the labels given.
Step2:
Then made a flow char of how to implement the diabetes predicting system and
made some conclusions like ,What are the ML algorithms or techniques that would
perfectly fit the problem with good accuracy.
Step3:
Next installation of required software's like Anaconda ,Jupyter, Python.
Step4:
After installing the required software’s ,started with Anaconda lunched Jupyter
notebook with python 3.
Step5:
Now importing all the necessary libraries like Pandas,Numpy,Sklearn,Matplotlib
and all the required modules.
Step6:
Import the diabetes Dataset into Jupyter notebook.
31
Step7:
Now we need to see the structure of the data like how many columns and rows are
present.
Step8:
Then started looking for missing values and filling those with mean of the particular
column (here we have 2 missing values one in SkinThickness and the other on
Insulin)
32
Fig 4.5: Example code for the pre-work of data set with null values
Fig 4.6: Example code for the pre-work of data set with null values
Step9:
Once we are good with the data with no null values and missing values and no
string values then we can split the data into 2 parts.
Part1 : Training data
Part2: Testing data
33
Here our Target column is ‘Outcome’ i.e y so separating from the main data now
we have y data frame with ‘Outcome’ and x data frame with all the other lables.
We need at least 80% of data to train and 20% of data to test.
Fig 4.7 : code for splitting data frame into two parts based on the column
values
Step10:
Now further splitting the data into
x_train,x_test
y_train,y_test
34
Fig 4.8: code for splitting data frame into two parts based on the column
values
35
Step11:
Now we will apply ML algorithm i.e. LogisticRegression from Sklearn.linear_model.
Fig 4.10: code to demonstrate fitting the algorithm to the model and
resulting array of predictions made
Step12:
From the trained model we will calculate the accuracy of our model.
36
Fig 4.11: code to calculate the accuracy of the model
Step13:
Now will see the accuracy graph.
37
Confusion Matrix
Fig 4.13: code for importing metrics to find the accuracy score
METHODOLOGY:
⚫ Data Collection
⚫ Data Preparation
⚫ Choosing a model
⚫ Training the model
⚫ Evaluating the model
⚫ Parameter turning
⚫ Making prediction
38
◼ Training the Model: After selecting the model we need to train the data.
◼ Evaluating the Model: After training we need to check weather the model is
accurately predicting the outcomes(Accuracy Score=75%).
◼ Making Predictions: Finally we need to predict the diabetes of a particular row
to check weather the model is working perfectly or not.
TECHOLOGIES USED:
39
CHAPTER 5
RESULT, PERFORMANCE ANALYSIS
◼ From the above snippet we are successfully able to see the predictions made
by the model.
◼ If the Output is ‘1’ that means the person is diabetic but if the Output is ‘0’ that
means the person is not diabetic.
◼ Here the Output is based on considering the other labels like Glucose,
Pregnancies , BloodPressure , SkinThickness , Insulin , BMI ,
DiabetesPedigreeFunction , Age.
◼ Successfully achieved an accuracy of 75%
40
Code to calculate the accuracy of the model
41
CHAPTER 6
SUMMARY AND CONCLUSION
The main aim of this project was to design and implement Diabetes Prediction
Using Machine Learning Methods and Performance Analysis of that methods
and it has been achieved successfully. Successfully able to clean the data
and split it into training and testing data. The proposed approach uses various
classification and ensemble learning method in which Decision Tree, Logistic
Regression are used. 75% classification accuracy has been achieved. The
Experimental results can be asst health care to take early prediction and make
early decision to cure diabetes and save humans life. In future I would like to
move on to Deep Learning and upgrading myself with new technologies like
TensorFlow and keras and NeuralNetwork and continue my research in the
field of AI and built many more projects.
42
APPENDIX
A. Source Code
{#c1}
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
{#c2}
data=pd.read_csv("diabetes.csv")
data
{#c3}
data.isna().sum()
{#c3}
mean_SkinThickness=data.SkinThickness.mean()
{#c4}
mean_Insulin=data.Insulin.mean()
{#c5}
data['SkinThickness'].fillna(value=mean_SkinThickness,inplace=True
)
data
{#c6}
x=data.drop('Outcome',axis = 1)
y=data.Outcome
print(x,y)
{#c7}
43
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.20,stratify
=y,random_state=2)
{#c8}
x_train.shape
{#c9}
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(x_train,y_train)
x_test = x_test.fillna(x_train.mean())
y_pred=model.predict(x_test)
y_pred
{#c10}
accuracy=metrics.accuracy_score(y_test,y_pred)
Precision=metrics.precision_score(y_test, y_pred)
Recall=metrics.recall_score(y_test, y_pred)
print("Accuracy",accuracy)
print("precision",Precision)
print("Recall",Recall)
{c11}
y_pred_proba = model.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve (y_test, y_pred_proba)
auc = metrics.roc_auc_score (y_test, y_pred_proba)
plt.plot(fpr, tpr, label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
{#c12}
from sklearn import metrics
{#c13}
#Predicting system
44
test_sample = (5,166,72,19,175,25.8,0.587,51)
test_sample_as_numpy_array = np.asarray(test_sample)
input_data_reshape = test_sample_as_numpy_array.reshape(1,-1)
prediction = model.predict(input_data_reshape)
print(prediction)
if (prediction[0]== 0):
print('The person is not diabetic')
else:
print('The person is diabetic')
45
B. Screen Shots
46
Fig 6.2: code for importing data set
Fig 6.3: Example code for the pre-work of data set with null values
Fig 6.4: Example code for the pre-work of data set with null values
47
Fig 6.5: code for splitting data frame into two parts based on the column
values
48
Fig 6.6: code for splitting data frame into two parts based on the column
values
Fig 6.7: code to demonstrate fitting the algorithm to the model and
resulting array of predictions made
49
Fig 6.8: code to calculate the accuracy of the model
50
Fig 6.10: code for importing metrics to find the accuracy score
51
REFERENCES
52
0661,p-ISSN: 2278-8727, Volume 19, Issue 1, Ver. IV (Jan.-Feb. 2017),
PP 39-44, www.iosrjournals.org.
53
SheakRashedHaiderNoori, MdNazirul Islam Sarkar, “Machine Learning
Based Unified Framework for DiabetesPrediction”, BDET 2018, August
25–27, 2018, Chengdu, China. © 2018 Association for Computing
Machinery. ACM ISBN 978-1-4503-6582-
6/18/08…$15.00.DOI:https://doi.org/10.1145/3297730.3297737.
54