0% found this document useful (0 votes)
9 views

BDA Paper7

Uploaded by

Sam Saji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

BDA Paper7

Uploaded by

Sam Saji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]

IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3

Diabetes prediction by using Big Data Tool and


Machine Learning Approaches
2020 3rd International Conference on Intelligent Sustainable Systems (ICISS) | 978-1-7281-7089-3/20/$31.00 ©2020 IEEE | DOI: 10.1109/ICISS49785.2020.9315866

Srinivasa Rao Swarna Sumati Boyapati Pooja Dixit, Rashmi Agrawal


Sr.Data Architect Sr. Solution Architect Research Scholar, Faculty of Computer
T ata Consultancy Services IRIS Software, INC Bundelkhand University, Application, Manav Rachna
Edison, NJ Edision, NJ Jhansi, UP. International Institute of
[email protected] [email protected] [email protected] Research and Studies
[email protected]

Abstract— The use of big data in daily life is increasing decision making Can help A variety of tools have been
from health care, social networks, banking systems, developed for data analysis and processing, among which
entry into the banking system, use of sensors and smart Hadoop and Spark is the most commonly used tools.
devices, leading to large amounts of data. That’s why, it Hadoop mainly consists of two components: map -reduce
is necessary to develop a model and device that handles used for distributed processing and Hadoop Distributor File
data in optimized form. In this paper, diabetes System (HDFS) used for distributed storage. In 2020 Ahmed
predicated from a data set with the help of various Yusuf proposed a structure for health information methods
machine-learning algorithms such as Naive bayes, KNN based on Big Data Analytics. This type of framework was
algorithm, Random forest, logistic regression. The main primarily used to benefit patients as well as to organize and
objective of the paper is to observe diabetes disease with examine large amounts of facts and details for healthcare
the help of big data tools and machine learning model. professionals. HIS frame work mainly consists of 5
For doing this, the authors can select more accurate components which are described below [1]
model with the help of some matrices. This paper predict
diabetes disease using four machine learning models and Health Information System
then compare their performance among themselves. Cloud EHR Securit Big Data Informat
Machine learning provides more flexible and scalability Environ- y Layer analytics -ion
(2)
than the older bio statistical method that helps it perform ment (1) (3) (4) Delivery
(5)
a variety of tasks such as risk detection, diagnosis,
classification, and prediction.
Keywords: Big data, Machine Learning, KNN, Logistic Fig 1.Health Information System Proposed by Yousuf, 2014
algorithm, Naïve Bayes, Random Forest.
The first component in the health information infrastructure
I. INTRODUCTION is the cloud environment that is used to provide a wide
Health has been all the time primary in every way before variety of services and to allow authorized users to access
technology existed. The health care domain gives a lot of the data. The second component is the electronic health
scope for research as it has developed a lot. There is a need record using which data from patients collected from
to upgrade recent health care technology until the different places. The third component is the layer of
digitization of the patient’s data and medical results protection that used to effectively manage a variety of
generated from advanced equipment as well as their security issues such as keeping patients' data secure by using
information. The major result of this type of information their authentication authority using an encryption algorithm.
revolution is that it has become a challenging task to The fourth layer is the Big data analytics that used to deploy
interpret and understand the huge data collected. That is why large data analytics tools. In addition, the last component is
Big Data Analytics is used to handle this type of data. Big information delivery using which data related to the health
data can be defined as high diversity, high volume and high of patients from different places is collected. This type of
bag data. Which is useful for a new form of effective structure proved useful for improving the quality and safety
information. In 2001, Doug Laney defined the characteristics of various types of health services.
of data in his paper as 3V's i.e. Volume, Validity and Variety Machine learning is a sub-field of computer science that is
[1]. Big data processing and its analysis can be used in many as an artificial intelligence used to manufacture intelligent
ways, such as the analysis of science, engineering, business, machines that have the capability to learn without any
social, finance as well as health care . The main purpose of programming. Machine learning has developed primarily
analysis is to gain a valuable insight that allows the highest based on pattern recognition and computational learning

978-1-7281-7089-3/20/$31.00 ©2020 IEEE 750

Authorized licensed use limited to: Carleton University. Downloaded on May 30,2021 at 00:45:51 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3

theory. Machine learning algorithms mainly divided into addition, he designed a robust version for bagging and
three classification techniques: supervised, unsupervised and boosting strategies.
reinforcement. For this various types of machine learning
algorithms such as Naive Bayes, Logistic Regression, TABLE I
Random Forest, KNN algorithm have been used. Using COMPARISON OF PROPOSED WORK DONE IN
which diabetes defined in three ways, and compares the LITERATURE REVIEW
performance based on accuracy, recall, precision and score Author Proposed Work
[2]. Different models of these algorithms have been tested
on the data sets taken in the input and the best performing 1 Prasanna Kumar et. Proposed probabilistic data
algorithm has been used on the basis of its accuracy which al collection, which performs an
can make accurate prediction of the dis ease preceding analysis of the mutual relationship
diabetes disease. between the data collected. And
II. LITERATURE REVIEW developed stochastic predictive
model
Different types of computer technologies are use in health
care. The prime motivation of the Literature review here is
on the utilization of machines in the Big Data Analytics and
ML in Health Care domains. To create a smart health care
learning model facing a variety of analytical challenges [3]. 2 Yichuan Wang et. al data analytics structure which
The correlation in between data patterns can understood identified five big data analytics
based on data analytics, and additional values can be found entities like Pattern's Analysis,
from Hughe's Health data [4]. Using data mining techniques unstructured Data Analysis,
for classification of diseases, the researcher proposed a Decision Support, Predictive and
decision support system that focuses on the diagnosis of traceability.
diabetes. For which it uses the Nearest Neighbor algorithm,
the Decision Tree algorithm [5].
In Prasanna Kumar et. al. [6] proposed connection in
probabilistic data collection, which performs an analysis of
the mutual relationship between the data collected. Then
developed a stochastic predictive model that relates to any
3 AbdulsalamYassine Developed a model that discover
disease based on current health status It was designed to
et al., human activity patterns with the
predict and understand the health of patients. Sudha Ram et. help of Smart home big data for
al. [7] also proposed a technique that was helpful in
health care
estimating the number of emergency passengers of patients
related to asthma. For this type of system, the total number
of visits to the asthma emergency department was estimating
using a variety of data such as Twitter data and
environmental sensors.
This Yichuan Wang et. al [8] proposed a data analytics 4 Javier Andreu-Perez They introduced a theorem about
structure for the healthcare sector, with the help of which et.al the diagnosis and disease
identified five big data analytics entities like Pattern's management for treatment using
Analysis, unstructured Data Analysis, Decision Support, the feature of big data.
Predictive and traceability.
AbdulsalamYassine et al., [9] proposed a model for
healthcare that proved useful to teach and discover human
5 Nongyao Model for the risk of diabetes by
activity patterns with the help of Smart Home Big Data for
using four famous machine-
Health Care Article. That analyzes and predicts the frequent
learning algorithms such as the
pattern mining, cluster analysis of patterns and behavior of
Decision Tree, Artificial Neural
occupants. Javier Andreu-Perez et.al [10] introduced new
Network, and Logistic Regression
testing theorem about the diagnosis and disease management
for treatment using the feature of big data. Health data such
as imaging informatics, health informatics, traditional
bioinformatics, and sensors informatics provide information
from a different data source. Nongyao [11] proposed the
model for the risk of diabetes mellitus in which he used four
famous machine-learning algorithms such as the Decision
Tree, Artificial Neural Network, and Logistic Regression by III. QUALITY HEALTHCARE AND BIG DATA
algorithms to gain knowledge of classified techniques. In ANALYTICS

978-1-7281-7089-3/20/$31.00 ©2020 IEEE 751

Authorized licensed use limited to: Carleton University. Downloaded on May 30,2021 at 00:45:51 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3

Quality healthcare is divided into four main bases which are IV. FLOW CHART & PROPOSED
shown in figure 2. METHODOLOGY

In this section a brief description about the progress of the


Healthcare technology used is given. The proposed classifier model
Sector
primarily cautions patients with diabetes, and takes input
into the data set for diabetes. Different models of machine
learning algorithms such as random forest, KNN algorithm,
Patient Care: Real time Predictive Improve the logistic regression, naive bayes have been tested on the data
patient analysis of treatment
 Patient drug
monitoring: disease:Real methods:
sets taken in the input and the results generated from them
history
 Exact and time patient X. Treatment have been collected based on the experimental results. The
 Electronic
timely data monitoring: comparison best performing algorithm has been used on the basis of its
medical records  Automatic I. Exact and with medical
 Clinical trials. timely data guidelines accuracy which can make accurate prediction of the disease
data capture
 Medical and input II. Automatic XI. Find preceding diabetes disease. figure 3 describe the
imaging  Regular data capture unexpected methodology as a method used to construct the model used
and input patterns in
 Patient patient
III. Regular treatments and calculate its comparative analysis to predict diabetes
surveillance
behavior and
 Open API for patient XII. Measure disease with accuracy [13].
preferences surveillance ef f iciency of
hospital IT
IV. Open API specif ic drugs
system Dataset for Diebetes Disease Prediction
f or hospital IT
 Customizable
system
early warning
V. Customiza
scores
ble early
warning scores Training Set
Fig 2 Big data in Healthcare sector
each one of these four mainstays VI. Predict
of value healthcare can be
human health
intensely overseen by utilizing riskexpressive,by forecasting and Machine Learning Algorithms
reusable huge data analytical techniques.
determine
Naive Bayes
K-Nearest Logistic
Random Forest
patterns Neighbour Regression
A. Patient Centric Care: Due to distance from the
VII. Surf acin
results in the initial phase,
g high it helps
risk the patient based
makers Learning Model
on clinical data andVIII. by limiting
Detect the dose of the
co morbidity to
drug. This assists with diminishing
predict critical
readmission
rates in hospital clinics and furthermore lessening
disease
IX. Early Test set
expense for the patients.
diagnosis of
B. Predictive Analysis ofdisease
Diseases: vaticinate the viral
problems in beginning phase prior to spreading
Classifier Model
dependent on the live analyses. This can be dictated
by examining the patients' social logs who are
experiencing a sickness in a specific area. This Performance Results
further encourages the healthcare experts to prompt Recommend Best ML algorithm based on Performance Evaluation
the casualties by taking require preventive
Fig 3 Proposed Methodology Flowchart
measures.
C. Real Time Patient Monitoring: This ensures The following depicts the steps associated with the
whether the hospital arranges according to the procedures of the Fig 3. Proposed Classifier Methodology
standard set by the Indian Clinical Committee. This Stepwise Procedure of Proposed Methodology
type of periodic registration helps the government Stage 1: Data set preprocessed for diabetes disease with the
help of Python tool
to take necessary measures to leave the hospital.
Stage 2: After the first stage, data sets divided into 80:20
D. Renovate the Treatment system: Investigation of the training and testing sets.
treatment of a previously prescribed patient Stage 3: In this stage, various types of machine learning
depends on the analysis of drugs, which may algorithms like naive bayes, logistic regression, random
change rapidly. Investigation information of forest, KNN algorithm selected for testing.
patients, which is determined based on their Stage 4: In this stage, the ML model developed for machine
symptoms, helps the doctor to give effective learning algorithm based on data set.
Stage 5: After making the model, it tested on the testing set.
prescriptions to new patients [12]. Stage 6: Experimental results earned from the classifier
comparatively evaluated.
Stage 7: Comparative evaluation of the experimental
performance results, derived from the classifier model, is

978-1-7281-7089-3/20/$31.00 ©2020 IEEE 752

Authorized licensed use limited to: Carleton University. Downloaded on May 30,2021 at 00:45:51 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3

perform in which the best algorithms are selected based on VI. OBJECTIVE
their results accuracy, precision.
Machine learning algorithm has been used to predict data on
The proposed classifier model developed using the python Pima Indian Diabetes data set. With the help of this model it
tool, which relies on fruitful execution of experimental steps. has been estimated which people are likely to develop
This is capable of estimating test results. diabetes based on the Confusion Matrix with >70%
accuracy.
V. DIABETES DISEASE DATASET
VII. PROPOSED MACHINE LEARNING
The dataset is derived from Pima India diabetes, this dataset ALGORITHM
is known as diabetes.csv. It has 8 characteristics that act as
indicators for diabetes. These tests the presence of diabetes Various types of classification algorithms such as Naive
in the patient based on 768 cases. This standard can be of bayes, Random forest, Logistic regression, KNN algorithm
different types in patients. The following table shows the have been used in this research. The algorithm has been
physician indications of the data set. The following TABLE evaluated based on accuracy, precision, and recall matrix as
II describes the 8 attributes of the diabetes dataset quickly. they are widely used in standard data mining fields [15].
Based on the confusion matrix, it will be very easy to
Table II
calculate the accuracy of proposed algorithm. Accuracy can
ATTRIBUTES OF DATASET
be finding by using following formula.

Where TP=True Positive


TN= True Negative
FP= False Positive
FN=False Negative
Precision can defined using the above equation where the
total numbers of correctly classified positive samples are
dividing by the total number of true positive samples.

Recall is define in equation 3 as the total number of correctly


classified positive samples divided by the total number of
predicted positive samples.

A. Naïve Bayes Algorithm: It works through the


Probability Major, which chase a distinct order for
execution. This method implemented using the
following formula:

Fig 4 Analysis of Diabetes in Pima Indian Women

The Panda-assisted diabetes database file has been read for


this research study, which has 8 medical predictor features This method uses the following formula for implementation:
for input and 1 target variable output, with 1 for 'yes' and 0 For the Navy Baas, here the dataset divided in the ratio of
for 'no' diabetes with 768 records. An observation has been 80: 20, where the training set is around 80% while the test
made in figure 4, where, out of 768 pima Indian women, set is 20%. The Gaussian algorithm chosen to create the
65.1% of the women have not been diagnosed with diabetes. model that is the simplest classifier model [15].
While 34.90 out of 768 pima Indian women have diabetes
disease [14].

978-1-7281-7089-3/20/$31.00 ©2020 IEEE 753

Authorized licensed use limited to: Carleton University. Downloaded on May 30,2021 at 00:45:51 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3

the target variable used to estimate the probability


of a target very well

TABLE III TABLE VII


CONFUSION MATRIX FOR NAÏVE BAYES CONFUSION MATRIX FOR LOGISTIC REGRESSION

Predicted Class Predicted Class


TRUE FALSE Actual TRUE FALSE
Actual Class
Class TRUE 106 45 TRUE 118 33
59 FALSE 28 52
FALSE 21

TABLE IV
CLASSIFICATION REPORT OF NAÏVE BAYES TABLE VIII
CLASSIFICATION REPORT OF LOGISTIC
Class Precision Recall f1-score Support REGRESSION
Class Precision Recall f1-score Support

0 0.83 0.84 0.84 100 0 0.82 0.69 0.85 100

1 0.69 0.68 0.69 50 1 0.74 0.72 0.69 50

avg / 0.74 0.71 0.77 150


avg / 0.77 0.78 0.79 154
total
total
The above table shows the accuracy of the Logistic
The above table shows the accuracy of the Navy Bay es
Regression
The true recall here is 0.68, while the true accuracy is 0.69,
The recall here is 0.72, while the accuracy is 0.74, so True
which is less than the objective (> 70%).
Precision and Recall >70%. means that this results
B. Random Forest: It is a very flexible and efficient
accomplish the objective. [16]
ML technique. Which gives very good results in a
short time, due to its simplicity and diversity; it is D. K-Nearest Neighbor Algorithm: This approach
the most used algorithm [15]. calculates the Euclidean distance for each attitude
based on which it sets an arbitrary value of k called
TABLE V
the number of near-term labor. Then, using
CONFUSION MATRIX FOR RANDOM FOREST
Euclidean distance, it finds out which value based
Predicted Class
on which attribute and calculates the result [16].
Actual TRUE FALSE
Class TRUE 121 30
FALSE 37 43

TABLE VI
Classification Report of Random Forest
Class Precision Recall f1- Support
score

0 0.77 0.8 0.78 151


1 0.6 0.58 0.56 80

avg / 0.7 0.71 0.71 231


total
Figure 8: KNN Training and Testing accuracy measure
The Recall here is 0.58, while the accuracy is 0.60, so True The graph above indicates that the maximum training
Precision and Recall lower than objective (>70%). accuracy for n = 1, but in this case the test score is the least.
C. Logistic Regression: Logistic regression is a
supervisor machine-learning algorithm used to test

978-1-7281-7089-3/20/$31.00 ©2020 IEEE 754

Authorized licensed use limited to: Carleton University. Downloaded on May 30,2021 at 00:45:51 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3

TABLE IX [2]. T hérence Nibareke and Jalal Laassiri “Using Big Data‑ machine
learning models for diabetes prediction and flight delays analytics” J
CONFUSION MATRIX FOR KNN ALGORITHM
Big Data (2020) 7:78 https://doi.org/10.1186/s40537-020-00355-0
Springer
Predicted Class [3]. Rahul C. Basole, Mark L. Braunstein, And Jimeng Sun, ”Data and
Analytics Challenges for a Learning Healthcare System”, ACM Journal
Actual TRUE FALSE of Data and Information Quality, Vol. 6, No. 2 –3, Article 10,
Class Publication date: July 2015
TRUE 142 25 [4]. Emrana Kabir Hashi, Md. Shahid Uz Zaman , Md. Rokibul Hasan, ”An
Expert Clinical Decision Support System to Predict Disease Using
FALSE 35 54
Classification Techniques”, International Conference on Electrical,
Computer and Communication Engineering (ECCE), February 16 -18,
TABLE X 2017, IEEE
CLASSIFICATION REPORT OF LOGISTIC [5]. Md. Golam Rabiul Alam, Rim Haw, Sung Soo Kim, Md. Abul Kalam
Azad, Sarder Fakhrul Abedin, Choong Seon Hong, ”EM-Psychiatry:
REGRESSION An Ambient Intelligent System for Psychiatric Emergency”, IEEE
class Precision Recall f1- Support T RANSACT IONS ON INDUST RIAL INFORMAT ICS, VOL. 12,
score NO. 6, DECEMBER 2016
[6]. PRASAN KUMAR SAHOO, SUVENDU KUMAR MOHAPAT RA,
SHIH-LIN WU “Analyzing Healthcare Big Data With Prediction for
0 0.88 0.85 0.83 167 Future Health Condition”, Vol-4 20176 IEEE Digital Object Identifier
10.1109/ACCESS.2016.2647619
1 0.68 0.61 0.64 89 [7]. Ram, S., Zhang, W., and Williams, M., Predicting Asthma-Related
Avg/total 0.76 0.77 0.76 256 Emergency Department Visits Using Big Dat a. IEEE Journal 19(4):
1216–1223, 2015.
[8]. Wang, Y., and Kung, L. A., T erry Anthony Byrd, “Understanding
VIII. RESULT ANALYSIS: itscapabilities and potential benefits for healthcare organizations”.
Journal of T echnological Forecasting and Social Change 126:3 –13,
2018.
Various types of classification algorithms such as Naive [9]. Abdulsalamyassine, S., Mining Human Activity Patterns From Smart
bayes, Random forest, Logistic regression, KNN algorithm Home Big Data for Health Care Applications. IEEE Access 5:13131 –
have been used in this research. The algorithm has been 13149, 2017.
[10]. Javier Andreu-Perez, Carmen C. Y. Poon, Robert D. Merrifield,
evaluated on the basis of accuracy, precision, and recall Stephen T . C. Wong, and Guang-Zhong Yang, Fellow, “ Big Data for
matrix as they are widely us ed in standard data mining fields Health” IEEE, IEEE JOURNAL OF BIOMEDICAL AND HEALT H
. Based on the confusion matrix, it will be very easy to INFORMAT ICS, VOL. 19, NO. 4, JULY 2015
calculate the accuracy of proposed algorithm. [11]. Nongyao Nai-aruna,*, Rungruttikarn Moungmaia “Comparison of
Classifiers for the Risk of Diabetes Prediction”
(http://creativecommons.org/licenses/by-nc-nd/4.0/) Procedia
IX. CONCLUSION Computer Science 69 ( 2015 ) 132
[12]. Archenaa J. et al. (2015) “A Survey of Big Data Analytics in
Healthcare and Government.” Procedia Computer Science. 50: 408 –
As more and more data is available, the use of machine 413.
learning is also increasing rapidly as they are useful for [13]. Ayman Mir, Sudhir N. Dhage “Diabetes Disease Prediction using
handling huge amounts of data with the help of Big Data. In Machine Learning on Big Data of Healthcare” 2018 Fourth
clinical practice and biomedical research, it is very International Conference on Computing Communication Control and
Automation (ICCUBEA) 978-1-5386-5257-2/18/$31.00 c 2018 IEEE
challenging to create a model that identifies a testable [14]. Senthilkumar SA, Bharatendara K Rai, Amruta A Meshram, Angappa
hypothesis and predicts an accurate hypothesis. Therefore, Gunasekaran, Chandrakumarmangalam “ Big Data in Healthcare
ML model is proving very useful in healthcare, which has Management: A Review of Literature” American Journal of Theoretical
made it easier to develop therapy and products in medicine and Applied Business 2018; 4(2): 57-69
http://www.sciencepublishinggroup.com/j/ajtab doi:
using new technology. The above research analysis shows 10.11648/j.ajtab.20180402.14 ISSN: 2469-7834 (Print); ISSN: 2469-
the percentage accuracy results for different types of 7842 (Online)
algorithms. In which various algorithms of machine learning [15]. K. Shailaja, B. Seetharamulu, M. A. Jabbar “Machine Learning in
such as naive bayes, random forest, KNN algorithm and Healthcare: A Review” Proceedings of the 2nd International conference
on Electronics, Communication and Aerospace T echnology (ICECA
logistic regression algorithm based on precision and 2018) IEEE Conference Record 42487; IEEE Xplore ISBN:978-1-
confusion matrix show the accuracy of the model. In 5386-0965-1
conclusion, it is noticed that logistic regression pass more [16]. Usha Nandhini, Dr. K. Dharmarajan “Diabetic Analysis on Big data
accurate results according to the objective of the paper. and Machine Learning - A Literature Review” Parishodh Journal ISSN
NO:2347-6648 2020.
REFERENCES
[1]. Prableen Kaura, Manik Sharma, Mamta Mittal “Big Data and Machine
Learning Based Secure Healthcare Framework” International
Conference on Computational Intelligence and Data Science (ICCIDS
2018) 10.1016/j.procs.2018.05.020

978-1-7281-7089-3/20/$31.00 ©2020 IEEE 755

Authorized licensed use limited to: Carleton University. Downloaded on May 30,2021 at 00:45:51 UTC from IEEE Xplore. Restrictions apply.

You might also like