BDA Paper7
BDA Paper7
Abstract— The use of big data in daily life is increasing decision making Can help A variety of tools have been
from health care, social networks, banking systems, developed for data analysis and processing, among which
entry into the banking system, use of sensors and smart Hadoop and Spark is the most commonly used tools.
devices, leading to large amounts of data. That’s why, it Hadoop mainly consists of two components: map -reduce
is necessary to develop a model and device that handles used for distributed processing and Hadoop Distributor File
data in optimized form. In this paper, diabetes System (HDFS) used for distributed storage. In 2020 Ahmed
predicated from a data set with the help of various Yusuf proposed a structure for health information methods
machine-learning algorithms such as Naive bayes, KNN based on Big Data Analytics. This type of framework was
algorithm, Random forest, logistic regression. The main primarily used to benefit patients as well as to organize and
objective of the paper is to observe diabetes disease with examine large amounts of facts and details for healthcare
the help of big data tools and machine learning model. professionals. HIS frame work mainly consists of 5
For doing this, the authors can select more accurate components which are described below [1]
model with the help of some matrices. This paper predict
diabetes disease using four machine learning models and Health Information System
then compare their performance among themselves. Cloud EHR Securit Big Data Informat
Machine learning provides more flexible and scalability Environ- y Layer analytics -ion
(2)
than the older bio statistical method that helps it perform ment (1) (3) (4) Delivery
(5)
a variety of tasks such as risk detection, diagnosis,
classification, and prediction.
Keywords: Big data, Machine Learning, KNN, Logistic Fig 1.Health Information System Proposed by Yousuf, 2014
algorithm, Naïve Bayes, Random Forest.
The first component in the health information infrastructure
I. INTRODUCTION is the cloud environment that is used to provide a wide
Health has been all the time primary in every way before variety of services and to allow authorized users to access
technology existed. The health care domain gives a lot of the data. The second component is the electronic health
scope for research as it has developed a lot. There is a need record using which data from patients collected from
to upgrade recent health care technology until the different places. The third component is the layer of
digitization of the patient’s data and medical results protection that used to effectively manage a variety of
generated from advanced equipment as well as their security issues such as keeping patients' data secure by using
information. The major result of this type of information their authentication authority using an encryption algorithm.
revolution is that it has become a challenging task to The fourth layer is the Big data analytics that used to deploy
interpret and understand the huge data collected. That is why large data analytics tools. In addition, the last component is
Big Data Analytics is used to handle this type of data. Big information delivery using which data related to the health
data can be defined as high diversity, high volume and high of patients from different places is collected. This type of
bag data. Which is useful for a new form of effective structure proved useful for improving the quality and safety
information. In 2001, Doug Laney defined the characteristics of various types of health services.
of data in his paper as 3V's i.e. Volume, Validity and Variety Machine learning is a sub-field of computer science that is
[1]. Big data processing and its analysis can be used in many as an artificial intelligence used to manufacture intelligent
ways, such as the analysis of science, engineering, business, machines that have the capability to learn without any
social, finance as well as health care . The main purpose of programming. Machine learning has developed primarily
analysis is to gain a valuable insight that allows the highest based on pattern recognition and computational learning
Authorized licensed use limited to: Carleton University. Downloaded on May 30,2021 at 00:45:51 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3
theory. Machine learning algorithms mainly divided into addition, he designed a robust version for bagging and
three classification techniques: supervised, unsupervised and boosting strategies.
reinforcement. For this various types of machine learning
algorithms such as Naive Bayes, Logistic Regression, TABLE I
Random Forest, KNN algorithm have been used. Using COMPARISON OF PROPOSED WORK DONE IN
which diabetes defined in three ways, and compares the LITERATURE REVIEW
performance based on accuracy, recall, precision and score Author Proposed Work
[2]. Different models of these algorithms have been tested
on the data sets taken in the input and the best performing 1 Prasanna Kumar et. Proposed probabilistic data
algorithm has been used on the basis of its accuracy which al collection, which performs an
can make accurate prediction of the dis ease preceding analysis of the mutual relationship
diabetes disease. between the data collected. And
II. LITERATURE REVIEW developed stochastic predictive
model
Different types of computer technologies are use in health
care. The prime motivation of the Literature review here is
on the utilization of machines in the Big Data Analytics and
ML in Health Care domains. To create a smart health care
learning model facing a variety of analytical challenges [3]. 2 Yichuan Wang et. al data analytics structure which
The correlation in between data patterns can understood identified five big data analytics
based on data analytics, and additional values can be found entities like Pattern's Analysis,
from Hughe's Health data [4]. Using data mining techniques unstructured Data Analysis,
for classification of diseases, the researcher proposed a Decision Support, Predictive and
decision support system that focuses on the diagnosis of traceability.
diabetes. For which it uses the Nearest Neighbor algorithm,
the Decision Tree algorithm [5].
In Prasanna Kumar et. al. [6] proposed connection in
probabilistic data collection, which performs an analysis of
the mutual relationship between the data collected. Then
developed a stochastic predictive model that relates to any
3 AbdulsalamYassine Developed a model that discover
disease based on current health status It was designed to
et al., human activity patterns with the
predict and understand the health of patients. Sudha Ram et. help of Smart home big data for
al. [7] also proposed a technique that was helpful in
health care
estimating the number of emergency passengers of patients
related to asthma. For this type of system, the total number
of visits to the asthma emergency department was estimating
using a variety of data such as Twitter data and
environmental sensors.
This Yichuan Wang et. al [8] proposed a data analytics 4 Javier Andreu-Perez They introduced a theorem about
structure for the healthcare sector, with the help of which et.al the diagnosis and disease
identified five big data analytics entities like Pattern's management for treatment using
Analysis, unstructured Data Analysis, Decision Support, the feature of big data.
Predictive and traceability.
AbdulsalamYassine et al., [9] proposed a model for
healthcare that proved useful to teach and discover human
5 Nongyao Model for the risk of diabetes by
activity patterns with the help of Smart Home Big Data for
using four famous machine-
Health Care Article. That analyzes and predicts the frequent
learning algorithms such as the
pattern mining, cluster analysis of patterns and behavior of
Decision Tree, Artificial Neural
occupants. Javier Andreu-Perez et.al [10] introduced new
Network, and Logistic Regression
testing theorem about the diagnosis and disease management
for treatment using the feature of big data. Health data such
as imaging informatics, health informatics, traditional
bioinformatics, and sensors informatics provide information
from a different data source. Nongyao [11] proposed the
model for the risk of diabetes mellitus in which he used four
famous machine-learning algorithms such as the Decision
Tree, Artificial Neural Network, and Logistic Regression by III. QUALITY HEALTHCARE AND BIG DATA
algorithms to gain knowledge of classified techniques. In ANALYTICS
Authorized licensed use limited to: Carleton University. Downloaded on May 30,2021 at 00:45:51 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3
Quality healthcare is divided into four main bases which are IV. FLOW CHART & PROPOSED
shown in figure 2. METHODOLOGY
Authorized licensed use limited to: Carleton University. Downloaded on May 30,2021 at 00:45:51 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3
perform in which the best algorithms are selected based on VI. OBJECTIVE
their results accuracy, precision.
Machine learning algorithm has been used to predict data on
The proposed classifier model developed using the python Pima Indian Diabetes data set. With the help of this model it
tool, which relies on fruitful execution of experimental steps. has been estimated which people are likely to develop
This is capable of estimating test results. diabetes based on the Confusion Matrix with >70%
accuracy.
V. DIABETES DISEASE DATASET
VII. PROPOSED MACHINE LEARNING
The dataset is derived from Pima India diabetes, this dataset ALGORITHM
is known as diabetes.csv. It has 8 characteristics that act as
indicators for diabetes. These tests the presence of diabetes Various types of classification algorithms such as Naive
in the patient based on 768 cases. This standard can be of bayes, Random forest, Logistic regression, KNN algorithm
different types in patients. The following table shows the have been used in this research. The algorithm has been
physician indications of the data set. The following TABLE evaluated based on accuracy, precision, and recall matrix as
II describes the 8 attributes of the diabetes dataset quickly. they are widely used in standard data mining fields [15].
Based on the confusion matrix, it will be very easy to
Table II
calculate the accuracy of proposed algorithm. Accuracy can
ATTRIBUTES OF DATASET
be finding by using following formula.
Authorized licensed use limited to: Carleton University. Downloaded on May 30,2021 at 00:45:51 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3
TABLE IV
CLASSIFICATION REPORT OF NAÏVE BAYES TABLE VIII
CLASSIFICATION REPORT OF LOGISTIC
Class Precision Recall f1-score Support REGRESSION
Class Precision Recall f1-score Support
TABLE VI
Classification Report of Random Forest
Class Precision Recall f1- Support
score
Authorized licensed use limited to: Carleton University. Downloaded on May 30,2021 at 00:45:51 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3
TABLE IX [2]. T hérence Nibareke and Jalal Laassiri “Using Big Data‑ machine
learning models for diabetes prediction and flight delays analytics” J
CONFUSION MATRIX FOR KNN ALGORITHM
Big Data (2020) 7:78 https://doi.org/10.1186/s40537-020-00355-0
Springer
Predicted Class [3]. Rahul C. Basole, Mark L. Braunstein, And Jimeng Sun, ”Data and
Analytics Challenges for a Learning Healthcare System”, ACM Journal
Actual TRUE FALSE of Data and Information Quality, Vol. 6, No. 2 –3, Article 10,
Class Publication date: July 2015
TRUE 142 25 [4]. Emrana Kabir Hashi, Md. Shahid Uz Zaman , Md. Rokibul Hasan, ”An
Expert Clinical Decision Support System to Predict Disease Using
FALSE 35 54
Classification Techniques”, International Conference on Electrical,
Computer and Communication Engineering (ECCE), February 16 -18,
TABLE X 2017, IEEE
CLASSIFICATION REPORT OF LOGISTIC [5]. Md. Golam Rabiul Alam, Rim Haw, Sung Soo Kim, Md. Abul Kalam
Azad, Sarder Fakhrul Abedin, Choong Seon Hong, ”EM-Psychiatry:
REGRESSION An Ambient Intelligent System for Psychiatric Emergency”, IEEE
class Precision Recall f1- Support T RANSACT IONS ON INDUST RIAL INFORMAT ICS, VOL. 12,
score NO. 6, DECEMBER 2016
[6]. PRASAN KUMAR SAHOO, SUVENDU KUMAR MOHAPAT RA,
SHIH-LIN WU “Analyzing Healthcare Big Data With Prediction for
0 0.88 0.85 0.83 167 Future Health Condition”, Vol-4 20176 IEEE Digital Object Identifier
10.1109/ACCESS.2016.2647619
1 0.68 0.61 0.64 89 [7]. Ram, S., Zhang, W., and Williams, M., Predicting Asthma-Related
Avg/total 0.76 0.77 0.76 256 Emergency Department Visits Using Big Dat a. IEEE Journal 19(4):
1216–1223, 2015.
[8]. Wang, Y., and Kung, L. A., T erry Anthony Byrd, “Understanding
VIII. RESULT ANALYSIS: itscapabilities and potential benefits for healthcare organizations”.
Journal of T echnological Forecasting and Social Change 126:3 –13,
2018.
Various types of classification algorithms such as Naive [9]. Abdulsalamyassine, S., Mining Human Activity Patterns From Smart
bayes, Random forest, Logistic regression, KNN algorithm Home Big Data for Health Care Applications. IEEE Access 5:13131 –
have been used in this research. The algorithm has been 13149, 2017.
[10]. Javier Andreu-Perez, Carmen C. Y. Poon, Robert D. Merrifield,
evaluated on the basis of accuracy, precision, and recall Stephen T . C. Wong, and Guang-Zhong Yang, Fellow, “ Big Data for
matrix as they are widely us ed in standard data mining fields Health” IEEE, IEEE JOURNAL OF BIOMEDICAL AND HEALT H
. Based on the confusion matrix, it will be very easy to INFORMAT ICS, VOL. 19, NO. 4, JULY 2015
calculate the accuracy of proposed algorithm. [11]. Nongyao Nai-aruna,*, Rungruttikarn Moungmaia “Comparison of
Classifiers for the Risk of Diabetes Prediction”
(http://creativecommons.org/licenses/by-nc-nd/4.0/) Procedia
IX. CONCLUSION Computer Science 69 ( 2015 ) 132
[12]. Archenaa J. et al. (2015) “A Survey of Big Data Analytics in
Healthcare and Government.” Procedia Computer Science. 50: 408 –
As more and more data is available, the use of machine 413.
learning is also increasing rapidly as they are useful for [13]. Ayman Mir, Sudhir N. Dhage “Diabetes Disease Prediction using
handling huge amounts of data with the help of Big Data. In Machine Learning on Big Data of Healthcare” 2018 Fourth
clinical practice and biomedical research, it is very International Conference on Computing Communication Control and
Automation (ICCUBEA) 978-1-5386-5257-2/18/$31.00 c 2018 IEEE
challenging to create a model that identifies a testable [14]. Senthilkumar SA, Bharatendara K Rai, Amruta A Meshram, Angappa
hypothesis and predicts an accurate hypothesis. Therefore, Gunasekaran, Chandrakumarmangalam “ Big Data in Healthcare
ML model is proving very useful in healthcare, which has Management: A Review of Literature” American Journal of Theoretical
made it easier to develop therapy and products in medicine and Applied Business 2018; 4(2): 57-69
http://www.sciencepublishinggroup.com/j/ajtab doi:
using new technology. The above research analysis shows 10.11648/j.ajtab.20180402.14 ISSN: 2469-7834 (Print); ISSN: 2469-
the percentage accuracy results for different types of 7842 (Online)
algorithms. In which various algorithms of machine learning [15]. K. Shailaja, B. Seetharamulu, M. A. Jabbar “Machine Learning in
such as naive bayes, random forest, KNN algorithm and Healthcare: A Review” Proceedings of the 2nd International conference
on Electronics, Communication and Aerospace T echnology (ICECA
logistic regression algorithm based on precision and 2018) IEEE Conference Record 42487; IEEE Xplore ISBN:978-1-
confusion matrix show the accuracy of the model. In 5386-0965-1
conclusion, it is noticed that logistic regression pass more [16]. Usha Nandhini, Dr. K. Dharmarajan “Diabetic Analysis on Big data
accurate results according to the objective of the paper. and Machine Learning - A Literature Review” Parishodh Journal ISSN
NO:2347-6648 2020.
REFERENCES
[1]. Prableen Kaura, Manik Sharma, Mamta Mittal “Big Data and Machine
Learning Based Secure Healthcare Framework” International
Conference on Computational Intelligence and Data Science (ICCIDS
2018) 10.1016/j.procs.2018.05.020
Authorized licensed use limited to: Carleton University. Downloaded on May 30,2021 at 00:45:51 UTC from IEEE Xplore. Restrictions apply.