Diabetes Prediction Using Data Mining Te
Diabetes Prediction Using Data Mining Te
Abstract:- This research work was conducted on the design and Diabetes is divided into two distinct types; type 1
implementation of a diabetes prediction system, a case study of diabetes enforces the need for artificially infusing insulin
Fudawa Health Centre. This research will help in automating through medicines or by injections and type 2 diabetes,
prediction of diabetes even before clinicians arrived. The current pancreas create insulin, but it is not effectively used by the
process of carrying this activity is manually which tends not to
body. The majority of people with diabetes are affected by
analyzing data flexible for the doctors, and transmission of
information is not transparent. The system was design using type 2 diabetes. Diabetes was a common problem among
Java Programming Language, Weka Tool, and MySQL adult’s specifically middle-aged people but due to changing
(Microsoft Structured Query Language) as the back end and a lifestyles diabetes affects children too. Type 1 diabetes is
strategic approach to analyse the existing system was taking in unpreventable because of the various external environmental
order to meets the demands of this system and solve the stimulants which result in the destruction of body’s insulin-
problems of the existing system by implementing the naïve beyes producing cells. However, changing lifestyle to achieve the
classifier. The implementation of this new system will help to required body weight and obtain the physical activities can
reduce the stressful process, doctors’ face during prediction of help to prevent type 2 diabetes to enlarge. Diabetes is a
diabetes, the result of the experiment shows that the proposed
chronic health problem with devastating, yet preventable
system has a better prediction in terms of accuracy.
consequences. It is characterized by high blood glucose levels
Keywords: Diabetes, Data mining, weka tool, Diagnosis, resulting from defects in insulin production, insulin action, or
prediction, Naïve bayes classifier, technique both.1,2 Globally, rates of diabetes were 15.1 million in 2003
I. INTRODUCTION the number of people with diabetes worldwide is projected to
increase to 36.6 million by 2030. Of these, 90-95% of these
Frawley and Piatetsky (1996) describes data mining pattern recognition, which are then used in classification
as the process of extracting implicit and previously which makes prediction based on the set of accepted input, in
undisclosed important information about data sets that can be which every given input, there are two feasible classes that
used for effective decision-making. The process is termed as form the inputs (Madzov et al, 2009). Support vector machine
Knowledge Discovery in Database, Such discovered is designed based on the principle of structural risk
knowledge can be very useful in many areas of sciences, and minimization principle with the basic idea of finding
health care is no different having a Knowledge Discovery in hypothesis with lowest error. However, the drawback of this
Database would help in predicting trends of many kinds of learner is that its computation is highly expensive thereby
diseases and illness. So doctors, rather than depending on their running slow on high dataset and it does not offer probability
own knowledge and experience, can use data mining and estimate directly. It also does not perform very well on large
specifically Knowledge Discovery in Database to predict or to dataset because higher training time is required.
forecast and to predict trends that would lead to better
Naïve bayes classification is simple and particularly suited
diagnoses, reduce cost and save person-hours for the
when the dimensionality of input is high. Despite its
organization. Data mining is placed as a statistical interface,
simplicity, it can outperform more sophisticated classification
data mining lies in the interface of statistics, database
method. This classifier works on the assumptions that: the
technology, recognizing patterns, machine readable data, and
data must be categorical in nature, occurrences of attributes
intelligent expert systems (Obenshain, 2004). The prime
independent and predict accurately on high volume dataset.
objective of data mining is to extract information from data
sources and alter it into a comprehensible assembly of
II. RELATED CONCEPTS
information for more uses (Data Mining Curriculum, 2014).
Data mining is a process that is used to locate correlations The clinical presentation of diabetes in a patient is the
between data and form pattern of relationships among cluster symptomatic features presented by the patients. This feature is
fields in the enormous interactive database (Extract -Nature an indication of the disease cause and has direct impact in
Biotechnology 18, 2000). guiding clinicians about the decision to take. In case of
classifying positive and negative diabetes, the following
With data mining techniques, doctors around the
parameters were considered: age, insulin, smoke, age first
world will be able to predict illnesses effectively and be better
smoked and where the survey was taking.
equipped to manage potential high-risk candidates. Such
analysis and predictions become critical if the objective is to Positive class label (P): patients can be confirmed to
provide relief to millions around the world. have positive diabetes, when the patient has one or
This research addressed the main challenging issue more symptoms of diabetes and has also been
confronting the health care industry, which is lack of quality confirm by laboratories. Since these features are the
service at minimal cost implying from diagnosing to the most occurring symptoms in a patient with diabetes.
predicting patients correctly(chaurasia and pal, 2013) or Negative class label (N): patients may have some of
sometimes even understand the complications that may result the parameters (symptoms) of positive diabetes, but
from the diseases(srinavas et al 2010). This issue can lead to after several trying of diagnostic test confirm, the
unfortunate clinical decision that can result in devastating diabetes is undetectable. This means that the
consequences that are unacceptable (Apte and Dangare 2012). existence of the signs may be as a result of the other
The availability of patients medical data has derived the need concomitant disease.
for clinicians and patients for alternative computer-based 2.1 Naïve Bayes Classifier
assessment tool that can assist in decision –making (soni et al
2011) for example, the physicians can compare analytical The Naive Bayesian classifier is based on Bayes’
information of numerous patients with the matching condition theorem with the independence assumptions between
and physicians can equally confirm their results with the predictors. A Naive Bayesian model is easy to build, with no
conformity of other part of the country (srinavas et al 2010). complicated iterative parameter estimation which makes it
particularly useful for very large datasets. Despite its
This research applies naïve bayes classification technique on simplicity, the Naive Bayesian classifier often does
the dataset obtained from Fudawa health care centre, jos surprisingly well and is widely used because it often
plateau state Nigeria. The dataset was preprocessed to remove outperforms more sophisticated classification methods. It
noise and null fields using weka tool and it was further works on the assumptions that: classifying categorical data,
divided into training dataset and test dataset. The following occurrences of an event independent and predict accurately on
parameters were used for detecting and classifying the high dataset.
diabetes into positive and negative class, the parameters are:
age, insulin, smoke cigarette, agefirstsmoked, where survey ALGORITHM
was taking. Let A be a training dataset. Suppose each tuple is represented
Support vector machine is a method that uses the concept of by n-dimensional attribute vector X=(X₁, X₂…..Xn) which
computer science and statistics to analyze data and support
represents ‘n’ measurement on the tuple from ‘n’ attributes B₁, P (X| C₁) = ∏ P (X₁| C₁).
). Therefore,
B₂……Bn.
P (X| C₁) = P (X₁| C₁)* P (X₂| C₂)*
)* P (X₃| C₃)*…………..* P
Suppose that there are N classes: C₁, C₂………Cn.
………Cn. Given a (Xn| Cn)
tuple X, the classifier will predict that X belongs to the class
The advantages of the Naïve bayes classifier are as follows:
having the highest probability (P) condition on Y that is, Y
belongs to class C₁ if and only if Ability to approximate probabilities for a class of any
given instances and also it relative simplicity.
P (Ck|X) ˃P (Ci|X) for 1≤i≤n i≠ k
It requires less model training time.
The algorithm will maximize It performs well in the present of irrelevant features
P(C₁|X)=P(X| C₁)P(C₁)/ P(X) and P(C₂|X)=P(X|
|X)=P(X| C
C₂)P(C₂)/ 2.2 The Knowledge Discovery in Databases (KDD)
P(X)
Knowledge Discovery in Databases (KDD) is the
Hence, procedure used to attain important and useful
knowledge from a large collection of previously
P (C₁|X) ˃ P (C₂|X) if and only if
collected data. The process involves selecting,
P (X| C₁) P (C₁) ˃ P (X| C₂) P (C₂),
), since P(X) is the same in preparing and cleansing the data from unnecessary
both cases. information. Any previously available information is
incorporated into the data sets. Data interpretations
Given the dataset with many attributes, it will be expensive to are conducted to achieve precise outcomes from
compute P (X| C₁). Therefore,, we assume that the values of available results as shown in Figure
Fi 1.
the attributes are conditional independent.. Thus,
examines data objects without referring to an identified class III. METHODOLOGIES AND PREDICTION
label. FRAMEWORK
Summarization is to categorize the distinctive properties of
3.1 Prediction Framework
data and point out if the data values are to be categori
categorized as
noise or outliers. The framework use clinical parameters of diabetes to classify
diabetes in a patient. The steps involved are general data
This research classifies data mining as shown in the as shown
collection, data pre-processing,
processing, classification and prediction.
in Figure 2.
Data collection, the data of the patients having
diabetes is collected.
Data pre-processing
processing was done to remove noise and
null fields.
Classification and prediction was done using Naïve
Bayes classifier to classify
cla the dataset into
categories.
The working of the frame work is illustrated as follows:
1. Data collection and preprocessing is done
2. Preprocess data is stored in a training dataset
3. Test dataset is stored in database test dataset.
dataset The test
dataset is compared for classification into positive and
negative class label. If patient is having diabetes,
diabet then patient
is classified as positivee (P), while patient is classified as
Figure 2: Data Mining Tasks negative (N) if patient does not have diabetes as shown in
Figure 3.
Data collection
Data pre
pre-processing
Positive Negative
3.2 Using The Naïve Bayes In The Study P (|H) is the probability of predictor given class
The naïve bayes method discussed in section 2 works as P (E) past (prior) probability of the predictor
follows on our problem.
Class diabetes is calculated as:
Using the naïve bayes formula,
Positive class (p): patients may have diabetes if the
P (H|E) = P (E|H) P (H)/ P (E) probability of selected features point out that the
probability of positive class is greater than negative
Where H is the class
class.
P (H|E) is a posterior probability of class given predictor P (positive| patient) = P (patient| P)* P) P (P)/ P
(patient)
P (H) is the past (prior) probability of class
Negative class (N):: patients may not have diabetes if 3.4.1Functional Requirement
the probability of selected features point out that the
A functional requirement describes what a software
probability of negative class is greater than positive
system should do. The functional requirement also specifies
class.
the operations and activities that a system must be able to
P (Negative| patient) = P (patient| N) *P (N)/ P
perform.
(patient)
Functional Requirements should include:
TABLE: 1 Parameters used for prediction
Descriptions of data to be entered into the system
Serial
number
Parameters Description Allowed values Descriptions of work-flows
flows performed by the system
syst
Discrete integer Descriptions of system reports or other outputs
1 Age Age of the subject
value
Take any drug or Some of the functional requirement of the proposed system
injection that can includes;
2 Take insulin Yes or No
prevent you from having
diabetes i. The proposed system will provide a platform to
3
Smoke Whether the subject
Yes or No
analyze dataset for new patients.
cigarette smoke cigarette ii. The proposed system will measure dataset for
Age first Age the subject does the Discrete integer
4 accuracy
smoked smoking value
Where did 3.4.2 Non-Functional Requirement
Where the subject Home or
5 you take the
takes the survey Office
survey? Non-functional
functional requirements, as the name suggests,
are requirements that are not directly concerned with the
3.3 The Proposed Application Software specific services delivered by the system to its users. This is a
requirement that specifies criteria that can be used to judge the
This research applied a technique of data mining for operation of a system, rather than specific behaviors.
predicting heart and diabetes risks for individual patients of
Fudawa. This research used the mining comparison prediction 3.5 System Design and Modeling
algorithm and used patient data sets attributes that affect the Software design is a creative activity in which
prediction. The program was developed using Java coding software components and their relationships, based on
language and MySQL as the database. requirements are identified. It is the process of defining the
3.4Requirement
Requirement Analysis Of Proposed System component modules, interfaces and the architecture of the
system to satisfy user requirements. The modeling of the
Requirement analysis is the process of determining system was done using Unified Modeling Language (UML)
user expectations for a new or modified data. The objects and components. The UML diagrams used in the
requirements for a system are the descriptions of what the design of the proposed d system includes the Data Flow
system should do, the services that it provides and the diagram (DFD), class diagram and activity diagram as shown
constraints on its operation. The system requirements are in Figure 4 and Figure 5.
classified in two types.
Portion of real data was used for training the model. We have In the table above, there are two predicted classes: ‘yes’ and
only one training set for classifying the patients to either ‘no’. classifier made a total of 155 predictions out of the 155
positive or negative diabetes. Using naïve bayes classification cases, classifier predicted ‘yes’ 105 times and ‘no’ 50 times.
discussed in section 3, Sensitivity=(TPR)= Tp/ (Fn+Tp)
A total of 155 cleaned preprocessed records were collected 90/ (5+90) =90/95
and stored in database say diabetes. 155 were used to for =0.95*100
training the model in the classification phase. During the =95%
performance testing 50 records sample was drawn from initial Specificity =(TNR)= Tn// (Tn+Fp)
(
155 populations as a validation set. 45/ (45 + 15) =45/60
0.75*100
In this study,, we check the accuracy of the Naive Bayes =75%
classifier using confusion matrix.
False positive rate=(FPR)= Fp/ (Tn+Fp)
4.1 Confusion Matrix 15/ (45 + 15) =15/60
=0.25 *100
Confusion matrix is used to summarize the performance of a
=25%
classification algorithm. It demonstrates the accuracy of a
False negative rate=(FNR)= Fn/ (Fn+Tp)
solution to a given classification problem. It contains
5/ (5+ 90) = 5/ 95
information about the predicted and actual classifications done
=0.05*100
by a classifier system.. Performance of the mo
model is normally
=5%
evaluated using the data in the confusion matrix. In this study,
Precision= Tp/ (Tp+Fp)
we achieved 90-95%95% accuracy of correctly classified instances
90/ (90 + 15) =90/ 105
in the classification phase.
=0.86 *100
Table 2: Confusion Matrix =86%
N=155 Predicted: No Predicted:: Yes total Accuracy = (Tp +Tn)/N
(90 + 45)/155 =135/155
Actual: No Tn=45 Fp=15 60
=0.95 *100
Actual :Yes Fn=5 Tp=90 95 =95%
4.2 Hardware Requirements to receive or utilize back-end capabilities of the host system. It
enables users to access and request the features and services
The most common set of requirements defined by
of the underlying information system. The front-end system
any operating system or software application is the physical
can be a software application or the combination or hardware,
computer resources, also known as hardware, A hardware
software and network resource. The following are the tools
requirement list is often accompanied by a hardware
used for the front end of the application:
compatibility list (HCL), especially in case of operating
systems. 1. Java Programming Language
2. Weka Tool
Below are the Minimum requirements of this system
3. Naïve Bayesian Classifier was used for the front end
1. A Keyboard 4. Java Development Kit
2. UPS (uninterrupted power supply)
Back End: A back-end of an application or program serves
3. A pointing device
indirectly in support of the front-end services, usually by
4. VGA (video graphics adapter)
being closer to the required resource or having the capability
5. A minimum of 64mb RAM(Random Access
to communicate with the required resource. The back-end of
Memory) or higher
the application may interact directly with the front-end or,
6. A Pentium processor (or any equivalent processor) of
perhaps more typically, is a program called from an
common speeds of 1.27MHZ or above.
intermediate program that mediates front-end and back-end
4.3 Software Requirements activities. Microsoft Structured Query Language (MySQL) is
used as the database application; Wamp Server was used as
Software requirements deal with defining software
the testing server.
resource requirements and prerequisites that need to be
installed on a computer to provide optimal functioning of an 4.5 System Implementation
application. These requirements or prerequisites are generally
In the course of the research, we tested the system
not included in the software installation package and need to
using Net Beans Development Environment. The
be installed separately before the software is installed. The
implementation of the diabetes prediction system was
following are the software of the system.
successfully met, although many challenges were encountered
1. Windows Operating System (OS) such as windows such as errors during execution, financial problem, and also
8, windows 10, windows 7, windows 8.1 getting information we needed was a bit delayed, but with all
2. Weka Tool this the system was implemented successfully by making sure
3. Net Beans Integrated Development Environment all things are corrected.
(IDE)
4.5 .1 Software Performance Testing
4. Java Development Kit (JDK)
Performance testing is generally executed to
4.4 Choice of Tools
determine how a system or sub-system performs in terms of
The following tools are used for the application responsiveness and stability under a particular workload. It
can also serve to investigate measure, validate or verify other
Front End: A front-end system is part of an information quality attributes of the system, such as scalability, reliability
system that is directly accessed and interacted with by the user and resource usage as shown in Figure 6.
V. CONCLUSION AND FUTURE WORK health sector, which means that it is necessary for knowledge
discovery in the healthcare’s sector.
5.1 Conclusion
Much more than huge savings in costs in terms of medical
An Application using a data mining algorithm of
expenses, loss of duty time and usage of critical medical
classes’ comparison has been developed to predict the
facilities,
occurrence of or recurrence of diabetes risks. In addition, the
result of the application shows that the predictions system is The naïve bayes classifier based system is very useful for
capable of predicting diabetes
tes effectively, efficiently and most diagnosis of diabetes. The system can perform good
importantly, timely. That means the application is capable of prediction with less error and this technique could be an
helping a physician in making decisions towards patient important tool for supplementing the medical doctors in
health risks. It generates results that make it closer to the real performing expert diagnosis.
agnosis. In this method the efficiency of
life situations. That makes the data
ata mining more helpful in the forecasting was found to be around 95%.
This application would be a tremendous asset for [14]. Hui Lin,H-W. Wu, (2009) ‘Mining frequent patterns in image
databases with 9DSPA Representation’, Journal of Systems and
doctors who can have structured specific and invaluable
Software, 82(4), pp.603-618.
information about their patients / others so that they can [15]. Lazakidou A. Athina and Siassiakos M. Konstantinos, (2009),
ensure that their diagnosis or inferences are correct and ‘Handbook of Research on Distributed Medical Informatics and
professional. E-health’,
[16]. Madzov, G., Gjorgevikj, D. and Chorbev, I. (2009) ’A multi-class
Finally, the huge appreciations received from the doctors on SVM classifier utilizing binary decision tree’, informatica, vol.33,
having such software prove that in a place like, where diseases NO.2, pp233-241
[17]. Maimon,Oded., Rokach, Lior.,( 2010), Introduction to knowledge
are on the rise, such applications should be developed to cover discovery in databases, University of Tel Aviv, Springer Science
the entire state. The common person stands to benefit from and Business Media.
doctors having such a tool so that he/she can be better [18]. Mena, Jesus. (2011), Machine Learning Forensics for Law
knowledgeable as far as personal health and wellbeing is Enforcement, Security and Intelligence, Boca Raton, FL: CRC
Press, ISBN 978-1-4398-6069-4
concerned. [19]. Obenshain, Mary, K., (2004) Application of Data Mining
Techniques to Healthcare Data, Statistics for Hospital
5.2 Future Work Epidemiology
Future work should be done on improving the accuracy of the [20]. Paitetsky-Shapiro, Gregory. Parker,Gary. (2011), Lesson: Data
Mining, Knowledge Discovery: An Introduction.
prediction by increasing the level of training data. Its [21]. Quinlan, J. Induction of Decision Trees. Mach Learn 1986; 1:81-
performance can be further improved by identifying and 106.
incorporating various other parameters and increasing size of [22]. Reutemann, Peter. Witten, Ian, H. (2010), WEKA Experiences
training. with a Java Open- source Project, Journal of Machine Learning
Research, 11: pp 2533-2541
REFERENCES [23]. Robert. E., Hoyt, A and Yoshihashi, Ann, (2014), ‘Health
Informatics, Practical Guide for Health Care’, 6th ed., See e.g.
[1]. Acharya, Rajendra, U and Yu, Wenwei, (2010).Data Mining OKAIRP (2005) Fall Conference, Arizona State University: Data
Techniques in Medical Informatics. The Open Medical Informatics mining.
Journal, PMCID: PMC2916206. [24]. Shanta, Kumar, .Patil, P and Kumaraswamy, Y.S., (2011).
[2]. Aflori C., and Craus, M., (May 2007) Grid Implementation of the “Predictive data mining for medical diagnosis of heart disease
Apriori algorithm Advances in Engineering Software, 38(5), pp. prediction” IJCSE, 17.
295-300. A. J.T. Lee, Y.H. Liu, H.Mu Tsai, H. [25]. Srinivas, K., (2010). “Analysis of Coronary Heart Disease and
[3]. Anbarasi M., (2010). ‘Enhanced Prediction of Heart Disease with Prediction of Heart Attack in coal mining regions using data
Feature Subset Selection using Genetic Algorithm,’ International mining techniques”, IEEE Transaction on Computer Science and
Journal of Engineering Science and Technology, 2(10), 5370- Education (ICCSE), p(1344 - 1349).
5376. [26]. Witten, Ian, H., Frank, Eibe. and Mark A., (2011). Data Mining:
[4]. Bronzino, D. Joseph, Medical Devices and Systems, 2006 Practical Machine Learning Tools and Techniques (3rd Ed.)
[5]. Chauraisa V., and Pal, S.,(2013). ‘Data Mining Approach to Elsevier, ISBN 978-0-12-374856-0
Detect Heart Diseases’, International Journal of Advanced [27]. World Health Day (2016) WHO calls for global action to halt rise
Computer Science and Information Technology (IJACSIT), 2, (4), in and improve care for people with diabetes
pp 56-66. http://www.who.int/diabetes/global-report/WHD16-press-release-
[6]. Clifton, Christopher (2010), Encyclopedia Britannica: Definition EN_3.pdf?ua=1
of Data Mining Retrieved 2016. [28]. Yoo, Illhoi., Alafaireet, Patricia., Marinov, Miroslav., Pena-
[7]. Data Mining Curriculum, ACM SIGKDD, 2006-04-30, retrieved Hernandez, Keila, Gopidi, Rejitha., Chang, Jia-Fu and Hua, Lei,
2016 (2011). Data Mining in Healthcare and Biomedicine: A Survey of
[8]. Fayyed, Usama., (15 June1999), First Editorial By Editor-In- the Literature, Med Syst DOI, Springer.
Chief, SIGKDD Explorations 1:1, doi:
10.1145/2207243.2207269
[9]. Fayyed, Usama., Piatetsky- Shapiro, Gregory., Smyth, Padhraic.,
(1996) From Data Mining to Knowledge Discovery in Databases.
[10]. Han J.,and Kamber, M. (2010). Data Mining: Concepts and
Techniques, 2nd ed., the Morgan Kaufmann Series.
[11]. Han Jiawei., & Kamber, Micheline. (2001), Data Mining:
Concepts and Techniques, pp. 5
[12]. Hastie, Trevor, Tibshirani Robert, Friedman, Jerome. (2009), ‘The
Elements of Statistical Learning’,
[13]. HninWintKhaing,(2011). “Data Mining based Fragmentation and
Prediction of Medical Data”, IEEE.