0% found this document useful (0 votes)
11 views

Paper 19

This document summarizes a journal article about using random forest and genetic algorithms for intelligent heart disease prediction. The authors propose a classification model that uses random forest as the classifier and chi square and genetic algorithms for feature selection. The model is tested on heart disease data and shows improved classification accuracy compared to other methods. Random forest is an accurate ensemble learning algorithm suitable for medical applications. Feature selection helps increase accuracy by removing irrelevant features. The presented model can help healthcare professionals predict heart disease.

Uploaded by

Ananya Dhuria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Paper 19

This document summarizes a journal article about using random forest and genetic algorithms for intelligent heart disease prediction. The authors propose a classification model that uses random forest as the classifier and chi square and genetic algorithms for feature selection. The model is tested on heart disease data and shows improved classification accuracy compared to other methods. Random forest is an accurate ensemble learning algorithm suitable for medical applications. Feature selection helps increase accuracy by removing irrelevant features. The presented model can help healthcare professionals predict heart disease.

Uploaded by

Ananya Dhuria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Journal of Network and Innovative Computing

ISSN 2160-2174 Volume 4 (2016) pp. 175-184


© MIR Labs, www.mirlabs.net/jnic/index.html

Intelligent heart disease prediction system using


random forest and evolutionary approach
M.A.Jabbar1, B.L.Deekshatulu2 and Priti Chandra3
1 Associate Professor, MJCET, Hyderabad, India,
[email protected]

2 Distinguished fellow, IDRBT, RBI, Hyderabad,


[email protected]
3 Senior Scientist, ASL, DRDO,Hyderabad

[email protected]

The automation or decision support system would be


Abstract: Heart disease is a leading cause of premature extremely advantageous. Data mining can be used to
death in the world.Predicting the outcome of disease is the automatically infer diagnostic rules and help specialists to
challenging task.Data mining is involved to automatically make diagnosis process more reliable. The purpose of
infer diagnostic rules and help specialists to make predictions in data mining is to discover trends in patient data
diagnosis process more reliable.Several data mining in order to improve their health care [2].
techniques are used by researchers to help health care
professionals to predict the heart disease.Random forest is Knowledge discovery in data bases is applied to extract useful
an ensemble and most accurate learning patterns from the medical data sets using various data mining
algorithm,suitable for medical applications.Chi square techniques. Data mining have shown a good result in
feature selection measure is used to evaluate between prediction of heart disease and is widely applied for prediction
variables and determines whether they are correlated or of heart disease. Due to shortage of doctors and experts in
not.In this paper ,we propose a classification model which medical field to predict heart disease, and because of
uses random forest as classifier ,chi square and genetic neglecting the patient’s symptoms, data mining is emerged as
algorithm as feature selection measures to predict heart an analysis tool.
disease. The experimental results have shown that our
approach improve classification accuracy compared to Random forest is an ensemble classifier which combines
other classification approaches,and the presented model bagging and random selection of features. Random forest can
can be successfully used by health care professional for handle data without preprocessing. Random forest algorithm
predicting heart disease. has been used in prediction and probability estimation.
Random forest consists of many decision trees and outputs the
Keywords: Heart disease, Random forest, Data mining, class, which is the mode of individual trees class [3].It is one of
Feature selection, Chi square, Genetic algorithm the most accurate classifier. It produces a highly accurate
. classification for many data sets especially for heart disease
data set.

I. Introduction Feature selection is a process of identifying and removing


redundant and irrelevant features and increasing accuracy.
Heart disease also called as coronary artery disease is a
During the last decade, the motivation for applying the feature
condition that affects the heart. Heart disease is a leading cause
subset selection has been increased for model
of death worldwide. Physicians generally make decisions by
building .Feature subset selection methods are classified into
evaluating current test results of the patients. Previous
four types.1) Embedded method 2) Wrapper method3) Filter
decisions taken by other patients with the same conditions are
method4) Hybrid method. Genetic algorithm is a randomized
also examined. So diagnosing heart disease requires
wrapper feature selection technique. Chi square test is a filter
experience and highly skilled physicians. Heart disease will
method used to determine the difference between expected
become a leading cause of death by 2020. Heart disease
frequency and observed frequency. Information gain and gain
diagnosis is an important yet complicated task. Today many
ratio are univariate filter based fast feature selection methods.
hospitals collect patient data to manage health care of patients.
These feature selection are independent of the classifier.
This information is in different format like numbers, charts,
text and images. But this database contains rich information
Major contributions of our paper are summarized as
but poorly used for clinical decision making [1].

Dynamic Publishers, Inc., USA


176 Jabbar et al.

1) We propose a new method which employs the random Andreeva used C4.5 decision tree for the diagnosis of heart
forest ensemble algorithm for prediction of heart disease. disease. Feature extraction and specific rule inferring from
2) Apply chi square and genetic algorithm to select best heart disease data set is considered. Their proposed approach
features. achieved an accuracy of 75.73% [10].
3) Apply feature selection measures to improve the accuracy
in predicting heart disease. Diagnosis of CVD with Bayesian classifiers was proposed by
Alaa Elsayad et.al [11].The researchers evaluated the
The rest of the paper is organized as follows .Section 2 performance of Bayesian classifier to predict the heart disease.
presents related work. We will review various articles related Cleveland heart disease data set is used for their study. The
to heart disease. Section 3 deals with literature review. Section model is implemented in SPSS work bench. Cleveland heart
4 presents proposed approach. Experimental results are disease data set consists of 14 features. Their study evaluates
discussed in section 5.We will conclude in section 6. two Bayesian network classifiers namely 1) tree augmented
naïve bayes and 2) Markova blanket estimation
II. Related work (MBE).Classification accuracies are compared with SVM.
The performance of classification model is evaluated using
In this section, we will review some articles related to heart classification accuracy, specificity and sensitivity. MBE
disease. model achieved an accuracy of 97.92 where as TAN and SVM
classifiers achieved an accuracy of 88.54 and 70.83
Kemal polat et.al proposed hybrid method which uses fuzzy respectively.
weighted preprocessing and artificial immune system
[4].Their proposed medical decision making method consists Hlaudi Daniel Masethe et.al proposed prediction of heart
of two phases. In the first phase fuzzy weighted preprocessing disease using five different classifiers namely J48, Bayes,
is applied to heart disease data set to weight the input data. CART, Reptree and Bayes net[2].
Artificial immune system is applied to classify the weighted
input. They applied their methodology on Cleveland heart Data set collected from south Africa is used for experimental
disease data set which consists of 13 attributes. The method analysis. Only 11 attributes are considered for modeling.
uses 10 fold cross validation. Accuracy obtained by various algorithms are used as reliable
indicators for prediction of heart disease.
Diagnosis of heart disease through neural network ensembles
was proposed by Resul das et.al[5].Their method creates a new Heart disease classification using nearest neighbor
model by combining posterior probabilities from multiple classifier with feature subset selection was proposed
predecessor models. They implemented the method with SAS in[12].Their method achieved an accuracy of 97.5%
base software on Cleveland heart disease data set and obtained Feature analysis of coronary artery heart disease data set is
89.01% accuracy. proposed in [13].Their work is focused on integrating result of
machine learning on different data sets targeting the coronary
P.K.Anooj developed a clinical decision support system to artery disease.
predict heart disease using fuzzy weighted approach. The
method consists of two phases. First phase consists of Heart disease prediction system using associative
generation of weighted fuzzy rules, and in second phase fuzzy classification was proposed in [14].Authors proposed efficient
rule based decision support system is developed. Author used associative classification algorithm for heart disease
attribute selection and attribute weight method to generate prediction using genetic algorithm. Their method uses gini
fuzzy weighted rules. Experiments were carried out on UCI index for class association rule generation. Gini index is used
repository and obtained accuracy of 57.85% [6]. to improve classification accuracy as a informative attribute
centered rule generation. Fitness of rule is evaluated using Z
Robert Detrano et.al proposed probability algorithm for the statistics. The experimental results showed that their approach
diagnosis of coronary artery disease. The probabilities that achieved an accuracy of 88.9%.
resulted from the application of the Cleveland algorithm were
compared with Bayesian algorithm. Their method obtained an M.A.Jabbar et.al [15] proposed a model for prediction of heart
accuracy of 77% [7]. disease using random forest (RF) and feature subset selection.
Authors proposed a new method which uses RF and feature
Decision tree for diagnosing heart disease patients was subset selection chi square for disease prediction. Chi square
proposed by Mai shouman et.al [8].Different types of decision metric is used to filter features in the data set. Cleveland data
trees are used for classification. The research involves data set is used for experimental analysis. Five metrics sensitivity,
discretization, decision tree selection and reduced error specificity, disease prevalence, negative predictive value and
pruning. Their method outperforms bagging and j48 decision positive predictive value are used for analysis of classification
tree. Their approach achieved 79.1% accuracy. model. Experimental results demonstrated in their approach
that there is significant improvement in accuracy.
Diagnosis of heart disease through bagging approach was
proposed by My chau Tu et.al [9].The proposed bagging Early diagnosis of heart disease using computational
algorithm is used to identify warning signs of heart disease. intelligence techniques are proposed in [16]. Authors
They made a comparison with decision tree. Their approach attempted to increase the accuracy of the naïve bayes classifier
claimed an accuracy of 81.4%. to classify heart disease data. They used a discretization
method and genetic search to remove irrelevant features.
Intelligent heart disease prediction system using random forest and evolutionary approach 177

Genetic algorithm is used for optimization. Authors performed Table 1: Symptoms of heart disease
a comparison with other traditional algorithms.
Sl.no Symptoms name
Alternating decision trees for early diagnosis of heart disease 1 Chest pain
was proposed by Jabbar et.al[17].Alternating decision tree is a 2 Strong compressing or
new type of classification, which is a generalization of flaming in the chest
decision tree, voted decision trees and voted decision stumps. 3 Discomfort in chest area
Principal component analysis is used as a feature selection 4 Sweating
measure and used to select best features .Heart disease data 5 Light headedness
consists of 96 patient’s records with 10 features. Their 6 Dizziness
proposed approach achieved an accuracy of 91.66%. 7 Shortness of breath
8 Pain spanning from the
M.A.Jabbar et,al proposed cluster based association rule chest to arm and neck
mining for heart attack prediction[18].Authors proposed a 9 Cough
model to analyze medical data set using association rule 10 Fluid retention
mining. Cleveland data set is used for experimental analysis.
Medical data set is divided into partitions of equal size based Major risk factors of Coronary heart disease are listed in table
on skipping fragments. Their approach reduces main memory 2
requirement and is efficient in pruning heart disease prediction Table 2: Risk factors of heart disease[22]
rules.
Sl.no Risk factor
III. Literature review 1 Diabetes
This section reviews literature used in this paper. 2 High blood pressure
3 High LDL
4 Low HDL
A. Heart Disease 5 Not getting enough physical
Heart disease also called as coronary heart disease (CHD), is a activity
deposition of fats inside the tubes which supplies blood to the 6 Obesity
heart muscles. Heart disease actually starts as early as 18 years 7 Smoking
and patients only came to know about heart disease when the
blockage exceeds about 70%.Theses blockages develop over Effective decision support system should be developed to help
the years and lead to rupture of the membrane covering the in tackling the menace of heart disease.
blockage due to pressure increases. If the chemicals released
by broken membrane mixed with blood and lead to a blood
clot, results to heart disease [19]. B. Random forest (RF)
The reasons which increase blockage are called as risk factors. Random forest algorithm is one of the most effective ensemble
These risk factors are classified as modifiable and non classification approach. The RF algorithm has been used in
modifiable risk factors. Non modifiable risk factors are age, prediction and probability estimation.RF consists of many
gender, and heredity. These risk factors can’t be modified and decision trees .Each decision tree gives a vote that indicate the
they will always keep causing heart disease. decision about class of the object. Random forest item was
Risk factors which can be changed by our efforts are called as first proposed by Tin kam HO of bell labs in 1995.
modifiable risk factors. Some modifiable risk factors are 1) RF method combines bagging and random selection of
Food related 2) Habit related 3) Stress related 4) Bio chemical features. There are three important tuning parameters in
and miscellaneous risk factors. Atherosclerosis, coronary, random forest1) No. of trees (n tree) 2) Minimum node size 3)
congential, rheumatic, myocarditis, arrhymia and angina are No. of features employed in splitting each node 3) No. of
the different types of heart diseases[20].Common symptoms of features employed in splitting each node for each tree (m try).
heart disease are listed in table 1[21]. Random forest algorithm advantages are listed below.

1) Random forest algorithm is accurate ensemble


learning algorithm.
2) Random forest runs efficiently for large data sets.
3) It can handle hundreds of input variables.
4) Random forest estimates which variables are important
in classification.
5) It can handle missing data.
6) Random forest has methods for balancing error for
class unbalanced data sets.
7) Generated forests in this method can be saved for
future reference [23].
178 Jabbar et al.

8) Random forest overcomes the problem over fitting. 4) Sample size should be adequate and simple
9) In training data, RF is less sensitive to outlier. 5) Data must be in frequency form
10) In RF, parameters can be set easily and eliminates the 6) All observations must be read.
need for tree pruning.
11) In RF accuracy and variable importance is Chi square formula is represented as
automatically generated [24].

When constructing individual trees in random forest,


randomization is applied to select the best node to split on.
This value is equal to √A, where A is no. of attributes in the
data set [25].However RF will generate many noisy trees,
which affect classification accuracy and wrong decision for Where O is observed frequency and e is expected frequency.
new sample. [25] Following example illustrates chi square hypothesis.

Example: A six sided die is thrown 264 times. Results are


Following algorithm illustrates random forest method.
Algorithm Random forest shown in the table. We want to know if the die is biased

Step 1: From the training set, select a new bootstrap sample. [let χ20.05=11.07 for 5d Degree of freedom (df)]
Step 2: Grow on a un pruned tree on this bootstrap sample.
Step 3: Randomly select (m try) at each internal node and Number 1 2 3 4 5 6
determine best split. appeared on
the die
Step 4: if each tree is fully grown. Do not perform pruning.
Frequency 40 32 28 58 54 52
Step 5: Output overall prediction as the majority vote from all
the trees. Solution
Degrees of freedom can be calculated as the number of
categories in the problem minus 1.Null hypothesis H0: The die
B. Chi-square method is unbiased.
Expected frequency of each of the numbers =264/6=44
Feature selection is a preprocessing technique used to remove
irrelevant and redundant features. Medical data is high volume
in nature and consists of redundant features. Medical diagnosis Observed Expected
is a complicated task, needs to be executed accurately and (O-e)2 (O-e)2/e
frequency(O) Frequency(e)
efficiently. Feature selection if applied on medical data set will 40 44 16 0.3636
give accurate results. In this paper, we consider chi square and 32 44 144 3.2727
genetic search as feature selection and ranking methods, which 28 44 256 5.8181
show good performance in various domains. 58 44 196 4.4545
54 44 100 2.2727
52 44 64 1.4545
χ 2=17.636
17.636
The number of degree of freedom (df) =n-1=5
The tabulated value of χ2 for degree of freedom (df=5) at 5%
level=11.07. Since calculated χ2 is greater than tabulated χ2 we
reject the null hypothesis H0.i.e. We reject the hypothesis that
the die is unbiased. Hence the die is biased.
Table 3 shows example data set weather data. This data set
Figure 1 : Filter and wrapper feature subset selection consist of 14 instances and 5 features. The last feature is class.
measures

Chi square is a statistical test that is used to measure


divergence from the distribution of feature occurrence which
is independent of the class value [26].Chi square requires the
following conditions to be satisfied.
1) Data must be quantitative
2) One or more categories of data required
3) Independent observations
Intelligent heart disease prediction system using random forest and evolutionary approach 179

Table 3: Weather data set 3) Mutation:


Mutation operator is used to maintain diversity and to inhibit
premature convergence. In mutation, a portion of the new
No Outlook Temperature Humidity Windy Play individual bits are flipped.
1 SUNNY HOT HIGH F No
Pseudo Code of Genetic Algorithm
2 RAINY MILD NORMAL F yes
Step 1: Randomly initialize population
3 SUNNY MILD NORMAL T yes
Step 2: Compute fitness of population
4 OVER CAST MILD HIGH T yes

5 OVER CAST HOT NORMAL F yes


Step 3: Repeat

6 RAINY MILD HIGH T no Step 4: Select parents from population

7 SUNNY HOT HIGH T no Step 5: Perform crossover

8 OVER CAST HOT HIGH F yes Step 6: Perform mutation

9 RAINY MILD HIGH F yes Step 7: Compute fitness

10 RAINY COOL NORMAL F yes Step 8: Until best individuals are selected and stop
11 RAINY COOL NORMAL T no Flow chart of Genetic algorithm is shown in figure 2
12 OVER CAST COOL NORMAL T yes

13 SUNNY MILD HIGH F no

14 SUNNY COOL NORMAL F yes

Ranking of the attributes for weather data set is based on chi is


shown in table 4
Table 4: Ranking of attributes based on chi square

Rank Name of the attribute Chi square value


1 outlook 3.547

2 humidity 2.8
3 windy 0.933
4 temperature 0.57 Figure 2: Flow chart of genetic algorithm

IV. Proposed method


C. Genetic Algorithm(GA)

The literature survey presents various techniques for


Genetic algorithm (GA)represents general purpose search prediction of heart disease. Each method has its own
method based on natural selection and genetics.GA stimulate advantages and their short comings. The proposed technique
natural process based on law mark and Darwin uses random forest algorithm for prediction of heart disease.
principles[27].GA are implemented using computer Feature subset selection is a process that selects a subset of
simulation for optimization.GA are useful for searching very original attributes and reduces feature space [29].
general spaces depending on some probability values for
optimization[14][28].Each solution generated in GA is called We applied, Random forest with chi square and GA as feature
a chromosome. Genetic algorithms have played a major role in selection measures on heart disease data set collected from
many applications of the engineering science. various corporate hospitals in Hyderabad (Heart disease data
set T.S) and also on heart stalog data set.
Driving force behind genetic algorithm is the use of three In our proposed work ,we used chi square and GA to select
operators namely attributes and keep only attributes which contribute more
1) Selection: towards the diagnosis of heart disease.
Selection operator is used to give preference to better
chromosomes using objective function Confusion matrix is a table used to visualize the performance
2) Crossover: of an algorithm. Confusion matrix(Table 5) has two rows and
Crossover operator takes more than one parent chromosome two columns (for two class problems) that specify TP, FP, TN,
and produces a child from them. FN.
180 Jabbar et al.

Confusion matrix is used to compare actual classification of


heart disease data set, with number of correct and incorrect Sl.no Attributes of heart disease
predictions made by the model. The traditional classification 1 Age
matrix is shown below. 2 Sex
3 Chest pain
To evaluate the performance of our proposed model, we used 4 Resting blood pressure
following classification measures [30]. 5 Serum cholestoral
Table 5: Confusion Matrix 6 Fasting blood sugar
7 Resting electro graphic results
Disease 8 Max.heart rate achieved
Prediction
+ - 9 Exercise induced angina
True positive False positive 10 Old peak
+
TP FP 11 Slope
1.
False Negative True Negative 12 No.of major vessels
-
FN TN 13 Thal
14 Class
1) Specificity=TN/ (FP+TN)
2) Sensitivity =TP/ (TP+FN) Proposed algorithm:
3) Disease prevalence= (TP+FN)/ (TP+FP+TN+FN)
4) Positive predictive value(PPV): TP/ (TP+FP) Step 1: Load the heart disease data set
5) Negative Predictive value(NPV): TN/ (FN+TN)
6) Accuracy= (TP+TN)/ (TP+FP+TN+FN) Step 2) Rank the features in descending order based on chi
Where TP=> Positive tuples that are correctly labeled by the square and GA value. A high value of chi square indicates
classifier. feature is more related to class.
TN=> Negative tuples that are correctly labeled by classifier. Apply backward elimination algorithm .Back ward
FN=> Positive tuples that are incorrectly labeled by elimination algorithm starts from the full feature set, and
classifier. iteratively removes one by one feature with low value.
FP=> Negative tuples that are incorrectly labeled by
classifier. In each iteration only one feature is removed, which mostly
affects overall model accuracy, as long as the accuracy stops
Positive predictive value (PPV) is defined as probability that increasing. Least rank feature will be pruned. Chi square and
the heart disease is present when the diagnosis test is positive. GA is used to select high ranked features.
Positive predictive value (NPV) is defined as probability that
the heart disease is absent when the diagnosis test is negative Step 4: Select the features with highest value.
[16].
Step 3) Apply Random forest algorithm on the remaining
Attributes for our heart disease data set T.S are listed in Table features of the data set that maximizes the classification
6 and heart stalog attributes are shown in table 7. accuracy.
Table 6: Heart disease data set attributes Steps 4) Find the accuracy of the classifier.
1 Age Numeric
Steps 1 to 4 deals with feature selection. High ranked features
2 Gender Nominal
3 BP Numeric are selected for classification. From Step 3 to 4, RF
4 Diabetic Nominal
5 Height Numeric classification will be applied to the selected feature subset.
6 Weight Numeric After applying classification, accuracy of the classifier will be
7 BMI Numeric
8 Hypertension Nominal calculated.
9 Rural Nominal
10 Urban Nominal
11 Disease class Nominal V. Experimental results

We carried out experiments using Hold out and Cross


validation approach. In Hold out approach, we partitioned
samples into two independent data set.75% of data set is used
to train the classifier and to build the classifier. Remaining
25% data set is used for testing.

In 10-fold cross validation all the instances of the data set are
Table 7: Attributes of heart stalog data set used and are divided into 10 disjoint groups, where nine
Intelligent heart disease prediction system using random forest and evolutionary approach 181

groups are used for training and the remaining are used for Table 10: Comparison of various parameters for heart disease
testing. The algorithm runs for 10 times and average accuracy data set T.S
of all folds is calculated. Our Decision
To evaluate the performance of our approach, we used the Sl.no Parameter
approach tree
measures listed in section 4.Accuracy comparison of Heart 1 Sensitivity 100 100
Disease data set-Cleveland [31] is shown in Table 8 and figure 2 Specificity 100 92.86
3. Naïve bayes approach obtained an accuracy of 3 Disease prevalence 82.67 81.33
78.56% ,whereas decision table obtained an accuracy of Positive Predictive
82.43%.The results are obtained using 10 cross validation. 4 100 98.39
Value(PPV)
Our approach obtained 7.97% improvement over C4.5 Negative Predictive
algorithm. Accuracy comparison for Heart Disease data set 5 100 100
Value(NPV)
T.S is compared with Decision tree (DT) is shown in Table 9
and figure 4.Our approach obtained 100% accuracy, where as
DT obtained an accuracy of 98.66%.Comparision of various Table 11: Comparison of various parameters for heart disease
parameters for heart disease data set T.S is listed Table 10 and data set –HEART STALOG
Figure 5.
Our Decision
Specificity shows that the probability of testing the result of Sl.no Parameter
approach tree
heart disease will be negative when the heart disease is not
1 Sensitivity 85.8 80.18
present. Positive predictive value (PPV) is the probability that
2 Specificity 82.3 80.50
the heart disease is present when the diagnosis test is
positive.PPV value for DT is 98.39% where as our approach 3 Disease prevalence 39.2 41.11
records 100%.Negative predictive value (NPV) recorded by Positive Predictive
4 75.8 74.17
our approach is 90% and positive predictive value is 75.8 for Value(PPV)
heart stalog data set which is shown in Table 11 and figure 6. Negative Predictive
5 90.0 85.33
Clinically, the disease prevalence(DP) is the same as the Value(NPV)
probability of disease being present before the test is
performed (prior probability of disease).
The above experimental results suggests that our proposed
approach efficiently achieve high degree of dimensionality
reduction and improve accuracy with predominate features.
Overall our approach outperforms other approaches. This
indirectly helps patient’s no. of diagnosis tests to be taken for
prediction of heart disease.

Table 8: Accuracy comparison for Heart Disease Data set


(Cleveland data set)

Sl.no Approach Accuracy


Figure 3 :Accuracy comparision of heart stalog data set by
1 PART C4.5 75.73
various approaches
2 Naïve bayes 78.56
3 Decision table 82.43
4 Neural nets 82.77
5 Our approach 83.70 Table 9:
Accuracy
comparison for Heart Disease Data set (TS Data set)

Sl.no Approach Accuracy


1 Decision 98.66
Tres(DT)
2 Our approach 100 Figure 4 :Accuracy comparision of Heart disease data set-T.S
182 Jabbar et al.

Figure 7 : Accuracy comparision


Figure 5 :Comparision of various parameters for Heart disease
data set-T.S

Figure 8 :Accuracy comparision before and after GA

Figure 6: Comparision of various parameters for Heart stalog Table 12 shows accuracy obtained using RF and GA .RF+GA
data set model improved 4%accuracy than RF with out GA.We tested
accuracy for various number of trees in RF.Table 13 shows
Specificity of heart disease data set T.S obtaned by our various parameters of GA.Population size is limited to 20 and
approach is 100% where as it is 92.86% by decesion tree.Our maximum generations are limited to 20.Crossover and
approach obtained 1.8% improvement over decesion tree for mutaion probabilities are set to 0.6 and 0.03 respectively.
heart stalog data set,which is shown in figure 5 and 6. Table 14 and figure 9 shows the acuracy comparision for heart
disease data by various methods.From the table it is clearly
Table 12: Accuracy of heart disease using RF and GA evident that our approach outperforms other models
developed by researchers.
No. of trees in RF Before GA After applying GA
as feature selection (RF+GA) Table 14 :Accuracy comparision for heart disease data set
(RF only)
50 80 84
10 80 82.96 Name of the author/approach Accuracy
100 80 82.96
Chen 80
Table 13: Parameters of GA Chaurasia 83.49
Abdullah 63.3
Sl.no Parameter Threshold value Decision tree 63.3
name Anooj 57.85
1 Crossover 0.6 Mi chau tu 81.4
2 Mutation 0.03 Andreeva 75.73
3 Population size 20 Our Approach(RF+GA) 84
4 Max.Generation 20
Intelligent heart disease prediction system using random forest and evolutionary approach 183

[5] Resul das ,Turkoglu,A Sengur,” Effective diagnosis of


heart disease through network ensembles”, Expert System
with Applications36,pp7675-7680(2009)
[6]PK Anooj,” Clinical decision support system: Risk level
prediction of heart disease using Weighted fuzzy rules”,
Journal of king saud university, CIS, 24, PP 27-40(2012)
[7] Detrano ,Janosi,W Stein burn,et.al,” International
application of new probability algorithm for the diagnosis of
CAD”. The American Journal of Cardiology, pp
304-310,64(5),(1989)
Figure 9 : Accuracy comparision for heart disesae data set by
various approaches [8] Mai Shouman, Turner, Stocker,” Using decision tree for
diagnosing heart disease patients”, In 9th Australian data
mining conference, Australia vol 121,ACM(2011)
VI. Conclusion [9] Tu et.al,” Effective diagnosis of heart disease through
bagging approach” Biomedical Engineering and approach,
In this research paper, we developed efficient approach for
pp 1-4, BMEI2009, IEEE (2009)
prediction of heart disease using Random forest. Data mining
plays an important role in the prediction of heart disease. We [10] Andreeva ,” Data modeling and specific rule generation
adopted feature selection using chi square and Genetic
via data mining techniques”, International conference on
algorithm measures for heart disease classification.
Our proposed approach (Random forest and Chi square) computer system and technologies” Comsystech 2006,
achieved an accuracy of 83.70% for heart stalog data set.
pp 1- 6(2006)
Applying Random forest has shown improved accuracy in
prediction of heart disease. This research systematically tested [11] Alaa Elsayad,Mahmoud Fakhr,“Diagnosis of
using 10 fold cross validation to identify most accurate method.
cardiovasular diseases with bayesian classifier“,Journal of
We compared our approach with other traditional
classification algorithms. Computer Science,vol 11(2),pp274-282(2015)
Our approach outperforms traditional classification algorithms
[12] M.A.Jabbar,B L Deekshatulu,Priti chandra,“heart
for effective classification of heart disease. This type of
research can be successfully used in predicting the risk factors disease classification using nearest neighbor classifier with
of heart disease and to help health care professionals for
feature subset selection“annals computer science series ,
prediction of heart disease.
11th tome,1st fasc,pp 47-54(2013)
[13]Randa El-Bay,“Feature analysis of coronary artery heart
References
disease data sets“,Procedia Computer science,Elsevier,vol
[1] P.Shrama,k.saxena,“heart disease prediction system
65,pp 459-469(2015)
evaluation using c4.5 rules and partial trees
[14]M.A.Jabbar,B L Deekshatulu,Priti chandra,“Heart disease
“AISC,Springer,pp 285-294(2016)
prediction system using associative classification and genetic
[2] Hlaudi Daniel Masethe,“Prediction of heart disease using
algorithm“,ICECIT 2012,VOL 1,PP 183-192(2012)
classification algorithms“, vol 11,pp1-4,WCECS2014
[15] M.A.Jabbar,B L Deekshatulu,Priti chandra,“Prediction of
[3]Sheik abdullah,RR Rajalakshmi,“A data mining model for
heart disease using random forest and feature subset
predicting the coronary heart disease using random forest
selection“,Springer,AISC,IBICA 2015,pp 187-196(2015)
classifier“,IJCA,PP 22-25(2012)
[16] M.A.Jabbar,B L Deekshatulu, Priti chandra,
[4] Kemal polat, S.Gunes, S.Tosun, ” Diagnosis of heart
“Computaional intelligence technique for early diagnosis of
disease using artificial immune recognition system and fuzzy
heart disease’IEEE,ICETECH 2015,pp 1-6(2015)
weight preprocessing”, pattern recognition, 39,
pp2186-193(2006)
184 Jabbar et al.

[17] M.A.Jabbar,B L Deekshatulu,Priti chandra,“Alternating Author Biographies


decision tree for early diagnosis of heart disease Dr.M.A.Jabbar born in Telangana state, India. He
obtained his B.E in computer science engineering
“IEEE,I4C2014,pp 322-328(2014) from MGMCE, Nanded and M.Tech from JNTUH,
Hyderabad. He obtained his Ph.D from JNTUH in
[18] M.A.Jabbar,B L Deekshatulu,Priti chandra, “Cluster data mining. He published more than 25 papers in
international Journals and conferences. He is
based association rule mining for heart disease prediction “, technical Committee member for many international
JATIT,Vol 32,No 2,pp 1-8(2011) Conferences . He is execom member in EEE computer society India council.
Presently he is working as an associate professor in MJCET,Hyderabad.His
[19] Saaol times, Monthly magazine” Modifiable risk factors research interests include data mining, Attack graphs,IDS, Big data,IOT.

of heart disease”, pp 6-10, July (2015)


Dr.BL Deekshatulu did his BSc (Electrical Engg)
[20] Khan MG,“Heart disease diagnosis and therapy“, a from BHU (1958) and ME (1960) and PhD (1964)
from Indian Institute of Science (IISc), Bangalore.
practical approach,2nd Edition Springer,pp544(2015) Dr Deekshatulu contributed in the areas of linear
and non-linear systems, digital image processing
[21]Khan MG,“Heart disease diagnosis and therapy“, a and remote sensing (data processing and
applications). At IISc, Deekshatulu developed
practical approach,2nd Edition Springer,pp544(2015) grayscale and colour drum scanners besides a flying spot scanner for image
[22] M.A.Jabbar,B L Deekshatulu,Priti digitization, introduced ME Course in Servo Mechanisms,
image/photo-processing labs and initiated School of Automation. At NRSA,
chandra ,”classification of heart disease using artificial neural he developed wide spread applications of remote sensing, commercial version
network and feature subset selection”,GJCST,Vol13, issue of the large format drum scanner, image analysis equipment, etc. He was the
3,2013 Chairman, Remote Sensing Application Missions (ISRO), IGBP and SCOPE.
Deekshatulu is a recipient of number of awards that include Sir M
Visveswaraya Award (1984), NRDC Invention Awards (1986, 1993), Dr
[23] home.etf.rs/~vm/os/dmsw/Random%20Forest.pptx,last Biren Roy Space Science Award (1988), Padma Shri (1991) and LTC award
accessed 10/8/2015 by INAE.
[24] Jehad Ali et.al,“Random forest and decision
trees“,IJCSI,Vol 9,No 3,pp272-278(2012) Dr.Priti Chandra obtained her Ph.D in artificial
intelligence from HCU, Hyderabad..She
[25] kahled fawagreh,mohamded medhat gaber,Eyad published more than 40 papers in international
journals and conferences. Presently she is
Elyan,“Random forest:freom early developments to recent working as senior scientist, ASL, DRDO,
Hyderabad. Her research interest includes data
advancements“,systems science and control engineering,2:1, mining, Artificial intelligence, optimization
pp602-609(2014) techniques, fault tolerant systems. She is a
member in IEEE.
[26] George forman,“An extensive empirical study of feature
selection metrics for text classification“,Journal of Machine
Learning Research 3,pp 1289-1305(2003)
[27] M.A.Jabbar,B L Deekshatulu,Priti chandra,“An
evolutionary algorithm for heart disease prediction“,ICICP
2012,CCIS292,Springer,PP378-389(2012)
[28] M.A.Jabbar,B L Deekshatulu,Priti chandra,“prediction
of risk scores for heart disease using associate classification
and hybrid feature subset selection“,IEEE ,ISDA,pp
628-634(2012)
[29]Preecha sonwang et.al,“Computer network security
based on SVM approach“,In 11th Intnl.conf on
aontrol,automation,and systems.
[30] Med Calc, “www.medcalc.org” last accessed on
(5/8/2015)
[31] UCI machine learning repository,”
archive.ics.uci.edu/ml” Last accessed 15/08/2015

You might also like