Predicting Heart Disease in Patients Using Bat Features Selection and Back Propagation Algorithm
Predicting Heart Disease in Patients Using Bat Features Selection and Back Propagation Algorithm
INTRODUCTION
Heart is the main part of our body. Our life depends on the working of the heart. If the heart fails
to work, it will affect the other parts of our body including brain, kidney etc. Heart disease is the
term that indicates the non-functioning of the heart. Several factors increase the risk of heart
disease which includes cholesterol, blood pressure, and lack of physical exercise, smoking and
obesity. The World Health Organization (WHO) has estimated that by 2030, nearly 23.6 million
people have died because of heart disease. In order to minimize the risk of heart disease
prediction of heart disease is a must to discover the disease based on symptoms, physical check-
up and signs of the patient body. Discovering and predicting disease is a tedious task in medical
environment. Discovery is a multilayered problem which may have negative presumptions and
unpredictable effects.
So, the HealthCare Industry maintains huge amount of complex data about patients, resources of
the hospital, disease diagnosis, electronic patient records, equipment etc. now this becomes the
knowledge source for data extraction which can save negative presumptions and unpredictable
effects. The good advantage as a result of diagnosis by a doctor is very active and intelligent
prediction.
Neural Networks has been widely used in the medical field for forecasting disease. NN has been
established of their potentials in many domains related with medical forecasting and diagnosis
disease. NNs can never replace the human experts but can help them in decision making,
classifying, screening and to cross-verify their diagnosis. The dataset has several attributes like
age, sex, blood pressure and blood sugar which are used to predict the risk of patients getting a
heart disease.
1.2STATEMENT OF THE PROBLEM
These limitations arise due to the highly uncertain and volatile demand of patient, as well
as the lack of historical background of the health status of the patient. The traditional
methods are inappropriate for analyzing if a patient have heart disease or not because of
their inability to handle these challenges . Additionally, the traditional prediction system
make it difficult to generate accurate result using traditional methods, as historical data is
different predicting methods that can analyze and predict accurately heart disease in
patient.
The clinical symptoms of the heart disease complicate the prognosis, as it is influenced
by many factors like functional and pathologic appearance. This could subsequently
delay the prognosis of the disease. Hence, there is a need for the invention of newer
concepts to improve the prediction accuracy with short span. Disease prognosis through
numerous factors or symptoms is a multi-layer problem even that could lead to a false
assumption. Therefore, an attempt is made to bridge the knowledge and the experience of
the experts and to build a system that fairly supports the diagnosing process.
In hospitals, there are provisions for continuous monitoring of critical care heart patients
whereas after the release from the hospital patients normally go out of direct supervision.
These patients need continuous monitoring of their health condition to reduce the risks of
Aim
The aim of this project to develop software for predicting heart disease in patients using BAT
ii. To design the new combination of a classifier to forecast the presence or absence of
heart disease.
Large volume of medical data are available in medical industry and acts as a great source of
predicting useful and hidden facts in almost all medical problems. These facts would really in
turn, help the practitioners to make accurate predictions. The techniques of Artificial Neural
Network concepts have contribute in yielding highest prediction accuracy over medical data.
The scope of this study is to develop a model that will predict heart disease in patients, using
BAT feature selection and back propagation Neural Network Algorithms, the software is develop
for clinical use and health practitioners in predicting heart disease in patients.
CHAPTER TWO
LITERATURE REVIEW
Heart is main part of our body. Our life is dependent on the working of the heart. If the
heart fails to work, it will affect them other parts of our body including brain, kidney etc.
Heart disease is the term that indicates the non-functioning of the heart. Several factors
increase the risk of Heart Disease which includes cholesterol, blood pressure, and lack of
physical exercise, smoking and obesity. The World Health Organization (WHO) has
estimated that by 2030, nearly 23.6 million people will die because of Heart Disease. In
order to minimize the risk of Heart Disease prediction of Heart Disease is a must to
discover the disease based on symptoms, physical check-up and signs of the patient body.
a multilayered problem which may have negative presumptions and unpredictable effects.
So the HealthCare Industry maintains huge amount of complex data about patients,
resources of the hospital, disease diagnosis, electronic patient records, equipments etc.
now this becomes the knowledge source for data extraction which can save negative
With the development of medical data sourced from the patient's health record, there is a
great opportunity as a basic material in developing patient health. Currently, the use of
computers has been applied in various fields. In health, it can be used to improve the
machine learning as an analytical tool can find hidden patterns in the data (Hassan M,
2018). This development follows up a high degree of prediction in terms of proper
prevention.
Heart disease, also known as cardiovascular disease, has several causes and risk factors.
i. High blood pressure: uncontrolled high blood pressure can damage blood
can lead to plaque buildup in arteries, increasing the risk of heart disease
iii. Smoking: Smoking damages blood vessels, increases blood pressure, and
iv. Diabetes: high blood sugar levels can damage blood vessels and nerves,
v. Obesity: excess weight can lead to high blood pressure, high colestrol, and
vi. Poor Diet: consuming a diet high in saturated fats, sodium, and added sugars
vii. Lack of exercise: a sedentary lifestyle can contribute to obesity, high blood
viii. Age: heart disease risk increases with age especially after 45 for men and 55
for women.
2.2 Machine Learning
The subset of artificial intelligence focuses on building systems that learn or improve
performance based on the data they consume (Nasteski n.d.). It was born from pattern
recognition and the theory that computers can learn without being programmed to
computers could learn from data. The iterative aspect of machine learning is important
because as models are exposed to new data, they can independently adapt. They learn
from previous computations to produce reliable, repeatable decisions and results. The
practice of machine learning involves taking data, examining it for patterns, and
developing some sort of prediction about future outcomes (Liu et al. 2022). By feeding an
algorithm more data over time, data scientists can sharpen the machine learning model's
predictions. From this basic concept, several different types of machine learning have
developed.
Supervised learning
Gartner, a business consulting firm, predicts supervised learning will remain the
most utilized machine learning among enterprise information technology leaders through
2022. This type of machine learning feeds historical input and output data in machine
learning algorithms, with processing in between each input/output pair that allows the
algorithm to shift the model to create outputs as closely aligned with the desired result as
possible. Common algorithms used during supervised learning include neural networks,
while its learning, meaning you are feeding the algorithm information to help it learn.
The outcome you provide the machine is labeled data, and the rest of the information you
Unsupervised learning
While supervised learning requires users to help the machine learn, unsupervised learning
algorithms don't use the same labeled training sets and data. Instead, the machine looks
for less obvious patterns in the data. Unsupervised machine learning is very helpful when
you need to identify patterns and use data to make decisions. Common algorithms used in
The unsupervised learning algorithm can be further categorized into two types of
problems:
objects with the most similarities remain in a group and have fewer or no
similarities with the objects of another group (Benndorf et al. 2018). Cluster
analysis finds the commonalities between the data objects and categorizes them as
the set of items that occurs together in the dataset. The Association rule makes
Reinforcement learning is the closest machine learning type to how humans learn. The
algorithm or agent used learns by interacting with its environment and getting a positive
networks, and Q-learning. Going back to the bank loan customer example, you might use
classifies them as high-risk and they default, the algorithm gets a positive reward. If they
don't default, the algorithm gets a negative reward. In the end, both instances help the
machine learn by understanding both the problem and environment better. Gartner notes
that most ML platforms don't have reinforcement learning capabilities because they
require higher computing power than most organizations have. Reinforcement learning is
applicable in areas capable of being fully simulated that are either stationary or have large
volumes of relevant data. Because this type of machine learning requires less
management than supervised learning, it’s viewed as easier to work with when dealing
Feature Extraction aims to reduce the number of features in a dataset by creating new
features from the existing ones (and then discarding the original features). These new
reduced sets of features should then be able to summarize most of the information
contained in the original set of features. In this way, a summarized version of the original
features can be created from a combination of the original set (Gemescu et al. 2019). The
process of feature extraction is useful when you need to reduce the number of resources
reduction of the data and the machine’s efforts in building variable combinations
(features) facilitate the speed of learning and generalization steps in the machine learning
process.
Data Science and Machine Learning: Dimensionality redu ction for visualization and
model performance improvement, Feature engineering for model training and prediction,
analysis, Topic modeling and information retrieval Language translation and language
modeling.
neural network models. The gradient estimate is used by the optimization algorithm to
compute the network parameter updates. It is an efficient application of the chain rule to
neural networks.
Back propagation computes the gradient of a loss function with respect to the weights of
the network for a single input–output example, and does so efficiently, computing the
gradient one layer at a time, iterating backward from the last layer to avoid redundant
calculations of intermediate terms in the chain rule; this can be derived through dynamic
commonly used.
2.6 Review of Related work
The random forest algorithm provides flexibility and robustness for classification tasks
using tabular data, which few other standard models can. Given its simplicity and
versatility, the random forest classifier is widely used for fraud detection, loan risk
prediction, and predicting heart diseases. With the ensemble learning theorem, the
random forest classifier combines results from several decision trees and optimizes
training. It aims to utilize different subsets and find the best combinations to increase the
dataset’s predictive accuracy. The first step is building, optimizing, mixing, and matching
several decision trees. Next, it uses these trees for prediction and ensembles their results
As the name says, a k neighbors classifier takes a data point and finds k other data points
nearest to it in the vector space. In a supervised fashion, KNN creates clusters of the data
samples having the same target value. Whenever a new value needs to be classified, it
uses a distance metric to assign it to one of the classes. For heart disease detection, there
are only two classes that KNN needs to build. Thus, it is pretty robust and efficient for
this task. Euclidean distance is one of the popular distance metrics used by KNN, but
there are many more available. However, the metric choice also impacts the classifier's
speed For larger datasets, KNN is already relatively slower than its contemporaries.
Heart disease Prediction Using Decision Tree classifier
Decision Trees are the individual models that make a random forest after ensembling.
Each decision tree classifier uses the dataset's attributes to create a tree. As shown in the
image below, the branches end up in the leaves that are made up of target values. Using
visual components and an information gain index, the tree identifies the leading features
of the labels of each class. Thus, the branches are created that maximize the information
gained in each split and lead up to the leaf node of that class. Decision trees are fast and
robust for disease prediction if the dataset has powerful features for a simple use-case.
generate hyper planes that divide the data points of two classes in the vector space. For N
number of features and M targets, SVM creates M-1 N-dimensional hyperplanes that
separate data points of different classes from each other. The image below shows how
"support" vectors are calculated such that the margin (or distance) between the vectors of
two classes is the most. SVM optimizes this margin metric to find the best hyperplane for
all the categories. Thus, SVMs are popular for disease prediction since they can
An ANN is perhaps the most popular machine learning model in today's AI landscape,
given its wide applications in deep learning in the form of convolution neural networks.
However, a normal ANN comprised of a handful of linear nodes can perform comparable
to the best standard ML models. The architecture of a standard ANN is shown in the
figure below. As we can see, the hidden layer is the most crucial part of an ANN, and is
You can wrap several hidden layers in between the input and the output layer to increase
the complexity and, thus, the learning ability of the model. Adding more nodes to a layer
and more layers to the network would allow the model to learn more non-linear and
complex relationships between the categorical variables and input features. This ability
makes the network very capable of capturing relationships between the various biological
and personal markers that are already independently affecting the probability of the
Heart Disease Prediction System Using Logistic Regression Algorithm this is done
Detecting the disease at a premature stage may save the life of the patient. Data mining
techniques are very popular and have been used in many fields including healthcare to
help the doctor to make better decisions. Machine learning provides classification
algorithms such as decision tree (DT), Naïve Bayes algorithm, Support machine vector
(SVM), and Logistic Regression (LG) are used in many types of research for predicting
heart disease. The dataset is collected from the Kaggle repository. It contains 604 data
and 14 attributes used to train the model that will be used in the web application.
Building an efficient prediction model to be deployed into the web application is the main
(Jabbar et al, 2016) proposed work employed RF to predict cardiac illness. The CHI
approach was utilized to choose to take the related features. When compared to decision
trees, the proposed research suggests that random forests yield more accurate results. The
proposed work was built utilizing neural networks by (Kim JK, 2017). Sensitivity
analysis is indeed one of the evaluation metrics for prediction. The importance of features
with such a high degree of sensitivity was considered. After selecting the relevant
sensitivity of each feature is determined by it. This (Amin U, 2018) employed seven
classification algorithms to predict cardiac disease in people. This study used Relief,
MRMR, and LAS, and Selection Operator feature selection methods to choose the
appropriate feature.
In addition to the seven performance metrics this study employed, the ROC and AUC
will help clinicians diagnose heart patients more efficiently. To select an appropriate
feature, (Rani et al., 2021) used a Genetic Algorithm (GA) and recursive feature
elimination. The proposed study used standard and SMOTE to preprocess the data and
performed support vector machines, naive Bayes, logistic regression, random forest, and
an Ada Boost classifier to aid in the earlier prediction of heart disease hung on the
patient's medical features. The system's simulation environment was built in Python, and
it was discovered that random forest achieved a maximum accuracy of 86.6 percent. (Ali
et al., 2019) used the chi-square statistical approach to pick significant features. Particular
features that were selected were fed into a deep neural network, which was then trained to
configuration.
(Paul et al, 2016) used a fuzzy decision support system (FDSS) that includes rules
derived from the genetic algorithm with perhaps even weighted fuzzy derivatives (GA).
They were able to recover eight useful features with an accuracy of 80%. Multiple heart
(Bashir et al., 2019) for experimentation analysis and to increase accuracy performance.
Feature selection algorithms such as Decision Tree, Logistic Regression SVM, Nave
Bayes, and Random Forest are used with the Rapid miner, and accuracy is enhanced.
(Liu et al., 017) offered a study that used relief and rough set approaches. The proposed
system consists of two subsystems: the RFRS feature system and ensemble classifier
classifications. The first system has three stages: data extraction using the ReliefF
method, feature reduction using our heuristic Rough Set reduction technique, and feature
reduction using our heuristic Rough Set reduction technique. In the second system, which
technique had a classification accuracy of 92.32 percent. On the Cleveland heart disease
dataset, (Singh et al, 2017) used an RF classifier that can handle large amounts of data
with missing values. This classifier generates a large number of decision trees that are
selected through voting. The chosen branch is used to improve precision. Due to the
obvious non-linear dataset, this study was able to reach an accuracy of 85.81 percent.
CHAPTER THREE
RESEARCH METHODOLOGY
Data acquisition for heart disease prediction involves collecting patient information. This
includes demographics like age and gender. Medical history, such as hypertension and
diabetes, is also collected. Clinical features like chest pain type and resting blood
pressure are gathered. Cholesterol levels, fasting blood sugar, and resting
induced by exercise are also recorded. The slope of the peak exercise ST segment is
calculated. Imaging features like the number of major vessels colored by fluoroscopy are
collected. The Brain Natriuretic Peptide (BNP) level, also known as the BAT feature, is
measured. The target variable, heart disease status, is determined. Data is collected from
electronic health records (EHRs) and clinical databases. Wearable devices like ECG
monitors and laboratory test results are also used. Data is preprocessed to handle missing
values and normalize the data. Features are selected and engineered to improve model
performance.
Data is split into training and testing sets. The data is then fed into a machine learning
model for training. The model is validated using the testing set. The trained model can
predict heart disease status based on the input features. The BAT feature is a key
Chest Pain Type Type of chest pain (e.g., angina, non- Categorical
angina)
Maximum Heart Rate Achieved Maximum heart rate achieved during Numeric
exercise
induced angina
relative to rest
Segment segment
levels
of heart disease
The heart disease prediction process begins with data collection, patient information,
medical history, and clinical features are gathered. The data is preprocessed to handle
missing values and normalize the data. Relevant features are selected and engineered to
improve model performance. A machine learning algorithm is chosen and trained on the
data, the algorithm learns patterns and relationships between features and heart disease
status, the trained model is validated using a testing set, the model's performance is
evaluated using metrics such as accuracy and f1 score, if the model meets the desired
performance threshold, it is deployed, new patient data is input into the deployed model,
the model predicts the likelihood of heart disease, the prediction is based on the patient's
individual characteristics, the model outputs a probability score or classification label, the
further testing or treatment may be recommended. the patient's data is added to the
existing dataset, the model is continuously updated and retrained, the performance of the
model is monitored and evaluated, the model is refined and improved over time, the goal
Feature extraction is a crucial step in heart disease prediction. The BAT feature is a key
predictor of heart disease status. Demographic features like age, gender, and medical
history are extracted. Clinical features like chest pain type, resting blood pressure, and
cholesterol levels are also extracted. Fasting blood sugar, resting electrocardiogram, and
maximum heart rate achieved are extracted from clinical data. Exercise stress test data
Imaging data provides features like number of major vessels colored by fluoroscopy.
Additional features like thalassemia, smoking status, and family history of heart disease
are extracted. Diabetes status, hypertension status, and body mass index (BMI) are also
extracted. The BAT feature is extracted from blood test results. All extracted features are
for dimensionality reduction. Feature engineering creates new features. Extracted features
are split into training and testing sets. The training set trains a machine learning model.
The testing set evaluates the model's performance. Extracted features, including the BAT
Global features capture overall patterns and trends in the data. Global feature extraction
complements local feature extraction. Global features include mean, median, and
global features. PCA reduces dimensionality and identifies principal components. Global
Global features also capture patterns in resting blood pressure and cholesterol levels.
Fasting blood sugar and resting electrocardiogram global features are extracted.
Maximum heart rate achieved and exercise-induced angina global features are extracted.
Global features from imaging data include average vessel diameter. Global features from
blood test results include average BAT levels. Global features are combined with local
features for prediction. Global features improve model performance and generalizability.
Global features help identify high-risk patients. Global features aid in early disease
detection and prevention. Global feature extraction is a crucial step in heart disease
Local features capture specific patterns and trends in individual data points. Local feature
extraction focuses on individual patient data. Local features include wavelet coefficients
from ECG signals. Local features also include texture features from medical images.
Local features capture specific patterns in chest pain types. Local features identify
specific abnormalities in resting blood pressure. Local features extract relevant
information from cholesterol levels. Local features analyze individual heart rate
variability. Local features extract meaningful information from exercise stress test data.
Local features identify specific patterns in blood test results, including BAT levels. Local
features are extracted using techniques like filtering and segmentation. Local features are
combined with global features for prediction. Local features improve model performance
and accuracy. Local features help identify high-risk patients earlier. Local feature
3.4 Classification
Random Forest. Support Vector Machines (SVM) and K-Nearest Neighbors (KNN) are
also used. Neural Networks and Deep Learning models are increasingly used for heart
disease prediction. Classification techniques are trained on labeled datasets. Models learn
to predict heart disease status based on feature patterns. Performance metrics include
Classification techniques are used for binary classification (heart disease or no heart
disease). Some techniques also predict the severity of heart disease. Ensemble methods
combine multiple models for improved prediction. Classification techniques are widely
used in clinical practice for heart disease diagnosis. Early detection and prevention are
critical in reducing heart disease mortality. Accurate classification is crucial for effective
Weight Update:
Bias Update:
Error Calculation:
Prediction:
they are interdependent, hence the term back propagation classifiers are part of generative
The hardware requirement refers to the tangible (physical) component to be used for the
development of the system and these are; Personal computer (PC) Macbook Air 4G RAM
Windows 8 or higher operating system software can be used for the deployment of
platform(X), Apache (A), and Python3 will all be used in the project to develop the
system. Visual Studio Code is the software package that will be used to create the source
4.1 Result
Feature Important
Age 92.5%
Discussion on table 4.1: The BAT feature selection algorithm identified age, blood
pressure, cholesterol level, family history, and smoking status as the most
medial knowledge, as these factors are known to increase the risk of heart disease.
4.3 Result of Predicting heart disease in patients
Metrics Results
Accuracy 92.5%
Sensitivity 90.5%
Specificity 94.1%
Precision 91.5%
Recall 90.8%
F1-Score 91.1%
Auc-Roc 0.95
The model achieved an accuracy of 92.5%, indicating that it can correctly predict
values are also high, at 90.2% and 94.1% respectively, demonstrating the model’s
Parameter Value
Hidden Layer 2
Epochs 100
The neural network with two hidden layers and 10 neurons per layer achieved
optimal performance. The ReLU activation and learning rate of 0.01 contributed to
The performance of this model can be compared to other machine learning models
Logistic Regression: A simpler model that may not capture complex relationships
between features.
Decision trees: A model that may be prone to over fitting, but provide
interpretable results.
Random Forest: An ensemble method that can improve accuracy, but may be
computationally intensive.
Deep learning: More complex neural network architectures that may require
Discussion:
1. Accuracy: 92.5% accuracy indicates the model correctly predict heart
2. Sensitivity (True positive Rate): 90.2% sensitivity means that the model
3. Specificity (True Negative Rate): 94.1% Specificity indicates that the model
curve) indicate the model’s ability to distinguish between heart disease and
Implication
High accuracy and sensitivity indicates that the model is effective in detecting
High specificity and precision suggest that the model also effective in avoiding
false positives
The F1 scores and AUC-ROC further confirm the model’s overall performance.
Selected features
Feature Importance
Age 0.85
Discussion
Age is the most important feature, with an importance score of 0.85, indicating its
Blood pressure and cholesterol level are also crucial features with importance
Family History and smoking status are also relevant features, with importance
Fig 4.1 show the login page of the user to the system
This page allow user to enter his/ her information to the system, the page contain
the username and the password registered into the system.
Predicting heart Disease
This page display prediction system, this is the page where the user will enter
symptom he/she observed in her are body system, and the system checked, if the
patient has the heart disease or not base on the information provided to the system.
Prediction Result Output
This result display the output result of the system, if give the predicting outcome
after the result have been implemented to the system. The result display the
likelihood output of the patient symptom implemented to the result.
CHAPTER FIVE
5.1 SUMMARY
of heart disease using machine learning technique. The model combine BAT
feature selection and back propagation algorithm to achieved high accuracy. The
BAT algorithm select the most relevant features, while the back propagation
algorithm train a neural network to predict heart risk. The model achieves an
feature aligns with medical knowledge, highlighting the importance of age, blood
pressure , cholesterol level, family history, and smoking status. The model can be
integrated into clinical decision support system to support healthcare providers and
In conclusion the research work model effectively predict heart disease risk using a
selected features align with medical knowledge, and the models performance
metrics demonstrate its potential for clinical application. This project contributes to
Based on the results of the heart disease prediction model, the following
recommendation is made.
Clinical integration; integrate the model into clinical decision support systems to
alert health care providers of high-risk patients and suggest preventive measures.
Patient engagement; educate the patients about their risk factors and involve them
in preventive care.
performance using techniques like hyper parameter tuning and ensemble methods.
Improved patient outcomes; early detection and prevention strategies can lead to
better patient outcomes. With the implementation of the heart disease prediction
model it can be refined and integrated into clinical practice, ultimately improving
10.1007/s10278-018-0145-0.
Costelloe, Colleen M., and John E. Madewell. 2021. “An Approach to Undiagnosed
10.1053/j.sult.2020.08.014.
Eweje, Feyisope R., BingtingBao, Jing Wu, DeepaDalal, Wei-hua Liao, Yu He,
Harrison X. Bai, and Lisa States. 2021. “Deep Learning for Classification
10.1016/j.ebiom.2021.103402.
Amin Ul Haq, Jian Ping Li, Muhammad Hammad Memon, Shah Nazir, Ruinan Sun,
https://doi.org/10.1155/2018/3860146 .
models for heart disease prediction using feature selection and PCA.
Jabbar MA, Deekshatulu BL, Chandra P (2016) “Prediction of heart disease using
://doi.org/10.1007/978-3-319-28031 - 8_16.
He, Yu, Ian Pan, BingtingBao, Kasey Halsey, Marcello Chang, Hui Liu,
Jiang, Liangxiao, Lungan Zhang, Chaoqun Li, and Jia Wu. 2019. “A Correlation-
10.1109/TKDE.2018.2836440.
Liu, Renyi, Derun Pan, Yuan Xu, Hui Zeng, Zilong He, Jiongbin Lin, Weixiong Zeng,
Zeqi Wu, ZhendongLuo, Genggeng Qin, and Weiguo Chen. 2022. “A Deep
Methods.” 12.
10.1177/1533033819840000.
10.18535/jmscr/v6i10.132.
Suster, David, Yin Pun Hung, and G. Petur Nielsen. 2020. “Differential Diagnosis of
Tao, Yuzhang, Xiao Huang, Yiwen Tan, Hongwei Wang, Weiqian Jiang, Yu Chen,
Chenglong Wang, Jing Luo, Zhi Liu, KangrongGao, Wu Yang, MinkangGuo, Boyu