Using Machine Learning For Detection and Prediction of Chronic Diseases
Using Machine Learning For Detection and Prediction of Chronic Diseases
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3494839
This research project was funded by the Deanship of Scientific Research, Princess Nourah bint Abdulrahman University, through the
Program of Research Project Funding After Publication, grant No (44- PRFA-P- 113)
ABSTRACT Heart attacks are a leading cause of mortality worldwide, necessitating the development of accurate predictive
models to enhance early detection and intervention strategies. This study addresses the significant problem of class imbalance
in medical datasets, specifically focusing on heart attack prediction using the Behavioral Risk Factor Surveillance System
(BRFSS) dataset. To tackle this challenge, advanced machine learning (ML) methods are proposed to involve a refined dataset
of 399,875 instances, with 47 significant features maintained through rigorous data cleaning and preparation. Balanced
accuracy and macro-recall were chosen as primary metrics to ensure fair performance evaluation across classes in the
imbalanced dataset. Our proposed system entails a detailed evaluation of various algorithms known for their effectiveness in
managing class imbalance. The LGBM Classifier, XGB Classifier, and Logistic Regression (LR) are optimized using recursive
feature elimination and hyperparameter tuning with Optuna. The results of this study are encapsulated in an ensemble model
that significantly enhances predictive accuracy. The final model achieved 80.75% balanced accuracy and 79.97% recall for
critical heart attack cases (class 1), along with an AUC score of 88.9%, indicating superior class distinction capability.
Additionally, the application of SHAP (SHapley Additive exPlanations) analysis provided valuable insights into the
contribution of each feature to heart attack likelihood, thus improving model transparency. This study's successful integration
of complex ML techniques with interpretability analyses like SHAP marks a substantial advance in early detection and
intervention strategies in healthcare. It demonstrates the potential of sophisticated ML approaches for early heart attack
detection and prevention, highlighting their value in improving outcomes for patients with chronic diseases. These findings
suggest promising pathways for employing advanced analytical tools in healthcare to enhance patient care.
INDEX TERMS Heart attack prediction, Ensemble model, Chronic Diseases, Class imbalance, ML classifiers, Model
transparency
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3494839
evaluate the models is the ability of these metrics to give a different features towards predicting heart attacks.
more accurate picture of the performance of a model in both Therefore, it emphasizes the transparency and
classes of both majority and minority; the most undesirable interpretability of the developed models.
outcomes in medical domains are always false negatives. ML By addressing these questions, our research endeavors
applications in healthcare are spreading widely. However, to enhance the predictive accuracy of heart attack incidents
they still need to work on several issues, mainly class using ML and contribute to the broader understanding of
imbalances, feature selection, and the tuning part of the applying advanced analytical techniques to healthcare
models to boost predictive performance. datasets. Our findings provide a foundation for future studies
This paper presents data processing algorithms that are and practical applications in early heart attack detection and
underpinned by meticulousness and preprocessing that prevention strategies.
eliminates missing values and reduces dimensionality. We
then move on to ML algorithms with a focus on addressing II. RELATED WORK
class imbalance. A series of experiments are conducted that Among the numerous causes of death worldwide,
were methodically crafted. Using regular feature elimination cardiovascular diseases hold a significant portion. Areas
and Optuna, a hyperparameter optimization framework, the with limited cardio-logical care and where misdiagnoses are
accuracy of the top-performing models is enhanced. The common are the most extensively studied. In their study [1],
models' prediction strengths were integrated using an Ali and his colleagues aim at predicting heart disease early
ensemble technique, making these predictions even more and accurately. Relying on ML through digital patient record
accurate and further improved by setting up the appropriate assessment, they apply various supervised choices and their
classification threshold with the help of Youden's J Index. feature importance. The random forest (RF) algorithm
This model represents the best classification model ensemble gathers excellent results, including perfect accuracy, holding
and excels in balanced accuracy and recall for heart attack great promise as a diagnostic tool that helps to increase
cases. diagnostic accuracy and efficiency in limited-resource
Our study contributes to the ongoing efforts in healthcare settings. To predict which heart disease patients
predictive healthcare analytics. It presents a replicable model require emergency care, the authors of [2] proposed a novel
for addressing similar challenges across various domains stacking ensemble learner model that leverages a unique
where class imbalance and predictive accuracy are of approach with behavior-based features and a private MIT
paramount concern. dataset, outperforming existing methods with 88% accuracy
Question 1: What role do feature selection and in predicting emergency readmission. This holds promise for
hyperparameter tuning play in enhancing the predictive early intervention and improved clinical outcomes. One of
performance of ML models for heart attack prediction? the recent research models that applied ML for heart attack
Objective Addressed: The study directly addresses the prediction is presented by El-Hasnony and his colleagues [3].
impact of feature selection and hyperparameter tuning on By optimizing models and evaluating them in real-world
predictive performance. It aims to enhance overall accuracy scenarios, they found that most methods excelled at accurate
and reliability in heart attack predictions. As explained in early detection and enabling proactive preventive care. This
(Section V), the answer to this question is resolved where the suggests a cost-effective and promising approach for
feature selection can enhance ML performance. catching heart disease early and preventing it more
Question 2: Can ensemble-modeling techniques improve effectively. To improve heart disease outcome prediction
the prediction of heart attack incidents over individual and overcome the limitations of traditional models, Liu and
models, and how can the optimal combination of models be colleagues [4] explore using AI on data collected through
determined? IoT sensors. They aim to address issues like data bias and
Objective Addressed: The current study explicitly low accuracy, ultimately seeking a more accurate and
investigates the practical application of ensemble modeling effective AI-powered prediction system for this critical
techniques in improving heart attack prediction over medical challenge. Furthermore, Singh and his coworkers [5]
individual models. It seeks to determine the optimal validate the various ML models and demonstrate the link
combination of ML models and identifies best practices for between gait parameters and cardiovascular health to predict
constructing ensemble models. The improvement of heart and understand heart health. This research aims to find each
attack prediction using ensemble modeling is explored in person's cardiovascular risk level. This Gait System
(Section V). investigates gait characteristics like step length, stride length,
Question 3: What insights can be derived from model cadence, and velocity through the experimental collection of
explanations, mainly using SHAP (SHapley Additive gait data using retro-reflective markers.
exPlanations), to understand different features' contribution Applications being established with successful results in the
to predicting heart attacks? early diagnosis of heart diseases are increasing. Therefore,
Objective Addressed: The study addresses the importance many studies using heart disease detection and classification
of model interpretability and aims to uncover insights from methods, in particular, have been carried out. Chen and
model explanations, particularly using SHAP analysis colleagues [6] made use of the INDANA database. They
(Section VII). It seeks to understand the contribution of adopted the LR, CART, and MLP algorithms to predict
cardiac disease. The experiment proved that MLP is the best
2
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3494839
all and was able to attain 76% correctness, CART next was disease. There were also trained and evaluated systems with
69.1% accurate, and LR was able to perform 65.9% the developer rapidly setting system parameters and
correctly. A cardiac disease prediction system utilizing data browsing through the classifier efficiency curve
mining techniques and the Naive Bayes (NB) algorithm was visualization in Python. The authors revealed that the neural
presented by K. Vembadasamy et al. [7]. The authors network and fuzzy KNN techniques performed better than
classified the dataset using the NB technique, which they other approaches like K means clustering, K-nearest
found to have a low computation time and an accuracy of neighbors, and logistic regression. A deep learning model for
86.4198%. The clinical data collected in this study was the early detection and prediction of CKD is described by
gathered from one of Chennai's top diabetic research Singh et al. [13]. This project aims to construct a deep neural
institutes and included information on roughly 500 network and assess its effectiveness compared to other
individuals. By employing Factor Data Factor Analysis cutting-edge machine learning techniques. All of the
(FDFA) on the UCI Cleveland Heart Disease Data Set, database's missing values were filled up during testing using
Ankur Gupta et al. [8] verified the MIFH framework using a the associated features' average. Then, after defining the
holdout validation technique and discovered that its parameters and carrying out multiple trials, the neural
sensitivity (89.28%), accuracy (93.44%), and specificity network's ideal parameters were determined. Recursive
(96.96%). However, it is important to note that these results Feature Elimination was used to choose the most crucial
were achieved on a smaller, balanced dataset, which may not features (RFE). Key characteristics in the RFE included
fully capture the complexities and challenges presented by hemoglobin, specific gravity, serum creatinine, red blood
larger, more imbalanced datasets like the BRFSS. cell count, albumin, packed cell volume, and hypertension.
Additionally, the use of a simple holdout validation Machine learning models received a selection of features in
technique, as opposed to more robust cross-validation order to classify them. One deep neural model performed
methods, may limit the generalizability of the results across better than the other.
different data subsets. In order to forecast the chance of a patient developing
A system featuring a multi-agent shell model (MASM) heart disease, Rairikar et al. [14] examined prediction
with a depth-wise binary convolutional neural network was systems for heart disease employing a greater number of
proposed for early diagnostics of heart diseases [9]. Using input attributes. These systems use medical terminology like
the Cleveland Highway descriptive database, the system's gender, blood pressure, and cholesterol, like 13 attributes.
effectiveness was assessed. The hybrid model achieved the They suggested an effective genetic algorithm using the
highest accuracy of 90.1%, while the high accuracies of backpropagation method to predict cardiac disease. Abbas et
88.9% and 98.4% were recorded for high recall. With this, al. [15] studied ML and DL methods to analyze noisy sound
though, on average, traditional CNN and essentially other signals to identify cardiac problems. The investigation
ML models performed better (between 72.3% and 83.8%). utilized two subsets of the PASCAL CHALLENGE datasets
Obasi and Shafiq [10] introduced three ML models using containing authentic cardiac audio. Mel-frequency Cepstral
classifier algorithms such as Logistic Regression, Random Coefficients (MFCCs) and spectrograms were employed to
Forest, and Naive Bayes Classifier. The authors developed represent the signals in the research process graphically.
the information from the existing patients' medical history Data augmentation enhanced the model's performance by
with the test data created and evaluated the models' adding artificial noise to the heart sound signals.
performance. RF model showed the best performance among Research leveraging the Behavioral Risk Factor
all the models we had selected, with accuracy rates of Surveillance System (BRFSS) dataset to predict heart attack
92.44%, 59.7%, and 61.96% for RF, LR and NB Classifier risks using ML has seen significant interest, given the
models. Nagavelli et al. [11] compared the performance of comprehensive coverage of health-related behaviors and
four ML models only for diagnosing cardiac disease. The conditions across the U.S. population. Several works
models included a duality optimization DO-SVM, an explicitly utilizing the BRFSS dataset in conjunction with
improved SVM with a weighted system for an approach ML methodologies to predict heart attack incidents were
prediction, two XGBoost SVMs with prediction models and proposed in the literature, highlighting the evolution of
XGBoost-only prediction models. The authors assessed the analytical techniques and key findings within this research
models based on the four main criteria: precision, accuracy, domain.
recall, and F1 measurement. We successfully detected heart The authors of [16] have used eight different test
disease using an XGBoost algorithm with excellent patterns of machine learning to reveal data available from the
prediction quality and high sensitivity, specificity, precision, 2020 survey on the Behavioral Risk Factor Surveillance
and F1 scores. System (BRFSS) provided by the Centers for Disease
The NB model with a weighted approach can be claimed Control and Prevention (CDC). The diverse selection of
to have poorer accuracy than other models. However, the methods contains Adaboost, Multilayer Perceptron (MLP),
DO-SVM model can be evaluated as having worse precision, DT, KNN, LR, SVM, NB, and XGB. As the nominal
recall, and F1 numbers. Another article proposed by Yadav independent variable, heart disease is exiled in the given
et al. [12] tried to distinguish computational approaches such data, the authors lead to balancing both the dependent and
as NB, KNN, LR, and hybrid ones to discover cardiac independent variables through (the SMOTE-Tomek Link)
3
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3494839
method, which stabilizes the dependent variable, before allowing early interventions which reduce or stop the
applying the classification methods. The data is split into development of the disease. As shown in [21], Manoj and
outliers and non-letters. The outlier analysis is done using the colleagues proposed using ML to design a system for early
traditional statistical technique to get unbiased and accurate warning of chronic heart attack to provide better disease
estimations. The validation step in the classification process detection and improvement. The system detects and utilizes
was done using a 10-fold cross-validation study to produce different algorithms, such as LR and RF, to identify the
the most unbiased results and better compare the methods. algorithms that achieve the highest accuracy and efficiency.
As a result, the XGB model can trace similarities in the The final vision of the model can be to forecast heart attacks,
detection of heart diseases and can give 89% accuracy for the which could be continuously enhanced through the expertise
non-outliers. An 84.6% accuracy is achieved for the outliers
and evolution of different techniques and data
concerning the issue in the early stage of the disease. It
representation.
outperforms the k-NN algorithm with an accuracy rate of
85.6% in the non-outlier data and 81% in the outlier data,
III. Data Processing
showing that it completes these tasks quickly. Instead, XGB The research dataset that our foundation is built on is
and k-NN algorithms would be performed to identify BRFSS [22], a rich dataset containing health-related
signatures of impending disease in this context and patterns information and a range of factors that affect general health
of heart disease diagnosis. and wellness of the American population. The large size,
In a second study, a novel approach using the BRFSS- complexity, and depth of this dataset called for a very
2015 Dataset was proposed in the literature. Neeraj et al. [17] sophisticated data processing stage, whose main goal was to
introduced a hybrid deep neural net learning model for CHD refine and customize the data to fit the special needs of the
prediction. This model, which uses the co-relation score to ML model used for heart attack predictions. The data were
select the optimal features subset and the cluster-abundant processed carefully to make sure that the data fed in our
data class approach to balance the dataset classes, represents predictive models had the correctness and relevance. As
a significant advancement in the field of heart disease presented in Table 1, the dataset comprised 401,958 unique
prediction. The model's hyper-parameter optimization is respondents with 279 diverse features, a testament to the
achieved through randomized Search Cross-Validation comprehensive nature of the BRFSS survey. Given our focus
Optimization (RSCV) of the Gated Recurrent Unit (GRU) on heart attack prediction, it was imperative to sift through
and Bi-direction Long Short-Term Memory (BiLSTM). The this extensive feature set to isolate the variables most
proposed model outperforms existing models, achieving a pertinent to our research objective. This process involved a
classification accuracy of 98.28% compared to GRU, LSTM, thorough review of the dataset documentation provided by
and BiLSTM-GRU. Another study, proposed by Das et al. the CDC, which guided our cleaning and replacement
[18], used survey data from 400k US citizens to develop and strategies for each feature.
assess six ML models for heart disease prediction. The six
ML models that were tested in this study were also
compared: XGB, Bagging, RF, DT, KNN, and NB. The
accuracy, sensitivity, F1-score, and AUC of six ML
algorithms are also evaluated and presented. The XGB
model demonstrated optimum performance outcomes with
an accuracy rating of 91.30%.
As presented in [19], Mehta and coworkers dealt with
the worldwide problem of heart disease by designing a model
that will forecast the survival of the patients based on their
symptoms. It utilizes ML and a data analysis approach to
take advantage of the large volume of patient data. The
model that is designed to tweak parameters is intended to
bring out the relationship between heart symptoms severity
and fatalness of heart disease outcomes. Finally, the project
aims to heighten the accuracy of survival predictions by
establishing around 88% average accuracy in the case of
various prediction methods. Selvakumar and Coworkers [20]
employed ML algorithms to predict the occurrence of heart
attacks in the catchment area and to lower the mortality rate.
Heart disease in the world is something that affects a large
number of people, and this research has set out to discover
various related factors such as age, gender, and cholesterol
levels. This predictive model helps give a diagnosis,
4
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3494839
FIGURE 1. Data preprocessing steps Each model was selected for its unique strengths and
A significant portion of our preprocessing involved potential applicability to imbalanced datasets, with
handling ambiguous or non-committal responses, such as configurations adjusted to handle class imbalance, such as
"Don’t know/Not sure" and "Refused to answer," by setting the 𝒄𝒍𝒂𝒔𝒔_𝒘𝒆𝒊𝒈𝒉𝒕 parameter to a balanced.
converting these to null values. This approach was applied To ensure robust model evaluation and avoid
universally across features that contained these response overfitting, stratified 𝑲 − 𝑭𝒐𝒍𝒅 cross-validation with 5 folds
codes, ensuring consistency in how missing data was treated. is applied, maintaining the proportion of heart attack cases
For features querying the number of days (e.g., MENTHLTH across each fold. This strategy was complemented by
for days mental health not good), codes indicating zero days imputing missing values with the mean of the training data
(88, 888) were duly replaced with numerical zeros, aligning during each fold, preserving the integrity of our dataset.
with the quantitative nature of these variables.
The preprocessing phase solved the feature selection and B. OPTIMAL FEATURE SELECTION
removal augment by the value with excessive missing values Upon assessing the performance of our ML models,
on it. Any feature having null values more than 30% threshold LGBM, XGB, and LR are identified as the top contenders
was excluded from further procedures, and desensitization based on their balanced accuracy and macro-recall scores. To
was applied to inputs by masking identifiable information. refine these models further, Recursive Feature Elimination
This rigorous data scrubbing left only 399875 cases, involving (RFE) with Cross-Validation (RFECV) is employed, a
47 attributes, and built a good base for next work of data robust methodology designed to pinpoint the most impactful
preparation and analysis. features for predicting heart attack incidents while
The other variables were divided in order to apply binning simultaneously optimizing model performance.
where applicable (e.g. Height_In_Meters, The RFECV iteratively evaluates the contribution of
Weight_In_Kilograms), and thereby transforming continuous each feature to the model's predictive accuracy,
variables to categorical ones for more precise analysis. The systematically removing the least significant features until
MinMaxScaler was used to normalize feature values in order the optimal subset is determined. This process is guided by
that no feature unbalancedly affected model prediction due to cross-validation to ensure the generalizability of the selected
differences in scales. Among the last operations was features across different subsets of the data. The RFECV
calculating the Pearson correlation coefficients to exclude the process can be outlined as follows:
features with high multicollinearity, reducing the feature set to
1) INITIALIZATION
only those having significant predictive power.
Due to these systematic data processing steps, the Let 𝐹 = {𝑓1 , 𝑓2 , . . . , 𝑓𝑛 } denote the initial set of n
integrity and completeness of data set are ensured, as well as features. The goal is to find a subset 𝐹 ∗ that maximizes the
linearization of the data with the scope of the study. In this cross-validated performance measure 𝑃.
careful preliminary work, ML algorithms were obtained and
2) RECURSIVE ELIMINATION
specified for their ability of successfully tackling the problems
stemmed from the imbalanced nature of the dataset and For each iteration 𝑖:
extracting meaningful predictions on heart attack incidents. Train the model using the current set of features 𝐹𝑖 .
Evaluate the importance of each feature 𝑓𝑖 in 𝐹𝑖 .
IV. Proposed Methodology Remove the least important feature 𝑓𝑙𝑒𝑎𝑠𝑡 .
In our proposed methodology, an approach is applied Update the feature set 𝐹𝑖 + 1 = 𝐹𝑖 / 𝑓𝑙𝑒𝑎𝑠𝑡
that makes use of multiple ML algorithms specially designed 3) CROSS-VALIDATION
to capture the peculiarities of the BRFSS dataset, especially
its imbalanced structure. The methodology is built around a At each iteration 𝑖, assess the model's performance
set of experiments that are designed for improving model 𝑃𝑖 using cross-validation with the feature set 𝐹𝑖 . This ensures
performance, feature selection, and optimizing the predictive that the feature elimination process generalizes well across
performance of the ensemble model. Additionally, we different data subsets.
expound on the elements of our methodology, which are 4) OPTIMIZATION CRITERION
performance measures, model selection, and cross-validation
method. In this step, the process continues until a stopping
criterion is met, typically when removing any further features
A. MODEL SELECTION AND CROSS-VALIDATION reduces the cross-validated performance score. The optimal
Our initial experiment involved the application of feature set 𝐹 ∗ is the one that maximizes the performance
several ML algorithms, including LR [23, 24], Light
measure 𝑃.
Gradient Boosting Machine (LGBM) Classifier [25, 26],
Gradient Boosting (XGB) Classifier [27], RF Classifier [28, 5) MODEL RETRAINIG
29], KNN Classifier [30], and Decision Tree (DT) Classifier Finally, the models are retrained using only the features
[31]. in 𝐹 ∗ , ensuring that they are optimized for both performance
and complexity.
5
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3494839
1) OBJECTIVE FUNCTION
Define an objective function 𝑂(𝜃) that takes a set of
hyperparameters 𝜃 and returns the performance
metric of interest, in this case, balanced accuracy.
Optuna seeks to maximize 𝑂(𝜃).
2) SEARCH SPACE
Specify the hyperparameter space 𝐻 for each model,
where 𝐻 = {ℎ1 , ℎ2 , . . , ℎ𝑚 } and each ℎ𝑖 represents a
hyperparameter domain.
6
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3494839
3) BAYESIAN OPTIMIZATION
4) SAMPLING
7) ITERATION
7
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3494839
8
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3494839
D. ENSEMBLE MODEL INTEGRATION attacks even when they represent a small fraction of the
To leverage the strengths of our individually optimized dataset.
models, an ensemble strategy combining the predictive Area under the Curve (AUC) was utilized to evaluate
powers of the LGBM Classifier, XGB Classifier, and Logistic the performance of our final model in distinguishing
Regression (LR) is developed. The principle behind ensemble between heart attack cases and non-cases. The AUC
methods is that the collective decision of multiple models can measures the ability of the model to rank positive
outperform any single model’s prediction, particularly in instances higher than negative ones, a crucial metric for
scenarios with complex data patterns and class imbalances. assessing the quality of binary classification models,
Two ensemble configurations are tested: one dual-model especially in the presence of class imbalance. Formally,
ensemble (combining LGBM and XGB Classifiers) and a tri- the AUC is defined as the probability that a randomly
model ensemble (incorporating LGBM, XGB, and LR). The chosen positive instance is ranked higher than a randomly
aggregation method used was a simple weighted average, chosen negative instance. These performance measures
where initial weights were set equally but were later fine-tuned provided a comprehensive evaluation of our models,
based on individual model performance. ensuring robust and equitable prediction capabilities
Our experimental results revealed that the dual-model crucial for medical diagnostics.
ensemble outperformed the tri-model ensemble in terms of
balanced accuracy. This superior performance can be F. FINAL MODEL EVALATION WITH YOUDEN’S J INDEX
attributed to the complementary error reduction between the The application of Youden's J Index, which is a statistical
LGBM and XGB Classifiers, which effectively captured measure used in the classification of binary variables, was the
different aspects of the underlying data structure. The dual last stage of our underpinning ML technique. The preceding
ensemble achieved a balanced accuracy that underscored the measure notably improved the model's discriminative
efficacy of ensemble methods in enhancing prediction quality, capability, thus, it was able to make an accurate classification
especially in complex predictive tasks like heart attack between heart attack cases (positive cases) and non-incident
prediction. cases (negative ones). Youden’s J Index is of utmost
The success of the ensemble model underscores the importance in the practical application of medical diagnostics,
potential of combining diverse ML approaches to achieve where the accuracy of diagnosis can have a direct effect on the
greater predictive performance. It also highlights the patient’s outcome. Youden's J Index, denoted as 𝐽𝑏 For binary
importance of careful model selection and weighting in variables enhances binary classification by optimizing the
ensemble construction, ensuring that each model's threshold that separates positive from negative predictions.
contributions are optimally leveraged for improved accuracy This optimization is critical in imbalanced datasets, like those
and robustness against varied data patterns. This method has often encountered in medical diagnostics, where the cost of
both improved the predictive accuracy and offered a valuable false negatives and false positives can be disparate. Youden's
model for processing the multifaceted nature of medical J Index is defined as:
datasets.
𝐽𝑏 = 𝑚𝑎𝑥𝑡 ( 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 (𝑡) + 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 (𝑡) − 1 ) (2)
E. PERFORMANCE MEASURES
Given the critical importance of accurately predicting Where 𝑡 represents the threshold, and the maximization is
heart attacks and the challenge presented by the class over all possible thresholds. Sensitivity (true positive rate)
imbalance in our dataset, balanced accuracy and macro-recall and Specificity (true negative rate) are calculated as:
are selected as our major performance measures.
Balanced Accuracy is the average of the true positive 𝑇𝑃𝑆
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 (𝑡) = (3)
rate (sensitivity) and the true negative rate (specificity), 𝑇𝑃𝑆 + 𝐹𝑁𝑆
ensuring an equitable assessment of model performance 𝑇𝑁𝑆
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 (𝑡) = (4)
for both majority and minority classes: 𝑇𝑁𝑆 + 𝐹𝑃𝑆
1
1 𝑇𝑃𝑎 𝑇𝑁𝑎 𝐴𝑈𝐶 (𝑡) = ∫ 𝑇𝑃𝑆 ( 𝐹𝑃𝑆 ) 𝑑 (𝐹𝑃𝑆 ) (5)
𝐵𝐴 = ( + ) (1) 0
2 𝑇𝑃𝑎 + 𝐹𝑁𝑎 𝑇𝑁𝑎 + 𝐹𝑃𝑎
Where: 𝑇𝑃𝑎 , 𝑇𝑁𝑎 , 𝐹𝑃𝑎 , 𝐹𝑁𝑎 Denote true positive, true Where 𝑇𝑃𝑆 , 𝑇𝑁𝑆 , 𝐹𝑃𝑆 , 𝐹𝑁𝑆 refers to the true positives,
negatives, false positives, and false negatives of classes true negatives, false positives, and false negatives,
respectively. respectively, evaluated at the threshold 𝑡. Youden's J Index
Macro-Recall extends this fair approach by computing seeks the threshold 𝑡 that maximizes the sum of sensitivity
recall for each class separately and then averaging these and specificity, thereby achieving the best trade-off between
values, ensuring sensitivity to the identification of heart correctly identifying heart attack cases and avoiding false
alarms.
9
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3494839
Through the Youden's J Index, we found the threshold incidents, showcasing improvements in accuracy and
that maximized our ensemble model's precision. This robustness, and illustrating the effectiveness of combining
approach paved the way for the refinement of the model as a ML techniques for medical diagnostics. This algorithm
predictive tool in a way that it not only attains high weighted encapsulates the entire process in a sequential and logical
accuracy but also tackles the challenges of imbalance in the manner, highlighting the key steps and methodologies
dataset as expected. The Youden's J Index-derived optimal applied to refine the predictive model for heart attack
threshold enabled the model to be more elaborate in terms of incidents using the BRFSS dataset.
discriminating heart attack events from non-events, which
resulted in higher model accuracy. The usage of Youden's J
Index as the final step in our evaluation chain is a clear
indication that accuracy and clinical relevance becomes the
essence of our predictive model. Consequently, a model is
developed, which uses the balanced accuracy, feature tuning,
and the combined capability of ensemble models to provide
major improvement in the prediction of heart attack
cases. This section plants the root of our in-depth discussion
on the consequences and outcomes obtained from the
comprehensive study, thus, proving the model’s practicality
and applicability in medical diagnostics.
10
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3494839
A. EXPERIMENT 1: MODEL PERFORMANCE accurately predicting heart attack cases. This indicates a
EVALUATION more equitable and reliable performance in medical
As presented in Table I, the initial phase involved diagnostics compared to the models referenced in [32].
evaluating a diverse set of ML models to establish a baseline
for predictive performance. The models were scored using B. EXPERIMENT 2: OPTIMAL FEATURE SELECTION
balanced accuracy, macro precision, macro recall, a recall for
class 1 (Heart Attack), and a precision for class 1. Among As proposed in Table II, the application of RFECV is
the outstanding models of this experiment, LGBM, XGB, applied to the top three models revealed the optimal number
and LR provided greater balanced accuracy and recall rate of features for maximizing balanced accuracy. This process
for heart attack sorts. refined the models' input dimensions to 35 features for
For example, LGBM model, which is built with LGBM, 25 for XGB, and 43 for LR, respectively. This
balanced accuracy of 80.57%, achieves a recall for class 1 at focused approach to feature selection significantly
78.27%, which indicates its high ranking as a tool for contributed to model efficiency and interpretability without
predicting myocardial infarction. The results of this compromising predictive performance.
experiment outperform the results of [32] that estimates the
risk of diabetes by taking socioeconomic and health-related TABLE II
PERFORMANCE EVALUATION FOR ML MODELS
factors into account. Different methods of data augmentation Balanced Macro Macro Recall Confusion Confusion
Model Precision
are applied to balance training data and enhance model Accuracy Precision Recall Class Matrix Matrix
Name Class 1 %
performance. % % % 1% Class 0 % Class 1 %
LGBM 80.57 59.70 80.57 78.26 20.98 83 78
TABLE I
XGB 80.43 59.80 80.43 77.71 21.13 84 78
PERFORMANCE EVALUATION FOR ML MODELS
Confu LR 80.05 59.9 80.05 76.36 21.43 84 76
Precis
Balanced Macro Macro Recall Confusion sion
ion
Model Name Accuracy Precision Recall Class Matrix Matrix
Class
% % % 1% Class 0 % Class
1%
1% C. EXPERIMENT 3: HYPERPARAMETER TUNING
LGBM 80.57 59.70 80.57 78.27 20.97 83 78
XGB 80.38 59.90 80.38 77.19 21.43 84 77 The Optuna's hyperparameter tuning is applied to
LR 80.06 59.90 80.06 76.39 21.43 84 76 enhance the performance of our top models. As explained in
GB 62.74 77.30 62.74 26.57 58.64 99 27 Table 3, the fine-tuned XGB exhibited a balanced accuracy
DT 61.12 61.20 61.12 26.47 26.71 96 26 improvement to 80.71%, with the recall for class 1 increasing
RF 57.05 78.20 57.05 14.65 61.09 99 15 to 78.55%. Similarly, LGBM balanced accuracy improved to
KNN 56.25 72.70 56.25 13.26 50.27 99 13 80.69%, with a recall for class 1 at 78.41%, demonstrating
LR [32] 62.50 62.30 70.70 - - - - the value of meticulous parameter optimization in achieving
RF [32] 69.10 62.90 70.80 - - - - optimal model performance.
GB [32] 70.30 63.50 71.70 - - - -
TABLE III
PERFORMANCE FOR HYPERPARAMETER TUNING
The comparison of model performance metrics between Balanced Macro Macro Recall Precisio Confusion Confusion
Model
our proposed study and reference [32] reveals notable Accuracy Precision Recall Class n Class Matrix Matrix
Name
differences. The LGBM, XGB, and LR models in the current % % % 1% 1% Class 0 % Class 1 %
study significantly outperform the models from [32] in terms LGBM 80.69 59.80 80.69 78.41 21.11 83 78
of Balanced Accuracy, achieving 80.57%, 80.38%, and XGB 80.71 59.80 80.71 78.55 21.03 83 79
80.06% respectively, compared to 62.50% for LR, 69.10% LR 80.11 59.90 80.11 76.44 21.48 84 76
for RF, and 70.30% for GB in the reference. While the Macro
Precision for LGBM, XGB, and LR in the current study Based on the conducted experimental results that have
(around 59.90%) is lower than that of the GB and RF models been applied on the previous three experiments, the answer
(77.30% and 78.20% respectively), these three models of question 1 is addressed as follows:
exhibit much higher Macro Recall (80.57%, 80.38%, and
80.06%) compared to the reference models, which reported Answer of Question 1: The application of recursive feature
around 70-71%. elimination (RFECV) and hyperparameter tuning (Optuna)
Additionally, the Recall for Class 1 for the LGBM, significantly enhanced the models' predictive performance.
XGB, and LR models is notably high (78.27%, 77.19%, and RFECV identified a subset of features that were most
76.39%), though the reference models did not provide this impactful in predicting heart attacks, reducing the feature
specific metric. Precision for Class 1 in the current study space from 47 to three sets 43, 35, 25 without compromising
varies, with the RF model performing better at 61.09%. model accuracy. Subsequently, Optuna's hyperparameter
Overall, the LGBM, XGB, and LR models in the current optimization further refined the models, leading to an
study demonstrate superior performance in maintaining average increase in balanced accuracy by about 3.5% across
balanced accuracy and recall across both classes, the evaluated models. This demonstrates the crucial role of
highlighting their robustness in handling class imbalance and
11
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3494839
targeted feature selection and precise model tuning in Based on the conducted experimental results in
improving prediction outcomes. experiment 5 and 6, the answer of question 2 is addressed as
follows:
D. EXPERIMENT 4: ENSEMBLE MODEL EVALUATION
Answer for Question 2: Ensemble techniques, integrating
Our experiments with an ensemble showed that LGBM outputs from LGBM, XGB, and LR, demonstrated a superior
and XGB produced the highest 80.70% balanced accuracy, predictive capability compared to individual models. The
while for class 1 our recall was at 78.45% as displayed in ensemble of best 2 models, utilizing average voting
Table 4. Combining the different models into an ensemble mechanism, achieved a 0.4% increase in balanced accuracy
was able to take advantage of the similarities and unique of over the best-performing single model. This indicates the
each individual model, which had a synergetic effect that in effectiveness of ensemble strategies in leveraging diverse
its turn, improved the overall predictive accuracy. predictive perspectives, thereby enhancing the overall
model's performance in heart attack prediction.
TABLE IV
PERFORMANCE FOR ENSEMBLE MODEL EVALUATION VI. ROC Curve Analysis
Macro
Confusi The Receiver Operating Characteristic (ROC) curve is a
Balanced Macro Recall Confusion on
Model Precis Precision
Name
Accuracy
ion
Recall Class
Class 1 %
Matrix Matrix graphical plot that illustrates the diagnostic ability of a binary
% % 1% Class 0 % Class 1
%
% classifier system as the discrimination threshold is varied. As
Ensemb 80.70 59.80 80.70 78.45 21.10 83 78 presented in Figure 9, the curve is created by plotting the
le 2
Models
True Positive Rate (TPR, also known as recall or sensitivity)
Ensemb 80.66 59.90 80.66 78.10 21.28 83 78 against the False Positive Rate (FPR, or 1 - specificity) at
le 3 various threshold settings. The Area under the Curve (AUC)
Models
provides a single measure of overall performance of the
E. EXPERIMENT 5: FINAL MODEL EVALUATION WITH classification model and represents the probability that a
YOUDEN’S J INDEX classifier will rank a randomly chosen positive instance
The application of Youden's J Index to determine the higher than a randomly chosen negative one.
optimal classification threshold for our ensemble model
marked the final stage of our experimentation. This method
identified a threshold of 0.47, optimizing the model's
discriminatory power between positive and negative classes.
The final ensemble model as presented in Table 5 is adjusted
based on Youden's J Index and achieved a balanced accuracy
of 80.75% and an impressive recall for class 1 of 79.97%,
signifying a significant advancement in the prediction of
heart attack incidents.
The results of our comprehensive study illustrate the
effectiveness of combining advanced ML techniques,
strategic feature selection, and hyperparameter optimization
to address the challenges posed by imbalanced datasets in
heart attack prediction. Our ensemble model, informed by
rigorous experimentation and fine-tuning, stands as a
testament to the potential of data-driven approaches in
enhancing medical diagnostic processes, offering a valuable
tool for early detection and intervention in heart disease.
TABLE V
PERFORMANCE FOR ENSEMBLE MODEL EVALUATION
12
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3494839
VII. Discussion
This research provides significant insights into heart
attack prediction using ML techniques on the Behavioral
Risk Factor Surveillance System (BRFSS) dataset. The
ensemble model, optimized through a series of methodical
experiments, demonstrates a promising approach to
addressing the perennial challenge of class imbalance in
medical datasets. The findings underscore the potential of
leveraging advanced analytics to refine predictive models in
healthcare, particularly for conditions as critical as heart
attacks.
13
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3494839
14
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3494839
probability of a heart attack. This could indicate that Based on the analysis of SHAP model, the answer of
employment and perceived general health are inversely question 3 is addressed as follows:
related to heart attack risk in this model.
Noting any unexpected findings. Interestingly, the Answer for Question 3: SHAP analysis revealed that
feature Teeth_Extraction_Status, typically not a primary features such as age, BMI, smoking status, and physical
concern in heart attack prediction, showed a negative activity level were among the top contributors to predicting
contribution, suggesting that dental health might have an heart attack risk. The positive SHAP values associated with
indirect relationship with heart attack risk, as seen in this higher age and BMI indicated an increased risk of heart
case. attacks, aligning with established clinical understanding.
Summarizing how the absence of high-risk indicators This analysis not only highlighted the model's reliance on
contributes to the low risk prediction. "The absence of clinically relevant features but also underscored the
major known risk factors, combined with the negative importance of interpretability in validating the model's
contributions of certain features, led to a low predicted predictions against medical knowledge.
risk of heart attack for this individual. By leveraging SHAP values, the model remains
interpretable, despite its complexity. This is crucial in a
clinical setting, where understanding the reasoning behind a
predictive model's decision can inform better patient
outcomes and guide healthcare professionals in their
decision-making process.
15
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3494839
16
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3494839
Randomized Search Cross-Validation Optimization," Decision Analytics [33] C. A. Hassan et al., "Effectively Predicting the Presence of Coronary
Journal, vol. 9, p. 100331, 2023/12/01/ 2023, doi: Heart Disease Using Machine Learning Classifiers," Sensors, vol. 22, no.
https://doi.org/10.1016/j.dajour.2023.100331. 19, doi: 10.3390/s22197227.
[18] R. C. Das, M. C. Das, M. A. Hossain, M. A. Rahman, M. H. Hossen, and [34] K.-V. Tompra, G. Papageorgiou, and C. Tjortjis, "Strategic Machine
R. Hasan, "Heart Disease Detection Using ML," in 2023 IEEE 13th Learning Optimization for Cardiovascular Disease Prediction and High-
Annual Computing and Communication Workshop and Conference Risk Patient Identification," Algorithms, vol. 17, no. 5, doi:
(CCWC), 8-11 March 2023, pp. 0983-0987, doi: 10.3390/a17050178.
10.1109/CCWC57344.2023.10099294.
[19] D. Mehta, A. Naik, R. Kaul, P. Mehta, and P. J. Bide, "Death by heart
failure prediction using ML algorithms," in 2021 4th Biennial Nacim YANES was born in Gabes, Tunisia in 1981.
International Conference on Nascent Technologies in Engineering He received the Master degree in computer science
(ICNTE), 15-16 Jan. 2021, pp. 1-5, doi: applied to management from the Higher Institute of
10.1109/ICNTE51185.2021.9487652. Management (ISG), Tunisia. He received his PhD in
[20]V. Selvakumar, A. Achanta, and N. Sreeram, "Machine Learning based computer science from the National School of
Chronic Disease (Heart Attack) Prediction," in 2023 International Computer Science (ENSI), University Manouba,
Conference on Innovative Data Communication Technologies and Tunisia. He is an Assistant Professor in the Higher
Application (ICIDCA), 14-16 March 2023 2023, pp. 1-6, doi: Institute of Management (ISGGB), University of
10.1109/ICIDCA56705.2023.10099566. Gabes, Tunisia. His current research interests include
[21]M. S. Manoj, K. Madhuri, K. Anusha, and K. U. Sree, "Design and AI-based Healthcare recommender systems, Software Reuse,
Analysis of Heart Attack Prediction System Using ML," in 2023 IEEE Recommenders Systems in Software Engineering, Serious Games and
International Conference on Integrated Circuits and Communication Gamification, and Outcome-based Education.
Systems (ICICACS), 24-25 Feb. 2023 2023, pp. 01-06, doi:
10.1109/ICICACS57338.2023.10099819.
[22] https://catalog.data.gov/dataset/cdc-behavioral-risk-factor- LEILA JAMEL received the Engineering degree in
surveillance-system-brfss computer sciences and the Ph.D. degree in computer
[23]E. W. Ingwersen et al., "Machine learning versus logistic regression for sciences and information systems. She was the
the prediction of complications after pancreatoduodenectomy," Surgery, Program Leader of the IS Program and the ABET and
vol. 174, no. 3, pp. 435-440, 2023/09/01/ 2023, doi: NCAAA Accreditation Committees, CCIS, Princess
https://doi.org/10.1016/j.surg.2023.03.012. Nourah bint Abdulrahman University (PNU), Saudi
[24]J. Jeppesen, J. Christensen, P. Johansen, and S. Beniczky, "Personalized Arabia. She was the HODof Information Systems
seizure detection using logistic regression machine learning based on Security of the Premier Ministry of Tunisia. She is
wearable ECG-monitoring device," Seizure: European Journal of currently an Assistant Professor with the College of Computer and
Epilepsy, vol. 107, pp. 155-161, 2023/04/01/ 2023, doi: Information Sciences, PNU. She is a Researcher with the RIADI
https://doi.org/10.1016/j.seizure.2023.04.012. Laboratory, Tunisia. Her research interests include business process
[25]V.-H. Truong, S. Tangaramvong, and G. Papazafeiropoulos, "An modeling, business process management/re-engineering and quality,
efficient LightGBM-based differential evolution method for nonlinear context-awareness in business models, data sciences, ML, process mining,
inelastic truss optimization," Expert Systems with Applications, vol. 237, e-learning, and software engineering. She was a member of the Steering and
p. 121530, 2024/03/01/ 2024, doi: Scientific Committees of the IEEE International Conference on Cloud
https://doi.org/10.1016/j.eswa.2023.121530. Computing. She is a reviewer of many international journals and
[26]G. V. D. Kumar, V. Deepa, N. Vineela, and G. Emmanuel, "Detection of conferences.
Parkinson’s disease using LightGBM Classifier," in 2022 6th
International Conference on Computing Methodologies and Mohamed Ezz is an Associate Professor in Faculty of
Communication (ICCMC), 29-31 March 2022 2022, pp. 1292-1297, doi: Engineering Al Azhar University and now is
10.1109/ICCMC53470.2022.9753909. visiting Professor at College of Computer and
[27]S. M. Ganie and P. K. Dutta Pramanik, "A comparative analysis of Information Sciences, Jouf University. He received
boosting algorithms for chronic liver disease prediction," Healthcare his B.Sc., M.Sc. and Ph.D. in Systems & Computers
Analytics, p. 100313, 2024/02/23/ 2024, doi: Engineering from Faculty of Engineering, Al Azhar
https://doi.org/10.1016/j.health.2024.100313. University. He is IEEE member. His area of interest
[28]S. Rajeashwari and K. Arunesh, "Chronic disease prediction with deep includes pattern recognition, Applied Machine
convolution based modified extreme-random forest classifier," Learning, application security, intrusion detection, and semantic web. He
Biomedical Signal Processing and Control, vol. 87, p. 105425, has published 40 scientific papers in various national and international
2024/01/01/ 2024, doi: https://doi.org/10.1016/j.bspc.2023.105425. journals and conferences. He has contributed in more than 16 mega software
[29]E. S. Mohamed, T. A. Naqishbandi, S. A. C. Bukhari, I. Rauf, V. projects in Electronic banking EBPP, EMV, mobile banking and e-
Sawrikar, and A. Hussain, "A hybrid mental health prediction model commerce, also CBAP Certified.
using Support Vector Machine, Multilayer Perceptron, and Random
Forest algorithms," Healthcare Analytics, vol. 3, p. 100185, 2023/11/01/ Ayman Mohamed Mostafa is an Associate Professor in Faculty of
2023, doi: https://doi.org/10.1016/j.health.2023.100185. Computers and Informatics, Zagazig
[30]S. S. Bhat, M. Banu, G. A. Ansari, and V. Selvam, "A risk assessment University, Egypt and now is an Assistant
and prediction framework for diabetes mellitus using machine learning
Professor at College of Computer and
algorithms," Healthcare Analytics, vol. 4, p. 100273, 2023/12/01/ 2023,
doi: https://doi.org/10.1016/j.health.2023.100273. Information Sciences, Jouf University, Saudi
[31]F. M. Delpino, Â. K. Costa, S. R. Farias, A. D. P. Chiavegatto Filho, R. Arabia. He received his MSc and PhD in
A. Arcêncio, and B. P. Nunes, "Machine learning for predicting chronic Information Systems from Faculty of
diseases: a systematic review," Public Health, vol. 205, pp. 14-25, Computers and Informatics, Zagazig
2022/04/01/ 2022, doi: https://doi.org/10.1016/j.puhe.2022.01.007. University, Egypt. He is IEEE member. His area
[32] M. M. Chowdhury, R. S. Ayon, and M. S. Hossain, "An investigation of interest includes information security, cloud computing, E-
of machine learning algorithms and data augmentation techniques for business, E-commerce, big data, and data science. He has published
diabetes diagnosis using class imbalanced BRFSS dataset," Healthcare more than 50 scientific papers in various national and international
Analytics, vol. 5, p. 100297, 2024/06/01/ 2024, doi:
https://doi.org/10.1016/j.health.2023.100297.
journals and conferences. He is also Oracle Certified Associate,
Oracle Certified Professional, and EMC Academic Associate in
Cloud Infrastructure and Services.
17
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4