0% found this document useful (0 votes)
5 views

4

This study investigates the use of machine learning algorithms to improve stroke prediction by analyzing electronic health records and identifying key risk factors such as age, glucose levels, heart disease, and hypertension. Seven algorithms were implemented, with the Decision Tree and KNN achieving the highest accuracy of 96.3%. The research emphasizes the importance of timely diagnosis and the potential of machine learning to enhance predictive analytics in healthcare.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

4

This study investigates the use of machine learning algorithms to improve stroke prediction by analyzing electronic health records and identifying key risk factors such as age, glucose levels, heart disease, and hypertension. Seven algorithms were implemented, with the Decision Tree and KNN achieving the highest accuracy of 96.3%. The research emphasizes the importance of timely diagnosis and the potential of machine learning to enhance predictive analytics in healthcare.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

(IJACSA) International Journal of Advanced Computer Science and Applications,

Vol. 14, No. 4, 2023

Using Machine Learning Algorithm as a Method for


Improving Stroke Prediction
Nojood Alageel, Rahaf Alharbi, Rehab Alharbi, Maryam Alsayil, Lubna A. Alharbi
Faculty of Computers and Information Technology
University of Tabuk
Tabuk, Saudi Arabia

Abstract—Having sudden strokes has had a very negative depending on the exact origin of the dysfunction, which
impact on all aspects in society to the point that it attracted defines four main types of strokes: ischemic stroke,
efforts for better improvement and management of stroke subarachnoid hemorrhage, cerebral venous sinus thrombosis,
diagnosis. Technological advancement also had an impact on the and intra-cerebral hemorrhage [3].
medical field such that nowadays caregivers have better options
for taking care of their patients by mining and archiving their In general, brain strokes can be classified as either ischemic
medical records for ease of retrieval. Furthermore, it is quite or hemorrhagic. Ischemic strokes are the predominant type and
essential to understand the risk factors that make a patient more they account for approximately 70% of the total stroke
susceptible to strokes, thus there are some factors that make incidents [4]. Ischemic strokes occur as a result of clots in
stroke prediction much easier. This research offers an analysis of vessels, or hypotensive vasoconstriction, arterial tears, and
the factors that enhance the stroke prediction process based on sickle cell anemia [5]. On the other hand, hemorrhagic strokes
electronic health records. The most important factors for stroke account for approximately 15% of the total incidents, yet their
prediction will be identified using statistical methods and effects are usually more detrimental as they often lead to
Principal Component Analysis (PCA). It has been found that the serious morbidity and death [6]. Hemorrhagic strokes occur
most critical factors affecting stroke prediction are the age, due to many causes among which are the vascular malfunction
average glucose level, heart disease, and hypertension. A and uncontrolled hypertension [7].
balanced dataset is used for the model evaluation which was
created by sub-sampling since the dataset for stroke occurrence When considering the risk factors or the reasons behind the
is already highly imbalanced. In this study, seven different occurrence of strokes, these can be divided into two types of
machine learning algorithms are implemented: Naïve Bayes, factors depending on their origin, meaning that there are factors
SVM, Random Forest, KNN, Decision Tree, Stacking, and that can be changed or modified, and factors that cannot be
majority voting to train on the Kaggle dataset to predict modified [8]. Some of the modifiable (changeable) factors is
occurrence of stroke in patients. After preprocessing and hypercholesterolemia, diabetes, and hypertension. On the other
splitting the dataset into training and testing sub-datasets, these hand, the non-modifiable factors include age, gender, and the
proposed algorithms were evaluated according to accuracy, f1 genetic factors in play [9].
score, recall value, and precision value. The NB classifier
achieved the lowest accuracy level (86%), whereas the rest of the The traditional stroke identification methods are usually the
algorithms achieved similar accuracies 96%, f1 scores 0.98, magnetic resonance imaging MRI and Computed Tomography
precision 0.97, and recall 1. CT scans which are expensive and invasive [10]. However,
since the stroke occurrence is a very time-sensitive issue,
Keywords—Stroke prediction; machine learning; PCA; dealing with it in a timely efficient manner is very important
decision tree; KNN; majority voting; Naïve Bayes because in most cases, death or permanent damage from stroke
I. INTRODUCTION can be prevented if the diagnosis happens early on [11], [12].
Therefore, it is essential to develop medical tools and devices
Strokes or cerebrovascular accidents are considered among that allow physicians to diagnose a stroke without being
the top three causes or morbidity and mortality in many invasive or uncomfortable, through relying on biomarkers for
countries all over the world [1], such that it accounts for around example or studying the risk factors. Machine learning poses as
10% of the world-wide deaths which makes it the second the perfect tool for predicting whether a stroke can occur or not
leading cause of death. As an estimation, approximately based on different factors. Machine Learning is capable of
700,000 individuals suffer from strokes each year, and by the diagnosing, treating, and predicting disease through analyzing
year 2030, it is expected that this number will be greatly clinical data.
increased and will cause a medical cost of 240 billion dollars in
the US alone [2]. In this research, the aim is to develop and implement a
machine learning-based system for the accurate prediction of
The world health organization WHO defines stroke as a future occurrence of stroke in patients based on several features
brain-related illness such that it leads to the dysfunction of the including age, gender, BMI, and medical history. The primary
brain, and it could be focal, acute, or diffuse. This dysfunction objective is to get this system to predict the occurrence of
is mainly a result of vessel problems and it lasts for longer than stroke by 100% accuracy so that lives can be saved. The
24 hours. Ultimately, there are many types of strokes

738 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 4, 2023

contributions that are provided in this report can be listed as performance was recorded by Decision Tree followed by KNN
follows: (96.3%).
 Predictive analytics approach to predict stroke Using Kaggle dataset, Sailasya and [15] discussed the
recurrence is suggested. prediction of stroke based on machine learning algorithm
namely Logistic Regression, K-Nearest Neighbour, Random
 Machine learning and neural network algorithms are Forest, Support Vector Machine, Naïve Bayes, and Decision
implemented. Tree algorithms. Undersampling method was used to handle
 A publicly available dataset of electronic health records the imbalanced data. The results showed that among these
is used. algorithms, Naïve Bayes had the best performance with 82%
overall accuracy compared to 80 % for both K-NN and support
 The subsampling techniques for balancing the dataset is vector machine, and 78% for logistic regression.
followed.
Emon et al. [16] collected information for 5110 patients
 Dimensionality reduction techniques are implemented were taken from Bangladesh's medical clinic. Then, ten
in analyzing the attributes. different machine learning classifiers, which are ANN, MLP, K
Neighbours algorithm, SGD, QDA, AdaBoost, Gaussian,
 The most impactful features for predicting strokes are QDA, GBC, and XGB were used. The weighted voting
picked out and shown. classifier offered the highest accuracy of about 97%, GBC and
Thus, after mentioning the contributions, it can be said that XGB classifiers achieving 96% accuracy, right before
the added value of this paper lies in the fact that it uses simple AdaBoost classifier that scored 94% accuracy. On the other
algorithms to achieve high accuracies with explainable results, hand, the lowest accuracy was recorded by the SGD classifier
instead of using complex algorithms. More precisely, the with a value of 65%.
majority of the chosen algorithms were able to score similarly Shoily et al. [17] used KNN, Naïve Bayes, J48, and
high results. Random Forest classifiers. They gathered data from multiple
The rest of the paper is distributed as follows: Section II is sources to create their dataset of 1058 individuals overall and
the literature review where some studies are mentioned with took a total of 28 features. The authors performed integer
their relative results. Section III is for describing the details of encoding to make the machine learning algorithms suitable for
the methodology followed in this study. Section IV shows the WEKA processing. After that, feature selection took place, and
results that were obtained by the proposed model. Finally, the the models were trained and tested then evaluated according to
paper is concluded with Section V as a conclusion. f1 score, accuracy, precision, and recall. In terms of accuracy,
Random Forest as well as KNN and J48 achieved the same
II. LITERATURE REVIEW results: 0.998 accuracy, 0.998 f1 score, 0.998 precision and
Since technologies like machine learning and deep learning 0.98 recall, whereas Naïve Bayes achieved 0.856 accuracy and
can greatly benefit the medical sector by increasing the 0.861 f1 score.
accuracy of stroke prediction, many studies were conducted to Abedi et al. [18] created a dataset termed “GNSIS”, which
explore how exactly machine learning models can be used in is a collection of electronic health records from 2003 to 2019.
predicting strokes. In this section, a group of similar studies Data preprocessing was performed, and the individuals within
that relied on freely available datasets such as Kaggle and the dataset were classified into six groups totaling 2091
datasets from local hospitals or labs were selected. individuals, 1 group consists of those who didn’t contract
Dritsas and Trigka [13] gathered data from Kaggle such stroke in the last 5 years, and the other 5 groups are of stroke
that the participants were 3254. The dataset consists of 10 patients. After that, the dataset was split into training and
independent features such as age, BMI value, glucose level, testing by 80 to 20 ratio, where data imputation was also done.
smoking status, hypertension, and whether the individual had From the dataset, 53 features existed including BMI, diastolic
contracted a stroke before. Data preprocessing was performed blood pressure, creatinine, and smoking status. Then, four
on the dataset, and class balancing was implemented through a feature selection sets were created with exclusion of some
resampling method known as SMOTE. Machine learning features at times, and six machine algorithm models were used
models namely Stacking, Decision Tree, Random Forest, each in all of the 5 recurrence prediction window, which makes
Majority Voting, Naïve Bayes, Multilayer Percepton, KNN, 24 models in total. For 1year prediction window, Random
Stochastic Gradient Descent, and logistic regression were used Forest achieved the better results with 90% accuracy, whereas
for predicting stroke or no-stroke. It appears from the results the average accuracy of all models was 88%. The average
that the stacking classifier performed best and achieves 0.989 accuracy achieved in the 5 years prediction window was 78%,
AUC value, with 0.974 precision and 0.974 recall. The other thus the wider prediction window results in less accurate
high performing models were Random Forest, KNN, and performances.
Majority Voting. Relying on electronic health records, Nwosu et al. [19] used
Rakshit [14] also relied on the Kaggle dataset and some of a dataset published by McKinsey & Company, containing 11
the algorithms as [13] namely Decision Tree, Naïve bayes, different attributes including body mass index, heart disease,
Support Vector Machine, Random Forest, K-Nearest Neighbor marital status, age, average blood glucose, and smoking status.
and Logistic Regression. According to their results, the best In the dataset, 548 patients suffered from stroke whereas 28524
patients didn’t suffer from any previous strokes, thus the

739 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 4, 2023

dataset needed downsizing. In fact, 1000 downsizing a wide range of machine learning platforms and tools. For this
experiments were done to avoid sampling bias. After that, 70% reason, machine learning algorithms were chosen in this study.
of the dataset was selected for training and 30% for testing. However, the limitation of this method is that it requires many
Over the 1000 experiments, the Neural Network model inputs for the model to be able to make predictions. It is
achieved the best accuracy of 75.02%, followed by Random possible that when predicting a person's status, not all inputs
Forest at 74.53% accuracy and Decision Tree at 74.31%. are available, and then the model will not be able to predict.
This issue was removed since the chosen dataset was large.
In [13], the dataset was large and their study was able to
score very similar results to ours, even though at times our In general, a wide set of attributes are used to predict
metrics were better. However, they did not mention the scored strokes such as gender, age, and blood pressure data among
accuracy. Similarly, oir proposed model acheived better many others. Additionally, the performance of a number of
performances than [14]. machine learning algorithms was examined to see which one is
best suited for predicting stroke incidence based on the dataset.
It's notworthy that the proposed method in this study Ultimately, the chosen ML algorithm must give the predictions
acheived 96.7% accuracy, which is significantly higher than with the highest accuracy.
the accuracy of [15] (80%).
A. Implementation
In [16] the authors chose complex algorithms such as
ADABoost and XGB and were only able to acheive similar In this section, the machine learning algorithms that will be
results to ours, whereas we acheived the high performances implemented and put to the test are presented and described.
using much simpler algorithms, which is more desirable. 1) Naive bayes: In the cases when the features are highly
In [17] the study relied on 28 inputs to predict stroke independent, the Naïve Bayes NB algorithm can lead to
occurance, which is usually difficult to obtain from patients for probability maximization [20]. There is a feature vector for
a quick prediction. Conversely, the proposed salgorithms in the every subject at that class c such that is
proposed system in this paper relied on 9 factors only as an maximized. The formula that defines the conditional
input. In addition, [17] used a much smaller dataset. probability is as in (1):
Similarly, [18] used a very high number of input, which is ⁄ 
not desireable for ML algorithms.
In (1), resembles
III. METHODOLOGY the features probability given class, whereas the previous
Machine learning permits the advancement of a system by feature probability is resembled by , and previous
making it capable of learning and improving from past class probability is resembled by P(c). Through maximizing the
experiences without the need of constant continuous numerator of 1, its number is also maximized, and the
programming. These systems learn through machine learning optimization becomes as in (2):
how to analyze data to identify patterns that help them make
decisions in the future without the help of humans. 

The real influence of machine learning becomes crystal 2) Random forest: There are multiple decision trees in a
clear in the fields that deal with a huge amount of data such as Random Forest (RF) classifier [21]. When these independent
retail, health, government, finance, and transportation. This is trees are combines in an ensemble through resampling, the
mainly due to the decision-making capabilities of machine results become subsets of instances that are used for
learning since it can understand the data and fit them into the classification and regression. In a random forest, the final
different models such that human can rely on them for output is a result of majority voting, since each independent
decisions. Machine learning models are efficiently used for tree generates its own classification outcome.
identifying diseases and computing risk satisfaction in the 3) K-Nearest neighbors: K-nearest neighbors (KNNs)
healthcare sector. The previous are only a few examples of the
classifier depends on Manhattan or Euclidean distances to
capabilities of machine learning.
evaluate similarities or differences between instances in the
Nonetheless, real-life data cannot be simply directly dataset [22]. More often than not, the Euclidean distance is the
processed by the selected machine learning algorithms which is metric of choice in KNN classifiers. In stroke prediction, the
why data preprocessing is an essential step before applying the features vector of the new samples would be fnew. The closest
ML models. After that, the available dataset must be divided K vectors (neighbors) to fnew is determined through KNN.
into training and testing datasets. The training step is
After that, the class where most neighbors belong is given the
performed in order to teach the algorithm about the data. In
addition, unknown data can be predicted through ML fnew value.
algorithms, yet the prediction results are checked against each 4) Decision tree: In the proposed Decision Tree model
other. [23], J48 resembling the single classifier, and RepTree
[24]resembling the base classifier were chosen. The classes
This study is dedicated to implementing machine learning
are denoted by the leaf nodes, whereas the features are
algorithms for stroke prediction, since it is a dangerous and
common disease. Machine learning is often suitable for denoted by the internal nodes. The Gini index technique is
datasets due to its simplicity, structure, and compatibility with employed by the J48 classifier in order to split a single feature

740 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 4, 2023

at each node. Gini index is a fast and simple decision learner The 9 input attributes (most of which are nominal) as well as
that is capable of building a DT through the gained the target class are briefly described in Table II.
information as an impurity measure and pruning via reduced-
error pruning. TABLE II. DESCRIPTION OF THE ATTRIBUTES/FEATURES IN THE DATASET
5) Majority voting: Soft or hard voting is implemented Risk factor Description Details
through simple majority voting, assuming an ensemble of K The actual age of All of the participants are older
basis models. This method allows the prediction of the class Age (year)
participants than 18
label associated to an instance [25]. The hard voting collects Whether the In the dataset, 1260 participants are
the votes related to each class label and choses the one with Gender participants is male males, and 1994 participants are
most votes as an output, that is the candidate class. On the or female female
other hand, the predicted probabilities for every class label are The participant
12.54% of the participants in the
collected by soft voting, and the class label with the largest Hypertension suffering from
dataset are hypertensive
hypertension or not
probability is predicted. In the proposed model, the hard
voting is adopted. Its general function of hard voting is The participants
Heart suffering from heart 6.33% of the participants in the
represented by (3): disease diseases in general or dataset suffer from heart diseases
not
∑  
Marital The participant is In the dataset, 79.84% of the
Such that Pk,c is the prediction or probability of k-th model status married or not participants are married
in class c, and c = {Stroke, Non − Stroke}. 65.02% of them work in the private
6) Stacking: One of the ensemble learning techniques is The work status of sector, 19.21% are self-employed,
Work type
the participants 15.67% have a job, while 0.1%
the Stacking, where the predictions of multiple heterogeneous have never worked
classifiers are integrated within a meta-classifier. Usually, the Whether the 51.14% of the participants in the
training set is used for training the base models whereas the Residence
participant lives in an dataset live in urban place whereas
type
outputs of the base models are used to train the meta- urban or rural place the rest live in rural places
classifier. Here, J48, RF, NB, and RepTree were chosen to be The average level of
Avg glucose
included in the stacking ensemble classifier. The predictions a participant’s blood Numerical values for each patient
level (mg/dL)
of these collective classifiers are used for training a logistic glucose
regression meta-classifier. Participant’s body
BMI
mass index of the Numerical values for each patient
The influence of machine learning parameters on the (Kg/m2)
participants
performance of a model can vary depending on the specific
algorithm used, the dataset being analyzed, and the problem Whether a participant
22.37% of the participants smoke,
being solved. However, in general, adjusting the values of Smoking 24.99% of them have smoked in
currently smokes or
status the past, and 52.64% of them have
these parameters can have a significant impact on the accuracy not
never smoked
and speed of a machine learning model. In this study, several
parameters for the different algorithms were modified to make Whether the
sure better results are achieved. The modifications to the Stroke participant has had a 5.53% of the participants have
parameters of each algorithm are shown in Table I. history stroke previously or previously had a stroke
not
TABLE I. THE CHANGED PARAMETERS FOR EACH ALGORITHM
2) Data pre-processing: If the data were kept in their raw
Algorithm Specific Parameters form, it might negatively affect the quality of the predictions,
- Number of neighbors (k): value is 6 which is why data preprocessing is essential. In the raw data,
KNN
- Distance metric: (Euclidean distance) there might be some missing values and redundancy as well as
- Kernel type: default kernel is radial basis function (RBF) noisy data, so tasks like data discretization and reduction of
SVM
- Regularization parameter (C): default value is 1.0 redundant values are performed. Furthermore, one of the data
DT - Tree depth: 3 pre-processing tasks is to balance the classes through selecting
NB No modifications one of the available resampling techniques. In the proposed
- Number of trees in the forest: default value is 100 workflow, the SMOTE technique will be used so that the
RF
- Maximum depth of each tree: 9 participants can be distributed over the stroke and non-stroke
classes in a balanced way. In more details, the minor class
B. Pre-Processing which belongs to the stroke participants, oversampling was
1) Dataset description: For this study, the dataset of done to increase the number of participants in this class. In
choice was adopted from Kaggle. The dataset comprises a addition, there were not missing or null values, so neither
large number of participants of which only those above 18 dropping nor data imputation was applied.
years old are chosen, making the total of the participants 3254.

741 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 4, 2023

C. Proposed Workflow such that the higher the AUC value, the better the performance.
The details of the proposed approach and methodology can If the model can discriminate between the instances of two
be summed up in a workflow chart presented in Fig. 1. classes perfectly, then AUC would be 1. Conversely, if the
model fails to distinguish between any instances, the AUC
would be 0.
IV. RESULTS
A. Data Visualization
The dataset can be visualized where each of the features or
attributes are analyzed separately and against each other.
Fig. 2, for instance, illustrates how the participants from the
dataset are distributed according to age and gender. It can be
seen that the patients have an average of 41 years old, and that
there are slightly more females than males, specifically, 56% of
the participants are female.

Fig. 1. Workflow of the proposed model.

Initially, the Kaggle dataset with 3254 participants is


acquired. Then, the data is visualized to determine the specifics
such as visualization of column and the relevant attributes. In
this stage, the distribution of the participants can be visualized
over the different features such as the age and gender
distribution. After that, data preprocessing takes place where
the data is being prepared through reduction of redundant Fig. 2. Distribution of data by age and gender.
information or resampling. In this approach, the SMOTE
On the other hand, Fig. 3 shows how the patients who had
technique is selected. Later, the data is split into 80% for
suffered from a previous stroke are distributed according to
training and 20% for testing. Six different algorithms were
age, where it becomes clear that approximately all of them
selected to perform the predictions: Naive Bayes, Random
were older than 40 years old, and the largest number of stroke
Forest, K-Nearest Neighbors, Decision Tree, Majority Voting,
patients was 80 years old. While the patients who didn’t suffer
and Stacking. These algorithms are then evaluated according to
from stroke were distributed among the different age groups.
the evaluation metrics.
D. Evaluation
A group of performance metrics were chosen to evaluate
the performance of the chosen machine learning methods. The
most commonly used metrics in general will also be used in
this study [see (4), (5), (6) and (7)]. Sensitivity, which is also
termed Recall, represents the true positive results where
participants who have had a stroke were successfully classified Fig. 3. Distribution of patients who suffered from a stroke and those who
into the stroke class from the collective totality of the didn’t according to age.
participants. Precision on the other hand specifies how many of
those who had a stroke actually belong to this class. Whereas, In addition, Fig. 4 shows that the majority of the
Recall shows how many of those who had a stroke are participants didn’t suffer from any heart diseases, nor did they
correctly predicted. F-measure is the harmonic mean of the suffer from Hypertension.
precision and recall and sums up the predictive performance of
a model.
  
 
 

 
Where, true positive is designated by TP and false negative
is designated by FN, false positive is designated by FP and true
negative is designated by TN. Fig. 4. Distribution of data over heart disease and hypertension cases.

On the other hand, Area under curve (AUC) is also a Moreover, 25% of the patients were obese, and 18% of the
beneficial metric, where the values must be between 0 and 1, participants were overweight according to Fig. 5.

742 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 4, 2023

Fig. 8. Distribution of data according to average glucose level.

B. Model Evaluation
Fig. 5. Distribution of data according to BMI. After acquiring the data, preprocessing it, and visualizing it,
it was used to train and test several classifiers whose role was
In Fig. 6 depicts that the majority of the patients were to predict whether a stroke occurs to a patient or not. The
smokers, followed by a large group of participants with evaluation results for each classifier are presented in Table III.
unknown smoking status (1544 participants).
TABLE III. EVALUATION OF THE DIFFERENT CLASSIFIERS IN TERMS OF
ACCURACY, F1 SCORE, RECALL, AND PRECISION

Algorithm Accuracy F-1 Score Recall Precision

KNN 0.9633 0.98 1.00 0.97

SVM 0.9674 0.98 1.00 0.97

Decision Tree 0.9674 0.98 1.00 0.97

Gaussian NB 0.8655 0.93 0.89 0.97

Random Forest 0.96741 0.98 1.00 0.97

Voting Classifier 0.9674 0.98 1.00 0.97

Stacking Classifier 0.96741 0.98 1.00 0.97

Fig. 6. Distribution of data according to smoking status and relation to In addition, these evaluation metrics can be seen in Fig. 9
stroke. which clearly illustrates that in fact, all of the proposed
algorithms in this study have similar results in predicting the
Additionally, the majority of the participants are employed stroke occurrence in patients, except for Naïve Bayes which
in the private sector. Meanwhile, the data was almost equally clearly has the worst performance among these classifiers.
distributed between living in rural and urban areas, as depicted
in Fig. 7.

Fig. 9. Evaluation of different algorithms according to several evaluation


metrics.
Fig. 7. Distribution of data according to work type and residency.
However, taking into consideration that the stacking
Furthermore, the participants in the dataset scored mostly classifier is an ensemble model, it can be said that choosing
healthy levels of blood glucose (below 100) as shown in Fig. 8. stacking algorithm might enhance the prediction results in case
of stroke.

743 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 4, 2023

V. CONCLUSION [8] V. L. Feigin, B. Norrving, and G. A. Mensah, "Global burden of stroke,"


Circulation Research, vol. 120, pp. 439-448, 2017.
Stroke is among the top medical accidents that lead to death [9] M. J. O'Donnell, D. Xavier, L. Liu, H. Zhang, S. L. Chin, P. Rao-
but even in the case of survival, stroke leaves serious Melacini, et al., "Risk factors for ischaemic and intracerebral
implications on the lives of its patients. A patient who has haemorrhagic stroke in 22 countries (the INTERSTROKE study): A
previously suffered from brain stroke, shall he remain alive, case-control study," The Lancet, vol. 376, pp. 112-123, 2010.
might suffer the consequences in what seems like paralysis, [10] M. Kaur, S. R. Sakhare, K. Wanjale, and F. Akter, "Early stroke
among many other life-long complications. Since there are prediction methods for prevention of strokes," Behavioural Neurology,
vol. 2022, p. 7725597, 2022.
several risk factors that enhance the chances of strokes, its
[11] M. Lee, J. Ryu, and D.-H. Kim, "Automated epileptic seizure waveform
prediction beforehand is possible. Machine learning algorithms detection method based on the feature of the mean slope of wavelet
have been employed for this purpose promising fast and coefficient counts using a hidden Markov model and EEG signals,"
efficient prediction results. ETRI Journal, vol. 42, pp. 217-229, 2020.
[12] B. Kim, N. Schweighofer, J. P. Haldar, R. M. Leahy, and C. J. Winstein,
In this study, the aim was to develop the optimal system "Corticospinal tract microstructure predicts distal arm motor
that can predict stroke occurrence with high accuracy based on improvements in chronic stroke," Journal of Neurologic Physical
several risk factors collected about the patients. Here, multiple Therapy, vol. 45, pp. 273-281, 2021.
machine learning algorithms are implemented such as Naïve [13] E. Dritsas and M. Trigka, "Stroke risk prediction with machine learning
Bayes, SVM, Random Forest, KNN, Decision Tree, Stacking, techniques," Sensors, vol. 22, p. 4670, 2022.
and majority voting to check the results provided by each [14] T. Rakshit and A. Shrestha, "Comparative analysis and implementation
algorithm. After that, the choice of the optimal algorithm will of heart stroke prediction using various machine learning techniques,"
International Journal of Engineering Research & Technology, vol. 10,
be made depending on the evaluation results. pp. 886-890, 2021.
After appropriate preparation of the data, it was divided [15] G. Sailasya and G. L. A. Kumari, "Analyzing the performance of stroke
into training and testing parts such that all of the proposed prediction using ML classification algorithms," International Journal of
Advanced Computer Science and Applications, vol. 12, pp. 539-545,
algorithms are tested for their ability of predicting stroke 2021.
occurrence. The evaluation metrics of choice were accuracy, f1 [16] M. U. Emon, M. S. Keya, T. I. Meghla, M. M. Rahman, M. S. A.
score, recall value, and precision value. Ultimately, the results Mamun, and M. S. Kaiser, "Performance analysis of machine learning
showed that the selected algorithms perform quite well in approaches in stroke prediction," in 2020 4th International Conference
predicting the strokes, such that SVM, DT, RF, KNN, Voting, on Electronics, Communication and Aerospace Technology (ICECA),
and stacking classifier almost scored the same values. The Coimbatore, India, 2020, pp. 1464-1469.
algorithms scored 96% accuracy, 0.98 f1 score, 1 recall value, [17] T. I. Shoily, T. Islam, S. Jannat, S. A. Tanna, T. M. Alif, and R. R. Ema,
"Detection of stroke disease using machine learning algorithms," in 2019
and 0.97 precision value. However, the achieved results 10th International Conference on Computing, Communication and
suggest that the Naïve Bayes algorithm might not be the best Networking Technologies (ICCCNT), Kanpur, India, 2019, pp. 1-6.
choice for creating a stroke prediction model since it scored [18] V. Abedi, V. Avula, D. Chaudhary, S. Shahjouei, A. Khan, C. J.
less accuracy levels (86%), less f1 score (0.93), less Recall Griessenauer, et al., "Prediction of long-term stroke recurrence using
(0.89), but the same precision value (0.97). machine learning models," Journal of Clinical Medicine, vol. 10, p.
1286, 2021.
REFERENCES [19] C. S. Nwosu, S. Dev, P. Bhardwaj, B. Veeravalli, and D. John,
"Predicting stroke from electronic health records," in 2019 41st Annual
[1] V. L. Feigin, C. M. M. Lawes, D. A. Bennett, and C. S. Anderson,
International Conference of the IEEE Engineering in Medicine and
"Stroke epidemiology: A review of population-based studies of
Biology Society (EMBC), Berlin, Germany, 2019, pp. 5704-5707.
incidence, prevalence, and case-fatality in the late 20th century," The
Lancet Neurology, vol. 2, pp. 43-53, 2003. [20] D. Berrar, "Bayes’ theorem and naive bayes classifier," in Encyclopedia
of Bioinformatics and Computational Biology, S. Ranganathan, M.
[2] B. Ovbiagele, L. B. Goldstein, R. T. Higashida, V. J. Howard, S. C.
Gribskov, K. Nakai, and C. Schönbach, Eds. ed Oxford: Academic
Johnston, O. A. Khavjou, et al., "Forecasting the future of stroke in the
Press, 2019, pp. 403-412.
United States," Stroke, vol. 44, pp. 2361-2375, 2013.
[3] World Health Organization. Noncommunicable Diseases and Mental [21] S. Alexiou, E. Dritsas, O. Kocsis, K. Moustakas, and N. Fakotakis, "An
Health Cluster, "WHO STEPS stroke manual : the WHO STEPwise approach for personalized continuous glucose prediction with regression
approach to stroke surveillance," World Health Organization, Geneva, trees," in 2021 6th South-East Europe Design Automation, Computer
2005. Engineering, Computer Networks and Social Media Conference
(SEEDA-CECNSM), Preveza, Greece, 2021, pp. 1-6.
[4] B. C. V. Campbell, D. A. De Silva, M. R. Macleod, S. B. Coutts, L. H.
[22] P. Cunningham and S. J. Delany, "K-Nearest neighbour classifiers - a
Schwamm, S. M. Davis, et al., "Ischaemic stroke," Nature Reviews
tutorial," ACM Computing Surveys, vol. 54, p. Article 128, 2021.
Disease Primers, vol. 5, p. 70, 2019.
[23] M. B. A. Snousy, H. M. El-Deeb, K. Badran, and I. A. A. Khlil, "Suite
[5] R. V. Krishnamurthi, V. L. Feigin, M. H. Forouzanfar, G. A. Mensah,
of decision tree-based classification algorithms on cancer gene
M. Connor, D. A. Bennett, et al., "Global and regional burden of first-
expression data," Egyptian Informatics Journal, vol. 12, pp. 73-82, 2011.
ever ischaemic and haemorrhagic stroke during 1990–2010: findings
from the Global Burden of Disease Study 2010," The Lancet Global [24] K. G. Dinesh, K. Arumugaraj, K. D. Santhosh, and V. Mareeswari,
Health, vol. 1, pp. e259-e281, 2013. "Prediction of cardiovascular disease using machine learning
algorithms," in 2018 International Conference on Current Trends
[6] S. Kamalakannan, A. S. V. Gudlavalleti, V. S. M. Gudlavalleti, S.
towards Converging Technologies (ICCTCT), Coimbatore, India, 2018,
Goenka, and H. Kuper, "Incidence & prevalence of stroke in India: A
pp. 1-7.
systematic review," Indian Journal of Medical Research, vol. 146, pp.
175-185, 2017. [25] A. Dogan and D. Birant, "A weighted majority voting ensemble
approach for classification," in 2019 4th International Conference on
[7] E. S. Donkor, "Stroke in the 21st century: A snapshot of the burden,
Computer Science and Engineering (UBMK), Samsun, Turkey, 2019,
epidemiology, and quality of life," Stroke Research and Treatment, vol.
pp. 1-6.
2018, p. 3238165, 2018.

744 | P a g e
www.ijacsa.thesai.org

You might also like