4
4
Abstract—Having sudden strokes has had a very negative depending on the exact origin of the dysfunction, which
impact on all aspects in society to the point that it attracted defines four main types of strokes: ischemic stroke,
efforts for better improvement and management of stroke subarachnoid hemorrhage, cerebral venous sinus thrombosis,
diagnosis. Technological advancement also had an impact on the and intra-cerebral hemorrhage [3].
medical field such that nowadays caregivers have better options
for taking care of their patients by mining and archiving their In general, brain strokes can be classified as either ischemic
medical records for ease of retrieval. Furthermore, it is quite or hemorrhagic. Ischemic strokes are the predominant type and
essential to understand the risk factors that make a patient more they account for approximately 70% of the total stroke
susceptible to strokes, thus there are some factors that make incidents [4]. Ischemic strokes occur as a result of clots in
stroke prediction much easier. This research offers an analysis of vessels, or hypotensive vasoconstriction, arterial tears, and
the factors that enhance the stroke prediction process based on sickle cell anemia [5]. On the other hand, hemorrhagic strokes
electronic health records. The most important factors for stroke account for approximately 15% of the total incidents, yet their
prediction will be identified using statistical methods and effects are usually more detrimental as they often lead to
Principal Component Analysis (PCA). It has been found that the serious morbidity and death [6]. Hemorrhagic strokes occur
most critical factors affecting stroke prediction are the age, due to many causes among which are the vascular malfunction
average glucose level, heart disease, and hypertension. A and uncontrolled hypertension [7].
balanced dataset is used for the model evaluation which was
created by sub-sampling since the dataset for stroke occurrence When considering the risk factors or the reasons behind the
is already highly imbalanced. In this study, seven different occurrence of strokes, these can be divided into two types of
machine learning algorithms are implemented: Naïve Bayes, factors depending on their origin, meaning that there are factors
SVM, Random Forest, KNN, Decision Tree, Stacking, and that can be changed or modified, and factors that cannot be
majority voting to train on the Kaggle dataset to predict modified [8]. Some of the modifiable (changeable) factors is
occurrence of stroke in patients. After preprocessing and hypercholesterolemia, diabetes, and hypertension. On the other
splitting the dataset into training and testing sub-datasets, these hand, the non-modifiable factors include age, gender, and the
proposed algorithms were evaluated according to accuracy, f1 genetic factors in play [9].
score, recall value, and precision value. The NB classifier
achieved the lowest accuracy level (86%), whereas the rest of the The traditional stroke identification methods are usually the
algorithms achieved similar accuracies 96%, f1 scores 0.98, magnetic resonance imaging MRI and Computed Tomography
precision 0.97, and recall 1. CT scans which are expensive and invasive [10]. However,
since the stroke occurrence is a very time-sensitive issue,
Keywords—Stroke prediction; machine learning; PCA; dealing with it in a timely efficient manner is very important
decision tree; KNN; majority voting; Naïve Bayes because in most cases, death or permanent damage from stroke
I. INTRODUCTION can be prevented if the diagnosis happens early on [11], [12].
Therefore, it is essential to develop medical tools and devices
Strokes or cerebrovascular accidents are considered among that allow physicians to diagnose a stroke without being
the top three causes or morbidity and mortality in many invasive or uncomfortable, through relying on biomarkers for
countries all over the world [1], such that it accounts for around example or studying the risk factors. Machine learning poses as
10% of the world-wide deaths which makes it the second the perfect tool for predicting whether a stroke can occur or not
leading cause of death. As an estimation, approximately based on different factors. Machine Learning is capable of
700,000 individuals suffer from strokes each year, and by the diagnosing, treating, and predicting disease through analyzing
year 2030, it is expected that this number will be greatly clinical data.
increased and will cause a medical cost of 240 billion dollars in
the US alone [2]. In this research, the aim is to develop and implement a
machine learning-based system for the accurate prediction of
The world health organization WHO defines stroke as a future occurrence of stroke in patients based on several features
brain-related illness such that it leads to the dysfunction of the including age, gender, BMI, and medical history. The primary
brain, and it could be focal, acute, or diffuse. This dysfunction objective is to get this system to predict the occurrence of
is mainly a result of vessel problems and it lasts for longer than stroke by 100% accuracy so that lives can be saved. The
24 hours. Ultimately, there are many types of strokes
738 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 4, 2023
contributions that are provided in this report can be listed as performance was recorded by Decision Tree followed by KNN
follows: (96.3%).
Predictive analytics approach to predict stroke Using Kaggle dataset, Sailasya and [15] discussed the
recurrence is suggested. prediction of stroke based on machine learning algorithm
namely Logistic Regression, K-Nearest Neighbour, Random
Machine learning and neural network algorithms are Forest, Support Vector Machine, Naïve Bayes, and Decision
implemented. Tree algorithms. Undersampling method was used to handle
A publicly available dataset of electronic health records the imbalanced data. The results showed that among these
is used. algorithms, Naïve Bayes had the best performance with 82%
overall accuracy compared to 80 % for both K-NN and support
The subsampling techniques for balancing the dataset is vector machine, and 78% for logistic regression.
followed.
Emon et al. [16] collected information for 5110 patients
Dimensionality reduction techniques are implemented were taken from Bangladesh's medical clinic. Then, ten
in analyzing the attributes. different machine learning classifiers, which are ANN, MLP, K
Neighbours algorithm, SGD, QDA, AdaBoost, Gaussian,
The most impactful features for predicting strokes are QDA, GBC, and XGB were used. The weighted voting
picked out and shown. classifier offered the highest accuracy of about 97%, GBC and
Thus, after mentioning the contributions, it can be said that XGB classifiers achieving 96% accuracy, right before
the added value of this paper lies in the fact that it uses simple AdaBoost classifier that scored 94% accuracy. On the other
algorithms to achieve high accuracies with explainable results, hand, the lowest accuracy was recorded by the SGD classifier
instead of using complex algorithms. More precisely, the with a value of 65%.
majority of the chosen algorithms were able to score similarly Shoily et al. [17] used KNN, Naïve Bayes, J48, and
high results. Random Forest classifiers. They gathered data from multiple
The rest of the paper is distributed as follows: Section II is sources to create their dataset of 1058 individuals overall and
the literature review where some studies are mentioned with took a total of 28 features. The authors performed integer
their relative results. Section III is for describing the details of encoding to make the machine learning algorithms suitable for
the methodology followed in this study. Section IV shows the WEKA processing. After that, feature selection took place, and
results that were obtained by the proposed model. Finally, the the models were trained and tested then evaluated according to
paper is concluded with Section V as a conclusion. f1 score, accuracy, precision, and recall. In terms of accuracy,
Random Forest as well as KNN and J48 achieved the same
II. LITERATURE REVIEW results: 0.998 accuracy, 0.998 f1 score, 0.998 precision and
Since technologies like machine learning and deep learning 0.98 recall, whereas Naïve Bayes achieved 0.856 accuracy and
can greatly benefit the medical sector by increasing the 0.861 f1 score.
accuracy of stroke prediction, many studies were conducted to Abedi et al. [18] created a dataset termed “GNSIS”, which
explore how exactly machine learning models can be used in is a collection of electronic health records from 2003 to 2019.
predicting strokes. In this section, a group of similar studies Data preprocessing was performed, and the individuals within
that relied on freely available datasets such as Kaggle and the dataset were classified into six groups totaling 2091
datasets from local hospitals or labs were selected. individuals, 1 group consists of those who didn’t contract
Dritsas and Trigka [13] gathered data from Kaggle such stroke in the last 5 years, and the other 5 groups are of stroke
that the participants were 3254. The dataset consists of 10 patients. After that, the dataset was split into training and
independent features such as age, BMI value, glucose level, testing by 80 to 20 ratio, where data imputation was also done.
smoking status, hypertension, and whether the individual had From the dataset, 53 features existed including BMI, diastolic
contracted a stroke before. Data preprocessing was performed blood pressure, creatinine, and smoking status. Then, four
on the dataset, and class balancing was implemented through a feature selection sets were created with exclusion of some
resampling method known as SMOTE. Machine learning features at times, and six machine algorithm models were used
models namely Stacking, Decision Tree, Random Forest, each in all of the 5 recurrence prediction window, which makes
Majority Voting, Naïve Bayes, Multilayer Percepton, KNN, 24 models in total. For 1year prediction window, Random
Stochastic Gradient Descent, and logistic regression were used Forest achieved the better results with 90% accuracy, whereas
for predicting stroke or no-stroke. It appears from the results the average accuracy of all models was 88%. The average
that the stacking classifier performed best and achieves 0.989 accuracy achieved in the 5 years prediction window was 78%,
AUC value, with 0.974 precision and 0.974 recall. The other thus the wider prediction window results in less accurate
high performing models were Random Forest, KNN, and performances.
Majority Voting. Relying on electronic health records, Nwosu et al. [19] used
Rakshit [14] also relied on the Kaggle dataset and some of a dataset published by McKinsey & Company, containing 11
the algorithms as [13] namely Decision Tree, Naïve bayes, different attributes including body mass index, heart disease,
Support Vector Machine, Random Forest, K-Nearest Neighbor marital status, age, average blood glucose, and smoking status.
and Logistic Regression. According to their results, the best In the dataset, 548 patients suffered from stroke whereas 28524
patients didn’t suffer from any previous strokes, thus the
739 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 4, 2023
dataset needed downsizing. In fact, 1000 downsizing a wide range of machine learning platforms and tools. For this
experiments were done to avoid sampling bias. After that, 70% reason, machine learning algorithms were chosen in this study.
of the dataset was selected for training and 30% for testing. However, the limitation of this method is that it requires many
Over the 1000 experiments, the Neural Network model inputs for the model to be able to make predictions. It is
achieved the best accuracy of 75.02%, followed by Random possible that when predicting a person's status, not all inputs
Forest at 74.53% accuracy and Decision Tree at 74.31%. are available, and then the model will not be able to predict.
This issue was removed since the chosen dataset was large.
In [13], the dataset was large and their study was able to
score very similar results to ours, even though at times our In general, a wide set of attributes are used to predict
metrics were better. However, they did not mention the scored strokes such as gender, age, and blood pressure data among
accuracy. Similarly, oir proposed model acheived better many others. Additionally, the performance of a number of
performances than [14]. machine learning algorithms was examined to see which one is
best suited for predicting stroke incidence based on the dataset.
It's notworthy that the proposed method in this study Ultimately, the chosen ML algorithm must give the predictions
acheived 96.7% accuracy, which is significantly higher than with the highest accuracy.
the accuracy of [15] (80%).
A. Implementation
In [16] the authors chose complex algorithms such as
ADABoost and XGB and were only able to acheive similar In this section, the machine learning algorithms that will be
results to ours, whereas we acheived the high performances implemented and put to the test are presented and described.
using much simpler algorithms, which is more desirable. 1) Naive bayes: In the cases when the features are highly
In [17] the study relied on 28 inputs to predict stroke independent, the Naïve Bayes NB algorithm can lead to
occurance, which is usually difficult to obtain from patients for probability maximization [20]. There is a feature vector for
a quick prediction. Conversely, the proposed salgorithms in the every subject at that class c such that is
proposed system in this paper relied on 9 factors only as an maximized. The formula that defines the conditional
input. In addition, [17] used a much smaller dataset. probability is as in (1):
Similarly, [18] used a very high number of input, which is ⁄
not desireable for ML algorithms.
In (1), resembles
III. METHODOLOGY the features probability given class, whereas the previous
Machine learning permits the advancement of a system by feature probability is resembled by , and previous
making it capable of learning and improving from past class probability is resembled by P(c). Through maximizing the
experiences without the need of constant continuous numerator of 1, its number is also maximized, and the
programming. These systems learn through machine learning optimization becomes as in (2):
how to analyze data to identify patterns that help them make
decisions in the future without the help of humans.
The real influence of machine learning becomes crystal 2) Random forest: There are multiple decision trees in a
clear in the fields that deal with a huge amount of data such as Random Forest (RF) classifier [21]. When these independent
retail, health, government, finance, and transportation. This is trees are combines in an ensemble through resampling, the
mainly due to the decision-making capabilities of machine results become subsets of instances that are used for
learning since it can understand the data and fit them into the classification and regression. In a random forest, the final
different models such that human can rely on them for output is a result of majority voting, since each independent
decisions. Machine learning models are efficiently used for tree generates its own classification outcome.
identifying diseases and computing risk satisfaction in the 3) K-Nearest neighbors: K-nearest neighbors (KNNs)
healthcare sector. The previous are only a few examples of the
classifier depends on Manhattan or Euclidean distances to
capabilities of machine learning.
evaluate similarities or differences between instances in the
Nonetheless, real-life data cannot be simply directly dataset [22]. More often than not, the Euclidean distance is the
processed by the selected machine learning algorithms which is metric of choice in KNN classifiers. In stroke prediction, the
why data preprocessing is an essential step before applying the features vector of the new samples would be fnew. The closest
ML models. After that, the available dataset must be divided K vectors (neighbors) to fnew is determined through KNN.
into training and testing datasets. The training step is
After that, the class where most neighbors belong is given the
performed in order to teach the algorithm about the data. In
addition, unknown data can be predicted through ML fnew value.
algorithms, yet the prediction results are checked against each 4) Decision tree: In the proposed Decision Tree model
other. [23], J48 resembling the single classifier, and RepTree
[24]resembling the base classifier were chosen. The classes
This study is dedicated to implementing machine learning
are denoted by the leaf nodes, whereas the features are
algorithms for stroke prediction, since it is a dangerous and
common disease. Machine learning is often suitable for denoted by the internal nodes. The Gini index technique is
datasets due to its simplicity, structure, and compatibility with employed by the J48 classifier in order to split a single feature
740 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 4, 2023
at each node. Gini index is a fast and simple decision learner The 9 input attributes (most of which are nominal) as well as
that is capable of building a DT through the gained the target class are briefly described in Table II.
information as an impurity measure and pruning via reduced-
error pruning. TABLE II. DESCRIPTION OF THE ATTRIBUTES/FEATURES IN THE DATASET
5) Majority voting: Soft or hard voting is implemented Risk factor Description Details
through simple majority voting, assuming an ensemble of K The actual age of All of the participants are older
basis models. This method allows the prediction of the class Age (year)
participants than 18
label associated to an instance [25]. The hard voting collects Whether the In the dataset, 1260 participants are
the votes related to each class label and choses the one with Gender participants is male males, and 1994 participants are
most votes as an output, that is the candidate class. On the or female female
other hand, the predicted probabilities for every class label are The participant
12.54% of the participants in the
collected by soft voting, and the class label with the largest Hypertension suffering from
dataset are hypertensive
hypertension or not
probability is predicted. In the proposed model, the hard
voting is adopted. Its general function of hard voting is The participants
Heart suffering from heart 6.33% of the participants in the
represented by (3): disease diseases in general or dataset suffer from heart diseases
not
∑
Marital The participant is In the dataset, 79.84% of the
Such that Pk,c is the prediction or probability of k-th model status married or not participants are married
in class c, and c = {Stroke, Non − Stroke}. 65.02% of them work in the private
6) Stacking: One of the ensemble learning techniques is The work status of sector, 19.21% are self-employed,
Work type
the participants 15.67% have a job, while 0.1%
the Stacking, where the predictions of multiple heterogeneous have never worked
classifiers are integrated within a meta-classifier. Usually, the Whether the 51.14% of the participants in the
training set is used for training the base models whereas the Residence
participant lives in an dataset live in urban place whereas
type
outputs of the base models are used to train the meta- urban or rural place the rest live in rural places
classifier. Here, J48, RF, NB, and RepTree were chosen to be The average level of
Avg glucose
included in the stacking ensemble classifier. The predictions a participant’s blood Numerical values for each patient
level (mg/dL)
of these collective classifiers are used for training a logistic glucose
regression meta-classifier. Participant’s body
BMI
mass index of the Numerical values for each patient
The influence of machine learning parameters on the (Kg/m2)
participants
performance of a model can vary depending on the specific
algorithm used, the dataset being analyzed, and the problem Whether a participant
22.37% of the participants smoke,
being solved. However, in general, adjusting the values of Smoking 24.99% of them have smoked in
currently smokes or
status the past, and 52.64% of them have
these parameters can have a significant impact on the accuracy not
never smoked
and speed of a machine learning model. In this study, several
parameters for the different algorithms were modified to make Whether the
sure better results are achieved. The modifications to the Stroke participant has had a 5.53% of the participants have
parameters of each algorithm are shown in Table I. history stroke previously or previously had a stroke
not
TABLE I. THE CHANGED PARAMETERS FOR EACH ALGORITHM
2) Data pre-processing: If the data were kept in their raw
Algorithm Specific Parameters form, it might negatively affect the quality of the predictions,
- Number of neighbors (k): value is 6 which is why data preprocessing is essential. In the raw data,
KNN
- Distance metric: (Euclidean distance) there might be some missing values and redundancy as well as
- Kernel type: default kernel is radial basis function (RBF) noisy data, so tasks like data discretization and reduction of
SVM
- Regularization parameter (C): default value is 1.0 redundant values are performed. Furthermore, one of the data
DT - Tree depth: 3 pre-processing tasks is to balance the classes through selecting
NB No modifications one of the available resampling techniques. In the proposed
- Number of trees in the forest: default value is 100 workflow, the SMOTE technique will be used so that the
RF
- Maximum depth of each tree: 9 participants can be distributed over the stroke and non-stroke
classes in a balanced way. In more details, the minor class
B. Pre-Processing which belongs to the stroke participants, oversampling was
1) Dataset description: For this study, the dataset of done to increase the number of participants in this class. In
choice was adopted from Kaggle. The dataset comprises a addition, there were not missing or null values, so neither
large number of participants of which only those above 18 dropping nor data imputation was applied.
years old are chosen, making the total of the participants 3254.
741 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 4, 2023
C. Proposed Workflow such that the higher the AUC value, the better the performance.
The details of the proposed approach and methodology can If the model can discriminate between the instances of two
be summed up in a workflow chart presented in Fig. 1. classes perfectly, then AUC would be 1. Conversely, if the
model fails to distinguish between any instances, the AUC
would be 0.
IV. RESULTS
A. Data Visualization
The dataset can be visualized where each of the features or
attributes are analyzed separately and against each other.
Fig. 2, for instance, illustrates how the participants from the
dataset are distributed according to age and gender. It can be
seen that the patients have an average of 41 years old, and that
there are slightly more females than males, specifically, 56% of
the participants are female.
Where, true positive is designated by TP and false negative
is designated by FN, false positive is designated by FP and true
negative is designated by TN. Fig. 4. Distribution of data over heart disease and hypertension cases.
On the other hand, Area under curve (AUC) is also a Moreover, 25% of the patients were obese, and 18% of the
beneficial metric, where the values must be between 0 and 1, participants were overweight according to Fig. 5.
742 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 4, 2023
B. Model Evaluation
Fig. 5. Distribution of data according to BMI. After acquiring the data, preprocessing it, and visualizing it,
it was used to train and test several classifiers whose role was
In Fig. 6 depicts that the majority of the patients were to predict whether a stroke occurs to a patient or not. The
smokers, followed by a large group of participants with evaluation results for each classifier are presented in Table III.
unknown smoking status (1544 participants).
TABLE III. EVALUATION OF THE DIFFERENT CLASSIFIERS IN TERMS OF
ACCURACY, F1 SCORE, RECALL, AND PRECISION
Fig. 6. Distribution of data according to smoking status and relation to In addition, these evaluation metrics can be seen in Fig. 9
stroke. which clearly illustrates that in fact, all of the proposed
algorithms in this study have similar results in predicting the
Additionally, the majority of the participants are employed stroke occurrence in patients, except for Naïve Bayes which
in the private sector. Meanwhile, the data was almost equally clearly has the worst performance among these classifiers.
distributed between living in rural and urban areas, as depicted
in Fig. 7.
743 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 4, 2023
744 | P a g e
www.ijacsa.thesai.org