0% found this document useful (0 votes)
74 views

Developing A Predictive Model of Stroke Using Support Vector Machine

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views

Developing A Predictive Model of Stroke Using Support Vector Machine

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Developing a Predictive Model of Stroke using

Support Vector Machine


Jovel T. Rosado Alexander A. Hernandez
Technological Institute of the Philippines Technological Institute of the Philippines
Manila, Philippines Manila, Philippines
[email protected] [email protected]

Abstract— Health is a fundamental human right of all the and proposes a new predictive method using Principal
Filipinos in the Philippines, as stated by the Philippine Component Analysis and a supervised machine learning
Constitution of 1987. Based on the data published by the algorithm. For dimensionality reduction and dealing with
World Health Organization in 2018, there are 41 million the multi-collinearity problem in the experimental data,
deaths occurred because of stroke and its complications. Thus,
PCA is used [8].
given the parameters for the variables of risk factors of stroke,
a predictive model is developed for the occurrence of stroke
based on the medical records of the patient. To ensure quality Support Vector Machine (SVM) is a technique suitable
data, the medical data of the patients underwent data pre- for disease prediction task, [9]. Thus, SVM is chosen to
processing, principal component analysis is used for dimension predict stroke. SVM based-approach for various kernel
reduction. The model is evaluated using accuracy, precision, functions produced accurate results, and it showed the
recall, F1 score, and area under curve. The study used datasets predictive power of SVM within a small set of input
of 1500 patients from Cavite, Philippines. This study used 60 parameters [10].
percent for training the model, and 30 percent is used for
testing the model and 10 percent for validating the model. The
The paper intends to develop a predictive model using
SVM model achieved an accuracy of 99% for training the data,
98.89% for testing, and 97.33% for validation. The results of the medical records of the patients and undergo dimension
the model show the potential use of the predictive model for reduction through Principal Component Analysis by
stroke, thus, remains relevant for researchers and practitioners reducing the range of continuous data into a range of values
in the medical and health sciences field. or categories and processed using Support Vector Machine
The model is evaluated using accuracy, precision, recall, F1
Keywords—support vector machine, principal component score and area under curve.
analysis, stroke prediction, Philippines

I. INTRODUCTION II. RELATED WORKS

A. Overview of Stroke
Stroke is the top life-threatening disease in the world. It
is the leading cause of cognitive disorder around the world. Stroke is a prevalent disease that for many years, can
[1]. To decrease the problem of stroke in the population, it is influence the patient and his/her family. It is one of the
needed first to identify the modifiable risk factors and to world’s major causes of adult disability. Developing
demonstrate the effectiveness of risk reduction efforts [2]. countries face this kind of non-communicable disease [11].
Accordingly, preventing stroke in the field s of neurology, For this reason, knowing what stroke is, is an essential first
cardiology, vascular medicine, and geriatrics medicine step. A stroke is a “brain attack.” It can occur anytime and
remains as one of the essential targets [3]. can affect anyone. It happens when blood flows to a cut area
of the brain. Brain cells die when this occurs due to the
In 2016, there were an estimated 41 million deaths absence of oxygen. Memory and muscle control are some of
because of non-communicable diseases. The significant part the capabilities regulated by the brain region that will be lost
of the percentage was because of cardiovascular disease when brain cells die. The common signs of stroke are
accounting to 17.9 million of deaths equivalent to 44% of all weakness or numbness of the face, arm, and leg of one side
non-communicable diseases deaths [4]. On the other hand, of the body. Speech difficulty happened and has trouble
based on the Philippines Statistics Authority (PSA), stroke seeing in one or both eyes. A patient can also experience
was the top leading cause of death with 74,134 or 12.7 sudden severe dizziness and loss of balance and has a severe
percent of the total in the Philippines [5]. headache. Moreover, lastly, increasing drowsiness with
possible loss of consciousness and confusion. [12].
However, the growing number of stroke incidents can be
addressed through innovation and technology. The use of B. Support Vector Machine
machine learning in knowledge discovery for disease Support Vector Machine, based on statistical learning
prediction has been one of the interesting and relevant topics theory, ensures a machine learning method. In the training
addressed by researchers [6]. Accordingly, because of the information descriptor space, a separate hyperplane is
importance of disease prediction to the people, several developed, and variables are categorized based on the side
studies have been conducted on modeling procedures for where the hyperplane is situated [13]. It is possible to use
prediction. [7]. This study incorporates machine learning

978-1-7281-5247-9/19/$31.00 ©2019 IEEE


SVM to discover complicated patterns. Similarity (or Karaman, and Turtay [21], SVM, and ANN anticipated the
kernels) is selected to transforms the information and to stroke based on chosen early diagnostic predictors for
select information points or vectors to help it [14]. clinical decision support system.

Moreover, in terms of classification, prediction, and Moreover, Xiang [22] applied and compared different
regression analysis, SVM is one of the supervised learning categories of machine learning model that have good
methods used [15]. interpretability, including generalized linear models, to build
the prediction for stroke and thromboembolism. This
study used integrated machine approaches, including data
curation, feature engineering, and supervised learning to
build the thromboembolism prediction model. The study
showed that the approach could achieve significantly better
prediction performance.
Negative Hyperplane

III. MATERIALS AND METHODS

This study applies the general framework on knowledge


Positive Hyperplane discovery in databases, presented in Figure 2.
Figure 1. Maximum Margin separating Hyperplane

Figure 1 shows the margin of classes and the hyperplane


used to classify data of two classes. Support vectors used to
have the maximum margins from each class of data [16].
The solid line is the maximum margin separating the
hyperplane. The point with the smallest margins are exactly
the one closest to the decision boundary parallel to the
decision boundary. Thus, only these three points will be
non-zero at the optimal solution to our optimization
problem. These three points are known as support vectors
[17].
Figure 2. Knowledge Discovery in Databases

C. Principal Component Analysis (PCA) A. Datasets


PCA is a significant method from the domain of lots of The data used by this study came from the medical
variables that are often used for data dimensionality records of the patients. A hospital in Cavite, Philippines
reduction. It is also a popular way to extract significant initially owns these datasets. In this study, there are a total
features from the training data used to learn a model of of 1,500 patients for the past year to the present. The
machine learning [18]. PCA will be used in this study using medical records of the hospital contained different variables,
the data sets of the patients for the prediction of stroke. such as shown in the table below.

In a general structure, PCA works as a linear TABLE 1. PATIENTS’ MEDICAL DATA


transformation method that converts the first data variables
Attribute Description
into a feature space that has the same dimensions as the Age Patient’s age
unprocessed data. There is no correlation between the Sex Gender of the patient
transformed variables in the feature space, and these are Chief Complaint Patient’s major health complain
called principal component. The transformation aims to Diabetes If patient has diabetes
create the most of the variance in the feature space among Hypertension If patient has hypertension
Smoker If patient is a smoker
the projected variables and thus enables the participation of
Alcoholic and If patient is alcoholic and beverage drinker
each principal component to be evaluated. The technique is Beverage Drinker
that the primary data can be selected and the remaining Blood Pressure Blood pressure of the patient
discarded [19]. Pulse Rate Pulse rate of the patient
Weight Weight of the patient

D. Stroke Prediction Model


With regards to the prediction of stroke, this study will The data sets consist of 33 attributes (patient’s name,
use a machine learning method, SVM for predicting stroke age, sex, civil status, birthday, nationality, occupation,
possibility base on the medical records of the patient. In a father’s name, mother’s name, chief complaint, history of
study conducted by Bentley et al., [20] SVM performed present illness, past medical history, diabetes, hypertension,
higher accuracy than radiological methods. On the other cancer, pulmonary tuberculosis, others, smoker, alcoholic
side, according to the research undertaken by Colak, and beverage drinker, food and drug allergy, general
appearance, blood pressure, respiratory rate, temperature, F(x)=WI + bias (1)
weight, sheent (skin, head, eyes, ears, nose & throat), chest
and lungs, CVS, abdomen, genitalia, extremities, CNS, Where W=weight factor, I=input vector and bias. The
diagnosis) cleaned and underwent dimension reduction to hyperplane which divides is defined by f(x)=0. Therefore,
extract the essential features used to train the support vector first class that falls above the hyperplane has f(x)>0 and
machine. The data was narrowed down into 11 attributes another class below the plane is f(x)<0 [24].
that served as the attributes for the stroke prediction model.
The remaining 11 attributes were the data that caused stroke
D. Evaluation
and annotated by the physicians. Based on the medical
records of the patient, if he/she had all positive responses of The performance of the model is evaluated using
the attributes used, then he/she had the probability of having accuracy. It is defined in terms of correctly classified
a stroke. 60% of the total data (900) was used for training instances divided by the total number of instances present in
the model, and the 30% (450) was used as testing data set the dataset as used in other study [25].
and 10% (150) is used for validation. TP + TN (2)
Accuracy =
B. Data Preprocessing TP + FP + TN + FN
Since the hospital did not have the electronic copy of the
medical records of the patients, individual records were Where TP-True Positive, FP-False Positive, TN-True
encoded in Microsoft Excel. After encoding all the Negative, FN-False Negative
information of 1500 patients with 33 attributes, the data
were cleaned by deleting all the redundant information and TP Rate: It is fraction of data that are positive were
the unnecessary details and became 11 attributes for the predicted positive. The true-positive rate is also called
parameters of stroke. sensitivity [25].

The raw data is contains binary, nominal, and numeric TP (3)


type. For different data types, this study designed sets of TPR =
cleansing rules to ensure complete and accurate data are TP + FN
available. The cleansing rules were used to standardize the Precision is defined as the degree to which the repeated
format, correct the input errors, or discard the values that measurements under unchanged conditions show the same
cannot be recognized. results [25].

After imputed the missing values that cannot be TP (4)


connected to other features. The features with too many Precision =
missing entries are discarded because the distributions are TP + FP
difficult to estimate, which may lead to inaccurate results. Recall is the ratio of correctly predicted positive
Xiang [22] suggest that if a binary feature has more than observations to the total predicted positive observations
80% missing instances or a numeric/multi-value nominal [26].
feature has more than 60% missing entries, then this feature
will be removed from the data sets. Thus, the other 22
variables were dropped since they were not necessary stroke TP (5)
Recall =
parameters, and some had missing values.
TP + FN
F-measure is the combination of both precision and
C. Model Building recall. It is used to estimate the query classification
Further, preprocessing activity was performed to remove performance [25].
outliers in the data set. PCA is used for feature selection as
it is a standard method of extracting the essential features 2*Recall * Precision (6)
from the training data. Many feature selections can relate to F-Measure =
distinct aspects of data analysis for better data visualization Recall + Precision
and comprehension, computational time decrease, analytical
length, and predictive accuracy [23].

This study used the SVM algorithm for model building.


It utilizes both linear and nonlinear kernel functions. It
classifies the data by finding the hyperplane, the point that
separates the data points of the first class from that of the
second class. If a large margin is found, then the model
would be better [24].

The SVM uses the linear classifier of the following


form,
IV. EXPERIMENTAL RESULTS in terms of accuracy and other relevant performance
measures.
Prevention is better than cure. Early signs of potentinal
stroke is essential since it is a life-threatening. It could TABLE 3. Model Testing Result
improve patient’s life expectancy and health condition. A
Accuracy Precision Recall F1 Score AUC
supervised algorithm known as SVM was used to develop
the model of stroke prediction.
Model
99.8%
Testing 98.89% 75% 81.82% 78.41%
(.998)
(450)

Table 3 on the other hand, presents the different


parameters for evaluating the model using the testing data,
which consists 450 medical records. It is found that the
accuracy of the SVM model in the testing data is 98.89%.
Precision is 75%, and recall, which showed correctly
identified the fraction of actual positive stroke cases for
SVM model is 81.82%. F1 score of SVM model is found
78.41, and AUC is 99.8 % (.998). Based on the results, the
the classifier is able to predict correctly based on patterns
Figure 3. Data plot of patients using SVM used in the training activity. Thus, the model is accurate to
use for predicting potential stroke.
Figure 3 shows the plotting of patients data. The blue
dots indicate those patients who are negative of stroke and TABLE 4. Model Validation Result
the brown dots show those patients who have the possibility
of stroke. The spread of Radial Basis Function (RBF) kernel Number Predicted Not Accuracy
of Data Correctly Predicted
shows that the gamma value is very high that the decision Correctly
boundary is starting to cover the spread of data better, Without
150 143 3 97.33%
transforming the data into a higher dimensional feature Stroke
space. RBF is a popular kernel (way of computing the dot With
3 1
products of two vectors) method used in the SVM model. It Stroke
is a function whose value depends on the distance from the
origin.
Table 4 shows the validation result, which was 10% of
In this study, the parameters accuracy, precision, recall, the total data. For validation, 150 records of data is used. It
F1 score, and AUC are computed to evaluate the generated a result of 143, which was correctly predicted by
performance of the SVM classifier. The 1500 datasets were the model without stroke and 3 was not correctly predicted.
divided into 60 % training, 30% testing, and 10% validation. Moreover, it predicted 3 instances with stroke and 1, which
The data underwent cross-validation to evaluate and was not correctly predicted. Based on the generated results
compare the results by dividing the data into two segments: of validation, the model is 97.33% accurate.
one used to learn or train a model, and the other used to
validate the model. TABLE 5. Model Testing Confusion Matrix

True without Stroke True with Stroke


TABLE 2. Model Training Result
pred without stroke 436 2
pred with stroke 3 9
Accuracy Precision Recall F1 Score AUC
Model
Training 99.8% Table 5 shows the confusion matrix of the data used in
99% 80% 76.19% 78.10% testing the model. The rows in the confusion matrix
Data (.998)
(900) correspond to what the model predicted, and the columns
correspond to the known truth. There are 436 patients
Table 2 presents the different parameters for evaluating without stroke that were correctly identified by the model.
the model using the training data, which consists of 900 There are 9 patients with stroke, which were correctly
medical records. The results show that the accuracy of the identified by the model. On the other hand, there were 3
SVM model in using the training data is 99.00%. Precision patients without stroke, but the algorithm identified these
is 80%, and recall, which shows correctly identified the with stroke. Lastly, 2 patients had a stroke, but the algorithm
fraction of actual positive stroke cases for SVM model recognized it without stroke.
76.19%. F1 score of SVM model is found 78.10, and AUC
is 99.8 % (.998) which means that it is an ideal classifier. Hence, from the above study, it can be seen that using
The results show that the classifier could still be improved the training data, the model obtained an accuracy of 99%
and 98.89 % for testing. To better ensure the accuracy and
efficiency of the algorithm used, the model underwent Journal of Soft Computing and Decision Support Systems, 5, 24-30.
validation and generated a result of 97.33%. In providing a
better understanding of the classifier performance, F1 score [9] Hazi Mohammad Azamathulla, A. H. (2017). Application of Data
matters as it provides a balance between recall and precision Mining Methods in Diabetes Prediction. 2017 2nd International
Conference on Image, Vision and Computing (IEEE), 106-110.
[28].
[10] Jeena RS, D. S. (2016). Stroke Prediction Using SVM. International
Conference on Control, Instrumentation, Communication and
V. CONCLUSION Computational Technologies (ICCICCT) (IEEE).

[11] Subha PP, P. G. (2015). ,Pattern and risk factors of stroke in the young
The objective of this study is to develop a predictive among stroke parients admitted in medical college hospital.
model using SVM to predict the possibility of stroke of the Thiruvananthapuram.,Ann indian Acad Neurol, 18:20-3 .
patients in Cavite, Philippines. Predictions from SVM
kernel resulted in high-performance classifier for RBF as [12] National Stroke Association. (2019). (American Heart Association
1.0. This can assist doctors to plan for better stroke detection Inc.) Retrieved May 28, 2019, from
medication soon. This study proves the predictive capability https://www.stroke.org/understand-stroke/what-is-stroke/
of SVM with 1, 500 patients, and 10 attributes. The results
[13] Dr. S. Vijayarani, M. S. (2015). Data Mining Classification
for evaluation resulted in accuracy of 99% using the training
Algorithms for Kidney Disease Prediction. International Journal on
data and 98.89% using the testing data with a validation Cybernatics and Informatics, 4(4), 13-25.
result of 97.33%.
[14] Jean-Emmanuel Bibault, P. G. (2016). Big Data and machine learning
This study is not free from limitations. Thus, this in radiation oncology: State of the art and future prospect. Elsevier,
110-117.
recommends some future activities. The study could be used
in the future for stroke prevention since it could detect the
[15] Cemil Colak, E. K. (2015). Application of knowledge discovery
early occurrence of stroke among the patients of Cavite, process on the prediction of stroke. Elsevier, 181-185.
Philippines. The results could also help in developing a
control plan for those patients since stroke cannot be [16] Raoof Gholami, N. F. (2017). Support Vector Machine: Principles,
detected beforehand. This study could also be used for Parameters, And Applications. Elsevier, 515-533.
developing another model for further comparison of the
different machine learning algorithms. [17] Ng, A. (n.d.). Standford Edu. Retrieved May 30, 2019, from
cs229.stanford.edu/notes/cs229-notes3.pdf

[18] Smita Jhajharia, H. K. (2016). A Neural Network Based Breast Cancer


REFERENCES Prognosis Model with PCA Processed Feature. Intl. Conference on
Advances in Computing, Communications and Informatics (ICACCI).
[1] V Mozaffarian, D. B. (2015). Heart disease and stroke statistics 2015 Jaipur, India.
update: a report from the American Heart Association. American
Heart Association, Circulation 131, e29–322. [19] O. Inan, M. S. (n.d.). “A new hybrid feature selection method based
on association rules and pca for detection of breast cancer.
[2] Amelia K. Boehme, C. E. (2017). Stroke Risk Factors, Genetics, and International Journal of Innovative Computing and Information and
Prevention. Circulation Research Journal of the American Heart Control, 09(02), 727-739.
Association.
[20] P. Bentley, J. G. (n.d.). Prediction of stroke thrombolysis outcome
[3] M. Edip Gurol, J. S. (2018). Adbances in Stroke Prevention in 2018. using CT brain machine learning. Nueroimage, 4, 635-640.
Journal of Stroke, 143-144.
[21] Cemil Colak, E. K. (2015). Application of knowledge discovery
[4] WHO. (2018). World Health Statistics 2018: Monitoring Health for process on the prediction of stroke. Elsevier, 181-185.
SDGs, sustainable dvelopment goals. Geneva World Health
Organization. [22] Xiang Li, P. H. (2017). Integrated Machine Learning Approaches for
Predicting Ischemic Stroke and Thromboembolism in Atrial
[5] PSA. (2018, February 12). Deaths in the Philippines 2016. Retrieved Fibrillation. AMIA Annual Proceedings Archive, 799-807.
from Philippine Statistics Authority: https://psa.gov.ph/content/deaths-
philippines-2016 [23] Ionnis Kavakiotis, O. T. (2017). Machine Learning and Data Mining
Methids in Diabetes Research. Elsevier Computational and Structural
[6] Mehrbakhsh Nilashi, H. A. (September 2017). Knowledge Discovery Biotechnology Journal(15), 104-116.
and Diseases Prediction: A Comparative Study of Machine Learning
Techniques. Journal of Soft Computing and Decision Support Systems, [24] Radhimeennakshi, S. (2016). Classification and prediction of Heart
4(No,5), 8-16. Disease Risk Using Data Mining Techniques of Support Vector
Machine and Artificial Neural Network. IEEE
[7] Nilashi, M. b. (2017). An Analytical Method for Diseases Prediction InternationalConference on Computing for Sustainable Global
Using Machine Learning Techniques. Computers & Chemical Development (INDIACom) , 3107-3111.
Engineering. 106, 212-223.
[25] O. Dr. S. Vijayarani, M. S. (2015). Data Mining Classification
[8] Nilashi, M. E. (2016). A multi-criteria collaborative filtering Algorithms for Kidney Disease Prediction. International Journal on
recommender system using clustering and regression techniques. Cybernatics and Informatics, 4(4), 13-25.
[26] Joshi, R. (2016, September 9). Exsilio Solutions. Retrieved June 4,
2019, from https://blog.exsilio.com/all/accuracy-precision-recall-f1-
score-interpretation-of-performance-measures/

[27] Harleen Kaur, V. K. (2018). Predicitve modeliing and analytics for


diabetes using a machine learning approach. Applied Computing and
Informatics.

[28] J. Li, O. A. (2017). Glycaemic index precision: a pilot study of data


linkage challenges and the application of machine learning. IEEE
EMBS Int. Conf. on Biomed. & Health Informat (BHI), 357-360.

You might also like