0% found this document useful (0 votes)
9 views26 pages

Machine Learning_project

This document discusses the use of machine learning algorithms for stroke prediction, comparing traditional Cox proportional hazards models with methods like k-nearest neighbors and Naïve Bayes. The study utilizes a dataset of 5110 patient records to identify risk factors and improve prediction accuracy, ultimately finding k-NN to be the most effective model with an accuracy of 94.60%. The research highlights the potential of machine learning in healthcare for early intervention and emphasizes the need for larger datasets and further validation in clinical settings.

Uploaded by

Uchaash Barua
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views26 pages

Machine Learning_project

This document discusses the use of machine learning algorithms for stroke prediction, comparing traditional Cox proportional hazards models with methods like k-nearest neighbors and Naïve Bayes. The study utilizes a dataset of 5110 patient records to identify risk factors and improve prediction accuracy, ultimately finding k-NN to be the most effective model with an accuracy of 94.60%. The research highlights the potential of machine learning in healthcare for early intervention and emphasizes the need for larger datasets and further validation in clinical settings.

Uploaded by

Uchaash Barua
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Stroke Prediction and its algorithm using Machine Learning

Anonnya Barua, AHAMMED EKRAK HOSSAIN UTSHA, SHADMAN SAKIB FARUKI,ZERIN FARZANA

1.Introduction:
Stroke is the third leading cause of death and the principal cause of serious long-term disability
in the United States. Accurate prediction of stroke is highly valuable for early intervention and
treatment. In this work, we compare the Cox proportional hazards model to a machine learning
strategy for stroke prediction using the Cardiovascular Health work (CHS) dataset. Stroke risk
prediction can help greatly with prevention and early treatment. Several medical investigations
and data analysis have been undertaken to find effective predictors of stroke. The Framingham
Study [6, 34] reported a list of stroke risk factors including age, systolic blood pressure, the use
of anti-hypertensive therapy, diabetes mellitus, cigarette smoking, prior cardiovascular disease,
atrial fibrillation, and left ventricular hypertrophy by electrocardiogram. Furthermore, in the
past decade, a number of other studies [3, 4, 5, 6] have led to the discovery of more risk factors
such as creatinine level, time to walk 15 feet, and others. Most previous prediction models have
adopted features (risk factors) that are verified by clinical trials or selected manually by medical
experts. With a large number of features in current medical datasets, it is a cumbersome task to
identify and verify each risk factor manually. On the other hand, machine learning algorithms
are capable of identifying features highly related to stroke occurrence efficiently from the huge
set of features; therefore, we believe machine learning can be used to: improve the prediction
accuracy of stroke risk and discover new risk factors.

According to the World Stroke Organization , 13 million people get a stroke each year, and approximately
5.5 million people will die as a result. It is the leading cause of death and disability worldwide, and that is
why its imprint is serious in all aspects of life. Stroke not only affects the patient but also affects the
patient’s social environment, family and workplace. In addition, contrary to popular belief, it can happen
to anyone, at any age, regardless of gender or physical condition [2].

The goal of this machine learning project is to create a predictive model that can effectively
identify people at risk of having a stroke based on their medical history, lifestyle factors, and
demographic information and to predict if a patient is diagnosed with stroke prediction or not
considering the following attributes – id, gender, age, hypertension, heart disease, ever married,
work type, Residence type, Average glucose, level, BMI, Smoking status, Stroke

The research proposes to construct a machine-learning model that can estimate the likelihood
of stroke prediction in patients using a large dataset of health records. To train the model for
accuracy, techniques such as k nearest neighbors and naïve bayes are used. The research has
huge significance for the healthcare that intends to solve an essential healthcare need by
employing technology to improve risk assessment and stroke prevention measures, saving lives
and lowering healthcare costs.

2. Project Methodology:
In this study, various methods were used to forecast the likelihood of a patient being diagnosed
with brain stroke, based on the individual's medical history. The algorithm with the highest
accuracy rate was selected as the best model for predicting stroke prediction. This study also
examines the role of machine learning (ML) in addressing stroke-related issues such as
prevention, risk factor identification, diagnosis, therapy, and prognosis. The systematic approach
involves reviewing the finest studies and providing helpful insights for further investigation. In
order to train multiple machine learning algorithms, various heart disease datasets are collected
and chosen. The Information gathered was preprocessed to remove missing data, normalize
numerical values, extract and scale features, and split into testing and training datasets.
Different machine learning models are evaluated and applied to the processed dataset. The
model performance is then compared considering some common metrics such as accuracy and F1 score.

2.1. Data Collection Procedure:


The dataset [8] we used in our work has in total 12 columns and 5110 rows. First 11 of those columns are
the features that we will be using later in order to predict the final column ‘target(stroke)’ which will tell
us if the patient is going to be affected by stroke or not. The 5110 rows represent data of 44680 patients
that we found from dataset. The short description of features of our used dataset is given in the
following table-1.
Number Attributes Description and Data Type
Domain
1 id unique identifier Numerical
2 gender Gender of the person Binary
[1: Male, 0: Female]
3 age Age of the person in Numerical
years (0.08-82)
4 hypertension 0 if the patient doesn't Binary
have hypertension, 1 if
the patient has
hypertension
5 heart disease Heart Disease (0-No, 1- Binary
Yes)
6 ever married Marital Status (0-No, 1- Binary
Yes)
7 work type 1-Children 2-Govt_Job Nominal
3-Never Worked 4-
Private 5-Self_Employed
8 Residence type 0-Rural, 1-Urban Binary
9 Avg glucose level Average Glucose Numerical
Level(55.12 – 271.74)
10 bmi Body Mass Index(10-97) Numerical
11 Smoking status 0-No-smoke, 1- Smoke Binary
12 stroke Class Attibute (0-No Binary
StrokeRisk 1-SrokeRisk

Table 1. Description of attributes for the dataset collected from Kaggle


2.2. Data Validation Procedure:

Data validation is a key stage in the data preprocessing phase of a machine learning project. It
entails evaluating the data's quality, consistency, and accuracy to verify that it is compatible with
analysis and modeling. The approach encompasses cleaning, preprocessing, sampling, cross-
validation, outlier identification, and assessing data quality using metrics.

This procedure entails handling missing values, scaling features, encoding categorical variables,
finding and treating outliers, and selecting useful features. Splitting the dataset into training and
testing sets, as well as subdividing it for validation, allows for more accurate evaluation of model
performance. Cross-validation procedures are used to check the robustness of the trained model
across different subsets of the data, ensuring its generalization to previously unseen data. Finally,
data validation is crucial to assuring the stability and effectiveness of machine learning models.
The steps for data validation vary depending on the type of data and analytic aims.

2.3. Data Preprocessing and Normalization:


Data from the real world frequently contains noise, and values are missing, and may be in an
unacceptable format, making it impossible to build machine learning models on it directly.

Data preprocessing is an important step in machine learning that prepares raw data for analysis
and model training. It includes a variety of strategies for cleaning, transforming, and
standardizing datasets to improve machine learning model performance and accuracy. Missing
value handling, outlier removal, categorical variable encoding, and numerical feature scaling are
common preprocessing techniques. Missing data can be imputed using methods such as mean,
median, or mode imputation, and outliers can be identified and addressed using trimming or
historizations. Categorical variables are frequently encoded into numerical representation using
methods such as one-hot encoding or label encoding to make them compatible with machine
learning algorithms. Normalization is a preprocessing approach that scales numerical features to
a standard range, such as 0 to 1, or with a mean of 0 and a standard deviation of 1, to guarantee
that features of varied scales contribute equally to model training. Preprocessing and
normalizing data can increase model convergence, lessen the influence of outliers, and improve
machine learning models' interpretability and generalization performance. The data can be
normalized to guarantee that the algorithm works as intended.

2.4. Feature Extraction:

Feature extraction in machine learning is selecting and manipulating raw data into a more compact
representation that captures critical information pertinent to the learning goal. It seeks to minimize the
data's dimensionality while retaining its discriminative capability, making it easier for machine learning
algorithms to find patterns and predict accurately. Principal component analysis (PCA), linear
discriminant analysis (LDA), and t-distributed stochastic neighbors embedding (t-SNE) are examples of
dimensionality reduction techniques used in feature extraction. These strategies transform the original
data into a lower-dimensional space by identifying the most informative characteristics or patterns.
Algorithms often perform better when they are trained on well-defined and relevant features. Feature
extraction can help identify the most important features for a given problem and improve the accuracy
of the resulting models, as well as make the data more interpretable by transforming it into a more
understandable format
There are many techniques for feature extraction. A correlation matrix is used to analyse the
relationship between stroke properties and other variables. The dataset's features, including
age, hypertension, gender, average glucose level, BMI, smoking status, heart disease, ever
married, work_type, and residence_type, have a substantial impact on stroke (figure). Previous
medical research indicates that stroke risk factors include hypertension, heart disease, diabetes,
age, BMI, and smoking . This is why negligible features were deleted.

2.5. Classification Algorithms:


A classification algorithm is a sort of machine learning algorithm that assigns input data to one
or more classes or categories depending on its characteristics. It learns from labelled training
data, with each data point assigned a predetermined class label, and then creates a model that
can predict the class labels of previously unseen data. Classification algorithms are utilised in a
wide range of applications, including spam detection, sentiment analysis, medical diagnosis, and
picture identification.

The k-Nearest Neighbours (k-NN) algorithm is a straightforward machine learning technique


used for classification and regression applications. It guesses the class or value of a new data
point based on the majority class or average value of its k nearest neighbours in the feature
space. The training set's k-nearest data points are used to forecast the target variable's values.

A Naïve Bayes classifier is a probabilistic machine-learning model that’s used for classification
task. The crux of the classifier is based on the Bayes theorem. It presupposes that the presence
of a certain feature in a class is unaffected by the presence of other features, which is frequently
an oversimplification but works well in reality for many real-world datasets.

2.6. Data Analysis Techniques:


Data analysis techniques are a set of tools and processes used to investigate, comprehend, and
gain insights from data. These strategies aid in the discovery of patterns, correlations, and
trends in data, which may subsequently be used to develop prediction models or make data-
driven decisions. x. A confusion matrix is a table that shows the number of true positives, false
positives, true negatives, and false negatives for a binary classification model. This paper made
use of a confusion matrix, as well as the accuracy and F1 score for the model. Accuracy is the
most basic performance metric, and it measures the proportion of correct predictions made by
a model whereas F1 score is the harmonic mean of precision and recall, and it provides a single
metric that balances the trade-off between precision and recall. These are calculated as:
Model evaluation techniques assess the performance of machine learning models using metrics
such as accuracy, precision, recall, F1-score, or area under the ROC curve (AUC). Model
interpretation techniques aim to explain the behavior and predictions of the models, providing
insights into the underlying relationships between features and the target variable.

2.7. Workflow Diagram of Proposed Model:


The workflow diagram of a proposed model depicts the step-by-step flow of data through the
model development and deployment pipeline.
Fig 3. Workflow diagram of proposed methodology

3. Results and Discussion:


The machine learning classifiers were analyzed by comparing their performance, as detailed
below.

3.1. Results Comparison

The training set was used to build the models, while the testing set was used to evaluate them.
K-nearest neighbors (KNN), Naive Bayes, decision trees, and support vector machines (SVM) are
four different algorithms to build the model but K-NN and Naïve Bayes that were done to be
trained and evaluated. For the models' evaluation, the accuracy and F1- score were employed.

KNN Accuracy:94.60 F1-score:0.64


Naïve Bayes Accuracy:87.47 F1-score:0.50

K-NN, the best-fit model for predicting stroke prediction of brain in this dataset among the four
algorithms, had the highest accuracy and F1-score.

3.2. Confusion Matrix Analysis

:
ROC curve:

KNN-
Naïve Bayes

Histogram:
Cumulative Gain Curve:

Lift Curve:
Conclusion and Future Recommendations:
The study suggests that machine learning algorithms have a high potential for effectively
predicting strokes based on numerous risk factors and medical data. These models exceed
traditional methods in terms of both predicted accuracy and efficiency. However, the study
recognizes constraints such as the need for larger, more diverse datasets and the difficulty of
model interpretation. Future research should concentrate on enhancing data quality and
availability, as well as developing more understandable machine learning approaches.
Furthermore, these models require additional validation in clinical settings to assure their
practical relevance and effectiveness in real-world circumstances. Longitudinal studies should be
conducted to test machine learning models' long-term prediction performance and impact on
patient outcomes.

Appendix:
import pandas as pd

from sklearn.model_selection import train_test_split # needed to spilt dataset into 2 part


called train set and test set (80-20)

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.preprocessing import StandardScaler

from sklearn.preprocessing import LabelEncoder

from sklearn.neighbors import KNeighborsClassifier

from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import accuracy_score, f1_score

from sklearn.metrics import accuracy_score

from sklearn.metrics import confusion_matrix

from sklearn.metrics import roc_curve, precision_recall_curve, auc, average_precision_score

from sklearn.calibration import calibration_curve

# from sklearn.metrics import plot_cumulative_gain, plot_lift_curve

import scikitplot as skplt


Load Dataset

df = pd.read_csv('/kaggle/input/stroke-prediction-dataset/healthcare-dataset-stroke-
data.csv')

df_copy = df.copy()

df

Data Preprocessing

Check null values

df.isnull().sum()

Remove Null Values

df = df.dropna()

Check the Dataset Informations

df.info()

Check the Number of Duplicated values

df.duplicated().sum()

df.describe()
Convert the object columns to numerical ones

scaler = StandardScaler()

label_encoder = LabelEncoder()

df_labeled = df.apply(lambda x: label_encoder.fit_transform(x) if x.dtype


== 'O' else x)

Check how many unique values in each column

print('Unique Values = \n\n')

for column in df.columns:

unique_values = df[column].nunique()

print(f"'{column}' = {unique_values}")

Scale the Dataset

df_scaled_up = pd.DataFrame(scaler.fit_transform(df_labeled), columns=df_labeled.columns)

df_scaled_up

Feature Selections

Heat Map
covariance_matrix = abs(df_scaled_up.cov())

plt.figure(figsize=(10,10))

sns.heatmap(covariance_matrix, annot=True, cmap='copper', fmt='.2f', linewidths=.5)

plt.title('Covariance Matrix Heatmap')

plt.show()

Train Test Split

df_scaled_up_final=df_labeled.drop([ 'id', 'gender' ], axis=1)

X_train=df_labeled.drop(['stroke'], axis=1)

Y_train=df_labeled['stroke']

X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)

X_train, X_test, y_train, y_test = train_test_split(df_scaled_up_final.drop(['stroke'], axis=1),

df_scaled_up_final['stroke'],

test_size=0.2,

random_state=42)

Model : K-NN

knn = KNeighborsClassifier(n_neighbors=20, weights='uniform', algorithm='auto', p=5)


knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

Model : Naive Bayes

naive_bayes = GaussianNB()

# Training the model

naive_bayes.fit(X_train, y_train)

y_pred_naive = naive_bayes.predict(X_test)

Accuracy:

accuracy_naive = accuracy_score(y_test, y_pred_naive)

print(f"Accuracy Naive Bayes : {accuracy_naive*100} %")

Result:

F1-Score

f1_KNN = f1_score(y_test, y_pred)

print("F1-score for KNN : ", f1_KNN)


f1_naive = f1_score(y_test, y_pred_naive)

print("F1-score Naive Bayes : ", f1_naive)

Confusion Matrix

conf_matrix = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))

sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", cbar=False)

plt.title("Confusion Matrix KNN")

plt.xlabel("Predicted")

plt.ylabel("Actual")

plt.show()

conf_matrix = confusion_matrix(y_test, y_pred_naive)

plt.figure(figsize=(8, 6))

sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", cbar=False)

plt.title("Confusion Matrix Naive Bayes")

plt.xlabel("Predicted")

plt.ylabel("Actual")

plt.show()
# Plot ROC curve(K-NN)

fpr, tpr, _ = roc_curve(y_test, y_pred)

roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))

plt.plot(fpr, tpr, color='blue', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)

plt.plot([0, 1], [0, 1], color='red', lw=2, linestyle='--')

plt.xlim([0.0, 1.0])

plt.ylim([0.0, 1.05])

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

plt.title('Receiver Operating Characteristic (ROC) Curve')

plt.legend(loc="lower right")

plt.show()
# Plot ROC curve(Nqaive Bayes)

fpr, tpr, _ = roc_curve(y_test, y_pred_naive)

roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))

plt.plot(fpr, tpr, color='blue', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)

plt.plot([0, 1], [0, 1], color='red', lw=2, linestyle='--')

plt.xlim([0.0, 1.0])

plt.ylim([0.0, 1.05])

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

plt.title('Receiver Operating Characteristic (ROC) Curve')

plt.legend(loc="lower right")

plt.show()
Plot histogram

plt.figure(figsize=(8, 6))

sns.histplot(y_probs, bins=20, kde=True)

plt.title("Histogram of Predicted Probabilities")

plt.xlabel("Predicted Probability")

plt.ylabel("Frequency")

plt.show()

Cumulative Gains Curve

plt.figure(figsize=(8, 6))

skplt.metrics.plot_cumulative_gain(y_test, knn.predict_proba(X_test))

plt.title("Cumulative Gains Curve")

plt.xlabel("Percentage of Sample")

plt.ylabel("Cumulative Gain")

plt.show()
Lift Curve

plt.figure(figsize=(8, 6))

skplt.metrics.plot_lift_curve(y_test, knn.predict_proba(X_test))

plt.title("Lift Curve")

plt.xlabel("Percentage of Sample")

plt.ylabel("Lift")

plt.show()

References:

1.K. Akazawa and T. Nakamura. Simulation program for estimating statistical power of Cox's
proportional hazards model assuming no specific distribution for the survival time. Elseview
Ireland, 1991

2. American Heart Association. Heart Disease and Stroke Statistics 2009 Update. American
Heart Association, Dallas, Texas, 2009.

[3] W. T. Longstreth, Jr., C. Bernick, A. Fitzpatrick, M. Cushman, L. Knepper, J. Lima, and C. Furberg.
Frequency and predictors of stroke death in 5,888 participants in the Cardiovascular Health Study.
Neurology, 56:368–375, February 2001.
[4] T. Lumley, R. A. Kronmal, M. Cushman, T. A. Manolio, and S. Goldstein. A stroke prediction score in the
elderly: Validation and web-based application. Journal of Clinical Epidemiology, 55(2):129–136, February
2002.

[5] T. A. Manolio, R. A. Kronmal, G. L. Burke, D. H. O’Leary, and T. R. Price. Short-term predictors of


incident stroke in older adults: The Cardiovascular Health Study. Stroke, 27:1479–1486, September 1996.

[6] A. P. McGinn, R. C. Kaplan, J. Verghese, D. M. Rosenbaum, B. M. Psaty, A. E. Baird, J. K. Lynch, P. A.


Wolf, C. Kooperberg, J. C. Larson, and S. Wassertheil-Smoller. Walking speed and risk of incident ischemic
stroke among postmenopausal women. Stroke, 39:1233–1239, April 2008

[7]. A. Kupusinac, R. Doroslovački, D. Malbaški, B. Srdić, and E. Stokić, "A primary estimation of the
cardiometabolic risk by using artificial neural networks," Computers in Biology and Medicine, vol. 43, no.
6, pp. 751-757, Jun. 2013. doi: 10.1016/j.compbiomed.2013.04.001

[8] Kaggle. “Healthcare stroke Prediction ” https://www.kaggle.com/datasets/fedesoriano/stroke-


prediction-dataset/data

You might also like