Machine Learning_project
Machine Learning_project
Anonnya Barua, AHAMMED EKRAK HOSSAIN UTSHA, SHADMAN SAKIB FARUKI,ZERIN FARZANA
1.Introduction:
Stroke is the third leading cause of death and the principal cause of serious long-term disability
in the United States. Accurate prediction of stroke is highly valuable for early intervention and
treatment. In this work, we compare the Cox proportional hazards model to a machine learning
strategy for stroke prediction using the Cardiovascular Health work (CHS) dataset. Stroke risk
prediction can help greatly with prevention and early treatment. Several medical investigations
and data analysis have been undertaken to find effective predictors of stroke. The Framingham
Study [6, 34] reported a list of stroke risk factors including age, systolic blood pressure, the use
of anti-hypertensive therapy, diabetes mellitus, cigarette smoking, prior cardiovascular disease,
atrial fibrillation, and left ventricular hypertrophy by electrocardiogram. Furthermore, in the
past decade, a number of other studies [3, 4, 5, 6] have led to the discovery of more risk factors
such as creatinine level, time to walk 15 feet, and others. Most previous prediction models have
adopted features (risk factors) that are verified by clinical trials or selected manually by medical
experts. With a large number of features in current medical datasets, it is a cumbersome task to
identify and verify each risk factor manually. On the other hand, machine learning algorithms
are capable of identifying features highly related to stroke occurrence efficiently from the huge
set of features; therefore, we believe machine learning can be used to: improve the prediction
accuracy of stroke risk and discover new risk factors.
According to the World Stroke Organization , 13 million people get a stroke each year, and approximately
5.5 million people will die as a result. It is the leading cause of death and disability worldwide, and that is
why its imprint is serious in all aspects of life. Stroke not only affects the patient but also affects the
patient’s social environment, family and workplace. In addition, contrary to popular belief, it can happen
to anyone, at any age, regardless of gender or physical condition [2].
The goal of this machine learning project is to create a predictive model that can effectively
identify people at risk of having a stroke based on their medical history, lifestyle factors, and
demographic information and to predict if a patient is diagnosed with stroke prediction or not
considering the following attributes – id, gender, age, hypertension, heart disease, ever married,
work type, Residence type, Average glucose, level, BMI, Smoking status, Stroke
The research proposes to construct a machine-learning model that can estimate the likelihood
of stroke prediction in patients using a large dataset of health records. To train the model for
accuracy, techniques such as k nearest neighbors and naïve bayes are used. The research has
huge significance for the healthcare that intends to solve an essential healthcare need by
employing technology to improve risk assessment and stroke prevention measures, saving lives
and lowering healthcare costs.
2. Project Methodology:
In this study, various methods were used to forecast the likelihood of a patient being diagnosed
with brain stroke, based on the individual's medical history. The algorithm with the highest
accuracy rate was selected as the best model for predicting stroke prediction. This study also
examines the role of machine learning (ML) in addressing stroke-related issues such as
prevention, risk factor identification, diagnosis, therapy, and prognosis. The systematic approach
involves reviewing the finest studies and providing helpful insights for further investigation. In
order to train multiple machine learning algorithms, various heart disease datasets are collected
and chosen. The Information gathered was preprocessed to remove missing data, normalize
numerical values, extract and scale features, and split into testing and training datasets.
Different machine learning models are evaluated and applied to the processed dataset. The
model performance is then compared considering some common metrics such as accuracy and F1 score.
Data validation is a key stage in the data preprocessing phase of a machine learning project. It
entails evaluating the data's quality, consistency, and accuracy to verify that it is compatible with
analysis and modeling. The approach encompasses cleaning, preprocessing, sampling, cross-
validation, outlier identification, and assessing data quality using metrics.
This procedure entails handling missing values, scaling features, encoding categorical variables,
finding and treating outliers, and selecting useful features. Splitting the dataset into training and
testing sets, as well as subdividing it for validation, allows for more accurate evaluation of model
performance. Cross-validation procedures are used to check the robustness of the trained model
across different subsets of the data, ensuring its generalization to previously unseen data. Finally,
data validation is crucial to assuring the stability and effectiveness of machine learning models.
The steps for data validation vary depending on the type of data and analytic aims.
Data preprocessing is an important step in machine learning that prepares raw data for analysis
and model training. It includes a variety of strategies for cleaning, transforming, and
standardizing datasets to improve machine learning model performance and accuracy. Missing
value handling, outlier removal, categorical variable encoding, and numerical feature scaling are
common preprocessing techniques. Missing data can be imputed using methods such as mean,
median, or mode imputation, and outliers can be identified and addressed using trimming or
historizations. Categorical variables are frequently encoded into numerical representation using
methods such as one-hot encoding or label encoding to make them compatible with machine
learning algorithms. Normalization is a preprocessing approach that scales numerical features to
a standard range, such as 0 to 1, or with a mean of 0 and a standard deviation of 1, to guarantee
that features of varied scales contribute equally to model training. Preprocessing and
normalizing data can increase model convergence, lessen the influence of outliers, and improve
machine learning models' interpretability and generalization performance. The data can be
normalized to guarantee that the algorithm works as intended.
Feature extraction in machine learning is selecting and manipulating raw data into a more compact
representation that captures critical information pertinent to the learning goal. It seeks to minimize the
data's dimensionality while retaining its discriminative capability, making it easier for machine learning
algorithms to find patterns and predict accurately. Principal component analysis (PCA), linear
discriminant analysis (LDA), and t-distributed stochastic neighbors embedding (t-SNE) are examples of
dimensionality reduction techniques used in feature extraction. These strategies transform the original
data into a lower-dimensional space by identifying the most informative characteristics or patterns.
Algorithms often perform better when they are trained on well-defined and relevant features. Feature
extraction can help identify the most important features for a given problem and improve the accuracy
of the resulting models, as well as make the data more interpretable by transforming it into a more
understandable format
There are many techniques for feature extraction. A correlation matrix is used to analyse the
relationship between stroke properties and other variables. The dataset's features, including
age, hypertension, gender, average glucose level, BMI, smoking status, heart disease, ever
married, work_type, and residence_type, have a substantial impact on stroke (figure). Previous
medical research indicates that stroke risk factors include hypertension, heart disease, diabetes,
age, BMI, and smoking . This is why negligible features were deleted.
A Naïve Bayes classifier is a probabilistic machine-learning model that’s used for classification
task. The crux of the classifier is based on the Bayes theorem. It presupposes that the presence
of a certain feature in a class is unaffected by the presence of other features, which is frequently
an oversimplification but works well in reality for many real-world datasets.
The training set was used to build the models, while the testing set was used to evaluate them.
K-nearest neighbors (KNN), Naive Bayes, decision trees, and support vector machines (SVM) are
four different algorithms to build the model but K-NN and Naïve Bayes that were done to be
trained and evaluated. For the models' evaluation, the accuracy and F1- score were employed.
K-NN, the best-fit model for predicting stroke prediction of brain in this dataset among the four
algorithms, had the highest accuracy and F1-score.
:
ROC curve:
KNN-
Naïve Bayes
Histogram:
Cumulative Gain Curve:
Lift Curve:
Conclusion and Future Recommendations:
The study suggests that machine learning algorithms have a high potential for effectively
predicting strokes based on numerous risk factors and medical data. These models exceed
traditional methods in terms of both predicted accuracy and efficiency. However, the study
recognizes constraints such as the need for larger, more diverse datasets and the difficulty of
model interpretation. Future research should concentrate on enhancing data quality and
availability, as well as developing more understandable machine learning approaches.
Furthermore, these models require additional validation in clinical settings to assure their
practical relevance and effectiveness in real-world circumstances. Longitudinal studies should be
conducted to test machine learning models' long-term prediction performance and impact on
patient outcomes.
Appendix:
import pandas as pd
df = pd.read_csv('/kaggle/input/stroke-prediction-dataset/healthcare-dataset-stroke-
data.csv')
df_copy = df.copy()
df
Data Preprocessing
df.isnull().sum()
df = df.dropna()
df.info()
df.duplicated().sum()
df.describe()
Convert the object columns to numerical ones
scaler = StandardScaler()
label_encoder = LabelEncoder()
unique_values = df[column].nunique()
print(f"'{column}' = {unique_values}")
df_scaled_up
Feature Selections
Heat Map
covariance_matrix = abs(df_scaled_up.cov())
plt.figure(figsize=(10,10))
plt.show()
X_train=df_labeled.drop(['stroke'], axis=1)
Y_train=df_labeled['stroke']
df_scaled_up_final['stroke'],
test_size=0.2,
random_state=42)
Model : K-NN
y_pred = knn.predict(X_test)
naive_bayes = GaussianNB()
naive_bayes.fit(X_train, y_train)
y_pred_naive = naive_bayes.predict(X_test)
Accuracy:
Result:
F1-Score
Confusion Matrix
plt.figure(figsize=(8, 6))
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()
plt.figure(figsize=(8, 6))
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()
# Plot ROC curve(K-NN)
plt.figure(figsize=(8, 6))
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.legend(loc="lower right")
plt.show()
# Plot ROC curve(Nqaive Bayes)
plt.figure(figsize=(8, 6))
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.legend(loc="lower right")
plt.show()
Plot histogram
plt.figure(figsize=(8, 6))
plt.xlabel("Predicted Probability")
plt.ylabel("Frequency")
plt.show()
plt.figure(figsize=(8, 6))
skplt.metrics.plot_cumulative_gain(y_test, knn.predict_proba(X_test))
plt.xlabel("Percentage of Sample")
plt.ylabel("Cumulative Gain")
plt.show()
Lift Curve
plt.figure(figsize=(8, 6))
skplt.metrics.plot_lift_curve(y_test, knn.predict_proba(X_test))
plt.title("Lift Curve")
plt.xlabel("Percentage of Sample")
plt.ylabel("Lift")
plt.show()
References:
1.K. Akazawa and T. Nakamura. Simulation program for estimating statistical power of Cox's
proportional hazards model assuming no specific distribution for the survival time. Elseview
Ireland, 1991
2. American Heart Association. Heart Disease and Stroke Statistics 2009 Update. American
Heart Association, Dallas, Texas, 2009.
[3] W. T. Longstreth, Jr., C. Bernick, A. Fitzpatrick, M. Cushman, L. Knepper, J. Lima, and C. Furberg.
Frequency and predictors of stroke death in 5,888 participants in the Cardiovascular Health Study.
Neurology, 56:368–375, February 2001.
[4] T. Lumley, R. A. Kronmal, M. Cushman, T. A. Manolio, and S. Goldstein. A stroke prediction score in the
elderly: Validation and web-based application. Journal of Clinical Epidemiology, 55(2):129–136, February
2002.
[7]. A. Kupusinac, R. Doroslovački, D. Malbaški, B. Srdić, and E. Stokić, "A primary estimation of the
cardiometabolic risk by using artificial neural networks," Computers in Biology and Medicine, vol. 43, no.
6, pp. 751-757, Jun. 2013. doi: 10.1016/j.compbiomed.2013.04.001