Case Study
Case Study
xxxxxx
REGISTRATION NO. 202003286
INTRODUCTION:
The World Health Organization estimates that 12 million deaths occur annually worldwide
due to heart diseases. In the U.S. and other developed countries, cardiovascular diseases
account for half of all deaths. Early prognosis of cardiovascular diseases can guide lifestyle
changes in high-risk patients, potentially reducing complications.
PROBLEM STATEMENT:
Cardiovascular diseases (CVDs) remain the leading cause of mortality globally, with a
significant impact on both developed and developing nations. Early identification of
individuals at risk can drastically reduce the burden of these diseases. Despite the
availability of various diagnostic tools and preventive measures, there is a need for
accurate predictive models to identify high-risk individuals early. This study seeks to
address this gap by developing a predictive model using logistic regression to forecast the
10-year risk of coronary heart disease (CHD) based on demographic, behavioral, and
medical factors.
OBJECTIVES:
The primary objective of this research is to pinpoint the most relevant risk factors
associated with heart disease and accurately predict the 10-year risk of coronary heart
disease (CHD) in patients. By leveraging logistic regression on a comprehensive dataset
from the Framingham Heart Study, this study aims to:
Identifying and quantifying the influence of major risk factors for heart disease.
Creating a predictive model to estimate the 10-year risk of coronary heart disease
(CHD).
Data Preparation:
Source:
The dataset used in this study is publicly available on the Kaggle website and originates
from an ongoing cardiovascular study on residents of Framingham, Massachusetts. The
classification goal is to predict whether a patient has a 10-year risk of future coronary heart
disease (CHD). The dataset includes over 4,000 records and 15 attributes.
Attributes
Each feature in the dataset serves as a potential risk factor for heart disease and is
categorized as follows:
Cigs Per Day: Average number of cigarettes smoked per day by the individual
(Continuous)
Prevalent Stroke: Indicates if the patient has had a previous stroke (Nominal)
Target Variable:
10-Year Risk of Coronary Heart Disease (CHD): A binary outcome where “1” indicates a
risk of CHD and “0” indicates no risk.
Visualization of Data:
Data Shape and Information:
The dataset columns education, CigsPerDay, BPMeds, totChol, BMI, heartrate and glucose
contains null values.
Based on this assessment, continuous variables like CigsPerDay, TotChol, SysBP, DiaBP,
BMI, HeartRate, and Glucose may benefit from outlier treatment.
Outlier Treatment.:
1.4 Visualization:
Univariate Analysis:
Histogram
We can see the strong correlation across the diaBP and SysBP (0.78) , similarly
prevalent Hyp has strong correlation across diaBP (0.62)and SysBP (0.72)
respectively.
Multivariate Analysis:
With this detailed output following by EDA method, we can conclude that our data
is clean and normalized, hence we can proceed further to create ML model in
Python.
Based on the EDA, we have analyzed that the data the target variable is categorical,
which is the 10-year risk of coronary heart disease (CHD), is binary ("1" for Yes and
"0" for No). Logistic regression is well-suited for binary classification tasks where the
outcome is categorical
The splitting the data into 20:80 form in which includes the 80% train data and 20% test
data.
2.2 Training the Model
o ROC Score
o ROC Curve
o Classification Report
o Confusion Matrix
A ROC AUC score of 0.7103 indicates that the model has acceptable
discrimination. This means the model can distinguish between the positive class
(individuals at risk of CHD) and the negative class (individuals not at risk of CHD)
better than random guessing but isn't performing exceptionally well.
ROC Curve
Output
Classification Report
Precision
o Class 0 (No CHD): Precision is 0.86, indicating that 86% of the instances
predicted as class 0 are actually class 0.
o Class 1 (CHD): Precision is 0.64, meaning that 64% of the instances
predicted as class 1 are actually class 1
Precision
o Class 0 (No CHD): Precision is 0.86, indicating that 86% of the instances
predicted as class 0 are actually class 0.
o Class 1 (CHD): Precision is 0.64, meaning that 64% of the instances
predicted as class 1 are actually class 1
Recall (Sensitivity or True Positive Rate)
o Class 0 (No CHD): Recall is 0.99, indicating that the model correctly
identifies 99% of the actual class 0 instances.
o Class 1 (CHD): Recall is 0.08, meaning the model correctly identifies only
8% of the actual class 1 instances, which is very low.
Confusion Matrix
The model correctly identified 492 individuals who do not have CHD. This
indicates good performance in predicting the absence of the condition.
The model correctly identified only 7 individuals who have CHD. This is a low
number, indicating poor performance in identifying individuals at risk.
The model incorrectly identified 4 individuals as having CHD when they do not.
This is a relatively small number, indicating that the model does not frequently
misclassify non-risk individuals as being at risk.
The model failed to identify 83 individuals who actually have CHD. This is a
significant number, indicating that the model misses many individuals at risk,
which is critical in a healthcare context where early detection is vital.
Conclusion:
The ROC AUC score of 0.71 indicates that the logistic regression model has
acceptable discrimination capability between individuals at risk of CHD and those
not at risk. This means the model is better than random guessing but still has room
for improvement.
The model achieves a high overall accuracy of 85%, which seems impressive at
first glance. However, this high accuracy is misleading due to the significant class
imbalance, where the majority class (no CHD) dominates.
The recall (sensitivity) for the minority class (CHD) is very low at 8%. This
indicates that the model fails to identify a large portion of individuals who actually
have CHD, which is a critical issue for a healthcare predictive model.
Precision for the minority class (CHD) is moderate at 64%, meaning that when the
model predicts CHD, it is correct 64% of the time. However, the F1-score for CHD
is very low at 14%, reflecting the poor balance between precision and recall for this
class.
The confusion matrix reveals a high number of false negatives (83), indicating that
many individuals at risk of CHD are not being correctly identified. This is a
significant shortcoming since early identification of at-risk individuals is crucial for
preventive measures.
References:
data_cardiovascular_r
isk.csv
Plagiarism Details:
Duplichecker-Plagiari
sm-Report-0.02164000 1719521388.docx