0% found this document useful (0 votes)
15 views

Case Study

Uploaded by

nileshgite31
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Case Study

Uploaded by

nileshgite31
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

"HEART DISEASE PREDICTION USING THE WHO DATA"

xxxxxx
REGISTRATION NO. 202003286
INTRODUCTION:
The World Health Organization estimates that 12 million deaths occur annually worldwide
due to heart diseases. In the U.S. and other developed countries, cardiovascular diseases
account for half of all deaths. Early prognosis of cardiovascular diseases can guide lifestyle
changes in high-risk patients, potentially reducing complications.

PROBLEM STATEMENT:
Cardiovascular diseases (CVDs) remain the leading cause of mortality globally, with a
significant impact on both developed and developing nations. Early identification of
individuals at risk can drastically reduce the burden of these diseases. Despite the
availability of various diagnostic tools and preventive measures, there is a need for
accurate predictive models to identify high-risk individuals early. This study seeks to
address this gap by developing a predictive model using logistic regression to forecast the
10-year risk of coronary heart disease (CHD) based on demographic, behavioral, and
medical factors.

OBJECTIVES:
The primary objective of this research is to pinpoint the most relevant risk factors
associated with heart disease and accurately predict the 10-year risk of coronary heart
disease (CHD) in patients. By leveraging logistic regression on a comprehensive dataset
from the Framingham Heart Study, this study aims to:

 Identifying and quantifying the influence of major risk factors for heart disease.

 Creating a predictive model to estimate the 10-year risk of coronary heart disease

(CHD).

 Offering insights to inform clinical decisions and lifestyle changes aimed at

reducing the risk of heart disease..

Data Preparation:

Source:
The dataset used in this study is publicly available on the Kaggle website and originates
from an ongoing cardiovascular study on residents of Framingham, Massachusetts. The
classification goal is to predict whether a patient has a 10-year risk of future coronary heart
disease (CHD). The dataset includes over 4,000 records and 15 attributes.

Attributes

Each feature in the dataset serves as a potential risk factor for heart disease and is
categorized as follows:

Sex: Male or female (Nominal)

Age: Age of the patient (Continuous)

Current Smoker: Indicates if the patient currently smokes

Cigs Per Day: Average number of cigarettes smoked per day by the individual
(Continuous)

BP Meds: Indicates if the patient is on blood pressure medication (Nominal)

Prevalent Stroke: Indicates if the patient has had a previous stroke (Nominal)

Prevalent Hyp: Indicates if the patient is hypertensive (Nominal)

Diabetes: Indicates if the patient has diabetes (Nominal)

Tot Chol: Total cholesterol level (Continuous)

Sys BP: Systolic blood pressure (Continuous)

Dia BP: Diastolic blood pressure (Continuous)

BMI: Body Mass Index (Continuous)

Heart Rate: Heart rate (Continuous)

Glucose: Glucose level (Continuous)

Target Variable:

10-Year Risk of Coronary Heart Disease (CHD): A binary outcome where “1” indicates a
risk of CHD and “0” indicates no risk.

1. Data Processing/Exploratory Data Analysis (EDA):


1.1 Data Familiarization:
Importing all the necessary libraries. The libraries will include the following
important libraries
 Numpy: It supports large, multi-dimensional arrays and matrices, and comes
with a collection of mathematical functions to operate on these arrays.
 Pandas: Provides data structures and data analysis tools. It is essential for
handling and manipulating structured data (e.g., tables).
 matplotlib.pyplot: 2D plotting library which produces publication-quality
figures in a variety of formats and interactive environments.
 Seaborn: Used for the visualization purpose.

Read the data from the drive

Visualization of Data:
Data Shape and Information:

Dataset contains 3390 rows and 17 columns.


Dataset contains the 9 float, 6 integer and 2 Object variables. We can also observe there are
few null values are present in the dataset.

1.2 Data Analysis:


1.3 Data Cleaning:

Null Values in Dataset

The dataset columns education, CigsPerDay, BPMeds, totChol, BMI, heartrate and glucose
contains null values.

Dropping the null values


Also the id column does not make any significant impact on our analysis so we dropping it.
Outlier Treatment:

Plotting the box plot to identify outliers.

 As per the boxplot, the following variables contains the outliers


o cigsPerDay: The 75th percentile is 20, but the maximum is 70. This
suggests possible outliers.
o TotChol: The maximum value (600) could be an outlier. Outlier treatment
might be necessary.
o SysBP: The maximum value (295) could be an outlier. Outlier treatment
might be necessary.
o DiaBP: The maximum value (142.5) could be an outlier. Outlier treatment
might be necessary.
o BMI: The maximum value (56.8) could be an outlier. Outlier treatment
might be necessary.
o HeartRate: The maximum value (143) could be an outlier. Outlier treatment
might be necessary.
o Glucose: The maximum value (394) could be an outlier. Outlier treatment
might be necessary

Based on this assessment, continuous variables like CigsPerDay, TotChol, SysBP, DiaBP,
BMI, HeartRate, and Glucose may benefit from outlier treatment.

Outlier Treatment.:
1.4 Visualization:

Univariate Analysis:

Histogram

 Looks all the variables are normally distributed.


Bivariate Analysis

 We can see the strong correlation across the diaBP and SysBP (0.78) , similarly
prevalent Hyp has strong correlation across diaBP (0.62)and SysBP (0.72)
respectively.
Multivariate Analysis:

 With this detailed output following by EDA method, we can conclude that our data
is clean and normalized, hence we can proceed further to create ML model in
Python.

1.5 Data Standardization:


we need to encode categorical variables into a numeric format for performing logistic
regression. Based on your dataset, the categorical variables are sex and is_smoking.

Standardizing the data

2. Machine Learning Model:

Based on the EDA, we have analyzed that the data the target variable is categorical,
which is the 10-year risk of coronary heart disease (CHD), is binary ("1" for Yes and
"0" for No). Logistic regression is well-suited for binary classification tasks where the
outcome is categorical

2.1 Splitting Data

The splitting the data into 20:80 form in which includes the 80% train data and 20% test
data.
2.2 Training the Model

2.3 Training the Model

2.4 Evaluating the Model Performance

To evaluate the performance , we must be use the following methods:

o ROC Score
o ROC Curve
o Classification Report
o Confusion Matrix

ROC AUC Score

 A ROC AUC score of 0.7103 indicates that the model has acceptable
discrimination. This means the model can distinguish between the positive class
(individuals at risk of CHD) and the negative class (individuals not at risk of CHD)
better than random guessing but isn't performing exceptionally well.

ROC Curve

Output
Classification Report

 Precision
o Class 0 (No CHD): Precision is 0.86, indicating that 86% of the instances
predicted as class 0 are actually class 0.
o Class 1 (CHD): Precision is 0.64, meaning that 64% of the instances
predicted as class 1 are actually class 1
 Precision
o Class 0 (No CHD): Precision is 0.86, indicating that 86% of the instances
predicted as class 0 are actually class 0.
o Class 1 (CHD): Precision is 0.64, meaning that 64% of the instances
predicted as class 1 are actually class 1
 Recall (Sensitivity or True Positive Rate)
o Class 0 (No CHD): Recall is 0.99, indicating that the model correctly
identifies 99% of the actual class 0 instances.
o Class 1 (CHD): Recall is 0.08, meaning the model correctly identifies only
8% of the actual class 1 instances, which is very low.

Confusion Matrix
 The model correctly identified 492 individuals who do not have CHD. This
indicates good performance in predicting the absence of the condition.
 The model correctly identified only 7 individuals who have CHD. This is a low
number, indicating poor performance in identifying individuals at risk.
 The model incorrectly identified 4 individuals as having CHD when they do not.
This is a relatively small number, indicating that the model does not frequently
misclassify non-risk individuals as being at risk.
 The model failed to identify 83 individuals who actually have CHD. This is a
significant number, indicating that the model misses many individuals at risk,
which is critical in a healthcare context where early detection is vital.

Conclusion:

 The ROC AUC score of 0.71 indicates that the logistic regression model has
acceptable discrimination capability between individuals at risk of CHD and those
not at risk. This means the model is better than random guessing but still has room
for improvement.
 The model achieves a high overall accuracy of 85%, which seems impressive at
first glance. However, this high accuracy is misleading due to the significant class
imbalance, where the majority class (no CHD) dominates.
 The recall (sensitivity) for the minority class (CHD) is very low at 8%. This
indicates that the model fails to identify a large portion of individuals who actually
have CHD, which is a critical issue for a healthcare predictive model.
 Precision for the minority class (CHD) is moderate at 64%, meaning that when the
model predicts CHD, it is correct 64% of the time. However, the F1-score for CHD
is very low at 14%, reflecting the poor balance between precision and recall for this
class.
 The confusion matrix reveals a high number of false negatives (83), indicating that
many individuals at risk of CHD are not being correctly identified. This is a
significant shortcoming since early identification of at-risk individuals is crucial for
preventive measures.

References:

data_cardiovascular_r
isk.csv
Plagiarism Details:

Duplichecker-Plagiari
sm-Report-0.02164000 1719521388.docx

You might also like