0% found this document useful (0 votes)

15 views

Case Study

Uploaded by

nileshgite31

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Case Study

Uploaded by

nileshgite31

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 21

"HEART DISEASE PREDICTION USING THE WHO DATA"

xxxxxx
REGISTRATION NO. 202003286
INTRODUCTION:
The World Health Organization estimates that 12 million deaths occur annually worldwide
due to heart diseases. In the U.S. and other developed countries, cardiovascular diseases
account for half of all deaths. Early prognosis of cardiovascular diseases can guide lifestyle
changes in high-risk patients, potentially reducing complications.

PROBLEM STATEMENT:
Cardiovascular diseases (CVDs) remain the leading cause of mortality globally, with a
significant impact on both developed and developing nations. Early identification of
individuals at risk can drastically reduce the burden of these diseases. Despite the
availability of various diagnostic tools and preventive measures, there is a need for
accurate predictive models to identify high-risk individuals early. This study seeks to
address this gap by developing a predictive model using logistic regression to forecast the
10-year risk of coronary heart disease (CHD) based on demographic, behavioral, and
medical factors.

OBJECTIVES:
The primary objective of this research is to pinpoint the most relevant risk factors
associated with heart disease and accurately predict the 10-year risk of coronary heart
disease (CHD) in patients. By leveraging logistic regression on a comprehensive dataset
from the Framingham Heart Study, this study aims to:

 Identifying and quantifying the influence of major risk factors for heart disease.

 Creating a predictive model to estimate the 10-year risk of coronary heart disease

(CHD).

 Offering insights to inform clinical decisions and lifestyle changes aimed at

reducing the risk of heart disease..

Data Preparation:

Source:
The dataset used in this study is publicly available on the Kaggle website and originates
from an ongoing cardiovascular study on residents of Framingham, Massachusetts. The
classification goal is to predict whether a patient has a 10-year risk of future coronary heart
disease (CHD). The dataset includes over 4,000 records and 15 attributes.

Attributes

Each feature in the dataset serves as a potential risk factor for heart disease and is
categorized as follows:

Sex: Male or female (Nominal)

Age: Age of the patient (Continuous)

Current Smoker: Indicates if the patient currently smokes

Cigs Per Day: Average number of cigarettes smoked per day by the individual
(Continuous)

BP Meds: Indicates if the patient is on blood pressure medication (Nominal)

Prevalent Stroke: Indicates if the patient has had a previous stroke (Nominal)

Prevalent Hyp: Indicates if the patient is hypertensive (Nominal)

Diabetes: Indicates if the patient has diabetes (Nominal)

Tot Chol: Total cholesterol level (Continuous)

Sys BP: Systolic blood pressure (Continuous)

Dia BP: Diastolic blood pressure (Continuous)

BMI: Body Mass Index (Continuous)

Heart Rate: Heart rate (Continuous)

Glucose: Glucose level (Continuous)

Target Variable:

10-Year Risk of Coronary Heart Disease (CHD): A binary outcome where “1” indicates a
risk of CHD and “0” indicates no risk.

1. Data Processing/Exploratory Data Analysis (EDA):

1.1 Data Familiarization:
Importing all the necessary libraries. The libraries will include the following
important libraries
 Numpy: It supports large, multi-dimensional arrays and matrices, and comes
with a collection of mathematical functions to operate on these arrays.
 Pandas: Provides data structures and data analysis tools. It is essential for
handling and manipulating structured data (e.g., tables).
 matplotlib.pyplot: 2D plotting library which produces publication-quality
figures in a variety of formats and interactive environments.
 Seaborn: Used for the visualization purpose.

Read the data from the drive

Visualization of Data:
Data Shape and Information:

Dataset contains 3390 rows and 17 columns.

Dataset contains the 9 float, 6 integer and 2 Object variables. We can also observe there are
few null values are present in the dataset.

1.2 Data Analysis:

1.3 Data Cleaning:

Null Values in Dataset

The dataset columns education, CigsPerDay, BPMeds, totChol, BMI, heartrate and glucose
contains null values.

Dropping the null values

Also the id column does not make any significant impact on our analysis so we dropping it.
Outlier Treatment:

Plotting the box plot to identify outliers.

 As per the boxplot, the following variables contains the outliers

o cigsPerDay: The 75th percentile is 20, but the maximum is 70. This
suggests possible outliers.
o TotChol: The maximum value (600) could be an outlier. Outlier treatment
might be necessary.
o SysBP: The maximum value (295) could be an outlier. Outlier treatment
might be necessary.
o DiaBP: The maximum value (142.5) could be an outlier. Outlier treatment
might be necessary.
o BMI: The maximum value (56.8) could be an outlier. Outlier treatment
might be necessary.
o HeartRate: The maximum value (143) could be an outlier. Outlier treatment
might be necessary.
o Glucose: The maximum value (394) could be an outlier. Outlier treatment
might be necessary

Based on this assessment, continuous variables like CigsPerDay, TotChol, SysBP, DiaBP,
BMI, HeartRate, and Glucose may benefit from outlier treatment.

Outlier Treatment.:
1.4 Visualization:

Univariate Analysis:

Histogram

 Looks all the variables are normally distributed.

Bivariate Analysis

 We can see the strong correlation across the diaBP and SysBP (0.78) , similarly
prevalent Hyp has strong correlation across diaBP (0.62)and SysBP (0.72)
respectively.
Multivariate Analysis:

 With this detailed output following by EDA method, we can conclude that our data
is clean and normalized, hence we can proceed further to create ML model in
Python.

1.5 Data Standardization:

we need to encode categorical variables into a numeric format for performing logistic
regression. Based on your dataset, the categorical variables are sex and is_smoking.

Standardizing the data

2. Machine Learning Model:

Based on the EDA, we have analyzed that the data the target variable is categorical,
which is the 10-year risk of coronary heart disease (CHD), is binary ("1" for Yes and
"0" for No). Logistic regression is well-suited for binary classification tasks where the
outcome is categorical

2.1 Splitting Data

The splitting the data into 20:80 form in which includes the 80% train data and 20% test
data.
2.2 Training the Model

2.3 Training the Model

2.4 Evaluating the Model Performance

To evaluate the performance , we must be use the following methods:

o ROC Score
o ROC Curve
o Classification Report
o Confusion Matrix

ROC AUC Score

 A ROC AUC score of 0.7103 indicates that the model has acceptable
discrimination. This means the model can distinguish between the positive class
(individuals at risk of CHD) and the negative class (individuals not at risk of CHD)
better than random guessing but isn't performing exceptionally well.

ROC Curve

Output
Classification Report

 Precision
o Class 0 (No CHD): Precision is 0.86, indicating that 86% of the instances
predicted as class 0 are actually class 0.
o Class 1 (CHD): Precision is 0.64, meaning that 64% of the instances
predicted as class 1 are actually class 1
 Precision
o Class 0 (No CHD): Precision is 0.86, indicating that 86% of the instances
predicted as class 0 are actually class 0.
o Class 1 (CHD): Precision is 0.64, meaning that 64% of the instances
predicted as class 1 are actually class 1
 Recall (Sensitivity or True Positive Rate)
o Class 0 (No CHD): Recall is 0.99, indicating that the model correctly
identifies 99% of the actual class 0 instances.
o Class 1 (CHD): Recall is 0.08, meaning the model correctly identifies only
8% of the actual class 1 instances, which is very low.

Confusion Matrix
 The model correctly identified 492 individuals who do not have CHD. This
indicates good performance in predicting the absence of the condition.
 The model correctly identified only 7 individuals who have CHD. This is a low
number, indicating poor performance in identifying individuals at risk.
 The model incorrectly identified 4 individuals as having CHD when they do not.
This is a relatively small number, indicating that the model does not frequently
misclassify non-risk individuals as being at risk.
 The model failed to identify 83 individuals who actually have CHD. This is a
significant number, indicating that the model misses many individuals at risk,
which is critical in a healthcare context where early detection is vital.

Conclusion:

 The ROC AUC score of 0.71 indicates that the logistic regression model has
acceptable discrimination capability between individuals at risk of CHD and those
not at risk. This means the model is better than random guessing but still has room
for improvement.
 The model achieves a high overall accuracy of 85%, which seems impressive at
first glance. However, this high accuracy is misleading due to the significant class
imbalance, where the majority class (no CHD) dominates.
 The recall (sensitivity) for the minority class (CHD) is very low at 8%. This
indicates that the model fails to identify a large portion of individuals who actually
have CHD, which is a critical issue for a healthcare predictive model.
 Precision for the minority class (CHD) is moderate at 64%, meaning that when the
model predicts CHD, it is correct 64% of the time. However, the F1-score for CHD
is very low at 14%, reflecting the poor balance between precision and recall for this
class.
 The confusion matrix reveals a high number of false negatives (83), indicating that
many individuals at risk of CHD are not being correctly identified. This is a
significant shortcoming since early identification of at-risk individuals is crucial for
preventive measures.

References:

data_cardiovascular_r
isk.csv
Plagiarism Details:

Duplichecker-Plagiari
sm-Report-0.02164000 1719521388.docx

IoT Case Studies India PDF
100% (1)
IoT Case Studies India PDF
69 pages
Heart Disease Prediction With Machine Learning Approaches
No ratings yet
Heart Disease Prediction With Machine Learning Approaches
6 pages
Formulae For Calculating IDI and NRI
No ratings yet
Formulae For Calculating IDI and NRI
8 pages
Machine Learning_project
No ratings yet
Machine Learning_project
26 pages
Mini-Review: Kathleen F. Kerr, Allison Meisner, Heather Thiessen-Philbrook, Steven G. Coca, and Chirag R. Parikh
No ratings yet
Mini-Review: Kathleen F. Kerr, Allison Meisner, Heather Thiessen-Philbrook, Steven G. Coca, and Chirag R. Parikh
9 pages
Project_Report
No ratings yet
Project_Report
18 pages
Heart Disease Prediction
No ratings yet
Heart Disease Prediction
9 pages
AI FOR LIFE
No ratings yet
AI FOR LIFE
48 pages
Bootstrap Resampling Methods: Something For Nothing?: Gary L. Grunkemeier,, and Yingxing Wu
No ratings yet
Bootstrap Resampling Methods: Something For Nothing?: Gary L. Grunkemeier,, and Yingxing Wu
3 pages
Introduction
No ratings yet
Introduction
8 pages
Heart Failure CETM24
No ratings yet
Heart Failure CETM24
28 pages
Bala
No ratings yet
Bala
28 pages
Heart_Desease_Presentation (5) (2)
No ratings yet
Heart_Desease_Presentation (5) (2)
23 pages
Data Mining Final Report
100% (1)
Data Mining Final Report
44 pages
My ML Project
No ratings yet
My ML Project
14 pages
Dementia Staging - CDR Sum of Boxes - 2008
No ratings yet
Dementia Staging - CDR Sum of Boxes - 2008
5 pages
ML PPT
No ratings yet
ML PPT
21 pages
Final Year Project PPT
No ratings yet
Final Year Project PPT
30 pages
HEART DISEASE PREDICTION USING
No ratings yet
HEART DISEASE PREDICTION USING
8 pages
Vai Bhav
No ratings yet
Vai Bhav
7 pages
Heart disease prediction SI 520 (3)
No ratings yet
Heart disease prediction SI 520 (3)
33 pages
Synopsis (Heart Disease Prediction)
No ratings yet
Synopsis (Heart Disease Prediction)
7 pages
High_HDL[1]
No ratings yet
High_HDL[1]
9 pages
Weka Project1 Sajeena
No ratings yet
Weka Project1 Sajeena
14 pages
Heart Disease Prediction by Using Machine Learning Final Research Paper
No ratings yet
Heart Disease Prediction by Using Machine Learning Final Research Paper
8 pages
21BCE9757 ITT Summer Internship AI ML Report
No ratings yet
21BCE9757 ITT Summer Internship AI ML Report
18 pages
Base Paper
No ratings yet
Base Paper
16 pages
Final PPT Heart Disease
100% (2)
Final PPT Heart Disease
23 pages
I Jcs It 2014050403
No ratings yet
I Jcs It 2014050403
5 pages
Heart Disease Prediction
No ratings yet
Heart Disease Prediction
6 pages
Paper-Final For Phase 1
No ratings yet
Paper-Final For Phase 1
16 pages
Heart Disease Prediction Using Hyper Parameter Optimization (HPO) Tuning
No ratings yet
Heart Disease Prediction Using Hyper Parameter Optimization (HPO) Tuning
10 pages
مختار النعيري - The Course Work Submission (1)
No ratings yet
مختار النعيري - The Course Work Submission (1)
31 pages
IEEE Conference Template Diego
No ratings yet
IEEE Conference Template Diego
9 pages
Machine Learning Project Report (Group 3) Shahbaz Khan
No ratings yet
Machine Learning Project Report (Group 3) Shahbaz Khan
11 pages
Predicting Coronary Heart Disease Using Various Regression Analysis
No ratings yet
Predicting Coronary Heart Disease Using Various Regression Analysis
6 pages
05 - Spps Logistic Regression
No ratings yet
05 - Spps Logistic Regression
39 pages
1-s2.0-S2352914821000745-main
No ratings yet
1-s2.0-S2352914821000745-main
19 pages
Heart Disease Prediction Using Machine Learning
No ratings yet
Heart Disease Prediction Using Machine Learning
18 pages
Proiect Econometrie
No ratings yet
Proiect Econometrie
15 pages
Week 2. Errors in Chemical Analysis (Abstract)
No ratings yet
Week 2. Errors in Chemical Analysis (Abstract)
31 pages
Shti235 0111
No ratings yet
Shti235 0111
5 pages
Edited Version of Cardiovascular Diseases Risk Prediction Dataset Report
No ratings yet
Edited Version of Cardiovascular Diseases Risk Prediction Dataset Report
25 pages
JOCC - Volume 2 - Issue 1 - Pages 50-65
No ratings yet
JOCC - Volume 2 - Issue 1 - Pages 50-65
16 pages
MiniProjet
No ratings yet
MiniProjet
24 pages
abstract 1
No ratings yet
abstract 1
1 page
Biostatics and Epidemiology 2022 1
No ratings yet
Biostatics and Epidemiology 2022 1
17 pages
biostatistics notes part 1
No ratings yet
biostatistics notes part 1
9 pages
Diagnosis and Prediction of Heart Disease Using Machine Learning Techniques
No ratings yet
Diagnosis and Prediction of Heart Disease Using Machine Learning Techniques
11 pages
(IJCST-V12I4P5) :vaishali Sarde, Pankaj Sarde
No ratings yet
(IJCST-V12I4P5) :vaishali Sarde, Pankaj Sarde
8 pages
Research Article: Heart Disease Prediction Based On The Embedded Feature Selection Method and Deep Neural Network
No ratings yet
Research Article: Heart Disease Prediction Based On The Embedded Feature Selection Method and Deep Neural Network
9 pages
BIOSTAT Assignment
No ratings yet
BIOSTAT Assignment
6 pages
Olayinka Babe-2
No ratings yet
Olayinka Babe-2
48 pages
Diabetes Patient's Risk Through Soft Computing Model
No ratings yet
Diabetes Patient's Risk Through Soft Computing Model
6 pages
Paper
No ratings yet
Paper
7 pages
Final Heart Disease Project Proposal
No ratings yet
Final Heart Disease Project Proposal
12 pages
Heart Disease
No ratings yet
Heart Disease
6 pages
Optimizing Heart Disease Diagnosis: Feature Selection Techniques For Enhanced Machine Learning Model Performance
No ratings yet
Optimizing Heart Disease Diagnosis: Feature Selection Techniques For Enhanced Machine Learning Model Performance
8 pages
Building A Simple Machine Learning Model On Breast Cancer Data
No ratings yet
Building A Simple Machine Learning Model On Breast Cancer Data
12 pages
Introduction To Non Parametric Methods Through R Software
From Everand
Introduction To Non Parametric Methods Through R Software
Editor IJSMI
No ratings yet
Biostatistics Explored Through R Software: An Overview
From Everand
Biostatistics Explored Through R Software: An Overview
Vinaitheerthan Renganathan
3.5/5 (2)
Apartness Relation
No ratings yet
Apartness Relation
3 pages
Achalasia Cardia
No ratings yet
Achalasia Cardia
32 pages
LKPD Bahasa Inggris Kelas 8 Semester 1
No ratings yet
LKPD Bahasa Inggris Kelas 8 Semester 1
30 pages
Comand Aaalan
No ratings yet
Comand Aaalan
14 pages
Government Polytechnic Thane: Department of Chemical Engineering
No ratings yet
Government Polytechnic Thane: Department of Chemical Engineering
19 pages
Control of Our Airplane
No ratings yet
Control of Our Airplane
5 pages
Narrative Question
No ratings yet
Narrative Question
7 pages
Rashi Mantra - Mantras For Your Zodiac Signs
No ratings yet
Rashi Mantra - Mantras For Your Zodiac Signs
1 page
Physical Science q4 Slm12 1
No ratings yet
Physical Science q4 Slm12 1
15 pages
Equilibrium Clarified
No ratings yet
Equilibrium Clarified
8 pages
(MATH2350) (2018) (F) Final Schenci 28833
No ratings yet
(MATH2350) (2018) (F) Final Schenci 28833
9 pages
Turco 4215 NC-LT - (Data Sheet)
No ratings yet
Turco 4215 NC-LT - (Data Sheet)
2 pages
Neuroscience Ch. 1 PowerPoint
No ratings yet
Neuroscience Ch. 1 PowerPoint
25 pages
ASTM F75. CoCr - F75
No ratings yet
ASTM F75. CoCr - F75
2 pages
38 Brunch Menu January 21
No ratings yet
38 Brunch Menu January 21
1 page
Tutorials 1-12: Tutorial Sheet 1
No ratings yet
Tutorials 1-12: Tutorial Sheet 1
11 pages
Kim Lighting SBC Square Beam Cutoff Brochure 1995
No ratings yet
Kim Lighting SBC Square Beam Cutoff Brochure 1995
28 pages
Pradhan Mantri Gram Sadak Yojana Cost Estimate For Road Construction
No ratings yet
Pradhan Mantri Gram Sadak Yojana Cost Estimate For Road Construction
3 pages
Unit 2 - WSN
No ratings yet
Unit 2 - WSN
18 pages
Advantages and Disadvantages of Smoking
No ratings yet
Advantages and Disadvantages of Smoking
4 pages
VILLACAMPA OFFICIAL R-PAPER [REVISION 2]
No ratings yet
VILLACAMPA OFFICIAL R-PAPER [REVISION 2]
63 pages
Ceramic Materials I: Asst - Prof. Dr. Ayşe KALEMTAŞ
No ratings yet
Ceramic Materials I: Asst - Prof. Dr. Ayşe KALEMTAŞ
38 pages
XGMndUFXQbKQBNyEUk0Y The Hard Surface E-Book
No ratings yet
XGMndUFXQbKQBNyEUk0Y The Hard Surface E-Book
88 pages
Geko Ansys CFD PDF
No ratings yet
Geko Ansys CFD PDF
38 pages
Ingeo 2003D
No ratings yet
Ingeo 2003D
3 pages
Sustainable Aviation: Greening The Flight Path Thomas Walker 2024 Scribd Download
100% (3)
Sustainable Aviation: Greening The Flight Path Thomas Walker 2024 Scribd Download
62 pages
Audio Ingles
100% (1)
Audio Ingles
2 pages
How To Build A High Chair - 20 Steps (With Pictures) - Instructables
No ratings yet
How To Build A High Chair - 20 Steps (With Pictures) - Instructables
11 pages
Saintgits Institute of Management: Business and Society
No ratings yet
Saintgits Institute of Management: Business and Society
8 pages

Case Study

Uploaded by

Case Study

Uploaded by

"HEART DISEASE PREDICTION USING THE WHO DATA"

 Offering insights to inform clinical decisions and lifestyle changes aimed at

reducing the risk of heart disease..

Sex: Male or female (Nominal)

Age: Age of the patient (Continuous)

Current Smoker: Indicates if the patient currently smokes

BP Meds: Indicates if the patient is on blood pressure medication (Nominal)

Prevalent Hyp: Indicates if the patient is hypertensive (Nominal)

Diabetes: Indicates if the patient has diabetes (Nominal)

Tot Chol: Total cholesterol level (Continuous)

Sys BP: Systolic blood pressure (Continuous)

Dia BP: Diastolic blood pressure (Continuous)

BMI: Body Mass Index (Continuous)

Heart Rate: Heart rate (Continuous)

Glucose: Glucose level (Continuous)

1. Data Processing/Exploratory Data Analysis (EDA):

Read the data from the drive

Dataset contains 3390 rows and 17 columns.

1.2 Data Analysis:

Null Values in Dataset

Dropping the null values

Plotting the box plot to identify outliers.

 As per the boxplot, the following variables contains the outliers

 Looks all the variables are normally distributed.

1.5 Data Standardization:

Standardizing the data

2. Machine Learning Model:

2.1 Splitting Data

2.3 Training the Model

2.4 Evaluating the Model Performance

To evaluate the performance , we must be use the following methods:

ROC AUC Score

You might also like