MiniProjet
MiniProjet
CHAPTER-1
INTRODUCTION
Cardiovascular disease (CVD) is a collective term designating all types of afflictions
affecting the blood circulatory system, including the heart and vasculature, which,
respectively, displaces and conveys the blood.
The heart has four valves — the aortic, mitral, pulmonary and tricuspid valves. They open
and close to move blood through the heart. Many things can damage the heart valves. A heart
valve may become narrowed (stenosis), leaky (regurgitation or insufficiency) or close
improperly (prolapse).
23
Cardiovascular diseases account for approximately 31% of all global deaths, making them a
critical area for medical research and intervention. Traditional methods of diagnosing CVDs
often involve invasive procedures and can be expensive. Machine learning offers a non-
invasive, cost-effective approach to predicting CVD risk based on readily available health
data. Previous studies have demonstrated the potential of machine learning models in
healthcare, particularly for predicting diseases. However, continuous improvement in model
accuracy and reliability is necessary to ensure these tools can be effectively used in clinical
practice.
The dataset used in this project contains various health-related parameters for individuals,
including age, gender, height, weight, blood pressure (systolic and diastolic), cholesterol
levels, glucose levels, smoking status, alcohol intake, physical activity, and whether the
individual has cardiovascular disease. The dataset comprises thousands of records, providing
a rich source of information for training and evaluating the prediction model.
23
CHAPTER-2
LITERATURE SURVEY
2.1 REVIEW 1:
Authors - A. Rama.
2.2 REVIEW 2:
Authors - Jiawei Zhou.
Outcome - ML algorithms are effective for mining real-world data in CVD research.Future
work is needed for method development and CVD applications.
2.3 REVIEW 3:
Authors - PrasannaVenkatesan Theerthagiri.
Title - Predictive analysis of cardiovascular disease using gradient boosting based learning
and recursive feature elimination technique.
Outcome - RFE has best accuracy at 88.8%.RFE-GB outperformed other methods with AUC
of 0.85.
2.4 REVIEW 4:
Authors - Khomkrit yongcharoenchaiyasit.
Title - Gradient Boosting Based Model for Elderly Heart Failure, Aortic Stenosis, and
Dementia Classification.
Outcome - The paper proposed a gradient boosting (GB) based model for the multiclass
classification of heart failure, aortic stenosis, and dementia in the elderly. The GB model
achieved the highest accuracy of 83.81% after applying feature engineering techniques.
23
CHAPTER-3
OBEJECTIVES AND PROBLEM STATEMENT
3.1 Objectives
3.1.1 Data Preprocessing:
Load the Dataset: Import and preprocess a dataset containing medical and personal
information relevant to cardiovascular health.
Data Splitting: Split the data into training and testing sets to evaluate the
performance of the models.
3.1.6 Visualization:
Prediction Comparison Graph: Provide a graphical comparison of prediction
probabilities from different models to help users understand the relative performance
and confidence of each model.
The main problem addressed in this project is the accurate prediction of cardiovascular
diseases based on available health data. The objective is to develop a robust and reliable
model that can assist healthcare providers in early diagnosis and intervention.
Given the global burden of cardiovascular diseases and the potential of machine learning to
improve early detection, there is a pressing need to develop accurate, reliable, and
interpretable predictive models. Such models can aid clinicians in identifying high-risk
individuals, guiding preventive measures, and ultimately reducing the incidence of CVDs.
23
CHAPTER- 4
METHODOLOGY
23
23
23
23
23
23
23
CHAPTER-5
HARDWARE AND SOFTWARE DETAILS
5.1 SOFTWARE DETAILS
5.1.1 Programming Language:
Python: Python is the primary programming language used in this project due to its
simplicity, readability, and extensive support for data science and machine learning
libraries.
The local machine is used for initial data exploration, preprocessing, and model development.
It provides sufficient computational power for running standard machine learning algorithms
and small to medium-sized datasets.
CHAPTER-6
RESULTS AND DISCUSSIONS
23
23
23
23
CHAPTER-7
ADVANTAGES AND APPLICATIONS
7.1 ADVANTAGES
7.1.1
23
23
CHAPTER-8
CONCLUSION AND FUTURE SCOPE
8.1 CONCLUSION
In this project, we developed a web-based application using Streamlit to predict
cardiovascular disease using three different machine learning models: Logistic Regression
(via SGDClassifier), Random Forest, and Gradient Boosting. The application allows users to
input their personal and medical information, including age, height, weight, gender, blood
pressure, cholesterol, glucose levels, smoking status, alcohol intake, and physical activity.
Based on these inputs, the models provide predictions on the likelihood of the user being
healthy or having cardiovascular disease.
The logistic regression model, implemented using SGDClassifier, is particularly suited for
large datasets with a simple linear decision boundary. It operates by minimizing the logistic
loss using stochastic gradient descent, making it efficient for high-dimensional data.
However, its performance can be limited when dealing with non-linear relationships in the
data. The Random Forest model, on the other hand, is an ensemble method that builds
multiple decision trees and combines their outputs. This model is robust to overfitting and
can capture complex interactions between features. Finally, the Gradient Boosting model is
another ensemble technique that builds trees sequentially, each one trying to correct the errors
of the previous one. This model is particularly powerful for capturing intricate patterns in the
data but can be computationally intensive.
The application not only provides individual predictions for each model but also includes a
feature to compare the prediction probabilities visually. The comparison graph displays the
probabilities of being healthy and having heart disease for each model, allowing users to
understand the differences in model predictions. This feature is particularly useful for
highlighting how different algorithms may interpret the same input data and helps in
assessing the reliability and agreement between the models.
each model to provide a more reliable assessment. Future work could involve enhancing the
models with more sophisticated techniques, incorporating additional relevant features, and
expanding the application to include more detailed explanations of the predictions to improve
transparency and user trust.