PROJECT PROPOSAL
PROJECT PROPOSAL
Approach.
Abstract: The main reason behind Cardiovascular Disease(CVD) is still unpredictable but we
all know that it is associated with a high risk of death. To find out an AI based technologies
which is smartly and trustworthily predict the future outcome of individuals who have
cardiovascular disease. Machine Learning algorithms and techniques have been applied to
various medical datasets to automate the analysis of large and complex data. Prediction of
occurrences of heart diseases in the medical field is significant work. Data analytics is useful
for prediction from more information and it helps the medical centre to predict various
diseases. Utilizing Machine Learning (ML) algorithms to analyse intricate medical data holds
promise in this regard. Researchers have increasingly turned to Machine Learning
techniques to aid healthcare professionals in diagnosing heart-related ailments. Given the
heart's pivotal role in circulating blood throughout the body, predicting heart diseases
assumes paramount importance. Data analytics enhances disease prediction by leveraging
extensive patient data, allowing medical facilities to anticipate future occurrences. Various
techniques, including Artificial Neural Networks(ANN), Random Forest(RF), and Support
Vector Machines(SVM), have been employed for heart disease prediction. Developing
efficient detection methods is crucial for minimizing heart-related fatalities. By harnessing
data mining and Machine Learning, researchers are striving to create software that assists
doctors in predicting and diagnosing heart diseases swiftly and accurately. This research
aims to predict heart diseases using Machine Learning algorithms, addressing a pressing
need in global healthcare.
Primarily, machine learning enables the creation of personalized risk prediction models,
assessing factors such as age, gender, blood pressure, cholesterol, and family history to
estimate individual heart disease likelihood. This aids healthcare providers in strategizing
preventive measures for high-risk individuals. Furthermore, machine learning fosters early
diagnosis by discerning subtle indications within medical data, such as irregular ECG readings
or variations in heart rate, facilitating timely interventions.
The integration of wearable devices and IoT technology allows remote monitoring of
patients' vital signs. Machine learning interprets real-time data, detecting anomalies in heart
rate and activity levels, and alerting healthcare professionals for prompt evaluation.
Similarly, machine learning's prowess in medical image analysis aids in identifying structural
anomalies and blockages in heart and blood vessels.
Treatment decisions are informed by machine learning's analysis of intervention
effectiveness in comparable patient groups, paving the way for personalized treatment
plans. By fusing data from diverse sources like medical records and diagnostic reports, these
algorithms offer a holistic view of patient health, enhancing risk assessment.
Machine learning models continually refine their accuracy through exposure to growing
datasets, bolstering their effectiveness over time. However, ethical considerations and
patient privacy are paramount. Rigorous testing, validation, and collaboration between
medical experts, data scientists, and regulators are indispensable to ensure the secure and
effective integration of machine learning into healthcare practices.
LITERATURE SURVEY- Baban U Rinde et al [1], suggested the heart disease prediction model
using machine learning. Here data set are taken from UCI machine Learning Respiratory [2]
where 303 data set are used in which 14 input features like Age, sex, Type of chest pain,
maximum heart rate are present. They use SVM, RF and ANN as classification techniques.
Devendra Sandhya et al [3], proposed the Heart Detection using the combination of both
Hardware and software model. Hardware components used are Arduino/Raspberry Pi,
different Biomedical sensors, Display Monitor, etc where as in Software model Recursive
feature elimination Algorithm has used. After that ANN and Logistic Algorithm are applied
individually.
Chala Beyene et al [4], recommended Prediction and Analysis of the occurrence of Heart
Disease Using Data Mining Techniques. The main objective is to predict the occurrence of
heart disease for early automatic diagnosis of the disease within result in a short time. The
proposed methodology is also critical in a healthcare organization with experts that have no
more knowledge and skill. It uses different medical attributes such as blood sugar and heart
rate, age, sex are some of the attributes are included to identify if the person has heart
disease or not. Analyses of the dataset are computed using WEKA software.
Ali, Liaqat, et al[5],propose a system containing two models based on linear Support Vector
Machine (SVM). The first one is called L1 regularized and the second one is called L2
regularized. First model is used for removing unnecessary features by making coefficient of
those features zero. The second model is used for prediction. Predication of disease is done
in this part. To optimize both models they proposed a hybrid grid search algorithm. This
algorithm optimizes two models based on metrics: accuracy, sensitivity, septicity, the
Matthews correlation coefficient, ROC chart and area under the curve. They used Cleveland
data set. Data splits into 70% training and 30% testing used holdout validation. There are
two experiments carried out and each experiment is carried out for various values of C1, C2
and k where C1 is hyperparameter of L1 regularized model, C2 is hyperparameter of L2
regularized model and k is the size of selected subset of features. First experiment is L1-
linear SVM model stacked with L2-linear SVM model which is giving maximum testing
accuracy of 91.11% and training accuracy of 84.05%. The second experiment is L1- linear
SVM model cascaded with L2-linear SVM model with RBF kernel. This is giving maximum
testing accuracy of 92.22% and training accuracy of 85.02. They have obtained an
improvement in accuracy over conventional SVM models by 3.3%.
Problem Formulation- Detecting heart disease remains a significant challenge due to costly
or inefficient predictive instruments. Early identification is vital for reducing mortality and
complications. Yet, continuous patient monitoring is often impractical and 24/7 doctor
consultations are limited by expertise and time. With abundant contemporary data,
leveraging machine learning holds promise. By analysing data through advanced algorithms,
hidden patterns in medical information can be unveiled, aiding health diagnosis. These
patterns, concealed within vast datasets, can provide crucial insights into cardiac health.
Machine learning's ability to discern subtle correlations and trends enhances our
understanding of heart disease development and progression. By deploying predictive
models, we can assess an individual's risk factors, incorporating factors like age, gender, and
medical history to offer personalized risk evaluations. Moreover, machine learning's
proficiency in processing real-time data from wearables enables remote monitoring,
promptly detecting deviations in vital signs.
Software Requirements-
Programming Language: Python is a popular choice for machine learning and data analysis
due to its extensive libraries, such as NumPy, Pandas, Scikit-Learn, and Matplotlib.
2. Machine Learning Libraries: You'll need Scikit-Learn for machine learning tasks,
TensorFlow or PyTorch for deep learning, and XGBoost for boosting algorithms.
3. Data Analysis Libraries: Pandas for data manipulation and analysis, and NumPy for
numerical operations.
4. Data Visualization: Matplotlib and Seaborn for creating visualizations, and Plotly for
interactive dashboards.
5. Data Collection and Cleaning: Tools for collecting, cleaning, and preprocessing data.
You might use SQL databases, Excel, or data manipulation libraries within Python.
Software Rescues-
During the development of a Heart Disease prediction system, a programmer may face some
challenges that require "software rescue." Here are common issues and how to address
them:
a. Data Quality: Inaccurate or incomplete data can impact the model's performance.
Data cleansing and imputation techniques should be employed to handle missing
values and outliers.
b. Model Selection:Choosing the right machine learning algorithm is crucial. If the initial
model doesn't perform well, consider trying various algorithms and ensemble
methods.
c. Overfitting: Overfit models may perform well on the training data but poorly on new
data. Implement techniques like cross-validation, regularization, and hyperparameter
tuning to mitigate overfitting.
d. Feature Engineering: The choice and engineering of features are essential. Iteratively
refine your features by considering domain expertise and using feature selection
techniques.
e. Scalability: As the dataset grows, your system might need optimization for efficiency.
Parallel processing or distributed computing may be necessary.
g. Model Updates: Medical guidelines and data change over time. You must have a
mechanism for updating and retraining the model as new data becomes available.
h. Regulatory Compliance: Ensure your system complies with healthcare data privacy
regulations, such as HIPAA in the United States or GDPR in Europe.
i. User Interface: The user interface for medical applications must be user-friendly and
follow best practices in healthcare UX/UI design.
Easy To Store Data: In a spreadsheet, unlimited data can be saved. MS Excel is widely
used to save data or to analyse data. Filtering information in Excel is easy and
convenient.
Easy To Recover Data: Finding and recovering data is very easy in Excel spreadsheet.
Application of Mathematical Formulas: Doing calculations has become easier and less
time-taking with the formulas option in MS excel
More Secure: These spreadsheets can be password secured in a laptop or personal
computer and the probability of losing them is way lesser in comparison to data
written in registers or piece of paper.
Data at One Place: Earlier, data was to be kept in different files and registers when the
paperwork was done. Now, this has become convenient as more than one worksheet can be
added in a single
Python- To collect data a web scraper programmed in Python was used. According to
Wikipedia Python’s syntax allows programmers to express concepts in fewer lines of codes.
Guido van Rossum at CWI in the Netherlands started Python’s implementation in December
1989. Python 2.0 was released on October 16th 2000 and Python 3.0 was released
December 3rd 2008.
Here Python is used for web scraping and not another thing because Python offers module
called ‘urllib2’, which has suitable functions to open websites and extract information easily
Python is used to program the web scraper that is in charge of collecting the weather data
for the model.
Support Vector Machines (SVMs)- Support vector machines exist in different forms, linear
and non-linear. A support vector machine is a supervised classifier. What is usual in this
context, two different datasets are involved with SVM, training and a test set. In the ideal
situation the classes are linearly separable. In such situation a line can be found, which splits
the two classes perfectly. However not only one line splits the dataset perfectly, but a whole
bunch of lines do. From these lines the best is selected as the "separating line".
A SVM can make some errors to avoid over-fitting. It tries to minimize the number of errors
that will be made. Support vector machines classifiers are applied in many applications. They
are very popular in recent research. This popularity is due to the good overall empirical
performance. Comparing the naive Bayes and the SVM classifier, the SVM has been applied
the most.
Random Forest Algorithm(RF)- The random forest algorithm is an extension of the bagging
method as it utilizes both bagging and feature randomness to create an uncorrelated forest
of decision trees. Feature randomness, also known as feature bagging or “the random
subspace method”, generates a random subset of features, which ensures low correlation
among decision trees. This is a key difference between decision trees and random forests.
While decision trees consider all the possible feature splits, random forests only select a
subset of those features.
Random forest algorithms have three main hyperparameters, which need to be set before
training. These include node size, the number of trees, and the number of features sampled.
From there, the random forest classifier can be used to solve for regression or classification
problems.
The random forest algorithm is made up of a collection of decision trees,
and each tree in the ensemble is comprised of a data sample drawn from
a training set with replacement, called the bootstrap sample. Of that
training sample, one-third of it is set aside as test data, known as the out-
of-bag (oob) sample, which we’ll come back to later. Another instance of
randomness is then injected through feature bagging, adding more
diversity to the dataset and reducing the correlation among decision
trees. Depending on the type of problem, the determination of the
prediction will vary. For a regression task, the individual decision trees will
be averaged, and for a classification task, a majority vote—i.e, the most
frequent categorical variable—will yield the predicted class. Finally, the
out-of-bag sample is then used for cross-validation, finalizing that
prediction.
In a Decision tree, there are two nodes, which are the Decision
Node and Leaf Node. Decision nodes are used to make any decision and
have multiple branches, whereas Leaf nodes are the output of those
decisions and do not contain any further branches.
The decisions or the test are performed on the basis of features of the
given dataset. It is a graphical representation for getting all the possible
solutions to a problem/decision based on given conditions.
It is called a decision tree because, similar to a tree, it starts with the root
node, which expands on further branches and constructs a tree-like
structure.
DECISION NODE
DECISION NODE
DECISION NODE
LEAF LEAF LEAF
NODE NODE NODE
SUB
TREE
LEAF LEAF
NODE NODE
A decision tree simply asks a question, and based on the answer (Yes/No),
it further split the tree into subtrees.
2. Pandas, a Python jewel, specializes in data manipulation and analysis. With its
versatile data structures, including DataFrames and Series, Pandas simplifies the
management of structured data. It's an indispensable tool for data professionals,
streamlining operations like data cleaning, transformation, and exploration.
3. Scikit-Learn is a versatile Python library, known for its extensive machine learning
capabilities. It features a wide range of machine learning algorithms for classification,
regression, clustering, and more. Notably, Scikit-Learn offers model selection,
evaluation, and preprocessing tools, making it essential for building and fine-tuning
machine learning models.
START
DATASET
PRE PROCESSING
COLLECTION
CLASSIFYING
FEATURE SELECTION
DATA
TRAIN TEST
DATASET DATASET
TRAIN
CLASSIFIERS
NO
IS
CLASSIFIERS
TRAINED?
YES
USE TRAINED
CLASSIFIER TO
TEST DATASET
END
Future Scope- As illustrated before the system can be used as a clinical
assistant for any
hence any internet users can access the system through a web browser
and understand the risk of heart disease. The proposed model can be
implemented for any real time application. Using the proposed model
other type of heart disease also can be determined. Different heart
diseases as rheumatic heart disease, hypertensive heart disease, ischemic
heart disease, cardiovascular disease and inflammatory heart disease can
be identified.
References-
[4] Mr. Chala Beyene, Prof. Pooja Kamat, “Survey on Prediction and
Analysis the Occurrence of Heart Disease Using Data Mining Technique”,
International Journal of Pure and Applied Mathematics, 2018.
[5] Ali, Liaqat, et al, “An optimized stacked support vector machines
based expert system for the effective prediction of heart failure” IEEE
Access 7 (2019): 54007-54014.