Heart Disease Predication
Heart Disease Predication
INTRODUCTION
The heart is a kind of muscular organ which pumps blood into the body and is the
central part of the body‘s cardiovascular system which also contains lungs. The
cardiovascular system also comprises a network of blood vessels, for example, veins,
arteries, and capillaries. These blood vessels deliver blood all over the body.
Abnormalities in normal blood flow from the heart cause several types of heart diseases
which are commonly known as cardiovascular diseases (CVD). Heart diseases are the
main reasons for death worldwide. According to the survey of the World Health
Organization (WHO), 17.5 million total global deaths occur because of heart attacks and
strokes. More than 75% of deaths from cardio-vascular diseases occur mostly in
middle-income and low-income countries. Also, 80% of the deaths that occur due to
CVDs are because of stroke and heart attack. Therefore, prediction of cardiac
abnormalities at the early stage and tools for the prediction of heart diseases can save a
lot of life and help doctors to design an effective treatment plan which ultimately reduces
the mortality rate due to cardiovascular diseases.
Due to the development of advance healthcare systems, lots of patient data are
nowadays available (i.e., Big Data in Electronic Health Record System) which can be
used for designing predictive models for cardiovascular diseases. Data mining or
machine learning is a discovery method for analyzing big data from an assorted
perspective and encapsulating it into useful information. ―Data Mining is a non-trivial
extraction of implicit previously unknown and potentially useful information about data‖.
Nowadays, a huge amount of data pertaining to disease diagnosis, patients etc. are
generated by healthcare industries. Data mining provides techniques which discover
hidden patterns or similarities from data.
1
evidence .Medical data mining has great potential for exploring the cryptic patterns in
the data sets of the clinical domain.
These patterns can be utilized for healthcare diagnosis. However, the available
raw medical data are widely distributed, voluminous and heterogeneous in nature .This
data needs to be collected in an organized form. This collected data can be then
integrated to form a 2 medical information system. Data mining provides a user-oriented
approach to novel and hidden patterns in the Data The data mining tools are useful for
answering business questions and techniques for predicting the various diseases in the
healthcare field. Disease prediction plays a significant role in data mining. This paper
analyzes the heart disease predictions using classification algorithms. These invisible
patterns can be utilized for health diagnosis in healthcare data.
Data mining technology affords an efficient approach to the latest and indefinite
patterns in the data. The information which is identified can be used by the healthcare
administrators to get better services. Heart disease was the most crucial reason for
victims in the countries like India, United States. In this project we are predicting the
heart disease using classification algorithms. Machine learning techniques like
Classification algorithms such as Random forest, Logistic Regression are used to
explore different kinds of heart based problem.
2
CHAPTER 2
LITERATURE SURVEY
Machine Learning techniques are used to analyze and predict medical data
information resources. Diagnosis of heart disease is a significant and tedious task in
medicine. The term heart disease encompasses the various diseases that affect the
heart. The exposure of heart disease from various factors or symptom is an issue which
is not complimentary from false presumptions often accompanied by unpredictable
effects. The data classification is based on Supervised Machine Learning algorithm
which results in better accuracy. Here we are using the Random Forest as the training
algorithm to train the heart disease dataset and to predict the heart disease. The results
showed that the medicinal prescription and designed prediction system is capable of
prophesying the heart attack successfully
In the literature survey we provide a summary of the different methods that have
been proposed for clustering over the period of 2002 to 2022 We have been though 5
papers each of which has a unique approach towards segmentation in some parameter
or the other. Summaries of each of the papers are provided below
Paper-1: Clinical decision support system: Risk level prediction of heart disease
using weighted fuzzy rules
3
Paper-2: Heart Disease with Feature Subset Selection using Genetic Algorithm
5
CHAPTER 3
RESEARCH GAP
Disadvantages:
Prediction is not possible at early stages.
In the Existing system, practical use of collected data is time consuming.
Any faults occurred by the doctor or hospital staff in predicting would lead to fatal
incidents.
A highly expensive and laborious process needs to be performed before treating
the patient to find out if he/she has any chances of getting heart disease in
future.
6
3.2 PROPOSED SYSTEM
This section depicts the overview of the proposed system and illustrates all of the
components, techniques and tools that are used for developing the entire system. To
develop an intelligent and user-friendly heart disease prediction system, an efficient
software tool is needed to train huge datasets and compare multiple machine learning
algorithms. After choosing the robust algorithm with best accuracy and performance
measures, it will be implemented on the development of the smartphone-based
application for detecting and predicting heart disease risk level. Hardware components
like Arduino/Raspberry Pi, different biomedical sensors, display monitor, buzzer etc. are
needed to build the continuous patient monitoring system.
7
As the developing organization has all the resources available to build the system
therefore the proposed system is technically feasible.
3.4.1. EMBEDDED:
This class of system is characterized by tight constraints, changing environment,
and unfamiliar surroundings. Projects of the embedded type are modeled to the
company and usually exhibit temporal constraints.
3.4.2. ORGANIC
8
This category encompasses all systems that are small relative to project size ical
capacity, adequate response and extensibility. The project is decided to build using
Python. Jupyter Notebook is designed for use in distributed environment and team size
and have a stable environment, familiar surroundings and relaxed interfaces. These are
simple business systems, data processing systems, and small software libraries.
3.4.3. SEMIDETACHED
The software systems falling under this category are a mix of those of organic
and embedded in nature. Some examples of software of this class are operating
systems, database management system, and inventory management systems class are
operating systems, database management system, and inventory management
systems.
TYPE OF PRODUCT A B C D
Organic 2.4 1.02 2.5 0.38
Semi Detached 3.0 1.12 2.5 0.35
Embedded 3.6 1.20 2.5 0.32
LO (Low) HI (High)
VL (Very Low) VH (Very High) NM
(Normal) XH (Extra High) The list of
9
attributes is composed of several features of the software and includes 10 product,
computer, personal and project attributes as follows.
Data bytes per DSI (DATA): The lower rating comes with lower size of a
database. Complexity (CPLX): The attribute expresses code complexity again
ranging from straight batch code (VL) to real time code with multiple resources
scheduling (XH).
10
These are used to quantify the number of experiences in each area by the
development team, more experience, higher rating.
VL LO NM HI VH XH
11
LEXP 1.14 1.07 1.10 0.95
12
CHAPTER 4
SYSTEM SPECIFICATION
13
Software Requirement Specification (SRS) is the starting point of the software
developing activity. As system grew more complex it became evident that the goal of
the entire system cannot be easily comprehended. Hence the need for the requirement
phase arose. The software project is initiated by the client needs. The SRS is the
means of translating the ideas of the minds of clients (the input) into a formal document
(the output of the requirement phase.) Under requirement specification, the focus is on
specifying what has15 been found giving analysis such as representation, specification
languages and tools, and checking the specifications are addressed during this activity.
The Requirement phase terminates with the production of the validate SRS document.
Producing the SRS document is the basic goal of this phase. The purpose of the
Software Requirement Specification is to reduce the communication gap between the
clients and the developers. Software Requirement Specification is the medium though
which the client and user needs are accurately specified. It forms the basis of software
development. A good SRS should satisfy all the parties involved in the system
4.2.1OPERATIONAL REQUIREMENTS
a) Economic: The developed product is economic as it is not required any hardware
interface etc. Environmental Statements of fact and assumptions that define the
system's expectations in terms of mission objectives, environment, constraints, and
effectiveness and suitability measures (MOE/MOS). The customers are those that
perform the eight primary functions of systems engineering, with special emphasis on
the operator as the key customer.
b) Health and Safety: The software may be safety-critical. If so, there are issues
associated with its integrity level. The software may not be safety-critical although it
forms part of a safety-critical system.
For example, software may simply log transactions. If a system must be of a high
integrity level and if the software is shown to be of that integrity level, then the
hardware must be at least of the same integrity level.
14
There is little point in producing 'perfect' code in some language if hardware and
system software (in widest sense) are not reliable. If a computer system is to run
software of a high integrity level then that system should not at the same time
accommodate software of a lower integrity level.
Ram : 4 GB
Hard Disk : 4 GB
Input device : Standard Keyboard and Mouse
Output device : VGA and High-Resolution Monitor
4.3.2 SOFTWARE REQUIREMENTS
Operating System : Windows 7 or higher
15
design philosophy that emphasizes code readability, notably using significant
whitespace. It provides constructs that enable clear programming on both small and
large scales. Python features a dynamic type system and automatic memory
management. It supports multiple programming paradigms, including object-oriented,
imperative, functional and procedural and has a large and comprehensive standard
library. Python interpreters are available for many operating systems. C Python, the
reference implementation of Python, is open source software and has a community-
based development Model, as do nearly all its variant implementations. C Python is
managed by the non-profit Python Software Foundation.
4.4.2 Pandas
Pandas is an open-source Python Library providing high-performance data
manipulation and analysis tool using its powerful data structures. The name Pandas is
derived from the word Panel Data – an Econometrics from Multidimensional data. In
2008, developer Wes McKinney started developing pandas when in need of high
performance, flexible tool for analysis of data. Prior to Pandas, Python was majorly used
for data mining and preparation. It had very little contribution towards data analysis.
Pandas solved this problem.
Using Pandas, we can accomplish five typical steps in the processing and
analysis of data, regardless of the origin of data — load, prepare, manipulate, model,
and analyze. Python with18 Pandas is used in a wide range of fields including academic
and commercial domains including finance, economics, Statistics, analytics, etc.
16
Group by data for aggregation and transformations.
High performance merging and joining of data.
Time Series functionality.
4.4.3 NumPy
NumPy is a general-purpose array-processing package. It provides a
highperformance multidimensional array object, and tools for working with these arrays.
It is the fundamental package for scientific computing with Python. It contains various
features including these important ones:
4.4.4 Sckit-Learn
Simple and efficient tools for data mining and data analysis\
Accessible to everybody, and reusable in various contexts
Built on NumPy, SciPy, and matplotlib
Open source, commercially usable - BSD license
17
It supports a very wide variety of graphs and plots namely - histogram, bar
charts, power spectra, error charts etc.
18
CHAPTER 5
SYSTEM DESIGN
For feature selection we used Recursive feature Elimination Algorithm using Chi2
method and get 16 top features. After that applied ANN and Logistic algorithm
individually and compute the accuracy. Finally, we used proposed Ensemble Voting
method and compute best method for diagnosis of heart disease.
5.2 MODULES
The entire work of this project is divided into 4 modules. They are:
a. Data Pre-processing
b. Feature
c. Classification
d. Prediction
a. Data Pre-processing:
19
This file contains all the pre-processing functions needed to process all input
documents and texts. First we read the train, test and validation data files then
performed some preprocessing like tokenizing, stemming etc. There are some
exploratory data analysis is performed like response variable distribution and data
quality checks like null or missing values etc.
b. Feature:
Extraction In this file we have performed feature extraction and selection
methods from scikit learn python libraries. For feature selection, we have used methods
like simple bag-ofwords and n-grams and then term frequency like tf-tdf weighting. We
have also used Features Algorithms selection Data - preprocessing Heart disease data
base 21 word2vec and POS tagging to extract the features, though POS tagging and
word2vec has not been used at this point in the project.
c. Classification:
Here we have built all the classifiers for the breast cancer diseases detection.
The extracted features are fed into different classifiers. We have used Naive-bayes,
Logistic Regression, Linear SVM, Stochastic gradient decent and Random forest
classifiers from sklearn. Each extracted feature was used in all the classifiers. Once
fitting the model, we compared the f1 score and checked the confusion matrix.
After fitting all the classifiers, 2 best performing models were selected as
candidate models for heart diseases classification. We have performed parameter
tuning by implementing GridSearchCV methods on these candidate models and chosen
best performing parameters for this classifier.Finally selected model was used for heart
disease detection with the probability of truth.
In Addition to this, we have also extracted the top 50 features from our term-
frequency tfidf Vectorizer to see what words are most and important in each of the
classes. We have also used Precision-Recall and learning curves to see how training
and test set performs when we increase the amount of data in our classifiers.
20
d. Prediction:
Our finally selected and best performing classifier was algorithm which was then
saved on disk with name final_model.sav. Once you close this repository, this model will
be copied to user's machine and will be used by prediction.py file to classify the Heart
diseases . It takes a news article as input from user then model is used for final
classification output that is shown to user along with probability of truth.
LEVEL 0:
21
LEVEL 1
22
5.3 USE-CASE DIAGRAM
A use case diagram shows a set of use cases and actors and their relationships.
A use case diagram is just a special kind of diagram and shares the same common
properties as do all other diagrams, i.e a name and graphical contents that are a
projection into a model. What distinguishes a use case diagram from all other kinds of
diagrams is its particular content.
23
5.4 ACTIVITY DIAGRAM
An activity diagram shows the flow from activity to activity. An activity is an
ongoing nonatomic execution within a state machine. An activity diagram is basically a
projection of the elements found in an activity graph, a special case of a state machine
in which all or most states are activity states and in which all or most transitions are
triggered by completion of activities in the source.
24
5.5 SEQUENCE DIAGRAM
A sequence diagram is an interaction diagram that emphasizes the time ordering
of messages. A sequence diagram shows a set of objects and the messages sent and
received by those objects. The objects are typically named or anonymous instances of
classes, but may also represent instances of other things, such as collaborations,
components, and nodes. We use sequence diagrams to illustrate the dynamic view of a
system.
25
5.6 CLASS DIAGRAM
A Class diagram in the Unified Modeling Language (UML) is a type of static
structure diagram that describes the structure of a system by showing the system's
classes, their attributes, operations (or methods), and the relationships among objects.
It provides a basic notation for other structure diagrams prescribed by UML. It is helpful
for developers and other team members too.
26
CHAPTER 6
ALGORITHMS SPECIFICATION
LogR models the data points using the standard logistic function, which is an
Sshaped curve also called as sigmoid curve and is given by the equation.
27
Fig. 6.1: Logistic Regression
Similarly, random forest creates decision trees on data samples and then gets
the prediction from each of them and finally selects the best solution by means of
voting. It is ensemble method which is better than a single decision tree because it
reduces the over-fitting by averaging the result.
28
Working of Random Forest with the help of following steps:
First, start with the selection of random samples from a given dataset.
Next, this algorithm will construct a decision tree for every sample. Then it will get
the prediction result from every decision tree.
In this step, voting will be performed for every predicted result.
At last, select the most voted prediction results as the final prediction result.
29
CHAPTER 7
SYSTEM TESTING
The term "WhiteBox" was used because of the see-through box concept. The
clear box or WhiteBox name symbolizes the ability to see through the software's outer
shell (or "box") into its inner workings. Likewise, the "black box" in "Black Box Testing"
symbolizes not being able to see the inner workings of the software so that only the
end-user experience can be tested.
White box testing involves the testing of the software code for the following:
Internal security holes
Broken or poorly structured paths in the coding processes
The flow of specific inputs through the code
Expected output
The functionality of conditional loops
Testing of each statement, object, and function on an individual basis
The testing can be done at system, integration and unit levels of software
development. One of the basic goals of white box testing is to verify the working flow for
an application. It 29 30 involves testing a series of predefined inputs against expected
30
or desired outputs so that when a specific input does not result in the expected output,
you have encountered a bug.
31
There are automated tools available to perform Code coverage analysis. Below are a
few coverage analysis techniques
1. Statement Coverage: -
This technique requires every possible statement in the code to be tested at least
once during the testing process of software engineering.
2. Branch Coverage: -
This technique checks every possible path (if-else and other conditional loops) of
a software application.
Apart from the above, there are numerous coverage types such as Condition
Coverage, Multiple Condition Coverage, Path Coverage, Function Coverage etc. Each
technique has its own merits and attempts to test (cover) all parts of software code.
Using Statement and Branch coverage you generally attain 80-90% code coverage
which is sufficient.
32
source code, detailed network information, IP addresses involved and all server
information the application runs on. The aim is to attack the code from several angles to
expose security threats
33
White-box testing is time-consuming, bigger programming applications take the
time to test full.
Here are the generic steps followed to carry out any type of Black Box Testing.
Initially, the requirements and specifications of the system are examined.
Tester chooses valid inputs (positive test scenario) to check whether SUT
processes them correctly. Also, some invalid inputs (negative test scenario) are
chosen to verify that the SUT can detect them.
Tester determines expected outputs for all those inputs.
Software tester constructs test cases with the selected inputs.
The test cases are executed.
Software tester compares the actual outputs with the expected outputs.
Defects if any are fixed and re-tested.
34
1.Functional testing - This black box testing type is related to the functional
requirements of a system; it is done by software testers.
2. Non-functional testing - This type of black box testing is not related to testing of
specific functionality, but non-functional requirements such as performance, scalability,
usability.
3. Regression testing - Regression Testing is done after code fixes, upgrades or any
other system maintenance to check the new code has not affected the existing code.
1.Equivalence Class Testing: It is used to minimize the number of possible test cases
to an optimum level while maintains reasonable test coverage.
2. Boundary Value Testing: Boundary value testing is focused on the values at
boundaries. This technique determines whether a certain range of values are
acceptable by the system or not. It is very useful in reducing the number of test cases. It
is most suitable for the systems where an input is within certain ranges.
3. Decision Table Testing: A decision table puts causes and their effects in a matrix.
There is a unique combination in each column.
35
Black box testing has its own life cycle called Software Testing Life Cycle (STLC)
and it is relative to every stage of Software Development Life Cycle of Software
Engineering.
1.Requirement
This is the initial stage of SDLC and in this stage, a requirement is gathered. Software
testers also take part in this stage.
2.Test Planning & Analysis
Testing Types applicable to the project are determined. A Test plan is created which
determines possible project risks and their mitigation.
3.Design
In this stage Test cases/scripts are created on the basis of software requirement
documents
4.Test Execution
In this stage Test Cases prepared are executed. Bugs if any are fixed and re-tested.
36
CHAPTER 8
CONCLUSION AND FUTURE SCOPE
8.1 CONCLUSION
In this project, we introduce the heart disease prediction system with different
classifier techniques for the prediction of heart disease. The techniques are Random
Forest and Logistic Regression: we have analyzed that the Random Forest has better
accuracy as compared to Logistic Regression. Our purpose is to improve the
performance of Random Forest by removing unnecessary and irrelevant attributes from
the dataset and only picking those that are most informative for the classification task.
37
REFERENCE
[1] P.K. Anooj, ―Clinical decision support system: Risk level prediction of heart disease
using weighted fuzzy rulesǁ; Journal of King Saud University – Computer and
Information Sciences (2012) 24, 27–40. Computer Science & Information Technology
(CS & IT) 59.
[2] Nidhi Bhatla, Kiran Jyoti "An Analysis of Heart Disease Prediction using Dif ferent
Data Mining Techniques". International Journal of Engineering Research & Technology.
36
[3] Jyoti Soni Ujma Ansari Dipesh Sharma, Sunita Soni. ―Predictive Data Mining for
Medical Diagnosis: An Overview of Heart Disease Prediction‖.
[4] Chaitrali S. Dangare Sulabha S. Apte, Improved Study of Heart Disease Prediction
System using Data Mining Classification Techniques‖ International Journal of Computer
Applications (0975 – 888).
[5] Dane Bertram, Amy Voida, Saul Greenberg, Robert Walker, ―Communication,
Collaboration, and Bugs: The Social Nature of Issue Tracking in Small, Collocated
Teams‖.
[7] Ankita Dewan, Meghna Sharma,‖ Prediction of Heart Disease Using a Hybrid
Technique in Data Mining Classification‖, 2nd International Conference on Computing
for Sustainable Global Development IEEE 2015 pp 704-706.
38
[8] R. Alizadehsani, J. Habibi, B. Bahadorian , H. Mashayekhi, A. Ghandeharioun, R.
Boghrati, et al., "Diagnosis of coronary arteries stenosis using data mining," J Med
Signals Sens, vol. 2, pp. 153-9, Jul 2012.
[9] M Akhil Jabbar, BL Deekshatulu, Priti Chandra,‖ heart disease classification using
nearest neighbor classifier with feature subset selection‖, Anale. Seria Informatica, 11,
2013.
[10] Shadab Adam Pattekari and Asma Parveen,‖ PREDICTION SYSTEM FOR HEART
DISEASE USING NAIVE BAYES‖, International Journal of Advanced Computer and
Mathematical Sciences ISSN 2230-9624, Vol 3, Issue 3, 2012, pp 290-294.
[12] Keerthana T. K., ―Heart Disease Prediction System using Data Mining Method‖,
International Journal of Engineering Trends and Technology‖, May 2017.
[13] Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber,
ELSEVIER. Animesh Hazra, Arkomita Mukherjee, Amit Gupta, Prediction Using
Machine Learning and Data Mining July 2017, pp.2137-2159
[14] Khdair H. Exploring Machine Learning Techniques for Coronary Heart Disease
Prediction. 2021. [(accessed on 12 April 2023)]. Available online:
https://thesai.org/Publications/ViewPaperVolume=12&Issue=5&Code=IJACSA&SerialN
o=5
[15] Singh A., Kumar R. Heart disease prediction using machine learning algorithms;
Proceedings of the 2020 International Conference on Electrical and Electronics
Engineering (ICE3); Gorakhpur, India. 14–15 February 2020; pp. 452–457. [Google
Scholar]
39
40