0% found this document useful (0 votes)
21 views

Heart Disease Predication

Heart Disease Predication

Uploaded by

E2-08 Bharath.M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Heart Disease Predication

Heart Disease Predication

Uploaded by

E2-08 Bharath.M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 40

CHAPTER 1

INTRODUCTION
The heart is a kind of muscular organ which pumps blood into the body and is the
central part of the body‘s cardiovascular system which also contains lungs. The
cardiovascular system also comprises a network of blood vessels, for example, veins,
arteries, and capillaries. These blood vessels deliver blood all over the body.
Abnormalities in normal blood flow from the heart cause several types of heart diseases
which are commonly known as cardiovascular diseases (CVD). Heart diseases are the
main reasons for death worldwide. According to the survey of the World Health
Organization (WHO), 17.5 million total global deaths occur because of heart attacks and
strokes. More than 75% of deaths from cardio-vascular diseases occur mostly in
middle-income and low-income countries. Also, 80% of the deaths that occur due to
CVDs are because of stroke and heart attack. Therefore, prediction of cardiac
abnormalities at the early stage and tools for the prediction of heart diseases can save a
lot of life and help doctors to design an effective treatment plan which ultimately reduces
the mortality rate due to cardiovascular diseases.

Due to the development of advance healthcare systems, lots of patient data are
nowadays available (i.e., Big Data in Electronic Health Record System) which can be
used for designing predictive models for cardiovascular diseases. Data mining or
machine learning is a discovery method for analyzing big data from an assorted
perspective and encapsulating it into useful information. ―Data Mining is a non-trivial
extraction of implicit previously unknown and potentially useful information about data‖.
Nowadays, a huge amount of data pertaining to disease diagnosis, patients etc. are
generated by healthcare industries. Data mining provides techniques which discover
hidden patterns or similarities from data.

Therefore, in this paper, a machine learning algorithm is proposed for


implementing a heart disease prediction system validated on two open access heart
disease prediction datasets. Data mining is the computer based process of extracting
useful information from enormous sets of databases. Data mining is most helpful in an
explorative analysis because of nontrivial information from large volumes of

1
evidence .Medical data mining has great potential for exploring the cryptic patterns in
the data sets of the clinical domain.

These patterns can be utilized for healthcare diagnosis. However, the available
raw medical data are widely distributed, voluminous and heterogeneous in nature .This
data needs to be collected in an organized form. This collected data can be then
integrated to form a 2 medical information system. Data mining provides a user-oriented
approach to novel and hidden patterns in the Data The data mining tools are useful for
answering business questions and techniques for predicting the various diseases in the
healthcare field. Disease prediction plays a significant role in data mining. This paper
analyzes the heart disease predictions using classification algorithms. These invisible
patterns can be utilized for health diagnosis in healthcare data.

Data mining technology affords an efficient approach to the latest and indefinite
patterns in the data. The information which is identified can be used by the healthcare
administrators to get better services. Heart disease was the most crucial reason for
victims in the countries like India, United States. In this project we are predicting the
heart disease using classification algorithms. Machine learning techniques like
Classification algorithms such as Random forest, Logistic Regression are used to
explore different kinds of heart based problem.

2
CHAPTER 2
LITERATURE SURVEY

Machine Learning techniques are used to analyze and predict medical data
information resources. Diagnosis of heart disease is a significant and tedious task in
medicine. The term heart disease encompasses the various diseases that affect the
heart. The exposure of heart disease from various factors or symptom is an issue which
is not complimentary from false presumptions often accompanied by unpredictable
effects. The data classification is based on Supervised Machine Learning algorithm
which results in better accuracy. Here we are using the Random Forest as the training
algorithm to train the heart disease dataset and to predict the heart disease. The results
showed that the medicinal prescription and designed prediction system is capable of
prophesying the heart attack successfully

In the literature survey we provide a summary of the different methods that have
been proposed for clustering over the period of 2002 to 2022 We have been though 5
papers each of which has a unique approach towards segmentation in some parameter
or the other. Summaries of each of the papers are provided below

Paper-1: Clinical decision support system: Risk level prediction of heart disease
using weighted fuzzy rules

 Publication Year: 2012


 Author: P.K. Anooj
 Journal Name: journal of King Saud University – Computer and Information
Sciences
 Summary: This paper discusses the development of a clinical decision support
system (CDSS) for predicting heart disease risk levels using weighted fuzzy
rules. The system aims to enhance prediction accuracy by integrating various
patient data points, helping clinicians in early diagnosis and treatment planning.

3
Paper-2: Heart Disease with Feature Subset Selection using Genetic Algorithm

 Publication Year: 2010


 Author: M. Anbarasi, E. Anupriya, & N.Ch.S.N. Iyengar
 Journal Name: International Journal of Engineering Science and Technology
 Summary: This paper explores the use of genetic algorithms for feature subset
selection to enhance the prediction accuracy of heart disease. By optimizing the
selection of relevant features, the study aims to improve the performance of
predictive models. The research demonstrates how genetic algorithms can
effectively identify the most significant features, thereby increasing the accuracy
and efficiency of heart disease prediction models.

Paper-3: Diagnosis of Heart Disease Using K-Nearest Neighbor Algorithm of Data


Mining

 Publication Year: 2016


 Author: C. Kalaiselvi, PhD
 Journal Name: IEEE
 Summary: In this paper, C. Kalaiselvi, PhD, explores the application of the K-
Nearest Neighbor (KNN) algorithm in diagnosing heart disease. KNN is a popular
data mining technique used for classification tasks. The paper likely discusses
how KNN is implemented in analyzing heart disease datasets to predict the
presence or absence of heart conditions based on various input features. This
approach leverages similarities between new data points and existing data to
make predictions, providing valuable insights for medical practitioners in
diagnosing heart conditions.

Paper-4: Heart Disease Prediction System using Data Mining Method

 Publication Year: 2017


 Author: Keerthana T. K.
4
 Journal Name: International Journal of Engineering Trends and Technology
 Summary: This paper authored by Keerthana T. K. presents a heart disease
prediction system employing data mining techniques. It likely explores various
data mining methods such as Decision Trees, Neural Networks, or Support
Vector Machines to analyze heart disease datasets and predict the likelihood of
heart disease occurrence in individuals. The study may discuss the
implementation of these techniques, their comparative performance, and their
potential application in clinical settings for early diagnosis and intervention of
heart conditions.

Paper-5: Exploring Machine Learning Techniques for Coronary Heart Disease


Prediction.
 Publication Year: 2021 ram Ram Ram Ram
 Author: Khdair H
 Journal Name: international Journal of Advanced Computer Science and
Applications (IJACSA)
 Summary: The paper authored by Khdair H. explores the application of machine
learning techniques for predicting coronary heart disease. It likely discusses the
use of various machine learning algorithms, such as Decision Trees, Random
Forests, Support Vector Machines, or Neural Networks, to analyze relevant
datasets and predict the likelihood of coronary heart disease occurrence in
individuals. The study may evaluate the performance of these techniques in
terms of accuracy, sensitivity, and specificity, aiming to identify the most effective
approach for coronary heart disease predict

5
CHAPTER 3
RESEARCH GAP

3.1 EXISTING SYSTEM


Clinical decisions are often made based on doctors 'intuition and experience
rather than on the knowledge rich data hidden in the database. This practice leads to
unwanted biases, errors and excessive medical costs which affects the quality of
service provided to patients. There are many ways that a medical misdiagnosis can
present itself. Whether a doctor is at fault, or hospital staff, a misdiagnosis of a serious
illness can have very extreme and harmful effects. The National Patient Safety
Foundation cites that 42% of medical patients feel they have had experienced a medical
error or missed diagnosis. Patient safety is sometimes negligently given the back seat
for other concerns, such as the cost of medical tests, drugs, and operations. Medical
Misdiagnoses are a serious risk to our healthcare profession. If they continue, then
people will fear going to the hospital for treatment. We can put an end to medical
misdiagnosis by informing the public and filing claims and suits against the medical
practitioners at fault.

Disadvantages:
 Prediction is not possible at early stages.
 In the Existing system, practical use of collected data is time consuming.
 Any faults occurred by the doctor or hospital staff in predicting would lead to fatal
incidents.
 A highly expensive and laborious process needs to be performed before treating
the patient to find out if he/she has any chances of getting heart disease in
future.

6
3.2 PROPOSED SYSTEM
This section depicts the overview of the proposed system and illustrates all of the
components, techniques and tools that are used for developing the entire system. To
develop an intelligent and user-friendly heart disease prediction system, an efficient
software tool is needed to train huge datasets and compare multiple machine learning
algorithms. After choosing the robust algorithm with best accuracy and performance
measures, it will be implemented on the development of the smartphone-based
application for detecting and predicting heart disease risk level. Hardware components
like Arduino/Raspberry Pi, different biomedical sensors, display monitor, buzzer etc. are
needed to build the continuous patient monitoring system.

3.3 FEASIBILITY STUDY


A Feasibility Study is a preliminary study undertaken before the real work of a
project starts to ascertain the likely hood of the project's success. It is an analysis of
possible alternative solutions to a problem and a recommendation on the best
alternative.

3.3.1 ECONOMIC FEASIBILITY


It is defined as the process of assessing the benefits and costs associated with
the development of project. A proposed system, which is both operationally and
technically feasible, must be a good investment for the organization. With the proposed
system the users are greatly benefited as the users can be able to detect the fake news
from the real news and are aware of most real and most fake news published in the
recent years. This proposed system does not need any additional software and high
system configuration. Hence the proposed system is economically feasible.

3.3.2 TECHNICAL FEASIBILITY


The technical feasibility infers whether the proposed system can be developed
considering the technical issues like availability of the necessary technology, techn of
the internet and for the professional programmer it is easy to learn and use effectively.

7
As the developing organization has all the resources available to build the system
therefore the proposed system is technically feasible.

3.3.3 OPERATIONAL FEASIBILITY


Operational feasibility is defined as the process of assessing the degree to which a
proposed system solves business problems or takes advantage of business
opportunities. The system is self-explanatory and doesn‘t needs any extra sophisticated
training. The system has built-in methods and classes which are required to produce the
result. The application can be handled very easily with a novice user. The overall time
that a user needs to get trained is 14 less than one hour. As the software used for
developing this application is economical and available in the market. Therefore, the
proposed system is operationally feasible.

3.4 EFFORT, DURATION AND COST ESTIMATION USING COCOMO MODEL


The COCOMO (Constructive Cost Model) model is the most complete and
thoroughly documented model used in effort estimation. The model provides detailed
formulas for determining the development time schedule, overall development effort,
and effort breakdown by phase and activity and maintenance effort. COCOMO
estimates the effort in person months of direct labor. The primary effort factor is the
number of source lines of code (SLOC) expressed in thousands of delivered source
instructions (KDSI). The model is developed in three versions of different level of detail
basic, intermediate, and detailed. The modeling process considers three classes of
systems.

3.4.1. EMBEDDED:
This class of system is characterized by tight constraints, changing environment,
and unfamiliar surroundings. Projects of the embedded type are modeled to the
company and usually exhibit temporal constraints.

3.4.2. ORGANIC

8
This category encompasses all systems that are small relative to project size ical
capacity, adequate response and extensibility. The project is decided to build using
Python. Jupyter Notebook is designed for use in distributed environment and team size
and have a stable environment, familiar surroundings and relaxed interfaces. These are
simple business systems, data processing systems, and small software libraries.

3.4.3. SEMIDETACHED
The software systems falling under this category are a mix of those of organic
and embedded in nature. Some examples of software of this class are operating
systems, database management system, and inventory management systems class are
operating systems, database management system, and inventory management
systems.

TYPE OF PRODUCT A B C D
Organic 2.4 1.02 2.5 0.38
Semi Detached 3.0 1.12 2.5 0.35
Embedded 3.6 1.20 2.5 0.32

Table 3.1: Organic, Semidetached and Embedded system values

For basic COCOMO Effort = a*(KLOC) b


Type =c*(effort)d
For Intermediate and Detailed COCOMO Effort = a * (KLOC) b* EAF (EAF = product of
cost drivers).
Intermediate COCOMO model is a refinement of the basic model, which comes in the
function of 15 attributes of the product. For each attribute, the model user must provide
a rating using the following six-point scale.

LO (Low) HI (High)
VL (Very Low) VH (Very High) NM
(Normal) XH (Extra High) The list of

9
attributes is composed of several features of the software and includes 10 product,
computer, personal and project attributes as follows.

3.4.4 PRODUCT ATTRIBUTES:


 Required reliability (RELY): It is used to express an effect of software faults
ranging from slight inconvenience (VL) to loss of life (VH). The nominal value
(NM) denotes moderate recoverable losses.

 Data bytes per DSI (DATA): The lower rating comes with lower size of a
database. Complexity (CPLX): The attribute expresses code complexity again
ranging from straight batch code (VL) to real time code with multiple resources
scheduling (XH).

3.4.5 COMPUTER ATTRIBUTES


 Execution time (TIME) and memory (STOR) constraints: This attribute
identifies the percentage of computer resources used by the system. NM states
that less than 50% is used; 95% is indicated by XH.
 Virtual machine volatility (VIRT): It is used to indicate the frequency of changes
made to the hardware, operating system, and overall software environment.
More frequent and significant changes are indicated by higher ratings.
Development turnaround time (TURN): This is the time from when a job is
submitted until output becomes received. LO indicated a highly interactive
environment, VH quantifies a situation when this time is longer than 12 hours.

3.4.6 PERSONAL ATTRIBUTES


 Analyst capability (ACAP) and Analyst programmer capability (PCAP).
 This describes skills of the developing team. The higher the skills, the higher the
rating.
 Application experience (AEXP), language experience (LEXP), and virtual
machine experience (VEXP) .

10
 These are used to quantify the number of experiences in each area by the
development team, more experience, higher rating.

3.4.7 PROJECT ATTRIBUTES


 Modern development practices (MODP): deals with the amount of use of
modern software practices such as structural programming and objectoriented
approach.
 Use of software tools (TOOL): is used to measure a level of sophistication of
automated tools used in software development and a degree of integration
among the tools being used. Higher rating describes levels in both aspects.
 Schedule effects (SCED): concerns the amount of schedule compression (HI
or VH), or schedule expansion (LO or VL) of the development schedule in
comparison to a nominal (NM) schedule.

VL LO NM HI VH XH

RELY 0.75 0.88 1.10 1.40 1.40

DATA 0.94 1.10 1.16 1.16

CPLX 0.70 0.85 1.10 1.30 1.30 1.65

TIME 1.10 1.30 1.30 1.65

STOR 1.10 0.71 1.21

VIRT 0.87 1.10 0.82 1.30

TURN 0.87 1.10 0.86 1.30

ACAP 1.46 1.19 1.10 0.91 0.71

AEXP 1.29 1.29 1.10 0.86 0.82

PCAP 1.42 1.17 1.10 0.86 0.82

11
LEXP 1.14 1.07 1.10 0.95

VEXP 1.21 1.10 1.10 0.90

MODP 1.24 1.10 1.10 0.91 0.82

TOOL 1.24 1.10 1.10 0.91 0.83

SCED 1.23 1.08 1.10 1.04 1.10

Table 3.2: Project Attributes

Our project is an organic system and for intermediate


COCOMO Effort = a * (KLOC) b *EAF
KLOC = 115
For organic system
a = 2.4
b = 1.02
EAF = product of cost
Driver’s effort=2.4*(0.115)^1.02*1.30
= 1.034
Programmer months Time for development = C * (Effort) d
= 2.5 * (1.034)^0.38
= 2.71 months
Cost of programmer = Effort * cost of programmer per month
= 1.034 * 20000
= 20680
Project cost = 20000 +20680
= 40680

12
CHAPTER 4
SYSTEM SPECIFICATION

4.1 INTRODUCTION TO REQUIREMENT SPECIFICATION


Software Engineering by James F Peters & WitoldPedrycz Head First Java by
Ka. A Software Requirements Specification (SRS) is a description of particular software
product, program or set of programs that performs a set of functions in a target
environment (IEEE Std. 830-1993).
a. Purpose:
The purpose of software requirements specification specifies the intentions and
intended audience of the SRS.
b. Scope:
The scope of the SRS identifies the software product to be produced, the
capabilities, application, relevant objects etc. We propose to implement Passive
Aggressive Algorithm which takes the test and trained data set.
c. Definitions, Acronyms and Abbreviations Software Requirements
Specification:
It’s a description of a particular software product, program or set of programs that
performs a set of functions in the target environment.
d. References:
IEEE Std. 830-1993, IEEE Recommended Practice for Software Requirements
Specifications thy Sierra and Bert Bates.
e. Overview
The SRS contains the details of process, DFD’s, functions of the product, user
characteristics. The non-functional requirements, if any, are also specified.
f. Overall description
The main functions associated with the product are described in this section of
SRS. The characteristics of a user of this product are indicated. The assumptions in this
section result from interaction with the project stakeholders.

4.2 REQUIREMENT ANALYSIS

13
Software Requirement Specification (SRS) is the starting point of the software
developing activity. As system grew more complex it became evident that the goal of
the entire system cannot be easily comprehended. Hence the need for the requirement
phase arose. The software project is initiated by the client needs. The SRS is the
means of translating the ideas of the minds of clients (the input) into a formal document
(the output of the requirement phase.) Under requirement specification, the focus is on
specifying what has15 been found giving analysis such as representation, specification
languages and tools, and checking the specifications are addressed during this activity.
The Requirement phase terminates with the production of the validate SRS document.

Producing the SRS document is the basic goal of this phase. The purpose of the
Software Requirement Specification is to reduce the communication gap between the
clients and the developers. Software Requirement Specification is the medium though
which the client and user needs are accurately specified. It forms the basis of software
development. A good SRS should satisfy all the parties involved in the system

4.2.1OPERATIONAL REQUIREMENTS
a) Economic: The developed product is economic as it is not required any hardware
interface etc. Environmental Statements of fact and assumptions that define the
system's expectations in terms of mission objectives, environment, constraints, and
effectiveness and suitability measures (MOE/MOS). The customers are those that
perform the eight primary functions of systems engineering, with special emphasis on
the operator as the key customer.

b) Health and Safety: The software may be safety-critical. If so, there are issues
associated with its integrity level. The software may not be safety-critical although it
forms part of a safety-critical system.

 For example, software may simply log transactions. If a system must be of a high
integrity level and if the software is shown to be of that integrity level, then the
hardware must be at least of the same integrity level.

14
 There is little point in producing 'perfect' code in some language if hardware and
system software (in widest sense) are not reliable. If a computer system is to run
software of a high integrity level then that system should not at the same time
accommodate software of a lower integrity level.

 Systems with different requirements for safety levels must be separated.


Otherwise, the highest level of integrity required must be applied to all systems in
the same environment.

4.3 SYSTEM REQUIREMENTS


4.3.1 HARDWARE REQUIREMENTS

 Processor : Above 500 MHz

 Ram : 4 GB
 Hard Disk : 4 GB
 Input device : Standard Keyboard and Mouse
 Output device : VGA and High-Resolution Monitor
4.3.2 SOFTWARE REQUIREMENTS
 Operating System : Windows 7 or higher

 Programming : Python 3.6 and related libraries

 Software : Anaconda Navigator and Jupyter


Notebook

4.4 SOFTWARE DESCRIPTION


4.4.1 Python
Python is an interpreted high-level programming language for general-purpose
programming. Created by Guido van Rossum and first released in 1991, Python has a

15
design philosophy that emphasizes code readability, notably using significant
whitespace. It provides constructs that enable clear programming on both small and
large scales. Python features a dynamic type system and automatic memory
management. It supports multiple programming paradigms, including object-oriented,
imperative, functional and procedural and has a large and comprehensive standard
library. Python interpreters are available for many operating systems. C Python, the
reference implementation of Python, is open source software and has a community-
based development Model, as do nearly all its variant implementations. C Python is
managed by the non-profit Python Software Foundation.

4.4.2 Pandas
Pandas is an open-source Python Library providing high-performance data
manipulation and analysis tool using its powerful data structures. The name Pandas is
derived from the word Panel Data – an Econometrics from Multidimensional data. In
2008, developer Wes McKinney started developing pandas when in need of high
performance, flexible tool for analysis of data. Prior to Pandas, Python was majorly used
for data mining and preparation. It had very little contribution towards data analysis.
Pandas solved this problem.

Using Pandas, we can accomplish five typical steps in the processing and
analysis of data, regardless of the origin of data — load, prepare, manipulate, model,
and analyze. Python with18 Pandas is used in a wide range of fields including academic
and commercial domains including finance, economics, Statistics, analytics, etc.

Key Features of Pandas


 Fast and efficient Data Frame object with default and customized indexing.
 Tools for loading data into in-memory data objects from different file formats.
 Data alignment and integrated handling of missing data
 Reshaping and pivoting of date sets.
 Label-based slicing, indexing and sub setting of large data sets.
 Columns from a data structure can be deleted or inserted.

16
 Group by data for aggregation and transformations.
 High performance merging and joining of data.
 Time Series functionality.

4.4.3 NumPy
NumPy is a general-purpose array-processing package. It provides a
highperformance multidimensional array object, and tools for working with these arrays.
It is the fundamental package for scientific computing with Python. It contains various
features including these important ones:

 A powerful N-dimensional array object


 Sophisticated (broadcasting) functions
 Tools for integrating C/C++ and Fortran code
 Useful linear algebra, Fourier transform, and random number capabilities 24
Besides its obvious scientific uses, NumPy can also be used as an efficient
multidimensional container of generic data. Arbitrary data-types can be defined
using Numpy which allows NumPy to seamlessly and speedily integrate with a
wide variety of databases.

4.4.4 Sckit-Learn
 Simple and efficient tools for data mining and data analysis\
 Accessible to everybody, and reusable in various contexts
 Built on NumPy, SciPy, and matplotlib
 Open source, commercially usable - BSD license

4.4.5 Matploit lib


 Matplotlib is a python library used to create 2D graphs and plots by using python
scripts.19
 It has a module named pyplot which makes things easy for plotting by providing
feature to control line styles, font properties, formatting axes etc.

17
 It supports a very wide variety of graphs and plots namely - histogram, bar
charts, power spectra, error charts etc.

4.4.6 Jupyter Notebook


 The Jupyter Notebook is an incredibly powerful tool for interactively developing
and presenting data science projects.
 A notebook integrates code and its output into a single document that combines
visualizations, narrative text, mathematical equations, and other rich media.
 The Jupyter Notebook is an open-source web application that allows you to
create and share documents that contain live code, equations, visualizations and
narrative text.
 Uses include: data cleaning and transformation, numerical simulation, statistical
modeling, data visualization, machine learning, and much more.
 The Notebook has support for over 40 programming languages, including
Python,R, Julia, and Scala.
 Notebooks can be shared with others using email, Drop box, Git Hub and the
Jupyter Notebook. Your code can produce rich, interactive output: HTML,
images, videos, LATEX, and custom MIME types.
 Leverage big data tools, such as Apache Spark, from Python, R and Scala.
Explore that same data with pandas, scikit-learn, ggplot2, Tensor Flow.

18
CHAPTER 5
SYSTEM DESIGN

5.1 SYSTEM ARCHITECTURE


The below figure shows the process flow diagram or proposed work. First we
collected the Cleveland Heart Disease Database from UCI website then pre-processed
the dataset and select 16 important features
.

Fig.5.1: System Architecture

For feature selection we used Recursive feature Elimination Algorithm using Chi2
method and get 16 top features. After that applied ANN and Logistic algorithm
individually and compute the accuracy. Finally, we used proposed Ensemble Voting
method and compute best method for diagnosis of heart disease.

5.2 MODULES
The entire work of this project is divided into 4 modules. They are:
a. Data Pre-processing
b. Feature
c. Classification
d. Prediction

a. Data Pre-processing:

19
This file contains all the pre-processing functions needed to process all input
documents and texts. First we read the train, test and validation data files then
performed some preprocessing like tokenizing, stemming etc. There are some
exploratory data analysis is performed like response variable distribution and data
quality checks like null or missing values etc.

b. Feature:
Extraction In this file we have performed feature extraction and selection
methods from scikit learn python libraries. For feature selection, we have used methods
like simple bag-ofwords and n-grams and then term frequency like tf-tdf weighting. We
have also used Features Algorithms selection Data - preprocessing Heart disease data
base 21 word2vec and POS tagging to extract the features, though POS tagging and
word2vec has not been used at this point in the project.

c. Classification:
Here we have built all the classifiers for the breast cancer diseases detection.
The extracted features are fed into different classifiers. We have used Naive-bayes,
Logistic Regression, Linear SVM, Stochastic gradient decent and Random forest
classifiers from sklearn. Each extracted feature was used in all the classifiers. Once
fitting the model, we compared the f1 score and checked the confusion matrix.

After fitting all the classifiers, 2 best performing models were selected as
candidate models for heart diseases classification. We have performed parameter
tuning by implementing GridSearchCV methods on these candidate models and chosen
best performing parameters for this classifier.Finally selected model was used for heart
disease detection with the probability of truth.

In Addition to this, we have also extracted the top 50 features from our term-
frequency tfidf Vectorizer to see what words are most and important in each of the
classes. We have also used Precision-Recall and learning curves to see how training
and test set performs when we increase the amount of data in our classifiers.

20
d. Prediction:
Our finally selected and best performing classifier was algorithm which was then
saved on disk with name final_model.sav. Once you close this repository, this model will
be copied to user's machine and will be used by prediction.py file to classify the Heart
diseases . It takes a news article as input from user then model is used for final
classification output that is shown to user along with probability of truth.

5.3.1 DATA FLOW DIAGRAM


The data flow diagram (DFD) is one of the most important tools used by system
analysis. Data flow diagrams are made up of number of symbols, which represents
system components. Most data flow modeling methods use four kinds of symbols:
Processes, Data stores, Data flows and external entities. These symbols are used to
represent four kinds of system components. Circles in DFD represent processes. Data
Flow represented by a thin line in the DFD and each data store has a unique name and
square or rectangle represents external entities.

LEVEL 0:

Fig.5.2:Data flow diagram level 0

21
LEVEL 1

Fig.5.3: Data flow diagram level 1

22
5.3 USE-CASE DIAGRAM
A use case diagram shows a set of use cases and actors and their relationships.
A use case diagram is just a special kind of diagram and shares the same common
properties as do all other diagrams, i.e a name and graphical contents that are a
projection into a model. What distinguishes a use case diagram from all other kinds of
diagrams is its particular content.

Fig.5.4 : Use-Case Diagram

23
5.4 ACTIVITY DIAGRAM
An activity diagram shows the flow from activity to activity. An activity is an
ongoing nonatomic execution within a state machine. An activity diagram is basically a
projection of the elements found in an activity graph, a special case of a state machine
in which all or most states are activity states and in which all or most transitions are
triggered by completion of activities in the source.

Fig.5.5 : Activity Diagram

24
5.5 SEQUENCE DIAGRAM
A sequence diagram is an interaction diagram that emphasizes the time ordering
of messages. A sequence diagram shows a set of objects and the messages sent and
received by those objects. The objects are typically named or anonymous instances of
classes, but may also represent instances of other things, such as collaborations,
components, and nodes. We use sequence diagrams to illustrate the dynamic view of a
system.

Fig.5.6 : Sequence Diagram

25
5.6 CLASS DIAGRAM
A Class diagram in the Unified Modeling Language (UML) is a type of static
structure diagram that describes the structure of a system by showing the system's
classes, their attributes, operations (or methods), and the relationships among objects.
It provides a basic notation for other structure diagrams prescribed by UML. It is helpful
for developers and other team members too.

Fig.5.7 : Class diagram

26
CHAPTER 6
ALGORITHMS SPECIFICATION

6.1 LOGISTIC REGRESSION


A popular statistical technique to predict binomial outcomes (y = 0 or 1) is
Logistic Regression. Logistic regression predicts categorical outcomes (binomial /
multinomial values of y). The predictions of Logistic Regression (henceforth, LogR in
this article) are in the form of probabilities of an event occurring, i.e., the probability of
y=1, given certain values of input variables x. Thus, the results of LogR range between
0-1.

LogR models the data points using the standard logistic function, which is an
Sshaped curve also called as sigmoid curve and is given by the equation.

27
Fig. 6.1: Logistic Regression

6.2 RANDOM FOREST


Random Forest is a supervised learning algorithm used for classification and
regression. But however, it is mainly used for classification problems. As we know that a
forest is made up of trees and more trees means more robust forest.

Similarly, random forest creates decision trees on data samples and then gets
the prediction from each of them and finally selects the best solution by means of
voting. It is ensemble method which is better than a single decision tree because it
reduces the over-fitting by averaging the result.

28
Working of Random Forest with the help of following steps:
 First, start with the selection of random samples from a given dataset.
 Next, this algorithm will construct a decision tree for every sample. Then it will get
the prediction result from every decision tree.
 In this step, voting will be performed for every predicted result.
 At last, select the most voted prediction results as the final prediction result.

The following diagram will illustrate its working

Fig.6.2: Random Forest

29
CHAPTER 7
SYSTEM TESTING

7.1 WHITE BOX TESTING


It is testing a software solution's internal structure, design, and coding. In this
type of testing, the code is visible to the tester. It focuses primarily on verifying the flow
of inputs and outputs through the application, improving design and usability,
strengthening security. White box testing is also known as Clear Box testing, Open Box
testing, Structural testing, Transparent Box testing, Code-Based testing, and Glass Box
testing. It is usually performed by developers. It is one of two parts of the Box Testing
approach to software testing. Its counterpart, Blackbox testing, involves testing from an
external or end-user type perspective. On the other hand, Whitebox testing is based on
the inner workings of an application and revolves around internal testing.

The term "WhiteBox" was used because of the see-through box concept. The
clear box or WhiteBox name symbolizes the ability to see through the software's outer
shell (or "box") into its inner workings. Likewise, the "black box" in "Black Box Testing"
symbolizes not being able to see the inner workings of the software so that only the
end-user experience can be tested.
 White box testing involves the testing of the software code for the following:
 Internal security holes
 Broken or poorly structured paths in the coding processes
 The flow of specific inputs through the code
 Expected output
 The functionality of conditional loops
 Testing of each statement, object, and function on an individual basis

The testing can be done at system, integration and unit levels of software
development. One of the basic goals of white box testing is to verify the working flow for
an application. It 29 30 involves testing a series of predefined inputs against expected

30
or desired outputs so that when a specific input does not result in the expected output,
you have encountered a bug.

How do you perform White Box Testing?


To give you a simplified explanation of white box testing, we have divided it into
two basic steps. This is what testers do when testing an application using the white box
testing technique:

Step 1: Understand The Source Code


The first thing a tester will often do is learn and understand the source code of
the application. Since white box testing involves the testing of the inner workings of an
application, the tester must be very knowledgeable in the programming languages used
in the applications they are testing. Also, the testing person must be highly aware of
secure coding practices. Security is often one of the primary objectives of testing
software. The testershould be able to find security issues and prevent attacks from
hackers and naive users who might inject malicious code into the application either
knowingly or unknowingly.

Step 2: Create Test Cases and Execute


The second basic step to white box testing involves testing the application's
source code for proper flow and structure. One way is by writing more code to test the
application's source code. The tester will develop little tests for each process or series
of processes in the application. This method requires that the tester must have intimate
knowledge of the code and is often done by the developer. Other methods include
Manual Testing, trial, and error testing and the use of testing tools as we will explain
further on in this article.
White Box Testing Techniques
A major White box testing technique is Code Coverage analysis. Code Coverage
analysis eliminates gaps in a Test Case suite. It identifies areas of a program that are
not exercised by a set of test cases. Once gaps are identified, you create test cases to
verify untested parts of the code, thereby increasing the quality of the software product.

31
There are automated tools available to perform Code coverage analysis. Below are a
few coverage analysis techniques

1. Statement Coverage: -
This technique requires every possible statement in the code to be tested at least
once during the testing process of software engineering.

2. Branch Coverage: -
This technique checks every possible path (if-else and other conditional loops) of
a software application.
Apart from the above, there are numerous coverage types such as Condition
Coverage, Multiple Condition Coverage, Path Coverage, Function Coverage etc. Each
technique has its own merits and attempts to test (cover) all parts of software code.
Using Statement and Branch coverage you generally attain 80-90% code coverage
which is sufficient.

Types of White Box Testing


1. Unit Testing:
It is often the first type of testing done on an application. Unit Testing is
performed on each unit or block of code as it is developed. Unit Testing is essentially
done by the programmer. As a software developer, you develop a few lines of code, a
single function or an object and test it to ensure it works before continuing Unit Testing
helps identify most bugs, early in the software development lifecycle. Bugs identified in
this stage are cheaper and easy to fix.

2.Testing for Memory Leaks:


Memory leaks are leading causes of slower running applications. A QA specialist
who is experienced at detecting memory leaks is essential in cases where you have a
slow running software application. Apart from above, a few testing types are part of both
black box and white box testing. They are listed as below e. White Box Penetration
Testing: In this testing, the tester/developer has full information of the application's

32
source code, detailed network information, IP addresses involved and all server
information the application runs on. The aim is to attack the code from several angles to
expose security threats

White Box Mutation Testing:


Mutation testing is often used to discover the best coding techniques to use for
expanding a software solution.

White Box Testing Tools


Below is a list of top white box testing tools.
 Parasoft Jtest
 EclEmma
 NUnit
 PyUnit
 HTMLUnit
 CppUnit

Advantages OofWhite Box Testing


 Code optimization by finding hidden errors.
 White box tests cases can be easily automated.
 Testing is more thorough as all code paths are usually covered.
 Testing can start early in SDLC even if GUI is not available.

Disadvantages off-white Box Testing


 White box testing can be quite complex and expensive.
 Developers who usually execute white box test cases detest it. The white box
testing by developers is not detailed and can lead to production errors.
 White box testing requires professional resources, with a detailed understanding
of programming and implementation.

33
 White-box testing is time-consuming, bigger programming applications take the
time to test full.

7.2 BLACK BOX TESTING


It is defined as a testing technique in which functionality of the Application under
Test (AUT) istested without looking at the internal code structure, implementation details
and knowledge of internal paths of the software. This type of testing is based entirely on
software requirements and specifications. In ‘BlackBox’ Testing we just focus on inputs
and output of the software system without bothering about internal knowledge of the
software program.
The Black-Box can be any software system you want to test. For Example, an
operating system like Windows, a website like Google, a database like Oracle or even
your own custom application Under Black Box Testing, you can test these applications
by just focusing on the inputs and outputs without knowing their internal code
implementation.

 Here are the generic steps followed to carry out any type of Black Box Testing.
 Initially, the requirements and specifications of the system are examined.
 Tester chooses valid inputs (positive test scenario) to check whether SUT
processes them correctly. Also, some invalid inputs (negative test scenario) are
chosen to verify that the SUT can detect them.
 Tester determines expected outputs for all those inputs.
 Software tester constructs test cases with the selected inputs.
 The test cases are executed.
 Software tester compares the actual outputs with the expected outputs.
 Defects if any are fixed and re-tested.

Types of Black Box Testing


There are many types of Black Box Testing, but the following are the prominent
ones:-

34
1.Functional testing - This black box testing type is related to the functional
requirements of a system; it is done by software testers.
2. Non-functional testing - This type of black box testing is not related to testing of
specific functionality, but non-functional requirements such as performance, scalability,
usability.
3. Regression testing - Regression Testing is done after code fixes, upgrades or any
other system maintenance to check the new code has not affected the existing code.

Tools used for Black Box Testing:


Tools used for Black box testing largely depends on the type of black box testing you
are doing.
 For Functional/ Regression Tests you can use - QTP, Selenium
 For Non-Functional Tests, you can use - LoadRunner, JMeter

Black Box Testing Techniques


1. Following are the prominent Test strategy amongst the many used in Black box
Testing.
2. Equivalence Testing
3. Boundary Value Testing
4. Decision Table Testing

1.Equivalence Class Testing: It is used to minimize the number of possible test cases
to an optimum level while maintains reasonable test coverage.
2. Boundary Value Testing: Boundary value testing is focused on the values at
boundaries. This technique determines whether a certain range of values are
acceptable by the system or not. It is very useful in reducing the number of test cases. It
is most suitable for the systems where an input is within certain ranges.
3. Decision Table Testing: A decision table puts causes and their effects in a matrix.
There is a unique combination in each column.

Black Box Testing and Software Development Life Cycle (SDLC)

35
Black box testing has its own life cycle called Software Testing Life Cycle (STLC)
and it is relative to every stage of Software Development Life Cycle of Software
Engineering.
1.Requirement
This is the initial stage of SDLC and in this stage, a requirement is gathered. Software
testers also take part in this stage.
2.Test Planning & Analysis
Testing Types applicable to the project are determined. A Test plan is created which
determines possible project risks and their mitigation.
3.Design
In this stage Test cases/scripts are created on the basis of software requirement
documents
4.Test Execution
In this stage Test Cases prepared are executed. Bugs if any are fixed and re-tested.

36
CHAPTER 8
CONCLUSION AND FUTURE SCOPE

8.1 CONCLUSION
In this project, we introduce the heart disease prediction system with different
classifier techniques for the prediction of heart disease. The techniques are Random
Forest and Logistic Regression: we have analyzed that the Random Forest has better
accuracy as compared to Logistic Regression. Our purpose is to improve the
performance of Random Forest by removing unnecessary and irrelevant attributes from
the dataset and only picking those that are most informative for the classification task.

8.2 FUTURE SCOPE


As illustrated before the system can be used as a clinical assistant for any
clinicians. The disease prediction through the risk factors can be hosted online and
hence any internet users can access the system through a web browser and
understand the risk of heart disease. The proposed model can be implemented for any
real-time application .Using the proposed model, other types of heart disease also can
be determined. Different heart diseases such as rheumatic heart disease, hypertensive
heart disease, ischemic heart disease, cardiovascular disease and inflammatory heart
disease can be identified.
Other health care systems can be formulated using this proposed model in order
to identify diseases in the early stage. The proposed model requires an efficient
processor with good memory configuration to implement it in real time. The proposed
model has wide area of application like grid computing, cloud computing, robotic
modeling, etc. To increase the performance of our classifier in future, we will work on
assembling two algorithms called Random Forest and Adab Oost. By assembling these
two algorithms we will achieve high performance.

37
REFERENCE
[1] P.K. Anooj, ―Clinical decision support system: Risk level prediction of heart disease
using weighted fuzzy rulesǁ; Journal of King Saud University – Computer and
Information Sciences (2012) 24, 27–40. Computer Science & Information Technology
(CS & IT) 59.

[2] Nidhi Bhatla, Kiran Jyoti "An Analysis of Heart Disease Prediction using Dif ferent
Data Mining Techniques". International Journal of Engineering Research & Technology.
36

[3] Jyoti Soni Ujma Ansari Dipesh Sharma, Sunita Soni. ―Predictive Data Mining for
Medical Diagnosis: An Overview of Heart Disease Prediction‖.

[4] Chaitrali S. Dangare Sulabha S. Apte, Improved Study of Heart Disease Prediction
System using Data Mining Classification Techniques‖ International Journal of Computer
Applications (0975 – 888).

[5] Dane Bertram, Amy Voida, Saul Greenberg, Robert Walker, ―Communication,
Collaboration, and Bugs: The Social Nature of Issue Tracking in Small, Collocated
Teams‖.

[6] M. Anbarasi, E. Anupriya, N.Ch.S.N.Iyengar, ―Enhanced Prediction of Heart


Disease with Feature Subset Selection using Genetic Algorithm; International Journal of
Engineering Science and Technology, Vol. 2(10), 2010

[7] Ankita Dewan, Meghna Sharma,‖ Prediction of Heart Disease Using a Hybrid
Technique in Data Mining Classification‖, 2nd International Conference on Computing
for Sustainable Global Development IEEE 2015 pp 704-706.

38
[8] R. Alizadehsani, J. Habibi, B. Bahadorian , H. Mashayekhi, A. Ghandeharioun, R.
Boghrati, et al., "Diagnosis of coronary arteries stenosis using data mining," J Med
Signals Sens, vol. 2, pp. 153-9, Jul 2012.

[9] M Akhil Jabbar, BL Deekshatulu, Priti Chandra,‖ heart disease classification using
nearest neighbor classifier with feature subset selection‖, Anale. Seria Informatica, 11,
2013.

[10] Shadab Adam Pattekari and Asma Parveen,‖ PREDICTION SYSTEM FOR HEART
DISEASE USING NAIVE BAYES‖, International Journal of Advanced Computer and
Mathematical Sciences ISSN 2230-9624, Vol 3, Issue 3, 2012, pp 290-294.

[11] C. Kalaiselvi, PhD, ―Diagnosis of Heart Disease Using K-Nearest Neighbor


Algorithm of Data Mining‖, IEEE, 2016.

[12] Keerthana T. K., ―Heart Disease Prediction System using Data Mining Method‖,
International Journal of Engineering Trends and Technology‖, May 2017.

[13] Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber,
ELSEVIER. Animesh Hazra, Arkomita Mukherjee, Amit Gupta, Prediction Using
Machine Learning and Data Mining July 2017, pp.2137-2159

[14] Khdair H. Exploring Machine Learning Techniques for Coronary Heart Disease
Prediction. 2021. [(accessed on 12 April 2023)]. Available online:
https://thesai.org/Publications/ViewPaperVolume=12&Issue=5&Code=IJACSA&SerialN
o=5

[15] Singh A., Kumar R. Heart disease prediction using machine learning algorithms;
Proceedings of the 2020 International Conference on Electrical and Electronics
Engineering (ICE3); Gorakhpur, India. 14–15 February 2020; pp. 452–457. [Google
Scholar]

39
40

You might also like