Mini Report2
Mini Report2
Submitted in partial fulfilment of the requirements for the award of the degree
of
BACHELOR OF TECHNOLOGY
in
INFORMATION TECHNOLOGY
by
Guided by
Dr Arvind Rehalia
[1]
CANDIDATE’S DECLARATION
It is hereby certified that the work which is being presented in the B. Tech Minor project Report
entitled " HEART DISEASE PREDICTION USING ML" in partial fulfilment of the
requirements for the award of the degree of Bachelor of Technology and submitted in the
Department of Information Technology of BHARATI VIDYAPEETH’S COLLEGE OF
ENGINEERING, New Delhi (Affiliated to Guru Gobind Singh Indraprastha University,
Delhi) is an authentic record of our own work carried out during a period from September 2022
to January 2023 under the guidance of Dr. Arun Kumar Dubey.
The matter presented in the B. Tech Major Project Report has not been submitted by me for the
award of any other degree of this or any other Institute.
This is to certify that the above statement made by the candidate is correct to the best of my
knowledge. He/She/They are permitted to appear in the External Minor Project Examination
[2]
Abstract
This report represents the mini-project assigned to seventh semester students for the partial fulfillment
of our aim, Machine Learning, given by the department of computer science and engineering, KU.
Cardiovascular diseases are the most common cause of death worldwide over the last few decades in
the developed as well as underdeveloped and developing countries. Early detection of cardiac diseases
and continuous supervision of clinicians can reduce the mortality rate. However, it is not possible to
monitor patients every day in all cases accurately and consultation of a patient for 24 hours by a doctor
is not available since it requires more sapience, time and expertise. In this project, we have developed
and researched about models for heart disease prediction through the various heart attributes of patient
and detect impending heart disease using Machine learning techniques like backward elimination
algorithm, logistic regression and REFCV on the dataset available publicly in Kaggle Website, further
evaluating the results using confusion matrix and cross validation. The early prognosis of
cardiovascular diseases can aid in making decisions on lifestyle changes in high risk patients and in
turn reduce the complications, which can be a great milestone in the field of medicine.
[3]
Acknowledgement
We express our deep gratitude to Dr. Arvind Rehalia, Department of Information Technology
Engineering for his valuable guidance and suggestion throughout my project work. We are
thankful to Dr. Arun Kumar Dubey & Mahesh Kumar Project Coordinators, for their valuable
guidance.
We would like to extend my sincere thanks to Head of the Department, Mr. Prakhar
Priyadarshi for his time to time suggestions to complete my project work. I am also thankful to
Prof. Dharmender Saini, Principal for providing me the facilities to carry out my project work.
[4]
Table of Content
CANDIDATE DECLARATION II
ABSTRACT III
ACKNOWLEDGEMENT IV
TABLE OF CONTENTS V-XX
[5]
Chapter-1
1.1 Introduction
According to the World Health Organization, every year 12 million deaths occur worldwide due to heart disease. The
load of cardiovascular disease is rapidly increasing all over the world from the past few years. Many researches have
been conducted in attempt to pinpoint the most influential factors of heart disease as well as accurately predict the overall
risk.
Heart Disease is even highlighted as a silent killer which leads to the death of the person without obvious symptoms. The
early diagnosis of heart disease plays a vital role in making decisions on lifestyle changes in high-risk patients and in
turn reduce the complications. This project aims to predict future heart disease by analyzing data of patients which
classifies whether they have heart disease or not using machine-learning algorithms.
The major challenge in heart disease is its detection. There are instruments available which can predict heart disease but
either they are expensive or are not efficient to calculate chance of heart disease in human. Early detection of cardiac
diseases can decrease the mortality rate and overall complications.
However, it is not possible to monitor patients every day in all cases accurately and consultation of a patient for 24
hours by a doctor is not available since it requires more sapience, time, and expertise. Since we have a good amount
of data in today’s world, we can use various machine learning algorithms to analyze the data for hidden patterns. The
hidden patterns can be used for health diagnosis in medicinal data.
1.3 Motivation
Machine learning techniques have been around us and has been compared and used for analysis for many kinds of
science applications. The major motivation behind this research-based project was to explore the feature methods, data
preparation and processing behind the training models in the machine learning. With first hand models and libraries,
the challenge we face today is data where beside their abundance, and our cooked models, the accuracy we see
during training, testing and actual validation has a higher variance.
Hence this project is carried out with the motivation to explore behind the models, and further implement Logistic
Regression model to train the obtained data. Furthermore, as the whole machine learning is motivated to develop an
appropriate computer-based system and decision support that can aid to early detection of heart disease, in this project
we have developed a model which classifies if patient will have heart disease in ten years or not based on various
features.
Hence, the early prognosis of cardiovascular diseases can aid in making decisions on lifestyle changes in high-risk
patients and in turn reduce the complications, which can be a great milestone in the field of medicine.
1.3 Objectives
[6]
1.4 Related Works
With growing development in the field of medical science alongside machine learning various experiments and
researches have been carried out in these recent years releasing the relevant significant papers. The paper propose
heart disease prediction using KStar, J48, SMO, and Bayes Net and Multilayer perceptron using WEKA software.
Based on performance from different factor SMO (89% of accuracy) and Bayes Net (87% of accuracy) achieve
optimum performance than KStar, Multilayer perceptron and J48 techniques using k-fold cross validation.
The accuracy performance achieved by those algorithms are still not satisfactory. So that if the performance of
accuracy is improved more to give batter decision to diagnosis disease.
In a research conducted using Cleveland dataset for heart diseases which contains 303 instances and used 10-fold
Cross Validation, considering 13 attributes, implementing 4 different algorithms, they concluded Gaussian Naïve
Bayes and Random Forest gave the maximum accuracy of 91.2 percent. Using the similar dataset of Framingham,
Massachusetts, the experiments were carried out using 4 models and were trained and tested with maximum accuracy
K Neighbors Classifier: 87%, Support Vector Classifier: 83%, Decision Tree Classifier: 79% and Random Forest
Classifier:
1.5 Data
What you'll want to do here is dive into the data your problem definition is based on. This may involve, sourcing,
defining different parameters, talking to experts about it and finding out what you should expect.
The original data came from the Cleveland Database from UCI Machine Learning Repository.
Howevever, we've downloaded it in a formatted way from Kaggle.
The original database contains 76 attributes, but here only 14 attributes will be used. Attributes (also called features)
are the variables what we'll use to predict our target variable.
Attributes and features are also referred to as independent variables and a target variable can be referred to as
a dependent variable.
[7]
Chapter-2
2.1 Evaluation
Features
Features are different parts of the data. During this step, you'll want to start finding out what you can about the
data.
A data dictionary describes the data you're dealing with. Not all datasets come with them so this is where
you may have to do your research or ask a subject matter expert (someone who knows about the data) for
more.
The following are the features we'll use to predict our target variable (heart disease or no heart disease).
[8]
8. thalach - maximum heart rate achieved
9. exang - exercise induced angina (1 = yes; 0 = no)
10. oldpeak - ST depression induced by exercise relative to rest
looks at stress of heart during excercise
unhealthy heart will stress more
11. slope - the slope of the peak exercise ST segment
0: Upsloping: better heart rate with excercise (uncommon)
1: Flatsloping: minimal change (typical healthy heart)
It's a good idea to save these to a Python dictionary or in an external file, so we can look at
them later without coming back here.
The human heart functions throughout a person’s lifespan and is one of the most robust and
hardest working muscles in the human body.
Besides humans, most other animals also possess a heart that pumps blood throughout their
bodies. Even invertebrates such as grasshoppers possess a heart like pumping organ, though
they do not function the same way a human heart does
[9]
2.1 Position of heart in human body
The human heart is located between the lungs in the thoracic cavity, slightly towards the left of the
sternum (breastbone). It is derived from the embryonic mesodermal germ layer.
The function of the heart in any organism is to maintain a constant flow of blood throughout the
body. This replenishes oxygen and circulates nutrients among the cells and tissues.
One of the primary functions of the human heart is to pump blood throughout the body.
Blood delivers oxygen, hormones, glucose and other components to various parts of the body, including
the human heart.
The heart also ensures that adequate blood pressure is maintained in the body
There are two types of circulation within the body, namely pulmonary circulation and systemic
circulation.
Now, the heart itself is a muscle and therefore, it needs a constant supply of oxygenated blood.
This is where another type of circulation comes into play, the coronary circulation.
Coronary circulation is an essential portion of the circulation, where oxygenated blood is supplied to
the heart. This is important as the heart is responsible for supplying blood throughout the body.
Moreover, organs like the brain need a steady flow of fresh, oxygenated blood to ensure functionality
[10]
2.2 Structure of the human heart
The human heart is about the size of a human fist and is divided into four chambers, namely two
ventricles and two atria. The ventricles are the chambers that pump blood and the atrium are the
chambers that receive blood. Among these both the right atrium and ventricle make up the “right
heart,” and the left atrium and ventricle make up the “left heart.” The structure of the heart also
houses the biggest artery in the body – the aorta.
One of the very first structures which can be observed when the external structure of the heart is
viewed is the pericardium.
Pericardium
The human heart is situated to the left of the chest and is enclosed within a fluid-filled cavity
described as the pericardial cavity. The walls and lining of the pericardial cavity are made up of a
membrane known as the pericardium.
The pericardium is a fibre membrane found as an external covering around the heart. It protects
the heart by producing a serous fluid, which serves to lubricate the heart and prevent friction
between the surrounding organs. Apart from the lubrication, the pericardium also helps by holding
the heart in its position and by maintaining a hollow space for the heart to expand itself when it is
full. The pericardium has two exclusive layers—
Epicardium – Epicardium is the outermost layer of the heart. It is composed of a thin-layered membrane
that serves to lubricate and protect the outer section.
Myocardium – This is a layer of muscle tissue and it constitutes the middle layer wall of the heart. It
contributes to the thickness and is responsible for the pumping action.
[11]
Endocardium – It is the innermost layer that lines the inner heart chambers and covers the heart
valves. Furthermore, it prevents the blood from sticking to the inner walls, thereby preventing potentially
fatal blood clots
Vertebrate hearts can be classified based on the number of chambers present. For instance, most
fish have two chambers, and reptiles and amphibians have three chambers. Avian and
mammalian hearts consists of four chambers. Humans are mammals; hence, we have four
chambers, namely:
Left atrium
Right atrium
Left ventricle
Right ventricle
Atria are thin and have less muscular walls and are smaller than ventricles. These are the blood-
receiving chambers that are fed by the large veins.
Ventricles are larger and more muscular chambers responsible for pumping and pushing blood
out into circulation. These are connected to larger arteries that deliver blood for circulation.
The right ventricle and right atrium are comparatively smaller than the left chambers. The walls
consist of fewer muscles compared to the left portion, and the size difference is based on their
functions. The blood originating from the right side flows through the pulmonary circulation, while
blood arising from the left chambers is pumped throughout the body.
Blood Vessels
In organisms with closed circulatory systems, the blood flows within vessels of varying sizes. All
vertebrates, including humans, possess this type of circulation. The external structure of the heart
has many blood vessels that form a network, with other major vessels emerging from within the
structure. The blood vessels typically comprise the following:
Veins supply deoxygenated blood to the heart via inferior and superior vena cava, and it eventually
drains into the right atrium.
Capillaries are tiny, tube-like vessels which form a network between the arteries to veins.
Arteries are muscular-walled tubes mainly involved in supplying oxygenated blood away from the heart
to all other parts of the body. Aorta is the largest of the arteries and it branches off into various smaller
arteries throughout the body.
[12]
2.2 Preparing the tools
At the start of any project, it's custom to see the required libraries imported in a big chunk like you can see
below.
However, in practice, your projects may import libraries as you go. After you've spent a couple of hours working
on your problem, you'll probably want to do some tidying up. This is where you may want to consolidate every
library you've used at the top of your notebook (like the cell below).
The libraries you use will differ from project to project. But there are a few which will you'll likely take advantage
of during almost every structured data project.
Pandas has a built-in function to read .csv files called read_csv() which takes the file pathname of your .csv file
Compare different columns to each other, compare them to the target variable. Refer back to your data dictionary and
remind yourself of what different columns mean.
Your goal is to become a subject matter expert on the dataset you're working with. So if someone asks you a question
about it, you can give them an explanation and when you start building models, you can sound check them to make
sure they're not performing too well (overfitting) or why they might be performing poorly (underfitting).
[13]
Since EDA has no real set methodolgy, the following is a short check list you might want to walk through:
Once of the quickest and easiest ways to check your data is with the head() function. Calling it on any dataframe will
print the top 5 rows, tail() calls the bottom 5. You can also pass a number to them like head(10) to show the top 10
rows.
[14]
value_counts() allows you to show how many times each of the values of
categorical column appear.
# Let's see how many positive (1) and negative (0) samples we have in our dataframe
1 165
0 138
Name: target, dtype: int64
Since these two values are close to even, our target column can be considered balanced. An unbalanced target
column, meaning some classes have far more samples, can be harder to model than a balanced set. Ideally, all of your
target classes have the same number of samples.
If you'd prefer these values in percentages, value_counts() takes a parameter, normalize which can be set to true.
# Normalized value counts
1 0.544554
0 0.455446
We can plot the target column value counts by calling the plot() function and telling it what kind of plot we'd like, in this
case, bar is good.
[15]
Heart Disease Frequency according to Gender
If you want to compare two columns to each other, you can use the function pd.crosstab(column_1, column_2).
This is helpful if you want to start gaining an intuition about how your independent variables interact with your
dependent variables.
Let's compare our target column with the sex column.
Remember from our data dictionary, for the target column, 1 = heart disease present, 0 = no heart disease. And for sex,
1 = male, 0 = female.
Sex 0 1
target
0 24 114
1 72 93
You can plot the crosstab by using the plot() function and passing it a few parameters such as, kind (the type of plot
you want), figsize=(length, width) (how big you want it to be) and color=[colour_1, colour_2] (the different colours
Different metrics are represented best with different kinds of plots. In our case, a bar graph is great. We'll see examples of
more later. And with a bit of practice, you'll gain an intuition of which plot to use with different variables.
[16]
2.5 Age vs Max Heart rate for Heart Disease
Let's try combining a couple of independent variables, such as, age and thalach (maximum heart rate) and then
comparing them to our target variable heart disease.
Because there are so many different values for age and thalach, we'll use a scatter plot.
[17]
Heart Disease Frequency per Chest Pain Type
Let's try another independent variable. This time, cp (chest pain).
We'll use the same process as we did before with sex.
target 0 1
Cp
0 104 39
1 9 41
2 18 69
3 7 16
[18]
3. cp - chest pain type
0: Typical angina: chest pain related decrease blood supply to the heart
1: Atypical angina: chest pain not related to heart
2: Non-anginal pain: typically esophageal spasms (non heart related)
3: Asymptomatic: chest pain not showing signs of disease
Model Comparison
Since we've saved our models scores to a dictionary, we can plot them by first converting them to a Data Frame.
[19]
Chapter-3
4. 3. Machine learning
Machine Learning is the field of study that gives computers the capability to learn without being explicitly
programmed. ML is one of the most exciting technologies that one would have ever come across. As it is
evident from the name, it gives the computer that makes it more similar to humans: The ability to learn.
Machine learning is actively being used today, perhaps in many more places than one would expect.
Machine can learn itself from past data and automatically improve.
From the given dataset it detects various patterns on data.
For the big organizations branding is important and it will become more easy to target relatable customer
base.
It is similar to data mining because it is also deals with the huge amount of data.
Supervised Machine Learning is where you have input variables (x) and an output variable (Y) and you use
an algorithm to learn the mapping function from the input to the output Y = f(X). The goal is to approximate
the mapping function so well that when you have new input data (x) you can predict the output variables (Y)
for that data.
Supervised learning problems can be further grouped into Regression and Classification problems
Regression: Regression algorithms are used to predict a continuous numerical output. For example, a
regression algorithm could be used to predict the price of a house based on its size, location, and other
features.
Classification: Classification algorithms are used to predict a categorical output. For example, a
classification algorithm could be used to predict whether an email is spam or not.
Classification Types
There are two main classification types in machine learning:
Binary Classification
In binary classification, the goal is to classify the input into one of two classes or categories. Example – On
the basis of the given health conditions of a person, we have to determine whether the person has a certain
disease or not.
Multiclass Classification
In multi-class classification, the goal is to classify the input into one of several classes or categories. For
Example – On the basis of data about different species of flowers, we have to determine which specie our
observation belongs to.
[20]
Regression Analysis is a statistical process for estimating the relationships between the dependent variables
or criterion variables and one or more independent variables or predictors. Regression analysis is generally
used when we deal with a dataset that has the target variable in the form of continuous data. Regression
analysis explains the changes in criteria about changes in select predictors.
The conditional expectation of the criteria is based on predictors where the average value of the dependent
variables is given when the independent variables are changed. Three major uses for regression analysis are
determining the strength of predictors, forecasting an effect, and trend forecasting.
There are times when we would like to analyze the effect of different independent features on the target or
what we say dependent features. This helps us make decisions that can affect the target variable in the
desired direction.
Regression analysis is heavily based on statistics and hence gives quite reliable results to this reason only
regression models are used to find the linear as well as non-linear relation between the independent and the
dependent or target variables.
Along with the development of the machine learning domain regression analysis techniques have gained
popularity as well as developed manifold from just y = mx + c.
There are several types of regression techniques, each suited for different types of data and different types of
relationships. The main types of regression techniques are:
[21]
Polynomial Regression
This is an extension of linear regression and is used to model a non-linear relationship between the
dependent variable and independent variables. Here as well syntax remains the same but now in the input
variables we include some polynomial or higher degree terms of some already existing features as well.
Linear regression was only able to fit a linear model to the data at hand but with polynomial features, we can
easily fit some non-linear relationship between the target as well as input features.
As the name suggests, "Random Forest is a classifier that contains a number of decision trees on various
subsets of the given dataset and takes the average to improve the predictive accuracy of that
dataset." Instead of relying on one decision tree, the random forest takes the prediction from each tree and
based on the majority votes of predictions, and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.
The below diagram explains the working of the Random Forest algorithm:
[22]
Assumptions for Random Forest
Since the random forest combines multiple trees to predict the class of the dataset, it is possible that some decision trees
may predict the correct output, while others may not. But together, all the trees predict the correct output. Therefore,
below are two assumptions for a better Random Forest classifier:
o There should be some actual values in the feature variable of the dataset so that the classifier can predict accurate
results rather than a guessed result.
o The predictions from each tree must have very low correlations.
Random Forest algorithm is a powerful tree learning technique in Machine Learning. It works by creating a number of Decision
Trees during the training phase. Each tree is constructed using a random subset of the data set to measure a random subset of
features in each partition. This randomness introduces variability among individual trees, reducing the risk of overfitting and
improving overall prediction performance.
[23]
This randomness introduces variability among individual trees, reducing the risk of overfitting and improving overall prediction
performance. In prediction, the algorithm aggregates the results of all trees, either by voting (for classification tasks) or by
averaging (for regression tasks)
This collaborative decision-making process, supported by multiple trees with their insights, provides an example stable and
precise results. Random forests are widely used for classification and regression functions, which are known for their ability to
handle complex data, reduce overfitting, and provide reliable forecasts in different environments.
Ensemble learning models work just like a group of diverse experts teaming up to make decisions – think of them as a bunch of
friends with different strengths tackling a problem together. Picture it as a group of friends with different skills working on a
project. Each friend excels in a particular area, and by combining their strengths, they create a more robust solution than any
individual could achieve alone.
The random Forest algorithm works in several steps which are discussed below–>
Random Forest leverages the power of ensemble learning by constructing an army of Decision Trees. These trees are like
individual experts, each specializing in a particular aspect of the data. Importantly, they operate independently, minimizing the
risk of the model being overly influenced by the nuances of a single tree.
Random Feature Selection: To ensure that each decision tree in the ensemble brings a unique perspective, Random Forest
employs random feature selection. During the training of each tree, a random subset of features is chosen.
This randomness ensures that each tree focuses on different aspects of the data, fostering a diverse set of predictors within the
ensemble.
Bootstrap Aggregating or Bagging: The technique of bagging is a cornerstone of Random Forest’s training strategy which
involves creating multiple bootstrap samples from the original dataset, allowing instances to be sampled with replacement. This
results in different subsets of data for each decision tree, introducing variability in the training process and making the model
more robust.
Decision Making and Voting: When it comes to making predictions, each decision tree in the Random Forest casts its vote. For
classification tasks, the final prediction is determined by the mode (most frequent prediction) across all the trees. In regression
tasks, the average of the individual tree predictions is taken. This internal voting mechanism ensures a balanced and collective
decision-making process
[24]
Applications of Random Forest in Real-World Scenarios
Finance Wizard: Imagine Random Forest as our financial superhero, diving into the world of credit scoring. Its mission? To
determine if you’re a credit superhero or, well, not so much. With a knack for handling financial data and sidestepping overfitting
issues, it’s like having a guardian angel for robust risk assessments.
Health Detective: In healthcare, Random Forest turns into a medical Sherlock Holmes. Armed with the ability to decode medical
jargon, patient records, and test results, it’s not just predicting outcomes; it’s practically assisting doctors in solving the mysteries
of patient health.
Environmental Guardian: Out in nature, Random Forest transforms into an environmental superhero. With the power to
decipher satellite images and brave noisy data, it becomes the go-to hero for tasks like tracking land cover changes and
safeguarding against potential deforestation, standing as the protector of our green spaces.
Digital Bodyguard: In the digital realm, Random Forest becomes our vigilant guardian against online trickery. It’s like a cyber-
sleuth, analyzing our digital footsteps for any hint of suspicious activity. Its ensemble approach is akin to having a team of cyber-
detectives, spotting subtle deviations that scream “fraud alert!” It’s not just protecting our online transactions; it’s our digital
bodyguard
Random Forest Classification is an ensemble learning technique designed to enhance the accuracy and robustness of classification
tasks. The algorithm builds a multitude of decision trees during training and outputs the class that is the mode of the classification
classes. Each decision tree in the random forest is constructed using a subset of the training data and a random subset of features
introducing diversity among the trees, making the model more robust and less prone to overfitting.
The random forest algorithm employs a technique called bagging (Bootstrap Aggregating) to create these diverse subsets.
During the training phase, each tree is built by recursively partitioning the data based on the features. At each split, the algorithm
selects the best feature from the random subset, optimizing for information gain or Gini impurity. The process continues until a
predefined stopping criterion is met, such as reaching a maximum depth or having a minimum number of samples in each leaf
node.
[25]
Once the random forest is trained, it can make predictions, using each tree “votes” for a class, and the class with the most votes
becomes the predicted class for the input data.
More trees generally lead to better performance, but at the cost of computational time.
Deeper trees can capture more complex patterns, but also risk overfitting.
Experiment with values between 5 and 15, and consider lower values for smaller datasets.
Gini impurity is often slightly faster, but both are generally similar in performance.
Higher values can prevent overfitting, but too high can hinder model complexity.
[26]
What is Random Forest Regression?
Random Forest Regression in machine learning is an ensemble technique capable of performing both regression and classification
tasks with the use of multiple decision trees and a technique called Bootstrap and Aggregation, commonly known as bagging. The
basic idea behind this is to combine multiple decision trees in determining the final output rather than relying on individual
decision trees.
Random Forest has multiple decision trees as base learning models. We randomly perform row sampling and feature sampling
from the dataset forming sample datasets for every model. This part is called Bootstrap.
We need to approach the Random Forest regression technique like any other machine learning technique.
Design a specific question or data and get the source to determine the required data.
Make sure the data is in an accessible format else convert it to the required format.
Specify all noticeable anomalies and missing data points that may be required to achieve the required data.
Now compare the performance metrics of both the test data and the predicted data from the model.
If it doesn’t satisfy your expectations, you can try improving your model accordingly or dating your data, or using another data
modeling technique.
The Working process can be explained in the below steps and diagram:
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
[27]
Step-5: For new data points, find the predictions of each decision tree, and assign the new data points to the category
that wins the majority votes.
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to the Random
forest classifier. The dataset is divided into subsets and given to each decision tree. During the training phase,
each decision tree produces a prediction result, and when a new data point occurs, then based on the majority of
results, the Random Forest classifier predicts the final decision. Consider the below image:
[28]
Applications of Random Forest
There are mainly four sectors where Random forest mostly used:
1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
o K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into
the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means
when new data appears then it can be easily classified into a well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the
Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it
stores the dataset and at the time of classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that data
into a category that is much similar to the new data.
[29]
o Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want to know either
it is a cat or dog. So for this identification, we can use the KNN algorithm, as it works on a similarity measure.
Our KNN model will find the similar features of the new data set to the cats and dogs images and based on the
most similar features it will put it in either cat or dog category.
[30]
Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so this
data point will lie in which of these categories. To solve this type of problem, we need a K-NN algorithm. With
the help of K-NN, we can easily identify the category or class of a particular dataset. Consider the below
diagram:
[31]
Suppose we have a new data point and we need to put it in the required category. Consider the below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
[32]
o Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the distance
between two points, which we have already studied in geometry. It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in category A and
two nearest neighbors in category B. Consider the below image:
[33]
o As we can see the 3 nearest neighbors are from category A, hence this new data point must belong to category A.
o There is no particular way to determine the best value for "K", so we need to try some values to find the best out
of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in the model.
o Large values for K are good, but it may find some difficulties.
[34]
Python implementation of the KNN algorithm
To do the Python implementation of the K-NN algorithm, we will use the same problem and dataset which we
have used in Logistic Regression. But here we will improve the performance of the model. Below is the
problem description:
Problem for K-NN Algorithm: There is a Car manufacturer company that has manufactured a new SUV car.
The company wants to give the ads to the users who are interested in buying that SUV. So for this problem, we
have a dataset that contains multiple user's information through the social network. The dataset contains lots of
information but the Estimated Salary and Age we will consider for the independent variable and
the Purchased variable is for the dependent variable. Below is the dataset:
o Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome must be a
categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact
value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are used. Linear Regression is
used for solving Regression problems, whereas Logistic regression is used for solving the classification
problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which predicts
two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as whether the cells are cancerous
or not, a mouse is obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the ability to provide probabilities
and classify new data using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different types of data and can easily determine
the most effective variables used for the classification. The below image is showing the logistic function:
[35]
Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it is
called logistic regression, but is used to classify samples; Therefore, it falls under the
classification algorithm.
o In logistic regression, we use the concept of the threshold value, which defines the probability of either 0 or 1.
Such as values above the threshold value tends to 1, and a value below the threshold values tends to 0.
[36]
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will become
[37]
Type of Logistic Regression:
On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of the dependent variables, such
as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types of the
dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent variables,
such as "low", "Medium", or "High".
Example: There is a dataset given which contains the information of various users obtained from the social
networking sites. There is a car making company that has recently launched a new SUV car. So the company
wanted to check how many users from the dataset, wants to purchase the car.
For this problem, we will build a Machine Learning model using the Logistic regression algorithm. The dataset
is shown in the below image. In this problem, we will predict the purchased variable (Dependent
Variable) by using age and salary (Independent variables).
[38]
Steps in Logistic Regression: To implement the Logistic Regression using Python, we will use the same steps
as we have done in previous topics of Regression. Below are the steps:
[39]
Chapter-4
4.1 CONCLUSION
The early prognosis of cardiovascular diseases can aid in making decisions on lifestyle changes in high risk patients
and in turn reduce the complications, which can be a great milestone in the field of medicine. This project resolved the
feature selection i.e. backward elimination and RFECV behind the models and successfully predict the heart disease,
with 85% accuracy. The model used was Logistic Regression. Further for its enhancement, we can train on models
and predict the types of cardiovascular diseases providing recommendations to the users, and also use more enhanced
models.
4.2 REFERENCES
[1] A. H. M. S. U. Marjia Sultana, "Analysis of Data Mining Techniques for Heart Disease
Prediction," 2018.
[6] S. Rehman, E. Rehman, M. Ikram, and Z. Jianglin, “Cardiovascular disease (CVD): assessment, prediction and
policy implications,” BMC Public Health, vol. 21, no. 1, p. 1299, 2021, doi: 10.1186/s12889-021-11334-2.
[7] O. Atef, A. B. Nassif, M. A. Talib, and Q. Nassir, “Death/Recovery Prediction for Covid-19 Patients using
Machine Learning,” 2020. [3] A. B. Nassif, I. Shahin, M. Bader, A. Hassan, and N. Werghi, “COVID-19 Detection
Systems Using Deep-Learning Algorithms Based on Speech and Image Data,” Mathematics, 2022.
[8] H. Hijazi, M. Abu Talib, A. Hasasneh, A. Bou Nassif, N. Ahmed, and Q. Nasir, “Wearable Devices, Smartphones,
and Interpretable Artificial Intelligence in Combating COVID-19,” Sensors, vol. 21, no. 24, 2021, doi:
10.3390/s21248424.
[9] O. T. Ali, A. B. Nassif, and L. F. Capretz, “Business intelligence solutions in healthcare a case study:
Transforming OLTP system to BI solution,” in 2013 3rd International Conference on Communications and
Information Technology, ICCIT 2013, 2013, pp. 209–214, doi: 10.1109/ICCITechnology.2013.6579551.
[40]