Machine Learning Project Report (Group 3) Shahbaz Khan
Machine Learning Project Report (Group 3) Shahbaz Khan
Logistic Regression
Logistic Regression is a “Supervised machine learning”
algorithm that can be used to model the probability of a certain
class or event. It is used when the data is linearly separable and
the outcome is binary or dichotomous in nature.That means
Logistic regression is usually used for Binary classification
problems.Binary Classification refers to predicting the output
variable that is discrete in two classes.
A logistic regression model predicts a dependent data variable by
analyzing the relationship between one or more existing
independent variables.
Independent variables can be numeric or categorical
variables, but the dependent variable will always be
categorical.
A few examples of Binary classification are Yes/No, Pass/Fail,
Win/Lose, Cancerous/Non-cancerous, etc.
K-nearest-neighbor algorithm
So, the higher the AUC value for a classifier, the better its ability
to distinguish between positive and negative classes.
Let’s dig a bit deeper and understand how our ROC curve would
look like for different threshold values and how the specificity and
sensitivity would vary.
We can try and understand this graph by generating a confusion
matrix for each point corresponding to a threshold and talk about
the performance of our classifier:
All points above this line correspond to the situation where the
proportion of correctly classified points belonging to the Positive
class is greater than the proportion of incorrectly classified points
belonging to the Negative class.
Although Point B has the same Sensitivity as Point A, it has a
higher Specificity. Meaning the number of incorrectly Negative
class points is lower compared to the previous threshold. This
indicates that this threshold is better than the previous one.
Confusion Matrix:
Accuracy
Accuracy (ACC) is calculated as the number of all correct
predictions divided by the total number of the dataset. The best
accuracy is 1.0, whereas the worst is 0.0. It can also be
calculated by 1 – ERR.
5. Conclusions
With the increasing number of deaths due to heart diseases, it
has become mandatory to develop a system to predict heart
diseases effectively and accurately. The motivation for the study
was to find the most efficient ML algorithm for detection of heart
diseases. This study compares the accuracy score of Decision
Tree, Logistic Regression, Random Forest and Naive Bayes
algorithms for predicting heart disease using UCI machine
learning repository dataset. The result of this study indicates that
the Random Forest algorithm is the most efficient algorithm with
accuracy score of 90.16% for prediction of heart disease. In future
the work can be enhanced by developing a web application based
on the Random Forest algorithm as well as using a larger dataset
as compared to the one used in this analysis which will help to
provide better results and help health professionals in predicting
the heart disease effectively and efficiently.
6. Implementation code
The Jupyter Notebook Of the project is attached here.