0% found this document useful (0 votes)
5 views

ML Classifiers

Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

ML Classifiers

Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 48

ETE 457: Neural and Fuzzy Systems

Machine Learning Classifiers

Online Resource
Topics to be Covered
 Logistic Regression
 Decision Tree
 Random Forest
 K-Nearest Neighbour (KNN)
 Naïve Bayes
 Support Vector Machine (SVM)
 Evaluation Metrics

2
Logistic Regression
• Why the Linear regression model will not work for classification problems.
• How Logistic regression model is derived from a simple linear model.

• While working with the machine learning models, one question that generally
comes into our mind for a given problem whether I should use the regression
model or the classification model.
• Regression and Classification both are supervised learning algorithms.
• In regression, the predicted values are of continuous nature and in classification
predicted values are of a categorical type.
• In simple terms, If you have a dataset with marks of a student in five subjects and
you have to predict the marks in another subject it will be a regression problem.
• On the other hand, if I ask you to predict whether a student is pass or fail based
on the marks it will be a classification problem.

3
Logistic Regression
• Logistic regression? A classification or a regression one.

• To our surprise, Logistic regression is actually a classification algorithm.


• Let’s consider a small example, here is a plot on the x-axis Age of the persons, and
the y-axis shows they have a smartphone.
• It is a classification problem where given the age of a person and we have to predict
if he posses a smartphone or not.

• In such a classification problem, can we use linear regression? 4


Issues with Logistic
Regression
• To solve the above prediction problem, let’s first use a Linear model.
• On the plot, we can draw a line that separates the data points into two groups. with
a threshold Age value.
• All the data points below that threshold will be classified as 0 i.e those who do not
have smartphones.
• Similarly, all the observations above the threshold will be classified as 1 which
means these people have smartphones as shown in the image below.

5
Issues with Logistic
Regression
Case- I
•Suppose we got a new data point on the extreme right in the plot, suddenly you see
the slope of the line changes.
•Now we have to inadvertently change the threshold of our model.
•Hence, this is the first issue we have with linear regression, our threshold of Age can
not be changed in a predicting algorithm.

6
Issues with Logistic
Regression
Case- II
•The other issue with Linear regression is when you extend this line it will give you
values above 1 and below 0.
•In our classification problem, we do not know what the values greater than one and
below 0 represents. so it is not the natural extension of the linear model.

7
Linear to Logistic
Regression
• Here the Logistic regression comes in. let’s try and build a new model known as
Logistic regression. Suppose the equation of this linear line is

• Now we want a function Q( Z) that transforms the values between 0 and 1 as shown
in the following image. This is the time when a sigmoid function or logit function
comes in handy.

8
Linear to Logistic
Regression
• Sigmoid function transforms the linear line into a curve.
• This will constraint the values between 0 and 1. Now it doesn’t matter how many
new points I add to each extreme it will not affect my model.
• The other important aspect is, for each observation model will give a continuous
value between 0 and 1.
• This continuous value is the prediction probability of that data point. If the prediction
probability is near 1 then the data point will be classified as 1 else 0.

9
Cost function for Logistic
regression
• For linear regression, the cost function is mostly we use Mean squared error
represented as the difference y_predicted and y_actual iterated overall data points,
and then you do a square and take the average.
• It is a convex function as shown below. This cost function can be optimized easily
using gradient descent.

• Whereas, If we use the same cost function for the Logistic regression is a non-linear
function, it will have a non-convex plot. It will create unnecessary complications if
use gradient descent for model optimization.

10
Cost function for Logistic
regression
• Hence, we need a different cost function for our new model.
• Here comes the log loss in the picture. As you can see, we have replaced the
probability in the log loss equation with y_hat.

11
Cost function for Logistic
regression
• In the first case when the class is 1 and the probability is close to 1, the left side of
the equation becomes active and the right part vanishes.
• You will notice in the plot below as the predicted probability moves towards 0 the
cost increases sharply.

12
Cost function for Logistic
regression
• Similarly, when the actual class is 0 and the predicted probability is close to 0, the
right side becomes active and the left side vanishes.
• Increasing the cost of the wrong predictions. Later, these two parts will be added.

13
Cost function for Logistic
regression

14
Decision Tree
• Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems.
• It is a tree-structured classifier, where internal nodes represent the features of a
dataset, branches represent the decision rules and each leaf node represents the
outcome.
• In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
• Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further
branches.
• The decisions or the test are performed on the basis of features of the given dataset.
• It is called a decision tree because, similar to a tree, it starts with the root node,
which expands on further branches and constructs a tree-like structure.
• A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
• Below diagram explains the general structure of a decision tree:
15
Decision Tree

16
Decision Tree
• Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches from the tree.
• Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.

17
Working Principle
• In a decision tree, for predicting the class of the given dataset, the algorithm starts
from the root node of the tree.
• This algorithm compares the values of root attribute with the record (real dataset)
attribute and, based on the comparison, follows the branch and jumps to the next
node.

• For the next node, the algorithm again compares the attribute value with the other
sub-nodes and move further.
• It continues the process until it reaches the leaf node of the tree. The complete
process can be better understood using the below algorithm:

18
Working Principle
• Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
• Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
• Step-3: Divide the S into subsets that contains possible values for the best
attributes.
• Step-4: Generate the decision tree node, which contains the best attribute.
• Step-5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage is reached where you cannot
further classify the nodes and called the final node as a leaf node.

19
Working Principle
• Example: Suppose there is a candidate who has a job offer and wants to decide
whether he should accept the offer or Not. So, to solve this problem, the decision
tree starts with the root node (Salary attribute by ASM). The root node splits further
into the next decision node (distance from the office) and one leaf node based on
the corresponding labels. The next decision node further gets split into one decision
node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf
nodes (Accepted offers and Declined offer). Consider the below diagram:

20
Attribute Selection
Measures
• While implementing a Decision tree, the main issue arises that how to select the
best attribute for the root node and for sub-nodes.
• So, to solve such problems there is a technique which is called as Attribute selection
measure or ASM.
• By this measurement, we can easily select the best attribute for the nodes of the
tree. There are two popular techniques for ASM, which are:

• Information Gain
• Gini Index

21
Applications
• In healthcare industries, decision tree can tell whether a patient is suffering from a
disease or not based on conditions such as age, weight, sex and other factors. Other
applications such as deciding the effect of the medicine based on factors such as
composition, period of manufacture, etc. Also, in diagnosis of medical reports, a
decision tree can be very effective.

22
Applications
• A person eligible for a loan or not based on his financial status, family member,
salary, etc. can be decided on a decision tree. Other applications may include credit
card frauds, bank schemes and offers, loan defaults, etc. which can be prevented by
using a proper decision tree.

23
Applications
• In colleges and universities, the shortlisting of a student can be decided based upon
his merit scores, attendance, overall score etc. A decision tree can also decide the
overall promotional strategy of faculties present in the universities.

24
Random Forest
• Random Forest is a popular machine learning algorithm that belongs to the
supervised learning technique.
• It can be used for both Classification and Regression problems in ML.
• It is based on the concept of ensemble learning, which is a process of combining
multiple classifiers to solve a complex problem and to improve the performance of
the model.

• "Random Forest is a classifier that contains a number of decision trees on various


subsets of the given dataset and takes the average to improve the predictive
accuracy of that dataset."
• Instead of relying on one decision tree, the random forest takes the prediction from
each tree and based on the majority votes of predictions, and it predicts the final
output.

• The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting. 25
Random Forest

26
Why use Random Forest
Below are some points that explain why we should use the Random Forest algorithm:

•It takes less training time as compared to other algorithms.


•It predicts output with high accuracy, even for the large dataset it runs efficiently.
•It can also maintain accuracy when a large proportion of data is missing.

27
Working Principle
Random Forest works in two-phase first is to create the random forest by combining N
decision tree, and second is to make predictions for each tree created in the first
phase.

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.


Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-4: Repeat Step 1 & 2.
Step-5: For new data points, find the predictions of each decision tree, and assign the
new data points to the category that wins the majority votes.

28
Working Principle
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this
dataset is given to the Random forest classifier. The dataset is divided into subsets
and given to each decision tree. During the training phase, each decision tree
produces a prediction result, and when a new data point occurs, then based on the
majority of results, the Random Forest classifier predicts the final decision. Consider
the below image:

29
K Nearest Neighbor
• K-NN algorithm assumes the similarity between the new case/data and available
cases and put the new case into the category that is most similar to the available
categories.
• K-NN algorithm stores all the available data and classifies a new data point based on
the similarity.
• This means when new data appears then it can be easily classified into a well suite
category by using K- NN algorithm.
• K-NN is a non-parametric algorithm, which means it does not make any assumption
on underlying data.
• It is also called a lazy learner algorithm because it does not learn from the training
set immediately instead it stores the dataset and at the time of classification, it
performs an action on the dataset.

30
K Nearest Neighbor
• KNN algorithm at the training phase just stores the dataset and when it gets new
data, then it classifies that data into a category that is much similar to the new data.
• Example: Suppose, we have an image of a creature that looks similar to cat and dog,
but we want to know either it is a cat or dog. So for this identification, we can use
the KNN algorithm, as it works on a similarity measure. Our KNN model will find the
similar features of the new data set to the cats and dogs images and based on the
most similar features it will put it in either cat or dog category..

31
Numerical Example
• Numerical Example (ctrl + click)
• Advantages and Disadvantages

32
Naïve Bayes
• It is a classification technique based on Bayes’ Theorem with an assumption of
independence among predictors.
• In simple terms, a Naive Bayes classifier assumes that the presence of a particular
feature in a class is unrelated to the presence of any other feature.

• For example, a fruit may be considered to be an apple if it is red, round, and about 3
inches in diameter. Even if these features depend on each other or upon the
existence of the other features, all of these properties independently contribute to
the probability that this fruit is an apple and that is why it is known as ‘Naive’.

• Naive Bayes model is easy to build and particularly useful for very large data sets.
Along with simplicity, Naive Bayes is known to outperform even highly sophisticated
classification methods.

• Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c),
P(x) and P(x|c).
33
Naïve Bayes

• Above,
• P (c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
• P (c) is the prior probability of class.
• P (x|c) is the likelihood which is the probability of predictor given class.
• P (x) is the prior probability of predictor.
34
Naïve Bayes
• Suppose, we have a training data set of weather and corresponding target variable
‘Play’ (suggesting possibilities of playing).

35
Naïve Bayes
• Suppose, we have a training data set of weather and corresponding target variable
‘Play’ (suggesting possibilities of playing).
• Now, we need to classify whether players will play or not based on weather
condition.
Let’s follow the below steps to perform it.

• Step 1: Convert the data set into a frequency table

• Step 2: Create Likelihood table by finding the probabilities like Overcast probability
= 0.29 and probability of playing is 0.64.
• Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for
each class.
• The class with the highest posterior probability is the outcome of prediction.

36
Naïve Bayes
• Problem: Players will play if weather is sunny. Is this statement is correct?

• We can solve it using above discussed method of posterior probability.

• P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)

• Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 =
0.64

• Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.

• Naive Bayes uses a similar method to predict the probability of different class based
on various attributes. This algorithm is mostly used in text classification and with
problems having multiple classes.
37
Naïve Bayes (Example-2)

38
Naïve Bayes (Example-2)
• Concerning our dataset, the concept of assumptions made by the algorithm can be
understood as:

• We assume that no pair of features are dependent.


• For example, the color being ‘Red’ has nothing to do with the Type or the Origin of
the car. Hence, the features are assumed to be Independent.
• Secondly, each feature is given the same influence(or importance).
• For example, knowing the only Color and Type alone can’t predict the outcome
perfectly.
• So none of the attributes are irrelevant and assumed to be contributing Equally to
the outcome.

• Note: The assumptions made by Naïve Bayes are generally not correct in real-world
situations. The independence assumption is never correct but often works well in
practice. Hence the name ‘Naïve’.
39
Naïve Bayes (Example-2)
• Here in our dataset, we need to classify whether the car is stolen, given the features
of the car.
• The columns represent these features and the rows represent individual entries.
• If we take the first row of the dataset, we can observe that the car is stolen if the
Color is Red, the Type is Sports and Origin is Domestic.
• So we want to classify a Red Domestic SUV is getting stolen or not.
• There is no example of a Red Domestic SUV in our data set.

40
Naïve Bayes (Example-2)

41
Naïve Bayes (Example-2)

No > Yes
42
Support Vector Machine
(SVM)
• Support Vector Machine (SVM) is a supervised machine learning algorithm.
• We plot each data item as a point in n-dimensional space (where n is number of
features you have).
• Then, we perform classification by finding a hyper-plane that differentiates the two
classes very well.
• Reference Article

43
SVM
• Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-dimensional
space. This best boundary is known as the hyperplane of SVM.

• Support Vectors:
• The data points or vectors that are the closest to the hyperplane and which affect the position of the
hyperplane are termed as Support Vector.

44
SVM

Scenario-I Scenario-II

Scenario-III Scenario-IV 45
SVM

Scenario-V

***What is Kernel Trick ?


46
SVM
Pros
•Accuracy
•Works well on smaller cleaner datasets
•It can be more efficient because it uses a subset of training points

Cons

•Isn’t suited to larger datasets as the training time with SVMs can be high
•Less effective on noisier datasets with overlapping classes

47
***Evaluation Metrics
Regression Evaluation Metrics (Link) (Press Ctrl+Click)

Mean Absolute Error (MAE)


Root Mean Square Error (RMSE)
Why Use R-Squared

1. Classification Evaluation Metrics (Link)


Confusion Matrix Interpretation
Accuracy
Precision
Recall
F1-score

2. Classification Evaluation Metrics (Broad Analysis) (Link)


Why F1-score ? Interpretation of Precision, Recall with Example

48

You might also like