0% found this document useful (0 votes)
8 views

ML NOTES BY PUSHPA

Classification and regression are key tasks in supervised machine learning, with classification focusing on discrete outcomes and regression on continuous values. Classification algorithms, such as decision trees and logistic regression, categorize data, while regression algorithms, like linear regression, predict numerical values. Understanding the differences between these approaches is crucial for selecting the appropriate method for specific machine learning tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

ML NOTES BY PUSHPA

Classification and regression are key tasks in supervised machine learning, with classification focusing on discrete outcomes and regression on continuous values. Classification algorithms, such as decision trees and logistic regression, categorize data, while regression algorithms, like linear regression, predict numerical values. Understanding the differences between these approaches is crucial for selecting the appropriate method for specific machine learning tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Classification vs Regression in Machine

Learning
Classification and regression are two primary tasks in supervised machine
learning, where key difference lies in the nature of the output:
classification deals with discrete outcomes (e.g., yes/no, categories),
while regression handles continuous values (e.g., price, temperature).
Both approaches require labeled data for training but differ in their
objectives—classification aims to find decision boundaries that separate
classes, whereas regression focuses on finding the best-fitting line to
predict numerical outcomes. Understanding these distinctions helps in
selecting the right approach for specific machine learning tasks.

For example, it can determine whether an email is spam or not, classify


images as “cat” or “dog,” or predict weather conditions like “sunny,”
“rainy,” or “cloudy.” with decision boundary and regression models are
used to predict house prices based on features like size and location, or
forecast stock prices over time with straight fit line.
What is Regression in Machine Learning?
Regression algorithms predict a continuous value based on input data. This
is used when you want to predict numbers such as income, height, weight,
or even the probability of something happening (like the chance of rain).
Some of the most common types of regression are:
1. Simple Linear Regression: Models the relationship between one
independent variable and a dependent variable using a straight line.
2. Multiple Linear Regression: Predicts a dependent variable based on
two or more independent variables.
3. Polynomial Regression: Models nonlinear relationships by fitting a
curve to the data.
What is Classification in Machine Learning?
Classification is used when you want to categorize data into different
classes or groups. For example, classifying emails as “spam” or “not
spam” or predicting whether a patient has a certain disease based on their
symptoms. Here are some common types of classification models:
1. Decision Tree Classification: Builds a tree where each node represents
a test case for an attribute, and branches represent possible outcomes.
2. Random Forest Classification: Uses an ensemble of decision trees to
make predictions, improving accuracy by averaging the results from
multiple trees.
3. K-Nearest Neighbor (KNN): Classifies data points based on the ‘k’
nearest neighbors using feature similarity.
Decision Boundary vs Best-Fit Line
When teaching the difference between classification and regression in
machine learning, a key concept to focus on is the decision
boundary (used in classification) versus the best-fit line (used in
regression). These are fundamental tools that help models make
predictions, but they serve distinctly different purposes.

1. Decision Boundary in Classification

It is an surface or line that separates data points into different classes


in a feature space. It can be linear (a straight line) or non-linear (a
curve), depending on the complexity of the data and the algorithm used.
For example:
 A linear decision boundary might separate two classes in a 2D space
with a straight line (e.g., logistic regression).
 A more complex model, may create non-linear boundaries to better fit
intricate datasets.

During training classifier learns to partition the feature space by finding


a boundary that minimizes classification errors.
 For binary classification, this boundary separates data points into two
groups (e.g., spam vs. non-spam emails).
 In multi-class classification, multiple boundaries are created to separate
more than two classes.

2. Best-Fit Line in Regression

In regression, a best-fit line (or regression line) represents the relationship


between independent variables (inputs) and a dependent variable (output).
It is used to predict continuous numerical values capturing trends and
relationships within the data, allowing for accurate predictions of
continuous variables. The best-fit line can be linear or non-linear:
 A straight line is used for linear regression.
 Curves are used for more complex regressions, like polynomial
regression

Classification Algorithms
There are different types of classification algorithms that have been
developed over time to give the best results for classification tasks. Don’t
worry if they seem overwhelming at first—we’ll dive deeper into each
algorithm, one by one, in the upcoming chapters.
 Logistic Regression
 Decision Tree
 Random Forest
 K – Nearest Neighbors
 Support Vector Machine
 Naive Bayes
Regression Algorithms
There are different types of regression algorithms that have been developed
over time to give the best results for regression tasks.
 Lasso Regression
 Ridge Regression
 XGBoost Regressor
 LGBM Regressor
Comparison between Classification and Regression
Feature Classification Regression

In this problem statement,


the target variables are
Output type discrete. Continuous numerical value
Discrete categories (e.g., (e.g., price, temperature).
“spam” or “not spam”)

Goal To predict which category a To predict an exact numerical


data point belongs to. value based on input data.

Example Email spam detection, image House price prediction, stock


problems recognition, customer market forecasting, sales
sentiment analysis. prediction.

Evaluation Evaluation metrics like


Mean Squared Error, R2-Score,
metrics Precision, Recall, and F1-
, MAPE and RMSE.
Score

Decision Clearly defined boundaries No distinct boundaries, focuses


boundary between different classes. on finding the best fit line.

Common Logistic regression, Decision Linear Regression, Polynomial


algorithms trees, Support Vector Regression, Decision Trees
Machines (SVM) (with regression objective).

Classification vs Regression : Conclusion


Classification trees are employed when there’s a need to categorize the
dataset into distinct classes associated with the response variable. Often,
these classes are binary, such as “Yes” or “No,” and they are mutually
exclusive. While there are instances where there may be more than two
classes, a modified version of the classification tree algorithm is used in
those scenarios.
On the other hand, regression trees are utilized when dealing with
continuous response variables. For instance, if the response variable
represents continuous values like the price of an object or the temperature
for the day, a regression tree is the appropriate choice.

What is Logistic Regression?


Logistic regression is a supervised machine learning algorithm used
for classification tasks where the goal is to predict the probability that an
instance belongs to a given class or not. Logistic regression is a statistical
algorithm which analyze the relationship between two data factors. The
article explores the fundamentals of logistic regression, it’s types and
implementations.
Logistic regression is used for binary classification where we use sigmoid
function, that takes input as independent variables and produces a
probability value between 0 and 1.
For example, we have two classes Class 0 and Class 1 if the value of the
logistic function for an input is greater than 0.5 (threshold value) then it
belongs to Class 1 otherwise it belongs to Class 0. It’s referred to as
regression because it is the extension of linear regression but is mainly
used for classification problems.

Key Points:

 Logistic regression predicts the output of a categorical dependent


variable. Therefore, the outcome must be a categorical or discrete
value.
 It can be either Yes or No, 0 or 1, true or False, etc. but instead of
giving the exact value as 0 and 1, it gives the probabilistic values which
lie between 0 and 1.
 In Logistic regression, instead of fitting a regression line, we fit an “S”
shaped logistic function, which predicts two maximum values (0 or 1).
Types of Logistic Regression
On the basis of the categories, Logistic Regression can be classified into
three types:
1. Binomial: In binomial Logistic regression, there can be only two
possible types of the dependent variables, such as 0 or 1, Pass or Fail,
etc.
2. Multinomial: In multinomial Logistic regression, there can be 3 or
more possible unordered types of the dependent variable, such as “cat”,
“dogs”, or “sheep”
3. Ordinal: In ordinal Logistic regression, there can be 3 or more possible
ordered types of dependent variables, such as “low”, “Medium”, or
“High”.
Assumptions of Logistic Regression
We will explore the assumptions of logistic regression as understanding
these assumptions is important to ensure that we are using appropriate
application of the model. The assumption include:
1. Independent observations: Each observation is independent of the other.
meaning there is no correlation between any input variables.
2. Binary dependent variables: It takes the assumption that the dependent
variable must be binary or dichotomous, meaning it can take only two
values. For more than two categories SoftMax functions are used.
3. Linearity relationship between independent variables and log odds: The
relationship between the independent variables and the log odds of the
dependent variable should be linear.
4. No outliers: There should be no outliers in the dataset.
5. Large sample size: The sample size is sufficiently large
Understanding Sigmoid Function
So far, we’ve covered the basics of logistic regression, but now let’s focus
on the most important function that forms the core of logistic regression .
 The sigmoid function is a mathematical function used to map the
predicted values to probabilities.
 It maps any real value into another value within a range of 0 and 1. The
value of the logistic regression must be between 0 and 1, which cannot
go beyond this limit, so it forms a curve like the “S” form.
 The S-form curve is called the Sigmoid function or the logistic
function.
 In logistic regression, we use the concept of the threshold value, which
defines the probability of either 0 or 1. Such as values above the
threshold value tends to 1, and a value below the threshold values tends
to 0.
σ(z)=1+e−z1

Sigmoid function

As shown above, the figure sigmoid function converts the continuous


variable data into the probability i.e. between 0 and 1.
 σ(z) σ(z) tends towards 1 as z→∞z→∞
 σ(z) σ(z) tends towards 0 as z→−∞z→−∞
 σ(z) σ(z) is always bounded between 0 and 1
Decision Tree
Decision tree is a simple diagram that shows different choices and their
possible results helping you make decisions easily. This article is all about
what decision trees are, how they work, their advantages and disadvantages
and their applications.

Understanding Decision Tree


A decision tree is a graphical representation of different options for solving
a problem and show how different factors are related. It has a hierarchical
tree structure starts with one main question at the top called a node which
further branches out into different possible outcomes where:
 Root Node is the starting point that represents the entire dataset.
 Branches: These are the lines that connect nodes. It shows the flow
from one decision to another.
 Internal Nodes are Points where decisions are made based on the input
features.
 Leaf Nodes: These are the terminal nodes at the end of branches that
represent final outcomes or predictions

They also support decision-making by visualizing outcomes. You can


quickly evaluate and compare the “branches” to determine which course of
action is best for you.
Now, let’s take an example to understand the decision tree. Imagine you
want to decide whether to drink coffee based on the time of day and how
tired you feel. First the tree checks the time of day—if it’s morning it asks
whether you are tired. If you’re tired the tree suggests drinking coffee if
not it says there’s no need. Similarly in the afternoon the tree again asks if
you are tired. If you recommends drinking coffee if not it concludes no
coffee is needed.
Classification of Decision Tree
We have mainly two types of decision tree based on the nature of the target
variable: classification trees and regression trees.
 Classification trees: They are designed to predict categorical outcomes
means they classify data into different classes. They can determine
whether an email is “spam” or “not spam” based on various features of
the email.
 Regression trees : These are used when the target variable is
continuous It predict numerical values rather than categories. For
example a regression tree can estimate the price of a house based on its
size, location, and other features.
Advantages of Decision Trees
 Simplicity and Interpretability: Decision trees are straightforward and
easy to understand. You can visualize them like a flowchart which
makes it simple to see how decisions are made.
 Versatility: It means they can be used for different types of tasks can
work well for both classification and regression
 No Need for Feature Scaling: They don’t require you to normalize or
scale your data.
 Handles Non-linear Relationships: It is capable of capturing non-
linear relationships between features and target variables.
Disadvantages of Decision Trees
 Overfitting: Overfitting occurs when a decision tree captures noise and
details in the training data and it perform poorly on new data.
 Instability: instability means that the model can be unreliable slight
variations in input can lead to significant differences in predictions.
 Bias towards Features with More Levels: Decision trees can become
biased towards features with many categories focusing too much on
them during decision-making. This can cause the model to miss out
other important features led to less accurate predictions .
Applications of Decision Trees
 Loan Approval in Banking: A bank needs to decide whether to
approve a loan application based on customer profiles.
o Input features include income, credit score, employment status,
and loan history.
o The decision tree predicts loan approval or rejection, helping the
bank make quick and reliable decisions.
 Medical Diagnosis: A healthcare provider wants to predict whether a
patient has diabetes based on clinical test results.
o Features like glucose levels, BMI, and blood pressure are used to
make a decision tree.
o Tree classifies patients into diabetic or non-diabetic, assisting
doctors in diagnosis.
 Predicting Exam Results in Education : School wants to predict
whether a student will pass or fail based on study habits.
o Data includes attendance, time spent studying, and previous
grades.
o The decision tree identifies at-risk students, allowing teachers to
provide additional support.
A decision tree can also be used to help build automated predictive models,
which have applications in machine learning, data mining, and statistics.

Random Forest
It is a method that combines the predictions of multiple decision trees to
produce a more accurate and stable result. It can be used for both
classification and regression tasks.
In classification tasks, Random Forest Classification predicts categorical
outcomes based on the input data. It uses multiple decision trees and
outputs the label that has the maximum votes among all the individual tree
predictions and in this article we will learn more about it.
Random Forest Classification works by creating multiple decision trees
each trained on a random subset of data. The process begins with Bootstrap
Sampling where random rows of data are selected with replacement to
form different training datasets for each tree.

Benefits of Random Forest Classification:


 Random Forest can handle large datasets and high-dimensional data.
 By combining predictions from many decision trees it reduces the risk
of overfitting compared to a single decision tree.
 It is robust to noisy data and works well with categorical data.
Implementing Random Forest Classification in
Python
Before implementing random forest classifier in Python let’s first
understand it’s parameters.
 n_estimators: Number of trees in the forest.
 max_depth: Maximum depth of each tree.
 max_features: Number of features considered for splitting at each
node.
 criterion: Function used to measure split quality (‘gini’ or ‘entropy’).
 min_samples_split: Minimum samples required to split a node.
 min_samples_leaf: Minimum samples required to be at a leaf node.
 bootstrap: Whether to use bootstrap sampling when building trees
(True or False).
K-Nearest Neighbor(KNN) Algorithm
K-Nearest Neighbors (KNN) is a simple way to classify things by looking
at what’s nearby. Imagine a streaming service wants to predict if a new
user is likely to cancel their subscription (churn) based on their age .
They checks the ages of its existing users and whether they churned or
stayed. If most of the “K” closest users in age of new user canceled their
subscription KNN will predict the new user might churn too. The key
idea is that users with similar ages tend to have similar behaviors and
KNN uses this closeness to make decisions.

Getting Started with K-Nearest Neighbors


K-Nearest Neighbors is also called as a lazy learner algorithm because it
does not learn from the training set immediately instead it stores the
dataset and at the time of classification it performs an action on the dataset.

KNN Algorithm working visualization

The new point is classified as Category 2 because most of its closest


neighbors are blue squares. KNN assigns the category based on the
majority of nearby points.
The image shows how KNN predicts the category of a new data
point based on its closest neighbours.
 The red diamonds represent Category 1 and the blue
squares represent Category 2.
 The new data point checks its closest neighbours (circled points).
 Since the majority of its closest neighbours are blue squares (Category
2) KNN predicts the new data point belongs to Category 2.
KNN works by using proximity and majority voting to make predictions.

What is ‘K’ in K Nearest Neighbour ?


In the k-Nearest Neighbours (k-NN) algorithm k is just a number that
tells the algorithm how many nearby points (neighbours) to look at when it
makes a decision.
Example:

Imagine you’re deciding which fruit it is based on its shape and size. You
compare it to fruits you already know.
 If k = 3, the algorithm looks at the 3 closest fruits to the new one.
 If 2 of those 3 fruits are apples and 1 is a banana, the algorithm says the
new fruit is an apple because most of its neighbours are apples.
Working of KNN algorithm
Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of
similarity where it predicts the label or value of a new data point by
considering the labels or values of its K nearest neighbors in the training
dataset.

Step-by-Step explanation of how KNN works is discussed below:

Step 1: Selecting the optimal value of K

 K represents the number of nearest neighbors that needs to be


considered while making prediction.

Step 2: Calculating distance

 To measure the similarity between target and training data points


Euclidean distance is used. Distance is calculated between data points
in the dataset and target point.

Step 3: Finding Nearest Neighbors

 The k data points with the smallest distances to the target point are
nearest neighbors.
Step 4: Voting for Classification or Taking Average for
Regression

 When you want to classify a data point into a category (like spam or not
spam), the K-NN algorithm looks at the K closest points in the dataset.
These closest points are called neighbors. The algorithm then looks at
which category the neighbors belong to and picks the one that appears
the most. This is called majority voting.
 In regression, the algorithm still looks for the K closest points. But
instead of voting for a class in classification, it takes the average of the
values of those K neighbors. This average is the predicted value for the
new point for the algorithm.
Distance Metrics Used in KNN Algorithm
KNN uses distance metrics to identify nearest neighbour, these neighbours
are used for classification and regression task. To identify nearest
neighbour we use below distance metrics:

1. Euclidean Distance

Euclidean distance is defined as the straight-line distance between two


points in a plane or space. You can think of it like the shortest path you
would walk if you were to go directly from one point to another.
distance(x,Xi)=∑j=1d(xj–Xij)2] distance(x,Xi)=∑j=1d(xj–Xij)2]

2. Manhattan Distance

This is the total distance you would travel if you could only move along
horizontal and vertical lines (like a grid or city streets). It’s also called
“taxicab distance” because a taxi can only drive along the grid-like streets
of a city.
d(x,y)=∑i=1n∣xi−yi∣d(x,y)=∑i=1n∣xi−yi∣

3. Minkowski Distance

Minkowski distance is like a family of distances, which includes


both Euclidean and Manhattan distances as special cases.
d(x,y)=(∑i=1n(xi−yi)p)1pd(x,y)=(∑i=1n(xi−yi)p)p1
From the formula above we can say that when p = 2 then it is the same as
the formula for the Euclidean distance and when p = 1 then we obtain the
formula for the Manhattan distance.
Support Vector Machine (SVM) Algorithm
Support Vector Machine (SVM) is a supervised machine learning
algorithm used for classification and regression tasks. While it can handle
regression problems, SVM is particularly well-suited for classification
tasks.
SVM aims to find the optimal hyperplane in an N-dimensional space to
separate data points into different classes. The algorithm maximizes the
margin between the closest points of different classes.

Support Vector Machine (SVM) Terminology


 Hyperplane: A decision boundary separating different classes in
feature space, represented by the equation wx + b = 0 in linear
classification.
 Support Vectors: The closest data points to the hyperplane, crucial for
determining the hyperplane and margin in SVM.
 Margin: The distance between the hyperplane and the support vectors.
SVM aims to maximize this margin for better classification
performance.
 Kernel: A function that maps data to a higher-dimensional space,
enabling SVM to handle non-linearly separable data.
 Hard Margin: A maximum-margin hyperplane that perfectly separates
the data without misclassifications.
 Soft Margin: Allows some misclassifications by introducing slack
variables, balancing margin maximizationand misclassification
penalties when data is not perfectly separable.
 C: A regularization term balancing margin maximization and
misclassification penalties. A higher C value enforces a stricter penalty
for misclassifications.
 Hinge Loss: A loss function penalizing misclassified points or margin
violations, combined with regularization in SVM.
 Dual Problem: Involves solving for Lagrange multipliers associated
with support vectors, facilitating the kernel trick and efficient
computation.
How does Support Vector Machine Algorithm
Work?
The key idea behind the SVM algorithm is to find the hyperplane that best
separates two classes by maximizing the margin between them. This
margin is the distance from the hyperplane to the nearest data points
(support vectors) on each side.
Multiple hyperplanes separate the data from two classes

The best hyperplane, also known as the “hard margin,” is the one that
maximizes the distance between the hyperplane and the nearest data points
from both classes. This ensures a clear separation between the classes. So,
from the above figure, we choose L2 as hard margin.
Let’s consider a scenario like shown below:

Selecting hyperplane for data with outlier

Here, we have one blue ball in the boundary of the red ball.
When data is not linearly separable (i.e., it can’t be divided by a straight
line), SVM uses a technique called kernels to map the data into a higher-
dimensional space where it becomes separable. This transformation helps
SVM find a decision boundary even for non-linear data.

Original 1D dataset for classification

A kernel is a function that maps data points into a higher-dimensional


space without explicitly computing the coordinates in that space. This
allows SVM to work efficiently with non-linear data by implicitly
performing the mapping.
For example, consider data points that are not linearly separable. By
applying a kernel function, SVM transforms the data points into a higher-
dimensional space where they become linearly separable.
 Linear Kernel: For linear separability.
 Polynomial Kernel: Maps data into a polynomial space.
 Radial Basis Function (RBF) Kernel: Transforms data into a space
based on distances between data points.

Mapping 1D data to 2D to become able to separate the two classes


In this case, the new variable y is created as a function of distance from the
origin.

Types of Support Vector Machine


Based on the nature of the decision boundary, Support Vector Machines
(SVM) can be divided into two main parts:
 Linear SVM: Linear SVMs use a linear decision boundary to separate
the data points of different classes. When the data can be precisely
linearly separated, linear SVMs are very suitable. This means that a
single straight line (in 2D) or a hyperplane (in higher dimensions) can
entirely divide the data points into their respective classes. A
hyperplane that maximizes the margin between the classes is the
decision boundary.
 Non-Linear SVM: Non-Linear SVM can be used to classify data when
it cannot be separated into two classes by a straight line (in the case of
2D). By using kernel functions, nonlinear SVMs can handle nonlinearly
separable data. The original input data is transformed by these kernel
functions into a higher-dimensional feature space, where the data points
can be linearly separated. A linear SVM is used to locate a nonlinear
decision boundary in this modified space.
Naive Bayes Classifiers
Naive Bayes classifiers are supervised machine learning algorithms used
for classification tasks, based on Bayes’ Theorem to find probabilities.
This article will give you an overview as well as more advanced use and
implementation of Naive Bayes in machine learning.

Key Features of Naive Bayes Classifiers


The main idea behind the Naive Bayes classifier is to use Bayes’
Theorem to classify data based on the probabilities of different classes
given the features of the data. It is used mostly in high-dimensional text
classification
 The Naive Bayes Classifier is a simple probabilistic classifier and it has
very few number of parameters which are used to build the ML models
that can predict at a faster speed than other classification algorithms.
 It is a probabilistic classifier because it assumes that one feature in the
model is independent of existence of another feature. In other words,
each feature contributes to the predictions with no relation between
each other.
 Naïve Bayes Algorithm is used in spam filtration, Sentimental analysis,
classifying articles and many more.
Why it is Called Naive Bayes?
It is named as “Naive” because it assumes the presence of one feature does
not affect other features.
The “Bayes” part of the name refers to for the basis in Bayes’ Theorem.
Consider a fictional dataset that describes the weather conditions for
playing a game of golf. Given the weather conditions, each tuple classifies
the conditions as fit(“Yes”) or unfit(“No”) for playing golf. Here is a
tabular representation of our dataset.

Outlook Temperature Humidity Windy Play


Golf

0 Rainy Hot High False No

1 Rainy Hot High True No

2 Overcast Hot High False Yes

3 Sunny Mild High False Yes

4 Sunny Cool Normal False Yes

5 Sunny Cool Normal True No

6 Overcast Cool Normal True Yes

7 Rainy Mild High False No

8 Rainy Cool Normal False Yes

9 Sunny Mild Normal False Yes

10 Rainy Mild Normal True Yes

11 Overcast Mild High True Yes

12 Overcast Hot Normal False Yes

13 Sunny Mild High True No

The dataset is divided into two parts, namely, feature matrix and
the response vector.
 Feature matrix contains all the vectors(rows) of dataset in which each
vector consists of the value of dependent features. In above dataset,
features are ‘Outlook’, ‘Temperature’, ‘Humidity’ and ‘Windy’.
 Response vector contains the value of class variable(prediction or
output) for each row of feature matrix. In above dataset, the class
variable name is ‘Play golf’.
Assumption of Naive Bayes
The fundamental Naive Bayes assumption is that each feature makes an:
 Feature independence: This means that when we are trying to classify
something, we assume that each feature (or piece of information) in the
data does not affect any other feature.
 Continuous features are normally distributed: If a feature is
continuous, then it is assumed to be normally distributed within each
class.
 Discrete features have multinomial distributions: If a feature is
discrete, then it is assumed to have a multinomial distribution within
each class.
 Features are equally important: All features are assumed to
contribute equally to the prediction of the class label.
 No missing data: The data should not contain any missing values.
Types of Naive Bayes Model
There are three types of Naive Bayes Model :

Gaussian Naive Bayes

In Gaussian Naive Bayes, continuous values associated with each feature


are assumed to be distributed according to a Gaussian distribution. A
Gaussian distribution is also called Normal distribution When plotted, it
gives a bell shaped curve which is symmetric about the mean of the feature
values as shown below:

Multinomial Naive Bayes

Multinomial Naive Bayes is used when features represent the frequency of


terms (such as word counts) in a document. It is commonly applied in text
classification, where term frequencies are important.

Bernoulli Naive Bayes

Bernoulli Naive Bayes deals with binary features, where each feature
indicates whether a word appears or not in a document. It is suited for
scenarios where the presence or absence of terms is more relevant than
their frequency. Both models are widely used in document classification
tasks

Advantages of Naive Bayes Classifier


 Easy to implement and computationally efficient.
 Effective in cases with a large number of features.
 Performs well even with limited training data.
 It performs well in the presence of categorical features.
 For numerical features data is assumed to come from normal
distributions
Disadvantages of Naive Bayes Classifier
 Assumes that features are independent, which may not always hold in
real-world data.
 Can be influenced by irrelevant attributes.
 May assign zero probability to unseen events, leading to poor
generalization.
Applications of Naive Bayes Classifier
 Spam Email Filtering: Classifies emails as spam or non-spam based
on features.
 Text Classification: Used in sentiment analysis, document
categorization, and topic classification.
 Medical Diagnosis: Helps in predicting the likelihood of a disease
based on symptoms.
 Credit Scoring: Evaluates creditworthiness of individuals for loan
approval.
 Weather Prediction: Classifies weather conditions based on various
factors.
How to solve K-Means Algorithm Numerical?

Q. Apply K(=2)-Means algorithm over the data (185, 72), (170, 56), (168, 60),
(179,68), (182,72), (188,77) up to two iterations and show the clusters. Initially choose
first two objects as initial centroids.

Solution:
Given, number of clusters to be created (K) = 2 say c1 and c2,
number of iterations = 2 and
The given data points can be represented in tabular form as:

also, first two objects as initial centroids:


Centroid for first cluster c1 = (185, 72)
Centroid for second cluster c2 = (170, 56)

Iteration 1: Now calculating similarity by using Euclidean distance measure as:

ow calculating similarity by using Euclidean distance measure as:

Euclidean distance calculation

Representing above information in tabular form:


Distance of each data points from cluster centroids

The resulting cluster after first iteration is:

Data points cluster

Iteration 2: Now calculating centroid for each cluster:

Calculating centroid as mean of data points

Now, again calculating similarity:

Distance calculation between data points and centroids

Representing above information in tabular form.

Distance of each data points from cluster centroids

The resulting cluster after second iteration is:


Data points cluster .Since, the clustering doesn’t change after second iteration, so
terminate the iteration even if question doesn’t say so.
Mathematical explanation of K-Nearest Neighbour
KNN stands for K-nearest neighbour is a popular algorithm in Supervised Learning
commonly used for classification tasks. It works by classifying data based on its
similarity to neighboring data points. The core idea of KNN is straightforward when a
new data point is introduced the algorithm finds its K nearest neighbors and assigns
the most frequent class from these neighbors to the new point.
Working of K-Nearest Neighbour
KNN algorithm stores all available cases and classifies new data based on the
majority class of its nearest neighbors. Value of K in KNN refers to the number of
nearest neighbors to consider when performing classification.
K parameter is critical because:
 If K is too small, the model may be sensitive to noise in the dataset.
 If K is too large, the classification might be too generalized, and nuances in the
data may be overlooked.
Distance between data points is measured using a distance metric, such as Euclidean
distance, to find the nearest neighbors.

How do we choose K?

Choosing the right value for K is crucial:


 A commonly used rule of thumb is to select K ≈ sqrt(n), where n is the number of
data points in the dataset.
 If n is even adjust K to be odd by adding or subtracting 1 to avoid ties in majority
voting.
Let’s dive deeper into an example of KNN to make the concept clearer. Below is a
data that includes age, gender and the class of sports people play.

NAME AGE GENDER CLASS OF SPORTS


Ajay 32 0 Football

Mark 40 0 Neither

Sara 16 1 Cricket

Zaira 34 1 Cricket

Sachin 55 0 Neither

Rahul 40 0 Cricket

Pooja 20 1 Neither

Smith 15 0 Cricket

Laxmi 55 1 Football

Michael 15 0 Football

Here male is denoted with numeric value 0 and female with 1. Let’s find in which
class of people Angelina will lie whose k factor is 3 and age is 5. So we have to find
out the distance using Euclidean distance formula:
d=(x2–x1)2+(y2–y1)2d=(x2–x1)2+(y2–y1)2 to find the distance between any two
points.
To calculate the distance between Angelina and other individuals in the dataset:
d=(age2–age1)2+(gender2–gender1)2d=(age2–age1)2+(gender2–gender1)2
Here, Angelina has:
 Age = 5
 Gender = 1 (female)
1. Distance between Angelina and Ajay (age = 32, gender = 0):
d=(5–32)2+(1–0)2=729+1=730=27.02d=(5–32)2+(1–0)2=729+1
=730=27.02
2. Distance between Angelina and Mark (age = 40, gender = 0):
d=(5–40)2+(1–0)2=1225+1=1226=35.01d=(5–40)2+(1–0)2
=1225+1=1226=35.01
3. Distance between Angelina and Sara (age = 16, gender = 1):
d=(5–16)2+(1–1)2=121+0=121=11.00d=(5–16)2+(1–1)2=121+0
=121=11.00
4. Distance between Angelina and Zaira (age = 34, gender = 1):
d=(5–34)2+(1–1)2=841+0=841=29.00d=(5–34)2+(1–1)2=841+0
=841=29.00
5. Distance between Angelina and Sachin (age = 55, gender = 0):
d=(5–55)2+(1–0)2=2500+1=2501=50.01d=(5–55)2+(1–0)2
=2500+1=2501=50.01
6. Distance between Angelina and Rahul (age = 40, gender = 0):
d=(5–40)2+(1–0)2=1225+1=1226=35.01d=(5–40)2+(1–0)2
=1225+1=1226=35.01
7. Distance between Angelina and Pooja (age = 20, gender = 1):
d=(5–20)2+(1–1)2=225+0=225=15.00d=(5–20)2+(1–1)2=225+0
=225=15.00
8. Distance between Angelina and Smith (age = 15, gender = 0):
d=(5–15)2+(1–0)2=100+1=101=10.05d=(5–15)2+(1–0)2=100+1
=101=10.05
9. Distance between Angelina and Laxmi (age = 55, gender = 1):
d=(5–55)2+(1–1)2=2500+0=2500=50.00d=(5–55)2+(1–1)2
=2500+0=2500=50.00
10. Distance between Angelina and Michael (age = 15, gender = 0):
d=(5–15)2+(1–0)2=100+1=101=10.05d=(5–15)2+(1–0)2=100+1
=101=10.05

Distance between Angelina and Distance

Ajay 27.02

Mark 35.01

Sara 11.00

Zaira 29.00

Sachin 50.01

Rahul 35.01
Pooja 15.00

Smith 10.05

Laxmi 50.00

Michael 10.05

K-Nearest Neighbors (K = 3): 3 nearest neighbors to Angelina are:


1. Smith (Cricket)- 10.5
2. Michael (Football)- 10.05
3. Sara (Cricket)- 11
So according to KNN algorithm classifying based on Majority Vote, Angelina will be
in the class of people who like cricket.
This example illustrates the working of KNN and how it classifies data based on the
majority class of its nearest neighbors. By calculating the distance between data
points, KNN helps in making predictions about new data. The importance of selecting
the right value of K and handling categorical data appropriately (such as converting
gender to numeric values) cannot be overstated in ensuring accurate classification
results.

You might also like