0% found this document useful (0 votes)
26 views

ML unit-2 (CEC)

Uploaded by

riskman1919
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

ML unit-2 (CEC)

Uploaded by

riskman1919
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 96

Probabilistic Classifiers:

1. Bayes Classifiers (Naïve Bayesian Classifier)


2. Logistic Regression
 What is logistic regression? Logistic regression is an
example of supervised learning. It is used to
calculate or predict the probability of a binary
(yes/no) event occurring. An example of logistic
regression could be applying machine learning to
determine if a person is likely to be infected with
COVID-19 or not.
 Logistic regression is a statistical method used to
predict the outcome of a dependent variable based
on previous observations. It's a type of regression
analysis and is a commonly used algorithm for
solving binary classification problems.
 Logistic regression performs better when the data is
linearly separable
 It does not require too many computational resources
as it’s highly interpretable
 There is no problem scaling the input features—It
does not require tuning
 It is easy to implement and train a model using
logistic regression
 It gives a measure of how relevant a predictor
(coefficient size) is, and its direction of association
(positive or negative)
1. K-Nearest Neighbor (KNN)
2. Support Vector Machine (SVM)
 K-Nearest Neighbour is one of the simplest Machine
Learning algorithms based on Supervised Learning
technique.

 K-NN algorithm assumes the similarity between the


new case/data and available cases and put the new
case into the category that is most similar to the
available categories.
 K-NN algorithm stores all the available data and
classifies a new data point based on the similarity.
This means when new data appears then it can be
easily classified into a well suite category by using
K- NN algorithm.

 K-NN algorithm can be used for Regression as well


as for Classification but mostly it is used for the
Classification problems.

 It is also called a lazy learner algorithm because it


does not learn from the training set immediately
instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
 K-NN is a non-parametric algorithm, which means
it does not make any assumption on underlying data.

 KNN algorithm at the training phase just stores the


dataset and when it gets new data, then it classifies
that data into a category that is much similar to the
new data.
 Why do we need a K-NN Algorithm?

 Suppose there are two categories, i.e., Category A and


Category B, and we have a new data point x1, so this
data point will lie in which of these categories. To
solve this type of problem, we need a K-NN
algorithm, we can easily identify the category or class
of a particular dataset. Consider the below diagram:
 Step-1: Select the number K of the neighbors
 Step-2: Calculate the Euclidean distance of K
number of neighbors
 Step-3: Take the K nearest neighbors as per the
calculated Euclidean distance.
 Step-4: Among these K neighbors, count the number
of the data points in each category.
 Step-5: Assign the new data points to that category
for which the number of the neighbor is maximum.
 Step-6: Our model is ready.
 Firstly, we will choose the number of neighbors, so
we will choose the k=3, k=5.
 Next, we will calculate the Euclidean
distance between the data points. The Euclidean
distance is the distance between two points, which we
have already studied in geometry. It can be calculated
as:
 # importing libraries
 import numpy as nm
 import matplotlib.pyplot as mtp
 import pandas as pd

 #importing datasets
 data_set= pd.read_csv('user_data.csv')

 #Extracting Independent and dependent Variable
 x= data_set.iloc[:, [2,3]].values
 y= data_set.iloc[:, 4].values

 # Splitting the dataset into training and test set.
 from sklearn.model_selection import train_test_split
 x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25,
random_state=0)

 #feature Scaling
 from sklearn.preprocessing import StandardScaler
 st_x= StandardScaler()
 x_train= st_x.fit_transform(x_train)
 x_test= st_x.transform(x_test)
 #Fitting K-NN classifier to the training set
 from sklearn.neighbors import KNeighborsClassifier

 classifier= KNeighborsClassifier(n_neighbors=5, met


ric='minkowski', p=2 )
 classifier.fit(x_train, y_train)
 Support Vector Machine or SVM is one of the most
popular Supervised Learning algorithms, which is used
for Classification as well as Regression problems.
However, primarily, it is used for Classification problems
in Machine Learning.

 The goal of the SVM algorithm is to create the best line


or decision boundary that can segregate n-dimensional
space into classes so that we can easily put the new data
point in the correct category in the future. This best
decision boundary is called a hyperplane.
 SVM chooses the extreme points/vectors that help in
creating the hyperplane. These extreme cases are
called as support vectors, and hence algorithm is
termed as Support Vector Machine. Consider the
below diagram in which there are two different
categories that are classified using a decision
boundary or hyperplane:

 Types of SVM
 SVM can be of two types:
 Linear SVM: Linear SVM is used for linearly
separable data, which means if a dataset can be
classified into two classes by using a single straight
line, then such data is termed as linearly separable
data, and classifier is used called as Linear SVM
classifier.
 Non-linear SVM: Non-Linear SVM is used for non-
linearly separated data, which means if a dataset
cannot be classified by using a straight line, then such
data is termed as non-linear data and classifier used is
called as Non-linear SVM classifier.
 Hyperplane and Support Vectors in the SVM algorithm:
 Hyperplane: There can be multiple lines/decision
boundaries to segregate the classes in n-dimensional
space, but we need to find out the best decision boundary
that helps to classify the data points. This best boundary
is known as the hyperplane of SVM.
 The dimensions of the hyperplane depend on the features
present in the dataset, which means if there are 2 features
(as shown in image), then hyperplane will be a straight
line. And if there are 3 features, then hyperplane will be a
2-dimension plane.
 We always create a hyperplane that has a maximum
margin, which means the maximum distance between the
data points.
Support Vectors:
 The data points or vectors that are the closest to the
hyperplane and which affect the position of the
hyperplane are termed as Support Vector. Since these
vectors support the hyperplane, hence called a
Support vector.
1. Accuracy-0/1 Loss, Sensitivity, Specificity.

2.Clustering: The General Problem, K-Means


Clustering.
Sensitivity tells us what proportion of the positive class
got correctly classified.
 Misclassification rate: It is also termed as Error rate,
and it defines how often the model gives the wrong
predictions. The value of error rate can be calculated as
the number of incorrect predictions to all number of the
predictions made by the classifier. The formula is given
below:
When to use Accuracy / Precision / Recall / F1-Score?
 a. Accuracy is used when the True Positives and True
Negatives are more important. Accuracy is a better
metric for Balanced Data.
 b. Whenever False Positive is much more important
use Precision.
 c. Whenever False Negative is much more important
use Recall.
 d. F1-Score is used when the False Negatives and
False Positives are important. F1-Score is a better
metric for Imbalanced Data.
 K-means clustering is the most common partitioning
algorithm. K-means reassigns each data in the dataset to
only one of the new clusters formed. A record or data
point is assigned to the nearest cluster using a measure of
distance or similarity.

 K-Means Clustering is an Unsupervised Learning


algorithm, which groups the unlabeled dataset into
different clusters. Here K defines the number of pre-
defined clusters that need to be created in the process, as
if k=2, there will be two clusters, and for k=3, there will
be three clusters, and so on.
There are the following steps used in the K-means
clustering −
 It can select K initial cluster centroid c1, c2, c3… . . ck.
 It can assign each instance x in the S cluster whose
centroid is nearest to x.
 For each cluster, recompute its centroid based on which
elements are contained in that cluster.
 Go to (b) until convergence is completed.
 It can separate the object (data points) into K clusters.
 It is used to cluster center (centroid) = the average of all
the data points in the cluster.
 It can assign each point to the cluster whose centroid is
nearest (using distance function).
Step 1: Take the mean value
 Step 2: Find the nearest number of mean and put in
cluster.
 Step 3: Repeat Step 1 and step2 until we get same
mean.
 Example:
 Data points={ 2,4,6,9,12,16,20,24,26}
 Number of clusters=2
Euclidean Distance Method
Suppose a data set, D, contains n objects in Euclidean
space. Partitioning methods distribute the objects in D
into k clusters, C1,...,Ck , that is, Ci ⊂ D and Ci ∩Cj =
∅ for (1 ≤ i, j ≤ k). An objective function is used to
assess the partitioning quality so that objects within a
cluster are similar to one another but dissimilar to
objects in other clusters. This is, the objective function
aims for high intra cluster similarity and low inter
cluster similarity.
 A centroid-based partitioning technique uses the
centroid of a cluster, Ci , to represent that cluster.
Conceptually, the centroid of a cluster is its center
point. The centroid can be defined in various ways such
as by the mean or medoid of the objects (or points)
assigned to the cluster. The difference between an
object p∈Ci and ci , the representative of the cluster, is
measured by dist(p,ci), where dist(x,y) is the Euclidean
distance between two points x and y. The quality of
cluster Ci can be measured by the withincluster
variation, which is the sum of squared error between all
objects in Ci and the centroid ci , defined as

You might also like