We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 96
Probabilistic Classifiers:
1. Bayes Classifiers (Naïve Bayesian Classifier)
2. Logistic Regression What is logistic regression? Logistic regression is an example of supervised learning. It is used to calculate or predict the probability of a binary (yes/no) event occurring. An example of logistic regression could be applying machine learning to determine if a person is likely to be infected with COVID-19 or not. Logistic regression is a statistical method used to predict the outcome of a dependent variable based on previous observations. It's a type of regression analysis and is a commonly used algorithm for solving binary classification problems. Logistic regression performs better when the data is linearly separable It does not require too many computational resources as it’s highly interpretable There is no problem scaling the input features—It does not require tuning It is easy to implement and train a model using logistic regression It gives a measure of how relevant a predictor (coefficient size) is, and its direction of association (positive or negative) 1. K-Nearest Neighbor (KNN) 2. Support Vector Machine (SVM) K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning technique.
K-NN algorithm assumes the similarity between the
new case/data and available cases and put the new case into the category that is most similar to the available categories. K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a well suite category by using K- NN algorithm.
K-NN algorithm can be used for Regression as well
as for Classification but mostly it is used for the Classification problems.
It is also called a lazy learner algorithm because it
does not learn from the training set immediately instead it stores the dataset and at the time of classification, it performs an action on the dataset. K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data.
KNN algorithm at the training phase just stores the
dataset and when it gets new data, then it classifies that data into a category that is much similar to the new data. Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and
Category B, and we have a new data point x1, so this data point will lie in which of these categories. To solve this type of problem, we need a K-NN algorithm, we can easily identify the category or class of a particular dataset. Consider the below diagram: Step-1: Select the number K of the neighbors Step-2: Calculate the Euclidean distance of K number of neighbors Step-3: Take the K nearest neighbors as per the calculated Euclidean distance. Step-4: Among these K neighbors, count the number of the data points in each category. Step-5: Assign the new data points to that category for which the number of the neighbor is maximum. Step-6: Our model is ready. Firstly, we will choose the number of neighbors, so we will choose the k=3, k=5. Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the distance between two points, which we have already studied in geometry. It can be calculated as: # importing libraries import numpy as nm import matplotlib.pyplot as mtp import pandas as pd #importing datasets data_set= pd.read_csv('user_data.csv') #Extracting Independent and dependent Variable x= data_set.iloc[:, [2,3]].values y= data_set.iloc[:, 4].values # Splitting the dataset into training and test set. from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0) #feature Scaling from sklearn.preprocessing import StandardScaler st_x= StandardScaler() x_train= st_x.fit_transform(x_train) x_test= st_x.transform(x_test) #Fitting K-NN classifier to the training set from sklearn.neighbors import KNeighborsClassifier
classifier= KNeighborsClassifier(n_neighbors=5, met
ric='minkowski', p=2 ) classifier.fit(x_train, y_train) Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for Classification as well as Regression problems. However, primarily, it is used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line
or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future. This best decision boundary is called a hyperplane. SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider the below diagram in which there are two different categories that are classified using a decision boundary or hyperplane: Types of SVM SVM can be of two types: Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be classified into two classes by using a single straight line, then such data is termed as linearly separable data, and classifier is used called as Linear SVM classifier. Non-linear SVM: Non-Linear SVM is used for non- linearly separated data, which means if a dataset cannot be classified by using a straight line, then such data is termed as non-linear data and classifier used is called as Non-linear SVM classifier. Hyperplane and Support Vectors in the SVM algorithm: Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-dimensional space, but we need to find out the best decision boundary that helps to classify the data points. This best boundary is known as the hyperplane of SVM. The dimensions of the hyperplane depend on the features present in the dataset, which means if there are 2 features (as shown in image), then hyperplane will be a straight line. And if there are 3 features, then hyperplane will be a 2-dimension plane. We always create a hyperplane that has a maximum margin, which means the maximum distance between the data points. Support Vectors: The data points or vectors that are the closest to the hyperplane and which affect the position of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence called a Support vector. 1. Accuracy-0/1 Loss, Sensitivity, Specificity.
2.Clustering: The General Problem, K-Means
Clustering. Sensitivity tells us what proportion of the positive class got correctly classified. Misclassification rate: It is also termed as Error rate, and it defines how often the model gives the wrong predictions. The value of error rate can be calculated as the number of incorrect predictions to all number of the predictions made by the classifier. The formula is given below: When to use Accuracy / Precision / Recall / F1-Score? a. Accuracy is used when the True Positives and True Negatives are more important. Accuracy is a better metric for Balanced Data. b. Whenever False Positive is much more important use Precision. c. Whenever False Negative is much more important use Recall. d. F1-Score is used when the False Negatives and False Positives are important. F1-Score is a better metric for Imbalanced Data. K-means clustering is the most common partitioning algorithm. K-means reassigns each data in the dataset to only one of the new clusters formed. A record or data point is assigned to the nearest cluster using a measure of distance or similarity.
K-Means Clustering is an Unsupervised Learning
algorithm, which groups the unlabeled dataset into different clusters. Here K defines the number of pre- defined clusters that need to be created in the process, as if k=2, there will be two clusters, and for k=3, there will be three clusters, and so on. There are the following steps used in the K-means clustering − It can select K initial cluster centroid c1, c2, c3… . . ck. It can assign each instance x in the S cluster whose centroid is nearest to x. For each cluster, recompute its centroid based on which elements are contained in that cluster. Go to (b) until convergence is completed. It can separate the object (data points) into K clusters. It is used to cluster center (centroid) = the average of all the data points in the cluster. It can assign each point to the cluster whose centroid is nearest (using distance function). Step 1: Take the mean value Step 2: Find the nearest number of mean and put in cluster. Step 3: Repeat Step 1 and step2 until we get same mean. Example: Data points={ 2,4,6,9,12,16,20,24,26} Number of clusters=2 Euclidean Distance Method Suppose a data set, D, contains n objects in Euclidean space. Partitioning methods distribute the objects in D into k clusters, C1,...,Ck , that is, Ci ⊂ D and Ci ∩Cj = ∅ for (1 ≤ i, j ≤ k). An objective function is used to assess the partitioning quality so that objects within a cluster are similar to one another but dissimilar to objects in other clusters. This is, the objective function aims for high intra cluster similarity and low inter cluster similarity. A centroid-based partitioning technique uses the centroid of a cluster, Ci , to represent that cluster. Conceptually, the centroid of a cluster is its center point. The centroid can be defined in various ways such as by the mean or medoid of the objects (or points) assigned to the cluster. The difference between an object p∈Ci and ci , the representative of the cluster, is measured by dist(p,ci), where dist(x,y) is the Euclidean distance between two points x and y. The quality of cluster Ci can be measured by the withincluster variation, which is the sum of squared error between all objects in Ci and the centroid ci , defined as