ml_unit2
ml_unit2
Supervised learning:
Supervised learning is a process of providing input data as well as correct output data to the machine learning
model. The aim of a supervised learning algorithm is to find a mapping function to map the input variable(x)
with the output variable(y). In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering.
we define a linear classifier as a two-class classifier that decides class membership by comparing a linear
combination of the features to a threshold. The diagrams given here demonstrate the same.
Linear classification:
In linear classification, the decision boundary is a straight line (in 2D), a plane (in 3D), or a hyperplane (in
higher dimensions) that separates data points belonging to different classes. The goal is to find a linear
function that can accurately classify data points.
Linear classifiers are simple, fast, and computationally efficient, making them widely used in many real-world
applications. These models make predictions based on a linear combination of input features. Some common
linear classification algorithms include:
1. Perceptron:
A foundational binary linear classifier that updates its weights iteratively to minimize misclassifications.
It’s one of the earliest and simplest models in machine learning.
2. Linear Support Vector Machine (SVM):
A powerful classifier that finds the optimal hyperplane separating the classes by maximizing the margin
—the distance between the hyperplane and the nearest data points from each class.
3. Logistic Regression:
A probabilistic linear classifier commonly used for binary classification tasks. It models the probability
of class membership using the logistic (sigmoid) function and is particularly useful when interpretability
is important.
Linear classifiers work well when the data points are linearly separable, meaning they can be separated by a
straight line or a plane.
Non-linear classification:
Non-linear classification is employed when the data cannot be accurately separated by a straight line or a
hyperplane in the input feature space. In this case, more complex decision boundaries, such as curves or
surfaces, are used to separate the data into different classes. To achieve this, non-linear classifiers use
techniques like feature transformations or kernel methods to map the original data into a higher-dimensional
space where a linear boundary can be found. Some common non-linear classification algorithms include:
1. Support Vector Machine with non-linear kernels (e.g., Polynomial kernel, Radial Basis Function kernel).
2. Decision Trees: These recursively split the feature space into regions to form non-linear decision boundaries.
3. Random Forest: An ensemble method that combines multiple decision trees to improve performance.
4. Neural Networks: Deep learning models capable of learning complex non-linear relationships between
features.
Non-linear classifiers are more flexible and can handle complex data patterns. They are well-suited for tasks
where the decision boundary is intricate and not easily separable by a straight line or a plane. However, they
may require more computational resources and could be prone to overfitting if not properly regularized. In
summary, linear classifiers work best for linearly separable data, while non-linear classifiers are more
appropriate for complex and non-linear data patterns. The choice between the two depends on the nature of
the data and the task at hand. Sometimes, a combination of both techniques can be used to achieve better
classification performance.
Multi-class & Multi-label Classification:
Multi-class and multi-label classification are two types of classification problems in machine learning that
involve assigning objects to multiple classes or labels.
In multi-class classification, each data point belongs to one and only one class out of multiple mutually
exclusive classes. The goal is to predict the correct class label for each data point from a predefined set of
classes. Some common algorithms used for multi-class classification include:
1. Softmax Regression (Multinomial Logistic Regression): An extension of logistic regression that handles
multiple classes.
2. Support Vector Machine (SVM): uses methods like one-vs-one or one-vs-all.
3. Decision Trees and Random Forest: Decision trees can directly handle multi-class problems, and random
forests can be used for more robust multi-class classification.
On the other hand, in multi-label classification, each data point can belong to multiple classes or have multiple
labels simultaneously. The goal is to predict the presence or absence of multiple labels for each data point. This
type of classification is commonly used in tasks where an object can have more than one attribute or
characteristic. Some common algorithms used for multi-label classification include:
1. Binary Relevance: Treat each label as a separate binary classification problem and combine the results.
2. Label Powerset: Convert the multi-label problem into a multi-class problem with one class for each unique
combination of labels.
3. Classifier Chains: Create a chain of classifiers, where each classifier predicts one label and takes into account
the predictions of previous classifiers in the chain.
some algorithms can be extended to handle both multi-class and multi-label tasks.
Email Spam Filtering using Binary Classification:
Email spam filtering is a classic application of machine learning, and various techniques can be employed to
identify and filter out spam messages effectively. Spam is nothing but an email which contains commercial /
unwanted content. It consists of legitimate content. Decision tree is a technique that resolves spam filtering
problem.
Performance Metrics:
In order to evaluate the performance of a machine learning algorithm, one can use different approaches.
While building any machine learning model, the first thing that comes to our mind is how we can build an
accurate & 'good fit' model and what the challenges are that will come during the entire procedure.
Confusion Matrix in Machine Learning: Confusion Matrix helps us to display the performance of a model or
how a model has made its prediction in Machine Learning. Confusion Matrix helps us to visualize the point
where our model gets confused in discriminating two classes. It can be understood well through a 2×2 matrix
where the row represents the actual truth labels, and the column represents the predicted labels.
Precision and Recall are the two most important but confusing concepts in Machine Learning. Precision and
recall are performance metrics used for pattern recognition and classification in machine learning.
Precision: the precision is the number of true positive results divided by the number of all positive results,
including those not identified correctly, Precision is also known as positive predictive value. It attempts to
answer the following question: What proportion of positive identifications was actually correct?
Precision is defined as follows:
For example, consider the confusion matrix here. The precision will be:
Recall: The recall is the number of true positive results divided by the number of all samples that should have
been identified as positive.
Recall is also known as sensitivity in diagnostic binary classification. Recall attempts to answer the following
question: What proportion of actual positives was identified correctly? Mathematically, recall is defined as
follows:
Recall is defined as follows:
It is also known as true positive rate (TPR). To fully evaluate the effectiveness of a model, you must examine
both precision and recall. Unfortunately, precision and recall are often in tension. That is, improving precision
typically reduces recall and vice versa. For example, in case of classifying an email with the help of a confusion
matrix for the following data:
Decision Trees:
A Decision tree is a popular and widely used supervised machine learning algorithm for both classification and
regression tasks. It is a non-linear model that can be used for various types of data and can handle both
numerical and categorical features. Decision Tree can be used for both classification and Regression problems,
but mostly it is preferred for solving Classification problems. It is a tree-structured classifier, where internal
nodes represent the features of a dataset, branches represent the decision rules and each leaf node represents
the outcome. In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
The basic idea behind a decision tree is to recursively divide the dataset into subsets based on the values of
different features, making decisions at each step to create a tree-like structure. The leaves of the tree
represent the final decisions or predictions for the instances falling into those subsets.
Entropy: In machine learning, entropy is a measure of uncertainty or disorder in a dataset. Entropy is defined
as the randomness or measuring the disorder of the information being processed in Machine Learning. In
simply, entropy is often interpreted as the degree of disorder or randomness in the system. Further, we can say
that entropy is the machine learning metric that measures the unpredictability or impurity in the system. It is
commonly used in decision trees as a criterion to evaluate the quality of a split. Decision trees are a popular
supervised learning algorithm used for classification and regression tasks. When building a decision tree, the
algorithm aims to find the best features to split the data, such that the resulting subsets have the highest
possible information gain or the greatest reduction in entropy. Entropy is calculated using the following
formula:
Where:
S is the dataset for which we are calculating the entropy.
n is the number of classes in the dataset.
Pi is the proportion of instances belonging to class i in the dataset S
In decision trees, the goal is to minimize entropy by finding the feature that leads to the most homogeneous
subsets (i.e., subsets with low entropy) after the split. This process is iteratively applied to create a tree-like
structure, where the leaves represent the final decision or prediction for each instance based on the majority
class in the corresponding leaf node.
Information gain, which is used to decide which feature to split on, is simply the difference between the
entropy of the parent node before the split and the weighted average of the entropies of the child nodes after
the split.
Decision trees use entropy (or alternative measures like Gini impurity) to make decisions about how to divide
the data at each node, which ultimately allows them to create a tree that can make predictions for new,
unseen instances based on the patterns learned from the training data.
ID3 (Iterative Dichotomiser 3) and CART (Classification and Regression Trees) are both decision tree algorithms
used in machine learning for classification and regression tasks.
CART's main advantages include its ability to handle non-linear relationships between features and the target
variable, as well as its ability to handle missing values. Additionally, the resulting decision tree can be visualized
and easily interpreted, making it a valuable tool for understanding the underlying patterns in the data.
ID3 Vs CART:
ID3 (Iterative Dichotomiser 3) CART (Classification and Regression Tree)
1.Algorithm Type: ID3 is primarily used for 1.CART is versatile and can be used for both
classification tasks. It constructs a decision tree classification and regression tasks. It constructs
based on the training data and uses it to classify new decision trees for classification and regression
instances. purposes.
2.Uses Information Gain for splitting 2.Uses Gini Index (classification) or variance
(regression)
3.Creates multi-way splits 3.Creates binary splits only
4.Works with categorical features only 4.Works with both categorical and numerical
features
5.Supports classification only 5.Supports classification and regression
6.Does not support pruning 6.Supports pruning (cost-complexity pruning)
7.Tree Structure: ID3 can produce deep and complex 7.CART trees tend to be more balanced and less
trees, which may lead to overfitting on the training complex due to pruning, which often leads to
data. improved performance on unseen data.
8.Handles missing values poorly 8.Can handle missing values well
REGRESSION:
Regression is a type of supervised learning algorithm in machine learning that is used for predicting
continuous numerical values. Regression models are used to describe relationships between variables by fitting
a line to the observed data. Regression allows you to estimate how a dependent variable changes as the
independent variable(s) change. In regression tasks, the goal is to model the relationship between a set of
input features (independent variables) and a continuous target variable (dependent variable). The algorithm
learns from a labeled training dataset, where the target variable's actual values are provided.
The main objective in regression is to find a function or model that best fits the data, allowing us to make
accurate predictions on new, unseen data. The most common form of regression is linear regression, but the
other types of regression models, includes:
1.Multiple Linear Regression
2.Logistic Regression
3.Polynomial regression
4.Decision tree regression, and more.
Linear Regression:
Linear regression is a simple and widely used regression technique that models the relationship between the
input features and the target variable as a linear equation. The equation takes the form: Y=AX+B.
The goal is to find the optimal values for the coefficients a and b that minimize the difference between the
predicted values and the actual target values in the training data. This is usually done by minimizing a cost or
loss function, such as the mean squared error.
Training the Model: During the training phase, the model is presented with the labeled training dataset. The
model adjusts the coefficients using an optimization algorithm (e.g., gradient descent) to find the best-fitting
line or hyperplane that minimizes the error between the predicted values and the actual target values.
Making Predictions: Once the model is trained, it can be used to make predictions on new, unseen data. Given
the input features of a new instance, the model calculates the predicted target value using the learned
coefficients and the linear equation.
Regression is commonly used in various fields, such as finance (stock price prediction), economics, healthcare
(medical data analysis), and many other domains where predicting continuous numerical values is essential.
Regression is based on either correlation or covariance.
Correlation: Correlation coefficients are used to measure how strong a relationship is between two variables.
There are several types of correlation coefficient, but the most popular is Pearson’s. Correlation between sets
of data is a measure of how well they are related. The most common measure of correlation in stats is the
Pearson Correlation. Correlation coefficient formulas are used to find how strong a relationship is between
data. The formulas return a value between -1 and 1.
Where,
• 1 indicates a strong positive relationship.
• -1 indicates a strong negative relationship.
• A result of zero indicates no relationship at all.
Covariance: In statistics, the covariance formula is used to assess the relationship between two variables. It is
essentially a measure of the variance between two variables. The variance can be any positive or negative
values. Covariance is a quantitative measure of the degree to which the deviation of one variable (X) from its
mean is related to the deviation of another variable (Y) from its mean. To simplify, covariance measures the
joint variability of two random variables.
here,
y is the dependent variable.
• x1, x2, x3 ,… are independent variables.
• b0 =intercept of the line.
• b1, b2, … are coefficients.
The multiple regression of two variables x1 and x2 are:
In general for ‘n’ independent variables, it will be:
y = f(x1, x2)
y = a0+ a1 x1 + a2 x2
y = a0+ a1 x1 + a2 x2 + …..+an xn + E where E is the error term
Logistic Regression:
Logistic regression is one of the most popular Machine Learning algorithms, which comes under the
Supervised Learning technique. It is used for predicting the categorical dependent variable using a given set of
independent variables. It is a statistical method used for binary classification tasks, where the goal is to predict
the probability of an input belonging to one of two classes. Despite its name, logistic regression is a
classification algorithm rather than a regression algorithm. It models the probability of the dependent variable
belonging to a particular class using the logistic function (also known as the sigmoid function).
Logistic regression predicts the output of a categorical dependent variable. Therefore, the outcome must be a
categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact
value as 0 and 1, it gives the probabilistic values which lie between 0 and 1
Logistic Regression is much similar to the Linear Regression except that how they are used. Linear Regression is
used for solving Regression problems, whereas Logistic regression is used for solving the classification
problems.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which predicts
two maximum values (0 or 1). The curve from the logistic function indicates the likelihood of something such
as whether the cells are cancerous or not, a mouse is obese or not based on its weight, etc. The logistic
function is defined as follows:
Where z is a linear combination of the input features and their corresponding weights: z = w0+w1.x1+w2.x2+
….+wn.xn Here, w0, w1, w2 ….wn are the coefficients or weights associated with the input features x1, x2,
x3….xn.
The probability that the dependent variable is 1 is given by the equation for the given independent variables:
The sigmoid function is a mathematical function used to map the predicted values to probabilities. It maps any
real value into another value within a range of 0 and 1. The value of the logistic regression must be between 0
and 1, which cannot go beyond this limit, so it forms a curve like the "S" form. The S-form curve is called the
Sigmoid function or the logistic function.
In logistic regression, we use the concept of the threshold value, which defines the probability of either 0 or 1.
Such as values above the threshold value tends to 1, and a value below the threshold values tends to 0.
On the basis of the categories, Logistic Regression can be classified into three types:
1.Binomial: In binomial Logistic regression, there can be only two possible types of the dependent variables,
such as 0 or 1, Pass or Fail, etc.
2.Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types of the
dependent variable, such as "cat", "dogs", or "sheep"
3.Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent variables,
such as "low", "Medium", or "High".
Hypothesis Uses a linear equation to model the Uses the logistic (sigmoid) function to model the
function relationship between input features probability of a binary outcome
and the target variable.
Objective Often uses Mean Squared Error (MSE) Uses Maximum Likelihood Estimation (MLE) to
function or Root Mean Squared Error (RMSE) as maximize the likelihood of observing the given
the loss function to be minimized. data given the model's parameters.
Cost function Often uses Mean Squared Error (MSE) Uses the Cross-Entropy Loss (also known as Log
as the cost function to be minimized Loss) as the cost function.
during training.
Optimization Gradient Descent is commonly used to Gradient Descent or other optimization
optimize the parameters. algorithms are used to find the optimal
parameters that maximize the likelihood
Regularization Regularization techniques like L1 and Regularization can also be applied to prevent
L2 regularization can be applied to overfitting in logistic regression
prevent overfitting.
Evaluation Evaluated using metrics like Mean Evaluated using metrics like accuracy, precision,
Absolute Error (MAE), Mean Squared recall, F1- score, and ROC-AUC for binary
Error (MSE), or R-squared. classification.
Use case Predicting house prices, stock prices, Predicting house prices, stock prices, sales data,
sales data, etc. etc.