0% found this document useful (0 votes)
57 views

UNIT-1 Regression vs. Classification

Uploaded by

Hii
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

UNIT-1 Regression vs. Classification

Uploaded by

Hii
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 25

UNIT-1

Regression vs. Classification

Regression and Classification algorithms are Supervised Learning algorithms. Both the
algorithms are used for prediction in Machine learning and work with the labeled
datasets.

But the difference between both is how they are used for different machine learning
problems.

The main difference between Regression and Classification algorithms that Regression
algorithms are used to predict the continuous values such as price, salary, age, etc.

and Classification algorithms are used to predict/Classify the discrete values such as
Male or Female, True or False, Spam or Not Spam, etc.

Consider the below diagram:

Classification:

Classification is a process of finding a function which helps in dividing the dataset into
classes based on different parameters.
In Classification, a computer program is trained on the training dataset and based on
that training, it categorizes the data into different classes.

The task of the classification algorithm is to find the mapping function to map the
input(x) to the discrete output(y).

Example: The best example to understand the Classification problem is Email Spam
Detection. The model is trained on the basis of millions of emails on different
parameters, and whenever it receives a new email, it identifies whether the email is
spam or not. If the email is spam, then it is moved to the Spam folder.

Types of ML Classification Algorithms:

Classification Algorithms can be further divided into the following types:

o Logistic Regression

o K-Nearest Neighbours

o Support Vector Machines

o Kernel SVM

o Naïve Bayes

o Decision Tree Classification

o Random Forest Classification

Regression:

Regression is a process of finding the correlations between dependent and independent


variables.

It helps in predicting the continuous variables such as prediction of Market Trends,


prediction of House prices, etc.

The task of the Regression algorithm is to find the mapping function to map the input
variable(x) to the continuous output variable(y).
Example: Suppose we want to do weather forecasting, so for this, we will use the
Regression algorithm. In weather prediction, the model is trained on the past data, and
once the training is completed, it can easily predict the weather for future days.

Types of Regression Algorithm:

o Simple Linear Regression

o Multiple Linear Regression

o Polynomial Regression

o Support Vector Regression

o Decision Tree Regression

o Random Forest Regression

Difference between Regression and Classification

Regression Algorithm Classification Algorithm

In Regression, the output variable In Classification, the output variable must be a


must be of continuous nature or real discrete value.
value.

The task of the regression algorithm The task of the classification algorithm is to
is to map the input value (x) with the map the input value(x) with the discrete output
continuous output variable(y). variable(y).

Regression Algorithms are used with Classification Algorithms are used with discrete
continuous data. data.

In Regression, we try to find the best In Classification, we try to find the decision
fit line, which can predict the output boundary, which can divide the dataset into
more accurately. different classes.

Regression algorithms can be used to Classification Algorithms can be used to solve


solve the regression problems such as classification problems such as Identification of
Weather Prediction, House price spam emails, Speech Recognition, Identification
prediction, etc. of cancer cells, etc.

The regression Algorithm can be The Classification algorithms can be divided


further divided into Linear and Non- into Binary Classifier and Multi-class Classifier.
linear Regression.

Data Preprocessing

Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning
model.

When creating a machine learning project, it is not always a case that we come across
the clean and formatted data. And while doing any operation with data, it is mandatory
to clean it and put in a formatted way. So for this, we use data preprocessing task.

Why do we need Data Preprocessing?

A real-world data generally contains noises, missing values, and maybe in an unusable
format which cannot be directly used for machine learning models. Data preprocessing
is required tasks for cleaning the data and making it suitable for a machine learning
model which also increases the accuracy and efficiency of a machine learning model.

It involves below steps:

o Getting the dataset

o Importing libraries

o Importing datasets

o Finding Missing Data

o Encoding Categorical Data

o Splitting dataset into training and test set

o Feature scaling
Feature extraction

● Feature extraction is a part of the dimensionality reduction process, in which, an

initial set of the raw data is divided and reduced to more manageable groups. So
when you want to process it will be easier.

● The most important characteristic of these large data sets is that they have a

large number of variables.

● These variables require a lot of computing resources to process. So Feature

extraction helps to get the best feature from those big data sets by selecting and
combining variables into features, thus, effectively reducing the amount of data.

● These features are easy to process, but still able to describe the actual data set

with accuracy and originality.

Why Feature Extraction is Useful?

The technique of extracting the features is useful when you have a large data set and
need to reduce the number of resources without losing any important or relevant
information. Feature extraction helps to reduce the amount of redundant data from the
data set.

In the end, the reduction of the data helps to build the model with less machine effort
and also increases the speed of learning and generalization steps in the machine
learning process.

Introduction to Dimensionality Reduction Technique

What is Dimensionality Reduction?

The number of input features, variables, or columns present in a given dataset is known
as dimensionality, and the process to reduce these features is called dimensionality
reduction.

A dataset contains a huge number of input features in various cases, which makes the
predictive modelling task more complicated. Because it is very difficult to visualize or
make predictions for the training dataset with a high number of features, for such cases,
dimensionality reduction techniques are required to use.

Dimensionality reduction technique can be defined as, "It is a way of converting the
higher dimensions dataset into lesser dimensions dataset ensuring that it provides
similar information." These techniques are widely used in machine learning for
obtaining a better fit predictive model while solving the classification and regression
problems.

It is commonly used in the fields that deal with high-dimensional data, such as speech
recognition, signal processing, bioinformatics, etc. It can also be used for data
visualization, noise reduction, cluster analysis, etc.

The Curse of Dimensionality

Handling the high-dimensional data is very difficult in practice, commonly known as


the curse of dimensionality. If the dimensionality of the input dataset increases, any
machine learning algorithm and model becomes more complex. As the number of
features increases, the number of samples also gets increased proportionally, and the
chance of overfitting also increases. If the machine learning model is trained on high-
dimensional data, it becomes overfitted and results in poor performance.

Hence, it is often required to reduce the number of features, which can be done with
dimensionality reduction.

Benefits of applying Dimensionality Reduction

Some benefits of applying dimensionality reduction technique to the given dataset are
given below:

o By reducing the dimensions of the features, the space required to store the
dataset also gets reduced.

o Less Computation training time is required for reduced dimensions of features.

o Reduced dimensions of features of the dataset help in visualizing the data


quickly.

o It removes the redundant features (if present) by taking care of multicollinearity.

Disadvantages of dimensionality Reduction

There are also some disadvantages of applying the dimensionality reduction, which are
given below:

o Some data may be lost due to dimensionality reduction.

o In the PCA dimensionality reduction technique, sometimes the principal


components required to consider are unknown.

Need For Principal Component Analysis (PCA)

Machine Learning in general works wonders when the dataset provided for training the
machine is large and concise. Usually having a good amount of data lets us build a better
predictive model since we have more data to train the machine with. However, using a
large data set has its own pitfalls. The biggest pitfall is the curse of dimensionality.
It turns out that in large dimensional datasets, there might be lots of inconsistencies in
the features or lots of redundant features in the dataset, which will only increase the
computation time and make data processing and EDA more convoluted.

To get rid of the curse of dimensionality, a process called dimensionality reduction was
introduced. Dimensionality reduction techniques can be used to filter only a limited
number of significant features needed for training and this is where PCA comes in.

What Is Principal Component Analysis (PCA)?

Principal components analysis (PCA) is a dimensionality reduction technique that enables


you to identify correlations and patterns in a data set so that it can be transformed into a
data set of significantly lower dimension without loss of any important information.

The main idea behind PCA is to figure out patterns and correlations among various
features in the data set. On finding a strong correlation between different variables, a
final decision is made about reducing the dimensions of the data in such a way that the
significant data is still retained.

Such a process is very essential in solving complex data-driven problems that involve
the use of high-dimensional data sets. PCA can be achieved via a series of steps. Let’s
discuss the whole end-to-end process.

Step By Step Computation Of PCA

The below steps need to be followed to perform dimensionality reduction using PCA:

1. Standardization of the data


2. Computing the covariance matrix
3. Calculating the eigenvectors and eigenvalues
4. Computing the Principal Components
5. Reducing the dimensions of the data set

Let’s discuss each of the steps in detail:


Step 1: Standardization of the data
If you’re familiar with data analysis and processing, you know that missing out on
standardization will probably result in a biased outcome. Standardization is all about
scaling your data in such a way that all the variables and their values lie within a similar
range.

Consider an example, let’s say that we have 2 variables in our data set, one has values
ranging between 10-100 and the other has values between 1000-5000. In such a
scenario, it is obvious that the output calculated by using these predictor variables is
going to be biased since the variable with a larger range will have a more obvious
impact on the outcome.

Therefore, standardizing the data into a comparable range is very important.


Standardization is carried out by subtracting each value in the data from the mean and
dividing it by the overall deviation in the data set.

It can be calculated like so:

Post this step, all the variables in the data


are scaled across a standard and comparable scale.

Step 2: Computing the covariance matrix


As mentioned earlier, PCA helps to identify the correlation and dependencies among the
features in a data set. A covariance matrix expresses the correlation between the
different variables in the data set. It is essential to identify heavily dependent variables
because they contain biased and redundant information which reduces the overall
performance of the model.

Mathematically, a covariance matrix is a p × p matrix, where p represents the


dimensions of the data set. Each entry in the matrix represents the covariance of the
corresponding variables.
Consider a case where we have a 2-Dimensional data set with variables a and b, the
covariance matrix is a 2×2 matrix as shown below:

In the above matrix:

● Cov(a, a) represents the covariance of a variable with itself, which is nothing but

the variance of the variable ‘a’

● Cov(a, b) represents the covariance of the variable ‘a’ with respect to the variable

‘b’. And since covariance is commutative, Cov(a, b) = Cov(b, a)

Here are the key takeaways from the covariance matrix:

● The covariance value denotes how co-dependent two variables are with respect

to each other

● If the covariance value is negative, it denotes the respective variables are

indirectly proportional to each other

● A positive covariance denotes that the respective variables are directly

proportional to each other

Simple math, isn’t it? Now let’s move on and look at the next step in PCA.

Step 3: Calculating the Eigenvectors and Eigenvalues


Eigenvectors and eigenvalues are the mathematical constructs that must be computed
from the covariance matrix in order to determine the principal components of the data
set.

But first, let’s understand more about principal components

What are Principal Components?


Simply put, principal components are the new set of variables that are obtained from
the initial set of variables. The principal components are computed in such a manner
that newly obtained variables are highly significant and independent of each other. The
principal components compress and possess most of the useful information that was
scattered among the initial variables.

If your data set is of 5 dimensions, then 5 principal components are computed, such that,
the first principal component stores the maximum possible information and the second
one stores the remaining maximum info and so on, you get the idea.

Now, where do Eigenvectors fall into this whole process?

Assuming that you all have a basic understanding of Eigenvectors and eigenvalues, we
know that these two algebraic formulations are always computed as a pair, i.e, for every
eigenvector there is an eigenvalue. The dimensions in the data determine the number of
eigenvectors that you need to calculate.

Consider a 2-Dimensional data set, for which 2 eigenvectors (and their respective
eigenvalues) are computed. The idea behind eigenvectors is to use the Covariance
matrix to understand where in the data there is the most amount of variance. Since
more variance in the data denotes more information about the data, eigenvectors are
used to identify and compute Principal Components.

Eigenvalues, on the other hand, simply denote the scalars of the respective eigenvectors.
Therefore, eigenvectors and eigenvalues will compute the Principal Components of the
data set.

Step 4: Computing the Principal Components


Once we have computed the Eigenvectors and eigenvalues, all we have to do is order
them in the descending order, where the eigenvector with the highest eigenvalue is the
most significant and thus forms the first principal component. The principal
components of lesser significances can thus be removed in order to reduce the
dimensions of the data.

The final step in computing the Principal Components is to form a matrix known as the
feature matrix that contains all the significant data variables that possess maximum
information about the data.
Step 5: Reducing the dimensions of the data set
The last step in performing PCA is to re-arrange the original data with the final principal
components which represent the maximum and the most significant information of the
data set. In order to replace the original data axis with the newly formed Principal
Components, you simply multiply the transpose of the original data set by the transpose
of the obtained feature vector.

Polynomial Curve Fitting

We would discuss Polynomial Curve Fitting.


Now don’t bother if the name makes it appear tough
First, We will discuss Linear Regression
So as before, we have a set of inputs
x = {x1, x2, . . . , xn}T where N = 6
corresponding to a set of target variables:
t = {t1, t2, . . . , tN}T where N = 6
Our objective is to find a function that relates each of the input variables to each of the
target values.
If we assume that the relationship is a linear one, then we can use the equation of a
straight line given as:
y = ß0 + ß1x
Then we simply calculates the coefficients ß0 and ß1
However since we don’t know the nature of this relationship, we would extend this
equation to cover more options.
So we would have it as:
which is the same as:
This is similar to what we already have. Just like y depends on x and ß in the linear
model, also here, y depends on x and w.
M is the order of the polynomial. So if M is 1, then we have the linear model. If M is 2,
then we have a quadratic function and so on.

So what is w?
w is simply the polynomial coefficients. So w0, w1, . . . , wM are denoted by the vector w.
So the problem reduces to simply determining the polynomial coefficients. Once we
have it, we simple plug it into the polynomial and solve for x.

How do we determine w?
We determine w by fitting the polynomial to the training data set. This is achieved by
minimizing the error function that measures the difference between the function y(x,w),
for any given value of w and the corresponding point in the training data set.
To perform minimization, we need the error function. A good choice is to use the sum of
squares error between the predicted values y(xn, w) for each training data point and
the corresponding target values tn.
This error function is given by:

The value of this function would always be non-negative. It can also be zero, but rarely.
It is zero if and only if the function produces exactly same output as the training set.
Multivariate Logistic Regression or Multivariate non-Linear Functions:

Logistic regression is an algorithm used to predict a binary outcome based on multiple


independent variables. A binary outcome has two possibilities, either the scenario
happens( represented by 1) or it doesn’t happen ( denoted by 0).

Logistic regression is used while working on binary data, the data where the outcome
(or the dependent variable) is dichotomous.

Where can logistic regression be used?

Logistic regression is primarily used to deal with classification issues. For instance, to
ascertain if an email is spam or not and if a particular transaction is malicious or not. In
data analysis, it is used to make calculated decisions to minimize loss and increase
profits.

Multivariate logistic regression is used when there is one dependent variable and
multiple outcomes. It differs from logistic regression by having more than two possible
outcomes.

X1 to Xp are distinct independent variables.

b0 to bp are the regression coefficients

The multiple logistic regression model can also be written in a different form. In the
form below, the outcome is the expected log of the odds that the outcome is present,
The multiple logistic regression model can also be written in a different form. In the
form below, the outcome is the expected log of the odds that the outcome is present.

The right side of the above equation resembles the linear regression equation but the
method of finding out the regression coefficients differs.

Assumptions in the Multivariate Regression Model

● The dependent and the independent variables have a linear relationship.

● The independent variables do not have a strong correlation among themselves.

● The observations of yi are chosen randomly and individually from the

population.
Assumptions in Multivariate Logistic Regression Model

● The dependent variable is nominal or ordinal. The nominal variables have two or

more categories without any meaningful organization. Ordinal variables can also
have two or more categories, but they have a structure and can be ranked.

● There can be single or multiple independent variables that can be ordinal,

continuous, or nominal. Continuous variables are those that can have infinite
values within a specific range.

● The dependent variables are mutually exclusive and exhaustive.

● The independent variables do not have a strong correlation among themselves.

Advantages of Multivariate Regression


1. Multivariate regression helps us to study the relationships among multiple
variables in the dataset.
2. The correlation between dependent and independent variables helps in
predicting the outcome.
3. It is one of the most convenient and popular algorithms used in machine
learning.
Disadvantages of Multivariate Regression

● The complexity of multivariate techniques requires complex mathematical

calculations.

● It is not easy to interpret the output of the multivariate regression model since

there are inconsistencies in the loss and error outputs.

● Multivariate regression models cannot be applied to smaller datasets; they are

designed for producing accurate outputs when it comes to larger datasets.

Bayes Theorem

Bayes theorem helps us find conditional probability. It simply derived from the product
rule.

If we rewrite the product rule in terms of P(X|Y) we would have:

Now we can use the symmetry property from the product rule to replace the numerator.
The we have:

Decision boundary
Decision boundary is a crucial concept in machine learning and pattern recognition. It
refers to the boundary or surface that separates different classes or categories in a
classification problem. In simple terms, decision boundary is a line or curve that divides
the data into two or more categories based on their features. The objective of decision
boundary is to make accurate predictions on unseen data by identifying the correct
class for a given input.

What is Decision boundary


A hyperplane that partitions the feature space into distinct classes is known as a
decision boundary. In binary classification problems, the decision boundary serves as
the line of demarcation between positive and negative classes. The position and
orientation of the decision boundary are determined by the model's training data and
algorithm. The primary aim is to discover a decision boundary that can effectively
generalize to new data, making it a reliable predictor.

Types of Decision boundaries:


There are different types of decision boundaries based on the complexity of the
classification problem. The most common types of decision boundaries are:

1. Linear decision boundary:

A linear decision boundary is a straight line that separates the data into two
classes. It is the simplest form of decision boundary and is used when the
classification problem is linearly separable. Linear decision boundary can be
expressed in the form of a linear equation, y = mx + b, where m is the slope of the
line and b is the y-intercept.

2. Non Linear decision boundary:

A non-linear decision boundary is a curved line that separates the data into two
or more classes. Non-linear decision boundaries are used when the classification
problem is not linearly separable. Non-linear decision boundaries can take
different forms such as parabolas, circles, ellipses, etc.

3. Decision Boundary with Margin:

A decision boundary with margin is a line or curve that separates the data into
two classes while maximizing the distance between the boundary and the closest
data points. The margin is defined as the distance between the decision
boundary and the closest data points of each class. The objective of decision
boundary with margin is to improve the generalization performance of the
classifier by reducing the risk of overfitting.

4. Decision Boundary with Soft Margin:

A decision boundary with soft margin is a line or curve that separates the data
into two classes while allowing some misclassifications. Soft margin is used
when the data is not linearly separable and when the classification problem has
some noise or outliers. The objective of decision boundary with soft margin is to
find a balance between the accuracy of the classifier and its ability to generalize
to unseen data.

Probability Density Estimation

Probability density is the relationship between observations and their probability.

Some outcomes of a random variable will have low probability density and other
outcomes will have a high probability density.

The overall shape of the probability density is referred to as a probability distribution,


and the calculation of probabilities for specific outcomes of a random variable is
performed by a probability density function, or PDF for short.

It is useful to know the probability density function for a sample of data in order
to know whether a given observation is unlikely, or so unlikely as to be
considered an outlier or anomaly and whether it should be removed. It is also
helpful in order to choose appropriate learning methods that require input data to have
a specific probability distribution.
It is unlikely that the probability density function for a random sample of data is known.
As such, the probability density must be approximated using a process known as
probability density estimation.

Probability Density

A random variable x has a probability distribution p(x).


The relationship between the outcomes of a random variable and its probability is
referred to as the probability density, or simply the “density.”
If a random variable is continuous, then the probability can be calculated via probability
density function, or PDF for short. The shape of the probability density function across
the domain for a random variable is referred to as the probability distribution and
common probability distributions have names, such as uniform, normal, exponential,
and so on.

Given a random variable, we are interested in the density of its probabilities.

For example, given a random sample of a variable, we might want to know things like
the shape of the probability distribution, the most likely value, the spread of values, and
other properties.

Knowing the probability distribution for a random variable can help to calculate
moments of the distribution, like the mean and variance, but can also be useful
for other more general considerations, like determining whether an observation
is unlikely or very unlikely and might be an outlier or anomaly.

The problem is, we may not know the probability distribution for a random variable. We
rarely do know the distribution because we don’t have access to all possible outcomes
for a random variable. In fact, all we have access to is a sample of observations. As such,
we must select a probability distribution.

This problem is referred to as probability density estimation, or simply “density


estimation,” as we are using the observations in a random sample to estimate the
general density of probabilities beyond just the sample of data we have available.
There are a few steps in the process of density estimation for a random variable.
The first step is to review the density of observations in the random sample with a
simple histogram. From the histogram, we might be able to identify a common and well-
understood probability distribution that can be used, such as a normal distribution. If
not, we may have to fit a model to estimate the distribution.

Parametric Density Estimation

The shape of a histogram of most random samples will match a well-known probability
distribution.

The common distributions are common because they occur again and again in different
and sometimes unexpected domains.

Get familiar with the common probability distributions as it will help you to identify a
given distribution from a histogram.

Once identified, you can attempt to estimate the density of the random variable with a
chosen probability distribution. This can be achieved by estimating the parameters of
the distribution from a random sample of data.

For example, the normal distribution has two parameters: the mean and the
standard deviation. Given these two parameters, we now know the probability
distribution function. These parameters can be estimated from data by
calculating the sample mean and sample standard deviation.

We refer to this process as parametric density estimation.

The reason is that we are using predefined functions to summarize the relationship
between observations and their probability that can be controlled or configured with
parameters, hence “parametric“.
Once we have estimated the density, we can check if it is a good fit. This can be done in
many ways, such as:

● Plotting the density function and comparing the shape to the histogram.
● Sampling the density function and comparing the generated sample to the real

sample.
Bayesian Inference

What Is Bayesian Inference?

Bayesian inference in mathematics is a method to determine the statistical inference to


amend or update the probability of an event or a hypothesis as more information becomes
available. Hence, it is also referred to as Bayesian updating and plays an important role in
sequential analysis and hypothesis testing.

Also called Bayesian probability, it is based on Bayes’ Theorem. Bayesian


inference has many applications due to its significance in predictive analysis. It
has been widely used and studied in science, mathematics, economics,
philosophy, etc. Its potential in data science looks especially promising for
machine learning.

Bayesian inference in statistical analysis can be understood by first studying statistical


inference. Statistical inference is a technique used to determine the characteristics of
the probability distribution and, thus, the population itself. Therefore, Bayesian
updating helps to update the characteristics of the population as new evidence comes
up. Hence, its role is justified, as new information is necessary to obtain accurate results.
Now, let’s systematically understand the technical aspect of a Bayesian inference
model.

Bayes’ theorem states that,

Here, H is the hypothesis or event whose probability was determined.

E is the evidence or the new data that can affect the hypothesis.

P(H) is the prior probability or the probability of the hypothesis before the new data
was available.

P(E) is the marginal likelihood and probability of the event occurring.

P(E|H) is the probability that event E occurs, given that event H has already occurred. It
is also called the likelihood.

P(H|E) is the posterior probability and determines the probability of event H when
event E has occurred. Hence, event E is the update required.

Thus, the posterior probability increases with the likelihood and prior probability, while
it decreases with the marginal likelihood.

Bayesian Inference vs Maximum Likelihood

Like Bayesian inference, maximum likelihood is important concepts in statistical


inference. However, their approach and scope is different.

● As the name suggests, maximum likelihood refers to the condition where the

probability that an event will occur is the highest. In statistics, this is arrived at
by estimating the observed value (parameters).
● Based on certain data, a scientist determines that the probability of a particular

outcome is 65%. But they want to estimate when the event becomes 100%
probable. So they try changing the data to attain the maximum probability
through simple trial and error.

● For example, the probability of getting heads when a coin is tossed is 50%. A

Bayesian would say it’s because there are only two possibilities – a head and a
tail. And the probability of any of these appearing is the same.

● These concepts have significant applications in highly data-driven fields like

research, machine learning, business, etc.

You might also like