UNIT-1 Regression vs. Classification
UNIT-1 Regression vs. Classification
Regression and Classification algorithms are Supervised Learning algorithms. Both the
algorithms are used for prediction in Machine learning and work with the labeled
datasets.
But the difference between both is how they are used for different machine learning
problems.
The main difference between Regression and Classification algorithms that Regression
algorithms are used to predict the continuous values such as price, salary, age, etc.
and Classification algorithms are used to predict/Classify the discrete values such as
Male or Female, True or False, Spam or Not Spam, etc.
Classification:
Classification is a process of finding a function which helps in dividing the dataset into
classes based on different parameters.
In Classification, a computer program is trained on the training dataset and based on
that training, it categorizes the data into different classes.
The task of the classification algorithm is to find the mapping function to map the
input(x) to the discrete output(y).
Example: The best example to understand the Classification problem is Email Spam
Detection. The model is trained on the basis of millions of emails on different
parameters, and whenever it receives a new email, it identifies whether the email is
spam or not. If the email is spam, then it is moved to the Spam folder.
o Logistic Regression
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
Regression:
The task of the Regression algorithm is to find the mapping function to map the input
variable(x) to the continuous output variable(y).
Example: Suppose we want to do weather forecasting, so for this, we will use the
Regression algorithm. In weather prediction, the model is trained on the past data, and
once the training is completed, it can easily predict the weather for future days.
o Polynomial Regression
The task of the regression algorithm The task of the classification algorithm is to
is to map the input value (x) with the map the input value(x) with the discrete output
continuous output variable(y). variable(y).
Regression Algorithms are used with Classification Algorithms are used with discrete
continuous data. data.
In Regression, we try to find the best In Classification, we try to find the decision
fit line, which can predict the output boundary, which can divide the dataset into
more accurately. different classes.
Data Preprocessing
Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning
model.
When creating a machine learning project, it is not always a case that we come across
the clean and formatted data. And while doing any operation with data, it is mandatory
to clean it and put in a formatted way. So for this, we use data preprocessing task.
A real-world data generally contains noises, missing values, and maybe in an unusable
format which cannot be directly used for machine learning models. Data preprocessing
is required tasks for cleaning the data and making it suitable for a machine learning
model which also increases the accuracy and efficiency of a machine learning model.
o Importing libraries
o Importing datasets
o Feature scaling
Feature extraction
initial set of the raw data is divided and reduced to more manageable groups. So
when you want to process it will be easier.
● The most important characteristic of these large data sets is that they have a
extraction helps to get the best feature from those big data sets by selecting and
combining variables into features, thus, effectively reducing the amount of data.
● These features are easy to process, but still able to describe the actual data set
The technique of extracting the features is useful when you have a large data set and
need to reduce the number of resources without losing any important or relevant
information. Feature extraction helps to reduce the amount of redundant data from the
data set.
In the end, the reduction of the data helps to build the model with less machine effort
and also increases the speed of learning and generalization steps in the machine
learning process.
The number of input features, variables, or columns present in a given dataset is known
as dimensionality, and the process to reduce these features is called dimensionality
reduction.
A dataset contains a huge number of input features in various cases, which makes the
predictive modelling task more complicated. Because it is very difficult to visualize or
make predictions for the training dataset with a high number of features, for such cases,
dimensionality reduction techniques are required to use.
Dimensionality reduction technique can be defined as, "It is a way of converting the
higher dimensions dataset into lesser dimensions dataset ensuring that it provides
similar information." These techniques are widely used in machine learning for
obtaining a better fit predictive model while solving the classification and regression
problems.
It is commonly used in the fields that deal with high-dimensional data, such as speech
recognition, signal processing, bioinformatics, etc. It can also be used for data
visualization, noise reduction, cluster analysis, etc.
Hence, it is often required to reduce the number of features, which can be done with
dimensionality reduction.
Some benefits of applying dimensionality reduction technique to the given dataset are
given below:
o By reducing the dimensions of the features, the space required to store the
dataset also gets reduced.
There are also some disadvantages of applying the dimensionality reduction, which are
given below:
Machine Learning in general works wonders when the dataset provided for training the
machine is large and concise. Usually having a good amount of data lets us build a better
predictive model since we have more data to train the machine with. However, using a
large data set has its own pitfalls. The biggest pitfall is the curse of dimensionality.
It turns out that in large dimensional datasets, there might be lots of inconsistencies in
the features or lots of redundant features in the dataset, which will only increase the
computation time and make data processing and EDA more convoluted.
To get rid of the curse of dimensionality, a process called dimensionality reduction was
introduced. Dimensionality reduction techniques can be used to filter only a limited
number of significant features needed for training and this is where PCA comes in.
The main idea behind PCA is to figure out patterns and correlations among various
features in the data set. On finding a strong correlation between different variables, a
final decision is made about reducing the dimensions of the data in such a way that the
significant data is still retained.
Such a process is very essential in solving complex data-driven problems that involve
the use of high-dimensional data sets. PCA can be achieved via a series of steps. Let’s
discuss the whole end-to-end process.
The below steps need to be followed to perform dimensionality reduction using PCA:
Consider an example, let’s say that we have 2 variables in our data set, one has values
ranging between 10-100 and the other has values between 1000-5000. In such a
scenario, it is obvious that the output calculated by using these predictor variables is
going to be biased since the variable with a larger range will have a more obvious
impact on the outcome.
● Cov(a, a) represents the covariance of a variable with itself, which is nothing but
● Cov(a, b) represents the covariance of the variable ‘a’ with respect to the variable
● The covariance value denotes how co-dependent two variables are with respect
to each other
Simple math, isn’t it? Now let’s move on and look at the next step in PCA.
If your data set is of 5 dimensions, then 5 principal components are computed, such that,
the first principal component stores the maximum possible information and the second
one stores the remaining maximum info and so on, you get the idea.
Assuming that you all have a basic understanding of Eigenvectors and eigenvalues, we
know that these two algebraic formulations are always computed as a pair, i.e, for every
eigenvector there is an eigenvalue. The dimensions in the data determine the number of
eigenvectors that you need to calculate.
Consider a 2-Dimensional data set, for which 2 eigenvectors (and their respective
eigenvalues) are computed. The idea behind eigenvectors is to use the Covariance
matrix to understand where in the data there is the most amount of variance. Since
more variance in the data denotes more information about the data, eigenvectors are
used to identify and compute Principal Components.
Eigenvalues, on the other hand, simply denote the scalars of the respective eigenvectors.
Therefore, eigenvectors and eigenvalues will compute the Principal Components of the
data set.
The final step in computing the Principal Components is to form a matrix known as the
feature matrix that contains all the significant data variables that possess maximum
information about the data.
Step 5: Reducing the dimensions of the data set
The last step in performing PCA is to re-arrange the original data with the final principal
components which represent the maximum and the most significant information of the
data set. In order to replace the original data axis with the newly formed Principal
Components, you simply multiply the transpose of the original data set by the transpose
of the obtained feature vector.
So what is w?
w is simply the polynomial coefficients. So w0, w1, . . . , wM are denoted by the vector w.
So the problem reduces to simply determining the polynomial coefficients. Once we
have it, we simple plug it into the polynomial and solve for x.
How do we determine w?
We determine w by fitting the polynomial to the training data set. This is achieved by
minimizing the error function that measures the difference between the function y(x,w),
for any given value of w and the corresponding point in the training data set.
To perform minimization, we need the error function. A good choice is to use the sum of
squares error between the predicted values y(xn, w) for each training data point and
the corresponding target values tn.
This error function is given by:
The value of this function would always be non-negative. It can also be zero, but rarely.
It is zero if and only if the function produces exactly same output as the training set.
Multivariate Logistic Regression or Multivariate non-Linear Functions:
Logistic regression is used while working on binary data, the data where the outcome
(or the dependent variable) is dichotomous.
Logistic regression is primarily used to deal with classification issues. For instance, to
ascertain if an email is spam or not and if a particular transaction is malicious or not. In
data analysis, it is used to make calculated decisions to minimize loss and increase
profits.
Multivariate logistic regression is used when there is one dependent variable and
multiple outcomes. It differs from logistic regression by having more than two possible
outcomes.
The multiple logistic regression model can also be written in a different form. In the
form below, the outcome is the expected log of the odds that the outcome is present,
The multiple logistic regression model can also be written in a different form. In the
form below, the outcome is the expected log of the odds that the outcome is present.
The right side of the above equation resembles the linear regression equation but the
method of finding out the regression coefficients differs.
population.
Assumptions in Multivariate Logistic Regression Model
● The dependent variable is nominal or ordinal. The nominal variables have two or
more categories without any meaningful organization. Ordinal variables can also
have two or more categories, but they have a structure and can be ranked.
continuous, or nominal. Continuous variables are those that can have infinite
values within a specific range.
calculations.
● It is not easy to interpret the output of the multivariate regression model since
Bayes Theorem
Bayes theorem helps us find conditional probability. It simply derived from the product
rule.
Now we can use the symmetry property from the product rule to replace the numerator.
The we have:
Decision boundary
Decision boundary is a crucial concept in machine learning and pattern recognition. It
refers to the boundary or surface that separates different classes or categories in a
classification problem. In simple terms, decision boundary is a line or curve that divides
the data into two or more categories based on their features. The objective of decision
boundary is to make accurate predictions on unseen data by identifying the correct
class for a given input.
A linear decision boundary is a straight line that separates the data into two
classes. It is the simplest form of decision boundary and is used when the
classification problem is linearly separable. Linear decision boundary can be
expressed in the form of a linear equation, y = mx + b, where m is the slope of the
line and b is the y-intercept.
A non-linear decision boundary is a curved line that separates the data into two
or more classes. Non-linear decision boundaries are used when the classification
problem is not linearly separable. Non-linear decision boundaries can take
different forms such as parabolas, circles, ellipses, etc.
A decision boundary with margin is a line or curve that separates the data into
two classes while maximizing the distance between the boundary and the closest
data points. The margin is defined as the distance between the decision
boundary and the closest data points of each class. The objective of decision
boundary with margin is to improve the generalization performance of the
classifier by reducing the risk of overfitting.
A decision boundary with soft margin is a line or curve that separates the data
into two classes while allowing some misclassifications. Soft margin is used
when the data is not linearly separable and when the classification problem has
some noise or outliers. The objective of decision boundary with soft margin is to
find a balance between the accuracy of the classifier and its ability to generalize
to unseen data.
Some outcomes of a random variable will have low probability density and other
outcomes will have a high probability density.
It is useful to know the probability density function for a sample of data in order
to know whether a given observation is unlikely, or so unlikely as to be
considered an outlier or anomaly and whether it should be removed. It is also
helpful in order to choose appropriate learning methods that require input data to have
a specific probability distribution.
It is unlikely that the probability density function for a random sample of data is known.
As such, the probability density must be approximated using a process known as
probability density estimation.
Probability Density
For example, given a random sample of a variable, we might want to know things like
the shape of the probability distribution, the most likely value, the spread of values, and
other properties.
Knowing the probability distribution for a random variable can help to calculate
moments of the distribution, like the mean and variance, but can also be useful
for other more general considerations, like determining whether an observation
is unlikely or very unlikely and might be an outlier or anomaly.
The problem is, we may not know the probability distribution for a random variable. We
rarely do know the distribution because we don’t have access to all possible outcomes
for a random variable. In fact, all we have access to is a sample of observations. As such,
we must select a probability distribution.
The shape of a histogram of most random samples will match a well-known probability
distribution.
The common distributions are common because they occur again and again in different
and sometimes unexpected domains.
Get familiar with the common probability distributions as it will help you to identify a
given distribution from a histogram.
Once identified, you can attempt to estimate the density of the random variable with a
chosen probability distribution. This can be achieved by estimating the parameters of
the distribution from a random sample of data.
For example, the normal distribution has two parameters: the mean and the
standard deviation. Given these two parameters, we now know the probability
distribution function. These parameters can be estimated from data by
calculating the sample mean and sample standard deviation.
The reason is that we are using predefined functions to summarize the relationship
between observations and their probability that can be controlled or configured with
parameters, hence “parametric“.
Once we have estimated the density, we can check if it is a good fit. This can be done in
many ways, such as:
● Plotting the density function and comparing the shape to the histogram.
● Sampling the density function and comparing the generated sample to the real
sample.
Bayesian Inference
E is the evidence or the new data that can affect the hypothesis.
P(H) is the prior probability or the probability of the hypothesis before the new data
was available.
P(E|H) is the probability that event E occurs, given that event H has already occurred. It
is also called the likelihood.
P(H|E) is the posterior probability and determines the probability of event H when
event E has occurred. Hence, event E is the update required.
Thus, the posterior probability increases with the likelihood and prior probability, while
it decreases with the marginal likelihood.
● As the name suggests, maximum likelihood refers to the condition where the
probability that an event will occur is the highest. In statistics, this is arrived at
by estimating the observed value (parameters).
● Based on certain data, a scientist determines that the probability of a particular
outcome is 65%. But they want to estimate when the event becomes 100%
probable. So they try changing the data to attain the maximum probability
through simple trial and error.
● For example, the probability of getting heads when a coin is tossed is 50%. A
Bayesian would say it’s because there are only two possibilities – a head and a
tail. And the probability of any of these appearing is the same.