0% found this document useful (0 votes)
4 views

DA_Unit_2

This document provides an overview of supervised learning with a focus on regression analysis, explaining its purpose, methods, and common algorithms. It also covers Bayesian decision theory, including concepts such as prior and posterior probabilities, and introduces the Naïve Bayes classifier and Support Vector Machines (SVM). The document highlights the advantages and applications of these techniques in various fields.

Uploaded by

chahat9076
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

DA_Unit_2

This document provides an overview of supervised learning with a focus on regression analysis, explaining its purpose, methods, and common algorithms. It also covers Bayesian decision theory, including concepts such as prior and posterior probabilities, and introduces the Naïve Bayes classifier and Support Vector Machines (SVM). The document highlights the advantages and applications of these techniques in various fields.

Uploaded by

chahat9076
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 124

Unit 2

Supervised Learning: Regression

Pallavi Shukla
Assistant Professor
UCER
Regression
• Regression analysis is a statistical method to model the relationship
between dependent (target) and independent (predictor) variables
with one or more independent variables.
• Helps us to understand how the value of the dependent variable
changes corresponding to an independent variable when other
independent variables are held fixed.
• Regression searches for relationships among variables.
• For example, you can observe several employees of some company
and try to understand how their salaries depend on the features, such
as experience, level of education, role, city they work in, and so on.
Regression
• In Regression, we plot a graph between the variables which best fits the
given datapoints.
• Using this plot, the machine learning model can make predictions about
the data.
• In simple words, "Regression shows a line or curve that passes through
all the datapoints on target-predictor graph in such a way that the
vertical distance between the datapoints and the regression line is
minimum."
• The distance between datapoints and line tells whether a model has
captured a strong relationship or not.
Examples

• Prediction of rain using temperature and other factors


• Determining Market trends
• Prediction of road accidents due to rash driving.
Taxonomy -

• Dependent Variable: The main factor in Regression analysis which we


want to predict or understand is called the dependent variable. It is also
called target variable.
• Independent Variable: The factors which affect the dependent variables
or which are used to predict the values of the dependent variables are
called independent variable, also called as a predictor.
• Outliers: Outlier is an observation which contains either very low value or
very high value in comparison to other observed values. An outlier may
hamper the result, so it should be avoided.
• .
Taxonomy -

• Multicollinearity: If the independent variables are highly correlated with


each other than other variables, then such condition is called
Multicollinearity. It should not be present in the dataset, because it
creates problem while ranking the most affecting variable.

• Underfitting and Overfitting: If our algorithm works well with the training
dataset but not well with test dataset, then such problem is
called Overfitting. And if our algorithm does not perform well even with
training dataset, then such problem is called underfitting.
Common Regression Algorithms
The most common regression algorithms are:
• Simple Linear Regression
• Multiple Linear Regression
• Polynomial Linear Regression
• Multivariate adaptive regression splines
• Logistic Regression
• Maximum likelihood estimation(least squares)
Linear Regression:

• Linear regression is a statistical regression method that is used for


predictive analysis.
• It is one of the very simple and easy algorithms that works on regression
and shows the relationship between the continuous variables.
• It is used for solving the regression problem in machine learning.
• Linear regression shows the linear relationship between the independent
variable (X-axis) and the dependent variable (Y-axis), hence called linear
regression.
• If there is only one input variable (x), then such linear regression is
called simple linear regression. And if there is more than one input
variable, then such linear regression is called multiple linear regression.
• The relationship between variables in the linear regression model can be
explained using the below image. Here we are predicting the salary of an
employee on the basis of the year of experience.
Applications of linear regression are:

• Analyzing trends and sales estimates


• Salary forecasting
• Real estate prediction
• Arriving at ETAs in traffic.
Simple Linear Regression
Slop of Simple Linear Regression
Linear Positive Slope
Curve Linear Positive Slope
Linear Negative Slope
Curve Linear Negative Slope
No Relationship Graph
Error in Simple regression
Example
Multiple Linear Regression-
Logistic Regression -
Bayesian Decision Theory

• It is a method to take actions based on present observations.


• Mr. R. Thomas Bayes suggested this method in the year
1761.
Basic Concepts in Bayes Decision Theory -

• Marginal Probability (Simple Probability) P(A) –


• The ordinary probability of occurrence of an event (A) irrespective of all
other events is called simple or marginal probability.
• P(A) = No. of Successful Events
Total no. of all Events
Basic Concepts in Bayes Decision Theory -

• Condition Probability P(A/B) –


• The probability (P) of the occurrence of an event (A), when event(B)
has already occurred is called Conditional probability.
Basic Concepts in Bayes Decision Theory -

• Joint Probability P(A,B) –


• The occurrence of two events (A) and (B) simultaneously is called
Joint Probability.
Basic Concepts in Bayes Decision Theory -

• Prior: The prior knowledge or belief about the probabilities of


various hypotheses in H is called Prior in the Context of Bayes’
theorem.
• Ex- Knowledge about tumors can be used to validate tumors
being malignant.
Basic Concepts in Bayes Decision Theory

• Posterior – The probability that a particular hypothesis holds for a dataset


based on Prior is called the Posterior Probability or simply Posterior.

Example: The probability of the hypothesis that the patient has


a malignant tumour considering the Prior of correctness of the
malignancy test.
BAYES’ THEOREM
• It is based on the conditional probability. It is given as :
SOME MORE QUESTIONS:

• Q1 – To calculate the probability of “fire” when “smoke “ is given with data


as: P(Fire) = Prior Probability = 0.3, P(Smoke|Fire) = Likelihood
Probability = 0.5, P(Smoke) = Evidence = 0.7
• Q2 – (Patient Diseases Problem) Let us consider data of a patient as
Effect = the state of patient having red dot on skin. Cause = The state of a
patient having Rubella Disease. Given probabilities , P(Cause) = 0.001,
P(Effect) = 0.01, P(Effect|Cause)= 0.9. Use Baye's rule to find the value of
probability P(Cause|Effect).
Bayes’ Theorem in Terms of Posterior
Probability -
P(h|D) = P(D|h) . P(h)
P(D)
P( h|D) = called posterior Probability or conditional probability of the hypothesis
(h) when data(D) is given
P(D|h)= called likelihood Probability or conditional Probability of Data(D) when
hypothesis(h) is given.
P(h) = Prior probability of a hypothesis (h) or simple probability of a hypothesis(h).
P(D) = Prior probability of data(D) or simple probability of D.
Maximum a Posterior(MAP) Hypothesis

• The maximum probable hypothesis is called the Maximum A Posterior (MAP)


hypothesis.
• Denoted by (hMAP).
• hMAP = Arg max P(h|D) = Arg max P(D|h) . P(h)
P(D)
• In above equation we can ignore the denominator
hMAP = Arg max P(D|h) . P(h)
Maximum Likelihood(ML) Hypothesis

• All hypothesis are equiprobable.


hML = Arg max P(D|h)
Difference between Max f(x) and Arg max
f(x) functions in Mathematics -
Max f(x) Arg max f(x)
Maximum value of function f(x) Called Argument of variable(x) at which the
function f(x) has maximum value.

Ex – Ex –
Max f(x) of sin θ = 90o Arg.max f(x) of sin θ = 1
It means sin θ has its max value at 90o It means sin θ has a maximum value of 1.
BRUTE FORCE BAYESIAN CONCEPT
LEARNING -
• Also called Brute Force Algorithm.
P(h|D) = P(D|h) . P(h)
P(D)
• hMAP = Arg max P(h|D)
• Let P(h) = 1 / |H| for all h in H.
• h = a single hypothesis , H = a set consisting pf all hypothesis
• H = {h1, h2, h3,……..hn}
• Now, P(h) = Probability of hypothesis (h)
• P(D|h) = 1 , if di = h(xi)
0 , otherwise
P(D|h) = Conditional probability of data(D) when hypothesis (h) is given
di = Data Value
Xi = Variable Value
P(h|D) = 1 . 1/|H|
P(D)
• But P(D) = |VS H,D| / |H|
• Now, putting thois value in above equation ,

• P(h|D) = 1
• |VS H,D|
• Where |VS H,D| is called the version space of hypothesis set(H)
BAYE’S OPTIMAL CLASSIFIER
• It is a “Probabilistic Model” which makes the most probable prediction for a new
example.
• Equation is

• = Probability of Value(vj ) when data is given


• = Probability of value (vj) when hypothesis (hi) is given
• = Probability of hypothesis (hi) , when data is given
Naïve Bayes Classifier

• It is a supervised learning algorithm.


• Based on Bayes Theorem.
• Used for solving classification problems in machine learning.
• It is a probabilistic classifier.
• It predicts on the basis of the probability of an event.
Naïve Bayes Classifier
• Naïve : Means “untrained” or “without experience”.
• Bayes : It is defined on Bayes Theorem.

• Question : We have been given dataset for weather condition with two
columns in which one has a value of weather condition and the other
column reports regarding whether player has gone for playing or not.
Find the probability of player going for play on sunny day.
0 Outlook Play
0. Rainy Yes
1. Sunny Yes
2. Overcast Yes
3. Overcast Yes
4. Sunny No
5. Rainy Yes
6. Sunny Yes
7. Overcast Yes
8. Rainy No
9. Sunny No
10. Sunny Yes
11. Rainy No
12. Overcast Yes
13. Overcast Yes
Solution: Frequency Table

Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Make Likelihood Table :

Weather No Yes Likelihood


Overcast 0 5 5/14 = 0.35
Rainy 2 2 4/14 = 0.29
Sunny 2 3 5/14=0.35
All 4/14 =0.29 10/14 =0.71
Apply Bayes Theorem :
• P (A|B) = P(B|A) .P(A)
• P(B)

• P(Yes | Suuny) = P(Sunny|Yes) x P(Yes)


• P(Sunny)
• P(Sunny |Yes) = 3/10 = 0.3
• P(Sunny) = 0.35
• P(Yes) = 0.71
• P(Yes |Sunny) = (0.3 x 0.71) /0.35 = 0.60
• P (No |Sunny) = P(Sunny |No) x P(No) = 0.5 X 0.29 = 0.41
• P(Sunny) 0.35
• As the P (Yes |Sunny) > P(No| Sunny) i.e 0.60 >0.41
• Therefore , we can say that on a sunny day, the player will go for play.
Advantages of Naïve Bayes Classifier -
• It is a fast and easy algorithm for classification.
• It can be used for binary and multi-classification.
• It is mostly used for text classification problems.
Disadvantages of Naïve Bayes Classifier

• It cannot learn relation between independent features.


Applications of Naïve Bayes Classifier
• Real Time Prediction
• Text Classification
• Sentiment Analysis
• Multiclass Classification
• Spam Filtering
• Recommendation System
BAYESIAN BELIEF NETWORKS
• A Bayesian Belief Network is a probabilistic graphical model. It represents a
set of variables and their conditional dependencies using a directed acyclic
graph.

• Also called Bayes Network, Belief Network , Decision Network or Bayesian


Model.
• The Bayesian network consists of two parts:
• Directed Acyclic Graph.
• Table of Conditional Probabilities.
• The Bayesian Belief networks are based on the joint probability and marginal
probability.
Support Vector Machine
• It is most popular supervised learning technique which is used for both
classification and regression tasks.
• It is mainly used for classification problems in machine learning.
• Objective of an SVM algorithm is to find a hyperplane in an N-dimensional
space, that distinctly classify the data points.
Support Vectors
• They are simply the coordinates of individual observation.

• SVM classifier is a frontier that best segregates the two classes (Hyper plane/
line)
Support Vector Machine Terminology
1.Hyperplane: Hyperplane is the decision boundary that is used to
separate the data points of different classes in a feature space. In
the case of linear classifications, it will be a linear equation i.e.
wx+b = 0.
2.Support Vectors: Support vectors are the closest data points to
the hyperplane, which makes a critical role in deciding the
hyperplane and margin.
3.Margin: Margin is the distance between the support vector and
hyperplane. The main objective of the support vector machine
algorithm is to maximize the margin. The wider margin indicates
better classification performance.
Support Vector Machine Terminology
1. Kernel: Kernel is the mathematical function, which is used in SVM to map the
original input data points into high-dimensional feature spaces, so, that the
hyperplane can be easily found out even if the data points are not linearly
separable in the original input space. Some of the common kernel functions are
linear, polynomial, radial basis function(RBF), and sigmoid.
2. Hard Margin: The maximum-margin hyperplane or the hard margin hyperplane
is a hyperplane that properly separates the data points of different categories
without any misclassifications.
3. Soft Margin: When the data is not perfectly separable or contains outliers, SVM
permits a soft margin technique. Each data point has a slack variable introduced
by the soft-margin SVM formulation, which softens the strict margin
requirement and permits certain misclassifications or violations. It discovers a
compromise between increasing the margin and reducing violations.
Types of SVM

• Linear SVM
Non Linear SVM
PROPERTIES OF SVM
• Flexibility in choosing a similarity function
• Sparseness of solution when dealing with large data sets - only support vectors are
used to specify the separating hyperplane
• Ability to handle large feature spaces - complexity does not depend on the
dimensionality of the feature space
• Overfitting can be controlled by soft margin approach
• Nice math property: a simple convex optimization problem which is guaranteed to
converge to a single global solution
• Feature Selection
Advantages of SVM
• Handling high-dimensional data: SVMs are effective in handling high-
dimensional data, which is common in many applications such as image
and text classification.
• Handling small datasets: SVMs can perform well with small datasets,
as they only require a small number of support vectors to define the
boundary.
• Modeling non-linear decision boundaries: SVMs can model non-linear
decision boundaries by using the kernel trick, which maps the data into a
higher-dimensional space where the data becomes linearly separable.
Advantages of SVM
• Robustness to noise: SVMs are robust to noise in the data, as the decision boundary is determined
by the support vectors, which are the closest data points to the boundary.
• Generalization: SVMs have good generalization performance, which means that they are able to
classify new, unseen data well.
• Versatility: SVMs can be used for both classification and regression tasks, and it can be applied to a
wide range of applications such as natural language processing, computer vision, and
bioinformatics.
• Sparse solution: SVMs have sparse solutions, which means that they only use a subset of the
training data to make predictions. This makes the algorithm more efficient and less prone to
overfitting.
• Regularization: SVMs can be regularized, which means that the algorithm can be modified to avoid
overfitting.
Disadvantages of SVM
• Computationally expensive: SVMs can be computationally expensive for large
datasets, as the algorithm requires solving a quadratic optimization problem.
• Choice of kernel: The choice of kernel can greatly affect the performance of an
SVM, and it can be difficult to determine the best kernel for a given dataset.
• Sensitivity to the choice of parameters: SVMs can be sensitive to the choice of
parameters, such as the regularization parameter, and it can be difficult to
determine the optimal parameter values for a given dataset.
• Memory-intensive: SVMs can be memory-intensive, as the algorithm requires
storing the kernel matrix, which can be large for large datasets.
Disadvantages of SVM
• Limited to two-class problems: SVMs are primarily used for two-class
problems, although multi-class problems can be solved by using one-versus-
one or one-versus-all strategies.
• Lack of probabilistic interpretation: SVMs do not provide a probabilistic
interpretation of the decision boundary, which can be a disadvantage in some
applications.
• Not suitable for large datasets with many features: SVMs can be very slow and
can consume a lot of memory when the dataset has many features.
• Not suitable for datasets with missing values: SVMs requires complete datasets,
with no missing values, it can not handle missing values.
Applications of SVM
1.Face observation – It is used for detecting the face according to
the classifier and model.
2.Text and hypertext arrangement – In this, the categorization
technique is used to find important information or you can say
required information for arranging text.
3.Grouping of portrayals – It is also used in the Grouping of
portrayals for grouping or you can say by comparing the piece of
information and take an action accordingly.
Applications of SVM
1. Bioinformatics – It is also used for medical science as well like in laboratory,
DNA, research, etc.
2. Handwriting remembrance – In this, it is used for handwriting recognition.
3. Protein fold and remote homology spotting – It is used for spotting or you can
say the classification class into functional and structural classes given their
amino acid sequences. It is one of the problems in bioinformatics.
4. Generalized predictive control(GPC) – It is also used for Generalized predictive
control(GPC) for predicting and it relies on predictive control using a multilayer
feed-forward network as the plants linear model is presented
Applications of SVM
5. Facial Expression Classification – Support vector machines (SVMs) is a
binary classification technique. The face Expression Classification model
determines the precise face expression by modeling differences between
two facial images. Validation techniques include the leave-one-out
methods and the K-fold test methods.
6. Speech Recognition – The transcription of speech into text is called
speech recognition. Mel Frequency Cepstral Coefficients (MFCC)-based
features are used to train Support Vector Machines (SVM), which are used
for figuring out speech. Speech recognition is a challenging classification
problem that is categorized using a variety of mathematical techniques,
including support vector machines, pattern recognition techniques, etc
• For any query : Write mail to

[email protected]

You might also like