0% found this document useful (0 votes)
13 views

Unit-4

Uploaded by

Sunil Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Unit-4

Uploaded by

Sunil Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Unit 4

(Optimization for Machine Learning) Basics of optimization, Convex


Objective functions, Minutiae of Gradient Decent, properties of optimization
in Machine Learning, Computing Derivatives with respect to Vectors, Linear
Regression: Optimization with Numerical Targets, Optimization Models for
Binary Targets, Optimization Models for the Multiclass setting, Co-ordinate
Descent.
Optimization in Machine Learning
• In machine learning, optimization is the procedure of identifying the ideal set
of model parameters that minimize a loss function.
• For a particular set of inputs, the loss function calculates the discrepancy
between the predicted and actual outputs. For the model to successfully
forecast the output for fresh inputs, optimization seeks to minimize the loss
function.
• A method for finding a function's minimum or maximum is called an
optimization algorithm, which is used in optimization.
• Up until the minimum or maximum of the loss function is reached, the
optimization algorithm iteratively modifies the model parameters.
• Gradient descent, stochastic gradient descent, Adam, Adagrad, and RMSProp
are a few optimization methods that can be utilised in machine learning.
Gradient Descent
• In machine learning, gradient descent is a popular optimization approach.
• It is a first-order optimization algorithm that works by repeatedly changing the model's
parameters in the opposite direction of the loss function's negative gradient.
• The loss function lowers most quickly in that direction because the negative gradient leads
in the direction of the greatest descent.
• The gradient descent algorithm operates by computing the gradient of the loss function
with respect to each parameter starting with an initial set of parameters.
• The partial derivatives of the loss function with respect to each parameter are contained in
a vector known as the gradient.
• After that, the algorithm modifies the parameters by deducting a small multiple of the
gradient from their existing values.
Q.1 Find the optimum solution using gradient decent method of the given
function f(x) = , where and

=2

=2(x-2)

= 2(1.5-2)=

Calculates the value of 1 ,


x2,………
Convex Objective functions
Example :
Prove that the set B = {(x1, x2, x3) : 2x1 − x2 + x3 ≤ 4} ⊂ R3 is a convex
set.
Solution:
Assume that X = (x1, x2, x3) and Y = (y1, y2, y3) are the two points of B.
From the given conditions, we can write
2x1 − x2 + x3 ≤ 4
2y1 − y2 + y3 ≤ 4
Now, assume that W = (w1, w2, w3) is any point of [X, Y ] such that 0 ≤ θ
≤ 1,
w1 = θx1 + (1 − θ)y1
w2 = θx2 + (1 − θ)y2
w3 = θx3 + (1 − θ)y3
Using the above equations, we can write
2w1 − w2 + w3 = θ(2x1 − x2 + x3) + (1 − θ)(2y1 − y2 + y3) ≤ 4θ + 4(1 −
θ) = 4
Thus, W = (w , w , w ) is a point of S.
Optimization Models for Binary Targets
Optimization Models for the Multiclass setting

Naive Bayes: A simple and effective algorithm that uses conditional probability and prior
probabilities to make predictions
Decision tree techniques: Can be used for multiclass classification
Logistic regression: Can be used for multiclass classification
Neural networks: Can be used for multiclass classification
SVM: Can be used for multiclass classification
Naive Bayes
Naive Bayes classifiers are a collection of classification algorithms based on Bayes’
Theorem. It is not a single algorithm but a family of algorithms where all of them share a
common principle, i.e. every pair of features being classified is independent of each other. To
start with, let us consider a dataset.

The dataset is divided into two parts, namely, feature matrix and the response vector.
•Feature matrix contains all the vectors(rows) of dataset in which each vector consists of
the value of dependent features. In above dataset, features are ‘Outlook’, ‘Temperature’,
‘Humidity’ and ‘Windy’.
•Response vector contains the value of class variable(prediction or output) for each row
of feature matrix. In above dataset, the class variable name is ‘Play golf’.
Temperature Humidity Windy Play Golf
Outlook
0 Rainy Hot High False No
1 Rainy Hot High True No
2 Overcast Hot High False Yes
3 Sunny Mild High False Yes
4 Sunny Cool Normal False Yes
5 Sunny Cool Normal True No
6 Overcast Cool Normal True Yes
7 Rainy Mild High False No
8 Rainy Cool Normal False Yes
9 Sunny Mild Normal False Yes
10 Rainy Mild Normal True Yes
11 Overcast Mild High True Yes
12 Overcast Hot Normal False Yes
13 Sunny Mild High True No
Bayes’ Theorem finds the probability of an event occurring given the probability of another
event that has already occurred. Bayes’ theorem is stated mathematically as the following
equation:

•asically, we are trying to find probability of event A, given the event B is true. Event B is
also termed as evidence.
•P(A) is the priori of A (the prior probability, i.e. Probability of event before evidence is
seen). The evidence is an attribute value of an unknown instance(here, it is event B).
•P(B) is Marginal Probability: Probability of Evidence.
•P(A|B) is a posteriori probability of B, i.e. probability of event after evidence is seen.
•P(B|A) is Likelihood probability i.e the likelihood that a hypothesis will come true based on
the evidence.

You might also like