0% found this document useful (0 votes)
358 views18 pages

Machine Learning - Exploring The Model - Resp

The document provides an introduction to machine learning models for predicting house prices based on area. It discusses representing the model, which involves defining input (x) and output (y) variables, creating a hypothesis (h) function to map x to y, and evaluating the accuracy of predictions using a cost function. It then introduces gradient descent as an algorithm for minimizing the cost function and improving the model by iteratively updating parameters based on the steepest rate of descent down the cost surface.

Uploaded by

IgorJales
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
358 views18 pages

Machine Learning - Exploring The Model - Resp

The document provides an introduction to machine learning models for predicting house prices based on area. It discusses representing the model, which involves defining input (x) and output (y) variables, creating a hypothesis (h) function to map x to y, and evaluating the accuracy of predictions using a cost function. It then introduces gradient descent as an algorithm for minimizing the cost function and improving the model by iteratively updating parameters based on the steepest rate of descent down the cost surface.

Uploaded by

IgorJales
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Machine Learning - Exploring the Model

 Welcome to the course on Machine Learning - Exploring the Model


 The objective of this course is to familiarize you with the steps involved
in fitting a machine learning model to a data set
 You will learn all the concepts involved in building a Machine Learning
Model, from Hypothesis function that represents a model for a given
data-set to evaluation of the hypothesis for a general case

ML Model Representation

 Suppose you are provided with a data-set that has area of the house in
square feet and the respective price

 How do you think you will come up with a Machine Learning


Model to learn the data and use this model to predict the price of a
random house given the area?

You will learn that in the following cards.

Lets get started...

House Price Prediction

 We have a data-set consisting of houses with their area in sq feet and


their respective prices
 Assume that the prices are dependent on the area of the house
 Let us learn how to represent this idea in Machine Learning parlance

ML Notations

 The input / independent variables are denoted by 'x'


 The output / dependent variable(s) are denoted by 'y'
In our problem the area in sq foot values are'x'and house prices are 'y'.
Here change in one variable is dependent on change in another variable. This
technique is called Regression.

Model Representation
 The objective is, given a set of training data, the algorithm needs to come
up with a way to map 'x' to 'y'
 This is denoted by h: X → Y

h(x) is called the hypothesis that does the mapping.

Model Representation Explained

This video outlines the model representation process in Machine Learning.

Why Cost Function ?

 You have learnt how to map the input and output variables through
the hypothesis function in the previous example.
 After defining the hypothesis function, the accuracy of the function has
to be determined to gauge the predictive power . i.e., how are the
square feet values predicting the housing prices accurately.
 Cost function is the representation of this objective.

Demystifying Cost Function

In the cost function:

 m - number of observations


 y^- predicted value
 y - actual value
 i - single observation

The objective is to minimize the sum of squared errors between


the predicted and actual values.

Cost Function Intuition

 The points on the line are the predicted values represented by y^
 The other points are the actual y values
Cost Function Explained

This video talks about cost functions.

Why Gradient Descent ?

You have learnt how to


 Represent a machine learning model
 Represent the accuracy of the model using cost function
The next step is to determine the parameters that best fit the data.
Learning Gradient Descent will help you with that.

Gradient Descent Explained

 Imagine the error surface as a 3D image with parameters theta0 and


theta1 on x and y axis and Error value in z axis
 The intuition behind gradient descent is to choose the parameters that
minimize the cost as low as possible
 Descending down the cost function is made in the direction of the
steepest descent
 The learning parameter(alpha) decides the magnitude of each step

Gradient Descent Explained

This video explains the process of Gradient Descent.

Play
11:31
-11:31
Mute
Settings
Enter fullscreen
Play
If you have trouble playing this video, please click here for help.
Transcript:

we previously defined the cost function J in this video I want to tell you about an
algorithm called gradient descent for minimizing the cost function J it turns out
gradient descent is a more general algorithm and is used not only in linear
regression is actually used all over the place in machine learning and later in
the class we'll use gradient descent to minimize other functions as well not just
the cost function J for linear regression so in this video I'm going to talk about
gradient descent for minimizing some arbitrary function J and then in later
videos we'll take those algorithm and apply it specifically to the cost function J
that we had defined for linear regression so here's the problem setup going to
assume that we have some function J of theta 0 comma theta 1 maybe as a
cost function from linear regression maybe it's some other function we want to
minimize and we want to come over now algorithm for minimizing that as a
function of j of theta0 theta1 just as an aside it turns out that gradient descent
actually applies to more general functions so imagine if you have a function
that's a function of j OS theta 0 theta 1 theta 2 up to say something that N and
you want to minimize theta 0 you minimize over theta0 up to theta n of this J of
theta 0 up to theta n it turns out gradient descent is an algorithm for solving of
this more general problem but for the sake of brevity on for the sake of you
know succinct ins of notation I'm just going to pretend I have only two
parameters throughout the rest of this video here's the idea for gradients then
what we're going to do is we're going to start off with some initial guesses for
theta 0 and theta 1 doesn't really matter what they are but a common choice will
be we set theta 0 to state 0 and set theta 1 to 0 just initialize them to 0 what
we're going to do in gradient descent is we'll keep changing theta 0 and say the
1 a little bit they try to reduce J of theta 0 theta 1 until hopefully we wind up at a
minimum or maybe a local so let's see what let's see in pictures what gradient
descent does let's say you try to minimize this function so notice the axes this is
a theta 0 theta 1 on the horizontal axis and J is the vertical axis and so the
height of the surface shows J and we want to minimize this function so we're
going to start off with theta 0 theta 1 at some point so imagine making some
value for theta 0 theta 1 and that corresponds to starting at some points on the
surface of this function okay so whatever value of theta 0 theta 1 gives you
some point here I didn't initialize them to 0 0 but you know sometimes you
neutralize it to other values as well now I want you to imagine that this figure
shows a hole imagine this is like the landscape of some draw c-pop with you
know two Hills like so and I want you to imagine that you are physically standing
at that point on the hill right on this little red node on your Park in gradient
descent what we're going to do is we're going to spin 360 degrees around just
look all around us and also if I were to take a little baby step in some direction
and I want to go downhill as quickly as possible what direction do I take that
little baby step in if I want to go down if I sort of want to physically walk down
this hill as rapidly as possible turns out that if you're standing at that point on the
hill you look all around you find that the best direction to take a little little step
downhill is roughly that direction okay and now you have this new point on your
hill you're going to again look all around and then say what direction should I
step in order to take a little baby step downhill and if you do that and take
another step you take a step in that direction and then you keep going you know
from this new point you look around taking decide what direction will take you
down hill most quickly take another step another step and so on until you
converge to this a local minimum down here here the dissenter is an interesting
property this first time we ran gradient descent we were starting at this point
over here right starts it at that point over here now imagine we have initialized
gradient descent just a couple steps to the right imagine with a nationalized
gradient descent with that points on the upper right if you were to repeat this
process so stop at that point look all around take a little step in the direction of
steepest descent you would do that then look around take another step and so
on and if you started just a couple steps to the right gradient descent would
have taken you to this second local optimum over on the right so if you have
started this first point you would have wondered about this local optima but you
started just a little bit slightly different location you would have wound up at a
very different local optimum and this is a property of gradients in that will stay a
little bit more about later so that's the intuition in pictures let's look at the map
this is the definition of the gradient descent algorithm we're going to just repeat
the Li do this until converging so we're going to update my parameter theta J by
you know taking theta J and subtracting from it alpha at times this term over
here okay so let's see a lot of details in this equation so let me unpack some of
it first this notation here R colon equals going to use colon equals to denote
assignment so the assignment operator so concretely if I write a colon equals B
what this means is it means eel in a computer this means take the value in B
and use it to overwrite whatever value is in so this means zero set a to be equal
to the value of B it's its assignment and I can also do a colon equals a plus one
this means take a and increase this value by one whereas in contrast if I use
the equal sign then I write a equals B then this is a truth assertion okay so if I
write a equals B then I'm asserting that the value of a equals to the value of B
right so the left hand side that's the computer operation where you set the value
of a to new value the right hand side this is asserting I'm just making a claim
that the values of a and B are the same and so whereas I can write a colon
equals a plus 1/10 is increment a by 1 hopefully I won't ever write a equals a
plus 1 because this is wrong write a and a plus 1 can never be equal to the
same values okay so there's the first part of the definition um this alpha here is
a is a number that is called the learning rate and what alpha does is it basically
controls how big a step we take downhill with creating descent so if alpha is
very large then that corresponds to a very aggressive gradient descent
procedure when we're trying to take huge steps downhill and if alpha is very
small then we're taking little baby steps downhill and I'll come back and say
more about this later about how to set alpha and so on and finally this term here
that's a derivative term I don't want to talk about it right now but I will derive this
derivative term and tell you exactly what this is later okay and some of you will
be more familiar with calculus in others but even if you aren't familiar with
calculus don't worry about it I'll tell you what you need to know about this term
here now there's one more subtlety about gradient descent which is ingrained in
descent we're going to update you know theta zero and theta one right so this
update takes place for J equals zero and J equals ones you can update J theta
zero and update theta one and the subtlety of how you implement gradient
descent is for this expression right for this update equation you want to
simultaneously update theta 0 and theta 1 what I mean by that is that you know
in this equation we're going to update theta 0 co 2 equals theta 0 minus
something and update theta 1 code equals theta 1 minus something and the
way to implement is is you should compute the right hand side right compute
that thing for theta 0 and theta 1 and then simultaneously at the same time
update theta 0 and theta 1 okay so let me say what I mean by that this is a
correct implementation of written meaning on simultaneous updates I'm going to
set 10 0 equals at set 10 1 equals that so the basic computer right hand sides
and then having computed the right hand sides and store them in two variables
x 0 and temp 1 I'm going to year update theta 0 and theta 1 simultaneously
that's a correct implementation in contrast here's an incorrect implementation
that does not do a simultaneous update so in this incorrect implementation we
compute 10 0 and then we update theta 0 let me compute tenth one and then
we update 10 1 and the difference between the right hand side and the left hand
side implementations is that if you look down here you look at this step if by this
time you've already updated theta zero then you would be using the new value
of theta 0 you know to compute this derivative term and so this gives you a
different value of 10 1 then the left hand side right because you've now plugged
in the new value of theta 0 into this equation and so this on the right hand side
is not a correct implementation of gradient descent ok so I don't want to say
why you need to do the simultaneous updates so it turns out that you know the
way gradient descent is usually implemented we say more but later it actually
turns out to be more natural to implement the simultaneous updates and when
people talk about gradient descent they always mean simultaneous update if
you implement the non simultaneous update it turns out it will probably work
anyway but this is this album on the right is not what people refer to as free
ascent and this is some other algorithm with different properties and for various
reasons this behaves can behave and slightly stranger ways and so you know
what you should do is really implement the simultaneous update of gradient
descent so that's the outline of the gradient descent algorithm in the next video
we're going to go into the details of the derivative term which I wrote I'll back
didn't really define and if you take the calculus class before and if you're familiar
with partial derivatives and derivatives it turns out that's exactly what that
derivative term is but in case you aren't familiar of calculus don't worry about it
the next video will give you all the intuitions and will tell you everything you need
to know to compute that derivative term even if you haven't seen calculus or
even you haven't seen partial derivatives before and with that with the next
video hopefully we'll really give you all the intuitions you need to apply gradient
descent

Convergence

 If the learning rate is small then the convergence takes time


 If the learning rate is high the values overshoot

Initializing the right learning rate is very important.

Multiple Features

 For theoretical purposes, single variable is used for illustration. But


practically, multiple features / independent variables are used for
predicting a variable.
 In the first example you saw how housing prices were predicted based
on their sq feet value. But ideally problems can get more complex and
have multiple features required to map the output.
Hypothesis Representation

 θ0 could be the basic price of a house


 θ1 could be the price per square feet
 θ2 the price per level
 x1 could be the area of square feet in the house
 x2 could be the number of floors

Why Feature Scaling?

When there are multiple features and each feature variable has a large


magnitude, combining them into a model and predicting a value
becomes computationally intensive.
Scaling comes to our help in these scenarios.

What is Feature Scaling?

 In scaling, each feature / variable is divided by its range (maximum


minus minimum)
 The result of scaling is a variable in the range of 1
 This eases the computation intensity to a considerable extent

Mean Normalization

 In mean normalization , the mean of each variable is subtracted from


the variable
 In many cases, the mean normalization and scaling are
performed together

Classification

So far we have seen an example where we have predicted one


variable depending on other variables. Machine Learning has also been used
for classification.
You will learn that in this topic.

Classification Explained

 In classification unlike regression, we need to discern one group of


data from another
 The idea is to get the likelihood of a feature falling into a specific class
 For example when we try to classify an e-mail it falls into one of the two
buckets viz, spam or not spam

Classification Visualized

Binary Classification

 In a binary classification problem, the dependent variable y can take


either of the two values 0 or 1. (This idea can also be extended for
a multiple-class case)
 For instance, if we are trying to build a classifier for
identifying tumour from an image, then x(i) may be some feature of
image, and y may be 1 if that feature has a cancer cell and 0 otherwise
Hence, y∈{0,1}.

Classification Explained

This video explains about classification.

Sigmoid Function
The logistic function is used as mapping function for classification problems.

Interpreting Logistic Regression

 The mapping function hθ(x) gives us the likelihood that the output is 1.


 For example, hθ(x)=0.65 gives us a probability of 65% that our output
is 1.
 The likelihood that the prediction is 0 is just the complement of
the likelihood that it is 1.

Logistic Regression

This video explains about logistic regression.

Decision Boundary

 The decision boundary or threshold is the line that separates the area


where y = 0 and where y = 1
 The hypothesis function creates the decision boundary

Optimal Threshold

Choosing the right threshold value is important in classification.


 Lower threshold might lead to False Positives
 Higher threshold leads to many cases not classifying properly

Why Evaluate the Hypothesis ?

After defining
 Hypothesis function that maps input to output
 Cost function that represents the prediction error
 Gradient descent that chooses the right parameters
We have built the model.
It is necessary to test this model on a new set of data to evaluate the model
fitting process.

Tips on Evaluation

After Fitting the data and viewing the results you can try out something
 Adding more training sets

Try going for smaller sets of featuresModel Selection

Model Selection is a part of the hypothesis evaluation process where the


model is evaluated on test set to check how well the model generalizes on new
data set

 Try adding new features
 Try going for polynomial features

Evaluating Hypothesis

 Each Machine Learning Algorithm has its own way of being evaluated.


 For Regression the error is calculated by finding the sum of squared
distance between actual and predicted values
 For Classification the error is determined by getting the proportion of
values mis-classified by the model

Hypothesis Evaluation

This video talks about how to evaluate a hypothesis.


Model Selection

Model Selection is a part of the hypothesis evaluation process where the


model is evaluated on test set to check how well the model generalizes on new
data sets.

Train/Validation/Test Sets

One way to break down our dataset into the three sets is:

 Training set: 60%
 Cross validation set: 20%
 Test set: 20%

Best Practice

 Use the training set for finding the optimal parameters for the cost


function
 Use the validation set to get the polynomial with the least error

 Use the test set for estimating the generalization error

Fitting Visualized

You can see three different mapping functions for the same data .

 Example 1 - Under-fitting with high bias

 Example 2 - Proper fit

 Example 3 - Over-Fitting with High Variance

Tips on Reducing Overfitting

 Reduce the number of features:


o Manually select which features to keep.
o A model selection algorithm can be used
 Regularization
o Suggestion is to reduce the magnitude of the parameters
o Regularization works well when there are lot
of moderately useful features

Bias Vs Variance

 How are the predictions far off from the actual values is measured


by Bias
 To what extent are the the predictions for a given point change
between various realizations of the model is measured by variance
Both these values are essential for analysis while selecting an optimum
Machine Learning model

Bias vs Variance continued

 If there are bad predictions, need to distinguish if it is due


to bias or variance
 High bias leads to under-fitting of data and high variance leads to
over-fitting
 The need is to find an optimal value for these two parameters

Learning Curves Intro


 Training an algorithm on a small number of data points will have
almost zero errors because we can find a quadratic
function that maps exactly those points correctly
 As the training set gets large and more complex , the error for a
quadratic function increases
 The error value will reach a constant only after a certain number of
training sets

High Bias

 Low training set size: causes Training set error to be low and cross


validation set error to be high
 Large training set size: causes both training set error and cross
validation set error to be high with validation set error much greater
that training set error.
 So when a learning algorithm is has high bias, getting more training data
will not aid much in improving

High Variance
 Low training set size: Training set error will be low and Cross Validation
set error will be high.
 Large training set size: Training set error increases with training set
size and Cross Validation set error continues to decrease without
leveling off. Also,Training Set Error less than cross validation set error
but the difference between them remains significant.

If a learning algorithm has high variance, getting more training data will help in
improvement

More tips

 Getting more training data : Solution for high variance problem


 Trying smaller number of input features: Solution for high variance
 Adding new input features: High Bias problem can be fixed
 Adding new polynomial features: High Bias Problem can be fixed

Model Complexity Effects


 Lower-order polynomials have very high bias and very low variance.
This is a poor fit
 Higher-order polynomials have low bias on the training data, but very
high variance. This is over fit.
 The objective is to build a model that can generalize well and that fits
the data well.

Final Summary

In this course you have learnt


 How to represent a Machine Learning Model
 Representing the Cost Function
 Determining the optimal parameters from Gradient Descent
 Classification Problems
 How to evaluate Machine Learning Models
 Bias vs Variance
 Some tips to improve Machine Learning Model fitting

The objective function for linear regression is also known as Cost Function.----- true

Cost function in linear regression is also called squared error function.--- true

Output variables are known as Feature Variables .---- false

Input variables are also known as feature variables.-----true

The result of scaling is a variable in the range of [1 , 10].--- false

For different parameters of the hypothesis function we get the same hypothesis function.-----
false

What is the name of the function that takes the input and maps it to the output variable
called ?---- hypothesis function

What is the process of dividing each feature by its range called ?---- featuring scalling

What is the process of subtracting the mean of each variable from its variable called ?------
Mean normalization

What is the Learning Technique in which the right answer is given for each example in the data
called ?-----supervised learning

Problems that predict real values outputs are called ?------regresion problems
____________ controls the magnitude of a step taken during Gradient Descent .---- learning
rate

How are the parameters updates during Gradient Descent Process ?---- simulataneously

For an underfit data set the training and the cross validation error will be high.-----false

For an overfit data set the cross validation error will be much bigger than the training
error.-----true

So when a ML Model has high bias, getting more training data will help in improving the
model-----false

Underfit Data has a high variance--------true

High values of threshold are good for the classification problem.----- false

Lower Decision boundary leads to False Positives during classification-----true

____________ function is used as a mapping function for classification problem.----- sigmoid

____________ is the line that separates y = 0 and y = 1 in a logistic function.-----decisiboundary

A suggested approach for evaluating the hypothesis is to split the data into training and test
set-----true

Where does the sigmoid function asymptote ?---- 0 and1

Problems where discrete valued outputs predicted are called ?----- clasification problems

Classification problems with just two classes are called Binary classification problems.---- true

Reducing the number of features can reduce overfitting---- true

Overfit data has a high bias--------false

Linear Regression is an optimal function that can be used for classification problems.--- false

Overfiting and Underfitting are applicable only to linear regression problems---false

What is the range of the output values for a sigmoid function ?---------[0,1]
I have a scenario where my hypothesis fits my training set well but fails to generalize for test
set. What is this scenario called ?----overfitting

For _____________ the error is calculated by finding the sum of squared distance between
actual and predicted values----- regression

For ______________ the error is determined by getting the proportion of values miss-classified
by the model-----classification

What measure how far the predictions are from the actual values ?-----bias

Problems, where discrete-valued outputs are predicted, are called?


Classification Problems

You might also like