Machine Learning - Exploring The Model - Resp
Machine Learning - Exploring The Model - Resp
ML Model Representation
Suppose you are provided with a data-set that has area of the house in
square feet and the respective price
ML Notations
Model Representation
The objective is, given a set of training data, the algorithm needs to come
up with a way to map 'x' to 'y'
This is denoted by h: X → Y
You have learnt how to map the input and output variables through
the hypothesis function in the previous example.
After defining the hypothesis function, the accuracy of the function has
to be determined to gauge the predictive power . i.e., how are the
square feet values predicting the housing prices accurately.
Cost function is the representation of this objective.
The points on the line are the predicted values represented by y^
The other points are the actual y values
Cost Function Explained
Play
11:31
-11:31
Mute
Settings
Enter fullscreen
Play
If you have trouble playing this video, please click here for help.
Transcript:
we previously defined the cost function J in this video I want to tell you about an
algorithm called gradient descent for minimizing the cost function J it turns out
gradient descent is a more general algorithm and is used not only in linear
regression is actually used all over the place in machine learning and later in
the class we'll use gradient descent to minimize other functions as well not just
the cost function J for linear regression so in this video I'm going to talk about
gradient descent for minimizing some arbitrary function J and then in later
videos we'll take those algorithm and apply it specifically to the cost function J
that we had defined for linear regression so here's the problem setup going to
assume that we have some function J of theta 0 comma theta 1 maybe as a
cost function from linear regression maybe it's some other function we want to
minimize and we want to come over now algorithm for minimizing that as a
function of j of theta0 theta1 just as an aside it turns out that gradient descent
actually applies to more general functions so imagine if you have a function
that's a function of j OS theta 0 theta 1 theta 2 up to say something that N and
you want to minimize theta 0 you minimize over theta0 up to theta n of this J of
theta 0 up to theta n it turns out gradient descent is an algorithm for solving of
this more general problem but for the sake of brevity on for the sake of you
know succinct ins of notation I'm just going to pretend I have only two
parameters throughout the rest of this video here's the idea for gradients then
what we're going to do is we're going to start off with some initial guesses for
theta 0 and theta 1 doesn't really matter what they are but a common choice will
be we set theta 0 to state 0 and set theta 1 to 0 just initialize them to 0 what
we're going to do in gradient descent is we'll keep changing theta 0 and say the
1 a little bit they try to reduce J of theta 0 theta 1 until hopefully we wind up at a
minimum or maybe a local so let's see what let's see in pictures what gradient
descent does let's say you try to minimize this function so notice the axes this is
a theta 0 theta 1 on the horizontal axis and J is the vertical axis and so the
height of the surface shows J and we want to minimize this function so we're
going to start off with theta 0 theta 1 at some point so imagine making some
value for theta 0 theta 1 and that corresponds to starting at some points on the
surface of this function okay so whatever value of theta 0 theta 1 gives you
some point here I didn't initialize them to 0 0 but you know sometimes you
neutralize it to other values as well now I want you to imagine that this figure
shows a hole imagine this is like the landscape of some draw c-pop with you
know two Hills like so and I want you to imagine that you are physically standing
at that point on the hill right on this little red node on your Park in gradient
descent what we're going to do is we're going to spin 360 degrees around just
look all around us and also if I were to take a little baby step in some direction
and I want to go downhill as quickly as possible what direction do I take that
little baby step in if I want to go down if I sort of want to physically walk down
this hill as rapidly as possible turns out that if you're standing at that point on the
hill you look all around you find that the best direction to take a little little step
downhill is roughly that direction okay and now you have this new point on your
hill you're going to again look all around and then say what direction should I
step in order to take a little baby step downhill and if you do that and take
another step you take a step in that direction and then you keep going you know
from this new point you look around taking decide what direction will take you
down hill most quickly take another step another step and so on until you
converge to this a local minimum down here here the dissenter is an interesting
property this first time we ran gradient descent we were starting at this point
over here right starts it at that point over here now imagine we have initialized
gradient descent just a couple steps to the right imagine with a nationalized
gradient descent with that points on the upper right if you were to repeat this
process so stop at that point look all around take a little step in the direction of
steepest descent you would do that then look around take another step and so
on and if you started just a couple steps to the right gradient descent would
have taken you to this second local optimum over on the right so if you have
started this first point you would have wondered about this local optima but you
started just a little bit slightly different location you would have wound up at a
very different local optimum and this is a property of gradients in that will stay a
little bit more about later so that's the intuition in pictures let's look at the map
this is the definition of the gradient descent algorithm we're going to just repeat
the Li do this until converging so we're going to update my parameter theta J by
you know taking theta J and subtracting from it alpha at times this term over
here okay so let's see a lot of details in this equation so let me unpack some of
it first this notation here R colon equals going to use colon equals to denote
assignment so the assignment operator so concretely if I write a colon equals B
what this means is it means eel in a computer this means take the value in B
and use it to overwrite whatever value is in so this means zero set a to be equal
to the value of B it's its assignment and I can also do a colon equals a plus one
this means take a and increase this value by one whereas in contrast if I use
the equal sign then I write a equals B then this is a truth assertion okay so if I
write a equals B then I'm asserting that the value of a equals to the value of B
right so the left hand side that's the computer operation where you set the value
of a to new value the right hand side this is asserting I'm just making a claim
that the values of a and B are the same and so whereas I can write a colon
equals a plus 1/10 is increment a by 1 hopefully I won't ever write a equals a
plus 1 because this is wrong write a and a plus 1 can never be equal to the
same values okay so there's the first part of the definition um this alpha here is
a is a number that is called the learning rate and what alpha does is it basically
controls how big a step we take downhill with creating descent so if alpha is
very large then that corresponds to a very aggressive gradient descent
procedure when we're trying to take huge steps downhill and if alpha is very
small then we're taking little baby steps downhill and I'll come back and say
more about this later about how to set alpha and so on and finally this term here
that's a derivative term I don't want to talk about it right now but I will derive this
derivative term and tell you exactly what this is later okay and some of you will
be more familiar with calculus in others but even if you aren't familiar with
calculus don't worry about it I'll tell you what you need to know about this term
here now there's one more subtlety about gradient descent which is ingrained in
descent we're going to update you know theta zero and theta one right so this
update takes place for J equals zero and J equals ones you can update J theta
zero and update theta one and the subtlety of how you implement gradient
descent is for this expression right for this update equation you want to
simultaneously update theta 0 and theta 1 what I mean by that is that you know
in this equation we're going to update theta 0 co 2 equals theta 0 minus
something and update theta 1 code equals theta 1 minus something and the
way to implement is is you should compute the right hand side right compute
that thing for theta 0 and theta 1 and then simultaneously at the same time
update theta 0 and theta 1 okay so let me say what I mean by that this is a
correct implementation of written meaning on simultaneous updates I'm going to
set 10 0 equals at set 10 1 equals that so the basic computer right hand sides
and then having computed the right hand sides and store them in two variables
x 0 and temp 1 I'm going to year update theta 0 and theta 1 simultaneously
that's a correct implementation in contrast here's an incorrect implementation
that does not do a simultaneous update so in this incorrect implementation we
compute 10 0 and then we update theta 0 let me compute tenth one and then
we update 10 1 and the difference between the right hand side and the left hand
side implementations is that if you look down here you look at this step if by this
time you've already updated theta zero then you would be using the new value
of theta 0 you know to compute this derivative term and so this gives you a
different value of 10 1 then the left hand side right because you've now plugged
in the new value of theta 0 into this equation and so this on the right hand side
is not a correct implementation of gradient descent ok so I don't want to say
why you need to do the simultaneous updates so it turns out that you know the
way gradient descent is usually implemented we say more but later it actually
turns out to be more natural to implement the simultaneous updates and when
people talk about gradient descent they always mean simultaneous update if
you implement the non simultaneous update it turns out it will probably work
anyway but this is this album on the right is not what people refer to as free
ascent and this is some other algorithm with different properties and for various
reasons this behaves can behave and slightly stranger ways and so you know
what you should do is really implement the simultaneous update of gradient
descent so that's the outline of the gradient descent algorithm in the next video
we're going to go into the details of the derivative term which I wrote I'll back
didn't really define and if you take the calculus class before and if you're familiar
with partial derivatives and derivatives it turns out that's exactly what that
derivative term is but in case you aren't familiar of calculus don't worry about it
the next video will give you all the intuitions and will tell you everything you need
to know to compute that derivative term even if you haven't seen calculus or
even you haven't seen partial derivatives before and with that with the next
video hopefully we'll really give you all the intuitions you need to apply gradient
descent
Convergence
Multiple Features
Mean Normalization
Classification
Classification Explained
Classification Visualized
Binary Classification
Classification Explained
Sigmoid Function
The logistic function is used as mapping function for classification problems.
Logistic Regression
Decision Boundary
Optimal Threshold
After defining
Hypothesis function that maps input to output
Cost function that represents the prediction error
Gradient descent that chooses the right parameters
We have built the model.
It is necessary to test this model on a new set of data to evaluate the model
fitting process.
Tips on Evaluation
After Fitting the data and viewing the results you can try out something
Adding more training sets
Evaluating Hypothesis
Hypothesis Evaluation
Train/Validation/Test Sets
One way to break down our dataset into the three sets is:
Training set: 60%
Cross validation set: 20%
Test set: 20%
Best Practice
Fitting Visualized
Bias Vs Variance
High Bias
High Variance
Low training set size: Training set error will be low and Cross Validation
set error will be high.
Large training set size: Training set error increases with training set
size and Cross Validation set error continues to decrease without
leveling off. Also,Training Set Error less than cross validation set error
but the difference between them remains significant.
If a learning algorithm has high variance, getting more training data will help in
improvement
More tips
Final Summary
The objective function for linear regression is also known as Cost Function.----- true
Cost function in linear regression is also called squared error function.--- true
For different parameters of the hypothesis function we get the same hypothesis function.-----
false
What is the name of the function that takes the input and maps it to the output variable
called ?---- hypothesis function
What is the process of dividing each feature by its range called ?---- featuring scalling
What is the process of subtracting the mean of each variable from its variable called ?------
Mean normalization
What is the Learning Technique in which the right answer is given for each example in the data
called ?-----supervised learning
Problems that predict real values outputs are called ?------regresion problems
____________ controls the magnitude of a step taken during Gradient Descent .---- learning
rate
How are the parameters updates during Gradient Descent Process ?---- simulataneously
For an underfit data set the training and the cross validation error will be high.-----false
For an overfit data set the cross validation error will be much bigger than the training
error.-----true
So when a ML Model has high bias, getting more training data will help in improving the
model-----false
High values of threshold are good for the classification problem.----- false
A suggested approach for evaluating the hypothesis is to split the data into training and test
set-----true
Problems where discrete valued outputs predicted are called ?----- clasification problems
Classification problems with just two classes are called Binary classification problems.---- true
Linear Regression is an optimal function that can be used for classification problems.--- false
What is the range of the output values for a sigmoid function ?---------[0,1]
I have a scenario where my hypothesis fits my training set well but fails to generalize for test
set. What is this scenario called ?----overfitting
For _____________ the error is calculated by finding the sum of squared distance between
actual and predicted values----- regression
For ______________ the error is determined by getting the proportion of values miss-classified
by the model-----classification
What measure how far the predictions are from the actual values ?-----bias