0% found this document useful (0 votes)
3 views

Linear Regression

models notes

Uploaded by

feathh1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Linear Regression

models notes

Uploaded by

feathh1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Linear Models

Characteristics of Linear Models


Linear Models are parametric i.e. they have a fixed form with a small number of
numeric parameters that are to be learnt from data.
Linear Models are stable. It means that small variations in the training data have
only limited impact on the learned model.
Linear Models are less likely to overfit the training data because they have
relatively few parameters.
Linear models have high bias and low variance. (Underfitting)
Linear models are preferable when there is limited training data and overfitting is
to be avoided.
Linear Regression
Linear regression is a method for finding the straight line or hyperplane that best fits a set of points.

It a very simple supervised learning approach.

It is used to predict a quantitative response like price, temperature etc.

It is a widely used statistical learning method.

When there is only feature it is called Univariate Linear Regression

Eg: Predicting prices of house based on the area of the house

and if there are multiple features, it is called Multiple Linear Regression.

Eg: Predicting prices of house based on the area of the house, number of floors, number of rooms
etc.
Advertising Data (Multivariate Regression)
Simple Linear Regression
Simple Linear Regression
It is a simple approach for predicting a quantitative response Y on the basis of a single
predictor variable X.
It assumes that there is approximately a linear relationship between X and Y.
Mathematically, it can be written as:
Y ≈ β 0 + β 1X
β0 and β1are two unknown constants that represent the intercept and slope terms in the
linear model.
Together, β0 and β1are known as model coefficients or parameters (to be learnt).
This is also called as regressing Y on X.
Simple Linear Regression
For this example, it can be written as

sales ≈ β0 + β1* TV

Once the values of β0 and β1have been estimated using the training data, we can
predict future sales on the basis of expenditure done on TV advertising.
Estimating the Coefficients
Let us say, we have a dataset as

(x1,y1), (x2,y2),......(xn,yn)

Now, we want to find an intercept β0 and slope β1 such that the resulting straight
line is as close as possible to the n data points.

Most common approach involves minimising the least squares criterion.


Ways to Estimate the Coefficients
- Ordinary Least Square Method
- Gradient Descent Method
- Normal Equation Method
Ordinary Least Square Method
Least Squares Method
Let ŷi = β0 + β1* xi be the prediction for Y based on the ith value of X.

Then, ei = yi - ŷi represents the ith residual - this is the difference between the ith
observed response value and the ith response value that is predicted by the linear
model.

Residual sum of squares is defined as:

RSS = e12 + e22 + …+en2


Understanding Residual Sum of Squares

X Y Y_predicted_1 Y_predicted_2 (Y-Y1)^2 (Y-Y2)^2


95 85 86 88 1 9

85 95 88 81 49 196

80 70 72 66 4 16

70 65 64 72 1 49

60 70 69 64 1 36

56 306 SUM

RSS of Model 1 is lower than that of Model 2. So Model 1 is preferable over Model 2.
But the question is, how to get to (find out) Model 1 (or the best model)?
Derivation
Now solving for β1
Alternative Formulas
Sample Question
Working
β1 = ((30500) - (5*78*77)) / (31150 - 5*6084)

= (30500 - 30030) / (31150 - 30420)

= (470)/ (730) = 0.6438

β0 = 77 - (-0.6438)(78)

= 77 - 50.2191

= 26.7808

Marks_in_lang_course = 26.7808 + (0.6438)(marks_in_prof_course)


Computing the sum of squared error (SSE)
Making Prediction
If a student scored 80 on the proficiency test, what marks would we expect her to
obtain in the language course?

Set x=80, β1 = 0.6438 and β0 = 26.7808

Predicted marks = 26.7808 + (0.6438)(80)

= 26.7808 + 51.504

= 78.2845
Points to Ponder
The intercept ꞵ0 is such that the regression line goes through

Sum of residuals of the least squares solution is zero.

This property also makes linear regression susceptible to outliers.

Outliers are the points that too far away from the regression line (or most of the
data points), often because of measurement errors.
Multivariate Linear Regression
Given marks in english, marks in mathematics then predict the GATE score

Y (gate_score) [RESPONSE VARIABLE]

x1 (eng_score) and x2 (math_score) [INPUT VARIABLES]

Y = beta0 + beta1 x1 + beta2 x2


Normal Equation Method
Matrix Notation
In order to deal with an arbitrary number of features it will be useful to employ
matrix notation.

Univariate linear regression can be written as:


Matrix Notation
For m examples with n features, this can be written more generally as:

Here, X is a mxn matrix whose first column is all 1s and the remaining columns
are the columns of X and ꞵ hat has the intercept ꞵ0 hat as its first entry and the
regression coefficients as the remaining n entries.
Normal Equation
The ꞵ hat vector can be computed using normal equation as given below:
Characteristics of Normal Equation Method
Normal Equation method is used:

- If n (number of features) is small.


- If m (number of training examples) is small i.e. around 20,000.

One step method for calculating regression coefficients.

Computation increases significantly when number of features increase as the size


of matrix increases and matrix multiplication is a computationally intensive
operation.
Practice Problem
Find the least square regression line for the given dataset using the normal
equation method. Show computation at each step. [2022]

x1 x2 y

1 9 14

2 1 7

3 2 12

4 3 16

5 4 20
Gradient Descent
Gradient Descent
Gradient Descent is a an optimization algorithm that can be used to find the global
or local minima of a differentiable function.

It is an iterative algorithm.
Notations
Matrix Notation

We will set x0(i) = 1, for all values of i.


Loss/Cost Function
A loss/cost function is a function that signifies how much our predicted values is
deviated from the actual values of the dependent variable.
Understanding Cost Function
Training Data

x0 x1 y
Assuming 𝜽0 = 0, find out J(𝜽1)

a) 𝜽1 = 1
1 1 1

1 2 2
b) 𝜽1 = 0.5
1 3 3

c) 𝜽1 = 0
Understanding Cost Function

Source: Andrew Ng, Machine Learning Course, Coursera


Steps Involved in Linear Regression with Gradient Descent Implementation

1. Initialize the weight and bias (i.e. regression coefficients) randomly or with
0(both will work).
2. Make predictions with this initial weight and bias.
3. Compare these predicted values with the actual values and define the loss
function using both these predicted and actual values.
4. With the help of differentiation, calculate how loss function changes with
respect to weight and bias term.
5. Update the weight and bias term so as to minimize the loss function.

To update 𝜽s, we need to calculate the gradients for each 𝜽i


Source: Andrew Ng, Machine Learning Course, Coursera
Differentiating J(𝜽0,𝜽1) w.r.t to 𝜽0 and 𝜽1
Differentiating J(𝜽0,𝜽1) w.r.t to 𝜽0 and 𝜽1

Cancel out 2 from numerator and denominator


Now differentiating w.r.t. 𝜽1

Cancelling out 2 from numerator and denominator and taking x1 common


Calculating Gradients
Updating weights

More generally, it can be written as:


Visualizing Gradient Descent Algorithm

The gradient basically


represents the slope of the
line.
When y increases with x, the
line has a +ve slope, thus w is
decreased.

When y decreases with x, the


line has a -ve slope thus w is
increased.
In both scenarios, the w
moves towards the minimum.
What is alpha?
Alpha represents the learning rate i.e. the speed at which the algorithm moves
towards the minimum point.

Learning rate alpha is something that we have to manually choose and it is


something which we don’t know beforehand. Mostly value of 0.01 is chosen.

A too low value of alpha can make the algorithm move very very slow. The
algorithm is said to converge too slowly.

A too high value of alpha can make the algorithm overshoot the minimum point
and thus never reach the minimum point.
Source: Andrew Ng, Machine Learning Course, Coursera
Source: Andrew Ng, Machine Learning Course, Coursera
Find out the values of regression coefficients for next
two iterations of Gradient Descent. Take initial values
of coefficients as 0. Also, find out the cost at each
iteration. Take alpha = 0.01.
x y

1 0.85

2 1.20

3 1.55

4 1.9
Iteration 1
Iteration 1:updating theta_1
Calculating Cost
J(0.01375,0.03875) = ?

~0.86 (approx.)
Iteration 2
Iteration 2
Calculating Cost
J(0.026393,0.4225) = ?

0.046
Vectorised Notation for Gradient Descent
Solving previous example using vectorized notation
Initially:

Updating theta by placing these values in the equation given below:


First Iteration
Second Iteration
Polynomial Regression
Polynomial Regression
Our hypothesis function need not be linear (a straight line) if that does not fit the
data well.
We can change the behavior or curve of our hypothesis function by making it a
quadratic, cubic or square root function (or any other form).
For example, if our hypothesis function is
then we can create additional features based on x1, to get the quadratic function

or the cubic function


In the cubic version, we have created new features x2 and x3 where x2 = x12 and x3
= x 13

You might also like