Reference Material Linear Regression
Reference Material Linear Regression
Prerequisite:
Statistics- Mean, Median, Mode, Variance and Standard deviation, Correlation, and
Covariance.
Exploratory Data Analysis- Data distribution, Scatter plot, correlation matrix, Heat map.
Objectives:
Understand what is Linear Regression and the motivation behind Linear Regression
What is the best fit line and the concept of residual(s) of Linear Regression
Least Square method to find Best Fit line of Regression.
Gradient Descent method to find Best Fit line of Regression
Linear Regression
Linear regression
[email protected] is a way to identify a relationship between two or more variables. We
21YORICED7
use this relationship to predict the values for one variable for a given set of value(s) of the
other variable(s). The variable, which is used in prediction is termed as
independent/explanatory/regressor variable where the predicted variable is termed as
dependent/target/response/regressand variable. Linear regression assumes that the
dependent variable is linearly related to the estimated parameter(s).
𝑦 = 𝑐 + 𝑚𝑥
In machine learning and regression literature the above equation is used in the form:
𝑦 = 𝑤0 + 𝑤1 𝑥
1. Let us see a use case of linear regression for the prediction of house prices. For each house
we have been given the complete information of the plot size area and the price at which
the house was sold. Can we use this information to predict the selling price of a house for
The above two examples are examples of a Simple Linear Regression. A regression which
has one independent variable. In such cases, we are using one response (x) variable to
predict the target variable (y).
3. Next, consider a problem where we need to predict the price of a used car. The selling price
of a used
[email protected] car depends on many attributes; some of them may be mileage (km/litre), model
21YORICED7
(Maruti, Hyundai, Honda, Toyota, Tata), segment (Small, Medium, Luxury). In this
scenario the selling price is the response or the target variable which depends on mileage,
model and segment (explanatory variables). This problem can be modelled as a Multiple
Linear Regression as there is more than one explanatory variable involved in the prediction
of the target variable.
𝑆𝑒𝑙𝑙𝑖𝑛𝑔𝑃𝑟𝑖𝑐𝑒(𝑦) = 𝑤0 + 𝑤1 𝑀𝑖𝑙𝑒𝑎𝑔𝑒 + 𝑤2 𝑀𝑜𝑑𝑒𝑙 + 𝑤3 𝑆𝑒𝑔𝑚𝑒𝑛𝑡
In real scenarios, we rarely have one explanatory variable, so we use multiple linear
regression rather than simple linear regression. However, here we take an example of
simple linear regression to understand the fundamentals of regression.
Example: Consider a toy example where we are interested to find the effect of studying
hours per day with grades in an examination and predict the grades of a student for a given
number of study hours. We have a sample data for about six students for their grades and
the total study hours per day.
[email protected] Table-1
21YORICED7
6 7
5 5
4 2
7 9
8 6 Figure-1: Scatter Graph
2 3
To fit a linear regression model to the given data we can draw multiple lines which goes
through our data points and out of them one will be our best fit line. Let the equation of the
linear regression model be given by:
𝑦(𝐺𝑟𝑎𝑑𝑒𝑠) = 𝑤0 + 𝑤1 𝑋(𝑆𝑡𝑢𝑑𝑦 𝐻𝑜𝑢𝑟𝑠 𝑝𝑒𝑟 𝑑𝑎𝑦)
Following graphs
[email protected] illustrate the process to find the best line of regression.
21YORICED7
(a) (b)
Figure-2 (a) Fit a line of equation y =x. (b) Fit a line of equation y = 1.667 + 0.667x
(c) Fit a line of equation 0.457 + 0.914x (d) Combine all three lines and choose the Best Fit
Lines with minimum Residual.
[email protected]
21YORICED7
These two equations are called the normal equations of a linear regression. These equations are
n
obtained by the partial differentiation of the function (f) =∑i=1[yi(actual) – yi(predicted)]2
[email protected]
21YORICED7with respect to w0 and w1 and equating it to 0 to minimise the function (f).
Also,
1 𝑛 1 𝑛
𝑣𝑎𝑟(𝑥) = ∑ 𝑥𝑖2 − 𝑥̅ 2 ⟹ ∑ 𝑥𝑖2 = 𝑣𝑎𝑟(𝑥) + 𝑥̅ 2 … … … . . (5)
𝑛 𝑖=1 𝑛 𝑖=1
and
𝑤0 = 𝑦̅ − 𝑤1 𝑥̅
The straight line defined by 𝑦 = 𝑤0 + 𝑤1 𝑥 satisfies the residual (least squares)
condition error = E{( y -(w0 + w1x))2} is minimum for variations in a and b, is called the
line of regression of y on x. Let us try these equations to estimate best fit line on our data
given in Table-1.
To estimate w0 and w1, we need to find covariance between x and y [Cov(x, y)], variance
of x [var(x)] and mean of x and y variables (x, and y) .For given data we get,
6+5+4+7+8+2
𝑥̅ = = 5.333
6
7+5+2+9+6+3
[email protected]
𝑦̅ = = 5.333
6
21YORICED7
𝑐𝑜𝑣(𝑥, 𝑦) = 3.5555
𝑣𝑎𝑟(𝑥) = 3.8889
when we substitute these values in equation (7) and (8) we get,
𝑤0 = 0.4571 𝑎𝑛𝑑 𝑤1 = 0.9143
which are exactly the same as shown in Figure-2(c) for the line y = 0.457 + 0.914x, which
gives the minimum residual among all the lines.
Performance metric for least square regression- Performance metrics are the way to
quantify and compare the efficiency of any machine learning model. Least square
regression uses R2( R-squared) and Radj 2 (Adjusted R-Square) metrics to measure the
performance of regression model. Radj2 (Adjusted R-Square) is used with multiple linear
regression. Both of these metrics denotes the power of explain ability of selected
independent variable(s) to the variation of response variable. The equations of R2( R-
squared) and Radj2(Adjusted R-Square) are given by:
Where n is the total number of observations in data and k is the number of explanatory
variables. Radj 2 (Adjusted R-Square) is slight improvement over R2( R-squared) by adding
an additional term to it. The problem with R2 (R-squared) is that, R2( R-squared) increases
with increase in number of terms in the model irrespective of whether the added terms
significantly contribute in prediction/explaining the dependent variable or not. On the
contrary, the value of Radj 2 (Adjusted R-Square) is only affected by if significant terms are
added to the model. The relation between R2( R-squared) and Radj 2 (Adjusted R-Square)
[email protected]
21YORICED7 is:
2
𝑅𝑎𝑑𝑗 ≤ 𝑅2
The cost function includes two parameters w0 and w1, which control the value of cost
function. As we know the derivatives give us the rate of change in one variable with respect
to others, so we can use partial derivatives to find the impact of individual parameter over
the cost function.
The principle of gradient descent is that we always make progress in the direction where
the partial derivatives of w0 and w1 are steepest. If the derivatives of parameters are
approaching zero or becoming very less, this points to the situation of either a maxima or
minima on the surface of the cost function. The process of gradient descent is started with
a random initialization of w0 and w1. Every iteration of gradient descent improves in the
direction of optimal values for w0 and w1 parameters for which the cost function will have
a minimum value. Following figure illustrate the process of optimization.
[email protected]
21YORICED7
[email protected]
21YORICED7 Parameter updates:
𝜕𝑐𝑜𝑠𝑡(𝑤0 , 𝑤1 )
𝑤0 = 𝑤0 − 𝑙𝑟𝑎𝑡𝑒
𝜕𝑤0
𝜕𝑐𝑜𝑠𝑡(𝑤0 , 𝑤1 )
𝑤1 = 𝑤1 − 𝑙𝑟𝑎𝑡𝑒
𝜕𝑤1
lrate is the learning rate which controls the step size of parameter update.
Let’s run it on our example:
X: 6 5 4 7 8 2
y: 7 5 2 9 6 3
𝜕𝑐𝑜𝑠𝑡(𝑤0 , 𝑤1 ) −2
= (7 ∗ 6 + 5 ∗ 5 + 2 ∗ 4 + 9 ∗ 7 + 6 ∗ 8 + 3 ∗ 2) = −64
𝜕𝑤1 6
𝜕𝑐𝑜𝑠𝑡(𝑤0 , 𝑤1 )
𝑤0 = 𝑤0 − 𝑙𝑟𝑎𝑡𝑒 = 0.0 − 0.01 (−10.6667) = 0.1066
𝜕𝑤0
𝜕𝑐𝑜𝑠𝑡(𝑤0 , 𝑤1 )
𝑤1 = 𝑤1 − 𝑙𝑟𝑎𝑡𝑒 = 0.0 − 0.01 (−64) = 0.64
𝜕𝑤1
Iteration #2:
𝑦ℎ𝑎𝑡𝑖 = 0.1067 + 0.64 𝑥𝑖
Calculate gradients:
[email protected]
21YORICED7 𝜕𝑐𝑜𝑠𝑡(𝑤0 , 𝑤1 ) −2
= (3.0533 + 1.6933 − 0.6667 + 4.4133 + 0.7733 + 1.6133) = −3.6266
𝜕𝑤0 6
𝜕𝑐𝑜𝑠𝑡(𝑤0 , 𝑤1 ) −2
= (3.0533 ∗ 6 + 1.6933 ∗ 5 − 0.6667 ∗ 4 + 4.4133 ∗ 7 + 0.7733 ∗ 8
𝜕𝑤1 6
+ 1.6133 ∗ 2)
= −21.475
𝜕𝑐𝑜𝑠𝑡(𝑤0 , 𝑤1 )
𝑤0 = 𝑤0 − 𝑙𝑟𝑎𝑡𝑒 = 0.1067 − 0.01 (−3.6266) = 0.14296
𝜕𝑤0
𝜕𝑐𝑜𝑠𝑡(𝑤0 , 𝑤1 )
𝑤1 = 𝑤1 − 𝑙𝑟𝑎𝑡𝑒 = 0.64 − 0.01 (−21.475) = 0.8547
𝜕𝑤1
Similarly, the number of iteration for gradient descent are performed till the minimum
value of cost function of error is achieved or some finite iterations are reached.
𝑛
1
𝑅𝑀𝑆𝐸 = √ ∑(𝑦𝑖 − 𝑦ℎ𝑎𝑡𝑖 )2
𝑛
𝑖=1
********