Notes On Linear Regression - 2
Notes On Linear Regression - 2
Collect/Extract Data
Pre-process the Data
Divide the Data into Training and Validation Data Sets
Define the Functional Form of Relationship
Estimate the Regression Parameters
Perform Regression Model Diagnostics
Model Deployment
The regression co-efficient (b1 ) captures the existence of a linear relationship between the response
variable and the explanatory variable. If b1 = 0, we can conclude that there is no statistically
significant linear relationship between the two variables.
The null and alternative hypotheses for the SLR model can be stated as follows:
b1 = 0 would imply that there is no linear relationship between the response variable Y and the
explanatory variable X. Thus, the null and alternative hypotheses can be restated as follows:
H0 : b1 = 0
HA: b1 ≠ 0
If the p-value is less than 0.05 (or an appropriate significance value), we reject the null hypothesis
and conclude that there is significant evidence suggesting a linear relationship between X and Y.
(remember, the p-value gets smaller as the test statistic calculated from the data gets further away
from the center which is zero as predicted by the null hypothesis)
What is Homoskedasticity?
Refers to a condition in which the variance of the residual, or error term, in a regression model is
constant. That is, the error term does not vary much as the value of the predictor variable changes.
Another way of saying this is that the variance of the data points is roughly the same for all data
points.
This suggests a level of consistency and makes it easier to model and work with the data through
regression; however, the lack of homoskedasticity may suggest that the regression model may need
to include additional predictor variables to explain the performance of the dependent variable.
What is Heterocedasticity
With heteroskedasticity, the tell-tale sign upon visual inspection of the residual errors is that they
will tend to fan out (errors increase as the X or Y variable increases in magnitude)
The primary objective of regression is to explain the variation in Y using the knowledge of X. The
coefficient of determination (or R-square or R2 ) measures the percentage of variation in Y explained
by the model (b0 + b1 X).
Calculation of R-Squared
R-Squared = SSR/SST
The following distance measures are useful in identifying the influential observations:
Z-Score
Cook’s Distance
Leverage Values
Z-Score
Z-score is the standardized distance of an observation from its mean value. For the predicted value
of the dependent variable Y, the Z-score is given by
Ypred – Ymean/Std-Y
Cook’s Distance
Cook’s distance measures how much the predicted value of the dependent variable changes for all
the observations in the sample when a particular observation is excluded from sample for the
estimation of regression parameters.
Leverage Value
Leverage value of an observation measures the influence of that observation on the overall fit of the
regression function.
Leverage value of more than 2k/n or 3k/n is treated as highly influential observation.
F-Statistic
Using the Analysis of Variance (ANOVA), we can test whether the overall model is statistically
significant.
The null and alternative hypothesis for F-test are given by
H0 : There is no statistically significant relationship between Y and any of the explanatory variables
(i.e., all regression coefficients are zero).
Alternatively:
T-Distribution
It is a type of normal distribution used for smaller sample sizes, where the variance in the data is
unknown.
The t-distribution is used when data are approximately normally distributed, which means the data
follow a bell shape but the population variance is unknown. The variance in a t-distribution is
estimated based on the degrees of freedom of the data set (total number of observations minus 1).