0% found this document useful (0 votes)
57 views

Notes On Linear Regression - 2

This document discusses the key steps and assumptions in regression model building and diagnostics. It describes collecting and preprocessing data, dividing it into training and validation sets, defining the relationship model, estimating parameters, and performing diagnostics. Diagnostics include hypothesis tests of coefficients, checking for homoscedasticity, identifying outliers, and using metrics like R-squared, F-statistic, leverage values, and t-distributions. The goal is to select the best fitting regression model and identify any issues.

Uploaded by

Shruti Mishra
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

Notes On Linear Regression - 2

This document discusses the key steps and assumptions in regression model building and diagnostics. It describes collecting and preprocessing data, dividing it into training and validation sets, defining the relationship model, estimating parameters, and performing diagnostics. Diagnostics include hypothesis tests of coefficients, checking for homoscedasticity, identifying outliers, and using metrics like R-squared, F-statistic, leverage values, and t-distributions. The goal is to select the best fitting regression model and identify any issues.

Uploaded by

Shruti Mishra
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Steps in Regression Model Building

 Collect/Extract Data
 Pre-process the Data
 Divide the Data into Training and Validation Data Sets
 Define the Functional Form of Relationship
 Estimate the Regression Parameters
 Perform Regression Model Diagnostics
 Model Deployment

The Assumptions in Regression Models

 The regression model is linear in regression parameters.


 The expected value of the residuals is zero
 The residuals follow a normal distribution. For estimation of regression parameters, the
assumption of normal distribution for errors is not necessary. However, it is essential for
testing hypotheses such as whether there is a statistically significant association relationship
between the outcome variable and the features.
 The variance of the residuals is constant for all values of X . When the variance of the
residuals is constant for different values of X , it is called homoscedasticity. A non-constant
variance of residuals is called heteroscedasticity.

THE MODEL DIAGNOSITICS

Hypothesis Test for Regression Co-Efficient

The regression co-efficient (b1 ) captures the existence of a linear relationship between the response
variable and the explanatory variable. If b1 = 0, we can conclude that there is no statistically
significant linear relationship between the two variables.

The null and alternative hypotheses for the SLR model can be stated as follows:

H0 : There is no relationship between X and Y

HA: There is a relationship between X and Y

b1 = 0 would imply that there is no linear relationship between the response variable Y and the
explanatory variable X. Thus, the null and alternative hypotheses can be restated as follows:

H0 : b1 = 0

HA: b1 ≠ 0

If the p-value is less than 0.05 (or an appropriate significance value), we reject the null hypothesis
and conclude that there is significant evidence suggesting a linear relationship between X and Y.
(remember, the p-value gets smaller as the test statistic calculated from the data gets further away
from the center which is zero as predicted by the null hypothesis)

What is Homoskedasticity?

Refers to a condition in which the variance of the residual, or error term, in a regression model is
constant. That is, the error term does not vary much as the value of the predictor variable changes.
Another way of saying this is that the variance of the data points is roughly the same for all data
points.

This suggests a level of consistency and makes it easier to model and work with the data through
regression; however, the lack of homoskedasticity may suggest that the regression model may need
to include additional predictor variables to explain the performance of the dependent variable.

What is Heterocedasticity

Heteroskedasticity happens when the standard deviations of a predicted variable, monitored over


different values of an independent variable or as related to prior time periods, are non-constant.

With heteroskedasticity, the tell-tale sign upon visual inspection of the residual errors is that they
will tend to fan out (errors increase as the X or Y variable increases in magnitude)

What is co-efficient of determination (R-squared)?

The primary objective of regression is to explain the variation in Y using the knowledge of X. The
coefficient of determination (or R-square or R2 ) measures the percentage of variation in Y explained
by the model (b0 + b1 X).

R2 is the proportion of variation in response variable Y explained by the regression model.


Coefficient of determination (R2 ) has the following properties:

 The value of R2 lies between 0 and 1.


 Higher value of R2 implies better fit, but one should be aware of spurious regression
 Mathematically, the square of correlation coefficient is equal to coefficient of determination.
 We do not put any minimum threshold for R-squared ; higher value of R-squared implies
better fit

Calculation of R-Squared

R-Squared = SSR/SST

SSR: is the sum of squares due to regression (explained sum of squares)

SST:  is the total sum of squares


Outlier Analysis:

The following distance measures are useful in identifying the influential observations:

 Z-Score
 Cook’s Distance
 Leverage Values

Z-Score

Z-score is the standardized distance of an observation from its mean value. For the predicted value
of the dependent variable Y, the Z-score is given by

Ypred – Ymean/Std-Y

Cook’s Distance

Cook’s distance measures how much the predicted value of the dependent variable changes for all
the observations in the sample when a particular observation is excluded from sample for the
estimation of regression parameters.

Leverage Value

Leverage value of an observation measures the influence of that observation on the overall fit of the
regression function.

Leverage value of more than 2k/n or 3k/n is treated as highly influential observation.

F-Statistic

Using the Analysis of Variance (ANOVA), we can test whether the overall model is statistically
significant.
The null and alternative hypothesis for F-test are given by

H0 : There is no statistically significant relationship between Y and any of the explanatory variables
(i.e., all regression coefficients are zero).

H1 : Not all regression coefficients are zero.

Alternatively:

H0 : All regression coefficients are equal to zero.

HA: Not all regression coefficients are equal to zero.

The F-statistic is given by

F = [SSR/k] / [SSE/n-k-1] = MSR/MSE

Where k is no. of parameters, n is no. of observations.

T-Distribution

The t-distribution, also known as Student’s t-distribution, is a way of describing data that follow a


bell curve when plotted on a graph, with the greatest number of observations close to the mean and
fewer observations in the tails.

It is a type of normal distribution used for smaller sample sizes, where the variance in the data is
unknown.

The t-distribution is used when data are approximately normally distributed, which means the data
follow a bell shape but the population variance is unknown. The variance in a t-distribution is
estimated based on the degrees of freedom of the data set (total number of observations minus 1).

It is a more conservative form of the standard normal distribution, also known as the z-distribution.


This means that it gives a lower probability to the center and a higher probability to the tails than
the standard normal distribution.

You might also like