0% found this document useful (0 votes)
4 views

Residual Analysis and test_02

Uploaded by

Vipul Khandke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Residual Analysis and test_02

Uploaded by

Vipul Khandke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

ICS422 Applied Predictive Analytics [3- 0-0-3]

Linear regression Residual Analysis


Class 15
Presented by
Dr. Selvi C
Assistant
Professor
IIIT Kottayam
Simple Linear Regression
• Simple linear regression is really a comparison of two
models
• One is where the independent variable does not even exist
• And the other uses the best fit regression line
• If there is only one variable, the best prediction for other
values is the mean of the “dependent” variable
• The difference between the best-fit line and the observed
value is called the residual (or error)
• The residuals are squared and then added together to
generate sum of squares (LITERALLY) residuals / error, SSE.
• Simple linear regression is designed to find the best fitting
line through the data that minimizes the SSE.
ICS 223 Compiler Design 2
3
4
Simple Linear Regression

5
REGRESSION EQUATION WITH
ESTIMATES
• If we actually knew the population parameters, Bo and
B1, we could use the Simple Linear Regression
Equation.
E(y) = + x
• In reality we almost never have the population
parameters. Therefore we will estimate them using
sample data. When using sample data, we have to
change our equation a little bit.
• ŷ, pronounced "y-hat“ is the point estimator of E(y)
ŷ= +x
• ŷ, is the mean value of y for a given value of x.
ICS 223 Compiler Design 6
Least square criterion
• = observed value of dependent variable (tip amount)
• = estimated (predicted) value of the dependent
variable (predicted tip amount)
• The goal is to minimize the sum of the squared
differences between the observed value for the
dependent variable () and the estimated/predicted
value of the dependent variable () that is provided by
the regression line. Sum of the squared residuals.

ICS 223 Compiler Design 7


Parameters
1.For each data point. 1. For each data point.
2. Take the x-value and 2. Take the x-value and
subtract the mean of x. subtract the mean of x.
3. Square Step 2 3. Take the y-value and
subtract the mean of y.
4. Add up all the products.
4. Multiply Step 2 and Step
3
5.Add up all of the products.

ICS 223 Compiler Design 8


Example

ICS 223 Compiler Design 9


• For every $1 the bill amount (x) increases, we would
expect the tip amount to increase by $0.1462 or about
15-cents.
• If the bill amount (x) is zero, then the
expected/predicted tip amount is $- 0.8188 or negative
82-cents! Does this make sense? NO. The intercept may
or may not make sense in the "real world."

ICS 223 Compiler Design 10


RESIDUAL ANALYSIS

• Residual (n): a quantity remaining after other things


have been subtracted or allowed for.
• Difference between the observed value of the
dependent variable (tip amount) and what is predicted
by the regression model
• So if the model predicts a tip of $10 for a given meal,
but the observed tip is $12, then the residual amount is
12 - 10 = 2
•-
• Observed tip - Predicted tip

ICS 223 Compiler Design 11


Goodness of fit
• Only a part of the variance in the dependent variable
will be explained by the values of the independent
variable;
• =(SSR / SST)
• The variance left unexplained is due to model error (SSE
/ SST)
• Think "How far off" or "How good" the model accounts
for the variance in the dependent variable

ICS 223 Compiler Design 12


Model Assumption

• Residuals offer the best information about the error


term, ε
• The expected value of the error term is zero; E (ε) = 0
• For all values of the independent variable x, the
variance of the error term ε is the same
• The values of the error term ε are independent of each
other
• The error term ε follows a normal distribution

ICS 223 Compiler Design 13


Assumption
For the results of a linear regression model to be valid and reliable, we
need to check that the following four assumptions are met:
1. Linear relationship: There exists a linear relationship between the
independent variable, x, and the dependent variable, y.
2. Independence: The residuals are independent. In particular, there is
no correlation between consecutive residuals in time series data.
3. Homoscedasticity: The residuals have constant variance at every
level of x.
4. Normality: The residuals of the model are normally distributed.

If one or more of these assumptions are violated, then the results of our
linear regression may be unreliable or even misleading.
14
15
Best case residual distribution
• Evenly distributed left to right, up to down, all over the graph

16
• Residuals are not evenly distributed

17
Points observed

• What happens if the residual analysis reveals


heteroscedasticity?
• Rebuild the model with different independent variable(s)
• Perform transformations on non-linear data
• Fit a non-linear regression model... but don't OVERFIT
• Are there statistical tests for residuals?

ICS 223 Compiler Design 18


R2 INTERPRETATION

ICS 223 Compiler Design 19


R2 INTERPRETATION
• Coefficient of Determination = r2 = 0.7493 or 74.93%
• We can conclude that 74.93% of the total sum of
squares can be explained by using the estimated
regression equation to predict the tip amount.
• The remainder is error.

ICS 223 Compiler Design 20


Comparison of R-squared to the
Standard Error of the Regression (S)
• The standard error of the regression provides the absolute
measure of the typical distance that the data points fall from
the regression line. S is in the units of the dependent variable.
• R-squared provides the relative measure of the percentage of
the dependent variable variance that the model explains. R-
squared can range from 0 to 100%.

21
Sum of Squared Error

• A measure of the variability of the observation about


the regression line

ICS 223 Compiler Design 22


Mean Squared Error
• MSE is an estimate of the variance of the error, ɛ.
• In other words, how spread out the data points are from the
regression line. MSE is SSE divided by its degrees of freedom
which is 2 because we are estimating the slope and
intercept.
MSE = =SSE/n-2
• Why divide by n - 2 and not just N? REMEMBER, we are using
sample data. It's also why we use and not .
• This is why MSE is not simply the average of the residuals.
• If we were using population data, we would just divide by N
and it would simply be the average of the residuals.
ICS 223 Compiler Design 23
Standard error of the Estimate
• The standard error of the estimate σ (or just "standard error")
is the standard deviation of the error term, ɛ. Now we are UN-
SQUARED!
• It is the average distance an observation falls from the
regression line in units of the dependent variable.
• Since the MSE is , the standard error is just the square root of
MSE.
• s= √MSE = √ SSE/n-2
• s = √7.5187 = 2.742
• So the average distance of the data points from the fitted line
is about $2.74.
ICS 223 Compiler Design
• You can think of s as a measure of how well the regression 24
Statistically significant
• How much variance in the dependent variable is
explained by the model / independent variable?
• For this we look at the value of R2 or Adjusted- R2
• Does a statistically significant linear relationship exist
between the independent and dependent variables?
• Is the overall F-test or t-test (in simple regression these are
actually the same thing) significant?
• Can we reject the null hypothesis that the slope b1 of the
regression line is ZERO?
• Does the confidence interval for the slope b1 contain zero?

ICS 223 Compiler Design 25


Estimators Everywhere
Linear regression contains many estimators
• the slope of the regression line
• the intercept of the regression line on the y-axis
• Centroid: the point that is the intersection of the mean
of each variable (x, y)
• The mean value of ŷ* for any value of x* (confidence
interval)
• The individual value of ŷ* for any value of x* (prediction
interval)
• And many others about variance, etc.
ICS 223 Compiler Design 26
Standard error of the Estimate
• The standard error of the estimate σ (or just "standard error")
is the standard deviation of the error term, ɛ.
• Since the MSE is , the standard error is just the square root of
MSE.
• s= √MSE = √ SSE/n-2
• s = √7.5187 = 2.742

ICS 223 Compiler Design 27


Degree of freedom
• What are degrees of freedom in statistics? Degrees of
freedom are the number of independent values that a
statistical analysis can estimate.
• Calculating the degrees of freedom is often the sample
size minus the number of parameters you’re estimating

28
Confidence Interval
• 95% confidence that the actual mean for the population
falls within this interval
t-value calculation
•±
Where is standard deviation of the slope,
is margin of error, is point estimator for the slope

29
Standard Deviation of the slope
•=

• =2.742/sqrt(4206)
• =0.04228

ICS 223 Compiler Design 30


31
Confidence Interval for slope
• ± 0.04228
• ± 0.04228
• ±
• (0.02885,0.2636)
We are 95% confident that the interval
(0.02885,0.2636)contains the true slope of the regression
line

ICS 223 Compiler Design 32


Does the interval contain zero?
• (0.02885,0.2636)

• Hypothesis :

• Can we reject null hypothesis have slope as zero?


• Null hypothesis is that the slope of the regression line
is zero and therefore there is no significant
relationship exist between two variables.

ICS 223 Compiler Design 33


Test statistics
• = = 3.4584
• t vs
• 3.4584 > 2.776 is significant, so reject null hypothesis

ICS 223 Compiler Design 34


Summary
• Does the confidence interval for the slope, contain the
value of ZERO?
• Is the test statistic t greater than the critical value for t
at the chosen significance level and correct degrees of
freedom?

ICS 223 Compiler Design 35


Any
Queries?
Thank you

You might also like