Understanding Linear Regression Output
Understanding Linear Regression Output
Regression is a tool to test for a relationship between a response variable and one
or more predictor variables. Starting with a straight-line relationship between two
variables:
^y i=β 0 + β 1∗x i
y i= ^y i + ε i
y i=β 0 + β 1∗x i +ε i
β 0∧β 1 are coefficients for the population data. Since we test only on the sample data
and not on the population data, we estimate the sample parameters b 0∧b1 which are
our best guess estimates of the population parameters β 0∧β 1. Therefore
β 0∧β 1 are the estimated intercept and coefficient for the entire population data
b 0∧b1 are the estimated intercept and coefficient from a sample of that
population
Now let us run the simple linear regression with Oxygen_Consumption as the
response variable and RunTime as the predictor variable in R.
OxygenConsumption= β0 + β 1∗RunTime+ ε
summary(fitness.lm)
The results are below. I have superscripted the numbers we are going to interpret.
Call:
lm(formula = Oxygen_Consumption ~ RunTime, data = fitness)
Residuals:
Min 1Q Median 3Q Max
-5.3311 -1.8445 -0.0599 1.5352 6.2077
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 82.4249 3.8558 21.377 < 2e-16 ***
RunTime -3.31091 0.36122 -9.1653 4.59e-104 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
OxygenConsumption=82.4249−3.3109∗RunTime
1. Estimate: Based on the sample data, we can say that for a unit increase in
RunTime, the Oxygen_Consumption decreases by 3.3109. But remember we are only
taking the sample and hence we are prone to sampling error.
2. Standard Error: Because -3.3109, the coefficient associated with RunTime is
only a sample estimate, the standard error associated with that estimate is
0.3612. The lower this value is, the more precise is our estimate of the
coefficient and vice-versa.
3. t value: The estimated coefficient, -3.3109, is -9.165 times the standard error
of 0.3612 from zero, the null Hypothesized value for the coefficient. The larger
this value, the more we are confident that the population coefficient is not
equal to zero.
4. The significance of relationship of the predictor variable RunTime with respect to
the response variable Oxygen_Consumption is 4.59 * 10-10. This is the probability of
getting an absolute t-value of more than 9.165, if the null Hypothesis is true.
Therefore we can say with more than 95% confidence that there is a
relationship between Oxygen_Consumption and RunTime. Hence we reject the Null
Hypothesis with 95% confidence.
5. Coefficient of determination of the model, R-Squared is equal to 0.7434. We
can say that 74.34% of the total variation in the dependent variable, i.e.
Oxygen_Consumption, is explained by the predictor variable i.e. RunTime.
6. The adjusted R-Squared of the model is equal to 0.7345. This adjusts
for avoiding overestimating impact of adding an additional
independent variable on the variability of the response / dependent
variable.
Now let us run the anova() function on the regression object fitness.lm.
anova(fitness.lm)
The results are below. I have superscripted the numbers we are going to interpret.
Response: Oxygen_Consumption
Df Sum Sq Mean Sq F value Pr(>F)
RunTime 11 633.013 633.015 847 4.59e-108 ***
Residuals 29 2
218.54 4
7.54 6
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
confint(fitness.lm)
The results are below and show the 95% confidence intervals of the estimates. I have
superscripted the numbers we are going to interpret.
2.5 % 97.5 %
(Intercept) 74.53890 90.310980
RunTime1 -4.04968 -2.572029