Lab 5 LR
Lab 5 LR
This analysis examines two datasets: Property Valuation Data and Clathrate Formation Data.
Both datasets were subjected to linear regression modeling and diagnostic tests to assess mod
el fit, normality of residuals, presence of outliers, and other key assumptions. The analysis inc
ludes interpretation of various plots and statistical tests to evaluate the validity and reliability
of the regression models.
Objectives
Residuals:
Min 1Q Median 3Q Max
-5.281 -1.887 -0.152 2.161 4.546
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.39273 8.59315 0.744 0.470
x1 2.53344 1.51623 1.671 0.119
x2 2.27081 5.18611 0.438 0.669
x3 0.69366 0.63312 1.096 0.293
x4 1.04991 5.57969 0.188 0.854
x5 0.18550 0.37228 0.498 0.627
x6 0.96185 2.75860 0.349 0.733
x7 -1.28039 3.52793 -0.363 0.722
x8 0.03784 0.10252 0.369 0.718
x9 0.56956 2.33648 0.244 0.811
data: fit$residuals
W = 0.97056, p-value = 0.7024
> # Question 3: Construct and interpret the plot of residuals versus predicted responses
> plot(fit$fitted.values, fit$residuals,
+ xlab = "Fitted values", ylab = "Residuals",
+ main = "Residuals vs Fitted")
> abline(h = 0, col = "red")
> # Question 4: Check for outliers
> boxplot(l41$y, main = "Boxplot of y", horizontal = TRUE)
> # Identify outliers
> outliers = outlier(l41$y)
> print(outliers)
[1] 45.8
> # Treat outliers by imputing with mean
> l41$y_treated = imputate_outlier(l41, y, method = "mean")
> # Question 5: Plot the residual plots of the lm function
> par(mfrow = c(2, 2), mar=c(2,2,1,1))
> plot(fit)
Call:
lm(formula = y ~ x1 + x2, data = l42)
Residuals:
Min 1Q Median 3Q Max
-7.3359 -2.4682 0.0104 2.8996 4.3209
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.878706 1.618607 8.574 3.63e-07 ***
x1 -21.793748 196.865674 -0.111 0.913
x2 0.095717 0.009761 9.806 6.46e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
data: fitt$residuals
W = 0.9449, p-value = 0.3506
> # Question 3: Construct and interpret the plot of residuals versus predicted responses
> plot(fitt$fitted.values, fitt$residuals,
+ xlab = "Fitted values", ylab = "Residuals",
+ main = "Residuals vs Fitted")
> abline(h = 0, col = "red")
> # Question 4: Check for outliers
> boxplot(l42$y, main = "Boxplot of y", horizontal = TRUE)
> # Identify outliers
> outliers = outlier(l42$y)
> print(outliers)
[1] 7.5
> # Treat outliers by imputing with mean
> l42$y_treated = imputate_outlier(l42, y, method = "mean")
Interpretation
This Normal Q-Q plot shows a good alignment between sample and theoretical quantiles, indi
cating the data generally follows a normal distribution. There's a slight deviation at the tails, p
articularly for higher values, suggesting some potential outliers or a slightly heavy-tailed distr
ibution.
1. Fitting a suitable linear regression model:
- The model includes all variables (x1 to x9).
- The adjusted R-squared is 0.6807, indicating that about 68.07% of the variance in y is explained b
y the predictors.
- The overall model is significant (F-statistic p-value = 0.001753), but none of the individual predict
ors are statistically significant at the 0.05 level.
- Residuals vs Leverage: No points outside Cook's distance lines, suggesting no highly influential p
oints.
5. Plots interpretation:
A. Residuals vs Fitted:
The residuals appear to be scattered fairly randomly around the horizontal line at 0, with no cl
ear pattern.
There's some variability in the spread, but no strong funneling or systematic curvature.
Points 6, 22, and 8 are labeled as potential outliers.
B. Q-Q Residuals:
Most points follow the diagonal line closely, suggesting the residuals are approximately norm
ally distributed.
There's some deviation at the tails, particularly for points 8 and 22.
C. Scale-Location:
The spread of residuals appears fairly consistent across the range of fitted values.
There's a slight upward trend in the smoothed line, but it's not severe.
Points 8, 22, and 16 are highlighted as potential outliers.
D. Residuals vs Leverage:
Most points fall within Cook's distance contours, indicating they don't have excessive influen
ce.
Point 22 has high leverage and a large residual, potentially influencing the model fit.
Point 16 has high leverage but a smaller residual.
Conclusion:
Both datasets yielded regression models with significant overall fit, although individual predi
ctor significance varied. The Property Valuation model explained about 68% of the variance,
while the Clathrate Formation model accounted for 86%. Residual analyses generally support
ed the assumptions of normality, linearity, and homoscedasticity, with some minor deviations.
Outliers were identified and treated in both datasets. The Q-Q plots for both models showed
good alignment with theoretical normal distributions, with slight deviations at the tails. Overa
ll, the analyses provide valuable insights into the relationships between variables and the robu
stness of the regression models, while also highlighting areas for potential further investigatio
n or model refinement.