0% found this document useful (0 votes)
21 views

Lab 5 LR

hr

Uploaded by

aniket nivesh
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Lab 5 LR

hr

Uploaded by

aniket nivesh
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Introduction:

This analysis examines two datasets: Property Valuation Data and Clathrate Formation Data.
Both datasets were subjected to linear regression modeling and diagnostic tests to assess mod
el fit, normality of residuals, presence of outliers, and other key assumptions. The analysis inc
ludes interpretation of various plots and statistical tests to evaluate the validity and reliability
of the regression models.

Dataset: “Property valuation data”


Source : “ Prediction, Linear Regression and Minimum Sum of Relative Errors, ” by S. C. Narula and
J. F. Wellington, Technometrics , 19 , 1977. Also see “ Letter to the Editor, ” Technometrics , 22 , 198
0.

Dataset: “Clathrate Formation Data”


Source : “ Study on a Cool Storage System Using HCFC (Hydro - chloro - fl uoro - carbon) - 14 lb (1,
1 - dichloro - 1 - fl uoro - ethane) Clathrate, ” by T. Tanii, M. Minemoto, K. Nakazawa, and Y. Ando,
Canadian Journal of Chemical Engineering

Objectives

1. Fit and evaluate a linear regression model

2. Assess normality of residuals via Q-Q plot

3. Analyze residuals vs predicted responses plot

4. Identify and treat outliers

5. Interpret comprehensive residual plots

Source code and results


> l41 <- read.csv("l41.csv", header = TRUE)
> l42 <- read.csv("l42.csv", header = TRUE)
> library(ggplot2)
> library(GGally)
> library(dplyr)
> library(stats)
> library(tidyverse)
> library(corrplot)
> library(outliers)
> library(dlookr)
> # Question 1: Fit a suitable linear regression model
> fit = lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9, data = l41)
> summary(fit)
Call:
lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9,
data = l41)

Residuals:
Min 1Q Median 3Q Max
-5.281 -1.887 -0.152 2.161 4.546

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.39273 8.59315 0.744 0.470
x1 2.53344 1.51623 1.671 0.119
x2 2.27081 5.18611 0.438 0.669
x3 0.69366 0.63312 1.096 0.293
x4 1.04991 5.57969 0.188 0.854
x5 0.18550 0.37228 0.498 0.627
x6 0.96185 2.75860 0.349 0.733
x7 -1.28039 3.52793 -0.363 0.722
x8 0.03784 0.10252 0.369 0.718
x9 0.56956 2.33648 0.244 0.811

Residual standard error: 3.634 on 13 degrees of freedom


Multiple R-squared: 0.8113, Adjusted R-squared: 0.6807
F-statistic: 6.212 on 9 and 13 DF, p-value: 0.001753

> # Question 2: Construct a normal probability plot of the residuals


> qqnorm(fit$residuals, main = "Normal Q-Q Plot")
> qqline(fit$residuals)
> # Shapiro-Wilk test for normality
> shapiro.test(fit$residuals)

Shapiro-Wilk normality test

data: fit$residuals
W = 0.97056, p-value = 0.7024

> # Question 3: Construct and interpret the plot of residuals versus predicted responses
> plot(fit$fitted.values, fit$residuals,
+ xlab = "Fitted values", ylab = "Residuals",
+ main = "Residuals vs Fitted")
> abline(h = 0, col = "red")
> # Question 4: Check for outliers
> boxplot(l41$y, main = "Boxplot of y", horizontal = TRUE)
> # Identify outliers
> outliers = outlier(l41$y)
> print(outliers)
[1] 45.8
> # Treat outliers by imputing with mean
> l41$y_treated = imputate_outlier(l41, y, method = "mean")
> # Question 5: Plot the residual plots of the lm function
> par(mfrow = c(2, 2), mar=c(2,2,1,1))
> plot(fit)

> # Question 1: Fit a suitable linear regression model


> fitt = lm(y ~ x1 + x2, data = l42)
> summary(fitt)

Call:
lm(formula = y ~ x1 + x2, data = l42)

Residuals:
Min 1Q Median 3Q Max
-7.3359 -2.4682 0.0104 2.8996 4.3209

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.878706 1.618607 8.574 3.63e-07 ***
x1 -21.793748 196.865674 -0.111 0.913
x2 0.095717 0.009761 9.806 6.46e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.625 on 15 degrees of freedom


Multiple R-squared: 0.878, Adjusted R-squared: 0.8617
F-statistic: 53.97 on 2 and 15 DF, p-value: 1.405e-07

> # Question 2: Construct a normal probability plot of the residuals


> qqnorm(fitt$residuals, main = "Normal Q-Q Plot")
> qqline(fitt$residuals)
> # Shapiro-Wilk test for normality
> shapiro.test(fitt$residuals)

Shapiro-Wilk normality test

data: fitt$residuals
W = 0.9449, p-value = 0.3506

> # Question 3: Construct and interpret the plot of residuals versus predicted responses
> plot(fitt$fitted.values, fitt$residuals,
+ xlab = "Fitted values", ylab = "Residuals",
+ main = "Residuals vs Fitted")
> abline(h = 0, col = "red")
> # Question 4: Check for outliers
> boxplot(l42$y, main = "Boxplot of y", horizontal = TRUE)
> # Identify outliers
> outliers = outlier(l42$y)
> print(outliers)
[1] 7.5
> # Treat outliers by imputing with mean
> l42$y_treated = imputate_outlier(l42, y, method = "mean")

> # Question 5: Plot the residual plots of the lm function


> par(mfrow = c(2, 2), mar=c(2,2,1,1))
> plot(fit)

Interpretation

Dataset 1: Property Valuation Data


y : Sale price of the house/1000
x1 : Taxes (local, school, county)/1000
x2 : Number of baths
x3 : Lot size (sq ft × 1000)
x4 : Living space (sq ft × 1000)
x5 : Number of garage stalls
x6 : Number of rooms
x7 : Number of bedrooms
x8 : Age of the home (years)
x9 : Number of fireplaces

This Normal Q-Q plot shows a good alignment between sample and theoretical quantiles, indi
cating the data generally follows a normal distribution. There's a slight deviation at the tails, p
articularly for higher values, suggesting some potential outliers or a slightly heavy-tailed distr
ibution.
1. Fitting a suitable linear regression model:
- The model includes all variables (x1 to x9).
- The adjusted R-squared is 0.6807, indicating that about 68.07% of the variance in y is explained b
y the predictors.
- The overall model is significant (F-statistic p-value = 0.001753), but none of the individual predict
ors are statistically significant at the 0.05 level.

2. Normal probability plot of the residuals:


- The Shapiro-Wilk test p-value is 0.7024, which is > 0.05. This suggests we fail to reject the null hy
pothesis of normality. The residual plot is non-normal.

3. Plot of residuals versus predicted responses:


- Based on the description, there's no clear pattern visible in the residuals vs. fitted plot, which is go
od.
- The residuals appear to be randomly scattered around the horizontal line at 0.

4. Outliers in the data:


- The boxplot and outlier function identified one outlier with a value of 45.8 in the y variable.
- The outlier was treated by imputing it with the mean value of y.

5. Residual plots interpretation:


- Residuals vs Fitted: No clear pattern, suggesting linearity assumption is met.
- Normal Q-Q: Points generally follow the line, confirming normality.
- Scale-Location: No clear pattern and equal width of points, suggesting homoscedasticity (constant
variance).

- Residuals vs Leverage: No points outside Cook's distance lines, suggesting no highly influential p
oints.

Dataset 2: Clathrate Formation Data


Variables:
y : Clathrate formation (mass %)
x1 : Amount of surfactant (mass %)
x2 : Time (minutes)
This Normal Q-Q plot shows a good alignment between sample and theoretical quantiles, indi
cating the data generally follows a normal distribution. There's a slight deviation at the tails, p
articularly for higher values, suggesting some potential outliers or a slightly heavy-tailed distr
ibution.
1. Fitting a suitable linear regression model:
- The model includes only x1 and x2 as predictors.
- The adjusted R-squared is 0.8617, indicating that about 86.17% of the variance in y is explained b
y x1 and x2.
- The overall model is significant (F-statistic p-value = 1.405e-07).
- x2 is a significant predictor (p < 0.001), but x1 is not (p = 0.913).

2. Normal probability plot of the residuals:


- The Shapiro-Wilk test p-value is 0.3506, which is > 0.05. This suggests we fail to reject the null hy
pothesis of normality. The residual plot is non-normally distributed.

3. Plot of residuals versus predicted responses:


- Based on the description, there's no clear pattern visible in the residuals vs. fitted plot, which is go
od.
- The residuals appear to be randomly scattered around the horizontal line at 0.

4. Outliers in the data:


- The outlier function identified one outlier with a value of 7.5 in the y variable.
- The outlier was treated by imputing it with the mean value of y.

5. Plots interpretation:

A. Residuals vs Fitted:

 The residuals appear to be scattered fairly randomly around the horizontal line at 0, with no cl
ear pattern.
 There's some variability in the spread, but no strong funneling or systematic curvature.
 Points 6, 22, and 8 are labeled as potential outliers.
B. Q-Q Residuals:

 Most points follow the diagonal line closely, suggesting the residuals are approximately norm
ally distributed.
 There's some deviation at the tails, particularly for points 8 and 22.

C. Scale-Location:

 The spread of residuals appears fairly consistent across the range of fitted values.
 There's a slight upward trend in the smoothed line, but it's not severe.
 Points 8, 22, and 16 are highlighted as potential outliers.

D. Residuals vs Leverage:

 Most points fall within Cook's distance contours, indicating they don't have excessive influen
ce.
 Point 22 has high leverage and a large residual, potentially influencing the model fit.
 Point 16 has high leverage but a smaller residual.

Conclusion:
Both datasets yielded regression models with significant overall fit, although individual predi
ctor significance varied. The Property Valuation model explained about 68% of the variance,
while the Clathrate Formation model accounted for 86%. Residual analyses generally support
ed the assumptions of normality, linearity, and homoscedasticity, with some minor deviations.
Outliers were identified and treated in both datasets. The Q-Q plots for both models showed
good alignment with theoretical normal distributions, with slight deviations at the tails. Overa
ll, the analyses provide valuable insights into the relationships between variables and the robu
stness of the regression models, while also highlighting areas for potential further investigatio
n or model refinement.

You might also like