Assignment - 03
Model Building, Selection, & Prediction
Question 1:
1. Predicting the Output Variable Y – Energy Production Prediction
a) Importing the data from CSV data and splitting into test and training data:
Using the read.csv() function we can import the data into R
INPUT:
OUTPUT:
INPUT:
OUTPUT:
b) Fitting a Linear Regression Model:
Running the Linear Regression Model with all the Variables
INPUT:
OUTPUT:
The Adjusted R-Squared value is found to be 0.2366.
From the data It can seen that Pressure and Wind are only significant.
So, we run the model only with wind and pressure variables.
Reduced Regression Model (Wind and Pressures Variable only)
INPUT:
OUTPUT:
Removing the Wind Variable since the Adjusted R Squared Value is only 0.0229. Now we run the regression using only the Pressure Variable.
Running the Regression model with only Wind Variable:
INPUT:
OUTPUT:
The Adjusted R-Squared value is found to be 0.219, which is less than the previous regression models.
ANOVA test is to be conducted to find the significance of the all variable included model and the reduced pressure variable model.
INPUT:
OUTPUT:
Between the All variable and Reduced model, the P value is found to be 0.2578, so we should not reject the Null hypothesis and use the Reduced Model.
Between the Pressure variable and Reduced model, the P value is found to be 0.0768, so we should not reject the Null hypothesis and use the Pressure Model.
Running Best Subset to find the model:
Best Subset find the value of statistics for all variables involved and print the statistics for comparison, using which we can select the appropriate variable
INPUT:
OUTPUT:
RSS Value decrease as the variable increase.
Model with 5 variable has the highest Adjusted R Square.
Model with 3 variable has the smallest AIC (or Cp).
Model with 8 variable has the smallest BIC.
Since the Bestsubset approach provides a broad result we check the predicted R square and use the model with highest R square and lower RMSE
R square and RMSE Prediction:
For all variable considered Model:
INPUT:
OUTPUT:
For the Reduced Model with Pressure and Wind Variables:
INPUT:
OUTPUT:
Single Model with Pressure as the dependent variable:
INPUT:
OUTPUT:
Summary:
From the Analysis we can conclude that model with the pressure as the dependent variable is better than the other models. The Adjusted R square value of 0.31 is the best and the RMSE value is also the least in case of the pressur model.
From the Adjusted R Squared value we conclude that the pressure model is the best and can predict the energy produced rate accurately for 31% of the data.
c) Backward Selection Approach:
Regression Model using all the variables:
INPUT:
OUTPUT:
Conclusion:
The backward step AIC function tells a slightly different result then the models generated above. However, when we create the regression model we see a low R2 value then our single mod.