Experiment No.2 Title:: Predicting Missing Data Using Regression Modeling
Experiment No.2 Title:: Predicting Missing Data Using Regression Modeling
Experiment No.2
Missing data (or missing values) is defined as the data value that is not stored for a variable in
the observation of interest. The problem of missing data is relatively common in data set and
can have a significant effect on the conclusions that can be drawn from the data. There are
various techniques proposed for handling missing values like deletion of records/attributes,
filling with a random value or using some measures of central tendency, imputation using
regression etc. Regression imputation is guessing missing variables using regression if we
know there is a correlation between the missing value and other variables. Scatterplots can be
used to identify correlation between variables.
Figure 1: A scatter plot showing correlation between attributes pain and tampascale.
Once correlation is identified either linear regression or multiple regression can be used for
imputation. Linear regression involves finding the “best” line as shown in fig. 1 to fit two
attributes (or variables) so that one attribute can be used to predict the other.
Figure 2: Example of simple linear regression, which has one independent variable
Multiple linear regression is an extension of linear regression, where more than two
attributes are involved and the data are fit to a multidimensional surface.
Prediction is predicting continuous or ordered values for a given input i.e. Numeric
prediction, for example, predicting salary of employee with 10 years of experience.
Straight line regression analysis involves a responsible variable y and a single predictor
variable x. by modeling y as a linear function of x as given in equation 1,
w0 = Y-intercept
w1 = Slope of the line
Calculate w0 and w1 by method of least squares, which estimates best fitting straight line.
Regression co-efficient,
|D|
( y¿ ¿i− ý)
w 1=∑ ( x¿¿ i¿−x́) |D| ¿¿¿
i=1
∑ ( x ¿¿ i¿− x́)2 ¿ ¿
i=1
………………………………………………………………..(2)
w 0= ý −w1 x́ …………………………………………………..…………………….(3)
To handle the complications of multiple regression, we will use matrix algebra. The least
squares normal equations can be expressed as: Y=Xb--------Multiply both sides with XT
XTY = XTXB or XTXB = XTY
Here, matrix XT is the transpose of matrix X. To solve for regression coefficients, simply pre-
multiply by the inverse of XTX:
( XTX)-1 XTXB = ( XTX)-1 XTY …........since ( XTX)-1 XTX = I, the identity matrix, we get
slope B as
B = (XTX)-1 XTY
___________________________________________________________________________
Evaluate the accuracy of prediction. .(usage of built in package for prediction is not
expected)
___________________________________________________________________________
Results: (Program printout with output / Document printout as per the format)
Questions:
1. How will you choose between linear regression and non-linear regression?
Ans: The general guideline is to use linear regression first to determine whether it can fit the
particular type of curve in your data. If you can’t obtain an adequate fit using linear
regression, that’s when you might need to choose nonlinear regression. Linear regression is
easier to use, simpler to interpret, and you obtain more statistics that help you assess the
model. While linear regression can model curves, it is relatively restricted in the shapes of the
curves that it can fit. Sometimes it can’t fit the specific curve in your data.
Nonlinear regression can fit many more types of curves, but it can require more effort both to
find the best fit and to interpret the role of the independent variables. Additionally, R-squared
is not valid for nonlinear regression, and it is impossible to calculate p-values for the
parameter estimates.
The general guideline is to use linear regression first to determine whether it can fit the
particular type of curve in your data. If you can’t obtain an adequate fit using linear
regression, that’s when you might need to choose nonlinear regression.Linear regression is
easier to use, simpler to interpret, and you obtain more statistics that help you assess the
model. While linear regression can model curves, it is relatively restricted in the shapes of the
curves that it can fit. Sometimes it can’t fit the specific curve in your data.Nonlinear
regression can fit many more types of curves, but it can require more effort both to find the
best fit and to interpret the role of the independent variables. Additionally, R-squared is not
valid for nonlinear regression, and it is impossible to calculate p-values for the parameter
estimates.
A simple guess of a missing value is the mean, median, or mode (most frequently
appeared value) of that variable.
Mean, median or mode imputation only look at the distribution of the values of
the variable with missing entries. If we know there is a correlation between the
missing value and other variables, we can often get better guesses by regressing
the missing variable on other variables.
As we can see, in our example data, tip and total_bill have the highest correlation.
Thus, we can use a simple linear model regressing total_bill on tip to fill the
missing values in total_bill.
When we replace the missing data with some common value we might
under(over)estimate it. In other words, we add some bias to our estimation.
There are many missing data imputation methods to avoid these troublesome
cases and Regression Imputation is one such method in which we estimate the
missing values by Regression using other variables as the parameters.
To add uncertainty back to the imputed variable values, we can add some
normally distributed noise with a mean of zero and the variance equal to the
standard error of regression estimates. This method is called as Random
Imputation or Stochastic Regression Imputation.
In this experiment we learnt about the descriptive and proximity measures of data. We also
learnt about regression algorithms and how to implement them and predict values using them
which further helps in analysing the dataset based on the outcome of the regression model.
Grade: AA / AB / BB / BC / CC / CD /DD