0% found this document useful (0 votes)
43 views

Experiment No.2 Title:: Predicting Missing Data Using Regression Modeling

The document describes an experiment on predicting missing data using regression modeling. It discusses linear regression and multiple linear regression techniques. Linear regression finds the best fitting line to predict a variable from another. Multiple linear regression extends this to predict a variable from multiple other variables. The document provides equations and examples to explain simple and multiple linear regression. It also discusses when regression imputation techniques can be used to predict missing data values based on correlations between variables.

Uploaded by

harsh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Experiment No.2 Title:: Predicting Missing Data Using Regression Modeling

The document describes an experiment on predicting missing data using regression modeling. It discusses linear regression and multiple linear regression techniques. Linear regression finds the best fitting line to predict a variable from another. Multiple linear regression extends this to predict a variable from multiple other variables. The document provides equations and examples to explain simple and multiple linear regression. It also discusses when regression imputation techniques can be used to predict missing data values based on correlations between variables.

Uploaded by

harsh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

KJSCE/IT/TY /SEMVI/EDA/2020-21

Experiment No.2

Title: Predicting missing data using regression modeling

(A constituent college under Somaiya Vidyavihar University)


KJSCE/IT/TY /SEMVI/EDA/2020-21

Batch:B-1 Roll No.:1814091 Experiment No.: 2

Aim: Predict missing data using regression modeling.


___________________________________________________________________________
Resources needed: Any programming language, any data source (RDBMS/Excel/CSV)
___________________________________________________________________________
Theory:

Missing data (or missing values) is defined as the data value that is not stored for a variable in
the observation of interest. The problem of missing data is relatively common in data set and
can have a significant effect on the conclusions that can be drawn from the data. There are
various techniques proposed for handling missing values like deletion of records/attributes,
filling with a random value or using some measures of central tendency, imputation using
regression etc. Regression imputation is guessing missing variables using regression if we
know there is a correlation between the missing value and other variables. Scatterplots can be
used to identify correlation between variables.

Figure 1: A scatter plot showing correlation between attributes pain and tampascale.

Once correlation is identified either linear regression or multiple regression can be used for
imputation. Linear regression involves finding the “best” line as shown in fig. 1 to fit two
attributes (or variables) so that one attribute can be used to predict the other.

Figure 2: Example of simple linear regression, which has one independent variable

(A constituent college under Somaiya Vidyavihar University)


KJSCE/IT/TY /SEMVI/EDA/2020-21

Multiple linear regression is an extension of linear regression, where more than two
attributes are involved and the data are fit to a multidimensional surface.

Prediction is predicting continuous or ordered values for a given input i.e. Numeric
prediction, for example, predicting salary of employee with 10 years of experience.

Simple Linear Regression:

Straight line regression analysis involves a responsible variable y and a single predictor
variable x. by modeling y as a linear function of x as given in equation 1,

y=w0 + w1*x ………………………………………………………………….. (1)

where w0 and w1 are Regression co-efficient.

w0 = Y-intercept
w1 = Slope of the line

Calculate w0 and w1 by method of least squares, which estimates best fitting straight line.

Let D be a training set,


[ D ] = { (x1,y1),(x2,y2),(x3,y3),…….,( xn, yn)}

Regression co-efficient,

|D|
( y¿ ¿i− ý)
w 1=∑ ( x¿¿ i¿−x́) |D| ¿¿¿
i=1
∑ ( x ¿¿ i¿− x́)2 ¿ ¿
i=1

………………………………………………………………..(2)

w 0= ý −w1 x́ …………………………………………………..…………………….(3)

Where x́ is the mean value of x1, x2, x3,….xn.

And ýis the mean value of y1, y2, y3, y4,…yn.


Multiple Linear Regression:
Multiple linear regression is used to explain the relationship between one continuous
dependent variable and two or more independent variables. The independent variables can
be continuous or categorical. Formula can be represented as Y=mX1+mX2+mX3…+b, or it
can be written in matrix format as Y = Xb

(A constituent college under Somaiya Vidyavihar University)


KJSCE/IT/TY /SEMVI/EDA/2020-21

To handle the complications of multiple regression, we will use matrix algebra. The least
squares normal equations can be expressed as: Y=Xb--------Multiply both sides with XT
XTY = XTXB    or    XTXB = XTY
Here, matrix XT is the transpose of matrix X. To solve for regression coefficients, simply pre-
multiply by the inverse of XTX:
( XTX)-1 XTXB = ( XTX)-1 XTY …........since ( XTX)-1 XTX = I, the identity matrix, we get
slope B as
B = (XTX)-1 XTY

___________________________________________________________________________

Procedure / Approach /Algorithm / Activity Diagram:

1. Identify attributes suitable for applying Linear regression. Construct a linear


regression model for your dataset and predict the missing values in your data set.
Evaluate the accuracy of prediction.(usage of built in package for prediction is not
expected)
2. Identify attributes suitable for applying Multiple Linear regression. Construct a linear
regression model for your dataset and predict the missing values in your data set.

(A constituent college under Somaiya Vidyavihar University)


KJSCE/IT/TY /SEMVI/EDA/2020-21

Evaluate the accuracy of prediction. .(usage of built in package for prediction is not
expected)
___________________________________________________________________________

Results: (Program printout with output / Document printout as per the format)

Questions:
1. How will you choose between linear regression and non-linear regression?
Ans: The general guideline is to use linear regression first to determine whether it can fit the
particular type of curve in your data. If you can’t obtain an adequate fit using linear
regression, that’s when you might need to choose nonlinear regression. Linear regression is
easier to use, simpler to interpret, and you obtain more statistics that help you assess the
model. While linear regression can model curves, it is relatively restricted in the shapes of the
curves that it can fit. Sometimes it can’t fit the specific curve in your data.
Nonlinear regression can fit many more types of curves, but it can require more effort both to
find the best fit and to interpret the role of the independent variables. Additionally, R-squared
is not valid for nonlinear regression, and it is impossible to calculate p-values for the
parameter estimates.

The general guideline is to use linear regression first to determine whether it can fit the
particular type of curve in your data. If you can’t obtain an adequate fit using linear
regression, that’s when you might need to choose nonlinear regression.Linear regression is
easier to use, simpler to interpret, and you obtain more statistics that help you assess the
model. While linear regression can model curves, it is relatively restricted in the shapes of the
curves that it can fit. Sometimes it can’t fit the specific curve in your data.Nonlinear
regression can fit many more types of curves, but it can require more effort both to find the
best fit and to interpret the role of the independent variables. Additionally, R-squared is not
valid for nonlinear regression, and it is impossible to calculate p-values for the parameter
estimates.

2. Explain the nature or characteristics of a dataset where we can apply regression


imputation.
Ans. Imputation simply means that we replace the missing values with some
guessed/estimated ones.

A simple guess of a missing value is the mean, median, or mode (most frequently
appeared value) of that variable.

(A constituent college under Somaiya Vidyavihar University)


KJSCE/IT/TY /SEMVI/EDA/2020-21

Mean, median or mode imputation only look at the distribution of the values of
the variable with missing entries. If we know there is a correlation between the
missing value and other variables, we can often get better guesses by regressing
the missing variable on other variables.

As we can see, in our example data, tip and total_bill have the highest correlation.
Thus, we can use a simple linear model regressing total_bill on tip to fill the
missing values in total_bill.

When we replace the missing data with some common value we might
under(over)estimate it. In other words, we add some bias to our estimation.

(A constituent college under Somaiya Vidyavihar University)


KJSCE/IT/TY /SEMVI/EDA/2020-21

There are many missing data imputation methods to avoid these troublesome
cases and Regression Imputation is one such method in which we estimate the
missing values by Regression using other variables as the parameters.

In Deterministic Regression Imputation, we replace the missing data with the


values predicted in our regression model and repeat this process for each variable.

To add uncertainty back to the imputed variable values, we can add some
normally distributed noise with a mean of zero and the variance equal to the
standard error of regression estimates. This method is called as Random
Imputation or Stochastic Regression Imputation.

Regression imputation has the opposite problem of mean imputation. A regression


model is estimated to predict observed values of a variable based on other
variables, and that model is then used to impute values in cases where the value of
that variable is missing.

Outcomes: Comprehend descriptive and proximity measures of data.

Conclusion: (Conclusion to be based on the objectives and outcomes achieved)

(A constituent college under Somaiya Vidyavihar University)


KJSCE/IT/TY /SEMVI/EDA/2020-21

In this experiment we learnt about the descriptive and proximity measures of data. We also
learnt about regression algorithms and how to implement them and predict values using them
which further helps in analysing the dataset based on the outcome of the regression model.

Grade: AA / AB / BB / BC / CC / CD /DD

Signature of faculty in-charge with date


___________________________________________________________________________
References:

Books/ Journals/ Websites:

1. Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3 nd


Edition

(A constituent college under Somaiya Vidyavihar University)

You might also like