ML Unit3 MultipleLinearRegression
ML Unit3 MultipleLinearRegression
REGRESSION
Multiple Linear Regression
Extension of simple linear regression
Study variable depends upon more than one
independent (explanatory) variable
Simple linear regression is a special case of
multiple linear regression with one explanatory
variable
Example 1: 50 start ups
You have a dataset in front of you with information on 50 companies.
Ref: https://towardsdatascience.com/multiple-linear-regression-in-four-lines-of-code-b8ba26192e84
Example 1: 50 start ups
You are a small business owner for regional delivery service who offers same day
delivery for letters, packages and small cargo.
You are able to use google maps to group individual deliveries into one group to
reduce time and costs.
milesTravell numDelverie petrolPrice travelTime
ed (x1) s (x2) (x3) (y)
As the owner you would like to estimate how
89
66
4
1
3.84
3.19
7
5.4
long a delivery will take based factors:
78 3 3.78 6.6 - The total distance of the trip in miles
111 6 3.89 7.4
44 1 3.57 4.8
-The number of deliveries that must be made
77 3 3.57 6.4 during the trip
80 3 3.03 7
66 2 3.51 5.6
-Petrol price
109 5 3.54 7.3
76 3 3.25 6.4
Let us assume we are having data of students like revision time, test
anxiety, lecture attendance and gender.
We want to use this data to predict examination performance.
We will use multiple regression to understand whether this can be
predicted.
Example 4: Predicting house prices
We want to predict price of a house based on
Area in square foot
Locality
Number of bedrooms
Age of construction
Ref: https://online.stat.psu.edu/stat462/node/131/
Multiple Linear Regression
Ref: https://online.stat.psu.edu/stat462/node/131/
Multiple Linear Regression
In case of one explanatory variable, estimated regression equation yields a
line
In case of two explanatory variables, estimated regression equation yields
a plane
In case of more than two explanatory variables, estimated regression
equation yields a hyperplane
Ref: https://online.stat.psu.edu/stat462/node/131/
Multiple regression (two predictors) by
hand
Multicollinearity
Multicollinearity occurs when independent variables in a regression model are correlated
High degree of correlation between independent variables can cause problems when model is
fitted and used to interpret the results
Interpretation of a regression coefficient is that it represents the mean change in the
dependent variable for each one unit change in an independent variable when all of the
other independent variables constant
When independent variables are correlated, it indicates that changes in one variable are
associated with shifts in another variable
Stronger the correlation, the more difficult it is to change one variable without changing
another
It becomes difficult for the model to estimate the relationship between each independent
variable and the dependent variable independently
Independent variables tend to change in unison.
Ref: https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/
https://towardsdatascience.com/multi-collinearity-in-regression-fe7a2c1467ea
How to detect multicollinearity
Variance Inflation Factor (VIF)
Measure of collinearity among predictor variables
within multiple regression
Correlation Matrix
Variance Inflation Factor
Example: Heart Disease
Data from approximately 500 towns is taken on
percentage of people who smoke, percentage of
people who o to work on bike and percentage of
people who have heart disease in each town
We want to find out the how the following factors that
influence the percentage of people having heart
disease in each town
Percentage of people who smoke
Percentage of people who go to work on bike
70
60
50
biking
40
biking
30
Predicted biking
20
10
0
0 5 10 15 20 25 30 35
smoking
Example Heart Disease-contd
SUMMARY OUTPUT
Since we have
established that there Regression Statistics
is no multicollinearity Multiple R 0.98975626
between smoking and R Square 0.97961745
biking,
Adjusted R Square 0.9795351
Standard Error 0.65403217
let us continue with Observations 498
regression with both the
independent variables
ANOVA
smoking and biking
and the dependent df SS MS F Significance F
variable heart disease Regression 2 10176.57 5088.286 11895.24 0
Residual 495 211.7403 0.427758
Total 497 10388.31
Regression output of
EXCEL is shown in the
table Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 14.984658 0.080137 186.9882 0 14.82720754 15.14211 14.82721 15.14211
biking -0.2001331 0.001366 -146.525 0 -0.202816647 -0.19745 -0.20282 -0.19745
smoking 0.17833391 0.003539 50.38667 5.2E-197 0.171379996 0.185288 0.17138 0.185288
Example Heart Disease-contd
Missing data
Types of missing data
Missing Completely At Random
Data is missing completely at random if all observations have the same likelihood of being missing
The propensity for a data point to be missing is completely random
Example some participants may have missing laboratory values because a batch of lab samples was processed improperly
Typically safe to remove MCAR data as the results will be unbiased
Missing At Random
Missing at Random means the propensity for a data point to be missing is not related to the missing data, but it is related to some of the observed data
Example: if older people are more likely to skip survey question #13 than younger people, the missingness mechanism is based on age, a different variable.
Example: a registry on patient data may confront data as MAR, where the smoking status is not documented in female patients because the doctor did not want to ask.
Missing Not At Random
Missing data has a structure to it
Data is missing is systematically
There appear to be reasons the data is missing
Example: In a survey, perhaps a specific group of people – say women ages 45 to 55 – did not answer a question
Example: the depression registry may encounter data that are MNAR if participants with severe depression are more likely to refuse to complete the survey about depression severity
http://www.statisticalassociates.com/missingvaluesanalysis_p.pdf ncbi.nlm.nih.gov/books/NBK493614/
Applied Missing Data Analysis- Craig Enders https://www.theanalysisfactor.com/mar-and-mcar-missing-data/
https://www.analyticsvidhya.com/blog/2021/10/guide-to-deal-with-missing-values/
Ways to deal with missing data
Deletion
Listwise deletion (deleting rows)
All data for an observation that has one or more missing values are deleted
Analysis is run only on observations that have a complete set of data
However when the missing data is not completely MCAR deleting the instances with missing observations can result in
biased parameters and estimates
reduces the statistical power of the analysis
Pairwise deletion
Pairwise deletion makes an attempt to reduce the loss that happens in listwise deletion.
It calculates the correlation between two variables for every pair of variables to which data is considered
Coefficient of correlation can be used to take such data into account.
Delete columns
If data is missing for more than 60% of the observation then discard it if the variable is insignificant.
Imputation
Method of substituting missing data with statistically determined values.
Replacement with mean, median, mode
Linear Regression-Missing values are predicted using a linear model and the other variables in the dataset.
https://www.aptech.com/blog/introduction-to-handling-missing-values/
Missing data IBM ICE (Innovation Centre for Education)
• Missing data causes problems because multiple regression procedures require that every
case have a score on every variable that is used in the analysis.
• If data are missing randomly, then it may be appropriate to estimate each bivariate
correlation on the basis of all cases that have data on the two variables:
– Pairwise deletion of missing data.
• A second procedure is to delete an entire case if information is missing on any one of the
variables that is used in the analysis:
– List-wise deletion.
• A third procedure is simply to delete a variable that has substantial missing data:
– Deletion of variables.
Validation of multiple regression model IBM ICE (Innovation Centre for Education)
• The validation process can involve analyzing the goodness of fit of the regression, analyzing
whether the regression residuals are random and checking whether the models predictive
performance deteriorates substantially when applied to data that were not used in model
estimation.
• Numerical methods also play an important role in model validation. For example, the lack of
fit test for assessing the correctness of the functional part of the model can aid in interpreting
a borderline residual plot.
• Cross-validation is the process of assessing how the results of a statistical analysis will
generalize to an independent data set.
• A development in medical statistics is the use of out of sample cross validation techniques in
meta analysis. It forms the basis of the validation statistic, Vn, which is used to test the
statistical validity of meta analysis summary estimates. Essentially it measures a type of
normalized prediction error and its distribution is a linear combination of χ2 variables of
degree 1.
R-Squared and Goodness of Fit
Distribution of data points surrounding the fitted regression line is estimated using R squared metric
R squared varies between zero and one
If R squared is zero, that means it is a best fitted model
It is a proportion of variation in dependent variable due to variations in independent variables
R squared may be increased by adding many independent variables
Sometimes this is not good when the sample size is less
R squared does not compute magnitude of the slope
A high R squared does not necessarily mean high predictability nor does a poor R squared means low
predictability
R squared is very sensitive to the sample size
Lesser the sample size higher is its value
Coefficient of multiple determination
(R-Squared) IBM ICE (Innovation Centre for Education)
• Indicates the percentage of the variance in the dependent variable that the independent
variables explain collectively.
• R-squared measures the strength of the relationship between the model and the dependent
variable on a convenient 0-100% scale.
• Residuals are the distance between the observed value and the fitted value.
Adjusted R-squared
IBM ICE (Innovation Centre for Education)
• The adjusted R-squared compares the explanatory power of regression models that contain
different numbers of predictors.
• The adjusted R-squared is a modified version of R-squared that has been adjusted for the
number of predictors in the model.
• Multiple R squared is the proportion of Y variance that can be explained by the linear model
using X variables in the sample data, but it over-estimates that proportion in the population.
• Consider, for example, sample R2 = 0.60 based on k=7 predictor variables in a sample of
N=15 cases. An estimate of the proportion of Y variance that can be accounted for by the X
variables in the population is called shrunken R squared or adjusted R squared. It can be
calculated with the following formula:
~2 =1-(1- 2 ) N -1 =1-(1- .6) 14 = .20.
Shrunken R2 = R R
N - k -1 7
Significance of adjusted R squared
Both R2 and the adjusted R2 gives an idea of how many data points
fall within the line of the regression equation
One main difference between R2 and the adjusted R2 is
R2 assumes that every single variable explains the variation in
the dependent variable
Adjusted R2 tells the percentage of variation explained by only
the independent variables that actually affect the dependent variable
Statistical significance: t-Test
IBM ICE (Innovation Centre for Education)
• A t-test looks at the t-statistic, the t-distribution values and the degrees of freedom to deter.
• Mathematically, the t-test takes a sample from each of the two sets and establishes the
problem statement by assuming a null hypothesis that the two means are equal. Mine the
probability of difference between two sets of data.
• For a large sample size, statisticians use a z-test. Other testing options include the chi-
square test and the f-test.
Regional Delivery Service Example
Regional delivery service example
We want to know how much time does a delivery
take based on the following factors-
Totaldistance of the trip in miles
Number of delivery that must be made during the trip
Coefficie Standard Lower Upper ANOVA table gives significance of overall model.
nts Error t Stat P-value Lower 95% Upper 95% 95.0% 95.0% Under coefficients :
Intercept 3.18556 0.466951 6.822047 0.000135 2.10877 4.262351 2.10877 4.262351 Coefficients of miles travelled (in hrs)-with increase in 1 hr mile travelled
milesTravelle increases by 0.042 hrs
d 0.040257 0.005706 7.054653 0.000107 0.027098 0.053416 0.027098 0.053416 p-value is 0.001. It is significant. p-value is same as significance F in
ANOVA
This is because we have only one independent variable.
Regional delivery service example
Regression between milesTravelled (x1)and travelTime(y)
Regression between milesTravelled (x1)and travelTime(y)
SUMMARY OUTPUT
Example: Calculate
Regression Statistics travelTime(y) for
Multiple R
R Square
0.928179
0.861515 milesTravelled(x1)= 84miles
Adjusted R
Square 0.844205 One unit increase in milesTravelled(x1)
increases travelTime by 0.040257 hours
Standard Error 0.342309
Observations 10
ANOVA With 95% confidence level, estimated
Significanc
df SS MS F eF time interval is calculated as-
Regression 1 5.831597 5.831597 49.76813 0.000107
Residual 8 0.937403 0.117175
Total 9 6.769
Standard
Coefficients Error t Stat P-value Lower 95% Upper 95% Lower 95.0%Upper 95.0%
Intercept 3.536488 3.649039 0.969156 0.360851 -4.87821 11.95119 -4.87821 11.95119
If we are only using one independent variable then we will choose x1 as we have highest F-statistics and lowest
standard error, highest R-squared
Regional delivery service example
Regression with IVs milesTravelled (x1) and numDeliveries(x2)
Regression between milesTravelled(x1), numberDeliveries(x2) and travelTime(y) Observations
SUMMARY OUTPUT
Rows for milesTravelled (x1) and
Regression Statistics numDeliveries (x2) show that p value >
Multiple R 0.933488 0.001
R Square 0.8714
Adjusted R Hence both x1 and x2 are insignificant
Square 0.834657
Standard Overall model is significant with F-value
Error 0.352642 is 23.72 and p value is 0.001
Observation
s 10 This is because of strong correlation
between independent variables
ANOVA
Significance
df SS MS F F
Regression 2 5.898503 2.949252 23.71607 0.000763 Note about p-value
Residual 7 0.870497 0.124357 p-value for each term tests the null hypothesis that the coefficient is
Total 9 6.769 equal to zero (no effect)
A low p-value (< 0.05) indicates that you can reject the null hypothesis
Standard Lower Upper
Coefficients Error t Stat P-value Lower 95% Upper 95% 95.0% 95.0% In other words, a predictor that has a low p-value is likely to be a
meaningful addition to your model because changes in the predictor's
Intercept 3.732158 0.886974 4.207744 0.003997 1.634799 5.829517 1.634799 5.829517 value are related to changes in the response variable
milesTravell Conversely, a larger (insignificant) p-value suggests that changes in the
ed 0.026223 0.020016 1.310076 0.231521 -0.02111 0.073553 -0.02111 0.073553 predictor are not associated with changes in the response
numDelveri
es 0.184041 0.250909 0.733496 0.487089 -0.40926 0.777345 -0.40926 0.777345
Coefficients of Multiple Linear Regression
Example: Multicollinearity problem in
Diamonds Dataset
price depth carat table x y z
1
2
326
326
61.5
59.8
0.23
0.21
55
61
3.95
3.89
3.98
3.84
2.43
2.31
First 15 entries of
3
4
327
334
56.9
62.4
0.23
0.29
65
58
4.05
4.2
4.07
4.23
2.31
2.63
this dataset are
5
6
335
336
63.3
62.8
0.31
0.24
58
57
4.34
3.94
4.35
3.96
2.75
2.48 shown
7 336 62.3 0.24 57 3.95 3.98 2.47
8 337 61.9 0.26 55 4.07 4.11 2.53
9 337 65.1 0.22 61 3.87 3.78 2.49
10 338 59.4 0.23 61 4 4.05 2.39
11 339 64 0.3 55 4.25 4.28 2.73
12 340 62.8 0.23 56 3.93 3.9 2.46
13 342 60.4 0.22 61 3.88 3.84 2.33
14 344 62.2 0.31 54 4.35 4.37 2.71
15 345 60.2 0.2 62 3.79 3.75 2.27
City C1 C2 C3
Mumbai 1 0 0
Delhi 0 1 0
Banglore 0 0 1
City C1 C2
Mumbai 1 0
Delhi 0 1
Banglore 0 0
Example: Gender Discrimination
Lawsuit dataset from
Years_of_
Kaggle
Gender
Experienc
es Job Salary Only first 10 entries
0 Male
1 Male
9 Full_Professor
10 Associate
84612
78497
are shown
2 Male
3 Male
6 Assistant
27 Full_Professor
67756
173220 Need to create
4 Male
5 Male
10 Full_Professor
10 Full_Professor
96099
87531
dummy variables for
6 Male
7 Male
9 Associate
11 Full_Professor
99972
166601
multiple categorical
8 Male 18 Full_Professor 85437 variables (Gender
and Job)
https://www.kaggle.com/hjmjerry/gender-discrimination
Example: Gender Discrimination
One hot encoding technique
Creating dummy variable for
categorical variable gender using
one hot encoding technique
If the candidate is female, assign
1 to Gender_Female variable and
0 to Gender_Male variable
And vice versa
Dummy variable trap
Two or more dummy variables created by one-hot encoding are highly correlated (multi-
collinear)
One variable can be predicted from the others, making it difficult to interpret predicted
coefficient variables in regression models
In other words, the individual effect of the dummy variables on the prediction model cannot be
interpreted well because of multicollinearity
In our example, if male variable is 1, that means that the female variable is 0
Both the variables are highly correlated
Multicollinearity exists
Dummy variable contains redundant information
To overcome the dummy variable trap, we drop one of the columns created when the
categorical variable were converted to dummy variables by one-hot encoding
So how many dummy variables are
needed?
Number of dummy variables required to represent
a particular categorical variable depends on the
number of values that the categorical variable can
assume
To represent a categorical variable that can
assume k different values, we need k - 1 dummy
variables
Removing redundant information from the dummy variable to
overcome dummy variable trap
Ref-William Mendenhall, Robert Beaver, Barbara Beaver, Introduction to probability and statistics, Cengage, 14th edition
Example: Condominiums
Correlation matrix
Observations??
Living area bedrooms
List price y x1 Floors x2 x3 baths x4
List price
y 1
Living area
x1 0.9507451 1
Floors
x2 0.6053571 0.6297639 1
bedrooms
x3 0.7441209 0.7109191 0.375 1
baths x4 0.8333801 0.7199549 0.7596752 0.6751223 1
Example: Condominiums
Regression results with all the IVs
Regression taking into account all the independent variables
SUMMARY OUTPUT
Regression Statistics Regression results from Excel are
Multiple R
R Square
0.9849108
0.9700494
shown
Observations
Adjusted R
Square 0.9580691
Standard
Error
Observation
6.9841692
Here, number of independent
s 15 variables k=4
ANOVA
Significance
And total observations n=15
df SS MS F F
s^2= MSE= SSE/(n-k-1)=
Regression
Residual
4
10
15798.558
487.78619
3949.6395
48.778619
80.97071 1.41E-07
487.78619/(n-4-1)=48.7787
Total 14 16286.344
p value bedrooms (x3) >0.05
Coefficients
Standard
Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% If we drop this IV, could results be
Intercept 119.03967 9.388678 12.679067 1.738E-07 98.120396 139.95895 98.120396 139.95895 any better??
Living area
x1 6.2530548 0.7395039 8.4557428 7.224E-06 4.6053374 7.9007721 4.6053374 7.9007721
Floors x2 -16.119062 6.3343795 -2.5446947 0.0291277 -30.232939 -2.0051847 -30.232939 -2.0051847
bedrooms
x3 -2.8213208 4.5823898 -0.6156876 0.5518553 -13.031522 7.38888 -13.031522 7.38888
baths x4 30.266348 6.9835377 4.3339564 0.0014803 14.706057 45.82664 14.706057 45.82664
Regression results with all the IVs except
bedrooms (x3)
Regression output by dropping IV bedrooms (x3)
SUMMARY OUTPUT
Observations
Regression Statistics
Multiple R
R Square
0.984334
0.968914
One variable reduced
Adjusted R
Square
Standard
0.960436
Hence MSE reduced
Error 6.784185
Observatio
ns 15
Value of R squared
ANOVA
Significance
with three variables is
df SS MS F F 0.9689
Regression 3 15780.07 5260.022 114.2858 1.43E-08
Residual
Total
11
14
506.2768
16286.34
46.02516
Value of R squared
Coefficients
Standard
Error t Stat P-value Lower 95% Upper 95%
Lower
95.0%
Upper
95.0%
increased to 0.970
Intercept
Floors x2
115.9232
-14.4939
7.680888
5.59334
15.09242
-2.59128
1.07E-08 99.01767 132.8287
0.025081 -26.8048 -2.18306
99.01767
-26.8048
132.8287
-2.18306
with four variables
Living area
x1 6.015165 0.612481 9.820977 8.85E-07 4.667103 7.363227 4.667103 7.363227
baths x4 28.10408 5.863379 4.793154 0.000559 15.19887 41.00929 15.19887 41.00929
Adjusted R squared -revisited
Due to definition of R squared, its value can never decrease with
addition of more number of variables into regression
It can either stay the same or increase
Hence its value can be inflated by inclusion of more predictor
variables
To account for this variation, we can adjust the model which has
more than one feature (independent variables)
It can be done by using adjusted R squared value
R adjusted uses MSE rather than SST
In MSE formula, degree of freedom is embedded
{MSE= SSE/(n-k-1)}
Hence adjusted R squared represents variation in response y
explained by independent variables, corrected for degree of
freedom
Adjusted R squared-revisited
Adjustment is made by balancing number of features
with number of observations
In a model having many features and less number of
observations, severe adjustment will be made
Adjusted R squared can become negative if there are
many features and very less number of observations
It can be said that increase in one feature reduces
degree of freedom by one
Example: Corporate profits
Find least square prediction equation
Corporate Profits
Use overall F test to determine whether the
profit y advertising x1 capital x2 model contributes significant information
15 25 4
16 1 5
for prediction of y. Use α = 0.01
2 6 3 Does advertising expenditure x1
3 30 1
12 29 2
contribute significant information for
1 20 0 prediction of y given that x2 is already
16 12 4 there in the model? Use α = 0.01
18 15 5
13 6 4 Calculate coefficient of determination R
2 16 2 squared. What percentage of overall
variation is explained by the model?
Ref-William Mendenhall, Robert Beaver, Barbara Beaver, Introduction to probability and statistics, Cengage, 14th edition
References
Machine Learning, IBM ICE
https://towardsdatascience.com/multiple-linear-regression-in-four-lines-of-
code-b8ba26192e84
https://www.youtube.com/watch?v=dQNpSa-bq4M
https://www.analyticsvidhya.com/blog/2020/03/what-is-multicollinearity/
Alice Zheng, Amanda Kasari, Feature Engineering for Machine
Learning, O’Reilly, 2018, first edition.
https://methods.sagepub.com/base/download/DatasetStudentGuid
e/multiple-reg-dummy-in-gss-2012