0% found this document useful (0 votes)
49 views

ML Unit3 MultipleLinearRegression

Multiple linear regression is an extension of simple linear regression that allows the dependent variable to be modeled as a linear combination of two or more independent variables. It can be used to model relationships between a dependent variable and multiple explanatory variables, and to predict the value of the dependent variable based on the explanatory variables. Simple linear regression is a special case of multiple linear regression with only one explanatory variable.

Uploaded by

Deepali Koirala
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views

ML Unit3 MultipleLinearRegression

Multiple linear regression is an extension of simple linear regression that allows the dependent variable to be modeled as a linear combination of two or more independent variables. It can be used to model relationships between a dependent variable and multiple explanatory variables, and to predict the value of the dependent variable based on the explanatory variables. Simple linear regression is a special case of multiple linear regression with only one explanatory variable.

Uploaded by

Deepali Koirala
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

MULTIPLE LINEAR

REGRESSION
Multiple Linear Regression
 Extension of simple linear regression
 Study variable depends upon more than one
independent (explanatory) variable
 Simple linear regression is a special case of
multiple linear regression with one explanatory
variable
Example 1: 50 start ups
 You have a dataset in front of you with information on 50 companies.

Ref: https://towardsdatascience.com/multiple-linear-regression-in-four-lines-of-code-b8ba26192e84
Example 1: 50 start ups

 You’ve been hired to analyze this information and create a model.


 You need to inform the guy who hired you what kind of companies will
make the most sense in the future to invest in.
 To keep things simple, let’s say that your employer wants to make this
decision based on last year’s profit.
 This means that the profits column is your dependent variable.
 The other columns are the independent variables.
Example 2: Region delivery Service

 You are a small business owner for regional delivery service who offers same day
delivery for letters, packages and small cargo.
 You are able to use google maps to group individual deliveries into one group to
reduce time and costs.
milesTravell numDelverie petrolPrice travelTime
ed (x1) s (x2) (x3) (y)
As the owner you would like to estimate how
89
66
4
1
3.84
3.19
7
5.4
long a delivery will take based factors:
78 3 3.78 6.6 - The total distance of the trip in miles
111 6 3.89 7.4
44 1 3.57 4.8
-The number of deliveries that must be made
77 3 3.57 6.4 during the trip
80 3 3.03 7
66 2 3.51 5.6
-Petrol price
109 5 3.54 7.3
76 3 3.25 6.4

Ref: Statistics 101-Brandon Foltz


https://www.youtube.com/watch?v=dQNpSa-bq4M
Example 3: Examination Performance

 Let us assume we are having data of students like revision time, test
anxiety, lecture attendance and gender.
 We want to use this data to predict examination performance.
 We will use multiple regression to understand whether this can be
predicted.
Example 4: Predicting house prices
 We want to predict price of a house based on
 Area in square foot
 Locality

 Number of bedrooms

 Age of construction

 Facilities and Amenities


Multiple Linear Regression

Multiple Linear Regression

Ref: https://online.stat.psu.edu/stat462/node/131/
Multiple Linear Regression

Ref: https://online.stat.psu.edu/stat462/node/131/
Multiple Linear Regression
 In case of one explanatory variable, estimated regression equation yields a
line
 In case of two explanatory variables, estimated regression equation yields
a plane
 In case of more than two explanatory variables, estimated regression
equation yields a hyperplane

Ref: https://online.stat.psu.edu/stat462/node/131/
Multiple regression (two predictors) by
hand

Multicollinearity
 Multicollinearity occurs when independent variables in a regression model are correlated
 High degree of correlation between independent variables can cause problems when model is
fitted and used to interpret the results
 Interpretation of a regression coefficient is that it represents the mean change in the
dependent variable for each one unit change in an independent variable when all of the
other independent variables constant
 When independent variables are correlated, it indicates that changes in one variable are
associated with shifts in another variable
 Stronger the correlation, the more difficult it is to change one variable without changing
another
 It becomes difficult for the model to estimate the relationship between each independent
variable and the dependent variable independently
 Independent variables tend to change in unison.
Ref: https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/
https://towardsdatascience.com/multi-collinearity-in-regression-fe7a2c1467ea
How to detect multicollinearity
 Variance Inflation Factor (VIF)
 Measure of collinearity among predictor variables
within multiple regression
 Correlation Matrix
Variance Inflation Factor

Example: Heart Disease
 Data from approximately 500 towns is taken on
percentage of people who smoke, percentage of
people who o to work on bike and percentage of
people who have heart disease in each town
 We want to find out the how the following factors that
influence the percentage of people having heart
disease in each town
 Percentage of people who smoke
 Percentage of people who go to work on bike

Dataset can be downloaded from https://www.scribbr.com/statistics/multiple-linear-regression/


Assumptions in Multiple Linear Regression
 There are some assumptions that absolutely have to be true:
 There is a linear relationship between the dependent variable and the
independent variables.
 The independent variables aren’t too highly correlated with each other.
 Your observations for the dependent variable are selected independently
and at random.
 Regression residuals are normally distributed.
 Homogeneity of variance (homoscedasticity): the size of the error in our
prediction doesn’t change significantly across the values of the
independent variable
Example Heart Disease-contd
 We want to observe the SUMMARY OUTPUT
effect of
smoking(independent Regression Statistics
variable) and biking on the Multiple R 0.015136
heart disease(dependent R Square 0.000229
variable) Adjusted R
Square -0.00179
 Let us check for Standard Error 21.5007
multicollinearity
Observations 498
 Following is the result of
regression performed on ANOVA
biking and smoking to find Significan
correlation between them df SS MS F ce F
Regression 1 52.54352 52.54352 0.113662 0.736156
Residual 496 229290.9 462.2801
Total 497 229343.5

Coefficien Standard Lower Upper Lower Upper


ts Error t Stat P-value 95% 95% 95.0% 95.0%
Intercept 37.18302 2.03783 18.24638 2.74E-57 33.17918 41.18686 33.17918 41.18686
smoking 0.039222 0.116338 0.337137 0.736156 -0.18935 0.267798 -0.18935 0.267798

That means there is no multicollinearity


Scatter plot between IVs –smoking and
biking
smoking Line Fit Plot
80

70

60

50
biking

40
biking
30
Predicted biking
20

10

0
0 5 10 15 20 25 30 35
smoking
Example Heart Disease-contd
SUMMARY OUTPUT
Since we have
established that there Regression Statistics
is no multicollinearity Multiple R 0.98975626
between smoking and R Square 0.97961745
biking,
Adjusted R Square 0.9795351
Standard Error 0.65403217
let us continue with Observations 498
regression with both the
independent variables
ANOVA
smoking and biking
and the dependent df SS MS F Significance F
variable heart disease Regression 2 10176.57 5088.286 11895.24 0
Residual 495 211.7403 0.427758
Total 497 10388.31
Regression output of
EXCEL is shown in the
table Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 14.984658 0.080137 186.9882 0 14.82720754 15.14211 14.82721 15.14211
biking -0.2001331 0.001366 -146.525 0 -0.202816647 -0.19745 -0.20282 -0.19745
smoking 0.17833391 0.003539 50.38667 5.2E-197 0.171379996 0.185288 0.17138 0.185288
Example Heart Disease-contd

Missing data
 Types of missing data
 Missing Completely At Random
 Data is missing completely at random if all observations have the same likelihood of being missing
 The propensity for a data point to be missing is completely random
 Example some participants may have missing laboratory values because a batch of lab samples was processed improperly
 Typically safe to remove MCAR data as the results will be unbiased
 Missing At Random
 Missing at Random means the propensity for a data point to be missing is not related to the missing data, but it is related to some of the observed data
 Example: if older people are more likely to skip survey question #13 than younger people, the missingness mechanism is based on age, a different variable.
 Example: a registry on patient data may confront data as MAR, where the smoking status is not documented in female patients because the doctor did not want to ask.
 Missing Not At Random
 Missing data has a structure to it
 Data is missing is systematically
 There appear to be reasons the data is missing
 Example: In a survey, perhaps a specific group of people – say women ages 45 to 55 – did not answer a question
 Example: the depression registry may encounter data that are MNAR if participants with severe depression are more likely to refuse to complete the survey about depression severity

http://www.statisticalassociates.com/missingvaluesanalysis_p.pdf ncbi.nlm.nih.gov/books/NBK493614/
Applied Missing Data Analysis- Craig Enders https://www.theanalysisfactor.com/mar-and-mcar-missing-data/
https://www.analyticsvidhya.com/blog/2021/10/guide-to-deal-with-missing-values/
Ways to deal with missing data
 Deletion
 Listwise deletion (deleting rows)
 All data for an observation that has one or more missing values are deleted
 Analysis is run only on observations that have a complete set of data
 However when the missing data is not completely MCAR deleting the instances with missing observations can result in
 biased parameters and estimates
 reduces the statistical power of the analysis
 Pairwise deletion
 Pairwise deletion makes an attempt to reduce the loss that happens in listwise deletion.
 It calculates the correlation between two variables for every pair of variables to which data is considered
 Coefficient of correlation can be used to take such data into account.
 Delete columns
 If data is missing for more than 60% of the observation then discard it if the variable is insignificant.
 Imputation
 Method of substituting missing data with statistically determined values.
 Replacement with mean, median, mode
 Linear Regression-Missing values are predicted using a linear model and the other variables in the dataset.

https://www.aptech.com/blog/introduction-to-handling-missing-values/
Missing data IBM ICE (Innovation Centre for Education)

• Missing data causes problems because multiple regression procedures require that every
case have a score on every variable that is used in the analysis.

• The most common ways of dealing with missing data are:


– Pairwise deletion, listwise deletion, deletion of variables and coding of missingness.

• If data are missing randomly, then it may be appropriate to estimate each bivariate
correlation on the basis of all cases that have data on the two variables:
– Pairwise deletion of missing data.

• A second procedure is to delete an entire case if information is missing on any one of the
variables that is used in the analysis:
– List-wise deletion.

• A third procedure is simply to delete a variable that has substantial missing data:
– Deletion of variables.
Validation of multiple regression model IBM ICE (Innovation Centre for Education)

• The validation process can involve analyzing the goodness of fit of the regression, analyzing
whether the regression residuals are random and checking whether the models predictive
performance deteriorates substantially when applied to data that were not used in model
estimation.

• One measure of goodness of fit is the R2 (coefficient of determination), which, in ordinary


least squares with an intercept ranges between 0 and 1.

• Numerical methods also play an important role in model validation. For example, the lack of
fit test for assessing the correctness of the functional part of the model can aid in interpreting
a borderline residual plot.

• Cross-validation is the process of assessing how the results of a statistical analysis will
generalize to an independent data set.

• A development in medical statistics is the use of out of sample cross validation techniques in
meta analysis. It forms the basis of the validation statistic, Vn, which is used to test the
statistical validity of meta analysis summary estimates. Essentially it measures a type of
normalized prediction error and its distribution is a linear combination of χ2 variables of
degree 1.
R-Squared and Goodness of Fit
 Distribution of data points surrounding the fitted regression line is estimated using R squared metric
 R squared varies between zero and one
 If R squared is zero, that means it is a best fitted model
 It is a proportion of variation in dependent variable due to variations in independent variables
 R squared may be increased by adding many independent variables
 Sometimes this is not good when the sample size is less
 R squared does not compute magnitude of the slope
 A high R squared does not necessarily mean high predictability nor does a poor R squared means low
predictability
 R squared is very sensitive to the sample size
 Lesser the sample size higher is its value
Coefficient of multiple determination
(R-Squared) IBM ICE (Innovation Centre for Education)

• R-squared is a goodness of fit measure for linear regression models.

• Indicates the percentage of the variance in the dependent variable that the independent
variables explain collectively.

• R-squared measures the strength of the relationship between the model and the dependent
variable on a convenient 0-100% scale.

• Residuals are the distance between the observed value and the fitted value.
Adjusted R-squared
IBM ICE (Innovation Centre for Education)

• The adjusted R-squared compares the explanatory power of regression models that contain
different numbers of predictors.

• The adjusted R-squared is a modified version of R-squared that has been adjusted for the
number of predictors in the model.

• Multiple R squared is the proportion of Y variance that can be explained by the linear model
using X variables in the sample data, but it over-estimates that proportion in the population.

• Consider, for example, sample R2 = 0.60 based on k=7 predictor variables in a sample of
N=15 cases. An estimate of the proportion of Y variance that can be accounted for by the X
variables in the population is called shrunken R squared or adjusted R squared. It can be
calculated with the following formula:
~2 =1-(1- 2 ) N -1 =1-(1- .6) 14 = .20.
Shrunken R2 = R R
N - k -1 7
Significance of adjusted R squared
 Both R2 and the adjusted R2 gives an idea of how many data points
fall within the line of the regression equation
 One main difference between R2 and the adjusted R2 is
 R2 assumes that every single variable explains the variation in
the dependent variable
 Adjusted R2 tells the percentage of variation explained by only
the independent variables that actually affect the dependent variable
Statistical significance: t-Test
IBM ICE (Innovation Centre for Education)

• A t-test is a type of inferential statistic used to determine if there is a significant difference


between the means of two groups, which may be related in certain features.

• A t-test looks at the t-statistic, the t-distribution values and the degrees of freedom to deter.

• Mathematically, the t-test takes a sample from each of the two sets and establishes the
problem statement by assuming a null hypothesis that the two means are equal. Mine the
probability of difference between two sets of data.

• For a large sample size, statisticians use a z-test. Other testing options include the chi-
square test and the f-test.
Regional Delivery Service Example
Regional delivery service example
 We want to know how much time does a delivery
take based on the following factors-
 Totaldistance of the trip in miles
 Number of delivery that must be made during the trip

 Daily price of gas/petrol in $

Ref: Statistics 101-Brandon Foltz


https://www.youtube.com/watch?v=dQNpSa-bq4M
Regional delivery service example-steps
 Steps
 Check the relation between
 DV and IVs
 IV and IV
 Remove redundant IV
 Remove based on multicollinearity
 Conduct simple linear regression for each IV/DV pair
 Select the best fit model to make predictions
Regional delivery service example-
Dataset
 Random sample of
milesTravelled numDelveries petrolPrice travelTime past 10 trips is taken
(x1) (x2) (x3) (y)
89 4 3.84 7  Information recorded
66 1 3.19 5.4 for each trip is
78 3 3.78 6.6
111 6 3.89 7.4
 Miles travelled
44 1 3.57 4.8  Number of deliveries
77 3 3.57 6.4
80 3 3.03 7
 Daily petrol price
66 2 3.51 5.6  Total travel time in
109 5 3.54 7.3 hours
76 3 3.25 6.4
Regional delivery service example-
Correlation matrix
 Correlation matrix is obtained in Excel to find
Correlation matrix relationship strength amongst
 the IVs
milesTravelled numDelveries petrolPrice travelTime  DV and individual IVs
(x1) (x2) (x3) (y)
 Observations
milesTravelled  milesTravelled (x1) highly correlated to travelTime
(x1) 1 (y)- IV to DV relation
 numberDeliveries (x2) high correlated to travelTime
numDelveries (y)- IV to DV relation
(x2) 0.955898 1
 petrolPrice (x3) not significantly correlated to
petrolPrice travelTime- IV to DV relation
(x3) 0.355796 0.498242 1  Non contributing independent variable
 numberDevilveries(x2) is also highly correlated to
travelTime milesTravelled(x1)- IV to IV relation
(y) 0.928179 0.916443 0.267212 1  Redundat independent variable

 Note: similar observations could be made using


scatterplot or VIF calculations
Regional delivery service example
Regression between milesTravelled (x1)and travelTime(y)
 R-square is the % of variation in dependent variable due to independent
Regression between milesTravelled (x1)and travelTime(y) variable.
SUMMARY OUTPUT
 86% of the variation is the value in the table. It is pretty high.
Regression Statistics
Multiple R 0.928179  Adjusted R square is similar to R-square.
R Square 0.861515  It is adjusted for the number of independent variables.
Adjusted R  In this case number of variables is one
Square 0.844205 One unit increase in milesTravelled(x1)  It always lower than R-square.
Standard
Error 0.342309
increases travelTime by 0.040257 hours  Standard error of regression is average distance of the data points from
regression line in dependent variable units.
Observations 10  Data points are on an average 0.342 hrs away from regression line.
ANOVA  It gives us a measure of how tightly around regression line data points
Significanc are.
df SS MS F eF  It forms a channel around regression line.
 Narrower the channel is more tightly data point are around regression.
Regression 1 5.831597 5.831597 49.76813 0.000107
Residual 8 0.937403 0.117175  Wider band means more scattered they are from the regression line.
Total 9 6.769  SE shows how wide band is. it is in units of dependent units. in this case it
is hrs.

Coefficie Standard Lower Upper  ANOVA table gives significance of overall model.
nts Error t Stat P-value Lower 95% Upper 95% 95.0% 95.0%  Under coefficients :
Intercept 3.18556 0.466951 6.822047 0.000135 2.10877 4.262351 2.10877 4.262351  Coefficients of miles travelled (in hrs)-with increase in 1 hr mile travelled
milesTravelle increases by 0.042 hrs
d 0.040257 0.005706 7.054653 0.000107 0.027098 0.053416 0.027098 0.053416  p-value is 0.001. It is significant. p-value is same as significance F in
ANOVA
 This is because we have only one independent variable.
Regional delivery service example
Regression between milesTravelled (x1)and travelTime(y)
Regression between milesTravelled (x1)and travelTime(y)
SUMMARY OUTPUT
 Example: Calculate
Regression Statistics travelTime(y) for
Multiple R
R Square
0.928179
0.861515 milesTravelled(x1)= 84miles
Adjusted R
Square 0.844205 One unit increase in milesTravelled(x1)
increases travelTime by 0.040257 hours
Standard Error 0.342309

Observations 10
ANOVA  With 95% confidence level, estimated
Significanc
df SS MS F eF time interval is calculated as-
Regression 1 5.831597 5.831597 49.76813 0.000107
Residual 8 0.937403 0.117175
Total 9 6.769

Coefficie Standard Lower Upper


nts Error t Stat P-value Lower 95% Upper 95% 95.0% 95.0%
Intercept 3.18556 0.466951 6.822047 0.000135 2.10877 4.262351 2.10877 4.262351  The estimated time interval will
milesTravelled 0.040257 0.005706 7.054653 0.000107 0.027098 0.053416 0.027098 0.053416 be (5.825 hrs to 7.601 hrs)
Regional delivery service example
Regression between numberDeliveries(x2) and travelTime(y)
Regression between numberDeliveries(x2) and travelTime(y)
SUMMARY OUTPUT
 Example: Calculate
Regression Statistics
Multiple R 0.916443
travelTime(y) for
R Square
Adjusted R
0.839868 numberDeliveries(x2)= 4
Square 0.819852 One unit increase in number of
Standard deliveries increases time travelled by
Error 0.368091 0.4983 hours
Observatio
ns 10
ANOVA
Significance
df SS MS F F

Regression 1 5.68507 5.68507 41.95894 0.000193


Residual 8 1.08393 0.135491
Total 9 6.769

Standard Lower Upper


Coefficients Error t Stat P-value Lower 95% Upper 95% 95.0% 95.0%
Intercept 4.845415 0.265345 18.26079 8.32E-08 4.233528 5.457302 4.233528 5.457302
numDelveri
es 0.498253 0.07692 6.477572 0.000193 0.320876 0.675631 0.320876 0.675631
Regional delivery service example
Regression between petrolPrice(x3) and travelTime(y)

Regression between petrolPrice(x3) and travelTime(y)  Observation


SUMMARY OUTPUT
Regression Statistics  Standard Error is very high-
Multiple R 0.267212
R Square 0.071402 0.886403
Adjusted R
Square -0.04467  R squared is very small
Standard
Error 0.886403 0.071402
Observation
s 10  Hence we will not consider,
ANOVA petrol price for our regression
df SS MS F
Significance
F
model
Regression 1 0.48332 0.48332 0.615138 0.455453
Residual 8 6.28568 0.78571
Total 9 6.769

Standard
Coefficients Error t Stat P-value Lower 95% Upper 95% Lower 95.0%Upper 95.0%
Intercept 3.536488 3.649039 0.969156 0.360851 -4.87821 11.95119 -4.87821 11.95119

petrolPrice 0.811348 1.034477 0.784307 0.455453 -1.57416 3.196857 -1.57416 3.196857


Regional delivery service example
Model summary for simple linear regression with single independent variable

IV Standard R squared F statistics p value


Error
x1 0.3423 0.86 49.76 <0.0001
x2 0.3680 0.83 41.95 <0.0001
x3 0.8864 0.0714 0.615 0.455
For first model, on an average data points are 0.3423 hrs away from regression line which is least among all
models (i.e. second and third)

If we are only using one independent variable then we will choose x1 as we have highest F-statistics and lowest
standard error, highest R-squared
Regional delivery service example
Regression with IVs milesTravelled (x1) and numDeliveries(x2)
Regression between milesTravelled(x1), numberDeliveries(x2) and travelTime(y)  Observations
SUMMARY OUTPUT
 Rows for milesTravelled (x1) and
Regression Statistics numDeliveries (x2) show that p value >
Multiple R 0.933488 0.001
R Square 0.8714
Adjusted R  Hence both x1 and x2 are insignificant
Square 0.834657
Standard  Overall model is significant with F-value
Error 0.352642 is 23.72 and p value is 0.001
Observation
s 10  This is because of strong correlation
between independent variables
ANOVA
Significance
df SS MS F F
Regression 2 5.898503 2.949252 23.71607 0.000763  Note about p-value
Residual 7 0.870497 0.124357  p-value for each term tests the null hypothesis that the coefficient is
Total 9 6.769 equal to zero (no effect)
 A low p-value (< 0.05) indicates that you can reject the null hypothesis
Standard Lower Upper
Coefficients Error t Stat P-value Lower 95% Upper 95% 95.0% 95.0%  In other words, a predictor that has a low p-value is likely to be a
meaningful addition to your model because changes in the predictor's
Intercept 3.732158 0.886974 4.207744 0.003997 1.634799 5.829517 1.634799 5.829517 value are related to changes in the response variable
milesTravell  Conversely, a larger (insignificant) p-value suggests that changes in the
ed 0.026223 0.020016 1.310076 0.231521 -0.02111 0.073553 -0.02111 0.073553 predictor are not associated with changes in the response
numDelveri
es 0.184041 0.250909 0.733496 0.487089 -0.40926 0.777345 -0.40926 0.777345
Coefficients of Multiple Linear Regression

Example: Multicollinearity problem in
Diamonds Dataset
price depth carat table x y z
1
2
326
326
61.5
59.8
0.23
0.21
55
61
3.95
3.89
3.98
3.84
2.43
2.31
 First 15 entries of
3
4
327
334
56.9
62.4
0.23
0.29
65
58
4.05
4.2
4.07
4.23
2.31
2.63
this dataset are
5
6
335
336
63.3
62.8
0.31
0.24
58
57
4.34
3.94
4.35
3.96
2.75
2.48 shown
7 336 62.3 0.24 57 3.95 3.98 2.47
8 337 61.9 0.26 55 4.07 4.11 2.53
9 337 65.1 0.22 61 3.87 3.78 2.49
10 338 59.4 0.23 61 4 4.05 2.39
11 339 64 0.3 55 4.25 4.28 2.73
12 340 62.8 0.23 56 3.93 3.9 2.46
13 342 60.4 0.22 61 3.88 3.84 2.33
14 344 62.2 0.31 54 4.35 4.37 2.71
15 345 60.2 0.2 62 3.79 3.75 2.27

Dataset from https://www.kaggle.com/shivam2503/diamonds


Example: Multicollinearity problem in
Diamonds Dataset-contd
Correlation matrix  Correlation matrix for this
price depth carat table x y z dataset
price 1
depth -0.00348 1  This matrix gives Pearson’s
carat
table
0.858804
0.044536
0.096712
-0.33338
1
0.106334 1
correlation coefficient
x 0.910751 -0.04391 0.978803 0.109662 1 among each variable in the
y 0.918344 -0.05232 0.973486 0.090974 0.996182 1 dataset
z 0.894832 0.213112 0.981171 0.014295 0.965433 0.963135 1
 We observe that there is a
strong correlation among
variables carat and x,y,z
Categorical and dummy variable
Categorical variables
 Used to represent categories or labels
 When an independent variable of interest is categorical-
 "Gender" might be coded as Male or Female
 "Region" might be coded as South, Northeast, Midwest, and West
 Seasons of the year
 Metro cities
 Such variables need to be coded as dummy variables for inclusion into a
regression model
 Dummy variable is a numeric variable that represents categorical data,
such as gender, race, political affiliation, color, designation etc
Alice Zheng, Amanda Kasari, Feature Engineering for Machine Learning, O’Reilly, 2018, first edition.
https://methods.sagepub.com/base/download/DatasetStudentGuide/multiple-reg-dummy-in-gss-2012
Categorical Variable types
 Nominal
 Values don’t have a particular order
 Example:
 categorical variable color (red, green, blue) there is no order to these colours
 Categorical variable gender (male and female) does not have any order
 Ordinal
 Categorical variable education level (10th, 12th, Bachelors) has an order
 Categorical variable feedback (happy, neutral, unhappy) has an order
 Categorical variable size (large, medium, small)
One hot encoding
 Each bit represents a possible category
 If the variable cannot belong to multiple categories
at once, then only one bit in the group can be “on”
 This is called one-hot encoding
 In scikitlearn it is implemented as
 sklearn.preprocessing.OneHotEncoder.
One hot encoding
Example- Representation of cities

City C1 C2 C3
Mumbai 1 0 0
Delhi 0 1 0
Banglore 0 0 1

Each of the bits is a feature

Hence a categorical variable with k possible categories is encoded as a feature vector


of length k.
One hot encoding
 Very simple to understand
 However it uses one more bit than is strictly necessary
 If we see that k–1 of the bits are 0, then the last bit must
be 1
 Because the variable must take on one of the k values
 Mathematically, one can write this constraint as
 “the sum of all bits must be equal to 1”
 c1+c2+c3+…ck=1
Dummy coding
 The problem with one-hot encoding is that it allows for k degrees of freedom, while
the variable itself needs only k–1
 Dummy coding2 removes the extra degree of freedom by using only k–1 features in
the representation
 One feature is dropped and represented by the vector of all zeros
 Dummy coding and one-hot encoding are both implemented in Pandas as
pandas.get_dummies

City C1 C2
Mumbai 1 0
Delhi 0 1
Banglore 0 0
Example: Gender Discrimination
 Lawsuit dataset from
Years_of_
Kaggle
Gender
Experienc
es Job Salary  Only first 10 entries
0 Male
1 Male
9 Full_Professor
10 Associate
84612
78497
are shown
2 Male
3 Male
6 Assistant
27 Full_Professor
67756
173220  Need to create
4 Male
5 Male
10 Full_Professor
10 Full_Professor
96099
87531
dummy variables for
6 Male
7 Male
9 Associate
11 Full_Professor
99972
166601
multiple categorical
8 Male 18 Full_Professor 85437 variables (Gender
and Job)
https://www.kaggle.com/hjmjerry/gender-discrimination
Example: Gender Discrimination
One hot encoding technique
 Creating dummy variable for
categorical variable gender using
one hot encoding technique
 If the candidate is female, assign
1 to Gender_Female variable and
0 to Gender_Male variable
 And vice versa
Dummy variable trap
 Two or more dummy variables created by one-hot encoding are highly correlated (multi-
collinear)
 One variable can be predicted from the others, making it difficult to interpret predicted
coefficient variables in regression models
 In other words, the individual effect of the dummy variables on the prediction model cannot be
interpreted well because of multicollinearity
 In our example, if male variable is 1, that means that the female variable is 0
 Both the variables are highly correlated
 Multicollinearity exists
 Dummy variable contains redundant information
 To overcome the dummy variable trap, we drop one of the columns created when the
categorical variable were converted to dummy variables by one-hot encoding
So how many dummy variables are
needed?
 Number of dummy variables required to represent
a particular categorical variable depends on the
number of values that the categorical variable can
assume
 To represent a categorical variable that can
assume k different values, we need k - 1 dummy
variables
Removing redundant information from the dummy variable to
overcome dummy variable trap

 Drop one column


 Here we have
removed
Gender_female
column
Similarly creating dummy variable
column for Job
 Here we have dropped
the column with
Job_Assistant
 When both
Job_Full_Professor and
Job_Associate are 0, that
means it is Job_Assistant
Check for multicollinearity using VIF
 Using built in function in
statsmodels package for
calculating VIF
 We see that VIF <5 for all
the independent variables
 Hence we take into
consideration all the
independent variables for
regression
Label encoding
 Involves converting each value in a column to a number
 Problem of using numbers is that they introduce
relation/comparison
Level (text) Level Colour (text) Colour
(numeric) (numeric)
Low 0 Red 0
Medium 1 Green 1
High 2 Blue 2
Introduces order or precedence but in a right way Algorithm might misunderstand 0<1<2 and
0<1<2 and hence Low< Medium <High hence Red <Green<Blue
Hypothesis testing
Testing usefuleness of a regression model
Interpreting results of significant
regression
 Once it is determined that the model is useful for
predicting the DV, usefulness should be explored
more in detail
 Do all the predictor variable add important
information for prediction in presence of other
predictors already?
Example: Condominiums
Living bedrooms
Observation
1
List price* y area x1
169 6
Floors x2 x3
1 2
baths x4
1
 How do real estate
2
3
218.5
216.5
10
10
1
1
2
3
2
2 agents decide the
4 225 11 1 3 2
5
6
229
235
13
13
1
2
3
3
1.7
2.5
asking price of
7
8
239.9
247
13
17
1
2
3
3
2
2.5
condominium?
9 260 19 2 3 2
10 269 18 1 3 2
11 234 13 1 4 2
12 255 18 1 4 2
13 269 17 2 4 3
14 294 20 2 4 3
15 309.9 21 2 4 3
*List price in thousands

Ref-William Mendenhall, Robert Beaver, Barbara Beaver, Introduction to probability and statistics, Cengage, 14th edition
Example: Condominiums
Correlation matrix
 Observations??
Living area bedrooms
List price y x1 Floors x2 x3 baths x4
List price
y 1
Living area
x1 0.9507451 1
Floors
x2 0.6053571 0.6297639 1
bedrooms
x3 0.7441209 0.7109191 0.375 1
baths x4 0.8333801 0.7199549 0.7596752 0.6751223 1
Example: Condominiums
Regression results with all the IVs
Regression taking into account all the independent variables
SUMMARY OUTPUT
Regression Statistics  Regression results from Excel are
Multiple R
R Square
0.9849108
0.9700494
shown
Observations
Adjusted R
Square 0.9580691 
Standard
Error
Observation
6.9841692
 Here, number of independent
s 15 variables k=4
ANOVA
Significance
 And total observations n=15
df SS MS F F
 s^2= MSE= SSE/(n-k-1)=
Regression
Residual
4
10
15798.558
487.78619
3949.6395
48.778619
80.97071 1.41E-07
487.78619/(n-4-1)=48.7787
Total 14 16286.344
 p value bedrooms (x3) >0.05
Coefficients
Standard
Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%  If we drop this IV, could results be
Intercept 119.03967 9.388678 12.679067 1.738E-07 98.120396 139.95895 98.120396 139.95895 any better??
Living area
x1 6.2530548 0.7395039 8.4557428 7.224E-06 4.6053374 7.9007721 4.6053374 7.9007721
Floors x2 -16.119062 6.3343795 -2.5446947 0.0291277 -30.232939 -2.0051847 -30.232939 -2.0051847
bedrooms
x3 -2.8213208 4.5823898 -0.6156876 0.5518553 -13.031522 7.38888 -13.031522 7.38888
baths x4 30.266348 6.9835377 4.3339564 0.0014803 14.706057 45.82664 14.706057 45.82664
Regression results with all the IVs except
bedrooms (x3)
Regression output by dropping IV bedrooms (x3)
SUMMARY OUTPUT
 Observations
Regression Statistics
Multiple R
R Square
0.984334
0.968914
 One variable reduced
Adjusted R
Square
Standard
0.960436
 Hence MSE reduced
Error 6.784185
Observatio
ns 15
 Value of R squared
ANOVA
Significance
with three variables is
df SS MS F F 0.9689
Regression 3 15780.07 5260.022 114.2858 1.43E-08
Residual
Total
11
14
506.2768
16286.34
46.02516
 Value of R squared
Coefficients
Standard
Error t Stat P-value Lower 95% Upper 95%
Lower
95.0%
Upper
95.0%
increased to 0.970
Intercept
Floors x2
115.9232
-14.4939
7.680888
5.59334
15.09242
-2.59128
1.07E-08 99.01767 132.8287
0.025081 -26.8048 -2.18306
99.01767
-26.8048
132.8287
-2.18306
with four variables
Living area
x1 6.015165 0.612481 9.820977 8.85E-07 4.667103 7.363227 4.667103 7.363227
baths x4 28.10408 5.863379 4.793154 0.000559 15.19887 41.00929 15.19887 41.00929
Adjusted R squared -revisited
 Due to definition of R squared, its value can never decrease with
addition of more number of variables into regression
 It can either stay the same or increase
 Hence its value can be inflated by inclusion of more predictor
variables
 To account for this variation, we can adjust the model which has
more than one feature (independent variables)
 It can be done by using adjusted R squared value
 R adjusted uses MSE rather than SST
 In MSE formula, degree of freedom is embedded
 {MSE= SSE/(n-k-1)}
 Hence adjusted R squared represents variation in response y
explained by independent variables, corrected for degree of
freedom
Adjusted R squared-revisited
 Adjustment is made by balancing number of features
with number of observations
 In a model having many features and less number of
observations, severe adjustment will be made
 Adjusted R squared can become negative if there are
many features and very less number of observations
 It can be said that increase in one feature reduces
degree of freedom by one
Example: Corporate profits
 Find least square prediction equation
Corporate Profits
 Use overall F test to determine whether the
profit y advertising x1 capital x2 model contributes significant information
15 25 4
16 1 5
for prediction of y. Use α = 0.01
2 6 3  Does advertising expenditure x1
3 30 1
12 29 2
contribute significant information for
1 20 0 prediction of y given that x2 is already
16 12 4 there in the model? Use α = 0.01
18 15 5
13 6 4  Calculate coefficient of determination R
2 16 2 squared. What percentage of overall
variation is explained by the model?

Ref-William Mendenhall, Robert Beaver, Barbara Beaver, Introduction to probability and statistics, Cengage, 14th edition
References
 Machine Learning, IBM ICE
 https://towardsdatascience.com/multiple-linear-regression-in-four-lines-of-
code-b8ba26192e84
 https://www.youtube.com/watch?v=dQNpSa-bq4M
 https://www.analyticsvidhya.com/blog/2020/03/what-is-multicollinearity/
 Alice Zheng, Amanda Kasari, Feature Engineering for Machine
Learning, O’Reilly, 2018, first edition.
 https://methods.sagepub.com/base/download/DatasetStudentGuid
e/multiple-reg-dummy-in-gss-2012

You might also like