Linear Regression
Linear Regression
3
More Examples?
• Software project time estimation?
• Marketing?
• Finance?
• Supply chain?
• Elsewhere?
4
Simple Linear Regression:
Continuous Response
One Predictor
5
Auto Data
Description
Mpg: miles per gallon (Y: Response)
6
What is Regression?
7
Example: Auto Data
r = -0.86 r = 0.40
8
Correlation
Correlation is a measure of strength of relationship between
pair of variables
In common usage it refers to the extent to which two variables
have a linear relationship with each other
Denoted by ρ or r
Ranges from -1.0 to +1.0
The closer r is to +1 or -1, the more closely the two variables
are related
If r is close to 0, it means there is no linear relationship between
the variables
9
Uncorrelated Data
10
Pairwise Correlations
11
Correlation - Regression
Correlation and regression are closely related in concept
Correlation is limited in scope
Regression more versatile and flexible
Correlation does not provide a structure of dependence
Regression provides a model
Correlation is only pairwise
Regression can handle multiple predictors in a joint
environment
12
Auto Data
Can mpg be estimated as a linear function of
displacement?
mpg = α + β displacement
14
Simple Linear Regression
Observe (𝑋𝑖 , 𝑌𝑖 ) for a sample of size n
Y : Response (Continuous)
X : Predictor
E(Y) = α + β X
𝑌𝑖 = α + β 𝑋𝑖 + 𝜀𝑖
𝜀𝑖 : Error, independent
𝜀𝑖 ~ N (0, 𝜎 2 )
Estimate α and β
𝒊 = a + b 𝑿𝒊
Predict the response Y by 𝒀
15
Which Line?
16
Error in Regression
Regression Model
Yi = α + βxi + εi
Estimated
= a + bxi
𝑌i
E(Yi) = α + βxi
εi ~ N(0, σ2)
Residual
= Yi - 𝑌i
=r i
17
Fuel Efficiency of a Car:
Displacement
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
const 28.5089 1.179 24.188 0.000 26.095 30.923
displacement -0.0418 0.005 -8.829 0.000 -0.051 -0.032
=====================================================
R-squared: 0.736
F-statistic: 77.96
Prob (F-statistic): 1.39e-09
No. Observations: 30
Df Residuals: 28
Df Model: 1
18
Fuel Efficiency of a Car:
Displacement
• Intercept (a)
• Slope (b)
• Sign of slope
• Sign of correlation coefficient
• For unit increase is displacement mpg changes by how much?
• Compare two cars: displacement 360 and 232. What will be
difference between expected mpg?
• What are the actual mpg?
• Why the difference?
• For a new car with displacement 300 will we know exact mpg?
• For a new car with displacement 500 can we estimate mpg?
19
Fuel Efficiency of a Car
• Regress mpg on horsepower
• Regress mpg on weight
• Regress mpg on acceleration
20
Fuel Efficiency of a Car: HP
Call:
lm(formula = mpg ~ horsepower, data = Auto)
Residuals:
Min 1Q Median 3Q Max
-3.7103 -1.5578 -0.4834 0.7346 6.0558
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.36800 1.69819 16.705 4.30e-16 ***
horsepower -0.08780 0.01465 -5.993 1.86e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
21
Fuel Efficiency of a Car: Weight
Call:
lm(formula = mpg ~ weight, data = Auto)
Residuals:
Min 1Q Median 3Q Max
-2.8887 -1.7401 -0.1255 1.0009 5.9506
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.8456229 2.6775772 13.39 1.08e-13 ***
weight -0.0053918 0.0008232 -6.55 4.22e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residuals:
Min 1Q Median 3Q Max
-5.6133 -1.8768 -0.3501 1.4241 8.9104
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.5258 3.4051 3.091 0.00448 **
acceleration 0.5309 0.2236 2.374 0.02467 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
23
Fuel Efficiency of a Car
If only one predictor is to be used to predict fuel efficiency of a car
which one should be used?
Predictor 𝑹𝟐
Displacement 73.6%
Weight 60.5%
Acceleration 16.8%
HP 56.2%
24
On Your Own!
25
Insurance Claim vs GDP
The data is collected at the district level. There are 620
district
districts.
Each of the four regions, North, East, West, and South has
region
155 districts
Spend type is classified into Public and Insurance. It is
assumed here that public expenditure is borne by the
Spend type government and is used for treating uninsured people.
People with insurance do not use any public expenditure.
These two groups are therefore mutually exclusive.
percapgdp Per capita gdp of the district
Average spend on cancer by public and insurance systems
avgcancerspend
per patient
Average spend on heart by public and insurance systems
avgheartspend
per patient
Average spend on organ treatment by public and insurance
avgorganspend
systems per patient
26
Insurance Claim vs GDP
Insurance companies want to know whether insurance spend
depends on per capita GDP?
27
Insurance Claim vs GDP:
Scatterplots
r = 78%
28
Insurance Claim vs GDP
===================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------
const 2.081e+04 582.763 35.702 0.000 1.97e+04 2.19e+04
AvgCancerSpend 0.0697 0.002 43.816 0.000 0.067 0.073
R-squared: 0.608
29
Insurance Claim vs GDP
Is there a dependency of insurance claim due to cancer on per
capita GDP?
30
31
Insurance Claim vs GDP:
Scatterplots
r = 24%
32
Insurance Claim vs GDP
Is there a dependency of insurance claim due to heart on per
capita GDP?
34
35
Insurance Claim vs GDP:
Scatterplots
r = 30%
36
Insurance Claim vs GDP
Is there a dependency of insurance claim due to organ on per
capita GDP?
38
39
Regression Residuals
Regression Model
Yi = α + βxi + εi
Estimated
= a + bxi
𝑌i
E(Yi) = α + βxi
εi ~ N(0, σ2)
Residual
= Yi - 𝑌i
=r i
40
Regression ANOVA
41
Regression ANOVA
Residual extraction in R: $residuals
Fitted values in R: $fitted
Response: mpg
Df Sum Sq Mean Sq F value Pr(>F)
displacement 1 241.476 241.476 77.955 1.395e-09 ***
43
Multiple Linear
Regression:
Continuous Response
Two or More Predictors
45
Fuel Efficiency of a Car
So what happens if all the predictors are included in the model
simultaneously?
46
Multiple Linear Regression
Observe (𝑋1𝑖 𝑋2𝑖 𝑋3𝑖 , ⋯, 𝑋𝑝𝑖 , 𝑌𝑖 ) for a sample of size n
Y : Response (Continuous)
𝑋1 , 𝑋2 , 𝑋3 , ⋯, 𝑋𝑝 : Predictors
E(Y) = α +β1 𝑋1 + ⋯ + β𝑝 𝑋𝑝
𝑌𝑖 = α + β 𝑋𝑖 + 𝜀𝑖
𝜀𝑖 : Error, independent
𝜀𝑖 ~ N (0, 𝜎 2 )
Estimate α, β1 , ⋯, β𝑝
𝒊 = a +𝑏1 𝑋1𝑖 + ⋯ + 𝑏𝑝 𝑋𝑝𝑖
Predict the response Y by 𝒀
47
Fuel Efficiency of a Car: Wt & HP
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 34.4319 2.593 13.276 0.000 29.110 39.753
horsepower -0.0440 0.020 -2.193 0.037 -0.085 -0.003
weight -0.0034 0.001 -2.879 0.008 -0.006 -0.001
48
Fuel Efficiency of a Car: Wt & HP
• Intercept (a)
• Slopes (b1, b2)
• Sign of slopes
• Sign of correlation coefficients
• For unit increase is weight mpg changes by how much? What
happens with horsepower?
• What are the fitted values?
• For a car with hp=90 and weight=3200 can you get actual mpg?
• For a car with hp=150 and weight=4260 can you get actual
mpg?
49
Auto Data: Full Model
coef std err t P>|t|
--Intercept 34.7116 6.111 5.680 0.000
cylinders -0.9100 0.864 -1.053 0.303
displacement -0.0173 0.017 -1.044 0.307
horsepower -0.0203 0.042 -0.483 0.633
weight -0.0006 0.002 -0.329 0.745
Response: mpg
Df Sum Sq Mean Sq F value Pr(>F)
cylinders 1 235.512 235.512 72.7147 1.001e-08 ***
displacement 1 10.575 10.575 3.2651 0.08332 .
horsepower 1 0.571 0.571 0.1763 0.67830
weight 1 3.196 3.196 0.9869 0.33042
acceleration 1 0.623 0.623 0.1924 0.66481
Residuals 24 77.732 3.239
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’
0.1 ‘ ’ 1
51
Auto Data: Full Model
Should it matter?
52
Auto Data
Pairwise correlation among variables
displacem
mpg cylinders ent horsepower weight acceleration
1 -0.85 -0.86 -0.75 -0.78 0.41
mpg
-0.85 1 0.94 0.82 0.81 -0.55
cylinders
-0.86 0.94 1 0.83 0.83 -0.48
displacement
-0.75 0.82 0.83 1 0.76 -0.74
horsepower
-0.78 0.81 0.83 0.76 1 -0.24
weight
0.41 -0.55 -0.48 -0.74 -0.24 1
acceleration
53
Multicollinearity
If predictor variables are linearly related then the data is
said to be multi-collinear
It is difficult to come up with reliable estimates of
regression coefficients.
It will result in incorrect conclusions about the
relationship between outcome variable and predictor
variables.
Regression coefficients have high variance
Coefficients may have wrong sign
Extremely sensitive to slight change in data points
Prediction model becomes unreliable
54
Multicollinearity
Variance Inflation Factor (VIF) is the measure of
multicollinearity
The VIF provides an index that measures how much the
variance of an estimated regression coefficient is
increased because of the multicollinearity
Ideally VIF should be close to 1
VIF > 5 indicates moderate multicollinearity
VIF > 10 indicates severe multicollinearity
55
VIF: Auto Data
VIF (set 1)
cylinders displacement horsepower weight acceleration
11.71 11.69 13.06 7.70 6.29
VIF (set 2)
cylinders displacement weight acceleration
10.47 9.76 3.82 1.70
VIF (set 3))
displacement weight acceleration
4.30 3.50 1.45
VIF (set 4)
weight acceleration
1.06 1.06
56
Auto Data: Final Model
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Response: mpg
Df Sum Sq Mean Sq F value Pr(>F)
weight 1 198.602 198.602 47.545 2.077e-07 ***
acceleration 1 16.825 16.825 4.028 0.05486 .
Residuals 27 112.782 4.177
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’
0.1 ‘ ’ 1
58
Are you satisfied with the multiple regression
model assumptions?
60
Transformation
• Sometimes a transformation on the response improves the
model fit
• Common transformations: Square-root, logarithm(base e),
inverse
• Instead of regressing the predictors on Y, they are regressed on
transformed Y
61
Transformation Example
Regress 1/MPG on weight and acceleration
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.146e-02 9.476e-03 2.264 0.0318 *
weight 1.533e-05 1.981e-06 7.736 2.55e-08 ***
acceleration -1.002e-03 3.707e-04 -2.702 0.0118 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
62
On Your Own!
64
Concrete Strength
Concrete compressive strength -- quantitative -- MPa -- Output
Variable
65
Concrete Strength
• Does concrete strength depend on the input variables?
• Does concrete strength depend on ALL the input variables?
• Is there a multicollinearity problem in the model?
• Interpret the regression coefficients? How do changes in the
input affect the output?
• Do you think if input variables are adjusted arbitrarily, the
output variable will go on increasing?
• How much of the total variability in the strength is explained by
the input variables?
• Explore the residuals and decide whether the assumptions are
satisfied
66
Final Model – Comments?
67
Insurance Claim vs GDP
68
Insurance Claim vs GDP
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.454e+04 8.540e+03 -1.703 0.0889 .
PerCapGDP 8.718e+00 1.804e-01 48.335 <2e-16 ***
Spendtypepublic -6.132e+04 3.735e+03 -16.417 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
69
Insurance Claim vs GDP
For the districts where Spend Type = Insurance
70
Insurance Claim vs GDP
• Does Region has an effect AFTER the dependency on GDP
and SpendType has been accounted for?
• Region is a factor variable with 4 levels
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.259e+05 8.784e+03 25.718 <2e-16 ***
PerCapGDP -1.363e-01 2.751e-01 -0.496 0.62
Spendtypepublic -6.132e+04 2.549e+03 -24.059 <2e-16 ***
RegionNorth 1.371e+05 5.487e+03 24.993 <2e-16 ***
RegionSouth 2.733e+05 7.764e+03 35.196 <2e-16 ***
RegionWest 2.226e+05 6.409e+03 34.742 <2e-16 ***
---
Residual standard error: 44880 on 1234 degrees of freedom
Multiple R-squared: 0.8505, Adjusted R-squared: 0.8499
F-statistic: 1404 on 5 and 1234 DF, p-value: < 2.2e-16
71
Insurance Claim vs GDP
• Write down the multiple regression model
• Repeat similar analysis for AvgHeartSpend and
AvgOrganSpend
• What do you conclude in each case?
72
Model Selection
𝑆𝑆𝐸 / (𝑛 −𝑝)
• Adjusted 𝑅2 = 1 −
𝑆𝑆𝑇 /(𝑛 −1)
• AIC Akaike’s Information Criterion =
2p – n ln(𝑆𝑆𝐸) − n ln 𝑛
• Bayesian Information Criterion
ln 𝑛 p - n ln(𝑆𝑆𝐸) − n ln 𝑛
• For a sequence of models choose the one with the minimum value of
AIC or BIC
73
74
75
76
77