REgression 1
REgression 1
analysis
1 / 68
Regression analysis:
objectives
2 / 68
Univariate linear
regression
(when all you have is a single predictor
variable)
3 / 68
Linear
regression
0.6
• (a linear relationship, a straight line when plotted)
0.4
outcome predictor
variable variable
“Fitting a line
𝑦 = 𝛽 0 + 𝛽1 × 𝑥 + error
through data”
intercept slope
4 / 68
Linear
regression
▷ Assumption: 𝑦 = 𝛽 0 + 𝛽 1 × 𝑥 + error
model 150
• minimize the length of the red lines in the figure to the right
(called the “residuals”)
140
40 45 50 55 60 65 70
5 / 68
Ordinary Least Squares
regression
actual values
170
• equivalent to minimizing the size of the red
rectangles in the figure to the right
160
6 / 68
Simple linear regression:
example
0 572 14
10 (actually 1-14) 802 105
20 (actually 15-24) 892 208
30 (actually >24) 1025 355
sease
CVD: cardiovascular di
8 / 68
Simple linear regression:
plots
1000
import pandas
900 import matplotlib.pyplot as plt
CVD deaths
0 5 10 15 20 25 30
Cigarettes smoked per day
lude that
Quite tempting to conc
deaths
cardiovascular disease
cigarette
increase linearly with
consumption…
9 / 68
Aside: beware assumptions of
causality
lung
smoking
cancer
12 /
Aside: beware assumptions of
causality
12 /
Beware assumptions of
causality
▷ To demonstrate the causality, you need a randomized controlled
experiment
▷ Take a large group of people and divide them into two groups
• one group is obliged to smoke
• other group not allowed to smoke (the “control” group)
▷ Observe whether smoker group develops more lung cancer than the
control group
▷ We have eliminated any possible hidden factor causing both smoking and
lung cancer
14 /
Fitting a linear
model
df = pandas.DataFrame({"cigarettes": [0,10,20,30],
900
"CVD": [572,802,892,1025],
CVD deaths
"lung": [14,105,208,355]});
800
df.plot("cigarettes", "CVD", kind="scatter")
lm = smf.ols("CVD ~ cigarettes", data=df).fit()
700 xmin = df.cigarettes.min()
xmax = df.cigarettes.max()
600 X = numpy.linspace(xmin, xmax, 100)
# params[0] i s th e i n t e r c e p t ( b e t a ₀ )
0 5 10 15 20 25 30 # params[1] i s th e s l o p e ( b e t a ₁ )
Cigarettes smoked per day Y = lm.params[0] + lm.params[1] * X
plt.plot(X, Y, color="darkgreen")
15 /
Parameters of the linear
model
60
▷ 𝛽0 is the intercept of the regression line
(where it meets the 𝑋 = 0 axis)
40
▷ 𝛽 1 is the slope of the regression line
16 /
Scatterplot of lung cancer
deaths
350
200
Lung cancer
"lung": [14,105,208,355]});
100 data.plot("cigarettes", "lung", kind="scatter")
plt.xlabel("Cigarettes smoked per day")
50
plt.ylabel("Lung cancer deaths")
0
0 5 10 20 30
15 Cigarettes smoked per
25 day
xmin = df.cigarettes.min()
50 # params[1] i s th e s l o p e ( b e t a ₁ )
Y = lm.params[0] + lm.params[1] * X
0
plt.plot(X, Y, color="darkgreen")
0 5 10 20 30
15 Cigarettes smoked per
25 day
d
Download the associate
Python notebook at
rg
risk-engineering.o
18 /
Using the model for
prediction
Q: What is the expected lung cancer mortality risk for a group of people
who smoke 15 cigarettes per day?
df = pandas.DataFrame({"cigarettes": [0,10,20,30],
"CVD": [572,802,892,1025],
"lung": [14,105,208,355]});
# create and f i t the l i n e a r model
lm = smf.ols(formula="lung ~ cigarettes",
data=df).fit() # use the f i t t e d model f o r p r e d i c t i o n
lm.predict({"cigarettes": [15]}) / 100000.0
# p r o b a b i l i t y o f mo r t a lit y from lung cancer, per person
per year
array([ 0.001705])
19 /
▷ How do we assess how well the linear model fits our observations?
Assessing
model
quality
• make a visual check on a scatterplot
• use a quantitative measure of “goodness of fit”
For simple linear regression, 𝑟2 is simply the square of the sample
▷
correlation coefficient 𝑟
• ▷ Coefficient of determination 𝑟2: a number that indicates how well data fit a statistical model
• it’s the proportion of total variation of outcomes explained by the m
• 𝑟2 = 1: regression line fits perfectly
• 𝑟2 = 0: regression line does not fit at all
20 /