0% found this document useful (0 votes)
13 views19 pages

REgression 1

This document discusses regression analysis, which is a statistical modeling technique used to investigate relationships between variables. It can be used for prediction of a dependent variable based on independent variables or for understanding the relationship between variables. Linear regression assumes a linear relationship between variables and is commonly used, with the goal of minimizing error between predicted and observed values using the least squares method. A simple example is presented using data from a British doctors' study on mortality rates from different diseases at various levels of cigarette consumption per day. Plots of the data demonstrate a linear relationship, but the document cautions that correlation does not necessarily prove causation.

Uploaded by

Shivani Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views19 pages

REgression 1

This document discusses regression analysis, which is a statistical modeling technique used to investigate relationships between variables. It can be used for prediction of a dependent variable based on independent variables or for understanding the relationship between variables. Linear regression assumes a linear relationship between variables and is commonly used, with the goal of minimizing error between predicted and observed values using the least squares method. A simple example is presented using data from a British doctors' study on mortality rates from different diseases at various levels of cigarette consumption per day. Plots of the data demonstrate a linear relationship, but the document cautions that correlation does not necessarily prove causation.

Uploaded by

Shivani Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Regression

analysis

▷ A regression problem is composed of


• an outcome or response variable 𝑌
• a number of risk factors or predictor variables 𝑋𝑖 that affect 𝑌
• also called explanatory variables, or features in the machine learning community
• a question about 𝑌, such as How to predict 𝑌 under different conditions?

▷𝑌 is sometimes called the dependent variable and 𝑋𝑖 the independent


variables
• not the same meaning as statistical independence
• experimental setting where the 𝑋𝑖 variables can be modified and
changes in 𝑌
can be observed

1 / 68
Regression analysis:
objectives

Prediction Model inference

We want to learn about the relationship


We want to estimate 𝑌 at some specific between 𝑌 and 𝑋𝑖, such as the combination
values of 𝑋𝑖 of predictor variables which has the most
effect on 𝑌

2 / 68
Univariate linear
regression
(when all you have is a single predictor
variable)

3 / 68
Linear
regression

▷ Linear regression: one of the simplest and most commonly used


statistical modeling techniques

▷ Makes strong assumptions about the relationship between the


predictor variables (𝑋𝑖) and the response (𝑌) 0.8

0.6
• (a linear relationship, a straight line when plotted)
0.4

• only valid for continuous outcome variables (not applicable to


0.2
category outcomes such as success/failure)
0

0 0.5 1 1.5 2 2.5 3 3.5 4

outcome predictor
variable variable

“Fitting a line
𝑦 = 𝛽 0 + 𝛽1 × 𝑥 + error
through data”

intercept slope

4 / 68
Linear
regression
▷ Assumption: 𝑦 = 𝛽 0 + 𝛽 1 × 𝑥 + error

▷ Our task: estimate 𝛽 0 and 𝛽 1 based on the available data

▷ Resulting model is 𝑦̂ = 𝛽0̂ + 𝛽1̂ × 𝑥


• the “hats” on the variables represent the fact that they are
estimated from the available data
• 𝑦̂ is read as “the estimator for 𝑦”
▷ 𝛽 0 and 𝛽 1 are called the model parameters or coefficients
180

▷ Objective: minimize the error, the difference between our 170

observations and the predictions made by our linear 160

model 150

• minimize the length of the red lines in the figure to the right
(called the “residuals”)
140
40 45 50 55 60 65 70

5 / 68
Ordinary Least Squares
regression

▷ Ordinary Least-Squares (ols) regression: a


method for selecting the model parameters 190

• β₀ and β₁ are chosen to minimize the square of the


distance between the predicted values and the 180

actual values
170
• equivalent to minimizing the size of the red
rectangles in the figure to the right
160

▷ An application of a quadratic loss function


150
• in statistics and optimization theory, a loss function, 40 45 50 55 60 65 70 75 80 85
90
or cost function, maps from an observation or event
to a number that represents some form of “cost”

6 / 68
Simple linear regression:
example

▷ The British Doctors’ Study followed the health of a large number of


physicians in the uk over the period 1951–2001

▷ Provided conclusive evidence of linkage between smoking and lung


cancer, myocardial infarction, respiratory disease and other illnesses

▷ Provides data on annual mortality for a variety of diseases at four levels


of cigarette smoking:
1 never smoked
2 1-14 per day
3 15-24 per day
4 > 25 per day

More information: ctsu.ox.ac.uk/research/british-doctors-study


7 / 68
Simple linear regression: the
data

cigarettes smoked CVD mortality lung cancer mortality


(per day) (per 100 000 men per year) (per 100 000 men per year)

0 572 14
10 (actually 1-14) 802 105
20 (actually 15-24) 892 208
30 (actually >24) 1025 355

sease
CVD: cardiovascular di

Source: British Doctors’ Study

8 / 68
Simple linear regression:
plots

Deaths for different smoking intensities

1000

import pandas
900 import matplotlib.pyplot as plt
CVD deaths

data = pandas.DataFrame({"cigarettes": [0,10,20,30],


800
"CVD": [572,802,892,1025],
"lung": [14,105,208,355]});
700 data.plot("cigarettes", "CVD", kind="scatter")
plt.title("Deaths for different smoking intensities")
plt.xlabel("Cigarettes smoked per day")
600
plt.ylabel("CVD deaths")

0 5 10 15 20 25 30
Cigarettes smoked per day

lude that
Quite tempting to conc
deaths
cardiovascular disease
cigarette
increase linearly with
consumption…
9 / 68
Aside: beware assumptions of
causality

1964: the US Surgeon General issues a


report claiming that cigarette
smoking causes lung cancer, based
mostly on correlation data similar to
the previous slide.

lung
smoking
cancer

12 /
Aside: beware assumptions of
causality

1964: the US Surgeon General issues a


report claiming that cigarette
hidden
smoking causes lung cancer, based
mostly on correlation data similar to factor?
the previous slide.

However, correlation is not sufficient


to demonstrate causality. There might
lung
be some hidden genetic factor that smoking
cancer
causes both lung cancer and desire
for nicotine.

12 /
Beware assumptions of
causality
▷ To demonstrate the causality, you need a randomized controlled
experiment

▷ Assume we have the power to force people to smoke or not smoke


• and ignore moral issues for now!

▷ Take a large group of people and divide them into two groups
• one group is obliged to smoke
• other group not allowed to smoke (the “control” group)

▷ Observe whether smoker group develops more lung cancer than the
control group

▷ We have eliminated any possible hidden factor causing both smoking and
lung cancer

▷ More information: read about design of experiments


13 /
Fitting a linear model in
Python

▷ In these examples, we use the statsmodels library for statistics in


Python
• other possibility: the scikit-learn library for machine learning

▷ We use the formula interface to ols regression, in


statsmodels.formula.api

▷ Formulas are written outcome ~ observation


• meaning “build a linear model that predicts variable outcome as a function
of input data on variable observation”

14 /
Fitting a linear
model

import numpy, pandas


CVD deaths for different smoking intensities
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
1000

df = pandas.DataFrame({"cigarettes": [0,10,20,30],
900
"CVD": [572,802,892,1025],
CVD deaths

"lung": [14,105,208,355]});
800
df.plot("cigarettes", "CVD", kind="scatter")
lm = smf.ols("CVD ~ cigarettes", data=df).fit()
700 xmin = df.cigarettes.min()
xmax = df.cigarettes.max()
600 X = numpy.linspace(xmin, xmax, 100)
# params[0] i s th e i n t e r c e p t ( b e t a ₀ )

0 5 10 15 20 25 30 # params[1] i s th e s l o p e ( b e t a ₁ )
Cigarettes smoked per day Y = lm.params[0] + lm.params[1] * X
plt.plot(X, Y, color="darkgreen")

15 /
Parameters of the linear
model

60
▷ 𝛽0 is the intercept of the regression line
(where it meets the 𝑋 = 0 axis)
40
▷ 𝛽 1 is the slope of the regression line

▷ Interpretation of 𝛽 1 = 0.0475: a “unit” 20


𝛽 = Δ𝑦
1
increase in cigarette smoking is associated 𝛽0 { Δ𝑥

with a 0.0475 “unit” increase in deaths 0


from lung cancer 0 5 10 15 20 25 30

16 /
Scatterplot of lung cancer
deaths

Lung cancer deaths for different smoking


intensities

350

300 import pandas


250 import matplotlib.pyplot as plt

200
Lung cancer

data = pandas.DataFrame({"cigarettes": [0,10,20,30],


150 "CVD": [572,802,892,1025],
deaths

"lung": [14,105,208,355]});
100 data.plot("cigarettes", "lung", kind="scatter")
plt.xlabel("Cigarettes smoked per day")
50
plt.ylabel("Lung cancer deaths")
0
0 5 10 20 30
15 Cigarettes smoked per
25 day

lude that lung


Quite tempting to conc
linearly with
cancer deaths increase
cigarette consumption…
17 /
Fitting a linear
model

import numpy, pandas


import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

Lung cancer deaths for different smoking intensities


df = pandas.DataFrame({"cigarettes": [0,10,20,30],
350
"CVD": [572,802,892,1025],
300 "lung": [14,105,208,355]});
df.plot("cigarettes", "lung", kind="scatter")
250
lm = smf.ols("lung ~ cigarettes", data=df).fit()
200
Lung cancer

xmin = df.cigarettes.min()

150 xmax = df.cigarettes.max()


deaths

X = numpy.linspace(xmin, xmax, 100)


100
# params[0] i s th e i n t e r c e p t ( b e t a ₀ )

50 # params[1] i s th e s l o p e ( b e t a ₁ )
Y = lm.params[0] + lm.params[1] * X
0
plt.plot(X, Y, color="darkgreen")
0 5 10 20 30
15 Cigarettes smoked per
25 day
d
Download the associate
Python notebook at
rg
risk-engineering.o
18 /
Using the model for
prediction
Q: What is the expected lung cancer mortality risk for a group of people
who smoke 15 cigarettes per day?

import numpy, pandas


import statsmodels.formula.api as smf

df = pandas.DataFrame({"cigarettes": [0,10,20,30],
"CVD": [572,802,892,1025],
"lung": [14,105,208,355]});
# create and f i t the l i n e a r model
lm = smf.ols(formula="lung ~ cigarettes",
data=df).fit() # use the f i t t e d model f o r p r e d i c t i o n
lm.predict({"cigarettes": [15]}) / 100000.0
# p r o b a b i l i t y o f mo r t a lit y from lung cancer, per person
per year
array([ 0.001705])

19 /
▷ How do we assess how well the linear model fits our observations?

Assessing
model
quality
• make a visual check on a scatterplot
• use a quantitative measure of “goodness of fit”
For simple linear regression, 𝑟2 is simply the square of the sample

correlation coefficient 𝑟
• ▷ Coefficient of determination 𝑟2: a number that indicates how well data fit a statistical model
• it’s the proportion of total variation of outcomes explained by the m
• 𝑟2 = 1: regression line fits perfectly
• 𝑟2 = 0: regression line does not fit at all
20 /

You might also like