0% found this document useful (0 votes)

13 views19 pages

REgression 1

This document discusses regression analysis, which is a statistical modeling technique used to investigate relationships between variables. It can be used for prediction of a dependent variable based on independent variables or for understanding the relationship between variables. Linear regression assumes a linear relationship between variables and is commonly used, with the goal of minimizing error between predicted and observed values using the least squares method. A simple example is presented using data from a British doctors' study on mortality rates from different diseases at various levels of cigarette consumption per day. Plots of the data demonstrate a linear relationship, but the document cautions that correlation does not necessarily prove causation.

Uploaded by

Shivani Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views19 pages

REgression 1

Uploaded by

Shivani Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 19

Regression

analysis

▷ A regression problem is composed of

• an outcome or response variable 𝑌
• a number of risk factors or predictor variables 𝑋𝑖 that affect 𝑌
• also called explanatory variables, or features in the machine learning community
• a question about 𝑌, such as How to predict 𝑌 under different conditions?

▷𝑌 is sometimes called the dependent variable and 𝑋𝑖 the independent

variables
• not the same meaning as statistical independence
• experimental setting where the 𝑋𝑖 variables can be modified and
changes in 𝑌
can be observed

1 / 68
Regression analysis:
objectives

Prediction Model inference

We want to learn about the relationship

We want to estimate 𝑌 at some specific between 𝑌 and 𝑋𝑖, such as the combination
values of 𝑋𝑖 of predictor variables which has the most
effect on 𝑌

2 / 68
Univariate linear
regression
(when all you have is a single predictor
variable)

3 / 68
Linear
regression

▷ Linear regression: one of the simplest and most commonly used

statistical modeling techniques

▷ Makes strong assumptions about the relationship between the

predictor variables (𝑋𝑖) and the response (𝑌) 0.8

0.6
• (a linear relationship, a straight line when plotted)
0.4

• only valid for continuous outcome variables (not applicable to

0.2
category outcomes such as success/failure)
0

0 0.5 1 1.5 2 2.5 3 3.5 4

outcome predictor
variable variable

“Fitting a line
𝑦 = 𝛽 0 + 𝛽1 × 𝑥 + error
through data”

intercept slope

4 / 68
Linear
regression
▷ Assumption: 𝑦 = 𝛽 0 + 𝛽 1 × 𝑥 + error

▷ Our task: estimate 𝛽 0 and 𝛽 1 based on the available data

▷ Resulting model is 𝑦̂ = 𝛽0̂ + 𝛽1̂ × 𝑥

• the “hats” on the variables represent the fact that they are
estimated from the available data
• 𝑦̂ is read as “the estimator for 𝑦”
▷ 𝛽 0 and 𝛽 1 are called the model parameters or coefficients
180

▷ Objective: minimize the error, the difference between our 170

observations and the predictions made by our linear 160

model 150

• minimize the length of the red lines in the figure to the right
(called the “residuals”)
140
40 45 50 55 60 65 70

5 / 68
Ordinary Least Squares
regression

▷ Ordinary Least-Squares (ols) regression: a

method for selecting the model parameters 190

• β₀ and β₁ are chosen to minimize the square of the

distance between the predicted values and the 180

actual values
170
• equivalent to minimizing the size of the red
rectangles in the figure to the right
160

▷ An application of a quadratic loss function

150
• in statistics and optimization theory, a loss function, 40 45 50 55 60 65 70 75 80 85
90
or cost function, maps from an observation or event
to a number that represents some form of “cost”

6 / 68
Simple linear regression:
example

▷ The British Doctors’ Study followed the health of a large number of

physicians in the uk over the period 1951–2001

▷ Provided conclusive evidence of linkage between smoking and lung

cancer, myocardial infarction, respiratory disease and other illnesses

▷ Provides data on annual mortality for a variety of diseases at four levels

of cigarette smoking:
1 never smoked
2 1-14 per day
3 15-24 per day
4 > 25 per day

More information: ctsu.ox.ac.uk/research/british-doctors-study

7 / 68
Simple linear regression: the
data

cigarettes smoked CVD mortality lung cancer mortality

(per day) (per 100 000 men per year) (per 100 000 men per year)

0 572 14
10 (actually 1-14) 802 105
20 (actually 15-24) 892 208
30 (actually >24) 1025 355

sease
CVD: cardiovascular di

Source: British Doctors’ Study

8 / 68
Simple linear regression:
plots

Deaths for different smoking intensities

1000

import pandas
900 import matplotlib.pyplot as plt
CVD deaths

data = pandas.DataFrame({"cigarettes": [0,10,20,30],

800
"CVD": [572,802,892,1025],
"lung": [14,105,208,355]});
700 data.plot("cigarettes", "CVD", kind="scatter")
plt.title("Deaths for different smoking intensities")
plt.xlabel("Cigarettes smoked per day")
600
plt.ylabel("CVD deaths")

0 5 10 15 20 25 30
Cigarettes smoked per day

lude that
Quite tempting to conc
deaths
cardiovascular disease
cigarette
increase linearly with
consumption…
9 / 68
Aside: beware assumptions of
causality

1964: the US Surgeon General issues a

report claiming that cigarette
smoking causes lung cancer, based
mostly on correlation data similar to
the previous slide.

lung
smoking
cancer

12 /
Aside: beware assumptions of
causality

1964: the US Surgeon General issues a

report claiming that cigarette
hidden
smoking causes lung cancer, based
mostly on correlation data similar to factor?
the previous slide.

However, correlation is not sufficient

to demonstrate causality. There might
lung
be some hidden genetic factor that smoking
cancer
causes both lung cancer and desire
for nicotine.

12 /
Beware assumptions of
causality
▷ To demonstrate the causality, you need a randomized controlled
experiment

▷ Assume we have the power to force people to smoke or not smoke

• and ignore moral issues for now!

▷ Take a large group of people and divide them into two groups
• one group is obliged to smoke
• other group not allowed to smoke (the “control” group)

▷ Observe whether smoker group develops more lung cancer than the
control group

▷ We have eliminated any possible hidden factor causing both smoking and
lung cancer

▷ More information: read about design of experiments

13 /
Fitting a linear model in
Python

▷ In these examples, we use the statsmodels library for statistics in

Python
• other possibility: the scikit-learn library for machine learning

▷ We use the formula interface to ols regression, in

statsmodels.formula.api

▷ Formulas are written outcome ~ observation

• meaning “build a linear model that predicts variable outcome as a function
of input data on variable observation”

14 /
Fitting a linear
model

import numpy, pandas

CVD deaths for different smoking intensities
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
1000

df = pandas.DataFrame({"cigarettes": [0,10,20,30],
900
"CVD": [572,802,892,1025],
CVD deaths

"lung": [14,105,208,355]});
800
df.plot("cigarettes", "CVD", kind="scatter")
lm = smf.ols("CVD ~ cigarettes", data=df).fit()
700 xmin = df.cigarettes.min()
xmax = df.cigarettes.max()
600 X = numpy.linspace(xmin, xmax, 100)
# params[0] i s th e i n t e r c e p t ( b e t a ₀ )

0 5 10 15 20 25 30 # params[1] i s th e s l o p e ( b e t a ₁ )
Cigarettes smoked per day Y = lm.params[0] + lm.params[1] * X
plt.plot(X, Y, color="darkgreen")

15 /
Parameters of the linear
model

60
▷ 𝛽0 is the intercept of the regression line
(where it meets the 𝑋 = 0 axis)
40
▷ 𝛽 1 is the slope of the regression line

▷ Interpretation of 𝛽 1 = 0.0475: a “unit” 20

𝛽 = Δ𝑦
1
increase in cigarette smoking is associated 𝛽0 { Δ𝑥

with a 0.0475 “unit” increase in deaths 0

from lung cancer 0 5 10 15 20 25 30

16 /
Scatterplot of lung cancer
deaths

Lung cancer deaths for different smoking

intensities

350

300 import pandas

250 import matplotlib.pyplot as plt

200
Lung cancer

data = pandas.DataFrame({"cigarettes": [0,10,20,30],

150 "CVD": [572,802,892,1025],
deaths

"lung": [14,105,208,355]});
100 data.plot("cigarettes", "lung", kind="scatter")
plt.xlabel("Cigarettes smoked per day")
50
plt.ylabel("Lung cancer deaths")
0
0 5 10 20 30
15 Cigarettes smoked per
25 day

lude that lung

Quite tempting to conc
linearly with
cancer deaths increase
cigarette consumption…
17 /
Fitting a linear
model

import numpy, pandas

import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

Lung cancer deaths for different smoking intensities

df = pandas.DataFrame({"cigarettes": [0,10,20,30],
350
"CVD": [572,802,892,1025],
300 "lung": [14,105,208,355]});
df.plot("cigarettes", "lung", kind="scatter")
250
lm = smf.ols("lung ~ cigarettes", data=df).fit()
200
Lung cancer

xmin = df.cigarettes.min()

150 xmax = df.cigarettes.max()

deaths

X = numpy.linspace(xmin, xmax, 100)

100
# params[0] i s th e i n t e r c e p t ( b e t a ₀ )

50 # params[1] i s th e s l o p e ( b e t a ₁ )
Y = lm.params[0] + lm.params[1] * X
0
plt.plot(X, Y, color="darkgreen")
0 5 10 20 30
15 Cigarettes smoked per
25 day
d
Download the associate
Python notebook at
rg
risk-engineering.o
18 /
Using the model for
prediction
Q: What is the expected lung cancer mortality risk for a group of people
who smoke 15 cigarettes per day?

import numpy, pandas

import statsmodels.formula.api as smf

df = pandas.DataFrame({"cigarettes": [0,10,20,30],
"CVD": [572,802,892,1025],
"lung": [14,105,208,355]});
# create and f i t the l i n e a r model
lm = smf.ols(formula="lung ~ cigarettes",
data=df).fit() # use the f i t t e d model f o r p r e d i c t i o n
lm.predict({"cigarettes": [15]}) / 100000.0
# p r o b a b i l i t y o f mo r t a lit y from lung cancer, per person
per year
array([ 0.001705])

19 /
▷ How do we assess how well the linear model fits our observations?

Assessing
model
quality
• make a visual check on a scatterplot
• use a quantitative measure of “goodness of fit”
For simple linear regression, 𝑟2 is simply the square of the sample
▷

correlation coefficient 𝑟
• ▷ Coefficient of determination 𝑟2: a number that indicates how well data fit a statistical model
• it’s the proportion of total variation of outcomes explained by the m
• 𝑟2 = 1: regression line fits perfectly
• 𝑟2 = 0: regression line does not fit at all
20 /

GRE - Idioms & Complex Sentences Examples PDF
100% (2)
GRE - Idioms & Complex Sentences Examples PDF
5 pages
Sources of Magnetic Field
No ratings yet
Sources of Magnetic Field
4 pages
Birthday Party Comedy of Menace
100% (2)
Birthday Party Comedy of Menace
2 pages
Slides Linear Regression
No ratings yet
Slides Linear Regression
70 pages
What Is Linear Regression
No ratings yet
What Is Linear Regression
14 pages
ML Linear Regression Trupesh Patel
No ratings yet
ML Linear Regression Trupesh Patel
23 pages
Multiple Linear Regression 1
No ratings yet
Multiple Linear Regression 1
115 pages
2 Modele lineare
No ratings yet
2 Modele lineare
43 pages
AIH_LAB1
No ratings yet
AIH_LAB1
10 pages
ML Module 2
No ratings yet
ML Module 2
185 pages
Logistic Regression Playbook
No ratings yet
Logistic Regression Playbook
19 pages
UNIT-2 ML
No ratings yet
UNIT-2 ML
39 pages
15Multiple Linear Regression
No ratings yet
15Multiple Linear Regression
168 pages
DS Unit-Iv
No ratings yet
DS Unit-Iv
34 pages
Statics Thinking-Regression
No ratings yet
Statics Thinking-Regression
51 pages
R Egression Simplified
No ratings yet
R Egression Simplified
24 pages
ppt4
No ratings yet
ppt4
54 pages
CS ELEC 4 Finals Module
No ratings yet
CS ELEC 4 Finals Module
57 pages
AAI Lecture 10 Sp 25
No ratings yet
AAI Lecture 10 Sp 25
37 pages
Intro to reg models
No ratings yet
Intro to reg models
27 pages
Regression Logistic Regression
100% (1)
Regression Logistic Regression
37 pages
Asite2 Chapter 12a
No ratings yet
Asite2 Chapter 12a
63 pages
reg
No ratings yet
reg
110 pages
Regression Modeling in Biostatistics
No ratings yet
Regression Modeling in Biostatistics
3 pages
Linear Regression - Module 3
No ratings yet
Linear Regression - Module 3
16 pages
Module 4 - Logistic Regression - Afterclass1b
No ratings yet
Module 4 - Logistic Regression - Afterclass1b
54 pages
Linear Regression Chap01
100% (1)
Linear Regression Chap01
7 pages
Lecture6 Regression
No ratings yet
Lecture6 Regression
42 pages
ch12_0
No ratings yet
ch12_0
43 pages
Predictive Analytics - Regression
No ratings yet
Predictive Analytics - Regression
27 pages
Us20 Allison
No ratings yet
Us20 Allison
10 pages
ML - LAB - BE CSE (DS) Final
No ratings yet
ML - LAB - BE CSE (DS) Final
110 pages
ML - Unit 2
No ratings yet
ML - Unit 2
155 pages
lecture_oct_2_2024_ab
No ratings yet
lecture_oct_2_2024_ab
15 pages
RSM1282-2025-Session 9-Binary Dependent Variables & Logistic Regression - POST
No ratings yet
RSM1282-2025-Session 9-Binary Dependent Variables & Logistic Regression - POST
35 pages
Module 3 - SimpleLinearRegression - Afterclass1b
No ratings yet
Module 3 - SimpleLinearRegression - Afterclass1b
26 pages
Unit 2&3_250421_215911
No ratings yet
Unit 2&3_250421_215911
60 pages
3CP10 Final MJJ Linear Regression
No ratings yet
3CP10 Final MJJ Linear Regression
68 pages
Hhghiikkk
No ratings yet
Hhghiikkk
29 pages
Concepts - Regression Overview
No ratings yet
Concepts - Regression Overview
14 pages
Lec2 ASE
No ratings yet
Lec2 ASE
86 pages
Regression Model and Its Applications
100% (1)
Regression Model and Its Applications
30 pages
ML - Module 2
No ratings yet
ML - Module 2
16 pages
AST Day 2 Slides
No ratings yet
AST Day 2 Slides
58 pages
Regression Linear
No ratings yet
Regression Linear
24 pages
AI_Lec23
No ratings yet
AI_Lec23
36 pages
Introduction To Logistic Regression: Rachid Salmi, Jean-Claude Desenclos, Alain Moren, Thomas Grein
No ratings yet
Introduction To Logistic Regression: Rachid Salmi, Jean-Claude Desenclos, Alain Moren, Thomas Grein
36 pages
04 - Notebook4 - Additional Information
No ratings yet
04 - Notebook4 - Additional Information
5 pages
Stephen and Senthamarai Kannan (2017) - Detection of Outliers in Regression Model for Medical Data
No ratings yet
Stephen and Senthamarai Kannan (2017) - Detection of Outliers in Regression Model for Medical Data
7 pages
Regression Presentation
No ratings yet
Regression Presentation
20 pages
Lecture-3---Linear-Regression-imran-20022025-092939am
No ratings yet
Lecture-3---Linear-Regression-imran-20022025-092939am
46 pages
Logistic Regression Lecture Notes
No ratings yet
Logistic Regression Lecture Notes
11 pages
Lab 1
No ratings yet
Lab 1
8 pages
Logistic Regression
No ratings yet
Logistic Regression
20 pages
Logistic Regression
100% (1)
Logistic Regression
34 pages
Simple Linear and Logistic Regression
No ratings yet
Simple Linear and Logistic Regression
81 pages
Log Reg
No ratings yet
Log Reg
32 pages
Da Public Slides Ch11 v3 2023
No ratings yet
Da Public Slides Ch11 v3 2023
43 pages
Linear Regression
No ratings yet
Linear Regression
13 pages
DA UNIT-III
No ratings yet
DA UNIT-III
14 pages
R Lab 4
No ratings yet
R Lab 4
7 pages
BA3-4-5modules
No ratings yet
BA3-4-5modules
258 pages
Control Charts: Six Sigma Thinking, #7
From Everand
Control Charts: Six Sigma Thinking, #7
Sumeet Savant
4/5 (1)
Lesson 8 Romantic Period 2
No ratings yet
Lesson 8 Romantic Period 2
6 pages
Sample Thesis Chapter 3 Statistical Treatment of Data
100% (2)
Sample Thesis Chapter 3 Statistical Treatment of Data
6 pages
Nietzsche - ''Pathos of Truth''
No ratings yet
Nietzsche - ''Pathos of Truth''
6 pages
Cleaning Validation PIC-S
No ratings yet
Cleaning Validation PIC-S
29 pages
Unit 4: Systems Development Life Cycle - Planning: Lecture Objectives
No ratings yet
Unit 4: Systems Development Life Cycle - Planning: Lecture Objectives
2 pages
Community Engagement, Solidarity and Citizenship
No ratings yet
Community Engagement, Solidarity and Citizenship
81 pages
Kill The Kids First The Coming of Black Genocide
100% (1)
Kill The Kids First The Coming of Black Genocide
37 pages
03 Poli Crestani-2
No ratings yet
03 Poli Crestani-2
6 pages
Challenges of Critical Thinking
No ratings yet
Challenges of Critical Thinking
16 pages
Chiemical Reserrcdi. Development Engineeringi Center: Electrical Properties of Atmospheric Moist Air
No ratings yet
Chiemical Reserrcdi. Development Engineeringi Center: Electrical Properties of Atmospheric Moist Air
33 pages
All About Ireland
No ratings yet
All About Ireland
5 pages
Kami Export - The Holocaust
No ratings yet
Kami Export - The Holocaust
1 page
Understanding The Self Syllabus BSBA
100% (1)
Understanding The Self Syllabus BSBA
12 pages
Case 18 19 Oblicon
No ratings yet
Case 18 19 Oblicon
3 pages
Chess Tips and Moves
No ratings yet
Chess Tips and Moves
4 pages
Task 2# - Bobbie Rizkie Mandhala H - 03031181419036
No ratings yet
Task 2# - Bobbie Rizkie Mandhala H - 03031181419036
2 pages
Problem Based Learning Lesson Plan, Gr. 5 Weather
100% (3)
Problem Based Learning Lesson Plan, Gr. 5 Weather
3 pages
Master The Universal Energy To Unleash
No ratings yet
Master The Universal Energy To Unleash
2 pages
Was God Ever Maligned
No ratings yet
Was God Ever Maligned
14 pages
Indian Constitution - Social Justice and Equity
No ratings yet
Indian Constitution - Social Justice and Equity
5 pages
10 Red Flags of Emotional Neglect in a Relationship _ Psychology Today
No ratings yet
10 Red Flags of Emotional Neglect in a Relationship _ Psychology Today
13 pages
Expressing Possibility
No ratings yet
Expressing Possibility
10 pages
Signals and Systems For Signals and Systems For
No ratings yet
Signals and Systems For Signals and Systems For
106 pages
Intacc 1
No ratings yet
Intacc 1
17 pages
9144b304-135a-4810-abdb-606b4c620fa3
No ratings yet
9144b304-135a-4810-abdb-606b4c620fa3
122 pages
Mock Test 7 Trà My: Chinh Phục Kì Thi Thpt Quốc Gia
No ratings yet
Mock Test 7 Trà My: Chinh Phục Kì Thi Thpt Quốc Gia
4 pages
N9060 90027
No ratings yet
N9060 90027
1,592 pages