0% found this document useful (0 votes)
5 views

Linear Regression - Kevin

Uploaded by

ken
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Linear Regression - Kevin

Uploaded by

ken
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Linear Regression

Recall

Correlation measures direction strength of a linear relationship shared


r between two variables
kqfffgti.ve
Referable
TicketSales
the linear relationship
i
ve linear associatio
between ticketsales

of followers
on
of fÉfwrble
thefilm's official
offollowers Instagram account
We cannot conclude
on thefilm's official there is CAUSATION
Instagram account
i EFFECT
RV i
i I
mustconduct
an EXPERIMENT

TicketSales

i
Regression i i ve linear relationship

TicketSales
f
FMMWMssionlineakallineofBestfit i.ie
change in gimmased
predicts the

when x
miffy fease
offollowers
on thefilm's official
Instagram account
Typesof variables

anithitative Quantitative
variable vs variable
a.k.a Categorical variable

classification of individuals numerical measures of individuals


based on their attributes or
units
characteristics
e
g heights im

e
g cities in a country
weight kg
gender Male Female
GPA points
levelofeducation
marital status ofstudents in class
breed of a dog time s

speed mph
eye colour
age of an individual years

Scatter Plot
ofppl Gpp
lineofbest fit
5
B REGRESSION
9 LINE
Ic 5 a be

MM 101040
2
1
slope

Timespentstudyingcar

blue blah blown eye timestpent Ipa


a 2 3.5
B 4 3.8
C 1 2.8
Consider

GPA points GPA points

J bot b x b
b
5 bot bi
g Gpa
5 predicted
GPA
studytime s x timespenton Ins
bo
p
b slope 0 b slope so
ve linearrelationship relation
T.my enq
Eudytime 1hr on Instagram h

Regression Line
slope
y intercept
Y m by
mmbi
Effo4Ustope any value of x

predicted intercept
valueof y
y

J b b x coefficient

standarddeviation
b
Sy ofy

b standarddeviation
5 bix of x
2
6 true value
true value of x
of y
Ext
A researcher wants to predict a student's GPA from the
amount oftimethey
studyeach week Ttrporvariable ly
study Time GPA
Ki Yi
I 2.0 J b b x coefficient

s standarddeviation
2 15 ofy
3 2.5 2fygandaenat.in

5 3.5 IE.at
6 Itryyque
of y
6 Recall
3.0
2
S Xi 5
8 4.0 n t
10
4.5
r
dqvdynein.tn
5 5 5 3
51 3.27
Sy 1.08 i
1113127711.08 20
7 0.94

student 8 89991 GPA y ki 5 195 ki illgi.gl

A 2.0 4 1 4
B 2 1.5 3 1.5 4.5
1 3 2.5 2 0.5 1
D 5 0 0.5 0
3,5
E 6 3.0 I 0 0

F 8 4.0 3 1 3

9 10 4.5 5 1.5 7.5


n 7 5 5 5 3 5920
5,1 3.27 Sy 1.08
y axtb
y beta
9 0.3125

6 1.4375
r 0.9441 1 II Tear
relationship
correlation 2 0,8928

9 0.3125 1.4375 y 9xtb

GPA 1315 of hours studied 1.4375

GPA is paedited to increase by 0.3125 for every


additional hour studied
Slatter Plot
GPA 61 0.94 0.311 to 3s.f
5 0 bo 3 0.31 5 5.445 to 3s.f
4.0
3.0
5 1.445 0.31151
2.0
GPA
1.445 0.311 StudyTime

as study time
dttfhh
If a student

d9hefhasesbyI.hr
we predict a student
doesn't
study GPA to increase
Q What is the predicted GPA of at all the
minimumGPA
by 0.311
a student whostudies for
is 1.445
6.5 hrs a week y int is newearingful interpretation

EPA 1.445 0.311


6.5 3.4665

3.47

QQ.IR

88ggfyyoninGfAis
actoredforbyittressionton

poinition

ii iii in
EE
o

b 0.897
Negativeslope as centralPressure falls
MaxWindSpeed increases
MaxWindSpeedincreases
by approx 0.897knots for
every 1 millibar drop in central
pressure

bo 955.27 Notmeaningful
Central Pressure of
0 mb in vacuum
not possible on Earth

b r
38 27197 281 mg
r b us r

correlation coefficient r
squared
r O r

measures the linear measures how close each data

relationship between two pointfits to theregression line


quantitative variables wrt
tells us how well the regression
direction
strength linepredicts actual values
i e 82 closer to 1

Better prediction

Consider

y regressionline
g
altualvalue
predictedvalue

82 0.90
actual predictedvalue
areclose

regressionline
s
altualvalue
predictedvalue

0.07
actual predictedvalue
are far
away
y regressionline
g
altualvalue
preditedvalue

r I
actual predictedvalue
are thesame

Residual how faroff the predicted value is from the actual value

tells us the ERROR in a prediction

Residual
Yi J
6 2 predicted value of
actual value y
of y
Consider
Slatter Plot
GPA

5 0 value
refines line
factual
4.0

iy.fi t
3.0 i diiiiEidual ui is
2.0 3.5
ftp.ggsidual 0.56 3
i e negative 0.5

s si study Time

y 1.445 0.311 2 2.067


residual i 0.567
Assumptions for Linear Regression
EqualvarianceAssumption
QS.ca Herplot of Residual must NOT show a pattern i e random
GPA Residual Plot

Study
Time

Does the plot thicken

he spreadaround the line


looks about the same
throughout

No
data points have various spread
around the line

Outliers check for obvious outliers or groups


Standardisation of Residuals
s

F
reified
n 2
6
Standardisation
ofthe residuals

Histogram of residuals
of Residuals
680h

any residualsabove 2s
f Residuals
is unusual
21.6 118.4
9.2 a2 s4 6
Ex2 cont'd

n
MW S 955.27 0.897 920

130.03 knots

Residual yo 5
110 130.03
20,03 knots

The model predicts a windspeed 20 knots

higher than was actually observed


To

94.454 1000
The I
prize increases by 94,454 as the housesize increases

by 1 thousand of
squarefeet Gpp
unit

as bi GPA n
thousands of dollars
b
thousands of squarefeet studytime hr

D 3.117 94.45412000

188.908 x 1000 thousands of dollars


188,908

07
Not meaningful Even if it were positive no one would buy a
house w O ft
mm

residual 53 79 1
um 53,790
KY we

01 GPA 9
r 0.595 59.5 of the variation in housesize is

Effeminizes
012
Te
Positive v70 correlation coefficient
slope 014.454 r

should have the same sign

613
R would not change r 1correlation coefficient is unchanged

5 b correlation

deviation
staffed
1 2
0 areaffected

b 5 b x
deviation
staffard read
thevalue Engine
of y
Unitconversion of ft to m
involves
I
do Sx
Sy will change
b slope also changes
QI
r2We shouldn't be surprised by any residual
Residual is 53,790
mfrs 100,000 smaller than 2 SD
z
Kigh A residual of 100,000 is
do not have data
less than 2 53,790 107,580
find Z score
16120 1sso
53,790 53,490 hhf no use residual
as SD
80 1 SD 53,790

iii M

The modelpredicts that cereals will have approx 27 more


milligrams of K for every additional gram of fibre
300 pounds foot Mostreasonable

Response price
a 1k

Explanatory V
Size ft
b
1K
ft

1
tue larger homes should lost move
v2 0.714 g
71.4 of the variation in price is

affobtsrgesiitdre TT

a
Yes no pattern
b
No there's a curve pattern

1
May not spread is changing
Price
R 71.4
r
F4 0.845 no

8 0
larger homes cost
more size
b

I
2.535
1
N
1.690 0.8450
I 7 price
0.84516902.535
10 r 0.845

Priceshould be 0.845SD
below the mean in
price
a
Priceshould be 1.690 SD
above bdoor the mean in
price
mm

I
0.061 1000 61.00
Price increases
by about 61.00 forevery additional sq ft
b

47.82 0.061.3000 8230.82 1000

230,820
priv
c

F 47.82 0.061 1200 121,0202predicted price


Asking Price 121,020 6,000 115,020
Actual Predicted
Residual 1151020 121,020 6,000
a
R does not tell whether the model is appropriate

R only measures the strength of linear relationship


Must also check for linear
relationship in scatterplot outliers
and if there exists a pattern in scatterplot of residuals
b
Wingspan is a prediction of the bird's height The actual wingspan
will vary around the prediction
e

a
No Your score is better than about 95 of people
assuming scones follow the Normal model
100268

16

4
b
Yes His score is better than only 16 of people
5 0 154030 0 06505271

a
probably
Theresiduals show
some initially low
points but there's
no clear curvature

b
92.4 of the variation
in nicotine is accounted to
itsregression on tar content

v2 0.595 59.5 ofthevariation inhon


Size
is

iifftisiif.ee
a Do you think a linear model is
appropriate here Explain
b
Explain the meaning R in this context
r

at
R 92.4 0.924
r R2 0.96

Nicotine should be 1.92 SDs


above
average
2.8s 192 696 8.96 9212.88

Nicotine

20 2 0.96 1.92

2.818 192 696 8.961.9212 Tar should be 0.96 SDs


gg
above average
Tar
I b qq.I.int
deviation
staffed
b r

9M
deviation
5
Hyalue
Equine

f
a

o 0.065052
4 Line 0.154030 0.065052Tan
b
to Fotine 0.154030 0.065052147

O 39611

0.396
my
Nicotine content
increases by 0.065 mg
per additional milligram of tar
mg ofnicotine
bo unify mi

b mg of tan
d We'd expect a cigarette
wf.no tar to
have 0.065052 ing of nicotine
g

le residual predicted
algal
9 Tav 7mg
Fotine
yi I 0.154030 0.06505217
0.609394
0.5 9 0.609394

Yi 0.5 0.609394
0.109394
mg
0.109mg to 3s f
abstentions
ie

actual

HCI
a
It'sappropriate
Therelationship is straightenough.nl
MFI a fewoutliers

I b cofficientation
y mate
6 6
I b
116.55
bo 5 bixy standard
deviation
bi r 0.65
7072.47
6
megan ate 0.010711 0
bo 338.2 10.01077146234
f

If 156.5040.0107144993
156.5038
156.50
324.925

324.93 FI 156.50 0.0107 MFI


d
Actual MFI 44,993 HCI 548.02

156,50 0.0107144993
324.9251

324.93

Residual 9 548.02 324.93 223109


e Standardisation
1 r 10 r 0.65
z Gi M n
g Zn 0.65 ZMFz

Zmpy 0.65
ZHI

a
j b cofficientation

deviation

I
staggard
b
deviation
bo 5 b x standard

0
megan year

b 0.037 1.1026
bo 572.52 1 1026 29.67

539.805

Fp 539.805 1.1026Age

R
b
Yes
Both variables are quantitative Fp 539.805 1.1026 18

Plot is straight although flat 559.6518 559.63


a few outliers
Fp 539.805 1.1026 50

594.935 594.94

d
R2 0.0372 1.369 103 0 001469 100

0.1469
0.15
0.15 of the variability in TYP is accounted for the regression
model on
age
6
No Theplot is nearly flat The model explainsalmost none of
the variation R
2
0.0015 in TYP
a
Fairlystraight positive and
moderately strong Possibly some

outliers higherthan expected


math scores
b
Thestudent w 800math 490verb
coefficient
5868599
Positive
fairly strong linear
R 0.685 0 469225 46.9
46.9 of variation in math stones
is accounted for
regression model
on verbal

d
g both

b 0 685 99
s 0.66159 0.662
60 612 2 06621596 3
217.740
e
Everypointof verbal score 217.7
adds 0.662 Math 217.7 0.662Verbal
points to the predicted
math stone

f
Math 217.7 0.6625500 548.7 points
9 residual
y J Fath 217.7 0.662 1800
747.3
8007473
52.7 points
a Math
verbal

5 0.685
5 0.685

unchanged
verbal
Math

I bo bit
b r 0.685 99.5
96.1 0.709
bo 5 b I
596.3 0.709 612.21
162.0967
162.1
Terbal 162.1 0.709 Math

residual Yi 5 0

Y 5
Actual verbal score is HIGHER thanthe predicted verbal score
verbal
162.1 0.709 500 516.6 pts

Math 217 7 0.662


Verbal
altual
2177 0.662 516.6

559.6891

559.7 pts
f
Regressionto the mean Someone whose math score is below
average
is predicted to have a verbal score below average but not as far
in SDs So if we use a predatedverbal score to predict math
score it's differentfromtheactual math score

You might also like