0% found this document useful (0 votes)
15 views78 pages

Ue23ma242a 20241029143447

The document discusses Simple Linear Regression and its application in predicting relationships between variables, such as weight and food consumption. It highlights the importance of regression analysis in forecasting, understanding causal relationships, and the impact of global warming on floods and droughts. Additionally, it provides examples and methods for computing the least squares line to estimate outcomes based on given data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views78 pages

Ue23ma242a 20241029143447

The document discusses Simple Linear Regression and its application in predicting relationships between variables, such as weight and food consumption. It highlights the importance of regression analysis in forecasting, understanding causal relationships, and the impact of global warming on floods and droughts. Additionally, it provides examples and methods for computing the least squares line to estimate outcomes based on given data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

MATHEMATICS FOR COMPUTER SCIENCE

ENGINEERS
Simple Linear Regression

Dr.Mamatha H R
Department of Computer Science and Engineering
Dr. Karthiyayini
Department of Science and Humanities
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Simple Linear Regression: Correlation &
Regression Analysis
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Regression Analysis

❖Regression Analysis is basically the study of a set of data to make the


best guess or some kind of prediction.

▪ For Example : By studying a data which provides information of how


much you eat and how much you weigh, you can conclude that there
exists a relationship between the two.

▪ Regression analysis can help you to quantify that and can help you to
predict how much you will weigh in 10 years time if you continue to put
on weight at the same rate.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Prediction of Floods / Droughts

Impact of Global warming :

❖ Increase in rainfall resulting in Floods

❖ Increase in amount of dry land leading


to droughts
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Impact of Global Warming

Global Warming in Global Warming in


Wet areas Dry areas
Evaporation of water Increase in evaporation of
from land and sea water from land , water
surfaces and plants

More rainfall Dry areas become drier

Increase in Floods Increase in droughts


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Impact of Global warming : Graphs!!!

Year Range vs Temperature


15.5 15.2

15
14.96
15

14.8
14.47
14.5
14.6
14.26
14.12
13.95 13.92 13.93 13.95 14.4
14 13.89
13.76
13.68 13.67
13.59 13.64 14.2
13.5
14

13 13.8

13.6
12.5
1881 1891 1901 1911 1921 1931 1941 1951 1961 1971 1981 1991 2001 2011
- - - - - - - - - - - - - - 13.4
1890 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 2010 2020 0 2 4 6 8 10 12 14 16
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Other factors influencing Floods !!!

Causes of Floods!!! Global Geography


Warming of the
area

Urbanisation
Floods

Other Deforestation
Human
Factors
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Regression Analysis

❖ In statistical modeling, regression analysis is a set of statistical processes


for estimating the relationships between a dependent variable and one or
more independent variables

❖It is a way of mathematically sorting out which of those variables indeed have
an impact

❖Which factors matter most ?

❖Which can we ignore ?

❖How do the factors interact with each other?

❖And most importantly, how certain are we about all these factors?
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Regression Analysis – A Broad Classification

Simple Regression Linear


Models
(One independent
Variable) Non Linear
Regression
Models Multiple Regression Linear
Models
(Several Independent
Variables) Non Linear
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Some Inputs !!!

❖Regression analysis is widely used for prediction and forecasting,


where its use has substantial overlap with the field of machine
learning.
❖In some situations regression analysis can be used to infer causal
relationships between the independent and dependent variables.
❖The term "regression" was coined by Francis Galton in the
nineteenth century to describe a biological phenomenon.
❖The earliest form of regression analysis is linear regression, in which
a researcher finds the line that most closely fits the data according to
a specific mathematical criterion.
❖This line is referred to as the line of least squares, which was
published by Legendre in 1805, and by Gauss in 1809.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
The Least – Squares Line

❖When two variables have a


linear relationship, the
scatter plot tends to be
clustered around a straight
line.

❖This line is referred to as


the Least Squares Line.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
The least squares line

❖ Consider the least square line given by,

෢0 + 𝛽
𝑦𝑖 = 𝛽 ෢1 𝑥𝑖

where,

σ𝑛 ҧ
𝑖=1(𝑥𝑖 −𝑥)(𝑦 ത
𝑖 −𝑦)
෢1 =
▪ 𝛽 σ𝑛 ҧ 2
𝑖=1(𝑥𝑖 −𝑥)

෢0 = 𝑦ത − 𝛽
▪ 𝛽 ෢1 𝑥ҧ
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example :
❖The details pertaining to the no. of SL No. No. of hours Marks
hours spent by students in preparing spent Scored
for an entrance exam and the marks 1 6 82
scored (on a scale of (0 – 100) is 2 10 88
provided in the following table. 3 2 56
Using these values, 4 4 64
i. Estimate the marks scored by a 5 6 77
student who has spent 2.35 6 7 92
hours. 7 0 23
ii. Predict the marks that a student 8 1 41
can score if he/she invests 20 hours. 9 8 80
10 5 59
11 3 47
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Computing the least squares line

❖We need to first obtain the least square line which is given by,

෢𝟎 + 𝜷
𝒚=𝜷 ෢𝟏 𝒙

σ𝒏
𝒊=𝟏(𝒙𝒊 −ഥ
𝒙)(𝒚𝒊 −ഥ𝒚)
▪ ෢𝟏 =
𝜷 σ𝒏 𝒙 )𝟐
𝒊=𝟏(𝒙𝒊 −ഥ

▪ ෢𝟎 = 𝒚
𝜷 ෢𝟏 𝒙
ഥ−𝜷 ഥ
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example :
SL No. No. of hours Marks
spent (𝑥) Scored(𝑦)
𝑥 − 𝑥ҧ (𝑥 − 𝑥)ҧ 2 𝑦 − 𝑦ത (𝑥 − 𝑥)(𝑦
ҧ − 𝑦)

1 6 82 1.27 1.6129 17.55 22.33
2 10 88 5.27 27.7729 23.55 124.15
3 2 56 -2.73 7.4529 -8.45 23.06
4 4 64 -0.73 0.5329 -0.45 0.33
5 6 77 1.27 1.6129 12.55 15.97
6 7 92 2.27 5.1529 27.55 62.60
7 0 23 -4.73 22.3729 -41.45 195.97
8 1 41 -3.73 13.9129 -23.45 87.42
9 8 80 3.37 11.3569 15.55 50.88
10 5 59 0.27 0.0729 -5.45 -1.49
11 3 47 -1.73 2.9929 -17.45 30.15
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example :
SL No. No. of hours Marks
spent (𝑥) Scored(𝑦)
𝑥 − 𝑥ҧ (𝑥 − 𝑥)ҧ 2 𝑦 − 𝑦ത (𝑥 − 𝑥)(𝑦
ҧ − 𝑦)

1 6 82 1.27 1.6129 17.55 22.33
2 10 88 5.27 27.7729 23.55 124.15
3 2 56 -2.73 7.4529 -8.45 23.06
4 4 64 -0.73 0.5329 -0.45 0.33
5 6 77 1.27 1.6129 12.55 15.97
6 7 92 2.27 5.1529 27.55 62.60
7 0 23 -4.73 22.3729 -41.45 195.97
8 1 41 -3.73 13.9129 -23.45 87.42
9 8 80 3.37 11.3569 15.55 50.88
10 5 59 0.27 0.0729 -5.45 -1.49
11 3 47 -1.73 2.9929 -17.45 30.15
4.73 64.45 94.8459 611.37
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example :
From the table we have,
𝑥ҧ = 4.73 ; 𝑦ത =64.45

▪ σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത =611.37

▪ σ𝑛𝑖=1(𝑥𝑖 − 𝑥)ҧ 2 =94.8459

σ𝑛 ҧ
𝑖=1(𝑥𝑖 −𝑥)(𝑦 ത
𝑖 −𝑦)
▪ ෢
𝛽1 = σ𝑛 ҧ 2
=611.37/94.8459=6.49
𝑖=1(𝑥𝑖 −𝑥)

෢0 = 𝑦ത − 𝛽
▪ 𝛽 ෢1 𝑥ҧ =64.45-[6.49x4.73]=33.7523

▪ The equation of the least squares line is given by,


෢0 + 𝛽
𝑦𝑖 = 𝛽 ෢1 𝑥𝑖 ⇒33.7523+6.79x
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example :
▪ The equation of the least squares line is given by,
𝑦 = 33.7523 + 6.49𝑥

i. To estimate the marks scored by a student who has spent


2.35 hours.

Y=33.7523+[6.35x2.35]=48.6748

ii. To predict the marks that a student can score if he/she


invests 20 hours.

Y=33.7523+[6.35x20]=163.5523
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS

❖ How to compute the Least Squares Line

❖ Residuals and Errors

❖ Measuring Goodness of fit


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
How to compute the Least – Squares Line ???

𝒍𝟏
𝒍𝟐
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
How to compute the Least – Squares Line ???

𝒍𝟏
𝒍𝟐
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Scenario # 1 : No Errors!!

Weight (lb) (x) /Length (in.) (y)


Weight (𝑙𝑏) Length (𝑖𝑛. ) 5.25

(𝑥) (𝑦)
0.0 5.02
5.2
0.2 5.04
0.4 5.06
0.6 5.08 5.15

0.8 5.10
1.0 5.12 5.1

1.2 5.14
1.4 5.16
5.05
1.6 5.18
1.8 5.20
5
2.0 5.22 0 0.5 1 1.5 2 2.5
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Scenario #2 : Measurement has Errors!!

WEight (lb) (x)/Length (in.) (y)


Weight (𝑙𝑏) Length (𝑖𝑛. ) Weight (𝑙𝑏) Length (𝑖𝑛. ) 5.9
(𝑥) (𝑦) (𝑥) (𝑦)
5.8
0.0 5.06 2.0 5.40
5.7
0.2 5.01 2.2 5.57
5.6
0.4 5.12 2.4 5.47
5.5
0.6 5.13 2.6 5.53
5.4
0.8 5.14 2.8 5.61
5.3
1.0 5.16 3.0 5.59
5.2
1.2 5.25 3.2 5.61
5.1
1.4 5.19 3.4 5.75
5
1.6 5.24 3.6 5.68
4.9
1.8 5.46 3.8 5.80 0 0.5 1 1.5 2 2.5 3 3.5 4
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Scenario #2 : Measurement has Errors!!

Weight (lb) (x)/Length (in.) (y)


Weight (𝑙𝑏) Length (𝑖𝑛. ) Weight (𝑙𝑏) Length (𝑖𝑛. ) 5.9
(𝑥) (𝑦) (𝑥) (𝑦)
5.8
0.0 5.06 2.0 5.40
5.7
0.2 5.01 2.2 5.57
5.6
0.4 5.12 2.4 5.47
5.5
0.6 5.13 2.6 5.53
5.4
0.8 5.14 2.8 5.61
5.3
1.0 5.16 3.0 5.59
5.2
1.2 5.25 3.2 5.61
5.1
1.4 5.19 3.4 5.75
5
1.6 5.24 3.6 5.68
4.9
1.8 5.46 3.8 5.80 0 0.5 1 1.5 2 2.5 3 3.5 4
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Residual :
Weight (lb) (x)/Length (in.) (y)
❖𝑒𝑖 = 𝑦𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 − 𝑦𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑
5.9

5.8

5.7

5.6

5.5

5.4

5.3

5.2

5.1

4.9
0 0.5 1 1.5 2 2.5 3 3.5 4
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Least Square Line :
WEight (lb) (x)/Length (in.) (y)
5.9
NOTE : The least square line is defined to be the line
5.8
for which the sum of squared residuals is minimum.

5.7 ❖That is, it is the line for which σ𝑛𝑖=1 𝑒𝑖 2 is minimum.


5.6

5.5

5.4
❖Using some Mathematical computations it can be shown that,
5.3

5.2

5.1

4.9
0 0.5 1 1.5 2 2.5 3 3.5 4
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Least Squares Line : Summary
Scenario #1 : If there is no measurement error then the data points lie on the straight line
𝑦 = 𝛽0 + 𝛽1 𝑥 and values of 𝛽0 and 𝛽1 can be obtained easily by calculating the slope and the
intercept.
Scenario #2 : If there is a measurement error 𝜀𝑖 , then
❖ the exact value of 𝛽0 and 𝛽1 cannot be determined
❖ the values of 𝛽0 and 𝛽1 are computed by calculating the least square line.
෢0 + 𝛽
❖ The least square line is given by 𝑦ෝ𝑖 = 𝛽 ෢1 𝑥𝑖

where
෢0 → the 𝑦 − intercept of the least square line
▪ 𝛽
→ gives an estimate of 𝛽0 , the initial length of the spring.
෢1 →the slope of the least square line
▪ 𝛽
→ gives an estimate of the actual value of the spring constant 𝛽 .
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Computing formulas
Remark :

❖ σ𝑛𝑖=1(𝑥𝑖 − 𝑥)(𝑦 ത = σ𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑛𝑥ҧ 𝑦ത


ҧ 𝑖 − 𝑦)

❖ σ𝑛𝑖=1(𝑥𝑖 − 𝑥)ҧ 2 = σ𝑛𝑖=1 𝑥𝑖 2 − 𝑛𝑥ҧ 2

❖ σ𝑛𝑖=1(𝑦𝑖 − 𝑦)
ത 2 = σ𝑛𝑖=1 𝑦𝑖 2 − 𝑛𝑦ത 2

For computational purposes we use the equivalent formula that is


specified in the RHS.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Try This !!!
Using the Hooke’s law data given in
the table Weight (𝑙𝑏) Length (𝑖𝑛. ) Weight (𝑙𝑏) Length (𝑖𝑛. )
(𝑥) (𝑦) (𝑥) (𝑦)
0.0 5.06 2.0 5.40
i. Compute the least squares
0.2 5.01 2.2 5.57
estimates of the spring constant
and the unloaded length of the 0.4 5.12 2.4 5.47
spring. 0.6 5.13 2.6 5.53
ii. Write the equation of the least 0.8 5.14 2.8 5.61
squares line. 1.0 5.16 3.0 5.59
iii. Estimate the length of the 1.2 5.25 3.2 5.61
spring under a load of 1.3 lb. 1.4 5.19 3.4 5.75
iv. Estimate the length of the 1.6 5.24 3.6 5.68
spring under a load of 1.4 lb.
1.8 5.46 3.8 5.80
v. Obtain the Residuals
corresponding to all the points
𝑥 ,𝑦 .
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Some Observations :

❖ The Estimates are not the same as true values

❖ The Residuals are not the same as the Errors.

❖ Don’t extrapolate outside the range of the data.

❖ Don’t use the Least Squares line when the data aren’t linear.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
The Estimates are not the same as true values

True Weight/True length ; Weight/Observed Length


5.5
Weight (𝑙𝑏) Length (𝑖𝑛. ) Length (𝑖𝑛. )
(𝑥) (𝑦) (𝑦) 5.45

0.0 5.02 5.06 5.4


y = 0.1859x + 5.0105
0.2 5.04 5.01 5.35

0.4 5.06 5.12 5.3

0.6 5.08 5.13

Length(y)
5.25

y = 0.1x + 5.02
0.8 5.10 5.14 5.2

1.0 5.12 5.16 5.15

1.2 5.14 5.25 5.1

1.4 5.16 5.19 5.05

1.6 5.18 5.24 5

1.8 5.20 5.46 4.95


0 0.5 1 1.5 2 2.5
Weight (X)
2.0 5.22 5.40
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
The Residuals are not the same as Errors
Weight/True length ; Weight/Observed Length
5.5
Weight (𝑙𝑏) Length (𝑖𝑛. ) Length (𝑖𝑛. )
(𝑥) (𝑦) (𝑦) 5.45

0.0 5.02 5.06 5.4


y = 0.1859x + 5.0105

0.2 5.04 5.01 5.35

0.4 5.06 5.12 5.3

0.6 5.08 5.13

Length(y)
5.25
y = 0.1x + 5.02
0.8 5.10 5.14 5.2

1.0 5.12 5.16 5.15

1.2 5.14 5.25 5.1

1.4 5.16 5.19 5.05

1.6 5.18 5.24 5

1.8 5.20 5.46 4.95


0 0.5 1 1.5 2 2.5
Weight (X)
2.0 5.22 5.40
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Don’t Extrapolate outside the range of the data!!
❖The details pertaining to the no. of SL No. No. of hours Marks
hours spent by students in preparing spent Scored
for an entrance exam and the marks 1 6 82
scored (on a scale of (0 – 100) is 2 10 88
provided in the following table. 3 2 56
Using these values, 4 4 64
i. Estimate the marks scored by a 5 6 77
student who has spent 2.35 6 7 92
hours. 7 0 23
ii. Predict the marks that a student 8 1 41
can score if he/she invests 20 hours. 9 8 80
10 5 59
11 3 47
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Don’t Extrapolate outside the range of the data!!

Weight (𝑙𝑏) Length (𝑖𝑛. ) Weight (𝑙𝑏) Length (𝑖𝑛. )


(𝑥) (𝑦) (𝑥) (𝑦)
0.0 5.06 2.0 5.40
0.2 5.01 2.2 5.57
0.4 5.12 2.4 5.47
0.6 5.13 2.6 5.53
0.8 5.14 2.8 5.61
1.0 5.16 3.0 5.59
1.2 5.25 3.2 5.61
1.4 5.19 3.4 5.75
1.6 5.24 3.6 5.68
1.8 5.46 3.8 5.80
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Don’t use the Least Squares Line when the data aren’t linear

Scatter plot of Projectile Motion


70

60

50

40

30

20

10

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

Note : In some cases the Least – Squares line can be used for non linear data, but only after
variable transformation is applied.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measuring goodness of fit
❖ A goodness of fit statistic is a quantity that measures how well a
model explains a given set of data.
❖ A linear model fits well if there is a strong relationship between the
variables involved.
❖ The strength of a linear relationship can be measured by
considering,
σ𝑛𝑖=1(𝑦𝑖 − 𝑦)
ത 2 − σ𝑛𝑖=1(𝑦𝑖 − 𝑦ෝ𝑖 )2 .
❖ The above relation is also referred to as a goodness-of-fit statistic.
❖ The draw back of this statistic relation is that it cannot be used to
compare the goodness-of-fit of two models which have different
data set. (That is, data sets having different units)
σ𝑛 ത 2 − σ𝑛
𝑖=1(𝑦𝑖 −𝑦) 𝑦𝑖 )2
𝑖=1(𝑦𝑖 −ෞ
❖ Hence we use the relation, 𝑟 2 = σ𝑛 ത 2
𝑖=1(𝑦𝑖 −𝑦)
which is obtained by using the correlation coefficient.
❖ This is also referred to as the co-efficient of determination.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Visualisation of 𝒓𝟐
t
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Some special terminologies!

2 σ𝑛 ത 2 − σ𝑛
𝑖=1(𝑦𝑖 −𝑦) 𝑦𝑖 )2
𝑖=1(𝑦𝑖 −ෞ
❖𝑟 = σ𝑛 ത 2
𝑖=1(𝑦𝑖 −𝑦)

❖ σ𝑛𝑖=1(𝑦𝑖 − 𝑦)
ത 2 − σ𝑛𝑖=1(𝑦𝑖 − 𝑦ෝ𝑖 )2 : Regression sum of squares

❖ Therefore, Total sum of squares = Regression sum of squares


+ Error sum of squares
Regression sum of squares
❖ And , 𝑟 2 =
Total sum of squares
❖ 𝑟 2 is also referred to as the proportion of the variance in y
explained by Regression.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
More about 𝒓𝟐

❖ Is a quantity that indicates how well a statistical model fits a


data set. In other words, it is a statistical measure of how close
the observed data are to the fitted regression line.

❖ It explains how much variation in the dependent variable 𝑦 is


characterized by a variation in the independent variable 𝑥.

❖ It is used to forecast or predict the possible outcomes.

❖ Its value lies between 0 and 1.

❖ The higher the value of 𝒓𝟐 , the better the prediction.


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Uncertainties in the Least-Squares Coefficients
• The errors εi create uncertainty in the estimates β0 and β1.
• It is intuitively clear that if the εi tend to be small in magnitude,
the points will be tightly clustered around the line, and the
uncertainty in the least-squares estimates β0 and β1 will be
small.
• On the other hand, if the εi tend to be large in magnitude, the
points will be widely scattered around the line,and the
uncertainties (standard deviations) in the least-squares estimates
β0 and β1 will be larger.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Uncertainties in the Least-Squares Coefficients

• Assume we have n data points (x1, y1), . . . , (xn, yn), and we plan
to fit the least squares line.
• In order for the estimates β1 and β0 to be useful, we need to
estimate just how large their uncertainties are. In order to do this,
we need to know something about the nature of the errors εi .
• We will begin by studying the simplest situation, in which four
important assumptions are satisfied.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Uncertainties in the Least-Squares Coefficients

Assumptions for Errors in Linear Models:


In the simplest situation, the following assumptions are satisfied:
1. The errors 1,…,n are random and independent. In particular, the
magnitude of any error i does not influence the value of the next
error i + 1.
2. The errors 1,…,n all have mean 0.
3. The errors 1,…,n all have the same variance, which we denote by 2.
4. The errors 1,…,n are normally distributed.

• When the sample size is large, the normality assumption (4) becomes less
important.
• Mild violations of the assumption of constant variance (3) do not matter too
much, but severe violations should be corrected.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Uncertainties in the Least-Squares Coefficients
• Under these assumptions, the effect of the εi is largely governed by the
magnitude of the variance σ2, since it is this variance that determines how
large the errors are likely to be.
• Therefore, in order to estimate the uncertainties in β0 and β1, we must first
estimate the error variance σ2.
• Since the magnitude of the variance is reflected in the
degree of spread of the points around the least-squares line, it follows that by
measuring this spread, we can estimate the variance.
Specifically, the vertical distance from each data point (xi , yi ) to the least-
squares line is given by the residual ei.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Uncertainties in the Least-Squares Coefficients

• The spread of the points around the line can be measured by the sum of the
squared residuals
• The estimate of the error variance σ2 is the quantity s2 given by
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Distribution

In the linear model yi = 0 +1xi +i, under assumptions 1 through 4, the


observations y1,…, yn are independent random variables that follow the normal
distribution. The mean and variance of yi are given by
 y =  0 + 1 xi
i

 y2 =  2
i

The slope represents the change in the mean of y associated with an increase in
one unit in the value of x.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
More Distributions
Under assumptions 1 – 4:
• The quantitiesˆ and ˆ are normally distributed random variables.
0 1

• The means of ˆ0 and ˆ1 are the true values 0 and 1, respectively.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
More Distributions (cont.)
• The standard deviations of ˆ0 and ˆ1 are estimated with
s
1 x 2
sˆ =
sˆ = s + and n

 i
1


n

(x
2
0
n − x) 2 ( x x )
i
i =1
i =1

n
(1 − r ) 2
( y i
− y ) 2

where s = i =1
is an estimate of the
n−2

error standard deviation .


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Notes
1. Since the quantity appears in the denominators of
, it follows that the more spread out the x’s are, the smaller the
uncertainties in will be ˆ and ˆ
0 1

2. Use caution: if the range of x values extends beyond the range where
the linear model holds, the results will not be valid.
3. The quantities ( ˆ0 −  0 ) / sˆ and
0
( ˆ1 − 1 ) / sˆ
1
have Student’s t
distribution with n – 2 degrees of freedom.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Checking Assumptions and Transforming Data

• We stated some assumptions for the errors. Here we want to see if any of
those assumptions are violated.

• The single best diagnostic for least-squares regression is a plot of residuals


versus the fitted values, sometimes called a residual plot.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
More of the Residual Plot

• When the linear model is valid, and assumptions 1 – 4 are satisfied, the plot will
show no substantial pattern. There should be no curve to the plot, and the
vertical spread of the points should not vary too much over the horizontal range
of the data.

• A good-looking residual plot does not by itself prove that the linear model is
appropriate. However, a residual plot with a serious defect does clearly indicate
that the linear model is inappropriate.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Residual Plots

A: No noticeable pattern
B: Heteroscedastic
C: Trend
D: Outlier
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Checking Assumptions to form a Linear Model
• Example of a residual plot: On the left is the plot of x versus the values
of y, on the right the residual with the fitted values of y
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Checking Assumptions to form a Linear Model

• A bit of terminology:
• If the vertical spread does not vary with the fitted value, we
call the residual plot homoscedastic. Else we call the plot
heteroscedastic.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Checking Assumptions to form a Linear Model
• Below on the left the plot is homoscedastic, while on the
right the spread increases with the fitted value and is thus
heteroscedastic.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Homoscedasticity or Heteroscedasticity? The way forward...

• If the residual plot is homoscedastic, and shows no


substantial trend or curve, then a linear model can be found
for the data plotted.
• If the residual plot is heteroscedastic, or shows a substantial
trend or curve, then the assumptions for a linear model
certainly do NOT hold! In such cases we need to transform
the data or pursue other methods.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Transforming the Variables
• If we fit the linear model y = 0 +1x + and find that the residual
plot exhibits a trend or pattern, we can sometimes fix the
problem by raising x, y, or both to a power.

• It may be the case that a model of the form


ya = 0 +1xb + fits the data well.

• Replacing a variable with a function of itself is called transforming


the variable. Specifically, raising a variable to a power is called a
power transformation.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Which transformation to apply?
It is possible with experience to look at a scatterplot, or a residual plot, and
make an educated guess as to how to transform the variables.
Mathematical methods are also available to determine a good
transformation.
Trial and Error is fine – Try various powers on both x and y (including
ln x and ln y), look at the residual plots, and hope to find a homoscedastic
one with no discernible pattern.
More advanced discussion in Draper and Smith (1998).
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Which transformation to apply?

Recall the earlier example of a scatter plot (O3 concentration vs


NOX concentration) whose residual plot on the right is
heteroscedastic as shown below. Linear model NOT GOOD! Uh oh!
Also notice the outlier with ozone concentration nearly 100.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Logarithm Transformation on One Axis

Applying the logarithm on y-axis (O3 concentration) and obtain


the following scatter plot and its residual on the right. Linear
model looks GOOD! YAY! The outlier is less prominent too!
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Logarithm Transformation on One Axis
Now consider an example below where The plot on the left is
Production (ft3/ft) vs Fracture fluid (gal/ft) and the residual plot is
largely heteroscedastic! Not good for a linear model.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Logarithm Transformation on One Axis
Below is a plot of ln (production) vs ln (fracture fluid) for the same
data. This time the residual plot is homoscedastic, good for linear
model!
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Power Transformations – The reciprocal
Below (left side) is a plot of Rockwell (B scale) hardness of welds
versus their Ogden-Jaffe number. The residual plot (right side)
shows a pattern where negative residual is observed for the
extreme fitted values and positive residual for the ones in the
middle. Linear model NOT OK.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Power Transformations – The reciprocal
We plot the graph of Rockwell Hardness vs (Ogden-Jaffe)-1 for the
same data (below, left side) and find that the residual plot (below,
right side) is homoscedastic, having no discernible pattern.
Linear model is OK.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Power Transformations – Positive Powers
Plot y vs x and its residual plot which
exhibits a discrenible pattern.
Linear model is NOT OK.

x y x y
1 2.2 11 31.5
2 9 12 32.7
3 13.5 13 34.9
4 17 14 36.3
5 20.5 15 37.7
6 23.3 16 38.7
7 25.2 17 40
8 26.4 18 41.3
9 27.6 19 42.5
10 30.2 20 43.7
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Power Transformations – Positive Powers
Plot y2 vs x and its homoscedastic
residual plot which exhibits no
discernible pattern.
Linear model is OK.
x y2 x y2
1 4.84 11 992.25
2 81 12 1069.29
3 182.25 13 1218.01
4 289 14 1317.69
5 420.25 15 1421.29
6 542.89 16 1497.69
7 635.04 17 1600
8 696.96 18 1705.69
9 761.76 19 1806.25
10 912.04 20 1909.69
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Transformations – Do they always work?

It is important to remember that power transformations don’t always


work.

Sometimes, none of the residual plots looks good, no matter what


transformations are tried. In these cases, other methods should be
used. One of these is multiple regression which is not covered here.

Some other methods are briefly mentioned in the next slide.


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Alternatives to Transformations
The popular methods other than transformation are:

• Weighted Least Squares


➔ We assign greater weights to points in regions where the

vertical spread is smaller and vice versa.

• Multiple Regression
➔ We add more independent variables in order to explain the

variation in the dependent variable.


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
How Many Points Make a Reliable Residual Plot?

When there are too few points on the residual plot, then…

➢ … it may appear to have a pattern or be heteroscedastic in


spite of that being just a visual effect created by one or two
points.

➢ … detecting outliers may become difficult

What to do if you can’t interpret a residual plot reliably?

You can start by fitting a linear model but declare your result
tentative; wait for more data and then a reliable decision can be
made.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
How Many Points Make a Reliable Residual Plot?

NOT all residual plots with few points turn out to be hard to interpret.

Some of these show a pattern which cannot be changed by relocating just one
or two points.

In such a case a linear model should NOT be used!


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Outliers

• Outliers are points that are detached from the bulk of the data.

• Both the scatter plot and the residual plot should be examined for
outliers.

• The first thing to do with an outlier is to determine why it is different


from the rest of the points.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Outliers

• Sometimes outliers are caused by data-recording errors or


equipment malfunction. In this case, the outlier can be deleted
from the data set. In this case, you may present results that do
not include the outlier.

• If it cannot be determined why there is an outlier, then it is not


wise to delete it. Here the results presented, should be the ones
from analysis with the outlier included in the data set.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Influential Point

• If there are outliers that cannot be removed from the data set,
then the best thing to do is fit the whole data set and then
remove the outlier and fit a line to the data set.

• If none of the outliers upon removal make a noticeable difference


to the least-squares line or to the estimated standard deviation of
the slope and intercept, then use the fit with the outliers
included.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Influential Point

• If one or more outlier does make a difference, then the range


of values for the least-squares coefficients should be reported.
Avoid computing confidence and prediction intervals and
performing hypothesis tests.

• An outlier that makes a considerable difference to the least-


squares line when removed is called an influential point.
• In general, outliers with unusual x values are more likely to be
influential than those with unusual y values, but every outlier
should be checked.

• Some authors restrict the definition of outliers to points that


have unusually large residuals.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Comments

• Transforming the variables is not the only method for analyzing


data when the residual plot indicates a problem.

• There is a technique called weighted least squares regression. The


effect is to make the points whose error variance is smaller have
greater influence in the computation of the least-squares line.

• When the residual plot shows a trend, this sometimes indicates


that more than one independent variable is needed to explain the
variation in the dependent variable.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Comments

• If the relationship is nonlinear, then a method called nonlinear


regression can be applied.

• If the plot of residuals versus fitted values looks good, it may be


advisable to perform additional diagnostics to further check the fit of
the linear model. A time series plot is used to see if time should be
included in the model. A normal probability plot can be used to
check the normality assumption.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Independence of Observations

• If the plot of residuals versus fitted values looks good, then further
diagnostics may be used to further check the fit of the linear model.

• A time order plot of the residuals versus order in which observations


were made.

• If there are trends in this plot, then x and y may be varying with time.
In this case, adding a time term to the model as an additional
independent variable.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Normality Assumption

• To check that the errors are normally distributed, a normal probability


plot of the residuals can be made.

• If the plot looks like it follows a rough straight line, then we can
conclude that the residuals are approximately normally distributed.
THANK YOU

You might also like