0% found this document useful (0 votes)

15 views78 pages

Ue23ma242a 20241029143447

The document discusses Simple Linear Regression and its application in predicting relationships between variables, such as weight and food consumption. It highlights the importance of regression analysis in forecasting, understanding causal relationships, and the impact of global warming on floods and droughts. Additionally, it provides examples and methods for computing the least squares line to estimate outcomes based on given data.

Uploaded by

madhulatharamachandra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views78 pages

Ue23ma242a 20241029143447

Uploaded by

madhulatharamachandra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 78

MATHEMATICS FOR COMPUTER SCIENCE

ENGINEERS
Simple Linear Regression

Dr.Mamatha H R
Department of Computer Science and Engineering
Dr. Karthiyayini
Department of Science and Humanities
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Simple Linear Regression: Correlation &
Regression Analysis
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Regression Analysis

❖Regression Analysis is basically the study of a set of data to make the

best guess or some kind of prediction.

▪ For Example : By studying a data which provides information of how

much you eat and how much you weigh, you can conclude that there
exists a relationship between the two.

▪ Regression analysis can help you to quantify that and can help you to
predict how much you will weigh in 10 years time if you continue to put
on weight at the same rate.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Prediction of Floods / Droughts

Impact of Global warming :

❖ Increase in rainfall resulting in Floods

❖ Increase in amount of dry land leading

to droughts
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Impact of Global Warming

Global Warming in Global Warming in

Wet areas Dry areas
Evaporation of water Increase in evaporation of
from land and sea water from land , water
surfaces and plants

More rainfall Dry areas become drier

Increase in Floods Increase in droughts

MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Impact of Global warming : Graphs!!!

Year Range vs Temperature

15.5 15.2

15
14.96
15

14.8
14.47
14.5
14.6
14.26
14.12
13.95 13.92 13.93 13.95 14.4
14 13.89
13.76
13.68 13.67
13.59 13.64 14.2
13.5
14

13 13.8

13.6
12.5
1881 1891 1901 1911 1921 1931 1941 1951 1961 1971 1981 1991 2001 2011
- - - - - - - - - - - - - - 13.4
1890 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 2010 2020 0 2 4 6 8 10 12 14 16
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Other factors influencing Floods !!!

Causes of Floods!!! Global Geography

Warming of the
area

Urbanisation
Floods

Other Deforestation
Human
Factors
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Regression Analysis

❖ In statistical modeling, regression analysis is a set of statistical processes

for estimating the relationships between a dependent variable and one or
more independent variables

❖It is a way of mathematically sorting out which of those variables indeed have
an impact

❖Which factors matter most ?

❖Which can we ignore ?

❖How do the factors interact with each other?

❖And most importantly, how certain are we about all these factors?
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Regression Analysis – A Broad Classification

Simple Regression Linear

Models
(One independent
Variable) Non Linear
Regression
Models Multiple Regression Linear
Models
(Several Independent
Variables) Non Linear
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Some Inputs !!!

❖Regression analysis is widely used for prediction and forecasting,

where its use has substantial overlap with the field of machine
learning.
❖In some situations regression analysis can be used to infer causal
relationships between the independent and dependent variables.
❖The term "regression" was coined by Francis Galton in the
nineteenth century to describe a biological phenomenon.
❖The earliest form of regression analysis is linear regression, in which
a researcher finds the line that most closely fits the data according to
a specific mathematical criterion.
❖This line is referred to as the line of least squares, which was
published by Legendre in 1805, and by Gauss in 1809.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
The Least – Squares Line

❖When two variables have a

linear relationship, the
scatter plot tends to be
clustered around a straight
line.

❖This line is referred to as

the Least Squares Line.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
The least squares line

❖ Consider the least square line given by,

෢0 + 𝛽
𝑦𝑖 = 𝛽 ෢1 𝑥𝑖

where,

σ𝑛 ҧ
𝑖=1(𝑥𝑖 −𝑥)(𝑦 ത
𝑖 −𝑦)
෢1 =
▪ 𝛽 σ𝑛 ҧ 2
𝑖=1(𝑥𝑖 −𝑥)

෢0 = 𝑦ത − 𝛽
▪ 𝛽 ෢1 𝑥ҧ
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example :
❖The details pertaining to the no. of SL No. No. of hours Marks
hours spent by students in preparing spent Scored
for an entrance exam and the marks 1 6 82
scored (on a scale of (0 – 100) is 2 10 88
provided in the following table. 3 2 56
Using these values, 4 4 64
i. Estimate the marks scored by a 5 6 77
student who has spent 2.35 6 7 92
hours. 7 0 23
ii. Predict the marks that a student 8 1 41
can score if he/she invests 20 hours. 9 8 80
10 5 59
11 3 47
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Computing the least squares line

❖We need to first obtain the least square line which is given by,

෢𝟎 + 𝜷
𝒚=𝜷 ෢𝟏 𝒙

σ𝒏
𝒊=𝟏(𝒙𝒊 −ഥ
𝒙)(𝒚𝒊 −ഥ𝒚)
▪ ෢𝟏 =
𝜷 σ𝒏 𝒙 )𝟐
𝒊=𝟏(𝒙𝒊 −ഥ

▪ ෢𝟎 = 𝒚
𝜷 ෢𝟏 𝒙
ഥ−𝜷 ഥ
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example :
SL No. No. of hours Marks
spent (𝑥) Scored(𝑦)
𝑥 − 𝑥ҧ (𝑥 − 𝑥)ҧ 2 𝑦 − 𝑦ത (𝑥 − 𝑥)(𝑦
ҧ − 𝑦)
ത
1 6 82 1.27 1.6129 17.55 22.33
2 10 88 5.27 27.7729 23.55 124.15
3 2 56 -2.73 7.4529 -8.45 23.06
4 4 64 -0.73 0.5329 -0.45 0.33
5 6 77 1.27 1.6129 12.55 15.97
6 7 92 2.27 5.1529 27.55 62.60
7 0 23 -4.73 22.3729 -41.45 195.97
8 1 41 -3.73 13.9129 -23.45 87.42
9 8 80 3.37 11.3569 15.55 50.88
10 5 59 0.27 0.0729 -5.45 -1.49
11 3 47 -1.73 2.9929 -17.45 30.15
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example :
SL No. No. of hours Marks
spent (𝑥) Scored(𝑦)
𝑥 − 𝑥ҧ (𝑥 − 𝑥)ҧ 2 𝑦 − 𝑦ത (𝑥 − 𝑥)(𝑦
ҧ − 𝑦)
ത
1 6 82 1.27 1.6129 17.55 22.33
2 10 88 5.27 27.7729 23.55 124.15
3 2 56 -2.73 7.4529 -8.45 23.06
4 4 64 -0.73 0.5329 -0.45 0.33
5 6 77 1.27 1.6129 12.55 15.97
6 7 92 2.27 5.1529 27.55 62.60
7 0 23 -4.73 22.3729 -41.45 195.97
8 1 41 -3.73 13.9129 -23.45 87.42
9 8 80 3.37 11.3569 15.55 50.88
10 5 59 0.27 0.0729 -5.45 -1.49
11 3 47 -1.73 2.9929 -17.45 30.15
4.73 64.45 94.8459 611.37
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example :
From the table we have,
𝑥ҧ = 4.73 ; 𝑦ത =64.45

▪ σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത =611.37

▪ σ𝑛𝑖=1(𝑥𝑖 − 𝑥)ҧ 2 =94.8459

σ𝑛 ҧ
𝑖=1(𝑥𝑖 −𝑥)(𝑦 ത
𝑖 −𝑦)
▪ ෢
𝛽1 = σ𝑛 ҧ 2
=611.37/94.8459=6.49
𝑖=1(𝑥𝑖 −𝑥)

෢0 = 𝑦ത − 𝛽
▪ 𝛽 ෢1 𝑥ҧ =64.45-[6.49x4.73]=33.7523

▪ The equation of the least squares line is given by,

෢0 + 𝛽
𝑦𝑖 = 𝛽 ෢1 𝑥𝑖 ⇒33.7523+6.79x
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example :
▪ The equation of the least squares line is given by,
𝑦 = 33.7523 + 6.49𝑥

i. To estimate the marks scored by a student who has spent

2.35 hours.

Y=33.7523+[6.35x2.35]=48.6748

ii. To predict the marks that a student can score if he/she

invests 20 hours.

Y=33.7523+[6.35x20]=163.5523
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS

❖ How to compute the Least Squares Line

❖ Residuals and Errors

❖ Measuring Goodness of fit

MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
How to compute the Least – Squares Line ???

𝒍𝟏
𝒍𝟐
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
How to compute the Least – Squares Line ???

𝒍𝟏
𝒍𝟐
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Scenario # 1 : No Errors!!

Weight (lb) (x) /Length (in.) (y)

Weight (𝑙𝑏) Length (𝑖𝑛. ) 5.25

(𝑥) (𝑦)
0.0 5.02
5.2
0.2 5.04
0.4 5.06
0.6 5.08 5.15

0.8 5.10
1.0 5.12 5.1

1.2 5.14
1.4 5.16
5.05
1.6 5.18
1.8 5.20
5
2.0 5.22 0 0.5 1 1.5 2 2.5
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Scenario #2 : Measurement has Errors!!

WEight (lb) (x)/Length (in.) (y)

Weight (𝑙𝑏) Length (𝑖𝑛. ) Weight (𝑙𝑏) Length (𝑖𝑛. ) 5.9
(𝑥) (𝑦) (𝑥) (𝑦)
5.8
0.0 5.06 2.0 5.40
5.7
0.2 5.01 2.2 5.57
5.6
0.4 5.12 2.4 5.47
5.5
0.6 5.13 2.6 5.53
5.4
0.8 5.14 2.8 5.61
5.3
1.0 5.16 3.0 5.59
5.2
1.2 5.25 3.2 5.61
5.1
1.4 5.19 3.4 5.75
5
1.6 5.24 3.6 5.68
4.9
1.8 5.46 3.8 5.80 0 0.5 1 1.5 2 2.5 3 3.5 4
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Scenario #2 : Measurement has Errors!!

Weight (lb) (x)/Length (in.) (y)

5.8

5.7

5.6

5.5

5.4

5.3

5.2

5.1

4.9
0 0.5 1 1.5 2 2.5 3 3.5 4
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Least Square Line :
WEight (lb) (x)/Length (in.) (y)
5.9
NOTE : The least square line is defined to be the line
5.8
for which the sum of squared residuals is minimum.

5.7 ❖That is, it is the line for which σ𝑛𝑖=1 𝑒𝑖 2 is minimum.

5.6

5.5

5.4
❖Using some Mathematical computations it can be shown that,
5.3

5.2

5.1

4.9
0 0.5 1 1.5 2 2.5 3 3.5 4
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Least Squares Line : Summary
Scenario #1 : If there is no measurement error then the data points lie on the straight line
𝑦 = 𝛽0 + 𝛽1 𝑥 and values of 𝛽0 and 𝛽1 can be obtained easily by calculating the slope and the
intercept.
Scenario #2 : If there is a measurement error 𝜀𝑖 , then
❖ the exact value of 𝛽0 and 𝛽1 cannot be determined
❖ the values of 𝛽0 and 𝛽1 are computed by calculating the least square line.
෢0 + 𝛽
❖ The least square line is given by 𝑦ෝ𝑖 = 𝛽 ෢1 𝑥𝑖

where
෢0 → the 𝑦 − intercept of the least square line
▪ 𝛽
→ gives an estimate of 𝛽0 , the initial length of the spring.
෢1 →the slope of the least square line
▪ 𝛽
→ gives an estimate of the actual value of the spring constant 𝛽 .
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Computing formulas
Remark :

❖ σ𝑛𝑖=1(𝑥𝑖 − 𝑥)(𝑦 ത = σ𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑛𝑥ҧ 𝑦ത

ҧ 𝑖 − 𝑦)

❖ σ𝑛𝑖=1(𝑥𝑖 − 𝑥)ҧ 2 = σ𝑛𝑖=1 𝑥𝑖 2 − 𝑛𝑥ҧ 2

❖ σ𝑛𝑖=1(𝑦𝑖 − 𝑦)
ത 2 = σ𝑛𝑖=1 𝑦𝑖 2 − 𝑛𝑦ത 2

For computational purposes we use the equivalent formula that is

specified in the RHS.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Try This !!!
Using the Hooke’s law data given in
the table Weight (𝑙𝑏) Length (𝑖𝑛. ) Weight (𝑙𝑏) Length (𝑖𝑛. )
(𝑥) (𝑦) (𝑥) (𝑦)
0.0 5.06 2.0 5.40
i. Compute the least squares
0.2 5.01 2.2 5.57
estimates of the spring constant
and the unloaded length of the 0.4 5.12 2.4 5.47
spring. 0.6 5.13 2.6 5.53
ii. Write the equation of the least 0.8 5.14 2.8 5.61
squares line. 1.0 5.16 3.0 5.59
iii. Estimate the length of the 1.2 5.25 3.2 5.61
spring under a load of 1.3 lb. 1.4 5.19 3.4 5.75
iv. Estimate the length of the 1.6 5.24 3.6 5.68
spring under a load of 1.4 lb.
1.8 5.46 3.8 5.80
v. Obtain the Residuals
corresponding to all the points
𝑥 ,𝑦 .
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Some Observations :

❖ The Estimates are not the same as true values

❖ The Residuals are not the same as the Errors.

❖ Don’t extrapolate outside the range of the data.

❖ Don’t use the Least Squares line when the data aren’t linear.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
The Estimates are not the same as true values

True Weight/True length ; Weight/Observed Length

5.5
Weight (𝑙𝑏) Length (𝑖𝑛. ) Length (𝑖𝑛. )
(𝑥) (𝑦) (𝑦) 5.45

0.0 5.02 5.06 5.4

y = 0.1859x + 5.0105
0.2 5.04 5.01 5.35

0.4 5.06 5.12 5.3

0.6 5.08 5.13

Length(y)
5.25

y = 0.1x + 5.02
0.8 5.10 5.14 5.2

1.0 5.12 5.16 5.15

1.2 5.14 5.25 5.1

1.4 5.16 5.19 5.05

1.6 5.18 5.24 5

1.8 5.20 5.46 4.95

0 0.5 1 1.5 2 2.5
Weight (X)
2.0 5.22 5.40
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
The Residuals are not the same as Errors
Weight/True length ; Weight/Observed Length
5.5
Weight (𝑙𝑏) Length (𝑖𝑛. ) Length (𝑖𝑛. )
(𝑥) (𝑦) (𝑦) 5.45

0.0 5.02 5.06 5.4

y = 0.1859x + 5.0105

0.2 5.04 5.01 5.35

0.4 5.06 5.12 5.3

0.6 5.08 5.13

Length(y)
5.25
y = 0.1x + 5.02
0.8 5.10 5.14 5.2

1.0 5.12 5.16 5.15

1.2 5.14 5.25 5.1

1.4 5.16 5.19 5.05

1.6 5.18 5.24 5

1.8 5.20 5.46 4.95

0 0.5 1 1.5 2 2.5
Weight (X)
2.0 5.22 5.40
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Don’t Extrapolate outside the range of the data!!
❖The details pertaining to the no. of SL No. No. of hours Marks
hours spent by students in preparing spent Scored
for an entrance exam and the marks 1 6 82
scored (on a scale of (0 – 100) is 2 10 88
provided in the following table. 3 2 56
Using these values, 4 4 64
i. Estimate the marks scored by a 5 6 77
student who has spent 2.35 6 7 92
hours. 7 0 23
ii. Predict the marks that a student 8 1 41
can score if he/she invests 20 hours. 9 8 80
10 5 59
11 3 47
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Don’t Extrapolate outside the range of the data!!

Weight (𝑙𝑏) Length (𝑖𝑛. ) Weight (𝑙𝑏) Length (𝑖𝑛. )

(𝑥) (𝑦) (𝑥) (𝑦)
0.0 5.06 2.0 5.40
0.2 5.01 2.2 5.57
0.4 5.12 2.4 5.47
0.6 5.13 2.6 5.53
0.8 5.14 2.8 5.61
1.0 5.16 3.0 5.59
1.2 5.25 3.2 5.61
1.4 5.19 3.4 5.75
1.6 5.24 3.6 5.68
1.8 5.46 3.8 5.80
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Don’t use the Least Squares Line when the data aren’t linear

Scatter plot of Projectile Motion

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

Note : In some cases the Least – Squares line can be used for non linear data, but only after
variable transformation is applied.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measuring goodness of fit
❖ A goodness of fit statistic is a quantity that measures how well a
model explains a given set of data.
❖ A linear model fits well if there is a strong relationship between the
variables involved.
❖ The strength of a linear relationship can be measured by
considering,
σ𝑛𝑖=1(𝑦𝑖 − 𝑦)
ത 2 − σ𝑛𝑖=1(𝑦𝑖 − 𝑦ෝ𝑖 )2 .
❖ The above relation is also referred to as a goodness-of-fit statistic.
❖ The draw back of this statistic relation is that it cannot be used to
compare the goodness-of-fit of two models which have different
data set. (That is, data sets having different units)
σ𝑛 ത 2 − σ𝑛
𝑖=1(𝑦𝑖 −𝑦) 𝑦𝑖 )2
𝑖=1(𝑦𝑖 −ෞ
❖ Hence we use the relation, 𝑟 2 = σ𝑛 ത 2
𝑖=1(𝑦𝑖 −𝑦)
which is obtained by using the correlation coefficient.
❖ This is also referred to as the co-efficient of determination.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Visualisation of 𝒓𝟐
t
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Some special terminologies!

2 σ𝑛 ത 2 − σ𝑛
𝑖=1(𝑦𝑖 −𝑦) 𝑦𝑖 )2
𝑖=1(𝑦𝑖 −ෞ
❖𝑟 = σ𝑛 ത 2
𝑖=1(𝑦𝑖 −𝑦)

❖ σ𝑛𝑖=1(𝑦𝑖 − 𝑦)
ത 2 − σ𝑛𝑖=1(𝑦𝑖 − 𝑦ෝ𝑖 )2 : Regression sum of squares

❖ Therefore, Total sum of squares = Regression sum of squares

+ Error sum of squares
Regression sum of squares
❖ And , 𝑟 2 =
Total sum of squares
❖ 𝑟 2 is also referred to as the proportion of the variance in y
explained by Regression.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
More about 𝒓𝟐

❖ Is a quantity that indicates how well a statistical model fits a

data set. In other words, it is a statistical measure of how close
the observed data are to the fitted regression line.

❖ It explains how much variation in the dependent variable 𝑦 is

characterized by a variation in the independent variable 𝑥.

❖ It is used to forecast or predict the possible outcomes.

❖ Its value lies between 0 and 1.

❖ The higher the value of 𝒓𝟐 , the better the prediction.

MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Uncertainties in the Least-Squares Coefficients
• The errors εi create uncertainty in the estimates β0 and β1.
• It is intuitively clear that if the εi tend to be small in magnitude,
the points will be tightly clustered around the line, and the
uncertainty in the least-squares estimates β0 and β1 will be
small.
• On the other hand, if the εi tend to be large in magnitude, the
points will be widely scattered around the line,and the
uncertainties (standard deviations) in the least-squares estimates
β0 and β1 will be larger.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Uncertainties in the Least-Squares Coefficients

• Assume we have n data points (x1, y1), . . . , (xn, yn), and we plan
to fit the least squares line.
• In order for the estimates β1 and β0 to be useful, we need to
estimate just how large their uncertainties are. In order to do this,
we need to know something about the nature of the errors εi .
• We will begin by studying the simplest situation, in which four
important assumptions are satisfied.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Uncertainties in the Least-Squares Coefficients

Assumptions for Errors in Linear Models:

In the simplest situation, the following assumptions are satisfied:
1. The errors 1,…,n are random and independent. In particular, the
magnitude of any error i does not influence the value of the next
error i + 1.
2. The errors 1,…,n all have mean 0.
3. The errors 1,…,n all have the same variance, which we denote by 2.
4. The errors 1,…,n are normally distributed.

• When the sample size is large, the normality assumption (4) becomes less
important.
• Mild violations of the assumption of constant variance (3) do not matter too
much, but severe violations should be corrected.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Uncertainties in the Least-Squares Coefficients
• Under these assumptions, the effect of the εi is largely governed by the
magnitude of the variance σ2, since it is this variance that determines how
large the errors are likely to be.
• Therefore, in order to estimate the uncertainties in β0 and β1, we must first
estimate the error variance σ2.
• Since the magnitude of the variance is reflected in the
degree of spread of the points around the least-squares line, it follows that by
measuring this spread, we can estimate the variance.
Specifically, the vertical distance from each data point (xi , yi ) to the least-
squares line is given by the residual ei.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Uncertainties in the Least-Squares Coefficients

• The spread of the points around the line can be measured by the sum of the
squared residuals
• The estimate of the error variance σ2 is the quantity s2 given by
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Distribution

In the linear model yi = 0 +1xi +i, under assumptions 1 through 4, the

observations y1,…, yn are independent random variables that follow the normal
distribution. The mean and variance of yi are given by
 y =  0 + 1 xi
i

 y2 =  2
i

The slope represents the change in the mean of y associated with an increase in
one unit in the value of x.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
More Distributions
Under assumptions 1 – 4:
• The quantitiesˆ and ˆ are normally distributed random variables.
0 1

• The means of ˆ0 and ˆ1 are the true values 0 and 1, respectively.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
More Distributions (cont.)
• The standard deviations of ˆ0 and ˆ1 are estimated with
s
1 x 2
sˆ =
sˆ = s + and n

 i
1

−
n

(x
2
0
n − x) 2 ( x x )
i
i =1
i =1

n
(1 − r ) 2
( y i
− y ) 2

where s = i =1
is an estimate of the
n−2

error standard deviation .

MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Notes
1. Since the quantity appears in the denominators of
, it follows that the more spread out the x’s are, the smaller the
uncertainties in will be ˆ and ˆ
0 1

2. Use caution: if the range of x values extends beyond the range where
the linear model holds, the results will not be valid.
3. The quantities ( ˆ0 −  0 ) / sˆ and
0
( ˆ1 − 1 ) / sˆ
1
have Student’s t
distribution with n – 2 degrees of freedom.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Checking Assumptions and Transforming Data

• We stated some assumptions for the errors. Here we want to see if any of
those assumptions are violated.

• The single best diagnostic for least-squares regression is a plot of residuals

versus the fitted values, sometimes called a residual plot.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
More of the Residual Plot

• When the linear model is valid, and assumptions 1 – 4 are satisfied, the plot will
show no substantial pattern. There should be no curve to the plot, and the
vertical spread of the points should not vary too much over the horizontal range
of the data.

• A good-looking residual plot does not by itself prove that the linear model is
appropriate. However, a residual plot with a serious defect does clearly indicate
that the linear model is inappropriate.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Residual Plots

A: No noticeable pattern
B: Heteroscedastic
C: Trend
D: Outlier
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Checking Assumptions to form a Linear Model
• Example of a residual plot: On the left is the plot of x versus the values
of y, on the right the residual with the fitted values of y
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Checking Assumptions to form a Linear Model

• A bit of terminology:
• If the vertical spread does not vary with the fitted value, we
call the residual plot homoscedastic. Else we call the plot
heteroscedastic.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Checking Assumptions to form a Linear Model
• Below on the left the plot is homoscedastic, while on the
right the spread increases with the fitted value and is thus
heteroscedastic.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Homoscedasticity or Heteroscedasticity? The way forward...

• If the residual plot is homoscedastic, and shows no

substantial trend or curve, then a linear model can be found
for the data plotted.
• If the residual plot is heteroscedastic, or shows a substantial
trend or curve, then the assumptions for a linear model
certainly do NOT hold! In such cases we need to transform
the data or pursue other methods.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Transforming the Variables
• If we fit the linear model y = 0 +1x + and find that the residual
plot exhibits a trend or pattern, we can sometimes fix the
problem by raising x, y, or both to a power.

• It may be the case that a model of the form

ya = 0 +1xb + fits the data well.

• Replacing a variable with a function of itself is called transforming

the variable. Specifically, raising a variable to a power is called a
power transformation.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Which transformation to apply?
It is possible with experience to look at a scatterplot, or a residual plot, and
make an educated guess as to how to transform the variables.
Mathematical methods are also available to determine a good
transformation.
Trial and Error is fine – Try various powers on both x and y (including
ln x and ln y), look at the residual plots, and hope to find a homoscedastic
one with no discernible pattern.
More advanced discussion in Draper and Smith (1998).
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Which transformation to apply?

Recall the earlier example of a scatter plot (O3 concentration vs

NOX concentration) whose residual plot on the right is
heteroscedastic as shown below. Linear model NOT GOOD! Uh oh!
Also notice the outlier with ozone concentration nearly 100.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Logarithm Transformation on One Axis

Applying the logarithm on y-axis (O3 concentration) and obtain

the following scatter plot and its residual on the right. Linear
model looks GOOD! YAY! The outlier is less prominent too!
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Logarithm Transformation on One Axis
Now consider an example below where The plot on the left is
Production (ft3/ft) vs Fracture fluid (gal/ft) and the residual plot is
largely heteroscedastic! Not good for a linear model.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Logarithm Transformation on One Axis
Below is a plot of ln (production) vs ln (fracture fluid) for the same
data. This time the residual plot is homoscedastic, good for linear
model!
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Power Transformations – The reciprocal
Below (left side) is a plot of Rockwell (B scale) hardness of welds
versus their Ogden-Jaffe number. The residual plot (right side)
shows a pattern where negative residual is observed for the
extreme fitted values and positive residual for the ones in the
middle. Linear model NOT OK.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Power Transformations – The reciprocal
We plot the graph of Rockwell Hardness vs (Ogden-Jaffe)-1 for the
same data (below, left side) and find that the residual plot (below,
right side) is homoscedastic, having no discernible pattern.
Linear model is OK.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Power Transformations – Positive Powers
Plot y vs x and its residual plot which
exhibits a discrenible pattern.
Linear model is NOT OK.

x y x y
1 2.2 11 31.5
2 9 12 32.7
3 13.5 13 34.9
4 17 14 36.3
5 20.5 15 37.7
6 23.3 16 38.7
7 25.2 17 40
8 26.4 18 41.3
9 27.6 19 42.5
10 30.2 20 43.7
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Power Transformations – Positive Powers
Plot y2 vs x and its homoscedastic
residual plot which exhibits no
discernible pattern.
Linear model is OK.
x y2 x y2
1 4.84 11 992.25
2 81 12 1069.29
3 182.25 13 1218.01
4 289 14 1317.69
5 420.25 15 1421.29
6 542.89 16 1497.69
7 635.04 17 1600
8 696.96 18 1705.69
9 761.76 19 1806.25
10 912.04 20 1909.69
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Transformations – Do they always work?

It is important to remember that power transformations don’t always

work.

Sometimes, none of the residual plots looks good, no matter what

transformations are tried. In these cases, other methods should be
used. One of these is multiple regression which is not covered here.

Some other methods are briefly mentioned in the next slide.

MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Alternatives to Transformations
The popular methods other than transformation are:

• Weighted Least Squares

➔ We assign greater weights to points in regions where the

vertical spread is smaller and vice versa.

• Multiple Regression
➔ We add more independent variables in order to explain the

variation in the dependent variable.

MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
How Many Points Make a Reliable Residual Plot?

When there are too few points on the residual plot, then…

➢ … it may appear to have a pattern or be heteroscedastic in

spite of that being just a visual effect created by one or two
points.

➢ … detecting outliers may become difficult

What to do if you can’t interpret a residual plot reliably?

You can start by fitting a linear model but declare your result
tentative; wait for more data and then a reliable decision can be
made.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
How Many Points Make a Reliable Residual Plot?

NOT all residual plots with few points turn out to be hard to interpret.

Some of these show a pattern which cannot be changed by relocating just one
or two points.

In such a case a linear model should NOT be used!

MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Outliers

• Outliers are points that are detached from the bulk of the data.

• Both the scatter plot and the residual plot should be examined for
outliers.

• The first thing to do with an outlier is to determine why it is different

from the rest of the points.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Outliers

• Sometimes outliers are caused by data-recording errors or

equipment malfunction. In this case, the outlier can be deleted
from the data set. In this case, you may present results that do
not include the outlier.

• If it cannot be determined why there is an outlier, then it is not

wise to delete it. Here the results presented, should be the ones
from analysis with the outlier included in the data set.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Influential Point

• If there are outliers that cannot be removed from the data set,
then the best thing to do is fit the whole data set and then
remove the outlier and fit a line to the data set.

• If none of the outliers upon removal make a noticeable difference

to the least-squares line or to the estimated standard deviation of
the slope and intercept, then use the fit with the outliers
included.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Influential Point

• If one or more outlier does make a difference, then the range

of values for the least-squares coefficients should be reported.
Avoid computing confidence and prediction intervals and
performing hypothesis tests.

• An outlier that makes a considerable difference to the least-

squares line when removed is called an influential point.
• In general, outliers with unusual x values are more likely to be
influential than those with unusual y values, but every outlier
should be checked.

• Some authors restrict the definition of outliers to points that

have unusually large residuals.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Comments

• Transforming the variables is not the only method for analyzing

data when the residual plot indicates a problem.

• There is a technique called weighted least squares regression. The

effect is to make the points whose error variance is smaller have
greater influence in the computation of the least-squares line.

• When the residual plot shows a trend, this sometimes indicates

that more than one independent variable is needed to explain the
variation in the dependent variable.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Comments

• If the relationship is nonlinear, then a method called nonlinear

regression can be applied.

• If the plot of residuals versus fitted values looks good, it may be

advisable to perform additional diagnostics to further check the fit of
the linear model. A time series plot is used to see if time should be
included in the model. A normal probability plot can be used to
check the normality assumption.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Independence of Observations

• If the plot of residuals versus fitted values looks good, then further
diagnostics may be used to further check the fit of the linear model.

• A time order plot of the residuals versus order in which observations

were made.

• If there are trends in this plot, then x and y may be varying with time.
In this case, adding a time term to the model as an additional
independent variable.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Normality Assumption

• To check that the errors are normally distributed, a normal probability

plot of the residuals can be made.

• If the plot looks like it follows a rough straight line, then we can
conclude that the residuals are approximately normally distributed.
THANK YOU

ScientificComputing2eHeath Solution
100% (11)
ScientificComputing2eHeath Solution
161 pages
Basler RDP-110
No ratings yet
Basler RDP-110
26 pages
Machine Learning Lab Viva
100% (1)
Machine Learning Lab Viva
9 pages
Least Squares Data Fitting With Applications PDF
No ratings yet
Least Squares Data Fitting With Applications PDF
175 pages
Assignment 4
No ratings yet
Assignment 4
4 pages
Applied Mathematics and Computation - Unit1
No ratings yet
Applied Mathematics and Computation - Unit1
44 pages
Applied Mathematics and Computation - Unit1 - Introduction
No ratings yet
Applied Mathematics and Computation - Unit1 - Introduction
44 pages
Lab 05-1
No ratings yet
Lab 05-1
6 pages
CA2123 Lecture 9 11
No ratings yet
CA2123 Lecture 9 11
119 pages
Least Squares Data Fitting With Applications
0% (1)
Least Squares Data Fitting With Applications
175 pages
Curve Fitting: There Are Two General Approaches For Curve Fitting
No ratings yet
Curve Fitting: There Are Two General Approaches For Curve Fitting
63 pages
Gujarat Technological University Computer Engineering (07), Information Technology (16) & Information & Communication Technology
No ratings yet
Gujarat Technological University Computer Engineering (07), Information Technology (16) & Information & Communication Technology
3 pages
Clase 11 Calculo Numerico I
No ratings yet
Clase 11 Calculo Numerico I
37 pages
Lesson 3-Multiple Linear Regression
No ratings yet
Lesson 3-Multiple Linear Regression
24 pages
Syllabus Spring 2025 Prof Irina
No ratings yet
Syllabus Spring 2025 Prof Irina
2 pages
Week - 1 Assignment-1
No ratings yet
Week - 1 Assignment-1
2 pages
Part 5 - Linear Regression and Curve Fitting CH. 17, 18
No ratings yet
Part 5 - Linear Regression and Curve Fitting CH. 17, 18
68 pages
USIT204 Numerical and Statistical Methods
100% (1)
USIT204 Numerical and Statistical Methods
240 pages
Spring Mid Sem ML Evalution Scheme
No ratings yet
Spring Mid Sem ML Evalution Scheme
8 pages
Se Comp Sem-Iii-Dse-2020-21
No ratings yet
Se Comp Sem-Iii-Dse-2020-21
21 pages
00 Table of Contents NM
No ratings yet
00 Table of Contents NM
9 pages
Curve Fitting
No ratings yet
Curve Fitting
48 pages
ChE F242 Numerical Methods For Chemical Engineers Semester II 2023-2024
No ratings yet
ChE F242 Numerical Methods For Chemical Engineers Semester II 2023-2024
6 pages
Least Square Regression: Numerical Methods ECE 410
No ratings yet
Least Square Regression: Numerical Methods ECE 410
44 pages
Chapter 1 Mathematical Modelling and Error Analysis
No ratings yet
Chapter 1 Mathematical Modelling and Error Analysis
17 pages
405 M.E. Computer Science and Engineering
No ratings yet
405 M.E. Computer Science and Engineering
68 pages
Week 13. Regression Analysis (Prediction)
No ratings yet
Week 13. Regression Analysis (Prediction)
56 pages
Notes Ending 21 Feb 2024
No ratings yet
Notes Ending 21 Feb 2024
7 pages
Matlab
No ratings yet
Matlab
39 pages
CST294 - Ktu Qbank
No ratings yet
CST294 - Ktu Qbank
22 pages
Curve Fitting
100% (1)
Curve Fitting
43 pages
ANUM 2012 Curve-Fitting
No ratings yet
ANUM 2012 Curve-Fitting
44 pages
Polynomial Curve Fitting
No ratings yet
Polynomial Curve Fitting
44 pages
Full Download Numerical Methods in Engineering With Python First Edition Jaan Kiusalaas PDF
No ratings yet
Full Download Numerical Methods in Engineering With Python First Edition Jaan Kiusalaas PDF
51 pages
ICT 2102-Curve-Fitting and Interpolation
No ratings yet
ICT 2102-Curve-Fitting and Interpolation
29 pages
PS Unit - Iv
No ratings yet
PS Unit - Iv
19 pages
Lecture 10 Regression Analysis
No ratings yet
Lecture 10 Regression Analysis
20 pages
CE301 Course Syllabus Fall 2015
No ratings yet
CE301 Course Syllabus Fall 2015
2 pages
Lecture 0701
No ratings yet
Lecture 0701
14 pages
Faculty of Eng. & Natural Sci.: Instructor
No ratings yet
Faculty of Eng. & Natural Sci.: Instructor
2 pages
Chapter 6sol
No ratings yet
Chapter 6sol
15 pages
L7 CurveFitting (LeastSquaresRegression)
No ratings yet
L7 CurveFitting (LeastSquaresRegression)
45 pages
Ch17 Curve Fitting
No ratings yet
Ch17 Curve Fitting
44 pages
Portion
No ratings yet
Portion
21 pages
Statistics 2 For Chemical Engineering: Department of Mathematics and Computer Science
No ratings yet
Statistics 2 For Chemical Engineering: Department of Mathematics and Computer Science
37 pages
Chapter-4 Curve Fitting PDF
No ratings yet
Chapter-4 Curve Fitting PDF
17 pages
Matlab Manual
No ratings yet
Matlab Manual
109 pages
AIML Lab
No ratings yet
AIML Lab
48 pages
Final Exam Review Spring 2014 111A
No ratings yet
Final Exam Review Spring 2014 111A
75 pages
Computer Modeling Techniques
No ratings yet
Computer Modeling Techniques
8 pages
Revised - Course Policy - Linear Algebra and Differential Equations - Sem II
No ratings yet
Revised - Course Policy - Linear Algebra and Differential Equations - Sem II
26 pages
Data Science For Civil Engineering Unit 2 Notes
No ratings yet
Data Science For Civil Engineering Unit 2 Notes
22 pages
UG Structure Syllabus 20.18 Batch 2
No ratings yet
UG Structure Syllabus 20.18 Batch 2
1 page
Lab Mariano2023 en
No ratings yet
Lab Mariano2023 en
64 pages
NM Presentation
No ratings yet
NM Presentation
16 pages
Least Square Regression
No ratings yet
Least Square Regression
13 pages
Lecture 1
No ratings yet
Lecture 1
31 pages
Syllabus - ML
No ratings yet
Syllabus - ML
9 pages
Programming Basics and AI Lecture
No ratings yet
Programming Basics and AI Lecture
270 pages
UECM2623
No ratings yet
UECM2623
3 pages
Mathematical Methods for Physicists and Engineers: Second Corrected Edition
From Everand
Mathematical Methods for Physicists and Engineers: Second Corrected Edition
Royal Eugene Collins
No ratings yet
Introductory Laplace Transform with Applications
From Everand
Introductory Laplace Transform with Applications
Dalpatadu
5/5 (1)
Form Supplier Registration Form - GDP
No ratings yet
Form Supplier Registration Form - GDP
6 pages
Internal Analysis: Resources, Capabilities, and Core Competencies
No ratings yet
Internal Analysis: Resources, Capabilities, and Core Competencies
59 pages
Automatic Drawing Machine
No ratings yet
Automatic Drawing Machine
2 pages
Prelim Intro To Multimedia Chap 1
No ratings yet
Prelim Intro To Multimedia Chap 1
38 pages
Extra Worksheets 1st Year
No ratings yet
Extra Worksheets 1st Year
41 pages
JAI MAHAKAAL! GOC Kohinoor Drive [Private] - _JAI MAHAKAAL! GOC Kohinoor Drive_[Educative.io] System Design_Grokking the System Design Interview_Course Contents_2.Glossary of System Design Basics_
No ratings yet
JAI MAHAKAAL! GOC Kohinoor Drive [Private] - _JAI MAHAKAAL! GOC Kohinoor Drive_[Educative.io] System Design_Grokking the System Design Interview_Course Contents_2.Glossary of System Design Basics_
139 pages
Aiml Micro Project DBMS
No ratings yet
Aiml Micro Project DBMS
14 pages
Connecting Python With SQL Database
No ratings yet
Connecting Python With SQL Database
21 pages
Osy Question Bank
No ratings yet
Osy Question Bank
8 pages
Python PYQ
No ratings yet
Python PYQ
10 pages
Nasscom Mlops Playbook 2022
No ratings yet
Nasscom Mlops Playbook 2022
55 pages
2023 R Programming Apr May (AICTE)
No ratings yet
2023 R Programming Apr May (AICTE)
3 pages
Sap HCM Payroll User Guide
100% (3)
Sap HCM Payroll User Guide
126 pages
8th STD - Maths - Qtrly Exam - Sep 2021 - 22 Online - 20.09.2021
No ratings yet
8th STD - Maths - Qtrly Exam - Sep 2021 - 22 Online - 20.09.2021
2 pages
SaratSasikumar v1701
No ratings yet
SaratSasikumar v1701
5 pages
Fixlog
No ratings yet
Fixlog
108 pages
Sap Powerdesigner: Object-Oriented Model Report
No ratings yet
Sap Powerdesigner: Object-Oriented Model Report
13 pages
TK Series Magnet GPS Tracker USER MANUAL
No ratings yet
TK Series Magnet GPS Tracker USER MANUAL
26 pages
Project Management Life Cycle
50% (2)
Project Management Life Cycle
5 pages
Zedboard Ubuntu
No ratings yet
Zedboard Ubuntu
11 pages
Least Learned Competencies
No ratings yet
Least Learned Competencies
2 pages
Top 58 MySql Interview Questions (2023) - Javatpoint
No ratings yet
Top 58 MySql Interview Questions (2023) - Javatpoint
37 pages
Project Management Book1
100% (1)
Project Management Book1
25 pages
Az1084s PDF
No ratings yet
Az1084s PDF
17 pages
Date Narration Chq./Ref - No. Value DT Withdrawal Amt. Deposit Amt. Closing Balance
No ratings yet
Date Narration Chq./Ref - No. Value DT Withdrawal Amt. Deposit Amt. Closing Balance
67 pages
GRES Integrated Energy Storage Systgem User Manual V1.01-EN
No ratings yet
GRES Integrated Energy Storage Systgem User Manual V1.01-EN
112 pages
Computer Science To The Point Computer Science For Life Sciences Students and Other Noncomputer Scientists Boris Tolg Instant Download
No ratings yet
Computer Science To The Point Computer Science For Life Sciences Students and Other Noncomputer Scientists Boris Tolg Instant Download
82 pages
Capstone Portfolio Template
100% (1)
Capstone Portfolio Template
4 pages