Ue23ma242a 20241029143447
Ue23ma242a 20241029143447
ENGINEERS
Simple Linear Regression
Dr.Mamatha H R
Department of Computer Science and Engineering
Dr. Karthiyayini
Department of Science and Humanities
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Simple Linear Regression: Correlation &
Regression Analysis
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Regression Analysis
▪ Regression analysis can help you to quantify that and can help you to
predict how much you will weigh in 10 years time if you continue to put
on weight at the same rate.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Prediction of Floods / Droughts
15
14.96
15
14.8
14.47
14.5
14.6
14.26
14.12
13.95 13.92 13.93 13.95 14.4
14 13.89
13.76
13.68 13.67
13.59 13.64 14.2
13.5
14
13 13.8
13.6
12.5
1881 1891 1901 1911 1921 1931 1941 1951 1961 1971 1981 1991 2001 2011
- - - - - - - - - - - - - - 13.4
1890 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 2010 2020 0 2 4 6 8 10 12 14 16
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Other factors influencing Floods !!!
Urbanisation
Floods
Other Deforestation
Human
Factors
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Regression Analysis
❖It is a way of mathematically sorting out which of those variables indeed have
an impact
❖And most importantly, how certain are we about all these factors?
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Regression Analysis – A Broad Classification
0 + 𝛽
𝑦𝑖 = 𝛽 1 𝑥𝑖
where,
σ𝑛 ҧ
𝑖=1(𝑥𝑖 −𝑥)(𝑦 ത
𝑖 −𝑦)
1 =
▪ 𝛽 σ𝑛 ҧ 2
𝑖=1(𝑥𝑖 −𝑥)
0 = 𝑦ത − 𝛽
▪ 𝛽 1 𝑥ҧ
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example :
❖The details pertaining to the no. of SL No. No. of hours Marks
hours spent by students in preparing spent Scored
for an entrance exam and the marks 1 6 82
scored (on a scale of (0 – 100) is 2 10 88
provided in the following table. 3 2 56
Using these values, 4 4 64
i. Estimate the marks scored by a 5 6 77
student who has spent 2.35 6 7 92
hours. 7 0 23
ii. Predict the marks that a student 8 1 41
can score if he/she invests 20 hours. 9 8 80
10 5 59
11 3 47
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Computing the least squares line
❖We need to first obtain the least square line which is given by,
𝟎 + 𝜷
𝒚=𝜷 𝟏 𝒙
σ𝒏
𝒊=𝟏(𝒙𝒊 −ഥ
𝒙)(𝒚𝒊 −ഥ𝒚)
▪ 𝟏 =
𝜷 σ𝒏 𝒙 )𝟐
𝒊=𝟏(𝒙𝒊 −ഥ
▪ 𝟎 = 𝒚
𝜷 𝟏 𝒙
ഥ−𝜷 ഥ
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example :
SL No. No. of hours Marks
spent (𝑥) Scored(𝑦)
𝑥 − 𝑥ҧ (𝑥 − 𝑥)ҧ 2 𝑦 − 𝑦ത (𝑥 − 𝑥)(𝑦
ҧ − 𝑦)
ത
1 6 82 1.27 1.6129 17.55 22.33
2 10 88 5.27 27.7729 23.55 124.15
3 2 56 -2.73 7.4529 -8.45 23.06
4 4 64 -0.73 0.5329 -0.45 0.33
5 6 77 1.27 1.6129 12.55 15.97
6 7 92 2.27 5.1529 27.55 62.60
7 0 23 -4.73 22.3729 -41.45 195.97
8 1 41 -3.73 13.9129 -23.45 87.42
9 8 80 3.37 11.3569 15.55 50.88
10 5 59 0.27 0.0729 -5.45 -1.49
11 3 47 -1.73 2.9929 -17.45 30.15
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example :
SL No. No. of hours Marks
spent (𝑥) Scored(𝑦)
𝑥 − 𝑥ҧ (𝑥 − 𝑥)ҧ 2 𝑦 − 𝑦ത (𝑥 − 𝑥)(𝑦
ҧ − 𝑦)
ത
1 6 82 1.27 1.6129 17.55 22.33
2 10 88 5.27 27.7729 23.55 124.15
3 2 56 -2.73 7.4529 -8.45 23.06
4 4 64 -0.73 0.5329 -0.45 0.33
5 6 77 1.27 1.6129 12.55 15.97
6 7 92 2.27 5.1529 27.55 62.60
7 0 23 -4.73 22.3729 -41.45 195.97
8 1 41 -3.73 13.9129 -23.45 87.42
9 8 80 3.37 11.3569 15.55 50.88
10 5 59 0.27 0.0729 -5.45 -1.49
11 3 47 -1.73 2.9929 -17.45 30.15
4.73 64.45 94.8459 611.37
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example :
From the table we have,
𝑥ҧ = 4.73 ; 𝑦ത =64.45
▪ σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത =611.37
σ𝑛 ҧ
𝑖=1(𝑥𝑖 −𝑥)(𝑦 ത
𝑖 −𝑦)
▪
𝛽1 = σ𝑛 ҧ 2
=611.37/94.8459=6.49
𝑖=1(𝑥𝑖 −𝑥)
0 = 𝑦ത − 𝛽
▪ 𝛽 1 𝑥ҧ =64.45-[6.49x4.73]=33.7523
Y=33.7523+[6.35x2.35]=48.6748
Y=33.7523+[6.35x20]=163.5523
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
𝒍𝟏
𝒍𝟐
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
How to compute the Least – Squares Line ???
𝒍𝟏
𝒍𝟐
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Scenario # 1 : No Errors!!
(𝑥) (𝑦)
0.0 5.02
5.2
0.2 5.04
0.4 5.06
0.6 5.08 5.15
0.8 5.10
1.0 5.12 5.1
1.2 5.14
1.4 5.16
5.05
1.6 5.18
1.8 5.20
5
2.0 5.22 0 0.5 1 1.5 2 2.5
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Scenario #2 : Measurement has Errors!!
5.8
5.7
5.6
5.5
5.4
5.3
5.2
5.1
4.9
0 0.5 1 1.5 2 2.5 3 3.5 4
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Least Square Line :
WEight (lb) (x)/Length (in.) (y)
5.9
NOTE : The least square line is defined to be the line
5.8
for which the sum of squared residuals is minimum.
5.5
5.4
❖Using some Mathematical computations it can be shown that,
5.3
5.2
5.1
4.9
0 0.5 1 1.5 2 2.5 3 3.5 4
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Least Squares Line : Summary
Scenario #1 : If there is no measurement error then the data points lie on the straight line
𝑦 = 𝛽0 + 𝛽1 𝑥 and values of 𝛽0 and 𝛽1 can be obtained easily by calculating the slope and the
intercept.
Scenario #2 : If there is a measurement error 𝜀𝑖 , then
❖ the exact value of 𝛽0 and 𝛽1 cannot be determined
❖ the values of 𝛽0 and 𝛽1 are computed by calculating the least square line.
0 + 𝛽
❖ The least square line is given by 𝑦ෝ𝑖 = 𝛽 1 𝑥𝑖
where
0 → the 𝑦 − intercept of the least square line
▪ 𝛽
→ gives an estimate of 𝛽0 , the initial length of the spring.
1 →the slope of the least square line
▪ 𝛽
→ gives an estimate of the actual value of the spring constant 𝛽 .
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Computing formulas
Remark :
❖ σ𝑛𝑖=1(𝑦𝑖 − 𝑦)
ത 2 = σ𝑛𝑖=1 𝑦𝑖 2 − 𝑛𝑦ത 2
❖ Don’t use the Least Squares line when the data aren’t linear.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
The Estimates are not the same as true values
Length(y)
5.25
y = 0.1x + 5.02
0.8 5.10 5.14 5.2
Length(y)
5.25
y = 0.1x + 5.02
0.8 5.10 5.14 5.2
60
50
40
30
20
10
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
Note : In some cases the Least – Squares line can be used for non linear data, but only after
variable transformation is applied.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measuring goodness of fit
❖ A goodness of fit statistic is a quantity that measures how well a
model explains a given set of data.
❖ A linear model fits well if there is a strong relationship between the
variables involved.
❖ The strength of a linear relationship can be measured by
considering,
σ𝑛𝑖=1(𝑦𝑖 − 𝑦)
ത 2 − σ𝑛𝑖=1(𝑦𝑖 − 𝑦ෝ𝑖 )2 .
❖ The above relation is also referred to as a goodness-of-fit statistic.
❖ The draw back of this statistic relation is that it cannot be used to
compare the goodness-of-fit of two models which have different
data set. (That is, data sets having different units)
σ𝑛 ത 2 − σ𝑛
𝑖=1(𝑦𝑖 −𝑦) 𝑦𝑖 )2
𝑖=1(𝑦𝑖 −ෞ
❖ Hence we use the relation, 𝑟 2 = σ𝑛 ത 2
𝑖=1(𝑦𝑖 −𝑦)
which is obtained by using the correlation coefficient.
❖ This is also referred to as the co-efficient of determination.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Visualisation of 𝒓𝟐
t
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Some special terminologies!
2 σ𝑛 ത 2 − σ𝑛
𝑖=1(𝑦𝑖 −𝑦) 𝑦𝑖 )2
𝑖=1(𝑦𝑖 −ෞ
❖𝑟 = σ𝑛 ത 2
𝑖=1(𝑦𝑖 −𝑦)
❖ σ𝑛𝑖=1(𝑦𝑖 − 𝑦)
ത 2 − σ𝑛𝑖=1(𝑦𝑖 − 𝑦ෝ𝑖 )2 : Regression sum of squares
• Assume we have n data points (x1, y1), . . . , (xn, yn), and we plan
to fit the least squares line.
• In order for the estimates β1 and β0 to be useful, we need to
estimate just how large their uncertainties are. In order to do this,
we need to know something about the nature of the errors εi .
• We will begin by studying the simplest situation, in which four
important assumptions are satisfied.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Uncertainties in the Least-Squares Coefficients
• When the sample size is large, the normality assumption (4) becomes less
important.
• Mild violations of the assumption of constant variance (3) do not matter too
much, but severe violations should be corrected.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Uncertainties in the Least-Squares Coefficients
• Under these assumptions, the effect of the εi is largely governed by the
magnitude of the variance σ2, since it is this variance that determines how
large the errors are likely to be.
• Therefore, in order to estimate the uncertainties in β0 and β1, we must first
estimate the error variance σ2.
• Since the magnitude of the variance is reflected in the
degree of spread of the points around the least-squares line, it follows that by
measuring this spread, we can estimate the variance.
Specifically, the vertical distance from each data point (xi , yi ) to the least-
squares line is given by the residual ei.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Uncertainties in the Least-Squares Coefficients
• The spread of the points around the line can be measured by the sum of the
squared residuals
• The estimate of the error variance σ2 is the quantity s2 given by
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Distribution
y2 = 2
i
The slope represents the change in the mean of y associated with an increase in
one unit in the value of x.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
More Distributions
Under assumptions 1 – 4:
• The quantitiesˆ and ˆ are normally distributed random variables.
0 1
• The means of ˆ0 and ˆ1 are the true values 0 and 1, respectively.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
More Distributions (cont.)
• The standard deviations of ˆ0 and ˆ1 are estimated with
s
1 x 2
sˆ =
sˆ = s + and n
i
1
−
n
(x
2
0
n − x) 2 ( x x )
i
i =1
i =1
n
(1 − r ) 2
( y i
− y ) 2
where s = i =1
is an estimate of the
n−2
2. Use caution: if the range of x values extends beyond the range where
the linear model holds, the results will not be valid.
3. The quantities ( ˆ0 − 0 ) / sˆ and
0
( ˆ1 − 1 ) / sˆ
1
have Student’s t
distribution with n – 2 degrees of freedom.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Checking Assumptions and Transforming Data
• We stated some assumptions for the errors. Here we want to see if any of
those assumptions are violated.
• When the linear model is valid, and assumptions 1 – 4 are satisfied, the plot will
show no substantial pattern. There should be no curve to the plot, and the
vertical spread of the points should not vary too much over the horizontal range
of the data.
• A good-looking residual plot does not by itself prove that the linear model is
appropriate. However, a residual plot with a serious defect does clearly indicate
that the linear model is inappropriate.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Residual Plots
A: No noticeable pattern
B: Heteroscedastic
C: Trend
D: Outlier
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Checking Assumptions to form a Linear Model
• Example of a residual plot: On the left is the plot of x versus the values
of y, on the right the residual with the fitted values of y
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Checking Assumptions to form a Linear Model
• A bit of terminology:
• If the vertical spread does not vary with the fitted value, we
call the residual plot homoscedastic. Else we call the plot
heteroscedastic.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Checking Assumptions to form a Linear Model
• Below on the left the plot is homoscedastic, while on the
right the spread increases with the fitted value and is thus
heteroscedastic.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Homoscedasticity or Heteroscedasticity? The way forward...
x y x y
1 2.2 11 31.5
2 9 12 32.7
3 13.5 13 34.9
4 17 14 36.3
5 20.5 15 37.7
6 23.3 16 38.7
7 25.2 17 40
8 26.4 18 41.3
9 27.6 19 42.5
10 30.2 20 43.7
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Power Transformations – Positive Powers
Plot y2 vs x and its homoscedastic
residual plot which exhibits no
discernible pattern.
Linear model is OK.
x y2 x y2
1 4.84 11 992.25
2 81 12 1069.29
3 182.25 13 1218.01
4 289 14 1317.69
5 420.25 15 1421.29
6 542.89 16 1497.69
7 635.04 17 1600
8 696.96 18 1705.69
9 761.76 19 1806.25
10 912.04 20 1909.69
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Transformations – Do they always work?
• Multiple Regression
➔ We add more independent variables in order to explain the
When there are too few points on the residual plot, then…
You can start by fitting a linear model but declare your result
tentative; wait for more data and then a reliable decision can be
made.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
How Many Points Make a Reliable Residual Plot?
NOT all residual plots with few points turn out to be hard to interpret.
Some of these show a pattern which cannot be changed by relocating just one
or two points.
• Outliers are points that are detached from the bulk of the data.
• Both the scatter plot and the residual plot should be examined for
outliers.
• If there are outliers that cannot be removed from the data set,
then the best thing to do is fit the whole data set and then
remove the outlier and fit a line to the data set.
• If the plot of residuals versus fitted values looks good, then further
diagnostics may be used to further check the fit of the linear model.
• If there are trends in this plot, then x and y may be varying with time.
In this case, adding a time term to the model as an additional
independent variable.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Normality Assumption
• If the plot looks like it follows a rough straight line, then we can
conclude that the residuals are approximately normally distributed.
THANK YOU